<%BANNER%>

Reconfigurable Fault Tolerance for Space Systems

Permanent Link: http://ufdc.ufl.edu/UFE0045337/00001

Material Information

Title: Reconfigurable Fault Tolerance for Space Systems
Physical Description: 1 online resource (119 p.)
Language: english
Creator: Jacobs, Adam Michael
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2013

Subjects

Subjects / Keywords: abft -- fault -- fpga -- performability -- performance -- reconfigurable -- reliability -- tolerance
Electrical and Computer Engineering -- Dissertations, Academic -- UF
Genre: Electrical and Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: Commercial SRAM-based, field-programmable gate arrays (FPGAs) have the capability to provide space applications with the necessary performance, energy-efficiency, and adaptability to meet next-generation mission requirements. However, mitigating an FPGA's susceptibility to radiation-induced faults is challenging. Triple-modular redundancy (TMR) techniques are traditionally used to mitigate radiation effects, but TMR incurs substantial overheads such as increased area and power requirements. In order to reduce overhead while providing sufficient radiation mitigation, this research proposes a framework for reconfigurable fault tolerance (RFT) that enables system designers to dynamically adjust a system's level of redundancy and fault mitigation based on the varying radiation incurred at different orbital positions. To realize this goal and validate the effectiveness of the approach, three areas are investigated and addressed. First, a method for accurately estimating time-varying fault rates in space systems and a reliability and performance model for adaptive systems are needed to quantify the effectiveness of the RFT approach. Using multiple case-study orbits, our models predict that adaptive fault-tolerance strategies are able to improve unavailability by 85% over low-overhead fault tolerance techniques and performability by 128% over traditional, static TMR fault tolerance. Second, low-overhead fault-tolerance techniques which can be used within the RFT framework for improved performance must be investigated. The effectiveness of Algorithm-Based Fault Tolerance (ABFT) for FPGA-based systems is explored for matrix multiplication and FFT. ABFT kernels were developed for an FPGA platform, and reliability was measured using fault-injection testing. We show that matrix multiplication and FFTs with ABFT can provide improved reliability (vulnerability reduced by 98%) with low resource overhead, and scale favorably with additional parallelism. Third, methods for facilitating the integration of RFT hardware into existing PR-based systems and architectures are explored. We expand the RFT framework to be used with bus-based or point-to-point architectures. We design a fault-tolerant task-scheduling algorithm which can schedule RFT tasks in a dynamically-changing fault environment in order to maximize system performability. Combined, these three areas demonstrate the capability of RFT to provide both performance and reliability in space. Using low-overhead fault-tolerance techniques and reconfiguration, RFT can meet the strict constraints of next-generation space systems.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Adam Michael Jacobs.
Thesis: Thesis (Ph.D.)--University of Florida, 2013.
Local: Adviser: George, Alan Dale.

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2013
System ID: UFE0045337:00001

Permanent Link: http://ufdc.ufl.edu/UFE0045337/00001

Material Information

Title: Reconfigurable Fault Tolerance for Space Systems
Physical Description: 1 online resource (119 p.)
Language: english
Creator: Jacobs, Adam Michael
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2013

Subjects

Subjects / Keywords: abft -- fault -- fpga -- performability -- performance -- reconfigurable -- reliability -- tolerance
Electrical and Computer Engineering -- Dissertations, Academic -- UF
Genre: Electrical and Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: Commercial SRAM-based, field-programmable gate arrays (FPGAs) have the capability to provide space applications with the necessary performance, energy-efficiency, and adaptability to meet next-generation mission requirements. However, mitigating an FPGA's susceptibility to radiation-induced faults is challenging. Triple-modular redundancy (TMR) techniques are traditionally used to mitigate radiation effects, but TMR incurs substantial overheads such as increased area and power requirements. In order to reduce overhead while providing sufficient radiation mitigation, this research proposes a framework for reconfigurable fault tolerance (RFT) that enables system designers to dynamically adjust a system's level of redundancy and fault mitigation based on the varying radiation incurred at different orbital positions. To realize this goal and validate the effectiveness of the approach, three areas are investigated and addressed. First, a method for accurately estimating time-varying fault rates in space systems and a reliability and performance model for adaptive systems are needed to quantify the effectiveness of the RFT approach. Using multiple case-study orbits, our models predict that adaptive fault-tolerance strategies are able to improve unavailability by 85% over low-overhead fault tolerance techniques and performability by 128% over traditional, static TMR fault tolerance. Second, low-overhead fault-tolerance techniques which can be used within the RFT framework for improved performance must be investigated. The effectiveness of Algorithm-Based Fault Tolerance (ABFT) for FPGA-based systems is explored for matrix multiplication and FFT. ABFT kernels were developed for an FPGA platform, and reliability was measured using fault-injection testing. We show that matrix multiplication and FFTs with ABFT can provide improved reliability (vulnerability reduced by 98%) with low resource overhead, and scale favorably with additional parallelism. Third, methods for facilitating the integration of RFT hardware into existing PR-based systems and architectures are explored. We expand the RFT framework to be used with bus-based or point-to-point architectures. We design a fault-tolerant task-scheduling algorithm which can schedule RFT tasks in a dynamically-changing fault environment in order to maximize system performability. Combined, these three areas demonstrate the capability of RFT to provide both performance and reliability in space. Using low-overhead fault-tolerance techniques and reconfiguration, RFT can meet the strict constraints of next-generation space systems.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Adam Michael Jacobs.
Thesis: Thesis (Ph.D.)--University of Florida, 2013.
Local: Adviser: George, Alan Dale.

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2013
System ID: UFE0045337:00001


This item has the following downloads:


Full Text

PAGE 1

RECONFIGURABLEFAULTTOLERANCEFORSPACESYSTEMSByADAMM.JACOBSADISSERTATIONPRESENTEDTOTHEGRADUATESCHOOLOFTHEUNIVERSITYOFFLORIDAINPARTIALFULFILLMENTOFTHEREQUIREMENTSFORTHEDEGREEOFDOCTOROFPHILOSOPHYUNIVERSITYOFFLORIDA2013

PAGE 2

c2013AdamM.Jacobs 2

PAGE 3

Tomyparentsforalloftheirpatienceandsupport 3

PAGE 4

ACKNOWLEDGMENTS ThisworkwassupportedinpartbytheI/UCRCProgramoftheNationalScienceFoundationunderGrantNo.EEC-0642422andIIP-1161022.Theauthorgratefullyacknowledgesvendorequipmentand/ortoolsprovidedbyvariousvendorsthathelpedmakethisworkpossible.Theauthoralsothanksfellowgraduatestudent,GrzegorzCieslewski,fordevelopingtheFPGAfault-injectiontoolusedtogathermanyoftheexperimentalresultsforthiswork. 4

PAGE 5

TABLEOFCONTENTS page ACKNOWLEDGMENTS .................................. 4 LISTOFTABLES ...................................... 7 LISTOFFIGURES ..................................... 8 ABSTRACT ......................................... 10 CHAPTER 1INTRODUCTION ................................... 12 2BACKGROUNDANDRELATEDRESEARCH ................... 18 2.1FPGAPerformanceandPowerEfciency .................. 18 2.2PartialReconguration ............................. 18 2.3Single-EventEffects .............................. 19 2.4FPGAsinSpaceSystems ........................... 20 2.5Low-OverheadFaultToleranceMethods ................... 23 2.6Algorithm-BasedFaultTolerance ....................... 25 2.7TaskScheduling ................................ 26 3FRAMEWORKFORRECONFIGURABLEFAULTTOLERANCE ........ 28 3.1RFTHardwareArchitecture .......................... 29 3.1.1RFTControllerOperation ....................... 30 3.1.2MicroBlazeOperation ......................... 32 3.1.3Environment-BasedFaultMitigation ................. 34 3.1.4RFTControllerResourceandPerformanceOverheads ....... 35 3.2RFTFault-RateModel ............................. 36 3.3RFTPerformabilityModel ........................... 39 3.4ResultsandAnalysis .............................. 43 3.4.1ValidationCaseStudy ......................... 44 3.4.2OrbitalCaseStudies .......................... 47 3.4.2.1Low-Earthorbitcasestudy ................. 48 3.4.2.2Highly-ellipticalorbitcasestudy .............. 50 3.5Conclusions ................................... 53 4ALGORITHM-BASEDFAULTTOLERANCEFORFPGASYSTEMS ...... 63 4.1MatrixMultiplication .............................. 64 4.1.1Checksum-BasedABFTforMatrixMultiplication .......... 64 4.1.2Matrix-MultiplicationArchitectures .................. 66 4.1.2.1Baseline,serialarchitecture ................. 66 4.1.2.2Fine-grainedparallelarchitecture .............. 66 5

PAGE 6

4.1.2.3Coarse-grainedparallelarchitecture ............ 67 4.1.2.4ArchitecturalmodicationsforABFT ............ 67 4.1.3Resource-OverheadExperiments ................... 68 4.1.3.1Resourceoverheadofserialarchitectures ......... 69 4.1.3.2Resourceoverheadofparallelarchitectures ........ 70 4.1.4Fault-InjectionExperiments ...................... 72 4.1.4.1Designvulnerabilityofserialarchitectures ......... 74 4.1.4.2Designvulnerabilityofparallelarchitectures ........ 75 4.1.5AnalysisofMatrix-MultiplicationArchitectures ............ 76 4.2FastFourierTransform ............................. 77 4.2.1Checksum-BasedABFTforFFTs ................... 78 4.2.2FFTArchitectures ............................ 79 4.2.2.1Radix-2Burst-IOFFTarchitecture ............. 80 4.2.2.2Radix-2PipelinedFFTarchitecture ............ 80 4.2.2.3ArchitecturalmodicationsforABFT ............ 80 4.2.3Resource-OverheadExperiments ................... 82 4.2.3.1ResourceoverheadofBurst-IOarchitecture ........ 82 4.2.3.2ResourceoverheadofPipelinedarchitecture ....... 83 4.2.4FFTFaultInjection ........................... 83 4.2.5AnalysisofFFTArchitectures ..................... 85 4.3Conclusions ................................... 86 5RFTSYSTEMINTEGRATIONFORRAPIDSYSTEMDEVELOPMENT .... 94 5.1DynamicallyGeneratedRFTComponents .................. 94 5.1.1RFTControllerPoint-to-PointInterface ................ 95 5.1.2ParameterizedandCongurableVotingLogic ............ 96 5.2TaskSchedulingforRCSystemsinDynamicFault-RateEnvironments .. 97 5.2.1SelectionCriteriaforFault-TolerantMode .............. 98 5.2.1.1FT-modeselectionusingthresholds ............ 98 5.2.1.2Time-resourcemetricforFT-modeselection ........ 99 5.2.2SchedulerforRFT ........................... 100 5.2.2.1RFTarchitecturedescription ................ 100 5.2.3SoftwareSimulation .......................... 101 5.2.4AnalysisandResults .......................... 102 5.2.4.1Constantfaultrates ..................... 102 5.2.4.2Dynamicfault-ratecasestudies .............. 103 5.2.4.3Schedulingimprovements .................. 104 5.3Conclusions ................................... 105 6CONCLUSIONS ................................... 111 REFERENCES ....................................... 113 BIOGRAPHICALSKETCH ................................ 119 6

PAGE 7

LISTOFTABLES Table page 3-1RFTfault-tolerancemodes. ............................. 54 3-2RFTcontrollerresourceusage. ........................... 54 3-3Fault-injectionresultsforRFTcomponents. .................... 54 3-4RFTMarkovmodelvalidation. ............................ 55 3-5UnavailabilityandperformabilityforLEOcasestudy. ............... 55 3-6UnavailabilityandperformabilityforHEOcasestudy. ............... 55 4-1ResourceutilizationandoverheadofserialMMdesigns. ............. 87 4-2Serialmatrixmultiplicationfault-injectionresults. ................. 87 4-3ResourceutilizationandoverheadofFFTdesigns. ................ 87 4-4FFTfault-injectionresults. .............................. 87 5-1RFTcontrollerresourceusage. ........................... 106 5-2Dynamicschedulingresults. ............................. 106 7

PAGE 8

LISTOFFIGURES Figure page 3-1System-on-chiparchitecturewithRFTcontroller. ................. 55 3-2RFTcontrollerPLB-to-PRRinterface. ....................... 56 3-3RFTfault-ratemodel. ................................ 56 3-4Phased-missionMarkovmodeltransitioningbetweenTMRandDWCmodes. 57 3-5MarkovmodelsofRFTmodes. ........................... 57 3-6RFTvalidationMarkovmodels. ........................... 58 3-7LEOfaultratesusingtheRFTfault-ratemodel. .................. 58 3-8LEOsystemavailability. ............................... 59 3-9Effectsofadaptivethresholdsonavailabilityandperformability. ......... 60 3-10HEOfaultratesusingtheRFTfault-ratemodel. .................. 61 3-11HEOsystemavailability. ............................... 62 4-1Matrix-multiplicationarchitectures. ......................... 88 4-2Matrix-multiplicationpsuedocode. ......................... 88 4-3Matrix-multiplicationABFT-Extraarchitecture. ................... 88 4-4Sliceoverheadofne-grainedparallelmatrixmultiplication. ........... 89 4-5Sliceoverheadofcoarse-grainedparallelmatrixmultiplication. ......... 89 4-6DSP48andBlockRAMoverheadofparallelmatrixmultiplication. ........ 90 4-7Faultvulnerabilityofne-grainedparallelmatrixmultiplication. .......... 91 4-8Faultvulnerabilityofcoarse-grainedparallelmatrixmultiplication. ........ 92 4-9FFTarchitectures. .................................. 92 4-10RequiredABFTthresholdvalueforoating-pointFFTs. ............. 93 5-1ResearchareasforPhase3. ............................ 106 5-2ComparisonofRFTarchitectures. ......................... 107 5-3Flowchartofschedulingsimulator. ......................... 107 5-4Effectofarrivalrateonfault-freeoperation. .................... 108 8

PAGE 9

5-5Effectoffaultrateontaskrejection. ........................ 108 5-6Fault-rateproleforcasestudies. .......................... 109 5-7Resourcefragmentationfromadaptiveplacement. ................ 110 9

PAGE 10

AbstractofDissertationPresentedtotheGraduateSchooloftheUniversityofFloridainPartialFulllmentoftheRequirementsfortheDegreeofDoctorofPhilosophyRECONFIGURABLEFAULTTOLERANCEFORSPACESYSTEMSByAdamM.JacobsMay2013Chair:AlanD.GeorgeMajor:ElectricalandComputerEngineeringCommercialSRAM-based,eld-programmablegatearrays(FPGAs)havethecapabilitytoprovidespaceapplicationswiththenecessaryperformance,energy-efciency,andadaptabilitytomeetnext-generationmissionrequirements.However,mitigatinganFPGA'ssusceptibilitytoradiation-inducedfaultsischallenging.Triple-modularredundancy(TMR)techniquesaretraditionallyusedtomitigateradiationeffects,butTMRincurssubstantialoverheadssuchasincreasedareaandpowerrequirements.Inordertoreduceoverheadwhileprovidingsufcientradiationmitigation,thisresearchproposesaframeworkforrecongurablefaulttolerance(RFT)thatenablessystemdesignerstodynamicallyadjustasystem'slevelofredundancyandfaultmitigationbasedonthevaryingradiationincurredatdifferentorbitalpositions.Torealizethisgoalandvalidatetheeffectivenessoftheapproach,threeareasareinvestigatedandaddressed.First,amethodforaccuratelyestimatingtime-varyingfaultratesinspacesystemsandareliabilityandperformancemodelforadaptivesystemsareneededtoquantifytheeffectivenessoftheRFTapproach.Usingmultiplecase-studyorbits,ourmodelspredictthatadaptivefault-tolerancestrategiesareabletoimproveunavailabilityby85%overlow-overheadfaulttolerancetechniquesandperformabilityby128%overtraditional,staticTMRfaulttolerance. 10

PAGE 11

Second,low-overheadfault-tolerancetechniqueswhichcanbeusedwithintheRFTframeworkforimprovedperformancemustbeinvestigated.TheeffectivenessofAlgorithm-BasedFaultTolerance(ABFT)forFPGA-basedsystemsisexploredformatrixmultiplicationandFFT.ABFTkernelsweredevelopedforanFPGAplatform,andreliabilitywasmeasuredusingfault-injectiontesting.WeshowthatmatrixmultiplicationandFFTswithABFTcanprovideimprovedreliability(vulnerabilityreducedby98%)withlowresourceoverhead,andscalefavorablywithadditionalparallelism.Third,methodsforfacilitatingtheintegrationofRFThardwareintoexistingPR-basedsystemsandarchitecturesareexplored.WeexpandtheRFTframeworktobeusedwithbus-basedorpoint-to-pointarchitectures.Wedesignafault-toleranttask-schedulingalgorithmwhichcanscheduleRFTtasksinadynamically-changingfaultenvironmentinordertomaximizesystemperformability.Combined,thesethreeareasdemonstratethecapabilityofRFTtoprovidebothperformanceandreliabilityinspace.Usinglow-overheadfault-tolerancetechniquesandreconguration,RFTcanmeetthestrictconstraintsofnext-generationspacesystems. 11

PAGE 12

CHAPTER1INTRODUCTIONAsremotesensortechnologyforspacesystemsincreasesindelity,theamountofdatacollectedbyorbitingsatellitesandotherspacevehicleswillcontinuetooutpacetheabilitytotransmitthatdatatootherstations(e.g.,groundstations,othersatellites).Increasingfuturesystems'onboarddata-processingcapabilitiescanalleviatethisdownlinkbottleneck,whichiscausedbybandwidthlimitationsandhighlatencytransmission.Onboarddataprocessingenablesmuchoftherawdatatobeinterpreted,reduced,and/orcompressedonboardthespacesystembeforetransmittingtheresultstogroundstationsorotherspacesystems,thusreducingdatatransmissionrequirements.Forapplicationswithlowdatatransmissionrequirements,improvedonboarddataprocessingcanenablemorecomplexandautonomouscapabilities.Theseautonomouscapabilitieswillenablenewclassesofspacemissionssuchasinsituscienticstudies,constellationsofmultiple,coordinatedspacesystems,orautomatedmaneuveringandguidanceofdeep-spacesystems.Finally,increasedonboarddataprocessingcanenablefuturespacesystemstokeepupwithincreasinglystringentreal-timeconstraints.However,increasingtheonboarddata-processingcapabilitiesrequireshigh-performancecomputing,whichhaslargelybeenabsentfromspacesystems.Inadditiontothehighperformancerequirementsoftheseenhancedfuturesystems,spaceenvironmentsimposeseveralotherstringent,high-prioritydesignconstraints.Systemdesigndecisionsmustconsiderthesystem'ssize,weight,andpower(SWaP)andheatdissipation,whicharedictatedbythesystem'sphysicalplatformconguration.Thesatellite'sphysicaldimensionsrestrictthesystem'ssize,thephotovoltaicsolararray'scapacityrestrictsthepowergeneration,andallheatdissipationmustoccurfrompassive,radiativecooling,whichcanbeslow.InadditiontoSWaPrequirements,systemdesignersmustconsidersystemcomponentreliabilityandsystemavailability 12

PAGE 13

becausefaultyspacesystemsareeitherimpossibleorprohibitivelyexpensivetoservice.Radiation-hardeneddevices,withincreasedprotectionfromlong-termradiationexposure(totalionizingdose),providesystemreliabilityandcorrectness.Whiledevice-levelradiationhardeningincreasescomponentlifetimesandreliability,thesehardeneddevicesareexpensiveandhavedramaticallyreducedperformanceascomparedtonon-hardenedcommercial-off-the-shelf(COTS)components.Inordertodesignanoptimalsystem,theperformance,SWaP,andreliabilityrequirementsoffuturespaceapplicationsmustbeconsideredtogether.Systemdesignphilosophymustconsidertheworst-caseoperatingscenario(e.g.,typicallyworst-caseradiation),whichmaydramaticallylimittheoverallsystemperformanceeventhoughtheworst-casescenariomaybeinfrequent(e.g.,radiationlevelstypicallychangebasedonorbitalposition).However,ifcomponentsusedforredundancyandreliabilitycanbedynamicallyrepurposedtoperformuseful,additionalcomputation,futurespacesystemscouldmeetbothhigh-performanceandhigh-reliabilityrequirements.Futuresystemscouldmaintainhighreliabilityduringworst-caseoperatingenvironmentswhileachievinghighperformanceduringlessradiation-intensiveperiods.Currentdesignmethodologiesdonotaccountforthistypeofadaptability.Therefore,inorderforfuturespacesystemstoachievehighlevelsofperformance,moresophisticatedandadaptivesystemdesignmethodologiesarenecessary.Oneapproachforadaptivehigh-performancespacesystemdesignleverageshardware-adaptivedevicessuchaseld-programmablegatearrays(FPGAs),whichprovideparallelcomputationsatahighlevelofperformanceperunitsize,mass,andpower[ Williamsetal. 2010 ].Fortunately,manyspaceapplications,suchassyntheticapertureradar(SAR)[ Leetal. 2004 ],hyperspectralimaging(HSI)[ HsuehandChang 2008 ],imagecompression[ Guptaetal. 2006 ],andotherimageprocessingapplications[ Dawoodetal. 2002 ],whereonboarddataprocessingcansignicantlyreducedatatransmissionrequirements,areamenabletoanFPGA'shighlyparallelarchitecture. 13

PAGE 14

SincerecongurationenablesFPGAstoperformawidevarietyofapplication-specictasks,thesesystemscanrivalapplication-specicintegratedcircuit(ASIC)performancewhilemaintainingageneral-purposeprocessor'sexibility.AnSRAM-basedFPGAcanbereconguredmultipletimeswithinasingleapplication,allowingasingleFPGAtobeusedformultiplefunctionsbytime-multiplexingtheFPGA'shardwareresources,reducingthenumberofconcurrentlyactiveprocessingmoduleswhenanapplicationdoesnotrequireallprocessingmodulesallofthetime.Thus,FPGArecongurationfacilitatessmall,lightweight,yetpowerfulsystemsthatcanbeoptimizedforaspaceapplication'stime-varyinghardwarerequirements.InordertoleverageFPGAsinspacesystems,theFPGAmustoperatecorrectlyandreliablyinhigh-radiationenvironments,suchasthosefoundinnear-Earthorbits.Currently,mostradiation-hardenedFPGAshaveantifuse-basedcongurationmemoriesthatareimmunetosingle-eventupsets(SEUs).However,thesehardenedFPGAshaverecongurationlimitationsandsmallcapacities,reducingtheprimaryperformancebenetsofferedbyCOTSSRAM-basedFPGAs.Fortunately,whencombinedwithspecialsystemdesigntechniques,SRAM-basedFPGAscanbeviableforspacesystems.AnSRAM-basedFPGA'sprimarycomputationallimitationisthepossibilityofSEUscausingerrorswithintheFPGAuserlogicandroutingresources,whichcanmanifestascongurationmemoryupsetsorlogicmemory(e.g.,ip-ops,userRAM)upsets(i.e.,resultingindeviationsfromtheexpectedapplicationbehavior).Fault-toleranttechniques,suchastriple-modularredundancy(TMR)andmemoryscrubbing,canprotectthesystemfrommostSEUsandsignicantlydecreasetheSEU-inducederrors,butdesigninganFPGA-basedspacesystemusingTMRintroducesatleast200%areaoverheadforeachprotectedmodule.Dependingontheexpectedupsetratesforagivenspacesystem,otherlower-overheadfaulttolerancemethodscouldbeusedtoprovidesufcientreliabilitywhilemaximizingtheresourcesavailableforperformance. 14

PAGE 15

Whendesigningatraditionalspacesystem,systemdesignersestimatetheexpectedworst-caseupsetratesandincludeanadditionalsafetymargin.However,sincesingle-eventupset(SEU)ratesvarybasedonorbitalpositionandthemajorityoforbitalpositionsexperiencerelativelylowupsetrates,asystemdesignedfortheworst-caseupset-ratescenariocontainsprocessingresourcesthatarewastedduringthefrequentlow-upset-rateperiods.Inordertoprovidethenecessaryreliabilityduringhigh-upset-rateperiodsandreducetheprocessingoverheadincurredduringlow-upset-rateperiods,thefaulttolerancemethodmustchangebasedonthecurrentupsetrate.Forexample,duringhigh-upset-rateperiods,thesystemcanbereconguredtoprovidehighreliabilityattheexpenseofreducedprocessingcapabilities,whileduringlow-upset-rateperiodsthesystemcanbereconguredtoprovidehigherperformancebyre-provisioningtheexcesshardware(usedforhighreliabilityduringhigh-upset-rateperiods)toapplicationfunctionality.Thisupset-rate-basedadaptabilityprovideshighperformancewhilemaintainingreliability.Thisresearchproposesaframeworkforrecongurablefaulttolerance(RFT)thatenablesFPGA-basedspacesystemstodynamicallyadapttheamountoffaulttolerancebasedonthecurrentupsetrate.Torealizethisgoalandvalidatetheeffectivenessoftheapproach,threeareasmustbeaddressedandinvestigated.First,amethodforaccuratelyestimatingtime-varyingfaultratesinspacesystemsandareliabilityandperformancemodelforadaptivesystemsisneedtoquantifytheeffectivenessoftheRFTapproach.Second,techniquesforlow-overheadfaulttolerancewhichcanbeusedwithintheRFTframeworkforimprovedperformancemustbeinvestigated.Third,toolswhichfacilitatethecreationoftheRFThardwarecomponentsarerequiredtoenabletheRFTintegrationintoexistingPR-basedsystemsandarchitectures.Theresearchpresentedinthisdocumentisdividedintothreephases,eachproposingasolutiontothegoalslistedabove. 15

PAGE 16

Therstphaseofthisresearchaddressestheneedforperformanceandreliabilitymodellingofadaptivesystemswithchangingenvironments,suchasFPGA-basedspacesystems.Afault-rateestimationmethodologyforsystemsinnear-Earthorbitsisdevelopedtoestimatethetime-varyingfaultratesexperiencedduringaspeciedorbit.Aphased-missionMarkovmodelisthenusedtoestimateperformanceandreliabilityofanadaptivesystemusingseveraladaptationschedules.Inthisphase,wealsodevelopandimplementanRFTcontrollerdesignwhichcanprovideadaptivefaulttoleranceinanFPGA.Thereliabilityandperformancemodelsarethenexperimentallyvalidatedusingfault-injectiontestingontheRFTcontroller.Thesecondphaseofthisresearchinvestigatestheeffectivenessoftechniquesforlow-overheadfaulttoleranceinFPGAsystems.Algorithm-basedfaulttolerance(ABFT)isatechniquethatcanbeusedwithmanylinear-algebraoperations,suchasmatrixmultiplicationorLUdecomposition[ HuangandAbraham 1984 ],toprovidefaulttolerancewithaslittleas5-10%overhead.Traditionally,ABFThasbeenimplementedinsoftware,withmultiprocessorarrays,andinhardware,withsystolicarrays,toprotectapplicationdatapaths.OurABFTapproachmaybeusedinFPGAapplicationstoprovidebothdatapathandcongurationmemoryprotectionwithlowoverhead.Otherfault-tolerancetechniques(e.g.,duplicationwithcompare,error-correctingcodes,concurrenterrordetection,reduced-precisionredundancy)arealsoexaminedandevaluatedforFPGAresourceoverheadandreliabilitywithinthecontextoftheRFTframework.Inthethirdphaseofthisresearch,methodsforintegratingtheRFTframeworkwithpre-existingpartialrecongurationarchitectureswillbeexamined,enablingfaulttoleranceforpre-existingsystemswithminimaldesignchanges.Threeaspectsofsystemintegrationwillbeconsidered.First,atoolthatwillenablethedynamiccreationofsystem-specicRFTcontrollers,allowingmission-specic,resource-optimizedhardware.Second,supportandintegrationofanRFTcontrollerwiththeexisting 16

PAGE 17

PRarchitecturestoenablefaulttolerancewithminimaldesignmodications.Third,investigationoffault-toleranttaskschedulinginthepresenceofchangingfaultrates.Theremainingsectionsofthispaperareorganizedasfollows.Chapter2surveyspreviousworkrelatedtothistopicscommontoallthreephasesofthisresearch.Chapter3describestheRFThardwarearchitecture,theRFTfault-ratemodel,andtheRFTperformabilitymodelandprovidescase-studyexamplesfortwonear-Earthorbits.Chapter4evaluatestheuseoflow-overheadfaulttolerancetechniquesinFPGAsystems,analyzesreliabilityresultsobtainedusingfault-injectioninMMandFFTcasestudies,andsuggestsdesignmodicationsforhigherreliability.Chapter5demonstratesadynamicRFThardware-creationmethodologyandapoint-to-pointRFTcontrollerimplementation,andexploresheuristicsforanenvironmentally-awaretaskschedulerforRFTsystems.Finally,Chapter6presentsconclusionsandoutlinesdirectionsforpossiblefutureresearch. 17

PAGE 18

CHAPTER2BACKGROUNDANDRELATEDRESEARCHInthischapter,wemotivatetheneedforSRAM-basedFPGAsinspacesystems.TheFPGAsusedinspacesystemsmustberatedforasufcienttotalaccumulatedionizingdose(TID)toensurelong-termdevicefunctionalityoveragivenmissionduration.Single-eventeffects(SEEs),causedbycollisionswithhigh-energyprotonsandheavyions(i.e.,radiation),aretheprimaryshort-termreliabilityconcerninspacesystems.InorderforFPGA-basedsystemstomaintainreliability,acombinationofradiation-hardenedFPGAsandcomplexfault-tolerancetechniquesarerequiredtomitigateerrors.Sincemanyofthecommonfault-tolerancetechniquesrequiresubstantialtemporalorspatialareaoverhead,new,low-overheadtechniquesallowmorerecongurablelogictobeusedforactual,usefulcomputationinsteadofforredundancy. 2.1FPGAPerformanceandPowerEfciencySRAM-basedFPGAsofferaverylargeamountofcongurablelogicandhavetheabilitytomodifyportionsofadesignduringrun-time,givingtheseFPGAsthecapabilitytoefcientlyperformawiderangeofapplicationsbyusingahighdegreeofparallelismwhilerunningatlowclockrates.Williamsetal.[ 2010 ]developedcomputationaldensitymetricstoquantifyandpredictperformanceofSRAM-basedFPGAsforspecictypesofalgorithmsandapplications.TheiranalysisshowedthatFPGAswerecapableofprovidingbetween3and60timesmoreperformanceperunitpowerthanmanyconventionalgeneral-purposeprocessors,dependingonthetypesofoperationsbeingconsidered.TheperformanceandpowerefciencyofFPGAsisextremelydesirableforthesmallpowerbudgetsofspacesystems. 2.2PartialRecongurationPartialreconguration(PR)enablesausertomodifyaportionofanFPGA'scongurationwhiletheremainderoftheFPGAremainsoperational.PRtime-multiplexesmutuallyexclusiveapplication-specicprocessingmodulesonthesamehardwareand 18

PAGE 19

onlythemodulesthatneedtobereconguredhaltoperation,whichmakesPRattractiveforreal-timesystems.Currently,XilinxsupportsPRfortheVirtex-4andnewerdevices[ Xilinx 2010a ],whileAlterahasrecentlyannouncedPRsupportforStratix-Vdevices[ Altera 2010 ].Duringthesystemdesignphase,systemdesignersdenetheFPGA'spartiallyrecongurableregions(PRRs)andpartiallyrecongurablemodules(PRMs)androutesignalsto/fromthePRRsthroughbusmacroPRprimitives(Xilinx9.2PRtoolow)orpartitionpins(Xilinx12.1PRtoolow).Partialbitstreams,communicatedthroughexternalcongurationinterfaces(e.g.,SelectMAP,JTAG),areusedtorecongurethePRRswiththePRMs.OnXilinxdevices,theInternalCongurationAccessPort(ICAP)isaninternalcongurationinterface,allowinguserlogictodirectlyrecongurePRRs,removingtheneedforadditionalexternalcongurationsupportdevices.Additionally,sincepartialbitstreamsaretypicallymuchsmallerthanfullbitstreams(whichareusedtoconguretheentireFPGA),PRreducesbitstreamstoragerequirements.Partialbitstreamsmaybesignicantlysmallerthanfullbitstreams,relatedtothesizeofeachPRR. 2.3Single-EventEffectsSEEsoccurwhenhigh-energyparticles,suchasprotons,neutrons,orheavyions,collidewithsiliconatomsonadevice,depositingtheion'selectricchargeintothedevice'scircuit.ProtonsandelectronsaretrappedwithintheEarth'sVanAllenbelts,whileheavyionsaremainlyproducedbygalacticcosmicraysandsolarares.Whenahigh-energyparticlecollideswithasilicondevice,theenergyofthecollisioncancausethelogicalvaluesstoredinsequentialmemoryelementstobeinverted[ KarnikandHazucha 2004 ].ErrorscausedbytheseparticlesareoftenreferredtoassofterrorsorSEUs,asthereisnopermanentcircuitrydamageandanyaffectedmemoriescanbecorrectedbyre-writingthecorrectvalues.Single-eventfunctionalinterrupts(SEFIs),anothertypeofSEE,cancauseasemi-permanentfaultthatrequiresacircuittobe 19

PAGE 20

power-cycledtorestorecorrectoperation.Single-eventlatchups(SELs)aredestructiveSEEsthatoccurwhenaparticlecausesaparasiticforward-biasedstructurewithinthedevicesubstratethatcanallowdestructivelyhighamountsofcurrenttoowthroughthesubstrate,potentiallydamagingthedevice.SELscanbeavoidedbyusingappropriatedevicemanufacturingprocesses,anddevicesproducedusingsilicon-on-insulator(SOI)processesarelargelyimmune.SELimmunityisanimportantpropertyforselectingdevicesformanyspacesystems. 2.4FPGAsinSpaceSystemsFPGAcongurationmemoriescanbeconstructedusingseveraldifferenttechnologies(e.g.,antifuse,ash,andSRAM),eachwithperformanceandreliabilitytradeoffs.Traditionally,antifuse-basedFPGAshavebeenusedinspacesystemstoprovidesimpleprocessingcapabilitiesandgluelogictointerconnectmultipleperipheralsortocombine/replacethefunctionalityofmultipleASICs.Antifuse-basedFPGAsareone-timeprogrammablewherethecongurationprocesscreates/xestheFPGA'sphysicalroutinginterconnectstructure.ThiscongurationprocessprovidesaninherentleveloffaulttolerancefromSEUssincetheantifuse-basedroutingcannotbereversed.Additionally,manycommerciallyavailableantifuse-basedFPGAs(e.g.,ActelRTSX-SU[ Actel 2010a ])includereplicatedip-opcellstopreventupsetsinthesequentiallogic.Additionally,antifusedevicesgenerallyhaveahighTIDthresholdandimmunitytoSELs.Unfortunately,whencomparedtoash-basedorSRAM-basedFPGAs,antifuse-basedFPGAscontainarelativelysmallamountofavailablelogicgatesandthexed-logicstructurelimitsperformancepotential.Flash-basedFPGAs(e.g.,ActelRT-ProASIC3[ Actel 2010b ])attempttomaintainanantifuse-basedFPGA'sreliabilitywhileincreasingtheamountofcongurablelogic(logicavailableforasystemdesignertoimplementapplicationfunctionality)andallowingreconguration.Flash-basedFPGAcongurationmemoriesarecomposedofradiation-tolerantashcellsthatprovidereliabilityforcombinationallogic,however 20

PAGE 21

systemdesignersmustinsertsequentiallogicreplicationtofullyprotecttheFPGAfromfaults.Eventhoughash-basedFPGAscanbefullyreconguredtosupportmultipleapplications,ash-basedFPGAsdonotsupportthePRcapabilitythatisavailableonsomeSRAM-basedFPGAsduetoalackofarchitecturalandvendorsupportforsuchacapability.ConcernsovertheTIDeffectsonash-basedlogic(oating-gatetransistors)havepreventedwide-spreadacceptanceofash-basedFPGAsinspacesystems[ Wang 2003 ].SRAM-basedFPGAsarethemostradiation-susceptibletypeofFPGAsincethedesignfunctionalityisstoredinvulnerableSRAMcells(i.e.,congurationmemory),andcongurationmemoryupsetscausefunctionalchangestothedesign'slogic.Traditionally,thisvulnerabilityhaspreventedSRAM-basedFPGAsfrombeingusedinhighlycriticalspaceapplications,however,somespacesystemsusespace-qualiedSRAM-basedFPGAsforonboardprocessing.Space-qualiedFPGAsaresimilartoCOTSFPGAsbutareproducedusingepitaxialwafers,useceramic,hermetically-sealedpackaging,andhavebeentestedtoensurethatdamagingSELeventswillnotcompromisethesystem.ThesedevicesareratedforTIDlevelshighenoughtobeusedinspacesystems[ Xilinx 2010c ].Evenwiththesereliabilitytechniques,thesespacesystemsmuststilluseseveralfault-mitigationstrategies,suchasTMRandcongurationscrubbing,toensurethatsystemupsetsaredetectableandrecoverable.(Wenotethatunlessspecied,allfurtherFPGAreferencesimplicitlyrefertoSRAM-basedFPGAs.)TraditionalFPGA-basedspacesystemdesignsleveragespatialTMR.TMRusesareliablemajorityvoterconnectedtothreeidenticalmodulereplicasinordertodetectandmaskerrorsinanyonemodule.InthecontextofTMR,amodulereferstothefunctionalunitbeingreplicated,whichcanrangefromasinglelogicgatetoanentiredevice.TherearetwoprimaryTMRvariations:externalandinternal.ExternalTMRusesthreeindependentFPGAsworkinginlockstepwhereeachFPGAimplementsamodulereplicaandtheoutputsareconnectedtoanexternalradiation-hardenedvoter 21

PAGE 22

thatcomparestheresults.ExternalTMRrequiressignicanthardwareoverhead(eachprotectedmoduleistriplicatedandboardlayoutcomplexityissignicantlyincreased),butisreliable.TheRCCboardproducedbySEAKRengineering[ Troxeletal. 2008 ]usesexternalTMRtoprovidereliablecomputationusingXilinxVirtex-4FPGAs.Alternatively,internalTMRcreatesthreeidenticalmoduleswithinasingleFPGA,andthemajorityvoterresidesinternallyorexternally[ Carmichaeletal. 1999 ].InternalTMRcanreducethenumberofphysicalFPGAsrequiredtoimplementaspaceapplication,butmayincreasethechanceofacommon-modefailure,wheremultiplemodulesfailsimultaneouslyfromasinglefault.Forexample,aSEFImaycausemultipleinternally-replicatedmodulestofail,whereasexternally-replicatedFPGAswouldbeimmune.SeveraltoolsassistsystemdesignersinincorporatingTMRintospacesystemdesigns[ Prattetal. 2006 ; Xilinx 2004 ].InadditiontoTMR,congurationscrubbingpreventserroraccumulationinFPGAcongurationmemory.WhileTMRmasksindividualerrors,TMRdoesnotcorrecttheunderlyingfaultandcannotcorrecterrorsthatoccurinmultiplemodules.ScrubbingusesanexternaldevicetoreadbacktheFPGA'scongurationmemoryandcomparesthereadcongurationmemorytoaknowngoodcopy.Alternatively,someFPGAscalculateanerrorcorrectioncode(ECC)duringcongurationread-backforeverycongurationframe,whichcanbeusedtodetectandcorrectcongurationfaults.Ifamismatchisdetected,thecorrectcongurationcanbewrittenusingPRwithouthaltingtheentireFPGAoperation[ Xilinx 2010a ].Traditionally,scrubbingisperformedbyanexternalradiation-hardenedmicrocontrollertoensurereliabilityoftherecongurationprocess.However,aself-scrubbermaybeimplementedwithintheFPGAusingtheICAPavailableinXilinxVirtex-4andnewerdevices.Xilinxhasalsodevelopedasingle-eventmitigation(SEM)IPcorewhichcanperformcongurationmemoryerrordetectionandcorrectionforuserdesigns[ Xilinx 2013b ]. 22

PAGE 23

Despitethesedrawbacks,SRAM-basedFPGAshavebeenusedinmanyspacesystems,includingearth-observingsciencesatellites,communicationsatellites,andsatellitesandroversfortheVenusandMarsmissions.Forinstance,space-qualiedXilinxVirtex-1000(XQVR1000)deviceswereusedontheMarsExplorationRoversformotorcontrolfunctionsandfourXQR4000XLswereusedforthelanderpyrotechnics[ Ratter 2004 ].Congurationread-backandscrubbingwereusedfordetectionandcorrectionofSEUs,andthefullsystemwascycledonceperMartianeveningtoremovepersistenterrors.Thesesystems,whichcontainedverylittlelogicascomparedtotoday'sstandardsandlackedtheabilityforPR,werenotusedfordataprocessing.TheincreasedlogicresourcesonrecentFPGAfamiliesenablefuturesystemstouseFPGAsforimageprocessingandothercomputation-intensiveapplications.TheSpaceCubecreatedatNASAGoddardSpaceFlightCenterusesmultiplecommercialXilinxVirtex-4FPGAsalongwitharadiation-hardenedprocessortoprovideonboardprocessingcapabilitiesforavarietyofmissions[ Flatley 2010 ].TheSpaceCubeprovidedcomputationalpowerfortheRelativeNavigationSensors(RNS)experimentduringHubbleServicingMission4andiscurrentlybeingusedasanon-orbittestplatformaboardtheNavalResearchLaboratory'sMISSE-7experimentontheInternationalSpaceStation(ISS).EachFPGAinthesystemcontainsaself-scrubbermodule,protectedwithTMR,tocorrecterrorsandpreventerroraccumulation. 2.5Low-OverheadFaultToleranceMethodsWhileTMRisthemostcommonfaulttolerancemethodforFPGAs,thehighareaoverheadduetoreplicatingmodulespartiallynegatessomeFPGAbenets.Therefore,muchresearchhasfocusedondevelopingalternative,low-overheadfault-tolerancemethodsforlow-upsetenvironments,suchasreplication-basedfaulttolerance[ Laprieetal. 1990 ],ECCs[ RaoandFujiwara 1989 ],andapplication-specicoptimizationstoprovidelow-costreliability.Manyofthesealternativetechniquescandetecterrorsquickly,butmayrequireadditionalprocessingorcompletere-computationtocorrect 23

PAGE 24

theerrors.Whentheexpectedupsetratesarelow,there-computationratesmaybeacceptablylow,dependingonapplicationthroughputrequirements.Replication-basedfaulttolerancerepresentsthemostcommonlyusedtypeoffault-mitigationstrategy,duetoconceptualsimplicityandhighfaultcoverage.TMRcandetectsingleerrorsandcancorrect/masksingleerrorsusingamajorityvoter.Duplicationwithcompare(DWC)isanalternativereplication-basedmethodthatcomparestheoutputsofduplicatedmodules.DWCreducestheresourceoverheadbyonehalfascomparedtoTMR,butDWCcannotcorrecterrorsandmustfullyre-computedatawhenerrorsaredetected[ Johnsonetal. 2008 ].Shimetal.[ 2004 ]proposedreduced-precisionredundancy(RPR)fornumericalcomputation.RPRtriplicatesanapplicationmoduleasinTMR,butthereplicashavelowerprecisionoronlyoperateonthemost-signicantbitsofapplicationdata,ensuringthatthemost-signicantbitsareprotectedanderrorsintheleast-signicantbitsaretreatedasnoiseinthesystem.RPRreducedthenumberofdetectablefaults(faultcoverage),andresultedinsignicantresourcesavingswhilemaintainingsufcientsignal-to-noiseratiosformanyDSPapplications.Morganetal.[ 2007 ]investigatedtheuseoftemporalredundancyandquaddedlogicasadditionalmethodsforprovidingfaulttolerancethroughredundancy.However,duetoinefcientmappingtotheunderlyingFPGAarchitecture,theirmethodsdidnotprovidefaulttoleranceimprovementoverTMRandimposedalargeareaoverhead.Theeffectsofthefault-tolerantmethodswereoffsetbytheincreasedcross-sectionalvulnerabilityofthelargerFPGAdesigns.Eventhoughthesereplication-basedmethodsforfaulttolerancearesuitableforFPGAs,severalnewfault-tolerantFPGAarchitecturesleverageanFPGA'shighcapacityandexibilitywhilemaintainingreliability.Alnajiaretal.[ 2009 ]proposedahypotheticalcoarse-grainedmulti-contextFPGAarchitecturethatsupportedTMR,DWC,andsingle-contextmodes.Theirworkexploredtheeffectsofsofterrorsandagingontheproposedarchitecture,withanexampleViterbidecodermodulemappedtothe 24

PAGE 25

hypotheticalarchitecture.Kyriakoulakosetal.[ 2009 ]proposedsimplemodicationstothecurrentVirtex-5andVirtex-6architecturestoallownativesupportforDWCandTMR.ByaddinganXOR-gatebetweeneach5-inputLUTwithinalarger6-inputLUT,theexistingarchitectureandsynthesistoolsrequiredonlyminimalchangestosupporttheirapproach,whileincurringonly17.5%to76%sliceutilizationoverheadforDWCorTMR,respectively. 2.6Algorithm-BasedFaultToleranceAlgorithm-basedfaulttolerance(ABFT)isamethodthatcanbeusedwithmanylinear-algebraoperationstoprovidefaulttolerancewithouttheuseofexplicitreplication.,TheABFTmethodwasoriginallydescribedformatrixmultiplicationandLUdecomposition[ HuangandAbraham 1984 ]buthasbeenexpandedtoprotectotheralgorithmscomprisedoflinearoperations,suchasQRdecomposition,FastFourierTransform[ TaoandHartmann 1993 ; WangandJha 1994 ],andniteelementanalysis[ MishraandBanerjee 2003 ; Roy-Chowdhuryetal. 1996 ].ThetraditionaldescriptionofABFTwasdesignedforsystolicarrays,butthemethodhasalsobeenusedinmultiprocessor-based,high-performancecomputing[ Yaoetal. 2012 ].ABFTaugmentsanoriginaldatamatrixwithrowand/orcolumnchecksumsandthelinear-algebraoperationisperformedonthenew,augmentedmatrix.Ifthelinear-algebraoperationiscomputedsuccessfully,theresultingaugmentedmatrixwillcontainvalid,consistentchecksums[ HuangandAbraham 1984 ].ABFTchecksumgenerationandcomparisonhaslowercomputationalcomplexitythantheprimarylinear-algebraoperation.ABFTcomputationaloverheadisgenerallylow,andasaproportionoftotalcomputation,decreasesasthematrixsizeincreases.ThemathematicalbasisofABFTforthematrixmultiplicationandFFTalgorithmswillbeshowninSection3andSection4,respectively.WhenusingABFTwithoating-pointalgorithms,theintroductionofroundingandprecisionerrorsrequirestheuseofathresholdcomparisontodifferentiaterounding 25

PAGE 26

errorsfromincorrectlycomputeddata.Thedeterminationofasufcientthresholdissignicantlyinuencedbypropertiesoftheinputdata,andcanaffecterrorcoverageandthenumberoffalsepositiveresults[ ChowdhuryandBanerjee 1996 ; Roy-ChowdhuryandBanerjee 1993 ].Additionally,theoriginaldescriptionofABFTconsideredalgorithmsexecutedonsystolicarrays,witheachprocessingelementcomputingasingledataelementoftheresultmatrix.Inthesesystems,asinglefaultcouldnotpropagatetootherprocessingelementsandwouldonlyresultinasingleerroneousresultelement.[ Silvaetal. 1998 ]investigatedthesevulnerabilitiesintraditionalABFTimplementationsandproposedmethodsforimprovingfaultcoverageusingaRobustABFTapproach. 2.7TaskSchedulingTaskschedulingalgorithmscanbecategorizedaseitheronlineorofine.Ofineschedulingalgorithmshavecompleteknowledgeofalltasksthatmustbescheduled.ThegeneralschedulingoptimizationproblemisNP-hard,butefcientheuristicsexist,andofineschedulescanbepre-determinedatcompile-time.Withonlinescheduling,tasksarriveattheschedulerperiodicallyovertime,andmustbeplacedaroundpreviouslyscheduledtasks.Taskarrivalratesandpatternscangreatlyaffectthequalityofthescheduler'sresults. Arndtetal. [ 2000 ]examinedmanyonlineschedulingalgorithmsfordistributedparallelcomputers,andusedsimulationtoevaluatetheirperformance.Oftheseveralalgorithmsstudied,theFirstFitalgorithm,whichprioritizesschedulingbythearrivaltimeofeachtask,providedgoodschedulelengthswhileminimizingtheaveragewaittimes.Real-timesystemsintroduceadditionalconstraintsfortaskscheduling.Eachtaskmustbecompletedbyadeadline,otherwisetheresultswillnolongerberelevantorneeded.Aharddeadlinemustbemet,otherwisethesystemisconsideredfailed.Armdeadlinecanbemissed,buttheusefulnessoftheresultafterthedeadlineiszero.Asoftdeadlinecanbemissed,butthevalueoftheresultdecreasesafterthedeadlinehaspassed.Forahardreal-timesystemalldeadlinesmustbemet,butthegoalofasoft 26

PAGE 27

real-timesystemistomeetasmanydeadlinesaspossiblewhileoptimizingforothercritera.Traditionally,schedulersattempttominimizecriteriasuchasmakespan(totalschedulelength)oraveragetasklatency. Hanetal. [ 2003 ]createdafault-tolerantschedulingalgorithmforperiodicreal-timesoftwaretasks.Foreachprimarytask,analternateless-precisetaskisalsousedtogenerateasufcientresultbeforethedeadline.Thesealternatetasksarescheduledasclosetothetaskdeadlineaspossible.Inthecaseofaprimarytaskfailure,thealternatetaskwillbeexecuted.Iftheprimarytasksucceeded,thealternatetasksarediscarded.Theiralgorithmisintendedforofineuseandwasintendedtoprotectsystemsagainstsoftwarefaults. Pathan [ 2006 ]extendstherate-monotonicschedulingalgorithmtosupporttemporalTMR(RM-FT),schedulingmultiplecopiesoftaskstomaskfaultsinanyonecopy.RM-FTalsorequiresperiodictaskstoperformaschedulinganalysis.Schedulingaperiodicreal-timetasksisamoredifcultproblemandiscurrentlybeingstudied.Schedulingtasksforrecongurablecomputingcreatesanadditionallevelofcomplexity.Insteadofschedulingataskforaone-dimensionalarrayofprocessors,schedulingonatwo-dimensionalFPGAfabricbecomesaconstrainedplacementproblem.Additionally,tasksmayhavemultiplehardwareorsoftwareimplementations,increasingtheoverallsearchspace. Banerjeeetal. [ 2005 ]presentanofineKLFMheuristic[ KernighanandLin 1970 ]whichincorporatesdetailedplacementinformationinordertoprovidehigh-qualityschedules. Meietal. [ 2000 ]combineageneticalgorithmtodetermineHW/SWplacementwithatraditionallistschedulingalgorithmtoenableonlineschedulingofreal-timerecongurableembeddedsystems. Steigeretal. [ 2004 ]developedtwoheuristicschedulingalgorithms,HorizonandStufng,whichprovidegoodresultswhilelimitingthecomputationalrequirements. 27

PAGE 28

CHAPTER3FRAMEWORKFORRECONFIGURABLEFAULTTOLERANCERFTleveragesCOTScomponentsinspacesystemstoachievehighperformanceandexibilitywhilemaintainingreliability.Beyondtraditional,spatialTMR,alternativefault-mitigationmethodsmaybeappropriategivenaparticularapplication'sperformancerequirementsandsetofexpectedenvironmentalfactors(e.g.,radiation).Othermethods,suchastemporalTMR,ABFT,software-implementedfaulttolerance(SIFT),orcheckpointingandrollback,maybesuitableforsystem-levelprotection.Eachalternativefault-mitigationmethodhastradeoffsbetweenperformance,reliability,andoverhead,andPareto-optimaloperationmaychangeoverasystem'slifetime.Therefore,themaingoalofourproposedRFTframeworkistoenableasystemtoautonomouslyadapttoPareto-optimaloperationbasedonthecurrentsystem'senvironmentalsituation.OurRFTframeworkconsistsofthreemainelements:aPR-basedhardwarearchitecture;afault-ratemodelforestimatingorbitalSEUrates;andamethodologyformodelingRFTsystemperformability.Thehardwarearchitecture,describedinSection 3.1 ,issimilartoothertraditionalSoCarchitecturesusedforrecongurablecomputing,withmultiple,identicalPRRs,whichareleveragedformoduleredundancy.Thehardwarearchitectureallowsthesystemtoexecuteeachprocessingmodule(PRM)inseveralpossiblefault-tolerancemodes,eachwithdifferingperformanceandreliabilitycharacteristics.RFTsoftwareadaptstheamountofper-moduleredundancyinaccordancewiththecurrentenvironmentandtemporarilypausesonlythereconguredhardwaremodulestopreserveandrecordapplicationstatewhilechangingthefault-tolerancemode,allowingtheremainderofthesystemtocontinueprocessing.Additionally,moduleswithconstantstatecancontinueprocessingduringtheadaptationprocess.Section 3.2 presentsamodelforestimatingexpectedfaultratesinpotentialspace-systemorbits.Finally,Section 3.3 presentsaperformabilitymodelthatquantiestheRFTbenetsinenvironmentswithvaryingupsetrates. 28

PAGE 29

3.1RFTHardwareArchitectureFigure 3-1 showsthehigh-levelarchitectureofanFPGA-basedSoCdesignwithPRRsintegratedwithanRFTcontroller.Themainarchitecturalcomponentsincludeamicroprocessor,amemorycontroller,I/Oports,PRRs(1...N)forPRMs,andthesysteminterconnect,whichconnectsallofthesecomponentstothemicroprocessor.SinceweleverageXilinxFPGAs,themicroprocessorisaMicroBlaze(lessresource-intensiveprocessorssuchasPicoBlazecanalsobeused)andthesysteminterconnectisaprocessorlocalbus(PLB).TheMicroBlazeorchestratesPRRrecongurationusingtheICAP,maintainsthestateofthecurrentlyactivePRMs,andinitiatesfault-tolerancemodeswitching.AllarchitecturalcomponentsexceptforthePRRsareprotectedusingTMRsincethesecomponents'functionalityiscrucialtotheentiresystem'sreliability.AlthoughtheICAPcannotbereplicated,thesignalstoandfromtheICAParealsoprotectedwithTMR.ToolssuchasXilinx'sTMRTool[ Xilinx 2004 ]orBYU'sEDIF-basedTMRtool[ Prattetal. 2006 ]automateTMRdesigncreationbyapplyinglow-levelTMRvotingonthedesign'soriginal,unprotectednetlist.ForadditionalSEUprotection,FPGAcongurationscrubbingshouldbeperformedinordertopreventerroraccumulation.Scrubbingcanbeperformedwithanexternalscrubberandradiation-hardenedcongurationstorageorwithaninternalscrubberusingtheinternalcongurationECCpresentinVirtex-4andlaterdevices.EachPRMusesaPLB-compatibleinterfacethatconnectstotheRFTcontrollerordirectlytothePLBbasedonwhetherthePRMisanRFT-enabledmodule(amodulereplicated/instantiatedbytheRFTcontroller)ornot,respectively.TheRFTcontrollerinstantiatesthebusmacrosorotherlow-levelcomponentsrequiredforinterfacingwiththePRRs.TheRFTcontrolleralsocontainsmultiplemajorityvotersandcomparators(votinglogic)thatcanbeusedtodetectorcorrecterrorsbyevaluatingthereplicatedPRMs'outputs.TheRFT-enabledmodulesareusedinparalleltocreate 29

PAGE 30

redundancy-based,fault-protectionmodes(e.g.,DWC,TMR)byinterfacingwiththeRFTcontroller'svotinglogic.Additionally,othersingle-modulefault-protectionmodes,suchasABFT,canbeusedforindividualPRRsandtheRFTcontrollerprovidesadditionalfaulttolerancecomponents,suchaswatchdogtimers,todetecthangingconditionswithinPRMs.Table 3-1 liststhecurrentlysupportedfault-tolerancemodesofRFTandthemodes'fault-tolerancetypeandPRRrequirements.TheMicroBlazeevaluatesthesystem'scurrentperformancerequirementsandmonitorsexternalstimuli(radiation)usingexternalsensorstodeterminewhenthefault-tolerancemodeshouldbeswitched,atwhichtimetheMicroBlazerecongurestheappropriatePRRsandtheRFTcontroller'sinternalvotingschemebetweenthePRRs'outputsforthenewfault-tolerancemode. 3.1.1RFTControllerOperationFigure 3-2 illustratestheinterfacebetweenPRRsandthePLBforanRFTcontrollerthatcanoperateinsingle-moduleandredundancy-basedfault-tolerancemodes.ThisRFTcontrollerinterfaceroutesinputsignalsfromthesysteminterconnecttotheappropriatePRRsandroutesvotingoutputsignalsbacktothePLB.TheabstractionofthislogicintotheRFTcontrollerenablespre-existingPRMstoleveragethefault-tolerancemodesandinterfacewiththeRFTcontrollerwithminimummodications.TocommunicatedataandcontrolsignalsbetweentheMicroBlaze,thePRRs,andtheRFTcontroller,atdesigntime,thesystemdesignerassignsalargememory-mappedregionoftheMicroBlaze'saddressspacetotheRFTcontroller.InordertoroutesignalstospeciedPRRs,thesystemdesigneralsosubdividesthisregionintosmallersubregionsandassignsasubregiontoeachPRR,whiletakingintoconsiderationthememoryinterfacerequirementsofeachpotentialPRM.EachPRMmustimplementtheactualmemoryinterface(e.g.,dual-portRAM,FIFO-basedinterface),whichallowspre-existingPRMstointerfacewiththeRFTcontrollerwithminormodicationssincenospecicinterfaceisrequired.WhencommunicationdatafromtheMicroBlazearrives 30

PAGE 31

overthePLB,theRFTcontroller'saddressdecoderprocessestheinputaddressanddeterminesthedestinationPRR(s)basedonthecurrentfault-tolerancemode.Whileoperatinginsingle-modulefault-tolerancemodes,thedataispasseddirectlytothePRRspeciedbythedecodedaddress.Whileoperatinginredundancy-basedfault-tolerancemodes,thedataispassedtomultiplePRRsusingappropriateenablesignalstothePRRs'interfaces.TheRFTcontrollerroutestheoutputsfromeachindividualPRR,theoutputsfromthevotinglogicviatheFT-moderegister,orrecordedinternalstatusinformationoverthePLBtotheMicroBlaze.WhentheMicroBlazerequestsaPRR'soutput,theRFTcontroller'soutputmultiplexer(OutputMuxinFig. 3-2 )selectstheappropriatePRRoutputtoroutetothePLBcontrollerbasedonthedecodedaddressfromtheMicroBlazeandthecurrentFT-Modevalue.Insingle-modulefaulttolerancemodes,theRFTcontroller'soutputmultiplexerroutessignalsdirectlyfromthePRRstothePLBcontroller,whichbypassestheRFT'sinternalvotinglogic.Inredundancy-basedfault-tolerancemodes,theoutputmultiplexerroutestheveriedoutputsfromtheRFTcontroller'svotinglogictothePLBcontroller.Inadditiontoprovidingroutingandvotinglogicforredundancy-basedfault-tolerancemodes,theRFTcontrollersupportsadditionalfault-tolerancecapabilitiesthatdonotrequireredundantPRMs.Ifthesystemisoperatinginasingle-modulefault-tolerancemode,eachPRMmayusetheRFTcontroller'swatchdogtimersorperforminternalfaultdetection.TheRFTcontrollerprovideseachPRRwithanoptionalwatchdogtimerinterfaceusingasignalthatmustbeassertedperiodicallywithinauser-denedtimeinterval(usuallyontheorderofseconds).IfthePRMdoesnotassertthewatchdogtimerresetsignalwithinthetimeinterval,theRFTcontroller'sinterruptgeneratorassertsaninterruptsignaltotheMicroBlaze,alertingtheMicroBlazeofapossiblefailedPRR.Additionally,PRMsthatperforminternalfaultdetectionmustincludeaninterruptsignaltonotifytheRFTcontrollerofinternallydetectederrors,whichtheRFT 31

PAGE 32

controllerpropagatestotheMicroBlaze.PRMsusingABFTperformself-checking(orself-correctioniftheABFToperationpermits)oninternallygeneratedchecksumsandsendaninterruptsignaltotheRFTcontrollerwhendataerrorsaredetected.Similarly,PRMsthatuseinternalTMRcanalsosignaldetectedandcorrectederrorstotheRFTcontroller.PRMswithoutanyspecicfault-tolerancefeaturescanalsousetheRFTcontroller'swatchdogtimer,whichstillallowstheRFTcontrollertodetectmodulehang-upsorotheroperationalerrors. 3.1.2MicroBlazeOperationInanRFTsystem,theMicroBlazeenablesadditionaloperationalfeatures,suchasfault-tolerancemethodstocomplementthefault-tolerancemodesofferedbytheRFTcontroller.Additionally,theMicroBlazemaintainsthesystem'sfault-tolerancestate(e.g.,activePRMs,currentPRRs'fault-tolerancemodes,etc.)andorchestratesPRRrecongurationandthefault-tolerancemodeswitchingprocess.ToprotecttheapplicationstatewhilereconguringPRRs,theMicroBlazeprovidessupportforPRMcheckpointingandrollback.Acheckpoint,whichtheMicroBlazestoresinexternalmemory,consistsoftheminimumsetofapplicationstateinformationneededtorestoretheapplication'scurrentstate.Intheeventofafault,anapplicationcanusethepreviouscheckpointtorollbacktoaknowngoodstate,insteadofwastingexecutiontimebybeginningexecutionataninitialstartingstate.APRM'sstatecanbecheckpointedifthePRMcanbereadandmodiedbytheMicroBlaze.InanRFTsystem,thestateofallPRMsshouldbecheckpointedperiodicallyinordertoreducewastedcomputationintheeventofafault-inducedPRMrestart.Thestoredcheckpointscanthenbeusedduringfaultrecoveryprocedurestoimprovesystemavailability.InadditiontohandlingPRMcheckpointing,theMicroBlazealsohandlesfaultrecoveryandrecongurationprocedures.IftheRFTcontroller'svotinglogicdetectsfaultsinthePRMs'outputs,theRFTcontrollerrecordsfaultstatusinformationaboutthefaultyPRM(e.g.,errorlocation,time,etc.)ininternalFT-statusregistersandsendsan 32

PAGE 33

interrupttotheMicroBlaze,whichinitiatestherecongurationprocedure.TheFT-statusregistersrecordfaultstatusinformationthatmaybeusedbytheMicroBlazetomakefault-tolerancemodedecisionsortoprovidesystemloginformationtosystemoperators.Forsingle-modulefault-tolerancemodes,theMicroBlazecorrectsthefaultyPRMbyreconguringthePRRwiththePRM'soriginalbitstreamovertheICAP.FortheDWCfault-tolerancemode,bothoftheassociatedPRMsmustbereconguredsincethefaultyPRMcannotbeidentied.IfcheckpointsexistforthePRMs,theMicroBlazeinitializesthenewPRMsusingthesecheckpoints.Alternatively,iftheRFTsystemisoperatingintheTMRfault-tolerancemode,theMicroBlazecheckpointsoneofthenon-faultyPRMswhilethefaultyPRMisrecongured.Afterthenon-faultyPRMhasbeencheckpointed,theRFTcontrollerpausesthenon-faultyPRMsusingclockgatingtokeepthePRMssynchronized.OncethefaultyPRMhasbeenrecongured,theMicroBlazeinitializesthenewlyreconguredPRMfromthemostrecentcheckpoint,andtheRFTcontrollercanre-enabletheclocksforallthreeofthereplicatedPRMstoresumeoperation.Finally,theMicroBlazeorchestratestheswitchingprocedurebetweendifferentRFTfault-tolerancemodes.Fault-tolerancemodeswitchingcanbetriggeredbyexternalevents,byaprioriknowledgeoftheoperatingenvironment,orbyapplication-triggeredevents,andthefault-tolerancemodeswitchingproceduremayvaryonaper-systembasisdependingonthespecicsystemperformanceandreliabilityrequirements.TheMicroBlazecanuseinformationfromthefault-statusregisterstomakePareto-optimaldecisionsaboutfuturefault-tolerancemodes.Beforefault-tolerancemodeswitching,theMicroBlazeensuresthatsufcientPRRsareavailableforthenewfault-tolerancemode.AdditionalPRRsmayberequiredorPRRsmaybefreedwhenswitchingfromasingle-modulefault-tolerancemodetoaredundancy-basedfault-tolerancemodeorviceversa,respectively.TheMicroBlazesignalstheRFTcontrollertochangefault-tolerancemodesbywritingtotheRFTcontroller'sFT-Moderegister.TheMicroBlazereconguresthePRRsinvolvedinthefault-tolerancemodeswitchingviatheICAPwithpartial 33

PAGE 34

bitstreamsfortheappropriatePRMorablankpartialbitstreamifthePRRisnotrequiredforthenewfault-tolerancemode.Whentherecongurationprocessiscomplete,theMicroBlazesignalstheRFTcontroller,byrewritingtheFT-Moderegister,toresumePRMoperation. 3.1.3Environment-BasedFaultMitigationRFTfault-tolerancemodeswitchingcanbetriggeredbyaprioriknowledgeoftheoperatingenvironment,byapplication-triggeredevents,orbyexternalevents.Whileaprioriknowledgeandapplication-triggeredeventsareconvenientformodelingpurposes,real-worldsystemsleveragemeasurementsfromattachedsensorstodeterminethesystem'scurrentenvironmentalstatus.Additionally,duetotheunpredictabilityofspaceweatherconditions,suchassolarareevents,anRFTsystemmustbeabletoresponddynamicallytothechangingenvironment.InanRFTsystem,thecurrentexpectedfaultratecanbeestimatedeitherdirectlyorindirectly.AnexternalradiationsensorcanbedirectlyinterfacedwiththeFPGA,allowingtheMicroBlazetotrackthecurrentfaultrateandpredictfuturefaultrates.Alternatively,theRFTsystemcanindirectlydeterminefaultratesbytrackingthenumberofdataandcongurationfaultsdetectedduringoperation.SinceanFPGA'sfabriciscomposedoflargeSRAMarrays,thesearrayscanbeusedasmakeshiftradiationdetectors.IfafaultisdetectedintheFPGAcongurationordatamemory,eitherthroughreadbackduringscrubbingorfromtheRFTcontroller'slogic,thefaultcanberecordedorcanbeusedtomakedecisionsaboutwhichRFTfault-tolerancemodetouse.Onesimple,rule-basedmethodforchoosingtheRFTfault-tolerancemodeistouseaslidingwindowapproach.Forinstance,asystemmayusethefollowingrules: 1. Iftherewereanyfaultsinthepast5minutes,usetheDWCfault-tolerancemode. 2. Ifthereweremorethan5faultsinthepast5minutes,usetheTMRfault-tolerancemode. 34

PAGE 35

3. OnlytransitionfromtheTMRfault-tolerancemodetoalower-reliabilitymodeafter5minutesforfault-freeoperation.ThesizeoftheslidingwindowandtheRFTfault-tolerancemodechoicesaresystem-andmission-dependent.Alargerwindowsizecanprovideamoreconservativefault-tolerancestrategy,whileasmallwindowsizecanmorequicklyadapttospikesintheexperiencedfaultrate.Implementingarulewithhysteresis,sucharule(3),canproduceamorereliablestrategy,whilerequiringasmallerwindowsize. 3.1.4RFTControllerResourceandPerformanceOverheadsToquantifytheRFTcontroller'sresourceoverhead,weimplementedanRFT-basedsystemonaVirtex-4FX60-basedplatform,similartotheSpaceCube.ThedesignusessixPRRsconnectedtoaMicroBlazethroughaPLB-connectedRFTcontrollerandoperatesat100MHz.EachPRRcontains2,000slicesforuserlogic.EachPRRcanoperateinhigh-performance(HP)mode,DWC-modewiththePRR'sneighboringPRRs,orTMR-modewithtwoconsecutive,neighboringPRRs.Theresources(LUT,FF,andslice)requiredfortheRFTcontrollerandconstituentmodulesaredetailedinTable 3-2 .ThenalcolumnofTable 3-2 showsthepercentageoftheVirtex-4FX60usedbyeachmodule.TheRFTcontrollerlogic,excludingthePLBcontrollerthatwouldberequiredfornon-RFTdesigns,requiresapproximately900slices.ThelargestmoduleswithintheRFTcontrollerarethe(optional)watchdogtimersandthevotinglogic.Overall,thecompleteRFTcontrollerusesapproximately5%ofthetotalFPGA.SincefrequentPRRrecongurationcombinedwithlengthyPRRrecongurationtimecanimposeaperformanceoverhead,wemeasuredtherecongurationofasinglePRR.ThePRRrecongurationtimewasmeasuredusingtimingfunctionsonaMicroBlazewithaPLB-attachedICAPcontrollerrunningat100MHz.WemeasuredthePRRrecongurationtimefora2,000slicePRRasapproximately15ms.EventhoughtheRFTmode-switchingoverheadisdominatedbythisPRRrecongurationtime,themode-switchingtimeincursanegligibleperformancepenaltyduetothelikely 35

PAGE 36

lowfrequencyofmode-switching.Additionally,theswitchingoverheadonlyoccurswhenincreasingthefaulttoleranceredundancy(i.e.,switchingfromDWCtoTMR).Whendecreasingthefaulttoleranceredundancy,theretainedmodulescancontinuerunningwithoutinterruption. 3.2RFTFault-RateModelInordertoanalyzeRFT'sreliabilitybenets,asuitablefault-ratemodelthatincorporatesvaryingfaultrates,systemperformancecapabilities,andsystemfault-tolerancemodesisrequired.Traditionalreliabilityanalysisfocusesonquantifyingpermanenthardwarefailuresinsystemswithlonglifetimes.SincemostprocessorsandFPGAshavesufcientlylonglifetimes,permanenthardwarefailuresduetolong-term,end-of-lifefailurescanbeignored.Therefore,reliabilityanalysisforspacesystemsfocusesonshort-termfailureanalysisbymodelingSEU-inducedcomputationalfailures.ForFPGA-basedsystems,thesefailurescancauseeithersingleormultipledataerrorsthroughcorrupteddataorcongurationmemory.FPGAupsetratesforspacesystemsarecorrelatedwiththemagneticeldstrengthofthesystem'scurrentorbitalposition.Forexample,asasystempassesintotheVanAllenradiationbelt,thetrappedchargedparticlesinthebelthaveahigherlikelihoodofinteractingwiththesystem.Forsystemsinalow-earthorbit,onlyaportionoftheorbitpassesthroughtheinnerVanAllenbelt,atalocationknownastheSouthAtlanticAnomaly(SAA).TheSAArepresentsthelow-altitudeareawithalargenumberoftrappedparticlesfromtheinnerVanAllenbelt'sclosestpointtotheEarth.Existingfault-ratemodelinggenerallyproducesasingle,averageorbitalupsetrate,howeverthisaverageisnotsufcientforRFT.InorderforRFTtoadaptthefault-tolerancemode,thefault-ratemodelmustestimatetheexpectedfaultratesbasedontheinstantaneousorbitalposition.Toaccountfortheorbit-dependent,time-varyingfaultrates,datafromseveralsourcescanbecombinedtoformamoreaccurateestimatethanasingleaveragerate.Ourfault-ratemodelcombinesorbitalposition 36

PAGE 37

andtrajectory,magneticeldstrength,andcosmicrayparticleinteractiondatatoprovideanaccurateestimateofinstantaneousfaultrates.Weusethesefault-rateestimatesasinputintomultiplesystem-levelMarkovmodelsinordertocalculatereliability,availability,andperformabilityofRFTsystems.OurRFTfault-ratemodelcombinesthreeexistingmodelstoestimatetime-varyingfaultrates.Figure 3-3 illustratesthethreemodelsusedaswellastheinputsandoutputstoeachmodel.Togenerateaccuratefaultestimates,asystem'stime-varyingorbitalpositionmustbemodeled.OrbitalpositioncanbeestimatedusingtheSGP4model,whichisasimpliedgeneralperturbationmodelingalgorithm.NORAD,whoisresponsiblefortrackingspaceobjectsandspacedebris,developedSGP4fortrackingnear-earthsatellites[ HootsandRoehrich 1980 ].SGP4accuratelycalculatesasatellite'spositiongivenasetoforbitalelements(apogee,perigee,inclination,etc.)collectivelyreferredtoasatwo-lineelement(TLE).Givenasystem'sTLE,theRFTfault-ratemodelusesSGP4togeneratethesystem'spositioninformation,intermsoflatitude,longitude,andaltitude,overtheuser-denedmodelingtimeperiod.Next,theRFTfault-ratemodelpassestheSGP4positioninginformationforeachpointalongaspeciedorbittotheInternationalAssociationofGeomagnetismandAeronomy's(IAGA)InternationalGeomagneticReferenceField(IGRF)model[ Mausetal. 2005 ],whichmodelstheEarth'smagnetosphere.IGRFcombinesmagnetospheredatacollectedfromsatellitesandobservatoriesaroundtheworldinordertocreatethemostaccurateandup-to-datemodelpossible(themodelisupdatedeveryveyears).Foragivenorbitalposition,IGRFoutputsaMcIlwainL-parameter,representingthesetofmagneticeldlinesthatcrosstheEarth'smagneticequatoratanumberofEarth-radiiequaltothevalueoftheL-parameter.TheinnerVanAllenradiationbeltcorrespondstoL-valuesbetween1.5and2.5.TheouterVanAllenbeltcorrespondstoL-valuesbetween4and6.Inadditiontoidentifyingregionswithtrappedparticles,theMcIlwainL-parametercanbeusedtoestimatetheeffectofgeomagneticshielding(cutoffrigidity) 37

PAGE 38

fromgalacticcosmicrays.TheestimatedL-parametersarethenused,alongwiththeoutputsofCREME96,toestimateSEUrates.TheCREME96model[ Tylkaetal. 1997 ],apublicallyavailableSEUestimationtool,generatesfault-rateestimatesandhasbeenusedextensivelytopredictheavyionandproton-inducedSEUrates,aswellasestimatetheexpectedtotalionizingdoseinmodernelectronics.CREME96combinesorbitalparameters,spacesystemphysicalcharacteristics,andsilicondeviceprocessinformationtocreateahighlyaccurateSEUsimulation.Traditionally,CREME96generatesanaveragefaultrateforaparticularorbit,generatedbyaveraginghundredsoforbitstogether.However,CREME96canalsogeneratefaultratesfororbitalsegments,whichcanbesegmentedusingtheMcIlwainL-parameter.ByrunningseveralsimulationswithverynarrowL-parametersegments,CREME96canobtainestimatedfaultratesforeachsegment.Asthewidthofeachsegmentdecreases,thegeneratedfaultratesbecomemorepreciseandcontinuous.TheL-parameteroutputsfromtheIGRFmodelarethenmappedtotheappropriateorbitalsegmentandtheassociatedfaultrate.TheSGP4,IGRF,andCREME96modelscollectivelygenerateatime-varyingfault-rateestimateoverthecourseofaspeciedorbit.WeimplementedtheRFTfault-ratemodelasaC++-basedprogramthatconnectstheseparatemodelstogetherandpassesdatabetweenthem.SGP4'salgorithms,alongwithC++/Javareferencecode,arepublicallyavailableviatheInternet.AFORTRAN-basedimplementationoftheIGRFalgorithmforcalculatingposition-basedMcIlwainL-parametersispublicallyavailablefromNASAGoddard[ MacmillanandMaus 2010 ].Fault-rateinformationcanbegeneratedfromtheCREME96modelthroughaweb-basedinterface,whichisstoredforefcientofineuse.OrbitsdescribedbyTLEscanbevisualizedusingopen-sourcetools,suchasJSatTrak[ Gano 2010 ].TheRFTfault-ratemodelprogramacceptsTLEdataasinputandgeneratestime-varyingfault-rateestimatesasoutput.Thesefault-rateestimatesarethenusedbytheRFTperformabilitymodel. 38

PAGE 39

3.3RFTPerformabilityModelSystemreliabilityistheprobabilitythatasystemisoperatingwithoutfaultsafteraspeciedtimeperiod.Assumingexponentially-distributedrandomfaultsatrate,thesystemreliabilityistraditionallydenedas: R(t)=e)]TJ /F9 7.97 Tf 6.59 0 Td[(t(3)Mean-time-to-failure(MTTF)andMean-time-to-repair(MTTR),aretheaverageamountsoftimebeforeasystemencountersafailure(orrepair)event.Forafaultrate(orrepairrate),MTTF(orMTTR)isdenedas: MTTF=Z10tR(t)dtMTTR=Z10tR(t)dt(3)Systemavailability,whichissimilartosystemreliability,estimatesthelong-term,steady-stateprobabilitythatthesystemisoperatingcorrectly,andisdenedas: A=MTTF MTTF+MTTR(3)Systemunavailabilityistheoppositeofavailability,andisoftenusedforconveniencewhendiscussingsystemswithveryhighavailability.Unavailabilityisdenedas: UA=1)]TJ /F6 11.955 Tf 11.96 0 Td[(A=MTTR MTTF+MTTR(3)SpacesystemreliabilityandavailabilitycanbeaccuratelymodeledusingMarkovmodels.AMarkovmodeliscomposedofstatesandtransitionrates.Astaterepresentsthecurrentoperatingstateofthesystemandthetransitionratesrepresentthetransitionsfromanoperatingstatetoafailurestate(i.e.,failurerates),orfromafailurestatetoanoperatingstate(i.e.,repairrates).TheMarkovmodelcanbetransformedintoaseriesofequations,whichcanbesolvedorapproximatednumericallyusingtoolssuchasSHARPE,anopen-sourcefault-modelingtool[ SahnerandTrivedi 1987 ],todetermineprobabilitiesofeachstate.Systemreliabilityandavailabilitycanbe 39

PAGE 40

directlydeterminedfromthecalculatedstateprobabilities.ForMarkovmodels,theinstantaneousavailabilityofrepairablesystemsmeasurestheprobabilityofbeinginanavailablevs.failedstateatagivenpointintime.ThesetypesofmodelsarefrequentlyusedtoestimatetheeffectsofTMR,scrubbing,andotherfault-tolerancemethodsinFPGAsandotherelectronics[ Dobiasetal. 2005 ; GarvieandThompson 2004 ; Prattetal. 2007 ].Markovrewardmodels,atypeofweightedMarkovmodel,canbeusedtoextendtheconceptofsystemavailabilitytomeasuresystemperformabilityofadaptablesystems[ Ciardoetal. 1990 ; Meyer 1982 ].Performabilityisametricthatcombinessystemavailabilitywiththeamountofworkproducedbythesystemandgivesameasureoftotalworkperformed.Performabilityisespeciallyusefulforgracefullydegradablesystemsorothersystemsthathavingchangingcharacteristicsovertime.AssumingthatX(t)isasemi-MarkovprocesswithstatespaceSandiscontinuousovertimet>0,theinstantaneousperformabilityisdenedby Performability(t)=Xa2SPerf(a)PfX(t)=ag(3)wherePerf(a)isthesystemperformanceinstatea.Systemperformancecanbedenedusinganydesiredperformancemetric(e.g.,throughput,executiontime,etc.)andperformabilityismeasuredsimilarly.Inthiscontext,instantaneousavailabilitycanbeviewedasaspecialcasewhenthePerf(a)=1inavailablestatesandPerf(a)=0otherwise.ForrecongurableFPGAsystemswherethesystemcongurationchangesovertime(e.g.,ourRFTarchitecture),thesystemmustbemodeledasaphased-missionsystem.Aphased-missionsystemisdescribedusingasetofuniquemodelsforeachphaseofthemission.Thestatesattheendofagivenphase'smodelmaptothestatesofthefollowingphase'smodelduringphasetransitionsandthephasedurationcanbemodeledaseitherprobabilisticordeterministic[ Alametal. 2006 ; KimandPark 1994 ]. 40

PAGE 41

WeusetheRFTfault-ratemodel'sgeneratedfault-rateestimatestodrivemultiplesystem-levelMarkovmodelsinordertocalculatereliability,availability,andperformabilityofRFTsystems.Inordertoincorporatevaryingfaultrates(duetoorbitalposition)andvaryingsystemtopologies(duetoRFT),weleverageaphased-missionMarkovapproach.WemodeltheRFTsystemasacollectionofindividualphases,witheachphaseconsistingofaperiodoftimewherethefault-tolerancemode,failurerates,andrepairratesareconstant.Thephaselengthsarebothapplication-dependentandorbit-dependent.Attheendofeachphase,thepre-transitionstateprobabilitiesaremappedontoinitialprobabilitiesforthepost-transitionMarkovmodel.Figure 3-4 illustratesanexamplehigh-levelmodeltransitioningfromTMRtoDWCattimet1,andthentransitioningfromDWCbacktoTMRattimet2.Faultrates()andrepairrates()arerepresentedasdirectedgraphedgesbetweenstates.Ateachphasetransition(denotedbythedashedverticallines),stateprobabilitiesarere-mapped(dashedarrows).Whenre-mappingstateprobabilitiesfromTMRtoDWC,twoTMRstatesaremergedintoasingleoperationalstate.Whenre-mappingfromDWCtoTMR,thesingleoperationalDWCstatemapstothemost-similarTMRstate.Eachfault-tolerancemodehasanassociatedMarkovmodel.EachstateoftheMarkovmodelcorrespondstothenumberofoperationaldevicesinasystem(e.g.,PRRsinanFPGA-basedSoC).Transitionsbetweenstatesoccurwhenadevicechangesstatefromoperationaltofailed,orvice-versa.FPGAupsetratesareestimatedusingtheRFTfault-ratemodeldescribedinSection 3.2 .Thesystemrepairratesarebasedonthesystem-designer-speciedscrubrateandcheckpointingrateandthesystemandPRRrecongurationtime,whichcanbeobtainedexperimentally.Transitionsbetweenfault-tolerancemodesaremission-dependent.State-mappingfunctionsensureacontinuousavailabilityfunction,althoughperformabilitymaycontaindiscontinuitiesatphasetransitions. 41

PAGE 42

Figure 3-5 showsMarkovmodelrepresentationsforeachfault-tolerancemodeinasystemwithsixPRRs.Figures 3-5 (a)and 3-5 (b)representsystemswhereeachPRRisoperatingintheHPorABFTfault-tolerancemodes,respectively.Figure 3-5 (c)representsasystemusingthreeindependentpairsofPRMsoperatingintheDWCfault-tolerancemode.Figure 3-5 (d)representsasystemusingtwoindependentsetsofthreePRMsoperatingintheTMRfault-tolerancemode.SolidcirclesrepresenttheMarkovmodel'savailableoperatingstatesanddashedcirclesrepresentfailedoperatingstates.UsingthedenitionofperformabilityfromEq. 3 ,eachstateinaMarkovrewardmodelisassignedaperformancethroughputvalue.Forthisanalysis,systemperformanceisnormalizedtotheworkperformedbyasinglePRR.TheperformancethroughputvalueofeachstateisrepresentedintheMarkovmodelbythevalueintherectangles.Forexample,sixindependentPRRsrunningconcurrentlyintheHPfault-tolerancemodewouldhaveaperformancethroughputvalueof6whileasystemusingsixPRRswithtwoindependentsetsofPRRsoperatingintheTMRfault-tolerancemodewouldhaveaperformancethroughputvalueof2.TheABFTMarkovmodelisusedtodemonstrateagenericreliabilitymodelformodulesthatcontainsomeformofinternalfaulttoleranceandarecapableofself-detectingdataerrors.Inparticular,ABFTprovidesalow-overheadmethodfordetectingerrorsincertainlinearalgebraoperations.Whilehardware-implementedABFTmaynothave100%faultcoverage,hardware-implementedABFTprovidesimprovementsovertheHPmodel,whichcannotdetectwhencorruptdataisreturned.IntheMarkovmodelforthesingle-moduleABFTmode,weestimatetheperformanceofasinglemoduleas80%ofthedefault,unprotectedmoduleduetotheperformanceoverheadassociatedwithgeneratingandcomparingchecksumsforfaultdetection[ Acreeetal. 1993 ].Additionally,ABFTfaultratesaremodeledashavinga20%higherfaultratethanunprotectedmodulesduetotheincreasedmodulesizefromtheABFTlogic.TherepairratefortheABFTmodelusesthesystem'sscrubbingrate,supplyinga 42

PAGE 43

worst-caserepairrateforcaseswhenABFTlogicdoesnotdetect(orevenintroduces)anerror,relyingonexternalscrubbingandcongurationreadbacktodetecterrors.Althoughthesenumbersareapplication-andimplementation-specic,thenumbersrepresentoverheadcoststhatmustbeconsidered.TheABFTmodelcanalsobeusedtomodelothermodulesusinguser-implementedfaulttolerancetechniques.Foragivenphased-missionMarkovmodel,aTLEisusedtodeterminetheexpectedfaultratesusingtheRFTfault-ratemodelinSection 3.2 .Thesefaultrates,alongwithadescriptionofthemission-speciccriteriaforusingeachfault-tolerancemode,areusedtosplitafullspacemissionintodistinctphasesforMarkovmodeling.RFTphasesareperiodsoftimewithconstantfaultrates,repairrates,andfault-tolerancemodes.Foreachphase,thepreviousphase'sstateprobabilitiesaremappedontothenewphase'sMarkovmodelasinitialprobabilities.SHARPEprocessesthenewmodeltonumericallysolvethestateprobabilitiesandreliabilitymetricsforthecurrentphase.SHARPE'sresultsprovidetheinitialprobabilitiesforthenextphase.Thisprocessisrepeatediterativelyforeachphasefortheentiremission'sdurationandSHARPE'sresultsareaggregatedtoproduceoverallreliabilityresults. 3.4ResultsandAnalysisInthissection,wepresentonevalidationcasestudyandtwoorbitalcasestudiestoevaluatethepotentialreliabilityandperformancebenetsofourRFTframeworkforspacesystems.ThevalidationcasestudyusesFPGAfaultinjectiontoestimatefaultratesanderrorcoverage,whichcanthenbeusedbytheothercasestudies.TheorbitalcasestudiesrepresentFPGA-basedspacesystemsoperatingintwocommonorbits,withmultipleperformanceandreliabilityrequirements.Therstcasestudyrepresentsaspacesystemoperatinginlow-Earthorbit(LEO),theEarthObserving-1(EO-1)satellite.SpacesystemsinaLEOexperiencerelativelylowradiation.Thesecondcasestudyrepresentsaspacesystemoperatinginahighlyellipticalorbit(HEO),whichisamuch 43

PAGE 44

harsherradiationorbit.Eachcasestudywillcomparemultipleadaptivefault-tolerancestrategiestoatraditionalstaticTMRstrategy. 3.4.1ValidationCaseStudyInordertovalidateourreliabilitymodels,faultsmustbeinjectedintoanexecutingRFTsystem.Inthissection,wepresentFPGAfault-injectionresultsgatheredusingtheSimple,PortableFaultInjector(SPFI)[ Cieslewskietal. 2010 ].SPFIperformsfaultinjectionusingbothfullandpartialreconguration,whichreducesthetimerequiredtomodifycongurationmemoryandimprovesthespeedoffaultinjection.WevalidateourreliabilitymodelsbycorrelatingSPFI'sresultswithanalyticalMarkovmodelresults.Forthevalidationcasestudy,weimplementedasimpliedRFT-basedsystemonaXilinxML505FPGAdevelopmentplatform.TheRFT-basedsystemhada3-PRRRFTcontrollerthatallowedHPandTMRvotingandhadwatchdogtimerfunctionality.EachPRRcontainedamatrixmultiplication(MM)PRMandaMicroBlazeprocessorinthestaticregionstreameddatafromaUARTtoanMMPRMandstreamedresultsbacktotheUART.TheSPFIfault-injectiontoolenabledindividualsystemcomponents(e.g.,PRMs,RFTcontroller,MicroBlaze)tobetestedindependentlywithoutmodifyingtheentiresystem.Table 3-3 showstheRFTsystems'fault-injectionresults.Foreachsystemcomponent,Table 3-3 indicatesthenumberofinjectionsperformedandthenumberofdataerrorsandsystemhangsdetected.ThefaultvulnerabilityforeachcomponentisscaledtotheFPGA'stotalnumberofcongurationbitstoestimatethecomponents'designvulnerabilityfactor(DVF).Acomponent's/device'sDVFrepresentsthepercentageofbitsthatarevulnerabletofaultsandcanresultinobservableerrors.MostXilinxFPGAdesignshaveaDVFthatrangesfrom1%-10%[ Xilinx 2010b ]duetothelargeamountofcongurationmemorydevotedtorouting.TheDVFforeachcomponentiscalculatedbymeasuringthecomponent'sfaultrate,estimatingthenumberofvulnerablebitsbyscalingthefaultratetothesizeoftheareaoccupiedbythecomponent,anddividingthenumberofvulnerablebitsbythetotalnumber 44

PAGE 45

ofFPGAcongurationbits.ForasingleMMPRMwiththeRFTcontrollerusingHPmode,only1.6%offaultsthatoccurredinthePRRwerefoundtocauseobservableerrorsintheoutput.Approximately41,497vulnerablebitsineachPRR,whichoccupiesapproximatelyone-eighthoftheFPGA,resultsinaDVFMMof0.197%.Inthiscase,themajorityofthePRRisunused,resultinginveryfewvulnerablebitsandalowDVF.FaultsinjectedintotheRFTcontrollerresultedinaDVFRFTof0.022%.TheMicroBlazewasnotprotectedusingTMRorotherdesigntechniquesandhadaDVFMBof0.342%.Basedonthesecomponentfault-injectionresults,thetotalFPGADVFisestimatedtobe0.955%(3DVFMM+DVFRFT+DVFMB).Figure 3-6 showsMarkovmodelrepresentationsforeachfault-tolerancemodeinasystemwiththreePRRs.Figures 3-6 (a)and 3-6 (b)representsystemswhereeachPRRisoperatingintheHPortheTMRfault-tolerancemode,respectively.ThesemodelsaresimilartothemodelspresentedinSection 3.3 ,withoutadditionalstatesforperformability,andwiththeadditionofstatetransitionstoaccountforfaultcoverage.InaTMRsystem,coveragereferstothepercentageoffaultsthatcausethesystemtoimmediatelyenterthefailedstate.Thesenon-coveredfaultsoccurduetodesignsnotbeingfullyprotectedbyTMR.Initialfault-injectiontestingrevealedthatoursystemappearedtohavetwopossiblefaultscenarios.Intherstscenario,afaultcouldcausethesystemtoremainoperationalbutproduceerroneousdata.Thisstatewasrecoverableusingperiodicscrubbing.Inthesecondscenario,afaultcouldcausethesystemtohanguntilafullrecongurationwasperformed.WeexpandedtheMarkovmodelstoincludethesebehaviorsasadditionalstates.Figure 3-6 (a)showstheHPMarkovmodelwithtwounavailablestatesthataccountforthetwofaultscenarios.TheprobabilitiesoftransitioningtoFaultyDataorSystemHangweredeterminedfromthecomponentfault-injectiontesting.TheDVFFPGAwasestimatedfromthesumofeachofthecomponentstested. 45

PAGE 46

AsimilarapproachwastakenfortheTMRMarkovmodelshowninFigure 3-6 (b).TheadditionaldegradedstateisusedtomodelfaultsthathavebeenmaskedbytheTMRprotectionprovidedbytheRFTcontroller.TheprobabilityoftransitioningtothedegradedstateisprovidedbytheDVFMMfromprevioustesting.TheDVFsystermrepresentsfaultsintheRFTcontroller(DVFRFT)andMicroBlaze(DVFMB).Fromfault-injectiontesting,weestimatetheDVFsystobeapproximately0.364%.Byassigningfaultrates,repairrates,andcoveragetotheMarkovmodel,wecancalculatethesystemavailability.Table 3-4 showsthefaultandrepairratesusedinthisanalysis.Usingascrubrateof1faultper10seconds,andarepairrateof1scrubper5seconds,theRFTsystemwillhavea98.82%availabilityinHPmodeand99.31%availabilityinTMRmode.Whenthefaultratetoscrubrateratioisincreased,thebenetsofTMRbecomemorepronounced.Usingascrubrateof1faultper2seconds,andarepairrateof1scrubper10seconds,theRFTsystemwillhavea92.25%availabilityinHPmodeand95.32%availabilityinTMRmode.Thesehighavailabilities,eveninHPmode,areduetofrequentscrubbingandtheverylowDVFoftheFPGAdesign.Finally,wevalidatetheMarkovmodelresultsusingfaultinjection.Inacontinuously-runningRFTsystem,faultsareinjectedataspeciedrate,whichisrandomlyvariedusingaPoissondistributiontosimulatetheerrormodelusedintheMarkovmodel.Scrubbing,usingpartialreconguration,occursatuser-denedperiodicintervals.FullrecongurationoccursatthenextscrubbingcycleafteraPRMerrorhasbeendetectedbytheRFTcontroller.FullrecongurationsalsooccuriftheexternaltestingprogramdetectsthattheFPGAsystemhasenteredtheSystemHangstate.Availabilitycanbeexperimentallydeterminedbytheratiooftimethesystemisoperatingcorrectlytothetotalexperimentruntime.Foreachrun,10,000faultswererandomlyinjectedintoarunningsystem.Table 3-4 showstheresultsoftheavailabilityexperimentandtherelativeerrorfromtheanalyticalmodel.Atlowfaultrates,theHPandTMRmodesbothprovideapproximately 46

PAGE 47

99%availability.Withhighfaultrates,thesystemhadanavailabilityof92.59%intheHP-modeand93.89%intheTMR-mode.TheMarkovmodelavailabilitymethodologyprovidesasimpleandeffectivemethodfordeterminingtheeffectsoffaultandrepairratesonsystemavailabilitywithoutexhaustivetesting.Ingeneral,theHPmodelspredictedslightlylowerunavailabilitythanwhatwasobservedduringexperimentaltestingwhiletheTMRmodelsoverestimatedtheavailabilityofthesystem.Allavailabilityresultswerewithin1.5%oftheMarkovmodels'predictions.TheMarkovmodelaccuracycanbeimprovedbyprovidingmoreaccuratefault-injectionresults.Theavailabilityvaluesobtainedthroughexperimentscanbeimprovedbyincreasingthelengthofthetestingperiod.Fault-injectiontestingdidhighlightimplementationissuesthatmustbehandledinanyhigh-reliabilitydesign.TheuseofTMRhadalowerthanexpectedbenetduetotheunprotectedMicroBlazeprocessor.SincetheDVFMBwaslargerthantheDVFMM,theavailabilityofthesystemwasdominatedbytheMicroBlaze'savailability.Forhighreliability,theMicroBlazemustbeprotectedwithTMRoranalternativefault-tolerantprocessor(e.g.,FT-LEON3). 3.4.2OrbitalCaseStudiesForeachorbitalcasestudy,thesystemundertestisanFPGASystem-on-ChipimplementationoftheRFThardwarearchitecturedescribedinSection 3.1 .Inordertocalculatedevicevulnerability,theparametersfortheXilinxVirtex-4FX60willbeusedasinputstotheRFTfault-ratemodeldescribedinSection 3.2 .TheradiationsusceptibilityparametersoftheVirtex-4devicefamilyareobtainedfromtheXilinxRadiationTestConsortium's(XRTC)publishedresults[ Swiftetal. 2008 ].ThegeneratedfaultrateislinearlyscaledfromafulldevicetothesizeofthePRRtoproducePRRfaultrates.TheRFTcontrollerisconnectedto6PRRs,allowingforseveralcombinationsoftheTMR,DWC,andABFTfault-tolerancemodesdiscussedinSection 3.3 .Theperformability 47

PAGE 48

modelfromSection 3.3 isusedtoevaluatetheeffectivenessofeachfault-tolerancestrategy. 3.4.2.1Low-EarthorbitcasestudySpacesystemsinLEOarecommonlyusedforearth-observingscienceapplications,suchasHSIorSAR.BothHSIandSARhavelargedatasets,whichcanbesignicantlyreducedthroughon-boardprocessingusingavarietyofalgorithmsthataredecomposableintobasickernelsthatcanbeparallelizedandimplementedonanFPGAsystemforhighperformanceandpowerefciency.Forexample,HSIcanbedecomposedintoasequenceofmatrixmultiplicationsandmatrixinversions,andSARcanbedecomposedintovectormultiplicationandFFTs.Sincethesemathematicaloperationsarelinear,ABFTcanbeusedtoprotectthecomputationresultsfromSEUs.ForapplicationsthatcannotbeprotectedwithABFT,TMRorDWCmustbeusedtoprovidefaulttolerance.TheTLEusedtogeneratethefaultratesforthisLEOcasestudyisfromtheEO-1satellite.TheEO-1orbitiscircularatanaltitudeof700kmanda98.12inclinationwithameantraveltimeof98minutes.Figure 3-7 (a)showstheorbitaltrackofEO-1andtheshadedcirclerepresentstheEO-1'seldofview.Figure 3-7 (b)showstheestimatednumberofupsetsperhourthatoccurintheEO-1orbitoverseveralorbitalperiods.TheaveragefaultrateoftheVirtex-4FX60intheEO-1'sorbitis16.5faultsperdevice-day(combinedcongurationmemoryandBRAMvulnerability).EachlocalmaximumoccurswhenthesatelliteisclosesttotheEarth'smagneticpoles.FaultratesinEO-1'sorbitarelowbecausetheorbitislowerthantheVanAllenBeltsandisfullywithintheEarth'smagnetosphere,whichdeectsalargeamountofradiation.WeexaminetheavailabilityandperformabilityofTMR,DWC,andABFTfault-tolerancemodesinLEO,aswellasthreeadaptivefault-tolerancestrategiestomaximizeapplicationperformability:10%two-mode,50%two-mode,andthree-modeadaptivestrategies.Fortheadaptivefault-tolerancestrategies,thefault-tolerancemodeswitchingisdeterminedbycomparingthecurrentupsetratewithafault-ratethreshold.The10% 48

PAGE 49

two-modeadaptivestrategyusestheABFTfault-tolerancemodewhenupsetrateisinthelowest10%oftheexpectedfaultratesandtheTMRfault-tolerancemodeotherwise.The50%two-modeadaptivestrategyusesasimilarstrategyasthe10%two-mode,butwithahigherfault-ratethreshold.Thethree-modeadaptivestrategyusesABFTwhentheupsetrateisinthelowest10%oftheexpectedfaultrates,TMRwhentheupsetrateisinthehighest50%oftheexpectedfaultrates,andDWCotherwise.Inallmodes,thesystemperformsscrubbingtoensurethatcongurationmemoryerrorsareremovedfromthesystem.ForthisLEOcasestudy,thesystemusesa60-secondscrubcycle,whichisalsousedasthesystemrepairratefortheMarkovmodels.Figure 3-8 (a)showstheavailabilityoftheLEOsystemwhilestaticallyusingthefourfault-tolerancemodelsdescribedinFigure 3-5 .WhilethestaticHPstrategy'savailabilityquicklydeclines(duetothelackofanyfault-tolerancemechanisms),thestaticTMR,DWC,andABFTstrategiesallmaintainavailabilityabove95%.Thedynamicallychangingavailabilityisdirectlyrelatedtothecurrentfaultrate,however,sincetherepairrateofthesystem(throughscrubbing)ismuchlargerthantheexpectedfaultrates,mostcongurationmemoryfaultsarerapidlymitigated.Thesystemavailabilitywhileusingthethreeadaptivefault-tolerancestrategiesisshowninFigure 3-8 (b).TheaverageavailabilityforeachadaptivestrategyimprovesavailabilityoverthestaticABFTstrategy,andtheadaptivestrategiescanincreasetheminimumavailabilitytoabove99.5%.Table 3-5 displaystheavailabilityresultsforeachfault-tolerancestrategyintermsofunavailability,showingtheprobabilityofasystemfailure.The10%two-modeadaptivestrategyreducesaverageunavailabilityby88%ascomparedtothestaticABFTstrategy.Forextremelyhighavailability,staticTMRisrequired,asthemaximumunavailabilityfortheadaptivestrategiesismorethan100timeshigherthanTMR.Thesystemperformabilityforthestaticandadaptivefault-tolerancestrategiesforaLEOisalsoshowninTable 3-5 .DuetothelowoverallupsetratesandgoodavailabilityofthestaticABFTstrategy,thestaticABFTstrategyachievesthehighest 49

PAGE 50

performability.The50%two-modeadaptivestrategyachievesanaverageperformabilitythroughputof4.01,a100%improvementoverstaticTMR,whileimprovingunavailabilityoverthestaticABFTstrategyby73%.The10%two-modestrategyhasloweraverageperformability,whilemaintainingbettersystemavailability.Thethree-modestrategyhasbetterperformabilitythanthestaticDWCmodewhilehavingbetteravailabilitythanthestaticDWCorABFTmodes,butisoutperformedbythe50%two-modestrategy.Wepointoutthatthefault-ratethresholdforthetwo-modestrategiescanbeusedtoadjusttheavailabilityandperformabilityparametersforaspacesystem.Figure 3-9 illustratestheeffectofchangingthethresholdofthetwo-modeadaptivestrategyfortheLEOcasestudy.Asthethresholdisraised,moretimeisspentusingtheABFTmode,whichlowerssystemavailabilitywhileincreasingperformability.FortheLEOcasestudy,mostoftheperformancegainsfromusingABFTcanbeobtainedbyusingalowthresholdvaluebecausemuchoftheorbitwillbeunderthisthreshold.Witha10%threshold,anRFTsystemwouldspendanapproximatelyequalamountoftimeineachoftheABFTandTMRmodes.Raisingtheadaptivethresholdhigherthan50%resultsinlimitedperformancegainsattheexpenseofdecreasedavailability.Furtheranalysisofthesethresholdsinmode-switchingstrategiescanenableoptimizationtowardPareto-optimalgoals. 3.4.2.2Highly-ellipticalorbitcasestudyTheHEOisacommontypeoforbitusedmostlybycommunicationsatellites.Fromtheground,satellitestravelinginanHEOcanappearstationaryintheskyforlongperiodsoftime.HEOsalsooffervisibilityoftheEarth'spolarregions,wheremostgeosynchronoussatellitesdonot.TheHEOusedforthiscasestudyisaMolniyaorbit,namedforthecommunicationsatellitesthatrstusedthisorbit.ATLEforaMolniya-1satellitewasusedtogeneratefaultratesfortheHEOcasestudy.Thisorbithasaperigeeof1,100km,anapogeeof39,000km,anda63.4inclinationwithameantraveltimeof12hours.Theaverageamountofradiationthroughouttheorbitismuchhigher 50

PAGE 51

thantheLEOcasestudy,andmuchlargeramountsofradiationareencounteredwhenthesatellitepassesthroughtheVanAllenbelts.TheaveragefaultrateintheMolniya-1orbitis62faultsperdevice-day.Formostoftheorbit,thefaultrateaverages7faultsperdevice-day,butthelargefault-ratepeaksthatoccurnearperigeeincreasestheoverallfaultrate.Figure 3-10 (a)illustratestheMolniya-1orbitandFigure 3-10 (b)showstheestimatednumberofupsetsperhourthatanFPGAmightexperienceinanHEO.SpacesystemsoutsideoftheEarth'smagnetosphere,eitheringeosynchronousorbitorininterplanetaryspace,experienceconstantfaultratesthatvarybasedontheoccurrenceofsolararesandotherspaceweatherconditions.Solararescansendawaveofhigh-energyparticlesintospace,causingabriefperiodofextremelyhighfaultrates.Thesefault-ratespikes,whiledifferentinorigin,looksimilartothefault-ratepeaksthatoccurinthiscasestudy.TheanalysisusedinthiscasestudycanalsobeusedtoestimateRFTsystemperformanceinthepresenceofdifferentspaceweatherconditions.WeevaluatetheavailabilityandperformabilityofTMR,DWC,andABFTfault-tolerancemodesinanHEO.InordertomaximizeperformabilityofapplicationsinanHEO,wealsoexaminetwoadaptivefault-tolerancestrategies.TheABFT/TMRadaptivestrategyusestheABFTfault-tolerancemodewhenupsetratesareinthelowest5%ofexpectedfaultratesandtheTMRfault-tolerancemodeotherwise.TheDWC/TMRadaptivestrategyusesthesamefault-ratethresholdsasABFT/TMR,butswitchesbetweentheDWCandTMRmodes.Ineachmode,thesystemperformsscrubbingtoensurethatcongurationmemoryerrorsareremovedfromthesystem.FortheHEOcasestudy,thesystemusesa10-secondscrubcycletoaccountfortheincreasedaveragefaultratesexperiencedinanHEO,whichisusedasthesystemrepairratefortheMarkovmodels.Figures 3-11 (a)and 3-11 (b)showtheavailabilityoftheHEOcasestudysystemwhileusingthreestaticfault-tolerancestrategiesandtwoadaptivestrategies.ThestaticTMRstrategymaintainsanaverageavailabilityof99.93%,buttheavailability 51

PAGE 52

dropsaslowas95.1%whilepassingthroughthepeakfault-rateperiods.WhileusingthestaticDWCandABFTstrategies,theavailabilitydropssignicantlyduringthepeakfault-rateperiods,makingthesestrategiesunsuitableforsystemsthatmustmaintaincontinuousoperation(duetotheextremelyhighupsetrate,manysystemsshutdownoperationduringpeakfault-rateperiods).However,outsideofthepeakfault-rateperiods,theminimumavailabilityforthestaticDWC(99.8%)andABFT(99.3%)strategiesarehighenoughtobetolerablebymanyapplications.ThetwoadaptivestrategiesuseDWCandABFTfault-tolerancemodesduringlowfault-rateperiodsandTMRfault-tolerancemodeduringthepeakfault-rateperiods.Usingtheadaptivestrategies,theavailabilityneverfallsbelowtheTMRstrategy'sminimumavailabilityof95.1%duringpeakfault-rateperiods,whilemaintaininghigheravailabilityatothertimes.(NotethechangeofscaleinFigure 3-11 (b).)Table 3-6 showstheunavailabilityandperformabilityofthefault-tolerancestrategiesdescribedinthissection.TheDWC/TMRadaptivestrategyreducesaverageunavailabilityby80%ascomparedtothestaticDWCstrategy.TheABFT/TMRadaptivestrategyreducesaverageunavailabilityby85%ascomparedtothestaticABFTstrategy.Thesystemperformabilityforthestaticandadaptivefault-tolerancestrategiesforanHEOareshowninTable 3-6 .WhiletheTMRstrategyexhibitsthelowestperformability,theTMRstrategyalsohasthelowestvariationinperformabilityandhighestavailability,allowingforpredictableperformancelevels.TheperformabilityofthestaticABFTandDWCstrategiesissignicantlyreducedinthepeakfault-rateperiods,butquicklyreturnstoacceptablelevelsafterpassingthroughthepeakfault-rateperiods.Themaximumunavailabilityforeachoftheadaptivestrategiesoccursduringthepeakfault-rateperiods.SincetheRFTsystemisusingtheTMRfault-tolerancemodeduringthesesegments,theadaptivestrategieshavethesamemaximumunavailabilityasthestaticTMRstrategy.TheDWC/TMRadaptivestrategyincreasesperformabilityovertheTMRstrategyby46%.TheABFT/TMRadaptivestrategyincreasesaverageperformability 52

PAGE 53

overtheTMRstrategyby128%,orreducesperformabilityby4%overthestaticABFTstrategy,whilesignicantlyimprovingunavailabilityovertheABFTstrategy. 3.5ConclusionsInthiswork,wehavepresentedanovelandcomprehensiveframeworkforrecongurablefaulttolerancecapableofcreatingandmodelingFPGA-basedrecongurablearchitecturesthatenableasystemtoself-adaptitsfault-tolerancestrategyinaccordancewithdynamicallyvaryingfaultrates.ThePR-basedRFThardwarearchitectureenablesseveralredundancy-basedfault-tolerancemodes(e.g.,DWC,TMR)andadditionalfault-tolerancefeatures(e.g.,watchdogtimers,ABFT,checkpointingandrollback),aswellasamechanismfordynamicallyswitchingbetweenmodes.Thecombinationofthesefault-tolerancefeaturesenablestheuseofCOTSSRAM-basedFPGAsinharshenvironments.FutureworkwillautomateandoptimizethecreationRFTcontrollersforspecicsystemcongurationsinordertoincreasedeveloperproductivity,facilitateadoption,andreducesystemoverhead.Inadditiontothehardwarearchitecture,wehavedemonstratedafault-ratemodelforRFTtoaccuratelyestimateupsetratesandcapturetime-varyingradiationeffectsforarbitrarysatelliteorbitsusingacollectionofexisting,publicallyavailabletoolsandmodels.Ourmodelprovidesimportantcharacterizationoffaultratesforspacesystemsoverthecourseofamissionthatcannotbecapturedbyaveragefaultrates.TheHEOcasestudydemonstratedthelargerangeoffaultratesexperiencedincertainellipticalorbits,andthepotentialforperformanceimprovementswhenusingRFT.Usingtheresultsfromthefault-ratemodel,ourMarkov-model-basedRFTperformabilitymodelwasusedtodemonstratethebenetsofusinganadaptivefault-tolerancesystemarchitecture.TheperformabilitymodelwasvalidatedusingFPGAfaultinjectiononanexperimentalRFTsystemandmultiplestaticandadaptivefault-tolerancestrategiesforspacesystemsindynamicallychangingenvironmentswereevaluated.TheRFTperformabilitymodeldemonstratedthatwhileTMRprovides 53

PAGE 54

alowerboundonavailability,lessreliable,low-overheadmethodscanbeusedduringlow-fault-rateperiodstoimproveperformance.Weevaluatedtwocasestudyorbitsandobservedthatadaptivefault-tolerancestrategiesareabletoimproveunavailabilityby85%overstaticABFTandperformabilityby128%overtraditional,staticTMRfaulttolerance.Thisadditionalperformancecanleadtomorecapableandpower-efcientFPGA-basedonboardprocessingsystemsinthefuture. Table3-1. RFTfault-tolerancemodes. Fault-tolerancemodeFault-tolerancetypePRRsrequired TripleModularRedundancy(TMR)Redundancy3DuplicationwithCompare(DWC)Redundancy2High-Performance(HP)-nofaultprotectionSingle-module1Algorithm-BasedFaultTolerance(ABFT)Single-module1InternalTMRSingle-module1 Table3-2. RFTcontrollerresourceusage. ModulenameLUTsusedFFsusedSlicesusedFPGAutilization PLBController4793753451.4%AddressDecoder23801360.5%RFTRegisters8264480.2%WatchdogTimers4983903661.4%VotingLogic43802340.9%OutputMux22701320.5% Total196282912615.0% Table3-3. Fault-injectionresultsforRFTcomponents. SystemFaultsDataSystemVulnerableDVFComponentinjectederrorshangsbits(est.)(%) MatrixMultiply(MM)100,0001,5017141,4970.197%RFTController(RFT)100,000631574,5760.022%MicroBlaze(MB)100,0001501,21986,5840.342% FullFPGA0.955% 54

PAGE 55

Table3-4. RFTMarkovmodelvalidation. FaultperiodRepairperiodFT-modeMarkovmodelExperimentalModel(time/fault)(time/scrub)availabilityavailabilityerror 10s5sHP98.82%98.99%0.2%10s5sTMR99.31%99.11%0.2%2s10sHP92.25%92.59%0.4%2s10sTMR95.32%93.89%1.5% Table3-5. UnavailabilityandperformabilityforLEOcasestudy. AverageMaximumAverageMinimumunavailabilityunavailabilityperformabilityperformability TMR9.810)]TJ /F12 7.97 Tf 6.59 0 Td[(63.810)]TJ /F12 7.97 Tf 6.58 0 Td[(52.002.00DWC3.910)]TJ /F12 7.97 Tf 6.59 0 Td[(31.110)]TJ /F12 7.97 Tf 6.58 0 Td[(23.002.98ABFT1.610)]TJ /F12 7.97 Tf 6.59 0 Td[(24.710)]TJ /F12 7.97 Tf 6.58 0 Td[(24.794.762-Mode(10%)1.910)]TJ /F12 7.97 Tf 6.59 0 Td[(34.610)]TJ /F12 7.97 Tf 6.58 0 Td[(33.512.002-Mode(50%)4.410)]TJ /F12 7.97 Tf 6.59 0 Td[(31.810)]TJ /F12 7.97 Tf 6.58 0 Td[(24.012.003-Mode2.710)]TJ /F12 7.97 Tf 6.59 0 Td[(35.410)]TJ /F12 7.97 Tf 6.58 0 Td[(33.692.00 Table3-6. UnavailabilityandperformabilityforHEOcasestudy. AverageMaximumAverageMinimumunavailabilityunavailabilityperformabilityperformability TMR7.210)]TJ /F12 7.97 Tf 6.59 0 Td[(44.910)]TJ /F12 7.97 Tf 6.58 0 Td[(22.001.95DWC1.410)]TJ /F12 7.97 Tf 6.59 0 Td[(24.110)]TJ /F12 7.97 Tf 6.58 0 Td[(12.982.46ABFT4.910)]TJ /F12 7.97 Tf 6.59 0 Td[(29.210)]TJ /F12 7.97 Tf 6.58 0 Td[(14.732.62DWC/TMR2.610)]TJ /F12 7.97 Tf 6.59 0 Td[(34.910)]TJ /F12 7.97 Tf 6.58 0 Td[(22.921.95ABFT/TMR7.210)]TJ /F12 7.97 Tf 6.59 0 Td[(34.910)]TJ /F12 7.97 Tf 6.58 0 Td[(24.561.95 Figure3-1. System-on-chiparchitecturewithRFTcontroller. 55

PAGE 56

Figure3-2. RFTcontrollerPLB-to-PRRinterface. Figure3-3. RFTfault-ratemodel. 56

PAGE 57

Figure3-4. Phased-missionMarkovmodeltransitioningbetweenTMRandDWCmodes. ABCDFigure3-5. MarkovmodelsofRFTmodes.A)6xHP.B)6xABFT.C)3xDWC.D)2xTMR. 57

PAGE 58

ABFigure3-6. RFTvalidationMarkovmodels.A)HPmode.B)TMRmode. A BFigure3-7. LEOfaultratesusingtheRFTfault-ratemodel.A)VisualizationoftheEO-1TLEdatausingJSatTrak[ Gano 2010 ].B)Expectedfaultratesoverseveralorbits. 58

PAGE 59

A BFigure3-8. LEOsystemavailability.A)Systemavailabilityofstaticstrategies.B)Systemavailabilityofadaptivestrategies. 59

PAGE 60

Figure3-9. Effectsofadaptivethresholdsonavailabilityandperformability. 60

PAGE 61

A BFigure3-10. HEOfaultratesusingtheRFTfault-ratemodel.A)VisualizationofMolniya-1TLEdatausingJSatTrak[ Gano 2010 ].B)Expectedfaultratesovertime. 61

PAGE 62

A BFigure3-11. HEOsystemavailability.A)Systemavailabilityofstaticstrategies(zoomedimageonright).B)Systemavailabilityofadaptivestrategies(zoomedimageonright). 62

PAGE 63

CHAPTER4ALGORITHM-BASEDFAULTTOLERANCEFORFPGASYSTEMSAlgorithm-basedfaulttolerance(ABFT)isafault-tolerancemethodthatcanbeusedwithmanylinear-algebraoperations,suchasmatrixmultiplicationorLUdecomposition[ HuangandAbraham 1984 ].Additionally,theABFTapproachcanbeextendedtootherlinearoperatorssuchastheFouriertransform.Fortunately,manycommonspaceapplicationsarecomposedoflinear-algebraoperations;e.g.,HSIfeaturesmatrixmultiplication[ Jacobsetal. 2008 ],whileSARfeaturesFastFouriertransforms.Otheralgorithmscanoftenbeconvertedtotanalgebraicframework.Traditionally,ABFThasbeenimplementedinsoftware,withmultiprocessorarrays,andinhardware,withsystolicarrays,toprotectapplicationdatapaths.OurABFTapproachmaybeusedinFPGAapplicationstoprovidebothdatapathandcongurationmemoryprotectionwithlowoverhead.BydemonstratingtheeffectivenessofABFTinFPGAsystems,wecanenabletheuseofFPGAsinfuturespacemissionswhereresourceconstraintsandreliabilityarethemajorchallenges.InthelargerperspectiveofanRFTsystem,ABFTenablesadditionaldesign-spaceoptionsforoptimizingtotalsystemperformability.Inthischapter,wepresentananalysisofmultiplefault-tolerancemethodsonXilinxFPGAsincludingTMRandABFT.Weexaminetheresourceusageofeachmethodandmeasurethevulnerabilityofthedesignusingafault-injectiontool.Wethenexaminepossibledesigntradeoffsandmodicationsthatcanenablehigherreliability.Thefollowingsectionsofthischapterareorganizedasfollows.Section 4.1 describesmultiplehardwarearchitecturesforABFTmatrixmultiplication,alongwithanalysisofresourceusageandfaultvulnerability.Section 4.2 describesanABFT-basedalgorithmforFFTs,FFTcasestudyarchitectures,andfaultvulnerabilityanalysis.Section 4.3 presentsconclusions,providessuggestionsfordevelopingmorereliabledesigns,andoutlinesdirectionsforfutureresearch. 63

PAGE 64

4.1MatrixMultiplicationMatrixmultiplication(MM)isusedasakeykernelinalargenumberofsignal-processingapplications,andhasbeenshowntobenetfromtheperformanceofFPGAs[ Daveetal. 2007 ; Wuetal. 2010 ; ZhuoandPrasanna 2004 ].ThetraditionalformulationofABFTusesMMasamotivatingcase,demonstratingthemethod'sabilitytodetectandcorrecterrorswhilelimitingcomputationalcomplexityandoverhead.MMalsohasthebenetofsimpleparalleldecompositionstrategies,tradingadditionalFPGAresourceusagefordecreasedtotalexecutiontime,whichworkwellwithABFT.InSection 4.1.1 ,themathematicaldescriptionofABFTforMMispresented.InSection 4.1.2 ,severalpossibleMM-ABFTarchitecturesaredescribed.Section 4.1.3 analyzestheresourceoverheadtradeoffsforeachoftheproposedarchitectures.Section 4.1.4 presentsananalysisoffaultinjectionresultsobtainedfromeachoftheproposedarchitectures.Finally,Section 4.1.5 summarizestheresultsfromthematrixmultiplicationanalysis. 4.1.1Checksum-BasedABFTforMatrixMultiplicationThefollowingdenitionsprovidethemathematicalbackgroundforABFT.Toobtaintheweightedchecksums,theinitialdatawillhavetobemultipliedbyanencodermatrix.Withoutalossofgeneralityandtosimplifythenotation,weassumethatgenericmatrixissquarewithdimensionsofNN.Denition1:Anencodermatrixisamatrixwhoseproductwiththedatamatrixwillyieldthedesiredchecksums.FortheremainderofthispaperwewillrefertotheencodermatrixasEN.Dependingonthesizeoftheencodingmatrix,ENmayencodemorethanonechecksumroworcolumn.TheENusedinthispaperwillhavedimensionsofN1. EN=1111T(4) 64

PAGE 65

Denition2:AcolumnchecksummatrixACisaninitialdatamatrixAthathasbeenaugmentedwithextrarowsofchecksums.Suchamatrixwillhavedimensionsof(N+1)Nandhastheform: AC=264AETNA375(4)Similarly,arowchecksummatrixARcanbeobtainedbyaugmentingadatamatrixAwithadditionalcolumns.SuchamatrixwillhavedimensionsofN(N+1)andhasthefollowingform: AR=AAEN(4)Denition3:TheproductofacolumnchecksummatrixACandarowchecksummatrixBRwillproduceafullchecksummatrixCF.Suchamatrixwillhavedimensionsof(N+1)(N+1)andtheform: ACBR=264AETNA375BBEN=264ABABENETNABETNABEN375=264CCENETNCETNCEN375=CF(4)Theassociativepropertyofthematrixproductallowsforvericationofthemultiplicationprocedurebysimplyrecalculatingthechecksumsandcomparingthemwithonesobtainedthroughthematrixmultiplication.Ingeneral,operationsthatpreserve 65

PAGE 66

weightedchecksumsarecalledchecksum-preservingandthematrixproductisanexampleofsuchafunction. 4.1.2Matrix-MultiplicationArchitecturesSeveralmatrixmultiplicationmoduleswerecreatedtoexaminethereliabilityofhardware-basedABFT.ThissectiondiscussesaserialarchitectureandtwopossibleparallelizationstrategiesforMMandthedesigndecisionsthatweremadefortheABFTarchitectures.Forthisanalysis,32-bitintegerprecisionwasusedineachdesign. 4.1.2.1Baseline,serialarchitectureTheminimal-hardware,serialarchitectureforthematrixmultiplicationfunctionconsistsofasinglemultiply-accumulator(MAC),anaddressgenerationmodule,andthreedata-storagemodules(RAM)asshowninFigure 4-1 (a).TwomemoriesareusedtostoretheinputmatricesAandB,andonememoryisusedfortheresultingoutputmatrixC.Theaddressgeneratoriteratesthroughthecorrectmatrixindices(i,j,kinFigure 4-2 ),sendingdatastoredinthetwoinputRAMstotheMAC,andgeneratestheappropriateaddressforoutputvalues.ThisMMmodulecanbeusedforanysizematrix,thelimitingfactorbeingdatastorage.Forthisanalysis,theinputandoutputmatricesareeachstoredina1024-deep32-bitXilinxBlockRAM(BRAM)component.ThisarchitecturerequiresO(N3)cyclestofullycalculateanoutputmatrix.MMcomputationalthroughputcanbeimprovedbyexploitingparallelismwithadditionalMACunits. 4.1.2.2Fine-grainedparallelarchitectureThene-grainedparallelMMarchitectureunrollstheinnerloopoftheMMalgorithmasshowninFigure 4-2 .EachelementintheresultmatrixCisthedotproductofarowfrommatrixAandacolumnfrommatrixB.Thisparallelarchitectureusesmultipleprocessingelementstocomputethedot-productinparallel.Withthene-grainedparallelarchitectureshowninFigure 4-1 (b),theoutputofseveralmultipliersareconnectedtoanadder-treestructure,allowingtheparallelcomputationofpartialdot-products,whicharethenaccumulatedintothenal,fulldotproduct.Byfully 66

PAGE 67

parallelizingthedotproduct(usingNmultipliers),theexecutiontimeofthefullalgorithmcanbereducedfromO(N3)toO(N2).Thismethodrequiresaccessingmultiplememoryelementsinparallel,andmaybelimitedbythetotalnumberofmemoryelementsorDSPcomponentsontheFPGA. 4.1.2.3Coarse-grainedparallelarchitectureThecoarse-grainedparallelapproachessentiallyunrollstheouterloopoftheMMalgorithminFigure 4-2 .EachrowintheresultmatrixiscalculatedfromarowinmatrixAandeveryelementofmatrixB.ThisdataparallelismenableseachprocessingelementtocalculateauniqueportionofmatrixCbyusingaportionofmatrixAandallofmatrixB,asshowninFigure 4-1 (c).Wheneachprocessingelementcomputesasinglerow,theexecutiontimeofthefullalgorithmcanbereducedfromO(N3)toO(N2).ThismethodrequiresreadingmultiplevaluesfrommatrixAandwritingmultiplevaluestomatrixCinparallel,andrequiresthesametotalnumberofmemoryandDSPelementsasinthene-grainedparallelarchitecture. 4.1.2.4ArchitecturalmodicationsforABFTTheadditionofABFTlogicrequiresthecreationoftwofunctions,ABFTchecksumgenerationandABFTchecksumverication.Eachofthesefunctionsrequiresasimpleaccumulator(oraMACforweightedchecksums).Inordertoaccommodatethecalculatedchecksumdata,theBRAMstoragemustbelargeenoughtoholdtheABFT-augmentedmatrices.ChecksumgenerationsumseachcolumnofmatrixAandwritesthechecksumintothematrixABRAM,creatingthecolumnchecksummatrix,AC.Next,checksumgenerationperformsthesameprocessfortherowsofmatrixBtocreatearowchecksummatrix,BR.ThechecksumvericationfunctionsumsthecolumnsandrowsofmatrixCandcomparesthesumstothechecksumvaluesinthematrixCBRAM.Ifamismatchisdetected,anerrorsignalisasserteduntilthemoduleisreset.Forsomeapplications,suchasimageprocessing,dataerrorsthatoccurinlow-signicancebitsmaybeignored.ABFTaccomplishesthisbycomparingthe 67

PAGE 68

differenceoftwogeneratedchecksumstoauser-denedthresholdvalue.However,formaximumcoverage,thisthresholdshouldbesettozeroforintegeroperations.Figure 4-3 showsanexampleofanABFT-enabledMMarchitecturewhereanadditionalMACisusedforthechecksumgenerationandvericationfunctions.TheMAChardwarethatexistsforthemainMMoperationmaybereusedforcreatingchecksums(ABFT-Shared),oranadditionalaccumulatorcanbeusedforthispurpose(ABFT-Extra).Forthebaselinearchitecture,anadditionalABFTaccumulatorwouldincuralmost100%overhead.However,whentheABFTencodingmatrixinEquation1isused,multipliersarenotrequiredandtheABFThardwarecanbesimplied.Forparalleldesignswithmultipleprocessingelements,theoverheadofasingleMACforABFTcalculationsisamortized.ToimplementABFTerrorcorrection,thecolumnandrowindicesoffaultyrowscanbetemporarilystoredinregisters.Faultyelementsexistattheintersectionofafaultyrowandfaultycolumn.Tocorrectthefaultyelement,thechecksum-generationmodulemustrecalculatethecolumnchecksum,ignoringthevalueatthefaultyrowindex.ThissumwouldbesubtractedfromthematrixCchecksumvaluetoobtainthecorrectvalue,andstoredinthematrixCRAM.UsingtheencodingmatrixdenedinEquation1,ABFTwillbeabletodetectsingle-andmultiple-elementerrors.However,forspecicdistributionsofmultiple-elementerrors,thecorrectionalgorithmmayfail.Weighted-checksumencodingmatricescanbeusedtoprovideadditionalfaultlocalizationcapabilitiesformultiple-elementerrors.TheABFTdesignsdiscussedinlatersectionsperformerrordetectiononly. 4.1.3Resource-OverheadExperimentsInthissectionweanalyzetheoverheadofthearchitecturespresentedinSection 4.1.2 andcomparethemtotraditionalfault-tolerancemitigationstrategies.Forthisanalysis,weusea32-bitintegerMMmoduleandvarythenumberofprocessingelementsandthetypeofparallelization.TheseMMmodulescanperformcomputation 68

PAGE 69

onmatricesupto3232elementsinsize,limitedonlybytheamountofBRAMdedicatedtostorage.Wecomparene-grainedandcoarse-grainedparallelMMdesignswithseveralfault-tolerantdesigns(TMR,ABFT,andhybridTMR/ABFT).ThebaselineandABFTdesignsweresynthesizedusingtheXilinxsynthesistoolwhiletheTMRdesignsusedSynplifyPremier.EachdesignwasimplementedonaXilinxML605developmentboardwithaVirtex6-LX240TFPGA.Theresultsofthiscomparison,with1processingelementperdesign,areshowninTable 4-1 ,andwillbediscussedbelow. 4.1.3.1ResourceoverheadofserialarchitecturesThebaseline(non-fault-tolerant)MMarchitecturewithasingleprocessingelementuses3BRAMstostoreinputandoutputdata.MatrixAandBareeachstoredinseparateBRAMcomponentsinordertoallowsimultaneousaccesses,whilematrixConlyrequiresasingleBRAM.DSP48componentsarereservedforthemultiply-accumulateunit.Themultiply-accumulatorcanbeimplementedin3DSP48units(3per32-bitMAC).Allremainingaddressingandindexinglogic(counters,statemachines,etc.)areimplementedusingslices(LUTsandFFs).TheABFT-ShareddesigndoesnotrequireanyadditionalBRAMorDSP48unitsoverthebaselinedesign.Theadditionallogicneededtohandleaddressingmatricesduringchecksumgenerationandvericationincreasesthenumberofrequiredslicesby14%.TheABFT-Extradesignuses100%moreDSP48sand32%moreslicesfortheadditionalMAC;noadditionalBRAMsareneeded.Forcomparison,Table 4-1 alsoshowstheresourceusageofaTMRdesign.TheserialTMRdesignhas146%overheadascomparedtothebaselinedesign.ThisTMRdesignwascreatedusingthehigh-reliabilityfeaturesoftheSynplifyPremiersynthesistool,whichinsertslow-levelTMRvotingintotheABFT-MMcore'snetlist.AlternativemethodsforcreatingTMRdesignsareavailablefromXilinx[ Xilinx 2004 ],BYU-LANL[ Prattetal. 2006 ],MentorGraphics[ MentorGraphics 2013 ],andothers.Asexpected,200%moreDSPsandBRAMsarerequiredforTMR.However,sliceusagein 69

PAGE 70

theTMRdesignalsoincreasedlessthanexpected.ThisdifferencemaybecausedbyoptimizationduringtheXilinxmappingandplace-and-routeprocesses.SinceaXilinxlogicslicehasmanyinternalcomponents(multiplelookuptables),additionallogicmaybepackedintopartially-usedslices,reducingtheneedforadditionalslices.WealsoexamineahybriddesignbasedontheABFT-ExtradesignwhichusesTMRontheaddressgeneratorandallstatemachineswithinthedesign,butonlyusesABFTalongthedatapath.Thishybridapproachresultsinane-graineddesignthathasapproximately156%overheadforslicesbutonly100%overheadonthelimitedDSP48resources. 4.1.3.2ResourceoverheadofparallelarchitecturesAsadditionalprocessingelementsareaddedtotheexistingserialMMdesigns,theresourcesrequiredforthedatapathshouldincreasemuchmorethanthecontrolpath.Thisasymmetricscalingresultsinresourceoverheadbeingdependentontheamountofparallelism.Figure 4-4 andFigure 4-5 comparethesliceoverheadofeachofthedesignsdiscussedinSection 4.1.3.1 whilevaryingthenumberofprocessingelementsfrom1to32,whichisafullyparallelimplementationforthe3232MM.Figure 4-6 comparestheDSP48resourceoverheadforeachofthedesigns.Foreachparalleldesign,bothcoarse-grainedandne-grainedparallelizationsareexamined.Overheadofeachfault-tolerantdesigniscalculatedbasedontheresourceusageoftheequivalentnon-fault-tolerantdesign.Intheserialdesigns,theoverheadofbothne-grainedandcoarse-grainedABFT-ExtradesignsishigherthantheABFT-Shareddesigns.However,theABFT-Extrasliceoverheaddecreasesforhighlyparalleldesigns.Alternatively,theABFT-Shareddesignsshowtheoppositetrend.Formoduleswithalargenumberofprocessingelements,theABFT-Sharedcontrollogiccanrequiresignicantlymoreslicesthanthebaseline,unprotecteddesign.TheoverheadintheABFT-Shareddesignsisattributed 70

PAGE 71

toadditionalmultiplexersneededtocorrectlyshareaccesstothedesign'sprocessingelements.EachoftheexploredABFTdesignshavelessthan60%sliceoverhead.TheoverheadoftheTMRdesignsrangesfrom65%to150%additionalslices,signicantlylowerthantheexpected200%overhead.Inthene-graineddesigns,relativeoverheadwassmallerinthehighly-parallelimplementations.However,forthecoarse-graineddesigns,theoverheadincreased.TheTMRdesignhasahigheroverhead(200%+)ofLUTs,howevereachsliceiscomposedofmultipleLUTs,andasliceisconsideredusedifanyoftheLUTsareused.TheTMRdesigncanhavealowersliceoverheadwhenthedesigncanbemoreefcientlypackedintoslicesbyusingmoreLUTsperslice.Thene-grainedarchitectureisabletotakeadvantageoftheefcientpacking,whilethecoarse-grainedarchitecturesdonot.Forlargedesigns(e.g.,32PEs),thenumberoffully-usedslicesintheoriginaldesignincreases,increasingTMRoverheadbyrequiringadditionalslicesfortheTMRlogic.ThesliceoverheadofthehybridABFT/TMRdesigniscomparabletotheTMRdesign.Designswithmanyprocessingelementshavelessresourceoverheadthansmallerdesigns.Thene-graineddesign'ssliceoverheadscalesmuchmorefavorablythanthecoarse-graineddesign,whichiscorrelatedtothesliceoverheadfortheTMRdesign.ThissliceoverheadcouldbereducedbymoreselectivelychoosingwhichportionsoflogictoprotectwithTMR.ThelimitingresourcesfortheMMdesignsaretheDSP48andBRAMcomponents.Figure 4-6 showstheoverheadofDSP48andBRAMcomponentsforthevariousdesigns.Theresultsareidenticalforthene-grainedandcoarse-grainedparallelizations.AsmoreprocessingelementsareusedintheABFT-ExtraMMmodule,theDSP48overheadcreatedbytheadditionalABFTMACunitbecomesextremelylow(3.125%with32processingelements).TheABFT-ShareddesignrequireszeroadditionalDSP48units.Additionally,boththeABFT-SharedandABFT-ExtramethodsdonotrequireadditionalBRAMs.TheHybriddesignhasthesameDSP48overheadasthe 71

PAGE 72

ABFT-Extradesignbecauseonlycontrollogicisreplicated.TheDSP48andBRAMoverheadfortheTMRdesignwasexpectedtobe200%,independentoftheamountofparallelism.However,theBRAMusageactuallyincreasedbylessthan100%fornon-serialdesignsduetotheunderlyingFPGAarchitecture,designsize,andbecauseofoptimizationsperformedbySynopsys'synthesizer.TheVirtex-6BRAMcanbeusedasone1024-elementRAMorastwo512-elementRAMs.TheXilinxtoolwillonlyinferthelarger,1024-elementRAMs,whileSynopsyscanmoreefcientlyusethesmaller512-elementRAMs.Therefore,the200%overheadismitigatedbytheneedforhalfasmanylargeBRAMs,andtheBRAMoverheadwillasymptoticallyapproach50%withmoreprocessingelements.Fromaresourceoverheadperspective,theABFTdesignshavethedesiredpropertiesofalow-overheadfaulttolerancemethod.AlloftheexaminedABFTdesignsexhibitedasliceoverheadoflessthan60%,withextremelylowDSP48resourceoverhead.ThehybridABFT/TMRshouldprovideadditionalfaulttolerancewhilenotrequiringreplicationofthelimitedDSP48resourcesontheFPGA.Inthenextsection,fault-injectiontestingwilldeterminetheeffectivenessofthedesignsdiscussedinthesection. 4.1.4Fault-InjectionExperimentsWhileSection 4.1.3 hasshownthatABFTprovidesloweroverheadcomparedtootherfault-tolerancestrategies,thereliabilityofABFTmustalsobeevaluated.InordertovalidateourABFTdesign,faultsmustbeinjectedintoanexecutingsystem.Inthissection,wepresentFPGAfault-injectionresultsgatheredusingtheSimple,PortableFaultInjector(SPFI)[ Cieslewskietal. 2010 ].SPFIperformsfaultinjectionbymodifyingFPGAcongurationframeswithinadesign'sbitstream,re-programmingtheFPGA,andcomparingtheresultingoutputagainstknownvalues.Partialrecongurationisusedtoreducethetimerequiredtomodifycongurationmemoryandtoimprovethespeedoffaultinjection.However,duetothelengthoftimerequiredtoexhaustivelytestanentire 72

PAGE 73

FPGAdesign,statisticalsamplingisusedtoestimatethetotalnumberofvulnerablebitsinagivendesign.Weimplementedmultiplefault-tolerantdesignsdiscussedinSection 4.1.2 onaXilinxML605FPGAdevelopmentplatform.Eachdesignuseda32-bitintegerMMmodulewithupto32processingelements.AUARTwasalsoimplementedontheFPGAtostreaminputtestvectorstotheMMmoduleandtoreportresultsbacktoavericationprogram.Duringthedesignphase,theMMmoduleisconstrainedtoasmallportionoftheFPGA.TheSPFIfault-injectiontoolenablestargetedinjections,allowingtheUARTtobeavoidedduringfaultinjection.Table 4-2 showsthefault-injectionresultsforeachofthefault-tolerantMMdesignsfromSection 4.1.2 .Foreachdesign,thetableindicatesthenumberofinjectionsperformed,thenumberofundetecteddataerrors,andsystemhangsdetected.Themeasuredpercentageoffaultsisthenscaledtothesizeoftheinjectionareatoestimatethetotalnumberofvulnerablebits.Inthisanalysis,vulnerablebitsarethetotalnumberofbitswhichcausesilent(undetected)dataerrorsorsystemhangs.ForABFTdesigns,bitsresultinginfalse-positiveresultswerenotconsideredvulnerable,sincetheydonotallowfaultydatatobepropagatedtobacktothehostsystem.ThefaultvulnerabilityforeachcomponentisdividedbytheFPGA'stotalnumberofcongurationbitstoestimatethecomponents'designvulnerabilityfactor(DVF).Adesign'scongurationDVFrepresentsthepercentageofcongurationbitsthatarevulnerabletofaultsandcanresultinerrors.ThetotalDVFofacomponentincludesvulnerablecongurationbitsandBRAMdatabitswhichmayaffectcomputationresults.ReliabilityofagivendesigncanthenbecalculatedfromtheFPGA'stotalfaultratescaledbythedesign'sDVF.MostXilinxFPGAdesignshaveaDVFthatrangesfrom1%to10%[ Xilinx 2010b ]duetothelargeamountofcongurationmemorydevotedtorouting. 73

PAGE 74

4.1.4.1DesignvulnerabilityofserialarchitecturesThenon-fault-tolerantbaselineMMdesignhasanestimated14,802vulnerablecongurationbits.Themajorityofthesevulnerablebitsresultsinundetecteddataerrors.Otherbitscausethesystemtobecomeunresponsive,orhang.ThesesystemhangsmaybecausedbyerrorsintheMMstatemachinelogic,orbyerrorsinFPGAroutinglogicwhichpreventsconnectionsfromtheMMmoduletothecommunicationUART.Thebaselinedesignissmallanduseslessthan0.5%ofthetotallogicslicesavailableontheFPGA.However,thecongurationDVFofthebaselinedesignis0.026%.WhenvulnerabledatastoredinBRAMsisalsoincluded,thetotalDVFofthebaselinedesignis0.198%.TheABFT-Shareddesignhasanestimated8,756vulnerablebits(approximately60%oftheunprotecteddesign)andaDVFof0.016%.Thevulnerablebitscanaffectaddressgeneration,thechecksumgenerationorverication,ortheerrordetectionstatusregister.BRAMdatabitsareinitiallyunprotected,butareeffectivelysafeoncetheABFTchecksumgenerationprocesshastakenplace.IftheinputBRAMs,priortothechecksumcalculation,areconsideredvulnerable,theABFT-ShareddesignhasaDVFof0.127%.IntheABFT-Extradesign,theindependentMACunitforchecksumgenerationandvericationwasexpectedtoisolatefaultsinthemaindatapathfromtheABFTchecksumcalculations,leadingtoamorereliabledesign.However,inthecomparisonofserialdesigns,thevulnerabilitiesofeachABFTdesignweresimilar,buttheABFT-Shareddesignhad5%fewervulnerablebitsthantheABFT-Extradesign.BothABFTdesignsexperienceasimilaramountofdataerrors,butfewerfaultscausetheABFT-Extradesigntohang.TheTMRdesignhasanestimated1,390vulnerablebitsandatotalDVFof0.002%.ThemajorityofthesebitsaretheresultofroutingfaultsorTMRmajorityvoterfaults.ThisresultrepresentsarealisticlowerboundontotalvulnerabilityoftheMMdesign.TheTMRreliabilityresultsinTable 4-2 wereobtainedbyusingtheSynplifyPremier 74

PAGE 75

high-reliabilitytool,whichresultedinsignicantlymorereliabledesignsthananaiveVHDL-levelTMRapproach,sincethetoolperformsTMRwithnergranularityandmorefrequentvoting.ThehybridABFTdesignhasapproximately30%lowervulnerabilitythantheABFT-ExtraandABFT-Shareddesigns,buthighervulnerabilitythantheTMRdesign.However,thehybriddesignhaslowerresourceusagecomparedtotheTMRdesign(seeTable 4-1 ).ForapplicationswhereBRAMorDSP48resourcesareconstrained,thehybridABFTdesignprovidesagoodcompromisebetweenlowvulnerability(0.012%congurationDVF)andlowDSP48usage(100%overhead). 4.1.4.2DesignvulnerabilityofparallelarchitecturesAlthoughtheserialimplementationsoftheABFTdesignsdemonstratedamodestreductioninvulnerability,thebenetsofABFTareexpectedtobemorevisibleinparallelimplementations.Withmultipleprocessingelements,faultyelementswillonlyaffectasubsetoftheoutputdata,improvingfaultlocalization.Figures 4-7 and 4-8 showthenumberofvulnerablebitsforeachdesignwhilevaryingthenumberofprocessingelements.Ineachofthebaselinedesigns,thenumberofvulnerablebitsincreaseslinearlywiththeamountofparallelization.Intuitively,asthetotalarearequiredforthedesignincreases,thenumberofvulnerablebitsincreasesproportionally.Thecoarse-graineddesignhasahighervulnerabilityperprocessingelementthanthene-graineddesignduetolargerresourcerequirements.Asmoreprocessingelementsareused,thevulnerabilityoftheABFTdesignsincreases,uptothe16PEdesign.Therateofincreaseismuchslowerratethanthebaselinedesigns,andtherefore,theimprovementinreliabilityispositivelycorrelatedtotheamountofparallelismemployed.Additionally,inthefully-parallel32-PEdesigns,theABFTdesignsexhibittheirhighestreliability.Whenfullyparallelized,thecontrollogicfortheMMcalculationissignicantlysimplied,resultinginlesssystemhangs.Foreach 75

PAGE 76

oftheparallelimplementations,theABFT-ShareddesignwasmorevulnerablethantheABFT-Extradesign,with20%-50%morevulnerablebits.Meanwhile,therewasnotasignicantdifferenceinvulnerabilitybetweenthecoarse-grainedandne-grainedABFTdesigns.Anybenetsfromfaultlocalizationinthecoarse-graineddesignwereoffsetbytheincreasedresourcerequirements.Asexpected,theTMRdesignhassignicantlyfewervulnerablebitsthananyoftheotherdesigns.Additionally,asthenumberofprocessingelementsscalesup,thenumberofvulnerablebitsstaysconstant.Finally,thehybridABFTdesignhasasimilarfaultvulnerabilityastheABFT-Extradesignuponwhichitwasbased.Thene-grainedhybriddesignhaslowervulnerabilityforimplementationswithupto4processingelements.Forlargerdesigns,thevulnerabilitiesaresimilar.Forthecoarse-graineddesigns,thehybriddesignhas5%-15%fewervulnerablebitsthantheABFT-Extradesign.Inbothcases,thenumberoffalsepositivesandsystemhangsareupto70%lowerinthehybriddesign. 4.1.5AnalysisofMatrix-MultiplicationArchitecturesTheDVFoftheserialmatrixmultiplicationdesignstestedinthissectionareverylow,duetothesmallresourcerequirementsincomparisontothesizeoftheFPGAbeingused.However,theeffectivenessofeachdesignisreadilymeasurable.WhiletheTMRdesignisthemostreliabledesign,theserialhybridABFTdesigndoesreducethenumberofvulnerablebitsby56%overthebaselinedesign.Forthe32-MACparalleldesign,thene-grainedhybridABFTdesignhas98%fewervulnerablebitsthanthebaselinedesign,whileonlyincurring25%sliceoverhead,3.125%DSP48overhead,and0%BRAMoverhead.Theobservederrorswerecategorizedintothreetypes:silentdataerrors,systemhangs,andfalsepositives.Whilesilentdataerrorsmustbereducedasmuchaspossible,systemhangsandfalsepositivesarecorrectablewithinasystem-levelreliabilityframework.Eachofthetesteddesigns,includingtheTMRdesign,experienced 76

PAGE 77

systemhangswhichpreventedthesystemfromreturningdata.Inordertopreventsystemhangs,anexternalwatchdogtimermaybeemployedtoresetthesystem,reconguretheFPGA,andresumeprocessing.Additionally,eachoftheABFTdesignsexperiencedfalsepositiveresults.Falsepositiveresultscannotbedeterminedduringrun-time,buttheerroragwilltriggertherecoveryprocedure,causinganFPGArecongurationandre-computationofthelastdataset.Byreducingfalsepositives,theperformanceoverheadincurredbyrecomputingdatacanbereduced.ThehybridABFTdesignexperiencedupto70%fewerfalsepositivesthantheABFT-SharedorABFT-Extradesigns.Additionally,anexaminationofresultmatriceswithdataerrorsrevealedthatmanyfaultymatricescontainedallzerovalues,producinganincorrectresultwithavalid(zero)checksum.ItmaybepossibletofurtherimprovethereliabilityoftheABFTdesignsbymodifyingtheABFTencodingmatrixtoensurethatoutputmatricescannotcontainallzeros. 4.2FastFourierTransformTheFastFourierTransform(FFT)isanotherkeykernelinmanyspace-basedapplications,suchassynthetic-apertureradarandbeamforming.WhiletheFFTcanbecomputedusingtraditiongeneral-purposeprocessorsorDSPs,theFFThasanefcient,high-throughputFPGAimplementation.Unlikematrixmultiplication,theFFTdoesnotnaturallytintothetraditionalABFTframework.InSection 4.2.1 ,themathematicaldescriptionofourblock-basedABFTapproachfortheFFTispresented.InSection 4.2.2 ,multipleFFTarchitecturesarediscussedforusewithinagenericABFT-enabledFFTarchitecture.Section 4.2.3 analyzestheresourceoverheadtradeoffsforeachoftheproposedarchitectures.Section 4.2.4 presentsananalysisoffault-injectionresultsobtainedfromeachoftheproposedarchitectures.Finally,Section 4.2.5 summarizestheresultsfromtheFFTanalysis. 77

PAGE 78

4.2.1Checksum-BasedABFTforFFTsSincetheFourierTransformisalinearoperator,thepropertiesoflinearitycanbeusedtoshowthatABFTtechniquescanbeappliedtotheFastFourierTransform.Fortwovectorsaandb,thefollowingequalityholdsforalllinearoperators(includingtheFFT): F(a+b)=F(a)+F(b)(4)Usingthispropertywecancreateablock-basedABFTtechniquethatcanbeusedtodetecterrorsinblocksofFFTs.Insteadofperformingdetectiononeachinputvector,weformamatrixAcomposedofindividualsignalvectors(x0,x1,x2,...).FollowingtheapproachinSection 4.1 ,wecanthenaugmentthismatrixwithanadditionalchecksumrowbyusingthesameencodingvectorENasinEquation1. A=266666666664x0x1...xN)]TJ /F12 7.97 Tf 6.58 0 Td[(2xN)]TJ /F12 7.97 Tf 6.58 0 Td[(1377777777775(4) AC=264AETNA375=264APN)]TJ /F12 7.97 Tf 6.59 0 Td[(1i=0xi375(4)ByperforminganFFToneachrowoftheaugmentedmatrixAC,weproduceanoutputmatrixBCwhichcontainsthetransformedvaluesofeachinputvectorand 78

PAGE 79

thetransformedchecksumvector.Thetransformedchecksumvectorwillbeavalidchecksumfortheoutputmatrix,asshowninEquation8. F(AC)=266666664F(x0)...F(xN)]TJ /F12 7.97 Tf 6.58 0 Td[(1)F(PN)]TJ /F12 7.97 Tf 6.59 0 Td[(1i=0xi)377777775(4) BC=266666664X0...XN)]TJ /F12 7.97 Tf 6.59 0 Td[(1PN)]TJ /F12 7.97 Tf 6.59 0 Td[(1i=0Xi377777775(4)UsingthepropertyoflinearitydescribedinEquation5,thechecksumrowsofF(AC)andBCareequivalent.Intheeventofanerrorduringcomputation,theerrorwillbedetectable,althoughmorecomplexencodingvectorsarerequiredtodetectwhichvectorisfaulty.TheabsoluteoverheadofusingthisABFTtechniqueisdependentonthenumberofvectorsusedperABFTblock.GivenNvectorsofMelements,theadditionaloverheadofchecksumgenerationandvericationisO(NM),whiletheadditionalcorecomputationisO(Mlog(M)).AsafunctionofABFTblocksize,therelativeoverheadofABFTisO(1=N). 4.2.2FFTArchitecturesMultipleFFTmoduleswerecreatedtoexaminethereliabilityofhardware-basedABFT.Unlikethepreviousmatrix-multiplicationarchitectures,thisanalysisusespre-built,single-precisionoating-pointoperatorstoconstructeachdesign.Commonly,pre-existingIPisusedtoreducedevelopmentcostsbyimprovingdesignandvericationtime.TheFFToperatorsandoating-pointarithmeticoperatorsusedinthefollowingsectionsarepre-existingIPgeneratedusingXilinxCoreGentool[ Xilinx 2013a ].The 79

PAGE 80

XilinxFFToperatorhasseveralpossiblearchitecturesincludingahigh-performancepipelinedarchitectureandaminimal-hardwareserialarchitecturewhichwillbecomparedinthissection.TheABFTlogicforeachdesignwasgeneratedbycombiningthesepre-existingIPcorestoperformthechecksumoperations.Thissectiondiscussestrade-offsofeachFFTarchitectureandthedesigndecisionsthatweremadefortheABFTarchitecture. 4.2.2.1Radix-2Burst-IOFFTarchitectureThebasic,serialarchitectureforthefastFourierTransformfunctionconsistsofasinglebutteryoperator(complexmultiplicationandaddition),anaddressgenerationmodule,andtwodatastoragemodules(BRAM)asshowninFigure 4-9 (a).OnememoryisusedtostoretheinputmatrixA,andonememoryisusedfortheresultingoutputmatrixB.Theaddressgeneratoriteratesthroughthecorrectmatrixindices,sendingdatastoredintheinputBRAMstotheFFToperator,andgeneratestheappropriateaddressforoutputvalues.ThisFFTmodulecanbeusedforanysizematrixaslongasenoughdatastorageissupplied.ForanM-pointFFT,thisarchitecturerequiresO(Mlog(M))cyclestocalculateanoutputvector. 4.2.2.2Radix-2PipelinedFFTarchitectureThepipelinedFFTarchitecturehasthesamehigh-levelarchitecturaldiagramasinFigure 4-9 (a).However,internallyituseslog(M)butteryoperatorstoenablehighthroughput.ThelatencyofthepipelinedarchitectureisapproximatelyO(2M)cycles,andthetimetocomputeNvectorsisO(NM)cycles,resultinginasignicantimprovementinprocessingspeedwhilecalculatingmultiple,consecutiveFFTs. 4.2.2.3ArchitecturalmodicationsforABFTTheadditionofABFTlogicrequiresthecreationoftwofunctions,ABFTchecksumgenerationandABFTchecksumverication,eachrequiringaoating-pointaccumulator.Unlikethematrix-multiplicationcasestudy,theABFTchecksumlogicissignicantlydifferentfromthealgorithm'smaincomputation,andthelogicmustbeimplemented 80

PAGE 81

separately.ChecksumgenerationsumseachcolumnofmatrixAusingaoating-pointaccumulatorandwritesthechecksumintoachecksumBRAM.AftertheFFToperation,thechecksumvericationfunctionsumsthecolumnsofmatrixBandwritesthevaluetoachecksumBRAM.Then,thechecksumvericationfunctionusesaoating-pointcomparisonoperatortocomparechecksumvaluesinthechecksumBRAMs.Ifamismatchisdetected,anerrorfoundsignalisasserteduntilthemoduleisreset.InordertoreducetheBRAMrequirements,thelastrowofthematrixAandBBRAMsmaybereservedforchecksums.Inthiscase,theinitiallygeneratedchecksumwillbewrittentothematrixABRAMandthevericationchecksumwilloverwritethatchecksumvalueaftercomputationiscomplete.Figure 4-9 (b)showsanexampleofanABFT-enabledFFTarchitecturewherechecksumsarestoredwithinthematrixAandBBRAMs.Duetothenatureofoating-pointarithmetic,thecalculatedchecksumsmaynotbeexact.QuantizationerrorisintroducedduetotheorderofadditionsaswellastheinternalroundingwithintheFFTmodule.FormostapplicationsusingFFTs,suchasimageprocessing,dataerrorsthatoccurinlow-signicancebitsmaybesafelyignored,anderrorsthatavoiddetectionshouldnotsignicantlyimpactthehigher-levelapplication.Inordertohandleerrorsinthehighlysignicantbits,ABFTdetectserrorsbycomparingthedifferenceofthetwogeneratedchecksumstoauser-denedthresholdvalue.Formaximumcoverage,thisthresholdshouldbesetasclosetozeroaspossible.Unfortunately,determinationofthethresholdvalueisbothalgorithm-specicanddata-specic.InordertoestimatetherequiredthresholdvalueswhilecalculatingblocksofFFTs,MATLABsimulationswereusedtomeasurethemaximumquantizationerrorencounteredwhileusingvariousinputdatasets.Foreachsimulation,testvectorsweregeneratedusingauniformrandomdistributionontheinterval()]TJ /F10 7.97 Tf 10.49 4.71 Td[(x 2,x 2),wherexistherangeoftheinputdata.Figure 4-10 showstherequiredthresholdtoavoidfalsepositivesfortheblock-basedABFTforFFTwhilevaryingtherangeoftheinputdata.Theobservedlineartrendsimpliestheselectionofanacceptablethresholdvalue. 81

PAGE 82

4.2.3Resource-OverheadExperimentsInthissectionweanalyzetheoverheadoftheABFTarchitecturespresentedinSection 4.2.2 andcomparethemtotraditionalfault-tolerancemitigationstrategies.Forthisanalysis,64-pointFFTswillbeused,enablingupto16complexinputvectorstotwithintwoBRAMcomponents.Theoating-pointBurst-IOandPipelinedFFTmodulesaregeneratedusingtheXilinxCoreGenutility.WecomparethetwobaselineFFTarchitectureswithseveralfault-tolerantdesigns(TMR,ABFT,andhybridTMR/ABFT).TheresultsofthiscomparisonareshowninTable 4-3 4.2.3.1ResourceoverheadofBurst-IOarchitectureInthebaselineBurst-IOarchitecture,4BRAMsareusedtostoreinputandoutputdata.TheFFTmoduleusesanadditional4BRAMstoholdintermediatevaluesduringprocessing.Additionally,8DSP48unitsareusedfortheFFTbutteryoperationcomputation.TheABFTdesignrequires4additionalDSP48unitsand260slicesfortheoating-pointaddition,oating-pointcomparison,andcontrollogicfunctionality,representing42%sliceoverheadand50%DSPoverhead.UnliketheMMcasestudy,theFFTTMRdesignhasapproximately200%overheadofeachoftheslice,BRAM,andDSPresources.SincetheFFTcomponentsareassembledfrompre-builtnetlistsproducedbyCoreGen,littleoptimizationisperformedduringsynthesis,resultingintheexpectedamountofresourceoverhead.InthehybridFFTdesign,eachofthestatemachinescontrollingFFTorABFTcomponentsinthedesignareprotectedusingTMRwhiletheFFTandABFToating-pointoperatorsremainunchanged.Thehybriddesignhas48%sliceoverheadand50%DSPoverhead.ForlargerFFTs(e.g.,128+points),theBurst-IODSP48resourcerequirementsremainconstantwhileadditionalBRAMsarerequiredforintermediateresultstorage.The50%resourceoverheadoftheABFTdesignisindependentoftheparametersoftheFFToperator. 82

PAGE 83

4.2.3.2ResourceoverheadofPipelinedarchitectureInthebaselinePipelinedarchitecture,4BRAMsareusedtostoreinputandoutputdata.TheFFTmoduleuses16DSP48unitsfortheFFTbutterycomputationandoneadditionalBRAMtoholdintermediatevaluesduringprocessing.ThePipelinedFFTusesalmosttwiceasmanyslicesandDSP48s,butfewerBRAMsthantheBurst-IOversion.TheABFTdesignincreasestheresourcerequirementsbythesameamountastheBurst-IOdesign(i.e.:4DSP48,260slices).SincethePipelineddesignisapproximatelytwiceaslargeastheBurst-IOdesign,theoverheadiscorrespondinglylowerat23%sliceoverheadand25%DSPoverhead.TheTMRdesignhasapproximately200%overheadofeachoftheslice,BRAM,andDSPresources.ThePipelinedhybriddesignusesthesamemethodologyastheBurst-IOhybriddesignandhasonly26%sliceoverheadand25%DSPoverhead.ForlargerFFTs(e.g.,128+points),thesizeofthePipelineddesignincreasesbecausemorepipelinestagesandbutteryoperatorsarerequired.AsthesizeoftheFFTincreases,theABFTrequirementsremainstatic,andtheABFTresourceoverheadisreduced.Forexample,a1024-pointFFTwouldrequire34DSP48componentsandthe4DSP48componentsfortheABFTchecksumswillonlyadd12%overheadtothebaselinedesign. 4.2.4FFTFaultInjectionInordertovalidateourABFTFFTdesigns,faultsmustbeinjectedintoanexecutingsystem.Inthissection,wepresentFPGAfault-injectionresultsgatheredusingthemethodologydescribedinSection 4.1.4 .FortheABFTdesigns,thevalueforthechecksumerrorthresholdwascalculatedfromtheinputtestvectorsduringafault-freeprocessingrun.Table 4-4 showsthefault-injectionresultsforeachofthefault-tolerantFFTdesignsfromSection 4.2.2 .Foreachdesign,thetableindicatesthenumberofinjectionsperformed,thenumberofundetecteddataerrors,andsystemhangsdetected.Themeasuredpercentageoffaultsisthenscaledtothesizeoftheinjectionareato 83

PAGE 84

estimatethetotalnumberofvulnerablebits.Inthisanalysis,vulnerablebitsarethetotalnumberofbitswhichcausesilent(undetected)dataerrorsorsystemhangs.ForABFTdesigns,bitsresultinginfalse-positiveresultswerenotconsideredvulnerable,sincetheydonotallowfaultydatatobepropagatedtobacktothehostsystem.TheBurst-IObaselinedesignhasanestimated48,012vulnerablecongurationbitsandacongurationDVFof0.084%.ThetotalDVFincludesanadditional131,072memorybitsinBRAM,increasingthebaselinedesign'stotalDVFto0.314%.Inthebaselinedesign,80%ofthecongurationmemoryfaultcausedataerrorswhile20%causesystemhangs.TheBurst-IOABFTdesignreducesthetotalnumberofvulnerablecongurationbitsby80%to9,624.AdditionallyintheABFTdesign,theMatrixABRAMdatabitsareonlyvulnerablebeforethechecksumvaluehasbeencomputedandtheMatrixBBRAMvaluesarealwaysprotected,improvingthetotalDVFofthedesignto0.125%.Themostdangeroustypeoferror,silentdataerrors,arereducedby95%overthebaselinedesign.ThedistributionoferrortypesintheABFTdesignisreversedfromthebaselinedesign'sdistribution;80%ofcongurationfaultsresultinsystemhangswhileonly20%causedataerrors.ThehybridABFT/TMRdesignreducesthedesignvulnerabilitybyanadditional30%,toonly6,662vulnerablebits.Thehybriddesigndecreasestheoccurrenceofsystemhangsmorethandataerrors.Finally,theTMRFFTdesignhas3,278vulnerablebits,94%fewervulnerablebitsthanthebaselinedesign,whileprotectingallBRAMdatabits.ThePipelinedbaselinedesignismuchmorevulnerablethantheBurst-IOcounterpartwith182,192vulnerablecongurationbits(0.320%congurationDVF).Theincreasedvulnerabilitycomesfromthelargerresourcerequirementsandincreasedroutingcomplexityofthepipelinedarchitecture.ThePipelinedABFTdesignreducesdataerrorsby97%butonlyreducessystemhangsby33%.ThePipelinedABFTdesign'serror-typedistributionismoreskewedthantheBurst-IOcase;85%oferrorsresultinsystemhangand15%resultindataerrors.ThePipelinedhybridABFT/TMR 84

PAGE 85

designreducesthedesignvulnerabilitybyanadditional20%,to22,251vulnerablecongurationbits.Thehybriddesigndecreasestheoccurrenceofsystemhangsmorethandataerrors.Finally,thepipelinedTMRFFTdesignhas9,306vulnerablebits,95%fewervulnerablecongurationbitsthanthebaselinedesign,whilealsoprotectingallBRAMdatabits. 4.2.5AnalysisofFFTArchitecturesTheFFTarchitecturesexaminedinthissectionaresignicantlymorecomplexthanthematrix-multiplicationarchitecturesofSection 4.1 .Duetoalgorithmcomplexityandtheuseofoating-pointprecisionoperators,theresourcerequirementsaremuchhigher.Forexample,thebaselineBurst-IOdesignis300%largerthanthebaselineMMdesign.Additionally,theoating-pointprecisionalsoincreasedtheresourcerequirementsoftheABFTlogic.However,evenwiththesmallerBurst-IOarchitecture,theoverheadoftheABFTdesignwasonly50%,signicantlybetterthantheTMRalternative.ForthepipelinedFFTdesigns,largerFFTsdonotincreasetheresourcesrequiredbytheABFTcomponent,furtherimprovingresourceoverhead.IneachoftheFFTdesigns,thenumberofvulnerablebitsthatcausesystemhangsismuchhigherthanintheMMcasestudies.ThegeneratedFFTcomponentshavesubstantiallymorecontrollogicthantheMMdesigns,whichcanleadtosystemhangswhenupset.InthebaselineFFTdesigns,mostvulnerablebitsresultincorruptdatabuttheproportionofvulnerablebitscausingsystemhangsishigherthanintheMMcasestudies.Ineachofthefault-tolerantdesigns,morevulnerablebitsresultinsystemhangsthancorruptdata.Whilethehybriddesigndoesnotsignicantlyreducedatacorruption,itdoeslowertheoccurrenceofsystemhangs.AlthoughTMRprovidesthehighestreliabilitydesigns,ABFTiseffectiveinreducingundetecteddataerrors.Ifasystem-levelmechanism(e.g.,watchdogtimer)isusedtorecoverfromsystemhangs,theABFTdesignscanapproachthereliabilityofTMR. 85

PAGE 86

4.3ConclusionsInthischapter,wehavepresentedanovelanalysisofABFTforlow-overheadfaulttoleranceinFPGAsystems.Severalmatrix-multiplicationandFFTdesignsemployingTMRandABFTfault-tolerancetechniquesweredevelopedandtestedusinganFPGAfault-injectiontool.TheresultsdemonstratedthatABFTwascapableofreducingthenumberofvulnerablecongurationbitsinadesignwhilealsoprotectingmostmemorybits.WhiletheTMRfault-mitigationapproachhadthelowestvulnerability,ahybridABFT/TMRdesignapproachwasabletolowercongurationvulnerabilitywhilemaintaininglowoverhead.Withmatrixmultiplication,congurationvulnerabilitywasreducedby90%inhighly-parallelABFTarchitectureswhileusinglessthan20%sliceandDSP48resourceoverhead.ForFFTarchitectures,congurationDVFwasloweredby85%whileusing50%orlessadditionalresources.AsthesizeandparallelismofeachoftheABFTarchitecturesincrease,therelativeeffectivenessofABFTalsoincreases.MatrixMultiplicationandFastFourierTransformsarethekeykernelsinmanyspace-basedprocessingsystems.TheABFTarchitecturesdescribedinthisworkaregenericandcanbeappliedtomanyofthesesystems.Additionally,otherlinearoperations,suchasLUorQRdecomposition,canalsobeprotectedusingABFT.ThevulnerabilityresultsshowthatABFTcanbeusedaspartofacomprehensivefault-tolerancestrategyforspace-basedFPGAsystems.ABFTdetectsmostdataerrorsthatoccurduringprocessing.TheuseofECCtechniques(e.g.,Hamming,parity)canbeusedtoprotectdatamemoryforminimaloverhead.Ahigher-levelwatchdogmechanismcanbeusedtorecoverfromsystemhangsandcrashes.Manyofthesetechniques,incombinationwithTMR,arealreadyinuseinspacesystems.ByselectivelyremovingreplicatedcomponentsandusingABFTintheirplace,spacesystemscanincreasetheirprocessingcapabilitieswithoutsacricingreliability. 86

PAGE 87

Table4-1. ResourceutilizationandoverheadofserialMMdesigns. FaultSliceSliceBRAMBRAMDSP48DSP48tolerancecountoverheadcountoverheadcountoverhead None15833ABFT-Shared18014%30%30%ABFT-Extra20832%30%6100%TMR388146%9200%9200%Hybrid404156%30%6100% Table4-2. Serialmatrixmultiplicationfault-injectionresults. DesignnameFaultsDataSystemVulnerableCongurationTotalInjectedErrorsHangsCongBits(est.)DVF(%)DVF(%) Baseline100,0003,11436814,8020.026%0.198%ABFT-Shared100,0001,0566338,7560.016%0.127%ABFT-Extra100,0001,0557129,1610.016%0.127%TMR100,000118161,3900.002%0.002%HybridABFT100,0003482816,5220.012%0.123% Table4-3. ResourceutilizationandoverheadofFFTdesigns. ArchitectureFaultSliceSliceBRAMBRAMDSP48DSP48tolerancecountoverheadcountoverheadcountoverhead Burst-IONone60788Burst-IOABFT86743%80%1250%Burst-IOTMR1834202%24200%24200%Burst-IOHybrid90148%80%1250% PipelinedNone1129516PipelinedABFT139023%50%2025%PipelinedTMR3404202%15200%48200%PipelinedHybrid142426%50%2025% Table4-4. FFTfault-injectionresults. DesignNameFaultsDataSystemVulnerableCongurationTotalinjectederrorshangscongbits(est.)DVF(%)DVF(%) Burst-IO-Baseline100,0001,80946148,0120.084%0.314%Burst-IO-ABFT100,000983579,6240.017%0.125%Burst-IO-TMR100,000271283,2780.006%0.006%Burst-IO-Hybrid100,000802356,6620.012%0.120% Pipelined-Baseline100,0006,8911,723182,1920.320%0.550%Pipelined-ABFT100,0001801,11227,3270.048%0.156%Pipelined-TMR100,00068979,3060.016%0.016%Pipelined-Hybrid100,00016289022,2510.039%0.147% 87

PAGE 88

ABCFigure4-1. Matrix-multiplicationarchitectures.A)Baseline.B)Fine-grainedparallel.C)Coarse-grainedparallel. 1for(i=0;i
PAGE 89

Figure4-4. Sliceoverheadofne-grainedparallelmatrixmultiplication. Figure4-5. Sliceoverheadofcoarse-grainedparallelmatrixmultiplication. 89

PAGE 90

Figure4-6. DSP48andBlockRAMoverheadofparallelmatrixmultiplication. 90

PAGE 91

Figure4-7. Faultvulnerabilityofne-grainedparallelmatrixmultiplication. 91

PAGE 92

Figure4-8. Faultvulnerabilityofcoarse-grainedparallelmatrixmultiplication. ABFigure4-9. FFTarchitectures.A)Serial.B)ABFT. 92

PAGE 93

Figure4-10. RequiredABFTthresholdvalueforoating-pointFFTs. 93

PAGE 94

CHAPTER5RFTSYSTEMINTEGRATIONFORRAPIDSYSTEMDEVELOPMENTTheRFT-basedsystempresentedinSection 3.1 isintendedtobeareferencedesignthatcanbeadaptedtoactualspace-systemarchitectures.Inthischapter,weinvestigatepossiblemethodsand/ortoolsthatwillimprovetheusabilityofRFT-basedsystemsforsystemdeveloperstofacilitateadoptionofourframework.Ourapproach,showninFigure 5-1 ,hasidentiedtwoareasofRFTsystemdesignthatmaybeimprovedthroughexternaltools.Therstarea,generatingRFThardwareforarbitrarysystemcongurations,canbefacilitatedbycreatingmultiplesystemtemplatesforvariousinterconnectionarchitectures.Specically,wetargetarchitecturalsupportinordertoprovidecompatibilitywithcommonlyusedpartialrecongurationarchitectures,suchasVAPRES[ Jara-BerrocalandGordon-Ross 2010 ].Thesecondarea,software-basedsupportforintelligentfault-tolerancemodeselection,willcombinefault-ratemodelingandtaskschedulingtooptimizejobplacementforhighperformability.ResearchintoeachoftheseareaswillfacilitatetheadoptionoftheRFTframeworkintonewandexistingspacesystems.Inthissection,weoutlineourapproachforimprovinghardwareandsoftwaresupportforRFTsystems.Section 5.1 discussesRFTarchitecturaltemplates,integrationofanRFTcontrollerintoanexistingVAPRESPR-basedsystem,andparameterizedVHDLdesignsforRFTcontrollerlogic.RFTcontrollergenerationandsystemintegrationwillreducetimerequiredtodevelopRFTsystemsbyusingexistingdesignswithlittlemodicationanddynamicallycreatingnewRFT-relatedcomponents.Section 5.2 discussesreliabletaskscheduling,whichwillreducethecomplexityofRFTsoftwarerequirements. 5.1DynamicallyGeneratedRFTComponentsTheRFTcontrollercreatedforSection 3.1 wasdesignedforasmallRFTsystemconsistingofonlythreePRRsandconnectedtoahostCPUthroughabus-based 94

PAGE 95

interconnect.However,formaximumexibility,theRFTframeworkmustsupportsystemswithanarbitrarynumberofPRRsandinterconnecttypes.Additionally,asystemdesignmaydecidetosupportonlyasubsetofthefault-tolerancemodesallowedbytheRFTframework.DynamicRFTcontrollergenerationallowsasystemdesignertospecifythesizeandsupportedfeaturesofanRFTcontrollercustomizedfortheirspecicsystem.Bydisablingunneededfeatures,theoverheadoftheRFTcontrollercanbereduced.TheRFTcontrollercanbedividedintotwocategories:systeminterconnectinterfacesandfault-tolerancefeatures.ThetypeofsysteminterconnectaffectsthestructureoftheAddressDecodercomponent.Thefault-tolerancefeaturesaffecttheVotingLogic,OutputMux,andWatchdogTimerscomponents. 5.1.1RFTControllerPoint-to-PointInterfaceTheRFThardwarearchitecturefromSection 3.1 wasusedasanexamplebus-basedPRarchitecture.However,toimproveusabilitywithpre-existingsystems,itispossibletoadapttheRFTcontrollertobeusedinotherPRarchitectures.ByadaptingtheRFTcontrollertobeusedwithinapoint-to-pointVAPRESPRsystem,itwillbepossibletoquicklyprovidefaulttolerancetopre-existingVAPRESsystems.OtherexistingPRarchitecturescanbeclassiedaseitherbus-basedorpoint-to-point.TheinitialRFTcontrollerconnectstotherestofthesystemthroughtheProcessorLocalBus(PLB).PRMsconnectedtotheRFTcontrollermustactasslavedevices.Slavedevicescanrespondtobusrequests,butcannotinitiatetransfers.Forexample,theprocessorcanreadfromaPRM'smemory,butthePRMcannotwriteavaluedirectlytothehostprocessor'smemory.TheRFTcontrollerinstantiatesaslavePLBcontrollertohandlecommunicationwiththehostprocessor.ThePLBcontrollerconvertsthebussignalsintoasimpliedsetofusersignals.TheprimarydifferencebetweenthesystemfromSection 3.1 andaVAPRES-basedsystemistheuseofFastSimplexLinks(FSL)forcommunicationsbetweenthe 95

PAGE 96

MicroblazeprocessorandthePRRs.AnFSLtemplatefortheRFTcontrollermustbecreatedinordertooperatewithapoint-to-pointarchitecture(VAPRES).AnFSLlinkisaFIFOqueuethatdirectlyconnectstheMicroBlazewithuserlogic.TheMicroBlazeinstructionsethasspecicintrinsicfunctionsforreadingandwritingtotheFSLlinks.ThesedirectlinksfromthePRMstotheMicroblazeincreasethroughputandreducelatencyfrombuscontention.ForanRFTcontrolleroperatinginhigh-performancemode,datafromanFSLcouldbepasseddirectlytoitsconnectedPRR.However,anFSL-basedRFTcontrollermustbeabletosenddatafromanarbitraryFSLtoanyotherFSLinordertoenableoperateinTMRorDWCmodes.Forexample,inTMRmode,anydatawrittentoFSL#0mustalsobesimultaneouslywrittentoFSL#1andFSL#2toensurecorrectfunctionality.Inordertoaccomplishthis,theFSL-basedRFTcontrollermustincludeanadditionalbi-directionalFIFOqueueforeachPRR.TheRFTcontrollermustprocessallincomingdata,routethedatatotheappropriatePRM,andwriteanyoutputdatatothecorrectoutputFSL.Thisrequiresanadditional2BRAMsperPRRplusthenecessarycontrollogic.Table 5-1 showsacomparisonoftheresourcerequirementsforthebus-basedandpoint-to-pointRFTcontrollers.SincemostPR-basedsystemscanbeclassiedaseitherbus-basedorpoint-to-point,thetwoRFTcontrollerscreatedfromthisworkcanserveasareferencedesign.Althoughalternativebusesorprotocolsmaybeusedinfuturesystems,theoverallarchitectureshouldremainapplicable. 5.1.2ParameterizedandCongurableVotingLogicWhiletheRFTcontroller'sinput-andoutput-addressdecoderhardwaremaybedependentontheinterfacetothelargerPRsystem,manyothercomponentsareonlydependentonthenumberofPRRsinthedesign.Thesecomponents,createdusingVHDL,canbeparameterizedinordertobeeasilyadaptedtoarbitrarilylargesystems.Themainparameter,NUM PRRS,scalestheRFTcontrollerinterfacestothe 96

PAGE 97

appropriatesizeforthesystem.Theparameterizedcomponentsweexaminearethewatchdogtimers,RFTstatusregisters,outputmux,andvotinglogic,.TheENABLE WATCHDOGparameter,enablesthecreationofwatchdogtimerlogicforeachPRRinthedesign.Additionally,awatchdogcongurationregisteriscreatedforeachPRR.ForPRMswhichdonotwishtousethewatchdogfunctionality,thecongurationregistercanbeusedtodisablethetimer.NUM PRRSstatusregistersarecreated.ThereareseveralparametersrelatedtothecreationofvotinglogicwithintheRFTcontroller.TheENABLE TMRandENABLE DWCcreatesTMRorDWCvoters,respectively.OneTMRvoteriscreatedforeverythreePRRs,andoneDWCvoteriscreatedforeverytwoPRRs.Bynotusing)]TJ /F10 7.97 Tf 5.48 -4.38 Td[(NUM PRRS3or)]TJ /F10 7.97 Tf 5.48 -4.38 Td[(NUM PRRS2voters,respectively,wereducethetotalnumberofrequiredresourcesattheexpenseofsomeexibility.Basedontheparameterizedvoters,theoutputmultiplexercanthenselectfromthecompleteenumerationofthepossibleRFT-modecombinations.ThesesimpleparametersallowforthecreationofascalableRFTcontrollerdesign.Byenablingordisablingspecicfault-tolerancefeatures,theRFTcontrollercanbetargetedandoptimizedforspecicarchitecturesorapplications.Assystemsizeincreases,theparameterizeddesignautomaticallycreatesthelogicrequiredforadditionalPRRs. 5.2TaskSchedulingforRCSystemsinDynamicFault-RateEnvironmentsInordertooptimizeasystemforperformanceandreliability,thefault-tolerantmodemustbeselectedcarefullybaseduponthecurrentfaultconditionsexperiencedbythesystem.Inthissection,weproposeafault-tolerantschedulerthatcanschedulereal-timetasksaswellasselectaheuristictoselectafault-tolerantRFTmodeforeachtask.Wethenevaluatetheeffectivenessoftheproposedheuristicusingtwocase-studysimulations. 97

PAGE 98

5.2.1SelectionCriteriaforFault-TolerantModeRFTmodeswitchingcanbetriggeredbyaprioriknowledgeoftheoperatingenvironment,application-triggeredevents,orexternalevents.InanRFTsystem,theexpectedfaultratecanbeestimatedeitherdirectlyorindirectly.AnexternalradiationsensorcanbedirectlyinterfacedwiththeFPGA,allowingthesystemtotrackthecurrentfaultrateandpredictfuturefaultrates.Alternatively,theRFTsystemcanindirectlydeterminefaultratesusingmodelsoftheexpectedfaultenvironmentfromChapter3.Bycorrelatingthespacesystem'scurrentpositiontoanexistingmodel,afault-rateestimatecanbeusedtomakeschedulingdecisions.Inthefollowingsectionsweassumethattaskscanbescheduledwithnofaulttolerance(Simplex),duplicationwithcompare(DWC),ortriple-modularredundancy(TMR).Dataerrorsintasksaredetectedbycomparingorvotingontheoutputofeachtaskreplica.Wealsoassumethatthesystemcanestimatethecurrentfaultenvironmentusingapre-existingmodelofthesystem'sorbit. 5.2.1.1FT-modeselectionusingthresholdsOneofthemoststraightforwardmethodsforselectinganappropriatefault-tolerancemodeistheuseofthresholds.Atveryhighfaultrates,TMRisrequiredtomaintainreliability.Atlowfaultrates,DWCorSimplexmodesmayprovidesufcientreliabilitywhileincreasingperformance.Ateachtimestep,thecurrentfaultrateismeasured,andnewtasksareassignedbasedonthepre-selectedrules.TheoptimalthresholdoccursatthefaultratewherethereliableperformanceofTMRandDWCareequal.TheidealschedulingheuristicwillselectTMRwhenthecurrentfaultrateisabovethefault-ratethreshold,fthresh,andwillselectDWCotherwise.Selectingtheappropriatevaluesforthethresholdisdependentontheapplicationandenvironment.Determiningfthreshrequiresinformationaboutfaultrates,taskfrequency,taskload,andotherfactorswhichmaynotbestaticthroughoutasystem'soperation. 98

PAGE 99

5.2.1.2Time-resourcemetricforFT-modeselectionInsteadofdependinguponauser-denedthresholdvaluetodetermineafault-tolerancestrategy,weexploreapossiblemetricwhichcanestimatetheoptimalthreshold.Themetriccombinescomputationtime()andfaultprobabilityofasingletask(f)inordertoselectbetweenFTmodes.Bydevelopingaschedulingmetricthatincorporatesadynamicfaultrate,weintendtoimproveoverallsystemperformancewithouttheneedforexpertuserinput.Giventhecurrentfaultrate,f,thereliabilityofataskattheendofitscomputationtime,R,isgivenbythefollowingequations(dependingonmode): RSimplex=(1)]TJ /F6 11.955 Tf 11.96 0 Td[(f)RDWC=(1)]TJ /F6 11.955 Tf 11.96 0 Td[(f)2RTMR=3(1)]TJ /F6 11.955 Tf 11.96 0 Td[(f)2)]TJ /F7 11.955 Tf 11.95 0 Td[(2(1)]TJ /F6 11.955 Tf 11.96 0 Td[(f)3(5)IftasksarerunninginDWCorTMRmode,faultsarediscoveredattheendofthetask'scomputationtime.Faultytasksarethenrescheduleduntiltheycompletesuccessfully.Theaveragenumberoftimesataskmustbeexecutedinordertosuccessfullycompleteisthengivenbythefollowinggeometricseries: e=1Xn=0(1)]TJ /F6 11.955 Tf 11.96 0 Td[(R)n= R(5)Wedeneatime-resourcecoefcient,,whichcombinestheeffectiveexecutiontimewiththerequiredresourcesofagiventask(=Ne).Then,bycomparingtheofaDWCorTMRtask,wecandeterminewhichmodeisoptimalforreliability(lowerisbetter).InEquation3,wesolveforconditionswhereDWCwillprovidelowerthanTMR. 2 (1)]TJ /F6 11.955 Tf 11.95 0 Td[(f)23 3(1)]TJ /F6 11.955 Tf 11.95 0 Td[(f)2)]TJ /F7 11.955 Tf 11.96 0 Td[(2(1)]TJ /F6 11.955 Tf 11.95 0 Td[(f)3(5) 99

PAGE 100

SimplifyingEquation3providesthefollowingsimplerelation: (1)]TJ /F6 11.955 Tf 11.96 0 Td[(f)3 4(5)Basedonthedenitionof,DWCprovidesmorereliableperformancethanTMRwhenRSimplexisgreaterthan0.75.Forlowfaultrates,DWCprovideshigheroverallperformance.Forveryhighfaultrates,orverylongexecutiontimes,thereliabilityofTMRschedulingispreferred.Asimilaranalysiscanbeperformedforsimplextasks,howeversimplexschedulinghasnomethodfordetectingfaults.Weusethisconclusionasthebasisforanadaptivefault-tolerancethreshold. 5.2.2SchedulerforRFTTraditionalfault-tolerantschedulingalgorithmsassumethatthefaultrateexperiencedbythesystemwillbeconstant,andthatthefault-tolerancestrategywillalsobeconstant.ForanRFT-basedsystem,aschedulerwhichusesthecurrentfaultrateisnecessarytomaximizesystemutilizationwhilemaintainingsystemavailability.Thefault-tolerantschedulerpresentedinthissectioncanscheduletasksinanyFTmodebasedonuser-denedthresholdsorthe-metric. 5.2.2.1RFTarchitecturedescriptionTheRFTsystemdescribedinChapter3containsamicroprocessorconnectedtoseverallargepartially-recongurableregions(PRRs)throughasharedsystembus.Duringnormalsystemoperation,uniquetaskscanbescheduledtoanyofthePRRs.Dependingonsystemconguration,theoutputsofthreecontiguousPRRscanbevotedontoprovidecoarse-grainedTMRfunctionality,ortwoPRRscanprovideDWCfunctionality.EachofthesePRRsarelargeandidenticalinsize,andcanberepresentedwitha1Dareamodel,reducingmanyoftheschedulingproblemspresentedinSection 2.7 100

PAGE 101

5.2.3SoftwareSimulationInordertoevaluateourschedulingtechniqueandpossibleheuristics,asoftware-baseddiscrete-timesimulatorwasdevelopedinC++.Thesimulatorenablesustospecifytaskarrivalrates,taskdeadlines,dynamicfaultrates,andschedulingalgorithms.Inadditiontoschedulingtasks,thesimulatorcanalsoinjectfaultsintotasksandforcere-schedulingoffailedtasks.Figure 5-3 showsthebasicoverviewofhowthesimulatorisused.Ateachtimestep,tasksarerandomlyaddedtoataskpool.ThisprocessismodeledasaPoissonprocesswithmeanarrival.Alltasksinthetaskpoolarescheduled,ifpossible,andthenmovedtothereservationlist.Whenmultipletasksarrivesimultaneously,tasksarescheduledusinganearliest-deadline-rst(EDF)heuristic.Theschedulerdoesnotemploypreemption;tasksarescheduledonarrivalonly,andnewtasksmustbeplacedaroundtheexistingschedule.Taskswhichcannotbescheduledbeforetheirdeadlinesarerejected.Ifataskisscheduledtobeginatthecurrenttimestep,thesimulatormovesthetaskfromthereservedlisttotheexecutionlist.Aftertaskshavebeenscheduled,faultsareinjectedintoeachPRRwithprobabilityfinordertosimulatethedynamicfaultenvironment.Attheendofthetask'sexecution,theoutcomeofthetaskisdeterminedbaseduponthenumberoffaultsencountered(i.e.,SimplexandDWCtasksfailwith1fault,TMRtasksfailwithfaultsin2ormorePRRs).MultiplefaultswithinasinglePRRhavenoadditionaleffectonthesystem.Ifthetaskfails,theschedulerthenreturnsthetasktothetaskqueuetobere-scheduled.Allotherreservedtasks(scheduled,butnotyetexecuting)arealsoreturnedtothetaskqueuetoberescheduled.Fault-toleranttasksarerescheduleduntiltheysuccessfullycompleteorcannolongermeettheirdeadline.Whentasksmustbere-scheduledduetofaults,theyaretreatedasanewtaskforschedulingpurposes,althoughtheiroriginaldeadlineismaintained.Thefault-tolerantmodeforrescheduledtaskswillbasedonthefaultrateatthetimeofrescheduling. 101

PAGE 102

5.2.4AnalysisandResultsInthefollowinganalysis,taskexecutiontimes(texec)areuniformlydistributedin[10,100]timestepswithdeadlines(tdeadline)of[100,200]timeunits.Forsimplicity,weassumethatatimestepis1second.Simplextasksuseoneprocessingregion,DWCtasksusetwoprocessingregions,andTMRtasksusethreeprocessingregions.Thesimulatedsystemuses12processingregions(NPRRs)inordertoenableexibilityinplacingTMRandDWCtasks.Foreachexperiment,wemeasuretheperformanceofeachmetricwiththescheduler'sguaranteeratio(percentageoftotaltasksscheduledsuccessfully)whileattemptingtoschedule100,000tasks. 5.2.4.1ConstantfaultratesInitially,thesimulatorisusedtogetafault-freebaselineforcomparisonpurposes.Figure 5-4 showstheeffectofarrivalrateontheperformanceofthesystem.Atlowarrivalrates,Simplex,DWC,andTMRschedulingcanallmeetthesystemdemand.However,arrivalrateshigherthan0.06taskspersecondbegintoimpacttheschedulabilityoftheTMRsystembecausetherearenotenoughresourcestohandleallincomingtasks.UsingDWCforfaulttolerancewillresultinhigherguaranteeratiossincemoreDWCtaskscanbescheduledatanyonetime.Thelackofafault-tolerancemechanismexcludestheuseofSimplexschedulinginthepresenceoffaults.Inordertoinvestigatetheeffectoffaultratesonoursystem,wechoseaconstantarrivalrateof0.075taskspersecond.Withthisarrivalrate,theDWCsystemcansuccessfullyscheduleallincomingtasks,whiletheTMRsystemcannot.Usingthearrivalrateinthisway,weattempttodeneasystemwhichrequiresDWCtomeetperformancedemandsbutcantemporarilyuseTMRtomeetreliabilityconstraints.TheeffectontheguaranteeratioisshowninFigure 5-5 .AtlowfaultratestheDWCmodeprovideshigherthroughput,whileTMRoutperformsDWCathighfaultrates.Ouradaptivemetricproduceshighthroughputatlowfaultrates,closelytrackingtheperformanceofDWC,butperformsbetweenTMRandDWCatintermediatefaultrates. 102

PAGE 103

Athigherfaultrates,theguaranteeratiooftheadaptiveheuristicproducesresultsclosetoTMR.Fromtheseconstant-fault-rateresults,anideal-thresholdheuristiccanbedetermined.ThecrossoverfaultrateforTMRandDWCoccursat0.0025faults/sec. 5.2.4.2Dynamicfault-ratecasestudiesInordertogetabenetfromtheadaptiveschedulingmethods,thefaultratesexperiencedbythesystemmustvary.Wepresenttwofaultproleswhichrepresentpatternscommonlyseeninspacemissions.Figure 5-6 showsthefaultprolesusedforthefollowinganalysis,basedonthefaultmodelin[ Jacobsetal. 2012 ].Therstproleisasinusoidalpatternwitha90-minuteperiodwhichischaracteristicoffaultratesinLow-EarthOrbit(LEO).Thesecondpattern(Burst)representsHighly-EllipticalOrbits(HEO),wherethesystemexperienceslowfaultratesformostoftheorbit,withalargeburstwhenmakingtheclosestapproachtoEarth,onceevery12hours.Forthefollowingcasestudies,fourdifferentschedulingheuristicsareexamined.TheTMR-onlyandDWC-onlyheuristicswillscheduleeverytaskintheirrespectivemode.Theideal-thresholdheuristicwillusethefault-ratethresholdmeasuredintheprevioussection(0.0025faults/sec)tochoosebetweentheDWCandTMRmodes.TheadaptiveheuristicusesEquation 5 todeterminetheFTmodeforeachtask.Eachheuristicwillbeevaluatedusinganarrivalrateof0.075tasks/secandthesameparametersusedinSection 5.2.4.1 .FortheSinusoidalcasestudy,faultratesarelowcomparedtotheaveragetaskexecutiontime.Forthisfaultprole,schedulingtaskswiththeDWC-only,ideal-threshold,oradaptiveheuristicsprovideanequivalentrejectionratio,0.2%.Thereareenoughsystemresourcestomakere-computationoffailedDWCtasksbetterthansimplyusingTMRtoprotectagainstallfailures.ThefaultraterarelygetshighenoughforthethresholdoradaptiveheuristicstoscheduletasksinTMRmode.Theadaptiveheuristicperformswell,reducingthenumberofrejectedtasksovertheTMR-onlystrategyby94%,whilemaintainingalowaveragetasklatencyof8secondspertask. 103

PAGE 104

IntheBurstcasestudy,faultratesarelowexceptduringashortwindowoftimewithextremelyhighfaultrates.Unlikeinthepreviouscasestudy,theadaptiveheuristicswillbenetfromthelargerangeoffaultratesandeachheuristichasdifferentperformancecharacteristics.Forthisfault-rateprole,theadaptiveheuristicsperformthebest,with11%fewerrejectedtasksthantheDWC-onlystrategyand48%fewerthantheTMR-onlystrategy.Additionally,theadaptiveheuristichasonly3%morerejectedtasksthantheideal-thresholdheuristic.TMRisonlyoptimalwhenthehigh-fault-rateburstoccurs.Otherwise,DWCwillbetterutilizethesystemresources.TheBurstfaultproleisidealforalldynamicmetrics,sincethetwophasesarehighlyseparated. 5.2.4.3SchedulingimprovementsAtextremefaultrates(highorlow),theadaptiveheuristicwillscheduleallincomingtasksinthesamemode.However,formoderatefaultrates,bothDWCandTMRtaskswillbescheduleddependingontaskcomputationtime.Onedrawbacktothedynamicfault-tolerantselectionmetricsistheresourcefragmentationthatoccurswhendifferentsizedobjectsareplacedontheFPGAfabric.ThiseffectcanproduceschedulessimilartoFigure 5-7 ,wherefragmentationcausespoorutilizationoftheavailableFPGAresources.Astasksarrivetothesystem,theyareplacedinPRRsp0throughp5,ineitherDWCorTMRmode.Attimet6thereareenoughunusedPRRsinthesystemforaDWCtask,butbecausetheresourcesarenotcontiguousthetaskcannotbeplaced.Unfortunately,thesimplisticEDFschedulingheuristiccurrentlyinusedoesnotaccountforFPGAfragmentation.TheloweffectivenessoftheadaptivemetricinFigure3canbeexplainedbythisFPGAfragmentation.Byusingaplacement-awarescheduler,theadaptiveheuristicshouldperformclosertooptimalforallfaultrates.Forexample,delayingtheplacementoftaskFDWCforuntilt6wouldenableamorecompactplacementinPRRsp2andp3,enablingspacefortheplacementofanadditionalDWCtask.Alternatively,taskFDWCcouldbeplacedinPRRsp4andp5attimet5,leavingroomforfutureDWCtasksinPRRsp2andp3.AnFPGAplacement-awarescheduling 104

PAGE 105

algorithmsuchastheHorizonorStufngscheduler[ Steigeretal. 2004 ]shouldbeincorporatedinordertoimprovetheperformanceoftheadaptiveheuristic.Fault-ratelagisanotherpossibleproblem.Ifataskisscheduledusingaspecicmodebutdoesnotexecuteforalongperiodoftime,adifferentFTmodemaybecomemoreappropriate.Limitingtheschedulingwindowtoonlyscheduleafewtasksatatimemaypreventthislag.Alternatively,usingapredictionofthefuturefaultrateduringschedulingmayreducethiseffect.Finally,preemptionenablestheschedulertoreturnacurrentlyrunningtasktothetaskqueueinordertostartahigherprioritytask.Byaddingpreemptioncapabilitiestothescheduler,theguaranteeratioofallthetestedmetricscanbeimproved. 5.3ConclusionsInPhase3wehaveadaptedtheoriginalbus-basedRFThardwarearchitectureforusewithaPRarchitecturebasedonpoint-to-pointconnections(VAPRES)andquantiedthedesigntradeoffs.Additionally,byprovidingbus-basedandpoint-to-pointhardwaretemplates,theRFTarchitecturecannowbemoreeasilybeportedtofuturesystems.TheadditionalparameterizationofinternalTMRcomponentsallowsforusercustomizationofRFTfeaturessuchaswatchdogtimersorvotinglogiccongurations.Wehavealsopresentedanovelmetricfordeterminingoptimalfault-tolerancesettingsforrecongurablefault-tolerantsystems.AnRFTschedulerandsimulatorweredevelopedinordertotesttheeffectivenessoftheadaptiveschedulingheuristicandtocompareitsperformancetotraditionalstaticfault-tolerancestrategies.WhenusingouradaptiveFTstrategyinBurst-likefaultenvironments,wemaintainsystemreliabilitywhilereducingthenumberofrejectedtasksby48%comparedtoastaticTMRfault-tolerancestrategyand11%comparedtostaticDWC.IntheSinusoidalcasestudy,thestaticDWC,Ideal,adaptiveheuristicsreducethenumberofrejectedtasksby94%comparetostaticTMRstrategy.Wehavedemonstratedthattheadaptiveheuristicperformssimilarlyto 105

PAGE 106

anoptimaluser-denedthreshold,withouttheneedfordetailedsystemsimulationandmeasurement. Table5-1. RFTcontrollerresourceusage. ModulenameSlicesusedBRAMsused PLBController3450FSLController76812 Table5-2. Dynamicschedulingresults. CasestudyFTmetricGuaranteeratioRejectratioAvg.latency(s) SinusoidalTMR-Only0.9700.03051.2SinusoidalDWC-Only0.9980.0028.0SinusoidalIdeal0.9980.0027.9SinusoidalAdaptive0.9980.0028.2 BurstTMR-Only0.9350.06553.6BurstDWC-Only0.9620.0389.9BurstIdeal0.9670.03311.7BurstAdaptive0.9660.03411.2 Figure5-1. ResearchareasforPhase3. 106

PAGE 107

ABFigure5-2. ComparisonofRFTarchitectures.A)PLB-basedarchitecture.B)FSL-basedarchitecture. Figure5-3. Flowchartofschedulingsimulator. 107

PAGE 108

(NPRRs=12,texec[10,100],tdeadline[100,200])Figure5-4. Effectofarrivalrateonfault-freeoperation. (NPRRs=12,texec[10,100],tdeadline[100,200],arrival=0.075)Figure5-5. Effectoffaultrateontaskrejection. 108

PAGE 109

Figure5-6. Fault-rateproleforcasestudies. 109

PAGE 110

Figure5-7. Resourcefragmentationfromadaptiveplacement. 110

PAGE 111

CHAPTER6CONCLUSIONSInthisresearch,acomprehensiveframeworkforprovidingrecongurablefaulttoleranceforFPGA-basedspacesystemshasbeendeveloped.Theframeworkfeaturesthreeprimaryareaseachaddressedbyoneoftheresearchphasespresentedinthisdocument.InPhase1,theinitialRFTframeworkconsistingofahardwarearchitecture,fault-ratemodel,andaperformabilitymodelispresented.ThehardwarearchitecturewasdevelopedandimplementedonaVirtex-5platformenablingrecongurablefaulttolerancewithsupportforTMR,DWC,oruser-denedfault-tolerancemodes.Additionally,RFTfault-rateandperformabilitymodelswereusedtopredicttheperformanceandreliabilityofanRFTsysteminmultiplespeciedorbits.Forhighly-ellipticalorbits,adaptivefault-tolerancestrategieswereshowntoincreasesystemperformabilityby128%overTMRwhileimprovingunavailabilityby85%overABFT.Faultinjectionwasusedtovalidatethereliabilityofthearchitectureandtheaccuracyoftheperformancemodel(1.5%error).InPhase2,anin-depthreliabilityandoverheadanalysisofFPGAdesignsusingABFTispresented.Byidentifyingfaulttolerancetechniqueswithlowoverheadandhighreliability,aspectrumofreliabilityandperformancecharacteristicsbecomeavailableforRFTsystems,enablingsystemexibility.Fault-injectiontestingofmatrixmultiplicationandFastFourierTransformFPGAdesignsshowthatABFTcanreducedesignvulnerabilitybyupto98%withlessthan25%overhead.Selectivelyapplyingredundancy(TMR)canallowforevenmorereliabledesigns.InPhase3,methodsforintegratingtheRFTframeworkwithpre-existingPRsystems,andmakingthefault-tolerancefeaturesofthearchitectureeasiertouse,aredemonstrated.RFThardwarewillbedynamicallygeneratedbasedonindividualsystemparameters.TheadaptiveschedulingheuristicwillsimplifythedecisiontouseaspecicRFTfault-tolerantmodeandenabletheuseofenvironmentally-awaretaskschedulers.Combined,thesethreephasesofresearchprovideanRFTframeworkwhichiscapable 111

PAGE 112

ofprovidingadaptivefault-tolerancetoexistingFPGAsystems,enablingtheirpossibleuseasspacesystems.ThecontributionsofPhase1includetheRFThardwarearchitecture,theorbitalfault-ratemodel,andthephased-missionperformabilitymodel.Thefault-rateandperformabilitymodelsareapplicabletomanyspacesystems,enablingatime-varyingfaultmodelwhereonlystatic,time-averagedmodelsarenormallyconsidered.FromPhase2,theanalysisofABFTreliabilityonFPGAarchitecturesdemonstratestheusefulnessofABFTasareliable,low-overheadalternativetechniqueinlow-to-mediumfault-rateenvironments.ThedesigntechniquesdiscoveredinthisresearchwillpromotetheuseofABFTasaalternativefault-tolerancetechniqueforspacesystems,enablinghigherperformance,lowerpowerconsumption,andlowercostsbyreducingthenumberofprocessorsneededtoperformonboarddataprocessing.ThecontributionsfromPhase3facilitatestheuseofRFTtechniquesinfuturesystemsbypartiallyautomatingtheprocessoffault-toleranthardwaredesign,allowingsystemdesignerstofocustheireffortsonotherpartsoftheirpotentialsystem.Optimizingtheselectionoffault-tolerancemodesthroughfault-ratepredictionandschedulerheuristicswillenablesystemstomaintainhighperformabilityautomatically. 112

PAGE 113

REFERENCES ACREE,R.,ULLAH,N.,KARIA,A.,RAHMEH,J.,ANDABRAHAM,J.1993.Anobject-orientedapproachforimplementingalgorithm-basedfaulttolerance.InTwelfthAnnualInternationalPhoenixConferenceonComputersandCommunications.210. ACTEL.2010a.Actelproductpage. http://www.actel.com/products/milaero/rtsxsu/default.aspx ACTEL.2010b.Actelproductpage. http://www.actel.com/products/milaero/rtpa3/default.aspx ALAM,M.,SONG,M.,HESTER,S.,ANDSELIGA,T.2006.Reliabilityanalysisofphased-missionsystems:apracticalapproach.InAnnualReliabilityandMaintainabil-itySymposium,2006.RAMS'06.551. ALNAJIAR,D.,KO,Y.,IMAGAWA,T.,KONOURA,H.,HIROMOTO,M.,MITSUYAMA,Y.,HASHIMOTO,M.,OCHI,H.,ANDONOYE,T.2009.Coarse-graineddynamicallyrecongurablearchitecturewithexiblereliability.InInternationalConferenceonFieldProgrammableLogicandApplications,2009.FPL2009.186. ALTERA.2010.StratixVFPGAs:UltimateFlexibilityThroughPartialandDynamicReconguration. http://www.altera.com/products/devices/stratix-fpgas/stratix-v/overview/partial-reconfiguration/stxv-part-reconfig.html ARNDT,O.,FREISLEBEN,B.,KIELMANN,T.,ANDTHILO,F.2000.Acomparativestudyofonlineschedulingalgorithmsfornetworksofworkstations.ClusterComputing3,95. BANERJEE,S.,BOZORGZADEH,E.,ANDDUTT,N.2005.Physically-awarehw-swpartitioningforrecongurablearchitectureswithpartialdynamicreconguration.InDesignAutomationConference,2005.Proceedings.42nd.335340. CARMICHAEL,C.,FULLER,E.,BLAIN,P.,ANDCAFFREY,M.1999.SEUmitigationtechniquesforVirtexFPGAsinspaceapplications.In2ndAnnualMilitaryandAerospaceApplicationsofProgrammableDevicesandTechnologiesConference.Laurel,MD. CHOWDHURY,A.-R.ANDBANERJEE,P.1996.Anewerroranalysisbasedmethodfortolerancecomputationforalgorithm-basedchecks.Computers,IEEETransactionson45,2,238. CIARDO,G.,MARIE,R.,SERICOLA,B.,ANDTRIVEDI,K.1990.Performabilityanalysisusingsemi-Markovrewardprocesses.IEEETransactionsonComputers39,10,1251. 113

PAGE 114

CIESLEWSKI,G.,GEORGE,A.,ANDJACOBS,A.2010.AccelerationofFPGAfaultinjectionthroughmulti-bittesting.In2010EngineeringofRecongurableSystemsandAlgorithms. DAVE,N.,FLEMING,K.,KING,M.,PELLAUER,M.,ANDVIJAYARAGHAVAN,M.2007.Hardwareaccelerationofmatrixmultiplicationonaxilinxfpga.InFormalMethodsandModelsforCodesign,2007.MEMOCODE2007.5thIEEE/ACMInternationalConferenceon.97. DAWOOD,A.,VISSER,S.,ANDWILLIAMS,J.2002.RecongurableFPGAsforrealtimeimageprocessinginspace.In14thInternationalConferenceonDigitalSignalProcessing,2002.DSP2002.Vol.2.845848vol.2. DOBIAS,R.,KUBALIK,P.,ANDKUBATOVA,H.2005.Dependabilitycomputationsforfault-tolerantsystembasedonFPGA.In12thIEEEInternationalConferenceonElectronics,CircuitsandSystems(ICECS).1. FLATLEY,T.2010.Advancedhybridon-boardsciencedataprocessor-SpaceCube2.0.EarthScienceTechnologyForum. GANO,S.2010.JSatTrak. http://www.gano.name/shawn/JSatTrak/index.html GARVIE,M.ANDTHOMPSON,A.2004.Scrubbingawaytransientsandjigglingaroundthepermanent:longsurvivalofFPGAsystemsthroughevolutionaryself-repair.In10thIEEEInternationalOn-LineTestingSymposium(IOLTS).155160. GUPTA,A.,NOOSHABADI,S.,TAUBMAN,D.,ANDDYER,M.2006.Realizinglow-costhigh-throughputgeneral-purposeblockencoderforJPEG2000.IEEETransactionsonCircuitsandSystemsforVideoTechnology16,7,843. HAN,C.-C.,SHIN,K.,ANDWU,J.2003.Afault-tolerantschedulingalgorithmforreal-timeperiodictaskswithpossiblesoftwarefaults.Computers,IEEETransactionson52,3,362372. HOOTS,F.R.ANDROEHRICH,R.L.1980.SPACETRACKREPORTNO.3-Mod-elsforPropagationofNORADElementSets. http://celestrak.com/NORAD/documentation/spacetrk.pdf HSUEH,M.ANDCHANG,C.-I.2008.Fieldprogrammablegatearrays(FPGA)forpixelpurityindexusingblocksofskewersforendmemberextractioninhyperspectralimagery.Int.J.HighPerform.Comput.Appl.22,408. HUANG,K.-H.ANDABRAHAM,J.1984.Algorithm-basedfaulttoleranceformatrixoperations.IEEETransactionsonComputersC-33,6,518. JACOBS,A.,CIESLEWSKI,G.,GEORGE,A.D.,GORDON-ROSS,A.,ANDLAM,H.2012.Recongurablefaulttolerance:Acomprehensiveframeworkforreliableandadaptivefpga-basedspacecomputing.ACMTrans.RecongurableTechnol.Syst.5,4,21:1:30. 114

PAGE 115

JACOBS,A.,CONGER,C.,ANDGEORGE,A.2008.Multiparadigmspaceprocessingforhyperspectralimaging.InAerospaceConference,2008IEEE.1. JARA-BERROCAL,A.ANDGORDON-ROSS,A.2010.VAPRES:Avirtualarchitectureforpartiallyrecongurableembeddedsystems.InDesign,AutomationTestinEuropeConferenceExhibition(DATE),2010.837. JOHNSON,J.,HOWES,W.,WIRTHLIN,M.,MCMURTREY,D.,CAFFREY,M.,GRAHAM,P.,ANDMORGAN,K.2008.Usingduplicationwithcompareforon-lineerrordetectioninFPGA-baseddesigns.In2008IEEEAerospaceConference.1. KARNIK,T.ANDHAZUCHA,P.2004.CharacterizationofsofterrorscausedbysingleeventupsetsinCMOSprocesses.IEEETransactionsonDependableandSecureComputing1,2,128143. KERNIGHAN,B.ANDLIN,S.1970.Aneicientheuristicprocedureforpartitioninggraphs.Bellsystemtechnicaljournal. KIM,K.ANDPARK,K.1994.Phased-missionsystemreliabilityunderMarkovenvironment.IEEETransactionsonReliability43,2,301. KYRIAKOULAKOS,K.ANDPNEVMATIKATOS,D.2009.AnovelSRAM-basedFPGAarchitectureforefcientTMRfaulttolerancesupport.InInternationalConferenceonFieldProgrammableLogicandApplications,2009.FPL2009.193. LAPRIE,J.-C.,ARLAT,J.,BEOUNES,C.,ANDKANOUN,K.1990.Denitionandanalysisofhardware-andsoftware-fault-tolerantarchitectures.IEEETransactionsonComputers23,7,39. LE,C.,CHAN,S.,CHENG,F.,FANG,W.,FISCHMAN,M.,HENSLEY,S.,JOHNSON,R.,JOURDAN,M.,MARINA,M.,PARHAM,B.,ROGEZ,F.,ROSEN,P.,SHAH,B.,ANDTAFT,S.2004.OnboardFPGA-basedSARprocessingforfuturespacebornesystems.InProceedingsoftheIEEERadarConference,2004.1520. MACMILLAN,S.ANDMAUS,S.2010.IGRF10ModelCoefcientsfor1945-2010. http://modelweb.gsfc.nasa.gov/magnetos/igrf.html MAUS,S.,MACMILLAN,S.,CHERNOVA,T.,CHOI,S.,DATER,D.,GOLOVKOV,V.,LESUR,V.,LOWES,F.,LHR,H.,MAI,W.,MCLEAN,S.,OLSEN,N.,ROTHER,M.,SABAKA,T.,THOMSON,A.,ANDZVEREVA,T.2005.The10thgenerationinternationalgeomagneticreferenceeld.PhysicsofTheEarthandPlanetaryInteriors151,3-4,320322. MEI,B.,SCHAUMONT,P.,ANDVERNALDE,S.2000.Ahardware-softwarepartitioningandschedulingalgorithmfordynamicallyrecongurableembeddedsystems.InProceedingsofProRISC.Citeseer,405. 115

PAGE 116

MentorGraphics2013.PrecisionHi-RelTechnologyOverview.MentorGraphics. http://www.mentor.com/products/fpga/multimedia/overview/precision-hi-rel-technology-overview MEYER,J.1982.Closed-formsolutionsofperformability.IEEETransactionsonComputersC-31,7,648. MISHRA,A.ANDBANERJEE,P.2003.Analgorithm-basederrordetectionschemeforthemultigridmethod.Computers,IEEETransactionson52,9,10891099. MORGAN,K.,MCMURTREY,D.,PRATT,B.,ANDWIRTHLIN,M.2007.AcomparisonofTMRwithalternativefault-tolerantdesigntechniquesforFPGAs.IEEETransactionsonNuclearScience54,6,2065. PATHAN,R.2006.Fault-tolerantreal-timeschedulingalgorithmfortoleratingmultipletransientfaults.InElectricalandComputerEngineering,2006.ICECE'06.Interna-tionalConferenceon.577. PRATT,B.,CAFFREY,M.,GRAHAM,P.,MORGAN,K.,ANDWIRTHLIN,M.2006.ImprovingFPGAdesignrobustnesswithpartialTMR.In44thAnnualIEEEInter-nationalReliabilityPhysicsSymposiumProceedings,2006.226. PRATT,B.,WIRTHLIN,M.,CAFFREY,M.,GRAHAM,P.,MORGAN,K.,QUINN,H.,ANDSHELLEY,S.2007.ImprovingFPGAreliabilityinharshenvironmentsusingtriplemodularredundancywithmorefrequentvoting.InMilitaryandAerospaceFPGAApplications. RAO,T.ANDFUJIWARA,E.1989.Error-ControlCodingforComputerSystems. RATTER,D.2004.FPGAsonMars.XcellJournal,8. ROY-CHOWDHURY,A.ANDBANERJEE,P.1993.Tolerancedeterminationforalgorithm-basedchecksusingsimpliederroranalysistechniques.InFault-TolerantComputing,1993.FTCS-23.DigestofPapers.,TheTwenty-ThirdInternationalSym-posiumon.290. ROY-CHOWDHURY,A.,BELLAS,N.,ANDBANERJEE,P.1996.Algorithm-basederror-detectionschemesforiterativesolutionofpartialdifferentialequations.Comput-ers,IEEETransactionson45,4,394. SAHNER,R.A.ANDTRIVEDI,K.S.1987.ReliabilitymodelingusingSHARPE.IEEETransactionsonReliabilityR-36,2,186. SHIM,B.,SRIDHARA,S.,ANDSHANBHAG,N.2004.Reliablelow-powerdigitalsignalprocessingviareducedprecisionredundancy.IEEETransactionsonVeryLargeScaleIntegration(VLSI)Systems12,5,497510. 116

PAGE 117

SILVA,J.,PRATA,P.,RELA,M.,ANDMADEIRA,H.1998.PracticalissuesintheuseofABFTandanewfailuremodel.InTwenty-EighthAnnualInternationalSymposiumonFault-TolerantComputing.26. STEIGER,C.,WALDER,H.,ANDPLATZNER,M.2004.Operatingsystemsforrecongurableembeddedplatforms:onlineschedulingofreal-timetasks.Com-puters,IEEETransactionson53,11,13931407. SWIFT,G.,ALLEN,G.,TSENG,C.W.,CARMICHAEL,C.,MILLER,G.,ANDGEORGE,J.2008.Staticupsetcharacteristicsofthe90nmVirtex-4QVFPGAs.InIEEERadiationEffectsDataWorkshop.98. TAO,D.ANDHARTMANN,C.1993.AnovelconcurrenterrordetectionschemeforFFTnetworks.ParallelandDistributedSystems4,2,198. TROXEL,I.,FEHRINGER,M.,ANDCHENOWETH,M.2008.AchievingmultipurposespaceimagingwiththeARTEMISrecongurablepayloadprocessor.In2008IEEEAerospaceConference.1. TYLKA,A.,ADAMS,J.H.,J.,BOBERG,P.,BROWNSTEIN,B.,DIETRICH,W.,FLUECK-IGER,E.,PETERSEN,E.,SHEA,M.,SMART,D.,ANDSMITH,E.1997.CREME96:Arevisionofthecosmicrayeffectsonmicro-electronicscode.IEEETransactionsonNuclearScience44,6,2150. WANG,J.2003.RadiationeffectsinFPGAs.In9thWorkshoponElectronicsforLHCExperiments. WANG,S.-J.ANDJHA,N.1994.Algorithm-basedfaulttoleranceforFFTnetworks.IEEETransactionsonComputers43,7,849. WILLIAMS,J.,MASSIE,C.,GEORGE,A.D.,RICHARDSON,J.,GOSRANI,K.,ANDLAM,H.2010.Characterizationofxedandrecongurablemulti-coredevicesforapplicationacceleration.ACMTransactionsonRecongurableTechnologyandSystems3,19:1:29. WU,G.,DOU,Y.,ANDWANG,M.2010.Highperformanceandmemoryefcientimplementationofmatrixmultiplicationonfpgas.InField-ProgrammableTechnology(FPT),2010InternationalConferenceon.134. XILINX.2004.XTMRToolUserGuide.XilinxUserGuideUG156. XILINX.2010a.PartialRecongurationUserGuide.XilinxUserGuideUG702. XILINX.2010b.SEUStrategiesforVirtex-5Devices.XilinxApplicationNoteXAPP864. XILINX.2010c.Space-GradeVirtex-4QVFamilyOverview.XilinxProductSpecicationDS653. 117

PAGE 118

XILINX.2013a.XilinxCOREGeneratorSystem.XilinxCOREGeneratorProductPage, http://www.xilinx.com/tools/coregen.htm XILINX.2013b.XilinxSoftErrorMitigation(SEM)Core. http://www.xilinx.com/products/intellectual-property/SEM.htm YAO,E.,WANG,R.,CHEN,M.,TAN,G.,ANDSUN,N.2012.Acasestudyofdesigningefcientalgorithm-basedfaulttolerantapplicationforexascaleparallelism.InParallelDistributedProcessingSymposium(IPDPS),2012IEEE26thInternational.438. ZHUO,L.ANDPRASANNA,V.2004.Scalableandmodularalgorithmsforoating-pointmatrixmultiplicationonFPGAs.InParallelandDistributedProcessingSymposium,2004.Proceedings.18thInternational.92. 118

PAGE 119

BIOGRAPHICALSKETCH AdamJacobsearnedhisBachelorofSciencedegreeinelectricalengineeringfromtheUniversityofFloridain2005.Aftergraduation,heparticipatedinaninternshipatHoneywellInternationalinClearwater,FLbeforereturningtotheUniversityofFloridaforgraduatestudies.AdamreceivedhisMasterofSciencedegreeinelectricalandcomputerengineeringin2007beforejoiningthedoctoralprogram.Whilepursuinghisdegree,AdamworkedasaresearchassistantintheHigh-PerformanceComputingandSimulation(HCS)ResearchLabandtheNSFCenterforHigh-PerformanceRecongurableComputing(CHREC).Insupportofhisstudies,AdaminternedatGoddardSpaceFlightCenterin2010,gainingexperienceinembeddedprocessingsystemsforspace.Aftergraduation,hewillbemovingtoAustin,TX,wherehehasacceptedapositionintheprocessordesigngroupofARM. 119