<%BANNER%>

Performance Analysis and Verification for High-Level Synthesis

Permanent Link: http://ufdc.ufl.edu/UFE0042483/00001

Material Information

Title: Performance Analysis and Verification for High-Level Synthesis
Physical Description: 1 online resource (115 p.)
Language: english
Creator: CURRERI,JOHN A
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2011

Subjects

Subjects / Keywords: ASSERTION -- FPGA -- HLS -- PERFORMANCE -- VERIFICATION
Electrical and Computer Engineering -- Dissertations, Academic -- UF
Genre: Electrical and Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: High-Level Synthesis (HLS) for Field-Programmable Gate Arrays (FPGAs) facilitates the use of Reconfigurable Computing (RC) resources for application developers by using familiar, higher-level syntax, semantics, and abstractions. Thus, HLS typically enables faster development times than traditional Hardware Description Languages (HDLs). However, this higher level of abstraction is typically not maintained throughout the design process. Once the program is translated from source code to an HDL, analysis tools provide results at the HDL level. The research performed in this document focus on providing higher-level analysis of application behavior throughout the design process to increase developer productivity. Just as knowledge of assembly code is no longer required for most programmers of microprocessors today, the goal of this work is to evaluate methods that can eliminate the need for programmers to understand HDLs in order to develop a high-performance error-free application for FPGA platforms. This document is divided into three phases. Phase one addresses the challenges associated with prototyping an in-circuit performance analysis framework for HLS applications while leveraging existing visualizations. Phase two focuses on the challenges of detecting communication bottlenecks and creates a novel visualization designed specifically for HLS and FPGA systems. Phase three addresses the challenges associated with in-circuit integration of assertion-based verification into HLS. The performance analysis and verification frameworks presented in this document are believed to be (after extensive literature review) the first of their kind for HLS. Case studies using various FPGA platforms and HLS tools show the utility and low overhead of these frameworks. The frameworks explored by this research will help lay the foundation for future performance analysis and verification research and tools targeting HLS.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by JOHN A CURRERI.
Thesis: Thesis (Ph.D.)--University of Florida, 2011.
Local: Adviser: George, Alan D.
Local: Co-adviser: Stitt, Greg.

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2011
System ID: UFE0042483:00001

Permanent Link: http://ufdc.ufl.edu/UFE0042483/00001

Material Information

Title: Performance Analysis and Verification for High-Level Synthesis
Physical Description: 1 online resource (115 p.)
Language: english
Creator: CURRERI,JOHN A
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2011

Subjects

Subjects / Keywords: ASSERTION -- FPGA -- HLS -- PERFORMANCE -- VERIFICATION
Electrical and Computer Engineering -- Dissertations, Academic -- UF
Genre: Electrical and Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: High-Level Synthesis (HLS) for Field-Programmable Gate Arrays (FPGAs) facilitates the use of Reconfigurable Computing (RC) resources for application developers by using familiar, higher-level syntax, semantics, and abstractions. Thus, HLS typically enables faster development times than traditional Hardware Description Languages (HDLs). However, this higher level of abstraction is typically not maintained throughout the design process. Once the program is translated from source code to an HDL, analysis tools provide results at the HDL level. The research performed in this document focus on providing higher-level analysis of application behavior throughout the design process to increase developer productivity. Just as knowledge of assembly code is no longer required for most programmers of microprocessors today, the goal of this work is to evaluate methods that can eliminate the need for programmers to understand HDLs in order to develop a high-performance error-free application for FPGA platforms. This document is divided into three phases. Phase one addresses the challenges associated with prototyping an in-circuit performance analysis framework for HLS applications while leveraging existing visualizations. Phase two focuses on the challenges of detecting communication bottlenecks and creates a novel visualization designed specifically for HLS and FPGA systems. Phase three addresses the challenges associated with in-circuit integration of assertion-based verification into HLS. The performance analysis and verification frameworks presented in this document are believed to be (after extensive literature review) the first of their kind for HLS. Case studies using various FPGA platforms and HLS tools show the utility and low overhead of these frameworks. The frameworks explored by this research will help lay the foundation for future performance analysis and verification research and tools targeting HLS.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by JOHN A CURRERI.
Thesis: Thesis (Ph.D.)--University of Florida, 2011.
Local: Adviser: George, Alan D.
Local: Co-adviser: Stitt, Greg.

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2011
System ID: UFE0042483:00001


This item has the following downloads:


Full Text

PAGE 1

PERFORMANCEANALYSISANDVERIFICATIONFORHIGH-LEVELSYNTHESIS By JOHNA.CURRERI ADISSERTATIONPRESENTEDTOTHEGRADUATESCHOOL OFTHEUNIVERSITYOFFLORIDAINPARTIALFULFILLMENT OFTHEREQUIREMENTSFORTHEDEGREEOF DOCTOROFPHILOSOPHY UNIVERSITYOFFLORIDA 2011 1

PAGE 2

c r 2011JohnA.Curreri 2

PAGE 3

Thisdissertationisdedicatedtomyfather,Dr.PeterCurreri 3

PAGE 4

ACKNOWLEDGMENTS Thisworkwassupportedinpartbytheindustryanduniversityco operativeresearch (I/UCRC)programoftheNationalScienceFoundationunderGra ntNo.EEC-0642422. Theauthorsgratefullyacknowledgevendorequipmentand/o rtoolsprovidedbyAldec, Altera,GiDEL,ImpulseAcceleratedTechnologies,SRCComputer s,Inc.andXtremeData, Inc.TheauthorswouldalsoliketoacknowledgeUniversityofWa shingtonAdaptive ComputingMachinesandEmulators(ACME)LabforanXD1000ver sionofthe backprojectionapplicationthatwasportedtoNovo-G. 4

PAGE 5

TABLEOFCONTENTS page ACKNOWLEDGMENTS ................................. 4 LISTOFTABLES ..................................... 8 LISTOFFIGURES .................................... 9 ABSTRACT ........................................ 11 CHAPTER 1INTRODUCTION .................................. 13 2BACKGROUNDANDRELATEDRESEARCH .................. 16 2.1RecongurableComputing ........................... 16 2.2High-LevelSynthesis .............................. 17 2.3PerformanceAnalysisofParallelApplications ................ 18 2.4Field-ProgrammableGateArrayPerformanceMonitoring .......... 18 2.5AutomaticAnalysisandVisualization ..................... 20 2.6Assertion-BasedVerication .......................... 20 3PERFORMANCEANALYSISFRAMEWORKFORHIGH-LEVELSYNTHESIS 23 3.1High-LevelSynthesisPerformanceAnalysisChallenges ........... 24 3.1.1InstrumentationChallenges ....................... 25 3.1.1.1Instrumentationlevels .................... 25 3.1.1.2Instrumentationselection .................. 26 3.1.2MeasurementChallenges ........................ 28 3.1.2.1Measurementtechniques ................... 28 3.1.2.2Measurementdataextraction ................ 29 3.1.3AnalysisChallenges ........................... 30 3.1.4VisualizationChallenges ........................ 31 3.2High-LevelSynthesisPerformanceAnalysisFramework ........... 34 3.2.1InstrumentationAutomation ...................... 35 3.2.1.1AutomatedCsourceinstrumentation ............ 35 3.2.1.2Automatedhardwaredescriptionlanguageinstrumenta tion 36 3.2.2PerformanceAnalysisToolIntegration ................ 39 3.2.2.1Performancedataextraction ................. 39 3.2.2.2Performancedatavisualization ............... 39 3.3Molecular-DynamicsCaseStudy ........................ 42 3.4Conclusions ................................... 46 4COMMUNICATIONBOTTLENECKDETECTIONANDVISUALIZATION 50 4.1CommunicationPerformanceAnalysis .................... 52 5

PAGE 6

4.1.1InstrumentationandMeasurement ................... 52 4.1.2Visualizations .............................. 55 4.1.3Analysis ................................. 57 4.2ExperimentalFramework ............................ 58 4.2.1Instrumentation ............................. 58 4.2.2Visualization ............................... 60 4.3ExperimentalResults .............................. 60 4.3.1TripleDataEncryptionStandard(DES) ............... 61 4.3.2MolecularDynamics ........................... 61 4.3.3Backprojection .............................. 63 4.4Conclusions ................................... 65 5ASSERTION-BASEDVERIFICATIONFORHIGH-LEVELSYNTHESIS ... 67 5.1AssertionSynthesisandOptimizations .................... 69 5.1.1AssertionParallelization ........................ 71 5.1.2ResourceReplication .......................... 72 5.1.3ResourceSharing ............................ 74 5.2In-CircuitTiming-AnalysisAssertions ..................... 75 5.3Hang-DetectionAssertions ........................... 78 5.4AssertionFramework .............................. 82 5.4.1UnoptimizedAssertionFramework ................... 82 5.4.2AssertionFrameworkOptimizations .................. 84 5.4.3Timing-AnalysisandHang-DetectionExtensions ........... 85 5.4.4High-LevelSynthesisToolandPlatform ................ 86 5.5ExperimentalResults .............................. 87 5.5.1DetectingSimulationInconsistencies .................. 87 5.5.2AssertionParallelizationOptimization ................. 88 5.5.2.1TripleDataEncryptionStandard(DES)casestudy .... 89 5.5.2.2Edge-detectionCaseStudy .................. 89 5.5.2.3StateMachineOverheadAnalysis .............. 90 5.5.3ResourceReplicationOptimization .................. 92 5.5.4ResourceSharingOptimization ..................... 93 5.5.5In-circuitTimingAnalysis ....................... 95 5.5.6HangDetection ............................. 99 5.5.7AssertionLimitations .......................... 100 5.6Conclusions ................................... 102 6CONCLUSIONS ................................... 104 APPENDIX ATARGETSYSTEMS ................................. 107 A.1ProcStar-III ................................... 107 A.2XD1000 ..................................... 107 A.3SRC-7 ...................................... 107 6

PAGE 7

BAPPLICATIONCASESTUDIES .......................... 108 B.1MolecularDynamics .............................. 108 B.2Backprojection ................................. 108 B.3TripleDataEncryptionStandard(DES) ................... 108 B.4EdgeDetection ................................. 108 REFERENCES ....................................... 109 BIOGRAPHICALSKETCH ................................ 115 7

PAGE 8

LISTOFTABLES Table page 3-1Performanceanalysisoverhead ............................ 47 4-1TripleDataEncryptionStandard(DES)instrumentationo verhead ........ 64 5-1TripleDataEncryptionStandard(DES)assertionoverhead ........... 90 5-2Edge-detectionassertionoverhead .......................... 91 5-3Single-comparisonassertion ............................. 93 5-4Pipelinedsingle-comparisonassertion ........................ 93 5-5Individualbackprojectiontimingassertionoverhead ................ 98 5-6Loopedbackprojectiontimingassertionoverhead ................. 99 5-7DataEncryptionStandard(DES)hang-detectionoverhea d ............ 101 8

PAGE 9

LISTOFFIGURES Figure page 2-1SimpliedblockdiagramofaField-ProgrammableGateArr ay .......... 17 2-2Performanceanalysissteps .............................. 19 3-1Exampleperformancevisualization ......................... 32 3-2HardwareMeasurementModule(HMM)additionandapplicatio ninstrumentation 36 3-3DesignrowforImpulseCperformanceanalysis .................. 37 3-4HardwareMeasurementModule(HMM) ...................... 38 3-5ProletreetablevisualizationforTripleDataEncrypti onStandard(DES) ... 40 3-6Molecular-Dynamics(MD)hardwaresubroutine .................. 43 3-7Acceleratorprocesssourcewithprolingpercentages ............... 44 3-8PiechartvisualizationofthestatesinAcceleratorprocess. Stateb5s0correspond tothepipelineandstateb6s2correspondstotheoutputstream. ......... 45 3-9BarchartvisualizationdepictingMDkernelperformance improvementdueto increasedstreambuersize(128to4096elements). ................ 46 3-10OutputstreamoverheadfortheAcceleratorprocess. ............... 47 4-1ParallelPerformanceWizard(PPW)datatransfervisualiz ation ......... 51 4-2Streaming-communicationcallvisualization .................... 56 4-3Directmemoryaccess(DMA)communicationcallvisualizati on ......... 56 4-4Streamingbueranalysis ............................... 58 4-5ToolrowforImpulseCvisualization ........................ 59 4-6CommunicationvisualizationofstreamingDES .................. 62 4-7CommunicationvisualizationofDESusingDMAtransfer ............. 63 4-8Molecular-dynamicsbandwidthvisualization(halfofFP GAcroppedtoenlarge image) ......................................... 64 4-9CommunicationvisualizationofBackprojection .................. 65 5-1Assertionframework ................................. 70 5-2Application'sstatemachine ............................. 73 9

PAGE 10

5-3Usingtiming-analysisassertionswithalterapplication ............. 77 5-4ManuallyusingAmericanNationalStandardsInstituteC(ANSIC)assertions forhangdetection .................................. 79 5-5Usinghang-detectionassertionswithalterapplication .............. 80 5-6High-LevelLanguage(HLL)assertioninstrumentation .............. 83 5-7In-circuitvericationexample ............................ 88 5-8Simplestreamingloopback .............................. 94 5-9Assertionfrequencyscalability ............................ 95 5-10Optimizedassertionresourcescalability ....................... 96 5-11Addingtimingassertionsindividuallytobackprojection .............. 97 5-12Addingtimingassertionsinalooptobackprojection ............... 97 10

PAGE 11

AbstractofDissertationPresentedtotheGraduateSchool oftheUniversityofFloridainPartialFulllmentofthe RequirementsfortheDegreeofDoctorofPhilosophy PERFORMANCEANALYSISANDVERIFICATIONFORHIGH-LEVELSYNTHESIS By JohnA.Curreri May2011 Chair:AlanD.GeorgeandGregStittMajor:ElectricalandComputerEngineering High-LevelSynthesis(HLS)forField-ProgrammableGateArrays (FPGAs)facilitates theuseofRecongurableComputing(RC)resourcesforapplica tiondevelopersbyusing familiar,higher-levelsyntax,semantics,andabstractions.Th us,HLStypicallyenables fasterdevelopmenttimesthantraditionalHardwareDescripti onLanguages(HDLs). However,thishigherlevelofabstractionistypicallynotmai ntainedthroughoutthe designprocess.Oncetheprogramistranslatedfromsourcecodeto anHDL,analysis toolsprovideresultsattheHDLlevel.Theresearchperformedi nthisdocumentfocuson providinghigher-levelanalysisofapplicationbehaviorth roughoutthedesignprocessto increasedeveloperproductivity.Justasknowledgeofassembly codeisnolongerrequired formostprogrammersofmicroprocessorstoday,thegoalofthis workistoevaluate methodsthatcaneliminatetheneedforprogrammerstounder standHDLsinorderto developahigh-performanceerror-freeapplicationforFPG Aplatforms. Thisdocumentisdividedintothreephases.Phaseoneaddressesth echallenges associatedwithprototypinganin-circuitperformanceanaly sisframeworkforHLS applicationswhileleveragingexistingvisualizations.Phase twofocusesonthechallenges ofdetectingcommunicationbottlenecksandcreatesanovel visualizationdesigned specicallyforHLSandFPGAsystems.Phasethreeaddressesthechall engesassociated within-circuitintegrationofassertion-basedvericationi ntoHLS.Theperformance analysisandvericationframeworkspresentedinthisdocume ntarebelievedtobe 11

PAGE 12

(afterextensiveliteraturereview)therstoftheirkindfor HLS.Casestudiesusing variousFPGAplatformsandHLStoolsshowtheutilityandlowov erheadofthese frameworks.Theframeworksexploredbythisresearchwillhel playthefoundationfor futureperformanceanalysisandvericationresearchandtoo lstargetingHLS. 12

PAGE 13

CHAPTER1 INTRODUCTION RecongurableComputing(RC)whichtypicallyusesField-Pr ogrammableGate Arrays(FPGAs)asprocessorshasshownfavorableenergyeciencya dvantagesas comparedtootherprocessorsforHigh-PerformanceComputing( HPC)andHigh-Performance EmbeddedComputing(HPEC)[ 1 ].However,theadoptionofFPGAsforapplication accelerationhasbeenhamperedbydesign-productivityissues, sincedevelopingapplications usingalow-levelHardwareDescriptionLanguage(HDL)canbeati me-consumingprocess. Inordertoaddressthisproblem,High-LevelSynthesis(HLS)too lshavebeendeveloped toallowapplicationdesignusingHigh-LevelLanguages(HLL)suc hasC.Unfortunately, manyofthetoolsneededtodevelopacorrectlyfunctioning, high-performanceapplication areabsentfromHLStools.HLStoolslackperformanceanalysistoo lstomeasurethe performanceofcomputationandcommunicationoftheapplic ationatruntime.There isalsoalackofvericationtoolstocheckthecorrectness,rea l-timeconstraintsorthe originofhangsoftheapplicationatruntime.Theframework sandtoolspresentedin thisdissertationallowtheusertoperformperformanceanalysi sandvericationontheir applicationatruntimewhilehavingalowresourceandperfor manceoverhead,thus boostingdeveloperproductivityandthecompetitivenessofF PGA-basedsolutions. Theresearchinthisdocumentisdividedintothreephases.InPh ase1,several challengesareexploredandaddressedwhenexpandinguponane xistingperformance analysisframework(ReCAP)[ 2 ]tosupportHLStools.Newinstrumentationtechniques areexporedforHLScode.Methodsforperformancedataextrac tionarealsoexplored aswellasmethodstopresentperformancedatainthecontexto fCsourcecode. Performancevisualizationsareleveragedfromanexistingpe rformanceanalysistool calledParallelPerformanceWizard(PPW)[ 3 ]toprovideseamlessperformanceanalysis acrossbothmicroprocessorsandFPGAs.Theperformanceanalysisf rameworkforHLSis evaluatedthroughacasestudyshowingitsabilitytospeedupapp licationswhilehaving 13

PAGE 14

alow-resourceoverhead.ThegoalofPhase1istoaddressthecha llengesassociatedwith prototypinganin-circuitperformanceanalysistoolforHLSt hatisfamiliartoparallel programmerswhotypicallytargetmicroprocessors. Phase2focusesonthechallengesofdetectingcommunicationb ottlenecksand creatingnovelperformancevisualizationsforHLS.Newinstrum entationisexploredto measuretheaveragebandwidthofcommunicationchannelsint heFPGA.Researchis performedtoexplorethetechniquesandmethodsneededtoau tomaticallydeterminethe communicationbottlenecks.Avisualizationintheformofadi rectedgraphisprototyped formanualanalysis.ThesenewHLSbottleneckdetectionandvisua lizationtechniquesare validate;dviacasestudies.ThegoalofPhase2istofurtherincr easedesignproductivity byevaluatingnovelHLS-specicvisualizationtechniquestha tenablebottleneckdetection. Inthethirdphase,thechallengesassociatedwithprototyping assertion-based vericationforHLSdesignsareexploredandaddressed.Techniq uesneededtoperform vericationofHLScodeareexploredincircuitasopposedtousi ngsimulation.Tradeos areexploredthatreduceFPGAresourceoverheadandruntimep erformanceoverhead. Theeectivenessofthisframeworkisdemonstratedbyshowingi tsabilitytond vericationfailuresnotcaughtinHLSsimulation.Novelenhan cementstoassertion-based vericationareexploredtoallowreal-timedeadlinecheck ingandhangdetection. ThegoalofPhase3istoaddressthechallengesassociatedwithpr ototypingalow overheadassertion-basedvericationtoolforHLSthatisfamil iarandusefultoparallel programmersthathavenotusedFPGAs. Therestofthisdocumentisorganizedasfollows.Chapter 2 discussesbackground andrelatedresearch.Chapter 3 describestheperformanceanalysisframeworkfor HLS(Phase1).Chapter 4 explainstheextendedperformanceanalysisframework thatincorporatesvisualizationsforbottleneckdetection (Phase2).Chapter 5 gives detailsabouttheframeworkforin-circuitassertion-basedve ricationofHLS(Phase3). 14

PAGE 15

Finally,Chapter 6 providesconcludingremarks.Abriefdescriptionoftargetsyst emsand applicationcasestudiescanbefoundinAppendix A and B ,respectively. 15

PAGE 16

CHAPTER2 BACKGROUNDANDRELATEDRESEARCH Thebackgroundandrelatedresearchinthischapterisdivide dintosixsections. Section 2.1 providesbackgroundonrecongurablecomputing.Section 2.2 explainsthe productivitygainfromusingHigh-LevelSynthesistools.Secti on 2.3 coversthestepsused toperformperformanceanalysisonaparallelapplicational ongwithabriefsurveyof microprocessor-basedparallelperformanceanalysistools.Sec tion 2.4 discussesFPGA performancemonitoringandthelackofin-circuitFPGAperf ormanceanalysistools. Section 2.5 describesrelatedworkinautomaticanalysisandvisualizatio n.Section 2.6 givesdetailsonrelatedworkinassertion-basedverication. 2.1RecongurableComputing ThissectionprovidesabriefbackgroundforRecongurableC omputing(RC). MoredetailedsurveysofRCcanbefoundinComptonandHauck[ 4 ]orHartenstein [ 5 ].RCsystemshavetheabilitytochangetheorganizationofthe irhardwaretobest suitaparticularapplication.RCsystemsaremostcommonlybuil twithFPGAs.Unlike microprocessorswhichareprogrammedviainstructions,FPGAsar eprogrammedby connectingvarioushardwareelementstogether.Figure 2-1 showsasimpliedblock diagramofthemostcommonFPGAhardwareelements.Thelook-up tablesinsidean FPGAcanbeprogrammedtoprovidealgebraicorlogicoperati onsneededbytheprogram whileroutingswitchesconnectlook-uptablestogethertocr eatemorecomplexoperations. Flip-Flopsareusedtoholdvaluesduringeachclockcycle.FP GAsalsocontainmemory intheformofblockRAMstostoredatatoolargetoholdwithregi sters.Input-output portsfacilitateo-chipcommunication.Foramoredetaile ddescriptionofanFPGA architecture,refertoAltera'sStratixhandbook[ 6 ]orXilinx'sVirtexuserguide[ 7 ]. FPGAscanbeintegratedintocomputingsystemsatmultipleleve ls.TheXD1000[ 8 ] integratesanFPGAinoneofthetwoOpteronsocketsonadual-p rocessormotherboard. TheSRC-7[ 9 ]usesariserboardpluggedintothememoryslottocommunicatew ith 16

PAGE 17

Figure2-1.SimpliedblockdiagramofaField-Programmabl eGateArray theCPU.TheGiDELProcStar-III[ 10 ]isaPCIExpresscardthatallowsFPGAsto beaccessedasaperipheraldevice.TheCrayXD1[ 11 ]usesacustomRapidArray interconnecttoallowmicroprocessors,FPGAsandothernodesto communicatewitheach other.ThelargevarietyofRCarchitecturesandapplicatio nscanposeachallengefor generalpurposevericationandperformanceanalysistools. 2.2High-LevelSynthesis ThissectionprovidesabriefbackgroundforHLS.AsurveyofHLSt oolscanbefound inHollandetal.[ 12 ]andaproductivityanalysisofHLStoolscanbefoundinEl-Arab y etal.[ 13 ].Traditionally,circuitdesignersprogramFPGAsusingHDLssuc hasVHDL orVerilog.HDLsareinherentlyabletoexpressparallelismbec ausethecomponentsthat makeupacircuitworkinparallel.However,softwareprogramm erstendtondHDLs cumbersomebecausetheyuseclocksignalstocontrolthetimingo ftheapplication.HLS toolsallowapplicationstobedesignedusinggraphicaltechn iquesorwithanHLLsuch asC,Java,orFortran.HLL-basedHLStoolstypicallyhaveApplic ationProgramming Interface(API)callstoincreaseparallelismoraddsynchroniz ationandcommunicationto anapplication.Althoughthehigherlevelofabstractionprov idedbyHLStoolsincreases developerproductivitywhenwritingapplications,product ivitywithcurrenttoolsisthen 17

PAGE 18

loweredbythelow-levelabstractionofaccuratevericatio nandperformanceanalysistools forHLS-generatedHDLcode. 2.3PerformanceAnalysisofParallelApplications Performanceanalysiscanbedividedintovesteps(asderived fromMaloney's workontheTAUperformanceanalysisframeworkfortradition alprocessors[ 14 ]) whoseendgoalistoproduceanoptimizedapplication.Theseste psareInstrument, Measure,Analyze,Present,andOptimize(seeFigure 2-2 ).Theinstrumentationstep insertsthenecessarycode(i.e.,additionalhardwareintheFP GA'scase)toaccessand recordapplicationdataatruntime,suchasvariablesorsigna lstocaptureperformance indicators.Measurementistheprocessofrecordingandstoring theperformancedata atruntimewhiletheapplicationisexecuting.Afterexecuti on,analysisofperformance datatoidentifypotentialbottleneckscanbeperformed.So metoolssuchasTAUcan automaticallyanalyzethemeasureddatatohelptheuserndpo tentialbottlenecks,while othertoolsrelysolelyuponthedevelopertoanalyzetheresul ts.Ineithercase,datais typicallypresentedtotheprogrammerviatext,charts,oroth ervisualizationstoallowfor furtheranalysis.Finally,optimizationisperformedbymodi fyingtheapplication'scodeor possiblychangingtheapplication'splatformbaseduponinsigh tsgainedviatheprevious steps.Sinceautomatedoptimizationisanopenareaofresearch ,optimizationatpresent istypicallyamanualprocess.Finally,thesestepsmayberepeat edasmanytimesasthe developerdeemsnecessary,resultinginanoptimizedapplicat ion.Thismethodologyis employedbyanumberofexistingtoolsforsoftwareparallelpe rformanceanalysistools includingPPW(ParallelPerformanceWizard)[ 3 ],TAU(TuningandAnalysisUtilities) [ 14 ],KOJAK(KitforObjectiveJudgmentAndKnowledge-baseddetec tionofperformance bottlenecks)[ 15 ],HPCToolkit[ 16 ],andSvPablo[ 17 ]. 2.4Field-ProgrammableGateArrayPerformanceMonitoring Tothebestofourknowledgeafteracomprehensiveliteraturese arch,littleprevious workexistsoutsidethisresearchconcerningruntimeperforma nceanalysisforHLS 18

PAGE 19

Figure2-2.Performanceanalysissteps applications.Hardwareperformancemeasurementmoduleshave beenintegratedinto FPGAsbefore;however,theyweredesignedspecicallyformoni toringtheexecution ofsoft-coreprocessors[ 18 ].TheOwlframework,whichprovidesperformanceanalysis ofsysteminterconnects,usesFPGAsforperformanceanalysis,butd oesnotactually monitortheperformanceofhardwareinsidetheFPGAitself[ 19 ].Rangeadaptive prolinghasbeenprototypedonFPGAsbutwasnotusedforprol inganapplication executingonanFPGA[ 20 ].RuntimedebuggingofanHLStool,SeaCucumber,has beendevelopedbyHemmertetal.[ 21 ].However,theSeaCucumberdebuggerdoes notsupportperformanceanalysis.Calvezetal.[ 22 ]describeperformanceanalysis forApplication-SpecicIntegratedCircuits(ASICs)whileDe Villeetal.[ 23 ]discuss performancemonitoringprobesforFPGAcircuits;however,n eitherworktargetsHLS tools.Thisworksignicantlyextendsourpreviousworkonper formanceanalysisforHDL applications[ 2 ]byexpandingthisHDLframeworktoaddressthechallengesofh igh-level synthesistools. 19

PAGE 20

2.5AutomaticAnalysisandVisualization Automaticanalysishasbeendevelopedinperformanceanalysis toolssuchasTAU [ 14 ],KOJAK[ 15 ],PPW[ 3 ],andParadyn[ 24 ].TAUusesParaProf[ 25 ]forproleanalysis. KOJAKandTAUusetheExpertperformanceanalysistool[ 26 ]toperformtracepattern matching.PPWusesanautomaticanalysissystem[ 27 ]developedbySuetal. AsummaryofearlyvisualizationsfromKiviatdiagramsto3Diso surfacedisplays isdescribedbyHeathetal.[ 28 ].ParaGraph[ 29 ],aperformanceanalysistoolforMPI programs,utilizesmanyoftheseearlyvisualizations.Millerw roteanessay[ 30 ]on eectivetechniquesandpitfallsofperformancevisualizat ions.Amoderntool,TAU, usesVampir[ 31 ],Jumpshot[ 32 ],andParaver[ 33 ]fortracevisualization.Knupferetal. describeanextensiontoVampirforvisualizationofpatternsi ntracedata[ 34 ].KOJAK hasa3Dload-balancingcasestudyofann-bodysimulation[ 35 ].Santamaraetal.give anoverviewofgraphingtechniquesforvisualizingnetworks [ 36 ].Haynesetal.[ 37 ]give anexampleofavisualizationofnetworkperformanceinacomp utingcluster.Littlework onvisualizationsofFPGAperformancewasfoundinanextensiv eliteraturereview. DRIVE[ 38 ]providessimulationandvisualizationfordynamicFPGAreco nguration.The samplevisualizationpresentedintheirpapershowsblocksofan FPGAinactive,active,or reconguring.Theirframeworkonlyprovidessimulationand doesnotprovidein-circuit monitoring.Relatedworkexistsforperformanceanalysis[ 2 ]andvisualization[ 39 ]ofHDL applicationsforFPGAs.However,thatworkisill-suitedforHLSa pplicationsduetolack ofsource-codecorrelationandbandwidthvisualizations. 2.6Assertion-BasedVerication ManylanguagesandlibrariesenableassertionsinHDLsduringsi mulation,suchas VHDLassertionstatements,SystemVerilogAssertions(SVA)[ 40 ],theOpenVerication Library(OVL)[ 41 ],andthePropertySpecicationLanguage(PSL)[ 42 ].Previouswork hasalsointroducedin-circuitassertionsviahardwareassertio ncheckersforeachassertion inadesign.ToolstargetedatASICdesignprovideassertioncheck ersusingSVA[ 43 ], 20

PAGE 21

PSL[ 44 ],andOVL[ 45 ].AcademictoolssuchCamera'sdebuggingenvironment[ 46 ]and commercialtoolssuchasTemento'sDiaLitealsoprovideasserti oncheckersforHDL. Kakoeeetal.showthatin-circuitassertions[ 45 ]canalsoimprovereliability,withahigher faultcoveragethanTripleModularRedundancy(TMR)foraFI RlterandaDiscrete CosineTransform(DCT). LogicanalyzerssuchasXilinx'sChipScope[ 47 ]andAltera'sSignalTap[ 48 ]canalso beusedforin-circuitdebugging.Thesetoolscancapturethev aluesofHDLsignalsand extractthedatausingaJTAGcable.However,theresultspresent edbythesetoolsare notatthesourcelevelofHLStools.Asource-leveldebuggerhasb eenbuiltfortheSea Cucumbersynthesizingcompiler[ 21 ]thatenablesbreakpointsandmonitoringofvariables inFPGAs.OurworkiscomplementarybyenablingHLLassertionsand canbepotentially beusedwithanyHLStool. CheckingtimingconstraintsofHDLapplicationscanbeperfor medwithmanyofthe methodsmentionedabove.SVA,PSLandOVLassertionscanbeusedto checkthetiming relationshipbetweenexpectedvaluesofsignalsinanHDLappli cation[ 49 ].AtimedC-like language,TC(timedC),hasbeendevelopedforcheckingOVLasse rtionsinsertedasC commentsforuseduringmodelingandsimulation[ 50 ].In-circuitlogicanalyzerssuchas ChipScope[ 47 ]andSignalTap[ 48 ]canalsobeusedtotraceapplicationsignalsandcheck timingconstraintsforsignalvalues.TheHLStool,Carte,provi destimingmacros[ 51 ] whichreturnthevalueofa64-bitcounterthatissettozeroup onFPGAreset.However, mostHLStools(includingImpulseC)donotprovidethisfunctio nality.In-circuit implementationofhigh-levelassertionsisamoregeneralapp roachthatpotentially supportsanyHLStoolandenablesdesignerstouseANSI-Cassertions. Afteracomprehensiveliteraturesearch,wefoundnopreviousw orkrelatedtohang detectionofHLSapplications.Hangdetectionformicroprocesso rshasbeenimplemented onFPGAs[ 52 ].Nakkaetal.[ 53 ]separateshangdetectionformicroprocessorsintothree categories.First,Instruction-CountHeartbeat(ICH)detectsa hungprocessnotexecuting 21

PAGE 22

anyinstructions.Second,Innite-LoopHangDetector(ILHD)de tectsaprocesswhich neverexitsaloop.Finally,Sequential-CodeHangDetector( SCHD)detectsaprocessthat neverexitsaloopbecausethetargetaddressforthecompletio nofaloopiscorrupted. Althoughsimilardetectioncategoriescouldbeusedforhardwa reprocessesgenerated byHLStools,themethodsneededforhangdetectionaredieren t;hardwareprocesses typicallyusestatemachinesforcontrolrowratherthanusingi nstructions.Therelated workfoundformicroprocessorhangdetectionistypicallyused toincreasereliabilityofthe systembyterminatingthehungprocessratherthantohelpanapp licationdevelopernd theproblematiclineofcode. AlthoughHDLassertionscouldbeintegratedintoHLS-generatedHD L,suchan approachhasseveraldisadvantages.AnychangestotheHLLsourceo radierentversion oftheHLStoolcouldcausechangestothegeneratedHDL(e.g.,re organizationofcode orrenamingofsignals),whichrequiresthedevelopertomanua llyreinserttheassertions intothenewHDL.Itisalsopossiblethatthedevelopermaynotbea bletoprogramin HDLortheHLStoolmayencryptorobfuscategeneratedHDL(e.g.,L abVIEW-FPGA). HLLassertionsforHLSavoidtheseproblemsbyaddingassertionsatt hesourcelevel. Specically,ANSI-C[ 54 ]assertionswerechosentobesynthesizedtohardwaresincethey areastandardassertionwidelyusedbysoftwareprogrammers.Synt hesizingANSI-C assertionswouldallowexistingassertionsalreadywrittenforso ftwareprogramstobe checkedwhilerunningincircuit. 22

PAGE 23

CHAPTER3 PERFORMANCEANALYSISFRAMEWORKFORHIGH-LEVELSYNTHESIS High-levelsynthesistoolstranslatehigh-levellanguages(e. g.,ImpulseC[ 55 ]orCarte C[ 56 ])tohardwarecongurationsonFPGAs.TodaysHLLssimplifysoftw aredevelopers transitiontorecongurablecomputinganditsperformancea dvantageswithoutthesteep learningcurveassociatedwithtraditionalHDLs.WhileHDLdevel opershavebecome accustomedtodebuggingcodeviasimulators,softwaredevelope rstypicallyrelyheavily upondebuggingandperformanceanalysistools.Inordertoacc ommodatethetypical softwaredevelopmentprocess,HLStoolssupportdebuggingatthe HLLsource-code levelonatraditionalmicroprocessorwithoutperformingtra nslationfromsourceto HDL.CurrentcommercialHLStoolsprovidefew(ifany)runtime tools(i.e.,whilethe applicationisexecutingononeormoreFPGAs)fordebuggingor performanceanalysisat thesource-codelevel.Inaddition,researchonruntimeperfo rmanceanalysisforFPGAsis lackingwithfewexceptionsandnoneofwhichistargetedtow ardsHLStools. Whileitispossiblefordebuggingandsimulationtechniquesto estimatebasic performance,manywell-researcheddebuggingtechniquesma ynotbesuitedforperformance analysis.Forexample,haltinganFPGAtoreadbackitsstatewil lcausetheFPGAto becometemporarilyinaccessiblefromtheCPU,potentiallyresu ltinginperformance problemsthatdidnotexistbefore.Thus,thisapproachisnotv iableduetothe unacceptablelevelofdisturbancecausedtotheapplications behaviorandtiming. Alternatively,performancecanbeanalyzedthroughsimulati on.However,cycle-accurate simulationsofcomplexdesignsonanFPGAareslowandincreasein complexityas additionalsystemcomponentsareaddedtothesimulation.Most( ifnotall)cycle-accurate simulatorsforFPGAsfocusuponsignalanalysisanddonotpresent theresultsatthe source-codeleveltoasoftwaredeveloper. RCapplicationshavepotentialforhighperformance,butHLS -basedapplications canfallfarshortofthatpotentialduetothelayerofabstract ionhidingmuchofthe 23

PAGE 24

implementation(andthusperformance)details.Performanc eanalysistoolscanaidthe developerinunderstandingapplicationbehavioraswellasi nlocatingandremoving performancebottlenecks.DuetotheHLLabstractionlayer,iti sessentialforperformance analysistoprovideperformancedataatthatsamelevel,allow ingcorrelationbetween performancedataandsourceline. ThisworkfocusesuponperformanceanalysisofanHLS-basedappl icationona recongurablesystembymonitoringtheapplicationatruntim e.Themajorityofinsight gainedwasfromperformanceanalysiswithhigh-levellangua gesfromexperienceswith ImpulseC,alanguagedesignedbyImpulseAcceleratedTechnolog ies,whichmapsa reducedsetofCstatementstoHDL;howeverinsightsgainedfromC arteC,alanguage designedbySRCComputers,arealsodiscussedtoprovideanalterna teperspectiveofHLL performanceanalysis.Aperformanceanalysisframeworkwasde velopedbasedonImpulse Candthisframeworkwasprototypedintoanautomatedtoolin ordertodemonstrateits eectivenessonamolecular-dynamicsapplication. Thischapterisorganizedasfollows.Section 3.1 coversthechallengesofperformance analysisforHLL.Next,Section 3.2 providesdetailsabouttheperformanceanalysis frameworkforImpulseC.Section 3.3 thenpresentsacasestudyusingamolecular-dynamics kernelwritteninImpulseC.Finally,Section 3.4 concludesandpresentsideasforfuture work. 3.1High-LevelSynthesisPerformanceAnalysisChallenges Whileallstagesofperformanceanalysismentionedaboveareo finterestfor HLS-basedapplications,discussioninthischapterislimitedtot hechallengesof instrumentation,measurement,analysis,andvisualization;op timizationisbeyond thescopeofthiswork.Thus,Section 3.1.1 coversthechallengesofinstrumentingan application,Section 3.1.2 explainsthechallengesassociatedwithmeasuringperformanc e datafromanapplication,Section 3.1.3 discussesthechallengesofanalyzingperformance data,andSection 3.1.4 examinesthechallengesofvisualizingperformancedata. 24

PAGE 25

3.1.1InstrumentationChallenges Instrumentation,therststepofperformanceanalysis,enables accesstoapplication dataatruntime.ForHLS-basedapplications,thisstepraisestwo keyissues:atwhat levelofabstractionshouldmodicationsbemade,andhowtobe stselectwhatshouldbe accessedtogainaclearyetunobtrusiveviewoftheapplication 'sperformance.Tradeos concerningthelevelofabstractionarediscussedinSection 3.1.1.1 ,whiletheselectionof whattomonitoriscoveredinSection 3.1.1.2 3.1.1.1Instrumentationlevels Threemaininstrumentationlevelshavebeeninvestigated:HLL software,HLL hardwareandHDL.Eachinstrumentationleveloersadvantage stoaperformance analysistoolforHLS-basedapplications.Fordetailsoninstrum entationlevelsbelowHDL, seeGrahametal.[ 57 ]. Themostobviouschoiceforinstrumentationistodirectlymod ifytheHLLsource code.Instrumentationcanbeaddedtothesoftwarecoderequir ingtimingtobehandled bytheCPU.EachtimingcallissentfromFPGAtoCPUoveracommun icationchannel. ThisiscurrentlyusedbyImpulseCdeveloperssincenoHLLhardwa retimingfunctions areavailable.Forasmallnumberofcoarse-grainedmeasuremen ts(e.g.,forphasesofthe hardwareapplication),thecommunicationoverheadandtim inggranularityareacceptable. InstrumentationcanalsobeaddedtotheHLLsourcecodethatdescr ibesFPGA hardware.Mosthigh-levelsynthesistoolslackthisfeature.C arteCisanexception inthatitallowsthedevelopertomanuallycontrolandretri evecyclecounters,which, alongwiththeFPGA'sclockfrequencyandsomeadjustmentforske w,providesaccurate timinginformationbetweentheCPUandFPGA.Themainadvanta geofthismethodis simplicity;codeisaddedtorecorddataatruntime,andthisd atacanbeeasilycorrelated withthesourcelinethatwasmodied.Itisalsopossiblethatthe HLLsourcecodemay betheonlylevelthatcanbeinstrumented(e.g.,ifencrypted net-listsandbitstreamsare employed). 25

PAGE 26

Instrumentationcanalsobeinsertedaftertheapplicationhas beenmappedfrom HLLtoHDL.InstrumentationofVHDLorVerilogprovidesgreaterre xibilitythan instrumentationattheHLLlevelsincemeasurementhardwarecan befullycustomized totheapplication'sneeds,ratherthandependinguponbuilt -inHLLtimingfunctions. AddinginstrumentationaftertheHLL-to-HDLmappingguarantee sthatmeasurement hardwarewillruninparallelwiththehardwarebeingtimed, minimizingtheeect ofmeasurementontheapplication'sperformanceandbehavio r.Incontrast,Carte Cintroducesdelaysintoaprogramwhenitstimingfunctions areused.However, usinginstrumentationbelowtheHLLsourceleveldoesrequiread ditionaleortto mapinformationgatheredattheHDLlevelbacktothesourcelev el.Thisprocessis problematicduetothediversityofmappingschemesandtransla tiontechniquesemployed byvarioushigh-levelsynthesistoolsandevenamongdierent versionsofthesametool. Forexample,ifaperformancetoolreliesuponatextualrela tionbetweenHLLvariables andHDLsignals,thentheperformancetoolwouldfailifthisnam ingschemewasmodied inasubsequentreleaseofthehigh-levelsynthesistool. WhilethesimplicityofHLLinstrumentationisdesirable,theHDL levelwas instrumentedinordertoprovidene-grainedperformancean alysisthatwouldotherwise beimpossibleforHLStoolsthatlackhardwaretimingfunctions. EvenforHLStoolsthat doprovidehardwaretimingfunctions,HDLinstrumentationmay incurlessoverheadand generallyprovidesgreaterrexibilitythanHLLinstrumentat ion.HDLinstrumentation isalsoutilizedbyFPGAlogicanalyzerssuchasXilinx'sChipSc ope[ 47 ]orAltera's SignalTap[ 48 ]. 3.1.1.2Instrumentationselection Applicationperformancecangenerallybeconsideredinterms ofcommunicationand computation.ManyHLStools,suchasImpulseCandCarteC,havebu ilt-infunctionsfor communication;thesefunctionstypicallyhaveassociatedstat ussignalsattheHDLlevel thatcanbeinstrumentedtodetermineusagestatisticssuchastra nsferrateoridletime. 26

PAGE 27

Instrumentingcomputationismorecomplexduetothevarious waysthatcomputation canbemappedtohardware.However,thesemappingsareconstrai nedbythefactthat eachhigh-levelsynthesistoolmustpreservethesemblanceofpro gramorderandthuswill requiresomecontrolstructuretomanagethisordering.Forex ample,ImpulseCmaps computationonto(possiblymulti-level)statemachines,usingt hetop-levelstatemachine toprovidehigh-levelorderingofprogramtasks.ForCarteC,c omputationismapped tocodeblocksthatactivateeachotherusingcompletionsigna ls.Whilethesecontrol structuresareusefulforcoarse-grainedtiminginformation, additionalinformationcan beobtainedfromsubstateswithinasinglestateofthetop-level statemachineorsignals withinacodeblock,whichareused,forexample,tocontrolasi ngleloopthathasbeen pipelined.InImpulseC,thispipelinewouldconsistoftheidle ,initialize,run,andrush substates,wheretheinitializeandrushsubstatesindicatepipel iningoverheadandthus provideindicationoflostperformance.Additionally,signal ssuchasstall,break,write, andcontinuecanbeinstrumentedonaper-pipeline-stagebasis toobtainevenmoredetails ifneeded.ForCarteC,lessdetailisavailablesincepipeline dloopsarenotbrokenup intoexplicitstagesandstatemachinesarenotexposedforinstr umentation.Nonetheless, intermediatesignalsconnectingCarteC'shardwaremacrosi nsideacodeblockcan beinstrumented,whichprovidethenecessaryinformationtode terminepipelinedloop iterationsandstalls.Overall,itisthecontrolstructuresem ployedtomaintainprogram orderthatprovidekeydataformonitoringperformanceofth eseapplications. Itmayalsobebenecialtomonitorapplicationdatadirectly (i.e.,anHLLvariable) ifsuchinformationprovidesabetterunderstandingofapplic ationperformanceand behavior.Forexample,aloopcontrolvariablemaybebenec ialtomonitorifitrepresents theprogressofanalgorithm.Unfortunately,selectionofanap plicationvariableis,in general,notautomatableduetotheneedforhigh-level,app lication-specicknowledgeto understandthevariable'spurposeandexpectedvalue. 27

PAGE 28

InstrumentationofstatemachineswaschosenforImpulseCandco mpletionsignals forCarteCsincetheyprovidetimingdatasimilartosoftwarepr olers.Sincethesecontrol structuresareneededtopreservetheorderofexecutionofthe HLL,theyshouldbe targetedforautomaticinstrumentation. WhencomparingImpulseCandCarteC,itisevidentthatinstrum entationis theprimarystepofperformanceanalysisthatrequireschange foranewHLStool. Theremainingstepscanremainbasicallyunchangedaslongast hedesignerisfocused primarilyonthetimingofHLLsourcecodeandthatreversemappi ngisperformed. 3.1.2MeasurementChallenges Afterinstrumentationcodehasbeeninsertedintothedevelope r'sapplication, monitoredvaluesmustberecorded(measured)andsentbacktoth ehostprocessor. Section 3.1.2.1 presentsstandardtechniquesformeasuringapplicationdataw hileSection 3.1.2.2 discussesthechallengesofextractingmeasurementdata. 3.1.2.1Measurementtechniques Regardlessoftheprogramminglanguageused,thetwocommonmo desformeasuring performancedataareprolingandtracing.Prolingrecord sthenumberoftimesthat aneventhasoccurred,oftenusingsimplecounters.Toconservet helogicresourcesof anFPGA,itispossibletostorealargernumberofcountersinbloc kRAMifitcanbe guaranteedthatonlyonecounterwithinablockRAMwillbeupd atedeachcycle.This techniqueisusefulforlargestatemachines,sincetheycanonly beinonestateatany givenclockcycle.Prolingdatacanbecollectedeitherwhe ntheprogramisnished (post-mortem)orsampled(collectedperiodically)duringex ecution.Atthecostof communicationoverhead,samplingcanprovidesnapshotsofpro ledataatvariousstages ofexecutionthatwouldotherwisebelostbyapost-mortemretri evalofperformancedata. Incontrast,tracingrecordstimestampsindicatingwhenindi vidualeventsoccurred and,optionally,anydataassociatedwitheachevent.Duetoth epotentialforgenerating largeamountsofdata,tracerecordstypicallyrequireabu erfortemporarystorage(e.g., 28

PAGE 29

BlockRAM)untiltheycanbeooadedtoalargermemory,suchast hehostprocessor's mainmemory.WhilelogicresourcesintheFPGAcanalsobeusedfo rtracedatastorage, thisresourceisscarceandoflowerdensitythanblockRAM,makin glogicresources ill-suitedforgeneraltracedata.Ifavailable,othermemor yresourcessuchaslarger, preferablyon-boardSRAMorDRAMcanbeusedtostoretracedataas wellbeforeitis senttothehostprocessor.Tracingdoesprovideamorecompletep ictureofapplication behavior,capturingthesequenceandtimingofevents.Thus,wh enneeded,tracingcan bejustieddespitetheoftenhighmemoryandcommunicationov erhead.Thechallenges andtechniquesassociatedwithmeasurementforHLLsaresimilart othoseofHDLs[ 2 ]. Therefore,theirprolingandtracingmeasurementtechniqu esareusedinthisframework. 3.1.2.2Measurementdataextraction MeasurementdatagatheredintheFPGAtypicallyistransferre dtopermanent storageforanalysis.Thisdataiscommonlyrstbueredinlarge ,lower-latencymemories whileawaitingtransfer.FPGAlogicanalyzerssuchasXilinx's ChipScopeorAltera's SignalTapuseJTAG[ 58 ]asaninterfaceforextractingmeasureddatainordertodebu g hardware.However,whileaJTAGinterfaceisavailableonman yFPGAcomputing platforms,itisnotwellsuitedtowardsdataextractionforru ntimeperformanceanalysis. JTAGisalow-bandwidthserialcommunicationinterfaceand, inanHPCenvironment, thesetupofallrequiredJTAGcablescoupledwiththepossiblen eedtoaddadditional hardwareinordertoreceivetheJTAGdataiscumbersomeandsca lespoorly. AsanalternativetoJTAG,manyHLStoolsusecommunicationinte rfacestotransfer databetweentheCPUandFPGA.Inordertoextractmeasurementd ata,anadditional communicationchannelcanbeaddedtotheapplication'ssour cecode.Usingthesebuilt-in interfacesisadvantageoussincenochangetothephysicalsyste misrequiredtosupport performanceanalysis.Thus,dataextractionwaschosenusingHLLc ommunication channelssinceitismoreportableandbettersuitedfortypica lHPCenvironments. 29

PAGE 30

However,sincetheHLLcommunicationchannelissharedwiththea pplication,caremust betakennottodisturbtheapplication'soriginalbehavior. Communicationoverheadcandependuponseveralfactors.Onem ajorfactorconcerns howmuchdataisgenerated.Prolecountersandtracebuers shouldbesizedaccording tothenumberofeventsexpected(withsomemarginofsafety).E ventsshouldalso bedenedfrugallytominimizetheamountofdatarecordedwh ilestillobtainingthe informationneededtoanalyzeperformance.Forexample,wh ileitmaybeidealtomonitor theexacttimeandnumberofcyclesforallwrites,itmaybesuc ienttoknowthenumber ofwritesexceedingacertainnumberofcycles. AnothersourceofoverheadcomesfromtheHLL'scommunicationi nterface.The bandwidthofstreamingandmemory-mappedcommunicationint erfacescanvary signicantlybetweenHLStoolsaswellasbetweenFPGAplatfor msusingthesame tool,dependinguponimplementation.Therefore,itisimpo rtantforperformanceanalysis toolstosupportasmanycommunicationinterfaces(e.g.,stre aming,DMA)aspossibleto providerexibilityandreduceoverhead.3.1.3AnalysisChallenges Whileanalysisofperformancedatahashistoricallybeenvery diculttoautomate, automaticanalysiscanimprovedeveloperproductivitybyqu icklylocatingperformance bottlenecks.AutomaticanalysisofHLS-basedapplicationscoul dfocusuponrecognizing commonperformanceproblemssuchaspotentiallyslowcommuni cationfunctionsoridle hardwareprocess.Forexample,processesreplicatedtoexploit specialparallelismcan bemonitoredtodeterminewhichareidleandforwhatlengtho ftime,givingpertinent load-balancinginformationtothedeveloper.Processescana lsobereplicatedtemporally intheformofapipelineofprocessesandmonitoredforbottlen ecks.High-levelsynthesis toolscanalsopipelineloopsinsideofaprocess,eitherautomat ically(e.g.,CarteC)or explicitlyviadirectedpragmas(e.g.,ImpulseC).Inthisca se,automaticanalysiswould 30

PAGE 31

determinehowmanycyclesinthepipelinewereunproductive andthecauseofthese problems(e.g.,datanotavailable,rushingofpipeline). Automaticanalysiscanalsobeusefulindeterminingcommunicat ioncharacteristics thatmaycausebottlenecks,suchastherateorchangeinrateofc ommunication.For example,streamsthatreceivecommunicationburstsmayrequi relargerbuers,oran applicationmaybeill-suitedforaspecicplatformduetoala ckofbandwidth.The timingofcommunicationcanalsobeimportant;sharedcommuni cationresourcessuch asSRAMsoftenexperiencecontentionandshould,ingeneral,b emonitored.Monitoring thesecommunicationcharacteristicscanaidinthedesignofan etworkthatkeepspipelines atpeakperformance.Integrationofautomaticanalysisinto theframeworkwillbesaved forfuturework.Furtherstudyoncommonrecongurablecompu tingbottlenecksin applicationsisneededinordertodevelopageneral-purpose automaticanalysisframework. 3.1.4VisualizationChallenges Oneofthestrengthsofrecongurablecomputingisthatitall owstheprogrammer toimplementapplication-specichardware.However,visuali zationsforhigh-performance computingaretypicallydesignedtoshowcomputationongener al-purposeprocessors andcommunicationonnetworksthatallowall-to-allcommun ication.Thus,these visualizationsareill-suitedforHLS-basedapplications,trea tingheterogeneouscomponents andcommunicationarchitecturesashomogeneous. Koehleretal.[ 2 ]presentedamockupvisualizationforHDLperformanceanalysis. Thisvisualizationwasorganizedintheformatofthesystemarc hitectureandprovides detailsonCPUusage,interconnectbandwidths,andFPGAstatem achinepercentages. ThevisualizationconceptspresentedbyKoehlercanbeextend edandpresentedingreater detailforHLS-basedapplications.HDLcodegeneratedbyhigh-l evelsynthesistoolshas apredenedstructure,makingitmorefeasibletoautomatical lygeneratemeaningful visualizationsforHLS-basedapplications. 31

PAGE 32

VisualizationsforHLS-basedapplicationscanshowperformance datainthecontext oftheapplication'sarchitecture,asspeciedbytheprogra mmer.Apotential(mockup) visualizationisshowninFigure 3-1 .Inthisexample,proledatarepresentingthetime theapplicationspentinvariousstatesispresentedusingpiech arts.Thecommunication architectureanditsassociatedperformanceisalsoshown,conn ectingtheprocesses togetherwithstreamingorDMAcommunicationchannels(Pro leViewandDefault ProleKeyinFigure 3-1 ).Thistypeofrepresentationallowsalloftheapplicationp role datatobedisplayedinasinglevisualizationwhilecapturingt heapplication'sarchitecture. Figure3-1.Exampleperformancevisualization RatherthanpresentingeachstateusedbyanHLLforDMAorstreamin gcommunication, thesestatescanbeassignedtooneofthreecategories:transferri ng,idle,orblocking.For example,blockingcanoccurifastreamFIFObecomesfullanda streamingcallismade; 32

PAGE 33

thefunctionwillthenblock,preventingfurtherexecution untilanelementisremovedfrom theFIFO.However,blockingcanalsooccurwhenDMAcommunicat ionrequiresashared resourcethatiscurrentlyservicinganotherrequest.Bycateg orizinganycommunication channelstateintooneofthreecategories,thevisualizationp rovidesbetterscalability (statesareeectivelysummarized)andismorereadilyunderst ood(allcommunication channelsusethesamecategories). Asimilarcategorizationisusedforhardwareprocesses(wheret herecanbe hundredsofstates,makingcategorizationessentialinorderto obtainmeaningful visualizations).Statesofahardwareprocessareassignedtoone ofthreecategories:active communication,activecomputation,andamiscellaneouscat egory.Activecommunication canbedenedastimespentforstreamingorDMAcallsthatareno n-blocking.Active computationcanincludetheuseofvariablesandstatescorresp ondingtoanactive pipeline.Inthecasewherebothcomputationandcommunicati onaretakingplace simultaneouslyinaprocess(e.g.,aCarteCparallelsection)ti meshouldonlybeadded tothecomputationcategorysincetheoverheadofcommunicat ionisbeinghidden.The miscellaneouscategoryactsasacatch-allforinitializati onandoverheadsuchaspipeline rushing.However,thedenitionofoverheadcanvarydependin gontheapplicationand programmer. Afurthercomplicationexistsduetothelackofaone-to-onem appingbetween HLLcodeandhardwarestates.Inordertohelptheprogrammerlin kHLLsourcecode totheabovecategories,eachlineinthesourcecodecanbecolo r-codedtomatchthe correspondingcategory'scolor(SourceViewinFigure 3-1 ).Notethatsomelinesofcode mayreceivenocoloratalliftheselinesofcodedonotrequire anyexecutiontimein hardware(e.g.,deningavariablecreatesasignalintheHDLb utdoesnotconsume cycles).Inthecasewhereasinglelineofcodecontainsmultipl estates,commentedlines ofcodecanbeautomaticallyaddedbeneaththatlineofcodet oallowtheprogrammerto makeamorene-grainedselectionbetweenstatesthatcouldfa llintodierentcategories. 33

PAGE 34

Forexample,thefourstatesofanImpulseCpipeline(run,rush, initialize,andidle) canbeaddedascommentsbelowaCOPIPELINEpragma.Therunstat ewouldbe consideredactivecomputationwhereastheinitialize,rusha ndidlestateswouldfallunder themiscellaneouscategory.Ifmultiplelinesofcodearegro upedintoasinglestate,then thoselinesofcodecanonlybecolor-codedasawhole. Whilepresentingabreakdownoftimespentinprocessescanprovi deagood indicationofapplicationperformance,pinpointingtheca useofbottlenecksmayrequire moredetailedtrace-baseddata.Sincelocalstoragefortrace dataislikelytobelimited, andsincecommunicationchannelsarelikelytobesharedwitht heapplication,itis importanttodeneecienttriggerstominimizethetraceda tagenerated.Thresholds fortracetriggerscanbesetafterexaminingtheProleViewfo rbottlenecks.Asan example,theprogrammermaywanttotriggeratraceeventwhe nastreambuerbecomes full(TimelineViewinFigure 3-1 )orwhenapipelineisstalledfor100cycles.The programmercaniterativelyrenethresholdsthatcontroltr ace-eventtriggers,ifnecessary, toreducebandwidthneededforperformancedata. Inordertoprovideascalablevisualizationforuser-speciedt racedata,traceevents canbedisplayedinasingletimelineviewfortheentireapplic ation.Thisone-dimensional viewwillallowtheprogrammertoquicklyscanatimelineand ndtraceeventsthat indicateapotentialbottleneck.Incontrast,traditionalp erformanceanalysisvisualizations suchasJumpshot[ 32 ]woulddedicatearowtoeachparallelnode(inthiscase,each processshownintheProleViewofFigure 3-1 sincetheyarealloperatinginparallel). Thistwo-dimensionalviewforcestheprogrammertoscanapote ntiallylargenumber ofrowsononeaxisoveralargeperiodoftimeonthesecondaxis inordertonda bottleneck. 3.2High-LevelSynthesisPerformanceAnalysisFramework AperformanceanalysistoolforImpulseCwasdevelopedinorde rtoillustrate techniquesthataddressmanyofthemajorchallengesdescribe dinSection 3.1 .ImpulseC 34

PAGE 35

wasselectedduetoitssupportforavarietyofplatforms.Inord ertoimprovetheusability ofthetoolrstdescribedinCurrerietal.[ 59 ],twomainissuesneededtobeaddressed: automationofinstrumentationandintegrationwithanexisti ngsoftwareperformance analysistool.InSection 3.2.1 ,themethodsneededtoautomateinstrumentationare discussed.Section 3.2.2 thencoversthestepstakentoaddperformanceanalysisfor ImpulseCtoanexistingsoftwareperformanceanalysistoolinor dertocreateaunied, dual-paradigmperformanceanalysistool.3.2.1InstrumentationAutomation Section 3.1.1 concludedthatwhileinstrumentationcouldbeinsertedatanu mber oflevelsrangingfromHLLsourcecodetobinary,instrumenting attheHDLlevel providedthebesttradeobetweenrexibilityandportabilit y.Additionally,Section 3.1.2.1 concludedthatextractingmeasurementdatausingHLLcommunic ationchannelsoers thegreatestportabilityandisoftenbettersuitedtoanHPCenv ironmentthanJTAGor othercommunicationinterfaces.Thus,instrumentationmustbe performedintwostages. Section 3.2.1.1 describesautomatedHLLinstrumentationthatinsertscommunic ation channelsformeasuredperformancedata.Section 3.2.1.2 thenpresentsanautomatedtool forHDLinstrumentationinordertorecordandstoreapplicatio nbehavioratruntime. 3.2.1.1AutomatedCsourceinstrumentation Inordertocommunicatemeasuredperformancedatafromhardw arebacktosoftware, theframeworkrstinstrumentsImpulseCsourcecode.Beforeinst rumentationisadded, theImpulseCapplicationconsistsofsoftwareprocessesrunningo nthehostprocessors, hardwareprocessesrunningontheFPGAsandcommunicationchan nelstoconnectthem (shownwithwhiteboxesandarrowsinFigure 3-2 ontheleft-handside).Automatic instrumentationmodiesthisstructurebyinsertingseparated enitionsforahardware process,asoftwareprocess,andcommunicationchannels(shownwi thdarkboxesand arrowsinFigure 3-2 ontheright-handside).Thesoftwareprocesscanbedeclaredas an \extern"functiontobeaddedlater.Sincethehardwareproc esswillbeoverwrittenduring 35

PAGE 36

HDLinstrumentation,asimpleloopbackprocesssuces,depictedb ythecross-hatched arrowinFigure 3-2 Figure3-2.HardwareMeasurementModule(HMM)additionandapp lication instrumentation Figure 3-3 providesarowchartillustratingthetypicaldesignrow(ligh tlyshaded steps)aswellasthechangesmadetothatrow(darklyshadedsteps) wheninstrumenting codeforperformanceanalysis.InstrumentationofImpulseCsour cecodeishandled automaticallybytheHLLInstrumenter,asshowninsteptwoofFig ure 3-3 ;forImpulse C,theHLLInstrumenterconsistsofaPerlscriptdrivenbyaJavaGUI frontend.The HLLInstrumentermodiesonlytheImpulseChardwarele;thesof twareleremains unchanged.SinceImpulseChardwarelesmustincludeacongu refunctiontosetup ImpulseCprocessesandcommunicationchannels,theHLLInstrumen tersearchesfor thedenitionofthecongurefunction,addingtheloopback hardwareprocesscodeand \extern"softwareprocessdeclarationatthebeginningofthi sfunction.Thecongure functionwillalsobemodiedsothatitdeclaresthenewsoftwar eandhardwareprocess andprovidesthenecessarycommunicationchannelsbetweenth em. 3.2.1.2Automatedhardwaredescriptionlanguageinstrume ntation OncetheapplicationismappedtoHDLcode(step3inFigure 3-3 ),theHDL Instrumenterisemployed,providingtheapplicationdevelo perthechoicebetweendefault 36

PAGE 37

Figure3-3.DesignrowforImpulseCperformanceanalysis andcustominstrumentation(step4inFigure 3-3 ).Sincesignalsthatcorrespondto statemachinesinImpulseCareprimecandidatesforinstrument ation,thedefaultoption simplymonitorsallstatemachines.ImpulseCreliesuponstatema chinesinthegenerated HDLcodetopreservethestructureoftheoriginalCcode.Thestat emachinestructure isprimarilydeterminedbystatementsthatrepresentabranch inexecution,suchasif, while,for,etc.ImpulseChandlesCstatementswithinabranch byplacingthemeither inasinglestateorinmultiplesequentialstatesdependingupon theiraggregateddelay. However,aloopthatispipelinedisalwaysrepresentedasonest atewithinthisstate machine.Instrumentingstatemachinesaidstheuserinbetteru nderstandingwheretime isbeingspentinsidetheirhardwareprocesses. Custominstrumentationallowstheapplicationdevelopertoi nstrumentImpulse Cvariables.DuetothefactthatImpulseCvariablenamesareuse dwithinthenames ofcorrespondingHDLsignals,theHDLsignalscanoftenbeidentie deasilybythe programmer.Variablescanbeinstrumentedbycountingeacht imethevariableisaboveor belowsomethreshold,althoughmoreadvancedinstrumentation ,suchashistogramscan beconstructedifdesired.Currently,custominstrumentationt ypicallyinvolvesspecifying thresholdsviaconcurrentconditionalVHDLstatements(i.e.,VHD L\when/else" statements).Theseconditionalstatementsconverttheinstrume ntedsignalvaluefrom 37

PAGE 38

thehardwareprocesstoaoneorzerovaluethatwillthencontr olaprolecounterortrace buer.However,thisprocesscouldbesimplied(e.g.,histogra mgenerationonlyrequires avaluerangeandbinsizetobespeciedbytheuser). Oncethesignalstobeinstrumentedhavebeenselected,theyare routedintothe hardwareloopbackprocess(thinblackarrowinFigure 3-2 ).Theloopbackprocessis thenreplacedbytheHardwareMeasurementModule(HMM)shownint helowerdark boxinFigure 3-2 .TheHMMcontainscustomizedhardwarewithprolingandtraci ng capabilities(seeFigure 3-4 )andwasoriginallydesignedforHDLperformanceanalysis[ 2 ]. TheHMMallowsHDLsignalstobeusedinarbitraryexpressionsthatd eneeventssuch as\buerisfull"or\componentisidle."Theseeventsareused totriggercustomprole countersortracebuersdependinguponthetypeandlevelof detailofperformancedata required.Acyclecounterisalsoprovidedforsynchronizatio nandtiminginformation.The modulecontrolprovidestheinterfacetosoftwarefortransfe rringdatabacktothehost processoratruntimeaswellasclearingorstoppingthemoduled uringexecution. Figure3-4.HardwareMeasurementModule(HMM) Onceinstrumentationiscomplete,theHDLisreadytobeconver tedtoabitstream (step5inFigure 3-3 )andprogrammedintotheFPGA.Ingeneral,thetechniquesused in 38

PAGE 39

Section 3.2.1 andFigure 3-2 shouldbevalidforanyHLStoolthatemployscommunication channelsandgeneratesunencryptedandnon-obfuscatedHDL.3.2.2PerformanceAnalysisToolIntegration Inordertoprovidetheprogrammerwithtraditionalmicropr ocessorperformance data(aswellasaGUIfrontendforviewingperformancedata)w ithminimaldesigneort, theImpulseChardwareperformanceanalysisframeworkwasint egratedintoanexisting performanceanalysistool,ParallelPerformanceWizard(PP W)[ 3 ],whichisdesigned toprovideperformanceanalysisforseveralparallelprogram minglanguagesandmodels includingUPC(UniedParallelC)andMPI(MessagePassingInterfa ce).Section 3.2.2.1 explainshowPPW+RC(i.e.,PPWaugmentedwiththeframework forRCperformance analysis)extractsperformancedatafromtheHMM.Section 3.2.2.2 thendescribeshow measurementdataforImpulseChardwareprocessesarevisualized byPPW+RC. 3.2.2.1Performancedataextraction ThesoftwareprocessthatextractsmeasurementdatafromtheHMM wasoriginally insertedintotheapplicationsoftware.ForImpulseC,themeasu rementextractionprocess nowresidesinthePPW+RCbackend.ForotherHLLs,nameshiftingc ouldbeusedto interceptfunctioncallsformeasurementdataextraction.I ntegratingthemeasurement extractionprocesswithPPW+RCallowsPPW+RCtoautomatical lygatherperformance datafromImpulseChardwareprocessesaswellastoconvertrece iveddatatomatch theformatofandmeshwellwithdatabeingcollectedonthemic roprocessor(e.g.,all timesmeasuredontheFPGAmustbeconvertedfromcyclecountst onanosecondsin ordertoalloweasycomparisonofthetimespentinhardwareandso ftwareprocesses). Onceexecutionoftheapplicationisnished,performanceda tafrombothhardwareand softwareisstoredinasingleleforreview.3.2.2.2Performancedatavisualization ThePPW+RCfrontendisaJavaGUIthatisusedtovisualizeperfor mancedata recordedandstoredbythePPW+RCbackend.Assumingtheprogramm erhasselected 39

PAGE 40

thedefaultinstrumentationmode(seeSection 3.2.1.2 ),theperformancedatalewill containinformationonallstatemachines(whichareemploye dbythehardwareprocesses); thisinformationcaneasilybeviewedinanexpandabletableo rviapiechartsorother graphicalviews.Statemachinesassociatedwithpipelineswil lalsobedisplayedif pipeliningwasused.Figure 3-5 showsthePPW+RCfrontenddisplayingtimingfor softwareprocesses,hardwareprocesses,andpipelinesforaDESenc ryptionapplication [ 60 ].PPW+RCcanalsocomparemultipleexecutionsofanapplicat iontoallowcareful analysisofthebenetsandeectsofvariousmodicationsan doptimizations,orpossibly justtostudynon-deterministicbehaviorbetweenexecutionso fthesameversionofthe application. Figure3-5.ProletreetablevisualizationforTripleDataE ncryptionStandard(DES) 40

PAGE 41

Currently,theHDLnamesforsignalsandstatesarepresentedinv isualizations. Techniquesforfullyautomatingthereverse-mappingofHDLco destatestoHLLsource linesofcodearecomplexandmaynotworkforallcases.Onepossib letechniqueisto performagraphanalysiscomparisononthehardwarecontrol-r owgraph(e.g.,state machinegraph)andthecontrol-rowgraphproducedbyasoftwa recompiler(e.g.,one obtainedbycompilingwithgccafterremovingnon-ANSI-C-com pliantHLLstatements). Thiscouldallowloopsorbranchesinthestatemachinegrapht obematchedwithloopor branchsourcecode.Anotherreverse-mappingtechniqueinvolv esmatchingHLL-specic codestatementswithcorrespondinghardware.Forexample,HLL pipelinedloopscanbe matchedwithpipelinehardware.Additionaltechniquessucha svariablename-matching (e.g.,viamatchingsimilarnamesinboththeHLLsourcecodeand thegeneratedHDL inImpulseCtomatchvariablestosignals)canaidinmatchingsta testosourcecodeline numbers. Ideally,HLStoolvendorswouldprovidesupportforreversemap ping,greatly simplifyingtheaboveprocess.Forexample,thedata-rowgraph legeneratedby CarteClinkshardwarecodeblockstosourcelinesofcode,all owingthetooltoperform automaticreversemapping.ForImpulseC,reversemappingiscu rrentlytool-assisted. ImpulseAcceleratedTechnologies,creatorsofImpulseC,hasex pressedinterestinadding commentstotheirHDLoutputtomakethereverse-mappingproce ssfullyautomated [ 61 ].Reverse-mappingwouldallowPPW+RCtoprovidesourcelinec orrelationforboth hardwareandsoftwareprocesses,allowingtheapplicationdeve lopertoeasilylocate theline(s)ofcodeassociatedwithmeasuredperformancedata.W ithfullyautomated reverse-mappingsupport,theunfamiliarconceptofhardware statescanbeabstracted awayallowingthesoftwareapplicationdevelopertoseesimila rperformanceanalysis visualizationsforbothsoftwareandhardware. 41

PAGE 42

3.3Molecular-DynamicsCaseStudy TodemonstratethebenetsofHLSperformanceanalysisandexpl oreitsassociated overhead,amolecular-dynamics(MD)kernelwritteninImpu lseCwasanalyzed. MDsimulatesinteractionsbetweenatomsandmoleculesoverd iscretetimeintervals. MDsimulationstakeintoaccountstandardphysics,VanDerWalls forces,andother interactionstocalculatethemovementofmoleculesoverti me.Alametal.[ 62 ]provides amorein-depthoverviewofMDsimulations.Thissimulationkee pstrackof16,384 molecules,eachofwhichuses36bytes(4bytestostoreitspositi on,velocity,and accelerationineachoftheX,Y,andZdirections).Analysisisfoc usedonthekernel oftheMDapplicationthatcomputesdistancesbetweenatomsa ndmolecules. SerialMDcodeoptimizedfortraditionalmicroprocessorswas obtainedfromOak RidgeNationalLab(ORNL).TheMDcodewasredesignedinImpulseC usinganXD1000 [ 8 ]asthetargetplatform.TheXD1000isarecongurablesystemfr omXtremeDataInc. containingadual-processormotherboardwithanAlteraStrati x-IIEP2S180FPGAon amoduleinoneofthetwoOpteronsockets.TheHyperTransportint erconnectprovides asustainedbandwidthofabout500MB/sbetweentheFPGAandhost processorwith ImpulseC.Usingthisplatform,aspeedupof6.2timeswasobtaine dversustheserial baselinerunningonthe2.2GHzOpteronprocessorinthesameXD1000 server.Usingthe prototypeperformanceanalysistool,theperformanceofthe MDcodewasanalyzedto determineiffurtherspeedupcouldbeobtained. TherearethreehardwareprocessesdenedintheMDhardwaresub routine(Figure 3-6 ).ThetwoprocessesnamedCollectorandDistributorareusedtot ransferdatato andfromSRAM,respectively,inordertoprovideastreamofdata runningthroughthe thirdprocess,Accelerator.Acceleratorcalculatesthepositio nvaluesofmoleculesand ispipelinedusingImpulseCpragmas.Theprocessisthenreplica ted16times,sothat FPGAresourcesarenearlyexhausted,soastoincreaseperforman ce. 42

PAGE 43

Figure3-6.Molecular-Dynamics(MD)hardwaresubroutine TheMDkernelwasinstrumentedandanalyzed,withafocusonun derstanding thebehaviorofthestatemachineinsideofeachAcceleratorpro cess(Figure 3-7 ).The numberofcyclesspentineachstatewasrecordedbytheHMMandsen tbacktothehost processorpost-mortem.Uponexamination,threegroupsofstates inthemainloopofthe Acceleratorprocesswereofparticularinterest.Therstgroup keepstrackofthetotal numberofcyclesusedbytheinputstream(arrowspointingtoAcc eleratorinFigure 3-6 ) oftheAcceleratorprocess.Thesecondgroupofstateskeepstrack ofthetotalnumber ofcyclesusedbythepipelineinsideoftheAcceleratorprocess.F inally,thethirdgroup ofstateskeepstrackofthetotalnumberofcyclesusedbytheou tputstream(arrows pointingtotheCollectorinFigure 3-6 )intheAcceleratorprocess.Tracingwasused tondthestartandstoptimesoftheFPGAandallAcceleratorpro cesses.Thecycle countsfromthesethreegroupswerethenconvertedintoaperc entageoftheAccelerator runtime(Figure 3-7 )bydividingbythetotalnumberofcyclesusedbytheMDhardwa re subroutine(i.e.FPGAruntime).Sincethestategroupsvaryby lessthanone-thirdofa percentwhencomparedacrossall16Accelerators,datawasonly presenteddatafromone oftheAcceleratorprocesses. Theperformanceanalysistoolsuccessfullyidentiedabottlen eckintheMDhardware subroutine.IntheAcceleratorprocesses,almosthalfoftheexecu tiontimewasused bytheoutputstreamtosenddatatotheCollectorprocess(stateb 6s2inFigures 3-7 43

PAGE 44

Figure3-7.Acceleratorprocesssourcewithprolingpercenta ges and 3-8 ).Anoptimalcommunicationnetworkwouldallowthepipeline performingMD operationstoexecutefornearly100%oftheFPGAruntimemin imizingthenumber ofcyclesspentblockingforatransfertocomplete.Thistrait isanindicatorthatthe streambuerswhichhold32-bitintegersarebecomingfullan dcausingthepipelineto stall.Increasingthebuersizeofthestreamsby32timesonlyre quiredachangeofone constantintheprogram.Thisincreasechangesthestreambuer sizeto4096bytesforall 16inputandoutputstreamsoftheAcceleratorprocesses.Sinceth eImpulseCcompiler canonlyincreasethestreambuersizebyapowerof2,abuersize of4096bytesis themaximumsizethatwillpassplaceandroute.Figure 3-9 showsthetradeobetween applicationruntimeandvariousstreambuersizes.Thelarger streambuersreducedthe numberofidlecyclesgeneratedbytheoutputstream(topbari nFigure 3-9 )whilethe pipeline'sruntime(bottombarinFigure 3-9 )remainedthesamethusreducingtheMD kernel'sruntime.Thissimplechangeincreasedthespeedupoft heapplicationfrom6.2to 7.8versustheserialbaselinerunningonthe2.2GHzOpteronproc essor. 44

PAGE 45

Figure3-8.PiechartvisualizationofthestatesinAccelerato rprocess.Stateb5s0 correspondtothepipelineandstateb6s2correspondstotheoutp utstream. Althoughper-loopanalysisiscurrentlynotsupportedbyautom aticinstrumentation andvisualization,moredetailanalysisoftheblockingstream callscanbeperformedby examiningthestreamtransfertimeforeachloopiteration.Si ncethesetimesaresosmall, theywillbepresentedincycles.TheouterloopoftheAccelerat orprocess(Figure 3-7 )is performedoncepermolecule(16384times).Onlyoneofthethr eeoutputtransferstates intheloopgeneratesidlecycles(Figure 3-9 );thusonlythattransferstateneedstobe monitored.Aftereachloopiteration,thenumberofcyclesre quiredbytheoutputstream totransferdataiscounted.Thecyclecountrangeissegmented intosub-rangesorbins, each256cycleswide.Acounterisusedforeachrangetokeeptr ackofthenumberof timesthetransfercountfallsinthatrange.Figure 3-10 showsthestreamtransfercycle countsegmentedintobins.Per-loopanalysisoftheoutputstrea mprovidesadditional insightintothebottleneckandtheeectofthebuersizeonth eloopiterationcycle count.Asthebuersizeincreases,longercyclecountsbecomele ssfrequentandcluster intodierentregions.Theregioncorrespondingtoaone-cycl estreamtransferrepresents thecasewherenoidlecyclesaregenerated.Evenwithastreamb uersizeof4096bytes, lessthan30%ofthestreamtransfersareideal. 45

PAGE 46

Figure3-9.BarchartvisualizationdepictingMDkernelperf ormanceimprovementdueto increasedstreambuersize(128to4096elements). TheoverheadcausedbyinstrumentationandmeasurementoftheAc celerator processwithastreambuersizeof4096bytesontheXD1000isshown inTable1.The instrumentedversioninTable 3-1 includesalladditionalhardwareforperformance analysis(i.e.,theHMMandadditionstotheImpulseCcommunica tionwrapper). InstrumentationandmeasurementhardwareincreasedtotalFPG Alogicutilizationby 3.90%.Prolecountersandtimersusedanadditional3.70%of theFPGA'slogicregisters, whereastracingbuersrequired1.27%additionalblockmem oryimplementationbits.An additional2.73%ofcombinationalAdaptiveLook-UpTables(AL UTs)werealsoneeded. Forrouting,instrumentationincreasedblockinterconnectu sageby2.56%.Finally,the FPGAexperiencedaslightfrequencyreductionof2.64%dueto instrumentation.Overall, theoverheadforperformanceanalysiswasfoundtobequitemo dest. 3.4Conclusions High-levellanguageshavethepotentialtomakerecongurab lecomputingmore productiveandeasiertouseforapplicationdevelopers.While thishigher-abstraction 46

PAGE 47

Figure3-10.OutputstreamoverheadfortheAcceleratorproce ss. Table3-1.Performanceanalysisoverhead EP2S180OriginalInstrumentedOverhead Logicused126252131851+5599(143520)(87.97%)(91.87%)(+3.90%)Comb.ALUT100344104262+3918(143520)(69.92%)(72.65%)(+2.73%)Registers104882110188+5306(143520)(73.08%)(76.78%)(+3.70%)Blockmemory34375683557376+119808(9383040bits)(36.64%)(37.91%)(+1.27%)BlockInterconnect288877300987+12110(536440)(53.85%)(56.11%)(+2.56%)Frequency(MHz)80.5778.44-2.13 (-2.64%) levelallowsthehigh-levelsynthesistoolstoimplementmany ofthedesigndetails,this higher-abstractioncanalsomakeiteasierforthedeveloperto introducebottlenecks intotheirapplication.Performanceanalysistoolsallowth eapplicationdeveloper 47

PAGE 48

tounderstandwheretimeisspentintheirapplicationsothatth ebeststrategyfor applicationoptimizationcanbetaken. ManychallengesforperformanceanalysisofHLS-basedFPGAapp licationshave beenidentiedinthischapter.Anumberofinstrumentationl evelsandassociated challengeswerediscussed;intheend,instrumentationattheHDL levelwaschosen foritsrexibilityandportabilitybetweenhigh-levelsynth esistoolsandplatforms.In addition,manydierentHLLstructureshavethepotentialtob einstrumentedonce mappedtoanHDL.Instrumentingcontrolhardwareemployedtom aintainprogram order,pipelines,andcommunicationchannelsisdiscussed.The sestructuresareamenable toautomatedinstrumentationandcanprovideperformanceda tarelevanttoawide rangeofapplications.CommunicationofmeasureddataviaJTA GandHLLchannels atruntimewasalsodiscussed.HLLcommunicationchannelsweresel ectedduetotheir portabilitybetweenplatformsandminimalexternalhardwa rerequirements.Theuse ofmeasuredperformancedataisexplainedforautomaticbott leneckdetectionatthe HLLsource-codeleveltoincreasedeveloperproductivity.AnHLS -specicvisualization displayingperformanceinformationinthecontextoftheapp lication'sarchitecturewas presented,providingtheprogrammerwithanoverviewofappl icationperformance.The visualizationalsodisplayedbottleneck-specictracing. AnautomatedframeworkforperformanceanalysisofImpulseCwa salsopresented thatimplementsmanyofthetechniquesneededtoaddressthec hallengesabove.The frameworkincorporatesautomaticinstrumentationofImpul seChardwareprocesses.State machinesineachprocesscanbeinstrumentedtoprovideanexec utiontimebreakdown. Timingdatagatheredfromthesestatemachinesarecollecteda ndvisualizedusinga performanceanalysistool,ParallelPerformanceWizard.Si ncePPWwasoriginally designedforperformanceanalysisofparallelcomputinglang uages(e.g.,MPIandUPC), theextensiontoPPWallowsittoprovideperformanceanalysis ofbothhardwareand softwaresimultaneously. 48

PAGE 49

Acasestudywaspresentedtodemonstratetheutilityofproling andtracing applicationbehaviorinhardware,allowingthedevelopert ogainanunderstandingof wheretimewasspentontherecongurableprocessor.Lowoverhe adwasobserved(in termsofFPGAresources)whenaddinginstrumentationandmeasur ementhardware, demonstratingtheabilitytoanalyzeapplicationsthatuseal argeportionoftheFPGA. Inaddition,aslightreductioninfrequency(lessthan3%)resu ltedfrominstrumentation. Sincedatawasgatheredafterexecutioncompleted,therewa snocommunicationoverhead. AlthoughthemappingbetweenHDLandHLLcodeisnotcurrentlypr esentedtothe programmer,futureplansincludelinkingallHDL-baseddatab acktotheHLLsource, permittingtheprogrammertoremainfullyunawareoftheHDLg eneratedbyahigh-level synthesistool.Additionalfutureworkincludesdevelopingmo readvancedvisualizations forHLS-basedapplicationsandexpandingthetooltosupportpe rformanceanalysisfor CarteC. 49

PAGE 50

CHAPTER4 COMMUNICATIONBOTTLENECKDETECTIONANDVISUALIZATION High-levelsynthesis(HLS)toolssuchasImpulseC[ 55 ]andCarte[ 56 ]increase developerproductivitybyallowingdeveloperstoprogramF ield-ProgrammableGate Arrays(FPGAs)atahigherabstractionlevel.Thishigherabstrac tionlevelisachievedby usingtheCprogramminglanguageratherthanprogrammingatt heregister-transferlevel (RTL)usinghardwaredescriptionlanguages(HDLs).However,pro grammingatahigher abstractionlevelcanalsoleadtoalossinperformancethatcan bedicultforadeveloper tooptimizeduetoalackofvisibilityintothesynthesizedcirc uitstructureandruntime behavior. Performanceanalysistoolsarecommonlyusedtoidentifyperf ormancebottlenecks insoftwaredevelopment,butthereisalackforsuchtoolsdesig nedspecicallyforHLS. Existingsimulatorsanddebuggingtoolsprovidesomeperforma nce-analysiscapabilities byenablingdeveloperstomonitorsignalsandstatemachines.Ho wever,theseapproaches arelackinginoneormoreofseveralareas:speedoraccuracy,sou rce-codecorrelation, high-levelmetricsandvisualizations.HLSsimulatorsrunCsour ceonthemicroprocessor, whichisfastandprovidessource-codecorrelation,butisina ccuratebecausethetiming andsynchronizationofhardwareisnottakenintoaccount.Th us,HLSsimulatorsare ill-suitedforperformanceanalysis.Cycle-accurateHDLsimula torscanprovideaccurate monitoringofHDLstructuresgeneratedbyHLStools,butsimulati nglargedesignscan betimeconsumingandcurrentHDLsimulatorsdonotprovidecorr elationbacktoHLS sourcecode.LogicanalyzerssuchasChipScope[ 47 ]andSignalTap[ 48 ]providefast accesstoHDLsignalvaluesbutdonotprovidecorrelationtoCsou rcecode. Oneofthekeycapabilitiesmissingfromexistingperformanceanalysisapproachesis visualizationofhigh-levelmetricssuchasbandwidthandvisu alizationsoftheapplication's communicationarchitecture.Suchvisualizationsarecommo ninthehigh-performance computing(HPC)communityforparallelprogramminglanguag esandlibrariessuchas 50

PAGE 51

MPIandUPC.Forexample,Figure 4-1 showstheamountofbytestransferredbetween initiatingthreadsanddataanitythreadsforanapplicati onexecutingon32nodes.This visualizationwasgeneratedbytheParallelPerformanceWiz ard(PPW),aperformance analysistoolwhichwasabletogetovera14%performanceimpr ovementafteroptimizing theFTbenchmark(fastFouriertransform)[ 3 ]. Figure4-1.ParallelPerformanceWizard(PPW)datatransfer visualization Thischapterpresentsacommunication-bottleneckdetectio nandvisualization toolforHLS.Ourtoolprovidesthedeveloperwithahigh-leve lvisualizationofthe bandwidthofallcommunicationbetweenprocessesofanapplic ationforboththeCPU andFPGAandgraphicallyidentiesbottlenecksviacolorco ding.Weevaluatedour techniquesusingImpulseC,butthetechniquesalsoapplytooth erHLStools,suchas Carte.AprototypetoolwasdevelopedforImpulseCduetoitssu pportformultiple platforms.Tomeasurebandwidth,thetoolautomaticallyadds hardwarecountersand softwaretimerstotheapplicationcodeandthenusesthemeasur ementstogeneratea 51

PAGE 52

communicationvisualizationoftheapplication.Optimizin gthebottlenecksshownbythe communicationvisualizationrequiredonlyseveralminuteso fdevelopereortandenabled a2.18xspeedupofaTripleDESapplicationonanXD1000platfor mbyXtremeData. BottleneckswerealsodetectedonaMolecularDynamicsappli cationonanXD1000 platform,resultinginaspeedupof1.25x,inadditiontoaBack projectionapplicationon theNovo-Gsupercomputingplatform(24nodeseachwithtwoGiD ELPROCStarIII boardsforatotalof192FPGAs).Themeasuredoverheadofourtoo lwaslessthan2 percentresourceoverheadand3percentfrequencyoverhead. Theremainderofthechapterispresentedasfollows.Section 2.6 providesrelated research.Section 4.1 presentsthechallengesfortheHLSperformance-analysistool for communication-bottleneckdetection.Section 4.2 presentstheframeworkdevelopedfor ImpulseC.Section 4.3 providesresultsfromtheapplicationcasestudies.Section 4.4 discussesconclusionsfromthepresentedwork. 4.1CommunicationPerformanceAnalysis Thefocusofthisworkisontechniquestoautomaticallyvisua lizetheeciencyof inter-processcommunicationforbothmicroprocessorsandFPG Asinapplicationsusing high-levelsynthesis.Tocreatethevisualization,theperform anceanalysistoolinstruments applicationcodetomeasurebandwidthofallcommunicationc hannelsduringexecution. Adevelopercanthenanalyzethevisualizationtoidentifybo ttlenecksanddetermine appropriateoptimizationstoincreaseperformance. Theremainderofthissectionpresentstechniquesforvisualiz ingandanalyzing bandwidth.Section 4.1.1 presentsinstrumentationandmeasurementchallenges.Section 4.1.2 describesthevisualizationtechniques.Section 4.1.3 discussesanalysistechniquesfor identifyingbottlenecksinthevisualization.4.1.1InstrumentationandMeasurement Communicationchannelsbetweenprocessesinhigh-levelsynth esistoolsnormally fallintotwocategories:streamingandDMAtransfers.Streamin gtransfersusebuersto 52

PAGE 53

temporarilystoredatabetweentwocommunicatingprocesses,wh ereoneprocesswrites tothebuerandtheotherprocessreadsfromthebuer.Direct memoryaccess(DMA) transfersmovedatato/fromdedicatedmemorysuchasSRAM. Theperformanceanalysistoolinstrumentscommunicationcha nnelsbyadding monitoringcircuitsconnectedtothesynthesizedcommunicat ioncallsinthesourcecode ofeachprocessontheFPGA.Duringexecution,themonitoringc ircuitsstoremeasured bandwidthslocallyinregisters,whichareextractedbythemi croprocessorafterthe applicationhasnishedexecutingtoavoidcommunicationov erhead.Thetoolthen combinesthemeasuredbandwidthsforboththeFPGAandCPUtof ormavisualization ofcommunicationchanneleciency. Bandwidth(ortransferrate)canbemeasuredusingEquation 4{1 whereBis bandwidth,DisdatatransferredandTistransfertime.Forall measurementsinthis chapter,timeistheamountoftimetakenbyacommunicationAP Icalltoperforma transferincludingidletimeasaresultofblocking.Forexamp le,astreamwritecallwill blockifthestreamingbuerbeingwrittentoisfull.Thismea surementoftimedoesnot penalizeapplicationsthatperformmostlycomputationwith littlecommunicationsince onlythecommunicationperformedbytheapplicationisbein gtimed.Insoftware,timeis measuredbyaddingawrapperaroundeachcommunicationcallt ype(e.g.,streamread)in thesourcecode.Thewrapperfunctioncallmeasuresthetimean drecordsthetransfersize passedasaparameterforeachcommunicationcall. B = D T (4{1) ForcommunicationAPIcallsperformedinhardware,bandwidt hmeasurementcan becomemorecomplex.TheinvocationofacommunicationAPIca llisdeterminedby whichstateisactiveinastatemachine(e.g.,inImpulseC)orha ndshakingsignals(e.g., inCarte).Countersareusedtomonitorhowlongsignalsareact ivethatareassociated withaparticularcommunicationAPIcall.Sincethecounters measuretimeincycles 53

PAGE 54

insteadofseconds,thefrequencyoftheFPGAmustbedeterminedi nordertoconvert cyclestosecondsasshowninEquation 4{2 whereCiscyclesandFisFrequency. T = C F (4{2) Somecommunicationcallshaveastatictransfersizewhileothe rcanchange dynamically.Wehaveobservedthatstreamingtransferstypica llyhaveasetwidth xedduringhardwaregenerationthatcannotbechangedatru ntime.DMAtransfersizes canbestaticordynamicduringprogramexecutiondependingo nhowtheapplication isdesigned.Forastatictransfersize,onlytheinvocationofac ommunicationAPIcall mustbemonitored.Thexedsizeofeachtransfercanbeparsedfro mthesourcecode. Thedatatransferred,D,canbedeterminedbyEquation 4{3 ,whereSisthexedsize ofeachtransferandIisnumberofinvocations.Communication invocationsaredetected bymonitoringsignalsorstatescorrespondingtocommunicatio nAPIcalls.Foreach invocation,acounterisincrementedforonlytherstcyclet hatthesignalorstatefor thecommunicationAPIcallisactive.Additionalcyclesmaybe neededtonishthedata transferofthecommunicationcall.Fordynamictransfersizes, acounterisusedtosum thetotalbytestransferredeachtimetheAPIcallisinvoked.E quation 4{4 isusedto computethebandwidthofacommunicationchannelthatusesAPI callswithstaticand dynamictransfersizes.Equation 4{4 isderivedbysubstitutingEquations 4{2 and 4{3 withEquation 4{1 whileaddingtogetherthestaticanddynamicversionsofDandC .In Equation 4{4 ,nandmrepresentthenumberofAPIcallswithstaticanddynamic transfer sizes,respectively,foragivencommunicationchannel. D = SI (4{3) 54

PAGE 55

B = F ( n X i =1 ( S i I i )+ m X j =1 D j ) n X i =1 C i + m X j =1 C j (4{4) Itisalsoimportanttosupplythevisualizationwiththemaximu mbandwidthfor aparticularcommunicationtype.Themaximumbandwidthsca nbemeasuredusing ourtoolbyrunningabenchmarkapplication.Atypicalbandw idthbenchmarkwould needtoperformfourtypesoftransfersbetweenprocessesforbo thstreamingandDMA communication:CPUtoCPU,fromCPUtoFPGA,fromFPGAtoCPUand FPGAto FPGA.Alternately,vendorprovidedmaximumbandwidthscanbe used.Themaximum bandwidthscanthenbegiventothevisualizationgeneratori ncommaseparatevalue (CSV)formattoenablecolorcoding.4.1.2Visualizations Softwareparallelperformanceanalysistoolsgeneratethei rvisualizationsina formatthatassumesahomogeneousnetworkbetweennodeswitht hecapabilityofall toallcommunicationasshowninFigure 4-1 .HLStools,incontrast,cangeneratea heterogeneouspartiallyconnectednetworkwithmultiplem odesofcommunicationbetween processes.Furthermore,bandwidthsbetweenprocessescanchang edependingonthe endpoints(e.g.FPGA,CPU,ormemory)andtypeofcommunicatio n.Therefore,a visualizationtailoredtotheapplication'scommunication architecturebetweenprocessesis neededtorepresentthebandwidthbetweeneachnode. Thevisualizationtoolconstructsadirectedgraphbasedonthe sourcecodeof theapplication.Thenodesofthegraphcorrespondtoapplica tionprocessesanddata buerswhiletheedgesofthegraphcorrespondtodirectionth atdataistransferred betweenthenodes.Ineachvisualization,thevisualizationto oldrawsboxescorresponding tooneormoreCPUs,FPGAsandexternalmemories.Thetoolrepresen tsprocesses asovalsthatareplacedintotheirrespectiveprocessor.Theco mmunicationAPI 55

PAGE 56

callsinsideeachprocessareusedtocreatetheedgesthatconne ctprocesses.The toolalsovisualizesmemoriesinthedirectedgraph.Streamin gbuersareshownas diamonds.Figure 4-2 showsanexampleofstreamingcommunication(Figure 4-2A ) withacorrespondingvisualizationofaprocesswritingtoastre ambuer(Figure 4-2B ). SRAMusedforDMAisshownasaseparateboxfromtheCPUorFPGA.Each DMA buerisdisplayedasaseparateboxinsideofthecorrespondingS RAMbox.Figure 4-3 showsaDMA-communicationcallinsideafunction(Figure 4-3A ),andthecorresponding visualizationofaprocesswritingtoaSRAMbuer(Figure 4-3B ). ASourcecode BVisualization Figure4-2.Streaming-communicationcallvisualization ASourcecode BVisualization Figure4-3.Directmemoryaccess(DMA)communicationcallvisu alization Thetoolannotatesbandwidthsmeasuredforeachcommunicati onAPIcallnextto eachedgeconnectingaprocessandbuer.Tomakeiteasierton dbottlenecks,theedges ofthegraphcorrespondingtodatatransfersarecolorcodedde pendingontheratioof 56

PAGE 57

measuredbandwidthtothemaximumbandwidthofthedatatransf erwhereshadesof redareusedforbelow50%bandwidthutilizationandshadesofg reenareusedforabove 50%bandwidthutilization.Percentagesofmaximumbandwid tharealsoprovidednext tothemeasuredbandwidthforeachedgeofthegraph.Asstatedbe fore,theapplications thatperformmostlycomputationwithlittlecommunicationw illnotbepenalizedinthe visualization,sincebandwidthisonlymeasuredwhilecommuni cationisbeingperformed bytheapplication.4.1.3Analysis Thissectiondescribesanalysistechniquestodetectcommunic ationbottlenecks. Allbottlenecks(i.e.,communicationcallswithlowbandwid th)areshowninashadeof redinthevisualization.Byanalyzingthedatarowandchange sinbandwidthinthe visualizationoftheapplication,developerscandetermine potentialremediesforeach bottleneck. Thefollowinganalysistechniquescanbeusedmanuallyorauto matedtosuggest potentialoptimizationstoincreasebandwidthutilization andspeedupanapplication.For streamingtransfers,itisimportanttonotetheratioofinputa ndoutputbandwidthsto astreamingbuer.Onepotentialbottleneckmayoccurwhenast reamingbuerisfull, inwhichcasetherewillbealowerinputbandwidththanoutput bandwidth,asshownin Figure 4-4A .Lowerinputbandwidthoccursbecausethewritingprocessmust blockwhen thebuerisfull.Tooptimizethisbottleneck,thebuersize canbeincreasedtoincrease bandwidth.Anotherbottleneckiscausedbyemptystreambuers, representedwiththe oppositeratio,whichcausesthereadingprocesstoblockasshow ninFigure 4-4B .To optimizethisbottleneck,dataratesupstreamshouldbeincre asedtopreventthebuer frombecomingemptybychangingstreamingwidths,pipelining orswitchingtoadierent communicationmethod.Anadditionalbottleneckcanbecaused bylowbandwidthson bothsidesofthestreamingbuer,whichcanresultfromstreamin gburstsofdatainto streamingbuersthatbecomefullandemptyatdierenttimes duringtheexecutionof 57

PAGE 58

ABuerfull BBuerempty Figure4-4.Streamingbueranalysis theapplication.Toreducethisbottleneck,acombinationo fbothoptimizationmethods canbeused. SRAMbuers,unlikestreamingbuers,mustshareportsbetweenmul tiple processesthatmaytrytoaccessthebuersimultaneously.Whenmu ltipleprocesses makesimultaneoustransfers,processesmakingDMAcallsmustbloc k,resultingina potentialbottleneck.Toreducethisbottleneck,synchroni zationcanbeaddedbetween processestomoreeectivelysharebandwidth.UsingsmallDMAtran sfersizescanalso causebottlenecks,whichcanbeoptimizedbysendinglargerchu nksofdatatoincrease bandwidth. 4.2ExperimentalFramework Thissectionpresentstheframeworkdesignedtovisualizecommu nicationbottlenecks inImpulseC.Section 4.2.1 givesdetailsontheextrastepsaddedtotheImpulseCdesign processforinstrumentationandsource-codecorrelation.Sec tion 4.2.2 describeswhich softwarepackagewasusedtogeneratethevisualization.4.2.1Instrumentation SinceImpulseCisnotopen-source,instrumentationisaddedto thesourcecodeand hardwareles.Perlscriptsareusedtoparseandmodifytheseles. AJavaGUIfrontend isusedtoselectlesandinstrumentationfeatures. 58

PAGE 59

Thefollowingstepsareusedtoinstrumenttheapplicationandg enerateperformance analysisdatalesasshowninFigure 4-5 .Instep1,thedeveloperselectsthesource-code lestobeinstrumented.Instep2,thetooladdsanextracommun icationchanneltothe applicationthatisusedtotransferbandwidthmeasurementdat aaftertheapplicationis nishedexecuting.Aftertheinstrumentedsourcecodeisusedtog eneratehardwareHDL lesinstep3,thetoolallowsHDLlestobeselectedinstep4toad dtheinstrumentation describedinSection 4.1.1 .Bydefaultallcommunicationstatesinallprocessesare instrumented,however,theGUIforthehardwareinstrumentera llowsselectionofwhich processesorindividualcommunicationstatesshouldbeinstrume ntedtoreduceresource usageoftheFPGA.TheImpulseCXHWleisusedforsource-codecorrel ation.Each communicationstateisannotatedwithitscommunicationtyp eanddirection(e.g.,stream write)andcorrespondingsourcelineofcode.Oncethehardwar eisinstrumented,the tooladdsacustomsoftwareprocesstothesoftwaresourcecodetog atherthebandwidth measurementdataandwritethemeasurementstoale.Currentl y,onlystatictransfer sizesareinstrumentedbythetoolasdescribedinSection 4.1.1 .Dynamictransfersize instrumentationwasnotrequiredforanyofthecasestudiesinS ection 4.3 Figure4-5.ToolrowforImpulseCvisualization 59

PAGE 60

4.2.2Visualization AfterthebitleisgeneratedinStep5andtheapplicationise xecutedonthetarget platforminStep6,thebandwidthmeasurementdatalecanbeu sedtogeneratethe bandwidthvisualizationinStep7.Thetoolusesthesource-cod elestoreconstructthe communicationarchitectureasdescribedinSection 4.1.2 .ThetoolusesGraphviz[ 63 ] togenerateamulti-levelvisualizationusingSVGoutputandl inkingbetweenmultiple SVGles.ThegeneratedHDLlesandXHWleareusedbythetooltovi sualizethe statemachineofeachhardwareprocessalongwithsource-codec orrelationforeachstate. Currently,theanalysismustbeperformedmanuallybutcouldb eautomatedinthe future.Forexample,input-outputbandwidthradiosofstrea mingbuerscouldbeused todetermineifthebuersarebecomingfullorempty.Optimi zationsuggestionscouldbe automaticallyprovidedasdescribedinSection 4.1.3 4.3ExperimentalResults Thissectionpresentscasestudiesusedtoevaluatethecommunica tionvisualization forImpulseCapplications.Section 4.3.1 givesdetailsonhowthevisualizationwas usedtospeedupaDESapplication.Section 4.3.2 demonstratesmultiplecommunication bottlenecksinaMolecularDynamicsapplication,inadditi ontocorrespondingoptimizations toreducethebottlenecks.Section 4.3.3 showsacommunicationvisualizationformultiple FPGAsonaBackprojectionapplication. TheframeworkcurrentlyusesImpulseCandQuartus.Thetargetp latformsare theXtremeDataXD1000[ 8 ]containingadual-processormotherboardwithanAltera Stratix-IIEP2S180FPGAinoneoftheOpteronsocketsandtheNo vo-Gsupercomputer [ 64 ]atUniversityofFloridacontaining48GiDELPROCStarIII[ 10 ]cardseachwithfour Stratix-IIIEP3SE260.ImpulseC3.3isusedfortheXD1000while ImpulseC3.6with anin-houseplatformsupportpackageisusedforNovo-G.Quartus 9wasusedforTriple DESandBackprojection.MolecularDynamicswouldnottusin gQuartus9andrequired Quartus8.1totduetothehighresourceutilizationoftheMo lecularDynamicscores. 60

PAGE 61

4.3.1TripleDataEncryptionStandard(DES) TripleDES[ 60 ]isablockcipherusedforencryption.Theapplicationconsist sofa modiedversionoftheTripleDEScodeprovidedbyImpulseC,wh ichsendstextlesto theFPGAtobeencodedandthendecodedagainbeforebeingsent backandcompared. WeevaluatedthisapplicationontheXD1000platform.Figure 4-6 showstheresulting visualizationoftheDESapplication,whichsolelyusesstreami ngcommunication.By analyzingthevisualization,weidentiedthattransfersbet weentheCPUandFPGA wereabottleneckbecauseoftheirlowbandwidthrelativetot heFPGAinternal bandwidth.Thestreambuer blocks decrypted ic hasahigherinputbandwidththan outputbandwidthindicatingthatstreamingcommunicationi sbeingblockedbecausethe buerisempty.SincetheXD1000platformsupportsDMAtransfer sandtheapplication isnotutilizingDMAfortheFPGA,weusedDMAtransferstoincrea sethetransferrates betweentheCPUandFPGAasshowninFigure 4-7 .Largeblackarrowshavebeenadded toFigures 4-6 and 4-7 topointoutthebandwidthincrease.Thechangefromstreaming to DMAcommunicationwasasimplemodicationandtookabouthal fanhourtocomplete. Thismodicationenableda2.18xspeedupandgreatlyincrease dbandwidthusage.The resourceoverheadoftheinstrumentationandadditionalImpu lseCcommunication channelsusedformeasurementanddataextractionwasquitemo destatlessthan2% resourceoverheadand4%frequencyoverheadasshowninTable 5-1 4.3.2MolecularDynamics MolecularDynamics(MD)simulatesinteractionsbetweenato msandmolecules overdiscretetimeintervals.Alametal.[ 62 ]providesamorein-depthoverviewofMD simulations.Forourexperiments,thesimulationkeepstrackof 16,384molecules,each ofwhichuses36bytes(4bytestostoreitsposition,velocity,a ndaccelerationineach oftheX,Y,andZdirections).ThekerneloftheMDapplicationco mputesdistances betweenatomsandmolecules.SerialCMDcodewasobtainedfro mOakRidgeNational 61

PAGE 62

Figure4-6.CommunicationvisualizationofstreamingDES Lab(ORNL)andoptimizedtorunontheFPGAusingImpulseC.Weeva luatedthis applicationontheXD1000platform. ThiscasestudyprovidesacomparisonbetweenthemodiedPPWto olpresentedin Section 3.3 andthebandwidthvisualizationtoolpresentedinthischapte r.Byanalyzing Figure 4-8 ,weidentiedabottleneckresultingfromallofthestreaming buerswith theletter p becomingfullsincetheinputbandwidthislowerthanoutputb andwidth. Thestreamingbuerswiththeletter a havelowbandwidthbothontheinputsand outputs.Thelowbandwidthsmeasuredarecausedbythestreaming buerbecoming fullandthe bottom processblockingduringstreamreads.The bottom processrequires datafromall a streamstobereadysimultaneouslyforitsstreamreadstate.Ifon e a buerbecomesempty,thentheothercanbecomefullwhilethe bottom processwaitsfor data.Toreducethebottleneck,weincreasedthestreambuersi ze.Thestreambuer 62

PAGE 63

Figure4-7.CommunicationvisualizationofDESusingDMAtran sfer sizecanbeincreasedbychangingaconstantintheapplication. Findingthemaximum buersizethatwouldtontheFPGAonlyrequiredseveralminut esofdevelopereort (notincludingthetimeforQuartustoimplementthedesigns). Maximizingthestream buersizeachievedaspeedupof1.25xcomparedtotheoriginal FPGAimplementation. Thespeedupcomparedtotheserialbaselinerunningonthe2.2GHz Opteronprocessor changedfrom6.2xto7.8xafteroptimizingthedetectedbott lenecks. 4.3.3Backprojection BackprojectionisaDSPalgorithmfortomographicreconstru ctionofdataviaimage transformation.WeevaluatedthisapplicationonallfourFP GAsoftheProcStar-III boardintheNovo-Gsupercomputer.Sincethein-houseImpulseCp latformsupport packageforNovo-GcurrentlyrequirestheuseofGiDELsoftware APIcalls,somemanual instrumentationandlegenerationwasrequired.OnlyoneFP GAwasinstrumented althoughitsdataisrepresentativeofallotherFPGAs.Figure 4-9 showstheresulting visualization.OneobviousbottleneckisthatPCIdatatransf ersare8-25%ofpeakspeeds. 63

PAGE 64

Table4-1.TripleDataEncryptionStandard(DES)instrument ationoverhead EP2S180OriginalInst.Overhead LogicUsed2119224032+2840(outof143520)(14.77%)(16.75%)(+1.98%)Comb.ALUT1297214935+1963(outof143520)(9.04%)(10.41%)(+1.37%)Registers1389616242+2346(outof143520)(9.68%)(11.32%)(+1.64%)BlockRAM143520158400+9216(9383040bits)(1.59%)(1.69%)(+0.10%)BlockInterconnect3906742969+3162(outof536440)(7.28%)(7.87%)(+0.59%)Frequency(MHz)77.0274.47-2.55 (-3.31%) Figure4-8.Molecular-dynamicsbandwidthvisualization(h alfofFPGAcroppedtoenlarge image) Thestreamingbuers sino in showsignsofbecomingfullsincetheirinputbandwidth islowerthantheiroutputbandwidth.Thisbottleneckcould potentiallybereducedby increasingthestreamingbuersize.Unfortunately,wewerenot abletoevaluatethis optimizationduetolimitationsofthecurrentplatformsupp ortpackageforNovo-G, 64

PAGE 65

whichonlysupportsxedsizesofstreamingbuers.Alternativel y,bandwidthscould beincreasedbyusingDMAtransferinsteadofstreamingtransfers,b utDMAtransfers arenotyetsupportedbytheplatformsupportpackage.Although thepotentialspeedup isnotknown,thevisualizationallowsadevelopertoeasilyid entifythecommunication bottlenecks,whichcouldbereducedinjustseveralminutesfor asystemwithacomplete platformsupportpackage. Figure4-9.CommunicationvisualizationofBackprojection 4.4Conclusions Inthischapter,weintroducedacommunicationvisualizatio ntoolforhigh-level synthesisthatallowsadevelopertoquicklylocatecommunica tionbottlenecks.The application'scommunicationAPIcallsarevisualizedasadir ectedgraphofcommunication linksbetweentheCPU,FPGAandcommunicationbuers.Bandwid thsarecolorcoded toallowthedevelopertoquicklylocateinecientdatatran sfers.Byanalyzingthegraph forbandwidthdistributionandratios,optimizationscanbem adebyincreasingstreaming buersizesorswitchingtodierentcommunicationtypes.Casest udieswereprovidedto showhowthebandwidthvisualizationcanbeusedtodetectandel iminatebottlenecks relatedtocommunicationwithlittleeort,whichresultedi nFPGAapplicationspeedups rangingfrom1.25xto2.18x.Inaddition,thetoolprovidesso urce-codecorrelation,which hidestheHLS-generatedHDLfromtheapplicationdeveloper.F utureworkincludes 65

PAGE 66

automatinganalysisandprovidingsuggestionsforoptimizati on.Thevisualizationcanalso beexpandedbyincludingperformanceanalysisofcomputatio nforeachprocess. 66

PAGE 67

CHAPTER5 ASSERTION-BASEDVERIFICATIONFORHIGH-LEVELSYNTHESIS Field-programmablegatearrays(FPGAs)showsignicantpowera ndperformance advantagesascomparedtomicroprocessors[ 1 ],buthavenotgainedwidespreadacceptance largelyduetoprohibitiveapplicationdesigncomplexity.Hi gh-levelsynthesis(HLS) signicantlyreducesapplicationdesigncomplexitybyenabl ingapplicationswrittenina high-levellanguage(HLL)suchasCtobeexecutedonFPGAs.Howeve r,limitedHLS supportforverication,debugging,andtiminganalysishasc ontributedtolimitedusageof suchtools. Forverication,designersusingHLScanuseassertion-basedveri cation(ABV), awidelyusedtechniqueinelectronicdesignautomation(EDA)t ools[ 65 ],toverify runtimebehaviorbyexecutinganapplicationthatcontains assertionsagainstatestbench. However,assertion-basedvericationofprogramswritteninCu singHLStools,such asImpulseC[ 55 ]andCarte[ 56 ],isoftenlimitedtosoftwaresimulationoftheFPGA's portionofthecode,whichcanbeproblematicduetocommonin consistenciesbetween simulatedbehaviorandactualcircuitbehavior.Suchinconsi stenciesmostcommonly resultfromtimingdierencesbetweenthesoftwarethread-ba sedsimulationofthecircuit andtheactualFPGAexecution[ 66 ].Insomecases,theseinconsistenciesmaycausean applicationthatbehavesnormallyinsoftwaresimulationton evercomplete(i.e.,hang) whenexecutingontheFPGA.DebugginganHLS-generatedcircui ttoidentifythecause ofsuchhangsisasignicantchallengethatcurrentlyrequire sexcessivedesignereort. Timinganalysis,aprocedurewhichdeterminesifperformance constraintsaremet,is anadditionallimitationofmanyHLStools.Althoughtimingana lysisiswidelyusedin physicaldesigntools,inmanycasesHLStoolsdonotconsidertimin gconstraints.Even worse,designersareunawareoftheperformanceofdierentre gionsofanHLS-generated circuit,whichmakesoptimizationmoredicult.Althoughti mingmeasurementscanbe 67

PAGE 68

takenduringhigh-levelsimulation,suchmeasurementsarebase donsoftwaresimulation anddonotrerectactualcircuitperformance[ 51 ]. Onepotentialsolutiontotheseverication,debugging,andt iming-analysis problemsisfordesignersusingHLStousepost-synthesisregister-t ransfer-level(RTL) simulation.However,suchanapproachrequiresadesignertoman uallyaddassertionsto HLS-generatedhardware-description-language(HDL)code,wh ichisacumbersomeprocess (ascomparedtoaddingassertionsatthesourcelevel)andthere arenumeroussituations wheresuchsimulationsmaybeinfeasibleorundesirable.Forexa mple,adesignermay useHLStocreateacustomcorethatispartofalargermultiproce ssorsystemthatmay betoocomplextomodelwithcycleaccuracy.Evenifsuchmodel ingwasrealized,slow simulationspeedscanmakesuchvericationprohibitivetoman ydesigners. Ideally,designerscouldovercometheselimitationsbyspecif yingassertionsin high-levelcode,whichtheHLStoolcouldintegrateintogene ratedcircuitstoverify behaviorandtiming,whilealsoassistingwithdebugging.Toach ievethisgoal,wepresent HLStechniquestoecientlysupportin-circuitassertions.These techniquesenablea designertouseassertionsatthesourcelevelwhilecheckingtheb ehaviorandtiming oftheapplication.Furthermore,weleveragesuchassertionst oenableadebugging techniquereferredtoashangdetectionthatreportsthespec ichigh-levelregionsofcode whereahangoccurs.Torealizethesein-circuit-assertion-base dtechniques,thischapter addressesseveralkeychallenges:scalability,transparency,an dportability.Scalability (largenumbersofassertions)andtransparency(lowoverhead)a reinterrelatedchallenges thatarenecessarytoenablethoroughin-circuitassertionswhi leminimizingeectson programbehavior.Weaddressthesechallengesbyintroducing optimizationstominimize performanceandareaoverhead,whichcouldpotentiallybei ntegratedintoanyHLS tool.Portabilityofin-circuitassertionsynthesis,forveric ationortiminganalysis,is criticalbecauseHLStoolscantargetnumerousplatformsandm ustthereforeavoid platform-specicimplementations.Thepresentedtechniques achieveportabilityby 68

PAGE 69

communicatingallassertionfailuresovertheHLS-providedco mmunicationchannels. Usingasemi-automatedframeworkthatimplementsthepresented HLStechniques,we showthatin-circuitassertionscanbeusedtorapidlyidentifyb ugsandviolationsof timingconstraintsthatdonotoccurduringsoftwaresimulatio n,whileonlyintroducinga smalloverhead(e.g.,reductioninfrequencyontheorderofl essthan3%andincreasein FPGAresourceutilizationof0.7%orlesshavebeenobservedwit hseveralapplicationcase studiesonanAlteraStratix-IIEP2S180andStratix-IIIEP3SE 260).Variouscasestudies withoptimizedassertionshaveshowna3 x reductioninresourceusageandimproved assertionperformancebyasmuchas100%comparedtounoptimiz edassertionsynthesis. Suchworkhasthepotentialtoimprovedesignerproductivity andtoenabletheuseof FPGAsbynon-expertswhomayotherwiselacktheskillsrequired toverifyandoptimize HLS-generatedcircuits. Thischapterispresentedasfollows.Section 2.6 discussesrelatedwork.Assertion-synthesis techniquesandoptimizationsareexplainedinSection 5.1 .Section 5.2 discussestiming analysis.HangdetectionisdescribedinSection 5.3 .Section 5.4 describestheexperimental setupandframeworkusedtoevaluatethepresentedtechniques.S ection 5.5 presents experimentalresults.Section 5.6 providesconclusions. 5.1AssertionSynthesisandOptimizations AmericanNationalStandardsInstituteC(ANSI-C)assertions,whenc ombinedwith atestbench,canbeusedasavericationmethodologytodenea ndtestthebehavior ofanapplication.Eachindividualassertionisusedtocheckasp ecicrun-timeBoolean expressionthatshouldevaluatetotrueforaproperlyfunction ingapplication.Ifthe expressionevaluatestofalse,theassertionprintsfailureinfo rmationtothestandarderror streamincludingthelename,linenumber,functionname,an dexpressionthatfailed; afterthisinformationisdisplayed,theprogramaborts. ThepresentedHLSoptimizationsforin-circuitassertionsassume asystemarchitecture consistingofatleastonemicroprocessorandFPGA,andanapplicat ionmodeledasatask 69

PAGE 70

graph.TheseassumptionsarecommontoexistingHLSapproaches[ 55 ].Therefore,the discussedtechniquesarepotentiallywidelyapplicablewithm inorchangesfordierent languagesortools. In-circuitassertionsareintegratedintotheapplicationby generatingasingle assertioncheckerforeachassertionandanassertionnotication function,asshownin thetoprighthandsideofFigure 5-1 .Theassertioncheckerimplementsthecorresponding Booleanassertionconditionbyfetchingalldata,computinga llintermediatevalues, andsignalingtheassertionnoticationfunctionuponfailure .Theassertionnotication functionisresponsibleforprintinginformationregardinga llassertionfailuresandhalting theapplication. Figure5-1.Assertionframework Theassertionnoticationfunctioncanrunsimultaneouslywith theapplication asataskwaitingforfailuremessagesfromtheassertioncheckers. Thetaskisdened essentiallyasalargeswitchstatementpercommunicationchann elthatimplementsone caseforeachhardware-mappedassertion.Althoughahardware/so ftwarepartitioning algorithmcouldpotentiallymaptheassertionnoticationfu nctiontasktoeither hardwareofsoftware,typicallytheassertionnoticationfun ctionwillbeimplemented insoftwareduetotheneedtocommunicatewithstandarderror. AlthoughtheaddedHLS communicationchannelsinthetaskgraphcouldgreatlyincre asetheI/Orequirementsfor 70

PAGE 71

hardware/softwarecommunication,suchasituationisavoided bytimemultiplexing allcommunicationoverasinglephysicalI/Ochannel(e.g.,PC Iebus,singlepin). Performanceoverheadduetothistime-multiplexingshouldb eminimalorevennonexistent (dependingontheHLStool)sinceANSI-Cassertionsonlysendmessages uponfailureand halttheprogramaftertherstfailedassertion. Onepotentialmethodtosynthesizeassertioncheckersintocirc uitsisdescribedas follows.Semantically,anassertissimilartoanifstatement.Th us,assertionscouldbe synthesizedbyconvertingeachassertionintoanifstatement,wh eretheconditionforthe ifstatementisthecomplementedassertionconditionandthebo dyoftheifstatement transfersallfailureinformationtotheassertionnoticatio nfunction.Althoughsucha straightforwardconversionofassertstatementsmaybeappropri ateforsomeapplications, ingeneralthisconversionwillresultinsignicantareaandpe rformanceoverhead.To dealwiththisoverhead,wepresentthreecategoriesofoptim izationsthatimprovethe scalabilityandtransparencyofin-circuitassertions,whichar edescribedinthefollowing sections.5.1.1AssertionParallelization Tomaximizetransparencyofin-circuitassertions,thecircuit fortheassertionchecker shouldhaveaminimaleectontheperformanceoftheoriginal application.However,by synthesizingassertionsviadirectconversiontoifstatements,th esynthesistoolmodies theapplication'scontrol-rowgraphandresultingstatemach ine,whichaddsanarbitrarily longdelaydependingonthecomplexityoftheassertionstateme nt.ForImpulseC,the delayoftheassertion assert (( j< =0 k a [0]== i )&&( b [0]==2 k i> 0))canbeshown bycomparingthecorrespondingsubsetoftheapplication'sstat emachinebefore(Figure 5-2A )andafter(Figure 5-2B )theassertionisadded.Forthisexample,theassertion canadduptosevencyclesofdelaytotheoriginalapplication foreachexecutionofthe assertion.Whilesevencyclesmaybeacceptableforsomeapplica tions,ifthisassertion occurredinaperformance-criticalloop,theassertioncould potentiallyreducetheloop's 71

PAGE 72

rate(i.e.,thereciprocalofthroughput)to12.5%ofitsori ginalsingle-cycleperformance, whichcouldsignicantlyaecthowapplicationcomponentsi nteractwitheachother. HLStoolscanminimizetheeectofassertionsontheapplicatio n'scontrol-row graphbyexecutingtheassertionsinparallelwiththeorigina lapplication.Toperform thisoptimization,HLScanconverteachassertionstatementint oaseparatetask(e.g., aprocessinImpulseC)thatenablestheoriginalapplicationt asktocontinueexecution whiletheassertionisevaluated.Insteadofwaitingfortheasser tion,theapplication simplytransfersdataneededbytheassertiontask,andthenproce eds. Forthepreviousassertionexample,theoptimizationreduced theoverheadfrom sevencyclestoasinglecycleasshowninFigure 5-2C .Theoptimizationwasunableto completelyeliminateoverheadduetoresourcecontentionfo rsharedblockRAMs.Such overheadisincurredwhentheassertiontaskandtheapplicatio ntasksimultaneously requireaccesstoasharedresource.5.1.2ResourceReplication Asmentionedintheprevioussection,resourcecontentionbetw eenassertionsand theapplicationcanleadtoperformanceoverhead.Contenti onbetweenassertionscan happenevenwhenassertionsareexecutedinparallel.Tominim izethisoverhead,HLScan performresourcereplicationbyduplicatingsharedresources. Forexample,arraysinCcanbesynthesizedintoblockRAMs.Acomm onsourceof overheadisduetothelimitednumberofportsonblockRAMstha taresimultaneously usedbyboththeapplicationtasksandassertiontasks.Whenaccessin gdierentlocations oftheblockRAM,thecircuitmusttime-multiplexthedatatoap propriatetasks, whichcausesperformanceoverhead.HLScaneectivelyincrea sethenumberofports byreplicatingthesharedblockRAMs,suchthatallreplicatedin stancesareupdated simultaneouslybyasingletask.Thisoptimizationensuresthata llreplicatedinstances containthesamedata,whileenablinganarbitrarynumberoft askstoaccessdatafrom thesharedresourcewithoutdelay. 72

PAGE 73

Awithoutassertion Bwithunoptimized,serialassertion Cwithparallelassertion Figure5-2.Application'sstatemachine Resourcereplicationprovidestheabilitytoreduceperform anceoverheadatthecost ofincreasedareaoverhead.SuchtradeosarecommontoHLSopt imizationsandare typicallyenabledbyuser-speciedoptimizationstrategies( i.e.,optimizeforperformance asopposedtoarea).Onepotentiallimitationofresourcerepl icationisthatforalarge 73

PAGE 74

numberofreplicatedresources,theincreasedareaoverheadco uldeventuallyreduce theclockspeed,whichmayoutweighthereducedcycledelays.Ho wever,forthecase studyinSection 5.5.2.3 ,resourcereplicationimprovedperformanceby33%allowing the application'spipelineratetoremainthesame.5.1.3ResourceSharing Whereastheprevioustwooptimizationsdealtwithperforma nceoverhead,in-circuit assertionscanalsohavealargeareaoverhead.Althoughanasserti oncheckercircuitwill generallycausesomeoverheadduetotheneedtoevaluatetheasse rtioncondition,HLS canminimizetheoverheadbysharingresourcesbetweenassertio ns.Forexample,ifa particulartaskhastenassertionswithamultiplicationinthe condition,resourcesharing couldpotentiallyshareasinglemultiplieramongalltheassert ions. AlthoughresourcesharingisacommonHLSoptimization[ 67 ]forindividualtasks, sharingresourcesacrossassertionsaddsseveralchallengesduet otherequirementthatall statementssharingresourcesmustbeguaranteedtonotrequiret heresourcesatthesame time.Fortask-graph-basedapplications,assertionsmayoccuri ndierenttasksatdierent times.ThisuncertaintypreventsaHLStoolfromstaticallydet ectingmutuallyexclusive executionofallassertions. Duetothislimitation,HLScanpotentiallyapplyexistingreso urce-sharingtechniques toassertionswithinnon-pipelinedregionsofindividualtask s,becausethoseassertionsare guaranteedtonotstartatthesametime.However,duetotheassert ionparallelization optimization,dierentstartingtimesfortwoassertionsdono tguaranteethattheir executiondoesnotoverlap.Forexample,anassertionwithaco mplexconditionmaynot completeexecutionbeforealaterassertionrequiresasharedr esource.Todealwiththis situation,HLScanimplementallassertionsthatshareresourcesa sapipelinethatcan startanewassertioneverycycle.Althoughthispipelinewillad dlatencytoallassertions inthesametaskthatrequireaccesstothesharedresources,suchlat encydoesnotaect 74

PAGE 75

theapplication,andonlydelaysthenoticationofprogram failure.Thistechniqueof pipelineassertioncheckingisevaluatedinSection 5.5.2.1 Resourcesharingcouldpotentiallybeextendedtosupportanar bitrarynumberof simultaneousassertionsinmultipletasksbysynthesizingapipel inedassertionchecker circuitthatimplementsagroupofsimultaneousassertions.Top reventsimultaneous accesstosharedresources,thecircuitcouldbuerdatafromdi erentassertionsusing FIFOs(e.g.,onebuerperassertion)andthenprocessthedataf romtheFIFOsina round-robinmanner.Thisextensionrequiresadditionalcon siderationofappropriatebuer sizestoavoidhavingtostalltheapplicationtasks,andanappro priatepartitioningof assertionsintoassertioncheckercircuits,whichweleaveasfut urework. Insomecases,resourcesharingmayimproveperformanceinadditi ontoreducing areaoverhead.Thisbenetismadepossiblebyenablingplacem entandroutingto achieveafasterclockduetofewerresources.However,resourcesh aringwillatsomepoint experiencediminishingreturns,andmayeventuallyincreasec lockfrequencyduetoalarge increaseinmultiplexersandothersteeringlogic. 5.2In-CircuitTiming-AnalysisAssertions Forapplicationswithreal-timerequirements,particularl yinembeddedsystems, vericationmustguaranteethatalltimingconstraintsareme t(aprocessreferredto astiminganalysis)inadditiontocheckingthecorrectnessofa pplicationbehavior. IfanHLS-generatedapplicationdoesnotmeettimingconstrai ntsduringexecution, thenitwouldbehelpfultoknowthelocationofthesectionofc odethatisviolating constraintsinordertofocusoptimizationeort.However,de terminingtheperformanceof anHLS-generatedapplicationcanbedicult.HLStools,suchasI mpulseCandCarte, providesomecompile-timefeedbackabouttherateandlatenc yofapipelinedloop,but itislargelyunknownhowmanycyclesaparticularlineofcod ewillrequire.Whileitis possibletodeterminethenumberofcyclesaline(orlines)ofco dewilltakebyexamining theHDLgeneratedbythetool,delaycanbedata-dependent,as showninthepossible 75

PAGE 76

traversalsofthestatemachinegeneratedbytheevaluationof theconditionalstatement if (( j< =0 k a [0]== i )&&( b [0]==2 k i> 0))inFigure 5-2B ).However,suchaprocess requiressignicantdesignereortandrequiresthedesignert ohaveknowledgeofthe HLS-generatedcode.Whileadelayrangeforthecomputationi neachlineofcodecould beprovidedbytheHLStoolviastaticanalysis,thedelayofcommu nicationcallscannot bedeterminedbystaticanalysis.Softwaresimulationcannotpr ovideaccuratetimingdue totimingdierencesbetweenthreadexecutiononthemicrop rocessorandexecutiononthe FPGA.Inthissection,wedescribetheadditionalconceptsandm ethodsneededtoextend in-circuitassertionstoperformtiminganalysisforapplicat ionsbuiltwithHLStools. Figure 5-3 illustratesusageoftiming-analysisassertionsforanaudiolt ering applicationdesignedwithaHLStool.Inthisexample,theappl icationdesignerhas determinedthattheltertakestoolongtoexecuteontheFPG Abymeasuringthe timetoruntheapplicationontheFPGA.However,theapplicati ondesignerisunsure ofwhichpartoftheapplicationintheFPGAisnotmeetingtim ingconstraints.Using timing-analysisassertions,theapplicationdesignercancheck thetimingofdierent applicationregionsintheFPGA,asshowninthegureinadditi ontothecasestudyin Section 5.5.5 .Data-dependentdelayscanbecheckedtoseeiftheyarewithi nboundsfor eachloopiteration.Althoughnotshowninthegure,thesameme thodcanbeusedto checkstreamingcommunicationcallsfordelayscausedbybue rsbecomingfullorempty. InordertoenableANSI-Cassertionstocheckthetimingofanappl ication,time mustbeaccessibleviaavariable.InC,timeistypicallydeterm inedviaafunctioncall. InFigure 5-3 ,theANSI-Cfunction,clock,isusedtoreturnthecurrenttimei ncycles. Tomeasurethetimeofasectionofcode,theclockfunctionshoul dbecalledbefore andafterthatsectionofcode,withthedierencebetweenthe twotimesprovidingthe executiontime(incycles).Toperformtiminganalysis,anassert ioncanbeusedtocheck acomparisonbetweentheexpectedtimeandthemeasuredtime.F orexample,inFigure 5-3 ,thecodeintheloopforeachlterisexpectedtotakelesstha n100cycles. 76

PAGE 77

Figure5-3.Usingtiming-analysisassertionswithalterapplic ation Fortiming-analysisassertions,timecanpotentiallybereprese ntedinmanydierent formats.However,returningtimeintermsofcycleswillrequi retheleastamountof overhead.TheANSI-Clibraryprovidestheclocktimingfuncti onthatreturnsthenumber ofclockticksthathaveelapsedsincetheprogramstarted.Howev er,forCprogrammers whomaywanttoexpresstimeintermsofsecondsratherthancycl es,theANSI-C constantexpressionCLOCKS PER SECcanbeusedtoconvertclocktickstotimein seconds.TheclockfrequencyoftheFPGAcouldbedeterminedby comparisonwith timestampssentfromtheCPU.However,anassertionmayneedtobech eckedontherst cycleafteranFPGArestart.Sincedeterminingthefrequency oftheFPGAautomatically 77

PAGE 78

couldtaketoolong,apreprocessorconstantFPGA FREQisusedtodenetheFPGA frequencyinHz. ThetypedenedforrepresentingclockticksinANSI-Cisclock tthattypically correspondstoalonginteger.Foraddedrexibilitywhenusedi nhardware,timecanbe returnedandstoredasa32-bitor64-bitvalue.A64-bitvalue isusedbydefault.Toselect a32-bitvalue,thepreprocessorconstantCLOCK T 32mustbedenedinthecode.A 32-bitvaluecanbeusedtoreduceoverheadbutwilloverrowaf ter43secondsforaclock speedof100MHz.Duringsoftwaresimulation,theassertionsusingt iminginformationare ignored,whichallowssimulationtocheckcorrectnessofthea pplicationwhileignoringthe timingofthemicroprocessor. Toenablesynthesisoftimingassertions,acounter,whichissetto zerouponreset,is addedineachhardwareprocessthatcontainsaclockstatement .Thevaluereturnedby theclockstatementisgeneratedbylatchingthecountersigna lforeachtransitionofthe statemachine.Useofalatchedcountersignalensuresthatthetim ervalueisconsistently takenatthebeginningofeachstatetransitionforstatesthate xecutemorethanonecycle. OnepotentialproblemwiththisapproachisthatHLStoolsoft enreorderstatements tomaximizeparallelism.Therefore,clockstatementscouldp otentiallybereordered leadingtoincorrecttimingresults.However,suchaproblemise asilyaddressedbymaking asynthesistoolawareofclockstatements.Inthischapter,weal ternativelyevaluated thetechniquesusinginstrumentationduetotheinabilitytom odifycommercialHLS tools.Althoughinstrumentationcouldexperiencereordering problems,fortheevaluated examples,reorderingofclockstatementsdidnotoccur. 5.3Hang-DetectionAssertions AcommonproblemwithFPGAapplicationsisafailuretonishe xecution,which isoftenreferredtoashanging.Commoncausesofhangingincl udeinniteloops, synchronizationdeadlock,blockingcommunicationcallsth atwaitindenitelytosend orreceivedata,etc.Determiningthecauseofahangingappli cation,referredtoashang 78

PAGE 79

detection,isdicultforHLS-generatedFPGAdesigns.Whilead ebuggercouldbeusedto tracedowntheproblemduringsoftwaresimulation,theinaccu raciesofsoftwaresimulation canmisshangsthatoccurduringFPGAexecution.Todealwitht hisproblem,weextend in-circuitassertionstoenablehangdetectionforHLS-genera tedapplication. Onechallengeofhangdetectionusingassertionsisthatitisassu medthatthe assertionwilleventuallybechecked.Iftheapplicationwait sindenitelyforalineofcode tonish(e.g.,aninnitelyblockingcommunicationcall)th enadierentdetectionmethod isneeded,sincetheassertionafterthehunglinewillneverbee xecutedasshowninFigure 5-4A .Withoutsomemechanismtoalertthedevelopertothecurrentst ateoftheprogram, itwillbediculttopinpointtheproblem.Forexample,inth elterapplication(see Figure 5-5 ),thesourceoftheproblemthatiscausingtheapplicationtoh angcouldbein anyofthesoftwareorhardwareprocesses. ATimingAssertion BAssertionssettofail Figure5-4.ManuallyusingAmericanNationalStandardsInstitu teC(ANSI-C)assertions forhangdetection Onepotentialsolutionistouseassertionsinacounterintuitiv ewaybyadding assertionsperiodicallythroughoutthecodethataredesigned tofail(i.e.,assert(0)).By alsodeningtheNABORTrag,failedassertionswillnotcausetheap plicationtoabort, whichallowsthedevelopertomanuallycreateanapplicatio nheartbeat(i.e.asignalsent asanoticationthattheprocessisalive)thattracestheexe cutionoftheapplicationon theFPGAasshowninFigure 5-4B .Inthelterapplicationexample,multipleassertions wouldneedtobeplacedinstrategiclocationsineachFPGApro cesstodeterminethe eventsthattakeplacebeforetheapplicationhangs.Theresol ution(intermsoflinesof 79

PAGE 80

Figure5-5.Usinghang-detectionassertionswithalterapplic ation code)wouldbedeterminedbyhowmanyassertionsareused.Unfort unately,ifalarge numberofassertionsareused,thenlargeamountsofcommunicat ionandFPGAresources couldbeusedbytheassertions.Althoughthisapproachworks,itre quiressignicant designereortandhaslargeoverhead. Toreduceeortandoverhead,wepresentamoreautomatedmeth odofhang detectionthatdoesnotrequireuserinstrumentationandinste aduseswatchdogtimersto monitorthetimebetweenchangesofthesignalsthatrepresent thestateofthehardware process.Themonitoringcircuithassoftware-accessibleregiste rsthatcontainsthecurrent stateofallhardwareprocessandthestateofanyhardwareproce ssthatithasdetected ashung.Hangdetectionistriggeredusingawatchdogtimerfor ahardwareprocessthat 80

PAGE 81

signalswhenastatetakeslongerthanauser-denednumberofcy cles;theassertion pragma,#pragmaassert FPGA watch dog,setsthistimeoutperiod,whichisreset anytimeastatetransitionoccurs.Thewatchdogtimerissizedto bejustlargeenough toholdthecyclecountgiveninthepragmatoreduceFPGAresou rceandfrequency overhead.Insoftware,aseparatethreadisspawnedtomonitort hehardwarehang detectortocheckforhungstates(i.e.,expiredwatchdogtim ers).Ifahardwareprocess hashungthenthestateintheregistersismatchedtothecorresp ondinglineofcodevia alookuptablegeneratedbyparsinganintermediatetranslati onle(bothImpulseCand Cartecreatetheseles).Thestateofallotherhardwareprocesse saregivenforreference. Insoftware,manyHLSapplicationswillwaitindenitelyatsom epointinits executionfortheFPGAtorespondwithsomeformofcommunicati onorsynchronization. Forthoseapplications,hangscausedintheFPGAhardwarewilla lsocausethesoftware tohangonthecommunicationorsynchronizationAPIcallforth eFPGA.Although traditionaldebuggingtoolscanbeusedtodetectthesehangsi nsoftware,softwarehang detectionisprovidedtomonitortheHLSAPIcallsforconvenie nce.Athreadisspawned forallAPIcallsoftheHLStool.ThethreadwillcheckiftheAPIc allnisheswithina timeperiodsetbytheassertionpragma,#pragmaassert API watch dog.IftheAPIcall takeslongerthanthetimeoutperiodthenthecurrentlineof codefortheAPIcallandall hardwareprocesseswillbeprintedtostandardoutputandthepr ogramwillabort. Thisautomatedapproachsimpliestheadditionofhangdetec tiontoanapplication, asshownforthelterapplicationinFigure 5-5 andcasestudyinSection 5.5.6 ,compared tomanuallyaddingassert(0)statements.Twoassertionpragmasar eaddedtothe applicationbeforeinstrumentationtosetthewatchdogtimeo utperiodsinhardware andsoftware.Althoughhangscanbecausedbytheinteractionbe tweentwoormore (hardwareorsoftware)processes,providingthestateofthehung processalongwith thecurrentstateofallotherhardwareprocessescangreatlyna rrowdownthesourceof problem. 81

PAGE 82

Severalimprovementscanbeaddedtofurtherenhancehangde tectionofHLS applications.Thefeedbackgiventotheapplicationdevelop ercanbeincreasedby reportingmorethanthelaststateofeachprocessintheFPGA.For example,atrace buercouldbeaddedofauser-denedsizethatwouldcaptureth esequenceofstatethat occurredbeforethehardwareprocesshung.Also,inniteloops inahardwareprocess willonlytriggersoftwareAPIhangdetection.Sinceinnitel oopswillnotstayinasingle statetotriggerthehang-detectionmethodmentionedabove, detectionofinniteloops inhardwarecouldalsobeincorporatedbyaddingasecondcount erforeachprocessthat isdedicatedtocountingthenumberofcyclesspentinstatesth atareknowntobeinside oneormoreloops.Theoverheadofhangdetectioncouldberedu cedbyallowingtheuser toselectwhichprocessestomonitor.Thehangdetectioncounte rscouldberemovedfor someorallprocesseswhilestillallowingthecurrentstateofthe processtobeperiodically retrievedorretrievedbysoftwareAPIhangdetection.Thisap proachwouldgivetheuser theoptiontocustomizehangdetectiontotfordesignsthatne arlylltheFPGA. 5.4AssertionFramework Toevaluatetheassertion-synthesistechniques,wecreatedapro totypetoolframework forImpulseCthatimplementsthetechniquesviainstrumentat ionofHLLandHDLcode. Itshouldbenotedthatweuseinstrumentationbecauseweareunab letomodifythe proprietaryImpulseCtool.Allofthetechniquesarefullyaut omatableandideallywould bedirectlyintegratedintoanHLStool.5.4.1UnoptimizedAssertionFramework Toimplementbasicin-circuitassertionfunctionality,thefr ameworkusesHLL instrumentationtoconvertassertstatementsintoHLS-complian tcodeinthreemain stages.First,theCcodefortheFPGAisparsedtondfunctionsco ntainingassertion statements,convertinganyassertionstatementstoanequivalen tifstatement.Afalse evaluationproducesamessagethatwillberetrievedfromtheF PGAbytheCPU, uniquelyidentifyingtheassertion.Next,communicationchan nelsaregeneratedtotransfer 82

PAGE 83

thesemessagesfromtheFPGAtotheCPU.Finally,theassertionnoti cationfunction isdenedasasoftwarefunctionexecutingontheCPUtoreceiv e,decode,anddisplay failedassertionsusingtheANSI-Coutputformat.Anexampleofthi sautomatedcode instrumentationisshowninFigure 5-6 Figure5-6.High-LevelLanguage(HLL)assertioninstrumentatio n Tonotifytheuserofanassertionfailure,theframeworkusesane rrorcodethat uniquelyidentiesthefailedassertionbasedonthelinenumbe randlenameofthe assertion.Oncetheassertionnoticationfunctiondecodesthe assertionidentier,theuser isnotiedbyprintingtothestandarderrorstreambytheCPUfo rthecurrentframework. TheframeworkcouldbeextendedtoworkwithoutaCPUbyhavin gtheassertion identierstoredtomemory,displayedonaLCD,orevenrashedas asequenceonanLED bytheFPGA.Alternatively,anFPGAcouldpotentiallyuseasoftcoreprocessor. NotethatotherchangesareneededtoroutethestreamtotheCPU, suchasAPI callstocreateandmaintainthestream.Thestreammustalsobead dedasaparameter tothefunction.TheoutputoftheframeworkisvalidImpulseC code,allowingfurther modicationstothesourcecodewithnootherchangestotheIm pulseCtoolrow.Once vericationoftheapplicationisnished,theconstantNDEBUGc anbeusedtodisableall 83

PAGE 84

assertionsandreducetheFPGAresourceoverheadforthenalap plication.Anadditional nonstandardconstantNABORTcanbeusedtoallowtheapplicationt ocontinueinstead ofabortingduetoanassertionfailure.5.4.2AssertionFrameworkOptimizations InordertoevaluatetheoptimizationspresentedinSection 5.1 ,ahybridmixof manualHLLandHDLinstrumentationwasused.Toenableassertionpa rallelization (Section 5.1.1 ),theframeworkmodiestheHLLcodetomoveassertionsintoasep arate ImpulseCprocess.Theframeworkintroducestemporaryvariabl estoextractdataneeded bytheassertion.HDLinstrumentationthenconnectsthetempora ryvariablesandtrigger conditionsbetweenprocesses.Theresultsofthisoptimization canbefoundinSection 5.5.2 ResourcereplicationwasperformedusingmanualHLLinstrument ation.Resource replicationisdescribedinSection 5.1.2 .Anextraarraywasaddedtothesourcecodethat performedthesamewritesastheoriginalarraybutreadswere onlyperformedbythe assertion,asshowninSection 5.5.3 Thefollowingmanualhybridinstrumentationwasusedtoevalu ateresourcesharingas describedinSection 5.1.3 .Althoughresourcesharingcouldpotentiallybeappliedtoany sharedresource,weevaluatetheoptimizationforsharedcommu nicationchannels,which arecommontoallImpulseCapplications.HLLinstrumentationcr eatesastreaming communicationchannelperImpulseCprocessandsendstheident ieroftheassertion uponassertionfailure.Creatingastreamingcommunicationch annelperImpulseC processcanbecomeexpensiveintermsofresourcesifalargenum berofImpulseC processescontainassertions.Toreducethenumberofstreamscrea tedforeachprocess, asinglebitofthestreamisusedperassertiontoindicateifanasser tionhasfailed. ThistechniqueallowsImpulseCprocessestomoreecientlyuti lizethestreaming communicationchannels.Whenstreamingcommunicationresour cesareshared,aseparate processiscreatedviaHLLinstrumentationthatcanhandlefail uresignalsfromupto 84

PAGE 85

32assertionsperprocessifa32-bitcommunicationchannelisu sed.Forexample,if all32assertionsfailsimultaneouslythenall32bitsofthecomm unicationchannelwill simultaneouslybeasserted.Thefailuresignalsareconnectedto assertionsusingHDL instrumentationforeciency.Theoverheadreductionassocia tedwithusingthistechnique isexploredinthecasestudythatispresentedinSection 5.5.4 5.4.3Timing-AnalysisandHang-DetectionExtensions Semiautomatichybridinstrumentationwasusedtosupporttimi ngfunctions presentedinSection 5.2 .ImpulseCdoesnotsupportANSI-Clibrarycallssotheclock functioncallsmustberemoved.Aplaceholdervariableisdec laredandusedinplaceof theclockstatementinthesourcecode.Afterhardwaregenerati on,aPerlscriptisusedto instrumenttheHDL.Acounterisaddedineachhardwareprocesst hatcontainsaclock statement,whichissettozerouponreset.Asecondsignalisadded totheprocessthat latchesthecountersignalupontransitionofthestatemachine .Theplaceholdervariable, synthesizedintoasignalwithasimilarnameinHDL,isreplacedwi ththelatchedcounter signal. Semiautomatichybridinstrumentationwasusedforhangdetec tioninSection 5.3 .For softwarehangdetection,awrapperwasaddedaroundeachofth eImpulseClibraryAPI callswhichaddedthethreadedhangdetection.Themodiedso ftwareAPIcallsrequired extraparametersforaccesstothehardwarehang-detectionr egisters.Automaticparsing ofthexhwlegeneratedbyImpulseCallowsstatestobeconvert edtolinenumbers. Forhardwarehangdetection,ahardwareprocesssupportingre gistertransfertosoftware isautomaticallyaddedtothesourcecode.AfterImpulseCgener atestheHDL,the statemachinesignalsofallotherhardwareprocessesareautoma ticallyroutedintothe hang-detectionprocess.Thehang-detectioncircuitisthenm anuallyaddedbyoverwriting partoftheregistertransferprocess. Althoughmanyofthestepsforaddingtiming-analysisandhangdetectioninstrumentation weremanual,allofthestepscouldbeautomatedviaPerlscript s.Ideally,modicationto 85

PAGE 86

theImpulseCtoolwouldbemadeinsteadofinstrumentingsourcea ndintermediatecode. However,becauseImpulseCisproprietary,suchmodicationwas notpossibleforthis work.5.4.4High-LevelSynthesisToolandPlatform TheframeworkcurrentlyusesImpulseC.ImpulseCisahigh-leve lsynthesistoolto convertaprogramwritteninasubsetofANSI-CtohardwareinanFP GA.ImpulseCis primarilydesignedforstreamingapplicationsbaseduponthec ommunicatingsequential processmodelbutalsosupportssharedmemorycommunication[ 66 ].Speedupscanbe achievedinImpulseCapplicationsbyrunningmultiplesequen tialprocessinparallel, pipeliningloopsandaddingcustomHDLcodedfunctionscalls. Quartus9wasusedforsynthesisandimplementationoftheImpul seC-generated circuits.ThetargetplatformsaretheXtremeDataXD1000[ 8 ]containingadual-processor motherboardwithanAlteraStratix-IIEP2S180FPGAinoneoft heOpteronsocketsand theNovo-Gsupercomputer[ 64 ]atUniversityofFloridacontaining48GiDELPROCStar III[ 10 ]cardseachwithfourStratix-IIIEP3SE260.ImpulseC3.3isu sedfortheXD1000 whileImpulseC3.6withanin-houseplatformsupportpackageis usedforNovo-G. AlthoughtheXD1000andNovo-Garehigh-performancecomputing platforms,ImpulseC alsosupportsembeddedPowerPCandMicroBlazeprocessors[ 66 ].Furthermore,Novo-G andtheXD1000arerepresentativeofFPGA-basedembeddedsystemst hatcombineCPUs withoneormoreFPGAs.Thepresentedoverheadresultswouldlike lybesimilarforother embeddedplatforms,assumingsimilarImpulseCwrapperimplemen tations. AlthoughwecurrentlyevaluateHLSassertionsusingImpulseC,the techniquesare easilyextendedtosupportotherlanguages.Forexample,inCar te,ImpulseC'sstreaming transferswouldbereplacedwithDMAtransfers.Thesoftware-ba sedassertionnotication function(seeFigure 5-1 )wouldthenneedtomonitorCarte'sFPGAfunctioncallsfor failedassertionsasopposedtomonitoringImpulseC'sFPGAproc esses. 86

PAGE 87

5.5ExperimentalResults Thissectionpresentsexperimentalresultsthatevaluatetheu tilityandoverhead ofthepresentedassertionsynthesis,timinganalysisandhangdete ction.Section 5.5.1 motivatestheneedforin-circuitassertionsbyillustratinga casestudywhereassertions passduringsimulationbutfailduringFPGAexecution.Sectio n 5.5.2 illustratesthe performanceandoverheadimprovementsoftheassertionparal lelizationoptimization. Section 5.5.3 evaluatesperformancebenetsofresourcereplication.Sec tion 5.5.4 evaluates thescalabilityofassertionsintermsofresourceandfrequency overheadbyapplying resourcesharingoptimizationstothecommunicationchannel s.Section 5.5.5 presentsthe overheadofusingassertionsfortiminganalysis.Section 5.5.6 evaluatestwohang-detection methodsusedonanapplicationthatfailstocomplete. Thedesignsusedinthecasestudiesoccupyarelativelysmallpart oftheFPGA(24% oflogicusedinSection 5.5.5 ).Designswithhigherresourceutilizationmayleadtogreate r performancedegradationandresourceoverheadofassertionsd uetoincreaseddiculty inplacementandroutingforexample.Inaddition,resourcer eplicationmightnotbe applicablefordesignsthatarealmostfull.5.5.1DetectingSimulationInconsistencies Inthissection,weillustratehowassertionscanbeusedforin-ci rcuitverication anddebuggingtocatchinconsistenciesbetweensoftwaresimula tionandFPGAexecution ofanapplication.ThecodeinFigure 5-7 showshowassertionstatementscanbeused forin-circuitvericationbyidentifyingbugsnotfoundusi ngsoftwaresimulation.The rstassertionisusedtodetectatranslationmistakefromsourceco detohardware. 1 Theassertionstatement(line6)neverfailsinsimulationbutfa ilswhenexecutedonthe XD1000platform.UponinspectionofthegeneratedHDL,itisobser vedthatImpulseC performsanerroneous5-bitcomparisonofc2andc1(line4).T he64-bitcomparisonof 1 Itispossibleforatranslationmistaketoalsohaveaneectonana ssertion 87

PAGE 88

4294967286 > 4294967296(whichevaluatestofalse)becomesa5-bitcompar isonof22 > 0(whichevaluatestotrue),allowingthearrayaddresstobec omenegative(line4).In contrast,thesimulatorexecutingthesourcecodeontheCPUsets theaddresstozero(line 5).ImpulseCwillgenerateacorrectcomparisonwhenc1andc2a re32-bitvariables. Thesecondassertion(line8)isusedtochecktheoutputofanexte rnalHDLfunction (line7),whichisusedtogainextraperformanceoverHLSgener atedHDL.Whenan externalHDLfunctionisused,thedevelopermustprovideaCsour ceequivalentfor softwaresimulation.However,thebehaviorandtimingoftheCso urceforsimulationmay dierfromthebehavioroftheexternalHDLfunctionduringha rdwareexecution,again demonstratinganeedforin-circuitverication. Figure5-7.In-circuitvericationexample Fordemonstrationpurposes,thisexamplecaseisintentionally simplistic.Similar conclusionscouldbedrawnusingacycle-accurateHDLsimulator .However,inpractice, inconsistenciescausedbythetimingofinteractionbetweenth eCPUandFPGAwouldbe verydiculttomodelinacycle-accuratesimulator.5.5.2AssertionParallelizationOptimization Thissectionprovidesresultsfortheparallelizationoptimi zationofassertions.Section 5.5.2.1 showsimprovementsfromoptimizationforTriple-DESencryp tion.Section 5.5.2.2 showsoptimizationimprovementsforedge-detection.While theapplicationsinthe previoussectionsevaluatefrequencyoverhead,Section 5.5.2.3 evaluatesstatemachine performanceoverhead(intermsofadditionalcycles)andopt imizationimprovements. 88

PAGE 89

5.5.2.1TripleDataEncryptionStandard(DES)casestudy Therstapplicationcasestudyshowstheareaandclockfrequenc yoverhead associatedwithaddingperformanceoptimizedassertionstateme ntstoaTriple-DES [ 60 ]applicationprovidedbyImpulseC,whichsendsencryptedtex tlestotheFPGA tobedecoded.Twoassertionstatementswereaddedinaperforma ncecriticalregion oftheapplicationtoverifythatthedecryptedcharactersa rewithinthenormalbounds ofanASCIItextle.Table 5-1 showsallsourcesofoverhead,includingthestreaming communicationchannelsgeneratedbyImpulseCforsendingfai ledassertionsbackto theCPU.Theoverheadnumberswerefoundtobequitemodest,wit hresourceusage increasingbyatmost0.12%ofthedeviceandthemaximumclockf requencydroppingby lessthan4MHz. Forthiscasestudy,theoptimizedassertionswerecheckedinasep aratepipeline processtoreducetheoverheadgeneratedbytheassertioncompa rison.Assertionfailures aresentbyanotherprocesstoensurethatassertionscanbechecke deachcycle.The statemachineoftheapplicationremainedunchangedbecauset heoptimizedassertions werecheckedinaseparatetaskworkinginparallelwiththeapp lication.Sincethe application'sstatemachineremainedthesame,theonlyperfo rmanceoverheadcomesfrom themaximumclockfrequencyreduction.Theresourceoverhea dforoptimizedassertions actuallydecreasedascomparedtounoptimizedassertions.TheAL UT(AdaptiveLook-Up Table)androutingresourcesneededbyQuartustoachieveama ximumfrequencyof144.7 MHzforunoptimizedassertionswas0.06%greaterthantheALUTand routingresources needforoptimizedassertionsthatachievedamaximumfrequen cyof142MHz. 5.5.2.2Edge-detectionCaseStudy Thefollowingcasestudyintegratesperformanceoptimizedasse rtionsintoan edge-detectionapplication.Theedge-detectionapplicat ion,providedbyImpulseC,reads a16-bitgrayscalebitmapleonthemicroprocessor,processesit withpipelined5 x 5image kernelsontheFPGA,andstreamstheimagecontainingedge-det ectioninformationback. 89

PAGE 90

Table5-1.TripleDataEncryptionStandard(DES)assertionov erhead EP2S180OriginalAssertOverhead LogicUsed1367713851+174(outof143520)(9.53%)(9.65%)(+0.12%)Comb.ALUT79298025+96(outof143520)(5.52%)(5.59%)(+0.07%)Registers1001910055+36(outof143520)(6.98%)(7.01%)(+0.03%)BlockRAM222912223488+576(9383040bits)(2.37%)(2.38%)(+0.01%)BlockInterconnect2465724878+221(outof536440)(4.60%)(4.64%)(+0.04%)Frequency(MHz)145.7142.0-3.7 (-2.54%) SincetheFPGAisprogrammedtoprocessanimageofaspecicsize ,twoassertionswere addedtocheckthattheimagesize(heightandwidth)received bytheFPGAmatchesthe hardwareconguration.Theassertionswereaddedinaregiono ftheapplicationthatwas notperformancecritical.AsshowninTable 5-2 ,theoverheadnumbersforthiscasestudy werealsomodest,withresourceusageincreasingbyatmost0.06%on theEP2S180. Fortheedge-detectioncasestudy,theoptimizedassertionswer echeckedina separateprocesstoreducetheoverheadgeneratedbytheasserti oncomparison.Since theapplicationsstatemachineremainedthesameandmaximumc lockfrequencydidnot reduce,theapplicationdidnotincuranyperformanceoverh eadduetotheadditionofthe assertions.Thefrequencyincreaseislikelyduetorandomnessin placementandrouting resultsofsimilardesigns.Theperformanceoptimizationofthe assertionsincreasedALUT resourceutilizationfrom0.03%to0.06%ontheEP2S180.5.5.2.3StateMachineOverheadAnalysis Thissectionpresentsageneralizedanalysisofperformanceov erheadcausedby addingassertionswithasinglecomparisonandtheperformancei mprovementvia 90

PAGE 91

Table5-2.Edge-detectionassertionoverhead EP2S180OriginalAssertOverhead LogicUsed1225012273+23(outof143520)(8.54%)(8.56%)(+0.02%)Comb.ALUT67266809+83(outof143520)(4.69%)(4.75%)(+0.06%)Registers93719417+46(outof143520)(6.53%)(6.56%)(+0.03%)BlockRAM141120141696+576(9383040bits)(1.50%)(1.51%)(+0.01%)BlockInterconnect1990419994+90(outof536440)(3.71%)(3.73%)(+0.02%)Frequency(MHz)77.579.3+1.8 (+2.32%) optimizations.Theresultsinthissectionpresentoverheadint ermsofcycles,and excludechangestoclockfrequency,whichwasdiscussedinthep revioussection.We evaluatesinglecomparisonassertionstodeterminealowerboun dontheoptimization improvements.Tomeasuretheperformanceoverheadofaddinga ssertions,weexamine thestatemachinesandpipelinesgeneratedbyImpulseC.Impul seCallowsloops(e.g., forloopsorwhileloops)tobepipelined.Assertionsaddedtoapi pelinecanmodifythe pipeline'scharacteristics.EachpipelinegeneratedbyImpu lseChasalatency(timein cyclesforoneiterationofalooptocomplete)andrate(time incyclesneededtonish thenextloopiteration).Assertionsthatarenotinapipelined loopwilladdlatency(i.e., oneormoreadditionalstates)tothestatemachinethatpreserve sthecontrolrowofthe application.AsstatedinSection 5.4.2 ,assertionscanbeoptimizedtoreduceoreliminate theoverheadofassertionsintermsofadditionalclockcycles requiredtonishapplication execution.Theseoptimizationsmovethecomparisonstoasepar ateImpulseCprocess sothattheycanbecheckedinparallelwiththeapplication.An yremainingclockcycle overheadafteroptimizationcomesfromthedatamovementne ededforassertionchecking. 91

PAGE 92

Table 5-3 showsthelatencyoverheadfornon-pipelined,singlecompari sonassertions. Inmostcases,assertionswiththesecomparisonswillincreaselaten cybyonecycle.With optimizations,thislatencyoverheadisreducedtozerosince extractingdatainmost caseswillnotaddlatencytotheapplication.Inthecasewhere anarrayisconsecutively accessedtemporallybytheapplicationandanassertion,anunop timizedassertionwill havealatencyoverheadoftwocyclesbecauseofblockRAMportl imitations.With optimizations,thislatencyoverheadisreducedtoonecycle toextractdatafromthearray orblockRAM.Formorecomplexassertions,thelatencywillincre aseforunoptimized assertionswhilethelatencyforoptimizedassertionswillrema inthesame,asseen whencomparingFigure 5-2B andFigure 5-2C .Evenwiththemultiplearrayaccesses in assert (( j< =0 k a [0]== i )&&( b [0]==2 k i> 0)),onlyonecycleisneededtoretrieve thearraydata. Table 5-4 showspipelinelatencyandrateoverheadobservedforasinglec omparison. Addinganunoptimizedassertionusingascalarvariabletoapipel inedloopincreasedthe latencyfrom2to3,resultinginanoverheadofonecycle,andd egradedtheratefrom1to 2forthepipeline.Althoughtherateoverheadwasasinglecycl e,thiscorrespondstoa2 x slowdowninperformancebecausethethroughputisreducedtoh alfoftheoriginalloop. Thisoverheadcomesfromaddingastreamingcommunicationca ll.Fortheoptimized assertion,thestreamingcommunicationcallwasmovedtoasepar ateprocessthat reducedthelatencyandrateoverheadtozero,resultingina2 x speedupcomparedtothe unoptimizedassertions.Forassertionsusingarraysinpipelined loops,addinganassertion causeda2cyclelatencyoverheadthatincreasedthelatencyfr om2to4.Theassertion reducedtheratefrom2to3,whichisaonecyclerateoverhead thatcorrespondstoa50% reductioninperformance.5.5.3ResourceReplicationOptimization AsmentionedinSection 5.5.2.3 ,Table 5-4 showspipelinelatencyandrateoverhead observedforasinglecomparison.Forassertionsusedinpipelined loopscheckinganarray 92

PAGE 93

Table5-3.Single-comparisonassertion LatencyOverhead AssertiondatastructureUnoptimizedOptimized Scalarvariable10Array(non-consecutive)10Array(consecutive)21 Table5-4.Pipelinedsingle-comparisonassertion Overhead UnoptimizedOptimized AssertiondatastructureLatencyRateLatencyRate Scalarvariable1100Array2110 datastructure,theassertionoverheadwasreducedviaresource replicationbyadding anadditionalarraytotheprocessdedicatedreadaccesstothe assertionasdescribedin Section 5.4.2 .Withaduplicatearray,onlythelatencyincreasedfrom2to3 andtherate remainedthesamewhichcorrespondstoa33%rateimprovemento verthenon-optimized version.Asimilarimprovementcouldbegainedforanon-pipel inedassertionthatchecks multipleindexestothesamearray.5.5.4ResourceSharingOptimization Thissectiondemonstratestheimprovementinscalabilityfrom resourcesharing optimizationtechniques.Weevaluatescalabilitybymeasurin gtheresourceandclock frequencyoverheadincurredbyaddingassertionstoalargenu mberofImpulseC processes,providinganextremelypessimisticscenariointermsof overhead.Asingle assertionisaddedperprocesswhichresultsinaseparatestreamin gcommunication channelforeachprocess.Asinglegreaterthancomparisonismad eperprocess,generally requiringonlyminorchangestotheprocessstatemachine.Int hisstudy,theapplication consistsofasimplestreamingloopbackasshowninFigurerefg:l oopback.Theloopback alsostoresthevalueandretrievesthevalueateachstage.Each processaddedtothe 93

PAGE 94

applicationaddsanextrastageintheloopback(e.g.,for4FP GAprocessesshownas LinFigure 5-8 ,incomingdatawouldbepassedfromtheinputtotheFPGA,passing througheachoftheprocessesbeforebeingreturnedtotheCPU). Theassertionineach processensuresthenumberbeingpassedisgreaterthanzero.Eac hprocessaddsoverhead intermsofanassertionshownasAinFigure 5-8 andanextraImpulseCstreaming communicationchannelshownasCinFigure 5-8 tonotifytheCPUoffailedassertions. Fora32-bitstream,upto32assertionscanbeconnectedtothestr eamingcommunication channelbeforeanewstreamingcommunicationchannelisneed ed. Figure5-8.Simplestreamingloopback Thepreviouslydiscussedstraightforwardconversionofassertstate mentstoif statementswasused.Theunoptimizedassertionswith128processe s(128assertions)had aresourceoverheadontheEP2S180of4.07%ALUTs(thehighestreso urcepercentage overhead).However,themaximumfrequencydecreasedfrom190 MHzforthe128-process originalapplicationto154MHzoran18.8%overheadasshownin Figure 5-9 forthe 128-processapplicationwithunoptimizedassertions. Byapplyingtheresourcesharingoptimizationonlytothecomm unicationchannels sothatonlyasinglebitofthestreamisusedperassertionasdescrib edinSection 5.4.2 (andnottheassertionresources),theresourceoverheadwasdecr eased.Theresource overheadontheEP2S180,asshowninFigure 5-10 ,wasreducedto1.34%ofALUTs orovera3 x improvementforthe128-processapplicationwithassertions.Asse rtion 94

PAGE 95

optimizationsincreasedthemaximumfrequencyforthe128-p rocessapplicationto189 MHz,asshowninFigure 5-9 ,whichrepresentsoveran18%improvement.Thefrequency oftheapplicationwithassertionoptimizations(189.3MHz)wa sveryclosetotheoriginal application'sfrequencyof190.6MHz.Whiletheresourceusage increasedconsistently forallthreetests(original,unoptimized,andoptimized)f rom1to128processes,the maximumfrequenciesreportedbyQuartusdidnotconsistently decreaseasthenumber processesincreaseduntil32processeswereadded.Thefrequency overheaddecreasedfrom 32to128processeswithoptimizedassertionsbecausetheapplica tionaddedonestream perprocesswhiletheassertionsonlyaddedonestreamper32proc essessince32-bit streamingcommunicationwasused.Thisdemonstratesthebene tsoftheresourcesharing optimizationforstreamingcommunicationchannels. Figure5-9.Assertionfrequencyscalability 5.5.5In-circuitTimingAnalysis Thissectionprovidesacasestudyshowingtheutilityandoverhe adofadding assertionswithtimingstatementstoabackprojectionapplica tion.Backprojectionisa DSPalgorithmfortomographicreconstructionofdataviaima getransformation.Forthe 95

PAGE 96

Figure5-10.Optimizedassertionresourcescalability backprojectionapplication,instrumentationwasaddintoa nestedloop(seeFigure 5-11 ). Two32-bittimingcallswereaddedaroundtheinner-pipelin edlooptomeasurethetime requiredforthepipelinedlooptonishgenerating512pixel s.Afterthetimingcalls,ten assertionswereaddedtondthemaximumtimerequiredforthep ipelinedlooptonish forallouter-loopiterations.Sincetheinnerloophas512it erations,aminimumof512 cyclesshouldbeneededtocompletetheloop,however,morecy clescouldberequiredfor stallsandrushingofthepipeline.Totesttheseassumptions,tenasse rtionswereaddedto checkthetimingoftheloopwithexponentiallyincreasingma ximumtimesandNABORT wasdenedtostoptheapplicationfromaborting.Afterexecut ion,onlytherstassertion passedevaluation,whichmeansthatthemaximumtimeforthein nerloopisbetween640 and1023cycles. Thistechniqueallowstheapplicationdesignertoquicklych ecktiminginmultiple regionsoftheapplicationwithminimaldisturbancetotheap plicationintermsofresource andcommunicationoverhead.Afterevaluatingthefeedbackf romtheassertions,the applicationdesignercanmodifytheapplicationtostreambac ktheexacttimingvaluesfor 96

PAGE 97

problematicregionsofcode.Inaddition,theassertionfeedb ackprovidedbeforemodifying theapplicationcanbeusedtomakesurethatthetimingvaluesst reamedbackarevalid. Itispossiblethattheadditionoflargedatatransferscouldch angethetimingofthe application. Figure5-11.Addingtimingassertionsindividuallytobackpro jection Figure5-12.Addingtimingassertionsinalooptobackprojecti on 97

PAGE 98

Table5-5.Individualbackprojectiontimingassertionoverh ead EP2S180OriginalAssertOverhead LogicUsed4828549702+1417(outof203520)(23.72%)(24.42%)(+0.70%)Comb.ALUT3296233132+170(outof203520)(16.20%)(16.28%)(+0.08%)Registers4409844595+497(outof203520)(21.67%)(21.91%)(+0.24%)BlockRAM711475271147520(15040512bits)(47.30%)(47.30%)(0%)BlockInterconnect101317102740+1423(outof694728)(14.58%)(14.79%)(+0.20%)Frequency(MHz)131.9132.5+0.6 (+0.45%) ThebackprojectionapplicationrunsonallfourStratix-II IEP3SE260FPGAsonthe GiDELPROCStarIII[ 10 ]card.OverheadisonlygivenforoneFPGAsincetheimage issplitbetweenallfourFPGAs.Ideally,asingleassertioncouldc heckanarrayofvalues inaloopformorecompactcode(seeFigure 5-12 ).However,thatapproachincreases overheadwhensynthesizedwithImpulseCasshowninTable 5-6 ascomparedtousing individualassertionsasshowninTable 5-5 .Forindividualassertions,noadditionalblock RAMwasusedsinceassertionfailuresweretransferredviaregister sratherthanusing streamingcommunicationonthePROCStarIII.Thelogicoverh eadof0.7%isthehighest ofalltheapplicationcasestudiesbutisreasonablegiventhat timingcallsandmultiple assertionswereused.ThemaximumFPGAfrequencystayedaboutth esamewithan insignicantincreaseof0.6MHz.Forasingleassertioninaloop,t heoverheadincreased inallcategoriesexceptforrouting.Theadditionaloverhe adislikelycausedbyadditional complexityofthestatemachineandtheusageofblockRAM.Thelo werroutingoverhead isprobablyduetoonlyhavingtomakeconnectionstoasinglea ssertion. 98

PAGE 99

Table5-6.Loopedbackprojectiontimingassertionoverhead EP2S180OriginalAssertOverhead LogicUsed4828550169+1884(outof203520)(23.72%)(24.65%)(+0.93%)Comb.ALUT3296233459+497(outof203520)(16.20%)(16.44%)(+0.24%)Registers4409844657+559(outof203520)(21.67%)(21.94%)(+0.27%)BlockRAM711475271239689216(15040512bits)(47.30%)(47.37%)(0.07%)BlockInterconnect101317102621+1304(outof694728)(14.58%)(14.77%)(+0.19%)Frequency(MHz)131.9131.3-0.6 (-0.45%) 5.5.6HangDetection Thissectionshowshowin-circuitassertionscanbeusedtodetect whenanapplication failstocomplete(i.e.,hangs),evenwhensoftwaresimulation runstocompletion.In aneorttospeedupadecoderandencoderversionoftheDESappl icationdescribed inSection 5.5.2.1 ,modicationsweremadethatcausedtheapplicationtocompl etein softwaresimulationandyethangontheXD1000.SinceImpulseCdo esnotsupportprintf inhardware,assertionswereusedtoprovideaheartbeatand\tr ace"theexecutionof processontheFPGA.Althoughthisisnotacommonuseofassertionsi nsoftware,it canbeusefultouseassertionsasapositiveindicatorratherthan anegativeindicator whenanapplicationisknowntocrashorhang.Assert(0)statement swereplacedat importantpointsinthecodeforeachFPGAprocessandNABORTwas denedtostop theapplicationfromaborting.Thenewcodewithassertionsad dedwasexecutedvia bothsoftwaresimulationandexecutiononthetargetplatform .Aftercomparingthe linenumbersofthefailedassertionsofbothruns,itwasfoundt hatthehangoccurred atamemoryread,whichwascausingtheprocesstohanginsteadof exitingaloop.By 99

PAGE 100

identifyingtheproblematiclineofcodeusingin-circuitasse rtions,wewereabletodebug theapplicationanddeterminedthatthememoryreadshouldha vebeenamemorywrite. Thiscorrectionallowedtheprocesstocompleteexecution. Next,automatedhangdetectionwasusedonthesameproblematic DESapplication. Thesoftwarehangdetectorwastriggeredbythetimeoutofaco mmunicationcall.The linenumberofthesoftwareAPIcallwasreportedbackalongwit hthelinenumber (takenbeforetheAPIcallwasmade)thatthehardwareprocessw ascurrentlyexecuting. Althoughhardwarehangdetectionwasworkingcorrectlyinth eFPGA,thehardwarehang detectorwasnotabletonotifytheapplicationdesignerofth eproblematiclineofcode sincethesoftwareAPIcallinconjunctionwiththeerroneousli neinthehardwareprocess causedallcommunicationbetweentheCPUandFPGAtostop.Tosol vethisproblem,a sleepofonesecondwasplaceabovethesoftwareAPIcallthatwasn otiedasbeinghung inpreviousrun.Theadditionofthesleepallowedthehardwar ehangdetectortoreport backtheexactlinenumberforthememoryreadthatshouldhave beenamemorywrite. Theresourceoverheadofusingautomatichangdetectiononthe Triple-DESapplication isshowninTable 5-7 .Hangdetectionhadthehighest,butstillreasonable,percenta geof ALUT(0.32%)androuting(0.25%)overheadbecauseofthecompar isonsandconnections madetothestatemachineoftheencoderanddecoderhardwarep rocess.Theassertion pragma,#pragmaassert FPGA watch dog,wassettoatimeoutofahundredmillion cycleswhichneededa30-bittimingregister.Whenusinga64-b itregister,thefrequency overheadincreasedto5.7%.However,suchoverheadisverypessim isticbecauseevenwith a10GHzclockspeed,a64-bitregistersupportsamaximumtimeout ofabout58years. Formoretypicalcases,thefrequencyoverheadshouldbelesstha n5.7%. 5.5.7AssertionLimitations Themainlimitationofin-circuitassertionsisthatoverhead isdependentonthe complexityoftheassertionstatements.Forexample,adesignerc ouldpotentiallyverifya signalprocessinglterusinganassertionstatementthatperforms anFFTandthenchecks 100

PAGE 101

Table5-7.DataEncryptionStandard(DES)hang-detectiono verhead EP2S180OriginalAssertOverhead LogicUsed2105121739+688(outof143520)(14.67%)(15.15%)(+0.48%)Comb.ALUT1298613440+454(outof143520)(9.05%)(9.36%)(+0.32%)Registers1388414015+121(outof143520)(9.67%)(9.77%)(+0.09%)BlockRAM1491841491840(9383040bits)(1.59%)(1.59%)(0%)BlockInterconnect3892440241+1317(outof536440)(7.26%)(7.50%)(+0.25%)Frequency(MHz)78.877.0-1.80 (-2.28%) toseeifaparticularfrequencyisbelowapre-denedvalue.I nthiscase,thesynthesized assertionwouldcontainacircuitforanFFT,whichcouldhavea largeoverhead.Note thatsuchoverheadisnotalimitationofthepresentedsynthesis techniques,butrathera fundamentallimitationofin-circuitassertions. Tominimizethisoverhead,wesuggestcertaincodingpractice s.Wheneverpossible, designersshoulduseassertionstatementsthatcomparepre-compu tedvalues.Designers shouldtrytoavoidconsolidatingassertionsinloopswithcompa risonvaluesstoredin arraysbecausetheunnecessaryusageofarraysandloopswithasser tionscanincrease overheadasshowninSection 5.5.5 .Designersshouldtrytoavoidusingmanylogical operatorsbecausetheseoperatorscancausetheHLStooltocreat ealargestatemachine tocheckallcombinationpossibilitiesoftheassertionasshowni nFigure 5-2B .By followingtheseguidelines,theassertionswillrequireaminim umamountofresources. Assertionparallelizationoptimizationandresourcereplicat ionoptimizationcanincrease theresourceoverheadtoreducetheperformanceoverhead.Acc essingthesamearray multipletimesinanassertion(e.g. assert ( a [ i ] >a [ i 1]))canbecostlyeitherinterms 101

PAGE 102

ofperformanceorresourcedependingifresourcereplication optimizationisused.Even accessinganarrayonlyonceinanassertioncouldbecostlyifthea pplicationwould normallybeusingthesamearrayelementinthesameclockcycle. 5.6Conclusions High-levelsynthesistoolsoftenrelyuponsoftwaresimulationf orverication anddebuggingexecutingFPGAprocessesasthreadsontheCPU.How ever,FPGA programmingbugsnotexposedbysoftwaresimulationbecomedi culttoremedyonce theapplicationisexecutingonthetargetplatform.Simila rly,HLStoolsoftenlack detailedtiming-analysiscapabilities,makingitdicultfo ranapplicationdesignerto determinewhichregionsofanapplicationdonotmeettiming constraintsduringFPGA execution.Theassertion-basedvericationtechniquespresen tedinthischapterprovide ANSI-C-stylevericationbothfortheFPGAandCPUwhileinsimul ationandwhen executingonthetargetplatform.Thisapproachallowsassert ionstobeseamlessly transferredfromsimulationtoexecutionontheFPGAwithoutr equiringthedesigner tounderstandHDLorcycle-accuratesimulators.Theabilityofa ssertionstoverifya portionoftheapplication'sfunctionalityanddebugerror snotfoundduringsoftware simulationwasdemonstrated.ANSI-Ctimingfunctionsalloweda ssertionstocheck applicationtimeconstraintsduringexecution.Automatedha ngdetectionprovided sourceinformationindicatingwheresoftwareorhardwarepro cessesfailedtocomplete inatimelymanner.Techniqueswereshowntoenabledebugging oferrorsnotfound duringsoftwaresimulationthatincurredasmallareaoverhead of0.7%orlessanda maximumclockfrequencyoverheadoflessthan3%forseveralap plicationcasestudies onanEP2S180andEP3SE260.Thepresentedtechniqueswereshow ntobehighly scalable,reducingresourceoverheadof128assertionsbyover3 x ,requiringonly1.34% ALUTresourcesandimprovingclockfrequencybyover18%.Thepe rformanceoverhead ofoptimizedassertionswasalsodemonstratedtobelow,withnop erformanceimpact observedintheedge-detectioncasestudyintermsoffrequency degradationorincreased 102

PAGE 103

cycleusage.Ageneralanalysisofperformanceforsinglecompa risonassertionsshowed thatthepresentedoptimizationsresultedinathroughputinc reaserangingfrom33%to 100%,whencomparedtounoptimizedassertions,potentiallyel iminatingallthroughput overhead.Futureworkincludesfurtherexplorationandaut omationofhangdetection. 103

PAGE 104

CHAPTER6 CONCLUSIONS UsingHDLtoolsforperformanceanalysisandvericationofHLSap plicationscanbe dicult.Themostobviouschallengeiscorrelatingresultsfr omHDLtoHLSsourcecode. HLStoolstypicallyprovidecorrelationdatainintermediat erepresentationles.Although itistediousfordeveloperstoreadthoughsuchles,intermed iaterepresentationlescan beparsedtoprovidesource-linecorrelationandvariablenam eslinkedtoaparticular stateofastatemachine.AnotherchallengethatisspecictoHLSi sdataextractionof results.UseoftheHLScommunicationcallswaschosenoverJTAGfor easeofuseand widerplatformsupport.ForHLStoolsthatdonothaveopen-sour cesupport,amethodof generatingacommunicationloopbackinhardwareisused.The loopbackstubinHDLis thenoverwrittenandusedasaninterfacetogatherresultspost mortem.Althoughmany HLStoolsstrivetobeANSI-Ccompliant,mostifnotallHLStoolsdon otsupportthe useofANSI-Clibraries.Partofthischallengewasaddressedbypro vidingsupportfor ANSI-Cassertionandtimingcallsforvericationandperforman cechecking.Finally,the challengesofvisualizingHLSapplicationswereaddressed.Visua lizationsfromanexisting performanceanalysistoolwereleveragedandcustomcommunic ationvisualizationsfor bottleneckdetectionwerecreatedtohelpthedeveloperopt imizetheirapplication. Thechallengesstatedabovewereaddressedinthreephases.InPha se1,aframework forperformanceanalysisofHLSapplicationwasdescribed.Tec hniquesweredeveloped forinstrumenting,measuring,andpresentingperformancedat aofHLSapplications.This frameworkprovidesautomatedprolingofHLScode.Thestatem achinesgeneratedby theHLStoolwereinstrumentedandcyclecountsweremeasuredto determinethetiming oftheHLSapplication.HLSprolingwasintegratedintothePa rallelPerformanceWizard tooltoprovidecomprehensiveperformanceanalysisformulti pleCPUsandFPGAs workingtogether.PerformanceanalysisvisualizationfromP PWwerealsoleveraged. ThisframeworkandtoolisbelievedtobetherstHLSperforman ceanalysistoolvia 104

PAGE 105

anextensiveliteraturereview.Theperformanceanalysistoo lwasusedonaMolecular Dynamicsapplication.Theoverheadofstreamingtransfersin theFPGAwerefound tobehigh.Thestreamingbuersizewasincreasedwhichincrease dthespeedupofthe MolecularDynamicsapplicationfrom6.2to7.8versustheseri albaseline. InPhase2,acommunicationbottleneckvisualizationwasexpl ored.Instrumentation ofcommunicationAPIcallswasaddedtomonitorthebandwidth ofthecommunication architecture.Methodstoanalyzethebandwidthcharacteri sticofacommunication architecturetosuggestpotentialoptimizationswerepresent ed.Graphvizwasusedto generateamulti-levelvisualizationofthecommunicationa rchitectureoftheapplication andstatemachinesofeachprocess.Thisframeworkandtooltovi sualizecommunication bottlenecksofHLSapplicationsisbelievedtobetherstofit skindviaanextensive literaturereview.Severalcasestudieswereusedtoillustrate analysistechniquesto identifycommunicationbottlenecks.Aspeedupof6.29timesw asachievedforaTriple DESapplicationbyswitchingfromstreamingtoDMAcommunicat ion.Bottleneckswere alsofoundinaMolecularDynamicsapplicationandaBackproj ectionapplication. InPhase3,anassertion-basedvericationframeworkwasdescrib edforHLStools. TheframeworkallowsANSI-CassertionstobesynthesizedintoFPGA circuits,reporting valuableinformationconcerninganyassertionfailuressucha ssource-lineinformation andtheconditionthatfailed.Hangdetectionwasalsoexplore dforHLSapplications. Watchdogtimerswereusedtodetermineifthestatemachinehan gsonasinglestateand softwaretimerswereusedtodetecthangsofAPIcommunicationc alls.Theframework andtoolforin-circuitassertion-basedvericationofHLSappl icationsisbelievedtobe therstofitskindviaanextensiveliteraturereview.Severa lcasestudiesdemonstrate thelowoverheadandimportanceofassertion-basedvericatio nforHLS.TripleDESand edgedetectionwereusedascasestudiesforassertionparalleliz ationoptimization.Minimal overheadintermsofadditionalexecutioncyclesandrequir edFPGAresourceswasshown. In-circuittiminganalysiswasperformedforBackprojectio n.Timingcallswereaddedto 105

PAGE 106

measurethetimerequiredforthepipelinedlooptonishgener ating512pixels.Hang detectionwasusedonaproblematicversionofTripleDES.Theh ardwarehangdetector wasabletoreportbacktheexactlinenumberofamemoryreadc ommunicationcallthat shouldhavebeenamemorywrite. Thecontributionsofthisresearchincludeframeworksandto olsforin-circuit performanceanalysisandassertion-basedvericationofHLSapp lications.AnHLS frameworkwasdemonstratedforperformanceanalysisandassert ion-basedverication thatiswidelyapplicabletovariousRCplatformsandHLStool s.Thesource-levelresults givenbythesetoolscanincreasedeveloperproductivityandr educetheneedofRC programmerstounderstandHDL.Astheabstractionlevelofprogr amminglanguagesfor FPGAsincreasesandHDLprogrammingbecomesafootnoteinhistor y,therewillbean evengreaterimportanceforsource-levelproductivitytool s. FutureresearchincludesextendingtoolcoveragetoHLSgraph icaltoolsorother HLSprogramminglanguagesbesidesC.Performanceprediction wasnotexploredin thisresearch.PerformancepredictionofHLSapplicationsco uldbeperformedbya combinationofstaticHDLcodeanalysisanddynamicanalysisoft hesourcecoderunning onamicroprocessor.Thisanalysiscombinationwouldallowper formanceanalysistotake placewithoutrequiringalengthyplaceandrouteofHDLtothe FPGAandbemore accuratepredictionofperformancethanjustexecutingtheso urcecoderunningona microprocessor.Statemachineswouldbeanalyzedforcycleco untsofvariousexecution paths.Thebranchestakenonthemicroprocessorwouldbemonito redandthecyclecounts wouldbeusedtopredicttheexecutiontimeontheFPGA.Sourcec ollationwouldbeused tomatchbranchesinthesourcecodeandstatemachines.Moreadv ancetechniqueswould berequiredtomodelcommunicationbottleneckscausedbystre amingandDMAbuers. 106

PAGE 107

APPENDIXA TARGETSYSTEMS A.1ProcStar-III ProcStar-III[ 10 ]isthebasisoftheNovo-GRCsupercomputer.Novo-Ghas24 ProcStar-IIIboards.EachProcStarIIIhas4AlteraStratix-I IIEP3SE260FPGAs.A fullImpulseC[ 55 ]platformsupportpackagewillbeneededinordertoseamlesslyu tilize performanceanalysisandvericationinthisdocument. A.2XD1000 TheXD1000[ 8 ]isarecongurablesystemfromXtremeDataInc.TheXD1000 containsadual-processormotherboardwithanAlteraStratixIIEP2S180FPGAona moduleinoneofthetwoOpteronsockets.TheHyperTransportinte rconnectprovidesa communicationchannelbetweentheFPGAandhostprocessorwith ImpulseC[ 55 ]. A.3SRC-7 SRC-7[ 9 ]isarecongurablesystembySRCComputers,Inc.withspecializ edinternal networkandmemories.TheSRC-7model71TScomeswithSeries-H MAPwithEP2S180 FPGAs,aHi-BarswitchandGlobalCommonMemory.CarteC[ 56 ]canonlyexecuteon theSRCmachines. 107

PAGE 108

APPENDIXB APPLICATIONCASESTUDIES B.1MolecularDynamics MDsimulatesinteractionsbetweenatomsandmoleculesoverd iscretetimeintervals. MDsimulationstakeintoaccountstandardphysics,VanDerWalls forces,andother interactionstocalculatethemovementofmoleculesoverti me.Alametal.[ 62 ]provides amorein-depthoverviewofMDsimulations.Thesimulationusedi nthisdocument keepstrackof16,384molecules,eachofwhichuses36bytes(4b ytestostoreitsposition, velocity,andaccelerationineachoftheX,Y,andZdirections) .Analysisisfocusedon thekerneloftheMDapplicationthatcomputesdistancesbetw eenatomsandmolecules. SerialMDcodeoptimizedfortraditionalmicroprocessorswas obtainedfromOakRidge NationalLab(ORNL).ThisapplicationrunsontheXD1000platfo rm. B.2Backprojection BackprojectionisaDSPalgorithmfortomographicreconstru ctionofdataviaimage transformation.Theapplicationwasoriginallydesignedfor theXD1000andportedto Novo-G.ThisapplicationrunsonallfourFPGAsoftheProcStar -IIIboardintheNovo-G supercomputer. B.3TripleDataEncryptionStandard(DES) Triple-DES[ 60 ]isablockcipherusedforencryption.Amodiedversionofthe triple-DESapplicationprovidedbyImpulseCwasusedinthisw ork.Theapplication sendsencryptedtextlestotheFPGAtobedecoded.Thisappli cationrunsonthe XD1000platform. B.4EdgeDetection Theedge-detectionapplicationwasprovidedbyImpulseC.Th eedge-detection applicationreadsa16-bitgrayscalebitmapleonthemicrop rocessor,processesit withpipelined5 x 5imagekernelsontheFPGA,andstreamstheimagecontaining edge-detectioninformationback.Thisapplicationrunson theXD1000platform. 108

PAGE 109

REFERENCES [1] J.Williams,A.George,J.Richardson,K.Gosrani,andS.Suresh,\ Fixedand recongurablemulti-coredevicecharacterizationforHPEC ,"in Proc.ofHighPerformanceEmbeddedComputing(HPEC)Workshop ,MITLincolnLab,Lexington, MA,September2008. [2] SethKoehler,JohnCurreri,andAlanD.George,\Performance analysischallenges andframeworkforhigh-performancerecongurablecomputi ng," ParallelComputing vol.34,no.4-5,pp.217{230,2008. [3] Hung-HsunSu,M.Billingsley,andA.D.George,\ParallelPerform anceWizard: Aperformanceanalysistoolforpartitionedglobal-address-sp aceprogramming," in ParallelandDistributedProcessing,2008.IPDPS2008.IEE EInternational Symposiumon ,April2008,pp.1{8. [4] KatherineComptonandScottHauck,\Recongurablecomputin g:asurveyof systemsandsoftware," ACMComput.Surv. ,vol.34,no.2,pp.171{210,2002. [5] R.Hartenstein,\Adecadeofrecongurablecomputing:avision aryretrospective," in Design,AutomationandTestinEurope,2001.Conferenceand Exhibition2001. Proceedings ,2001,pp.642{649. [6] Altera,\StratixIVdevicehandbook," http://www.altera.com/literature/hb/ stratix-iv/stratix4_handbook.pdf ,RetrievedSeptember2010. [7] Xilinx,\Virtex-5FPGAuserguide," http://www.xilinx.com/support/ documentation/user_guides/ug190.pdf ,RetrievedSeptember2010. [8] XtremeDataInc.,\XD1000FPGAcoprocessormoduleforsocket940, http://www. xtremedatainc.com/pdf/XD1000_Brief.pdf ,RetrievedSeptember2010. [9] SRCComputers,\SRC-7MAPstation," http://www.srccomp.com/products/docs/ SRC7_MAPstation_70000-AF.pdf ,RetrievedSeptember2010. [10] GiDEL,\PROCStarIIIPCIex8computationaccelerator," http://www.gidel.com/ pdf/PROCStarIII%20Product%20Brief.pdf ,RetrievedSeptember2010. [11] Cray,\CrayXD1datasheet," http://www.hpc.unm.edu/%7Etlthomas/buildout/ Cray_XD1_Datasheet.pdf ,RetrievedSeptember2010. [12] B.Holland,M.Vacas,V.Aggarwal,R.DeVille,I.Troxel,andA.Geor ge,\Survey ofC-basedapplicationmappingtoolsforrecongurablecomp uting,"in Proc.of8th InternationalConferenceonMilitaryandAerospaceProgra mmableLogicDevices (MAPLD) ,Washington,DC,September2005. 109

PAGE 110

[13] E.El-Araby,M.Taher,M.Abouellail,T.El-Ghazawi,andG.B.Ne wby, \Comparativeanalysisofhighlevelprogrammingforrecong urablecomputers: Methodologyandempiricalstudy,"in ProgrammableLogic,2007.SPL'07.20073rd SouthernConferenceon ,Feb.2007,pp.99{106. [14] SameerS.ShendeandAllenD.Malony,\TheTAUparallelperfor mancesystem," InternationalJournalofHighPerformanceComputingApplic ations ,vol.20,no.2, pp.287{311,2006. [15] BerndMohrandFelixWolf, Euro-Par2003ParallelProcessing ,chapter:KOJAK AToolSetforAutomaticPerformanceAnalysisofParallelProgr ams,pp.1301{1304, SpringerBerlin,2003. [16] JohnMellor-Crummey,RobertJ.Fowler,GabrielMarin,andNa thanTallent, \HPCVIEW:Atoolfortop-downanalysisofnodeperformance," JournalofSupercomputing ,vol.23,no.1,pp.81{104,2002. [17] L.A.DeRoseandD.A.Reed,\SvPablo:Amulti-languagearchitec ture-independent performanceanalysissystem,"in ParallelProcessing,1999.Proceedings.1999 InternationalConferenceon ,1999,pp.311{318. [18] J.G.TongandM.A.S.Khalid,\Acomparisonofprolingtoolsfo rFPGA-based embeddedsystems,"in ElectricalandComputerEngineering,2007.CCECE2007. CanadianConferenceon ,April2007,pp.1687{1690. [19] MartinSchulz,BrianS.White,SallyA.McKee,Hsien-HsinS.Lee,a ndJurgen Jeitner,\Owl:nextgenerationsystemmonitoring,"in CF'05:Proceedingsofthe2nd conferenceonComputingfrontiers ,NewYork,NY,USA,2005,pp.116{124,ACM. [20] ShashidharMysore,BanitAgrawal,RodolfoNeuber,TimothySher wood,Nisheeth Shrivastava,andSubhashSuri,\Formulatingandimplementin gprolingover adaptiveranges," ACMTransactionsonArchitectureandCodeOptimization (TACO) ,vol.5,no.1,pp.1{32,2008. [21] K.ScottHemmert,JustinL.Tripp,BradL.Hutchings,andPrestonA. Jackson, \Sourceleveldebuggerfortheseacucumbersynthesizingcompi ler," Proc.IEEE SymposiumonField-ProgrammableCustomComputingMachines (FCCM) ,pp. 228{237,2003. [22] JeanP.CalvezandOlivierPasquier,\Performancemonitorin gandassessmentof embeddedHW/SWsystems,"in Proc.InternationalConferenceonComputerDesign (ICCD):VLSIinComputersandProcessors .1995,pp.52{57,IEEEComputer Society. [23] R.DeVille,I.Troxel,andA.D.George,\Performancemonitori ngforrun-time managementofrecongurabledevices," InternationalConferenceonEngineeringof RecongurableSystemsandAlgorithms(ERSA) ,pp.175{181,2005. 110

PAGE 111

[24] BartonP.Miller,MarkD.Callaghan,JonathanM.Cargille,J ereyK.Hollingsworth, R.BruceIrvin,KarenL.Karavanic,KrishnaKunchithapadam, andTiaNewhall, \TheParadynparallelperformancemeasurementtool," Computer ,vol.28,no.11,pp. 37{46,1995. [25] RobertBell,AllenD.Malony,andSameerShende, Euro-Par2003ParallelProcessing ,chapter:ParaProf:APortable,Extensible,andScalableTo olforParallel PerformanceProleAnalysis,pp.17{26,SpringerBerlin,2003 [26] FelixWolf,BerndMohr,JackDongarra,andShirleyMoore, Euro-Par2004Parallel Processing ,chapter:EcientPatternSearchinLargeTracesThroughSu ccessive Renement,pp.47{54,SpringerBerlin,2004. [27] Hung-HsunSu,MaxBillingsley,andAlanD.George,\Adistributed, programming model-independentautomaticanalysissystemforparallelapp lications,"in 14thIEEE InternationalWorkshoponHigh-LevelParallelProgrammin gModelsandSupportive Environments(HIPS)ofIPDPS2009 ,May2009. [28] M.T.Heath,A.D.Malony,andD.T.Rover,\Parallelperformanc evisualization:from practicetotheory," ParallelandDistributedTechnology:SystemsandApplicatio ns, IEEE ,vol.3,no.4,pp.44{60,Winter1995. [29] M.T.HeathandJ.A.Etheridge,\Visualizingtheperformanceofp arallelprograms," Software,IEEE ,vol.8,no.5,pp.29{39,Sep1991. [30] BartonP.Miller,\Whattodraw?whentodraw?:anessayonparal lelprogram visualization," J.ParallelDistrib.Comput. ,vol.18,no.2,pp.265{269,1993. [31] FelixWolf,FelixFreitag,BerndMohr,ShirleyMoore,andBr ianWylie,\Large eventtracesinparallelperformanceanalysis,"in In8thWorkshopParallelSystems andAlgorithms(PASA),LectureNotesinInformatics,Frankf urt/Main ,2006,pp. 264{273. [32] OmerZaki,EwingLusk,WilliamGropp,andDeborahSwider,\To wardscalable performancevisualizationwithJumpshot," InternationalJournalofHighPerformanceComputingApplications ,vol.13,no.3,pp.277{288,1999. [33] VincentPillet,VincentPillet,JessLabarta,ToniCortes,Toni Cortes,SergiGirona, andSergiGirona,\PARAVER:Atooltovisualizeandanalyzepara llelcode,"Tech. Rep.,InWoTUG-18,1995. [34] AndreasKnupfer,BernhardVoigt,WolfgangE.Nagel,andHartmu tMix, Applied ParallelComputing.StateoftheArtinScienticComputing ,chapter:Visualization ofRepetitivePatternsinEventTraces,pp.430{439,Springe rBerlin,2008. [35] P.Gibbon,W.Frings,andS.Dominiczak.B.Mohr,\Performanc eanalysisand visualizationoftheN-BodyTreecodePEPConmassivelyparallel computers," John vonNeumannInstituteforComputing,NICSeries ,vol.33,pp.367{374,2006. 111

PAGE 112

[36] RodrigoSantamaraandRobertoThern, SmartGraphics ,chapter:Overlapping ClusteredGraphs:Co-authorshipNetworksVisualization,pp.190 {199,Springer Berlin,2008. [37] R.Haynes,P.Crossno,andE.Russell,\Avisualizationtoolforanal yzingcluster performancedata,"in ClusterComputing,2001.Proceedings.2001IEEEInternationalConferenceon ,2001,pp.295{302. [38] KiranBondalapatiandViktorK.Prasanna, FieldProgrammableLogicandApplications ,chapter:DRIVE:AnInterpretiveSimulationandVisualization Environment forDynamicallyRecongurableSystems,pp.31{40,SpringerB erlin,1999. [39] SethKoehlerandAlanD.George,\Performancevisualizationa ndexplorationfor recongurablecomputingapplications," InternationalConferenceonEngineeringof RecongurableSystemsandAlgorithms(ERSA) ,2010. [40] Accellera,\SystemVerilog3.1alanguagereferencemanual," http://www.eda.org/ sv/SystemVerilog_3.1a.pdf ,RetrievedSeptember2010. [41] Accellera,\OVLopenvericationlibrarymanual,ver.2.4," http://www.accellera. org/activities/ovl ,RetrievedSeptember2010. [42] Accellera,\PSLlanguagereferencemanual,ver.1.1," http://www.eda.org/vfv/ docs/PSL-v1.1.pdf ,RetrievedSeptember2010. [43] M.Pellauer,M.Lis,D.Baltus,andR.Nikhil,\Synthesisofsynchr onousassertions withguardedatomicactions,"in FormalMethodsandModelsforCo-Design,2005. MEMOCODE'05.Proceedings.ThirdACMandIEEEInternational Conferenceon July2005,pp.15{24. [44] M.Boule,J.-S.Chenard,andZ.Zilic,\Assertioncheckersinve rication,silicon debugandin-elddiagnosis,"in QualityElectronicDesign,2007.ISQED'07.8th InternationalSymposiumon ,March2007,pp.613{620. [45] M.R.Kakoee,M.Riazati,andS.Mohammadi,\Enhancingthete stabilityofRTL designsusingecientlysynthesizedassertions,"in QualityElectronicDesign,2008. ISQED2008.9thInternationalSymposiumon ,March2008,pp.230{235. [46] K.CameraandR.W.Brodersen,\Anintegrateddebuggingenviro nmentforfpga computingplatforms,"in FieldProgrammableLogicandApplications,2008.FPL 2008.InternationalConferenceon ,Sept.2008,pp.311{316. [47] Xilinx,\Xilinxchipscopeprosoftwareandcoresuserguide,v.9. 2i," http://www. xilinx.com/ise/verification/chipscope_pro_sw_cores_ 9_2i_ug029.pdf RetrievedSeptember2010. [48] Altera,\DesigndebuggingusingtheSignalTapIIembeddedlogi canalyzer," http:// www.altera.com/literature/hb/qts/qts_qii53009.pdf ,RetrievedSeptember 2010. 112

PAGE 113

[49] HarryD.Foster,AdamC.Krolnik,andDavidJ.Lacey, Assertion-BasedDesign Springer,2004. [50] FarnWang1andFangYu, Real-TimeandEmbeddedComputingSystemsandApplications ,chapterOVLAssertion-CheckingofEmbeddedSoftwarewithDense -Time Semantics,pp.254{278,SpringerBerlin,2004. [51] SRCComputers,Inc.,\SRC-7Cartev3.2Cprogrammingenviron mentguide,"2009. [52] P.Klemperer,R.Farivar,G.P.Saggese,N.Nakka,Z.Kalbarczyk ,andR.Iyer, \Fpgaimplementationoftheillinoisreliabilityandsecuri tyengine,"in International ConferenceonDependableSystemsandNetworks(DSN) ,June2006,pp.220{221. [53] NithinNakka,GiacintoPaoloSaggese,ZbigniewKalbarczyk,an dRavishankarK. Iyer, DependableComputing-EDCC2005 ,chapterAnArchitecturalFrameworkfor DetectingProcessHangs/Crashes,pp.103{121,SpringerBerlin, 2005. [54] GNU,\TheGNUClibraryreferencemanual," http://www.gnu.org/software/ libc/manual ,RetrievedSeptember2010. [55] D.PellerinandThibault, PracticalFPGAProgramminginC ,PrenticeHallPTR, 2005. [56] D.S.Poznanovic,\ApplicationdevelopmentontheSRCComput ers,Inc.systems," in ParallelandDistributedProcessingSymposium,2005.Proce edings.19thIEEE International ,April2005,pp.78a{78a. [57] PaulGraham,BrentNelson,andBradHutchings,\Instrumentingbi tstreamsfor debuggingFPGAcircuits," Proc.IEEESymposiumonField-ProgrammableCustom ComputingMachines(FCCM) ,pp.41{50,2001. [58] IEEE,\IEEEstandardtestaccessportandboundary-scanarchite cture," IEEEStd 1149.1-2001 ,pp.i{200,2001. [59] JohnCurreri,SethKoehler,BrianHolland,andAlanD.George, \Performance analysiswithhigh-levellanguagesforhigh-performancere congurablecomputing," Proc.16thInternationalSymposiumonField-ProgrammableC ustomComputing Machines(FCCM) ,pp.23{30,2008. [60] NIST,\Dataencryptionstandard(DES)," http://csrc.nist.gov/publications/ fips/fips46-3/fips46-3.pdf ,RetrievedSeptember2010. [61] D.Pellerin,\EmailtransactiononaddingcommentstoHDLforHL Lsource correlation,"Dec.2007. [62] SadafR.Alam,JereyS.Vetter,PratulK.Agarwal,andAlGeist,\ Performance characterizationofmoleculardynamicstechniquesforbio molecularsimulations," in Proc.11thACMSIGPLANsymposiumonPrinciplesandPracticeof Parallel Programming(PPoPP) .2006,pp.59{68,ACM. 113

PAGE 114

[63] JohnEllson,EmdenGansner,LefterisKoutsoos,StephenC.North ,andGordon Woodhull, GraphDrawing ,chapter:GraphvizOpenSourceGraphDrawingTools, pp.594{597,SpringerBerlin,2002. [64] CHREC,\CHRECfacilities,"http://www.chrec.org/facilitie s.html. [65] Deepchip,\Mindsharevs.marketshare," http://www.deepchip.com/items/ snug07-01.html ,RetrievedSeptember2010. [66] ImpulseAcceleratedTechnologies,\Codevelopersusersguide, "2008. [67] GiovanniDeMicheli, SynthesisandOptimizationofDigitalCircuits ,McGraw-Hill, 1994. 114

PAGE 115

BIOGRAPHICALSKETCH JohnCurreriisaPh.D.CandidateintheDepartmentofElectr icalandComputer EngineeringDepartment.HereceivedhisBachelorsofScienc edegreeincomputer engineeringfromtheUniversityofAlabamainHuntsvilleandaMast erofScience degreefromUniversityofFlorida.Heisamemberofthetranslati onandexecution productivitygroupattheCenterforHigh-PerformanceRecon gurableComputing (CHREC).Hiscurrentfocusareaisintheperformanceanalysisa ndvericationof High-LevelSynthesis.HehasalsobeenintheAdvancedSpaceComput ing(ASC)groupat theHigh-performanceComputingandSimulation(HCS)Research Laboratory.IntheASC group,hisfocusareawhileworkingwithHoneywellinNational AeronauticsandSpace Administration's(NASA's)NewMillenniumProjectwassystemservices. Hisresearch interestsarerecongurable,parallelandfault-tolerantc omputing.Heisanactivemember oftheUFstudentbranchofInstituteofElectricalandElectron icsEngineers(IEEE)anda memberofAmericanInstituteofAeronauticsandAstronautics(AIAA) 115