<%BANNER%>

ReCAP

Permanent Link: http://ufdc.ufl.edu/UFE0042327/00001

Material Information

Title: ReCAP A Novel Framework for Application Performance Optimization in Reconfigurable Computing
Physical Description: 1 online resource (128 p.)
Language: english
Creator: Koehler, Seth
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2010

Subjects

Subjects / Keywords: analysis, application, bottleneck, exploration, fpga, hardware, optimization, parallel, performance, rc, reconfigurable, visualization
Computer and Information Science and Engineering -- Dissertations, Academic -- UF
Genre: Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: Reconfigurable-computing (RC) applications employing both microprocessors and FPGAs (field-programmable gate arrays) have potential for large speedup when compared with traditional (software-based) parallel applications. However, this potential is marred by the additional complexity of these hybrid systems, making it difficult to identify performance bottlenecks and achieve desired performance. Performance analysis concepts and tools are well researched and widely available for traditional parallel applications but are lacking in RC, despite being of great importance due to the applications' increased complexity. In this document, we explore challenges, tradeoffs, and various techniques for automated instrumentation, low-overhead measurement, hierarchical visualization, performance exploration, common bottleneck detection, and tool-assisted optimization for RC applications. We specifically propose the Reconfigurable-Computing Application Performance (ReCAP) framework to address these challenges and provide a cohesive structure for the various tradeoffs and techniques discussed. ReCAP seeks to facilitate or automate much of the time-consuming and error-prone processes associated with manually analyzing performance for an RC application, enabling an application designer to more productively locate and remedy performance bottlenecks. Through this research, novel concepts as well as infrastructure for performance analysis and optimization of RC applications are introduced, analyzed, prototyped, and evaluated. Case studies using a prototype ReCAP tool are provided across a representative set of RC systems and applications to demonstrate the effectiveness of this framework. This tool is believed to be the first of its kind for RC, a relatively new but increasingly important paradigm of computing.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Seth Koehler.
Thesis: Thesis (Ph.D.)--University of Florida, 2010.
Local: Adviser: George, Alan D.
Electronic Access: RESTRICTED TO UF STUDENTS, STAFF, FACULTY, AND ON-CAMPUS USE UNTIL 2012-12-31

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2010
System ID: UFE0042327:00001

Permanent Link: http://ufdc.ufl.edu/UFE0042327/00001

Material Information

Title: ReCAP A Novel Framework for Application Performance Optimization in Reconfigurable Computing
Physical Description: 1 online resource (128 p.)
Language: english
Creator: Koehler, Seth
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2010

Subjects

Subjects / Keywords: analysis, application, bottleneck, exploration, fpga, hardware, optimization, parallel, performance, rc, reconfigurable, visualization
Computer and Information Science and Engineering -- Dissertations, Academic -- UF
Genre: Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: Reconfigurable-computing (RC) applications employing both microprocessors and FPGAs (field-programmable gate arrays) have potential for large speedup when compared with traditional (software-based) parallel applications. However, this potential is marred by the additional complexity of these hybrid systems, making it difficult to identify performance bottlenecks and achieve desired performance. Performance analysis concepts and tools are well researched and widely available for traditional parallel applications but are lacking in RC, despite being of great importance due to the applications' increased complexity. In this document, we explore challenges, tradeoffs, and various techniques for automated instrumentation, low-overhead measurement, hierarchical visualization, performance exploration, common bottleneck detection, and tool-assisted optimization for RC applications. We specifically propose the Reconfigurable-Computing Application Performance (ReCAP) framework to address these challenges and provide a cohesive structure for the various tradeoffs and techniques discussed. ReCAP seeks to facilitate or automate much of the time-consuming and error-prone processes associated with manually analyzing performance for an RC application, enabling an application designer to more productively locate and remedy performance bottlenecks. Through this research, novel concepts as well as infrastructure for performance analysis and optimization of RC applications are introduced, analyzed, prototyped, and evaluated. Case studies using a prototype ReCAP tool are provided across a representative set of RC systems and applications to demonstrate the effectiveness of this framework. This tool is believed to be the first of its kind for RC, a relatively new but increasingly important paradigm of computing.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Seth Koehler.
Thesis: Thesis (Ph.D.)--University of Florida, 2010.
Local: Adviser: George, Alan D.
Electronic Access: RESTRICTED TO UF STUDENTS, STAFF, FACULTY, AND ON-CAMPUS USE UNTIL 2012-12-31

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2010
System ID: UFE0042327:00001


This item has the following downloads:


Full Text

PAGE 1

RECAP:ANOVELFRAMEWORKFORAPPLICATION PERFORMANCEOPTIMIZATIONINRECONFIGURABLECOMPUTING By SETHL.KOEHLER ADISSERTATIONPRESENTEDTOTHEGRADUATESCHOOL OFTHEUNIVERSITYOFFLORIDAINPARTIALFULFILLMENT OFTHEREQUIREMENTSFORTHEDEGREEOF DOCTOROFPHILOSOPHY UNIVERSITYOFFLORIDA 2010

PAGE 2

c r 2010SethL.Koehler 2

PAGE 3

IdedicatethisdissertationtomyLordandsavior,JesusChr ist–Iwouldnotbehere withoutyou;mywife,whohaslovedandsupportedmegreatlyt hroughoutthisjourney (andwhohasmademorechocolatecakesformylate-nightwork thanshouldbe numbered);mykids,SammyandEmma,whohaveshownmelovethatc anonlycome fromyourchildren–Iamprivilegedtobeyourfather;mypare nts(bothmineandmy wife's)forprayers,nancialsupport,advice,andgeneral moralsupport;my grandparents,siblings,otherfamilymembers,andfriends foryourprayers,thoughts,and forgoodconversationsthatmakemethink. Iamtrulygratefulforyouall. 3

PAGE 4

ACKNOWLEDGMENTS ThisworkissupportedinpartbytheI/UCRCProgramoftheNati onalScience FoundationunderGrantNo.EEC-0642422andbyequipmentortoo lsprovidedby Aldec,Altera,Cray,GiDEL,Nallatech,Synplicity,Xilinx,andXt remeData.Iwould liketothankmycommittee–Dr.TimothyDavis,Dr.AnnGordonRoss,Dr.Herman Lam,Dr.Jih-KwonPeir,andDr.GregoryStitt–fortheiradvice andassistanceas wellasmyadvisor,Dr.AlanD.George,forhisguidance,direc tion,andsignicant supportthroughthiswork.IwouldalsoliketothankJohnCur reriforhisworkonan N-Queensapplicationcasestudyaswellasoverallinsighti ntovariouspartsofthis work.Additionally,IthankDr.KarthikNagarajanforhisinit ialimplementationofa two-dimensionalprobabilitydensityfunctionestimatoru sedasacasestudyinthiswork. Finally,IwouldliketoexpressmythankstoRafaelGarciawh oaidedsignicantlyinthe productionofourrst,throw-awayprototypeofanRCperfor manceanalysistool. 4

PAGE 5

TABLEOFCONTENTS page ACKNOWLEDGMENTS .................................. 4 LISTOFTABLES ...................................... 7 LISTOFFIGURES ..................................... 8 ABSTRACT ......................................... 10 CHAPTER 1INTRODUCTION ................................... 12 2BACKGROUNDANDRELATEDRESEARCH ................... 18 2.1RecongurableComputing .......................... 18 2.2PerformanceAnalysisofParallelApplications ................ 21 2.2.1Visualization ............................... 23 2.2.2Knowledge-BasedBottleneckDetection ............... 25 2.3FPGAPerformanceModelingandSimulation ................ 27 2.4FPGAPerformanceAnalysis ......................... 29 3PERFORMANCEANALYSISFRAMEWORKFORRECONFIGURABLECOMPUTINGAPPLICATIONS(PHASE1) .................... 32 3.1ChallengesforRCPerformanceAnalysis .................. 32 3.1.1ChallengesforHardwareInstrumentation .............. 33 3.1.1.1Whattoinstrument ...................... 33 3.1.1.2Levelsofinstrumentation .................. 35 3.1.1.3Modifyingtheapplication .................. 37 3.1.2ChallengesforHardwareMeasurement ............... 38 3.1.2.1Recordingandstoringperformancedata ......... 39 3.1.2.2Managingsharedresources ................ 41 3.1.3ChallengesforPerformancePresentation .............. 43 3.1.4UniedPerformanceAnalysisTool .................. 46 3.2Framework ................................... 47 3.2.1Instrumentation ............................. 48 3.2.2Measurement .............................. 50 3.3CaseStudy ................................... 52 3.4Conclusions ................................... 56 4PERFORMANCEVISUALIZATIONANDEXPLORATION(PHASE2) ...... 58 4.1RCPerformanceVisualization ......................... 60 4.2RCPerformanceExploration ......................... 65 4.3Results ..................................... 73 5

PAGE 6

4.4Conclusions ................................... 75 5PLATFORM-AWAREKNOWLEDGE-BASEDBOTTLENECKDETECTION (PHASE3) ...................................... 76 5.1PlatformTemplates ............................... 78 5.1.1Software ................................. 78 5.1.2Hardware ................................ 82 5.2BottleneckDetectioninRCApplications ................... 83 5.3CommonBottlenecksinRCApplications ................... 93 5.3.1CommunicationBottlenecks ...................... 95 5.3.2SynchronizationBottlenecks ...................... 98 5.3.3InternalOverheadBottlenecks .................... 101 5.3.4ImbalanceBottlenecks ......................... 102 5.4CaseStudies .................................. 103 5.4.1Time-DomainFiniteImpulseResponse ................ 103 5.4.22-DProbabilityDensityFunctionEstimation ............. 108 5.5Conclusions ................................... 112 6CONCLUSIONS ................................... 114 APPENDIX ASYSTEMSSUPPORTEDBYRECAPTOOL ................... 116 A.1NallatechH101Cluster ............................ 116 A.2XtremeDataXD1000 .............................. 116 A.3GiDELPROCStarIIICluster .......................... 117 BAPPLICATIONCASESTUDIES .......................... 118 B.1N-Queens .................................... 118 B.2Time-DomainFiniteImpulseResponse ................... 118 B.3CollatzConjecture ............................... 118 B.42-DProbabilityDensityFunctionEstimation ................. 119 REFERENCES ....................................... 120 BIOGRAPHICALSKETCH ................................ 128 6

PAGE 7

LISTOFTABLES Table page 3-1Comparisonofsourceandbinaryinstrumentation. ................ 37 3-2PerformanceAnalysisOverhead. .......................... 53 5-1RCplatformsemployedduringcasestudies. ................... 104 5-2DatasetsevaluatedforTDFIRbenchmark. ..................... 105 5-32DPDFresultsforboththeNallatechandXD1000platforms. Speedupis givenwithrespecttothesoftwarebaselineexecutedonaPen tium4Xeon 3.2GHzCPU. ..................................... 112 7

PAGE 8

LISTOFFIGURES Figure page 1-1Potentialcommunicationbottlenecks(representedbya rrows)inRC applications. ..................................... 13 2-1TypicalislandlayoutofFPGAlogicandroutingfabric ............... 19 2-2Stagesofperformanceanalysis ........................... 23 3-1MockupJumpshotvisualizationofanRCapplicationwith 8CPUsand8 FPGAs. ........................................ 43 3-2ExamplehierarchicaldisplayofacomplexRCapplication ............ 45 3-3Additionsmadebysource-levelinstrumentationofanRCa pplication. ..... 47 3-4Instrumentationofusersourcecode. ........................ 50 3-5HardwareMeasurementModule. .......................... 51 3-6DistributionofcyclesspentincorestatemachinesofNQueens. ........ 55 3-7SpeedupofN-QueensApplication. ......................... 56 4-1Traditionaltimelinevisualization(Jumpshot)exampl edemonstrating difcultiesassociatedwithpresentingRCdata .................. 59 4-2Exampleofuser-denedpragmas ......................... 61 4-3ReCAPvisualizationsfor2-coreCollatzapplication ................ 64 4-4Exampleofasystem-levelvisualizationofanRCapplicat ionexecutingona 6-CPU/3-FPGARCsystemassumedtobepartofalargercluster ....... 66 4-5Exampleofanode-levelvisualizationofanRCapplicatio n,showing2CPUs andanFPGAfromthelefthandsideofthesystem-levelvisuali zation ..... 66 4-6Exampleapplicationforexploration. ........................ 68 4-7Potentialtimelineswhenoptimizinganexampleapplica tion. ........... 68 4-8Performanceexplorationmethodology. ....................... 70 4-9PerformanceexplorationusingCST/CTupdates ................. 72 4-10PerformanceexplorationforCollatzapplication. .................. 74 8

PAGE 9

5-1DirectedgraphofanRCapplicationthattakesinputfrom asensor, processesdatausingatwo-corepipeline,potentiallyofo adsdatatothreads onamulti-coreCPUforfurtherprocessing,andnallystores resultsinDDR memory. ........................................ 84 5-2Examplesofuser-denedpragmas. ........................ 86 5-3Default“reasons”providedbyReCAPforclassifyingAPIcal lsorHDLbranches. 89 5-4Exampleofbottleneckdetectionresults,showinginclus ionofwarningicons toindicateblockswithbottlenecksandaportionofthedeta iledbottleneck report. ......................................... 92 5-5TaxonomyofcommonbottlenecksinanRCsystem. ............... 94 5-6Platformtransferratevs.transfersizeacrossvariousR Csystemsand communicationtypes,demonstratingcommoncommunication problems.The non-blockingtransferratewascomputedasthesustainedcu mulativetransfer rateofeightconcurrentnon-blockingtransfers. .................. 97 5-7TDFIRperformanceonvariousdevices(includingbothth eoriginaland optimizedFPGAperformanceafterbottleneckdetection. ............. 106 5-8Speedupofinitialandimprovedversionsofthe2DPDFappli cationwhen comparedtoaCPUbaseline. ............................ 109 9

PAGE 10

AbstractofDissertationPresentedtotheGraduateSchool oftheUniversityofFloridainPartialFulllmentofthe RequirementsfortheDegreeofDoctorofPhilosophy RECAP:ANOVELFRAMEWORKFORAPPLICATION PERFORMANCEOPTIMIZATIONINRECONFIGURABLECOMPUTING By SethL.Koehler December2010 Chair:AlanD.GeorgeMajor:ComputerEngineering Recongurable-computing(RC)applicationsemployingbot hmicroprocessors andFPGAs(eld-programmablegatearrays)havepotentialfor largespeedupwhen comparedwithtraditional(software-based)parallelappl ications.However,this potentialismarredbytheadditionalcomplexityofthesehy bridsystems,making itdifculttoidentifyperformancebottlenecksandachiev edesiredperformance. Performanceanalysisconceptsandtoolsarewellresearche dandwidelyavailable fortraditionalparallelapplicationsbutarelackinginRC ,despitebeingofgreat importanceduetotheapplications'increasedcomplexity. Inthisdocument,we explorechallenges,tradeoffs,andvarioustechniquesfor automatedinstrumentation, low-overheadmeasurement,hierarchicalvisualization,p erformanceexploration, commonbottleneckdetection,andtool-assistedoptimizat ionforRCapplications. WespecicallyproposetheRecongurable-ComputingApplic ationPerformance (ReCAP)frameworktoaddressthesechallengesandprovideaco hesivestructurefor thevarioustradeoffsandtechniquesdiscussed.ReCAPseeks tofacilitateorautomate muchofthetime-consuminganderror-proneprocessesassoc iatedwithmanually analyzingperformanceforanRCapplication,enablinganap plicationdesignertomore productivelylocateandremedyperformancebottlenecks.T hroughthisresearch,novel conceptsaswellasinfrastructureforperformanceanalysi sandoptimizationofRC applicationsareintroduced,analyzed,prototyped,andev aluated.Casestudiesusing 10

PAGE 11

aprototypeReCAPtoolareprovidedacrossarepresentatives etofRCsystemsand applicationstodemonstratetheeffectivenessofthisfram ework.Thistoolisbelieved tobetherstofitskindforRC,arelativelynewbutincreasi nglyimportantparadigmof computing. 11

PAGE 12

CHAPTER1 INTRODUCTION Whiletheperformanceofparallelcomputingsystems(e.g.,m ulticoreCPUs, clusters,multiprocessorsystem-on-chips(MPSoCs))contin uestoincreaseinboth high-performanceembeddedcomputing(HPEC)andhigh-perfor mancecomputing (HPC),thegrowingimportanceofpowerconsumptionhascause dprogrammersin botheldstorelyincreasinglyonexplicitparallelismand acceleratorsratherthanon instruction-levelparallelismorclockfrequencyincreas es[ 1 ],[ 2 ],[ 3 ].Duetothese trends,recongurablecomputing(RC)[ 4 ],whichtypicallyemploysbothCPUsand recongurablehardwaresuchaseld-programmablegatearr ays(FPGAs),has emergedasaviableeldforprovidingorders-of-magnitude performancegainsover microprocessors[ 5 ],[ 6 ]whilesimultaneouslyconsuminglittlepowerincompariso nto bothmicroprocessorsandGPUs[ 7 ],[ 8 ],[ 9 ].However,thebehaviorofanRCapplication canbeparticularlydifculttoobserveandunderstandduet oadditionallevelsof hierarchicalparallelismandcomplexinteractionsbetwee nheterogeneousresources inherentinsuchsystems[ 10 ].RCapplicationsarebydenitionasupersetoftraditiona l parallelcomputingapplications,containingalltheprobl emsandcomplexityofthese applicationsandmoreduetotheiruseofbothmicroprocesso rsandFPGAs.Tohandle thiscomplexity,performanceanalysistools(toolsthatre cordapplicationbehavior duringexecutiononagivensystem)areindispensableforap plicationanalysisand optimization,evenmoresothanintraditionalparallelcom putingwheresuchtoolsare alreadycommonlyusedandhighlyvalued. Unfortunately,traditionalperformanceanalysistoolsar eonlyequippedtomonitor applicationbehaviorfromtheCPU'sperspective.Systemssuc hastheCrayXD1 [ 11 ]orthoseemployingCPU-socket-compatibleFPGAboards(e.g. ,XtremeData[ 12 ] orDRC[ 13 ])areadvancingtheFPGAfromslavetopeerwithCPUs,enabling the FPGAtoindependentlyinteractwithresourcesincludingmai nmemory,CPUs,orother 12

PAGE 13

FPGAs.DuetotheexpandedcapabilitiesofFPGAsinRCapplicatio ns,conventional toolshaveanincreasinglyincompleteviewofapplicationp erformance,necessitating hardware-awareperformanceanalysistoolsthatcanprovid eacompleteviewofRC applicationperformance.Toillustratethisneed,Figure 1-1 showsthehierarchyof parallelismandmyriadofinteractionsinsideanRCsystem, differentiatingbetween communicationthatcanbemonitored(lightarrows)andcomm unicationthatcannot bemonitored(darkarrows)bytraditionalperformanceanal ysistools.WithFPGA communicationpathstoCPUs,otherFPGAs,andvariouslevelsof memory,theamount ofunmonitoredcommunicationissignicant,hinderingthe applicationdesigner'sability tounderstandandimproveapplicationperformance. n n n n r r nrn nr n nrn rnrn rnrn n rr r r n n !nnn"#$ rn! %&n 'rr (nrr)##'rr ()) "#$r Figure1-1.Potentialcommunicationbottlenecks(represe ntedbyarrows)inRC applications. ToaddressacurrentlackofRCperformanceanalysisresearc handtools,we presentthedetailsofourRCperformanceanalysisframewor kandtool,ReCAP (recongurablecomputingapplicationperformance),whic hwebelievetobethe rstofitskind.ReCAPisasingleframeworkandtoolforobtai ning,visualizing,and analyzingperformancedataforanapplicationusingbothCPU andFPGAdevicesinside anRCsystem,thuspresentingauniedviewofRCapplication behavioratruntime. Wedrawuponresearch,concepts,andtechniquesfromHPCandH PECperformance analysisincludinginstrumentation,measurement,visual ization,bottleneckdetection, andoptimizationstrategiesaswellasfromFPGAdebugging,p erformanceprediction, andtoolportability.Withoutthesupportofaperformanceto olthatimplementssucha 13

PAGE 14

framework,applicationdesignersareforcedtomanuallyob tainaccessandinterpret informationbelievedtoaffectperformanceaswellasformu lateoptimizationsandpredict theireffectsontheapplication,requiringagreatdealofd esignereffortandexpertise. ReCAPspecicallyprovidesstatisticalandtiminginformat ionforapplicationevents, displaysRC-targetedvisualizations,detectscommonbott lenecks,andsuggests potentialoptimizationstrategiesfordetectedbottlenec ks,thuspermittingrapid, intuitiveunderstandingofapplicationbehaviorinbothso ftwareandhardware,often withlittle-to-nouserassistance.Simultaneously,weatte mpttominimizeresource andruntimeoverhead,reducingthechancethatmonitoringw illaltertheperformance behavioroftheapplication.Withthesecapabilities,ReCAPa imstosignicantlyincrease applicationdesignerproductivity,assistingdesignerst hroughouttheentireoptimization process. Wesubdivideourresearchintothreephases.Therstphasei nvolvesexploration ofthechallengesfacedinattaininglow-overhead,automat abletechniquesfor instrumentationandruntimemeasurementofRCapplication saswellasfor basicvisualizationofperformancedata.Techniquesaredi scussedtoaddress thesechallengesanddetailsoftheReCAPframeworkareprese nted,providinga comprehensivemethodologyforRCperformanceanalysis.Th isframeworkisthen integratedintoanexistingparallelperformanceanalysis tool,ParallelPerformance Wizard(PPW)[ 14 ],[ 15 ],andresultsfromacasestudyareprovidedtodemonstrate ReCAP'sutilityandminimaloverhead. Thesecondphaseofthisresearchthendetailsanextensiont otheresearch presentedinPhase1inordertoprovideperformancevisualiz ationandexplorationfor RCapplications.Traditionalvisualizationstrategiesha veseriouslimitationsforRCdue totheirassumptionoftypicallyhomogeneous,xed-archit ecturedevices(e.g.,CPUs). Incontrast,RCapplicationscommonlyemployintricatehie rarchiesofheterogeneous parallelismthatdoesnotscalewellwithincurrentvisuali zations.Further,visualizations 14

PAGE 15

cannotabstractawaymostofthehardware(asistypicalforC PU-basedvisualization) sincethishardwareisnowpartofthedesigner'sapplicatio n.WithoutRC-targetted visualization,anapplicationdesignerishinderedinthei rabilitytoquicklyunderstand thebehaviorof,andthusoptimize,theirapplication'sbeh avior.Additionally,once anapplicationdesignerformulatespotentialoptimizatio ns,itisoftenunclear,dueto complexityinherentinRCapplications,astowhattheeffec tsoftheseoptimizationswill beontheentireapplication'sperformance.Whiledesign-sp aceexploration(DSE)and performancepredictiontechniquesarewellresearched,th esetechniquesaretypically employedbeforeanapplicationisconstructed,andthusmus tmakeassumptionsthat canreducepredictionaccuracy.Incontrast,ReCAP'sperform anceexplorationemploys runtimeperformancedatatopredicttheeffectsofanoptimi zation,sacricingexibility inthemagnitudeofchangesthatcanbemodeledforpotential lyincreasedprediction accuracy.Thisphasewillinvolvethestudyofvisualizatio nandDSEtechniquesas wellasthedevelopmentofvisualizationswithinReCAP.Thes etechniquesarethen demonstratedwithinanapplicationcasestudy. Inthethirdphaseofthisresearch,wepresentageneralfram eworkfor platform-aware,knowledge-basedbottleneckdetection,a idinganapplicationdesigner inquicklylocatingandremedyingperformancebottlenecks acrossadiverserangeof RCplatforms.Applicationsoftencontainsimilarbottlenec kssuchasload-balancingor communication-relatedproblems,butlocatingandresolvi ngthesebottlenecksremainsa difcultandtediousprocess,especiallygivendifferenta pproachesneededtoeffectively monitorandaddressthesebottlenecksondiverseRCsystems andAPIs.Thepresented frameworkandtoolareextensibleinthatusersmayeasilyad dsupportfortheirown platform,aslongasittswithinReCAP'splatform-templatem odel,andmayalsoeasily modifythebottleneckknowledgebase.Thersttaxonomyofc ommonRCbottlenecks, peranextensiveliteraturereview,isalsopresented(incl udingdetectionandoptimization strategies)andusedtopopulateReCAP'sknowledgebase.Altho ughusefulfornovice 15

PAGE 16

RCdesignerswhomaybeunawareofmanypotentialbottleneck sandoptimizations, experiencedRCdesignerscanbenetaswellfromquickfeedb ackonthelocationand severityofperformanceproblems. Itisimportanttonotethatwhilewefocusonimprovingrunti meperformance throughoutthisresearch,theseconceptsandtechniquesca nalsobeofusefor reducingpowerconsumptionorresourceusageaswell.Forex ample,anincrease inrawapplicationperformancemayallowanapplicationdes ignertodecreasethe numberofhardwarecoresortheclockfrequencyontheFPGAwhi lestillprovidingthe requiredperformance.Conversely,optimizationsuggesti ons(suchasthosegivenin Phase3)mayindicatethatresourcescouldbereducedwithout sacricingperformance, allowingtheseresourcestoberepurposedelsewhereforimp rovedperformanceorto beleftunused,reducingpowerorevenpermittingasmallerF PGAtobeemployed. Thus,optimizationsmaybeusedtoachieveabalanceamongst runtimeperformance, resourceusage,andpowerconsumption. Thisresearchconsistsofbackground,details,andresults fromeachphaseof researchaswellasconcludingremarksandappendices.Speci cally,Chapter 2 providesbackgroundandrelatedresearchintheareaoftrad itional(CPU-based) performanceanalysisincludingvisualizationandbottlen eckdetection,RCdebugging techniquesthatrelatetoperformanceanalysisinRC,andli mitedresearchinRC performanceanalysisandRCtoolportability.Chapter 3 thendetailsthechallenges, techniques,andtradeoffsofperformanceanalysisinRC,pr oposingtheinitialReCAP frameworkbasedonthisresearchandprovidingacasestudyt odemonstratethe framework'sutilityonanRCapplication(Phase1).Next,Cha pter 4 coverschallenges andtechniquesassociatedwithmodel-appropriatevisuali zationofRCperformance dataandperformanceexplorationusingruntimeperformanc edata,includingdetailsof aprototypevisualizationforrapid,intuitiveunderstand ingofRCapplicationbehavior (Phase2).Then,Chapter 5 describesanextensiontothePhase1and2frameworks 16

PAGE 17

thataidsinthedetectionandoptimizationofcommonperfor mancebottlenecksacross diverseRCsystems,providingtheuserwithaneasymechanis mtoaddsupportfor additionalRCsystemsandbottlenecksasneeded(Phase3).Fi nally,conclusionsare providedinChapter 6 withappendicescontainingdataonRCsystems(Appendix A )and RCapplications(Appendix B )usedinthestudyandevaluationofReCAP. 17

PAGE 18

CHAPTER2 BACKGROUNDANDRELATEDRESEARCH Thebackgroundandrelatedresearchissplitintofoursecti ons.Section 2.1 presentsanoverviewofRCsystemsandapplicationdesign,f ocusingalmostexclusively onFPGAs.Then,Section 2.2 detailsbackgroundandrelatedresearchinperformance analysisofparallelapplications(includingvisualizati onandbottleneckdetection).Next, Section 2.3 providesanoverviewofRCperformancemodelingandsimulat ion.Finally, Section 2.4 coversbackgroundandrelatedresearchspecicallyforFPGA performance analysis,alongwithrelatedconceptsfromanalyticalands imulativeFPGAperformance predictionandFPGAdebugging. 2.1RecongurableComputing Onlyabrief,relatedoverviewoftheeldofRCisgivenhere; seeComptonand Hauck'ssuperboverviewofthiseld[ 4 ]aswellasatreatmentofRCforaccelerating computation[ 16 ]foramorethoroughtreatment.RCincludesabroadarrayofd evices thatallowsomeformofhardwareprogrammability,incontra sttoinstruction-controlled devicesorhard-wiredapplication-specicintegratedcir cuits(ASICs).WhileRC employsdevicessuchasprogrammablearraylogicdevices(P ALs),complex programmablelogicdevices(CPLDs),andeld-programmable objectarrays(FPOAs) [ 17 ],eld-programmablegatearrays(FPGAs)havebecomethepred ominantdevice usedinRCsystems[ 4 ],[ 16 ].ModernFPGAstypicallyconsistofameshoflogicblocks androutingblocks,whicharedepictedinFigure 2-1 (seeAltera'sStratixII[ 18 ]and Xilinx'sVirtex4[ 19 ]devicesaswellas[ 4 ]formoreinformation).Logicblockstypically containLUTs(lookuptables,whichdenethemappingbetwee nsomenumberofinputs andoneormoreoutputbits)andstateholdingcircuitrysuch asDFFs(Dip-ops)or latches.Routingblocks,suchasconnectionandswitchboxe s,areusedtoconnect logiccomponentstoeachotherandtoI/Opinsinamyriadofwa ys.FPGAsoften 18

PAGE 19

containcoarse-grainedcomponentsaswell,suchasmemorie s,arithmeticblocks(e.g., multipliers),embeddedprocessors,andcommunicationsha rdware. n rr !"" # n rr $ $ % % & Figure2-1.TypicalislandlayoutofFPGAlogicandroutingfa bric DesignershavetraditionallyprogrammedRCsystemsusingh ardwaredescription languages(HDLs)suchasVHDLandVerilog.Thehardwareporti onofthedesign isdescribedinanHDL,synthesizedintoanetlist(gate-lev elmodelofdesign)via synthesis,andthenimplementedintoanobjectleviamappi ng,placeandroute,and objectlegeneration.Thesoftwareportionoftheapplicat ionthatinteractswiththe FPGAiswrittenusingatypicallanguagesuchasC/C++anduses anAPItoaccess theFPGA(e.g.,functionormethodcalls,memorymappedI/O). Thissoftwareis thencompiledandexecutedonthetargetRCsystem'sCPU;this softwaretypically congurestheFPGAwiththegeneratedobjectle(althoughth eobjectlemaybe preloadedbeforesoftwareexecution). Alternatively,designersmayusehigh-levellanguages(HLL s)suchasC/C++or JavatodescribeFPGAhardware.TheHLLmustthenbetranslate dintoanHDLor netlistbythecompilerandmaythenfollowthestandardowg ivenintheprevious 19

PAGE 20

paragraph;thisprocessiscommonlyreferredtoashigh-lev elsynthesis(HLS).HLLs areanattractivealternativetoHDLssinceHLLsaresignic antlyeasiertouse,enable fasterdevelopmenttimes,andmayaidportabilityacrosspl atforms.Inpractice,these benetsmaybeoffsetbyperformancethatislowerandmoredi fculttopredict(when comparedtoapure-HDLdesign)aswellasbyincompletetrans lationtoolsthatonly permitasubsetoftheHLLtobeused[ 20 ]. InadditiontotheCPUsandFPGAsfoundinRCsystems,otheraccel eratorssuch asgraphicsprocessingunits(GPUs),theCellprocessor,and digital-signalprocessors (DSPs),maybeemployedaswell.Eachdeviceprovidesatradeoff betweenmetrics suchasspeed,cost,power,easeofprogramming,andexibil ity;see[ 6 ],[ 7 ],[ 8 ],[ 9 ], [ 21 ],[ 22 ]forcomparisonsbetweenthesedevices.Incontrasttothes eaccelerators, FPGAspermitaprogrammertocustomizehardwaredirectlyfora givenapplication, potentiallyyieldinggreaterspeedandefciency(interms ofresourcesemployedby theapplicationaswellaspowerconsumption).WhileASICsalso permithardwareto becustomizedandperformevenbetterintermsofspeedandef ciency,ASICsare xedafterfabricationandincurprohibitive,non-recurri ngcosts.Duetothesecosts, manyASICdesignstarget130nmorlargerprocessnodes,elimin atinganyadvantage theseASICsmighthaveintermsofspeedorefciencywithrespe cttocurrent40nm FPGAs[ 22 ].Thus,whencomparedtootheraccelerators,FPGAsarewell-s uitedfor general-purposeapplicationaccelerationduetotheirabi litytoprovidefast,efcient, customizedlogicforagiventaskwhilesimultaneouslyreta iningsignicantexibility (inthatthedevicemaybereconguredwithinmillisecondst oaccelerateanother application). Workinginalarger,heterogeneoussystem,FPGAscanassistCPU sasa coprocessor,actaspeerswithotherCPUs,orevenasamasterd ependingonthe systemarchitectureandapplication.FPGAsmaybecoupledwit hinthesystemat variouspointsincludinginsidetheCPU(asafunctionalunit ),inasystem-on-chip(SoC), 20

PAGE 21

inaproprietarycoprocessor,inastandaloneCPU-socket,in aperipheralconnection suchasPCI-X/e,orasastandaloneunitconnectedviaexternal I/OsuchasEthernet orUSB[ 4 ].Atightercouplingusuallyimpliesalowerlatencytoacce ssthedevice, whichmayinturnallowtheFPGAtobeusedforner-grainedtas ks;tasksaccelerated bylooselycoupledFPGAsaretypicallycoarse-grainedinorde rtocompensatefor increasedlatency(e.g.,givena 1 ms latencybetweentheCPUandFPGA,theFPGAis unlikelytoaccelerateasimpleadditionormultiplication sincetheCPUcouldcompute thisitselfinfarlesstime). Insummary,RCincorporatesalargenumberofdevices,langu ages,andsystem architecturesthathavethepotentialtosignicantlyacce lerateapplicationswhile simultaneouslyusinglesspowerandretainingexibility( whencomparedtotraditional microprocessor-onlysystems),permittingtheaccelerati onofawiderangeoftasks. Thisexibilityisconcomitantlythesourceofstrengthand weaknessforRCdevices. Flexibilityallowscustomhardwaretobeemployedfortheta skathand,typically improvingperformanceandreducingpowerusagebyremoving theoverheadassociated withthefetch,decode,andmanagementofinstructionsaswe llaswiththememory hierarchythatisubiquitousinmicroprocessors.Unfortun ately,thisexibilityalso correspondstoincreaseddifcultyinprogramming,debugg ing,andanalyzingthe performanceofRCapplicationswithinthesedevices. 2.2PerformanceAnalysisofParallelApplications A“performancebottleneck”referstoa(proper)subsetofco mponentsinasystem thatadverselyaffectssystemperformanceasawhole(thete rm“bottleneck”istaken fromthenotionofabottle'sneckthatrestrictstheowofli quidfromthelargerbottle throughasmalleropening).Thus,thegoalofperformancean alysisistounderstand aprogram'sruntimebehavioronagivensysteminordertoloc ateandremedy performancebottlenecks.Conceptsforoptimization,such asmakingthecommoncase 21

PAGE 22

fast(oftenderivedfromAmdahl'slaw)andfocusingonbottle necksintheapplication's criticalpath,motivateassortedquestionsinperformance analysissuchasthefollowing: Wheredoesanapplicationspendmostofitstime? Whereandwhydoesanapplicationgetdelayed? Whatresourcesareusedheavilybyanapplication? Whateffectwouldanoptimizationofoneportionofanapplica tionhaveonthe entireapplication? Designersmaytakeamanualapproachtoperformanceanalysi s,insertingtiming routinesandprintstatementsintotheircodetolocateprob lemsandndadditional informationneededtoremedytheproblem.However,giventh etediousanderrorprone techniquesinvolvedingainingaccessto,recording,stori ng,analyzing,andpresentinga typicallyvastamountofdatarelatedtoapplicationperfor mance,performanceanalysis toolsweredevelopedtofacilitatethisprocess;thesetool sassistauserinobtaining andpresentingperformancedatarecordedatruntimefromth eactualexecutionofan applicationonagivensystem(hereafter,“performanceana lysis”referstotool-assisted performanceanalysisratherthanmanualperformanceanaly sis). Forparallel-computingapplications,Maloney'sTAUframe work[ 23 ]providesa goodintroductiontothevariouschallengesandtechniques inperformanceanalysis.As showninFigure 2-2 ,performanceanalysismaybedecomposedintostagesinclud ing gainingaccesstoapplicationdata(instrumentation),rec ordingandstoringthatdataat runtime(measurement),optionallyanalyzingrecordeddat aforperformanceproblems (automatedanalysis),visualizingperformancedataandan alysisresults(presentation) inordertoallowthedesignertocarryonfurtheranalyses(m anualanalysis),andnally strategizingandimplementingchangeswithintheapplicat ioninordertoameliorate locatedperformancebottlenecks(optimization).Thesest epsmayberepeateduntil desiredperformanceisachievedornofurtherperformanceg ainsseemlikely. 22

PAGE 23

n rn rn n nnn rnn nn rnn rnn nnn n n rn r nn rnn Figure2-2.Stagesofperformanceanalysis Anumberofperformanceanalysistoolsforparallelprogram shavebeendeveloped withavarietyofpurposesandfeatures.ToolsmaybesimpleM PI(message-passing interface)librarywrapperssuchasmpiP[ 24 ],visualizationsuitessuchasJumpshot [ 25 ]andVampir[ 26 ],orfull-edgedperformancetoolssupportingmultiplela nguages, programmingmodels,methodsofinstrumentationandmeasur ement,andautomated analyses(e.g.,IBMHigh-PerformanceComputingToolkit[ 27 ]andHigh-Productivity ComputingSystemsToolkit[ 28 ],IntelTraceAnalyzerandCollectorTool(basedon Vampir)[ 29 ],Paradyn[ 30 ],Paraver[ 31 ],PPW[ 14 ],[ 15 ],Scalasca(formerlyknown asKOJAK)[ 32 ],SCALEA[ 33 ],SvPablo[ 34 ],TAU[ 23 ]).Chungetal.'srecentstudyof performanceanalysistoolsontheBlueGene/Lprovidesacomp arisonofsomeofthese tools[ 35 ].Thesetoolsprovideawealthofconceptsandtechniquesth atareleveraged throughoutthisresearch.ReCAPextendsPPWbyaddingsupportf orRCapplications, leveragingPPW'ssupportfortraditionalparallelapplicatio nsusingCPUs. 2.2.1Visualization Performancevisualizationseekstorapidlyconveyapplica tionbehaviortoa user.DuetoalackofpreviousworkinvisualizationforRCap plications,weemploy researchinvisualizationfortraditionalCPU-basedparall elapplications,whichsuggests thatvisualizationsshouldbeconcise,easy-to-use,infor mative,appropriatetothe programmingmodel,scalable,andinteractive[ 36 ],[ 37 ].Performancedatamaybe 23

PAGE 24

presentedinanumberofways,includingtext-basedmethods suchasdatatables aswellasmorecommongraphicalformssuchascharts,graphs ,timelineviews, Kiviatdiagrams,hypercubes,animations,etc.[ 37 ],[ 38 ]Examplesofvisualization methodologiesandtoolscanbefoundinHPCToolkit'sHPCView[ 39 ],theIBMHigh PerformanceComputingToolkit'sPeekPerf[ 40 ],Intel'sTraceAnalyzerandCollector [ 29 ],Jumpshot[ 25 ],Paradyn[ 30 ],Paraver[ 31 ],PPW[ 14 ],[ 15 ],Scalasca'sCUBE[ 32 ], SCALEA[ 33 ],SvPablo[ 34 ],andTAU'sParaprofandPerfExplorer[ 23 ]. Goalssuchassimplicityandscalabilityhavereceivedsign icantstudy.Knupfer etal.discussproblemsassociatedwiththesizeofgenerate dtracedata,problems associatedwithvisualizingthisamountofdata,andtechni questoreducetheamount ofvisualizeddatabyexploitingpatternsfoundandhighlig htinganydeviationfromthese patterns[ 41 ].Hackstadtetal.focusonchallengesandtechniquesofsca labilityof timelineviews,demonstratingtheuseof3-Dvisualization s[ 42 ].Reedetal.present techniquesandassociatedbenetsofusingvirtualreality topresentaneverincreasing amountofdatainasensiblefashion[ 43 ]. Visualizationresearchpertainingto“appropriatenesstop rogrammingmodel” hastraditionallyreceivedlessattentioninHPCduetoMPI'sd ominanceintheparallel programmingcommunity,althoughtheadventandspreadofne werparallellanguages, suchaspartitionedglobaladdressspace(PGAS)languages,has beencoupledwith newvisualizationsforthoselanguages[ 14 ],[ 15 ].Nonetheless,visualizationcanoften benetfromauser'shigh-levelmodeloftheapplicationtha tisnotpresentin(ornot easilyextractedfrom)sourcecode,usingthismodeltostru ctureperformancedatainan intuitive,application-speciccontext(andpotentially reducinginstrumentationoverhead aswell)[ 44 ],[ 45 ],[ 46 ].Specically,Sekaetal.purporttheideaofarchitectureaware visualizationtoaidinthedesignofcomplexsoftwaresyste ms(theyspecicallyfocus onthedesignoftheChoicesoperatingsystem),where“archi tecture”hererefers tothehigh-levelconceptualorganizationofsoftware[ 45 ].Theirapproachinvolves 24

PAGE 25

instrumentingtheapplicationandvisualizingperformanc edataintermsofthe high-levelsubsystemsdesignersenvisionedduringformul ationofthesystem(e.g., lesystem,devicesystem,virtualmemorysystem),beforet heseweretranslatedto manylower-levelclasses,objects,andmethodsthatimplem entthesesystems.They demonstratethatbyvisualizingtheperformanceintermsof theapplicationasthe designerwroteandenvisionedit,suchanapproachisimmedi atelymoreintuitivetothe designerandfacilitatesanalysesthatarenotpossiblewhe nviewingthedesignasa disparatesetofclasses,objects,andmethods.Garc aetal.employthisconceptinthe embeddedrealmbydescribinginstrumentationandvisualiz ationintermsofabehavioral modeloftheapplicationtogainadditionalinsightintocon currentexecutioninherentin thesetypesofsystems[ 44 ].Inbothcases,adesigner'smodelofthesystemisused tobothguideinstrumentationandtovisualizeperformance data.Thispermitsmore targeted,low-overheadinstrumentationandallowsvisual izationstoprovideanintuitive viewofoverallsystemperformancethatmatchestheprogram mingmodelinuse.From thisvantagepoint,adesignercanviewproblemsareasfroma high-level,delvedeeper intocomponentdetailstondadditionalinformationwithi nahierarchy,andthenquickly connecttheirndingsbacktothehigh-levelabstractionof theirdesign,signicantly streamliningtheoptimizationprocess.2.2.2Knowledge-BasedBottleneckDetection Knowledge-basedbottleneckdetectionisaformofautomated analysisdesigned tolocateanddescribecommonperformancebottlenecksinan applicationbased uponspecicknowledgeaboutwhereandhowbottlenecksmayo ccur.Thegoal ofknowledge-basedbottleneckdetectionistoreducebotht heeffortandexpertise requiredtooptimizeanapplication,acceleratingtheopti mizationprocess;fora thoroughoverviewofvariousframeworksandtoolsforautom aticanalysis,see[ 47 ]. Withoutbottleneckdetection,thedesignermustunderstand theintricaciesinvolving wherebottleneckscanoccur,understandhowtodetecteachb ottleneck,instructthe 25

PAGE 26

tooltomonitorrelevantdata,interprettheperformanceda tatolocatebottlenecks, understandwhatoptimizationsmaybeeffectiveagainsteac hbottleneck,anddetermine theexpectedperformanceimprovementifagivenbottleneck wereremedied(inorderto ascertainwhetherthebottleneckisworthremedying). Whilenodirectresearchexistsforenumeratingorcategoriz ingRCbottlenecks,nor hasanyworkdemonstratedautomaticbottleneckdetectionf orRCapplications,DeHon etal.presentednumerous“designpatterns”forthepurpose ofimprovingapplication efciencyonrecongurablesystems;thesedesignpatterns couldbeincludedas optimizationsuggestionsforrelevantbottlenecks.Fortr aditionalHPCsystems(i.e., systemswithoutFPGAs),MohrandWolfargueforthenecessityo fautomaticdetection andcategorizationofbottleneckswhenanalyzingtheperfo rmanceofanapplication, whichtheyimplementintheKOJAKperformancetool(nowcalle dScalasca)[ 48 ].This tree-basedcategorizationbreaksuptotaltimespentintoe xecutionandidletime, withexecutionfurtherbrokendownbylanguageused(useful formulti-language applicationssuchasMPI/OpenMPhybridapplications).Eachl anguageisthen subdividedintocommunication,I/O,andsynchronizationr elatedbottlenecks,which arethenfurthersubdivided;forexample,thecommunicatio ncategoryisdividedinto subcategoriessuchascollectiveoperations(e.g.,latebr oadcast,earlyreduce,all-to-all) andpoint-to-pointoperations(e.g.,latesender/receive r).Jorbaetal.discussanother knowledge-basedbottleneckdetectiontool,KappaPI2,thate mploysanXMLformatto storetheirknowledgebaseofbottleneckdenitions,allow ingthetooltolocatepatterns representingperformanceproblemsintracedatacollected frommessage-passing applications[ 49 ].Staticcodeanalysisisusedinconjunctionwithaspecica tionof bottleneck“instances”(alsoinXMLformat)toindicatethec auseorconditionsofeach bottleneckandtoprovideshintsastohowthebottleneckmay beremedied.Suet al.providedetailsoftheirautomatedanalysisframeworkw ithinParallelPerformance Wizard(PPW)[ 47 ].Theirtoolprovidesautomated,model-independentdetec tionas 26

PAGE 27

wellasdistributedanalysiscapabilitiestoreduceanalys istime.TruongandFahringer provideaverydetailedschemeforperformanceoverheadcla ssicationinSCALEA, aprole-andtrace-basedperformanceanalysistoolthatal lowsuserspecication ofmetricsandcoderegionsofinterest[ 33 ].Theirclassicationforperformance bottlenecksincludesdatamovement,synchronization,con trolofparallelism,additional computation,andlossofparallelism,eachofwhicharefurt hersubdividedtobetter specifythespecicreasonforthebottleneck(e.g.,the“da tamovement”category subdividesintonumeroussubcategoriesincludingpoint-t o-pointandcollective communication,remoteget/put,memoryaccessesincluding cachebehaviorand pagefaults,andbothlocalandremoteleI/O).Chungetal.d etailtheirframework andtoolforautomaticbottleneckdetectionusinganextens ible,rule-basedsystem andperformancedataobtainedfromtheIBMHigh-Performance ComputingToolkit [ 50 ].“Rules”arecreatedusing“metrics”(bothofwhichmaybeu ser-dened)which inturncandependonparameterssuchasthetargetsystem,en ablingthetoolto providedetailedinformationforallbottlenecksdetected includingexpectedperformance improvementifthebottleneckwasremoved.However,therea reanumberofdifferences inRCsystemsandapplicationsthatmustbeaddressed. 2.3FPGAPerformanceModelingandSimulation FPGAperformanceanalysisshouldnotbeconfusedwithanalyt icalmodelingor simulation,whichprovideestimatesofapplicationperfor mancethatmusteventually beveriedagainstactualruntimeperformance;performanc eanalysisisessentialfor capturingactualapplicationbehavioronatargetsystemfo rthepurposeofoptimization. Nonetheless,performancemodelingandsimulation(bothof whichareusedinthe moregeneralcontextofdesign-spaceexploration,orDSE)are vitalforpredicting applicationperformance.DSEemploysanalyticalmodelingo rsimulationtostrategically explorealternativeapplicationdesignsorsystemarchite cturesbasedonmetricssuch asperformance,power,andresources.Asgenerallynosource codeexists,theuser 27

PAGE 28

suppliesparametersormodelsofapplicationorsystembeha viortopredictperformance. Balsamoetal.provideagoodsurveyofperformance-drivenmo delingtechniquesfor softwareapplications[ 51 ].Processalgebrasandcalculicanalsobeusedtopredict performance,fromwhichsomeconceptsareleveragedinReCAP' sperformance explorationframework[ 52 ]. Hollandetal.proposedananalyticalmodelcalledtheRCame nabilitytest(RAT) forquicklyestimatingthespeedupofanRCapplicationover asoftwarebaseline usingvariousparametersincludingsystembandwidth,read /writeefciency,inputand outputsize,andFPGAclockfrequency[ 53 ].Smithetal.presentananalyticalmodel foranetworkofshared,heterogeneousworkstationseachco ntainingRChardware, includingmanyparametersforpredictingload-imbalances andsharedcontentionon nodesaswellasthecommunicationandcomputationparamete rsofthesystemand applicationitself[ 54 ].Suchtechniqueshavebeenshowntobeveryusefulforearly estimatesofapplicationperformancebeforeapplicationd evelopmenthasevenbegun. However,analyticalmodelsmayleaveoutmanydetailsthatc ansignicantlyeffect performanceonanactualsystem,maybeinaccurateifparame ters(oreveninteractions betweenparameters)aremis-predicted,andaregenerallyu nhelpfulinexplainingwhy anapplicationperformeddifferentlythanexpected. Simulation-basedmethodsforperformanceanalysistypical lyprovidehigherdelity thananalyticalmethodssincelessdetailisabstractedawa y.Whilesimulationisalso veryusefulforestimatingapplicationperformancebefore investingsignicantresources intoRCsystemsandapplicationdevelopment,simulationmu stgenerallybalance accuracyandspeed;extremelyaccuratesimulationscanbeo rdersofmagnitudeslower thanactualexecution.Inaddition,buildingasimulatoris asignicantundertaking initself.Reardonetal.developedanenvironmentforsimul atingRCapplication performanceonanRCsystem[ 55 ].Theapplicationischaracterizedbyascript whiletheRCsystemistypicallybuiltfrompre-denedcompo nents(e.g.,network 28

PAGE 29

switches,CPUs,FPGAs).Anaccuracyofunder7%wasreportedfors imulationsthat executedreasonablyfast(roughly12to14seconds)forasyn thetic-aperatureradar (SAR)application,althoughthisaccuracydependsin-turnon theaccuracyofthe RC-applicationscriptmodel.Densmoreetal.presentedana pproachforautomatically extractingperformanceinformationfromanIPcorelibrary onanembeddedRC systemandincorporatingthisdataintotheMETROPOLISsimula tionenvironment formoreaccuratesimulation[ 56 ].However,thisworkassumedthatthedesignmade substantialuseoftheseIPcores.BondalapatiandPrasannapr ovidedetailsoftheir DRIVEframeworkforsimulationandvisualizationofdynamic allyrecongurablesystems [ 46 ].High-levelparameterizedmodelsforrecongurablehard wareareprovidedbythe user,whicharethenusedtoanalyzeexpectedapplicationpe rformanceandproduce visualizations.Unfortunately,evenwithaccuratesimula tionsofanFPGA,itisgenerally notpossibletomodeltheentiresystematahigh-levelofacc uracy(e.g.,modelingall CPUs,communicationhardware,memories)duetoperformance constraints,alack ofhigh-delitymodelsforeachcomponent,oralackofinter activitybetweendifferent simulatorsforeachcomponent.Withoutaninclusivesimulat ionofthesystem,the behaviorandperformanceofthesimulatedapplicationmayd iffersignicantlyfromthe actualapplication. 2.4FPGAPerformanceAnalysis WhilepreliminaryworkexistsinRCperformanceanalysis,th iseldissignicantly lessmaturethanitssoftwarecounterpart.DeVilleetal.'sp aperinvestigatestheuseof distributedandcentralizedperformanceanalysisprobesi nanFPGAbutislimitedin scopetoefcient,simplemeasurementwithinasingleFPGA[ 57 ].Schulzetal.'sOWL frameworkpaperproposestheuseofFPGAsforsystem-levelper formanceanalysis, monitoringsystemcomponentssuchascachelines,buses,et c.[ 58 ].However,their workisdirectedatmonitoringsoftwarebehaviorfromhardw are,ratherthanmonitoring hardwareitself.Similarly,Kirschbaumetal.detailtheHarM onIC(HardwareMonitorfor 29

PAGE 30

InterprocessCommunication),buttheirfocusisonmonitor ingbuscommunicationin embeddedenvironmentsusinganFPGA,ratherthanmonitoringa napplicationonthe FPGAitself[ 59 ]. TheeldofRCdebuggingprovidessomeoverlappingtechniqu esfor instrumentationandmeasurementofapplicationdata.Unfo rtunately,thisoverlapis limitedbyfundamentaldifferencesintheirrespectivepur poses.Forexample,debug techniquessuchasbreakpointingandFPGAreadbackareusefu lforreading(andeven controlling)theentirestateoftheFPGA.Cameraetal.propos ethesetechniquesin theBORPH(BerkeleyOperatingsystemforReProgrammableHardwa re)[ 60 ]while Wheeleretal.proposeafairlyportableapproachtoreadback andcontrolofthe FPGAusingacustomscan-chain[ 61 ].Bellowsalsodiscussessimilarideas,including clock-steppingandtransparentaccesstomemories,intheSL AAC(SystemLevel ApplicationsofAdaptiveComputing)project[ 62 ].Unfortunately,thesetechniques must“pause”theFPGAapplication'sclockinordertoretriev edata,effectively isolatingtheFPGAfromtherestofthesystem,whichtypicall ycannotbepaused inunison.Whileisolationisencouragedindebugging,itise xtremelyproblematic inperformanceanalysissincecomponentinteractioninthe systemisakeyfactor. ToolssuchasAltera'sSignalTap[ 63 ]andXilinx'sChipScope[ 64 ]doallowanFPGA torunatornearfullspeedinasystem(minimizingchangesto applicationbehavior), butaredesignedtomonitorexactvaluesateachcycleoverag ivenperiodtoensure correctness,muchlikealogicanalyzer.Incontrast,perfo rmanceanalysisassumes correctnessandisinsteadconcernedwithtimelinessofapp licationprogress,often allowingsomedatatobesummarizedorignored.Byreducingth edatarecorded, fewerstorageandcommunicationresourcesarenecessaryto monitoranapplication, minimizingthedistortionoftheoriginalapplication'sbe havior.Inaddition,SignalTap andChipScoperequireseparateconnectors(e.g.,JTAG)toac quiredata,whicharenot readilyaccommodatedoravailableformanysystems. 30

PAGE 31

Finally,abriefoverviewofportabilityissuesforRCperfo rmanceanalysisis discussed.DuetothelackofstandardizedAPIsforRCsystems, tooldesigners musteitherattempttosupportamyriadofplatformAPIsorprov idetheuserwitha mechanismtoaddsupportfortheirownplatform.Duetothedi fcultiesinaddingand maintainingsupportfor,orevengainingaccessto,eachnew platformandAPIversion, therstapproachwilltypicallyensurealimitednumberofs upportedplatformsand thuslimitedtoolapplicability.Thelatterapproachistak enbysomeHLStools,suchas ImpulseC'splatform-supportpackages[ 65 ]orROCCC'splatforminterfaceabstraction layer[ 66 ],aswellasbyframeworkssupportinggenericcommunicatio nbetween heterogeneousdevices,suchasAuto-Pipe[ 67 ]ortheSystem-LevelCoordination Framework(SCF)[ 68 ].Unfortunately,forperformanceanalysis(andspecical ly bottleneckdetection),additionalinformationisrequire dbeyondwhatisnecessary forportability(e.g.,bytestransferredorexpectedbandw idthforeachtransfertype).In addition,addingplatformsupportmayrequiresignicante xpertiseandeffort(e.g.,both ImpulseCandROCCCrequireamanualconversionbetweenapla tform'sAPIandtheir speciedinterface),limitingthepotentialforwidesprea duse. 31

PAGE 32

CHAPTER3 PERFORMANCEANALYSISFRAMEWORKFORRECONFIGURABLE-COMPUTING APPLICATIONS(PHASE1) Inthisphaseofresearch,weexplorethechallengesfacedin attaining low-overhead,automatabletechniquesforinstrumentatio nandruntimemeasurement ofRCapplicationsaswellasforvisualizationofobtainedp erformancedata.While signicantchallengesexistineachofthevestagesofperf ormanceanalysis,the instrumentation,measurement,andtoalesserextentprese ntationstagesformthecore ofperformanceanalysisthatisbuiltuponbymoreadvancedv isualization,automated analysis,andoptimization.Thus,thisphasefocusesoncha llengesassociated withinstrumentationandmeasurementwhilebrieydiscuss ingconceptstowards presentationandauniedperformanceanalysistool.Weals odiscusstechniques toaddressthesechallengesandpresentthedetailsoftheRe CAPframeworkthat providesthecoremethodologyforRCperformanceanalysisu singthesetechniques. Wethenintegratethisframeworkintoanexistingparallelp erformanceanalysistool, ParallelPerformanceWizard(PPW)[ 14 ],[ 15 ],andprovideresultsfromacasestudyto demonstratetheutilityandminimaloverheadofourReCAPtoo l. 3.1ChallengesforRCPerformanceAnalysis Withininstrumentationandmeasurement,thekeygoalsofper formanceanalysis toolsarethefollowing(adaptedfrom[ 23 ]): 1. Perturbtheoriginalapplication'sbehavioraslittleaspo ssible(minimizeimpact). 2. Recordsufcientdetail&structuretoaccuratelyreconstr uctapplicationbehavior (maximizedelity). 3. Allowexibilitytomonitordiverseapplicationsandsystem s(maximizeadaptability andportability). 4. Requireaslittleeffortfromthedesigneraspossible(mini mizeinconvenience). Goals1and2areopposedtooneanother,asareGoals3and4.Th usthe challengesfacedgenerallystemfromattemptingtoreachac ompromise.The 32

PAGE 33

followingtwogoalsforpresentationprovidesomecontextf orthefourgoalsabove, aspresentationusesthemeasureddatatoreconstructappli cationbehavior: 5. Displayonlywhatisnecessarytocaptureapplicationbehav iorandbottlenecks(be concise). 6. Formatdatatoallowrapidunderstandingofapplicationbeh avior(beintuitive). 3.1.1ChallengesforHardwareInstrumentation Instrumentingahardwaredesigninvolvesgainingentrypoi ntstosignals(i.e.,wires) intheapplication.Alogicanalyzerexempliesthisproces swithlogicprobesconnected toexternalpinsthatareinturnconnectedtovaluesofinter estintheapplication.By takingadvantageoftherecongurabilityofanFPGA,wecanuse thebuilt-inrouting resourcestotemporarilyaccessapplicationdata,acquiri ngthenecessaryentrypoints formeasurement.Instrumentationinvolveschoosingwhatd atatoinstrument,choosing thelevel(s)ofinstrumentation(e.g.,source,binary,etc .),andnallymodifyingthe applicationatthechosenleveltogainaccesstotheselecte ddata.Theseissuesare discussedinthefollowingsubsections.3.1.1.1Whattoinstrument Instrumentinganapplicationbeginswithaselectiveproce ssthatdetermines whatdatatorecordandwhattoignore.Thedatachosenshould reectapplication behaviorascloselyaspossiblewhilesimultaneouslyminim izingperturbationofthat behavior(Goals1and2).Whileapplicationknowledgeisusef ulinmakingthese selections,itisdesirabletoautomatethistime-consumin gprocesswhenpossible(Goal 4).Softwareperformanceanalysishasdemonstratedthatsuc hautomationispossible byusingknowledgeofwhatconstitutesacommonperformance bottlenecktoguide instrumentation.Thus,onekeychallengeinFPGAinstrument ationisdeterminingwhere andhowperformancebottleneckscanoccurinatypicalFPGAde sign. Applicationsconsistofcommunicationandcomputation,bot hofwhichmust bemonitoredtounderstandapplicationbehavior.Softwarep erformanceanalysis 33

PAGE 34

typicallymonitorsspecicconstructsthatinvokecommuni cationexplicitlyorimplicitly throughsynchronizationprimitivessuchasbarriersandlo cks.Computationistypically monitoredbytimingfunctioncallsorothercontrolstructu ressuchasloops,whichare similartothemechanismsusedtocontrolsubcomponentsinh ardware(e.g.,state machines,pipelines,loopcounters,etc).Thus,thesehard warecommunicationand controlconstructsprovideastartingpointforstudyingco mmonperformancebottlenecks inanFPGA. InanFPGA,communicationincludesoff-board(e.g.,toanothe rFPGA,CPU, mainmemory,etc.),on-board(e.g.,toon-boardDDRmemoryo rotherFPGAs connectedtotheFPGAonthesameboard),oron-chip(betweenc omponentsinside theFPGAdevice)communication.Off-chipcommunication(i. e.,off-boardandon-board communication)iswidelyknowntobeapotentialbottleneck inFPGA-basedsystem designs.Nonetheless,on-chipcommunicationcanbeasigni cantbottleneckaswell, especiallyifsomeformofroutingnetworkordatadistribut ionisimplementedinthe design(acommontechniqueusedinapplicationscontaining multiplecorestoexploit parallelism).Instrumentingon-chipcommunicationbetwe encomponents(e.g.,to observefrequencyofcommunicationorbytestransferred)c anhelpthedesignerto betterunderstandhowthecomponentisused.However,dueto thelargeamount ofparallelismpossibleinanFPGA,monitoringallon-chipcom municationcanincur signicantoverhead. Controlcanbecomeabottleneckwhentoomanycyclesareused forsetup, completion,orbookkeepingtasks.However,theprimaryrea sonforinstrumenting controlistogaininsightintotheapplication'sbehavior, helpingthedesignertolocate otherbottlenecks.Asanexample,ifastatemachinecontains astatethatwaitsfordata fromanFFTcore,recordingthenumberofcyclesspentinthis wait-statecandetermine whethertheFFTcoreisabottleneckintheapplication.This informationiscomparable 34

PAGE 35

tothatobtainedbyasoftwareperformanceanalysistoolmon itoringtheamountoftime anFFTsubroutinerequired. Itisimportanttonotethatinstrumentationshouldgeneral lyberestrictedtoclocked elementsinhardware.Synthesisandplace-and-routetoolsa lreadyoptimizedelays associatedwithunclocked(combinatorial)signals;these delayscanbeanalyzedvia timinganalysis,simulation,ordebuggingtools.Evenindes ignsthatareprimarily combinatorial,thereisinevitablysomeclockedportionof thedesignthathandles controlorcommunication(andoftenmultiplelevelsofcont rolandcommunication), demonstratingthewideapplicabilityoftheseconceptsacr ossdesigns(Goal3). Thus,communicationandcontrolarereasonablepointstoin strumentinitially. However,applicationknowledgecanoftengivefurtherinsi ghtintowhatshouldbe instrumented.Certaincontrolandcommunicationmaybeunn ecessarytomonitor inaspecicapplication;performancemaybebetterunderst oodbymonitoringa specicinputvaluetoacomponent.Thisapplicationknowle dgeisextremelydifcult toautomate,andthusdeterminingwhattoinstrumentremain sasignicantchallengein RCperformanceanalysis.3.1.1.2Levelsofinstrumentation Beforereachingthechallengeofmodifyinganapplicationfo rmeasurement,the levelatwhichinstrumentationwilloccurmustbeselected. Thehardwareportionofan RCapplicationcanbeinstrumentedatanylevelbetweensour cecode(e.g.,VHDL) andtheFPGAcongurationle(binaryloadeddirectlyontoth eFPGA).Whileitisalso possibletousesystem-levelinstrumentation(e.g.,OWL[ 58 ]discussedinSection 2.4 ), thisapproachlacksportabilityduetotherequirementofde dicatedhardwaretomonitor systemcomponentssuchascachelines,buses,etc.Inadditi on,dataunrelatedtothe applicationisalsocaptured,suchasthebehavioroftheope ratingsystemandother runningapplications,makingsystem-levelinstrumentati onlesssuitableforperformance 35

PAGE 36

analysisofaspecicapplication.System-levelinstrument ationisthusnotconsidered furtherhere. Grahametal.providesanexcellentlookatthevariouslevel sandassociated tradeoffsofapplication-levelinstrumentationinsidean FPGA[ 69 ].Theyindicatethat whileinstrumentingatintermediatelevelsbetweensource codeandbinaryofferssome advantages(e.g.,modifyingcleanabstractsyntaxtreesas opposedtosourcecode orbinaries),theseadvantagesarenotsignicantenoughto counterbalancethepoor documentationanddifcultyofaccessingtheselevels(som elevelsexistonlyinmemory duringsynthesisandimplementation).Thus,thelevelsofi nstrumentationareinpractice polarizedintosource-levelandbinary-levelinstrumenta tion. Sourceinstrumentationisattractivesinceitiseasiertoim plement,isfairlyportable acrossdevices,isexiblewithrespecttowhichsignalscan bemonitored,andoften minimizesthechangeinareaandspeedoftheinstrumentedde signduetooptimization ofthedesignafterinstrumentation.Sourceinstrumentatio nalsooffersthepossibilityof sourcecorrelation,allowingbehaviortobelinkedbacktos ourcecode. Incontrast,binary-levelinstrumentationisattractiveb ecauseitrequireslesstimeto instrumentadesign(e.g.,minutesinsteadofhoursasitocc ursafterplace-and-route), isportableacrosslanguagesforaspecicdevice,andpertu rbsthedesignlayoutless, againsinceitismostlyaddedafterthedesignhasbeenoptim izedandimplemented. Unfortunately,binaryinstrumentationforFPGAsisverydif cult,muchmoresothan instrumentingassemblycodeinasoftwarebinary.Inadditi on,binaryinstrumentation losessomeexibilitysincesynthesisandimplementationm ayhavesignicantly transformedoreliminatedsomedataduringoptimizationor madesomedata inaccessibleviatheFPGAroutingfabric.Linksbetweenbeha viorandsourcecode arealsolost. 36

PAGE 37

Itisalsopossibletoapplyinstrumentationatbothlevels, allowingthedesignerto selecttheappropriatecompromiseforeachinstrumentedda tum.Table 3-1 providesa summaryofthecomparisonbetweensourceandbinaryinstrum entation. Table3-1.Comparisonofsourceandbinaryinstrumentation Source-levelBinary-level DifcultyTextparsingBit-levelsignalroutingDesignperturbationLowchangeinarea&speedLowchangeinonchipphysicallayout TimetoinstrumentLong(hours)Short(minutes)PortabilityGoodacrossdevicesGoodacrosslanguagesFlexibilityAccesstoallsignalsSomedatainaccessibleSourcecorrelationPossibleGenerallynotpossible 3.1.1.3Modifyingtheapplication Onceaninstrumentationlevelhasbeenselected,theapplic ationmustbemodied toallowaccesstowhateverdatahasbeenchosenforinstrume ntation.Whileboth sourceandbinaryinstrumentationcandrawheavilyfromsim ilartechniquesinsoftware andFPGAdebugging,automaticinstrumentationbaseduponth edecisiontoinstrument controlandcommunication(discussedinSection 3.1.1.1 )stillposesachallengefor FPGAinstrumentation.Forexample,softwareinstrumentati onmightinvolvescanning sourcecodeforspecicAPIcallsthatareharbingersofcommun ication(e.g.,an MPI Send callinanMPIprogram),whereasFPGAcommunicationandcontro larenot aseasytodetectineithertheHLLorHDLportionofanapplica tion.Duetoalack ofstandards,FPGAcommunicationinHLLsiscurrentlypropri etary,appearingin formssuchasvendorfunctioncalls,softwarepointerstoFPG Amemories,andI/O calls.InanHDL,communicationwithaCPU,otherFPGA,oron-boa rdmemorycan eitherbeproprietaryorconformtooneofmanystandardsdep endingontheactual hardwareavailable(e.g.,PCIx/e,HyperTransport,RapidIO ,SDRAM,SRAM).On-chip communication,whilebetterdenedbythecomponentinputs andoutputs,stillposes similardifculties.Forexample,componentsinanFPGAmaym akeuseofaread enablesignalwhichcanbenamedarbitrarily,orthecompone ntmayreaddataonly whenacomplexsetofconditionsaretrue.Theseissuesaread dressedinSection 5 37

PAGE 38

Source-levelinstrumentationforhardwarecanemployapars ertoscanapplication codeandinsertlinestoextractthedesireddataatruntime( e.g.,inVHDL,the componentinterfacecanbemodiedtoallowaccesstoperfor mancedata).The challengehereliesintheexpressivenessofthegivenlangu age;theparsermustbe abletocopewiththevariouswaysinwhichadesignermaystru ctureorexpressthe behavioroftheirapplication.Forexample,theuseofanenu meratedtypeinVHDLalong withaclocked“case”statementusingthattypewouldusuall ysuggestastatemachine. However,thesamestructurecouldberepresentedwithconst antsandacomplicated if-then-elsestructure. Binary-levelinstrumentationsufferssimilardifculties .Nowcontroland communicationmustbedetectedfromafullyoptimizedandim plementeddesign. Whileescapingtheproblemofthesourcelanguage'sexpressi veness,thehierarchy andstructurebehindmuchoftheapplicationhasbeenatten edandreformedduring synthesisandimplementation.Givenasetofphysicallooku ptables(LUTs)inthe FPGAtomonitor,binary-levelinstrumentationcanbeperfor medbysynthesizingand implementingtheoriginaldesignasusual,exceptforthene edtoreservespaceand connectionpointsforthemeasurementcomponent.Aftersynt hesisandimplementation, toolssuchasXilinx'sJBitsSDK[ 70 ]canbeusedtoplacethemeasurementcomponent inthedeviceandroutesignalstoitfromtheapplication.3.1.2ChallengesforHardwareMeasurement Measurementisconcernedwithhowtorecordandstoredatase lectedduring instrumentation.Anintegralchallengeofthisprocessisto recordenoughdata tounderstandapplicationbehaviorwhileatthesametimemi nimizingperturbation causedbyrecording(Goals1and2).Duetolimitedresources andalackofresource virtualizationonanFPGA,resourcesharingbetweentheappli cationandmeasurement frameworkpresentsauniquechallengeforRCperformancean alysis. 38

PAGE 39

3.1.2.1Recordingandstoringperformancedata Tobalancedelityandoverhead,softwareperformanceanal ysisemploys techniquessuchastracing(recordingindividualeventtim esandassociateddata) andproling(recordingsummarystatisticsandtrends,not whenspecicevents occurred).Thesemethodscanbetriggeredtorecordinforma tionunderspecic conditions(event-based)orperiodically(sampling).The efcacyofonetechnique overanotherisdependentuponwhatbehaviorneedstobeobse rvedintheapplication. Tracingisthemethodologyofrecordingdataandthecurrent time(basedona deviceclock)forindividualevents,allowingtheduration andrelativeorderingofthese eventstobeanalyzed.Tomaintaineventorderingbetweende vices,clockoffsetand driftmustbeperiodicallymonitoredonallCPUsandFPGAs;meth odssuchasthose in[ 71 ]estimateround-tripdelay,enablingclockdrifttobecorr ectedpostmortem. Whilecloselyrelatedtohardwaredebugging,tracinginperf ormanceanalysismust besustainableforanindeniteperiodoftimeinordertocap tureapplicationbehavior (debugtechniquesoftenrecorduntilmemoryisexhausted). Toreducetheamount ofdatarecorded,event-basedtracingrecordsdataonlyund erspeciedconditions, whereassample-basedtracingrecordsdataperiodically.Ba sedontheeventconditions orsamplingfrequency,adifferentcompromiseisreachedbe tweendelityand perturbation. Prolingdiffersfromtracinginthatnospeciceventtiming isstored.Rather, summarystatisticsofthedataaremaintained,usuallywith simplecountersthatare extremelyfastandfairlysmall.Prolingsacricessomeoft hedelityoftracingfor lessperturbationofthedesign.Prolecounterscanprovide statisticssuchastotals, maximums,minimums,averages,andevenvarianceandstanda rddeviation,althoughat thecostofadditionalhardware(andpossibleperformanced egradation).Aswithtracing, prolecounterscanbeupdatedbaseduponaneventorbyperio dicallycheckingsome condition(sampling). 39

PAGE 40

Onesignicantdifferencebetweensoftwareandhardwarepe rformanceanalysis withrespecttoprolingandtracingisparallelism.Whileso ftwaremeasurement requiresadditionalinstructionstoproleortracetheapp licationthatgenerally degradeperformance,prolecountersandtracebufferscan workindependentlyof theapplicationandeachotherinhardware.Thus,hardwarep erformanceanalysiscan incurnoperformancedegradationifsufcientresourcesar eavailableandthedesign's maximumclockfrequencyisunaffected.Inaddition,itispo ssibletomonitorextremely ne-grainedevents,eventhoseoccurringeverycycle.Anoth ersignicantdifference involveslimitedmemoryavailabilityinanFPGA.Softwareperf ormanceanalysistypically hashundredsofmegabytesofmemoryormoretostoreprolean dtracedata,whilean FPGAsuchasXilinx'sVirtex-4LX100devicecontainsonly540KBof blockRAMand 13.5KBofmemoryinlogiccells[ 72 ].Whileprolingtypicallyrequiresfarlessmemory thantracing,prolecountersthatmustbeaccessedsimulta neouslyarelikelytobe placedinlogiccells(asetofprolecounterscanbeplacedi nblockRAMifonlyone counterinthesetwillbeupdatedpercycle).Asanexample,51 236-bitprolecounters requireaminimumof16.7%oflogiccellsinXilinx'sVirtex-4L X100device[ 72 ],and yetcouldbestoredinasingleblockRAM(representingonly0. 4%ofblockRAMonthe samedevice).Incontrast,tracingiswellsuitedforblockR AMandthuscanmakeuseof additionalstorage.Unfortunately,tracedatacaneasilye xhaustblockRAMresources, yieldingashortageofresourcesforbothprolingandtraci ng. Inhardware,thetradeoffsbetweentracingandprolingpro videasignicant challengetoautomatingtheselectionofthemeasurementty petouseforaspecic signalinanFPGA.Whilethedesignermayrecognizedatathatwou ldbeproblematic fortracingorpoorlyrepresentedbyprolecounters,thisk nowledgeisrarelyexplicit intheapplicationcodeorFPGAcongurationle.Worse,once aselectionhasbeen made,themeasurementframeworknecessarytomonitorthiss electionmaynott intheremaininglogicormemoryontheFPGA,orthemeasurement frameworkmay 40

PAGE 41

causesignicantdegradationofthemaximumfrequencyatwh ichtheapplicationcan run.Thus,ndingabalancebetweenperturbationanddelit ymayrequiresignicant knowledgeofboththeapplicationandtradeoffsinmeasurem entstrategies. 3.1.2.2Managingsharedresources OneofthegreatestchallengesinRCperformanceanalysisis themanagement ofsharedresourcesthatwereonceexclusivelycontrolledb ytheapplication. Althoughthesharingofon-chipresourcesisimportant,this sharingishandledby thesynthesisandimplementationtool,andthusisoflessco ncernthanoff-chipmemory andcommunicationsharing,whichmustbemanagedmanually. Whilerecording performancedatawouldideallyrequirenooff-chipcommuni cation(orpossiblyusea separatecommunicationchannelsuchasJTAG),thetypicalv olumeoftracedata,the limitednumberandsizeoflargememories,andthelimitednu mberandbandwidthof communicationchannelswillgenerallynecessitatesharin gofmemory(e.g.,theFPGA on-boardmemoryortheCPUmainmemory),thecommunicationch annel,orboth. Softwareperformanceanalysistoolscansharememory,commu nication,and processortimewiththeapplicationthroughoperatingsyst emandhardwarevirtualization (processes,virtualmemory,sockets,etc.).FPGAshavenoneo fthisinfrastructure, requiringtheperformanceanalysistooltohandlethesecom plexities.Tosharethe FPGAinterconnect,performanceanalysisframeworksmusten sureperformance andapplicationdatacanbedistinguishedsothateachisdel iveredtothecorrect location,usuallybyallocatingmemoryandaddressspacefo rperformancedatatouse exclusively.Arbitrationbetweentheapplicationandperfo rmanceanalysishardwareis alsonecessarytoensurethatonlyonecanaccesstheinterco nnectatatime. Oneaddedcomplexityisthatthecommunicationarchitectur emayonlyallowthe CPUtoinitiateadatatransferfromtheFPGAtomainmemory.Thi sscenariocanbe handledbyinstrumentingtheapplicationsoftwaretoperio dicallypolltheperformance analysishardwarefordata,eitherdirectlybetweenothera pplicationtasksorviaa 41

PAGE 42

separateprocessorthread.Ifsupported,interruptscanbe usedtohavetheCPUinitiate atransfer,althoughinterruptsareoftenscarceandthusma yalsoneedtobesharedif usedbytheapplication.WhenCPU-initiateddatatransfersar eused,theapplicationand performancemonitoringsoftwaremustuselockstoguarante etheydonotaccessthe FPGAsimultaneously. Schemestominimizetheperturbationofapplicationperform ance(Goal1)must alsobeconsideredwhensharinganyresource.Theseschemes generallyreduce toconictresolutionbetweentheperformanceframeworkan dtheapplication. WithFPGA-initiateddatatransfers,decisionsmaybefairlyn e-grained,allowinga performancedatatransfertobeinterruptedtopermitappli cationuseoftheshared resource.CPU-initiateddatatransferstypicallycannotbe interrupted,requiring coarse-grainedtechniquessuchasadaptingthepollingfre quencydynamicallyto accountforapplicationuseandavailableperformancedata Itisimportanttonotethatwhilemeasurementdeterminesth eneedfor communicationsharing,instrumentationisaffectedaswel l,sinceitmustnowbe awareoftheapplication'scommunicationschemeandseamle sslyintegratewith it.Communicationschemessuchasmemorymapsornetworkpac ketsareused withavarietyofinterconnectsandbydiverseAPIsforFPGAs.Ino rdertoautomate instrumentation,thetoolmustbeabletodetectandassigns omeunusedportionofthe application'saddressspace(orsomeotheruniqueidentie rnotusedbytheapplication), connecttotheapplicationattheproperlevelinthedesigni nordertoavoiddealing withthedetailsofaspecicinterconnect,anddetectandad dlocksaroundallFPGA communicationinthesoftwareportionoftheapplication.Pr ovidingbothautomated instrumentationandmeasurementtechniquestosupportsha redresourcesisthusa signicantchallengeformeasurementinRCperformanceana lysis. 42

PAGE 43

3.1.3ChallengesforPerformancePresentation Conventionaltrace-baseddisplaytoolssuchasJumpshot[ 25 ]usetimelineviews toshowcommunicationandcomputationforparallelcomputi ngapplications.These timelineviewscanbeextendedtoincludeFPGAsasadditionalp rocessingelements forRCapplications.A(mockup)visualizationexampleissh owninFigure 3-1 .Inthis example,theCPUs(nodes0-7)areperformingworkandreceivi ngdatafromtheFPGAs (nodes8-15).Nodes3(CPU)and11(FPGA)completerstnearthem iddleofthe diagram,whilenodes4and12arelagging,completingtoward theend,nallyallowing globalsynchronizationofallnodesbeforeanewiterationb egins. nrn rn n r r r rn Figure3-1.MockupJumpshotvisualizationofanRCapplicat ionwith8CPUsand8 FPGAs. Unfortunately,suchtimelineviewsscalepoorlyassystems izesincreaseto hundredsorthousandsofnodes.Kn ¨ upferetal.arguethatastracedatasizescontinue togrow,timelineviewsmustcontinuetoshowsmallerfracti onsofthisdataandyet stillconveymeaningfulinformationtotheuser[ 41 ].Theyproposedetectingrepetitive patternsandcollapsingthesepatternsvisuallyintoasing lebox,thushighlighting irregularbehaviorintheapplicationtoincreasescalabil ity.Theseproblemsarepresent inRCperformancepresentationaswellandarefurthercompl icatedbythetaskof concisely,yetaccurately,displayingtheheterogeneousp arallelisminsidetheFPGA. Traditionalperformanceanalysistoolstreatanypossibil ityforparallelexecution(e.g., viamultiplecoresinaCPU,CPUsinasymmetricmultiprocessor ,ornodesinacluster) 43

PAGE 44

asseparate“threads”thatseriallyexecutefunctions,com municate,waitforothernodes tocomplete,etc.Duetothevastamountofparallelismpossi bleinanFPGA,suchan abstractioncanbeunwieldy(violatingGoal5).Worseyet,t reatingtheFPGAasalarge, uniformmulticoredeviceisinaccurate,giventhedifferin gtypesofcomponentsand thespecichierarchywithinwhichtheyexist.Ontheopposi teendofthespectrum, portrayinganFPGAasasingleprocessingelement(asisdonei nFigure 3-1 )maynot beusefulasthisexcludestoomuchparallelisminsidetheFPG A. Hierarchicalviewsmaybebettersuitedtotheheterogeneou sdevicesandbehavior inalarge-scaleRCsystem.Forexample,Figure 3-2 showsonepossibilityofsuch adisplaycapturingbothpotentialandactualcommunicatio nandcomputationinthe contextofasystem.Actualcommunicationratesaregivennum erically,whilethe maximumbandwidthforeachchannelisdepictedintheassoci atedrectangularboxes. Computationisdepictedsimilarly,representingtheperce ntageoftimethatthedeviceor FPGAcorewasbusy.Fromthegure,fourpossibleperformance bottlenecksarereadily visible:thenetworkinterface,thecommunicationchannel toCPU1,thetwocoresin FPGA1,andthecommunicationandcomputationsurroundingCPU s4and5andFPGA 2.Itisalsoevidentthatthereisrelativelylittleuseofre sourcessuchasCPUs2and3 andFPGA0.Amoredetailedproleandtracedataviewcouldbei ntegratedwhenthe userclicksonanodeorcommunicationchannel(shownonleft ofgure).Forexample, fromthedetailedcommunicationchannelview,thedesigner isabletoseetheduration ofacommunicationspikethatwouldnotbeevidentfromastat isticalaverage,maximum, orminimum. Fromapracticalstandpoint,obtainingthecommunicationa ndcomputationshown inFigure 3-2 isfairlystraightforward,requiringprole(andoptional lytrace)datatobe collectedonboththeCPUsandFPGAs.Generatingthediagramlay outautomatically isnon-trivialsincethesystemarchitecturemustbeunders tood,butthisproblemcould behandledbyqueryingthesystemforasmanyparametersaspo ssible,llingin 44

PAGE 45

n n n n nr r n nn r r nr n n n r !"##!$ nn n n Figure3-2.ExamplehierarchicaldisplayofacomplexRCappl ication. theremaininggapsmanuallyforagivensystem;thisprocess wouldonlyneedtobe performedoncepersystemunlessthecongurationwaschang ed.Ifactualsustained throughputvaluesaredesiredratherthanmaximumtheoreti calbandwidthvalues,these mustbeobtainedaswell,usuallyviamicrobenchmarks. Ultimately,thepurposeofanyperformancevisualizationi stoaidthedesigner informingstrategiesforoptimization.Forexample,onepo ssibleimprovementto theapplicationinFigure 3-2 wouldofoadsomeoftheworkloadfromFPGA1to underutilizedresourcessuchasFPGA0orCPUs2and3(orboth). Anotherpossibility consistsofsharingsometaskscurrentlyassignedtoCPUs4an d5withCPUs2and3. Bothoftheseimprovementsaredependentontheabilitytofur therparallelizeorpartition tasks,andhavethepossibilityofaffectingcommunication signicantly.Forexample, CPUs4and5havenearlysaturatedtheircommunicationchanne ls;movingsomeof theworktoCPUs2and3mayincuradditionalcommunicationorm aydistributesome ofthecommunicationfromCPUs4and5toCPUs2or3.Givenpotent ialbottlenecks andsolutions,thedesignercanapplyapplicationknowledg etomaketheappropriate optimizations. 45

PAGE 46

3.1.4UniedPerformanceAnalysisTool TocreateaholisticviewofanRCapplication'sbehavior,au nied software/hardwaretoolisessential.Separatetoolswillgi veadisjointedviewofthe system,requiringsignicantefforttostitchthetwoviews backtogether.Inaddition, eachtoolmustmakedecisionsaboutinstrumentationandmea surementwithoutany knowledgeofwhatisbeingmonitoredbytheother.Auniedto olcantakeadvantageof strategicallychoosingwheretomonitoraspecicevent(i. e.,fromsoftware,hardware, orboth)baseduponfactorssuchasefciency,difcultyina ccessinginformation,and accuracyofthatinformation.Also,someinstrumentationan dmeasurementtechniques requirecomplimentarymodicationstosoftwareandhardwa re(e.g.,modifyinga memorymaptoallowCPU-initiatedtransfersofperformanced ata). WeuseParallelPerformanceWizard(PPW)[ 14 ],[ 15 ]asaspecicsoftware performanceanalysistooltodiscussintegrationhere,alt houghtheseconcepts shouldapplytoothertoolsaswell.PPWsupportsperformancea nalysisforPGAS programmingmodelssuchasUPCandSHMEMaswellasformessage-p assingmodels suchasMPI.PGASperformanceanalysisisenabledbywayoftheGl obalAddress SpacePerformance(GASP)interface[ 73 ],whichspeciestheinteractionbetweena performancetool(suchasPPW)andtheprogrammingmodelimplem entation(such asBerkeleyUPCorgcc-upc).Basedonaspeciclanguage,manyco nstructssuchas synchronizationprimitiveswillwarrantmonitoring,whic hthecompilerinstrumentsby usingeventcallbackfunctions(user-denedeventsareals opossible).Theseeventscan thenbereceivedbyanytoolsupportingtheGASPinterface,whe rethetoolcanchoose toprole,trace,orignoretheseevents. TotrackFPGAactivityfromsoftware,theGASPinterfacecanbee xtendedwith genericeventssuchasFPGAreset,congure,send,andreceiv e.Uponreceiving anFPGAevent,theperformancetoolcouldstoreinformations uchasaveragebytes transferred,maximumtimetakentoreconguretheFPGA,ormin imumlatency 46

PAGE 47

observed,providingadetailedviewofFPGAcommunicationfr omsoftware.However, automaticallyaddingtheseextendedGASPfunctionsaroundFPG Acommunicationis difcultduetothevarietyofwaysFPGAcommunicationcanapp earinsoftware.Ideally, astandardAPIforFPGAaccesscouldmakedetectionofFPGAcallst rivial.Inthe absenceofsuchastandard,theperformanceanalysistoolmu stdetecteachvendor's FPGAaccessmethodsandmapthemtotheappropriategenericev ent. 3.2Framework Inthissectionweproposeaninitialframeworktoinstrumen tandmeasurean FPGA'sperformanceatruntimeusingthesamecommunicationch annelusedbythe applicationforperformancedatatransfers.Forsimplicit y,portability,andexibilityin whatcanbemonitored,ourframeworkemployssourceinstrum entation(specicallyof VHDL).ToensureapplicabilitytosystemswithoutFPGA-initia tedtransfers,interrupts, oraccesstootherlargememories,weuseCPU-initiatedretri evalofFPGAperformance dataatruntimeusingonlyon-chipFPGAresources.Wediscuss theinstrumentation methodologyrst,followedbythemeasurementportionofth eframework. nrrr nnr nnr nr n rr rn "#r#$r %&rr$n 'r! n (r"#r nr %&! rn" rrrrr (rrrr nr! nnr nnr nnr )r(nrrr rr nr n* "r Figure3-3.Additionsmadebysource-levelinstrumentation ofanRCapplication. 47

PAGE 48

3.2.1Instrumentation Figure 3-3 illustratesthechangestoanRCapplicationduringinstrum entationin ordertosupportmeasurement.Thesechangescanbedividedi ntothefollowingseven steps: 1. Allsignals,variables,componentports,andotherdataavai lableintheHDLsource lesareenumeratedalongwiththeirtypes,locationsinthe hierarchy,andother usefulinformation.Thisinformationisgatheredbyparsin gtheuser'sHDLcode directlyviaastandardVHDLgrammarandparser.Theuser'sHD Lcodeisthen partiallyelaboratedtogaininformationsuchasnumberofi terationsinaVHDL “for-generate”statementorwhetheracomponentwithinaVHD L“if-generate” statementisactive,permittingmoreaccurateandtargeted instrumentation(e.g., byinstrumentingonlyoneoffourapplicationcoresontheFPG A). 2. Anautomatedselectionofdataismadebasedonadesiretomoni tor communicationandcomputation.Forexample,all“case”or“ if-else”statements(or both)canbeproled,averageuseoftheinputandoutputport softhetop-levelle canbemonitored,andanyidentiablecontrolsignalsforsu bcomponentscanbe proledortraced.Thisautomationisbasedoncommonpracti cesinVHDLcode (e.g.,statemachinesareoftenusedforcomponentcontrola ndgenerallyappear in“case”statements),andthusmayfailtonddatatomonito r(orconversely maymonitorunnecessarydata)dependingoncodingstyleand thenatureof theapplication.Therefore,theuserisoptionallygiventh eabilitytooverrideany decisionmadebythetoolbyaddingorremovingitemstomonit or(e.g.,signals, variables,componentports),specifyingwhethertousepro lingortracing,and choosingtheamountofFPGAresourcestodevotetomonitoring 3. Giventhenallistofwhatdataistobemonitored(andhowtom onitorit), thetoolautomaticallymodiesacopyoftheuser'sVHDLcodet ooutputall monitoreddatatotheuser'stop-levelle(illustratedinF igure 3-4 a).Datatypesare convertedto std logic vector wheneverpossibleforsimplicityofmonitoring(e.g., enumeratedtypesareconvertedto std logic vector usingthestate'spositionin theenumeratedlist);anytypethatisnotunderstoodispass edoutasanewtype (e.g., HMM Type1 )denedidenticallytotheoriginaltype,assumingtheuser will manuallyhandletheanalysis.Notethatthenewtypeensures theuser'sdatacan traversethecomponenthierarchytothetop-levelle;theu ser'soriginaltypemay havebeendenedonlyinthecomponentwhereitwasused.Eachc omponentin thedesign'shierarchymustoutputitsmonitoreddataaswel lasallmonitoreddata fromitssubcomponent(s),ifany. 4. Anewtop-levelleiscreatedbyduplicatingtheuser'stoplevelleinterfaceand splicingintothecommunicationschemetoallowtheperform ancetooltogain accesstotheinterfaceaswell(e.g.,theperformancetoolm ightbeassigned 48

PAGE 49

unusedaddressspacetoallowroutingofincomingdatatothe correctlocation). Thisstepisextremelydifculttoautomateinafool-proofm annerandthusis permittedtofail,allowingmanualcontrolbytheuserifnec essary.Notethatthe newtop-levelleispermittedtohaveadditionalHDLlesab oveit(e.g.,awrapper tointerfacewithabusorotherinterconnect)thattheuserh asnointerestin instrumenting.Inordertoincreaseautomation,ReCAPv0.7a ndlatereliminate thisnewtop-levellebymakingallchangesdirectlytoacop yoftheoriginal top-levelleandperformingname-shiftingonsignalsconn ectedtotheinputand outputports.Thismethodologyensurestheusernolongerne edstomodifytheir hardwareprojecttoincludeadditionalles. 5. Thesignals,variables,andotherdatatobemonitoredareth enconnectedto theHardwareMeasurementModule(HMM)(seenextsection),w hichhandles allrecordingofperformancedataandtransferringofthatd atatotheCPUwhen requested.Anyanalysisorcombinationofthesignalsishand ledhere,suchas triggeringaneventonlyifboththeerroragandwriteenabl earetrue.Ifonly partofasignalorasubsetofcomponentsneedstobemonitore d,thenonly thesesignalsorcomponentsareconnectedtotheHMM,leavin gtheremainder unconnectedforremovalbythesynthesistool.Incontrast, ReCAPv0.7and laterperformanyanalysisorcombinationofsignalsasclos etotheiroriginas possibleinthehierarchy,passingupasingle-biteventlin efromthatpointinthe hierarchy;thismethodologyallowsforlowerresourceover headwhenemploying eventpipelining(atechniquewhereeventsmayberegistere dseveraltimesto helpreduceoreliminateadropintheapplication'smaximum frequencywhen instrumented). 6. Insoftware,aHardwareDataTransferModule(HDTM)isadded thatwillstart immediatelyaftertheuser'sinitializationoftheFPGA.This modulewillexecute asaseparatethreadthatperiodicallypollstheFPGAforperf ormancedata andthentransfersthatdatatomainmemory.Thisthreaduses genericFPGA calls,coupledwiththeappropriatemappingbetweenthesec allsandtheactual vendor-specicFPGAcalls,toimproveportability.Inorder toreduceoverheadand supportsystemswithoutthreads,ReCAPv0.7andlaterprovid e(anddefaultto) post-mortem-onlycollection;threadeddatacollectionis typicallyonlydesirable whentracedatawillexhaustFPGAmemoryresources,thusjust ifyingperiodic datacollectiontoreduceoreliminatedroppedtracerecord s. 7. IfthreadsareemployedfortheHDTM,alockisplacedarounda nycallto theFPGAintheuser'sapplication,astheFPGAmustbeguardeda gainst simultaneousaccessbytheapplicationanddatatransferth read.Thesame lockisalreadypresentintheHDTM(thisandpreviousstepar eillustratedinFigure 3-4 b). Instep6,thesoftwareinterfacetolaunchandmanagethisth readiswrappedinto foursimplecalls: HMM Init HMM Start HMM Stop ,and HMM Finalize (showninFigure 49

PAGE 50

3-4 b).Theinitializeandnalizeroutinesmanagethesetupand cleanupofallnecessary memoryandthreadresourcesonthesoftwareside.Thestarta ndstoproutinesactasa stopwatch,launchingandstoppingthedatatransferthread (andoptionallythehardware measurementitself)forportionsofcodethatdonotrequire monitoring.Ingeneral,it ispossibletooverridekeyvendorAPIfunctionstomanagethis threadautomatically, allowingHLLsourceinstrumentation(steps6and7)tobeful lyautomated.Notethat sourceinstrumentationoftheuser'sHDLrequirestheappli cationtoberesynthesized andimplementedbeforeexecuting,whilesourceinstrument ationoftheuser'sHLL requiresuseofthevendor'sAPItoaccesstheFPGA. nrr nr nr nnrnnr nn nrrn n rn n nr n AHDLsourcecodemodications nr nnrr rrr rrr nnrrr rrr rrr nnrr BHLLsourcecodemodications Figure3-4.Instrumentationofusersourcecode. 3.2.2Measurement AtthecenterofmeasurementintheFPGAistheHardwareMeasure mentModule (HMM).TheHMMisresponsibleforimplementingallprole,t race,andsampling capabilities,aswellaspackagingthatdataforretrievalb ysoftware.TheHMMallows quickcustomizationof(andeasyaccessto)allofthesereso urces,eliminatingthe time-consuminganderror-proneprocessofmanuallymeasur ingperformance.Features 50

PAGE 51

includearbitrarycounterandtracesizes(limitedbyresou rces);storageofmaximums, minimums,andaveragesofselectedvalues;countersforeac htracebufferindicating thenumberofrecordsdropped(tracerecordsmaybedroppedd uetoinsufcientbuffer space);abilitytoclear,stop,hold,andacknowledgeerror sinproleandtraceunits; packagingofallperformancedatatothespeciedinterface widthforexporttotheCPU; andstorageofrecordsinlogiccells,blockRAM,oron-boardm emories(ifavailable). Figure 3-5 illustratesthedesignoftheHMM. n nr n n n rn n nn nn n !n "# "$ "$ r n Figure3-5.HardwareMeasurementModule. Atruntime,thepollingthreadinsertedbyinstrumentationp eriodicallyretrieves alltracedata(andoptionallyproledataaswell)fromtheF PGA.Tominimize perturbationoftheapplication'scommunicationchannel, thepollingratecanbe adaptive,increasingordecreasingbaseduponatargetusag eofthecommunication channel,theapplication'susageofthecommunicationchan nel,orthenumberof recentlydroppedtracerecordsontheFPGA.TheHMMreceivesar equestforprole data,tracedata,ormodulestatistics(e.g.,droppedtrace records),andsplitsthedata upintopiecesthesizeofthecommunicationchannelwidth.T heHMMcanalsoreceive 51

PAGE 52

commandstoclear,stop,oracknowledgeoverowsofprolec ountersandtrace buffers.Samplingcapabilitiesarealsoavailable,allowin gtracebufferstorecordfora speciednumberofcycles.OncedataisretrievedfromtheHM M,thisdatamaybe lightlyprocessedtoreducestorageoverhead. 3.3CaseStudy TodemonstratethebenetsandimportanceofRCperformance analysis techniques,aswellasexploretheassociatedoverhead,wep resentresultsfroma casestudyusingaprototypeversionoftheframeworkdiscus sedinSection 3.2 .Inour prototypeversion,theinstrumentationstepsareperforme dmanually(Section 3.2.1 ), withprolecountersandatracebufferavailableintheHMM. Thedatatransfermodule retrievesdataataxedrate,pollingoncepermillisecond. Forourcasestudy,weexecutedtheN-Queensbenchmarkappli cationontwoRC systems.TherstRCsystem,theCrayXD1,consistsofsixnode s,eachcontaining twoOpteron250CPUsandaXilinxVirtex-2Pro50FPGAconnectedvia ahigh-speed interconnect(3.2GB/sidealpeak)[ 11 ].ThesecondRCsystemisa16-nodeInniband cluster,eachnodecontaininga3.2GHzIntelXeonEM64Tproces sorandaNallatech H101-PCIXMapplicationaccelerator[ 74 ]employingaXilinxVirtex-4LX100userFPGA andconnectedviaaPCI-Xbus(1GB/sidealpeak).TheN-Queensa pplicationwas implementedusingUPC(software)andVHDL(hardware).Compil ationfortheCrayXD1 wasperformedusingSynplicity'sSynplifyPro8.6.2,Xilinx'sI SE7.1.04i,andBerkeley UPC2.4.0,whilecompilationforthe16-nodeInnibandclust erwasperformedusing Nallatech'sDimetalk3.1.5,Xilinx'sISE9.1.03i,andBerkele yUPC2.4.0. TheN-Queensproblemasksforthenumberofdistinctwaystha t N queenscan beplacedontoan N N chessboardsuchthatnotwoqueenscanattackeachother [ 75 ].Asonlyonequeencanbeineachcolumn,asimplealgorithmwa semployedto checkallpossiblepositionsviaaback-tracking,depth-r stsearch.Parallelismwas exploitedbyassigningtwoqueenswithinthersttwocolumn s;eachcorethenreceives 52

PAGE 53

apartial-boardandgeneratesallpossiblesolutionsbymov ingqueensintheremaining N 2 columns,returningthenumberofsolutionstosoftware.The programwas executedonbothRCsystemsusingaboardsizeof 16 16 .TheN-Queensapplication wasrstexecutedwithouthardwareinstrumentationtoacqu irebaselinetiming,and thenwithinstrumentationtocollectmeasureddata.TheHMM wasconguredtoinclude 16prolecountersineachFPGA(sixformonitoringapplicati oncommunication,nine formonitoringanN-Queenscorestatemachine,andonetomon itorthenumberof solutionsfoundbythatcore)andone2KBtracebuffertomonit ortheexactcyclein whichanycoreintheapplicationcompleted. Table 3-2 providestheoverheadincurredbyaddinginstrumentationt othe N-Queenscoresandperiodicallymeasuringprolecounters andtracedatafrom theN-Queensapplicationatruntime.Fromthisdata,amaxim umoverheadbandwidth of33.3KB/swasobserved,whichisnegligiblewhencomparedto theinterconnect bandwidth(theapplicationusedverylittlebandwidthaswe ll,pollingthedeviceonly onceper100milliseconds).Lessthan7%oftheFPGA'slogicres ourcesand2%ofthe blockRAMwereneededtomonitortheapplication.Frequencyd egradationrangedfrom 1%ontheXD1tonodegradationonthelargerLX100devicesinthe Nallatechcluster. Table3-2.PerformanceAnalysisOverhead. (a)CrayXD1 DesignAspectOrig.Device%Inst.Device%Diff.Diff.% Slices(23616total)904138.3%990141.9%+8603.7%BlockRAM(232total)114.7%156.5%+41.7%Frequency(MHz)124123-1-0.8%Communication(B/s)8033290+33210 (b)NallatechCluster DesignAspectOrig.Device%Inst.Device%Diff.Diff.% Slices(49152total)2308647.0%2621853.3%+31326.4%BlockRAM(240total)218.8%229.2%+10.4%Frequency(MHz)10110100.0%Communication(B/s)4029860+29820 53

PAGE 54

Itisimportanttonotethateventhoughsomedecreaseinmaxi mumfrequencymay occur,someFPGAsystemshaveonlycoarse-grainedorxedclo cks,polarizingthe importanceoffrequencydegradation.Forexample,iftheFPG Aclockoperatesateither 75or100MHz,adropfrom102MHzto101MHzwouldhavenoeffect whileadropfrom 102MHzto99MHzwouldnecessitateasignicantdropinFPGAsp eed,reducingthe accuracyofperformancedatameasured.However,boththeXD1 andtheNallatech hardwareallowforne-grainedcontroloftheclock,permit tingachangeof1MHzorless viatheDigitalClockManagers(DCMs). ThenumberofcyclesspentineachstateofanN-Queenscorest atemachinewas monitoredinordertounderstandacore'sbehavioratruntim e.Whilenotaccessible fromasoftwareperformanceanalysistool,thisinformatio niseasilyobtainedbyusing asmanyprolecountersastherearestates,witheachcounte rincrementingwhen thatstateoccurs(technically,oneoftheprolecountersi sunnecessarysinceitsvalue canbeascertainedfromtheothercountersandthetotalcycl ecount).Fromthisdata, thepercentageofcyclesspentineachstatewascalculateda ndisshowninFigure 3-6 .Morethanathirdofthetotaltimeisspentdeterminingwhet heranyqueenscan attackeachother.Whilethisstatewouldnormallybetargete dforoptimization,itwas alreadyheavilyoptimized,leavinglittleroomforimprove ment.However,Figure 3-6 alsoshowsthatthe ResetAttackChecker stateconsumes12%ofthetotalstate machinecycles,whichissurprisinggiventherelativelysm alljobthatthisstateperforms. Thus,arelativelysimplemodicationwasmadetocombineth e ResetAttackChecker state,aswellasthe Finished and ResetQueenRow states,withtheremainingstates, yieldingapotentialspeedupof16.3%versusthenon-optimi zedversion(basedupon removingthesestatesfromthegraph);theactualspeedupis expectedtobelessas communicationandsetupportionsoftheapplicationhaveno tchanged.Whilemerging statescouldreducethemaximumfrequencyofthedesign,ane gligibledropincore clockfrequencyoftheoptimizedversionwasobserved.Theo ptimizedN-Queenscore 54

PAGE 55

wasthenmeasuredonthetargetsystems,givinganaveragesp eedupof10.5%.This performancegainwasgreatlyfacilitatedbytheuseofhardw areperformanceanalysis, removingguessworkfromunderstandingtheapplication'sb ehaviorandaidinginthe detectionofperformancebottlenecks. n n r r r r r nn Figure3-6.Distributionofcyclesspentincorestatemachi nesofN-Queens. Thetracebufferwasusedtomonitorthecycleinwhichanycor einthedevice completedinordertounderstandthepenaltyoftheapplicat ion'sstaticscheduling, whichrequiresallcoresinthedevicetocompletebeforerec eivingfurtherwork.Tracing data(ignoringtrivialcompletionsofinvalidstartingboa rds)revealedthattherst coretocompletewasidle25%ofthetime,waitingforthelast coretocomplete;on averagecoreswereidle10%ofthetime.Thus,adynamicsched ulingalgorithmcould theoreticallyimprovespeedupby11%. Figure 3-7 showsthespeedupoftheparallelsoftwareandboththeiniti aland optimizedhardwareversionsofN-Queensoverthebaselines equentialCversion.The 8-nodesoftwareversionwasabletoachieveaspeedupof7.9o verthesequential baseline.Theclusterexecutingtheoptimizedhardwareon8 FPGAsachieveda speedupof37.1overthebaseline. 55

PAGE 56

n rn nr nr nr Figure3-7.SpeedupofN-QueensApplication. 3.4Conclusions Intherstphaseofthisresearch,wehaveexploredvariousc hallengesfacedinRC performanceanalysis.Whilewediscussedsomechallengesth ataresharedbysoftware performanceanalysis,manyofthesechallengesaremoredif cultinoruniquetoRC. Challengessuchasresourcesharing,automationofinstrum entationandmeasurement, andtradeoffsbetweenaccuratemeasurementandoverheadin curredneedtobe addressedforperformanceanalysistobesuccessful.Furth ermore,difcultiesin representingincreasinglylargeFPGAsandRCsystemsinmeani ngfulvisualizationsare signicantbarriersaswell.Toaddressthesechallenges,w epresentedaframework toinstrumentanRCapplicationandmeasureruntimeperform ancedataaswell asconceptsforvisualizationsinvolvingRCresources.Wea rguedthat,duetothe complexityinherentinlarge-scaleRCsystemsandapplicat ions,unicationofsoftware 56

PAGE 57

andhardwareperformanceanalysisintoasingletooliscruc ialtoefcientlyrecordand understandapplicationbehavioratruntime. Todemonstratetheoverheadandbenetsofthesetechniques ,resultsfrom anN-Queenscasestudywereprovided.UsingN-QueensontwoR Cplatforms,we demonstratedthatourprototypehardwaremeasurementmodu le(HMM)incurredlittle overhead.Measuringapplicationbehaviorusingprolecou ntersandtracebuffers costnomorethan6.4%ofthelogicresourcesinamedium-size dFPGA,1.7%of theblockRAM,1%infrequencydegradation,and33KB/sinbandwi dthwhenpolled oncepermillisecond.Fromtheperformancedatareturned,i ncludingstatisticsontime spentinthemainN-Queensstatemachine,thebehaviorofthe applicationwasreadily understood,resultingina10.5%speedupwithminimalmodi cations. 57

PAGE 58

CHAPTER4 PERFORMANCEVISUALIZATIONANDEXPLORATION(PHASE2) Intherstphaseofthisresearch,basicvisualizationsalr eadyavailableintraditional HPCperformancetoolswereusedtoprovidequickgraphicalun derstandingofFPGA performancedata,andpotentialconceptsforfutureworkin RCvisualizationwere presented.Thesevisualizationsareessentialinquicklyc onveyingcomplexapplication behavioratruntime,allowinganapplicationdesignertora pidlylocateandaddress performanceproblemsintheirapplication.AsmentionedinPh ase1,traditionalHPC visualizationsarerootedinahomogeneous-parallel-thre admodelwhereinallthreads executeonthesame(orsimilar)hardware,allthreadsarege nerallyassumedtobeat thesamelevelintheapplicationhierarchy,andwhereasing lethreadisconsideredto besequentialinnature.Unfortunately,theseassumptions donotholdinRCsystems withheterogeneousdevices,special-purposecomponents, multiplehierarchiesof parallelism,andparallelismevenwithinthesamecomponen t(e.g.,componentsmay performcomputationandcommunicationsimultaneously).I nattemptingtoconvertthis disparatemodeltothehomogeneous-parallel-threadmodel ,asignicantamountof informationisneithercapturednorpresentedintuitively totheuser. Figure 4-1 demonstratesthedifcultiesassociatedwithpresentingR Cdatain atraditionaltimelineviewusingJumpshot.Assumingthisvi sualizationdepictseight CPUsandboxesforeachCPUrepresentdifferentfunctions,thi svisualizationcanbeof use.However,assumingthisvisualizationdepictseightdi fferentRCcomponentsand boxesrepresentwork,overhead,andsoforth,theresultsar efarlessclearsincethe visualizationleavesthepurposeandrelationshipbetween differentlinesunspecied. Unfortunately,thenumberofcomponentsinsideevenoneFPGA foratypicalRCdesign mayexceedthescalabilityofsuchagure;inamulti-CPU/FPGA system,thisgure wouldonlybecomeworse.Thus,thisphaseofresearchseekst oextendtheframework 58

PAGE 59

andtooldetailedinPhase1toprovideRCperformancevisuali zationthatisappropriate totheheterogeneoussoftware/hardwareenvironmentofRCs ystemsandapplications. Figure4-1.Traditionaltimelinevisualization(Jumpshot )exampledemonstrating difcultiesassociatedwithpresentingRCdata Additionally,evengivenanintuitiveRCperformancevisual izationwherebottlenecks canbequicklylocatedandoptimizationsformulated,auser isstillleftwiththechallenge ofpredictingtheexpectedperformanceimprovementforeac hoptimization,weighing thisimprovementagainstotherexpectedcosts(e.g.,imple mentationeffort,power, resourcesused).Ifthepredictionisinaccurate,asignic antamountofimplementation andrevericationeffortmaybewasted.Whileasignicantam ountofworkhasbeen performedinRCperformancemodelingwithinDSE,thisliterat uretypicallymakes signicantassumptionsaboutthesystemandapplicationth atcanreduceaccuracy andareunnecessarynowthatanimplementationforaspecic systemhasbeen realized.Rather,wepresentaperformanceexplorationmet hodologythatemploys runtimeperformancedatatocaptureapplicationbehavior, allowingourframeworkto provideaccuratepredictionsofapplicationperformancei foptimizationsweremade. Thus,inthisphaseofresearch,wealsoextendourPhase1fram eworktoprovideRC performanceexploration(inadditiontotheaforementione dvisualizationframework andtool),culminatinginanapplicationcasestudythatdem onstratestheutilityofboth RCperformancevisualizationandexploration.Werstpres entourRCperformance visualizationandexplorationresearch(includingdetail sofourprototypeimplementation 59

PAGE 60

forRCvisualization),followedbyademonstrationofbothv isualizationandexploration viaanapplicationcasestudy. 4.1RCPerformanceVisualization Unfortunately,traditionalparallel-processorperforma ncevisualizationsarenot readilysuitedforRCapplications.First,thesevisualiza tionstypicallyassumedevices aregeneral-purpose,interchangeable,andwithinarelati velyshallowhierarchy(e.g., timelinevisualizationwithoneprocessorperrow).Incont rast,manycomponents programmedontheFPGAarespecial-purpose(e.g.,acomponen tthatscattersdata tocores)andmaybeconnectedinarbitrarilydeephierarchi es.Iftreatedasaatlist ofinterchangeabledevices,muchofthestructureandseman ticsoftheapplicationare lost.Second,evenwithheavilypipelined,superscalarproc essorsexecutinginstructions out-of-order,asinglethreadorprocessisstillabstractl yviewedasroughlysequential (sincesequentialorderingofapplicationcodeispreserve d),whileasingleFPGA componentmayperformanarbitraryamountoftasksinagiven cycle(HDLcodeis inherentlyparallel).Finally,HDLslackstandardized,hi gh-levelcommunicationand synchronizationfunctions(e.g., MPI Send ),complicatingattemptstoautomatically classifyandvisualizesuchbehavior. Toaddresstheseissues,wefollowasimilarapproachto[ 44 ],[ 45 ],havingauser provideahigh-levelmodelofapplicationbehavior.Howeve r,tomakeourframework easy-to-use,weprovidesimplepragmasthatallowausertoa ugmenttheirsource codewithhigh-levelinformationwhilenotchangingtheapp licationitself.Eachpragma denesthecurrentstateofahardwareorsoftware“block”(t henest-grainedunit thatoperatesinparallelwithallotherblocks,typicallya softwarethreadorprocess oraVHDLprocessblockorVerilogalwaysblock);notethatwhi letheterm“state”is used,noexplicitstatemachineisneededintheapplication 'ssourcecodetoemploy ourmethodology.Insoftware,pragmasareonlyusedtoclass ifyFPGAAPIcalls (sincePPW,andthereforeReCAP,alreadymonitorstraditional parallel-program 60

PAGE 61

communicationandsynchronization).Inhardware,pragmas areusedtoclassify howtimeisspentineachclockedblock.Figure 4-2 showsseveralexamplesofsoftware andhardwarepragmas. nrrn n nrnr n !""#$$ ASWpragmas nr r !n" # $%&!" # $!& BHWpragmas Figure4-2.Exampleofuser-denedpragmas Eachpragma-denedstateisgivenanameandoneormoreclassi cationssuch as wait send wait recv send recv ,and busy (performinginternaltasks),although othercategoriessuchas wait sync and overhead couldbeusefulaswell(thechosen categoriesaresimilartothosevisualizedincurrentperfo rmancetools[ 32 ]andusedfor performancemodeling[ 52 ]).The“busy”categoryrequiresnoarguments,while“wait” and“communication”categoriesrequiretwoargumentsspec ifyingdependenciesand a“purpose”tagusedtodistinguishbetweensimilarly-type dmessagesfromthesame block.Whileapragmaautomaticallyinheritsallconditions thatcontainit,anoptional conditioneld(afterthecolon)permitsadditionalcontro loverwhenablockisinagiven state. Giventhisframework,weextendedReCAPtoautomaticallymon itorthese pragma-denedstates.Whiletrace(timeline)informationi sidealforviewingexact detailsofapplicationbehavior(andissupportedbyReCAPon bothCPUsandFPGAs), limitationsonFPGAmemorymakeitdifculttorelyonsuchinf ormation.Instead,we 61

PAGE 62

buildaMarkovmodelofallCPUandFPGAblocksbyrecordingtime spentineachstate andstatetransition,acceptingapotentiallossindelity .Whilethisapproachincurs O ( n 2 ) counters,where n isthenumberofstates,itiscommonformanytransitionsto beimpossible.Thus,weuseabasicsparse-matrixrepresent ationforstatetransitions onbothdevices;duetothepotentiallyhighcostofimplemen tingsparsematriceson FPGAs,weprovideoptionalsyntaxinhardwarepragmastospeci fywhichtransitions arepossible,thuscreatingastaticsparsematrix.Asynthe sistoolcouldpotentially determineimpossibletransitionsautomatically,butthis isdifcultandmaynotbe possibleinallcases. Topresentperformanceinanintuitive,model-appropriate fashion,wevisualize performancedatawithinthesystemandapplicationhierarc hy.WhileHDLcode isnaturallyhierarchical,wefoundthedigraphgeneratedf romblockdependencies bettersuitedforunderstandingbehaviorthantheactualst ructureofcode(sinceHDL componentsormodulesmaysimplypassdatathroughorconnec ttwosub-components together).Forourvisualization,wedenethe“system,”“n ode,”“device,”“block,”and “state”levels,allowinganyoftheselevelstohaveoneormo renestedgroupswithinit.A “node”isgenerallythesmallestnetworkedhomogeneousuni tofasystem(whichmay containmultiple“devices”suchasCPUsandFPGAs),whilethebl ocklevelreferstothe denitionof“block”givenearlier(e.g.,thread,VHDLproce ssblock),whicharebroken into“states”viauser-suppliedpragmas.Examplesofapplic ation-orsystem-specic groupingincludeFPGAsonthesamecardandthreadsorcoresper formingsimilar tasks.Byemployingahierarchicalstructure,scalabilityi simprovedandourvisualization isappropriatetothehierarchicalstructurecommoninRCap plications.Astheselevels arefairlygeneric,ourvisualizationandexplorationfram eworkcouldalsoapplytoother computingdevices(e.g.,GPUs,DSPs). InReCAP,weprototypedthisvisualization(withoutgroupin gmentionedabove)by generatingasingleleperlogicalCPUforuseinGraphviz[ 76 ];thus,multi-CPU/FPGA 62

PAGE 63

applicationsaresupported,butthesystem-levelviewisno tyetconstructed.Graphviz canthengeneratescalablevectorgraphics(SVG)les,allow ingausertoopenthe top-levelSVGleandtraversethesystemandapplicationhie rarchyviahyperlinked nodes.Figure 4-3 showsactualtool-generatedoutputforthedevice,node,an d blocklevels,whichhavebeencroppedforbrevity.Thestate level(notshown)isalso visualizedwithinourtoolasanHTMLtableincludingthenum berofiterations,user pragma,sourcelocation,tooloverhead,timespentinthest ate(includingmin,max, average,andtotal),andsimilarsummariesofbytestransfe redandcommunication bandwidth.Dependencyinformationbetweenblockswithint heFPGAorbetweenthe FPGAandCPUarealsovisualized(e.g.,thearrowconnectingth etwoblockswithin thedevicelevelinFigure 4-3 ).WhilevisualizationofCPU-CPUcommunicationand synchronizationiscurrentlyunimplementedinthisview,s tatisticalandtimelineviewsof thisbehaviorareavailableinPPW+RC'sfrontendGUI.Figure 4-3 willbeanalyzedfor performanceproblemsinSection 4.3 Theblocklevel(forCPUsandFPGAs)showseachuser-denedstat e'snameand type,alongwiththecumulativetimespentinthatstate(num ericallyandvisuallyvia partiallylledbars).Inaddition,transitionsarelabele dwithapercentagerepresenting thefrequencyofagiventransitionincomparisontoalloutg oingtransitionsfromthe state.InCPUblocks,transitionsarealsolabeledwiththeto taltimespenttransitioning fromonestatetoanother(numericallyandvisually),asana rbitraryamountofworkmay occurbetweenFPGAAPIcalls.TheCPUblocklevelisshownatthebo ttomofFigure 4-3 Athigherlevelsinthehierarchy,itiscrucialtosummarized atasuchthatausercan quicklydeterminewhetherthatitemdeservesfurtheratten tion.Fromaperformance analysisperspective,anapplicationisperformingideall yifallcomputationalresources committedtotheapplicationarefullyused;certainlybett eralgorithmscanhave lessutilizationandstillperformbetter,butperformance analysisisconcernedwith 63

PAGE 64

nr Figure4-3.ReCAPvisualizationsfor2-coreCollatzapplica tion maximizingagivensolution,assumingallcomputationperf ormedisnecessary. Deviationfromthisidealisconsidereda(local)performan ceproblemandpotentiallyan applicationbottleneck(acriticalperformanceproblemde gradingtheentireapplication's performance)andshouldbehighlighted. 64

PAGE 65

Therefore,wedisplaytotaltimespentineachcategorywhen summarizingblocks withinthedeviceandnodelevels,andwedisplaythemaximum ofeachcategory's totals(alongwiththeminimumtotalbusytime)whensummari zingdevicesandnodes themselves.Themaximumwait-time,communication-time,a ndminimumbusy-time showdeviationsfromtheideal,whilethemaximumbusy-time showshot-spotsthatmay benetsignicantlyfromoptimization(includingalgorit hmicimprovement).Itmayalso beofusetoprovidestandarddeviationandautomaticgroupi ngorbinningofblockswith othersimilarlyperformingblocks,dependingontheapplic ation. Basedonourvisualizationapproach,itisoftenbenecialto deneasfewstates asnecessarytodescribehigh-levelapplicationbehavior, insertingadditionalpragmas inkeyareasifmoredetailisneeded.Inaddition,whilehier archicalstructuringhas manyadvantages,itmayalsobedesirabletoqueryoraggrega teperformance data.Forexample,aquerysuchas SELECT*FROMFPGA[*]WHERETIME(SEND)> 0.1*TOTAL TIME() couldbeusedtondallFPGAstatessendingdatamorethan 10%ofthetime.Duetosignicantimplementationcost,this featureisnotcurrently implemented. Thevisualizationimplementedhereisnotintendedtobeade nitivevisualziationfor RC.Additionalviewsanddatacouldbeofuse.Forexample,mul ti-nodevisualizations containingcommunicationchannelstatistics,suchasshow ninFigure 4-4 (similarto Figure 3-2 showninSection 3.1.3 )andanode-levelviewvisualizingcommunication, buffer,andmemorystatistics,suchasshowninFigure 4-5 ,couldbeofsignicantuse. Furthervisualizationresearchisleftforfuturework. 4.2RCPerformanceExploration Onceperformanceisvisualized,theusermustformulatepot entialoptimizationsand weighexpectedperformancegainsagainstexpectedcostsof optimization(e.g.,effort, power,resources).Unfortunately,predictingperformanc ecanbedifcultincomplex RCapplications,potentiallyleadingtowastedimplementa tioneffort.Asmentionedin 65

PAGE 66

n n n n nr r n nn r r nr n n n nn n n Figure4-4.Exampleofasystem-levelvisualizationofanRCa pplicationexecutingona 6-CPU/3-FPGARCsystemassumedtobepartofalargercluster n r n n r n n n r r nrrrr "# $ %& rr r r '( '( r ) *+ %& r Figure4-5.Exampleofanode-levelvisualizationofanRCapp lication,showing2CPUs andanFPGAfromthelefthandsideofthesystem-levelvisuali zation Section 2.3 ,DSEisinvaluableforprovidingearlyperformanceestimate s.However, theseestimatesmaybeinaccurateduetoassumptionsconcer ningthesystemand application;thus,weemployactualperformancedatagathe redforourvisualizationto supportperformanceexplorationintheoptimizationstage .Ourproposedexploration methodologycomplimentsexistingmethodsinDSEandperform anceprediction. 66

PAGE 67

Sinceourgoalistomaximizeperformanceofanexistingsyste manddesign,wetake advantageofdetailedperformancedata,targetingner-gr ainedchangesandmore accurateexplorationratherthanbroadchangescommonintr aditionalDSE.Ideally,our dataandmethodologycouldbeintegratedwithexistingDSEto ols(e.g.,augmenting applicationandsystemmodelswithrealperformancedata), enablingmoreaccurate predictionandprovidinganend-to-endenvironmentinwhic htostudyapplication performance,fromconceptualdesignthroughoptimization Wenowmotivateourapproach.Figure 4-6 showsmockupperformancedatafor anapplicationconsistingoftwoblocks.Eachstatecontains thecumulativetime(CT) spentinthatstate(overalliterations)aswellasthenumbe rofincomingandoutgoing transitions,fromwhichapotentialtimelinecanbeconstru cted(Figure 4-7A ).Supposing anoptimizationcouldreducetheCTofthe busy2 statefrom3sto1s,Amdahl'slaw suggestsanidealspeedupof22.9%( 1 = (1 3 s= 10 : 75 s + 3 s= 10 : 75 s 3 ) )ispossible.However, inreality,theperformanceofotherblockscouldinhibitus fromachievingthispotential. Assumingtheaveragecase,theaboveoptimizationwillreduc eeachofthe fouriterationsof busy2 by0.5s,leavinggapsinBlockB'stimeline(Figure 4-7B ). Unfortunately,therstthreegapsprecedeawaitstate;sin ceBlockBwasalready waitingonBlockA,BlockBwillhavetowaitthatmuchlonger.How ever,thenal gapisfollowedbycommunicationwithBlockA,whereBlockAiswa itingforthis communication.Thus,BlockAcouldnowwaitlesstime,allowi ngthenalsend/recv pairtocomplete0.5searlier(Figure 4-7C )andprovidinganalspeedupof4.9%,much lessthantheideal22.9%(thisidealcouldberealizedwitha similarchangeto busy1 instead). Whilethepreviousexamplefocusedonchangingastate'sCT(a longwiththe correspondingadjustmentstootherstates),otherchanges canbemodeledaswell.For example,changingFPGAclockfrequencycanbemodeledasscal ingthetotaltimeof eachFPGAstate(exceptthosecommunicatingorwaitingonext ernalresourcessuch 67

PAGE 68

nr nr n r r n Figure4-6.Exampleapplicationforexploration. n r AInitialtimeline BAfterlocalmodication n n r CFinalresult Figure4-7.Potentialtimelineswhenoptimizinganexample application. asSRAMsorCPU).Changingablock'sreplicationfactorby F (e.g.,doublingCPUs orFPGAcores)canbemodeledasscalingallbusyandcommunica tionstatetimesby afactorof 1 =F (assumingperfectloadbalancing).Ourframeworkdoesnotc onstrain themagnitudeofsuchchanges,allowinguserstoinvestigat eperformanceifalarger orfasterdevicewereemployed;however,potentialchanges inresourceusageand 68

PAGE 69

achievablefrequencyshouldbeconsideredwhentargetings pecicdevices.Ideally,a toolcouldautomaticallymapthesehigher-levelchangesin tocorrespondingstate-level changesratherthanrequiringtheusertomanuallyperformt hismapping. Thisapproachmakesseveralassumptions.Sinceageneratedt imelinerepresents theaveragecase,itcoulddiffersignicantlyfromactualr untimebehavior(especially withsharedeffectssuchascontention).However,anumbero fRCapplicationsexhibit veryregularbehavior,andthusourapproachisreasonableg iventhesignicantmemory reductionaffordedbyusingsummarystatistics(eventrace datacanbeinsufcientto predictperformanceasasingleexecutionmaynotincludeal lexecutionpaths).Also, althoughauser'schangetotheCTofoneormorestatescoulda ffecttheCTofabusy orcommunicationstate,weassumeonlytheCTofwaitstatesi sadjusted,asthiscase isfarmorecommonandeasiertomodel.Finally,althoughmat chingcommunication statesbetweenblockscanwaitoneachother(e.g.,inoneite rationBlockAwaitson BlockBwhileinthenextBlockBwaitsonBlockA),weassumethatco mmunication canbeshiftedsuchthatonlyoneblockiswaitingonanother; whilenotalwayspossible, thisrepresentsanidealcaseifsuchasynchronizationprob lemisremedied.Weshow inSection 4.3 thatourmethodologycanstillbefairlyaccurate,evenwith non-uniform behavior. Unfortunately,generatingtimelinesconsistentacrossal lblockscanbedifcult,as matchingsend/recvpairsmustbealignedglobally(e.g.,in Figure 4-7A ,BlockB'srst waitstatewaselongatedinordertolineupcommunicationst ates).Thus,forsimplicity andefciency,weforegotimelinegeneration,insteadempl oyingaheuristicmethodology thatcalculatestheCST,orcumulativetimeastateisshifted inthetimeline(e.g.,the nal-0.5sshiftinFigure 4-7C ),aswellastheCTforeachstate;pseudocodeforour approachisprovidedinFigures 4-8A and 4-8B .NotethatwhileCST/CTcalculationmay yielddifferentresultsfromatimelineapproach(asthelat terensuresmatchingtransfers occurtogetherratherthanjustensuringthatdependencies ,onaverage,aremet),this 69

PAGE 70

doesnotimplythetimelineapproachismoreaccurate;botha reeffectivelyassuminga schedulethatmaydifferfromactualexecution. nrr r nr nn rr"r #rrr r$rrr %& n& '' ABlock-propagationmethodology nnrnn nn n nr rrr !!r nnnnnn "#$r"#$rr % nrr BDependency-handlingmethodology Figure4-8.Performanceexplorationmethodology. Wenowre-predictperformanceforourexampleapplicationv iatheCST/CT methodology.AsshowninFigure 4-9A ,theuserchangestheCTof busy2 from3sto 1s(checkered).The propBlk function(Figure 4-8A )isthencalledwiththe busy2 state 70

PAGE 71

and-2sCST.Thisfunctionisresponsibleforpropagatingach angeinexpectedCST throughoutablock,addinganydependenciestoaqueueforla terprocessing.TheCST issplitamongallpotentialnextstatesbasedonthefrequen cyoftransitionstoeach.If astatehasbeenvisitedbefore,the handleCycle function(denitionnotshown)nds allstatesoutsidethecyclethathaveatransitiontothemfr omwithinthecycle(exit states)anddeterminesthecorrectsplitofCSTbetweenthese states(determinedeither byiterationorgeometricseriesifthenumberofiterations issufcientlylarge).Inour exampleapplication,thecallto propBlk addsdependenciesfor recv2 and send2 tothe queueandresultsinthenalCSTvaluesshown(blackellipses )inFigure 4-9A Atthispoint,eachdependencyisincrementallyappliedbyca lling handleDep with thefourrelevantstatesinthedependency:thesource s ,itscorrespondingwaitstate sw ,thedestination d ,anditscorrespondingwaitstate dw (Figure 4-8B ).Ifthereisno correspondingwaitstateforeither s or d ,animplicitonewithzeroCTisadded.The handleDep functionattemptstoshiftcommunicationasearlyaspossib le,resolvingany conictdetectedbyincreasingwaittimeatthesource.Figu re 4-9B showstheresultof triggeringthe recv2 dependency,whichresolvesaconictwhereBlockBcouldprog ress fasterbutBlockAcannot,thusincreasingthewaittimein wait2 to4.25s.Figure 4-9C thenshowsthenaloutcomeaftertriggeringthe send2 dependency,whichallows the recv1 stateinBlockAtoshift0.5searlier(byreducingtheCTof wait1 to0.25s), yieldinganalreductionof0.5soutof10.75s(4.9%speedup )fortheapplication,as predictedearlier.Notethatourmethodologynotonlypredi ctsoverallperformance, butalsotheCTforeachstate,thusaidingauserinlocatinga ndpotentiallyremedying bottlenecksinanoptimizedversionoftheapplicationbefo retheoptimizationhasbeen implemented. Whileourintentistoallowuserstomodifyastate'sCT,block replication,or frequencywithinourvisualization,performanceexplorat ionisnotcurrentlyimplemented 71

PAGE 72

nr nr n r r n AResultofinitialupdatetoBlockB nr nr n r r n BResultofrecv2dependencyhandling nr nr n r r n CResultofsend2dependencyhandling Figure4-9.PerformanceexplorationusingCST/CTupdates inReCAP.Thus,wemanuallyemployourexplorationframework todemonstrateits utility. 72

PAGE 73

4.3Results Tovalidatetheutilityofourframework,weinvestigatethe performanceofaCollatz application,whichteststhelengthsofsequencesgenerate dunderrepetitiveiteration ofasimplefunctiononthenaturalnumbers(see[ 77 ]fordetails).Whilethisapplication isnotreadilyusedforpracticalpurposes,itcontainspatt ernsthatcloselyresemble cryptanalysis(i.e.,alargenumberspaceispre-lteredby aCPU,theFPGAperforms alargeparallelsearch,andnallytheCPUcollectsandperfo rmspost-processing onreturnednumbers)[ 78 ],[ 79 ].Inaddition,boththeCPUandFPGAareinvolvedin computation,withFPGAcoresrequiringanunknownnumberofc yclestocomplete, thusprovidingacasestudythatwouldbedifculttopredict performanceforviaother methods. TheapplicationwasexecutedonaPentium-43.2GHz64-bitXeo nprocessor containingaNallatechH101-XPCI-XcardwithaVirtex-4LX100. WeusedGCC4.4.2 with“O3”optimizationandXilinxISE11.3withdefaultsettin gstocompiletheapplication softwareandhardware,respectively;thesameapplication servedasaCbaseline(with FPGAtasksperformedontheCPUinstead).Allexecutiontimeswe recomputedfrom theaverageofthreeexecutions,withtheFPGAoperatingat10 0MHz.Eachexecution consistedofrunningsequence-lengthtestsontherst154b illionnumbers. Figure 4-3 showsvisualizationsgeneratedbyReCAPforaninitial2-cor eversionof theCollatzapplication.Fromthevisualizations,wequick lynoticedtheCPUiswaitinga signicantamountoftimefortheFPGAtocompleteitswork(10 0.8s,asshowninthe CPU-block-levelvisualization),representingpoorhardwa re/softwarepartitioningthat couldberemediedeasilybyincreasingthenumberofcoresin theFPGA.Assuming thewait-timeproblemcanberemoved,the inStatus1 stateistakingupasignicant portionoftheremainingtime(13.5%).Decreasingthetrans missionfrequency(and correspondinglyincreasingbuffersize)couldreducethis overhead.Insupportofthis decision,clickingonthe send1 stateindicatesanaveragebandwidthof42.5MB/sand 73

PAGE 74

messagesizeof0.75KB(notshown);platformbenchmarksindi catedthatincreasing messagesizecouldimprovebandwidthsignicantly. WenowpredicttheeffectsofscalingthenumberofFPGAcoresi nourapplication from2to56viaourperformanceexplorationmethodology(64 corescouldnotbe validatedduetoFPGAresourcelimitations).Figure 4-10 showstheidealprediction givenbyAmdahl'slaw(triangles),ourmethodology'spredic tion(squaresanddotted line),actualapplicationperformance(asterisk),andper formanceachievedfroman optimizedversionoftheapplicationdiscussedbelow(circ les).SinceFPGAcomputation representsthemajorityoftheapplicationtime,Amdahl'sla wpredictsnear-linear speedup,witha56-coreversionachievingatotal54.2xspee dupoverthesoftware baseline.However,ourperformanceexplorationpredictst hatperformancewillscale inaroughlylinearfashionuntillevelingoffaround12core s,withthe56-coreversion achievingatotal12.1xspeedupoverthesoftwarebaseline. n rrn nr n n r r Figure4-10.PerformanceexplorationforCollatzapplicat ion. Ouractual56-coreversionachievedatotal11.7xspeedupov erthesoftware baseline,resultinginamaximumerrorinpredictedperform anceofonly3.3%forour performanceexplorationmethodology(usingAmdahl'slawyi elds361%error),thus 74

PAGE 75

demonstratingourframework'sutility,eveninthepresenc eofnon-uniformbehavior. AsexaminingthepredictedstateCTsindicatedtheFPGAwouldb emostlyidleif 56-coreswereemployed,wedevelopedanauto-tuningstepto performcoarse-grained load-balancingbetweentheCPUandFPGAandreducedtransmiss ionfrequency(as mentionedearlier),yieldinga3.8xspeedupovertheorigin al56-coredesign(44.7x speedupoverthesoftwarebaseline),asshowninFigure 4-10 Duetothenumberofapplicationresourcesused(85%LUTs,32 %regs,and57% blockRAMs(BRAMs)),wechosetoinstrumentonly2ofthe56hardw arecores(all othersoftwareandhardwarewereinstrumentednormally).W eobservedachange of+1.7%(orless)insoftwareruntime,-2%LUTs,+4%regs,+3 5%BRAMs,andno changeinmaximumfrequency.TheunexpectedchangestoLUTs andBRAMsweredue toinstrumentationinterferingwithaRAM-packingsynthesi soptimizationthatdecreases BRAMsattheexpenseofLUTs.Unfortuntaely,ReCAPreplicatesc omponentsin loopstoeaseinstrumentation,preventingISEfromdetectin gthisoptimizationforthis application;betterhandlingofloopsinReCAPwouldavoidth isissue.Disablingthis optimizationyieldsamoreinformativebaselineof77%LUTs ,32%regs,and92% BRAMsfortheoriginalapplication,yieldinganactualoverhe adof+6%LUTs,+4%regs, and+0%BRAMs. 4.4Conclusions Inconclusion,thisphaseprovidedamethodologyforbothpe rformancevisualization andperformanceexplorationofRCapplications,implement ingaprototypevisualization byextendingtheReCAPframeworkandtooldescribedinPhase1. Theutilityofour methodologywasdemonstratedbypredictingapplicationpe rformancebeforeactually implementingchanges,yieldingonly3.3%errorwhenvalida ted.Basedontheseresults, ascalabilityproblemandinefcientcommunicationwerequ icklylocatedandremedied, yieldinga3.8xspeedupoftheoriginal56-coredesign(incr easingapplicationspeedup from11.7to44.7whencomparedtothesoftwarebaseline). 75

PAGE 76

CHAPTER5 PLATFORM-AWAREKNOWLEDGE-BASEDBOTTLENECKDETECTION (PHASE3) WhiletheextendedReCAPframeworkfromPhase2providesperfor mance analysisforRCapplications,includingperformancevisua lizationandexploration, thedesignermuststillmanuallysearchfor,discover,andc haracterizebottlenecksas wellasformulateoptimizationstrategiestoaddresseachb ottleneck,requiringsignicant effortandexpertise.However,differentapplicationsmay sharesimilarbottlenecks andthusmaybenetfromsimilartechniquestolocatetheseb ottlenecksaswellas fromsimilarsuggestionsonhowtoremedythem.Thus,knowle dge-basedbottleneck detection(apromisingareaofresearchintraditionalHPCth atattemptstolocate commonbottlenecks)couldreduceboththeexpertiseandeff ortrequiredtooptimize applications,signicantlyacceleratingtheoptimizatio nprocess. Unfortunately,providingbottleneckdetectionforRCappl icationsincursadditional challengesnottypicallypresentintraditionalHPC.Forexa mple,whileattemptshave beenmadetostandardizesomeaspectsofRC-systemAPIs,sucha sOpenFPGA's GenAPI[ 80 ],thecontinuedwidespreaddiversityofbothhardwareands oftware APIsforRCsystemscomplicatesbottleneckdetection,asther earenostandardized constructssuchas MPI Send (aconstructwithintheMPIlibrarythatsendsdatafrom thecallingnodetoanothernodeinasystem)thatprovidehoo ksforrecordingrelevant statistics(e.g.,measuredbandwidthorbytestransferred ).Anotherkeychallenge existsindening,locating,reportingon,andsuggestingr emediesforapplication bottlenecks.ManyapproachesintraditionalHPCareill-sui tedforRCapplicationsdueto commonassumptionsthatallsystemprocessingelementsare relativelyhomogeneous, general-purpose,andatroughlythesamelevelinthesystem hierarchy(i.e.,applications executeonCPUsthatcanperformgeneralcomputationandcomm unication,individually oringroups).Incontrast,componentswithinanRCapplicat ionarealmostcertainly heterogeneous,maybedesignedsolelyforscattercommunic ationorpipelinecontrol 76

PAGE 77

(non-computationalcomponents),existinfairlydeepandi ntricatehierarchies,and mayperformanarbitraryamountofworkinasinglecycle,thu sresistingconventional bottleneckdetectionandclassicationschemes.Inadditi on,anFPGA'shardware exibilitysignicantlyexpandsthepossibilitiesforopt imization,increasingthecomplexity offormulatingbottleneckremedies.Finally,fromapracti calstandpoint,tracedata,which isusedheavilyforbottleneckdetectionintraditionalHPC, cannotbereliedoninFPGAs asthesedevicestypicallyhavelimitedmemoriesandreal-t imerequirements(i.e.,it canbedifcultto“pause”anapplicationonanFPGAduetolowlevelinteractionwith externalhardware,suchasmemoriesorotherFPGAs). Thus,inthisphaseofresearchweextendourPhase1and2frame workandtool withplatform-aware,knowledge-basedbottleneck-detect iontoaddressthecurrent lackofbottleneckdetectionavailableforRCapplications .Thepresentedframework andtoolareextensibleinthatusersmayeasilyaddsupportf ortheirownplatform, aslongasittswithinourplatform-templatemodel,andmay alsoeasilymodifythe bottleneckknowledgebase.Wealsopresentwhatwebelievet obethersttaxonomy ofcommonRCbottlenecks,includingdetectionandoptimiza tionstrategies,which weusetopopulateourReCAPtool'sknowledgebase.Althoughth isworkshould aidnoviceRCdesignerswhomaybeunawareofmanypotentialb ottlenecksand optimizations,experiencedRCdesignerscanbenetaswell fromquickfeedbackon thelocationandseverityofperformanceproblems.Werstf ocusonplatform-templates toaddressbottleneckdetectionandportabilityacrossdiv erseRCplatforms,followed byadiscussionofourbottleneckdetectionframeworkandto ol.Wethenpresentour taxonomyofcommonbottlenecksthatReCAPcandetectandsugg estoptimizationsfor, followedbyademonstrationofthisresearchontwoapplicat ioncasestudiesonatotalof threediverseRCsystems. 77

PAGE 78

5.1PlatformTemplates PlatformtemplateshavetwopurposesinReCAP:bottleneckdete ctionand portability.Inthissectionwedetailourplatform-templa teframeworkandits implementationinReCAP.Whilethisframeworkhandlesmanyco mmonAPIstyles inbothsoftwareandhardware,wewillalsopointoutcasesno tcurrentlyhandledaswell asconceptsthatcouldpotentiallyaddressthesesituation s. ReCAPprovidesasinglelocationwithintheHDLInstrumenter thatencapsulates allplatform-specicinformation,permittingatypicalus ertoquicklyaddsupportfortheir ownplatform;thisinformationisdividedintoseveralsoft wareandhardwareAPItabsfor easieraccess.ReCAPallowseachplatformtobegivenaunique nameandsavesall platforminformationinaseparatele,allowingplatformt emplatestobeeasilyloaded andshared.Infact,ReCAPcansupportuserAPIsbuiltontopofap latform'sAPI, allowingseveraldifferentplatformtemplatestocoexistf orthesameplatform;theuser simplyselectsthecorrespondingtemplatethatmatchesthe currentAPIinuse. 5.1.1Software InordertolocatebottlenecksinvolvingCPU-FPGAcommunicat ion,ortoeven determinewhensuchcommunicationisoccurring,ReCAPmustk nowwhatAPIcalls arepossibleaswellassomeauxiliaryinformation(e.g.,pu rpose,bytestransferedifa transfer)inordertoeffectivelyinstrumentthesecalls.Sp ecically,ausermustentera C/C++prototypeforafunction,macro,orclassmethodintot heHDLInstrumenter 1 Asthisinformationcanbedirectlyobtainedfromtheplatfor mAPI'sheader,whichmust beavailableonthesystem,theusercansimplycopyandpaste thisinformationinto ReCAP. 1 Macrosarecurrentlyhandledbyprovidinganequivalentfun ctionprototype;while thisapproachcouldcauseproblemsbypresupposingargumen tandreturntypes,an improvedimplementationcouldavoidthisissuebyhandling macrosseparately. 78

PAGE 79

Foreachprototypesupplied,theusermustindicateitstype (e.g.,conguration, acquisition,datatransfer,release).Inaddition,C/C++e xpressionscanbegivento identifytheFPGAnumberassociatedwiththiscall(ifany),b ytestransferred(fordata transfers),andlenames(forconguration).Theseexpres sionsarefreetoaccess functionarguments,globalvariables,classmethods,orcl asspublicelds;forexample, theFPGAnumberexpressionmaysimplybe fpgaNum or this->getFpgaId() ,assuming theserepresentanargumentintheprototypeorapublicclas smethod,respectively. Fromthisinformation,ReCAPidentiesandinstrumentseach APIcallintheuser's sourcecode.Function-andmacro-basedAPIcallsareoverridd enviamacrosthat effectivelychangethenameofeachAPIcallinusersourcecode inordertoinsteadcall instrumentedwrapperfunctionsormacros,whichrecordvar iousapplicablestatistics (e.g.,timespent,bytestransferred,bandwidthachieved) inadditiontoperforming theoriginalAPIcall 2 .Classmethodsareinstrumentedbysubclassingeachclass withinaplatformAPI,witheachsubclassmethodrecordingsta tisticsbeforecalling thecorrespondingbaseclass'method;inaddition,anyinst antiationofaplatformAPI classintheuser'ssourcecodeisreplacedwithaninstantia tiontotheinstrumented subclassinacopyoftheuser'ssourcecode.Unfortunately, thisapproachwouldnotbe sufcientforconstructorsanddestructorsduetotheorder inwhichconstructorsand destructorsarecalled.Forexample,timingaclassconstru ctorinaplatform'sAPIwould requireatimertobestartedbeforethatconstructorwascal led,whereasthesubclass' constructorwon'tbecalleduntilaftertheplatformAPI'scon structor.Tohandlethis situation,ReCAP'ssubclassemploysmultipleinheritance, rstinheritingfromanother ReCAP-generatedclassthat,inthecaseoftheconstructor,ha ndlesstartingtimersand 2 Theuseofmacrosforoverridingfunctionscausesproblemsw ithC++'sfunction overloading,althoughmethodoverloadingishandledcorre ctly;animproved implementationcouldhandlethiscaseproperlyvianame-ma nglingforoverloaded functions. 79

PAGE 80

othermeasurementcodebeforetheplatformAPI'sconstructor iscalled;thesubclass' constructorthenstopstimersandrecordsstatistics.Dest ructorsarehandledinthe reversefashionduetothereverseorderinwhichdestructor sarecalled. Forbottleneckdetection,ReCAPalsoallowsausertoassocia teadefault“reason” foranAPIcall(e.g.,initializing,broadcasting,waitingto sendduetoafullbuffer); “reasons”willbediscussedinmoredetailinSection 5.2 .Fortransferfunctions, microbenchmarkdatacanbeenteredintabularform,whereea chrowincludesthe transfersize(inbytes)andtimetakentotransferthatamou ntofdata(inseconds). MicrobenchmarkdatapermitsReCAPtodetectbottlenecksand makespecic suggestionsforcommunication-relatedbottlenecks(disc ussedinSection 5.3.1 ).ReCAP couldalsoperformmicrobenchmarksautomatically,suchas uponinstallationorduringa specialcalibrationstep,althoughadditionalinformatio nabouthowtouseeachAPIcall wouldthenberequired;intheabsenceofmicrobenchmarkdat a,ReCAPcouldsimply detectperformancedeviationsbetweendifferentinstance softhesameAPIcall(these techniqueshavebeendemonstratedintraditionalHPC).Fina lly,agenerictexteldfor platform-specicbottlenecksuggestionsisprovidedtoas sociateknownissueswith givenAPIcalls.Forexample,onaCrayXD1system[ 11 ],itisroughly200timesmore efcienttohaveanFPGAwritedataintotheCPU'smemorythanto havetheCPUread fromtheFPGA'sQDRSRAMs[ 81 ]. AsingleAPIcallcouldrepresentmorethanonebehavior.Forex ample,atransfer functionmayrepresentareadorawritedependingonaclass eldorfunction argument,orstandardlibraryfunctionsmaybeusedtoacces stheFPGA,suchas anXtremeDataXD1000system's[ 12 ]useofthestandardC open functiontoinitialize theFPGA.Tosupportthesescenarios,ReCAPsupportsaconditio ntodetermine whetherornotmonitoringisactiveforagivenAPIcall,allowi ngasingletransferthat canreadandwritetobemonitoredseparatelyforeachcase(b yenteringtheprototype twicewithconditionstestingforareadandwrite,respecti vely)andforotherusesof 80

PAGE 81

standardfunctions(suchas open )tobeignoredunlessthecorrectargumentsaregiven thatindicateanFPGAaccess.Asubtypelabelisalsoprovided toallowausertoeasily distinguishdifferentconditionalversionsofthesameAPIca llinvisualizationsand bottleneckreports. Whiletheinformationdiscussedaboveissufcientformonit oringandbottleneck detection,ReCAPmustalsobeabletoinitiatetransfersinor dertoretrievehardware performancedataatruntime,requiringadditionalinforma tionabouthowtoactually performthesetransfers.Whileageneralimplementationcou ldprovideadditionaldetail aboutalltransfertypes,wereducetheamountofdataauserm ustenterbyrequiring thisadditionalinformationforonlyonesendandreceivety pe.Thus,theusermust provideincludedirectivesforallFPGAlibraries,aminimum andmaximumtransfersize permitted,thedatatypeforFPGAtransfers,andashortcodef ragment(usuallyoneor twolines)thatshowshowtosend(orreceive)anarrayofdata to(orfrom)theFPGA. ReCAPmustalsobeawareofhowtoallocate,free,andaccessth esearrays;default codeforthesetasksisprovided,butcanbeoverriddensince FPGAssometimesrequire page-aligneddata,andthusspecialallocationfunctionsa nddatastructures.Further, ReCAPmaybeprovidedwithconditionsindicatingasendorrec eiveerrorforbetter errorchecking.Additionalmiscellaneousinformationcanb eprovidedaswell,including asizemultiplierontheFPGAdatatypeforplatformsthathave differentdatawidths inhardwareandsoftware,aleexclusionlisttopreventins trumentingtheinsideofa user-denedAPI,andapplication-specicconstantsneededf orFPGAaccess. Unfortunately,someissuesremainwithourapproach.First ,someAPIsemploy memory-mappedFPGAaccess,wherereadingorwritingtoaspec icmemorylocation fromtheCPUactuallyconstitutesanFPGAtransfer.Itisnotpo ssible,ingeneral,to staticallydeterminewhetheragivenpointeraccessisinth eFPGA'smemoryrange, althoughruntimesupportispossibleandmanycommoncasesc ouldbedetected statically.Thus,ReCAPcurrentlyrequireswrappingsuchpo interaccessesinsimple 81

PAGE 82

macrosorfunctionsbeforetheycanbemonitored;functions canbeinlinedtoprevent performanceloss.Second,weassumethereadandwritefuncti onsusedbyReCAP toretrievehardwareperformancedatahaveaddressesassoc iatedwiththem;infact, duringinstrumentation,theusermustprovideanaddressra ngethatisunusedbythe applicationsothatReCAPcanhijackandusethisrangefortra nsferringdata.This address-basedapproachprecludeshijackingstream-based APIcallsthatlackaddress information;thisissuecouldberesolvedinanumberofways includinghijackinganother APIcallthatdoesprovideaddressinformation(ReCAPonlyneed soneaddress-based readandwriteAPIcall),byembeddingaspecialmarkerintheda tastreamand escapingthatmarkerinanydatasentbytheapplication,orb ysettingaspecicag ontheFPGAifsuchafeaturewereavailableandunusedbytheap plication.Despite theselimitations,wehavefoundthatmostplatformscanbes upportedquickly(e.g.,ina fewhours).Forexample,ReCAPcurrentlysupportssoftwarea ndhardwareAPIsfora NallatechPCI-Xcard[ 74 ],anXtremeDataXD1000system[ 12 ],andaGiDELPROCStar IIIPCIecard[ 82 ];Table 5-1 inSection 5.4 containsmoreinformationontheseplatforms. 5.1.2Hardware Inordertodeterminewhencommunicationoccursinhardware ,ReCAPmust knowwhateventsanddataareassociatedwithareadorwrite. Thus,ReCAPrequests informationconcerningthenamesofdataandaddresssignal sforbothincomingand outgoingtransfersaswellasHDLconditionsthatindicatew hendataisavailablefor (orwhendatacanbesentby)theuser'sapplication.Inaddit ion,ausercanprovidean HDLcodefragmenttoperformrequiredactionswhensendingo rreceivingdata(e.g., settingavalidoracknowledgeaghigh).Also,duetothewide varietyofwaysdatacan betransferred,ReCAPsupportsbothmemory-styleaccessand block-transfer-style access(e.g.,DMAtransferswheretheaddressandnumberofw ordsaresentalongwith arequestag).Forblocktransfers,theusermustprovideth esignalnameindicating thenumberofwordsandtheconditionindicatingatransferr equest(addresssignals 82

PAGE 83

werealreadyspeciedwiththedatasignalsearlier);thetr ansferendswhenthenumber ofwordsreaches0.TheusermayalsoselectwhethertheAPIwill keeptheaddress up-to-dateoneachcycle,orwhetheronlyastartingaddress isprovided,inwhichcase ReCAPwillmanageupdatingtheaddressinternally.Further, duetodifferentlatencies requiredbydifferentplatforms,theusermayspecifythenu mberofcyclestodelay outgoingdata,including0forthesamecycle.Finally,ReCAP requirestheuserto providethetop-levelclocksignalname(ReCAPcurrentlyonl yoperatesinoneclock domain,althoughanextendedimplementationcouldsupport multipleclockdomains) andresetcondition. Aswithsoftware,weagainassumeanaddress-basedscheme;if oneisnot present,techniquesmentionedaboveareapplicablehereas well.Wealsonotethat ourcurrentframeworkonlyconsidersattachmenttoaCPU-FPGA communicationport, whereastheFPGAmayconnecttoanotherFPGAorexternalmemory aswell.Weleave theextensionofthisframeworktomonitorthesetypesofpor tsforfuturework,butthis extensionshouldbesimilartothetechniquespresentedher e.Thankfully,thislimitation onlypreventsautomaticmonitoringandbottleneckdetecti onontheseports,asour currentimplementationcanmonitoranyportviaafewhardwa repragmasinthetop-level le;pragmasarediscussedinSection 5.2 5.2BottleneckDetectioninRCApplications Asdiscussedatthebeginningofthischapter,thereareanumb eroffactorsbeyond systemandAPIdiversitythatcomplicatebottleneckdetectio ninRCapplications, includingintricatehierarchiesofinterconnectedcompon ents,componentheterogeneity, andnon-computationalcomponents.Forexample,Figure 5-1 depictsasimplemockup RCapplicationthatdistributessensorinputbetweentwoap plicationcorepipelines, performssomecomputationusingSRAMinthatpipeline,collec tsresults,andpotentially ofoadsdataforfurtherprocessingtoadual-coreCPU(witht wosoftwarethreadsper core)beforestoringresultsinDDRmemory.Thisapplicatio nexhibitsheterogeneity 83

PAGE 84

sincemanydifferentblocksexist(e.g.,theP1,P2,andP3pipel inestagesarelikely performingfairlydifferenttasks),non-computationalco mponents(e.g.,buffers, distributors,andcollectors),andcomplexinteractionam ongstblocks(e.g.,datamay traversethrough6or10blocksviaseveralpaths).Thus,the sedifferencesmustbe addressedwhendetectingbottlenecksinRCapplications. n r nn r n r r rnn Figure5-1.DirectedgraphofanRCapplicationthattakesin putfromasensor, processesdatausingatwo-corepipeline,potentiallyofo adsdatatothreads onamulti-coreCPUforfurtherprocessing,andnallystores resultsinDDR memory. WeleverageourPhase2researchinperformancevisualizatio nandexploration forabstractingapplicationbehaviorasadirectedgraphof blocks,wherea“block” representsasoftwarethread,aclocked 3 VHDL“process”block,oraclockedVerilog “always”block(shownasblackboxesinFigure 5-1 ).Onekeyrequirementofablockis thatitoperatesinparallelwithallotherblocks,possibly withsomedependenciesdue 3 Purelyasynchronouslogicwillalreadyhavehadtimingoptim izedduringsynthesis andimplementationandcanbemonitoredonclockboundaries ifnecessary. 84

PAGE 85

tointeractionsbetweenblocks.VHDL“entity-architecture ”pairs,Verilog“modules”, andCPUcoresarecalled“components,”andmaycontainoneorm oreblocks.While eachblockmayperformanarbitraryamountofcommunication andcomputation(e.g.,a singleblockmaybesimultaneouslycommunicatingwithseve ralblockswhilecomputing amultiply-and-accumulate),wechooseablocktobethefund amentalunitofparallelism inourabstractionofanapplication.Thischoicerepresent satradeoffbetweenadesire fordetailedrecordingandmodelingofne-grainedparalle lismandadesiretominimize extraneousdetailrecordedandvisualized. Theperformancevisualizationandexplorationresearchdi scussedinPhase2 alsodenesapragma-basedsyntax,providinganapplicatio ndesignerwithasimple, unobtrusivemethodologyforspecifyinghigh-levelinform ationconcerningapplication behavior;weextendthissyntaxhereforthepurposeofbottl eneckdetection.Figure 5-2 providessomeexamplesofourextendedsyntaxforsoftwarea ndhardwarepragmas. Extensionstothesyntaxdenedinourpreviousworkincludes ubdividinga“busy” categoryinto“work”andinternal“overhead”aswellasthea dditionofa“reason” argument;forconvenience,webrieydescribeeachpartofa pragma'ssyntaxbelow, includingaspectsdenedinourpreviousworkaswellasthes eextensions. Eachpragmainsoftwareorhardwaredenesa“state”thattheg ivenblockisin whenthatpragmaisreachedinsourcecode;thesepragmasare placedeitherbefore anAPIcallorbeforetherststatementinanHDLbranch.Pragmas canindicateablock isworking,performinginternaloverhead,communicating, orwaiting.Specically,the followingcategoriesarepermitted: Work:tasksdirectlyassociatedwiththeobjectiveoftheap plication(e.g.,matrix multiplicationorconvolution);thisshouldbemaximized Overhead:internal,auxiliarytasksthatareartifactsoft heimplementation,suchas bookkeepingorloopcounterupdates;thisshouldbeminimiz ed Send/Recv:movementofapplicationdatatoorfromthisblock 85

PAGE 86

nrn n n !n"#$ "% &' (n)!&& ASWpragmas nr !!!r "# !!!r$ "$r !!!r%# &n#! '#()*"! '#(* BHWpragmas Figure5-2.Examplesofuser-denedpragmas. Wait send/Wait recv:synchronizationwithanotherblock(e.g.,waitingto send datatoalockedresourceorwaitingtoreceivedatafromablo ckthatiscurrently working) Additionally,aconditioncanbeprovidedtoindicatewhenap ragmaisactive;thus multiplepragmascanbedenedperAPIcallorHDLbranchwithco nditionsindicating whichpragmaisactiveatanygiventime.Forexample,onlyon eofthetwohardware pragmasinthe“waitAck”stateinFigure 4-2B isactive,dependingonwhetherthe“ack” signalis0or1.Allcategoriesexceptfor“work”(whichhasno arguments)acceptan optionalrstargumentrepresentingthe“reason”theappli cationisperformingthegiven operation(wewillprovideexactdetailsfor“reasons”late rinthissection).Finally,the communicationandwaitcategoriesrequireatargetandmess ageIDtobeprovided (secondandthirdarguments);a“target”indicateswhichbl ock(s)thecurrentblockis 86

PAGE 87

communicatingwithorwaitingforwhilea“messageID”disti nguishesbetweendifferent communicationinvolvingthesameblocks. Figure 5-2A showsseveralexamplesofsoftwarepragmasforbottleneckd etection. Therstpragmaisgivenaunique,user-denednameof“write X”andiscategorizedas a“send.”Thereasongivenis“data,”indicatingthefollowi ngAPIcallissendingdata, asopposedtocontrolmessages,toaprocessblocklabeled“i n”withinthe“top”VHDL entity/architecture(targetsaregivenusinghierarchica lreferences,orusingspecially namedblockssuchas$CPUor$MEM);amessageIDof“x1”isgiven. Notethatthe conditionindicatesthispragmawillbeactiveonlywhen“wo rds”isgreaterthan0,as presumablythefollowingAPIcalldoesnotrepresentcommunic ationif0wordsaresent. Thesecondpragma,named“waitResult,”indicatesthefollo wingAPIcalliswaitingto receivedatabecauseofanemptybuffer;theAPIcalliswaiting toreceivedatawhichwill haveamessageIDof“r1”fromthe“top.out”processblock. Figure 5-2B showsseveralexamplesofhardwarepragmas.Therst,named “getX,”iscomplementarytotherstsoftwarepragma.Thepro cessblockisdescribed tobereceivingdatafromtheCPUwiththesamemessageID;note thatitispossible toreceivemultiplemessageIDsifneeded(e.g.,ifdifferen tthreadsordifferentstates withinthesamethreadwillinteractwiththesameHDLpragma state).Thenexttwo pragmasindicatewhenthegivenblockisworking(e.g.,perf ormingsomemultiplication) andperforminganinternaloverheadtaskwiththe“update”r eason(e.g.,updatinga loopcounter).Thelasttwohardwarepragmasdemonstrateth euseofconditionsto differentiatebehavior,evenwithinthesamestate;infact ,nostatemachineneeds tobepresentinhardwareatall.Thepragmanamed“waitNext” indicatestheVHDL processblockiswaitingtoreceiveanacknowledgmentfroma notherVHDLprocess block,“top.out,”whenthe“ack”signalislow.However,whe nthe“ack”signalishigh,the secondpragma,named“next,”isactive,indicatingacontro lmessage(theacknowledge) isbeingreceivedwiththegivenmessageID. 87

PAGE 88

Bottleneckscanoccurwhenoneormoreblocksareworkingslow erthanother blockstheyinteractwith.Inordertodetectbottlenecks,w emonitortimespentineach statedenedbyapragmaaswellastransitionsbetweenthese states,thusgeneratinga Markovmodelofeachblock'sbehavior.ReCAPemploysthreeco nstructsinbottleneck detection:reasons,metrics,andbottleneckrules;thelat tertwoarediscussedbelow andaresimilarinnotiontotherules,metrics,andparamete rsin[ 50 ]aswellastoother techniquesusedinknowledge-basedbottlenecktools[ 49 ],[ 33 ]. “Reasons”provideanexplanationforwhyanapplicationisp erformingagiven task.Inadditiontojustindicatingthatagivenblockiswai tingtoreceivedata,auser cannowspecifytheblockiswaitingtoreceivebecauseofane mptybufferorbecause ofcontention(orboth–multiple“reasons”canbeprovidedf orasinglestate).These “reasons”areuser-editableandarecloselytiedtobottlen eckrulesdiscussedbelow. Forexample,acontentionbottleneckcanbedetectedbysear chingforblocksthatcould achieveatleast1.05xspeedupifalltimespentinstateswit h“contention”asagiven “reason”wereeliminated.Figure 5-3 showsthedifferent“reasons”providedinReCAP foreachcategory. Thekeyconceptbehind“reasons”isthatitisrelativelyeas yforanapplication designer,ifprovidedalistof“reasons,”tolookatanAPIcall orbranchofHDLcode anddeterminewhetheritiswaitingforanacknowledge,send ingcontrolmessages, orupdatingaloopcounter,whereasitisrelativelydifcul tforatooltoascertainsuch high-levelinformationautomatically.Conversely,itisp ossibleforatool,giventheabove informationforeachAPIcallorHDLbranchofinterest,toanal yzetheperformanceof eachblockforpotentialbottlenecks(accountingforblock dependencies),whereasthis taskisfairlydifcultforanapplicationdesignertoperfo rmmanually. “Metrics”aretool-providedmeasurementsofruntimebehav ior(e.g.,duration ofanevent,bytestransferred,bandwidthobserved),which canbereportedforany blockoroverall,andcanbelteredbyanycombinationof“re asons;”metricsare 88

PAGE 89

Figure5-3.Default“reasons”providedbyReCAPforclassify ingAPIcallsorHDL branches. usedtodenebottleneckrules,asdiscussedbelow.Forexam ple,ametriccould returntheminimumbandwidthforasoftwareAPIfunctionthatw aswaitingtosend duetoafullbufferorwaitingtoreceiveduetoanemptybuffe r,orametriccould returnthetotaltimeahardwareblockspentcommunicatingo rwaiting.Metricscan behardware-orsoftware-specic,andcurrentlyincludeit emssuchastotaltime spent;min/avg/max/totalbytestransferred(orbandwidth observed);numberofcalls, callgroups,min/avg/maxconsecutivecalls,andcalltype( forsoftwareAPIcalls); variousstatisticsformicrobenchmarkdataforagivenAPIcal l(forcomparingwith actualbandwidthachieved);andmiscellaneousmetricsfor computingformulassuch aspercentagesandspeedup.Unfortunately,adding,modify ing,orremovingmetrics requiressomedetailedtoolknowledge,andthusmetricsare notuser-customizable, 89

PAGE 90

althoughwehavelocalizedwheremetricinformationisden edtofacilitatetheinclusion ofadditionalmetrics. Sinceitisdesirabletohaveamoreexible,user-denedmetr icstructure,we havebegundevelopmentofahardwaredirectiveframeworkfo raddinguser-dened hardwaremodules(withsomeconstraintsonportinterfaces )thatcouldextendReCAP's measurementcapabilities,whichwebrieymentionhere.Th esedirectivescanbe denedintermsofeachother,andthusonlyalimitedsubseto fhardwaremetricsmust actuallybedenedinHDL;theremainderareformedbyasimpl emacrosyntaxthat cancomposenewmetricsfromotherones.Thehardwaresubset includesdirectives suchasconditionalconstructs(frombasictomulti-cyclep attern-basedconditions),sum, min/max,histogram,andseveraldirectivesaidinginitera tionwhichgeneratearrays oflinearorexponentialsequences;compositedirectivesi ncludeitemsforcorrelation, additionalmulti-cyclepattern-basedconditions,andeve ncomputingaverageand standarddeviation.Softwaremetricscouldbesimilarlyde nedusingC/C++code modules. “Bottleneckrules”includeadescriptionofthebottleneck, thebottleneckcondition (anybooleanC/C++expression,typicallyemployingoneorm oremetrics),textual suggestionsforresolvingthebottleneckalongwithargume ntstoinsertintothetext (giveninprintf-styleformat),whetherthebottleneckapp liestoindividualblocksor whetheritcanbeappliedtotheentireapplication(orboth) ,anyadditionalmetricsthe usermaybeinterestedin,andinformationconcerningtheor iginaltimeandnewtime ifthebottleneckwereresolved(forcomputingspeedup).Spe edupisusedtolterout bottlenecksthat,ifremedied,wouldnotimproveapplicati onperformancebyatleast theuser-denedspeedupthreshold.Bottleneckrulesmaybea ddedormodiedbythe userthroughReCAP'sGUIandaresavedalongwithall“reasons” inaseparatele tofacilitatesharingofbottleneckdetectionstrategiesi nacommunity;theleformatis 90

PAGE 91

currentlythatusedbythestandardJava“Properties”class, althoughotherformatssuch asXMLcouldalsobeofuse. ReCAPdetectsandproducesreportsconcerningbottlenecksa truntime immediatelyaftertheuserapplicationhasnishedexecuti ngbytestingallapplicable bottleneckrulesonallsoftwareandhardwareblocksandfor theapplicationasawhole. WeaugmentedReCAP'sSVG-basedvisualizationfromourpreviou sPhase2research withwarningiconsforblockscontainingbottlenecks.Thes eiconsdirectlylinktoan HTMLlecontainingbottlenecksdetectedinthatblock;int ernalnamedanchorsare usedtojumpwithinasinglebottleneckle,andthusallbott lenecksmaybereviewed directlyaswell.Allreportedbottlenecksdisplayinformat ionconcerningthebottleneck type;potentialspeedupifthebottleneckwereremedied;su ggestionsforremedyingthe bottleneck,whichcanincludespecicdatafromtoolmetric s;allvaluesofmetricsused todeterminewhethertodisplaythisbottleneck;andanyoth eruser-speciedmetricsof interest.Currently,speeduppresentedisideal,assuming nootherbottlenecksprevent theapplicationfromimprovingbythegivenamount;inreali ty,dependenciesamongst blockscouldresultinconsiderablylessspeedup.Ourprior performanceexploration researchinPhase2dealtdirectlywiththisestimationprobl em;thus,whilenotdiscussed orimplementedinthiswork,integratingperformanceexplo rationwouldsignicantly improvetheaccuracyofspeedupestimates.Finally,platfo rm-specicsuggestions andmicrobenchmarkdataarealsoincluded(thelatterasali nktoanHTMLtable). Figure 5-4 showsanexampleofReCAP'saugmentedvisualization,whichco ntains warningiconsforblockswithbottlenecksdetected,aswell astheassociatedbottleneck information. Althoughusersarefreetoaddormodify“reasons,”“bottlene ckrules,”and“platform templates,”typicalusersofReCAPneedonlyaddpragmastoth eirusersourcecode totakefulladvantageofbottleneckdetection;Section 5.3 willpresentourtaxonomy ofcommonRCbottlenecks(andassociated“reasons”)thatar eprovidedbydefault 91

PAGE 92

nrr nr r r nr rnr!"r#"n$%r !n&"r n#$ r!"rr &n %'!rn"$()*rn" r"$%n&)* # "r!r)*"r"rr"&&#rr"&r"# "r" +,-+,.$ "!n!"rr &n "!" /r" # rrn#& "rn"n$ %r !n&rn"" )*)*rrr$ %r"r)*rrr$$&"nr0n$ r )*$$&#)*r1!,r!rrrr$ r#)*&rrr$ +$+21 r++$.-.3r4r25$4..6r$ 78)9(%89: 9#$ 9# ++$.-.3 78)987)88(9(%89: 78)9(%89: 9#$ 9# +-$..2+ #'994;;<78)=&78)=> -$4??5?r 0n99+;;<:%(978)=&78)=&78)=> ++$.-.3r +44$4@ 4$446644-?r ?5$6@ 4$4.54?-6r !" !""#"" Figure5-4.Exampleofbottleneckdetectionresults,showin ginclusionofwarningicons toindicateblockswithbottlenecksandaportionofthedeta iledbottleneck report. inReCAP.Inaddition,whilewedealwithonlyCPUsorFPGAshere,o urblock-based approachcouldbeapplicabletoabroadclassofheterogeneo ussystems(e.g.,GPUs, DSPs,Cellprocessors);onlythemethodologyforinstrumenta tionandmeasurement mustchange. Asmentionedatthebeginningofthischapter,ourcurrentimp lementationrelies solelyonproledataforbottleneckdetection,eventhough traditionalHPCemploys tracedataforthispurpose;tocompensate,ReCAPdoesprovid eanumberofuseful time-dependentprolemetrics,suchasthosereportingonc onsecutiveAPIcallsand transitionsbetweendifferentstatesinbothsoftwareandh ardware.WhileReCAP supportstracinginbothsoftwareandhardware,thelargest FPGAscurrentlycontain 92

PAGE 93

lessthan10MBofon-chipmemory;incontrast,a40-block,10 0MHzdesigngenerating 64-bittracerecordsevery10cycleswouldrequire3.2GB/s.F urther,futuredevices withlargermemoriesarelikelytoemploylargerapplicatio ns,whichwilllikelygenerate largeramountsoftracedata.Thus,reducingthenumberoftr aceeventsalongwith compressingorotherwisepre-processingtracedatatosave storagecouldbeof particularuse.Also,unlikewithCPUs,itcanbedifculttopa useauserapplication inanFPGAinordertowritetracedatatomemory;theFPGAmayint eractwithexternal hardwareinatiming-dependentmanner,allowingapausetom isscriticaldataor becomeunsynchronizedwithanotherdevice.Ifhandshaking wererequiredbetweenthe FPGAandallexternaldevices,pausingtheuser'sapplicatio nontheFPGAmaybea viablemethodforrecordingtracedata,albeitbyincurring additionalruntimeoverhead. WenallynotethatReCAPdoescurrentlyemployamemoryhiera rchytoimprove tracingcapabilities;localtracebuffersareconstructed fromon-chipmemoryandthen fedintoasingle,per-chiptracecollectorusingeitherint ernalorexternalmemory,the latterofwhichisconnectedmanuallyatthistime.Thisappr oachallowshigh-bandwidth recordingforsmallburstsoftracerecordsforeachblockwh ileprovidingstorage forlargeramountsoftracedata.Astracingcanprovideaweal thofdatauseful forbottleneckdetection,furtherresearchintoefcientp erformance-basedtracing methodologiesforFPGAscouldbeofgreatuse. 5.3CommonBottlenecksinRCApplications Inthissectionweattempttosystematicallyexploreandtax onomizepotentialRC bottlenecks,drawinguponourexperiencewithRCapplicati onsaswellasconceptsand techniquesfromknowledge-basedbottleneckdetectionint raditionalHPC.SinceanRC applicationmaycontainalltheproblemsofastandardparal lelapplication,traditional HPCbottleneckdetectiontoolsarequitebenecialintuning theCPUportionofanRC application,allowingthisworktobeeasilyintegratedwit hthesignicantamountof literatureandtoolsalreadypresentfortraditionalHPC.In fact,sinceReCAPbuildsupon 93

PAGE 94

PPW,itinheritsallofPPW'ssoftwarebottleneckanalysiscapabi lities.Thus,wefocus onbottlenecksthatmayoccurduetoCPU-FPGAcommunicationan dforbottlenecks withintheFPGA,culminatinginataxonomyofpotentialRCappl icationbottlenecks (includingdetectionandoptimizationstrategiesforeach bottleneck)thatdovetailswith traditionalHPCknowledge-basedbottleneckdetectionrese arch.Figure 5-5 provides ourtaxonomyofpossiblebottlenecks,whichwillbediscuss edfortheremainderofthis section.Notethatwhilethistaxonomycontainscommonbott lenecksthatapplytoboth softwareandhardware,thesebottlenecksaredenedsepara telyinReCAPtopermit differentdetectionandoptimizationstrategiesforthesa mebottleneck. nr r r r "! #r$ n %$! n &n '! ( n r ( )*%$! "! n+r "! *,! %$! r r nr Figure5-5.TaxonomyofcommonbottlenecksinanRCsystem. 94

PAGE 95

Webreakourdiscussionintofourgeneralbottleneckcatego ries:communication (Section 5.3.1 ),synchronization(Section 5.3.2 ),internaloverhead(Section 5.3.3 ),and imbalances(Section 5.3.4 ).Sinceanumberofbottlenecksbelowaredetectedsimply bydeterminingifasignicantportionoftimeforablockwas spentperformingtaskswith anassociatedreason(where“signicant”impliesapossibl espeedupofmorethana user-denedthresholdiftheproblemwereremedied),wewil lonlyhighlightdetection strategiesthatrequireadditionalconditionstobetested 5.3.1CommunicationBottlenecks Bottlenecksmayoccurduringcommunicationbetweenblocks( recallthatablock canbeaVHDLprocess,Verilogalwaysblock,orsoftwarethrea d,andthusthisincludes communicationbetweenCPUsandFPGAs,blockswithinanFPGA,ande venbetween FPGAs).Ifablockspendsasignicantportionoftimecommunic atingwithotherblocks, thisconstitutesapotential“generalcommunication”bott leneck.However,someblocks maybesolelypurposedforcommunication(especiallycommo ninFPGAs);inhardware, thiscasecouldbedetectedbysearchingforblockswithnopr agmasspecifyingthe “work”category,although,insoftware,ReCAPcurrentlycon sidersalltimespent betweenAPIcallstobe“work,”makingsuchdetectionmoredif cult.Suggestionsfor amelioratingthisgenericbottleneckincludeoverlapping communicationwithothertasks ifpossible,employingbit-packingorcompression,andrep artitioningthealgorithmto reducethedatatransferred.Notethatifmorespeciccommu nicationbottlenecksare detectedinablock,basicbottlenecksuggestionswillstil lbedisplayed,thuseliminating theneedtorepeatthesesuggestionsformorespeciccases. Anothercommonscenarioincommunicationisthattransferra tescanvary drasticallywithtransfersizeandtypedependingonthepro tocolsandinterconnects used;seeFigure 5-6 forasubsetoftransfersizesandtypesforanXtremeDataXD100 0 platform,aPentium-4XeonsystemequippedwithaNallatechH 101-Xaccelerator,and aquad-coreXeonE5520systemequippedwithtwoquad-FPGAGiDELPR OCStarIII 95

PAGE 96

cards.Thus,usinganinefcienttransfertypeorsizeconst itutesapotentialbottleneck. ReCAPcurrentlyonlydetectsthisbottleneckinsoftware,si ncehardwarecommunication isoftennotpacketizedandthusincursnooverheadbeyondth eactualdatatransfer, althoughmorecomplicatedprotocolscouldbeusedinhardwa reforinter-FPGA communicationiftheFPGAswerenottightlycoupled.Forexamp le,Figures 5-6A and 5-6B demonstratethepotentialforlargedifferencesbetweenre adandwrite performance,whereasFigure 5-6C showshowdifferenttransfertypesmayperform betterfordifferenttransfersizes.Figure 5-6B alsoshowsthatnon-overlappedtransfers, wheremultiplesimultaneoustransferscanoccurviaeither non-blockingcommunication orcollectivefunctions(e.g.,broadcast,scatter,gather ,reduce),canbeapotential bottleneckinRCapplications;thisisalsotrueinhardware whereitisidealforablock toperformasmuchcommunicationaspossibleinparallelwit hothertasks,andthus communicationshouldbeoverlappedwhereverpossible.The sephenomenahavebeen well-researchedinnon-RCcontexts[ 83 ]. ReCAPdetectssuchbottlenecksbycomparingactualbandwidt hrecordedwith thebestmicrobenchmarkbandwidthsfromtheplatformtempl ate;asmentionedin Section 5.1 ,ReCAPcouldalsoautomaticallyperformmicrobenchmarksor detect performancedeviationsbetweensimilarAPIcalls.Fromthisd ata,ReCAPcanmake specicsuggestionsonthebesttypetouseiftransfersizew asheldconstant,the bestsizetouseiftransfertypewasheldconstant,andthebe stoverallsizeandtype tousefortheplatform.Thebestsizeisgivenasarange(e.g. ,indicatingtransfer sizesthatachieve95%ofmaximumbandwidthforthattransfe rtype),withalinkto microbenchmarkdataprovidedforfurtherinvestigation.T histypeofspecicsuggestion, alongwithpotentialspeedups,isjustoneexampleofwhereR eCAP'ssuggestionscan beuseful,evenforexpertRCdesigners.Suggestionsforincr easingtransfersizeinclude unifyingdifferenttransferstogether(eitherbyplacingd ifferentitemsconsecutively inthememorymaporbyembeddingcontrol,masking,oraddres sinformationinthe 96

PAGE 97

nr r nrr ANallatechH101-PCIXM n nr r rnr !nr rn"r !n"r BGiDELPROCStarIII n nr r r !" #$ !" #$%" r%" r&#$%" CXtremeDataXD1000 Figure5-6.Platformtransferratevs.transfersizeacrossv ariousRCsystemsand communicationtypes,demonstratingcommoncommunication problems. Thenon-blockingtransferratewascomputedasthesustaine dcumulative transferrateofeightconcurrentnon-blockingtransfers. communicationstream)andcreatingbuffers,orincreasing buffersizes,tohandle largertransfers(ifthedatacannotbeprocessedinreal-ti me) 4 .Transfersizemaybe decreasedsimplybybreakingalargetransferintopieces,i nwhichcasebuffersmay beabletobereducedinsizetosaveresources.Asanexample,t ransferingthirty-two 4 ReCAP'sactualsuggestionsoftencontainadditionaldescrip tionandexamplesnot includedhereforbrevity,butwhicharequiteusefulforles s-experiencedRCdesigners. 97

PAGE 98

128KBpacketsontheNallatechplatformismoreefcientthan transferingasingle4MB packet(360MB/svs.291MB/s). Controlmessagesdeservespecicmentionasthesearecommo nlyemployed inRCapplicationsandtypicallyincursignicantoverhead .ReCAPdetectscontrol messagesdirectlyfrompragmasthatspecifycommunication tobecontroloriented. Suggestionsforaddressingcontrolbottlenecksincludemov ingasmuchcontrollogic ontotheFPGAaspossible(e.g.,insteadofreadingfromanFPGA register,testing avalue,andthenpossiblystartingsomeactionontheFPGA,the registertestcan bemovedontotheFPGA),employingunusedaddressbitsinaread tocarrycontrol information,orincreasingthesize/durationoftheworkbe ingcontrolled(e.g.,loop unrolling),thusreducingtheamountofcontrolneeded. Insoftwareandhardware,itmaybepossibleforboththeappl icationandinterface tostallatransfer(e.g.,duetonodatabeingavailableorpr eemptionbyanothertransfer), potentiallycausingastallingbottleneckonbehalfofeith ertheapplicationorinterface. Thesebottleneckscanbedetectedinhardwarebymonitoring theappropriatesignals ontheinterfaceport;however,itisrarethatanAPIwouldsurf acesuchinformationin software,andstallingcouldoccurinsomeintermediatebuf fernoteasilymonitoredfrom eithersoftwareorhardware.Thesebottleneckscouldberem ediedbytheadditionof bufferstopreventstallingor,inthecaseoftheapplicatio n,byincreasingtherateat whichtheapplicationcanacceptdata(e.g.,viareplicatio nofapplicationcomponents); whileasimilarsuggestioncouldbemadeforimprovinganAPI,t hisistypicallynot possibleforanapplicationdesigner,whooftendoesnothav ethesourcecodeforanAPI available,andthusreducingtheapplication'sdataproces singspeedissuggestedas thisisunlikelytoaffectperformanceandmaysaveresource sorpower. 5.3.2SynchronizationBottlenecks Synchronizationbottlenecksindicateablockiswaitingono neormoreother blocks,eithertotransferdataortoreachacertainpointin executionbeforecontinuing. 98

PAGE 99

Iftwoblocksmustcommunicatesynchronously,oneblockmay arrivetoitssideof communicationlaterthananother,permittinga“latesende r”or“latereceiver”bottleneck (commonterminologyintraditionalHPCbottleneckdetectio n).Suggestionsfor remedyingthissituationincludeperformingunrelatedcom putationorcommunication whilewaiting,bufferingordouble-buffering(dependingo nwhethertheapplicationis streaming-basedorblock-memory-based),orremedyingani mbalancebetweenblocks (eitherbyincreasingtheefciencyoftheslowerblockorde creasingtheefciencyof fasterblocks,thelatterofwhichcouldreducepowerorreso urcesused).Ifemployinga bufferinthestreamingcase,ReCAPsuggestsatransmissions izethatisatmosthalf thetotalbuffersize(toensurethebuffercanberelledbef oreitdrains)andatleasta quarterofthetotalbuffersize(tomaximizecommunication bandwidth). Asbufferingisacommontechniqueforaddressingsynchroniz ationbottlenecks, additionalcommonbottlenecksincludeafull-bufferoremp ty-bufferbottleneck.ReCAP suggestsanimbalancemayexist,asmentionedabove,orthat bursttrafcmaybe occurringifthebufferisfullandemptyoften,inwhichcase thebuffersizeshouldbe increasedtohandlelargerburstsorthetransmissionsizes houldbedecreasedand thetransmissionfrequencyincreasedtosmoothouttheburs ts,suchasbyhavinga separatethreadhandletransmissionperiodically. Severaladditionaltypesofsynchronizationbottlenecksar ealsohandledinReCAP. Apollingbottleneckcanoccurwhenstatusofanotherblocki srepeatedlyquerieduntil someconditionismet.Inhardware,thissituationisextrem elycommonandissimilarto anormalsynchronizationbottleneck;however,insoftware ,thisbehaviorisproblematic, wastingcommunicationbandwidthandprocessortime.ReCAPd etectsthisbottleneck bysearchingforanyAPIcallthatiswaitingtoreceivedataand is,onaverage,executed threeormoretimesconsecutively.Suggestionsincludeusin ginterruptsifsupported bythesystemandAPI,movinganyunrelatedtasksbetweenpolli ngcalls,andusinga separatethreadtoperformthepoll,potentiallyaddingani ndicatorintodatareturned 99

PAGE 100

fromthepoll,suchas“percentdone,”toestimatetimetheth readshouldsleepbefore pollingagain. Anacknowledgebottleneckcanoccurwhenwaitingforablockt oacknowledge someevent.Inthiscase,ReCAPsuggestsspecutivelycontinu ingwithoutthe acknowledgeifanacknowledgeisexpected,storinganychec kpointorstateinformation neededtorestartorretryataskiftheacknowledgedoesnota rrivesolongasthis storageisnotprohibitiveintermsofmemoryrequirements. Abarrierbottleneck indicatesthatablockiswaitingforotherblockstoreachag ivenpointofexecution beforecontinuing.Inthiscase,decouplingtheseblocksvi abufferingmaybepossible ifnofeedbackorresourcecontentionexiststhatwouldprev entblocksfromcontinuing. Forexample,toaveragetheresultsfromseveralblocks,aru nningaveragecouldbe computedorabuffercouldbeaddedtotheoutputofeachblock ,withtheaveragebeing computedfromthebufferoutput. Acontentionbottleneckdetectswhenblocksarespendingas ignicanttimewaiting onasharedresource.Thereareanumberofsuggestionsforal leviatingthistypeof bottleneck.Onesuggestioninvolvesreducingtimeneededt oacquireorreleasealock, usuallybyaddinganarbiterthatmanagesallrequeststothe shareddevice.Another suggestionincludesincreasingtheefciencyorreplicati onofthesharedresource itselfcanbeuseful,suchasbyplacingthememoryinafaster clock-domainthatcan handletwotransactionsperapplicationcycleorbyeitherr eplicatingasharedmemoryin hardwareorusinganadditionalmemorybankfromsoftware(i ssuingreadstodifferent copieswhilewritesareissuedtoallforconsistency).Addit ionalsuggestionsinclude forcingastaggeredorderingtoreduceoreliminatelocking ,increasing(ordecreasing) granularityoftasksperformedbetweenlockandreleaseift hecostforlockingishigh(or low),ensuringnoblockholdsasharedresourceanylongerth annecessary,ensuring theminimumlockingisperformedtostillensureconsistenc y,andnallyreducingthe 100

PAGE 101

efciencyofotherblocksorthenumberofblocksaccessingt hesharedresourceto potentiallysaveresourceswithlittleperformanceloss.5.3.3InternalOverheadBottlenecks Ablockcanspendasignicantamountoftimeperformingbook keepingorother internaloverheadtasks,thusallowinganinternal-overhe adbottlenecktooccur.As withothergenericbottlenecks,thisbottleneckmaybeaddr essedbyparallelizing, pipelining,orotherwiseoverlappingthesetaskswithothe rsorbyincreasingthe size/durationoftheworkassociatedwithinternaloverhea dsothatfeweroverhead tasksareperformed(e.g.,loopunrolling).However,manys pecicvariantsofthis bottleneckarealsodetected.Initialization,nalizatio n,andupdatebottlenecksoccur whensignicanttimeisspentinablockperforminginitiali zation,update,ornalization tasks.Suggestionsincludereducingoreliminatingthisove rhead;forexample,ifa histogramistobeaccumulatedinmemoryandthusneedstobec learedforeachnew dataset,itmaybepossibletoinsteadmaintainabit-vector thatindicateswhichmemory locationshavebeenaccessedsincethelastdatasetandthus determinewhetherto storeoraddagivenvaluetothecurrentmemorylocation.Amu ltiple-initializationor multiple-nalizationbottleneckindicatesthatsoftware hasnotonlyspentasignicant amountoftimeconguringorreleasingtheFPGA,butthatsoftw areperformedthis taskseveraltimes;thistypicallyindicatesseveraldiffe rentcongurationleshave beenloadedduringruntimetoacceleratedifferentphaseso fanapplication.These bottlenecksaredetectedbyensuringtheAPIcallisoftheappr opriatetype(e.g., congure,release)andcalledatleasttwice.Suggestionsfo roptimizationincludeadding functionalityto,orgeneralizingfunctionalityin,eachc ongurationletoreducethe numberofcongurations,reschedulingtheCPU'sworkifthes amecongurationleis loadedmultipletimessothatoverheadfromreprogrammingt hesameleisreduced, andrepartitioningthealgorithmtominimizetheamountoff unctionalityneededbythe 101

PAGE 102

FPGA,suchasbymovingsomepre-orpost-processingtasksfrom theFPGAtothe CPU. Aninterrupt-processingbottleneckmayoccurifablockisin terruptedtoooftenby anotherblock.Notethatwhilethephysicalinterruptiscom munication,thisrefersto interrupthandling,andthusisaninternaltask.Inthiscas e,thenumberofexceptional circumstancesthatcauseinterruptsshouldbereduced,suc hasbyincreasingprecision toreduceoverowinterruptsorbyresolvingthemostcommon interruptslocally,if possible.Adelaybottleneckoccurswhenablockmustdelayf orsomeinternalreason, suchaswhenwaitingforanextra-longcombinatorialpathor internalpipeline,the latterofwhichishandledspecicallyinthe“pipelinell/ drain”and“pipelineush/stall” bottlenecks.Delaybottlenecksuggestionsfocusonreduci ngthelatencycausingthe delayoroverlappingtheselatencieswithotherdelaysorus efulwork.Suggestionsfor addressinga“pipelinell/drain”bottleneckincludelli ngapipelinewiththenextdataset whilestillprocessingordrainingthepreviousdataset.Sug gestionsforaddressing a“pipelineush/stall”bottleneckincludedecreasinglat encyorpipelinestages, althoughthismustbebalancedwiththeeffectontheFPGA'smax imumfrequency; movingdetectionofushconditionsearliertominimizesta gesushed;andhaving pipelineswithalargenumberofstallsprocessdatafrommul tiplestreams,interleaving independentdataintothepipelinetominimizedependencys talls. 5.3.4ImbalanceBottlenecks Animbalanceofcomputationbetweentwoormorerelatedblock sisalsoconsidered apotentialbottleneck.Astageimbalancereferstoanimbal ancewhereablockdepends onotherblocks,suchasinapipeline.Optimizationofthisb ottleneckinvolvesimproving theperformanceofslowerblocks(e.g.,throughadditional stagedivisionorreplication) orbyreducingtheperformanceoffasterblockstopotential lysaveresources.Aload imbalanceindicatesanimbalancebetweenparallelblockst hatreceivedata,potentially indirectly,fromthesamesourceblock,suchasiscommonwit hreplicatedcores. 102

PAGE 103

Possibleoptimizationsforthisbottleneckincludeimprov ingthedatadistributionscheme (e.g.,employingalook-aheadround-robinschemethatdete rmineswhetheranyofthe nextfouroreightblocksareidle,skippingthemallifso,or employingapriority-based selectionschemesuchasleast-recentlyused),addinginpu tbuffers,orchangingthe replicationfactortopossiblydistributedatamoreevenly (e.g.,simplyusingaprime replicationfactormaydistributedatamoreevenlythanahe avilycompositereplication factor). ReCAPdetectsbothbottlenecksbysearchingthroughthebloc kdependencygraph givenbyuserpragmastodeterminewherepotentialbottlene cksarelocated;anumber ofmetricsareprovidedthatreturnstatisticsoverthesebl ockgroupstofacilitatesuch bottleneckdetection.Withintheprovidedsuggestions,spe cicdataaboutthebest andworstperformingblocksaswellastheaverageperforman ceofallrelatedblocks arealsogiventoallowadesignertodeterminehowseverethe imbalanceisandwhat techniquesmaybebestinaddressingthebottleneck. 5.4CaseStudies TodemonstratetheutilityofRCbottleneckdetectionandpl atformtemplates,our extendedReCAPtoolwasemployedontwodifferentapplicatio nsonatotalofthree diverseRCplatforms:atime-domainniteimpulseresponse benchmark[ 84 ]ona GiDELPROCStarIII[ 82 ]anda2-Dprobabilitydensityfunctionestimatorapplicat ion[ 85 ] onbothaNallatechH101-PCIXMcard[ 74 ]andanXtremeDataXD1000system[ 12 ]. Table 5-1 providesdetailsfortheseRCsystems. 5.4.1Time-DomainFiniteImpulseResponse Thetime-domainniteimpulseresponse(TDFIR)benchmarki spartoftheHPEC challengebenchmarksuite[ 84 ]andhasbeenacceleratedonGPUsaswell[ 86 ].For anFPGA-acceleratedversion,weimplementedconvolutionfor realnumbersrather thancomplex;however,forconsistencyofresults,werepor tallnumbersinGFLOPS, thusaccountingforthefactthateachbasiccomputationinv olvesonlytwooating-point 103

PAGE 104

NameCPU(s)FPGA(s)APItype NallatechH101-PCIXM Pentium4Xeon3.2GHz OneVirtex4LX100(PCI-Xcard) C,simplememory XtremeDataXD1000 Dual-coreOpteron2852.6GHz OneStratixIIS180(HyperTransportsocket) C++,DMA GiDELPROCStarIII Quad-coreXeon55202.26GHz FourStratixIIIE260s(PCIecard) C++,simplememory Table5-1.RCplatformsemployedduringcasestudies. operationsratherthaneightforthecomplexversions.This casestudywasperformed ontheGiDELsystem(Table 5-1 )usingQuartus9.1SP2andGCC4.4.3with-O3 optimization;alltimesgivenaretheaverageofthreeexecu tions.TheTDFIRbenchmark wasabletoexecuteat125MHzontheFPGAforboththeoriginala ndoptimized versions,andthusallexecutionswereperformedatthisfre quencyforuniformityof results.AllFPGAbenchmarkexecutiontimesincludealldatat ransfertimesbetweenthe CPUandFPGAaswellasanyotherneededCPUtaskssuchasdatamove ment;we onlyexcludetheFPGAinitialization/nalizationtime,sin cethecongurationlecould bepreloadedonceandthenusedindenitely(e.g.,forstrea minglargeamountsofdata through). Threedatasetswereusedforevaluation,whichconsistedof randomdatawiththe samekernelsize,inputsize,anditerations(i.e.,thenumb erofdifferentsub-datasets withthegivenkernelandinputsizethatmustbecomputed).D atasetAandBarethe standarddatasetsgivenintheHPECchallenge,whiledatasetC isthelargestdataset from[ 86 ](Table 5-2 ).WecompareFPGAresultsfromaStratixIIIE260toboththeInte l XeonE5520processor(thehostprocessorintheGiDELsystem)an dtheresultsgiven in[ 86 ]foranNVIDIA8800GTX. UponexecutingtheFPGAversionofTDFIR,twoproblemswereob served.First, thesingle-FPGAversionwasslowerthaneithertheCPUorGPUver sionforthe rsttwodatasets,althoughtheFPGAversiondidachievea9.1 xspeedupoverthe 104

PAGE 105

DatasetKernelSizeInputSizeIterations A12102420B128409664 C409632768128 Table5-2.DatasetsevaluatedforTDFIRbenchmark. CPUanda3.0xspeedupovertheGPUondatasetC(Figure 5-7A ).Second,when scalingupfrom1to4FPGAs,performancescaledpoorly,result inginlessthan1.06x speedupfordatasetsAandBandlessthan2.1xspeedupfordat asetC,eventhough nocommunicationorsynchronizationwasneededbetweenFPGAs ;differentiterations weresimplyscatteredtothedifferentFPGAs(Figure 5-7B ). WethenemployedReCAP'sautomaticbottleneckdetectiononth eFPGAversion ofTDFIR,usingdatasetCforexecution.Overheadforobtain ingperformancedata wasacceptable,incurringanadditional5.1%insoftwareti me,1.8%ofFPGAlogic resources,0.8%ofFPGAregisterresources,andanegligible (lessthan1%)decrease infrequency;thebenchmarkstillmettherequired125MHz.R eCAPdetectedseveral communicationbottlenecksinsoftwareincluding“inefci entcommunicationsizeand type”(ideal11.63xspeedup),“controloverhead”(ideal3. 60xspeedup),a“latesender” bottleneckfortheFPGA(theCPUisreadytoreceivedatabutthe FPGAislatein sendingthatdata,ideal2.52xspeedup),anda“latesender” bottleneckfortheCPU (theFPGAisreadytoreceivedatabuttheCPUislateinsendingt hatdata,ideal 1.63xspeedup).Asmentionedearlier,thespeedupnumbersar eideal,andthusour performanceexplorationframeworkinPhase2shouldbeusedf ormoreaccurate predictions.Forexample,whileReCAPsuggestsabovethatan 11.63xspeedupis possibleiftheassociatedbottleneckisremedied,ReCAP'svi sualizationshowsthe FPGAisworking54.1%ofthetimewhenprocessingdatasetC,th uslimitingspeedupto atmost 1 = 0 : 541=1 : 85 x (theFPGAisworkingmuchlessofthetimefordatasetsAand Bandthusbetterspeedupispossible).Nonetheless,theide alspeedupnumbersgiven 105

PAGE 106

nr nr r rrrr rrr "#nr ATDFIRPerformance nnrn nnr nn nn n BOriginalFPGATDFIRscalability nn rn nnr nn nn n COptimizedFPGATDFIRscalability n nr rrr rrr DFPGATDFIRPerformance(DatasetC) Figure5-7.TDFIRperformanceonvariousdevices(includin gboththeoriginaland optimizedFPGAperformanceafterbottleneckdetection. arestillusefulforprovidinganupperboundandserveasase verityindicatorforagiven bottleneckandthusareoftenagoodrankingofwhichbottlen eckstoaddressrst. Werstchosetoaddressthe“inefcientcommunicationsize andtype”bottleneck. ForoneAPIcall,bottleneckdetectionindicateda2.52xspeed upwaspossibleifthe transfersizewereincreasedtobetween32MBand64MB(orhig herforasynchronous communication)whilea2.39xspeedupwaspossiblebyswitch ingtolow-latency transfers.Aschangingthetransfersizewouldbemoredifcu ltandyetnotresultin muchadditionalperformancebeyondthatgainedbyswitchin gthetransfertype,we chosethelatterapproachforthisAPIcall.However,forsever alotherAPIcalls,solely switchingtoadifferentcommunicationtypewasnotrecomme ndedbyReCAP,andthus 106

PAGE 107

weincreasedthetransfersizebyafactorof8to10viabuffer sandlogictohandlebatch transfers 5 .Thesechangesaloneresultedina1.47ximprovementinperf ormancefor datasetCononeFPGAanda1.68xperformanceimprovementfor4 FPGAs.However, withonlya2.4xspeedupbetweenthe1and4FPGAversions,scal abilitywasstilllow. Wenextaddressedthe“latesender”bottlenecks,whichsugg estedoverlapping thewaitingperiodwithothercomputationorcommunication ,usingasynchronous transfersifavailable,aswellasemployingdoubleorhighe r-orderbuffering.Thus,we overlappedcommunicationusingasynchronoustransfersan dquadruplebufferingwith threedifferentexternalmemoriesaswellasinternalmemor y,providinganadditional 1.12ximprovementinperformancefordatasetCononeFPGAand anadditional 1.73xperformanceimprovementfor4FPGAs;the1-FPGAversione xperienced lessperformanceincreaseduetoheavyuseofitscomputatio nalunits.Asshownin Figure 5-7C ,thesecondoptimizationresultedinascalabilityofover9 0%oftheideal ( 3 : 61 = 4 : 00 )fordatasetCandnoticeableimprovementinscalabilityfo rdatasetB. Thus,byemployingthesetwooptimizations,performancewa simprovedby1.65x forthesingleFPGAversion(achieving27.3GFLOPS)and2.9xfor the4FPGAversion (achieving98.9GFLOPS)ondatasetC,withthe4-FPGAversionac hievingatotal of54.4xspeedupoverthesoftwarebaseline;nalperforman ceforeachdatasetis shownasblackbarsinFigure 5-7A andwithcirclemarkersinFigure 5-7D .Further, thesingleFPGAversionnowperformed1.15xbetterthantheCPU ondatasetB, whereastheoriginalversionhadperformed6.7timesslower thantheCPU.While performanceondatasetAwasincreasedby5.7x,theFPGAconti nuedtoperformpoorly 5 ReCAPsuggestedincreasingthetransfersizemuchfurtherfo roptimal transmission,whichwasinfeasibleduetomemorylimitatio ns.Thus,wemanually determinedabalancebetweensignicantlybetterbandwidt hfromthemicrobenchmark table,memoryoverhead,andachievingevendivisibilityin tothenumberofiterationsfor eachdataset;thelastrequirementresultedinabatchsizeo f10fordatasetAsince8 doesnotdivideevenlyintothestipulated20iterationsfor thatdataset. 107

PAGE 108

duetocommunicationoverhead;theNvidia8800GTXexperien cedasimilar,albeitless pronouncedeffectforthisdataset,asseeninFigure 5-7A .Furtherimprovements suggestedbyReCAP,suchasmovingamemoryclearoperationfr omsoftware intoFPGAlogic,wouldlikelyachievesomeadditionalspeedu p,althoughtheFPGA coreswereobservedtobeworking68.7%ofthetimeinthe4-FPG Aversion,limiting theamountoffurtherspeeduppossiblewithoutemployingad ditionalresourcesora betteralgorithm.Itisnoteworthythatgiventhereportedb ottlenecksandoptimization suggestions,actualoptimizationsweremadewithintwoday s,resultinginhigher performanceaswellasimprovedproductivitycomparedtoma nual,ad-hocbottleneck locationandoptimization.5.4.22-DProbabilityDensityFunctionEstimation The2-Dprobabilitydensityfunction(2DPDF)estimationapp licationisusedin variousengineering,nancial,andscienticeldswheren on-parametricprobabilistic approachesarerequired;theapplicationiscomputational lyintensive,involving O ( m n 2 ) operationswhere m isthenumberofsamplepointsand n isthenumberofbinsper dimension.OurimplementationusestheParzen-windowalgo rithmandaxed-point format([18,9]externalprecision,[48,18]internalpreci sion,givenin[total,fractional] format)[ 85 ]. OurexperimentalsetupconsistedoftheNallatechsystem(T able 5-1 ),usingGCC 4.4.3andXilinxISE11.5tocompileallCandVHDLles,respecti vely.Asoftware baselinewaswritteninCusingonlyintegerarithmetictobe ttercomparetotheFPGA's xed-pointformat,astheCPU'soating-pointversionwassl ower.Thisbaselinewas compiledwith-O3optimizationandexecutedontheattached Pentium4Xeon3.2GHz processor;allexecutiontimeswerecomputedfromtheavera geof3executions. The2DPDFapplicationwascapableofoperationat100MHzorhi gherforalldesign variants,andthus100MHzwasusedforuniformityofperform anceresults.Similarto theconvolutioncasestudy,allFPGAapplicationexecutiont imesincludealldatatransfer 108

PAGE 109

timesbetweentheCPUandFPGAaswellasanyotherneededCPUtask ssuchas datamovement;weonlyexcludetheFPGAinitialization/nal izationtime,sincethe congurationlecouldbepreloadedonceandthenusedinde nitely(e.g.,forstreaming largeamountsofdatathrough). Initially,weattemptedtogainspeedupbyextendingthe2DPD Fapplicationto amulti-coredesignwithintheFPGA.Figure 5-8 showstheinitialexecutiontimesfor severalmulti-corevariants(circlemarkers).Whileoursof twarebaselinerequired 250.5secondstoprocess1,024,000points,the20-coreFPGAd esignrequiredonly 64.5secondsforthesamedataset,resultingina3.9xspeedu p.However,Figure 5-8 demonstratesthattheseadditionalcoresprovideddiminis hingperformance improvements. nr nr Figure5-8.Speedupofinitialandimprovedversionsofthe2D PDFapplicationwhen comparedtoaCPUbaseline. WethenemployedReCAP'sautomaticbottleneckdetectiononth e20-coredesign. Overheadforobtainingperformancedatawasacceptable,in curringanadditional 14.2%insoftwaretime(duetomillionsofAPIcallsduringexec ution),4.3%ofFPGA logicresources,2.3%ofFPGAregisterresources,andamaxim um2.2%decreasein frequency.ReCAPidentiedanumberofpotentialbottleneck sincludinga“CPUlate sender”bottleneckwithideal16.21xspeedup,a“control”o verheadbottleneckwithideal 109

PAGE 110

1.55xspeedup,an“inefcientcommunicationsize/type”bo ttleneckwithideal9.83x speedup,a“control”overheadbottleneckwithideal1.38xs peedup,andan“FPGA latesender”bottleneckwithideal1.32xspeedup.Specical ly,bottleneckswerealso detectedinveofthetwelveindividualAPIcallsinsoftwarei nvolvingclearingmemory, startingandstoppingcores,checkingtoseeifcoreswereco mplete,andreadingthe output;eachAPIcall'spotentialspeeduprangedfrom1.15xto 1.34x.Thegeneric“all overhead”bottleneckontheFPGAindicatedthat,iftheFPGAwe refullyutilized,a possible16.81xspeedupcouldbeachieved. Basedontheoptimizationssuggestedabove,wefocusedoncon solidatingthe largenumberofsmalltransfersperformedandonmovingcont rollogicontotheFPGA. Specically,severalinputbufferswereincreasedinsizefr om2KBto32KB,theminimum idealtransfersizesuggestedbyReCAP;outputbufferswerepl acedconsecutivelyinthe memorymap,permittingasinglelargerreadratherthanseve ralsmallerreads;register datawasconsolidatedtoreducetheamountofdatapolled;an dcontrollogicwasmoved ontotheFPGA,performingtaskssuchasautomaticallyclearin gintermediateFPGA bufferswhenreceivingnewdataratherthanrelyingonsoftw aretomanuallycontrolthis process. Speedupfortheimproved2DPDFapplicationincreasedfrom3.9 xto44.4xwhen comparedwiththesoftwarebaseline,resultinginan11.4xp erformanceimprovement betweentheunoptimizedandoptimizedversions(squaremar kersinFigure 5-8 )and demonstratingfarmorelinearspeedupwithrespecttothenu mberofcoresemployed. Interestingly,wealsodiscoveredthatthenewversionincu rredslightlylessrounding errorsinceFPGAdataisroundedwhentransferredtotheCPUand fewertransferswere employedintheoptimizedversion,althoughtheerrorincur redbytheoriginalversion wasdeemedacceptable.Again,itisnoteworthythatgiventhe reportedbottlenecksand optimizationsuggestions,actualoptimizationsweremade withinaday,resultingina 110

PAGE 111

signicantperformanceincreaseaswellasimprovedproduc tivitycomparedtomanual approachesforbottleneckdetectionandoptimization. Asanalattempttoimproveperformance,weportedtheoptimi zed2DPDF applicationtotheXD1000platform(Table 5-1 ),whichallowedustoincrease computationalresourcesbyafactorof2.4x.Softwareandhar dwarewerecompiled withGCC4.3.2with-O3optimizationandQuartus9.1SP2,respe ctively.Weobtainedan additional2.5xspeedupovertheNallatechimplementation ;notethataspeedupgreater than2.4xispossibleduetofastertransfersaffordedbythe HyperTransportinterconnect ontheXD1000.WeusedReCAP'sautomaticbottleneckdetectiono ntheported application,incurringanadditional5.2%ofsoftwarerunt imeoverhead,4.7%ofFPGA logicresources,1.8%ofFPGAregisterresources,anda15.3% frequencydegradation (duetothefactthattheapplicationlled87%ofthedeviceb eforeinstrumentation, andthus92%withinstrumentation);the100MHzrequirement wasstillmetbythe instrumentedversion. ReCAP'sbottleneckdetectionreturnedseveralpotentialbot tlenecks,buttheideal speedupsweremuchlowerthaninthepreviouscases.Forexam ple,insoftware,all speedupswerelessthan4.2x;inhardwarethecoresshowedsp eeduppotentialof atmost1.31x(notethatastageimbalancebottleneckwasthe exception,showinga potentialofupto5.5x;however,thestageimbalanceitrefe rredtoinvolvedabasic distributioncorethatperformsverylittlecommunication andnowork;withtheintegration oftheaforementionedperformanceexplorationframework, suchfalsepositivescouldbe signicantlyreduced).Thus,givenlacklusterperformanc eimprovementspredicted,we chosenottooptimizefurther,althoughtherearescenarios whereausermaybelievethe potentialspeedupwarrantsadditionaleffort,suchasifth eapplicationmustbeexecuted manytimesorifruntimeweresignicantlylonger.Thisresu ltunderscoresanoften overlookedbenetofbottleneckdetection;bottleneckdet ectioncanoftenbeasuseful inindicatingwhatnottooptimizeasitisinwhattooptimize .Table 5-3 givesexecution 111

PAGE 112

timesforthe2DPDFapplicationonbothplatforms,showingth attheXD1000version achieveda112.9xspeedupovertheoriginalPentium4Xeon3.2 GHzsoftwarebaseline. ApplicationVersionRuntime(s)Speedup Pentium4XeonCPU250.5151.0Nallatech(FPGAoriginal)64.5063.9Nallatech(FPGAoptimized)5.64844.4XD1000(FPGA)2.218112.9 Table5-3.2DPDFresultsforboththeNallatechandXD1000plat forms.Speedupis givenwithrespecttothesoftwarebaselineexecutedonaPen tium4Xeon 3.2GHzCPU. 5.5Conclusions Inthisphaseofresearch,wepresentedwhatwebelievetobet herstautomatic bottleneckdetectionframeworkandtoolforRCapplication s,includingaframeworkfor platformtemplatesthatpermitsmoreaccurate,platform-a warebottleneckdetection aswellastoolportabilityacrossdiverseRCsystems.These templatesareeasily createdbyend-users,typicallyinafewhoursprovidedthep latformtswithinthe genericplatform-templatemodel.Inaddition,weformulat edwhatwebelievetobethe rsttaxonomyofcommonbottlenecksforRCapplications,al ongwithassociated detectionandoptimizationstrategiesforeachofthesebot tlenecks,topopulate ReCAP'sknowledgebaseforbottleneckdetection.Ourbottlen eckknowledgebase isextensible,providingforuserdetectionofbottlenecks notenvisionedbytheauthors. Theknowledge-basedbottleneckdetectionframeworkandpl atform-templatesystem wereimplementedbyextendingourworkfromPhase1and2,prov idinguserswith afull-featuredRCperformanceanalysisframeworkandtool foradiversesetofRC systemsthatcansignicantlyacceleratetheoptimization process. WethendemonstratedbottleneckdetectioninReCAPviatwoca sestudies involvingatime-domainniteimpulseresponse(TDFIR)ben chmarkfromtheHPEC challengeanda2-Dprobabilitydensityfunction(2DPDF)est imationapplicationon 112

PAGE 113

atotalofthreediverseplatforms.ReCAPreportedanumberof bottlenecktypes discoveredinbothsoftwareandhardwarealongwithoptimiz ationsuggestionsand potentialspeedupforeachbottlenecktype.Severalofthese optimizationsuggestions wereemployedtoachieveanadditional2.9xspeedupforTDFI R,resultinginatotal 54.4xspeedupovertheCPUbaseline,andanadditional11.4xs peedupfor2DPDF estimation,resultingina44.4xspeedupovertheCPUbaselin e.Weported2DPDF totheXD1000platform(whichprovidedanadditional2.5xspe edupduetoincreased computationalresources)and,basedonlacklusterpotenti alspeedupreportedby bottleneckdetection,didnotfurtheroptimizetheapplica tion,resultinginatotal112.9x speedupoverthePentium4Xeon3.2GHzCPUbaseline. 113

PAGE 114

CHAPTER6 CONCLUSIONS Inthisresearch,therstframeworkandtoolforgeneral-pu rposeperformance analysisofRCapplications(peranextensiveliteraturere view),calledReCAP (recongurablecomputingapplicationperformance),wasp resented.Backgroundand relatedworkwerepresentedinperformanceanalysisfortra ditionalhigh-performance computing(HPC)systems(includingvisualizationandbottl eneckdetection), recongurablecomputing(RC),RCperformancemodelingand simulation,and nallyRCdebugging,performanceanalysis,andtoolportab ility.Aninitialframework wasdenedinPhase1forinstrumentation,measurement,andb asicpresentationof performancedata,extendingacurrentHPCperformanceanaly sistool(PPW)tosupport performanceanalysisformulti-CPU/FPGAsystems.InPhase2,a nextensiontoPhase 1waspresentedthatprovidesfull-edgedperformancevisu alizationcapabilitiesin ReCAPtoaidtheuserinrapidlyunderstandingapplicationbe havior.Inaddition,a methodologyforperformanceexplorationusingruntimeper formancedatawasalso presented,providingaframeworkforaccuratelypredictin gchangestoapplication performanceifoptimizationswereimplemented.Finally,i nPhase3,ReCAPwasfurther extendedtoprovideautomaticdetectionofbottlenecks(in cludingpotentialoptimization strategiesandexpectedspeedup).Thisphaseofresearchin cludedanenumerationand categorizationofcommonRC-applicationbottlenecksaswe llasaplatform-template systempermittingmoreaccurate,platform-awarebottlene ckdetectionacrossdiverse systemsandtoolportability.Casestudieswereperformeda crossthreesystems todemonstratetheutilityofReCAP'sbottleneckdetectionan dplatform-template frameworks. ThisworkcontributestheReCAPperformanceframeworkandto olalongwith relatedresearchandndings,wheresuchresearchandframe workwaslacking.The ReCAPframeworkandtoolhasbeendemonstratedtobeapplicab letoadiverseset 114

PAGE 115

ofRCsystemsandapplicationsaswellasshowntosignicant lyaiddesignersin quicklyunderstandingandremedyingperformancebottlene cksintheirRCapplications. Thisframeworkcontributestoallveareasofperformancea nalysisasmentionedin Section 2.2 ;instrumentation,measurement,automatedanalysis,visu alization,and optimization(viaperformanceexplorationinPhase2aswell asoptimizationsuggestions givenforaddressingbottlenecksinPhase3).Inaddition,th isframeworkhasalso contributedtoongoingresearchinperformanceanalysisfo rHLL-basedRCapplications asdemonstratedinCurrerietal.[ 20 ]. Futureareasforresearchincludeexpandingtheexibility andutilityofbottleneck detectionthroughuser-denedmetrics,integrationandre nementoftheperformance explorationframework,trace-basedRCbottleneckdetecti on,andevenemploying data-miningforbottleneckdetection.Inaddition,integr ationofthisresearchintoDSE toolscouldbenetbothtools,providingacomprehensiveen vironmentforstudying applicationbehaviorfromconceptualdesignthroughoptim ization.Furtherresearch isalsowarrantedinRCtoolportability,suchasexploringa genericframeworkfor monitoringFPGA-FPGAandFPGA-memoryinterfaces,aswellasfurt herstudyand renementofvisualizationtechniquesforlarge-scalehet erogeneousapplications,such asincludingcommunicationchannelusageandefciency.Fi nally,researchonthe difculttaskofautomaticoptimizationcouldbeofsignic antuse,reducingtheburden ontheusertomanuallyoptimizetheirRCapplications. 115

PAGE 116

APPENDIXA SYSTEMSSUPPORTEDBYRECAPTOOL A.1NallatechH101Cluster Thisclusterconsistsof16nodesconnectedbyDDRInniBanda ndGigabit Ethernet.Eachnodecontainsthefollowinghardware: 3.2GHzPentium4IntelXeonEM64THyper-threadingprocessorw ith2MBL2 cache 2GBofregistered-ECCDDR333RAM MellanoxInniHostIIILx(MHGS18-XTC)PCI-Ex8DDRInnibandc ard NallatechH101-PCIXMApplicationAccelerator[ 74 ] – 133MHzPCI-XcardwithXilinxXC4VLX100userFPGA – 512MBDDR2SDRAM(singlebank) – 16MBDDR-IISRAM(across4banks) – 4high-speedSerialI/Ochannels(2.5Gb/seach) – 25Wpowerconsumption(typical) A.2XtremeDataXD1000 Thissystemconsistsofadual-processormotherboardconta iningadual-core Opteronandanin-socketFPGAmodule[ 12 ]. 2.6GHzOpteron285 4GBDDR400RAM XD1000in-socketFPGAmodule(connectedviaHyperTransport) – AlteraStratixIIEP2S180FPGA – 4GBDDRRAM(128-bit,333MHz) – 4MBZBTSRAM(32-bit,200MHz) – 32MBFLASH 116

PAGE 117

A.3GiDELPROCStarIIICluster Thissystemconsistsof24computenodesconnectedbyDDRIn niBandand GigabitEthernet.WhilethissystemincludesahostCPU,theinc ludedGiDELPROCStar IIIboard[ 82 ]contains4tightly-coupledFPGAsandcanbeseenasrepresent ativeofan embeddedenvironmentforthisresearch,astheboardcansta nd-alonewithoutanyCPU ifneeded.Eachnodecontainsthefollowinghardware: IntelE55202.26GHzquad-coreXeonwithQP 6GBDDR31333RAM TwoGiDELPROCStarIIIboards,eachcontaining... – FourAlteraStratixIIIE260FPGAs – 4.25GBDDRmemoryperFPGA(threebanksperFPGA:256MB,2GB,and2GB) – High-speeddirect-connectionbetweenneighboringFPGAs – PCIeinterconnect(orcardmaystandaloneifsuppliedextern alpower) 117

PAGE 118

APPENDIXB APPLICATIONCASESTUDIES B.1N-Queens TheN-Queensproblemasksforthenumberofdistinctwaystha t N queenscanbe placedontoan N N chessboardsuchthatnotwoqueenscanattackeachother.The RCbenchmarkwaswrittenbyVikasAggarwal(HCSLab,Universit yofFlorida),adapted byGabeBareld(HCSLab,UniversityofFlorida),andthenfur theradaptedbyJohn Curreri(CHREC,UniversityofFlorida).[ 75 ] B.2Time-DomainFiniteImpulseResponse Thetime-domainniteimpulseresponse(TDFIR)benchmarki spartoftheHPEC challengebenchmarksuite[ 84 ].TheRCbenchmarkperformsconvolutionforreal numbersratherthancomplex.TheinitialRCimplementation ofthisbenchmarkwas writtenbyDr.GregoryStitt(CHREC,UniversityofFlorida).[ 84 ] B.3CollatzConjecture TheCollatzconjectureisanumbertheoreticconjectureinv olvingsequencesof integers.Thisapplicationsearcheslargerangesofnatura lnumbersinanattempttond record-lengthsequencesofnumbersthatdonotterminateat thetrivialcyclecontaining thenumber1.Thealgorithmemploysparallelkernelswhichp rocess96-bitintegers, computingsequencesusingheavilypipelinedcustomlogica swellaslogictoskipthe initialpartofmostsequences.Whilethisapplicationisnot readilyusedforpractical purposes,itcontainspatternsthatcloselyresemblecrypt analysis(i.e.,alargenumber spaceispre-lteredbyaCPU,theFPGAperformsalargeparalle lsearch,andnally theCPUcollectsandperformspost-processingonreturnednu mbers)[ 78 ],[ 79 ].This applicationisalsoagoodexampleofanon-deterministicRC applicationthatmakes heavyuseofbothCPUsandFPGAs.Thisapplicationwaswrittenby theauthor.[ 77 ] 118

PAGE 119

B.42-DProbabilityDensityFunctionEstimation The2-Dprobabilitydensityfunction(2DPDF)estimationalg orithmisemployedin variousengineering,nancial,andscienticeldswheren on-parametricprobabilistic approachesarerequired.TheapplicationusesaParzen-win dowalgorithmtoestimate two-dimensionalprobabilitydensityfunctionsviaanexha ustivepermutationofdata betweentwovectorsinareplicated,pipelineddesign.Thec omplexityofthisalgorithm is O ( m n 2 ) where m isthenumberofsamplepointsand n isthenumberofbinsper dimension.TheinitialRCimplementationofthisapplicati onwaswrittenbyDr.Karthik Nagarajan(CHREC,UniversityofFlorida).[ 85 ] 119

PAGE 120

REFERENCES [1] L.A.Barroso,“Thepriceofperformance,” Queue ,vol.3,no.7,pp.48–53,2005. [2] J.Laudon,“Performance/watt:thenewserverfocus,” SIGARCHComputerArchitectureNews ,vol.33,no.4,pp.5–13,2005. [3] C.H.Crawford,P.Henning,M.Kistler,andC.Wright,“Accelera tingcomputingwith thecellbroadbandengineprocessor,”in Proceedingsofthe5thConferenceon ComputingFrontiers .NewYork,NY,USA:ACM,2008,pp.3–12. [4] K.ComptonandS.Hauck,“Recongurablecomputing:asurveyo fsystemsand software,” ACMComputingSurveys ,vol.34,no.2,pp.171–210,2002. [5] P.Garcia,K.Compton,M.Schulte,E.Blem,andW.Fu,“Anoverviewo f recongurablehardwareinembeddedsystems,” EURASIPJournalonEmbeddedSystems ,vol.2006,no.1,pp.13–13,2006. [6] R.TessierandW.Burleson,“Recongurablecomputingfordig italsignal processing:Asurvey,” TheJournalofVLSISignalProcessing ,vol.28,pp.7–27, 2001,10.1023/A:1008155020711.[Online].Available: http://dx.doi.org/10.1023/A: 1008155020711 [7] J.Williams,A.D.George,J.Richardson,K.Gosrani,C.Massie, andH.Lam, “Characterizationofxedandrecongurablemulti-corede vicesforapplication acceleration,” ACMTransactionsonRecongurableTechnologyandSystems vol.3,no.4,p.toappear,January2011. [8] J.Williams,A.D.George,J.Richardson,K.Gosrani,andS.Sures h,“Computational densityofxedandrecongurablemulti-coredevicesforap plicationacceleration,” in ProceedingsofRecongurableSystemsSummerInstitute2008 ,Urbana,IL,July 2008. [9] S.Che,J.Li,J.W.Sheaffer,K.Skadron,andJ.Lach,“Accelerati ng compute-intensiveapplicationswithGPUsandFPGAs,”in Proceedingsofthe 2008SymposiumonApplicationSpecicProcessors .Washington,DC,USA: IEEEComputerSociety,June2008,pp.101–107. [10] J.L.Tripp,A.A.Hanson,M.Gokhale,andH.Mortveit,“Partiti oninghardware andsoftwareforrecongurablesupercomputingapplicatio ns:Acasestudy,”in Proceedingsofthe2005ACM/IEEEConferenceonSupercomputing .Washington, DC,USA:IEEEComputerSociety,November2005,p.27. [11] Cray,“CrayXD1datasheet,” http://www.hpc.unm.edu/%7Etlthomas/buildout/Cray XD1 Datasheet.pdf ,RetrievedAugust2010. [12] XtremeDataInc.,“XD1000 TM developmentsystem,” http://old.xtremedatainc. com/index.php?option=com content&view=article&id=109&Itemid=170 ,Retrieved August2010. 120

PAGE 121

[13] DRC,“RPU110:DRCrecongurableprocessorunit,” http://www.drccomputer.com/ pdfs/DRC RPU110 datasheet.pdf ,RetrievedAugust2010. [14] H.-H.Su,M.Billingsley,andA.D.George,“Parallelperforman cewizard:A performanceanalysistoolforpartitionedglobal-address -spaceprogramming,”in Proceedingsofthe2008IEEEInternationalSymposiumonParalle landDistributed Processing ,April2008,pp.1–8. [15] H.-H.Su,M.BillingsleyIII,andA.D.George,“Parallelperfor mancewizard: Aperformancesystemfortheanalysisofpartitionedglobal -address-space applications,” InternationalJournalofHigh-PerformanceComputingAppli cations ,p. acceptedandinpress. [16] M.B.GokhaleandP.S.Graham, RecongurableComputing:Accelerating ComputationwithField-ProgrammableGateArrays .Springer,2005. [17] D.Helgemo,“Digitalsignalprocessingat1GHzinaeld-pro grammableobject array,” Proceedingsofthe2003IEEEInternationalSystems-on-ChipCon ference pp.57–60,September2003. [18] Altera,“StratixIIdevicehandbook,” http://www.altera.com/literature/hb/stx2/stratix2 handbook.pdf ,RetrievedAugust2010. [19] Xilinx,“Virtex-4FPGAuserguide,” http://www.xilinx.com/support/documentation/ user guides/ug070.pdf ,RetrievedAugust2010. [20] J.Curreri,S.Koehler,A.D.George,B.Holland,andR.Garcia ,“Performance analysisframeworkforhigh-levellanguageapplicationsi nrecongurablecomputing,” ACMTransactionsonRecongurableTechnologyandSystems ,vol.3,no.1,pp. 1–23,2010. [21] S.Williams,J.Shalf,L.Oliker,S.Kamil,P.Husbands,andK.Yel ick,“Thepotential ofthecellprocessorforscienticcomputing,”in Proceedingsofthe3rdConference onComputingFrontiers .NewYork,NY,USA:ACM,2006,pp.9–20. [22] V.BetzandS.Brown,“RecentFPGAadvancesandchallenges,”in Proceedingsof the2010InternationalConferenceonEngineeringofRecong urableSystemsand Algorithms ,LasVegas,NV,USA,July2010. [23] S.S.ShendeandA.D.Malony,“TheTAUparallelperformancesys tem,” International JournalofHighPerformanceComputingApplications ,vol.20,no.2,pp.287–311, 2006. [24] J.VetterandC.Chambreau,“mpiP:Lightweight,scalableMPIp roling,” http://mpip. sourceforge.net/ ,RetrievedAugust2010. [25] C.E.Wu,A.Bolmarcich,M.Snir,D.Wootton,F.Parpia,A.Chan,E.Lu sk,and W.Gropp,“Fromtracegenerationtovisualization:aperfor manceframeworkfor distributedparallelsystems,”in Proceedingsofthe2000ACM/IEEEConferenceon 121

PAGE 122

Supercomputing .Washington,DC,USA:IEEEComputerSociety,November2000, p.50. [26] W.E.Nagel,A.Arnold,M.Weber,andK.Solchenbach,“VAMPIR:Visuali zationand analysisofMPIresources,” Supercomputer ,vol.12,pp.69–80,1996. [27] IBM,“TheIBMHighPerformanceComputingToolkit,” http://domino.research.ibm. com/comm/research projects.nsf/pages/hpct.index.html ,RetrievedAugust2010. [28] D.Klepacki,I.-H.Chung,C.Guojing,andS.Wen,“IBMhighprod uctivitycomputing systemstoolkit,” http://domino.research.ibm.com/comm/research projects.nsf/ pages/hpcst.index.html/%24FILE/HPCST README.pdf ,RetrievedAugust2010. [29] Intel,“PerformanceoptimizationofEXASolutionwiththeIntel R r traceanalyzerand collectortool,” http://www.intel.com/cd/software/products/asmo-na/e ng/408233.htm RetrievedAugust2010. [30] P.C.RothandB.P.Miller,“On-lineautomatedperformanced iagnosisonthousands ofprocesses,”in Proceedingsofthe11thACMSIGPLANSymposiumonPrinciples andPracticeofParallelProgramming .NewYork,NY,USA:ACM,2006,pp.69–80. [31] G.Jost,H.Jin,J.Labarta,J.Gimenez,andJ.Caubet,“Perfo rmanceanalysisof multilevelparallelapplicationsonsharedmemoryarchite ctures,”in Proceedings ofthe17thInternationalSymposiumonParallelandDistribu tedProcessing Washington,DC,USA:IEEEComputerSociety,2003,p.80.2. [32] Z.Szebenyi,B.J.Wylie,andF.Wolf,“Scalascaparallelperfo rmanceanalyses ofPEPC,”in Proceedingsofthe1stWorkshoponProductivityandPerforman cein conjunctionwithEuro-Par2008 ,ser.LectureNotesinComputerScience,vol.5415. Springer,2009,pp.305–314. [33] H.-l.TruongandT.Fahringer,“SCALEA:Aperformanceanalysist oolfordistributed andparallelprograms,”in LectureIn8thInternationalEuroparConference .Berlin/ Heidelberg,Germany:Springer-Verlag,2002,pp.41–55. [34] L.D.Rose,Y.Zhang,andD.A.Reed,“SvPablo:Amulti-language performance analysissystem,”in Proceedingsofthe10thInternationalConferenceonComputerPerformanceEvaluation:ModelingTechniquesandTool s .London,UK: Springer-Verlag,1998,pp.352–355. [35] I.-H.Chung,R.E.Walkup,H.-F.Wen,andH.Yu,“MPIperformanc eanalysis toolsonBlueGene/L,”in Proceedingsofthe2006ACM/IEEEConferenceon Supercomputing .NewYork,NY,USA:ACMPress,November2006,p.123. [36] B.P.Miller,“Whattodraw?Whentodraw?:anessayonparallelp rogram visualization,” JournalofParallelandDistributedComputing ,vol.18,no.2,pp. 265–269,1993. 122

PAGE 123

[37] M.T.Heath,A.D.Malony,andD.T.Rover,“Parallelperforman cevisualization: Frompracticetotheory,” IEEEParallelandDistributedTechnology ,vol.3,no.4,pp. 44–60,1995. [38] M.T.HeathandJ.A.Etheridge,“Visualizingtheperformanceof parallelprograms,” IEEESoftware ,vol.8,no.5,pp.29–39,1991. [39] J.Mellor-crummey,R.Fowler,G.Marin,andN.Tallent,“HPCVi ew:Atoolfor top-downanalysisofnodeperformance,” TheJournalofSupercomputing ,vol.23,p. 2002,2001. [40] IBM,“TheIBMHighPerformanceComputingToolkit:PeekPerfdocum entation,” http://domino.research.ibm.com/comm/research projects.nsf/pages/hpct.peekperf. html/%24FILE/peekperf.pdf ,RetrievedAugust2010. [41] A.Kn ¨ upfer,B.Voigt,W.E.Nagel,andH.Mix,“Visualizationofrepe titive patternsineventtraces,”in AppliedParallelComputing:StateoftheArtinScientic Computing .SpringerBerlin/Heidelberg,2008,pp.430–439. [42] S.Hackstadt,A.Malony,andB.Mohr,“Scalableperformancevi sualizationfor data-parallelprograms,” Proceedingsofthe1994ScalableHigh-Performance ComputingConference ,pp.342–349,May1994. [43] D.A.Reed,K.A.Shields,W.H.Scullin,L.F.Tavera,andC.L.Elford ,“Virtualreality andparallelsystemsperformanceanalysis,” Computer ,vol.28,no.11,pp.57–67, 1995. [44] J.Garc a,J.Entrialgo,andD.Garcia,“Aninstrumentationandvisua lization techniqueforperformanceanalysisofhigh-performancein dustrialembedded applications,”in Proceedingsofthe16thIEEEInstrumentationandMeasurement TechnologyConference ,vol.2,1999,pp.958–963. [45] M.Seka,A.Sane,andR.H.Campbell,“Architecture-orientedvi sualization,”in Proceedingsofthe11thACMSIGPLANConferenceonObject-orient edProgramming,Systems,Languages,andApplications .NewYork,NY,USA:ACM,1996, pp.389–405. [46] K.BondalapatiandV.K.Prasanna,“DRIVE:Aninterpretivesimulati onand visualizationenvironmentfordynamicallyrecongurable systems,”in Proceedings ofthe9thInternationalWorkshoponField-ProgrammableLog icandApplications SpringerBerlin/Heidelberg,1999,pp.31–40. [47] H.-H.Su,M.BillingsleyIII,andA.D.George,“Adistributed,p rogramming model-independentautomaticanalysissystemforparallel applications,”in Proceedingsofthe14thIEEEInternationalWorkshoponHigh-Le velParallel ProgrammingModelsandSupportiveEnvironmentsheldinconjun ctionwithIPDPS 2009 ,Rome,Italy,May2009. 123

PAGE 124

[48] B.MohrandF.Wolf, Euro-Par2003ParallelProcessing .Berlin/Heidelberg, Germany:Springer-Verlag,2003,ch.KOJAKAToolSetforAutoma ticPerformance AnalysisofParallelPrograms,pp.1301–1304. [49] J.Jorba,T.Margalef,andE.Luque, AppliedParallelComputing:StateoftheArt inScienticComputing .Berlin/Heidelberg,Germany:Springer-Verlag,2008, ch.SearchofPerformanceInefcienciesinMessagePassingAp plicationswith KappaPI2Tool,pp.409–419. [50] I.-H.Chung,G.Cong,D.Klepacki,S.Sbaraglia,S.Seelam,andH .-F.Wen,“A frameworkforautomatedperformancebottleneckdetection ,”in Proceedingsofthe 2008IEEEInternationalSymposiumonParallelandDistributed Processing ,Miami, FL,USA,April2008,pp.1–7. [51] S.Balsamo,A.DiMarco,P.Inverardi,andM.Simeoni,“Model-ba sedperformance predictioninsoftwaredevelopment:asurvey,” IEEETransactionsonSoftware Engineering ,vol.30,no.5,pp.295–310,May2004. [52] A.ClarkandS.Gilmore,“State-awareperformanceanalysiswi thextended stochasticprobes,”in Proceedingsofthe5thEuropeanPerformanceEngineeringWorkshoponComputerPerformanceEngineering .Berlin,Heidelberg: Springer-Verlag,2008,pp.125–140. [53] B.Holland,K.Nagarajan,andA.D.George,“RAT:RCamenabilit ytestforrapid performanceprediction,” ACMTransactionsonRecongurableTechnologyand Systems ,vol.1,no.4,pp.1–31,2009. [54] M.C.SmithandG.D.Peterson,“Parallelapplicationperform anceonsharedhigh performancerecongurablecomputingresources,” PerformanceEvaluation ,vol.60, no.1-4,pp.107–125,2005. [55] C.Reardon,E.Grobelny,A.D.George,andG.Wang,“Asimulatio nframework forrapidanalysisofrecongurablecomputingsystems,” ACMTransactionson RecongurableTechnologyandSystems ,toappear. [56] D.Densmore,A.Donlin,andA.Sangiovanni-Vincentelli,“FPGAar chitecture characterizationforsystemlevelperformanceanalysis,” in Proceedingsofthe2006 ConferenceonDesign,AutomationandTestinEurope .3001Leuven,Belgium, Belgium:EuropeanDesignandAutomationAssociation,2006,pp .734–739. [57] R.DeVille,I.Troxel,andA.George,“Performancemonitoring forrun-time managementofrecongurabledevices,”in Proceedingsof2005International ConferenceonEngineeringofRecongurableSystemsandAlgori thms ,June2005, pp.175–181. [58] M.Schulz,B.S.White,S.A.McKee,H.-H.S.Lee,andJ.Jeitner,“ Owl: nextgenerationsystemmonitoring,”in Proceedingsofthe2ndConferenceon Computingfrontiers .NewYork,NY,USA:ACMPress,May2005,pp.116–124. 124

PAGE 125

[59] A.Kirschbaum,J.Becker,andM.Glesner,“Run-timemonitoring ofcommunication activitiesinarapidprototypingenvironment,”in Proceedingsofthe9thIEEE InternationalWorkshoponRapidSystemPrototyping .Washington,DC,USA: IEEEComputerSociety,1998,p.52. [60] K.Camera,H.K.-H.So,andR.W.Brodersen,“Anintegrateddebuggi ng environmentforreprogrammblehardwaresystems,”in Proceedingsofthe6th InternationalSymposiumonAutomatedAnalysis-DrivenDebug ging .NewYork, NY,USA:ACMPress,September2005,pp.111–116. [61] T.Wheeler,P.Graham,B.E.Nelson,andB.Hutchings,“Usingde sign-levelscan toimproveFPGAdesignobservabilityandcontrollabilityfo rfunctionalverication,” in Proceedingsofthe11thInternationalConferenceonField-Pr ogrammableLogic andApplications .London,UK:Springer-Verlag,2001,pp.483–492. [62] P.Bellows,“High-visibilitydebug-by-designforFPGAplatf orms,” Journalof Supercomputing ,vol.32,no.2,pp.105–118,2005. [63] Altera,“DesigndebuggingusingtheSignalTapIIembeddedlog icanalyzer,” http:// www.altera.com/literature/hb/qts/qts qii53009.pdf ,RetrievedAugust2010. [64] Xilinx,“XilinxChipScopeProsoftwareandcoresuserguide,v.9 .2i,” http://www. xilinx.com/ise/verication/chipscope pro sw cores 9 2i ug029.pdf ,RetrievedAugust 2010. [65] R.Bodenner,“Creatingplatformsupportpackages,” http://www.impulseaccelerated. com/AppNotes/APP109 PSP/IATAPP109 PSP.pdf ,RetrievedAugust2010. [66] UniversityofCaliforniaatRiverside,“ROCCC2.0user'sma nual-revision0.5.1,” http://roccc.cs.ucr.edu/documentation/les/UserManu al-0.5.1.pdf ,RetrievedAugust 2010. [67] R.Chamberlain,M.Franklin,E.Tyson,J.Buckley,J.Buhler,G. Galloway, S.Gayen,M.Hall,E.Shands,andN.Singla,“Auto-Pipe:Streaming applicationson architecturallydiversesystems,” Computer ,vol.43,no.3,pp.42–49,March2010. [68] V.Aggarwal,R.Garcia,G.Stitt,A.George,andH.Lam,“SCF:adev iceandlanguage-independenttaskcoordinationframeworkfor recongurable, heterogeneoussystems,”in Proceedingsofthe3rdInternationalWorkshopon High-PerformanceRecongurableComputingTechnologyand Applications .New York,NY,USA:ACM,2009,pp.19–28. [69] P.Graham,B.Nelson,andB.Hutchings,“Instrumentingbits treamsfordebugging FPGAcircuits,”in Proceedingsofthethe9thAnnualIEEESymposiumonFieldProgrammableCustomComputingMachines .Washington,DC,USA:IEEE ComputerSociety,April2001,pp.41–50. 125

PAGE 126

[70] S.A.Guccione,D.Levi,andP.Sundararajan,“Jbits:AJava-ba sedinterface forrecongurablecomputing,”in Proceedingsof2ndMilitaryandAerospace ApplicationsofProgrammableDevicesandTechnologiesConfe rence ,September 1999,p.27. [71] F.Cristian,“Aprobabilisticapproachtodistributedcloc ksynchronization,”in Proceedingsof9thInternationalConferenceonDistributed ComputingSystems June1989,pp.288–296. [72] Xilinx,“Virtex-4familyoverview,” http://direct.xilinx.com/bvdocs/publications/ds112. pdf ,RetrievedAugust2010. [73] A.Leko,D.Bonachea,H.-H.Su,H.Sherburne,B.Golden,andA.D.Ge orge, “GASP!Astandardizedperformanceanalysistoolinterfacefor globaladdressspace programmingmodels,”in Proceedingsofthe2006WorkshoponState-of-the-Artin ScienticandParallelComputing ,June2006. [74] Nallatech,“H101-PCIXMPCI-XFPGAacceleratorcard,” http://www.nallatech.com/ PCI-Express-Cards/h101-pcixm.html ,RetrievedAugust2010. [75] C.Erbas,S.Sarkeshik,andM.M.Tanik,“Differentperspectiv esoftheN-Queens problem,”in Proceedingsofthe1992ACMAnnualConferenceonCommunicatio ns NewYork,NY,USA:ACMPress,March1992,pp.99–108. [76] J.Ellson,E.Gansner,L.Koutsoos,S.C.North,andG.Woodhul l, LectureNotes inComputerScience:GraphDrawing .SpringerBerlin/Heidelberg,2002,ch. Graphviz-OpenSourceGraphDrawingTools,pp.594–597. [77] T.O.E.Silva,“Maximumexcursionandstoppingtimerecord-ho ldersforthe3x+1 problem:computationalresults,” MathematicsofComputation ,vol.68,no.225,pp. 371–384,1999. [78] J.-J.Quisquater,F.-X.Standaert,G.Rouvroy,J.-P.David,a ndJ.-D.Legat, “Acryptanalytictime-memorytradeoff:FirstFPGAimplemen tation,”in FieldProgrammableLogicandApplications.RecongurableComputi ngIsGoing Mainstream .London,UK:Springer-Verlag,2002,pp.780–789. [79] T.PorninandJ.Stern,“Software-hardwaretrade-offs:Applic ationtoA5/1 cryptanalysis,”in Proceedingsofthe2ndInternationalWorkshoponCryptograp hic HardwareandEmbeddedSystems .London,UK:Springer-Verlag,2000,pp. 318–327. [80] OpenFPGA,“OpenFPGAGenAPIversion0.4draftforcomment,” http://www. openfpga.org/Standards%20Documents/OpenFPGA-GenAPIv0.4.p df ,Retrieved August2010. [81] J.L.Tripp,H.S.Mortveit,A.A.Hansson,andM.Gokhale,“Metr opolitanroad trafcsimulationonFPGAs,”in Proceedingsofthe13thAnnualIEEESymposium 126

PAGE 127

onField-ProgrammableCustomComputingMachines .Washington,DC,USA: IEEEComputerSociety,April2005,pp.117–126. [82] GiDEL,“GiDELPROCStarIII TM PCIex8computationaccelerator,” http://www.gidel. com/pdf/PROCStarIII%20Product%20Brief.pdf ,RetrievedAugust2010. [83] A.Alexandrov,M.F.Ionescu,K.E.Schauser,andC.Scheiman,“Logg p: incorporatinglongmessagesintothelogpmodel—onestepcl osertowardsa realisticmodelforparallelcomputation,”in Proceedingsofthe7thAnnualACM SymposiumonParallelAlgorithmsandArchitectures .NewYork,NY,USA:ACM, 1995,pp.95–105. [84] R.Haney,T.Meuse,J.Kepner,andJ.Lebak,“TheHPECchallenge benchmark suite,”in Proceedingsofthe9thAnnualHigh-PerformanceEmbeddedCompu ting Workshop ,Lexington,MA,USA,September2005. [85] K.Nagarajan,B.Holland,C.Slatton,andA.D.George,“Scalable andportable architectureforprobabilitydensityfunctionestimation onFPGAs,”in Proceedings ofthe16thInternationalSymposiumonField-ProgrammableCu stomComputing Machines .Washington,DC,USA:IEEEComputerSociety,2008,pp.302–303. [86] M.P.McGraw-Herdeg,D.P.Enright,andB.S.Michel,“Benchmar kingtheNVIDIA 8800GTXwiththeCUDAdevelopmentplatform,”in Proceedingsofthe11th AnnualHigh-PerformanceEmbeddedComputingWorkshop ,Lexington,MA,USA, September2007. 127

PAGE 128

BIOGRAPHICALSKETCH SethKoehlerisaPh.D.graduatefromtheDepartmentofCompute rInformation ScienceandEngineeringattheUniversityofFlorida.Herecei vedtwoB.S.degrees incomputerengineeringandmathematicsfromtheUniversit yofFloridain2003.His researchfocusesonthedevelopmentofperformanceanalysi sandvericationconcepts andtoolsforrecongurablesystems.Additionalinterestsi ncludealgorithms,number theory,numericalanalysis,andgametheory. 128