<%BANNER%>

Abstraction and Simulation for Strategic Design-Space Exploration in Reconfigurable Computing

Permanent Link: http://ufdc.ufl.edu/UFE0041561/00001

Material Information

Title: Abstraction and Simulation for Strategic Design-Space Exploration in Reconfigurable Computing
Physical Description: 1 online resource (114 p.)
Language: english
Creator: Reardon, Casey
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2010

Subjects

Subjects / Keywords: Electrical and Computer Engineering -- Dissertations, Academic -- UF
Genre: Electrical and Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: Due to recent trends in computing that favor device technologies and applications exploiting explicit parallelism to achieve greater performance, reconfigurable computing (RC) systems consisting of highly parallel applications executing on platforms featuring reconfigurable devices such as FPGAs are becoming an increasingly important option for accelerating applications in high-performance and embedded computing. Unfortunately, the time and difficulty associated with developing applications for RC platforms is often prohibitive, making it difficult to exploit the potential gains in performance and power savings that RC can offer. In order to facilitate RC productivity, better concepts and tools are needed to allow designers to plan and analyze their designs before coding a specific (and possibly fruitless) implementation, a process we call formulation. The research presented here defines a formal framework and presents a set of techniques to address the need for better RC formulation, which includes 1) a script-based discrete-event simulation framework for rapid analysis of RC systems, 2) an abstract modeling language for conveniently representing RC systems that can be integrated with existing prediction and analysis methods, and 3) an algorithm for automated scheduling and partitioning of applications onto scalable RC platforms. Case studies show the simulation framework to provide performance prediction results across multiple applications and platforms with errors of less than 10% and in a fraction of the time that traditional functional simulators require. The abstract modeling framework is demonstrated in modeling a number of RC systems and serves as an effective interface to the simulation framework. The automated scheduling algorithm for scalable RC systems efficiently provides users with near-optimal partitions and schedules of heterogeneous tasks graphs mapped to potentially large-scale distributed RC-based clusters. Combined, these techniques allow RC designers to efficiently model, analyze, and document their designs early in the development process, which is projected to provide large improvements in overall productivity and thus expand the usage of RC technologies.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Casey Reardon.
Thesis: Thesis (Ph.D.)--University of Florida, 2010.
Local: Adviser: George, Alan D.

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2010
System ID: UFE0041561:00001

Permanent Link: http://ufdc.ufl.edu/UFE0041561/00001

Material Information

Title: Abstraction and Simulation for Strategic Design-Space Exploration in Reconfigurable Computing
Physical Description: 1 online resource (114 p.)
Language: english
Creator: Reardon, Casey
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2010

Subjects

Subjects / Keywords: Electrical and Computer Engineering -- Dissertations, Academic -- UF
Genre: Electrical and Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: Due to recent trends in computing that favor device technologies and applications exploiting explicit parallelism to achieve greater performance, reconfigurable computing (RC) systems consisting of highly parallel applications executing on platforms featuring reconfigurable devices such as FPGAs are becoming an increasingly important option for accelerating applications in high-performance and embedded computing. Unfortunately, the time and difficulty associated with developing applications for RC platforms is often prohibitive, making it difficult to exploit the potential gains in performance and power savings that RC can offer. In order to facilitate RC productivity, better concepts and tools are needed to allow designers to plan and analyze their designs before coding a specific (and possibly fruitless) implementation, a process we call formulation. The research presented here defines a formal framework and presents a set of techniques to address the need for better RC formulation, which includes 1) a script-based discrete-event simulation framework for rapid analysis of RC systems, 2) an abstract modeling language for conveniently representing RC systems that can be integrated with existing prediction and analysis methods, and 3) an algorithm for automated scheduling and partitioning of applications onto scalable RC platforms. Case studies show the simulation framework to provide performance prediction results across multiple applications and platforms with errors of less than 10% and in a fraction of the time that traditional functional simulators require. The abstract modeling framework is demonstrated in modeling a number of RC systems and serves as an effective interface to the simulation framework. The automated scheduling algorithm for scalable RC systems efficiently provides users with near-optimal partitions and schedules of heterogeneous tasks graphs mapped to potentially large-scale distributed RC-based clusters. Combined, these techniques allow RC designers to efficiently model, analyze, and document their designs early in the development process, which is projected to provide large improvements in overall productivity and thus expand the usage of RC technologies.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Casey Reardon.
Thesis: Thesis (Ph.D.)--University of Florida, 2010.
Local: Adviser: George, Alan D.

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2010
System ID: UFE0041561:00001


This item has the following downloads:


Full Text

PAGE 1

1

PAGE 2

2

PAGE 3

3

PAGE 4

ThisworkwassupportedinpartbytheI/UCRCProgramoftheNationalScienceFoundationunderGrantNo.EEC-0642422.Iwouldliketogratefullyacknowledgevendorequipmentand/ortoolsprovidedbyXilinx,Altera,MLDesignTechnologies,SRC,Nallatech,XtremeData,andtheGeorgeWashingtonUniversitythathelpedmakethisworkpossible.Specialthankstomycommitteeandfellowlabmemberswhoseeffortscontributedtotheresearchpresentedhere,includingGongyuWang,BrianHolland,EricGrobelny,MarkOden,AdamJacobs,andmanyothers.Andnally,averyspecialthankstoDr.Georgeforguidingandadvisingmethroughoutmydoctoralstudies. 4

PAGE 5

page ACKNOWLEDGMENTS .................................. 4 LISTOFTABLES ...................................... 7 LISTOFFIGURES ..................................... 8 ABSTRACT ......................................... 10 CHAPTER 1INTRODUCTION ................................... 12 2BACKGROUNDANDRELATEDRESEARCH ................... 18 2.1RecongurableComputing .......................... 18 2.2RCPerformanceandPowerPrediction .................... 20 2.3AbstractModeling ............................... 24 2.4AutomatedMappingandDesign-SpaceExploration ............ 28 3RCSE:ASIMULATIONFRAMEWORKFORRAPIDSYSTEMANALYSIS ... 31 3.1SimulationFrameworkOverview ....................... 31 3.1.1ApplicationDomain ........................... 34 3.1.2SimulationDomain ........................... 38 3.2SimulationFrameworkWalkthrough ..................... 41 3.2.1CharacterizationandCalibrationResults ............... 41 3.2.2ValidationResults ............................ 47 3.3ApplicationCaseStudies ........................... 49 3.3.1HSIPerformanceCaseStudies .................... 50 3.3.2MDPerformanceCaseStudies .................... 53 3.3.3MDPowerCaseStudy ......................... 55 3.4Conclusions ................................... 58 4RCML:ANABSTRACTMODELINGLANGUAGEFORCREATINGSYSTEM-LEVELESTIMATIONMODELS ........... 59 4.1RCMLOverview ................................ 60 4.1.1AlgorithmModeling ........................... 63 4.1.2ArchitectureModeling ......................... 69 4.1.3SystemModelingandMapping .................... 72 4.1.4RCMLEditor .............................. 73 4.2RCSEInputGeneratorTool .......................... 73 4.3RCML+RCSECaseStudy ........................... 75 4.4Conclusions ................................... 77 5

PAGE 6

............................. 79 5.1ProblemDenitionandAssumptions ..................... 81 5.2AlgorithmOverview .............................. 83 5.2.1List-basedSchedulingStage ..................... 83 5.2.2IterativeDesign-spaceExplorationStage .............. 88 5.2.3IllustrativeExample ........................... 89 5.3RCMLIntegration ................................ 92 5.4CaseStudies .................................. 93 5.5Conclusions ................................... 96 6CONCLUSIONS ................................... 98 APPENDIX ADESCRIPTIONSOFAPPLICATIONSUSEDTHROUGHOUTRESEARCH .. 102 A.1MolecularDynamics .............................. 102 A.2HyperspectralImaging ............................. 103 BDESCRIPTIONSOFRCPLATFORMSUSEDTHROUGHOUTRESEARCH .. 104 B.1SigmaCluster .................................. 104 B.2H101Cluster .................................. 104 B.3SRC-6Platform ................................. 105 B.4XD1000Platform ................................ 105 B.5Novo-GPlatform ................................ 105 REFERENCES ....................................... 107 BIOGRAPHICALSKETCH ................................ 114 6

PAGE 7

Table page 3-1SummaryofExperimentalPlatformsusedinRCSECaseStudies ....... 42 3-2SummaryofValidationResults ........................... 48 3-3SummaryofKeyPowerModelParameters .................... 57 4-1RCMLAlgorithmModelConstructs ......................... 65 4-2RCMLArchitectureModelConstructs ....................... 69 4-3SimulativeValidationResults ............................ 76 5-1SummaryofKeyLBS-stageAlgorithmMetrics .................. 85 5-2TaskParametersforWalkthroughExample .................... 90 5-3CalculatedTaskMetricsforWalkthroughExample ................ 91 5-4SchedulingAlgorithmCaseStudyResults ..................... 95 7

PAGE 8

Figure page 1-1FDTEModel ..................................... 14 1-2RCFormulationFramework ............................. 15 2-1TypicalFPGAArchitecture .............................. 19 3-1RCSEFrameworkDiagram ............................. 33 3-2SampleRCApplicationScript ............................ 36 3-3ThroughputforFPGAreads(a)andwrites(b)onH101systembasedonI/Obenchmark ...................................... 44 3-4ThroughputforFPGAreads(a)andwrites(b)onXD1000systembasedonI/Obenchmark .................................... 46 3-5ThroughputforFPGAreads(a)andwrites(b)onSRC-6systembasedonI/Obenchmark ...................................... 46 3-6PredictedSpeedupofACSMvs.SpectralBands ................. 51 3-7PredictedSpeedupofACSMvs.SRAMBanksonXD1000System ....... 52 3-8PredictedSpeedupofHSIvs.SystemSize .................... 52 3-9PredictedPerformanceofMDonXD1000System ................ 54 4-1AbstractionPyramidComparingLevelsofModelingforHardwareApplications(adaptedfrom[44]) .................................. 60 4-2OverviewofRCMLMethodology .......................... 63 4-3ExampleRCMLAlgorithmModelofaMatrixMultiply ............... 67 4-4RCMLArchitectureModelofExperimentalPlatform ............... 71 4-5ScreenshotofRCMLModelEditorinEclipse ................... 74 4-6RCMLAlgorithmModelofHSIApplication ..................... 76 4-7HSIPredictedRuntimevs.NumberofNodes ................... 77 5-1TheGapInExistingAutomatedSchedulingResearch .............. 80 5-2PseudocodeforAutomatedSchedulingAlgorithm ................. 84 5-3TaskGraphforWalkthroughExample ....................... 90 5-4ScreenshotofADSEIntegrationinRCMLEnvironment ............. 93 8

PAGE 9

................................... 94 5-6MVATaskGraph ................................... 95 9

PAGE 10

10

PAGE 11

11

PAGE 12

Ascomputingapplicationsandplatformscontinuetoincreaseinsizeandcomplexity,tworeformationsaretakingplace.Therstisanarchitecturalreformation,whereimprovementsinthecomputationalcapabilitiesofdevicesareachievedthroughexplicitparallelism,sinceithasbecomeincreasinglydifculttoprovidecomparableperformanceincreasesthroughhigherclockspeedsandimplicitinstruction-levelparallelismalone.Inresponsetothisarchitecturalreformation,anapplicationreformationisunderwayaswell,requiringexplicitparallelisminapplicationstoexploittheparallelismofnewhardware.Underthesetworeformations,recongurablecomputing(RC)isbecomingrecognizedasanincreasinglyimportantandviableparadigmforhigh-performancecomputingintimeswherethesizeandpowerconsumptionoftraditionalsupercomputersandclustershavegrowntoalarminglevels.WithRC,theperformancepotentialofunderlyinghardwareresourcesinasystemcanbefullyrealizedinahighlyadaptivemanner.RCextendstheeldsoflarge-scaleandembeddedhigh-performancecomputingbyincorporatinganddynamicallyreconguringthesedevicesatrun-timetoaccelerateoperationsthatwouldotherwisebeperformedinsoftware.HybridsystemsofmicroprocessorsandFPGAscanleveragesystem-levelconceptsfromconventionalhigh-performancecomputingwhileaccommodatinghardwarerecongurability,providingevengreaterbenetsduringtimeswherethesizeandpowerconsumptionofclustersandtraditionalsupercomputershavegrowntoalarminglevels.Forthesereasons,recenttrendsandresearchsuggestRCsystemswillbecomecommonineldssuchascomputervision[ 1 ],digitalsignalprocessing[ 2 ],embeddedcomputing[ 3 ],andmanyothers. DespitethegainsinperformanceandefciencyovertraditionalsystemsthatRCcanprovide,developmentforRCcanbeaprohibitivelylonganddifcultprocess.Hardwarecircuitsareoftenwrittenusingahardwaredescriptionlanguage(HDL), 12

PAGE 13

DevelopingapplicationsforbothRCandgeneralcomputingplatformstypicallyfollowstheiterativefour-stageprocessillustratedinFig. 1-1 ,whichwecalltheFDTEmodel.TheinitialstageintheFDTEmodeisformulation,wherestrategicplanningandexplorationisperformedbeforecodingaspecicimplementation.Earlyexplorationduringformulationcanhelpensurethattheproposedimplementationwillmeetprojectspecicationsbeforeintensivecodinghasbegun.Typically,thisstageisoftenbypassedorperformedhaphazardlyusingquickandinformaltechniques,whichcanhavedireconsequenceslaterinthedevelopmentprocess.Inthedesignstage,codeisdevelopedtoimplementthespecicsolutionarrivedatduringformulation.Thetranslationstageisthenresponsibleforconvertingthedesigncodeintoanexecutableformatthatcanbefedtothetargetplatform.Intheexecutionstage,theprogramisexecuted,andruntimedataisgatheredtoaidinthedebugging,verication,andoptimizationprocesses.Thedataacquiredduringexecutionistypicallyfedbacktothedesignstage,wherechangestothecodearemadetoaddresstheissuesobservedduringexecution.Thus,thedevelopermustcontinuouslyiteratethroughthesethreestagesuntilthespecicationsoftheprojectaremet,whichcanbeacostlyprocess. Withoveralldevelopmenttimesandcostsexpanding,theimportanceofeffectiveformulationtechniquesbecomesincreasinglycriticaltoensuretimelydevelopment.Withouteffectiveformulationtechniquesandtools,costlyiterationsthroughthedesign, 13

PAGE 14

FDTEModel translation,andexecution(DTE)stagesareperformed,manyofwhichcouldhavebeenavoidedthroughcarefulplanningandanalysis.Theintroductionofrecongurabledevicescanfurtherincreasethedesigncomplexityofsuchsystemsaswell.Inadditiontotraditionaldesign-spaceparameterssuchasprocessorspeed,memorysubsystemperformance,andnetworkinterconnect,RCdesignersmustalsoconsiderFPGAresources,IOsubsystemperformance,andrecongurationcapabilities.Thelargedesignspacefurtherincreasestheneedtoconductstrategicalgorithmandarchitectureexplorationsearlyinthedevelopmentprocess,i.e.duringformulation. Previouswork[ 4 ]hasestimatedthatformulationmayimproveproductivitybyupto10x-animprovementmadelargelyfromreductionsofcostlydesigniterations.Inordertorealizethisgaininproductivity,toolsandtechniquesareneededthatsupportacomprehensiveandefcientformulationprocess.TheresearchpresentedherefocusesondeningandenablingacompleteformulationframeworktailoredforRC.Torealizethisgoal,threeareasinformulationneedtobeaddressedandsupported.First,atechniqueforpredictingtheperformanceandpowerconsumptionofpreliminarydesignsisneeded,otherwisedesignerscannotmakeinformeddecisionsaboutthequalityoftheirpreliminarydesigns.Second,anabstractmodelisneededtoallowdesignerstointuitivelyandmeaningfullyrepresenttheirdesignsduringformulation,whichcan 14

PAGE 15

RCFormulationFramework thenintegratemoreefcientlywithpredictiontoolsandbeusedforcodegenerationwhentransitioningtothedesignstage.Third,algorithmsforautomatedmappinganddesign-spaceexploration(DSE)canfurtherautomateRCformulationbyefcientlygeneratingautomatedschedulesandpartitionsthatcanbeanalyzedorimplementedbytheuser.Figure 1-2 illustratestherelationshipbetweenthesethreeareasinformulation.Theresearchpresentedinthisdocumentisdividedintothreephases,eachproposinganovelsolutiontooneoftheprimaryareasofRCformulationintroducedhere. TherstphaseofmyresearchaddressestheneedforperformancepredictionbypresentinganinnovativeframeworkforsimulativepredictionofapplicationsonRCsystemswhichfocusesonbalancingsimulationspeedanddelity,calledtheRCSimulationEnvironment(RCSE).BasedontheFastandAccurateSimulationEnvironment(FASE)methodology[ 5 ],theRCSEframeworkemploysaprocesswhereRCapplicationsaredescribedbyasequenceofhigh-levelevents,formingascriptwhichstimulatesdiscrete-eventsystemmodels.Usingahigh-levelofabstraction,thisframeworkcanprovideabroadrangeoftimelyandmeaningfulpredictionsfora 15

PAGE 16

Inthesecondphaseofmyresearch,anovelabstractestimationmodelingenvironmenttailoredforrepresentingRCsystemsandapplicationsisintroduced,calledtheRecongurableComputingModelingLanguage(RCML).RCMLmodelsprovideanabstractrepresentationofRCsystemsduringformulation,whichcanbefedtovariouspredictiontoolsforanalysis.TheRCMLframeworkenablesuserstoseparatelymodeltheapplicationandexecutionplatformarchitectureunderstudy,providingspecicconstructsfordeningparallelism,communicationpatterns,andothercommonaspectsinRCapplications.Acompletesystemmodelisthencreatedbymappinganapplicationandplatformmodeltogether.Thisprocessallowsindividualmodelstobereusedforanynumberofmappings,andsupportsaniterativedesign-spaceexplorationprocessinformulation.Furthermore,aninputgeneratoriscreatedthatintegratesRCMLwiththeRCSEframeworkdevelopedintherstphaseofresearch,supportingautomatedsimulativepredictionofsystemsmodeledinRCML. Inthethirdphaseofmyresearch,analgorithmuniquelydesignedtosupportautomatedschedulingandpartitioningofapplicationsonlarge-scaleRCsystemsispresented.ThealgorithmextendsexistingtechniquesusedforRC-basedschedulingandpartitioningtoincludesupportforscalable,multi-nodearchitectures.Thealgorithmdevelopedinthisphaseofresearchisdividedintotwostages.Intherststage,novellist-basedscheduling(LBS)heuristicsareusedtoquicklygenerateaninitialschedule.ThesecondstageofthealgorithmconsistsofanextensionoftheKernighan-Linheuristicthatiterativelyanalyzesasetofmovesforeachunxednodeinthetaskgraph, 16

PAGE 17

Theremainderofthisdocumentsummarizesthebackground,technicaldetails,andresultsoftheresearchdiscussedhere,andisorganizedasfollows.BackgroundandrelatedresearchpertainingtoRCandtheresearchinthisdocumentissummarizedinChapter 2 .InChapter 3 ,theRCSEframeworkforsimulativeperformancepredictionofRCsystemsispresented(Phase1).InChapter 4 ,theabstractmodelinglanguageRCMLispresented,alongwithitsintegrationwiththeRCSEframework(Phase2).Theautomatedschedulingandpartitioningalgorithmforlarge-scaleRCsystemsisintroducedandillustratedinChapter 5 (Phase3).Finally,theconclusionsandacknowledgementsrelatingtotheresearchpresentedherearesummarizedinChapter 6 .DetailedinformationregardingtheapplicationsandRCexecutionplatformsusedthroughoutthisresearchareprovidedinAppendix A andAppendix B ,respectively. 17

PAGE 18

Thebackgroundandrelatedresearchinthischapterisdividedintofoursections.Section 2.1 providesabriefgeneralbackgroundofRC.Section 2.2 discussesmethodsofperformanceandpowerpredictionrelatingtoRC,withadditionalfocusonsimulation.Next,Section 2.3 coversrelatedresearchinabstractmodelinglanguagesusedincomputing.Finally,Section 2.4 discussesexistingalgorithmsandmodelsusedforautomatedmappingandDSEofRCapplications. 6 ].Therstistoemployageneral-purposemicroprocessorthatexecutesapplicationsfromsoftwarecode.Ageneral-purposemicroprocessoroffersahighlyexiblecomputingsolution,supportingtheexecutionofanyapplicationthatcanbedescribedusingtheprocessor'sinstructionset.Thesecondapproachistheuseofoneormorehardwarecircuitsfabricatedtoperformaspecicfunction,suchasanApplication-SpecicIntegratedCircuit(ASIC).ASICsaretypicallyabletooutperformmicroprocessorsintermsofperformanceandefciency,butlackexibilitysincethecongurationofthecircuitcannotbechangedafterfabrication.Recongurablecomputingoffersasolutionthatcanpotentiallycombinethebenetsofbothcomputingparadigms,implementinghighlyefcienthardwarecircuitswhileallowingdevicerecongurationtomaintaincomputingexibility.Forthisreason,RCisbeingadoptedasanincreasinglypopulartopicinnumerousresearchelds[ 7 ].TheremainderofthissectionprovidesaverybriefoverviewofafewtopicsinRC.Foradditionalbackgroundinformation,thesurveyin[ 8 ]providesaveryusefuloverviewofrelevanttoolsandtechnologiesrelatingtoRC. WhileRCencompassesmanydifferenttechnologies,theField-ProgrammableGateArray(FPGA)isthemostprominentdeviceusedinRC.AnFPGAisane-grained 18

PAGE 19

TypicalFPGAArchitecture recongurabledevicemadeupofanarrayoflogicblocks.Eachlogicblocktypicallyincludesaprogrammablelookuptable(LUT)andalatchorip-optoholdstateinformation.Thelogicblocksarenormallylaidoutina2-dimensionalarray,connectedbyroutingresourcesthatrunthroughoutthechip.ModernFPGAsusuallyincludeotherspecializedresourcessuchasmemorybanks,multipliers,andembeddedprocessors.ThetypicalarchitectureofFPGAsisillustratedinFig. 2-1 [ 9 ]. RCsystemsusuallyincludeoneormoreFPGAsworkingalongsideahostmicroprocessor,actingasanacceleratorusedtospeedupkeycomputationsrunningonthesystem.FPGAscanbeintegratedwithinheterogeneoussystemsinmultipleways,e.g.asanexternaldeviceconnectedtothehostserverusinganinterconnectsuchasUSBorEthernet,connectedasaperipheraldeviceusingPCI,locatedonasystem-on-chip(SoC),orresidinginaCPUsocketontheserver'smotherboard.EachformofRCsystemarchitectureoffersavaryingdegreeofcouplingbetweentheFPGAandhostCPU,whichcansignicantlyaffecttheefciencyofcommunicationbetweentheCPUandFPGA. 19

PAGE 20

10 ],orHDLcircuitsimulatorssuchasModelSim[ 11 ].Whileexecution-drivensimulationcanprovideveryaccurateandusefulresultsforsmallsystems,thesesimulationsareoftentooslowandimpracticalwhenconsideringlarge-scalesystemsandapplications.Thisproblembecomescompoundedwhenresearchersneedresults 20

PAGE 21

12 14 ].Thistechniquereducessimulationtimesbyexecutingonlyaportionofthebenchmarkunderconsideration,whichisthenfedintoadetailedprocessorsimulator.Onedrawbackofthisapproachistheneedfordetailedwarming,wheresectionsofcodeareexecutedindetailsimplytoestimatethestateofthesystematthebeginningofablockofsampledcode.Furthermore,itisunclearhowsuchanapproachwouldbeappliedtoparallelsystems,wherethecommunicationbetweenprocessorscouldbeomittedduringsampling. Trace-drivensimulationoffersanotheralternativebyusingamethodofabstractiontoformulatearepresentationoftheapplicationusedasinputintothesimulationmodels.Whileusinganabstractrepresentationmakesitimpossibletoreplicatetheexactbehavioroftheentiresystem,relativelyaccurateperformanceresultscanbegatheredinsignicantlylesstime.Theexactspeedandaccuracyofatrace-basedsimulatorisdependentonthedelityofthesimulationmodelsandthemethodofabstractionusedforgatheringthetracedata.Severalresearchprojectshaveofferedvaryingapproachesforachievingaccuratesimulationresultsforhigh-performancecomputingsystemswithouttheneedforprohibitivelylongsimulations[ 15 17 ].Unfortunately,noneofthesimulationprojectslistedthusfarsupportthesimulationofRC-basedapplicationsandsystems,butinsteadfocusonsimulatinggeneral-purposecomputingsystems. Whilemanyresearchprojectshaveinvestigatedthedomainoftraditionalcomputingsimulation,comparativelyfeweffortshavefocusedtowardssystem-levelperformancepredictionofRCsystems.TheRCAmenabilityTest(RAT)denesasetofanalyticequationsforpredictingthepotentialspeedupofaparticularFPGAcoredesign[ 18 ].AccuratethroughputanalysisistheprimaryfocusoftheRATmethodology,whichusesaseriesofsimpleequationstopredicttheperformanceoftheapplicationbased 21

PAGE 22

19 ].Resultsfromcasestudiesshowtheseequationstobereasonablyaccuratefortheapplicationsunderstudy,buttheaccuracyoftheseresultsdiminishesastheclustersizeincreases.Whiletheanalyticalmodelsplacegreatemphasisonloadbalancing,itcanbehardtopredictthepreciseeffectsofresourcecontentionincomplexsystemswithsuchmodels.AnothersetofanalyticalequationshavebeenproposedwhicharedesignedtondpotentialperformancebottlenecksimposedbythememorytransfersystemwithapplicationsthatutilizeFPGAs[ 20 ].Theseanalyticmodelsabstractlycharacterizealgorithmsbasedontheirdata-movementpatternsandcomputationaldensity,whicharethenemployedtoidentifythememorylayerthatwillactasthelargestbottleneckforthealgorithm.Thus,theanalyticalmodelsarelimitedtodeterminingthebestperformancethehardwarearchitectureiscapableofsupporting,asopposedtopredictingtheactualperformancethearchitectureisexpectedtoproduce.Thoughanalyticmodelsareoftenvaluableforprovidingusefuldataveryquickly,theireffectivenessislimitedforperformancepredictionoflargeandcomplexRCsystems.Therefore,theRCSEmethodologydevelopedinPhase1oftheresearchpresentedherefocusesonusingsimulativetechniquestoprovideperformancepredictiondata. OtherprojectshavefocusedonsimulativemodelsandvirtualprototypestoachieveperformancepredictionofRCsystems.OneexampleistheHybridSystemArchitectureModel(HySAM)andDRIVEsimulationframework[ 21 ].WithinHySAM,architecturesaredenedbyasetofattributesdescribingthecapabilitiesofthesystem,whileapplicationsarepartitionedintoaseriesoftaskssplitbetweentheCPUandRCdevicethatarefedtoDRIVEforvisualization.Aseparateprojectcreatedaco-simulationenvironmentwhichintegratedtwoexistingcycle-accuratesimulators[ 22 ].Theauthorsemployed 22

PAGE 23

23 ].TheSimics-basedsimulatoremploysacycle-accurateprocessorandmemorysimulator,spendingsignicanttimeandeffortcapturingtheprecisepatternofmemoryaccesseswhenRCdevicetransfersareinvolved.Kernelsexecutedbytherecongurabledevicearesimplyperformedbyasoftwareequivalentinthesimulatortoensurefunctionalcorrectness,whilethehardwareexecutiontimeisprovidedtothesimulatorsoastocorrectlyadvancesimulationtime.Resultshaveshownbothofthelattertwosimulationenvironmentstobeaccurateinthecasestudiespresented,whichonewouldexpectwhenusingcycle-accuratemodels.However,inbothcases,theresultsareonlyreportedforsingle-nodemachinesthatcontainasingleprocessorchipandRCdevice.Also,noinformationisgivendescribingthetimeandaccuracyassociatedwitheachsimulatorwhenthesystemsizeisscaledtoalargenumberofnodes.SincethesimulationframeworkpresentedinChapter 3 aimstoprovidetimelypredictionresultsforlarge-scalesystemsduringformulation,atrace-drivenapproachisusedasopposedtoexecution-driventechniquesemployedbytheSimicsandSimpleScalar/ModelSimco-simulationenvironments. Incomplimenttoperformanceprediction,ahealthyamountofresearchexistsintheeldofpowerconsumptionmodelingandpredictionforhardwaredevices,includingFPGAs.Inadditiontotheirpotentialforincreasingtheperformanceofcertainapplications,recongurabledevicesalsooffertheadvantageofpowersavingsovermanycurrentprocessors.SinceFPGAstypicallyoperateatclockspeedsfarbelowthoseofcurrentmicroprocessors,whileonlyusingtheresourcesneededtocongurethedesiredcircuit,thepowersavingsfromusinganRC-basedsolutioncanbesignicant.Consequently,theabilitytopredictthepowerconsumptionofrecongurable 23

PAGE 24

24 25 ].ThesetoolsapplyinformationgatheredduringsynthesisoftheFPGAdesigntoaccuratelyestimatethepowerconsumptionofthecircuit.Similarly,thepowerevaluationframeworkfrom[ 26 ]isdesignedforevaluatingthepowerefciencyoftheFPGAarchitecturebasedonattributessuchastheirLUTsize,clustersize,andwiringscheme.Atahigherlevel,powerconsumptioncanbeestimatedearlierinthedesigncycleusingamodelsimilartotheonedescribedin[ 27 ].High-levelparametersofthedeviceandthedesignarecombinedtoallowdesignerstogainaroughestimateofpowerconsumptionbeforethecircuitdesigniscompleteandreadytobetestedwithintheCADtoolenvironment.GiventhattheRCSEframeworkisintendedtoprovideanalysisofpotentialdesignsduringformulation,ahigh-levelpowermodelsimilartotheonepresentedby[ 27 ]isused.Theremainingpowerestimationmodelsrequireinformationdescribingthenalcoredesign,whichisnotoftenavailableintheformulationstage. 28 29 ].AADLmodelsarecollectionsofpre-denedcomponentswhicharefurthercharacterizedthroughproperties,interfaces,andnetworksofsubcomponents.MostAADLcomponentscanbeclassiedaseithersoftware,executionplatform,orcompositecomponents.AADLsupportslanguageextensionsthroughpropertysets 24

PAGE 25

30 ],memory[ 31 ],andreal-timeschedulingrequirements[ 32 ]ofthesystem.AADLisalsocommonlyusedtoautomaticallygenerateanimplementationofthemodeledsystem.RCMLmodelscanpotentiallybeusedforautomatedimplementationaswell,especiallywhenthetasksinthealgorithmmodelcanbelinkedtocoresinanexistingcorelibrary.WhilesuchcapabilitiesarecriticalinallowinguserstotransitiontolessabstractmodelsoftheirRCsystem,automatedimplementationfromRCMLmodelswillbethesubjectoffutureworkandthusisbeyondthescopeoftheresearchpresentedinthisdocument.Anextensibleopen-sourcetoolknownasOSATEisavailableforcreatingandworkingwithAADLmodelsinatextualandgraphicalenvironment[ 33 ]. AnevenmorepopularmodelinglanguagewithinthecomputingeldistheUniedModelingLanguage(UML)[ 34 ].UMLiswidelyusedthroughoutthesoftware-developmentindustryasalanguageforplanninganddesigningobject-oriented,software-intensiveprojects.TheUMLspecicationdenes13typesofdiagramsthateachprovideapartialviewoftheunderlyingmeta-model,allowinguserstointeractwiththediagramtype(s)bestsuitedfordescribingtheirproblem.UMLprolessuchasSysML[ 35 ]andMARTE[ 36 ]havebeencreatedtosupportmodelingcomplexandreal-timeembeddedsystemsinUML,respectively.LikeAADL,MARTEandSysMLdeneextensivemodelingspecicationswithsoftware,hardware,andsystemconstructsformodelingcomplexsystemswithalargenumberofinteractingsoftwareprocessesanddevices.Sincetheseprolesaredesignedtosupportlargemodelingprojectsthroughmanydevelopment 25

PAGE 26

SeveralmodelingenvironmentshavebeendevelopedforbuildingandevaluatingfunctionalmodelsofheterogeneousandRCsystems.Ptolemyisawell-knownmodelingframeworkforsimulatingandprototypingheterogeneoussystemssupportingmultiplemodelsofcomputation(MoC)[ 37 ].Ptolemystudiesthesupportofheterogeneoussystemdesignthroughtheuseofhierarchicalheterogeneity[ 38 ].HierarchicalheterogeneousmodelsaredividedintolocalsubmodelswhichshareacommonMoC.EachlocallyhomogeneoussubmodelcanthenbetreatedasasinglecomponentinamodelnetworkwithcomponentsfeaturingpotentiallyseveralMoCswhoseinterfacesandinteractionsareautomatedthroughPtolemy.Aswithmostexistingmodel-baseddevelopmentenvironmentsforembeddedsystemdesign,Ptolemyprovidesuserswithanabstractexecutablemodelasanentrypointinthedevelopmentprocess.TheRCMLenvironmentisintendedtocomplementsuchtoolsbyprovidingaframeworktoperformearlydesign-spaceexplorationatahigherlevelofabstraction,beforespecifyingafunctionalmodel. UnlikePtolemy,MetropolisandArtemisaretwoexamplesofmodel-basedframeworksthatprovidesupportforseparatelyandindependentlymodelingthefunctionalityandarchitectureofanembeddedsystem.Metropolisisaplatform-baseddesignenvironmentbasedonametamodelwithformalexecutionsemantics[ 39 ].TheMetropolismetamodelcanbeusedtospecifythefunction(usingexistingornewMoCs),architecture,andmappingofthesystem.TheMetropolisframeworkalsoincludestools 26

PAGE 27

40 ]. TheArtemisframeworkprovideshigh-levelmodelingandsimulationmethodsandtoolsforsystem-levelperformanceevaluationandexplorationofheterogeneousandrecongurableembeddedmultimediasystems[ 41 42 ].Artemisdescribesasystematicapproachtoexploreembeddedsystemarchitecturesatmultiplelevelsofabstraction,followingpopularembeddeddesignconceptssuchasseparationofconcerns[ 43 ]andtheY-chartdesignmethodology[ 44 ].TheArtemisworkbenchincludesasystem-levelmodelingandsimulationenvironmentknownasSesame[ 45 ].InbothArtemisandSesame,applicationsaredescribedasaKahnProcessNetwork(KPN)[ 46 ]atthehighestlevelofabstraction,whilearchitecturemodelsareimplementedfromalibrarygenericbuildingblocksusingeitherPearlorSystemC.Asthearchitectureisdescribedinmoredetail,theabstractapplicationmodelisrenedintoamoredetaileddataowgraphthroughtracetransformations[ 47 ].WhileSesameandMetropolishavebeenshowntobeeffectiveapproachestoheterogeneousembeddedsystemdesign,theyarenotparticularlysuitedforestimationmodelingofRCsystems.Forexample,aKPNassumesparallelprocessescommunicateusingunboundedFIFOchannels,buteffectiveuseoflimitedbufferingresourcesonRCdevicesisoftencriticaltomaximizeperformance.Inaddition,complexcommunicationpatternsanddataparallelismarenotconvenientlyexpressedusingtheirbasicconstructs,butcouldeasilybegeneratedandconstructedfromanexistingRCMLmodelwhichisintendedtofunctionatahigherlevelofabstractionthanArtemisandMetropolissupport. Ahierarchicalmodel-basedframeworkforFPGAdevelopmentinRCispresentedby[ 48 ]whichincludessupportforevaluationofdesignalternativesearlyinthedevelopmentprocess.Theframeworkintegratesahigh-levelperformanceestimator(HiPerE)andadesign-spaceexplorationtool(DESERT)forefcientevaluationofcandidatemappingsagainstuser-speciedperformancerequirementsonto 27

PAGE 28

49 ].DESERTisusedtoprunethedesignspacetoeliminateanycandidatedesignthatfailstomeetuser-speciedconstraints.HiPerEisthenemployedtoquicklyevaluatetheremainingdesignsusingtrace-basedsimulations,intermsofenergyandlatency.BothDESERTandHiPerEareintegratedintotheMILANintegratedsimulationenvironmentforembeddedsystemdesign[ 50 ].Whiletheframeworkin[ 48 ]overlapsinsomedegreewiththeRCMLframeworkpresentedinChapter 4 ,thereareseveralkeydifferences.First,theframeworkin[ 48 ]isintendedtoserveasadevelopmentenvironment,unlikeRCMLwhichisamodelingenvironmentdesignedforestimation-levelmodelingofRCsystems.Furthermore,RCMLincludesalargernumberofpre-denedspecializedconstructsformodelingRCapplicationsandplatforms,intendedtofurtherimproveproductivity.Finally,HiPerEisonlydesignedtoevaluatearchitecturesthatcanbedescribedbyGenM,whichisdesignedforSoCarchitectures.RCMLisdesignedtomodelandanalyzeabroadersetofhigh-performanceRCarchitectures. 51 52 ]: 28

PAGE 29

ItcanbeseeninEquation 2 thatthenumberofpossiblemappingsincreasesexponentiallywiththenumberofprocessorsandtasks.Forthisreason,mappingalgorithmsareappliedthatcansuggestasinglemappingthatisclosetooptimal,ifnotoptimalitself.ManypreviousalgorithmshavebeendevelopedforHW/SWpartitioning[ 53 55 ]andschedulingonparallelheterogeneoussystems[ 56 ].MeanwhilecomparativelyfewresearchprojectshavefocusedonalgorithmsforautomatedschedulingandpartitioninginRCenvironments,thoughseveralvariousapproacheshavebeenproposedwhichwillbesummarizedintheremainderofthissection. Apairofalgorithmshavebeendevelopedforschedulinginreal-timesystemswithrecongurablearchitectures.AnevolutionaryschedulingalgorithmbuiltintotheCORDSco-synthesisframeworkfordistributedreal-timerecongurableembeddedsystemsisdiscussedin[ 57 ].Anintegratedpartitioningandschedulingalgorithmforco-synthesisofhardreal-timeRCsystemsisdescribedin[ 58 ],whichusestheLatestDeadlineFirstheuristicforsoftwaretasksandthenmovestaskstoanFPGAco-processor(addingFPGAstothearchitectureasneeded)untilafeasiblescheduleexiststhatmeetsallreal-timedeadlines. Stittproposesapairofefcientsub-heuristicsforhardware/softwarepartitioningandmulti-versionimplementationexplorationofrecongurablehardwarecircuits[ 59 ].Thesub-heuristicsarespecializedforsupportofbalanced(i.e.executiontimeisapproximatelyevenlydistributedthroughouteachregionofthealgorithm)andun-balancedapplications.Lietal.describeahardware/softwarepartitioningapproachforembeddedrecongurablearchitecturesdevelopedfortheNimbleframework[ 60 ].Thealgorithmidentiesregionsofinterestinthetaskgraph,andexhaustivelysearchesthoseregionsforanoptimalpartitioningofthelocalregionthatresultinnear-optimalsolutionsfortheglobalsystem. 29

PAGE 30

61 ]introducesaniterative-basedalgorithmtodetermineschedulingandmoduleplacementoftasksinapartialrecongurationenvironment.Thealgorithm,whichusesalist-basedschedulertoevaluatethequalityofmovesduringeachiteration,assumesthetargetarchitectureconsistsofsingleCPUandFPGAandsupportscongurationpre-fetching. AnotheriterativealgorithmsimilartotheKernighan-Linheuristicisproposedin[ 62 ]whichcombinesHW/SWpartitioning,hardwaredesign-spaceexploration(DSE),andscheduling.MAGELLANextendsthisworkbyimplementingdesign-spaceexplorationforloop-unrollinganddenesadditionaltypesofmovesforschedulingcontrol-dataowtaskgraphs[ 63 ].Aswithalloftheexistingapproachesdiscussedinthissection,theiralgorithmassumesasingle-nodearchitecturewithfewdevices.Butsincetheiralgorithmsupportsscheduling,partitioningandautomatedDSEofRCapplicationswithloop-unrollinganddynamicreconguration(commoncharacteristicsofscalableRCapplications),thealgorithmpresentedinChapter 5 largelybuildsupontheworkin[ 62 63 ]inordertosupportlarge-scaleRCsystems. 30

PAGE 31

Inthischapter,aframeworkforsimulatingapplicationsonRCsystemscalledtheRCSimulationEnvironment(RCSE)ispresented,whichfocusesonbalancingsimulationspeedanddelity.BuiltatoptheFastandAccurateSimulationEnvironment(FASE)methodology[ 5 ],RCSEemploysaprocesswhererecongurablecomputingapplicationsaredescribedbyasequenceofhigh-levelevents,formingascriptwhichstimulatesdiscrete-eventsystemmodels.Usingahigh-levelofabstraction,thisframeworkcanrapidlyprovideabroadrangeofpredictionsforagivenrecongurableapplicationandsystemwhentheperformanceoftheapplicationisdata-independent.Furthermore,simulationresultscanbeobtainedforverylarge-scaleandcomplexRCsystemswithoutrequiringaninordinateamountoftimeorresources.Theseresultscanbeobtainedefcientlyduringtheformulationstage,allowingdesignerstomakeinformedstrategicdecisionsearlyinthedevelopmentprocess. Theremainderofthischapterisorganizedasfollows.AdetaileddescriptionoftheRCSEframeworkispresentedinSection 3.1 .Section 3.2 providesawalkthroughwithRCSE,includingvalidationresultstoverifytheaccuracyofourapproach.ResultsfromaselectedsetofapplicationcasestudiesareprovidedinSection 3.3 demonstratingtheeffectivenessandusefulnessofthisframework.Finally,conclusionsfromthisphaseofresearcharesummarizedinSection 3.4 31

PAGE 32

Torealizetheformulation-baseddevelopmentowdescribedinthepreviousparagraph,asimulationframeworkwascreatedtosupportefcientalgorithmicandarchitecturalexplorationovermanydesignparameters.ThissectiondetailstheRCSEframeworkforperformancepredictionofRCsystemsthatbalancesspeed,accuracyandexibility.Toachievethisgoal,additionaleffortisperformedwithintheapplicationdomaintocharacterizethebehavioroftheapplicationsothatinstruction-leveldetailsoftheprogramcanbeabstractedtoreducethecomplexityofthediscrete-eventsimulation.Thisframeworkissplitintotwoseparatedomains-theapplicationdomainandthesimulationdomain.Thissplitallowsuserstocharacterizeapplicationsindependentlyofthecandidatesystemarchitectureswhilesupportingconcurrentmodeldevelopmentthatisindependentofthepotentialapplications.Thisindependenceoffersahighlevelofdataandmodelreusabilityandmodularitywhichinturnfacilitatesrapidanalysesofnumerousvirtuallyprototypedsystemsandapplications.Withthisreusability,userscaneffectivelyiteratewithintheformulationstagebymakingchangeswithinonedomainandsimplyreusingthemodelfromtheseconddomainforthenextsimulationasthedesignspaceisexplored.TheoverallstructureoftheRCSEframework,andthekeystepswithineachdomain,areillustratedinFig. 3-1 .Sections 3.1.1 and 3.1.2 providedetailsandexamplesoftheproceduresemployedintheapplicationandsimulationdomains,respectively. TheFastandAccurateSimulationEnvironment(FASE)isasimulationmethodologythatattemptstobalancespeedandaccuracyforsimulatingmessage-passingparallelapplicationsonHPCclustersusingatwo-stagesimulationprocess[ 5 ].TheRCSEframeworkpresentedinthissectionleverageskeyconceptsandmodelsusedinFASE,butalsoproposesmanyextensionsandadditionstosupportRC-basedanalyses. 32

PAGE 33

RCSEFrameworkDiagram WhileFASEdenestwoprimarystagesinitssimulationprocess,theRCSEframeworkdenessixprimarystages,includingthehardwarecorecharacterizationstagewhichisuniquetoRC.Inaddition,newsetsofdiscrete-eventmodelsforRC-basedarchitecturalcomponentshavebeendeveloped,andthescriptinglanguagehasbeenredenedandsignicantlyexpandedinordertoeffectivelyrepresentRCapplications. Aspreviouslymentioned,FASEisprimarilydesignedforsimulatingmessage-passingparallelapplicationsonHPCclusters.SincetheRCSEframeworkincorporatesmanyofthesameapproachesusedbyFASE,certaintypesofsystemsarenoteffectivelymodeledwithRCSE.Forexample,FASEandtheRCSEarecurrentlydesignedtopredicttheperformanceofasingleapplicationonanunloadedsystem,thereforeextensionsarenecessarytosupportanalysisonamulti-userorloadedsystem.Streamingapplicationsarenoteasilysupportedeithersincetheycannotbenaturallydescribedwiththecurrentscriptinglanguage.Finally,RCSEassumesthatactionsontheRCdeviceareinitiatedbycommandsfromahostmicroprocessor,thussystemsinwhichtheRCdeviceactsautonomouslyarenotcurrentlysupportedaswell.ThislastlimitationcouldbeaddressedwithextensionsthatwouldallowseparatescriptstobespeciedfordeningthebehaviorofautonomousRCdevices. 33

PAGE 34

3-1 Hardwarecorecharacterization,whichisnotapartoftheoriginalFASEframework,denesthebehaviorofthekernelorfunctiontobeperformedwithintheRCdevice.Thecharacterizationofhardwarecoresisperformedbydeningasetofparametersforthecore.Theseparametersincludecomputationaldelay(intermsoflatencyandthroughput),resourceutilization,inputdatasize,andoutputdatasize.ThecomputationtimeforanRCcorecanbeobtainedfrommultiplesources.Forcaseswherethehardwarecoredesigniscompleteandavailabletotheuser,thisvaluemaybemeasuredexperimentally,orsimilarlyusingdelayssuppliedfromcycle-accuratefunctionalsimulationsofthehardwaredesignfromvendor-suppliedtools.BothmethodsprovidereasonablyaccurateresultsassumingdeterministicbehavioroftheRCdevices.Whileitispossibletointegrateacycle-accuratehardwaresimulatorintoourenvironment,itmakeslittlesensetouseoneincaseswhereadeterministichardwarecoreisexecutednumeroustimesduringasinglerunofanapplication.Insteadofperformingcostlycycle-accuratesimulationsnumeroustimes,itismoreefcienttoextracttheperformancedatafromonesingleruninthisearlystage.Furthermore,whenanishedhardwarecoreisnotavailabletotheuser,corescanbesimulatedatthebeginningofthedevelopmentcyclebyusingparameterestimatesbasedoninitialcoredesigns.By 34

PAGE 35

18 ],quickestimatesofthecoreperformancecanbeobtainedbyuserswithafundamentalunderstandingoftheiralgorithm.FollowingtheRATmethodologyallowstheusertoestimatethefullsetofparametersneededtocharacterizeahardwarecore.Theresourceutilizationparametersareimportantinordertoconsiderperformancegainswhenscalingupdevicesizebysqueezingmorecoresontothefabricatonce,thusexecutingonmoredatainparallel.ThemodelingofpartialrecongurationissupportedaswellbyallowingcorestobeexchangedwithintheRCdeviceduringthesimulation.However,anin-depthstudyofthistechniqueisnotincludedinthisdocument.Thenalparameters,theinputandoutputdatasizes,allowaccuratemodelingoftransactionswiththeRCdevice.TheseparametersarecriticalinordertoaccuratelycapturethecommunicationdelaysincurredwhenpassingdatabetweenvariousnodecomponentsandRCdevice. Anothervitalstepintheapplicationdomainthatcanbeconductedinparallelwithcharacterizingthehardwarecoreistheapplicationcharacterizationstage.Thisstepinvolvesidentifyingandgatheringasequenceofkeyeventsthatdenesthebehaviorandattributesofthetargetapplication.Theeventscurrentlysupportedincludecomputationconductedbyhostprocessors,computationperformedbyRCdevices,alongwithinter-andintra-nodecommunication.Whentheperformanceoftheapplicationisdata-independent,aspecicsequenceoftheseeventscanbeemployedtoaccuratelycapturethebehaviorofanyparallelorserialRCapplication.However,gatheringthesesequencesofeventsisakeychallengewhendealingwithRCapplicationsduetothelargenumberofprogramminginterfacesimplementedtotransferdatatoandfromRCdevices.RCSEaddressesthischallengebyusingascriptinglanguagethatallowstheusertomanuallyrepresentthebehaviorofanRCapplication.Whenusingthisapproach,classicinstrumentationtoolscanbeemployedtogatherthedatarelevanttodeninghostcomputationandinter-nodecommunication,suchasthePerformanceApplicationProgrammingInterface(PAPI)fromtheUniversityof 35

PAGE 36

SampleRCApplicationScript Tennessee[ 64 ].ThisframeworkcurrentlyusesPAPItotracethenumberofinstructionsandmemoryaccessesperformedbythehostprocessorduringeachcomputationblock.Thisdataisusedtoestimatetheamountoftimespentbytheprocessorinthatcomputationblock,alongwithpredictingtheamountofcontentionthatwouldbeexperiencedwithsharedresources.Presently,theincorporationofRCeventsintoscriptsmustbeperformedmanually,sincenocommonstandardexiststhatdenesinteractionswithRCdevices.However,theadoptionofastandardRCprogramminginterfaceorabstractmodelinglanguagewouldremovetherequirementofmanualscriptcreation,sinceRC-relatedeventscouldbedetectedbyanautomatedcodeormodelparser.CombinedwithotherstandardssuchastheMessagePassingInterface(MPI)[ 65 ],awell-denedandwidelyacceptedprogramminginterfaceforinter-nodecommunication,thefullautomaticcharacterizationofalargenumberofRCandparallelapplicationscouldbesupported.However,untilacommoninterfaceisnalized,manualscriptgenerationprovidesthemostexiblealternativetocoverthewiderangeofRCapplicationandRCdevicecombinations. Thenalstepintheapplicationdomaindealswithgeneratingscriptsthatrepresentthebehaviorofthetargetapplications.Theinformationcollectedduringtheapplicationandcorecharacterizationstepsdeneboththestructureandvaluesneededtoconstruct 36

PAGE 37

Fig. 3-2 illustratesanexampleofanRCapplicationscriptforasinglenodeofaparallelapplication.TheexamplescriptisasimpliedversionofthescriptfortargetdetectionintheHSIapplicationdenedandusedinSections 3.2 and 3.3 .Thescriptbeginsbyconguringasingletargetdetection(TD)coreonthedevicenameddev1whileprovidingcoredetailsobtainedduringcorecharacterization.Inthisexample,theTDcoreisdenedtooccupy4500slices,operateat200MHz,andexecuteon8192-byteinputandoutputdatachunkseachin1000clockcycles.AftertheTDcorehasbeencongured,thescriptinitializesthenodeforMPI-basedcommunicationviatheMPI Initcall,followedby288sofcomputationbythehostprocessor.Afterwards,thenodereceives32MBofdatafromnode0,whomusthaveamatchingMPI Sendinitsownscript.Thescriptthenproceedsbydening100iterationsofaloop,witheachiterationcontaininganFPGA-basedexecutionofTDfollowedbyablockofhostcomputation.Oncethe100iterationshavecompleted,thenodereturnsdatatonode0viatheMPI Sendfunction.Afterthedatahasbeenreturned,ablockofhostcomputationfollows,afterwhichthescriptiscompleteandthesimulationterminates.ItmustbenotedthatthecurrentscriptingsyntaxisdesignedtosupportmostofthefunctionalitiesexercisedincurrentRCapplicationsandiseasilyexpandabletosupportotherRCeventsthatmayariseasprogramminginterfacesmatureandnewtechnologiesareintroduced. 37

PAGE 38

3-1 Therststepinthesimulationdomainisthemodeldevelopmentstage.Inthisstage,modelsofkeysystemcomponentsareconstructed.ThesimulationenvironmentusedforbuildingcomponentandsystemmodelsisMission-LevelDesigner(MLD)fromMLDesignTechnologies[ 66 ].MLDisagraphicaldiscrete-eventsimulationtoolthatsupportsmodular,hierarchicaldesignsofarbitrarysystemsallowingforquickdevelopmenttimesofmodelswithvaryingdegreesofdelity.SincethegoalofRCSEisperformancepredictionofRCsystems,componentmodelsdonotfullyincorporatemechanismstomanipulatedata.Infact,theactualdataisabstractedawaybyonlyconsideringhowmuchdataratherthanwhatdata.Asaresult,themodelsfocusontheperformancetimingoftheinteractionsbetweencomponentsexercisedbytheapplication.Focusingonsystemperformancefacilitatesquickmodeldesigntimes(duetothereduceddetailassociatedwitheachcomponentmodel)andimprovedsimulationspeed(duetolessprocessingneededtoexecuteeachcomponentmodel). AnoverviewofkeycomponentmodelsinthecurrentRCmodellibraryisdescribedhere,allofwhichareneworsignicantlyupdatedcomparedtothemodelsintheoriginalFASElibrary.Therstcomponent,theRCScriptParser,convertsscriptcommandsintoappropriatedatastructuresprocessedbythemodels.TheRCMiddlewaremodelmanagestransactionsbetweenthehostprocessorandtheRCdevice,andalsoincludesperformance-criticaloverheadincurredbythedevicedrivers.TheRCCoremodelwasdevelopedasagenericblack-boxmodel,suchthatanyhardwarecorecouldbe 38

PAGE 39

Oncethecomponentmodelshavebeendeveloped,thenextstepinthesimulationdomainisthecalibrationofthosemodels.Inmodelcalibration,theparametersofacomponentmodelareselectedsuchthatthemodel'sperformancematchesthatofthecorrespondingreal-worldtechnology.Ingeneral,thecomponentmodelsintheRClibraryhaveallbeendesignedtobegenericandparameterizable.Eachoftheprimarycomponentmodelsacceptsaparameterleasaninput,usedtotuneagenericcomponentmodeltomatchtheparticulardeviceofinterest.Thesetofparametersdenedinaparameterleisuniqueforeachtypeofcomponent.Forexample,theparameterleforaninstanceofanRCFabricmodelincludesparameterssuchasthesize,recongurationtime,numberofI/Opins,andmaximumclockspeedoftheFPGAboard.Bycorrectlysettingthevaluesoftheparameterle,thegenericRCFabricmodelcanbecalibratedtorepresentanynumberofdifferentFPGAorotherRC-basedmodules.Inordertocalibratethecomponentmodelsinsomecases,experimentalmeasurementsmustbeobtainedfrombenchmarksthatexercisethetargetcomponent.Oncetheexperimentaldatahasbeengathered,theparameterleislledouttomatchthecomponentattributestothemeasureddatapoints.Variousmetricssuchasaverage 39

PAGE 40

ThecurrentframeworkfocusesonthecomponentsexercisedwhentransferringdatabetweenthehostprocessorandtheRCdeviceduetothecommonbottleneckthatarisesinthisdatapathformanyRCapplications.Assuch,thecorrespondingcomponentmodelsoftenrequireanin-depthcalibrationprocessinordertoaccuratelycapturetheperformanceofthedatatransfers.However,whenattemptingtocalibratethecomponentsexercisedduringFPGAdatatransfers,itcanbeverydifculttobenchmarkeithertheI/Obusorthedriversinisolationduetothecomplexityofthesystemandtheproprietarynatureofthedevicedrivers,respectively.Toresolvethisproblem,theperformanceoftransactionsbetweenthehostandRCdevicearemeasuredasawhole(asseenbythetop-levelapplication),andtheoptimalsetofparametersforthedriversisobtainedusingaparametersolverdevelopedbytheauthors.Theparametersolver,builtinMatlab,denesboundsonthecommunicationlatencyandbandwidthfromtheexperimentaldata,theniteratesovertheboundedlatencyandbandwidthrangestondtheparametercombinationthatoptimizesthecommunicationmodel,withinadenablegranularity.Byusingthisiterativeapproach,theusercanchoosetheerrormetricusedbytheparametersolvertodenetheoptimalparameterset.Calibrationresultsobtainedusingtheparametersolver,whichisappliedtocalibratebothintra-andinter-nodecommunicationsystems,arepresentedinSection 3.2.1 Thenalstageofthesimulationdomainissystemanalysis.Inthisstage,theRCapplicationscriptisprocessedbythesystemmodelsproducingperformanceresultsforeachcandidatesystemarchitecture.Theperformanceresultscanbeanalyzedtoidentifybottlenecksinthevirtuallyprototypedsystemsandconductwhat-ifscenariosandtradeoffanalyseswithrespecttovariousdesignoptionssuchasalgorithmdecompositionsandmappingsandindividualcomponentspecications.Thesimulation 40

PAGE 41

3.3 Thesimulationresultspresentedinthischapterarecomposedofcasestudieswithtwoapplications,HSIandMD,whosedetailsarepresentedinAppendix A .TheapplicationsweresimulatedonthreedifferentRCexperimentalplatforms.ThethreeplatformsemployedintheseexperimentseachrepresentuniquestylesofRCsystemarchitectures,thusillustratingtheversatilityofRCSE.DetailsregardingeachofthethreeplatformscanbefoundinAppendix B ,andarebrieysummarizedhereinTable 3-1 3.1.1 ,thePAPIlibraryisemployedforgatheringmemoryaccessdata.After 41

PAGE 42

SummaryofExperimentalPlatformsusedinRCSECaseStudies runningtheapplicationoneachsystem,applicationscriptssimilartotheoneillustratedinFig. 3-2 weremanuallygeneratedfromtheinstrumentationdataforeachapplicationoneachsystem.Itshouldbenotedthatthegenerationofanapplicationscriptforasystemmodeldoesnotrequireexecutingtheapplicationunderstudyonthetargetsystem.Whiledoingsowillmostlikelyproducethemostaccuratecharacterizationdataforagivenscenario,therearemanycaseswherethisoptionissimplynotavailabletotheuser.Forcaseswhereboththeapplicationandsystemarenotavailableforinstrumentation,anapplicationscriptcanbegeneratedfromscratchbasedonknowledgeofthealgorithm,orfrominstrumentationdatagatheredonaseparatesystem. Inadditiontocharacterizingtheapplicationsunderstudy,thehardwarecoresusedineachapplicationmustbecharacterizedaswell.Thischaracterizationdatacancomefrommultiplesources,suchasexperimentalmeasurements,functional-levelsimulationsoftheFPGAcore,orusingearlydesigncalculationssuchasthosepresentedin[ 18 ].Inthecasestudiespresentedhere,corecharacterizationdatawasobtainedthroughexperimentalmeasurementswhichinturnwereappliedtotunemathematicalmodelsfortheexecutiontimeofeachcore.Forexample,theexecutiontimeoftheprimaryFPGAcoreforMDcanbedividedintothreephasesandischaracterizedas 42

PAGE 43

3 representsthetimerequiredtoscattereachmoleculefromon-boardmemorytothecomputationalpipelines,whichmustbeperformedbeforeforcecalculationsbegin.Thesecondproductrepresentstheforcecalculationsperformedbetweeneachpairofmolecules,aseachmoleculeisbroadcastedoneatatimetoallofthecomputationalpipelines.Thenalproductrepresentstheadditionaltimerequiredtowriteeachresultbacktoon-boardmemoryaftercalculationsofallofthemolecularinteractionshavebeencompleted.AsimilarapproachisemployedfortheexecutiontimeforthetwoHSIcores,whichwerecharacterizedsimplyastheproductofanexperimentallymeasuredconstantmultipliedbytheinputimagesize. Foreachsystem,asystemmodelwithinthediscrete-eventsimulationenvironmentwasconstructed.Oncecompleted,thesystemmodelparameterswerethencalibratedtomatchtheperformancecharacteristicsofthetargetsystems.Forthisphaseofresearch,thecalibrationresultsarelimitedtothecalibrationoftheI/OinterconnectbetweentheRCdeviceandhostprocessor,andthesystem-areanetworkusedtoconnectnodeswithineachcluster.Toobtaintheexperimentaldatausedduringcalibration,amicro-benchmarkexecutedoneachsystemisusedtomeasuretransfertimesforvarioussizeddatatransfersbetweenahostprocessoranditslocalFPGA,orbetweennodesacrossthedatanetwork.Forallofthecalibrationresultspresentedinthissection,themodelparametervaluesusedtodenethettedcurvewerederivedusingtheMatlab-basedparametersolvercitedinSection 3.1.2 .Ineachcase,theparametersolverisappliedtominimizethemeanpercentageerrorbetweenthemodeleddataandexperimentalperformance. Figs. 3-3A and 3-3B showtheexperimentalresultsversussimulationresultsintermsofthroughputvaluesforhostprocessorreadsfromtheFPGA(FPGAreads)and 43

PAGE 44

BWrites ThroughputforFPGAreads(a)andwrites(b)onH101systembasedonI/Obenchmark hostprocessortransferstotheFPGA(FPGAwrites),inanH101node.Forbothofthesecases,apenaltyforlargedatatransfersisautomaticallyfactoredinbytheparametersolver,duetothenegativetrendinaveragethroughputforeachsetofexperimentaldataasthetransfersizegrewbeyond256KB.Themodelingofthisphenomenonwasincorporatedintotheparametersolveranddiscrete-eventmodelstorepresentthedecliningthroughputperformancesometimesexperiencedforextremelylargedatatransfers,oftencausedbywindowingormemorybufferoverrunswithinthetargetdeviceoroperatingsystem.Anotherbehaviorillustratedhereisthatforsimilar-sizedtransfers,thehostprocessorinanH101nodeisoftenabletowritedatatotheFPGAattwicetheratetheprocessorcanreaddatafromtheFPGA.SuchadisparityoftenhaslargeimplicationsontheperformanceofRCapplicationsthatarecommunication-bound,thereforeitisimportantforthesimulationframeworktocapturethisbehavior.TheRCMiddlewaremodelsinthecurrentframeworkallowtheusertocapturethisbehavior,byintroducingadditionaldelaystothoseexperiencedbysimplytraversingthesysteminterconnect. 44

PAGE 45

3-4A and 3-4B showthecalibrationdataforhostprocessorreadsandwrites,respectively,withtheFPGA'son-boardmemoryonanXD1000node.TheI/OmodelcurvesfortheXD1000didnotincludeapenaltyforlargetransfersoracommunicationcontextswitch.Forthenalsystem,Figs. 3-5A and 3-5B showthecalibrationdataforhostprocessorreadsandwriteswiththeFPGA'son-boardmemoryontheSRC-6HiBarsystem,respectively.TheSRCsystemexhibitsmuchhigherdatathroughputwhilewritingtotheFPGAasopposedtoreading,similartothebehaviorobservedwiththeH101node.Furthermore,unlikethersttwosystems,acommunicationcontextswitchisusedbytheparametersolverandsimulationmodels,withtheswitchingpointoccurringat16KBontheSRC-6.Acommunicationcontextswitchiswherethecommunicationmiddlewareattemptstoincreaseefciencybychangingbetweeninternaltransfermechanismsorprotocols,oftenbasedonthesizeofdatatobetransferred.AcommunicationcontextswitchshouldnottobeconfusedwithanFPGAcontextswitch,whichreferstoachangeinthecongurationofanFPGA'sreprogrammablefabric.Whenacommunicationcontextswitchisapplied,theperformanceofthetransferismodeledusingtwoseparatesetsoflatencyandthroughputparameters.Theuseoftheappropriatesetdependsonwhichsideoftheswitchingpointthesizeofthetransferdatalies. Themodeledtransfertimesforthedatasetsacrossallthreesystemsyieldedameanpercentageerrorthatrangedbetween2.1%and5.1%versustheirexperimentalcounterparts.Theseaveragevaluesareencouraging,especiallyconsideringthattheyaremildlyinatedinseveralcasesbysingleerraticexperimentaldatapointsthatyielddouble-digitpercenterrorswhencomparedtothecalibratedmodelresults.Additionally,itisimportanttonotethattheI/ObehaviorwiththeRCdevicevariessignicantlyfromsystemtosystem.Inthethreesystemsdiscussedhere,onesufferedperformancepenaltiesforlargedatatransfers,anotherexhibitedacommunicationcontextswitchbetweensmallandlargetransfers,whiletwoofthethreesystemsallowedthehost 45

PAGE 46

BWrites ThroughputforFPGAreads(a)andwrites(b)onXD1000systembasedonI/Obenchmark BWrites ThroughputforFPGAreads(a)andwrites(b)onSRC-6systembasedonI/Obenchmark processortowritetotheFPGAatasignicantlyhigherratethanforFPGAreadsofthesamesize.Allofthesefactorsillustratetheimportanceofaexiblecalibrationandmodelingapproachthatcanhandlesuchdisparatebehaviors.Finally,thesameprocessisemployedtocalibratemodeleddatatransfersbetweenindividualH101orXD1000 46

PAGE 47

Itisworthnotingthatuserswillnotalwaysbeabletocalibratetheircomponentmodelswithexperimentaldatafrombenchmarkingexperiments,especiallywhenconsideringnotionalplatforms.Inthesecases,itisstronglysuggestedtousecalibrationdatathatcanbeobtainedfromasystemmostsimilartothesystemsbeingsimulatedasastartingpoint.Forexample,ausermightwanttopredicttheperformancegainsexpectedbyusingawiderHyperTransportbusinthenext-generationXD1000platform.AneducatedapproximationofthehypotheticalXD1000systemisobtainedbyscalingtheappropriateparametersofthecalibratedXD1000modeltoreectthewiderbus.Delaysforthemiddlewarewouldbealteredbythesimulationmodelsaswellifthehostprocessorormemorysubsystemwereupdatedtoo. 3-2 summarizestheapplicationvalidationresults.Validationresultswerecollectedforeachapplication/systemcombinationwherethecorrespondingFPGAcorewasavailable.ForeachoftheHSIvalidationexperiments,a256256pixelinputimagewith1024spectralbandswasused.ThesimulativeHSIpredictionsarevalidatedagainstexperimentalexecutionsofHSIononeandfourH101nodes,andusingoneCPUandoneFPGAmoduleontheSRC-6.FortheMDvalidationexperiment,amolecularsystemof16,384moleculesisprocessedforvetimestepsontheXD1000system.Themaximumerrorbetweentheexperimentalruntimeandthepredictedruntimeacrossallofthevalidationexperimentsis6:74%,whereHSIissimulatedonasingle-nodeH101system. 47

PAGE 48

SummaryofValidationResults SystemExper.RuntimeSimulativePrediction%ErrorSimulationTime H101(1Node)116.12s123.95s6.74%14.76sHSI H101(4Nodes)44.16s43.02s2.58%107.32sHSI SRC-686.49s85.05s1.66%17.66sMD XD10004.41s4.39s0.45%1.45s Aspreviouslymentioned,predictionaccuracyissacricedinordertoallowsimulationstocompletemorequicklyandwithoutrequiringacodedimplementation.Validationerrorsunder10%producedbytheRCSEareconsideredacceptable.Predictionswithinthisboundprovidedesignerswithusefulinsightwhileperformingstrategicdesign-spaceexploration.Likemanytools,theexactaccuracyofthepredictionswilllargelydependonthequalityoftheinputprovidedbytheuser.InstanceswherecharacterizationdatacomesfromasystemthatisidenticalorverysimilartothesystembeingsimulatedwillproducepredictionresultsexhibitingerrorpercentagessimilartothoseobservedinTable 3-2 .Asthesimulatedsystemisvaried,uncertaintyisintroducedintotheprocessandthepredictionsbecomelessreliable.Inthesecases,calibrationandvalidationserveasanimportantstartingpointindesign-spaceexploration,sothatsimulationsinvolvinghypotheticalscenarioshaveareliablegroundinginreality.Giventhattheresultsobtainedineachofthevalidationexperimentsyieldedacceptableerrorratesforthisframework,wecanproceedtoperformsimulativeanalysesofthesesystems. ThenalcolumnofTable 3-2 reportsthetimemeasuredforeachsimulationtoexecuteona3.2GHzXeonprocessor.ThesimulationofMDontheXD1000systemisthequickestsimulationbyasubstantialmargin,largelybecausethereareaverylimitednumberofcomputationaliterationsanddatatransfersbetweensystemcomponentsduringtheapplication.Conversely,thesimulationofHSIona4-nodeH101systemrequiredthelongestsimulation.Theincreasedsimulationtimeresultsfromanumberof 48

PAGE 49

UsingtheplatformsandapplicationspreviouslyintroducedandvalidatedinSection 3.2 ,thissectionpresentsdesign-spaceexplorationcasestudiesintendedtodemonstratetheusefulnessofRCSE.ThesecasestudiesillustratetheanalysisandinsightthatRCSEcanquicklyprovidetodevelopersunderthesituationsdiscussedinthepreviousparagraph.First,severalperformancestudiesareconductedusingHSI,analyzingchangestotheapplicationparametersanddevicecharacteristicsonmultiplesystems.Next,asetofcasestudiesispresentedanalyzingthescalabilityofMDonadistributedXD1000system.Finally,theRCSEframeworkisusedtoestimatethepowerconsumptionoftheFPGAcoredesignforMD. 49

PAGE 50

3-6A 3-6B ,and 3-6C plotthespeedupoftheFPGA-basedACSMcalculationonallthreesystemsversusthenumberofspectralbandsintheimage,usingoneandtwoFPGAnodes.ForalloftheHSIcasestudies,speedupresultsarecomparedagainsttheoriginalsoftwareversionofHSIrunningonasingle2.8GHzXeonprocessor,asfoundintheH101system.OntheH101system(Fig. 3-6A ),theperformanceincreasesdramaticallyasthenumberofspectralbandsincreasesfrom128to1024.Thisperformanceimprovementisprimarilyduetotheminimizationofthecommunicationpenaltyforsmalldatatransfers.OntheXD1000andSRC-6systems,thepredictedspeedupisrelativelyconstantforallnumbersofspectralbands.ComparedtotheH101system,theSRC-6andXD1000providehighlyefcienthigh-speedcommunicationbetweenthehostCPUandtheFPGA,thusverylittleefciencyisgainedbyincreasingthesizeofeachdatatransfer. AnotherobservationisthattheXD1000systemonlyobtainsapproximatelyhalfofthemaximumspeedupastheH101andSRC-6systems.ThisresultistroublingconsideringtheXD1000containsthelargestFPGAandthefastesthost-FPGAinterconnectofallthreesystems.Afterviewingthesesimulationresults,areexaminationoftheFPGAcoredesignfortheXD1000wasperformed,whichidentiedthesinglebankofSRAMontheXD1000FPGAmoduleastheperformancebottleneck.Incontrast,theFPGAsontheH101andSRC-6systemsbothcontainmultiplebanksofSRAM,allowingmultipledatavaluestobeextractedfrommemoryatthesametime.Thus,thenumberofpixelvectorsthatcanbeprocessedinparallelisthesameasthenumberofvaluesthatcanbeextractedfrommemorysimultaneously.Inthiscase,addingadditionalbanksofmemorytotheFPGAcoulddramaticallyimprovethe 50

PAGE 51

BSRC-6System CXD1000System PredictedSpeedupofACSMvs.SpectralBands performanceofthisFPGAcore,andthesimulationframeworkcanbeusedtoquicklyestimatetheexpectedperformancechange.InFig. 3-7 ,theperformanceofACSMontheXD1000issimulatedasthenumberofSRAMbanksontheFPGAisincreasedforanimagewith1024spectralbands.ApproximatelylinearspeedupispredictedasSRAMbanksareaddedtotheFPGA,sincetheFPGAontheXD1000islargeenoughtoholdtheadditionalparallelkernels. ThenalHSI-basedsimulativestudyanalyzestheperformanceoftheentireHSIapplicationforvarioussystemsizeswithanimageconsistingof1024spectral 51

PAGE 52

PredictedSpeedupofACSMvs.SRAMBanksonXD1000System BXD1000System PredictedSpeedupofHSIvs.SystemSize bands.Casestudiessuchasthisoneprovideusefulinformationtosystemdesignersinvestigatingsystemstoefcientlyperformthetargetapplication.Figs. 3-8A and 3-8B showthespeedupoftheentireHSIapplicationontheH101andXD1000systemsrespectively,ranginginsizefromonetoeightnodes.Foreachsystem,thepredictedspeedupoftheRCimplementationisprovidedfortwodifferentinputimagesizes.Largerspeedupsareprojectedineverycasewhentheimagesizeisincreasedto256256pixels.Thelargerimageoffersmoreopportunitiestoparallelizecomputationson 52

PAGE 53

A andusedforvalidationinSection 3.2.2 .Theapplicationisdistributedacrossmultiplenodesbyassigningeachnodetocalculatepartialaccelerationsforallmoleculesinthesystem,whicharethenreturnedtotherootprocessingnodeandsummedforeachtimestep.SincethepartialaccelerationsforeachmoleculeareknownoncethemoleculehasnishedpassingthroughallofthecomputationalpipelineswithintheFPGA,thevaluecanbeimmediatelywrittentoon-boardmemorywhiletheremainingmoleculesarepassingthroughthepipelines.Forthisreason,wecanassignavalueofzerototheconstantCinEq. 3 usedtocharacterizetheexecutiontimeoftheMDcore.Thesecondparallelizationstrategy(Core2)computesthefullaccelerationsforonlyaportionofthemoleculesateachnode.ThecomputationalstructureofCore2reducestheamountofdatareturnedbyeachnodeattheendofthetimestep,sinceeachnodeisonlyresponsibleforreturningaccelerationdataforafractionofthetotalmolecules.Conversely,thisdesignincreasestheexecutiontimeontheFPGAforeachtimestep,sincethetotalaccelerationforeachmoleculeisstoredwithinthepipelinesandisnot 53

PAGE 54

3 mustbesetequaltoAforCore2,asthetimerequiredtowriteeverymoleculebacktoSRAMisapproximatelyequivalenttothetimerequiredtoreadeachmoleculefromSRAM. Fig. 3-9A showsthesimulationresultsfortheexecutiontimeofMDonasingleXD1000nodeversusthenumberofmoleculesintheapplication.TheresultsinFig. 3-9A illustratethatforanynumberofmolecules,therelativespeedupbetweentheFPGA-acceleratedversionsandthesoftwarebaselineisrelativelyconstant.Furthermore,itisinterestingtonotetheverycloseproximityinperformancebetweenCore1andCore2.WhileCore1alwaysexecutesslightlyfasterthanCore2,thedifferenceispracticallynegligibleforasingle-nodeXD1000system,whichbecomesimportantwhenconsideringdistributedversionsofthisapplicationinthefollowingexperiment. BSpeedupvs.SystemSize PredictedPerformanceofMDonXD1000System ThenextexperimentanalyzestheprojectedspeedupofMDparallelizedacrossadistributedXD1000system.Adatasetwith131,072moleculesisusedheresince 54

PAGE 55

3-9B showsthespeedupofthedistributedMDapplicationonsystemsizesrangingfromoneto16nodes.Forthiscasestudy,speedupresultsarecomparedagainsttheoriginalsoftwareversionofMDrunningonasingle2.2GHzOpteronprocessor,asfoundintheXD1000system.Duetotheveryhighratioofcomputationvs.communication,theperformanceoftheparallelizedsoftwareandFPGA-accelerationapplicationsscalealmostlinearlywiththenumberofnodes.Asthesizeofthesystemgrowsto8nodesandbeyond,Core2beginstooutperformCore1byanoticeablemargin.Forlargersystemsizes,Core2becomesmoreefcientduetothereducedamountofdatareturnedtotherootnodeattheendofeachtimestep.Thedifferencebecomesmorepronouncedasthesystemsizegrows,sincealargernumberofnodesaresimultaneouslycommunicatingwiththerootnodeatthebeginningandendofeachtimestep.Core2isdesignedtoalleviatethatnetworkcontention,attheexpenseofdecreasedperformancewithintheFPGA.Thisdatacanbeusedbeforethedesign(orcoding)stagetodecidewhichcoredesignisidealfortheirscenario,dependingonthesizeandtypeofsystem. 55

PAGE 56

27 ]: InEq. 3 ,VcorerepresentsthevoltageleveloftheFPGAcore,fmaxisthemaximumoperatingfrequencyofthecore,Uresrepresentstheresourceutilizationofthecore,Tlogistheaveragetransistortogglerateoftherecongurablelogic,andKpisadevicetechnologyconstant.ValuesforVcore,Uresandfmaxarecore-specicandobtainedduringcorecharacterization.ThetogglerateforTlogisdifculttodeterminewithoutafunctional-levelsimulation,butvaluesrangingbetween0.15and0.2arecommonlyusedasestimatesofthisparameter.Finally,KpisadeviceconstantthatisspecicallydenedindatasheetsforXilinxFPGAs,whileAlteradatasheetstypicallyprovidemultipleconstantsrelatingtoindividualchipresourcesthatareaveragedinRCSEtoobtainanestimateforKp[ 67 ]. Table 3-3 summarizesthekeyparametersusedbythepowerestimationequations,includingthesourceofthevalueandwhereitisspeciedwithintheRCSEframework.Totesttheaccuracyofourpowermodel,theestimatedFPGApowerconsumptioniscomparedtothevaluegeneratedbyAltera'sPowerPlaytool.PowerPlayisatoolwithinAltera'sQuartus-IIFPGAdevelopmentenvironmentusedtoobtainaccuratepowerestimatesofanFPGAdesignbasedonafunctionalsimulationofthenalcore(similartotheXPowerAnalyzerforXilinxFPGAs).UsingthesevalueswithEq. 3 56

PAGE 57

SummaryofKeyPowerModelParameters anestimatedpowerconsumptionof10.25Wisgenerated,comparedtoanestimateof13.30WfromPowerPlay,resultingina3.05Wdifferenceanda22:9%error.Inthiscasestudy,thepowerestimationerrorisprimarilycausedbythederivedvalueofUres.ThevalueofUresiscalculatedtobetheaverageutilizationofslices,BRAM,andDSPresourcesbythecore.Unfortunately,thisaveragedoesnotprovideanaccuraterepresentationofthecoreforpowerestimationpurposesunderscenarioswheretheutilizationofoneresourceisveryhighcomparedtotheothertwo(e.g.theMDcorehasaveryhighmemoryutilizationcomparedtosliceandDSPutilization). Despitetheselimitations,resultsfromearlypowerpredictioncanprovideusefuldatatoRCdesigners.DesignerscanobtainroughdeterminationsregardingthesatisfactionofsystempowerconstraintsfromusingtheirFPGAdesign,orwhethersignicantpowersavingswillbeachievedversusanalternativedevicetechnology.Furthermore,thisframeworkcantrackthetotalenergyconsumptionoftheFPGAthroughoutthelifetimeoftheapplication.Thediscrete-eventdevicemodelstrackthetotaltimetheFPGAcorespendsperformingactivecomputationcomparedtotheentirelengthoftheapplication.ByapplyingthetotalFPGApowerconsumptionestimatetoactivecomputationperiodsandthestaticpowerconsumptionestimatetotheremainingtime,apredictionofthetotalenergyconsumptionfortheFPGAcanbecalculated.Meanwhile,estimatingthepowerconsumptionofothercomponentssuchasmicroprocessors,memories[ 68 ],etal.iswellstudied.Datasheetparametersorexistinghigh-levelmodelsbasedonarchitecturalparameters[ 69 ]canbeusedtoobtain 57

PAGE 58

58

PAGE 59

Inthischapter,annovelabstractmodelinglanguagetailoredforcreatingestimationmodelsrepresentingRCsystemsisintroduced,theRecongurableComputingModelingLanguage(RCML).ThegoalofRCMListoprovideamodelingenvironmentthatallowsuserstoefcientlymodeltheirdesignsduringtheformulationstageofRCdevelopment.TheRCMLframeworkenablesuserstoseparatelymodelthealgorithmandexecutionplatformarchitectureunderstudy,providingspecicconstructsfordeningparallelism,communicationpatterns,andothercommonaspectsinRCapplications.Acompletesystemmodelisthencreatedbymappinganapplicationandplatformmodeltogether.Thisstructureallowsreuseofindividualmodelsforanynumberofsystemmappings. OnekeyadvantageofRCMListhattheabstractmodelscanbefedtoanynumberofanalysistoolstoproducepredictiondata,thusenablingrapiddesign-spaceexploration.Withoutalinktotoolsforpredictionandanalysis,theutilityofabstractmodelsarelimitedandusersaremorelikelytojumpstraighttoimplementationwithoutproperplanningduringformulation.Therefore,anautomatedinputgeneratorisdevelopedaspartofthisphaseofresearchtointegratetheperformancepredictioncapabilitiesofRCSEwithRCML.FuturetoolscanalsobedevelopedtoperformcodegenerationfromRCMLmodels,increasingtheusefulnessofRCMLbyreducingtheamountofcodingrequiredinthedesignstageofdevelopment. Theremainderofthischapterisorganizedasfollows.AnoverviewofRCMLisprovidedinSection 4.1 ,whichincludesasubsectionforeachclassofmodelsdenedinRCML.Section 4.2 discussestheautomatedinputgeneratordevelopedtoprovidealinkbetweenRCMLandRCSE.InSection 4.3 ,apairofcasestudiesarepresentedtoillustratetheeffectivenessofdesign-spaceexplorationthroughRCMLusingthesimulationinputgenerator.Finally,conclusionsandfutureworkaresummarizedinSection 4.4 59

PAGE 60

AbstractionPyramidComparingLevelsofModelingforHardwareApplications(adaptedfrom[44]) 44 ]andillustratedinFig. 4-1 ,summarizesthelevelsofabstractiondesignerscanusetomodeltheirhardwaresystems,andthetradeoffsbetweeneachlevel.Atthetopoftheabstractionpyramid,aback-of-the-envelopemodelisasetofsimplemathematicalrelationshipsusedtoapproximatebasicperformancemetricsofthesystem.Estimationmodelsareamoresophisticatedandaccuratealternativetoback-of-the-envelopemodels,usingmorecomplexmodelinputsandcalculationstopredictsystemperformance.Whilebothoftheprevioustwomodelsareusedtoquicklypredictandapproximatekeyperformancemetrics,neitherdescribethefullfunctionalbehaviorortimingofasystem.Incontrast,anabstractexecutablemodeldescribesthefunctionalbehaviorofthesystem,withoutinformationthatdescribesthesystembehaviorintermsoftiming.Acycle-accuratemodeldescribesthefunctionalbehaviorandtimingofanarchitecture,asamultipleofclockcycles.Finally,asynthesizablemodelisonethatcontainsenoughbehaviorandtimingdetailssuchthatthemodelcanberealizedinsilicon.HDLssuchasVerilogandVHDLarecommonexamplesofsynthesizablemodels. 60

PAGE 61

RCMLdiffersfrommostexistingmodelingenvironmentsbecauseitistailoredtooperateattheestimation-modellevelofabstraction.ExistingmodelingenvironmentsoftenusedinsystemdesignsuchasPtolemyandSimulink,andhigh-levellanguagecompilerssuchasImpulse-C,typicallyuseanabstractexecutablemodelastheuser'sentrypointforspecifyingadesignandmovingtowardsanalimplementation.Suchtoolsrequireuserstodeneandcodeafunctionalimplementationinordertodriveafullsimulationorotheranalysis.Instead,RCMLisintendedtoallowuserstoquicklymodeltheirsystemsbeforecodingafunctionalimplementation(i.e.duringformulation),usingabstractconstructsandattributestodenecomponentbehavior.ModelsbuiltwithRCMLcanthenbeusedtobridgetofunctionalspecicationenvironmentssuchasPtolemyandImpulse-C,complementingthesetoolsintheoveralldevelopmentprocess.Inotherwords,anenvironmentsuchasRCMLcanbeusedtoraisetheinitiallevel 61

PAGE 62

AnoverviewoftheRCMLmethodologyisillustratedinFig. 4-2 ,whichfollowstheY-chartapproachoriginallyproposedfordesigningprogrammableembeddedsystems[ 44 ].TherearethreedifferenttypesofmodelswithinRCML'sframeworktofacilitateefcientsystemdesignandexploration.First,algorithmmodelsinRCMLareplatform-independent,RC-specializedtaskgraphsdescribingthealgorithm'stask-levelparallelismandcommunication.Second,architecturemodelsarecomponentdiagramsthatdescribethemakeupandcapabilitiesoftheexecutionplatform.Finally,systemmodelsconsistofanalgorithmmodelmappedontoanarchitecturemodel,providingacompleterepresentationofthesystemthatcanusedforanalysis.Bydeningseparatemodelsforapplicationsandplatforms,theseindividualmodelscanbedevelopedindependentlyandreusedamonganynumberofsystemmappings.RCMLalsosupportshierarchicalmodeling,suchthatblocksinamodelmaybecomprisedofsub-blocksthatextenddownoneormorelevelsinthehierarchy,facilitatingmodelingatmultiplelevelsofdetail. AdditionalinformationisaddedtoanRCMLmodelbydeningcustomattributestoamodelorindividualelement.Eachattributeconsistsofanameandvaluespeciedbytheuser,andanynumberofattributescanbedenedforanRCMLmodelelement.Attributesallowuserstofreelyembedbehavioralinformationthroughouttheirmodelwhichcanbeusedbyexternaltoolsforanalysis.TheremainderofthissectiondescribeseachtypeofmodelinRCML.ExamplesofRCMLmodelsbuiltusingthepreviouslydescribedEclipse-basededitorarepresentedinSection 4.3 62

PAGE 63

OverviewofRCMLMethodology 4-1 .AlgorithmconstructsinRCMLcanbedividedintotwogroups,functionblocksanddatablocks.Datablocksareusedtorepresentthemajorentitiesofdatathatwillbeproduced,consumed,andprocessedbythealgorithm.Eachdatablockismadeupofaspeciednumberofdataelements.Adataelementcanbeassimpleasasingleprimitivetype,suchasanintegeroroat,oracompositionofprimitivetypes.TwotypesofdatablocksarecurrentlydenedinRCML,datasets(asingle,nitegroupofdata)anddatastreams(aninnite,continuousdataentity).Thedatablocksaredesignedtomakeiteasyforuserstoexplorehowsystemperformancechangesasthescaleorsizeoftheirproblem 63

PAGE 64

AnalgorithminRCMLismodeledasacollectionoftasks,inwhicheachtaskisrepresentedbyafunctionblock.TherearecurrentlytwobasictypesoffunctionblocksdenedinRCML,data-drivenandcontrol-driven.Executionindata-drivenblocksistriggeredbythereceiptofdataonspeciedinputdataconnections.Theamountofdataneededtotriggerablock'sexecutioncanbedenedbytheuser,forcaseswheremultiplepiecesofdataareneededbeforeataskcanbegin.Control-drivenblocksareexecutedwhentherelevantcontrolsignalsenteringthatblockhavebeentriggered.Arepetitivefunctionblockisataskdenedtoiterativelyexecuteaspeciednumberoftimesuponeachtriggering,suchasabasicloop.Thenumberofrepetitionsforafunctionblockcaneitherbeastaticnumberorlinkedtothesizeofadatablock(e.g.afunctionblockmaybedenedtoiterateonceforeachdataelementinadataset).Attributescanbeusedtofurtherdenethecomputationalrequirementsofthefunctionblock,e.g.processinglatencies,hardwareresourcerequirements,etc. AlthoughanRCdesignercouldpotentiallyrepresentanalgorithmasapuretaskgraph,itisimportantinRCtobeabletoconciselyexpresscommontypesofparallelism.Forthisreason,RCMLincludestwospecializedparallelconstructs.Therstparallelconstructistheprocessline,aspecialfunctionblockusedtoexpressdeepfunctionalparallelisminanalgorithm.Aprocesslineconsistsofasequenceofseparatetasksthatdatamustpassthrough,eachofwhichmayexecuteconcurrentlyondifferentpiecesofdata(typicallyrealizedasapipelineinhardwarecircuits).RCMLprocesslinesarecustomizedbydeningthenumberofstagesinthelinealongwithfunctionblockstocharacterizeeachstage.ThesecondparallelconstructRCMLprovidesarereplicatedfunctionblocks,usedtoconciselyexpresswidedataparallelisminan 64

PAGE 65

RCMLAlgorithmModelConstructs BasicFunctionBlockAlgorithmtasktriggeredeitherbyincomingdata(data-driven)orcontrolsignals(control-driven) RepetitiveFunctionBlockAlgorithmtaskwhoseexecutionisrepeatedaspeciednumberoftimes ProcessLineFunctionBlockTasksequencethatcanexecuteconcurrentlytosupportdeepparallelism ReplicatedFunctionBlockAtaskthatisphysicallyreplicatedtosupportwideparallelism algorithm.Replicatedfunctionblocksrepresenttasksthatcanbephysicallyreplicated,eachabletoconcurrentlyperformthesametaskondifferentdata.Widefunctionalparallelism,wheredifferenttasksoperateconcurrentlyonthesame(ordifferent)data,isnaturallyexpressedasparallelpathsinthetaskgraph.Theseparallelconstructsfacilitatedesign-spaceexplorationbyallowinguserstoeasilychangetheamountofparallelismthesystemfeatures,inordertoefcientlystudytheeffectsofvaryinglevelsandtypesofparallelismonsystemperformance. Thecommunicationanddependenciesbetweentasksaredenedbyconnections.Aconnectioncancarryanarbitrarysizeofundeneddata,elementsofadatablock,oracontrolsignal,alldenedusingbuilt-inparametersforeachconnection.Eachconnectionalsohasapriority,whichdenestheordering(ifany)betweenmultipleoutgoingconnectionsleavingthesamefunctionblock.Additionally,acommunicationpatternmustbespeciedforanyconnectioninvolvingreplicatedfunctionblocks.EachpatterndenedinRCMLmimicsastandardfunctionintheMessagePassingInterface(MPI)[ 65 ],eventhoughtheconnectionmayormaynotresultininter-nodecommunication.Someofthesupportedpatternsincludebroadcastandscatterforone-to-manyconnections,gatherformany-to-oneconnections,andall-to-allfor 65

PAGE 66

Inadditiontothebasicconstructsfordeningdataandtasksinanalgorithm,apairofmiscellaneousalgorithmconstructsarecurrentlysupportedinRCML.Therstandmostimportantmiscellaneousconstructisthebuffer.Buffersareusedtomodellocationsinthealgorithmwheredataelementsmaybestoredinaqueue.BuffersareacriticalcomponentofmanyembeddedandRCapplications.Eachbufferinthealgorithmmustbemappedtoamemoryinstanceinthesystemmappingstage.Buffersdonotaffectthefunctionalityofthealgorithmandthusarenotnecessarytobuildarepresentativealgorithmmodel,butareusedtoprovidenerimplementationdetailsusedduringanalysisandcodegenerationaftersystemmapping.Aconditionalbranchconstructisalsoprovided,whichmodelsasegmentofthealgorithmthatwillfollowoneofmultipleoutgoingpathsdependingupontheevaluationofthecondition.Theconditionalbranchcanalsoprovideanalternativewaytomodelloopsinadditiontoanrepetitivefunctionblock. AnexampleofasimpleRCMLalgorithmmodelforperformingamatrixmultiplicationispresentedinFig. 4-3 .Therearethreedatasetslistedforthisalgorithmalongthebottomofthediagram,eachonerepresentingamatrixwith100100integerstakingpartinthemultiplication.Eventhoughnovisibleconnectionsaredepictedbetweenthedatasetsandtherestofthealgorithmmodel,thedatasetsarelinkedtovariouselementsofthetaskgraph,whichwillbediscussedlaterinthisparagraph.Thetaskgraphbegins 66

PAGE 67

ExampleRCMLAlgorithmModelofaMatrixMultiply withatask(labeledFetchInputs)whichretrievesthetwoinputmatrices.TheretrievalofthetwomatricesisassumedduringtheexecutionoftheFetchInputstasksincetheoutputconnectionsfromFetchInputsarethersttotransferelementsofthosedatasets.Thefollowingtask(labeledCalcDotProducts)thenperformsthemultiplicationsandadditionstocomputetheelementsoftheresultingproductmatrixinapipelinedprocess.Inthismatrix-multiplyalgorithm,theBroadcastconnectionfromFetchInputstoCalcDotProductsrstsendstheentiredatasetMatrixAtoeachinstanceofCalcDotProducts.Thenumberandtypeofdataelementstransmittedacrosseachconnectionarenotshown,butarespeciedasattributesinsideeachconnection.Forexample,selectingtheBroadcastconnectionwouldshowattributesthatspecifythatallofMatrixAisbeingtransmittedoverthatconnection.Afterthebroadcastiscomplete,thecolumnsofMatrixBarescatteredbetweeneachkernelviatheScatterconnectionbetweenthesametwoblocks.TheorderingofthetwotransfersbetweenFetchInputsandCalcDotProductsisdepictedbythenumericlabelsnexttothebeginningofeachconnectionleavingtheFetchInputstask(theprioritylabelsareomittedwhenaconnectionistheonlyoneleavingafunctionblockandthusnoorderingneedstobedened).TheCalcDotProductstaskwillthenperformitstaskonceforeverycolumnofelements(i.e.100integers)ofMatrixBthatitreceives,asspeciedalongthetopoftheCalcDotProductsfunctionblock.Sinceeachofthe100columnsofmatrixBmaybeprocessed 67

PAGE 68

Aspreviouslystated,anRCMLalgorithmmodelisdesignedtorepresentthestructure,dataow,andparallelisminanRCapplicationforperformingearlydesign-spaceexploration,andthusonlyanabstractdescriptionofthealgorithmbehaviorisneeded.Forexample,theFetchInputsfunctionblockinthematrix-multiplyalgorithmmodeldoesnotspecifyhowthealgorithmsplitsuptheMatrixBdatasetalongcolumnstobedividedamongkernelsinCalcDotProducts.Instead,themodelonlyspeciesthattheFetchInputstaskissplittingtheMatrixBdatasetanddistributingthedataamonginstancesofCalcDotProducts.Similarly,theMergeResultstaskdoesnotspecifyhowthenalproductMatrixCisassembledandstored,onlythatthetaskwillnotexecuteuntilalloftheresultshavebeengatheredfromtheCalcDotProductkernelsbeforeformingandsendingouttheMatrixCdatasetthatrepresentstheproductofthemultiplication.ThisisakeyadvantageofusinganestimationmodelingenvironmentsuchasRCML,asmoredetailedbehaviorisoftenunnecessarytoaccuratelyestimateapplicationperformanceandperformdesign-spaceexploration.Nevertheless,insomecasesuserswillwanttomodelpartsoftheiralgorithmswithmoredetail,specicallywithcoresdesignedtobeimplementedonrecongurabledevices.Futureworkwillinvestigateincorporatingalternativemodelingframeworks,suchastheCMDframeworkproposedby[ 70 ],intotheRCMLenvironmenttosupportestimationmodelingofrecongurablecoresatlowerlevelsofabstraction. 68

PAGE 69

RCMLArchitectureModelConstructs RecongurableProcessorAprocessingelementthatsupportsreconguration,suchasanFPGA MemoryBlockAninstanceofmemoryorstorageinthesystem CacheBlockAspecializedmemoryblockformulti-levelcaches InterconnectBusAsystemornetworkbusfordatatransfers InterconnectSwitchAnetworkswitchfordatatransfers MiddlewareAmiddlewarepackageusedbyplatformcomponent(s) Source/SinkDeviceGenericdevices,suchasasensor(source)ordisplay(sink) 69

PAGE 70

4-2 .Architectureblockscanalsobegroupedtogethertoformanodeusingthenodecontainer,whichcanthenbereplicatedinordertoeasilymodelscalableplatforms. Eachofthegenericarchitectureconstructscanbefurtherrenedtorepresentaspecicinstanceofthatclassbydeningcustomattributes.Forexample,theattributesforanetworkswitchcanbeusedtodenethespeed,width,protocol,andthetypeofswitchingbackplaneoftheswitch.Withthisapproach,thenetworkswitchclasscanrepresentaparticularinstanceofaHyperTransport,Ethernet,orotherinterconnectswitch.MemoryblockscouldberenedtorepresentmemorytypesfromSRAMtoexternaldisksbydeningattributesfortheirsize,speeds,andprotocol.Memoryblocksmayalsobeembeddedwithinaprocessorordeviceblocktorepresentaninternalmemoryforthehostblock.Notonlydoesthisapproachsupportaveryexibleandrobustmodelingenvironmentcapableofmodelingmanytypesofarchitectures,userscanefcientlyperformdesign-spaceexplorationbychangingarchitecturalattributesofinterest.Forexample,studyingtheeffectofusingafuturegenerationofFPGAscanbeapproximatedbyincreasingthenumberofresourcesdenedforarecongurableprocessor.Architecturalscalabilitystudiescanoftenbeeasilyperformedbysimplychangingthenumberofnodesforanodecontainer. Fig. 4-4 showsanexampleofanRCMLarchitecturemodelusedtorepresentatheSigmaclusterofFPGA-enhancedembeddedcomputenodes(seeAppendix B ). 70

PAGE 71

RCMLArchitectureModelofExperimentalPlatform EachnodecontainsaPowerPCmicroprocessorwithmulti-levelcacheattachedtoit,andanFPGAcardcontainingaXilinxVirtex-4SX55FPGAandSRAMmemoryblock.ThemicroprocessorandFPGAareconnectedbyaPCIbus.Thenodeelementsallresidewithinanodecontainer,and16instancesofthenodearespeciedtorealizethefullcluster.EachnodeisconnectedtoaGigabitEthernetswitchtomodelfullconnectivitybetweennodes.Finally,twomiddlewarecomponentsaredenedinthismodel.Otherarchitecturecomponentscanlinktomiddlewareinstancesbydeclaringmiddlewareassociationsintheirattributes.Sinceitmaybecommonforamiddlewarepackagetoresideonalargenumberofcomponentinstances,novisualconnectionisdepictedinthegure.Instead,theinformationisaccessedbyviewingtheattributesofanindividualcomponents,whichshowallofthemiddlewareinstancesthatthecomponentisassociatedwith.IntheexampleinFig. 4-4 ,theMPI MWmiddlewarecomponentislinkedtothePowerPCmicroprocessorineachnode(viaattributesnotdepictedinthegure)andisusedtocharacterizeadditionaldelaysintroducedbytheMPIcommunicationlayerandprotocolsusedforinter-nodetransfers.TheRC MWcomponentislinkedtothePowerPCandtheVirtex-4FPGAineachnode,andisusedtocharacterizeadditionalcommunicationdelaysintroducedbythemiddlewareassociatedwiththeFPGAboard.Parametersdeningthesizesandspeedsofthevariousdevicesareembeddedasattributesintheindividualcomponents,andareaccessiblethrough 71

PAGE 72

4-5 (itshouldbenotedthattheRCMLmodeldepictedinFig.5isanexampleofanalgorithmmodel,buttheinterfacesfordeningandeditingattributesaregenerallythesamewithinthetoolwheneditinganytypeofRCMLmodel). ThebasicmappinginRCMLrequiresthateachfunctionblockinthealgorithmmodelismappedtoaprocessingelementinthearchitecturemodel.Eachnon-hierarchicalfunctionblockcanonlybemappedtoasingleprocessingelement,unlessthefunctionblockisreplicatedinwhichcaseeachinstancemaybemappedindividually.Eachstageofaprocesslinemayalsobeindependentlymappedtodifferentprocessingelements.Inadditiontothespatialmappingthatisrequiredforalltasks,tasksexecutingonanRCdevicemustbemappedtemporallyaswell,sincedynamicrecongurationcanbeusedduringruntimetochangethetaskssupportedbythedevice.Thus,arecongurablecoremanagerisalsoprovidedtoallowuserstoexplicitlydenegroupsoftasksthatwillresideconcurrentlyonanRCdeviceatanypointintime.Theinformationfromthecoremanagerisusedbyanalysistoolstodeterminethenecessaryrecongurationeventsandtheirassociatedimpactonsystemperformance. Inmanycases,userswillwanttodeneadditionalinformationregardinghowtheirapplicationismappedontotheexecutionplatform.Forthesecases,optionalmappingproceduresaresupportedtodeneadditionalmappinginformation.Forexample, 72

PAGE 73

4.1.1 intheEclipse-basedRCMLEditorisshowninFig. 4-5 .AdditionalexamplesofRCMLmodelsbuiltusingtheEclipseRCMLtoolarepresentedinSection 4.3 3 hasbeendeveloped.ThistoolprovidesandautomatedlinkbetweenRCMLandRCSEtoenableefcientdesign-spaceexplorationbyautomatingtheprocessofanalyzingmodelsbuiltinRCMLwithRCSE.ThetoolpresentedhereautomaticallygeneratesapplicationscriptsfortheRCSEframeworkfromanRCMLsystemmodel.Additionally,thistoolalso 73

PAGE 74

ScreenshotofRCMLModelEditorinEclipse createsparameterlesforeachcomponentintheRCMLarchitecturemodelthatisusedbytheequivalentcomponentwithinthesimulationenvironment. Inordertogenerateacompleteapplicationscript,eachfunctionblockmustcontainattributestodescribethecomputationaldemandsofthetask.AttributesetsthatenableaRATanalysisforthesystemaretypicallysufcientforthesimulationtool.Alternatively,usersmaydeneattributesforthenumberofclockcyclesrequiredbytasksmappedtoRCdevices,orthenumberofinstructionsandmemory/cacheaccessesperformedbytasksmappedtomicroprocessors. Aseparateapplicationscriptisneededforeachparticipatingprocessorinthesystem.Therefore,thetoolcreatesascriptforeachprocessorwithoneormorefunctionblocksmappedtoit.Eachfunctionblockmappedtoaxedprocessorgeneratesacomputationentryinthemappedprocessor'sscript,whileeachRCcoredenedinthesystemmappingcreatesafunctioncalltotheRCdeviceprecededbyacorecongurationcommand.Linkedfunctionblocksmappedontodifferentprocessingelementswillgeneratecommunicationeventsinthescripts.Ascriptcommandfor 74

PAGE 75

Alternatively,automaticgenerationofparameterlesisstraightforward.Foreacharchitectureblock,theblock'sattributesarecomparedagainstalistofpre-denedattributesforthatcomponentclass,andthevaluesformatchingattributesarerecorded.TheparameterlesandapplicationscriptsrepresentallinputsneededtodriveasimulativeanalysisofthesysteminRCSE. A )ispredictedthroughsimulationsinRCSEdrivenbyinputsautomaticallygeneratedfromanRCMLmodel.TheRCMLalgorithmdiagramforHSI,builtusingtheEclipseRCMLeditor,isshowninFig. 4-6 .A512512inputimagewith1024spectralbandsand8targetclassicationsisassumed,unlessotherwisenoted.TheplatformusedinthesecasestudiesistheSigmacluster,detailedinAppendix B .ThecorrespondingRCMLarchitecturediagramisshowninFig. 4-4 .ThetwoRCMLmodelspresentedherearemappedtogethertocreatethesystemmodelanalyzedineachofthefollowingcasestudies.Forallsimulativeexperiments,theFPGAisusedtoacceleratetheACSMstageofHSIonly.TheonlychangesmadetotheRCMLmodelbetweensimulativeexperimentsisthesystemmappingbasedonthenumberofnodesused,andthedatasetsizetoreectthesizeoftheinputimage. Therstsimulativeexperimentsareusedtovalidatetheaccuracyofthesimulationinputsandmodels.Table 4-3 summarizesresultsfromexperimentscomparingsimulativeprojectionsofHSIperformanceagainstexperimentalresultspresentedin[ 71 ].Simulativeresultsfromthreeofthefourvalidationexperimentsyieldpredictionsthatarewithin4%oftheexperimentalruntime.Theremainingvalidationexperiment 75

PAGE 76

RCMLAlgorithmModelofHSIApplication SimulativePredictionExperimentalRuntimeError Table4-3. SimulativeValidationResults yieldsanerrorof7:4%,stillwithinarangethatcanprovideusefulinsighttodesignersduringearlydesign-spaceexploration.Theseexperimentsdemonstratethataccuratesimulationinputsaregeneratedbythetool.Furthermore,theinputsaregeneratedmuchmoreefcientlyandreliablythroughanautomatedtool,otherwiseuserswouldberequiredtoreproducescriptsmanuallyforeachsimulativeexperiment.Thus,havinganenvironmentandtooltoautomatethesetasksgreatlyincreasestheproductivityofsimulativeanalysis. ThesecondsimulativeexperimentanalyzesthescalabilityofHSIonlargersystemsbyscalingtheplatformintheRCMLmodelupto16nodestodemonstrateanalysisofhypotheticalsystemsthatarenotphysicallyavailabletothedesigner.Havingvalidatedtheaccuracyofthesimulationsfor2-and4-nodesystems,wecanusetheRCMLmodelandsimulationtoolstoprojecttheperformanceofHSIonlargersystems.Fig. 4-7 showsthesimulativeprojectedperformanceofHSIasthesystemsizeisscaledfromtwoto16nodesfora512512imagewith256and1024spectralbands.When 76

PAGE 77

HSIPredictedRuntimevs.NumberofNodes scalinguptoeightnodeswith1024spectralbands,signicantreductionsinruntimeareprojectedduetoparallelismthatcanbeexploitedduringtheACSMandTCstages.ButFig. 4-7 alsoshowsthattheruntimeisminimallyreducedwhenscalingbeyond12nodes.ThepenaltiesforcommunicationandtheserializationoftheweightcomputationstagelimittheparallelefciencyofHSIonlargersystemsizes.Furthermore,smallerrelativegainsinperformanceareobservedwith256spectralbands,whoserunshaveasmallercomputation-to-communicationratio,thusgainsfromparallelexecutionareoffsetbycommunicationoverhead. 77

PAGE 78

78

PAGE 79

Inthischapter,analgorithmdesignedtosupportautomatedschedulingandpartitioningofalgorithmsonscalableRCsystemsisproposedanddiscussed.TheemergenceofRCinincreasinglylarge-scalehigh-performanceandembeddedcomputingsystemsraisesmanychallenges.OnechallengethatdesignersmustconfrontistheproblemofHW/SWpartitioningandschedulingforcomplexparallelapplicationsontoalarge-scaleheterogeneoussystem.Whilemanuallydeterminingandspecifyinganoptimalschedulemaybeastraightforwardprocessforsimple,embarrassinglyparallelapplications,itcanbedifcultandtime-consumingforcomplexapplicationswithmanyparalleltasksonlargesystems.Thechallengeisevengreaterwhenconsideringarecongurableheterogeneoussystem,suchasanRCplatformcontainingFPGAsandmicroprocessors,whereeachtypeofdeviceislikelytoprovideadifferentlevelofsupportandperformanceforagiventask.Additionally,usersmustalsodecidewhentoreconguretheplatform'sRCdevicestosupportanewsetoftasks. Toaddressthesechallenges,automatedHW/SWpartitioningandschedulingtechniquescanbeemployedtosuggestanoptimalornear-optimalmapping(i.e.,assignmentoftaskstospecicinstancesofaresource)oftheparallelalgorithmontotheexecutionplatformintermsofperformance(orothermetric(s)).AlgorithmsbasedonmethodssuchasIntegerLinearProgramming(ILP)canevenguaranteeanoptimalschedule,buttheircomputationalcomplexitymakesthesealgorithmsundesirableformanysituations,especiallywithlarge-scalesystems.Furthermore,whilepreviousapproacheshaveaddressedvariousschedulingandHW/SWpartitioningproblems(seeSection 2.4 ),theyhavetypicallyfocusedoneitherparallelSW-basedsystemsorHW/SWpartitioningforASICsorsmall-scaleRCarchitectures.Asaresult,existingresearchislackingintheareaofschedulingandpartitioningforsystemsthatarescalable,heterogeneous,anddynamicallyrecongurable(Fig. 5-1 ). 79

PAGE 80

TheGapInExistingAutomatedSchedulingResearch TollthegapillustratedinFig. 5-1 andaddressthechallengespreviouslydiscussed,thisphaseofresearchfocusesoncreatinganovelautomatedschedulingandpartitioningalgorithmdesignedforsupportinglarge-scaleRCsystems.Thealgorithmpresentedinthischapter,whichextendsexistingRC-basedschedulingandpartitioningtechniquestoefcientlysupportlarge-scalesystems,isdividedintotwostages.Intherststage,anovelpriorityandallocationschemetoextendexistinglist-basedscheduling(LBS)techniquesisusedtogenerateaninitialschedule.Inthesecondstage,anextensionoftheKernighan-Linheuristic[ 72 ]isusedthatiterativelyanalyzesasetofmovesforeachunxednodeinthetaskgraph,implementingthebestmoveaftereachiterationtomovethescheduletowardsabettersolution.Theiterativeprocesscontinuesuntilalltaskshaveparticipatedinamoveoruntilnomoremovesproduceanimprovedschedule.Twocasestudiesarepresenteddemonstratingthatthisalgorithmperformswellforlarge-scaleRCapplicationsandinafractionofthetimecomparedtoourbaselineschedulingalgorithm,whichemployssimulatedannealing. ThealgorithmdevelopedinthisphaseofresearchservesasanimportantlinkbetweentheresearchoutcomesinPhase1andPhase2.TheRCMLframeworkdesignedinPhase2enablesuserstoquicklyandintuitivelybuildabstractmodelsoftheirsystemdesignsduringformulationthatcaneasilybealteredasdesignparameterschange.TheRCSEframeworkdevelopedinPhase1allowsuserstoeffectivelypredicttheperformanceoftheirdesigns,allowingdesignerstoevaluate 80

PAGE 81

4.2 ,mappinganddesign-spaceexplorationisstillamanualiterativeprocess.ThisphaseofresearchservestoprovideatightercouplingbetweenRCSEandRCML,allowinguserstomorequicklyiteratebetweendesign(viaRCML)andanalysis(viaRCSE).Thus,thisphaseofresearchaddressesthethirdandnalcomponentofthecomprehensiveRCformulationframeworkintroducedinChapter 1 anddepictedinFig. 1-2 Theremainderofthischapterisorganizedasfollows.BasicproblemdenitionsandassumptionsaresummarizedinSection 5.1 .InSection 5.2 ,anoverviewofthealgorithmispresented,alongwithanillustrativeexample.IntegrationoftheautomatedschedulingalgorithmwithRCMLisshowninSection 5.3 .CasestudiestestinganddemonstratingtheeffectivenessoftheschedulingalgorithmarepresentedinSection 5.4 .Finally,theconclusionsandfutureworkpertainingtothisphaseofresearcharediscussedinSection 5.5 Theplatformarchitectureisassumedtoconsistofoneormoreuniform,fullyconnectednodes.EachnodecontainsoneormoreCPUsandoneormoreFPGAs,asdenedbytheuser'sarchitecturemodel.ParametersareassignedtoeachFPGAtocharacterizethelogicandmemorycapacitiesofeachdevice,aswellasthe 81

PAGE 82

TheapplicationisassumedtobespeciedasadirectedacyclictaskgraphG(V;E),wheretheverticesVrepresenttasksandedgesErepresentdatadependenciesbetweentasks.ForeachtaskVi,TP(Vi)denestheexecutiontimeoftaskVionprocessortypeP(i.e.,CPUorFPGA).Suchvaluescantypicallybeascertainedfromprolingtools,synthesisreports,orpredictivemodelingtechniques,andarecommonlyacceptedvariablesinschedulingalgorithms.UP;log(Vi)andUP;mem(Vi)denethebaselinelogicandmemoryutilizationrespectivelyofthetaskViforprocessortypeP.Currently,theutilizationsareonlyuseddetermineifresourceconstraintsareviolated;potentialperformancedegradationsfrommemoryorlogiccontentionwithinadevicearenotconsidered.Autilizationof1isassumedforalltaskswithregardtosoftware-basedCPUs(i.e.,onlyonetaskmayexecuteatatimeonanyCPU,eachcoreofamulti-coreCPUistreatedasaseparateCPU).AninstanceofTP(Vi)=nullsigniesthatexecutionoftaskViisnotsupportedonprocessortypeP.Eachtaskalsocontainsaparameterdeningthedegreeofloopunrolling(orparallelization)thatcanbeexploitedforconcurrentexecutionononeormoredevices.Anunrolledtaskmayspanmultipledevicesandmultiplenodes,assuchtasksoftencomprisethebulkofan 82

PAGE 83

5.1 .Followingpre-processing,theschedulingandpartitioningalgorithmisdividedintotwostages.Intherststage,novellist-basedschedulingheuristicsareusedtoquicklygenerateaninitialscheduleforthesystem.Inthesecondstage,anextensionoftheKernighan-Linheuristicisemployedthattakestheinitialscheduleanditerativelyanalyzesasetofmovesforeachunxednodeinthetaskgraph,implementingthebestmoveaftereachiteration.PseudocodeforthegeneralalgorithmispresentedinFig. 5-2 .Thefollowingsubsectionsdetailthetwostagesofthealgorithmfollowedbyabasicexamplewhichillustrateshowthealgorithmoperates. 83

PAGE 84

Algorithm//Preprocessing AllocateResourcesandMapVhptoPlatform UpdateListUnscheduledTasks endwhile//IterativeDSECycleStage foreachTaskVkinListUnfixedTasksdo ifScheduleImprovementFromBestMove>0then SetTaskVBestMoveasFixed endwhile returnFinalSchedule PseudocodeforAutomatedSchedulingAlgorithm thisprocess,andtheremainderofthissubsectiondenestheprioritymetricsusedduringtheLBSstageandtheprocedureusedtomaptaskstothearchitectureoncetheyhavebeenselectedforscheduling. BeforetheLBSprocessbegins,afewmetricsarecalculatedforeachtask,whicharesummarizedinTable 5-1 .ThesemetricsareuniquetothisalgorithmandrepresentextensionstopreviousLBStechniques.TherstiscalledtheDeviceAfnityWeight 84

PAGE 85

SummaryofKeyLBS-stageAlgorithmMetrics Symbol NameDescription (DAW),whichrepresentshowstronglyataskfavorsexecutiononaparticulartypeofprocessoroveranyothers.ForataskViandprocessortypeP,theDAWiscalculatedwiththeequation whereT0P(Vi)representsthenormalizedprocessingtimefortaskVionprocessortypeP,whichtakesintoaccountresourceutilizationbyViinadditiontothetotalprocessingtime.AssumingNPisthenumberofprocessorsoftypePonasinglenodeoftheplatform,andUavgistheaverageofthememoryandlogicutilizationofthetaskonprocessortypeP,T0P(Vi)iscalculatedasfollows. Currently,theLBSstageassumesonlyonetaskcanoccupyanFPGAatanytimebutallowsasingletasktobeunrolledononeormoredevices.TheiterativeDSEstagelaterconsiderspotentialgroupingsofseparatetasksintohardwarecores. EachtaskwillhaveaDAWvalueforeachtypeofprocessorinthearchitecture.TheDAWmetric,whosevaluerangesfrom0to1,willgivemoreweighttotasksthatstronglyfavoronetypeofprocessoroverothers,orcanonlyexecuteononetypeof 85

PAGE 86

AscanbeseenfromEq. 5 ,ahigherprioritywillbegivenprimarilytotasksthataremorecomputationallyintensive,whilealsofavoringtaskswithastrongprocessorafnityfromtheirDAWset.Thisallowsthealgorithmtopreferentiallyscheduletasksthataremoredemandingintermsofcomputationandresourcerestrictions,allowingthelessdemandingtaskstouseleftoverresourcesinthearchitecture.Thisbehaviorisimportantsinceefcientschedulingonlarge-scalesystemswilloftenrequiretheschedulertoeffectivelymakeuseofallresourcesinthesystem. OnceataskhasbeenselectedforschedulingbasedontheirLBSpriority,processingresourcesmustbebeallocatedforthetask.Thisinvolvesdeterminingthenumber,type,andlocationofprocessor(s)usedtoexecutethetask.WhileDAWisusedtodeterminewhenataskisscheduledviathetaskpriority,itisnotusedto 86

PAGE 87

5.2.3 .Thus,foragivenprocessortypePandtaskVi, wherenrepresentsthesetofunscheduledtasksthatresideatthesameprimaryt-leveloftaskVi.TheCWservesasaboundthatdenesthefractionofavailableprocessorsoftypePthatmaybeallocatedtothetask.Aprocessor(whichcouldbeaCPUorFPGA)isconsideredavailableifthereisnoothertaskexecutingonitattheearliestpossiblestarttimeforthetask.CWvaluesaredynamicallymaintained,thuseachtimeafterataskismappedandscheduled,CWvaluesarerecalculatedforallunscheduledtasks. Whenataskisselectedtobescheduled,itwillbemappedontotheplatformtominimizethenishingtimeoftheentiretaskboundedbythenumberofresourcesitisallowed.Thedeterminationoftheexpectedcompletiontimeofthetaskisthesumoftherequiredcomputation,incomingcommunication,andhardwarerecongurationtimes.WhentwoseparatetasksaremappedtothesameFPGAarecongurationoftheFPGAistakenintoaccount,whichmaydelaythetimeatwhichthelattertaskmayreceivedataandexecuteontheFPGA.Whencalculatingpotentialcommunicationdelays,thealgorithmtakesintoaccountwhetherdatamustbetransmittedwithinanodeand/orbetweennodes,andappliestheappropriatecommunicationratesdenedinSection 5.1 87

PAGE 88

Atthebeginningofthisstageofthealgorithm,alltasksaremarkedasunxed,followedbytheiterativecycle.Duringeachiteration,thealgorithmanalyzesallpotentialmovesforeachunxedtask.Movesforeachtaskincludechangingthenumber,type,and/orlocationofprocessorsallocatedforthetask.Thesemovesaresimilartomany 88

PAGE 89

63 ].Onlyvalidmovesareconsideredforeachtask;thus,e.g.,thealgorithmwillnotchangethenumberofdevicesallocatedtotasksthatdonotsupportloopunrolling.Inaddition,ouralgorithmextendspreviousworkbyconsideringanewmoveforcreatinghardwarecores.Inthismove,combinationsofhardwaretasksaregroupedintocores,suchthatrecongurationandcommunicationisunnecessarybetweentasksinthesamecoreonthesamedevice.Foreachmove,theexpectedruntimeofthewholeapplicationisrecalculatedusingtheschedulerdescribedinSection 5.2.1 .Attheendofeachiteration,themovethatprovidesthelargestimprovementtotheoverallscheduleisselected,andthetaskassociatedwiththemoveischangedfromunxedtoxed.Forthecaseofahardwarecoremove,theassociatedtasksarexedintothecore,butthecoreitselfisinitializedasunxedandmaybere-allocated(orgroupedtoformanothercore)insuccessiveiterations.Theiterativeprocessrepeatsuntilalltasksandcoresarexedoruntilthereisaniterationwherenomovesimprovetheschedule. 5.4 ).Alldevicescommunicatewitheachotherovera1GB/sbus.Fig. 5-3 showsthetaskgraphthatwillbeusedinthisexample,whichwasbuiltusingtheRCMLenvironment.Thestacked-boxiconforTasks1,2,3,and5representsataskthatsupportsloopunrolling.Thenumberdirectlyabovethestackedboxesrepresentsthedegreeofloopunrollingthatthetasksupports(e.g.,Task1maybeunrolledandparallelizedbyafactorof32).ParametersforthetotalserialruntimeandbaselineFPGAhardwareutilizationareattachedtoeachtaskandsummarizedinTable 5-2 ThealgorithmbeginswithpreprocessingwhichrstcalculatestheDAWvaluesforeachtask,whichareshowninthesecondandthirdcolumnsofTable 5-3 .Sincethe 89

PAGE 90

TaskGraphforWalkthroughExample Table5-2. TaskParametersforWalkthroughExample LogicMemory NANATask216s200s 0.500.50Task3128s128s 0.250.25Task44s3s 1.001.00Task532s32s 0.500.50 parametersforTask1signifythatitcanonlyexecuteonaCPU,DAWCPU(V1)isgivenavalueof1andDAWFPGA(V1)avalueof0.Meanwhile,Task4hasanFPGAexecutiontimeof4sandaCPUexecutiontimeof3s(thiscouldrepresentacontrol-intensivetaskwithlittleparallelismthatwouldseenospeedupwhenmappedtohardware).BasedonEq. 5 ,DAWFPGA(V4)iscalculatedas11 1:5+1=0:6.Similarly,DAWCPU(V4)

PAGE 91

CalculatedTaskMetricsforWalkthroughExample 5.0NA0.3102Task20.980.02 3.90.220.1210Task30.890.11 30.70.780.5730Task40.600.40 2.40.070.0801Task50.800.20 3.21.001.0040 iscalculatedas11:5 1:5+1=0:4.ThefourthcolumnofTable 5-3 ,labeledPri,liststheresultingpriorityvaluesforeachtask,whicharecalculatedusingEq. 5 AscanbeseenfromFig. 5-3 ,Tasks1-3haveaprimaryt-levelofonesincetheyaredirectlybelowtheroot,whileTask4hasat-leveloftwo.Task3,whichhasaprimaryt-levelofone,alsoresidesonthesecondt-levelofthegraphsinceitsonlychild(Task5)hasat-levelofthree.Therefore,theCWvaluesforTasks1-3willalltakeeachotherintoaccountwhendeterminingtheirownCW.Meanwhile,theCWforTask4willfactortheloadfromTask3(sinceTask3residesont-leveltwo),butTask3willnotfactortheloadfromTask4initsownCW(sinceTask4doesnotresideont-levelone).TheinitialCWvaluesforeachtaskareshowninthenaltwocolumnsofTable 5-3 WhentheLBSphasebegins,Tasks1-3areinitiallyconsideredsincetheymayallexecuteaftertheapplicationhasstarted(representedbytheStarttask).Task3isscheduledrst,havingthehighestpriorityamongthethreerst-leveltasks.TheCWvalueslimitTask3touseuptothreeoftheFPGAsinthearchitectureandoneCPU.Giventheseconstraints,usingthreeFPGAsisthefastestwayforTask3toexecuteontheplatform(givingTask3acompletiontimeof11seconds),thusitisallocatedthreeFPGAsandscheduledaccordingly.ThiscausestheCWvaluesforTasks1and2tobereadjustedwithoutconsiderationofTask3.ThismeansthatCWFPGA(V2)becomes1sincenootherunscheduledtasksmayexecuteconcurrentlywithitonanFPGA(s).TheCWCPUweightsforTasks1and2arereadjustedto0.72and0.28respectively.Task1 91

PAGE 92

Atthispointofthealgorithm,theinitialschedulefromtheLBSstagehasbeengenerated,andtheiterativeDSEprocessbegins.Sincenomovescanimprovetheschedulegeneratedbythelist-basedscheduler,nomoveswouldbeimplementedduringtherstiterationandthealgorithmwouldcomplete. 4 .Abaselineschedulingalgorithm,whichemployssimulatedannealing,hasalsobeendevelopedandintegratedintotheRCMLenvironmentinordertoruncomparativeexperiments,whicharepresentedinSection 5.4 .TheautomatedalgorithmscanbeaccessedandperformedforanyRMCLsystemmodelbysimplyright-clickingthemodelandselectingtheappropriateoptionfromthemenu(Fig. 5-4 ).Bothalgorithmsoperateindependentofanyexistingmappinginformation.Whenthealgorithmcompletes,areportsummarizingtheresultingschedulegeneratedbythealgorithmispresentedtotheuser,whothenhastheoptiontoeitherautomaticallyapplythemappinginformationtotheRCMLmodelordiscardthegeneratedmappingsothattheRCMLmodelmaintainsitspreviousstate. 92

PAGE 93

ScreenshotofADSEIntegrationinRCMLEnvironment 73 ],whichisdetailedinAppendix B .TheNovo-Gplatformmodelconsistsof24nodes,eachcontainingoneXeonE5520CPUandfourAlteraStratix-IIIE260FPGAsonasingleboard.APCIe8xconnectionconnectstheCPUandFPGAboard,whiletheFPGAboardsupports25.6Gb/scommunicationdirectlybetweenFPGAs.Thenodesareallconnectedtoa20Gb/sDDRInniBandswitch.Foreachcasestudy,theexpectedexecutiontimeoftheapplication(hereafterreferredtoastheschedulelength)generatedbytheinitialLBSstage(LBSOnly)andthefullalgorithm(LBS+DSE)arecomparedtotheschedulegeneratedbyabaselinesimulatedannealing 93

PAGE 94

HSITaskGraph (SA)schedulingalgorithmdevelopedforthisenvironment.Furthermore,theprocessingtimeforeachalgorithmisalsoreportedtoanalyzetheprocessingrequirementsofeachalgorithm.Eachoftheschedulingalgorithmsareexecutedandtimedonadesktopcomputerwitha1.86GHzIntelCore26300CPU. TherstcasestudyusesasimpliedmodelofHSI,whosetaskgraphforforthiscasestudyisillustratedinFig. 5-5 .Aspreviouslydiscussed,thestackediconsfortheACSMCalcandTCtaskssigniesthatloopunrollingcanbeemployedtothedegreespeciedbythenumberjustabovetheboxes.AlltasksinHSImayexecuteonCPUs,whiletheACSMCalcandTCtasksmayalsobemappedtoFPGAs. Thesecondcasestudyisamean-valueanalysis(MVA)structuredtaskgraph.MVAisacommontechniqueusedinanalysisofqueueingnetworks,anditstaskgraphstructureisoftenusedasabenchmarkforschedulingalgorithms.Toadditionallystresstheschedulingalgorithms,atypicalMVAtaskgraphisaugmentedwithsizableloopunrollingamountsassignedtoseveralofthetasks.RandomizedFPGAandCPUexecutiontimesareassignedtoeachtaskaswell.ThetaskgraphfortheMVAcasestudyisshowninFig. 5-6 Table 5-4 summarizestheresultsfromeachoftheschedulingcasestudies.Theschedulelengthcolumnsrepresenttheexpectedcompletiontimeoftheapplicationbasedonthegeneratedschedule.FortheHSIcasestudy,theresultsshowthatalloftheschedulingalgorithmsproducedthesameschedulelengthof15.1s.Sincethe 94

PAGE 95

MVATaskGraph Table5-4. SchedulingAlgorithmCaseStudyResults SchedulingAlgorithmScheduleLengthProcessingTimeScheduleLengthProcessingTime LBSOnly15.1s0.3s41.0s0.8s HSItaskgraphconsistedofasinglepathoftasks,itwasrelativelyeasyforboththeLBSalgorithmandthesimulatedannealingalgorithmtogeneratewhatturnedouttobeanoptimalschedule.SincetheLBSstageofouralgorithmproducedanoptimalschedule,nomoveswereimplementedduringtheiterativeDSEstageandthescheduleremainedunchangedattheendofthealgorithm.Whileitisnotremarkablethateachalgorithmproducedanoptimalscheduleforthistaskgraph,theprocessingtimespentbyouralgorithmisconsiderablylessthantheprocessingtimeofthesimulatedannealingalgorithm.SinceHSIislikelytoberepresentativeofthestructureofmanyRCapplications,theresultsofthiscasestudyareencouraging.Additionally,thealgorithmcouldprovidereliabilityandsignicanttimesavingstouserswhowouldotherwiseneedtomanuallyperformschedulingandanalysisaspartofthedesignprocess. 95

PAGE 96

UnliketheHSIcasestudy,sixmovesareimplementedduringtheDSEstageofouralgorithmfortheMVAcasestudy.TheadditionalmovesanditerativecyclesduringtheDSEstagearetheprimaryreasonsforthelargerdiscrepancybetweentheLBSOnlyandLBS+DSEprocessingtimesintheMVAcasestudyversustheHSIcasestudy.SincetheprocessingtimeofthealgorithmislargelyproportionaltothenumberofiterativecyclesperformedduringtheDSEstage,itisimportantthatareasonablygoodscheduleisgeneratedduringtheinitialLBSstageinordertoreducethenumberofmovesthattheDSEstageneedstomake. 96

PAGE 97

97

PAGE 98

Aspreviouslydiscussed,RCisbecominganimportantoptionforrealizinghighlyefcientandexiblesystemsforhigh-performanceand/orembeddedcomputing.Despitetheadvantagesithastooffer,RChasyettoexperiencewidespreadadoptioninmanycomputingeldsinpartbecauseoftheprohibitivetimeanddifcultyassociatedwithdevelopingapplicationsforRCsystems.ManyfactorscontributeinlimitingdevelopmentproductivityonRCsystems,makingitdifculttoexploitthepotentialgainsinperformanceandpowersavingsthatRCcanprovide.ToimproveRCproductivity,betterconceptsandtoolsareneededwhichallowdesignerstoplanandanalyzetheirdesignsbeforecodingaspecic(andpossiblyfruitless)implementation,aprocesswecallformulation.Withbettertoolsandtechniquesforformulation,RCproductivitycanbesignicantlyincreasedbyminimizingthetimeandresourcesdevelopersspenditeratingamongthedesign,translation,andexecutionstagesofdevelopment.Forthesereasons,theresearchinthisdocumentexaminesconstructingacomprehensiveframeworkforperformingformulationwithRCsystems.Theformulationprocessproposedhere,initiallyillustratedinFig. 1-2 ,featuresthreeprimarycomponents:performancepredictionandanalysis,abstractmodeling,anddesign-spaceexploration.Eachofthethreecomponentsintheformulationprocessisaddressedbyoneofthephasesofthisresearch.Combined,thesethreephasesrepresentanovelsetofconceptsandtoolsthatcanbeusedtoproductivelyperformRCformulation,potentiallyleadingtosubstantialimprovementsinoverallproductivityandthushelpingRCgainacceptanceamongabroadergroupofprogrammersanddesigners. InPhase1,aframeworkforrapidsimulativeperformancepredictionofRCsystemscalledRCSEwaspresented.RCSErepresentsthersttrace-basedsimulationframeworkdesignedforpredictingtheperformanceandpowerconsumptionofRCsystemsoneormoreordersofmagnitudemorequicklythantraditionalcycle-accurate 98

PAGE 99

InPhase2,auniqueabstractmodelinglanguageforrepresentingRCsystemswasintroduced,calledRCML.RCMLisanestimationmodelingenvironmenttailoredforRC,enablinguserstoquicklybuildmeaningfulmodelsoftheirRCsystemsthatcanbeanalyzedandusedasabridgetothedesignstageofdevelopment.TheRCMLframeworkincludesthreetypesofmodels:algorithmmodelswhichrepresentapplicationsasRC-specializedtaskgraphs,architecturemodelswhichrepresentexecutionplatformsasacollectionofcomponents,andsystemmodelswhichmaptogetheranalgorithmandarchitecturemodel.ConstructsinRCMLaretailoredtoallowRCdesignerstoquicklyandconciselydenetheirparallelalgorithmandexecutionplatformatvaryinglevelsofabstraction.Furthermore,anautomatedinputgeneratorwaspresentedthatautomaticallycreatesscriptsandparameterlestostimulateasimulativeanalysisoftheRCMLmodelwithRCSE.Acasestudyinvolvingautomated 99

PAGE 100

InPhase3,therstknownalgorithmdesignedtoperformHW/SWpartitioning,scheduling,andDSEforlarge-scaleRCsystemswaspresented.Thealgorithmusesatwo-stageprocessthatperformsinitialschedulingandpartitioningintherststage,followedbyDSEinthesecondstage.Anovelpriorityandallocationschemethatextendsexistinglist-basedschedulingtechniquesisusedintherststagetoobtainaninitialscheduleforthesystem.Inthesecondstage,anextensionoftheiterativeKernighan-LinheuristicisusedtoperformDSEandimproveupontheinitialschedulegeneratedintherststageofthealgorithm,featuringnewmovesforanalyzingtaskgroupingsintorecongurablecores.Afteranillustrativeexample,apairofcasestudieswaspresentedshowcasingtheperformanceoftheschedulingalgorithmcomparedtoabaselinesimulatedannealing(SA)schedulingalgorithm.CasestudyresultsshowedthatthealgorithmperformednearlyaswellastheSAbaselineintermsofthequalityofthegeneratedschedules,whilerequiringonlyafractionofthetimetoexecute.Inadditiontospeedingupthetypicalmappingprocess,theproposedalgorithmallowsuserstomoreefcientlyiteratebetweentheabstractmodeloftheirstrategicdesignsandanalysistoolsbyautomaticallysuggestingandspecifyingasystemmappingthatisbestsuitedformoredetailedanalysis(viatoolssuchasRCSE)andimplementation. Thecontributionsoftheresearchpresentedinthisdocumentincludethedesign,creation,andtestingofthenovelRCSEandRCMLframeworksandtools,aswellasanalgorithmforefcientautomatedmappingandDSEofscalableRCapplicationsintegratedintotheRCMLenvironment.Casestudiesshownotonlytheeffectivenessofeachindividualcomponentofmyresearch,butalsohowthesetoolsandmodelsintegratewitheachothertoformacomprehensiveformulationframeworkthatcansignicantlyimproveoverallproductivityofRCdevelopment.Inaddition,theRCML 100

PAGE 101

70 ].AdditionalongoingresearchincludesextendingthefunctionalityofRCMLtoefcientlybridgebetweentheformulationanddesignstagesofdevelopment,viatoolstoperformcodegenerationandlinkingtodesignenvironmentssuchastheSystem-LevelCoordinationFramework(SCF)forheterogeneousapplicationdesign[ 74 ]. 101

PAGE 102

Inthisappendix,theapplicationsusedincasestudiesthroughoutthisresearcharedescribedindetail. 75 ].Large-scalemoleculardynamicssimulatorssuchasAMBER[ 76 ]andNAMD[ 77 ]usethesesameclassicalphysicsprinciplesbutcancalculatenotonlyLennard-Jonespotentialbutalsothenonbondedelectrostaticenergiesandtheforcesofcovalentbonds,theirangles,andtorsionsmakingthemapplicabletonotonlyinertatomsbutalsocomplexmoleculessuchasproteins. TheRCimplementationoftheparallelalgorithmusedinthisresearchwasadaptedfromcodeprovidedbyOakRidgeNationalLaboratory(ORNL)[ 78 ].Foreachtimestep,thesetofmoleculesintheMDsystemaresenttotheFPGA,inordertodeterminetheinteractionsbetweeneverypairofmolecules.Eachmoleculeisthenpassedthroughasetofparallelcomputationalpipelines,whereeachcomputationalpipelinedeterminesapartialnetaccelerationforthatmolecule.Aseachmoleculeexitsthecomputationalpipelines,thepartialaccelerationsareaccumulatedandimmediatelywrittentomemorywhiletheremainingmoleculesarestillbeingprocessed.Afterallmoleculeshavebeen 102

PAGE 103

79 ].Ahyperspectralimageisacollectionof2-Dimages,allofthesamescenebuteachcontainingasmalluniqueportionoftheoverallspectrumpickedupbythesensors.TheHSIalgorithmusedaspartofthisresearchisthesameastheimplementationdescribedin[ 71 ],whichcanbedividedintothreestages:calculationoftheauto-correlationsamplematrix(ACSM),weightcomputation,andtargetclassication.TheFPGAisemployedtoacceleratetwodifferentstagesofthebenchmark,ACSMcalculationandtargetdetection,whichrequirestheFPGAtobecompletelyreconguredbetweenstages.ForcalculationoftheACSM,theFPGAcorereceivesavectorforeachpixelintheHSIimagewiththenumberofelementsequaltothenumberofspectralbandsintheimagedata.Theouter-productiscalculated,andarunningsumoftheresultingmatricesisconstantlykeptuntilallpixelshavebeenprocessed,atwhichpointthenalmatrixcanbereturnedtothehostprocessor.Meanwhile,targetdetectionisprimarilycomposedofcorrelatingeachpixelvectorwithavectorthatrepresentsthespectralsignatureofthetargetclassicationofinterest.ForeachpixelvectorsenttotheFPGA,thecorereturnsaoating-pointdetectionvalueforeachtargetclassicationspeciedatruntime. 103

PAGE 104

Inthisappendix,theexecutionplatformsusedincasestudiesthroughoutthisresearcharedescribedindetail. 33MHzPCIcardfeaturingXilinxVirtex4-SX55FPGA Four(4)4MBDDR2SRAMbanks 133MHzPCI-XcardfeaturingaXilinxXC4VLX100FPGA One(1)512MBbankDDR2SRAM(4GB/sbandwidth) Four(4)4MBbanksDDR2SSRAM(2GB/sbandwidth) 104

PAGE 105

Opteron-socketcardfeaturingAlteraStratixIIEP2S180FPGA One(1)4GBDDRRAM One(1)4MBZBTSRAM 32MBFLASH 105

PAGE 106

Quad-FPGAcardfeaturingfour(4)Stratix-IIIE260FPGAs One(1)4GBDDRRAM One(1)4MBZBTSRAM 32MBFLASH *Currently,theNovo-GplatformisbeingupgradedtohouseasecondProcStar-IIIFPGAboardineachcomputenode,doublingthenumberofFPGAsinthecluster. 106

PAGE 107

[1] K.BondalapatiandV.K.Prasanna,Recongurablecomputingsystems,Proc.IEEE,vol.90,no.7,pp.1201,July2002. [2] R.TessierandW.Burleson,Recongifugrablecomputingfordigitalsignalprocessing:Asurvey,J.VLSISignalProcessing,vol.28,no.1-2,pp.7,May2001. [3] P.Garcia,K.Compton,M.Schulte,E.Blem,andW.Fu,Anoverviewofrecongurablehardwareinembeddedsystems,EURASIPJ.EmbeddedSys-tems,vol.2006,pp.1,2006. [4] S.Merchant,B.Holland,C.Reardon,A.George,H.Lam,andG.Stitt,Strategicchallengesforapplicationdevelopmentproductivityinrecongurablecomputing,inProc.NationalAerospace&ElectronicsConference(NAECON),2008. [5] E.Grobelny,D.Bueno,I.Troxel,A.George,andJ.Vetter,FASE:AframeworkforscalableperformancepredictionofHPCsystemsandapplications,Trans.SocietyModelingandSimulationInternational,vol.83,no.10,pp.721,Oct.2007. [6] K.ComptonandS.Hauck,Anintroductiontorecongurablecomputing,IEEEComputer,Apr.2000. [7] R.Hartenstein,Adecadeofrecongurablecomputing:Avisionaryretrospective,inProc.Design,Automation,andTestinEurope(DATE),2001,pp.642. [8] K.ComptonandS.Hauck,Recongurablecomputing:Asurveyofsystemsandsoftware,ACMComputingSurveys,vol.34,no.2,pp.171,June2002. [9] B.Zeidman,ThedeathofthestructuredASIC,ChipDesignMagazine,May2006. [10] D.BurgerandT.M.Austin,ThesimpleScalartoolset,version2.0,ACMSIGARCHComputerArchitectureNews,vol.25,no.3,pp.13,June1997. [11] MentorGraphics,ModelSimSEUser'sManual,2010. [12] R.E.Wunderlich,T.F.Wenisch,B.Falsa,andJ.C.Hoe,Statisticalsamplingofmicroarchitecturesimulation,ACMTrans.ModelingandComputerSimulation(TOMACS),vol.16,no.3,pp.197,July2006. [13] G.Hamerly,E.Perelman,J.Lau,andB.Calder,Simpoint3.0:Fasterandmoreexibleprogramphaseanalysis,J.Instruction-LevelParallelism,vol.7,pp.1,Sep.2005. [14] T.LafageandA.Seznec,Choosingrepresentativeslicesofprogramexecutionformicroarchitecturesimulations:Apreliminaryapplicationtothedatastream,inWorkloadcharacterizationofemergingcomputerapplications,2001,KluwerInternationalSeriesInEngineeringAndComputerScienceSeries,pp.145. 107

PAGE 108

R.A.UhligandT.N.Mudge,Trace-drivenmemorysimulation:Asurvey,ACMComputingSurveys(CSUR),vol.29,no.2,pp.128,June1997. [16] A.Snavely,L.Carrington,N.Wolter,J.Labarta,R.Badia,andA.Purkayastha,Aframeworkforperformancemodelingandprediction,inProc.ACM/IEEESC2002Conf.,2002,pp.21. [17] S.PllanaandT.Fahringer,Performanceprophet:Aperformancemodelingandpredictiontoolforparallelanddistributedprograms,inProc.2005Int.Conf.ParallelProcessing,2005,pp.509. [18] B.Holland,K.Nagarajan,C.Conger,A.Jacobs,andA.George,RAT:AmethodologyforpredictingperformanceinapplicationdesignmigrationtoFPGAs,inProc.High-PerformanceRecongurableComputingTechnologiesandAppsWorkshop(HPRCTA),2007,pp.1. [19] M.C.SmithandG.D.Peterson,Analyticalmodelingforhigh-performancerecongurablecomputers,inProc.SCSInt.Symp.PerformanceEvaluationofComputerandTelecomm.Systems(SPECTS),SanDiego,CA,July2002. [20] C.P.Steffen,ParameterizationofalgorithmsandFPGAacceleratorstopredictperformance,inProc.RecongurableSystemSummerInstitute(RSSI),2007,pp.17. [21] K.K.Bondalapati,ModelingandMappingforDynamicallyRecongurableHybridArchitectures,Ph.D.dissertation,UniversityofSouthernCalifornia,LosAngeles,CA,Aug.2001. [22] R.Enzler,C.Plessl,andM.Platzner,System-levelperformanceevaluationofrecongurableprocessors,J.MicroprocessorsandMicrosystems,vol.29,no.2-3,pp.63,Apr.2005,SpecialIssueonFPGAToolsandTechniques. [23] W.FuandK.Compton,Asimulationplatformforrecongurablecomputingresearch,inProc.Int.Conf.FieldProgrammableLogicandApplications(FPL),Aug.2006,pp.1. [24] K.K.Poon,S.J.Wilton,andA.Yan,Adetailedpowermodelforeld-programmablegatearrays,ACMTrans.DesignAutomationofElectronicSystems,vol.10,no.2,pp.279,Apr.2005. [25] J.H.AndersonandF.N.Najim,PowerestimationtechniquesforFPGAs,IEEETrans.VeryLargeScaleIntegration(VLSI)Systems,vol.12,no.10,pp.1015,Oct.2004. [26] F.Li,D.Chen,L.He,andJ.Cong,Architectureevaluationforpower-efcientFPGAs,inProc.ACM/SIGDAInt.Symp.onFieldProgrammableGateArrays,Feb.2003,pp.175. 108

PAGE 109

K.Weiss,C.Oetker,I.Katchan,T.Steckstor,andW.Rosenstiel,PowerestimationapproachforSRAM-basedFPGAs,inProc.ACM/SIGDAInt.Symp.onFieldProgrammableGateArrays,2000,pp.195. [28] SAE,SAEStandards:AS5506,ArchitectureAnalysis&DesignLanguage(AADL),SocietyofAutomotiveEngineers,Nov.2004. [29] P.Feiler,D.Gulch,andJ.Hudak,Thearchitectureanalysisanddesignlanguage(AADL):Anintroduction,Tech.Rep.(CMU/SEI-2006-TN-011,ADA455842),SoftwareEngineeringInstitute,CarnegieMellonUniversity,2006. [30] A.Rugina,K.Kanoun,andM.Kaaniche,Anarchitecture-baseddependabilitymodelingframeworkusingAADL,inProc.10thIASTEDIntlConf.SoftwareEngineeringandApplications(SEA'2006),Nov.2006. [31] F.Singhoff,J.Legrand,L.Nana,andL.Marce,SchedulingandmemoryrequirementsanalysiswithAADL,inProc.2005AnnualACMSIGAdaInt.Conf.Ada,Dec.2005,pp.1. [32] O.Sokolsky,I.Lee,andD.Clarke,SchedulabilityanalysisofAADLmodels,inProc.20thIEEEInt.Parallel&DistributedProcessingSymp.,Apr.2006. [33] SEIAADLTeam,AnextensiblesourceAADLtoolenvironment(OSATE),Tech.Rep.,SoftwareEngineeringInstitute,CarengieMellonUniversity,June2006. [34] UMLRevisionTaskForce,OMGUniedModelingLanguageSpecication,v1.4,ObjectManagementGroup,Sep.2001. [35] ObjectManagementGroup,OMGSystemsModelingLanguage(OMGSysML),v1.1,ObjectManagementGroup,formal/2008-11-01edition,Nov.2008. [36] ObjectManagementGroup,UMLProleforMARTE:ModelingandAnaly-sisofReal-TimeEmbeddedSystems,v1.0,ObjectManagementGroup,formal/2009-11-02edition,Nov.2009. [37] J.Buck,S.Ha,E.A.Lee,andD.G.Messerschmitt,Ptolemy:Aframeworkforsimulatingandprototypingheterogeneoussystems,Int.J.ComputerSimulation,vol.4,pp.152,Apr.1994. [38] J.Eker,J.W.Janneck,E.A.Lee,J.Liu,X.Liu,J.Ludvig,S.Neuendorffer,S.Sachs,andY.Xiong,Taminghereogeneity-theptolemyapproach,Proc.IEEE,vol.91,no.1,pp.127,Jan2003. [39] F.Balarin,Y.Watanabe,H.Hsieh,L.Lavagno,C.Passerone,andA.Sangiovanni-Vincentelli,Metropolis:anintegratedelectronicsystemdesignenvironment,IEEEComputerMag.,vol.36,no.4,pp.45,Apr2003. 109

PAGE 110

D.Densmore,A.Donlin,andA.Sangiovanni-Vincentelli,FPGAarchitecturecharacterizationforsystemlevelperformanceanalysis,inProc.Conf.Design,AutomationandTestinEurope(DATE),2006,pp.764. [41] A.D.Pimentel,Theartemisworkbenchforsystem-levelperformanceevaluationofembeddedsystems,Intl.J.EmbeddedSystems,vol.3,no.3,pp.181,2005. [42] A.D.Pimentel,L.O.Hertzbetger,P.Lieverse,P.vanderWolf,andE.F.Deprettere,Exploringembedded-systemsarchitectureswithartemis,IEEEComputerMag.,vol.34,no.11,pp.57,Nov.2001. [43] K.Keutzer,S.Malik,A.R.Newton,J.M.Rabaey,andA.Sangiovanni-Vincentelli,System-leveldesign:Orthogonalizationofconcernsandplatform-baseddesign,IEEETrans.Computer-AidedDesignofIntegratedCircuitsandSystems,vol.19,no.12,pp.1523,Dec.2000. [44] B.Kienhuis,E.F.Deprettere,P.vanderWolf,andK.Vissers,EmbeddedProcessorDesignChallenges,chapterAMethodologytoDesignProgrammableEmbeddedSystems:TheY-ChartApproach,pp.18,Springer,2002. [45] A.D.Pimentel,C.Erbas,andS.Polstra,Asystematicapproachtoexploringembeddedsystemarchitecturesatmultipleabstractionlevels,IEEETrans.Computers,vol.55,no.2,pp.99,Feb.2006. [46] G.Kahn,Thesemanticsofasimplelanguageforparallelprogramming,inProc.IFIPCongress,1974,number74. [47] P.Lieverse,P.vanderWolf,andE.Deprettere,Atracetransformationtechniqueforcommunicationrenement,inProc.Int.Symp.Hardware/SoftwareCodesign,2001,pp.134. [48] S.MohantyandV.K.Prasanna,Amodel-basedextensibleframeworkforefcientapplicationdesignusingFPGA,ACMTrans.DesignAutomationofElectronicSystems,vol.12,no.2,pp.13,Apr.2007. [49] S.Mohanty,V.K.Prasanna,S.Neema,andJ.Davis,Rapiddesign-spaceexplorationofheterogeneousembeddedsystemsusingsymbolicsearchandmulti-granularsimulation,inProc.JointConf.Languages,compilersandtoolsforembeddedsystems(LCTES/SCOPES),2002,pp.18. [50] A.Bakshi,V.K.Prasanna,andA.Ledeczi,Milan:Amodelbasedintegratedsimulationframeworkfordesignofembeddedsystems,inProc.ACMSIGPLANworkshoponLanguages,compilersandtoolsforembeddedsystems(LCTES'01),2001,pp.82. [51] C.Roig,A.Ripoll,andF.Guirado,Anewtaskgraphmodelformappingmessagepassingapplications,IEEETrans.ParallelandDistributedSystems,vol.18,no.12,pp.1740,Dec.2007. 110

PAGE 111

P.J.Cameron,Combinatorics:Topics,Techniques,andAlgorithms,CambridgeUniversityPress,1994. [53] P.Eles,Z.Peng,K.Kuchchinski,andA.Doboli,System-levelhardware/softwarepartitinoingbasedonsimulatedannealingandtabusearch,J.DesignAutomationforEmbeddedSystems,vol.2,no.1,pp.5,1997. [54] J.Henkel,Alow-powerhardware/softwarepartitioningapproachforcore-basedembeddedsystems,inProc.DesignAutomationConference(DAC),1999,pp.122. [55] G.Stitt,R.Lysecky,andF.Vahid,Dynamichardware/softwarepartitioning:arstapproach,inProc.DesignAutomationConference(DAC),2003,pp.250. [56] D.Menasce,D.Saha,S.C.daSilvaPorto,V.Almeida,andS.Tripathi,Staticanddynamicprocessorschedulingdisciplinesinheterogeneousparallelarchitectures,J.ParallelandDistributedComputing,vol.28,no.1,pp.1,July1995. [57] R.P.DickandN.K.Jha,Cords:Hardware-softwareco-synthesisofrecongurablereal-timedistributedembeddedsystems,inProc.Int.Conf.ComputerAidedDesign(ICCAD),Nov.1998,pp.62. [58] F.M.AliandA.S.Das,Hardware-softwareco-synthesisofhardreal-timesystemswithrecongurablefpgas,J.Computers&ElectricalEngineering,vol.30,no.7,pp.471,2004. [59] G.Stitt,Hardware/softwarepartitioningwithmulti-versionimplementationexploration,inProc.oftheACMGreatLakessymposiumonVLSI(GLSVLSI'08),May2008,pp.143. [60] Y.Li,T.Callahan,E.Darnell,R.Harr,U.Kurkure,andJ.Stockwood,Hardware-softwareco-designofembeddedrecongurablearchitectures,inProc.DesignAutomationConference(DAC),June2000,pp.507. [61] S.Banerjee,E.Bozorgzadeh,andN.Dutt,Physically-awarehw-swpartitioningforrecongurablearchitectureswithpartialdynamicreconguration,inProc.DesignAutomationConf.(DAC),June2005,pp.335. [62] K.S.ChathaandR.Vemuri,Aniterativealgorithmforhardware-softwarepartitioning,hardwaredesignspaceexplorationandscheduling,J.DesignAu-tomationforEmbeddedSystems,vol.5,no.3-4,pp.281,Aug.2000. [63] K.S.ChathaandR.Vemuri,Magellan:Multiwayhardware-softwarepartitioningandschedulingforlatencyminimizationofhierarchicalcontrol-dataowtaskgraphs,inProc.9thInt.SymposiumonHardware/SoftwareCodesign(CODES'01),2001,pp.42. 111

PAGE 112

S.Browne,J.Dongarra,N.Garner,G.Ho,andP.Mucci,Aportableprogramminginterfaceforperformanceevaluationonmodernprocessors,Int.J.HighPerfor-manceApplications,vol.14,no.3,pp.189,Aug.2000. [65] D.W.Walker,Thedesignofastandardmessagepassinginterfacefordistributedmemoryconcurrentcomputers,J.ParallelComputing,vol.20,no.4,pp.657,1994. [66] G.Schorcht,I.Troxel,K.Farhanigan,P.Unger,D.Zinn,C.Mick,A.D.George,andH.Salzwedel,System-levelsimulationmodelingwithMLDesigner,inProc11thIEEE/ACMInt.Symp.Modeling,AnalysisandSimulationofComputerandTelecomm.Systems(MASCOTS),2003,pp.207. [67] Altera,Evaluatingpowerforalteradevices,ApplicationNote74version3.1,AlteraCorp.,July2001. [68] P.Hicks,M.Walnock,andR.M.Owens,Analysisofpowerconsumptioninmemoryhierarchies,inProc.Int.Symp.LowPowerElectronicsandDesign,Aug.1997,pp.239. [69] E.Macii,M.Pedram,andF.Somenzi,High-levelpowermodeling,estimation,andoptimization,IEEETrans.Computer-AidedDesignofIntegratedCircuitsandSystems,vol.17,no.11,pp.1061,Nov.1998. [70] G.Wang,G.Stitt,H.Lam,andA.George,Aframeworkforcore-levelmodeinganddesignofrecongurablecomputingalgorithms,inProc.High-PerformanceRecongurableComputingTechnologyandApplicationsWorkshop(HPRTCA),nov2009,pp.29. [71] A.Jacobs,C.Conger,andA.George,Multiparadigmspaceprocessingforhyperspectralimaging,inProc.IEEEAerospaceConference,2008. [72] B.KernighanandS.Lin,Anefcientheuristicprocedureforpartitioninggraphs,BellSystemTechnicalJ.,Feb.1970. [73] NSFCenterforHigh-PerformanceRecongurableComputing(CHREC),Novo-G:Adaptivelycustomresearchcomputer,updatedDecember2009,citedMarch2010,availablefromhttp://www.chrec.org/facilities.html. [74] V.Aggarwal,R.Garcia,G.Stitt,A.George,andH.Lam,SCF:Adevice-andlanguage-independenttaskcoordinationframeworkforrecongurable,heterogeneoussystems,inProc.High-PerformanceRecongurableComput-ingTechnologyandAppsWorkshop(HPRCTA),Nov.2009,pp.19. [75] M.P.AllenandD.J.Tildesley,ComputerSimulationofLiquids,OxfordUniversityPress,1987. 112

PAGE 113

D.A.Pearlman,D.A.Case,J.W.Caldwell,W.S.Ross,T.E.C.III,S.DeBolt,D.Ferguson,G.Seibel,andP.Kollman,Amber,apackageofcomputerprogramsforapplyingmolecularmechanics,normalmodeanalysis,moleculardynamicsandfreeenergycalculationstosimulatethestructuralandenergeticpropertiesofmolecules,ComputerPhysicsCommunications,vol.91,no.1-3,pp.1,Sep.1995. [77] M.Nelson,W.Humphrey,A.Gursoy,A.Dalke,L.Kale,R.Skeel,andK.Schulten,NAMD:aparallel,object-orientedmoleculardynamicsprogram,Int.J.HighPerformanceComputingApplications,vol.10,no.4,pp.251,1996. [78] S.Alam,P.Agrawal,M.Smith,J.Vetter,andD.Caliga,UsingFPGAdevicestoacceleratebiomolecularsimulations,IEEEComputerMagazine,vol.39,no.4,pp.66,Mar.2007. [79] C.-I.Chang,H.Ren,andS.-S.Chiang,Real-timeprocessingalgorithmfortargetdetectionandclassicationinhyperspectralimagery,IEEETrans.onGeoscienceandRemoteSensing,vol.39,no.4,pp.760,Apr.2004. 113

PAGE 114

CaseyReardonwasbornin1982inAlexandria,Virginia.Theyoungestoftwochildren,CaseygrewupinAlexandriaandgraduatedfromWestPotomacHighSchoolinthespringof2000wherehewasastate-championtrackathlete.Followinghighschool,CaseyenrolledatDukeUniversityinthefallof2000wherehewasamemberofthevarsitytrackandeldteam.HereceivedaB.S.degreeatDukein2004afterdouble-majoringinelectricalengineeringandcomputerscience.CaseyenrolledinthePh.D.programintheDepartmentofElectricalandComputerEngineeringattheUniversityofFloridainthefallof2004,asarecipientoftheUniversityofFlorida'sPresidentialFellowship.HisinitialresearcheffortsatFloridaincludedleadingaprojectinvestigatingmodelingandsimulationofadvancedopticalnetworksinaircraftfortheU.S.NavalAirSystemsCommand.HehassincebeenconductingresearchasamemberoftheNSFCenterforHighPerformanceRecongurableComputing(CHREC)attheUniversityofFloridainmodelingandsimulationofrecongurablecomputingsystems.CaseyreceivedhisPh.D.fromtheUniversityofFloridainthespringof2010. 114