<%BANNER%>

SHMEM+ and SCF

Permanent Link: http://ufdc.ufl.edu/UFE0042821/00001

Material Information

Title: SHMEM+ and SCF System-Level Programming Models for Scalable Reconfigurable Computing
Physical Description: 1 online resource (119 p.)
Language: english
Creator: AGGARWAL,VIKAS
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2011

Subjects

Subjects / Keywords: FPGA -- HETEROGENEOUS -- MESSAGE -- MODEL -- PARALLEL -- PASSING -- PGAS -- PORTABILITY -- PRODUCTIVITY -- PROGRAMMING -- RECONFIGURABLE
Electrical and Computer Engineering -- Dissertations, Academic -- UF
Genre: Electrical and Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: Heterogeneous computing systems comprised of FPGAs coupled with standard microprocessors are becoming an increasingly popular solution for building future high-performance computing (HPC) and high-performance embedded computing (HPEC) systems due to their higher performance and energy efficiency versus their CPU-only counterparts. Unfortunately, the time and difficulty associated with developing scalable, parallel applications for reconfigurable computing (RC) platforms is often prohibitive, making it difficult to exploit the potential gains in performance and energy savings. Design and implementation of applications for these systems involve many system-wide considerations such as algorithm decomposition and architecture mappings to exploit multiple levels of parallelism, inter-device communication and control, and system-level debug and verification. Thus, system-level languages with constructs for expressing multiple levels and forms of parallelism are vital for productive implementation of designs. In order to improve a developer's productivity as well as the portability of scalable RC applications, we propose and analyze two different approaches for establishing communication in parallel RC applications. The first approach extends the traditional partitioned, global-address-space (PGAS) model to a multilevel abstraction by integrating a hierarchy of multiple memory components present in reconfigurable HPC systems into a single virtual memory layer. Based on this model, we adapt the SHMEM communication library to become what we call SHMEM+, the first known SHMEM library enabling coordination between FPGAs and CPUs in a reconfigurable HPC system. The second approach provides a system coordination framework (SCF) based on a message-passing programming model. SCF enables transparent communication and synchronization between tasks running on heterogeneous processing devices in a system by hiding low-level communication details while presenting a uniform communication interface across heterogeneous devices. Besides improving developer productivity, both of these approaches enhance the portability of applications (by hiding the platform-specific details from applications). Further improvements in productivity are obtained by combining techniques for modeling performance with the aforementioned approaches for application design. By allowing developers to estimate the performance of their application, modeling enables the developers to make critical design decisions before undertaking an expensive implementation. Case studies illustrate the merits and effectiveness of the proposed approaches through increased performance and portability of applications along with improvement in developer productivity. The resulting communication libraries and coordination framework are projected to provide large productivity improvements, thus expanding the use of RC technology in the fields of HPC and HPEC.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by VIKAS AGGARWAL.
Thesis: Thesis (Ph.D.)--University of Florida, 2011.
Local: Adviser: George, Alan D.
Local: Co-adviser: Stitt, Greg.

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2011
System ID: UFE0042821:00001

Permanent Link: http://ufdc.ufl.edu/UFE0042821/00001

Material Information

Title: SHMEM+ and SCF System-Level Programming Models for Scalable Reconfigurable Computing
Physical Description: 1 online resource (119 p.)
Language: english
Creator: AGGARWAL,VIKAS
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2011

Subjects

Subjects / Keywords: FPGA -- HETEROGENEOUS -- MESSAGE -- MODEL -- PARALLEL -- PASSING -- PGAS -- PORTABILITY -- PRODUCTIVITY -- PROGRAMMING -- RECONFIGURABLE
Electrical and Computer Engineering -- Dissertations, Academic -- UF
Genre: Electrical and Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: Heterogeneous computing systems comprised of FPGAs coupled with standard microprocessors are becoming an increasingly popular solution for building future high-performance computing (HPC) and high-performance embedded computing (HPEC) systems due to their higher performance and energy efficiency versus their CPU-only counterparts. Unfortunately, the time and difficulty associated with developing scalable, parallel applications for reconfigurable computing (RC) platforms is often prohibitive, making it difficult to exploit the potential gains in performance and energy savings. Design and implementation of applications for these systems involve many system-wide considerations such as algorithm decomposition and architecture mappings to exploit multiple levels of parallelism, inter-device communication and control, and system-level debug and verification. Thus, system-level languages with constructs for expressing multiple levels and forms of parallelism are vital for productive implementation of designs. In order to improve a developer's productivity as well as the portability of scalable RC applications, we propose and analyze two different approaches for establishing communication in parallel RC applications. The first approach extends the traditional partitioned, global-address-space (PGAS) model to a multilevel abstraction by integrating a hierarchy of multiple memory components present in reconfigurable HPC systems into a single virtual memory layer. Based on this model, we adapt the SHMEM communication library to become what we call SHMEM+, the first known SHMEM library enabling coordination between FPGAs and CPUs in a reconfigurable HPC system. The second approach provides a system coordination framework (SCF) based on a message-passing programming model. SCF enables transparent communication and synchronization between tasks running on heterogeneous processing devices in a system by hiding low-level communication details while presenting a uniform communication interface across heterogeneous devices. Besides improving developer productivity, both of these approaches enhance the portability of applications (by hiding the platform-specific details from applications). Further improvements in productivity are obtained by combining techniques for modeling performance with the aforementioned approaches for application design. By allowing developers to estimate the performance of their application, modeling enables the developers to make critical design decisions before undertaking an expensive implementation. Case studies illustrate the merits and effectiveness of the proposed approaches through increased performance and portability of applications along with improvement in developer productivity. The resulting communication libraries and coordination framework are projected to provide large productivity improvements, thus expanding the use of RC technology in the fields of HPC and HPEC.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by VIKAS AGGARWAL.
Thesis: Thesis (Ph.D.)--University of Florida, 2011.
Local: Adviser: George, Alan D.
Local: Co-adviser: Stitt, Greg.

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2011
System ID: UFE0042821:00001


This item has the following downloads:


Full Text

PAGE 1

SHMEM+ANDSCF:SYSTEM-LEVELPROGRAMMINGMODELS FORSCALABLERECONFIGURABLECOMPUTING By VIKASAGGARWAL ADISSERTATIONPRESENTEDTOTHEGRADUATESCHOOL OFTHEUNIVERSITYOFFLORIDAINPARTIALFULFILLMENT OFTHEREQUIREMENTSFORTHEDEGREEOF DOCTOROFPHILOSOPHY UNIVERSITYOFFLORIDA 2011

PAGE 2

c r 2011VikasAggarwal 2

PAGE 3

Idedicatethisdissertationtomyfamily(myparentsandmyc aringsister),andmy friends.Iamtrulygratefultoyouall. 3

PAGE 4

ACKNOWLEDGMENTS IwouldliketothankGodandmyfate;thisdissertationcould nothavematerialized withouteither.Mom,Dad,andmydearsister,Iamtrulygrate fulforyourpatience, prayers,andsupport(amongstmanyotherthings)–thankyou forbeingso understanding.Iwouldliketothankmycommittee,Dr.Herma nLamandDr.Prabhat Mishra,fortheirtime,adviceandassistanceaswellasmyad visors,Dr.AlanD.George andDr.GregStitt,fortheirguidance,direction,andsigni cantsupportthroughthiswork. Iwouldalsoliketothankmyfriends,Swami,Kunal,Rick,andD anafortheirsupport, care,andhelpinallthebigandlittleways–Ican'tthankyou allenough.Additionally, Iwouldliketothankmyformerandcurrentlabmembers,inpar ticularRafaelGarcia, Changilyoon,andKishoreYalamanchiliforhelpingmeinvari ouspartsofthisresearch. Andlastly,Iwouldliketothanktheactivityofrunning!Your eallyhelpedmekeepa saneheadonmyshoulders,helpedmeresolvemanyproblemson anesunnyday bymyself,helpedmecometopeacewiththingsbeyondmycontr olandkeptmefrom givinguponmyselfseveraltimes.Iwouldliketothankallth eguyswithwhomIhave sharedseveralruns.Iamgratefultothegroupof“TeamAsha”r unnersforallthe memoriesthatIwilltreasure.Thisworkissupportedinpart bytheI/UCRCProgram oftheNationalScienceFoundationunderGrantNo.EEC-0642422 andbyequipment and/ortoolsprovidedbyAlteraandGiDEL. 4

PAGE 5

TABLEOFCONTENTS page ACKNOWLEDGMENTS .................................. 4 LISTOFTABLES ...................................... 7 LISTOFFIGURES ..................................... 8 ABSTRACT ......................................... 10 CHAPTER 1INTRODUCTION ................................... 12 2BACKGROUNDANDRELATEDRESEARCH ................... 18 2.1RecongurableComputing .......................... 18 2.2Parallel-ProgrammingModels ......................... 19 2.3PerformanceModeling ............................. 23 3SHMEM+:AMULTILEVEL-PGASPROGRAMMINGMODELFORSCALABLE RCSYSTEMS(PHASE1) .............................. 26 3.1MultilevelPGAS ................................ 27 3.2OverviewofSHMEM+ ............................. 32 3.2.1SHMEM+Interface ........................... 34 3.2.2SHMEM+ApplicationExample .................... 35 3.2.3DesignofSHMEM+ .......................... 36 3.2.4DesignMethodologywithSHMEM+ ................. 38 3.3ExperimentalResults ............................. 39 3.3.1BenchmarkingPerformanceofCommunicationRoutines ...... 40 3.3.2CaseStudy1:Content-BasedImageRetrieval(CBIR) ....... 43 3.3.3CaseStudy2:Two-DimensionalFFT ................. 50 3.4Conclusions ................................... 53 4PERFORMANCEMODELINGFORMULTILEVELCOMMUNICATIONIN SHMEM+(PHASE2) ................................ 55 4.1OverviewofMultilevelCommunicationModel ................ 56 4.2EarlyDSEinApplications ........................... 60 4.2.1PerformanceEstimation ........................ 62 4.2.2ExperimentalResults .......................... 66 4.3OptimizingSHMEM+Functions ........................ 68 4.3.1PerformanceEstimation ........................ 69 4.3.2ExperimentalResults .......................... 71 4.4Self-TuningSHMEM+Library ......................... 72 4.4.1ExperimentalResults .......................... 74 5

PAGE 6

4.5Conclusions ................................... 75 5SYSTEMCOORDINATIONFRAMEWORK(PHASE3) .............. 77 5.1OverviewofSCF ................................ 77 5.1.1ProgrammingModel .......................... 81 5.1.2ArchitecturalModel ........................... 81 5.1.3CommunicationModel ......................... 82 5.2SCFEcosystem ................................ 85 5.2.1SCFTask-GraphLanguageandMapping .............. 85 5.2.2ToolFlow ................................ 87 5.2.3InterfacingwithExternalModelingTools ............... 90 5.3ResultsandAnalysis .............................. 91 5.3.1ExperimentalSetup .......................... 92 5.3.2CustomCommunicationSynthesis .................. 93 5.3.3CaseStudy:TargetTracking ...................... 95 5.3.4CaseStudy:Backprojection ...................... 99 5.4Conclusions ................................... 101 6COMPARISONOFSHMEM+ANDSCF ...................... 103 6.1ProgrammingModelandComplexity ..................... 103 6.2PerformanceofDataTransfers ........................ 104 6.3Data-TransferCapabilities ........................... 105 6.4Conclusions ................................... 106 7CONCLUSIONS ................................... 107 APPENDIX ADESCRIPTIONSOFRCPLATFORMSEMPLOYED ............... 109 A.1MuCluster ................................... 109 A.2Novo-G ..................................... 109 A.3HeterogeneousTestbed ............................ 110 BDESCRIPTIONSOFAPPLICATIONSEMPLOYED ................ 111 B.1TargetTrackingUsingKalmanFilter ..................... 111 B.2Content-BasedImageRetrieval(CBIR) ................... 111 B.3Backprojection ................................. 112 REFERENCES ....................................... 113 BIOGRAPHICALSKETCH ................................ 119 6

PAGE 7

LISTOFTABLES Table page 3-1BaselinefunctionscurrentlysupportedinSHMEM+library. ........... 33 3-2End-to-endlatencyoftransfersbetweenvariouscombina tionsofdevicesfor traditionalSHMEMonMucluster,andSHMEM+onMuclusterandNovo -G. 42 3-3Majorfactorsthatcontributetowardsincreaseddevelo perproductivitywhen usingSHMEM+. ................................... 46 3-4DevelopmenthoursspentindevelopingCBIRapplication. Timeisreportedin termsof8-hourworkdays. ............................. 48 3-5DevelopmenthoursspentfordevelopingTwo-dimensiona lFFTapplication. .. 53 4-1Observedtimeandestimatedtimetoperformgatheropera tionbysingle-FPGA designsusingthreedifferentapproachesonNovo-G. .............. 65 4-2Observedtimeandestimatedtimetoperformgatheropera tionforquad-FPGA designsonNovo-GusingApproaches1and2. .................. 65 4-3Observedtimeandestimatedtimetoperformgatheropera tionforquad-FPGA designsonNovo-GusingApproach3. ....................... 65 4-4Estimatingperformanceofpacketizedtransfersbetween tworemoteFPGAs onNovo-Gfor P = 2MBand512KB. ........................ 71 4-5Relativeerrorbetweenestimatedbandwidthandobserve dbandwidthof transfersbetweenremoteFPGAs. ......................... 73 4-6PerformanceimprovementfortransfersbetweenremoteF PGAsobtainedby self-tunedSHMEM+libraryvs.baseline. ...................... 75 5-1SCFlibraryprimitives. ................................ 83 5-2Executiontimeandspeedupoftarget-trackingapplicati onunderdifferent mappingscenariosforthreesystems. ....................... 96 5-3Productivityimprovementfortarget-trackingapplicat ion. ............. 97 5-4Overheadmeasurementsfortarget-trackingapplicatio n. ............. 97 5-5Majorfactorsthatcontributetoincreaseddeveloperpr oductivityusingSCF. .. 98 5-6Computationandcommunicationperformanceofbackproj ectiontaskona CPUorFPGA. .................................... 100 5-7Performanceofbackprojectionapplicationunderdiffe rentmappingscenarios. 101 7

PAGE 8

LISTOFFIGURES Figure page 1-1OverviewoftoolsandlanguagesemployedindesignofRCa pplications. ... 13 2-1SourcecodeforaSHMEMapplicationthattransfersanarrayo fdatafrom PE1toPE0. ...................................... 20 3-1SystemarchitectureofatypicalRCmachine. ................... 27 3-2High-levelabstractionsforprogrammingheterogeneou ssystems. ........ 28 3-3Distributionofmemoryandsystemresourcesinmultilev elPGAS. ....... 29 3-4Differencesinphysicalandlogicalresourceabstracti onsofatypicalRC system. ........................................ 30 3-5DatatransferchoicesavailabletothedeveloperwithSHM EM+. ......... 32 3-6Multi-FPGA“add-one”applicationexample. .................... 35 3-7Codesnippetsforamulti-FPGAadd-oneapplication. ............... 37 3-8DesignoftheSHMEM+library. ........................... 37 3-9Designmethodologyforapplicationdevelopmentusingm ultilevelPGASand SHMEM+. ....................................... 38 3-10Bandwidthofpoint-to-pointroutinesusingSHMEM+andQua dricsSHMEM fortransfersbetweentwoCPUs. .......................... 41 3-11Bandwidthofpoint-to-pointroutinesonNovo-GusingSHM EM+fortransfers betweendifferentdevices. .............................. 42 3-12ProcessingstepsinvolvedinparallelalgorithmforCBIR usingSHMEM+. ... 43 3-13Performancecomparisonofdifferentimplementations ofparallelCBIR applicationonMucluster. .............................. 45 3-14Performancecomparisonofdifferentimplementations ofparallelCBIR applicationonNovo-G. ................................ 49 3-15Performancecomparisonofdifferentimplementations ofparallelCBIR applicationonNovo-G. ................................ 50 3-16Abstractrepresentationofprocessingstepsinvolvedi naparalleltwo-dimensional FFTalgorithm. .................................... 51 3-17Performancecomparisonofdifferentimplementations ofparallel2-DFFT algorithmonNovo-G. ................................ 53 8

PAGE 9

4-1Transfercapabilitiessupportedbyvariouscommunicat ionroutinesinSHMEM+. 57 4-2ResultsforCBIRapplicationobservedfromexperimentsc onductedon Novo-Gforasearchdatabaseconsistingof22,000images,ea chofsize 128 128 pixelsof8bits. .............................. 61 4-3EstimatedperformanceofgatheroperationonNovo-Gusin gthreedifferent approaches. ..................................... 66 4-4Speedupobservedforquad-FPGAdesignsofCBIRusingthethr ee approachesoverasequentialbaselineexecutingonasingle CPUonNovo-G. 67 4-5Intermediatestepsinvolvedindatatransfersbetweent woremoteFPGAs usingSHMEM+. ................................... 68 4-6BandwidthoftransfersbetweenremoteFPGAs. ................. 72 4-7BandwidthoftransfersbetweenremoteFPGAsobtainedbybas eline SHMEM+libraryandself-tunedversionofSHMEM+. .............. 74 4-8Packetsizedeterminedbyself-tunedSHMEM+libraryfortr ansfersbetween twoFPGAsoverarangeofdatasizesondifferentsystems.`-'in dicatesno packetizationwasbenecial.Allnumbersrepresentbytesof data(e.g.1M= 1MB). ......................................... 75 5-1ApplicationdesignphilosophyusingSCF. ..................... 78 5-2Architecturalmodelofanexamplesystem. .................... 82 5-3AnexampleapplicationgraphdescriptioninSCFtask-grap hlanguage. .... 86 5-4ToolowinSCF. ................................... 87 5-5ExampleapplicationinSCFenvironment. ..................... 89 5-6Representationofanexampleapplication. ..................... 91 5-7SCFarchitecturalmodelforexperimentalsystem. ................ 92 5-8Resultsforcustomcommunicationsynthesis. ................... 94 5-9Taskgraphfortarget-trackingapplication. ..................... 95 5-10Taskgraphforbackprojectionapplication. ..................... 99 6-1Theaverageround-tripbandwidthforaping-pongmicrob enchmarkusing communicationroutinesinSHMEM+andSCF. .................. 105 A-1SystemarchitectureofNovo-GRCsupercomputer. ................ 110 9

PAGE 10

AbstractofDissertationPresentedtotheGraduateSchool oftheUniversityofFloridainPartialFulllmentofthe RequirementsfortheDegreeofDoctorofPhilosophy SHMEM+ANDSCF:SYSTEM-LEVELPROGRAMMINGMODELS FORSCALABLERECONFIGURABLECOMPUTING By VikasAggarwal May2011 Chair:AlanD.GeorgeCochair:GregStittMajor:ElectricalandComputerEngineering HeterogeneouscomputingsystemscomprisedofFPGAscoupledw ithstandard microprocessorsarebecominganincreasinglypopularsolu tionforbuildingfuture high-performancecomputing(HPC)andhigh-performanceemb eddedcomputing (HPEC)systemsduetotheirhigherperformanceandenergyefc iencyversustheir CPU-onlycounterparts.Unfortunately,thetimeanddifcul tyassociatedwithdeveloping scalable,parallelapplicationsforrecongurablecomput ing(RC)platformsisoften prohibitive,makingitdifculttoexploitthepotentialga insinperformanceandenergy savings.Designandimplementationofapplicationsforthe sesystemsinvolvemany system-wideconsiderationssuchasalgorithmdecompositi onandarchitecture mappingstoexploitmultiplelevelsofparallelism,interdevicecommunicationand control,andsystem-leveldebugandverication.Thus,sys tem-levellanguageswith constructsforexpressingmultiplelevelsandformsofpara llelismarevitalforproductive implementationofdesigns. Inordertoimproveadeveloper'sproductivityaswellasthe portabilityofscalable RCapplications,weproposeandanalyzetwodifferentappro achesforestablishing communicationinparallelRCapplications.Therstapproa chextendsthetraditional partitioned,global-address-space(PGAS)modeltoamultilev elabstractionby integratingahierarchyofmultiplememorycomponentspres entinrecongurableHPC 10

PAGE 11

systemsintoasinglevirtualmemorylayer.Basedonthismode l,weadapttheSHMEM communicationlibrarytobecomewhatwecallSHMEM+,therstk nownSHMEMlibrary enablingcoordinationbetweenFPGAsandCPUsinarecongurabl eHPCsystem. Thesecondapproachprovidesasystemcoordinationframewo rk(SCF)basedon amessage-passingprogrammingmodel.SCFenablestranspare ntcommunication andsynchronizationbetweentasksrunningonheterogeneou sprocessingdevices inasystembyhidinglow-levelcommunicationdetailswhile presentingauniform communicationinterfaceacrossheterogeneousdevices.Bes idesimprovingdeveloper productivity,bothoftheseapproachesenhancetheportabi lityofapplications(byhiding theplatform-specicdetailsfromapplications). Furtherimprovementsinproductivityareobtainedbycombi ningtechniquesfor modelingperformancewiththeaforementionedapproachesf orapplicationdesign.By allowingdeveloperstoestimatetheperformanceoftheirap plication,modelingenables thedeveloperstomakecriticaldesigndecisionsbeforeund ertakinganexpensive implementation.Casestudiesillustratethemeritsandeff ectivenessoftheproposed approachesthroughincreasedperformanceandportability ofapplicationsalongwith improvementindeveloperproductivity.Theresultingcomm unicationlibrariesand coordinationframeworkareprojectedtoprovidelargeprod uctivityimprovements,thus expandingtheuseofRCtechnologyintheeldsofHPCandHPEC. 11

PAGE 12

CHAPTER1 INTRODUCTION Thepowerbottleneckcreatedbyhighclockfrequencieshasf orcedcomputer architectstoconsideralternativemethodsforincreasing systemperformance,focusing onparallel[ 1 ]andoftenheterogeneousarchitectures[ 2 ].Systemsfromdomains rangingfromembeddedsystems[ 3 4 ]tohigh-performancecomputing[ 5 – 9 ]now increasinglycombinemicroprocessorswithxed-logic,re congurable-logic,and/or heterogeneousmulticoreandmany-coredevices[ 2 10 – 12 ].Thesetechnologiesare drivingsystemstobecomeevermorepowerfulandefcient,b utunfortunatelyalso morecomplextoprogram,withmultipletypesandlevelsofha rdwareparallelismtobe understoodandexploited. Aspecialclassofsuchsystemsfeaturingrecongurablecom puting(RC),based oncloselycoupledmicroprocessorsandFPGAs,offersanattra ctivesolutionforboth high-performancecomputing(HPC)andhigh-performanceemb eddedcomputing (HPEC)systems.NumerousstudieshavedemonstratedthatRCsy stemscanachieve performanceimprovementsrangingfrom 10 [ 13 ]tomorethan 1000 [ 9 14 ]overtheir microprocessor-basedcounterpartswhileconcomitantlyr educingenergyconsumption. Despitetheirsuperiorperformance,RCsystemshaveyettom akeasignicantimpact ontheHPCandHPECmarket,largelybecauseofincreasedcomplex ityofapplication development. ApplicationdevelopmentforRCsystemsrequiresspecicati onofcomputation, communication,andcontrolforFPGAsandothercomponentssuc hasmicroprocessors andserial/parallelbusinterfaces.Designandimplementa tionofthesesystemsinvolve manysystem-wideconsiderationssuchasalgorithmdecompo sitionandarchitecture mappingstoexploitmultiplelevelsofparallelism,interdevicecommunicationand control,andsystem-leveldebugandverication.Althoughs omeoftheseissues canbeexploredthroughmodeling,manyissuesareleftunres olveduntilanactual 12

PAGE 13

implementation.Thus,system-levellanguageswithconstr uctsforexpressingmultiple levelsandformsofparallelismarevitalforefcientdesig nimplementation. Figure1-1.Overviewoftoolsandlanguagesemployedindesi gnofRCapplications. Figure 1-1 showsanoverviewoftheprogramminglanguagesandtoolsemp loyed bydevelopersforimplementinganRCapplication.Althoughp rogramminglanguages fordevice-leveldesigncontinueareevolvingandhaverais edthelevelofimplementation abstractionaboveHDLtoalevelassociatedwithHLL-basedd esign,system-level designissueshavereceivedlimitedattention.SimilartoMPI [ 15 ],SHMEM[ 16 17 ],UPC [ 18 19 ],etc.intheconventionalparallel-computingcommunity, coordinationtoolsfor RCsystemsareneededtoharnessthebestofthedevice-level designlanguagesand yetoperateatahigherlevelofabstraction,expressingand managingcomputationsand dataacrossmultipledevices.However,therearecharacter isticdifferencesbetween RCsystemsandtraditionalparallelsystems,suchasadditi onallevelsofmemoryand communicationinthesystemaswellasdifferentexecutionm odelsofheterogeneous 13

PAGE 14

devicespresentinthesystemwhichwarrantaprogrammingmo delthatcanaddress thesedifferences.Currently,applicationdevelopersemp loyad-hocmethodsand multiplelibraries(andAPIs)todevelopRCapplications,whi chrequiresignicant codingmodicationsforportinganapplicationtoadiffere ntsystem.Asaresult,the developmentproductivityforscalableRCapplicationshas sufferedandapplications havelackedportability.Inaddition,mostexistingformul ationtoolsthatareemployedfor modelingandearlydesign-spaceexploration(DSE)lackaseam lesstransitiontodesign toolswhichisessentialforutilizingtheknowledgegained byDSE. Thisresearchfocusesondeninganddevelopingnoveltools forestablishing communicationinscalableRCapplications.Torealizethis goal,weexploreandanalyze twocomplementaryapproachestoraisethelevelofabstract ionofcommunicationfor RCapplications,SHMEM+andSystemCoordinationFramework(SCF ).SHMEM+ isanextendedandadaptedversionoftheSHMEMcommunicationl ibrary[ 17 ]for heterogeneoussystems,basedonPartitioned,GlobalAddres sSpace(PGAS)[ 20 ].By contrast,SCFprovidesanapproachbasedonamessage-passin gprogrammingmodel forexchangingdatabetweenheterogeneousdevicesinasyst em.Anapplication designerwilltypicallychoosebetweenthesetwoapproache sbaseduponthe requirementsofaspecicapplicationandpreferenceforap rogrammingmodel,which issimilartothechoicebetweenprogrammingmodelsbasedon message-passingand PGASinthetraditionalparallel-programmingcommunity.Beca usetheseapproaches provideahigherlevelofabstraction,thusdistancingadev eloperfromdetailsofthe actualimplementationandattainableperformance,modeli ngandDSEtoolsbecome importanttoprovidehigh-performancecommunicationaswe llastoallowthedevelopers toaccuratelypredictperformance(toavoidwastedeffort) .Bothoftheaforementioned approachesprovidenecessarybridgestoformulationtools (existingornewtools proposedinthisresearch)forperformingearlyDSE,whilefac ilitatingasmoother transitionfromtheformulationstagetothedesignstage. 14

PAGE 15

Theresearchpresentedinthisdocumentissub-dividedinto threephases.Inthe rstphase,weexploretechniquesforadaptingaprogrammin gmodelbasedonPGAS forscalableRCsystems.WeextendthetraditionalPGASmodelt oamultilevel-PGAS programmingmodel,whichabstractsvariouslevelsofmemor yhierarchyavailable insuchsystemsandpresentsthedesignerwithaattened,un iedviewofsystem memory.Basedonthismodel,weextendandadapttheSHMEMcommun ication librarytobecomewhatwecallSHMEM+,therstknownSHMEMlibrar ythatenables coordinationbetweenFPGAsandCPUsinarecongurablesystem. Wepresentthe designofSHMEM+anddescribethemethodologyfordevelopinga pplicationsusing SHMEM+.OurdesignoftheSHMEM+libraryishighlyportableandpr ovidespeak communicationbandwidthcomparabletoavendor-proprieta ryversionofSHMEM.In addition,applicationsdesignedwithSHMEM+yieldimprovedd eveloperproductivity comparedtocurrentmethodsofmulti-deviceRCdesignandex hibitahighdegreeof portability.Weanalyzetheperformanceandbenetsofouri mplementationofSHMEM+ throughtwocasestudiesontwodifferentplatforms. OneimportantchallengewhenusinglibrariessuchasSHMEM+is choosingan appropriatecommunicationstrategy.Variancesinclude,f orexample,whoinitiatesthe transfer,whatfunctionsareemployed,whatintermediates tepsareinvolved,etc.Such factorscanhaveasignicantimpactontheoverallperforma nceofanapplication.In ordertoevaluatetheimpactofcommunicationonapplicatio nperformanceandobtain optimalperformance,aconcreteunderstandingoftheunder lyingcommunication infrastructureisoftenimperative.Inthesecondphaseoft hisresearch,weintroduce anewperformancemodel,themultilevelcommunicationmode l,forestimatingthe performanceofvariousdatatransfersencounteredinSHMEM+. Suchamodelcan leadtoimprovementsinapplicationperformanceandproduc tivity.Weillustratethe meritsofourmodelusingthreeusecases.Intherstusecase ,themodelenables applicationdeveloperstoperformearlydesign-spaceexpl orationofcommunication 15

PAGE 16

patternsintheirapplicationsbeforeundertakingthelabo riousandexpensiveprocess ofimplementation,yieldingimprovedperformanceandprod uctivity.Second,themodel enablessystemdeveloperstoquicklyoptimizetheperforma nceofdata-transferroutines withintoolssuchasSHMEM+whenthesearebeingportedtoanewp latform.Third, themodelaugmentstheSHMEM+communicationlibrarytoautoma ticallyimprovethe performanceofdatatransfersbyself-tuningitsinternalp arameterstomatchplatform capabilities.Resultsfromourexperimentswiththeseusec asessuggestmarked improvementinperformance,productivity,andportabilit y. Inthethirdphaseofthisresearch,weexploreanalternatea pproachforestablishing communicationbetweenvariousdevicesinaheterogeneouss ystem.Tocomplement theresearchinthersttwophases,theresearchinthisphas eextendsandadapts conceptsbasedonamessage-passingprogrammingmodel,com monlyemployedin traditionalHPCandHPECsystems,forheterogeneoussystems.W einvestigateanovel frameworkcalledSystemCoordinationFramework(SCF),tosim plifyapplicationdesign forheterogeneoussystemsbyenablingtransparentcommuni cationandsynchronization betweentasksrunningondifferentdevices.Ourframeworkc onsistsofalibraryof message-passingcoordinationprimitivessuitableforpot entiallyanylanguageordevice, aframeworkthatallowsanapplicationtobeexpressedasast atictask-graphandeach taskofanapplicationtobedenedinpotentiallyanylangua ge,andasetoftoolsthat cancreatecustomizedcommunicationmethodsforagivensys temarchitecturebased onthemappingoftaskstodevices.WithSCF,manylow-levelarc hitecturaldetailsare hiddenfromapplicationdevelopersbyallowingthemtospec ifycoordinationbetween tasksusingmessage-passingprimitiveswhendeningeacht askinalanguage(ortool) oftheirchoice.Byhidinglow-levelarchitecturaldetailsf romanapplicationdesigner, SCFcanimprovedeveloperproductivity,providehigherleve lsofapplicationportability, andofferrapiddesign-spaceexplorationofdifferenttask /devicemappings.Wepresent thetechnicaldetailsoftheframeworkandanimplementatio nofrequiredtoolswhich 16

PAGE 17

wedevelopedtodemonstrateitseffectiveness.Tofurtheri mproveproductivity,we incorporatesupportforexistingformulationtoolssuchas RCML[ 21 ]andprovide automatedtranslationofmodelsspeciedusingRCMLintoaf ormatcompliantwithSCF (i.e.,taskgraphs). Thisdocumentconsistsofbackground,technicaldetails,a ndresultsfromeach phaseofresearchandisorganizedasfollows.Backgroundand existingresearch pertainingtotheresearchinthisdocumentaresummarizedi nChapter 2 .Chapter 3 providesadetaileddescriptionandanalysisofourmultile velPGASmodeland theSHMEM+communicationlibrary.Next,Chapter 4 augmentstheresearchin Phase1bydevelopingaperformancemodelcalledthemultilev elcommunication modelforrepresentingvariousdatatransfersencountered inSHMEM+andfor estimatingperformance.InChapter 5 ,adetaileddescriptionandevaluationofSCF ispresented.Chapter 6 comparesandcontrastsSHMEM+andSCFbasedoncertain characteristics.Finally,theconclusionsfromthisresea rchareprovidedinChapter 7 DetailedinformationregardingtheRCplatformsandtheapp licationsusedthroughout thisresearchareprovidedinAppendix A andAppendix B respectively. 17

PAGE 18

CHAPTER2 BACKGROUNDANDRELATEDRESEARCH Thebackgroundandrelatedresearchpresentedinthischapt eraresplitinto threesections.Section 2.1 providesabriefandrelevantbackgroundfromtheeldof recongurablecomputing.Section 2.2 discussestheexistingparallel-programming modelsthatarecommonlyemployedfordevelopingparallela pplications,emphasizing heterogeneoussystems.Finally,Section 2.3 discussestheexistingmethodsfor modelingcommunicationandapplicationperformance,with additionalfocusonRC systems. 2.1RecongurableComputing Inthissectionwepresentabrief,relatedoverviewofthee ldofRC,focusing mainlyontheaspectsofapplicationdevelopmentonRCsyste ms.Foracomprehensive literaturesurveyfromtheeldofRC,seeComptonandHauck' sworkin[ 22 ].Although RCemploysavarietyofdevicessuchasProgrammableLogicArra ys(PLAs) orComplexProgrammableLogicDevices(CPLDs),Field-Program mableGate Arrays(FPGAs)havebecomethepredominantenablingtechnolog yinRCsystems. Unprecedentedlevelsofcomputationaldensity,increased internalmemorybandwidth, andimprovedsupportfordouble-precisionoating-pointa rithmeticinmodernFPGAs havemadethemanextremelyattractivetechnologyintheel dsofHPCandHPEC. RecongurableHPCandHPECsystemsareusuallycomprisedofone ormore FPGAsworkingalongsideahostmicroprocessor,theformerofw hichactsasan acceleratorusedtospeedupkeycomputationsrunningonthe system.FPGAs areemployedinsuchsystemsinmultiplewaysrangingfromap eripheraldevice onaPCIe-basedcardtoanacceleratorresidinginaCPUsocketo ntheserver's motherboard.EachformofRC-systemarchitectureoffersava ryingdegreeofcoupling betweentheFPGAandhostCPU,whichcansignicantlyaffectth ecomplexity andefciencyofcommunicationbetweenthetwodevices.Thi sresearchfocuses 18

PAGE 19

onalleviatingproblemsandissueswithestablishingcommu nicationbetween heterogeneousdevicesinscalableRCsystems. RC-applicationdeveloperstraditionallyprogrammedRCsy stemsusingHardware DescriptionLanguages(HDLs)suchasVHDLandVerilogtodesc ribethehardware portionofanapplication.Thesoftwareportionoftheappli cationthatinteractswiththe FPGAistypicallyspeciedinaprogramminglanguagesuchasC /C++,usinganAPI toaccesstheFPGA(e.g.,writingandreadingdatafrommemory ,accessingFPGA registers).Morerecently,High-LevelLanguages(HLLs),o fferingalevelofabstraction closertosoftware,havebeenemployedfordescribinghardw arecircuitsforanFPGA. AlthougheasiertousethanVHDL,mostoftheselanguagesarest illrelativelyimmature andcansufferfromsomelimitations.Forexample,HLLssuch asSystemC[ 23 ]and ImpulseC TM [ 24 ]offerimproveduserproductivity,buttradeoffexecution andresource efciencyforlimitedportabilityandreusability.TheCar teprogrammingenvironment mayachievehigherefciencybutatthecostofportability[ 25 ].Additionally,while enablingfasterhardwaredevelopmentforanFPGA,HLLstypica llyonlyaddressa subsetofthesystem-levelissuesinvolvedwithprogrammin gonscalableRCsystems. Thecommunicationmodelsandlibrariespresentedinthisre searchfocusonresolving system-levelissuesencounteredinthedevelopmentofscal ableRCapplicationswhile permittingexibilityinthetoolsandlanguagesusedforde scribinghardwarecircuitsfor FPGAs. 2.2Parallel-ProgrammingModels Traditionally,developersofparallelprogramshaveperfo rmedcoordinationbetween tasksusingeithermessage-passinglibrariessuchasMPI[ 15 ]orshared-memory librariessuchasOpenMP[ 26 ].Recently,languagesandlibrariesthatpresenta partitionedglobaladdressspace(PGAS)totheprogrammer,suc hasUPC[ 18 19 ]and SHMEM[ 17 ],havebecomemorevisibleandpopular.Theselanguagespro videasimple interfacefordevelopersofparallelapplicationsthrough implicitorexplicitone-sided 19

PAGE 20

Figure2-1.SourcecodeforaSHMEMapplicationthattransfersa narrayofdatafrom PE1toPE0. data-transferfunctionswhileprovidingcomparableperfo rmancetomessage-passing libraries[ 27 ].Inparticular,theSHMEMcommunicationlibraryiscurrentl yexperiencing agrowthininterestintheHPCcommunityduetoitsinnatesimp licity,lowoverhead,and emphasisuponexplicit,high-bandwidth,one-sidedcommun ications.TheSHMEM communicationlibraryconsistsofasetofroutinesthatall owexchangeofdata betweencooperatingparallelprocesses(calledprocessin gelementsorPEs).Programs developedusingSHMEMfollowthesingle-program,multiple-d atamodel(SPMD)[ 28 ], andaresimilarinstyletoprogramsbasedonMPI.SHMEMroutines supportremote datatransfersthroughput(orget)operations,whichtrans ferdatato(orfrom)adifferent PEusingremotepointerswhichallowdirectreferencestodat aobjectsownedbythe remotePE.Severalotheroperationsarealsosupportedsuchasb roadcast,collective reduction,synchronizationoperations,andatomicmemory operations.Figure 2-1 shows thesourcecodeofanexampleapplicationwhichusestheSHMEMl ibrarytotransferan arrayofdatafrom“source”variableonPE1to“dest”variableo nPE0. 20

PAGE 21

Sincemostofsuchlanguagesandlibrariesweredevelopedfor traditionalHPC systems,theyhavebeentypicallylimitedtohomogeneoussy stemsofmicroprocessors connectedviacommodityinterconnecttechnology(e.g.,Eth ernet,InniBand). Althoughresearchershaveextendedsomeoftheselibrariest oaheterogeneous mixofmicroprocessorsconnectedviadifferentnetworktec hnologies[ 29 – 31 ],the heterogeneitysupportedbytheselibrarieshasbeenlimite dtodifferenttypesof microprocessors. Owingtotheemergenceofaplethoraofdevicesthatareusedf orapplication accelerationandcoupledwithmicroprocessorsinHPC,there hasbeenaquestfor exploringparallel-programmingmodelsthatarebettersui tedforheterogeneous systems.Someresearchershaveattemptedtobuildhybridmod elsusingmultiple modelsforasystem[ 32 ].System-levellibrariesandlanguagessuchasMPIandUPC wereusedforcoordinationbetweentasksexecutingondiffe rentnodesofacluster, andlibrariessuchasOpenMPforcoordinationbetweentasks withineachnode. Hybridmodelsrequirethedevelopertopartitiontheirdesi gnintomultiplelevelsand acquireexpertisewithmultipleprogrammingmodels,langu ages,libraries,andtools.By contrast,thisresearchattemptstoabstractthesedetails fromanapplicationdeveloper presentingthemwithanintegratedprogrammingmodelandau niformcommunication interface. Otherresearchgroupshaveshowninterestinasynchronouse xecutioninthePGAS model,leadingtoAsynchronousPGAS(APGAS)[ 33 ],whichlaysthefoundationfor active-messageprogrammingandne-grainedconcurrency. However,APGASislargely tailoredtowardsspawningmassivelyparallel,multi-thre adedkernelcomputationsat run-timeonacceleratorssuchasGPUsandisnotwell-suitedf orFPGAdevices. Theworkin[ 34 ]extendstheUPCprogrammingmodeltoabstractasystemof microprocessorsandacceleratorsthroughatwo-levelhier archyofparallelism.While theirworkrelatestoSHMEM+inthatbothseektoprovideaunie dprogrammingmodel, 21

PAGE 22

theirapproachisquitedifferentthanoursinthatitrelies onidentifyingandextracting sectionsofcodeinaUPCprogramthatareamenabletohardware acceleration, re-directingthemthroughasource-to-sourcetranslatora ndahigh-levelsynthesistool togeneratehardwaredesigns.Insteadofprovidingameansf orcreatinghardware designs,SHMEM+providesaparallelprogrammingmodel,amena bletoHPCsystems thathaveahierarchyofcomputationaldevicesandmemoryre sources,anddeferringto andleveragingtheefciencyofexistingandemerginghighlevelsynthesistoolstoraise theabstractionfordevice-leveldesignandgeneratetheap propriatehardware. TMD-MPI[ 35 ]extendstheMPIlibrarytosupportmessage-passingbetween heterogeneousdevices,suchasamixofFPGAsandmicroprocess ors.Although conceptuallysimilar,SCF,thecoordinationframeworkpres entedinPhase3of thisresearch,hasseveralnovelaspects.Forinstance,TMD -MPIusesadynamic, task-graphrepresentationofanapplicationandacommunic ation-architecturebasedon packet-switched,Network-On-Chip(NoC)design.Incontra st,applicationsaredened asstatictaskgraphsinSCF,which,becauseofaknownmapping ,canyielddesigns whichprovideimprovedperformance. Auto-PipeandtheXlanguagepresentedin[ 36 37 ]provideaframeworkfor developingpipelinedapplicationsdistributedacrossthe resourcesofaheterogeneous system.AlthoughSCFadoptsasimilarapproach,itextendsthi sconcepttoallow applicationswitharbitrarytask-graphs.Toprovideahigh -levelabstractiontodevelopers, Auto-PipepresentsanabstractionofFIFOsforcommunicatio nbetweenanytwotasks ofanapplication.Whilesimplifyingthejobofdevelopers,t hisabstractionrequires Auto-Pipetore-organizeandpacketizethedataforefcient transfersbetweenanytwo tasks.Bycontrast,SCFemploysmessage-passingsemanticsfo rcommunication,which allowdeveloperstospecifyanappropriatemessagesizefor obtaininghighperformance fordatatransfers.Recently,muchefforthasbeenchannele dtowardsstandardizingthe interfacebetweenmicroprocessorsandenhanceddevicessu chasFPGAs[ 38 ]and 22

PAGE 23

GPUs[ 39 ].Focusoftheseeffortshasbeenonlow-levelinteractionb etweenaccelerator devicesandmicroprocessors,notmakinganyassumptionson higher-levelformsof communicationandsynchronization.ThescopeofSCFandSHMEM+ islargerand complementssuchefforts,astheycanoverlayacommunicati onframeworkontopof suchexistingAPIs/languages. 2.3PerformanceModeling Therearevariouscommunicationmodelsprevalentintheel dofHPCthataim toprovidedeveloperswithabetterunderstandingoftheund erlyingcommunication infrastructure.Suchmodelsprovidevaluableideasandusef ulinsighttowardsour performancemodelformultilevelcommunicationproposedi nPhase2ofthisresearch forestimatingtheperformanceofdatatransfersinSHMEM+. CommunicationmodelshavebeenprevalentintheeldofHPCfo restimatingthe performanceofcommunicationinparallelapplications.Mo delslikeParallelRandom AccessMachine(PRAM)[ 40 ],BulkSynchronousParallel(BSP)[ 41 ],LogP[ 42 ],LogGP [ 43 ],PlogP[ 44 ],etc.aimtobefairlygenericandarchitecture-independe nt,andprovide high-levelestimatesofcommunicationperformanceforpar allelprograms.However, thesemodelsweredevelopedfortraditionalHPCsystemsbase donahomogeneousset ofmicroprocessorsandcannotbedirectlyemployedtorepre sentthemultileveltransfers inrecongurableHPCsystems.Otherresearchershaveattemp tedtobuildsystem-level modelsforheterogeneoussystems.HeterogeneousLogGP(HL ogGP)[ 45 ]considers extensionsofLogGPformultipleprocessorspeedsandcommu nicationnetworkswithin acluster.In[ 46 ],system-levelmodelingconceptsformthebasisforapropo sedmodel forheterogeneousclusters.However,theprimaryemphasis ofthesemodelswasto targetsystemsbasedonheterogeneousmicroprocessorsand thustheseapproaches areill-suitedformultilevelsystemsbasedonaccelerator s. Incontrast,comparativelyfeweffortshavefocusedonsyst em-levelperformance predictionofRCsystems.In[ 47 ],RCAmenabilityTest(RAT)denesananalytical 23

PAGE 24

modelforperformanceestimationofanRCalgorithmforagiv enRCplatformprior toanyimplementation.Althoughitprovidesafairlyaccurat erepresentationofan algorithmtargetingasingledevice,itwasnotintendedtom odeltheeffectsofscalable, multi-deviceapplications.Bycontrast,ourproposedmodel focusesonmodeling andestimationofthecommunicationperformanceinlarge-s caleRCsystemswhile leveragingmodelssuchasRATforestimatingtheperformanc eofcomputationalparts oftheapplication.Theworkin[ 48 ]proposesasetofapplication-specicanalytical equationsforperformancepredictionofiterativesynchro nousprogramsrunningon heterogeneousclusterswithRCdevices.However,[ 48 ]doesnotdelveintothedetails ofcommunicationmodelingandthereforedoesnotconsidert henuancesofmultilevel communicationenabledbytoolssuchasSHMEM+. Thisresearchalsosharesandleveragesconceptsfromother academicprojects thataimtoimprovedeveloperproductivityforheterogeneo ussystemsthroughmodeling andsimulation.Ptolemy[ 49 ]studiesmodeling,simulation,anddesignofconcurrent real-timeembeddedsystemsbasedondifferentmodelsofcom putation.Other researchershaveproposedtechniquesforabstractmodelin gandestimationtoallow userstoquicklyandmoreabstractlybuildmodelrepresenta tionsoftheirproposed applications.Thesemodelscanbeeasilyanalyzedandmodi edinordertoefciently performdesign-spaceexplorationbeforeimplementation. Weleverageonesuch formulationtoolinourworktoenableabstractmodeling,th eRCModelingLangauge (RCML).RCML[ 21 ]provideshierarchicalmodelsforthealgorithm,systemar chitecture, andtotalapplicationmappingwithspecializedconstructs toexpressparallelism, communicationpatterns,andothercommonaspectsinRC.RCM Lisintendedtoallow userstoquicklymodelsystemsbeforelengthycodingofanim plementation,using abstractconstructsandquantitativeattributestodeneb ehavior.SCF,describedin Phase3ofthisresearch,providesautomatedtranslationofR CMLmodelstotask-graph 24

PAGE 25

specicationsrequiredbySCFtofurtherimproveproductivi tybyallowingusersto quicklytransitionfromformulationtodesign. 25

PAGE 26

CHAPTER3 SHMEM+:AMULTILEVEL-PGASPROGRAMMINGMODEL FORSCALABLERCSYSTEMS(PHASE1) Thischapterpresentsamultilevel-PGASprogrammingmodelfo rRCsystems, whichabstractsthememoryhierarchyavailableinthesyste m,andpresentsthe designerwithaattened,uniedviewofthesystemmemory.F urthermore,we employthemodeltodevelopSHMEM+(i.e.anextendedSHMEMlibrar y),therst knownimplementationofSHMEMthatenablescommunicationand synchronization betweenFPGAsandCPUsinscalableRCsystems.UsingSHMEM+,desig ners cancreatescalable,parallelapplicationsthatexecuteov eramixofmicroprocessors andFPGAs.Thehigh-levelabstractionprovidedbySHMEM+canyie ldsignicant improvementindeveloperproductivity.Concomitantly,fo rthedecomposedtasksofa parallelapplication,developersofFPGAcorescanemployhi gh-levelsynthesistools andlanguages(e.g.ImpulseC TM ,Carte TM ,Handel-C)forcreatinghardwaredesigns forFPGAstofurtherimproveproductivity.Thisstudyanalyze stheperformanceofour implementationofSHMEM+andinvestigatesitsinherentstren gthsthroughtwocase studiesontwodifferentplatforms.Althoughtheworkinthis chapterfocusesonHPC systemsandapplications,theproposedmultilevelPGASmodel andSHMEM+library canalsobeextendedtosystemsbasedonothertypesofaccele ratorssuchasGPUs, many-coreprocessors,etc. Theremainderofthischapterisorganizedasfollows.Adeta ileddescriptionof themultilevel-PGASprogrammingmodelispresentedinSection 3.1 .Section 3.2 givesanoverviewofthedesignofSHMEM+.InSection 3.3 ,theperformanceof data-transferroutinesavailableinSHMEM+isbenchmarked.T hesectionalsopresent twocasestudiestoillustratethedesignmethodologyandev aluatetheadvantagesof applicationdesignusingSHMEM+.Finally,conclusionsfromt hisphaseofresearchare summarizedinSection 3.4 26

PAGE 27

Figure3-1.SystemarchitectureofatypicalRCmachine. 3.1MultilevelPGAS Next-generationRCsystemswillbetargetingFPGAdevicesin theirsystem architecturesinexoticwaystoextractperformance,rangi ngfromcloselycoupled, in-socketacceleratorstoPCIe-basedacceleratorcards.Fi gure 3-1 depictsanexample RCsystem,whereeverynodecontainsasetofprocessingunit s(PUs),eacha microprocessororFPGA.WithFPGAdevicesandmulticoreCPUs,eac hwithone ormoreassociatedmemorymodules,allwithinasinglenode, becomingpervasivein high-endcomputingsystems,theexistenceofmultipleleve lsofmemoryhierarchyand differentpermutationsofcommunicationisbecomingincre asinglydifculttoignore. Asaresult,applicationdevelopersarepresentedwithadaun tingtaskoforchestrating dataamongstheterogeneousdevicesandseveralmemorycomp onentsbyemploying multipleAPIs. Thereisaneedforaparallel-programmingmodelthatprovid esapplication developerswithahighlevelofabstractionandpresentsasi mpliedviewofthesystem, somewhatakintothatprovidedbytheglobalmemorylayerinPG AS.However,thereare variouschallengesinvolvedinapplyinganexistingparall el-programmingmodelsuchas 27

PAGE 28

Figure3-2.High-levelabstractionsforprogrammingheter ogeneoussystems.A)An idealprogrammingabstractionforapplicationdevelopers .B)Amore practicalandrealizableapproach. PGAStorecongurableHPCsystems.Someoftheconceptsandseman ticsassociated withPGAS-basedprogrammingmodelontraditionalsystemsaren otdirectlyapplicable tosuchhybridsystems.Forexample,amajorityofparallelp rogramsaredescribed usingtheSPMDmodel,whereeachnodeinthecomputationalsyst emexecutesthe sameprogramwhileworkingonadifferentpartofinputdata. Heterogeneoussystems comprisedofdeviceswithdifferentprogrammingparadigms oftenrequireanapplication developertocreateseparateprograms,oneforeachtypeofd eviceinthesystem,and necessitateare-denitionofSPMDforsuchsystems.Similarly ,themulti-tiermemory hierarchythatexistsinrecongurableHPCsystemswarrants are-examinationofthe distributionofthevirtual,globalmemorylayerofPGASprogr ammingmodeloverthe physicalmemoryresourcesofasystem. Fromapplicationdevelopers'pointofview,anidealprogra mmingmodelshould provideanabstractionwheretheheterogeneousdevicespre sentinthesystemare treatedaslogicallyequivalentPUs(Figure 3-2 A).Usingsuchamodel,eachPUwill executeaprograminstanceofaSPMDapplication,obtainedbyt ranslatingthesource 28

PAGE 29

Figure3-3.Distributionofmemoryandsystemresourcesinm ultilevelPGAS. codeintologicallyequivalentoperationsindifferentpro grammingparadigms.ThePGAS interfaceoneachPUwouldberesponsibleforpresentingalog icallyhomogeneous systemviewtotheapplicationdevelopers.Whileitmayprovi deasimpliedview ofthesystem,suchanabstractionwouldbedifculttoimple mentandmayleadto inefcientutilizationofsystemresources.Forexample,F PGAdevicesyieldexceptional performanceforcomputationswhichhaveahighdegreeofpar allelism,butcanleadto inefciencieswhenimplementingcompletefunctionalityo faSPMDprogram. Amorepracticalsolutionwouldraisethelevelofabstracti onforapplication developerswhilemakingefcientuseofthespecializedres ourcespresentinasystem. Figure 3-2 Bshowssuchanapproach,whereeachindividualtaskofaSPMDap plication isfurtherpartitionedacross,andcollectivelyexecutedb yallthePUsonanode.Sucha solutioncanalsoextendtheconceptofpartitioned,global addressspacetoamultilevel abstraction,whichintegratesahierarchyofmultiplememo rycomponentsintoasingle, virtualmemorylayer.WecallthismodelmultilevelPGAS. Figure 3-3 showsthephysicaldistributionofmemorycomponentsthatf ormthe globaladdressspaceinmultilevelPGAS.Memoryblocksassoci atedwithallPUsinthe system,irrespectiveoftheirphysicallocationandhierar chyinthesystemarchitecture, canformapartofthevirtualmemorylayerandhavegloballyu niquememoryaddresses inthesystem.BothCPUsandFPGAsprovideinterfacesrequiredfo rtheglobalmemory abstractionfortheircorrespondingmemoryblocks.Noteth at,allphysicalmemory 29

PAGE 30

Figure3-4.Differencesinphysicalandlogicalresourceab stractionsofatypicalRC system.A)Physicalresourcelayout.B)Logicalabstractionpr ovidedby multilevel-PGASmodel. blocksdonothavetobeapartofthePGAS.Thememoryblocksthat donotformapart ofthevirtual,globalmemorylayercanbeusedbytheirPUsfor storinglocalvariables. ItshouldbenotedthatmemoryblocksshowninFigure 3-3 correspondonlytooff-chip memoryresourcesforthefocusofourwork.On-chipmemoryst ructuresofanFPGA suchasblockRAMsandregisterlesaretreatedaslocalstora geandnotexposed asapartofPGAS.Suchmodelingoflocalstorageissimilartotha tofmicroprocessor cacheandregisters,whicharehiddenfromthePGASlayerintra ditionalHPCsystems. Suchresourceswerenotincludedintheglobaladdressspacei nourdesignbecause thememoryconsistencyrequiredbyparallelapplicationsm ayprecludetheusageof BRAMsassharedresourcesinmostcases.However,ourframewor kdoesnotprevent theusageofsuchresourcesintheglobaladdressspaceifitc anbesupportedbythe targetFPGAplatform. Figure 3-4 depictsadetailedviewofthephysicaldistributionofreso urceswithin eachnodeanditsequivalentlogicalabstractionprovidedb ythemultilevel-PGASmodel (usedbySHMEM+).Althoughtheguredepictstwoprocessinguni tspernode,one CPUandoneFPGA,itcanbegeneralizedtoincludeanynumberandv ariety.As showninFigure 3-4 A,theglobaladdressspace,partitionedacrossmultiplenod esin thesystem,iscomposedofmemoryblockswhicharephysicall ydistributedacross differentprocessingunitswithinanode.However,thelogi calabstractionpresentedtoa 30

PAGE 31

designer(showninFigure 3-4 B)isaattenedviewofthenode'ssharedmemory.Thus, applicationdesignersdonothavetounderstandthedistrib utionofdataoverthephysical memoryresourceswhenaccessingaremotenode. ThePGASinterfaceoneachnodeisresponsibleforprovidingap plicationdesigners withanabstractionofasingle,integratedmemoryblock.Sim ilartothecaseformemory resources,thelogicalviewofthePGASinterfacepresentedto thedeveloperisdifferent fromitsphysicalimplementation.Thephysicalimplementa tionoftheinterfaceitselfis system-dependentandcanberealizedindifferentwaysbysy stemarchitects.While eachnodeprovidestheentirefunctionalityrequiredbythe PGASinterface,eachPU withinanodemayimplementonlyasubsetofthisfunctionali ty.Thedistributionof theseresponsibilitiesamongstthePUswithinanodeisdicta tedbytheircapabilities inthesystem.Forexample,inourcurrentdesign,theCPUspro videamajorityofthe SHMEMfunctionalityandtheFPGAsonlyprovideassistancefortr ansferstoandfrom theFPGA'smemoryusingthevendor-specicmemorycontroller s.Asfuturework,we intendtoinvestigatethefeasibilityofFPGA-initiatedtran sfers,whichwillrequiremore extensivesupportfromFPGAsandmayleadtosomeresourceutil izationontheFPGAs bySHMEM+,unlikeourcurrentdesign. Themultilevel-PGASmodelsupportstwoadditionalfeaturesw hichhelpinattaining highperformanceforapplications.First,itallowsapplic ationdeveloperstospecify afnityofvariousapplicationdatatospecicmemorycompo nentswithinanodeduring memoryallocation.Therefore,thedatacanbeplacedinamem oryblockclosertothe processingunitthatoperatesonitmostfrequently.Second, multilevel-PGASmodel requiresexplicittransfersbetweenlocalmemorycomponen ts.Insystemsequipped withmultiplenon-coherentmemoryblockswithinanode,DMA operationsareoften employedfordatatransferbetweendifferentmemoryblocks ,whichareexpensive operationsandcansignicantlyhamperapplicationperfor mance.Havingexplicitcalls fordatatransferswithinalocalnodeeliminatesthepossib ilityofinefcienciescaused 31

PAGE 32

Figure3-5.Datatransferchoicesavailabletothedevelope rwithSHMEM+. bytransparentbutexpensivetransferswhichareimplicitl yembeddedintheapplication code. 3.2OverviewofSHMEM+ Usingthemultilevel-PGASprogrammingmodel,weextendconve ntional SHMEMtobecomewhatwecallSHMEM+,acommunicationlibrarywhic henables additionalcommunicationcapabilitiesbetweenheterogen eousdevices.Using SHMEM+,designerscancreatehighlyscalableapplicationsth atexecuteovera mixofmicroprocessorsandFPGAs.Previousimplementationsof theSHMEMAPI havetargetedspecicsystems[ 16 ]andoftenlackedportability.SHMEM+isbuilt overservicesprovidedbyGlobalAddressSpaceNETworking(GASNe tfromUC Berkeley)[ 50 ]whichisalanguage-independent,communicationsmiddlew arethat providesnetwork-independent,high-performanceprimiti vestailoredforimplementing parallelGASlanguages.Asaresult,SHMEM+canbeeasilyportedt oothersystems thataresupportedbyGASNetbysimplymodifyingtheFPGAinterf acesthatemploy vendor-specicAPIs. SHMEM+providesdeveloperswithahigh-productivityenviron mentforestablishing communicationinanRCapplication,byprovidingdeveloper swithseveralchoicesfor 32

PAGE 33

Table3-1.BaselinefunctionscurrentlysupportedinSHMEM+li brary. FunctionSHMEM+callTypePurpose Initializationshmem initSetupInitializesSHMEMlibraryandother resources Comm.Idmy peSetupProvidesauniqueIDforeachprocess Comm.sizenum pesSetupProvidesnumberofPEsinthesystem Finalizeshmem nalizeSetupDe-allocatesresourcesandgracefully terminates MallocshmallocSetupAllocatesmemoryforsharedvariablesGetshmem int gP2PReadssingleelementfromaremotenode Putshmem int pP2PWritessingleelementtoaremotenode Getshmem getmemP2PBulkreadfromaremotenode Putshmem putmemP2PBulkwritetoaremotenode Quietshmem quietSynch.Waitsforcompletionofoutstandingputs BarrierSync.shmem barrier allSynch.Synchronizesallthenodes datatransfersbetweendevicesofaheterogeneousRCsystem ,someofwhichdid notexistinconventionalSHMEMlibrary.Figure 3-5 illustratesthedifferentoptionsfor datatransfersprovidedbySHMEM+inasystemwithaCPUandanFPGA oneach node.Theexistingtransfercapabilitiesaremarkedbylabe ls`a'(CPU-onlySHMEM) and`b'(platform-specicAPIs)inthegure,andtheonesintr oducedbySHMEM+are labeledas`x'and`y'.Whiletheseadditionaldata-transfer optionssimplifytheprocess ofdevelopingparallelapplicationsandimproveproductiv ity,thedevelopersshould understandthetradeoffsassociatedwithsuchtransfers.F orexample,directtransfer betweentworemoteFPGAseliminatestheneedforadeveloperto carefullyorchestrate thedatathroughthelocalCPUsonthesourceanddestinationn odes.However,it mightoccasionallyreducetheopportunitiesofoverlappin gintermediatestepsof communicationthatexistintheapplication.SHMEM+doesnotf orcedeveloperstowork ataparticularlevelofabstraction.Insteaditprovidestr ansferfunctionswhichimprove productivityalongwithfunctionswhichallowmoredetaile dcontroloverdatatransfers, forachievinghigherperformance.Thechoiceofthedata-tr ansfersemployedwilloften dependonthecharacteristicsandstructureoftargetappli cation. 33

PAGE 34

3.2.1SHMEM+Interface OurdesignofSHMEM+asdescribedinthischapterfocusesonasu bsetof baselinefunctionsselectedfromtheentireAPIfunctionseto fSHMEM.Inthischapter, wediscuss11baselinefunctionsshowninTable 3-1 ,whichincludevesetupfunctions, fourpoint-to-pointmessagingcalls,andtwosynchronizat ionroutines.Someofthese functionscanbeeasilyextendedtosupportotherSHMEMfuncti ons;suchisthecase forsingle-elementandcontiguousdata-transferroutines .InthisversionofSHMEM+,we focusprimarilyonblockingcommunication.However,weals oprovidelimitedsupportfor non-blockingcommunicationasinthecasefortransfersbet weenaCPUanditslocal FPGAs.Moreextensivesupportfornon-blockingtransfersist hefocusofourongoing researchandfuturework.Inaddition,allofthevarioustra nsfersarecurrentlyinitiated byCPUdeviceswhichinvokeSHMEM+functionstotransferdatabe tweenanytwo locations. ItisourobjectivetokeeptheinterfaceofSHMEM+consistentw ithprevious SHMEMimplementations.However,thefunctionalityprovided bySHMEM+hasbeen extendedinvariouswaystoincorporatesupportforFPGAsandp rovidemultilevel-PGAS abstraction.Thus,SHMEM+functionsperformtheseextratask sinadditiontothe onesperformedbytraditionalSHMEMroutines.Forexample,th e shmem init routine performsFPGAinitialization(i.e.congurationofFPGAwith therequiredhardware design)andFPGAmemory-managementoperations,concomitan ttoinitializationand managementofCPUmemorysegmentsasperformedbythetraditi onal shmem init function.Theroutinesfordatatransfer(variationsof shmem get and shmem put ) performexchangesbetweenanytwodevices,suchastwoCPUsor betweenaCPU andanFPGA,etc.Basedonthetargetmemoryaddressspeciedint hefunction, SHMEM+identieswhethertherequesteddataresidesinCPUorFPG Amemory andemploysappropriatemeansoftransferringthedata.Ina ddition,transferstoboth remoteandlocalFPGAscanbeperformedusingthesameinterfac e,eliminatingthe 34

PAGE 35

Figure3-6.Multi-FPGA“add-one”applicationexample.A)Tas k-graphoftheapplication alongwiththedesiredmappingoftasksontodevices.B)Anabst ract representationofarchitectureoftargetRCsystem. needformultipleAPIs.WithoutSHMEM+,applicationdevelopersm ustdecomposethe algorithminmultiplestages,usingtheconventionalSHMEMli braryforsystem-level decompositionandlower-levelvendorAPIsfordistributingt hefunctionsacrossvarious PUswithinanode,andthencarefullyorchestratethecommuni cationthroughmultiple librariesbasedonthelocationandtypeofthetargetdevice .Thememoryallocation routine(shmalloc),whichallocatesmemoryfordatavariab lesfromthesharedaddress space,hasbeenmodiedtoallowuserstospecifytheafnity ofanydatatoaparticular memoryblockinthesystem.Forexample,asetofdatathatiso perateduponbyan FPGAcanbespeciedtobeallocatedonFPGAmemorywhich,asexp lainedinSection 3.1 ,canbebenecialforapplicationperformance.Theapplica tiondeveloperconveys thisinformationbyspecifyingthe“type”parameter(type= 0forCPUmemory,1for FPGAmemory)intheshmallocfunctioncall.3.2.2SHMEM+ApplicationExample Inordertodevelopabetterunderstandingandappreciation forSHMEM+,we highlightthedifferencesintroducedbySHMEM+inanapplicat ionthroughanexample. Figure 3-6 Ashowsthetask-graphofamulti-FPGA“add-one”application alongwith 35

PAGE 36

thedesiredmappingofeachtaskontoadevice(labeledalong sideinthetaskgraph). Thetask“Senddata”whichismappedontheCPUofNode0sendsinp utdatatothe “Addone”tasks(mappedontherstFPGAofeachnode)andcollec tstheoutput dataoncompletionofprocessing.Thearchitectureoftheta rgetRCsystemforthe applicationispresentedinFigure 3-6 B.Figure 3-7 liststhecodesnippetsoftheadd-one applicationdesignedusing(a)SHMEM+and(b)acombinationof CPU-onlySHMEM andvendor-specicAPIs.WithSHMEM+anapplicationdeveloperha sthecapability oftransferringtheinputdatatovariousFPGAs(bothlocaland remote)usingthe sameinterfaceofSHMEM+.Traditionally,thistransferwould havebeenachieved bydistributingthedatafromCPUonNode0totheCPUsonotherno des.Allthe nodeswouldthenhavetoperformasynchronizationoperatio ntoensurethereceipt ofdatabeforeproceedingwithatransfertotheirlocalFPGAs. AlthoughFigure 3-7 B summarizesthetransfertothelocalFPGAusingasinglefunct ioncall,theprocess isoftenlessthantrivialandnon-uniformacrossdifferent FPGAplatforms.Evenfor thissimpliedexample,itiseasytoseethebenetsofemplo yingSHMEM+over traditionalmethodsofapplicationdesign.Morecomplexap plicationswithmoreintricate communicationpatternswillbenetfromlargerreductioni nprogramcomplexityand developereffort.3.2.3DesignofSHMEM+ Figure 3-8 AillustratesthesoftwarearchitectureofSHMEM+.Itmakesus eof GASNet'sCoreAPI,ExtendedAPI,andActiveMessage(AM)services.The setup functions,whichperformmemoryallocationandotheriniti alizationtasks,employthe “CoreAPI”servicesofGASNet.Thedatatransfersto/fromtheCPUm emorywerebuilt usingthe“ExtendedAPI,”whichprovidesdirectsupportforhig h-leveloperationssuchas remotememoryaccess.Asaresult,SHMEM+functionsthatperfor mtransfersbetween twoCPUscanbeimplementedbysimplyprovidingwrappersarou ndtheunderlying GASNetfunctions.Sincetransfersto/fromFPGAmemoryarenotdi rectlysupportedby 36

PAGE 37

Figure3-7.Codesnippetsforamulti-FPGAadd-oneapplicati on.A)Designedusing SHMEM+.B)DesignedusingSHMEM. Figure3-8.DesignoftheSHMEM+library.A)Softwarearchitectu reofSHMEM+.B) DatatransferexampleusingActiveMessages. underlyingGASNetfunctions,theyweredevelopedusingtheAMs erviceinconjunction withFPGAinterfacesthatwecreatedforourFPGA-platform(mor edetailsaboutour platformareprovidedinSection 3.3 ). 37

PAGE 38

Figure3-9.Designmethodologyforapplicationdevelopmen tusingmultilevelPGASand SHMEM+. Figure 3-8 BshowsthesequenceofstepsinvolvedinatransferusingActi ve MessageswhenaCPUrequestsdatafromaremoteFPGA.TheCPUonNod e1 initiatesthetransferbycallingthe shmem getmem function,whichsendsanAMrequest totheCPUonremotenode(Node2).UponreceivingtheAMrequest message,Node 2invokesanAMrequesthandlerwhichreadstherequesteddata fromthelocalFPGA inatemporarybufferandsendsanAMreplymessagecontaining therequesteddatato theinitiatingnode.WhenNode1receivesthereplymessage,i tinvokesacorresponding replyhandlertocopytheincomingdataintotheuser-speci edlocation.Themessage handlersshowninthegureemploy FPGA read and FPGA write functions,whichwe developedusingtheFPGA-boardvendor'sAPItocommunicatewith FPGAmemory. DuetooverheadincurredbyAMservicesanddataaccessto/fro mtheFPGAboard, communicationwithanFPGAcanresultinslightlyhigherlate ncyandlowerbandwidth whencomparedtoCPUtransfers.3.2.4DesignMethodologywithSHMEM+ OneoftheimportantgoalsofmultilevelPGASandSHMEM+istoprov ide developerswithaframeworkforbuildinglarge-scaleRCapp licationsusingfamiliar techniquesofparallelprogramming.Thedesignmethodolog yforbuildingapplications usingSHMEM+isdescribedinFigure 3-9 .Adeveloperbeginswithabaseline 38

PAGE 39

algorithmoftheapplication(stepa).Aparallelalgorithm isobtainedbysystem-level decomposition(stepb)ofthebaselineintomultipletasks, eachofwhichisassigned toanodeinthetargetRCsystem.Variousconventionaltechn iquesofdecomposition canbeemployedduringthisstage,suchasSPMD,pipelining,et c.Eachtaskofthe parallelalgorithmisfurtherdecomposedintoconstituent functionswhicharedistributed amongsttheprocessingunitsineachnode(stepc).Inthefol lowingstep(stepd),the developerdescribesthefunctionsmappedonFPGAsashardware enginesusinga hardwaredescriptionlanguage(HDL)orHLLs.Finally,ther emainderofthefunctions aredescribedinsoftwaretobemappedonCPUs(stepe).Thesof twarealsoprovides theFPGAwithcontrolsignalsrequiredbythehardwareengine sdevelopedinthe previousstep.ThefunctionsthataremappedontheCPUsemplo ySHMEM+routines toaccessthePGASinthesystemandperformsynchronizationop erations.Although thecurrentversionofSHMEM+onlyallowsCPU-initiatedtransf ers,thecapabilityof FPGA-initiatedtransfersinfuturecanprovidenumerousoppo rtunitiesforinnovative applicationdesign.Thedesignowdescribedhereisfurthe rexempliedthrough multiplecasestudiesinSection 3.3 3.3ExperimentalResults Inthissection,wepresenttheperformanceobtainedforvar iousmemory transferswithSHMEM+andcompareitagainsttheperformanceo btainedwiththe vendor-proprietary,CPU-onlyversionofSHMEMprovidedbyQua dricsforQsNet systems.Wethenpresenttwocasestudiestoillustratethed esignmethodology, andevaluatevariousadvantagesofapplication-designusi ngSHMEM+.Toevaluate portabilityandscalabilityofapplicationsdesignedusin gSHMEM+,weconductedour experimentsontwodifferentsystems,MuclusterandtheNov o-GRCsupercomputer, whicharedetailedinAppendix A TheSHMEM+librarywasinitiallydevelopedontheMuclusteran dlaterported onNovo-G.PortingtheSHMEM+librarytotheNovo-Gsystemwasa straightforward 39

PAGE 40

process,justrequiringaninstallationofGASNetonthenewsy stem.Ideally,anysystem thatissupportedbyGASNetcanbesupportedbySHMEM+withease,a sinthecase ofNovo-G.However,ingeneral,theprocessofportingSHMEM+o ntoanewplatform requiressolving:(a)system-levelissuesand(b)FPGAplatf orm-levelissues.For supportinganewFPGAplatform,thesystemarchitectsareres ponsibleforcreating functionstointeractwiththeFPGAboard.Althoughthetimeto portSHMEM+islargely dependentontheskillsofasystemdeveloperandthecomplex ityofthevendorAPIs fortheFPGAboard,itshouldbeintheorderofacoupleofweeks foranexperienced systemdeveloper.OurcurrentdesignofSHMEM+libraryallows eachofthefourFPGAs oneachnodetosupportupto2GBofsharedmemorywhichformsa partofthePGAS layer.TheremainderofthememoryisavailabletotheFPGAsfor storinglocaldata. 3.3.1BenchmarkingPerformanceofCommunicationRoutines Figure 3-10 depictsperformanceofpoint-to-pointcommunicationinSHM EM+ fortransfersbetweentwoCPUsonourtwosystemsandcompares itwiththe performanceobtainedbytheSHMEMlibraryfromQuadricsonthe Mucluster.Bulk communicationroutinessuchas shmem getmem and shmem putmem attainapeak throughputofabout850MB/sontheMucluster.Thebandwidtho btainedwithSHMEM+ calls,fortransfersbetweentwoCPUs,iscomparabletothepr oprietaryversionof SHMEMavailablefromQuadricsforourMucluster.TheSHMEM+rout inesforthese transfersbenetfromdirectsupportprovidedbyGASNetandth usincurminimal overheads.Novo-Goffershigherpeakbandwidth(over1400M B/s,approx.75%of themax.capacityofthenetwork)forpoint-to-pointtransf ersbetweentwoCPUswhen comparedtotheMuclusterduetothefasterinterconnect.Alt houghthepeakbandwidth obtainedonNovo-GishigherthanMucluster,theperformanc eofGASNetforsmaller messagesizesisbetteronMucluster.Sincetherewasnoknown implementation ofSHMEMavailableforInniBand,wecomparedtheperformanceo fSHMEM+with theperformanceofsynchronousdata-transferfunctionspr esentinMVAPICH[ 51 ](an 40

PAGE 41

Figure3-10.Bandwidthofpoint-to-pointroutinesusingSHMEM +andQuadricsSHMEM fortransfersbetweentwoCPUs.Notethatintheperformancer esultsfor Mucluster,thelinegraphsofSHMEMGETandSHMEM+GEToverlapeachother.A)OnMucluster.B)OnNovo-G. implementationofMPIoverInniBand)whichisacommonlyused communication library.ThegraphsindicatesthatSHMEM+outperformsMVAPICHf orlargedata transfersandoffersahigherpeakbandwidth. Figure 3-11 AshowsperformanceofdatatransfersbetweenaCPUandanFPGA usingSHMEM+routinesonNovo-G.The“LocalPUT”and“LocalGET”l abelsrepresent thebandwidthofdatatransfersbetweenahostCPUanditsloca lFPGAonthesame node.Thebandwidthofsuchlocaltransfersisspecictothe particularFPGAboard anddependsuponavarietyoffactorsassociatedwithinterc onnect(s)betweenCPU andFPGA,efciencyofthecommunicationcontrollerontheboa rd,etc.ManyRC systemsofferahigherbandwidthforreadoperationfromanF PGA(FPGAtoCPU) whencomparedtowriteoperation(CPUtoFPGA).Similarly,oursy stemyieldsapeak bandwidthofapproximately275MB/sforlocalputoperations (CPUtoFPGA)and approximately1000MB/sforlocalgetoperations(FPGAtoCPU). The“RemotePUT” and“RemoteGET”labelsrepresentthebandwidthofdatatrans fersbetweenaCPU andanFPGAonadifferentnode.Asexpected,thebandwidthfors uchtransfersis observedtobelowerthanthebandwidthattainedforlocaltr ansfers.Figure 3-11 B 41

PAGE 42

Figure3-11.Bandwidthofpoint-to-pointroutinesonNovo-G usingSHMEM+for transfersbetweendifferentdevices.A)CPUandFPGA.B)TwoFPGAs.Thelabel`Local'inthegraphsrepresentstransfersbetwee ndeviceswhich areonthesamenode,whereasthelabel`Remote'representst hetransfers betweendevicesonseparatenodes.Notethatfortransfersb etweentwo FPGAS,thelinegraphsofLocalGETandLocalPUToverlapeachothe r. showsperformanceoftransfersbetweentwoFPGAs.Thelabels“ Local”indicatethetwo devicesareonthesamenodewhereas“Remote”representtran sfersbetweendevices ondifferentnodes. Table3-2.End-to-endlatencyoftransfersbetweenvariousc ombinationsofdevicesfor traditionalSHMEMonMucluster,andSHMEM+onMuclusterandNovo -G. Thetimesarereportedinmicrosecondsfordatatransferof3 2bytes. TransfersbetweenSHMEM(Mu)SHMEM+(Mu)SHMEM+(Novo-G) TworemoteCPUs3.87 s3.76 s7.96 s CPU&localFPGA205.49 s213.60 s644.92 s CPU&remoteFPGA211.67 s219.49 s664.70 s TworemoteFPGAs427.09 s425.27 s1292.22 s Table 3-2 reportstheend-to-endlatency(EEL)observedfortransfersb etween variouscombinationsofdevices.Thesmallestdatasizefor transfersinourexperiments wasrestrictedto32bytesbytherequirementsoftheFPGAboar d.Thesecondand thirdcolumninthetablecomparethelatenciesobservedfor traditionalSHMEMwith thoseobservedforSHMEM+onMucluster.Thedifferencesbetwe enthetwoare lessthan5%forallthecases.ThetablealsoliststheEELobser vedfortransferson 42

PAGE 43

Figure3-12.Processingstepsinvolvedinparallelalgorith mforCBIRusingSHMEM+. Novo-G.ThelatencyfordatatransfersusingGASNetonNovo-Gi shigherthanonMu cluster,whichconcurswiththeperformancegraphspresent edearlier.Fromresults presentedinthissection,itcanbeobservedthatperforman ceofSHMEM+compares wellwithconventionalSHMEMfortransfersbetweentwoCPUs,an dSHMEM+performs reasonablywellforcommunicationwithanFPGA.3.3.2CaseStudy1:Content-BasedImageRetrieval(CBIR) Inthissection,weuseCBIRapplication(Appendix B )toillustratethedesign methodologyassociatedwithSHMEM+,andevaluatevariousadv antagesof application-designusingSHMEM+.Theprocessingstepsinvol vedintheparallel algorithmemployedinourexperimentsareshowninFigure 3-12 .Weemployedthe design-owdescribedinSection 3.2 toderivethisalgorithmasfollows: Stepa Theserialalgorithmiteratesoverthesetofimagesintheda tabasetocalculate theirfeaturevectoranddeterminetheirsimilaritywithth equeryimage.Once 43

PAGE 44

alltheimagesinthedatabasehavebeenprocessed,theresul tsaresortedin decreasingorderoftheirsimilarity. Stepb Parallelalgorithmisobtainedbydistributingthesetofim agesinthedatabase overtheprocessingnodesinthesystem. Stepc Thesetofimagesisfurtherpartitionedamongsttheprocess ingunitswithin eachnode.ThenumberofimagestobeprocessedoneachFPGAand CPU wasdeterminedbasedontheirprocessingcapability.Byexpl oitingne-grain parallelismavailableinalgorithm,FPGAsareabletoprocess imagesatafaster ratethantheCPUsandareassignedalargersubsetofimages. Stepd UsingVHDL,wedevelopedahardwareengineforFPGAs,whichiter atesover itsassignedsetofinputimagestocomputetheirfeaturevec torsandevaluatetheir similaritytothespeciedqueryimage. Stepe Wedevelopedsoftwarecodeforprocessingasubsetofimages ontheCPU, providingeachFPGAwithcontrolsignalstoinitiateprocess ingandwaitfor completion,andtransferringresultsfromallprocessingu nitsinthesystemtoroot node(node0)atcompletion. Inadditiontosoftwareparallelismdescribedinthealgori thmabove,ourhardware designforeachFPGAinstantiatesmultiplecomputationalke rnelsthatoperateonve imagesinparallel.Figure 3-13 comparestheexecutiontimeandspeedupversusa serialsoftwarebaselinefordifferentimplementationsof aCBIRalgorithmontheMu cluster.Ourexperimentswereconductedforanimagesizeof 128 128 ,withthe searchdatabaseconsistingofapproximately2800images.Ad vantagesofusingFPGA devicesareevidentthroughfasterexecutiontimesforRC-b asedimplementationsover software-onlysolutions.TheFPGAswereabletoprocessimage satamuchfasterrate thantheCPUsleadingtoover 30 speedupwithfournodeswhenemployingFPGA devices.Figure 3-13 alsocomparestheperformanceofthealgorithmimplemented usingSHMEM+withsolutionimplementedusingacombinationof QuadricsSHMEM 44

PAGE 45

Figure3-13.Performancecomparisonofdifferentimplemen tationsofparallelCBIR applicationonMucluster.SoftwaredesignsinvolveonlyCPUd evices whereasRCdesignsinvolvebothCPUandFPGAsoneachnode.Experimentswereconductedforasearchdatabaseconsisting of2800 images,eachofsize 128 128 .A)Executiontimeofdifferentdesigns.B) Speedupobtainedbydifferentdesignswhencomparedtoaseri alsoftware baselinerunningonasingleprocessor. (CPU-only)libraryandplatform-specicAPIsforinteraction withFPGAs.Itisevident thattheapplicationdevelopedusingSHMEM+incursminimalov erheadcomparedto traditionaltechniquesofdevelopmentwhereexpertdevelo pershaveaccesstovendor APIs. Moreimportantly,SHMEM+providesapplicationdeveloperswi tha parallel-programmingmodelthatenablesproductiveandpo rtabledesignofscalable RCapplications.Thereareavarietyoffactorsthatcontrib utetowardsimprovement indeveloperproductivitythatarelistedinTable 3-3 .Applicationsdevelopedwithout usingSHMEM+exhibithigherconceptualcomplexity.Applicati ondevelopersareoften forcedtoemploymultiplelibrarieswithvaryingAPIstoincor poratecommunication amongstaclusterofhostCPUsandtofacilitatecoordination betweenahostCPUand itslocalFPGAs.Inaddition,anycommunicationwithanFPGAona remotenodewill havetobeexplicitlyroutedbythedeveloper,throughtheho stCPUonthatnode.The processingonthehostCPUoftheremotenodewillhavetobeint erruptedtoservice thiscommunicationrequest,whichfurtherincreasestheco mplexityofdevelopingthe 45

PAGE 46

parallelprogram.WithSHMEM+,adeveloperisoblivioustosuch detailsandexposedto ahigherlevelofabstraction.Similarly,high-levelSHMEM+fu nctionseliminatetheneed toexplicitlyperformvariousintermediatestepsofcommun ication,leadingtoareduction incodesize.Portabilityandscalabilityincreasetheappl icationlifespanandreducethe recurringcostthatwouldhavebeeninvolvedwithouttheuse ofalibrarylikeSHMEM+. AcombinationofthefactorsshowninTable 3-3 (andmore)haveacollectiveinuence onthedevelopmenttimeforanapplication.Althoughacompre hensiveanalysisof theimpactofeachofthefactorslistedhereisbeyondthesco peofthisresearch,to understandtheproductivitygainsofSHMEM+,wepresentabrie fdiscussionaboutthe totaldevelopmenthoursspentinapplicationdevelopmentb yourteam. Table3-3.Majorfactorsthatcontributetowardsincreased developerproductivitywhen usingSHMEM+. ProgramcomplexityHigh-levelabstractionprovidedbySHMEM+ functionsshieldstheapplicationdeveloperfromvariousunderlyingdetails LearningcurveFamiliarAPIsandprogrammingmodelleadto reducedlearningperiodwhenmigratingtoanewsystem SourcelinesofcodeEachSHMEM+functioncanperformseveral intermediatestepsofcommunicationwhicheliminatestheneedforextracodeandfunctioncalls ApplicationportabilityApplicationshavelongerlifecycle astheycanbe executedonavarietyofplatforms ScalabilityReductioninrecurringdeveloperefforttoexec utean applicationonsystemsofdifferentsizes Table 3-4 comparesthedevelopmenthoursspentbyourteamduringvari ous stagesofapplicationdesignemploying(a)traditionaltec hniquesofimplementationand (b)SHMEM+separately.Althoughthenumberscitedarespecict oourteampersonnel, webelievetheyareafairestimateofimprovementsexpected fromSHMEM+.For thenumberscitedinTable 3-4 ,weassumethedeveloperhasexperienceinparallel programmingandincreatingFPGAdesignsusingVHDL.Inadditi on,weassume 46

PAGE 47

thedeveloperisnewtotheRCsystemandhencehastoundergoa learningprocess tofamiliarizewiththeplatform,whichisoftenthecasewhe nportingapplications toanewsystem.TherowsinTable 3-4 reportthetimespent(intermsof8-hour workdays)invariousphasesofapplicationdevelopmentwhi chrequiredsignicant amountofthetimeandeffort.Thetimespentineachactivity alsoincludesthehours spentfordebugginginthatphase,whereverapplicable.Sinc eSHMEM+doesnot modifytheprocessofdevelopinghardwarecoresforFPGAs,the timerequiredfor FPGA-coredevelopment(rsttworowsofthetable)remainsuna ffectedforboth techniques,buthasbeenincludedhereforcompleteness.It shouldbenotedthat weemployedHDLfordevelopingourhardware-coresandfurth erreductionsineffort canbeobtainedbyemployingHLLs,iftheyaresupportedbyth etargetplatform.The timespentinparallel-softwaredevelopmentincludesthea mountoftimeadeveloper spendsinfamiliarizingwiththeplatform-specicAPI,learn ingtheSHMEMAPI,and nallydesigningtheparallelapplicationusingtheseAPIs.Wh enusingSHMEM+,a developeremploystheSHMEM+interfaceforinteractingwithF PGAmemoryand hencehastospendlesstimeunderstandingonlyasubsetofpl atform-specicAPIs whicharerequiredforsending(orreceiving)controlsigna ls(thirdrowinthetable). Bycontrast,thelearningperiodinvolvedfortheSHMEMAPIremain sunaffectedas boththetechniquesexposethedevelopertoasimilarinterf ace.Duetoahigherlevel ofabstractionprovidedbySHMEM+,areductioninthetimeford esigningtheparallel applicationwasobservedasindicatedbythefthrowinthet able.Asshowninthe sixthrow,anoverallreductionintimeandeffortofabout30 %wasobtainedforthetotal timespentinvariousphasesofparallel-softwaredevelopm ent.Suchanimprovement couldtranslatetosignicantsavingsinthedevelopmentho ursandmoneyspentonthe designofacomplexapplication.Forexample,anapplicatio nthatrequired10weeks ofdevelopmenttimecouldnowbecompletedinjust7weeks.App licationswithmore complexcommunicationpatternsareexpectedtohavehigher gainsinproductivity. 47

PAGE 48

Table3-4.DevelopmenthoursspentindevelopingCBIRapplic ation.Timeisreportedin termsof8-hourworkdays. DevelopmentphaseTraditionalSHMEM+ SHMEM FPGA-coredevelopment Learningplatform-specicwrappers10days10daysApp-coredesign15days15days Parallel-softwaredevelopment Platform-specicAPIlearningperiod5days2daysSHMEMAPIlearningperiod5days5daysParallel-applicationdesign10days7daysTotalsoftwaredevelopmenttime20days14days SinceSHMEM+applicationsdonotemployanyvendor-specicAPIsf orinteraction withFPGAs,applicationsdevelopedusingSHMEM+arehighlyport able.Aslong astheSHMEM+librarycanbesupportedonatargetRCsystem,any application designedwithSHMEM+canexecuteonitwithoutrequiringchang estotheapplication sourcecode.WeevaluatedtheportabilitybymigratingtheC BIRapplicationfrom ourMuclustertoNovo-G.Thisprocessdidnotrequireanymod icationtosoftware andhardwaresourcecode.Asimplere-compilationofthesof twarewasrequiredto obtainanexecutableforthenewsystem.Itshouldbenotedth atforasystemwitha differentFPGAboard,somemodicationswillbeneededforth ehardwaredesigns. Figure 3-14 showstheperformanceoftheCBIRapplicationscalingupto16 nodeson Novo-G.Weexpandedoursearchdatabasetoincludeapproxim ately22,000images onthislargersystem.ThegraphsinFigure 3-14 comparetheperformanceofadesign developedusingSHMEM+withadesignbasedontraditionalSHMEM. Itisevident thatdesignsdevelopedusingSHMEM+continuetooffercompara bleperformanceto designsbasedontraditionalSHMEMforlargersystemsizes.Th eminorvariationin performanceofthedesigncreatedusingSHMEM+whencomparedt otheonecreated withSHMEMisduetothedifferenceinthecommunicationmechan ismusedtogather theresultsontherootnodeinthelastprocessingstep.Aswit hanyformofhigh-level abstraction,atradeoffexistsbetweenproductivityandpe rformance.Forourapplication 48

PAGE 49

Figure3-14.Performancecomparisonofdifferentimplemen tationsofparallelCBIR applicationonNovo-G.SoftwaredesignsinvolveonlyCPUdevi ces whereasRCdesignsinvolvebothCPUandFPGAsoneachnode.Experimentswereconductedforasearchdatabaseconsisting of22,000 images,eachofsize 128 128 .A)Executiontimeofdifferentdesigns.B) Speedupobtainedbydifferentdesignswhencomparedtoaseri alsoftware baselinerunningonasingleprocessor. designusingSHMEM+,theabilitytousedirecttransfertother emoteFPGAseliminates theopportunitytooverlaptheintermediatestepsofcommun icationforthisdesign. However,SHMEM+doesnotforcedeveloperstoworkataparticul arlevelofabstraction. Insteaditprovidesapplicationdeveloperswithmultipleo ptionstomeetthedemands oftheapplication.Althoughtheperformancepenaltyincurr edbythedesignscreated usingSHMEM+isminimal,wemadeaminormodicationtotheimpl ementationofthe quad-FPGAdesignstoeliminatethispenaltyasdiscussedint hefollowingparagraphs. ParallelalgorithmsemployingmultipleFPGAsoneachnodeexh ibitmorecomplex communicationpatternsandoftenrequireincreaseddevelo perefforttoobtainan efcientimplementation.SHMEM+hastheabilitytosupportmu ltipleFPGAson eachFPGAboardusingthesameinterface,whichfurthersimpl iesthedevelopment processandyieldsadditionalimprovementinproductivity .Figure 3-15 comparesthe performanceofdifferentdesignsemployingallofthefourF PGAsoneachnodeof Novo-G.ThealgorithmdesignedusingSHMEM+wasmodiedsligh tlytooptimize thecollectionofresultsattheendonrootnode(Node0).Ins teadoftherootnode 49

PAGE 50

Figure3-15.Performancecomparisonofdifferentimplemen tationsofparallelCBIR applicationonNovo-G.SoftwaredesignsinvolveonlyCPUdevi ces whereasRCdesignsinvolvebothCPUandfourFPGAsoneachnode.Experimentswereconductedforasearchdatabaseconsisting of22000 imageseachofsize 128 128 .DesignsbasedonSHMEM+continueto offerminimaloverheadswhencomparedtotraditiontechniq uesof applicationdevelopment.A)Executiontimeofdifferentdesi gns.B) Speedupobtainedbydifferentdesignswhencomparedtoaseri alsoftware baselinerunningonasingleprocessor. usinga“GET”routinetoreceiveresultsfromalltheFPGAs,each processingnode usesa“PUT”functiontotransfertheresultsfromeachofitsl ocalFPGAtotheroot node.TheoptimizationallowsthedesignscreatedwithSHMEM+ toexhibitexcellent scalingbehaviorandminimaloverheadswhencomparedtothe designscreatedusinga combinationofCPU-onlySHMEMandvendorAPIs.3.3.3CaseStudy2:Two-DimensionalFFT Asournextcasestudyofparallelapplication,atwo-dimensi onalFFTwaschosen becauseofitsemphasisonamorecomplexcommunicationpatt ernanditsrelevancein avarietyofapplicationdomainssuchasmedicalimagingsys tems,SyntheticAperture RADAR(SAR)systems,andimageprocessing[ 52 – 54 ].Theheavycomputation demandsofFouriertransform[ 55 ]posestremendouspressureonthecapabilitiesof computationplatformsinmostreal-worldapplications,as aresultofwhichseveral researchershaveexploredFPGAimplementationsforthesame [ 56 57 ].A2-DFFT 50

PAGE 51

Figure3-16.Abstractrepresentationofprocessingstepsin volvedinaparallel two-dimensionalFFTalgorithm. operationonanimageisperformedbydecomposingitintoase riesof1-DFFToverthe rowsoftheimage,followedbyaseriesofFouriertransforms overthecolumns. Ourparallelimplementationof2-DFFTalgorithmdistribut esrowsoftheinputimage acrossthecomputationalnodeswhichperforma1-DFFTovert heirassignedsubset ofrowsasshowninFigure 3-16 .Acorner-turn(distributedtranspose),whichinvolves all-to-allcommunicationbetweentheprocessingnodes,is requiredtore-distribute thedataacrossallthenodes.Thenodesthencompute1-DFFTo verthecolumns oftheimage.Anothercornerturnisrequiredtore-organizet hedataandrecoverthe transformedoutputimage.Followingthedesignowdescrib edearlierwederiveour implementationasfollows: Stepa Theserialalgorithmcomputesthe2-DFFToftheimagebyperf ormingaseries of1-DFFTovertherowsfollowedby1-DFFToverthecolumnsof theimage. Stepb Ourparallelalgorithmisobtainedbyusingblockdecomposi tiontodistribute asubsetofrowsandcolumnstobetransformedoneachnode.Ana ll-to-all communicationisrequiredtore-distributethedatabetwee nthetwostagesof1-D FFTs. Stepc TheFPGAoneachnodeisabletoperform1-DFFToperationsfast erthanthe CPUandishenceassignedtotransformtheassignedsetofrows andcolumns. 51

PAGE 52

SincetheCPUsaremoreefcientinre-organizingthedatainth eirlocalmemory thanFPGAs,theyareassignedtoperformthecornerturn. Stepd UsingVHDL,wedevelopedahardwareengineforFPGAs,toperform aseries of1-DFFToveritsassignedsetofinputdata.TheFPGAwaitsfo rcontrolsignal fromtheCPUtobeginprocessingandindicatescompletionoft ransformsthrough anothercontrolsignal. Stepe SoftwarecodeontheCPUisresponsiblefortransferringthein putdatatothe FPGAsandreadingthetransformedoutputonceFPGAcompletespr ocessing. Thesoftwarecodealsoperformsanall-to-allcommunicatio ntocompletethe corner-turn.Inaddition,italsoprovidesthecontrolsign alrequiredbytheFPGAs. Figure 3-17 comparestheexecutiontimeandspeedupversusaserialsoft ware baselinefordifferentimplementationsofaparallel2-DFF TalgorithmonNovo-G.Our experimentswereconductedforan8k 8kimage.Similartoourrstcasestudy,designs for2-DFFTimplementedusingSHMEM+yieldperformancewhichi scomparableto designsimplementedusingacombinationoftraditional,CPU -onlySHMEMlibraryand platform-specicAPIs.Theminordifferenceintheperforman ceofbothofthesedesigns iswithinreasonablelimitsofexperimentalerror.Acompar isonofthehoursspentin developing2-DFFTalgorithmusingbothofthesetechniques ispresentedinTable 3-5 .Forthiscasestudy,weassumeaparallelapplicationdevel operisfamiliarwiththe SHMEMAPIanddoesnothavetospendanyeffort/timelearningit.H owever,since platform-speciclearningisacommonoccurrenceeachtime theapplicationisportedto anewplatform,weretainthelearningperiodfortheplatfor m-specicAPI.Ourestimates indicateanimprovementofapproximately25%indeveloperp roductivity.Portability experimentswerealsoconductedforthesecondcasestudy.Si ncemostoftheresults andinferencesfromtheseexperimentswereconsistentwith therstcasestudy,they arenotrepeatedhere. 52

PAGE 53

Figure3-17.Performancecomparisonofdifferentimplemen tationsofparallel2-DFFT algorithmonNovo-G.SoftwaredesignsinvolveonlyCPUdevice swhereas RCdesignsinvolvebothCPUandFPGAsoneachnode.Experimentswereconductedforan8k 8kimage.A)Executiontimeofdifferentdesigns. B)Speedupobtainedbydifferentdesignswhencomparedtoseri al softwarebaselinerunningonasingleprocessor. Table3-5.DevelopmenthoursspentfordevelopingTwo-dime nsionalFFTapplication. DevelopmentphaseTraditionalSHMEM+ SHMEM FPGA-coredevelopment Learningplatform-specicwrappers15days15daysApp-coredesign15days15days Parallel-softwaredevelopment Platform-specicAPIlearningperiod5days2daysParallel-applicationdesign12days10daysTotalsoftwaredevelopmenttime17days12days 3.4Conclusions Thelackofintegrated,system-wide,parallelprogramming modelshaslimited currentRCapplicationstosmallsystemsizes.Torealizeth efullpotentialof recongurableHPCsystems,parallel-programmingmodelsan dlanguagesthatare suitedtosuchsystemsarecriticalyetlacking.Inthisphas eofresearch,wepresenteda parallel-programmingmodelandacommunicationlibraryfo rscalable,heterogeneous, recongurablesystems.Themultilevel-PGASmodelproposedi nthischapterisableto capturekeycharacteristicsofRCsystems,suchasdifferen tlevelsofmemoryhierarchy anddifferencesintheexecutionmodelofheterogeneousdev icespresentinthesystem. 53

PAGE 54

Theexistenceofsuchaprogrammingmodelwillenableproduc tivedevelopmentof scalable,parallelapplicationsforrecongurableHPCsyst ems. Usingthemultilevel-PGASprogrammingmodel,weextendtheex istingSHMEM librarytoSHMEM+,therstknownversionofthelibrarythaten ablesdesignersto createscalableapplicationsthatexecuteoveramixofmicr oprocessorsandFPGAs. SHMEM+offersdevelopersofRCapplicationsahigh-levelofab stractionthatallows themtofacilitatecomplexcommunicationinapplicationbe tweenheterogeneous devices,whileprovidinghighproductivityandperformanc e.Resultsfromour experimentsandcasestudiesdemonstratethatperformance offeredbySHMEM+is comparabletoanexistingvendor-proprietaryversionofSHM EM.Ourcasestudies showcasethesimplieddesignprocessinvolvedwithSHMEM+fo rdevelopingscalable RCapplications,whichisverysimilartotraditionalmetho dsfordevelopmentofparallel applications.Moreimportantly,thehigherlevelofabstra ctionprovidedbySHMEM+ leadstosignicantimprovementinproductivitywithoutsa cricingperformance signicantly.Althoughitisdifculttoquantifytheproduc tivitygains,ourcasestudies demonstrateanaverageimprovementinproductivityofabou t30%.Inaddition,byhiding thedetailsofvendor-specicFPGAcommunicationfromdevel opers,SHMEM+creates highlyportableapplications. 54

PAGE 55

CHAPTER4 PERFORMANCEMODELINGFORMULTILEVELCOMMUNICATION INSHMEM+(PHASE2) ThischapterfocusesonthePhase2ofthisresearch.Programmi ngmodelsand librariesforheterogeneous,parallel,andrecongurable computingsuchasSHMEM+ areusefulforsupportingcommunicationinvolvingadivers emixofprocessordevices. However,toevaluatetheimpactofcommunicationonapplica tionperformanceand obtainoptimalperformance,aconcreteunderstandingofth eunderlyingcommunication infrastructureisoftenimperative.Oneimportantchallen gewhenusinglibrariessuch asSHMEM+ischoosinganappropriatecommunicationstrategy. Variancesinclude, whoinitiatesthetransfer,whatfunctionsareemployed,wh atintermediatestepsare involved,etc.Suchfactorscanhaveasignicantimpactonth eoverallperformance ofanapplication.Therefore,itiscriticalforanapplicat iondesignertounderstandthe underlyingcommunicationinfrastructure.TraditionalHPC systemsandapplications assistdesignersinunderstandingandoptimizingcommunic ationinfrastructureby employingcommunicationmodelsthatuseasetofparameters toprovideahigh-level representationofcommunication,whilehidingtheunneces sarydetails.Although afewgeneral-purposecommunicationmodelshavebeenpreva lentfortraditional HPCsystems,thesemodelswerenotdevelopedtotargetmultil evelsystemssuch asrecongurableHPCsystems.Asaresult,thesemodelslacksu fcientdetailsfor representingmultilevelsystemscompletelyandaccuratel y. Inthisphaseofresearch,weintroduceanewmultilevelcomm unicationmodelfor representingvariousdatatransfersprovidedbytheSHMEM+li braryandforpredicting theirperformance.Theproposedmodelprovidesasystem-le velrepresentation whichintegratestheeffectofintermediatestepsofcommun icationpresentina multileveltransfer.Additionally,themodelallowsforthe capabilityofoverlapping theseintermediatestepswhichiscriticalforobtaininghi gh-performance.Thepotential meritsoftheproposedmodelareshowcasedinthischapterth roughseveralexamples. 55

PAGE 56

Besidesattaininghigherperformancefromcommunicationin frastructure,suchamodel canleadtoconsiderableimprovementintheproductivityof applicationandsystem developers. Theremainderofthischapterisformattedasfollows.Anover viewofthemultilevel communicationmodelispresentedinSection 4.1 .InSection 4.2 ,weshowcase theeffectivenessofthemodelinenablingearlyDSEforanapp lication.Section 4.3 illustratesthepotentialuseofthemodeltooptimizethepe rformanceoftheSHMEM+ libraryforatargetRCsystem.InSection 4.4 ,wediscussamechanismtoaugment SHMEM+withtheself-tuningcapabilityusingthemodel.Final ly,Section 4.5 provides conclusionsfromthisphaseofresearch. 4.1OverviewofMultilevelCommunicationModel TheSHMEM+librarydescribedinChapter 3 ,providesahigh-productivity environmentforestablishingcommunicationinRCsystemsb yabstractingthedetailsof theunderlyingdatatransferswithauniform,high-levelin terface.Figure 4-1 (reproduced fromFigure 3-5 )illustratesthedifferentoptionsfordatatransfersprov idedbythe SHMEM+libraryinasystemwithaCPUandanFPGAoneachnode(altho ughthe librarycansupportanynumberandvarietyofdevices).Someo fthedirect-transfer capabilitiesshownintheguresuchastheoneslabeled`x'a nd`y'wereenabledby SHMEM+anddidnotexistpreviously. Whilehavingvariousoptionstotransferdatafromonedevice toanothersimplies thedevelopmentofparallelapplications,usersshouldund erstandtheunderlying mechanismandtradeoffsassociatedwithsuchtransfersino rdertoobtainoptimal performance.SinceonlytheCPUoneachnodecanphysicallyper forminter-node communication,someofthedatatransfersareinternallyac hievedthroughmultiple intermediatetransfersteps.Considerafunctionthatprov idesdirecttransferbetween tworemoteFPGAs(labeledas`y'inFigure 4-1 ).Thisfunctionisinternallyperformed by(a)transferringdatatointernalbuffersonlocalCPU,(b) followedbyatransfertothe 56

PAGE 57

Figure4-1.Transfercapabilitiessupportedbyvariouscom municationroutinesin SHMEM+. CPUontheremotenode,and(c)nallywritingdatatothememor yofthedestination FPGA.However,suchafunctionmayserializetheaforemention edtransferstepswhich couldhavebeenexplicitlyparallelizedbyanapplicationd eveloper.Forexample,inan applicationwheredataiscollectedfromseveralFPGAdevice sonaremoteCPU,an applicationdevelopercanexplicitlyparallelizethetran sferofdatafromFPGAstotheir localCPUsonallnodesandthentransferthedatafromtheCPUso neachnodetothe remoteCPUwheredataiscollected.Toenablebetterundersta ndingoftheunderlying detailsofcommunicationpresentinSHMEM+,wederiveamultil evelcommunication modelforthesame. Thecommunicationtimeforapoint-to-pointdatatransferb etweenanytwodevices ofthesystemshowninFigure 4-1 canberepresentedbyagenericmodeldescribedas: T comm = f ( T cpu $ cpu )+ f ( T cpu $ fpga ) (4–1) 57

PAGE 58

Where, T comm : Totaltimeforcommunication, T cpu $ cpu : TransfertimefortwoCPUsonremotenodes, T cpu $ fpga : TotaltimeforvarioustransfersbetweenaCPUanditslocalFPGA, f ( T )= serialtimeTforcommunication partoftime Thiddenbyoverlapwithothercommunication = non-overlappedpartoftimeT ThegoalofthemodelinEquation 4–1 istoestimatetheperformanceofadata-transfer routinebyincorporatingeachintermediatestepofthetran sferseparately.Depending uponthesourceanddestinationdevices,acommunicationro utinemayrequiretwo transfersbetweenaCPUandanFPGA(suchasthecasefortransfe rlabeledas'y'in Figure 4-1 ). T cpu $ fpga inEquation 4–1 representsthecumulativetimeofbothtransfers insuchcases.Themultilevelmodeldoesnotestimatetheper formanceofindividual transferstepsorimposetheuseofaspecicmodelforthesam e.Instead,itallows theuseofdifferentcommunicationmodelsforrepresenting eachcomponentofthe completetransfer.Suchmodelingisimportantbecauseitiso ftendifculttodescribe thevariousintermediatestepsoftransferusingasingleco mmunicationmodel.Inour experimentsweperformmicrobenchmarkingtodeterminethe timeforintermediate stepsofcommunicationaccurately.ManyRCsystemsexhibit asymmetricperformance forread(FPGAtoCPU)andwrite(CPUtoFPGA)operations.Toaccoun tforthis asymmetryinspeedofreadandwriteoperationsforanFPGA,the modelcanbe rewrittenas: T comm = f ( T cpu $ cpu )+ f ( T cpu fpga )+ f ( T cpu fpga ) (4–2) 58

PAGE 59

Where, T cpu fpga : TransfertimeforreadoperationfromFPGA, T cpu fpga : TransfertimeforwriteoperationtoFPGA Byoverlappingvariousstepsofcommunication,itispossibl etoreducetheoverall timetakenbythetransfer.Itshouldbenotedthatthemodeli nEquation 4–2 (and Equation 4–1 )onlyconsiderstheeffective(non-overlapped)timeofeac hlevelof communicationthatinuencestheoverallperformanceofco mmunication.Theprovision for“effective”timeisanimportantfeatureofthemodelbec ausecommunicationwith theFPGAisinvariablyanexpensiveoperationandisoftenove rlappedwithothersteps ofcommunicationtoimproveoverallperformance.Thisfeat ureofthemodelwillbe employedtooptimizetransfersbetweentwoFPGAsonremotenod esinSection 4.3 .In additiontoestimatingcommunicationtime,themultilevel communicationmodelcanalso becombinedwithexistingapproachessuchasRAT[ 47 ]toestimatetheexecutiontime ofthecompleteapplication. Inthefollowingthreesections,wediscussseveralbenets ofthemultilevel communicationmodelthroughthreeusecases.First,themod elcanenableapplication developerstoperformearlydesign-spaceexplorationofco mmunicationpatternsintheir applicationsbeforeundertakingthelaboriousandexpensi veprocessofimplementation, yieldingimprovedperformanceandproductivity.Second,th emodelcanbeemployed bysystemdeveloperstoquicklyoptimizeperformanceofdat a-transferroutineswithin SHMEM+whenbeingportedtoanewplatform.Third,themodelcan beusedto augmentSHMEM+toautomaticallyimproveperformanceofdatat ransfersbyself-tuning itsinternalparameterstomatchplatformcapabilitieswhi chimprovestheportabilityof SHMEM+.Theexperimentsforthisphaseofresearchwillemploy twodifferentsystems, MuClusterandNovo-G,whicharedescribedinAppendix A 59

PAGE 60

4.2EarlyDSEinApplications Oneadvantageofthemultilevelcommunicationmodelisthee nablingofearly design-spaceexploration(DSE)thatcanimproveperformance andreducedevelopment timeforanapplication.Byprovidingperformanceestimates ,themodelallows developerstomakedesigndecisionsaboutthecommunicatio ninfrastructurebefore undertakingexpensiveimplementation.Weillustratethis capabilityusinganexampleof acontent-basedimageretrieval(CBIR)application,detail edinAppendix B Theparallelalgorithmemployedinourexperimentsdistrib utesthesetofimages tobesearchedoverasetofnodesandallowsmultipleprocess ingdevicestoevaluate theseimagessimultaneously(moredetailsaboutimplement ationofthealgorithmcan befoundinSection 3.3.2 .Thestepsinvolvedinourparallelimplementationbasedon SHMEM+areasfollows: 1. Allnodesperforminitializationusing shmem init ,whichalsocongureslocal FPGAsoneachnodewiththedesiredbitle. 2. CPUsonallnodesreadtheirsubsetofinputimagesfromastora gedevice(such asalocalharddiskoranetworkstoragedevice)alongwithth efeaturevectorof thequeryimage. 3. CPUstransferthesubsetofimagestotheirlocalFPGAsforhardw areacceleration usingthe shmem putmem function. 4. CPUsinitiatetheexecutionontheirlocalFPGAsthrougha“GO”s ignal.FPGAs onallnodescomputefeaturevectorsandsimilaritymeasure sfortheirsubsetof imagesinparallel. 5. FPGAssignalthecompletionofexecutiontolocalCPUsthrougha “DONE”signal. OncecomputationonCPUsandFPGAsoneachnodeiscomplete,alln odes synchronizeusing shmem barrier all 6. Finally,similarityvaluesfromalloftheFPGAsaregatheredo ntherootnodeusing oneoftheseveralapproaches(determinedbyouranalysispr esentedinSection 4.2.1 ).Resultsarethensortedindecreasingorderofsimilarity ByemployingmultipleFPGAstoacceleratetheapplication,the computationtimeof thealgorithmcanbereducedsignicantlyasshownbytheper formanceofatypical 60

PAGE 61

implementationinFigure 4-2 A.Thesingle-FPGAdesignsrefertothedesignswhich employmultiplenodeswitheachnodeusingasingleFPGA.Simila rly,quad-FPGA designsexecuteovermultiplenodeswitheachnodeusingfou rlocalFPGAs.Asthe applicationisscaledacrossmorenodes,communicationsta rtsbecomingasignicant proportionofthetotaltime.Figure 4-2 Bshowsthattheamountoftimerequiredto collecttheresultsontherootnodeincreasessubstantiall ywithincreasingnumberof processingnodes.Theeffectismorepronouncedforthequad -FPGAdesignasthe resultsneedtobecollectedfrommoreFPGAs(64FPGAsforasystem sizeof16 nodes).Asaresult,therelativeperformancegainfromemplo yingmorenodes(and FPGAs)startsdecreasing. Figure4-2.ResultsforCBIRapplicationobservedfromexper imentsconductedon Novo-Gforasearchdatabaseconsistingof22,000images,ea chofsize 128 128 pixelsof8bits.A)Scalingbehaviorofapplication.B)Timeto gatherresults(4bytesperimage,i.e.88KBofdata)onrootno deasa percentageofapplicationexecutiontime. Toresolvethisbottleneck,weneedtounderstandthemechan ismbywhichthe applicationcollectsresultsontherootnode.Thegatherop erationintheapplicationcan beimplementedinmultiplewayssuchas: Approach1 : Byusinga shmem getmem functionontherootnodetoreceivethe outputdatafromalloftheFPGAsinvolvedintheapplicationin dividually. 61

PAGE 62

Approach2 : Byusinga shmem putmem functiononeverynodetosendtheresults computedbyeachFPGAtotherootnodeindividually. Approach3 : ByrstcollectingtheresultsfromthelocalFPGAsoneverynode 'sCPU andthensendingthemtotherootnodecollectivelyusing shmem putmem oneach node. 4.2.1PerformanceEstimation Whileperformingthesamegatheroperation,theaforementio nedapproaches canofferdifferentperformanceandrequiredifferentleve lsofdevelopereffort.To evaluatetheimpactofthesethreeapproachesontheoverall performanceoftheCBIR application,weestimatetheperformanceofgatheroperati onusingeachapproach. Thedifferenceintheperformancebecomesapparentbyapply ingourmultilevel communicationmodeltothethreeapproaches.Theestimated timeforperformingthe gatheroperationcanbeexpressedusingthemodelas: T gather = n X i =1 T comm i (4–3) T comm i = f ( T cpu fpga )+ f ( T cpu $ cpu )+ f ( T cpu fpga ) = non-overlappedpartof T cpu fpga + non-overlappedpartof T cpu $ cpu +0 (4–4) Where, n : Numberofnodesinvolvedinthegatheroperation, T gather : Totaltimeforgatheroperation, T comm i : Communicationtimecorrespondingto i th node SincetherearenowriteoperationstoanFPGA,thecorrespondin gtermbecomes zeroinEquation 4–4 .WhileApproach1seemsthemostintuitivewaytoperform agather,theprocessofinvokingagetoperationtoreceived atafromalldevices 62

PAGE 63

individuallyserializesallofthetransfers.Asaresult,no neofthecommunicationtime canbeoverlapped.Therefore,thetimetoperformthegather operationforApproach1 canbedescribedasfollows: Approach1,Single-FPGADesign: T gather =( n 1) ( T cpu fpga + T cpu $ cpu )+ T cpu fpga (4–5) Approach1,Quad-FPGADesign: T gather =4 [( n 1) ( T cpu fpga + T cpu $ cpu )+ T cpu fpga ] (4–6) Notethattherootnodeonlyrequiresareadoperationfromit slocalFPGA,whichdoes notinvolveanetworktransaction(lastterminEquations 4–5 and 4–6 ). Bycontrast,Approach3requiresmoreeffortfromadevelopera sitrstcollects thedatafromtheFPGAsonthelocalCPUbeforesendingthedataac rosstotheroot node.However,thisapproachallowseachnodetooverlapthe readoperationfromits localFPGAs(whichisusuallyanexpensiveoperation)withthe sameoperationonthe othernodes.Therefore,thetimetoperformthegatheropera tionforApproach3canbe representedasfollows: Approach3,Single-FPGADesign: T gather = T cpu fpga +( n 1) T cpu $ cpu (4–7) Approach3,Quad-FPGADesign: T gather =4 T cpu fpga +( n 1) T cpu $ cpu (4–8) Approach2,whichappearsverysimilartoApproach1onthesurf ace,behavesmuch likeApproach3.Allowingeachnodetoputthedatadirectlyfro mthelocalFPGAsto therootnodeessentiallyoverlapsthepartofthetransferw hichreadsthedatafrom thelocalFPGAs(intoatemporarybufferinSHMEM+)withthesameo perationonthe 63

PAGE 64

othernodes.Asaresult,Approach2yieldsperformancethatis comparabletothe thirdapproachwhilerequiringcomparativelylowerdevelo pereffort.Theperformance ofApproach2isidenticaltoApproach3forsingle-FPGAdesigns .Forquad-FPGA designs,Approach2requiresfourCPU-to-CPUtransfersfromea chnodetotheroot node(ofasmallerdatasize)asopposedtoasingletransferr equiredbyApproach3. Theestimatedtimecanberepresentedusingtheequationsas follows: Approach2,Single-FPGADesign: T gather = T cpu fpga +( n 1) T cpu $ cpu (4–9) Approach2,Quad-FPGADesign: T gather =4 [ T cpu fpga +( n 1) T cpu $ cpu ] (4–10) Inordertocomputetheestimatesforthetimerequiredbythe gatheroperation, weperformedmicrobenchmarkingtodetermine T cpu fpga and T cpu $ cpu fordifferentdata sizes.Table 4-1 liststheinputparameters( n T cpu fpga ,and T cpu $ cpu ),theestimated gathertimes,andthecorrespondingtimesobservedexperim entallyforperformingthe gatheroperationusingsingle-FPGAdesignsonNovo-G.There sultsreportedinthe tablecorrespondtoagatheroperationcollecting8MBdatao ntherootnode(toallow forlongergathertimesinouranalysis,adatasizemuchlarg erthanrequiredbyour CBIRimplementationwaschosen).Theestimateslistedinthe tableforApproaches 1,2and3arecomputedusingEquations 4–5 4–9 ,and 4–7 respectively.Forexample when n =16 (lastrowofTable 4-1 ),theestimateforApproach3canbecomputedfrom Equation 4–7 as: T gather = T cpu fpga +( n 1) T cpu $ cpu =1 : 51+(16 1) 0 : 37=7 : 06 ms 64

PAGE 65

Table4-1.Observedtimeandestimatedtimetoperformgathe roperationby single-FPGAdesignsusingthreedifferentapproachesonNov o-G.Alltimes arereportedinmilliseconds.Totalamountofdatacollecte dontherootnode is8MBinallcases. n T cpu fpga T cpu $ cpu Approach1Approach2Approach3 Est.Obs.Est.Obs.Est.Obs. 18.145.728.148.338.148.278.148.2224.932.8812.7412.417.817.987.817.9842.931.4516.0716.797.286.907.286.8481.910.7320.3922.287.026.457.026.39 161.510.3729.7135.127.066.177.066.16 Table4-2.Observedtimeandestimatedtimetoperformgathe roperationfor quad-FPGAdesignsonNovo-Gusing(a)Approaches1and2.Alltim esare reportedinmilliseconds.Totalamountofdatacollectedon therootnodeis 8MBinallcases. n T cpu fpga T cpu $ cpu Approach1Approach2 Est.Obs.Est.Obs. 12.931.4511.7212.0011.7212.0021.910.7318.2020.8510.5611.9941.510.3728.6031.7210.488.5481.240.2045.2852.5010.568.29 161.240.1085.3681.3710.968.21 Theestimatedandobservedtimeforthequad-FPGAdesignsonN ovo-Gare listedinTables 4-2 and 4-3 alongwiththecorrespondinginputparametersrequired forcomputingtheestimates.InApproach3,CPUscollectdataf romtheirlocalFPGAs beforesendingthecollecteddatatotherootnode.Asaresult T cpu $ cpu forApproach 3isdifferentthanforApproaches1and2.Theresultsaretabu latedinseparatetables Table4-3.Observedtimeandestimatedtimetoperformgathe roperationfor quad-FPGAdesignsonNovo-GusingApproach3.Alltimesarerepo rtedin milliseconds.Totalamountofdatacollectedontherootnod eis8MBinall cases. n T cpu fpga T cpu $ cpu Approach3 Est.Obs. 12.935.7211.7213.9321.912.8810.5210.9941.511.4510.3910.6481.240.7310.078.57 161.240.3710.518.95 65

PAGE 66

(Tables 4-2 and 4-3 ).TheestimateslistedinthetablesforApproaches1,2and3a re computedusingEquations 4–6 4–10 and 4–8 respectively.Forexamplewhen n =8 (in Table 4-2 ,theestimateforApproach2canbecomputedfromEquation 4–10 as: T gather =4 [ T cpu fpga +( n 1) T cpu $ cpu ] =4 [1 : 24+(8 1) 0 : 20]=10 : 56 ms Figure 4-3 showstheestimatedperformanceofthegatheroperationusi ngthedifferent approachesfor(a)single-FPGAdesignsand(b)quad-FPGAdesi gns.Whilethetime requiredforgatheringthedataincreasessignicantlyfor therstapproach,itstays relativelyconstantforthesecondandthirdapproaches.Th eeffectismorepronounced forthequad-FPGAdesignsbecausetherearemoreprocessingd evicesinvolvedinthe gatheroperation. Figure4-3.EstimatedperformanceofgatheroperationonNov o-Gusingthreedifferent approaches.A)Forsingle-FPGAdesigns.B)Forquad-FPGAdesign s.Total amountofdatacollectedontherootnodeis8MBinallcases. 4.2.2ExperimentalResults Theestimatescomputedbyuseoftheproposedmodelwereveri edbycomparing themwithexperimentalperformanceobservedforthethreea pproaches(reportedin Tables 4-1 4-2 and 4-3 ).Theaverageerrorbetweentheestimatesandexperimental resultswas9.4%(whichisconsideredreasonablyaccurateg iventhefocusonearly DSEpriortoanyimplementation).Afewcasesexperiencedhig hererrors(between 66

PAGE 67

20-25%),suchasthequad-FPGAdesignover16nodesusingAppro ach2.Withmore processingdevices,thetimeforintermediatestepsoftran sferbecameconsiderably smallsuchthatlargevariationswereobservedinexperimen taldataandintheresultsof microbenchmarking(whichformtheinputsfortheestimatio nmodel).Asaresult,even aminordeviationinmeasurementsmanifestedasalargerrel ativeerror.Althoughnot ideal,theabsoluteerrorhadlittleimpactonourdesigndec isionsasthetrendobserved betweentheexperimentaldataandtheestimatesbasedonour modelwereconsistent. Figure4-4.Speedupobservedforquad-FPGAdesignsofCBIRusin gthethree approachesoverasequentialbaselineexecutingonasingle CPUon Novo-G.Experimentswereconductedforasearchdatabasecon sistingof 22,000images,eachofsize 128 128 pixelsof8bits. Fromthebehaviorofthethreeapproaches,itappearsthataC BIRapplication employingApproach1wouldleadtosignicantperformancede gradationasthe systemsizeincreases,whereaswithApproaches2and3,theap plicationwould continuetooffersatisfactoryperformance.Thechoicebet weenusingeitherofthe twoapproaches(2or3)wouldbebasedonthelevelofprogramm ingcomplexityand theeffortrequiredfromthedeveloper.Approach2appearspr omisingasitoffers performancecomparabletoApproach3whilerequiringlowerd evelopereffort.To observetheeffectofthethreeapproachesonoverallapplic ationperformance,we developedafullCBIRimplementationusingthethreeapproac hes.Figure 4-4 shows theperformanceobtainedforquad-FPGAdesignsofCBIRapplic ationonNovo-G usingthethreeapproaches.Theresultsagreewiththebehav iorpredictedbased 67

PAGE 68

Figure4-5.Intermediatestepsinvolvedindatatransfersb etweentworemoteFPGAs usingSHMEM+. ontheestimatesfromourmodel.Approaches2and3offersimil arperformance, betterthanApproach1.Byusingtheproposedmodeltoquicklyp erformearlyDSEof communicationpatterns,wewereabletoimproveperformanc eoftheapplicationby approximately42%.Thisexerciseshowcasedtheuseofourmu ltilevelcommunication modelasatoolforenablingDSEintheearlyphaseofapplicati ondevelopment.Such atoolcaneliminateseveraliterationsofexpensivedesign cycleandhelpinimproving developerproductivity. 4.3OptimizingSHMEM+Functions Themultilevelcommunicationmodelenablessystemdevelop erstoquicklyoptimize theperformanceofacommunicationlibrarysuchasSHMEM+when portingittoanew system.Considerthecaseofdatatransfersbetweentworemo teFPGAs.Asshownin Figure 4-5 ,suchatransfercanbeperformedby(1)readingthedatafrom thesource FPGAtoitslocalCPU;(2)transferringthedatafromthelocalC PUtotheCPUonthe remotenode;(3)andnally,writingthedatafromtheCPUonth eremotenodetothe destinationFPGA. InsteadofperformingSteps1,2and3sequentially,anefcie ntimplementationfor thetransfershowninFigure 4-5 mayoverlaptheintermediatestepsofsuchatransfer. Forexample,Step3canbeoverlappedwithSteps1and2collecti vely.Inorderto overlapvariousstepsofatransfer,thedataneedstobedivi dedintosmallerpackets. 68

PAGE 69

Thesizeofthedatapacketcanhaveasignicantimpactonove rallperformanceofthe transfer.Asmallpacketsizewouldleadtolowperformancef orintermediatestepsofthe transfer,whilealargepacketsizewouldlimittheamountof communicationthatcanbe efcientlyoverlapped.Determininganappropriatepacket sizecanbealaborioustask andmayoccasionallyrequiredevelopmentofatestbenchand numerousexecutionsof thetestbenchwithvariouspacketsizesuntilsatisfactory performanceisobtained. 4.3.1PerformanceEstimation Theproposedmodelcanassistsystemdevelopersindetermin inganappropriate packetsizewithoutrequiringanytestcode.Toestimatethe performanceofthetransfer foraparticularpacketsize,ourmodelcanbeemployedasfol lows: Let, N :Numberofpackets D :Sizeofdatabeingtransferredinbytes P :Sizeofdatapacketsinbytes T ( L ) :Time T fortransferringLbytes then, N = d D P e (4–11) T step 1 = 8>><>>: T cpu fpga ( P ) ; when D>P T cpu fpga ( D ) ; when D

><>>: T cpu $ cpu ( P ) ; when D>P T cpu $ cpu ( D ) ; when D

><>>: T cpu fpga ( P ) ; when D>P T cpu fpga ( D ) ; when D


PAGE 70

packet.Whereasfor D>P ,thetransferisbrokenintopacketsofsize P .Byoverlapping Step3withSteps1and2collectively,theoveralltimeoftrans fercanbedescribedas: T comm = f ( T cpu fpga + T cpu $ cpu )+ f ( T cpu fpga ) = non-overlappedpartof ( T cpu fpga + T cpu $ cpu ) + non-overlappedpartof T cpu fpga =( T step 1 + T step 2 )+( N 1) max (( T step 1 + T step 2 ) ;T step 3 )+ T step 3 (4–15) SinceSteps1and2arecollectivelyoverlappedwithStep3,only thetimeforthegreater ofthetwocomponentsaffectstheoveralltimeoftransfer.I naddition,Steps1and2for therstpacketandStep3forthelastpacketcannotbeoverlap pedandareincluded separately.ByapplyingEquation 4–15 fordifferentpacket-sizes,overalltimeofthe transfercanbeestimatedforvariouspacketsizes.Table 4-4 presentstheestimates computedfortransfertimeandbandwidthwhen P =2MBand P =512KB.Theinput parameters( D N T step 1 T step 2 and T step 3 )requiredtocomputetheestimatesarealso listedinthetable.Thevaluesreportedinthetablefor T step 1 T step 2 and T step 3 were determinedempiricallyusingmicrobenchmarksonNovo-G.F orexample,estimatesfor P = 2MBand D = 16MBcanbecomputedusingEquation 4–15 as: T comm =( T step 1 + T step 2 )+( N 1) max (( T step 1 + T step 2 ) ;T step 3 )+ T step 3 =(3+1 : 45)+(8 1) max ((3+1 : 45) ; 8)+8 =68 : 45 ms B=W = D T comm = 16 1024 1024 68 : 45 10 3 10 6 =245 : 1 MB/s 70

PAGE 71

Table4-4.Estimatingperformanceofpacketizedtransfersb etweentworemoteFPGAs onNovo-Gfor P = 2MBand512KB. For P =2MB DN T step 1 T step 2 T step 3 T comm B=W (Bytes)(ms)(ms)(ms)(ms)(MB/s) 512K11.170.373.014.55115.21M12.000.734.997.72135.82M13.001.458.0012.45168.44M23.001.458.0020.45205.18M43.001.458.0036.45230.116M83.001.458.0068.45245.132M163.001.458.00132.45253.3 For P =512KB DN T step 1 T step 2 T step 3 T comm B=W (Bytes)(ms)(ms)(ms)(ms)(MB/s) 512K11.170.373.014.55115.21M21.170.373.017.56138.72M41.170.373.0113.58154.44M81.170.373.0125.62163.78M161.170.373.0149.70168.816M321.170.373.0197.86171.432M641.170.373.01194.18172.8 Followingasimilarapproach,performanceofthepacketize dtransferscanbe estimatedforotherpacketsizes.Figure 4-6 Ashowstheestimatedbandwidthfora varietyofpacketsizesonNovo-G.Basedontheestimates,the packetsizewhichoffers thebestperformancecanbedetermined.Theresultsindicat ethatfor P = 512KBthe overallbandwidthislowerthanthenon-packetizedbaselin e.Thepacketsizeof2MB wasfoundtooffersatisfactoryperformanceoveralargeran geofdatasizesunder consideration.Dependingonthesizeofthedatatransferin volved,theimprovement obtainedrangedfrom11%to24%(fortherangeofdatasizesun derconsideration). 4.3.2ExperimentalResults Figure 4-6 Bshowsthebandwidthofpacketizedtransfersbetweentwore mote FPGAsforvariouspacketsizes,recordedexperimentallyusin gatestbenchonNovo-G. Thetrendobservedfromtheexperimentaldataconcurswitht heestimatesgenerated usingourmodel.Whileittookourteamtwotothreehourstorun microbenchmarksand 71

PAGE 72

Figure4-6.BandwidthoftransfersbetweenremoteFPGAs.A)Estim atedbyusing multilevelcommunicationmodel.B)Observedexperimentall yonNovo-G usingatestbench. computetheestimates(fromtheinputparameters,asshowni nTable 4-4 )todetermine thebestpacketsize,theprocessofdevelopingatestbencha ndconductingseveral trialstodeterminethebestpacketsizeexperimentallytoo kintheorderoftwodays. Thesystemdeveloperscanbenetgreatlyfromreductionint heirtimeandeffortusing modelingandestimationtoperformsuchoptimizationsonat argetsystem. Theerrorsobservedbetweentheestimatedandobservedband widtharereported inTable 4-5 .Theaverageoferrorsreportedinthetableisunder6%.Afew cases (especiallytransfersinvolvingsmalldatasizes)experie ncedhighererrors.For smallerdatasizes,thetimeforintermediatestepsoftrans fers(whichformsthe inputtoourmodel)becameconsiderablysmall.Asaresult,ev enaminorvariation inmicrobenchmarkingresultsmanifestedasalargerrelati veerrorintheestimates computedusingthemodel.Nevertheless,thebehaviorofest imatedperformance concurswiththeobservedperformance,whichhelpedusinde terminingthebestpacket size.Asimilarmethodologycanalsobeemployedtooptimize otherdatatransfersinthe SHMEM+library. 4.4Self-TuningSHMEM+Library Dependingonthecapabilitiesoftheinterconnecttechnolo gyandtheI/Obus,a communicationlibrarysuchasSHMEM+mayhavedifferentrange sforoptimaloperation ondifferentsystems.Forexample,acertainpacketsizefor transfersdiscussedin 72

PAGE 73

Table4-5.Relativeerrorbetweenestimatedbandwidthando bservedbandwidthof transfersbetweenremoteFPGAs. DatasizeNon-packetized512KB2MB8MB(Bytes)baselinepacketpacketpacket 512K15.5%18.3%14.7%18.8%1M3.3%5.6%3.5%4.1%2M9.3%8.4%8.9%7.9%4M5.3%2.5%7.1%1.5%8M2.2%1.1%3.6%2.5%16M1.4%0.4%2.2%2.5%32M0.5%-1.0%1.4%1.0% Section 4.3 mayleadtosatisfactoryperformanceonsomesystemswhiley ielding sub-optimalperformanceonothers.Inaddition,thevariat ioninthesystemloadcanalso changetheoperationalcharacteristicsofasystem.Allowin gacommunicationlibrary suchasSHMEM+toautomaticallytuneitselftothecapabilitie sofatargetsystemcan increaseitsusefulnessinachievingportableperformance ThemethodologydescribedinSection 4.3 canbeextendedtoautomatethe optimizationprocess.Themultilevelcommunicationmodel canbeembeddedinthe SHMEM+librarytoallowittoautomaticallytuneitsperforman ceonatargetsystem, hencepreservingperformancewhileimprovingportability .Suchacapabilityalso enablesapplicationstodynamicallytunetheSHMEM+librarya ccordingtothesystem loadatseveralpointsduringanapplication'sexecution. ToincorporatethisfeatureinSHMEM+,wewillmodifytheiniti alizationfunction ( shmem init )intheSHMEM+librarytoinvokea“self-optimization”functi on.This functionwillexecuteaseriesofmicrobenchmarkstodeterm inetheinputparametersfor estimation.Theparameterscanbethetransmissiontimesfo rdifferentdatasizesover thenetworkandtheI/Obus,orasetofmodelparameters(e.g. LogGP,PlogP)which canthenbeusedtoestimatethetransmissiontimesoverthes einterconnects.The self-optimizationfunctionwillthenusethemultilevelco mmunicationmodeltodetermine thepacketsizesrequiredtoobtainoptimalperformancefor variousSHMEM+routines. 73

PAGE 74

Theinformationgeneratedbytheself-optimizationfuncti onwillalsobestoredtoale, fromwhichitcanbelaterretrievedtoeliminatetheneedfor performingthesetestsfor everyexecutionofanapplication.However,ifthesystempe rformancevariesovera periodoftime,anapplicationmaychoosetoinvoketheoptim izationfunctiontore-tune theperformanceofthelibrary. Figure4-7.BandwidthoftransfersbetweenremoteFPGAsobtain edbybaseline SHMEM+libraryandself-tunedversionofSHMEM+.A)OnNovo-GemployingslowerDMAengine.B)OnNovo-GemployingfasterDM Aengine. C)OnMuCluster. 4.4.1ExperimentalResults WeaugmentedtheSHMEM+librarywithself-tuningcapabilitya ndevaluated itseffectivenessonseveraldifferentsystemconguratio ns.Figure 4-7 compares thebandwidthobservedfortransfersbetweenremoteFPGAsusi ngthebaseline SHMEM+(withoutself-tuning)withthebandwidthobservedfor thosetransfersbya self-tunedSHMEM+.TheNovo-GandMuclusterdifferintheinte rconnecttechnology usedbythetwosystems.Wealsoaddeddiversityinthecapabi litiesoftheI/Obus byemployingtwodifferentversionsoftheDMAengine(withv aryingtransferspeeds) providedbytheFPGA-boardvendorontheNovo-Gsystem.Figure 4-7 showsthatthe self-tunedSHMEM+librarywasabletoautomaticallydetermin etheappropriatepacket sizefortransfersondifferentsystemsandoffersignican tlyimprovedperformance. Figure 4-8 liststhepacketsizethatweredeterminedtoofferbestperf ormancebythe self-optimizationfunctionfortransfersbetweenremoteF PGAs.Table 4-6 highlightsthe 74

PAGE 75

improvementinbandwidthobtainedbytheroutinesinthesel f-tunedSHMEM+library overthebaselineversion.Moreimportantly,theself-tuni ngcapabilityimprovesthe portabilityofSHMEM+. Figure4-8.Packetsizedeterminedbyself-tunedSHMEM+libra ryfortransfersbetween twoFPGAsoverarangeofdatasizesondifferentsystems.`-'in dicatesno packetizationwasbenecial.Allnumbersrepresentbytesof data(e.g.1M= 1MB). Table4-6.Performanceimprovementfortransfersbetweenr emoteFPGAsobtainedby self-tunedSHMEM+libraryvs.baseline. DatasizeNovo-G(slowerNovo-G(fasterMu(Bytes)DMAengine)DMAengine)cluster 4M6.9%5.5%104.3%8M14.5%22.0%74.6%16M18.9%23.7%59.3%32M25.1%27.7%52.6% 4.5Conclusions Programmingmodelsandlibrariesforheterogeneous,parall el,andrecongurable computingsuchasSHMEM+provideausefulmechanismforestabl ishing communicationinsuchsystems.However,acommunicationmo delisanimportant toolforoptimizingtheperformanceofanapplicationandde velopingaconcrete understandingoftheunderlyingcommunicationinfrastruc ture.Thenewmultilevel communicationmodelproposedinthisphaseofresearchprov idesasystem-level representationtointegratetheeffectofmultiplelevelso fcommunicationthatare routinelyencounteredinscalableRCsystems.Themodelhas provisionsforaccurately representingtheopportunitiesforoverlappingintermedi atestepsofcommunication, whichiscriticalforobtaininghighperformance. 75

PAGE 76

Inthischapter,wedemonstratedthebenetsofourmodelasa toolforperforming earlyDSEtooptimizethecommunicationinfrastructureyiel dingimprovedperformance andproductivity.Animprovementof42%wasobservedintheov erallperformanceof ourCBIRapplication.Thecommunicationmodelalsoenabledu stosimplifytheprocess ofoptimizingthetransferfunctionsinSHMEM+foratargetsys tem.Furthermore, themodelallowedustoaugmenttheSHMEM+librarytoautomatic allytuneits performancebasedonthecapabilitiesofasystem,hencemak ingSHMEM+more portable.Improvementinperformanceofupto100%wasobtai nedincertaincasesby self-tunedroutinesinSHMEM+. 76

PAGE 77

CHAPTER5 SYSTEMCOORDINATIONFRAMEWORK (PHASE3) Inthischapter,aframeworkforestablishingcommunicatio nbetweenvarious devicesinaheterogeneoussystemispresented,theSystemCo ordinationFramework (SCF).SCFconsistsofalibraryofmessage-passingcoordinat ionprimitivessuitable forpotentiallyanylanguageordevice,aframeworkthatall owsanapplicationtobe expressedasastatictask-graphandeachtaskofanapplicat iontobedenedin potentiallyanylanguage,andasetoftoolsthatcancreatec ustomizedcommunication methodsforagivensystemarchitecturebasedonthemapping oftaskstodevices. OnekeyadvantageofSCFisthatmanylow-levelarchitectural detailsarehiddenfrom theapplicationdevelopers,allowingthemtosimplydenee achtaskinalanguage oftheirchoice,whilespecifyingcoordinationbetweentas ksusingmessage-passing primitives.Byhidinglow-levelarchitecturaldetailsfrom theapplicationdeveloper,SCF canimproveapplicationdevelopmentproductivityandoffe rrapidexplorationofdifferent task-to-devicemappingswithoutmodicationstotaskden itioncodewhichoftenyields improvedsystemperformance.Furthermore,sinceapplicat ionsemployauniform, high-levelinterfacetodenecommunication,SCFhelpsinim provingapplication portability.Ourpreliminaryresultsshowmarkedimprovem entinproductivityand portabilitywhileincurringminimalperformanceoverhead s. Theremainderofthischapterisformattedasfollows.Adeta iledoverviewof thedesignofSCFispresentedinSection 5.1 .InSection 5.2 ,wedescribethetools presentintheSCFecosystem.Section 5.3 evaluatesourimplementationofSCFon twodifferentplatformsandpresentsresultsfromtwocases tudies.Finally,Section 5.4 providesasummaryofconclusions. 5.1OverviewofSCF Figure 5-1 presentsanoverviewoftheapplicationdesignmethodology withSCF. Thisdescriptionillustratesabottom-upapproachwherean applicationisbuiltby 77

PAGE 78

connectingconstituenttasks.However,itcaneasilybeada ptedtoatop-downdesign owbyswitchingtheorderofthersttwostagesinvolvedint heprocess. ThedesignerbeginsbydeningindividualtasksasshowninF igure 5-1 A,using potentiallyanylanguage,compiler,orsynthesistoolpert ask,thusenablingtool-chain interoperabilitybyallowingdifferenttaskstobewritten indifferentlanguages.Such behavioriscriticalforheterogeneoussystems,whereeach devicemayrequirea different,specializedlanguage[ 58 59 ].ExistingIPcorescouldalsobeusedastask denitions,withonlyminorextensionstointegratewithth eSCFsendandreceive mechanisms. Figure5-1.ApplicationdesignphilosophyusingSCF.A)Design erdenestasks independentlyfromtaskgraphanddevicemapping.B)Creates ataskgraph bysimplyinterconnectingtaskswithoutchangingindividu altaskdenitions. C)Andthen,mapstasksontodevices.D)Usingautomatedcommu nication synthesistocreateefcientcoordinationmechanismsbase donthe mapping. Animportantpartofeachtaskdenitionisdeningtheinputa ndoutputtoother tasks,whichcanhavealargeeffectondesignerproductivit y.WithoutSCF,deningtask interactionsisdependentonthesourcedeviceandtherecei verdevice,oftenrequiring 78

PAGE 79

differentdevice/platform-specicAPIsfordifferentmappi ngs.Inmanycases,changing amappingrequirestime-consumingmodicationstotaskde nitioncode.WithSCF,a designerspeciesalltaskinteractionsusingamessage-pa ssinglibrarycalledtheSCF library.Whendeningtaskinteractionswiththislibrary,a designercanbecompletely unawareofthesourceordestinationdevice—akeyadvantage thatenablestask portabilityacrossmultipledevices,taskreuseindiffere ntapplications,andtransparency oflow-level,device-speciccommunicationdetails.Furt hermore,becauseSCFis awareofthesourceanddestinationdeviceforallinter-tas kdatatransfers,SCFcan potentiallyallowautomaticconversionbetweendataforma ts(forexample,conversionof endiannessofdata),whichfurtherimprovesdesignerprodu ctivity. Afterdeningindividualtasks,thedesignerthenbuildsthe completeapplication bysimplyconnectinginputsandoutputsofthevarioustasks toformataskgraph, asshowninFigure 5-1 B.Notethatthisprocessdoesnotrequireanychangestothe denitionoftheindividualtasks.Currently,SCFcanmapata sktoanydevice,provided thatthecodeforthetaskcanbeimplementedonthatdevice.I ntheworstcase,a designerwouldhavetoprovidetaskcodeforeachdeviceunde rconsideration.However, improvementsinhigh-levelsynthesistools(e.g.CtoVHDL)c anprovidethecapability ofconvertingtask-denitioncodeofonedevicetoanother. SCFdoesnotattemptto automatethisconversionofcodespeciedusingaparticula rlanguageorvendortools toanother. Oncethetaskgraphisdened,thedesignermapsindividualt askstospecic devicesinthesystemarchitecture(asshowninFigure 5-1 C),againwithoutmakingany modicationstothetaskdenitioncodeorthetaskgraphoft heapplication.SCFcan beusedwithsystemsfromdomainsrangingfromHPCtoembedded computing.Figure 5-1 CshowsseveralexamplesofsystemarchitectureswhereSCFca nbeemployed suchasaclusterofCPUnodes(HPCsystem)optionallyequipped withaccelerators,a 79

PAGE 80

combinationofFPGAsandembeddedprocessor(high-performan ceembeddedsystem), orasystembasedonastand-aloneFPGA(embeddedsystem). AsshowninFigure 5-1 D,SCFusesthespeciedmappingtoautomatically implementalldatatransfersinthetaskgraphusingthespec iccommunication capabilitiesofthesystem.Forexample,inaplatformcompr isedofahost microprocessorandanacceleratorboardwithmultipleFPGAs( suchasour experimentalsysteminSection 5.3.1 ),SCFcouldimplementcommunicationbetween anyFPGAandthehostCPUusingon-boardmemory,whereascommun icationbetween differentFPGAscouldbeimplementedusingphysicalwiresont heboard,whilehiding allimplementationdetailsfromthedesigner.Furthermore ,knowledgeoftaskinteraction andtheirspecicmappingontosystemresourcespriortocom pilationallowsSCFto performoptimizationsforrecongurabledevices,suchasF PGAs,thatarenotpossible inapproachesthatusedynamicroutingtoenablearbitraryc ommunicationbetween tasks,suchaspacket-switched,NoCdesign[ 35 ].Suchoptimizationscanimprove applicationperformanceandreducearearequirements. ThetransparencyprovidedbySCFenablesadesignertorapidl yexploremappings oftaskstodifferentdevices,whichisoftencriticalforme etingdesignconstraints. Althoughmuchpreviousworkhasfocusedonautomaticdesignspaceexploration [ 60 61 ],suchexplorationisstilllargelyamanualprocessforhet erogeneoussystems. ItshouldbenotedthatSCFitselfdoesnotprovideanyautomat eddesign-space exploration.Rather,thisworkabstractsawayfromthedesi gnerthedetailsof coordinationbetweenvarioustasksmappedoverheterogene ousresources.Designers maystillperformoptimizationanddesign-spaceexplorati onbymeansofexternaltools, andSCFprovidesaneasy-to-useentrypointforimplementing thosedesigns.The followingsubsectionsexplainSCFinmoredetailanddiscuss theprogrammingand communicationmodeladoptedbythisframework. 80

PAGE 81

5.1.1ProgrammingModel TherearefourtermsthatdenetheSCFprogrammingmodel.A task inSCF isthenest,indivisibleunitofcomputationthatcanbemap pedontoadevice.Each SCFtaskcommunicatesdatawithothertasksusingmessage-pa ssingprimitives.A taskgraph isadirectedgraphformedbyconnectingthe communicationinterfaces (i.e.inputsandoutputs)ofvarioustaskstogetherusing edges .TheedgesinSCF taskgraphsdonotspecifyanyprecedenceororderofexecuti on,butrepresentthe interactionbetweentwotasks.TheSCFprogrammingmodelpla cesnorestriction onwhereorwhenthecommunicationprimitivesareusedinsid etasks,whichallows representationofcommunicationbetweentasksusingdiffe rentunderlyingmodels suchasdataowgraphs(synchronousandasynchronous),com municatingsequential processes,andgenericmessage-passingmodels. Task-denitioncode implementsthe computationalportionofataskusingcodeinpotentiallyan ylanguagesuchasC++, VHDL,ImpulseC TM ,CUDA TM ,OpenCL TM ,etc.,whileusingSCFlibraryprimitivesfor specifyingcommunication.Aslongasmessage-passingconst ructsofthelibrarycanbe speciedinalanguage,SCFcansupporttaskdenitionsusing thatlanguage.Finally, mapping inSCFdeneshowtasksaremappedontospecicdevicesandsys tem resources.5.1.2ArchitecturalModel Toenablecoordinationbetweenheterogeneousdevicesonas manysystems aspossible,SCFusesahierarchicalarchitecturemodelthat capturesstructures commontoheterogeneoussystems,whileabstractingawayde tailsthatdesigners maynotrequire.TheSCFarchitecturemodelconsistsofthree levelsofabstraction: devices,platforms,andsystems.AllSCFtasksexecuteonSCFde vices,whichare thenest-grainedcomputationalresourcesofagivensyste m.EverySCFdevice ispartofanSCFplatform,whichisasubsystemcontainingase tofSCFdevices interconnectedbyaspeciccommunicationtopology.EachSCF platformispartof 81

PAGE 82

Figure5-2.Architecturalmodelofanexamplesystem. anSCFsystem,atthetopofthehierarchy,whichconnectsaset ofplatformsusinga speciccommunicationtopology. Figure 5-2 illustrateshowSCFarchitecturemodelscanbeusedtorepres ent commonsystems.Thegureshowsarepresentationofacluste rofnodesconnected overEthernet(oranynetworktechnology),wheresomenodesc onsistofaCPU(D1 devicesinFigure 5-2 ),oraCPUandanacceleratorboard,eachhavingmultipleFPGAs (D2andD3devices).ThisarchitectureisrepresentedasanSC Fmodelinthefollowing way.TheFPGAsandCPUsareSCFdevices,eachnodecollectivelyac tsasanSCF platform,andtheclusterofallnodesformsanSCFsystem.The differentlevelsof abstractionarenotnecessarilymutuallyexclusive;asing lephysicaldevicecouldbe aSCFdevice,platform,andsystem.Forexample,platformP1,w hichiscomprisedof asingledevice,isbothanSCFdeviceandplatform.Suchexibi lityallowstheSCF architecturalmodeltorepresentdiversesystems.5.1.3CommunicationModel Onekeyadvantagetothemultiplelevelsofabstractioninth earchitecturemodel isthatcommunicationcanbemadetransparenttothedesigne rbydistributing communicationresponsibilitiesthroughoutthesystem.Fu rthermore,suchtransparency easesconversionofexistingdevicesandplatformsintoSCFs ystems.SCF-compliant devicesarecapableofhandlingalldevice-levelcommunica tion,whichwedeneto 82

PAGE 83

becommunicationbetweentasksmappedontothesameSCFdevic e.Forexample, amicroprocessorisSCF-compliantifitiscapableofsupport ingcommunication betweenmultipletasksmappedontoit.Thephysicalimpleme ntationdoesnotaffect SCFcomplianceandcouldvaryfordifferentdevices;itcould beachievedviaa message-passinglibraryormessagequeuessupportedbyope ratingsystems.Similarly, SCF-compliantplatformsprovidecommunicationroutinesth atareresponsibleforall datatransferswhenthereceiverisimplementedonadiffere ntdeviceinthesame platform.SCFplatformsarecapableoftransferringmessage stotheappropriateSCF deviceintargetplatform.Again,thephysicalimplementati onofsuchcommunication couldvaryfordifferentplatforms.Forexample,inanSCFpla tformcontainingmultiple processorcores,messagescouldbepassedthroughFIFOsins haredmemorywhereas, foranSCFplatformcomprisedofmultipleFPGAs,streamingdata transfercould beachievedthroughphysicalwires.Alternatively,SCFresor tstothesystem-level communicationroutinesifthereceivertaskismappedonadi fferentplatform. Table5-1.SCFlibraryprimitives. FunctionAPIType InitializationSCF InitSetup TerminationSCF FinalizeSetup SendSCF SendPoint-to-Point RecvSCF RecvPoint-to-Point Despitetheexistenceofdistinctcommunicationlevels,SCF hidestheselevelsfrom designers,whoareinsteadexposedtoasinglecommunicatio nAPI.WithoutSCF,a designerwouldhavetogothroughanextensiveprocessofusi ngad-hocmethodsof establishingcommunicationbetweenanytwodevicesonwhic hthecommunicatingtasks aremapped.Furthermore,thedesignerwouldhavetore-esta blishthecommunication mechanismfollowinganychangesintheresourcemapping.Wit hSCF,adesigner simplyspeciestheinputandoutputofeachtaskwhilerelyi ngonthetoolsinour framework(discussedinSection 5.2 )toimplementthecommunication. 83

PAGE 84

CommunicationbetweenSCFtasksusesthemessage-passingmo del,whichis commonlyemployedintheparallelcomputingcommunity[ 62 ].Withmessagepassing, allcommunicationinatask-denitioncodeisspeciedexpl icitlyasfunctioncallsthat sendorreceivedataandrequirebothparticipatingtasksto executetheirrespective functionstocompleteadatatransfer.Suchacommunicationm odelisgenericenoughto interfacewithpotentiallyanyprogrammingmodelassociat edwithanydevice.Table 5-1 presentsthecoordinationprimitivescurrentlysupported bytheSCFlibrary.Thesetup routine, SCF Init ,performsinitializationoperationsandallocationofres ourcesforall levelsofcommunicationandunderlyinglibraries. SCF Finalize performscomplementary terminationfunctionssuchasde-allocationoftheresourc eswhichweresetupduring initialization. SCF Send and SCF Recv arecommunicationcalls(whichcouldbe synchronousorasynchronous)thatprovidedatatransferbe tweentasks.InSection 5.3 wefurtherdescribehowthissimplesetofcommunicationrou tinesisadaptedtothe programmingmodelsofthedevicespresentinourexperiment alsystem. NotethatthisAPIisintentionallymuchsimplerthanothermes sage-passing libraries(e.g.,MPI),whichtypicallycontainconstructsf orscatter,gather,broadcast,etc. WithSCF,adesignerdeningataskisnotrequiredtoknowifthe inputsandoutputs ofthetaskareusedforpoint-to-pointcommunicationorcol lectivecommunication.The designersimplydenestheinputsandoutputsofthetask.Th en,atthetask-graphlevel, collectivecommunicationoperationscanbespeciedviasp ecializededgesbetween tasks.Byseparatingspecializedcommunicationfromtaskde nitions,SCFincreases portabilityoftask-denitioncodestodifferentapplicat ions.Thesimplelanguagethat wedeveloped(describedinSection 5.2.1 )fordescribingtaskgraphscurrentlysupports unicastedges(singlesource,singlereceiver)forpoint-t o-pointcommunicationand multicastedges(singlesourceandmultiplereceivers,orm ultiplesourcesandasingle receiver)forbroadcast,scatter,andgatheroperations. 84

PAGE 85

5.2SCFEcosystem TheSCFecosystemreferstothesetoftools(commercialtools aswellastools developedasapartofthisphaseofresearch)whichanapplic ationdeveloperwill typicallyemployfordevelopingapplicationswithSCF.Inth efollowingsubsections,we describethetoolsthatwedevelopedtosupportSCFonourexpe rimentalsystems,and theirinteractionwithcommercialtoolsfordevelopingapp licationsinSCF. 5.2.1SCFTask-GraphLanguageandMapping ToevaluateSCF,wecreatedasetoftoolsintheEclipseenviron ment[ 63 64 ]. ThesetoolsincludeanSCFtask-grapheditorthatallowsdeve loperstospecify applicationsastaskgraphsandspecifythemappingoftasks ontosystemresources, aswellasacommunicationsynthesizerthatusestheinforma tionspeciedtogenerate appropriatecommunicationroutines. Wedevelopedasimplelanguageforspecicationofapplicat iontaskgraphsinthe task-grapheditor.Theprimaryconstructsinthislanguage are task edge ,and loop Everytaskofanapplicationthatimplementsapartofitscomp utationisrepresentedina graphusinga task construct.Thetaskgraphspecicationlanguagerequirest hattasks haveuniquenamesandwell-denedcommunicationinterface sforanycommunication goingto(andfrom)atask.Thecommunicationinterfacesare speciedusingtheprex input or output beforetheinterfacenameoftherespectivetasks.Figure 5-3 shows anexampleofataskgraph(oftargettrackingapplicationde tailedinSection 5.3.3 ) speciedusingtheSCFtask-graphlanguage.Therearefourta sksintheexamplegraph (T1,T2,T3,andT4).TaskT1hasthreeoutputinterfaces(out 2,out3,andout4)and threeinputinterfaces(in2,in3,andin4). Eachcommunicationinterfaceofataskisconnectedtoaninte rfaceofanother taskusingan edge .TheSCFtask-graphlanguagecurrentlysupportsunicastedg es forpoint-to-pointcommunication(denedusingtype edge )andmulticastedgesfor collectivecommunication(denedusingtype bedge forbroadcast, gedge forgather, 85

PAGE 86

Figure5-3.AnexampleapplicationgraphdescriptioninSCFta sk-graphlanguage. sedge forscatter).Dependingonthetypeofanedge,itcouldconne ctoneormore inputandoutputinterfaces.InFigure 5-3 ,theinterfacesfortaskT1areconnectedtothe interfacesofothertasksusingunicastedges(i2,o2,etc.) .Theedgeassignmentsare basedonthefollowingconvention:theoutputinterfacesar eassignedtoanedge(for example,out2ofT1isassignedtoo2),andtheinputinterfac esreceivetheirvaluesfrom anedge(forexample,in2receivesitsvaluefromi2inT1). Tosimplifyspecicationoftaskgraphsofscalableapplica tionswhichtypically includeseveralinstantiationsofthesametask,thelangua geprovidesa loop construct. A loop constructcreatesaloopiteratorwhichisonlydenedinsid ethebodyofthe loop.Therangeoftheloopiteratorcontrolsthenumberofta sksinstantiatedbythat loop.Thevaluesoftheloopiteratorareemployedinthebody ofthe loop (enclosed between <> )tospecifyuniquenames(orIDs)fortheinstantiatedtasks andtomake anyedgeassignments.Figure 5-3 employsthe loop constructtodeneandperform edgeassignmentfortasksT2,T3andT4. 86

PAGE 87

Figure5-4.ToolowinSCF. Themappinginformation,whichdeneshowthetasksofanapp licationare mappedontotheresourcesofatargetsystem,isspeciedina separateleusingthe task-grapheditor.ThebasicmappinginSCFrequiresthateac htaskofanapplication isassignedtoadeviceonatargetsystem.Theassignmentofa tasktoadeviceis performedbyspecifyingtheuniqueaddressofthatdevicein thesystem.Inaddition, fewotherattributesarealsoprovidedtodirectSCFtoolsins ynthesizingappropriate communicationroutinessuchasthetargetIDE,communicatio ncapabilitiesavailable atdifferentlevelsofhierarchy(device,platform,andsys tem),etc.Thenextsection willprovideanexampleofamappinglealongwiththetaskgr aphofcorresponding application,whiledescribingthestepsinvolvedindevelo pinganapplicationusingSCF. 5.2.2ToolFlow TheSCFtoolowshowninFigure 5-4 beginswithataskdenitionstep.Designers denethecomputationsoftasks,oruseappropriatecores,u singanylanguage, compiler,orsynthesistool,whiledeningalltaskinterac tionsusingprimitivesfromthe SCFlibrary.Figure 5-5 Ashowsthetaskgraphofanexampleapplicationconsistingo f twotasks,whichwillbemappedtoaCPUandFPGA,respectively.As showninFigure 87

PAGE 88

5-5 B,onetask,writteninC++,generatesrandomnumbersandout putstherandom numberstoanoutputinterfacecalled“out1”whiletheother task,describedusing Handel-C,receivesthesenumbersthroughaninterfacecall ed“in1”andaccumulates them.Notethatthesetwotasksaredenedindependentlyofe achother,usingdifferent programminglanguages.Thecommunicationisspeciedusin gtheSCFlibrary,whichis adaptedtomeettherequirementsoftheprogrammingenviron mentassociatedwiththe targetdevice. Afterdeningtasks,adesignerspeciesthetaskgraphofapp licationinSCF's task-grapheditor.Figure 5-5 Cshowsthecorrespondingtaskgraph(“.scf”le)forthe applicationunderconsideration.AsshowninFigure 5-5 C,theoutputoftask“random”is connectedtotheinputoftask“accumu”throughthe“edge1”e dge.Notethatthenames oftheinterfacesusedinthetaskgrapharesameasthe labels supplied(asoneofthe parameters)tothelibraryroutinesinthetask-denitionc odeinFigure 5-5 B. Next,anapplicationdesignerspeciesthemappinginforma tionandprovidesthe architecturalmodelforatargetsystem.Anexamplemapping( “.map”le)isshownin Figure 5-5 D.Althoughthismappingprocessiscurrentlyperformedmanu ally,existing design-spaceexplorationtechniquescouldbeintegratedi ntotheSCFtoolowto provideoptimalmappingsuggestionstoapplicationdesign ers. Communicationsynthesis(Step4inFigure 5-4 )analyzesthemappingand thearchitecturalmodeltoautomaticallycreateefcienti mplementationsforall edgesinthetaskgraph.AteachlevelofhierarchyintheSCFarc hitecturemodel, communicationsynthesisdetermineswhatmechanismstouse ,andbasedonthis informationitgeneratesdenitionsofthecommunicationf unctionsfordifferent devicesandplatforms.Inthesimplestcase,communication synthesisdeterminesif anedgeofthetaskgraphcorrespondstodevice-level,platf orm-level,orsystem-level communicationandtranslatesSCFlibraryfunctionstoanund erlyinglibraryspecied inthemappingle.Forexample,onaplatformconsistingofa CPUandPCI-XFPGA 88

PAGE 89

Figure5-5.ExampleapplicationinSCFenvironment. board,communicationsynthesiswoulddenean SCF Send fromahostCPUtoan FPGAusingvendor-specicAPItotransferdataoverPCI-X.Althoug hcommunication synthesiscurrentlyimplementsallcommunicationasamapp ingontounderlyingvendor APIcalls,therearenumerouspossibilitiesforfuturework. TheSCFtoolsextractinformationfromthe“.scf”and“.map” les,andfurtherinvoke separateplug-ins(oneforeachIDE)toautomaticallygenera tethecommunication routinesfordifferentprogramminglanguages.Fortheexam pleshowedinFigure 5-5 therewillbeaseparateplug-inforC++andHandel-C,eachof whichwillgenerate thebehaviorofthe SCF Send and SCF Recv routinesfortheirrespectivetasks. Suchastructureallowsfornewcomputationaldevicesalongw iththeirprogramming languagesandtoolstobeeasilyintegratedintoourimpleme ntation,bysimplycreating aplug-inforthenewtool(orprogramminglanguage).Wehope suchaframeworkwill 89

PAGE 90

beamenabletovendorsoffuturetechnology,andprovideane asymechanismforusing theirtechnologywithotherdevicesinthesystem.Aftercomm unicationsynthesis,the usercombinesthedenitionsfortheSCFlibraryfunctionswi ththeircorresponding task-denitioncodeandcompilesthemcollectivelyusingt henativecompilerofthe associatedprogramminglanguagetoformanSCFexecutable(w hichrepresentsaset ofmultipleexecutables,oneforeachparticipatingdevice inatargetsystem)thatcanrun onthetargetsystem.5.2.3InterfacingwithExternalModelingTools TheformatofSCFtaskgraphsallowseasyinterfacingwithext ernaltoolsthat offerthecapabilityofearlydesign-spaceexplorationthr oughalgorithmmodeling.Such toolsalsoprovideapplicationdeveloperswithamechanism fordescribingapplications asabstractmodels(incertaincasesagraphical)whichcant henbeautomatically translatedintoSCFtaskgraphs.Toprovidedeveloperswitha graphicalinterfacefor describingtaskgraphsofapplicationsandperformingearl ydesign-spaceexploration, weintegratedourframeworkwithRCMLwhichisamodelinglan guageforabstract modelingofRCapplications.WhileRCMLhasseveralinterest ingfeaturesformodeling andanalysisofanapplicationpriortoanyimplementation, thissectiondiscusses theequivalencebetweenRCMLmodelsandSCFtaskgraphsandth ecapability ofautomaticallytranslatingtheformerintothelatterthr oughanexample.Werefer interestedreadersto[ 21 ]forfurtherdetailsaboutRCML. Figure 5-6 showsanRCMLmodelofatarget-trackingapplication(appli cation presentedasacasestudyinSection 5.3.3 )anditsequivalenttaskgraphgenerated automaticallybyourframework.AnapplicationinRCMLismod eledasacollectionof blocksandRC-specializedconstructs,inwhicheachparall eltaskisrepresentedby a functionblock .Each functionblock ofanRCMLmodelcandirectlycorrespondto a task inSCFtaskgraphs,ormultiplefunctionblockscanbecombine dintoasingle SCFtaskgraph.WeachievethecombinationofmultipleRCMLbl ocksintoasingle 90

PAGE 91

Figure5-6.Representationofanexampleapplication.A)Asan RCMLmodel.B)Its equivalenttaskgraphinSCF. SCFtaskbydeningan attribute called TaskID foreachfunctionblock.Functionblocks withsamevalueforTaskIDarecombinedintoasingletaskwhe ntranslatinganRCML modelintoitsequivalentSCFtaskgraph. Connections thatlinktheblocksinanRCML modelrepresentcommunicationanddependencies,similart otheedgesinSCFtask graphs.RCMLsupportsallofthevariouscommunicationpatt ernscurrentlyprovided inourimplementationofSCFsuchasbroadcast,scatter,gath er,andpoint-to-point communication. 5.3ResultsandAnalysis Inthissection,wepresentexperimentsillustratingthepr oductivityandperformance advantagesofSCF.Thetwoexperimentalsystemsemployedint hisphaseofresearch aredetailedinAppendix A (HeterogeneoustestbedandNovo-G).Werstdescribethe coordinationlibrarywhichwedevelopedtomakethesesyste msSCFcompliant.We thendemonstratetheadvantagesofcustomcommunicationin SCFbycomparingitwith communicationbasedonapacket-switched,NoCcommunicati onarchitecture.Finally, 91

PAGE 92

Figure5-7.SCFarchitecturalmodelforexperimentalsystem weillustratethecapabilitiesofSCFandbenetsofrapid-de signspaceexploration enabledbySCFthroughtwocasestudies.5.3.1ExperimentalSetup ToevaluatethestrengthsandweaknessesofSCF,weconducted ourexperiments ontwodifferentsystems.Therstsystemistheheterogeneo ustestbeddetailedin Appendix A .ThissystemcanberepresentedintheSCFarchitecturalmode lasshown inFigure 5-7 .ThecombinationofthefourFPGAsandthehostCPUformanSCF platform(P1inthegure)andthesecondCPUformsanotherSCFpl atform(P2inthe gure).Thesetwoplatformscollectivelyformourrstexpe rimentalsystem.EachFPGA andCPUisanSCFdevice.ThesecondsystemistheNovo-GRCsuper computer detailedinAppendix A WedevelopedtheSCFlibrarytosupportvariouslevelsofcomm unicationonboth thesystems.System-levelcommunicationbetweentheproces sorsondifferentservers isestablishedusingMPIastheunderlyingcommunicationmec hanism.Platform-level communicationforthehostprocessoroneachserver,whicha llowsthehostprocessor tointeractwiththeFPGAs,wassupportedbyusingAPIcallsprovi dedbyGiDEL.Note thatanapplicationdesignerisnotexposedtoMPIorGiDELAPIs.I nstead,adesigner 92

PAGE 93

simplyspeciesallthecommunicationusingthecoordinati onlibrarycalls,whichSCF toolsautomaticallymapontoappropriateunderlyingcommu nicationmechanism(e.g., MPIandtheGiDELAPIinourcase).Platform-levelcommunicationo ntheFPGAs issupportedthroughsendandrecventitieswhichwedevelop edinVHDLandare analogoustosend(orrecv)functioncallsonaCPU.Theentiti esconnecttotheuser's VHDLapplicationsthroughdataandcontrolports(suchas`go 'and`done'signals),and performhandshakingoperationsthroughregisters(instan tiatedonFPGA)toexchange noticationsandlocationsofoutstandingmessageswithth ehostCPUandother FPGAs. Fortheseexperiments,wedidnotimplementfullSCFcomplian ceoneachsystem, andinsteadimplementedonlythetypesofcommunicationnec essaryforthetargeted applications.Forexample,ourimplementationoftheSCFlib rarydoesnotcurrently supportdirectcommunicationbetweendevicessuchasaCPUan danFPGAon differentnodes,orbetweentwoFPGAsondifferentnodes.Note thattheselimitations areonlyrelatedtoourimplementationofthelibraryonthes emachines,andarenot limitationsofSCF.Suchcommunicationcanbeimplementedthr oughatwo-level hierarchy,wherethedatafromanydeviceisrsttransferre dtothehostCPUon theoriginatingSCFplatform.Then,thedataistransferredt othehostCPUonthe destinationSCFplatform,fromwhichitcanbetransferredto thedestinationdevice. 5.3.2CustomCommunicationSynthesis OneoftheadvantagesofSCFisitsabilitytosynthesizecusto mcommunication wellsuitedtotherequirementsoftheapplicationandtheca pabilitiesofthesystem. Wedemonstratetheadvantagesofsuchaschemeversusagener ic,packet-switched solutionusingasimpleexample,whichinvolvessendingdat afromtwotaskstoathird task.ThetaskgraphoftheapplicationisshowninFigure 5-8 A,inwhichtasksT2and T3senddatatoT1.Thegurealsoshowstwopossibledesignsf orthisexample,bothof whichwereimplementedonasingleFPGA. 93

PAGE 94

Figure5-8.Resultsforcustomcommunicationsynthesis.A)T askgraphofan application.B)NoC-baseddesign.C)Customizedcommunicat ion.D) Speedupobtained. Therstdesign(Figure 5-8 B)employsasimplerouterthathasaconnectionto eachtaskwithaFIFOtobufferoutgoingdataonitsoutputpor ts.Whilethisdesignis genericandcansupportavarietyofpermutationsfordataco mmunicationbetween connectedtasks,itfailstoexploitinformationspeciedb ydatadependenciesinthe taskgraph.Theseconddesign(Figure 5-8 C),alternatively,adoptsacustomizeddesign basedondetailsofdatacommunicationextractedfromtheta skgraph.ItinstantiatesT1 withtwoportsandconnectsthemtoT2andT3directly.Asaresu ltofthisoptimization, thelatterdesignofferssuperiorperformanceoverthegene ricsolution,asindicatedby theapplicationexecutiontimesfortwodifferentdatasize sinFigure 5-8 D. Inasimilarmanner,customcommunication,oneoftheimport antcomponentsof SCF,canleadtobetterapplicationdesignsinothersituatio ns.Althoughcommunication synthesisiscurrentlydenedbyimplementingallcommunic ationasamappingonto underlyingfeatures(includingvendor-providedfeatures aswellasthefunctionalitythat wedevelopedasapartofthisphaseofresearch)forestablis hingcommunication,there 94

PAGE 95

Figure5-9.Taskgraphfortarget-trackingapplication. arenumerousautomaticsynthesispossibilitiesenabledby SCFwhichwillbeexploredin ourfutureresearch.5.3.3CaseStudy:TargetTracking Inthissection,weanalyzetheoverallperformanceoftarge t-trackingapplication (Appendix B )whichtracksthreeobjectsusingthreeKalmanltersusingSC Ftorapidly exploredifferentresourcemappings.Figure 5-9 showsthetaskgraphoftheapplication. TaskT1,thesensorprocess,createstheinputsforallthree ltersandisimplemented inC++.ThethreeKalmanlters(tasksT2,T3,T4)canbemapped onaCPU(withits designimplementedinC++)oronanFPGA(asaVHDLdesign). Table 5-2 presentsexecutiontimesoftheapplicationundervariousm apping scenariosforthreedifferentsystems.SystemIisourrstex perimentalsystem (representedinFigure 5-7 ).SystemsIIandIIIrepresentnotionalsystems,emulating thecharacteristicsofasystemthathaslowerbandwidthand higherlatencyof communicationbetweentheCPUandFPGAs.Whenpowerconsumption isamajor consideration,systemsoftenemployFPGAsrunningatalowerf requency,perhaps attachedtothesystemoveralower-speedbus.Weimplemente dtheseemulationsby addingextradelaysontheFPGAinourexperimentalsystemtor educecommunication bandwidth.TherowsofthetablerepresentthenumberoftheKa lman-ltertasks (amongstT2toT4)mappedtoCPUsandFPGAs.T1isalwaysmappedon theCPU 95

PAGE 96

deviceofplatformP1(inFigure 5-7 ).Eachoftheotherthreetaskseithertime-shares theCPUdeviceonplatformP2withothertasksmappedontoitore xecutesonanFPGA onplatformP1. Table5-2.Executiontimeoftarget-trackingapplicationun derdifferentmapping scenariosforthreesystems.Speedupshowsperformancegain ofoptimal mapping(highlightedinbold)comparedtoall-CPUbaseline( i.e.3-CPU, 0-FPGAmapping). Mappings Executiontime(ms) SystemISystemIISystemIII 3-CPU,0-FPGA2432432432-CPU,1-FPGA1521521671-CPU,2-FPGA67861670-CPU,3-FPGA886167Speedup28.32.81.4 Table 5-2 showsthatthethreesystems,althoughsimilar,yielddiffe rentoptimal mappings(whereoptimalmappingisdenedasthemappingwhi chachievesbest performanceand,incaseoftie,withleastnumberofFPGAs).Wit hSCF,exploringthese differentmappingsonlyrequiredsimplemodicationstoth eresourcemappingle,and nonetosourcecodeofthetaskdenitions.TheSCFcommunicat ionsynthesistool adaptedthecommunicationinfrastructurebasedoninforma tioninthemappingle.In contrast,anysuchchangestraditionallywouldrequiremod icationstotheapplication sourcecodeanddesignerinterventiontocreatecommunicat ioninfrastructuretomatch thenewresourcemapping.Moreover,theprocesswouldhavet oberepeatedmultiple timesuntilasuitablelevelofperformanceisobtained.With SCF,wewereableto performdesign-spaceexplorationrapidly,whichledtospe edupsrangingfrom1.4times fastertomorethan28. Inordertounderstandproductivitygainsobtainedbyemplo yingSCF,werecorded developmenthoursspentbyourteamduringourexperiments, inadditiontosource linesofcode(SLOC)involvedincertainpartsoftheapplicat ioncode.Table 5-3 shows theincreaseinsourcelinesofcodeinvolvedforestablishi ngcommunicationfromthe 96

PAGE 97

Table5-3.Productivityimprovementfortarget-trackingap plication. Productivityimprovement(forFPGAcomm.) WithoutSCFWithSCFImprovement SLOC3571123.18 Developmenthours Conservative 40hrs16hrs 2.5 (1week)(2days) Optimistic 80hrs16hrs 5 (2weeks)(2days) Table5-4.Overheadmeasurementsfortarget-trackingappl ication. Overhead(onFPGA) WithoutSCFWithSCFOverhead Performance76MHz75MHz1.3%ALUTsused2095/1435202152/1435202.6%Interconnectresourcesused11%12%8.3% CPUtoeachFPGAwithoutSCF.Alargepartofthisimprovementcom esfromhiding detailsofcommunicationfromthedesignerwhilepresentin gasimplerinterfacethrough SCFlibrarythatdoesnotchangedependingonthesourceandde stinationofthe communication. Table 5-3 alsopresentsproductivityestimates,whichareexplained asfollows. Thereareavarietyoffactorsthatcontributetowardsimpro vementindeveloper productivity,whicharelistedinTable 5-5 .ApplicationsdevelopedwithoutusingSCF exhibithigherconceptualcomplexity.Developersareofte nforcedtoemploy(andlearn) multiplelibrarieswithvaryingAPIstoincorporatecommunic ationbetweendifferent combinationsofdevices.Alternatively,withSCF,applicati ondevelopersdonothaveto gothroughalearningcurvetofamiliarizethemselveswitha vendor-specicAPIwhen migratingtoanewsystem.Similarly,high-levelSCFfunction seliminatetheneedto explicitlyperformvariousintermediatestepsofcommunic ation,leadingtoareduction incodesize.Forexample,SCFcanenabledatatransfersfroma CPUtoanFPGA onaremoteplatformthroughasinglesendroutine.WithoutSCF ,anycommunication withanFPGAonaremoteplatformwillhavetobeexplicitlyrou tedbyadeveloper, throughthehostCPUonthatplatform.Theprocessingontheho stCPUoftheremote 97

PAGE 98

platformwillhavetobeinterruptedtoservicethiscommuni cationrequest,whichfurther increasesthecomplexityofdevelopingparallelprograms. Portabilityincreasesthe applicationlifespan,promotescodereuse,andreducesthe recurringcostthatwould havebeeninvolvedwithouttheuseofaframeworklikeSCF.Addi tionally,adeveloper doesnothavetomakecomplexmodicationstothesourcecode severaltimesto evaluatedifferentmappingcongurations.Acombinationo fthefactorsshowninTable 5-5 (andmore)hadacollectiveinuenceonthedevelopmenttime forthetarget-tracking applicationandledtoareductionintheapplicationdevelo pmenttimeasreportedin Table 5-3 Table5-5.Majorfactorsthatcontributetoincreaseddevel operproductivityusingSCF. ProgramcomplexityHigh-levelabstractionprovidedbySCFfu nctions hidesvariousunderlyingdetailsfromapplicationdevelopers LearningcurveFamiliarAPIsandprogrammingmodelleadto reducedlearningperiodwhenmigratingtoanewsystem SourcelinesofcodeEachSCFfunctioncanperformseveralinter mediate stepsofcommunicationwhicheliminatestheneedforextracodeandfunctioncalls ApplicationportabilityAllowspartsofanapplicationtober e-usedby otherapplications.Additionally,applicationscanbeexecutedonavarietyofplatformswhichreducesrecurringcosts Design-spaceexplorationofmappings NomodicationsarerequiredinthesourcecodetoevaluatedifferentresourcemappingsusingSCF Basedonthesecontributingfactors,weestimateSCFcanreduc edevelopmenttime andimproveproductivitybyafactorrangingfromapproxima tely2.5to5,asshownin Table 5-3 .Theoptimisticcaserepresentsacasewhereadesignerisun familiarwiththe system,andthushastoundergoasteeplearningprocessfort heAPIsforeachdevice. Theconservativecaserepresentsanexperienceddesignerw hoisfamiliarwiththetools andvendor-APIsforthatparticularsystem.Althoughthesenum bersarespecicto 98

PAGE 99

ourexperimentalsystemandteampersonnel,webelievethem tobeafairestimateof improvementsweexpecttoobtainwithSCF. Table 5-4 liststheoverheadincurredbyourVHDLdesignemployingSCFfo r communication,incomparisontooptimized,handwrittende signdevelopedforthesame application.Theseresultsshowthatthecommunicationrou tinesemployedbytheSCF toolsresultinmodestoverheadsintermsofbothresourcesa ndperformance. 5.3.4CaseStudy:Backprojection Figure5-10.Taskgraphforbackprojectionapplication. Inthissection,wepresentacasestudyonbackprojection(d etailedinAppendix B )applicationrunningonNovo-G.Duetotheheavycomputatio ndemandsofthe application,mostimplementationsdecomposethebackproj ectionalgorithminan embarrassinglyparallelmannertoexploititsinherentpar allelism.Inthiscasestudy, werstwriteaparallelprogramtoacceleratethebackproje ctionalgorithmonacluster ofmicroprocessors.Wethenseamlesslymigratepartsofthe applicationontoFPGAs tofurtheracceleratetheapplicationanddeterminethebes tmappingofthealgorithm ontosystemresources.Forourevaluation,weemploythebac kprojectionalgorithmto generatea512 512image.Wedecomposethecomputationalloadofthealgori thm overfourprocessingdevices(CPUsorFPGAs),eachofwhichrece ivesitsinput dataandsendsitsresulttoataskmappedonaCPUasshownbythe taskgraph 99

PAGE 100

inFigure 5-10 .Table 5-6 liststhecomputationtimeandcommunicationtimeforthe backprojectiontaskmeasuredoneachofthetwodevicetypes onNovo-G.Thetimes fortransferringboththeinputandoutputtoataskareliste dintherowlabeledas Data transfer .Thedata-transfertimefortheCPUcorrespondstothetimefo rtransferringdata betweentwoCPUsondifferentSCFplatforms,whilefortheFPGAi trepresentsthe transfertimebetweenaCPUandanFPGAonthesameplatform. Table5-6.Computationandcommunicationperformanceofba ckprojectiontaskona CPUorFPGA. FunctionResourceTime(ms) BackprojectioncomputationCPU2145.75BackprojectioncomputationFPGA3.28Datatransfer(input/outputdata)CPU0.33/0.85Datatransfer(input/outputdata)FPGA1.50/2.16 Table 5-7 liststheexecutiontimeoftheapplicationundervariousma pping scenariosonNovo-G.Themappingcongurationinthetabled enotestheordered mappingoftasksT2-T5(inFigure 5-10 )ontothesystemresourcesofNovo-G.Inthe simplestcase(case1),allofthetaskssharingthecomputat ionalloadofbackprojection algorithmaremappedontofourdifferentcoresofaquad-cor eprocessoronasingle platform.Forcase2,thetasksaremappedondifferentCPUsth atresideindifferent SCFplatforms.Thenextthreecases(cases3-5)achievesimil arperformance.The performanceinthesecasesislimitedbytheslowestperform ingtask(i.e.tasksmapped ontoCPUs).Itisinterestingtoobservetheslightvariation inperformancebetween cases4and5duetoachangeintheorderoftaskmappings.Thel astcase(case6), implementseachofthetaskssharingthecomputationalload onadifferentFPGAof asingleSCFplatform.Duetothesuperiorcomputationalcapa bilityofFPGAs,this mappingoffersthebestperformance. Whiletheoptimalmappingisreasonablyself-explanatoryfo rthiscasestudy,itmay notalwaysbeobvious.Forexample,onemaynotexpectcase3( withtwoFPGAs)to beslowerthancase1withonlyCPUs.Thedifferentmappingpos sibilitiesarelisted 100

PAGE 101

heretoillustratetheeaseofevaluatingvariousmappingco ngurationsofapplication tasksusingSCF,whichmaybecriticalforndingoptimalmapp ingsinotherapplications. Determiningtheperformanceofvariousmappingcongurati onslistedinTable 5-7 didnotrequireanychangeinthesourcecodebyapplicationd evelopers.Instead,a developerjustmademodicationstothemappingleintheSCF editorandfollowed thesteps4-7fromFigure 5-4 togenerateaparticularexecutable.Sincemostofthe inferencesontheproductivityanalysisfromthiscasestud ywereconsistentwiththerst casestudy,theyarenotrepeatedhere. Table5-7.Performanceofbackprojectionapplicationunde rdifferentmappingscenarios. Mappingcongurationrepresentstheorderedmappingforth efourtasks sharingthecomputationalloadoftheapplication.Speedupc ompares performanceofaparticularmappingcongurationwithaser ialsoftware baselinewhichrequires8.4sonasingleCPUofNovo-G.Forthe device mappings,allFPGAsarealwaysonthesameplatformasthehostC PUto whichTaskT1ismapped.CPUdevicesareondifferentplatform sexcept case1. CaseMappingcongurationTime(ms)Speedup 1 CPU,CPU,CPU,CPU 2213.83.80 (allCPUsonsameplatform) 2 CPU,CPU,CPU,CPU 2323.43.62 (allCPUsondifferentplatforms) 3CPU,CPU,FPGA,FPGA2317.23.634CPU,FPGA,FPGA,FPGA2276.93.705FPGA,FPGA,FPGA,CPU2285.43.686FPGA,FPGA,FPGA,FPGA12.5672.96 5.4Conclusions TocomplementtheworkinPhases1and2ofthisresearchandtoa ddress challengesinvolvingtaskcoordinationinfuturerecongu rable,heterogeneous systems,wehavepresentedanovelframeworkforsystem-wid ecoordinationthat enablescommunicationandsynchronizationbetweentasksr unningonheterogeneous processingdevicesinasystem.SCFhidesthelow-levelcommu nicationdetailsfromthe applicationdesigner,resultinginimprovedproductivity .Byallowingdesignerstodene communicationindependentlyofthedevicesinasystem,SCFi mprovesapplication 101

PAGE 102

portability.Inaddition,SCFallowsdesignerstodenetask susingpotentiallyany language,whichenhancestheinter-operabilitybetweendi fferentvendortools. Inthischapter,weevaluatedanimplementationofSCFandits associatedtools andlibrariesthroughexperimentsontwodifferentsystems .Ourexperimentsindicate thatcustomcommunication,oneoftheimportantcomponents ofSCF,createsdesigns thatoffersuperiorperformanceovergenericsolutions.We alsoillustratedthesimplied designprocessinvolvedwithSCFfordevelopingRCapplicati ons,whichenabled rapiddesign-spaceexplorationofvariousmappingstoobta inhigherperformance fromapplicationsinourcasestudies.Moreimportantly,th ehigherlevelofabstraction offeredbythisframeworkledtosubstantiallyimproveddes ignerproductivity.Although productivitygainsareoftendifculttoquantify,ourcase studiesdemonstratean improvementrangingbetween2.5 to5 102

PAGE 103

CHAPTER6 COMPARISONOFSHMEM+ANDSCF Inthischapter,wecompareandcontrastthecapabilitiesan dcharacteristicsof SHMEM+andSCF.Bothoftheapproachespresentedinthisresearch ,SHMEM+ andSCF,providealternativetechniquesfordevelopingscal able,parallelapplications forRCsystems.Thechoiceoftheapproachadoptedforaparti cularapplicationis drivenlargelybytherequirementsofthatapplicationandd eveloper'spreferencefor aprogrammingmodel.WhileSHMEM+isbasedonthePGASprogramming model, SCFusesthemessage-passingmodelforestablishingcommuni cation.Asaresult, theseapproacheshaveseveralcontrastingfeaturessuchas themodelofdecomposition adoptedfortheapplication,theprogrammingcomplexityof feredtoanapplication developer,theperformanceofthecommunicationroutinesi neach,aswellastheir data-transfercapabilities.Inthefollowingsections,we evaluatebothoftheapproaches basedontheaforementionedcharacteristicsandcompareth eperformanceobtainedby datatransferroutinesineach. 6.1ProgrammingModelandComplexity ApplicationsdevelopedusingSHMEM+typicallyfollowtheSPMDmo delfor algorithmdecomposition,i.e.,everyparticipatingnodei nasystemexecutesan instanceofthesameprogram.Eachprograminstanceispartit ionedacross,and collectivelyexecutedbyalltheheterogeneousdevicesint hesystem.Bycontrast,SCF allowsanyarbitrarydecompositionforaparallelapplicat ion.TaskgraphsinSCFcan representapplicationsdescribedusingSPMDdecompositions ,aswellasfunctional decompositionstocreatepipelines.Inaddition,SCFalsoal lowsanyassignmentof applicationtasksontotheheterogeneousresourcesinasys tem. MostRCapplicationsarecomposedoftwoportions,thehardw arecorewrittenin anHDL,andthesoftwarecodethatisspeciedinaprogrammin glanguagesuchas C/C++.BothSHMEM+andSCFaimtoreducetheprogramcomplexityof aparallel,RC 103

PAGE 104

applicationincludingbothsoftwaredevelopment,aswella shardware-coredevelopment. Thetwo-sidedcommunicationroutinesinSCFrequireanappli cationdeveloperto providematchingsendandreceiveoperationstocompletead atatransfer.Bycontrast, SHMEM+accomplishesthesamedatatransferusingone-sidedge tandputroutines.As aresult,anapplicationdeveloperdoesnothavetoprogramm atchingsendandreceive pairwhenusingSHMEM+.Consequently,programswrittenusing SHMEM+provide lowersoftwareprogramcomplexity. Fordescriptionofhardwarecores,SCFprovidessupportford atatransfersthrough VHDLentitieswhichoperateinamannersimilartosendandrec eivefunctionsin software.Incontrast,thehardwarecoresdevelopedforSHMEM +applicationscurrently employvendor-specicfunctionalitytofetchdatafromthe on-boardmemory.Asa result,SCFofferslowerprogrammingcomplexityforhardwar ecorescomparedto SHMEM+. 6.2PerformanceofDataTransfers One-sidedcommunicationroutinesinSHMEM+donotrequirecoo rdinationwith theothertaskinvolvedinadatatransfer.Consequently,co mmunicationroutinesin SHMEM+havethepotentialtoofferhigherperformancethanthe onesinSCF,which requirethecommunicatingtaskstosynchronizepriortocom pletingatransfer.However, forcertaintypesofcommunications(suchas,communicatio nbetweentwoFPGAs),the customcommunicationsynthesisinSCFcanoptimizethecommu nicationinfrastructure tooffersuperiorperformance. Figure 6-1 showstheaveragebandwidthoftransfersforaping-pongben chmark transferringdatabetweenaCPUandanFPGA(whenbothdevicesa reonthesame node)usingcommunicationroutinesfromSHMEM+andSCF.Theper formanceof transfersusingSHMEM+routinesissuperiortothatofroutine sinSCF.Thelower bandwidthforSCFroutinescanbeattributedtotheoverheads ofadatatransfer encounteredonanFPGAwhenperformingaping-pongtest.The SCF Send and 104

PAGE 105

Figure6-1.Theaverageround-tripbandwidthforaping-pon gmicrobenchmarkusing communicationroutinesinSHMEM+andSCF. SCF Recv entitiesonFPGAs,write(orread)onedataelementineveryclo ckcycle. Asaresult,thetimerequiredtocompleteadatatransferto(o rfrom)theon-board memoryfrom(orto)anapplicationcoreonanFPGA,becomessubs tantialforlarge datatransfers.Fortunately,mostapplicationscanoverla pthisoverheadwithuseful computationandthereforedonotexperienceaperformancep enaltywhenemploying SCF. 6.3Data-TransferCapabilities BothSHMEM+andSCFenablenewfunctionalityanddirecttransfer sbetween devicesthatwerepreviouslyachievedbyemployingmultipl ecommunicationlibraries (andAPIs).Forexample,SHMEM+enablesthecapabilityofcommun icatingdirectly withFPGAsonaremotenodebyemployingactivemessagesforsuc htransfers. Althoughnotalimitationofourframework,suchfunctionali tyisdifculttoimplementin SCF,andiscurrentlymissingfromourimplementation.Bycont rast,anadvantageof SCFisthatFPGAsaretreatedaspeerstoCPUsinthesystemandhave theabilityof initiatingtransfers.ToenablesuchtransfersinSHMEM+,FPGAs needamechanism foraccessingthesystemmemorytocompleteaone-sidedcomm unication,afeature whichisoftenmissingonmanyRCsystemswhereFPGAsareconnec tedtothehost 105

PAGE 106

processorthroughanI/Obus.Consequently,FPGA-initiatedt ransfersaredifcultto provideinSHMEM+andaremissingfromourcurrentimplementat ion. 6.4Conclusions BothSHMEM+andSCFaddresssystem-wideissuesandchallengesin volved indevelopingscalableapplicationsforrecongurableHPCa ndHPECsystems.Each approachhasitsuniquecapabilities(andcertainlimitati ons)thatmightmakeone approachmoresuitablethantheotherforagivenapplicatio n.Anapplicationdeveloper willtypicallychoosebetweenthetwoapproachesbasedupon therequirementsof agivenapplicationandpreferenceforaprogrammingmodel, whichissimilartothe choicebetweendifferentprogrammingmodelsinthetraditi onalparallel-programming community. 106

PAGE 107

CHAPTER7 CONCLUSIONS Thisresearchaddressessystem-levelissuesandchallenge sinvolvedindeveloping scalableapplicationsforrecongurableHPCandHPECsystems. Tothisend,wehave proposedtwocomplementaryapproachesforestablishingco mmunicationbetween tasksofparallelRCapplications,SHMEM+andSystemCoordinat ionFramework.By providingahigh-levelabstractionandauniformcommunica tioninterface,bothofthese approachesaimtoimprovedeveloperproductivityandappli cationportability.Further improvementsinproductivityareobtainedbybridgingthes edesignapproacheswith appropriateformulationtoolstoenableearlydesign-spac eexploration. InPhase1ofthisresearch,weproposedandinvestigatedanew parallel-programmingmodelcalledmultilevel-PGAS,andbas edonthismodel,we createdSHMEM+,therstknownversionoftheSHMEMlibraryforre congurable HPCsystems.WedevelopedaprototypefortheSHMEM+libraryand demonstratedits effectivenessonNovo-Gthroughtwocasestudies.Resultsf romourexperimentsand casestudiesdemonstratethatSHMEM+iscapableofofferingpe rformancecomparable toanexistingvendor-proprietaryversionofSHMEMwhilesimu ltaneouslyyielding improvedproductivityandportability. InPhase2,anewmultilevelcommunicationmodelforestimati ngtheperformance ofmultileveltransferspresentinSHMEM+isproposedandeval uated.Themodelis extremelybenecialtoapplicationdevelopersforprovidi ngaconcreteunderstandingof theunderlyingcommunicationinfrastructureaswellasopt imizingthecommunication intheseapplications.Besidesattaininghigherperformanc efromthecommunication infrastructure,ourmodelcanleadtoconsiderableimprove mentintheproductivityof applicationandsystemdevelopers.Resultsfromourexperi mentsillustratethemerits andaccuracyofthemultilevelcommunicationmodelthrough threedifferentuse-case scenarios. 107

PAGE 108

InPhase3,weproposedandanalyzedanovelframeworkforsyst em-wide coordinationwhichprovidesanapproachbasedonmessagepa ssingforestablishing communicationbetweenvariousdevicesinaheterogeneouss ystem.Byallowingthe useofdifferenttoolsandlanguagesfordescribingvarious tasksofaparallelapplication, SCFpromotesinteroperabilityamongstvarioustools.Inadd ition,byhidingthelow-level communicationdetails,SCFyieldsimprovedproductivityan dapplicationportability.We demonstratedthebenetsofdevelopingapplicationswithSC Fusingtwocasestudies whereimprovementinapplicationperformancewasobtained duetorapiddesign-space exploration.Inaddition,aproductivityimprovementrang ingfrom2.5 to5 was achievedwithminimalperformanceoverheads. Contributionsofthisresearchincludethemultilevel-PGASp rogrammingmodel, theSHMEM+communicationlibrary,thesystemcoordinationfr ameworkfortask-level coordination,andthemultilevelcommunicationmodelfore stimatingperformanceof multileveltransfers.Combined,thesetoolsandframework scanovercomeseveral existingchallengesindevelopingparallelRCapplication sthathavelimitedthepotential ofRCintheeldsofHPCandHPEC. AlthoughthisresearchfocusesmainlyonscalableRCsystems andapplications, thecommunicationtoolsandframeworkproposedherecanbee xtendedtosystems basedonothertypesofacceleratorssuchasGPUs,many-corep rocessors,etc.The potentialusesandimpactofthesetoolsonsuchsystemscoul dbeexploredinfuture research.Inaddition,theprototypeoftheSHMEM+librarydev elopedinthisresearchis expectedtobeafoundationfordevelopmentofamorecomplet elibraryinthefuture. 108

PAGE 109

APPENDIXA DESCRIPTIONSOFRCPLATFORMSEMPLOYED ThisappendixdescribesthedifferentRCplatformsthatwer eemployedbyvarious casestudiesthroughoutthisresearch. A.1MuCluster TheMuclusterconsistsoffourLinuxserversconnectedviaQ sNetIIfromQuadrics. Eachnodeoftheclusterfeatures: 2GHzAMDOpteron246processor 1GBofregistered-ECCDDR400RAM PROCStar-IIIPCIex8quad-FPGAcardfromGiDEL[ 65 ] – 4AlteraStratix-IIIEP3SE260FPGAs – 256MBDDRII – Twobanksof2GBDDRII(SODIMM) – High-speeddirect-connectionbetweenneighboringFPGAs A.2Novo-G Novo-Gconsistsof24computerserversconnectedbyDDRInn iBandandGigE. AnarchitectureoverviewofNovo-GisprovidedinFigure A-1 .Eachnodeconsistsofthe followinghardware: 2.26GHzIntelXeonE5520quad-coreprocessor 6GBofECCDDR3,1333MHzRAM MellanoxDDRInniBandPCIecard PROCStar-III TM PCIex8quad-FPGAcardfromGiDEL[ 65 ] – 4Altera R r Stratix R r -IIIEP3SE260FPGAs – 256MBDDRII – Twobanksof2GBDDRII(SODIMM) – High-speeddirect-connectionbetweenneighboringFPGAs 109

PAGE 110

FigureA-1.SystemarchitectureofNovo-GRCsupercomputer. A.3HeterogeneousTestbed ThistestbedconsistsoftwostandaloneWindowsserverconne ctedviaGigabit Ethernet.Oneofthenodesiscomprisedofa3GhzXeonprocessor .Thesecondnode consistsof 2GHzAthlon3200+processor PROCStar-II TM PCI-Xquad-FPGAcardfromGiDEL[ 66 ] – 4Altera R r Stratix R r -IIEP2S180FPGAs – 64MBDDRII – 1GBDDRII(SODIMM) – High-speeddirect-connectionbetweenneighboringFPGAs 110

PAGE 111

APPENDIXB DESCRIPTIONSOFAPPLICATIONSEMPLOYED B.1TargetTrackingUsingKalmanFilter TargettrackingusingKalmanltering[ 67 ]isamethodforpredictingthetrajectoryof environmentaltargetssuchasvehicles,missiles,animals ,hostiles,orevenunidentied objects.AKalmanlteriscommonlyemployedinsignalproces singapplicationsto estimatethedynamicsystemstateinanoisyenvironment.We selectedthisapplication inthisresearchduetoitsnumerousconstraintsthatareoft enmetusingheterogeneous devices.Thereareavarietyoffactorssuchaserrortoleran ce,samplingrateofinput, targetproximitytosensors,etc.thatdeterminetheexacto perationalcharacteristics requiredforaparticulartarget-object.Differenttarget shavevaryingrequirements whichmandateanappropriatecomputationalplatformsucha sanFPGA,anembedded processororadesktopprocessor(CPU). B.2Content-BasedImageRetrieval(CBIR) Content-BasedImageRetrieval(CBIR)isacommonapplication incomputer visionandconsistsofsearchingalargedatabaseofdigital imagesfortheonesthat arevisuallysimilartoagivenqueryimage,wherethesearch isbasedoncontentsof theimage.Thecontentinthiscontextcanbeoneofthesevera lfeaturespresentinthe image,suchascolors,shapes,textures,oranyotherinform ationthatcanbederived fromtheimage.CBIRhasbeenwidelyadoptedinmanydomainssu chasbiomedicine, military,commerce,education,andWebimageclassicatio nandsearching.Eachimage inaCBIRsystemisrepresentedbyafeaturevector,whichisba sedoncharacteristics oftheimageascitedabove.Similaritybetweenaqueryimagea ndthesetofimages inthedatabaseisdeterminedbymeasuringsimilaritybetwe entheirfeaturevectors. Theprocessesofdeterminingthefeaturevectorandanalyzi ngimagesforsimilarities areoftenthemostcomputationallyintensivestagesinanyC BIRsystem[ 68 ].There 111

PAGE 112

arevariousformsofparallelismavailableintheapplicati onthatcanbeexploitedbyRC systemstoacceleratethesearchprocess[ 69 ]. Ourimplementationpresentedinthisresearchemploysatec hniquebasedon auto-correlogramofcolorcomponents[ 70 ],wherethefeaturevectorisbasedoncolor informationintheimage.Acorrelogramofanimagecorrespo ndstoatablewherethe rowsareindexedbycolorpairs ( c i ;c j ) suchthatthe d -thcolumninrow ( c i ;c j ) stores theprobabilityofndingapixelofcolor c j atadistance d fromapixelofcolor c i inthe image.Forthecaseofauto-correlogram,thetableonlycons istsofrowswhere c i = c j Inthiswork,weuseamodiedversionofauto-correlogram,w hichstoresanabsolute countoftheoccurrencesofapixelofcolor c i insteadoftheprobabilityofsuchan event.Similaritybetweentwoimagesisdeterminedbycalcul atingthesumofabsolute differencesbetweentheirfeaturevectors. B.3Backprojection Backprojection[ 71 ]istheprocess(alsocommonlyreferredtoasatransformin literature)ofreconstructingtheinternalstructureofas cannedobjectfromemission data(orcross-sectiondata)acquiredbyascanner.Theproc essisemployedinvarious applicationdomainssuchasmedicalimaging,syntheticape rtureradar(SAR),electron microscopy,etc.Backprojectioncanbeviewedasmappingofr awdataintotheimage space.Itistypicalforapplicationstohavethousands(tom illions)ofdatapointsforeach cross-sectionandthousandsofcross-sectionsfromdiffer entangles.Theprocessof transformingeachdataelementfromthesensorstotheimage spaceiscomputationally demandinghavingacomplexityof O ( n 3 ) whengenerating2Dimages[ 72 ]. 112

PAGE 113

REFERENCES [1] K.OlukotunandL.Hammond,“Thefutureofmicroprocessors,” Queue ,vol.3, no.7,pp.26–29,2005. [2] T.Chen,R.Raghavan,J.N.Dale,andE.Iwata,“Cellbroadband engine architectureanditsrstimplementation:aperformancevi ew,” IBMJournalof ResearchandDevelopment ,vol.51,no.5,pp.559–572,2007. [3] P.Bhat,Y.Lim,andV.Prasanna,“Issuesinusingheterogeneou sHPCsystems forembeddedrealtimesignalprocessingapplications,”in Real-TimeComputing SystemsandApplications,1995.Proceedings.,SecondInternat ionalWorkshopon 1995,pp.134–141. [4] C.ErbasandA.D.Pimentel,“Utilizingsynthesismethodsinacc uratesystem-level explorationofheterogeneousembeddedsystems,”in in:IEEEWorkshoponSignal ProcessingSystems(SIPS ,2003,pp.310–315. [5] K.Barker,K.Davis,A.Hoisie,D.Kerbyson,M.Lang,S.Pakin,and J.Sancho, “Enteringthepetaopera:thearchitectureandperformance ofroadrunner,”in SC '08:Proceedingsofthe2008ACM/IEEEconferenceonSupercomput ing ,2008,pp. 1–11. [6] Cray R r ,“CrayXD1 TM datasheet,” http://www.hpc.unm.edu/%7Etlthomas/buildout/ Cray XD1 Datasheet.pdf ,RetrievedFebruary2011. [7] XtremeDataInc. TM ,“XD1000 TM developmentsystem,” http://old.xtremedatainc. com/index.php?option=com content&view=article&id=109&Itemid=170 ,Retrieved February2011. [8] T.El-Ghazawi,E.El-Araby,M.Huang,K.Gaj,V.Kindratenko,andD. Buell,“The promiseofhigh-performancerecongurablecomputing,” Computer ,vol.41,pp. 69–76,2008. [9] C.Pascoe,A.Lawande,H.Lam,A.George,Y.Sun,andW.Farmerie,“Recongurablesupercomputingwithscalablesystolicarr aysandin-streamcontrol forwavefrontgenomicsprocessing,”in ProceedingsofSymposiumonApplication AcceleratorsinHigh-PerformanceComputing(SAAHPC) ,2010. [10] Xilinx R r ,“Virtex R r -5familyoverview,” http://www.xilinx.com/support/documentation/ data sheets/ds100.pdf ,RetrievedFebruary2011. [11] Altera R r ,“Stratix R r IVdevicehandbook,” http://www.altera.com/literature/hb/ stratix-iv/stratix4 handbook.pdf ,RetrievedFebruary2011. [12] Tilera R r ,“TILE64 TM processorproductbrief,” http://www.tilera.com/sites/default/les/ productbriefs/PB010 TILE64 Processor A v4.pdf ,RetrievedFebruary2011. 113

PAGE 114

[13] K.Shih,A.Balachandran,K.Nagarajan,B.Holland,C.Slatton,and A.George, “Fastreal-timeLIDARprocessingonFPGAs,”in Proc.ofInternationalConference onEngineeringofRecongurableSystemsandAlgorithms ,LasVegas,NV,USA, 2008. [14] O.Storaasli,“Acceleratinggenomesequencing100-1000Xwit hFPGAs,”in ManycoreandRecongurableSupercomputingConference(MRSC) ,TheQueen's UniversityofBelfast,NorthernIreland,2008. [15] “MPIstandard,” http://www.mcs.anl.gov/research/projects/mpi/ ,RetrievedFebruary 2011. [16] Cray R r ,“CrayT3E TM Fortranoptimizationguide-004-2518-002,” http://docs.cray. com/books/004-2518-002/html-004-2518-002/z826920364 dep.html ,Retrieved February2011. [17] SGI R r ,“IntroductiontotheSHMEMprogrammingmodel,” http://docs.sgi.com/library/ tpl/cgi-bin/getdoc.cgi?coll=linux n &db=man%25%20&fname=/usr/share/catman/ man3/shmem.3.html ,RetrievedFebruary2011. [18] T.A.El-Ghazawi,W.W.Carlson,andJ.M.Draper,“UPClanguages pecications v1.0,” http://upc.gwu.edu/docs/upc spec 1.1.1.pdf ,RetrievedFebruary2011. [19] W.W.Carlson,J.M.Draper,D.E.Culler,K.Yelick,E.Brooks,and K.Warren, “IntroductiontoUPCandlanguagespecication,”Universit yofCalifornia-Berkeley, Berkeley,CA,USA,Tech.Rep.,1999. [20] W.Carlson,T.El-Ghazawi,B.Numrich,andK.Yelick,“Programm inginthe partitionedglobaladdressspacemodel,”2003. [21] C.Reardon,B.Holland,A.George,G.Stitt,andH.Lam,“RCML:An environment forestimationmodelingofrecongurablecomputingsystem s,” ACMTransactions onEmbeddedComputingSystems(TECS) ,toappear. [22] K.ComptonandS.Hauck,“Recongurablecomputing:asurveyo fsystemsand software,” ACMComput.Surv. ,vol.34,no.2,pp.171–210,2002. [23] “OpenSystemC TM initiative,” http://www.systemc.org ,RetrievedFebruary2011. [24] D.PellerinandE.A.Thibault, PracticalFPGAProgramminginC .PrenticeHall, 2005. [25] D.Poznanovic,“Applicationdevelopmentonthesrccomputer s,inc.systems,”in ParallelandDistributedProcessingSymposium,2005.Proceed ings.19thIEEE International ,2005,pp.78a–78a. [26] “TheOpenMP TM APIspecicationforparallelprogramming,” http://openmp.org/wp/ RetrievedFebruary2011. 114

PAGE 115

[27] R.Nishtala,P.H.Hargrove,D.O.Bonachea,andK.A.Yelick,“Sca ling communication-intensiveapplicationsonBlueGene/Pusing one-sided communicationandoverlap,”in IEEEInternationalParallel&DistributedProcessing Symposium ,Rome,Italy,2009,pp.1–12. [28] F.Darema,“TheSPMDmodel:past,presentandfuture,”in Proceedingsofthe8th EuropeanPVM/MPIUsers'GroupMeetingonRecentAdvancesinParal lelVirtual MachineandMessagePassingInterface .London,UK:Springer-Verlag,2001. [29] A.LastovetskyandR.Reddy,“Heterompi:Towardsamessage-p assinglibrary forheterogeneousnetworksofcomputers,” JournalofParallelandDistributed Computing ,vol.66,no.2,pp.197–220,2006. [30] R.Graham,G.Shipman,B.Barrett,R.Castain,G.Bosilca,andA.L umsdaine, “Openmpi:Ahigh-performance,heterogeneousmpi,” ClusterComputing,IEEE InternationalConferenceon ,pp.1–9,2006. [31] F.I.Massetto,A.M.G.Junior,andL.M.Sato,“HyMPI-aMPIimplem entationfor heterogeneoushighperformancesystems,”in GPC'06 ,2006,pp.314–323. [32] M.Farreras,V.Marjanovic,E.Ayguade,andJ.Labarta,“Gain ingasynchronyby usinghybridUPC/SMPSs,”in WorkshoponAsynchronyinthePGASProgramming Model ,YorktownHeights,NY,USA,Jun2009. [33] WorkshoponAsynchronyinthePGASProgrammingModel ,2009,availableat http://research.ihost.com/apgas09/ [34] T.El-Ghazawi,O.Serres,S.Bahra,M.Huang,andE.El-Araby,“Para llel programmingofhigh-performancerecongurablecomputing systemswithUnied ParallelC,”in Proc.ofRecongurableSystemsSummerInstitute ,Urbana,Illinois, 2008. [35] M.Saldana,A.Patel,C.Madill,D.Nunes,W.Danyao,H.Styles,A. Putnam, R.Wittig,andP.Chow,“MPIasanabstractionforsoftware-har dwareinteraction forHPRCs,”in HPRCTA'08:ProceedingsoftheThirdInternationalWorkshopo n High-PerformanceRecongurableComputingTechnologyand Applications .New York,NY,USA:ACM,Nov2008. [36] M.Franklin,E.Tyson,J.Buckley,P.Crowley,andJ.Maschmeye r,“Auto-pipeand theXlanguage:apipelinedesigntoolanddescriptionlangu age,”in Paralleland DistributedProcessingSymposium,2006.IPDPS2006.20thInter national ,2006,p. 10pp. [37] R.D.Chamberlain,M.A.Franklin,E.J.Tyson,J.H.Buckley,J.Bu hler, G.Galloway,S.Gayen,M.Hall,E.B.Shands,andN.Singla,“Auto -pipe:Streaming applicationsonarchitecturallydiversesystems,” Computer ,vol.43,pp.42–49, 2010. 115

PAGE 116

[38] OpenFPGA,“OpenFPGAGenAPIversion0.4draftforcomment,” http://www. openfpga.org/Standards%20Documents/OpenFPGA-GenAPIv0.4.p df ,Retrieved February2011. [39] K.Group,“OpenCL TM 1.0specication,” http://www.khronos.org/registry/cl/specs/ opencl-1.0.43.pdf ,RetrievedFebruary2011. [40] S.FortuneandJ.Wyllie,“Parallelisminrandomaccessmach ines,”in Proceedings ofthe10thACMSymposiumonTheoryofComputing ,1978,pp.114–118. [41] L.Valiant,“Abridgingmodelforparallelcomputation,” CommunicationsoftheACM vol.33,no.8,pp.103–111,1990. [42] D.Culler,R.Karp,D.Patterson,A.Sahay,K.Schauser,E.Santos,R.Subramonian,andT.vonEicken,“LogP:towardsarealisticmo delofparallel computation,” SIGPLANNotices ,vol.28,no.7,pp.1–12,1993. [43] A.Alexandrov,M.Ionescu,K.Schauser,andC.Scheiman,“LogGP:In corporating longmessagesintotheLogPmodelforparallelcomputation, ” JournalofParallel andDistributedComputing ,vol.44,no.1,pp.71–79,1997. [44] T.Kielmann,H.Bal,andS.Gorlatch,“Bandwidth-efcientcoll ectivecommunication forclusteredwideareasystems,”in Proc.InternationalParallelandDistributed ProcessingSymposium(IPDPS) ,1999,pp.492–499. [45] J.BosqueandL.Perez,“HLogGP:anewparallelcomputationalm odelfor heterogeneousclusters,”in Proceedingsofthe2004IEEEInternationalSymposium onClusterComputingandtheGrid ,2004,pp.403–410. [46] A.Lastovetsky,I.-H.Mkwawa,andM.O'Flynn,“Anaccuratecom munication modelofaheterogeneousclusterbasedonaswitch-enablede thernetnetwork,” in Proceedingsofthe12thInternationalConferenceonParalle landDistributed Systems ,2006,pp.15–20. [47] B.Holland,K.Nagarajan,andA.George,“RAT:RCAmenabilityTe stforrapid performanceprediction,” ACMTransactionsonRecongurableTechnologyand Systems(TRETS) ,vol.1,no.4,pp.1–31,2009. [48] M.C.SmithandG.D.Peterson,“Analyticalmodelingforhigh-p erformance recongurablecomputers,”in SCSInternationalSymposiumonPerformance EvaluationofComputerandTelecommunicationsSystems(SPECTS) ,SanDiego, CA,July2002. [49] L.H.T,C.Hylands,E.Lee,J.Liu,X.Liu,S.Neuendorffer,Y.Xio ng,Y.Zhao,and H.Zheng,“OverviewofthePtolemyproject,”2003. [50] D.BonacheaandJ.Jeong,“GASNet:Aportablehigh-performance communication layerforglobaladdress-spacelanguages,”CS258ParallelC omputerArchitecture Project,Spring2002. 116

PAGE 117

[51] N.-B.C.Laboratory,“MVAPICH:MPIoverInniBandandiWARP,” http://mvapich. cse.ohio-state.edu ,RetrievedFebruary2011. [52] E.O.Brigham, TheFastFourierTransformanditsApplication .PrenticeHall, 1988. [53] R.GonzalesandR.E.Woods, DigitalImageProcessing .Addison-Wesley,2002. [54] I.Uzun,A.Amira,andA.Bouridane,“FPGAimplementationsoffast Fourier transformsforreal-timesignalandimageprocessing,” Vision,ImageandSignal Processing,IEEProceedings ,vol.152,no.3,pp.283–296,June2005. [55] J.W.CooleyandJ.W.Tukey,“Analgorithmforthemachinecomp utationofthe complexfourierseries,” MathematicsofComputation ,vol.19,pp.297–301,1965. [56] N.Shirazi,P.M.Athanas,andA.L.Abbott, FieldProgrammableLogicandApplication .SpringerBerlin,1995,ch.Implementationofa2-DfastFouri ertransformon anFPGA-basedcustomcomputingmachine,pp.282–292. [57] K.D.Underwood,R.R.Sass,andI.WalterB.Ligon,“Acceleratio nofa2d-fft onanadaptablecomputingcluster,”in FCCM'01:Proceedingsofthethe9th AnnualIEEESymposiumonField-ProgrammableCustomComputingMa chines Washington,DC,USA,2001,pp.180–189. [58] D.PellerinandS.Thibault, Practicalfpgaprogramminginc ,1sted.UpperSaddle River,NJ,USA:PrenticeHallPress,2005. [59] J.SandersandE.Kandrot, CUDAbyExample:AnIntroductiontoGeneral-Purpose GPUProgramming ,1sted.Addison-WesleyProfessional,July2010. [60] K.ChathaandR.Vemurl,“Magellan:multiwayhardware-softw arepartitioningand schedulingforlatencyminimizationofhierarchicalcontr ol-dataowtaskgraphs,” in Hardware/SoftwareCodesign,2001.CODES2001.Proceedingsof theNinth InternationalSymposiumon ,2001,pp.42–47. [61] B.P.Dave,“Crusade:Hardware/softwareco-synthesisofdy namically recongurableheterogeneousreal-timedistributedembed dedsystems,” Design, AutomationandTestinEuropeConferenceandExhibition ,vol.0,p.97,1999. [62] C.A.R.Hoare,“Communicatingsequentialprocesses,” Commun.ACM ,vol.21,pp. 666–677,August1978. [63] Eclipse,“Eclipseclassic3.4.1.” http://www.eclipse.org/downloads/packages/ eclipse-classic-341/ganymedesr1 ,RetrievedFebruary2011. [64] OpenArchitectureWare,“Xtextreferencedocumentation,” http://www. openarchitectureware.org/pub/documentation/4.1//r80 xtextReference.pdf RetrievedFebruary2011. 117

PAGE 118

[65] GiDEL,“GiDELPROCStarIII TM ,” http://www.gidel.com/PROCStarIII.htm ,Retrieved February2011. [66] “GiDELPROCStarII TM ,” http://www.gidel.com/PROCStarII.htm ,RetrievedFebruary 2011. [67] C.LeeandZ.Salcic,“Afully-hardware-typemaximum-parall elarchitecturefor kalmantrackinglterinfpgas,”in Information,CommunicationsandSignalProcessing(ICICS) ,9-121997,pp.1243–1247vol.2. [68] I.Guyon,S.Gunn,M.Nikravesh,andL.Zadeh, FeatureExtraction,Foundations andApplications .Springer,2006. [69] C.SkarpathiotisandK.Dimond, FieldProgrammableLogicandApplication SpringerBerlin,2004,ch.AHardwareImplementationofaCont entBasedImage RetrievalAlgorithm,pp.1165–1167. [70] J.Huang,S.Kumar,M.Mitra,W.-J.Zhu,andR.Zabih,“Imagei ndexingusingcolor correlograms,”in ComputerVisionandPatternRecognition,1997.Proceedings. 1997IEEEComputerSocietyConferenceon ,Jun1997,pp.762–768. [71] B.M.W.TsuiandE.C.Frey,“Analyticimagereconstructionmet hodsinemission computedtomography,”in QuantitativeAnalysisinNuclearMedicineImaging SpringerUS,2006,pp.82–106. [72] N.Subramanian,“AC-to-FPGAsolutionforacceleratingtomog raphic reconstruction,”Master'sthesis,UniversityofWashingto n,2009. 118

PAGE 119

BIOGRAPHICALSKETCH VikasAggarwalisaPh.D.graduatefromtheDepartmentofElectri calandComputer EngineeringattheUniversityofFlorida.Priortobeginningh isdoctoralstudies,he workedwithIBMResearchLabinIndia,andTandelSystemsInc.i ntheUnitedStates. HereceivedhisM.S.inelectricalandcomputerengineering fromtheUniversityof FloridainDecemberof2005.PriortocomingtotheStates,here ceivedhisB.S.degree fromthedepartmentofECEatGuruGobindSinghIndraprasthaUn iversity,India,in Augustof2003. 119