<%BANNER%>

Fast and Accurate Simulation Environment (FASE) for High-Performance Computing Systems and Applications

Permanent Link: http://ufdc.ufl.edu/UFE0022068/00001

Material Information

Title: Fast and Accurate Simulation Environment (FASE) for High-Performance Computing Systems and Applications
Physical Description: 1 online resource (146 p.)
Language: english
Creator: Grobelny, Eric
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2008

Subjects

Subjects / Keywords: hpc, hpec, simulation
Electrical and Computer Engineering -- Dissertations, Academic -- UF
Genre: Electrical and Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: As systems of computers become more complex in terms of their architecture, interconnect, and heterogeneity, the optimum configuration and use of these machines becomes a major challenge. To reduce the penalties caused by poorly configured systems, simulation is often used to predict the performance of key applications to be executed on the new systems. Simulation provides the capability to observe component and system characteristics (e.g., performance and power) in order to make vital design decisions. However, simulating high-fidelity models can be very time consuming and even prohibitive when evaluating large-scale systems. The Fast and Accurate Simulation Environment (FASE) framework seeks to support large-scale system simulation by using high-fidelity models to capture the behavior of only the performance-critical components while employing abstraction techniques to capture the effects of those components with little impact on the system. To achieve this balance of accuracy and simulation speed, FASE provides a methodology and associated toolset to evaluate numerous architectural options. This approach allows users to make system design decisions based on quantifiable demands of their key applications rather than using manual analysis which can be error prone and impractical for large systems. The framework accomplishes this evaluation through a novel approach of combining discrete-event simulation with an application characterization scheme in order to remove unnecessary details while focusing on components critical to the performance of the application. In addition, FASE is extended to support in-depth availability analyses and quick evaluations of data-intensive applications. In this document, we present the methodology and techniques behind FASE and include several case studies validating systems constructed using various applications and interconnects. The studies show that FASE produces results with acceptable accuracy (i.e., maximum error of 23.3% and under 6% in most cases) when predicting the performance of complex applications executing on HPC systems. Furthermore, when using FASE to analyze data-intensive applications, the framework achieves over 1500x speedup with less than 1% error when compared to the traditional, function-level modeling approach.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Eric Grobelny.
Thesis: Thesis (Ph.D.)--University of Florida, 2008.
Local: Adviser: George, Alan D.
Electronic Access: RESTRICTED TO UF STUDENTS, STAFF, FACULTY, AND ON-CAMPUS USE UNTIL 2009-08-31

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2008
System ID: UFE0022068:00001

Permanent Link: http://ufdc.ufl.edu/UFE0022068/00001

Material Information

Title: Fast and Accurate Simulation Environment (FASE) for High-Performance Computing Systems and Applications
Physical Description: 1 online resource (146 p.)
Language: english
Creator: Grobelny, Eric
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2008

Subjects

Subjects / Keywords: hpc, hpec, simulation
Electrical and Computer Engineering -- Dissertations, Academic -- UF
Genre: Electrical and Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: As systems of computers become more complex in terms of their architecture, interconnect, and heterogeneity, the optimum configuration and use of these machines becomes a major challenge. To reduce the penalties caused by poorly configured systems, simulation is often used to predict the performance of key applications to be executed on the new systems. Simulation provides the capability to observe component and system characteristics (e.g., performance and power) in order to make vital design decisions. However, simulating high-fidelity models can be very time consuming and even prohibitive when evaluating large-scale systems. The Fast and Accurate Simulation Environment (FASE) framework seeks to support large-scale system simulation by using high-fidelity models to capture the behavior of only the performance-critical components while employing abstraction techniques to capture the effects of those components with little impact on the system. To achieve this balance of accuracy and simulation speed, FASE provides a methodology and associated toolset to evaluate numerous architectural options. This approach allows users to make system design decisions based on quantifiable demands of their key applications rather than using manual analysis which can be error prone and impractical for large systems. The framework accomplishes this evaluation through a novel approach of combining discrete-event simulation with an application characterization scheme in order to remove unnecessary details while focusing on components critical to the performance of the application. In addition, FASE is extended to support in-depth availability analyses and quick evaluations of data-intensive applications. In this document, we present the methodology and techniques behind FASE and include several case studies validating systems constructed using various applications and interconnects. The studies show that FASE produces results with acceptable accuracy (i.e., maximum error of 23.3% and under 6% in most cases) when predicting the performance of complex applications executing on HPC systems. Furthermore, when using FASE to analyze data-intensive applications, the framework achieves over 1500x speedup with less than 1% error when compared to the traditional, function-level modeling approach.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Eric Grobelny.
Thesis: Thesis (Ph.D.)--University of Florida, 2008.
Local: Adviser: George, Alan D.
Electronic Access: RESTRICTED TO UF STUDENTS, STAFF, FACULTY, AND ON-CAMPUS USE UNTIL 2009-08-31

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2008
System ID: UFE0022068:00001


This item has the following downloads:


Full Text

PAGE 1

FASTANDACCURATESIMULATIONENVIRONMENTFASEFORHIGH-PERFORMANCECOMPUTINGSYSTEMSANDAPPLICATIONSByERICM.GROBELNYADISSERTATIONPRESENTEDTOTHEGRADUATESCHOOLOFTHEUNIVERSITYOFFLORIDAINPARTIALFULFILLMENTOFTHEREQUIREMENTSFORTHEDEGREEOFDOCTOROFPHILOSOPHYUNIVERSITYOFFLORIDA2008 1

PAGE 2

c2008EricM.Grobelny 2

PAGE 3

Tomyfamilyandfriends 3

PAGE 4

ACKNOWLEDGMENTSIwouldliketothank,rstandforemost,myparents,brother,andsisterfortheirloveandguidance.Iwouldalsoliketoexpressmyutmostgratitudetomyadvisor,Dr.AlanGeorge,forsupportingmethroughgraduateschoolandteachingmethenecessaryskillstobecomearesearcherandinnovator.AnothercrucialpersonwhotaughtmemuchaboutresearchincomputerengineeringisDr.JeVetter.Furthermore,IwishtoexpressmyappreciationtomysponsorstheDepartmentofDefense,Honeywell,andtheUniversityofFloridafortheirnancialaid.WithoutitIwouldbeinextremedebt.Finally,IwouldliketothankMr.RobertHenuberforplantingtheseedthatinspiredandmotivatedmetobecomeadoctorofphilosophyincomputerengineering. 4

PAGE 5

TABLEOFCONTENTS page ACKNOWLEDGMENTS ................................. 4 LISTOFTABLES ..................................... 7 LISTOFFIGURES .................................... 8 ABSTRACT ........................................ 10 CHAPTER 1INTRODUCTION .................................. 12 2BACKGROUNDANDRELATEDRESEARCH .................. 17 3FASTANDACCURATESIMULATIONENVIRONMENTPHASE1 ..... 24 3.1ApplicationDomain .............................. 25 3.1.1ApplicationCharacterization ...................... 26 3.1.2StimulusDevelopment ......................... 30 3.2SimulationDomain ............................... 32 3.2.1ComponentDesign ........................... 33 3.2.2SystemDevelopment .......................... 37 3.2.3SystemAnalysis ............................. 38 3.3ResultsandAnalysis .............................. 39 3.3.1ModelValidation ............................ 39 3.3.2SystemValidation ............................ 41 3.3.3CaseStudy:Sweep3D .......................... 44 3.3.3.1Experiment1:Accuracy ................... 46 3.3.3.2Experiment2:Speed ..................... 47 3.3.3.3Experiment3:Virtualsystemprototyping ......... 49 3.4Conclusions ................................... 53 4PERFORMANCEANDAVAILABILITYPREDICTIONSOFVIRTUALLYPROTOTYPEDSYSTEMSFORSPACE-BASEDAPPLICATIONSPHASE2 57 4.1Background ................................... 57 4.1.1ProjectOverview ............................ 57 4.1.2DMSystemArchitecture ........................ 59 4.1.3DMMiddlewareArchitecture ...................... 61 4.2Approach .................................... 62 4.2.1PhysicalPrototype ........................... 64 4.2.2Markov-RewardModeling ....................... 65 4.2.2.1Datanodemodel ....................... 66 4.2.2.2Systemmodel ......................... 69 5

PAGE 6

4.2.3Discrete-EventSimulationModeling .................. 70 4.2.4FaultModelLibrary ........................... 71 4.3ResultsandAnalysis .............................. 75 4.3.1ModelCalibration ............................ 76 4.3.1.1Componentmodelcalibrationandvalidation ........ 76 4.3.1.2Systemperformabilitymodel ................ 77 4.3.2CaseStudy:FastFourierTransform .................. 78 4.3.3CaseStudy:SyntheticApertureRadar ................ 82 4.3.3.1Amenabilitystudy ...................... 86 4.3.3.2In-depthapplicationanalysis ................ 87 4.3.3.3Flightsystem ......................... 92 4.4Conclusions ................................... 95 5HYBRIDSIMULATIONSTOIMPROVETHEANALYSISTIMEOFDATA-INTENSIVEAPPLICATIONSPHASE3 ................. 99 5.1Introduction ................................... 99 5.2BackgroundandRelatedResearch ....................... 102 5.3HybridSimulationApproach .......................... 106 5.3.1Function-LevelTraining ......................... 108 5.3.2AnalyticalModeling ........................... 111 5.3.3Micro-Simulation ............................ 113 5.4ResultsandAnalysis .............................. 118 5.4.1SimulationSetup ............................ 118 5.4.2PerformanceModeling ......................... 121 5.4.3ContentionModeling .......................... 125 5.4.4CaseStudy ................................ 128 5.5Conclusions ................................... 131 6CONCLUSIONS ................................... 134 APPENDIX AEXPERIMENTALANDSIMULATIVESETUP .................. 137 A.1ExperimentalSetup ............................... 137 A.2SimulationSetup ................................ 137 REFERENCES ....................................... 139 BIOGRAPHICALSKETCH ................................ 146 6

PAGE 7

LISTOFTABLES Table page 3-1TheFASEcomponentlibrary ............................ 36 3-2Experimentalversussimulationexecutiontimesformatrixmultiply ....... 43 3-3Ratioofsimulationtoexperimentalwall-clockexecutiontime .......... 45 3-4Computenodespecicationsforeachclusterinheterogeneoussystem ...... 46 3-5ExperimentalversussimulationerrorsforSweep3D ................ 48 4-1TheDMmiddlewarecomponents .......................... 62 4-2Datanodemodelstates ............................... 68 4-3Failureandrecoveryratesofthenodemodel .................... 68 4-4SummaryofDMcomponentmodels ......................... 71 4-5Summaryoffaultmodels ............................... 73 4-6Baselinesystemparameters ............................. 77 4-7TheFFTAlgorithmicvariationsandsystemenhancements ............ 81 4-8Checkpointoptionsexploredusingpatch-basedSARapplication ......... 90 4-9Architecturalenhancementsexploredforightsystem ............... 92 5-1Summaryofrelevantsimulationmodels ....................... 119 5-2Keysystemparameters ................................ 119 5-3Hybridsourcemodelparameters .......................... 121 5-4DatasetsizesforeachHSIdatatransaction ..................... 129 5-5SimulationtimesforvariousHSIimagesizes .................... 130 A-1ComputationsystemsattheHCSLabatUF .................... 137 7

PAGE 8

LISTOFFIGURES Figure page 3-1High-leveldata-owdiagramofFASEframework ................. 25 3-2TheFASEapplicationcharacterizationprocess ................... 27 3-3InniBandmodellatencyvalidation ......................... 40 3-4InniBandmodelthroughputvalidation ...................... 41 3-5TheTCP/IP/Ethernetmodellatencyvalidation .................. 41 3-6TheTCP/IP/Ethernetmodelthroughputvalidation ............... 41 3-7TheSCImodellatencyvalidation .......................... 42 3-8TheSCImodelthroughputvalidation ........................ 42 3-9Sweep3Dalgorithm .................................. 46 3-10ExperimentalversussimulativeexecutiontimesforSweep3D ........... 47 3-11Ratiosofsimulationtoexperimentalwall-clockcompletiontimeforvaryingsystemanddatasetsizes ............................... 48 3-12ExecutiontimesforSweep3Drunningonvarioussystemcongurations ..... 50 3-13MaximumspeedupsforSweep3Drunningonvariousnetworkcongurations .. 51 3-14SpeedupsforSweep3Drunningon8192-nodeInniBandsystem ......... 53 4-1Systemhardwarearchitectureofthedependablemultiprocessor ......... 60 4-2Systemsoftwarearchitectureofthedependablemultiprocessor .......... 61 4-3LogicaldiagramandphotographofDMtestbed .................. 65 4-4Markov-rewarddatanodemodel .......................... 66 4-5Markov-rewardsystemmodel ............................ 69 4-6TheDMnodemodels ................................. 72 4-7TheDMightsystemmodel ............................. 72 4-8Examplefault-enabledsystem ............................ 74 4-9ThroughputvalidationsfornetworkandMDSsubsystemmodels ......... 76 4-10MarkovversussimulationDMsystemperformabilitycomparison ......... 78 4-11Dataowdiagramofparallel2DFFT ........................ 79 8

PAGE 9

4-12Executiontimeperimageforbaselineandenhancedsystems ........... 80 4-13Parallel2DFFTexecutiontimesperimageforvariousperformance-enhancingtechniques ....................................... 82 4-14Distributed2DFFTexecutiontimesperimageforvariousperformance-enhancingtechniques ....................................... 83 4-15SARdataowwithoptionalcheckpointstagesandpatcheddatadecomposition 84 4-16AmenabilityresultsviaMarkovmodelforpatch-basedSARapplication ..... 86 4-17Systemperformabilitypercentagesandthroughputsforpatch-basedSAR .... 88 4-18Systemperformabilityandthroughputfor8192-elementpatch-basedSARexecutingonvarioussystemsizes .......................... 91 4-19Speedupsofarchitecturalenhancementsforpatch-basedSAR .......... 93 4-20Systemperformabilityandthroughputof20-nodeDMightsystemexecutingpatch-basedSAR ................................... 94 5-1High-levelexamplesystemsemployinghybridmodeling .............. 107 5-2High-leveldiagramofhybridsimulationapproach ................. 108 5-3Function-leveltrainingprocedure .......................... 110 5-4Examplethree-owmicro-simulation ........................ 117 5-5PingPongaccuracyresults .............................. 122 5-6MDSTestaccuracyresults .............................. 123 5-7PingPongspeedupresults .............................. 124 5-8MDSTestspeedupresults .............................. 124 5-9MDSTestaccuracyandspeedupresultsusinghybridmodelingapproach .... 127 5-10TheHSIdatadecompositionanddataowdiagram ................ 129 5-11TheHSIaccuracyandspeedupresultsfortwohybridcongurations ....... 131 A-1TheMLDdevelopmentenvironment ........................ 138 9

PAGE 10

AbstractofDissertationPresentedtotheGraduateSchooloftheUniversityofFloridainPartialFulllmentoftheRequirementsfortheDegreeofDoctorofPhilosophyFASTANDACCURATESIMULATIONENVIRONMENTFASEFORHIGH-PERFORMANCECOMPUTINGSYSTEMSANDAPPLICATIONSByEricM.GrobelnyAugust2008Chair:AlanGeorgeMajor:ElectricalandComputerEngineeringAssystemsofcomputersbecomemorecomplexintermsoftheirarchitecture,interconnect,andheterogeneity,theoptimumcongurationanduseofthesemachinesbecomesamajorchallenge.Toreducethepenaltiescausedbypoorlyconguredsystems,simulationisoftenusedtopredicttheperformanceofkeyapplicationstobeexecutedonthenewsystems.Simulationprovidesthecapabilitytoobservecomponentandsystemcharacteristicse.g.,performanceandpowerinordertomakevitaldesigndecisions.However,simulatinghigh-delitymodelscanbeverytimeconsumingandevenprohibitivewhenevaluatinglarge-scalesystems.TheFastandAccurateSimulationEnvironmentFASEframeworkseekstosupportlarge-scalesystemsimulationbyusinghigh-delitymodelstocapturethebehaviorofonlytheperformance-criticalcomponentswhileemployingabstractiontechniquestocapturetheeectsofthosecomponentswithlittleimpactonthesystem.Toachievethisbalanceofaccuracyandsimulationspeed,FASEprovidesamethodologyandassociatedtoolsettoevaluatenumerousarchitecturaloptions.Thisapproachallowsuserstomakesystemdesigndecisionsbasedonquantiabledemandsoftheirkeyapplicationsratherthanusingmanualanalysiswhichcanbeerrorproneandimpracticalforlargesystems.Theframeworkaccomplishesthisevaluationthroughanovelapproachofcombiningdiscrete-eventsimulationwithanapplicationcharacterizationschemeinordertoremoveunnecessarydetailswhilefocusingoncomponentscriticaltotheperformanceofthe 10

PAGE 11

application.Inaddition,FASEisextendedtosupportin-depthavailabilityanalysesandquickevaluationsofdata-intensiveapplications.Inthisdocument,wepresentthemethodologyandtechniquesbehindFASEandincludeseveralcasestudiesvalidatingsystemsconstructedusingvariousapplicationsandinterconnects.ThestudiesshowthatFASEproducesresultswithacceptableaccuracyi.e.,maximumerrorof23.3%andunder6%inmostcaseswhenpredictingtheperformanceofcomplexapplicationsexecutingonHPCsystems.Furthermore,whenusingFASEtoanalyzedata-intensiveapplications,theframeworkachievesover1500speedupwithlessthan1%errorwhencomparedtothetraditional,function-levelmodelingapproach. 11

PAGE 12

CHAPTER1INTRODUCTIONSubstantialcapitalresourcesareinvestedannuallytoexpandthecomputationalcapacityandimprovethequalityoftoolsscientistshaveattheirdisposaltosolvegrand-challengeproblemsinphysics,lifesciencesandotherdisciplines.Typicallyeachnewlarge-scale,high-performancecomputingHPCsystemdeployedatanationallab,industryresearchcenter,orothersiteexempliesthelatestintechnologyandfrequentlyoutperformsitspredecessorsasmeasuredbytheexecutionofgenericbenchmarksuites.Whileasupercomputer'srawcomputationalpotentialcanbereadilypredictedanddemonstratedforthesegenericbenchmarks,anapplicationscientist'sabilitytoharnessthenewsystem'spotentialfortheirspecicapplicationisnotguaranteed.Fewapplicationsstresssystemsinexactlythesamemanner,especiallyatlargesystemsizes,andthereforepredictinghowbesttoallocatelimitedfundstobuildanoptimallyconguredsupercomputerischallenging.Lackingquantitativedataandaclearmethodologytounderstandandexplorethedesignspace,developerstypicallyrelyonintuition,orinsteadsimplyusemanualanalysistoidentifythebestavailableoptionforeachsystemcomponent,whichmayoftenleadtoinecienciesinthesystemarchitecture.Amorestructuredmethodologyisrequiredtoprovidedevelopersthemeanstoperformanaccuratecost-benetanalysistoensureresourcesareecientlyallocated.TheFastandAccurateSimulationEnvironmentFASEhasbeendevelopedtoaddressthiscriticalneed.FASEisacomprehensivemethodologyandassociatedtoolsetforperformancepredictionandsystem-designexplorationthatencompassesameanstocharacterizeapplications,designandbuildvirtualprototypesofcandidatesystems,andstudytheperformanceofapplicationsinaquick,accurate,andcost-eectivemanner.FASEapproachesthegrandchallengethatisperformancepredictionofHPCapplicationsoncandidatesystemsbysplittingtheproblemintotwodomains:theapplicationandthe 12

PAGE 13

simulationdomains.Thoughsomeinterdependenciesmustexistbetweenthetworealms,thissplitisolatestheworkconductedineitherdomainsothatapplicationanalysisdataandsystemmodelscanbereusedwithlittleeortwhenexploringotherapplicationsorsystemdesigns.Unlikeotherperformancepredictionenvironments,FASEprovidestheuniquecapabilityofvirtuallyprototypingcandidatesystemsviaagraphicaluserinterface.Thisfeaturenotonlyprovidessubstantialtimeandcostsavingsascomparedtodevelopinganexperimentalprototype,butalsocapturesstructuraldependenciese.g.,networkcontentionwithinthecomputationalsubsystemsallowinguserstoexploredecompositionandloadbalancingoptions.Furthermore,virtualprototypingcanhelpforecastthetechnologicaladvancesinspeciccomponentsthatwouldbemostcriticaltoimprovingtheperformanceofselectapplications.Moreimportantly,cross-overpointsinkeymetrics,suchasnetworklatencycanbeidentiedbyquantitativelyassessingwheretoapplyAmdahl'slawforaparticularapplicationandsystempair.Inordertoensurealloptionsareexamined,analysiscanalsoincludethereusepotentialofcurrentlydeployedsystemsinordertodetermineifupgradingorotherwiseaugmentingthoseexistingsystemswillprovideabetterreturnoninvestmentcomparedtobuildinganentirelynewsystem.AnotheruniquefeatureofFASEisitsiterativedesignandanalysisproceduresinwhichresultsfromoneormoreinitialrunsareusedtorenetheapplicationcharacterizationprocessaswellasdictatethedelityofthecomponentmodelsemployedincandidatearchitectures.Iterationsduringthesestagescanresultinhighlytargetedperformancedatathatdrivessimulatedsystemsoptimizedforspeedandaccuracy.Toaccommodateforoptimizedsystemmodels,theframeworksupportsacombinationofanalyticalandsimulationmodeltypes.Thiscombinationallowsuserstoeectivelyadjustthefocalpointsofthesimulationstothecomponentswiththegreatestimpactonsystemperformancethroughtheuseofthesimulationmodelswhilestillaccountingforthelesserinuentialcomponentsbyusingfasteranalyticalmodels.Asaresult,thecomplexityoftheoverallsystemdesignisreducedthusdecreasingsimulationtime.Insummary, 13

PAGE 14

FASEallowsdesignerstoevaluatetheperformanceofaspecicsetofapplicationswhileexploringarichsetofsystemdesignoptionsavailablefordeploymentviaanupgradeable,modularcomponentlibrary.ThelistbelowdetailsthemaincontributionsoftheFASEframework.1.Asystematic,iterativemethodologythatdescribesthevariousoptionsavailableateachstepofthedesignandanalysisprocessandilluminatestheimplicationsandissueswitheachoption.2.TheFASEtoolsetthatprovidesanapplicationcharacterizationtoolsupportingMPI-basedCandFortranapplicationstocollectperformancedataandagraphical,object-orientedusingC++simulationenvironmenttovirtuallyprototypeandevaluatecandidatesystems.3.Apre-builtmodellibrarythatcontainsavarietyofHPCarchitecturalcomponentsfacilitatingrapidprototypingandevaluationofsystemswithvaryingdegreesofdetailatallthekeysubsystems.Thisstudyconsistsofthreephasesofresearch.TherstphasefocusesondesigninganddevelopingarobustandcomprehensivemethodologyandtoolkitthatuserscanemploytoquicklyandaccuratelypredicttheperformanceofinterestingHPCsystemsandapplications.Morespecically,theworkconductedintherstphaseprovidesadetailedproceduretocharacterizeanapplication,designandbuildcomponentandsystemmodels,andanalyzetheapplication'sperformanceonvarioussystemcongurationsviasimulation.Abasictoolkit,consistingofanapplicationanalysistool,asimulator,andpre-builtsimulationmodelsofkeycomponentsandsamplesystems,facilitatesthepredictionandanalysisprocedure.AftercreatingthefoundationofFASE,weperformcasestudiesusingamatrixmultiplybenchmarkandarealscienticapplicationinordertovalidatethespeedandaccuracyofFASE.ThesecondphaseofthestudyexplorestheuseoftheFASEframeworktoperformanin-depthperformabilityanalysisofspace-basedsystems.Thisworkconsistsofthedesignanddevelopmentofcomponentandsystemmodelstorepresentthecandidatespacesystemaswellasasimulation-basedfaultinjectionframeworktoconductscalability 14

PAGE 15

andperformabilityanalysesofdierentsystemcongurationsandapplicationvariations.Thescalabilityanalysisemploysthe2DFastFourierTransformDFFTasthekeybenchmarkkernelduetoitscomputationalrelevanceinmanyspaced-basedapplications.Afteridentifyingthestrengthsandweaknessesofthebasearchitecture,weconductaperformabilitystudyofthesystemunderdierentenvironmentalconditions.ThestudyusestheSyntheticApertureRadarSARapplicationwhichactuallyincorporatesthe2DFFTkernelinordertoperformimageprocessing.Thestudyreportsinsightonkeyarchitecturalandalgorithmicoptionsthatprovideperformanceandavailabilityenhancementsforfuturespacesystems.PhasethreeexpandsthefoundationofFASEbyincorporatinghybridsimulationtechniquestoaddressscalabilityissueswhenanalyzingdata-intensiveapplications.Theresearchconductedwithinthisphasefocusesonthedesignanddevelopmentoftechniquesthatcombinethestrengthsofthefunction-levelandanalyticalmodelingapproachesinordertoreducesimulationtimesinapplicationsthatprocessverylargedatasets.Theproposedapproachisvalidatedusingsimplebenchmarkprogramsexecutingonanemergingspace-basedplatformwhileitscapabilitiesaredemonstratedbyanalyzingadata-intensive,remote-sensingapplicationcalledhyperspectralimagingHSI.Theremainderofthisdocumentconsistsofthetechnicaldetailsofthestudy.Chapter 2 providesbackgroundinformationonthebasicconceptsinvolvedintheperformancepredictionprocessthroughsimulation.ThechapteralsopresentsbriefoverviewsonpreviousresearchconductedthatsharesimilarmethodsandgoalsofFASE.Chapter A describesthevariousfacilitiesandtoolsusedtoconducttheresearch.Chapter 3 detailstheFASEframeworkwhileChapter 4 appliesandextendstheframeworktoperformscalabilityandperformabilityanalysesonaspace-basedsystemforathorougharchitecturalandalgorithmicevaluation.Chapter 5 extendsFASEbyenhancingpre-builtmodelswiththenecessarymechanismsneededtosupporthybridsimulationsforthe 15

PAGE 16

analysisofdata-intensiveapplications.Finally,Chapter 6 summarizestheworkpresentedinthisdocument. 16

PAGE 17

CHAPTER2BACKGROUNDANDRELATEDRESEARCHBuildinglarge-scalesystemstodeterminehowanapplicationwillperformissimplynotaviableoptionduetotimeandcostconstraints.Therefore,othermethodsofprecursoryinvestigationareneeded,andseveraldierenttypesofmodelingtechniquesexisttoaidinthisprocess.Analytical[ 1 ],[ 2 ]andstatistical[ 3 ],[ 4 ]modelingaretwosuchoptionsandbothmethodsinvolvearepresentationofthesystemusingmathematicalequationsorMarkovmodelsinordertogaininsightonhowaparticularsystemwillperformbasedoncertainparameters.Thesemodelscanbecomeverycomplexespeciallywhenconsideringthelargenumberofcongurationparametersoflarge-scaleHPCsystemsandstillremaininaccurateduetotheoverallcomplexityofthesystems.Inaddition,itisdicultforanalyticalmodelingtoaddressthehigher-ordertransientfeaturesofanapplication,suchasnetworkcontention,andoftenover-simplicationsareemployedtomaketheequationssolvable[ 5 ].Computersimulationisanalternativethatbringsaccuracy,andexibilitytothischallenge.Realhardwarecanbemodeledatanydegreeofdelityrequiredbytheuseranddictatedbytheapplication,allowingthesystemtobetailoredtotheapplicationandviceversa.Asimulation-basedapproachprovidestheuserwiththeexibilitytomodelimportantcomponentsveryaccuratelywhilesacricingtheaccuracyoflessvitalcomponentsbymodelingthematalowerdelity.Inadditiontothesebenets,computersimulationsupportsthescalingofspeciccomponentparameters,allowingforthemodelingofnext-generationtechnologieswhichmaynotbecurrentlyavailableforexperimentalstudy.Analysisbasedonsuchmodelscanalsoprovideconcreteevidencethatmayinuencethefutureroadmapofsystemcomponentmanufactures.Classically,therearetwotypesofcomputersimulationenvironments:execution-drivenandtrace-driven.Execution-drivensimulationsoftenuseaprogram'smachinecodeastheinputtothesimulationmodelsandalsohavenearclock-cycledelity,producingaccurateresultsatthecostofslowsimulationspeeds[ 6 ],[ 7 ].Thoughvery 17

PAGE 18

usefulfordetailedstudiesofsmall-scalesystems,execution-drivensimulationstendtobecomeimpracticalwithregardstotimewhenusedtosimulatelarge,complexsystems.Trace-drivensimulationsemployhigher-levelrepresentationsofanapplication'sexecutiontodrivethesimulations[ 8 ],[ 9 ].Theserepresentationsaregenerallycapturedusinganinstrumentedversionoftheapplicationunderstudyrunningonagivenhardwareconguration.Inessence,thetrace-drivensimulationsrequireextratimeduringthecharacterization"stagetothoroughlyunderstandandcapturerelevantinformationabouttheapplication.Thisadditionaltimespentduringcharacterizationistypicallyamortizedduringthesimulationstagebyhavingtheinputavailablefornumeroussimulatedsystemcongurations.Theaccuracyoftrace-drivensimulationsnotonlydependsonthedelityofthemodels,butalsoonthedetailoftheinformationobtainedcorrespondingtotheapplication.Anon-traditionalsimulationtypeismodel-driven.Model-drivensimulationsuseformalmodelsdevelopedwithinthesimulationenvironmentinordertoemulatethebehaviorofanapplication.Essentially,thesemodelsproduceoutputdatathatstimulatesthecomponentsofthesimulatedsystemasiftherealapplicationwererunning.Thoughthedevelopmentofthemodelscanbeverytimeconsuming,dependingonthecomplexityofthemodeledapplication,oncethemodelisdevelopeditcanbeusedwithinanysystemwithoutanyextrawork.Inordertoperformatrace-drivensimulation,arepresentationoftheapplicationisnecessary.Tracesandprolesaretwotypesofapplicationrepresentationsthatcanbeusedtoportraythebehaviorofaprogram[ 10 ].Tracesgatherachronologyofeventssuchascomputationorcommunicationoccurintime,allowingausertoobservewhataprogramisdoingduringaspecictimeperiod.Becausetracesaredependentontheexecutiontimeofaprogram,long-runningprogramscanproduceextremelylargetracelogs.Theselargelogscanbeimpracticalorevenimpossibletostoredependingonhowmuchdetailisrecordedbythetraceprogram.Proles,bycontrast,donotrecordeventsintime,butrathertallytheeventsinordertoprovideusefulstatisticstotheenduser. 18

PAGE 19

Theoverheadincurredfromtheexecutionofextracodeusedtocollecttheproleortracedatacanbequitesimilar,butisultimatelydependentonthelevelofdetailproledortracedaswellastheapplicationunderstudy.Whiletracegenerationmayimposepenaltiesassociatedwiththecreationoflargetracelesdependingonthefrequencyofdiskaccess,ithastheadvantagethatverylittleadditionalprocessingisneededasintermediateresultsarecalculated.Bycontrast,prolingtypicallyrequiresverylittleleaccess,butmaydemandthefrequentcalculationofintermediateresultsfortheapplicationprole.Inessence,atraceprovidesrawdatadescribingtheexecutionofanapplicationandaproleoutputsaprocessedformofthisdata.Bothprolesandtracescanbeusefultoolsintrace-drivensimulation,dependingonthetypeofsimulationmodelsusedforthestudyandtheamountofinformationthatisdesired.FurtherdiscussiononhowtraceandproletoolsareleveragedwithintheFASEframeworkaredescribedinChapter 3 .Tracesandprolescanbecollectedforbothsequentialandparallelprogramsthoughthisstudyfocusesonthelatter.Variousparallelprogrammingmodelsexistinordertofacilitatecomputationacrossmulti-process,multi-nodesystems.Thesemodelsincludesharedaddress,messagepassing,anddataparallel[ 11 ].Duetoitsproliferationwithinthescienticcommunity,thisstudyfocusesonthemessagepassingparadigm.Morespecically,weconsidertheMessagePassingInterfaceMPIasthedefactostandardforinter-processcommunicationviamessagepassing[ 12 ],[ 13 ].TheMPIstandarddenesanumberoffunctionsthatallowapplicationdeveloperstopassinformationfromasourceprocesstooneormoredestinationprocesses.Thedestinationprocessescanbelocaltothesendingprocessorcanbelocatedonaremotenoderequiringtransactionsacrosssomeinterconnect.MoredetailsontheMPIstandardanditsimplementationcanbefoundin[ 12 ].Predictingtheperformanceofanarbitraryapplicationexecutingonaspecicsystemarchitecturehasbeenalong-soughtgoalforapplicationandsystemdesignersalike.Throughouttheyears,researchershavedesignedanumberofpredictionframeworksthat 19

PAGE 20

employvaryingtechniquesandmethodologiesinordertoproduceaccurateandmeaningfulresults.Theapproachtakenbyeachframeworkallowsittoovercomecertainobstaclesorfocusonspecicapplicationorsystemtypeswhileinevitablysacricingsomekeyfeaturesuchasaccuracy,speed,orscalability.ThefollowingparagraphsbrieydescribeexistingpredictionenvironmentssimilartoFASEandthemethodologiesandgenerallimitationsofeach.TheIntegratedSimulationEnvironmentISE[ 14 ],aprecursortoFASE,employshardware-in-the-loopHWILsimulationinwhichactualMPIprogramsareexecutedusingrealhardwarecoupledwithhigh-delitysimulationmodels.Thecomputationtimeofeachsimulatednodeiscalculatedviaaprocessspawnedonaprocessmachine,"whilenetworkmodelscalculatethedelaysforthecommunicationevents.ThoughtheISEshowsreasonablesimulationtimeandaccuracy,thescalabilityoftheframeworkislimitedduetothelargenumberofprocessesneededtorepresentlarge-scalesystems.ThePOEMSproject[ 15 ]adherestoanapproachsimilartoHWILsimulationinthatsystemscanbeevaluatedbasedonresultsfromactualhardwareandsimulationmodelsaswellasperformancedatacollectedfromactualsoftwareandanalyticalmodels.ThesimulationenvironmentusedbyPOEMSintegratesSimpleScalarwiththeCOMPASSsimulator[ 16 ]enablingdirectexecution-driven,parallelsimulationofMPIprograms.Thesehigh-delitysimulatorscanprovideveryaccurateandinsightfulresultsthoughthesimulationscanrequirelargeamountsoftimeespeciallywhendealingwithverylargesystems.WhileISEandPOEMSuseexecution-drivensimulations,otherprojectsemployamodel-drivenapproach.CHAOS[ 17 ]createsapplicationemulatorsthatreproducethecomputation,communication,andI/Ocharacteristicsoftheprogramunderstudy.Theemulatorsarethenusedtopasstheirinformationtoasuiteofsimulatorsthat,inturn,determinetheperformanceofthemodeledapplicationonthetargetsimulatedsystem.ThePACE[ 18 ]frameworkdenesacustomlanguage,calledCHIP3S,thatisusedtogenerateamodeldescribingaparallelprogram'sowcontrolandworkload. 20

PAGE 21

Similarly,PerformanceProphet[ 19 ]usestheUniedModelingLanguageUMLtomodelparallelanddistributedapplications.BothPACEandPerformanceProphetusethesemodelstodrivesystemscomposedofbothanalyticalandsimulationmodelsinordertobalancespeedandaccuracy.Duetotheinherentdicultyofautomaticallycreatinganaccurateapplicationmodel,allthreeprojectsrequireintimateknowledgeoftheprogramitrepresents.Asaresult,thesourcecodemustbeanalyzedandproledtocollectthenecessaryinformationtoreconstructtheprogram'sexecution.WorkconductedbythePerformanceEvaluationResearchCenterPERCfocusesonperformancepredictionofparallelsystemsundertheassumptionthatanapplication'sperformanceonasingleprocessorismemoryboundandtheinterconnectdictatesthescalabilityoftheprogramunderstudy[ 20 ],[ 21 ].Theframeworkusesathree-stepprocess: 1. Collectmemoryperformanceparametersoftheconsideredmachine 2. Collectmemoryaccesspatternsindependentoftheunderlyingarchitectureonwhichtheapplicationisexecuted 3. Algebraicallyconvolvetheresultsfromstepsoneandtwo,thenfeedtheresultstotheDIMEMAS[ 22 ]networksimulatordevelopedattheEuropeanCenterforParallelismofBarcelonaTheresearchersofPERChavereportedaccurateperformancepredictionsusingawiderangeofapplications.However,collectingmemoryaccesspatternscanbeaverytimeconsumingtaskthatresultsinlargeamountsofdataforeachapplicationconsidered.Also,theDIMEMASnetworksimulatorisrelativelysimplisticandhasthepotentialforlargeinaccuracieswhenanalyzingcommunication-intensiveapplicationswithpotentialcontentionissues.PerformancepredictionisthemainthemeofFASEandtheframeworksmentionedabove.However,performanceanalysisisanintegraltechniqueusedtocharacterizetheapplicationsunderstudypriortosimulation.Theaccuracyanddetailofthischaracterizationdatacangreatlyinuencetheaccuracyandspeedofthesimulation 21

PAGE 22

frameworksthatuseit.Whileagreatnumberofperformanceanalysistoolsexistforvariouspurposes[ 23 ],SvPablo[ 24 ],Paradyn[ 25 ],andTAU[ 26 ]arethetoolsmostapplicabletotheFASEsimulationframework.SvPablomaybeusedtoanalyzetheperformanceofMPIandOpenMPparallelprograms,andallowsforinteractiveinstrumentationandcorrelatesperformancedatawiththeoriginalsourcecode.ParadynalsoiscompatiblewithMPIprogramsandoerstheadvantagethatnomodicationstothesourcecodeareneededduetodynamicinstrumentationatthebinarylevel.Paradyn'sfocusisontheexplicitidenticationofperformancebottlenecks.TAUisanMPIperformanceanalysistoolthatcanprovideperformancedataonathread-levelbasis,andprovidestheuserwithachoiceofthreedierentinstrumentations.ManyperformanceanalysistoolsincludingSvPabloandTAUarebasedonorsupportthepopularPerformanceApplicationProgrammingInterfacePAPI,whichisalsosupportedbytheSequoiatoolusedbyFASE.Eachoftheseanalysistoolscanbeincorporatedintothepre-simulationstageofFASEinconjunctionwithorasareplacementforSequoiatoprovideadditionalinformationonanapplication'sbehavior.FASEculminatesmanyofthesegeneraltechniquesandmethodsinordertoprovidearobust,exibleperformancepredictionframework.Inaddition,FASEfeaturesanenvironmentthatallowsuserstobuildvirtualrepresentationsofcandidatearchitectures.Thesevirtualprototypescapturestructuraldependenciessuchasnetworkcongestionandworkloaddistributionthatcangreatlyimpactanapplication'sperformance.Manyoftheframeworksdescribedinthissectionusesimplisticcommunicationmodelsthathavedicultiescapturingsuchissues.Furthermore,FASEemploysasystematic,iterativemethodologythatproduceshighlymodularapplicationcharacterizationdataandcomponentmodels.Theframeworkillustratestherelationshipsbetweentheapplicationperformanceinformationandthearchitecturalmodelssuchthatfeaturesandmechanismswithinthemodelscanbeidentiedandalteredtoimprovepredictionaccuracyandsimulationspeed.Theotherframeworksproposeusingmodelsofvarious 22

PAGE 23

delitiesinordertospeedupthesimulations;however,theydonotexplicitlydescribethedecision-makingprocessofchoosingordesigningamodel'sdelity.Finally,FASEprovidesafullyextensiblepre-builtmodellibrarywithcomponentsrangingfromtheapplicationlayerdowntothehardwarelayer.Unlikemanyoftheframeworksdescribedabove,FASEincludesanumberofdetailedmiddlewaremodelsthatcanhavesignicantimpactsontheperformanceoftheoverallsystem.Manyofthepre-builtmodelsarehighlytunable,thusallowingasingle,genericmodeltorepresentmanydierentimplementationsbasedontheparametervaluessetbytheuser.ThecombinationofthesefeaturesmakesFASEapowerful,exibleenvironmentforrapidlyevaluatingapplicationsexecutingonavarietyofcandidatesystems. 23

PAGE 24

CHAPTER3FASTANDACCURATESIMULATIONENVIRONMENTPHASE1TheFastandAccurateSimulationEnvironmentFASEisaframeworkdesignedtofacilitatesystemdesignaccordingtotheneedsofoneormorekeyapplications.FASEprovidesamethodologyandcorrespondingtoolkittoevaluatetheperformanceofvirtually-prototypedsystemstodetermine,inatimelyandcost-eectivemanner,theidealsystemcongurationforaspecicsetofapplications.Inordertopromotequickandmodularpredictions,theFASEframeworkisbrokenintotwoprimarydomains-theapplicationandsimulationdomains.Intheapplicationdomain,FASEemploysvarioustoolsandtechniquestocharacterizethebehaviorofanapplicationinordertocreateaccuraterepresentationsofitsexecution.Theinformationgatheredisusedtonotonlyidentifyandunderstandthecharacteristicsofthevariousphasesinherenttotheapplication,butalsotogeneratethestimulusdatatodrivethesimulationmodels.Thecharacterizationdatacanbecollectedusingoneormoretoolsdependingontheapplication,thecapabilitiesoftheemployedtools,andthesimulationmodelsusedduringsimulation.Oncethedataiscollected,itcanbeusedinnumeroussimulationswithoutanymodicationsthusfacilitatingtheexplorationofvarioussystemcongurations.MoredetailsontheapplicationdomainareprovidedinSection 3.1 .Thesimulationdomainincorporatesthedesign,development,andanalysisofthevirtually-prototypedsystemstobestudied.Inthisdomain,componentmodelsaredesignedandvalidatedinordertocreatesystemsthatincorporateneworemergingtechnologies.Toeasesystemdevelopment,FASEprovidesalibraryofpre-constructedmodelstailoredtoaccommodatethedesignofHPCenvironments.Onceasystemhasbeenconstructed,characterizationdatafromanynumberofapplicationscanbeusedasstimulustothesimulationthusallowingrapidanalysesofthesystemundervaryingworkloads.MoreinformationonthevariousaspectsofsimulationdomainaredetailedinSection 3.2 24

PAGE 25

Figure 3-1 illustratesahigh-levelrepresentationoftheprocessassociatedwiththeFASEframework.Thedarkgrayblocksrepresentstepsintheapplicationdomainwhilethesimulationstepsaredenotedbywhiteblocks.Noticethatausercanworkinbothdomainsconcurrently.Alsonotethattheframeworkincorporatesmultiplefeedbackpathsthatallowtheusertofollowaniterativeprocessbywhichinsightisgainedthroughapplicationcharacterizationandsimulation,andusedtorenethemodelsandapplicationanalysisdataemployedforfutureiterations.Section 3.2.3 containsfurtherdetailsonhowaniterativemethodologymaybeemployedinFASE. Figure3-1.High-leveldata-owdiagramofFASEframework 3.1ApplicationDomainTheapplicationdomainisacriticalpartoftheFASEframework.Inthisdomain,importantinformationisgatheredthatprovidesinsightonthebehaviorsofanapplicationduringitsexecution.Themaingoalwithintheapplicationrealmistogatherenoughinformationaboutanapplicationsothatsystemsinthesimulationenvironmentarestimulatedasiftheywerereallyrunningthecode.Assuch,thisdomainisdecomposedintotwomainstages:1applicationcharacterizationand2stimulusdevelopment.The 25

PAGE 26

applicationcharacterizationstageemploysanalysistoolstocollectpertinentperformancedatathatillustrateshowtheapplicationexercisesthecomputationalsubsystems.Thedatathatcanbecollectedincludescommunicationinformation,computationinformation,memoryaccesses,anddiskI/O.Thisdatacanthenbeuseddirectlyorprocessedandanalyzedduringthestimulusdevelopmentstage.Inthestimulusdevelopmentstage,rawdatagatheredduringcharacterizationisusedtoprovidevalidinputtothesimulationmodelssuchthatthesimulatedsystem'scomponentsareexercisedasiftherealprogramwasexecutingonit.Moredetailsonbothstagesaswellasthevariousoptionsavailableineachareprovidedintheproceedingsections.3.1.1ApplicationCharacterizationApplicationcharacterizationisavitalstepinFASEthatenablesaccuratepredictionsontheperformanceofanapplicationexecutingonatargetsystem.Thegoalofcharacterizationistoidentifyandtracktheperformance-criticalattributesthatdictateanapplication'sperformancebasedonbothapplicationandtarget-systemparameters.FASEprovidesaframeworkinwhichuserscananalyzetheirapplicationsusingexistingsystemsalsoknownasinstrumentationplatformsinordertoprepareforsimulation.ThebasicmethodologybywhichtheuseranalyzestheapplicationisshowninFigure 3-2 .Thetoolsemployedineachiterationinitiallydependontheuser'sexperiencewiththeapplicationandcanthenchangebasedontheresultsfromthepreviousiterationofanalysis.Theselectedtoolsshouldbecapableofcapturingtheinherentqualitiesoftheapplicationwhileminimizingthecollectionofinformationresultingfromdependenciesontheunderlyingarchitectureoftheinstrumentationplatform.Perturbationi.e.,theadditionaloverheadimposedonthesystemduetoinstrumentationshouldalsobeconsideredtoensuredataaccuracy.Thoughcharacterizationispartoftheapplicationdomain,thesimulationmodelsshouldalsobeconsideredduringtoolselection.Forexample,iftheprocessormodeltobeusedinsimulationsupportsonlyinstruction-basedinformation,thentheanalysistoolsselectedshouldprovideatleastthatinformationforthatparticularmodel. 26

PAGE 27

Multipletoolscanbeusedbasedonthedetailstheyprovide,butoutputdatamustbeconvertedtoacommonformatinordertodrivethesimulationmodels. Figure3-2.TheFASEapplicationcharacterizationprocess FASEincorporatesafeedbackloopLoop1inFigure 3-1 atthecharacterizationstagesuchthatmultiplecharacterizationsmaybeperformedtorstunderstandthemainbottleneckoftheprograme.g.,processor,network,memory,diskI/O,etc.,andthenfocusonthecollectionofinformationthatcharacterizesthemainbottleneckwhileabstractingthecomponentsoflesserimpact.Thisdatacanthenbefedintothesimulationenvironmentandanalyzeduntilthesystemexposesadierentcomponentasthebottleneck.Ifdesired,thecharacterizationdataandmodelscanthenbeswitchedoradjustedtoincorporatetheappropriateinformationandcomponentstocapturethenatureofthenewbottleneck.Infact,theperformancedatacanprovideallthe 27

PAGE 28

necessaryinformationforapplicationswithanarbitrarybottleneck,whilethesimulationmodelsincorporateabstractionbyonlyusingthedatathatcorrespondstocapabilitiessupportedintheirdesigns.Furtherdetailsonthesimulationphaseandhowapplicationcharacterizationinuencesdesigndecisionsaredescribedinthenextsection.TheinitialdeploymentofFASEemploysasingleanalysistoolcalledtheSequoiaToolkitdevelopedatOakRidgeNationalLaboratory.Sequoiaisatrace-basedtoolthatsupportstheanalysisofCandFORTRANapplicationsusingtheMPIparallelprogrammingmodel,thecurrentdefactostandardforlarge-scalemessage-passingplatforms[ 28 ].Instrumentationisconductedduringlink-timebyusingtheprolingMPIPMPIwrapperfunctions.PMPIisdenedintheMPIstandardandprovidesaneasyinterfaceforprolingtoolstoanalyzeMPIprograms[ 12 ].Therefore,aSequoiausermustonlyrebuildhiscodebylinkingtheSequoialibrary,simplifyingthedatacollectionprocess.Thoughnotrequired,Sequoiaalsosupportsadditionalfunctionsthatcanbemanuallyinsertedintotheapplicationtostartorstopdatacollectionaswellasdenotevariousphaseswithinthecodetofacilitateanalysis.TheSequoiaToolkitexplicitlysupportstheloggingofcommunicationandcomputationevents.AcommunicationeventinSequoiaisdenedasanyMPIfunctionencounteredduringtheexecutionofthecode.Thetoolcollectsrelevantinformationsuchassource,destination,andtransfersizeforallkeyMPIcommunicationfunctionsandalsologsimportantnon-communicationfunctionse.g.,MPI Topology,MPI Comm create.CollectingcommunicationeventsattheMPIlevelinherentlyisolatesthecharacterizationdatafromtheunderlyingnetworkoftheinstrumentationplatformthusallowingthedatatobeusedonavarietyofsimulatedsystemsthatemploydierentinterconnecttechnologies.Networktopologydependenciesarealsoremovedduringcharacterizationsincenetworktransfersbetweenmachinesarecapturedashigh-levelsemanticsrepresentingprocess-to-processcommunicationsratherthanarchitecturally-dependentcharacteristicssuchaslatencyandbandwidth. 28

PAGE 29

Computationeventsoccurbetweencommunicationevents.Sequoiasupportstwomechanismsthatmeasurecomputationstatisticsduringanapplication'sexecution:timingfunctionsandthePerformanceApplicationProgrammingInterfacePAPI.PAPIisanAPIfromtheUniversityofTennesseethatprovidesaccesstohardwarecountersonavarietyofplatformsthatcanbeusedtotrackawiderangeoflow-levelperformancestatisticssuchasnumberofinstructionsissued,L1cachemisses,anddataTLBmisses[ 29 ].Everyloggedevent,bothcomputationaswellascommunication,includebothwall-clockandCPU-timemeasurements.WithPAPIenabled,computationeventsincludeadditionalperformanceinformationonclockcycles,instructioncount,andthenumberofloads,stores,andoatingpointoperationsexecuted.SequoiadoesnotexplicitlysupportthecollectionofI/Odata.However,roughestimatescanbecalculatedbycomparingwall-clockandCPUtimesforeachcomputationevent.Thecharacterizationstagedoessuerfromoneinherentproblemthatmustbeaddressedinordertoprovideanenvironmenttopredicttheperformanceoflarge-scalesystems.Thisissuearisesduetotherestrictionsplacedontheuserbythephysicaltestbedusedtocollectcharacterizationdata.Forexample,tocollectaccurateinformationaboutanapplicationrunningona1024-nodesystem,a1024-nodetestbedmustbeavailabletotheuser.Ofcourse,notallusershaveaccesstosystemswith1000sofnodesforevaluationpurposes,thoughtheymaybeinterestedinobservingtheexecutionoftheirapplicationsontheselargersystems.Inordertoovercomethislimitation,FASEincorporatestwotechniques{theprocessmethodandtheextrapolationmethod.Theprocessmethodallowseachphysicalnodeintheinstrumentationplatformtorunmultipleprocessesandthusgathercharacterizationinformationformultiplesimulatednodes.Thedownsidetothisapproachisresourcecontentionatsharedcomponentsthatcanleadtoinaccuraterepresentationsoftheapplication'sexecution.Also,asinglenodecanonlysupportalimitednumberofprocesses.ThislimitationisencounteredwhenOSormiddlewarerestrictionsaremetorwhenthenodebecomessoboggeddownthatthe 29

PAGE 30

applicationcannotnishwithinareasonableamountoftime.Initialtestsshowthattheprocessmethodproducescommunicationeventsidenticaltothosefoundusingthetraditionalapproach.However,computationeventscansuerfromlargeinaccuraciesduetomemoryandcachecontentionissuesespeciallyintestsusingmanyprocessespernodewithlargerdatasets.TheexperimentsconductedinSection 3.3 employtracescollectedusingthetraditionalapproach;however,researchiscurrentlyunderwaytoremedytheinaccuraciesoftheprocessmethodinordertofacilitatelarge-scalesystemevaluation.Theextrapolationmethodobservestrendsintheapplication'sbehaviorwhilechangingsystemsizeanddatasetsize,andthenformulatesaroughmodeloftheapplicationbasedonthendings.Themodeldescribesthecommunication,computation,andotherbehaviorsoftheapplicationusingahigh-levellanguage.Thelanguagecanthenbereadbyanextrapolationprogramtoproducetracesforanarbitrarysystemsizeandapplicationdatasetsize.Detailsonextrapolatingcommunicationpatternsoflarge-scalescienticapplicationscanbefoundin[ 30 ].Thisapproachsupportsaccurategenerationoftracesanddoesnotsuerfromthelimitationsoftheprocessmethod,thoughitcanbequitediculttodeterminethetrendsofanapplication,especiallywhendealingwithapplicationsthatbehavedynamicallybasedonmeasuredvalues.Althoughmanymoreissuescanarisewhenusingtheextrapolation-basedapproach,thistopicisoutofthescopeofthisresearch.3.1.2StimulusDevelopmentAfteranapplicationhasbeencharacterized,theinformationcollectedisusedtodevelopstimulusdatausedasinputtothesimulationstage.FASEsupportsthreemethodsofprovidinginputtosimulationmodels.Thesemethodsincludeatrace-basedapproach,amodel-drivenapproach,andahybridapproach.Theexactmethodemployedislefttotheuserthoughtheselectionshoulddependonthetypeofapplicationunderstudy,theamountofeortthatcanbeaorded,andtheamountofknowledgegatheredontheinternalsoftheapplication.Detailsoneachmethodareprovidedbelow. 30

PAGE 31

Thetrace-basedapproachisthequickestandmostautomatedmethodavailabletotheFASEuser.Themethodcanuseeitherraworprocessedperformancedatacollectedduringthecharacterizationstageaccordingtothetypeofinformationrequiredbythesimulationmodels.However,thetrace-basedapproachdoesplacesomerestrictionsontheuser.First,thesimulationenvironmentmusthaveatracereaderthatiscapableoftranslatingtheperformanceinformationintodatastructuresnativetothesimulationenvironment.Thisrestrictionrequiresacommonformattowhichallperformancedatamustconform.Therefore,ifmultipletoolsareemployedtogathercharacterizationdata,theiroutputsmustbemergedandmodiedtosomecommonformattypethatissupportedbythetracereader.Thesecondissueisthattracedatamustbecollectedforeachsystemanddatasetsizeunderconsideration.Assystemanddatasetsizesincrease,thetracedatafromcomplexapplicationscouldpotentiallyrequireextremelylargeamountsofstoragespaceandthuscaremustbetakentokeeptracelesmanageable.InthecurrentversionofFASE,thislimitationcanbealleviatedbycollectingdataforonlycertainregionsofcodeoralimitednumberofiterationsthroughtheuseofspecicinstrumentationconstructssupportedbySequoia.Themodel-drivenapproachrequiresmuchmoremanualeortbytheuserthanthetrace-basedapproach.Thismethodusesaformalmodeloftheapplication'sbehaviorbasedoneitherathoroughanalysisofcharacterizationdatacollectedwhilevaryingsystemanddatasetsizeorthroughsourcecodeanalysis.Thedevelopedmodelshavethecapabilityofreproducingthebehaviorsofcomplex,adaptiveapplicationsthatcannotbecapturedusingthetrace-basedapproach.Ingeneral,thisapproachbeginsbyidentifyingkeyapplicationparametersthataectitsperformance.Thenextstepistoascertaintheparametershavingthegreatestimpactonperformanceandthendeterminingthevariouscomponentmodelstheapplicationwillexerciseduringexecution.Oncethesestepsarecomplete,theactualmodelisdevelopedsuchthatitexecutesthecorrectcomputation,communication,andothereventsbasedonthebehaviorsdiscoveredduring 31

PAGE 32

characterization.TheactualtypeofmodelemployedinthisapproachisnotlimitedbytheFASEframework.Markovchains,stochasticandanalyticalmodelsandexplicitsimulativemodelsareafewmodeltypesthatcanbeusedwithinFASEaslongastheycaninterfacewiththesimulationenvironment.ThelastapproachsupportedbyFASEisthehybridapproach.Inthisapproach,amixoftraceandmodel-drivenstimulusisusedinordertocombinetheaccuracyandease-of-useoftrace-basedsimulationswiththeexibilityanddynamismofthemodel-drivenapproach.Inthismethod,theapplicationischaracterizedataveryhighleveltoidentifystructuredanddynamicareasofcode.Thestructuredareasareusedtogeneratetracedatawhilesmall-scaleformalmodelsareemployedtorepresentthedynamicareas.Thismixtureoftechniquesdecreasestheamountoftracedataneeded,reducestheamountofeortrequiredtoformulateformalmodelsandmaintainsrelativelyaccuraterepresentationsoftheapplication'sbehavior.TheinitialdeploymentofFASEusesthetrace-basedapproachasitsprimarystimulus.Thepre-builtFASEmodellibraryconsistsofaSequoiatracereaderthattranslatesSequoiadataintothenecessarydatastructureinthesimulationenvironment.Thoughbothmodel-drivenandhybridapproachesaredenedwithintheFASEframework,thisphaseofresearchfocusesonsimulationsconductedusingonlythetrace-basedapproach.3.2SimulationDomainThesimulationdomainconsistsofthreestages:1componentdesign,2systemdevelopment,and3systemanalysis.Therstofthethreestagesinvolvesthecreationofthenecessarycomponentsusedtobuildthesystemsunderstudy.Thisstagecanbeparticularlytime-consumingdependingonthecomplexityofthecomponentaswellaslevelofdelity;however,itisaone-timepenaltythatmustbepaidtogainthebenetsofsimulation.TheinitialreleaseofFASEincludesseveralpre-builtmodelsofcommoncomponentsdetailedinSection 3.2.1 toaidusersinthisprocessandmorewillbeadded 32

PAGE 33

inthefuture.Thenextstepinthesimulationdomainisthedevelopmentofthecandidatesystems.Theprocessofconstructingvirtualsystemstypicallyrequireslesstimethancomponentdesignalthoughconstructiontimenormallyincreaseswithsystemsizeandcomplexity.Similartostageone,theoverheadofbuildingasystemmustbepaidoncethoughnumerousapplicationsandcongurationscanbeanalyzedusingthesystem.Finally,thethirdstageallowsustoreapthebenetsoftheFASEframeworkandprocess.Systemanalysisusesthecomponentsandsystemsconstructedinstagesoneandtwo,andtheapplicationstimulusdatafromtheapplicationdomain,inordertopredicttheperformanceofanapplicationonaconguredsystem.Sincemanyvariationsofsystemsarelikelytobeanalyzed,thisstageisassumedtobethemosttime-sensitive.Inthefollowingsubsections,wediscusseachofthethreesimulationstagesinmoredetail.3.2.1ComponentDesignEachcomponentthatisofinteresttotheusermustrstbedesignedanddevelopedinthesimulationenvironmentofchoice.Thecomponentsmustnotonlyrepresentthebehaviorofthecomponent,butalsocorrespondtothelevelofdetailprovidedbythecharacterizationtoolsandtheabstractionleveltowhichtheapplicationlendsitself.Inmostcases,certainpartsofacomponentwillbeabstracted,whileotherpartsthatareknowntoaectperformancewillbemodeledinmoredetail.Thedecisionofwheretoadddelitytocomponentsandwheretoabstracttosavedevelopmentandsimulationtimeshouldbebasedontrendsandapplicationattributesdiscoveredduringcharacterization.Thecomponentsdesignedshouldincorporateavarietyofkeyparametersthatdictatetheirbehaviorandperformance.Animportantstepinthecomponentdesignistweakingtheseparameterstoaccuratelyportraythetargethardware.Theactualvaluessuppliedtothemodelsshouldbebasedonempiricaldatacollectedusingsimilarhardwareoronthepredictedperformanceifthecomponentsarefuturetechnologies.Forexample,thenetworkandmiddlewaremodelsshowninTable 3-1 werevalidatedaccordingtorealhardwareintheHigh-performanceComputingandSimulationHCSResearchLabatthe 33

PAGE 34

UniversityofFlorida.TheexperimentalsetupandresultsforthesevalidationtestsarepresentedinSection 3.3 .TheFASEdevelopmentenvironmentusesagraphical,discrete-eventsimulationtoolcalledMissionLevelDesignerMLDfromMLDesignTechnologies[ 27 ].Itscorecomponents,calledprimitives,performbasicfunctionssuchasarithmeticanddataowcontrol.Thebehaviorsexempliedbyeachprimitivearedescribedusingtheobject-orientedC++languagetopromotemodularity.Morecomplexfunctionsarecommonlyrealizedbyconnectingmultipleprimitivessuchthatdataismanipulatedasitowsfromoneprimitivetoanother.Alternately,usersmaywritecustomizedprimitivestoprovidetheequivalentfunctionality.MLDwasselectedasthesimulationtoolforFASEforthreemainreasons.First,itisafull-featuredtoolthatsupportscomponentandsystemdesignaswellasthecapabilitiestosimulatethedevelopedsystemsallthroughaGUI.Second,MLDsupportsvariousdesignfeaturesthatfacilitatequickdesigntimesevenforverycomplexsystems.Finally,theauthorshavemuchexperienceusingthetoolandmanymodelshavebeencreatedoutsideofFASEthatcanbeimportedwithlittleornomodications.AlthoughFASEcurrentlyusesMLDasitssimulationenvironmentofchoice,itmaybeadaptedtosupportadditionalsimulationenvironmentsinthefuture.Awiderangeofpre-constructedmodelspopulatetheinitialFASElibraryinordertoprovideastartingpointforusers.Eachmodelwasdesignedanddevelopedbasedonthehardwareorsoftwaretheyrepresentthroughtheuseoftechnicaldetailsprovidedbycorrespondingstandardsandotherliterature.Thedelityofeachmodelinthepre-builtlibrarycorrespondstothecurrentHPCfocusoftheinitialdeploymentofFASEaswellasthecapabilitiesofSequoia.Asaresult,itincorporateshigh-delitynetworkandcommunicationmiddlewaremodelstocapturescalabilitycharacteristicswhileprovidinglowerdelitymodelsforcomponentssuchasaCPU,memory,anddisk.Table 3-1 highlightsthemoreimportantcomponent-levelmodelscurrentlypopulatingthepre-builtFASElibrary.ItisnoteworthytomentionthatavarietyofcomponentsnotlistedinTable 34

PAGE 35

3-1 canbedevelopedusingthosethatarelisted.Forexample,anexplicitmulticoreCPUmodeldoesnotcurrentlyexistintheFASElibrary.However,bycombiningtwoCPUmodels,asharedmemorymodel,andtwotraceles,onecananalyzetheperformanceofanapplicationrunningonamulticoremachinewithlittleeort.ThenetworkmodelslistedinTable 3-1 sharesimilarcharacteristics.Eachmodelreceiveshigh-leveldatastructuresthatdenethevariousparametersrequiredtocreateandoutputoneormorenetworktransactionsbetweenmultiplenodes.Eachnetworkmodelalsohasnumeroususer-denableparameterssuchaslinkrate,maximumdatasize,andbuersizethatdictatetheperformanceofcommunicationevents.Furthermore,themodelsincludeanumberofparametersthatdenethecapabilitiesofthesubsystemsthatsupplythenetworkinterfaceswiththenecessarydatatobetransferred.Forexample,theInniBandmodelincorporatestheparametersLocalInterconnectLatencyandLocal-InterconnectBWtodenethelatencyandbandwidthoftheinterconnectbetweenhostmemoryandtheIniniBandhostchanneladapterHCA.TheseparametersareusedtocalculatetheperformancepenaltiesincurredfromtransferringdatafrommemorytotheHCA.Thesecalculationseectivelyabstractawaythecomplexbehaviorsoftheunderlyingtransfermechanismswhilestillaccountingfortheirperformanceimpacts.ThemiddlewaremodelsinTable 3-1 providetheperformance-criticalcapabilitiesoftheprotocoleachrepresents.TheTCPmodelisasingle,genericmodelwithavarietyofparametersthatenableausertocongureitasaspecicimplementation.Similarly,theMPImodelalsoincorporatesmanyparameterssothatparticularimplementationscanberepresentedusingasinglemodel.TheMPIlayerismodeledusingtwolayerssuchthatthegeneral,high-levelfunctionalityoftheMPIprotocolformsthenetwork-independentlayerwhilethesecondlayeremploysinterfacemodelsthattranslateMPIdataintonetwork-specicdatastructures.ThislayeredapproachallowsacommoninterfacetobeusedinallsystemsfeaturingMPIwhileprovidingplug-and-play"capabilitiestosupportMPItransfersovervariousinterconnects. 35

PAGE 36

Table3-1.TheFASEcomponentlibrary Class Type Modelname Fidelity Description Networks InniBand HostChannelAdapterHCA High ConductsIBprotocolprocessingonincomingandoutgoingpacketsforIBcomputenodes Switch High Devicesupportingcut-throughandstore-and-forwardroutingusingcrossbarbackplane ChannelInterface Medium Dynamicbueringmechanism Ethernet NetworkInter-faceCardNIC Medium ConductsEthernetprotocolpro-cessingonincomingandoutgoingframesforEthernetcomputenodes Switch High Devicesupportingcut-throughandstore-and-forwardroutingusingcrossbarorbusbackplane SCI LinkController High ConductsSCIprotocolprocessingonincomingandoutgoingpacketsforSCIcomputenodes Middleware IP IPInterface Low HandlesIPaddressresolution TCP TCPConnection High Providesreliablesocketsbetweentwodevices TCPManager High ManagesTCPconnectionstoensurethatthecorrectsocketreceivesitscorrespondingsegments MPI MPICH2 High ProvidesMPIinterfaceusingTCPasthetransportlayer MVAPICH High ProvidesMPIinterfaceforInni-Band MP-MPICH High ProvidesMPIinterfaceforSCI Processors GenericProcessor GenericProces-sor Low Supportstiminginformationtomodelcomputation OperatingSystems GenericOS GenericOS Low Supportssomememorymanage-mentcapabilitiesoftheOS Memories GenericMemory GenericMemory Low Modelsreadandwriteaccessesbasedondatasize Disks GenericDisk GenericDisk Low Modelsreadandwriteaccessesbasedondatasize Exotics RecongurableDevice RecongurableDevice Medium Modelsaspecializedcoproces-sore.g.,FPGAthatcomputesapplicationkernels 36

PAGE 37

3.2.2SystemDevelopmentAftercomponentdevelopment,systemsmustbecreatedtoanalyzepotentialcongurations.Thesystemsdevelopedshouldcorrespondtothedemandsoftheapplicationsasdiscoveredviacharacterization.FASEprovidesthecapabilitytonotonlychangethesystemsizeandthecomponentsinthesystem,butalsotweakcomponentparameterssuchasthenetwork'slatencyandbandwidth,middlewaredelays,andprocessingcapabilities.Thisfeatureallowstheusertoscrutinizetheeectsofcongurationchangesrangingfromminorsystemupgradestocompletesystemredesignusingexotichardware.Scalabilityissuesinthisstagearedependentonthesimulationenvironmentratherthantheapplication.Timelyandecientdevelopmentofamassivelyparallelsysteminagivensimulationenvironmentcanquicklybecomeanissueassystemsizesscaletoveryhighlevels,andthecreationofsystemswiththousandsofnodescanbecomeanalmostunwieldytask.SinceFASEisfocusedonrapidanalysisofarbitrarysystems,itmustaddressthisissue.AmongthewaysFASEsupportsthecreationoflargesystems,theMLDsimulationtoolsupportshierarchical,XML-baseddesignssuchthatasinglemodulecanencapsulatemultiplecomputenodesyetthesimulatorstillmaintainsthedelityoftheunderlyingcomponentsandincludestheireectintheanalysis.SystemsarecreatedusingagraphicalinterfacewhichisautomaticallytranslatedintoXMLcodedescribingsystem-leveldetailssuchasthenumberandtypeofeachcomponentandhowtheyareinterconnected.Inaddition,MLDsupportsdynamicinstanceswhereifamodeliscreatedaccordingtocertainguidelines,asingleblockcanrepresentauser-speciednumberofidenticalcomponents.Finally,amoreadvancedmethodoflarge-scalesystemcreationcanusehierarchicalsimulationsratherthanhierarchicalmodelswithinasinglesimulation,wheresmall-scalesystemsarebuiltandanalyzedwiththeintentofproducingintermediatetracestostimulateahigher-levelsimulationinwhichmultiplenodesfromeachsmall-scalesystemthenactasasinglenodeinthelarge-scalesystem.Thismethodhasthepotentialtonotonlyspeedupdevelopmenttimesforthesesystems,butalsoto 37

PAGE 38

reducesimulationruntimes.Whilethistechniquehasnotyetbeenemployed,weplantoexamineitspotentialbenetsinfutureresearchtoimprovethescalabilityoftheFASEsimulationenvironment.3.2.3SystemAnalysisAftertheapplicationhasbeenthoroughlyanalyzedandthecomponentsandinitialsystemshavebeendeveloped,theusercanbeginanalyzingtheperformanceofvariousapplicationexecutingonthetargetsystems.Thestimulusdatafromtheapplicationdomainisusedasinputtothesimulationmodelsinordertoinducenetworktracandprocessing.OnepowerfulfeatureofFASEisitsabilitytocarryoutmultiplesimulationrunsusingdierentsystemcongurations,butallbasedonthesamesetofstimulusdata.Therefore,theadditionaltimespentinapplicationcharacterizationallowsthesystemanalysistoproceedmuchquicker.Duringsimulation,statisticscanbegatheredonnumerousaspectsofthesystemsuchasapplicationruntime,networkbandwidth,averagenetworklatency,andaverageprocessingtime.Inadditiontotheproling"-typedatathatiscollected,itmayalsobedesiredtocollecttraces"fromthesimulation.Thesetracesdierfromthestimulustracesinthattheyaremorearchitecture-dependentandlessgeneric,buttheyprovideatleastonecommonfunction-givingtheuserfurtherinsightintotheperformanceoftheapplicationunderstudy.Thebreadthanddepthofperformancestatisticsandresultscollectedduringsimulationwilldeterminethelevelofinsightavailablepost-simulation,andtheresultscollectedshouldbetailoredtowardstheneedsoftheuser.However,thetypeanddelityofresultscollectedmayalsonegativelyimpactsimulationruntime,sotheusershouldbecarefulnottocollectanexcessiveamountofunnecessaryresultdata.Thesimulationenvironmentmayalsobetailoredtooutputresultsinacommonformat,suchthattheymaybeviewedinaperformancevisualizationtoolsuchasJumpshot[ 31 ].Insomecases,resultsobtainedduringsystemanalysiscanleadtoadditionalinsightintermsoftheapplicationbottleneck,requiringre-characterizationoftheapplicationas 38

PAGE 39

showninfeedbackloop4inFigure 3-1 .Inthissituation,thestepsfromtheapplicationdomainarerepeated,followedbypossibleadditionalcomponentandsystemdesigns,andrepeatedsystemanalysis.Tracescollectedduringsimulationmayalsopotentiallybeusedtodrivefuturesimulationsinthisiterativeprocesstosolveanoptimizationproblemanddeterminetheidealsystemconguration.3.3ResultsandAnalysisThissectionpresentsresultsandanalysisforexperimentsconductedusingFASE.Ineachofthefollowingsubsections,weintroducetheexperimentalsetupusedtocollectstimulusdatai.e.,tracesandexperimentalnumbersagainstwhichthesimulationresultsarecompared.Therstsubsectionpresentsthecalibrationproceduresfollowedtovalidatethethreemainnetworkmodelsinthecurrentlibrary{InniBand,TCPoverIPoverEthernet,andthedirect-connectnetworkbasedontheScalableCoherentInterfaceSCIprotocol[ 32 ].Ineachcase,theappropriateMPImiddlewarelayerisalsoincorporatedoneachinterconnectinthemodelingenvironment.Section 3.3.2 presentsasimplescalabilitystudyforamatrixmultiplybenchmarkusingtheaforementionedinterconnects.Finally,Section 3.3.3 showcasesthefeaturesandcapabilitiesofFASEthroughacomprehensivescalabilityanalysisoftheSweep3DapplicationfromtheASCIBlueBenchmarksuite.3.3.1ModelValidationInordertotestandvalidatetheFASEconstructionset,theMLDmodelswererstcalibratedtoaccuratelyrepresentsomeprevalentsystemsavailableinourlab.Validationofthenetworkandmiddlewaremodelswasconductedusingatestbedcomprisedof16dual-processor1.4GHzOpteronnodeseachhaving1GBofmainmemoryandrunningthe64-bitCentOSLinuxvariantwithkernelversion2.6.9-22.ThenodesemployedtheVoltaireHCA400attachedtotheVoltaireISR9024switchfor10GbpsInniBandconnectivitywhileGigabitEthernetwasprovidedusingintegratedBroadcomBCM5404CLANcontrollersconnectedviaaForce10S50switch.Thedirect-connectnetworkmodelwascalibratedtorepresentSCIhardwaresuppliedbyDolphinInc.AsimplePingPong 39

PAGE 40

MPIprogramthatmeasureslow-levelnetworkperformancewasusedtocalibratethemodelstobestrepresentthe16-nodecluster'sperformanceoveraspecicinterconnect.ThreeMPImiddlewarelayersweremodeledincludingMVAPICH-0.9.5,MPICH2-1.0.5,andMP-MPICH-1.3.0forInniBand,TCP,andSCI,respectively.Figures 3-3 through 3-8 showtheexperimentallygatherednetworkperformancevaluesoftheInniBand,TCP/IP/EthernetandSCItestbedscomparedtothoseproducedbythesimulationmodel.Theperformanceofeachcongurationcloselymatchesthatofthetestbedandtheaverageerrorbetweentheexperimentaltestsandsimulativemodelswas5%forInniBand,3.6%forTCP/IP/Ethernet,and2.7%forSCI.ThroughputcalculatedfromthePingPongbenchmarklatenciesformessagesizesupto32MBshowthesimulativebandwidthscloselyfollowthemeasuredbandwidthsandhaveanaverageerrorroughlyequaltothatfoundinthelatencyexperiments.Theseresultsshowthecomponentmodelsarehighlyaccuratewhencomparedtotherealsystemsbutitisnoteworthytomentionthatdipsinthemeasuredthroughputarereadilyapparentat4MBand256KBfortheInniBandandTCP/IP/Ethernetnetworks,respectively.Thedecreasesinbandwidthareduetooverheadsincurredinthesoftwarelayersasthesoftwareemploysdierentmechanismstoaccommodatethelarger-sizedtransfersandthecorrespondingmodelsabstractthesethrottlingpointswiththegoalofbestmatchingtheoveralltrend. aSmallmessagesizes bLargemessagesizesFigure3-3.InniBandmodellatencyvalidation 40

PAGE 41

Figure3-4.InniBandmodelthroughputvalidation aSmallmessagesizes bLargemessagesizesFigure3-5.TheTCP/IP/Ethernetmodellatencyvalidation Figure3-6.TheTCP/IP/Ethernetmodelthroughputvalidation 3.3.2SystemValidationAftervalidatingthenetworkandmiddlewaremodels,weproceededtoexaminetheaccuracyandspeedofFASEusingasimplebenchmark-matrixmultiply.Theselected 41

PAGE 42

aSmallmessagesizes bLargemessagesizesFigure3-7.TheSCImodellatencyvalidation Figure3-8.TheSCImodelthroughputvalidation implementationtookamaster/workerapproachwiththemasternodetransmittingasegmentofmatrixAandtheentirematrixBtothecorrespondingworkersinsequentialorderandthenreceivingtheresultsfromeachnodeinthesameorder.MeasurementswerecollectedusingtheInniBandandGigabitEthernetmodelsforsystemsizesof2,4,and8nodesanddatasetsizesof500500,10001000,15001500,and20002000.Eachdataelementisa64-bitdouble-precision,oating-pointvalue.ThefollowingparagraphoutlinestheproceduretakentoconducttheanalysisofthematrixmultiplyusingtheFASEmethodologyandcorrespondingtoolkit.First,thematrixmultiplycodewasinstrumentedbylinkingtheSequoialibraryandtheresultingbinarywasexecutedforeachcombinationofsystemanddatasetsize.TracelesforeachcombinationwereautomaticallygeneratedbytheSequoiainstrumentationcodeduringexecutionandservedasthestimulustothe 42

PAGE 43

simulationmodels.ThecomponentmodelsfromtheFASElibrarywereusedtocreatesixsystemsforeachsystemsizeandnetwork2analyzed.Thisstepwasconductedwhilecollectingcharacterizationdatafromthematrixmultiply.Afterthetraceleswerecollectedandthesystemsbuilt,eachsystemwassimulatedusingthecorrespondingtracelesforeachsystemsize.Itshouldbenotedthatforthisparticularexperiment,theSequoiatraceswerecollectedusingthetestbed'sInniBandnetwork,thoughsimulationswererunusingbothIniniBandandGigabitEthernetnetworksinordertoshowtheportabilityofthetracesandhighlighttheexibilityofFASE.Finally,experimentalruntimesweremeasuredforeachsystemanddatasetsizerunningonbothnetworksinordertodeterminetheerrorsassociatedwiththesimulatedsystems.Table 3-2 presentstheexperimentalandsimulativeresultsforthevariousnetworks. Table3-2.Experimentalversussimulationexecutiontimesformatrixmultiply InniBand GigabitEthernet System Data Exp. Sim. Error Exp. Sim. Error size size sec sec sec sec 2 500 3.32 3.33 0.21% 3.42 3.38 1.24% 1000 49.45 49.40 0.09% 48.81 49.58 1.58% 1500 187.40 187.03 0.20% 187.50 187.43 0.04% 2000 459.36 458.80 0.12% 460.48 459.50 0.21% 4 500 1.07 1.06 1.48% 1.16 1.12 3.12% 1000 16.69 16.54 0.93% 16.63 16.76 0.79% 1500 62.87 62.49 0.61% 63.29 63.06 0.37% 2000 153.82 153.19 0.41% 154.43 154.02 0.26% 8 500 0.52 0.46 12.70% 0.62 0.58 7.60% 1000 7.23 7.09 2.66% 7.71 7.50 2.69% 1500 27.51 26.93 2.12% 28.24 27.94 1.06% 2000 66.91 65.90 1.52% 68.18 67.60 0.86% FromTable 3-2 ,onecanseethatthesimulationscloselymatchedtheexperimentalexecutiontimesofthematrixmultiply.Themaximumerror,12.7%,occurredatsmallerdatasetsizeswithlargersystemsduetoshorterruntimesthataremoregreatlyaected 43

PAGE 44

byanytimingdeviationsfromvariousanomaliessuchasOStaskmanagement,dynamicmiddlewaretechniques,etc.Theaectsofsuchanomaliesarenormallyamortizedwhenanalyzingthetypical,long-runningHPCapplication.Thesimulationtimesforeachrunofthematrixmultiplywerealsocollectedinordertoquantifytheslowdownofusingsimulationversusrealhardwarethushighlightingthefast"portionofFASE.Table 3-3 showsthattheratiosofsimulationtoexperimentalwall-clocktimesareverylowandinsomecasese.g.,smallsystemswithlargedatasizes,thesimulationactuallycompletesfasterthanthehardwarerepresentedbyaratiolessthanone.Theratioslessthanonearedirectlyrelatedtotheamountofcharacterizationinformationcollectedaswellasthehighlevelofabstractionofcomputationeventsinthesimulationmodels.Inthecaseofthematrixmultiply,computationwasabstractedthroughtheuseoftimingandthisinformationwasfedintoalow-delityprocessormodel,thusaccommodatingshortsimulationtimes.Assystemsizeandproblemsizescales,moretimeisspenttriggeringhigh-delitynetworkmodelsthusslowingthesimulations.However,thewall-clocksimulationtimesobservedwithinFASEareordersofmagnitudefasterthancycle-accuratesimulatorswherea1000orgreaterslowdowninwall-clockexecutiontimeiscommon.3.3.3CaseStudy:Sweep3DNowthatthesystemmodelshavebeenvalidatedusingthematrixmultiplybenchmark,theycanbeusedtopredicttheperformanceofanyapplicationexecutingonthemgiventhepropercharacterizationandstimulusdevelopmentstepshavebeenconducted.InordertodisplaythefullcapabilitiesandfeaturesofFASE,amorecomplexapplicationwasselected.TheSweep3DalgorithmformsthefoundationofarealAcceleratedStrategicComputingInitiativeASCIapplicationandsolvesa1-group,time-independent,discreteordinate3DCartesiangeometryneutrontransportproblem[ 33 ],[ 34 ].AsshowninFigure 3-9 ,eachiterationofthealgorithminvolvestwomainsteps.Therststepsolvesthestreamingoperatorbysweeping"eachangleoftheCartesian 44

PAGE 45

Table3-3.Ratioofsimulationtoexperimentalwall-clockexecutiontime InniBand GigabitEthernet System Data Exp. Sim. Ratio Exp. Sim. Ratio size size sec sec sec sec 2 500 3.32 5.58 1.68 3.42 6.21 1.82 1000 49.45 19.42 0.39 48.81 24.40 0.50 1500 187.40 42.71 0.23 187.50 54.60 0.29 2000 459.36 75.10 0.16 460.48 96.40 0.21 4 500 1.07 9.62 8.96 1.16 11.20 9.64 1000 16.69 33.44 2.00 16.63 44.10 2.65 1500 62.87 73.28 1.17 63.29 99.10 1.57 2000 153.82 131.05 0.85 154.43 174.00 1.13 8 500 0.52 17.92 34.22 0.62 22.60 36.17 1000 7.29 60.25 8.27 7.71 88.70 11.52 1500 27.51 129.06 4.69 28.24 195.00 6.92 2000 66.91 230.99 3.45 68.18 357.00 5.24 geometryusingblocking,point-to-pointcommunicationfunctionswhilethesecondstepusesaniterativeprocesstosolvethescatteringoperatoremployingcollectivefunctions.TherearenumerousinputparametersthatmaybesetincludingthenumberofprocessingelementsintheXandYdimensionsofthelogicalsystemaswellasthenumberofgridpointsi.e.,double-precision,oating-pointvaluesassignedtotheXYZdimensionsofthedatacube.ThedefaultdatasetsizessuppliedwithSweep3Dare505050and150150150,thoughtheexperimentsinthissectionwillalsoexploreanintermediatedataset-100100100.Inthissection,wepresentasetofexperimentsdesignedtoquantifytheaccuracyandspeedoftheFASEsimulationenvironmentwithrespecttotheSweep3Dapplication.TherstexperimentillustratestheaccuracyofFASEbycomparingexperimentallymeasuredexecutiontimesofSweep3Dversusthetimesproducedbycorrespondingsimulatedsystems.Experiment2analyzesthespeedofsimulationsconductedusingFASEanddemonstrateshowthespeedscaleswithsystemsize.ThesectionconcludeswithanalexperimentthatshowcasesthefullpotentialoftheFASE 45

PAGE 46

frameworkbyprovidingadetailedsimulativeanalysisoftheSweep3Dapplicationrunningonsystemswithvarioussizes,interconnecttechnologies,topologies,andmiddlewareattributes. Figure3-9.Sweep3Dalgorithm 3.3.3.1Experiment1:AccuracyTherstsetofexperimentsperformedusingSweep3Dwasverysimilartothosefromthelastsection.Themaindierence,however,isthatthesystemunderstudyfortheseexperimentsleveragestheabilitiesofFASEtosimulateheterogeneouscomponents.Thissystemiscomposedofaheterogeneous,64-nodeLinuxclusterthatfeaturesfourtypesofcomputationalnodesaslistedinTable 3-4 Table3-4.Computenodespecicationsforeachclusterinheterogeneoussystem Nodecount Processor Memory OS Kernel Cluster1 10 1.4GHzOpteron 1GBDDR333 CentOS 2.6.9-22 Cluster2 14 3.2GHzXeonwithEM64T 2GBDDR333 CentOS 2.6.9-22 Cluster3 30 2.0GHzOpteron 1GBDDR400 CentOS 2.6.9-22 Cluster4 10 2.4GHzXeon 1GBDDR266 Redhat9 2.4.20-8 AllmeasurementsandSequoiatracesweregatheredovertheGigabitEthernetinterconnect.Figure 3-10 showsacomparisonoftheexecutiontimesfromthephysicalandsimulatedsystemswhileincreasingthesystemanddatasetsizes.Table 3-5 displays 46

PAGE 47

Figure3-10.ExperimentalversussimulativeexecutiontimesforSweep3D theerrorsbetweenexperimentalandsimulativeexecutiontimes.Inthisexperiment,weobservedslightlyhighererrorratesthanthematrixmultiplybenchmarkandnetworkvalidationtests,butthistrendistobeexpectedconsideringtheincreasedcomplexityoftheSweep3Dapplicationandheterogeneoussystemusedforthestudy.Inallbutvecases,errorrateswerebelow10%,withmanycasesshowingerrorsaround1%.Inthecaseswith10%errororgreater,wecanlargelyattributethehighervaluestoextraneousdatatracandspuriousOSactivityamongothereectsduetonon-dedicatedresources.Onceagain,theincreasederrorratesoccurredincaseswhereeitherdatasetsizesweresmallorsystemsizeswererelativelylargeorboth.Themaximumerrorobservedis23.28%,whichisundertheacceptablethresholdforpredictingperformanceofsimulatedsystemssincereal-worldimplementationsofthehardwareandsoftwaredeviceswillhaveagreateectontheactualperformanceofthenalsystem.3.3.3.2Experiment2:SpeedForeachexperiment,thetestbedexecutiontimeoftheapplicationwascomparedtothesimulationtimeofthevirtually-prototypedsystem.AsseeninFigure 3-11 ,thesimulationtimeincreaseswithbothdatasetsizeandsystemsize.Increasesineithercharacteristicraisetheamountofnetworktrac,thuscausingmorecomputationtimeto 47

PAGE 48

Table3-5.ExperimentalversussimulationerrorsforSweep3D Datasetsize Systemsize 505050 100100100 150150150 2 0.69% 0.13% 0.21% 4 0.90% 5.38% 9.76% 8 0.60% 0.25% 2.84% 16 8.36% 10.11% 15.31% 32 5.95% 23.28% 1.03% 64 14.66% 17.36% 1.02% bespentprocessinginteractionsbetweenthehigher-delitymodels.Infact,thelongestsimulationtime,onehour,occurredforthe64-nodesystemusingthe150150150dataset. Figure3-11.Ratiosofsimulationtoexperimentalwall-clockcompletiontimeforvaryingsystemanddatasetsizes Whileaone-hoursimulationtimeiswithinanacceptabletolerance,ifweextrapolatethetimingresultstoevenlargersystemanddatasetsizes,wendthatasystemof1024nodescomputinga250250250grid-pointdatasetwilltakeapproximately70hours.Inordertocutthetimewhensimulatingthecomputationofverylargedatasetsonlarge-scalesystems,thestimulusdevelopmentandsimulationtechniquespreviouslydescribedinthischaptercanbeemployed.Inthiscase,knowledgeoftheapplication'sexecutioncharacteristicscanhelptospeedupsimulations.TheSweep3Dapplication 48

PAGE 49

performstwelveiterationsofitscorecode,witheachcomputenodehavingidenticalcommunicationblocksandverysimilarcomputationblocksperiteration.Theanityacrossiterationsallowsustousetheperformancedatacollectedduringasingleiterationtoextrapolatethetotaltimetakentoexecutealltwelveiterations.Thisprocesscanleadtodecreasedsimulationtimeswithlittleeectsonsimulationaccuracy.Infact,removingallbutasingleiterationfromtheSweep3Dtracesresultedinasimulationspeedupof9.6totalof7minutesratherthan65minuteswhileimpactingtheaccuracyofthemodelbylessthan1%.3.3.3.3Experiment3:VirtualsystemprototypingNowthatwehaveafastandaccuratebaselinesystemfortheSweep3Dapplication,wecanexploretheeectsofchangingthesystemconguration.Thisexperimentexplorestheperformanceimpactofincreasingthenumberofnodesinthesystembyscalingtheprocessingpowerofeachnode.Thisscalingeectivelyextrapolatestheperformanceofthebaseline64-nodesystemtorepresenttheperformanceofSweep3Dexecutingonsystemsupto8192nodes.Theresultsprovidedinthisexperimentpresentthebest-casevaluesofeachsystemsizesincenetworkissuessuchasswitchcongestionandmulti-hoptransfersthatarisefromaddingadditionalnodestoasystemarenotconsidered.Thoughtheactualnetworkpressuresofthesesystemsarenotfullyrepresented,theresultsdoprovideanupperboundperformanceoftheSweep3Dapplicationrunningonthecorrespondingsystem.Thisupperboundcanbeusedtoquicklyidentifywhetherornotaparticularsystemissuitabletorunaparticularapplicationthusfacilitatingtheevaluationofnumerouscongurationswiththeintentofpinpointingasmallsubsettheoriginalcandidatesystemstosimulateinmoredetail.FutureworkwillincorporatethestimulusdevelopmenttechniquesdiscussedinSection 3.1.2 sothatnetworkcontentionisconsideredtoimprovetheaccuracyoftheperformancepredictions.ThevariousnetworksthatweexamineincludestandardGigabitEthernetGigE,anenhancedversionofGigE,10GigE,InniBand,a2D88direct-connectSCInetwork, 49

PAGE 50

Figure3-12.ExecutiontimesforSweep3Drunningonvarioussystemcongurations anda3D444SCInetwork.Eachcongurationincludedinthisstudyattemptstoshedlightonhowchangesinsystemsize,networkbandwidth,networklatency,middlewarecharacteristics,andtopologyaecttheoverallperformanceofSweep3DeachofwhichiseasilycongurableviatheFASEframework.TheresultsfromthisexperimentaredisplayedinFigure 3-12 ,withthe64-nodeGigabitEthernetsystemusedasthebaseline.Figure 3-12 showstheGigabitEthernetsystemexperiencednearlylinearspeedupsuntilthesystemsizereached1024nodes.Atthispoint,thecommunicationofSweep3Dbeginstodominateexecutiontime.Ingeneral,thetrendsdisplayedinFigure 3-12 weresurprising,givenaninitialtiminganalysisofthealgorithmshowedthathalftheexecutiontimewasspentincommunicationblocks.UponfurtheranalysiswedeterminedthatthereasonforthesetrendsisduetotheLateSender"problemwhereprocessesposttheirMPI RecvsbeforethematchingMPI Sendisexecutedbythecorrespondingprocess,causingthereceivingprocesstobecomeidle.Inthecasewith8192nodes,theapplicationbecomesnetwork-bound,evenwiththeLateSenderproblem.Therefore,wemustchangethefocustoothernetworktechnologiestoalleviatethecommunicationbottleneck. 50

PAGE 51

Figure3-13.MaximumspeedupsforSweep3Drunningonvariousnetworkcongurations TherstattempttoremedythecommunicationbottleneckforlargersystemsizesemployedoptimizedversionsofTCPandMPI.Specically,weincreasedkeyparameterssuchasTCPwindowsizeandTCPmaximumtransferuniteectivelyenablingtheuseofjumboframesaswellasreducedMPIoverhead.Thiscase,labeledEnhancedGigE,showedlittleimprovementsintotalexecutiontimeleadingustoconcludethatthebandwidthandlatencyofthenetworkdictatetheperformanceofcommunicationeventsratherthanthemiddleware.ThenexttestsconductedreplacedtheGigabitEthernetinterconnectwithhigh-performancenetworkssuchas10GigE,InniBandandSCI.Figure 3-12 showsthatinsmallersystems,theinterconnecthaslittleeectontheperformanceofSweep3D.However,itbecomesapparentthatbeyond1024nodesafasterinterconnectprovidesbetterspeedups.Forexample,the10GigEandInniBandinterconnectsoerspeedupsof4.1and5.31,respectively,overthebaselineGigEsystem.Figure 3-13 illustratesthemaximumspeedupsforthevarioushigh-performancenetworksthatweretested.Notonlydothesetestsanalyzetheaectsofaddingbandwidthandloweringlatency,buttheSCIcasesdemonstratethepowerofvirtualsystemprototypingbyexploringtheimpactofmappingtheSweep3Dalgorithmtodrasticallydierenttopologies, 51

PAGE 52

specicallya2Dand3Dtorus.Atrstglance,theSweep3Dalgorithmseemstomapwelltodirect-connectnetworktopologiesbasedonitsnearestneighborcommunicationpattern;however,fromFigure 3-13 itappearsthatthisisnotexactlythecase.FurtheranalysisoftheapplicationandunderlyingarchitectureshowsthatthethreefundamentalcharacteristicsoftheSCInetworkhinderfurtherperformanceimprovements.First,theSCIprotocolusessmallpayloadsizesof128-byteswhichcannoteectivelyamortizecommunicationoverhead.Thesecondandthirdcharacteristicsaretheone-waypacketowofadimensionalringandthepacketforwardingordimensionswitchingdelaysthatoccurateachintermediatenodewhileroutingpacketstothedestination.Undercertaincircumstances,asdictatedbytheSweep3Dalgorithm,fewpacketsexperiencemorethanoneforwardingdelayduetothetargetnodebeingthenextneighboringnodeinthedimensionalring'spacketow.Thiscaseprovidesoptimalmappingbetweenthealgorithmandthenetworkarchitectureresultinginexcellentperformance.However,thisscenarioisonlyoneoffourcommunicationowsusedbySweep3D.Theremainingcommunicationsresultinnumeroustransactionsthatmusttravelalmosttheentirelengthofadimensionalringinordertoreachitsdestinationresultinginmanymulti-hoptransfers,higherlatencies,andloweroverallspeedups.Thesenegativeeectscouldeasilybelostwhenusinglow-delity,analyticalnetworkmodelsduetotheirinabilitytocapturestructuralcharacteristicsthatcangreatlyimpactoverallsystemperformance.NotonlycanFASEbeusedtoanalyzetheeectsofinterchangingnetworktechnologies,butitcanalsoprovideahighlydetailedanalysisofaspecictechnology.WechosetheInniBandnetworkforthein-depthstudysinceitshowedthebestperformanceofthesixcongurations.TheInniBandmodel,aswellastheMPImiddlewaremodel,hasnumeroususer-denableparametersthatcanbechangedtocorrespondtothecurrentandfutureversionsofthetechnologies.Theseparametersprovideamechanismtoperformne-grainedanalysestosqueezeasmuchperformanceaspossiblefromaspecictechnologywhilealsoprovidingvaluableinsightonanynewbottlenecksthatmayarise 52

PAGE 53

Figure3-14.SpeedupsforSweep3Drunningon8192-nodeInniBandsystem duringfutureupgrades.FromtheresultsinFigure 3-14 ,onecanseethatevenat8192nodes,networkenhancementshavelittleeectontheInniBandsystem'sperformance.Thegreatestspeedupfromchangesinthecommunicationlayercomesfrommiddlewareenhancementsthatachievea1.21performanceboostoverthebaseline.TheseresultsindicatethatfurtherimprovementsoftheInniBandsystem'scomputationcapabilitiesareneededinordertoshowanysignicantspeedupswhentweakingthemiddlewareandnetworklayers.TheenhancementthatincreasesnetworkandprocessingperformanceaswellasdecreasestheoverheadassociatedwiththemiddlewareseeFigure 3-14 ,Mid-dleware+Network+Processorenhancementreinforcesthisclaimbyprovidinga2.44speedup.ThisstudyprovidesbutabriefsummaryofthemodicationsthatcanbemadetotesttheeectivenessofkeytechnologieswhenanalyzinganapplicationusingFASE.3.4ConclusionsThetaskofdesigningpowerfulyetcost-ecientHPCsystemsisbecomingextremelydauntingduetonotonlytheincreasingcomplexityofindividualcomputationandI/Ocomponentsbutalsotheeectivemappingofgrand-challengeapplicationstotheunderlyingarchitecture.Inthisrstphaseofresearch,wepresentedaframeworkcalledtheFastandAccurateSimulationEnvironmentFASEthataidssystemengineers 53

PAGE 54

overcomethechallengesofdesigningsystemsthattargetspecicapplicationsthroughtheuseofanalysistoolsinconjunctionwithdiscrete-eventsimulation.Withintheapplicationdomain,FASEprovidesamethodologytoanalyzeandextractanapplication'sperformance-criticaldatawhichisthenusedtodiscovertrendsandlimitationsaswellasprovidestimuliforsimulationmodelsforvirtualprototyping.WealsoprovidedbackgroundonvariousoptionsforperformancepredictionofHPCsystemsthroughmodelingandsimulation,andoutlinedtheneedforasolutionthatcanprovidefastsimulationtimeswithaccurateresults.TheFASEframeworkwasthenoutlinedanditsdierentcomponentsandfeaturesweredescribed.ToshowcasethecapabilitiesofFASE,wegatheredavarietyofresultsshowingtheperformanceofvarioussystemcongurations.WerstprovidedvalidationresultsforourInniBand,TCP/IPoverEthernet,andSCImodels,showingthatthenetworkmodelsthatserveasthebackboneofthecasestudiesinthispaperhavebeencarefullytunedtoaccuratelymatchrealhardware.WethenshowedtheresultsofamatrixmultiplycasestudywherewecomparedexperimentaltosimulatedexecutiontimesforaparallelMPI-basedmatrixmultiplybenchmark.Inmostcases,errorsinthemodelswerelessthan1%,withthemaximumerrorof12.7%occurringinacasewithasmalldatasetsizeandalargesystemsize.TheseconditionsresultinshortexperimentalruntimesoflessthanonesecondwheretransienteectssuchasOStaskmanagementandpagefaultscancauseunpredictabledeviationsinapplicationexecutiontime.Intermsofsimulationspeed,theslowdownsobservedbysimulatingtheparallelmatrixmultiplywereverylow,andinsomecases,thesimulationactuallycompletedbeforetheactualsystemnishedexecutingthecode.ThenalcasestudypresentedaseriesofexperimentsusingtheSweep3Dbenchmark,whichisthemainkernelofarealAcceleratedStrategicComputingInitiativeapplication.WeperformedsimulationandhardwareexperimentsoverarangeofdatasetsizesusingaGigabitEthernet-basedsystem,andagainfounderrorstobeverylowinmostcases. 54

PAGE 55

Amaximumerrorof23.28%wasobserved,whichisconsideredacceptablewhendealingwithpredictingtheperformanceofacomplexapplicationrunningonanHPCsystem.Again,thecaseswithhigherrorscorrespondtonon-typicalHPCscenarioswithlargersystemsworkingonsmalldatasetsresultinginconditionsthatamplifydeviationsinexperimentallymeasuredruntimes.Wealsoproposedandemployedtacticstospeedupsimulationby10whilesacricinglessthan1%accuracy.WithafastandaccuratebaselinesystemestablishedforSweep3D,weproceededtousetheFASEmethodologytopredicttheperformanceoftheapplicationforsystemswithvarioussizesandnetworktechnologies.Wefoundthattheapplicationwasactuallymoreprocessor-boundthaninitiallyanticipatedduetotheMPILateSender"problemwhereaprocesspostsanMPI RecvbeforethecorrespondingMPI Sendisexecuted,causingthereceivingprocesstobecomeidle.However,asthesystemsincreasedinsize,Sweep3Ddidbecomenetworkbound,with10GigabitEthernetandInniBandbothprovidingsignicantperformanceimprovementsovertheGigabitEthernetbaselinemainlyduetotheirincreasedbandwidth.TheanalysisofSweep3Dconcludedwithanin-depthlookatitsexecutiononanInniBandsystemwhilevaryingne-grainedparameterssuchasnetworkbandwidth,packetsize,andmiddlewareoverhead.Thesemodicationsprovidedminimalperformanceimprovementsbecausethealgorithm'sbottleneckchangedfromcommunicationtocomputationwhennetworkbackpressurewasreducedbychoosinganimprovednetworktechnologyi.e.,InniBand.Theworkconductedduringthisphaseofresearchproducedaexibleandcomprehensiveframeworkforperformancemodelingandprediction.Thisframeworkprovidesageneralizedmethodologyforapplicationcharacterization,designanddevelopmentofcomponentandsystemmodels,andanalysisofapplicationsrunningonthevirtualsystemsunderconsideration.Theworkalsoproducedasetoftoolsandamodellibrarytofacilitateperformanceprediction.ThecasestudiesvalidatedtheusefulnessofFASEbydisplayingbothfastandaccurateresultswhencomparingtheobserved 55

PAGE 56

experimentalandsimulativevalues.ThestudiesalsoillustratedthecapabilitiesofFASEforanalyzingtheeectsofarchitecturalvariationsinordertoimprovethescalabilityofapplications.ThecontributionsandaccomplishmentsofthisworkhavebeencompiledintoamanuscriptandpublishedinSimulation:TransactionsofTheSocietyforModelingandSimulationInternational[ 35 ]. 56

PAGE 57

CHAPTER4PERFORMANCEANDAVAILABILITYPREDICTIONSOFVIRTUALLYPROTOTYPEDSYSTEMSFORSPACE-BASEDAPPLICATIONSPHASE2ThischapterpresentstwodetailedcasestudiesoftheFASEframeworkappliedtoanalyzetheeectsofvariouscongurationandalgorithmicchangestoaspace-basedsystem.TherstcasestudylooksatperformanceandscalabilityissuesoftheNASADependableMultiprocessorDMexecutingakeyapplicationkernel,theFastFourierTransform.ThesecondevaluatestheperformanceandavailabilityoftheSyntheticApertureRadarSARapplicationrunningontheDMsysteminafaultyenvironmentsuchasspace.Theproceedingsectionsprovidethedetailsandresultsofthesecasestudiesaswellasanovelanalysisapproachtoaccuratelypredictsystemperformability.Therstsectionpresentsbackgroundinformationonthemotivationsandinitialdesignofthespacesystem.Thenextsectionsuppliesdetailsontheapproachtakentodesignanddevelopthenecessarymodelstovirtuallyexploretheperformance,scalability,andavailabilitytrade-osofthecandidatesystem.ItalsodescribestheuniqueanalysisapproachusedtostudySARexecutingontheDMsystem.Finally,theexperimentsandresultsfrombothcasestudiesarepresentedfollowedbytheconclusionsdrawnfromtheanalyses.4.1Background4.1.1ProjectOverviewNASAandotherspaceagencieshavehadalongandrelativelyproductivehistoryofspaceexplorationasexempliedbyrecentrovermissionstoMars.Traditionally,spaceexplorationmissionshaveessentiallybeenremote-controlplatformswithallmajordecisionsmadebyoperatorslocatedincontrolcentersonEarth.Theonboardcomputersintheseremotesystemshavecontainedminimalfunctionality,partiallyinordertosatisfydesignsizeandpowerconstraints,butalsotoreducecomplexityandthereforeminimizethecostofdevelopingcomponentsthatcanenduretheharshenvironmentofspace.Hence,thesetraditionalspacecomputershavebeencapableofdoinglittlemorethanexecutingsmallsetsofreal-timespacecraftcontrolprocedures,withlittleornoprocessingfeatures 57

PAGE 58

remainingforinstrumentdataprocessing.Thisapproachhasproventobeaneectivemeansofmeetingtightbudgetconstraintsbecausemostmissionstodatehavegeneratedamanageablevolumeofdatathatcanbecompressedandpost-processedbygroundstations.However,asoutlinedinNASA'slateststrategicplanandothersources,thedemandforonboardprocessingispredictedtoincreasesubstantiallyduetoseveralfactors[ 36 ].Asthecapabilitiesofinstrumentsonexplorationplatformsincreaseintermsofthenumber,typeandqualityofimagesproducedinagiventimeperiod,additionalprocessingcapabilitywillberequiredtocopewithlimiteddownlinkbandwidthandline-of-sightchallenges.Substantialbandwidthsavingscanbeachievedbyperformingpreprocessingand,ifpossible,knowledgeextractiononrawdatain-situ.Beyondsimpledatacollection,theabilityforspaceprobestoautonomouslyself-managewillbeacriticalfeaturetosuccessfullyexecuteplannedspace-explorationmissions.AutonomousspacecrafthavethepotentialtosubstantiallyincreasetheirreturnoninvestmentthroughopportunisticexplorationsconductedoutsidetheEarth-boundoperatorcontrolloop.Toachievethisgoal,therequiredprocessingcapabilitybecomesevenmoredemandingwhendecisionsmustbemadequicklyforapplicationswithreal-timedeadlines.However,providingtherequiredlevelofonboardprocessingcapabilityforsuchadvancedfeaturesandsimultaneouslymeetingtightbudgetrequirementsisachallengingproblemthatmustbeaddressed.Inresponse,NASAhasinitiatedseveralprojectstodeveloptechnologiesthataddresstheonboardprocessinggap.Onesuchprogram,NASA'sNewMillenniumProgramNMP,providesavenuetotestemergenttechnologyforspace.TheDependableMultiprocessorDMisoneofthefourexperimentsontheupcomingNMPSpaceTechnology8ST8mission,tobelaunchedin2009,andtheexperimentseekstodeployCommercial-O-The-ShelfCOTStechnologytoboostonboardprocessingperformanceperwatt[ 37 ].TheDMsystemcombinesCOTSprocessorsandnetworking 58

PAGE 59

componentse.g.,Ethernetwithanovelandrobustmiddlewaresystemthatprovidesameanstocustomizeapplicationdeploymentandrecoveryfeatures,andtherebymaximizesystemeciencywhilemaintainingtherequiredlevelofreliabilitybyadaptingtotheharshenvironmentofspace.Inaddition,theDMsystemmiddlewareprovidesaparallelprocessingenvironmentcomparabletothatfoundinhigh-performanceCOTSclustersofwhichapplicationscientistsarefamiliar.Byadoptingastandarddevelopmentstrategyandruntimeenvironment,theadditionalexpenseandtimelossassociatedwithportingapplicationsfromthelaboratorytothespacecraftpayloadcanbesignicantlyreduced.4.1.2DMSystemArchitectureBuildinguponthestrengthsofpastresearcheorts[ 38 ],[ 39 ],[ 40 ],theDMsystemprovidesacost-eective,standardprocessingplatformwithaseamlesstransitionfromground-basedcomputationalclusterstospacesystems.Byprovidingdevelopmentandruntimeenvironmentsfamiliartoearthandspacescienceapplicationdevelopers,projectdevelopmenttime,riskandcostcanbesubstantiallyreduced.TheDMhardwarearchitectureseeFigure 4-1 followsanintegrated-payloadconceptwherebycomponentscanbeincrementallyaddedtoastandardsysteminfrastructureinexpensively[ 41 ].TheDMplatformiscomposedofacollectionofCOTSdataprocessorsaugmentedwithruntime-recongurableCOTSFPGAsinterconnectedbyredundantCOTSpacket-switchednetworkssuchasEthernetorRapidIO[ 42 ].Toguardagainstunrecoverablecomponentfailures,COTScomponentscanbedeployedwithredundancy,andthechoiceofwhetherredundantcomponentsareusedascoldorhotsparesismission-specic.Thescalablenatureofnon-blockingswitchesprovidesdistinctperformanceadvantagesovertraditionalbus-basedarchitecturesandalsoallowsnetwork-levelredundancytobeaddedonaper-componentbasis.Additionalperipheralsorcustommodulesmaybeaddedtothenetworktoextendthesystem'scapability;however,theseperipheralsareoutsideofthescopeofthebasearchitecture. 59

PAGE 60

Figure4-1.Systemhardwarearchitectureofthedependablemultiprocessor FutureversionsoftheDMsystemmaybedeployedwithafullcomplementofCOTScomponentsbut,inordertoreduceprojectriskfortheDMexperiment,componentsthatprovidecriticalcontrolfunctionalityareradiation-hardenedinthebaselinesystemconguration.TheDMiscontrolledbyoneormoreSystemControllers,eacharadiation-hardenedsingle-boardcomputer,whichmonitorandmaintainthehealthofthesystem.Also,thesystemcontrollerisresponsibleforinteractingwiththemaincontrollerfortheentirespacecraft.Althoughsystemcontrollersarehighlyreliablecomponents,theycanbedeployedinaredundantfashionforhighlycriticalorlong-termmissionswithcoldorhotsparing.Aradiation-hardenedMassDataStoreMDSwithonboarddatahandlingandprocessingcapabilitiesprovidesacommoninterfaceforsensors,downlinksystemsandotherperipheralstoattachtotheDMsystem.Furthermore,theMDSprovidesagloballyaccessibleandsecurelocationforstoringcheckpoints,I/Oandothersystemdata.TheprimarydataowinthesystemisfrominstrumenttoMassDataStore,throughthecluster,backtotheMassDataStore,andnallytothegroundviathespacecraft'sCommunicationSubsystem.BecausetheMDSisahighlyreliablecomponent,itwilllikelyhaveanadequatelevelofreliabilityformostmissionsandthereforeneednotbereplicated.However,redundantsparesorafullydistributedmemoryapproachmayberequiredforsomemissions.Infact,resultsfromaninvestigationofthesystemperformancesuggestthatamonolithicandcentralizedMDSmaylimitthescalabilityofcertainapplicationsandtheseresultsarepresentedinSection 4.3 60

PAGE 61

4.1.3DMMiddlewareArchitectureTheDMmiddlewarehasbeendesignedwiththeresource-limitedenvironmenttypicalofembeddedspacesystemsinmindandyetismeanttoscaleuptohundredsofdataprocessorsperthegoalsforfuturegenerationsofthetechnology.Atop-leveloverviewoftheDMsoftwarearchitectureisillustratedinFigure 4-2 .Akeyfeatureofthisarchitectureistheintegrationofgenericjobmanagementandsoftwarefault-toleranttechniquesimplementedinthemiddlewareframework.TheDMmiddlewareisindependentofandtransparenttoboththespecicmissionapplicationandtheunderlyingplatform.Thistransparencyisachievedformissionapplicationsthroughwell-dened,high-level,ApplicationProgrammingInterfacesAPIsandpolicydenitions,andattheplatformlayerthroughabstractinterfacesandlibrarycallsthatisolatethemiddlewarefromtheunderlyingplatform.Thismethodofisolationandencapsulationmakesthemiddlewareservicesportabletonewplatforms. Figure4-2.Systemsoftwarearchitectureofthedependablemultiprocessor Toachieveastandardruntimeenvironmentwithwhichscienceapplicationdesignersareaccustomed,acommodityoperatingsystemsuchasaLinuxvariantformsthebasisforthesoftwareplatformoneachsystemnodeincludingthecontrolprocessorandmassdatastorei.e.,theHardenedProcessorseeninFigure 4-2 .ProvidingaCOTSruntimesystemallowsspacescientiststodeveloptheirapplicationsoninexpensiveground-basedclustersandtransfertheirapplicationstotheightsystemwithminimaleort.Suchan 61

PAGE 62

easypathtoightdeploymentwillreduceprojectcostsanddevelopmenttime,ultimatelyleadingtomoresciencemissionsdeployedoveragivenperiodoftime.Table 4-1 providesdescriptionsoftheotherDMmiddlewarecomponents. Table4-1.TheDMmiddlewarecomponents Component Description High-AvailabilityMiddle-wareHAM Providesstandardcommunicationinterfacebetweenallsoft-warecomponentsincludinguserapplications.Guaranteesinorderdeliveryofallmessagesandsupportsseamlessswitchingbetweenredundantnetworks. Fault-ToleranceManagerFTM CentralfaultrecoveryagentforDMsystem.Monitorsstatusofsoftwareagentsandreliablemessagingmiddleware.Up-datesJMtablesuponresourcechangesaectingapplicationscheduling. JobManagerJM Centralizedcomponentthatschedulesjobs,allocatesresources,dispatchesprocesses,anddirectsapplicationrecovery. JobManagerAgentsJMA Distributedsoftwareagentsthatforktheexecutionofjobsandmanagerequiredruntimejobinformationonthelocalhost. Fault-tolerantEmbeddedMessagePassingInterfaceFEMPI Application-independent,fault-tolerant,messagepassinginter-faceadheringtotheMPIstandards.ProvidesasubsetoftheMPIAPIandsupportsvariousfaultrecoverymodes. MDSServer Servicesalldataoperationsbetweenapplicationsandmassmemory. 4.2ApproachTheFASEframeworkpresentedinChapter 3 providesanidealenvironmentforexploringthedesignoptionsinvolvedwithsystemcongurationoftheDMsystem.Themodelsinthepre-builtlibraryweredesignedsothatcomponentscouldbeconguredforembeddedortraditionalHPCsystemsthroughsimpleparametertweaksrepresentingthecapacityorcapabilityofaspecicresource.Therefore,variousdesigntrade-oscanbeexploredusingavarietyofhardwareandsoftwaremodelsinordertoanalyzetheireectsonsystemperformanceandscalability.InordertoapplytheFASEframeworktostudyCOTScomponentsinspace,theoriginaldesignwasextendedtosupportfaultinjectioncapabilities.Theseadditionalfeaturesallowuserstoexplorenotonlyperformance-orientedissues,butalsothosedealing 62

PAGE 63

withfaulttoleranceandavailability.Inordertofacilitatetheuseofthesenewcapabilities,theresearcherintroducesanovelapproachtopredicttheperformability,ametricthatcombinesperformanceandavailabilitytodescribedegradablesystems[ 43 ],ofCOTS-basedpayloadprocessingsystems.Thisapproachanalyzessystemsinthreecomplementarydomains:1physicalprototype,2Markov-rewardmodel,and3discrete-eventsimulationmodel.Techniquesfromeachdomainrepresentcornerstonesintheanalysisprocessthougheachhasitsstrengthsandweaknesses.Physicalprototypesoervalidityinmeasuredvaluesbutprovidelimitedscalabilityandadaptability.Markov-rewardmodelsallowforquickperformabilitymeasurementsforspecicfailureandrecoveryrates,butarenotsuitablewhenmodelingcomplexsystemsduetohighdimensionalitywhichisrequiredforhigh-delitymodels.Finally,simulationprovidesafree-formenvironmenttoevaluatesystemswitharbitrarycongurationsandworkloads,butoftensuersfromincreaseddevelopmenttimeandlengthyanalyses.Byintelligentlyleveragingthestrengthsineachdomain,aquickandpreciseanalysisofvarioussystemcongurationsandapplicationscanbeachievedthatincludeavarietyofarbitraryworkloadsandfault-injectioncampaigns.Theprocessbeginswiththeevaluationoftheprototypesystem,wherereal-worldperformancevaluessuchasnetworklatencyandcomponentrecoverytimesaremeasuredandusedtocalibratetheMarkov-rewardanddiscrete-eventsimulationmodelsthatwouldotherwiselackvalidity.Next,quickperformabilityevaluationsofthesystem'sfault-tolerantsoftwarearchitectureareconductedusingMarkovmodelingtechniquestoidentifyecientdesignsandworkloads.Thus,theMarkovmodelstrimanotherwiselargedesignspacetoeliminatethetimespentanalyzingpoordesigns.Thenalstepusespre-builtorcustomizedsimulationmodelstoanalyzearchitecturalenhancementsanddependencieswithintheselectedsystemsandapplicationsatalevelofdetailthatcannotbeachievedinthepreviousdomains.Theresultingmethodologyallowscandidatesystemstobethoroughlyandaccuratelyanalyzedforbothperformanceandavailability 63

PAGE 64

thusallowingdesignerstocomparealternatefault-tolerantarchitecturesforaerospaceapplications.Weapplythethree-stagemethodologytoanalyzeandquantifytheperformanceandfault-tolerantcharacteristicsoftheDMmanagementsoftwareandproposedightsystem.Thefollowingsubsectionsprovidemoredetailsonthemodelingeortsinvolvedwiththiswork.4.2.1PhysicalPrototypeTherststageoftheanalysisapproachinvolvesthedevelopmentandtestingofaprototypesystemthatrepresentsascaled-downversionoftheproposedsystem.TheprototypeDMsystemwasdesignedanddevelopedtomirrorwhenpossibleandemulatewhennecessarythefeaturesofatypicalsatellitesystem.AsshowninFigure 4-3 ,theprototypehardwareconsistsofacollectionofCOTSSingle-BoardComputersSBCsrunningaLinux-basedoperatingsysteminterconnectedwithredundantGigabitEthernetnetworks.OneSBCisaugmentedwithanFPGAcoprocessorandaresetcontrollerandpowersupplyisincorporatedforpower-oresets.SixSBCsareusedtomirrorfourdataprocessorboardsandemulatethefunctionalityofthetworadiation-hardenedcontrolandMDSnodes.EachSBCiscomprisedofa1GHzPowerPCprocessor,1GBofmainmemoryanddualGigabitEthernetNICs.ALinuxworkstationemulatestheroleoftheSpacecraftCommandandControlProcessor,whichisresponsibleforcommunicationwithandexternalcontroloftheDMsystembutisoutsidethescopeofthispaper.TheMPImiddlewarelayerusedonthetestbedisFEMPI1.0,acustom,fault-tolerantimplementationofaselectedsubsetoftheMPIstandard[ 44 ].GoAhead'sSelfReliant4.1isusedasthehigh-availabilitymiddlewarewhichprovidesnetworkcommunication,livelinessinformation,andnetworkfailover.Finally,theMDSstoragedevicewasemulatedviaa5400RPMharddrive.TheprototypeisusedtomeasuretheachievableperformanceofthesystemexecutingmicrobenchmarksthatexerciseitsnetworkandMDSsubsystems.Itisalsoemployed 64

PAGE 65

aLogicaldiagram bPhotographFigure4-3.LogicaldiagramandphotographofDMtestbed togathertheresponsetimesnecessarytodetectfailedcomponentswithinthesystem.FurtherdetailsonhowtheprototypeisusedtovalidatetheDMmodelsarepresentedinSection 4.3 .4.2.2Markov-RewardModelingAftertheprototypesystemhasbeendevelopedandperformanceandotherkeymetricshavebeenmeasured,theanalysistransitionstotheMarkov-rewardmodelingdomain.Withinthisdomain,quickevaluationsareconductedtoexplorevariousfaultandrecoveryratesinordertoidentifytheworkloadsandsystemcongurationsthataremostinterestingforfurtherstudy.Inthesestudies,steady-stateperformabilitySSPisthecommontnessmetricusedtodescribetheperformanceofdegradablemultiprocessorcomputersystems[ 45 ].TheSSPallowsuserstopredictameancomputationalperformanceofthesystemwhichtakesintoaccountbothshort-andlong-termeectswhichcouldotherwisecauseskewinexperimentalmeasurementsofsystemperformance.AtypicalmethodusedtoestimatetheSSPinvolvesusingMarkov-rewardmodelsMRMs,constructsbasedoncontinuous-timeMarkovchainsCTMC.MRMscombinestateprobabilities,obtainedfromsteady-stateanalysesofCTMCs,andrewardratesbasedoncomputationalperformancetocalculatetheSSPofasystem.Formally,theSSPisdenedastheexpectedasymptoticrewardor 65

PAGE 66

SSP=XiSiri{1whereSrepresentsasetofallpossiblestatesthatthegivensystemcanoccupy,idenotesasteady-stateprobabilityofthesystemoccupyingstatei,andristandsforarewardratefortheithstate[ 46 ].4.2.2.1DatanodemodelThedatanodemodelfocusesoncalculatingtheperformabilityofthenodeunderfaultyconditions.TosimplifythenodemodelweassumedthatjobsarecontinuouslyscheduledonthenodebytheJM.Also,thedelaybetweenthecompletionofonejobandstartofanotherisconsideredtobenegligiblesincetheruntimeofeachjobissignicantlylargerthantheschedulingtime.Theseassumptionsallowthemodeltoberealizedasasix-stateCTMCasshowninFigure 4-4 .Eachstatecorrespondstoaparticularconditionofthedataprocessingnodewherethethreeprimarycomponents,theapplication,JMA,andsystemi.e.,HAMandoperatingsystem,areeitheroperationalornot. Figure4-4.Markov-rewarddatanodemodel TheSAPPstateoccurswhentheapplicationisexecutingonthenodeandallotherservicesarerunningcorrectly.Thisstateistheonlynodecongurationthathasan 66

PAGE 67

associatednon-zerorewardratevalueequalto1,whichmakestheSSPequivalenttotheavailabilityforthismodel.InorderforthemodeltotransitionoutoftheSAPPstate,anSEE-relatederrormustoccurcausingahangorcrashoftheapplication,JMA,orsystemgovernedbythefaultratesAF,JF,andSF,respectively.Sincetherecoverypolicynoderesetfornode-wideerrorsandHAMerrorsisidentical,thefailureratesarecombinedtosimplifythemodel.Eachfailurerateisproportionallyrelatedtotheindependentvariable,MTBFNODEmeantimebetweenfaultsforthenode,whichisequivalenttotheSEErateexperiencedbythenode.ThemajorityofSEEsareexpectedtoimpacttheCPUthereforeeachoftheaforementionedfaultratesisobtainedbyscalingtheMTBFNODEbytheCPUutilizationofthegivensoftwarecomponentCPU%APP,CPU%JMA,CPU%SYS.TheCPUutilizationisdenedbyfollowingequation.CPU%APP+CPU%JMA+CPU%SYS=100%{2TheSDETstatedenotesadetectiondelaywhentheerrorhasoccurredintheapplicationcausingittoabortorcrash.ThisdelayisassociatedwiththeheartbeatintervaloftherunningapplicationtotheJMA.TheSJMAstatedenotesacongurationinwhichthereisnoapplicationrunningbuttherestofthesystemisfunctioningproperly.TotransitiontotheSRECstatetheJMAmuststartanapplicationwithrateFRwhichisinverselyproportionaltothetimerequiredbythesystemtostarttheprocess.TheSRECstatesymbolizestheapplicationrecoveringfromacrash.TherateRC,atwhichthesystemcantransitionbacktoSAPP,isdependentonthecheckpointingintervaloftheapplicationaswellasthesizeofthecheckpointandthetransfertimefromtheMDS.WhentheJMAfailsallrunningapplicationsareterminatedandthemodelenterstheSSYSstate.UponenteringtheSSYSstate,theHAMwillimmediatelyattempttostartaJMAwithastartrateofFR.IftheoperatingsystemorHAMfailsbeforetheJMAstartsup,thenodemodelwillswitchtotheSDOWNstateandthenodewillberebooted.TherebootrateRBdictatesthetimerequiredforthesystemtocyclepowertothenode, 67

PAGE 68

starttheoperatingsystemandHAM,andreconnecttothesystem.Tables 4-2 and 4-3 summarizethestatesandparametersincorporatedintothenodemodel,respectively. Table4-2.Datanodemodelstates Symbol RunningComponents Description SAPP SYS,JMA,Application Systemisfunctioningcorrectly. SDET SYS,JMA Applicationhascrashedorhanged. SJMA SYS,JMA TheJMAisreadytostartorrestarttheapplica-tion. SREC SYS,JMA,Application Theapplicationisrecoveringfromthecrash. SSYS SYS TheJMAhascrashedorhangedtheapplicationisautomaticallykilled. SDOWN None Thesystemhascrashedandrequiresreboot. Table4-3.Failureandrecoveryratesofthenodemodel Symbol Rate[1/s]orValue Description Type MTBFNODE variable Meantimebetweenfaultsforanode. Input AF CPU%APPMTBFNODE Applicationfaultrate. Derived JF CPU%JMAMTBFNODE JMAfaultrate. Derived SF CPU%SYSMTBFNODE Systemfaultrate. Derived RB 0.0333 SystemrebootrateHAMandOS. Measured RC 0.069061 Applicationrecoveryrate. Measured FR 14.27 Systemforkrate. Measured DT 0.8333 Failedapplicationdetectionrate. Measured CPU%APP 70% PortionofCPUusedbyApplication. Estimated CPU%JMA 5% PortionofCPUusedbyJMA. Estimated CPU%SYS 25% PortionofCPUusedbyOSandHAM. Estimated TheratesspeciedinTable 4-3 aredividedintothreecategories{derived,measured,andestimated.ThederivedratesarecalculatedbasedonequationsusingtheSEEratewhilethemeasuredrateswereobtainedbyexperimentalmeasurementsontheDM 68

PAGE 69

prototypesystem.TheCPUutilizationvalueschosenreecttheestimatedworkloadoftypicalapplicationsrunningontheDMsystem.4.2.2.2SystemmodelThegoaloftheMarkovmodelrepresentingtheDMsystemistoapproximatetheSSPofthesystemwithanarbitrarynumberofnodes.Forthismodel,weassumeeachnodeexecutescompletelyindependentworkloadsand,asaresult,themodelrepresentsabest-caseapproximationandsetsanupperboundontheSSPofthesystem.Thefactthattheradiation-hardenedcontrolandMDSnodesintheDMsystemarenotsusceptibletoSEEsfurthersimpliesthemodel.TopredicttheSSPofsuchasystem,wedevelopedaMarkov-rewardmodelwithN+1states,whereNisthenumberofthecomputenodesintheclusterseeFigure 4-5 Figure4-5.Markov-rewardsystemmodel EachstateinthesystemmodeldenotesthenumberofnodesthatarecurrentlyintheSAPPstateatagiventime.Mostcommonly,therewardrateassociatedwitheachstateissimplysettothenumberofnodesintheSAPPstate.However,insystemssuchasthosewithhotsparesorthosethatincuroverheadpenalties,eachnode'srewardratecanbemodiedaccordingly.ThenodefailurerateND,denedinEquation 4{3 ,isequivalenttotheaggregaterateofalltransitionsfromtheSAPPstate,whiletherecoveryrateofanode,ND,istherateatwhichthenodemodelcantransitionbacktotheSAPPstate.Equation 5{1 providesaformaldenitionofthisrecoveryrate.ND=AF+JF+SF=1 MTBFNODE{3 69

PAGE 70

ND=PSAPPND 1)]TJ/F21 11.955 Tf 11.956 0 Td[(PSAPP{4TosimulatethedatanodeandsystemmodelsweusetheSHARPEtoolwhichiscommonlyusedtosimulateMarkovchains,Petrinets,andhierarchicalmodelsforavailability,reliabilityanddependabilitycalculations.ThetoolisactivelydevelopedatDukeUniversity[ 47 ].InordertocoarselyevaluatetheperformanceoftheDMsystemtheresearcherdevelopedahierarchicalMarkov-rewardmodelthatallowsforrapidevaluationofpotentialcomputationalratesachievableforarangeofapplicationsundervaryingfaultconditions.Unfortunately,suchabasicmodellacksthedelityandprecisiontoexploretheeectsofnetwork,CPU,MDS,orschedulingperformance,whichinconjunctionwithfaultconditionscansignicantlyaecttheSSP.ThequalityoftheSSPobtainedfromtheMarkovrewardmodelisfurtherevaluatedandcomparedtothesimulativemodelinSection 4.3 .4.2.3Discrete-EventSimulationModelingThenalstepinthethree-stageanalysisinvolvesin-depthevaluationsofvirtuallyprototypedsystemusingdiscrete-eventsimulationmodels.Inthesimulationdomain,analysesofapplicationandarchitecturalcongurationsareconductedwiththeintenttoidentifythesettingsthatproducethehighestperformanceandavailability.BasedonthenodeandsystemarchitecturesfromSections 4.1.2 and 4.1.3 ,discrete-eventsimulationmodelsofkeycomponentsweredesignedanddeveloped.EachmodeladherestotheFASEmethodologyofbalancingspeedandaccuracy,andsomemodelsactuallyextendorenhancethepre-existingmodelsintheFASElibrary.ThecomponentslistedinTable 4-4 wereformallymodeledtonotonlycapturethecorrectfunctionalityofthecorrespondingtechnologybutalsotoincorporatetheirimpactsonsystemperformanceandfaulttolerance.Fromthesecorecomponentmodels,nodeandsystemmodelsweredeveloped.Figure 4-6 illustratesthemiddlewaremodelsthatcompose 70

PAGE 71

adataprocessingnode,systemcontrolnode,andMDSnode.Finally,thevirtualightsystemwasdevelopedusingthenaldesignarchitectureseeFigure 4-7 Table4-4.SummaryofDMcomponentmodels Component Library Description FaultToleranceManager DM Detectscomponentfailures,notiesJobManager,andtakesnecessaryrecoveryprocedures. JobManager DM Schedulesandmanagesjobs.Handlestaskrestartsbasedonavailableresources. JobManagerAgent DM Startsandmonitorsapplicationsondatanode.NotiesJobManagerwhenapplicationfailurede-tected. High-AvailabilityMiddleware DM Providesreliablecommunicationbetweennodesinsystem.MonitorsJM,JMA,andothernodesforfailures.NotiesFTMoffailures. MDSServer DM Handlesdataaccessrequeststothemassdatastore. TCPLayer FASE ProvidesTCPprotocolforreliablecommunicationbetweennodes. IPLayer FASE ProvidesIPprotocolforallnetworktransfers. EthernetNIC FASE ProvidesEthernetprotocolforallnetworktransfers.Supportsmultipleports. EthernetSwitch FASE ProvidesEthernetconnectivitybetweennodes.Supportsvarietyofbackplaneandroutingoptions. 4.2.4FaultModelLibraryThemodelsdescribedintheprevioussectioncapturethefunctionalityandperformance-basedcharacteristicsasseenintheirreal-worldcounterparts.However,theydonotincludefaultdetectionandrecoverymechanismsneededtofunctionproperlywhenexposedtoafault.Asaresult,afaultmodellibrarywasdesignedtointegratekeyfeaturesthatenhancethemodelstoreactappropriatelyundervariousfaultcampaigns.Inaddition,thefaultmodellibraryprovidesthenecessarycomponentstogenerateandinjectfaultsintoanarbitrarysystem.Themodelsinthelibrarywerespecicallydesignedsothatnewandpre-existingmodelscouldbefault-enabled"withfewadditionsormodications.Thefaultmodelswerealsodesignedtocreateafaulthierarchysuchthatasingle,high-levelcomponentcouldbeaectedbyafaultandthemechanismswould 71

PAGE 72

aComputenode bControlnode cMDSnodeFigure4-6.TheDMnodemodels Figure4-7.TheDMightsystemmodel 72

PAGE 73

automaticallypropagatethefaulttoalllower-levelentities.Thishierarchicaldesignnotonlycapturestheareaofinuenceofaparticularfaulttype,butitalsoprovidesaninfrastructuretodeneinterdependenciesbetweenvariouscomponents.Table 4-5 liststhefaultmodelcomponentsaccompaniedwithbriefdescriptionsoftheirfunctionality. Table4-5.Summaryoffaultmodels Component Description FaultGenerator Controlswhenfaultsaregenerated.Generationtimesbasedonrandomdistributionsoruserdened. FaultController Injectsfaultsintosystem.Selectstargetcomponentrandomlyorbasedonuser-denedsusceptibilitymatrix.Monitorsthestatusofallfaultmanagersandmodulesinthesystem. FaultManager Propagatesinjectedfaultsfromthefaultcontrollertoalllower-levelfaultydevices.Aidsintherecoveryprocessofmanagedcom-ponentswhennecessary. FaultModule Provideshigh-levelfaultmechanismssuchasdetectionandrecov-erytointegrateintonewandpre-existingmodels.InheritsFaultBasemechanismsanddatastructures. FaultBase Provideslow-levelfaultdatastructuresandmechanismstointe-grateintonewandpre-existingmodels.Thesedatastructuresandmechanismsdealwithschedulingevents,managingfaultyevents,andmanagingmemorye.g.,preventingmemoryleaks. Figure 4-8 illustratesanexampleofafault-enabledsystembasedontheDMarchitecture.Thegureshowsthevarioushardwareandsoftwarecomponentsthatcanbeaectedbyfaultsaswellassomeofthefaultmodelsthatmanagetheinjection,detection,andreactiontofaultsinthesystem.FaultModuleshavebeenintegratedintoeachofthehardwareandsoftwarecomponentsinthesystemandFaultManagersareusedtomanagegroupsofmodules.Theactualgroupingsoffaultycomponentsarebasedoneitherthephysicalproximityofthecomponentsinadeviceorthemanagementsystemthatcontrolsthelivelinessofthecomponent.Anotherfactortoconsiderwhencreatingthesegroupsishoweachcomponentreactstospecicfaults.Forexample,theDataNodeFaultManagerinFigure 4-8 manageshowfaultsareinjectedatthenodelevelsuchthatifthecorrespondingdatanodebecomesfaulty,itwillpassthefaulttothelower-levelfault 73

PAGE 74

managerswithintheNIC,Middleware,andApplicationsblockssothattheycandisabletheircorrespondingcomponentse.g.,Port,JMA,HAM,orSAR. Figure4-8.Examplefault-enabledsystem FaultsareinjectedintothesystembytheFaultControlBlockwhichiscomposedoftheFaultGeneratorandFaultController.TheFaultGeneratorcreatestime-basedfaulteventsasdictatedbyeitherarandomdistributionortheuser.TheFaultControllerreceiveseventsfromtheFaultGeneratorandprovidestheinjectioncapabilitiesnecessarytostimulatethevirtualsystemwithfailures.TheFaultControllertargetsspeciccomponentsinthesystemaccordingtoasusceptibilitymatrixthatdenestheprobabilityeachlistedcomponentwillexperienceafault.Theactualpercentagevaluessuppliedwithinthesusceptibilitymatrixareuser-denedandboththeutilizationandphysicalsizeofeachcomponentshouldbeconsideredwhensettingthevalues.OncetheFaultControllerdeterminesitstargetcomponent,itinjectsthefaultintothesystemviaFaultManagers.TheFaultManagersrelaythefaulttothetargetcomponentsothe 74

PAGE 75

targetcanreactaccordingtoaparticularpolicydenedbytheuser.TheFaultModuleisincorporatedintoeachfault-enabledcomponentandprovidesvirtualdetectionandrecoveryfunctionsthatcanberedenedtoallowtheusertocongurethedevicetotakethenecessaryactionsasdictatedbyitfault-tolerantpolicies.Themodelswithinthefaultlibrarywereprogrammedprimarilyusingtheobject-orientedC++languagesothatothersystemsandmodelscanbeeasilyretrottedwithfaultinjectioncapabilities.Also,themodelsweredesignedwithextensibilityinmindtosupportawiderangeofdetectionandrecoverymethodsformanytypesoffaults.AllthecomponentsdescribedintheprevioussectionhavebeenretrottedwiththenecessaryfaultmodelsandeachhasbeenconguredtoreacttoandrecoverfromfaultsasdictatedbythepoliciessetupfortheDMsystem.Detailsonthesepoliciescanbefoundin[ 48 ].4.3ResultsandAnalysisThissectiondescribesthemethodsusedtoanalyzeandidentifyperformanceandavailabilityissuesintheDMsystemarchitecture.ThesectionpresentsexperimentsusedtovalidatetheMarkovandsimulationmodelsthroughtheuseofexperimentallygatheredmeasurementsfromtheprototypesystem.Thesemeasurementsareusedtocalibratethemodelsandvalidationresultsarepresentedwhereapplicable.Afterthemodelsarecalibrated,ascalabilitystudyofanimportantapplicationkernel,the2DFFT,ispresented.Thegoalofthisinvestigationistondbottlenecksthatexistintheproposeddesignandexploretheeectsofchangingkeysystemfeaturesregardingbotharchitecturalandalgorithmicvariationstobettermaptheapplicationtotheunderlyingarchitecture.Thescalabilitystudyisfollowedbyanin-depthperformabilityanalysisoftheDMsystemusingtheSARapplicationinordertoevaluatethetrade-osbetweenperformance,scalability,andavailability.Thesubsectionconcludeswithanevaluationoftheproposed20-nodeightsystemincorporatingtheoptimalcongurationstomaximizeperformability. 75

PAGE 76

4.3.1ModelCalibration4.3.1.1ComponentmodelcalibrationandvalidationModelvalidationisacriticalstepinanymodelingeortthatattemptstoprovideaccurateresultscomparabletothoseproducedbyrealsystems.ValidationofcomplexsystemssuchastheDMsystemisdiculttoaccomplish,thereforeweattempttoovercomethischallengebydecomposingandvalidatingitssubsystems:thenetworksubsystemandtheMDSsubsystem.ThenetworksubsystemencompassesallsoftwareandhardwarelayersemployedduringadatatransferwhichcorrespondtotheHAM,TCP,IP,andEthernetmodels.Inordertovalidatethissubsystem,asimplePingPongMPIprogramthatmeasureslow-levelnetworkperformancebetweentwonodeswasexecutedontheprototypesystemdescribedinSection 4.2.1 andtheresultswereusedtocalibratethemodelstobestrepresentthetestbed'snetworkandmiddlewareperformance.Figure 4-9 aillustratestheexperimentallygatheredthroughputvaluesascomparedtothoseproducedbythesimulatedsystem.Thegureshowsthesimulationmodelcloselymatchestheperformancemeasuredonthetestbedwithameanrelativeerrorof1.27%.Asimilarmeanrelativeerrorwasobservedwhencomparingtheexperimentalandsimulativelatencymeasurementsacrossthestudiedmessagesizes. aNetworksubsystem bMDSsubsystemFigure4-9.ThroughputvalidationsfornetworkandMDSsubsystemmodels Oncethenetworksubsystemwasvalidated,theperformanceofanotherkeysubsystem,theMDS,wascalibratedaccordingtoexperimentalmeasurements.Again, 76

PAGE 77

asimplebenchmarkwasdevelopedthattransfersdataofvaryingsizestoandfromtheMDSnode.ThevalidationresultsareshowninFigure 4-9 b,withtheMDSsubsystemmodelproducingmeanrelativeerrorsof1.58%forwritesand2.03%forreads.Fromthevalidationprocessanddocumentation,theDMsystem'smaincomponentparameterswerecalibratedinordertomostaccuratelyrepresentthetestbedsystem.ThevaluesforkeyparametersarelistedinTable 4-6 andthiscongurationcorrespondstothebaselinesystemusedintheproceedingexperiments. Table4-6.Baselinesystemparameters ParameterName Value Processorpower 1200MIPS,600MFLOPS MPImaximumthroughput 57MB/s MPImessagelatency 13.6ms HAMbuersize 2000000bytes Networkbandwidth Non-blocking1000Mb/s Networkswitchlatency 5s MDSbandwidthwrite/read 60/40MB/s MDSlatencywrite/read 300/500s MDSopenleoverhead 8ms 4.3.1.2SystemperformabilitymodelInadditiontocomponentcalibration,thesimulationsystemmodelswerevalidatedwithregardtotheirnear-idealisticperformabilityascomparedtothoseproducedbytheMarkov-rewardmodelspresentedinSection 4.2.2 .Forthisexperiment,aserialversionofanLUdecompositionkernelwasscheduledoneachnodeinthetestedsystems.EachLUjobprocessedmatricesof10001000elementswitheachelementbeingan8-bytedouble.Theexperimentvariedthesystemsizefromfourtothirty-twonodesbypowersoftwowhileexploringtenMTBFNODEvalues.TheminimumfaultrateexpectedforeachDMdatanodewasestimatedtobethreefaultsperday,whichcorrespondstothemaximumMTBFNODEvalueanalyzed800seconds=8hoursandarelativelyhospitable 77

PAGE 78

environment.Theremainingrateswereselectedtoanalyzethesystem'sperformabilityinharsherconditionsandtheresultsfromthisstudyareshowninFigure 4-10 aPerformability bErrorFigure4-10.MarkovversussimulationDMsystemperformabilitycomparison Figure 4-10 ashowsacomparisonbetweentheperformabilitynumberscollectedfortheMarkovandsimulationmodelswhileFigure 4-10 billustratestherelativeerrorsofthesimulationresultsascomparedtothoseproducedbytheMarkovmodel.OnecanseethatforlargerMTBFNODEvalues,thetwoanalysistechniquesyieldnearidenticalresults.However,deviationsbetweentheapproachesbecomeapparentwhenanalyzingsystemsexposedtomorefaultsi.e.,smallMTBFNODEvalues.Thesedeviationsresultfromthevaryinglevelsofdetailcapturedbyeachmodelingapproach.Inthisinstance,thesimulationmodelcapturesextraperformancepenaltiessuchasnetworkandschedulingdelaysthataecttheperformanceofthesystem.Inhighfaultconditions,thesepenaltiesbegintoaccumulateduetonumerousjobrestartsthusnegativelyimpactingthesystem'soverallperformability.Inaddition,thedeviationsatthesmallerMTBFNODEvaluesincreasewiththesystemsizeduetoschedulingoverheadaswellasresourcecontentioninthenetworkandMDSnodenotmodeledintheMarkov-rewardmodels.4.3.2CaseStudy:FastFourierTransformTheexperimentsconductedforthisportionofthestudyexploretheperformanceandscalabilityofthe2DFFTkernelexecutingontheDMsystem.Afault-tolerant,parallel 78

PAGE 79

2DFFTservesasthebaselinealgorithm,whichdistributesanimageevenlyoverNprocessingnodesandperformsalogicaltransposeofthedataviaacornerturn.AsingleiterationoftheFFT,illustratedinFigure 4-11 ,includesseveralstagesofcomputation,inter-processorcommunicationi.e.,cornerturn,andseveralMDSaccessesi.e.,imagereadandwriteandcheckpointoperations. Figure4-11.Dataowdiagramofparallel2DFFT TheresultsofthebaselinesimulationseeFigure 4-12 showthattheperformanceoftheFFTslightlyworsensasthenumberofdatanodesincreases.InordertopinpointthecauseoftheperformancedecreaseoftheFFT,theprocessor,network,andMDScharacteristicsweregreatlyenhancedi.e.,upto1000-fold.TheresultsinFigure 4-12 showthatenhancingtheprocessorandnetworkhaslittleeectontheperformanceoftheFFT,whileMDSimprovementsgreatlydecreaseexecutiontimeandenhancescalability.ThereasontheFFTapplicationperformanceissodirectlytiedtoMDSperformanceisduetothehighnumberofaccessestotheMDS,thelargeMDSaccesslatencies,andtheserializationofaccessestotheMDS.AftertheMDSwasveriedasthebottleneckforthe2DFFT,severaloptionswereexploredinordertomitigatethenegativeeectsofthecentralmemory.Theoptionsincludedalgorithmicvariations,enhancingtheperformanceoftheMDS,andcombinationsofthesetechniques.Table 4-7 liststhedierentvariations.Eachtechniqueoersperformanceenhancementsoverthebaselinealgorithmi.e.,P-FFT.Figure 4-13 showsthattheparallelFFTwithdistributedcheckpointingand 79

PAGE 80

Figure4-12.Executiontimeperimageforbaselineandenhancedsystems distributeddataprovidesthebestspeedupupto740overthebaselinebecauseiteliminatesallMDSaccesses.Individually,thedistributedcheckpointinganddistributeddatatechniquesresultinonlyaminimalperformanceincreasesincethetimetakentoaccesstheMDSstilldominatesthetotalexecutiontime.MDSperformanceenhancementsreducetheexecutionoftheparallelFFTbyafactorof5.SwitchingtheFFTalgorithmseeFigure 4-14 tothedistributedversionachievesa2.5speedupoverthebaselinewhichcanthenbefurtherincreasedto14and100byemployingMDSimprovementsanddistributeddata,respectively.ItisnoteworthytomentionthatthedistributedFFTiswellsuitedforlargersystemssizessincethenumberofMDSaccessesremainsconstantassystemsizeincreases.Resultsfortheparallel2DFFTFigure 4-13 magnifytheeectsoftheMDSonthesystem'sperformance.ThoughtheparallelFFT'sgeneraltrendshowsworseperformanceassystemsizescales,thetopfourlinesshownumerousanomalieswheretheperformanceoftheFFTactuallyimprovesasthenumberofnodesinthesystemincreases.TheseanomaliesarisefromthetotalnumberofMDSaccessesneededtocomputeasingleimagefortheentiresystem.Forexample,adipinexecutiontimeoccursinthebaselineparallelFFTalgorithmwhenmovingfrom18to19nodes.ThetotalnumberofMDSaccessesoftheparallelFFTusing18nodesis90whilethenumberofaccessesdecreasesto76for 80

PAGE 81

Table4-7.TheFFTAlgorithmicvariationsandsystemenhancements Algorithm/Technique Description Label ParallelFFT Baselineparallel2DFFT. P-FFTBaseline ParallelFFTwithdistributedcheckpointing Parallel2DFFTwithnearestneighbor"checkpointing-datanodeisavescheckpointdatatodatanodei+1modN,whereiisauniqueintegeriN)]TJ/F17 10.909 Tf 12.477 0 Td[(1andNisthenumberoftasksinaspecicjob. P-FFT-DCP ParallelFFTwithdistributeddata Parallel2DFFTwitheachnodecollectingaportionofanimageforprocessingthuselimi-natingthedataretrievalanddatasavestages. P-FFT-DD ParallelFFTwithdistributedcheckpointinganddistributeddata Combinationofbothdistributiontechniquesdescribedabove. P-FFT-DCP-DD ParallelFFTwithMDSen-hancements Parallel2DFFTusingaperformance-enhancedMDS.TheMDSbandwidthisimproved100-foldandtheaccesslatencyisreducedbyafactorof50. P-FFT-MDSe DistributedFFT Avariationofthe2DFFTthathaseachnodeprocessanentireimageratherthanapartoftheimage. D-FFT DistributedFFTwithdistributeddata Distributed2DFFTalgorithmwitheachnodecollectinganentireimagetoprocess. D-FFT-DD DistributedFFTwithMDSen-hancements Distributed2DFFTalgorithmusingaperformance-enhancedMDS. D-FFT-MDSe the19-nodecase.SincetheMDSisthesystem'sbottleneck,theexecutiontimeofthealgorithmbenetsfromthereductionofMDSaccesses.OnlyintheparallelFFTwithdistributeddataanddistributedcheckpointingoptiondoweseethezig-zags"disappearduetonodatatransfersoccurringbetweenthenodesandtheMDS.ThedistributedFFTseeFigure 4-14 alsodoesnotshowanyperformanceanomaliesduetothenatureofthealgorithm.Thatis,thenumberofMDSaccessesremainsconstantperimagesinceonlyonenodeisresponsibleforcomputingthatimage. 81

PAGE 82

TheresultsinFigures 4-13 and 4-14 correspondedto1MBimages,thusweconductedsimulationstoanalyzetheaectsoflargerimagesizes.OurresultsshowedthatthealgorithmsandenhancementsreversedthetrendfortheparallelFFT.Thatis,theexecutiontimesimprovedasthesystemsizegrew,thoughtheimprovementswereveryminimal.Also,thesporadicperformancejumpswereamortizedduetothelargenumberofMDSaccessesascomparedtothevarianceinthenumberofaccesses.ThedistributedFFTwithdistributeddatawastheonlyoptionthatshowedalargeimprovementbecausemoreprocessingcouldoccurwhendatawasmorereadilyavailablefortheprocessors.TheresultsdemonstratethatarealisticapplicationcanbeeectivelyexecutedbytheDMsystemifthemassmemorysubsystemisimprovedtoallowforparallelmemoryaccessesanddistributedcheckpoints. Figure4-13.Parallel2DFFTexecutiontimesperimageforvariousperformance-enhancingtechniques 4.3.3CaseStudy:SyntheticApertureRadarForthenextcasestudy,weevaluatetheperformanceandavailabilityofamorecomplexapplication,SAR.SyntheticApertureRadarSARisahigh-resolution, 82

PAGE 83

Figure4-14.Distributed2DFFTexecutiontimesperimageforvariousperformance-enhancingtechniques broad-areaimagingprocessusedforreconnaissance,surveillance,targeting,navigation,andotheroperationsrequiringhighlydetailed,terrain-structuralinformation[ 49 ].Itusesatwo-dimensional,space-variantconvolutionthatcanbedecomposedintotwodomainsofprocessing{rangeandazimuth.Inordertocorrectlytransitionbetweentherangeandazimuthdomains,thedatamustbereorderedviaatransposeoperation[ 50 ].Thiscasestudyanalyzesafault-tolerantversionofSARthatincorporatesanoptionalcheckpointingstagestripedblockinFigure 4-15 atosaveandrecoverrollbackpointsintheeventofafailedjob.Figure 4-15 aillustratesthedataowofthefault-tolerantSARapplication.TherearevariousimplementationsoftheSARapplicationeachdieringbasedonthedatadecompositionacrossparticipatingprocessingnodesand,thus,theamountofcommunicationandcomputationconductedbyeachnode[ 51 ].Forthisstudy,weconsiderthepatch-basedapproachwhichsplitseachSARimageintopatches"alongtheazimuthdimensionanddistributeseachpatchtoanavailablecomputenode.Thispatchedversiondoesnotrequirecommunicationbetweenparticipatingnodesalthougheachnodemust 83

PAGE 84

aDataowdiagram bDatadecompositionFigure4-15.SARdataowwithoptionalcheckpointstagesandpatcheddatadecomposition fetchandprocessadditionaldatatoensurecorrectresults.Figure 4-15 billustratesthedatadecompositionofthepatch-basedimplementation.ThebaselineSARapplicationusedthroughoutthisstudyprocesses280005616-elementimageswhichistheapproximatesizeoftheimagescollectedbytheEuropeanRemote-SensingERSsatellites[ 52 ].Eachelementisstoredwithinthemassdatastoreasacomplexpairof8-bitintegers-bytestotal,thetypicalformatusedforrawSARdata.WhenthedataisimportedbytheSARapplication,eachelementisexpandedtoacomplexpairof32-bitoating-pointnumbers-bytesperpairinordertoimproveprecisionandreducethepotentialforround-oerrors,andtherangedimensionispaddedto8192elementstoincreasetheeciencyoftheFFTcalculations.WhenSARiscomplete,thepaddedelementsintherangedimensionareremovedfromtheprocessedimageandtheremainingelementsareconvertedtocomplexpairsof16-bitshortintegers-bytesperpairthusreducingtheamountofstorageneededtostorethedatabacktotheMDS.ThepatchdatasizeisP5616elements,wherePisthepatchsize,andtheoverheaddatasizeis12965616elementsperpatch.Foreachsimulationrun,theDM 84

PAGE 85

systemisobservedoverten,100-minuteorbitsandtheradiation-hardenedcontrolandMDSnodesareassumedtoexperiencenofailures.Forthefault-injectionexperiments,thefaultcontrolblockdescribedinSection 4.2.4 createsandinsertsfaultsintothesystemusinganexponentialdistributionwithamean,MTBFSYSTEM,denedas:MTBFSYSTEM=MTBFNODE N{5MTBFNODEisthemeantimebetweenfaultspernodeandNisthenumberofnodesinthesystem.TheMTBFNODEratesconsideredinthefaultexperimentsareidenticaltothoseinvestigatedinSection 4.3.1.2 andrepresentradiationconditionsrangingfromminimaltoextreme.Faultsareinjectedintoaparticularnodebasedonauniformdistributionandthespeciccomponenttotargetontheselectednodeisdictatedbythefollowingpercentages:SARApplication=70%,HAM/System=25%,andJMA=5%.ThesepercentagevalueswereestimatedbasedontheanticipatedbehavioroftheSARapplicationexecutingonaDMdatanode.Also,faultscanbeinjectedintorecoveringcomponentsthusrestartingtherecoveryprocess.Thefollowingsectionsdescribethetechniquesandcapabilitieswithinthetwomodelingdomainsofthethree-stageanalysisastheyareappliedtoevaluatetheperformanceandavailabilityofSARexecutingontheDMsystem.Section 4.3.3.1 presentstheamenabilitystudyofSARusingtheMarkov-rewardmodeltodetermineiftheapplicationmapswelltotheDMarchitecture.Afterthestudy,weenterthediscrete-eventsimulationdomaininordertoexplorethevariousapplicationandarchitecturaloptionsavailabletoimproveperformanceandavailability.Section 4.3.3.2 reportstheperformabilityandsystemthroughputofthepatch-basedSARapplicationwhileconsideringvariouspatchsizes.Itthenevaluatesthedierentcheckpointstorageoptionswiththeintentofimprovingthefaulttoleranceoftheapplication.Finally,we 85

PAGE 86

investigatenumerousarchitecturalenhancementstosupportecientcomputingonatwenty-nodesystemandconcludewithanalanalysisoftheproposedightsystem.4.3.3.1AmenabilitystudyThissectionbeginswithapreliminaryanalysisoftheSARapplicationusingtheMarkov-rewardmodeltodeterminewhethertheworkloadcharacteristicsofthisparticularapplicationareappropriatefortheDMsystem.Thestudypresentsbest-caseperformabilitynumbersofthepatch-basedSARapplicationemploying2048-,4096-,and8192-elementpatchesexecutingonsystemsusing4,8,16,and32datanodes.ThefaultinjectionratesexploredareidenticaltothoseusedinSection 4.3.1.2 Figure4-16.AmenabilityresultsviaMarkovmodelforpatch-basedSARapplication FromFigure 4-16 ,wecanseethatthepatchedSARshowspromisingperformabilitynumbersforeachsystemsizeandpatchsizewhilethesystemexperiencesrelativelybenignconditions.However,asthefaultrateincreases,weobservedecreasingperformabilityinallcases.Furthermore,largerpatchsizeshavenegativeeectsontheperformabilityofthesystemduetotheincreasedamountoftimerequiredtocompletetheprocessingofeachSARjob.Infact,theMarkovmodelreportedadierenceof32.9%inperformability 86

PAGE 87

betweenthe2048-elementpatchandthe8192-elementpatchatthehighestfaultrate.Fromtheseresults,weobservethatthepatch-basedSARapplicationproducesgoodperformabilitynumbersforthefaultratestargetedfortheDMsystemthusmakingitagoodcandidateforfurtherinvestigationusingthediscrete-eventsimulationapproach.4.3.3.2In-depthapplicationanalysisForthenextstepinthecasestudy,wetransitionintothenalstageofourproposedmethodology{thediscrete-eventsimulationdomain.Withinthisstage,wehavethecapabilitytoanalyzemanyinterestingapplicationandsystemoptionswhileexposingeachcongurationtovariousfaultconditions.Thissectionfocusesonthevariousoptionsavailableforthepatch-basedSARapplication.Similartotheamenabilitystudy,weexploretheimpactofpatchsizeonthesystem'sperformabilitywhileconsideringpatcheswith2048,4096,and8192elementsintheazimuthdimension.Theobjectiveofthisstudyistodeterminewhichpatchsizeachievesthebestperformance.Figure 4-17 illustratestheresultscollectedfromthesimulations.TheleftcolumnofchartsshowstheperformabilitypercentagesofeachpatchsizeandtherightcolumndisplaysthecorrespondingthroughputswhilevaryingsystemsizeandMTBFNODE.Fromtheguresintheleftcolumn,weseethatthe2048-elementpatchsizehasthehighestperformabilityinallsystemsizesandfaultratesduetotheapplication'sshortexecutiontime.Asthepatchsizegrows,theexecutiontimeforeachSARjoblengthensthusincreasingtheprobabilityofafaultinterruptingthecompletionofajob.TheseresultsconcurwiththosereportedbytheMarkov-rewardmodelintheprevioussection.However,thesimulationsproducedlowerperformabilitypercentagesthanthoseobservedusingtheMarkovmodel.Infact,amaximumreductionof43.0%intheperformabilityoflargersystemsexperiencinghighfaultrateswasreportedbythediscrete-eventsimulationmodels.Anotherkeyobservationisthattheperformabilityofeachsystemdecreasesasthesystemsizeenlargesthusindicatingapotentialbottleneckwithinthesystem.Our 87

PAGE 88

a2048-elementperformability b2048-elementthroughput c4096-elementperformability d4096-elementthroughput e8192-elementperformability f8192-elementthroughputFigure4-17.Systemperformabilitypercentagesandthroughputsforpatch-basedSAR 88

PAGE 89

Markov-rewardmodelfailstoshowthistrendduetoitsinabilitytocapturearchitecturaldependencies.Despiteobservingdecreasedperformabilitywhenusinglargerpatches,thesystemthroughputsforthesejobscanbemuchhigherthanthe2048-elementcase.Infact,theguresintherightcolumnshowimprovedthroughputsupto1.86and2.29forthe4096-and8192-elementcases,respectively,inlow-faultenvironments.However,the2048-elementcasedoesoutperformthe8192-elementpatchsizewhentheMTBFNODErateislessthan200seconds.Inthesecongurations,thefaultratesarelowerthantheaverageexecutiontimesoftheapplicationsusinglargerpatchsizesthusincreasingtheprobabilityofafaultcausingaSARjobtofail.Finally,foreachpatchsizeandfaultrate,thethroughputreachesitspeakvalueinsystemswitheightdatanodes.Althoughtheexactreasonforthispeakinperformanceisnotfullyclearfromthisparticularstudy,wehypothesizethatthecentralizedMDSisthemaincause.Thestudyconductedinthenextsectionveriesthishypothesisbyobservingtheeectsofscalingtheperformanceofkeysystemcomponents.Forthisstudy,wearemostconcernedwithlighttomoderatefaultratesi.e.,MTBFNODEvaluesbetween28000and3600secondswheretheperformabilitypercentagesforallpatchandsystemsizesareverysimilar.Therefore,wefocusonthe8192-elementpatchcongurationforfurtheranalysisofSARandtheDMsysteminordertomaximizethesystem'sthroughput.NowthatSARhasbeenconguredproperlywithregardstoperformanceandavailability,wecannowevaluatevariouscheckpointingoptionswiththegoalofimprovingapplicationfaulttolerance.Table 4-8 providesalistofthecheckpointoptionsalongwithbriefdescriptionsofeach.Forthisevaluation,weobservetheperformabilityandthroughputoffoursystemsizes{4,8,16,and32datanodes{whileincorporatingtheoptionalcheckpointstagefortheSARapplicationseeFigure 4-15 a.Theobjectiveofthisstudyistoobservetheimpactonperformanceandavailabilityofcheckpointingwithinthe 89

PAGE 90

SARapplication.Furthermore,wewishtomeasureandcomparetheoverheadsassociatedwithstoringthecheckpointdatatoareliable,centralizedlocationMDSnodeversuscheckpointingtounreliable,distributeddatanodes. Table4-8.Checkpointoptionsexploredusingpatch-basedSARapplication Checkpointoption Description Label Nocheckpointing Nocheckpointingconducted. NoCP MDScheckpointing CheckpointdataisstoredontheMDSnode. MDSCP Datanodecheckpointing Checkpointdataisstoredonthenearestneighbordatanode. DataCP Figure 4-18 showstheperformabilityandthroughputobservedforeachcheckpointoptionwhilevaryingsystemsizeandMTBFNODE.Theperformabilityisreportedbythesolidlinesinthechartswhilethethroughputisrepresentedbydottedlines.Whencomparingtheresultsfromallfourgures,weseethatassystemsizeincreases,performabilitydecreasesforallMTBFNODErateswhilethroughputincreases.TheMDSCPcheckpointoptionreportsboththelowestperformabilityandthroughputforallcases.Again,thelikelycauseofthispoorperformanceistheincreasedpressureplacedonthecentralizedMDSnode.Inthesmallersystemsizes,theDataCPoptionreportsanapproximate11%dropinperformability.Also,usingtheDataCPoptionlowersthroughputsby33.5%and27.5%forsystemsconsistingoftwoandfourdatanodes,respectively.However,thisoptiondoesshowslightbenetsinthe16-and32-nodesystems.Forbothsystemsizes,nearlyequivalentperformabilitypercentageswerereportedandamaximumspeedupof1.08inthroughputwasmeasuredwhencomparedtotheNoCPcase.Anotherimportantobservationfromthisstudyisthatperformabilitydoesnotalwaystranslateintorawperformancei.e.,throughput.Figure 4-18 bshowsnearlyequivalentperformabilitypercentagesbetweenthethreecheckpointoptionsatlowfaultrates.Conversely,thethroughputoftheNoCPoptionis2.5and1.3greaterthantheMDSCPandDataCPoptions,respectively.Thisobservationsuggeststhatwhileperformabilityisausefulmetrictoevaluatetheoverallutilizationofadegradable 90

PAGE 91

systeminafaultyenvironment,itdoesnotrepresentthetrueperformanceofaspecicapplicationsinceitdoesnotdierentiatebetweenmeaningfulcomputationandtheprocessingconductedduetoextramechanismssuchascheckpointing. a4datanodes b8datanodes c16datanodes d32datanodesFigure4-18.Systemperformabilityandthroughputfor8192-elementpatch-basedSARexecutingonvarioussystemsizes Theresultsattainedinthisstudysuggestthatthepatched-basedSARapplicationisnotwellsuitedforcheckpointing.AlthoughtheDataCPoptionachievedimprovedthroughputovertheNoCPoptioninlargesystemsizes,theimprovementwasminimalandtheadditionalnetworktransactionsanddemandsonthedatanodescouldeasilynegateanygainsifmultiplejobswereexecutedoneachnode.Inthenalstudyoftheightsystem,wefocusonthe8192-elementpatchedSARapplicationwithcheckpointingdisabled. 91

PAGE 92

4.3.3.3FlightsystemInthissection,weexploretheperformabilityofthepatch-basedSARapplicationwith8192-elementpatchesexecutingontheDMightsystemcomposedoftwentydatanodes[ 53 ].Fromtheprevioussection,wehaveshownthattheperformanceoftheDMsystembeyondeightnodessuersduetoanunidentiedbottleneckwithinthesystem.Inthisstudy,weenhanceandmodifythesystemarchitectureinordertoidentifyandremedythisbottleneckinordertoecientlysupporttwentydatanodesprocessingSAR.Table 4-9 liststhevariousarchitecturalenhancementsinvestigated.TheobjectiveofthisstudyistomodifythecurrentDMsystemdesigntosupporttwentydatanodesandtoimprovetheperformabilityandthroughputofthesystemwithrealisticupgradesandaugmentations. Table4-9.Architecturalenhancementsexploredforightsystem Enhancement Description Label Processor Increasesprocessingpowerofoating-pointunitby2andincreasesthroughputanddecreaseslatencyofmiddlewareby2. Proc MDSstoragedevice IncreasesbandwidthanddecreaseslatencyofMDSstoragedeviceby2. MDSe MDSnodes IncorporatesNMDSnodeswithinsystem. NMDS Beforetheenhancementswerestudied,wemeasuredtheperformabilityandthroughputofa20-nodesystemusingdefaultsettingsthatservedasthebaselineconguration.Toidentifythesystembottleneck,wetargettwomaincomponentsofthesystem{thedatanodeprocessorandtheMDSstoragedevice.Byacceleratingthedataprocessorby2,weassumethattheoating-pointunitandmiddlewarelayerattainequivalentboostsinperformance.Therefore,thisenhancementimprovestheperformanceofbothoating-pointcomputationsandnetworktransfers.Figure 4-19 ashowsthatupgradingtheprocessorprovidesnoperformancegainsforlargeMTBFNODEvalues;however,a1.58speedupisattainedforhighfaultratesduetoreducedexecutiontimesascomparedtothebaseline.WhenweimprovetheperformanceoftheMDSstoragedevice,weobservespeedupsrangingfrom1.7to2.1overthebaseline.These 92

PAGE 93

speedupssuggestthattheMDSisthemainbottleneckofthesystemfortheSARapplication.Tofurthersubstantiatethisclaim,weaugmentthecurrentsystemdesignwithadditionalMDSnodesinordertoreducecontention.Figure 4-19 aillustratesthattheDMsystememployingoneextraMDSnodedoublestheperformanceobservedfromthebaselinesystemforlowfaultratesandmorethantriplesitinextremefaultyconditions.EmployingthreeMDSnodesinthe20-nodeDMsystemfurtherimprovestheperformanceofthesystemby2.5inlightandmoderatefaultyconditionsand4.4inhighfaultrates.AninterestingobservationfromFigure 4-19 isthatthespeedupreportedforeachenhancementincreaseswiththefaultrate.ThisincreaseiscausedbythereductioninexecutiontimeoftheSARapplicationthusallowingthejobtocompletemorefrequentlywithoutexperiencingafailure. aComponentenhancements bCombinationenhancementsFigure4-19.Speedupsofarchitecturalenhancementsforpatch-basedSAR Next,weinvestigatetheimpactofcombiningtheindividualcomponentenhancementsinordertomaximizesystemperformance.FromFigure 4-19 bonecanseethatsignicantspeedupsareachievedfromcombininganupgradedprocessorandMDSstoragedevicewithasystemusingmultipleMDSnodes.ThesystemusingtwoMDSnodesshowedspeedupsrangingfrom12.8to4whilethesystemincorporatingthreeMDSnodeswasfoundtohaveamaximumspeedupof17.5inhighfaultconditionsand5.1intheenvironmentsinwhichdatanodesexperiencethreefaultsperday. 93

PAGE 94

Fromtheresultsinthearchitecturalstudy,theMDSwasdeemedthemainbottleneckofthesystem.Inordertoremedythiscontentionpointforthe20-nodeightsystem,weproposeusingthreeMDSnodeswithenhancedstoragedevices.Wealsoincludeupgradeddata-nodeprocessorsinordertoacceleratedatacomputationsandmiddlewareprocessing.Usingthisupgradeddesign,weconductthenalperformabilitystudyoftheSARapplication.Figure 4-20 providestheresults. Figure4-20.Systemperformabilityandthroughputof20-nodeDMightsystemexecutingpatch-basedSAR TheresultsinFigure 4-20 showthattheproposedightsystemperformswellinmostradiationconditions.Infact,theperformabilityofthesystemispredictedtoexceed99.5%inlighttomoderatefaultconditionsi.e.,MTBFNODE>7200seconds.Theminimumperformabilityofthesystemexecutingthepatch-basedSARapplicationis54.0%.Thethroughputsachievedbytheightsystemweregreatlyimprovedevenatthehighestfaultrates.Theminimumthroughputwasmeasuredtobe316imagesperorbitandthemaximumis587imagesperorbit.Assuming100%ofthesystem'sdatanodesarededicatedtoprocessingSARimages,thethroughputoftheproposedightsystemdictatesthatitshouldbeabletosupportasustainableinputrateof29.35MB/secofrawSARdatafromthesensors.However,theexpectedinputratecurrentlyusedbytheERSsatelliteswascalculatedtobeapproximately11.90MB/sec.Thisdierencetranslates 94

PAGE 95

intoasituationwhereonly9datanodesarerequiredtocomputeSARjobswhiletheremainingnodesarefreetoperformothercomputejobs,conducttestdiagnostics,orsimplyremainidletoconservepower.ThisnalobservationtellsusthattheproposedDMightsystemarchitectureexecutingthepatch-basedSARapplicationismorethansuitabletohandlethelargedemandsforcomputationseenintheERSsatellites.4.4ConclusionsThisphaseofresearchpresentedanapproachthatcombinesanalysistechniquesusingsmall-scaleprototypesystems,Markov-rewardmodels,anddiscrete-eventsimulationmodelsinordertoquicklyandaccuratelyevaluatetheperformanceandavailabilityi.e.,performabilityofaerospacesystemsandapplications.Thecombinationofthesetechniquesallowedustocalibratecomponentmodelsusingexperimentalmeasurements,quicklypinpointworkloadsandfaultratessupportedbythemanagementsoftwareviaMarkov-rewardmodels,andthoroughlyinvestigatespecicapplicationsexecutingonvirtuallyprototypedsystemsthroughdiscrete-eventsimulation.DetailsofeachanalysistechniqueandtheextensionsandenhancementstotheFASEframeworkwereoutlined.Next,modelswerecalibratedtoreecttheperformanceofthephysicaltestbedsysteminordertoensureaccurateresults.Finally,twocasestudiesreportedperformance,scalability,andavailabilitypredictionsofthe2DFFTkernelandSARapplicationexecutingontheNASADependableMultiprocessortoshowcasethecapabilitiesofthepresentedapproach.Formodelcalibration,weusedasmall-scaleprototypesystemconsistingoffourdatanodes,oneMDSnode,andonecontrolnodetocollectperformancemeasurements.ThesemeasuredvalueswerethenusedtocalibratetheMarkovandsimulationmodelsandtheMDSandnetworksubsystemmodelswerevalidatedusingsimplebenchmarkstoconrmtheaccuracyofthesimulationmodels.AperformabilityvalidationexperimentwasthenconductedusinganLUdecompositionkerneltocompareresultsbetweentheMarkov-rewardmodelsandsimulationmodelswhileeachwassubjectedtovariousfault 95

PAGE 96

rates.Theresultsshoweddierencesbetweenthemodelingapproachesoflessthan1%forlowfaultratesthoughlargererrorswereobservedforhigherfaultratesduetoshortcomingse.g.,modelingarchitecturaldependenciesoftheMarkovmodel.Aftervalidation,theDMsystemmodelswereusedtoevaluatetheperformanceandscalabilityofthesystemexecutingthe2DFFTkernel.ThisstudyexposedthecentralizedMDSasapotentialperformancebottleneckinjobsthatfrequentlyaccesstheMDS.VarioustechniqueswereexploredtomitigatetheMDSbottleneckincludingdistributedcheckpointing,distributinginterconnectionsbetweensensorsanddataprocessorsi.e.,distributeddata,algorithmvariations,andimprovingtheperformanceoftheMDS.ThestudyshowedthateliminatingextraneousMDSaccesseswasthebestoptionthoughenhancingtheMDSmemorywasalsoagoodoptionforincreasingperformance.Regardingscalability,changingthealgorithmfromaparalleltoadistributedapproachandincludingdistributedcheckpointingprovidesthebestperformanceimprovementofalltheoptionsanalyzed.Forlargeimagesizesi.e.,64MB,thedistributedFFTwithdistributeddatawastheonlyoptionthatshowedalargeimprovementbecausemoreprocessingcouldoccurwhendatawasmorereadilyavailablefortheprocessors.Thesecondcasestudyusedthepatch-basedSARapplicationtostudyitsperformanceandavailabilitywhilerunningontheDMsystem.TheMarkovmodelwasinitiallyemployedtoquicklydeterminethattheperformabilityoftheapplicationusingvariouspatchsizeswassuitableforfurtherevaluationontheDMsystem.Oncethispreliminaryanalysiswassuccessful,wecontinuedourthree-stageanalysisofSARusingthediscrete-eventsimulationmodels.SimulationresultsweresimilartothoseproducedbytheMarkov-rewardmodelbutwithonekeydierence.TheperformabilityvaluespredictedbythesimulationmodelweremuchlowerthanthoseproducedbytheMarkovmodelwhenconsideringhighfaultratesandlargesystems.Forexample,thesimulation-basedperformabilityofa32-nodesystemexperiencinganMTBFNODErateof60secondswasfoundtobe6.1%comparedtothe49.1%reportedbytheMarkovmodel.Again,the 96

PAGE 97

MarkovmodeldidnotcapturethearchitecturaldependenciesoftheSARapplicationontheMDSnodeandthustheresultsoverlookeditsimpactontheperformabilityofthelargersystems.ThesimulativemodelwasalsousedtogaugetheoverallsystemthroughputwhileexecutingSAR.Theapplicationproduceditshighestthroughput16imagesperorbitwhenexecutingonan8-nodesystem;however,insomecases,systemsbeyondthissizereportedreductionsinthroughputduetocontentionattheMDS.Finally,checkpointingwasemployedinanattempttoimprovethefaulttoleranceoftheSARapplication.TwostorageoptionswereexploredsuchthatcheckpointdatawaseitherstoredontheMDSnodeorneighboringdatanode.Theresultsshowedthatforallsystemsizesandfaultrates,checkpointingtotheMDSnodereducedperformabilityandsystemthroughput.Checkpointingtoaneighboringdatanodewasfoundtohaveslightbenetsinlargersystemsizes;however,thegainsweremuchtoosmallandcheckpointingwasdeemedunnecessaryforthepatch-basedSARapplicationexecutingontheDMsystem.Duringthestudy,certainsystemcongurationsresultedineachcheckpointoptionproducingsimilarperformabilityvaluesyetdrasticallydierentthroughputs.Thisobservationsuggestedthatwhileperformabilityisausefulmetrictogaugetherobustnessofadegradablesystemundervaryingfaultenvironments,itdoesnotaccuratelyrepresenttheeciencyforwhichanapplicationexecutesonasystem.Thus,theresultsfromthisstudyreinforcetheneedtocoupleMarkovanddiscrete-eventsimulationmodelingforcomprehensiveanalysesofaerospacesystemsandapplications.AftertheSARapplicationwasconguredforoptimalperformabilityandthroughput,weexploredarchitecturalenhancementsinordertoidentifyandalleviatethebottleneckswithinthesystem.TheresultsfoundthattheMDSwasthemainthrottlingpointofthesystemduetothecumulativeeectsofeachdatanodeaccessingandtransmittingdataduringvariousphasesofSAR.Toalleviatethiscontentionpoint,weenhancedthecapabilitiesoftheMDSstoragedevicewhichallowedthesystemtonearlydoubleitsthroughput.Wealsoobservedsimilarspeedupsinsystemthroughputbyincorporating 97

PAGE 98

additionalMDSnodesintothesystem.TheDMsystemusingtwoMDSnodesreceiveda2.0boostinthroughputwhileadesignusingthreeMDSnodesachieveda2.5improvementundermoderatefaultconditions.Asthefaultrateincreased,greaterspeedupswereobservedforeachenhancementoverthebaselinecase.Shortenedexecutiontimesreducedtheeectsofthehigherfaultratesthusincreasingtheoveralleciencyandthroughputofthesystem.ThecasestudyconcludedwithaperformabilityevaluationofthenalightsystemthatincorporatedtwentyenhanceddatanodesandthreeenhancedMDSnodes.Theproposedightsystemwasexposedtovariousfaultratesanditsmaximumthroughputwasobservedtobeapproximately587imagesperorbitinrelativelylightfaultconditionsand316imagesperorbitintheworstconditionsstudied.TheDM'sperformabilitywasobservedtobeover99.5%whenconsideringlighttomoderateradiationconditionsi.e.,lessthanonefaulteverytwohoursperdatanode.Theworkconductedduringthisphaseofresearchproducedanovel,3-stageanalysisprocessforpredictingtheperformanceandavailabilityofhigh-performance,embeddedsystemsandapplications.TheFASEframeworkwasextendedbyincorporatingafaultmodellibrary.ThisadditionallibraryallowsFASEuserstoinjectfaultsintoarbitrarysystemsinordertoconductin-depthavailabilityandperformabilityanalyses.ThecasestudiesdemonstratedthecapabilitiesoftheenhancedframeworkasappliedtotheDMsystem.Thecontributionsandaccomplishmentsofthisworkhavebeencompiledintotwomanuscripts.Therstmanuscriptpresentsthesimulationworkinvolvedwiththe2DFFTstudyandwaspublishedin[ 48 ].Thesecondpaperintroducesthethree-stageanalysisapproachforpredictingtheperformanceandavailabilityofradiation-suspectiblesystemsandapplications.ThispaperwassubmittedtoACMTransactionsonEmbeddedComputingSystems[ 54 ]. 98

PAGE 99

CHAPTER5HYBRIDSIMULATIONSTOIMPROVETHEANALYSISTIMEOFDATA-INTENSIVEAPPLICATIONSPHASE3TheFASEframeworkpresentedinChapter 3 laidthefoundationforfastandaccurateperformancepredictionsofarbitraryapplicationsexecutingonawiderangeofsystems.Theinformationpresentedencompassesstrategiestoaddressthegeneralissuesofcharacterizingapplications,designinganddevelopingcomponentsandsystems,andanalyzingtheperformanceofthevirtualsystemsfortheapplicationsunderstudy.However,inpractice,thetechniquesandtoolsdevelopedsuccumbtolengthysimulationtimeswhenstudyingdata-intensiveapplications.Theselonganalysistimesareexacerbatedwhenconsideringlarge-scale,parallelsystemstothepointinwhichsimulationbecomesprohibitive.Thissectionpresentstheworkconductedforthethirdandnalphaseofthedissertation.Thisresearchfocusesonreducingthetimeneededtosimulatedata-intensivesystemsandapplicationsviaanovel,hybridsimulationapproach.Thesectionstartsbydiscussingthemotivationsfollowedbythepresentationofbackgroundinformationandrelatedresearch.AdetaileddescriptionofthehybridmodelingapproachillustratesthetechniquesemployedtoenhanceandextendtheFASEframeworktospeedupsimulationsevaluatingdata-intensiveapplications.Experimentsandresultsfollowvalidatingthesuccessesoftheproposedtechnique.ThissectionalsopresentsacasestudythatillustratestheaccuracyandspeedofthehybridapproachasitisappliedtotheDMsystemdescribedinChapter 4 .Finally,conclusionsaredrawnandkeyinsightisoered.5.1IntroductionAsscientistsdiscovernewmethodsforexplorationanddiscoveryofbothearth-andspace-boundphenomena,theamountofdatacollectedbytheaccompanyingequipmentcanbecomestaggering.Withthisincreaseindata,newapplicationsareneededtoanalyzeandinterprettheinformationinordertoidentifytheareasofinterest.Asaresult,keyareasofsciencesuchasthegeosciences,remotesensing,andsystemsbiology 99

PAGE 100

areemployingapplicationsthatprocesstypicaldatasetsinthemulti-gigabyterangeandbeyond.Forexample,NASAsatellitesystemsperformingremotesensingtasksgenerate50gigabytesofimagesperhourwhilemultipleterabytesofdataonthehumangeneticcodewerecollectedfortheHumanGenomeproject[ 55 ].Inordertoecientlyprocesstheselargequantitiesofdata,newsystemsmustbedesignedthatpushthelimitsofprocessingandI/Otechnologies.However,designingthemosteectivesystemforaspecicapplicationsetisanontrivialtaskduetovastnumberoftechnologiesandtechniquesavailabletodesigners.Simulationisoftenusedtofacilitatethedesignprocesswithdiscrete-eventsimulationbeingaparticularlyeectivemethodtoverifyandanalyzespeciccharacteristicsofanarchitectureorprotocolbeforedeployment.Thiscapabilityallowsdeveloperstosavevastamountsoftimeandmoneybycircumventingthedevelopmentandevaluationofprematureprototypes.Italsoexpandsthedesignspacetoincludearchitecturesthatusebothnewandemergingtechnologiesthusallowingsystemdesignerstoselectthebestcongurationforaparticularsetofapplications.Althoughdiscrete-eventsimulationprovidesmanybenets,onemajorshortcomingisthetimerequiredtoanalyzecomplexsystemsexecutingdata-intensiveapplications.Generallyspeaking,thetimerequiredtoanalyzeasystemusingdiscrete-eventsimulationishighlydependentonthenumberofdiscreteeventsgeneratedbythemodel.Inhigh-delitysimulationmodels,theselargedatasetsaretypicallysplitintoaconsiderablenumberoffragmentswitheachfragmentprocessedindividuallyinordertomimicthebehaviorofthedatatransaction.Thisfragmentationgeneratesaproportionalnumberofdiscreteeventsthatmustbescheduledandprocessedthroughoutthesimulatedsystem,thusdramaticallylengtheningsimulationtime.Simulationtimesareexacerbatedbythefactthatthemajorityofthecandidatesystemsemploynumerousdistributedcomponentsworkinginparalleltocompleteagiventask.Thecombinationoflarge,complexsystemsprocessingsizeabledatasetscancausesimulationstorunfordays,ifnotlonger.Suchlengthyanalysesareoftenprohibitiveduetounacceptableincreasesindesignand 100

PAGE 101

developmenttimes.Thus,methodsforimprovingtheeciencyandspeedofdiscrete-eventsimulations,whilesacricingaslittleaccuracyaspossible,areessentialtoremainaneectivetoolforprototypingandevaluatingarchitecturesexecutingdata-intensiveapplications.Inthischapter,wepresentanovelapproachtohybridsimulationmodelingthataimstoreducesimulationtimefordata-intensiveapplicationswhileretainingahighdegreeofaccuracy.Ourapproachcombinestheaccuracyoffunction-levelmodelswiththespeedofanalyticalmodelsandmicro-simulations"toachievefastandaccurateresults.Themodelingprocedureusesatechniquecalledfunction-leveltrainingtocollectperformancemeasurementsfromthesimulatedsystem.Thesemeasurementsessentiallytakeasnapshotofthecurrentstateofthesystemandareusedtocalibratetheanalyticalmodels.Thecalibratedanalyticalmodelisthenemployedtocalculatethetimerequiredtocompletethecurrenttransactionassumingthatthestateofthemodelremainsrelativelyunchangedthroughoutitsexecution.Micro-Simulationsalsousemeasurementsgatheredduringthetrainingperiod,thoughtheyareemployedtoaccountfordeviceandcontentiondelaysatcomponentsactivelyparticipatinginthedatatransaction.Themethodoperateswitheachhybridtransactionbeginninginthefunction-leveltrainingprocedureinwhichfunctionalmodelsareemployedtocollectstatisticsthatcharacterizethecurrentstatusofthesystem.Whenthetrainingperiodiscomplete,theanalyticalmodelcalibratesitselfusingthecollectedstatistics,calculatesthetimerequiredtocompletethecurrenttransaction,andschedulesthetransmissionofanaldatastructureusingthefunction-levelmodel.Finally,thislastdatastructuretraversesthesysteminvokingmicro-simulationsateachcomponentitencountersuntilitreachesthedestinationdevicemodel.Withthisapproach,theredundantprocessingthattypicallyoccursduringlargedatatransfersisreplacedbyasinglecalculationi.e.,analyticalmodelandaxednumberofmicro-simulationsthatcollectivelyapproximatetheultimateoutcomeoftheserepetitiouscomputations.The 101

PAGE 102

resultinghybridmethodologysupportsthetimelyevaluationandanalysisofaclassofapplicationsthathastraditionallystresseddiscrete-eventsimulators.5.2BackgroundandRelatedResearchWithinthepastdecade,technologyhasadvancedinleapsandboundstosupportboththeacquisitionandstorageoflargequantitiesofdatarelatingtoawiderangeofeldsincludingnance,commerce,scienticcomputing,andnationalsecurity.Duetotheoverwhelmingquantitiesofdatacollected,theimportanceofprocessingandunderstandingtheinformationhasleadtoaneweldofstudywithincomputersciencecalledknowledgediscoveryindatabasesKDD.KDD,asdenedin[ 56 ],is...thenon-trivialprocessofidentifyingvalid,novel,andpotentiallyuseful,andultimatelyunderstandablepatternsindata."WiththeemergenceofKDD,manynewresearchinitiativeshavecommencedtodevelopandoptimizealgorithmsandapplicationsthatanalyzedataresidinginlargedatarepositorieswhileotherprojectsaimtodesignpowerfulcomputationalsystemstoeectivelysiftthroughthelargedatasetstodiscoverusefulinformation.TheData-IntensiveComputingInitiativeDICIatPacicNorthwestNationalLaboratoryisonesuchresearchprojectthatisinvestigatingscalablesolutionsinbothsoftwareandhardwaretosupporttimely,eectiveanalysesoflargequantitiesofdata[ 57 ].However,inordertoanalyzethecapabilitiesofnewhardwaredesigns,thecostofbuildingprototypesolutionscanbeprohibitive.Asaresult,toolssuchasdiscrete-eventsimulatorsareimperativetoperformextensive,yetcost-eectiveevaluationsofanumberofdesignoptions.Traditional,high-delitymodelingapproachesdictatethatthebehaviorofamodeledentityshouldbemimickedidenticallytoensurecorrectandaccurateresults.Asaresult,muchsimulationtimeisusedtofragmenteachtransactionintosmallerchunksthataretransferredandprocessedbythecorrespondingfunctionalmodels.Thismodelingapproachisveryaccurate,butitsscalabilitysuersduetolargesimulationtimesasthedatasetgrowsinsize.Numeroussolutionshavebeeninvestigatedinordertoremedythis 102

PAGE 103

scalabilityproblemthroughbothgeneralandspecicmethodsthatattempttospeedupsimulations.Onemethodology,calledstagedsimulation,targetsthereductionofdiscreteeventsinwirelessnetworksimulations.Stagedsimulationusesvarioustechniques,suchasfunctioncachingandincrementalcomputations,todecreasetheamountofredundantprocessingconductedthussupportingfastersimulationsoflargersystems[ 58 ].Accordingtocasestudies,themethodachievesa30improvementinsimulationspeedfora1,500-nodesystem;however,manyoftheproposedtechniquesareuniquelydesignedforwirelessnetworksimulations.Ageneralapproachthathasbeeninvestigatedtoimprovesimulationtimeisparalleldiscrete-eventsimulationPDES.PDESframeworksattempttoreducesimulationtimebydistributingtheworkloadamongmultipleprocessors[ 59 ].PDESframeworkstypicallyfallintooneoftwocategories:conservativeoroptimistic.InconservativePDESenvironments,eachparticipatingnodeprocessesaneventonlywhenallpendingeventsanywhereinthesimulationthatmayaecttheconsideredeventhavebeencompleted[ 60 ].Byusingaconservativeapproach,processorsmayremainidleforsignicantperiodsoftimewaitingonotherprocessorstosignalthattheycanproceedwiththeirnextscheduledevent.CPSim[ 61 ]andParsec[ 62 ]aretwotoolsthatsupportconservativePDES.Conversely,optimisticPDESavoidsidlecyclesbyallowingeventstobeprocessedregardlessoftheiranitieswithconcurrentorpreviouseventsresidinginothernodes.Tosupportthiscapabilityandmaintaincorrectness,detectionandrollbackmechanismsareincorporatedintothedesign[ 63 ].Thedetectionmechanismintroducesoverheadduetotheadditionalprocessingrequiredtoidentifydependenciesbetweeneventswhilerollbacksdelaysimulationsbyrepeatingcomputationsoptimisticallycompleted.ExamplesoftoolssupportingoptimisticPDESincludeSPaDES[ 64 ],WARPED[ 65 ],andModSim[ 66 ].Whileparallelsimulatorshavebeenproveneectiveforcertainapplications,theyoftenprovideverypoorparalleleciencyduetothehighlevelsofinterdependenciesand 103

PAGE 104

synchronizationsinherentindiscrete-eventsimulations.Asaresult,welooktoalternativemethodsforimprovingtheeciencyofsimulatingdata-intensiveapplications.Fluid-basedmodelingprovidesaparticularlysuccessfulapproachtogreatlyreducethetimeneededtoanalyzelarge-scalesystemsprocessingsizeabledatasets.Fluid-basedmodelsattempttoabstractanentiredatatransactionasasingleuidrepresentation,calledaow.Typically,analyticalmodelsareusedtodeneaow'sbehavioraswellastheinteractionsbetweenmultipleowscontendingforasharedresourceorchannel.Conversely,thebehaviorofafunction-leveltransactioniscapturedbythecollectivebehaviorofeachindividualfragmentthatcomposesthetransactionasdictatedbythestateofthesystem.Byusingauid-basedmodel,anextremelylargedatatransactioncanbemodeledasacontinuousevent,withoutgeneratingalargenumberofdiscrete-eventsthatwouldotherwiseslowthesimulation.Incaseswherethesimulationtimeisdominatedbythemodelingoftheselargedatatransactions,suchatechniquehasthepotentialtosignicantlyreducesimulationtime.Forthesereasons,theuseofuid-basedmodelshasbecomepopularforapplicationssuchaslarge-scalenetworksimulations[ 67 ].Unfortunately,coarser-grainedmodelsaretypicallylessaccuratethanfunction-levelmodels[ 68 ],andattimeslessecientduetorippleeectscausedbytheinteractionofcompetingows[ 69 ].Therippleeect,discussedin[ 70 ],referstothephenomenonwheretheratechangeofasingleowcausesratechangestootherowsthatpropagatethroughoutthesystem.Insimulationshandlingalargenumberofows,theseripplescancreateenoughextraneouseventstosignicantlydecreasethegainsinsimulationeciencyobtainedfromusinguid-basedmodels.Severalprojectshaveproposeddesignsofhybridsimulationenvironmentsthatcombinefunctional-levelanduid-based,analyticalmodelsforaccurateandecientnetworksimulations[ 71 ],[ 72 ],[ 73 ].Foreachofthesesimulators,userstypicallydecidewhetheranetworktransactionismodeledasmultiple,function-levelfragmentsorasasingle,uid-basedow.Whiletheframeworksprovidetheexibilityofsupporting 104

PAGE 105

bothmodelingapproaches,theysuerfromafewshortcomings.First,uidstreamsaremodeledsolelyusinganalyticalmodelsthatconsidertracratesandbuercapacities.Thus,theaccuracyoftheuidtransactionisbasedentirelyontheanalyticalmodel'sabilitytoapproximatecontentionanductuatingnetworkbehaviorwithouttheuseofdetailed,functional-levelmodels.Furthermore,eachsimulationenvironmentisdesignedtoanalyzeonlythenetwork-speciccharacteristicsofasystem.Therefore,theframeworksareunabletocorrectlymodelthecompletebehaviorofadata-intensiveapplicationandsystem.Finally,thecapabilitiesofeachsimulatoraredemonstratedbyfocusingonasingledatastreaminthepresenceofrandomlygeneratedbackgroundtrac.Thedemonstrationssuggestthatdicultiescouldarisewhenattemptingtomodelrealtracgeneratedbytheentiresysteminordertomimicthebehaviorofanapplication.Otherresearchprojectshavelookedbeyonduid-basedmodelingtoreducethecomplexityofsimulatingdata-intensiveapplications.Aframeworkthatcombinesapplicationemulatorswithasetofsimulationmodelsfordealingwithlarge-scale,parallelapplicationsisdescribedin[ 74 ].Theapplicationemulatorsdynamicallyfeedstimulitothesimulatorintheformofepochs,whichrepresentgroupsofeventsandtheirhigh-leveldatadependenciesprocessedbyaprocessormodel.Whilethecoarse-grainedprocessormodelsandtheapplicationemulatorsabstractawaymuchoftheworkperformedforagivenapplication,theframeworkstillreliesonhigh-delitybusanddiskmodels,whichcanhinderthesimulationduringverylargedatatransactions.SincetheFASEframeworkplacesanemphasisonaccurateanddetailedmodelingofdatatransactions,itstandstobenetgreatlyfromthehybridmodelingmethodology.Forthiswork,thediscrete-eventmodelsdevelopedinChapters 3 and 4 havebeenextendedtosupportthehybridmodelingmethodologiesproposedinthischapter.Theremainderofthischapterpresentsthehybrid-modelingextensionsincorporatedintoFASEandresultsfromcasestudiesthatshowcasethenewcapabilitiesoftheextendedframework. 105

PAGE 106

5.3HybridSimulationApproachInthissection,weintroduceahybridsimulationapproachthatcanbeappliedtoanumberofcomponenttypestoproducefastandaccuratesimulationresults.Beforeprovidingthespecicdetailsonthemethodology,werstdenesomebasicterminologyusedthroughoutthechapteraswellasafewbasicmodelingconcepts.Inthischapter,werefertoagenericdataoperationasatransaction,whileafragmentandaowcorrespondtofunction-levelanduid-basedmodeling,respectively.Figure 5-1 aillustratesagenerichybridsystemmodelconsistingofthreemodeltypes:1source,2pathcomponent,and3sink.Eachmodeltypeconsistsofbothfunctionalanduid-basedmodelsandparticipatesinthemodelingofthetransactionswithinthesystem.Thecomponentslabeledasasourcebeginnewdatatransactionsbasedonoperationscreatedbythecorrespondingcomponentorreceivedfromanupper-layercomponent.Thepathcomponentsareintermediatemodelsthatreceivebothfragmentsandowsandpropagateeachdatatypetothenextcomponentinthepath.Thesinkmodelssignifythedestinationsforthetransactionsandultimatelyreceivetheactualdatasentbythecorrespondingsourcesforfurtherprocessing.Also,logicalchannelsconnecttheoriginsource,pathcomponents,andthesinkmodelparticipatingineachdatatransaction.Inordertorelatethesegenerictermstoareal-worldsystem,weillustrateasystemseeFigure 5-1 bemployingthreetransactions.TransactionT1correspondstoaremotediskwriteinwhichserverAsourcetransmitsdatatoashared,remotedisksinkusingtwosharedswitchespathcomponents.TransactionT2isadatatransferinwhichtheworkstationreceivesamessagefromServerB,andTransactionT3representsawriteaccesstotheremotediskbyserverB.Thesetransactionsareusedthroughoutthissectiontoexemplifythetechniquesemployedintheproposedapproach.Ourhybridmodelingapproachincorporatesthreemainstepsinordertosupportquickyetaccurateresults:1function-leveltraining,2analyticalmodeling,and3micro-simulation.AsillustratedinFigure 5-2 ,therststepofahybridsimulation 106

PAGE 107

aGenericsystem bReal-worldsystemFigure5-1.High-levelexamplesystemsemployinghybridmodeling i.e.,function-leveltrainingemploysthefunctionalmodelsatallhybridmodeltypesparticipatinginthetransactiontocollectperformancemeasurementsthatcharacterizethecurrentstateofthesystem.Thesemeasurementsarethenusedintheanalyticalmodelingstagetocongurethecorrespondingmodelandcalculatethelengthoftimethesourceisbusyprocessingthecurrenttransaction.Thethirdstep,micro-simulation,usesthemetricscollectedbypathcomponentsandsinksduringfunction-leveltrainingtocomputedelaysexperiencedbyeachowduetointernalmechanismsandcontention.Dataowsthroughourhybridsystemsasfollows.First,thesourcemodelreceivesthedataandusesitsfunction-levelmodeltotransmitFfragmentsusingitsfunctionalmodels.Asthesefragmentstraversethesystem,eachmodelcollectsstatisticsonthebehaviorofthetransaction.AfterFmeasurementshavebeenmadebythesource,onemorefragment,calledtheheadfragment,istransmittedandthestatisticsgatheredbythesourcearefedtotheanalyticalmodel.Theanalyticalmodelcalculatesatimebasedonthestatisticsandeectivelydelaysthecomponentforthecomputedtime.Whiledelayed,thedevicecanstillrespondtoremoterequestsfromothercomponentsthoughnofurtherdiscreteeventsarecreatedbythecurrentdatatransaction.Afterthecalculatedtimehaselapsed,the 107

PAGE 108

sourcesendsonenalfragment,deemedthetailfragment,usingitsfunction-levelmodels.Thisfragmentcanbedelayedateachpath-componentandsinkmodelaccordingtothecorrespondingmicro-simulationsthatoccurateachmodeltype.Thedatatransactioniscompletewhenthetailfragmentreachesthesinkmodel.Thefollowingsubsectionspresentin-depthdetailsonthethreestepsemployedinourhybridsimulationapproachanddiscusstheinterdependenciesbetweeneachstage. Figure5-2.High-leveldiagramofhybridsimulationapproach 5.3.1Function-LevelTrainingData-intensiveapplicationstypicallyperformnumerousoperationsthatmanipulatelargequantitiesofdata.Inordertoeectivelymodeltheseoperations,weconsiderthemasatwo-stageprocess.Therststage,namedtransitionalperiod,representstheperiodinwhichthesystembeginsanewoperation.Duringthetransitionalperiod,thesystemexperiencesanincreaseindatatraversingbetweencomponentswhilethesedevicestakethenecessaryactionstoadjusttothenewinuxofdata.Themostcommonimpactsobservedineachaectedcomponentarebuergrowthandincreasedcontention.Oncethesystementitieshaveadjustedtothechanges,theoperationreachesitssteady-stateperiod.Inthisstage,thesystemiseectivelyinequilibriumandtheoutputrateofeachcomponentishighlypredictablewhenconsideringidenticalinputs.Ourhybridapproachusesthesestagesbyapplyingthebestsuitedmodelsforeachinordertoensure 108

PAGE 109

accuracywhileprovidingopportunitiestoreducesimulationtime.Specically,weemployfunction-levelmodelsduringthetransitionalperiodandthenswitchtotheuid-basedmodelswhenasteadystatehasbeenreached.Theuseofhigh-delitymodelsduringthetransitionalperiodallowsustoaccuratelycapturene-grainedeventsthathavethepotentialtogreatlyimpactthesystem'soverallperformance.However,oncetheoperationhasreacheditssteady-stateperiod,ananalytical,uid-basedmodelcanbeemployed.Byusingtheanalyticalmodel,numerousredundantcomputationscanbereplacedbyasinglecalculationthussavingasignicantamountoftimeproportionaltothesizeofthedatacorrespondingtothecurrenttransaction.Onedicultywiththisapproachisconguringtheanalyticalmodeltocorrectlyrepresentthesystem'scurrentstate.Toovercomethischallenge,ourhybridsimulationapproachcollectsstatisticsonspecicattributesofthesystemasexperiencedbythecorrespondingcomponentwhileusingthefunction-levelmodels.Bycollectingthesemeasurements,wecaneectivelytrain"ouranalyticalmodeltocapturethebehaviorofthecomponentbasedonthestateofthesystem.Thisprocessofcollectingperformancemetricsandotherstatisticsduringthetransitionalperiodiscalledfunction-leveltraining.Function-leveltrainingisconductedatthebeginningofeachdatatransactiontoensuretheaccuracyofeachmodeledtransaction.AsshowninFigure 5-3 ,theprocessbeginswiththesourcemodeltransmittingtherstFfragmentsofthedatatransactionusingitsfunctionalmodel.Whilethesefragmentstraversethesystem,thesourcekeepstrackofkeystatisticssuchasthedeparturerateoffragmentsasdictatedbythesystem.Meanwhile,thepath-componentandsinkmodelsmonitorthestreamoffragmentsreceivedfromeachsourceinordertocalculateinternalstatisticsusedwithinthemicro-simulationstage.OnceFmeasurementshavebeencollectedbythesource,ittransmitsanalfragment,calledtheheadfragment,thatcontainsinformation,suchasthetransaction'sremainingdatasize,thatisusedduringmicro-simulationswithinthepath-componentandsinkmodels.Thiseventalsomarksthetimewhenthesourceswitchesfromits 109

PAGE 110

function-levelimplementationtoitsuid-baseddesign.Sections 5.3.2 and 5.3.3 providein-depthdetailsontheuid-based,analyticalmodelandmicro-simulations,respectively. Figure5-3.Function-leveltrainingprocedure Thesourcemodelcanbeconguredtoswitchbetweenitsfunction-levelanduid-based,analyticalmodelsaccordingtotwouser-denableparameters.Therstparameterallowsuserstospecifythenumberofmeasurements,F,thatshouldbetakenusingthefunction-levelmodelbeforeswitchingtotheuidmodel.Theseconddenesaspecicdatasizeunderwhichthesourceisforcedtouseonlyitsfunctionalmodel.Therstparameterisideallysettoavaluethatsupportstheuseofthefunction-levelmodelduringtheentiretransitionalperiodofatransactionwhilealsocapturingthesteady-statestatisticsforthevariouscomponentmodels.Ifthetrainingperiodistooshort,thecomponentmodelscouldpotentiallycapturemetricsthatmisrepresentthesteady-stateconditionsforthegiventransaction.However,ifthetrainingperiodistoolong,simulationtimeiswastedduetotheschedulingandprocessingofsuperuousdiscreteevents.Thesecondparameterallowstheusertocontroltheminimumdatasizethatcanusethehybridapproach.Thisparameterisimportantfortworeasons.First,eachtransaction 110

PAGE 111

incursoverheadwhenemployingthehybridapproachduetotheextracomputationrequiredateachcomponenttocalculateandlogstatistics.Asaresult,thisoverheadhasagreaterimpactonsmallertransactionsandcanpotentiallyincreasesimulationtimeascomparedtofunctionalmodeling.Moreimportantly,thehybridapproachwasdesignedforlargetransactions,andtherefore,cansuerfrominaccuracieswhenconsideringsmallertransactions.Theproceedingsectionprovidesdetailsontheissuesregardingtheaccuracyoftheanalyticalmodel.Inordertoillustratefunction-leveltraining,weexaminethethreetransactionsintroducedintheprevioussectionanddisplayedinFigure 5-1 b.Considerdatasizesof10KB,8KB,and10KBforTransactionsT1,T2,andT3,respectively.Also,themaximumfragmentsizeis1KBandallsources'trainingperiodsareconguredtotakefourmeasurementsi.e.,F=4.Inthisexample,eachsourcetransmits4KBofdataduringitstrainingperiod.ThesourceforT1i.e.,ServerAdeterminesitcansend0.5maximum-sizedfragmenteverysecondwhileServerBcantransmit0.25and1.0maximum-sizedfragmentseverysecondfortransactionsT2andT3,respectively.Furthermore,thepath-componentandsinkmodelscalculatesimilararrivalratesforthetransactionsduringthefunction-leveltrainingperiods.Afterthefourmeasurementsarecollected,eachsourcetransmitsaheadfragmentcontainingthenecessaryinformationtobeusedbythepath-componentandsinkmodelsduringtheirmicro-simulations.Also,thesourcesusethecalculatedfragmentoutputratestocalibratetheiranalyticalmodelstodeterminethetimesthesourcesarebusyprocessingtheircurrenttransactions.Thefollowingsectionscontinuethisexamplebyoutlininghowtheanalyticalmodelsandmicro-simulationsareusedtomodelthebehaviorofthethreetransactionstraversingthesystem.5.3.2AnalyticalModelingAfterthetrainingperiodcompletes,theresultingperformancemetricscollectedatthesourcemodelareusedbyauid-based,analyticalmodelseeFigure 5-2 tocalculate 111

PAGE 112

thetimeinwhichthecurrenttransactionisexpectedtocomplete.Atimerissetwithinthesourcemodelbasedonthecalculatedtime.Whenthetimerexpires,thesourcemodeloutputsonelastfragment,thetailfragment,whichsigniestheendofthetransaction.Thetailfragmentisusedbythepath-componentandsinkmodelstocalculatenaldelaysincurredbyeachow.Thesedelaysarecomputedusingmicro-simulations,whicharedescribedinmoredetailinthefollowingsection.Inordertoensureaccuracywithintheanalyticalmodel,threerequirementsmustbesatised.First,themodelmustbecapableofcapturingthesteady-statebehaviorofthecomponentthatitrepresents.Thisrequirementcanbefullledbyemployingvalidatedderivationsthatareeithercustom-builtordenedinliterature.Second,themodelshoulduseparametersthatcorrespondtothecurrentstateofthesystem.Weaddressthisrequirementthroughfunction-leveltrainingasdescribedintheprevioussection.Finally,sincetheanalyticalmodelisinvokedonlyonetimetocalculatethetimerequiredtocompleteeachtransaction,itinherentlyassumesthestateofthesystemdoesnotchangewithinthecalculatedtimeperiod.However,complexsystemstypicallyhavenumeroustransactionsinteractingandcausingchangeswithinthesystem.Asaresult,thelastconstraintrequiresameanstorecalibratetheanalyticalmodelwhenasystemupdatepotentiallyaectsthebehaviorofthemodeledcomponent.Insteadoffullysatisfyingthisrequirement,weallowtheanalyticalmodeltosuerfromanyinaccuraciesthatmayoccurduetosystemupdates.However,weintroduceamechanismdescribedinthesubsequentsectiontocompensatefortheseinaccuracies.Thespeedupattainedusingtheanalyticalmodeldependsonthecomplexityofthemechanismsreplacedandthenumberofdiscreteeventscreatedusingtheequivalentfunction-levelmodel.Thegreaterthemodel'scomplexityandthelargerthenumberofdiscreteeventscreatedbythefunctionalmodel,themorespeedupisachievedusingthehybridapproach.Furthermore,thecomplexityofthepath-componentandsinkmodelsalsocomeintoplaysincediscreteeventscreatedatthesourcenormallyleadto 112

PAGE 113

nontrivialcomputationsatlower-layercomponentsaswellasthepotentialtocreatemoreeventsthatfurtherincreasesimulationtimes.Finally,thereductionofdiscreteeventsenabledbyanalyticalmodelshasthepotentialtogreatlytrimthememoryrequirementsofthesimulation.Asaresult,theemploymentofthehybridapproachallowsuserstomoreecientlyanalyzelargersystemsorattainevengreaterspeedupsespeciallywhenreplacingfunction-levelsystemsthatrequiretheuseofdiskswapspace.Thecombinationofallthesefactorsimbuetheanalyticalmodelwiththepotentialtogreatlyreducethesimulationtimesobservedinpurefunction-levelsimulations.Duringeachfunction-leveltrainingperiodforthethreetransactionsconsideredinourexamplesystem,4KBofdatawassent.Also,ServerAandServerBcalculatedfragmentoutputratesof0.5,0.25,and1.0fragmentspersecondforTransactionsT1,T2,andT3,respectively.Forthisexample,weassumebothsourcesusethesimpleanalyticalmodelexpressedasSourceDelay=dT Fragmaxe)]TJ/F15 11.955 Tf 19.926 0 Td[(1 O{1whereOisthefragmentoutputrate,Tistheremainingsizeofthetransaction,andFragmaxisthemaximumfragmentsize.Usingthisanalyticalmodel,thesourcesdelayfor10,12,and5secondsbeforesendingthetailfragmentsforTransactionsT1,T2,andT3,respectively.Afterdelayingforthespeciedamountoftime,thecorrespondingsourcetransmitsthecurrenttransaction'stailfragment,signifyingtheendofthetransfer.Thenextsectiondetailstheroleofthemicro-simulationtechniquewithintheexamplesystem.5.3.3Micro-SimulationThusfar,ourhybridsimulationapproachhasaddressedsource-baseddelaysasconguredaccordingtoasteady-stateviewofthesystem.However,thestateofasystemfrequentlychangesresultinginpotentialimpactsthatpropagatebacktothesourcemodelsduetofeedbackmechanisms.Therearetwooptionsavailabletohandlethissituation.First,thefeedbackmechanismscanbeincorporatedintotheuidmodelssuch 113

PAGE 114

thatchangesinthesystemcausesourcemodelstorecalibratetheiranalyticalequationsaccordingtothenewstateofthesystem.However,thisapproachcangreatlysuerfromaneectsimilartotherippleeect"reportedin[ 70 ].Thatis,changescancontinuouslycausesourcemodelstorecalibrate,thusrevertingthehybridsimulationapproachintonothingmorethanafunction-levelapproachwithadditionaloverhead.Thesecondalternativeattemptstotakeadvantageofasimpleobservation{themostlikelysystemchangestohavesubstantialeectsontheperformanceofatransactionaretheadditionsorremovalsofothertransactionscompetingforthesameresource.Byusingthisobservation,thesecondoptionallowsthesourcetocompleteasoriginallyscheduledwhilethepath-componentandsinkmodelsaccountfordelayscausedbytheadditionaltransactionscreatingcontentionwithinthesystem.Thisapproach,deemedmicro-simulation,usesFIFOqueuesandtheperformancestatisticscollectedduringfunction-leveltrainingtocalculatedeviceandcontentiondelaysincurredbyeachtransaction.Micro-Simulationeectivelyreducesthecomplexityofthemodeledcomponentintoasimplequeuingsystemthatapproximatesthedelaysexperiencedbyeachow.Inordertosimplifythesystem,thedevice'sbehaviorischaracterizedbyasingleparameter,servicerate,whileeachowisrepresentedwiththreeparameters:1starttime,2arrivalrate,and3numberoffragments.Theservicerateparameterspeciestherateatwhichthedevicecancompleteavirtualfragment.Thisparameteristypicallycalculatedbasedonthecomponent'sperformanceattributessuchaslatencyandthroughputaswellasthefragmentsize.Thestarttimespeciesthetimeatwhichaowrstarrivesatthecomponent,thearrivalratedenotestherateatwhichavirtualfragmentarrivesatthedevice,andthenumberoffragmentsparameterindicatesthenumberofvirtualfragmentsrepresentedinaow.Thearrivalrateforeachowiscalculatedatthepath-componentandsinkmodelsduringtheow'sfunction-leveltrainingwhilethestarttimeandnumberoffragmentsaredenedintheheadfragment.Duetoqueuingcomplexities,computersimulationisusedtocalculatethedelaysforeachowin 114

PAGE 115

apath-componentorsinkmodel.Micro-Simulationsareconductedonlywhenanewowbeginsassigniedbyaheadfragmentoranexistingowcompletesasindicatedbyatailfragment.Forbothevents,themicro-simulation'sstateisupdatedtothecurrentsimulationtime;however,predictionsofaow'scompletiontimeareonlymadewhenitstailfragmentisreceivedbeforethemicro-simulationprogressedtoastateinwhichtheowhascompleted.Sincetheadditionofnewtransactionscannotbeforecasted,yetwilllikelyaectthedelaysexperiencedbyexistingows,speculativecalculationsofthecompletiontimescanleadtowastedcomputation.Asaresult,micro-simulationsperformtheminimumamountofcomputationsnecessarytocalculatethedelaysexperiencedbyaow,thusminimizingthesimulationtimerequiredbythistechnique.Micro-Simulationimprovessimulationtimefortworeasons.First,byreducingthesystemintoaqueuingproblem,micro-simulationabstractsawaythecomplexitiesinvolvedwithmodelinglargetransactionsemployingpotentiallycomplicatedcomponents.Thequeuingsystemcanbequicklyprocessedusingcomputersimulation,thusdecreasingthecomputationtimerequiredtoexplicitlysimulatethesystem.Second,micro-simulationsdonotcreatediscreteeventsand,thus,donotsuerfromschedulingorotherprocessingdelaysexternaltothemodelthathavethepotentialtodrasticallyincreasetheoverallsimulationtime.Figure 5-4 providesthemicro-simulationthatoccursatSwitchBillustratedinFigure 5-1 b.Forthisexample,weassumethatSwitchBusesabus-basedbackplanetoroutefragments,andtherefore,itsmicro-simulationsmodelanycontentionbetweentransactionssharingthisresource.Also,Flow#1,Flow#2,andFlow#3inFigure 5-4 correspondtoTransactionsT1,T2,andT3inFigure 5-1 b,respectively.Thecharacterizationparametersoftheswitchcomponentandthreeowsareprovidedatthetop,thestateofthemicro-simulation'squeuewithrespecttotimeisdisplayedinthemiddle,andthemicro-simulationupdatesandkeyeventsthattriggerthemareshownatthebottom.The 115

PAGE 116

micro-simulation'squeuestateillustratestheorderandnumberofvirtualfragmentsfromeachowresidinginthequeueatagivenpointintime.Inordertoillustratethethreeowtypesthatcanoccurinourhybridsimulations,weassumethatthetransactions'starttimesareinterleavedaccordingtothevaluesdisplayedinFigure 5-4 andeachtailfragmentsentbythecorrespondingsourceexperiencesatwosecondroutingdelaybeforeitisreceivedbySwitchB.Flow#1illustratesthecaseinwhichthecomponentreceivestheow'stailfragmentandthencalculatestheow'scompletiontimetobethecurrentsimulationtime.Inthiscase,thetailfragmentisimmediatelyprocessedbythecomponentsincenocontentiondelaysoccurred.Flow#2representsaowthatencounterssource-baseddelays.Thatis,thecomponentreceivestheow'stailfragmentsaftertheow'scompletiontimeascalculatedbythemicro-simulation.SimilartoFlow#1,thetailfragmentisprocessedimmediatelyinthiscase.Finally,Flow#3demonstratestheprocessfollowedwhenthetailfragmentisreceivedbythecomponentbutthemicro-simulationdeterminesthattheowisnotcomplete.Inthiscase,thedeviceschedulestheprocessingofthetailfragmenttothecorrespondingow'spredictedcompletiontime.Takenotice,themicro-simulation'sstateneverprogressesfurtherthanthecurrentsimulationtime.However,predictionsaremadetocalculatethecompletiontimeofFlow#3whenthetailfragmentarrivesbeforethemicro-simulation'sstatehasprogressedtothepointinwhichtheowisdeterminedtobecomplete.Whilethemicro-simulationtechniqueeliminatespotentialslowdownsduetotherippleeectphenomenon,itdoesnotcompensateforallsource-basedinaccuraciesthatmayoccurwhenmodelingatransaction.Considerthefollowingscenario.TransactionAbeginsitstrainingperiodandcausescontentionatasharedresourcealreadyinusebyTransactionB.Assumingthesharedresourcecanhandleonlyonetransactionatatimewithoutcausingdelays,thesourcesofbothtransactionsobserveslowerratesatwhichtheirdatatraversesthesystem.TransactionAnishesitstrainingperiodjustas 116

PAGE 117

Figure5-4.Examplethree-owmicro-simulation TransactionBcompletes,thusconguringitsanalyticalsourcemodeltorepresentthecomponent'sbehaviorinthepresenceofcontention.However,thecontentionnolongerexistssinceTransactionBcompleted.Theresultingscenariocannotberesolvedthroughmicro-simulationsandrequiresafeedbackmechanismthatcannotifythesourceofthechangeinthesystemsoitcanretrain.However,thefeedbackloopleadsbacktotherippleeectforwhichmicro-simulationsareusedtoavoid.Section 5.5 discussesproposedfutureworkthatinvestigatestechniquesthatincorporateafeedbackmechanismtoremedytheseinaccuracieswhilemitigatingtherippleeect.Throughoutthisstudy,weconsideronlyFIFOqueues,althoughourdesigncanbeeasilyextendedtosupportpriority-basedqueues.Furthermore,weassumeinnitequeuecapacitiesinordertosimplifythemodelingeort.However,thenecessarymechanismsareinplacetosupportnitequeuesizes.Finally,whenusingthisapproach,thesourcecannotproceedwiththenexttransactionuntilthetailfragmenthasbeenfullycompleted.Asaresult,mechanismsthatmayallowtransactionstocompleteearlye.g.,non-blockingshouldbedisabledorcarefullycontrolledusingthishybridapproach. 117

PAGE 118

5.4ResultsandAnalysisInthissectionwediscusstheresultsoftheanalysesconductedtoshowthecapabilitiesofourhybridsimulationapproachasitisappliedtotheNASADependableMultiprocessorseeChapter 4 fordetails.Therstsectionpresentsthesimulationsetupfollowedbyvalidationresultsusinglow-levelbenchmarkstoverifytheaccuracyandshowcasethespeedofthehybridmodels.Next,weevaluatethespeedandaccuracyofthehybridmodelswhenconsideringcontentionatsharedresources.Contentionstressestheproposedhybridmethodologyduetothelackoffeedbackloopsthatadjusttheanalyticalmodelswithinthesourcecomponents.Afterthespeedandaccuracyofourapproachhavebeenshowcased,weapplythetechniquetoanalyzetheperformanceoftheDMsystemexecutingadata-intensive,hyperspectralimagingHSIapplication.5.4.1SimulationSetupThekeysimulationmodelsemployedforthefollowingexperimentsarelistedinTable 5-1 .Similartothemodelsdevelopedinthepreviouschapters,thesemodelswerealsocreatedusingMLD.EachmodelcapturesthefunctionalityandbehaviorofthecorrespondingtechnologywhileadheringtotheFASEmethodologyofbalancingspeedandaccuracy.Fromthesecorecomponentmodels,nodeandsystemmodelsweredeveloped.Finally,Table 5-2 liststhekeysystemparametersconguredtobestmatchtheperformanceoftheprototypeDMsystem.TheDMsystemincorporatestwomainsubsystemsthatcanbenetfromhybridsimulationmodels-thenetworkandmassdatastore.BothsubsystemsusetheHAMforinter-nodecommunication,whichinturnemploysTCP/IPasitsprimarycommunicationprotocoloveraGigabitEthernetlink.Therefore,weretrottedtheHAMmodelintheDMmodellibraryandtheTCP,IP,andEthernetmodelsfoundinthepre-builtFASElibrarywiththeappropriatehybridsimulationmechanisms.TheMDSServermodelwasalsomodiedtoincorporatehybridmechanismstomodellargedataaccessestotheMDSdevice. 118

PAGE 119

Table5-1.Summaryofrelevantsimulationmodels Library Component Description DM High-AvailabilityMiddleware Providesreliablecommunicationbetweennodesinsystem. MDSServer Handlesdataaccessrequeststothemassdatastore. FASE TCPLayer ProvidesTCPprotocolforreliablecommunicationbetweennodes. IPLayer ProvidesIPprotocolforallnetworktransfers. EthernetNIC ProvidesEthernetprotocolforallnetworktransfers.Sup-portsmultipleports. EthernetSwitch ProvidesEthernetconnectivitybetweennodes.Supportsvarietyofbackplaneandroutingoptions. Table5-2.Keysystemparameters Parametername Value Processorpower 1200MIPS,600MFLOPS MPImaximumthroughput 57MB/s MPImessagelatency 13.6ms HAMbuersize 2000000bytes Networkbandwidth Non-blocking1000Mb/s Networkswitchlatency 5s MDSbandwidthwrite/read 60/40MB/s MDSlatencywrite/read 300/500s MDSopenleoverhead 8ms TheHAMmodelactsasasourcemodelformostdatatransfersbetweennodes.Themodelreceivesdatatransactionsfromtheapplicationlayerandfragmentseachtransactionintomessageswithamaximumsizeof14000byteswhenusingitsfunction-levelimplementation.Duringthetrainingperiod,theround-triptimeofeachmessageiscalculatedinordertocalibratetheanalyticalmodelusedintheuid-basedimplementation.TheHAMmodelsupportsnon-blockingcommunicationsviabueringtechniques.TheTCP,IP,EthernetNIC,andEthernetswitchmodelsareretrottedwithhybridmechanisms.Eachofthesecomponentsactsasapathcomponentandthereforecollects 119

PAGE 120

statisticsontheowstraversingthroughtheminordertocalculatedelaysincurredatthedeviceviamicro-simulations.TheTCPmodelalsoactsasasourcemodelsinceithasthecapabilitytofragmentthemessagesreceivedfromtheHAMintoTCPsegmentswithamaximumsizeof1460bytes.TheTCPmodelisoutttedwithananalyticalmodelthatusesthecommandwindow,maximumwindowsize,andacknowledgementratetocalculatethetimeneededtotransmitNbytesofdata.Currently,ouruidmodeldoesnotaccountforTCPsegmentsdroppedintransit;however,themodelcanbeextendedtocollectthenecessarymetricsduringtrainingandtheanalyticalmodelmodiedtoaccountfortheeectsofdroppedsegments.Finally,theMDSmodelisalsoretrottedwithhybridsimulationmechanismstoprovidesequentialaccessesforsubmitteddiskI/Ojobs.TheMDSmodelrepresentsasinkmodelandthereforeusesmicro-simulationstocalculatedelaysincurredwhileaccessingthemassstoragedevice.Morespecically,thequeueservicerateoftheMDS'smicro-simulationsisconguredaccordingtothebandwidthandlatencyofthestoragedevicewhilethearrivalratesofeachowarecalculatedbasedontheperformancemetricsgatheredduringeachtransaction'strainingperiod.DuetothefeaturesoftheMDSserver,readaccessesretainexclusiveownershipoftheMDSstoragedevicethroughoutthedurationoftheaccesswhilewriteaccessescanbeinterleavedin14000-bytemessages,themaximummessagesizeoftheHAM.Forthisstudy,theHAMandTCPsourcemodelsareconguredasshowninTable 5-3 .ThetrainingvalueswerechosentobestbalancetheaccuracyandspeedofthehybridsimulationsusedtoanalyzetheDMsystem.Thefollowingsectionsshowthatthesevaluesprovidesucienttrainingperiodstoproducefastyetaccurateresultsfromeachtransaction.Theminimumhybridmessagesizeforeachsourcemodelisinitiallysettozeroinordertoevaluatethehybridmethodologyforallmessagesizes.Itshouldbenotedthateachsimulationpresentsauniquecaseforwhichtheseparametersshould 120

PAGE 121

becalibratedtobestaccommodatethecharacteristicsofthesystemsandapplicationsofinterest. Table5-3.Hybridsourcemodelparameters Model Parametername Initialvalue HAM Function-leveltrainingmeasurements 10 Minimumhybridmessagesize 0MB TCP Function-leveltrainingmeasurements 100 Minimumhybridmessagesize 0MB Allsimulationsareconductedondedicatedcomputenodeswithanominalnumberofprocessesexecutinginthebackgroundtominimizethenoiseexperiencedduringtheexperiments.Eachnodeisconguredwithaquad-core,2.4GHzXeonprocessorwith2GBofmainmemoryrunninga64-bitLinuxvariantusingkernel2.6.9-55.5.4.2PerformanceModelingTherststudyconductedusestwosimplebenchmarks,PingPongandMDSTest,toshowtheaccuracyandspeedofthehybridsimulationapproachunderidealconditionsi.e.,noresourcecontentionandminimalsystemstatechanges.ThePingPongbenchmarktransfersdatabetweentwodatanodeswhiletheMDSTestbenchmarkwritesandreadsdatatoandfromtheMDSnode.Forbothprograms,thedatatransfersrangefromonebytetofourgigabytes.Thesmall-scaleDMtestbedsystemisusedtocollectexperimentalmeasurementstowhichresultsfromboththefunction-levelandhybridmodelresultsarecompared.Consequentially,theresultsnotonlyshowcasethecapabilitiesofourmodelingapproach,butalsovalidatetheDMmodel'saccuracy.FromFigure 5-5 andFigure 5-6 ,onecanobservethatbothmodelingapproachescloselyreproducethethroughputsobservedinthetestbedsystem.ForthePingPongbenchmark,wecalculatenearlyidenticalmeanrelativeerrorsof1.24%forthefunctionalandhybridmodelswhencomparedtotheexperimentalmeasurements.Forthisparticularbenchmark,theerrorfoundbetweenthetwomodelingapproacheswasnegligible.TheMDSbenchmarktestsproducedsimilarresultswithmeanrelativeerrorsof1.50%for 121

PAGE 122

writesand3.01%forreadswhenusingthehybridapproachascomparedtoexperimentallycollectedmeasurements.Again,theMDSTestbenchmarkshowednegligibleerrorsbetweenthetwomodelingapproaches. Figure5-5.PingPongaccuracyresults SpeedupresultsforthePingPongbenchmarkareshowninFigure 5-7 .Fromthegure,weseethatemployingthehybridsimulationapproachinsystemsexecutingapplicationsthattransferlargedatasizescangreatlyimprovesimulationtimes.Infact,weobserveanorderofmagnitudespeedupfordatasetsassmallas4MB,while4GBdatatransfersobservenearlyan1850speedup.Thelargespeedupsobservedatthelargerdatasizesaredirectlyproportionaltothenumberofdiscreteeventseliminatedthroughtheuseofthehybridsimulationapproach.Forexample,a4GBfunction-leveltransferusingTCPrequiresthecreation,scheduling,andprocessingofapproximately2.9milliondiscreteeventswhilethecurrentlyconguredhybridapproachusesnomorethan1000discreteeventstosimulatethesametransaction.Bysimplydividingthenumberof 122

PAGE 123

Figure5-6.MDSTestaccuracyresults discreteeventsgeneratedusingbothapproaches,wecalculateanidealspeedupofaround2940thusverifyingsuchlargegainsusingourmethod.TheMDSTestspeedupresultsseeFigure 5-8 showsimilartrendstothoseobservedinthePingPongbenchmark.Over10speedupsareachievedat4MBdatasetsandmaximumspeedupsofover1700and1800areobservedat4GBdatasetsforwritesandreads,respectively.TheMDSTestshowedlargerspeedupsforreadoperationsduetoaslightlysmalleramountofcomputationrequiredtoconductmicro-simulationsattheMDSsincereadsaresequentiallyexecutedwithexclusiveaccesstothedevice.Bothbenchmarksshowspeedupsthatbegintolevelopastthe1GBmessagesize.Thisbehaviorisduetoa1GBmessagesizelimitationplacedondatatransactionsthatusetheTCPmodelinordertoavoidproblemsassociatedwithits32-bitvariablesthatmaintainitscurrentstateformechanismssuchaswindowingandacknowledgements.Itshouldbenotedthattheserestrictionsareonlyplacedoncomponentmodelsthathaveinherentlimitationsthatcanpotentiallycauseproblemswhenconsideringverylargedatasizes.Also,bothbenchmarksshowincreasingslowdownsbetween1KBand256KB 123

PAGE 124

Figure5-7.PingPongspeedupresults Figure5-8.MDSTestspeedupresults 124

PAGE 125

messages,whichsignifyadditionaloverheadwhenusingtheproposedhybridsimulationapproach.Thecauseofthisaddedoverheadisduetothegatheringofstatisticsontheincreasingnumberoffragmentsusedineachtransaction.However,onceamessagesizeof512KBisreached,ourhybridapproachovercomesthisloggingpenaltyandshowspositivespeedup.Thebenchmarksusedinthisstudyrepresentbest-casescenariosforusingthehybridsimulationtechnique.Thisquickstudyshowsthatourhybridapproachcanprovidesubstantialreductionsinsimulationtimewhilehavinglittleimpactontheaccuracyoftheresults.However,neitherbenchmarkexperiencedconditionsthatcouldpotentiallycauseinaccuraciesinthehybridmodelsandthusrepresentthebest-casenumbersachievablebythetechniqueasitiscurrentlycongured.Morespecically,neitherbenchmarkcausessystemstatechangeswhilepreviouslyconguredtransactionsareexecuting.Theproceedingsectioninvestigatestheaccuracyandspeedupattainablebythehybridmethodwhenthesystemisexposedtocontentionandthusnumerousstateupdates.5.4.3ContentionModelingNowthatwehaveshowntheaccuracyandspeedupsachievedusingourhybridmodelsunderrelativelyidealconditionsi.e.,littletonoexternaleectsinuencingatransaction,weinvestigatetheimpactsofusingthetechniqueundermoreextremeconditions.Inthisstudy,weintroducecontentionintothesystemtoobservehowthehybridmodelsreactwithregardstosimulationspeedandaccuracy.Forthistest,weusetheMDSTestbenchmarkwhileincreasingthenumberofnodesthatsimultaneouslyaccesstheMDSfromtwotothirty-two.Thisscenariocreatescontentionattwosharedresourceswithinthesystem{theoutputportintheEthernetswitchattachingtheMDSnodetothenetworkandtheMDSstoragedevice.Forthisbenchmark,eachnodeinvolvedrstwritesNbytesofdatatotheMDS,whereNrangesfrom1Bto128MB,andthenreadsthedatabackwithasynchronizationpointbetweeneachoperationtomaximizetheamountofcontentionandthusthestressonthehybridsimulationapproach. 125

PAGE 126

Figure 5-9 illustratestherelativeerrorsandspeedupsobservedwhencomparingthehybridwriteandreadaccesstimestotheresultsobtainedviathefunction-levelmodels.Datasizeslessthan64KBarenotdisplayedsincethehybridmodelsuseonlytheirfunction-levelimplementationsconsequentiallyproducingidenticalresultsbetweenthetwosimulationapproaches.FromFigure 5-9 a,onecanseethatthehybridwriteaccessesshowrelativelylargedeviationsinaccuracyatsmallerdatasizes.Infact,amaximumerrorofjustover46%wascalculatedfora256KBdatasetregardlessofthenumberofowstested.Furthermore,asthedatasizeincreases,thegeneraltrendobservedisadecreaseinerror.Theseobservationssuggestthatalthoughtheuid-basedmodelsdonotadequatelyrepresentthebehaviorofthesourcemodelsatsmallerdatasizes,theyaremuchmoreaccurateatlargerdatasets.Meanwhile,forlargerowcounts,weobserveaminorincreaseinerrorwhentransitioningfrom1MBto2MBdatatransactions.ThisincreaseislikelyduetoabstractionsmadewithinthehybridHAMmodelwithrespecttoitsbuersizerecallfromTable 5-2 thattheHAM'sbuersizewassetto2MBandthenon-blockingfunctionalityprovidedbythisbuer.Figure 5-9 bshowsthatthehybridreadaccessesperformnearlyidenticallytothefunction-leveloperationswithamaximumobservederrorof0.26%occurringat64KB.Fromthegure,wealsoobservethatthenumberofnodesparticipatinginthereadportionoftheMDSTestbenchmarkhasaminimaleectontheaccuracyofthehybridsimulationapproachduetotheserializationofaccessesconductedbytheMDSServer.Figure 5-9 candFigure 5-9 dshowthespeedupsachievedforthehybridwriteandreadaccessesoverthevariousmessagesizesandowcounts.Fromthegures,onecanseethatbothoperationsshowsignicantgainsinspeedupasthedatasizeincreases.Themaximumspeedups,595and760,forthewriteandreadoperations,respectively,areobservedat128MB.Aminimumspeedupof0.75i.e.,slowdownof25%isobservedforbothwritesandreads,whichrepresentstheoverheadassociatedwiththeextracomputationrequiredtocalculateandlogstatisticswhenusingthehybridapproach. 126

PAGE 127

aWriteerrors bReaderrors cWritespeedups dReadspeedupsFigure5-9.MDSTestaccuracyandspeedupresultsusinghybridmodelingapproach Whilelargeerrorsareobservedinthehybridmodelswhenprocessingsmallerwriteoperations,wemustrememberthattheproposedapproachisdesignedtomodellargedatatransactions.Asaresult,weidentifyacross-overpointforwhichdatatransactionswithintheDMsystemuseonlythefunction-levelmodelsversusthehybridmodelsinordertoremedythelargeerrorsatsmalldatasizeswhilestillachievingspeedupsforlargedatasizes.Topinpointthiscross-overpoint,wecalculatethespeedup-to-errorratiosforthewriteaccessesseeFigure 5-9 aandFigure 5-9 cateachdatasizeandselectthevaluethatsustainsaratiogreaterthanoneforallowcounts.Whenthisratioisgreaterthanone,thespeedupislargerthantheerrorthussuggestingthatthebenetofthetechniqueoutweighsitsinaccuracies.Itshouldbenotedthatwhilethisratioisusefultoquicklyidentifyacross-overpoint,itcanprovidevaluesthatresultinpotentiallylarge 127

PAGE 128

inaccuraciesinthecaseswhereboththespeedupanderrorvaluesarelarge.Forthisstudywendthatthe4MBdatasizeproducedratiosofoneorgreaterforallowcounts.However,weselect8MBastheminimummessagesizetousethehybridmodelssincewedesiresingle-digiterrorsforallowcountsaswell.Inthenextsection,weexploretheimpactofconguringtheHAMandTCPmodelstousethisvalueasitsminimumhybridmessagesizeonarealdata-intensiveapplication.5.4.4CaseStudyHyperspectralimagingisatechniquethatcombinesconventionalimagingandspectroscopytoidentifyandclassifyvariousobjectswithina3Dimage.HSIisusedinapplicationsthatincludemapping,reconnaissanceandsurveillance,andenvironmentalmonitoring.Similartootherremote-sensingtechniques,HSItypicallydealswithlargeamountsofdatathatinsomeapplicationsmustbeprocessedinreal-timetoprovideimmediateassessmentofpotentiallythreateningscenarios.Inthisstudy,weapplyourhybridsimulationapproachtoanHSIapplicationbasedonthealgorithmpresentedin[ 75 ]inordertoshowcaseitscapabilitieswhenanalyzingarealscienticapplicationexecutingontheDMsystem.Figure 5-10 illustratesthedataowdiagramoftheHSIapplication.Eachparticipatingnodeacquiresaslaboftheimage,calculatestheautocorrelationsamplematrixACSM,andtransmitstheresultstoasinglerootnode.TherootnodeprocessesthedatacollectedfromeachnodeintheweightcalculationstageandbroadcastsCclassicationconstraintstoeachnode.Thenodesthenclassifytheoriginalimagedatabasedontheconstraintsandsavetheresultingdatatoconstructanoutputimage.Table 5-4 displaysthenumberof4-byteelementstransferredineachstageintermsofpixelsperrow/columnN,spectralbandsL,numberofprocessorsP,andnumberofclassicationconstraintsC.Forthiscasestudy,weexploretheaccuracyandspeedofthehybridversionoftheDMightsystemcomposedoftwentydatanodes[ 53 ]ascomparedtoitsstandard, 128

PAGE 129

Figure5-10.TheHSIdatadecompositionanddataowdiagram Table5-4.DatasetsizesforeachHSIdatatransaction Transaction Datasetsizeelements GetData N2L P Reduce L2 Broadcast CL SaveData N2C P function-levelcounterpart.ThestudyanalyzesatotaloftenimagesizeslistedinTable 5-5 .TherstvedatasetsrepresenttheimagesprocessedusingcurrentandemergingimplementationsoftheHSIapplication.ThelastveimagesizesrepresentdatasetsthatmaybeanalyzedinfutureversionsofHSIandshowcasethecapabilitiesofthehybridtechniquewhendealingwithverylargedatasets.Thetriplelinedenotestheboundaryinwhichextrapolationisusedtoapproximatethesimulationtimesrequiredbythestandard,function-levelapproachtocompleteasingleHSIiteration.Extrapolationisemployedasameanstoquicklyestimatethefullfunctionalsimulationtimeratherthanoccupyresourcesforsuchlargeperiodsoftime.ThestudyalsoconsiderstwocongurationsofthehybridHAMandTCPmodels.ThecongurationlabeledHybrid-0MBusesthedefaultvalueslistedinTable 5-3 whiletheHybrid-8MBcongurationsetstheminimumhybridmessagesizeforboththeHAMandTCPmodelsto8MB.RecallfromSection 5.4.3 thatthis 129

PAGE 130

datasizewasdeterminedtobalancethetrade-oofusingthehybridmethod'sspeedandaccuracyforsmallerdatatransactions. Table5-5.SimulationtimesforvariousHSIimagesizes Imagedimensions Rawdatasize Simulationtimes elements Functional Hybrid-0MB Hybrid-8MB 1024102464 256MB 2.18min 7.56s 39.40s 10241024128 512MB 3.92min 8.08s 40.25s 10241024256 1GB 7.37min 9.55s 42.48s 10241024512 2GB 14.31min 15.34s 50.88s 102410241024 4GB 28.33min 20.91s 80.33s 204820481024 16GB 1.88hours 23.44s 50.17s 409640961024 64GB 7.51hours 34.07s 1.03min 819281921024 256GB 1.25days 1.25min 1.74min 16384163841024 1TB 5.01days 4.08min 4.66min 32768327681024 4TB 20.03days 15.53min 16.29min Table 5-5 showssimulationtimesrequiredtocompleteasingleiterationofHSIprocessingthecorrespondingimagewhileFigure 5-11 displaystheerrorandspeedupofthetwohybridcongurationsversusthestandard,function-levelapproach.Theresultsshowthatbothhybridcongurationsprovideveryaccurateresultsaswellasimprovedsimulationtimesforallimagesizes.Infact,themaximumerrorsobservedintheHybrid-0MBandHybrid-8MBsetupsare0.77%and0.0032%,respectively.Similartothepreviousstudies,wendthatspeedupincreaseswithdatasetsizethoughitbeginstolevelasdatasizesbecomeverylarge.Maximumspeedupsof1858and1771areobservedintheHybrid-0MBandHybrid-8MBcongurations,respectively.NotethatHybrid-8MBreportslessspeedupforalldatasizesduetotheincreasedamountoffragmentationoccurringforsmalltomediumdatatransactions.However,theaccuracyofthiscongurationforsmallerimagesizesissignicantlyimproved,thoughtheaccuracyoftheHybrid-0MBcongurationisveryacceptable. 130

PAGE 131

aError bSpeedupFigure5-11.TheHSIaccuracyandspeedupresultsfortwohybridcongurations 5.5ConclusionsAsdata-intensiveapplicationsbecomeincreasinglyprevalent,moreecientsystemsmustbedesignedtoaccommodatetheirspecialdemands.Inordertofacilitatethedesignofthesesystems,discrete-eventsimulationisoftenusedtovirtuallyprototypecandidatesystems.However,lengthyanalysistimesofcomplexsystemsviasimulationarefurtherhinderedwhenevaluatingdata-intensiveapplicationsduetothesheervolumeofdatacreated,processed,andscheduledbythesimulationenvironment.Inthischapter,wepresentedanovelapproachforhybridsimulationtospeedtheanalysisofapplicationsprocessinglargedatasetswhileretainingahighdegreeofaccuracy.Ourapproachfeaturedtwotechniques,function-leveltrainingandmicro-simulations,tocalibrateanalyticalmodelsthatdepictthelong-term,steady-statebehaviorsofthecorrespondingcomponentsandaccountforchangesinthesystem'sperformancewithouttheuseoffeedbackmechanisms.Detailsonourhybridsimulationapproachwereoutlinedandthevariousimplicationsofeachtechniqueusedwerediscussed.Toshowcasethecapabilitiesoftheproposedapproach,weappliedthetechniquestotheNASADependableMultiprocessor.First,weobservedtheaccuracyandspeedupachievedbytheDMsystemmodelsusingtheproposedtechniquesascomparedtoapurefunctionalmodelwhileemployingtwolow-levelbenchmarks.ThePingPongbenchmarkreportedameanrelativeerrorof1.24%whenusingthehybridsimulationapproach 131

PAGE 132

whiletheMDSTestbenchmarkshowed1.64%and3.01%errorsforwritesandreads,respectively.Furthermore,ourapproachshowedspeedupsupto1850inthePingPongbenchmarkandover1700intheMDSTest.Theselargespeedupswerearesultofthedrasticreductionofdiscreteeventsprocessedbythehybridapproach.However,theoutcomesobservedfromtheinitialtestsrepresentedbest-casenumbers.Asaresult,weanalyzedthehybridDMsystemmodelwhileexecutingtheMDSTestbenchmarkontwotothirty-twonodes.ThisscenariocausedcontentionattheMDSnodeandthereforeinvestigatedthecapabilitiesofourproposedmethodologyinmorestressingconditions.Theresultsfromthisstudyshowederrorsupto46%,thoughtheselargererrorsoccurredforsmallerdatasizes.Asthetransactionsizeincreased,theerrorsdecreasedtomorereasonablepercentagesandeventuallytovalueslessthan1%.Maximumspeedupsofthehybridapproachwereobservedtobe595and790forwritesandreads,respectively,atthemaximummessagesizeobservedi.e.,128MB.Whiletheproposedhybridsimulationapproachreportedlargeerrorsinthisstudy,theywereobservedatsmallertransactionsizesthatdisplayedsmallerspeedupswhenemployingthehybridmodels.Asaresult,weidentiedacross-overpointat8MBthatsupportedtheuseofthefunction-levelmodelstoensureaccurateresultsforsmalldatasizeswhiletransitioningtothehybridmodelsfordatasizeslargerthan8MBinordertosupportthespeedysimulationsoflargetransactions.Onceourapproachwasvalidatedanditspotentialdemonstratedusinglow-levelbenchmarks,weevaluateditsaccuracyandspeedusingahyperspectralimagingHSIapplicationexecutingontheDMightsystem.Byanalyzingvariousimagesizesusingboththestandardfunction-levelandproposedhybridsimulations,wefoundthatourapproachproducedamaximumerrorof0.7%whiledisplayingamaximumspeedupof290.Thetrendsobservedfromthestudyshowedlargererrorsforsmallerdatasetsduetoinaccuraciesintheanalyticalmodelwhiletheobservedspeedupincreasedwithlargerdatasets.Theanalysisconcludedbyanalyzingmuchlargerdatasetsusingthehybrid 132

PAGE 133

simulationapproachwithextrapolationsusedtoestimatetheamountoftimerequiredbythefunction-levelmodels.Theresultsshowedaprojected,maximumspeedupof1858.Speedupsofthismagnitudedictatethatourhybridapproachhasthepotentialtocompletemonth-longsimulationsinmereminutes.Theworkconductedduringthisnalphaseofresearchproducedahybridsimulationapproachthatemployedtwonoveltechniques,function-leveltrainingandmicro-simulations,toreducetheanalysistimesofsimulationsconsideringdata-intensiveapplications.ThesehybridsimulationmechanismswereincorporatedintotheFASEframeworkandnumerouspre-existingmodelswithintheFASEandDMmodellibrarieswereretrottedtoaccommodatethesespeedyfeatures.CasestudiesdemonstratedthecapabilitiesoftheproposedapproachasappliedtotheDMsystemexecutingadata-intensive,hyperspectralimagingapplication.ThecontributionsandaccomplishmentsofthisworkhavebeencompiledintoamanuscriptthatwassubmittedtoACMTransac-tionsonModelingandComputerSimulation[ 76 ]. 133

PAGE 134

CHAPTER6CONCLUSIONSThisdocumentpresentedthreephasesofresearchtoshowthewiderangeofresearchtopicsaddressedbytheFASEframework.Therstphaseanalyzedthevariousaspectsinvolvedwiththedesignanddevelopmentofaperformancepredictioninfrastructureusingapplicationcharacterizationanddiscrete-eventsimulationinordertobalancespeedandaccuracy.Theworklaidoutageneralizemethodologytopredicttheperformanceofapplicationsexecutingonvirtuallyprototypedsystems.ThemethodologywasthenrealizedthroughtheuseofanapplicationcharacterizationtoolcalledSequoia,whichtracedMPIfunctioncallsandmeasuredcomputationtimebetweencommunicationevents,andapre-builtmodellibrarycreatedinMLDesigner,ahierarchical,discrete-eventsimulationtool.Casestudieswerethenconductedtoobservesimulationaccuracyandspeedwhencomparedtoexperimentalmeasurements.Theresultsshowedaccuracyerrorswithinanacceptablethresholdwithin25%andsimulationspeedsnogreaterthanthreeordersofmagnitudeslowerthanexperimentalprocessingtimes.Aftervalidation,thepotentialofFASEwasshowcasedwithanin-depthstudyoftheSweep3Dalgorithmexecutingonvirtualsystemscomposedofvariousnetworktypes,middlewareimplementations,processingcapabilities,andotherdegreesoffreedominthesystems'hardwareandsoftwarecongurations.Theframeworkdevelopedintherstphaseofthisresearchcreatedanidealenvironmenttoevaluatehigh-performance,embeddedspacesystems.Consequently,aNASA-sponsoredprojectwasusedtoexploretheexibilityofFASEandextendtheframeworktosupportnotonlyscalabilitystudiesoftheproposedspacesystem,butalsoavailabilitystudies.TheworkconductedinphasetwoexpandedtheFASEpre-builtmodelstoincludereliablemiddlewaretechnologiesthatmonitorsystemhealth,scheduleanddeployjobs,andreactandrecoverfromfaults.Afaultmodellibrarywasalsodevelopedtoinjectfaultsintothesysteminordertoperformavailability 134

PAGE 135

studies.Afterthenecessarytoolkitwasinplace,athree-stageanalysisprocedurewasformulatedforperformanceandavailabilityevaluations.Thisapproachallowsuserstocalibratecomponentmodelsusingexperimentalmeasurements,quicklyidentifyworkloadsandfaultratessupportedbyamanagementsoftwareviaMarkov-rewardmodels,andthoroughlyinvestigatespecicapplicationsexecutingonvirtuallyprototypedsystemthroughdiscrete-eventsimulation.Thenovelanalysismethodologyandsimulationmodelswerethenappliedtoexplorethescalabilityofthe2DFFTkernelexecutingontheDMsystem.Thescalabilitystudyrevealedtheprimebottleneckofthesystemwasthecentralizedmemoryandalgorithmicandarchitecturalvariationswereanalyzedtoalleviatetheproblem.Afterthescalabilityanalysis,theSARapplicationwasusedtostudytheperformabilityofthevirtualightsystemconsistingoftwentydatanodes.Theresultsshowedgoodsystemthroughputi.e.,between300and600imagesperorbitandperformabilityi.e.,over99.5%inlowradiationenvironmentsand54.0%inextremeconditionswhenthesystemwasenhancedandaugmentedwithimproveddataprocessorsandMDSstoragedevicesaswellasextraMDSnodestomitigatethecontentionpointdiscoveredinthe2DFFTcasestudyandfurthersubstantiatedintheSARstudy.ThenalphaseofthisresearchconsideredextensionstotheexistingFASEframeworkinordertoovercomescalabilityissueswithsimulationtimewhenanalyzingdata-intensiveapplications.Toovercometheseissues,weproposedanovelhybridsimulationapproachthatemploytwouniquetechniques,function-leveltrainingandmicro-simulations,toreducetheamountoftimerequiredtosimulatesystemexecutingapplicationswithlargedatasets.Theproposedapproachcombinestheaccuracyoffunction-levelmodels,viafunction-leveltraining,withthespeedofanalyticalmodelsandmicro-simulationsinordertoquicklyandaccuratelyapproximatethetimeneededtocompleteadatatransaction.Thiscombinationdrasticallyreducesthenumberofdiscrete-eventsprocessedandscheduledbythesimulator,thusresultinginsimulationspeedupsoverthepurefunction-levelapproach.TheapproachwasthenappliedtotheDMsystemexecuting 135

PAGE 136

low-levelbenchmarksandahyperspectralimagingapplication.Thelow-levelbenchmarksshowedrelativelyaccurateresultslessthan7%andorder-of-magnitudespeedupswhenconsideringdatatransactionsassmallas8MB.Furthermore,theaccuracyandspeedupofthehybridsimulationapproachimprovedasthetransactionsizeincreased.Infact,thesimulationsreportedminimumerrorsoflessthan1%andspeedupsover1700forthelow-levelbenchmarksand1500fortheHSIapplication. 136

PAGE 137

APPENDIXAEXPERIMENTALANDSIMULATIVESETUPThisresearchprojectincorporatestworealmstoexplorevariousaspectsandproposedtechniquesforfastandaccurateperformanceprediction.Theserealmsareexperimentalandsimulative.Theexperimentalrealmdealswithphysicalhardwareandsoftwareusedtoconstructacomputesystemtocollectreal-world"values,whicharecomparedtotheresultsgatheredinthesecondrealm,simulation.Thesimulationrealmprovidesanenvironmenttoexplorecomputationalsystemsunavailabletotheresearcherduetolimitedfunds,non-existentcomponents,orfuturegenerationsofcomponents.Moredetailsontheinteractionsbetweentheserealmsasappliedtoperformancemodelingandpredictionwillbediscussedinproceedingsections.A.1ExperimentalSetupTheworkconductedduringthecourseofthisstudyemploysequipmentfromtheHigh-performanceComputingandSimulationHCSLabattheUniversityofFlorida.TheHCSlabconsistsof9computeclusterseachwithvariousresourcesregardingtheprocessor,interconnect,mainmemory,harddisk,andsoftwaremodules.Table A-1 liststhesubsetofclustersusedforthisstudyandtheirresourcetypes,capabilities,andcapacities. TableA-1.ComputationsystemsattheHCSLabatUF Cluster CPU CPU CPU Node Memory Special name type speed count count features Alpha Xeon 3.2GHz 128 32 2GBDDR667 EMT64,Quad-core Delta Xeon 3.2GHz 16 16 2GBDDR333 EMT64,PCI-Express Mu Opteron 2.0GHz 32 32 1GBDDR400 PCI-Express,QsNetII Lambda Opteron 1.4GHz 32 16 1GBDDR333 10Gb/sInniBand Kappa Xeon 2.4GHz 70 35 1GBDDR266 A.2SimulationSetupThemodelingtoolemployedforthisprojectwasMission-LevelDesignerMLDdevelopedbyMLDesignTechnologiesInc[ 27 ].MLDisablock-oriented,discrete-eventsimulationenvironmentthatsupportsmodularandhierarchicaldesigns.Atitscore, 137

PAGE 138

MLDusesprimitives,C++codethatprovidessomespecicfunctionsuchasarithmetic,dataowswitching,ordataqueuing.Largermodulesandsystemsareconstructedbyconnectingtwoormoreprimitivesand/orothermodulesviaagraphicalinterface.Inordertofurtherfacilitateuserdesign,MLDsuppliesnumerouslibrarieswithpre-builtprimitivesandmodules.Figure A-1 showsthedevelopmentenvironmentofMLD. FigureA-1.TheMLDdevelopmentenvironment 138

PAGE 139

REFERENCES [1] O.Lubeck,Y.Luo,H.Wasserman,andF.Bassetti,AnEmpiricalHierarchicalMemoryModelBasedonHardwarePerformanceCounters,"Proc.Int'lConf.ParallelandDistributedProcessingTechniquesandApplications,LasVegas,NV,July13-16,1998. [2] D.Kerbyson,H.Wasserman,andA.Hoisie,ExploringAdvancedArchitecturesUsingPerformancePrediction,"Proc.Int'lWorkshoponInnovativeArchitecture,KohalaCoast,BigIsland,HI,Jan.10-11,2002. [3] M.Salsburg,AStatisticalApproachtoComputerPerformanceModeling,"ACMSIGMETRICSPerformanceEvaluationReview,vol.15,no.1,pp.155-162,May1987. [4] E.Strohmaier,StatisticalPerformanceModeling:CaseStudyoftheNPB2.1Results,"Proc.ThirdInt'lEuro-ParConf.ParallelProcessing,Passau,Germany,Aug.26-29,1997. [5] R.Jain,TheArtofComputerSystemsPerformanceAnalysis.JohnWileyandSons,1991. [6] A.Sampogna,D.Kaeli,D.Green,M.Silva,andC.Sniezek,PerformanceModelingUsingObject-OrientedExecution-DrivenSimulation,"Proc.29thSimulationSymp.,NewOrleans,LA,Apr.8-11,1996. [7] S.Dwarkadas,J.Jump,andJ.Sinclair,Execution-DrivenSimulationofMultiprocessors:AddressandTimingAnalysis,"ACMTrans.ModelingandComputerSimulation,vol.4,no.4,pp.314-338,Oct.1994. [8] R.UhligandT.Mudge,Trace-drivenMemorySimulation:ASurvey,"ACMComputingSurveys,vol.29,no.2,pp.128-170,June1997. [9] J.Flanagan,B.Nelson,J.Archibald,andG.Thompson,TheInaccuracyofTrace-DrivenSimulationUsingIncompleteMultiprogrammingTraceData,"Proc.FourthInt'lWorkshopModeling,Analysis,andSimulationofComputerandTelecommunicationSystems,SanJose,CA,Feb.1-3,1996. [10] S.Moore,F.Wolf,J.Dongarra,S.Shende,P.Teller,andB.Mohr,AScalableApproachtoMPIApplicationAnalysis,"Proc.12thEuropeanPVM/MPIUsers'GroupMeeting,Sorrento,Italy,Sept.18-21,2005. [11] D.Culler,J.Singh,andA.Gupta,ParallelComputerArchitecture:AHard-ware/SoftwareApproach.MorganKaufmannPublishers,1998. [12] MPIForum,MPI:AMessage-PassingInterfaceStandard,"UniversityofTennessee,Version1.1,June1995. 139

PAGE 140

[13] W.Gropp,E.Lusk,N.Doss,andA.Skjellum,AHigh-Performance,PortableImplementationoftheMPIMessagePassingInterfaceStandard,"ParallelComput-ing,vol.22,no.6,pp.789-828,Sept.1996. [14] A.George,R.Fogarty,J.Markwell,andM.Miars,AnIntegratedSimulationEnvironmentforParallelandDistributedSystemPrototyping,"Simulation,vol.72,no.5,pp.283-294,May1999. [15] E.Deelman,A.Dube,A.Hoisie,Y.Luo,R.Oliver,D.Sundaram-Stukel,H.Wasserman,V.Adve,R.Bagrodia,J.Browne,E.Houstis,O.Lubeck,J.Rice,P.Teller,andM.Vernon,POEMS:End-to-EndPerformanceDesignofLargeParallelAdaptiveComputationalSystems,"IEEETrans.SoftwareEngineering,vol.26,no.11,pp.1027-1048,Nov.2000. [16] R.Bagrodia,E.Deelman,S.Docy,andT.Phan,PerformancePredictionofLargeParallelApplicationsUsingParallelSimulations,"Proc.SeventhACMSIGPLANSymp.PrinciplesandPracticeofParallelProgramming,pp.151-161,Atlanta,GA,May1999. [17] M.Uysal,T.Kurc,A.Sussman,andJ.Saltz,APerformancePredictionFrameworkforDataIntensiveApplicationsonLargeScaleParallelMachines,"TechnicalReportCS-TR-3918andUMIACS-TR-98-39,UniversityofMaryland,DepartmentofComputerScienceandUMIACS,July1998. [18] J.Cao,D.Kerbyson,E.Papaefstathiou,andG.Nudd,PerformanceModelingofParallelandDistributedComputingUsingPACE,"Proc.19thIEEEInt'lPerformance,Computing,andCommunicationsConf.,pp.485-492,Phoenix,AZ,Feb.20-22,2000. [19] S.PllanaandT.Fahringer,PerformanceProphet:APerformanceModelingandPredictionToolforParallelandDistributedPrograms,"Proc.Int'lConf.ParallelProcessing,Oslo,Norway,June14-17,2005. [20] A.Snavely,L.Carrington,andN.Wolter,AFrameworkforPerformanceModelingandPrediction,"Proc.15thSupercomputingConf.,Baltimore,MD,Nov.16-22,2002. [21] D.BaileyandA.Snavely,PerformanceModeling:UnderstandingthePresentandPredictingtheFuture,"Proc.Euro-ParConf.,Lisbon,Portugal,Aug.30-Sept.2,2005. [22] R.Badia,J.Labarta,J.Gimenez,andF.Escale,DIMEMAS:PredictingMPIApplicationsBehaviorinGridEnvironments,"Proc.WorkshoponGridApplicationsandProgrammingTools,Seattle,WA,June25,2003. [23] S.Moore,D.Cronk,K.London,andJ.Dongarra,ReviewofPerformanceAnalysisToolsforMPIParallelPrograms,"TechnicalReport,UniversityofTennessee,ComputerScienceDepartment,1998. 140

PAGE 141

[24] L.DeRoseandD.Reed,SvPablo:AMulti-LanguageArchitecture-IndependentPerformanceAnalysisSystem,"Proc.Int'lConf.ParallelProcessing,Fukushima,Japan,Sept.1999. [25] B.Miller,M.Callaghan,J.Cargille,J.Hollingsworth,R.Irvin,K.Karavanic,K.Kunchithapadam,andT.Newhall,TheParadynParallelPerformanceMeasurementTool,"IEEEComputer,vol.28,no.11,pp.37-46,Nov.1995. [26] S.ShendeandA.Malony,TheTAUParallelPerformanceSystem,"Int'lJournalofHighPerformanceComputingApplications,vol.20,no.2,pp.287-331,Summer2006. [27] G.Schorcht,I.Troxel,K.Farhangian,P.Unger,D.Zinn,C.Mick,A.George,andH.Salzwedel,System-LevelSimulationModelingwithMLDesigner,"Proc.11thInt'lSymp.Modeling,Analysis,andSimulationofComputerandTelecommunicationSystems,pp.207-212,Orlando,FL,Oct.12-15,2003. [28] J.Vetter,N.Bhatia,E.Grobelny,andP.Roth,CapturingPetascaleApplicationCharacteristicswiththeSequoiaToolkit,"Proc.ParallelComputing,Malaga,Spain,Sept.13-16,2005. [29] S.Browne,J.Dongarra,N.Garner,K.London,andP.Mucci,AScalableCross-PlatformInfrastructureforApplicationPerformanceTuningUsingHardwareCounters,"Proc.13thSupercomputingConf.,Dallas,TX,Nov.4-10,2000. [30] E.GrobelnyandJ.Vetter,ExtrapolatingCommunicationPatternsofLarge-scaleScienticApplications,"TechnicalReport,UniversityofFloridaandOakRidgeNationalLaboratory,2006. [31] O.Zaki,E.Lusk,W.Gropp,andD.Swider,TowardScalablePerformanceVisualizationwithJumpshot,"TheInt'lJournalofHighPerformanceComputingApplications,vol.13,no.2,pp.277-288,Fall1999. [32] D.GustavsonandQ.Li,TheScalableCoherentInterfaceSCI,"IEEECommuni-cationsMagazine,vol.34,no.8,pp.52-63,Aug.1996. [33] K.Koch,R.Baker,andR.Alcoue,SolutionoftheFirst-OrderFormofthe3-DDiscreteOrdinatesEquationonaMassivelyParallelProcessor,"Trans.oftheAmericanNuclearSociety,vol.65,no.198,1992. [34] J.VetterandA.Yoo,AnEmpiricalPerformanceEvaluationofScalableScienticApplications,"Proc.15thSupercomputingConf.,Baltimore,MD,Nov.16-22,2002. [35] E.Grobelny,D.Bueno,I.Troxel,A.George,andJ.Vetter,FASE:AFrameworkforScalablePerformancePredictionofHPCSystemsandApplications,"Simulation:TransactionsofTheSocietyforModelingandSimulationInternational,vol.83,no.10,pp.721-745,Oct.2007. 141

PAGE 142

[36] M.Grin,NASA2006StrategicPlan,"NationalAeronauticsandSpaceAdministration,NP-2006-02-423-HQ,WashingtonDC,Feb.2006. [37] J.Ramos,J.Samson,D.Lupia,I.Troxel,R.Subramaniyan,A.Jacobs,J.Greco,G.Cieslewski,J.Curreri,M.Fischer,E.Grobelny,A.George,V.Aggarwal,M.PatelandR.Some,High-Performance,DependableMultiprocessor,"Proc.IEEE/AIAAAerospaceConf.,BigSky,MT,Mar.4-11,2006. [38] D.Dechant,TheAdvancedOnboardSignalProcessorAOSP,"AdvancesinVLSIandComputerSystems,vol.2,no.2,pp.69-78,Oct.1990. [39] M.IacoponiandD.Vail,TheFaultToleranceApproachoftheAdvancedArchitectureOn-BoardProcessor,"Proc.Symp.Fault-TolerantComputing,Chicago,IL,June21-23,1989. [40] F.Chen,L.Craymer,J.Deik,A.Fogel,D.Katz,A.SillimanJr.,R.Some,S.UpchurchandK.Whisnant,DemonstrationoftheRemoteExplorationandExperimentationREEFault-TolerantParallel-ProcessingSupercomputerforSpacecraftOnboardScienticDataProcessing,"Proc.Int'lConf.DependableSystemsandNetworks,NewYork,NY,June25-28,2000. [41] E.Prado,P.PrewittandE.Ille,AStandardApproachtoSpacebornePayloadDataProcessing,"Proc.IEEEAerospaceConf.,BigSky,MT,March10-17,2001. [42] S.Fuller,RapidIO-TheEmbeddedSystemInterconnect.JohnWiley&Sons,2005. [43] J.Meyer,OnEvaluatingthePerformabilityofDegradableComputingSystems,"IEEETrans.Computers,vol.C-29,no.8,pp.720-731,Aug.1980. [44] R.Subramaniyan,V.Aggarwal,A.Jacobs,andA.George,FEMPI:ALightweightFault-tolerantMPIforEmbeddedClusterSystems,"Proc.Int'lConf.EmbeddedSystemsandApplications,LasVegas,NV,June26-29,2006. [45] R.Smith,K.TrivediandA.Ramesh,PerformabilityAnalysis:Measures,anAlgorithmandaCaseStudy,"IEEETrans.Computers,vol.37,no.4,pp.406-417,Apr.1988. [46] B.Haverkort,R.Marie,G.Rubino,andK.Trivedieditors,PerformabilityModeling:TechniquesandTools.Wiley,2001. [47] C.Hirel,R.Sahner,X.Zang,andK.Trivedi,ReliabilityandPerformabilityModelingUsingSHARPE2000,"Proc.Int'lConf.ComputerPerformanceEvalua-tion:ModelingTechniquesandTools,Schaumburg,IL,Mar.27-31,2000. [48] I.Troxel,E.GrobelnyandA.George,SystemManagementServicesforHigh-PerformanceIn-situAerospaceComputing,"AIAAJournalofAerospaceComputing,Information,andCommunication,vol.4,no.2,pp.636-656,Feb.2007. 142

PAGE 143

[49] D.Bueno,C.Conger,A.Leko,I.TroxelandA.George,RapidIO-basedSpaceSystemArchitecturesforSyntheticApertureRadarandGroundMovingTargetIndicator,"High-PerformanceEmbeddedComputingWorkshop,MITLincolnLab,Lexington,MA,Sept.20-22,2005. [50] P.Meisl,M.Ito,andI.Cumming,ParallelSyntheticApertureRadarProcessingonWorkstationNetworks,"Proc.10thInt'lParallelProcessingSymp.,pp.716-723,Honolulu,HI,Apr.15-19,1996. [51] C.Miller,D.Payne,T.Phung,H.Siegel,andR.Williams,ParallelProcessingofSpaceborneImagingRadarData,"Proc.EighthSupercomputingConf.,SanDiego,CA,Dec.4-8,1995. [52] D.Sandwell,SARImageFormation:ERSSARProcessorCodedinMATLAB," http://www.geo.uzh.ch/rsl/research/SARLab/GMTILiterature/PDF/San02d.pdf ,2002. [53] J.Samson,G.Gardner,D.Lupia,M.Patel,P.Davis,V.Aggarwal,A.George,Z.Kalbarcyzk,andR.Some,TechnologyValidation:NMPST8DependableMultiprocessorProjectII,"Proc.IEEEAerospaceConf.,BigSky,MT,Mar.3-10,2007. [54] E.Grobelny,G.Cieslewski,I.Troxel,andA.George,PredictingthePerformanceofRadiation-SusceptibleAerospaceComputingSystemsandApplications,"submittedtoACMTrans.EmbeddedComputingSystems. [55] M.Cannataro,D.Talia,andP.Srimani,ParallelDataIntensiveComputinginScienticandCommercialApplications,"ParallelComputing,vol.28,no.5,pp.673-704,May2002. [56] U.Fayyad,DataMiningandKnowledgeDiscovery:MakingSenseOutOfData,"IEEEExpert:IntelligentSystemsandTheirApplications,vol.11,no.5,pp.20-25,Oct.1996. [57] Data-IntensiveComputingInitiativeDICI,PacicNorthwestNationalLaboratory, http://dicomputing.pnl.gov/ [58] K.WalshandE.Sirer,StagedSimulation:AGeneralTechniqueForImprovingSimulationScaleandPerformance,"ACMTrans.ModelingandComputerSimula-tions,vol.14,no.2,pp.170-195,Apr.2004. [59] R.Fujimoto,ParallelSimulation:ParallelandDistributedSimulationSystems,"Proc.WinterSimulationConf.,pp.147-157,Arlington,VA,Dec.9-12,2001. [60] D.Nicol,PrinciplesofConservativeParallelSimulation,"Proc.WinterSimulationConf.,pp.128-135,Coronado,CA,Dec.8-11,1996. [61] B.Groselj,CPSim:AToolforCreatingScalableDiscrete-EventSimulations,"Proc.WinterSimulationConference,pp579-583,Arlington,VA,Dec.3-6,1995. 143

PAGE 144

[62] R.Bagrodia,R.Meyer,M.Takai,Y.Chen,X.Zeng,J.Martin,B.ParkandH.SongParsec:AParallelSimulationEnvironmentforComplexSystems,"IEEEComputer,vol.31,no.10,pp.77-85,Oct.1998. [63] R.Fujimoto,OptimisticApproachestoParallelDiscrete-eventSimulation,"Trans.oftheSocietyforComputerSimulationInternational,vol.7,no.2,pp.153-191,June1990. [64] Y.Teo,S.TayandS.Kong,SPaDES:AnEnvironmentforStructuredParallelSimulation,"TechnicalReport,TR20/96,DepartmentofInformationSystemsandComputerScience,NationalUniversityofSingapore,Singapore,Oct.1996. [65] D.Martin,T.McBrayer,andP.Wilsey,WARPED:ATimeWarpSimulationKernelforAnalysisandApplicationDevelopment,"Proc.29thHawaiiInt'lConf.SystemSciencesVolume1:SoftwareTechnologyandArchitecture,Jan.3-6,1996,pp.383-386. [66] J.WestandA.Mullarney,ModSim:ALanguageforDistributedSimulation."Proc.SCSMulticonf.DistributedSimulation,pp.155-159,SanDiego,CA,Feb.3-5,1988. [67] Y.Liu,F.Presti,V.Misra,D.TowsleyandY.Gu,FluidModelsandSolutionsforLarge-scaleIPNetworks,"Proc.ACMSIGMETRICSInt'lConf.MeasurementandModelingComputerSystems,pp.91-101,SanDiego,CA,June10-14,2003. [68] A.YanandW.Gong,Time-drivenFluidSimulationforHighSpeedNetworks,"IEEETrans.InformationTheory,vol.45,no.5,pp.1588-1599,June1999. [69] B.Liu,Y.Gao,J.Kurose,D.TowsleyandW.Gong,FluidSimulationofLarge-scaleNetworks:IssuesandTradeos,"Proc.Int'lConf.ParallelandDis-tributedProcessingTechniquesandApplications,pp.2136-2142,LasVegas,NV,June28-July1,1999. [70] G.Kesidis,A.Singh,D.CheungandW.Kwok,FeasibilityofFluid-DrivenSimulationforATMNetwork,"Proc.IEEEGlobalTelcommunicationsConf.,pp.2013-2017,London,England,Nov.18-22,1996. [71] B.MelamedandS.Pan,HNS:AStreamlinedHybridNetworkSimulator,"ACMTrans.ModelingandComputerSimulation,vol.14,no.3,pp.251-277,July2004. [72] G.Riley,T.JafaarandR.Fujimoto,IntegratedFluidandPacketNetworkSimulations,"Proc.IEEEInt'lSymp.Modeling,AnalysisandSimulationofComputerandTelecommunicationsSystems,pp.511-518,FortWorth,TX,Oct.11-16,2002. [73] C.Kiddle,R.Simmonds,C.WilliamsonandB.Unger,HybridPacket/FluidFlowNetworkSimulation,"Proc.17thWorkshopParallelandDistributedSimulation,pp.143-152,SanDiego,CA,June10-13,2003. 144

PAGE 145

[74] M.Uysal,T.Kurc,A.SussmanandJ.Saltz,APerformancePredictionFrameworkforDataIntensiveApplicationsonLarge-ScaleParallelMachines,"Proc.FourthInt'lWorkshoponLanguages,Compilers,andRun-timeSystemsforScalableComputers,pp.243-258,Pittsburgh,PA,May28-30,1998. [75] C.Chang,H.Ren,andS.Chiang,Real-timeProcessingAlgorithmsforTargetDetectionandClassicationinHyperspectralImagery,"IEEETrans.GeoscienceandRemoteSensing,vol.39,no.4,pp.760-768,Apr.2001. [76] E.Grobelny,C.Reardon,andA.George,AHybridSimulationApproachtoReduceAnalysisTimeofData-IntensiveApplications,"submittedtoACMTrans.ModelingandComputerSimulation. 145

PAGE 146

BIOGRAPHICALSKETCHEricGrobelnybeganattendingtheUniversityofFloridainFall1998andreceivedhisB.S.in2002andM.E.in2004.HeconductedresearchattheHigh-performanceComputingandSimulationHCSLaboratoryunderthesupervisionofDr.AlanGeorgeforsixyearsfocusingonperformanceanalysisandpredictionforhigh-performancecomputingsystemsandapplications.Hisotherinterestsincludehigh-performanceembeddedcomputingforaerospacesystemsandapplications,simulation-basedfaultinjection,andhigh-performanceinterconnecttechnologies.AsamemberoftheHCSlab,healsoworkedonnumeroussideprojectsincludingdevelopinganMPIcommunicationlayerforsatellitesystems,investigatingperformanceenhancementsforlow-localityapplications,andexploringtechniquesandbestpracticesfordisasterrecoveryandmissionassuranceindynamic,high-performanceenvironments.ErichasacceptedajobatHoneywellSpaceSystemsinClearwater,Florida. 146