PAGE 1
FASTANDACCURATESIMULATIONENVIRONMENTFASEFORHIGH-PERFORMANCECOMPUTINGSYSTEMSANDAPPLICATIONSByERICM.GROBELNYADISSERTATIONPRESENTEDTOTHEGRADUATESCHOOLOFTHEUNIVERSITYOFFLORIDAINPARTIALFULFILLMENTOFTHEREQUIREMENTSFORTHEDEGREEOFDOCTOROFPHILOSOPHYUNIVERSITYOFFLORIDA2008 1
PAGE 2
c2008EricM.Grobelny 2
PAGE 3
Tomyfamilyandfriends 3
PAGE 4
ACKNOWLEDGMENTSIwouldliketothank,rstandforemost,myparents,brother,andsisterfortheirloveandguidance.Iwouldalsoliketoexpressmyutmostgratitudetomyadvisor,Dr.AlanGeorge,forsupportingmethroughgraduateschoolandteachingmethenecessaryskillstobecomearesearcherandinnovator.AnothercrucialpersonwhotaughtmemuchaboutresearchincomputerengineeringisDr.JeVetter.Furthermore,IwishtoexpressmyappreciationtomysponsorstheDepartmentofDefense,Honeywell,andtheUniversityofFloridafortheirnancialaid.WithoutitIwouldbeinextremedebt.Finally,IwouldliketothankMr.RobertHenuberforplantingtheseedthatinspiredandmotivatedmetobecomeadoctorofphilosophyincomputerengineering. 4
PAGE 5
TABLEOFCONTENTS page ACKNOWLEDGMENTS ................................. 4 LISTOFTABLES ..................................... 7 LISTOFFIGURES .................................... 8 ABSTRACT ........................................ 10 CHAPTER 1INTRODUCTION .................................. 12 2BACKGROUNDANDRELATEDRESEARCH .................. 17 3FASTANDACCURATESIMULATIONENVIRONMENTPHASE1 ..... 24 3.1ApplicationDomain .............................. 25 3.1.1ApplicationCharacterization ...................... 26 3.1.2StimulusDevelopment ......................... 30 3.2SimulationDomain ............................... 32 3.2.1ComponentDesign ........................... 33 3.2.2SystemDevelopment .......................... 37 3.2.3SystemAnalysis ............................. 38 3.3ResultsandAnalysis .............................. 39 3.3.1ModelValidation ............................ 39 3.3.2SystemValidation ............................ 41 3.3.3CaseStudy:Sweep3D .......................... 44 3.3.3.1Experiment1:Accuracy ................... 46 3.3.3.2Experiment2:Speed ..................... 47 3.3.3.3Experiment3:Virtualsystemprototyping ......... 49 3.4Conclusions ................................... 53 4PERFORMANCEANDAVAILABILITYPREDICTIONSOFVIRTUALLYPROTOTYPEDSYSTEMSFORSPACE-BASEDAPPLICATIONSPHASE2 57 4.1Background ................................... 57 4.1.1ProjectOverview ............................ 57 4.1.2DMSystemArchitecture ........................ 59 4.1.3DMMiddlewareArchitecture ...................... 61 4.2Approach .................................... 62 4.2.1PhysicalPrototype ........................... 64 4.2.2Markov-RewardModeling ....................... 65 4.2.2.1Datanodemodel ....................... 66 4.2.2.2Systemmodel ......................... 69 5
PAGE 6
4.2.3Discrete-EventSimulationModeling .................. 70 4.2.4FaultModelLibrary ........................... 71 4.3ResultsandAnalysis .............................. 75 4.3.1ModelCalibration ............................ 76 4.3.1.1Componentmodelcalibrationandvalidation ........ 76 4.3.1.2Systemperformabilitymodel ................ 77 4.3.2CaseStudy:FastFourierTransform .................. 78 4.3.3CaseStudy:SyntheticApertureRadar ................ 82 4.3.3.1Amenabilitystudy ...................... 86 4.3.3.2In-depthapplicationanalysis ................ 87 4.3.3.3Flightsystem ......................... 92 4.4Conclusions ................................... 95 5HYBRIDSIMULATIONSTOIMPROVETHEANALYSISTIMEOFDATA-INTENSIVEAPPLICATIONSPHASE3 ................. 99 5.1Introduction ................................... 99 5.2BackgroundandRelatedResearch ....................... 102 5.3HybridSimulationApproach .......................... 106 5.3.1Function-LevelTraining ......................... 108 5.3.2AnalyticalModeling ........................... 111 5.3.3Micro-Simulation ............................ 113 5.4ResultsandAnalysis .............................. 118 5.4.1SimulationSetup ............................ 118 5.4.2PerformanceModeling ......................... 121 5.4.3ContentionModeling .......................... 125 5.4.4CaseStudy ................................ 128 5.5Conclusions ................................... 131 6CONCLUSIONS ................................... 134 APPENDIX AEXPERIMENTALANDSIMULATIVESETUP .................. 137 A.1ExperimentalSetup ............................... 137 A.2SimulationSetup ................................ 137 REFERENCES ....................................... 139 BIOGRAPHICALSKETCH ................................ 146 6
PAGE 7
LISTOFTABLES Table page 3-1TheFASEcomponentlibrary ............................ 36 3-2Experimentalversussimulationexecutiontimesformatrixmultiply ....... 43 3-3Ratioofsimulationtoexperimentalwall-clockexecutiontime .......... 45 3-4Computenodespecicationsforeachclusterinheterogeneoussystem ...... 46 3-5ExperimentalversussimulationerrorsforSweep3D ................ 48 4-1TheDMmiddlewarecomponents .......................... 62 4-2Datanodemodelstates ............................... 68 4-3Failureandrecoveryratesofthenodemodel .................... 68 4-4SummaryofDMcomponentmodels ......................... 71 4-5Summaryoffaultmodels ............................... 73 4-6Baselinesystemparameters ............................. 77 4-7TheFFTAlgorithmicvariationsandsystemenhancements ............ 81 4-8Checkpointoptionsexploredusingpatch-basedSARapplication ......... 90 4-9Architecturalenhancementsexploredforightsystem ............... 92 5-1Summaryofrelevantsimulationmodels ....................... 119 5-2Keysystemparameters ................................ 119 5-3Hybridsourcemodelparameters .......................... 121 5-4DatasetsizesforeachHSIdatatransaction ..................... 129 5-5SimulationtimesforvariousHSIimagesizes .................... 130 A-1ComputationsystemsattheHCSLabatUF .................... 137 7
PAGE 8
LISTOFFIGURES Figure page 3-1High-leveldata-owdiagramofFASEframework ................. 25 3-2TheFASEapplicationcharacterizationprocess ................... 27 3-3InniBandmodellatencyvalidation ......................... 40 3-4InniBandmodelthroughputvalidation ...................... 41 3-5TheTCP/IP/Ethernetmodellatencyvalidation .................. 41 3-6TheTCP/IP/Ethernetmodelthroughputvalidation ............... 41 3-7TheSCImodellatencyvalidation .......................... 42 3-8TheSCImodelthroughputvalidation ........................ 42 3-9Sweep3Dalgorithm .................................. 46 3-10ExperimentalversussimulativeexecutiontimesforSweep3D ........... 47 3-11Ratiosofsimulationtoexperimentalwall-clockcompletiontimeforvaryingsystemanddatasetsizes ............................... 48 3-12ExecutiontimesforSweep3Drunningonvarioussystemcongurations ..... 50 3-13MaximumspeedupsforSweep3Drunningonvariousnetworkcongurations .. 51 3-14SpeedupsforSweep3Drunningon8192-nodeInniBandsystem ......... 53 4-1Systemhardwarearchitectureofthedependablemultiprocessor ......... 60 4-2Systemsoftwarearchitectureofthedependablemultiprocessor .......... 61 4-3LogicaldiagramandphotographofDMtestbed .................. 65 4-4Markov-rewarddatanodemodel .......................... 66 4-5Markov-rewardsystemmodel ............................ 69 4-6TheDMnodemodels ................................. 72 4-7TheDMightsystemmodel ............................. 72 4-8Examplefault-enabledsystem ............................ 74 4-9ThroughputvalidationsfornetworkandMDSsubsystemmodels ......... 76 4-10MarkovversussimulationDMsystemperformabilitycomparison ......... 78 4-11Dataowdiagramofparallel2DFFT ........................ 79 8
PAGE 9
4-12Executiontimeperimageforbaselineandenhancedsystems ........... 80 4-13Parallel2DFFTexecutiontimesperimageforvariousperformance-enhancingtechniques ....................................... 82 4-14Distributed2DFFTexecutiontimesperimageforvariousperformance-enhancingtechniques ....................................... 83 4-15SARdataowwithoptionalcheckpointstagesandpatcheddatadecomposition 84 4-16AmenabilityresultsviaMarkovmodelforpatch-basedSARapplication ..... 86 4-17Systemperformabilitypercentagesandthroughputsforpatch-basedSAR .... 88 4-18Systemperformabilityandthroughputfor8192-elementpatch-basedSARexecutingonvarioussystemsizes .......................... 91 4-19Speedupsofarchitecturalenhancementsforpatch-basedSAR .......... 93 4-20Systemperformabilityandthroughputof20-nodeDMightsystemexecutingpatch-basedSAR ................................... 94 5-1High-levelexamplesystemsemployinghybridmodeling .............. 107 5-2High-leveldiagramofhybridsimulationapproach ................. 108 5-3Function-leveltrainingprocedure .......................... 110 5-4Examplethree-owmicro-simulation ........................ 117 5-5PingPongaccuracyresults .............................. 122 5-6MDSTestaccuracyresults .............................. 123 5-7PingPongspeedupresults .............................. 124 5-8MDSTestspeedupresults .............................. 124 5-9MDSTestaccuracyandspeedupresultsusinghybridmodelingapproach .... 127 5-10TheHSIdatadecompositionanddataowdiagram ................ 129 5-11TheHSIaccuracyandspeedupresultsfortwohybridcongurations ....... 131 A-1TheMLDdevelopmentenvironment ........................ 138 9
PAGE 10
AbstractofDissertationPresentedtotheGraduateSchooloftheUniversityofFloridainPartialFulllmentoftheRequirementsfortheDegreeofDoctorofPhilosophyFASTANDACCURATESIMULATIONENVIRONMENTFASEFORHIGH-PERFORMANCECOMPUTINGSYSTEMSANDAPPLICATIONSByEricM.GrobelnyAugust2008Chair:AlanGeorgeMajor:ElectricalandComputerEngineeringAssystemsofcomputersbecomemorecomplexintermsoftheirarchitecture,interconnect,andheterogeneity,theoptimumcongurationanduseofthesemachinesbecomesamajorchallenge.Toreducethepenaltiescausedbypoorlyconguredsystems,simulationisoftenusedtopredicttheperformanceofkeyapplicationstobeexecutedonthenewsystems.Simulationprovidesthecapabilitytoobservecomponentandsystemcharacteristicse.g.,performanceandpowerinordertomakevitaldesigndecisions.However,simulatinghigh-delitymodelscanbeverytimeconsumingandevenprohibitivewhenevaluatinglarge-scalesystems.TheFastandAccurateSimulationEnvironmentFASEframeworkseekstosupportlarge-scalesystemsimulationbyusinghigh-delitymodelstocapturethebehaviorofonlytheperformance-criticalcomponentswhileemployingabstractiontechniquestocapturetheeectsofthosecomponentswithlittleimpactonthesystem.Toachievethisbalanceofaccuracyandsimulationspeed,FASEprovidesamethodologyandassociatedtoolsettoevaluatenumerousarchitecturaloptions.Thisapproachallowsuserstomakesystemdesigndecisionsbasedonquantiabledemandsoftheirkeyapplicationsratherthanusingmanualanalysiswhichcanbeerrorproneandimpracticalforlargesystems.Theframeworkaccomplishesthisevaluationthroughanovelapproachofcombiningdiscrete-eventsimulationwithanapplicationcharacterizationschemeinordertoremoveunnecessarydetailswhilefocusingoncomponentscriticaltotheperformanceofthe 10
PAGE 11
application.Inaddition,FASEisextendedtosupportin-depthavailabilityanalysesandquickevaluationsofdata-intensiveapplications.Inthisdocument,wepresentthemethodologyandtechniquesbehindFASEandincludeseveralcasestudiesvalidatingsystemsconstructedusingvariousapplicationsandinterconnects.ThestudiesshowthatFASEproducesresultswithacceptableaccuracyi.e.,maximumerrorof23.3%andunder6%inmostcaseswhenpredictingtheperformanceofcomplexapplicationsexecutingonHPCsystems.Furthermore,whenusingFASEtoanalyzedata-intensiveapplications,theframeworkachievesover1500speedupwithlessthan1%errorwhencomparedtothetraditional,function-levelmodelingapproach. 11
PAGE 12
CHAPTER1INTRODUCTIONSubstantialcapitalresourcesareinvestedannuallytoexpandthecomputationalcapacityandimprovethequalityoftoolsscientistshaveattheirdisposaltosolvegrand-challengeproblemsinphysics,lifesciencesandotherdisciplines.Typicallyeachnewlarge-scale,high-performancecomputingHPCsystemdeployedatanationallab,industryresearchcenter,orothersiteexempliesthelatestintechnologyandfrequentlyoutperformsitspredecessorsasmeasuredbytheexecutionofgenericbenchmarksuites.Whileasupercomputer'srawcomputationalpotentialcanbereadilypredictedanddemonstratedforthesegenericbenchmarks,anapplicationscientist'sabilitytoharnessthenewsystem'spotentialfortheirspecicapplicationisnotguaranteed.Fewapplicationsstresssystemsinexactlythesamemanner,especiallyatlargesystemsizes,andthereforepredictinghowbesttoallocatelimitedfundstobuildanoptimallyconguredsupercomputerischallenging.Lackingquantitativedataandaclearmethodologytounderstandandexplorethedesignspace,developerstypicallyrelyonintuition,orinsteadsimplyusemanualanalysistoidentifythebestavailableoptionforeachsystemcomponent,whichmayoftenleadtoinecienciesinthesystemarchitecture.Amorestructuredmethodologyisrequiredtoprovidedevelopersthemeanstoperformanaccuratecost-benetanalysistoensureresourcesareecientlyallocated.TheFastandAccurateSimulationEnvironmentFASEhasbeendevelopedtoaddressthiscriticalneed.FASEisacomprehensivemethodologyandassociatedtoolsetforperformancepredictionandsystem-designexplorationthatencompassesameanstocharacterizeapplications,designandbuildvirtualprototypesofcandidatesystems,andstudytheperformanceofapplicationsinaquick,accurate,andcost-eectivemanner.FASEapproachesthegrandchallengethatisperformancepredictionofHPCapplicationsoncandidatesystemsbysplittingtheproblemintotwodomains:theapplicationandthe 12
PAGE 13
simulationdomains.Thoughsomeinterdependenciesmustexistbetweenthetworealms,thissplitisolatestheworkconductedineitherdomainsothatapplicationanalysisdataandsystemmodelscanbereusedwithlittleeortwhenexploringotherapplicationsorsystemdesigns.Unlikeotherperformancepredictionenvironments,FASEprovidestheuniquecapabilityofvirtuallyprototypingcandidatesystemsviaagraphicaluserinterface.Thisfeaturenotonlyprovidessubstantialtimeandcostsavingsascomparedtodevelopinganexperimentalprototype,butalsocapturesstructuraldependenciese.g.,networkcontentionwithinthecomputationalsubsystemsallowinguserstoexploredecompositionandloadbalancingoptions.Furthermore,virtualprototypingcanhelpforecastthetechnologicaladvancesinspeciccomponentsthatwouldbemostcriticaltoimprovingtheperformanceofselectapplications.Moreimportantly,cross-overpointsinkeymetrics,suchasnetworklatencycanbeidentiedbyquantitativelyassessingwheretoapplyAmdahl'slawforaparticularapplicationandsystempair.Inordertoensurealloptionsareexamined,analysiscanalsoincludethereusepotentialofcurrentlydeployedsystemsinordertodetermineifupgradingorotherwiseaugmentingthoseexistingsystemswillprovideabetterreturnoninvestmentcomparedtobuildinganentirelynewsystem.AnotheruniquefeatureofFASEisitsiterativedesignandanalysisproceduresinwhichresultsfromoneormoreinitialrunsareusedtorenetheapplicationcharacterizationprocessaswellasdictatethedelityofthecomponentmodelsemployedincandidatearchitectures.Iterationsduringthesestagescanresultinhighlytargetedperformancedatathatdrivessimulatedsystemsoptimizedforspeedandaccuracy.Toaccommodateforoptimizedsystemmodels,theframeworksupportsacombinationofanalyticalandsimulationmodeltypes.Thiscombinationallowsuserstoeectivelyadjustthefocalpointsofthesimulationstothecomponentswiththegreatestimpactonsystemperformancethroughtheuseofthesimulationmodelswhilestillaccountingforthelesserinuentialcomponentsbyusingfasteranalyticalmodels.Asaresult,thecomplexityoftheoverallsystemdesignisreducedthusdecreasingsimulationtime.Insummary, 13
PAGE 14
FASEallowsdesignerstoevaluatetheperformanceofaspecicsetofapplicationswhileexploringarichsetofsystemdesignoptionsavailablefordeploymentviaanupgradeable,modularcomponentlibrary.ThelistbelowdetailsthemaincontributionsoftheFASEframework.1.Asystematic,iterativemethodologythatdescribesthevariousoptionsavailableateachstepofthedesignandanalysisprocessandilluminatestheimplicationsandissueswitheachoption.2.TheFASEtoolsetthatprovidesanapplicationcharacterizationtoolsupportingMPI-basedCandFortranapplicationstocollectperformancedataandagraphical,object-orientedusingC++simulationenvironmenttovirtuallyprototypeandevaluatecandidatesystems.3.Apre-builtmodellibrarythatcontainsavarietyofHPCarchitecturalcomponentsfacilitatingrapidprototypingandevaluationofsystemswithvaryingdegreesofdetailatallthekeysubsystems.Thisstudyconsistsofthreephasesofresearch.TherstphasefocusesondesigninganddevelopingarobustandcomprehensivemethodologyandtoolkitthatuserscanemploytoquicklyandaccuratelypredicttheperformanceofinterestingHPCsystemsandapplications.Morespecically,theworkconductedintherstphaseprovidesadetailedproceduretocharacterizeanapplication,designandbuildcomponentandsystemmodels,andanalyzetheapplication'sperformanceonvarioussystemcongurationsviasimulation.Abasictoolkit,consistingofanapplicationanalysistool,asimulator,andpre-builtsimulationmodelsofkeycomponentsandsamplesystems,facilitatesthepredictionandanalysisprocedure.AftercreatingthefoundationofFASE,weperformcasestudiesusingamatrixmultiplybenchmarkandarealscienticapplicationinordertovalidatethespeedandaccuracyofFASE.ThesecondphaseofthestudyexplorestheuseoftheFASEframeworktoperformanin-depthperformabilityanalysisofspace-basedsystems.Thisworkconsistsofthedesignanddevelopmentofcomponentandsystemmodelstorepresentthecandidatespacesystemaswellasasimulation-basedfaultinjectionframeworktoconductscalability 14
PAGE 15
andperformabilityanalysesofdierentsystemcongurationsandapplicationvariations.Thescalabilityanalysisemploysthe2DFastFourierTransformDFFTasthekeybenchmarkkernelduetoitscomputationalrelevanceinmanyspaced-basedapplications.Afteridentifyingthestrengthsandweaknessesofthebasearchitecture,weconductaperformabilitystudyofthesystemunderdierentenvironmentalconditions.ThestudyusestheSyntheticApertureRadarSARapplicationwhichactuallyincorporatesthe2DFFTkernelinordertoperformimageprocessing.Thestudyreportsinsightonkeyarchitecturalandalgorithmicoptionsthatprovideperformanceandavailabilityenhancementsforfuturespacesystems.PhasethreeexpandsthefoundationofFASEbyincorporatinghybridsimulationtechniquestoaddressscalabilityissueswhenanalyzingdata-intensiveapplications.Theresearchconductedwithinthisphasefocusesonthedesignanddevelopmentoftechniquesthatcombinethestrengthsofthefunction-levelandanalyticalmodelingapproachesinordertoreducesimulationtimesinapplicationsthatprocessverylargedatasets.Theproposedapproachisvalidatedusingsimplebenchmarkprogramsexecutingonanemergingspace-basedplatformwhileitscapabilitiesaredemonstratedbyanalyzingadata-intensive,remote-sensingapplicationcalledhyperspectralimagingHSI.Theremainderofthisdocumentconsistsofthetechnicaldetailsofthestudy.Chapter 2 providesbackgroundinformationonthebasicconceptsinvolvedintheperformancepredictionprocessthroughsimulation.ThechapteralsopresentsbriefoverviewsonpreviousresearchconductedthatsharesimilarmethodsandgoalsofFASE.Chapter A describesthevariousfacilitiesandtoolsusedtoconducttheresearch.Chapter 3 detailstheFASEframeworkwhileChapter 4 appliesandextendstheframeworktoperformscalabilityandperformabilityanalysesonaspace-basedsystemforathorougharchitecturalandalgorithmicevaluation.Chapter 5 extendsFASEbyenhancingpre-builtmodelswiththenecessarymechanismsneededtosupporthybridsimulationsforthe 15
PAGE 16
analysisofdata-intensiveapplications.Finally,Chapter 6 summarizestheworkpresentedinthisdocument. 16
PAGE 17
CHAPTER2BACKGROUNDANDRELATEDRESEARCHBuildinglarge-scalesystemstodeterminehowanapplicationwillperformissimplynotaviableoptionduetotimeandcostconstraints.Therefore,othermethodsofprecursoryinvestigationareneeded,andseveraldierenttypesofmodelingtechniquesexisttoaidinthisprocess.Analytical[ 1 ],[ 2 ]andstatistical[ 3 ],[ 4 ]modelingaretwosuchoptionsandbothmethodsinvolvearepresentationofthesystemusingmathematicalequationsorMarkovmodelsinordertogaininsightonhowaparticularsystemwillperformbasedoncertainparameters.Thesemodelscanbecomeverycomplexespeciallywhenconsideringthelargenumberofcongurationparametersoflarge-scaleHPCsystemsandstillremaininaccurateduetotheoverallcomplexityofthesystems.Inaddition,itisdicultforanalyticalmodelingtoaddressthehigher-ordertransientfeaturesofanapplication,suchasnetworkcontention,andoftenover-simplicationsareemployedtomaketheequationssolvable[ 5 ].Computersimulationisanalternativethatbringsaccuracy,andexibilitytothischallenge.Realhardwarecanbemodeledatanydegreeofdelityrequiredbytheuseranddictatedbytheapplication,allowingthesystemtobetailoredtotheapplicationandviceversa.Asimulation-basedapproachprovidestheuserwiththeexibilitytomodelimportantcomponentsveryaccuratelywhilesacricingtheaccuracyoflessvitalcomponentsbymodelingthematalowerdelity.Inadditiontothesebenets,computersimulationsupportsthescalingofspeciccomponentparameters,allowingforthemodelingofnext-generationtechnologieswhichmaynotbecurrentlyavailableforexperimentalstudy.Analysisbasedonsuchmodelscanalsoprovideconcreteevidencethatmayinuencethefutureroadmapofsystemcomponentmanufactures.Classically,therearetwotypesofcomputersimulationenvironments:execution-drivenandtrace-driven.Execution-drivensimulationsoftenuseaprogram'smachinecodeastheinputtothesimulationmodelsandalsohavenearclock-cycledelity,producingaccurateresultsatthecostofslowsimulationspeeds[ 6 ],[ 7 ].Thoughvery 17
PAGE 18
usefulfordetailedstudiesofsmall-scalesystems,execution-drivensimulationstendtobecomeimpracticalwithregardstotimewhenusedtosimulatelarge,complexsystems.Trace-drivensimulationsemployhigher-levelrepresentationsofanapplication'sexecutiontodrivethesimulations[ 8 ],[ 9 ].Theserepresentationsaregenerallycapturedusinganinstrumentedversionoftheapplicationunderstudyrunningonagivenhardwareconguration.Inessence,thetrace-drivensimulationsrequireextratimeduringthecharacterization"stagetothoroughlyunderstandandcapturerelevantinformationabouttheapplication.Thisadditionaltimespentduringcharacterizationistypicallyamortizedduringthesimulationstagebyhavingtheinputavailablefornumeroussimulatedsystemcongurations.Theaccuracyoftrace-drivensimulationsnotonlydependsonthedelityofthemodels,butalsoonthedetailoftheinformationobtainedcorrespondingtotheapplication.Anon-traditionalsimulationtypeismodel-driven.Model-drivensimulationsuseformalmodelsdevelopedwithinthesimulationenvironmentinordertoemulatethebehaviorofanapplication.Essentially,thesemodelsproduceoutputdatathatstimulatesthecomponentsofthesimulatedsystemasiftherealapplicationwererunning.Thoughthedevelopmentofthemodelscanbeverytimeconsuming,dependingonthecomplexityofthemodeledapplication,oncethemodelisdevelopeditcanbeusedwithinanysystemwithoutanyextrawork.Inordertoperformatrace-drivensimulation,arepresentationoftheapplicationisnecessary.Tracesandprolesaretwotypesofapplicationrepresentationsthatcanbeusedtoportraythebehaviorofaprogram[ 10 ].Tracesgatherachronologyofeventssuchascomputationorcommunicationoccurintime,allowingausertoobservewhataprogramisdoingduringaspecictimeperiod.Becausetracesaredependentontheexecutiontimeofaprogram,long-runningprogramscanproduceextremelylargetracelogs.Theselargelogscanbeimpracticalorevenimpossibletostoredependingonhowmuchdetailisrecordedbythetraceprogram.Proles,bycontrast,donotrecordeventsintime,butrathertallytheeventsinordertoprovideusefulstatisticstotheenduser. 18
PAGE 19
Theoverheadincurredfromtheexecutionofextracodeusedtocollecttheproleortracedatacanbequitesimilar,butisultimatelydependentonthelevelofdetailproledortracedaswellastheapplicationunderstudy.Whiletracegenerationmayimposepenaltiesassociatedwiththecreationoflargetracelesdependingonthefrequencyofdiskaccess,ithastheadvantagethatverylittleadditionalprocessingisneededasintermediateresultsarecalculated.Bycontrast,prolingtypicallyrequiresverylittleleaccess,butmaydemandthefrequentcalculationofintermediateresultsfortheapplicationprole.Inessence,atraceprovidesrawdatadescribingtheexecutionofanapplicationandaproleoutputsaprocessedformofthisdata.Bothprolesandtracescanbeusefultoolsintrace-drivensimulation,dependingonthetypeofsimulationmodelsusedforthestudyandtheamountofinformationthatisdesired.FurtherdiscussiononhowtraceandproletoolsareleveragedwithintheFASEframeworkaredescribedinChapter 3 .Tracesandprolescanbecollectedforbothsequentialandparallelprogramsthoughthisstudyfocusesonthelatter.Variousparallelprogrammingmodelsexistinordertofacilitatecomputationacrossmulti-process,multi-nodesystems.Thesemodelsincludesharedaddress,messagepassing,anddataparallel[ 11 ].Duetoitsproliferationwithinthescienticcommunity,thisstudyfocusesonthemessagepassingparadigm.Morespecically,weconsidertheMessagePassingInterfaceMPIasthedefactostandardforinter-processcommunicationviamessagepassing[ 12 ],[ 13 ].TheMPIstandarddenesanumberoffunctionsthatallowapplicationdeveloperstopassinformationfromasourceprocesstooneormoredestinationprocesses.Thedestinationprocessescanbelocaltothesendingprocessorcanbelocatedonaremotenoderequiringtransactionsacrosssomeinterconnect.MoredetailsontheMPIstandardanditsimplementationcanbefoundin[ 12 ].Predictingtheperformanceofanarbitraryapplicationexecutingonaspecicsystemarchitecturehasbeenalong-soughtgoalforapplicationandsystemdesignersalike.Throughouttheyears,researchershavedesignedanumberofpredictionframeworksthat 19
PAGE 20
employvaryingtechniquesandmethodologiesinordertoproduceaccurateandmeaningfulresults.Theapproachtakenbyeachframeworkallowsittoovercomecertainobstaclesorfocusonspecicapplicationorsystemtypeswhileinevitablysacricingsomekeyfeaturesuchasaccuracy,speed,orscalability.ThefollowingparagraphsbrieydescribeexistingpredictionenvironmentssimilartoFASEandthemethodologiesandgenerallimitationsofeach.TheIntegratedSimulationEnvironmentISE[ 14 ],aprecursortoFASE,employshardware-in-the-loopHWILsimulationinwhichactualMPIprogramsareexecutedusingrealhardwarecoupledwithhigh-delitysimulationmodels.Thecomputationtimeofeachsimulatednodeiscalculatedviaaprocessspawnedonaprocessmachine,"whilenetworkmodelscalculatethedelaysforthecommunicationevents.ThoughtheISEshowsreasonablesimulationtimeandaccuracy,thescalabilityoftheframeworkislimitedduetothelargenumberofprocessesneededtorepresentlarge-scalesystems.ThePOEMSproject[ 15 ]adherestoanapproachsimilartoHWILsimulationinthatsystemscanbeevaluatedbasedonresultsfromactualhardwareandsimulationmodelsaswellasperformancedatacollectedfromactualsoftwareandanalyticalmodels.ThesimulationenvironmentusedbyPOEMSintegratesSimpleScalarwiththeCOMPASSsimulator[ 16 ]enablingdirectexecution-driven,parallelsimulationofMPIprograms.Thesehigh-delitysimulatorscanprovideveryaccurateandinsightfulresultsthoughthesimulationscanrequirelargeamountsoftimeespeciallywhendealingwithverylargesystems.WhileISEandPOEMSuseexecution-drivensimulations,otherprojectsemployamodel-drivenapproach.CHAOS[ 17 ]createsapplicationemulatorsthatreproducethecomputation,communication,andI/Ocharacteristicsoftheprogramunderstudy.Theemulatorsarethenusedtopasstheirinformationtoasuiteofsimulatorsthat,inturn,determinetheperformanceofthemodeledapplicationonthetargetsimulatedsystem.ThePACE[ 18 ]frameworkdenesacustomlanguage,calledCHIP3S,thatisusedtogenerateamodeldescribingaparallelprogram'sowcontrolandworkload. 20
PAGE 21
Similarly,PerformanceProphet[ 19 ]usestheUniedModelingLanguageUMLtomodelparallelanddistributedapplications.BothPACEandPerformanceProphetusethesemodelstodrivesystemscomposedofbothanalyticalandsimulationmodelsinordertobalancespeedandaccuracy.Duetotheinherentdicultyofautomaticallycreatinganaccurateapplicationmodel,allthreeprojectsrequireintimateknowledgeoftheprogramitrepresents.Asaresult,thesourcecodemustbeanalyzedandproledtocollectthenecessaryinformationtoreconstructtheprogram'sexecution.WorkconductedbythePerformanceEvaluationResearchCenterPERCfocusesonperformancepredictionofparallelsystemsundertheassumptionthatanapplication'sperformanceonasingleprocessorismemoryboundandtheinterconnectdictatesthescalabilityoftheprogramunderstudy[ 20 ],[ 21 ].Theframeworkusesathree-stepprocess: 1. Collectmemoryperformanceparametersoftheconsideredmachine 2. Collectmemoryaccesspatternsindependentoftheunderlyingarchitectureonwhichtheapplicationisexecuted 3. Algebraicallyconvolvetheresultsfromstepsoneandtwo,thenfeedtheresultstotheDIMEMAS[ 22 ]networksimulatordevelopedattheEuropeanCenterforParallelismofBarcelonaTheresearchersofPERChavereportedaccurateperformancepredictionsusingawiderangeofapplications.However,collectingmemoryaccesspatternscanbeaverytimeconsumingtaskthatresultsinlargeamountsofdataforeachapplicationconsidered.Also,theDIMEMASnetworksimulatorisrelativelysimplisticandhasthepotentialforlargeinaccuracieswhenanalyzingcommunication-intensiveapplicationswithpotentialcontentionissues.PerformancepredictionisthemainthemeofFASEandtheframeworksmentionedabove.However,performanceanalysisisanintegraltechniqueusedtocharacterizetheapplicationsunderstudypriortosimulation.Theaccuracyanddetailofthischaracterizationdatacangreatlyinuencetheaccuracyandspeedofthesimulation 21
PAGE 22
frameworksthatuseit.Whileagreatnumberofperformanceanalysistoolsexistforvariouspurposes[ 23 ],SvPablo[ 24 ],Paradyn[ 25 ],andTAU[ 26 ]arethetoolsmostapplicabletotheFASEsimulationframework.SvPablomaybeusedtoanalyzetheperformanceofMPIandOpenMPparallelprograms,andallowsforinteractiveinstrumentationandcorrelatesperformancedatawiththeoriginalsourcecode.ParadynalsoiscompatiblewithMPIprogramsandoerstheadvantagethatnomodicationstothesourcecodeareneededduetodynamicinstrumentationatthebinarylevel.Paradyn'sfocusisontheexplicitidenticationofperformancebottlenecks.TAUisanMPIperformanceanalysistoolthatcanprovideperformancedataonathread-levelbasis,andprovidestheuserwithachoiceofthreedierentinstrumentations.ManyperformanceanalysistoolsincludingSvPabloandTAUarebasedonorsupportthepopularPerformanceApplicationProgrammingInterfacePAPI,whichisalsosupportedbytheSequoiatoolusedbyFASE.Eachoftheseanalysistoolscanbeincorporatedintothepre-simulationstageofFASEinconjunctionwithorasareplacementforSequoiatoprovideadditionalinformationonanapplication'sbehavior.FASEculminatesmanyofthesegeneraltechniquesandmethodsinordertoprovidearobust,exibleperformancepredictionframework.Inaddition,FASEfeaturesanenvironmentthatallowsuserstobuildvirtualrepresentationsofcandidatearchitectures.Thesevirtualprototypescapturestructuraldependenciessuchasnetworkcongestionandworkloaddistributionthatcangreatlyimpactanapplication'sperformance.Manyoftheframeworksdescribedinthissectionusesimplisticcommunicationmodelsthathavedicultiescapturingsuchissues.Furthermore,FASEemploysasystematic,iterativemethodologythatproduceshighlymodularapplicationcharacterizationdataandcomponentmodels.Theframeworkillustratestherelationshipsbetweentheapplicationperformanceinformationandthearchitecturalmodelssuchthatfeaturesandmechanismswithinthemodelscanbeidentiedandalteredtoimprovepredictionaccuracyandsimulationspeed.Theotherframeworksproposeusingmodelsofvarious 22
PAGE 23
delitiesinordertospeedupthesimulations;however,theydonotexplicitlydescribethedecision-makingprocessofchoosingordesigningamodel'sdelity.Finally,FASEprovidesafullyextensiblepre-builtmodellibrarywithcomponentsrangingfromtheapplicationlayerdowntothehardwarelayer.Unlikemanyoftheframeworksdescribedabove,FASEincludesanumberofdetailedmiddlewaremodelsthatcanhavesignicantimpactsontheperformanceoftheoverallsystem.Manyofthepre-builtmodelsarehighlytunable,thusallowingasingle,genericmodeltorepresentmanydierentimplementationsbasedontheparametervaluessetbytheuser.ThecombinationofthesefeaturesmakesFASEapowerful,exibleenvironmentforrapidlyevaluatingapplicationsexecutingonavarietyofcandidatesystems. 23
PAGE 24
CHAPTER3FASTANDACCURATESIMULATIONENVIRONMENTPHASE1TheFastandAccurateSimulationEnvironmentFASEisaframeworkdesignedtofacilitatesystemdesignaccordingtotheneedsofoneormorekeyapplications.FASEprovidesamethodologyandcorrespondingtoolkittoevaluatetheperformanceofvirtually-prototypedsystemstodetermine,inatimelyandcost-eectivemanner,theidealsystemcongurationforaspecicsetofapplications.Inordertopromotequickandmodularpredictions,theFASEframeworkisbrokenintotwoprimarydomains-theapplicationandsimulationdomains.Intheapplicationdomain,FASEemploysvarioustoolsandtechniquestocharacterizethebehaviorofanapplicationinordertocreateaccuraterepresentationsofitsexecution.Theinformationgatheredisusedtonotonlyidentifyandunderstandthecharacteristicsofthevariousphasesinherenttotheapplication,butalsotogeneratethestimulusdatatodrivethesimulationmodels.Thecharacterizationdatacanbecollectedusingoneormoretoolsdependingontheapplication,thecapabilitiesoftheemployedtools,andthesimulationmodelsusedduringsimulation.Oncethedataiscollected,itcanbeusedinnumeroussimulationswithoutanymodicationsthusfacilitatingtheexplorationofvarioussystemcongurations.MoredetailsontheapplicationdomainareprovidedinSection 3.1 .Thesimulationdomainincorporatesthedesign,development,andanalysisofthevirtually-prototypedsystemstobestudied.Inthisdomain,componentmodelsaredesignedandvalidatedinordertocreatesystemsthatincorporateneworemergingtechnologies.Toeasesystemdevelopment,FASEprovidesalibraryofpre-constructedmodelstailoredtoaccommodatethedesignofHPCenvironments.Onceasystemhasbeenconstructed,characterizationdatafromanynumberofapplicationscanbeusedasstimulustothesimulationthusallowingrapidanalysesofthesystemundervaryingworkloads.MoreinformationonthevariousaspectsofsimulationdomainaredetailedinSection 3.2 24
PAGE 25
Figure 3-1 illustratesahigh-levelrepresentationoftheprocessassociatedwiththeFASEframework.Thedarkgrayblocksrepresentstepsintheapplicationdomainwhilethesimulationstepsaredenotedbywhiteblocks.Noticethatausercanworkinbothdomainsconcurrently.Alsonotethattheframeworkincorporatesmultiplefeedbackpathsthatallowtheusertofollowaniterativeprocessbywhichinsightisgainedthroughapplicationcharacterizationandsimulation,andusedtorenethemodelsandapplicationanalysisdataemployedforfutureiterations.Section 3.2.3 containsfurtherdetailsonhowaniterativemethodologymaybeemployedinFASE. Figure3-1.High-leveldata-owdiagramofFASEframework 3.1ApplicationDomainTheapplicationdomainisacriticalpartoftheFASEframework.Inthisdomain,importantinformationisgatheredthatprovidesinsightonthebehaviorsofanapplicationduringitsexecution.Themaingoalwithintheapplicationrealmistogatherenoughinformationaboutanapplicationsothatsystemsinthesimulationenvironmentarestimulatedasiftheywerereallyrunningthecode.Assuch,thisdomainisdecomposedintotwomainstages:1applicationcharacterizationand2stimulusdevelopment.The 25
PAGE 26
applicationcharacterizationstageemploysanalysistoolstocollectpertinentperformancedatathatillustrateshowtheapplicationexercisesthecomputationalsubsystems.Thedatathatcanbecollectedincludescommunicationinformation,computationinformation,memoryaccesses,anddiskI/O.Thisdatacanthenbeuseddirectlyorprocessedandanalyzedduringthestimulusdevelopmentstage.Inthestimulusdevelopmentstage,rawdatagatheredduringcharacterizationisusedtoprovidevalidinputtothesimulationmodelssuchthatthesimulatedsystem'scomponentsareexercisedasiftherealprogramwasexecutingonit.Moredetailsonbothstagesaswellasthevariousoptionsavailableineachareprovidedintheproceedingsections.3.1.1ApplicationCharacterizationApplicationcharacterizationisavitalstepinFASEthatenablesaccuratepredictionsontheperformanceofanapplicationexecutingonatargetsystem.Thegoalofcharacterizationistoidentifyandtracktheperformance-criticalattributesthatdictateanapplication'sperformancebasedonbothapplicationandtarget-systemparameters.FASEprovidesaframeworkinwhichuserscananalyzetheirapplicationsusingexistingsystemsalsoknownasinstrumentationplatformsinordertoprepareforsimulation.ThebasicmethodologybywhichtheuseranalyzestheapplicationisshowninFigure 3-2 .Thetoolsemployedineachiterationinitiallydependontheuser'sexperiencewiththeapplicationandcanthenchangebasedontheresultsfromthepreviousiterationofanalysis.Theselectedtoolsshouldbecapableofcapturingtheinherentqualitiesoftheapplicationwhileminimizingthecollectionofinformationresultingfromdependenciesontheunderlyingarchitectureoftheinstrumentationplatform.Perturbationi.e.,theadditionaloverheadimposedonthesystemduetoinstrumentationshouldalsobeconsideredtoensuredataaccuracy.Thoughcharacterizationispartoftheapplicationdomain,thesimulationmodelsshouldalsobeconsideredduringtoolselection.Forexample,iftheprocessormodeltobeusedinsimulationsupportsonlyinstruction-basedinformation,thentheanalysistoolsselectedshouldprovideatleastthatinformationforthatparticularmodel. 26
PAGE 27
Multipletoolscanbeusedbasedonthedetailstheyprovide,butoutputdatamustbeconvertedtoacommonformatinordertodrivethesimulationmodels. Figure3-2.TheFASEapplicationcharacterizationprocess FASEincorporatesafeedbackloopLoop1inFigure 3-1 atthecharacterizationstagesuchthatmultiplecharacterizationsmaybeperformedtorstunderstandthemainbottleneckoftheprograme.g.,processor,network,memory,diskI/O,etc.,andthenfocusonthecollectionofinformationthatcharacterizesthemainbottleneckwhileabstractingthecomponentsoflesserimpact.Thisdatacanthenbefedintothesimulationenvironmentandanalyzeduntilthesystemexposesadierentcomponentasthebottleneck.Ifdesired,thecharacterizationdataandmodelscanthenbeswitchedoradjustedtoincorporatetheappropriateinformationandcomponentstocapturethenatureofthenewbottleneck.Infact,theperformancedatacanprovideallthe 27
PAGE 28
necessaryinformationforapplicationswithanarbitrarybottleneck,whilethesimulationmodelsincorporateabstractionbyonlyusingthedatathatcorrespondstocapabilitiessupportedintheirdesigns.Furtherdetailsonthesimulationphaseandhowapplicationcharacterizationinuencesdesigndecisionsaredescribedinthenextsection.TheinitialdeploymentofFASEemploysasingleanalysistoolcalledtheSequoiaToolkitdevelopedatOakRidgeNationalLaboratory.Sequoiaisatrace-basedtoolthatsupportstheanalysisofCandFORTRANapplicationsusingtheMPIparallelprogrammingmodel,thecurrentdefactostandardforlarge-scalemessage-passingplatforms[ 28 ].Instrumentationisconductedduringlink-timebyusingtheprolingMPIPMPIwrapperfunctions.PMPIisdenedintheMPIstandardandprovidesaneasyinterfaceforprolingtoolstoanalyzeMPIprograms[ 12 ].Therefore,aSequoiausermustonlyrebuildhiscodebylinkingtheSequoialibrary,simplifyingthedatacollectionprocess.Thoughnotrequired,Sequoiaalsosupportsadditionalfunctionsthatcanbemanuallyinsertedintotheapplicationtostartorstopdatacollectionaswellasdenotevariousphaseswithinthecodetofacilitateanalysis.TheSequoiaToolkitexplicitlysupportstheloggingofcommunicationandcomputationevents.AcommunicationeventinSequoiaisdenedasanyMPIfunctionencounteredduringtheexecutionofthecode.Thetoolcollectsrelevantinformationsuchassource,destination,andtransfersizeforallkeyMPIcommunicationfunctionsandalsologsimportantnon-communicationfunctionse.g.,MPI Topology,MPI Comm create.CollectingcommunicationeventsattheMPIlevelinherentlyisolatesthecharacterizationdatafromtheunderlyingnetworkoftheinstrumentationplatformthusallowingthedatatobeusedonavarietyofsimulatedsystemsthatemploydierentinterconnecttechnologies.Networktopologydependenciesarealsoremovedduringcharacterizationsincenetworktransfersbetweenmachinesarecapturedashigh-levelsemanticsrepresentingprocess-to-processcommunicationsratherthanarchitecturally-dependentcharacteristicssuchaslatencyandbandwidth. 28
PAGE 29
Computationeventsoccurbetweencommunicationevents.Sequoiasupportstwomechanismsthatmeasurecomputationstatisticsduringanapplication'sexecution:timingfunctionsandthePerformanceApplicationProgrammingInterfacePAPI.PAPIisanAPIfromtheUniversityofTennesseethatprovidesaccesstohardwarecountersonavarietyofplatformsthatcanbeusedtotrackawiderangeoflow-levelperformancestatisticssuchasnumberofinstructionsissued,L1cachemisses,anddataTLBmisses[ 29 ].Everyloggedevent,bothcomputationaswellascommunication,includebothwall-clockandCPU-timemeasurements.WithPAPIenabled,computationeventsincludeadditionalperformanceinformationonclockcycles,instructioncount,andthenumberofloads,stores,andoatingpointoperationsexecuted.SequoiadoesnotexplicitlysupportthecollectionofI/Odata.However,roughestimatescanbecalculatedbycomparingwall-clockandCPUtimesforeachcomputationevent.Thecharacterizationstagedoessuerfromoneinherentproblemthatmustbeaddressedinordertoprovideanenvironmenttopredicttheperformanceoflarge-scalesystems.Thisissuearisesduetotherestrictionsplacedontheuserbythephysicaltestbedusedtocollectcharacterizationdata.Forexample,tocollectaccurateinformationaboutanapplicationrunningona1024-nodesystem,a1024-nodetestbedmustbeavailabletotheuser.Ofcourse,notallusershaveaccesstosystemswith1000sofnodesforevaluationpurposes,thoughtheymaybeinterestedinobservingtheexecutionoftheirapplicationsontheselargersystems.Inordertoovercomethislimitation,FASEincorporatestwotechniques{theprocessmethodandtheextrapolationmethod.Theprocessmethodallowseachphysicalnodeintheinstrumentationplatformtorunmultipleprocessesandthusgathercharacterizationinformationformultiplesimulatednodes.Thedownsidetothisapproachisresourcecontentionatsharedcomponentsthatcanleadtoinaccuraterepresentationsoftheapplication'sexecution.Also,asinglenodecanonlysupportalimitednumberofprocesses.ThislimitationisencounteredwhenOSormiddlewarerestrictionsaremetorwhenthenodebecomessoboggeddownthatthe 29
PAGE 30
applicationcannotnishwithinareasonableamountoftime.Initialtestsshowthattheprocessmethodproducescommunicationeventsidenticaltothosefoundusingthetraditionalapproach.However,computationeventscansuerfromlargeinaccuraciesduetomemoryandcachecontentionissuesespeciallyintestsusingmanyprocessespernodewithlargerdatasets.TheexperimentsconductedinSection 3.3 employtracescollectedusingthetraditionalapproach;however,researchiscurrentlyunderwaytoremedytheinaccuraciesoftheprocessmethodinordertofacilitatelarge-scalesystemevaluation.Theextrapolationmethodobservestrendsintheapplication'sbehaviorwhilechangingsystemsizeanddatasetsize,andthenformulatesaroughmodeloftheapplicationbasedonthendings.Themodeldescribesthecommunication,computation,andotherbehaviorsoftheapplicationusingahigh-levellanguage.Thelanguagecanthenbereadbyanextrapolationprogramtoproducetracesforanarbitrarysystemsizeandapplicationdatasetsize.Detailsonextrapolatingcommunicationpatternsoflarge-scalescienticapplicationscanbefoundin[ 30 ].Thisapproachsupportsaccurategenerationoftracesanddoesnotsuerfromthelimitationsoftheprocessmethod,thoughitcanbequitediculttodeterminethetrendsofanapplication,especiallywhendealingwithapplicationsthatbehavedynamicallybasedonmeasuredvalues.Althoughmanymoreissuescanarisewhenusingtheextrapolation-basedapproach,thistopicisoutofthescopeofthisresearch.3.1.2StimulusDevelopmentAfteranapplicationhasbeencharacterized,theinformationcollectedisusedtodevelopstimulusdatausedasinputtothesimulationstage.FASEsupportsthreemethodsofprovidinginputtosimulationmodels.Thesemethodsincludeatrace-basedapproach,amodel-drivenapproach,andahybridapproach.Theexactmethodemployedislefttotheuserthoughtheselectionshoulddependonthetypeofapplicationunderstudy,theamountofeortthatcanbeaorded,andtheamountofknowledgegatheredontheinternalsoftheapplication.Detailsoneachmethodareprovidedbelow. 30
PAGE 31
Thetrace-basedapproachisthequickestandmostautomatedmethodavailabletotheFASEuser.Themethodcanuseeitherraworprocessedperformancedatacollectedduringthecharacterizationstageaccordingtothetypeofinformationrequiredbythesimulationmodels.However,thetrace-basedapproachdoesplacesomerestrictionsontheuser.First,thesimulationenvironmentmusthaveatracereaderthatiscapableoftranslatingtheperformanceinformationintodatastructuresnativetothesimulationenvironment.Thisrestrictionrequiresacommonformattowhichallperformancedatamustconform.Therefore,ifmultipletoolsareemployedtogathercharacterizationdata,theiroutputsmustbemergedandmodiedtosomecommonformattypethatissupportedbythetracereader.Thesecondissueisthattracedatamustbecollectedforeachsystemanddatasetsizeunderconsideration.Assystemanddatasetsizesincrease,thetracedatafromcomplexapplicationscouldpotentiallyrequireextremelylargeamountsofstoragespaceandthuscaremustbetakentokeeptracelesmanageable.InthecurrentversionofFASE,thislimitationcanbealleviatedbycollectingdataforonlycertainregionsofcodeoralimitednumberofiterationsthroughtheuseofspecicinstrumentationconstructssupportedbySequoia.Themodel-drivenapproachrequiresmuchmoremanualeortbytheuserthanthetrace-basedapproach.Thismethodusesaformalmodeloftheapplication'sbehaviorbasedoneitherathoroughanalysisofcharacterizationdatacollectedwhilevaryingsystemanddatasetsizeorthroughsourcecodeanalysis.Thedevelopedmodelshavethecapabilityofreproducingthebehaviorsofcomplex,adaptiveapplicationsthatcannotbecapturedusingthetrace-basedapproach.Ingeneral,thisapproachbeginsbyidentifyingkeyapplicationparametersthataectitsperformance.Thenextstepistoascertaintheparametershavingthegreatestimpactonperformanceandthendeterminingthevariouscomponentmodelstheapplicationwillexerciseduringexecution.Oncethesestepsarecomplete,theactualmodelisdevelopedsuchthatitexecutesthecorrectcomputation,communication,andothereventsbasedonthebehaviorsdiscoveredduring 31
PAGE 32
characterization.TheactualtypeofmodelemployedinthisapproachisnotlimitedbytheFASEframework.Markovchains,stochasticandanalyticalmodelsandexplicitsimulativemodelsareafewmodeltypesthatcanbeusedwithinFASEaslongastheycaninterfacewiththesimulationenvironment.ThelastapproachsupportedbyFASEisthehybridapproach.Inthisapproach,amixoftraceandmodel-drivenstimulusisusedinordertocombinetheaccuracyandease-of-useoftrace-basedsimulationswiththeexibilityanddynamismofthemodel-drivenapproach.Inthismethod,theapplicationischaracterizedataveryhighleveltoidentifystructuredanddynamicareasofcode.Thestructuredareasareusedtogeneratetracedatawhilesmall-scaleformalmodelsareemployedtorepresentthedynamicareas.Thismixtureoftechniquesdecreasestheamountoftracedataneeded,reducestheamountofeortrequiredtoformulateformalmodelsandmaintainsrelativelyaccuraterepresentationsoftheapplication'sbehavior.TheinitialdeploymentofFASEusesthetrace-basedapproachasitsprimarystimulus.Thepre-builtFASEmodellibraryconsistsofaSequoiatracereaderthattranslatesSequoiadataintothenecessarydatastructureinthesimulationenvironment.Thoughbothmodel-drivenandhybridapproachesaredenedwithintheFASEframework,thisphaseofresearchfocusesonsimulationsconductedusingonlythetrace-basedapproach.3.2SimulationDomainThesimulationdomainconsistsofthreestages:1componentdesign,2systemdevelopment,and3systemanalysis.Therstofthethreestagesinvolvesthecreationofthenecessarycomponentsusedtobuildthesystemsunderstudy.Thisstagecanbeparticularlytime-consumingdependingonthecomplexityofthecomponentaswellaslevelofdelity;however,itisaone-timepenaltythatmustbepaidtogainthebenetsofsimulation.TheinitialreleaseofFASEincludesseveralpre-builtmodelsofcommoncomponentsdetailedinSection 3.2.1 toaidusersinthisprocessandmorewillbeadded 32
PAGE 33
inthefuture.Thenextstepinthesimulationdomainisthedevelopmentofthecandidatesystems.Theprocessofconstructingvirtualsystemstypicallyrequireslesstimethancomponentdesignalthoughconstructiontimenormallyincreaseswithsystemsizeandcomplexity.Similartostageone,theoverheadofbuildingasystemmustbepaidoncethoughnumerousapplicationsandcongurationscanbeanalyzedusingthesystem.Finally,thethirdstageallowsustoreapthebenetsoftheFASEframeworkandprocess.Systemanalysisusesthecomponentsandsystemsconstructedinstagesoneandtwo,andtheapplicationstimulusdatafromtheapplicationdomain,inordertopredicttheperformanceofanapplicationonaconguredsystem.Sincemanyvariationsofsystemsarelikelytobeanalyzed,thisstageisassumedtobethemosttime-sensitive.Inthefollowingsubsections,wediscusseachofthethreesimulationstagesinmoredetail.3.2.1ComponentDesignEachcomponentthatisofinteresttotheusermustrstbedesignedanddevelopedinthesimulationenvironmentofchoice.Thecomponentsmustnotonlyrepresentthebehaviorofthecomponent,butalsocorrespondtothelevelofdetailprovidedbythecharacterizationtoolsandtheabstractionleveltowhichtheapplicationlendsitself.Inmostcases,certainpartsofacomponentwillbeabstracted,whileotherpartsthatareknowntoaectperformancewillbemodeledinmoredetail.Thedecisionofwheretoadddelitytocomponentsandwheretoabstracttosavedevelopmentandsimulationtimeshouldbebasedontrendsandapplicationattributesdiscoveredduringcharacterization.Thecomponentsdesignedshouldincorporateavarietyofkeyparametersthatdictatetheirbehaviorandperformance.Animportantstepinthecomponentdesignistweakingtheseparameterstoaccuratelyportraythetargethardware.Theactualvaluessuppliedtothemodelsshouldbebasedonempiricaldatacollectedusingsimilarhardwareoronthepredictedperformanceifthecomponentsarefuturetechnologies.Forexample,thenetworkandmiddlewaremodelsshowninTable 3-1 werevalidatedaccordingtorealhardwareintheHigh-performanceComputingandSimulationHCSResearchLabatthe 33
PAGE 34
UniversityofFlorida.TheexperimentalsetupandresultsforthesevalidationtestsarepresentedinSection 3.3 .TheFASEdevelopmentenvironmentusesagraphical,discrete-eventsimulationtoolcalledMissionLevelDesignerMLDfromMLDesignTechnologies[ 27 ].Itscorecomponents,calledprimitives,performbasicfunctionssuchasarithmeticanddataowcontrol.Thebehaviorsexempliedbyeachprimitivearedescribedusingtheobject-orientedC++languagetopromotemodularity.Morecomplexfunctionsarecommonlyrealizedbyconnectingmultipleprimitivessuchthatdataismanipulatedasitowsfromoneprimitivetoanother.Alternately,usersmaywritecustomizedprimitivestoprovidetheequivalentfunctionality.MLDwasselectedasthesimulationtoolforFASEforthreemainreasons.First,itisafull-featuredtoolthatsupportscomponentandsystemdesignaswellasthecapabilitiestosimulatethedevelopedsystemsallthroughaGUI.Second,MLDsupportsvariousdesignfeaturesthatfacilitatequickdesigntimesevenforverycomplexsystems.Finally,theauthorshavemuchexperienceusingthetoolandmanymodelshavebeencreatedoutsideofFASEthatcanbeimportedwithlittleornomodications.AlthoughFASEcurrentlyusesMLDasitssimulationenvironmentofchoice,itmaybeadaptedtosupportadditionalsimulationenvironmentsinthefuture.Awiderangeofpre-constructedmodelspopulatetheinitialFASElibraryinordertoprovideastartingpointforusers.Eachmodelwasdesignedanddevelopedbasedonthehardwareorsoftwaretheyrepresentthroughtheuseoftechnicaldetailsprovidedbycorrespondingstandardsandotherliterature.Thedelityofeachmodelinthepre-builtlibrarycorrespondstothecurrentHPCfocusoftheinitialdeploymentofFASEaswellasthecapabilitiesofSequoia.Asaresult,itincorporateshigh-delitynetworkandcommunicationmiddlewaremodelstocapturescalabilitycharacteristicswhileprovidinglowerdelitymodelsforcomponentssuchasaCPU,memory,anddisk.Table 3-1 highlightsthemoreimportantcomponent-levelmodelscurrentlypopulatingthepre-builtFASElibrary.ItisnoteworthytomentionthatavarietyofcomponentsnotlistedinTable 34
PAGE 35
3-1 canbedevelopedusingthosethatarelisted.Forexample,anexplicitmulticoreCPUmodeldoesnotcurrentlyexistintheFASElibrary.However,bycombiningtwoCPUmodels,asharedmemorymodel,andtwotraceles,onecananalyzetheperformanceofanapplicationrunningonamulticoremachinewithlittleeort.ThenetworkmodelslistedinTable 3-1 sharesimilarcharacteristics.Eachmodelreceiveshigh-leveldatastructuresthatdenethevariousparametersrequiredtocreateandoutputoneormorenetworktransactionsbetweenmultiplenodes.Eachnetworkmodelalsohasnumeroususer-denableparameterssuchaslinkrate,maximumdatasize,andbuersizethatdictatetheperformanceofcommunicationevents.Furthermore,themodelsincludeanumberofparametersthatdenethecapabilitiesofthesubsystemsthatsupplythenetworkinterfaceswiththenecessarydatatobetransferred.Forexample,theInniBandmodelincorporatestheparametersLocalInterconnectLatencyandLocal-InterconnectBWtodenethelatencyandbandwidthoftheinterconnectbetweenhostmemoryandtheIniniBandhostchanneladapterHCA.TheseparametersareusedtocalculatetheperformancepenaltiesincurredfromtransferringdatafrommemorytotheHCA.Thesecalculationseectivelyabstractawaythecomplexbehaviorsoftheunderlyingtransfermechanismswhilestillaccountingfortheirperformanceimpacts.ThemiddlewaremodelsinTable 3-1 providetheperformance-criticalcapabilitiesoftheprotocoleachrepresents.TheTCPmodelisasingle,genericmodelwithavarietyofparametersthatenableausertocongureitasaspecicimplementation.Similarly,theMPImodelalsoincorporatesmanyparameterssothatparticularimplementationscanberepresentedusingasinglemodel.TheMPIlayerismodeledusingtwolayerssuchthatthegeneral,high-levelfunctionalityoftheMPIprotocolformsthenetwork-independentlayerwhilethesecondlayeremploysinterfacemodelsthattranslateMPIdataintonetwork-specicdatastructures.ThislayeredapproachallowsacommoninterfacetobeusedinallsystemsfeaturingMPIwhileprovidingplug-and-play"capabilitiestosupportMPItransfersovervariousinterconnects. 35
PAGE 36
Table3-1.TheFASEcomponentlibrary Class Type Modelname Fidelity Description Networks InniBand HostChannelAdapterHCA High ConductsIBprotocolprocessingonincomingandoutgoingpacketsforIBcomputenodes Switch High Devicesupportingcut-throughandstore-and-forwardroutingusingcrossbarbackplane ChannelInterface Medium Dynamicbueringmechanism Ethernet NetworkInter-faceCardNIC Medium ConductsEthernetprotocolpro-cessingonincomingandoutgoingframesforEthernetcomputenodes Switch High Devicesupportingcut-throughandstore-and-forwardroutingusingcrossbarorbusbackplane SCI LinkController High ConductsSCIprotocolprocessingonincomingandoutgoingpacketsforSCIcomputenodes Middleware IP IPInterface Low HandlesIPaddressresolution TCP TCPConnection High Providesreliablesocketsbetweentwodevices TCPManager High ManagesTCPconnectionstoensurethatthecorrectsocketreceivesitscorrespondingsegments MPI MPICH2 High ProvidesMPIinterfaceusingTCPasthetransportlayer MVAPICH High ProvidesMPIinterfaceforInni-Band MP-MPICH High ProvidesMPIinterfaceforSCI Processors GenericProcessor GenericProces-sor Low Supportstiminginformationtomodelcomputation OperatingSystems GenericOS GenericOS Low Supportssomememorymanage-mentcapabilitiesoftheOS Memories GenericMemory GenericMemory Low Modelsreadandwriteaccessesbasedondatasize Disks GenericDisk GenericDisk Low Modelsreadandwriteaccessesbasedondatasize Exotics RecongurableDevice RecongurableDevice Medium Modelsaspecializedcoproces-sore.g.,FPGAthatcomputesapplicationkernels 36
PAGE 37
3.2.2SystemDevelopmentAftercomponentdevelopment,systemsmustbecreatedtoanalyzepotentialcongurations.Thesystemsdevelopedshouldcorrespondtothedemandsoftheapplicationsasdiscoveredviacharacterization.FASEprovidesthecapabilitytonotonlychangethesystemsizeandthecomponentsinthesystem,butalsotweakcomponentparameterssuchasthenetwork'slatencyandbandwidth,middlewaredelays,andprocessingcapabilities.Thisfeatureallowstheusertoscrutinizetheeectsofcongurationchangesrangingfromminorsystemupgradestocompletesystemredesignusingexotichardware.Scalabilityissuesinthisstagearedependentonthesimulationenvironmentratherthantheapplication.Timelyandecientdevelopmentofamassivelyparallelsysteminagivensimulationenvironmentcanquicklybecomeanissueassystemsizesscaletoveryhighlevels,andthecreationofsystemswiththousandsofnodescanbecomeanalmostunwieldytask.SinceFASEisfocusedonrapidanalysisofarbitrarysystems,itmustaddressthisissue.AmongthewaysFASEsupportsthecreationoflargesystems,theMLDsimulationtoolsupportshierarchical,XML-baseddesignssuchthatasinglemodulecanencapsulatemultiplecomputenodesyetthesimulatorstillmaintainsthedelityoftheunderlyingcomponentsandincludestheireectintheanalysis.SystemsarecreatedusingagraphicalinterfacewhichisautomaticallytranslatedintoXMLcodedescribingsystem-leveldetailssuchasthenumberandtypeofeachcomponentandhowtheyareinterconnected.Inaddition,MLDsupportsdynamicinstanceswhereifamodeliscreatedaccordingtocertainguidelines,asingleblockcanrepresentauser-speciednumberofidenticalcomponents.Finally,amoreadvancedmethodoflarge-scalesystemcreationcanusehierarchicalsimulationsratherthanhierarchicalmodelswithinasinglesimulation,wheresmall-scalesystemsarebuiltandanalyzedwiththeintentofproducingintermediatetracestostimulateahigher-levelsimulationinwhichmultiplenodesfromeachsmall-scalesystemthenactasasinglenodeinthelarge-scalesystem.Thismethodhasthepotentialtonotonlyspeedupdevelopmenttimesforthesesystems,butalsoto 37
PAGE 38
reducesimulationruntimes.Whilethistechniquehasnotyetbeenemployed,weplantoexamineitspotentialbenetsinfutureresearchtoimprovethescalabilityoftheFASEsimulationenvironment.3.2.3SystemAnalysisAftertheapplicationhasbeenthoroughlyanalyzedandthecomponentsandinitialsystemshavebeendeveloped,theusercanbeginanalyzingtheperformanceofvariousapplicationexecutingonthetargetsystems.Thestimulusdatafromtheapplicationdomainisusedasinputtothesimulationmodelsinordertoinducenetworktracandprocessing.OnepowerfulfeatureofFASEisitsabilitytocarryoutmultiplesimulationrunsusingdierentsystemcongurations,butallbasedonthesamesetofstimulusdata.Therefore,theadditionaltimespentinapplicationcharacterizationallowsthesystemanalysistoproceedmuchquicker.Duringsimulation,statisticscanbegatheredonnumerousaspectsofthesystemsuchasapplicationruntime,networkbandwidth,averagenetworklatency,andaverageprocessingtime.Inadditiontotheproling"-typedatathatiscollected,itmayalsobedesiredtocollecttraces"fromthesimulation.Thesetracesdierfromthestimulustracesinthattheyaremorearchitecture-dependentandlessgeneric,buttheyprovideatleastonecommonfunction-givingtheuserfurtherinsightintotheperformanceoftheapplicationunderstudy.Thebreadthanddepthofperformancestatisticsandresultscollectedduringsimulationwilldeterminethelevelofinsightavailablepost-simulation,andtheresultscollectedshouldbetailoredtowardstheneedsoftheuser.However,thetypeanddelityofresultscollectedmayalsonegativelyimpactsimulationruntime,sotheusershouldbecarefulnottocollectanexcessiveamountofunnecessaryresultdata.Thesimulationenvironmentmayalsobetailoredtooutputresultsinacommonformat,suchthattheymaybeviewedinaperformancevisualizationtoolsuchasJumpshot[ 31 ].Insomecases,resultsobtainedduringsystemanalysiscanleadtoadditionalinsightintermsoftheapplicationbottleneck,requiringre-characterizationoftheapplicationas 38
PAGE 39
showninfeedbackloop4inFigure 3-1 .Inthissituation,thestepsfromtheapplicationdomainarerepeated,followedbypossibleadditionalcomponentandsystemdesigns,andrepeatedsystemanalysis.Tracescollectedduringsimulationmayalsopotentiallybeusedtodrivefuturesimulationsinthisiterativeprocesstosolveanoptimizationproblemanddeterminetheidealsystemconguration.3.3ResultsandAnalysisThissectionpresentsresultsandanalysisforexperimentsconductedusingFASE.Ineachofthefollowingsubsections,weintroducetheexperimentalsetupusedtocollectstimulusdatai.e.,tracesandexperimentalnumbersagainstwhichthesimulationresultsarecompared.Therstsubsectionpresentsthecalibrationproceduresfollowedtovalidatethethreemainnetworkmodelsinthecurrentlibrary{InniBand,TCPoverIPoverEthernet,andthedirect-connectnetworkbasedontheScalableCoherentInterfaceSCIprotocol[ 32 ].Ineachcase,theappropriateMPImiddlewarelayerisalsoincorporatedoneachinterconnectinthemodelingenvironment.Section 3.3.2 presentsasimplescalabilitystudyforamatrixmultiplybenchmarkusingtheaforementionedinterconnects.Finally,Section 3.3.3 showcasesthefeaturesandcapabilitiesofFASEthroughacomprehensivescalabilityanalysisoftheSweep3DapplicationfromtheASCIBlueBenchmarksuite.3.3.1ModelValidationInordertotestandvalidatetheFASEconstructionset,theMLDmodelswererstcalibratedtoaccuratelyrepresentsomeprevalentsystemsavailableinourlab.Validationofthenetworkandmiddlewaremodelswasconductedusingatestbedcomprisedof16dual-processor1.4GHzOpteronnodeseachhaving1GBofmainmemoryandrunningthe64-bitCentOSLinuxvariantwithkernelversion2.6.9-22.ThenodesemployedtheVoltaireHCA400attachedtotheVoltaireISR9024switchfor10GbpsInniBandconnectivitywhileGigabitEthernetwasprovidedusingintegratedBroadcomBCM5404CLANcontrollersconnectedviaaForce10S50switch.Thedirect-connectnetworkmodelwascalibratedtorepresentSCIhardwaresuppliedbyDolphinInc.AsimplePingPong 39
PAGE 40
MPIprogramthatmeasureslow-levelnetworkperformancewasusedtocalibratethemodelstobestrepresentthe16-nodecluster'sperformanceoveraspecicinterconnect.ThreeMPImiddlewarelayersweremodeledincludingMVAPICH-0.9.5,MPICH2-1.0.5,andMP-MPICH-1.3.0forInniBand,TCP,andSCI,respectively.Figures 3-3 through 3-8 showtheexperimentallygatherednetworkperformancevaluesoftheInniBand,TCP/IP/EthernetandSCItestbedscomparedtothoseproducedbythesimulationmodel.Theperformanceofeachcongurationcloselymatchesthatofthetestbedandtheaverageerrorbetweentheexperimentaltestsandsimulativemodelswas5%forInniBand,3.6%forTCP/IP/Ethernet,and2.7%forSCI.ThroughputcalculatedfromthePingPongbenchmarklatenciesformessagesizesupto32MBshowthesimulativebandwidthscloselyfollowthemeasuredbandwidthsandhaveanaverageerrorroughlyequaltothatfoundinthelatencyexperiments.Theseresultsshowthecomponentmodelsarehighlyaccuratewhencomparedtotherealsystemsbutitisnoteworthytomentionthatdipsinthemeasuredthroughputarereadilyapparentat4MBand256KBfortheInniBandandTCP/IP/Ethernetnetworks,respectively.Thedecreasesinbandwidthareduetooverheadsincurredinthesoftwarelayersasthesoftwareemploysdierentmechanismstoaccommodatethelarger-sizedtransfersandthecorrespondingmodelsabstractthesethrottlingpointswiththegoalofbestmatchingtheoveralltrend. aSmallmessagesizes bLargemessagesizesFigure3-3.InniBandmodellatencyvalidation 40
PAGE 41
Figure3-4.InniBandmodelthroughputvalidation aSmallmessagesizes bLargemessagesizesFigure3-5.TheTCP/IP/Ethernetmodellatencyvalidation Figure3-6.TheTCP/IP/Ethernetmodelthroughputvalidation 3.3.2SystemValidationAftervalidatingthenetworkandmiddlewaremodels,weproceededtoexaminetheaccuracyandspeedofFASEusingasimplebenchmark-matrixmultiply.Theselected 41
PAGE 42
aSmallmessagesizes bLargemessagesizesFigure3-7.TheSCImodellatencyvalidation Figure3-8.TheSCImodelthroughputvalidation implementationtookamaster/workerapproachwiththemasternodetransmittingasegmentofmatrixAandtheentirematrixBtothecorrespondingworkersinsequentialorderandthenreceivingtheresultsfromeachnodeinthesameorder.MeasurementswerecollectedusingtheInniBandandGigabitEthernetmodelsforsystemsizesof2,4,and8nodesanddatasetsizesof500500,10001000,15001500,and20002000.Eachdataelementisa64-bitdouble-precision,oating-pointvalue.ThefollowingparagraphoutlinestheproceduretakentoconducttheanalysisofthematrixmultiplyusingtheFASEmethodologyandcorrespondingtoolkit.First,thematrixmultiplycodewasinstrumentedbylinkingtheSequoialibraryandtheresultingbinarywasexecutedforeachcombinationofsystemanddatasetsize.TracelesforeachcombinationwereautomaticallygeneratedbytheSequoiainstrumentationcodeduringexecutionandservedasthestimulustothe 42
PAGE 43
simulationmodels.ThecomponentmodelsfromtheFASElibrarywereusedtocreatesixsystemsforeachsystemsizeandnetwork2analyzed.Thisstepwasconductedwhilecollectingcharacterizationdatafromthematrixmultiply.Afterthetraceleswerecollectedandthesystemsbuilt,eachsystemwassimulatedusingthecorrespondingtracelesforeachsystemsize.Itshouldbenotedthatforthisparticularexperiment,theSequoiatraceswerecollectedusingthetestbed'sInniBandnetwork,thoughsimulationswererunusingbothIniniBandandGigabitEthernetnetworksinordertoshowtheportabilityofthetracesandhighlighttheexibilityofFASE.Finally,experimentalruntimesweremeasuredforeachsystemanddatasetsizerunningonbothnetworksinordertodeterminetheerrorsassociatedwiththesimulatedsystems.Table 3-2 presentstheexperimentalandsimulativeresultsforthevariousnetworks. Table3-2.Experimentalversussimulationexecutiontimesformatrixmultiply InniBand GigabitEthernet System Data Exp. Sim. Error Exp. Sim. Error size size sec sec sec sec 2 500 3.32 3.33 0.21% 3.42 3.38 1.24% 1000 49.45 49.40 0.09% 48.81 49.58 1.58% 1500 187.40 187.03 0.20% 187.50 187.43 0.04% 2000 459.36 458.80 0.12% 460.48 459.50 0.21% 4 500 1.07 1.06 1.48% 1.16 1.12 3.12% 1000 16.69 16.54 0.93% 16.63 16.76 0.79% 1500 62.87 62.49 0.61% 63.29 63.06 0.37% 2000 153.82 153.19 0.41% 154.43 154.02 0.26% 8 500 0.52 0.46 12.70% 0.62 0.58 7.60% 1000 7.23 7.09 2.66% 7.71 7.50 2.69% 1500 27.51 26.93 2.12% 28.24 27.94 1.06% 2000 66.91 65.90 1.52% 68.18 67.60 0.86% FromTable 3-2 ,onecanseethatthesimulationscloselymatchedtheexperimentalexecutiontimesofthematrixmultiply.Themaximumerror,12.7%,occurredatsmallerdatasetsizeswithlargersystemsduetoshorterruntimesthataremoregreatlyaected 43
PAGE 44
byanytimingdeviationsfromvariousanomaliessuchasOStaskmanagement,dynamicmiddlewaretechniques,etc.Theaectsofsuchanomaliesarenormallyamortizedwhenanalyzingthetypical,long-runningHPCapplication.Thesimulationtimesforeachrunofthematrixmultiplywerealsocollectedinordertoquantifytheslowdownofusingsimulationversusrealhardwarethushighlightingthefast"portionofFASE.Table 3-3 showsthattheratiosofsimulationtoexperimentalwall-clocktimesareverylowandinsomecasese.g.,smallsystemswithlargedatasizes,thesimulationactuallycompletesfasterthanthehardwarerepresentedbyaratiolessthanone.Theratioslessthanonearedirectlyrelatedtotheamountofcharacterizationinformationcollectedaswellasthehighlevelofabstractionofcomputationeventsinthesimulationmodels.Inthecaseofthematrixmultiply,computationwasabstractedthroughtheuseoftimingandthisinformationwasfedintoalow-delityprocessormodel,thusaccommodatingshortsimulationtimes.Assystemsizeandproblemsizescales,moretimeisspenttriggeringhigh-delitynetworkmodelsthusslowingthesimulations.However,thewall-clocksimulationtimesobservedwithinFASEareordersofmagnitudefasterthancycle-accuratesimulatorswherea1000orgreaterslowdowninwall-clockexecutiontimeiscommon.3.3.3CaseStudy:Sweep3DNowthatthesystemmodelshavebeenvalidatedusingthematrixmultiplybenchmark,theycanbeusedtopredicttheperformanceofanyapplicationexecutingonthemgiventhepropercharacterizationandstimulusdevelopmentstepshavebeenconducted.InordertodisplaythefullcapabilitiesandfeaturesofFASE,amorecomplexapplicationwasselected.TheSweep3DalgorithmformsthefoundationofarealAcceleratedStrategicComputingInitiativeASCIapplicationandsolvesa1-group,time-independent,discreteordinate3DCartesiangeometryneutrontransportproblem[ 33 ],[ 34 ].AsshowninFigure 3-9 ,eachiterationofthealgorithminvolvestwomainsteps.Therststepsolvesthestreamingoperatorbysweeping"eachangleoftheCartesian 44
PAGE 45
Table3-3.Ratioofsimulationtoexperimentalwall-clockexecutiontime InniBand GigabitEthernet System Data Exp. Sim. Ratio Exp. Sim. Ratio size size sec sec sec sec 2 500 3.32 5.58 1.68 3.42 6.21 1.82 1000 49.45 19.42 0.39 48.81 24.40 0.50 1500 187.40 42.71 0.23 187.50 54.60 0.29 2000 459.36 75.10 0.16 460.48 96.40 0.21 4 500 1.07 9.62 8.96 1.16 11.20 9.64 1000 16.69 33.44 2.00 16.63 44.10 2.65 1500 62.87 73.28 1.17 63.29 99.10 1.57 2000 153.82 131.05 0.85 154.43 174.00 1.13 8 500 0.52 17.92 34.22 0.62 22.60 36.17 1000 7.29 60.25 8.27 7.71 88.70 11.52 1500 27.51 129.06 4.69 28.24 195.00 6.92 2000 66.91 230.99 3.45 68.18 357.00 5.24 geometryusingblocking,point-to-pointcommunicationfunctionswhilethesecondstepusesaniterativeprocesstosolvethescatteringoperatoremployingcollectivefunctions.TherearenumerousinputparametersthatmaybesetincludingthenumberofprocessingelementsintheXandYdimensionsofthelogicalsystemaswellasthenumberofgridpointsi.e.,double-precision,oating-pointvaluesassignedtotheXYZdimensionsofthedatacube.ThedefaultdatasetsizessuppliedwithSweep3Dare505050and150150150,thoughtheexperimentsinthissectionwillalsoexploreanintermediatedataset-100100100.Inthissection,wepresentasetofexperimentsdesignedtoquantifytheaccuracyandspeedoftheFASEsimulationenvironmentwithrespecttotheSweep3Dapplication.TherstexperimentillustratestheaccuracyofFASEbycomparingexperimentallymeasuredexecutiontimesofSweep3Dversusthetimesproducedbycorrespondingsimulatedsystems.Experiment2analyzesthespeedofsimulationsconductedusingFASEanddemonstrateshowthespeedscaleswithsystemsize.ThesectionconcludeswithanalexperimentthatshowcasesthefullpotentialoftheFASE 45
PAGE 46
frameworkbyprovidingadetailedsimulativeanalysisoftheSweep3Dapplicationrunningonsystemswithvarioussizes,interconnecttechnologies,topologies,andmiddlewareattributes. Figure3-9.Sweep3Dalgorithm 3.3.3.1Experiment1:AccuracyTherstsetofexperimentsperformedusingSweep3Dwasverysimilartothosefromthelastsection.Themaindierence,however,isthatthesystemunderstudyfortheseexperimentsleveragestheabilitiesofFASEtosimulateheterogeneouscomponents.Thissystemiscomposedofaheterogeneous,64-nodeLinuxclusterthatfeaturesfourtypesofcomputationalnodesaslistedinTable 3-4 Table3-4.Computenodespecicationsforeachclusterinheterogeneoussystem Nodecount Processor Memory OS Kernel Cluster1 10 1.4GHzOpteron 1GBDDR333 CentOS 2.6.9-22 Cluster2 14 3.2GHzXeonwithEM64T 2GBDDR333 CentOS 2.6.9-22 Cluster3 30 2.0GHzOpteron 1GBDDR400 CentOS 2.6.9-22 Cluster4 10 2.4GHzXeon 1GBDDR266 Redhat9 2.4.20-8 AllmeasurementsandSequoiatracesweregatheredovertheGigabitEthernetinterconnect.Figure 3-10 showsacomparisonoftheexecutiontimesfromthephysicalandsimulatedsystemswhileincreasingthesystemanddatasetsizes.Table 3-5 displays 46
PAGE 47
Figure3-10.ExperimentalversussimulativeexecutiontimesforSweep3D theerrorsbetweenexperimentalandsimulativeexecutiontimes.Inthisexperiment,weobservedslightlyhighererrorratesthanthematrixmultiplybenchmarkandnetworkvalidationtests,butthistrendistobeexpectedconsideringtheincreasedcomplexityoftheSweep3Dapplicationandheterogeneoussystemusedforthestudy.Inallbutvecases,errorrateswerebelow10%,withmanycasesshowingerrorsaround1%.Inthecaseswith10%errororgreater,wecanlargelyattributethehighervaluestoextraneousdatatracandspuriousOSactivityamongothereectsduetonon-dedicatedresources.Onceagain,theincreasederrorratesoccurredincaseswhereeitherdatasetsizesweresmallorsystemsizeswererelativelylargeorboth.Themaximumerrorobservedis23.28%,whichisundertheacceptablethresholdforpredictingperformanceofsimulatedsystemssincereal-worldimplementationsofthehardwareandsoftwaredeviceswillhaveagreateectontheactualperformanceofthenalsystem.3.3.3.2Experiment2:SpeedForeachexperiment,thetestbedexecutiontimeoftheapplicationwascomparedtothesimulationtimeofthevirtually-prototypedsystem.AsseeninFigure 3-11 ,thesimulationtimeincreaseswithbothdatasetsizeandsystemsize.Increasesineithercharacteristicraisetheamountofnetworktrac,thuscausingmorecomputationtimeto 47
PAGE 48
Table3-5.ExperimentalversussimulationerrorsforSweep3D Datasetsize Systemsize 505050 100100100 150150150 2 0.69% 0.13% 0.21% 4 0.90% 5.38% 9.76% 8 0.60% 0.25% 2.84% 16 8.36% 10.11% 15.31% 32 5.95% 23.28% 1.03% 64 14.66% 17.36% 1.02% bespentprocessinginteractionsbetweenthehigher-delitymodels.Infact,thelongestsimulationtime,onehour,occurredforthe64-nodesystemusingthe150150150dataset. Figure3-11.Ratiosofsimulationtoexperimentalwall-clockcompletiontimeforvaryingsystemanddatasetsizes Whileaone-hoursimulationtimeiswithinanacceptabletolerance,ifweextrapolatethetimingresultstoevenlargersystemanddatasetsizes,wendthatasystemof1024nodescomputinga250250250grid-pointdatasetwilltakeapproximately70hours.Inordertocutthetimewhensimulatingthecomputationofverylargedatasetsonlarge-scalesystems,thestimulusdevelopmentandsimulationtechniquespreviouslydescribedinthischaptercanbeemployed.Inthiscase,knowledgeoftheapplication'sexecutioncharacteristicscanhelptospeedupsimulations.TheSweep3Dapplication 48
PAGE 49
performstwelveiterationsofitscorecode,witheachcomputenodehavingidenticalcommunicationblocksandverysimilarcomputationblocksperiteration.Theanityacrossiterationsallowsustousetheperformancedatacollectedduringasingleiterationtoextrapolatethetotaltimetakentoexecutealltwelveiterations.Thisprocesscanleadtodecreasedsimulationtimeswithlittleeectsonsimulationaccuracy.Infact,removingallbutasingleiterationfromtheSweep3Dtracesresultedinasimulationspeedupof9.6totalof7minutesratherthan65minuteswhileimpactingtheaccuracyofthemodelbylessthan1%.3.3.3.3Experiment3:VirtualsystemprototypingNowthatwehaveafastandaccuratebaselinesystemfortheSweep3Dapplication,wecanexploretheeectsofchangingthesystemconguration.Thisexperimentexplorestheperformanceimpactofincreasingthenumberofnodesinthesystembyscalingtheprocessingpowerofeachnode.Thisscalingeectivelyextrapolatestheperformanceofthebaseline64-nodesystemtorepresenttheperformanceofSweep3Dexecutingonsystemsupto8192nodes.Theresultsprovidedinthisexperimentpresentthebest-casevaluesofeachsystemsizesincenetworkissuessuchasswitchcongestionandmulti-hoptransfersthatarisefromaddingadditionalnodestoasystemarenotconsidered.Thoughtheactualnetworkpressuresofthesesystemsarenotfullyrepresented,theresultsdoprovideanupperboundperformanceoftheSweep3Dapplicationrunningonthecorrespondingsystem.Thisupperboundcanbeusedtoquicklyidentifywhetherornotaparticularsystemissuitabletorunaparticularapplicationthusfacilitatingtheevaluationofnumerouscongurationswiththeintentofpinpointingasmallsubsettheoriginalcandidatesystemstosimulateinmoredetail.FutureworkwillincorporatethestimulusdevelopmenttechniquesdiscussedinSection 3.1.2 sothatnetworkcontentionisconsideredtoimprovetheaccuracyoftheperformancepredictions.ThevariousnetworksthatweexamineincludestandardGigabitEthernetGigE,anenhancedversionofGigE,10GigE,InniBand,a2D88direct-connectSCInetwork, 49
PAGE 50
Figure3-12.ExecutiontimesforSweep3Drunningonvarioussystemcongurations anda3D444SCInetwork.Eachcongurationincludedinthisstudyattemptstoshedlightonhowchangesinsystemsize,networkbandwidth,networklatency,middlewarecharacteristics,andtopologyaecttheoverallperformanceofSweep3DeachofwhichiseasilycongurableviatheFASEframework.TheresultsfromthisexperimentaredisplayedinFigure 3-12 ,withthe64-nodeGigabitEthernetsystemusedasthebaseline.Figure 3-12 showstheGigabitEthernetsystemexperiencednearlylinearspeedupsuntilthesystemsizereached1024nodes.Atthispoint,thecommunicationofSweep3Dbeginstodominateexecutiontime.Ingeneral,thetrendsdisplayedinFigure 3-12 weresurprising,givenaninitialtiminganalysisofthealgorithmshowedthathalftheexecutiontimewasspentincommunicationblocks.UponfurtheranalysiswedeterminedthatthereasonforthesetrendsisduetotheLateSender"problemwhereprocessesposttheirMPI RecvsbeforethematchingMPI Sendisexecutedbythecorrespondingprocess,causingthereceivingprocesstobecomeidle.Inthecasewith8192nodes,theapplicationbecomesnetwork-bound,evenwiththeLateSenderproblem.Therefore,wemustchangethefocustoothernetworktechnologiestoalleviatethecommunicationbottleneck. 50
PAGE 51
Figure3-13.MaximumspeedupsforSweep3Drunningonvariousnetworkcongurations TherstattempttoremedythecommunicationbottleneckforlargersystemsizesemployedoptimizedversionsofTCPandMPI.Specically,weincreasedkeyparameterssuchasTCPwindowsizeandTCPmaximumtransferuniteectivelyenablingtheuseofjumboframesaswellasreducedMPIoverhead.Thiscase,labeledEnhancedGigE,showedlittleimprovementsintotalexecutiontimeleadingustoconcludethatthebandwidthandlatencyofthenetworkdictatetheperformanceofcommunicationeventsratherthanthemiddleware.ThenexttestsconductedreplacedtheGigabitEthernetinterconnectwithhigh-performancenetworkssuchas10GigE,InniBandandSCI.Figure 3-12 showsthatinsmallersystems,theinterconnecthaslittleeectontheperformanceofSweep3D.However,itbecomesapparentthatbeyond1024nodesafasterinterconnectprovidesbetterspeedups.Forexample,the10GigEandInniBandinterconnectsoerspeedupsof4.1and5.31,respectively,overthebaselineGigEsystem.Figure 3-13 illustratesthemaximumspeedupsforthevarioushigh-performancenetworksthatweretested.Notonlydothesetestsanalyzetheaectsofaddingbandwidthandloweringlatency,buttheSCIcasesdemonstratethepowerofvirtualsystemprototypingbyexploringtheimpactofmappingtheSweep3Dalgorithmtodrasticallydierenttopologies, 51
PAGE 52
specicallya2Dand3Dtorus.Atrstglance,theSweep3Dalgorithmseemstomapwelltodirect-connectnetworktopologiesbasedonitsnearestneighborcommunicationpattern;however,fromFigure 3-13 itappearsthatthisisnotexactlythecase.FurtheranalysisoftheapplicationandunderlyingarchitectureshowsthatthethreefundamentalcharacteristicsoftheSCInetworkhinderfurtherperformanceimprovements.First,theSCIprotocolusessmallpayloadsizesof128-byteswhichcannoteectivelyamortizecommunicationoverhead.Thesecondandthirdcharacteristicsaretheone-waypacketowofadimensionalringandthepacketforwardingordimensionswitchingdelaysthatoccurateachintermediatenodewhileroutingpacketstothedestination.Undercertaincircumstances,asdictatedbytheSweep3Dalgorithm,fewpacketsexperiencemorethanoneforwardingdelayduetothetargetnodebeingthenextneighboringnodeinthedimensionalring'spacketow.Thiscaseprovidesoptimalmappingbetweenthealgorithmandthenetworkarchitectureresultinginexcellentperformance.However,thisscenarioisonlyoneoffourcommunicationowsusedbySweep3D.Theremainingcommunicationsresultinnumeroustransactionsthatmusttravelalmosttheentirelengthofadimensionalringinordertoreachitsdestinationresultinginmanymulti-hoptransfers,higherlatencies,andloweroverallspeedups.Thesenegativeeectscouldeasilybelostwhenusinglow-delity,analyticalnetworkmodelsduetotheirinabilitytocapturestructuralcharacteristicsthatcangreatlyimpactoverallsystemperformance.NotonlycanFASEbeusedtoanalyzetheeectsofinterchangingnetworktechnologies,butitcanalsoprovideahighlydetailedanalysisofaspecictechnology.WechosetheInniBandnetworkforthein-depthstudysinceitshowedthebestperformanceofthesixcongurations.TheInniBandmodel,aswellastheMPImiddlewaremodel,hasnumeroususer-denableparametersthatcanbechangedtocorrespondtothecurrentandfutureversionsofthetechnologies.Theseparametersprovideamechanismtoperformne-grainedanalysestosqueezeasmuchperformanceaspossiblefromaspecictechnologywhilealsoprovidingvaluableinsightonanynewbottlenecksthatmayarise 52
PAGE 53
Figure3-14.SpeedupsforSweep3Drunningon8192-nodeInniBandsystem duringfutureupgrades.FromtheresultsinFigure 3-14 ,onecanseethatevenat8192nodes,networkenhancementshavelittleeectontheInniBandsystem'sperformance.Thegreatestspeedupfromchangesinthecommunicationlayercomesfrommiddlewareenhancementsthatachievea1.21performanceboostoverthebaseline.TheseresultsindicatethatfurtherimprovementsoftheInniBandsystem'scomputationcapabilitiesareneededinordertoshowanysignicantspeedupswhentweakingthemiddlewareandnetworklayers.TheenhancementthatincreasesnetworkandprocessingperformanceaswellasdecreasestheoverheadassociatedwiththemiddlewareseeFigure 3-14 ,Mid-dleware+Network+Processorenhancementreinforcesthisclaimbyprovidinga2.44speedup.ThisstudyprovidesbutabriefsummaryofthemodicationsthatcanbemadetotesttheeectivenessofkeytechnologieswhenanalyzinganapplicationusingFASE.3.4ConclusionsThetaskofdesigningpowerfulyetcost-ecientHPCsystemsisbecomingextremelydauntingduetonotonlytheincreasingcomplexityofindividualcomputationandI/Ocomponentsbutalsotheeectivemappingofgrand-challengeapplicationstotheunderlyingarchitecture.Inthisrstphaseofresearch,wepresentedaframeworkcalledtheFastandAccurateSimulationEnvironmentFASEthataidssystemengineers 53
PAGE 54
overcomethechallengesofdesigningsystemsthattargetspecicapplicationsthroughtheuseofanalysistoolsinconjunctionwithdiscrete-eventsimulation.Withintheapplicationdomain,FASEprovidesamethodologytoanalyzeandextractanapplication'sperformance-criticaldatawhichisthenusedtodiscovertrendsandlimitationsaswellasprovidestimuliforsimulationmodelsforvirtualprototyping.WealsoprovidedbackgroundonvariousoptionsforperformancepredictionofHPCsystemsthroughmodelingandsimulation,andoutlinedtheneedforasolutionthatcanprovidefastsimulationtimeswithaccurateresults.TheFASEframeworkwasthenoutlinedanditsdierentcomponentsandfeaturesweredescribed.ToshowcasethecapabilitiesofFASE,wegatheredavarietyofresultsshowingtheperformanceofvarioussystemcongurations.WerstprovidedvalidationresultsforourInniBand,TCP/IPoverEthernet,andSCImodels,showingthatthenetworkmodelsthatserveasthebackboneofthecasestudiesinthispaperhavebeencarefullytunedtoaccuratelymatchrealhardware.WethenshowedtheresultsofamatrixmultiplycasestudywherewecomparedexperimentaltosimulatedexecutiontimesforaparallelMPI-basedmatrixmultiplybenchmark.Inmostcases,errorsinthemodelswerelessthan1%,withthemaximumerrorof12.7%occurringinacasewithasmalldatasetsizeandalargesystemsize.TheseconditionsresultinshortexperimentalruntimesoflessthanonesecondwheretransienteectssuchasOStaskmanagementandpagefaultscancauseunpredictabledeviationsinapplicationexecutiontime.Intermsofsimulationspeed,theslowdownsobservedbysimulatingtheparallelmatrixmultiplywereverylow,andinsomecases,thesimulationactuallycompletedbeforetheactualsystemnishedexecutingthecode.ThenalcasestudypresentedaseriesofexperimentsusingtheSweep3Dbenchmark,whichisthemainkernelofarealAcceleratedStrategicComputingInitiativeapplication.WeperformedsimulationandhardwareexperimentsoverarangeofdatasetsizesusingaGigabitEthernet-basedsystem,andagainfounderrorstobeverylowinmostcases. 54
PAGE 55
Amaximumerrorof23.28%wasobserved,whichisconsideredacceptablewhendealingwithpredictingtheperformanceofacomplexapplicationrunningonanHPCsystem.Again,thecaseswithhigherrorscorrespondtonon-typicalHPCscenarioswithlargersystemsworkingonsmalldatasetsresultinginconditionsthatamplifydeviationsinexperimentallymeasuredruntimes.Wealsoproposedandemployedtacticstospeedupsimulationby10whilesacricinglessthan1%accuracy.WithafastandaccuratebaselinesystemestablishedforSweep3D,weproceededtousetheFASEmethodologytopredicttheperformanceoftheapplicationforsystemswithvarioussizesandnetworktechnologies.Wefoundthattheapplicationwasactuallymoreprocessor-boundthaninitiallyanticipatedduetotheMPILateSender"problemwhereaprocesspostsanMPI RecvbeforethecorrespondingMPI Sendisexecuted,causingthereceivingprocesstobecomeidle.However,asthesystemsincreasedinsize,Sweep3Ddidbecomenetworkbound,with10GigabitEthernetandInniBandbothprovidingsignicantperformanceimprovementsovertheGigabitEthernetbaselinemainlyduetotheirincreasedbandwidth.TheanalysisofSweep3Dconcludedwithanin-depthlookatitsexecutiononanInniBandsystemwhilevaryingne-grainedparameterssuchasnetworkbandwidth,packetsize,andmiddlewareoverhead.Thesemodicationsprovidedminimalperformanceimprovementsbecausethealgorithm'sbottleneckchangedfromcommunicationtocomputationwhennetworkbackpressurewasreducedbychoosinganimprovednetworktechnologyi.e.,InniBand.Theworkconductedduringthisphaseofresearchproducedaexibleandcomprehensiveframeworkforperformancemodelingandprediction.Thisframeworkprovidesageneralizedmethodologyforapplicationcharacterization,designanddevelopmentofcomponentandsystemmodels,andanalysisofapplicationsrunningonthevirtualsystemsunderconsideration.Theworkalsoproducedasetoftoolsandamodellibrarytofacilitateperformanceprediction.ThecasestudiesvalidatedtheusefulnessofFASEbydisplayingbothfastandaccurateresultswhencomparingtheobserved 55
PAGE 56
experimentalandsimulativevalues.ThestudiesalsoillustratedthecapabilitiesofFASEforanalyzingtheeectsofarchitecturalvariationsinordertoimprovethescalabilityofapplications.ThecontributionsandaccomplishmentsofthisworkhavebeencompiledintoamanuscriptandpublishedinSimulation:TransactionsofTheSocietyforModelingandSimulationInternational[ 35 ]. 56
PAGE 57
CHAPTER4PERFORMANCEANDAVAILABILITYPREDICTIONSOFVIRTUALLYPROTOTYPEDSYSTEMSFORSPACE-BASEDAPPLICATIONSPHASE2ThischapterpresentstwodetailedcasestudiesoftheFASEframeworkappliedtoanalyzetheeectsofvariouscongurationandalgorithmicchangestoaspace-basedsystem.TherstcasestudylooksatperformanceandscalabilityissuesoftheNASADependableMultiprocessorDMexecutingakeyapplicationkernel,theFastFourierTransform.ThesecondevaluatestheperformanceandavailabilityoftheSyntheticApertureRadarSARapplicationrunningontheDMsysteminafaultyenvironmentsuchasspace.Theproceedingsectionsprovidethedetailsandresultsofthesecasestudiesaswellasanovelanalysisapproachtoaccuratelypredictsystemperformability.Therstsectionpresentsbackgroundinformationonthemotivationsandinitialdesignofthespacesystem.Thenextsectionsuppliesdetailsontheapproachtakentodesignanddevelopthenecessarymodelstovirtuallyexploretheperformance,scalability,andavailabilitytrade-osofthecandidatesystem.ItalsodescribestheuniqueanalysisapproachusedtostudySARexecutingontheDMsystem.Finally,theexperimentsandresultsfrombothcasestudiesarepresentedfollowedbytheconclusionsdrawnfromtheanalyses.4.1Background4.1.1ProjectOverviewNASAandotherspaceagencieshavehadalongandrelativelyproductivehistoryofspaceexplorationasexempliedbyrecentrovermissionstoMars.Traditionally,spaceexplorationmissionshaveessentiallybeenremote-controlplatformswithallmajordecisionsmadebyoperatorslocatedincontrolcentersonEarth.Theonboardcomputersintheseremotesystemshavecontainedminimalfunctionality,partiallyinordertosatisfydesignsizeandpowerconstraints,butalsotoreducecomplexityandthereforeminimizethecostofdevelopingcomponentsthatcanenduretheharshenvironmentofspace.Hence,thesetraditionalspacecomputershavebeencapableofdoinglittlemorethanexecutingsmallsetsofreal-timespacecraftcontrolprocedures,withlittleornoprocessingfeatures 57
PAGE 58
remainingforinstrumentdataprocessing.Thisapproachhasproventobeaneectivemeansofmeetingtightbudgetconstraintsbecausemostmissionstodatehavegeneratedamanageablevolumeofdatathatcanbecompressedandpost-processedbygroundstations.However,asoutlinedinNASA'slateststrategicplanandothersources,thedemandforonboardprocessingispredictedtoincreasesubstantiallyduetoseveralfactors[ 36 ].Asthecapabilitiesofinstrumentsonexplorationplatformsincreaseintermsofthenumber,typeandqualityofimagesproducedinagiventimeperiod,additionalprocessingcapabilitywillberequiredtocopewithlimiteddownlinkbandwidthandline-of-sightchallenges.Substantialbandwidthsavingscanbeachievedbyperformingpreprocessingand,ifpossible,knowledgeextractiononrawdatain-situ.Beyondsimpledatacollection,theabilityforspaceprobestoautonomouslyself-managewillbeacriticalfeaturetosuccessfullyexecuteplannedspace-explorationmissions.AutonomousspacecrafthavethepotentialtosubstantiallyincreasetheirreturnoninvestmentthroughopportunisticexplorationsconductedoutsidetheEarth-boundoperatorcontrolloop.Toachievethisgoal,therequiredprocessingcapabilitybecomesevenmoredemandingwhendecisionsmustbemadequicklyforapplicationswithreal-timedeadlines.However,providingtherequiredlevelofonboardprocessingcapabilityforsuchadvancedfeaturesandsimultaneouslymeetingtightbudgetrequirementsisachallengingproblemthatmustbeaddressed.Inresponse,NASAhasinitiatedseveralprojectstodeveloptechnologiesthataddresstheonboardprocessinggap.Onesuchprogram,NASA'sNewMillenniumProgramNMP,providesavenuetotestemergenttechnologyforspace.TheDependableMultiprocessorDMisoneofthefourexperimentsontheupcomingNMPSpaceTechnology8ST8mission,tobelaunchedin2009,andtheexperimentseekstodeployCommercial-O-The-ShelfCOTStechnologytoboostonboardprocessingperformanceperwatt[ 37 ].TheDMsystemcombinesCOTSprocessorsandnetworking 58
PAGE 59
componentse.g.,Ethernetwithanovelandrobustmiddlewaresystemthatprovidesameanstocustomizeapplicationdeploymentandrecoveryfeatures,andtherebymaximizesystemeciencywhilemaintainingtherequiredlevelofreliabilitybyadaptingtotheharshenvironmentofspace.Inaddition,theDMsystemmiddlewareprovidesaparallelprocessingenvironmentcomparabletothatfoundinhigh-performanceCOTSclustersofwhichapplicationscientistsarefamiliar.Byadoptingastandarddevelopmentstrategyandruntimeenvironment,theadditionalexpenseandtimelossassociatedwithportingapplicationsfromthelaboratorytothespacecraftpayloadcanbesignicantlyreduced.4.1.2DMSystemArchitectureBuildinguponthestrengthsofpastresearcheorts[ 38 ],[ 39 ],[ 40 ],theDMsystemprovidesacost-eective,standardprocessingplatformwithaseamlesstransitionfromground-basedcomputationalclusterstospacesystems.Byprovidingdevelopmentandruntimeenvironmentsfamiliartoearthandspacescienceapplicationdevelopers,projectdevelopmenttime,riskandcostcanbesubstantiallyreduced.TheDMhardwarearchitectureseeFigure 4-1 followsanintegrated-payloadconceptwherebycomponentscanbeincrementallyaddedtoastandardsysteminfrastructureinexpensively[ 41 ].TheDMplatformiscomposedofacollectionofCOTSdataprocessorsaugmentedwithruntime-recongurableCOTSFPGAsinterconnectedbyredundantCOTSpacket-switchednetworkssuchasEthernetorRapidIO[ 42 ].Toguardagainstunrecoverablecomponentfailures,COTScomponentscanbedeployedwithredundancy,andthechoiceofwhetherredundantcomponentsareusedascoldorhotsparesismission-specic.Thescalablenatureofnon-blockingswitchesprovidesdistinctperformanceadvantagesovertraditionalbus-basedarchitecturesandalsoallowsnetwork-levelredundancytobeaddedonaper-componentbasis.Additionalperipheralsorcustommodulesmaybeaddedtothenetworktoextendthesystem'scapability;however,theseperipheralsareoutsideofthescopeofthebasearchitecture. 59
PAGE 60
Figure4-1.Systemhardwarearchitectureofthedependablemultiprocessor FutureversionsoftheDMsystemmaybedeployedwithafullcomplementofCOTScomponentsbut,inordertoreduceprojectriskfortheDMexperiment,componentsthatprovidecriticalcontrolfunctionalityareradiation-hardenedinthebaselinesystemconguration.TheDMiscontrolledbyoneormoreSystemControllers,eacharadiation-hardenedsingle-boardcomputer,whichmonitorandmaintainthehealthofthesystem.Also,thesystemcontrollerisresponsibleforinteractingwiththemaincontrollerfortheentirespacecraft.Althoughsystemcontrollersarehighlyreliablecomponents,theycanbedeployedinaredundantfashionforhighlycriticalorlong-termmissionswithcoldorhotsparing.Aradiation-hardenedMassDataStoreMDSwithonboarddatahandlingandprocessingcapabilitiesprovidesacommoninterfaceforsensors,downlinksystemsandotherperipheralstoattachtotheDMsystem.Furthermore,theMDSprovidesagloballyaccessibleandsecurelocationforstoringcheckpoints,I/Oandothersystemdata.TheprimarydataowinthesystemisfrominstrumenttoMassDataStore,throughthecluster,backtotheMassDataStore,andnallytothegroundviathespacecraft'sCommunicationSubsystem.BecausetheMDSisahighlyreliablecomponent,itwilllikelyhaveanadequatelevelofreliabilityformostmissionsandthereforeneednotbereplicated.However,redundantsparesorafullydistributedmemoryapproachmayberequiredforsomemissions.Infact,resultsfromaninvestigationofthesystemperformancesuggestthatamonolithicandcentralizedMDSmaylimitthescalabilityofcertainapplicationsandtheseresultsarepresentedinSection 4.3 60
PAGE 61
4.1.3DMMiddlewareArchitectureTheDMmiddlewarehasbeendesignedwiththeresource-limitedenvironmenttypicalofembeddedspacesystemsinmindandyetismeanttoscaleuptohundredsofdataprocessorsperthegoalsforfuturegenerationsofthetechnology.Atop-leveloverviewoftheDMsoftwarearchitectureisillustratedinFigure 4-2 .Akeyfeatureofthisarchitectureistheintegrationofgenericjobmanagementandsoftwarefault-toleranttechniquesimplementedinthemiddlewareframework.TheDMmiddlewareisindependentofandtransparenttoboththespecicmissionapplicationandtheunderlyingplatform.Thistransparencyisachievedformissionapplicationsthroughwell-dened,high-level,ApplicationProgrammingInterfacesAPIsandpolicydenitions,andattheplatformlayerthroughabstractinterfacesandlibrarycallsthatisolatethemiddlewarefromtheunderlyingplatform.Thismethodofisolationandencapsulationmakesthemiddlewareservicesportabletonewplatforms. Figure4-2.Systemsoftwarearchitectureofthedependablemultiprocessor Toachieveastandardruntimeenvironmentwithwhichscienceapplicationdesignersareaccustomed,acommodityoperatingsystemsuchasaLinuxvariantformsthebasisforthesoftwareplatformoneachsystemnodeincludingthecontrolprocessorandmassdatastorei.e.,theHardenedProcessorseeninFigure 4-2 .ProvidingaCOTSruntimesystemallowsspacescientiststodeveloptheirapplicationsoninexpensiveground-basedclustersandtransfertheirapplicationstotheightsystemwithminimaleort.Suchan 61
PAGE 62
easypathtoightdeploymentwillreduceprojectcostsanddevelopmenttime,ultimatelyleadingtomoresciencemissionsdeployedoveragivenperiodoftime.Table 4-1 providesdescriptionsoftheotherDMmiddlewarecomponents. Table4-1.TheDMmiddlewarecomponents Component Description High-AvailabilityMiddle-wareHAM Providesstandardcommunicationinterfacebetweenallsoft-warecomponentsincludinguserapplications.Guaranteesinorderdeliveryofallmessagesandsupportsseamlessswitchingbetweenredundantnetworks. Fault-ToleranceManagerFTM CentralfaultrecoveryagentforDMsystem.Monitorsstatusofsoftwareagentsandreliablemessagingmiddleware.Up-datesJMtablesuponresourcechangesaectingapplicationscheduling. JobManagerJM Centralizedcomponentthatschedulesjobs,allocatesresources,dispatchesprocesses,anddirectsapplicationrecovery. JobManagerAgentsJMA Distributedsoftwareagentsthatforktheexecutionofjobsandmanagerequiredruntimejobinformationonthelocalhost. Fault-tolerantEmbeddedMessagePassingInterfaceFEMPI Application-independent,fault-tolerant,messagepassinginter-faceadheringtotheMPIstandards.ProvidesasubsetoftheMPIAPIandsupportsvariousfaultrecoverymodes. MDSServer Servicesalldataoperationsbetweenapplicationsandmassmemory. 4.2ApproachTheFASEframeworkpresentedinChapter 3 providesanidealenvironmentforexploringthedesignoptionsinvolvedwithsystemcongurationoftheDMsystem.Themodelsinthepre-builtlibraryweredesignedsothatcomponentscouldbeconguredforembeddedortraditionalHPCsystemsthroughsimpleparametertweaksrepresentingthecapacityorcapabilityofaspecicresource.Therefore,variousdesigntrade-oscanbeexploredusingavarietyofhardwareandsoftwaremodelsinordertoanalyzetheireectsonsystemperformanceandscalability.InordertoapplytheFASEframeworktostudyCOTScomponentsinspace,theoriginaldesignwasextendedtosupportfaultinjectioncapabilities.Theseadditionalfeaturesallowuserstoexplorenotonlyperformance-orientedissues,butalsothosedealing 62
PAGE 63
withfaulttoleranceandavailability.Inordertofacilitatetheuseofthesenewcapabilities,theresearcherintroducesanovelapproachtopredicttheperformability,ametricthatcombinesperformanceandavailabilitytodescribedegradablesystems[ 43 ],ofCOTS-basedpayloadprocessingsystems.Thisapproachanalyzessystemsinthreecomplementarydomains:1physicalprototype,2Markov-rewardmodel,and3discrete-eventsimulationmodel.Techniquesfromeachdomainrepresentcornerstonesintheanalysisprocessthougheachhasitsstrengthsandweaknesses.Physicalprototypesoervalidityinmeasuredvaluesbutprovidelimitedscalabilityandadaptability.Markov-rewardmodelsallowforquickperformabilitymeasurementsforspecicfailureandrecoveryrates,butarenotsuitablewhenmodelingcomplexsystemsduetohighdimensionalitywhichisrequiredforhigh-delitymodels.Finally,simulationprovidesafree-formenvironmenttoevaluatesystemswitharbitrarycongurationsandworkloads,butoftensuersfromincreaseddevelopmenttimeandlengthyanalyses.Byintelligentlyleveragingthestrengthsineachdomain,aquickandpreciseanalysisofvarioussystemcongurationsandapplicationscanbeachievedthatincludeavarietyofarbitraryworkloadsandfault-injectioncampaigns.Theprocessbeginswiththeevaluationoftheprototypesystem,wherereal-worldperformancevaluessuchasnetworklatencyandcomponentrecoverytimesaremeasuredandusedtocalibratetheMarkov-rewardanddiscrete-eventsimulationmodelsthatwouldotherwiselackvalidity.Next,quickperformabilityevaluationsofthesystem'sfault-tolerantsoftwarearchitectureareconductedusingMarkovmodelingtechniquestoidentifyecientdesignsandworkloads.Thus,theMarkovmodelstrimanotherwiselargedesignspacetoeliminatethetimespentanalyzingpoordesigns.Thenalstepusespre-builtorcustomizedsimulationmodelstoanalyzearchitecturalenhancementsanddependencieswithintheselectedsystemsandapplicationsatalevelofdetailthatcannotbeachievedinthepreviousdomains.Theresultingmethodologyallowscandidatesystemstobethoroughlyandaccuratelyanalyzedforbothperformanceandavailability 63
PAGE 64
thusallowingdesignerstocomparealternatefault-tolerantarchitecturesforaerospaceapplications.Weapplythethree-stagemethodologytoanalyzeandquantifytheperformanceandfault-tolerantcharacteristicsoftheDMmanagementsoftwareandproposedightsystem.Thefollowingsubsectionsprovidemoredetailsonthemodelingeortsinvolvedwiththiswork.4.2.1PhysicalPrototypeTherststageoftheanalysisapproachinvolvesthedevelopmentandtestingofaprototypesystemthatrepresentsascaled-downversionoftheproposedsystem.TheprototypeDMsystemwasdesignedanddevelopedtomirrorwhenpossibleandemulatewhennecessarythefeaturesofatypicalsatellitesystem.AsshowninFigure 4-3 ,theprototypehardwareconsistsofacollectionofCOTSSingle-BoardComputersSBCsrunningaLinux-basedoperatingsysteminterconnectedwithredundantGigabitEthernetnetworks.OneSBCisaugmentedwithanFPGAcoprocessorandaresetcontrollerandpowersupplyisincorporatedforpower-oresets.SixSBCsareusedtomirrorfourdataprocessorboardsandemulatethefunctionalityofthetworadiation-hardenedcontrolandMDSnodes.EachSBCiscomprisedofa1GHzPowerPCprocessor,1GBofmainmemoryanddualGigabitEthernetNICs.ALinuxworkstationemulatestheroleoftheSpacecraftCommandandControlProcessor,whichisresponsibleforcommunicationwithandexternalcontroloftheDMsystembutisoutsidethescopeofthispaper.TheMPImiddlewarelayerusedonthetestbedisFEMPI1.0,acustom,fault-tolerantimplementationofaselectedsubsetoftheMPIstandard[ 44 ].GoAhead'sSelfReliant4.1isusedasthehigh-availabilitymiddlewarewhichprovidesnetworkcommunication,livelinessinformation,andnetworkfailover.Finally,theMDSstoragedevicewasemulatedviaa5400RPMharddrive.TheprototypeisusedtomeasuretheachievableperformanceofthesystemexecutingmicrobenchmarksthatexerciseitsnetworkandMDSsubsystems.Itisalsoemployed 64
PAGE 65
aLogicaldiagram bPhotographFigure4-3.LogicaldiagramandphotographofDMtestbed togathertheresponsetimesnecessarytodetectfailedcomponentswithinthesystem.FurtherdetailsonhowtheprototypeisusedtovalidatetheDMmodelsarepresentedinSection 4.3 .4.2.2Markov-RewardModelingAftertheprototypesystemhasbeendevelopedandperformanceandotherkeymetricshavebeenmeasured,theanalysistransitionstotheMarkov-rewardmodelingdomain.Withinthisdomain,quickevaluationsareconductedtoexplorevariousfaultandrecoveryratesinordertoidentifytheworkloadsandsystemcongurationsthataremostinterestingforfurtherstudy.Inthesestudies,steady-stateperformabilitySSPisthecommontnessmetricusedtodescribetheperformanceofdegradablemultiprocessorcomputersystems[ 45 ].TheSSPallowsuserstopredictameancomputationalperformanceofthesystemwhichtakesintoaccountbothshort-andlong-termeectswhichcouldotherwisecauseskewinexperimentalmeasurementsofsystemperformance.AtypicalmethodusedtoestimatetheSSPinvolvesusingMarkov-rewardmodelsMRMs,constructsbasedoncontinuous-timeMarkovchainsCTMC.MRMscombinestateprobabilities,obtainedfromsteady-stateanalysesofCTMCs,andrewardratesbasedoncomputationalperformancetocalculatetheSSPofasystem.Formally,theSSPisdenedastheexpectedasymptoticrewardor 65
PAGE 66
SSP=XiSiri{1whereSrepresentsasetofallpossiblestatesthatthegivensystemcanoccupy,idenotesasteady-stateprobabilityofthesystemoccupyingstatei,andristandsforarewardratefortheithstate[ 46 ].4.2.2.1DatanodemodelThedatanodemodelfocusesoncalculatingtheperformabilityofthenodeunderfaultyconditions.TosimplifythenodemodelweassumedthatjobsarecontinuouslyscheduledonthenodebytheJM.Also,thedelaybetweenthecompletionofonejobandstartofanotherisconsideredtobenegligiblesincetheruntimeofeachjobissignicantlylargerthantheschedulingtime.Theseassumptionsallowthemodeltoberealizedasasix-stateCTMCasshowninFigure 4-4 .Eachstatecorrespondstoaparticularconditionofthedataprocessingnodewherethethreeprimarycomponents,theapplication,JMA,andsystemi.e.,HAMandoperatingsystem,areeitheroperationalornot. Figure4-4.Markov-rewarddatanodemodel TheSAPPstateoccurswhentheapplicationisexecutingonthenodeandallotherservicesarerunningcorrectly.Thisstateistheonlynodecongurationthathasan 66
PAGE 67
associatednon-zerorewardratevalueequalto1,whichmakestheSSPequivalenttotheavailabilityforthismodel.InorderforthemodeltotransitionoutoftheSAPPstate,anSEE-relatederrormustoccurcausingahangorcrashoftheapplication,JMA,orsystemgovernedbythefaultratesAF,JF,andSF,respectively.Sincetherecoverypolicynoderesetfornode-wideerrorsandHAMerrorsisidentical,thefailureratesarecombinedtosimplifythemodel.Eachfailurerateisproportionallyrelatedtotheindependentvariable,MTBFNODEmeantimebetweenfaultsforthenode,whichisequivalenttotheSEErateexperiencedbythenode.ThemajorityofSEEsareexpectedtoimpacttheCPUthereforeeachoftheaforementionedfaultratesisobtainedbyscalingtheMTBFNODEbytheCPUutilizationofthegivensoftwarecomponentCPU%APP,CPU%JMA,CPU%SYS.TheCPUutilizationisdenedbyfollowingequation.CPU%APP+CPU%JMA+CPU%SYS=100%{2TheSDETstatedenotesadetectiondelaywhentheerrorhasoccurredintheapplicationcausingittoabortorcrash.ThisdelayisassociatedwiththeheartbeatintervaloftherunningapplicationtotheJMA.TheSJMAstatedenotesacongurationinwhichthereisnoapplicationrunningbuttherestofthesystemisfunctioningproperly.TotransitiontotheSRECstatetheJMAmuststartanapplicationwithrateFRwhichisinverselyproportionaltothetimerequiredbythesystemtostarttheprocess.TheSRECstatesymbolizestheapplicationrecoveringfromacrash.TherateRC,atwhichthesystemcantransitionbacktoSAPP,isdependentonthecheckpointingintervaloftheapplicationaswellasthesizeofthecheckpointandthetransfertimefromtheMDS.WhentheJMAfailsallrunningapplicationsareterminatedandthemodelenterstheSSYSstate.UponenteringtheSSYSstate,theHAMwillimmediatelyattempttostartaJMAwithastartrateofFR.IftheoperatingsystemorHAMfailsbeforetheJMAstartsup,thenodemodelwillswitchtotheSDOWNstateandthenodewillberebooted.TherebootrateRBdictatesthetimerequiredforthesystemtocyclepowertothenode, 67
PAGE 68
starttheoperatingsystemandHAM,andreconnecttothesystem.Tables 4-2 and 4-3 summarizethestatesandparametersincorporatedintothenodemodel,respectively. Table4-2.Datanodemodelstates Symbol RunningComponents Description SAPP SYS,JMA,Application Systemisfunctioningcorrectly. SDET SYS,JMA Applicationhascrashedorhanged. SJMA SYS,JMA TheJMAisreadytostartorrestarttheapplica-tion. SREC SYS,JMA,Application Theapplicationisrecoveringfromthecrash. SSYS SYS TheJMAhascrashedorhangedtheapplicationisautomaticallykilled. SDOWN None Thesystemhascrashedandrequiresreboot. Table4-3.Failureandrecoveryratesofthenodemodel Symbol Rate[1/s]orValue Description Type MTBFNODE variable Meantimebetweenfaultsforanode. Input AF CPU%APPMTBFNODE Applicationfaultrate. Derived JF CPU%JMAMTBFNODE JMAfaultrate. Derived SF CPU%SYSMTBFNODE Systemfaultrate. Derived RB 0.0333 SystemrebootrateHAMandOS. Measured RC 0.069061 Applicationrecoveryrate. Measured FR 14.27 Systemforkrate. Measured DT 0.8333 Failedapplicationdetectionrate. Measured CPU%APP 70% PortionofCPUusedbyApplication. Estimated CPU%JMA 5% PortionofCPUusedbyJMA. Estimated CPU%SYS 25% PortionofCPUusedbyOSandHAM. Estimated TheratesspeciedinTable 4-3 aredividedintothreecategories{derived,measured,andestimated.ThederivedratesarecalculatedbasedonequationsusingtheSEEratewhilethemeasuredrateswereobtainedbyexperimentalmeasurementsontheDM 68
PAGE 69
prototypesystem.TheCPUutilizationvalueschosenreecttheestimatedworkloadoftypicalapplicationsrunningontheDMsystem.4.2.2.2SystemmodelThegoaloftheMarkovmodelrepresentingtheDMsystemistoapproximatetheSSPofthesystemwithanarbitrarynumberofnodes.Forthismodel,weassumeeachnodeexecutescompletelyindependentworkloadsand,asaresult,themodelrepresentsabest-caseapproximationandsetsanupperboundontheSSPofthesystem.Thefactthattheradiation-hardenedcontrolandMDSnodesintheDMsystemarenotsusceptibletoSEEsfurthersimpliesthemodel.TopredicttheSSPofsuchasystem,wedevelopedaMarkov-rewardmodelwithN+1states,whereNisthenumberofthecomputenodesintheclusterseeFigure 4-5 Figure4-5.Markov-rewardsystemmodel EachstateinthesystemmodeldenotesthenumberofnodesthatarecurrentlyintheSAPPstateatagiventime.Mostcommonly,therewardrateassociatedwitheachstateissimplysettothenumberofnodesintheSAPPstate.However,insystemssuchasthosewithhotsparesorthosethatincuroverheadpenalties,eachnode'srewardratecanbemodiedaccordingly.ThenodefailurerateND,denedinEquation 4{3 ,isequivalenttotheaggregaterateofalltransitionsfromtheSAPPstate,whiletherecoveryrateofanode,ND,istherateatwhichthenodemodelcantransitionbacktotheSAPPstate.Equation 5{1 providesaformaldenitionofthisrecoveryrate.ND=AF+JF+SF=1 MTBFNODE{3 69
PAGE 70
ND=PSAPPND 1)]TJ/F21 11.955 Tf 11.956 0 Td[(PSAPP{4TosimulatethedatanodeandsystemmodelsweusetheSHARPEtoolwhichiscommonlyusedtosimulateMarkovchains,Petrinets,andhierarchicalmodelsforavailability,reliabilityanddependabilitycalculations.ThetoolisactivelydevelopedatDukeUniversity[ 47 ].InordertocoarselyevaluatetheperformanceoftheDMsystemtheresearcherdevelopedahierarchicalMarkov-rewardmodelthatallowsforrapidevaluationofpotentialcomputationalratesachievableforarangeofapplicationsundervaryingfaultconditions.Unfortunately,suchabasicmodellacksthedelityandprecisiontoexploretheeectsofnetwork,CPU,MDS,orschedulingperformance,whichinconjunctionwithfaultconditionscansignicantlyaecttheSSP.ThequalityoftheSSPobtainedfromtheMarkovrewardmodelisfurtherevaluatedandcomparedtothesimulativemodelinSection 4.3 .4.2.3Discrete-EventSimulationModelingThenalstepinthethree-stageanalysisinvolvesin-depthevaluationsofvirtuallyprototypedsystemusingdiscrete-eventsimulationmodels.Inthesimulationdomain,analysesofapplicationandarchitecturalcongurationsareconductedwiththeintenttoidentifythesettingsthatproducethehighestperformanceandavailability.BasedonthenodeandsystemarchitecturesfromSections 4.1.2 and 4.1.3 ,discrete-eventsimulationmodelsofkeycomponentsweredesignedanddeveloped.EachmodeladherestotheFASEmethodologyofbalancingspeedandaccuracy,andsomemodelsactuallyextendorenhancethepre-existingmodelsintheFASElibrary.ThecomponentslistedinTable 4-4 wereformallymodeledtonotonlycapturethecorrectfunctionalityofthecorrespondingtechnologybutalsotoincorporatetheirimpactsonsystemperformanceandfaulttolerance.Fromthesecorecomponentmodels,nodeandsystemmodelsweredeveloped.Figure 4-6 illustratesthemiddlewaremodelsthatcompose 70
PAGE 71
adataprocessingnode,systemcontrolnode,andMDSnode.Finally,thevirtualightsystemwasdevelopedusingthenaldesignarchitectureseeFigure 4-7 Table4-4.SummaryofDMcomponentmodels Component Library Description FaultToleranceManager DM Detectscomponentfailures,notiesJobManager,andtakesnecessaryrecoveryprocedures. JobManager DM Schedulesandmanagesjobs.Handlestaskrestartsbasedonavailableresources. JobManagerAgent DM Startsandmonitorsapplicationsondatanode.NotiesJobManagerwhenapplicationfailurede-tected. High-AvailabilityMiddleware DM Providesreliablecommunicationbetweennodesinsystem.MonitorsJM,JMA,andothernodesforfailures.NotiesFTMoffailures. MDSServer DM Handlesdataaccessrequeststothemassdatastore. TCPLayer FASE ProvidesTCPprotocolforreliablecommunicationbetweennodes. IPLayer FASE ProvidesIPprotocolforallnetworktransfers. EthernetNIC FASE ProvidesEthernetprotocolforallnetworktransfers.Supportsmultipleports. EthernetSwitch FASE ProvidesEthernetconnectivitybetweennodes.Supportsvarietyofbackplaneandroutingoptions. 4.2.4FaultModelLibraryThemodelsdescribedintheprevioussectioncapturethefunctionalityandperformance-basedcharacteristicsasseenintheirreal-worldcounterparts.However,theydonotincludefaultdetectionandrecoverymechanismsneededtofunctionproperlywhenexposedtoafault.Asaresult,afaultmodellibrarywasdesignedtointegratekeyfeaturesthatenhancethemodelstoreactappropriatelyundervariousfaultcampaigns.Inaddition,thefaultmodellibraryprovidesthenecessarycomponentstogenerateandinjectfaultsintoanarbitrarysystem.Themodelsinthelibrarywerespecicallydesignedsothatnewandpre-existingmodelscouldbefault-enabled"withfewadditionsormodications.Thefaultmodelswerealsodesignedtocreateafaulthierarchysuchthatasingle,high-levelcomponentcouldbeaectedbyafaultandthemechanismswould 71
PAGE 72
aComputenode bControlnode cMDSnodeFigure4-6.TheDMnodemodels Figure4-7.TheDMightsystemmodel 72
PAGE 73
automaticallypropagatethefaulttoalllower-levelentities.Thishierarchicaldesignnotonlycapturestheareaofinuenceofaparticularfaulttype,butitalsoprovidesaninfrastructuretodeneinterdependenciesbetweenvariouscomponents.Table 4-5 liststhefaultmodelcomponentsaccompaniedwithbriefdescriptionsoftheirfunctionality. Table4-5.Summaryoffaultmodels Component Description FaultGenerator Controlswhenfaultsaregenerated.Generationtimesbasedonrandomdistributionsoruserdened. FaultController Injectsfaultsintosystem.Selectstargetcomponentrandomlyorbasedonuser-denedsusceptibilitymatrix.Monitorsthestatusofallfaultmanagersandmodulesinthesystem. FaultManager Propagatesinjectedfaultsfromthefaultcontrollertoalllower-levelfaultydevices.Aidsintherecoveryprocessofmanagedcom-ponentswhennecessary. FaultModule Provideshigh-levelfaultmechanismssuchasdetectionandrecov-erytointegrateintonewandpre-existingmodels.InheritsFaultBasemechanismsanddatastructures. FaultBase Provideslow-levelfaultdatastructuresandmechanismstointe-grateintonewandpre-existingmodels.Thesedatastructuresandmechanismsdealwithschedulingevents,managingfaultyevents,andmanagingmemorye.g.,preventingmemoryleaks. Figure 4-8 illustratesanexampleofafault-enabledsystembasedontheDMarchitecture.Thegureshowsthevarioushardwareandsoftwarecomponentsthatcanbeaectedbyfaultsaswellassomeofthefaultmodelsthatmanagetheinjection,detection,andreactiontofaultsinthesystem.FaultModuleshavebeenintegratedintoeachofthehardwareandsoftwarecomponentsinthesystemandFaultManagersareusedtomanagegroupsofmodules.Theactualgroupingsoffaultycomponentsarebasedoneitherthephysicalproximityofthecomponentsinadeviceorthemanagementsystemthatcontrolsthelivelinessofthecomponent.Anotherfactortoconsiderwhencreatingthesegroupsishoweachcomponentreactstospecicfaults.Forexample,theDataNodeFaultManagerinFigure 4-8 manageshowfaultsareinjectedatthenodelevelsuchthatifthecorrespondingdatanodebecomesfaulty,itwillpassthefaulttothelower-levelfault 73
PAGE 74
managerswithintheNIC,Middleware,andApplicationsblockssothattheycandisabletheircorrespondingcomponentse.g.,Port,JMA,HAM,orSAR. Figure4-8.Examplefault-enabledsystem FaultsareinjectedintothesystembytheFaultControlBlockwhichiscomposedoftheFaultGeneratorandFaultController.TheFaultGeneratorcreatestime-basedfaulteventsasdictatedbyeitherarandomdistributionortheuser.TheFaultControllerreceiveseventsfromtheFaultGeneratorandprovidestheinjectioncapabilitiesnecessarytostimulatethevirtualsystemwithfailures.TheFaultControllertargetsspeciccomponentsinthesystemaccordingtoasusceptibilitymatrixthatdenestheprobabilityeachlistedcomponentwillexperienceafault.Theactualpercentagevaluessuppliedwithinthesusceptibilitymatrixareuser-denedandboththeutilizationandphysicalsizeofeachcomponentshouldbeconsideredwhensettingthevalues.OncetheFaultControllerdeterminesitstargetcomponent,itinjectsthefaultintothesystemviaFaultManagers.TheFaultManagersrelaythefaulttothetargetcomponentsothe 74
PAGE 75
targetcanreactaccordingtoaparticularpolicydenedbytheuser.TheFaultModuleisincorporatedintoeachfault-enabledcomponentandprovidesvirtualdetectionandrecoveryfunctionsthatcanberedenedtoallowtheusertocongurethedevicetotakethenecessaryactionsasdictatedbyitfault-tolerantpolicies.Themodelswithinthefaultlibrarywereprogrammedprimarilyusingtheobject-orientedC++languagesothatothersystemsandmodelscanbeeasilyretrottedwithfaultinjectioncapabilities.Also,themodelsweredesignedwithextensibilityinmindtosupportawiderangeofdetectionandrecoverymethodsformanytypesoffaults.AllthecomponentsdescribedintheprevioussectionhavebeenretrottedwiththenecessaryfaultmodelsandeachhasbeenconguredtoreacttoandrecoverfromfaultsasdictatedbythepoliciessetupfortheDMsystem.Detailsonthesepoliciescanbefoundin[ 48 ].4.3ResultsandAnalysisThissectiondescribesthemethodsusedtoanalyzeandidentifyperformanceandavailabilityissuesintheDMsystemarchitecture.ThesectionpresentsexperimentsusedtovalidatetheMarkovandsimulationmodelsthroughtheuseofexperimentallygatheredmeasurementsfromtheprototypesystem.Thesemeasurementsareusedtocalibratethemodelsandvalidationresultsarepresentedwhereapplicable.Afterthemodelsarecalibrated,ascalabilitystudyofanimportantapplicationkernel,the2DFFT,ispresented.Thegoalofthisinvestigationistondbottlenecksthatexistintheproposeddesignandexploretheeectsofchangingkeysystemfeaturesregardingbotharchitecturalandalgorithmicvariationstobettermaptheapplicationtotheunderlyingarchitecture.Thescalabilitystudyisfollowedbyanin-depthperformabilityanalysisoftheDMsystemusingtheSARapplicationinordertoevaluatethetrade-osbetweenperformance,scalability,andavailability.Thesubsectionconcludeswithanevaluationoftheproposed20-nodeightsystemincorporatingtheoptimalcongurationstomaximizeperformability. 75
PAGE 76
4.3.1ModelCalibration4.3.1.1ComponentmodelcalibrationandvalidationModelvalidationisacriticalstepinanymodelingeortthatattemptstoprovideaccurateresultscomparabletothoseproducedbyrealsystems.ValidationofcomplexsystemssuchastheDMsystemisdiculttoaccomplish,thereforeweattempttoovercomethischallengebydecomposingandvalidatingitssubsystems:thenetworksubsystemandtheMDSsubsystem.ThenetworksubsystemencompassesallsoftwareandhardwarelayersemployedduringadatatransferwhichcorrespondtotheHAM,TCP,IP,andEthernetmodels.Inordertovalidatethissubsystem,asimplePingPongMPIprogramthatmeasureslow-levelnetworkperformancebetweentwonodeswasexecutedontheprototypesystemdescribedinSection 4.2.1 andtheresultswereusedtocalibratethemodelstobestrepresentthetestbed'snetworkandmiddlewareperformance.Figure 4-9 aillustratestheexperimentallygatheredthroughputvaluesascomparedtothoseproducedbythesimulatedsystem.Thegureshowsthesimulationmodelcloselymatchestheperformancemeasuredonthetestbedwithameanrelativeerrorof1.27%.Asimilarmeanrelativeerrorwasobservedwhencomparingtheexperimentalandsimulativelatencymeasurementsacrossthestudiedmessagesizes. aNetworksubsystem bMDSsubsystemFigure4-9.ThroughputvalidationsfornetworkandMDSsubsystemmodels Oncethenetworksubsystemwasvalidated,theperformanceofanotherkeysubsystem,theMDS,wascalibratedaccordingtoexperimentalmeasurements.Again, 76
PAGE 77
asimplebenchmarkwasdevelopedthattransfersdataofvaryingsizestoandfromtheMDSnode.ThevalidationresultsareshowninFigure 4-9 b,withtheMDSsubsystemmodelproducingmeanrelativeerrorsof1.58%forwritesand2.03%forreads.Fromthevalidationprocessanddocumentation,theDMsystem'smaincomponentparameterswerecalibratedinordertomostaccuratelyrepresentthetestbedsystem.ThevaluesforkeyparametersarelistedinTable 4-6 andthiscongurationcorrespondstothebaselinesystemusedintheproceedingexperiments. Table4-6.Baselinesystemparameters ParameterName Value Processorpower 1200MIPS,600MFLOPS MPImaximumthroughput 57MB/s MPImessagelatency 13.6ms HAMbuersize 2000000bytes Networkbandwidth Non-blocking1000Mb/s Networkswitchlatency 5s MDSbandwidthwrite/read 60/40MB/s MDSlatencywrite/read 300/500s MDSopenleoverhead 8ms 4.3.1.2SystemperformabilitymodelInadditiontocomponentcalibration,thesimulationsystemmodelswerevalidatedwithregardtotheirnear-idealisticperformabilityascomparedtothoseproducedbytheMarkov-rewardmodelspresentedinSection 4.2.2 .Forthisexperiment,aserialversionofanLUdecompositionkernelwasscheduledoneachnodeinthetestedsystems.EachLUjobprocessedmatricesof10001000elementswitheachelementbeingan8-bytedouble.Theexperimentvariedthesystemsizefromfourtothirty-twonodesbypowersoftwowhileexploringtenMTBFNODEvalues.TheminimumfaultrateexpectedforeachDMdatanodewasestimatedtobethreefaultsperday,whichcorrespondstothemaximumMTBFNODEvalueanalyzed800seconds=8hoursandarelativelyhospitable 77
PAGE 78
environment.Theremainingrateswereselectedtoanalyzethesystem'sperformabilityinharsherconditionsandtheresultsfromthisstudyareshowninFigure 4-10 aPerformability bErrorFigure4-10.MarkovversussimulationDMsystemperformabilitycomparison Figure 4-10 ashowsacomparisonbetweentheperformabilitynumberscollectedfortheMarkovandsimulationmodelswhileFigure 4-10 billustratestherelativeerrorsofthesimulationresultsascomparedtothoseproducedbytheMarkovmodel.OnecanseethatforlargerMTBFNODEvalues,thetwoanalysistechniquesyieldnearidenticalresults.However,deviationsbetweentheapproachesbecomeapparentwhenanalyzingsystemsexposedtomorefaultsi.e.,smallMTBFNODEvalues.Thesedeviationsresultfromthevaryinglevelsofdetailcapturedbyeachmodelingapproach.Inthisinstance,thesimulationmodelcapturesextraperformancepenaltiessuchasnetworkandschedulingdelaysthataecttheperformanceofthesystem.Inhighfaultconditions,thesepenaltiesbegintoaccumulateduetonumerousjobrestartsthusnegativelyimpactingthesystem'soverallperformability.Inaddition,thedeviationsatthesmallerMTBFNODEvaluesincreasewiththesystemsizeduetoschedulingoverheadaswellasresourcecontentioninthenetworkandMDSnodenotmodeledintheMarkov-rewardmodels.4.3.2CaseStudy:FastFourierTransformTheexperimentsconductedforthisportionofthestudyexploretheperformanceandscalabilityofthe2DFFTkernelexecutingontheDMsystem.Afault-tolerant,parallel 78
PAGE 79
2DFFTservesasthebaselinealgorithm,whichdistributesanimageevenlyoverNprocessingnodesandperformsalogicaltransposeofthedataviaacornerturn.AsingleiterationoftheFFT,illustratedinFigure 4-11 ,includesseveralstagesofcomputation,inter-processorcommunicationi.e.,cornerturn,andseveralMDSaccessesi.e.,imagereadandwriteandcheckpointoperations. Figure4-11.Dataowdiagramofparallel2DFFT TheresultsofthebaselinesimulationseeFigure 4-12 showthattheperformanceoftheFFTslightlyworsensasthenumberofdatanodesincreases.InordertopinpointthecauseoftheperformancedecreaseoftheFFT,theprocessor,network,andMDScharacteristicsweregreatlyenhancedi.e.,upto1000-fold.TheresultsinFigure 4-12 showthatenhancingtheprocessorandnetworkhaslittleeectontheperformanceoftheFFT,whileMDSimprovementsgreatlydecreaseexecutiontimeandenhancescalability.ThereasontheFFTapplicationperformanceissodirectlytiedtoMDSperformanceisduetothehighnumberofaccessestotheMDS,thelargeMDSaccesslatencies,andtheserializationofaccessestotheMDS.AftertheMDSwasveriedasthebottleneckforthe2DFFT,severaloptionswereexploredinordertomitigatethenegativeeectsofthecentralmemory.Theoptionsincludedalgorithmicvariations,enhancingtheperformanceoftheMDS,andcombinationsofthesetechniques.Table 4-7 liststhedierentvariations.Eachtechniqueoersperformanceenhancementsoverthebaselinealgorithmi.e.,P-FFT.Figure 4-13 showsthattheparallelFFTwithdistributedcheckpointingand 79
PAGE 80
Figure4-12.Executiontimeperimageforbaselineandenhancedsystems distributeddataprovidesthebestspeedupupto740overthebaselinebecauseiteliminatesallMDSaccesses.Individually,thedistributedcheckpointinganddistributeddatatechniquesresultinonlyaminimalperformanceincreasesincethetimetakentoaccesstheMDSstilldominatesthetotalexecutiontime.MDSperformanceenhancementsreducetheexecutionoftheparallelFFTbyafactorof5.SwitchingtheFFTalgorithmseeFigure 4-14 tothedistributedversionachievesa2.5speedupoverthebaselinewhichcanthenbefurtherincreasedto14and100byemployingMDSimprovementsanddistributeddata,respectively.ItisnoteworthytomentionthatthedistributedFFTiswellsuitedforlargersystemssizessincethenumberofMDSaccessesremainsconstantassystemsizeincreases.Resultsfortheparallel2DFFTFigure 4-13 magnifytheeectsoftheMDSonthesystem'sperformance.ThoughtheparallelFFT'sgeneraltrendshowsworseperformanceassystemsizescales,thetopfourlinesshownumerousanomalieswheretheperformanceoftheFFTactuallyimprovesasthenumberofnodesinthesystemincreases.TheseanomaliesarisefromthetotalnumberofMDSaccessesneededtocomputeasingleimagefortheentiresystem.Forexample,adipinexecutiontimeoccursinthebaselineparallelFFTalgorithmwhenmovingfrom18to19nodes.ThetotalnumberofMDSaccessesoftheparallelFFTusing18nodesis90whilethenumberofaccessesdecreasesto76for 80
PAGE 81
Table4-7.TheFFTAlgorithmicvariationsandsystemenhancements Algorithm/Technique Description Label ParallelFFT Baselineparallel2DFFT. P-FFTBaseline ParallelFFTwithdistributedcheckpointing Parallel2DFFTwithnearestneighbor"checkpointing-datanodeisavescheckpointdatatodatanodei+1modN,whereiisauniqueintegeriN)]TJ/F17 10.909 Tf 12.477 0 Td[(1andNisthenumberoftasksinaspecicjob. P-FFT-DCP ParallelFFTwithdistributeddata Parallel2DFFTwitheachnodecollectingaportionofanimageforprocessingthuselimi-natingthedataretrievalanddatasavestages. P-FFT-DD ParallelFFTwithdistributedcheckpointinganddistributeddata Combinationofbothdistributiontechniquesdescribedabove. P-FFT-DCP-DD ParallelFFTwithMDSen-hancements Parallel2DFFTusingaperformance-enhancedMDS.TheMDSbandwidthisimproved100-foldandtheaccesslatencyisreducedbyafactorof50. P-FFT-MDSe DistributedFFT Avariationofthe2DFFTthathaseachnodeprocessanentireimageratherthanapartoftheimage. D-FFT DistributedFFTwithdistributeddata Distributed2DFFTalgorithmwitheachnodecollectinganentireimagetoprocess. D-FFT-DD DistributedFFTwithMDSen-hancements Distributed2DFFTalgorithmusingaperformance-enhancedMDS. D-FFT-MDSe the19-nodecase.SincetheMDSisthesystem'sbottleneck,theexecutiontimeofthealgorithmbenetsfromthereductionofMDSaccesses.OnlyintheparallelFFTwithdistributeddataanddistributedcheckpointingoptiondoweseethezig-zags"disappearduetonodatatransfersoccurringbetweenthenodesandtheMDS.ThedistributedFFTseeFigure 4-14 alsodoesnotshowanyperformanceanomaliesduetothenatureofthealgorithm.Thatis,thenumberofMDSaccessesremainsconstantperimagesinceonlyonenodeisresponsibleforcomputingthatimage. 81
PAGE 82
TheresultsinFigures 4-13 and 4-14 correspondedto1MBimages,thusweconductedsimulationstoanalyzetheaectsoflargerimagesizes.OurresultsshowedthatthealgorithmsandenhancementsreversedthetrendfortheparallelFFT.Thatis,theexecutiontimesimprovedasthesystemsizegrew,thoughtheimprovementswereveryminimal.Also,thesporadicperformancejumpswereamortizedduetothelargenumberofMDSaccessesascomparedtothevarianceinthenumberofaccesses.ThedistributedFFTwithdistributeddatawastheonlyoptionthatshowedalargeimprovementbecausemoreprocessingcouldoccurwhendatawasmorereadilyavailablefortheprocessors.TheresultsdemonstratethatarealisticapplicationcanbeeectivelyexecutedbytheDMsystemifthemassmemorysubsystemisimprovedtoallowforparallelmemoryaccessesanddistributedcheckpoints. Figure4-13.Parallel2DFFTexecutiontimesperimageforvariousperformance-enhancingtechniques 4.3.3CaseStudy:SyntheticApertureRadarForthenextcasestudy,weevaluatetheperformanceandavailabilityofamorecomplexapplication,SAR.SyntheticApertureRadarSARisahigh-resolution, 82
PAGE 83
Figure4-14.Distributed2DFFTexecutiontimesperimageforvariousperformance-enhancingtechniques broad-areaimagingprocessusedforreconnaissance,surveillance,targeting,navigation,andotheroperationsrequiringhighlydetailed,terrain-structuralinformation[ 49 ].Itusesatwo-dimensional,space-variantconvolutionthatcanbedecomposedintotwodomainsofprocessing{rangeandazimuth.Inordertocorrectlytransitionbetweentherangeandazimuthdomains,thedatamustbereorderedviaatransposeoperation[ 50 ].Thiscasestudyanalyzesafault-tolerantversionofSARthatincorporatesanoptionalcheckpointingstagestripedblockinFigure 4-15 atosaveandrecoverrollbackpointsintheeventofafailedjob.Figure 4-15 aillustratesthedataowofthefault-tolerantSARapplication.TherearevariousimplementationsoftheSARapplicationeachdieringbasedonthedatadecompositionacrossparticipatingprocessingnodesand,thus,theamountofcommunicationandcomputationconductedbyeachnode[ 51 ].Forthisstudy,weconsiderthepatch-basedapproachwhichsplitseachSARimageintopatches"alongtheazimuthdimensionanddistributeseachpatchtoanavailablecomputenode.Thispatchedversiondoesnotrequirecommunicationbetweenparticipatingnodesalthougheachnodemust 83
PAGE 84
aDataowdiagram bDatadecompositionFigure4-15.SARdataowwithoptionalcheckpointstagesandpatcheddatadecomposition fetchandprocessadditionaldatatoensurecorrectresults.Figure 4-15 billustratesthedatadecompositionofthepatch-basedimplementation.ThebaselineSARapplicationusedthroughoutthisstudyprocesses280005616-elementimageswhichistheapproximatesizeoftheimagescollectedbytheEuropeanRemote-SensingERSsatellites[ 52 ].Eachelementisstoredwithinthemassdatastoreasacomplexpairof8-bitintegers-bytestotal,thetypicalformatusedforrawSARdata.WhenthedataisimportedbytheSARapplication,eachelementisexpandedtoacomplexpairof32-bitoating-pointnumbers-bytesperpairinordertoimproveprecisionandreducethepotentialforround-oerrors,andtherangedimensionispaddedto8192elementstoincreasetheeciencyoftheFFTcalculations.WhenSARiscomplete,thepaddedelementsintherangedimensionareremovedfromtheprocessedimageandtheremainingelementsareconvertedtocomplexpairsof16-bitshortintegers-bytesperpairthusreducingtheamountofstorageneededtostorethedatabacktotheMDS.ThepatchdatasizeisP5616elements,wherePisthepatchsize,andtheoverheaddatasizeis12965616elementsperpatch.Foreachsimulationrun,theDM 84
PAGE 85
systemisobservedoverten,100-minuteorbitsandtheradiation-hardenedcontrolandMDSnodesareassumedtoexperiencenofailures.Forthefault-injectionexperiments,thefaultcontrolblockdescribedinSection 4.2.4 createsandinsertsfaultsintothesystemusinganexponentialdistributionwithamean,MTBFSYSTEM,denedas:MTBFSYSTEM=MTBFNODE N{5MTBFNODEisthemeantimebetweenfaultspernodeandNisthenumberofnodesinthesystem.TheMTBFNODEratesconsideredinthefaultexperimentsareidenticaltothoseinvestigatedinSection 4.3.1.2 andrepresentradiationconditionsrangingfromminimaltoextreme.Faultsareinjectedintoaparticularnodebasedonauniformdistributionandthespeciccomponenttotargetontheselectednodeisdictatedbythefollowingpercentages:SARApplication=70%,HAM/System=25%,andJMA=5%.ThesepercentagevalueswereestimatedbasedontheanticipatedbehavioroftheSARapplicationexecutingonaDMdatanode.Also,faultscanbeinjectedintorecoveringcomponentsthusrestartingtherecoveryprocess.Thefollowingsectionsdescribethetechniquesandcapabilitieswithinthetwomodelingdomainsofthethree-stageanalysisastheyareappliedtoevaluatetheperformanceandavailabilityofSARexecutingontheDMsystem.Section 4.3.3.1 presentstheamenabilitystudyofSARusingtheMarkov-rewardmodeltodetermineiftheapplicationmapswelltotheDMarchitecture.Afterthestudy,weenterthediscrete-eventsimulationdomaininordertoexplorethevariousapplicationandarchitecturaloptionsavailabletoimproveperformanceandavailability.Section 4.3.3.2 reportstheperformabilityandsystemthroughputofthepatch-basedSARapplicationwhileconsideringvariouspatchsizes.Itthenevaluatesthedierentcheckpointstorageoptionswiththeintentofimprovingthefaulttoleranceoftheapplication.Finally,we 85
PAGE 86
investigatenumerousarchitecturalenhancementstosupportecientcomputingonatwenty-nodesystemandconcludewithanalanalysisoftheproposedightsystem.4.3.3.1AmenabilitystudyThissectionbeginswithapreliminaryanalysisoftheSARapplicationusingtheMarkov-rewardmodeltodeterminewhethertheworkloadcharacteristicsofthisparticularapplicationareappropriatefortheDMsystem.Thestudypresentsbest-caseperformabilitynumbersofthepatch-basedSARapplicationemploying2048-,4096-,and8192-elementpatchesexecutingonsystemsusing4,8,16,and32datanodes.ThefaultinjectionratesexploredareidenticaltothoseusedinSection 4.3.1.2 Figure4-16.AmenabilityresultsviaMarkovmodelforpatch-basedSARapplication FromFigure 4-16 ,wecanseethatthepatchedSARshowspromisingperformabilitynumbersforeachsystemsizeandpatchsizewhilethesystemexperiencesrelativelybenignconditions.However,asthefaultrateincreases,weobservedecreasingperformabilityinallcases.Furthermore,largerpatchsizeshavenegativeeectsontheperformabilityofthesystemduetotheincreasedamountoftimerequiredtocompletetheprocessingofeachSARjob.Infact,theMarkovmodelreportedadierenceof32.9%inperformability 86
PAGE 87
betweenthe2048-elementpatchandthe8192-elementpatchatthehighestfaultrate.Fromtheseresults,weobservethatthepatch-basedSARapplicationproducesgoodperformabilitynumbersforthefaultratestargetedfortheDMsystemthusmakingitagoodcandidateforfurtherinvestigationusingthediscrete-eventsimulationapproach.4.3.3.2In-depthapplicationanalysisForthenextstepinthecasestudy,wetransitionintothenalstageofourproposedmethodology{thediscrete-eventsimulationdomain.Withinthisstage,wehavethecapabilitytoanalyzemanyinterestingapplicationandsystemoptionswhileexposingeachcongurationtovariousfaultconditions.Thissectionfocusesonthevariousoptionsavailableforthepatch-basedSARapplication.Similartotheamenabilitystudy,weexploretheimpactofpatchsizeonthesystem'sperformabilitywhileconsideringpatcheswith2048,4096,and8192elementsintheazimuthdimension.Theobjectiveofthisstudyistodeterminewhichpatchsizeachievesthebestperformance.Figure 4-17 illustratestheresultscollectedfromthesimulations.TheleftcolumnofchartsshowstheperformabilitypercentagesofeachpatchsizeandtherightcolumndisplaysthecorrespondingthroughputswhilevaryingsystemsizeandMTBFNODE.Fromtheguresintheleftcolumn,weseethatthe2048-elementpatchsizehasthehighestperformabilityinallsystemsizesandfaultratesduetotheapplication'sshortexecutiontime.Asthepatchsizegrows,theexecutiontimeforeachSARjoblengthensthusincreasingtheprobabilityofafaultinterruptingthecompletionofajob.TheseresultsconcurwiththosereportedbytheMarkov-rewardmodelintheprevioussection.However,thesimulationsproducedlowerperformabilitypercentagesthanthoseobservedusingtheMarkovmodel.Infact,amaximumreductionof43.0%intheperformabilityoflargersystemsexperiencinghighfaultrateswasreportedbythediscrete-eventsimulationmodels.Anotherkeyobservationisthattheperformabilityofeachsystemdecreasesasthesystemsizeenlargesthusindicatingapotentialbottleneckwithinthesystem.Our 87
PAGE 88
a2048-elementperformability b2048-elementthroughput c4096-elementperformability d4096-elementthroughput e8192-elementperformability f8192-elementthroughputFigure4-17.Systemperformabilitypercentagesandthroughputsforpatch-basedSAR 88
PAGE 89
Markov-rewardmodelfailstoshowthistrendduetoitsinabilitytocapturearchitecturaldependencies.Despiteobservingdecreasedperformabilitywhenusinglargerpatches,thesystemthroughputsforthesejobscanbemuchhigherthanthe2048-elementcase.Infact,theguresintherightcolumnshowimprovedthroughputsupto1.86and2.29forthe4096-and8192-elementcases,respectively,inlow-faultenvironments.However,the2048-elementcasedoesoutperformthe8192-elementpatchsizewhentheMTBFNODErateislessthan200seconds.Inthesecongurations,thefaultratesarelowerthantheaverageexecutiontimesoftheapplicationsusinglargerpatchsizesthusincreasingtheprobabilityofafaultcausingaSARjobtofail.Finally,foreachpatchsizeandfaultrate,thethroughputreachesitspeakvalueinsystemswitheightdatanodes.Althoughtheexactreasonforthispeakinperformanceisnotfullyclearfromthisparticularstudy,wehypothesizethatthecentralizedMDSisthemaincause.Thestudyconductedinthenextsectionveriesthishypothesisbyobservingtheeectsofscalingtheperformanceofkeysystemcomponents.Forthisstudy,wearemostconcernedwithlighttomoderatefaultratesi.e.,MTBFNODEvaluesbetween28000and3600secondswheretheperformabilitypercentagesforallpatchandsystemsizesareverysimilar.Therefore,wefocusonthe8192-elementpatchcongurationforfurtheranalysisofSARandtheDMsysteminordertomaximizethesystem'sthroughput.NowthatSARhasbeenconguredproperlywithregardstoperformanceandavailability,wecannowevaluatevariouscheckpointingoptionswiththegoalofimprovingapplicationfaulttolerance.Table 4-8 providesalistofthecheckpointoptionsalongwithbriefdescriptionsofeach.Forthisevaluation,weobservetheperformabilityandthroughputoffoursystemsizes{4,8,16,and32datanodes{whileincorporatingtheoptionalcheckpointstagefortheSARapplicationseeFigure 4-15 a.Theobjectiveofthisstudyistoobservetheimpactonperformanceandavailabilityofcheckpointingwithinthe 89
PAGE 90
SARapplication.Furthermore,wewishtomeasureandcomparetheoverheadsassociatedwithstoringthecheckpointdatatoareliable,centralizedlocationMDSnodeversuscheckpointingtounreliable,distributeddatanodes. Table4-8.Checkpointoptionsexploredusingpatch-basedSARapplication Checkpointoption Description Label Nocheckpointing Nocheckpointingconducted. NoCP MDScheckpointing CheckpointdataisstoredontheMDSnode. MDSCP Datanodecheckpointing Checkpointdataisstoredonthenearestneighbordatanode. DataCP Figure 4-18 showstheperformabilityandthroughputobservedforeachcheckpointoptionwhilevaryingsystemsizeandMTBFNODE.Theperformabilityisreportedbythesolidlinesinthechartswhilethethroughputisrepresentedbydottedlines.Whencomparingtheresultsfromallfourgures,weseethatassystemsizeincreases,performabilitydecreasesforallMTBFNODErateswhilethroughputincreases.TheMDSCPcheckpointoptionreportsboththelowestperformabilityandthroughputforallcases.Again,thelikelycauseofthispoorperformanceistheincreasedpressureplacedonthecentralizedMDSnode.Inthesmallersystemsizes,theDataCPoptionreportsanapproximate11%dropinperformability.Also,usingtheDataCPoptionlowersthroughputsby33.5%and27.5%forsystemsconsistingoftwoandfourdatanodes,respectively.However,thisoptiondoesshowslightbenetsinthe16-and32-nodesystems.Forbothsystemsizes,nearlyequivalentperformabilitypercentageswerereportedandamaximumspeedupof1.08inthroughputwasmeasuredwhencomparedtotheNoCPcase.Anotherimportantobservationfromthisstudyisthatperformabilitydoesnotalwaystranslateintorawperformancei.e.,throughput.Figure 4-18 bshowsnearlyequivalentperformabilitypercentagesbetweenthethreecheckpointoptionsatlowfaultrates.Conversely,thethroughputoftheNoCPoptionis2.5and1.3greaterthantheMDSCPandDataCPoptions,respectively.Thisobservationsuggeststhatwhileperformabilityisausefulmetrictoevaluatetheoverallutilizationofadegradable 90
PAGE 91
systeminafaultyenvironment,itdoesnotrepresentthetrueperformanceofaspecicapplicationsinceitdoesnotdierentiatebetweenmeaningfulcomputationandtheprocessingconductedduetoextramechanismssuchascheckpointing. a4datanodes b8datanodes c16datanodes d32datanodesFigure4-18.Systemperformabilityandthroughputfor8192-elementpatch-basedSARexecutingonvarioussystemsizes Theresultsattainedinthisstudysuggestthatthepatched-basedSARapplicationisnotwellsuitedforcheckpointing.AlthoughtheDataCPoptionachievedimprovedthroughputovertheNoCPoptioninlargesystemsizes,theimprovementwasminimalandtheadditionalnetworktransactionsanddemandsonthedatanodescouldeasilynegateanygainsifmultiplejobswereexecutedoneachnode.Inthenalstudyoftheightsystem,wefocusonthe8192-elementpatchedSARapplicationwithcheckpointingdisabled. 91
PAGE 92
4.3.3.3FlightsystemInthissection,weexploretheperformabilityofthepatch-basedSARapplicationwith8192-elementpatchesexecutingontheDMightsystemcomposedoftwentydatanodes[ 53 ].Fromtheprevioussection,wehaveshownthattheperformanceoftheDMsystembeyondeightnodessuersduetoanunidentiedbottleneckwithinthesystem.Inthisstudy,weenhanceandmodifythesystemarchitectureinordertoidentifyandremedythisbottleneckinordertoecientlysupporttwentydatanodesprocessingSAR.Table 4-9 liststhevariousarchitecturalenhancementsinvestigated.TheobjectiveofthisstudyistomodifythecurrentDMsystemdesigntosupporttwentydatanodesandtoimprovetheperformabilityandthroughputofthesystemwithrealisticupgradesandaugmentations. Table4-9.Architecturalenhancementsexploredforightsystem Enhancement Description Label Processor Increasesprocessingpowerofoating-pointunitby2andincreasesthroughputanddecreaseslatencyofmiddlewareby2. Proc MDSstoragedevice IncreasesbandwidthanddecreaseslatencyofMDSstoragedeviceby2. MDSe MDSnodes IncorporatesNMDSnodeswithinsystem. NMDS Beforetheenhancementswerestudied,wemeasuredtheperformabilityandthroughputofa20-nodesystemusingdefaultsettingsthatservedasthebaselineconguration.Toidentifythesystembottleneck,wetargettwomaincomponentsofthesystem{thedatanodeprocessorandtheMDSstoragedevice.Byacceleratingthedataprocessorby2,weassumethattheoating-pointunitandmiddlewarelayerattainequivalentboostsinperformance.Therefore,thisenhancementimprovestheperformanceofbothoating-pointcomputationsandnetworktransfers.Figure 4-19 ashowsthatupgradingtheprocessorprovidesnoperformancegainsforlargeMTBFNODEvalues;however,a1.58speedupisattainedforhighfaultratesduetoreducedexecutiontimesascomparedtothebaseline.WhenweimprovetheperformanceoftheMDSstoragedevice,weobservespeedupsrangingfrom1.7to2.1overthebaseline.These 92
PAGE 93
speedupssuggestthattheMDSisthemainbottleneckofthesystemfortheSARapplication.Tofurthersubstantiatethisclaim,weaugmentthecurrentsystemdesignwithadditionalMDSnodesinordertoreducecontention.Figure 4-19 aillustratesthattheDMsystememployingoneextraMDSnodedoublestheperformanceobservedfromthebaselinesystemforlowfaultratesandmorethantriplesitinextremefaultyconditions.EmployingthreeMDSnodesinthe20-nodeDMsystemfurtherimprovestheperformanceofthesystemby2.5inlightandmoderatefaultyconditionsand4.4inhighfaultrates.AninterestingobservationfromFigure 4-19 isthatthespeedupreportedforeachenhancementincreaseswiththefaultrate.ThisincreaseiscausedbythereductioninexecutiontimeoftheSARapplicationthusallowingthejobtocompletemorefrequentlywithoutexperiencingafailure. aComponentenhancements bCombinationenhancementsFigure4-19.Speedupsofarchitecturalenhancementsforpatch-basedSAR Next,weinvestigatetheimpactofcombiningtheindividualcomponentenhancementsinordertomaximizesystemperformance.FromFigure 4-19 bonecanseethatsignicantspeedupsareachievedfromcombininganupgradedprocessorandMDSstoragedevicewithasystemusingmultipleMDSnodes.ThesystemusingtwoMDSnodesshowedspeedupsrangingfrom12.8to4whilethesystemincorporatingthreeMDSnodeswasfoundtohaveamaximumspeedupof17.5inhighfaultconditionsand5.1intheenvironmentsinwhichdatanodesexperiencethreefaultsperday. 93
PAGE 94
Fromtheresultsinthearchitecturalstudy,theMDSwasdeemedthemainbottleneckofthesystem.Inordertoremedythiscontentionpointforthe20-nodeightsystem,weproposeusingthreeMDSnodeswithenhancedstoragedevices.Wealsoincludeupgradeddata-nodeprocessorsinordertoacceleratedatacomputationsandmiddlewareprocessing.Usingthisupgradeddesign,weconductthenalperformabilitystudyoftheSARapplication.Figure 4-20 providestheresults. Figure4-20.Systemperformabilityandthroughputof20-nodeDMightsystemexecutingpatch-basedSAR TheresultsinFigure 4-20 showthattheproposedightsystemperformswellinmostradiationconditions.Infact,theperformabilityofthesystemispredictedtoexceed99.5%inlighttomoderatefaultconditionsi.e.,MTBFNODE>7200seconds.Theminimumperformabilityofthesystemexecutingthepatch-basedSARapplicationis54.0%.Thethroughputsachievedbytheightsystemweregreatlyimprovedevenatthehighestfaultrates.Theminimumthroughputwasmeasuredtobe316imagesperorbitandthemaximumis587imagesperorbit.Assuming100%ofthesystem'sdatanodesarededicatedtoprocessingSARimages,thethroughputoftheproposedightsystemdictatesthatitshouldbeabletosupportasustainableinputrateof29.35MB/secofrawSARdatafromthesensors.However,theexpectedinputratecurrentlyusedbytheERSsatelliteswascalculatedtobeapproximately11.90MB/sec.Thisdierencetranslates 94
PAGE 95
intoasituationwhereonly9datanodesarerequiredtocomputeSARjobswhiletheremainingnodesarefreetoperformothercomputejobs,conducttestdiagnostics,orsimplyremainidletoconservepower.ThisnalobservationtellsusthattheproposedDMightsystemarchitectureexecutingthepatch-basedSARapplicationismorethansuitabletohandlethelargedemandsforcomputationseenintheERSsatellites.4.4ConclusionsThisphaseofresearchpresentedanapproachthatcombinesanalysistechniquesusingsmall-scaleprototypesystems,Markov-rewardmodels,anddiscrete-eventsimulationmodelsinordertoquicklyandaccuratelyevaluatetheperformanceandavailabilityi.e.,performabilityofaerospacesystemsandapplications.Thecombinationofthesetechniquesallowedustocalibratecomponentmodelsusingexperimentalmeasurements,quicklypinpointworkloadsandfaultratessupportedbythemanagementsoftwareviaMarkov-rewardmodels,andthoroughlyinvestigatespecicapplicationsexecutingonvirtuallyprototypedsystemsthroughdiscrete-eventsimulation.DetailsofeachanalysistechniqueandtheextensionsandenhancementstotheFASEframeworkwereoutlined.Next,modelswerecalibratedtoreecttheperformanceofthephysicaltestbedsysteminordertoensureaccurateresults.Finally,twocasestudiesreportedperformance,scalability,andavailabilitypredictionsofthe2DFFTkernelandSARapplicationexecutingontheNASADependableMultiprocessortoshowcasethecapabilitiesofthepresentedapproach.Formodelcalibration,weusedasmall-scaleprototypesystemconsistingoffourdatanodes,oneMDSnode,andonecontrolnodetocollectperformancemeasurements.ThesemeasuredvalueswerethenusedtocalibratetheMarkovandsimulationmodelsandtheMDSandnetworksubsystemmodelswerevalidatedusingsimplebenchmarkstoconrmtheaccuracyofthesimulationmodels.AperformabilityvalidationexperimentwasthenconductedusinganLUdecompositionkerneltocompareresultsbetweentheMarkov-rewardmodelsandsimulationmodelswhileeachwassubjectedtovariousfault 95
PAGE 96
rates.Theresultsshoweddierencesbetweenthemodelingapproachesoflessthan1%forlowfaultratesthoughlargererrorswereobservedforhigherfaultratesduetoshortcomingse.g.,modelingarchitecturaldependenciesoftheMarkovmodel.Aftervalidation,theDMsystemmodelswereusedtoevaluatetheperformanceandscalabilityofthesystemexecutingthe2DFFTkernel.ThisstudyexposedthecentralizedMDSasapotentialperformancebottleneckinjobsthatfrequentlyaccesstheMDS.VarioustechniqueswereexploredtomitigatetheMDSbottleneckincludingdistributedcheckpointing,distributinginterconnectionsbetweensensorsanddataprocessorsi.e.,distributeddata,algorithmvariations,andimprovingtheperformanceoftheMDS.ThestudyshowedthateliminatingextraneousMDSaccesseswasthebestoptionthoughenhancingtheMDSmemorywasalsoagoodoptionforincreasingperformance.Regardingscalability,changingthealgorithmfromaparalleltoadistributedapproachandincludingdistributedcheckpointingprovidesthebestperformanceimprovementofalltheoptionsanalyzed.Forlargeimagesizesi.e.,64MB,thedistributedFFTwithdistributeddatawastheonlyoptionthatshowedalargeimprovementbecausemoreprocessingcouldoccurwhendatawasmorereadilyavailablefortheprocessors.Thesecondcasestudyusedthepatch-basedSARapplicationtostudyitsperformanceandavailabilitywhilerunningontheDMsystem.TheMarkovmodelwasinitiallyemployedtoquicklydeterminethattheperformabilityoftheapplicationusingvariouspatchsizeswassuitableforfurtherevaluationontheDMsystem.Oncethispreliminaryanalysiswassuccessful,wecontinuedourthree-stageanalysisofSARusingthediscrete-eventsimulationmodels.SimulationresultsweresimilartothoseproducedbytheMarkov-rewardmodelbutwithonekeydierence.TheperformabilityvaluespredictedbythesimulationmodelweremuchlowerthanthoseproducedbytheMarkovmodelwhenconsideringhighfaultratesandlargesystems.Forexample,thesimulation-basedperformabilityofa32-nodesystemexperiencinganMTBFNODErateof60secondswasfoundtobe6.1%comparedtothe49.1%reportedbytheMarkovmodel.Again,the 96
PAGE 97
MarkovmodeldidnotcapturethearchitecturaldependenciesoftheSARapplicationontheMDSnodeandthustheresultsoverlookeditsimpactontheperformabilityofthelargersystems.ThesimulativemodelwasalsousedtogaugetheoverallsystemthroughputwhileexecutingSAR.Theapplicationproduceditshighestthroughput16imagesperorbitwhenexecutingonan8-nodesystem;however,insomecases,systemsbeyondthissizereportedreductionsinthroughputduetocontentionattheMDS.Finally,checkpointingwasemployedinanattempttoimprovethefaulttoleranceoftheSARapplication.TwostorageoptionswereexploredsuchthatcheckpointdatawaseitherstoredontheMDSnodeorneighboringdatanode.Theresultsshowedthatforallsystemsizesandfaultrates,checkpointingtotheMDSnodereducedperformabilityandsystemthroughput.Checkpointingtoaneighboringdatanodewasfoundtohaveslightbenetsinlargersystemsizes;however,thegainsweremuchtoosmallandcheckpointingwasdeemedunnecessaryforthepatch-basedSARapplicationexecutingontheDMsystem.Duringthestudy,certainsystemcongurationsresultedineachcheckpointoptionproducingsimilarperformabilityvaluesyetdrasticallydierentthroughputs.Thisobservationsuggestedthatwhileperformabilityisausefulmetrictogaugetherobustnessofadegradablesystemundervaryingfaultenvironments,itdoesnotaccuratelyrepresenttheeciencyforwhichanapplicationexecutesonasystem.Thus,theresultsfromthisstudyreinforcetheneedtocoupleMarkovanddiscrete-eventsimulationmodelingforcomprehensiveanalysesofaerospacesystemsandapplications.AftertheSARapplicationwasconguredforoptimalperformabilityandthroughput,weexploredarchitecturalenhancementsinordertoidentifyandalleviatethebottleneckswithinthesystem.TheresultsfoundthattheMDSwasthemainthrottlingpointofthesystemduetothecumulativeeectsofeachdatanodeaccessingandtransmittingdataduringvariousphasesofSAR.Toalleviatethiscontentionpoint,weenhancedthecapabilitiesoftheMDSstoragedevicewhichallowedthesystemtonearlydoubleitsthroughput.Wealsoobservedsimilarspeedupsinsystemthroughputbyincorporating 97
PAGE 98
additionalMDSnodesintothesystem.TheDMsystemusingtwoMDSnodesreceiveda2.0boostinthroughputwhileadesignusingthreeMDSnodesachieveda2.5improvementundermoderatefaultconditions.Asthefaultrateincreased,greaterspeedupswereobservedforeachenhancementoverthebaselinecase.Shortenedexecutiontimesreducedtheeectsofthehigherfaultratesthusincreasingtheoveralleciencyandthroughputofthesystem.ThecasestudyconcludedwithaperformabilityevaluationofthenalightsystemthatincorporatedtwentyenhanceddatanodesandthreeenhancedMDSnodes.Theproposedightsystemwasexposedtovariousfaultratesanditsmaximumthroughputwasobservedtobeapproximately587imagesperorbitinrelativelylightfaultconditionsand316imagesperorbitintheworstconditionsstudied.TheDM'sperformabilitywasobservedtobeover99.5%whenconsideringlighttomoderateradiationconditionsi.e.,lessthanonefaulteverytwohoursperdatanode.Theworkconductedduringthisphaseofresearchproducedanovel,3-stageanalysisprocessforpredictingtheperformanceandavailabilityofhigh-performance,embeddedsystemsandapplications.TheFASEframeworkwasextendedbyincorporatingafaultmodellibrary.ThisadditionallibraryallowsFASEuserstoinjectfaultsintoarbitrarysystemsinordertoconductin-depthavailabilityandperformabilityanalyses.ThecasestudiesdemonstratedthecapabilitiesoftheenhancedframeworkasappliedtotheDMsystem.Thecontributionsandaccomplishmentsofthisworkhavebeencompiledintotwomanuscripts.Therstmanuscriptpresentsthesimulationworkinvolvedwiththe2DFFTstudyandwaspublishedin[ 48 ].Thesecondpaperintroducesthethree-stageanalysisapproachforpredictingtheperformanceandavailabilityofradiation-suspectiblesystemsandapplications.ThispaperwassubmittedtoACMTransactionsonEmbeddedComputingSystems[ 54 ]. 98
PAGE 99
CHAPTER5HYBRIDSIMULATIONSTOIMPROVETHEANALYSISTIMEOFDATA-INTENSIVEAPPLICATIONSPHASE3TheFASEframeworkpresentedinChapter 3 laidthefoundationforfastandaccurateperformancepredictionsofarbitraryapplicationsexecutingonawiderangeofsystems.Theinformationpresentedencompassesstrategiestoaddressthegeneralissuesofcharacterizingapplications,designinganddevelopingcomponentsandsystems,andanalyzingtheperformanceofthevirtualsystemsfortheapplicationsunderstudy.However,inpractice,thetechniquesandtoolsdevelopedsuccumbtolengthysimulationtimeswhenstudyingdata-intensiveapplications.Theselonganalysistimesareexacerbatedwhenconsideringlarge-scale,parallelsystemstothepointinwhichsimulationbecomesprohibitive.Thissectionpresentstheworkconductedforthethirdandnalphaseofthedissertation.Thisresearchfocusesonreducingthetimeneededtosimulatedata-intensivesystemsandapplicationsviaanovel,hybridsimulationapproach.Thesectionstartsbydiscussingthemotivationsfollowedbythepresentationofbackgroundinformationandrelatedresearch.AdetaileddescriptionofthehybridmodelingapproachillustratesthetechniquesemployedtoenhanceandextendtheFASEframeworktospeedupsimulationsevaluatingdata-intensiveapplications.Experimentsandresultsfollowvalidatingthesuccessesoftheproposedtechnique.ThissectionalsopresentsacasestudythatillustratestheaccuracyandspeedofthehybridapproachasitisappliedtotheDMsystemdescribedinChapter 4 .Finally,conclusionsaredrawnandkeyinsightisoered.5.1IntroductionAsscientistsdiscovernewmethodsforexplorationanddiscoveryofbothearth-andspace-boundphenomena,theamountofdatacollectedbytheaccompanyingequipmentcanbecomestaggering.Withthisincreaseindata,newapplicationsareneededtoanalyzeandinterprettheinformationinordertoidentifytheareasofinterest.Asaresult,keyareasofsciencesuchasthegeosciences,remotesensing,andsystemsbiology 99
PAGE 100
areemployingapplicationsthatprocesstypicaldatasetsinthemulti-gigabyterangeandbeyond.Forexample,NASAsatellitesystemsperformingremotesensingtasksgenerate50gigabytesofimagesperhourwhilemultipleterabytesofdataonthehumangeneticcodewerecollectedfortheHumanGenomeproject[ 55 ].Inordertoecientlyprocesstheselargequantitiesofdata,newsystemsmustbedesignedthatpushthelimitsofprocessingandI/Otechnologies.However,designingthemosteectivesystemforaspecicapplicationsetisanontrivialtaskduetovastnumberoftechnologiesandtechniquesavailabletodesigners.Simulationisoftenusedtofacilitatethedesignprocesswithdiscrete-eventsimulationbeingaparticularlyeectivemethodtoverifyandanalyzespeciccharacteristicsofanarchitectureorprotocolbeforedeployment.Thiscapabilityallowsdeveloperstosavevastamountsoftimeandmoneybycircumventingthedevelopmentandevaluationofprematureprototypes.Italsoexpandsthedesignspacetoincludearchitecturesthatusebothnewandemergingtechnologiesthusallowingsystemdesignerstoselectthebestcongurationforaparticularsetofapplications.Althoughdiscrete-eventsimulationprovidesmanybenets,onemajorshortcomingisthetimerequiredtoanalyzecomplexsystemsexecutingdata-intensiveapplications.Generallyspeaking,thetimerequiredtoanalyzeasystemusingdiscrete-eventsimulationishighlydependentonthenumberofdiscreteeventsgeneratedbythemodel.Inhigh-delitysimulationmodels,theselargedatasetsaretypicallysplitintoaconsiderablenumberoffragmentswitheachfragmentprocessedindividuallyinordertomimicthebehaviorofthedatatransaction.Thisfragmentationgeneratesaproportionalnumberofdiscreteeventsthatmustbescheduledandprocessedthroughoutthesimulatedsystem,thusdramaticallylengtheningsimulationtime.Simulationtimesareexacerbatedbythefactthatthemajorityofthecandidatesystemsemploynumerousdistributedcomponentsworkinginparalleltocompleteagiventask.Thecombinationoflarge,complexsystemsprocessingsizeabledatasetscancausesimulationstorunfordays,ifnotlonger.Suchlengthyanalysesareoftenprohibitiveduetounacceptableincreasesindesignand 100
PAGE 101
developmenttimes.Thus,methodsforimprovingtheeciencyandspeedofdiscrete-eventsimulations,whilesacricingaslittleaccuracyaspossible,areessentialtoremainaneectivetoolforprototypingandevaluatingarchitecturesexecutingdata-intensiveapplications.Inthischapter,wepresentanovelapproachtohybridsimulationmodelingthataimstoreducesimulationtimefordata-intensiveapplicationswhileretainingahighdegreeofaccuracy.Ourapproachcombinestheaccuracyoffunction-levelmodelswiththespeedofanalyticalmodelsandmicro-simulations"toachievefastandaccurateresults.Themodelingprocedureusesatechniquecalledfunction-leveltrainingtocollectperformancemeasurementsfromthesimulatedsystem.Thesemeasurementsessentiallytakeasnapshotofthecurrentstateofthesystemandareusedtocalibratetheanalyticalmodels.Thecalibratedanalyticalmodelisthenemployedtocalculatethetimerequiredtocompletethecurrenttransactionassumingthatthestateofthemodelremainsrelativelyunchangedthroughoutitsexecution.Micro-Simulationsalsousemeasurementsgatheredduringthetrainingperiod,thoughtheyareemployedtoaccountfordeviceandcontentiondelaysatcomponentsactivelyparticipatinginthedatatransaction.Themethodoperateswitheachhybridtransactionbeginninginthefunction-leveltrainingprocedureinwhichfunctionalmodelsareemployedtocollectstatisticsthatcharacterizethecurrentstatusofthesystem.Whenthetrainingperiodiscomplete,theanalyticalmodelcalibratesitselfusingthecollectedstatistics,calculatesthetimerequiredtocompletethecurrenttransaction,andschedulesthetransmissionofanaldatastructureusingthefunction-levelmodel.Finally,thislastdatastructuretraversesthesysteminvokingmicro-simulationsateachcomponentitencountersuntilitreachesthedestinationdevicemodel.Withthisapproach,theredundantprocessingthattypicallyoccursduringlargedatatransfersisreplacedbyasinglecalculationi.e.,analyticalmodelandaxednumberofmicro-simulationsthatcollectivelyapproximatetheultimateoutcomeoftheserepetitiouscomputations.The 101
PAGE 102
resultinghybridmethodologysupportsthetimelyevaluationandanalysisofaclassofapplicationsthathastraditionallystresseddiscrete-eventsimulators.5.2BackgroundandRelatedResearchWithinthepastdecade,technologyhasadvancedinleapsandboundstosupportboththeacquisitionandstorageoflargequantitiesofdatarelatingtoawiderangeofeldsincludingnance,commerce,scienticcomputing,andnationalsecurity.Duetotheoverwhelmingquantitiesofdatacollected,theimportanceofprocessingandunderstandingtheinformationhasleadtoaneweldofstudywithincomputersciencecalledknowledgediscoveryindatabasesKDD.KDD,asdenedin[ 56 ],is...thenon-trivialprocessofidentifyingvalid,novel,andpotentiallyuseful,andultimatelyunderstandablepatternsindata."WiththeemergenceofKDD,manynewresearchinitiativeshavecommencedtodevelopandoptimizealgorithmsandapplicationsthatanalyzedataresidinginlargedatarepositorieswhileotherprojectsaimtodesignpowerfulcomputationalsystemstoeectivelysiftthroughthelargedatasetstodiscoverusefulinformation.TheData-IntensiveComputingInitiativeDICIatPacicNorthwestNationalLaboratoryisonesuchresearchprojectthatisinvestigatingscalablesolutionsinbothsoftwareandhardwaretosupporttimely,eectiveanalysesoflargequantitiesofdata[ 57 ].However,inordertoanalyzethecapabilitiesofnewhardwaredesigns,thecostofbuildingprototypesolutionscanbeprohibitive.Asaresult,toolssuchasdiscrete-eventsimulatorsareimperativetoperformextensive,yetcost-eectiveevaluationsofanumberofdesignoptions.Traditional,high-delitymodelingapproachesdictatethatthebehaviorofamodeledentityshouldbemimickedidenticallytoensurecorrectandaccurateresults.Asaresult,muchsimulationtimeisusedtofragmenteachtransactionintosmallerchunksthataretransferredandprocessedbythecorrespondingfunctionalmodels.Thismodelingapproachisveryaccurate,butitsscalabilitysuersduetolargesimulationtimesasthedatasetgrowsinsize.Numeroussolutionshavebeeninvestigatedinordertoremedythis 102
PAGE 103
scalabilityproblemthroughbothgeneralandspecicmethodsthatattempttospeedupsimulations.Onemethodology,calledstagedsimulation,targetsthereductionofdiscreteeventsinwirelessnetworksimulations.Stagedsimulationusesvarioustechniques,suchasfunctioncachingandincrementalcomputations,todecreasetheamountofredundantprocessingconductedthussupportingfastersimulationsoflargersystems[ 58 ].Accordingtocasestudies,themethodachievesa30improvementinsimulationspeedfora1,500-nodesystem;however,manyoftheproposedtechniquesareuniquelydesignedforwirelessnetworksimulations.Ageneralapproachthathasbeeninvestigatedtoimprovesimulationtimeisparalleldiscrete-eventsimulationPDES.PDESframeworksattempttoreducesimulationtimebydistributingtheworkloadamongmultipleprocessors[ 59 ].PDESframeworkstypicallyfallintooneoftwocategories:conservativeoroptimistic.InconservativePDESenvironments,eachparticipatingnodeprocessesaneventonlywhenallpendingeventsanywhereinthesimulationthatmayaecttheconsideredeventhavebeencompleted[ 60 ].Byusingaconservativeapproach,processorsmayremainidleforsignicantperiodsoftimewaitingonotherprocessorstosignalthattheycanproceedwiththeirnextscheduledevent.CPSim[ 61 ]andParsec[ 62 ]aretwotoolsthatsupportconservativePDES.Conversely,optimisticPDESavoidsidlecyclesbyallowingeventstobeprocessedregardlessoftheiranitieswithconcurrentorpreviouseventsresidinginothernodes.Tosupportthiscapabilityandmaintaincorrectness,detectionandrollbackmechanismsareincorporatedintothedesign[ 63 ].Thedetectionmechanismintroducesoverheadduetotheadditionalprocessingrequiredtoidentifydependenciesbetweeneventswhilerollbacksdelaysimulationsbyrepeatingcomputationsoptimisticallycompleted.ExamplesoftoolssupportingoptimisticPDESincludeSPaDES[ 64 ],WARPED[ 65 ],andModSim[ 66 ].Whileparallelsimulatorshavebeenproveneectiveforcertainapplications,theyoftenprovideverypoorparalleleciencyduetothehighlevelsofinterdependenciesand 103
PAGE 104
synchronizationsinherentindiscrete-eventsimulations.Asaresult,welooktoalternativemethodsforimprovingtheeciencyofsimulatingdata-intensiveapplications.Fluid-basedmodelingprovidesaparticularlysuccessfulapproachtogreatlyreducethetimeneededtoanalyzelarge-scalesystemsprocessingsizeabledatasets.Fluid-basedmodelsattempttoabstractanentiredatatransactionasasingleuidrepresentation,calledaow.Typically,analyticalmodelsareusedtodeneaow'sbehavioraswellastheinteractionsbetweenmultipleowscontendingforasharedresourceorchannel.Conversely,thebehaviorofafunction-leveltransactioniscapturedbythecollectivebehaviorofeachindividualfragmentthatcomposesthetransactionasdictatedbythestateofthesystem.Byusingauid-basedmodel,anextremelylargedatatransactioncanbemodeledasacontinuousevent,withoutgeneratingalargenumberofdiscrete-eventsthatwouldotherwiseslowthesimulation.Incaseswherethesimulationtimeisdominatedbythemodelingoftheselargedatatransactions,suchatechniquehasthepotentialtosignicantlyreducesimulationtime.Forthesereasons,theuseofuid-basedmodelshasbecomepopularforapplicationssuchaslarge-scalenetworksimulations[ 67 ].Unfortunately,coarser-grainedmodelsaretypicallylessaccuratethanfunction-levelmodels[ 68 ],andattimeslessecientduetorippleeectscausedbytheinteractionofcompetingows[ 69 ].Therippleeect,discussedin[ 70 ],referstothephenomenonwheretheratechangeofasingleowcausesratechangestootherowsthatpropagatethroughoutthesystem.Insimulationshandlingalargenumberofows,theseripplescancreateenoughextraneouseventstosignicantlydecreasethegainsinsimulationeciencyobtainedfromusinguid-basedmodels.Severalprojectshaveproposeddesignsofhybridsimulationenvironmentsthatcombinefunctional-levelanduid-based,analyticalmodelsforaccurateandecientnetworksimulations[ 71 ],[ 72 ],[ 73 ].Foreachofthesesimulators,userstypicallydecidewhetheranetworktransactionismodeledasmultiple,function-levelfragmentsorasasingle,uid-basedow.Whiletheframeworksprovidetheexibilityofsupporting 104
PAGE 105
bothmodelingapproaches,theysuerfromafewshortcomings.First,uidstreamsaremodeledsolelyusinganalyticalmodelsthatconsidertracratesandbuercapacities.Thus,theaccuracyoftheuidtransactionisbasedentirelyontheanalyticalmodel'sabilitytoapproximatecontentionanductuatingnetworkbehaviorwithouttheuseofdetailed,functional-levelmodels.Furthermore,eachsimulationenvironmentisdesignedtoanalyzeonlythenetwork-speciccharacteristicsofasystem.Therefore,theframeworksareunabletocorrectlymodelthecompletebehaviorofadata-intensiveapplicationandsystem.Finally,thecapabilitiesofeachsimulatoraredemonstratedbyfocusingonasingledatastreaminthepresenceofrandomlygeneratedbackgroundtrac.Thedemonstrationssuggestthatdicultiescouldarisewhenattemptingtomodelrealtracgeneratedbytheentiresysteminordertomimicthebehaviorofanapplication.Otherresearchprojectshavelookedbeyonduid-basedmodelingtoreducethecomplexityofsimulatingdata-intensiveapplications.Aframeworkthatcombinesapplicationemulatorswithasetofsimulationmodelsfordealingwithlarge-scale,parallelapplicationsisdescribedin[ 74 ].Theapplicationemulatorsdynamicallyfeedstimulitothesimulatorintheformofepochs,whichrepresentgroupsofeventsandtheirhigh-leveldatadependenciesprocessedbyaprocessormodel.Whilethecoarse-grainedprocessormodelsandtheapplicationemulatorsabstractawaymuchoftheworkperformedforagivenapplication,theframeworkstillreliesonhigh-delitybusanddiskmodels,whichcanhinderthesimulationduringverylargedatatransactions.SincetheFASEframeworkplacesanemphasisonaccurateanddetailedmodelingofdatatransactions,itstandstobenetgreatlyfromthehybridmodelingmethodology.Forthiswork,thediscrete-eventmodelsdevelopedinChapters 3 and 4 havebeenextendedtosupportthehybridmodelingmethodologiesproposedinthischapter.Theremainderofthischapterpresentsthehybrid-modelingextensionsincorporatedintoFASEandresultsfromcasestudiesthatshowcasethenewcapabilitiesoftheextendedframework. 105
PAGE 106
5.3HybridSimulationApproachInthissection,weintroduceahybridsimulationapproachthatcanbeappliedtoanumberofcomponenttypestoproducefastandaccuratesimulationresults.Beforeprovidingthespecicdetailsonthemethodology,werstdenesomebasicterminologyusedthroughoutthechapteraswellasafewbasicmodelingconcepts.Inthischapter,werefertoagenericdataoperationasatransaction,whileafragmentandaowcorrespondtofunction-levelanduid-basedmodeling,respectively.Figure 5-1 aillustratesagenerichybridsystemmodelconsistingofthreemodeltypes:1source,2pathcomponent,and3sink.Eachmodeltypeconsistsofbothfunctionalanduid-basedmodelsandparticipatesinthemodelingofthetransactionswithinthesystem.Thecomponentslabeledasasourcebeginnewdatatransactionsbasedonoperationscreatedbythecorrespondingcomponentorreceivedfromanupper-layercomponent.Thepathcomponentsareintermediatemodelsthatreceivebothfragmentsandowsandpropagateeachdatatypetothenextcomponentinthepath.Thesinkmodelssignifythedestinationsforthetransactionsandultimatelyreceivetheactualdatasentbythecorrespondingsourcesforfurtherprocessing.Also,logicalchannelsconnecttheoriginsource,pathcomponents,andthesinkmodelparticipatingineachdatatransaction.Inordertorelatethesegenerictermstoareal-worldsystem,weillustrateasystemseeFigure 5-1 bemployingthreetransactions.TransactionT1correspondstoaremotediskwriteinwhichserverAsourcetransmitsdatatoashared,remotedisksinkusingtwosharedswitchespathcomponents.TransactionT2isadatatransferinwhichtheworkstationreceivesamessagefromServerB,andTransactionT3representsawriteaccesstotheremotediskbyserverB.Thesetransactionsareusedthroughoutthissectiontoexemplifythetechniquesemployedintheproposedapproach.Ourhybridmodelingapproachincorporatesthreemainstepsinordertosupportquickyetaccurateresults:1function-leveltraining,2analyticalmodeling,and3micro-simulation.AsillustratedinFigure 5-2 ,therststepofahybridsimulation 106
PAGE 107
aGenericsystem bReal-worldsystemFigure5-1.High-levelexamplesystemsemployinghybridmodeling i.e.,function-leveltrainingemploysthefunctionalmodelsatallhybridmodeltypesparticipatinginthetransactiontocollectperformancemeasurementsthatcharacterizethecurrentstateofthesystem.Thesemeasurementsarethenusedintheanalyticalmodelingstagetocongurethecorrespondingmodelandcalculatethelengthoftimethesourceisbusyprocessingthecurrenttransaction.Thethirdstep,micro-simulation,usesthemetricscollectedbypathcomponentsandsinksduringfunction-leveltrainingtocomputedelaysexperiencedbyeachowduetointernalmechanismsandcontention.Dataowsthroughourhybridsystemsasfollows.First,thesourcemodelreceivesthedataandusesitsfunction-levelmodeltotransmitFfragmentsusingitsfunctionalmodels.Asthesefragmentstraversethesystem,eachmodelcollectsstatisticsonthebehaviorofthetransaction.AfterFmeasurementshavebeenmadebythesource,onemorefragment,calledtheheadfragment,istransmittedandthestatisticsgatheredbythesourcearefedtotheanalyticalmodel.Theanalyticalmodelcalculatesatimebasedonthestatisticsandeectivelydelaysthecomponentforthecomputedtime.Whiledelayed,thedevicecanstillrespondtoremoterequestsfromothercomponentsthoughnofurtherdiscreteeventsarecreatedbythecurrentdatatransaction.Afterthecalculatedtimehaselapsed,the 107
PAGE 108
sourcesendsonenalfragment,deemedthetailfragment,usingitsfunction-levelmodels.Thisfragmentcanbedelayedateachpath-componentandsinkmodelaccordingtothecorrespondingmicro-simulationsthatoccurateachmodeltype.Thedatatransactioniscompletewhenthetailfragmentreachesthesinkmodel.Thefollowingsubsectionspresentin-depthdetailsonthethreestepsemployedinourhybridsimulationapproachanddiscusstheinterdependenciesbetweeneachstage. Figure5-2.High-leveldiagramofhybridsimulationapproach 5.3.1Function-LevelTrainingData-intensiveapplicationstypicallyperformnumerousoperationsthatmanipulatelargequantitiesofdata.Inordertoeectivelymodeltheseoperations,weconsiderthemasatwo-stageprocess.Therststage,namedtransitionalperiod,representstheperiodinwhichthesystembeginsanewoperation.Duringthetransitionalperiod,thesystemexperiencesanincreaseindatatraversingbetweencomponentswhilethesedevicestakethenecessaryactionstoadjusttothenewinuxofdata.Themostcommonimpactsobservedineachaectedcomponentarebuergrowthandincreasedcontention.Oncethesystementitieshaveadjustedtothechanges,theoperationreachesitssteady-stateperiod.Inthisstage,thesystemiseectivelyinequilibriumandtheoutputrateofeachcomponentishighlypredictablewhenconsideringidenticalinputs.Ourhybridapproachusesthesestagesbyapplyingthebestsuitedmodelsforeachinordertoensure 108
PAGE 109
accuracywhileprovidingopportunitiestoreducesimulationtime.Specically,weemployfunction-levelmodelsduringthetransitionalperiodandthenswitchtotheuid-basedmodelswhenasteadystatehasbeenreached.Theuseofhigh-delitymodelsduringthetransitionalperiodallowsustoaccuratelycapturene-grainedeventsthathavethepotentialtogreatlyimpactthesystem'soverallperformance.However,oncetheoperationhasreacheditssteady-stateperiod,ananalytical,uid-basedmodelcanbeemployed.Byusingtheanalyticalmodel,numerousredundantcomputationscanbereplacedbyasinglecalculationthussavingasignicantamountoftimeproportionaltothesizeofthedatacorrespondingtothecurrenttransaction.Onedicultywiththisapproachisconguringtheanalyticalmodeltocorrectlyrepresentthesystem'scurrentstate.Toovercomethischallenge,ourhybridsimulationapproachcollectsstatisticsonspecicattributesofthesystemasexperiencedbythecorrespondingcomponentwhileusingthefunction-levelmodels.Bycollectingthesemeasurements,wecaneectivelytrain"ouranalyticalmodeltocapturethebehaviorofthecomponentbasedonthestateofthesystem.Thisprocessofcollectingperformancemetricsandotherstatisticsduringthetransitionalperiodiscalledfunction-leveltraining.Function-leveltrainingisconductedatthebeginningofeachdatatransactiontoensuretheaccuracyofeachmodeledtransaction.AsshowninFigure 5-3 ,theprocessbeginswiththesourcemodeltransmittingtherstFfragmentsofthedatatransactionusingitsfunctionalmodel.Whilethesefragmentstraversethesystem,thesourcekeepstrackofkeystatisticssuchasthedeparturerateoffragmentsasdictatedbythesystem.Meanwhile,thepath-componentandsinkmodelsmonitorthestreamoffragmentsreceivedfromeachsourceinordertocalculateinternalstatisticsusedwithinthemicro-simulationstage.OnceFmeasurementshavebeencollectedbythesource,ittransmitsanalfragment,calledtheheadfragment,thatcontainsinformation,suchasthetransaction'sremainingdatasize,thatisusedduringmicro-simulationswithinthepath-componentandsinkmodels.Thiseventalsomarksthetimewhenthesourceswitchesfromits 109
PAGE 110
function-levelimplementationtoitsuid-baseddesign.Sections 5.3.2 and 5.3.3 providein-depthdetailsontheuid-based,analyticalmodelandmicro-simulations,respectively. Figure5-3.Function-leveltrainingprocedure Thesourcemodelcanbeconguredtoswitchbetweenitsfunction-levelanduid-based,analyticalmodelsaccordingtotwouser-denableparameters.Therstparameterallowsuserstospecifythenumberofmeasurements,F,thatshouldbetakenusingthefunction-levelmodelbeforeswitchingtotheuidmodel.Theseconddenesaspecicdatasizeunderwhichthesourceisforcedtouseonlyitsfunctionalmodel.Therstparameterisideallysettoavaluethatsupportstheuseofthefunction-levelmodelduringtheentiretransitionalperiodofatransactionwhilealsocapturingthesteady-statestatisticsforthevariouscomponentmodels.Ifthetrainingperiodistooshort,thecomponentmodelscouldpotentiallycapturemetricsthatmisrepresentthesteady-stateconditionsforthegiventransaction.However,ifthetrainingperiodistoolong,simulationtimeiswastedduetotheschedulingandprocessingofsuperuousdiscreteevents.Thesecondparameterallowstheusertocontroltheminimumdatasizethatcanusethehybridapproach.Thisparameterisimportantfortworeasons.First,eachtransaction 110
PAGE 111
incursoverheadwhenemployingthehybridapproachduetotheextracomputationrequiredateachcomponenttocalculateandlogstatistics.Asaresult,thisoverheadhasagreaterimpactonsmallertransactionsandcanpotentiallyincreasesimulationtimeascomparedtofunctionalmodeling.Moreimportantly,thehybridapproachwasdesignedforlargetransactions,andtherefore,cansuerfrominaccuracieswhenconsideringsmallertransactions.Theproceedingsectionprovidesdetailsontheissuesregardingtheaccuracyoftheanalyticalmodel.Inordertoillustratefunction-leveltraining,weexaminethethreetransactionsintroducedintheprevioussectionanddisplayedinFigure 5-1 b.Considerdatasizesof10KB,8KB,and10KBforTransactionsT1,T2,andT3,respectively.Also,themaximumfragmentsizeis1KBandallsources'trainingperiodsareconguredtotakefourmeasurementsi.e.,F=4.Inthisexample,eachsourcetransmits4KBofdataduringitstrainingperiod.ThesourceforT1i.e.,ServerAdeterminesitcansend0.5maximum-sizedfragmenteverysecondwhileServerBcantransmit0.25and1.0maximum-sizedfragmentseverysecondfortransactionsT2andT3,respectively.Furthermore,thepath-componentandsinkmodelscalculatesimilararrivalratesforthetransactionsduringthefunction-leveltrainingperiods.Afterthefourmeasurementsarecollected,eachsourcetransmitsaheadfragmentcontainingthenecessaryinformationtobeusedbythepath-componentandsinkmodelsduringtheirmicro-simulations.Also,thesourcesusethecalculatedfragmentoutputratestocalibratetheiranalyticalmodelstodeterminethetimesthesourcesarebusyprocessingtheircurrenttransactions.Thefollowingsectionscontinuethisexamplebyoutlininghowtheanalyticalmodelsandmicro-simulationsareusedtomodelthebehaviorofthethreetransactionstraversingthesystem.5.3.2AnalyticalModelingAfterthetrainingperiodcompletes,theresultingperformancemetricscollectedatthesourcemodelareusedbyauid-based,analyticalmodelseeFigure 5-2 tocalculate 111
PAGE 112
thetimeinwhichthecurrenttransactionisexpectedtocomplete.Atimerissetwithinthesourcemodelbasedonthecalculatedtime.Whenthetimerexpires,thesourcemodeloutputsonelastfragment,thetailfragment,whichsigniestheendofthetransaction.Thetailfragmentisusedbythepath-componentandsinkmodelstocalculatenaldelaysincurredbyeachow.Thesedelaysarecomputedusingmicro-simulations,whicharedescribedinmoredetailinthefollowingsection.Inordertoensureaccuracywithintheanalyticalmodel,threerequirementsmustbesatised.First,themodelmustbecapableofcapturingthesteady-statebehaviorofthecomponentthatitrepresents.Thisrequirementcanbefullledbyemployingvalidatedderivationsthatareeithercustom-builtordenedinliterature.Second,themodelshoulduseparametersthatcorrespondtothecurrentstateofthesystem.Weaddressthisrequirementthroughfunction-leveltrainingasdescribedintheprevioussection.Finally,sincetheanalyticalmodelisinvokedonlyonetimetocalculatethetimerequiredtocompleteeachtransaction,itinherentlyassumesthestateofthesystemdoesnotchangewithinthecalculatedtimeperiod.However,complexsystemstypicallyhavenumeroustransactionsinteractingandcausingchangeswithinthesystem.Asaresult,thelastconstraintrequiresameanstorecalibratetheanalyticalmodelwhenasystemupdatepotentiallyaectsthebehaviorofthemodeledcomponent.Insteadoffullysatisfyingthisrequirement,weallowtheanalyticalmodeltosuerfromanyinaccuraciesthatmayoccurduetosystemupdates.However,weintroduceamechanismdescribedinthesubsequentsectiontocompensatefortheseinaccuracies.Thespeedupattainedusingtheanalyticalmodeldependsonthecomplexityofthemechanismsreplacedandthenumberofdiscreteeventscreatedusingtheequivalentfunction-levelmodel.Thegreaterthemodel'scomplexityandthelargerthenumberofdiscreteeventscreatedbythefunctionalmodel,themorespeedupisachievedusingthehybridapproach.Furthermore,thecomplexityofthepath-componentandsinkmodelsalsocomeintoplaysincediscreteeventscreatedatthesourcenormallyleadto 112
PAGE 113
nontrivialcomputationsatlower-layercomponentsaswellasthepotentialtocreatemoreeventsthatfurtherincreasesimulationtimes.Finally,thereductionofdiscreteeventsenabledbyanalyticalmodelshasthepotentialtogreatlytrimthememoryrequirementsofthesimulation.Asaresult,theemploymentofthehybridapproachallowsuserstomoreecientlyanalyzelargersystemsorattainevengreaterspeedupsespeciallywhenreplacingfunction-levelsystemsthatrequiretheuseofdiskswapspace.Thecombinationofallthesefactorsimbuetheanalyticalmodelwiththepotentialtogreatlyreducethesimulationtimesobservedinpurefunction-levelsimulations.Duringeachfunction-leveltrainingperiodforthethreetransactionsconsideredinourexamplesystem,4KBofdatawassent.Also,ServerAandServerBcalculatedfragmentoutputratesof0.5,0.25,and1.0fragmentspersecondforTransactionsT1,T2,andT3,respectively.Forthisexample,weassumebothsourcesusethesimpleanalyticalmodelexpressedasSourceDelay=dT Fragmaxe)]TJ/F15 11.955 Tf 19.926 0 Td[(1 O{1whereOisthefragmentoutputrate,Tistheremainingsizeofthetransaction,andFragmaxisthemaximumfragmentsize.Usingthisanalyticalmodel,thesourcesdelayfor10,12,and5secondsbeforesendingthetailfragmentsforTransactionsT1,T2,andT3,respectively.Afterdelayingforthespeciedamountoftime,thecorrespondingsourcetransmitsthecurrenttransaction'stailfragment,signifyingtheendofthetransfer.Thenextsectiondetailstheroleofthemicro-simulationtechniquewithintheexamplesystem.5.3.3Micro-SimulationThusfar,ourhybridsimulationapproachhasaddressedsource-baseddelaysasconguredaccordingtoasteady-stateviewofthesystem.However,thestateofasystemfrequentlychangesresultinginpotentialimpactsthatpropagatebacktothesourcemodelsduetofeedbackmechanisms.Therearetwooptionsavailabletohandlethissituation.First,thefeedbackmechanismscanbeincorporatedintotheuidmodelssuch 113
PAGE 114
thatchangesinthesystemcausesourcemodelstorecalibratetheiranalyticalequationsaccordingtothenewstateofthesystem.However,thisapproachcangreatlysuerfromaneectsimilartotherippleeect"reportedin[ 70 ].Thatis,changescancontinuouslycausesourcemodelstorecalibrate,thusrevertingthehybridsimulationapproachintonothingmorethanafunction-levelapproachwithadditionaloverhead.Thesecondalternativeattemptstotakeadvantageofasimpleobservation{themostlikelysystemchangestohavesubstantialeectsontheperformanceofatransactionaretheadditionsorremovalsofothertransactionscompetingforthesameresource.Byusingthisobservation,thesecondoptionallowsthesourcetocompleteasoriginallyscheduledwhilethepath-componentandsinkmodelsaccountfordelayscausedbytheadditionaltransactionscreatingcontentionwithinthesystem.Thisapproach,deemedmicro-simulation,usesFIFOqueuesandtheperformancestatisticscollectedduringfunction-leveltrainingtocalculatedeviceandcontentiondelaysincurredbyeachtransaction.Micro-Simulationeectivelyreducesthecomplexityofthemodeledcomponentintoasimplequeuingsystemthatapproximatesthedelaysexperiencedbyeachow.Inordertosimplifythesystem,thedevice'sbehaviorischaracterizedbyasingleparameter,servicerate,whileeachowisrepresentedwiththreeparameters:1starttime,2arrivalrate,and3numberoffragments.Theservicerateparameterspeciestherateatwhichthedevicecancompleteavirtualfragment.Thisparameteristypicallycalculatedbasedonthecomponent'sperformanceattributessuchaslatencyandthroughputaswellasthefragmentsize.Thestarttimespeciesthetimeatwhichaowrstarrivesatthecomponent,thearrivalratedenotestherateatwhichavirtualfragmentarrivesatthedevice,andthenumberoffragmentsparameterindicatesthenumberofvirtualfragmentsrepresentedinaow.Thearrivalrateforeachowiscalculatedatthepath-componentandsinkmodelsduringtheow'sfunction-leveltrainingwhilethestarttimeandnumberoffragmentsaredenedintheheadfragment.Duetoqueuingcomplexities,computersimulationisusedtocalculatethedelaysforeachowin 114
PAGE 115
apath-componentorsinkmodel.Micro-Simulationsareconductedonlywhenanewowbeginsassigniedbyaheadfragmentoranexistingowcompletesasindicatedbyatailfragment.Forbothevents,themicro-simulation'sstateisupdatedtothecurrentsimulationtime;however,predictionsofaow'scompletiontimeareonlymadewhenitstailfragmentisreceivedbeforethemicro-simulationprogressedtoastateinwhichtheowhascompleted.Sincetheadditionofnewtransactionscannotbeforecasted,yetwilllikelyaectthedelaysexperiencedbyexistingows,speculativecalculationsofthecompletiontimescanleadtowastedcomputation.Asaresult,micro-simulationsperformtheminimumamountofcomputationsnecessarytocalculatethedelaysexperiencedbyaow,thusminimizingthesimulationtimerequiredbythistechnique.Micro-Simulationimprovessimulationtimefortworeasons.First,byreducingthesystemintoaqueuingproblem,micro-simulationabstractsawaythecomplexitiesinvolvedwithmodelinglargetransactionsemployingpotentiallycomplicatedcomponents.Thequeuingsystemcanbequicklyprocessedusingcomputersimulation,thusdecreasingthecomputationtimerequiredtoexplicitlysimulatethesystem.Second,micro-simulationsdonotcreatediscreteeventsand,thus,donotsuerfromschedulingorotherprocessingdelaysexternaltothemodelthathavethepotentialtodrasticallyincreasetheoverallsimulationtime.Figure 5-4 providesthemicro-simulationthatoccursatSwitchBillustratedinFigure 5-1 b.Forthisexample,weassumethatSwitchBusesabus-basedbackplanetoroutefragments,andtherefore,itsmicro-simulationsmodelanycontentionbetweentransactionssharingthisresource.Also,Flow#1,Flow#2,andFlow#3inFigure 5-4 correspondtoTransactionsT1,T2,andT3inFigure 5-1 b,respectively.Thecharacterizationparametersoftheswitchcomponentandthreeowsareprovidedatthetop,thestateofthemicro-simulation'squeuewithrespecttotimeisdisplayedinthemiddle,andthemicro-simulationupdatesandkeyeventsthattriggerthemareshownatthebottom.The 115
PAGE 116
micro-simulation'squeuestateillustratestheorderandnumberofvirtualfragmentsfromeachowresidinginthequeueatagivenpointintime.Inordertoillustratethethreeowtypesthatcanoccurinourhybridsimulations,weassumethatthetransactions'starttimesareinterleavedaccordingtothevaluesdisplayedinFigure 5-4 andeachtailfragmentsentbythecorrespondingsourceexperiencesatwosecondroutingdelaybeforeitisreceivedbySwitchB.Flow#1illustratesthecaseinwhichthecomponentreceivestheow'stailfragmentandthencalculatestheow'scompletiontimetobethecurrentsimulationtime.Inthiscase,thetailfragmentisimmediatelyprocessedbythecomponentsincenocontentiondelaysoccurred.Flow#2representsaowthatencounterssource-baseddelays.Thatis,thecomponentreceivestheow'stailfragmentsaftertheow'scompletiontimeascalculatedbythemicro-simulation.SimilartoFlow#1,thetailfragmentisprocessedimmediatelyinthiscase.Finally,Flow#3demonstratestheprocessfollowedwhenthetailfragmentisreceivedbythecomponentbutthemicro-simulationdeterminesthattheowisnotcomplete.Inthiscase,thedeviceschedulestheprocessingofthetailfragmenttothecorrespondingow'spredictedcompletiontime.Takenotice,themicro-simulation'sstateneverprogressesfurtherthanthecurrentsimulationtime.However,predictionsaremadetocalculatethecompletiontimeofFlow#3whenthetailfragmentarrivesbeforethemicro-simulation'sstatehasprogressedtothepointinwhichtheowisdeterminedtobecomplete.Whilethemicro-simulationtechniqueeliminatespotentialslowdownsduetotherippleeectphenomenon,itdoesnotcompensateforallsource-basedinaccuraciesthatmayoccurwhenmodelingatransaction.Considerthefollowingscenario.TransactionAbeginsitstrainingperiodandcausescontentionatasharedresourcealreadyinusebyTransactionB.Assumingthesharedresourcecanhandleonlyonetransactionatatimewithoutcausingdelays,thesourcesofbothtransactionsobserveslowerratesatwhichtheirdatatraversesthesystem.TransactionAnishesitstrainingperiodjustas 116
PAGE 117
Figure5-4.Examplethree-owmicro-simulation TransactionBcompletes,thusconguringitsanalyticalsourcemodeltorepresentthecomponent'sbehaviorinthepresenceofcontention.However,thecontentionnolongerexistssinceTransactionBcompleted.Theresultingscenariocannotberesolvedthroughmicro-simulationsandrequiresafeedbackmechanismthatcannotifythesourceofthechangeinthesystemsoitcanretrain.However,thefeedbackloopleadsbacktotherippleeectforwhichmicro-simulationsareusedtoavoid.Section 5.5 discussesproposedfutureworkthatinvestigatestechniquesthatincorporateafeedbackmechanismtoremedytheseinaccuracieswhilemitigatingtherippleeect.Throughoutthisstudy,weconsideronlyFIFOqueues,althoughourdesigncanbeeasilyextendedtosupportpriority-basedqueues.Furthermore,weassumeinnitequeuecapacitiesinordertosimplifythemodelingeort.However,thenecessarymechanismsareinplacetosupportnitequeuesizes.Finally,whenusingthisapproach,thesourcecannotproceedwiththenexttransactionuntilthetailfragmenthasbeenfullycompleted.Asaresult,mechanismsthatmayallowtransactionstocompleteearlye.g.,non-blockingshouldbedisabledorcarefullycontrolledusingthishybridapproach. 117
PAGE 118
5.4ResultsandAnalysisInthissectionwediscusstheresultsoftheanalysesconductedtoshowthecapabilitiesofourhybridsimulationapproachasitisappliedtotheNASADependableMultiprocessorseeChapter 4 fordetails.Therstsectionpresentsthesimulationsetupfollowedbyvalidationresultsusinglow-levelbenchmarkstoverifytheaccuracyandshowcasethespeedofthehybridmodels.Next,weevaluatethespeedandaccuracyofthehybridmodelswhenconsideringcontentionatsharedresources.Contentionstressestheproposedhybridmethodologyduetothelackoffeedbackloopsthatadjusttheanalyticalmodelswithinthesourcecomponents.Afterthespeedandaccuracyofourapproachhavebeenshowcased,weapplythetechniquetoanalyzetheperformanceoftheDMsystemexecutingadata-intensive,hyperspectralimagingHSIapplication.5.4.1SimulationSetupThekeysimulationmodelsemployedforthefollowingexperimentsarelistedinTable 5-1 .Similartothemodelsdevelopedinthepreviouschapters,thesemodelswerealsocreatedusingMLD.EachmodelcapturesthefunctionalityandbehaviorofthecorrespondingtechnologywhileadheringtotheFASEmethodologyofbalancingspeedandaccuracy.Fromthesecorecomponentmodels,nodeandsystemmodelsweredeveloped.Finally,Table 5-2 liststhekeysystemparametersconguredtobestmatchtheperformanceoftheprototypeDMsystem.TheDMsystemincorporatestwomainsubsystemsthatcanbenetfromhybridsimulationmodels-thenetworkandmassdatastore.BothsubsystemsusetheHAMforinter-nodecommunication,whichinturnemploysTCP/IPasitsprimarycommunicationprotocoloveraGigabitEthernetlink.Therefore,weretrottedtheHAMmodelintheDMmodellibraryandtheTCP,IP,andEthernetmodelsfoundinthepre-builtFASElibrarywiththeappropriatehybridsimulationmechanisms.TheMDSServermodelwasalsomodiedtoincorporatehybridmechanismstomodellargedataaccessestotheMDSdevice. 118
PAGE 119
Table5-1.Summaryofrelevantsimulationmodels Library Component Description DM High-AvailabilityMiddleware Providesreliablecommunicationbetweennodesinsystem. MDSServer Handlesdataaccessrequeststothemassdatastore. FASE TCPLayer ProvidesTCPprotocolforreliablecommunicationbetweennodes. IPLayer ProvidesIPprotocolforallnetworktransfers. EthernetNIC ProvidesEthernetprotocolforallnetworktransfers.Sup-portsmultipleports. EthernetSwitch ProvidesEthernetconnectivitybetweennodes.Supportsvarietyofbackplaneandroutingoptions. Table5-2.Keysystemparameters Parametername Value Processorpower 1200MIPS,600MFLOPS MPImaximumthroughput 57MB/s MPImessagelatency 13.6ms HAMbuersize 2000000bytes Networkbandwidth Non-blocking1000Mb/s Networkswitchlatency 5s MDSbandwidthwrite/read 60/40MB/s MDSlatencywrite/read 300/500s MDSopenleoverhead 8ms TheHAMmodelactsasasourcemodelformostdatatransfersbetweennodes.Themodelreceivesdatatransactionsfromtheapplicationlayerandfragmentseachtransactionintomessageswithamaximumsizeof14000byteswhenusingitsfunction-levelimplementation.Duringthetrainingperiod,theround-triptimeofeachmessageiscalculatedinordertocalibratetheanalyticalmodelusedintheuid-basedimplementation.TheHAMmodelsupportsnon-blockingcommunicationsviabueringtechniques.TheTCP,IP,EthernetNIC,andEthernetswitchmodelsareretrottedwithhybridmechanisms.Eachofthesecomponentsactsasapathcomponentandthereforecollects 119
PAGE 120
statisticsontheowstraversingthroughtheminordertocalculatedelaysincurredatthedeviceviamicro-simulations.TheTCPmodelalsoactsasasourcemodelsinceithasthecapabilitytofragmentthemessagesreceivedfromtheHAMintoTCPsegmentswithamaximumsizeof1460bytes.TheTCPmodelisoutttedwithananalyticalmodelthatusesthecommandwindow,maximumwindowsize,andacknowledgementratetocalculatethetimeneededtotransmitNbytesofdata.Currently,ouruidmodeldoesnotaccountforTCPsegmentsdroppedintransit;however,themodelcanbeextendedtocollectthenecessarymetricsduringtrainingandtheanalyticalmodelmodiedtoaccountfortheeectsofdroppedsegments.Finally,theMDSmodelisalsoretrottedwithhybridsimulationmechanismstoprovidesequentialaccessesforsubmitteddiskI/Ojobs.TheMDSmodelrepresentsasinkmodelandthereforeusesmicro-simulationstocalculatedelaysincurredwhileaccessingthemassstoragedevice.Morespecically,thequeueservicerateoftheMDS'smicro-simulationsisconguredaccordingtothebandwidthandlatencyofthestoragedevicewhilethearrivalratesofeachowarecalculatedbasedontheperformancemetricsgatheredduringeachtransaction'strainingperiod.DuetothefeaturesoftheMDSserver,readaccessesretainexclusiveownershipoftheMDSstoragedevicethroughoutthedurationoftheaccesswhilewriteaccessescanbeinterleavedin14000-bytemessages,themaximummessagesizeoftheHAM.Forthisstudy,theHAMandTCPsourcemodelsareconguredasshowninTable 5-3 .ThetrainingvalueswerechosentobestbalancetheaccuracyandspeedofthehybridsimulationsusedtoanalyzetheDMsystem.Thefollowingsectionsshowthatthesevaluesprovidesucienttrainingperiodstoproducefastyetaccurateresultsfromeachtransaction.Theminimumhybridmessagesizeforeachsourcemodelisinitiallysettozeroinordertoevaluatethehybridmethodologyforallmessagesizes.Itshouldbenotedthateachsimulationpresentsauniquecaseforwhichtheseparametersshould 120
PAGE 121
becalibratedtobestaccommodatethecharacteristicsofthesystemsandapplicationsofinterest. Table5-3.Hybridsourcemodelparameters Model Parametername Initialvalue HAM Function-leveltrainingmeasurements 10 Minimumhybridmessagesize 0MB TCP Function-leveltrainingmeasurements 100 Minimumhybridmessagesize 0MB Allsimulationsareconductedondedicatedcomputenodeswithanominalnumberofprocessesexecutinginthebackgroundtominimizethenoiseexperiencedduringtheexperiments.Eachnodeisconguredwithaquad-core,2.4GHzXeonprocessorwith2GBofmainmemoryrunninga64-bitLinuxvariantusingkernel2.6.9-55.5.4.2PerformanceModelingTherststudyconductedusestwosimplebenchmarks,PingPongandMDSTest,toshowtheaccuracyandspeedofthehybridsimulationapproachunderidealconditionsi.e.,noresourcecontentionandminimalsystemstatechanges.ThePingPongbenchmarktransfersdatabetweentwodatanodeswhiletheMDSTestbenchmarkwritesandreadsdatatoandfromtheMDSnode.Forbothprograms,thedatatransfersrangefromonebytetofourgigabytes.Thesmall-scaleDMtestbedsystemisusedtocollectexperimentalmeasurementstowhichresultsfromboththefunction-levelandhybridmodelresultsarecompared.Consequentially,theresultsnotonlyshowcasethecapabilitiesofourmodelingapproach,butalsovalidatetheDMmodel'saccuracy.FromFigure 5-5 andFigure 5-6 ,onecanobservethatbothmodelingapproachescloselyreproducethethroughputsobservedinthetestbedsystem.ForthePingPongbenchmark,wecalculatenearlyidenticalmeanrelativeerrorsof1.24%forthefunctionalandhybridmodelswhencomparedtotheexperimentalmeasurements.Forthisparticularbenchmark,theerrorfoundbetweenthetwomodelingapproacheswasnegligible.TheMDSbenchmarktestsproducedsimilarresultswithmeanrelativeerrorsof1.50%for 121
PAGE 122
writesand3.01%forreadswhenusingthehybridapproachascomparedtoexperimentallycollectedmeasurements.Again,theMDSTestbenchmarkshowednegligibleerrorsbetweenthetwomodelingapproaches. Figure5-5.PingPongaccuracyresults SpeedupresultsforthePingPongbenchmarkareshowninFigure 5-7 .Fromthegure,weseethatemployingthehybridsimulationapproachinsystemsexecutingapplicationsthattransferlargedatasizescangreatlyimprovesimulationtimes.Infact,weobserveanorderofmagnitudespeedupfordatasetsassmallas4MB,while4GBdatatransfersobservenearlyan1850speedup.Thelargespeedupsobservedatthelargerdatasizesaredirectlyproportionaltothenumberofdiscreteeventseliminatedthroughtheuseofthehybridsimulationapproach.Forexample,a4GBfunction-leveltransferusingTCPrequiresthecreation,scheduling,andprocessingofapproximately2.9milliondiscreteeventswhilethecurrentlyconguredhybridapproachusesnomorethan1000discreteeventstosimulatethesametransaction.Bysimplydividingthenumberof 122
PAGE 123
Figure5-6.MDSTestaccuracyresults discreteeventsgeneratedusingbothapproaches,wecalculateanidealspeedupofaround2940thusverifyingsuchlargegainsusingourmethod.TheMDSTestspeedupresultsseeFigure 5-8 showsimilartrendstothoseobservedinthePingPongbenchmark.Over10speedupsareachievedat4MBdatasetsandmaximumspeedupsofover1700and1800areobservedat4GBdatasetsforwritesandreads,respectively.TheMDSTestshowedlargerspeedupsforreadoperationsduetoaslightlysmalleramountofcomputationrequiredtoconductmicro-simulationsattheMDSsincereadsaresequentiallyexecutedwithexclusiveaccesstothedevice.Bothbenchmarksshowspeedupsthatbegintolevelopastthe1GBmessagesize.Thisbehaviorisduetoa1GBmessagesizelimitationplacedondatatransactionsthatusetheTCPmodelinordertoavoidproblemsassociatedwithits32-bitvariablesthatmaintainitscurrentstateformechanismssuchaswindowingandacknowledgements.Itshouldbenotedthattheserestrictionsareonlyplacedoncomponentmodelsthathaveinherentlimitationsthatcanpotentiallycauseproblemswhenconsideringverylargedatasizes.Also,bothbenchmarksshowincreasingslowdownsbetween1KBand256KB 123
PAGE 124
Figure5-7.PingPongspeedupresults Figure5-8.MDSTestspeedupresults 124
PAGE 125
messages,whichsignifyadditionaloverheadwhenusingtheproposedhybridsimulationapproach.Thecauseofthisaddedoverheadisduetothegatheringofstatisticsontheincreasingnumberoffragmentsusedineachtransaction.However,onceamessagesizeof512KBisreached,ourhybridapproachovercomesthisloggingpenaltyandshowspositivespeedup.Thebenchmarksusedinthisstudyrepresentbest-casescenariosforusingthehybridsimulationtechnique.Thisquickstudyshowsthatourhybridapproachcanprovidesubstantialreductionsinsimulationtimewhilehavinglittleimpactontheaccuracyoftheresults.However,neitherbenchmarkexperiencedconditionsthatcouldpotentiallycauseinaccuraciesinthehybridmodelsandthusrepresentthebest-casenumbersachievablebythetechniqueasitiscurrentlycongured.Morespecically,neitherbenchmarkcausessystemstatechangeswhilepreviouslyconguredtransactionsareexecuting.Theproceedingsectioninvestigatestheaccuracyandspeedupattainablebythehybridmethodwhenthesystemisexposedtocontentionandthusnumerousstateupdates.5.4.3ContentionModelingNowthatwehaveshowntheaccuracyandspeedupsachievedusingourhybridmodelsunderrelativelyidealconditionsi.e.,littletonoexternaleectsinuencingatransaction,weinvestigatetheimpactsofusingthetechniqueundermoreextremeconditions.Inthisstudy,weintroducecontentionintothesystemtoobservehowthehybridmodelsreactwithregardstosimulationspeedandaccuracy.Forthistest,weusetheMDSTestbenchmarkwhileincreasingthenumberofnodesthatsimultaneouslyaccesstheMDSfromtwotothirty-two.Thisscenariocreatescontentionattwosharedresourceswithinthesystem{theoutputportintheEthernetswitchattachingtheMDSnodetothenetworkandtheMDSstoragedevice.Forthisbenchmark,eachnodeinvolvedrstwritesNbytesofdatatotheMDS,whereNrangesfrom1Bto128MB,andthenreadsthedatabackwithasynchronizationpointbetweeneachoperationtomaximizetheamountofcontentionandthusthestressonthehybridsimulationapproach. 125
PAGE 126
Figure 5-9 illustratestherelativeerrorsandspeedupsobservedwhencomparingthehybridwriteandreadaccesstimestotheresultsobtainedviathefunction-levelmodels.Datasizeslessthan64KBarenotdisplayedsincethehybridmodelsuseonlytheirfunction-levelimplementationsconsequentiallyproducingidenticalresultsbetweenthetwosimulationapproaches.FromFigure 5-9 a,onecanseethatthehybridwriteaccessesshowrelativelylargedeviationsinaccuracyatsmallerdatasizes.Infact,amaximumerrorofjustover46%wascalculatedfora256KBdatasetregardlessofthenumberofowstested.Furthermore,asthedatasizeincreases,thegeneraltrendobservedisadecreaseinerror.Theseobservationssuggestthatalthoughtheuid-basedmodelsdonotadequatelyrepresentthebehaviorofthesourcemodelsatsmallerdatasizes,theyaremuchmoreaccurateatlargerdatasets.Meanwhile,forlargerowcounts,weobserveaminorincreaseinerrorwhentransitioningfrom1MBto2MBdatatransactions.ThisincreaseislikelyduetoabstractionsmadewithinthehybridHAMmodelwithrespecttoitsbuersizerecallfromTable 5-2 thattheHAM'sbuersizewassetto2MBandthenon-blockingfunctionalityprovidedbythisbuer.Figure 5-9 bshowsthatthehybridreadaccessesperformnearlyidenticallytothefunction-leveloperationswithamaximumobservederrorof0.26%occurringat64KB.Fromthegure,wealsoobservethatthenumberofnodesparticipatinginthereadportionoftheMDSTestbenchmarkhasaminimaleectontheaccuracyofthehybridsimulationapproachduetotheserializationofaccessesconductedbytheMDSServer.Figure 5-9 candFigure 5-9 dshowthespeedupsachievedforthehybridwriteandreadaccessesoverthevariousmessagesizesandowcounts.Fromthegures,onecanseethatbothoperationsshowsignicantgainsinspeedupasthedatasizeincreases.Themaximumspeedups,595and760,forthewriteandreadoperations,respectively,areobservedat128MB.Aminimumspeedupof0.75i.e.,slowdownof25%isobservedforbothwritesandreads,whichrepresentstheoverheadassociatedwiththeextracomputationrequiredtocalculateandlogstatisticswhenusingthehybridapproach. 126
PAGE 127
aWriteerrors bReaderrors cWritespeedups dReadspeedupsFigure5-9.MDSTestaccuracyandspeedupresultsusinghybridmodelingapproach Whilelargeerrorsareobservedinthehybridmodelswhenprocessingsmallerwriteoperations,wemustrememberthattheproposedapproachisdesignedtomodellargedatatransactions.Asaresult,weidentifyacross-overpointforwhichdatatransactionswithintheDMsystemuseonlythefunction-levelmodelsversusthehybridmodelsinordertoremedythelargeerrorsatsmalldatasizeswhilestillachievingspeedupsforlargedatasizes.Topinpointthiscross-overpoint,wecalculatethespeedup-to-errorratiosforthewriteaccessesseeFigure 5-9 aandFigure 5-9 cateachdatasizeandselectthevaluethatsustainsaratiogreaterthanoneforallowcounts.Whenthisratioisgreaterthanone,thespeedupislargerthantheerrorthussuggestingthatthebenetofthetechniqueoutweighsitsinaccuracies.Itshouldbenotedthatwhilethisratioisusefultoquicklyidentifyacross-overpoint,itcanprovidevaluesthatresultinpotentiallylarge 127
PAGE 128
inaccuraciesinthecaseswhereboththespeedupanderrorvaluesarelarge.Forthisstudywendthatthe4MBdatasizeproducedratiosofoneorgreaterforallowcounts.However,weselect8MBastheminimummessagesizetousethehybridmodelssincewedesiresingle-digiterrorsforallowcountsaswell.Inthenextsection,weexploretheimpactofconguringtheHAMandTCPmodelstousethisvalueasitsminimumhybridmessagesizeonarealdata-intensiveapplication.5.4.4CaseStudyHyperspectralimagingisatechniquethatcombinesconventionalimagingandspectroscopytoidentifyandclassifyvariousobjectswithina3Dimage.HSIisusedinapplicationsthatincludemapping,reconnaissanceandsurveillance,andenvironmentalmonitoring.Similartootherremote-sensingtechniques,HSItypicallydealswithlargeamountsofdatathatinsomeapplicationsmustbeprocessedinreal-timetoprovideimmediateassessmentofpotentiallythreateningscenarios.Inthisstudy,weapplyourhybridsimulationapproachtoanHSIapplicationbasedonthealgorithmpresentedin[ 75 ]inordertoshowcaseitscapabilitieswhenanalyzingarealscienticapplicationexecutingontheDMsystem.Figure 5-10 illustratesthedataowdiagramoftheHSIapplication.Eachparticipatingnodeacquiresaslaboftheimage,calculatestheautocorrelationsamplematrixACSM,andtransmitstheresultstoasinglerootnode.TherootnodeprocessesthedatacollectedfromeachnodeintheweightcalculationstageandbroadcastsCclassicationconstraintstoeachnode.Thenodesthenclassifytheoriginalimagedatabasedontheconstraintsandsavetheresultingdatatoconstructanoutputimage.Table 5-4 displaysthenumberof4-byteelementstransferredineachstageintermsofpixelsperrow/columnN,spectralbandsL,numberofprocessorsP,andnumberofclassicationconstraintsC.Forthiscasestudy,weexploretheaccuracyandspeedofthehybridversionoftheDMightsystemcomposedoftwentydatanodes[ 53 ]ascomparedtoitsstandard, 128
PAGE 129
Figure5-10.TheHSIdatadecompositionanddataowdiagram Table5-4.DatasetsizesforeachHSIdatatransaction Transaction Datasetsizeelements GetData N2L P Reduce L2 Broadcast CL SaveData N2C P function-levelcounterpart.ThestudyanalyzesatotaloftenimagesizeslistedinTable 5-5 .TherstvedatasetsrepresenttheimagesprocessedusingcurrentandemergingimplementationsoftheHSIapplication.ThelastveimagesizesrepresentdatasetsthatmaybeanalyzedinfutureversionsofHSIandshowcasethecapabilitiesofthehybridtechniquewhendealingwithverylargedatasets.Thetriplelinedenotestheboundaryinwhichextrapolationisusedtoapproximatethesimulationtimesrequiredbythestandard,function-levelapproachtocompleteasingleHSIiteration.Extrapolationisemployedasameanstoquicklyestimatethefullfunctionalsimulationtimeratherthanoccupyresourcesforsuchlargeperiodsoftime.ThestudyalsoconsiderstwocongurationsofthehybridHAMandTCPmodels.ThecongurationlabeledHybrid-0MBusesthedefaultvalueslistedinTable 5-3 whiletheHybrid-8MBcongurationsetstheminimumhybridmessagesizeforboththeHAMandTCPmodelsto8MB.RecallfromSection 5.4.3 thatthis 129
PAGE 130
datasizewasdeterminedtobalancethetrade-oofusingthehybridmethod'sspeedandaccuracyforsmallerdatatransactions. Table5-5.SimulationtimesforvariousHSIimagesizes Imagedimensions Rawdatasize Simulationtimes elements Functional Hybrid-0MB Hybrid-8MB 1024102464 256MB 2.18min 7.56s 39.40s 10241024128 512MB 3.92min 8.08s 40.25s 10241024256 1GB 7.37min 9.55s 42.48s 10241024512 2GB 14.31min 15.34s 50.88s 102410241024 4GB 28.33min 20.91s 80.33s 204820481024 16GB 1.88hours 23.44s 50.17s 409640961024 64GB 7.51hours 34.07s 1.03min 819281921024 256GB 1.25days 1.25min 1.74min 16384163841024 1TB 5.01days 4.08min 4.66min 32768327681024 4TB 20.03days 15.53min 16.29min Table 5-5 showssimulationtimesrequiredtocompleteasingleiterationofHSIprocessingthecorrespondingimagewhileFigure 5-11 displaystheerrorandspeedupofthetwohybridcongurationsversusthestandard,function-levelapproach.Theresultsshowthatbothhybridcongurationsprovideveryaccurateresultsaswellasimprovedsimulationtimesforallimagesizes.Infact,themaximumerrorsobservedintheHybrid-0MBandHybrid-8MBsetupsare0.77%and0.0032%,respectively.Similartothepreviousstudies,wendthatspeedupincreaseswithdatasetsizethoughitbeginstolevelasdatasizesbecomeverylarge.Maximumspeedupsof1858and1771areobservedintheHybrid-0MBandHybrid-8MBcongurations,respectively.NotethatHybrid-8MBreportslessspeedupforalldatasizesduetotheincreasedamountoffragmentationoccurringforsmalltomediumdatatransactions.However,theaccuracyofthiscongurationforsmallerimagesizesissignicantlyimproved,thoughtheaccuracyoftheHybrid-0MBcongurationisveryacceptable. 130
PAGE 131
aError bSpeedupFigure5-11.TheHSIaccuracyandspeedupresultsfortwohybridcongurations 5.5ConclusionsAsdata-intensiveapplicationsbecomeincreasinglyprevalent,moreecientsystemsmustbedesignedtoaccommodatetheirspecialdemands.Inordertofacilitatethedesignofthesesystems,discrete-eventsimulationisoftenusedtovirtuallyprototypecandidatesystems.However,lengthyanalysistimesofcomplexsystemsviasimulationarefurtherhinderedwhenevaluatingdata-intensiveapplicationsduetothesheervolumeofdatacreated,processed,andscheduledbythesimulationenvironment.Inthischapter,wepresentedanovelapproachforhybridsimulationtospeedtheanalysisofapplicationsprocessinglargedatasetswhileretainingahighdegreeofaccuracy.Ourapproachfeaturedtwotechniques,function-leveltrainingandmicro-simulations,tocalibrateanalyticalmodelsthatdepictthelong-term,steady-statebehaviorsofthecorrespondingcomponentsandaccountforchangesinthesystem'sperformancewithouttheuseoffeedbackmechanisms.Detailsonourhybridsimulationapproachwereoutlinedandthevariousimplicationsofeachtechniqueusedwerediscussed.Toshowcasethecapabilitiesoftheproposedapproach,weappliedthetechniquestotheNASADependableMultiprocessor.First,weobservedtheaccuracyandspeedupachievedbytheDMsystemmodelsusingtheproposedtechniquesascomparedtoapurefunctionalmodelwhileemployingtwolow-levelbenchmarks.ThePingPongbenchmarkreportedameanrelativeerrorof1.24%whenusingthehybridsimulationapproach 131
PAGE 132
whiletheMDSTestbenchmarkshowed1.64%and3.01%errorsforwritesandreads,respectively.Furthermore,ourapproachshowedspeedupsupto1850inthePingPongbenchmarkandover1700intheMDSTest.Theselargespeedupswerearesultofthedrasticreductionofdiscreteeventsprocessedbythehybridapproach.However,theoutcomesobservedfromtheinitialtestsrepresentedbest-casenumbers.Asaresult,weanalyzedthehybridDMsystemmodelwhileexecutingtheMDSTestbenchmarkontwotothirty-twonodes.ThisscenariocausedcontentionattheMDSnodeandthereforeinvestigatedthecapabilitiesofourproposedmethodologyinmorestressingconditions.Theresultsfromthisstudyshowederrorsupto46%,thoughtheselargererrorsoccurredforsmallerdatasizes.Asthetransactionsizeincreased,theerrorsdecreasedtomorereasonablepercentagesandeventuallytovalueslessthan1%.Maximumspeedupsofthehybridapproachwereobservedtobe595and790forwritesandreads,respectively,atthemaximummessagesizeobservedi.e.,128MB.Whiletheproposedhybridsimulationapproachreportedlargeerrorsinthisstudy,theywereobservedatsmallertransactionsizesthatdisplayedsmallerspeedupswhenemployingthehybridmodels.Asaresult,weidentiedacross-overpointat8MBthatsupportedtheuseofthefunction-levelmodelstoensureaccurateresultsforsmalldatasizeswhiletransitioningtothehybridmodelsfordatasizeslargerthan8MBinordertosupportthespeedysimulationsoflargetransactions.Onceourapproachwasvalidatedanditspotentialdemonstratedusinglow-levelbenchmarks,weevaluateditsaccuracyandspeedusingahyperspectralimagingHSIapplicationexecutingontheDMightsystem.Byanalyzingvariousimagesizesusingboththestandardfunction-levelandproposedhybridsimulations,wefoundthatourapproachproducedamaximumerrorof0.7%whiledisplayingamaximumspeedupof290.Thetrendsobservedfromthestudyshowedlargererrorsforsmallerdatasetsduetoinaccuraciesintheanalyticalmodelwhiletheobservedspeedupincreasedwithlargerdatasets.Theanalysisconcludedbyanalyzingmuchlargerdatasetsusingthehybrid 132
PAGE 133
simulationapproachwithextrapolationsusedtoestimatetheamountoftimerequiredbythefunction-levelmodels.Theresultsshowedaprojected,maximumspeedupof1858.Speedupsofthismagnitudedictatethatourhybridapproachhasthepotentialtocompletemonth-longsimulationsinmereminutes.Theworkconductedduringthisnalphaseofresearchproducedahybridsimulationapproachthatemployedtwonoveltechniques,function-leveltrainingandmicro-simulations,toreducetheanalysistimesofsimulationsconsideringdata-intensiveapplications.ThesehybridsimulationmechanismswereincorporatedintotheFASEframeworkandnumerouspre-existingmodelswithintheFASEandDMmodellibrarieswereretrottedtoaccommodatethesespeedyfeatures.CasestudiesdemonstratedthecapabilitiesoftheproposedapproachasappliedtotheDMsystemexecutingadata-intensive,hyperspectralimagingapplication.ThecontributionsandaccomplishmentsofthisworkhavebeencompiledintoamanuscriptthatwassubmittedtoACMTransac-tionsonModelingandComputerSimulation[ 76 ]. 133
PAGE 134
CHAPTER6CONCLUSIONSThisdocumentpresentedthreephasesofresearchtoshowthewiderangeofresearchtopicsaddressedbytheFASEframework.Therstphaseanalyzedthevariousaspectsinvolvedwiththedesignanddevelopmentofaperformancepredictioninfrastructureusingapplicationcharacterizationanddiscrete-eventsimulationinordertobalancespeedandaccuracy.Theworklaidoutageneralizemethodologytopredicttheperformanceofapplicationsexecutingonvirtuallyprototypedsystems.ThemethodologywasthenrealizedthroughtheuseofanapplicationcharacterizationtoolcalledSequoia,whichtracedMPIfunctioncallsandmeasuredcomputationtimebetweencommunicationevents,andapre-builtmodellibrarycreatedinMLDesigner,ahierarchical,discrete-eventsimulationtool.Casestudieswerethenconductedtoobservesimulationaccuracyandspeedwhencomparedtoexperimentalmeasurements.Theresultsshowedaccuracyerrorswithinanacceptablethresholdwithin25%andsimulationspeedsnogreaterthanthreeordersofmagnitudeslowerthanexperimentalprocessingtimes.Aftervalidation,thepotentialofFASEwasshowcasedwithanin-depthstudyoftheSweep3Dalgorithmexecutingonvirtualsystemscomposedofvariousnetworktypes,middlewareimplementations,processingcapabilities,andotherdegreesoffreedominthesystems'hardwareandsoftwarecongurations.Theframeworkdevelopedintherstphaseofthisresearchcreatedanidealenvironmenttoevaluatehigh-performance,embeddedspacesystems.Consequently,aNASA-sponsoredprojectwasusedtoexploretheexibilityofFASEandextendtheframeworktosupportnotonlyscalabilitystudiesoftheproposedspacesystem,butalsoavailabilitystudies.TheworkconductedinphasetwoexpandedtheFASEpre-builtmodelstoincludereliablemiddlewaretechnologiesthatmonitorsystemhealth,scheduleanddeployjobs,andreactandrecoverfromfaults.Afaultmodellibrarywasalsodevelopedtoinjectfaultsintothesysteminordertoperformavailability 134
PAGE 135
studies.Afterthenecessarytoolkitwasinplace,athree-stageanalysisprocedurewasformulatedforperformanceandavailabilityevaluations.Thisapproachallowsuserstocalibratecomponentmodelsusingexperimentalmeasurements,quicklyidentifyworkloadsandfaultratessupportedbyamanagementsoftwareviaMarkov-rewardmodels,andthoroughlyinvestigatespecicapplicationsexecutingonvirtuallyprototypedsystemthroughdiscrete-eventsimulation.Thenovelanalysismethodologyandsimulationmodelswerethenappliedtoexplorethescalabilityofthe2DFFTkernelexecutingontheDMsystem.Thescalabilitystudyrevealedtheprimebottleneckofthesystemwasthecentralizedmemoryandalgorithmicandarchitecturalvariationswereanalyzedtoalleviatetheproblem.Afterthescalabilityanalysis,theSARapplicationwasusedtostudytheperformabilityofthevirtualightsystemconsistingoftwentydatanodes.Theresultsshowedgoodsystemthroughputi.e.,between300and600imagesperorbitandperformabilityi.e.,over99.5%inlowradiationenvironmentsand54.0%inextremeconditionswhenthesystemwasenhancedandaugmentedwithimproveddataprocessorsandMDSstoragedevicesaswellasextraMDSnodestomitigatethecontentionpointdiscoveredinthe2DFFTcasestudyandfurthersubstantiatedintheSARstudy.ThenalphaseofthisresearchconsideredextensionstotheexistingFASEframeworkinordertoovercomescalabilityissueswithsimulationtimewhenanalyzingdata-intensiveapplications.Toovercometheseissues,weproposedanovelhybridsimulationapproachthatemploytwouniquetechniques,function-leveltrainingandmicro-simulations,toreducetheamountoftimerequiredtosimulatesystemexecutingapplicationswithlargedatasets.Theproposedapproachcombinestheaccuracyoffunction-levelmodels,viafunction-leveltraining,withthespeedofanalyticalmodelsandmicro-simulationsinordertoquicklyandaccuratelyapproximatethetimeneededtocompleteadatatransaction.Thiscombinationdrasticallyreducesthenumberofdiscrete-eventsprocessedandscheduledbythesimulator,thusresultinginsimulationspeedupsoverthepurefunction-levelapproach.TheapproachwasthenappliedtotheDMsystemexecuting 135
PAGE 136
low-levelbenchmarksandahyperspectralimagingapplication.Thelow-levelbenchmarksshowedrelativelyaccurateresultslessthan7%andorder-of-magnitudespeedupswhenconsideringdatatransactionsassmallas8MB.Furthermore,theaccuracyandspeedupofthehybridsimulationapproachimprovedasthetransactionsizeincreased.Infact,thesimulationsreportedminimumerrorsoflessthan1%andspeedupsover1700forthelow-levelbenchmarksand1500fortheHSIapplication. 136
PAGE 137
APPENDIXAEXPERIMENTALANDSIMULATIVESETUPThisresearchprojectincorporatestworealmstoexplorevariousaspectsandproposedtechniquesforfastandaccurateperformanceprediction.Theserealmsareexperimentalandsimulative.Theexperimentalrealmdealswithphysicalhardwareandsoftwareusedtoconstructacomputesystemtocollectreal-world"values,whicharecomparedtotheresultsgatheredinthesecondrealm,simulation.Thesimulationrealmprovidesanenvironmenttoexplorecomputationalsystemsunavailabletotheresearcherduetolimitedfunds,non-existentcomponents,orfuturegenerationsofcomponents.Moredetailsontheinteractionsbetweentheserealmsasappliedtoperformancemodelingandpredictionwillbediscussedinproceedingsections.A.1ExperimentalSetupTheworkconductedduringthecourseofthisstudyemploysequipmentfromtheHigh-performanceComputingandSimulationHCSLabattheUniversityofFlorida.TheHCSlabconsistsof9computeclusterseachwithvariousresourcesregardingtheprocessor,interconnect,mainmemory,harddisk,andsoftwaremodules.Table A-1 liststhesubsetofclustersusedforthisstudyandtheirresourcetypes,capabilities,andcapacities. TableA-1.ComputationsystemsattheHCSLabatUF Cluster CPU CPU CPU Node Memory Special name type speed count count features Alpha Xeon 3.2GHz 128 32 2GBDDR667 EMT64,Quad-core Delta Xeon 3.2GHz 16 16 2GBDDR333 EMT64,PCI-Express Mu Opteron 2.0GHz 32 32 1GBDDR400 PCI-Express,QsNetII Lambda Opteron 1.4GHz 32 16 1GBDDR333 10Gb/sInniBand Kappa Xeon 2.4GHz 70 35 1GBDDR266 A.2SimulationSetupThemodelingtoolemployedforthisprojectwasMission-LevelDesignerMLDdevelopedbyMLDesignTechnologiesInc[ 27 ].MLDisablock-oriented,discrete-eventsimulationenvironmentthatsupportsmodularandhierarchicaldesigns.Atitscore, 137
PAGE 138
MLDusesprimitives,C++codethatprovidessomespecicfunctionsuchasarithmetic,dataowswitching,ordataqueuing.Largermodulesandsystemsareconstructedbyconnectingtwoormoreprimitivesand/orothermodulesviaagraphicalinterface.Inordertofurtherfacilitateuserdesign,MLDsuppliesnumerouslibrarieswithpre-builtprimitivesandmodules.Figure A-1 showsthedevelopmentenvironmentofMLD. FigureA-1.TheMLDdevelopmentenvironment 138
PAGE 139
REFERENCES [1] O.Lubeck,Y.Luo,H.Wasserman,andF.Bassetti,AnEmpiricalHierarchicalMemoryModelBasedonHardwarePerformanceCounters,"Proc.Int'lConf.ParallelandDistributedProcessingTechniquesandApplications,LasVegas,NV,July13-16,1998. [2] D.Kerbyson,H.Wasserman,andA.Hoisie,ExploringAdvancedArchitecturesUsingPerformancePrediction,"Proc.Int'lWorkshoponInnovativeArchitecture,KohalaCoast,BigIsland,HI,Jan.10-11,2002. [3] M.Salsburg,AStatisticalApproachtoComputerPerformanceModeling,"ACMSIGMETRICSPerformanceEvaluationReview,vol.15,no.1,pp.155-162,May1987. [4] E.Strohmaier,StatisticalPerformanceModeling:CaseStudyoftheNPB2.1Results,"Proc.ThirdInt'lEuro-ParConf.ParallelProcessing,Passau,Germany,Aug.26-29,1997. [5] R.Jain,TheArtofComputerSystemsPerformanceAnalysis.JohnWileyandSons,1991. [6] A.Sampogna,D.Kaeli,D.Green,M.Silva,andC.Sniezek,PerformanceModelingUsingObject-OrientedExecution-DrivenSimulation,"Proc.29thSimulationSymp.,NewOrleans,LA,Apr.8-11,1996. [7] S.Dwarkadas,J.Jump,andJ.Sinclair,Execution-DrivenSimulationofMultiprocessors:AddressandTimingAnalysis,"ACMTrans.ModelingandComputerSimulation,vol.4,no.4,pp.314-338,Oct.1994. [8] R.UhligandT.Mudge,Trace-drivenMemorySimulation:ASurvey,"ACMComputingSurveys,vol.29,no.2,pp.128-170,June1997. [9] J.Flanagan,B.Nelson,J.Archibald,andG.Thompson,TheInaccuracyofTrace-DrivenSimulationUsingIncompleteMultiprogrammingTraceData,"Proc.FourthInt'lWorkshopModeling,Analysis,andSimulationofComputerandTelecommunicationSystems,SanJose,CA,Feb.1-3,1996. [10] S.Moore,F.Wolf,J.Dongarra,S.Shende,P.Teller,andB.Mohr,AScalableApproachtoMPIApplicationAnalysis,"Proc.12thEuropeanPVM/MPIUsers'GroupMeeting,Sorrento,Italy,Sept.18-21,2005. [11] D.Culler,J.Singh,andA.Gupta,ParallelComputerArchitecture:AHard-ware/SoftwareApproach.MorganKaufmannPublishers,1998. [12] MPIForum,MPI:AMessage-PassingInterfaceStandard,"UniversityofTennessee,Version1.1,June1995. 139
PAGE 140
[13] W.Gropp,E.Lusk,N.Doss,andA.Skjellum,AHigh-Performance,PortableImplementationoftheMPIMessagePassingInterfaceStandard,"ParallelComput-ing,vol.22,no.6,pp.789-828,Sept.1996. [14] A.George,R.Fogarty,J.Markwell,andM.Miars,AnIntegratedSimulationEnvironmentforParallelandDistributedSystemPrototyping,"Simulation,vol.72,no.5,pp.283-294,May1999. [15] E.Deelman,A.Dube,A.Hoisie,Y.Luo,R.Oliver,D.Sundaram-Stukel,H.Wasserman,V.Adve,R.Bagrodia,J.Browne,E.Houstis,O.Lubeck,J.Rice,P.Teller,andM.Vernon,POEMS:End-to-EndPerformanceDesignofLargeParallelAdaptiveComputationalSystems,"IEEETrans.SoftwareEngineering,vol.26,no.11,pp.1027-1048,Nov.2000. [16] R.Bagrodia,E.Deelman,S.Docy,andT.Phan,PerformancePredictionofLargeParallelApplicationsUsingParallelSimulations,"Proc.SeventhACMSIGPLANSymp.PrinciplesandPracticeofParallelProgramming,pp.151-161,Atlanta,GA,May1999. [17] M.Uysal,T.Kurc,A.Sussman,andJ.Saltz,APerformancePredictionFrameworkforDataIntensiveApplicationsonLargeScaleParallelMachines,"TechnicalReportCS-TR-3918andUMIACS-TR-98-39,UniversityofMaryland,DepartmentofComputerScienceandUMIACS,July1998. [18] J.Cao,D.Kerbyson,E.Papaefstathiou,andG.Nudd,PerformanceModelingofParallelandDistributedComputingUsingPACE,"Proc.19thIEEEInt'lPerformance,Computing,andCommunicationsConf.,pp.485-492,Phoenix,AZ,Feb.20-22,2000. [19] S.PllanaandT.Fahringer,PerformanceProphet:APerformanceModelingandPredictionToolforParallelandDistributedPrograms,"Proc.Int'lConf.ParallelProcessing,Oslo,Norway,June14-17,2005. [20] A.Snavely,L.Carrington,andN.Wolter,AFrameworkforPerformanceModelingandPrediction,"Proc.15thSupercomputingConf.,Baltimore,MD,Nov.16-22,2002. [21] D.BaileyandA.Snavely,PerformanceModeling:UnderstandingthePresentandPredictingtheFuture,"Proc.Euro-ParConf.,Lisbon,Portugal,Aug.30-Sept.2,2005. [22] R.Badia,J.Labarta,J.Gimenez,andF.Escale,DIMEMAS:PredictingMPIApplicationsBehaviorinGridEnvironments,"Proc.WorkshoponGridApplicationsandProgrammingTools,Seattle,WA,June25,2003. [23] S.Moore,D.Cronk,K.London,andJ.Dongarra,ReviewofPerformanceAnalysisToolsforMPIParallelPrograms,"TechnicalReport,UniversityofTennessee,ComputerScienceDepartment,1998. 140
PAGE 141
[24] L.DeRoseandD.Reed,SvPablo:AMulti-LanguageArchitecture-IndependentPerformanceAnalysisSystem,"Proc.Int'lConf.ParallelProcessing,Fukushima,Japan,Sept.1999. [25] B.Miller,M.Callaghan,J.Cargille,J.Hollingsworth,R.Irvin,K.Karavanic,K.Kunchithapadam,andT.Newhall,TheParadynParallelPerformanceMeasurementTool,"IEEEComputer,vol.28,no.11,pp.37-46,Nov.1995. [26] S.ShendeandA.Malony,TheTAUParallelPerformanceSystem,"Int'lJournalofHighPerformanceComputingApplications,vol.20,no.2,pp.287-331,Summer2006. [27] G.Schorcht,I.Troxel,K.Farhangian,P.Unger,D.Zinn,C.Mick,A.George,andH.Salzwedel,System-LevelSimulationModelingwithMLDesigner,"Proc.11thInt'lSymp.Modeling,Analysis,andSimulationofComputerandTelecommunicationSystems,pp.207-212,Orlando,FL,Oct.12-15,2003. [28] J.Vetter,N.Bhatia,E.Grobelny,andP.Roth,CapturingPetascaleApplicationCharacteristicswiththeSequoiaToolkit,"Proc.ParallelComputing,Malaga,Spain,Sept.13-16,2005. [29] S.Browne,J.Dongarra,N.Garner,K.London,andP.Mucci,AScalableCross-PlatformInfrastructureforApplicationPerformanceTuningUsingHardwareCounters,"Proc.13thSupercomputingConf.,Dallas,TX,Nov.4-10,2000. [30] E.GrobelnyandJ.Vetter,ExtrapolatingCommunicationPatternsofLarge-scaleScienticApplications,"TechnicalReport,UniversityofFloridaandOakRidgeNationalLaboratory,2006. [31] O.Zaki,E.Lusk,W.Gropp,andD.Swider,TowardScalablePerformanceVisualizationwithJumpshot,"TheInt'lJournalofHighPerformanceComputingApplications,vol.13,no.2,pp.277-288,Fall1999. [32] D.GustavsonandQ.Li,TheScalableCoherentInterfaceSCI,"IEEECommuni-cationsMagazine,vol.34,no.8,pp.52-63,Aug.1996. [33] K.Koch,R.Baker,andR.Alcoue,SolutionoftheFirst-OrderFormofthe3-DDiscreteOrdinatesEquationonaMassivelyParallelProcessor,"Trans.oftheAmericanNuclearSociety,vol.65,no.198,1992. [34] J.VetterandA.Yoo,AnEmpiricalPerformanceEvaluationofScalableScienticApplications,"Proc.15thSupercomputingConf.,Baltimore,MD,Nov.16-22,2002. [35] E.Grobelny,D.Bueno,I.Troxel,A.George,andJ.Vetter,FASE:AFrameworkforScalablePerformancePredictionofHPCSystemsandApplications,"Simulation:TransactionsofTheSocietyforModelingandSimulationInternational,vol.83,no.10,pp.721-745,Oct.2007. 141
PAGE 142
[36] M.Grin,NASA2006StrategicPlan,"NationalAeronauticsandSpaceAdministration,NP-2006-02-423-HQ,WashingtonDC,Feb.2006. [37] J.Ramos,J.Samson,D.Lupia,I.Troxel,R.Subramaniyan,A.Jacobs,J.Greco,G.Cieslewski,J.Curreri,M.Fischer,E.Grobelny,A.George,V.Aggarwal,M.PatelandR.Some,High-Performance,DependableMultiprocessor,"Proc.IEEE/AIAAAerospaceConf.,BigSky,MT,Mar.4-11,2006. [38] D.Dechant,TheAdvancedOnboardSignalProcessorAOSP,"AdvancesinVLSIandComputerSystems,vol.2,no.2,pp.69-78,Oct.1990. [39] M.IacoponiandD.Vail,TheFaultToleranceApproachoftheAdvancedArchitectureOn-BoardProcessor,"Proc.Symp.Fault-TolerantComputing,Chicago,IL,June21-23,1989. [40] F.Chen,L.Craymer,J.Deik,A.Fogel,D.Katz,A.SillimanJr.,R.Some,S.UpchurchandK.Whisnant,DemonstrationoftheRemoteExplorationandExperimentationREEFault-TolerantParallel-ProcessingSupercomputerforSpacecraftOnboardScienticDataProcessing,"Proc.Int'lConf.DependableSystemsandNetworks,NewYork,NY,June25-28,2000. [41] E.Prado,P.PrewittandE.Ille,AStandardApproachtoSpacebornePayloadDataProcessing,"Proc.IEEEAerospaceConf.,BigSky,MT,March10-17,2001. [42] S.Fuller,RapidIO-TheEmbeddedSystemInterconnect.JohnWiley&Sons,2005. [43] J.Meyer,OnEvaluatingthePerformabilityofDegradableComputingSystems,"IEEETrans.Computers,vol.C-29,no.8,pp.720-731,Aug.1980. [44] R.Subramaniyan,V.Aggarwal,A.Jacobs,andA.George,FEMPI:ALightweightFault-tolerantMPIforEmbeddedClusterSystems,"Proc.Int'lConf.EmbeddedSystemsandApplications,LasVegas,NV,June26-29,2006. [45] R.Smith,K.TrivediandA.Ramesh,PerformabilityAnalysis:Measures,anAlgorithmandaCaseStudy,"IEEETrans.Computers,vol.37,no.4,pp.406-417,Apr.1988. [46] B.Haverkort,R.Marie,G.Rubino,andK.Trivedieditors,PerformabilityModeling:TechniquesandTools.Wiley,2001. [47] C.Hirel,R.Sahner,X.Zang,andK.Trivedi,ReliabilityandPerformabilityModelingUsingSHARPE2000,"Proc.Int'lConf.ComputerPerformanceEvalua-tion:ModelingTechniquesandTools,Schaumburg,IL,Mar.27-31,2000. [48] I.Troxel,E.GrobelnyandA.George,SystemManagementServicesforHigh-PerformanceIn-situAerospaceComputing,"AIAAJournalofAerospaceComputing,Information,andCommunication,vol.4,no.2,pp.636-656,Feb.2007. 142
PAGE 143
[49] D.Bueno,C.Conger,A.Leko,I.TroxelandA.George,RapidIO-basedSpaceSystemArchitecturesforSyntheticApertureRadarandGroundMovingTargetIndicator,"High-PerformanceEmbeddedComputingWorkshop,MITLincolnLab,Lexington,MA,Sept.20-22,2005. [50] P.Meisl,M.Ito,andI.Cumming,ParallelSyntheticApertureRadarProcessingonWorkstationNetworks,"Proc.10thInt'lParallelProcessingSymp.,pp.716-723,Honolulu,HI,Apr.15-19,1996. [51] C.Miller,D.Payne,T.Phung,H.Siegel,andR.Williams,ParallelProcessingofSpaceborneImagingRadarData,"Proc.EighthSupercomputingConf.,SanDiego,CA,Dec.4-8,1995. [52] D.Sandwell,SARImageFormation:ERSSARProcessorCodedinMATLAB," http://www.geo.uzh.ch/rsl/research/SARLab/GMTILiterature/PDF/San02d.pdf ,2002. [53] J.Samson,G.Gardner,D.Lupia,M.Patel,P.Davis,V.Aggarwal,A.George,Z.Kalbarcyzk,andR.Some,TechnologyValidation:NMPST8DependableMultiprocessorProjectII,"Proc.IEEEAerospaceConf.,BigSky,MT,Mar.3-10,2007. [54] E.Grobelny,G.Cieslewski,I.Troxel,andA.George,PredictingthePerformanceofRadiation-SusceptibleAerospaceComputingSystemsandApplications,"submittedtoACMTrans.EmbeddedComputingSystems. [55] M.Cannataro,D.Talia,andP.Srimani,ParallelDataIntensiveComputinginScienticandCommercialApplications,"ParallelComputing,vol.28,no.5,pp.673-704,May2002. [56] U.Fayyad,DataMiningandKnowledgeDiscovery:MakingSenseOutOfData,"IEEEExpert:IntelligentSystemsandTheirApplications,vol.11,no.5,pp.20-25,Oct.1996. [57] Data-IntensiveComputingInitiativeDICI,PacicNorthwestNationalLaboratory, http://dicomputing.pnl.gov/ [58] K.WalshandE.Sirer,StagedSimulation:AGeneralTechniqueForImprovingSimulationScaleandPerformance,"ACMTrans.ModelingandComputerSimula-tions,vol.14,no.2,pp.170-195,Apr.2004. [59] R.Fujimoto,ParallelSimulation:ParallelandDistributedSimulationSystems,"Proc.WinterSimulationConf.,pp.147-157,Arlington,VA,Dec.9-12,2001. [60] D.Nicol,PrinciplesofConservativeParallelSimulation,"Proc.WinterSimulationConf.,pp.128-135,Coronado,CA,Dec.8-11,1996. [61] B.Groselj,CPSim:AToolforCreatingScalableDiscrete-EventSimulations,"Proc.WinterSimulationConference,pp579-583,Arlington,VA,Dec.3-6,1995. 143
PAGE 144
[62] R.Bagrodia,R.Meyer,M.Takai,Y.Chen,X.Zeng,J.Martin,B.ParkandH.SongParsec:AParallelSimulationEnvironmentforComplexSystems,"IEEEComputer,vol.31,no.10,pp.77-85,Oct.1998. [63] R.Fujimoto,OptimisticApproachestoParallelDiscrete-eventSimulation,"Trans.oftheSocietyforComputerSimulationInternational,vol.7,no.2,pp.153-191,June1990. [64] Y.Teo,S.TayandS.Kong,SPaDES:AnEnvironmentforStructuredParallelSimulation,"TechnicalReport,TR20/96,DepartmentofInformationSystemsandComputerScience,NationalUniversityofSingapore,Singapore,Oct.1996. [65] D.Martin,T.McBrayer,andP.Wilsey,WARPED:ATimeWarpSimulationKernelforAnalysisandApplicationDevelopment,"Proc.29thHawaiiInt'lConf.SystemSciencesVolume1:SoftwareTechnologyandArchitecture,Jan.3-6,1996,pp.383-386. [66] J.WestandA.Mullarney,ModSim:ALanguageforDistributedSimulation."Proc.SCSMulticonf.DistributedSimulation,pp.155-159,SanDiego,CA,Feb.3-5,1988. [67] Y.Liu,F.Presti,V.Misra,D.TowsleyandY.Gu,FluidModelsandSolutionsforLarge-scaleIPNetworks,"Proc.ACMSIGMETRICSInt'lConf.MeasurementandModelingComputerSystems,pp.91-101,SanDiego,CA,June10-14,2003. [68] A.YanandW.Gong,Time-drivenFluidSimulationforHighSpeedNetworks,"IEEETrans.InformationTheory,vol.45,no.5,pp.1588-1599,June1999. [69] B.Liu,Y.Gao,J.Kurose,D.TowsleyandW.Gong,FluidSimulationofLarge-scaleNetworks:IssuesandTradeos,"Proc.Int'lConf.ParallelandDis-tributedProcessingTechniquesandApplications,pp.2136-2142,LasVegas,NV,June28-July1,1999. [70] G.Kesidis,A.Singh,D.CheungandW.Kwok,FeasibilityofFluid-DrivenSimulationforATMNetwork,"Proc.IEEEGlobalTelcommunicationsConf.,pp.2013-2017,London,England,Nov.18-22,1996. [71] B.MelamedandS.Pan,HNS:AStreamlinedHybridNetworkSimulator,"ACMTrans.ModelingandComputerSimulation,vol.14,no.3,pp.251-277,July2004. [72] G.Riley,T.JafaarandR.Fujimoto,IntegratedFluidandPacketNetworkSimulations,"Proc.IEEEInt'lSymp.Modeling,AnalysisandSimulationofComputerandTelecommunicationsSystems,pp.511-518,FortWorth,TX,Oct.11-16,2002. [73] C.Kiddle,R.Simmonds,C.WilliamsonandB.Unger,HybridPacket/FluidFlowNetworkSimulation,"Proc.17thWorkshopParallelandDistributedSimulation,pp.143-152,SanDiego,CA,June10-13,2003. 144
PAGE 145
[74] M.Uysal,T.Kurc,A.SussmanandJ.Saltz,APerformancePredictionFrameworkforDataIntensiveApplicationsonLarge-ScaleParallelMachines,"Proc.FourthInt'lWorkshoponLanguages,Compilers,andRun-timeSystemsforScalableComputers,pp.243-258,Pittsburgh,PA,May28-30,1998. [75] C.Chang,H.Ren,andS.Chiang,Real-timeProcessingAlgorithmsforTargetDetectionandClassicationinHyperspectralImagery,"IEEETrans.GeoscienceandRemoteSensing,vol.39,no.4,pp.760-768,Apr.2001. [76] E.Grobelny,C.Reardon,andA.George,AHybridSimulationApproachtoReduceAnalysisTimeofData-IntensiveApplications,"submittedtoACMTrans.ModelingandComputerSimulation. 145
PAGE 146
BIOGRAPHICALSKETCHEricGrobelnybeganattendingtheUniversityofFloridainFall1998andreceivedhisB.S.in2002andM.E.in2004.HeconductedresearchattheHigh-performanceComputingandSimulationHCSLaboratoryunderthesupervisionofDr.AlanGeorgeforsixyearsfocusingonperformanceanalysisandpredictionforhigh-performancecomputingsystemsandapplications.Hisotherinterestsincludehigh-performanceembeddedcomputingforaerospacesystemsandapplications,simulation-basedfaultinjection,andhigh-performanceinterconnecttechnologies.AsamemberoftheHCSlab,healsoworkedonnumeroussideprojectsincludingdevelopinganMPIcommunicationlayerforsatellitesystems,investigatingperformanceenhancementsforlow-localityapplications,andexploringtechniquesandbestpracticesfordisasterrecoveryandmissionassuranceindynamic,high-performanceenvironments.ErichasacceptedajobatHoneywellSpaceSystemsinClearwater,Florida. 146
|