Citation
Techniques and Tools for an Autonomic Approach to Fault and Performance Management in Map-Reduce

Material Information

Title:
Techniques and Tools for an Autonomic Approach to Fault and Performance Management in Map-Reduce
Creator:
Kadirvel, Selvi
Place of Publication:
[Gainesville, Fla.]
Publisher:
University of Florida
Publication Date:
Language:
english
Physical Description:
1 online resource (196 p.)

Thesis/Dissertation Information

Degree:
Doctorate ( Ph.D.)
Degree Grantor:
University of Florida
Degree Disciplines:
Electrical and Computer Engineering
Committee Chair:
FORTES,JOSE A
Committee Co-Chair:
KHARGONEKAR,PRAMOD P
Committee Members:
FIGUEIREDO,RENATO JANSEN
CHEN,SHIGANG
Graduation Date:
5/3/2014

Subjects

Subjects / Keywords:
Datasets ( jstor )
Maps ( jstor )
Modeling ( jstor )
Parametric models ( jstor )
Performance prediction ( jstor )
Petri nets ( jstor )
Simulations ( jstor )
Slavery ( jstor )
Uniform Resource Locators ( jstor )
Workloads ( jstor )
Electrical and Computer Engineering -- Dissertations, Academic -- UF
autonomic-computing -- distributed-systems -- fault-tolerance -- machine-learning -- map-reduce -- performance
Genre:
Electronic Thesis or Dissertation
born-digital ( sobekcm )
Electrical and Computer Engineering thesis, Ph.D.

Notes

Abstract:
Map-Reduce is a programming paradigm and software implementation for executing data-intensive applications on a cluster of computers. Computational environments such as data centers, grids and clouds that execute Map-Reduce applications, experience faults and hence, performance degradations, because of a combination of factors that includes scale, complex interdependencies, use of commodity servers, sharing of resources, heterogeneity and geographical distribution of constituent components.  This dissertation proposes various techniques and tools that help facilitate an autonomic approach to fault and performance management in Map-Reduce systems. Fault-managed Map-Reduce handles faults in an online, on-demand and closed-loop manner through the use of performance prediction, anomaly detection, dynamic resource scaling and other built-in features of Hadoop. Performance prediction uses machine learning based regression methods that have short prediction computation times and high prediction accuracy. Anomaly detection for proactively identifying a faulty node is performed using a sparse-coding based technique. FMR successfully mitigates runtime penalties as high as 180% to an average of 14%. This dissertation presents two tools, namely MRNets and FaultPlay to facilitate performance and fault management studies in MapReduce.   Challenges in fault studies on a Map-Reduce platform include lack of access to real-world failure datasets and the inability to precisely evaluate and compare different fault management solutions. This is because the testbed, workload and faultload used in different solutions are difficult to recreate exactly either due to insufficient information, different available resources or the need for many hours of replicated effort. FaultPlay overcomes these challenges, through software-defined and easily reproducible fault studies on Map-Reduce platforms. FaultPlay enables a variety of characterization and management studies to be conducted by providing modules for job creation, fault injection, distributed monitoring, log parsing and deploying recovery-based management solutions. For the case of long-running jobs and a workload of multiple Map-Reduce jobs, empirical studies and the generation of representative training data for machine-learning models can become time-consuming. This dissertation presents an orthogonal approach to fast and accurate performance modeling, called MRNets, that is based on Petri-nets, a discrete event modeling methodology. Petrinets provide a formal method grounded by well-founded mathematical properties to capture both system structure and behavior. Models are executable and used to simulate system behavior. MRnets facilitate various performance analyses and provide a key advantage through their graphical representation, which makes it easy to design, use and extend these models for further performance studies. ( en )
General Note:
In the series University of Florida Digital Collections.
General Note:
Includes vita.
Bibliography:
Includes bibliographical references.
Source of Description:
Description based on online resource; title from PDF title page.
Source of Description:
This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Thesis:
Thesis (Ph.D.)--University of Florida, 2014.
Local:
Adviser: FORTES,JOSE A.
Local:
Co-adviser: KHARGONEKAR,PRAMOD P.
Electronic Access:
RESTRICTED TO UF STUDENTS, STAFF, FACULTY, AND ON-CAMPUS USE UNTIL 2015-05-31
Statement of Responsibility:
by Selvi Kadirvel.

Record Information

Source Institution:
University of Florida
Holding Location:
University of Florida
Rights Management:
Copyright by Selvi Kadirvel. Permission granted to University of Florida to digitize and display this item for non-profit research and educational purposes. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder.
Embargo Date:
5/31/2015
Resource Identifier:
907379381 ( OCLC )
Classification:
LD1780 2014 ( lcc )

Downloads

This item has the following downloads:


Full Text
xml version 1.0 encoding UTF-8
REPORT xmlns http:www.fcla.edudlsmddaitss xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.fcla.edudlsmddaitssdaitssReport.xsd
INGEST IEID EL5PIA9XB_78S52B INGEST_TIME 2014-10-03T22:03:31Z PACKAGE UFE0046167_00001
AGREEMENT_INFO ACCOUNT UF PROJECT UFDC
FILES



PAGE 1

TECHNIQUESANDTOOLSFORANAUTONOMICAPPROACHTOFAULTAND PERFORMANCEMANAGEMENTINMAP-REDUCE By SELVIKADIRVEL ADISSERTATIONPRESENTEDTOTHEGRADUATESCHOOL OFTHEUNIVERSITYOFFLORIDAINPARTIALFULFILLMENT OFTHEREQUIREMENTSFORTHEDEGREEOF DOCTOROFPHILOSOPHY UNIVERSITYOFFLORIDA 2014

PAGE 2

c 2014SelviKadirvel 2

PAGE 3

Tomydearestfatherandmother 3

PAGE 4

ACKNOWLEDGMENTS Firstandforemost,IwouldliketoexpressmyheartfeltthankstoDr.Fortesfor givingmetheopportunitytopursueaPhDandformakingtheprocessaveryvaluable andfulllinglearningexperience.Dr.Forteshasencouragedandguidedmein becominganindependentresearcher.Histimelywordsofadviceandhiscontinued supporthasbeeninvaluabletothesuccessfulcompletionofthisdissertation. Iwouldliketothankmycommitteemembers,Dr.RenatoFigueiredo,Dr.Pramod KhargonekarandDr.ShigangChenfortheirtime,effortandfeedbackinsupportingand helpingtoimprovethisdissertation. IamdeeplythankfultoDr.MarkSheplak,Dr.LouCattafestaandDr.Toshikazu Nishidaforintroducingmetothejoysofresearchandsystematicexperimentationat theInterdisciplinaryMicrosystemsgroup,whichrootedinmethemotivationtopursuea PhD. IalsowanttothankDr.FigueiredoforintroducingmetotheAdvancedComputing andInformationSystemsLab.Hiscourseonvirtualcomputingandhiskindwordsat criticaljunctureshashelpedlaythefoundationforacareerinsystems. Dr.Khargonekarsetthesparkformyfascinationformachine-learningthroughhis SystemsTheoryclass.Hisinspiringresearchanecdotesineachclassenabledmeto aimhigherthanIthoughtwaspossible.Hispassionforresearchrekindledmyspirits andhelpedmepursuenewavenuesforresearch. Dr.AnandRangarajan,Dr.ArunavaBannerjeeandDr.JeffreyHoplayedacritical roleinhelpingmenavigatethelandscapeofmachine-learning.Theyhaveprovidedme withcopiousamountsoftimeandhelpandamazingmentorshipandIwillalwaysbe gratefulfortheirkindconcernandgenerosity. Dr.CharlieTaylor(HighPerformanceComputingCenter)andDr.BillFarmerie (InterdisciplinaryCenterforBiotechnologyResearch),havebeenwonderfulmentors 4

PAGE 5

fromtheinitialyearsofmygraduateprogram.Iamtrulythankfulfortheirwordsof encouragementandthemanylessonsIhavelearntworkingwiththem. IwouldliketothankACISlabstaffmembersCathySembajwe-Reeves,Dina Trenkamp,JulieWaltersandMatthewCollins,whohavebeenhelpfulovertheyears helpedmakethelivesofstudentsinthelabsomuchbetter.Theirsmilesandwillingness tochathasmademylabexperienceslovely. Iwouldliketoespeciallythankmyfriendsforbeingbymysideandencouragingme throughthehighsandlows.VineetChadha,GirishVenkatasubramanian,Prapaporn Rattanatamrong,PierreSt.Juste,SridharSrinivasanandSurendraBoppanahavemade graduateschoolalotmorefun. MysinceregratitudeandthanksgoestoDebdeepforbeingmyconstantsourceof supportandencouragementandmostimportantlyforhelpingkeeptheconversations withmyconsciencealight.Hehasgivenmethestrengthandinspirationtoovercome everyhurdlealongtheway.IwouldliketothankKalyaniAuntyforherprayersand eloquentsupportofthepursuitofthePhDthatgavememuch-neededburstsof strength. Myheartfeltthanksgoesouttomymother,forwithoutthelifeskillsshetaughtme andherpersistentprayers,IwouldnotbethepersonIamtoday.Herwarmth,loveand affectionandthespiritualityshehelpedmeimbibeandvalue,havebeenandwillbemy guidesthroughout.Iwanttothankmyfatherwhohasalwaysbeenmyrolemodel.His dedication,loveanddoggedpursuitofperfectionathomeandatwork,thatIhavehad theprivilegeofobservingfrommyformativeyears,hasandwillbemyinnitesourceof energy.Iespeciallywanttothankmyparentsfortheirpatiencedespitealloddsandthe manyopportunitiesandprivilegestheyhaveendowedmylifewith. 5

PAGE 6

TABLEOFCONTENTS page ACKNOWLEDGMENTS .................................. 4 LISTOFTABLES ...................................... 10 LISTOFFIGURES ..................................... 11 ABSTRACT ......................................... 15 CHAPTER 1INTRODUCTION ................................... 17 1.1Motivation .................................... 18 1.2ChallengesinFaultandPerformanceManagement ............. 19 1.3AutonomicApproachtoFaultandPerformanceManagement ....... 20 1.3.1OverviewofProposedApproach ................... 21 1.3.2KeyFeatures .............................. 22 1.3.3Non-Goals ................................ 23 1.4ToolsforFaultandPerformanceStudiesinMap-Reduce .......... 24 1.4.1FaultPlay:Afault-injectionandperformancemanagementtestbed 24 1.4.2MRNets:PerformancemodelingusingPetrinets ........... 24 1.5DissertationOutline .............................. 25 2PERFORMANCEANDFAULTCHARACTERIZATION .............. 26 2.1BackgroundandIntroduction ......................... 26 2.2FaultandPerformanceCharacterization:SimulationResults ....... 28 2.2.1HadoopEvaluation ........................... 29 2.2.2ImprovedFaultTolerance ....................... 31 2.3FaultandPerformanceCharacterization:EmpiricalResults ........ 33 2.3.1HadoopEvaluation ........................... 34 2.3.2ImprovedFaultTolerance ....................... 38 2.4SummaryandConclusions .......................... 40 3PERFORMANCEPREDICTIONINMAP-REDUCE ................ 43 3.1Background ................................... 43 3.2Relatedwork .................................. 45 3.3CharacteristicsofaGrey-BoxApproach ................... 48 3.4ExperimentalSetup .............................. 49 3.5EmpiricalMotivation .............................. 49 3.5.1InuencingFactorsandNon-linearities ................ 50 3.5.2Heterogeneity .............................. 52 3.6PerformancePrediction ............................ 54 3.6.1DesiredCharacteristicsofthePredictionTechnique ......... 54 6

PAGE 7

3.6.2FeatureProcessing ........................... 54 3.6.3PredictionTechniques ......................... 56 3.7EvaluationResults ............................... 59 3.8SummaryandContributions .......................... 63 4FAULTMANAGEMENTINMAP-REDUCE ..................... 65 4.1Introduction ................................... 66 4.2BackgroundandRelatedWork ........................ 69 4.3DesignofFault-ManagedMap-Reduce .................... 71 4.3.1AnomalyDetection ........................... 72 4.3.2RemediationthroughDynamicResourceScaling .......... 79 4.4ImplementationofFault-ManagedMap-Reduce ............... 81 4.5ExperimentalEvaluation ............................ 83 4.5.1AnomalyDetection ........................... 85 4.5.2DynamicResourceScaling ...................... 90 4.5.3Discussion ................................ 92 4.6SummaryandContributions .......................... 95 5FAULT-PLAY:AFAULTINJECTIONANDMANAGEMENTTESTBED ...... 96 5.1BackgroundandMotivation .......................... 97 5.1.1Real-worldfailuredatasets ....................... 97 5.1.2Fault-injectionbasedstudies ...................... 98 5.2ATypicalFaultManagementStudy ...................... 100 5.2.1Infrastructuresetupanddeployment ................. 100 5.2.2Selectionofworkloadandfaultload .................. 100 5.2.3Characterizationofsystembehavior ................. 101 5.2.4Implementationoffaultmanagementsolutions ........... 102 5.3DesignandImplementation .......................... 103 5.3.1Experimentengine ........................... 104 5.3.2Jobengine ............................... 104 5.3.3Faultengine ............................... 104 5.3.4Remedymanager ............................ 106 5.4Evaluationonavirtualizedcluster ...................... 107 5.4.1DescriptionofInfrastructure ...................... 107 5.4.2PerformanceCharacterizationStudies ................ 107 5.4.3FaultCharacterizationStudies ..................... 109 5.4.4FaultManagementStudies ...................... 111 5.5RelatedWork .................................. 113 5.6SummaryandConclusions .......................... 114 6MRNETS:PERFORMANCEMODELINGOFMAP-REDUCEUSINGPETRINETS 115 6.1Background ................................... 115 6.2RelatedWork .................................. 117 6.3OverviewofPetrinets ............................. 119 7

PAGE 8

6.4ModelingMap-Reduce ............................. 121 6.4.1ModelingasingleMap-Reducejob .................. 122 6.4.1.1PlacesandTransitions ................... 122 6.4.1.2Arcinscriptions ........................ 123 6.4.1.3Modelparameterization ................... 125 6.4.1.4Modelsimulation ....................... 125 6.4.2Modelingnodefaults .......................... 126 6.4.3ModelingaworkloadofMap-Reducejobs .............. 128 6.5ModelEvaluation ................................ 129 6.5.1Singlejobmodelaccuracy ....................... 130 6.5.2Workloadmodelaccuracy ....................... 131 6.5.3Modelaccuracyofsinglejobwithnodefaults ............ 132 6.5.4What-ifanalysis ............................. 133 6.6Conclusions ................................... 135 7CONCLUSIONSANDFUTUREDIRECTIONS .................. 136 APPENDIX:PETRINETSFORSYSTEMCONTROLINHEALTHMANAGEMENT 139 A.1Background ................................... 139 A.2HealthManagement .............................. 142 A.2.1HealthIndicators ............................ 143 A.2.2Monitoring ................................ 147 A.2.3Diagnosis ................................ 147 A.2.4Prognosis ................................ 148 A.2.5Planning ................................. 148 A.2.6RemainingUsefulLifeManagement ................. 149 A.2.7Remediation ............................... 152 A.3ModelingFramework .............................. 153 A.3.1RequirementsofaSuitableModel .................. 153 A.3.2PetriNetBasics ............................. 153 A.3.3PetriNetExtensionsofInterest .................... 154 A.3.4HierarchyModeling ........................... 155 A.4HealthManagementModelingMethodology ................. 155 A.4.1StepI-ModelingSystemStructureandDependencies ....... 155 A.4.2StepII-ModelingUsefulLifeManagement ............. 157 A.4.3StepIII-ModelingHealthRecovery ................. 160 A.4.4StepIV-HealthManagementPolicies ................ 161 A.5Proof-Of-ConceptImplementation ...................... 162 A.5.1SystemOverview ............................ 162 A.5.2PetriNetModel ............................. 163 A.5.3Evaluation ................................ 167 A.6RelatedWork .................................. 174 A.7SummaryandContributions .......................... 177 A.8BenetsofPetriNets .............................. 177 A.9PetriNetPropertiesofInterest ........................ 178 8

PAGE 9

REFERENCES ....................................... 180 BIOGRAPHICALSKETCH ................................ 196 9

PAGE 10

LISTOFTABLES Table page 3-1Factorsinuencingperformance .......................... 53 3-2Searchalgorithmsusedforfeatureselection ................... 57 3-3Predictionaccuracyevaluation ........................... 62 4-1Trainingandtestdurationsforanomalydetection ................. 87 4-2Comparisonofanomalydetectiontechniques ................... 89 5-1Listofparametersinajobcongurationle .................... 105 5-2Listoffaultsandtheirimplementation ....................... 106 6-1PlacesinthePetrinetmodel ............................ 123 6-2TransitionsinthePetrinetmodel .......................... 124 6-3Modelparameters .................................. 126 6-4AugmentedplacesinthePetrinetmodeltocapturenodefaults ......... 128 6-5AugmentedtransitionsinthePetrinetmodeltocapturenodefaults ....... 128 6-6Additionalmodelparameterstocapturenodefaults ................ 128 6-7Modelaccuracy-Singlejob ............................. 130 6-8ModelAccuracy-Singlejobusingbesttdistribution .............. 130 6-9Workloadexecutiontime-Twojobsinaworkload ................. 132 6-10Workloadexecutiontime-Fourjobsinaworkload ................ 132 6-11AccuracyofPetrinetmodelincorporatingfaults .................. 133 6-12Modelaccuracy-Datasetsizevariations ...................... 133 6-13Modelaccuracy-Clustersizevariations ...................... 134 A-1Activitiesandresourcedependencies ....................... 164 A-2Listofremediationactions .............................. 173 10

PAGE 11

LISTOFFIGURES Figure page 1-1MAPEKcontrolloop ................................. 20 1-2Faultmanagementcontrolloop ........................... 22 2-1OverviewofHadoopshowinginteractionsbetweentheJobTracker,NameNode, TaskTracker,DataNodehostedonthemasterandslavenodes. ......... 27 2-2OverviewofHadoopfault-toleranceshowingincreasedjobexecutiontimeneeded forre-executingmaptasksthatwererunonthefailednode. ........... 27 2-3EffectofnumberofnodesinaHadoopjobonthejobcompletion-timepenalty. Datasetsizeisdoubledwhennumberofnodesisdoubledtoallowforconvenient comparison. ...................................... 29 2-4Effectoffault-occurrencetimeonjobcompletion-timepenaltyusingHadoop's built-infaulttolerance. ................................ 30 2-5Effectofnodefaultdetectiontimeoutonjobcompletion-timepenaltyusing Hadoop'sbuilt-infaulttolerance. .......................... 31 2-6Effectofresourcescalingonjobswithsinglenodefaults ............. 32 2-7Multiplenodefaults:Effectofresourcescalingonjobswithtwoandfour-node faults .......................................... 33 2-8Componentsoftheexperimentaltestbed ..................... 35 2-9Execution-timecomparisonfordifferentclustersizes ............... 37 2-10Comparisonbetweenexecutiontimesfortheno-fault,early-fault,mid-fault andlate-faultcases ................................. 38 2-11Execution-timecomparisonfordifferentfault-occurrencetimes ......... 39 2-12Execution-timecomparisonforMulti-nodeFaults ................. 40 2-13ComparisonbetweenexecutiontimesintheTerasortapplicationforthreedifferent chunksizes. ...................................... 40 2-14Dynamicresourcescalingintensity ......................... 41 3-1AtypicalvirtualizedenvironmenthostingaMap-Reduceplatform ........ 44 3-2Map-Reduceplatformviewedasablack-box ................... 48 3-3Differencesinjobcompletiontime ......................... 50 3-4Jobruntimevariationsfordifferentchoicesofcongurationparameters .... 54 11

PAGE 12

3-5TaskruntimedistributionintheMapandReducephaseofaWordcountapplication run ........................................... 55 3-6TaskruntimedistributionintheMapandReducephaseofaTerasortapplication run ........................................... 56 3-7DifferenceincorrelationcoefcientsbytheinclusionoffeaturesintheFault' categoryandConguration'categoryfordifferentpredictiontechniques .... 57 3-8RegressionErrorCharacteristic(REC)curvescomparingdifferenttechniques forpredictingjobruntimefortheWordCountapplication ............. 58 3-9ComparativeimportanceofdifferentfactorsintheFault'Category ....... 61 4-1Faultmanagementcontrolloop ........................... 66 4-2OverviewofHadoopshowinginteractionsbetweentheJobTracker,NameNode, TaskTracker,DataNodehostedonthemasterandslavenodes. ......... 71 4-3FaultdetectioninHadoopandimprovedfaultdetectionbyFMR ......... 73 4-4ExecutiontimeincreaseforHadoopwordcountjobs ............... 73 4-5Asimplieddiagramtoillustratetheanomalydetectionapproach. ....... 76 4-6Errorsinthesparserepresentationoftrainingandtestfeaturevectorsforthree Map-Reduceapplicationdatasets. ......................... 78 4-7PseudocodeofthescalingheuristicinFMR .................... 81 4-8FaultdetectioninHadoop .............................. 82 4-9Applicationheartbeatwaves ............................ 86 4-10Evaluationofsparsecodingbasedanomalydetection .............. 87 4-11Sensitivityanalysis .................................. 88 4-12ROCcurvesshowingperformanceofanomalydetection ............. 89 4-13Predictionaccuracyofthemodeltreealgorithm .................. 90 4-14EvaluationofFMRscalingheuristic ........................ 91 4-15SwimlaneplotsofthreeMap-Reducejobs ..................... 92 4-16ComparisonofFMRwithspeculativeexecution .................. 93 4-17ComparisonofjobexecutiontimesofaPiEstimationjob ............ 94 5-1FaultdetectioninHadoop .............................. 103 12

PAGE 13

5-2JobcompletiontimeforPiEstimationapplicationwhenexecutedon12,24, 36and48nodes. ................................... 108 5-3JobcompletiontimeforTerasortapplicationwhenexecutedondatasetsizes of1GB,5GB,10GBand15GB ........................... 109 5-4Jobcompletiontimewhenmultiplefaultsareinjectedintoa24-nodePiEstimation job. ........................................... 109 5-5JobcompletiontimewhenfaultintensityisvariedforaCPUhogfaultinjected intoa24nodePiEstimationjob. .......................... 110 5-6TimeseriesofCPUusagevaluesofamaptaskon2healthynodes ...... 110 5-7TimeseriesofCPUusagevaluesofamaptaskonanodewithaninjected CPUhogprocess .................................. 111 5-8ImprovementinperformanceofaMap-reducejobwiththeinclusionofamanagement policy ......................................... 111 5-9Sensitivityanalysisofamanagementpolicy:Jobcompletiontimeimprovement fordifferentvaluesofCPUusagethreshold. .................... 112 6-1AsimplePetrinetmodelshowingatransitionring ................ 120 6-2OverviewofHadoopshowinginteractionsbetweentheJobTracker,NameNode, TaskTracker,DataNodehostedonthemasterandslavenodes. ......... 121 6-3AsimplePetrinetmodelofaMap-Reducejob .................. 122 6-4AugmentedPetrinetmodelofaMap-Reducejob ................. 127 6-5ComparisonofMap-Reducejobexecutiontimebetweentestbedandmodel .. 131 6-6Modelingaworkloadwith2Map-Reducejobs ................... 132 6-7Performancemodelingofjobswithinjectednodefaults .............. 133 6-8Differenceinexecutiontimeobservedempiricallyandthroughmodelsimulations fordifferentdatasetsizes .............................. 134 6-9Differenceinexecutiontimeobservedempiricallyandthroughmodelsimulations fordifferentclustersizes ............................... 134 A-1Recoveryactionscorrespondingtodifferentstepsoftheprogressionfromfault precursorstocomponentfaultstosubsystemfaultstosystemfailure.Larger circlesrepresentlargerrecoverycosts. ....................... 142 A-2HealthManagementArchitecture .......................... 143 13

PAGE 14

A-3Diagramdepictingsystemhealthstates,actionsandeventsthatdetermine statetransitions. ................................... 146 A-4ProlongingsystemRULbyextendingtheRULofsubsystemlabelledCOMP1 tomatchthatofsubsystemlabelledCOMP2 ................... 149 A-5Overviewoftargetsystem .............................. 150 A-6Feedbackcontrolloopoftargetsystem.Referenceinputisthedesiredrateof RemainingUsefulLife(RUL)changeandmeasuredoutputisthemeasured rateofRULchange. ................................. 150 A-7Graphshowingmeasuredoutputfromtargetsystemfordifferentcontroller gainsandreferenceinputsignaltoclosedloopsystem. ............. 152 A-8SimplePetrinetbeforeandafterring. ....................... 154 A-9HierarchicalPetriNetModel ............................. 156 A-10PartialPetrinetmodelshowingactivitydependenciesonotheractivitiesand resourceavailability ................................. 157 A-11PartialPetrinetmodelofqueuedandexecutionstagesofaqueue-basedsystem showinginteractionbetweenstages ........................ 159 A-12PartialPetrinetmodelshowingaddedplacesandtransitions .......... 160 A-13Blockdiagramoftestbedandglobalsystemmanagerusedforproof-of-concept implementation. ................................... 163 A-14PetrinetmodeloftheITsystemunderstudy ................... 163 A-15Feedbackcontrollerloopusingadaptivecontroltheory.Modelparameterestimation andcontrollerdesignareperformedonline. .................... 166 A-16Averageresourceavailabilityfordifferentjobsizesobtainedusingaverage numberoftokens'metricthroughsteadystateanalysisofthePetrinetmodel 168 A-17Averageresourceavailabilityfordifferentjobarrivalratesobtainedusingaverage numberoftokens'metricthroughsteadystateanalysisofthePetrinetmodel 169 A-18Graphshowingoperationoftheproportionalintegralcontrollerusedtoperform RULextensioninthejobexecutionstage. ..................... 171 A-19Graphshowingresourceconsumptiononacomputenode ............ 172 A-20Graphshowingdistributionsoftimedurationspentinthethreefunctionalstates fordifferentintensityworkloads ........................... 175 14

PAGE 15

AbstractofDissertationPresentedtotheGraduateSchool oftheUniversityofFloridainPartialFulllmentofthe RequirementsfortheDegreeofDoctorofPhilosophy TECHNIQUESANDTOOLSFORANAUTONOMICAPPROACHTOFAULTAND PERFORMANCEMANAGEMENTINMAP-REDUCE By SelviKadirvel May2014 Chair:Jos eA.B.Fortes Major:ElectricalandComputerEngineering Map-Reduceisaprogrammingparadigmandsoftwareimplementationforexecuting data-intensiveapplicationsonaclusterofcomputers.Computationalenvironmentssuch asdatacenters,gridsandcloudsthatexecuteMap-Reduceapplications,experience variousfaultsandperformancedegradations.Somefactorsthatleadtoincreasing numberoffaultsarescale,complexinterdependencies,useofcommodityserversand resourcesharing.Thisdissertationproposesvarioustechniquesandtoolsthathelp facilitateanautonomicsolutiontofaultandperformancemanagementinMap-Reduce systems. Fault-managedMap-Reduce(FMR),presentedintherstpartofthisdissertation, handlesfaultsthroughanonline,on-demandandclosed-loopsolution.FMRuses novelmethodsforperformancepredictionandanomalydetectiontoachievethis. Performanceispredictedthroughmachinelearningbasedregressionmethodsthathave shortpredictioncomputationtimesandhighpredictionaccuracy.Anomalydetection forproactivelyidentifyingafaultynodeisperformedusinganovel,sparse-coding basedtechnique.FMRusesthesecomponentsalongwithasimplescalingheuristicto successfullymitigateruntimepenalties. Thesecondpartofthisdissertationpresentstwotools,namelyFaultPlayand MRnets,whichfacilitateperformanceandfaultmanagementstudiesinMapReduce. 15

PAGE 16

FaultPlayisasoftwareframeworkandimplementationtosetuptestbedsforfailure studiesandtoperformexperimentsthatdeploy,evaluate,andcomparedifferentfault managementsolutions.CurrentchallengesinfaultstudiesinaMap-Reduceplatform includelackofaccesstoreal-worldfailuredatasetsandtheinabilitytoevaluateand comparedifferentfaultmanagementsolutions.FaultPlayovercomesthesechallenges throughsoftware-denedandeasilyreproduciblefaultexperiments.Avarietyof characterizationandmanagementstudiesareenabledthroughmodulesforjobcreation, faultinjection,distributedmonitoring,logparsinganddeploymentofrecovery-based managementsolutions. TheMRnetstoolismotivatedbytheneedtoextendautonomicmanagement techniquestoa workloadsuite ofmultipleMap-Reducejobs.Asacriticalcomponent towardsthisgoal,MRnetsisamodelingmethodologybasedonPetrinetsthatenables performancemodeling ofaworkloadofjobs.Thealternativeofusingmachine-learning modelsforperformancepredictionistime-consumingandeveninfeasibleforthecaseof aworkloadofjobs. ThroughFault-managedMap-Reduce,FaultPlayandMRnets,thisdissertation providesasuiteoftechniquesandtoolsforautonomicperformanceandfaultmanagement inMap-Reduceplatforms. 16

PAGE 17

CHAPTER1 INTRODUCTION Innovationsininfrastructure,middlewareandapplicationshavemade" big data "analyticspossibleandeconomicallyviableinawiderangeofeldssuchas bioinformatics,datamining,webindexing,documentclassication,recommendation systems,etc.TheMap-Reduce(MR)programmingparadigm( DeanandGhemawat 2004 )alongwiththefreeandwidelysupportedopen-sourceimplementationframework, Hadoop( Hadoop 2004 ),hasbeenaprimechoiceforincorporatingdataanalyticsin industry,governmentaswellasacademicdomains.Oneoftheimportantbenets ofthischoiceisthatparallelization,distributionandfault-tolerancearefacilitated andprovidedbytheframeworkitselfratherthanrequiringsystemsdesignersand developerstoberesponsiblefortheseaspects.Thismodelalsoallowstheuseof Infrastructure-as-a-Serviceofferingsoracloud-computingenvironmenttoprocessbig data.Thismakesitpossibleforsmallandmedium-sizedcompanies,organizationsand researchlabstoutilizelarge-scalecomputationalresourcesonaper-demandbasis whichwouldhaveotherwisenotbeenpossiblebecauseofthetime,costandeffort involvedinprovisioning,deployingandmaintaininglarge-scaleinfrastructures. Map-Reducereferstoboththeprogrammingparadigmaswellasthesoftware frameworkthatisnecessarytoprovidetheunderlyingjobparallelizationanddata distribution.Theterms"Map"and"Reduce"correspondtoprogrammingprimitives fromfunctionalprogramminglanguagessuchasLispandProlog. Map referstothe applicationofaMapfunctiontoapartitionedsubsetofdatausuallystructuredinthe formof(key,value)pairs.Allmapoutputscorrespondingtoaspecickeyarethen processedbya reduce functionwhichisusuallyatypeofaggregationoperation.A Map-ReducejobconsistsofmanyMaptasksexecutinginparallelonaclusterofworker nodes,followedbyasetofparallelReducetasks.Theresurgenceandwide-spread 17

PAGE 18

adoptionoftheMap-Reduceparadigmisdueinparttoitssuccessfulapplicationfor web-indexsearchbyGoogle( DeanandGhemawat 2004 ). 1.1Motivation Enterprisedatacentersandcloudcomputingenvironmentsexperiencemanyfaults andfailuresasshowninrecentstudiesby Pinheiroetal. ( 2007 ), SchroederandGibson ( 2007a ), SchroederandGibson ( 2010 )and Dean ( 2008 ).Thecausesforthesefaults includesscale,heterogeneity,geographicaldistribution,congurationmanagementover alargesetofinter-dependentservicesandhumanerrorasillustratedby Barrosoand Hlzle ( 2009 ). Thesefaultsadverselyaffectapplicationsrunningintheseenvironmentsresultingin jobperformancedegradations,failedjobs,increasedcostsforusersandlossofrevenue fortheproviderwhenServiceLevelObjectivesareviolated. Wangetal. ( 2009b ),show throughsimulationstudiesthatasinglenodefaultcanresultinupto 139% performance slowdowninMap-Reduce. DinuandNg ( 2012 )recordpenaltiesofupto 350% injob runtimeduetoTaskTrackerfailures. Ananthanarayananetal. ( 2010 )showthatjob completiontimesinDryad(animplementationoftheMap-Reduceparadigm)canbe inatedby 34% becauseofoutliersandthatfastercompletiontimes(byreducingthe effectofoutliers)alsoprovidesacompetitiveadvantagetoserviceproviders. TheseperformancevariabilitiesandpenaltiesmakeitchallengingtouseMap-Reduce forapplicationswhereresponsetimeisimportant,suchasinuser-facingsocial networkingapplicationsatFacebook( Borthakuretal. 2011 ),user-customization applicationsatLinkedIn( Kreps 2009 )anduserclick-streamprocessing,web-index generationandadvertisementselectionapplicationsatMicrosoft( Fergusonetal. 2012a ). Thebodyofworkinthisdissertationpresentstechniquesandtoolsthataim atmodelingperformanceandmitigatingperformancepenaltiesexperiencedby MapReducejobsinenterprisedatacentersandcloudcomputingenvironments.The 18

PAGE 19

restofthischapterprovidesanoverviewofthechallengesthatexisttowardsthisgoal andthekeyfeaturesoftheproposedtechniquesandtools. 1.2ChallengesinFaultandPerformanceManagement Thissectionwilldetailthechallengesinvolvedindesigninganddevelopingafault andperformancemanagementsolutionforMap-Reducebasedplatforms. (a) Lackofreal-worldfaultdatasets.PublicfaultdatasetrepositoriessuchasComputer FaultDataRepository( CFDR 2008 )andFailureTraceArchive( FTA 2009 )donot yetcontainMap-Reducedata.Furthermore,astaticfaultdatasetcanbehelpfulin characterizingperformanceinthepresenceoffaultsbutcannotbeusedtoverifyor validateafaultmanagementsolution. (b) Inapplicabilityofclassicfaulttolerancetechniquessuchasnodereplicationand checkpointing.Activenodereplicationwouldnotbesuitableforhandlingfaults experiencedbyaMap-Reducejobbecauseoftheprohibitivecostofmaintaining ahotspareforeachofthelargenumberofworkernodesinaMap-Reducejob. CheckpointingthatistypicallyusedinHighPerformanceComputingapplicationsis limitedbyitsscalability.Recentworkshowsthatthetimespentoncheckpointing wouldbecomeexcessivewhencomparedtotheamountoftimespentperforming usefulworkinlarge-scaleparallelsystems( Cappello 2009 ). (c) Largenumberoffactorsinuenceperformanceincomplexways.Map-Reduce platformsconsistofavarietyofapplications,middlewareandsystemsoftware components.Eachofthesecomponentsalongwiththeirinter-dependencies andruntimecongurationsaffectjobcompletiontime.Theabilitytoestimate performanceofaMap-Reducejobbothinthepresenceandabsenceoffaultsis animportantpre-requisitetofaultmanagement.Performancepredictionthrough analytical,simulationanddata-drivenmodelsisanactiveareaofresearchandis especiallychallengingindistributedenvironments. (d) Real-timeanalysisoflarge,distributedlogles.Informationaboutthebehaviorof slavenodesispresentinanumberofloglesstoredlocallyoneachnode.Relevant informationfrommultipleloglesneedstobeidentied,parsedefciently,correlated andretrievedbythefaultmanagementmodulebeforeinformationgleanedfromit canbeusedforfaultdetection,diagnosisandrecoverydecisionmaking. (e) Inter-dependentphasesofaMap-Reducejobthatresultincascadedlossof computationandintermediateresults.ThestructureofaMap-Reducejobis suchthatresultsfromtheMapphasebecomeinputstotheReducephase.This intermediatedataisstoredonlocaldiskandisnotreplicatedbytheunderlying distributedlesystem.Asaresult,lossofevenonecomputenodecouldpotentially causemultiplemapandreducetaskstobere-computedresultinginsevere penalties( DinuandNg 2012 ). 19

PAGE 20

(f) Distributedandpossiblyheterogeneousenvironment.Performancedebuggingis challengingintheseenvironmentsbecauseeachcomputenodecanhavedifferent behavioralcharacteristicsduetovariationsinitsstatichardwarespecicationsand dynamicworkloadconditions. (g) Feedbackaboutjobperformancebeingavailableonlyattheendofjobcompletion. TheruntimeoftypicalMap-Reducejobsisintheorderoffewtensofsecondstofew tensofminutes( Chenetal. 2011 ; Kavulyaetal. 2010 )andhencejobperformance feedbackfromthesystemisavailableonlyattheendofthisperiod.Thisisunlike n-tierwebarchitecturesinwhichfeedbackisavailableintheformoftimetaken toprocess'HTTPrequestsordatabasequerieswhichareintheorderofafew secondsorless. 1.3AutonomicApproachtoFaultandPerformanceManagement TherequirementsandchallengespresentinaMap-Reduceenvironmentprecludes theuseofmanualtechniquesbyanITexpertforfaultdetectionandsuitableremediation. Anautonomicapproachreferstotheuseof online on-demand and closed-loop techniquesformanagingfaultsexperiencedbyanexecutingMap-Reducejoband henceissuitableforthegoalofperformancepenaltymitigation.Anautonomicapproach usuallyviewsthetargetsystemintheframeworkofaMonitor-Analyze-Plan-Execute-Knowledge (MAPEK)loopasshowninFigure 1-1 !"#$%!&'('!%) )*+,!*# "+"-(.% /-"+ %0%12!% 3 + 4 % 5 $ % '%+'*#' "1!2"!*#' Figure1-1. AMonitor-Analyze-Plan-Execute-Knowledge(MAPEK)loopforsystem management 20

PAGE 21

1.3.1OverviewofProposedApproach Fault-managedMap-Reduce(FMR)isaclosed-loopsolutiontodetectfaultsand remedytheeffectoffaultsusingdynamicresourcescaling.ThekeyideabehindFMRis toovercomedelayedfaultdetectionthroughmachine-learningbasedanomalydetection techniques.Abuilt-infeatureofHadoopcalledthe"nodehealthscript"isutilizedfor augmentedmonitoringofaslavenode.Inaddition,theresourcemonitoringtoolGanglia hasbeencustomizedthroughtheadditionoftwometricmodulestoenablecapturing ofapplicationheartbeatsandforanomalydetectioncomputation.Anomalousslave nodesaredetectedthroughsingle-classclassicationbasedonsparse-coding.The mainadvantageprovidedbysingle-classclassicationisthat,atthetimeoftraining, themodelrequirestrainingdataonlyfromthenormal(ornon-anomalous)class.This isacriticalbenetbecauseitprecludesasystemadministratorfromtheneedtohave representativeanomalousdataforallpossibletypesoffaults.Inadynamicdatacenter environmentnewtypesoffaultscanoccuratruntimeanditwouldbeunreasonableto expectmodeltrainingusinganomalousdata.Onceananomalousnodeisdetected, horizontalresourcescalingischosenastheremediationtechnique.Horizontalscaling iswidelyavailableinmostInfrastructure-as-a-ServiceandPlatform-as-a-Service offeringsandthereforeleadstoarecoverysolutionthatcanbeeasilyadoptedin practice.Theintensityofscalingthatistobeinvokeddependsonanestimationof thejobexecutiontime.Amachine-learningbasedregressionmethodisusedfor performancepredictionbothinthepresenceandabsenceofafault.Suchanapproach hasthreemainadvantages:(1)itfacilitatesinclusionofalargenumberoffactorsthat determineperformance,(2)itallowformodelstobeincrementallyupdated,and(3)it facilitatescapturingcharacteristicsofaheterogeneousenvironment.Ascalingheuristic isproposedinordertodeterminetheintensityofscalingwhichalsodependsonfactors suchasthetimeofoccurrenceofthefault,thesizeofthedatasettobeprocessed,the numberofslavenodesavailableandthetimetakentoprovisionextraresources. 21

PAGE 22

ApictorialrepresentationoftheautonomiccontrolloopofFMRisshownin Figure 4-1 !"#$%&'()* !"#$"()'()* + & % ) / ( ) !"#$%!&'('!%) +0'1,-2(/-$ '30)4",+ +0'1,-2(/-$ 0''3%/0)%"& +0'1,-2(/-$ 4,0+-5",6 *7*)-+$*"4)50,%&4,0*),(/)(,*+,!#+--%# 40(3)$8$,-+-27$ +0&09-, +0*)-,$/"+'"&-&):$ ',-/(,*",$/;-/6* &"2-$*/03%&9 &"2-$#30/6$3%*)%&9 *30<-$/"+'"&-&):$ 0&"+037$2-)-/)%"& !#",'./*%# ;-0,)$#-0)$,0)-$ /03/(30)%"& *3%2%&9$5%&2"5$ 4%3)-,%&9 ',"9,-**$,0)-$ 0<-,09%&9 %&*),(+-&)-2$ +"&%)",%&9$"4:$ =>$;-0,)$#-0) ?>$40(3)* @>$-&<%,"&+-&) /"&),"3$ 0/)%"&* ',"/-**-2 +"&%)",%&9$ 20)0 Figure1-2. AutonomicControlLoopforFault-managedMap-Reduce(FMR)showing detailsofeachoftheMAPEKloopcomponents 1.3.2KeyFeatures Thissubsectionliststhekeyfeaturesofanautonomicapproachtofaultand performancemanagementinMap-Reducebasedplatforms. FMRdoesnotrequireanymodicationtotheHadoopsourcecode.Features providedbyHadoopandthevirtualizedinfrastructureareleveragedtoachieve desiredgoals.Thisfacilitatespracticaladoptionandprovidestheabilitytoleverage theactiveopen-sourcecommunitythatisinvolvedwithHadoopdevelopmentand maintenance. FMReliminatestheneedtocontinuallyover-provisionforworst-caseconditions. Organizationsthatmaynotbeabletoaffordthecostsinvolvedinover-provisioning arethusbenettedbytheon-demandnatureoftheproposedsolutioninorderto achievedesiredperformanceeveninthepresenceoffaults. 22

PAGE 23

Executiontimeofajobisestimatedthroughtheuseofanalyticalandsimulation basedmodelsincurrentstate-of-the-artmethodsforperformancemanagement inMap-Reduce.Thesemodelsprovideagoodrststeptowardsperformance prediction.However,whenMap-Reduceimplementationsandmiddleware congurationparameterschangeornewperformance-inuencingfactorsbecome involved,thesemodelsbecomelessaccurateandneedtoberedesignedfrom scratch.Inordertoovercomethislimitationandalsotoenablecapturingalarge numberoffeaturesthataffectjobexecutiontime,FMRusesmachine-learning basedregressiontechniquesforperformanceprediction. Currentstate-of-the-artmethodsusestaticthreshold-basedtechniquesto distinguishfaultybehaviorfromnormaloperatingconditionsinMap-Reduce basedsystems.Thresholdsaredifculttodetermineandhardertotunein today'sdynamiccomputingenvironment.Theanomalydetectionmodulein FMRovercomesthislimitationthroughtheuseofasparse-codingbasedmodelfor identifyinganomalousnodes. FaultdetectioninFMRisperformedthroughdecentralizedmodelslocaltoeach workernode.Thisreducescommunicationandsynchronizationoverheadamongst nodesandfacilitatesscalabilityoftheproposedsolution. 1.3.3Non-Goals Inordertoclarifythegoalsoftheproposedwork,thissub-sectionlistsallied researchquestionsthatformactiveareasofresearchbutcurrentlyfalloutsidethescope ofthiswork. Diagnosisoffaults:Faultdiagnosisreferstodeterminingthetypeofadetected faultandidentifyingitsrootcause.Faultdiagnosiscanhelpinmakingbetter remediationdecisions.However,inthisproposedwork,thefocusisontheeffector manifestationofafaultandhencewedonotattempttoperformfaultdiagnosis. SoftwarebugsintheMapandReducetask:Softwarebugsincertainexecution pathsofaMaporReducetaskcancausesometaskstofail.Ifthesefaultsare activatedbyinputdatatothesetasks,thentechniquesfromtheeldofsoftware testingandvericationmustbeusedtorectifyorrecoverfromthesefaults. RackfaultsandPowerDistributionUnit(PDU)faults:Map-Reducejobscan tolerateonly n 1 faults,wherenisthereplicationfactorchosenfortheunderlying distributedlesystem.RackfaultsandPDUfaultsthatcause > n 1 slavenode faultscannotberecoveredfrom. 23

PAGE 24

1.4ToolsforFaultandPerformanceStudiesinMap-Reduce Inthesecondpartofthisdissertation,twotools,namelyFaultPlayandMRnetsare designedanddeveloped,inordertofacilitatefurtherstudiesinperformanceandfault managementofMapReduce. 1.4.1FaultPlay:Afault-injectionandperformancemanagementtestbed Currently,difcultiesinvolvedinfailurestudiesonaMap-Reduceplatforminclude lackofaccesstoreal-worldfailuredatasetsandtheinherentstaticnatureofdatasets. Furthermore,whenfaultandperformancemanagementsolutionsareproposed throughafault-injectionbasedstudy,itisdifculttoevaluate,compareandimprove uponstate-of-the-artsolutions.Thisisbecausethetestbed,workloadandfaultload usedaredifculttoreplicateexactlyduetocomplexsystemcongurationdetails necessaryforthis.Evenwhensufcientdetailisavailable,aconsiderableamount ofredundanteffortneedstobeexpendedforsetupoftheinfrastructure,system software,middlewareandapplicationcomponentsneededforsuchastudy.Inorder toovercomethesechallenges,FaultPlay,facilitatesclearly-denedandreproducible faultstudiesonMap-Reduceplatforms.FaultPlayhelpsinreducingtimeandeffort expendedonsystemsdeployment,installationandcongurationbyprovidingthetoolas avirtualapplianceimagethatcomeswithallnecessarydependenciesalreadyinstalled. FaultPlayconsistsofmodulesforjobprocessing,faultinjection,distributedmonitoring, logparsingandfordeployingrecovery-basedmanagementsolutions.Thesemodules togetherenableavarietyofcharacterization,performanceandfault-management studiestobeconductedeasily. 1.4.2MRNets:PerformancemodelingusingPetrinets Theempiricalandmachine-learningbasedperformancemodelsproposedas partsofFMRandFaultPlayfullltheneedsforaccurateandrealisticstudies.However, forthecaseoflong-runningjobsoraworkloadsuiteofmultipleMap-Reducejobs, empiricalstudiesandthegenerationofrepresentativetrainingdataformachine-learning 24

PAGE 25

modelscanbetime-consuming.Inordertofacilitateanorthogonalapproachtofast andaccurateperformancemodeling,weproposeamodelingapproachbasedon Petri-nets,adiscreteeventmodelingmethodology.Petrinetsprovideaformalmethod groundedbywell-foundedmathematicalpropertiestocapturebothsystemstructureand behavior.Petrinetmodelsareexecutableandareusedtosimulatesystembehavior. ThesePetrinetmodelsofMap-Reduce,termedMRnets,facilitatevariousperformance analysesonbothsinglejobsaswellasaworkloadofjobs.Furthermore,theyprovidea keyadvantagethroughtheirgraphicalrepresentationwhichmakesiteasytodesign,use andextendthesemodelsforfurtherperformancestudies. 1.5DissertationOutline Chapter2describestheMapReduceimplementation,Hadoopandperformance penaltiessufferedbyaHadoopjobwhenfaultsareexperiencedunderdifferentsystem congurations.Thefeasibilityofusingdynamicresourcescalingasatechniqueto mitigatetheseperformancepenaltiesisillustratedthroughbothanempiricaland simulationstudy.Chapter3proposestheuseofmachine-learningbasedregression techniquesforpredictingtheperformanceofHadoopjobs.Acomparativeevaluation offteendifferentregressionmethodsisperformedandreported.Chapter4buildson insightsandperformancemodelstoconstructanautonomiccontrolloopforonlinefault management.Ananomalydetectiontechniquebasedonsparsecodingisproposed andincorporatedwithintheautonomiccontrolloop.Chapter5describesFaultPlay-an empiricaltestbedtoconductfaultandperformancestudiesinaneasy,repeatableand extensiblemanner.Chapter6describestheuseofPetrinetsformodelingperformance ofMapReducejobsandworkloads.Chapter7summarizesthecontributionsofthis dissertationandconcludeswithdirectionsforfutureresearch. 25

PAGE 26

CHAPTER2 PERFORMANCEANDFAULTCHARACTERIZATION TherststeptowardsthedesignofFault-ManagedMap-Reduce(FMR)consists ofdetailedfaultandperformancecharacterizationstudiesinordertoidentifythe factorsthataffectjobexecutiontime.Inthischapter,webeginwithanintroduction toHadoop,aMapReduceimplementationweusethroughoutthisdissertationand itsbuilt-infault-tolerancemechanism.Anempiricaltestbedandasimulationtoolare usedtostudytheeffectsofvariousfactorsrelatedtotheresource,dataset,framework congurationanddifferenttypesoffaults.Observedperformancetrendsinboththe empiricalandsimulationstudyenablesidenticationofthekeyfactorsthataffect performancepenaltyinHadoop.Wethenproposetheuseofdynamicresourcescaling asapotentialremediationtechnique.Weimplementaprototypethataddsnodestoan executingjobinaseamlessfashion.Wethenconductexperimentstodeterminethe extentofscalingnecessarytocompensatefortheperformancepenaltiesexperienced byajobfordifferenttypesoffaults. InSection A.1 ,wedescribeHadoopandit'sfaulttolerancemechanism.Section 2.2 illustratesthefactorsthataffectajob'sperformancethroughadetailedsimulationstudy, whileSection 2.3 doessothroughexperimentsona74-nodeHadooptestbed.We summarizeandconcludeinSection 6.6 2.1BackgroundandIntroduction Hadoop( Hadoop 2004 ),afreeandopen-sourceMap-Reduceimplementation, consistsofthefollowingmaincomponents:(1) JobTracker and TaskTracker daemons thatmanageschedulingandcoordinationofmapandreducetasks,and(2) NameNode and DataNode daemonsthatmanagetheHadoopDistributedFileSystems(HDFS). TheJobTrackerandNameNodedaemonsrunontheHadoopmasternode,whilethe TaskTrackerandDataNodedaemonsrunontheslavenodes.Figure 6-2 showsa simpliedoverviewofHadoop. 26

PAGE 27

!"#$%& !"#$%&'()*+ ",--.)%! /,0+ 12."3$2 ).-$ ),%$ -$1.+%.1.+ 2$4&$!1 .'(5$+-!*! -$1.+%.1. 2$!6,)!$ #'"(% 1.!3+ 12."3$2 7+%.1.+ ),%$+ %.$-,)! ',".'+ !1,2.*$ #'"(% 1.!3+ 12."3$2 7+%.1.+ ),%$+ %.$-,)! ',".'+ !1,2.*$ #'"(% 1.!3+ 12."3$2 7+%.1.+ ),%$+ %.$-,)! ',".'+ !1,2.*$ #'"(% 1.!3+ 12."3$2 7+%.1.+ ),%$+ %.$-,)! ',".'+ !1,2.*$ #'"(% 1.!3+ 12."3$2 7+%.1.+ ),%$+ %.$-,)! ',".'+ !1,2.*$ #'"(% 1.!3+ 12."3$2 7+%.1.+ ),%$+ %.$-,)! ',".'+ !1,2.*$ Figure2-1. OverviewofHadoopshowinginteractionsbetweentheJobTracker, NameNode,TaskTracker,DataNodehostedonthemasterandslavenodes. !"#$ %"&' !"# %"&' !"# %"&' !"# %"&' !"#$ %"&' ('()*$ !"# ('()*$ !"#$ *+,'$ -"./)(' #'*"/01 -"./)('$ +22)(('*2' -"./)('$ ,'0'20', 34)--/'52+#1 (',)2'$ %"&' (',)2'$ %"&' 6+7$ 2+!#/'0.+* 0.!' 0"38$.*30"*2' %"&' *)!7'($+-$0"383$ ()**.*9$2+*2)(('*0/1 "&'("9'$/'*904$+-$ 0"38 %"&'$('#('3'*0"0.+* Figure2-2. OverviewofHadoopfault-toleranceshowingincreasedjobexecutiontime neededforre-executingmaptasksthatwererunonthefailednode. InaMap-Reducejob,whenanodefails,allmaptasksthatwereexecutedonthis node(forthisjob)havetobere-executedonotherhealthynodes.Thisisbecausemap outputsarestoredlocallyateachslavenode(ratherthanbeingstoredonthereplicated HDFS).Maptaskswhoseoutputshavealreadybeenreadbycorrespondingreduce tasksneednotbere-executed.Themasternodedetectsaslavenodefaultafterastatic timeoutintervalandtheninitiatesre-execution.Thismechanismtohandlefaultsin illustratedinFigure 2-2 27

PAGE 28

2.2FaultandPerformanceCharacterization:SimulationResults Ourrstgoalistoperformanin-depthinvestigationofthedesignspaceofthe Map-Reduceenvironmentinthepresenceoffaults.Wechosetobeginwithasimulation frameworksincethiseasilyfacilitatesperformingalargenumberofexperimentswith sweepsacrossnumerousparameters.WeusetheMRPerf( Wangetal. 2009a ) simulatordevelopedbyresearchersfromVirginiaTechandIBMAlmadenResearch Center,whichwascreatedtoexperimentwiththedesignspaceofMap-Reduce applications.ItisbasedontheNetworkSimulator,NS2.MRPerfallowsustostudy theeffectsofdifferentinter-connecttopologies,datalocality,andsoftwareandhardware faultsonoverallapplicationperformance.MRPerfallowsquanticationoftheeffectof thesefactors,andthuscanserveasatoolforoptimizingexistingMap-Reducesetups aswellasdesigningnewones.MRPerfhasbeenvalidated( Wangetal. 2009a )against actualexecutionofHadoopjobsonaphysicalclusterconsistingof16to128cores. IntheMRPerfsimulator,applicationcharacteristicsarecapturedbythenumber ofcyclesrequiredforprocessingeachbyteaswellastheratiobetweentheinputand outputdatasets.TherearesomesimplifyingassumptionsinMRPerf.Forexample,the computationtimeneededtoprocessachunkofdatadependsonlyonitssizeandnot onitscontent.Inaddition,speculativeexecution,whichisatechniqueusedinHadoop tomitigatetheeffectofstragglernodes,isnotimplemented.Despitethesefactors,ithas beenfoundthatMRPerfcanestimatejobcompletiontimeofMap-Reducejobswithin 3% and 19% ofthemeasuredvaluesforthemapandreducephaserespectively.We alsoconsideredtwootherdifferentsimulatorsforourwork-Mumak( Foundation 2009 ) ( Tang 2009 ),designedanddevelopedbyYahoo!andMRSim( Hammoudetal. 2010 ) fromBrunelUniversity.However,wecouldnotusethesesimulatorssincenodefaults werenotcapturedinthem. 28

PAGE 29

2.2.1HadoopEvaluation Ourrstgoalistovalidatetheclaimthatnodefaultsresultinincreasedexecution timesandtounderstandtheeffectofincreasedjobsizeonthispenalty.Wesimulate aMap-Reducejobon8,16,32and64nodes.Weusedtwonetworkconguration parametersprovidedbyMRPerf:(1)choiceofthenumberofracksand(2)choiceof interconnectivitybetweennodeswithinarack.Forallthecasesstudiedinthissection, wechoseasingle-rack,startopology. Wecapturetheworst-caseperformancepenaltyforeachjobbyinjectinganode faultattheendofthemapphase.Suchanodefaultresultsinatime-outperiodthat doesnotoverlapwiththemapphaseandhenceresultsinmaximumpenalty.In Figure 2-3 weshowthatapenaltyofupto 300% isincurredbyjobsforthedefault time-outvalueof600seconds.Weseethattheperformancepenaltyishigherforjobs runningonsmallernumberofnodes.Thisisbecausewhenasinglenodefails,the amountofcomputepowerthatislost(asapercentageofthetotalamountavailable)is largerforjobsrunningonfewernodes.Forexampleinan8-nodejob,lossofasingle nodeisalossof 12.5% oftotalcomputepowerwhereasina64-nodejob,asinglenode faultcorrespondsto 1.6% lossofthetotalcomputepoweravailable. 8 16 32 64 0 200 400 600 800 1000 1200 CLUSTER SIZE (Number of Nodes) EXECUTION TIME (in seconds) NO FAULTS SINGLE NODE FAULT Figure2-3. EffectofnumberofnodesinaHadoopjobonthejobcompletion-time penalty.Datasetsizeisdoubledwhennumberofnodesisdoubledtoallow forconvenientcomparison. 29

PAGE 30

InthesecondsetofexperimentsshowninFigure 2-4 ,westudytheeffectofthe occurrencetimeoffaultsduringjobexecutiononthecompletiontimepenalty.We choosea16-nodeMap-Reducejobwhosefault-freejobcompletiontimeis524seconds onanaverage.Thenodefaultoccurrencetimewasvariedbetween50secondsto600 secondsin100secondincrements.Forthersttwocases(faultatthe50thand150th second),thetimeoutperiodoverlapswiththeMapphaseandhencetheruntimepenalty isalmostconstant.Beyondthis,thepenaltyincreases,asthefaultoccurrencetime isfurtheralongthejobsexecution.Inthesecases,thetimeoutperiodfornodefault detectionextendsbeyondthemapphaseandhenceseverelyaffectsjobcompletion time.Inthelastcase,nodefaultdoesnotaffectjobcompletiontimemuchbecausethe maptasksthatwerecompletedonthefailednodedidnotneedtobere-executedsince thesemaptaskoutputswerealreadyconsumedbythecorrespondingreducetasks. !" #!" $!" $!" %!" !!" #"" $"" &"" %"" !"" '"" ("" )"" *"" +,-./012344/0015412-567,57289:;< =3>20/55-5?27-@1289:;< 2 2 532+,-./01 >/-.7 -52+,/.7273.10,541 Figure2-4. Effectoffault-occurrencetimeonjobcompletion-timepenaltyusing Hadoop'sbuilt-infaulttolerance. Inthethirdexperiment,westudytheeffectofthetimeoutperiodonjobcompletion times,inthepresenceofanodefault.Webeginwithatime-outvalueof1minuteand increasethisby100secondsforeachrun.AsshowninFigure 2-5 ,forincreasingvalues oftimeoutweseethatinitiallythejobcompletiontimesremainconstant.Thisisonce 30

PAGE 31

againbecausethetimeoutperiodoverlapswiththemapphase.Forlargertime-out values,thetime-outperiodextendsbeyondthemapphaseandresultsinsevere performancepenalties. !" #!" $!" %!" &!" '!" !!" (!" $"" &"" !"" )"" #""" #$"" *+,-./*012345 6.708/99+9:0*+,-012345 0 0 9.0;<+=/87/+=* +90;Figure2-5. Effectofnodefaultdetectiontimeoutonjobcompletion-timepenaltyusing Hadoop'sbuilt-infaulttolerance. 2.2.2ImprovedFaultTolerance Inthissection,thequestionthatweansweris:Towhatextentdoesadding resourcestoarunningHadoopjobimproveitsrunningtimeinthepresenceofanode fault. Intherstsetofexperiments,weintroducedonenodefaultin4differentHadoop jobsthatexecutedon16nodes.InFigure 2-6 ,foreachjobtherstbarrepresentsthe penaltyofbuilt-infaulttolerance.Thenextthreebarsshowthedecreasedpenaltyby addingone,twoandfournodesrespectively.Weuseourresourcescalingapproachto bringthepenaltytolessthan 5% .Fortherstjob,built-infaulttoleranceresultsina 15% penalty.Fourextranodesareneededtobringtheperformancepenaltytolessthan 5% Forthesecond,thirdandfourthjobs,weneedtwo,oneandonenodesrespectivelyto bringdownthepenaltytotheacceptablepercentage.Thisdifferenceinthenumberof nodesneededisdependentonresourceutilizationinthelastwaveofthemaptask.If 31

PAGE 32

resourcesaresparselyutilizedinlastmapwave,thenfewernodeswouldbeneededto effectivelycounteractnodefaults(suchasjobs2,3and4inthepreviousexample).Ifall thenodesarefullyutilizedduringthelastwave(suchasinjob1),thenmorenodeswill beneededtocompensateforthelostnodeswithinashortperiodoftime. # $ $ % $ & !% !" !$ !& ()*+,-./0,1,.2 3./4506+3.27./048. + + *9,50 ,/+14950+0)5.24/7. 4--+!+/)-. 4--+"+/)-.: 4--+$+/)-.: !;< =;< !%;$ <;= >;' !;# !%;> %;' !!;> !>;! #;" %;& !;< !;< !;" $;% Figure2-6. Effectofresourcescalingonjobswithsinglenodefaults.Thearrowineach groupedsetofbarsindicatestheoptimalnumberofnodesneededtobring execution-timepenaltywithin 5% ofthefault-freejobexecutiontime. Next,weconsideredthecaseofmultiplenodefaults.Figure 2-7 showstheeffect ofresourcescalingfora16-nodeMap-Reducejobwhen2nodesand4nodesarelost. Forboththesecases,thearrowmarkstheextentofresourcescalingthatisneeded tobringdownthepenaltyto 5% .Forthe2-nodefaultcase,whenresourcescalingis increasedbeyondthisoptimalvalue,weseethatthereisnotmuchbenet.Thisis becauseaddingmorenodesthanneededsimplyresultsinunderutilizedresources (ratherthanperformanceimprovement).Inallourresourcescalingexperiments,we chosetheworst-casesituationofhavingonlyoneremainingmapwaveduringwhich resourcescanbescaled.Thisallowsustoidentifywhetherdynamicresourcescalingis helpfulforrecoveryunderextremeconditions.Incaseswhenfaultdetectionisearlierin jobexecution,theextentofresourcescalingneededwouldbelesser. 32

PAGE 33

Animportanttradeoffthatneedstobeconsideredinthisapproachisthatscaling toomuchcanresultinbetterexecutiontimebutatexcessiveresourcecost,while conservativescalingcanleadtounacceptableexecutiontimepenalties.Intheworst case,aggressivescalingcouldevenleadtoincreaseincompletiontimesduetoworker nodeprovisioningoverheads. !" # $ %$$ !$$ &$$ #$$ '$$ ($$ )$$ *$$ +,-./0"12"+13/"2456,0/7 81."0,++5+9":5-/";<=>? " +1"2456,0/7 .,56: 5+"2: 433"%"+13/ 433"!"+13/7 433"#"+13/7 433"("+13/7 433"*"+13/7 433"%$"+13/7 433"%!"+13/7 433"%#"+13/7 433"%("+13/7 433"* +13/7 433"%(" +13/7 Figure2-7. Multiplenodefaults:Effectofresourcescalingonjobswithtwoand four-nodefaults.Thearrowineachgroupedsetofbarsindicatestheoptimal numberofnodesneededtobringexecution-timepenaltywithin 5% ofthe fault-freejobexecutiontime. 2.3FaultandPerformanceCharacterization:EmpiricalResults Aftertheinitialstudyonasimulationplatform,weuseanexperimentaltestbed toperformfurtherevaluations.Thetestbedconsistsof24Xenguestdomainseach with1.1GhzCPU,36Gbofharddiskspaceand1Gbofmemory.Thesedomainswere hostedon4IBMbladeservers(HS227870)withIntelXeonprocessorsconsisting ofatotalof64cores,96Gbofmemoryand920Gbofharddisk.TheXenguest domainswererunningtheUbuntu10.04.2LTS(Kernel:2.6.32-30-generic)operating systemandtheXenhostdomainswererunningCentOSRelease5.5(Kernel: 33

PAGE 34

2.6.18-194.32.1.el5xen).Hadoopversion0.21.0wasusedforalltheWordcount experimentsandHadoopversion0.20.203.0wasusedforalltheTerasortexperiments. 2.3.1HadoopEvaluation Thegoaloftheevaluationexperimentswastostudyindetailtheeffectofvarious systems,application,datasetandfaultcharacteristicsonexecutiontimepenalty.An overviewofstepsinvolvedineachexperimentisshowninFigure 2-8 withmoststeps beingautomatedusingLinuxscriptingtools.TherststepistosetuptheHadoop DistributedFileSystem(HDFS)andthelocallesystemwiththenecessaryjob directorystructureandrequiredinputdata.Betweenanytwojobrunsifthenumber ofslavenodesorinputdatasethaschanged,theentireHDFSisdeleted,formattedand re-populated.Thiswasdonetoensurethatthedistributionofdatatotheslavenodes wouldbefreshandrepresentativeofasystemwithoutanypreviousfaults.Anodefault causesdatablockreplicastobelost,andthusinvokesHDFStoinitiatedatareplications. Thisresultsinchangestodatadistributionacrosstheslaveswhichmaythenbecome unbalancedacrosstheslaves.Afterthisstep,theslavenodecongurationleiscreated basedonthecongurationparametervaluesprovided.Allslavesarethentestedto ensurethattheTaskTrackerandDataNodedaemonsarerunningonthem,whichis thenfollowedbyjobsubmission.Duringexecution,threemainstepsarerelevant,node statusmonitoring,faultinjectionatconguredpointsintimeandnodescalingwhen suitable.Afterthejobcompletes,logles(fromeachslavenodeandcorrespondingto eachdaemon),historylesaswellHDFSdatadistributionreportsareretrieved.Parsing scriptsthenextractusefulinformationfromtheselessuchastaskstartandendtimes, datainputandoutputsizestoeachphaseandfailedtaskidentiers. Experimentswererunonsmalltomedium-sizedclustersandhaveexecutiontimes intheorderofafewminutestofewtensofminutes.Thismatchescommonjobsin realworkloadsasobservedfromproductiontracesofYahooandFacebookusedin previousstudies( Chenetal. 2010 ).Thesetracesconsistedof30,000and1million 34

PAGE 35

!"#$%&'()*+,.)()/,0,(1 %#$()10('!0'(, 1+)2, #"3,1 /)10,(#"3, $%+,-1410,/-!"#0("+ 567889-3:;<=:>?<@7-$:A@-1B;<@C!8D<=8AE1@-3:=@I<8=BE!=@6<:8DF-1@-!8D<=8AE1?>C:;;:8DF-J8>-%3-=@I8=7F1A6K@;-PA@-I=@6<:8DH "#+%#,-,Q,!'0%"#-!"#0("+ #87@-1I6A:DR #87@-$6?A<%DS@I<:8D J8>-1<6@ ."10-,Q,!'0%"#-!"#0("+ +8R-6D7-5:;<8=B-$:A@;(@<=:@K6A+8R-.6=;:DR-6D7-)D6AB;:; Figure2-8. Componentsoftheexperimentaltestbedthatincludecompute,storageand networkinfrastructure,scriptsandtoolstocontrolandautomatejob execution,jobmonitoring,faultinjectionandnodescalingandconguration parameterstosetuptheexperiment. jobsrespectivelyandthemajority( > 98% )ofjobshadameanruntimevaryingbetween fewtensofsecondstofewtensofminutes.Ourevaluationscanbegroupedintothree differentcategoriesdescribedinthesubsectionsbelow. Infrastructure-basedparameters Inthissubsection,westudyHadoopperformancebyvaryingparametersassociated witheitherthehardwareortheHadoopmiddleware.Intherstsetofexperimentsin Figure 2-9A ourgoalwastostudytheeffectoffaultpenaltyasthesizeoftheHadoop clusterisincreased.WerunthewordcountapplicationprovidedwithHadoopon8 nodesto24nodeson36Gboftextdata(triplereplicatedasperHadoop'sdefault) derivedfromProjectGutenberg( Gutenberg 2009 ).Executiontimesarerecordedfora fault-freerunaswellasforaruninwhichanodefaultisinjectedatapproximately50% ofmapphaseprogress.Thelossofasinglenodeineachcasecorrespondstoaloss(in computeandstoragecapacity)between12%to4%,whilethepenaltyincurredranges between155%to26%respectively.Thispenaltyincludestimetakentorunin-progress andpreviouslycompletedmaptasksonthefailednodeaswellasthenodeexpiry timeoutperiod.Figure 2-9B showsimilartrendsfortheTerasortbenchmarkapplication. 35

PAGE 36

Expirytimeout'isacongurableparameterprovidedbyHadoopandisthe timeintervalafterwhichtheJobTrackerconcludesthatanodeisdeadifithasnot receivedanyheartbeatmessagesfromitduringthisperiod.Inordertounderstandthe dependenceofruntimepenaltyontheexpirytimeoutparameter,westudyitsinuence inFigure 2-10 .Foreachvalueoftimeout,wecomparepenaltiesfordifferentpointsof occurrenceofthefault.Wegroupfaultsoccurringbefore30%(ofmapphaseprogress) as early faults,andthoseafter30%andbefore70%as mid faultsandthoseafter70% as late faults.Weseethatforsmallandmostmediumvaluesoftimeout,thepenalties observedarenegligible.Butinanactualcluster,timeoutvaluesneedtobelargeenough toaccountforfactorssuchastasklengthforallpossiblejobs,networkdelays,I/O delays,datachunkprocessingtimedifferences,heterogenousslavesandtransient excessiveloadsonslaves.Henceconguringlowtimeoutvaluescouldbeapossible techniquetoreducepenaltiesinsomeclusters,butwouldbeunsuitableinmostcases. TypesofFaults Theoccurrenceofafaultatdifferentpointsduringtheexecutionofajob,affects thepenaltyasshowninFigures 2-11A and 2-11B fortheWordcountandTerasort applicationrespectively.Forthenodescalingtechnique,thisisanimportantparameter, becauseoftworeasons,1.thetimeoffaultoccurrencedeterminesthedurationfor whichaslavenodehasbeenapartoftheexecutingjobwhichinturndetermines theamountofworklostwhentheslavefailsand2.italsodeterminestheremaining numberofmapwavesavailableduringwhichthescalednodescancontributecompute resources.Forthecaseshown,penaltyvariesbetween52%to110%offault-free executiontimeforWordcount.ForTerasortthepenaltiesdonotfollowanincreasing trendandvarybetween42%to77%.InFigure 2-12A and 2-12B ,weinjecttwonode faultstoobservepenalties.Weseethatthetrendofpenaltiesforearly,midandlate faultsissimilartothatofthesinglefaultcase.WeuseHadoop'sdefaultreplicationfactor of3fortheinputdataset,andhencethejobcanonlytolerateonlyuptotwonodefaults. 36

PAGE 37

! "# "$ #% #& % '%% "%%% "'%% ()*+,-./+01-/2345678/9:/39;7<= ->-(*,0?3/,0@-/2AB/<7C9B;<= / / 39/DE4FG< +ABHF7/39;7/DE4FG #$I&J KLI!J L&IMJ ""KI$J"''I#J AWordcountMap-ReduceApplication "# "$ #% #& % '%% "%%% "'%% #%%% #'%% ()*+,-./+01-/2345678/9:/39;7<= ->-(*,0?3/,0@-/2AB/<7C9B;<= / / 39/DE4FG< +ABHF7/39;7/DE4FG !IJKL "#KJ&L '%L &"J'L $J!L BTerasortMap-ReduceApplication Figure2-9. Comparisonofexecutiontimefordifferentclustersizesforthecaseof no-faultsandsinglenodefaults.Performancepenaltyisindicatedasa percentageontopofeachdarkgreybar. ApplicationandDatasetbasedparameters SinceMap-Reduceisageneralprogrammingparadigm,averydiversesetof applicationscanbeconstructedusingthebasicmap,shufe/copyandreducephases. Therecanbeapplicationswhichconsistofamostlymapphaseandanidentityreduce phaseorthosethatcontainnomapphaseandonlyareducephase.Applicationscan alsobecategorizedintodatatransformations,dataexpansions,datasummarizations anddataaggregationsbasedontheratioofthedatasetsproducedbyeachofthe 37

PAGE 38

!"# $## %## # "## &## %## '## !### !"## ()*+,-./+0(12/.345.67895:6; ()(<2/+1=./+0(.345.67895:6; . =9.>?@AB6 (?CAD.>?@AB 04:.>?@AB E?B7.>?@AB &%.F G'H&.F I!HG.F !##H%.F Figure2-10. Comparisonbetweenexecutiontimesfortheno-fault,early-fault,mid-fault andlate-faultcasesfordifferentexpirytimeoutvalues.ClusterSize=16 nodes,Program=Wordcount. phases( Chenetal. 2010 ).SinceMap-Reduceistypicallyusedforprocessing largedatasets,thecharacteristicsoftheinputdatasetstronglyinuenceapplication performance.Inthissection,weusetheTerasortbenchmarkapplicationwithan inputdatasetof 8 GbgeneratedwiththeassociatedTeragenapplication.Forthis applicationtheoutputdataisasortedversionoftheinputandhenceitbelongstothe datatransformationcategorywithadataratioequalto1.TheWordcountapplication usedintheothersectionsisadatasummaryapplicationsincetheoutputdataconsists ofacountvalueforeverydistinctwordintheinputtextdata(andhencetheoutputis verysmallcomparedtotheinput).IntheexperimentrecordedinFigure 2-13 ,weshow howforthesamedatasetsize,varyingthechunksize(andhencenumberofchunks) affectsfaultpenaltydifferently.Weseethatthesmallestchunksizeof 64 Mbperforms bestforthisapplication,thoughthepenaltyforthissizeisalmostdoublethatfora chunksizeof 256 Mb.Henceinselectingchunksize,theuserhastotrade-offbetween performanceandfault-tolerance. 2.3.2ImprovedFaultTolerance Inthissection,weshowhowfaultpenaltiesintroducedbymanyoftheconditions seenintheprevioussubsectionscanbemitigatedusingvariousintensitiesofnode 38

PAGE 39

!"# $"# %"# &"# '"# ("# )"# *"# +"# $"" &"" ("" *"" !""" !$"" !&"" ,-./0123450-66377/,6/08-9,50:;<0=>?@>AB;C>0DE0FDG0=?DC?><@DAL0,DL>01;MNB '$# &(# ((P$# )+P!# *(P%# +'P%# *+P$# !"'P*# !!"P$# AWordcountMap-ReduceApplication !"# $"# %"# &"# '"# ("# )"# *"# +"# $"" &"" ("" *"" !""" !$"" !&"" !("" !*"" ,-./0123450-66377/,6/08-9,50:;<0=>?@>AB;C>0DE0FDG0=?DC?><@DAL0,DL>01;MNB &&P)# &$P(# '"# '"P%# '*P+# (&P$# )'P*# &(P!# )(P+# BTerasortMap-ReduceApplication Figure2-11. Comparisonbetweenexecutiontimesfortheno-faultandsingle-nodefault caseasthepointofoccurrenceofthenodefaultisvariedbetween10%to 90%ofmap-phaseprogress.ClusterSize=16,Program=Wordcount, Timeout=600seconds. scaling.InFigure 2-14A ,weseehowinthecaseofearly-faultsandmid-faults,scaling byone-nodeissufcienttobringdownthepenaltytolessthan5%.Whileinthecaseof late-faults,4nodesareneededtocompensateforasingle-nodefaultandtobringdown thepenaltyto5%.Alsointheearlyandmid-faultcases,itisseenthatscalingby2and 4nodesisunnecessaryasitbringsdowntheexecutiontimetovaluesslightlylower 39

PAGE 40

EARLY MID LATE 0 500 1000 1500 2000 OCCURRENCE TIME OF FAULT EXECUTION TIME (in seconds) No Faults 2 Node Faults 84.9% 60% 117.7% AWordcountMap-ReduceApplication EARLY MID LATE 0 500 1000 1500 2000 OCCURRENCE TIME OF FAULT EXECUTION TIME (in seconds) No Faults 2 Node Faults 77.5 % 70.3 % 61% BTerasortMap-ReduceApplication Figure2-12. Comparisonbetweenexecutiontimesfor2-nodefaultsinjectedduringthe early,midandlatestagesofmap-phaseprogress. EARLY MID LATE 0 500 1000 1500 2000 2500 3000 OCCURRENCE TIME OF FAULT EXECUTION TIME (in seconds) No Faults 2 Node Faults 39.3% 20.8% 33.7% Figure2-13. ComparisonbetweenexecutiontimesintheTerasortapplicationforthree differentchunksizes. thanfault-freeexecutiontime.Figure 2-14B showstheeffectofscalingfortheTerasort application. 2.4SummaryandConclusions Inthischapter,weuseavarietyofexperimentstoevaluatetheperformancepenalty experiencedbyajobunderdifferentconditions.Weseethatfactorssuchascluster size,timeofoccurrenceofafault,failuredetectiontimeoutandnumberoffaultsresult inwidelyvaryingperformancepenalties.Thesepenaltiesalsovarybetweendifferent 40

PAGE 41

!"#$% &'( )"*+ ! ! / 0 / 10 234)56578!69:";+(6<=6><'=*6##+:=+? 5!@0)2435!AB!@20C! 3 3 @9(+3D="$':E F @9(+3D="$':E GHIJ KHLJ FJ LHLJ IHMJ MH.J BTerasortMap-ReduceApplication Figure2-14. Comparisonbetweendifferentintensitiesofscalingforearly,midandlate faults. benchmarkapplicationsthatwerestudied.Performancepenaltiesaresevere(ashighas 300% )inmanycases,therebymotivatingtheneedforafault-managementsolution. Wealsoshowthatdynamicresourcescalingisa feasible and effective remediation techniquetomitigateexecutiontimepenaltiesduetothesenodefaults.Althoughthe possibilityoftheadditionofslavenodestoaHadoopclusteriscommonknowledge, weshowthatscalingispossible during theexecutionofasinglejob-therebyallowing ajobto recover fromapotentialSLOviolation.Wealsoshowthatcompensatingfora 41

PAGE 42

singlenodefaultwithjustasinglenewnodeisnotsufcient.Inparticular,theamountof nodescalingrequireddependsonfactorssuchastimeofoccurrenceofthefault,length andnumberofmapandreducewaves,clustersizeofjob,chunksizeandnodeexpiry timeoutvalue.Thismotivatestheneedforascalingheuristicthatincorporatesthese factorsandisdescribedaspartoftheautonomiccontrolloopdescribedinChapter 4 Insummary,thedetailedsimulationandempiricalstudydescribedinthischapter setsthestageforthedesign,developmentandimplementationofFault-Managed Map-Reducedescribedinthefollowingchapters. 42

PAGE 43

CHAPTER3 PERFORMANCEPREDICTIONINMAP-REDUCE AkeyfunctionalitynecessaryinFault-ManagedMapReduceistheabilitytopredict theperformanceofajobgivenvariousfactorsthatinuenceitsperformance.Thisability toestimateMap-reduceexecutiontimeisalsousefulforgeneralresourcescheduling andprovisioningpurposes.Currentstate-of-the-arttechniquesforperformance predictionofMap-Reduceapplicationsuseanalyticalandsimulation-basedmodels.In thischapter,wemakethecaseforperformancepredictionusingregressiontechniques basedonmachine-learningandillustrateitsadvantages.Throughmodelingthe Map-Reduceenvironmentasagrey-box,weleverageacombinationofexternally observedsystemfeaturesandinformationaboutsub-systeminternals.Weidentify fourlearningtechniqueswithhighpredictionaccuracythroughadetailedcomparative studyoftwentymethods.Weshowthattheidentiedtechniquescaneffectivelypredict performancewithameanpredictionerrorof < 12% .Incorporationoftheseprediction modelsintotheMAPEKloopforfaultmanagementisdiscussedinthefollowingchapter. 3.1Background Moderndatacentersaswellasprivateandpubliccloudshavemanylayersof hardware,middlewareandsoftwarecomponentsintegratedandworkinginunisonto deliverservicesforcustomers.Figure 3-1 showsimportantcomponentsineachofthese layersinatypicalvirtualizedenvironmenthostingaMap-Reduce(MR)platform.Inthe caseofaleasedenvironment,aMap-Reduceapplicationandthenecessarymiddleware aresetupdynamicallyonlyfortheperiodduringwhichtheservicesareneeded. Typically,applicationperformanceisdeterminedbyvariouscharacteristics:(1)code beingexecuted,(2)inputtuningparameters,(3)datasetand(4)underlyingresources. Inmoderndatacentersandpubliccloudenvironmentsmanynewdimensionsofthese factorscomeintoplay-suchasheterogeneousinfrastructure,sharedresourceusage, middlewarevirtualization,collocationandimperfectisolationbetweenvirtualmachines, 43

PAGE 44

!""#$%!&$'() +,-.*/01-2/034*5/6/07689:;/<=4*>:!"C@DEF%D*>$EE#D, !@D +G:166H*>:H@-1I<-4*G:166H*E/=;8/.I;-1*J/K-*)A=;-9B L$@&F!#* >!%G$(D* >'($&'@ L> L> L> ') ') ') MMM '"D@!&$'()* !(E* >!(!ND>D(& )D%F@$&O (D&,'@P )&'@!ND %'>"F&D* @!%P) L$@&F!#* >!%G$(D* >'($&'@ L> L> L> ') ') ') L$@&F!#* >!%G$(D* >'($&'@ L> L> L> ') ') ') Figure3-1. AtypicalvirtualizedenvironmenthostingaMap-Reduceplatformshowing thevarioushardware,middlewareandsoftwarecomponentsworkingin unison.Diversecharacteristicsofmanyofthesecomponentsinuence performanceofMap-Reduceapplications. scalesofoperationandthelackofQoSguaranteesfromconstituentsub-systems.Itis alsowelldocumentedthatinMRapplicationsjobcompletiontimescanbequitevariable ( KadirvelandFortes 2011b ; Wangetal. 2009c ). CurrentstateoftheartinperformancepredictioninMRapplicationsisbasedon analytical,white-boxandsimulation-basedmodels( Cardosaetal. 2011 ; Herodotou andBabu 2011 ; Vermaetal. 2011 ).Asanalternative,thisworkpresentsagrey-box approachbasedonmachine-learning(ML)regressiontechniquesforperformance prediction.Thoughexistingmodelshavetheirownbenetsintermsofsimplicity andaccuracy,wemakethecasefortheuseofML-basedtechniquesbecauseof thevarietyofthebenetstheycanprovide:(1)Abilitytocapturehighlynon-linear 44

PAGE 45

relationshipsamongstthecontributingfactors.(2)Abilitytocapturethelargefeature spaceassociatedwithcomplexmodels.(3)Independencefromexactdetailsabout systeminternalsandhenceeasyadaptationasversionsandpoliciesoftheunderlying frameworkchange.(4)Non-necessitytomakesimplifyingassumptions(whichis commonwhenconstructingwhite-boxmodelsforcomplexsystems).Thesebenets becomeincreasinglyattractiveasdataanalyticssystemsscaleandbecomemore complexandhencewebelieveareaviableandcompellingoptionforperformance prediction. Themaincontributionsoftheworkdescribedinthischapterare:(1)agrey-box approachforcharacterizingandmodelingMR-basedplatforms,(2)adetailedempirical evaluationofsuitablesupervisedlearningregressiontechniques(wheretheevaluation includesasystematicapproachtoselectingandpreprocessingthemostrelevantsystem featurestopredictperformance)and,(3)performancepredictionunderdegraded systemoperatingconditionsthatoccursduetocomponentfaults. Theremainderofthischapterisorganizedasfollows.InSection 3.2 anoverview ofcurrentstate-of-the-arttechniquesusedforapplicationperformancepredictionis provided.InSection 3.3 grey-boxsystemsareintroducedanddened.InSection 3.5 empiricalevidenceisprovidedforthenon-lineardependenceofperformanceof MRapplicationsonvariousfactors.Featurepreprocessingandsuitableprediction techniquesareoutlinedinSection 3.6 andadetailedevaluationisprovidedinSection 3.7 ThechapterendswithasummaryofcontributionsinSection 3.8 3.2Relatedwork PerformanceVariabilitiesandFailureEffects. Manyrecentworksillustrate performancevariabilityinclouds( Iosupetal. 2011 ; Schadetal. 2010 )virtualized environments( Ballanietal. 2011 )andspecicallyinMRapplications( Ananthanarayanan etal. 2010 ; Kaleetal. 2010 ; Zahariaetal. 2008 ).In KadirvelandFortes ( 2011b ),we quantitativelyevaluatedtheperformancepenaltyassociatedwithHadoop'sbuilt-in 45

PAGE 46

fault-toleranceapproachusingbothasimulationframeworkandanexperimentaltest bed.JobcompletiontimepenaltiesinMRarealsodocumentedby Wangetal. ( 2009c ) and Koetal. ( 2009 ). Map-ReduceandPerformancePrediction. Performancepredictionhasplayeda crucialroleinsimplifyingsystemtuningandoptimizingMRprograms( Herodotouand Babu 2011 ),resourceprovisioning( Cardosaetal. 2011 ; Vermaetal. 2011 ),parallel queryprogressestimation( Mortonetal. 2010 ),adaptivescheduling( Poloetal. 2011 ) anddevelopmentofservice-models( Sharma 2010 ).Intheseworksperformance predictionhasbeenbasedonanalytical,white-boxandsimulation-basedmodels.We builduponandextendtheseworksbyshowingthatsupervisedlearningtechniques canhelpcapturethewidespectrumoffactorsthataffectMRapplications.Inaddition, agrey-boxapproachbasedonMLenablesrelaxationofhomogeneityassumptions, allowseasierincorporationoffurtherfactorsandiseasiertoadaptasunderlyingpolicies changewithnewerversionsHadoop. Ganapathi ( 2009 )useamultivariatekernel-basedregressiontechniquefor predictingtheperformanceofMRjobs.Theirgoalistoshowtheapplicabilityofa specictechniquenamely,KernelCanonicalCorrelationAnalysisintheMRcontext.Our workgoesbeyondthisintwoways:(1)Weaimtoprovideanempiricalevaluationand comparativestudyofawiderangeofsuitableregressiontechniques:(a)Rule-based techniques,(b)Tree-basedtechniques,(c)Meta-learningtechniques,(d)Neural-networks and(e)Gaussian-Processtechniques.Alongwiththeapplicationofthesetechniques, weprovidedetailsofhowfeaturesweresystematicallyselectedbasedonincrementally viewingthesystemasagrey-box.Ineachcase,thedirectrelationshipbetweenthe numberofdimensionsintheproblemspacethatwereperturbedandthesubsetof selectedfeaturesisillustrated.(2)Secondly,ourpredictionmodelalsocapturesthe inuenceofnodefaultsonperformanceandthisisenabledbytheidenticationand additionofanewcategoryofattributesasinputstothepredictiontechniques. 46

PAGE 47

In Montes ( 2010 ),aMapReduceenvironmentismodeledusingtheGlobalBehavior Model(GloBeM).Classicationandclusteringtechniquesarerstusedtodeterminea FiniteStateMachinemodelofthesystem.Futuresystemstatesandstatetransitions arethenpredictedusingacombinationofMLtechniques.Ourworkdiffersfromthisin thelevelofgranularityofthepredictioninvolved.Wepredictanumericalvalueofsystem performance(namely,executiontimeoftheMapReducejob),ratherthanahigher-level systemstate. Machine-learningforperformanceprediction. Machine-learninghasbeenexplored forperformancepredictioninsinglecomputenodesin Kapadiaetal. ( 1999 )and Matsunagaetal. ( 2010 )andindistributedsystemssuchasgrids( Duanetal. 2009 ; Guimetal. 2007 ; Smith 2007 ; Smithetal. 2004 ).Thegoalofthisworkdiffersfrom thoselistedinthatwefocuson Map-Reduce applications. FactorsthatinuenceofperformanceofMRincludethefollowing:scaleatwhich MRapplicationsoperate,theinherentcomplexdependenciesanddifferencesbetween theMapandReducephase,thelargedatasetsizesandunstructurednatureofdata thattheseapplicationsprocess,thelargedesignspaceofcongurationparameters thatinuenceperformanceaswellasthewidevariationintaskruntimes,along withfactorsintroducedbyvirtualizationsuchassharing,heterogeneity,overheads andlackofperfectisolation.Conceivably,supervisedmachinelearningtechniques coulduseinformationaboutthesefactorstopredictMRperformance.Tothebestof ourknowledge,thisworkisthersttoprovideadetailedquantitativeandqualitative evaluationofvariousMLtechniquesforperformancepredictioninMR-basedplatforms. Furthermore,thisistherstworktoconsiderperformancepredictionunderdegraded modesofoperationthatresultfromfault-tolerantimplementations.Theincreasing prevalenceofMRapplicationsandvirtualizedenvironmentsmakesacompellingcaseto overcometheseperformancemodelingchallenges. 47

PAGE 48

3.3CharacteristicsofaGrey-BoxApproach Black-boxandwhite-boxapproachestosystemmodelingareusedforvarious purposessuchasperformanceevaluation,testing,systemsimulationanddesign. Black-boxapproachesareusedinsituationswheresystemsaretoocomplextobe modeledindetailandmathematically.Inadditionthesearesystemsforwhichthe input-outputrelationshipscanbecharacterized sufcientlywell byusingexternally observedsystembehavior.Indirectcontrast,arewhite-boxapproachesinwhich detailedinformationabouttheinternalworkingsofasystemareutilizedformodeling.A classicexampleofwhite-boxmodelingwouldbetheuseofpartialdifferentequationsto describesystemoperation.Thesemodelsareusuallybuiltupfromrstprinciples. !"#$%&'($)*&+",# #%)&-$,"!'..",($)*&+",# #*$/,'0%!'(!)%.&', 0"#*-1( '2$',& #34/,56785( !9:;<7=3>?9:@ .>9=3<5(AB98C( .?D5@(03>3( =54B?83>?9:@(5>8 +37B>E@ .B9F(69F:E@( 0?E>7=G3:85E@( 5>8 !B7E>5=(E?D5@ 03>3( 6?E>=?G7>?9:@ $=9<=3H( 43=3BB5B?D3>?9:@( 5>8 !"#$%&' ()"*)$#+' !$,$'-.,' -%/.+'.,0 !"#$%&*&-"1*)( '2$',& #*$,'0%!' '2$',& -&( '2$',& !"#$%&'( 1"0'. 1'&I",J( !"#$"1'1&. .&",*K'( !"#$"1'1&. Figure3-2. Map-Reduceplatformviewedasablack-box(representedbyleft-mostblack arrow)aswellasdifferentgrey-boxes(representedbythethreegrey arrows).Movingtotherightandhencetotheinner-layersofasystemallows formoreinformationtobecomeusableformanagementandcontrol.Below eacharrowisasamplelistofparametersthatcanbeleveragedatthatlevel oftransparency. Inthisproposedwork,thetermgrey-boxisusedtorefertomethodsinwhicha combinationofinternalparametersandbehavioralmetricsofthesystemareusedfor modeling.Thisdenitionissimilarinspirittotheoneprovidedby Arpaci-Dusseauand Arpaci-Dusseau ( 2001 ).Broadlyspeaking,agrey-boxapproachentailstwosteps:(1) 48

PAGE 49

Therstandmostimportantstepinpredictionistodetermineallthefactorsthatcould haveaninuenceonperformance.Thismayincludeidenticationoffeaturesfrom therightcomponents,lteringoutsuperuousfeatures,augmentingfeaturesthrough differentcombinationsandleveraginghistoricalinformationaboutapplications.A sufcientunderstandingofsystemoperationisthusnecessaryandleadstoviewingthe systemathandatdifferentlevelsofbeing grey .(2)Secondly,systeminternalsneedto beprobedinordertogatherraw-dataneededbytheperformancemodels.Thiscanbe donethroughloganalysisorexecutionofsubroutinestomicro-benchmarktime-specic characteristics. InFigure 3-2 ,theMRplatformisviewedbothasablack-boxandatdifferentlevels of greyness ,asincrementallymoreinformationaboutsysteminternalsisvisible.In addition,moreinformationbeingavailablepavesthewayforincreasinglymorepowerful controloperationstobeperformedonthesystembeingmanaged. 3.4ExperimentalSetup ThetestbedusedinalltheexperimentsinthischapterconsistsofIBMblade serversHS22mountedontwodifferentracks.Eachphysicalnodehas16coresand24 GbofRAM.ThetworacksarelinkedtogetherbyaGigabitEthernetnetwork.Xenisthe VirtualMachineMonitorchosentohostHadoopmasterandslavevirtualmachines.The operatingsystemonthehostplatformisCentOS5.5,andtheguestvirtualmachines (ordomUinXenparlance)isUbuntu10.04.2.Eachslavenodehasasingle-corewith 2GbofRAM.TheHadoopversionusedis0.20.Theapplicationsthatwereextensively testedareTerasortandWordcountbenchmarkprograms.Weka( Halletal. 2009 ),a Javamachine-learningtoolsuiteisusedfordataanalysis. 3.5EmpiricalMotivation AlargefeaturespaceoffactorsaffectsperformanceofMRjobs.Complex interactionsbetweenthesefactorsresultindifferenttrendswhenthesameparameteris variedunderdifferentconditions.Furthermore,non-linearitiesinperformanceasfactors 49

PAGE 50

5 10 15 20 25 200 400 600 800 1000 1200 1400 1600 1800 NUMBER OF SLAVES JOB COMPLETION TIME (seconds) WITH INJECTED FAULTS NO NODE FAULTS AClusterSizeScalinginWordcount 5 10 15 20 500 1000 1500 2000 2500 3000 3500 DATASET SIZE (GB) JOB COMPLETION TIME (seconds) BDatasetSizeScalinginTerasort Early Mid Late 600 700 800 900 1000 1100 1200 FAULT OCCURRENCE TIME JOB COMPLETION TIME (seconds) Timeout: 120sec Timeout: 300sec Timeout: 600sec CEffectofInjectedFaultsinWordcount 0 200 400 600 550 600 650 700 750 800 850 NUMBER OF REDUCE TASKS JOB COMPLETION TIME (seconds) DEffectofCongurationFeaturesin Wordcount Figure3-3. PlotsofMap-Reducejobcompletiontimewhendifferentsubsetsoffeatures arevaried. arescaledandheterogeneousbehaviorofslaves(becauseofdifferencesinscheduling, locationofdata,resourcesharingwithinanode)substantiatetheneedforsupervised learningtechniquesthatcanhelpcapturethisconceptspace.Empiricalobservations thatsupportthesestatementsaredescribedinthefollowingsubsections. 3.5.1InuencingFactorsandNon-linearities Thissubsectiondetailsthedifferentcategoriesoffactorsthataffectperformance andprovidesempiricalexamplesforthedegreeandnatureofinuence.Table 3-1 liststhefactorsorganizedintosixcategories: Resource (R), Data (D), Program (P), 50

PAGE 51

Conguration (C), Faults (F),and Environment (E),alongwithrepresentativeexamples foreach.Performanceisthusafunctionofallthesefactors. Performance = f ( R D P C F E ) Ineachcategory,theeffectofsomefactorsonperformanceisobvious.Examples includenumberofslavenodes,inputdatasize,CPUormemoryorI/Ointensivenature oftheprogramanddisturbancesinternaltoslavenode.Thesefactorsareknownand visibleevenwhentheunderlyingcomputationalplatformisablack-box,i.e.wedo notneedtoknowaboutunder-the-hooddetailsofMR.Ontheotherhand,knowledge aboutMRoperation(viewingthesystemasagrey-box)revealsthatcertainfactors playavitalroleinperformance.Examplesincludenumberofslotsperslavenode, replicationfactor,datamovementpatternsbetweenthemapandreducephases(which isaconsequenceofthenatureofdataprocessingi.e.aggregationorsummarizationor transformation)andHadoopcongurationfactorssuchasio.sort.factor,io.sort.mband io.sort.record.percent. Thefault-tolerancemechanisminHadoopdoesnotrequireanyuserintervention andjobsruntocompletionthroughfailedtaskre-execution.However,theperformance penaltyassociatedwiththiscanbeveryhigh( KadirvelandFortes 2011b ; Wangetal. 2009c ).BasedonourperformanceandfaultcharacterizationexperimentsinHadoop, weidentifyfactorsassociatedwithcomputenodefaultsthatseverelyaffectapplication performance.Thisincludesnumberofnodefaults,timeoffaultoccurrenceandtimeout values.InSection 3.7 ,wequantitativelyevaluatetheeffectoftheseidentiedfactors. Inthespectrumofblacktoincreasinglygreyviewsofthemodeledsystem,depictedin Figure 3-2 ,wenotethatthesefault-relatedfactorsfallfurthertowardstherightsince theidenticationofthesefactorsneedsin-depthunderstandingoffault-tolerance mechanismsdeployedinHadoop. InplotsofFigure4,werecordtheeffectonperformancebyvaryingonefactorand keepingallothersconstant.InFigure 3-3A ,weseethattheeffectofnodescalingin 51

PAGE 52

thepresenceandabsenceoffaultsfollowsverydifferenttrends.InFigure 3-3B ,itis observedthatforTeraSortdoublingthedatasetsizefrom5Gbto10Gb,quadruplesthe runtime;whiledoublingthedatasetfrom10Gbto20Gb,resultsinlessthantwicethe runtimeincrease.InFigure 3-3C ,asinglenodefaultisinjectedintoarunningjobat differentpointsofjobprogress.Early'refersto # 15% ,Mid'refersto # 30% andLate' refersto # 80% ofjobprogress.Theexperimentisrepeatedfordifferentvaluesofthe timeout'parameterprovidedinHadoop.Timeoutdeterminestheamountoftimefor whichthemasternodewaitsbeforeitdeclaresaslavenodeasdead,ifithasn'treceived aheartbeatmessagefromit.Weseethatthisparametercanhaveasignicanteffect onjobcompletiontimeresultinginpenaltiesashighastwicethejobruntime.In Figure 3-3D ,anothercongurationparameterNumberofReducetasks'isvaried.We seethatthiscanleadtoadifferenceofupto 44% injobruntime.InFigure 3-4 ,we exemplifytheimpactof4differentcongurationparametersandsampledifferentvalues overthepermissiblerange.Weseethatthejobruntimesvariedwidelybetween1200to 2200secondsforthesameTerasortjobexecutedon10Gbofdata. Inallthesejobruns,weseethatperformancecanvarysignicantlybyvarying differentfactors.Furthermore,thereisanon-lineardependenceofperformanceon acomplexarrayoffactors,makingperformanceacandidateforpredictionusingML techniques. 3.5.2Heterogeneity Figures 3-5A 3-5B 3-6A and 3-6B illustratedifferencesintaskruntimeobserved inourtestbedforbothWordcountandTerasort.InFigure 3-6A ,weseehowthe runtimeoftasksinthemapphaseofaTerasortapplicationvariesbothbetween differentslavenodesandwithinaslavenode.Intheclusterofbarsassociatedwith eachslavenode,eachbarcorrespondstotheruntimeinamapwave.Figure 3-6B showsthereducetaskruntimedistributions,allofwhicharecomparabletoeach otherandwithin 20% ofeachother.Thereversetrendisobservedforthecaseofthe 52

PAGE 53

Table3-1. Factorsinuencingperformance CategoryExamplesoffactors ResourceNumberofslavenodes,Numberofmapandreduceslotsperslave node,Numberofcoresperslave,Memoryperslave,Networkand I/Ocapabilityofslave Data Totalinputdatasize,HDFSBlocksize,Replication,Content Program CPU-intensive,memory-intensive,orI/Ointensive,Natureofdata processing-Aggregation,Summarization,Transformation CongurationMap-Reducefactors:Numberofreducewaves,io.sort.mb, io.sort.record.percent,io.sort.factor,JavaVirtualMachine(JVM) factors:mapred.child.java.opts. Faults Numberofslavenodesthatcrashed,Time(duringjobprogress) atwhichnodesfail,Taskfailure,TaskTracker/DataNodedaemon failure EnvironmentDisturbancesinternaltoslavevirtualmachines,Disturbancefrom co-locatedvirtualmachines Wordcountapplication-thatistaskruntimesaresimilarinthemapphase,shownin Figure 3-5A whilestarkdifferencesareseenwithinanodeinthereducephase,as showninFigure 3-5B Itistobenotedthatthesedifferencesintaskruntimeoccurdespitethefactthat ourtestbedishomogeneousintermsofhardware,virtualizationmiddleware,operating systemandMRmiddleware.Thisisbecauseresourcesaresharedwithinaslavenode betweenmapandreducetasksandtheactualschedulingofoperationsonaparticular nodeisnotdeterministic.Differentinter-leavingsofoperationssuchastheinterspersed executionofacomputation-lightmapstepandanI/Oheavyshufestep,canresultin increasedmaptimes.Thiswasthespecicreasonforincreasedtasktimesasobserved inthesamplejobrunshowninthepreviousgures. PredictionresultsshowninSection 3.7 suggestthatMLtechniquesarecapable ofcapturingthedynamicsduetotaskscheduling.Inarealcloudenvironment,this heterogeneousnaturemaybemoreintenseduetoresourcecontentionbothwithina virtualmachineanditsphysicalhost.Thismakesthecaseforinclusionofnode-specic factorstotheperformancemodelandfurthercorroboratestheusefulnessofsupervised learningtechniquessincetheycanaccommodateverylargefeaturespaces. 53

PAGE 54

C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 0 500 1000 1500 2000 JOB CONFIGURATION VALUES JOB RUNTIME (seconds) Figure3-4. Jobruntimevariationsfordifferentchoicesofcongurationparameters. EachjobexecutesTerasorton10Gbofdata.Eachbarcorrespondstoa differentcombinationof { IOSortRecordPercent,NumberofReducetasks, IOSortFactor,IOSortSpillPercent } congurationvalues. C1: { 0.15,127,100,0.80 } ,C2: { 0.50,127,100,0.20 } ,C3: { 0.40,127,100,0.80 } C4: { 0.05,127,100,0.80 } ,C5: { 0.50,127,100,0.80 } ,C6: { 0.05,063,100,0.20 } C7: { 0.05,063,100,0.80 } ,C8: { 0.05,064,010,0.80 } ,C9: { 0.05,063,010,0.20 } C10: { 0.05,065,010,0.80 } 3.6PerformancePrediction 3.6.1DesiredCharacteristicsofthePredictionTechnique Basedonempiricalevidencepresentedintheprevioussectionandtheneedsof theusecasesofapredictionmodel,thedesiderataofasuitablepredictiontechnique canbesummarizedasfollows:(1)Goodpredictionaccuracy(sinceexpensiveactions mayhavetobetakenonthebasisofitsresults),(2)Capabilitytohandlealargenumber offeatures(inordertoaccommodateallthecategoriesoffactorsandifnecessary node-specicattributestoo),(3)Fastpredictioncomputationtime(thiswillallowfor thepredictortobeincorporatedinonlinefeedbackcontrolloopsforthepurposesof resourceprovisioning,schedulingandproactivefaultmanagement),(4)Simplicityin tuningthelearningalgorithm'sparameters,and(5)Humanunderstandingofmodel(for improvedvalidationofthenalmodelstructureandforcondenceinincorporatingitin toacriticalsystemcomponent). 3.6.2FeatureProcessing ApplicationofMLsolutionstoreal-worldproblemstypicallyrequirestrainingdata setgeneration,featureidentication,featurecollection,andfeaturepre-processing ( WittenandFrank 2005 ).SourcesofinformationinaMRplatformincludeJobTracker 54

PAGE 55

d26 d27 d28 d30 d31 d32 d33 d35 0 20 40 60 80 VIRTUAL MACHINE/ HADOOP SLAVE ID EXECUTION TIME (seconds) AWordcountapplication-Mapphase d26 d27 d28 d30 d31 d32 d33 d35 0 100 200 300 VIRTUAL MACHINE/ HADOOP SLAVE ID EXECUTION TIME (seconds) BWordcountapplication-Reducephase Figure3-5. TaskruntimedistributionintheMapandReducephaseofaWordcount applicationrun.EachclusterofbarsforaHadoopslavenodeontheX-axis correspondstosubsequentwaveswithintheMapphase.Forthemapphase limitedvariabilityisobserved.Forthereducephase,highvariabilityis observedwithinanode,whilethetrendoverallslavenodesissimilar. logs,TaskTrackerlogs,jobcongurationXMLleandahistoryloglethatisgenerated foreachjob.Inaddition,slavenodecongurationles,Hadoopmetrics,Ganglia metricsalsocontaincriticalinformation.Inthepresentwork,focushasbeenon utilizinginformationavailableinthelogles.WeusesuggestionsfromtheApache Hadoopprojectresourcesandfrom HerodotouandBabu ( 2011 )tochoosetheinitial setofcongurationparametersthataffectruntimefortheapplicationsTerasortand Wordcount. Featureselection :Afeatureselectiontechniqueallowsfordeterminingasubsetof theavailablefeaturesthathasmaximumpredictivecapability.Thepresenceofirrelevant attributescancausetheperformanceofpredictiontechniquessuchasdecisiontrees andrules,linearregression,instance-basedlearnerstobelowered( Wittenand 55

PAGE 56

d2 d3 d4 d7 d8 d9 d13 d14 0 200 400 600 VIRTUAL MACHINE/ HADOOP SLAVE ID EXECUTION TIME (seconds) ATerasortApplication-Mapphase d2 d3 d4 d7 d8 d9 d13 d14 0 200 400 600 800 1000 1200 VIRTUAL MACHINE/ HADOOP SLAVE ID EXECUTION TIME (seconds) BTerasortApplication-Reducephase Figure3-6. TaskruntimedistributionintheMapandReducephaseofaTerasort applicationrun.EachclusterofbarsforaHadoopslavenodeontheX-axis correspondstosubsequentwaveswithintheMapphase.Inthemapphase highvariabilityisobserved,whileinthereducephasemuchlessvariabilityis observed.In B ,slavewithID"ubuntudom61"didnothaveareducewave scheduledonit,basedontherecommendedpracticeofleavingasmall percentageofreduceslotsvacantforpossibletaskslow-downsorfailures. Frank 2005 ).Aselectiontechniquerequirestheuseoftwomaincomponents-an attributeevaluatorandasearchalgorithmforexploringthespaceofpossiblesubsets oftechniques.Wechoosesubsetswithfeaturesthathavehighcorrelationwiththe targetvaluetobepredictedbuthavelowintercorrelationbetweeneachother.Thisis implementedusingWeka's CfsSubsetEval evaluationmethod.Weusevedifferent searchtechniquesaslistedinTable 3-2 3.6.3PredictionTechniques Awidearrayofregressiontechniquesofdifferenttypesweretested.Wereport resultsonlyforthosetechniquesthatachievedgoodperformancewhichwedeneas thoseforwhichthecorrelationcoefcient(betweentheactualandpredictedvaluesof 56

PAGE 57

Table3-2. Searchalgorithmsusedforfeatureselection Name Description Bestt Greedyhillclimbingwithbacktracking Exhaustivesearch Exhaustivesearchthroughallpossiblefeature subsets Geneticsearch Asimplegeneticalgorithmbasedsearch( Goldberg 1989 ) Linearforwardselection ExtensionofBesttaimedatreducingnumberof evaluationsperformed SubsetsizeforwardselectionExtensionofBesttaimedatacompactnalsubset offeatures GPR MLP RBD M5P SLR REP 0.2 0.4 0.6 0.8 1 1.2 REGRESSION TECHNIQUE CORRELATION COEFFICIENT WITHOUT FAULT FEATURES WITH FAULT FEATURES AImprovementinpredictionbyinclusionofFault'categoryfeatures GPR MLP RBD M5P SLR REP 0.2 0.4 0.6 0.8 1 1.2 REGRESSION TECHNIQUE CORRELATION COEFFICIENT WITHOUT CONFIGURATION FEATURES WITH CONFIGURATION FEATURES BImprovementinpredictionbyinclusionofConguration'categoryfeatures Figure3-7. DifferenceincorrelationcoefcientsbytheinclusionoffeaturesintheFault' categoryshownin A andConguration'categoryshownin B fordifferent predictiontechniques.Additionoffaultandcongurationfeaturesimprove correlationcoefcientbyanaverageof 20% asseenin A and 14.9% asseen in B jobcompletiontime)isgreaterthan 0.9 .Weidentifyfourbestalgorithmsthatoughtto beseenascandidatesforperformancepredictioninHadoopMRjobs.Thisincludes(1) GaussianProcessregression(GPR),(2)MultilayerPerceptron(MLP),(3)Regression byDiscretization(RBD)and(4)M5PrunedModelTree(M5P).Inordertoprovidea basereference,resultsarealsoreportedfortwoothertechniques:(1)SimpleLinear 57

PAGE 58

0 10 20 30 40 50 60 0 20 40 60 80 100 120 PERCENTAGE ERROR TOLERANCE PERCENT OF DATA WITHIN TOLERANCE AVariationsinclustersizeanddatasetsize 0 10 20 30 40 50 60 0 20 40 60 80 100 120 PERCENTAGE ERROR TOLERANCE PERCENT OF DATA WITHIN TOLERANCE Gaussian Process Reg. Multilayer Perceptron Reg. By Discretization M5 Model Tree Simple Linear Reg. Reduced Error Pruning Tree BVariationsincluster,datasetsizeandinjected faults 0 10 20 30 40 50 60 0 20 40 60 80 100 120 PERCENTAGE ERROR TOLERANCE PERCENT OF DATA WITHIN TOLERANCE CVariationsindatasetsizeandconguration parameters Figure3-8. RegressionErrorCharacteristic(REC)curvescomparingdifferent techniquesforpredictingjobruntimefortheWordCountapplicationwhen A numberofslavesanddatasetsizeisvaried, B nodefaultsareinjectedand C datasetsizeandcongurationparametersarevaried.Solidcurves representtechniqueswithhighcorrelationcoefcient 0.95 ;dashedcurves representtechniqueswithlowcorrelationcoefcient.Toreduceclutter,the commonlegendisshownonlyforthecentergraph. Regression(SLR)basedonthesinglemostimportantattributeand(2)adecision treelearnerforregressionthatusesreduced-errorpruning(REP).Theothertwelve techniquesusedinourcomparisonwereisotonicregression,leastmediansquares, ridgeregression,paceregression,radialbasisfunctionnetworks,supportvector regression,additiveregression,bagging,ensembleselection,randomsubspace 58

PAGE 59

decisiontree,conjunctiverulelearner,decisiontableanddecisionstump.TheM5 techniquewhenrepresentedasrulesinsteadofatree(M5P)alsoperformedwellbutwe donotreportitsresultsseparatelysinceitisverysimilartoM5P. GaussianProcessRegressionisanon-linearkernel-basedmethodforwhichwe useaRadialBasisFunctionkernelinourexperiments.TheMultilayerPerceptronisa back-propagationbasedneuralnetworktechnique.InRegression-By-Discretization,the targetvaluetobepredictedisdiscretizedintomultiplebinsandthenaclassication technique(J48)isusedtopredictthebinorclasslabel.TheM5PtechniqueusestheM5 modeltreeinductionalgorithm.Amodeltreedividesuptheinputspaceintosub-spaces eachcorrespondingtoaleafnode.Attheleaf,alinearregressionmodelisusedfor prediction.Wereferthereaderto WittenandFrank ( 2005 )forfurtherdetailsaboutthese techniques.Whilenotpossibletoguaranteethatoneoftheseisthebestfordifferent use-casescenarios,ourevaluationresultsinthefollowingsectionillustrateseach methodquantitativelyandqualitatively. 3.7EvaluationResults Clusteranddatasetsize :Intherstsetofexperiments,wevaryclustersizeand datasetsizesandrecordjobruntimevaluesforeach.Initiallysixdifferentattributes werechosenasfeatures.UsingthevefeatureselectiontechniqueslistedinTable 3-2 thetwomostimportantfeatureswereidentied.Best-Fit,Exhaustive-Searchand Genetic-Searchnarroweddownthesefactorsto { NumberofSlaves,Numberof chunks } ,whileLinearForwardSelectionandSubsetSizeForwardSelectionchose { NumberofSlaves,MapInputData } .Inbothcases,theformerfeaturecorresponds toResource'categoryandthesecondcorrespondstotheData'category.Infact,the numberofchunksisdirectlyrelatedtothesizeofinputdataandhencecapturesthe sameinformation. AlltheeightpredictiontechniquesdescribedinSection 3.6.3 wereusedtopredict jobruntime.Ten-foldcrossvalidationwasusedtodeterminetheaccuracyofeach 59

PAGE 60

model.RegressionErrorCharacteristic(REC)curves( BiandBennett 2003 )were usedtocomparethetechniquesinFigure 3-8A .Thiscurveshowsthepercentageof datapointsinthedatasetforwhichtheerrorpercentageisbelowathresholdvalue anditisplottedasafunctionofthisthresholdvalue.Errorpercentage'istheratioof thedifferencebetweentheactualvalueandpredictedvaluetotheactualvalue.These curveshelpvisuallycomparemultipletechniques;withacurvethatisclosertotheupper leftcornerbeingbetterthanonethatisfartheraway.Suchacurvewouldcorrespondto alargernumberofdatapointshavinglowerror. Nodefaults :Inthesecondsetofexperiments,weinjectnodefaultsintoarunning jobatdifferentpointsinitsprogress,andrecordjobruntimes.Performanceofdifferent predictiontechniquesarecomparedinFigure 3-8B .Allvefeatureselectionsearch algorithmsconsistentlynarroweddownthefeaturestothissubset: { NumberofFaults, TimeofFaults,Timeout,NumberofSlaves,NumberofReduceWaves } .Weseethat thesefeaturesbelongtothreecategories:Resource',Faults'andConguration'. ThoughintuitivelywewouldexpectResource'andData'toplayastrongerrole,we seethataConguration'parameterhasbeenchosen.Onfurtheranalysis,itwasfound thatthiswasanartifactofthetrainingdataset.Thisspecicdatasethadmuchfewer variationsinthedatasetsizethaninthenumberofreducewaves.Inordertoverifythis, weexcludedthedatasetsizeparameterfromtherstsetofexperimentsandreranall thefeatureselectionalgorithms.Itwasfoundthatconsistently, { NumberOfReduce Waves } waschosen,showingthatitisthefeatureofmaximummeritwheninputdata setsizeisnotconsidered. AdditionofFault'categoryfeaturesresultedinimprovementinpredictionaccuracy. Weshowthisimprovementbyplottingthecorrelationcoefcientvalues(between predictedandactualjobcompletiontimes)obtainedbothwithandwithouttheinclusion ofFault'featuresinFigure 3-7A .Inaddition,amongthefaultfeatures,theTimeout parameteristhemostimportant.Thiscanbeseenbycomparingthecorrelation 60

PAGE 61

MLP GPR 0.6 0.8 1 1.2 1.4 REGRESSION TECHNIQUE CORRELATION COEFFICIENT Num. Of Faults, Time Of Faults, Timeout Num. Of Faults, Timeout Time Of Faults, Timeout Num. Of Faults, Time Of Faults Figure3-9. ComparativeimportanceofdifferentfactorsintheFault'Category.Ineach caseonefactorisleft-outandcorrelationcoefcientismeasuredusingthe remainingpair.ItisseenthatremovalofTimeoutresultsinmaximum reduction.Thisshowsthatitisofcriticalimportanceinperformance predictionunderdegradedoperatingconditionsinducedbynode-faults. coefcientofapredictionmodelincludingallthreefeatureswithanotherinstanceof thesamemodelleavingonefeatureout(one-at-a-time).ThisisshowninFigure 3-9 WenotethatourworkisthersttotaketheTimeoutfeatureintoconsiderationin performancemodelingofMRjobsinthepresenceoffailuresandwehaveshown empiricalevidenceofthesignicanceofthisfeature. Congurationparameters :InthenextsetofexperimentsweperturbHadoop congurationparametersfortheWordcountapplication.Wecompareprediction accuracywhenusingthesamesetoffeaturesfromResource'andData'category withadditionofConguration'features.InadditiontousingRECcurvestoshow comparativeperformanceinFigure 3-8C ,weshowthedifferenceinpredictionaccuracy throughthecorrelationcoefcientvaluesobtainedbothwithandwithouttheinclusion ofConguration'featuresinFigure 3-7B Wealsorunthefeatureselectionsearch algorithmsandndthelargestsubsetchosensubsettobe { MapInputSize,Numberof Reduces,I/OSortRecordPercent,Map-ReduceParallelCopies } .Weseethatallthe perturbedConguration'featuresareselected.Forsimplelinearregression,thereis nochangeinpredictionperformancebecausepredictiondependsonthesinglemost importantattributeinbothcases. 61

PAGE 62

Table3-3. Predictionaccuracyevaluation Prediction technique Variationsin { Resource,Dataset } Variationsin { Resource,Dataset, Fault } Variationsin { Conguration, Dataset } MeanPercentageError(Maximum,StandardDeviation) GPR 8.6(30.7,6.4) 9.6(49.7,10.9) 7.7(34.4,7.8) MLP 4.4(29.4,4.5) 7.7(43.4,8.1) 5.9(57.1,7.3) RBD 8.9(32.3,9.3) 8.3(50.2,9.3) 4.9(23.7,4.5) M5P 11.9(48.5,11.2) 9.2(48.7,9.8) 6.9(42.1,7.1) Foreachofthedatasetsstudiedabove,welistthepredictionerrorintheformofits mean,maximumandstandarddeviationinTable 3-3 .Weseethatinallcasesprediction errorisbelow 12% .Reportederrorsinrelatedworkusinganalyticalmodelsis < 10% in Vermaetal. ( 2011 )andbetween < 2% to < 20% in Cardosaetal. ( 2011 ).Inboth theseworks,reportedresultsconsiderasubsetofthefactorsincludedinourreported results.Specically,intheformer,congurationparametersarenotconsideredandin thelatter,faultconditionsarenotconsidered.Sowenotethattheseareonlyballpark comparisons. Themodeltrainingprocessinourexperimentstooklessthanonesecondforall thefourbest-performingmethods.Thismakesthemsuitableforincorporationwithin productionsystems.AkeyadvantageofM5Povertheothers,isthatthemodelconsists ofatreewithnodesrepresentingconditionsusingtheinputfeaturesasvariables.This providestransparencytothepredictionmodelandmayhelppracticaladoptionwhen comparedtothemoreopaqueandhenceharder-to-interpretstructureoftheMultilayer Perceptron.GaussianProcessRegressionisanon-parametrictechniqueandhence doesnotimposestructureonthedata.However,thismakesitsuitableonlywhen sufcientamountoftrainingdataisavailable. Wenotethatwehavenottakenintoconsiderationinstance-basedlearning methodsbecauseinthesemethods,theentiretrainingdatasetneedstoberetained aspartofthenalpredictionmodel.Thiswouldbecomeunwieldyforourintended resourcemanagementapplication.Otherdifcultiesincludetheneedfortrainingdata 62

PAGE 63

thatisuniformlyspreadovertheinputfeaturespaceandhigherpredictioncomputation times.However,manyoptimizationsexistthatcanminimizetheseoverheadsandwe encourageexplorationinthisdirection. 3.8SummaryandContributions Inthischapter,agrey-boxtechniqueisproposedforperformancepredictionin Map-Reducebasedplatforms.Thistechniqueprovidesagoodbalancebetween leveragingageneralunderstandingofsystemoperationandatthesametimeprecludes theneedforin-depthmathematicalrepresentationsofallconstituentcomponents. Thisisbenecialwhencomparedtowhite-boxmodels,whichcaneasilygrowtobe verycomplexormighthavetomakecommon-caseassumptionsandsimplicationsto becomepractical.Black-boxmodelsresolvethisissuebyabstractingsystembehavior throughexternallyobservablemeasuresandworkwellinmanycases.Intheproposed models,weshowthatwhensystemsbecomemorecomplex,viewingitasagrey-box allowsustoutilizelowerlevelsofsysteminformation,providingforrichermodelsand hencemoreaccurateperformancemodeling. Experimentalevaluationsusingsupervisedlearningregressiontechniques showgoodpredictionaccuracyandpredictioncomputationtimefortheconditions studied.Importantly,wewerealsoabletomakeaccuratepredictionsofdegradedjob performanceinthepresenceofnodefaultsthatwereinjectedatdifferenttimesduringa job'sprogress. Supervisedlearningtechniquesthusshowpromisingresultsforthepurposeof performancepredictionofMRjobsandcaneasilyincorporatethelargespectrumof factorsthatinuenceperformance.Furthermore,theuseofMLtechniquesopensup thepossibilitytoincludealargenumberofnode-specicfactorsthataffectperformance, therebyfacilitatingthecaptureofresourceheterogeneityandcontentionwhichis becomingincreasinglycommoninmoderndatacenters. 63

PAGE 64

Thegrey-boxpredictionmodelspresentedinthischapterareincorporatedintothe autonomiccontrol-loopofFMRpresentedinChapter 4 64

PAGE 65

CHAPTER4 FAULTMANAGEMENTINMAP-REDUCE Thischapterpresentsthedesignandimplementationofanautonomicmanagement looptomonitor,detectandhandlefaultsatruntimeinaMap-Reducebasedplatform. Thekeyideathatisproposedtomitigateperformancepenaltiesinthepresence offail-stopandfail-stutterfaultsisearlyfaultdetectionandaclosed-looprecovery mechanismbasedondynamicresourcescaling.Faultdetectionattheslavenodeis non-trivialbecauseitischallengingtoestimatewhetheranexecutingmaporreduce taskismakingsufcientprogresstomeetanexpectedperformancegoal.Theproposed solutionovercomesthisdifcultythroughtheuseofstatisticalmachine-learning techniquesforidentifyinganomalousbehaviorinanode.Map-Reduceapplications areinstrumentedtoemitapplicationheartbeats.Theheartbeatsignalfromeach nodeisusedtocharacterizetherateatwhichdataisbeingprocessedbyanode.A deviationfrompreviouslyobservedapplicationbehavioroneachnodeisdetectedusing asingle-classclassicationtechniquebasedonsparsecoding.Anomalydetectionusing single-classclassicationenablesthecreationofpredictionmodelsusingonlynormal (ornon-anomalous)trainingdata.Thisisbenecialbecauseobtainingrepresentative anomaloustrainingdataforthediversesetofpossibleanomaliesinaproduction environmentisverydifcult.Oncethefaultisdetected,horizontalresourcescalingis chosenastheremediationtechnique.Horizontalscalingiswidelyavailableinmost Infrastructure-as-a-ServiceandPlatform-as-a-Serviceofferingsandthereforeleads toarecoverysolutionthatcanbeeasilyadoptedinpractice.Ascalingheuristicis proposedinordertodeterminetheintensityofscalingthatisnecessary.Thisintensity isdeterminedbyfactorssuchasthetimeofoccurrenceofthefault,thesizeofthe datasettobeprocessedbytheMap-Reducejob,thenumberofslavenodesavailable andthetimetakentoprovisionextraresources.FMRfacilitatespracticaladoption bybeingimplementedasasetoflibrariesandscriptsthatrequirenochangestothe 65

PAGE 66

underlyingsourcecodeofHadoop.AsetofrealisticMap-Reduceapplicationswere studiedthroughafewthousandjobexecutionsona72-nodeHadooptestbed.Detailed empiricalevaluationshowsthatFMRsuccessfullymitigatesperformancepenaltiesfrom 119% downto 14% ,averagedacrossexperiments. ApictorialrepresentationoftheautonomiccontrolloopofFMRisshownin Figure 4-1 !"#$%&'()* !"#$"()'()* + & % ) / ( ) !"#$%!&'('!%) +0'1,-2(/-$ '30)4",+ +0'1,-2(/-$ 0''3%/0)%"& +0'1,-2(/-$ 4,0+-5",6 *7*)-+$*"4)50,%&4,0*),(/)(,*+,!#+--%# 40(3)$8$,-+-27$ +0&09-, +0*)-,$/"+'"&-&):$ ',-/(,*",$/;-/6* &"2-$*/03%&9 &"2-$#30/6$3%*)%&9 *30<-$/"+'"&-&):$ 0&"+037$2-)-/)%"& !#",'./*%# ;-0,)$#-0)$,0)-$ /03/(30)%"& *3%2%&9$5%&2"5$ 4%3)-,%&9 ',"9,-**$,0)-$ 0<-,09%&9 %&*),(+-&)-2$ +"&%)",%&9$"4:$ =>$;-0,)$#-0) ?>$40(3)* @>$-&<%,"&+-&) /"&),"3$ 0/)%"&* ',"/-**-2 +"&%)",%&9$ 20)0 Figure4-1. AutonomiccontrolloopforFault-managedMap-Reduce(FMR) 4.1Introduction Fault-managedMap-Reduce(FMR),aimsatmitigatingperformancepenalties experiencedbyMap-Reducejobs.FMRusesaMonitor-Analyse-Plan-Execute (MAPE)controllooptoprovideanonline,on-demandandclosed-loopsolutionto faultmanagement.InFMR,faultsareanticipatedthroughthedetectionofanomalous conditionsthatareindicativeofanimpendingfault( Gabeletal. 2012 ; Pinheiroetal. 2007 ). 66

PAGE 67

Foranomalydetectioninthiscontext,weproposetheuseofasimplemachine-learning techniquebasedonsparsecoding.Thistechniquesatisesthefollowingtworequirements: (1)modeltrainingusingonly normal-class data(asopposedtotheuseofboth normal-classandanomaly-classdata)and(2)fastpredictiontime.Normalclassdata capturesruntimebehaviorofajobthathasnotexperiencedaperformancefault.The needfortrainingusingonlynormal-classdataisnecessarybecauseanomaly-classdata thatisrepresentativeofall(ormost)possibletypesoffaults,isdifculttoobtainina productionenvironment.Predictioncomputationtimeusingtheproposedsparse-coding techniqueislessthanasecond.Thisallowstheanomalydetectionmoduletobe incorporatedinanonlinefashionwithintheMAPEloopforhandlingfaultsduringjob execution.Inadditiontotheseessentialrequirements,sparsecodingbasedanomaly detectionhastwootherbenets.Thesparsecodingmodelisdeployedlocallyoneach slavenodeanddoesnotneedtocommunicateorsynchronizewithmodelsonother nodestomakeaprediction.ThismakesFMRapplicabletobothhomogeneousand heterogeneousMap-Reduceenvironments.Thetimetakentotrainasparsecoding modelisintheorderofafewseconds.Thismakesitpossibletoquicklycreatemodels foranewMap-Reduceapplicationandalsotoquicklyre-trainmodelswhensystem characteristicschange.Map-Reduceapplicationsneedtobeinstrumentedtoemitheart beats,whicharefurtherprocessedtoconstructfeaturevectorsthatthenserveasinput totheanomalydetectionmodule. Afterananomalyisdetected,FMRusesdynamicresourcescalingtoreducethe performancepenaltyduetoanimpendingfault.Ascalingheuristicisusedtodetermine theextentofscalingnecessary.Thisheuristicusesperformancepredictionmodels derivedfromourpreviouswork( KadirvelandFortes 2012 )toestimateMap-Reducejob executiontimesbothinfault-freeandfault-presentconditions.Thecostduetoincreased executiontimeiscomparedwiththecostforadditionalresourcesandthenasuitable scalingdecisionistaken. 67

PAGE 68

FMRleveragesbuilt-infeaturesofHadoopinordertoimplementitscontrolloop. Thisincludesfeaturessuchas(1)theprovisionforseamlessdynamicadditionofslave nodestoanexecutingjob,(2)blacklistingofslavenodestostopassignmentofnew taskstoaslavenode,and(3)thenodehealthscriptfeaturetoperiodicallymonitor auser-denedsetofconditionsontheslave.FMRhasbeendesignedtorequireno changestotheunderlyingHadoopcodebase,therebyfacilitatingpracticaladoption. TheincreasingprevalenceofMap-Reduceapplicationsalongwithincreasingly fault-prone,large-scalecomputingenvironments,makesFMRatimelyandcritical componenttoimproveMap-Reduceperformanceinthepresenceoffaults.Themain contributionsproposedinthischapterareasfollows: (1)Faultanticipationandearlydetectionthroughasparse-codingbasedanomaly detectionmethod.Theproposedtechniquehasahightruepositiverateof 0.95 anda hightruenegativerateof 0.93 averagedacrossexperiments.Additionally,itprovidesthe benetsofshorttrainingandtestingtimes,requiringonlynormal-classdatafortraining. (2)Aclosed-loop,onlinedynamicresourcescalingapproachtoreducefault-induced performancepenalties.Observedperformancepenalties(withoutFMR)rangebetween 18% upto 210% .UsingFMR,penaltieswerebroughtdowntovaluesrangingbetween 5% to 46% .FMRhasbeenthoroughlyevaluatedusingafewthousandexperimentsona 72-nodein-housecluster.InjectedfaultsincludeCPU,memoryanddiskhogprocesses aswellasnodecrashes.Benchmarkapplicationsfromthedomainsoftextminingand machine-learningwereusedfortheevaluationofFMR. InSection 4.2 ,backgroundtotheproblemandrelatedworkarediscussed.In Section 4.3 ,theFault-managedMap-Reduceapproachisintroduced.InSection 4.4 implementationdetailsofFMRaredescribed.Section 4.5 consistsofexperimental validationofFMRandadiscussionoftheresults. 68

PAGE 69

4.2BackgroundandRelatedWork Inthissection,wesummarizeMap-Reduceresearchrelatedtofaultmanagement andbringouttheneedforFMR. EffectoffaultsinHadoop : DinuandNg ( 2012 )evaluatethebehaviorofHadoop inthepresenceoffail-stopfaultsofanentirecomputenodeaswellasHadoop componentssuchastheTaskTrackerandDataNodedaemons.Theauthorsshow thatTaskTrackerfailurescanresultinupto 350% penaltywhileDataNodefailurescan leadto 218% penalty. Wangetal. ( 2009b )presentaMap-Reducesimulator,MRPerf andshowthatitcancapturefaulteffects.Theirsimulationexperimentsshowpenalties upto 186% forvariousinjectedfaults.Ourwork( KadirvelandFortes 2011b )illustrates theeffectofvariousfactorsonperformancepenaltysuchasnumberofslavenodes, timeoffaultinjectionandfault-detectiontimeoutinterval.Theseresearchresultsalong withtheincreasingimportanceofMap-Reducemotivatesourgoalforimprovingfault managementinHadoop. Faultdiagnosis :TheFingerpointingproject,thatincludesworkssuchas Pan etal. ( 2010 ), Tanetal. ( 2010 ),and Bareetal. ( 2010 )focuseson faultdiagnosis in Map-Reduceenvironments.Ourapproachfocusesonfault detection andfault recovery throughanonline,closed-loopapproach.However,diagnosisisimportantandour choiceofsparsecodingforanomalydetectionismotivatedbytheneedtoextend detectiontodiagnosisinordertofacilitatemoretargetedrecoveryactions. Faulthandling :InMantri( Ananthanarayananetal. 2010 )outliersinanexecuting DryadMap-Reducejobareidentiedthroughtheuseofstaticthresholdsdetermined fromapplicationhistory.Thedeterminationofthecorrectthresholdtouseischallenging andapre-setthresholdcanoftendrifttobecomeincorrectindynamicenvironments. Hadoopprovidesabuilt-infeaturecalledspeculativeexecutioninwhichslowtasksare chosentobeexecutedthroughduplicatetaskinstances.Thedeciencyofspeculative executioninheterogeneousenvironmentshasbeenaddressedbytheLATEalgorithm 69

PAGE 70

proposedby Zahariaetal. ( 2008 ).FMRappliestobothperformancefaultsaswell asperformancefaultsthatleadtocrashfaults.However,thelatterconditioncannot behandledbyspeculativeexecutionandLATEandthisisempiricallyillustratedin Section 4.5 .Speculativeexecutionalsousesresourcesinefcientlythroughthe executionofmanyduplicatetasks(fore.g.in Zahariaetal. ( 2008 )itwasobserved thatasmanyas 80% oftaskswerespeculativelyexecuted).Incontrasttospeculative executionandLATE,whichuseprogress-basedanalyticalmodelsfordetectingaslow task,FMRusesdecentralizedandlocalmachine-learningmodelsoneachnodefor detectinganomalies. Performanceprediction :PredictingthecompletiontimeofaMapReducejobis donethroughanalyticalmodelsin Vermaetal. ( 2011 )andthroughsimulationmodelsin Herodotouetal. ( 2011 ).In Bortnikovetal. ( 2012 ),theauthorspredictmapandreduce taskslowdownusingthegradientboosteddecisiontreemodel.However,predictionis basedonofineanalysis.Theanomalydetectionmethodproposedinthispapercanbe usedinanonlinefashionandhenceenablesincorporationintotheMAPEcontrolloop. Anomalydetection : Tanetal. ( 2012 )proposeanomalypreventionschemesthrough theuseofMarkovchainmodelsforgeneralvirtualizedcloudcomputing.systems. However,itisasupervisedlearningtechniquewhichmeansthatrepresentativenormal andanomalyclassdataisneeded. DanielDean ( 2012 )and Gabeletal. ( 2012 )propose unsupervisedtechniquesforanomalydetection.FMR'sanomalydetectiontechniqueis similaringoaltotheseworks. Performancemanagement : Herodotouetal. ( 2011 )(Starshproject)and Vermaetal. ( 2011 )(ARIAproject)proposetheuseofdynamicresourcescalingfor performancemanagementofHadoopjobs.TheStarshprojectdoesnothandlefaults andtheARIAprojecthandlesfail-stopfaults.FMR'sfocusisonperformancefaults thatresultindegradedjobexecutiontimes. LamaandZhou ( 2012 )(AROMA)use 70

PAGE 71

machine-learningtechniquesforresourceallocationandcongurationinHadoop, howeveritdoesnothandleperformancedeviationsintroducedbyfaults. Ourworkismostsimilarto Fergusonetal. ( 2012b )(Jockey),inwhichresource allocationisusedtoguaranteejoblatenciesfordataparalleljobs.Jockeydependson anofinejobprolesimulatorforcompletiontimeprediction;whileFMRusesanonline, machine-learningbasedmodelforprediction. !"#$%& !"#$%&'()*+ ",--.)%! /,0+ 12."3$2 ).-$ ),%$ -$1.+%.1.+ 2$4&$!1 .'(5$+-!*! -$1.+%.1. 2$!6,)!$ #'"(% 1.!3+ 12."3$2 7+%.1.+ ),%$+ %.$-,)! ',".'+ !1,2.*$ #'"(% 1.!3+ 12."3$2 7+%.1.+ ),%$+ %.$-,)! ',".'+ !1,2.*$ #'"(% 1.!3+ 12."3$2 7+%.1.+ ),%$+ %.$-,)! ',".'+ !1,2.*$ #'"(% 1.!3+ 12."3$2 7+%.1.+ ),%$+ %.$-,)! ',".'+ !1,2.*$ #'"(% 1.!3+ 12."3$2 7+%.1.+ ),%$+ %.$-,)! ',".'+ !1,2.*$ #'"(% 1.!3+ 12."3$2 7+%.1.+ ),%$+ %.$-,)! ',".'+ !1,2.*$ Figure4-2. OverviewofHadoopshowinginteractionsbetweentheJobTracker, NameNode,TaskTracker,DataNodehostedonthemasterandslavenodes. 4.3DesignofFault-ManagedMap-Reduce Thispaperfocusesontheopen-sourceMap-Reduceimplementation,Hadoop ( Hadoop 2004 ).Hadoopconsistsofthefollowingmaincomponents:(1) JobTracker and TaskTracker daemonsthatmanageschedulingandcoordinationofmapandreduce tasks,and(2) NameNode and DataNode daemonsthatmanagetheHadoopDistributed FileSystems(HDFS).TheJobTrackerandNameNodedaemonsrunontheHadoop masternode,whiletheTaskTrackerandDataNodedaemonsrunontheslavenodes. Figure 6-2 showsasimpliedoverviewofHadoop. InaMap-Reducejob,whenanodefails,allmaptasksthatwereexecutedonthis node(forthisjob)havetobere-executedonotherhealthynodes.Thisisbecausemap outputsarestoredlocallyateachslavenode(ratherthanbeingstoredonthereplicated HDFS).Maptaskswhoseoutputshavealreadybeenreadbycorrespondingreduce 71

PAGE 72

tasksneednotbere-executed.Themasternodedetectsaslavenodefaultaftera statictimeoutinterval(asshowninFigure 4-3A )andtheninitiatesre-execution.The performancepenaltyduetoasinglenodefaultisillustratedforHadoopclustersof differentsizesinFigure 4-4A andforthecasewhennodefaultsoccuratdifferentpoints duringajob'sruntimeinFigure 4-4B .Thesepenalties(rangingupto 155% )motivatethe needforFMR. Oneofthemaincontributorstotheperformancepenaltyexperiencedinthe presenceoffaultsisthetimeoutintervalbetweenfaultoccurrenceanddetectionby themaster.Andtherefore,inordertodetectfaultssooner,thekeyideainFMRisthe anticipationofafaultthroughanomalydetection.Figure 4-3B showstheperiodduring whichFMRattemptstodetectfaults.Ananomalyreferstoaconditionthatisindicative ofanimpendingperformancefault.Inthecontextofthispaper,aperformancefault refersaMap-Reducejob'sexecutiontimeexceedingapre-speciedServiceLevel Objective(SLO).Bydefault,thisSLOisthefault-freeexecutiontime.Severalstudies haveshownthatnodecrashfaultsareprecededbyanomalousconditions( Gabeletal. 2012 ; Pinheiroetal. 2007 ). Weuseamachine-learningtechniquetoidentifywhetheranodeisbehavingin amannerthatisunusualbasedonitsownhistoryforaspecictypeofapplication. Afterthedetectionofanodeanomaly,recoveryisinitiatedthroughdynamicresource scaling.Anomalydetectionanddynamicresourcescalingaredescribedinthefollowing subsections. 4.3.1AnomalyDetection Currentanomalydetectiontechniquesusedinsystemsmanagementdependon identifyingvariousstaticthresholdsaspartofthecontrolpolicy.Whensystemmetrics exceedthesepre-determinedthresholds,eitheralarmsorsuitablerecoveryactionsare invoked.Althoughthissimpliestheprocessofanomalydetection,thesethresholdsare difculttodetermineandneedtobecustomizedassystemconditionschange. 72

PAGE 73

!"#$%&'((#))*+(* ,"%&-$".*&+'/*0 !"#$%&/*%*(%1'+ &&&&&&,"%&2"-%*)&+'/*0 -%"%1( %12*'#%& 1+%*)."$ AFaultdetectioninHadoop !"#$%&'((#))*+(* ,"%&-$".*&+'/*0 1')-%&("-*& !"#$%&/*%*(%2'+ &&&&&&,"%&3"-%*)&+'/*0 4'--25$* *")$6 /*%*(%2'+ 4*)2'/ %23*'#%& 2+%*)."$ BImprovedfaultdetectioninHadoopwithFMR Figure4-3. Theshaded(blue)regionrepresentsmapandreducetasksrunningona slave.Afterthefaultshownbyacrosstasksstoprunningonthisslave. Hadoopmasterdetectsthefaultafterastatictimeoutvalue.Thelightly shaded(green)regionintroducedintothenodetimelinein(b)corresponds totheperiodleveragedbyFMRforearlyfaultdetection 8 12 16 20 24 0 500 1000 1500 CLUSTER SIZE (Number of Nodes) EXECUTION TIME (in seconds) NO FAULT SINGLE FAULT A 10% 30% 50% 70% 90% 0 500 1000 1500 FAULT OCCURRENCE POINT EXECUTION TIME (in seconds) NO FAULT SINGLE FAULT B Figure4-4. ExecutiontimeincreaseforHadoopwordcountjobsof(a)differentcluster sizesand(2)fornodecrashfaultsinjectedatdifferentpointsinjobprogress. Inthecontextofmachine-learning,anomalydetectioncanbeviewedasa classicationproblem.Givenafeaturevectordescribingrecentconditionsona computenode,wewanttobeabletopredictwhetherornotthiscorrespondstoan anomalouscondition.Inourwork,ananomalousconditiononanodecouldleadtoa performancefaultoftheexecutingMap-Reducejob.Trainingdatafromcomputenodes 73

PAGE 74

thatareoperatingnormallywillbereferredtoas normal-class data;whilethosefroma potentiallyfaultynodewillbereferredtoas anomaly-class data. Applicationheartbeats :TheMap-Reduceapplicationisinstrumentedtoemit heartbeatstoindicatetherateofprogressinprocessinginputdata.Heartbeat timestampsarerecordedlocallyoneachslavenodeinaheartbeatle.Asliding windowoftimestampsareprocessedtodeterminetheheartbeatrate.Asequenceof heartbeatratevalues(referredtoasa heartbeatwave )capturesapplicationbehavior andisusedastheinputfeaturevectortotheanomalydetectionmodule.Application heartbeatshavebeenusedforautonomicmanagementgoalsin BuneciandReed ( 2008 )and Hoffmannetal. ( 2010 ). Animplicitrequirementforthebinaryormulti-classformulationistheneedfor balancedandrepresentativetrainingdatafromeachoftheclasses.Inaproduction computationalinfrastructure,itiseasytoobtainrepresentativenormaltrainingdata. However,requiringasystemdesigneroradministratortoprovidesufcientand representativeexamplesofanomalousdata(suchasfromallpossibleperformance anomalies)wouldstronglyrestricttheapplicabilityofourapproach.Inordertoovercome thislimitation,weproposeananomalydetectionmethodthatcanbetrainedusingonly normal-classdata. Sparserepresentationhasreceivedagreatamountofattentioninthesignal processingcommunityrecentlye.g., Elad ( 2010 ), CandesandTao ( 2004 )and Eldar andMishali ( 2009 ),anditreadilyprovidesaprincipledandexibleframeworkfor feature-basedanomalydetectionneededinFMR.Wenotethatthereareseveralrecent worksinimageprocessingandcomputervisionapplyingsimilarideastoanomalous eventdetection(e.g., Congetal. ( 2011 )and Zhaoetal. ( 2011 )).Althoughoriginating fromdifferentapplicationdomains,theseproblemscanallbeconsideredasanomaly detectiongivenonlynormalfeatures.Thatis,anomaliesarenotexplicitlydenedbased oninputfeaturesbutonlyrelativetothetrainingnormalfeatures,andthisapparent 74

PAGE 75

asymmetryintrainingfeaturesisthemainsourceofdifculty.Therefore,analgorithmic solutionwouldrequireasuitablegenerativemodelforthenormalfeaturesthatcanbe usedforidentifyinganomalies,andthesparserepresentation( Elad 2010 )offerssuch amodelthatisknownforitssimplicity,generalizability,andcomputationalefciency. Formally,insparserepresentation,afeature(consideredasavector) x isrepresented asalinearcombinationofasmallnumberofbasisfeatureschosenfromadictionary D (ofbasisfeatures).Inthefollowingdiscussion,wewillassumethat x isacolumnvector ofdimension d andthedictionary D isgivenasa d $ l matrixsuchthat l > d ( D has morecolumnsthanrows).Thecolumnsofthedictionary D arethebasisfeatures,and themainassumptioninsparsesignalrepresentationisthatarelevantfeature x canbe reconstructedbyasmallnumber k ofcolumns.Mathematically,thiscanbewrittenas f x = D c x where c x isthesparsecoefcientfor x withrespecttothedictionary D ,andallbut k componentsof c x arezero.Theinteger k isthesparsitylevelofthefeature x andittells usthatthefeature x canbereconstructedbytakingalinearcombinationof k columns of D .Inotherwords, x isinthelinearspanofthese k columns.Giventhedictionary D thesparsitylevel k istheparameterthatcontrolsthegeneralizability(orexpressiveness) ofthemodel.Forexample,whenthesparsitylevelissetto k =1 ,eachfeaturevector x isjustacolumnof D withascalingsinceweareworkingwith x suchthat c x onlyhas onenonzerocomponentintheequationabove.Forothervaluesof k ,thefeaturesare assumedtobethoseinthesubspacesspannedbynomorethan k columnsof D .For thisgenerativemodel,whichislinearinnature,thedictionaryandthesparsitylevel aretheonlytwoparametersusedforspecifyingthenormalfeatures,andinparticular, trainingofthemodelisexceptionallyeasy:simplytakethenormaltrainingfeaturesas thedictionarycolumns.Foranomalydetection,themainassumptionwemakeinregard totheessentialdifferencebetweenfeaturevectorsoriginatingfromnormaloperating 75

PAGE 76

statesandanomalousstatesisthatnormalfeaturescanbesparselyapproximatedwell usingonlyasmallnumberofnormalfeatureswhileanomalyfeaturesareexpectedto notenjoythisproperty.Therefore,givenadictionary D andasparsitylevel k ,wewill consideranyfeaturevectorasanormalfeatureifitbelongstoasubspacespannedby k columnsof D ;otherwise,itwillbeconsideredasananomaly.Figure 4-5 illustratesthis throughasimplieddiagram. !"#$%&'(#%)!)!*' )!+(%!,-+ !"#$%&'(-+(' )!+(%!,%!"$%&.'(-+(' )!+(%!,Figure4-5. Asimplieddiagramtoillustratetheanomalydetectionapproach.Normal traininginstances(orfeaturevectors)aresimilarbecausetheyareproduced bythesameunderlyingprocessandhencewithahighprobability,liewithin aconnedsubspace.Anormaltestinstanceisalsoproducedbythesame processandhencecanbereconstructedwellbyothernormaltraining instances(i.e.dictionaryatoms)andasaresultitssparserepresentation haslowerrorvalues.Ontheotherhand,ananomalytestinstanceis generatedbyadifferentunderlyingprocessandliesinadifferentsubspace. Hence,whenreconstructedusingnormaltraininginstances,thesparse representationhaslargererrors. Moreprecisely,givenadictionary D ofnormalfeaturesandasparsitylevel k ,we expectthatforanormalfeaturevector x ,itssparse-codingerror, e e = x D c x shouldbeavectorwithsmallcomponentsanditsmagnitudefollowssomemultivariate normaldistribution(andthesquarederrornorm e = % e % 2 2 canbemodeledbya -distribution ( e ) ).Ontheotherhand,forananomalyfeaturevector,itssparse-coding error e isexpectedtobelargeanditssquarederrornormdoesnotfollowthedistribution 76

PAGE 77

" .Therefore,byestimatingthebackgrounddistribution ( e ) duringtraining,the squarederrornorm e foranunknownfeaturevector x canbecomparedagainst to determineitsclassicationandtheassociatedcondencelevel.Weremarkthatthe validityofusingasparsemodelforanomalydetectioncanonlybesupportedempirically, andFigure4displaystheresultsofseveralexperimentsthatconrmourexpectation thatanomalousfeaturesincurlargeerrorswhensparselycodedwithrespecttothe dictionarywhosecolumnsarenormalfeatures,providingastrongsupportforthesparse model.Furthermore,theseresultsalsosuggestthattheerror e canbeausefulfeature foridentifyingtheanomalies. Morespecically,thetrainingcomponentofourmethodconsistsoftwosteps:forming thedictionary D andestimatingthedistribution .Werandomlydividethetraining (normal)featurevectorsintotwogroups.Featurevectorsintherstgroupformthe dictionary D andthoseinthesecondgroupareusedtoempiricallyestimate and itscumulativedistributionfunction CDF ( e ) .Theuserspeciestwoparameters, 0 < #< fnr < $< 1 ,whichareusedtoboundthefalsenegativerate fnr asfollows: Let e n = CDF 1 (1 $ ) and e a = CDF 1 (1 # ) .Foranyfeaturevector x withsquared sparse-codingerrornorm e ,itwouldbeclassiedasnormalif e & e n orasananomaly if e e a .Forthe"grayarea"between e n and e a ,wedenethecondencelevel % ( e ) of declaring x asananomalyaccordingtotheformula, % ( e )= CDF ( e ) (1 $ ) $ # Notethat 0 & % ( e ) & 1 andfor % ( e ) toprovidethecondencelevel, % ( e ) simplyscales theprobabilitymassof between e n and e a linearlytozeroandonesothat % ( e n )= 0, % ( e a )=1 .Wealsonotethatbecausewearedeclaringanyfeaturevector x witherror e > e a tobeananomaly,thisgives # asalowerboundon fnr ,thefalsenegativerate (theproportionof(training)normalfeaturevectorsclassiedasanomalies).Similarly,we alsohave $ asanupperboundfor fnr 77

PAGE 78

0 50 100 150 200 250 300 0 1 2 3 4 5 TRAINING AND TEST INSTANCES ERROR MAGNITUDE NORMAL TRAINING DATA NORMAL TEST DATA ANOMALY TEST DATA AMap-Reduceapplication:Wordcount 0 200 400 600 800 1000 1200 0 500 1000 1500 TRAINING AND TEST INSTANCES ERROR MAGNITUDE NORMAL TRAINING DATA NORMAL TEST DATA ANOMALY TEST DATA BMap-Reduceapplication:Piestimation 0 50 100 150 0 20 40 60 80 100 TRAINING AND TEST INSTANCES ERROR MAGNITUDE NORMAL TRAINING DATA NORMAL TEST DATA ANOMALY TEST DATA CMap-Reduceapplication:Grep Figure4-6. Errorsinthesparserepresentationoftrainingandtestfeaturevectorsfor threeMap-Reduceapplicationdatasets.Illustratesasignicantdifferencein themagnitudesbetweennormalclassandanomalyclassinstances. Weremarkthatthekeypointinourmethoddescribedaboveisthesparsity requirement,sincewithoutit,anyanomalyfeaturevectorcanbeapproximatedwell usingsufcientlymanynormalfeaturevectorsinthedictionary D .Onlybyimposing sparsity,itisthenpossibletousetheerror e asameaningfulvalueforclassifyingthe featurevector x .Thesparsityrequirementcanfurtherbejustiedusingourqualitative understandingofthenormalstatesandanomalies.Inmostapplications,thenormal featuresarecomparativelymorehomogeneousthantheanomalies,whichduetotheir diverseoriginandsporadicnature,aredifculttomodelconsistently.Computationally, 78

PAGE 79

thishomogeneitycanbemodeledbyadictionary D thatcaptures(mostof)thevariability ofthenormalfeaturessuchthateachnormalfeaturecanberepresentedasalinear combinationofonlyasmallnumber( k )ofbasisfeaturesin D .Therefore,thisexpected regularityofnormalfeaturesprovidesthemotivationandrationaleforusingsparse representationfortheirmodeling.Ontheotherhand,theheterogeneityoftheanomalies precludessuchmodeling,andingeneral,ananomalyfeatureisnotexpectedtobewell approximatedbyafewbasisfeaturesin D .Therefore,using ( e ) asthebackground distribution,thesparse-codingerror e providesadiscriminativeandusefulquantity forclassifyingthefeaturevectors.InFigure4,weplottheseerrorsforthreedifferent Map-Reduceapplicationdatasets.Wecanobservethesignicantdifferencebetween errorsforthenormalandanomalyclassinstances. Theproposedanomalydetectionframeworkisconceptuallysimpleandits implementationisstraightforward.Animportantcomputationalissueistodetermine thesparsecoefcients c x givenatestfeature x .Fortunately,thereareefcient algorithmssuchasorthogonalmatchingpursuit(MOD)( Donohoetal. 2006 )and LASSO( Tibshirani 1994 )thatcomputesparsesignaldecomposition,giventhesignal x anddictionary D .Usingtheseefcientsparsecodingalgorithms,therunningtimeof ourmethod,bothintrainingandtesting,isfastandmakesreal-timeanomalydetection feasible.Furthermore,thesimplicityofourmethodallowsvariousgeneralizations andextensionssuchasincorporatingincrementalupdatesofthedictionary D and backgrounddistribution foranomalydetectionindynamicandcomplexenvironments, atopicwewillpursueinthefuture. 4.3.2RemediationthroughDynamicResourceScaling DynamicresourcescalingreferstotheadditionofMap-Reduceslavenodestoan executingjob.Thisisafeasiblesolutiontoimproveexecutiontimeinthepresenceof faultsbecauseoftworeasons:(1)Hadoopallowsforseamlessadditionofslavenodes (withoutrestartofthemasternodedaemonsorchangestocongurationles)and 79

PAGE 80

(2)HorizontalscalingisprovidedthroughaprogrammingAPIinmostvirtualizedand cloudenvironments.Foranewnodetobeincludedinanexecutingjob,TaskTracker andDataNodedaemonsmustbestartedonitandthemasternodeIPaddressmust beprovidedtoit.Thesenewlystarteddaemonswillmakearequestforworktothe masternodeandarethenassigneddatablockstoprocess.Wenotethatthemaster nodeneednotbeawareofaslavenodethatmaypotentiallybeaddedinthefuture.This providesnecessaryexibilitytoaddasmanynodesasneededforhandlingdifferent faultyconditions.Additionally,animportantdesigngoalinourworkhasbeentokeep theunderlyingHadoopframeworkunmodiedinordertoensurethatoursolutioncan beeasilyadoptedinpractice.Dynamicresourcescalingchosenastheremediation technique,facilitatesthisdesigngoal. Thenumberofnodestobeaddeddependsonthetimeatwhichthefaultis detected,expectedcompletiontimeofthejob,thenumberofchunksyettobe processedandthenumberofnodesinvolvedinthejob.Whenthenumberofmap taskstobeexecutedismorethanthenumberofslavenodesavailable,thenmultiple mapwavesareexecuted.Dynamicresourcescalingcanhelpasarecoverytechnique foranexecutingjob,onlyifatleastoneormoremapandreducewavesareyettobe started. ThescalingheuristicthatisapartofFMRneedstoaddsufcientnumberof nodestoreduceexecutiontimepenalty.Bothtasksthathavealreadycompletedon thefaultynodeandfuturetasksthatwouldhaveexecutedonthatnodeneedtobe executedonnewlyaddednodes.ThisconditionisexpressedinEquation(1)where mapProgressPercentage isretrievedfromtheHadoopruntimeusingabuilt-inAPI. N nodes added = ceil 1 1 MapProgPercentage +1 (41) Afterdeterminingtheoptimalnumberofnodestobeusedforscaling(usingEq. 1),wedeterminetheassociatedcostoftheseresources( costOfScaling ).Inorder 80

PAGE 81

todeterminewhetherthecostofscalingwouldbejustied,weuseMap-Reduce executiontimepredictionmodelstoestimatejobdurationinthepresenceofanodefault ( execTime fault ). InChapter 3 ,wehaveshownthatMap-Reduceexecutiontimescanbeestimated usingmachine-learningbasedregressionmodels( PerfModel ).Weshowedthat4 techniques,namelygaussianprocessregression,regressionbydiscretization,multilayer perceptronandmodeltrees,achievedbestperformanceforpredictingMap-reducejob executiontime.Averagepredictionerrorsobtainedwerelessthan 12% .Outofthese models,modeltreeswerechosenforuseintheexperimentsinthispaper. Usingthisexecutiontime,wecalculatethepotentialcost( delayPenalty )associated withexceedingthejobdeadline.Anyuser-denedcostfunction( CostModel )can beusedhere.Thecostfortheexecutiontimepenaltyiscomparedwiththecostfor resourcescaling,andscalingisinvokedifitprovidesacostbenet.Thisfunctionalityof FMRisdescribedaspseudocodeinFigure 4-7 1: execTime nofault = PerfModel ( NumFaults =0) 2: execTime fault = PerfModel ( NumFaults =1) 3: delayPenalty = CostModel ( execTime nofault execTime fault ) 4: N nodes added = ceil # 1 1 MapProgPercentage +1 $ 5: costOfScaling = nodesToScale ( costPerNode 6: if costOfScaling < delayPenalty then 7: Invokescalingoperation 8: endif Figure4-7. PseudocodeofthescalingheuristicinFMR 4.4ImplementationofFault-ManagedMap-Reduce ThevariouscomponentsofFMRthattogetherconstitutetheMAPEcontrolloopare illustratedinFigure 4-8 anddescribedinthissection. MonitoringusingGanglia :Gangliaisanopen-sourceproject( Ganglia 2012 )that providesaexiblemonitoringframeworkfordistributedsystems.InFMR,customized metricsareaddedtoGangliaforcalculatingtheheartbeatrateandforperforming anomalydetection. 81

PAGE 82

!"#$%&'()&* +,+-&! !"#$%&'()&* "##./)"-/01 !"#$%&'()&* 2%"!&30%4 +,+-&!*+02-3"%& /12%"+-%()-(%& !01/-0%/15*!0'(.& 5"15./"*6"+&'* !01/-0%/15 10'&*7&".-7*+)%/##."11/15*!0'(.& +)"./15*7&(%/+-/) 8!"+-&%9 "10!".,*'&-&)-/01 8+.":&9* "1".,+/+*!0'(.& 7&"%-*6&"-* #%0)&++/15 8(+/15*5"15./"* !&-%/)*!0'(.&+9 #%&)(%+0%* '&-&)-/01* 8(+/15*7"'00#*10'&* 7&".-7*+)%/#-9 &;&)(-/01*!0'(.& %&+0(%)&*+)"./15 6.")4$./+-/15 #%&'/)-/01*!0'&.+ 8!"+-&%9 )0+-*!0'&.+* 8!"+-&%9 Figure4-8. AutonomiccontrolloopofFault-managedMap-Reduceshowingahigh-level overviewofthemonitoring,analysis,planningandexecutionmodules. Contributionsdescribedinthispaperareshownwithinsolid-outlineblocks. ContributionsfrompriorworkthatareusedinFMRareshownwithin dashed-outlineblocks. NodeHealthScript :ThenodehealthscriptisafeatureprovidedinHadoopthat allowsforapre-denedhealthscripttobeperiodicallyexecutedoneachslavenode. AssoonasanodeanomalyisconveyedtothemasterthroughGanglia,thenodeis black-listed byFMR.Black-listingimmediatelypreventsanymoretasksfrombeing scheduledonthatnode.Typically,slavenodefaultsaredetectedbythemasteronly afteratimeoutinterval.Theadvantageofblacklistingisthatthemasterismadeto becomeawareofaslave'sdegradinghealthstatusimmediately.Thisisbenecialsince arecoveryactioncanbeinvokedwithoutanydelay.Wealsocongurethenodehealth scripttocheckforotherfaultprecursorssuchastaskanddaemoncrashfaults.This 82

PAGE 83

precursordetectionfunctionalityhelpsdetectsomecrashfaultsthatarenotprecededby anomalousconditions(thatcouldbedetectedbythesparsecodingtechnique). AnomalyDetection :Theanomalydetectionmoduleisinvokedattheendofeach task.Theinputfeaturevectortotheanomalydetectionmoduleisaheartbeatwavethat correspondstothelast-completedtask.Thesparsecodingtechniqueisimplementedin Matlabandconvertedtoastand-aloneexecutablewhichisthenexecutedontheslaves usingtheMatlabCompilerRuntime(MCR)environment. RecoverythroughDynamicResourceScaling :Onceananomalyisdetected,the anomalousnodeisforcefullyblack-listed.Thenthescalingheuristicisexecutedto determinethenumberofnodestobeadded.Weuseacostmodelinwhichdollarcosts areassociatedwithdifferentpenaltyranges.Virtualmachineimagesforthenewslaves arepre-setwiththemasterIPaddress.TaskTrackerandDataNodedaemonsarestarted uponthenewnodes,whichthenbecomeapartoftheexecutingMap-Reducejob. 4.5ExperimentalEvaluation ExperimentalTestbed :ThetestbedusedtoevaluatetheFMRapproachconsistsof 16IBMbladeservers(HS22)mountedontwodifferentracks.Eachphysicalnodehas a8-CoreXeon2.4GHzCPUand24GBofRAMandrunsCentOS5.5withXen3.4.3. ThetworacksarelinkedtogetherbyaGigabitEthernetnetwork.Eachphysicalnode hostsveguestvirtualmachines.ThisguestVM(whichformsaHadoopslavenode) runsUbuntu10.04.2andisconguredwithasinglecoreand2GbofRAM.Hadoop version0.20.203isused. Map-Reduceapplications :ApplicationsfromtheHadoopdistributionandthePUMA benchmarksuite( PUMA )wereusedandaredescribedbelow: 1. Wordcount(WC):Mapoutputsa (word,1) key-valuepairforeachwordina document.Reducecombinesthecountforeachwordproducinga (word,wordcount) pair. 83

PAGE 84

2. Grep(GR):Mapsearchesforapatternintheinputdocumentsandproduces (pattern,1) .Reducecombinesthecountforeachpatternproducing (pattern, patterncount) 3. Piestimation(PI):EstimatesthevalueofPiusingquasi-MonteCarlomethod. 4. Invertedindex(II):Mapgeneratesthedocumentindexforeachwordas (word, documentindex) .Reducecombinesalloccurrencesofawordtoproduce (word, listofdocumentindices) 5. Termvectorperhost(TV):Determinesfrequentlyoccurringwordsinadocument. Mapproduces (host,termvector) foreachhost.Reducecombinestermvectorsfor eachhostandoutputs (host,list(termvector)) 6. Histogramratings(HR):Generatesahistogramofmovieratingsfromadatasetof userreviews.Mapproduces (rating,1) foreachuserreview.Reducecombinesthe counttoproduce (rating,count) Inputdataset :DatasetusedforWC,GR,IIandTVconsistsofbooksfrom Gutenberg ( 2009 )withsizevaryingbetween5GBto20GB.PIdoesnotrequireany inputdata.InputforHRisgeneratedusingscriptsfromPUMA. Jobduration :Performancepenaltiesarelowforlong-runningjobsthatexecute onalargenumberofnodes.However,longrunningjobsarenotthecommoncasefor Hadoopasseenfromtwoproductiontracesthatwereanalyzedin Chenetal. ( 2011 ), Kavulyaetal. ( 2010 ).Inthesestudies,theaveragelengthofajobvariesbetweenfew tensofsecondstofewtensofminutes.TheaverageMap-ReducejobsizeatGoogle ( DeanandGhemawat 2008 )variedbetween 395 to 874 secondsoveraperiodofthree yearsbetween2004and2007.FMRanditsevaluationexperimentsthusfocuson shortjobswithruntimesrangingbetween 300 to 600 secondswhichcorrespondtothe majorityworkloadinproductionenvironments. Faultload :Thefollowingfaultconditionswereinjectedintoslavenodes: 1. CPUhog:ACPU-intensivesequenceofmatrixmultiplicationoperations. 2. Memoryhog:Asequenceofmemoryleaksprogrammedintoanexecutingmatrix multiplicationprocess. 84

PAGE 85

3. Diskhog:Thelinux dd commandusedtocopylargechunksofdatabetweentwo diskpartitions. 4. Nodecrashfault:Thelinux kill commandusedtoterminatetheTaskTrackerand DataNodedaemonprocessesrunningonthenode. EachfaultexperimentconsistsofloadingHDFSwiththeinput,startingFMRscripts andtheMap-Reducejobandtheninjectingfaultsatpre-speciedtimeinstances.Node crashfaultsareprecededbyperformancefaults.Aftereachfaultexperiment,HDFSis reformattedandreloadedwithinputdata.Thisensuresthatanynon-uniformityindata distributionandreplicationiseliminatedforeachnewexperiment.Asetof3000job executionswereperformedforvalidatinganomalydetection,performancepredictionand resourcescalingcomponentsofFMRandaredescribedinthefollowingsubsections. 4.5.1AnomalyDetection TheexperimentshowninFigure 4-9 isusedtoillustratetheoperationofthe anomalydetectionmodule.AWordcountMap-ReducejobisexecutedwithaCPU hoginjectedintoonenode.WenotethattheanomalousslavenodeDom-13'(inthe fourthplotfromthetop)wascorrectlyidentied.Inaccordancetothegoalof early fault detection,thefaultwasdetectedattheendoftherstapplicationheartbeatwaveandis markedinusinganarrowhead. Figure 4-10A showsthefractionofcorrectpredictionsforthenormalandanomaly testdataforsixdatasetscorrespondingtosixdifferentMap-Reduceapplications.The fractionofcorrectpredictionsintheanomalydatasetisthetheTruePositiveRate(TPR) = TP FN + TP ;whilethefractionofcorrectionpredictionsinthenormaldatasetistheTrue NegativeRate(TNR) = TN TN + TP .Here TP TN FP FN standforTruePositives,True Negatives,FalsePositivesandFalseNegativesrespectively.Inthiscontextcorrectly detectingananomalyistermedaTruePositive.WeseethatforallthedatasetstheTNR isgreaterthanthetheoreticalboundof0.8thatwaschosenforthepercentileparameter. ThiscorrespondstoamaximumFalsePositiveRateof0.2. 85

PAGE 86

0 20 40 60 80 100 120 140 5 0 5 SLAVE NODE: DOM 7 0 20 40 60 80 100 120 140 5 0 5 SLAVE NODE: DOM 8 0 20 40 60 80 100 120 140 5 0 5 SLAVE NODE: DOM 9 0 20 40 60 80 100 120 140 5 0 5 SLAVE NODE: DOM 13 (With Injected Performance Fault) 0 20 40 60 80 100 120 140 5 0 5 SLAVE NODE: DOM 14 0 20 40 60 80 100 120 140 5 0 5 SLAVE NODE: DOM 15 TIME APPLICATION HEART BEAT RATE (BEATS PER SECOND) Figure4-9. Applicationheartbeatwavesfora6-nodeHadoopjobinwhichaCPUhog processwasinjectedintooneslavenodeDom13'.Theanomalousnodeis showninthefourthplot(fromthetop).Anarrowheadmarksthetimeof detectionoftheanomalyonthisnode. InFigure 4-10B ,weplottheTPRandTNRfortheinvertedindexapplicationfor differentinjectedfaults.Inordertoidentifythecauseforvariationinperformance, wecomparetheintensityoftheeffect(performancepenalty)ofeachthesefaults.A CPUhog,memoryhoganddiskhogcauses 22% 13% and 11% increaseinaverage executiontime.TheCPUhogwithmaximumpenaltyisamoreseverefaultandhence canbedetectedwithbetterperformance.Thememoryhoganddiskhogeffectsare moresubtleandhenceappeartoresultinslightlylesserperformance. ThetimetakenforpredictioncomputationandmodeltrainingisshowninTable 4-1 for3applicationdatasets.Weseethatforthelargestdatasetof1400instances,training 86

PAGE 87

PE GR WC II TV HR 0.6 0.7 0.8 0.9 1 1.1 APPLICATION DATASETS FRACTION OF CORRECT PREDICTIONS NORMAL TEST DATA (True negative rate) ANOMALY TEST DATA (True positive rate) ADifferentMap-Reduceapplications CPU HOG MEM HOG DISK HOG 0.6 0.7 0.8 0.9 1 TYPE OF FAULT FRACTION OF CORRECT PREDICTIONS NORMAL TEST DATA (True negative rate) ANOMALY TEST DATA (True positive rate) BDifferenttypesoffaults Small Medium Large 0.5 0.6 0.7 0.8 0.9 1 1.1 TYPE OF VIRTUAL MACHINE INSTANCE FRACTION OF CORRECT PREDICTIONS NORMAL TEST DATA (True negative rate) ANOMALY TEST DATA (True positive rate) CDifferentvirtualmachineinstances Figure4-10. Evaluationofsparsecodingbasedanomalydetectionfor(1)different Map-Reduceapplications,(b)differentfaultyconditions,and(c)different VMinstancesizesinaheterogeneousenvironment. takes 1 secandtestingtakesonly 0.008 secs.Thisensuresminimaloverheadwhenour anomalydetectionmodelisusedwithinFMR. Table4-1. Trainingandtestdurationsforanomalydetection DatasetApplication TotalinstancesTrainingdurationTestingduration 1PiEstimation 1497 1.06secs 0.008secs 2Grep 202 0.13secs 0.008secs 3Wordcount 400 0.22secs 0.008secs Heterogeneity :Theuseofdecentralized,localmodelsforanomalydetection enablesustoextendFMRtoworkinaheterogenousenvironment.Aheterogeneous 87

PAGE 88

testbedwasconguredconsistingofthreedifferentvirtualmachineinstancetypes: small'VMswith1CPUand2GBofRAM,medium'VMswith2CPUsand4GBRAM andlarge'VMswith4CPUsand6GBofRAM.Performanceofanomalydetectionfor eachVMinstancetypeinthisenvironmentisshowninFigure 4-10C 0.8 0.85 0.9 0.95 0.4 0.6 0.8 1 1.2 PERCENTILE FRACTION OF CORRECT PREDICTIONS NORMAL TEST DATA ANOMALY TEST DATA A 2 4 6 0.4 0.6 0.8 1 1.2 SPARSITY LEVEL FRACTION OF CORRECT PREDICTIONS NORMAL TEST DATA ANOMALY TEST DATA B Figure4-11. Sensitivityanalysis:Variationofthefractionofcorrectpredictionsfor normaltestdata(truenegativerate)andanomalytestdata(truepositive rate)usingthesparsecodinganomalydetectiontechniquefordifferent valuesofthe(a)Percentile'parameterand(a)SparsityLevel'parameter. Application:Termvectorperhost.Injectedfault:CPUhog. SensitivityAnalysis :Inordertodeterminetheeffectofchoosingdifferentparameters, weperformasensitivityanalysisoftwoparameters,namelythesparsitylevelin Figure 4-11A andpercentilevalueinFigure 4-11B .Sparsitylevelisvariedbetween 1 and 6 andthepercentileparameter(whichisrelatedto 1 # )isvariedbetween 0.8 and 0.96 .TheeffectofthesevariationsonTPRandTNRisplotted.Weseethat anomalydetectionperformanceisquitestablewithintheseranges,therebyproviding sufcientleewayinchoosinggoodparametervalues.Wenotethatalthoughanomaly dataisnotneededfortraining,itcanbeleveragedwhenavailableforparametertuning. ReceiverOperatingCharacteristiccurves :WeplotROCcurvesfor6applications inFigure 4.5.1 byvaryingthecondencethresholdbetween0and1.Allcurvesare closetotheupper-leftcorner,whereTPRishighandFPRislow.Inaddition,mostofthe curvesprovideanumberofpossiblevaluesofcondencethreshold(i.e.pointsonthe 88

PAGE 89

curvewithmarkers)intheupperleftcornerregionindicatingthatgoodperformanceis possibleformanycondencethresholdvalues. Table4-2. Comparisonoffouranomalydetectiontechniques:MultilayerPerceptron (MLP),K-meansClustering(KMC),Supportvectormachines(SVM)and Sparse-coding(SPC)showingtheirtruepositiverateandtruenegativerates ApplicationMLP KMC SVM SPC PI 1.0/0.960.99/0.880.76/0.31.0/0.93 GR 1.0/0.940.7/0.311.0/0.651.0/1.0 WC 0.98/0.880.96/11.0/0.691.0/0.92 II 0.99/0.950.8/0.661.0/0.490.92/0.95 TV 0.95/0.760.93/0.690.89/0.490.82/0.82 HR 0.99/0.980.975/0.991.0/0.510.99/0.96 Comparisonofanomalydetectiontechniques :Wecomparesparsecodingwith3 classicationtechniquesinTable 4-2 .Multilayerperceptronprovidesbestperformance. However,itneedsanomalydatafortrainingandtakestentoahundredseconds fortraining,thusmakinginunsuitableforFMR.K-meansclusteringalsoneeds anomalydatafortraininganddoesnotachieveverygoodTPRandTNRvalues. Single-classSVMdoesnotneedanomalydatafortraining,makingitaviablecandidate. However,sparsecodingmodelsachievemuchbetterperformanceforall6benchmark applications. 0 0.5 1 0 0.2 0.4 0.6 0.8 1 False Positive Rate True Positive Rate Pi estimation Grep Wordcount A 0 0.5 1 0 0.2 0.4 0.6 0.8 1 False Positive Rate True Positive Rate Inverted index Term vector per host Histogram ratings B Figure4-12. ROCcurvesshowingperformanceofanomalydetectionasthecondence thresholdisvaried.Sixbenchmarkapplicationsareshownin2separate plots(a)and(b)forclarity. 89

PAGE 90

1 2 3 4 5 6 0 500 1000 JOB ID EXECUTION TIME (in seconds) Actual Execution Time Predicted Execution Time Figure4-13. PredictionaccuracyofthemodeltreealgorithmusedforMap-Reducejob executiontimeprediction.Application:Wordcount.Jobs1,2,3haveone faultinjected,whilejobs4,5and6arefault-free. 4.5.2DynamicResourceScaling Werstevaluatetheaccuracyofmodeltreeprediction,whichisacriticalcomponent oftheFMRcontrolloop.Predictionaccuracyfor6testjobsisshowninFigure 4-13 and wasanaverageof 11.1% .Featuresusedincludenumberofslaves,datasetsize,timeof fault,numberoffaults,timeoutandHadoopframeworkcongurationparameters.The modeltreeimplementationintheWekatoolsuitewasusedfortrainingandtesting. NextweevaluatethescalingheuristicinFigure 4-14 todeterminewhethersufcient numberofnodesarechosenforscaling.Forthejobshown,theheuristicscalesby7 nodesbasedonthemapprogresspercentageof0.83inEq(1).Wemanuallyscaleby 1to6nodesand8to10nodes,todetermineif7istherightchoice.Weseethatfor N nodes added < 7 ,penaltyis > 5% andfor N nodes added > 7 thereisnoadditionalbenet. Thepenaltyreductionthroughdynamicresourcescalingisillustratedusinga swimlane plotinFigure 4-15 .Inthisplot,eachy-axiscoordinatecorrespondstothe executionofamaptaskinasinglemap-slot.OurexperimentsuseHadoop'sdefault 90

PAGE 91

1 2 3 4 5 6 7 8 9 10 600 700 800 900 NUM OF NODES SCALED EXECUTION TIME (in seconds) No Faults Node Fault (Fault managed Map Reduce) Figure4-14. EvaluationofFMRscalingheuristic.Application:Piestimation.Fault:Node crashfaultat520seconds.FMRscalingheuristicscalesby7nodes (markedwithanarrowintheplot) settingoftwoslotspernode.Soapairofconsecutivelines(paralleltothex-axis) correspondtotwomaptasksrunningsimultaneouslyonanode. Figure 4-15 (a)showsaMap-Reducejobthatconsistsoffourmapwaves.The jobdidnotexperienceanyfaults.InFigure 4-15 (b),thesamejobisrerunwithaCPU performancefaultinjectedintooneofthenodes.Thepresenceofafaultresultsinan executiontimepenaltyof 18.5% .Inthenextexecutionofthesamejob,FMRscripts areenabled.Figure 4-15 (c)showstheadditionoftwonodestotherunningjobafter detectionoftheanomalousnode.Wenotethatwiththehelpofresourcescaling, performancepenaltyisreducedto 4.6% InFigure 4-16 ,FMRiscomparedwithHadoop'sbuilt-inspeculativeexecution.After ajobbeginsexecution,aCPUhogprocessisinjectedandisfollowedbyanodecrash faultafter30seconds.Weseethatwithspeculativeexecution,penaltyisnotreduced. HoweverusingtheFMRapproach,throughscalingby4nodes,thepenaltyisdecreased to < 5% ofthefault-freeexecutiontime. 91

PAGE 92

1.3401 1.3401 1.3401 1.3401 1.3401 1.3401 1.3401 1.3401 x 10 12 0 20 40 (a) JOB WITH NO FAULTS 1.34 1.34 1.34 1.34 1.34 1.34 1.34 1.34 x 10 12 0 20 40 (b) JOB WITH PERFORMANCE FAULT 1.3401 1.3401 1.3401 1.3401 1.3401 1.3401 1.3401 1.3401 x 10 12 0 20 40 (c) JOB WITH PERF. FAULT AND DYNAMIC SCALING TIME Figure4-15. SwimlaneplotsofthreeMap-Reducejobs.(a)Jobwithnofaults(b)Job withinjectedCPUperformancefault(Executiontimepenaltyof 18.5% )and (c)Jobwithinjectedperformancefaultanddynamicresourcescaling enabled(Executiontimepenaltyreducedto 4.6% ). Inthenextsetofexperiments,nodecrashfaultsareinjectedintoarunningjob executingondifferentnumbersofnodes.AsshowninFigures 4-17A to 4-17F ,we seethatFMRconsistentlyhelpsinmitigatingperformancepenalty.FMRhelped decreaseperformancepenaltyfromanaverageof 119% to 14% acrossthese6sets ofexperiments. 4.5.3Discussion Virtualizedenvironment :FMRhasbeenimplementedonavirtualizedenvironment whicheasilyprovidestheactuatorsneededfordynamicresourcescaling.However,a non-virtualizedenvironmentcanalsouseFMRbyprovisioningextraresourcesthatcan beaddedtotheHadoopclusteron-demand.Theseextraresourcescanbeutilizedfor 92

PAGE 93

1 2 3 4 0 500 1000 SCENARIOS TIME (in seconds) Figure4-16. ComparisonofFMRwithspeculativeexecution.(1)Jobwithnofaults(2) Jobwithfaultandspeculativeexecutionturnedoff.(3)Jobwithfaultand speculativeexecutionturnedon.(4)JobwithfaultsmanagedusingFMR approach.Application:InvertedIndex.Fault:CPUhogprocess+node crashfaultafter30seconds. executingpreempt-ablejobsduringthoseperiodswhentheyarenotutilizedaspartof recovery.However,avirtualizedenvironmentprovidesthecapabilitytoextendrecovery tootheractionssuchasmigration(tohandlehardwarefaults)andscalingup(tohandle resourceexhaustionfaults). Application-specicanomalydetectionmodels :Theanomalydetectionmodel developedisspecictoaMap-Reduceapplicationbecauseeachapplicationhas differentheartbeatcharacteristics.Webelievethatitisreasonabletomanageapplication-specic modelsbecausetypicalMap-Reduceworkloadsinvolvetheexecutionofthesamejob ongraduallyevolvingdatasets.Recentliteratureshowsthat 80% ofjobsinaworkload fromYahoo!wererepeatedatleast50times( Bortnikovetal. 2012 ).Thefeaturevectors neededfortrainingthesparsecodingmodelarederivedfromoneapplicationheart beatwavethatcorrespondstotheprocessingofonedatachunkbyamaptask.Hence, evenasingleMapReducejobcanprovidefewtenstoafewhundredfeaturevectorsfor training. ScalabilityofFMR :Sinceanomalydetectionisperformedbylocal,decentralized modelsatanode,theassociatedoverheadsarelocaltoanode.Therefore,the 93

PAGE 94

Initial Early Mid Late 0 500 1000 1500 2000 OCCURRENCE TIME OF FAULT EXECUTION TIME (in seconds) No Faults Node Fault (Built in Fault Tolerance) Node Fault (Fault Managed Map Reduce) A12nodecluster+10nodesforscaling Initial Early Mid Late 0 500 1000 1500 2000 OCCURRENCE TIME OF FAULT EXECUTION TIME (in seconds) No Faults Node Fault (Built in Fault Tolerance) Node Fault (Fault Managed Map Reduce) B22nodecluster+10nodesforscaling Initial Early Mid Late 0 500 1000 1500 2000 OCCURRENCE TIME OF FAULT EXECUTION TIME (in seconds) No Faults Node Fault (Built in Fault Tolerance) Node Fault (Fault Managed Map Reduce) C32nodecluster+10nodesforscaling Initial Early Mid Late 0 500 1000 1500 2000 OCCURRENCE TIME OF FAULT EXECUTION TIME (in seconds) No Faults Node Fault (Built in Fault Tolerance) Node Fault (Fault Managed Map Reduce) D42nodecluster+10nodesforscaling Initial Early Mid Late 0 500 1000 1500 2000 OCCURRENCE TIME OF FAULT EXECUTION TIME (in seconds) No Faults Node Fault (Built in Fault Tolerance) Node Fault (Fault Managed Map Reduce) E52nodecluster+10nodesforscaling Initial Early Mid Late 0 500 1000 1500 2000 OCCURRENCE TIME OF FAULT EXECUTION TIME (in seconds) No Faults Node Fault (Built in Fault Tolerance) Node Fault (Fault Managed Map Reduce) F62nodecluster+10nodesforscaling Figure4-17. ComparisonofjobexecutiontimesofaPiEstimationjobinthepresenceof anodecrashfaultthroughtheuseofHadoop'sbuilt-infaulttoleranceand FMR.x-axislabelsInitial',Early',Mid'andLate'correspondtoanode crashfaultinjectedat1sec,120sec,220secand320secfromjobstart. 94

PAGE 95

overheaddoesnotincreaseadverselyasthenumberofslavesisincreased.The computationperformedatthemaster(byFMR)foreachslaveislimitedandonly consistsofevaluatingtwobinarymetricsforeachslave(namelythepresence/absence ofananomalyandtheresultofthenode-healthscript).ThelatestversionofHadoop, calledNextGen'Hadoopusesdistributedmastersandwillfurtherhelpreducetime takenforthisevaluation.Furthermore,FMRaimsatachievingsoftdeadlinesfor Map-Reducejobs.InatypicalsharedMap-Reduceclusteronlyasubsetofthejobs wouldhavethesesoft-deadlinerequirements.Thus,FMRneedstobeenabledonlyfor thesejobs,thusavoidingtheneedtomonitorandmanagealljobs. 4.6SummaryandContributions TheFault-managedMap-Reducetechniqueproposedinthischapterpresentsthese uniquecontributions: Currentfault-managementsolutionsinMap-Reducearelimitedbytheuseofstatic thresholdstoidentifyfaults.Thresholdsaredifculttodetermineespeciallyina dynamicallychangingenvironment.FMRovercomesthislimitationthroughtheuse machine-learningbasedmodels. Anovelsparse-codingbasedanomalydetectionmethodisproposedand evaluatedonavarietyofMap-Reduceapplications.Thismethodachieveshigh truepositiveandtruenegativeratesandcanbetrainedusingonlynormalclass(or anomaly-free)data. Thedecentralizednatureoftheproposedanomalydetectionmethodensures minimalcomputationoverheadandenablesittobeincludedwithintheMAPEK loopwithnegligibleoverhead. ThedecentralizedmethodalsofacilitatesFMRtobeusedinbothhomogeneous andheterogenousclusterenvironments. AlthoughthestudyandevaluationinthischapterdealswithMap-Reduce applications,theFMRtechniquecanbeappliedtootherparallelapplications thatallowforseamlessresourcescaling. Fault-ManagedMap-Reducethusprovidesanonline,on-demandandclosed-loop solutiontofaultmanagementinMap-Reduceenvironments. 95

PAGE 96

CHAPTER5 FAULT-PLAY:AFAULTINJECTIONANDMANAGEMENTTESTBED Inthesecondpartofthisdissertation,wefocusontoolsthatfacilitatefaultand performancemanagementstudiesinaMap-Reduceenvironment. Thischapterpresentsatool,namedFaultPlaytoovercomeexistingconditionsthat makefailurestudieschallenging.Forexample,real-worldfailuredatasetsaredifcult toaccesssincetheyrepresenthighlysensitiveandproprietaryinformation.Eventhe fewpubliclyavailabledatasetsarestatic,i.e.,theyhavealimitedsetofmetricsthatwere predeterminedandcollectedduringsystemoperation.Thesecollectedmetricsmay beinadequateandnotcontainspecicinformationthatisneededforanovelsolution. Furthermore,itisnotpossibletotestdifferentfaultmanagementsolutionsonthesame orevensimilarinfrastructurefromwhichthedatasetwasgenerated.Inanalternative approachtofailuresstudies,faultsareidentiedandinjectedintoalocaltestbed. Failureconditionsarethusrecreatedandthisenablesgenerationoffailuredata. Characterizationandanalysisperformedonthesedatasetsenablestestingofavariety ofsolutions.However,faultsareidentiedanddeneddifferentlyindifferentresearch projects.Thismakesithardtoeffectivelyreproducefailuredataandeffectsandeven hardertocomparedifferentfaultmanagementsolutions.Withoutwaystorecreatefaults inthesameenvironment,proposedmanagementsolutionsbecomespecictoacustom testbedanditbecomesdifculttobuildandextendexistingresearchresults. FaultPlayaddressesthesechallengesthroughanumberofdifferentfeatures.It providesafacilityforthedenitionandinjectionoffaults,thusallowingfaultcharacterization resultstoberecreatedbetweenresearchgroups.Theeffectoffaultsonajobis recordedusingadistributedmonitoringtool(duringexecution)andlogparsingscripts (afterexecution).Monitoredmetricsareaggregatedintheformoffeaturevectorsto facilitatetheiruseasinputstodecision-makingalgorithms.Sinceatypicalstudywill requirerepeatingexperimentswithchangestovariouscongurationparameters, 96

PAGE 97

FaultPlayprovidesaprogrammaticwaytoautomateexperimentation.Although fault-injectiontoolsprovidesomeofthesefeatures,FaultPlayprovidesanother uniquefeaturethatgoesbeyondfaultcharacterization,namelyrecovery-basedfault management.Avarietyoflibrarysuitesareprovided,allowingforoptimizationand machine-learningtechniquestobeeasilyincorporatedintoamanagementsolution. DetailedcasestudiesareprovidedinSection 6.5 ,toillustratethecapabilitiesof FaultPlay. FaultPlayhelpsreducemanyperson-monthsofeffortsnecessaryforsetup, congurationanddeploymentofinfrastructureandthedesignandimplementation ofasoftwareframeworktoconductfailurestudiesinMap-Reduce.AlthoughFaultPlay hasbeendesignedanddevelopedforMap-Reduce,itsbasicconceptinformsthedesign ofsimilartoolsforotherdistributedcomputingparadigms.TheFaultPlaytoolispublicly availablebothintheformofsourcecodeandready-to-deployvirtualmachineimages. Thetoolhasbeenevaluatedextensivelyonanin-houseclusterof64nodes. 5.1BackgroundandMotivation Inthissection,wediscusstwotypicalapproachescurrentlyfollowedinfailure studies.Weshowhowconstraintsposedbythesetwoapproachesmotivatedthe developmentofFaultPlay. 5.1.1Real-worldfailuredatasets Thersttypeoffailurestudyconsistsoffailurecharacterizationandanalysis ofreal-worlddatasetsandtraces.Someofthechallengesinthistypeofstudyare describedbelow. 1. Fewpubliclyavailabledatasets:Failuredatasetsarecurrentlypubliclyavailable throughtworepositories.TheComputerFailureDataRepository( CFDR 2008 ) wastherstefforttomakefailuredatapubliclyavailableandconsistsof13 datasetsfromHPCclusters,Craysystems,Blue/Genesystems,andInternet serviceclusters.TheFailureTraceArchive( FTA 2009 )focusedonprovidingraw failuredatainastandardizedformatalongwithpost-processingtoolsforanalysis andconversion.FTAconsistsof25datasetsfrompeer-to-peersystems,grids, webservers,DNSserversandHPCclusters.BothCFDRandFTAcurrentlydo 97

PAGE 98

nothaveMap-Reducedatasetsandthusdonothelpwithfailureresearchon Map-Reduceplatforms. 2. Limitedaccesstodatasets:Incertaincasesfailuredataisreleasedinalimited wayforresearchpurposes( Bareetal. 2010 ).Theseeffortsarehighlybenecial totheresearchcommunitysincetheyprovidekeyinsightsaboutfailuresin large-scaleproductionsystems.However,thekeyconstrainthereisthatitisnot possibleforcorporationstoeasilysharethesefailuredatasetsforpublicuse becauseofthesensitivenatureofthedata.Thisrestrictscomparableresearch acrossthecommunityandtheabilitytoimproveuponproposedanalysesand solutions. 3. Staticnatureofdatasets:Evenwhenaccessibilityofdatasetsisovercome,data setshavesomelimitationsonwhatcanbedoneusingthem.Werefertothisas theinherentstaticnatureofdatasets.Thesedatasetsallowforcharacterization studiestobeperformedbutdonotallowforfaultmanagementsolutionstobe implementedandaccuratelyvalidatedinthesameenvironmentsfromwhichthese datasetswereproduced.Forexample,ifanunobservedtypeoffaultneedstobe evaluated,itsperformanceimpactonthesystemcannotbedetermined(since thesourceofthedatamaynotbeaccessible).Similarly,ifsystemcharacteristics ordesignweretobeadapted(tobetterhandlefaults),theeffectofthiscannot bevalidatedonthesourcesystem.Furthermoreonlyalimitedsetofdetailsmay havebeencollectedaspartofthetrace.Recently,Yahoo!AndGooglehave publiclysharedsomeMap-Reducedatasets.TheHDFSgridlogdatasetfrom Yahoo!( Yahoo 2013 )containsHadooplesystemauditlogs;butdoesnotcapture otheressentialinformationabouttheenvironmentthatwouldbenecessaryfor fault-managementstudies.Googleclusterdataset( Wilkes 2013 )consistsof tracesofamixofMap-Reducejobs,highperformancecomputingjobsandlong runningservicessuchaswebservers.However,jobsarenotidentiablebytheir type,therebypreventingananalysisofonlyMap-Reducejobs. 5.1.2Fault-injectionbasedstudies Thesecondtypeoffailurestudyusesfaultinjectionontestbedstoperformanalysis andcharacterization,followedbythedeploymentandevaluationofsolutionsfor improvementinfault-tolerance,reliability,availability,etc.FaultPlayovercomesthe limitationsofusingreal-worlddatasetsbyadoptingthissecondapproach.However, fault-injectionbasedstudiesalsohavecertainlimitations.Thesearelistedbelowalong withadescriptionofhowFaultPlayovercomesthem. 1. Researchcontributionsonspecializedenvironments:Faultmanagementsolutions areimplementedanddeployedoncustomtestbeds.Thesetestbedsconsists 98

PAGE 99

ofmanylayersofsystemsoftware,middlewareandapplicationcomponents interactingwithoneanother.Eachcomponentcanbeconguredinavarietyof ways,resultinginatestbedthatisverydifculttoreplicateexactly.Thismakes itdifculttorecreatesimilarconditionsandproposesolutionsthatenhance priorwork.Asaresult,newfaultstudiestendtodenetheirownspecialized environmentsandpresentcontributionsforthese.FaultPlayisprovidedtothe researchcommunityasavirtualapplianceimage,therebyenablingresearchersto useexactlysimilarenvironmentalconditionsforproposingimprovedsolutionse.g., throughtheuseofpublicclouds. 2. Replicationofeffort:Evenwhenitispossibletorecreatesimilarenvironments,this wouldrequirealotofredundantefforts.Effortstoprovision,congure,deploy,and implementideasindistributedsystemsrequiresspecializedsystemadministration skillsandcanbeverytime-intensive.Furthermore,manyperformanceimprovement andfault-tolerancetechniquesuseasimilarsetofmonitoringandcontrolactions. Examplesofoften-usedmonitoredvaluesaretaskprogress,jobprogressand resourceutilization;whilecommoncontrolactionsaretaskduplicationand resourcescaling.InSection 6.2 ,welistexamplesofstate-of-the-artmanagement solutionsthatpossessmanyofthesecommonalities.Thesoftwaredevelopment effortrequiredforsolelythesecommonfeatures,leavingasidetheintelligentaction orideabeingproposed,isnon-trivial.FaultPlayleveragesthepremisethatfor differentperformanceandfaultmanagementsolutions,manyofitscomponents suchasjobconstructionandmonitoring,faultdenitionsandinjectionsand invocationsofremedialactionscanbereused.FaultPlaytherebyconsiderably reduceseffortinre-implementingtheseaspects.FaultPlayssoftwareframework capturesthesecommonfunctionalitiesthroughfourmainmodules:JobEngine, FaultEngine,RemedyManagerandExperimentEngine. 3. Differencesinchosenworkloadandfaultload:Evaluationofafaultmanagement solutionrequirestheuseofworkloadsandfaultloads.SinceMap-Reduce programsarefreeform,itisimportanttocomparesolutionsthroughrepresentative Map-Reduceapplications.Sincetheseapplicationsaretypicallydataintensive, choiceofinputdataanditscharacteristicswillalsoplayaroleincorrectly evaluatingthebenetsofasolution.Furthermore,afaultmanagementresearch contributionischaracterizedbythesubsetoffaultconditionsthatweretargeted. Eventhoughresearchcontributionslistthetypeoffaultsstudied,thereare avarietyofwaysinwhicheachfaultcanbedenedandimplementedinthe underlyingtestbed.Forexample,aCPUhogprocesscanbeimplementedthrough anyCPUintensiveprocess.Itscharacteristicssuchasintensityandduration signicantlyaffectfaultdetection,diagnosisandrecoveryactions.However,this levelofdetailcannotbecapturedwithinthelimitsofaresearchpublication.The resultingvariationsinchosenworkloadandfaultload,thusmakeitdifcultto evaluate,compare,contrastandimproveuponexistingcontributions.FaultPlay denesworkloadsandfaultloadsassociatedwitheachexperimentthroughajob congurationleandawrapperscriptfordifferenttypesoffaults.Theseworkload 99

PAGE 100

andfaultloaddenitionsandtemplatesareprovidedalongwiththetooltoaid researchprojects.Thisprovidesasteptowardsareproducibleandconsistent waytoperformfailurestudies,compare(andhencebenchmark)differentresearch contributions. 5.2ATypicalFaultManagementStudy Inthecontextofthischapter,a fault-managementstudy referstoresearch studiesthatarerelatedtounderstandingand/orimprovingsystempropertiessuch asfault-tolerance,dependability,availability,reliabilityandperformance(inthepresence offaults).Inthissection,wedescribethefourmainstagesinsuchafaultmanagement study. ThesestagesdeterminetherequirementsforFaultPlay,andhenceforeachstage weindicatehowFaultPlayprovidesthenecessaryfunctionality. 5.2.1Infrastructuresetupanddeployment Hardwareisobtainedandprovisionedwithappropriatesystemsoftware.In-house physicalclusterscanbeusedorvirtualmachinescanberentedfromInfrastructure-as-a-Service providers.Dependingonthechoice,asuitablevirtualizationlayerand/oroperating systemlayerareselectedandinstalled.Map-Reducedependenciesaretheninstalled, followedbytheinstallationofMap-Reducemiddleware.FaultPlayfacilitatesthis byprovidingvirtualmachineapplianceimagesthathavealltheabove-mentioned componentsalreadyinstalledandconguredtoworktogethercorrectly. 5.2.2Selectionofworkloadandfaultload Currentandpastfaultliteratureisexploredtodetermineappropriatetypesof faultsthatarelikelytoaffectthesystemofinterest.Therearenostandardfaultsthat areuniversallyacceptedforachosenenvironmentandhencesomeexpertiseand explorationisnecessaryforthischoice.Thesefaultsarethenimplementedinsoftware orhardwaredependingonwhatisfeasible.Workloadsarethenselectedtoexercise thesystemoveratypicalrangeofoperation.Thechoiceofworkloadwouldneedto takeintoconsiderationqualitativemetricsthatareofinterest.Forexample,forjob 100

PAGE 101

completiontimestudies,onewoulduseavarietyofapplicationswithdifferentmap andreducecharacteristics.Andforjobthroughputstudiesdifferentworkloadmixes wouldneedtobedeterminedandused.Choosingapplicationsforaworkloadalso necessitateschoosingsuitableinputdataandargumentsthatarerepresentativeofreal usecasescenarios.Thisrequiresunderstandingthedomainofchosenapplications andidentifyingpubliclyavailabledatasetsthataresuitableforthetypeofprocessing performedbyeachapplication.CommondatasetsforMap-ReduceincludeWikipedia, TwitterandGutenbergelectronicbooks.Map-Reducealsoprovidesbenchmarks suchas GridMix ( 2013 ).AlthoughMapReduceiswidelyusedinmanyorganizations ( Hadoop 2012 ),thereislimitedinformationontheactualMap-Reduceprogramsused bytheseorganizations.Sothesechoicesneedtobemadeonacase-by-casebasisby researchersaftertakingintoconsiderationthetargetapplicabilityoftheircontributions. FaultPlayhaspre-installedrepresentativefaultloadandworkloadexamples.Thesecan beeasilyselectedandvariedthroughtheuseofprovidedcongurationparametersand wrapperscripts. 5.2.3Characterizationofsystembehavior Usingthechosenworkloadsandfaults,systembehaviorischaracterizedfor differentfaultandworkloadcombinations.Jobperformanceisalsoaffectedby middlewareandinfrastructurecongurationparameters(suchasJVMcharacteristics, mapandreduceslotspernode,applicationcommand-linearguments,etc.).Avariety ofexperimentswouldneedtoberepeatedtosystematicallyperturbthesystemalong eachofthesedimensions.Additionallybetweeneachexperiment,thesystemwould needtobebroughttoitsinitialstatetoensurethateffectsobservedareassociated withonlythecurrentexperimentandarenotsideeffectsofpastexperiments.Some examplesoftheseside-effectsare:(1)distributionofdataamongslavenodesthat becomesunbalanced,(2)slavenodedaemonsthatcontinuetoexecuteonextraslave nodeswhenanewjobisstartedonasmallernumberofnodes,and(3)injectedfaults 101

PAGE 102

continuingtoexistpastjobcompletion.InFaultPlay,thisfunctionalityhasbeenprovided throughtheconceptofamanagedjobclassthatimplementsallthepre-processingand post-processingstepsnecessaryforcompletejobexecution.Anexperimentreferstoa managedjobalongwithinjectedfaultsandafaultmanagementpolicyinplace. Inordertocharacterizesystemoperationfordifferentcombinationsofworkload, faultsandconguration,thesystemneedstobemonitoredandoptionallyinstrumented. Monitoringduringjobexecutionrequiresadistributedmonitoringtoolrunningonall theslaves.Asuitabledistributedmonitoringtoolneedstobechosenbasedonthe requirementsandscaleoftheclusterthatitisintendedfor.Sincefault-management solutionsmayusecustommetricsthatarenotrecordedbydefaultbythemonitoring tool,theabilitytodeneanddeploycustommonitorswouldbeuseful.Attheendof jobexecution,loglesgeneratedbythevariousmasterandslavenodedaemons containvaluablerawdataaboutjobexecution.Theseloglesneedtobecopiedfrom eachslavenodebacktothecontrollernode.Parsingscriptsneedtobeimplemented toretrieveusefulinformationfromtheselogles.FaultPlaycomeswithGanglia,a distributedmonitoringtool,pre-installed.Logparsingscriptshavealsobeenprovidedfor retrievingmanycommonmetrics. 5.2.4Implementationoffaultmanagementsolutions Aftersystembehaviorischaracterizedunderdifferentfaultconditions,different techniquescanbeproposedandimplementedtoimprovevarioussystemproperties. Thesesolutionscanbeopen-looporclosed-loopinnature.Anopen-loopsolutionis oneinwhichfeedbackfromsystemoperationisnotincorporatedaspartoftheeffected policy.Aclosed-loopsolutionisintheformofaMonitor-Analyze-Plan-Execute-Knowledge (MAPEK)controlloop.Forboththesetypesofsolutions,commonmonitoringand controlactionsneedtobeprovided.Inthecaseofclosed-loopsolutions,theanalysis andplanningmodulecouldbeaidedwithoptimizationandmachine-learninglibraries. 102

PAGE 103

FaultPlayconsistsofacomponentcalledRemedyManagerthatprovidesthebuilding blocksneededtoimplementmanagementpolicies. Eachofthestagesdescribedaboverequiresauniquesetofskillsthatincludes infrastructureinstallation,systemadministration,faultmodeling,expertiseinonline managementsolutions,etc.OneoftheadvantagesoftheFaultPlaytoolisthata researcherneednotbetrainedinalltheseskillstobeabletoconductacomprehensive study.Forexamplearesearcherinterestedinfault-injectioncanuseFaultPlays applianceimages,built-infaultsandthenrunavarietyofcharacterizationexperiments. Theresearchercanfocusoncustomizationand/orextensionofthefault-injection modulealonetoexplorespecicresearchquestions. 5.3DesignandImplementation !"#!$%&!'()!'*%'! %'+!,( -./0() +12),1'-%*)-%0! -./0()!'*%'! %'314!) -./0( +12)!'*%'! %'314!) +12 $!&!56) &.'.*!$ %'314!) $!&!56 78)-%0!969(!&) 9!(/# :8)%'#/()5.(.) 9!(/# ;8)+12)9/2&%99%1' <8)#19()+12) ,0!.'/# 78)+12) ($.,4%'* :8)#.$9!)=) $!($%!3!)+12) 9(.(9 +12)=) $!91/$,!) ,1'($10) .,(%1'9 )&1'%(1$ +12)=) $!91/$,!9 &.#)$!5/,!)969(!& &.9(!$)'15! +12)($.,4!$ '.&!)'15! 90.3!)'15! 90.3!)'15! (.94)($.,4!$ 5.(.)'15! 90.3!)'15! 90.3!)'15! (.94)($.,4!$ 5.(.)'15! Figure5-1. FaultPlayArchitecture FaultPlay'ssoftwareframeworkconsistsoffourmainmoduleswhosehigh-level interactionsareshowninFigure 5-1 .Thesemodulesworktogethertohelpconducta fault-managementstudythatiseasytodeployandisreproducible. 103

PAGE 104

5.3.1Experimentengine Thisisthehigh-levelcomponentthatisresponsibleforcoordinatinganexperiment thatwouldresultinthegenerationofaperformanceorfaultdatapoint.First,amanaged Map-ReducejobisstartedusingtheJobEngine.Thenafaultisconstructedand injectedusingtheFaultEngine.AndnallyasuitablecomponentoftheRemedy Managerthatcorrespondstothechosenfaultmanagementsolutionisinvoked. 5.3.2Jobengine Thismoduleisresponsibleforallaspectsofamanagedjobsuchasconstruction, submission,trackingaswellasallhousekeepingactivitiesnecessarybeforeandafter thejobexecutes.Thejobcongurationleisatextleconsistingofparameter-value pairsthatdenevariousdetailsaboutajobandservesasaninputtotheJobEngine. AnexampleofthiscongurationleisshowninTable 5-1 TheJobEnginerstperformsasetofpreprocessingtasks.Thisincludesactions tosetupalocaljobdirectoryandaHDFSjobdirectoryforstoringjoblogs,status andresults.Thejobcongurationlespeciesinputdatatobeusedforthejoband theJEtransfersthisdatafromthelocallesystemtoHDFS.Inthecaseofjobslike Terasort,inputdataaregeneratedontheybyanotherMap-Reducejob.Thejobis thensubmittedforexecution.Duringexecution,jobstatusismonitoredperiodically. Whenthejobiscomplete,statisticsaregatheredbyparsingthroughthemasterand slavelogles.Statisticsassociatedwitheachjobandtaskiscollectedasafeature vector.Featurevectorscanbeaggregatedforasetforexperimentsandusedfor differenttypesofanalysis. 5.3.3Faultengine Thiscomponentfacilitatestheinjectionofdifferenttypesoffaultsduringthe executionofajob.Implementedfaultsarecategorizedintocrashfaultsandperformance faults.Acrashfaultreferstoafaultthatoccurssuddenlywithoutanydetectable precursors.ExamplesincludemapandreducetaskprocessesandTaskTrackerand 104

PAGE 105

Table5-1. Listofparametersinajobcongurationle Category Congurationparameter JobIdentication jobDirName="job pi 1" jobLoc=/usr/local/jobs/ Application jarFile=/path/hbpiestimator.jar programName=HBPiEstimator HadoopMiddleware numOfReduceTasks=23 sortFactor=10 sortMb=100 sortRecordPercent=0.05 childJavaOpts=-Xmx200m sortSpillPercent=0.80 reduceParallelCopies=5 InputDataandFilesystem dataDirLoc=/extra dataDir=PiEstimator TMP clearHdfs=False dataDirToBeDel=unused loadHdfs=False copyLocalJobResult=False Infrastructure slavesList="7,8,9,13,14,15" numOfSlaves=6 constructSlavesList=False ApplicationArguments terasortDataSize=100000000 mahoutJob=False numOfMapTasks=120 numOfSamples=64000000 DataNodedaemonprocessbeingkilled.Aperformancefaultreferstodegradationin differentaspectsofsystemperformance.Themanifestationofaperformancefaulthas beenimplementedthroughCPU,memory,diskandnetworkhogprocesses.Selection ofthesefaultswasinformedbyMap-Reducefailureresearchperformedaspartofthe Fingerpointingproject( Fingerpointing 2013 ),AnarchyApeproject( Roberts 2013 )and studiesin KadirvelandFortes ( 2011b )and KadirvelandFortes ( 2012 ). Table 5-2 liststhetypesoffaultsandhowtheyareimplementedinthetestbed. Faultsaredenedbytheirtype,timeofoccurrenceandotherparametersspecicto 105

PAGE 106

Table5-2. Listoffaultsandtheirimplementation Fault Implementation Customizableparameter NodeCrash LinuxKillCommand tokillTaskTrackerand DataNodedaemons processes TimeofInjection CPUHog MatrixmultiplicationC program TimeofInjection, Duration,Priority NetworkPacketLossLinuxKernelcomponent netem'foremulation networkproperties. TimeofInjection,Loss percent,Duration NetworkSlowdownLinuxKernelcomponent netem'foremulation networkproperties Timeofinjection,Delay, Duration MemoryHog MatrixmultiplicationC programwithamemory leak Timeofinjection,Rate DiskHog Linuxddcommand Timeofinjection,Duration eachfaulttype.Multiplefaultsarecollectivelytermedasafault-suiteandareinjected withinthespanofasingleexperimentinordertostudytheircumulativeeffect.These faultsuitescanbeusedacrossresearchteamsforsharinginformationaboutfault coveragethatisassociatedwithafaultmanagementsolution.Theyalsohelpfacilitate comparisonsamongdifferentsolutions. 5.3.4Remedymanager TheRemedyManagermoduleconsistsofimplementationsofcontrolactionsto managetheMap-Reducejob,itsmiddlewareandtheunderlyinginfrastructure.Control actionsthathavebeenimplementedincludehorizontalscaling-outandscaling-in, blacklisting,greylistingandwhitelistingofslavenodes.Scaling-outandscaling-inrefers toseamlessadditionandremovalofslavenodeswithoutdisruptingcurrentlyexecuting jobs.Inthecaseofscaling-out,Map-Reducedaemonsarestarteduponnewslave nodes.Thesedaemonscanconnecttothemasternodewithouttheneedtorestartthe masternodedaemonsormakechangestocongurationles.Inthecaseofscaling-in, datapresentinthenodestobeshutdownmustrstbetransferredandreplicatedonthe remainingnodes.Wecomposethreebuilt-infeaturesprovidedinHadoopnamelythe 106

PAGE 107

nodehealthscript,blacklistinganddecommissioning,inordertoimplementscale-out andscale-in.TheRemedyManagerAPIprovidestwomethodsthroughwhichtheuser caneasilyinvokescale-inandscale-outsimplybyspecifyingthenumberofnodesthat needtobeincreasedordecreased. TheExperimentEnginetogetherwiththeJobEnginecanbeusedforperformance characterizationexperiments.ThesecomponentstogetherwiththeFaultEnginecanbe usedforfaultcharacterizationexperimentsandtestingofopen-loopfaultmanagement solutions.AndnallywhenthesecomponentsarecombinedwiththeRemedyManager, variousclosed-loopfaultmanagementsolutionscanbedeployedandevaluated. AllthefourmoduleshavebeenimplementedusingthePythonandBashscripting languages.Congurationleshavebeenusedinordertoenableexecutionofversatile experimentsbysimplychangingvaluesinthesecongurationles(thusrequiringno sourcecodechanges).Forimplementingnewmanagementpolicies,methodsprovided viatheremedymanagercanbecombinedornewmethodscanbedened.FaultPlays codebaseconsistsofapproximately3000linesofcodeexcludingexternallibraries. 5.4Evaluationonavirtualizedcluster 5.4.1DescriptionofInfrastructure Inthesetofexperimentsdescribedinthissection,16IBMbladeservers(HS22) mountedontwodifferentrackswereused.Eachphysicalnodehas16coresand24Gb ofRAM.ThetworacksarelinkedtogetherbyaGigabitEthernetnetwork.Xenisthe VirtualMachineMonitorchosentohostHadoopmasterandslavevirtualmachines.The operatingsystemonthehostplatformisCentOS5.5,andtheguestvirtualmachines(or domUinXenparlance)isUbuntu10.04.2.Eachslavenodeisconguredwithasingle coreand2GbofRAM.TheHadoopversionusedis0.20.203. 5.4.2PerformanceCharacterizationStudies Jobperformancedependsonvecategoriesoffactors:resources,datasets, workload,faultsandcongurationparameters.Whencharacterizingjobperformance, 107

PAGE 108

weshowtheeasewithwhichFaultPlayenableseachofthesefactorstobemodied. Resource:InFigure 5-2 ,weplotjobperformanceforthePi-Estimationapplication executedon12,24,36and48nodes.Thissetofjobswasperformedsimplyby changingtheslavesListparameterinthejobcongurationle.WithoutFaultPlay, thiswouldrequireasequenceofstepsthatneedstobemanuallyinvokedbytheuserfor eachjob: 1. Stopallthedaemonsrunningoncurrentmasterandslave 2. Clearallcontentsinthelocallesystemdirectorythathasbeenreservedfor HDFS. 3. UseHDFSformatcommandtocleanlyformatthelesystem 4. SetuptheHadoopcongurationlewithhostnamesofallthenewnumberof slaves 5. Startallthemasterandslavenodedaemons 6. Setuplocaljobdirectorywherejobdetails,commands,loglesandjobstatistics willbestored. 7. Dependingonthetypeofapplication,loadnecessaryinputdataintoHDFS. 8. Constructcommandlineforthechosenapplicationandstartjob. 12 24 36 48 0 50 100 150 200 JOB COMPLETION TIME (secs) NUMBER OF SLAVES Figure5-2. JobcompletiontimeforPiEstimationapplicationwhenexecutedon12,24, 36and48nodes. Dataset:InFigure 5-3 ,forthecaseoftheTerasortapplication,weplotjob performanceastheinputdatasetisvariedbetween1Gbto15Gb.Againthiswas 108

PAGE 109

1 5 10 15 0 500 1000 1500 JOB COMPLETION TIME (secs) DATASET SIZE (GB) Figure5-3. JobcompletiontimeforTerasortapplicationwhenexecutedondatasetsizes of1GB,5GB,10GBand15GB. 1 3 6 12 18 0 100 200 300 400 JOB COMPLETION TIME (secs) NUMBER OF FAULTS With Fault and No Management With Fault and Management Figure5-4. Jobcompletiontimewhenmultiplefaultsareinjectedintoa24-nodePi Estimationjob. putintoeffectbysimplychangingtheterasortDataSizeparameterintheconguration le.FaultPlayinvokestheteragenapplicationtogeneratethespeciedamountofinput data. Similartoresourceanddataset,congurationandworkloadaspectscanbealso variedeasilythroughthecongurationle. 5.4.3FaultCharacterizationStudies 1. MultipleFaults:Figure 5-4 showstheeffectofmultipleCPUhogfaultsona Pi-Estimationjobexecutedon24nodes.Hogswereinjectedinto1,3,6,12and18 nodes.Weseethattheeffectsdonotcumulativelyincrease.Thisisbecausethe 109

PAGE 110

0 5 10 15 20 0 200 400 JOB COMPLETION TIME (secs) FAULT INTENSITY With Fault and No Management With Fault and Management Figure5-5. JobcompletiontimewhenfaultintensityisvariedforaCPUhogfault injectedintoa24nodePiEstimationjob. 1 2 3 4 5 6 7 8 9 10 0 20 40 60 80 JOB COMPLETION TIME (secs) MONITORING INTERVAL AHealthynode-1 1 2 3 4 5 6 7 8 9 10 11 0 20 40 60 80 JOB COMPLETION TIME (secs) MONITORING INTERVAL BHealthynode-2 Figure5-6. TimeseriesofCPUusagevaluesofamaptaskon2healthynodes increasedwaittimes(orpenalties)overlapintime.Thiscanbeveriedbyviewing aswimlaneplotofthejob.Aswimlaneplotpresentsthestartandendtimeofeach mapandreducetaskinajobtherebybringingoutrelativetaskdurationsandtheir overlap. 2. IntensityofFault:Forsomeperformancefaults,intensityofthefaultcanbevaried. Forexample,inthecaseofaCPUhogprocess,itsprioritywilldeterminethe portionoftotalCPUusagethatisusedbythehogprocess.Thus,indirectlyitalso determinestheportionofCPUusagethatisremainingformapandreducetask processes.Figure 5-5 showstheeffectofintensityofafaultonthePi-Estimation application.ProcesspriorityiscontrolledusingtheLinuxrenicecommandandis incrementedbyvaluesbetween0and20. 110

PAGE 111

0 10 20 30 0 20 40 60 JOB COMPLETION TIME (secs) MONITORING INTERVAL Figure5-7. TimeseriesofCPUusagevaluesofamaptaskonanodewithaninjected CPUhogprocess 1 2 3 0 100 200 300 400 500 JOB COMPLETION TIME (secs) MONITORING INTERVAL Fault Free With Fault No Management With Fault and Management Figure5-8. ImprovementinperformanceofaMap-reducejobwiththeinclusionofa managementpolicy. 5.4.4FaultManagementStudies Weillustrateincorporationofamanagementsolutionthroughthecase-studyof asimple,rule-basedmanagementpolicy.Inthispolicy,CPUusageofmaptasksis periodicallymonitoredbyHadoopsnodehealthscriptfeature.WhenCPUusagegoes belowapre-determinedthreshold(thatisindicativeofnormaloperation),itdenotes thatthenodeisexperiencingtheeffectofaCPUhogprocess.Assoonasthisis 111

PAGE 112

5 10 15 20 25 0 100 200 300 400 500 CPU USAGE THRESHOLD JOB COMPLETION TIME (secs) With Fault and No Management With Fault and Management Figure5-9. Sensitivityanalysisofamanagementpolicy:Jobcompletiontime improvementfordifferentvaluesofCPUusagethreshold. detected,thefaultynodeisblack-listedandremovedfromfurtherscheduling.Inorderto compensateforthelostcomputepowerandtherebypreventapenaltyinjobcompletion time,thenumberofslavesintheexistingclusterisscaledup.Inordertoimplementthis managementpolicyinFaultPlay,thefollowingstepsneedtobeperformed:Inthenode healthscript,anexistingtemplateforobservingthemetricofchoiceandcomparingit withathresholdismodied.TheLinuxpscommandisusedtomonitorCPUusageper process.TheCPUusagethresholdisthensettoanapplication-specicvaluethatcan bedeterminedbyexecutingfault-freejobs.Afteranodeisblacklisted,thefunctionality necessarytodetectthisconditionandinvokeasuitableremedialaction,isalreadybuilt intotheRemedyManager.Sotheexperimentinvocationscriptsimplychoosesthe requiredscaling-upmethod. Asanexample,weimplementthispolicyandexecuteitforaPi-Estimationjob.In Figure 5-6A ,Figure 5-6B andFigure 5-7 ,weplotthetimeseriesofCPUusagevalues asmonitoredbythenode-healthscriptonhealthyandfaultynodesrespectively.In thecaseofthefaultynode,CPUusagefallsbelowthepre-setthresholdof 10% andis detectedaftertwomonitoringintervals.Figure 5-8 showsjobexecutiontimeunderthree 112

PAGE 113

conditions:(1)Withnoinjectedfaults(2)Withaninjectedfaultandnomanagement policyand(3)Withaninjectedfaultandamanagementpolicy.Theinjectedfaultisa CPUhogprocessaffectingoneslavenode. WethenperformasensitivityanalysisofthechosenmanagementpolicytotheCPU usagethresholdvalue.InFigurerefg:sensitivityanalysis,theaveragejobcompletion timeisplottedforthresholdvaluesbetween5and25.Weseethatforthismanagement policy,forvaluesbetween10and15,weobtainthebestperformanceimprovement.The empiricalresultsshowninthissectionillustrateafewexamplesofthedifferenttypesof studiesthatcanbeperformedusingFaultPlay. 5.5RelatedWork Syntheticworkloads( Chenetal. 2011 )provideaviablealternativetosharing real-worlddatasetswithalargegroupofresearchers.Itcanhelpalleviatetheproblem oflimitedaccess.Howevertheseworkloadsuites,justlikethedatasetsfromwhichthey arederivedwillcontinuetohavesomelimitationsasstatedinSection A.1 ,namely,static natureofthedatasetandredundancyineffortsforimplementingmanagementsolutions. ResearchcontributionsthataimatimprovingMap-ReduceperformanceincludeARIA ( Vermaetal. 2011 ),Mantri( Ananthanarayananetal. 2010 ),Jockey( Ferguson etal. 2012a ),Starsh( HerodotouandBabu 2011 ),LATE( Zahariaetal. 2008 )and SteamEngine( Cardosaetal. 2011 ).FaultPlayaimsatmakingcontributionssuchas theseeasiertoimplement,evaluateandshare.AmongtheseprojectsStarshand LATEhavebeenreleasedpublicly.StarshaimsatperformancemanagementandLATE improvestaskschedulinginHadoop,whileourworkfocusesontheabilitytoinjectand managefaults.Failure-Scenario-As-A-Service( Faghrietal. 2012 )andAnarychyApe ( Roberts 2013 )projectsprovidefailureinjectionfunctionalityforMap-Reduce.Ourwork goesbeyondthesetwocontributions,i.e.inadditiontoprovidingwaystoinjectfaults, FaultPlayprovidesfeaturestodeployandautomateexperiments,characterizetheeffect offaultsandtoincorporatefaultandperformancemanagementsolutionsintoHadoop. 113

PAGE 114

5.6SummaryandConclusions Inthischapter,webringoutthedifcultiesinvolvedinfailurestudiesonaMap-Reduce platform.Thisincludeslackofaccesstoreal-worldfailuredatasetsandtheinherent staticnatureofdatasets.Whenfaultandperformancemanagementsolutionsare proposedthroughafault-injectionbasedstudy,itisdifculttoevaluate,compareand improveuponpriorsolutions.Thisisbecausethetestbed,workloadandfaultloadused aredifculttoreplicateexactlyduetothedetailedsystemscongurationinformation necessaryforthis.Evenwhensufcientdetailisavailable,aconsiderableamountof redundanteffortneedstobeexpendedforsetupoftheinfrastructure,systemsoftware, middlewareandapplicationcomponentsnecessaryforsuchastudy. Inordertoovercomethesechallenges,wepresentFaultPlay,atooltofacilitate clearlydenedandreproduciblefailureresearchonMap-Reduceplatforms.FaultPlay helpsinreducingtimeandeffortexpendedonsystemsdeployment,installationand congurationbyprovidingthetoolasavirtualmachineapplianceimagethatcomes withallnecessarydependenciesalreadyinstalled.FaultPlayconsistsofmodulesfor jobprocessing,faultinjection,distributedmonitoring,logparsingandfordeploying recovery-basedmanagementsolutions.Thesemodulestogetherenableavarietyof characterization,performanceandfault-managementstudiestobeconducted.FaultPlay hasbeendeployedbothonanin-houseclusterandonpubliccloudresourcesandhas beenusedforevaluatinganumberoffault-managementpolicies. 114

PAGE 115

CHAPTER6 MRNETS:PERFORMANCEMODELINGOFMAP-REDUCEUSINGPETRINETS Inthelastpartofthisdissertation,ourworkismotivatedbytheneedtoextends techniquesandtoolspresentedinthepriorchaptersforfaultandperformance managementof individual Map-Reducejobstoa workload suiteofMap-Reduce jobs.Wespecicallyaddressthequestionof performancemodelingofMap-Reduce workloads whichisacriticalcomponenttoresourceprovisioningandperformance management. Theempiricalandmachine-learningbasedperformancemodelsproposedinthe priorchapters,fullltheneedsforaccurateandrealisticstudies.However,forthecase oflong-runningjobsoraworkloadsuiteofmultipleMap-Reducejobs,empiricalstudies andthegenerationofrepresentativetrainingdataformachine-learningmodelscan betime-consumingoreveninfeasible.Inordertofacilitateanorthogonalapproachto fastandaccurateperformancemodeling,weproposeamodelingapproachbasedon Petri-nets,adiscreteeventmodelingmethodology.Petrinetsprovideaformalmethod groundedbywell-foundedmathematicalpropertiestocapturebothsystemstructureand behavior.Petrinetmodelsareexecutableandareusedtosimulatesystembehavior. ThesePetrinetmodelsofMap-Reduce,termedMRnets,facilitatevariousperformance analysesonbothsinglejobsaswellasaworkloadofjobsandaredescribedindetailin thefollowingsections. 6.1Background CurrentlyperformanceinMap-Reduceisstudiedand/ormodeledprimarilythrough thefollowingmethods:(1)empiricalstudies( Ananthanarayananetal. 2010 ; Zaharia etal. 2008 ),(2)machine-learningbasedmodels( KadirvelandFortes 2012 ; Lama andZhou 2012 ),(3)analyticalmodels( Herodotouetal. 2011 ; Vermaetal. 2011 ) andthrough(4)simulationmodels( Fergusonetal. 2012b ; Wangetal. 2009b ). Thesemethodshaveworkedsuccessfullyaspartofseveralspecicoptimizations 115

PAGE 116

toMap-ReduceimplementationsandMap-Reducemanagementmodules.However, thesemodelsfocusonspecicaspectsoftheenvironmentstheyareintendedfor. For,e.g.somemodelsarecustomizedforhomogeneousclusters,somemodelsdo notcaptureperformancepenaltiesduetofaultswhilesomeothersfocusonsingle Map-Reducejobsandnotaworkloadsuiteofjobs.Thegoalofthisworkistoadda generalandcomprehensiveperformancemodeltothisarsenalofexistingtechniques. MRnetsproposesamodelbasedonPetrinetsthatiscapableofcapturing performanceofasinglejob,aworkloadofjobs,heterogeneousinfrastructuresas wellasnodefaults.APetrinetisadiscrete-eventmodelingtechniquethatiscapableof capturingkeypropertiesofadistributedsystem,namely,concurrency,synchronization andresourcesharing.Modelsconstructedinthisworkcapture sufcient structureand behaviorofmapandreducetasksinordertomodelperformance.Asaresult,complex systemdetailsthatmaybeunnecessaryforthisgoalareabstractedaway,resultingin simple,easy-to-understandmodels.Oncethemodelisconstructedandparameterized, performanceanalysesthroughsimulationscanbecompletedintheorderofafew seconds.ModelsofasingleMap-Reducejobareextendedtocaptureperformanceofa workloadofMap-ReducejobsthatareexecutedusingaFIFOscheduler. InthecontextofthecurrentmodelinglandscapeofMap-Reduce,Petrinetsprovide uniquepropertiesthatmakeitanattractivechoice.Petrinetsarea graphical model, makingiteasytointerpretanduseforbeginnersandserveasanexcellentpedagogical tool.WeleverageafeaturethatextendsthebasicPetrinetmodel,namelyhierarchy thatfurtherhelpsinterpretabilityatdifferentlevelsofmodelabstraction.Petrinetmodels allowsustoparameterizemapandreducetaskexecutiontimedistributions.Thisallows themodelspresentedinthischaptertobeextendedtocapturehardwareheterogeneity andskewininputandintermediatedatasets.MRnetscanalsocaptureperformance penaltiesintroducedintoajob'sexecutionstimeduetotheoccurrenceofnodefaults. 116

PAGE 117

Inaddition,Petrinetscanbeextendedtobeusedasexecutablemodelsas illustratedinanautonomiccontrollerforITsystems( Graupneretal. 2007 )andfor high-performancecomputingsystems( KadirvelandFortes 2010 2011a ).Thelatter workisreproducedintheAppendixtodescribethemethodologyusingwhichPetrinets canbeusedassystemcontrollers. Contributionsinthischapterareasfollows: (a)IdenticationandproposalofPetrinetsasaviableandeffectivemodeling methodtocaptureaMap-Reducesystems'sstructureandbehavior.Adetailedmapping betweenMap-ReducestructureandbehaviortoPetrinetcomponentsandfeaturessuch ahierarchy,colorandtime. (b)Validationofconstructedmodelsthroughcomparisonswithempiricalexperiments onaphysicalHadoopclusterof64nodes. (c)PerformanceanalysesofMap-ReduceusingPetrinetmodelstodetermine metricssuchasjobandworkloadexecutiontimeinthepresenceofnodefaults. RelatedworkispresentedinSection 6.2 .InSection 6.3 ,characteristicsofPetri netsandhowtheycanfulllmodelrequirementsisdiscussed.Theproposedmodels areintroducedinSection 6.4 .ConstructedmodelsareevaluatedinSection 6.5 and conclusionsarepresentedinSection 6.6 6.2RelatedWork PerformancemodelingofMap-Reducehasrecentlyreceivedwideattentionfrom theresearchcommunity,resultinginavarietyofperformancemodels. Simulationmodels: IntheStarshproject( Herodotouetal. 2011 ),acombination ofanalyticalmodelsandsimulationsareusedformodelingperformance( Herodotou 2011 )andansweringwhat-ifquestions.InJockey( Fergusonetal. 2012b )ananalytical modelinspiredbyAmdahl'slawandanevent-basedsimulatorareusedforperformance predictionofdataparalleljobs.ThesejobsarewritteninSCOPE( Chaikenetal. 2008 ). SCOPEqueriesaresimilartoPigandHivequeriesandresultinthegenerationof 117

PAGE 118

aDAGofMap-Reducejobs.Averageerrorsforthesimulatorwas 9.8% andforthe Amdahl'slawbasedmethodwas 11.8% .ForthegoalofmeetingSLOsofdataparallel jobs,theauthorschoosetousethesimulator-basedmethodsinceitcancaptureeffects ofoutliersandfailures.MRPerf( Wangetal. 2009c )proposesasimulationapproach tomodelingperformanceundervariousnetworktopologies,datalocalitiesandfailures. TheunderlyingnetworkismodeledusingtheNetworkSimulator,ns-2( ns2 2013 ). Analyticalmodels :IntheARIAproject( Vermaetal. 2011 ),ananalyticalmodelis proposedbasedontheMakespantheoremtoestimateexecutiontimeofaMap-Reduce job.ThismodelisextendedandappliedtoadirectedacyclicgraphofMap-Reducejobs producedduringtheexecutionofPigqueriesin Zhangetal. ( 2012 ).Thismodel's applicabilitytoheterogeneousenvironmentsisshownin Zhangetal. ( 2013 ).In Paratimer( Mortonetal. 2010 ),aprogressestimatorisproposedforaDAGof Map-Reducejobsunderconditionsoffailureanddataskew.Asetofestimatesare providedthatcorrespondtoexecutiontimesindifferentscenarios.In Castiglione etal. ( 2013 ),theMeanValueAnalysisalgorithmisextendedtoestimateaveragejob executiontime.InMantri( Ananthanarayananetal. 2010 )simpleanalyticalmodels areproposedforpredictingtheexecutiontimedistributionofmapandreducetasks executingonCosmosclusters( Chaikenetal. 2008 ; Isardetal. 2007 ).Modelerroris reportedtobewithin 2.9% ofactualcompletiontime.In Limetal. ( 2012 ),theauthors proposeaperformancemodelforMap-Reducejobsbycalculatingthedilationfactorof jobsrunningonmulti-resourcesharedsystems.Themodelappliestoheterogeneous Map-Reduceenvironmentsandhasanerroroflessthan 15.8% fordifferentbenchmark applications. Machine-learningbasedmodels: InAROMA( LamaandZhou 2012 )asupport vectormachine(SVM)regressionmodelisusedtopredictexecutiontimefordifferent inputdatasizes,resourceallocationsandcongurationparameters.Modelerroris reportedtobewithin 12% .In KadirvelandFortes ( 2012 ),acomparativestudyofvarious 118

PAGE 119

machine-learningregressiontechniquesforperformancepredictionsispresented.Four techniques,namelyGaussianprocessregression,multi-layerperceptron,regression bydiscretizationandmodeltrees,wereshowntohavegoodpredictionaccuracyand andfastpredictioncomputationtimes.Averagemodelpredictionerrorisreportedtobe within 12% Petrinetmodelsproposedinthisworkareanalyticalinnatureandinparticular servetobeusedformultiplepurposes,i.e.formodelingasinglejob,aworkloadof jobs,aheterogeneousenvironmentandnodefaults.Existingmodelingtechniques catertoafewofthesecharacteristicsbutnotall,hencenecessitatingtheneedfora comprehensivemodel. 6.3OverviewofPetrinets Informally,Petrinetsconsistofagraphoftwotypeofnodescalled places and transitions connectedtogetherby arcs .Placesareusedtorepresentdifferentsystem conditionsorstates,whiletransitionsareusedtorepresenteventsthattakethe systemfromonestatetoanother.Placescontain tokens whichcancorrespondto anysystemcomponentdependingonthemodelingdomain.Fore.g.,aplace,say named"JobExecution"couldrepresenttheconditioninthesystemwhenjobsarein execution.Tokensinthisplacewouldcorrespondtoeachexecutingjob.Theoccurrence ofaneventthatisrepresentedbyatransition,iscapturedbythe ring ofatransition.A transitionringcausestokenstoberemovedfrominputplacesandtokenstobeadded tooutputplacesasshowninFigure 6-1 ForcapturingMap-Reducestructureandbehavior,weusethreeextensionsto thebasicPetrinet,namely, color time and hierarchy .Colorreferstothedatavalue associatedwithtokensinaplace.Alltokenswithinaplacewillbeofthesamecolor(and hencedatatype).A marking referstothenumberandtypeoftokensinalltheplacesof amodel.Theconceptoftimeiscapturedbyassociatinganintegralvalue(representing simulationtime)withtokens.Atokenbecomesactiveandusableonlywhensimulation 119

PAGE 120

Figure6-1. AsimplePetrinetmodelshowingatransitionring.OntheleftisthePetri netmodelbeforethetransitionresandontherightisthemodelafterthe transitionhasred.Theinputarctothetransitionislabeledwitha"2"to indicatethattwotokensareneededforthetransitiontore.Theoutputarc doesnothavealabelandthiscorrespondstoadefaultvalueof1.Firing resultsintwotokensbeingremovedfromtheinputplaceandasingletoken beingproducedintheoutputplace. timeisequaltoorgreaterthanthetimeassociatedwithatoken.Hierarchyallowslarge Petrinetmodelstoberepresentedasasetofsmallerinterconnectedmodels. Petrinetscanbeclassiedaslow-levelPetrinetsandhigh-levelPetrinets.Colored Petrinetsusedinthisworkfallunderthelattercategoryandarecharacterizedbythe useofahigh-levelprogramminglanguagetocapturemanysystemcharacteristics. High-levelPetrinetsarealsoanISO/IECstandardandCPNtools( Ratzeretal. 2003 ) (usedtoconstructandsimulatemodelsinthiswork)conformstothisstandard. AbasicPetrinetalongwiththeseextensionscanbeformallydenedasa nine-tuple( Jensen 1995 ),shownbelow: CPN =( P T A % V C G E I ) ,where: 1. Pisanitesetofplaces. 2. TisanitesetoftransitionsTsuchthat P ) T = 3. A + PXT TXP isasetofdirectedarcs. 4. % isanitesetofnon-emptycolorsets. 5. V isanitesetoftypedvariablessuchthat Type [ v ] % forallvariables v V 6. C : P = % isacolorsetfunctionthatassignsacolorsettoeachplace. 7. G : T = EXPR V isaguardfunctionthatassignsaguardtoeachtransition t suchthat Type [ G ( t )]= Bool 120

PAGE 121

8. E : A = EXPR V isanarcexpressionfunctionthatassignsanarcexpressionto eacharc a suchthat Type [ E ( a )]= C ( p ) MS ,where p istheplaceconnectedtothe arc a 9. I : P = EXPR isaninitializationfunctionthatassignsaninitialization expressiontoeachplace p suchthat Type [ I ( p )]= C ( p ) MS 6.4ModelingMap-Reduce Inthissection,wedescribehowvariousaspectsofaMap-Reducejobaremapped intoPetrinetcomponentsandconcepts. MRnetsfocusesonaspecicopen-sourceMap-Reduceimplementation,namely Hadoop( Hadoop 2004 ),whichhasbeenwidelyadoptedinmanydomains.Hadoop consistsofthefollowingmaincomponents:(1) JobTracker and TaskTracker daemons thatmanageschedulingandcoordinationofmapandreducetasks,and(2) NameNode and DataNode daemonsthatmanagetheHadoopDistributedFileSystems(HDFS). TheJobTrackerandNameNodedaemonsrunontheHadoopmasternode,whilethe TaskTrackerandDataNodedaemonsrunontheslavenodes.Figure 6-2 showsa simpliedoverviewofHadoop. !"#$%& !"#$%&'()*+ ",--.)%! /,0+ 12."3$2 ).-$ ),%$ -$1.+%.1.+ 2$4&$!1 .'(5$+-!*! -$1.+%.1. 2$!6,)!$ #'"(% 1.!3+ 12."3$2 7+%.1.+ ),%$+ %.$-,)! ',".'+ !1,2.*$ #'"(% 1.!3+ 12."3$2 7+%.1.+ ),%$+ %.$-,)! ',".'+ !1,2.*$ #'"(% 1.!3+ 12."3$2 7+%.1.+ ),%$+ %.$-,)! ',".'+ !1,2.*$ #'"(% 1.!3+ 12."3$2 7+%.1.+ ),%$+ %.$-,)! ',".'+ !1,2.*$ #'"(% 1.!3+ 12."3$2 7+%.1.+ ),%$+ %.$-,)! ',".'+ !1,2.*$ #'"(% 1.!3+ 12."3$2 7+%.1.+ ),%$+ %.$-,)! ',".'+ !1,2.*$ Figure6-2. OverviewofHadoopshowinginteractionsbetweentheJobTracker, NameNode,TaskTracker,DataNodehostedonthemasterandslavenodes. InaMap-Reducejob,whenanodefails,allmaptasksthatwereexecutedonthis node(forthisjob)havetobere-executedonotherhealthynodes.Thisisbecausemap outputsarestoredlocallyateachslavenode(ratherthanbeingstoredonthereplicated HDFS).Maptaskswhoseoutputshavealreadybeenreadbycorrespondingreduce 121

PAGE 122

tasksneednotbere-executed.Themasternodedetectsaslavenodefaultafterastatic timeoutintervalandtheninitiatesre-execution. 6.4.1ModelingasingleMap-Reducejob ThePetrinetmodelforasingleMap-ReducejobisshowninFigure 6-3 !"#$%&'()*&+, !"# ("-./0"10!"# $23!"# !"# !"# /4501-4 1 0 0 *-4 670 81&"0&9&3:'&-+0, ;15:#<&=/0"109&3 !"# 93= 9&3> ?@4A4> !&1B&!"#-+0, $239&3 9&3> /4509&3> 93=> 9&3:'& 1 1 1-4 9&3$%&'()*&+, Figure6-3. AsimplePetrinetmodelofaMap-Reducejob 6.4.1.1PlacesandTransitions Weusetwotypesofplaces:onetypeofplace(similartotypicalusage)represents variousconditionsintheprocessofaMap-Reducejobexecution.Place MapTasks representsmaptasksthatareavailableandreadytobeginexecution.Place Map correspondstomaptasksthatarecurrentlyexecutingonslavenodes.Place MapRdy representsthosemaptasksthathavecompletedexecution.Forthecaseofreduce, analogousplacesare Red.Avlbl. Reduce Red.Rdy. .Table 6-1 ).Tokenswithinthese placescorrespondtoeitheramaporreducetask. Asecondtypeofplaceisusedtorepresentamapslot,called MapSlots anda reduceslotresource,called Red.Slots ,soatokenwithinthistypeofplacerepresents aninstanceofthisresource.Forexample,ifwehaveaMap-Reduceclusterwith4 serverswhereeachserverisconguredtohave2mapslotsand1reduceslot.Thisis 122

PAGE 123

capturedintheMap-Reducemodelthrough 4 x 2=8 mapslottokensinthe MapSlots placeand 4 x 1=4 reduceslottokensinthe Red.Slots place. Transitionscorrespondstoeventsthatcausethesystemtomovefromonestate tothenextandaredescribedindetailinTable 6-2 .Transitions StartMap and EndMap representthebeginningandendofamaptask.Amaptaskbeginswhenamaptaskis readyforexecutionandamapslotresourceisavailable.Andsothe StartMap transition isreadytorewhenatokenofcolorordatatype MapTask isavailableinit'sinput place MapTasks andatokenofcolor MapSlotResource isavailableinit'sinputplace MapSlots .Attheendofamaptaskexecutionrepresentedbytheringofthetransition EndMap ,amapslotresourceresourceisreturnedtotheavailableofpoolofmapslot resources.Thisisrepresentedbythereturnofatokenthroughtheringoftransition EndMap toit'soutputplace MapSlots Transition GroupKeys reswhenallmaptasksbelongingtoajobhavecompleted execution.Soifajobhas16maptasks,then16tokenswiththatsamejobIDneedtobe presentwithintheplace MapRdy for GroupKeys tore. Table6-1. PlacesinthePetrinetmodel PlaceIDDescription Systementityrepresentedby token MapTasksMaptasksthatarereadyforexecutionOnemaptask Map Maptaskisbeingexecuted Onemaptask MapSlotsMapslotresourceavailable Onemapslotresource MapRdy.Maptasksthathavecompleted execution Onemaptask Red.Avlbl.Reducestasksthatarereadytobegin execution Onereducetask ReduceReducetaskisbeingexecuted Onereducetask Red.SlotsReduceslotresourceavailable Onereduceslotresource Red.Rdy.Reducetasksthathavecompleted execution Onereducetask 6.4.1.2Arcinscriptions Arcinscriptionsarecodesegmentsinahigh-levellanguagethatareassociatedwith theinputandoutputarcsofatransition.Inputarcinscriptionsdeterminetheconditions 123

PAGE 124

Table6-2. TransitionsinthePetrinetmodel NameDescription Inputarcinscription signicance Outputarcinscription signicance StartMapBeginningofamap taskexecution Onemaptask tokenandonemap slotresourcefor execution Maptasktokensbecome readyforEndMapaftera periodoftimedetermined bythedistribution EndMapEndofamaptask execution Amaptaskthathas completedexecution Amaptaskwhoseo/p isreadyforreduceand themapslotresourceis returnedafteruse GrpKeysEndofexecutionof allmaptasksofajob Allmaptokens belongingtoajob arewaitedfor Reducetokens correspondingtoeach jobarecreatedafter correspondingmapphase iscompleted StartRedStartofreducetask execution Onemaptask tokenandonemap slotresourcefor execution Maptasktokensbecome readyforEndMapaftera periodoftimedetermined bythedistribution EndRedEndofreducetask execution Amaptaskthathas completedexecution Amaptaskwhoseo/p isreadyforreduceand themapslotresourceis returnedafteruse thatarenecessarytobetrueforatransitiontobeabletore.Outputarcinscriptions determinethetypeandnumberoftokensthatareproducedasaresultoftransition ring.ArcinscriptionsusedintheproposedMap-Reducemodelaredescribedbelow: Maptaskexecutiontime:Theexecutiontimeofamaptaskiscapturedthrough afunction MapExecTime onthearcbetweenStartMapandMapnodes.This functionmakesthetokensproducedbytheringoftransition StartMap become availableafteradurationoftimethatisdeterminedbythechosendistribution.This distributioncanbeassimpleasaconstantvalueinwhichcaseitrepresentsthe conditionwhenallmaptaskstakeasimilaramountoftimetoexecute.Itcouldalso bedistributionssuchasPoisson,Gamma,Exponentialorevenauser-denedone. Reducetaskexecutiontime:Theexecutiontimeofareducetaskiscaptured throughthefunction RedExecTime inthearcbetweenStartRedandReduce places. BarrierbetweenMapandReducephase:InaMap-Reducejob,allmaptasks mustcompleteexecutionbeforetherstreducetaskcanbegintoexecute.Thisis capturedthroughthe MergeMaps functionusedonthearcbetweenthe MapRdy 124

PAGE 125

and GroupKeys place.ThisfunctionusesajobIDtodeterminethenumber oftokens,(andhencemaptasks)ofthatspecicjobthatmustbeavailablein MapRdy before GroupKeys canre. UserdenednumberofReducetasks:Thenumberofmaptasksinajobis determinedbytheinputdatasetsizeandthelesystemblocksize.However,the numberofreducetasksinajobisdenedbytheuser.Inordertocapturethis,we usethefunction CreateReduces onthearcbetween GroupKeys and RedAvlbl Thisfunctionproducesauser-denednumberofreducetasktokensintotheplace RedAvlbl 6.4.1.3Modelparameterization ForthisPetrinetmodeltobeusedtoestimateexecutiontimeofajob,certainmodel parametersneedtobedened.ThesearelistedanddenedinTable 6-3 .Wewillnow describehoweachparameterismappedtoaspecicaspectofthePetrinetmodel. Theproductofthenumberofserversorslavenodes, s ,andnumberofmapslotsper slavenode, mn ,determinesthenumberoftokensthatwillbein MapSlots .Thenumber ofmaptasksforeachjob m ,willbeusedinthefunction MergeMaps todeterminehow manytokenswillcause GroupKeys tore.Thenumberofreducetasks, r ,isusedinthe function CreateReduces todeterminehowmanyreducetasktokenswillbeproduced bytheringof GroupKeys .Theexecutiontimedistributionparameters, d 1 and d 2 ,are usedinthefunctions MapExecTime and RedExecTime respectivelytoparameterizethe chosentaskdistribution. 6.4.1.4Modelsimulation Oncetheparametershavebeendened,themodelcanbesimulated.Theend resultofasimulationisthecollectionofallreducetasktokensofajobintheplace RedAvlbl .Thisendofthesimulationcorrespondstoamarkingofthemodelwherein noothertransitionringispossible.Theestimatedexecutiontimeofthesimulatedjob isobtainedbythemostrecenttimevalueassociatedwithallthetokensbelongingto thatjobcollectedin RedAvlbl .Forexample,ifajobconsistedoffourreducetasks,then attheendofsimulation,fourreducetasktokens,eachhavingapossiblydifferenttime 125

PAGE 126

values,arecollectedin RedAvlbl .Theestimatedexecutiontimeforthisjobisthelargest timevalueassociatedwiththefourtokens. Table6-3. Modelparameters ParameterDescription s NumberofserversorslavenodesintheMap-Reducecluster mn Numberofmapslotsperslavenode rn Numberofreduceslotsperslavenode m Numberofmaptasksforeachjob r Numberofreducetasksforeachjob d 1 Maptaskexecutiontimedistribution d 2 Reducetaskexecutiontimedistribution 6.4.2Modelingnodefaults Wecapturethemechanismofnodefaultsanditseffectonajob'sexecutiontime throughtheadditionofafewmoreplacesandtransitionstothebasicmodelasshown inFigure 6-4 .InHadoop,whenaslavenodefails,themasternodere-executesallthe maptasksandreducetasksthatwereexecutedonthisnode.Thisisbecausemap tasksoutputsarestoredlocallyoneachslavenode(andnotintheunderlyingreplicated distributedlesystem)andthesemapoutputsarenolongeraccessibletothereduce taskswhenaslavefails.Somemaptaskswhoseoutputshavealreadybeenreadbyall thereducetasksdonotneedtobere-executed.Themasternodedetectsaslavenode faultwhenitstopsreceivingheartbeatmessagesfromtheslaveforapre-determined timeoutintervalperiod.Theadditionalplacesandtransitions,listedinTables 6-4 and 6-5 ,extendthebasicPetrinetmodelenablingustocapturethefollowing:(1)the numberofnodefaults,(2)thetimeofoccurrenceofanodefault,andthe(3)static timeoutintervalafterwhichamasternodeconcludesthataslavenodehasfailed.These newmodelparametersarelistedinTable 6-6 Thenewplace NumOfFaults storesonetokenwhosecolorisanintegeralong withatimevalue.Theintegerrepresentsthenumberoffaultsthataretobeinjected andthetimecorrespondstothetimeatwhichthefaultistobeinjected.Whenamap tasktokenin Map placeisreadyand EndMap canre,thenthetokenin NumOfFaults 126

PAGE 127

placedetermineswhetherafaultisremainingtobeinjected.Iftheintegervalueofthis tokenisgreaterthanzero,thenatasktokeniscreatedinthe FailedMaps place.And theintegerisdecrementedbyone,asshownonthearcbetween NumOfFaults and EndMap .Iftheintegervalueofthistokeniszero,thenatokeniscreatedinthe MapRdy placeindicatingsuccessfulmaptaskcompletion.Thenewlyaddedplace FailedMaps containstokensrepresentingfailedmaptasks.Firingofthe StartTimeout transition correspondstothestartoftheperiodwhenthemasternodedoesnotreceiveheartbeat messagesfromthefailedslave.Thenewplace Timeout ,representsallmaptasksonthe failednodethatareinthetimeoutperiodandhencestalledfromexecuting.Whenthe newtransition EndFail res,thetimevalueassociatedwitheachfailedmaptasktoken isincrementedbythestaticfailuretimeoutvaluebecausethisisthetimeafterwhicha maptaskisavailableforre-execution.Themaptasktokenwiththisupdatedtimeisthen madeavailableinthe MapTasks place,whereitwillberescheduledforexecutiononone oftheremainingslavenodes. "#$%&'()*+',+./ "#$ )#.0. 1!#2!"#$ %34"#$ "#$ "#$ 1/5!. *6736!7879 !:'37. '/.'7'+$!; <25=$>';. "#$ ?4; "'2@'"#$.,!A=+ B6 C#=/!. 36! *6736!7D79 !:'3736!7EF7 '/.'736! *6736!7879 !:'37!7 '/.'7'+$!; 1!#2!)*+'5=! C#*/'4 "#$. *6736!7D79 !:'37!7 '/.'7'+$!; GH77!*+'5=! )*+'E 5=! %34C#*/ ! Figure6-4. AugmentedPetrinetmodelofaMap-Reducejobshowingplacesand transitionscapturingbehaviorandstructureofnodefaults.Placesand transitionscorrespondingtothereducephasehavebeenomittedforclarity. 127

PAGE 128

Table6-4. AugmentedplacesinthePetrinetmodeltocapturenodefaults PlaceID Description Systementityrepresented bytoken NumOfFaultsNumberoffaultstobeinjected Remainingnumberof faultstobeinjected FailedMapsMaptasksthathavefailed Onemaptask Timeout Maptasksthatareinthetimeout period Onemaptask Table6-5. AugmentedtransitionsinthePetrinetmodeltocapturenodefaults Name DescriptionInputarcinscription signicance Outputarcinscription signicance StartTimeoutTimeout period beginsfor maptask Representsanymap task t Representsanymap task t EndFail Failed task detected bymaster Representsanymap task t Timeassociatedwith allmaptasktokens areincrementedby timeout Table6-6. Additionalmodelparameterstocapturenodefaults ParameterDescription nft Numberofnodefaultstobeinjected ft Timeatwhichfaultsaretobeinjected timeout Timeoutperiodafterwhichmasterconcludesslavenodehasfailed (duringwhichmasterhasnotreceivedheartbeatsfromtheslave) 6.4.3ModelingaworkloadofMap-Reducejobs WhenanumberofMap-ReducejobsneedtobeexecutedonaMap-Reduce cluster,werefertosuchagroupofjobsasaworkload.Whenjobsarriveatthecluster, bydefault,Hadoopusesdifferentschedulerstodeterminehowtasksfromdifferentjobs arescheduledtoexecuteonthecluster.TheseschedulersareFIFO,Fairandcapacity scheduler.InMRnets,theFIFOscheduleriscapturedthroughtokencolor.Themodel isinitializedwithtokensofmaptasksbelongingtoeachjob.Eachmaptasktokenhasa colorthatconsistsofthejobIDitbelongstoaswellthetimeofarrivalofthatjob.Thusa maptasktokenbecomesactiveonlyafterit'sarrivaltime. 128

PAGE 129

Asthenumberofjobsinaworkloadincreases,thenumberofplacesandtransitions inthemodeldoesnotincrease.However,thenumberofmapandreducetasktokensdo increase.Thisaspectoftheproposedmodels(i.e.noincreaseinthenumberofplaces andtransitions)helpsmitigatethestatespaceexplosionproblem. 6.5ModelEvaluation WeevaluatetheconstructedPetrinetmodelsbycomparingexecutiontime estimatesgeneratedfromthemodelwithactualjobexecutiontimeobservedona Hadoopclusterforavarietyofbenchmarkapplications. ExperimentalTestbed :Thetestbedusedintheevaluationconsistsof16IBM bladeservers(HS22)mountedontwodifferentracks.Eachphysicalnodehasa8-Core Xeon2.4GHzCPUand24GBofRAMandrunsCentOS5.5withXen3.4.3.Thetwo racksarelinkedtogetherbyaGigabitEthernetnetwork.Eachphysicalnodehostsve guestvirtualmachines.ThisguestVM(whichformsaHadoopslavenode)runsUbuntu 10.04.2andisconguredwithasinglecoreand2GbofRAM.Hadoopversion0.20.203 isused. Map-Reduceapplications :Weusevebenchmarkapplicationstoevaluate accuracyoftheconstructedmodels.ThesewerechosenfromtheHadoopdistribution andthePUMAbenchmarksuite( PUMA )andaredescribedbelow: 1. Wordcount(WC):Mapoutputsa (word,1) key-valuepairforeachwordina document.Reducecombinesthecountforeachwordproducinga (word,wordcount) pair. 2. Piestimation(PI):EstimatesthevalueofPiusingquasi-MonteCarlomethod. 3. Invertedindex(II):Mapgeneratesthedocumentindexforeachwordas (word, documentindex) .Reducecombinesalloccurrencesofawordtoproduce (word, listofdocumentindices) 4. Termvectorperhost(TV):Determinesfrequentlyoccurringwordsinadocument. Mapproduces (host,termvector) foreachhost.Reducecombinestermvectorsfor eachhostandoutputs (host,list(termvector)) 129

PAGE 130

5. Sequencecount(SC):Mapoutputsa (sequence,1) key-valuepairforeach sequenceinadocument.Reducecombinesthecountforeachsequence producinga (sequence,sequencecount) pair. Inputdataset :DatasetusedforWC,II,TVandSCconsistsofbooksfrom Gutenberg( Gutenberg 2009 )withsizevaryingbetween1GBto20GB.PIdoesnot requireanyinputdata. 6.5.1Singlejobmodelaccuracy WecomparetheexecutiontimeasestimatedbytheconstructedPetrinet modelwiththeactualexecutiontimeobservedondeployedtestbedinTable 6-7 and Figure 6-5 .Weinitiallychooseconstantdistributionsforbothmapandreducetask executiontimeandobtainanaveragemodelerrorof 14.8% .Wethentrytondthe besttdistributionforrepresentingtaskexecutiontimesanddeterminethatthePoisson isamuchbettertformaptasks.WerevaluatemodelestimationsusingthePoisson distributionandillustrateresultsinTable 6-8 .Weseethatmodelaccuracyimprovesand averagemodelerroris 7.4% .Improvementinmodelaccuracythroughtheuseofthe best-tPoissondistributionisshowngraphicallyinFigure 6-5 Table6-7. Modelaccuracy-Singlejob AppMapdist.MapRed. dist. Red.Actual(sec)Model(sec)Error(%) WCConstant42Constant5 207 173 16.4 PIConstant25Constant5 134 105 21.6 TVConstant56Constant7 274 231 15.7 IIConstant55Constant7 259 227 12.4 SCConstant280Constant40 1227 1125 8.3 Table6-8. ModelAccuracy-Singlejobusingbesttdistribution AppMapdist.MapRed. dist. Red.Actual(sec)Model(sec)Error(%) WCPoisson42Constant5 207 199 3.9 PIPoisson25Constant5 134 110 17.9 TVPoisson56Constant7 274 242 11.7 IIPoisson55Constant7 259 252 2.7 SCPoisson280Constant40 1227 1219 0.7 130

PAGE 131

WC PI TV II SC 0 500 1000 1500 MAP REDUCE APPLICATION EXECUTION TIME (secs) EMPIRICAL EXEC. TIME MODEL EXEC. TIME CONSTANT MODEL EXEC. TIME POISSON Figure6-5. ComparisonofMap-Reducejobexecutiontimeobservedontestbedwith modelestimatedvaluesforvedifferentMap-Reduceapplications.Map executiontimerepresentedinthemodeleitherasaconstantoraPoisson distributionisalsocompared. 6.5.2Workloadmodelaccuracy WhenaMap-Reducejobispartofaworkload,it'sexecutiontimedependsonits arrivaltimewithrespecttootherjobsintheworkload.Weillustratevariationsthatcan occurinajob'sexecutiontimethroughasimpleexample.WeconsidertwoWordcount jobswithfourdifferentarrivaltimescorrespondingtodifferentpercentagesofoverlapin theexecutionperiodofconsecutivejobs.Figures 6-6A 6-6B 6-6C and 6-6D ,illustrate fourcasesofhowtwojobsinaworkloadcanarriveinrelationtoeachother.Table 6-9 liststheexecutiontimeobservedinthetestbedandtheexecutiontimeestimatedby thePetrinetmodelforthecaseoftwoconsecutiveWordcountjobswithdifferentarrival times.Averagemodelerroris 2.4% Wethenconsiderthecaseoffourdifferentjobsformingaworkload.Wechoosefour jobs,namely,Wordcount,Pi-estimation,TermvectorperhostandInvertedindexeach arriving60secondsafterthepreviousone.Comparisonbetweentestbedexecution timesandmodelestimationsisshowninTable 6-10 .Averagemodelerroris 19.6% 131

PAGE 132

!"#$% !"#$& ANooverlap !"#$% !"#$& BFulloverlap !"#$% !"#$& CPartialoverlap-1 !"#$% !"#$& DPartialoverlap-2 Figure6-6. Aworkloadwith2Map-Reducejobs.Theworkload'sandjob'sexecution timediffersdependingonthejob'sarrivaltimeandtheconsequentoverlapin theirexecutionperiod. Table6-9. Workloadexecutiontime-Twojobsinaworkload Overlap TrialActualtime(sec)Modeltime(sec)Error(%) Partial(60s) 1 419 399 4.8 Partial(60s) 2 436 424 2.8 Partial(60s) 3 441 425 3.6 Partial(120s) 1 431 421 2.3 Partial(120s) 2 419 403 3.8 Partial(120s) 3 427 429 -0.5 Full 1 452 451 0.2 Full 2 421 417 1.0 Full 3 435 418 3.9 Table6-10. Workloadexecutiontime-Fourjobsinaworkload Overlap TrialActualtime(sec)Modeltime(sec)Error(%) Partial(60s) 1 627 763 17.8 Partial(60s) 2 609 774 21.3 Partial(60s) 3 609 758 19.7 6.5.3Modelaccuracyofsinglejobwithnodefaults Wetesttheaccuracyofthefailure-augmentedPetrinetmodelbyinjectinganumber ofdifferenttypesoffaultsduringjobexecutionandcomparingtheobservedexecution timewiththosederivedfromthemodel.Wechoosedifferentnumberoffaultsand differentfailuredetectiontimeoutvaluesanddifferentfaultoccurrencetimesasshownin Table 6-11 .InFigure 6-7A and 6-7B weplottheresultsofthesecomparisons.Average modelerroris 15.8% 132

PAGE 133

Table6-11. AccuracyofPetrinetmodelincorporatingfaults No.offaultsTimeTimeout(secs)Actual(secs)Model(secs)Error(%) 00 600 283 256 9.5 11 600 793 701 11.6 11 300 533 394 26.1 21 600 822 691 15.9 0 1 2 0 200 400 600 800 1000 NUMBER OF FAULTS EXECUTION TIME (secs) EMPIRICAL EXECUTION TIME MODEL EXECUTION TIME ADifferentnumberoffaults 600 300 0 200 400 600 800 1000 TIMEOUT INTERVAL (secs) EXECUTION TIME (secs) EMPIRICAL EXECUTION TIME MODEL EXECUTION TIME BDifferenttimeoutintervals Figure6-7. Performancemodelingofjobswithinjectednodefaults.Differentinjected nodefaultscenariosarestudied.A.numberoffaultsinjected,B.timeout intervalvalue 6.5.4What-ifanalysis Wevarytwomodelparameters,namelynumberofserversanddatasetsize todetermineifthemodeliscapableofestimatingexecutiontimeswellwhenthese characteristicsoftheMap-Reduceclusterchanges.Comparisonsareshownin Tables 6-12 and 6-13 andFigures 6-8 and 6-9 .Averagemodelerroris 7.7% for datasetsizevariationsand 17.9% forclustersizevariations. Table6-12. Modelaccuracy-Datasetsizevariations DataMapdist.MapRed. dist. Red.Actual(sec)Model(sec)Error(%) 1GBPoisson51Constant5 138 124 10.1 2GBPoisson42Constant5 207 199 3.9 4GBPoisson48Constant5 442 415 6.1 8GBPoisson49Constant7 911 813 10.8 133

PAGE 134

1GB 2GB 4GB 8GB 0 200 400 600 800 1000 DATASET SIZE EXECUTION TIME (secs) EMPIRICAL EXECUTION TIME MODEL EXECUTION TIME Figure6-8. Differenceinexecutiontimeobservedempiricallyandthroughsimulationsof thePetrinetmodelfordifferentinputdatasetsizes 4 8 16 32 64 0 200 400 600 800 1000 NUMBER OF NODES EXECUTION TIME (secs) EMPIRICAL EXECUTION TIME MODEL EXECUTION TIME Figure6-9. Differenceinexecutiontimeobservedempiricallyandthroughsimulationsof thePetrinetmodelfordifferentclustersizes Table6-13. Modelaccuracy-Clustersizevariations ClusterMapdist.MapRed. dist. Red.Actual(sec)Model(sec)Error(%) 4Poisson49Constant7 911 813 10.8 8Poisson52Constant5 475 469 1.3 16Poisson44Constant6 241 204 15.4 32Poisson41Constant5 164 117 28.7 64Poisson49Constant5 114 76 33.3 134

PAGE 135

6.6Conclusions InthischapterweproposedtheuseofPetrinetstomodelbothsingleMap-Reduce jobsaswellasaworkloadofMap-Reducejobs.Wedesign,developandimplement Petrinetmodelsthatcanadditionallycapturetheoccurrenceofnodefaults.Weshow thatPetrinetscanalsobeusedforwhat-ifanalysisofMap-Reducesystemsthatinturn enablescriticalresourceprovisioningdecisionstobemade. MRnetswereevaluatedthroughdetailedempiricalcomparisonswithjobexecutions ona64-nodeHadoopcluster.WeseethatPetrinetsfacilitateeasyandaccurate performancemodelingwithaveragemodelerrorof 14.5% inawidevarietyofuse-case scenarios. 135

PAGE 136

CHAPTER7 CONCLUSIONSANDFUTUREDIRECTIONS Computationalenvironmentssuchasenterprisedatacenters,gridsandInfrastructure asaServicecloudsoperateatlarge-scaleandconsistofnumerouslayersofmiddleware andsoftwarecomponentsworkinginunisontoprovideITservices.Thescaleatwhich thesesystemsoperatealongwithcomplexinterdependencies,heterogeneityand geographicaldistributionofconstituentcomponentsresultsinfrequentfaults.Faults thatoccurintheseenvironmentsincludefail-stopfaultssuchashard-dishcrashes, PDUfailuresandrackfailuresaswellasperformancefaultsthatoccurduetosurgesin workloadandnetworkoverload. Thisdissertationproposestechniquesandtoolsforanautonomicapproachto managefaultsdynamicallyinMap-Reducebasedsystems.Anautonomicapproach consistsofmonitoringthesystemofinterest,analyzingsystembehavior,planningfor suitableactionstomanagefaultsandinvocationofremediationactions.Thiswork addressestheseaspectsusingacombinationofmachine-learningtechniquesfor performancepredictionandanomalydetection,dynamicresourcescalingandthe extensionofbuilt-infeaturesinanopen-sourceimplementationofMap-Reduce,namely Hadoop. Anempiricaltestbedandasimulationtoolarerstusedtofacilitatefault-injection andtoperformdetailedfaultandperformancecharacterizationexperimentsonHadoop. Bothfail-stopfaultssuchasprocessandnodecrashesaswellasfail-stutterfaultssuch asCPU,diskandnetworkhogprocessesarestudied.Fail-stopfaultsaredetectedby leveragingHadoop'snodehealthscriptfeatureandcustomizedmonitoringmodules. Fail-stutterfaultsaredetectedearlythroughtheuseofasingle-classclassication techniquebasedonsparse-codingforanomalydetection.Afterfaultdetection, recoveryisperformedusingdynamicresourcescaling,wheretheintensityofscaling isdeterminedthroughascalingheuristic.Scalingleveragestheabilitytopredictjob 136

PAGE 137

performancethroughtheuseofmachine-learningbasedregressionmethods.Through performancecharacterizationstudies,jobfeaturesrelatingtoresourceallocation, datasetandmiddlewarecongurationareselectedforuseasinputstoregression. Throughadetailedcomparativestudyofseveralregressiontechniques,fourtechniques withgoodpredictionaccuracy(withanaverageerrorof 12% orless),areidentied.FMR incorporatestheseanomalydetectionandperformancepredictionmodelsinorderto successfullymitigateruntimepenaltiesashighas 180% toanaverageof 14% .FMR thusenablesMap-ReduceapplicationstomeetServiceLevelObjectivesinthepresence ofdifferenttypesoffaults. Inthesecondpartofthisdissertation,twotools,namelyFaultPlayandMRnets weredesigned,developedandimplemented,inordertofacilitatefurtherstudiesin performanceandfaultmanagementofMapReduce. FaultPlayaddressesdifcultiesinvolvedinfailurestudiesonaMap-Reduce platform.FaultPlay,facilitatessoftware-denedandreproduciblefaultstudieson Map-Reduceplatforms.FaultPlayhelpsinreducingtimeandeffortexpendedon systemsdeployment,installationandcongurationbyprovidingthetoolasavirtual applianceimagethatcomeswithallnecessarydependenciesalreadyinstalled. FaultPlayconsistsofmodulesforjobprocessing,faultinjection,distributedmonitoring, logparsingandforthedeploymentofrecovery-basedmanagementsolutions.MRnets enableresourceprovisioningandperformancemanagementstudiesonaworkloadof Map-ReducejobsthroughamodelingapproachbasedonPetrinets.Modelexecution enablesustosimulatemultipleMap-Reducejobsaswellasjobsthathaveexperienced nodefaults.Thesetoolstogetherenableavarietyofcharacterization,performanceand fault-managementstudiestobeconductedeasily. Thesetechniquesandtoolspavewayforanumberoffutureresearchavenues, suchasthefollowing: 137

PAGE 138

1. Greyboxmodelsformapandreducetasks :Thisdissertationpresentstheuse ofmachine-learningregressiontechniquesforpredictingtheexecutionofa Map-Reducejob.Similarmodelscanbeusedtopredictexecutiontimeofamap andreducetask.Thiswillallowmoreaccuratemodelsandenablethedesignof sophisticatedtaskschedulersforHadoop. 2. FaultPlayforhigherlevelMap-Reduceabstractions :FaultPlaycanbeextended toincorporatetheabilitytodeneandinvokePigandHivejobs.Fault-injection, jobprogressmonitoringandjobrecoverymoduleswillalsoneedtobeextended toaccommodatethesenewtypeofjobs.FaultPlayisimplementedthrougha object-orientedPythoncodebasethatfacilitatessuchextensions. 3. MRnetsforotherHadoopschedulers :MRnetscurrentlymodeltheFIFOscheduler inHadoop.NewmodelscanbedevelopedforotherHadoopschedulerssuchas FairandCapacityscheduler,therebyincreasingtheapplicabilityofMRnetsto differentproductionenvironments. 4. Fault-managementforMap-Reduceworkloads :Performancemodelsproposed forMap-ReduceworkloadsthroughMRnetscanbeutilizedindesigningand developingend-to-endfaultandperformancemanagementsolutionsforworkloads. 5. Additionalrecoveryactions :FMRproposedinthisdissertationconsidersonly dynamicresourcescalingasaremediationaction.Otherrecoveryactionssuch asverticalscaling,migrationandrejuvenationcouldalsobeincorporated.The decisiontochoosetherightrecoveryactionandinvokeitwithsuitableparameters wouldalsobeaninterestingresearchproblem. 138

PAGE 139

APPENDIX PETRINETSFORSYSTEMCONTROLINHEALTHMANAGEMENT Autonomicgoalssuchasfaultandperformancemanagement,describedinthis dissertation,areimplementedasmiddlewarecomponents.InanyITinfrastructure, multiplesuchmiddlewarecomponentsneedtobeputtogetherforvariousreasons:(1) tomeetbusinessobjectives,(2)tosharehardware,softwareordataresourcesamongst differentteams,(3)toconstructworkowsbetweendifferentteamsandprojects,or(4) topackagedifferentproductsintoadeliverableservice.Sincemiddlewarecomponents arecomplexbothintheirfunctionality(assupervisorsandcontrollersofothervitalIT components)andinterdependencies,itbecomesimportanttoseamlesslyintegrate thesecomponents.Inordertodoso,thischapterproposestheuseofPetrinets,a graphicaldiscrete-eventmodelingtoolforcapturingrelevantpropertiesofthesystemof interest.Thedevelopedmodelsarethenusedforanalysis,simulationandcontrolofthe executingsystem.Themodelingmethodologyisproposedinthecontextofself-caring ITsystemsinwhichasystem'shealthismanagedthroughacombinationofdiagnosis andprognosistechniques.Healthmanagementissimilartofaultmanagementwithan emphasisonproactiverecovery.Aproof-of-conceptprototypeforabatch-basedjob submissionsystemispresentedindetail. A.1Background Self-caringITsystemsaresystemsdenedasthosethatarecapableofmonitoring andmanagingtheirownhealth.Unlikeself-healingsystemswhicharereactivetofaults andfailures,self-caringsystemsareawareoftheirhealthandhencecanpotentially circumventandadapttoimpendingfaults,orrecoverfromthemquickerandmore effectively. AgrowinglycommonpracticeinITdesign,isthatofre-usingorcombining existingsystemstoquicklyandeffectivelybuildlargerandmorecomplexsystems. Thecomponentsystemscanbesoftware,hardware,data,services,orcombinations 139

PAGE 140

thereof.Theycouldbepassive(e.g.asoftwaremodule)oractive(e.g.arunning service),static(e.g.asoftwarelibrary)ordynamic(e.g.anevolvingservice),andthey couldbeownedbydifferententities.Ofparticularinteresttothispaperaresystems builtoutofactivedynamicsubsystemscontrolledbydifferententities.Thesesystems arebecomingcommonwiththeemergenceofWebservices,hostingservices,and cloud-providedinfrastructure,platforms,andapplicationsthatcanbecombinedand composedfordifferentpurposes. TheeldofautonomiccomputinghasemergedtodealwithincreasedITsystem complexitybyachievinggoalssuchasself-conguration,self-healing,self-optimization andself-protection.ThedesirableabilityofanITsystemtomanageitsownhealthhas someoverlapwiththeself-healingpropertyinautonomiccomputing.However,the self-healingpropertyreferstotheabilityofasystemtorecoverfromeventsthatcause systemfailureoroperationalmalfunctions( Murch 2004 ),whereashealthmanagement referstothesystem'sabilitytodetect,isolateandidentifybehaviorsleadingtofaults(as partofdiagnosis)andpredictimpendingfailures(aspartofprognosis)sothatcorrective actioncanbetakenbeforethefaultsoccurorprogresstoasystemfailure. Theprerequisitetoachieveanyself-*propertyisself-knowledge.Thisiswhere theeldofmodelingcomesin,byprovidingasystemwitharepresentationofitself. Thiseld,withitsvastandrichliterature,allowsforvarioustypesofmodelssuchas functionalmodels,reliabilitymodels,performancemodels,costmodelsandsecurity models,eachtocapturethebehaviorthatisofinterestinthesystemunderstudy ( Marinescu 2002 ).Intheareaofmodelingofcomputeinfrastructures,extensive researchresultsexistforthemodelingofwebservers,webapplications,multiprocessors, aswellasthesecomponentsputtogether(termedasmulti-tierarchitectures)( Conallen 1999 ; Kahkipuro 2001 ; StewartandShen 2005 ; Urgaonkaretal. 2005 ; Vaidyanathan andTrivedi 1999 ; vanderMeietal. 2001 ). 140

PAGE 141

Inthecontextofself-caringsystems,themodelsofinterestrequiretheidentication ofhealthindicators,themechanismsavailabletocontroltheseindicatorsandtheir dependenciesontime,environmentandoperationalparameters.Anadditional challengeistherepresentationandmodelingofinteractionsamongcomponents thatcapturehowacomponent'shealthimpactsthehealthofothercomponentsandthe wholesystem.Thefocusofthisworkisnotonthemodelingtheinternalsofindividual subsystems.Insteadthegoalistocaptureasubsystem'ssusceptibilitytofaultsas ablackboxthatperformsanactivity,andtheinteractionsbetweensubsystemsin atypicalusageenvironment.OurapproachviewsanyITsystemofsystemsasa discrete-eventsystemwhoselogicandinteractionsarecapturedbyPetrinets,a graph-basedmathematicalformalismthathasbeensuccessfullyappliedin,among others,theareasofcommunicationprotocols,manufacturingsystems( Zhouand Kurapati 1999 ),andmultiprocessorsystems. Healthmanagementreferstotakingappropriateactionsbasedoncurrent systemhealthandestimatedfuturesystemstatewiththegoalofkeepingthesystem operational.Healthmanagementrequiressixcapabilities-monitoring,diagnosis, prognosis,planning,remaining-useful-lifemanagementandremediation( Salfneretal. 2010 ; Vachtsevanosetal. 2006 ).Theimportantbenetofhealthmanagementisthatby detectinganincipientfaultbeforeitprogressesintoanerrororfailure,thisunhealthy conditioncanbetreatedbyarangeofrecoveryactionswhilepossiblyavoiding interruptionofservice.Whencomparedwithself-healingandfault-tolerantdesigns, theseactionsaretypicallycheapertoperform,resultinshortersystemdowntimes andthusimprovesystemavailabilityandperformance.AsillustratedbyFigure A-1 ,by detectingthefault(oritsprecursors)inacomponentbeforeitaffectsthesubsystem orsystemthatitisapartof,wecanreplacethecomponentwithanon-faultyoneor evenrepairthefaultycomponent.Thecostofreplacingacomponentoreliminatingthe causesofitsfailureismoreaffordablethanthecostofreplacingasubsystemorsystem 141

PAGE 142

andrecoveringfromitsfailure.PrognosticHealthManagement(PHM)technologies havebeensuccessfullydeployedinothercontextsincludingmechanical,structural andelectronicsystems( Engeletal. 2000 ; Kalgrenetal. 2007 ; PechtandJaai 2010 ; RoemerM.J.andBloor 2001 ; Tangetal. 2008 ; Urmanov 2007 ; Vachtsevanosetal. 2006 ).Whereapplicable,conceptsandtechniquesusedintheseimplementationscan beleveragedforITinfrastructures. !"#$#%&'!( )&*+!+(,-( -&*"'+ -&*"'(!.,"*'#,% (/!0"&)! ,/ -#1(-&*"'2 ),$0,%!%' (/!0"&)!( ,/ -#1(-&*"'2 +*3+2+'!$ (/!0"&)! ,/ -#1(-&*"'2 +2+'!$ FigureA-1. Recoveryactionscorrespondingtodifferentstepsoftheprogressionfrom faultprecursorstocomponentfaultstosubsystemfaultstosystemfailure. Largercirclesrepresentlargerrecoverycosts. Thischapterisorganizedasfollows.Theconceptofhealthmanagementasit appliestoITsystemsispresentedinSection A.2 .Section A.3 motivatestheneedfor modelingandbringsoutthesuitabilityandbenetsofthechosenmodel,introducing variousaspectsofPetrinetsthatareusedintheproposedframework.Thehealth managementmodelingmethodologyispresentedinSection A.4 .Section A.6 outlines relatedworkintheeld,andSection A.7 summarizesthecontributionsofthepresented work. A.2HealthManagement Healthmanagementconsistsofsixmainactions:monitoring,diagnosis,prognosis, planningandremaining-useful-life(RUL)management,andremediation.Figure A-2 is ahigh-levelviewofthehealthmanagementarchitectureenvisionedbyourwork.The subsectionsbelowprovideanoverviewofsystemoperationandtheconstituentsofthis architecture.ThefocusofourcontributionisonRULmanagementandhencethissub sectionisdescribedindetailwithanexamplescenario. 142

PAGE 143

A.2.1HealthIndicators Ahealthindicatorisdenedasasystemattributethatisindicativeofpossibly degradinghealth.Trackingofhealthindicatorscanbeusedtodetectfaultprecursors. !"#$%!&'#(#)"'"(% *($+("& '*,"$*..$+("& !+-%*/0 1$#((+() /"'",+#%+*( ,+#)(*-+/"2*(.+)3/" -0-%"'& /"1$#2" /"1#+/ '#-4 '+%+)#%" !"#$%!&+(,+2#%*/1/*)(*-+'*(+%*/+() &&"5%"(,&/3$ /3$&'#(#)"'"(% FigureA-2. HealthManagementArchitecture Thechoiceofhealthindicatorsdependsonthetypesoffaultstobeanticipatedand thesubsystemsthatmakeupasystem.Inthispaper,wefocusonalargeandimportant classoffaults,namelyfaultscausedbyexhaustionofresources.Inthiscontext,a resourceisbroadlydenedtomeananytypeofentityorattributethatisconsumed bythesystemandistypicallyavailableinnitesupply.Resourcescancorrespond tophysicalhardware,softwareormiddlewareobjectsorattributes.Theavailability ofminimumrequiredamountsofresourcesisessentialfortheproperoperationofIT subsystems.Typicalhardwaresubsystemsincludecomputationalsubsystems(compute nodes),storagesubsystem(I/Oservers,RAIDarrays)andnetworkingsubsystem (switches,NICcards).Softwareincludestheapplicationsoftwareandsystemsoftware thatexecuteonthehardwareresources.Middlewareincludesservicessuchasresource managers,jobschedulers,jobqueuesandapplicationserversthatenableservicesto interoperate,enablingtheITdeploymenttoprovideintendedservicestoitsusers.The healthofanythesesubsystemsisreectedbytheirabilitytoperformtheiroperationsin thepresentaswellastillanintendedpointoftimeinthefuture.Resourceexhaustion 143

PAGE 144

canhavemanydifferentcauses.Theseincludeimproperlyimplementedfunctionality insoftwaresuchasnotreleasingalldynamicallyallocatedmemory,notensuring terminationofallcreatedchildprocesses,notreleasingnumberedresourcessuch assocketdescriptors,ledescriptors,etc.Resourceexhaustioncanalsoresultfrom softwareaging,unanticipatedworkloadsormaliciouscodeinvocations.Furthermore hardwarefaultscanalsoaffecttheavailabilityofresources.Adatabaseofsoftware vulnerabilitiescollectedandorganizedbytheUSDepartmentofHomelandSecurity ( Mitre 2010 )showsthatanimportantclassofsystemfailuresisthosethatresultfrom resource-exhaustionfaults.Numerousproductionsoftwaresystemshaverecorded occurrencesofthisclassoffailuressuchasoperatingsystems,DNSservers,web servers,etc. Forexample,afailurecanoccurinacomputenodewhenanapplicationcannot allocatemorememory(memoryisexhaustedduetomemoryleaks),createnew threads(theprocesslimitfornumberofthreadshasbeenexceeded),opennewles (theprocesslimitfornumberofledescriptorshasbeenexceeded)oraddcontentto anexistingle(theprocesslimitformaximumsizeofthelehasbeenexceeded).In adenial-of-servicetypeofattack,resourceexhaustionoccurstypicallybecausetoo manypacketswerereceived;toomanysessionswereinitiatedortoomuchmalicious' workloadwasreceived;whichinturnpreventslegitimatepackets,sessionorworkload frombeingprocessed. Forthepurposesofhealthmanagement,weclassifysystemoperationintotwo regions:anormalregioninwhichnoresourceisnearexhaustionandastressedregion inwhichatleastonetypeofresourceisbeingusedbeyondapredeterminedlimit butpriortoexhaustion.Thedifferencebetweenthesafelimitandthehardlimitisthe remainingusefulresourcegap;ifresourcesareusedataknownrate(orbasedona knownpattern),onecandeterminetheremainingusefullifefromtheresourcegap. Thetwooperatingregionscanbestaticallyordynamicallydetermined.Astaticregion 144

PAGE 145

canbedeterminedbyparameterssetasenvironmentvariables,virtualmachine(VM) parameters,etc.Thiscanbecontrolledbythedesignerofthesystemofsystems.For example,thedesignercanrequestaVMwithsomeresources(CPU,memory,storage) whosevaluesdeterminethehardlimits.Thenthedesignercandeterminesafelimits beyondwhichRULmanagementisneeded.Anotherexampleisattheoperatingsystem level,whereforeachprocess,hardlimitscanbesetthroughoperatingsystemhooksfor eachofaprocess'resourcessuchasnumberofthreads,numberofledescriptorsor maximumsizeoftheprocess'stackordatasegment.Otherhardwareresourcessuch asstorageserversorswitcheshavehardlimitstotheirdataaccessratesaswellas capacities. Animportantbenetofthedemarcationbetweennormalandstressedregionsof operationisthatthefrequencyandoverheadofmonitoringcanbecontrolledbasedon currentoperatingregionofthesystemsofinterest.Theexperienceofanadministrator orapplicationexpertcanbeleveragedtoestimatesafelimits.Thismaynotalways bepossibleinwhichcasealearningmethodmustbeputinplace(ofineoronline). Adynamicdemarcationbetweenthenormalandstressedregioncanbedetermined bydetectingbehaviorswhicharedeemedanomalous.Thisisalargerproblemthatis beyondthescopeofthispaperandleftforfuturework.Usingthisperspective,thestatic approachcorrespondstothecasewhereanomalousbehaviorcorrespondstoresource usageexceedingasafelimit. Figure A-3 attemptstodemarcatethedifferenceinsystemstatesasconsidered byhealthmanagementandfailuremanagement.Inthenormalfunctionalstate, thesystemisworkingwithinitstypical/nominaloperatingregion.Thesystemisina stressedmodeofoperationinthestressedfunctionalstate.Forexample,inthecase ofresource-exhaustionfaults,thesystemmovestothisstateiftheconsumptionofany resourceexceedsthesafelimit.Thiscouldhappenbecauseofananticipatedfault(such asagrowingmemoryleak).Thethirdstateisadegradedfunctionalstateinwhichthe 145

PAGE 146

systemisstillfunctioningbutatreducedperformance.RULextensiontechniquescan beusedtoprolongasystem'soperationineitherthestressedordegradedfunctional states,therebydelayingapossiblemoveintothefault-presentstate.Thelifetime gainedusingRULextensiontechniquescanthenbeeffectivelyusedtoevaluate varioushealthrecoverytechniques(planning)andinitiateanappropriatefaulthandling technique(remediation).Thisfaultavoidanceactionwillmovethesystembacktoits normalfunctionalstate.RULextensiontechniquescouldbeappliedatthesubsystem levelbutinteractionsbetweenmultiplesubsystemscouldcausesystemperformance degradation.Thisisindicatedbythestressaccommodationtransitionbetweenthe stressedanddegradedfunctionalstates. !"#$!!% &''()(*&"+(, -&./"% &0(+*&,'$ &,"+'+1&"$* -&./" -&./"%2$&/+,3 .,&0(+*$*% -&./" ,(#)&/% -.,'"+(,&/% !"&"$ *$3#&*$*% -.,'"+(,&/% !"&"$ -&./"% &0(+*&,'$ .,&0(+*$*% -&./" !4!"$)% -&+/.#$ -&./"5 1#$!$,"% !"&"$ .,#$'(0$#$*%-&./" #./ )&,&3$)$," !"#$!!$*% -.,'"+(,&/% !"&"$ #./ )&,&3$)$," FigureA-3. Diagramdepictingsystemhealthstates,actionsandeventsthatdetermine statetransitions. Wheninthefault-presentstate,fault-healingtechniques(suchasfault-tolerance andself-healing)canbeutilizedtobringthesystembacktothenormalfunctionalstate. Ifsuchrecoveryisnotpossible,thenthesystemmayprogresstosystemfailure.The 146

PAGE 147

goalofhealthmanagementisthustokeepthesystemfunctioninginthethreefunctional statesdescribedabove. A.2.2Monitoring Monitoringreferstothecollectionofrelevantdataaboutoperatingconditions ofdifferentsubsystemsthatconstitutethesystemunderstudy.Healthmonitoring mechanismsdependonthechoiceofhealthindicatorsandthetypeofsystemstobe monitored.Inmechanicalandelectroniccomponentsthisisperformedusingsensors (temperature,pressure,vibration).Insoftwarecomponentsthisisintheformoflog lesandsoftwareprobesthatcouldmeasureapplicationperformance,resource consumption,environmentalconstructsorinternalstateoftheapplication.Insystems ofconsiderablesize,therawdatacollectedismassiveandneedstorstbereducedby ltering.Featureselection,theprocessofchoosingthosedataitemsthatarepossible indicatorsofhealth,needstobedone,eitherautomaticallyorusingtheexperienceof humandesigners.Featureextractionisthenexecutedinacomputationallyefcient manner.Theextractedfeaturesproducedbythesedatapreprocessingstepsserveas theinputforthefollowingdiagnosticandprognosticsteps.Consumptionofresourcesby asubsystemcanoftenbetrackedbyreadilyavailablesensors,e.g.viaoperatingsystem levelcommands. A.2.3Diagnosis Intheabovecontext,healthmonitoringtakesoneoftwoforms-directdetectionof resourceusagebeyondsafelimits,andindirectdetectionviamonitoringofperformance deteriorationattributabletooveruseorunavailabilityofacertaintypeofresource.In thecaseofawebserver,anexampleofanindirectmeasureistheresponsetime. Anunusualresponsetime(eithertoohighortoolow)mayindicateapossiblehealth decline.Foradatabaseserver,query-processingtimeisasimilarindicator.Diagnosis istheprocessofusingextractedfeaturestodetectfaultprecursors,faultconditionsand faultlocations,andtocharacterizethenatureandextentofthefaultincipiency.Insome 147

PAGE 148

cases(includingresourceexhaustion)thecausalrelationshipbetweenahealthindicator andapotentialfaultiseitherknownoreasytoestablish.Inothercasesdiagnostic relationshipsmayhavetobelearnedeitherofineoronlinefromdatathatincludes valuesofhealthindicatorsandfaultoccurrences.Faultclustering,classicationand decision-makingcanbeperformedonthisdatausingstatisticalandmachinelearning techniques. A.2.4Prognosis Prognosisistheprocessofpredictingthetime-evolutionofafaultortheremaining usefullifeofapredicted-to-failcomponent,afteranincipientfaulthasbeendetected. Prognosticapproachescanbedata-driven,probability-basedormodel-based.With therecentdevelopmentofcontrol-theoreticapproachestomodelingandcontrolling computingsystems,model-basedtechniquescanbeemployedforprognosisinIT systems.Usefullifepredictionsfromtheprognosticssteparethenusedtoinitiate preventivemaintenanceorschedulefuturemaintenanceactivities. Inthecaseofimpendingresourceexhaustionfailures,amodelofresource consumptionasafunctionoftimeand/oractivity(processinvocation,userrequests, etc)isrequired.Themodelcanbelearnedfromobservedoperationinthenormalregion orinthestressedregionorboth. A.2.5Planning Basedontheresultsoftheprognosticanddiagnosticmodules,theHMsystem needstoinitiateappropriatehealthremediationactions.Inthecasewhenanincipient faultisdiscovered,itmaybepossibletoslowdownorreversefaultprogressionby mitigatingoreliminatingitscauses.Ifthefaultcannotbedelayedoravoided,recovery ofthecomponent,subsystemorsystem,mighthavetobeinitiated.Thechoiceofthe remediationactionwilldependontheusefullifeavailable,costofremedialactions anddurationofdowntimethatcanbetoleratedbytheITsubsystembeingmanaged. WhenitisfoundthattheestimatedRULavailablemaynotbesufcientforinvoking 148

PAGE 149

planningalgorithmsortheremediationoperation,thenRULmanagementtechniques areemployedtoattempttogainvaluableusefullife. A.2.6RemainingUsefulLifeManagement RemainingUsefulLife(RUL)isdenedasanestimateofthetimeafterwhich acomponentwillfailwithhighprobability.ThefactorsthataffecttheRULinclude workloadstressesandoperatingconditions.Thelatterincludesinteractionswithother subsystemsorcomponents,wearandtearofcomponentsduetoage,environmental changes,etc.RULmanagementreferstotheprocessbywhichtheusefullifeofa systemiscontrolledusingsomeofthefactorsthataffectit,therebyallowingforthe extendedlifetobeusedeffectively.TheRULofasystemisdeterminedbytheRULof eachofitssubsystems.Inthecasewhenfaultprogressionineachofthesecomponents isindependentofprogressionoffaultsinothercomponents,thensystemRULcanbe estimatedastheshortestRULofallthecomponents.Whenfaultprogressionisnot independent,explicitmodelsneedtobeconstructedtocapturethesecorrelations. Figure A-4 showshowsystemRULisextendedbyprolongingusefullifeofoneofits components. !"#$%&%'%()*%+,-.#/-+ !"#$%0%'%()*%+,-.#/-+ !"#$%&%'%()*%+,-.#/-+ !"#$%0%'%()*%+,-.#/-+ 1+/*-1%#/2/3+#+2FigureA-4. ProlongingsystemRULbyextendingtheRULofsubsystemlabelledCOMP 1tomatchthatofsubsystemlabelledCOMP2 ApplyingClassicalControlTheory ThissectionwilldetailanapproachtoRULmanagementthroughtheapplicationof theprinciplesoffeedbackcontroltheory.Thesystemofinterest,showninFigure A-5 ,is 149

PAGE 150

!"#$%!&'('!%) '('!%)& *+!,+!' -*.!#*/& 0.,+!' 10'!+#2".-%.*0'% FigureA-5. Overviewoftargetsystem-Incontrolsystems,thetargetsystemisviewed tohaveanitesetofcontrolinputsthatcanmodifysystembehavior, disturbanceandnoiseinputsthatcanhaveaffectsystembehaviorandaset ofsystemsoutputsthatareindicativeofsystemoperation referredtoasthetargetsystem.Itconsistsofanumberofsubsystemsorcomponents thatareessentialforitsoperation.Themeasuredoutputofthetargetsystemwillbe therateatwhichRULdecreasesforitsdifferentsubsystems.Asinprevioussections, ourscopeisrestrictedtoresource-exhaustionfailures.Thecontrolinputconsistsof thosefactorsofthetargetsystemthatcanbecontrolledinordertoaffectthesubsystem RULrates.ThedisturbancesconsistsofthosefactorsthatwillaffectRULratesbut cannotbecontrolledandneedtobecompensatedforbythesystemcontroller.The noiseinputscorrespondtomeasurementnoisethatgetsaddedtothemeasuredsystem output.Figure A-6 showsthefeedbackcontrolloopofthetargetsystem.Thegoalofthe feedbackcontrolleristocontroltherateatwhichcomponentRULdecreaseswiththe ultimategoalofcontrollingandprolongingsystemRUL.Itisimportanttonotethatfor anytargetsystemitisachallengingtasktoestimatecomponentRULandsystemRUL. Asarststeptowardssolvingthischallengingproblem,wefocusourattentiononthe rateatwhichRULdecreasesinthecaseofresource-exhaustionfaults. !"#$%"&&'% '('!)$*#+ ,--&*!,$*"# ./01 !2345267 *38.4 /925062:;1 5/01 %<=<5<3>< *38.4 7 /?<@A5<;7 %<@2.5><7 !23@.B84A23 %:4<1 C/01 D<:@.5<;7 ".48.47 /D<:@.5<; %<@2.5>< !23@.B84A23 %:4<1 FigureA-6. Feedbackcontrolloopoftargetsystem.Referenceinputisthedesiredrate ofRemainingUsefulLife(RUL)changeandmeasuredoutputisthe measuredrateofRULchange. 150

PAGE 151

ExampleScenario Letustaketheexampleofaserverinavirtualizedenvironment,whichisacommon subsysteminmanyITsystems.Sinceweareinterestedinresourceexhaustionfailures, theHMmodulewillcheckwhetherresourceconsumptioniswithinthenormaloperating region.Whentheoperationentersastressedregion,theHMfeedbackcontrollerwill beinvokedtokeeptheresourcedepletionincheck.Wecreatedasynthetictestbedto createaproofofconceptimplementationofafeedbackcontroller. Thetargetsubsystemofinterestinourexperimentconsistsofavirtualserver runninganapplication.WeconsideraCPU-intensiveworkloadthatissimilartoa workloadrunningonacomputenodeinanHPCenvironmentoracloudinfrastructure. Inthisexperimentwechosetostudymemoryexhaustionfailuresandsoamemoryleak wasintroducedintotheapplicationcode.Theoutputofthetargetsystemistherateat whichmemoryisdepleted,whichindirectlyprovidesameasureoftheRUL.Thecontrol inputistheworkloadthatistobeprocessedbytheapplicationcode.Thistargetsystem ismodeledasarst-orderlinearsystem. SystemIdentication Systemidenticationistheprocessofdeterminingtherelationshipbetweenthe controlinputandmeasuredoutputofthetargetsystem.Theapplicationwasexecuted fordifferentworkloadvaluesandthememoryleakratewasmeasuredforeach.Linear regressionwasusedtondvaluesofthemodelparameters. ControllerDesign Forthedeterminedmodelparameters,theclosedlooptransferfunctionofthe targetsystemisdetermined.Inordertodesignanintegralcontroller,therangeof possiblecontrollergain(K)valuesaredeterminedsuchthatthestabilitycondition (polemagnitudeislessthanone)ofthesystemisalwayssatised.Sincemodel parametersonlyapproximatesystemcharacteristics,empiricalstudiesneedtobe performedtodeterminegoodcontrollergainvalues.Experimentswereconductedfor 151

PAGE 152

differentcontrollergainvalues.ThegraphinFigure A-7 showsthemeasuredoutputand referenceinputforKvaluesof0.13,0.20and0.25.Theovershootandsettlingtimesare determinedbythedominantpoleintheclosedloopsystem.Thebestcontrollergainis chosenbasedontheneedsofthesystem. FigureA-7. Graphshowingmeasuredoutputfromtargetsystemfordifferentcontroller gainsandreferenceinputsignaltoclosedloopsystem. Intheseexperiments,weconsideronemeasuredoutputfromthetargetsystem, namelymemoryleakrateandonecontrolinput,namelyworkload.Futureworkwill extendthisSingle-InputSingle-Output(SISO)targetsystemmodeltoaMultiple-Input Multiple-Output(MIMO)modelinwhichthemeasuredoutputswillincludedifferent subsystemRULrates. A.2.7Remediation Remediationreferstotheactionsthatareinvokedforfaultmasking,mitigation, preventionorrecovery.Inthecaseofapossibleresourceexhaustionfault,theuseful lifeofasystemoperatinginthestressedregioncanbeextendedeitherbyadding moreresourcestothesystemorbydecreasingtheworkloadexperiencedbyit.For examplewhenwehaveagroupofwebserversprocessingclientrequeststhenumber ofserverscanbeexpandedseamlesslytoreduceoverloadingofanyspecicstressed server.Thethroughputofthesystemasawholeismaintained(atthecostofadditional 152

PAGE 153

resources).Inthecaseofvirtualizedenvironments,astressedserver'sresourcescan betransparentlyincreasedwhentheservercontinuestoremainonline. Giventheabovneprinciplesofhealthmanagement,ourapproachtosystematically incorporatehealthmanagementintoITsystemsconsistsofusingamodeling frameworktocapturetheITsystemofinterestandthenusethismodelasasystem managertocontrolandoperatethesysteminsuchawayastokeepitshealth managed.Thefollowingsectiondetailstheproposedmodelingframeworkand methodology. A.3ModelingFramework A.3.1RequirementsofaSuitableModel InatypicalITsystem,processinvocationandcompletionevents,management eventsandfailureeventsalloccurinanasynchronousandconcurrentmanner.Thus,in general,anITsystemisnotamenabletodescriptionthroughdiscreteorcontinuous variablesmodeledbydifferenceordifferentialequations-itisbestviewedasa DiscreteEventDynamicSystem(DES)( ZhouandDicesare 1993 ).Statechanges areinitiatedbytheoccurrenceofthementionedeventsindistributedcomponentsthat arelinkedthroughanetwork.Soinadditiontoasynchronyandconcurrency,themodel shouldcapturefunctionaldependenciesandsequentialorderingbetweeninteracting components.Themodelshouldalsoprovideawaytorepresentquantitativeparameters thatdenetheseinteractionssothatanalysisofperformance,reliabilityandQoS measurescanbeperformed.APetrinetisonecandidatemodelthatmeetsthese requirements. A.3.2PetriNetBasics APetrinetisabipartitegraphwherethetwotypesofnodesareplacesand transitions.APetrinetisdenedbythequintuple(P,T,I,O,M0)whereP= { p 1 p 2 ,..., p n } isanitenon-emptysetofplaces,T= { t 1 t 2 ,..., t m } isanitenonemptysetof transitions.IisthesetofinputarcsconnectingplacestotransitionsandOistheset 153

PAGE 154

ofoutputarcsconnectingtransitionstoplaces.Amarkingisdenedasamappingof tokenstoplacesinthenet.M0istheinitialmarking.InthegraphicalnotationforPetri nets,placesarerepresentedascircles,transitionsasrectanglesorsimplylinesand tokensasdotsinsideplaces. FigureA-8. SimplePetrinetbeforeandafterring. Petrinetexecutionconsistsofthering'oftransitions.Atransitionmayrewhen itisenabled.Atransitionisconsideredenabledwhenthenumberoftokensineach ofitsinputplacesislargerthanorequaltothenumber(orcardinalityorweight)of arcsconnectingthatinputplacetothetransition.Theringofthetransitionresultsin theproductionoftokensinalltheoutputplacesofthetransition,thenumberofwhich dependsonthecardinalityoftheoutputarcs.Figure A-8 showsasimplePetrinetwith twoplacesandonetransition,bothbeforeandafterring.ThevariousbenetsofPetri NetsaswellasthepropertiesofinteresttoourapplicationareoutlinedintheAppendix. A.3.3PetriNetExtensionsofInterest NumerousextensionstothebasicPetrinetmodelhavebeendevelopedoverthe yearstoapplythemfordifferentpurposesandtypesofsystems.Thefollowingare theextensionsthatwefoundusefulforrepresentingautonomicpropertiesandhealth managementinITsystems: ColoredPetriNets(CPN)-Tokensareassociatedwithvaluesbelongingtodifferent types(alsocalledcolors).Thisallowsformorecompactrepresentationofcomplexnets ( Jensenetal. 2007 ). StochasticPetriNets(SPN)-Randomringdelaysareassociatedwithtransitions whoseringisatomic( Marsan 1990 ).Randomdelaysareexponentiallydistributed. 154

PAGE 155

StochasticRewardNets(SRN)-Arewardrateisassociatedwitheachmarkingto enableevaluationofthereliabilityofcomplexsystems( Muppalaetal. 1994 ). A.3.4HierarchyModeling IThealthmanagementneedstotakeintoconsiderationnumeroussubsystems, theirstructuralconnectionsandfunctionalinterdependenciesandtheprocessesthat usedifferenttypesofresources.Oneofthemostintuitivewaystohandlethiscomplexity isthroughhierarchyinmodelingandPetrinetsprovideawell-establishedparadigmto representhierarchy. APetrinetplaceismadetorepresentasubsystem.Thismeansatokenentering thatplacewouldsubsequentlyenteralowerlevelPetrinetthatisembeddedinthat place.Letusconsiderthecaseofabusinessenvironment.Ajobrepresentsataskthat needstobeperformed.Atthehighestlevel,aresourceneedstobeselectedthatis capableofperformingthistask.Eachresourceisrepresentedbytwoplaces,oneinthe availablecondition(representedasOKinthegure)andtheotherinthefailedcondition (representedasNotOK).EachresourceplaceconsistsofaPetrinetthatmodelsthe functioningofthatresource.Figure A-9 showsthedescribedscenario. A.4HealthManagementModelingMethodology Weintroduceamodelingmethodologytotransformknowledgeabouthowasystem operatesintoaPetrinetmodelandincorporateheathmanagementtechniquesintoit. Ourinitialfocusisonsystemsthatconsistoftheexecutionofasequenceofactivities, eachwithitsownsetofresourcerequirements.Inthiscontextwedenearesourceto beanyofahardware,softwareormiddlewareentitythatisconsumedandisavailablein nitesupply. A.4.1StepI-ModelingSystemStructureandDependencies Inthisstep,thesystemismodeledasasequenceofactivitiesorstages.Each stageoractivityisassociatedwithann-dimensionalhypercubeofresourcesthatis requiredtoperformthisactivity.Thisstepthuscapturesbothorderingofactivitiesas 155

PAGE 156

!"#$%!&"'(')'$* !"#$%!&"'+')'$* FigureA-9. HierarchicalPetriNetModelAttheresourceselectionlevel,eachplace suchasResource1-OKorResource3-OKcanbeexpandedintoitsown Petrinetmodel(showninboxestitledRESOURCE1-OKandRESOURCE 3-OK) wellasthedependenciesbetweenresourcesandactivities.Itisassumedthatresource exhaustionfailuresdonotoccurandthesystemisnotyetbeingcontrolledormanaged. Thesystemmodelconsistsofplacesthatareoftwotypes:activityplacesand resourceplaces.Existenceofatokeninanactivityplaceindicatesthattheactivity isbeingperformedonanobjectwithinthatplace.Thisobjectcouldrepresenta joborclientrequest.Theexistenceofatokeninaresourceplaceindicatesthat aunitofthatresourceisavailableforconsumption.Sinceweareconsideringan n-dimensionalresourcespaceeachresourcetypecorrespondstoaspeciccolored token.Transitionsrepresenteventssuchasthestartandendofanactivity.Thering ofatransitionindicatestheoccurrenceofthisevent.Thepre-resourceplaceofa transitioncorrespondstotheconditionsthatarerequiredforaneventtotakeplace orthetransitiontore.Theoutputarcsconnectingatransitiontoaresourceplace indicatesthereturnofaresourceinstancebacktothepoolofavailableresources. 156

PAGE 157

FigureA-10. PartialPetrinetmodelshowingactivitydependenciesonotheractivities andresourceavailability.TransitionBeginAct.nresafteratokenis availableinAct.n-1andRes.n.(indicatingthatataskthathasundergone activityn-1canmoveontoactivityn,onlyafteractivityn-1iscompleteand whenresourcenisavailable) FormodelingpurposesweuseageneralizedstochasticPetriNet(GSPN)model whichisanextensiontoSPN.GSPNsallowtwotypesoftransitions,onewhosering delayiszeroandtheotherwhoseringdelayisexponentiallydistributed.Thisprovides exibilityincapturingdifferenttransitionringtimes.Figure A-10 showsthemodel ofapartofasystemthatconsistsofthreeactivityplaces { Actn-1,Actn,Actn+1 } andtworesourceplaces { Res.n-1,Res.n } .Inthisgure,eachresourceplaceis one-dimensional(ingeneral,byusingcoloredPetrinets,multipleresourcedimensions canbecapturedbyasingleplace).Atransitionbetweentwoactivitiesrepresentsboth theendofthepreviousactivityaswellasthebeginningofthenextactivity.Activityn' canbeginonlyifatokenexistsinactivityplaceAct.n-1andintheresourceplaceRes. n.AfteractivityAct.ncompletes,atoken(correspondingtooneunitofresourceRes.n) isreturnedtothepool.Theboundonaresourceplacewillcorrespondtothemaximum numberofavailableresources. A.4.2StepII-ModelingUsefulLifeManagement InStepII,eachstageisaugmentedwithahealthmanagementcontroller.In essence,thesecontrollersdetectfaultprecursorsthatmayindicateresourceexhaustion vulnerabilitiesandusethistomakeprognosticdecisionsonthehealthofthesystem beingcontrolled.Controlparametersappropriateforeachsystemarethenmodulatedin ordertoprolongusefullife. 157

PAGE 158

ThemodelintheFigure A-11 showsaspecicactivityaugmentedwithfourplaces. ThenormalconditionplaceP-Normalisactiveifallresourcesneededbythissystemare abovepre-determinedsafelimits.ThestressedconditionplaceP-Stressedisactiveif anyoftheresourcesneededbythissystemarebelowpre-determinedsafelimits.The controllerplaceP-Throttleindicatestheconditionwhenthecontrolleristhrottlingthe arrivaloftokenstobeprocessedbytheplaceAct.n.TheplaceP-Enableistheplace thatdetermineswhetherringoftransitionBeginExecuteisenabledornot.Ascanbe seenfromthemodel,theplaceP-Throttleisactiveonlyinthestressedoperatingregion. Thisstepalsoaimsatcapturingtheimplicationsofcontrollingtheusefullifeofone activityasitreectsonotherstagesinthesystem.Thisimpactonotherstageswillbe determinedbythenatureofdependenciesbetweenactivities.Letuslookatanexample scenariotodescribesuchaninteraction.ThemodelinFigure A-11 showstwostagesin aqueue-basedloadbalancercontrolledsystem.Therststageshowsclientrequests inthequeuedstateAct.QueueandthesecondstageAct.Executerepresentsthese requestsexecutingonaserver.Associatedwitheachstageisthesetofnresources thatareneededfortheactivity.Res.ExecutewouldconsistofCPU,memoryand storagewhileRes.Queuewouldconsistofparameterssuchasstorage,queuelength, etc.ThesecondstageisaugmentedwithaHMcontroller.Themonitoringcomponent oftheHMcontrollerwillmonitortherateatwhichresourcesarebeingdepletedonthe server.Whentheserverentersthestressedregion,aprognosticdecisionistakenand thecontrollerwillthrottletherateatwhichclientrequestsarebeingprovidedtothe serverinordertocontroltherateofresourcedepletion.InthePetrinetmodel,thisis capturedbycontrollingtheringrateofthetransitionBeginExecute.Thiswillcontrolthe rateofresourceconsumptionoftheserverandinturnprolongtheremainingusefullife ofthissubsystem. Animportanteffectofthiscontrolleractionisthatthenumberofclientrequestsin thequeuedstagemaybuildup.InthePetrinetmodelthisisreectedbyanincrease 158

PAGE 159

FigureA-11. PartialPetrinetmodelofqueuedandexecutionstagesofaqueue-based systemshowinginteractionbetweenstages.Throttlingtheringof transitionBeginExecuteresultsinpotentialaccumulationoftokensinplace Act.QueueandincreasedusageofresourcesinRes.Queue. inthenumberoftokensintheAct.Queueplace(Anaturalconsequenceofdecreased ringrateoftransitionBeginExecute)aswellasanincreaseintheresourceparameter queuelength'inresourceRes.Queue.ThelocalHMcontrolleroftheactivityAct. Queuewilltakeappropriateactionwhenandifthisqueuelengthcrossesasafelimit. Thuswecanseethatcontrollingtherateofconsumptionofoneresourceaffectsthe consumptionofanotherresourceinadifferentstage.Coordinationbetweenlocal controllersisneededtoensuresmoothoperationofthesystemofsystems.Wewould liketoemphasizethatthegoalofthehealthmanagementcontrolleristoprolonguseful life;andthisalonemaynotpreventthecomponentfromfailing.Sorecoveryneedstobe initiatedinthestageundercontrolorinotherpartsofthesysteminanticipationofwhat mighthappeninthefuture. 159

PAGE 160

A.4.3StepIII-ModelingHealthRecovery InstepIII,thegoalistoenabletheinitiationofhealthrecoverymeasures.When thesystemisoperatinginthestressedregion,diagnosticandprognosticresultsmight indicateaneedforahealthrecovery.Inthiscase,possiblehealthrecoveryactions areevaluatedbytakingintotheconsiderationthecostsandtimeoverheadinvolved withtheexecutionofeachactionaswellasthenatureofserviceinterruptionsthat canbetoleratedbythoseactivitiesthatutilizetheresourcesinvolved.Theusefullife gainedbythetechniquesemployedinthepreviousstep,canbeeffectivelyutilized toperformthisevaluation,initiateandexecutetheserecoveryactions.Thegoalof thesehealthrecoverytechniquesistoeliminateormitigatetheeffectsoftheobserved faultprecursors.Thesetechniquescanbeplacedinthebroadcategoriesofmasking, mitigation,repair,recongurationandreplacement. !""#$#%&!'( )'!*+,(!&"($-!&,$#%&, FigureA-12. PartialPetrinetmodelshowingplacesandtransitionsaddedinstepIII augmentationconveyingwhenandhowremedialactionsareinvoked. IncreasedresourcesaremadeavailableinRes.ninthecaseof remediationforpotentialresourceexhaustionfaults. 160

PAGE 161

Inthepreviouslyconsideredexample,asequenceofclientrequestsisbeing processedbyaserver.Aresourceexhaustionfailurecouldoccurbecauseoftwo conditions:1.toofewresourcesand2.heavyconsumption.Theformercasecould occurbecausethecomputenodedidnothavetheappropriatesetofresourcesforits expectedload.Thelattercasecouldhaveoccurredbecauseofvariousreasons-load thatwasunanticipatedatdesigntime,malicioususagewiththeintentionofcausing aresourceexhaustionfailure,resourceleakswiththesystemnallyrunningoutof resourcesorahardwarefaultaffectingresourceavailability.Healthrecoveryactionsto dealwiththisconditioncanbelongtooneoftwodifferentclasses.Firstly,afailurecan beavertedordelayedbytheadditionofresourcestothecurrentserver.Forexample, iftheserverisrunningoutofdiskspace,anewstoragelocationcouldberemotely mounted.Inavirtualizedenvironment,theservercouldbelive-migratedtoanother physicalhostwithmoreresourcesoftherequiredtype-beitmemory,CPUornetwork bandwidth.Secondlyafailurecanalsobeavertedbyrejuvenationoftheserver.Thisis aproventechniquethatcanhandlesoftwareageing-relatedresource-exhaustionfailures ( KolettisandFulton 1995 ; VaidyanathanandTrivedi 2005 ). A.4.4StepIV-HealthManagementPolicies Inthisstep,policiesthatdeterminewhenandhowaremedialactionneedstobe performedareidentied.Forexample,inthecaseofresourceexhaustionfaults,policies determinetheresourcethresholdlevelsbelowwhichRULmanagementisinvokedor whenremediationisinvoked.Theselevelscanbeidentiedbasedontheknowledgeof asystemexpert.Thesepoliciescanalsobederivedbasedonsimulationandanalytical studiesofaproductionsystem.SincePetrinetmodelsallowforbothsimulationand analysis,thesecanbeusedtodetermineaverageresourceconsumptionestimatesfor typicalworkloads.Signicantdeviationsfromtheseexpectedresourceconsumption valuescanthenbeusedaspotentialindicatorsofanomalousconditions. 161

PAGE 162

TheabovefourstepsareusedtosystematicallymodelagivenITsystemandthen augmentitwithhealthmanagementcapabilities.Thenalmodelwillthenserveasthe basistobuildtheglobalsystemmanager.ThefunctionalityassociatedwitheachPetri netplaceisconvertedtocodethatistheninputtoagenericPetrinetexecutionengine. A.5Proof-Of-ConceptImplementation ThemodelingmethodologydescribedintheprevioussectionandRULmanagement techniquedescribedinSection A.2 havebeenappliedtoabatch-basedjobsubmission systeminavirtualizedenvironment.APetrinetmodelwasdesignedandimplemented tocapturepropertiesofinterestandtoserveasaglobalsystemmanagerofthejob submissionsystem.WeshowthatourRULextensiontechniqueeffectivelypreventsjob failuresduetomemoryresource-exhaustionwhileincurringlowoverhead. A.5.1SystemOverview WiththeemergenceofcloudsandthequestionofthesuitabilityofcloudsforHigh PerformanceComputing,proposalshavebeenmadeforthedeploymentofbatch-based submissionsystemsonthecloud( Marshalletal. 2010 ).Thejobschedulingand resourcemanagementfacilitiesprovidedbythesesystemsarethuscombinedwith benetssuchasexibilityinmanagementandcongurabilityprovidedbythevirtualized platform.HealthmanagementasneededandapplicabletosuchanITsystemisthus thefocusofthispaper. Ajobsubmissionsystemtypicallyconsistsofagroupofcomputenodesanda headnode.Theheadnodeservesasthesingleaccesspointtothevirtualcluster. Itisresponsibleforreceivinguserjobdescriptionsanddata,forallowingusersto checkonjobprogress,forexecutingjobschedulerdaemons,etc.Thecomputenodes areresponsibleforexecutinguserjobs.Inaddition,theinfrastructurealsoconsists ofstorageandnetworkingsubsystemsaswellasthemiddlewarethatfacilitatesthe overallsystemtofunctioninunison.Ablockdiagramofourtestbedaswellastheglobal systemmanagerdevelopedusingourapproachareshowntogetherinFigure A-13 162

PAGE 163

!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! "#$%&'(!)#*(+ ,-.'&/0! $/"1-)(+ 1(/*!)#*( .(+#&."(! $/)/2(. 3#4 +"1(*&0(. &+(.!%#.'/0 &+(.!*/'/!5 3#4!*('/-0+ 2&-#! +&4+6+'($ 20#4/0!+6+'($!$/)/2(. %('.-!)('!$#*(0!5!"#*( FigureA-13. Blockdiagramoftestbedandglobalsystemmanagerusedfor proof-of-conceptimplementation. A.5.2PetriNetModel Figure A-14 showsthePetrinetmodelforthevirtualclustertestbedshownin Figure A-13 .Thismodelcapturestheorderofactivitiesaswellasdependencies betweenactivitiesandresources.ItwascreatedandanalyzedusingthePIPE2tool ( Bonetetal. 2008a ).Differentaspectsofthemodelwillbedescribedinfurtherdetailto illustratehowsystemoperationandhealthmanagementarecapturedinthemodel. !"#$%&'()*"+,()($)&(+-.'&!"#$/0'0', !"#$'1'%0)*"+ -2(,"3$ (%%"0+)$ &'-"0&%' -)"&(4'$ &'-"0&%' #0..'&$ &'-"0&%' 5'5"&6$ &'-"0&%' 78 78 78 78 2'(9)2$ 3&(::'& !"#$$ -"0&%'; -*+< !"#$ (&&*=(9 &'5")'$ %":6$".$ ,()( !"#$ -0#5*) !"#$ -%2',09', -)(&)$ :(&)$!"# :(&)$ !"# ,"+' !"# ,"+' (::9*%()*"+$ '1'%0)*"+ FigureA-14. PetrinetmodeloftheITsystemunderstudy.Eachdashedboxrepresents astagethatconsistsofanactivityandaresourceneededfortheactivityto proceed.Theshadedcirclesareresourceplacesandtheunlledcircles areactivityplaces.Transitionsrepresenteventsandarcscaptureordering ofactivitiesandresourcedependencies.Arcsfromaresourceplacetoa transitionrepresentresourceconsumptionwhilearcsfromatransitiontoa resourceplacerepresentreleaseofresourcesbacktothesystempool. ActivityandResourceIdentication Therststepinthemethodologyistheidenticationofstagesthatconstitutethe chainofactivitiesintheITsystemaswellasthesetofresourcesrequiredforeach 163

PAGE 164

activity.EachactivityandresourceismappedtoaPetrinetplace.Eventsleadfromone activitytoanotherandaremappedtoPetrinettransitions.Thisapproachisinspired bytheapplicationofPetrinetstothemodelingofexiblemanufacturingsystemsas describedin( ZhouandDicesare 1993 ; ZhouandKurapati 1999 ). TheITchainofactivitiesthatareofinterestinthesysteminclude1.jobconstruction, 2.transferofdataandsubmissionscriptsrequiredforthejob,3.jobbeingqueuedby thejobschedulertillsufcientresourcesbecomeavailableand4.jobexecutingon thecomputenodes.Eachoftheseactivitiesisassociatedwithasetofresourcesthat needtobefunctioningcorrectlyaswellashavesufcientcapacityforthecorresponding activitytocompletesuccessfully.Forsimplicity,wechooseoneresourceperstage inthemodel.Thischoiceofresourceineachstageisbasedonourexperiencein implementingaproductionbioinformaticsjobparallelizationandmanagementportal forexecutionofBLASTjobsonUniversityofFlorida'sHighPerformanceComputing resources( Kadirveletal. 2009 ).Thelistofactivitiesandresourcesisshownin Table A-1 TableA-1. Activitiesandresourcedependencies ActivityResource JobCreationShadowAccount-Reusableuseraccountusedtoaccess remoteresourcesonbehalfofusers DataTransferStorageSpace-Diskspaceneededforjobdata,submission scriptsandresults JobQueuedBufferSlot-Slotinthejobschedulerqueuewherejobswait beforebeingscheduled JobExecutionMemory-Memoryconsumedbyjobduringexecution RemainingUsefulLifeExtension Whenasystemoperatesinthe normal state,itsperformanceisasexpected. Whentheavailableresourcelevelfallsbelowathreshold,thesystemoperationissaid tomoveintoa stressed state.Whenitsperformanceisaffectedbythelowresource levels,thesystemmovesintoa degraded state.Inthisstate,thehealthofthesystem isdeterioratingandthepossibilityofanimpendingfault,followedbysystemfailure,is 164

PAGE 165

high.Afterahealthdeteriorationisdetected,thedurationoftimeforwhichthesystem willcontinuetooperateisdenedasitsremainingusefullife(RUL). Whenthesystembeginstooperateinthestressedstate,RULextensionisinvoked iftheestimatedRULislessthanthedurationneededforplanningandremediation.The usefullifegainedasaresult,iscrucialtoperformplanningandremediation.Inthe proof-of-conceptimplementation,afeedbackcontrollerisusedtoextendusefullife(and hencedelayanimpendingresourceexhaustionfault)bycontrollingtheworkloadofthe executingjob. Usefullifeextensionisimplementedbymeansofahealthmanagementwrapper. Thiswrapperconsistsofcodethatisresponsibleforobservingtheresourceconsumption rateoftheexecutingapplicationandadjustingtheworkloadoftheapplication.The wrapperthusincludesimplementationofthelogicoftheentirefeedbackcontrolloop. Inthecaseofapplicationswheretheworkloadratecannotbedirectlycontrolled, resourcereplicationisausefulmechanismforhealthmanagement.Theelasticcapacity providedbyavirtualizedcloudplatformfacilitateson-demandcreationofnewvirtual machinestohandleresourcedepletionfaults. Thetargetsystemofinterest,namelythejobexecutionstageismodeledas asingle-input-single-outputsystem.Modelparametersareidentiedusinglinear regressionofexperimentaldata,obtainedfrommultiplejobexecutions.Aproportional integralcontrollerwasdesignedandtunedempirically.Thedesiredresourceconsumption rate,whichisaninputtothecontroller,isdeterminedbasedontheminimumusefullife thatisneededforperformingplanningandremediation.Ofinemodelidenticationand controllerdesignhasproventobeapplicableforanumberofapplications( Abdelzaher etal. 2002 ; Hellersteinetal. 2004 ). Inthecaseofapplicationswhosebehavioristimevarying,wecanuseonlinemodel parameterestimationtechniquesalongwithonlinecontrollerdesignalgorithmsfor effectivefeedbackcontrol( Astrom 1995 )asshowninFigure A-15 .Adaptivecontrol 165

PAGE 166

techniquesarealsousefulwhenthestochasticsofthedisturbancesaffectingthetarget system,changeswithtime. PlanningandRemediation Planningconsistsoftwomainconstituents-isolatingthecauseforthehealth deteriorationanddeterminingasuitableremediation.Rootcauseanalysisordiagnosis canbeperformedusingdependabilityanalysis( Bagchietal. 2001 ; Brownetal. 2001 ), decisiontrees( Chenetal. 2004 ),Bayesiannetworks( Rishetal. 2004 )etc.Sincethere canbeanumberofremediationtechniquesthatmaybeapplicable,asuitableonehas tobechosenbasedonfactorssuchasseverityofthehealthdeterioration,thenatureof theapplication,rootcauseofunhealthystate,RULavailableandcostofrecoveryaction. !"#$ %&'&!&()& *(+,. "/&01!&2. %&03,!)&. 43(0,5+-13( %6-&$ 7"#$ 8&60,!&2. 9,-+,-. "8&60,!&2 %&03,!)& 43(0,5+-13( %6-&$ 49:;%9<<=% =>=4?;*:@ ABB<*4A;*9: ,"#$ 43(-!3C. *(+,"D3!#C362$ 9:<*:=.89/=<. BA%A8=;=% =E;*8A;*9: 9:<*:=.. 49:;%9<<=%. @A*: =E;*8A;*9: FigureA-15. Feedbackcontrollerloopusingadaptivecontroltheory.Modelparameter estimationandcontrollerdesignareperformedonline. Theremediationtechniquesthatareapplicableinthemanagementofpotential resourceexhaustionfaultsincludenodeorprocessrejuvenation( Vaidyanathan andTrivedi 2005 ),jobresubmission,jobmigration,virtualmachinemigrationand dynamicadditionofresourcestothevirtualmachine.Sincethegoalofthisstudyisnot dependentonthetechniqueusedtoperformplanning,weuseasimplecondition-based approachtodeterminetheremediationactionofchoice.Intheexperimentsperformed forthisstudy,oneofthreedifferentremediations-processrejuvenation,VMmigration anddynamicincreaseofresourceallocationischosenbasedontheplanningmodule. 166

PAGE 167

GlobalSystemManager ThePetrinetmodeldevelopedusingtheabovestepsservesasanexecutable model.ItisrstconvertedintoanXMLrepresentation.Javacodeisthenassociated witheachactivityandresourceplace.Inanactivityplace,theJavacodeisresponsible forinitiatingthecorrespondingactivitywhereasinaresourceplaceitisresponsiblefor monitoringthemanagedresource.AgenericPetrinetexecutionengine( Graupneretal. 2007 )isusedtomanagetheITsystemusingthedevelopedPetrinetmodel. A.5.3Evaluation Thebenetsofourapproachareevaluatedinthissection.Insubsection Structural Analysis' ,systempropertiesarevalidatedusingstructuralanalysisofthemodel.In subsection PerformanceAnalysis' ,systemperformanceisevaluated.Insubsection HealthManagement' ,thevirtualclustertestbedisdescribedandthehealthofjob executionismanagedusingafeedbackcontroller. StructuralAnalysis Basedonitsstructuralcharacteristics,theconstructedPetrinetmodelisfound tobedeadlockfree(orlive),reversible,andbounded.Themodelnetalsofallsunder thecategoryoffreechoicenetsandsimple(otherwisecalledasymmetric)nets.This classicationofthemodelintoestablishedcategories,allowsustoleverageefcient analyticalalgorithmsandresultsthathavebeenpublishedforthesePetrinettypes ( DeselandEsparza 1995 ). Forlargemodels,whenstatespaceexplosionbecomesaconstraint,statespace reductiontechniquescanbeused( Valmari 1991 ).Inaddition,compositionruleshave beendenedthatwillallowforpropertiesofindividualsub-systemstobepreservedina fully-constructedsystemofsystems( ZhouandDicesare 1989 ). Systempropertiesinferredusingstructuralanalysis,allowustoverifyandvalidate boththesoundnessofthehealthmanagementpoliciesofthesystemmodeledas wellasthemodeldeveloped.Decienciesofthesystemmodelermaybemitigated. 167

PAGE 168

Inaddition,thisanalysisofthecorrectnesspropertiesoftheITsystemenables improvementofnon-optimaldesignchoices. Performanceanalysis ThePIPE2simulatorwasusedtoevaluatesystemperformance.Theexponential ringdelayofanoutputtransitioncorrespondstothetimespentbyatoken(orjobin ourcase)ineachstage.Sinceourstudyfocussesonresourceexhaustionfaults,we chosetounderstandtheeffectofdifferentjobarrivalratesandjobsizesonresource availability.Forthispurpose,weuseacommonperformancemetricobtainedthrough steadystateanalysisofPetrinets,namelytheaveragenumberoftokensinaplace.As anexample,Figures A-16 and A-17 showtheaverageavailabilityoftheshadowaccount resource(asapercentageoftotalresourcecapacity),fordifferentjobsizesandjob arrivalratesrespectively.FromFigure A-17 wecanseethatforajobarrivalrateof1per unittime, 50 %oftheresourcesareavailablefornewlyarrivingjobswhiletheremaining 50 %oftheresourcesarebeingactivelyconsumedbyjobsthathavealreadyenteredthe system. FigureA-16. Averageresourceavailabilityfordifferentjobsizesobtainedusingaverage numberoftokens'metricthroughsteadystateanalysisofthePetrinet model Knowledgeofaveragevaluesofresourceconsumptionfortypicalworkloads, obtainedthroughperformanceanalysis,isusefulinsettingresourceconsumption 168

PAGE 169

FigureA-17. Averageresourceavailabilityfordifferentjobarrivalratesobtainedusing averagenumberoftokens'metricthroughsteadystateanalysisofthePetri netmodel thresholdsfordifferenthealthstatesoftheITsystembeingmanaged.Forexample, onemaydecidetoconsiderresourceusagethatexceeds1.5timesthesteady-state resourceconsumptionasanomalous. HealthManagementappliedtoJobExecutionstage Inordertostudythehealthmanagementcapabilityofourapproach,thePetrinet modeldevelopedwasusedtomanageexecutionofasequenceofjobsonavirtual clustertestbed.Eachjobconsistsofasequenceofequal-sizedmatrixmultiplicationsas theyarecommonlypresentinscienticcodes.Forthepurposeofthiswork,wechose thejobexecutionstageforin-depthanalysis,becausejobstypicallyspendasignicant portionoftheirlifetimeinthisstageandmostfaultstendtooccurduringthisstage.The faultsofinterestinourstudyarethosecausedbyresourceexhaustion,typicallydueto either1.unanticipatedworkloadsor2.hardwareerrorsthataffectfunctionsthatimpact resourceconsumptionor3.incorrectdesignorimplementationofthesoftware.Inour testbed,wecapturethelatterbyintroducingresourceleaksduringjobexecution. VirtualClusterTestBed ThetestbedconsistsofaclusterofVMwareESX( ESX 2010 )virtualmachines (VMs).WeuseoneVMastheheadnodeandthreeVMsasworkernodes.Theopen 169

PAGE 170

sourceresourcemanagerTorqueandjobschedulerMaui( Torque 2010 )serveas themiddlewarecomponents.Thesecomponentsarerepresentativeofrealworld deploymentsandaredeployedinhundredsofacademicandbusinessinstitutions. VMwareprovidesaPerlAPItocontrolvirtualmachinesprogrammaticallythatisusedfor invokingremediationactions. Thegoaloftheexperimentsconductedonthistestbedistoshowthatthehealth oftheapplicationissuccessfullymanagedbyusingafeedbackcontrollertoextend RULoftheapplication.Weusethreeworkloadsofdifferentintensitycharacterized aslight,mediumandheavy.Ineachworkload,thesizeofajobi.e.thenumberof matrixmultiplications,isgeneratedusinganexponentialdistribution.Otherdistributions suggestedintheworkloadliteraturesuchastheErlang,Paretoandhyperexponential distributionsarepartofourongoingstudy.Allworkloadswereexecutedandthehealth stateoftheapplicationwasmonitoredforeachcase.First,welookattheoperationofa singlejobonacomputenodeandillustratetheoperationofthefeedbackcontroller. Feedbackcontrollerforjobexecution Whenresourceconsumptionlevelsarelessthanapredenedthreshold,job executionproceedsnormally.Whenresourceconsumptionincreasesbeyondthis threshold,thesystemisinastressedstateofoperationandthecontrollerisinvokedinto action.Thecontroller,wheninoperationwillmodulatetheworkloadthatisinputtothe applicationinordertocontrolitsresourceconsumption.Thresholdsareaspecication parametertothealgorithmandcanbesetbasedonthenatureoftheapplication, steady-stateperformanceanalysisofthesystemmodelandthedesiredsafeutilization levelofserversinthetestbed. Foraspecicjobexecutingonthecomputenode,controlleroperationisshownin Figure A-18 .Inthisexamplejob,wechosetocontrolandhenceplottherateofresource leak.Thisisbecausetheresourceofinterest,namelyavailablememoryisobserved beforeandaftereachsub-taskinajob.Thedifferencebetweenthesetwovaluesis 170

PAGE 171

indicativeofthememoryleaki.e.theamountofmemorythathasnotbeenreturnedto thesystemresourcepool.Techniquesproposedinthispaperalsoapplytomanaging jobsinwhichwemonitorandcontrolresourceconsumptionrateratherthanresource leakrate. IntheFigure A-18 ,wecanseethatinitiallythereisahighobservedleakrate. Thecontrollerisinvokedaroundthe 3 rd timestep(whenresourcelevelgoesbelowa predenedthreshold).Afterthecontrollerbeginstooperate,theleakratedecreases andstabilizestoavalueof 45 Mbperunittime.ForadesiredRULof 40 timeunits, thereferenceleakrate(aninputparametertothecontroller)wassetat50Mbperunit time.Fromthegraph,weseethatthecontrollerhasasettlingtimeof 4 timeunitsand asteadystateerrorof 5 Mbperunittime.Ofineempiricalstudieswereperformedto choosetheproportionalandintegralgainsofthecontrollertoobtainminimalvaluesof settlingtimeandsteadystateerror.Non-optimalvaluesofcontrollergainresultedin oscillationsinthemonitoredsignalandvaluesofsteady-stateerrorthatwerelargerthan thoseshowninFigure A-18 FigureA-18. Graphshowingoperationoftheproportionalintegralcontrollerusedto performRULextensioninthejobexecutionstage. Toshowhowremainingusefullifeisextendedbythecontroller,inFigure A-19 ,we showhowresourcelevelsvaryasthecontrolleroperates.Atapproximatelythe 90 th time instant,thesystementersthestressedstateandthecontrollerisputintoaction.The 171

PAGE 172

controllertriestoslowdowntheresourceconsumptionrate,sothattheusefullifeofthe systemcanbeextended.Wecanobservethiseffectproducedbythecontrollerthrough thedifferenceinslopesofthedashedanddottedlinesinFigure A-19 .Thedashed lineisanextrapolationthatpredictstheRULofthesystemifthecontrollerwasnot invoked.Thedottedlinewithitsdecreasedslopeshowsthecontrolledrateofresource depletion.WecanseethattheRULhasincreasedbyabout 40 unitsoftime.Thisgain inusefullifeallowsforplanningandasuitableremediationactiontobeinitiated.In thisimplementation,weincorporatedfourdifferentremediationtechniquesaslistedin Table A-2 .Thechoiceamongtheseremediationswasmadebasedonthenatureofthe executingapplicationandtheinputsreceivedfromtheplanningmodulethatprovidesan indicationaboutthemostlikelycauseofresourcedepletion.Forexample,inthecase ofanapplicationthatcanbemadeofineforashortdurationandwhosestatecanbe saved,rejuvenationischosen.Whereasinthecaseofacriticalapplication,thatcannot bemadeofineandwhoseresourceisbeingdepletedduetointerferencefromanother VMonthesamephysicalhost,VMmigrationischosen. FigureA-19. Graphshowingresourceconsumptiononacomputenode.Dashedline showsrateofresourcedepletionwhencontrollerhasnotbeenputinto operation.Dottedlineshowsactualrateofresourcedepletionafterbeing sloweddownbycontroller. 172

PAGE 173

TableA-2. Listofremediationactions RemediationPossiblesourceofresource depletion Effectofremediation RejuvenationSoftwareAging Leakedresourcesare returnedtosystemresource pool Workload throttling Unanticipatedheavyload Rateofresourcedepletionis limited Dynamic resource increase Unanticipatedheavyload; Interference(i.e.resource consumption)byinternaland externaldisturbancestoVM Increasedresourcecapacity forusebyexcessloadand disturbances VMmigrationInterference(i.e.resource consumption)bydisturbances externaltoVM Avoidexternaldisturbancesin newphysicalmachinehostfor VM Intheaboveexperiments,resourcelevelinformationisgatheredusingtheLinux /proclesystem.Overheadsduetoperiodicreadingof/procislowintheseexamples sinceweareconsideringonlyoneresourcememory.Monitoringoverheadcan becomeconsiderablewhenalargenumberoffeaturesneedtoberecorded;inwhich casethecurrentmonitoringframeworkcanbeextendedwithadifferentialschemei.e. thesamplingfrequencyisadaptedtocurrentsystemload. Thepercentageoftimespentforexecutionofthehealthwrappercode(as comparedtothetotaltimetakenfortheexecutionofthehealthmanagedapplication) wasevaluatedtobelessthan0.5%.Theperformancedegradationoftheapplication duetoworkloadthrottlingisdependentontheresourcedepletionrate,whichissetas thereferenceinputtothecontroller.Ourimplementationallowsforsettingacongurable lowerSLAforapplicationperformancetoensurethatperformanceoftheapplicationis notreducedbelowthisacceptablethreshold. Extendedusefullifefordifferentworkloads Thetimespentbytheworkloadsindifferenthealthstateswasrecordedandis summarizedintheFigure A-20 .Twoimportantobservationsareinorderhere:1. Thetimespentinthenormalfunctionalstateisalmostthesameirrespectiveofthe intensityoftheworkload.Thisisbecauseintheinitialstagesofapplicationexecution, 173

PAGE 174

resourceconsumptionlevelsdonot(asyet)exceedsafethresholds.2.Durationof theextendedusefullifeperiodissignicantlyhigherforboththemediumandheavy workloadsbecausethecontrollercomesintoplayfortheseheavierworkloads,slowing downresourcedepletionratesandkeepingthejobexecutingtoensureworkload completion.ThegraphinFigure A-20 clearlyshowsthatthehealthmanagement frameworksuccessfullyextendstheRULwhennecessary(i.e.formediumandheavy workloads)tokeepresourceexhaustionincheckandensuresuccessfuljobcompletion. Inthenextexperiment,wecharacterizethepercentageofjobsinatypicalworkload thatwillbenetfromRULmanagement.Fiftyjobswithresourceexhaustionfaults werestarted.Whenajobentersthestressedstate,theplanningmoduleiscalled todeterminethedurationofusefullifethatwouldbeneededfordiagnosisand pre-remedialactions.ThedesiredRULisrepresentedusingauniformdistribution between 10 to 600 seconds.ThecontrollerisinvokedifthedesiredRULislargerthan thecurrentRULestimate.ThedesiredRULisalsousedtodeterminethereference inputtothecontroller.Wefoundthatfor 82% ofthejobs,usefullifecouldbeextended sufcientlybythedesignedfeedbackcontroller.Intheremaining 12% ofjobs,the desiredRULwasmorethanwhatcouldbemadeavailablebythrottlingworkloadand hencethesejobssufferedfromresourceexhaustionfailures.In 6% ofthejobs,despite indicationsofhealthdeterioration,RULextensionwasfoundtonotbenecessaryand henceremediationcouldbeinvokedimmediately. A.6RelatedWork Ourgoalsaremostlysimilartotheworkin Urmanov ( 2007 ), Grossetal. ( 2006 ) and HamerlyandElkan ( 2001 ),inwhichtheauthorsproposetheuseofpattern recognition,multivariatestateestimationandMarkovchainbasedmodelingtechniques forprognosticsindifferentcomponents(eg.mainmemorymodule,powersupply) ofacomputeserver.Thoughourgoalsaresimilar,thecontributionofourworkis asystematicframeworkandmethodologyforhealthmanagementatthesystemof 174

PAGE 175

FigureA-20. Graphshowingdistributionsoftimedurationspentinthethreefunctional statesfordifferentintensityworkloads systemslevel.Atthesubsystemlevel,weuseacontroltheoreticformulationforuseful lifemanagement. ModelDrivenApproachtoAutonomicSystems Dobson ( 2008 )and Dobson ( 2007 )haveshowntheneedandrequirements ofmodelstocaptureself-*propertiesinautonomicsystems.Ourworkhasbeen inuencedby Graupneretal. ( 2007 )'sITmanagementcontrollerthatusesHighLevel PetriNetstomodelITtasks.Ourworkdiffersinthatthescopeofourworkincludes healthmanagementofsystemlevelmodels,ratherthanerrorrecoveryinindividualIT components.Petrinetshavealsobeenusedby SalfnerandWolter ( 2009 )todetermine serviceavailabilityimprovementachievedthroughtheuseofredundantservers. Dai etal. ( 2006 )haveproposedamodel-drivenapproachtoautonomiccomputing.But thisproposalhasnospecicsandhasnotbeenappliedtoarealworldautonomic solution. Bellur ( 2006 )brieyoutlinestheuseofHierarchicalQueuingPetrinets(HQPN) withtheultimategoalofmodelingautonomiccomputingproperties.Thisproject's mostrecentstatuswasthemodelingofaTomcatserverusingHQPN.Theeldof workowshasusedPetrinetsextensivelybyextendingthebasicPetrinetmodelinto 175

PAGE 176

Workownets( Marinescu 2002 ; vanderAalstandvanHee 1996 ).Domainspecic modelinglanguageshavebeenusedintomodelautonomicsystems. Dubeyetal. ( 2006 )haveusedtimedautomatonmodelstoverifytimingbehaviorofautonomic systems.ArchitecturemodelshavebeenusedforcapturingautonomicpropertiesinIT systemsin Chengetal. ( 2006 ). Failure,Monitoring,DetectionandPredictionStudies Thoughthefollowinglistofworksmostlydealwithfailuresratherthanhealth(which isourfocus),techniquesfromtheseeldscanbeleveragedinhealthmanagement. Salfneretal. ( 2010 )haveprovidedacomprehensivesurveyofonlinefailureprediction incomputingsystems,principlesofwhichformapartofprognosis. Williamsetal. ( 2007 )proposeablackboxapproachtofailurepredictionindistributedsystemsusing performancemetrics.OVIS( Brandtetal. 2008 2009 )isatoolforstatisticalmodeling andanalysisofmonitoreddataincomputationalclusters,withthegoalofdetermining advanceindicatorsoffailures. Renetal. ( 2006 )usesemi-Markovprocessesto predictresourcefailuresinne-grainedcyclesharingsystems. Lagunaetal. ( 2009 ) havedevelopedanerrordetectionsystemformulti-tierapplicationsusingaHidden MarkovModel(HMM)basedalgorithmthatiscapableofhandlingthehighvolumeof messagesinadistributedsystem(therebyprovidingefcientmonitoring). Schroeder andGibson ( 2010 )havestudiedfailuresinlarge-scalesystemswithconclusionson failurerate,repairtimeandtimebetweenfailuredistributions.Bayesianestimation andMarkovdecisiontheoryhasbeenusedbyin Joshietal. ( 2005 )tochooseoptimal recoveryactionsinadistributedsystem.Large-scalestudiesoffailuresinHPCand petascalesystemsin SchroederandGibson ( 2007b ), SchroederandGibson ( 2007a ) and SchroederandGibson ( 2006 )motivatetheimportanceofhealthmanagementin theseinfrastructures. 176

PAGE 177

ResourceExhaustionStudies Theseworkscharacterizeresourceexhaustionfaults,whileourgoalistousethis categoryoffaultsasanexampleforourproposedhealthmanagementsolution.In VaidyanathanandTrivedi ( 1999 ),semi-Markovrewardmodelsareusedtoestimate resourceexhaustionratesasafunctionofworkload. Antunesetal. ( 2008 )have proposedatechniquetopredictresourceexhaustionvulnerabilitiesinbothsynthetic aswellreal-worldDNSserversusingresourceusagemodelingandafaultinjection framework. ControlTheoreticApproaches Theapplicationoffeedbackcontroltheorytocomputingsystemshasbeen introducedin Hellersteinetal. ( 2004 ).ThecontrolofmemoryandCPUutilizationin webservershasbeenstudiedin Gandhietal. ( 2002 )and Diaoetal. ( 2009 ). A.7SummaryandContributions Thegoaloftheproposedworkhasbeentorecommendtheuseofamodeling frameworkandmethodologytoincorporatehealthmanagementinanITsystem.The useofcontroltheorytomanageusefullifeextensionofsubsystemsisalsoproposedfor thecaseofapossibleresourceexhaustionfault.Asasimpleillustrationoftheconcept, acontrollerwasbuiltforusefullifemanagementintheapplicationexecutionstage (containingapotentialmemoryexhaustionfault)ofanITsystem.Thegainedusefultime makesitpossibletoevaluateandinitiatehealthremediationactions. Inordertoshowapplicability,benetsandeaseofuseoftheproposedmethodology inareal-worldITsystem,aprototypeglobalmanagerforabatch-basedjobsubmission systemonavirtualizedplatformwasdesignedanddeployed. A.8BenetsofPetriNets PetrinetsareamathematicalaswellasgraphicaltoolformodelingDES.Ageneral introductiontoPetrinetmodelingcanbefoundin Peterson ( 1981 ).Theadvantagesof usingPetrinetsinclude: 177

PAGE 178

1. Systemvalidation-Varioussystempropertiescanbedeterminedtovalidatethe correctnessofthesystem. 2. Systemmonitoring-Wheninterfacedwithaproductionsystem,Petrinetsallowfor realtimeandvisualmonitoringofthesystemstatus. 3. Systemsimulation-AgenericPetri-netexecutionenginecanbeusedfor simulatingdifferentclassesofPetrinets.Thisisalsocalledatokengamebecause asimulationinvolveschangingthestateofthesystembymovingtokensbetween places. 4. Systemperformanceevaluation-ThebasePetrinetmodelcanbeaugmentedwith timetoenablemeasurementofthevariousperformancemetricsofsystems. 5. Systemcomposition-Inthecaseofalargesystem,itwouldbenecessary fordifferentexpertstomodelvariouspartsofthesystem.Petrinetspossess thecompositionpropertysuchthatundercertainpredenedconditions,itis guaranteedthatsystempropertiesaremaintainedwhenindependentmodelsare combinedtogethertoasingleone( Jensenetal. 2007 ; Muppalaetal. 1994 ). 6. Systemcontrol-Petrinetsprovidetheabilitytogenerateasupervisorycontroller directlyfromthesystemmodel. A.9PetriNetPropertiesofInterest ThefollowingisalistingofcertainPetrinetpropertiesandhowtheyapplytothe modelingofanITsystem. 1. Reachability:InaPetrinetC,amarking 'isreachablefrommarking ,ifthere existsasequenceoftransitionringsthatcantransformthestatefrom to '.The reachabilitysetR(C, )isthesetofallreachablemarkingsfrommarking Inasystemmodelcertainmarkingscanrepresentthesystemoperatingina faultymode.Insuchasituation,itisusefultoknowifthesystemcanreach(orbe steeredtoreach)anon-faultymarking. 2. Boundedness:Aplace p i ofaPetrinetwithaninitialmarking isk-boundediffor all R ( C ) ( p i ) < = k APetrinetisboundedifallitsplacesarebounded.Ifk=1,thenthePetrinetis saidtobesafe.Inasystemmodel,boundednesscanrepresentthecapacityof aresource.Ifaccesstoadatacenteristhroughajobqueue,thentheboundon theplacerepresentingthequeuecanbeusedtoensurethatthequeueneither overowsnorisunderutilized. 3. Liveness:APetrinetislivewithrespecttoamarking m 0 ,ifforanymarkingin itsreachabilityset R ( m 0 ) ,itispossibletoultimatelyreanytransitioninthenet. Livenessguaranteestheabsenceofdeadlocks( Tangetal. 2008 ).Inasystem 178

PAGE 179

modelthiswouldmeanthatasystemcouldcontinuetooperatewithoutgetting stuckinadeadlocksituationarisingduetoresourcedependencies. 4. Reversibility:APetrinetisreversibleiffforeachreachablemarking m R ( m 0 ) ,the initialmarking m 0 belongstothereachabilitysetof m .i.e. m 0 R ( m ) .Asystem thatpossessesthispropertywillbeablereinitializeitselfifafailureoccursatany stage. OtherpropertiesofPetrinetscanbeanalyzedbyusingareachabilitytreeorby matrixequationtechniques.AllthePetrinetmodelsinthisworkwereconstructed usingthePIPEtool( Bonetetal. 2008b ). 179

PAGE 180

REFERENCES Abdelzaher,TarekF.,Shin,KangG.,andBhatti,Nina."PerformanceGuaranteesfor WebServerEnd-Systems:AControl-TheoreticalApproach." IEEETrans.Parallel Distrib.Syst. 13(2002).1:8096. URL http://dx.doi.org/10.1109/71.980028 Ananthanarayanan,Ganesh,Kandula,Srikanth,Greenberg,Albert,Stoica,Ion,Lu,Yi, Saha,Bikas,andHarris,Edward."Reiningintheoutliersinmap-reduceclustersusing Mantri." Proceedingsofthe9thUSENIXconferenceonOperatingsystemsdesignand implementation .OSDI'10.Berkeley,CA,USA:USENIXAssociation,2010,116. URL http://dl.acm.org/citation.cfm?id=1924943.1924962 Antunes,J.,Neves,N.F.,andVerissimo,P.J."DetectionandPredictionof Resource-ExhaustionVulnerabilities." SoftwareReliabilityEngineering,2008.ISSRE2008.19thInternationalSymposiumon .2008,8796. Arpaci-Dusseau,AndreaC.andArpaci-Dusseau,RemziH."Informationandcontrolin gray-boxsystems." SIGOPSOper.Syst.Rev. 35(2001).5:4356. URL http://doi.acm.org/10.1145/502059.502040 Astrom,K.J. AdaptiveControl .1995. Bagchi,Saurabh,Kar,Gautam,andHellerstein,Joe."Dependencyanalysisin distributedsystemsusingfaultinjection:Applicationtoproblemdeterminationin ane-commerceenvironment." InProc.12thIntl.WorkshoponDistributedSystems: OperationsandManagement .2001. Ballani,Hitesh,Costa,Paolo,Karagiannis,Thomas,andRowstron,Ant."Towards predictabledatacenternetworks." ProceedingsoftheACMSIGCOMM2011conference .SIGCOMM'11.NewYork,NY,USA:ACM,2011,242253. URL http://doi.acm.org/10.1145/2018436.2018465 Bare,Keith,Kavulya,SoilaP.,Tan,Jiaqi,Pan,Xinghao,Marinelli,Eugene,Kasick, Michael,Gandhi,Rajeev,andNarasimhan,Priya."Architectingdependablesystems VII."chap.ASDF:anautomated,onlineframeworkfordiagnosingperformance problems.Berlin,Heidelberg:Springer-Verlag,2010.201226. URL http://dl.acm.org/citation.cfm?id=1985596.1985608 Barroso,LuizAndrandHlzle,Urs. TheDatacenterasaComputer:AnIntroductionto theDesignofWarehouse-ScaleMachines .2009. Bellur,Umesh."AutomatingApplicationsManagementintheEnterpriseusingDMTF InformationModels." www.dmtf.org/education/academicalliance ,2006.[Online; accessed16-March-2010]. 180

PAGE 181

Bi,JinboandBennett,KristinP."RegressionErrorCharacteristicCurVes." Proceedings ofthe20thInternationalConferenceonMachineLearning .2003,4350. Bonet,Pere,Llado,Catalina,Puijaner,Ramon,andKnottenbelt,William."PIPE2.5-A Petrinettoolforperformancemodeling." LatinAmericanConferenceonInformatics, 2007 .2008a. ."PIPE2.5-APetrinettoolforperformancemodeling." LatinAmericanConferenceonInformatics,2007 .2008b. Borthakur,Dhruba,Gray,Jonathan,Sarma,JoydeepSen,Muthukkaruppan,Kannan, Spiegelberg,Nicolas,Kuang,Hairong,Ranganathan,Karthik,Molkov,Dmytro,Menon, Aravind,Rash,Samuel,Schmidt,Rodrigo,andAiyer,Amitanand."Apachehadoop goesrealtimeatFacebook." Proceedingsofthe2011ACMSIGMODInternational ConferenceonManagementofdata .SIGMOD'11.NewYork,NY,USA:ACM,2011, 10711080. URL http://doi.acm.org/10.1145/1989323.1989438 Bortnikov,Edward,Frank,Ari,Hillel,Eshcar,andRao,Sriram."PredictingExecution BottlenecksinMap-ReduceClusters." ProceedingoftheHotCloud2012 .2012. Brandt,Jim,Debusschere,Bert,Gentile,Ann,Mayo,Jackson,P ebay,Philippe, Thompson,David,andWong,Matthew."UsingProbabilisticCharacterizationto ReduceRuntimeFaultsinHPCSystems." Proceedingsofthe2008EighthIEEEInternationalSymposiumonClusterComputingandtheGrid .CCGRID'08.Washington, DC,USA:IEEEComputerSociety,2008,759764. URL http://dx.doi.org/10.1109/CCGRID.2008.124 Brandt,Jim,Gentile,Ann,Mayo,Jackson,P ebay,Philippe,Roe,Diana,Thompson, David,andWong,Matthew."Methodologiesforadvancewarningofcomputecluster problemsviastatisticalanalysis:acasestudy." Proceedingsofthe2009workshop onResiliencyinhighperformance .Resilience'09.NewYork,NY,USA:ACM,2009, 714. URL http://doi.acm.org/10.1145/1552526.1552528 Brown,A.,Kar,G.,andKeller,A."Anactiveapproachtocharacterizingdynamic dependenciesforproblemdeterminationinadistributedenvironment." Integrated NetworkManagementProceedings,2001IEEE/IFIPInternationalSymposiumon 2001,377390. 181

PAGE 182

Buneci,EmmaS.andReed,DanielA."Analysisofapplicationheartbeats:learning structuralandtemporalfeaturesintimeseriesdataforidenticationofperformance problems." Proceedingsofthe2008ACM/IEEEconferenceonSupercomputing .SC '08.Piscataway,NJ,USA:IEEEPress,2008,52:152:12. URL http://dl.acm.org/citation.cfm?id=1413370.1413423 Candes,EmmanuelandTao,Terence."DecodingbyLinearProgramming."2004. Cappello,Franck."FaultToleranceinPetascale/ExascaleSystems:CurrentKnowledge, ChallengesandResearchOpportunities." Int.J.HighPerform.Comput.Appl. 23 (2009).3:212226. URL http://dx.doi.org/10.1177/1094342009106189 Cardosa,M.,Narang,P.,Chandra,A.,Pucha,H.,andSingh,A."STEAMEngine:Driving MapReduceprovisioninginthecloud." HighPerformanceComputing(HiPC),2011 18thInternationalConferenceon .2011,110. Castiglione,Aniello,Gribaudo,Marco,Iacono,Mauro,andPalmieri,Francesco. "Exploitingmeaneldanalysistomodelperformancesofbigdataarchitectures." FutureGenerationComputerSystems (2013).0:. URL http://www.sciencedirect.com/science/article/pii/S0167739X13001611 CFDR."ComputerFailureDataRepository." http://cfdr.usenix.org/ ,2008.[Online; accessed12-July-2012]. Chaiken,Ronnie,Jenkins,Bob,Larson,Perake,Ramsey,Bill,Shakib,Darren,Weaver, Simon,andZhou,Jingren."SCOPE:easyandefcientparallelprocessingofmassive datasets." Proc.VLDBEndow. 1(2008).2:12651276. URL http://dl.acm.org/citation.cfm?id=1454159.1454166 Chen,MikeY.,Accardi,Anthony,Kiciman,Emre,Lloyd,Jim,Patterson,Dave,Fox, Armando,andBrewer,Eric."Path-basedfaliureandevolutionmanagement." Proceedingsofthe1stconferenceonSymposiumonNetworkedSystemsDesignand Implementation-Volume1 .NSDI'04.Berkeley,CA,USA:USENIXAssociation,2004, 2323. URL http://dl.acm.org/citation.cfm?id=1251175.1251198 Chen,Yanpei,Ganapathi,Archana,Grifth,Rean,andKatz,Randy."TheCase forEvaluatingMapReducePerformanceUsingWorkloadSuites." Proceedings ofthe2011IEEE19thAnnualInternationalSymposiumonModelling,Analysis, andSimulationofComputerandTelecommunicationSystems .MASCOTS'11. Washington,DC,USA:IEEEComputerSociety,2011,390399. URL http://dx.doi.org/10.1109/MASCOTS.2011.12 182

PAGE 183

Chen,Yanpei,Ganapathi,ArchanaSulochana,Grifth,Rean,andKatz,RandyH."A MethodologyforUnderstandingMapReducePerformanceUnderDiverseWorkloads." Tech.Rep.EECS-2010-135,UniversityofCalifornia,Berkeley,2010. Cheng,Shang-Wen,Garlan,David,andSchmerl,Bradley."Architecture-based self-adaptationinthepresenceofmultipleobjectives." Proceedingsofthe2006 internationalworkshoponSelf-adaptationandself-managingsystems .SEAMS'06. NewYork,NY,USA:ACM,2006,28. URL http://doi.acm.org/10.1145/1137677.1137679 Conallen,Jim."ModelingWebapplicationarchitectureswithUML." Commun.ACM 42 (1999).10:6370. URL http://doi.acm.org/10.1145/317665.317677 Cong,Yang,Yuan,Junsong,andLiu,Ji."Sparsereconstructioncostforabnormalevent detection." Proc.ofCVPR .2011. URL http://dx.doi.org/10.1109/CVPR.2011.5995434 Dai,Yuan,Marshall,Tom,andGuan,Xiaohong."AutonomicandDependable Computing:MovingTowardsaModel-DrivenApproach." JournalofComputer Science .vol.2.2006,496504. DanielDean,XiaohuiGu,HiepNguyen."UBL:UnsupervisedBehaviorLearningfor PredictingPerformanceAnomaliesinVirtualizedCloudSystems." Proc.ofICAC 2012. Dean,Jeff."SoftwareEngineeringAdvicefromBuildingLarge-ScaleDistributed Systems."2008.[Online;accessed1-July-2012]. Dean,JeffreyandGhemawat,Sanjay."MapReduce:simplieddataprocessingonlarge clusters." Proceedingsofthe6thconferenceonSymposiumonOperatingSystems DesignandImplementation-Volume6 .OSDI'04.Berkeley,CA,USA:USENIX Association,2004,1010. URL http://dl.acm.org/citation.cfm?id=1251254.1251264 ."MapReduce:simplieddataprocessingonlargeclusters." Commun.ACM 51 (2008).1:107113. URL http://doi.acm.org/10.1145/1327452.1327492 Desel,J.andEsparza,J. FreeChoicePetriNets .1995. 183

PAGE 184

Diao,Yixin,Hu,Xiaolei,Tantawi,Asser,andWu,Haishan."Anadaptivefeedback controllerforSIPservermemoryoverloadprotection." Proceedingsofthe6th internationalconferenceonAutonomiccomputing .ICAC'09.NewYork,NY,USA: ACM,2009,2332. URL http://doi.acm.org/10.1145/1555228.1555234 Dinu,FlorinandNg,T.S.Eugene."Understandingtheeffectsandimplicationsof computenoderelatedfailuresinhadoop." Proceedingsofthe21stinternational symposiumonHigh-PerformanceParallelandDistributedComputing .HPDC'12.New York,NY,USA:ACM,2012,187198. URL http://doi.acm.org/10.1145/2287076.2287108 Dobson,Simon."Achievinganacceptabledesignmodelforautonomicsystems." ProceedingsoftheFourthIEEEInternationalWorkshoponEngineeringofAutonomic andAutonomousSystems .EASE'07.Washington,DC,USA:IEEEComputerSociety, 2007,196202. URL http://dx.doi.org/10.1109/EASE.2007.4 ."FacilitatingaWell-FoundedApproachtoAutonomicSystems." Engineeringof AutonomicandAutonomousSystems,2008.EASE2008.FifthIEEEWorkshopon 2008,204208. Donoho,DavidL.,Tsaig,Yaakov,Drori,Iddo,andlucStarck,Jean."Sparsesolutionof underdeterminedlinearequationsbystagewiseorthogonalmatchingpursuit."Tech. rep.,2006. Duan,Rubing,Nadeem,Farrukh,Wang,Jie,Zhang,Yun,Prodan,Radu,andFahringer, Thomas."AHybridIntelligentMethodforPerformanceModelingandPredictionof WorkowActivitiesinGrids." Proceedingsofthe20099thIEEE/ACMInternational SymposiumonClusterComputingandtheGrid .CCGRID'09.Washington,DC,USA: IEEEComputerSociety,2009,339347. URL http://dx.doi.org/10.1109/CCGRID.2009.58 Dubey,Abhishek,Nordstrom,Steve,Keskinpala,Turker,Neema,Sandeep,andBapty, Ted."VerifyingAutonomicFaultMitigationStrategiesinLargeScaleReal-Time Systems." ProceedingsoftheThirdIEEEInternationalWorkshoponEngineering ofAutonomic&AutonomousSystems .EASE'06.Washington,DC,USA:IEEE ComputerSociety,2006,129140. URL http://dx.doi.org/10.1109/EASE.2006.24 Elad,Michael. SparseandRedundantRepresentations:FromTheorytoApplicationsin SignalandImageProcessing .Springer,2010. 184

PAGE 185

Eldar,YoninaC.andMishali,Moshe."Robustrecoveryofsignalsfromastructured unionofsubspaces." IEEETrans.Inf.Theor. 55(2009).11. URL http://dx.doi.org/10.1109/TIT.2009.2030471 Engel,S.J.,Gilmartin,B.J.,Bongort,K.,andHess,A."Prognostics,therealissues involvedwithpredictingliferemaining." AerospaceConferenceProceedings,2000 IEEE .vol.6.2000,457469vol.6. ESX,VMWare."Buildingaterabyte-scaledatacycleatLinkedInwithHadoopand ProjectVoldemort." http://www.vmware.com/products/esx/ ,2010.[Online;accessed 30-July-2012]. Faghri,Faraz,Bazarbayev,Sobir,Overholt,Mark,Farivar,Reza,Campbell,RoyH.,and Sanders,WilliamH."Failurescenarioasaservice(FSaaS)forHadoopclusters." ProceedingsoftheWorkshoponSecureandDependableMiddlewareforCloud MonitoringandManagement .SDMCMM'12.NewYork,NY,USA:ACM,2012, 5:15:6. URL http://doi.acm.org/10.1145/2405186.2405191 Ferguson,AndrewD.,Bodik,Peter,Kandula,Srikanth,Boutin,Eric,andFonseca, Rodrigo."Jockey:guaranteedjoblatencyindataparallelclusters." Proceedingsofthe 7thACMeuropeanconferenceonComputerSystems .EuroSys'12.NewYork,NY, USA:ACM,2012a,99112. URL http://doi.acm.org/10.1145/2168836.2168847 ."Jockey:guaranteedjoblatencyindataparallelclusters." EuroSys .2012b. URL http://doi.acm.org/10.1145/2168836.2168847 Fingerpointing."Fingerpointing:ProblemDiagnosisinDistributedSystems,ParallelData Lab,CarnegieMellonUniversity." http://www.pdl.cmu.edu/Fingerpointing/index. shtml ,2013.[Online;April2013]. Foundation,ApacheSoftware."ApacheSoftwareFoundationJIRAIssueTracker." https://issues.apache.org/jira/browse/MAPREDUCE-728 ,2009.[Online;accessed 1-July-2012]. FTA."FailureTraceArchive." http://fta.inria.fr/apache2-default/pmwiki/index. php ,2009.[Online;accessed12-July-2012]. Gabel,M.,Schuster,A.,Bachrach,R.-G.,andBjorner,N."Latentfaultdetectioninlarge scaleservices." Proc.ofDSN .2012. Ganapathi,Archana."PredictingandOptimizingSystemUtilizationandPerformancevia StatisticalMachineLearning."2009. 185

PAGE 186

Gandhi,N.,Tilbury,D.M.,Diao,Y.,Hellerstein,J.,andParekh,S."MIMOcontrolofan Apachewebserver:modelingandcontrollerdesign." AmericanControlConference, 2002.Proceedingsofthe2002 .vol.6.2002,49224927vol.6. Ganglia."Gutenberg." http://ganglia.sourceforge.net/ ,2012.[Online;accessed 6-July-2012]. Goldberg,DavidE. GeneticAlgorithmsinSearch,OptimizationandMachineLearning Boston,MA,USA:Addison-WesleyLongmanPublishingCo.,Inc.,1989,1sted. Graupner,S.,Cook,N.,andColeman,D."AutomationControllerforOperationalIT Management." IntegratedNetworkManagement,2007.IM'07.10thIFIP/IEEE InternationalSymposiumon .2007,363372. GridMix."HadoopGridMix." http://hadoop.apache.org/docs/mapreduce/current/ gridmix.html ,2013.[Online;April2013]. Gross,KennyC.,Urmanov,Aleksey,Votta,LawrenceG.,McMaster,Scott,andPorter, Adam."TowardsDependabilityinEverydaySoftwareUsingSoftwareTelemetry." ProceedingsoftheThirdIEEEInternationalWorkshoponEngineeringofAutonomic &AutonomousSystems .EASE'06.Washington,DC,USA:IEEEComputerSociety, 2006,918. URL http://dx.doi.org/10.1109/EASE.2006.21 Guim,Francesc,Rodero,Ivan,Corbalan,Julita,andGoyeneche,A."TheGrid Backlling:aMulti-SiteSchedulingArchitecturewithDataMiningPrediction Techniques." CoreGridWorkshopinGridMiddleware .2007. Gutenberg."Gutenberg." http://www.gutenberg.org/ ,2009.[Online;accessed 1-July-2012]. Hadoop."InstitutionsusingHadoopforeducationalorproductionuses." http: //wiki.apache.org/hadoop/PoweredBy ,2012.[Online;accessed1-July-2012]. Hadoop,Apache."ApacheHadoop." http://hadoop.apache.org/ ,2004.[Online; accessed1-July-2012]. Hall,Mark,Frank,Eibe,Holmes,Geoffrey,Pfahringer,Bernhard,Reutemann,Peter,and Witten,IanH."TheWEKAdataminingsoftware:anupdate." SIGKDDExplor.Newsl. 11(2009).1:1018. URL http://doi.acm.org/10.1145/1656274.1656278 186

PAGE 187

Hamerly,GregandElkan,Charles."Bayesianapproachestofailurepredictionfor diskdrives." ProceedingsoftheEighteenthInternationalConferenceonMachine Learning .ICML'01.SanFrancisco,CA,USA:MorganKaufmannPublishersInc., 2001,202209. URL http://dl.acm.org/citation.cfm?id=645530.655825 Hammoud,S.,Li,Maozhen,Liu,Yang,Alham,N.K.,andLiu,Zelong."MRSim:A discreteeventbasedMapReducesimulator." FuzzySystemsandKnowledge Discovery(FSKD),2010SeventhInternationalConferenceon .vol.6.2010,2993 2997. Hellerstein,JosephL.,Diao,Yixin,Parekh,Sujay,andTilbury,DawnM. Feedback ControlofComputingSystems .JohnWiley&Sons,2004. Herodotou,Herodotos."HadoopPerformanceModels." TheComputingResearch Repository abs/1106.0940(2011). Herodotou,HerodotosandBabu,Shivnath."Proling,What-ifAnalysis,andCost-based OptimizationofMapReducePrograms." PVLDB 4(2011).11:11111122. Herodotou,Herodotos,Dong,Fei,andBabu,Shivnath."Noone(cluster)sizetsall: automaticclustersizingfordata-intensiveanalytics." SOCC .2011. URL http://doi.acm.org/10.1145/2038916.2038934 Hoffmann,Henry,Eastep,Jonathan,Santambrogio,MarcoD.,Miller,JasonE.,and Agarwal,Anant."Applicationheartbeats." Proceedingofthe7thinternational conferenceonAutonomiccomputing-ICAC'10 .ACMPress,2010,79+. URL http://dx.doi.org/10.1145/1809049.1809065 Iosup,Alexandru,Yigitbasi,Nezih,andEpema,Dick."OnthePerformanceVariabilityof ProductionCloudServices." Proceedingsofthe201111thIEEE/ACMInternational SymposiumonCluster,CloudandGridComputing .CCGRID'11.Washington,DC, USA:IEEEComputerSociety,2011,104113. URL http://dx.doi.org/10.1109/CCGrid.2011.22 Isard,Michael,Budiu,Mihai,Yu,Yuan,Birrell,Andrew,andFetterly,Dennis."Dryad: distributeddata-parallelprogramsfromsequentialbuildingblocks." Proceedingsof the2ndACMSIGOPS/EuroSysEuropeanConferenceonComputerSystems2007 EuroSys'07.NewYork,NY,USA:ACM,2007,5972. URL http://doi.acm.org/10.1145/1272996.1273005 Jensen,Kurt. ColouredPetrinets:basicconcepts,analysismethodsandpracticaluse, vol.2 .London,UK,UK:Springer-Verlag,1995. 187

PAGE 188

Jensen,Kurt,Kristensen,LarsMichael,andWells,Lisa."ColouredPetriNetsandCPN Toolsformodellingandvalidationofconcurrentsystems." Int.J.Softw.ToolsTechnol. Transf. 9(2007).3:213254. URL http://dx.doi.org/10.1007/s10009-007-0038-x Joshi,KaustubhR.,Sanders,WilliamH.,Hiltunen,MattiA.,andSchlichting,RichardD. "AutomaticModel-DrivenRecoveryinDistributedSystems." Proceedingsofthe24th IEEESymposiumonReliableDistributedSystems .SRDS'05.Washington,DC,USA: IEEEComputerSociety,2005,2538. URL http://dx.doi.org/10.1109/RELDIS.2005.11 Kadirvel,S.andFortes,J.A.B."Self-CaringITSystems:AProof-of-Concept ImplementationinVirtualizedEnvironments." CloudComputingTechnologyand Science(CloudCom),2010IEEESecondInternationalConferenceon .2010,433 440. Kadirvel,SelviandFortes,Jos'e."Grey-BoxApproachforPerformancePredictionin Map-ReduceBasedPlatforms." Proc.ofICCCN .2012. Kadirvel,SelviandFortes,Jos eAB."TowardsITsystemscapableofmanagingtheir health." FoundationsofComputerSoftware.Modeling,Development,andVerication ofAdaptiveSystems .Springer,2011a.77102. Kadirvel,SelviandFortes,Jos'eA.B."Towardsself-caringmapreduce:Proactively reducingfault-inducedexecution-timepenalties." HighPerformanceComputingand Simulation(HPCS),2011InternationalConferenceon .2011b,6371. Kadirvel,Selvi,Matsunaga,Andrea,Chen,Keyu,Tong,Yuchu,Liu,Li,Yu,Fahong, Farmerie,Bill,andFortes,Jos'eA.B."ICBRIn-VigoBioinformaticsPortalforHigh PerformanceComputing." https://invigo.acis.ufl.edu/invigo/Login ,2009. [Online;accessed4-July-2012]. Kahkipuro,Pekka."UML-BasedPerformanceModelingFrameworkfor Component-BasedDistributedSystems." PerformanceEngineering,Stateofthe ArtandCurrentTrends .London,UK,UK:Springer-Verlag,2001,167184. URL http://dl.acm.org/citation.cfm?id=647640.733374 Kale,Vivek,Mukherjee,Jayanta,andGupta,Indranil."HadoopJitter:TheGhostinthe CloudandHowtoTameIt."2010. Kalgren,P.W.,Baybutt,M.,Ginart,A.,Minnella,C.,Roemer,M.J.,andDabney,T. "ApplicationofPrognosticHealthManagementinDigitalElectronicSystems." AerospaceConference,2007IEEE .2007,19. 188

PAGE 189

Kapadia,NiravH.,Fortes,Jos eA.B.,andBrodley,CarlaE."Predictive Application-PerformanceModelinginaComputationalGridEnvironment." Proceedingsofthe8thIEEEInternationalSymposiumonHighPerformanceDistributed Computing .HPDC'99.Washington,DC,USA:IEEEComputerSociety,1999,6. URL http://dl.acm.org/citation.cfm?id=822084.823276 Kavulya,Soila,Tan,Jiaqi,Gandhi,Rajeev,andNarasimhan,Priya."AnAnalysisof TracesfromaProductionMapReduceCluster." CCGRID .2010,94103. Ko,StevenY.,Hoque,Imranul,Cho,Brian,andGupta,Indranil."Onavailabilityof intermediatedataincloudcomputations." Proceedingsofthe12thconferenceon Hottopicsinoperatingsystems .HotOS'09.Berkeley,CA,USA:USENIXAssociation, 2009,66. URL http://dl.acm.org/citation.cfm?id=1855568.1855574 Kolettis,NickandFulton,N.Dudley."SoftwareRejuvenation:Analysis,Moduleand Applications." ProceedingsoftheTwenty-FifthInternationalSymposiumonFaultTolerantComputing .FTCS'95.Washington,DC,USA:IEEEComputerSociety,1995, 381. URL http://dl.acm.org/citation.cfm?id=874064.875631 Kreps,Jay."Buildingaterabyte-scaledatacycleatLinkedInwithHadoop andProjectVoldemort." http://project-voldemort.com/blog/2009/06/ building-a-1-tb-data-cycle-at-linkedin-with-hadoop-and-project-voldemort/ 2009.[Online;accessed30-June-2012]. Laguna,Ignacio,Arshad,FahadA.,Grothe,DavidM.,andBagchi,Saurabh."How tokeepyourheadabovewaterwhiledetectingerrors." Proceedingsofthe10th ACM/IFIP/USENIXInternationalConferenceonMiddleware .Middleware'09.New York,NY,USA:Springer-VerlagNewYork,Inc.,2009,11:111:20. URL http://dl.acm.org/citation.cfm?id=1656980.1656995 Lama,PaldenandZhou,Xiaobo."AROMA:AutomatedResourceAllocationand CongurationofMapReduceEnvironmentintheCloud." ICAC .2012. Lim,Seung-Hwan,Huh,Jae-Seok,Kim,Youngjae,Shipman,GalenM.,andDas, ChitaR."D-factor:aquantitativemodelofapplicationslow-downinmulti-resource sharedsystems." Proceedingsofthe12thACMSIGMETRICS/PERFORMANCE jointinternationalconferenceonMeasurementandModelingofComputerSystems SIGMETRICS'12.NewYork,NY,USA:ACM,2012,271282. URL http://doi.acm.org/10.1145/2254756.2254790 Marinescu,DanC. Internet-BasedWorkowManagement:TowardsaSemanticWeb NewYork,NY,USA:JohnWiley&Sons,Inc.,2002. 189

PAGE 190

Marsan,M.Ajmone."AdvancesinPetrinets1989."chap.StochasticPetrinets:an elementaryintroduction.NewYork,NY,USA:Springer-VerlagNewYork,Inc.,1990. 129. URL http://dl.acm.org/citation.cfm?id=90011.90012 Marshall,Paul,Keahey,Kate,andFreeman,Tim."ElasticSite:UsingCloudsto ElasticallyExtendSiteResources." Proceedingsofthe201010thIEEE/ACMInternationalConferenceonCluster,CloudandGridComputing .CCGRID'10.Washington, DC,USA:IEEEComputerSociety,2010,4352. URL http://dx.doi.org/10.1109/CCGRID.2010.80 Matsunaga,Andrea,Fortes,Jose,andA.B."OntheUseofMachineLearningto PredicttheTimeandResourcesConsumedbyApplications." Cluster,CloudandGrid Computing(CCGrid),201010thIEEE/ACMInternationalConferenceon .2010,495 504. Mitre,CWE-400."CWE-400:UncontrolledResourceConsumption.CommonWeakness Enumeration.AninitiativesponsoredbytheNationalCyberSecurityDivisionof theU.S.DepartmentofHomelandSecurity." hhttp://cwe.mitre.org/data/ definitions/400.html ,2010.[Online;accessed16-March-2010]. Montes,J."GlobalBehaviorModeling:ANewApproachtoGridAutonomic Management'."2010. Morton,K.,Friesen,A.,Balazinska,M.,andGrossman,D."Estimatingtheprogress ofMapReducepipelines." DataEngineering(ICDE),2010IEEE26thInternational Conferenceon .2010,681684. Muppala,JogeshK.,Ciardo,Gianfranco,andTrivedi,KishorS."StochasticReward NetsforReliabilityPrediction." CommunicationsinReliability,Maintainabilityand Serviceability .1994,920. Murch,Richard. AutonomicComputing .IBMPress,2004. Pan,Xinghao,Tan,Jiaqi,Kavulya,Soila,Gandhi,Rajeev,andNarasimhan,Priya. "Ganesha:blackBoxdiagnosisofMapReducesystems." SIGMETRICSPerform.Eval. Rev. 37(2010).3:813. URL http://dx.doi.org/10.1145/1710115.1710118 Pecht,MichaelandJaai,Rubyca."Aprognosticsandhealthmanagementroadmapfor informationandelectronics-richsystems." MicroelectronicsReliability 50(2010).3: 317323. URL http://www.sciencedirect.com/science/article/pii/S0026271410000181 190

PAGE 191

Peterson,JamesLyle. PetriNetTheoryandtheModelingofSystems .PrenticeHall PTR,1981. Pinheiro,Eduardo,Weber,Wolf-Dietrich,andBarroso,LuizAndr."FailureTrends inaLargeDiskDrivePopulation." 5thUSENIXConferenceonFileandStorage Technologies(FAST2007) .2007,1729. Polo,Jorda,Castillo,Claris,Carrera,David,Becerra,Yolanda,Whalley,Ian,Steinder, Malgorzata,Torres,Jordi,andAyguad e,Eduard."Resource-AwareAdaptive SchedulingforMapReduceClusters." Middleware .2011,187207. PUMA."PurdueMapReduceBenchmarkSuite."http://tinyurl.com/bn5gmga,???? Ratzer,AnneVinter,Wells,Lisa,Lassen,HenryMichael,Laursen,Mads,Frank,Jacob, Stissing,MartinStig,Westergaard,Michael,Christensen,Sren,andJensen,Kurt. "CPNToolsforediting,simulating,andanalysingcolouredPetrinets." Applications andTheoryofPetriNets2003:24thInternationalConference,ICATPN2003 .Springer Verlag,2003,450462. Ren,Xiaojuan,Lee,Seyong,Eigenmann,Rudolf,andBagchi,Saurabh."Resource AvailabilityPredictioninFine-GrainedCycleSharingSystems." HPDC .2006,93104. Rish,I.,Brodie,M.,Odintsova,N.,Ma,Sheng,andGrabarnik,G."Real-timeproblem determinationindistributedsystemsusingactiveprobing." NetworkOperationsand ManagementSymposium,2004.NOMS2004.IEEE/IFIP .vol.1.2004,133146 Vol.1. Roberts,Nathan."AnarchyApe." https://github.com/ynroberts/anarchyape ,2013. [Online;April2013]. RoemerM.J.,NwadiogbuE.O.andBloor,G."Developmentofdiagnosticandprognostic technologiesforaerospacehealthmanagementapplications." AerospaceConference, 2001,IEEEProceedings. .vol.6.2001,31393147vol.6. Salfner,Felix,Lenk,Maren,andMalek,Miroslaw."Asurveyofonlinefailureprediction methods." ACMComput.Surv. 42(2010).3:10:110:42. URL http://doi.acm.org/10.1145/1670679.1670680 Salfner,FelixandWolter,Katinka."APetrinetmodelforserviceavailabilityinredundant computingsystems." WinterSimulationConference .WSC'09.WinterSimulation Conference,2009,819826. URL http://dl.acm.org/citation.cfm?id=1995456.1995577 191

PAGE 192

Schad,J org,Dittrich,Jens,andQuian e-Ruiz,Jorge-Arnulfo."Runtimemeasurements inthecloud:observing,analyzing,andreducingvariance." Proc.VLDBEndow. 3 (2010).1-2:460471. URL http://dl.acm.org/citation.cfm?id=1920841.1920902 Schroeder,B.andGibson,G.A."ALarge-ScaleStudyofFailuresinHigh-Performance ComputingSystems." DependableandSecureComputing,IEEETransactionson 7 (2010).4:337351. Schroeder,BiancaandGibson,GarthA."Alarge-scalestudyoffailuresin high-performancecomputingsystems." ProceedingsoftheInternationalConferenceonDependableSystemsandNetworks .DSN'06.Washington,DC,USA:IEEE ComputerSociety,2006,249258. URL http://dx.doi.org/10.1109/DSN.2006.5 ."Diskfailuresintherealworld:whatdoesanMTTFof1,000,000hoursmeanto you?" Proceedingsofthe5thUSENIXconferenceonFileandStorageTechnologies FAST'07.Berkeley,CA,USA:USENIXAssociation,2007a. URL http://dl.acm.org/citation.cfm?id=1267903.1267904 ."UnderstandingFailuresinPetascaleComputers."2007b. Sharma,A."AdaptiveResourceManagementinDistributedSystems."2010. Smith,W."PredictionServicesforDistributedComputing." ParallelandDistributed ProcessingSymposium,2007.IPDPS2007.IEEEInternational .2007,110. Smith,Warren,Foster,Ian,andTaylor,Valerie."Predictingapplicationruntimeswith historicalinformation." J.ParallelDistrib.Comput. 64(2004).9:10071016. URL http://dx.doi.org/10.1016/j.jpdc.2004.06.008 Stewart,ChristopherandShen,Kai."Performancemodelingandsystemmanagement formulti-componentonlineservices." Proceedingsofthe2ndconferenceonSymposiumonNetworkedSystemsDesign&Implementation-Volume2 .NSDI'05.Berkeley, CA,USA:USENIXAssociation,2005,7184. URL http://dl.acm.org/citation.cfm?id=1251203.1251209 Tan,Jiaqi,Pan,Xinghao,Marinelli,Eugene,Kavulya,Soila,Gandhi,Rajeev,and Narasimhan,Priya."Kahuna:ProblemdiagnosisforMapreduce-basedcloud computingenvironments." 2010IEEENetworkOperationsandManagementSymposium-NOMS2010 .IEEE,2010,112119. URL http://dx.doi.org/10.1109/NOMS.2010.5488446 192

PAGE 193

Tan,Yongmin,Nguyen,Hiep,Shen,Zhiming,Gu,Xiaohui,Venkatramani,Chitra, andRajan,Deepak."PREPARE:PredictivePerformanceAnomalyPreventionfor VirtualizedCloudSystems." Proc.ofICDCS .2012. Tang,Hong."UsingsimulationforLarge-ScaleDistributedSystemVerication andDebugging,HadoopUserGroup,Yahoo!" http://www.slideshare.net/ hadoopusergroup/mumak/ ,2009.[Online;accessed5-July-2012]. Tang,Liang,Kacprzynski,G.J.,Goebel,K.,Saxena,A.,Saha,B.,andVachtsevanos, G."Prognostics-enhancedAutomatedContingencyManagementforadvanced autonomoussystems." PrognosticsandHealthManagement,2008.PHM2008. InternationalConferenceon .2008,19. Tibshirani,Robert."RegressionShrinkageandSelectionViatheLasso." Journalofthe RoyalStatisticalSociety,SeriesB 58(1994):267288. Torque."TorqueResourceManager,MauiJobScheduler." http://www. clusterresources.com ,2010.[Online;accessed3-July-2012]. Urgaonkar,Bhuvan,Pacici,Giovanni,Shenoy,Prashant,Spreitzer,Mike,andTantawi, Asser."Ananalyticalmodelformulti-tierinternetservicesanditsapplications." SIGMETRICSPerform.Eval.Rev. 33(2005).1:291302. URL http://doi.acm.org/10.1145/1071690.1064252 Urmanov,A."ElectronicPrognosticsforComputerServers." ReliabilityandMaintainabilitySymposium,2007.RAMS'07.Annual .2007,6570. Vachtsevanos,George,F.L.,Lewis,M.,Roemer,A.,Hess,andB.,Wu. IntelligentFault DiagnosisandPrognosisforEngineeringSystems .Wiley,JohnandSons,2006. Vaidyanathan,KalyanaramanandTrivedi,KishorS."AMeasurement-BasedModelfor EstimationofResourceExhaustioninOperationalSoftwareSystems." Proceedings ofthe10thInternationalSymposiumonSoftwareReliabilityEngineering .ISSRE'99. Washington,DC,USA:IEEEComputerSociety,1999,84. URL http://dl.acm.org/citation.cfm?id=851020.856189 ."AComprehensiveModelforSoftwareRejuvenation." IEEETrans.Dependable Secur.Comput. 2(2005).2:124137. URL http://dx.doi.org/10.1109/TDSC.2005.15 Valmari,Antti."AStubbornAttackOnStateExplosion." Proceedingsofthe2nd InternationalWorkshoponComputerAidedVerication .CAV'90.London,UK,UK: Springer-Verlag,1991,156165. URL http://dl.acm.org/citation.cfm?id=647759.735025 193

PAGE 194

vanderAalst,W.M.P.andvanHee,K.M."Businessprocessredesign:a Petri-net-basedapproach." Comput.Ind. 29(1996).1-2:1526. URL http://dx.doi.org/10.1016/0166-3615(95)00051-8 vanderMei,RobertD.,Hariharan,Rema,andReeser,Paul."WebServerPerformance Modeling." TelecommunicationSystems 16(2001).3-4:361378. Verma,Abhishek,Cherkasova,Ludmila,andCampbell,Roy."ResourceProvisioning FrameworkforMapReduceJobswithPerformanceGoals."2011. Wang,Guanying,Butt,AliR.,Pandey,Prashant,andGupta,Karan."Usingrealistic simulationforperformanceanalysisofmapreducesetups." Proceedingsofthe1st ACMworkshoponLarge-Scalesystemandapplicationperformance .LSAP'09.New York,NY,USA:ACM,2009a,1926. URL http://doi.acm.org/10.1145/1552272.1552278 Wang,Guanying,Butt,A.R.,Pandey,P.,andGupta,K."Asimulationapproachto evaluatingdesigndecisionsinMapReducesetups." Modeling,AnalysisSimulationof ComputerandTelecommunicationSystems,2009.MASCOTS'09.IEEEInternational Symposiumon .2009b,1-11. ."AsimulationapproachtoevaluatingdesigndecisionsinMapReducesetups." Modeling,AnalysisSimulationofComputerandTelecommunicationSystems,2009. MASCOTS'09.IEEEInternationalSymposiumon .2009c,111. Wilkes,John."MoreGoogleClusterData." http://googleresearch.blogspot.com/ 2011/11/more-google-cluster-data.html ,2013.[Online;April2013]. Williams,A.W.,Pertet,S.M.,andNarasimhan,Priya."Tiresias:Black-BoxFailure PredictioninDistributedSystems." ParallelandDistributedProcessingSymposium, 2007.IPDPS2007.IEEEInternational .2007,18. Witten,IanH.andFrank,Eibe. DataMining:PracticalMachineLearningToolsand Techniques,SecondEdition(MorganKaufmannSeriesinDataManagementSystems) .SanFrancisco,CA,USA:MorganKaufmannPublishersInc.,2005. Yahoo."Yahoo!Hadoopgridlogs." http://webscope.sandbox.yahoo.com/catalog. php?datatype=s ,2013.[Online;April2013]. Zaharia,Matei,Konwinski,Andy,Joseph,AnthonyD.,Katz,Randy,andStoica,Ion. "ImprovingMapReduceperformanceinheterogeneousenvironments." Proceedings ofthe8thUSENIXconferenceonOperatingsystemsdesignandimplementation OSDI'08.Berkeley,CA,USA:USENIXAssociation,2008,2942. URL http://dl.acm.org/citation.cfm?id=1855741.1855744 194

PAGE 195

Zhang,Zhuoyao,Cherkasova,Ludmila,andLoo,BoonThau."PerformanceModeling ofMapReduceJobsinHeterogeneousCloudEnvironments." IEEE6thInternational ConferenceonCloudComputing(CLOUD2013) .CLOUD'13.SantaClara,California, USA,2013. Zhang,Zhuoyao,Cherkasova,Ludmila,Verma,Abhishek,andLoo,BoonThau. "Automatedprolingandresourcemanagementofpigprogramsformeetingservice levelobjectives." Proceedingsofthe9thinternationalconferenceonAutonomic computing .ICAC'12.NewYork,NY,USA:ACM,2012,5362. URL http://doi.acm.org/10.1145/2371536.2371546 Zhao,Bin,Fei-Fei,Li,andXing,E.P."Onlinedetectionofunusualeventsinvideosvia dynamicsparsecoding." Proc.ofCVPR .2011. URL http://dx.doi.org/10.1109/CVPR.2011.5995524 Zhou,M.C.andDicesare,F."AdaptivedesignofPetrinetcontrollersforerrorrecoveryin automatedmanufacturingsystems." Systems,ManandCybernetics,IEEETransactionson 19(1989).5:963973. Zhou,MengandDicesare,Frank. PetriNetSynthesisforDiscreteEventControlof ManufacturingSystems .KluwerPublishers,1993. Zhou,MengChuandKurapati,Venkatesh. Modeling,SimulationandControlofFlexible ManufacturingSystems-APetrinetApproach .WorldScientic,1999. 195

PAGE 196

BIOGRAPHICALSKETCH SelviKadirvelcompletedherBachelorofEngineeringdegreeinComputer ScienceandEngineeringattheUniversityofMadrasinTamilNadu,Indiaandher MasterofSciencedegreeinComputerEngineeringfromtheUniversityofFlorida inGainesville,Florida.Duringhermaster'sprogramshewasaResearchAssistant buildingself-poweredwirelesssystemsforcontrollingaircraftacousticlinersatthe InterdisciplinaryMicrosystemsGroupatUF.ShewasalsoaGraduateAssistant atUF'sHighPerformanceComputingCenterwheresheworkedonperformance benchmarkingofparallellesystemsandperformanceoptimizationofimageprocessing applicationsforHPCenvironments.DuringherPhDprogramshehasworkedon projectscollaboratingwiththeUF'sWhitneyLabforMarineBiosciences,theInterdisciplinary centerforBiotechnologyResearchandtheNationalScienceFoundation'scyberinfrastructure project-FutureGrid.Herresearchinterestsspantheeldsofdistributedsystems, machine-learning,autonomiccomputingandfault-tolerance. 196