Using Hadoop to accelerate the analysis of semiconductor-manufacturing monitoring data

MISSING IMAGE

Material Information

Title:
Using Hadoop to accelerate the analysis of semiconductor-manufacturing monitoring data
Physical Description:
1 online resource (117 p.)
Language:
english
Creator:
Zhang, Wenjie
Publisher:
University of Florida
Place of Publication:
Gainesville, Fla.
Publication Date:

Thesis/Dissertation Information

Degree:
Master's ( M.S.)
Degree Grantor:
University of Florida
Degree Disciplines:
Electrical and Computer Engineering
Committee Chair:
FORTES,JOSE A
Committee Co-Chair:
FIGUEIREDO,RENATO JANSEN
Committee Members:
LI,TAO

Subjects

Subjects / Keywords:
data-analysis -- hadoop -- mapreduce -- scalability
Electrical and Computer Engineering -- Dissertations, Academic -- UF
Genre:
Electrical and Computer Engineering thesis, M.S.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract:
Large-scale data sets are generated every day by the sensors used to monitor semiconductor manufacturing processes. To effectively process these large data sets, a scalable, efficient and fault-tolerant system must be designed. This thesis investigates the suitability of Hadoop as the core processing engine of such a system. Hadoop has been used in many fields for large-scale data processing, including web-indexing, bioinformatics, data mining, etc, where it is proved to be an effective data processing platform because it provides parallelization and distribution for applications built on it. It also offers scalability and fault-tolerance using only commodity machines. In this thesis, we propose a system for statistics-based analysis of manufacturing monitoring data using Hadoop. We first present a data-oriented approach through an in-depth analysis of the data to be processed. Then, different Hadoop configuration parameters are tuned in order to improve the performance of the system. In addition, the statistics-based computation algorithm is modified so that it can reuse previous results. By establishing a mapping from data to Hadoop MapReduce programming paradigm, the implemented system outperforms the legacy analysis system by up to 82.7%. Experiments have been done on different mapping strategies to study the effect of configuration parameters on performance. The algorithmic optimization improves the optimized computation time by 50%. However, the non-computation time, i.e. IO operation time, dominates execution time in our experiments.
General Note:
In the series University of Florida Digital Collections.
General Note:
Includes vita.
Bibliography:
Includes bibliographical references.
Source of Description:
Description based on online resource; title from PDF title page.
Source of Description:
This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility:
by Wenjie Zhang.
Thesis:
Thesis (M.S.)--University of Florida, 2014.
Local:
Adviser: FORTES,JOSE A.
Local:
Co-adviser: FIGUEIREDO,RENATO JANSEN.

Record Information

Source Institution:
UFRGP
Rights Management:
Applicable rights reserved.
Classification:
lcc - LD1780 2014
System ID:
UFE0046807:00001


This item is only available as the following downloads:


Full Text

PAGE 1

USINGHADOOPTOACCELERATETHEANALYSISOFSEMICONDUCTOR-MANUFACTURINGMONITORINGDATAByWENJIEZHANGATHESISPRESENTEDTOTHEGRADUATESCHOOLOFTHEUNIVERSITYOFFLORIDAINPARTIALFULFILLMENTOFTHEREQUIREMENTSFORTHEDEGREEOFMASTEROFSCIENCEUNIVERSITYOFFLORIDA2014

PAGE 2

c2014WenjieZhang

PAGE 3

Tomyadvisor,Dr.JoseA.B.FortesTomyparents,PeisongZhangandJinzhiQu

PAGE 4

ACKNOWLEDGMENTS First,Iwouldliketoexpressaspecialappreciationandthankstomyadvisor,Dr.JoseA.B.Fortesforhissupporttomystudyandresearchtowardsthemaster'sdegree,forhispatience,encouragementandimmenseknowledge.Hisendlesssupportandguidanceenlightenmywayofstudy,researchandcareer.Withouthishelp,thisthesiscannotbenished.Second,IwouldliketothanktherestofmycommitteemembersDr.RenatoJFigueiredoandDr.TaoLifortheirpatience,support,insightfulcommentsandsuggestions.Third,IwishtoacknowledgethehelpprovidedbyMr.YoungsunYufromSamsungfortheinsightdiscussionoftheproblem.Fourth,mysincerethankstoDr.AndreaMatsunaga.Herenthusiasm,patienceandencouragementofresearchandstudyguidemetothedoorofresearch.Inaddition,aspecialthankstomyfamilyandmyfriendsforthesupport,understandingandencouragement.WordscannotexpresshowgratefulIamtomyparentsforallofthesacricesthatyou'vemadeonmybehalf.AttheendIwouldlikeexpressappreciationtoZhuoyuanSongforhissupportandhelpinnishingthethesis. 4

PAGE 5

TABLEOFCONTENTS page ACKNOWLEDGMENTS ................................. 4 LISTOFTABLES ..................................... 8 LISTOFFIGURES .................................... 9 ABSTRACT ........................................ 12 CHAPTER 1INTRODUCTION .................................. 13 1.1MapReduceProgrammingModelandHadoopWorkow .......... 16 1.1.1MapReduceProgrammingModel ................... 17 1.1.2HadoopPlatform ............................ 18 1.2MotivationandProblemStatement ...................... 22 1.2.1DataMappingandComputationDistribution ............ 23 1.2.2SystemOptimization .......................... 24 1.3SystemArchitecture .............................. 24 1.4StructureofThesis ............................... 28 2DESIGNANDIMPLEMENTATION ........................ 29 2.1DataDescription ................................ 29 2.2AlgorithmAnalysis ............................... 31 2.2.1PercentileandOutlier .......................... 33 2.2.2ProcessCapability ............................ 34 2.2.3EuclideanDistance ........................... 35 2.2.4EquivalenceTest ............................ 36 2.3DataMapping .................................. 36 2.3.1DataMapping:DatabasetoFiles ................... 36 2.3.2DataMapping:FilestoMappers .................... 40 2.4SystemImplementation ............................. 42 2.4.1M-HeavyImplementation ........................ 43 2.4.1.1Mapimplementation ..................... 43 2.4.1.2Reduceimplementation ................... 43 2.4.2R-HeavyImplementation ........................ 45 2.4.2.1Mapimplementation ..................... 45 2.4.2.2Reduceimplementation ................... 46 2.4.3MR-BalancedImplementation ..................... 47 2.4.3.1Mapimplementation ..................... 47 2.4.3.2Reduceimplementation ................... 48 2.4.4RelationshipswithDataMapping ................... 49 5

PAGE 6

3SYSTEMOPTIMIZATION ............................. 51 3.1HadoopCongurationParametersTuning .................. 51 3.1.1ParametersoftheMapReduceProgrammingModel ......... 51 3.1.2ParametersofHadoopWorkow .................... 54 3.1.3OtherHadoopFeatures ......................... 57 3.2AlgorithmicOptimization ........................... 59 3.2.1DataMovement ............................. 60 3.2.2AlgorithmicOptimization ........................ 61 3.2.2.1Thesliding-windowalgorithm ................ 61 3.2.2.2Algorithmanalysis ...................... 64 4EXPERIMENTALEVALUATIONANDANALYSIS ............... 67 4.1DataMappingandMapReduceImplementationExperiments ........ 67 4.1.1DataMapping .............................. 67 4.1.1.1Experimentssetup ...................... 68 4.1.1.2Resultsanalysis ........................ 68 4.1.2LargeScaleDataSets .......................... 69 4.1.2.1Experimentssetup ...................... 69 4.1.2.2Resultsanalysis ........................ 70 4.2HadoopConguration-ParametersTuningExperiments ........... 71 4.2.1ParametersoftheMapReduceProgrammingModel ......... 71 4.2.2ParametersofHadoopWorkow .................... 75 4.2.3OtherHadoopFeatures ......................... 78 4.2.4PutItAllTogether ........................... 81 4.3AlgorithmicOptimizationExperiments .................... 82 4.3.1NumericalAnalysisofStandardDeviationComputation ....... 83 4.3.1.1Experimentssetup ...................... 83 4.3.1.2Resultsanalysis ........................ 83 4.3.2Sliding-WindowAlgorithmPerformanceAnalysis ........... 85 4.3.2.1Experimentssetup ...................... 85 4.3.2.2Resultsanalysis ........................ 86 5CONCLUSIONS ................................... 91 5.1Summary .................................... 91 5.1.1DataMapping .............................. 91 5.1.2ComputationDistribution ....................... 92 5.1.3SystemOptimization .......................... 93 5.1.4ExperimentalResultsAnalysis ..................... 93 5.2TowardsaBetterSystem ............................ 94 5.2.1AutomaticallyTuningHadoopCongurationParameters ...... 95 5.2.2UsingSolid-StateDrives ........................ 96 APPENDIX 6

PAGE 7

ASETTINGUPAHADOOPCLUSTERUSINGCLOUDERAMANAGER ... 98 A.1Introduction ................................... 98 A.2VMCreationandOperatingSystemInstallation ............... 98 A.2.1DownloadCentOS ............................ 98 A.2.2InstallCentOS6.4 ............................ 98 A.2.3CongureCentOS ............................ 99 A.2.4ReplicateVMs .............................. 102 A.3ClouderaManagerandCDHDistribution ................... 103 BAPARALLELALGORITHMFORSTATISTICS-BASEDANALYSIS ..... 109 REFERENCES ....................................... 113 BIOGRAPHICALSKETCH ................................ 117 7

PAGE 8

LISTOFTABLES Table page 2-1Averticalviewofasamplerecord .......................... 30 2-2HowdierentMapReduceimplementationsconstraindatamappingsintothemap/reducephase .................................. 43 2-3Relationshipsbetweendierentimplementationsanddatamappingstrategies 50 3-1Asubsetofcongurationparameterstunedforsystemoptimization-part1[ 1 ] 52 3-2Asubsetofcongurationparameterstunedforsystemoptimization-part2 .. 53 4-1ThedefaultvaluesofcongurationparametersinTable 3-1 and 3-2 ....... 72 4-2Experimentalresultsofcompressandspeculativeexecutionparameters ..... 80 4-3Thevaluesofcongurationparametersaftertuned ................ 82 4-4Comparisonofaveragesofrecordscalculatedusingthetwo-passalgorithmandthesliding-windowapproachfordierentnumbersofrecords ........... 84 4-5Comparisonofstandarddeviationsofrecordscalculatedusingthetwo-passalgorithmandthesliding-windowapproachfordierentnumbersofrecords ........ 84 5-1Theadvantagesanddisadvantagesofdierentimplementations ......... 93 8

PAGE 9

LISTOFFIGURES Figure page 1-1Legacysystemarchitectureforsemiconductormanufacturingmonitoringanalysissystem.Thegureshowsthearchitectureofalegacyanalysissystem.DatacollectedfrommonitoringsensorsarestoredinaMicrosoftdatabase.Theanalysissystemretrievesdatafromthedatabaseandanalyzesthedatausingmultiplemachines.ResultsofanalysisaresenttoanOracledatabaseforeldengineers. ...... 15 1-2TheMapReduceprogrammingmodel ........................ 18 1-3HadoopMapReduceworkow ............................ 21 1-4AnoverviewofsystemarchitecturefortheanalysissystemonHadoop ..... 23 1-5Ahighlevelviewofthestatistics-basedmonitoringsystemandthestaticaloptimizationstrategy ........................................ 26 2-1Datasetscollectedduringseveralconsecutivedaysareusedfornightlystatisticalanalysis. ........................................ 30 2-2Database-to-lesmappingstrategies-arbitrarilymapping ............ 37 2-3Database-to-lesmappingstrategies-per-iCUmapping .............. 38 2-4Database-to-lesmappingstrategies-per-dCUmapping ............. 39 2-5Files-to-Mappersmappingstrategies ........................ 41 2-6M-Heavyimplementation:Mapphase ........................ 44 2-7R-Heavyimplementation:Mapphase ........................ 45 2-8R-Heavyimplementation:Reducephase ...................... 46 2-9MR-Balancedimplementation:Mapphase ..................... 48 2-10MR-Balancedimplementation:Reducephase ................... 49 3-1AshuingphaseofaMapReducejob.Eachmaptaskcontainsamapfunction.Thebuerassociatedwiththemapfunctionistheplacethatstorestheoutputfromthemapfunction.Whenthesizeofthedatainthebuerexceedssomethreshold,thedataaresplitintodisks.Theblocksfollowingthebuerineachmaptaskrepresenttheblockswrittenonthedisk.Eachreducetaskcontainsareducefunction.Beforethereducefunctionstarts,thereducetaskretrievestheoutputfrommaptasksandmergestheoutputsastheinputtothereducefunction,representedbythearrowandthemergestreamsineachreducetask. 55 3-2Theslidingwindowofstatisticalanalysis.Thewindowsizeis14days. ..... 59 9

PAGE 10

3-3Datamappingstrategywithregardstotimestampofeachrecord ........ 60 4-1Dierentdatamappingstrategiesforeachimplementations ............ 69 4-2Dierentimplementationsforeachdatamappingstrategy. ............ 70 4-3Timemeasurementforcurrentdatasimulation ................... 71 4-4Inuenceofthenumberofmaptaskslaunched.Thedotsinthegureshowtheexperimentalresults.Eachdotcontainsthreedimensions,thenumberofmaptasks,theHDFSblocksizeandtheexecutiontime.Theexecutiontimeisshownusingdierentshadesofgray.Thedarkertheareais,thelesstimethejobtakestorun. ........................................ 73 4-5Inuenceofthenumberofmaptasks(evenlydatadistribution) ......... 75 4-6Inuenceofthenumberofreducetasksonexecutiontime ............ 76 4-7Inuenceofthebuersizeoftheoutputfromthemapfunction ......... 76 4-8Inuenceoftheparameterio.sort.factorontheexecutiontime .......... 77 4-9Inuenceoftheparametermapred.reduce.parallel.copiesontheexecutiontime 78 4-10Inuenceofthereducephasebuersize(percent).Thestarsinthegureindicatethattheexperimentscannotbenishedduetothelackofmemoryspaceforthereducetaskcomputation. ............................ 78 4-11Inuenceofthereducephasebuersize(records) ................. 79 4-12Inuenceoftheparametermapred.job.reuse.jvm.num.tasksonexecutiontime .. 79 4-13Inuenceoftheparametermapred.reduce.slowstart.completed.mapsonexecutiontime .......................................... 80 4-14ComparisonoftheexecutiontimebeforeandaftertuningofcongurationparameterstotheR-Heavyimplementation ........................... 82 4-15Theexecutiontimesofthetwo-passalgorithmandtheslidingwindowapproachtocomputeaveragesandstandarddeviations ................... 85 4-16Theglobal-orderstrategyandthetime-stamp-orderstrategy ........... 86 4-17Sortingandthestandarddeviationcomputationtime ............... 87 4-18ComputationtimeandI/Otimemeasurement ................... 88 4-19Thetotalexecutiontimeforfourcases ....................... 88 4-20Executiontimefordierentcases .......................... 89 5-1Dierentsystemarchitectures ............................ 95 10

PAGE 11

A-1ThewelcomepagefromCentOSimage ....................... 99 A-2Successfullysetupthenetworkconnection ..................... 100 A-3Thecontentofthele70-persistent-net.rulesbeforeupdating .......... 102 A-4Thecontentofthele70-persistent-net.rulesafterupdating ........... 102 A-5TheloginwebinterfaceoftheClouderaManager ................. 103 A-6ThewebinterfaceforusertospecifythenodesonwhichtheHadoopwillbeinstalled ........................................ 104 A-7Thewebinterfaceforusertospecifytheinstallationpackageandotheroptions 104 A-8Thewebinterfaceforusertospecifytheappropriatecredentialmethodtoaccesstothecluster ..................................... 105 A-9Thewebinterfacewhentheinstallationfails .................... 106 A-10Thewebinterfacewhentheinstallationprocessgoeswell ............. 106 A-11Thewebinterfacewheninstallationofthepackageissuccessful ......... 106 A-12Thewebinterfacewheninstallationofthepackageissuccessful ......... 107 A-13Thewebinterfacewhenallserviceshasbeeninstalledandtheclusterisreadytowork ........................................ 107 A-14Thewebinterfaceshowingthestatusofservicesavailable ............. 108 11

PAGE 12

AbstractofThesisPresentedtotheGraduateSchooloftheUniversityofFloridainPartialFulllmentoftheRequirementsfortheDegreeofMasterofScienceUSINGHADOOPTOACCELERATETHEANALYSISOFSEMICONDUCTOR-MANUFACTURINGMONITORINGDATAByWenjieZhangMay2014Chair:JoseA.B.FortesMajor:ElectricalandComputerEngineeringLarge-scaledatasetsaregeneratedeverydaybythesensorsusedtomonitorsemiconductormanufacturingprocesses.Toeectivelyprocesstheselargedatasets,ascalable,ecientandfault-tolerantsystemmustbedesigned.ThisthesisinvestigatesthesuitabilityofHadoopasthecoreprocessingengineofsuchasystem.Hadoophasbeenusedinmanyeldsforlarge-scaledataprocessing,includingweb-indexing,bioinformatics,datamining,etc,whereitisprovedtobeaneectivedataprocessingplatformbecauseitprovidesparallelizationanddistributionforapplicationsbuiltonit.Italsooersscalabilityandfault-toleranceusingonlycommoditymachines.Inthisthesis,weproposeasystemforstatistics-basedanalysisofmanufacturingmonitoringdatausingHadoop.Werstpresentadata-orientedapproachthroughanin-depthanalysisofthedatatobeprocessed.Then,dierentHadoopcongurationparametersaretunedinordertoimprovetheperformanceofthesystem.Inaddition,thestatistics-basedcomputationalgorithmismodiedsothatitcanreusepreviousresults.ByestablishingamappingfromdatatoHadoopMapReduceprogrammingparadigm,theimplementedsystemoutperformsthelegacyanalysissystembyupto82.7%.Experimentshavebeendoneondierentmappingstrategiestostudytheeectofcongurationparametersonperformance.Thealgorithmicoptimizationimprovestheoptimizedcomputationtimeby50%.However,thenon-computationtime,i.e.IOoperationtime,dominatesexecutiontimeinourexperiments. 12

PAGE 13

CHAPTER1INTRODUCTIONMassiveamountsofdataarecollecteddailyinmanufacturingplantswheresensorsareusedtotrackmanyphysicalparametersthatcharacterizetheoperationofalargenumberofequipmentunits.Thedataneedtobeprocessed,analyzed,storedandorganizedinawaythatitiseasyforengineerstousethem.Processinglarge-scaledatasetsusuallytakesalongtime.Technologies,suchasparallelizationanddistribution,areusedforsolvingthe\big-dataanalytics"problem.Onesuchtechnology,Hadoop[ 2 ],isdesignedforlarge-scaledataprocessingonclustersofcomputers.Inthisresearch,wepresentastatistics-basedanalysissystemforsemiconductormanufacturingmonitoringdatabuiltonHadoop.ThechoiceofHadoopisduetoitsscalability,eciency,andtransparentsupportofdistributed,parallelandfault-tolerantcomputationaltasks.AsemiconductormanufacturingplantconsistsofmanyEquipmentUnits(EUs)whichcanbeusedtoproduceavarietyofsemiconductorproducts.Dierentproductsareproducedindierentproductionlines,eachproductionlineconsistingofmultiplesteps,eachusingoneormoreEUs.SensorsembeddedineachEUareusedtomonitormultipleenvironmentparametersofrelevancetothequalityofthenalproduct.Monitoringeveryproductionlinegeneratesalargesetofdataeachday.Asthescaleofsemiconductormanufacturingplantsgrows,sodoesthescaleofthemonitoringdata.Asaconsequence,ascalablesystemforanalysisofmonitoringdataisessentialtoenabletheoptimizationofsemiconductormanufacturingprocesses.Asarepresentativeexample,thisthesisconsiderastatistics-basedanalysissystemthatcategorizesdatacollectedfrommonitoringsensorsaccordingtodierentproductionlinesanddierentmanufacturingsteps.Ineachcategory,collecteddataarefurtherdividedintodierentdatasetsbasedonEUs.TheStudent'sT-test[ 3 ]isusedtodetermineiftwodatasetsaresignicantlydierentfromeachother.T-test,oroneofotherrelatedstatisticalhypothesistests,hasbeenwidelyusedforstatisticalprocess 13

PAGE 14

controlinmanufacturingprocedures[ 4 5 ].ResultsoftheanalysisoereldengineersinformationaboutmalfunctioningEUssuchthatrepairingthoseEUstakesplaceinatimelymanner.AcommonSemiconductorManufacturingMonitoringDataAnalysisSystem(SMMDAS)ispresentedinFigure 1-1 [ 6 ].SensorsareembeddedineachEUformonitoringenvironmentparameters.Datacollectedfromsensorsarepreprocessedandstoredinadatabase.Thelegacystatistics-basedanalysissystemconsistsofmultiplecomputers,andeachcomputercontainsmultiplecores.Everynight,thesystemretrievesthedatatobeanalyzedfromthedatabaseandassignsjobstomultiplecomputers.Eachcomputermayreceiveasinglejob,ormultiplejobs.Eachcoreofthecomputerisresponsibleforprocessinganindependentjobatatime.Nocorrelatedprocessingisrequiredbetweenmultiplecores,ormultiplejobs.Afterprocessingjobs,computerssendresultstoadierentdatabase.Severalproblemsexistinthecurrentparallelsystem.First,thereisnomanagementblockresponsibleforworkloadbalancing,whichresultsinresourceunderutilization.Theparallelsystemconsistsofmultiplehomogeneouscomputers.However,dierentcomputersreceivedierentworkloads.AsanexampleshowninFigure 1-1 ,thejoblistforFab-line2isemptymakingtheData-Analytics-Server2idlewhileotherserversarebusywithalonglistofjobsfromotherFab-lines.Second,theanalysisresultsarerequiredbyeverymorningforeldengineers.However,thecurrentsystemcannotalwaysmeetthedeadline,resultingindelayinrepairingproblematicEUs.Inaddition,otherconsiderationsarealsorequiredforanecientandeectiveanalysissystem.Asthescaleofmanufacturingandmonitoringproceduregrows,thesizeofdatatobeanalyzedincreasesaswell.Duetoproblemsofthecurrentsystemandthegrowingscaleofdata,scalabilityisarequirementforanewsystemdesign.Inaddition,thescaleofbothdataandthesystemisincreasingsuchthatfailuresappearmoreandmorefrequently.Itis 14

PAGE 15

Figure1-1. Legacysystemarchitectureforsemiconductormanufacturingmonitoringanalysissystem.Thegureshowsthearchitectureofalegacyanalysissystem.DatacollectedfrommonitoringsensorsarestoredinaMicrosoftdatabase.Theanalysissystemretrievesdatafromthedatabaseandanalyzesthedatausingmultiplemachines.ResultsofanalysisaresenttoanOracledatabaseforeldengineers. necessarytodesignasystemwithfault-tolerancesothatfailuresdon'tinuencethecorrectnessofresultsandperformanceofthesystem.Inthischapter,anintroductionofMapReduceprogrammingmodelanditsopen-sourceimplementationHadooparepresentedalongwiththeircorrespondingadvantages.Second,aimingtoaddresslimitationsexposedbythelegacySMMDAS,asystembuiltonHadoopisintroducedtoachievethescalability,performanceandotherdesirablefeatures.Third,optimizationstrategies,includingtuningHadoopcongurationparametersandre-usingofpreviouscomputationresults,arediscussed. 15

PAGE 16

1.1MapReduceProgrammingModelandHadoopWorkowAdataprocessingsystemcanbeviewedashavingthreetiers:application,platformandstorage.Thestoragetierhandlesdatastorage,replication,access,etc.Theapplicationtiersupportsmultiplejobs,usersandapplications.Theplatformtierenablestheuseofmultipledistributedresourcesandprovidesaprogramminganduniedenvironmentthathidesfromprogrammershowprogramsanddataareparallelized,allocatedandcoordinated.Virtualizationtechnologiesenabletheuseofmultiplehomogeneous,orevenheterogeneousmachines,underauniedcomputingenvironment.Theplatformandstoragetiersaretightlycoupled,hidingthecomplexityoftheunderlyingmanagementanddistribution.TheGoogleMapReduceplatform[ 7 ],withitscorrespondingstoragesystemGoogleFileSystem(GFS)[ 8 ],arerepresentativeexamplesoftheplatformandstoragetiers.MapReduceisaprogrammingmodelinspiredbyfunctionalprogramming.Googleadoptedtheprogrammingmodelanditsassociatedimplementation,inheritingthenameofMapReduce,forlarge-scalewebindexservice[ 7 ].WiththesupportofitsunderlyingstoragesystemGFS,MapReduceisusedformassivescaledataprocessinginadistributedandparallelmanner.Hadoop[ 2 ]isanopen-sourceplatformconsistingoftwoservices-theMapReduceserviceandtheHadoopDistributedFileSystem(HDFS)[ 9 ].HadoopwasoriginallydevelopedbyYahoo!andonceitbecameopen-source,ithasbeenapplied,analyzedandimprovedbymanyengineersandresearchers.ManyapplicationshavebeendevelopedfollowingtheMapReduceprogrammingmodel[ 1 10 ]onHadoop,ineldssuchasbio-medicalinformatics[ 11 12 ],datamining[ 13 14 ]andmanufacturingdatamanagement[ 15 ].Hadoophasalsobeenwidelyusedinscienticanalyticsforlarge-scaledataprocessing[ 11 12 16 ].Taylor[ 17 ]summarizedcurrentbioinformaticsapplicationsthatuseHadoop.PapadimitriouandSun[ 13 ]developedadistributedco-clusteringframeworkbasedonHadoop,andshowedthescalabilityandeciencyprovidedbyitsunderlyingHadoopplatform.Cardonaetal.[ 18 ]appliedtheMapReduceprogrammingmodelto 16

PAGE 17

aGrid-baseddataminingsystem,inwhichtwoimprovedschedulersareproposedandsimulated.Inthissection,wepresentbackgroundontheMapReduceprogrammingmodelandtheworkowofaHadoopjobinordertoexplainhowHadoopachievesparallelizationanddistribution. 1.1.1MapReduceProgrammingModelMapReduceprogrammingmodelconsistsoftwofunctions,mapandreduce.TheinputtoaMapReduceprogramisasetofpairs,andtheoutputoftheprogramusuallyisanothersetofpairs.Figure 1-2 showsadiagramoftheMapReduceprogrammingmodel.ThemapphaseconsistsofmultipleMappers-thesamemapfunctioniscarriedoutoneachMapper.Mappersareexecutedindependently,meaningthattherearenointeractionsamongthem.Themapfunctiontakesapairasaninput,andgeneratesasetofintermediatepairsastheoutput.Thereisnorelationshipbetweentheinputandoutputtypes.AsshowninFigure 1-2 ,theinputtothemapphaseisabstractedas,andtheoutputofthemapphaseisabstractedas1.TheMapReduceprogrammingmodelusuallyhasashuingphase,inwhichtheintermediatepairswiththesamekeyaresenttothesamereducefunction.Similartothemapphase,thereducephaseconsistsofmultipleReducers,andthesamereducefunctioniscarriedoutoneachReducer.Thereducefunctiontakesalistofvaluesassociatedwiththesamekeyastheinput,generatesasetofpairsastheoutput,andwritesthemintothelesystem.Similartothemapphase,thetypeofinputscanbedierentfromthetypeofoutputs.Sincebothmapandreducefunctionsareexecutedindependently,anapplicationfollowingtheMapReduceprogrammingparadigmcanbenaturallydistributedand 1subscriptsareusedtoidentifypairswherekeysand/orvaluesareidentical 17

PAGE 18

Figure1-2. TheMapReduceprogrammingmodel parallelizedacrossaclusterofcomputers.Hadoopisasoftwaresystemthatprovidesaprogrammingplatformandrun-timeenvironmentforMapReducejobs.HDFSisthestoragesystemassociatedwithHadoop-itprovidesdistributedlesystemcapabilitiesforlarge-scaledatasets.MapReduceandHDFSservicesaretightlycoupledsuchthatusersonlyneedtoprogrammap/reducefunctionsasdiscussedaboveandHadooptakescareofparallelizationanddistributionofcomputationsandinputdata.AtypicalHadoopMapReducejobworkowisdiscussedinthenextsection. 1.1.2HadoopPlatformInatypicalHadoopcluster,theMapReduceservicereliesononemasternodeandmultipleslavenodes.Themasternodeisresponsibleforscheduling,assigningmap/reducetaskstoslavenodes,managementofthecluster,etc.Slavenodesaremachinesthatexecutemap/reducetasks.Correspondingly,HDFSreliesononenamenodeandmultipledatanodes.Thenamenodestoresmetadata,andapplicationdataaredistributedamongdatanodes[ 9 ].BoththenamenodeofHDFSandthemasternodeoftheMapReduce 18

PAGE 19

serviceusededicatedmachines.AHadoopMapReducejobworkowisdescribedbelow(HDFSisconsideredasatransparentstoragesystemthatprovidesmetadatainformationtotheMapReduceservice).AJobTrackerisasoftwarecomponentrunningonthemasternodeandresponsibleforcoordinatingajobrun.Correspondingly,aTaskTrackerisasoftwarecomponentrunningoneachslavenodeandresponsibleforrunningtasksofthejob.TheclientprogramiswritteninJavainthisexample(otherwise,theHadoopstreamingutilityisusedforprogramswritteninotherprogramminglanguages)[ 1 ].Theclient,oruserprogram,submitsaMapReducejobtoaHadoopcluster.TheJobTrackerreceivesthejobinformation,assignsauniquejobidtothesubmittedjob,retrievesinputleinformationfromtheunderlyingdistributedstoragesystem(forexample,HDFS),andassignsmap/reducetaskstoslavenodesaccordingtothecurrentstatusofslavenodes.EachTaskTrackercommunicateswiththeJobTrackerandexecutesmap/reducetasksassignedtoit.EachTaskTrackersendsa\heartbeat"periodicallytotheJobTracker,asanindicationofthehealthoftheslavenode.TheheartbeatchannelisalsousedforsendingmessagestotheJobTracker.Whenamap/reduceslotisavailableinaslavenode,theJobTrackerassignsamap/reducetasktotheslavenodethroughtheTaskTracker.TheTaskTrackerretrievestheinputfromdistributedlesystem,andlaunchesachildJavaprogramtoexecutethetask.Afterthetaskhasnished,theTaskTrackercollectsthetaskinformationandsendsthetaskinformationbacktotheJobTracker.TheJobTrackernotiestheclientwhenthejobhasnished.Figure 1-3 showstheworkowofaMapReducejobinHadoop.Thenumbers1to6indicatethestepsofajobexecution.First,theclientprogramsubmitsthejobtoaHadoopcluster.Second,theJobTrackerretrieveslesplitsinformation,e.g.thelocationoflesandthenumberofsplits,fromthedistributedstorage.Forthethirdstep,theheartbeatinformationissentbytheTaskTrackertotheJobTrackertonotifytheJobTrackerthatthereareavailableslotsinaslavenode.Fourth,theJobTracker 19

PAGE 20

assignsatasktotheTaskTrackerbasedonthenodeinformationthattheTaskTrackersent.Usually,theJobTrackerassignsamaptasktotheslavenodetakingintoaccountthelocationofinputles.Fifth,theTaskTrackerretrievesinputinformationfromthedistributedstorage.Ifitisassignedamaptask,theTaskTrackerretrievestheinputlesplitinformation.Ifitisassignedareducetask,theTaskTrackerretrievesintermediatedatageneratedbythemapphaseaccordingtoinformationprovidedbytheJobTracker.Finally,instepsix,theTaskTrackerlaunchesanewJavaprogramtoexecutetheassignedtask.Afterthetaskhasnished,theinformationissenttotheJobTrackerthroughtheheartbeatmessagechannel.AnimportantpropertyofHadoopisfault-tolerance[ 1 ].TherearethreekindsoffailuresintheworkowshowninFigure 1-3 .Onekindoffailureisataskfailure,i.e.thetaskexecutedinachildJVMofaslavenodefails.IftheTaskTrackerdetectsthefailure,itwillfreetheslotandexecuteanothertask.ThefailureinformationisreportedtotheJobTrackerthroughtheheartbeatmessagechannel.WhentheJobTrackerreceivesthefailureinformation,itwilltrytoreschedulethetaskonadierentnode.ThesecondkindoffailureisaTaskTrackerfailure,i.e.whentheJobTrackerdetectsthefailureofaTaskTracker(bynotreceivingheartbeatinformationorreceivingtheinformationtooslow).TheJobTrackerstopsschedulingtaskstothatnodeandrescheduletasksonthatnodetootherhealthynodes.ThethirdkindoffailureisaJobTrackerfailure.ThereisnomechanismtotoleratetheJobTrackerfailure.YARN(orHadoop2.0)[ 19 ]introducesanewdatacentermanagementstructurethatprovidesamechanismtotoleratesuchafailureinadatacenter.Asdiscussedabove,theHadoopMapReduceserviceanditsassociatedstoragesystemHDFSenableparallelizationanddistributionamongavailableresources.Multipleadvantagesareprovidedbythesoftwareplatform.First,asthescaleoftheinputincreases,theHadoopplatformcaneasilyscaleupsuchthatadditionalresourcescan 20

PAGE 21

Figure1-3. HadoopMapReduceworkow1.Runjob2.Retrieveinputinformationfromdistributedstorage3.Heartbeatandexecutionresults4.Assignjobtoslavenode5.Retrieveinputinformationforassignedjobs6.Launchamap/reducejob beaddedforlargerscaledataprocessing.Astheresourcesintheclusterareunderutilized,theplatformcanalsoeasilyscaledown.Second,withregardstofault-tolerance,HDFSenablesdatadurabilitythroughreplicationsindierentnodes,andtheMapReduceservicetoleratesfailuresbyspeculativeexecution.Third,withoutneedingtoconsiderthecomplexityofunderlyingdistributedsystem,Hadoopprovidestousersasimpleprogramminginterface.Fourth,dierentcongurationparameterscanbetunedfor 21

PAGE 22

dierentapplicationrequirements.Fifth,userscaninherit,suchaslowcostandexibilityofusingopen-sourcesoftware.Insummary,theHadoopplatformprovidespropertiesincludingscalability,fault-tolerance,easy-to-programinterface,tunablecongurations,benetsfromopen-sourcesoftware,etc.Ithasbeenappliedtorunparallelalgorithmsinmanyeldsformassivedataprocessing[ 1 10 ].WenextintroducetheproposedSMMDASanddiscusshowthesystembenetsfromtheHadoopplatform. 1.2MotivationandProblemStatementMotivatedbythepreviouslydiscussedproblemsandrequirements,weproposetobuildtheanalysissystemonHadoop.AsdiscussedinSection 1.1 ,theHadoopMapReduceserviceandHDFSprovidemanagementofunderlyingresourcesandhidethecomplexityofthedistributedsystemfromapplicationsbuiltonit.ThepropertiesoftheHadoopplatformcanbeinheritedbytheanalysissystemsothatitiscapableofprocessinglarge-scaledatasets.Withrespecttoscalability,thesystemcaneasilyscaleupbyaddingmoreresourceswhenthesizeofinputdataincreasesortheperformanceofapplicationsdecreases.Withregardstofault-tolerance,throughMapReducespeculativeexecutionandHDFSreplication,thesystemcanobtaincorrectresultsevenwhentherearefailuresinthecluster.Thesystemwillalsorelyontheopen-sourcenatureofHadoopthusbenetingfromlowcostinbothacquisitionandupgrades.Thus,thesystemisproposedtobebuiltusingtheHadoopplatform.Thegoalofthisresearchistodesign,implement,optimizeandverifyaneectiveandecientdistributedsystemforanalysisofmonitoringdata.Tothisend,thesystemarchitecturewillreplacethecurrentanalysissystembyaHadoopsystem,asshowninFigure 1-4 .Datacollectedfrommonitoringsensorsarestillstoredinadatabase.TheanalysissystemretrievesthedatafromthedatabasetoHDFSwhichisresponsiblefordistributingdataamongmultiplecomputers.Thestatistics-basedanalysissystemisbuiltontheHadoopMapReduceservice.Aftertheanalysis,thedataaresent 22

PAGE 23

backtoadatabase.WeproposeanewsystemarchitecturewithoutinvolvingadatabaseinChapter5. Figure1-4. AnoverviewofsystemarchitecturefortheanalysissystemonHadoop Forpurposesofbuildingthestatistics-basedanalysissystem,thefollowingproblemsmustbeconsidered. 1.2.1DataMappingandComputationDistributionThepurposeoftheSMMDASistoanalyzethe\data".Thus,theinputdataarerequiredtobestudiedin-depthsothattheycanbeappropriatelydistributedamongmultiplemachines.AthoughtlessdatamappingstrategycanleadtounbalancedworkloadinaHadoopcluster,whichresultsinperformancedegradationandunder-utilizationofresources.Tothisend,thequestionofhowtomapthedataintoHDFSmustbecarefullyanswered.Correspondingly,thecomputationmustbemappedtotheHadoopMapReduceserviceaswell.Thecomputationcanbenishedineitherone-stepMapReduceworkowormulti-stepMapReduceworkow(oriterativeMapReduce[ 20 ]).Dierentmappingstrategieshavedierentadvantagesanddisadvantages.Forexample,iterativeapplications 23

PAGE 24

containmultiplestepsofcomputationandeachstepisparallelizedsuchthatthedataprocessingtimerequiredbyeachstepisreduced.IterativeapplicationsonHadoop,however,generatenewmap/reducetasksineachiteration,resultinginoverheadforcreatingnewtasksforeachiteration.Ekanayakeetal.[ 20 ]improvethesolutionbyintroducingTwister,anextensionoftheHadoopplatform,foriterativescienticapplications.Thus,itisimportanttodesignanappropriatecomputationmappingstrategyfortheSMMDAS.OtherextensionsoftheHadoopsoftwareplatformmayalsobeadoptedforperformancepurposes. 1.2.2SystemOptimizationAfterthesystemisdeployed,optimizationofthesystemwillbenetboththeperformanceofthestatistics-basedanalysisandtheutilizationofthewholecomputationcluster.Bothexternalandinternalimprovementsarerequired.Fromtheexternalperspective,optimizationcanbeachievedbytuningdierentcongurationparameters.Applicationsaresignicantlyinuencedbydierentcongurationsettings[ 21 ].Changingparameterstoappropriatevaluesisshowntosignicantlyimproveperformanceofapplications.Fromtheinternalperspective,optimizationcanbeachievedbychangingthealgorithmoftheanalysis.Throughalgorithmicoptimization,previouscomputationresultscanbeusedforfuturedataanalysis.Accordingtotheproblemsdiscussedinthissection,wepresentthisresearchinahigh-levelviewinthefollowingsection. 1.3SystemArchitectureFigure 1-5 showsahigh-levelviewofthestatistics-basedanalysissystemusingHadoop(bottom)andstrategiestostaticallyoptimizethesystem(top).ThebottomofFigure 1-5 solvestheproblemofdatamappingandcomputationdistribution.Sincedataareacriticaldesignconsideration,weproposeadata-orientedmappingfromthedatabasetotheHadoopMapReduceparadigm.Thismappingconsistsoftwosteps,mappingthedatabasetolesandmappinglestoHadoop.Aone-step 24

PAGE 25

MapReduceworkowisproposedformappingthecomputationtotheMapReduceservice.Basedondierentmappingstrategies,wedeploythestatistics-basedanalysissysteminthreedierentways:Map-Heavy,MapReduce-BalancedandReduce-Heavy.ThetopofFigure 1-5 solvestheproblemofsystemoptimization.MAPE-K(Monitoring,Analysis,Planning,ExecutionandKnowledge)canbeusedasaself-managementarchitectureinautonomiccomputing[ 22 ].However,inthisresearch,wedonotconsiderdynamicoptimizationbutfollowthebasicideatheMAPE-Kapproachtostaticallyoptimizetheanalysissystem.AsshowninFigure 1-5 ,theMonitoringblockcollectsmetricsofthesystem,suchastheCPU/memoryutilization,inputdataandtheexecutiontime;theAnalysisblockprovidesadjustmentsuggestionsbyanalyzingdierentperformancemetricswithregardstotheknowledgeofthesystemexecutionhistoryornewpolicies;thePlanningblockproducesaseriesofchangesbasedonthesuggestionsfromtheAnalysisprocess;theExecutionblockcarriesoutadjustmentstothesystem.Twomethodologiesareusedinthisresearchforsystemoptimization.Fromanexternalpointofview,dierentHadoopcongurationparametersaretunedfordecreasingtheexecutiontime.TheMonitoringblockcollectsthecurrentexecutiontimeoftheanalysis.TheAnalysisblockcomparesthecurrentexecutiontimewithhistoricexecutiontimeandsomewell-knowntuningdirections.Itsuggeststhetuningdirectionsofthecongurationparameters.ThePlanningblockgeneratestargetvaluesofthecongurationparameterstobetuned.TheExecutionblockcarriesoutthetuningactions.Fromaninternalpointofview,thealgorithmdescribedinChapter2isoptimizedfordecreasingtheexecutiontime.AdynamicprocedurethatfollowsMAPE-Kwouldbesimilartotuningthecongurationparametersasdiscussedabove.WorkshavebeendoneforapplyingHadoopinmanufacturingprocedures.Baoetal.[ 15 ]appliedHadooponthesensordatamanagementinmanufacturing.TheyfocusedonthedatamanagementperspectiveandproposedtoreplacetheRDBMSwithHBase[ 23 ].Mostcomputationsusedintheirsystemaresimplequeriesorstatisticsoperations.Our 25

PAGE 26

Figure1-5. Ahighlevelviewofthestatistics-basedmonitoringsystemandthestaticaloptimizationstrategy workfocusesonacomplexstatistics-basedcomputation,whichcannotbenishedwithinadatabase.TheuseofHBaseorHivemaybeaninterestingdesignfortheproposedarchitecturewithoutinvolvingthedatabaseinChapter5.Toservetheincreasingscaleofthedata,GossandVeeramuthu[ 24 ]surveyedcurrenttechnologiestowardsthe\BigData"problemforanewsemiconductormanufacturingcompany,includingtechnologiessuchasApplianceRDBMS,Hadoopandin-memoryDB,aswellasnewarchitectures,forexample,GeneralEngineeringandManufacturingDataWarehouse(GEM-D)solution.Theycomparedadvantagesversusdisadvantagesofpossibletechnologieswithoutimplementingthem.Wehavetakenafurtherstepinthisresearchbydesigningandimplementing 26

PAGE 27

thesystemusingHadoop.Experimentalevaluationresults,aswellastwooptimizationmethodologies,arealsopresentedinthisthesisMostHadoopapplicationsareI/Obounded[ 1 ].Hadooptriestoachieve\datalocality",i.e.toassigntaskstowherethedataarelocated.Multiplestudieshaveresearchedthisdata-localitypropertyandsuggestedmethodstomanagetaskassignments.Liuetal.[ 25 ]studiedtheperformanceofHDFS,especially,forprocessingsmallles.ForimprovingtheI/Operformanceofsmallles,theyproposedagroupingmethodologytocombinesmallles.Ourexperimentalresultsverifytheobservationandtakeadvantageofcombiningsmalllesforperformanceimprovement.Xieetal.[ 26 ]discussedtheeectofdataplacementinaheterogeneousHadoopcluster.Animbalanceddataplacementinheterogeneousenvironmentcausessevereperformancedegradationindata-intensiveHadoopapplications.ByimplementingadataplacementmechanisminHDFS,inputdatacanbedistributedbasedonthecomputingcapacityofeachnode.WeusethebalancerprovidedbyHadoop.Sinceourclusterisbuiltonahomogeneousenvironmentasaprivatecloud,thebalancerworkswellinthecurrentsetup.Also,wefoundoutthatincreasingthenumberofreplicationsforeachleinHDFSresultsinincreasingthepossibilityofexecutiononlocaldata.ForpurposesofsavingenergyandI/Oeciency,theintermediatedatageneratedbythemapphasecanbecompressed.Chenetal.[ 27 ]proposedanalgorithmtomeasuredatacharacteristicssothatusersofHadoopcanmakethedecisionofwhethertocompresstheintermediatedataornot.Itisbettertocompressthemapoutputinoursystem.DatalocalityisusedinHadooptoavoidslowtasksbyreducingnetworktransmission[ 28 ].However,schedulingtaskswithregardstothedatalocationcannotnecessarilyimprovetheperformanceofaHadoopjob.Ifmaptasksarescheduledonslownodesonthebasisofdatalocality,theperformanceofajobispossiblydegraded.Again,thecurrentsystemisbuiltonahomogeneousenvironment.Suchworksmaybeusefulwhenusingaheterogeneousenvironment. 27

PAGE 28

1.4StructureofThesisThisthesisisstructuredasfollows:Chapter2describestheinputdataandadata-orienteddesignbyintroducingdierentdatamappingstrategiesanddierentimplementations.Chapter3proposesdierentstrategiesforsystemoptimizationincludingtuningHadoopcongurationparametersandalgorithmicoptimization.Chapter4discussesexperimentalresults.Chapter5concludesthethesisandrecommendsfutureresearchdirections. 28

PAGE 29

CHAPTER2DESIGNANDIMPLEMENTATIONThesystemstudiedinthisthesisforstatisticalanalysisofmanufacturingsensormeasurementsisimplementedbyusingHadooptodoallthedataprocessing.Thedesignofthissystemrequiresagoodunderstandingofwhatkindofdataistobeanalyzed,howtheyaretobeanalyzedandwheretheyshouldbelocated.Thischapteranswerstheseissuesanddescribesthreeimplementationsconsideringdierentdatamappingscenarios.Adata-orienteddesignapproachisusedtomapdatafromthedatabasetotheHadoopstorage.Thischaptercharacterizesthedatathataretobeanalyzed,thealgorithmisusedtoanalyzethem,mappingstrategiesfromthedatabasetoHDFSaswellasthreeimplementationsofthemappingofcomputationstotheHadoopmap/reducephase. 2.1DataDescriptionTheinputdatacanbeconsideredasalargetablestoredinadatabase,whereeachrowisameasurementrecordthatincludeswhenandwherethemeasurementtookplaceaswellasthemeasured\item",productandequipmentidentications.Table 2-1 isaverticalviewofanexampleofarecordfromasampleofatypicaldataset.UsefulinformationforstatisticalanalysisincludesITEM(environmentparameter,suchastemperature,pressureandgasowrate),RECIPE(productandmanufacturingstepinformation),EQP ID(EUinformation),TIME(startandendtimeofthemeasurement),SPEC(upperandlowerspecicationsofthemeasuredrecord)andAVG(theaverageofthemeasuredvaluefromstarttoendtime,i.e.theinformationtobeanalyzed).Themonitoringsystemcollectsdataduringthemanufacturingprocess.Everynight,theanalysissystemretrievesdatafrom14days(thecurrentdayandtheprevious13days),basedonthetimestampassociatedwitheachrecordasshowninFigure 2-1 .Assumethecurrentdateis03/27/14,thestatistics-basedanalysisisperformedonthedatafrom03/14/14to03/27/14.Onthesecondday,03/28/14,thestatistics-basedanalysisisperformedonthedatafrom03/15/14to03/28/14.Wediscusspossibilitiesto 29

PAGE 30

Table2-1. Averticalviewofasamplerecord ColumnNameContentDetailInformation LOT IDLOT0759.1RecordidentierWAFER ID01WaferIDPRODUCT IDA5ProductIDSTEP SEQA5 001SequencestepEQP MODELDIFFUSIONEQP IDDIFF01EquipmentunitIDCHAMBER IDARECIPEA5 001 R01Informationofwafer,productandanIDSTART TIME22:26.5END TIME47:26.5ITEMGas Flow RateTheenvironmentvariabletobemeasuredMIN100.4286MAX105.7051AVG106.7414TheaveragevaluetobeanalyzedSTDEV0.56929UPPER SPEC109TheupperspecicationLOWER SPEC101ThelowerspecicationTARGET105 usetimestampinformationandre-useofresultsforoptimizingtheanalysisprocedureinChapter3. Figure2-1. Datasetscollectedduringseveralconsecutivedaysareusedfornightlystatisticalanalysis. Forpurposesofstatisticalanalysis,theinputdatatablecanbepartitionedintoindependentsubtables,oneforeachrecipe-and-itemcombination.Thesesubtablesareindependentinthesensethatthestatisticalanalysisisdoneindividuallyandindependentlyforeachsubtable.Eachsubtableisreferredtoasanindependent 30

PAGE 31

ComputationUnit(iCU).Eachsubtableisfurtherpartitionedintodependentsubtables,oneforeachEU.SincesubtablesbelongingtothesameiCUhavetobegroupedtogethertonishthecomputation,eachdependentsubtableisreferredtoasadependentComputationUnit(dCU).LetKrdenoteacombinationofrecipeanditem,KeanequipmentidandKtacombinationofrecipe,itemandequipmentid.Kr Ke Kt Thus,thewholestatisticalanalysisisindependentlyperformedoneachiCUanditconsistsofpartialrelatedanalysisdoneoneachofthedCUsthatmakeupanentireiCU. 2.2AlgorithmAnalysisThestatisticalanalysisprocessisreferredtoasthe\equivalencetest".Thepurposeofthisanalysisistocharacterizethedierencebetweentwosetsofdata,areferencedatasetandatestdataset,usingStudent'sT-test[ 3 ].Algorithm 1 describestheanalysisprocedure.Theinputoftheprocedureisalinkedlist(inputList).SincetheanalysisisperformedindependentlyoneachiCU,thedatainthesameinputListsharethesameKr.Theithelementinthelistisa1Ddoubleprecisionarray,Ai,representingadCU.EachvalueinAicorrespondstoanAVGvalueextractedfromtherecordsthatsharethesameKt.ForpurposesofmappingtheanalysisalgorithmtotheMapReduceprogrammingmodellater,thealgorithmcanbeconsideredconsistingoftwostages.TherststageoftheproceduresortsanarrayAiandthencomputesseveralpercentile[ 29 ]valuesfromAi.Theloweroutlierlimit(LOL)andupperoutlierlimit(UOL)arecomputedusingthepercentilevaluescomputedinthepreviousstep.Outliers[ 30 ],eitherextremelysmallorextremelylargevalues,areeliminatedfromAi.Then,theprocedureusestheaveragei 31

PAGE 32

Algorithm1Procedureforequivalencetest Input:LinkedListinputList foreachDoubleArrayAiininputListdo . SortAi ComputethepercentilevaluesofAi ComputethelowerandupperoutlierlimitsLOL,UOLusingpercentilevalues . . EliminateallAi[k]thatarenotintherange[LOL,UOL] ComputeaverageiandstandarddeviationiofentriesofAi ComputetheprocesscapabilityindexCpk[i] endfor . . Stage1 PickAiwiththelargestCpkasthereferencesetR . ComputeEuclideandistanceWR[k]=p (R[k])]TJ /F5 11.955 Tf 11.96 0 Td[(R)2 ComputeEuclideandistanceBR[k]=p (R[k])]TJ /F5 11.955 Tf 11.96 0 Td[()2 ComputeWR;WRandBR;BR foreachDoubleArrayAiininputListdo ComputeEuclideandistanceWi[k]=p (Ai[k])]TJ /F5 11.955 Tf 11.96 0 Td[(i)2 ComputeEuclideandistanceBi[k]=p (Ai[k])]TJ /F5 11.955 Tf 11.96 0 Td[(R)2 ComputeWi;WiandBi;Bi DoapairwiseStudent'st-testbetweenWR(orBR)andWi(orBi) Finaldecisionbasedontheresultoft-test endfor . . Stage2 andthestandarddeviationiofAitocomputetheprocesscapabilityindex[ 31 ].ThisprocedureisdoneforeveryAiintheinputlistindependently.ThesecondstageoftheprocedurepickstheAiwhoseCpkisthelargestfromtheinputListasthereferencedataset,denotedasR.Otherdatasetsarereferredtoastestingdatasets.EachtestingdatasetintheinputListispreprocessedbycomputingtheEuclideandistancebetweenthereferencedatasetanditself.Then,apair-wiseStudent'sT-testiscarriedoutbetweenthereferencedatasetandeachofthetestingdatasets.Finaldecisions(onwhetherthetestingdatasetisequivalenttothereferencedataset)aredeterminedbytheresultsofT-test[ 4 ].Insummary,therststagecomputationisperformedoneachdCUindependently,whilethesecondstagecomputationisperformedoneachiCUindependently. 32

PAGE 33

ThefollowingssectionsdescribethefunctionsinAlgorithm 1 indetail1,providinginformationusefulforalgorithmicoptimizationofthestatisticsanalysisinChapter3. 2.2.1PercentileandOutlierInstatistics,anoutlierisreferredtoasadataelementthatisextremelydistinctfromotherdatainthesample[ 30 ].Outliersusuallyindicateproblemsaectingdatareliability,forexample,anerrorduringmeasurement.Toidentifyanoutlierrequiresanupperlimitandalowerlimit,i.e.boundariesthatdetermineexcludeddata.Tocomputetheoutlierlimits,apercentilefunctioniscomputedonthelistofsampledata.Apercentileisavaluebelowwhichproportionsofvaluesinagroupfallinto[ 29 ].Forexample,thepthpercentileisavalue,Vp.Thereareatmost(p)%datalessthanVp,andatmost(100)]TJ /F5 11.955 Tf 12.61 0 Td[(p)%datagreaterthanVp.The50thpercentileisthemedianvalueofadataset.ThepercentilevaluecanbeestimatedusingfunctionPercentileasfollows[ 32 ].NotethatLisa1Dsortedlist(oranarray). functionPercentile(dataL,doublep) L:sorted1Dlist(oranarray) N:thesizeofthelistL p:thepercentile r=p 100(N+1) k integerpartofr d fractionalpartofr V=Lk+d(Lk+1)]TJ /F5 11.955 Tf 11.95 0 Td[(Lk) endfunction 1Thedescriptiondoesn'treectthespecicimplementationusedinourexperimentalsystem.Instead,itpresentsgeneralapproachesdescribedincommonengineeringstatisticsbooksandmaterialsofwhichtheimplementationisarepresentativeexample. 33

PAGE 34

ThePercentilefunctionrstcomputestherank,r,oftheinputpercentilep.Then,itusestheentrieswithranks(i.e.indexes)kandk+1,wherekistheintegerpartofthecomputedrank,r,tocomputethepercentilevalue,V.Theresult,V,iscomputedbymultiplyingthefractionalpartofthecomputedrankandthedierencebetweenthe(k+1)thandkthentries.The25thpercentileisreferredtoasrstquartile,denotedasQ1andthe75thpercentileisreferredtoasthirdquartile,denotedasQ3.Theinterquartilerange,IQR,isthedierencebetweenQ3andQ1, IQR=Q3)]TJ /F5 11.955 Tf 11.95 0 Td[(Q1:(2{1).Theoutlierlimitsareestimatedas UOL(UpperOutlierLimit)=Q3+cIQR;(2{2) LOL(LowerOutlierLimit)=Q1)]TJ /F5 11.955 Tf 11.96 0 Td[(cIQR;(2{3)wherecisaconstant.Atypicalvalueofcis1.5.Anydataoutsidethisrangeareconsideredasoutliers.OutliersareeliminatedintherststageoftheAlgorithm 1 2.2.2ProcessCapability\Aprocessisacombinationoftools,materials,methodsandpeopleengagedinproducingameasurableoutput;forexampleamanufacturinglineformachineparts."[ 31 ]Processcapabilityisameasurablepropertyofaprocess,whichservesformeasuringthevariabilityoftheoutputandcomparingthevariabilitywithcertainspecications.\Theprocesscapabilityindexisastatisticalmeasurementoftheprocesscapability:theabilityofaprocesstoproduceoutputwithinspecicationlimits."[ 31 ]Theprocessperformanceindexisoneofthecommonlyacceptedprocesscapabilityindices,whichisdenedandcomputedasCpkinAlgorithm 1 andEquation( 2{4 )below.Givenadataset{theoutputfromthesystemthatmonitorstheprocess{anditscorrespondingupperspecication(USL),lowerspecication(LSL),theprocess 34

PAGE 35

performanceindexisdenedandcomputedas Cpk=min[USL)]TJ /F5 11.955 Tf 11.95 0 Td[( 3;)]TJ /F5 11.955 Tf 11.96 0 Td[(LSL 3];(2{4)whereistheaverageofthedatasetandisthestandarddeviationofthedataset.Anothermetricthatisusedtoestimatetheprocesscapabilityisdenedas ^C=USL)]TJ /F5 11.955 Tf 11.95 0 Td[(LSL 6:(2{5)Thismetricisnotusedinourprototypes,buttheyareeasilymodiabletousesuchametricifnecessary. 2.2.3EuclideanDistanceAftertheeliminationofoutliersfromthesampledata,datasetsthatcomefromdierentEUsbutsharethesameKrarecollectedtogether.Thedatasetwiththemaximumprocessperformanceindexispickedasthereferencedataset,R.Eachofthedatasetsofotherequipmentunits,Ai,ispairedwiththereferencedataset.Foreachelementineachdataset,Within-Distance,Wi[k],isdenedastheEuclideandistancebetweenasinglerecordofthedatasetandtheaverageofthedataset.Foreachelementinthereferencedataset,Between-Distance,BR[k],isdenedastheEuclideandistancebetweenasinglerecordofthereferencedatasetandtheaverageofallrecordsinalldatasetsexceptthereferenceone.Foreachelementineachdatasetexceptthereferencedataset,Between-Distance,Bi[k],isdenedastheEuclideandistancebetweenasinglerecordofthedatasetandtheaverageofallrecordsinthereferencedataset.TheequationstocomputetheEuclideandistancesarelistedbelow.Forthereferencedataset, WR[k]=p (R[k])]TJ /F5 11.955 Tf 11.95 0 Td[(R)2;(2{6) BR[k]=p (R[k])]TJ /F5 11.955 Tf 11.96 0 Td[()2;(2{7) 35

PAGE 36

whereR[k]referstothekthelementinthereferencedataset,Rreferstotheaverageofthereferencedatasetandreferstotheaverageofallrecordsexceptthosefromthereferencedataset.Forothertestingdatasets, Wi[k]=p (Ai[k])]TJ /F5 11.955 Tf 11.96 0 Td[(i)2;(2{8) Bi[k]=p (Ai[k])]TJ /F5 11.955 Tf 11.96 0 Td[(R)2;(2{9)whereAi[k]referstothekthelementintheithdatasetandireferstotheaverageoftheithdataset. 2.2.4EquivalenceTestTheequivalencetestisperformedonthepreviouslycomputedEuclideandistancesofeachofthedatasetsandthereferencedataset.ThecoreoftheequivalencetestisStudent'sT-test[ 4 ],whichisusedtodeterminewhethertwosetsofdataaresignicantlydierentfromeachother.Anumericalalgorithmtocomputethep-value[ 4 ],ameasurableresultofT-test,canbefoundin[ 33 ]. 2.3DataMappingTominimizetheoverheadofdatacommunicationamongnodesduringthemapphase,Hadoopisdesignedsothatitassignsmaptaskstonodeswheretheinputdataarelocated.Thus,appropriatelymappingdatafromthedatabasetoHadoophassignicantinuenceonthesystemperformance.HadoopapplicationsrequirethesupportfromunderlyingHDFS.DataarestoredaslesinHDFS.Thus,mappingdatatoHadoopconsistsoftwostages:thedatabasetolesandlestoMappers. 2.3.1DataMapping:DatabasetoFilesAccordingtothedataandthealgorithmdescription,thewholedatasetcanbepartitionedintoiCUsordCUs.Basedonhowthedataarepartitioned(i.e.assignedto 36

PAGE 37

HDFSblocks),weproposethreestrategies:arbitrarilymapping,per-iCUmappingandper-dCUmapping.Arbitrarilymappingmapsthedatatolesarbitrarily(i.e.intheorderinwhichtheyarereadfromthedatabase),showninFigure 2-2 .Therefore,recordsbelongingtodierentiCUsanddCUsmaybestoredinthesamele.Theadvantageofthisstrategyisthatdatacanbedistributedevenlyacrossthecluster. Figure2-2. Database-to-lesmappingstrategies-arbitrarilymapping 37

PAGE 38

Figure2-3. Database-to-lesmappingstrategies-per-iCUmapping 38

PAGE 39

Figure2-4. Database-to-lesmappingstrategies-per-dCUmapping 39

PAGE 40

Per-iCUmappingmapsthedatainthesameiCUtothesamele,showninFigure 2-3 .RecordssharethesameKraremappedtothesamele.SinceallrecordsinthesameiCUarestoredinale,asinglelecanbeanalyzedindependentlywithoutinterferencewithotherles.Per-dCUmappingmapsthedatainthesamedCUtothesamele,showninFigure 2-4 .RecordssharethesameKt(KrandKe)aremappedtothesamele.AllrecordsinthesamedCUarestoredinale,andpartoftheanalysis(Stage1ofAlgorithm 1 )canbeperformedindependentlyonalewithoutinterferingwithotherles. 2.3.2DataMapping:FilestoMappersThedatatobeanalyzedarestoredaslesinHDFS.AsintroducedinChapter1,theHadoopMapReduceserviceprovidesparallelizationanddistributiontoapplicationsrunningonit.Filesaredistributedamongnodesinacluster.TheHadoopplatformprovidesusersexibilitytosplitlessothatuserscandecidewhethertosplitaleandhowtosplitale.BasedonhowmanylesaMapperprocesses,wealsoproposethreestrategiestomaplestoMappers:one-to-onemapping,many-to-onemappingandone-to-manymapping.One-to-onemappingmapsoneletooneMapper,asshowninFigure 2-5 -a.EachMapperhandlesaleindependently.ThiscanbecontrolledbyoverwritingisSplitable()methodtoletitreturnfalse.Inasituationthatalecontainsindivisibledata,inotherwords,thedatahavetobeprocessedbyasingleMapperindependently,thismappingstrategyworksproperly.Ontheotherhand,thedisadvantageofthestrategyisthatwhenaleisstoredindierentnodes,processingtheleasawholebringsdatatransmissionoverheadinthemapphaseexecution.Many-to-onemappingmapsmultiplelestooneMapper,asshowninFigure 2-5 -b.EachMapperhandlesmultipleles.HadoopprovidesaclassCombineInputFormatthatcombinesseveralsmalllestobeprocessedbyamaptask.Weconsideranalternativeapproach|apreprocessingproceduretocombinemultiplelestooneleandeach 40

PAGE 41

AOne-to-onemapping BMany-to-onemapping COne-to-manymappingFigure2-5. Files-to-Mappersmappingstrategies 41

PAGE 42

Mappertakesthecombinedleastheinput.ThepreprocessingapproachalleviatesHDFSmeta-datastorageand,also,reducesleprocessingtimeduringaMapReducejobexecution.Inthecasethatthesizeofeachleisrelativelysmall,assigningeachletooneMappercreatesmuchoverhead.Thus,assigningmultiplelestoaMapperisaproperstrategy.One-to-manymappingmapsoneletomultipleMappers,asshowninFigure 2-5 -c.EachMapperhandlespartofale.ThisisthemostcommonsituationinaHadoopapplication.Whenaleistoolarge,HDFSsplitstheleintorelativelysmallerchunksandHadoopassignseachchunktoaMapperduringthejobexecution.ThesplitsizecanalsobesetlargerthantheblocksizeofHDFS[ 1 ].However,thisisnotrecommendedsinceitdecreasesthepossibilityofprocessingdatalocally.Theone-to-manystrategydoesn'tworkinthescenariowhentherearedependenciesbetweencomputationsperformedondierentchunks. 2.4SystemImplementationAsillustratedinChapter1,thestructureofaHadoopapplicationfollowstheMapReduceparadigm.Todistributethecomputation,thestatistics-basedanalysisismappedtotheMapReduceprogrammingmodel.InthecurrentSMMDAS,thesizeofdatabelongingtothesameiCUisrelativelysmallsothateachiCUcanbeprocessedusingonecomputingnodeinaHadoopcluster.Thus,weproposetomapAlgorithm 1 toasinglestepMapReducejob.WediscussthescenariothatthesizeofeachdCUandeachiCUincreasesinChapter4.Algorithm 1 consistsoftwomainstages.TherststagecanbecomputedforeachdCUindependently.ThesecondstageprocessesdatabelongingtothesameiCU.Basedonhowmuchcomputationcarriedoutinthemap/reducephase,weproposethreeimplementations:Map-Heavy(M-Heavy)implementation,Reduce-Heavy(R-Heavy)implementationandMapReduce-Balanced(MR-Balanced)implementation.Table 2-2 summarizeshowdierentMapReduceimplementationsconstraindatamappingsintothemap/reducephase(theblankonemeansnospecicconstraints). 42

PAGE 43

Table2-2. HowdierentMapReduceimplementationsconstraindatamappingsintothemap/reducephase phaseM-HeavyR-HeavyMR-Balanced mapphaseiCUdCUreducephaseiCUiCU 2.4.1M-HeavyImplementationTheM-Heavyimplementationmapsthewholeanalysisproceduretothemapphase.Thus,databelongingtothesameiCUhavetobecollectedandsenttothesameMapper.ThisbenetsthecasewhenthesizeofeachiCUisrelativelysmall. 2.4.1.1MapimplementationIntheM-Heavyimplementation,themapphaseisresponsibleforthewholeanalysisprocedure.DatabelongingtothesameiCUhavetobereadbythesameMapper.ThelesplitagissettofalsesothatalecannotbesplitbetweenmultipleMappers.Figure 2-6 isthecomputationprocedureandthedataformat/denitionforeachstepinthemapphaseimplementation.TheinputtoaMapperisapair,wherethekeyisthelenameandthevalueisthewholecontentofthele.TheMapperextractsusefulinformationfromeachrecordandstorestheminanestedhashmap,m.InthecasethatdatabelongingtomultipleiCUsarecontainedinthesamele(many-to-onedatamapping),eachentryintheoutlayerhashmapisaniCU,andeachentryintheinnerlayerhashmapisadCU.Thus,theStage1ofAlgorithm 1 isperformedoneachentryoftheinnerlayerhashmapindependently,andtheStage2ofAlgorithm 1 isperformedoneachentryoftheoutlayerhashmapindependently. 2.4.1.2ReduceimplementationSincethewholeanalysiscomputationisdoneinthemapphase,thereducephaseisunnecessaryanddeprecated. 43

PAGE 44

Figure2-6. M-Heavyimplementation:Mapphase 44

PAGE 45

2.4.2R-HeavyImplementationTheR-Heavyimplementationmapsthewholeanalysisproceduretothereducephase,andMapperslterthedatabydierentKrs.Inthisimplementation,thereisnolimitationontheinputdata. 2.4.2.1MapimplementationLikeatypicalHadoopapplication,theR-Heavyimplementationlters(orcategorizes)thedataduringthemapphase.EachMapperreadsablockofdata,awholeleorpartofale,dependingontheblocksizeofHDFSandlesplittingcongurationsettings.Theinputofthemapfunctionisapair,wherethekeyistheosetofarecordandthevalueistherecord.Figure 2-7 showsthecomputationprocedureandthedataformat/denitionofthemapphase.TheMappertakesarecordandextractsusefulinformationfromtherecord,especiallyiCUinformation.TheoutputoftheMapperisapair,wherethekeyistheiCUinformationandthevalueistheusefulrecordinformation,includingthedatatobeanalyzed. Figure2-7. R-Heavyimplementation:Mapphase 45

PAGE 46

2.4.2.2ReduceimplementationAfterthemapphasehasbeennished,pairswhosekeysarethesamearesenttothesameReducer.Inthiscase,recordsfromthesameiCUaresenttothesameReducer.TheReducertakesalistofvaluesgeneratedbythemapphasethatsharethesameiCUastheinput,andperformsAlgorithm 1 oneachvaluesinthesameiCU.Figure 2-8 showsthecomputationprocedureandthedataformat/denitioninthereducephasefortheR-Heavyimplementation. Figure2-8. R-Heavyimplementation:Reducephase 46

PAGE 47

2.4.3MR-BalancedImplementationTheMR-Balancedimplementationmapstherst-stagecomputationtothemapphaseandthesecond-stagecomputationtothereducephase.Sincetherst-stagecomputationrequiresthatthedataaregroupedbydCUs,databelongingtothesamedCUhavetobecollectedandsenttothesameMapperbeforethecomputationstarts.MapperssendresultsbelongingtothesameiCUtothesameReducer.ThisimplementationbenetsforthecasethatthesizeofeachdCUisrelativelylarge. 2.4.3.1MapimplementationSincetherst-stagecomputationofAlgorithm 1 isperformedinthemapphase,recordsbelongingtothesamedCUhavetobesenttothesameMapper.Thus,eachMapperhastoreadthewholeleincasethattheleissplit.TheinputtotheMapperisapairwherethekeyisale'snameandthevalueisthecontentofthele.Figure 2-9 showstheimplementationofthemapfunctionineachMapper.TheMapperpullscontentsofaleintomemoryandorganizessinglerecordsbydierentdCUs.ThentheMappercomputespercentiles,eliminatesoutliersandcalculatestheprocesscapabilityforeachlistofvaluesthatbelongtothesamedCU.Theresultofcomputationsisstoredinanobject,referredtoasInterData,whichcanbetransferredbetweenMappersandReducers.AnexampleofaninstanceoftheInterDataobjectisshownasfollows.InterDataisanobjectrepresentingalistofAVGvaluesbelongingtothesamedCU.FollowingisanexampleofaninstanceofInterData.ItimplementsSerializableinterfacesothatitcanbetransmittedbetweenmapandreducephases,whichisusefulfortheMR-Balancedimplementation. 23976#tablesize97.539597.548397.5485...#thedatatable0.3850#processperformanceindex,$P$0.4121#processcapabilityindex$\hat{P}$A5_001_R01Gas_Flow_Rate,DIFF06#RECIPE+ITEM,EQP_ID 47

PAGE 48

TheoutputofeachMapperisasetofpairswherethekeyisKroftheiCUandthevalueisanInterDataobject.ThedatabelongingtothesameiCUaregroupedtogetherandsenttoReducers.ThedefaultPartitioner,HashPartitioner,isusedtodistributeoutputsofMapperstotheReducersbasedonthehashvalueofthekeys. Figure2-9. MR-Balancedimplementation:Mapphase 2.4.3.2ReduceimplementationAfterthemapphasehasnished,theReducerreceivesoutputsfromMappers.EachReducerreceivesalistofInterDataobjectswiththesameKr(databelongtothesameiCU).TheinputtotheReducerisavaluelistassociatedwiththesamekey,whereeachvaluecorrespondstoanInterDataobject.Figure 2-10 showstheimplementationofthe 48

PAGE 49

ReducerintheMR-Balancedimplementation.TheReducerloopsthevaluelistandpickstheInterDatathatcontainsthemaximumprocessperformanceindex,Cpk,asthereferencedataset.ForotherInterDataobjects,theReducercomputesEuclideandistancesofthedata,bothWithin-DistanceandBetween-Distance,doesanequivalenttestusingEuclideandistances,andgeneratesthenaldecision.OutputsoftheReducerarepairswherethekeyconsistsofRECIPE,ITEMandEQP ID(Kt)andthevalueisthenaldecision,e.g.Equivalent,meaningthatthedatasetAiiscomparabletothereferencedatasetanditsassociatedEUishealthy. Figure2-10. MR-Balancedimplementation:Reducephase 2.4.4RelationshipswithDataMappingTable 2-3 showstherelationshipbetweendierentimplementationsanddierentstrategiesofdatamapping.TheR-Heavyimplementationhasnolimitationondata 49

PAGE 50

mappingstrategywhileothertwoimplementationsbothhaverequirementsfordatamappingstrategies.Thus,theR-Heavyimplementationworksforalldata-mappingstrategies.AsdiscussedinChapter1,forexperimentspurposes,thedatabaseiskeptfunctioning.Ifthefuturearchitectureremovesthedatabase,theR-Heavyimplementationistheonlysolutionthatiscapableoffunctioningwithoutassistancefromthedatabaseorpreprocessingthedata.ThemapphaseoftheR-HeavyimplementationgroupsthedataaccordingtoiCUordCUinformationofeachrecordsothatnogroupingfromthedatabaseorpreprocessingthedataisrequiredpriortothecomputation.TheMR-Balancedimplementationworksinper-iCUone-to-oneandmany-to-onestrategiesaswellasper-dCUone-to-oneandmany-to-onestrategies.TheM-Heavyimplementationonlyworksinper-iCUone-to-oneandmany-to-onestrategies.BothM-HeavyandMR-Balancedimplementationsareconstrainedbyspecicdatamappingstrategies.TheinputdatatothesetwoimplementationsrequireeitheranassistancefromthedatabaseorapreprocessingblockthatcategorizesthedataaccordingtodierentKrsandKts. Table2-3. Relationshipsbetweendierentimplementationsanddatamappingstrategies ImplementationsOne-to-oneMany-to-oneOne-to-many ArbitrarilyR-HeavyR-HeavyR-HeavyPer-iCUR-Heavy,M-Heavy,MR-BalancedR-Heavy,M-Heavy,MR-BalancedR-HeavyPer-dCUR-Heavy,MR-BalancedR-Heavy,MR-BalancedR-Heavy 50

PAGE 51

CHAPTER3SYSTEMOPTIMIZATIONWehaveconductedanin-depthstudyofthedatatobeanalyzedandproposeddierentdatamappingstrategies.Basedonthedatamappingstrategiesandthecomputationsperformedonthedata,threeimplementationsareintroducedtomapthestatistics-basedanalysistotheHadoopplatform.Inthischapter,wetrytooptimizethesystemfromtwoperspectives:tuningHadoopcongurationparametersandoptimizingtheanalysisalgorithmbyfactoringcomputationswhenpossiblesothatresultscanbere-usedandcomputationsavoided. 3.1HadoopCongurationParametersTuningToachievetimelyandscalablecomputation,Hadoopcongurationparametersneedtobetunedforeachspecicapplication.Studies[ 21 34 ]showthattheperformanceofaHadoopapplicationsuersseverelyifinappropriatecongurationparametershavebeenset.Manyin-depthworks[ 21 35 36 ]havebeendoneforimprovingtheperformancebytuningcongurationparameters.Weadoptastaticapproachtoevaluatetheapplicationonasmallamountofdatatodecideonappropriatecongurationparametersettings.AsubsetofparametersaresummarizedinTable 3-1 andTable 3-2 .ThetargetofthetuningistheR-Heavyimplementation,sinceitistheimplementationthatworksforeverydatamappingstrategy.Also,ittakesadvantageofthedatashuingphaseperformedbytheHadoopplatform,whichisdonebycollectingvaluesgeneratedinthemapphaseintogroupsofvalues,eachgroupsharingthesamekey.WeintroducethesecongurationparametersinthischapterandevaluatetheireectonsystemperformanceinChapter4. 3.1.1ParametersoftheMapReduceProgrammingModelAsdiscussedinChapter1,applicationsfollowingtheMapReduceprogrammingmodelarenaturallyparallelizedanddistributedamongclustersofcomputers.Thenumberofmaptasksandthenumberofreducetasksdeterminethedegreeofparallelization. 51

PAGE 52

Table3-1. Asubsetofcongurationparameterstunedforsystemoptimization-part1[ 1 ] NameFunctionality mapred.map.tasksNumberofmaptasksinajobexecutiondfs.block.sizeHDFSblocksizemapred.reduce.tasksNumberofreducetasksinajobexecutionio.sort.mbThesize(inMB)ofthebuerthatisusedforwritingamaptaskoutputio.sort.spill.percentThethresholdofthemapoutputbuer.Ifthesizeofoutputinthebuerexceedsthethreshold,thebackgroundthreadstartstospillcontentintodisks.io.sort.factorThemaximumnumberofstreamsusedformergingmap-phaseoutputorreduce-phaseinputmapred.compress.map.outputThisagindicateswhetherornottocompresstheoutputofmaptasksmapred.reduce.parallel.copiesThenumberofthreadsusedinreducetaskstocopyintermediateresultsfrommaptaskstothereducetask. mapred.job.shue.input.buer.percentTheproportionofthememoryofareducetaskthatisusedtoholdoutputfrommaptasks.mapred.job.shue.merge.percentThethreshold(theproportionofthebuerusedtoholdoutputfrommaptasks)ofthebuerthatisusedtoholdinputtoareducetask.mapred.inmem.merge.thresholdThethreshold(thenumberofrecordsfrommaptasks)ofthebuerthatisusedtoholdinputtoreducetask. mapred.map.tasks.speculative.executionTheagenablesthemapphasespeculativeexecution. mapred.reduce.tasks.speculative.executionTheagenablesthereducephasespeculativeexecution. 52

PAGE 53

Table3-2. Asubsetofcongurationparameterstunedforsystemoptimization-part2 NameFunctionality mapred.job.reuse.jvm.num.tasksThenumberoftaskstoreusethesameJVM.Notethattheydon'tsharetheJVM.TasksthatusethesameJVMareexecutedsequentially.io.le.buer.sizeThebuersizeusedbyHadoopIO mapred.reduce.slowstart.completed.mapsTheproportionofnishedmaptaskswhenthereducetaskstarts. mapred.tasktracker.map.tasks.maximumThemaximumnumberofmaptasksexecutedonthesameTaskTracker. mapred.tasktracker.reduce.tasks.maximumThemaximumnumberofreducetasksexecutedonthesameTaskTracker.mapred.child.java.optsJavaheapsizeformap/reducetasks ThenumberofmaptasksisdeterminedbythetotalnumberofsplitsofinputdatatoaMapReducejob.(Itisassumedthatthelesplitagistruesothatalargelecanbesplitacrossmultiplestoragenodes.Ifthisagisfalse,alargelecanstillbesplit.Themaptask,however,willreadawholele,nomatterhowmanyblocksaleissplitinto.Inthiscase,thenumberofmaptasksisdeterminedbythenumberofinputles.)Usingtheone-to-onedatamappingstrategy,eachsplitreferstoale.Usingtheone-to-manydatamappingstrategy,eachsplitreferstoablockofale.TheblocksizeofHDFSdeterminesthenumberofblocksintowhichaleissplitinthestoragesystem.Ifthelesizeislessthantheblocksize,thelecannotbesplit.Otherwise,thenumberofsplitsiscomputedasdFs=Bse,whereFsreferstoasthesizeofale,BsreferstoastheblocksizeofHDFSanddedenotesthe\ceilingof".Todecreasethenumberofmaptasks,theblocksizeofHDFSshouldbeincreasedandthenumberoflesshouldbedecreasedwhengiventhetotalamountofdataunchanged. 53

PAGE 54

Thenumberofreducetasksreferstohowmanyreducetasksarelaunchedduringajobexecution.Theparameteriscontrolledbyusers.Usually,thenumberofreducetasksissetaccordingtoavailableresources[ 1 ].FortheR-Heavyimplementation,themapphaseisresponsibleforlteringuselessinformationandcategorizingdatabyiCUswhilethereducephaseconductstheanalysisdescribedinAlgorithm 1 .ThenumberoflesandtheblocksizeofHDFSdeterminethenumberofmaptasksandthenumberofreducetasksiscontrolledbyparametermapred.reduce.tasks.Chapter4showsevaluationresultsoftheeectofthesetwofactors. 3.1.2ParametersofHadoopWorkowExtendingtheoverviewofHadoopinChapter1,thissectionpresentsanin-depthviewoftheHadoopworkow.Wethenanalyzedierentcongurationparametersinvolvedintheworkowthatmayinuencetheperformanceofapplications.AsshowninFigure 1-3 ,whenthemasternodeassignsamap/reducetasktoaslavenode,theTaskTrackerontheslavenoderunsthemap/reducetask.Figure 3-1 showsthemap/reducetaskexecutionworkow,includingtheshuingphase,indetail.InaHadoopMapReducejob,theinputdatatothereducetaskaresortedbythekeys.TheshuingphaseinFigure 1-3 showstheprocessofsortingandtransformingintermediatedatafromthemapphasetothereducephase.Figure 3-1 detailstheshuingphaseexecution.Eachmaptaskisallocatedamemoryblockusedforbueringtheoutputofthetask.Duringthemaptaskexecution,outputsofthemaptaskarewrittentothebuerrst.Whenthesizeofthedatainthebuerexceedsaspecicthreshold,abackgroundthreadstartstospillthedatatothelocaldisk.Everytimethesizeofthedatainthebuerexceedsthethreshold,thereisalegenerated.Attheendofthemaptask,theremaybemultipleles,whicharemergedintothreelesusually.Beforethereducetaskstarts,severalthreadsretrievetheoutputsfrommaptasks.Sincethereducetaskmayretrieveoutputsfromdierentmaptasks,theremaybemultiplepiecesofdatabeforetheexecutionofthereducetask.Abackgroundthreadmergesmultiplepieces 54

PAGE 55

Figure3-1. AshuingphaseofaMapReducejob.Eachmaptaskcontainsamapfunction.Thebuerassociatedwiththemapfunctionistheplacethatstorestheoutputfromthemapfunction.Whenthesizeofthedatainthebuerexceedssomethreshold,thedataaresplitintodisks.Theblocksfollowingthebuerineachmaptaskrepresenttheblockswrittenonthedisk.Eachreducetaskcontainsareducefunction.Beforethereducefunctionstarts,thereducetaskretrievestheoutputfrommaptasksandmergestheoutputsastheinputtothereducefunction,representedbythearrowandthemergestreamsineachreducetask. ofdataintoonepiece.Thereducetaskthenstartswiththeone-pieceinputfromtheshuingphase.Afterthereducefunctionexecution,outputsarewrittenintoHDFS.Intermediateoutputsgeneratedduringthemapphasearestoredinabuerwhosesizeisspeciedbytheio.sort.mbparameter.Iftheamountofintermediatedataexceedsathreshold,abackgroundthreadspillstheintermediatedatatothedisksothatadditionaloutputsfrommaptaskscanbewrittentothebuer.Thethresholdisdeterminedbyboththesizeofthebuerandapercentofthatsizeio.sort.spill.percentabovewhichdataarespilled.Beforetheintermediateoutputsarewrittentothedisk,thebackgroundthreadpartitionsthedataaccordingtocorrespondingreducetasks.Thedatatobesenttothesamereducetaskareinasamepartitionandsortedaccordingtointermediatekeys.If 55

PAGE 56

io.sort.spill.percentpercentofthebuerisfull,thebackgroundthreadstartstospillthecontenttothediskwithregardstothepartitioninthebuer.Duetotheamountofintermediatedatathatmaptasksgenerate,theremaybemultiplespilledles.Beforethemapphaseends,multiplespilledlesaremergedintoasinglele.Thecongurationparameterio.sort.factordeterminesthenumberofstreamsusedtomergeatonce.Itisalsopossibletocompresstheintermediatedatabeforethedataarewrittenintothedisk.Ifthedataarecompressedatthemapphase,theyhavetobedecompressedatthereducephase.mapred.compress.map.outputistheagdecidingwhetherornottocompresstheoutputofthemapphase.mapred.map.output.compression.codectellstheHadoopplatformthespeciccompressionlibrary[ 1 ].FortheR-Heavyimplementation,themapphaseextractsusefulinformationandcomposespairs.Beforethemapphaseends,theintermediatepairsarepartitionedandsortedwithregardstodierentKrs.Thecongurationparametersinvolvedinthisphasemayhavesignicanteectontheexecutiontimeoftheapplication.Duringthereducephase,thereducetaskretrievesthepartitionofthedatathataresupposedtobesenttoit.Sincemultiplemaptasksmaynishatdierenttime,thereducetaskstartstoretrievethepartitionaslongasthemaptaskthatgeneratesthepartitionisnished.ThelocationsofpartitionsarecommunicatedtotheJobTrackerthroughtheheartbeatprocessbyeachTaskTracker.Thenumberofthreadsthatareusedtocopyoutputsofmaptaskstoareducetaskisdeterminedbytheparametermapred.reduce.parallel.copies.Theintermediatedatageneratedbymaptasksarecopiedintoabueratareducetask,whosesizeisdeterminedbymapred.job.shue.input.buer.percent,thepercentoftheheapofthereducetasktobeallocatedforthebuer.Similartothemapphase,whentheamountofdatainthebuerexceedsathreshold,thedatainmemoryarespilledintothedisk.Twoparameterscontrolthethreshold,mapred.job.shue.merge.percentandmapred.inmem.merge.threshold. 56

PAGE 57

Afterallrequiredinputsforareducetaskareretrievedfrommaptasks,abackgroundthreadisusedformergingtheinputsintoonesortedle.Theio.sort.factorintroducedpreviouslyalsodeterminesthenumberofstreamsusedformerging.Then,thereducephasecomputationstartsandtheoutputsarewrittenintoHDFS.ThereducephaseoftheR-Heavyimplementationretrievesthepartition|asetofpairs|belongingtothesameiCUpriortotheexecutionofthereducefunction.ThedatabelongingtothesameiCUmaybegeneratedbydierentmaptasks.Thereducetasksretrievethedatafrommultiplelocationsandmergethedataintoonesorteddataset.Afterthedatahavebeenmerged,thereducefunctionconductscomputationsdescribedinAlgorithm 1 oneachiCU.EachreducetaskmayprocessoneormultipleiCUs.CongurationparametersinvolvedinaMapReduceworkowsignicantlyinuencethewaytoutilizeavailableresources.Abetterparametersettingmaybringaperformanceboost. 3.1.3OtherHadoopFeaturesWehaveintroducedparametersinvolvedintheMapReduceprogrammingmodelandaHadoopjobworkow.Otherparametersrelatedtotaskexecution,I/O,andenvironmentsettingsmayinuenceperformanceofapplicationsaswell.TheMapReduceprogrammingmodelallowsajobtobedistributedamongmultiplenodestoachievespace-parallelexecution.Duringajobexecution,onetaskmayslowdownthetotalexecutiontimeofanapplicationwhenothertaskshavealreadynishedandtheapplicationisstillwaitingfortheresultsfromtheslowtask.Iftheexecutiontimeofataskissignicantlylongerthanotherparalleltasks,Hadooptriestoschedulethesametasktobeexecutedonadierentnodeandacceptstheresultsreturnedbytherstnodethatcompletesthetask.Thisprocedureisreferredtoasspeculativeexecution.ThepropertysignicantlyimprovestheperformanceofaHadoopjobsothatasingleslowtaskdoesnotaectthetotaljobexecutiontime.Thespeculativeexecutionis 57

PAGE 58

notaboutlaunchingtwotasksforprocessingthesamedatablockatthesametime.Instead,itcomputesthetaskexecutionrateanddetermineswhichtaskneedstobescheduledonadierentnode.Theparametersmapred:map:task:speculative:executionandmapred:reduce:task:speculative:executionaretwoagsenablingthespeculativeexecutioninthemapandreducephaserespectively.WhentheJobTrackerassignsatasktoaTaskTracker,theTaskTrackerassignsthetasktobeexecutedinanindividualJavaVirtualMachine.IftheTaskTrackercreatesoneJVMforeachtaskassignedtoit,thetotalJVMcreationtimeissignicantforlargeamountofshortlivedtasks.Theparametermapred.job.reuse.jvm.num.taskscanenabletheTaskTrackertoreuseJVMforothertasks.However,multipletasksdonotsharethesameJVMatthesametime.Instead,theyareexecutedsequentiallybythesameJVM.AnotherpropertyoftheHadoopplatformislazyinitializationofreducetasks.Theparametermapred:reduce:slowstart:completed:mapsdecidesthepercentofthemapphasenishedwhenthereducetasksareinitialized.Ifreducetasksareinitializedinanearlystageofalong-runningjob,theperformanceofthemapphasemaybedegraded.Thereasonisthatreducetaskswillshareresourceswithmaptasks,butreducetaskscannotstartwithoutallmaptaskshavingnished.Thus,lazyinitializationcoulddelaytheinitializationofthereducephaseexecutionsothatthemapphasecanpossiblyutilizeallavailableresources.AfteraHadoopclusterhasbeensetup,multipleenvironmentparametersneedtobedetermined.Appropriatelysettingtheparametersbenetsboththeapplicationexecutedontheclusterandtheresourceutilizationofthecluster.AsshowninFigure 1-3 ,theTaskTrackerworksineachslavenode.ThenumberoftasksthataTaskTrackerallowstorunisconguredbytheparametersmapred.tasktracker.map.tasks.maximumandmapred.tasktracker.reduce.tasks.maximum.Accordingtoavailableresources,theprincipleofsettingthenumberoftasksforeachTaskTrackeristoachieveoptimalutilization.Usually,theTaskTrackerandtheDataNodebothrunasindividualprocesses.Toutilize 58

PAGE 59

theavailableresources,theparametercanbeestimatedthroughNprocessorw)]TJ /F1 11.955 Tf 12.26 0 Td[(2,whereNprocessoristhenumberofprocessorsavailableonthenode,wisthenumberofprocessesrunningonthesameprocessor,and2referstotheTaskTrackerprocessandtheDataNodeprocess.WediscussourclustersettingsinChapter4.Table 3-1 andTable 3-2 summarizeallcongurationparametersintroducedinthissection.Studiesshowthattuningparametersbringsignicantperformanceimprovementtoapplications.ExperimentalevaluationofthissectionareshowninChapter4. 3.2AlgorithmicOptimizationWehaveintroducedthecharacterizationsofthedatatobeanalyzedinSection2.1.Onepieceofinformationthateachrecordcarriesisthetimestampindicatingwhentherecordiscollected.Basedonthetimestampinformation,thesystemretrievesthedatafromthedatabase.Thenewdataaregeneratedeverydayandpartoftheolddataaredeleted.Atypicalstatisticalcomputationusesdatacollectedduringa\window"ofseveraldays(e.g.thelast14days).Figure 3-2 showsthatthestatisticalanalysisisperformedonthedatafrom14days,Day-1toDay-14.AttheendofDay-15,thedatafromDay-1aredeletedandtheDay-15dataareadded.ThenthecomputationisperformedonthedatafromDay-2toDay-15. Figure3-2. Theslidingwindowofstatisticalanalysis.Thewindowsizeis14days. 59

PAGE 60

InthisSection,wetakethetimestampinformationintoconsiderationandproposeamethodthattakesadvantageofthepreviousdataandcomputationresultstoalgorithmicallyoptimizetheperformanceofthesystem. 3.2.1DataMovementTherearethreedatamappingstrategiesthathavebeendiscussedinSection2.3:arbitrary,per-iCUandper-dCUmapping.Weproposeanewdatamappingstrategyconsideringthetimestampinformationofeachrecord,referredtoasper-windowmapping.Per-windowmappingmapsthedatabelongingtothesametimeperiodtothesamele.Forexample,thedatacollectedfromDay-1aremappedtothesamele,asshowninFigure 3-3 .Sincemassiveamountsofdatacanbecollectedeachday,per-windowmappingcanbecombinedwithothermappingstrategiestodecreasethelesize. Figure3-3. Datamappingstrategywithregardstotimestampofeachrecord 60

PAGE 61

Usingthisapproach,theanalysissystemonlyretrievesnewlygenerateddataeachdayinsteadofretrievingalldatathatbelongtothesamecomputationwindow.Priortotheanalysis,thesystemremovesoutdateddataandaddsnewdatatoHDFSbyarelativelyshortpreprocessingprocedure.Theamountofdatatobemoveddecreasesbyapproximately1/14forthecasethatthewindowsizeis14days. 3.2.2AlgorithmicOptimizationIfwekeepsomeintermediateanalysisresultsfrompreviouscomputations,wecantakeadvantageofprecomputedresultstodecreasethecomputationtimeforthedatawindowofinterest.Inthissection,theStage1ofAlgorithm 1 ,discussedinSection2.2,isoptimizedbyintroducinganewalgorithm. 3.2.2.1Thesliding-windowalgorithmTheStage1ofAlgorithm 1 consistsofmultiplesteps.Amongthosesteps,sortingandcomputingtheaverageandthestandarddeviationofeachdCUarethemosttime-consumingparts.Ineachanalysisrun,onlyarelativelysmallamountofdataareaddedintothewholedataset.TakingFigure 3-2 asanexample,atDay-15,thenewdatageneratedinDay-15areaddedintotheDay-15analysisrun,andtheoutdateddatafromDay-1aredeleted.Ifitisassumedthattheamountofdatageneratedineachdayisthesame,thenewlyaddeddatasetis1/14ofthetotaldatasettobeanalyzed.Thus,ifthecomputationperformsonlyonthenewlygenerateddata,thetotalcomputationtimecouldbedecreased.Weproposeaslidingwindowalgorithmtooptimizethecomputation.ItisassumedthatthepreviousrecordsaresortedandtheaverageandthestandarddeviationofeachdCUcomputationresultsareknown.Thesortingcanbedonebyrstsortingthesmallnewlygenerateddataset,andthenusingthemergesortalgorithm[ 37 ]tomergethesmalldatasettotheprevioussorteddataset.Simultaneously,thealgorithmcheckstimestampinformationofeachrecordanddeletesoutdateddatawhentherecordliesoutsidethecomputationwindow.SlidingWindowSortisthepseudo-codeforthesortingalgorithm. 61

PAGE 62

procedureSlidingWindowSort(ArrayA(sortedarray),ArrayU,Windoww) SortUusinganysortingalgorithm AllocateoutputarrayB whilei
PAGE 63

standarddeviationoftheupdateddata.Again,Figure 3-2 isanexampletoshowthecomputationprocedure.Wederivethecomputingresultsmathematicallyasfollows. =1 NNXi=1xi(3{1) =vuut 1 NNXi=1(xi)]TJ /F5 11.955 Tf 11.96 0 Td[()2(3{2) =vuut 1 N(NXi=1x2i)]TJ /F1 11.955 Tf 15.54 8.09 Td[(1 N(NXi=1xi)2)(3{3)Let1denotetheaverageofDay-1,15theaverageofDay-15,theaveragefromDay-1toDay-14and0theaveragefromDay-2toDay-15.LetN1,N15andNbethenumbersofrecordsinDay-1,Day-15andDay-1toDay-14,respectively.Then,0(N)]TJ /F5 11.955 Tf 11.96 0 Td[(N1+N15)=N)]TJ /F5 11.955 Tf 11.96 0 Td[(1N1+15N15:0iscomputedthrough 0=N)]TJ /F5 11.955 Tf 11.95 0 Td[(1N1+15N15 N)]TJ /F5 11.955 Tf 11.96 0 Td[(N1+N15:(3{4)ItisnotobvioushowtousepreviousresultstocomputethestandarddeviationofeachdCU.WecanstartfromcomputingthevarianceofthedatasetfromDay-2toDay-15.v0isthevariancefromDay-2toDay-15.Then,v0=PDay-15i Day-2(xi)]TJ /F5 11.955 Tf 11.95 0 Td[(0)2 N)]TJ /F5 11.955 Tf 11.95 0 Td[(N1+N15;wherexiisanelementinthedatafromDay-2toDay-15.Theequationcanbeexpandedasv0=PDay-15i Day-2(x2i)]TJ /F1 11.955 Tf 11.96 0 Td[(2xi0+02) N)]TJ /F5 11.955 Tf 11.95 0 Td[(N1+N15:WedeneS2asthesummationofsquaresofeachelementinacertaindatasetandSasthesummationofallelementsinthedataset,thenS2=Xx2i;S=Xxi: 63

PAGE 64

Thevariancecanbecomputedthroughv0=S2)]TJ /F5 11.955 Tf 11.96 0 Td[(S21+S215)]TJ /F5 11.955 Tf 11.96 0 Td[(0(S)]TJ /F5 11.955 Tf 11.96 0 Td[(S1+S15)+(0N0)2 N0;whereN0=N)]TJ /F5 11.955 Tf 11.96 0 Td[(N1+N15.Thestandarddeviation,then,isderivedthrough 0=p v0=r S2)]TJ /F5 11.955 Tf 11.96 0 Td[(S21+S215)]TJ /F5 11.955 Tf 11.96 0 Td[(0(S)]TJ /F5 11.955 Tf 11.96 0 Td[(S1+S15)+(0N0)2 N0:(3{5)Sofar,wehaveusedtheaverage,thesummationandthesummationofsquaresofthepreviousdatatocomputetheaverageandthestandarddeviationoftheupdateddata.Algorithm 2 ,shownbelow,usestheslidingwindowapproachtocomputetheaverageandthestandarddeviationwithpreviousresults.Weassumethateachelementinthedouble-precisionarraycontainsatimestamp.Wheniteratingthearray,wecanextracttheoutdateddatabycheckingthetimestampoftheelement.Airepresentsfortheith1Ddouble-precisionarrayinthelinkedlistinputList,Ai[k]representsanelementinthearrayAi.ItisassumedthatUiistheith1Ddouble-precisionarrayinthelinkedlistnewDataanditisthenewlyaddeddatatothearrayAi.SXisthesummationofelementsinarrayX;S2XisthesummationofsquaresofeachelementinX;XistheaverageofelementsinX;NXisthenumberofelementsinX.EacharrayAiissortedandSAi,S2Ai,AiandNAiareknown.TherststageofAlgorithm 2 startswithsortingandcomputingtheaverage,thesummationandthesummationofsquaresofthenewlyaddeddata.Then,thealgorithmmergesthenewlyaddeddatatothepreviousdataandremovestheout-dateddatabycheckingthetime-stampofeachrecord.Aftermerging,outlierlimitsarecomputedtoidentifyoutliers.Finally,theproposedalgorithm,Equation 3{4 andEquation 3{5 ,isusedtocomputetheaverageandthestandarddeviationoftheupdateddata.Thesecondstageremainsunchanged. 3.2.2.2AlgorithmanalysisTherststageofAlgorithm 2 isoptimizedusingSlidingWindowSortalgorithmandEquation 3{5 andEquation 3{4 .Aftersorting,thepercentileandoutlierlimits 64

PAGE 65

Algorithm2Slidingwindowapproachtoequivalencetest Input:LinkedListinputList,LinkedListnewData foreachDoubleArrayAiininputList,eachDoubleArrayUiinnewDatado SortUi ComputeSUi,S2Ui,Ui,NUi MergeUitoAi,extractoutdatedelementsandstoreinXi ComputeSXi,S2Xi,Xi,NXi ComputethepercentilevaluesusingPercentilefunctionofXi ComputethelowerandupperoutlierlimitsLOL,UOLusingpercentilevalues EliminateXi[k]thatisnotintherange[LOL,UOL] Ni=NAi)]TJ /F5 11.955 Tf 11.95 0 Td[(NXi+NUi Computei=AiNAi)]TJ /F5 11.955 Tf 11.95 0 Td[(XiNXi+UiNUi Ni Computei=s S2Ai)]TJ /F5 11.955 Tf 11.96 0 Td[(S2Xi+S2Ui)]TJ /F5 11.955 Tf 11.96 0 Td[(i(SAi)]TJ /F5 11.955 Tf 11.95 0 Td[(SXi+SUi)+(iNi)2 Ni ComputetheprocesscapabilityindexCpk[i] endfor PickXiwiththelargestCpkasreferencesetR ComputeEuclideandistanceWR[k]=p (R[k])]TJ /F5 11.955 Tf 11.96 0 Td[(R)2 ComputeEuclideandistanceBR[k]=p (R[k])]TJ /F5 11.955 Tf 11.96 0 Td[()2 ComputeWR;WRandBR;BR foreachDoubleArrayXiininputListdo ComputeEuclideandistanceWi[k]=p (Xi[k])]TJ /F5 11.955 Tf 11.96 0 Td[(i)2 ComputeEuclideandistanceBi[k]=p (Xi[k])]TJ /F5 11.955 Tf 11.96 0 Td[(R)2 ComputeWi;WiandBi;Bi DoapairwiseStudent'st-testbetweenWR(orBR)andWi(orBi) Finaldecisionbasedontheresultoft-test endfor computationsaretheonlyonestobedone,whicharenottime-consuming.Thestandarddeviationisusedtobecomputedusingatwo-passalgorithm[ 38 ],whichrequiresaccessingeachrecordtwice.Inourapproach,weonlyneedtoaccesseachrecordonce.Anin-memorysortingalgorithmhasatimecomplexityO(nlogn),wherenisthenumberofrecordstobesorted.UsingSlidingWindowSortalgorithm,wetakeadvantageofthemergesortalgorithm.ThenewapproachresultsatimecomplexityO(n+klogk),wherenisthetotalnumberofrecordstobesortedandkisthenumberofnewlygeneratedrecords. 65

PAGE 66

Thetwo-passalgorithmforcomputingthestandarddeviationaccesseseachrecordtwice,computingtheaverageandcomputingthestandarddeviation.Theslidingwindowalgorithm,aswellasonline-algorithm,accesseseachrecordonce.Inthesituationwhenthememoryspaceislimited,thisalgorithmcandecreasememoryaccesstimeby50%.Ifthepreviousresultsareknown,computingthestandarddeviationonlyrequirescomputingtheaverageandthesummationofsquaresofthenewlygenerateddataset.Theexecutiontimeofslidingwindowalgorithmforcomputingthestandarddeviationis1/14oftheoriginalalgorithm(assumingthattheamountofdatageneratedeachdayisthesame). 66

PAGE 67

CHAPTER4EXPERIMENTALEVALUATIONANDANALYSISThischaptersummarizesexperimentalresultsforthedataanalysisapproachesdiscussedinpreviouschapters.Experimentsweredonetoevaluatei)dierentdatamappingstrategiescorrespondingtodierentimplementations;ii)dierentimplementations'scalabilitywithregardstothelargescaledataset;iii)theeectonperformanceofdierentHadoopcongurationparameters;iv)theperformanceoftheslidingwindowalgorithm. 4.1DataMappingandMapReduceImplementationExperimentsThedatacollectedfrommonitoringsensorscanincreaseinmultipledimensions.Thedatasize(intermsofrecords),S,dependsonthenumberofiCUs,I,thenumberofdCUsineachiCU,D,andthesizeofeachdCU,R. S=F(I;D;R)(4{1)Infollowingexperiments,itisassumedthatthesizeofeachdCUandthenumberofdCUofeachiCUarethesame.Therefore,Equation 4{1 becomes S=IDR:(4{2)Theexperimentsaresetuponalocalclusterconsistingof9physicalnodes,onemasternodeandeightslavenodes.Onenodededicatestobethemasternode.Eachphysicalnodecontains8IntelXeon2.33GHzprocessorsand16GBmemory.ClouderaDistributionofHadoop(CDH)4.2.1isintalledonthelocalcluster. 4.1.1DataMappingDierentdatamappingstrategiesaredescribedinChapter2.Experimentsaresetuptocomparedierentstrategiescorrespondingtodierentimplementations. 67

PAGE 68

4.1.1.1ExperimentssetupWeassumethenumberofdCUsineachiCU,D,isaconstant,6(thereareapproximate6EUsforeachrecipeanditem).WeconductexperimentsfollowingTable 2-3 byincreasingthesizeofeachdCU,R,startingfrom1MBto24MB.Forthemany-to-onescenario,thefactorischosentobe2,meaningthateachmaptaskhandlestwoles.Thesizeofthedata,S,is30GB.AsRincreases,thenumberofiCUs,I,decreases.SincetheR-HeavyimplementationcanbeconsideredastheM-HeavyimplementationandanextrafunctionthatgroupsthedatabelongingtothesameiCUtogether,itistheonlyimplementationthatcanbeusedinthearbitrarilymappingstrategy.Intuitively,theexecutiontimeofprogramsusingarbitrarilymappingstrategyislongerthanusingothermappingstrategies.Thus,weonlyconsidertheper-iCUandper-dCUmappingstrategies. 4.1.1.2ResultsanalysisForcomparisonpurposes,weshowexperimentalresultsintwodierentgures.Figure 4-1 comparestheexecutiontimeofdierentdatamappingstrategiesforeachimplementation.FortheR-Heavyimplementation,whenthesizeofdCUislessthan16MB,theper-iCUmany-to-onemappingstrategyoutperformsothermappingstrategies.TheMR-Balancedimplementationshowssimilarbehavior.FortheM-Heavyimplementation,theexecutiontimechangesslightlyasthesizeofdCUchanges.AllthreeimplementationsshowanveryslightlyincreasingtrendafterthesizeofdCUexceeds12MB.Sinceaper-dCUone-to-onemappingwillgeneratemanymorelesthanaper-iCUmany-to-onemapping,theHadoopplatformgeneratesmoremaptasksintheper-dCUone-to-onescenariothanitdoesinthelatter.Thus,theexecutiontimefortheper-dCUone-to-onemappingstrategyislongerthanthetimetakenfortheper-iCUcase.However,asthesizeofeachdCUincreases,thetaskexecutiontime,includingcomputationanddatamovementtime,dominatesthetotalexecutiontime,whichresultsinsimilarexecutiontimeforallmappingstrategies. 68

PAGE 69

Figure4-1. Dierentdatamappingstrategiesforeachimplementations Figure 4-2 comparestheexecutiontimeofdierentimplementationsforeachdatamappingstrategy.Dierentimplementationsshowsimilarbehaviorsforthesamedatamappingstrategy.Theper-iCUmany-to-onestrategyshowsaveryslightlyincreasingtrendwhenthesizeofdCUexceeds8MB.AsthesizeofdCUgrows,eachmaptaskhandleslargerandlargerles.Sincethecomputationismemory-intensive,thedatasizeofeachtaskwillbeboundedbymemory.Thedatasizeforeachtaskintheper-iCUmappingstrategyislargerthanothermappingstrategies.Thus,itistherstcasethatshowsuptheincreasingtrendoftime-consumptionasthesizeofdCUincreases.IfthesizeofdCUistoolargesuchthatasingleiCUcannotbecomputedinasinglenode,aparallelalgorithmmustbedesignedtoparallelizetheexecution.WehaveproposedaniterativeMapReduceparallelalgorithminAppendix B 4.1.2LargeScaleDataSetsInordertocapturethescaleofmonitoringdatainthereal-worldmanufacturingsystem,theexperimentsassumedthat1TBdataaretobeanalyzedeachday. 4.1.2.1ExperimentssetupMonitoringdatafor25iCUsweregeneratedbymodifyingactualvaluesfromrecordscollectedfromasemiconductormanufacturingsystem[ 6 ].ItisassumedthatRandDareconstant.Fromthisinitialdataset,thedatasizeisgrowingbyincreasingthenumberofiCUsbasedonEquation 4{2 .TheIisintherangefrom625to2,000,000.Thedatasizeisincreasedfrom0.32GBto1TBcorrespondingly. 69

PAGE 70

Figure4-2. Dierentimplementationsforeachdatamappingstrategy. SincethesizeofeachdCUislessthan1MB,weuseper-iCUmany-to-onedatamappingstrategywithamany-to-onefactorof125.Thelesizerangesfrom54MBto73MB.TheDFSblocksizeis128MB.Thenumberofreducetasksissetto8. 4.1.2.2ResultsanalysisFigure 4-3 showsseveralexperimentalresults.Asthedatasizeincreases,thecomputationtimeincreaseslinearlyaswell.TheM-HeavyimplementationoutperformstheR-HeavyimplementationandtheMR-Balancedimplementation.ThemapphaseexecutiontimeoftheMR-BalancedandtheM-Heavyimplementationareapproximatelythesame.ThemapphaseexecutiontimeoftheR-Heavyimplementationisshorterthanotherimplementations,becausethemapphaseofR-Heavyimplementationonlyltersandcategorizesrecordswithoutanycomputation.ThereducephaseexecutiontimeoftheM-Heavyimplementationisshorterthanthatofotherimplementations,sincethereducephaseoftheM-Heavyimplementationis 70

PAGE 71

Figure4-3. Timemeasurementforcurrentdatasimulation absent.ThereducephaseexecutiontimeoftheR-Heavyimplementationisthelongestbecauseallcomputationsaredoneinthisphase. 4.2HadoopConguration-ParametersTuningExperimentsWeevaluatetheeectofdierentcongurationparameters,showninTable 3-1 and 3-2 ,ontheperformanceoftheR-Heavyimplementation.AsdiscussedinChapter 3 ,wetunedtheR-HeavyimplementationforreasonsthattheR-HeavyimplementationworksforanymappingstrategyandittakesadvantageoftheshuingphaseofatypicalHadoopjob.Table 4-1 showsdefaultvaluesofcongurationparametersofanR-Heavyjob.ThedefaultvaluesetisatypicalsettingforaHadoopcluster.Wehaveusedthissettingforpreviousexperiments.Inourexperiments,ifaspecicparameteristuned,otherparametersusethevaluesinTable 4-1 4.2.1ParametersoftheMapReduceProgrammingModelTheMapReduceprogrammingmodelhastwomainphases,themapphaseandthereducephase.Thus,thenumberofmaptasksandthenumberofreducetasksaretwoimportantfactorsinuencingtheperformanceofaMapReduceprogram.Inthissection,thesetwofactorsaretunedanddiscussed.A100GBdatasetisgeneratedfortestingpurposes.EachlecontainsdatafromdierentiCUsanddCUs.Thenumberofmaptasks,representingbymapred.map.tasks,isdeterminedbyboththeparameterdfs.block.sizeandthelesizeofeachle.Itisassumed 71

PAGE 72

Table4-1. ThedefaultvaluesofcongurationparametersinTable 3-1 and 3-2 NameDefaultvalue mapred.map.tasks800dfs.block.size128MBmapred.reduce.tasks16io.sort.mb200MBio.sort.spill.percent0.8io.sort.factor64mapred.compress.map.outputtruemapred.reduce.parallel.copies10mapred.job.shue.input.buer.percent0.7mapred.job.shue.merge.percent0.66mapred.inmem.merge.threshold1000mapred.map.tasks.speculative.executionfalsemapred.reduce.tasks.speculative.executionfalsemapred.job.reuse.jvm.num.tasks1io.le.buer.size64KBmapred.reduce.slowstart.completed.maps0.8mapred.tasktracker.map.tasks.maximum8mapred.tasktracker.reduce.tasks.maximum4mapred.child.java.opts1536MB thattheleissplittableandthesplittingsizeisthesameastheblocksizeofHDFS.Therelationshipcanberepresentedby M=i=0XN(ceilSi B)(4{3)whereMdenotesthenumberofmaptasks,Nthetotalnumberofles,SithelesizeoftheithleandBtheblocksizeofHDFS.TheR-Heavyimplementation,likemanyotherHadoopapplications,willassigneachmaptasktoanindependentblockofdata.Forexample,ifalesizeis10GBandtheblocksizeis1.5GB,thenumberofblocksis7,thus,thenumberofmaptaskslaunchedfortheleis7.Inthisexperiment,Nis10,andeachleis10GB.Bychangingtheblocksize,thenumberofmaptaskslaunchedischangedaswell.Figure 4-4 showstheexperimentalresultsoftheinuenceofthenumberofmaptasksontheexecutiontimeoftheprogram.Astheblocksizeincreases,thenumberofmap 72

PAGE 73

Figure4-4. Inuenceofthenumberofmaptaskslaunched.Thedotsinthegureshowtheexperimentalresults.Eachdotcontainsthreedimensions,thenumberofmaptasks,theHDFSblocksizeandtheexecutiontime.Theexecutiontimeisshownusingdierentshadesofgray.Thedarkertheareais,thelesstimethejobtakestorun. taskslauncheddecreases.AsrepresentedbyEquation 4{3 ,thenumberofmaptasksforeachleistheceilingofthelesizedividedbytheblocksizeofHDFS.Thus,insomecases,thedataarenotevenlydistributedamongmaptasks.Thedotsinthegureshowtheexperimentalresults.Eachdotcontainsthreedimensions,thenumberofmaptasks, 73

PAGE 74

theHDFSblocksizeandtheexecutiontime.Theverticalaxisisthenumberofmaptaskslaunched,andthehorizontalaxisistheblocksizeofHDFS.Theshadesofgrayrepresenttheamountoftimeittakesforthejobtorun.Thedarkertheareais,theshortertheamountoftimethejobtakes.ThoughthenumberofmaptasksdecreasesasthesizeofHDFSblockincreases,theexecutiontimedoesnotshowanytrend.Becauseofunevenlydistributeddata,someofmaptaskstakelongertimethanothermaptasksinajob.Ifthecasewhendataareevenlydistributedisconsidered,however,theexecutiontimedecreasesasthenumberofmaptasksdecreases,asshowninFigure 4-5 .Inthiscase,weconsidertheblocksize64MB,128MB,256MB,512MB,1024MB,2048MBand3072MB,whichareusedinordertoevenlydistributethedatafromthesameleamongmultipleblocks.Correspondingly,thenumberofmaptasksdecreasesfrom1600to40.Theminimalexecutiontimeisachievedwhenthenumberofmaptasksis50withthecorrespondingblocksize1024MB.Asthenumberofmaptaskscontinuesdecreasing,theexecutiontimeincreases,incontrary,whichisshownatthebeginningofthehorizontalaxisinFigure 4-5 .Thisincreasingtrendindicatestheproblemoftoolargegranularityoftheinputdata.Asdiscussedpreviously,theJobTrackertriestoassignthetaskwheretheinputdataarelocated.If,ontheotherhand,thenodethatstorestheinputisbusy,theinputhastobetransferredtoanothernode.Thepenaltyofdatatransmission1islowwhenthesizeoftransmitteddataissmall.Astheblocksizeincreases,thetimeittakestotransmitthedataforthissituationismuchlonger.Thus,thetimeforinputdatatransmissionlagsthetotalexecutiontimeofmaptasks.Theparametermapred.reduce.tasksdecideshowmanyreducetasksarelaunchedforajob.Inourexperiments,thisparameterisvariedfrom1to128.Figure 4-6 showstheexecutiontimefordierentnumbersofreducetasks.Theexecutiontimeofthemapphase 1Thetimeittakestotransfertheinputlefromthesourcenodetothetargetnodeismuchlongerthanthetimeittakestoperformcomputationslocally. 74

PAGE 75

Figure4-5. Inuenceofthenumberofmaptasks(evenlydatadistribution) doesnotchangemuchduringexperiments.Whenthenumberofreducetasksisrelativelysmall,theexecutiontimeofthereducephasedecreasesasthenumberofreducetasksincreases,whichcontributestodecreasingofthetotalexecutiontimeofajob.Asthenumberofreducetasksincreases,theutilizationoftheresourceoftheclusterincreases.Whenthenumberofreducetasksexceeds20,theexecutiontimeofthereducephaseremainsunchanged.Whenthenumberofreducetasksisrelativelylarge,theexecutiontimeofthereducephaseshowsaveryslightlyincreasingtrend,becauseincreasingthenumberofreducetasksresultsinincreasingthesystemoverhead.Whenthesystemresourceisfullyutilized,increasingthenumberofreducetaskswillbringslightlymoreoverhead. 4.2.2ParametersofHadoopWorkowTheshuingphaseplaysanimportantroleinaMapReducejob.SeveralcongurationparametersinvolvedareproventosignicantlyinuencetheperformanceofaMapReducejob.io.sort.mbandio.sort.spill.percentdecidethebuersizeoftheoutputofamaptask.Theprincipleforsettingthesetwoparametersisthataslongasthememoryisavailable,thebuersizeshouldbesetaslargeaspossible.Figure 4-7 showstheexperimental 75

PAGE 76

Figure4-6. Inuenceofthenumberofreducetasksonexecutiontime resultsoftheinuenceofthesetwoparametersontheperformanceoftheR-Heavyimplementation.Theresultgoesagainsttheexpectationsothatasbothparametersincrease,theexecutiontimeoftheprogramincreasesaswell.Whentheio.sort.mbissetto50MBandtheio.sort.spill.percentissetintherangebetween0.3and0.9,theexecutiontimeisdecreasedby13%oftheworstcasescenariointhisexperiment. Figure4-7. Inuenceofthebuersizeoftheoutputfromthemapfunction io.sort.factorrepresentsthenumberofstreamsthatareusedformergingtheoutputofamaptaskortheinputofareducetask.Theoretically,thelargertheparameteris,thelesstimeittakestomergetheoutputofmaptasksandtheinputofreducetasks.Figure 4-8 showstheexperimentalresults.Localminimalpointsarelocatedat16,64,128...Thetotalresourcesavailableintheclusterare64cores.Thus,theoptimalperformanceoftheprogramisachievedatthepointwherethemergingphaseutilizesallavailableresources 76

PAGE 77

andtherearenoextrastreamsallocatedforthisphase.Inthisexperiment,thevalueisintherange[60,75]. Figure4-8. Inuenceoftheparameterio.sort.factorontheexecutiontime Afterthemapphase,theoutputsofmaptasksarecopiedtothecorrespondingreducetasks.Thenumberofthreadsthatareusedforcopyingintermediatedataisdeterminedbytheparametermapred.reduce.parallel.copies.Figure 4-9 showsthatthemapphaseexecutiontimeisnotinuencedbythisparameter.Thereducephaseexecutiontimeisminimalwhenthenumberofthreadsis10.Figure 4-9 suggeststhatchoosinganumberofthreadsintherange[5,20]willyieldbetterexecutiontime.Astheparameterkeepsincreasing,theexecutiontimeincreasesaswell.Whentheresourceisfully-utilized,additionalthreadsbringonlyoverheadratherthanperformanceimprovement.Theresultsofthemapphasearerstcopiedtoabuer,ablockoftheheapdeployedineachreducetaskprocessofareducetask.Whenthebuerisfull,outputsfrommaptasksarespilledintodisk.Threeparameters,mapred.job.shue.input.buer.percent,mapred.job.shue.merge.percentandmapred.inmem.merge.threshold,inuencethesizeofthebuer.FortheR-Heavyimplementation,thecomputationtakesplaceinthereducephasewhichrequiressucientmemoryspace.Thus,thesizeofthebuerusedtotemporarilystoretheoutputofeachmaptaskislimitedsothattherestofthememoryissucientforcomputations.Figure 4-10 and 4-11 showtwogroupsof 77

PAGE 78

Figure4-9. Inuenceoftheparametermapred.reduce.parallel.copiesontheexecutiontime experimentalresults.Oneshowstheeectofmapred.job.shue.input.buer.percentandmapred.job.shue.merge.percentontheexecutiontime,andtheothershowstheeectofmapred.job.shue.input.buer.percentandmapred.inmem.merge.thresholdontheexecutiontime.Ifanyofthreeparametersissettoolarge,thecomputationcannotnishduetothelackofmemoryspace.Thestarsinthegureindicatethattheexperimentscannotbenishedduetothelackofmemoryspaceforthereducetaskcomputation. Figure4-10. Inuenceofthereducephasebuersize(percent).Thestarsinthegureindicatethattheexperimentscannotbenishedduetothelackofmemoryspaceforthereducetaskcomputation. 4.2.3OtherHadoopFeaturesmapred.job.reuse.jvm.num.tasksistheparameterthatdecideshowmanytasksshareaJavavirtualmachine.SinceeachtaskusesanindependentJVM,taskswithinthesamenodeshareaJVMsequentially.Whentheparameterislessthan10,theexecutiontime 78

PAGE 79

Figure4-11. Inuenceofthereducephasebuersize(records) ofthemapphasedecreasesasthenumberoftaskssharingaJVMincreaseswhereastheexecutiontimeofthereducephaseincreases.Whentheparametervalueisaround5,thetotalexecutiontimeisminimal. Figure4-12. Inuenceoftheparametermapred.job.reuse.jvm.num.tasksonexecutiontime mapred.reduce.slowstart.completed.mapsistheparameterwhosevalueistheproportionofthemapphasethathasnishedwhenthereducephasestarts.Typically,thisissettoalargevaluesuchthatallavailableresourcesarededicatedtothemapphaseforalargejob.Figure 4-13 showstheexperimentalresultsofmeasuringtheinuenceofthisparameterontheexecutiontime.Astheparameterincreases,theexecutiontimeincreasesaswell.Aftertheparameterexceeds0.3,theexecutiontimestartsdecreasing.Whentheparameterisintherange[0.7,0.8],theexecutiontimeisminimal.Thenitincreasesagain. 79

PAGE 80

Figure4-13. Inuenceoftheparametermapred.reduce.slowstart.completed.mapsonexecutiontime Theparametermapred.compress.map.outputisabooleanvaluedecidingwhethertocompresstheoutputsofmaptasks,andtheparametersmapred.map.tasks.speculative.executionandmapred.reduce.tasks.speculative.executionenablethespeculativeexecutioninthecorrespondingmaporreducephase.ExperimentalresultsshowingtheimpactofthoseparametersontheexecutiontimearesummarizedinTable 4-2 .Fromthetable,whentheintermediatedatageneratedbyaHadoopjobarecompressedandthespeculativeexecutionisenabled,thejobperformsbetterthanthescenariowhenallofthreeparametersaresettofalse. Table4-2. Experimentalresultsofcompressandspeculativeexecutionparameters CongurationparameterExecutiontime(true)Executiontime(false) mapred.compress.map.output433.168497.326 mapred.map.tasks.speculative.execution418.156433.168 mapred.reduce.tasks.speculative.execution417.247433.168 80

PAGE 81

4.2.4PutItAllTogetherTherearemorethan190dierentcongurationparametersforeachHadoopjob[ 21 ].Amongallparameters,thoseinTable 3-1 andTable 3-2 usuallyhavesignicantinuenceontheperformanceofajob.WehaveevaluatedtheinuenceofthoseparametersontheexecutiontimeoftheR-Heavyimplementation.Experimentalresultsshowimprovementswhentuningsomeofparameters.AsshowninSection4.2.2,parametersinterferewitheachothersuchthatwhenaparameterischangedtoadierentvalue,theoptimalvalueofanotherparametermaychangeaswell.Forexample,whentheio.sort.mbissetto50,theoptimalperformanceoftheR-Heavyimplementationisatthetimethatio.sort.spill.percentis0.6.Whentheio.sort.mbissetto300,theoptimalperformanceoftheR-Heavyimplementationisatthetimethatio.sort.spill.percentis0.1.Wetrytominimizetheinterferencebytuningpossiblerelatedparameterstogether,forexample,io.sort.mbandio.sort.spill.percent.However,otherparametersmaystillinuenceeachother.Bettermethodologiesormoreexperimentsarerequiredinthefutureresearch.Inthissection,acomparisoniscarriedoutontheR-Heavyimplementationbetweentheexecutiontimepriortotuningthecongurationparametersandtheexecutiontimeaftertuningtheparameters.Accordingtotheexperimentsabove,thevaluesofcongurationparametersaftertunedaresummarizedinTable 4-3 .Whenaparameterhasveryslightinuence,ornoinuence,ontheperformanceofajob,thedefaultvalueinTable 4-1 isused.Otherwise,thevalueischosenbasedontheexperimentalresultspresentedintheprevioussections.Figure 4-14 showsexperimentalresultsofthecomparison.Aftertuningcongurationparameters,thetotalexecutiontimetoprocess1TBdatadecreasesbyapproximately38%.Boththemapphaseandthereducephaseexecutiontimedecreaseduetochangingthecongurationparameters.Althoughcongurationparametersinuenceeachother,thestatictuningmethodologyusedintheexperimentstillbringsanimprovementaround38%. 81

PAGE 82

Table4-3. Thevaluesofcongurationparametersaftertuned NameTunedValue mapred.map.tasks100-1100dfs.block.size1GBmapred.reduce.tasks16io.sort.mb50MBio.sort.spill.percent0.6io.sort.factor64mapred.compress.map.outputtruemapred.reduce.parallel.copies10mapred.job.shue.input.buer.percent0.7mapred.job.shue.merge.percent0.66mapred.inmem.merge.threshold1000mapred.map.tasks.speculative.executiontruemapred.reduce.tasks.speculative.executiontruemapred.job.reuse.jvm.num.tasks5io.le.buer.size64KBmapred.reduce.slowstart.completed.maps0.8mapred.tasktracker.map.tasks.maximum8mapred.tasktracker.reduce.tasks.maximum4mapred.child.java.opts1536MB Figure4-14. ComparisonoftheexecutiontimebeforeandaftertuningofcongurationparameterstotheR-Heavyimplementation 4.3AlgorithmicOptimizationExperimentsAsliding-windowalgorithmwasproposedinSection3.2tooptimizetherststageofAlgorithm 1 .Sincebothalgorithmsrequirethedatatobeanalyzedtobecollectedinonenode,experimentswereconductedinasinglemachine.First,weevaluatethecorrectnessofourapproachtocomputetheaverageandthestandarddeviationofasetof 82

PAGE 83

records.Second,wecomparetheexecutiontimeoftherststageofAlgorithm 1 andthesliding-windowalgorithm. 4.3.1NumericalAnalysisofStandardDeviationComputationMathematically,theresultsofthenewapproachshouldbethesameasthetwo-passalgorithm.However,theymayvarynumericallywhencomputationsareperformedincomputersbecauseofroundingerrors.Weevaluatetheapproachbycomparingtheresultsofthetwo-passalgorithmwiththoseofusingthesliding-windowapproach. 4.3.1.1ExperimentssetupThemachineusedfortestingisoneofmachinesinthelocalclustercontaining8IntelXeon2.33GHzprocessorsand16GBmemory,withJava1.7installed.WeusearandomgeneratorinJavastandardlibrarytogeneratedatawithintherange[1,10]and[1,100].Boththecomputationresultsandtheexecutiontimearecomparedbetweenthetwo-passalgorithmandthesliding-windowapproach. 4.3.1.2ResultsanalysisTables 4-4 and 4-5 showtheaverageandthestandarddeviationcomputationresultsforthenumberofrecordsrangingfrom15to150,000,000.Recordsarealsorandomlygenerated.Theinputconsistsof15unitsandeachunitcontainsthesameamountofrecords.Thetwo-passalgorithmcomputestheaverageandthestandarddeviationforunits2through15usingEquation 3{1 andEquation 3{2 ,whilethesliding-windowapproachassumestheresultsof1stunitandunits1through14areknownandusesEquations 3{4 andEquation 3{5 tocomputetheresultsforunits2through15.Bothaverageandstandarddeviationcomputationresultsshowroundingerrorsthatincreasewiththenumberofrecords.However,thetotalnumberofrecordsforeachmeasurementunit(ordCU)usuallymuchlessthan150,000,000.Themaximumerror10)]TJ /F7 7.97 Tf 6.59 0 Td[(8isacceptablefortheanalysissystem.IfthenumberofrecordsforeachdCUincreasesdramatically,knownnumericaltechniquescanbeusedtodecreasetheroundingerrors[ 38 ]. 83

PAGE 84

Table4-4. Comparisonofaveragesofrecordscalculatedusingthetwo-passalgorithmandthesliding-windowapproachfordierentnumbersofrecords No.RecordsAverage(two-pass)Average(slidingwindow) 154.2205024724684724.2205024724684721504.9283438443034174.92834384430341715005.1386860167713225.13868601677132315,0004.9888900485653024.988890048565292150,00050.0035776820134250.003577682013841,500,00050.0016077470186950.0016077470209315,000,00050.00645400430854450.00645400430462150,000,00049.9971672737857149.99716727378661 Table4-5. Comparisonofstandarddeviationsofrecordscalculatedusingthetwo-passalgorithmandthesliding-windowapproachfordierentnumbersofrecords No.RecordsStandarddeviation(two-pass)Standarddeviation(slidingwindow) 152.5533441067699942.5533441067699941503.0063599194012853.00635991940128815002.85650355567472272.856503555674718315,0002.8911339806411712.891133980641179150,00028.85644650576797728.856446505766381,500,00028.85490069485581428.85490069485599215,000,00028.8646286650003828.864628665002964150,000,00028.86589363097013828.865893631013957 Figure 4-15 showstheexecutiontimeofboththetwo-passalgorithmandthesliding-windowapproachtocomputeaveragesandstandarddeviations.I/Oisnotconsideredinthisexperiment.Theexecutiontimeoffourcasesareshowninthegure.First,wemeasuretheexecutiontimeforthetwo-passalgorithm(referredtoasTwo-passtotal).Second,thetotalexecutiontime,includingboththetimeforcomputingpreviousresultandthetimeforcomputingthenewaverageandstandarddeviation,oftheslidingwindowapproacharemeasured(referredtoasSliding-windowtotal).Third,itisassumedthatthecomputationresultsfromDay-1toDay-14areknown(referredtoasSliding-windowstd).Fourth,itisassumedthatthecomputationresultsfromDay-1toDay-14andeachdayarebothknown(referredtoasSliding-windowdaily).Asshowninthegure,ifthepreviouscomputationresultsareknown,thesliding-windowalgorithmoutperformsthetwo-passalgorithmsignicantly.ThecaseSliding-windowstd 84

PAGE 85

Figure4-15. Theexecutiontimesofthetwo-passalgorithmandtheslidingwindowapproachtocomputeaveragesandstandarddeviations assumesthatN,,SandS2inEquation 3{5 areknown.ThecaseSliding-windowdailyfurtherassumesthatN1,1,S1andS21inEquation 3{5 areknown.SincecaseSliding-windowdailyhastheresultsforDay-1,thecomputationtimeincreasesalittleslowerthanthecaseSliding-windowstdwhenthenumberofrecordsincreases.Sincetheslidingwindowapproachcomplicatesthealgorithm,theexecutiontimeofthecaseSliding-windowtotalislongerthanthecaseTwo-passtotal.Asthenumberofrecordsincreases,thetotalexecutiontimeoftwo-passalgorithmisabout14timeslongerthanthesliding-windowapproachwiththeassumptionthatpreviouscomputationresultsareknown. 4.3.2Sliding-WindowAlgorithmPerformanceAnalysisInthissection,weevaluatetherststageoftheslidingwindowalgorithmbycomparingtheexecutiontimeoftherststageofAlgorithm 1 andthesliding-windowalgorithm. 4.3.2.1ExperimentssetupSeveralassumptionsaremadewhenevaluatingthesliding-windowalgorithm.Forsimplicity,eachrecordgeneratedinthissectioncontainsanintegerasatimestampintherangefrom1to15.Itisassumedthatthecomputationforrecordswhosetime 85

PAGE 86

stampsareintherange[1,14]isnishedandresultsarestoredinles.Dierentorderingstrategiesareusedtostoresortedrecords,aglobal-orderandatime-stamp-order.Intheglobal-orderstrategy,recordsarestoredinsortedorderwithoutconsiderationofthetimestamp.Inthetime-stamp-orderstrategy,recordsarerstorderedbythetimestamp,thenorderedbythevaluetobeanalyzed.Figure 4-16 showstwoorderingstrategies.R0,R1,R2,...aresortedrecords.Eachrecordisassociatedwithatime-stampT1,T2,T3,...IntheleftleinFigure 4-16 ,recordsareorderedaccordingtovaluewithoutconsiderationoftheirtime-stamps.Intherightle,recordsarerstsortedbytime-stamps,thensortedbyvalues. Figure4-16. Theglobal-orderstrategyandthetime-stamp-orderstrategy Accordingtodierentdataorderingstrategiesanddierentalgorithms,therearefourcasestobeconsidered,Algorithm 1 withtheglobal-orderstrategy(orAlg.-1Sort),Algorithm 1 withthetime-stamp-orderstrategy(orAlg.-1Time),Algorithm 2 withtheglobal-orderstrategy(orAlg.-2Sort),andAlgorithm 2 withthetime-stamp-orderstrategy(orAlg.-2Time). 4.3.2.2ResultsanalysisAccordingtotheanalysisinSection 3.2.2.2 ,sortingandthestandarddeviationcomputationaremosttime-consumingcomputationsintherststageofAlgorithm 1 and 2 .Thus,thesortingandstandarddeviationcomputationtimearemeasuredseparately 86

PAGE 87

inFigure 4-17 .Wealsoconsiderthecomputationtimeversustheinput/outputtime,asshowninFigure 4-18 .ThetotalexecutiontimeofdierentcasesarecomparedinFigure 4-19 .Forcomparisonpurposes,resultsarealsoshownbydierentcasesinFigure 4-20 ASorttime BStandarddeviationcomputationtimeFigure4-17. Sortingandthestandarddeviationcomputationtime BasedonSlidingWindowSortalgorithmdiscussedinSection 3.2.2.1 ,weimplementedmergesortinthecaseAlg.-2Sortandmulti-waymergesortinthecaseAlg.-2Time.ForcasesAlg.-1SortandAlg.-1Time,thesortmethodinJavaisused.AsshowninFigure 4-17 -A,ThesortingtimeforcasesAlg.-2SortandAlg.-1Sortareshortest,becausemostoftherecordsaresorted.ThecaseAlg.-2Timeperformsworst.Multi-waymergesortthatweimplementedcannotbeatthesortingalgorithmimplementedinJava.TheresultsofthestandarddeviationcomputationtimearesimilartotheresultsinFigure 4-15 .InSection 4.3.1.2 ,onlythecomputationofthestandarddeviationandtheaverageareperformedonrecordswhileinthissection,thestage1ofcomputationalgorithmisperformed.FromFigure 4-17 -B,whenmostoftherecordsaresorted,thecomputationtimefortheaverageandthestandarddeviationareshorter.Inaddition,thesortingtakeslongertimethancomputationofthestandarddeviation.Thetotalcomputationtimeisthecombinationofthesortingtimeandthestandarddeviationcomputationtime,asshowninFigure 4-18 -A.ThecaseAlg.-2Sortrunsfaster 87

PAGE 88

ATotalcomputationtime BIOtimeFigure4-18. ComputationtimeandI/Otimemeasurement thananyothercases.Asthenumberofrecordsincreases,Alg.-1SortrunsapproximatelytwotimesslowerthanAlg.-2Sort.SincethesortingtimeofAlg.-2Timeislongerthanothercases,itcausesthetotalcomputationtimeofAlg.-2Timetobetheslowestcase.Withregardstotheinput/outputtimeconsumedbyprograms,theyareapproximatelythesameforallcases.Comparingthescalesofy-axisinFigure 4-18 ,theI/Otimedominatestheprogram'sexecutiontime. Figure4-19. Thetotalexecutiontimeforfourcases 88

PAGE 89

ThetotalexecutiontimeoffourcasesarepresentedinFigure 4-19 .SinceI/Odominatestheexecutiontime,thetotalexecutiontimeoffourcasesareveryclosetoeachother.ThetotalexecutiontimeofthecaseAlg.-2SortandAlg.-2TimeareslightlyshorterthanthecaseAlg.-1SortandAlg.-1Time. ACaseAlg.-1Sort BCaseAlg.-1Time CCaseAlg.-2Sort DCaseAlg.-2TimeFigure4-20. Executiontimefordierentcases InordertocomparedierentpartsofcomputationtimeandI/Otime,allresultsarepresentedfromadierentperspectiveinFigure 4-20 .Allexecutiontimeforthesamecasearesummarizedinthesamegure.Figure 4-20 -AshowsthecaseAlg.-1Sort,Figure 4-20 -BshowsthecaseAlg.-1Time,Figure 4-20 -CshowsthecaseAlg.-2Sortand 89

PAGE 90

Figure 4-20 -DshowsthecaseAlg.-2Time.TheexecutiontimeismainlydecidedbyI/Oprocessingtime.Asshowninfourcases,I/Otimeisatleastvetimeslongerthanthetotalcomputationtime.Thesortingtimedominatesthecomputationtime,whichisclearlyvisibleinFigure 4-20 -D.IncasethatexperimentsareinuencedbyCPUcacheormemory,anewleisgeneratedforeachexperiment.Thus,everyexperimentcanbeconsideredasanindependentevent.Inaddition,allresultsareaveragesofveexperimentalresults. 90

PAGE 91

CHAPTER5CONCLUSIONSThepurposeofthisresearchworkistobuildabetterstatistics-basedsystemthatallowsforlarge-scaledataanalysis.WehaveimplementedthesystemontheHadoopplatformandtheimplementedsystemshowsabetterperformancewhencomparedtothelegacysystem.Optimizationofthesystemisfromtwoperspectives,tuningdierentHadoopcongurationparametersandoptimizingthecomputationalgorithm.Inthischapter,weconcludetheresearchworkwithasummaryofthethesis,adiscussionaboutimplementationsandexperimentsandpossibledirectionsforfutureresearch. 5.1SummaryInthisthesis,wepresentSMMDASonHadoop.Chapter1introducesthelegacySMMDASforthecurrentsemiconductormanufacturingprocessinordertoshowtherequirementsofascalable,ecient,fault-tolerancedistributedsystemformassivedataanalysisduringthesemiconductormanufacturingprocedure.Hadoop,asasoftwareplatformthatallowsformassivedataanalysisacrossclustersofcomputers,ischosentobetheunderlyingplatform.ThebackgroundontheMapReduceprogrammingmodelandtheHadoopjobworkowisdiscussedinChapter1aswell.DierentdatamappingandimplementationstrategiesarepresentedinChapter2.Chapter3discussesoptimizationofthesystem.Chapter4presentsallexperimentalevaluationresults. 5.1.1DataMappingTheSMMDASisdesignedandimplementedusingadata-orientedapproach.Weproposethreedierentmethodsofmappingdatafromthedatabasetoles:arbitrarily,per-iCUandper-dCU,andthreedierentmethodsofmappingdatafromlestoHadoop:one-to-one,many-to-one,andone-to-many.WhenthesizeofeachiCUissmall,experimentalresultsshowthattheper-iCUmany-to-onedatamappingstrategyoutperformsotherdatamappingstrategies.SincethesizeofeachiCUissmall,all 91

PAGE 92

computationsineachiCUcanbenishedinmemory.Theper-iCUmany-to-onedatamappingstrategycanreducethenumberofmaptaskslaunchedduringajobexecutionsothattheperformanceofthesystemisimproved.AsthesizeofeachdCUincreases,wehaveobservedthatcomputationsineachiCUcannotbenishedinmemory.Thus,aniterativeMapReducealgorithm,consistingofmultipleMapReducejobsineachanalysisjob,mayneedtobeconsideredforthesystem,asdiscussedinAppendix B 5.1.2ComputationDistributionTomapthecomputationtotheHadoopplatform,weproposethreedierentimplementations:M-Heavy,R-HeavyandMR-Balanced.FortheM-Heavyimplementation,allcomputationsofAlgorithm 1 ofeachiCUareperformedinthemapphase,whilefortheR-Heavyimplementation,allcomputationsofeachiCUareperformedinthereducephase.IntheMR-Balancedimplementation,therststageofAlgorithm 1 isperformedinthemapphaseandthesecondstageisperformedinthereducephase.Dierentimplementationscanbenetdierentsituations.Table 2-3 inChapter2summarizesthelimitationsofdierentdatamappingstrategiesondierentimplementations.Table 5-1 comparesdierentimplementationsbyasummaryofadvantagesanddisadvantagesoftheimplementations.WhenthesizeofeachiCUissmall,thecomputationcanbeperformedinmemoryinthemapphase.TheM-Heavyimplementationdecreasesintermediatedatatransmissionbetweenmapandreducephases,thusoutperformsotherimplementations.WhenthesizeofeachdCUincreases,thecomputationcannottinmemoryanymore.TheMR-BalancedimplementationdistributescomputationsbetweenthemapandreducephasesothatcomputationscanbenishedinoneMapReducejob.However,theM-HeavyandtheMR-Balancedimplementationsbothrequirespecicdatamappingstrategies.TheR-Heavyimplementationdoesnotlimitondatamapping.Thus,itisusefulwhendesigningaHadoop-onlyanalysissystemsuchthatthedatabaseisremoved. 92

PAGE 93

Table5-1. Theadvantagesanddisadvantagesofdierentimplementations ImplementationAdvantagesDisadvantages M-HeavyimplementationTheexecutiontimeisshortestNeedshelpfromthedatabaseforgroupingthedataMR-BalancedimplementationRequirelessmemoryspaceforeachtaskNeedshelpfromthedatabaseforgroupingthedataR-HeavyimplementationTheexecutiontimeislongestDoesnotneedsupportfromthedatabase 5.1.3SystemOptimizationThesystemisoptimizedfromtwoperspectives.First,dierentHadoopcongurationparametersaretunedfortheR-Heavyimplementation.Second,analgorithmicoptimizationiscarriedoutfortherststageofthecomputation.Ontheonehand,therearealotofcongurationparameterstoeachjobexecutedontheHadoopplatform.WehaveanalyzedtheworkowofaHadoopjobindetailssothatparametersthatmayhaveasignicantinuenceontheperformanceofaHadoopjobaretuned.Ontheotherhand,intherststageofthecomputation,themosttime-consumingcomputationsarethesortingandcomputingthestandarddeviation.Weconsiderre-usingtheresultsofthepreviouscomputationandproposedasliding-windowalgorithmtooptimizeAlgorithm 1 5.1.4ExperimentalResultsAnalysisInthelegacysystem,thecomputationperformedon1TBdatasettakesaround8to10hours[ 6 ].Inourexperiments,theM-Heavyimplementationusingper-iCUmanytoonemappingstrategyrunsfor1hourand40minutesforanalyzing1TBdataset.UsingthestaticapproachtotunetheHadoopcongurationparameters,theR-Heavyimplementationtakes1hourand13minutesforanalyzing1TBdataset.Thus,performanceofthesystemisimprovedbyupto82.7%inthecurrentexperiments.Inaddition,resultsinChapter4showthattheexecutiontimeoftherststagecomputationisimprovedby50%thankstothealgorithmicoptimization.Theexperiments,however, 93

PAGE 94

revealthefactthatIOtimedominatestheexecutiontimeofthewholeprogram.FutureresearchneedstobedonetoimprovetheIOexecutiontime. 5.2TowardsaBetterSystemInthelegacySMMDAS,thedatacollectedfromthemanufacturingprocessarerststoredinalesystem.Then,thedataaretransferredfromthelesystemtoadatabase.Beforetheanalysis,thedataaremovedfromthedatabasetoanotherlesystemandcomputingnodesretrievedatafromthelesystem.Afteranalysis,thedataaretransferredtoanotherdatabase.ThelegacysystemarchitectureisshowninFigure 5-1 -A.ThecurrentsystemistoreplacethelegacycomputingnodeswithaHadoopcluster,asshowninFigure 5-1 -B.However,thedatatobeanalyzedarestilltransferredfromsensorstoHDFSthroughseveralsteps.Inthissection,weproposeanarchitecturethatiscompletelydierentfromthelegacysystemarchitecturesothatthedatabaseisremovedinthesystemdesign.AsshowninFigure 5-1 -C,theproposedsystemcanbeconsideredasaHadoopclusterwithoutinvolvementofthedatabase.ThedatacollectedfromthemanufacturingprocessarestoreddirectlyinHDFS.Asdiscussedinpreviouschapters,theR-Heavyimplementationextractsusefulinformationduringthemapphaseandanalyzesthedataduringthereducephase.TheSMMDAScanbebuiltwithoutmulti-stepdatatransmission.Abackgroundscriptmayberequiredtoperiodicallyremoveuselessdata.Iffunctionalitiesofthedatabasearerequired,ahigh-levelsoftwarecomponent,Hive,isinstalledontheHadoopcluster.Hiveisadatawarehousesolutionforlarge-scaledatamanagementonthedistributedlesystem[ 40 ].Itisahigh-leveltoolthatsupportsaSQL-likeprogramminglanguage.ApplicationsbuiltonHivecanrelyontheSQL-likeinterfacesuchthatlow-levelmap/reducetasksareautomaticallygeneratedbythesoftwarecomponent.Theproposedarchitecturereducesthedatatransmissionbetweenlesystemsanddatabases.Inaddition,weenvisiontwootherpossibleimprovements. 94

PAGE 95

ALegacySystemArchitecture BExperimentalSystemArchitecture CProposedSystemArchitectureFigure5-1. Dierentsystemarchitectures 5.2.1AutomaticallyTuningHadoopCongurationParametersExperimentsinChapter4showthatHadoopcongurationparameterssignicantlyinuencetheperformanceofthesystem.Thereare190ormoredierentparametersforeachHadoopjob.Manuallyadjustmentisimpossiblefordierentjobs.Thus,itisnecessarytoinvestigateanautomaticanalysissystem.RelatedworkshavebeendonetowardsautomaticoptimizationofHadoopapplications[ 34 41 ],whichmayinspirefutureworkstoachieveoptimalcongurationparametersettings. 95

PAGE 96

5.2.2UsingSolid-StateDrivesInChapter4,ourexperimentalresultsshowthatthediskI/OtimedominatesthetotalexecutiontimeoftheSMMDAS.Hadooptriestoassigntaskstowherethedataarelocatedsothatthedatatransmissiontimebetweennodesisminimized.However,thediskI/OtimecannotbedecreasedusingtheHardDiskDrive(HDD).Solid-StateDrive(SSD)isproventooerexceptionalperformancethanHDD.Workshavebeendone[ 42 ]tocomparetheperformanceandthepricebetweenSSDandHDD.Forthesequentialreadapplications,theperformanceratioofSDDandHDDisaround2.Fortherandomaccessapplications,theperformanceratioofSSDandHDDisaround60.InmostHadoopapplications,thereadingandwritingtakeplacesequentially.Otherworks[ 43 44 ]alsocomparetheperformanceofaHadoopclusterusingbetweenSSDandHDD.Withappropriatesettingsin[ 43 ],theexecutiontimeofI/Oboundedapplicationsisimprovedby3usingSSD.FortheSMMDAS,thediskI/Oisaround10timesmorethanthecomputationtime,asshowninChapter4.CurrentclusterisconstructedofHDD.Thus,ifwereplacetheHDDwithSSDandsetcongurationparametersofapplicationsappropriately,thetotalexecutiontimeispredictedtobeimprovedby3.State-of-the-arttechnologies[ 44 ]haveshownthespeedupwithViolin'sscalableashmemoryarraytoHDDis7forsequentialreadapplications.Thewritespeedupis2.5,respectively.SincetheamountofreadoperationsaremuchmorethanwriteoperationsintheSMMDAS,thespeedupwiththescalableashmemoryarraystorageispredictedtobearound5.Insummary,manyHadoopapplicationsareIObounded[ 1 ].Ourimplementations,especiallywhenoptimizingsystemtowardsreal-time,alsorevealthefactthatIOtimedominatesthetotalexecutiontimeoftheanalysissystem.SSDsareproventoperformfasterexecutiontimethanHDD[ 42 ].StudieshaveappliedSSDinaHadoopcluster[ 43 44 ]andshownaperformanceimprovementbecauseofdecreasingtheI/Otime.Thus, 96

PAGE 97

upgradinghardwareisanotherpossibleoptimizationstrategytoreducetheexecutiontimeoftheSMMDAS. 97

PAGE 98

APPENDIXASETTINGUPAHADOOPCLUSTERUSINGCLOUDERAMANAGERThisappendixshowsabasictutorialofhowtosetupaHadoopclusterusingtheClouderaManager.Fortestingpurposes,theexperimentalinstallationistakenplaceonVMwareESXiserveronalocalcluster.ItstartsfromcreatingaVirtualMachine(VM),installingoperatingsystemthensettingupacluster.TheVMscreationandreplicationcanbeignoredifphysicalmachinesareusedforsettingupthecluster. A.1IntroductionAtypicalHadoopclusterconsistsofonemasternodeandmultipleslavenodes.TherearemultiplewaystosetupaHadoopcluster.ClouderaManager[ 45 ]providesuseraneasywaytosetup,manageandmaintainaHadoopcluster.Inthistutorial,wesetupanexperimentalHadoopclusteronVMwareESXiserver.Theclusterconsistsofonemasternodeandtwoslavenodes.ThemasternodeisadedicatedVM. A.2VMCreationandOperatingSystemInstallationInthissection,weusevSphereClienttocreateVMsonVMwareESXiserver[ 46 ].ThedetailsofhowtocreateaVMonVMwareESXiserverareskipped.AftercreatingaVM,anoperatingsystemisrequiredtobeinstalled.WechooseCentOS6.4.Followingsdetailtheinstallationoftheoperatingsystemandenvironmentsetup. A.2.1DownloadCentOSChooseadistributionimagefrom CentOS website.WedownloadedCentOS-6.4-x8664-minimal.iso.Theminimalinstallationisappropriateforaserver,whichdoesnotcontainaGUI. A.2.2InstallCentOS6.4First,powerontheVMthatcreatedontheclusterandthenopentheconsole.Second,attherightmosttoolbaroftheconsole,choose\connect/disconnecttheCD/DVDdevicesofthevirtualmachine". 98

PAGE 99

FigureA-1. ThewelcomepagefromCentOSimage Third,chooseanappropriatewaytoconnecttothedatastorewheretheimageoftheOSisstored.Intheexperimentalsetup,wedownloadedtheOStothelocalmachine.Thus,\connecttoISOimageonlocaldisk..."ischosentoconnecttheVMtothelocalstorage.Fourth,pressanykeyintheconsole,thesystemwillbootfromtheimage,asshowninFigure A-1 .Fifth,choose\Installorupgradeanexistingsystem".Sixth,followtheinstructionstoinstallCentOSontheVM. A.2.3CongureCentOSAfterinstallationofCentOS,rebootthesystem.Thencongurethesystemenvironment.Tobeabletomakechangestosystemcongurationsconveniently,loginastherootuser.First,congurethenetwork.Editthescriptle/etc/syscong/network-scripts/ifcfg-eth0.Replacecontentsofthelewiththefollowing: DEVICE=eth0HWADDR=xx:xx:xx:xx:xx:xx(Usedefaultvalue)TYPE=Ethernet 99

PAGE 100

ONBOOT=yesNM_CONTROLLED=noBOOTPROTO=dhcpInthisexperimentalsetup,weuseddynamicIPassignmentsfromDHCP.TheIPaddresscanalsobesetstatically.AftersetuptheIPaddress,restartthenetworkservicebythecommand $servicenetworkrestartTestthenetworkconnectionbythecommand $pingwww.google.com FigureA-2. Successfullysetupthenetworkconnection Figure A-2 showsthecommandinterfaceifthenetworkissetupsuccessfully.Second,updatetheinstalledoperatingsystembythecommand $yumupdateThird,installPerlbythecommand $yuminstallperlFourth,installNTP(NetworkTimeProtocolisusedtosynchronizeallhostsinthecluster)bythecommand $yuminstallntpFifth,editthele/etc/ntp.confbyreplacingexistingNTPserverswiththelocalNTPservers.Inthisexperimentalsetup,theNTPserversusedinUniversityofFloridaarelistedasfollows. serverntps2-3.server.ufl.eduserverntps2-2.server.ufl.edu 100

PAGE 101

serverntps2-1.server.ufl.eduSixth,disableselinuxbyeditingthele/etc/selinux/cong.Seventh,inordertocommunicatebetweennodes,disabletherewallby $serviceiptablessave$serviceiptablesstop$chkconfigiptablesoffEighth,setthehostnameandfqdn.Setthehostnamebyeditingthele/etc/syscong/network. HOSTNAME=masterThen,editthele/etc/hosts.TheIPaddress127.0.0.1correspondstothelocalIPwhile\masterIP"and\slaveIP"shouldbesettotheIPaddressesthatareusedforaccessingthemasternodeandslavenodes.IfhostscannotberesolvedbytheDNS,wehavetospecifyeveryhostsinthele/etc/hosts. 127.0.0.1localhost.localdomainlocalhost"masterIP"hello.world.commaster"slaveIP"hello1.world.comslave1Nighth,rebootthesystembythecommand $rebootnowTenth,installtheSSHserverandclient.Inthiscase,theOSdistributioncontainsopensshserverandclient.IfthedistributioninstalleddoesnotcontaintheSSHpackage,referto CentOSSSHInstallationandConguration forinstallationoftheSSHserverandclient.Eleventh,createaSSHprivate/publickeypair.ThisstepcanbeskippedifyouwouldliketoSSHwithapassword.CreateaSSHprivate/publickeypairbythecommand $ssh-keygenCopythepublickeytoeveryremotehostby $ssh-copy-id-i/root/.ssh/id_rsa.pubremote-host 101

PAGE 102

TesttheSSHconnectionwithoutapasswordby $sshremote-host A.2.4ReplicateVMsInordertomaintainthesystemsettingsatthebeginning,createasnapshotoftheVMthatwehavebeenconguredsofar.CopytwoVMinstancesonthesamecluster.OneproblemthatwehaveencounteredwhencopyingtheVMinstancesisthateth0isautomaticallymappedtotheoldmacaddress.Weneedtomanuallyeditthecorrespondingles.Thele/etc/udev/rules.d/70-persistent-net.rulesshouldbeeditedfromFigure A-3 toFigure A-4 FigureA-3. Thecontentofthele70-persistent-net.rulesbeforeupdating FigureA-4. Thecontentofthele70-persistent-net.rulesafterupdating Then,thele/etc/syscong/network-scripts/ifcfg-eth0shouldbeupdatedsothatHDADDRisconsistentwiththeoneinthele/etc/udev/rules.d/70-persistent-net.rulesRebootthesystembythecommand $rebootnow 102

PAGE 103

Theneditthehostnameofeachnodecorrespondinglybyeditingthele/etc/syscong/networkand/etc/hosts. A.3ClouderaManagerandCDHDistributionFirst,downloadClouderaManagerfromitswebsite.Second,changethepermissionofthepackageby $chmodu+xcloudera-manager-installer.binThird,runtheinstallationpackageby $./cloudera-manager-installer.binFourth,accesstotheClouderaManagerusinganywebbrowserthroughtheIPaddresshttp://ClouderaManagerNode:7180.ThedefaultUsernameandPasswordareboth\admin".Figure A-5 showstheloginwebinterfaceoftheClouderaManager. FigureA-5. TheloginwebinterfaceoftheClouderaManager Fifth,continuetothepageandspecifyhostsfortheCDHclusterinstallation.TypetheIPaddressesorhostnamesofthenodesfortheHadoopcluster.Figure A-6 showsthewebinterfacewherenodesareaddedintothecluster.Inthisexperiment,weaddedonemasternodeandtwoslavesnodestothecluster.Aftertheinitialclustersetup,newnodesstillcanbeaddedto(orremovedfrom)theclusterthroughtheClouderaManager. 103

PAGE 104

FigureA-6. ThewebinterfaceforusertospecifythenodesonwhichtheHadoopwillbeinstalled Continuewiththedefaultsettings.InstallusingParcels,andinstallbothCDHandIMPALA.Figure A-7 showsthewebinterfaceonwhichuserscanspecifytheinstallationpackageandotherappropriateinstallationoptions. FigureA-7. Thewebinterfaceforusertospecifytheinstallationpackageandotheroptions SincerootaccessisrequiredtoinstallCDH,specifytheSSHcredentialsbytypingthepasswordoftheroot,oruploadtheprivatekeyoftheroot.Notethatwetriedtouseasudouserotherthanrootwithsshprivate/publickeyaccess.TheClouderaManager, 104

PAGE 105

however,cannotnishthecredentialverication.Figure A-8 showsthewebinterfaceforuserstospecifytheappropriatecredentialmethodtoaccesstothecluster. FigureA-8. Thewebinterfaceforusertospecifytheappropriatecredentialmethodtoaccesstothecluster TheremaybefailednodeswheninstallingCDH.Figure A-9 isanexampleoffailedinstallation.Inthisparticularcase,wedidnotinstallPerlonCentOS.AfterinstallingPerl,thefailednodeworkswell.TheprocesswillcontinueinstallingwithParcels.Iftheinstallationprocessgoeswell,itshouldlookliketheinterfaceshowninFigure A-10 .Aftertheinstallationofthepackage,continuewithHostinspectors,asshowninFigure A-11 .Thisprocesswillfailifslavenodescannotndthewaybacktothemasternode.MakesuretospecifytheIPaddressofeachhostiftheycannotberesolvedbyDNS.Chooseservicestobeinstalledonthecluster.RuntheInspectroleassignments,asshowninFigure A-12 .Continuewiththe\UseEmbeddedDatabase".Waituntilallselectedserviceshavebeenstarted.Afterallservicesareinstalledsuccessfully,theclustercouldbestartedfollowingtheinstructiononthewebinterface,showninFigure A-13 105

PAGE 106

FigureA-9. Thewebinterfacewhentheinstallationfails FigureA-10. Thewebinterfacewhentheinstallationprocessgoeswell FigureA-11. Thewebinterfacewheninstallationofthepackageissuccessful 106

PAGE 107

FigureA-12. Thewebinterfacewheninstallationofthepackageissuccessful FigureA-13. Thewebinterfacewhenallserviceshasbeeninstalledandtheclusterisreadytowork Then,theCDHHadoopclusterisreadytobeused.Figure A-14 showsthewebinterfacewhenallservicesstartedareworkingappropriately. 107

PAGE 108

FigureA-14. Thewebinterfaceshowingthestatusofservicesavailable 108

PAGE 109

APPENDIXBAPARALLELALGORITHMFORSTATISTICS-BASEDANALYSISIfthesizeofeachiCUissmallenough,thesequentialalgorithm(Algorithm 1 )canbenishedonasinglenode.IfthesizeofeachiCUisverylarge,however,thesequentialalgorithmcannotbenishedonasinglenode.Thenparallelizationthecomputationisrequiredtonishtheanalysisjob.Manystatisticsfunctionscanbeparallelized.FollowingistheproposediterativeMapReduceparallelizedalgorithm.SimilartoAlgorithm 1 ,theinputoftheprocedureisalinkedlist(inputList).SincetheanalysisisperformedindependentlyoneachiCU,thedatainthesameinputListsharethesameKr.Theithelementinthelistisa1Ddoubleprecisionarray,Ai,representingadCU.EachvalueinAicorrespondstoanAVGvalueextractedfromtherecordsthatsharethesameKt.Inthisparallelizationalgorithm,eachAiisfurtherpartitionedintomblocks,andAijdenotesthejthblockofAi.SinceitisassumedthatthesizeofeachdCUistoolargetobecomputedonasinglenode,thealgorithmrstpartitionsAiintoseveralsmallblockssothateachblockofAicanbeprocessedonasinglenode.ForeachAij,thealgorithmsortsthedatasetlocally,andcomputestheaverage,summation,andsummationofsquaresofAij.ThisprocedureisdoneindependentlyoneachAij,whichcanbemappedtothemapphaseexecutionofaMapReducejob.Second,afterallblocksinAiaresorted,computationoftheaverageandthestandarddeviationofAicanbeeasilyderivedbyEquation 3{4 andEquation 3{5 .Also,amulti-waymergesortcanbeusedforsortingAi.Then,thealgorithmcomputesthepercentilevaluesofAiandcomputesoutlierlimitsusingpercentilevalues.ThisprocedurecanbemappedtothereducephaseexecutionoftheMapReducejobmentionedpreviously.Third,thealgorithmpartitionsAiintosmallblocks,andeliminatesoutliersfromeachAij,whichcanbemappedtothemapphaseofanewMapReducejob.Sofar,theStage1ofAlgorithm 1 isnishedusingtheiterativeMapReducealgorithm,Algorithm 3 ,accordingly. 109

PAGE 110

Algorithm 4 startswithpickingthedatasetwhoseCpkislargestasthereferencedataset.Then,itusestheaveragevaluesfromthepreviousstagetocomputetheaverageofallrecordsbelongingtothesameiCUexceptforthereferencedataset.Second,thealgorithmpartitionseachAiintoseveralsmallblocks,andcomputestheWithin-DistanceandBetween-Distanceofeachrecords.TheprocedurecanbemappedtothemapphaseofanewMapReducejob.AftercomputingtheEuclideandistancesofrecords,thealgorithmcomputestheaverageandthestandarddeviationofEuclideandistancesbasedonEquation 3{4 andEquation 3{5 .Finally,thealgorithmstartsapairwiseStudent'sT-testusingpreviouscomputedresultsandmakesanaldecision.TheproposedparallelalgorithmcontainsthreeMapReducejobsandeachoftheMapReducejobsareiterativelyexecutedonthecluster.FrequentlycreatingmultipleMapReducejobsmaybringsignicantoverhead.Twister,proposedbyEkanayakeetal.[ 20 ],asoneoftheoptimizedMapReduceextensions,canbeusedforsolvingtheproblem. 110

PAGE 111

Algorithm3Procedureofequivalencetest(iterativeMapReduce)-Part1 Input:LinkedListinputListAirepresentsfortheith1Ddouble-precisionarrayininputListAi[k]representsforanelementinAiAijrepresentsforthejthblockofAi foreachDoubleArrayAiininputListdo DivideAiintomblocks foreachblockAijinAido SortAij Computeij,SijandS2ij endfor Computetheaveragei=Pmj=1ijNj Pmj=1Nij ComputethestandarddeviationibasedonEquation 3{5 MergesortAi ComputethepercentilevaluesofAi ComputethelowerandupperoutlierlimitsLOL,UOLusingpercentilevalues DividethesortedAiintomblocks foreachblockAijinAido EliminateallAij[k]thatarenotintherange[LOL,UOL] Computethenewaverage0ij endfor ComputetheprocesscapabilityindexCpk[i] endfor 111

PAGE 112

Algorithm4Procedureofequivalencetest(iterativeMapReduce)-Part2 PickAiwiththelargestCpkasthereferencesetR Compute0=P0ij PN0ij(Ai6=R) foreachDoubleArrayAiininputListdo DivideAiintomblocks foreachblockAijinAido if(Ai==R)then ComputeEuclideandistanceWRj[k]=p (Rj[k])]TJ /F5 11.955 Tf 11.96 0 Td[(0R)2 ComputeEuclideandistanceBRj[k]=p (Rj[k])]TJ /F5 11.955 Tf 11.95 0 Td[(0)2 Compute00ij,SijandS2ij else ComputeEuclideandistanceWij[k]=q (Aij[k])]TJ /F5 11.955 Tf 11.95 0 Td[(0ij)2 ComputeEuclideandistanceBij[k]=p (Aij[k])]TJ /F5 11.955 Tf 11.96 0 Td[(0R)2 Compute00ij,SijandS2ij endif endfor Computetheaverage00basedonEquation 3{4 Computethestandarddeviation00basedonEquation 3{5 DoapairwiseStudent'st-testWR(orBR)andWi(orBi) Finaldecisionbasedontheresultoft-test endfor 112

PAGE 113

REFERENCES [1] T.White.Hadoop:TheDenitiveGuide.O'ReillyMedia,2012. [2] Apache.Welcometoapachehadoop!,2013.Availablefrom: http://hadoop.apache.org [3] A.JohnRice.MathematicalStatisticsandDataAnalysis.DuxburyAdvanced.CengageLearning,3edition,April2006. [4] Statisticalhypothesistesting,1996.Availablefrom: http://www.ganesha.org/spc/hyptest.html [5] WilliamA.LevinsonandFrankTumbelty.SPCEssentialsandProductivityImprove-ment:AManufacturingApproach.ASQCQualityPress,October1996. [6] YoungsunYu.Personalcommunication.January2013. [7] JereyDeanandSanjayGhemawat.Mapreduce:Simplieddataprocessingonlargeclusters.pages10{10,2004. [8] SanjayGhemawat,HowardGobio,andShun-TakLeung.Thegooglelesystem.InProceedingsoftheNineteenthACMSymposiumonOperatingSystemsPrinciples,SOSP'03,pages29{43,NewYork,NY,USA,2003.ACM. [9] KonstantinShvachko,HairongKuang,SanjayRadia,andRobertChansler.Thehadoopdistributedlesystem.InProceedingsofthe2010IEEE26thSymposiumonMassStorageSystemsandTechnologies(MSST),MSST'10,pages1{10,Washington,DC,USA,2010.IEEEComputerSociety. [10] A.RajaramanandJ.D.Ullman.MiningofMassiveDatasets.CambridgeUniversityPress,2012. [11] MichaelC.Schatz.Cloudburst:highlysensitivereadmappingwithmapreduce.Bioinformatics,25(11):1363{1369,2009. [12] AaronMcKenna,MatthewHanna,EricBanks,AndreySivachenko,KristianCibulskis,AndrewKernytsky,KiranGarimella,DavidAltshuler,StaceyGabriel,MarkDaly,andMarkA.DePristo.Thegenomeanalysistoolkit:Amapreduceframeworkforanalyzingnext-generationdnasequencingdata.GenomeResearch,20(9):1297{1303,2010. [13] S.PapadimitriouandJimengSun.Disco:Distributedco-clusteringwithmap-reduce:Acasestudytowardspetabyte-scaleend-to-endmining.InDataMining,2008.ICDM'08.EighthIEEEInternationalConferenceon,pages512{521,2008. [14] J.Ekanayake,S.Pallickara,andG.Fox.Mapreducefordataintensivescienticanalyses.IneScience,2008.eScience'08.IEEEFourthInternationalConferenceon,pages277{284,2008. 113

PAGE 114

[15] YuanBao,LeiRen,LinZhang,XuesongZhang,andYongliangLuo.Massivesensordatamanagementframeworkincloudmanufacturingbasedonhadoop.InIndustrialInformatics(INDIN),201210thIEEEInternationalConferenceon,pages397{401,2012. [16] ChenZhang,HansSterck,AshrafAboulnaga,HaigDjambazian,andRobSladek.Casestudyofscienticdataprocessingonacloudusinghadoop.InDouglasJ.K.Mewhort,NatalieM.Cann,GaryW.Slater,andThomasJ.Naughton,editors,HighPerformanceComputingSystemsandApplications,volume5976ofLectureNotesinComputerScience,pages400{415.SpringerBerlinHeidelberg,2010. [17] RonaldCTaylor.Anoverviewofthehadoop/mapreduce/hbaseframeworkanditscurrentapplicationsinbioinformatics.BMCbioinformatics,11(Suppl12):S1,2010. [18] KelvinCardona,JimmySecretan,MichaelGeorgiopoulos,andGeorgiosAnagnostopoulos.Agridbasedsystemfordataminingusingmapreduce.Technicalreport,2007. [19] VinodKumarVavilapalli,ArunC.Murthy,ChrisDouglas,SharadAgarwal,MahadevKonar,RobertEvans,ThomasGraves,JasonLowe,HiteshShah,SiddharthSeth,BikasSaha,CarloCurino,OwenO'Malley,SanjayRadia,BenjaminReed,andEricBaldeschwieler.Apachehadoopyarn:Yetanotherresourcenegotiator.InProceedingsofthe4thAnnualSymposiumonCloudComputing,SOCC'13,pages5:1{5:16,NewYork,NY,USA,2013.ACM. [20] JaliyaEkanayake,HuiLi,BingjingZhang,ThilinaGunarathne,Seung-HeeBae,JudyQiu,andGeoreyFox.Twister:Aruntimeforiterativemapreduce.InProceedingsofthe19thACMInternationalSymposiumonHighPerformanceDistributedComputing,HPDC'10,pages810{818,NewYork,NY,USA,2010.ACM. [21] ShivnathBabu.Towardsautomaticoptimizationofmapreduceprograms.InProceedingsofthe1stACMSymposiumonCloudComputing,SoCC'10,pages137{142,NewYork,NY,USA,2010.ACM. [22] IBMCorp.Anarchitecturalblueprintforautonomiccomputing.IBMCorp.,USA,October2004.Availablefrom: www-3.ibm.com/autonomic/pdfs/ACBP2_2004-10-04.pdf [23] Welcometoapachehbase,2014.Availablefrom: http://hbase.apache.org [24] R.G.GossandK.Veeramuthu.Headingtowardsbigdatabuildingabetterdatawarehouseformoredata,morespeed,andmoreusers.InAdvancedSemiconductorManufacturingConference(ASMC),201324thAnnualSEMI,pages220{225,2013. [25] XuhuiLiu,JizhongHan,YunqinZhong,ChengdeHan,andXubinHe.Implementingwebgisonhadoop:Acasestudyofimprovingsmalllei/operformanceonhdfs.InClusterComputingandWorkshops,2009.CLUSTER'09.IEEEInternationalConferenceon,pages1{8,Aug2009. 114

PAGE 115

[26] JiongXie,ShuYin,XiaojunRuan,ZhiyangDing,YunTian,J.Majors,A.Manzanares,andXiaoQin.Improvingmapreduceperformancethroughdataplacementinheterogeneoushadoopclusters.InParallelDistributedProcessing,WorkshopsandPhdForum(IPDPSW),2010IEEEInternationalSymposiumon,pages1{9,2010. [27] YanpeiChen,ArchanaGanapathi,andRandyH.Katz.Tocompressornottocompress-computevs.iotradeosformapreduceenergyeciency.InProceedingsoftheFirstACMSIGCOMMWorkshoponGreenNetworking,GreenNetworking'10,pages23{28,NewYork,NY,USA,2010.ACM. [28] EdwardBortnikov,AriFrank,EshcarHillel,andSriramRao.Predictingexecutionbottlenecksinmap-reduceclusters.InProceedingsofthe4thUSENIXConferenceonHotTopicsinCloudCcomputing,HotCloud'12,pages18{18,Berkeley,CA,USA,2012.USENIXAssociation. [29] NIST/SEMATECHe-HandbookofStatisticalMethods:Percentiles,December2013.Availablefrom: http://www.itl.nist.gov/div898/handbook/prc/section2/prc252.htm [30] JohnRenze.Outlier.MathWorld{AWolframWebRecource,createdbyEricW.Weisstein.Availablefrom: http://mathworld.wolfram.com/Outlier.html [31] J.B.KeatsandD.C.Montgomery.StatisticalApplicationsinProcessControl.QualityandReliability.Taylor&Francis,1996. [32] DavidLane.Percentile.Availablefrom: http://cnx.org/content/m10805/latest [33] WilliamH.Press,SaulA.Teukolsky,WilliamT.Vetterling,andBrianP.Flannery.NumericalRecipes:TheArtofScienticComputing.NumericalRecipes:TheArtofScienticComputing.CambridgeUniversityPress,1986.Availablefrom: http://books.google.fr/books?id=1aAOdzK3FegC [34] HerodotosHerodotou,HaroldLim,GangLuo,NedyalkoBorisov,LiangDong,FatmaBilgenCetin,andShivnathBabu.Starsh:Aself-tuningsystemforbigdataanalytics.InCIDR,volume11,pages261{272,2011. [35] ShrinivasB.Joshi.Apachehadoopperformance-tuningmethodologiesandbestpractices.InProceedingsofthe3rdACM/SPECInternationalConferenceonPerformanceEngineering,ICPE'12,pages241{242,NewYork,NY,USA,2012.ACM. [36] DaweiJiang,BengChinOoi,LeiShi,andSaiWu.Theperformanceofmapreduce:Anin-depthstudy.Proc.VLDBEndow.,3(1-2):472{483,September2010. [37] ThomasH.Cormen,CharlesE.Leiserson,RonaldL.Rivest,andCliordStein.IntroductiontoAlgorithms,ThirdEdition.TheMITPress,3rdedition,2009. 115

PAGE 116

[38] TonyF.Chan,GeneH.Golub,andRandallJ.LeVeque.Algorithmsforcomputingthesamplevariance:Analysisandrecommendations.TheAmericanStatistician,37(3):pp.242{247,1983. [39] B.P.Welford.NoteonaMethodforCalculatingCorrectedSumsofSquaresandProducts.Technometrics,4(3):419{420,1962. [40] AshishThusoo,JoydeepSenSarma,NamitJain,ZhengShao,PrasadChakka,SureshAnthony,HaoLiu,PeteWycko,andRaghothamMurthy.Hive:Awarehousingsolutionoveramap-reduceframework.Proc.VLDBEndow.,2(2):1626{1629,August2009. [41] DukeUniversity.Starsh:Self-tuninganalyticssystem,2013.Availablefrom: http://www.cs.duke.edu/starfish/ [42] VamseeKasavajhala.Solidstatedrivevs.harddiskdrivepriceandperformancestudy,2011. [43] KangSeok-Hoon,KooDong-Hyun,KangWoon-Hak,andLeeSang-Won.Acaseforashmemoryssdinhadoopapplications.InternationalJournalofControlAutomation,6(1):201{209,2013. [44] ViolinMemory.Benchmarkinghadoop&hbaseonviolin:Harnessingbigdataanalyticsatthespeedofmemory,2013.Availablefrom: http://pull.vmem.com/wp-content/uploads/hadoop-benchmark.pdf?d=1 [45] Cloudera.Clouderamanagerend-to-endadministrationforhadoop,2014.Availablefrom: http://www.cloudera.com/content/cloudera/en/products-and-services/cloudera-enterprise/cloudera-manager.html [46] VMware.Managingtheesxhost,2014.Availablefrom: http://pubs.vmware.com/vsphere-4-esx-vcenter/index.jsp#getting_started/c_managing_your_first_esx_server_3i_host.html 116

PAGE 117

BIOGRAPHICALSKETCH WenjieZhangisborninDalianChina.Sheobtainedherbachelor'sdegreefromLiaoningUniversity(China).ShejoinedUniversityofFlorida,DepartmentofElectricalandComputerEngineering,in2011,withanemphasisincomputerengineering.Sincethesummerof2012,shejoinedAdvancedComputingandInformationSystemsLaboratory.ShehasworkedonITaspectsoftheDARPA/REPAIRproject,whichisaprojectfocusingonresearchofmechanismsforrepairofbraininjuries.Sheiscurrentlyinvolvedinresearchonparallelanddistributedsemiconductordataanalytics,aprojectoftheNSFCenterforCloudandAutonomicComputing(CAC)supportedbySamsungInc.Herinterestsincludecloudcomputingandlarge-scaledataanalysisindistributedenvironment. 117