Citation
Using Hadoop to accelerate the analysis of semiconductor-manufacturing monitoring data

Material Information

Title:
Using Hadoop to accelerate the analysis of semiconductor-manufacturing monitoring data
Creator:
Zhang, Wenjie
Place of Publication:
[Gainesville, Fla.]
Florida
Publisher:
University of Florida
Publication Date:
Language:
english
Physical Description:
1 online resource (117 p.)

Thesis/Dissertation Information

Degree:
Master's ( M.S.)
Degree Grantor:
University of Florida
Degree Disciplines:
Electrical and Computer Engineering
Committee Chair:
FORTES,JOSE A
Committee Co-Chair:
FIGUEIREDO,RENATO JANSEN
Committee Members:
LI,TAO
Graduation Date:
5/3/2014

Subjects

Subjects / Keywords:
Algorithms ( jstor )
Databases ( jstor )
Datasets ( jstor )
Genetic mapping ( jstor )
Input output ( jstor )
Outliers ( jstor )
Percentiles ( jstor )
Programming models ( jstor )
Recordings ( jstor )
Standard deviation ( jstor )
Electrical and Computer Engineering -- Dissertations, Academic -- UF
data-analysis -- hadoop -- mapreduce -- scalability
Genre:
bibliography ( marcgt )
theses ( marcgt )
government publication (state, provincial, terriorial, dependent) ( marcgt )
born-digital ( sobekcm )
Electronic Thesis or Dissertation
Electrical and Computer Engineering thesis, M.S.

Notes

Abstract:
Large-scale data sets are generated every day by the sensors used to monitor semiconductor manufacturing processes. To effectively process these large data sets, a scalable, efficient and fault-tolerant system must be designed. This thesis investigates the suitability of Hadoop as the core processing engine of such a system. Hadoop has been used in many fields for large-scale data processing, including web-indexing, bioinformatics, data mining, etc, where it is proved to be an effective data processing platform because it provides parallelization and distribution for applications built on it. It also offers scalability and fault-tolerance using only commodity machines. In this thesis, we propose a system for statistics-based analysis of manufacturing monitoring data using Hadoop. We first present a data-oriented approach through an in-depth analysis of the data to be processed. Then, different Hadoop configuration parameters are tuned in order to improve the performance of the system. In addition, the statistics-based computation algorithm is modified so that it can reuse previous results. By establishing a mapping from data to Hadoop MapReduce programming paradigm, the implemented system outperforms the legacy analysis system by up to 82.7%. Experiments have been done on different mapping strategies to study the effect of configuration parameters on performance. The algorithmic optimization improves the optimized computation time by 50%. However, the non-computation time, i.e. IO operation time, dominates execution time in our experiments. ( en )
General Note:
In the series University of Florida Digital Collections.
General Note:
Includes vita.
Bibliography:
Includes bibliographical references.
Source of Description:
Description based on online resource; title from PDF title page.
Source of Description:
This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Thesis:
Thesis (M.S.)--University of Florida, 2014.
Local:
Adviser: FORTES,JOSE A.
Local:
Co-adviser: FIGUEIREDO,RENATO JANSEN.
Statement of Responsibility:
by Wenjie Zhang.

Record Information

Source Institution:
UFRGP
Rights Management:
Copyright Zhang, Wenjie. Permission granted to the University of Florida to digitize, archive and distribute this item for non-profit research and educational purposes. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder.
Resource Identifier:
908645728 ( OCLC )
Classification:
LD1780 2014 ( lcc )

Downloads

This item has the following downloads:


Full Text

PAGE 1

USINGHADOOPTOACCELERATETHEANALYSISOFSEMICONDUCTOR-MANUFACTURINGMONITORINGDATAByWENJIEZHANGATHESISPRESENTEDTOTHEGRADUATESCHOOLOFTHEUNIVERSITYOFFLORIDAINPARTIALFULFILLMENTOFTHEREQUIREMENTSFORTHEDEGREEOFMASTEROFSCIENCEUNIVERSITYOFFLORIDA2014

PAGE 2

c2014WenjieZhang

PAGE 3

Tomyadvisor,Dr.JoseA.B.FortesTomyparents,PeisongZhangandJinzhiQu

PAGE 4

ACKNOWLEDGMENTS First,Iwouldliketoexpressaspecialappreciationandthankstomyadvisor,Dr.JoseA.B.Fortesforhissupporttomystudyandresearchtowardsthemaster'sdegree,forhispatience,encouragementandimmenseknowledge.Hisendlesssupportandguidanceenlightenmywayofstudy,researchandcareer.Withouthishelp,thisthesiscannotbenished.Second,IwouldliketothanktherestofmycommitteemembersDr.RenatoJFigueiredoandDr.TaoLifortheirpatience,support,insightfulcommentsandsuggestions.Third,IwishtoacknowledgethehelpprovidedbyMr.YoungsunYufromSamsungfortheinsightdiscussionoftheproblem.Fourth,mysincerethankstoDr.AndreaMatsunaga.Herenthusiasm,patienceandencouragementofresearchandstudyguidemetothedoorofresearch.Inaddition,aspecialthankstomyfamilyandmyfriendsforthesupport,understandingandencouragement.WordscannotexpresshowgratefulIamtomyparentsforallofthesacricesthatyou'vemadeonmybehalf.AttheendIwouldlikeexpressappreciationtoZhuoyuanSongforhissupportandhelpinnishingthethesis. 4

PAGE 5

TABLEOFCONTENTS page ACKNOWLEDGMENTS ................................. 4 LISTOFTABLES ..................................... 8 LISTOFFIGURES .................................... 9 ABSTRACT ........................................ 12 CHAPTER 1INTRODUCTION .................................. 13 1.1MapReduceProgrammingModelandHadoopWorkow .......... 16 1.1.1MapReduceProgrammingModel ................... 17 1.1.2HadoopPlatform ............................ 18 1.2MotivationandProblemStatement ...................... 22 1.2.1DataMappingandComputationDistribution ............ 23 1.2.2SystemOptimization .......................... 24 1.3SystemArchitecture .............................. 24 1.4StructureofThesis ............................... 28 2DESIGNANDIMPLEMENTATION ........................ 29 2.1DataDescription ................................ 29 2.2AlgorithmAnalysis ............................... 31 2.2.1PercentileandOutlier .......................... 33 2.2.2ProcessCapability ............................ 34 2.2.3EuclideanDistance ........................... 35 2.2.4EquivalenceTest ............................ 36 2.3DataMapping .................................. 36 2.3.1DataMapping:DatabasetoFiles ................... 36 2.3.2DataMapping:FilestoMappers .................... 40 2.4SystemImplementation ............................. 42 2.4.1M-HeavyImplementation ........................ 43 2.4.1.1Mapimplementation ..................... 43 2.4.1.2Reduceimplementation ................... 43 2.4.2R-HeavyImplementation ........................ 45 2.4.2.1Mapimplementation ..................... 45 2.4.2.2Reduceimplementation ................... 46 2.4.3MR-BalancedImplementation ..................... 47 2.4.3.1Mapimplementation ..................... 47 2.4.3.2Reduceimplementation ................... 48 2.4.4RelationshipswithDataMapping ................... 49 5

PAGE 6

3SYSTEMOPTIMIZATION ............................. 51 3.1HadoopCongurationParametersTuning .................. 51 3.1.1ParametersoftheMapReduceProgrammingModel ......... 51 3.1.2ParametersofHadoopWorkow .................... 54 3.1.3OtherHadoopFeatures ......................... 57 3.2AlgorithmicOptimization ........................... 59 3.2.1DataMovement ............................. 60 3.2.2AlgorithmicOptimization ........................ 61 3.2.2.1Thesliding-windowalgorithm ................ 61 3.2.2.2Algorithmanalysis ...................... 64 4EXPERIMENTALEVALUATIONANDANALYSIS ............... 67 4.1DataMappingandMapReduceImplementationExperiments ........ 67 4.1.1DataMapping .............................. 67 4.1.1.1Experimentssetup ...................... 68 4.1.1.2Resultsanalysis ........................ 68 4.1.2LargeScaleDataSets .......................... 69 4.1.2.1Experimentssetup ...................... 69 4.1.2.2Resultsanalysis ........................ 70 4.2HadoopConguration-ParametersTuningExperiments ........... 71 4.2.1ParametersoftheMapReduceProgrammingModel ......... 71 4.2.2ParametersofHadoopWorkow .................... 75 4.2.3OtherHadoopFeatures ......................... 78 4.2.4PutItAllTogether ........................... 81 4.3AlgorithmicOptimizationExperiments .................... 82 4.3.1NumericalAnalysisofStandardDeviationComputation ....... 83 4.3.1.1Experimentssetup ...................... 83 4.3.1.2Resultsanalysis ........................ 83 4.3.2Sliding-WindowAlgorithmPerformanceAnalysis ........... 85 4.3.2.1Experimentssetup ...................... 85 4.3.2.2Resultsanalysis ........................ 86 5CONCLUSIONS ................................... 91 5.1Summary .................................... 91 5.1.1DataMapping .............................. 91 5.1.2ComputationDistribution ....................... 92 5.1.3SystemOptimization .......................... 93 5.1.4ExperimentalResultsAnalysis ..................... 93 5.2TowardsaBetterSystem ............................ 94 5.2.1AutomaticallyTuningHadoopCongurationParameters ...... 95 5.2.2UsingSolid-StateDrives ........................ 96 APPENDIX 6

PAGE 7

ASETTINGUPAHADOOPCLUSTERUSINGCLOUDERAMANAGER ... 98 A.1Introduction ................................... 98 A.2VMCreationandOperatingSystemInstallation ............... 98 A.2.1DownloadCentOS ............................ 98 A.2.2InstallCentOS6.4 ............................ 98 A.2.3CongureCentOS ............................ 99 A.2.4ReplicateVMs .............................. 102 A.3ClouderaManagerandCDHDistribution ................... 103 BAPARALLELALGORITHMFORSTATISTICS-BASEDANALYSIS ..... 109 REFERENCES ....................................... 113 BIOGRAPHICALSKETCH ................................ 117 7

PAGE 8

LISTOFTABLES Table page 2-1Averticalviewofasamplerecord .......................... 30 2-2HowdierentMapReduceimplementationsconstraindatamappingsintothemap/reducephase .................................. 43 2-3Relationshipsbetweendierentimplementationsanddatamappingstrategies 50 3-1Asubsetofcongurationparameterstunedforsystemoptimization-part1[ 1 ] 52 3-2Asubsetofcongurationparameterstunedforsystemoptimization-part2 .. 53 4-1ThedefaultvaluesofcongurationparametersinTable 3-1 and 3-2 ....... 72 4-2Experimentalresultsofcompressandspeculativeexecutionparameters ..... 80 4-3Thevaluesofcongurationparametersaftertuned ................ 82 4-4Comparisonofaveragesofrecordscalculatedusingthetwo-passalgorithmandthesliding-windowapproachfordierentnumbersofrecords ........... 84 4-5Comparisonofstandarddeviationsofrecordscalculatedusingthetwo-passalgorithmandthesliding-windowapproachfordierentnumbersofrecords ........ 84 5-1Theadvantagesanddisadvantagesofdierentimplementations ......... 93 8

PAGE 9

LISTOFFIGURES Figure page 1-1Legacysystemarchitectureforsemiconductormanufacturingmonitoringanalysissystem.Thegureshowsthearchitectureofalegacyanalysissystem.DatacollectedfrommonitoringsensorsarestoredinaMicrosoftdatabase.Theanalysissystemretrievesdatafromthedatabaseandanalyzesthedatausingmultiplemachines.ResultsofanalysisaresenttoanOracledatabaseforeldengineers. ...... 15 1-2TheMapReduceprogrammingmodel ........................ 18 1-3HadoopMapReduceworkow ............................ 21 1-4AnoverviewofsystemarchitecturefortheanalysissystemonHadoop ..... 23 1-5Ahighlevelviewofthestatistics-basedmonitoringsystemandthestaticaloptimizationstrategy ........................................ 26 2-1Datasetscollectedduringseveralconsecutivedaysareusedfornightlystatisticalanalysis. ........................................ 30 2-2Database-to-lesmappingstrategies-arbitrarilymapping ............ 37 2-3Database-to-lesmappingstrategies-per-iCUmapping .............. 38 2-4Database-to-lesmappingstrategies-per-dCUmapping ............. 39 2-5Files-to-Mappersmappingstrategies ........................ 41 2-6M-Heavyimplementation:Mapphase ........................ 44 2-7R-Heavyimplementation:Mapphase ........................ 45 2-8R-Heavyimplementation:Reducephase ...................... 46 2-9MR-Balancedimplementation:Mapphase ..................... 48 2-10MR-Balancedimplementation:Reducephase ................... 49 3-1AshuingphaseofaMapReducejob.Eachmaptaskcontainsamapfunction.Thebuerassociatedwiththemapfunctionistheplacethatstorestheoutputfromthemapfunction.Whenthesizeofthedatainthebuerexceedssomethreshold,thedataaresplitintodisks.Theblocksfollowingthebuerineachmaptaskrepresenttheblockswrittenonthedisk.Eachreducetaskcontainsareducefunction.Beforethereducefunctionstarts,thereducetaskretrievestheoutputfrommaptasksandmergestheoutputsastheinputtothereducefunction,representedbythearrowandthemergestreamsineachreducetask. 55 3-2Theslidingwindowofstatisticalanalysis.Thewindowsizeis14days. ..... 59 9

PAGE 10

3-3Datamappingstrategywithregardstotimestampofeachrecord ........ 60 4-1Dierentdatamappingstrategiesforeachimplementations ............ 69 4-2Dierentimplementationsforeachdatamappingstrategy. ............ 70 4-3Timemeasurementforcurrentdatasimulation ................... 71 4-4Inuenceofthenumberofmaptaskslaunched.Thedotsinthegureshowtheexperimentalresults.Eachdotcontainsthreedimensions,thenumberofmaptasks,theHDFSblocksizeandtheexecutiontime.Theexecutiontimeisshownusingdierentshadesofgray.Thedarkertheareais,thelesstimethejobtakestorun. ........................................ 73 4-5Inuenceofthenumberofmaptasks(evenlydatadistribution) ......... 75 4-6Inuenceofthenumberofreducetasksonexecutiontime ............ 76 4-7Inuenceofthebuersizeoftheoutputfromthemapfunction ......... 76 4-8Inuenceoftheparameterio.sort.factorontheexecutiontime .......... 77 4-9Inuenceoftheparametermapred.reduce.parallel.copiesontheexecutiontime 78 4-10Inuenceofthereducephasebuersize(percent).Thestarsinthegureindicatethattheexperimentscannotbenishedduetothelackofmemoryspaceforthereducetaskcomputation. ............................ 78 4-11Inuenceofthereducephasebuersize(records) ................. 79 4-12Inuenceoftheparametermapred.job.reuse.jvm.num.tasksonexecutiontime .. 79 4-13Inuenceoftheparametermapred.reduce.slowstart.completed.mapsonexecutiontime .......................................... 80 4-14ComparisonoftheexecutiontimebeforeandaftertuningofcongurationparameterstotheR-Heavyimplementation ........................... 82 4-15Theexecutiontimesofthetwo-passalgorithmandtheslidingwindowapproachtocomputeaveragesandstandarddeviations ................... 85 4-16Theglobal-orderstrategyandthetime-stamp-orderstrategy ........... 86 4-17Sortingandthestandarddeviationcomputationtime ............... 87 4-18ComputationtimeandI/Otimemeasurement ................... 88 4-19Thetotalexecutiontimeforfourcases ....................... 88 4-20Executiontimefordierentcases .......................... 89 5-1Dierentsystemarchitectures ............................ 95 10

PAGE 11

A-1ThewelcomepagefromCentOSimage ....................... 99 A-2Successfullysetupthenetworkconnection ..................... 100 A-3Thecontentofthele70-persistent-net.rulesbeforeupdating .......... 102 A-4Thecontentofthele70-persistent-net.rulesafterupdating ........... 102 A-5TheloginwebinterfaceoftheClouderaManager ................. 103 A-6ThewebinterfaceforusertospecifythenodesonwhichtheHadoopwillbeinstalled ........................................ 104 A-7Thewebinterfaceforusertospecifytheinstallationpackageandotheroptions 104 A-8Thewebinterfaceforusertospecifytheappropriatecredentialmethodtoaccesstothecluster ..................................... 105 A-9Thewebinterfacewhentheinstallationfails .................... 106 A-10Thewebinterfacewhentheinstallationprocessgoeswell ............. 106 A-11Thewebinterfacewheninstallationofthepackageissuccessful ......... 106 A-12Thewebinterfacewheninstallationofthepackageissuccessful ......... 107 A-13Thewebinterfacewhenallserviceshasbeeninstalledandtheclusterisreadytowork ........................................ 107 A-14Thewebinterfaceshowingthestatusofservicesavailable ............. 108 11

PAGE 12

AbstractofThesisPresentedtotheGraduateSchooloftheUniversityofFloridainPartialFulllmentoftheRequirementsfortheDegreeofMasterofScienceUSINGHADOOPTOACCELERATETHEANALYSISOFSEMICONDUCTOR-MANUFACTURINGMONITORINGDATAByWenjieZhangMay2014Chair:JoseA.B.FortesMajor:ElectricalandComputerEngineeringLarge-scaledatasetsaregeneratedeverydaybythesensorsusedtomonitorsemiconductormanufacturingprocesses.Toeectivelyprocesstheselargedatasets,ascalable,ecientandfault-tolerantsystemmustbedesigned.ThisthesisinvestigatesthesuitabilityofHadoopasthecoreprocessingengineofsuchasystem.Hadoophasbeenusedinmanyeldsforlarge-scaledataprocessing,includingweb-indexing,bioinformatics,datamining,etc,whereitisprovedtobeaneectivedataprocessingplatformbecauseitprovidesparallelizationanddistributionforapplicationsbuiltonit.Italsooersscalabilityandfault-toleranceusingonlycommoditymachines.Inthisthesis,weproposeasystemforstatistics-basedanalysisofmanufacturingmonitoringdatausingHadoop.Werstpresentadata-orientedapproachthroughanin-depthanalysisofthedatatobeprocessed.Then,dierentHadoopcongurationparametersaretunedinordertoimprovetheperformanceofthesystem.Inaddition,thestatistics-basedcomputationalgorithmismodiedsothatitcanreusepreviousresults.ByestablishingamappingfromdatatoHadoopMapReduceprogrammingparadigm,theimplementedsystemoutperformsthelegacyanalysissystembyupto82.7%.Experimentshavebeendoneondierentmappingstrategiestostudytheeectofcongurationparametersonperformance.Thealgorithmicoptimizationimprovestheoptimizedcomputationtimeby50%.However,thenon-computationtime,i.e.IOoperationtime,dominatesexecutiontimeinourexperiments. 12

PAGE 13

CHAPTER1INTRODUCTIONMassiveamountsofdataarecollecteddailyinmanufacturingplantswheresensorsareusedtotrackmanyphysicalparametersthatcharacterizetheoperationofalargenumberofequipmentunits.Thedataneedtobeprocessed,analyzed,storedandorganizedinawaythatitiseasyforengineerstousethem.Processinglarge-scaledatasetsusuallytakesalongtime.Technologies,suchasparallelizationanddistribution,areusedforsolvingthe\big-dataanalytics"problem.Onesuchtechnology,Hadoop[ 2 ],isdesignedforlarge-scaledataprocessingonclustersofcomputers.Inthisresearch,wepresentastatistics-basedanalysissystemforsemiconductormanufacturingmonitoringdatabuiltonHadoop.ThechoiceofHadoopisduetoitsscalability,eciency,andtransparentsupportofdistributed,parallelandfault-tolerantcomputationaltasks.AsemiconductormanufacturingplantconsistsofmanyEquipmentUnits(EUs)whichcanbeusedtoproduceavarietyofsemiconductorproducts.Dierentproductsareproducedindierentproductionlines,eachproductionlineconsistingofmultiplesteps,eachusingoneormoreEUs.SensorsembeddedineachEUareusedtomonitormultipleenvironmentparametersofrelevancetothequalityofthenalproduct.Monitoringeveryproductionlinegeneratesalargesetofdataeachday.Asthescaleofsemiconductormanufacturingplantsgrows,sodoesthescaleofthemonitoringdata.Asaconsequence,ascalablesystemforanalysisofmonitoringdataisessentialtoenabletheoptimizationofsemiconductormanufacturingprocesses.Asarepresentativeexample,thisthesisconsiderastatistics-basedanalysissystemthatcategorizesdatacollectedfrommonitoringsensorsaccordingtodierentproductionlinesanddierentmanufacturingsteps.Ineachcategory,collecteddataarefurtherdividedintodierentdatasetsbasedonEUs.TheStudent'sT-test[ 3 ]isusedtodetermineiftwodatasetsaresignicantlydierentfromeachother.T-test,oroneofotherrelatedstatisticalhypothesistests,hasbeenwidelyusedforstatisticalprocess 13

PAGE 14

controlinmanufacturingprocedures[ 4 5 ].ResultsoftheanalysisoereldengineersinformationaboutmalfunctioningEUssuchthatrepairingthoseEUstakesplaceinatimelymanner.AcommonSemiconductorManufacturingMonitoringDataAnalysisSystem(SMMDAS)ispresentedinFigure 1-1 [ 6 ].SensorsareembeddedineachEUformonitoringenvironmentparameters.Datacollectedfromsensorsarepreprocessedandstoredinadatabase.Thelegacystatistics-basedanalysissystemconsistsofmultiplecomputers,andeachcomputercontainsmultiplecores.Everynight,thesystemretrievesthedatatobeanalyzedfromthedatabaseandassignsjobstomultiplecomputers.Eachcomputermayreceiveasinglejob,ormultiplejobs.Eachcoreofthecomputerisresponsibleforprocessinganindependentjobatatime.Nocorrelatedprocessingisrequiredbetweenmultiplecores,ormultiplejobs.Afterprocessingjobs,computerssendresultstoadierentdatabase.Severalproblemsexistinthecurrentparallelsystem.First,thereisnomanagementblockresponsibleforworkloadbalancing,whichresultsinresourceunderutilization.Theparallelsystemconsistsofmultiplehomogeneouscomputers.However,dierentcomputersreceivedierentworkloads.AsanexampleshowninFigure 1-1 ,thejoblistforFab-line2isemptymakingtheData-Analytics-Server2idlewhileotherserversarebusywithalonglistofjobsfromotherFab-lines.Second,theanalysisresultsarerequiredbyeverymorningforeldengineers.However,thecurrentsystemcannotalwaysmeetthedeadline,resultingindelayinrepairingproblematicEUs.Inaddition,otherconsiderationsarealsorequiredforanecientandeectiveanalysissystem.Asthescaleofmanufacturingandmonitoringproceduregrows,thesizeofdatatobeanalyzedincreasesaswell.Duetoproblemsofthecurrentsystemandthegrowingscaleofdata,scalabilityisarequirementforanewsystemdesign.Inaddition,thescaleofbothdataandthesystemisincreasingsuchthatfailuresappearmoreandmorefrequently.Itis 14

PAGE 15

Figure1-1. Legacysystemarchitectureforsemiconductormanufacturingmonitoringanalysissystem.Thegureshowsthearchitectureofalegacyanalysissystem.DatacollectedfrommonitoringsensorsarestoredinaMicrosoftdatabase.Theanalysissystemretrievesdatafromthedatabaseandanalyzesthedatausingmultiplemachines.ResultsofanalysisaresenttoanOracledatabaseforeldengineers. necessarytodesignasystemwithfault-tolerancesothatfailuresdon'tinuencethecorrectnessofresultsandperformanceofthesystem.Inthischapter,anintroductionofMapReduceprogrammingmodelanditsopen-sourceimplementationHadooparepresentedalongwiththeircorrespondingadvantages.Second,aimingtoaddresslimitationsexposedbythelegacySMMDAS,asystembuiltonHadoopisintroducedtoachievethescalability,performanceandotherdesirablefeatures.Third,optimizationstrategies,includingtuningHadoopcongurationparametersandre-usingofpreviouscomputationresults,arediscussed. 15

PAGE 16

1.1MapReduceProgrammingModelandHadoopWorkowAdataprocessingsystemcanbeviewedashavingthreetiers:application,platformandstorage.Thestoragetierhandlesdatastorage,replication,access,etc.Theapplicationtiersupportsmultiplejobs,usersandapplications.Theplatformtierenablestheuseofmultipledistributedresourcesandprovidesaprogramminganduniedenvironmentthathidesfromprogrammershowprogramsanddataareparallelized,allocatedandcoordinated.Virtualizationtechnologiesenabletheuseofmultiplehomogeneous,orevenheterogeneousmachines,underauniedcomputingenvironment.Theplatformandstoragetiersaretightlycoupled,hidingthecomplexityoftheunderlyingmanagementanddistribution.TheGoogleMapReduceplatform[ 7 ],withitscorrespondingstoragesystemGoogleFileSystem(GFS)[ 8 ],arerepresentativeexamplesoftheplatformandstoragetiers.MapReduceisaprogrammingmodelinspiredbyfunctionalprogramming.Googleadoptedtheprogrammingmodelanditsassociatedimplementation,inheritingthenameofMapReduce,forlarge-scalewebindexservice[ 7 ].WiththesupportofitsunderlyingstoragesystemGFS,MapReduceisusedformassivescaledataprocessinginadistributedandparallelmanner.Hadoop[ 2 ]isanopen-sourceplatformconsistingoftwoservices-theMapReduceserviceandtheHadoopDistributedFileSystem(HDFS)[ 9 ].HadoopwasoriginallydevelopedbyYahoo!andonceitbecameopen-source,ithasbeenapplied,analyzedandimprovedbymanyengineersandresearchers.ManyapplicationshavebeendevelopedfollowingtheMapReduceprogrammingmodel[ 1 10 ]onHadoop,ineldssuchasbio-medicalinformatics[ 11 12 ],datamining[ 13 14 ]andmanufacturingdatamanagement[ 15 ].Hadoophasalsobeenwidelyusedinscienticanalyticsforlarge-scaledataprocessing[ 11 12 16 ].Taylor[ 17 ]summarizedcurrentbioinformaticsapplicationsthatuseHadoop.PapadimitriouandSun[ 13 ]developedadistributedco-clusteringframeworkbasedonHadoop,andshowedthescalabilityandeciencyprovidedbyitsunderlyingHadoopplatform.Cardonaetal.[ 18 ]appliedtheMapReduceprogrammingmodelto 16

PAGE 17

aGrid-baseddataminingsystem,inwhichtwoimprovedschedulersareproposedandsimulated.Inthissection,wepresentbackgroundontheMapReduceprogrammingmodelandtheworkowofaHadoopjobinordertoexplainhowHadoopachievesparallelizationanddistribution. 1.1.1MapReduceProgrammingModelMapReduceprogrammingmodelconsistsoftwofunctions,mapandreduce.TheinputtoaMapReduceprogramisasetofpairs,andtheoutputoftheprogramusuallyisanothersetofpairs.Figure 1-2 showsadiagramoftheMapReduceprogrammingmodel.ThemapphaseconsistsofmultipleMappers-thesamemapfunctioniscarriedoutoneachMapper.Mappersareexecutedindependently,meaningthattherearenointeractionsamongthem.Themapfunctiontakesapairasaninput,andgeneratesasetofintermediatepairsastheoutput.Thereisnorelationshipbetweentheinputandoutputtypes.AsshowninFigure 1-2 ,theinputtothemapphaseisabstractedas,andtheoutputofthemapphaseisabstractedas1.TheMapReduceprogrammingmodelusuallyhasashuingphase,inwhichtheintermediatepairswiththesamekeyaresenttothesamereducefunction.Similartothemapphase,thereducephaseconsistsofmultipleReducers,andthesamereducefunctioniscarriedoutoneachReducer.Thereducefunctiontakesalistofvaluesassociatedwiththesamekeyastheinput,generatesasetofpairsastheoutput,andwritesthemintothelesystem.Similartothemapphase,thetypeofinputscanbedierentfromthetypeofoutputs.Sincebothmapandreducefunctionsareexecutedindependently,anapplicationfollowingtheMapReduceprogrammingparadigmcanbenaturallydistributedand 1subscriptsareusedtoidentifypairswherekeysand/orvaluesareidentical 17

PAGE 18

Figure1-2. TheMapReduceprogrammingmodel parallelizedacrossaclusterofcomputers.Hadoopisasoftwaresystemthatprovidesaprogrammingplatformandrun-timeenvironmentforMapReducejobs.HDFSisthestoragesystemassociatedwithHadoop-itprovidesdistributedlesystemcapabilitiesforlarge-scaledatasets.MapReduceandHDFSservicesaretightlycoupledsuchthatusersonlyneedtoprogrammap/reducefunctionsasdiscussedaboveandHadooptakescareofparallelizationanddistributionofcomputationsandinputdata.AtypicalHadoopMapReducejobworkowisdiscussedinthenextsection. 1.1.2HadoopPlatformInatypicalHadoopcluster,theMapReduceservicereliesononemasternodeandmultipleslavenodes.Themasternodeisresponsibleforscheduling,assigningmap/reducetaskstoslavenodes,managementofthecluster,etc.Slavenodesaremachinesthatexecutemap/reducetasks.Correspondingly,HDFSreliesononenamenodeandmultipledatanodes.Thenamenodestoresmetadata,andapplicationdataaredistributedamongdatanodes[ 9 ].BoththenamenodeofHDFSandthemasternodeoftheMapReduce 18

PAGE 19

serviceusededicatedmachines.AHadoopMapReducejobworkowisdescribedbelow(HDFSisconsideredasatransparentstoragesystemthatprovidesmetadatainformationtotheMapReduceservice).AJobTrackerisasoftwarecomponentrunningonthemasternodeandresponsibleforcoordinatingajobrun.Correspondingly,aTaskTrackerisasoftwarecomponentrunningoneachslavenodeandresponsibleforrunningtasksofthejob.TheclientprogramiswritteninJavainthisexample(otherwise,theHadoopstreamingutilityisusedforprogramswritteninotherprogramminglanguages)[ 1 ].Theclient,oruserprogram,submitsaMapReducejobtoaHadoopcluster.TheJobTrackerreceivesthejobinformation,assignsauniquejobidtothesubmittedjob,retrievesinputleinformationfromtheunderlyingdistributedstoragesystem(forexample,HDFS),andassignsmap/reducetaskstoslavenodesaccordingtothecurrentstatusofslavenodes.EachTaskTrackercommunicateswiththeJobTrackerandexecutesmap/reducetasksassignedtoit.EachTaskTrackersendsa\heartbeat"periodicallytotheJobTracker,asanindicationofthehealthoftheslavenode.TheheartbeatchannelisalsousedforsendingmessagestotheJobTracker.Whenamap/reduceslotisavailableinaslavenode,theJobTrackerassignsamap/reducetasktotheslavenodethroughtheTaskTracker.TheTaskTrackerretrievestheinputfromdistributedlesystem,andlaunchesachildJavaprogramtoexecutethetask.Afterthetaskhasnished,theTaskTrackercollectsthetaskinformationandsendsthetaskinformationbacktotheJobTracker.TheJobTrackernotiestheclientwhenthejobhasnished.Figure 1-3 showstheworkowofaMapReducejobinHadoop.Thenumbers1to6indicatethestepsofajobexecution.First,theclientprogramsubmitsthejobtoaHadoopcluster.Second,theJobTrackerretrieveslesplitsinformation,e.g.thelocationoflesandthenumberofsplits,fromthedistributedstorage.Forthethirdstep,theheartbeatinformationissentbytheTaskTrackertotheJobTrackertonotifytheJobTrackerthatthereareavailableslotsinaslavenode.Fourth,theJobTracker 19

PAGE 20

assignsatasktotheTaskTrackerbasedonthenodeinformationthattheTaskTrackersent.Usually,theJobTrackerassignsamaptasktotheslavenodetakingintoaccountthelocationofinputles.Fifth,theTaskTrackerretrievesinputinformationfromthedistributedstorage.Ifitisassignedamaptask,theTaskTrackerretrievestheinputlesplitinformation.Ifitisassignedareducetask,theTaskTrackerretrievesintermediatedatageneratedbythemapphaseaccordingtoinformationprovidedbytheJobTracker.Finally,instepsix,theTaskTrackerlaunchesanewJavaprogramtoexecutetheassignedtask.Afterthetaskhasnished,theinformationissenttotheJobTrackerthroughtheheartbeatmessagechannel.AnimportantpropertyofHadoopisfault-tolerance[ 1 ].TherearethreekindsoffailuresintheworkowshowninFigure 1-3 .Onekindoffailureisataskfailure,i.e.thetaskexecutedinachildJVMofaslavenodefails.IftheTaskTrackerdetectsthefailure,itwillfreetheslotandexecuteanothertask.ThefailureinformationisreportedtotheJobTrackerthroughtheheartbeatmessagechannel.WhentheJobTrackerreceivesthefailureinformation,itwilltrytoreschedulethetaskonadierentnode.ThesecondkindoffailureisaTaskTrackerfailure,i.e.whentheJobTrackerdetectsthefailureofaTaskTracker(bynotreceivingheartbeatinformationorreceivingtheinformationtooslow).TheJobTrackerstopsschedulingtaskstothatnodeandrescheduletasksonthatnodetootherhealthynodes.ThethirdkindoffailureisaJobTrackerfailure.ThereisnomechanismtotoleratetheJobTrackerfailure.YARN(orHadoop2.0)[ 19 ]introducesanewdatacentermanagementstructurethatprovidesamechanismtotoleratesuchafailureinadatacenter.Asdiscussedabove,theHadoopMapReduceserviceanditsassociatedstoragesystemHDFSenableparallelizationanddistributionamongavailableresources.Multipleadvantagesareprovidedbythesoftwareplatform.First,asthescaleoftheinputincreases,theHadoopplatformcaneasilyscaleupsuchthatadditionalresourcescan 20

PAGE 21

Figure1-3. HadoopMapReduceworkow1.Runjob2.Retrieveinputinformationfromdistributedstorage3.Heartbeatandexecutionresults4.Assignjobtoslavenode5.Retrieveinputinformationforassignedjobs6.Launchamap/reducejob beaddedforlargerscaledataprocessing.Astheresourcesintheclusterareunderutilized,theplatformcanalsoeasilyscaledown.Second,withregardstofault-tolerance,HDFSenablesdatadurabilitythroughreplicationsindierentnodes,andtheMapReduceservicetoleratesfailuresbyspeculativeexecution.Third,withoutneedingtoconsiderthecomplexityofunderlyingdistributedsystem,Hadoopprovidestousersasimpleprogramminginterface.Fourth,dierentcongurationparameterscanbetunedfor 21

PAGE 22

dierentapplicationrequirements.Fifth,userscaninherit,suchaslowcostandexibilityofusingopen-sourcesoftware.Insummary,theHadoopplatformprovidespropertiesincludingscalability,fault-tolerance,easy-to-programinterface,tunablecongurations,benetsfromopen-sourcesoftware,etc.Ithasbeenappliedtorunparallelalgorithmsinmanyeldsformassivedataprocessing[ 1 10 ].WenextintroducetheproposedSMMDASanddiscusshowthesystembenetsfromtheHadoopplatform. 1.2MotivationandProblemStatementMotivatedbythepreviouslydiscussedproblemsandrequirements,weproposetobuildtheanalysissystemonHadoop.AsdiscussedinSection 1.1 ,theHadoopMapReduceserviceandHDFSprovidemanagementofunderlyingresourcesandhidethecomplexityofthedistributedsystemfromapplicationsbuiltonit.ThepropertiesoftheHadoopplatformcanbeinheritedbytheanalysissystemsothatitiscapableofprocessinglarge-scaledatasets.Withrespecttoscalability,thesystemcaneasilyscaleupbyaddingmoreresourceswhenthesizeofinputdataincreasesortheperformanceofapplicationsdecreases.Withregardstofault-tolerance,throughMapReducespeculativeexecutionandHDFSreplication,thesystemcanobtaincorrectresultsevenwhentherearefailuresinthecluster.Thesystemwillalsorelyontheopen-sourcenatureofHadoopthusbenetingfromlowcostinbothacquisitionandupgrades.Thus,thesystemisproposedtobebuiltusingtheHadoopplatform.Thegoalofthisresearchistodesign,implement,optimizeandverifyaneectiveandecientdistributedsystemforanalysisofmonitoringdata.Tothisend,thesystemarchitecturewillreplacethecurrentanalysissystembyaHadoopsystem,asshowninFigure 1-4 .Datacollectedfrommonitoringsensorsarestillstoredinadatabase.TheanalysissystemretrievesthedatafromthedatabasetoHDFSwhichisresponsiblefordistributingdataamongmultiplecomputers.Thestatistics-basedanalysissystemisbuiltontheHadoopMapReduceservice.Aftertheanalysis,thedataaresent 22

PAGE 23

backtoadatabase.WeproposeanewsystemarchitecturewithoutinvolvingadatabaseinChapter5. Figure1-4. AnoverviewofsystemarchitecturefortheanalysissystemonHadoop Forpurposesofbuildingthestatistics-basedanalysissystem,thefollowingproblemsmustbeconsidered. 1.2.1DataMappingandComputationDistributionThepurposeoftheSMMDASistoanalyzethe\data".Thus,theinputdataarerequiredtobestudiedin-depthsothattheycanbeappropriatelydistributedamongmultiplemachines.AthoughtlessdatamappingstrategycanleadtounbalancedworkloadinaHadoopcluster,whichresultsinperformancedegradationandunder-utilizationofresources.Tothisend,thequestionofhowtomapthedataintoHDFSmustbecarefullyanswered.Correspondingly,thecomputationmustbemappedtotheHadoopMapReduceserviceaswell.Thecomputationcanbenishedineitherone-stepMapReduceworkowormulti-stepMapReduceworkow(oriterativeMapReduce[ 20 ]).Dierentmappingstrategieshavedierentadvantagesanddisadvantages.Forexample,iterativeapplications 23

PAGE 24

containmultiplestepsofcomputationandeachstepisparallelizedsuchthatthedataprocessingtimerequiredbyeachstepisreduced.IterativeapplicationsonHadoop,however,generatenewmap/reducetasksineachiteration,resultinginoverheadforcreatingnewtasksforeachiteration.Ekanayakeetal.[ 20 ]improvethesolutionbyintroducingTwister,anextensionoftheHadoopplatform,foriterativescienticapplications.Thus,itisimportanttodesignanappropriatecomputationmappingstrategyfortheSMMDAS.OtherextensionsoftheHadoopsoftwareplatformmayalsobeadoptedforperformancepurposes. 1.2.2SystemOptimizationAfterthesystemisdeployed,optimizationofthesystemwillbenetboththeperformanceofthestatistics-basedanalysisandtheutilizationofthewholecomputationcluster.Bothexternalandinternalimprovementsarerequired.Fromtheexternalperspective,optimizationcanbeachievedbytuningdierentcongurationparameters.Applicationsaresignicantlyinuencedbydierentcongurationsettings[ 21 ].Changingparameterstoappropriatevaluesisshowntosignicantlyimproveperformanceofapplications.Fromtheinternalperspective,optimizationcanbeachievedbychangingthealgorithmoftheanalysis.Throughalgorithmicoptimization,previouscomputationresultscanbeusedforfuturedataanalysis.Accordingtotheproblemsdiscussedinthissection,wepresentthisresearchinahigh-levelviewinthefollowingsection. 1.3SystemArchitectureFigure 1-5 showsahigh-levelviewofthestatistics-basedanalysissystemusingHadoop(bottom)andstrategiestostaticallyoptimizethesystem(top).ThebottomofFigure 1-5 solvestheproblemofdatamappingandcomputationdistribution.Sincedataareacriticaldesignconsideration,weproposeadata-orientedmappingfromthedatabasetotheHadoopMapReduceparadigm.Thismappingconsistsoftwosteps,mappingthedatabasetolesandmappinglestoHadoop.Aone-step 24

PAGE 25

MapReduceworkowisproposedformappingthecomputationtotheMapReduceservice.Basedondierentmappingstrategies,wedeploythestatistics-basedanalysissysteminthreedierentways:Map-Heavy,MapReduce-BalancedandReduce-Heavy.ThetopofFigure 1-5 solvestheproblemofsystemoptimization.MAPE-K(Monitoring,Analysis,Planning,ExecutionandKnowledge)canbeusedasaself-managementarchitectureinautonomiccomputing[ 22 ].However,inthisresearch,wedonotconsiderdynamicoptimizationbutfollowthebasicideatheMAPE-Kapproachtostaticallyoptimizetheanalysissystem.AsshowninFigure 1-5 ,theMonitoringblockcollectsmetricsofthesystem,suchastheCPU/memoryutilization,inputdataandtheexecutiontime;theAnalysisblockprovidesadjustmentsuggestionsbyanalyzingdierentperformancemetricswithregardstotheknowledgeofthesystemexecutionhistoryornewpolicies;thePlanningblockproducesaseriesofchangesbasedonthesuggestionsfromtheAnalysisprocess;theExecutionblockcarriesoutadjustmentstothesystem.Twomethodologiesareusedinthisresearchforsystemoptimization.Fromanexternalpointofview,dierentHadoopcongurationparametersaretunedfordecreasingtheexecutiontime.TheMonitoringblockcollectsthecurrentexecutiontimeoftheanalysis.TheAnalysisblockcomparesthecurrentexecutiontimewithhistoricexecutiontimeandsomewell-knowntuningdirections.Itsuggeststhetuningdirectionsofthecongurationparameters.ThePlanningblockgeneratestargetvaluesofthecongurationparameterstobetuned.TheExecutionblockcarriesoutthetuningactions.Fromaninternalpointofview,thealgorithmdescribedinChapter2isoptimizedfordecreasingtheexecutiontime.AdynamicprocedurethatfollowsMAPE-Kwouldbesimilartotuningthecongurationparametersasdiscussedabove.WorkshavebeendoneforapplyingHadoopinmanufacturingprocedures.Baoetal.[ 15 ]appliedHadooponthesensordatamanagementinmanufacturing.TheyfocusedonthedatamanagementperspectiveandproposedtoreplacetheRDBMSwithHBase[ 23 ].Mostcomputationsusedintheirsystemaresimplequeriesorstatisticsoperations.Our 25

PAGE 26

Figure1-5. Ahighlevelviewofthestatistics-basedmonitoringsystemandthestaticaloptimizationstrategy workfocusesonacomplexstatistics-basedcomputation,whichcannotbenishedwithinadatabase.TheuseofHBaseorHivemaybeaninterestingdesignfortheproposedarchitecturewithoutinvolvingthedatabaseinChapter5.Toservetheincreasingscaleofthedata,GossandVeeramuthu[ 24 ]surveyedcurrenttechnologiestowardsthe\BigData"problemforanewsemiconductormanufacturingcompany,includingtechnologiessuchasApplianceRDBMS,Hadoopandin-memoryDB,aswellasnewarchitectures,forexample,GeneralEngineeringandManufacturingDataWarehouse(GEM-D)solution.Theycomparedadvantagesversusdisadvantagesofpossibletechnologieswithoutimplementingthem.Wehavetakenafurtherstepinthisresearchbydesigningandimplementing 26

PAGE 27

thesystemusingHadoop.Experimentalevaluationresults,aswellastwooptimizationmethodologies,arealsopresentedinthisthesisMostHadoopapplicationsareI/Obounded[ 1 ].Hadooptriestoachieve\datalocality",i.e.toassigntaskstowherethedataarelocated.Multiplestudieshaveresearchedthisdata-localitypropertyandsuggestedmethodstomanagetaskassignments.Liuetal.[ 25 ]studiedtheperformanceofHDFS,especially,forprocessingsmallles.ForimprovingtheI/Operformanceofsmallles,theyproposedagroupingmethodologytocombinesmallles.Ourexperimentalresultsverifytheobservationandtakeadvantageofcombiningsmalllesforperformanceimprovement.Xieetal.[ 26 ]discussedtheeectofdataplacementinaheterogeneousHadoopcluster.Animbalanceddataplacementinheterogeneousenvironmentcausessevereperformancedegradationindata-intensiveHadoopapplications.ByimplementingadataplacementmechanisminHDFS,inputdatacanbedistributedbasedonthecomputingcapacityofeachnode.WeusethebalancerprovidedbyHadoop.Sinceourclusterisbuiltonahomogeneousenvironmentasaprivatecloud,thebalancerworkswellinthecurrentsetup.Also,wefoundoutthatincreasingthenumberofreplicationsforeachleinHDFSresultsinincreasingthepossibilityofexecutiononlocaldata.ForpurposesofsavingenergyandI/Oeciency,theintermediatedatageneratedbythemapphasecanbecompressed.Chenetal.[ 27 ]proposedanalgorithmtomeasuredatacharacteristicssothatusersofHadoopcanmakethedecisionofwhethertocompresstheintermediatedataornot.Itisbettertocompressthemapoutputinoursystem.DatalocalityisusedinHadooptoavoidslowtasksbyreducingnetworktransmission[ 28 ].However,schedulingtaskswithregardstothedatalocationcannotnecessarilyimprovetheperformanceofaHadoopjob.Ifmaptasksarescheduledonslownodesonthebasisofdatalocality,theperformanceofajobispossiblydegraded.Again,thecurrentsystemisbuiltonahomogeneousenvironment.Suchworksmaybeusefulwhenusingaheterogeneousenvironment. 27

PAGE 28

1.4StructureofThesisThisthesisisstructuredasfollows:Chapter2describestheinputdataandadata-orienteddesignbyintroducingdierentdatamappingstrategiesanddierentimplementations.Chapter3proposesdierentstrategiesforsystemoptimizationincludingtuningHadoopcongurationparametersandalgorithmicoptimization.Chapter4discussesexperimentalresults.Chapter5concludesthethesisandrecommendsfutureresearchdirections. 28

PAGE 29

CHAPTER2DESIGNANDIMPLEMENTATIONThesystemstudiedinthisthesisforstatisticalanalysisofmanufacturingsensormeasurementsisimplementedbyusingHadooptodoallthedataprocessing.Thedesignofthissystemrequiresagoodunderstandingofwhatkindofdataistobeanalyzed,howtheyaretobeanalyzedandwheretheyshouldbelocated.Thischapteranswerstheseissuesanddescribesthreeimplementationsconsideringdierentdatamappingscenarios.Adata-orienteddesignapproachisusedtomapdatafromthedatabasetotheHadoopstorage.Thischaptercharacterizesthedatathataretobeanalyzed,thealgorithmisusedtoanalyzethem,mappingstrategiesfromthedatabasetoHDFSaswellasthreeimplementationsofthemappingofcomputationstotheHadoopmap/reducephase. 2.1DataDescriptionTheinputdatacanbeconsideredasalargetablestoredinadatabase,whereeachrowisameasurementrecordthatincludeswhenandwherethemeasurementtookplaceaswellasthemeasured\item",productandequipmentidentications.Table 2-1 isaverticalviewofanexampleofarecordfromasampleofatypicaldataset.UsefulinformationforstatisticalanalysisincludesITEM(environmentparameter,suchastemperature,pressureandgasowrate),RECIPE(productandmanufacturingstepinformation),EQP ID(EUinformation),TIME(startandendtimeofthemeasurement),SPEC(upperandlowerspecicationsofthemeasuredrecord)andAVG(theaverageofthemeasuredvaluefromstarttoendtime,i.e.theinformationtobeanalyzed).Themonitoringsystemcollectsdataduringthemanufacturingprocess.Everynight,theanalysissystemretrievesdatafrom14days(thecurrentdayandtheprevious13days),basedonthetimestampassociatedwitheachrecordasshowninFigure 2-1 .Assumethecurrentdateis03/27/14,thestatistics-basedanalysisisperformedonthedatafrom03/14/14to03/27/14.Onthesecondday,03/28/14,thestatistics-basedanalysisisperformedonthedatafrom03/15/14to03/28/14.Wediscusspossibilitiesto 29

PAGE 30

Table2-1. Averticalviewofasamplerecord ColumnNameContentDetailInformation LOT IDLOT0759.1RecordidentierWAFER ID01WaferIDPRODUCT IDA5ProductIDSTEP SEQA5 001SequencestepEQP MODELDIFFUSIONEQP IDDIFF01EquipmentunitIDCHAMBER IDARECIPEA5 001 R01Informationofwafer,productandanIDSTART TIME22:26.5END TIME47:26.5ITEMGas Flow RateTheenvironmentvariabletobemeasuredMIN100.4286MAX105.7051AVG106.7414TheaveragevaluetobeanalyzedSTDEV0.56929UPPER SPEC109TheupperspecicationLOWER SPEC101ThelowerspecicationTARGET105 usetimestampinformationandre-useofresultsforoptimizingtheanalysisprocedureinChapter3. Figure2-1. Datasetscollectedduringseveralconsecutivedaysareusedfornightlystatisticalanalysis. Forpurposesofstatisticalanalysis,theinputdatatablecanbepartitionedintoindependentsubtables,oneforeachrecipe-and-itemcombination.Thesesubtablesareindependentinthesensethatthestatisticalanalysisisdoneindividuallyandindependentlyforeachsubtable.Eachsubtableisreferredtoasanindependent 30

PAGE 31

ComputationUnit(iCU).Eachsubtableisfurtherpartitionedintodependentsubtables,oneforeachEU.SincesubtablesbelongingtothesameiCUhavetobegroupedtogethertonishthecomputation,eachdependentsubtableisreferredtoasadependentComputationUnit(dCU).LetKrdenoteacombinationofrecipeanditem,KeanequipmentidandKtacombinationofrecipe,itemandequipmentid.Kr Ke Kt Thus,thewholestatisticalanalysisisindependentlyperformedoneachiCUanditconsistsofpartialrelatedanalysisdoneoneachofthedCUsthatmakeupanentireiCU. 2.2AlgorithmAnalysisThestatisticalanalysisprocessisreferredtoasthe\equivalencetest".Thepurposeofthisanalysisistocharacterizethedierencebetweentwosetsofdata,areferencedatasetandatestdataset,usingStudent'sT-test[ 3 ].Algorithm 1 describestheanalysisprocedure.Theinputoftheprocedureisalinkedlist(inputList).SincetheanalysisisperformedindependentlyoneachiCU,thedatainthesameinputListsharethesameKr.Theithelementinthelistisa1Ddoubleprecisionarray,Ai,representingadCU.EachvalueinAicorrespondstoanAVGvalueextractedfromtherecordsthatsharethesameKt.ForpurposesofmappingtheanalysisalgorithmtotheMapReduceprogrammingmodellater,thealgorithmcanbeconsideredconsistingoftwostages.TherststageoftheproceduresortsanarrayAiandthencomputesseveralpercentile[ 29 ]valuesfromAi.Theloweroutlierlimit(LOL)andupperoutlierlimit(UOL)arecomputedusingthepercentilevaluescomputedinthepreviousstep.Outliers[ 30 ],eitherextremelysmallorextremelylargevalues,areeliminatedfromAi.Then,theprocedureusestheaveragei 31

PAGE 32

Algorithm1Procedureforequivalencetest Input:LinkedListinputList foreachDoubleArrayAiininputListdo . SortAi ComputethepercentilevaluesofAi ComputethelowerandupperoutlierlimitsLOL,UOLusingpercentilevalues . . EliminateallAi[k]thatarenotintherange[LOL,UOL] ComputeaverageiandstandarddeviationiofentriesofAi ComputetheprocesscapabilityindexCpk[i] endfor . . Stage1 PickAiwiththelargestCpkasthereferencesetR . ComputeEuclideandistanceWR[k]=p (R[k])]TJ /F5 11.955 Tf 11.96 0 Td[(R)2 ComputeEuclideandistanceBR[k]=p (R[k])]TJ /F5 11.955 Tf 11.96 0 Td[()2 ComputeWR;WRandBR;BR foreachDoubleArrayAiininputListdo ComputeEuclideandistanceWi[k]=p (Ai[k])]TJ /F5 11.955 Tf 11.96 0 Td[(i)2 ComputeEuclideandistanceBi[k]=p (Ai[k])]TJ /F5 11.955 Tf 11.96 0 Td[(R)2 ComputeWi;WiandBi;Bi DoapairwiseStudent'st-testbetweenWR(orBR)andWi(orBi) Finaldecisionbasedontheresultoft-test endfor . . Stage2 andthestandarddeviationiofAitocomputetheprocesscapabilityindex[ 31 ].ThisprocedureisdoneforeveryAiintheinputlistindependently.ThesecondstageoftheprocedurepickstheAiwhoseCpkisthelargestfromtheinputListasthereferencedataset,denotedasR.Otherdatasetsarereferredtoastestingdatasets.EachtestingdatasetintheinputListispreprocessedbycomputingtheEuclideandistancebetweenthereferencedatasetanditself.Then,apair-wiseStudent'sT-testiscarriedoutbetweenthereferencedatasetandeachofthetestingdatasets.Finaldecisions(onwhetherthetestingdatasetisequivalenttothereferencedataset)aredeterminedbytheresultsofT-test[ 4 ].Insummary,therststagecomputationisperformedoneachdCUindependently,whilethesecondstagecomputationisperformedoneachiCUindependently. 32

PAGE 33

ThefollowingssectionsdescribethefunctionsinAlgorithm 1 indetail1,providinginformationusefulforalgorithmicoptimizationofthestatisticsanalysisinChapter3. 2.2.1PercentileandOutlierInstatistics,anoutlierisreferredtoasadataelementthatisextremelydistinctfromotherdatainthesample[ 30 ].Outliersusuallyindicateproblemsaectingdatareliability,forexample,anerrorduringmeasurement.Toidentifyanoutlierrequiresanupperlimitandalowerlimit,i.e.boundariesthatdetermineexcludeddata.Tocomputetheoutlierlimits,apercentilefunctioniscomputedonthelistofsampledata.Apercentileisavaluebelowwhichproportionsofvaluesinagroupfallinto[ 29 ].Forexample,thepthpercentileisavalue,Vp.Thereareatmost(p)%datalessthanVp,andatmost(100)]TJ /F5 11.955 Tf 12.61 0 Td[(p)%datagreaterthanVp.The50thpercentileisthemedianvalueofadataset.ThepercentilevaluecanbeestimatedusingfunctionPercentileasfollows[ 32 ].NotethatLisa1Dsortedlist(oranarray). functionPercentile(dataL,doublep) L:sorted1Dlist(oranarray) N:thesizeofthelistL p:thepercentile r=p 100(N+1) k integerpartofr d fractionalpartofr V=Lk+d(Lk+1)]TJ /F5 11.955 Tf 11.95 0 Td[(Lk) endfunction 1Thedescriptiondoesn'treectthespecicimplementationusedinourexperimentalsystem.Instead,itpresentsgeneralapproachesdescribedincommonengineeringstatisticsbooksandmaterialsofwhichtheimplementationisarepresentativeexample. 33

PAGE 34

ThePercentilefunctionrstcomputestherank,r,oftheinputpercentilep.Then,itusestheentrieswithranks(i.e.indexes)kandk+1,wherekistheintegerpartofthecomputedrank,r,tocomputethepercentilevalue,V.Theresult,V,iscomputedbymultiplyingthefractionalpartofthecomputedrankandthedierencebetweenthe(k+1)thandkthentries.The25thpercentileisreferredtoasrstquartile,denotedasQ1andthe75thpercentileisreferredtoasthirdquartile,denotedasQ3.Theinterquartilerange,IQR,isthedierencebetweenQ3andQ1, IQR=Q3)]TJ /F5 11.955 Tf 11.95 0 Td[(Q1:(2{1).Theoutlierlimitsareestimatedas UOL(UpperOutlierLimit)=Q3+cIQR;(2{2) LOL(LowerOutlierLimit)=Q1)]TJ /F5 11.955 Tf 11.96 0 Td[(cIQR;(2{3)wherecisaconstant.Atypicalvalueofcis1.5.Anydataoutsidethisrangeareconsideredasoutliers.OutliersareeliminatedintherststageoftheAlgorithm 1 2.2.2ProcessCapability\Aprocessisacombinationoftools,materials,methodsandpeopleengagedinproducingameasurableoutput;forexampleamanufacturinglineformachineparts."[ 31 ]Processcapabilityisameasurablepropertyofaprocess,whichservesformeasuringthevariabilityoftheoutputandcomparingthevariabilitywithcertainspecications.\Theprocesscapabilityindexisastatisticalmeasurementoftheprocesscapability:theabilityofaprocesstoproduceoutputwithinspecicationlimits."[ 31 ]Theprocessperformanceindexisoneofthecommonlyacceptedprocesscapabilityindices,whichisdenedandcomputedasCpkinAlgorithm 1 andEquation( 2{4 )below.Givenadataset{theoutputfromthesystemthatmonitorstheprocess{anditscorrespondingupperspecication(USL),lowerspecication(LSL),theprocess 34

PAGE 35

performanceindexisdenedandcomputedas Cpk=min[USL)]TJ /F5 11.955 Tf 11.95 0 Td[( 3;)]TJ /F5 11.955 Tf 11.96 0 Td[(LSL 3];(2{4)whereistheaverageofthedatasetandisthestandarddeviationofthedataset.Anothermetricthatisusedtoestimatetheprocesscapabilityisdenedas ^C=USL)]TJ /F5 11.955 Tf 11.95 0 Td[(LSL 6:(2{5)Thismetricisnotusedinourprototypes,buttheyareeasilymodiabletousesuchametricifnecessary. 2.2.3EuclideanDistanceAftertheeliminationofoutliersfromthesampledata,datasetsthatcomefromdierentEUsbutsharethesameKrarecollectedtogether.Thedatasetwiththemaximumprocessperformanceindexispickedasthereferencedataset,R.Eachofthedatasetsofotherequipmentunits,Ai,ispairedwiththereferencedataset.Foreachelementineachdataset,Within-Distance,Wi[k],isdenedastheEuclideandistancebetweenasinglerecordofthedatasetandtheaverageofthedataset.Foreachelementinthereferencedataset,Between-Distance,BR[k],isdenedastheEuclideandistancebetweenasinglerecordofthereferencedatasetandtheaverageofallrecordsinalldatasetsexceptthereferenceone.Foreachelementineachdatasetexceptthereferencedataset,Between-Distance,Bi[k],isdenedastheEuclideandistancebetweenasinglerecordofthedatasetandtheaverageofallrecordsinthereferencedataset.TheequationstocomputetheEuclideandistancesarelistedbelow.Forthereferencedataset, WR[k]=p (R[k])]TJ /F5 11.955 Tf 11.95 0 Td[(R)2;(2{6) BR[k]=p (R[k])]TJ /F5 11.955 Tf 11.96 0 Td[()2;(2{7) 35

PAGE 36

whereR[k]referstothekthelementinthereferencedataset,Rreferstotheaverageofthereferencedatasetandreferstotheaverageofallrecordsexceptthosefromthereferencedataset.Forothertestingdatasets, Wi[k]=p (Ai[k])]TJ /F5 11.955 Tf 11.96 0 Td[(i)2;(2{8) Bi[k]=p (Ai[k])]TJ /F5 11.955 Tf 11.96 0 Td[(R)2;(2{9)whereAi[k]referstothekthelementintheithdatasetandireferstotheaverageoftheithdataset. 2.2.4EquivalenceTestTheequivalencetestisperformedonthepreviouslycomputedEuclideandistancesofeachofthedatasetsandthereferencedataset.ThecoreoftheequivalencetestisStudent'sT-test[ 4 ],whichisusedtodeterminewhethertwosetsofdataaresignicantlydierentfromeachother.Anumericalalgorithmtocomputethep-value[ 4 ],ameasurableresultofT-test,canbefoundin[ 33 ]. 2.3DataMappingTominimizetheoverheadofdatacommunicationamongnodesduringthemapphase,Hadoopisdesignedsothatitassignsmaptaskstonodeswheretheinputdataarelocated.Thus,appropriatelymappingdatafromthedatabasetoHadoophassignicantinuenceonthesystemperformance.HadoopapplicationsrequirethesupportfromunderlyingHDFS.DataarestoredaslesinHDFS.Thus,mappingdatatoHadoopconsistsoftwostages:thedatabasetolesandlestoMappers. 2.3.1DataMapping:DatabasetoFilesAccordingtothedataandthealgorithmdescription,thewholedatasetcanbepartitionedintoiCUsordCUs.Basedonhowthedataarepartitioned(i.e.assignedto 36

PAGE 37

HDFSblocks),weproposethreestrategies:arbitrarilymapping,per-iCUmappingandper-dCUmapping.Arbitrarilymappingmapsthedatatolesarbitrarily(i.e.intheorderinwhichtheyarereadfromthedatabase),showninFigure 2-2 .Therefore,recordsbelongingtodierentiCUsanddCUsmaybestoredinthesamele.Theadvantageofthisstrategyisthatdatacanbedistributedevenlyacrossthecluster. Figure2-2. Database-to-lesmappingstrategies-arbitrarilymapping 37

PAGE 38

Figure2-3. Database-to-lesmappingstrategies-per-iCUmapping 38

PAGE 39

Figure2-4. Database-to-lesmappingstrategies-per-dCUmapping 39

PAGE 40

Per-iCUmappingmapsthedatainthesameiCUtothesamele,showninFigure 2-3 .RecordssharethesameKraremappedtothesamele.SinceallrecordsinthesameiCUarestoredinale,asinglelecanbeanalyzedindependentlywithoutinterferencewithotherles.Per-dCUmappingmapsthedatainthesamedCUtothesamele,showninFigure 2-4 .RecordssharethesameKt(KrandKe)aremappedtothesamele.AllrecordsinthesamedCUarestoredinale,andpartoftheanalysis(Stage1ofAlgorithm 1 )canbeperformedindependentlyonalewithoutinterferingwithotherles. 2.3.2DataMapping:FilestoMappersThedatatobeanalyzedarestoredaslesinHDFS.AsintroducedinChapter1,theHadoopMapReduceserviceprovidesparallelizationanddistributiontoapplicationsrunningonit.Filesaredistributedamongnodesinacluster.TheHadoopplatformprovidesusersexibilitytosplitlessothatuserscandecidewhethertosplitaleandhowtosplitale.BasedonhowmanylesaMapperprocesses,wealsoproposethreestrategiestomaplestoMappers:one-to-onemapping,many-to-onemappingandone-to-manymapping.One-to-onemappingmapsoneletooneMapper,asshowninFigure 2-5 -a.EachMapperhandlesaleindependently.ThiscanbecontrolledbyoverwritingisSplitable()methodtoletitreturnfalse.Inasituationthatalecontainsindivisibledata,inotherwords,thedatahavetobeprocessedbyasingleMapperindependently,thismappingstrategyworksproperly.Ontheotherhand,thedisadvantageofthestrategyisthatwhenaleisstoredindierentnodes,processingtheleasawholebringsdatatransmissionoverheadinthemapphaseexecution.Many-to-onemappingmapsmultiplelestooneMapper,asshowninFigure 2-5 -b.EachMapperhandlesmultipleles.HadoopprovidesaclassCombineInputFormatthatcombinesseveralsmalllestobeprocessedbyamaptask.Weconsideranalternativeapproach|apreprocessingproceduretocombinemultiplelestooneleandeach 40

PAGE 41

AOne-to-onemapping BMany-to-onemapping COne-to-manymappingFigure2-5. Files-to-Mappersmappingstrategies 41

PAGE 42

Mappertakesthecombinedleastheinput.ThepreprocessingapproachalleviatesHDFSmeta-datastorageand,also,reducesleprocessingtimeduringaMapReducejobexecution.Inthecasethatthesizeofeachleisrelativelysmall,assigningeachletooneMappercreatesmuchoverhead.Thus,assigningmultiplelestoaMapperisaproperstrategy.One-to-manymappingmapsoneletomultipleMappers,asshowninFigure 2-5 -c.EachMapperhandlespartofale.ThisisthemostcommonsituationinaHadoopapplication.Whenaleistoolarge,HDFSsplitstheleintorelativelysmallerchunksandHadoopassignseachchunktoaMapperduringthejobexecution.ThesplitsizecanalsobesetlargerthantheblocksizeofHDFS[ 1 ].However,thisisnotrecommendedsinceitdecreasesthepossibilityofprocessingdatalocally.Theone-to-manystrategydoesn'tworkinthescenariowhentherearedependenciesbetweencomputationsperformedondierentchunks. 2.4SystemImplementationAsillustratedinChapter1,thestructureofaHadoopapplicationfollowstheMapReduceparadigm.Todistributethecomputation,thestatistics-basedanalysisismappedtotheMapReduceprogrammingmodel.InthecurrentSMMDAS,thesizeofdatabelongingtothesameiCUisrelativelysmallsothateachiCUcanbeprocessedusingonecomputingnodeinaHadoopcluster.Thus,weproposetomapAlgorithm 1 toasinglestepMapReducejob.WediscussthescenariothatthesizeofeachdCUandeachiCUincreasesinChapter4.Algorithm 1 consistsoftwomainstages.TherststagecanbecomputedforeachdCUindependently.ThesecondstageprocessesdatabelongingtothesameiCU.Basedonhowmuchcomputationcarriedoutinthemap/reducephase,weproposethreeimplementations:Map-Heavy(M-Heavy)implementation,Reduce-Heavy(R-Heavy)implementationandMapReduce-Balanced(MR-Balanced)implementation.Table 2-2 summarizeshowdierentMapReduceimplementationsconstraindatamappingsintothemap/reducephase(theblankonemeansnospecicconstraints). 42

PAGE 43

Table2-2. HowdierentMapReduceimplementationsconstraindatamappingsintothemap/reducephase phaseM-HeavyR-HeavyMR-Balanced mapphaseiCUdCUreducephaseiCUiCU 2.4.1M-HeavyImplementationTheM-Heavyimplementationmapsthewholeanalysisproceduretothemapphase.Thus,databelongingtothesameiCUhavetobecollectedandsenttothesameMapper.ThisbenetsthecasewhenthesizeofeachiCUisrelativelysmall. 2.4.1.1MapimplementationIntheM-Heavyimplementation,themapphaseisresponsibleforthewholeanalysisprocedure.DatabelongingtothesameiCUhavetobereadbythesameMapper.ThelesplitagissettofalsesothatalecannotbesplitbetweenmultipleMappers.Figure 2-6 isthecomputationprocedureandthedataformat/denitionforeachstepinthemapphaseimplementation.TheinputtoaMapperisapair,wherethekeyisthelenameandthevalueisthewholecontentofthele.TheMapperextractsusefulinformationfromeachrecordandstorestheminanestedhashmap,m.InthecasethatdatabelongingtomultipleiCUsarecontainedinthesamele(many-to-onedatamapping),eachentryintheoutlayerhashmapisaniCU,andeachentryintheinnerlayerhashmapisadCU.Thus,theStage1ofAlgorithm 1 isperformedoneachentryoftheinnerlayerhashmapindependently,andtheStage2ofAlgorithm 1 isperformedoneachentryoftheoutlayerhashmapindependently. 2.4.1.2ReduceimplementationSincethewholeanalysiscomputationisdoneinthemapphase,thereducephaseisunnecessaryanddeprecated. 43

PAGE 44

Figure2-6. M-Heavyimplementation:Mapphase 44

PAGE 45

2.4.2R-HeavyImplementationTheR-Heavyimplementationmapsthewholeanalysisproceduretothereducephase,andMapperslterthedatabydierentKrs.Inthisimplementation,thereisnolimitationontheinputdata. 2.4.2.1MapimplementationLikeatypicalHadoopapplication,theR-Heavyimplementationlters(orcategorizes)thedataduringthemapphase.EachMapperreadsablockofdata,awholeleorpartofale,dependingontheblocksizeofHDFSandlesplittingcongurationsettings.Theinputofthemapfunctionisapair,wherethekeyistheosetofarecordandthevalueistherecord.Figure 2-7 showsthecomputationprocedureandthedataformat/denitionofthemapphase.TheMappertakesarecordandextractsusefulinformationfromtherecord,especiallyiCUinformation.TheoutputoftheMapperisapair,wherethekeyistheiCUinformationandthevalueistheusefulrecordinformation,includingthedatatobeanalyzed. Figure2-7. R-Heavyimplementation:Mapphase 45

PAGE 46

2.4.2.2ReduceimplementationAfterthemapphasehasbeennished,pairswhosekeysarethesamearesenttothesameReducer.Inthiscase,recordsfromthesameiCUaresenttothesameReducer.TheReducertakesalistofvaluesgeneratedbythemapphasethatsharethesameiCUastheinput,andperformsAlgorithm 1 oneachvaluesinthesameiCU.Figure 2-8 showsthecomputationprocedureandthedataformat/denitioninthereducephasefortheR-Heavyimplementation. Figure2-8. R-Heavyimplementation:Reducephase 46

PAGE 47

2.4.3MR-BalancedImplementationTheMR-Balancedimplementationmapstherst-stagecomputationtothemapphaseandthesecond-stagecomputationtothereducephase.Sincetherst-stagecomputationrequiresthatthedataaregroupedbydCUs,databelongingtothesamedCUhavetobecollectedandsenttothesameMapperbeforethecomputationstarts.MapperssendresultsbelongingtothesameiCUtothesameReducer.ThisimplementationbenetsforthecasethatthesizeofeachdCUisrelativelylarge. 2.4.3.1MapimplementationSincetherst-stagecomputationofAlgorithm 1 isperformedinthemapphase,recordsbelongingtothesamedCUhavetobesenttothesameMapper.Thus,eachMapperhastoreadthewholeleincasethattheleissplit.TheinputtotheMapperisapairwherethekeyisale'snameandthevalueisthecontentofthele.Figure 2-9 showstheimplementationofthemapfunctionineachMapper.TheMapperpullscontentsofaleintomemoryandorganizessinglerecordsbydierentdCUs.ThentheMappercomputespercentiles,eliminatesoutliersandcalculatestheprocesscapabilityforeachlistofvaluesthatbelongtothesamedCU.Theresultofcomputationsisstoredinanobject,referredtoasInterData,whichcanbetransferredbetweenMappersandReducers.AnexampleofaninstanceoftheInterDataobjectisshownasfollows.InterDataisanobjectrepresentingalistofAVGvaluesbelongingtothesamedCU.FollowingisanexampleofaninstanceofInterData.ItimplementsSerializableinterfacesothatitcanbetransmittedbetweenmapandreducephases,whichisusefulfortheMR-Balancedimplementation. 23976#tablesize97.539597.548397.5485...#thedatatable0.3850#processperformanceindex,$P$0.4121#processcapabilityindex$\hat{P}$A5_001_R01Gas_Flow_Rate,DIFF06#RECIPE+ITEM,EQP_ID 47

PAGE 48

TheoutputofeachMapperisasetofpairswherethekeyisKroftheiCUandthevalueisanInterDataobject.ThedatabelongingtothesameiCUaregroupedtogetherandsenttoReducers.ThedefaultPartitioner,HashPartitioner,isusedtodistributeoutputsofMapperstotheReducersbasedonthehashvalueofthekeys. Figure2-9. MR-Balancedimplementation:Mapphase 2.4.3.2ReduceimplementationAfterthemapphasehasnished,theReducerreceivesoutputsfromMappers.EachReducerreceivesalistofInterDataobjectswiththesameKr(databelongtothesameiCU).TheinputtotheReducerisavaluelistassociatedwiththesamekey,whereeachvaluecorrespondstoanInterDataobject.Figure 2-10 showstheimplementationofthe 48

PAGE 49

ReducerintheMR-Balancedimplementation.TheReducerloopsthevaluelistandpickstheInterDatathatcontainsthemaximumprocessperformanceindex,Cpk,asthereferencedataset.ForotherInterDataobjects,theReducercomputesEuclideandistancesofthedata,bothWithin-DistanceandBetween-Distance,doesanequivalenttestusingEuclideandistances,andgeneratesthenaldecision.OutputsoftheReducerarepairswherethekeyconsistsofRECIPE,ITEMandEQP ID(Kt)andthevalueisthenaldecision,e.g.Equivalent,meaningthatthedatasetAiiscomparabletothereferencedatasetanditsassociatedEUishealthy. Figure2-10. MR-Balancedimplementation:Reducephase 2.4.4RelationshipswithDataMappingTable 2-3 showstherelationshipbetweendierentimplementationsanddierentstrategiesofdatamapping.TheR-Heavyimplementationhasnolimitationondata 49

PAGE 50

mappingstrategywhileothertwoimplementationsbothhaverequirementsfordatamappingstrategies.Thus,theR-Heavyimplementationworksforalldata-mappingstrategies.AsdiscussedinChapter1,forexperimentspurposes,thedatabaseiskeptfunctioning.Ifthefuturearchitectureremovesthedatabase,theR-Heavyimplementationistheonlysolutionthatiscapableoffunctioningwithoutassistancefromthedatabaseorpreprocessingthedata.ThemapphaseoftheR-HeavyimplementationgroupsthedataaccordingtoiCUordCUinformationofeachrecordsothatnogroupingfromthedatabaseorpreprocessingthedataisrequiredpriortothecomputation.TheMR-Balancedimplementationworksinper-iCUone-to-oneandmany-to-onestrategiesaswellasper-dCUone-to-oneandmany-to-onestrategies.TheM-Heavyimplementationonlyworksinper-iCUone-to-oneandmany-to-onestrategies.BothM-HeavyandMR-Balancedimplementationsareconstrainedbyspecicdatamappingstrategies.TheinputdatatothesetwoimplementationsrequireeitheranassistancefromthedatabaseorapreprocessingblockthatcategorizesthedataaccordingtodierentKrsandKts. Table2-3. Relationshipsbetweendierentimplementationsanddatamappingstrategies ImplementationsOne-to-oneMany-to-oneOne-to-many ArbitrarilyR-HeavyR-HeavyR-HeavyPer-iCUR-Heavy,M-Heavy,MR-BalancedR-Heavy,M-Heavy,MR-BalancedR-HeavyPer-dCUR-Heavy,MR-BalancedR-Heavy,MR-BalancedR-Heavy 50

PAGE 51

CHAPTER3SYSTEMOPTIMIZATIONWehaveconductedanin-depthstudyofthedatatobeanalyzedandproposeddierentdatamappingstrategies.Basedonthedatamappingstrategiesandthecomputationsperformedonthedata,threeimplementationsareintroducedtomapthestatistics-basedanalysistotheHadoopplatform.Inthischapter,wetrytooptimizethesystemfromtwoperspectives:tuningHadoopcongurationparametersandoptimizingtheanalysisalgorithmbyfactoringcomputationswhenpossiblesothatresultscanbere-usedandcomputationsavoided. 3.1HadoopCongurationParametersTuningToachievetimelyandscalablecomputation,Hadoopcongurationparametersneedtobetunedforeachspecicapplication.Studies[ 21 34 ]showthattheperformanceofaHadoopapplicationsuersseverelyifinappropriatecongurationparametershavebeenset.Manyin-depthworks[ 21 35 36 ]havebeendoneforimprovingtheperformancebytuningcongurationparameters.Weadoptastaticapproachtoevaluatetheapplicationonasmallamountofdatatodecideonappropriatecongurationparametersettings.AsubsetofparametersaresummarizedinTable 3-1 andTable 3-2 .ThetargetofthetuningistheR-Heavyimplementation,sinceitistheimplementationthatworksforeverydatamappingstrategy.Also,ittakesadvantageofthedatashuingphaseperformedbytheHadoopplatform,whichisdonebycollectingvaluesgeneratedinthemapphaseintogroupsofvalues,eachgroupsharingthesamekey.WeintroducethesecongurationparametersinthischapterandevaluatetheireectonsystemperformanceinChapter4. 3.1.1ParametersoftheMapReduceProgrammingModelAsdiscussedinChapter1,applicationsfollowingtheMapReduceprogrammingmodelarenaturallyparallelizedanddistributedamongclustersofcomputers.Thenumberofmaptasksandthenumberofreducetasksdeterminethedegreeofparallelization. 51

PAGE 52

Table3-1. Asubsetofcongurationparameterstunedforsystemoptimization-part1[ 1 ] NameFunctionality mapred.map.tasksNumberofmaptasksinajobexecutiondfs.block.sizeHDFSblocksizemapred.reduce.tasksNumberofreducetasksinajobexecutionio.sort.mbThesize(inMB)ofthebuerthatisusedforwritingamaptaskoutputio.sort.spill.percentThethresholdofthemapoutputbuer.Ifthesizeofoutputinthebuerexceedsthethreshold,thebackgroundthreadstartstospillcontentintodisks.io.sort.factorThemaximumnumberofstreamsusedformergingmap-phaseoutputorreduce-phaseinputmapred.compress.map.outputThisagindicateswhetherornottocompresstheoutputofmaptasksmapred.reduce.parallel.copiesThenumberofthreadsusedinreducetaskstocopyintermediateresultsfrommaptaskstothereducetask. mapred.job.shue.input.buer.percentTheproportionofthememoryofareducetaskthatisusedtoholdoutputfrommaptasks.mapred.job.shue.merge.percentThethreshold(theproportionofthebuerusedtoholdoutputfrommaptasks)ofthebuerthatisusedtoholdinputtoareducetask.mapred.inmem.merge.thresholdThethreshold(thenumberofrecordsfrommaptasks)ofthebuerthatisusedtoholdinputtoreducetask. mapred.map.tasks.speculative.executionTheagenablesthemapphasespeculativeexecution. mapred.reduce.tasks.speculative.executionTheagenablesthereducephasespeculativeexecution. 52

PAGE 53

Table3-2. Asubsetofcongurationparameterstunedforsystemoptimization-part2 NameFunctionality mapred.job.reuse.jvm.num.tasksThenumberoftaskstoreusethesameJVM.Notethattheydon'tsharetheJVM.TasksthatusethesameJVMareexecutedsequentially.io.le.buer.sizeThebuersizeusedbyHadoopIO mapred.reduce.slowstart.completed.mapsTheproportionofnishedmaptaskswhenthereducetaskstarts. mapred.tasktracker.map.tasks.maximumThemaximumnumberofmaptasksexecutedonthesameTaskTracker. mapred.tasktracker.reduce.tasks.maximumThemaximumnumberofreducetasksexecutedonthesameTaskTracker.mapred.child.java.optsJavaheapsizeformap/reducetasks ThenumberofmaptasksisdeterminedbythetotalnumberofsplitsofinputdatatoaMapReducejob.(Itisassumedthatthelesplitagistruesothatalargelecanbesplitacrossmultiplestoragenodes.Ifthisagisfalse,alargelecanstillbesplit.Themaptask,however,willreadawholele,nomatterhowmanyblocksaleissplitinto.Inthiscase,thenumberofmaptasksisdeterminedbythenumberofinputles.)Usingtheone-to-onedatamappingstrategy,eachsplitreferstoale.Usingtheone-to-manydatamappingstrategy,eachsplitreferstoablockofale.TheblocksizeofHDFSdeterminesthenumberofblocksintowhichaleissplitinthestoragesystem.Ifthelesizeislessthantheblocksize,thelecannotbesplit.Otherwise,thenumberofsplitsiscomputedasdFs=Bse,whereFsreferstoasthesizeofale,BsreferstoastheblocksizeofHDFSanddedenotesthe\ceilingof".Todecreasethenumberofmaptasks,theblocksizeofHDFSshouldbeincreasedandthenumberoflesshouldbedecreasedwhengiventhetotalamountofdataunchanged. 53

PAGE 54

Thenumberofreducetasksreferstohowmanyreducetasksarelaunchedduringajobexecution.Theparameteriscontrolledbyusers.Usually,thenumberofreducetasksissetaccordingtoavailableresources[ 1 ].FortheR-Heavyimplementation,themapphaseisresponsibleforlteringuselessinformationandcategorizingdatabyiCUswhilethereducephaseconductstheanalysisdescribedinAlgorithm 1 .ThenumberoflesandtheblocksizeofHDFSdeterminethenumberofmaptasksandthenumberofreducetasksiscontrolledbyparametermapred.reduce.tasks.Chapter4showsevaluationresultsoftheeectofthesetwofactors. 3.1.2ParametersofHadoopWorkowExtendingtheoverviewofHadoopinChapter1,thissectionpresentsanin-depthviewoftheHadoopworkow.Wethenanalyzedierentcongurationparametersinvolvedintheworkowthatmayinuencetheperformanceofapplications.AsshowninFigure 1-3 ,whenthemasternodeassignsamap/reducetasktoaslavenode,theTaskTrackerontheslavenoderunsthemap/reducetask.Figure 3-1 showsthemap/reducetaskexecutionworkow,includingtheshuingphase,indetail.InaHadoopMapReducejob,theinputdatatothereducetaskaresortedbythekeys.TheshuingphaseinFigure 1-3 showstheprocessofsortingandtransformingintermediatedatafromthemapphasetothereducephase.Figure 3-1 detailstheshuingphaseexecution.Eachmaptaskisallocatedamemoryblockusedforbueringtheoutputofthetask.Duringthemaptaskexecution,outputsofthemaptaskarewrittentothebuerrst.Whenthesizeofthedatainthebuerexceedsaspecicthreshold,abackgroundthreadstartstospillthedatatothelocaldisk.Everytimethesizeofthedatainthebuerexceedsthethreshold,thereisalegenerated.Attheendofthemaptask,theremaybemultipleles,whicharemergedintothreelesusually.Beforethereducetaskstarts,severalthreadsretrievetheoutputsfrommaptasks.Sincethereducetaskmayretrieveoutputsfromdierentmaptasks,theremaybemultiplepiecesofdatabeforetheexecutionofthereducetask.Abackgroundthreadmergesmultiplepieces 54

PAGE 55

Figure3-1. AshuingphaseofaMapReducejob.Eachmaptaskcontainsamapfunction.Thebuerassociatedwiththemapfunctionistheplacethatstorestheoutputfromthemapfunction.Whenthesizeofthedatainthebuerexceedssomethreshold,thedataaresplitintodisks.Theblocksfollowingthebuerineachmaptaskrepresenttheblockswrittenonthedisk.Eachreducetaskcontainsareducefunction.Beforethereducefunctionstarts,thereducetaskretrievestheoutputfrommaptasksandmergestheoutputsastheinputtothereducefunction,representedbythearrowandthemergestreamsineachreducetask. ofdataintoonepiece.Thereducetaskthenstartswiththeone-pieceinputfromtheshuingphase.Afterthereducefunctionexecution,outputsarewrittenintoHDFS.Intermediateoutputsgeneratedduringthemapphasearestoredinabuerwhosesizeisspeciedbytheio.sort.mbparameter.Iftheamountofintermediatedataexceedsathreshold,abackgroundthreadspillstheintermediatedatatothedisksothatadditionaloutputsfrommaptaskscanbewrittentothebuer.Thethresholdisdeterminedbyboththesizeofthebuerandapercentofthatsizeio.sort.spill.percentabovewhichdataarespilled.Beforetheintermediateoutputsarewrittentothedisk,thebackgroundthreadpartitionsthedataaccordingtocorrespondingreducetasks.Thedatatobesenttothesamereducetaskareinasamepartitionandsortedaccordingtointermediatekeys.If 55

PAGE 56

io.sort.spill.percentpercentofthebuerisfull,thebackgroundthreadstartstospillthecontenttothediskwithregardstothepartitioninthebuer.Duetotheamountofintermediatedatathatmaptasksgenerate,theremaybemultiplespilledles.Beforethemapphaseends,multiplespilledlesaremergedintoasinglele.Thecongurationparameterio.sort.factordeterminesthenumberofstreamsusedtomergeatonce.Itisalsopossibletocompresstheintermediatedatabeforethedataarewrittenintothedisk.Ifthedataarecompressedatthemapphase,theyhavetobedecompressedatthereducephase.mapred.compress.map.outputistheagdecidingwhetherornottocompresstheoutputofthemapphase.mapred.map.output.compression.codectellstheHadoopplatformthespeciccompressionlibrary[ 1 ].FortheR-Heavyimplementation,themapphaseextractsusefulinformationandcomposespairs.Beforethemapphaseends,theintermediatepairsarepartitionedandsortedwithregardstodierentKrs.Thecongurationparametersinvolvedinthisphasemayhavesignicanteectontheexecutiontimeoftheapplication.Duringthereducephase,thereducetaskretrievesthepartitionofthedatathataresupposedtobesenttoit.Sincemultiplemaptasksmaynishatdierenttime,thereducetaskstartstoretrievethepartitionaslongasthemaptaskthatgeneratesthepartitionisnished.ThelocationsofpartitionsarecommunicatedtotheJobTrackerthroughtheheartbeatprocessbyeachTaskTracker.Thenumberofthreadsthatareusedtocopyoutputsofmaptaskstoareducetaskisdeterminedbytheparametermapred.reduce.parallel.copies.Theintermediatedatageneratedbymaptasksarecopiedintoabueratareducetask,whosesizeisdeterminedbymapred.job.shue.input.buer.percent,thepercentoftheheapofthereducetasktobeallocatedforthebuer.Similartothemapphase,whentheamountofdatainthebuerexceedsathreshold,thedatainmemoryarespilledintothedisk.Twoparameterscontrolthethreshold,mapred.job.shue.merge.percentandmapred.inmem.merge.threshold. 56

PAGE 57

Afterallrequiredinputsforareducetaskareretrievedfrommaptasks,abackgroundthreadisusedformergingtheinputsintoonesortedle.Theio.sort.factorintroducedpreviouslyalsodeterminesthenumberofstreamsusedformerging.Then,thereducephasecomputationstartsandtheoutputsarewrittenintoHDFS.ThereducephaseoftheR-Heavyimplementationretrievesthepartition|asetofpairs|belongingtothesameiCUpriortotheexecutionofthereducefunction.ThedatabelongingtothesameiCUmaybegeneratedbydierentmaptasks.Thereducetasksretrievethedatafrommultiplelocationsandmergethedataintoonesorteddataset.Afterthedatahavebeenmerged,thereducefunctionconductscomputationsdescribedinAlgorithm 1 oneachiCU.EachreducetaskmayprocessoneormultipleiCUs.CongurationparametersinvolvedinaMapReduceworkowsignicantlyinuencethewaytoutilizeavailableresources.Abetterparametersettingmaybringaperformanceboost. 3.1.3OtherHadoopFeaturesWehaveintroducedparametersinvolvedintheMapReduceprogrammingmodelandaHadoopjobworkow.Otherparametersrelatedtotaskexecution,I/O,andenvironmentsettingsmayinuenceperformanceofapplicationsaswell.TheMapReduceprogrammingmodelallowsajobtobedistributedamongmultiplenodestoachievespace-parallelexecution.Duringajobexecution,onetaskmayslowdownthetotalexecutiontimeofanapplicationwhenothertaskshavealreadynishedandtheapplicationisstillwaitingfortheresultsfromtheslowtask.Iftheexecutiontimeofataskissignicantlylongerthanotherparalleltasks,Hadooptriestoschedulethesametasktobeexecutedonadierentnodeandacceptstheresultsreturnedbytherstnodethatcompletesthetask.Thisprocedureisreferredtoasspeculativeexecution.ThepropertysignicantlyimprovestheperformanceofaHadoopjobsothatasingleslowtaskdoesnotaectthetotaljobexecutiontime.Thespeculativeexecutionis 57

PAGE 58

notaboutlaunchingtwotasksforprocessingthesamedatablockatthesametime.Instead,itcomputesthetaskexecutionrateanddetermineswhichtaskneedstobescheduledonadierentnode.Theparametersmapred:map:task:speculative:executionandmapred:reduce:task:speculative:executionaretwoagsenablingthespeculativeexecutioninthemapandreducephaserespectively.WhentheJobTrackerassignsatasktoaTaskTracker,theTaskTrackerassignsthetasktobeexecutedinanindividualJavaVirtualMachine.IftheTaskTrackercreatesoneJVMforeachtaskassignedtoit,thetotalJVMcreationtimeissignicantforlargeamountofshortlivedtasks.Theparametermapred.job.reuse.jvm.num.taskscanenabletheTaskTrackertoreuseJVMforothertasks.However,multipletasksdonotsharethesameJVMatthesametime.Instead,theyareexecutedsequentiallybythesameJVM.AnotherpropertyoftheHadoopplatformislazyinitializationofreducetasks.Theparametermapred:reduce:slowstart:completed:mapsdecidesthepercentofthemapphasenishedwhenthereducetasksareinitialized.Ifreducetasksareinitializedinanearlystageofalong-runningjob,theperformanceofthemapphasemaybedegraded.Thereasonisthatreducetaskswillshareresourceswithmaptasks,butreducetaskscannotstartwithoutallmaptaskshavingnished.Thus,lazyinitializationcoulddelaytheinitializationofthereducephaseexecutionsothatthemapphasecanpossiblyutilizeallavailableresources.AfteraHadoopclusterhasbeensetup,multipleenvironmentparametersneedtobedetermined.Appropriatelysettingtheparametersbenetsboththeapplicationexecutedontheclusterandtheresourceutilizationofthecluster.AsshowninFigure 1-3 ,theTaskTrackerworksineachslavenode.ThenumberoftasksthataTaskTrackerallowstorunisconguredbytheparametersmapred.tasktracker.map.tasks.maximumandmapred.tasktracker.reduce.tasks.maximum.Accordingtoavailableresources,theprincipleofsettingthenumberoftasksforeachTaskTrackeristoachieveoptimalutilization.Usually,theTaskTrackerandtheDataNodebothrunasindividualprocesses.Toutilize 58

PAGE 59

theavailableresources,theparametercanbeestimatedthroughNprocessorw)]TJ /F1 11.955 Tf 12.26 0 Td[(2,whereNprocessoristhenumberofprocessorsavailableonthenode,wisthenumberofprocessesrunningonthesameprocessor,and2referstotheTaskTrackerprocessandtheDataNodeprocess.WediscussourclustersettingsinChapter4.Table 3-1 andTable 3-2 summarizeallcongurationparametersintroducedinthissection.Studiesshowthattuningparametersbringsignicantperformanceimprovementtoapplications.ExperimentalevaluationofthissectionareshowninChapter4. 3.2AlgorithmicOptimizationWehaveintroducedthecharacterizationsofthedatatobeanalyzedinSection2.1.Onepieceofinformationthateachrecordcarriesisthetimestampindicatingwhentherecordiscollected.Basedonthetimestampinformation,thesystemretrievesthedatafromthedatabase.Thenewdataaregeneratedeverydayandpartoftheolddataaredeleted.Atypicalstatisticalcomputationusesdatacollectedduringa\window"ofseveraldays(e.g.thelast14days).Figure 3-2 showsthatthestatisticalanalysisisperformedonthedatafrom14days,Day-1toDay-14.AttheendofDay-15,thedatafromDay-1aredeletedandtheDay-15dataareadded.ThenthecomputationisperformedonthedatafromDay-2toDay-15. Figure3-2. Theslidingwindowofstatisticalanalysis.Thewindowsizeis14days. 59

PAGE 60

InthisSection,wetakethetimestampinformationintoconsiderationandproposeamethodthattakesadvantageofthepreviousdataandcomputationresultstoalgorithmicallyoptimizetheperformanceofthesystem. 3.2.1DataMovementTherearethreedatamappingstrategiesthathavebeendiscussedinSection2.3:arbitrary,per-iCUandper-dCUmapping.Weproposeanewdatamappingstrategyconsideringthetimestampinformationofeachrecord,referredtoasper-windowmapping.Per-windowmappingmapsthedatabelongingtothesametimeperiodtothesamele.Forexample,thedatacollectedfromDay-1aremappedtothesamele,asshowninFigure 3-3 .Sincemassiveamountsofdatacanbecollectedeachday,per-windowmappingcanbecombinedwithothermappingstrategiestodecreasethelesize. Figure3-3. Datamappingstrategywithregardstotimestampofeachrecord 60

PAGE 61

Usingthisapproach,theanalysissystemonlyretrievesnewlygenerateddataeachdayinsteadofretrievingalldatathatbelongtothesamecomputationwindow.Priortotheanalysis,thesystemremovesoutdateddataandaddsnewdatatoHDFSbyarelativelyshortpreprocessingprocedure.Theamountofdatatobemoveddecreasesbyapproximately1/14forthecasethatthewindowsizeis14days. 3.2.2AlgorithmicOptimizationIfwekeepsomeintermediateanalysisresultsfrompreviouscomputations,wecantakeadvantageofprecomputedresultstodecreasethecomputationtimeforthedatawindowofinterest.Inthissection,theStage1ofAlgorithm 1 ,discussedinSection2.2,isoptimizedbyintroducinganewalgorithm. 3.2.2.1Thesliding-windowalgorithmTheStage1ofAlgorithm 1 consistsofmultiplesteps.Amongthosesteps,sortingandcomputingtheaverageandthestandarddeviationofeachdCUarethemosttime-consumingparts.Ineachanalysisrun,onlyarelativelysmallamountofdataareaddedintothewholedataset.TakingFigure 3-2 asanexample,atDay-15,thenewdatageneratedinDay-15areaddedintotheDay-15analysisrun,andtheoutdateddatafromDay-1aredeleted.Ifitisassumedthattheamountofdatageneratedineachdayisthesame,thenewlyaddeddatasetis1/14ofthetotaldatasettobeanalyzed.Thus,ifthecomputationperformsonlyonthenewlygenerateddata,thetotalcomputationtimecouldbedecreased.Weproposeaslidingwindowalgorithmtooptimizethecomputation.ItisassumedthatthepreviousrecordsaresortedandtheaverageandthestandarddeviationofeachdCUcomputationresultsareknown.Thesortingcanbedonebyrstsortingthesmallnewlygenerateddataset,andthenusingthemergesortalgorithm[ 37 ]tomergethesmalldatasettotheprevioussorteddataset.Simultaneously,thealgorithmcheckstimestampinformationofeachrecordanddeletesoutdateddatawhentherecordliesoutsidethecomputationwindow.SlidingWindowSortisthepseudo-codeforthesortingalgorithm. 61

PAGE 62

procedureSlidingWindowSort(ArrayA(sortedarray),ArrayU,Windoww) SortUusinganysortingalgorithm AllocateoutputarrayB whilei
PAGE 63

standarddeviationoftheupdateddata.Again,Figure 3-2 isanexampletoshowthecomputationprocedure.Wederivethecomputingresultsmathematicallyasfollows. =1 NNXi=1xi(3{1) =vuut 1 NNXi=1(xi)]TJ /F5 11.955 Tf 11.96 0 Td[()2(3{2) =vuut 1 N(NXi=1x2i)]TJ /F1 11.955 Tf 15.54 8.09 Td[(1 N(NXi=1xi)2)(3{3)Let1denotetheaverageofDay-1,15theaverageofDay-15,theaveragefromDay-1toDay-14and0theaveragefromDay-2toDay-15.LetN1,N15andNbethenumbersofrecordsinDay-1,Day-15andDay-1toDay-14,respectively.Then,0(N)]TJ /F5 11.955 Tf 11.96 0 Td[(N1+N15)=N)]TJ /F5 11.955 Tf 11.96 0 Td[(1N1+15N15:0iscomputedthrough 0=N)]TJ /F5 11.955 Tf 11.95 0 Td[(1N1+15N15 N)]TJ /F5 11.955 Tf 11.96 0 Td[(N1+N15:(3{4)ItisnotobvioushowtousepreviousresultstocomputethestandarddeviationofeachdCU.WecanstartfromcomputingthevarianceofthedatasetfromDay-2toDay-15.v0isthevariancefromDay-2toDay-15.Then,v0=PDay-15i Day-2(xi)]TJ /F5 11.955 Tf 11.95 0 Td[(0)2 N)]TJ /F5 11.955 Tf 11.95 0 Td[(N1+N15;wherexiisanelementinthedatafromDay-2toDay-15.Theequationcanbeexpandedasv0=PDay-15i Day-2(x2i)]TJ /F1 11.955 Tf 11.96 0 Td[(2xi0+02) N)]TJ /F5 11.955 Tf 11.95 0 Td[(N1+N15:WedeneS2asthesummationofsquaresofeachelementinacertaindatasetandSasthesummationofallelementsinthedataset,thenS2=Xx2i;S=Xxi: 63

PAGE 64

Thevariancecanbecomputedthroughv0=S2)]TJ /F5 11.955 Tf 11.96 0 Td[(S21+S215)]TJ /F5 11.955 Tf 11.96 0 Td[(0(S)]TJ /F5 11.955 Tf 11.96 0 Td[(S1+S15)+(0N0)2 N0;whereN0=N)]TJ /F5 11.955 Tf 11.96 0 Td[(N1+N15.Thestandarddeviation,then,isderivedthrough 0=p v0=r S2)]TJ /F5 11.955 Tf 11.96 0 Td[(S21+S215)]TJ /F5 11.955 Tf 11.96 0 Td[(0(S)]TJ /F5 11.955 Tf 11.96 0 Td[(S1+S15)+(0N0)2 N0:(3{5)Sofar,wehaveusedtheaverage,thesummationandthesummationofsquaresofthepreviousdatatocomputetheaverageandthestandarddeviationoftheupdateddata.Algorithm 2 ,shownbelow,usestheslidingwindowapproachtocomputetheaverageandthestandarddeviationwithpreviousresults.Weassumethateachelementinthedouble-precisionarraycontainsatimestamp.Wheniteratingthearray,wecanextracttheoutdateddatabycheckingthetimestampoftheelement.Airepresentsfortheith1Ddouble-precisionarrayinthelinkedlistinputList,Ai[k]representsanelementinthearrayAi.ItisassumedthatUiistheith1Ddouble-precisionarrayinthelinkedlistnewDataanditisthenewlyaddeddatatothearrayAi.SXisthesummationofelementsinarrayX;S2XisthesummationofsquaresofeachelementinX;XistheaverageofelementsinX;NXisthenumberofelementsinX.EacharrayAiissortedandSAi,S2Ai,AiandNAiareknown.TherststageofAlgorithm 2 startswithsortingandcomputingtheaverage,thesummationandthesummationofsquaresofthenewlyaddeddata.Then,thealgorithmmergesthenewlyaddeddatatothepreviousdataandremovestheout-dateddatabycheckingthetime-stampofeachrecord.Aftermerging,outlierlimitsarecomputedtoidentifyoutliers.Finally,theproposedalgorithm,Equation 3{4 andEquation 3{5 ,isusedtocomputetheaverageandthestandarddeviationoftheupdateddata.Thesecondstageremainsunchanged. 3.2.2.2AlgorithmanalysisTherststageofAlgorithm 2 isoptimizedusingSlidingWindowSortalgorithmandEquation 3{5 andEquation 3{4 .Aftersorting,thepercentileandoutlierlimits 64

PAGE 65

Algorithm2Slidingwindowapproachtoequivalencetest Input:LinkedListinputList,LinkedListnewData foreachDoubleArrayAiininputList,eachDoubleArrayUiinnewDatado SortUi ComputeSUi,S2Ui,Ui,NUi MergeUitoAi,extractoutdatedelementsandstoreinXi ComputeSXi,S2Xi,Xi,NXi ComputethepercentilevaluesusingPercentilefunctionofXi ComputethelowerandupperoutlierlimitsLOL,UOLusingpercentilevalues EliminateXi[k]thatisnotintherange[LOL,UOL] Ni=NAi)]TJ /F5 11.955 Tf 11.95 0 Td[(NXi+NUi Computei=AiNAi)]TJ /F5 11.955 Tf 11.95 0 Td[(XiNXi+UiNUi Ni Computei=s S2Ai)]TJ /F5 11.955 Tf 11.96 0 Td[(S2Xi+S2Ui)]TJ /F5 11.955 Tf 11.96 0 Td[(i(SAi)]TJ /F5 11.955 Tf 11.95 0 Td[(SXi+SUi)+(iNi)2 Ni ComputetheprocesscapabilityindexCpk[i] endfor PickXiwiththelargestCpkasreferencesetR ComputeEuclideandistanceWR[k]=p (R[k])]TJ /F5 11.955 Tf 11.96 0 Td[(R)2 ComputeEuclideandistanceBR[k]=p (R[k])]TJ /F5 11.955 Tf 11.96 0 Td[()2 ComputeWR;WRandBR;BR foreachDoubleArrayXiininputListdo ComputeEuclideandistanceWi[k]=p (Xi[k])]TJ /F5 11.955 Tf 11.96 0 Td[(i)2 ComputeEuclideandistanceBi[k]=p (Xi[k])]TJ /F5 11.955 Tf 11.96 0 Td[(R)2 ComputeWi;WiandBi;Bi DoapairwiseStudent'st-testbetweenWR(orBR)andWi(orBi) Finaldecisionbasedontheresultoft-test endfor computationsaretheonlyonestobedone,whicharenottime-consuming.Thestandarddeviationisusedtobecomputedusingatwo-passalgorithm[ 38 ],whichrequiresaccessingeachrecordtwice.Inourapproach,weonlyneedtoaccesseachrecordonce.Anin-memorysortingalgorithmhasatimecomplexityO(nlogn),wherenisthenumberofrecordstobesorted.UsingSlidingWindowSortalgorithm,wetakeadvantageofthemergesortalgorithm.ThenewapproachresultsatimecomplexityO(n+klogk),wherenisthetotalnumberofrecordstobesortedandkisthenumberofnewlygeneratedrecords. 65

PAGE 66

Thetwo-passalgorithmforcomputingthestandarddeviationaccesseseachrecordtwice,computingtheaverageandcomputingthestandarddeviation.Theslidingwindowalgorithm,aswellasonline-algorithm,accesseseachrecordonce.Inthesituationwhenthememoryspaceislimited,thisalgorithmcandecreasememoryaccesstimeby50%.Ifthepreviousresultsareknown,computingthestandarddeviationonlyrequirescomputingtheaverageandthesummationofsquaresofthenewlygenerateddataset.Theexecutiontimeofslidingwindowalgorithmforcomputingthestandarddeviationis1/14oftheoriginalalgorithm(assumingthattheamountofdatageneratedeachdayisthesame). 66

PAGE 67

CHAPTER4EXPERIMENTALEVALUATIONANDANALYSISThischaptersummarizesexperimentalresultsforthedataanalysisapproachesdiscussedinpreviouschapters.Experimentsweredonetoevaluatei)dierentdatamappingstrategiescorrespondingtodierentimplementations;ii)dierentimplementations'scalabilitywithregardstothelargescaledataset;iii)theeectonperformanceofdierentHadoopcongurationparameters;iv)theperformanceoftheslidingwindowalgorithm. 4.1DataMappingandMapReduceImplementationExperimentsThedatacollectedfrommonitoringsensorscanincreaseinmultipledimensions.Thedatasize(intermsofrecords),S,dependsonthenumberofiCUs,I,thenumberofdCUsineachiCU,D,andthesizeofeachdCU,R. S=F(I;D;R)(4{1)Infollowingexperiments,itisassumedthatthesizeofeachdCUandthenumberofdCUofeachiCUarethesame.Therefore,Equation 4{1 becomes S=IDR:(4{2)Theexperimentsaresetuponalocalclusterconsistingof9physicalnodes,onemasternodeandeightslavenodes.Onenodededicatestobethemasternode.Eachphysicalnodecontains8IntelXeon2.33GHzprocessorsand16GBmemory.ClouderaDistributionofHadoop(CDH)4.2.1isintalledonthelocalcluster. 4.1.1DataMappingDierentdatamappingstrategiesaredescribedinChapter2.Experimentsaresetuptocomparedierentstrategiescorrespondingtodierentimplementations. 67

PAGE 68

4.1.1.1ExperimentssetupWeassumethenumberofdCUsineachiCU,D,isaconstant,6(thereareapproximate6EUsforeachrecipeanditem).WeconductexperimentsfollowingTable 2-3 byincreasingthesizeofeachdCU,R,startingfrom1MBto24MB.Forthemany-to-onescenario,thefactorischosentobe2,meaningthateachmaptaskhandlestwoles.Thesizeofthedata,S,is30GB.AsRincreases,thenumberofiCUs,I,decreases.SincetheR-HeavyimplementationcanbeconsideredastheM-HeavyimplementationandanextrafunctionthatgroupsthedatabelongingtothesameiCUtogether,itistheonlyimplementationthatcanbeusedinthearbitrarilymappingstrategy.Intuitively,theexecutiontimeofprogramsusingarbitrarilymappingstrategyislongerthanusingothermappingstrategies.Thus,weonlyconsidertheper-iCUandper-dCUmappingstrategies. 4.1.1.2ResultsanalysisForcomparisonpurposes,weshowexperimentalresultsintwodierentgures.Figure 4-1 comparestheexecutiontimeofdierentdatamappingstrategiesforeachimplementation.FortheR-Heavyimplementation,whenthesizeofdCUislessthan16MB,theper-iCUmany-to-onemappingstrategyoutperformsothermappingstrategies.TheMR-Balancedimplementationshowssimilarbehavior.FortheM-Heavyimplementation,theexecutiontimechangesslightlyasthesizeofdCUchanges.AllthreeimplementationsshowanveryslightlyincreasingtrendafterthesizeofdCUexceeds12MB.Sinceaper-dCUone-to-onemappingwillgeneratemanymorelesthanaper-iCUmany-to-onemapping,theHadoopplatformgeneratesmoremaptasksintheper-dCUone-to-onescenariothanitdoesinthelatter.Thus,theexecutiontimefortheper-dCUone-to-onemappingstrategyislongerthanthetimetakenfortheper-iCUcase.However,asthesizeofeachdCUincreases,thetaskexecutiontime,includingcomputationanddatamovementtime,dominatesthetotalexecutiontime,whichresultsinsimilarexecutiontimeforallmappingstrategies. 68

PAGE 69

Figure4-1. Dierentdatamappingstrategiesforeachimplementations Figure 4-2 comparestheexecutiontimeofdierentimplementationsforeachdatamappingstrategy.Dierentimplementationsshowsimilarbehaviorsforthesamedatamappingstrategy.Theper-iCUmany-to-onestrategyshowsaveryslightlyincreasingtrendwhenthesizeofdCUexceeds8MB.AsthesizeofdCUgrows,eachmaptaskhandleslargerandlargerles.Sincethecomputationismemory-intensive,thedatasizeofeachtaskwillbeboundedbymemory.Thedatasizeforeachtaskintheper-iCUmappingstrategyislargerthanothermappingstrategies.Thus,itistherstcasethatshowsuptheincreasingtrendoftime-consumptionasthesizeofdCUincreases.IfthesizeofdCUistoolargesuchthatasingleiCUcannotbecomputedinasinglenode,aparallelalgorithmmustbedesignedtoparallelizetheexecution.WehaveproposedaniterativeMapReduceparallelalgorithminAppendix B 4.1.2LargeScaleDataSetsInordertocapturethescaleofmonitoringdatainthereal-worldmanufacturingsystem,theexperimentsassumedthat1TBdataaretobeanalyzedeachday. 4.1.2.1ExperimentssetupMonitoringdatafor25iCUsweregeneratedbymodifyingactualvaluesfromrecordscollectedfromasemiconductormanufacturingsystem[ 6 ].ItisassumedthatRandDareconstant.Fromthisinitialdataset,thedatasizeisgrowingbyincreasingthenumberofiCUsbasedonEquation 4{2 .TheIisintherangefrom625to2,000,000.Thedatasizeisincreasedfrom0.32GBto1TBcorrespondingly. 69

PAGE 70

Figure4-2. Dierentimplementationsforeachdatamappingstrategy. SincethesizeofeachdCUislessthan1MB,weuseper-iCUmany-to-onedatamappingstrategywithamany-to-onefactorof125.Thelesizerangesfrom54MBto73MB.TheDFSblocksizeis128MB.Thenumberofreducetasksissetto8. 4.1.2.2ResultsanalysisFigure 4-3 showsseveralexperimentalresults.Asthedatasizeincreases,thecomputationtimeincreaseslinearlyaswell.TheM-HeavyimplementationoutperformstheR-HeavyimplementationandtheMR-Balancedimplementation.ThemapphaseexecutiontimeoftheMR-BalancedandtheM-Heavyimplementationareapproximatelythesame.ThemapphaseexecutiontimeoftheR-Heavyimplementationisshorterthanotherimplementations,becausethemapphaseofR-Heavyimplementationonlyltersandcategorizesrecordswithoutanycomputation.ThereducephaseexecutiontimeoftheM-Heavyimplementationisshorterthanthatofotherimplementations,sincethereducephaseoftheM-Heavyimplementationis 70

PAGE 71

Figure4-3. Timemeasurementforcurrentdatasimulation absent.ThereducephaseexecutiontimeoftheR-Heavyimplementationisthelongestbecauseallcomputationsaredoneinthisphase. 4.2HadoopConguration-ParametersTuningExperimentsWeevaluatetheeectofdierentcongurationparameters,showninTable 3-1 and 3-2 ,ontheperformanceoftheR-Heavyimplementation.AsdiscussedinChapter 3 ,wetunedtheR-HeavyimplementationforreasonsthattheR-HeavyimplementationworksforanymappingstrategyandittakesadvantageoftheshuingphaseofatypicalHadoopjob.Table 4-1 showsdefaultvaluesofcongurationparametersofanR-Heavyjob.ThedefaultvaluesetisatypicalsettingforaHadoopcluster.Wehaveusedthissettingforpreviousexperiments.Inourexperiments,ifaspecicparameteristuned,otherparametersusethevaluesinTable 4-1 4.2.1ParametersoftheMapReduceProgrammingModelTheMapReduceprogrammingmodelhastwomainphases,themapphaseandthereducephase.Thus,thenumberofmaptasksandthenumberofreducetasksaretwoimportantfactorsinuencingtheperformanceofaMapReduceprogram.Inthissection,thesetwofactorsaretunedanddiscussed.A100GBdatasetisgeneratedfortestingpurposes.EachlecontainsdatafromdierentiCUsanddCUs.Thenumberofmaptasks,representingbymapred.map.tasks,isdeterminedbyboththeparameterdfs.block.sizeandthelesizeofeachle.Itisassumed 71

PAGE 72

Table4-1. ThedefaultvaluesofcongurationparametersinTable 3-1 and 3-2 NameDefaultvalue mapred.map.tasks800dfs.block.size128MBmapred.reduce.tasks16io.sort.mb200MBio.sort.spill.percent0.8io.sort.factor64mapred.compress.map.outputtruemapred.reduce.parallel.copies10mapred.job.shue.input.buer.percent0.7mapred.job.shue.merge.percent0.66mapred.inmem.merge.threshold1000mapred.map.tasks.speculative.executionfalsemapred.reduce.tasks.speculative.executionfalsemapred.job.reuse.jvm.num.tasks1io.le.buer.size64KBmapred.reduce.slowstart.completed.maps0.8mapred.tasktracker.map.tasks.maximum8mapred.tasktracker.reduce.tasks.maximum4mapred.child.java.opts1536MB thattheleissplittableandthesplittingsizeisthesameastheblocksizeofHDFS.Therelationshipcanberepresentedby M=i=0XN(ceilSi B)(4{3)whereMdenotesthenumberofmaptasks,Nthetotalnumberofles,SithelesizeoftheithleandBtheblocksizeofHDFS.TheR-Heavyimplementation,likemanyotherHadoopapplications,willassigneachmaptasktoanindependentblockofdata.Forexample,ifalesizeis10GBandtheblocksizeis1.5GB,thenumberofblocksis7,thus,thenumberofmaptaskslaunchedfortheleis7.Inthisexperiment,Nis10,andeachleis10GB.Bychangingtheblocksize,thenumberofmaptaskslaunchedischangedaswell.Figure 4-4 showstheexperimentalresultsoftheinuenceofthenumberofmaptasksontheexecutiontimeoftheprogram.Astheblocksizeincreases,thenumberofmap 72

PAGE 73

Figure4-4. Inuenceofthenumberofmaptaskslaunched.Thedotsinthegureshowtheexperimentalresults.Eachdotcontainsthreedimensions,thenumberofmaptasks,theHDFSblocksizeandtheexecutiontime.Theexecutiontimeisshownusingdierentshadesofgray.Thedarkertheareais,thelesstimethejobtakestorun. taskslauncheddecreases.AsrepresentedbyEquation 4{3 ,thenumberofmaptasksforeachleistheceilingofthelesizedividedbytheblocksizeofHDFS.Thus,insomecases,thedataarenotevenlydistributedamongmaptasks.Thedotsinthegureshowtheexperimentalresults.Eachdotcontainsthreedimensions,thenumberofmaptasks, 73

PAGE 74

theHDFSblocksizeandtheexecutiontime.Theverticalaxisisthenumberofmaptaskslaunched,andthehorizontalaxisistheblocksizeofHDFS.Theshadesofgrayrepresenttheamountoftimeittakesforthejobtorun.Thedarkertheareais,theshortertheamountoftimethejobtakes.ThoughthenumberofmaptasksdecreasesasthesizeofHDFSblockincreases,theexecutiontimedoesnotshowanytrend.Becauseofunevenlydistributeddata,someofmaptaskstakelongertimethanothermaptasksinajob.Ifthecasewhendataareevenlydistributedisconsidered,however,theexecutiontimedecreasesasthenumberofmaptasksdecreases,asshowninFigure 4-5 .Inthiscase,weconsidertheblocksize64MB,128MB,256MB,512MB,1024MB,2048MBand3072MB,whichareusedinordertoevenlydistributethedatafromthesameleamongmultipleblocks.Correspondingly,thenumberofmaptasksdecreasesfrom1600to40.Theminimalexecutiontimeisachievedwhenthenumberofmaptasksis50withthecorrespondingblocksize1024MB.Asthenumberofmaptaskscontinuesdecreasing,theexecutiontimeincreases,incontrary,whichisshownatthebeginningofthehorizontalaxisinFigure 4-5 .Thisincreasingtrendindicatestheproblemoftoolargegranularityoftheinputdata.Asdiscussedpreviously,theJobTrackertriestoassignthetaskwheretheinputdataarelocated.If,ontheotherhand,thenodethatstorestheinputisbusy,theinputhastobetransferredtoanothernode.Thepenaltyofdatatransmission1islowwhenthesizeoftransmitteddataissmall.Astheblocksizeincreases,thetimeittakestotransmitthedataforthissituationismuchlonger.Thus,thetimeforinputdatatransmissionlagsthetotalexecutiontimeofmaptasks.Theparametermapred.reduce.tasksdecideshowmanyreducetasksarelaunchedforajob.Inourexperiments,thisparameterisvariedfrom1to128.Figure 4-6 showstheexecutiontimefordierentnumbersofreducetasks.Theexecutiontimeofthemapphase 1Thetimeittakestotransfertheinputlefromthesourcenodetothetargetnodeismuchlongerthanthetimeittakestoperformcomputationslocally. 74

PAGE 75

Figure4-5. Inuenceofthenumberofmaptasks(evenlydatadistribution) doesnotchangemuchduringexperiments.Whenthenumberofreducetasksisrelativelysmall,theexecutiontimeofthereducephasedecreasesasthenumberofreducetasksincreases,whichcontributestodecreasingofthetotalexecutiontimeofajob.Asthenumberofreducetasksincreases,theutilizationoftheresourceoftheclusterincreases.Whenthenumberofreducetasksexceeds20,theexecutiontimeofthereducephaseremainsunchanged.Whenthenumberofreducetasksisrelativelylarge,theexecutiontimeofthereducephaseshowsaveryslightlyincreasingtrend,becauseincreasingthenumberofreducetasksresultsinincreasingthesystemoverhead.Whenthesystemresourceisfullyutilized,increasingthenumberofreducetaskswillbringslightlymoreoverhead. 4.2.2ParametersofHadoopWorkowTheshuingphaseplaysanimportantroleinaMapReducejob.SeveralcongurationparametersinvolvedareproventosignicantlyinuencetheperformanceofaMapReducejob.io.sort.mbandio.sort.spill.percentdecidethebuersizeoftheoutputofamaptask.Theprincipleforsettingthesetwoparametersisthataslongasthememoryisavailable,thebuersizeshouldbesetaslargeaspossible.Figure 4-7 showstheexperimental 75

PAGE 76

Figure4-6. Inuenceofthenumberofreducetasksonexecutiontime resultsoftheinuenceofthesetwoparametersontheperformanceoftheR-Heavyimplementation.Theresultgoesagainsttheexpectationsothatasbothparametersincrease,theexecutiontimeoftheprogramincreasesaswell.Whentheio.sort.mbissetto50MBandtheio.sort.spill.percentissetintherangebetween0.3and0.9,theexecutiontimeisdecreasedby13%oftheworstcasescenariointhisexperiment. Figure4-7. Inuenceofthebuersizeoftheoutputfromthemapfunction io.sort.factorrepresentsthenumberofstreamsthatareusedformergingtheoutputofamaptaskortheinputofareducetask.Theoretically,thelargertheparameteris,thelesstimeittakestomergetheoutputofmaptasksandtheinputofreducetasks.Figure 4-8 showstheexperimentalresults.Localminimalpointsarelocatedat16,64,128...Thetotalresourcesavailableintheclusterare64cores.Thus,theoptimalperformanceoftheprogramisachievedatthepointwherethemergingphaseutilizesallavailableresources 76

PAGE 77

andtherearenoextrastreamsallocatedforthisphase.Inthisexperiment,thevalueisintherange[60,75]. Figure4-8. Inuenceoftheparameterio.sort.factorontheexecutiontime Afterthemapphase,theoutputsofmaptasksarecopiedtothecorrespondingreducetasks.Thenumberofthreadsthatareusedforcopyingintermediatedataisdeterminedbytheparametermapred.reduce.parallel.copies.Figure 4-9 showsthatthemapphaseexecutiontimeisnotinuencedbythisparameter.Thereducephaseexecutiontimeisminimalwhenthenumberofthreadsis10.Figure 4-9 suggeststhatchoosinganumberofthreadsintherange[5,20]willyieldbetterexecutiontime.Astheparameterkeepsincreasing,theexecutiontimeincreasesaswell.Whentheresourceisfully-utilized,additionalthreadsbringonlyoverheadratherthanperformanceimprovement.Theresultsofthemapphasearerstcopiedtoabuer,ablockoftheheapdeployedineachreducetaskprocessofareducetask.Whenthebuerisfull,outputsfrommaptasksarespilledintodisk.Threeparameters,mapred.job.shue.input.buer.percent,mapred.job.shue.merge.percentandmapred.inmem.merge.threshold,inuencethesizeofthebuer.FortheR-Heavyimplementation,thecomputationtakesplaceinthereducephasewhichrequiressucientmemoryspace.Thus,thesizeofthebuerusedtotemporarilystoretheoutputofeachmaptaskislimitedsothattherestofthememoryissucientforcomputations.Figure 4-10 and 4-11 showtwogroupsof 77

PAGE 78

Figure4-9. Inuenceoftheparametermapred.reduce.parallel.copiesontheexecutiontime experimentalresults.Oneshowstheeectofmapred.job.shue.input.buer.percentandmapred.job.shue.merge.percentontheexecutiontime,andtheothershowstheeectofmapred.job.shue.input.buer.percentandmapred.inmem.merge.thresholdontheexecutiontime.Ifanyofthreeparametersissettoolarge,thecomputationcannotnishduetothelackofmemoryspace.Thestarsinthegureindicatethattheexperimentscannotbenishedduetothelackofmemoryspaceforthereducetaskcomputation. Figure4-10. Inuenceofthereducephasebuersize(percent).Thestarsinthegureindicatethattheexperimentscannotbenishedduetothelackofmemoryspaceforthereducetaskcomputation. 4.2.3OtherHadoopFeaturesmapred.job.reuse.jvm.num.tasksistheparameterthatdecideshowmanytasksshareaJavavirtualmachine.SinceeachtaskusesanindependentJVM,taskswithinthesamenodeshareaJVMsequentially.Whentheparameterislessthan10,theexecutiontime 78

PAGE 79

Figure4-11. Inuenceofthereducephasebuersize(records) ofthemapphasedecreasesasthenumberoftaskssharingaJVMincreaseswhereastheexecutiontimeofthereducephaseincreases.Whentheparametervalueisaround5,thetotalexecutiontimeisminimal. Figure4-12. Inuenceoftheparametermapred.job.reuse.jvm.num.tasksonexecutiontime mapred.reduce.slowstart.completed.mapsistheparameterwhosevalueistheproportionofthemapphasethathasnishedwhenthereducephasestarts.Typically,thisissettoalargevaluesuchthatallavailableresourcesarededicatedtothemapphaseforalargejob.Figure 4-13 showstheexperimentalresultsofmeasuringtheinuenceofthisparameterontheexecutiontime.Astheparameterincreases,theexecutiontimeincreasesaswell.Aftertheparameterexceeds0.3,theexecutiontimestartsdecreasing.Whentheparameterisintherange[0.7,0.8],theexecutiontimeisminimal.Thenitincreasesagain. 79

PAGE 80

Figure4-13. Inuenceoftheparametermapred.reduce.slowstart.completed.mapsonexecutiontime Theparametermapred.compress.map.outputisabooleanvaluedecidingwhethertocompresstheoutputsofmaptasks,andtheparametersmapred.map.tasks.speculative.executionandmapred.reduce.tasks.speculative.executionenablethespeculativeexecutioninthecorrespondingmaporreducephase.ExperimentalresultsshowingtheimpactofthoseparametersontheexecutiontimearesummarizedinTable 4-2 .Fromthetable,whentheintermediatedatageneratedbyaHadoopjobarecompressedandthespeculativeexecutionisenabled,thejobperformsbetterthanthescenariowhenallofthreeparametersaresettofalse. Table4-2. Experimentalresultsofcompressandspeculativeexecutionparameters CongurationparameterExecutiontime(true)Executiontime(false) mapred.compress.map.output433.168497.326 mapred.map.tasks.speculative.execution418.156433.168 mapred.reduce.tasks.speculative.execution417.247433.168 80

PAGE 81

4.2.4PutItAllTogetherTherearemorethan190dierentcongurationparametersforeachHadoopjob[ 21 ].Amongallparameters,thoseinTable 3-1 andTable 3-2 usuallyhavesignicantinuenceontheperformanceofajob.WehaveevaluatedtheinuenceofthoseparametersontheexecutiontimeoftheR-Heavyimplementation.Experimentalresultsshowimprovementswhentuningsomeofparameters.AsshowninSection4.2.2,parametersinterferewitheachothersuchthatwhenaparameterischangedtoadierentvalue,theoptimalvalueofanotherparametermaychangeaswell.Forexample,whentheio.sort.mbissetto50,theoptimalperformanceoftheR-Heavyimplementationisatthetimethatio.sort.spill.percentis0.6.Whentheio.sort.mbissetto300,theoptimalperformanceoftheR-Heavyimplementationisatthetimethatio.sort.spill.percentis0.1.Wetrytominimizetheinterferencebytuningpossiblerelatedparameterstogether,forexample,io.sort.mbandio.sort.spill.percent.However,otherparametersmaystillinuenceeachother.Bettermethodologiesormoreexperimentsarerequiredinthefutureresearch.Inthissection,acomparisoniscarriedoutontheR-Heavyimplementationbetweentheexecutiontimepriortotuningthecongurationparametersandtheexecutiontimeaftertuningtheparameters.Accordingtotheexperimentsabove,thevaluesofcongurationparametersaftertunedaresummarizedinTable 4-3 .Whenaparameterhasveryslightinuence,ornoinuence,ontheperformanceofajob,thedefaultvalueinTable 4-1 isused.Otherwise,thevalueischosenbasedontheexperimentalresultspresentedintheprevioussections.Figure 4-14 showsexperimentalresultsofthecomparison.Aftertuningcongurationparameters,thetotalexecutiontimetoprocess1TBdatadecreasesbyapproximately38%.Boththemapphaseandthereducephaseexecutiontimedecreaseduetochangingthecongurationparameters.Althoughcongurationparametersinuenceeachother,thestatictuningmethodologyusedintheexperimentstillbringsanimprovementaround38%. 81

PAGE 82

Table4-3. Thevaluesofcongurationparametersaftertuned NameTunedValue mapred.map.tasks100-1100dfs.block.size1GBmapred.reduce.tasks16io.sort.mb50MBio.sort.spill.percent0.6io.sort.factor64mapred.compress.map.outputtruemapred.reduce.parallel.copies10mapred.job.shue.input.buer.percent0.7mapred.job.shue.merge.percent0.66mapred.inmem.merge.threshold1000mapred.map.tasks.speculative.executiontruemapred.reduce.tasks.speculative.executiontruemapred.job.reuse.jvm.num.tasks5io.le.buer.size64KBmapred.reduce.slowstart.completed.maps0.8mapred.tasktracker.map.tasks.maximum8mapred.tasktracker.reduce.tasks.maximum4mapred.child.java.opts1536MB Figure4-14. ComparisonoftheexecutiontimebeforeandaftertuningofcongurationparameterstotheR-Heavyimplementation 4.3AlgorithmicOptimizationExperimentsAsliding-windowalgorithmwasproposedinSection3.2tooptimizetherststageofAlgorithm 1 .Sincebothalgorithmsrequirethedatatobeanalyzedtobecollectedinonenode,experimentswereconductedinasinglemachine.First,weevaluatethecorrectnessofourapproachtocomputetheaverageandthestandarddeviationofasetof 82

PAGE 83

records.Second,wecomparetheexecutiontimeoftherststageofAlgorithm 1 andthesliding-windowalgorithm. 4.3.1NumericalAnalysisofStandardDeviationComputationMathematically,theresultsofthenewapproachshouldbethesameasthetwo-passalgorithm.However,theymayvarynumericallywhencomputationsareperformedincomputersbecauseofroundingerrors.Weevaluatetheapproachbycomparingtheresultsofthetwo-passalgorithmwiththoseofusingthesliding-windowapproach. 4.3.1.1ExperimentssetupThemachineusedfortestingisoneofmachinesinthelocalclustercontaining8IntelXeon2.33GHzprocessorsand16GBmemory,withJava1.7installed.WeusearandomgeneratorinJavastandardlibrarytogeneratedatawithintherange[1,10]and[1,100].Boththecomputationresultsandtheexecutiontimearecomparedbetweenthetwo-passalgorithmandthesliding-windowapproach. 4.3.1.2ResultsanalysisTables 4-4 and 4-5 showtheaverageandthestandarddeviationcomputationresultsforthenumberofrecordsrangingfrom15to150,000,000.Recordsarealsorandomlygenerated.Theinputconsistsof15unitsandeachunitcontainsthesameamountofrecords.Thetwo-passalgorithmcomputestheaverageandthestandarddeviationforunits2through15usingEquation 3{1 andEquation 3{2 ,whilethesliding-windowapproachassumestheresultsof1stunitandunits1through14areknownandusesEquations 3{4 andEquation 3{5 tocomputetheresultsforunits2through15.Bothaverageandstandarddeviationcomputationresultsshowroundingerrorsthatincreasewiththenumberofrecords.However,thetotalnumberofrecordsforeachmeasurementunit(ordCU)usuallymuchlessthan150,000,000.Themaximumerror10)]TJ /F7 7.97 Tf 6.59 0 Td[(8isacceptablefortheanalysissystem.IfthenumberofrecordsforeachdCUincreasesdramatically,knownnumericaltechniquescanbeusedtodecreasetheroundingerrors[ 38 ]. 83

PAGE 84

Table4-4. Comparisonofaveragesofrecordscalculatedusingthetwo-passalgorithmandthesliding-windowapproachfordierentnumbersofrecords No.RecordsAverage(two-pass)Average(slidingwindow) 154.2205024724684724.2205024724684721504.9283438443034174.92834384430341715005.1386860167713225.13868601677132315,0004.9888900485653024.988890048565292150,00050.0035776820134250.003577682013841,500,00050.0016077470186950.0016077470209315,000,00050.00645400430854450.00645400430462150,000,00049.9971672737857149.99716727378661 Table4-5. Comparisonofstandarddeviationsofrecordscalculatedusingthetwo-passalgorithmandthesliding-windowapproachfordierentnumbersofrecords No.RecordsStandarddeviation(two-pass)Standarddeviation(slidingwindow) 152.5533441067699942.5533441067699941503.0063599194012853.00635991940128815002.85650355567472272.856503555674718315,0002.8911339806411712.891133980641179150,00028.85644650576797728.856446505766381,500,00028.85490069485581428.85490069485599215,000,00028.8646286650003828.864628665002964150,000,00028.86589363097013828.865893631013957 Figure 4-15 showstheexecutiontimeofboththetwo-passalgorithmandthesliding-windowapproachtocomputeaveragesandstandarddeviations.I/Oisnotconsideredinthisexperiment.Theexecutiontimeoffourcasesareshowninthegure.First,wemeasuretheexecutiontimeforthetwo-passalgorithm(referredtoasTwo-passtotal).Second,thetotalexecutiontime,includingboththetimeforcomputingpreviousresultandthetimeforcomputingthenewaverageandstandarddeviation,oftheslidingwindowapproacharemeasured(referredtoasSliding-windowtotal).Third,itisassumedthatthecomputationresultsfromDay-1toDay-14areknown(referredtoasSliding-windowstd).Fourth,itisassumedthatthecomputationresultsfromDay-1toDay-14andeachdayarebothknown(referredtoasSliding-windowdaily).Asshowninthegure,ifthepreviouscomputationresultsareknown,thesliding-windowalgorithmoutperformsthetwo-passalgorithmsignicantly.ThecaseSliding-windowstd 84

PAGE 85

Figure4-15. Theexecutiontimesofthetwo-passalgorithmandtheslidingwindowapproachtocomputeaveragesandstandarddeviations assumesthatN,,SandS2inEquation 3{5 areknown.ThecaseSliding-windowdailyfurtherassumesthatN1,1,S1andS21inEquation 3{5 areknown.SincecaseSliding-windowdailyhastheresultsforDay-1,thecomputationtimeincreasesalittleslowerthanthecaseSliding-windowstdwhenthenumberofrecordsincreases.Sincetheslidingwindowapproachcomplicatesthealgorithm,theexecutiontimeofthecaseSliding-windowtotalislongerthanthecaseTwo-passtotal.Asthenumberofrecordsincreases,thetotalexecutiontimeoftwo-passalgorithmisabout14timeslongerthanthesliding-windowapproachwiththeassumptionthatpreviouscomputationresultsareknown. 4.3.2Sliding-WindowAlgorithmPerformanceAnalysisInthissection,weevaluatetherststageoftheslidingwindowalgorithmbycomparingtheexecutiontimeoftherststageofAlgorithm 1 andthesliding-windowalgorithm. 4.3.2.1ExperimentssetupSeveralassumptionsaremadewhenevaluatingthesliding-windowalgorithm.Forsimplicity,eachrecordgeneratedinthissectioncontainsanintegerasatimestampintherangefrom1to15.Itisassumedthatthecomputationforrecordswhosetime 85

PAGE 86

stampsareintherange[1,14]isnishedandresultsarestoredinles.Dierentorderingstrategiesareusedtostoresortedrecords,aglobal-orderandatime-stamp-order.Intheglobal-orderstrategy,recordsarestoredinsortedorderwithoutconsiderationofthetimestamp.Inthetime-stamp-orderstrategy,recordsarerstorderedbythetimestamp,thenorderedbythevaluetobeanalyzed.Figure 4-16 showstwoorderingstrategies.R0,R1,R2,...aresortedrecords.Eachrecordisassociatedwithatime-stampT1,T2,T3,...IntheleftleinFigure 4-16 ,recordsareorderedaccordingtovaluewithoutconsiderationoftheirtime-stamps.Intherightle,recordsarerstsortedbytime-stamps,thensortedbyvalues. Figure4-16. Theglobal-orderstrategyandthetime-stamp-orderstrategy Accordingtodierentdataorderingstrategiesanddierentalgorithms,therearefourcasestobeconsidered,Algorithm 1 withtheglobal-orderstrategy(orAlg.-1Sort),Algorithm 1 withthetime-stamp-orderstrategy(orAlg.-1Time),Algorithm 2 withtheglobal-orderstrategy(orAlg.-2Sort),andAlgorithm 2 withthetime-stamp-orderstrategy(orAlg.-2Time). 4.3.2.2ResultsanalysisAccordingtotheanalysisinSection 3.2.2.2 ,sortingandthestandarddeviationcomputationaremosttime-consumingcomputationsintherststageofAlgorithm 1 and 2 .Thus,thesortingandstandarddeviationcomputationtimearemeasuredseparately 86

PAGE 87

inFigure 4-17 .Wealsoconsiderthecomputationtimeversustheinput/outputtime,asshowninFigure 4-18 .ThetotalexecutiontimeofdierentcasesarecomparedinFigure 4-19 .Forcomparisonpurposes,resultsarealsoshownbydierentcasesinFigure 4-20 ASorttime BStandarddeviationcomputationtimeFigure4-17. Sortingandthestandarddeviationcomputationtime BasedonSlidingWindowSortalgorithmdiscussedinSection 3.2.2.1 ,weimplementedmergesortinthecaseAlg.-2Sortandmulti-waymergesortinthecaseAlg.-2Time.ForcasesAlg.-1SortandAlg.-1Time,thesortmethodinJavaisused.AsshowninFigure 4-17 -A,ThesortingtimeforcasesAlg.-2SortandAlg.-1Sortareshortest,becausemostoftherecordsaresorted.ThecaseAlg.-2Timeperformsworst.Multi-waymergesortthatweimplementedcannotbeatthesortingalgorithmimplementedinJava.TheresultsofthestandarddeviationcomputationtimearesimilartotheresultsinFigure 4-15 .InSection 4.3.1.2 ,onlythecomputationofthestandarddeviationandtheaverageareperformedonrecordswhileinthissection,thestage1ofcomputationalgorithmisperformed.FromFigure 4-17 -B,whenmostoftherecordsaresorted,thecomputationtimefortheaverageandthestandarddeviationareshorter.Inaddition,thesortingtakeslongertimethancomputationofthestandarddeviation.Thetotalcomputationtimeisthecombinationofthesortingtimeandthestandarddeviationcomputationtime,asshowninFigure 4-18 -A.ThecaseAlg.-2Sortrunsfaster 87

PAGE 88

ATotalcomputationtime BIOtimeFigure4-18. ComputationtimeandI/Otimemeasurement thananyothercases.Asthenumberofrecordsincreases,Alg.-1SortrunsapproximatelytwotimesslowerthanAlg.-2Sort.SincethesortingtimeofAlg.-2Timeislongerthanothercases,itcausesthetotalcomputationtimeofAlg.-2Timetobetheslowestcase.Withregardstotheinput/outputtimeconsumedbyprograms,theyareapproximatelythesameforallcases.Comparingthescalesofy-axisinFigure 4-18 ,theI/Otimedominatestheprogram'sexecutiontime. Figure4-19. Thetotalexecutiontimeforfourcases 88

PAGE 89

ThetotalexecutiontimeoffourcasesarepresentedinFigure 4-19 .SinceI/Odominatestheexecutiontime,thetotalexecutiontimeoffourcasesareveryclosetoeachother.ThetotalexecutiontimeofthecaseAlg.-2SortandAlg.-2TimeareslightlyshorterthanthecaseAlg.-1SortandAlg.-1Time. ACaseAlg.-1Sort BCaseAlg.-1Time CCaseAlg.-2Sort DCaseAlg.-2TimeFigure4-20. Executiontimefordierentcases InordertocomparedierentpartsofcomputationtimeandI/Otime,allresultsarepresentedfromadierentperspectiveinFigure 4-20 .Allexecutiontimeforthesamecasearesummarizedinthesamegure.Figure 4-20 -AshowsthecaseAlg.-1Sort,Figure 4-20 -BshowsthecaseAlg.-1Time,Figure 4-20 -CshowsthecaseAlg.-2Sortand 89

PAGE 90

Figure 4-20 -DshowsthecaseAlg.-2Time.TheexecutiontimeismainlydecidedbyI/Oprocessingtime.Asshowninfourcases,I/Otimeisatleastvetimeslongerthanthetotalcomputationtime.Thesortingtimedominatesthecomputationtime,whichisclearlyvisibleinFigure 4-20 -D.IncasethatexperimentsareinuencedbyCPUcacheormemory,anewleisgeneratedforeachexperiment.Thus,everyexperimentcanbeconsideredasanindependentevent.Inaddition,allresultsareaveragesofveexperimentalresults. 90

PAGE 91

CHAPTER5CONCLUSIONSThepurposeofthisresearchworkistobuildabetterstatistics-basedsystemthatallowsforlarge-scaledataanalysis.WehaveimplementedthesystemontheHadoopplatformandtheimplementedsystemshowsabetterperformancewhencomparedtothelegacysystem.Optimizationofthesystemisfromtwoperspectives,tuningdierentHadoopcongurationparametersandoptimizingthecomputationalgorithm.Inthischapter,weconcludetheresearchworkwithasummaryofthethesis,adiscussionaboutimplementationsandexperimentsandpossibledirectionsforfutureresearch. 5.1SummaryInthisthesis,wepresentSMMDASonHadoop.Chapter1introducesthelegacySMMDASforthecurrentsemiconductormanufacturingprocessinordertoshowtherequirementsofascalable,ecient,fault-tolerancedistributedsystemformassivedataanalysisduringthesemiconductormanufacturingprocedure.Hadoop,asasoftwareplatformthatallowsformassivedataanalysisacrossclustersofcomputers,ischosentobetheunderlyingplatform.ThebackgroundontheMapReduceprogrammingmodelandtheHadoopjobworkowisdiscussedinChapter1aswell.DierentdatamappingandimplementationstrategiesarepresentedinChapter2.Chapter3discussesoptimizationofthesystem.Chapter4presentsallexperimentalevaluationresults. 5.1.1DataMappingTheSMMDASisdesignedandimplementedusingadata-orientedapproach.Weproposethreedierentmethodsofmappingdatafromthedatabasetoles:arbitrarily,per-iCUandper-dCU,andthreedierentmethodsofmappingdatafromlestoHadoop:one-to-one,many-to-one,andone-to-many.WhenthesizeofeachiCUissmall,experimentalresultsshowthattheper-iCUmany-to-onedatamappingstrategyoutperformsotherdatamappingstrategies.SincethesizeofeachiCUissmall,all 91

PAGE 92

computationsineachiCUcanbenishedinmemory.Theper-iCUmany-to-onedatamappingstrategycanreducethenumberofmaptaskslaunchedduringajobexecutionsothattheperformanceofthesystemisimproved.AsthesizeofeachdCUincreases,wehaveobservedthatcomputationsineachiCUcannotbenishedinmemory.Thus,aniterativeMapReducealgorithm,consistingofmultipleMapReducejobsineachanalysisjob,mayneedtobeconsideredforthesystem,asdiscussedinAppendix B 5.1.2ComputationDistributionTomapthecomputationtotheHadoopplatform,weproposethreedierentimplementations:M-Heavy,R-HeavyandMR-Balanced.FortheM-Heavyimplementation,allcomputationsofAlgorithm 1 ofeachiCUareperformedinthemapphase,whilefortheR-Heavyimplementation,allcomputationsofeachiCUareperformedinthereducephase.IntheMR-Balancedimplementation,therststageofAlgorithm 1 isperformedinthemapphaseandthesecondstageisperformedinthereducephase.Dierentimplementationscanbenetdierentsituations.Table 2-3 inChapter2summarizesthelimitationsofdierentdatamappingstrategiesondierentimplementations.Table 5-1 comparesdierentimplementationsbyasummaryofadvantagesanddisadvantagesoftheimplementations.WhenthesizeofeachiCUissmall,thecomputationcanbeperformedinmemoryinthemapphase.TheM-Heavyimplementationdecreasesintermediatedatatransmissionbetweenmapandreducephases,thusoutperformsotherimplementations.WhenthesizeofeachdCUincreases,thecomputationcannottinmemoryanymore.TheMR-BalancedimplementationdistributescomputationsbetweenthemapandreducephasesothatcomputationscanbenishedinoneMapReducejob.However,theM-HeavyandtheMR-Balancedimplementationsbothrequirespecicdatamappingstrategies.TheR-Heavyimplementationdoesnotlimitondatamapping.Thus,itisusefulwhendesigningaHadoop-onlyanalysissystemsuchthatthedatabaseisremoved. 92

PAGE 93

Table5-1. Theadvantagesanddisadvantagesofdierentimplementations ImplementationAdvantagesDisadvantages M-HeavyimplementationTheexecutiontimeisshortestNeedshelpfromthedatabaseforgroupingthedataMR-BalancedimplementationRequirelessmemoryspaceforeachtaskNeedshelpfromthedatabaseforgroupingthedataR-HeavyimplementationTheexecutiontimeislongestDoesnotneedsupportfromthedatabase 5.1.3SystemOptimizationThesystemisoptimizedfromtwoperspectives.First,dierentHadoopcongurationparametersaretunedfortheR-Heavyimplementation.Second,analgorithmicoptimizationiscarriedoutfortherststageofthecomputation.Ontheonehand,therearealotofcongurationparameterstoeachjobexecutedontheHadoopplatform.WehaveanalyzedtheworkowofaHadoopjobindetailssothatparametersthatmayhaveasignicantinuenceontheperformanceofaHadoopjobaretuned.Ontheotherhand,intherststageofthecomputation,themosttime-consumingcomputationsarethesortingandcomputingthestandarddeviation.Weconsiderre-usingtheresultsofthepreviouscomputationandproposedasliding-windowalgorithmtooptimizeAlgorithm 1 5.1.4ExperimentalResultsAnalysisInthelegacysystem,thecomputationperformedon1TBdatasettakesaround8to10hours[ 6 ].Inourexperiments,theM-Heavyimplementationusingper-iCUmanytoonemappingstrategyrunsfor1hourand40minutesforanalyzing1TBdataset.UsingthestaticapproachtotunetheHadoopcongurationparameters,theR-Heavyimplementationtakes1hourand13minutesforanalyzing1TBdataset.Thus,performanceofthesystemisimprovedbyupto82.7%inthecurrentexperiments.Inaddition,resultsinChapter4showthattheexecutiontimeoftherststagecomputationisimprovedby50%thankstothealgorithmicoptimization.Theexperiments,however, 93

PAGE 94

revealthefactthatIOtimedominatestheexecutiontimeofthewholeprogram.FutureresearchneedstobedonetoimprovetheIOexecutiontime. 5.2TowardsaBetterSystemInthelegacySMMDAS,thedatacollectedfromthemanufacturingprocessarerststoredinalesystem.Then,thedataaretransferredfromthelesystemtoadatabase.Beforetheanalysis,thedataaremovedfromthedatabasetoanotherlesystemandcomputingnodesretrievedatafromthelesystem.Afteranalysis,thedataaretransferredtoanotherdatabase.ThelegacysystemarchitectureisshowninFigure 5-1 -A.ThecurrentsystemistoreplacethelegacycomputingnodeswithaHadoopcluster,asshowninFigure 5-1 -B.However,thedatatobeanalyzedarestilltransferredfromsensorstoHDFSthroughseveralsteps.Inthissection,weproposeanarchitecturethatiscompletelydierentfromthelegacysystemarchitecturesothatthedatabaseisremovedinthesystemdesign.AsshowninFigure 5-1 -C,theproposedsystemcanbeconsideredasaHadoopclusterwithoutinvolvementofthedatabase.ThedatacollectedfromthemanufacturingprocessarestoreddirectlyinHDFS.Asdiscussedinpreviouschapters,theR-Heavyimplementationextractsusefulinformationduringthemapphaseandanalyzesthedataduringthereducephase.TheSMMDAScanbebuiltwithoutmulti-stepdatatransmission.Abackgroundscriptmayberequiredtoperiodicallyremoveuselessdata.Iffunctionalitiesofthedatabasearerequired,ahigh-levelsoftwarecomponent,Hive,isinstalledontheHadoopcluster.Hiveisadatawarehousesolutionforlarge-scaledatamanagementonthedistributedlesystem[ 40 ].Itisahigh-leveltoolthatsupportsaSQL-likeprogramminglanguage.ApplicationsbuiltonHivecanrelyontheSQL-likeinterfacesuchthatlow-levelmap/reducetasksareautomaticallygeneratedbythesoftwarecomponent.Theproposedarchitecturereducesthedatatransmissionbetweenlesystemsanddatabases.Inaddition,weenvisiontwootherpossibleimprovements. 94

PAGE 95

ALegacySystemArchitecture BExperimentalSystemArchitecture CProposedSystemArchitectureFigure5-1. Dierentsystemarchitectures 5.2.1AutomaticallyTuningHadoopCongurationParametersExperimentsinChapter4showthatHadoopcongurationparameterssignicantlyinuencetheperformanceofthesystem.Thereare190ormoredierentparametersforeachHadoopjob.Manuallyadjustmentisimpossiblefordierentjobs.Thus,itisnecessarytoinvestigateanautomaticanalysissystem.RelatedworkshavebeendonetowardsautomaticoptimizationofHadoopapplications[ 34 41 ],whichmayinspirefutureworkstoachieveoptimalcongurationparametersettings. 95

PAGE 96

5.2.2UsingSolid-StateDrivesInChapter4,ourexperimentalresultsshowthatthediskI/OtimedominatesthetotalexecutiontimeoftheSMMDAS.Hadooptriestoassigntaskstowherethedataarelocatedsothatthedatatransmissiontimebetweennodesisminimized.However,thediskI/OtimecannotbedecreasedusingtheHardDiskDrive(HDD).Solid-StateDrive(SSD)isproventooerexceptionalperformancethanHDD.Workshavebeendone[ 42 ]tocomparetheperformanceandthepricebetweenSSDandHDD.Forthesequentialreadapplications,theperformanceratioofSDDandHDDisaround2.Fortherandomaccessapplications,theperformanceratioofSSDandHDDisaround60.InmostHadoopapplications,thereadingandwritingtakeplacesequentially.Otherworks[ 43 44 ]alsocomparetheperformanceofaHadoopclusterusingbetweenSSDandHDD.Withappropriatesettingsin[ 43 ],theexecutiontimeofI/Oboundedapplicationsisimprovedby3usingSSD.FortheSMMDAS,thediskI/Oisaround10timesmorethanthecomputationtime,asshowninChapter4.CurrentclusterisconstructedofHDD.Thus,ifwereplacetheHDDwithSSDandsetcongurationparametersofapplicationsappropriately,thetotalexecutiontimeispredictedtobeimprovedby3.State-of-the-arttechnologies[ 44 ]haveshownthespeedupwithViolin'sscalableashmemoryarraytoHDDis7forsequentialreadapplications.Thewritespeedupis2.5,respectively.SincetheamountofreadoperationsaremuchmorethanwriteoperationsintheSMMDAS,thespeedupwiththescalableashmemoryarraystorageispredictedtobearound5.Insummary,manyHadoopapplicationsareIObounded[ 1 ].Ourimplementations,especiallywhenoptimizingsystemtowardsreal-time,alsorevealthefactthatIOtimedominatesthetotalexecutiontimeoftheanalysissystem.SSDsareproventoperformfasterexecutiontimethanHDD[ 42 ].StudieshaveappliedSSDinaHadoopcluster[ 43 44 ]andshownaperformanceimprovementbecauseofdecreasingtheI/Otime.Thus, 96

PAGE 97

upgradinghardwareisanotherpossibleoptimizationstrategytoreducetheexecutiontimeoftheSMMDAS. 97

PAGE 98

APPENDIXASETTINGUPAHADOOPCLUSTERUSINGCLOUDERAMANAGERThisappendixshowsabasictutorialofhowtosetupaHadoopclusterusingtheClouderaManager.Fortestingpurposes,theexperimentalinstallationistakenplaceonVMwareESXiserveronalocalcluster.ItstartsfromcreatingaVirtualMachine(VM),installingoperatingsystemthensettingupacluster.TheVMscreationandreplicationcanbeignoredifphysicalmachinesareusedforsettingupthecluster. A.1IntroductionAtypicalHadoopclusterconsistsofonemasternodeandmultipleslavenodes.TherearemultiplewaystosetupaHadoopcluster.ClouderaManager[ 45 ]providesuseraneasywaytosetup,manageandmaintainaHadoopcluster.Inthistutorial,wesetupanexperimentalHadoopclusteronVMwareESXiserver.Theclusterconsistsofonemasternodeandtwoslavenodes.ThemasternodeisadedicatedVM. A.2VMCreationandOperatingSystemInstallationInthissection,weusevSphereClienttocreateVMsonVMwareESXiserver[ 46 ].ThedetailsofhowtocreateaVMonVMwareESXiserverareskipped.AftercreatingaVM,anoperatingsystemisrequiredtobeinstalled.WechooseCentOS6.4.Followingsdetailtheinstallationoftheoperatingsystemandenvironmentsetup. A.2.1DownloadCentOSChooseadistributionimagefrom CentOS website.WedownloadedCentOS-6.4-x8664-minimal.iso.Theminimalinstallationisappropriateforaserver,whichdoesnotcontainaGUI. A.2.2InstallCentOS6.4First,powerontheVMthatcreatedontheclusterandthenopentheconsole.Second,attherightmosttoolbaroftheconsole,choose\connect/disconnecttheCD/DVDdevicesofthevirtualmachine". 98

PAGE 99

FigureA-1. ThewelcomepagefromCentOSimage Third,chooseanappropriatewaytoconnecttothedatastorewheretheimageoftheOSisstored.Intheexperimentalsetup,wedownloadedtheOStothelocalmachine.Thus,\connecttoISOimageonlocaldisk..."ischosentoconnecttheVMtothelocalstorage.Fourth,pressanykeyintheconsole,thesystemwillbootfromtheimage,asshowninFigure A-1 .Fifth,choose\Installorupgradeanexistingsystem".Sixth,followtheinstructionstoinstallCentOSontheVM. A.2.3CongureCentOSAfterinstallationofCentOS,rebootthesystem.Thencongurethesystemenvironment.Tobeabletomakechangestosystemcongurationsconveniently,loginastherootuser.First,congurethenetwork.Editthescriptle/etc/syscong/network-scripts/ifcfg-eth0.Replacecontentsofthelewiththefollowing: DEVICE=eth0HWADDR=xx:xx:xx:xx:xx:xx(Usedefaultvalue)TYPE=Ethernet 99

PAGE 100

ONBOOT=yesNM_CONTROLLED=noBOOTPROTO=dhcpInthisexperimentalsetup,weuseddynamicIPassignmentsfromDHCP.TheIPaddresscanalsobesetstatically.AftersetuptheIPaddress,restartthenetworkservicebythecommand $servicenetworkrestartTestthenetworkconnectionbythecommand $pingwww.google.com FigureA-2. Successfullysetupthenetworkconnection Figure A-2 showsthecommandinterfaceifthenetworkissetupsuccessfully.Second,updatetheinstalledoperatingsystembythecommand $yumupdateThird,installPerlbythecommand $yuminstallperlFourth,installNTP(NetworkTimeProtocolisusedtosynchronizeallhostsinthecluster)bythecommand $yuminstallntpFifth,editthele/etc/ntp.confbyreplacingexistingNTPserverswiththelocalNTPservers.Inthisexperimentalsetup,theNTPserversusedinUniversityofFloridaarelistedasfollows. serverntps2-3.server.ufl.eduserverntps2-2.server.ufl.edu 100

PAGE 101

serverntps2-1.server.ufl.eduSixth,disableselinuxbyeditingthele/etc/selinux/cong.Seventh,inordertocommunicatebetweennodes,disabletherewallby $serviceiptablessave$serviceiptablesstop$chkconfigiptablesoffEighth,setthehostnameandfqdn.Setthehostnamebyeditingthele/etc/syscong/network. HOSTNAME=masterThen,editthele/etc/hosts.TheIPaddress127.0.0.1correspondstothelocalIPwhile\masterIP"and\slaveIP"shouldbesettotheIPaddressesthatareusedforaccessingthemasternodeandslavenodes.IfhostscannotberesolvedbytheDNS,wehavetospecifyeveryhostsinthele/etc/hosts. 127.0.0.1localhost.localdomainlocalhost"masterIP"hello.world.commaster"slaveIP"hello1.world.comslave1Nighth,rebootthesystembythecommand $rebootnowTenth,installtheSSHserverandclient.Inthiscase,theOSdistributioncontainsopensshserverandclient.IfthedistributioninstalleddoesnotcontaintheSSHpackage,referto CentOSSSHInstallationandConguration forinstallationoftheSSHserverandclient.Eleventh,createaSSHprivate/publickeypair.ThisstepcanbeskippedifyouwouldliketoSSHwithapassword.CreateaSSHprivate/publickeypairbythecommand $ssh-keygenCopythepublickeytoeveryremotehostby $ssh-copy-id-i/root/.ssh/id_rsa.pubremote-host 101

PAGE 102

TesttheSSHconnectionwithoutapasswordby $sshremote-host A.2.4ReplicateVMsInordertomaintainthesystemsettingsatthebeginning,createasnapshotoftheVMthatwehavebeenconguredsofar.CopytwoVMinstancesonthesamecluster.OneproblemthatwehaveencounteredwhencopyingtheVMinstancesisthateth0isautomaticallymappedtotheoldmacaddress.Weneedtomanuallyeditthecorrespondingles.Thele/etc/udev/rules.d/70-persistent-net.rulesshouldbeeditedfromFigure A-3 toFigure A-4 FigureA-3. Thecontentofthele70-persistent-net.rulesbeforeupdating FigureA-4. Thecontentofthele70-persistent-net.rulesafterupdating Then,thele/etc/syscong/network-scripts/ifcfg-eth0shouldbeupdatedsothatHDADDRisconsistentwiththeoneinthele/etc/udev/rules.d/70-persistent-net.rulesRebootthesystembythecommand $rebootnow 102

PAGE 103

Theneditthehostnameofeachnodecorrespondinglybyeditingthele/etc/syscong/networkand/etc/hosts. A.3ClouderaManagerandCDHDistributionFirst,downloadClouderaManagerfromitswebsite.Second,changethepermissionofthepackageby $chmodu+xcloudera-manager-installer.binThird,runtheinstallationpackageby $./cloudera-manager-installer.binFourth,accesstotheClouderaManagerusinganywebbrowserthroughtheIPaddresshttp://ClouderaManagerNode:7180.ThedefaultUsernameandPasswordareboth\admin".Figure A-5 showstheloginwebinterfaceoftheClouderaManager. FigureA-5. TheloginwebinterfaceoftheClouderaManager Fifth,continuetothepageandspecifyhostsfortheCDHclusterinstallation.TypetheIPaddressesorhostnamesofthenodesfortheHadoopcluster.Figure A-6 showsthewebinterfacewherenodesareaddedintothecluster.Inthisexperiment,weaddedonemasternodeandtwoslavesnodestothecluster.Aftertheinitialclustersetup,newnodesstillcanbeaddedto(orremovedfrom)theclusterthroughtheClouderaManager. 103

PAGE 104

FigureA-6. ThewebinterfaceforusertospecifythenodesonwhichtheHadoopwillbeinstalled Continuewiththedefaultsettings.InstallusingParcels,andinstallbothCDHandIMPALA.Figure A-7 showsthewebinterfaceonwhichuserscanspecifytheinstallationpackageandotherappropriateinstallationoptions. FigureA-7. Thewebinterfaceforusertospecifytheinstallationpackageandotheroptions SincerootaccessisrequiredtoinstallCDH,specifytheSSHcredentialsbytypingthepasswordoftheroot,oruploadtheprivatekeyoftheroot.Notethatwetriedtouseasudouserotherthanrootwithsshprivate/publickeyaccess.TheClouderaManager, 104

PAGE 105

however,cannotnishthecredentialverication.Figure A-8 showsthewebinterfaceforuserstospecifytheappropriatecredentialmethodtoaccesstothecluster. FigureA-8. Thewebinterfaceforusertospecifytheappropriatecredentialmethodtoaccesstothecluster TheremaybefailednodeswheninstallingCDH.Figure A-9 isanexampleoffailedinstallation.Inthisparticularcase,wedidnotinstallPerlonCentOS.AfterinstallingPerl,thefailednodeworkswell.TheprocesswillcontinueinstallingwithParcels.Iftheinstallationprocessgoeswell,itshouldlookliketheinterfaceshowninFigure A-10 .Aftertheinstallationofthepackage,continuewithHostinspectors,asshowninFigure A-11 .Thisprocesswillfailifslavenodescannotndthewaybacktothemasternode.MakesuretospecifytheIPaddressofeachhostiftheycannotberesolvedbyDNS.Chooseservicestobeinstalledonthecluster.RuntheInspectroleassignments,asshowninFigure A-12 .Continuewiththe\UseEmbeddedDatabase".Waituntilallselectedserviceshavebeenstarted.Afterallservicesareinstalledsuccessfully,theclustercouldbestartedfollowingtheinstructiononthewebinterface,showninFigure A-13 105

PAGE 106

FigureA-9. Thewebinterfacewhentheinstallationfails FigureA-10. Thewebinterfacewhentheinstallationprocessgoeswell FigureA-11. Thewebinterfacewheninstallationofthepackageissuccessful 106

PAGE 107

FigureA-12. Thewebinterfacewheninstallationofthepackageissuccessful FigureA-13. Thewebinterfacewhenallserviceshasbeeninstalledandtheclusterisreadytowork Then,theCDHHadoopclusterisreadytobeused.Figure A-14 showsthewebinterfacewhenallservicesstartedareworkingappropriately. 107

PAGE 108

FigureA-14. Thewebinterfaceshowingthestatusofservicesavailable 108

PAGE 109

APPENDIXBAPARALLELALGORITHMFORSTATISTICS-BASEDANALYSISIfthesizeofeachiCUissmallenough,thesequentialalgorithm(Algorithm 1 )canbenishedonasinglenode.IfthesizeofeachiCUisverylarge,however,thesequentialalgorithmcannotbenishedonasinglenode.Thenparallelizationthecomputationisrequiredtonishtheanalysisjob.Manystatisticsfunctionscanbeparallelized.FollowingistheproposediterativeMapReduceparallelizedalgorithm.SimilartoAlgorithm 1 ,theinputoftheprocedureisalinkedlist(inputList).SincetheanalysisisperformedindependentlyoneachiCU,thedatainthesameinputListsharethesameKr.Theithelementinthelistisa1Ddoubleprecisionarray,Ai,representingadCU.EachvalueinAicorrespondstoanAVGvalueextractedfromtherecordsthatsharethesameKt.Inthisparallelizationalgorithm,eachAiisfurtherpartitionedintomblocks,andAijdenotesthejthblockofAi.SinceitisassumedthatthesizeofeachdCUistoolargetobecomputedonasinglenode,thealgorithmrstpartitionsAiintoseveralsmallblockssothateachblockofAicanbeprocessedonasinglenode.ForeachAij,thealgorithmsortsthedatasetlocally,andcomputestheaverage,summation,andsummationofsquaresofAij.ThisprocedureisdoneindependentlyoneachAij,whichcanbemappedtothemapphaseexecutionofaMapReducejob.Second,afterallblocksinAiaresorted,computationoftheaverageandthestandarddeviationofAicanbeeasilyderivedbyEquation 3{4 andEquation 3{5 .Also,amulti-waymergesortcanbeusedforsortingAi.Then,thealgorithmcomputesthepercentilevaluesofAiandcomputesoutlierlimitsusingpercentilevalues.ThisprocedurecanbemappedtothereducephaseexecutionoftheMapReducejobmentionedpreviously.Third,thealgorithmpartitionsAiintosmallblocks,andeliminatesoutliersfromeachAij,whichcanbemappedtothemapphaseofanewMapReducejob.Sofar,theStage1ofAlgorithm 1 isnishedusingtheiterativeMapReducealgorithm,Algorithm 3 ,accordingly. 109

PAGE 110

Algorithm 4 startswithpickingthedatasetwhoseCpkislargestasthereferencedataset.Then,itusestheaveragevaluesfromthepreviousstagetocomputetheaverageofallrecordsbelongingtothesameiCUexceptforthereferencedataset.Second,thealgorithmpartitionseachAiintoseveralsmallblocks,andcomputestheWithin-DistanceandBetween-Distanceofeachrecords.TheprocedurecanbemappedtothemapphaseofanewMapReducejob.AftercomputingtheEuclideandistancesofrecords,thealgorithmcomputestheaverageandthestandarddeviationofEuclideandistancesbasedonEquation 3{4 andEquation 3{5 .Finally,thealgorithmstartsapairwiseStudent'sT-testusingpreviouscomputedresultsandmakesanaldecision.TheproposedparallelalgorithmcontainsthreeMapReducejobsandeachoftheMapReducejobsareiterativelyexecutedonthecluster.FrequentlycreatingmultipleMapReducejobsmaybringsignicantoverhead.Twister,proposedbyEkanayakeetal.[ 20 ],asoneoftheoptimizedMapReduceextensions,canbeusedforsolvingtheproblem. 110

PAGE 111

Algorithm3Procedureofequivalencetest(iterativeMapReduce)-Part1 Input:LinkedListinputListAirepresentsfortheith1Ddouble-precisionarrayininputListAi[k]representsforanelementinAiAijrepresentsforthejthblockofAi foreachDoubleArrayAiininputListdo DivideAiintomblocks foreachblockAijinAido SortAij Computeij,SijandS2ij endfor Computetheaveragei=Pmj=1ijNj Pmj=1Nij ComputethestandarddeviationibasedonEquation 3{5 MergesortAi ComputethepercentilevaluesofAi ComputethelowerandupperoutlierlimitsLOL,UOLusingpercentilevalues DividethesortedAiintomblocks foreachblockAijinAido EliminateallAij[k]thatarenotintherange[LOL,UOL] Computethenewaverage0ij endfor ComputetheprocesscapabilityindexCpk[i] endfor 111

PAGE 112

Algorithm4Procedureofequivalencetest(iterativeMapReduce)-Part2 PickAiwiththelargestCpkasthereferencesetR Compute0=P0ij PN0ij(Ai6=R) foreachDoubleArrayAiininputListdo DivideAiintomblocks foreachblockAijinAido if(Ai==R)then ComputeEuclideandistanceWRj[k]=p (Rj[k])]TJ /F5 11.955 Tf 11.96 0 Td[(0R)2 ComputeEuclideandistanceBRj[k]=p (Rj[k])]TJ /F5 11.955 Tf 11.95 0 Td[(0)2 Compute00ij,SijandS2ij else ComputeEuclideandistanceWij[k]=q (Aij[k])]TJ /F5 11.955 Tf 11.95 0 Td[(0ij)2 ComputeEuclideandistanceBij[k]=p (Aij[k])]TJ /F5 11.955 Tf 11.96 0 Td[(0R)2 Compute00ij,SijandS2ij endif endfor Computetheaverage00basedonEquation 3{4 Computethestandarddeviation00basedonEquation 3{5 DoapairwiseStudent'st-testWR(orBR)andWi(orBi) Finaldecisionbasedontheresultoft-test endfor 112

PAGE 113

REFERENCES [1] T.White.Hadoop:TheDenitiveGuide.O'ReillyMedia,2012. [2] Apache.Welcometoapachehadoop!,2013.Availablefrom: http://hadoop.apache.org [3] A.JohnRice.MathematicalStatisticsandDataAnalysis.DuxburyAdvanced.CengageLearning,3edition,April2006. [4] Statisticalhypothesistesting,1996.Availablefrom: http://www.ganesha.org/spc/hyptest.html [5] WilliamA.LevinsonandFrankTumbelty.SPCEssentialsandProductivityImprove-ment:AManufacturingApproach.ASQCQualityPress,October1996. [6] YoungsunYu.Personalcommunication.January2013. [7] JereyDeanandSanjayGhemawat.Mapreduce:Simplieddataprocessingonlargeclusters.pages10{10,2004. [8] SanjayGhemawat,HowardGobio,andShun-TakLeung.Thegooglelesystem.InProceedingsoftheNineteenthACMSymposiumonOperatingSystemsPrinciples,SOSP'03,pages29{43,NewYork,NY,USA,2003.ACM. [9] KonstantinShvachko,HairongKuang,SanjayRadia,andRobertChansler.Thehadoopdistributedlesystem.InProceedingsofthe2010IEEE26thSymposiumonMassStorageSystemsandTechnologies(MSST),MSST'10,pages1{10,Washington,DC,USA,2010.IEEEComputerSociety. [10] A.RajaramanandJ.D.Ullman.MiningofMassiveDatasets.CambridgeUniversityPress,2012. [11] MichaelC.Schatz.Cloudburst:highlysensitivereadmappingwithmapreduce.Bioinformatics,25(11):1363{1369,2009. [12] AaronMcKenna,MatthewHanna,EricBanks,AndreySivachenko,KristianCibulskis,AndrewKernytsky,KiranGarimella,DavidAltshuler,StaceyGabriel,MarkDaly,andMarkA.DePristo.Thegenomeanalysistoolkit:Amapreduceframeworkforanalyzingnext-generationdnasequencingdata.GenomeResearch,20(9):1297{1303,2010. [13] S.PapadimitriouandJimengSun.Disco:Distributedco-clusteringwithmap-reduce:Acasestudytowardspetabyte-scaleend-to-endmining.InDataMining,2008.ICDM'08.EighthIEEEInternationalConferenceon,pages512{521,2008. [14] J.Ekanayake,S.Pallickara,andG.Fox.Mapreducefordataintensivescienticanalyses.IneScience,2008.eScience'08.IEEEFourthInternationalConferenceon,pages277{284,2008. 113

PAGE 114

[15] YuanBao,LeiRen,LinZhang,XuesongZhang,andYongliangLuo.Massivesensordatamanagementframeworkincloudmanufacturingbasedonhadoop.InIndustrialInformatics(INDIN),201210thIEEEInternationalConferenceon,pages397{401,2012. [16] ChenZhang,HansSterck,AshrafAboulnaga,HaigDjambazian,andRobSladek.Casestudyofscienticdataprocessingonacloudusinghadoop.InDouglasJ.K.Mewhort,NatalieM.Cann,GaryW.Slater,andThomasJ.Naughton,editors,HighPerformanceComputingSystemsandApplications,volume5976ofLectureNotesinComputerScience,pages400{415.SpringerBerlinHeidelberg,2010. [17] RonaldCTaylor.Anoverviewofthehadoop/mapreduce/hbaseframeworkanditscurrentapplicationsinbioinformatics.BMCbioinformatics,11(Suppl12):S1,2010. [18] KelvinCardona,JimmySecretan,MichaelGeorgiopoulos,andGeorgiosAnagnostopoulos.Agridbasedsystemfordataminingusingmapreduce.Technicalreport,2007. [19] VinodKumarVavilapalli,ArunC.Murthy,ChrisDouglas,SharadAgarwal,MahadevKonar,RobertEvans,ThomasGraves,JasonLowe,HiteshShah,SiddharthSeth,BikasSaha,CarloCurino,OwenO'Malley,SanjayRadia,BenjaminReed,andEricBaldeschwieler.Apachehadoopyarn:Yetanotherresourcenegotiator.InProceedingsofthe4thAnnualSymposiumonCloudComputing,SOCC'13,pages5:1{5:16,NewYork,NY,USA,2013.ACM. [20] JaliyaEkanayake,HuiLi,BingjingZhang,ThilinaGunarathne,Seung-HeeBae,JudyQiu,andGeoreyFox.Twister:Aruntimeforiterativemapreduce.InProceedingsofthe19thACMInternationalSymposiumonHighPerformanceDistributedComputing,HPDC'10,pages810{818,NewYork,NY,USA,2010.ACM. [21] ShivnathBabu.Towardsautomaticoptimizationofmapreduceprograms.InProceedingsofthe1stACMSymposiumonCloudComputing,SoCC'10,pages137{142,NewYork,NY,USA,2010.ACM. [22] IBMCorp.Anarchitecturalblueprintforautonomiccomputing.IBMCorp.,USA,October2004.Availablefrom: www-3.ibm.com/autonomic/pdfs/ACBP2_2004-10-04.pdf [23] Welcometoapachehbase,2014.Availablefrom: http://hbase.apache.org [24] R.G.GossandK.Veeramuthu.Headingtowardsbigdatabuildingabetterdatawarehouseformoredata,morespeed,andmoreusers.InAdvancedSemiconductorManufacturingConference(ASMC),201324thAnnualSEMI,pages220{225,2013. [25] XuhuiLiu,JizhongHan,YunqinZhong,ChengdeHan,andXubinHe.Implementingwebgisonhadoop:Acasestudyofimprovingsmalllei/operformanceonhdfs.InClusterComputingandWorkshops,2009.CLUSTER'09.IEEEInternationalConferenceon,pages1{8,Aug2009. 114

PAGE 115

[26] JiongXie,ShuYin,XiaojunRuan,ZhiyangDing,YunTian,J.Majors,A.Manzanares,andXiaoQin.Improvingmapreduceperformancethroughdataplacementinheterogeneoushadoopclusters.InParallelDistributedProcessing,WorkshopsandPhdForum(IPDPSW),2010IEEEInternationalSymposiumon,pages1{9,2010. [27] YanpeiChen,ArchanaGanapathi,andRandyH.Katz.Tocompressornottocompress-computevs.iotradeosformapreduceenergyeciency.InProceedingsoftheFirstACMSIGCOMMWorkshoponGreenNetworking,GreenNetworking'10,pages23{28,NewYork,NY,USA,2010.ACM. [28] EdwardBortnikov,AriFrank,EshcarHillel,andSriramRao.Predictingexecutionbottlenecksinmap-reduceclusters.InProceedingsofthe4thUSENIXConferenceonHotTopicsinCloudCcomputing,HotCloud'12,pages18{18,Berkeley,CA,USA,2012.USENIXAssociation. [29] NIST/SEMATECHe-HandbookofStatisticalMethods:Percentiles,December2013.Availablefrom: http://www.itl.nist.gov/div898/handbook/prc/section2/prc252.htm [30] JohnRenze.Outlier.MathWorld{AWolframWebRecource,createdbyEricW.Weisstein.Availablefrom: http://mathworld.wolfram.com/Outlier.html [31] J.B.KeatsandD.C.Montgomery.StatisticalApplicationsinProcessControl.QualityandReliability.Taylor&Francis,1996. [32] DavidLane.Percentile.Availablefrom: http://cnx.org/content/m10805/latest [33] WilliamH.Press,SaulA.Teukolsky,WilliamT.Vetterling,andBrianP.Flannery.NumericalRecipes:TheArtofScienticComputing.NumericalRecipes:TheArtofScienticComputing.CambridgeUniversityPress,1986.Availablefrom: http://books.google.fr/books?id=1aAOdzK3FegC [34] HerodotosHerodotou,HaroldLim,GangLuo,NedyalkoBorisov,LiangDong,FatmaBilgenCetin,andShivnathBabu.Starsh:Aself-tuningsystemforbigdataanalytics.InCIDR,volume11,pages261{272,2011. [35] ShrinivasB.Joshi.Apachehadoopperformance-tuningmethodologiesandbestpractices.InProceedingsofthe3rdACM/SPECInternationalConferenceonPerformanceEngineering,ICPE'12,pages241{242,NewYork,NY,USA,2012.ACM. [36] DaweiJiang,BengChinOoi,LeiShi,andSaiWu.Theperformanceofmapreduce:Anin-depthstudy.Proc.VLDBEndow.,3(1-2):472{483,September2010. [37] ThomasH.Cormen,CharlesE.Leiserson,RonaldL.Rivest,andCliordStein.IntroductiontoAlgorithms,ThirdEdition.TheMITPress,3rdedition,2009. 115

PAGE 116

[38] TonyF.Chan,GeneH.Golub,andRandallJ.LeVeque.Algorithmsforcomputingthesamplevariance:Analysisandrecommendations.TheAmericanStatistician,37(3):pp.242{247,1983. [39] B.P.Welford.NoteonaMethodforCalculatingCorrectedSumsofSquaresandProducts.Technometrics,4(3):419{420,1962. [40] AshishThusoo,JoydeepSenSarma,NamitJain,ZhengShao,PrasadChakka,SureshAnthony,HaoLiu,PeteWycko,andRaghothamMurthy.Hive:Awarehousingsolutionoveramap-reduceframework.Proc.VLDBEndow.,2(2):1626{1629,August2009. [41] DukeUniversity.Starsh:Self-tuninganalyticssystem,2013.Availablefrom: http://www.cs.duke.edu/starfish/ [42] VamseeKasavajhala.Solidstatedrivevs.harddiskdrivepriceandperformancestudy,2011. [43] KangSeok-Hoon,KooDong-Hyun,KangWoon-Hak,andLeeSang-Won.Acaseforashmemoryssdinhadoopapplications.InternationalJournalofControlAutomation,6(1):201{209,2013. [44] ViolinMemory.Benchmarkinghadoop&hbaseonviolin:Harnessingbigdataanalyticsatthespeedofmemory,2013.Availablefrom: http://pull.vmem.com/wp-content/uploads/hadoop-benchmark.pdf?d=1 [45] Cloudera.Clouderamanagerend-to-endadministrationforhadoop,2014.Availablefrom: http://www.cloudera.com/content/cloudera/en/products-and-services/cloudera-enterprise/cloudera-manager.html [46] VMware.Managingtheesxhost,2014.Availablefrom: http://pubs.vmware.com/vsphere-4-esx-vcenter/index.jsp#getting_started/c_managing_your_first_esx_server_3i_host.html 116

PAGE 117

BIOGRAPHICALSKETCH WenjieZhangisborninDalianChina.Sheobtainedherbachelor'sdegreefromLiaoningUniversity(China).ShejoinedUniversityofFlorida,DepartmentofElectricalandComputerEngineering,in2011,withanemphasisincomputerengineering.Sincethesummerof2012,shejoinedAdvancedComputingandInformationSystemsLaboratory.ShehasworkedonITaspectsoftheDARPA/REPAIRproject,whichisaprojectfocusingonresearchofmechanismsforrepairofbraininjuries.Sheiscurrentlyinvolvedinresearchonparallelanddistributedsemiconductordataanalytics,aprojectoftheNSFCenterforCloudandAutonomicComputing(CAC)supportedbySamsungInc.Herinterestsincludecloudcomputingandlarge-scaledataanalysisindistributedenvironment. 117


xml version 1.0 encoding UTF-8
REPORT xmlns http:www.fcla.edudlsmddaitss xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.fcla.edudlsmddaitssdaitssReport.xsd
INGEST IEID E3M9ZKDJL_IH0WTK INGEST_TIME 2014-10-03T21:41:30Z PACKAGE UFE0046807_00001
AGREEMENT_INFO ACCOUNT UF PROJECT UFDC
FILES