<%BANNER%>

Enhanced Glade and Its Impact on Computational Data Analytics

Permanent Link: http://ufdc.ufl.edu/UFE0044817/00001

Material Information

Title: Enhanced Glade and Its Impact on Computational Data Analytics
Physical Description: 1 online resource (53 p.)
Language: english
Creator: Alex, Daley
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2012

Subjects

Subjects / Keywords: datapath -- glade -- mahout -- mining
Computer and Information Science and Engineering -- Dissertations, Academic -- UF
Genre: Computer Engineering thesis, M.S.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: The management and analysis of large amounts of constantly increasing data is required to facilitate better knowledge and understanding. Such analysis extracts less apparent information and forms fodder for enhancement of products and services by business institutions and also in a variety of other domains. Two techniques that are available for managing large amounts of data and simultaneously performing complex computations are MapReduce and User Defined Aggregates (UDA). However, the flexibility with which you can express computations in either models require enhancements. This thesis focuses on enhancing a computational model called GLADE which is already efficient at managing computations on large data but has a shortcoming in its flexibility to express computations. This work transforms GLADE into a comprehensive format for large data management and for carrying out complex analytic computations with improved flexibility and performance when compared to other existing computational models. The first part of the thesis deals with proposing the required enhancements in detail and in the second part, this flexibility is exhibited by encoding some of the classical data mining algorithms in this format. Also, comparisons are drawn wherever possible with existing implementations in Mahout, a MapReduce based machine learning library.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Daley Alex.
Thesis: Thesis (M.S.)--University of Florida, 2012.
Local: Adviser: Dobra, Alin.
Local: Co-adviser: Ranka, Sanjay.

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2012
System ID: UFE0044817:00001

Permanent Link: http://ufdc.ufl.edu/UFE0044817/00001

Material Information

Title: Enhanced Glade and Its Impact on Computational Data Analytics
Physical Description: 1 online resource (53 p.)
Language: english
Creator: Alex, Daley
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2012

Subjects

Subjects / Keywords: datapath -- glade -- mahout -- mining
Computer and Information Science and Engineering -- Dissertations, Academic -- UF
Genre: Computer Engineering thesis, M.S.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: The management and analysis of large amounts of constantly increasing data is required to facilitate better knowledge and understanding. Such analysis extracts less apparent information and forms fodder for enhancement of products and services by business institutions and also in a variety of other domains. Two techniques that are available for managing large amounts of data and simultaneously performing complex computations are MapReduce and User Defined Aggregates (UDA). However, the flexibility with which you can express computations in either models require enhancements. This thesis focuses on enhancing a computational model called GLADE which is already efficient at managing computations on large data but has a shortcoming in its flexibility to express computations. This work transforms GLADE into a comprehensive format for large data management and for carrying out complex analytic computations with improved flexibility and performance when compared to other existing computational models. The first part of the thesis deals with proposing the required enhancements in detail and in the second part, this flexibility is exhibited by encoding some of the classical data mining algorithms in this format. Also, comparisons are drawn wherever possible with existing implementations in Mahout, a MapReduce based machine learning library.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Daley Alex.
Thesis: Thesis (M.S.)--University of Florida, 2012.
Local: Adviser: Dobra, Alin.
Local: Co-adviser: Ranka, Sanjay.

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2012
System ID: UFE0044817:00001


This item has the following downloads:


Full Text

PAGE 1

ENHANCEDGLADEANDITSIMPACTONCOMPUTATIONALDATAANALYTICSByDALEYALEXATHESISPRESENTEDTOTHEGRADUATESCHOOLOFTHEUNIVERSITYOFFLORIDAINPARTIALFULFILLMENTOFTHEREQUIREMENTSFORTHEDEGREEOFMASTEROFSCIENCEUNIVERSITYOFFLORIDA2012

PAGE 2

c2012DaleyAlex 2

PAGE 3

TomymotherwhoismygreatestcritiqueandcondanteTomyfatherforhistoilsinmakingmylifeapossibilityTomysistersfortheirceaselessloveandaffectionTomyclosestfriends,withoutwhomIwillneverbemyselfLastbutnottheleast,toallthepeopleIhaveeverhadachancetomeet 3

PAGE 4

ACKNOWLEDGMENTS Foremost,Iwouldliketoexpressmygratitudetomyadvisor,Dr.AlinDobrawhohasbeenamentortomethroughoutmyMastersstudiesandwithoutwhosepatienceandguidance,IwouldhaveneverlearnedthetechnicalitiesofconductingresearchIwouldliketothankmyco-advisor,Dr.SanjayRankaforhisinvaluableinsightsandadvicesMysincerethanksalsogoestoDr.DaisyZheWangfortakingoutvaluabletimetoserveonmycommitteeInaddition,IthankmycolleagueSabariAjayKumarforhissupportandassistancethroughoutthecourseofmyresearchLastbutnottheleast,IwouldliketoexpressmygratitudetomyparentsAlexGeorgeandSoosanAlex,mysistersAnjuandAmrithaAlexandalsototheclosestofmyfriends 4

PAGE 5

TABLEOFCONTENTS page ACKNOWLEDGMENTS .................................. 4 LISTOFTABLES ...................................... 7 LISTOFFIGURES ..................................... 8 ABSTRACT ......................................... 9 CHAPTER 1INTRODUCTION ................................... 10 2ORIGINALSYSTEM ................................. 13 3SYSTEMENHANCEMENTS ............................ 16 3.1Iteration ..................................... 16 3.2StatePassing .................................. 18 3.3StatePassingandIteration .......................... 19 4ALGORITHMS .................................... 21 4.1Item-BasedCollaborativeFiltering ...................... 21 4.2K-MeansClustering .............................. 24 4.3DecisionTreeClassier ............................ 25 4.4Apriori ...................................... 28 4.5FPGrowth .................................... 32 4.5.1FPTreeMerge .............................. 33 4.5.2ProofofCorrectness .......................... 34 4.5.3TimeComplexity ............................ 36 5EMPIRICALEVALUATION ............................. 37 5.1Item-BasedCollaborativeFiltering ...................... 38 5.1.1BookCrossingDataset ......................... 38 5.1.2Movielens ................................ 38 5.2K-MeansClustering .............................. 39 5.3DecisionTreeClassier ............................ 42 5.4Apriori ...................................... 42 5.5FPGrowth .................................... 43 6DISTRIBUTEDSTATE ................................ 45 6.1FP-GrowthDistributed ............................. 46 6.2EmpiricalEvalutaion .............................. 46 7RELATEDWORKINMAPREDUCE ........................ 48 5

PAGE 6

8CONCLUSION .................................... 50 REFERENCES ....................................... 51 BIOGRAPHICALSKETCH ................................ 53 6

PAGE 7

LISTOFTABLES Table page 5-1HadoopCongurations ............................... 37 5-2MahoutCongurations ................................ 38 7

PAGE 8

LISTOFFIGURES Figure page 3-1IterativeMechanisminGLADE ........................... 17 4-1FPTreeMerge .................................... 35 5-1GLADEItemBasedCollaborativeFiltering ..................... 39 5-2ItemBasedCollaborativeFiltering:GLADEvsMahout[logscale] ....... 40 5-3PerIterationSpeedup,KMeans:GLADEvsMahout[logscale] ......... 41 5-4ScalabilityofKMeans(GLADE)Algorithm ..................... 41 5-5GLADEDecisionTree ................................ 42 5-6GLADEApriorivsGLADEFPGrowth ........................ 43 5-7FPGrowthGLADEvsFPGrowthMahout[Inlogscale] .............. 44 6-1TheDistributedState ................................ 46 6-2TheReplicatedState ................................. 47 6-3ReplicatedvsDistributedState ........................... 47 8

PAGE 9

AbstractofThesisPresentedtotheGraduateSchooloftheUniversityofFloridainPartialFulllmentoftheRequirementsfortheDegreeofMasterofScienceENHANCEDGLADEANDITSIMPACTONCOMPUTATIONALDATAANALYTICSByDaleyAlexAugust2012Chair:AlinDobraCochair:SanjayRankaMajor:ComputerEngineeringThemanagementandanalysisoflargeamountsofconstantlyincreasingdataisrequiredtofacilitatebetterknowledgeandunderstanding.Suchanalysisextractslessapparentinformationandformsfodderforenhancementofproductsandservicesbybusinessinstitutionsandalsoinavarietyofotherdomains.TwotechniquesthatareavailableformanaginglargeamountsofdataandsimultaneouslyperformingcomplexcomputationsareMapReduceandUserDenedAggregates(UDA).However,theexibilitywithwhichyoucanexpresscomputationsineithermodelsrequireenhancements.ThisthesisfocusesonenhancingacomputationalmodelcalledGLADEwhichisalreadyefcientatmanagingcomputationsonlargedatabuthasashortcominginitsexibilitytoexpresscomputations.ThisworktransformsGLADEintoacomprehensiveformatforlargedatamanagementandforcarryingoutcomplexanalyticcomputationswithimprovedexibilityandperformancewhencomparedtootherexistingcomputationalmodels.Therstpartofthethesisdealswithproposingtherequiredenhancementsindetailandinthesecondpart,thisexibilityisexhibitedbyencodingsomeoftheclassicaldataminingalgorithmsinthisformat.Also,comparisonsaredrawnwhereverpossiblewithexistingimplementationsinMahout,aMapReducebasedmachinelearninglibrary. 9

PAGE 10

CHAPTER1INTRODUCTIONAsmoreandmoredataisbecomingavailable,thedemandforlargedataanalyticsisconstantlyincreasing.Forperformingcomplexanalyticsonlargedata,themechanismsprovidedinthetraditionaldatabasesaremostlyunsuitable.Therefore,oneoftwoapproachesareadoptedforperformingsuchanalytics,eitheraMapReducelikeframeworkthatdoesnottintothenotionofatradionaldatamanagementsystemortheuseofUserDenedAggregates(UDAs)usedinconjunctionwithanexistingdatabasesystem.UDAshaveexistedintraditionaldatabasesystemsforalongtime,theyhavebeenusedtoperformsimpleanalyticaltasksonrelationaldata.However,theadventofMapReduce[ 1 ]introducedanewformatinencodingcomputationsandbecamepopularwithcomputationsinvolvinglargedata.MapReduceintroducedamechanismtoperformcomplexanalyticsonlargeamountsofdataandmoreoverhasaninherentabilityforscalingtolargeclusters.However,italsohasdisadvantagesintheformofarigidstructurethatdemandstheencodingofdataintheformofkey-valuepairs,ahashbasedloaddistributionthatdependedonthekeyvaluesandnointernaliterativemechanism.ThedatabasecommunityadoptedeithertheMapReduceframeworkthatfocusedmainlyonunstructureddataorchosetoenhancetheUDAssothattheycouldcompetewiththegrowingdemandtocarryoutcomplexcomputationsonlargeamountsofdata.Thelatterhasanadavantagethatitcouldveryeffectivelyleveragealreadyadvanceddatabasetechnologies.Rusuetal.[ 2 ]introducedanewformofUserDenedAggregatescalledGeneralizedLinearAggregatesorGLAsandalsoitsfurtherimplementationasaframeworkcalledGLADEontopofastateoftheartdatabasesystemcalledDatapath[ 3 ].GLAsalongwithDatapathprovidedmanynewpossibilitiesforcomputationaldataanalyticsasexplainedinChapter 2 10

PAGE 11

However,UDAsincludingtheoriginalGLAshavesomeshortcomingsthataffectedtheirexibilitytoencodecomplexcomputations, datacouldonlybemovedintheformofrelations likeMapReduce,theyalsodidnothaveaninternalmechanismtocarryoutiterativecomputationsOurfocusoniterativemechanismsleadtondingamorenaturalmethodofstoringintermediatedata,intheformofConstantStates,Chapter 3 .WewerealsoabletoeffectivelyusetheConstantStatesfurtherfornon-relationalcommunicationofdata,amechanismknownasStatePassing,Chapter 3 .StatePassingalongwithIterationopenedupnewavenuesindevelopingalgorithmsforcomputations,especiallywithregardtoparallelizationprospects.OurintentionwastotaketheGLADEframeworktothenextlevel,whereinitcanencodeanygivenanalyticswithimprovedexibilityandalsoproducethesameperformanceitinitiallydid.Ourstartingpointwastodevelopandadaptsomeoftheclassicaldataminingalgorithms.Mahout,[ 4 ]seemedagoodstartingpointasitisoneofthemostpopularandcomprehensivelibraryofmachinelearningalgorithms.Wehadthenotionofenforcingchangestothesystemifarealworldtaskdemandedsuchchangesratherthanbuildingfunctionalitiesthatmayneverbeused.Thisthesisenhancestheworkdonein[ 2 ]. WefurtherbuildontheoriginalsystemandintroducestwofurtherenhancementstotheGLADEframeworkandalsotheDatapathsystem,IterativeGLAsandStatePassing,whichallowsusaexiblemodeforexpressingcomputations. WefocusondevelopingGLADEbasedversionsofItemBasedCollaborativeFiltering,frequentitemsetminingbothFP-GrowthandApriori,DecisionTreeClassicationandKMeansClustering WhereverpossiblewecompareandcontrastourimplementationswiththecorrespondingimplementationsinMahout.ThesisOrganization:Therestofthethesisisorganizedasfollows:Chapter 2 explainstheoriginalsystemofGLADEandDatapathindetail,Chapter 3 detailstheenhancementsthatweproposeandourmotivationtomakethesechanges,Chapter 4 11

PAGE 12

focusesonthedataminingalgorithmsmentionedearlierwithdetailedexplanationsofourproposedalgorithms,Chapter 5 wefurtherpublishourspeedupsandalsowhereverpossibleacomparisonisdrawnwithMahout,Chapter 6 introducesanewformofstateaggregationinGLAsforlargeintermediatestates,Chapter 7 exploressomeoftheiterativemechanismsproposedforMapReduceandwecomparethemtoourmechanismandChapter 8 concludesthethesis. 12

PAGE 13

CHAPTER2ORIGINALSYSTEMGeneralizedLinearAggregatesisatthecoreoftheGLADEframework.ItisanAbstractDataTypethathasastateandsomefunctionstomodifythatstate.Thestatecanbeliterallyanythingrangingfromasimpleintegertoacomplexhashmap.ThefunctionsusedtoexamineandmodifythestateasintroducedbyRusuetal.[ 2 ]areasfollows: Initialize():setsupthestateasrequiredforthecomputation AddItem(Tuplet):addsttothecurrentstate AddState(GLA&other):combinesthestatepresentinothertothestateofthecurrentGLA.AddState()isassociativeinnature. Finalize():extractstherequiredinformationfromthestategrownbytheGLALetsnowlookatasimpleexampletocalculatetheaverageofalargesetofnumbers.Theinformationthatyouneedtoessentiallymaintainasthestatehereisthesumofallnumbersandtheircount.State:Sum,CountFunctions: Initialize():initializesSumandCountto0. AddItem(Tuplet):tisoneofthenumberswhoseaverageistobefound.AddItem()addsthenumbertotheSumandincrementsthecount.Sum+=t;++Count; AddState(GLA&other):ItcombinestheSumandCountofthetwoGLA's,respectively,therebyaggregatingthetwostates.TheAddState()withoutanyadditionalcost,isassociativeaswell.Sum+=other.Sum;Count+=other.Count; Finalize():Here,Finalizeinferstheaveragefromthemaintainedstatistics,whichcanbedonebyasingledivision.Average=Sum/Count;TheinherentpropertiesofGLAsprovidessomeusefulfeaturesatthecore: 13

PAGE 14

thereisnorestrictionastothetypeofdatatakenasinput thereisnorequirementforakeyvaluepairasenforcedbyMapReduce therearenoconstraintsastohowtheGLAscanbeaggregatedastheAddState()isassociativeTheassociativenatureoftheGLAsalsoleadstoanaturalabilitytoloadbalancewhichcanbeensuredbydividingtheinputdatasetintosmallerpartsandbuildingGLAsfromeachoftheseparts.TheGLAsgrownseparatelycanbecombinedinanysequencebecauseoftheassociativepropertyoftheAddState.MapReduceontheotherhand,usesahashdistributiontoloadbalancebasedonthevaluesofthekeysinitskey-valuepair.Thismaybeinefcientinsituationswherethekeysarenotgoodindicatorsofthedistributionofthedataset.ThispropertyalsoleadsustoDatapath.Datapath[ 3 ]isastateoftheartdataprocessingsystem.Itispushbasedanddatacentric,whichmeansthatdatacirculatesthroughthesystemandisutilizedifandonlyifthereexistsademandforitsomewhereinthesystem.DatapatharrangesitscomputationintheformofWorkersandWaypoints.WaypointsaretheentitiesinthesystemthatmanagescomputationandWorkersaretheentitiesthatperformstheactualcomputation.WaypointsareasgenericaspossibleforaparticulartypeofcomputationforexampleScan,Print,Join,GLAetc.Workersarescheduledfromthewaypointsandtheyuseoneoftheavailablethreadsandperformsthecomputationallocatedtoit.Thecomputationstobeperformedbytheworkersareasspecicasitgetsforaparticularquery.ThecodeexecutedbytheWorkersaregeneratedbyacombinationofusercodeandM4templates.AnincomingqueryQ1isdividedintoasmanywaypointsasspeciedbytheuserandtheyarerepresentedasagraphinthesystem.Datapathhasacolumnbasedstorageandthedatacirculatesthroughthesystemaschunks.Thechunksessentiallystoresthedataincolumnsandalsoageneralizedschemaisadoptedtofacilitatethecarryingofmultipletuplesbythesamechunk.Theschemawouldessentiallyhaveprovisionstostoreallthedifferentkindsoftuples 14

PAGE 15

presentinthesystematanytime.Thisholdstheconstraintthatdatageneratedbyaparticularwaypointshouldbeencodedintheformoftuplesinordertoconveyittoanotherwaypoint.TosurpassthisrestrictionweintroduceStatePassing(Chapter 3 ).AsecondconstraintintheoriginalGLADEframeworkwasitsinabilitytoperforminternaliterations,thatistherewasnoseamlesswayaparticularquerycouldhavemultiplepassesovertheinputdataset.Thisaffectedtheperformanceandexibilityoftheoriginalsystemastheiterationshadtobeperformedexternallyandasasolution,weintroduceIterations(Chapter 3 ). 15

PAGE 16

CHAPTER3SYSTEMENHANCEMENTSInordertodescribeIterationandStatePassing,weexplorethenotionofaConstantState.ConstantStateactsasanentitythatcanstoreandtransmitinformation.Inthecaseofiteration,theconstantstatecarriesinformationwithinthesamewaypointandinstatepassing,betweenwaypoints.WeintroducesomechangestotheoriginalformatofGLAs, GLA Name()orGLA Name(constConstState*):Initialize()isabsentandinsteadtheargumentsarepassedtotheconstructor.Thesecondvariationiswhenaconstantstateisdenedforthatparticularqueryinthatwaypoint. AddItem(Tuplet)andAddState(GLA&other):existsastheirformerself. boolIterate(ConstState*&,int&numfrags):isusedtopasstheConstantStateasausermodiableparameterandalsotogetthenumberoffragmentsproducedbythecurrentiterationofthequery.Italsoreturnsabooleanwhichsayswhethertoiterateornot.Iterate()isoptionalandshouldonlybedenedifaconstantstateisdeclared. Finalize():isusedasaprecursortoGetResult()whentheresultsareextractedinmultipleparts.TheFinalize()callsetsupthedivisionofthestateintomultiplepartssothatwheneverGetResult()asksforaparticulardivisionofthestate,theGLAwillbeabletoprovidethat.Thisfunctionisabsentinthecaseofstatepassing. GetResult(int&output1,int&output2,...........):isusedtoextractthetuplesfromthecurrentGLAforthenextwaypoint.Thisfunctionisalsoabsentinthecaseofstatepassing. 3.1IterationIterationisrequiredincaseswhereacomputationrequiresmultiplepassesoverthewholeinputdataasinKMeansclustering,wherethecentersaregraduallyrenedovermultiplepassesofdata.InDatapath,theproductionofchunksareinitiatedbythenalwaypointinthegraph.Itsendsamessagedownstreamwhichwhenreceivedbyatablescanner,itstartsproducingchunks.Thechunksroutedthroughthewaypointsareutilizedonlybytheonesthatrequiresit.Foreachchunkprocessedbyarequiredwaypointforaparticularquery,itsendsanacknowledgementwhichispropogated 16

PAGE 17

Figure3-1. IterativeMechanisminGLADE downwards.Thisletsanyproducerwaypointabovetoknowthatitsdatahasbeenreceived.Sincethesystemisdatadriven,thechunkscouldarriveatawaypointwhenthereisnotenoughresourcestoprocessit.Insuchcasesthesystemutilizesaformofloadsheddinganddropsthenewlyarrivedchunk.Sincenoacknowledgementisreceivedforthatparticularchunk,itisroutedagaintothiswaypointatalatertime.SoonceAddItem()nishestherstpassovertheentiredataset,allacknowledgementsaresentout.However,weareabletodeterminewhetheranotheriterationisrequiredonlywhenIterate()completesexecution.Onewaytore-routethechunksistopropogateasmanydropmessagesdownstreamastherearechunks.Thisisahighlywastefulmodeoftransmission.Therefore,insteadofdroppingthechunksweinitiatetheproductionofthechunksonceagainasifforanewquery.Incaseaniterationisrequired,thechunksarereinitiatedandthewaypointissetupsothatonlytheconstantstateremainsfromthepreviousiterationinthewaypointforthatparticularquery.Ifconstantstateisspeciedbythequeryforthatwaypointthentheconstructorargumentsareusedtoinitializetheconstantstateinitiallybythesystem.Thenitis 17

PAGE 18

passedasaconstant,readonlystatetotheconstructoroftheGLA.ThentheGLAusestheconstantstatetoinitializeitselfandsetupitsstate.ThestateisagainsenttotheGLAasamodiableobjectinIterate,whereinitcanstoremoreinformationormodifythecurrentinformationforthenextiterationifany.InK-Meansclustering,theconstantstateinitiallystorestheKrandomclustercentersforstartingtherstiteration.Afterasingleiteration,inPreFinalize,thenewlygeneratedcentersareinjectedintotheconstantstateanditcarriesthecenterstothenextiteration.Also,whenthequerynishes,theresultwouldbepresentintheconstantstateaswell.Theconstantstatetherebymaintainsthevolatileinformationbetweeniterations.WecaninferthatthisisanefcientmechanismtoincorporateiterationasitworksorthogonallywiththeDatapathsystemasawhole.Themechanismofreinitiatingchunksisessentiallythesameasthemechanismofinitiatingchunksforanewquery.Also,thetablescannercanchoosetouseacachingmechanismforchunksifitsuspectsapendingiterationwhichwillfurtherspeedupthesystem.Figure 3-1 showsthearchitecture.ThechunkissentbytheproducerwaypointandafterprocessingitbycallingAddItem()inUserSpace,theGLAWaypoint,theconsumerhere,sendsanacknowledgementback.Ifaniterationisrequired,thenitisdeterminedinIterateinUserspace.BythetimeuserconveysthemessagetotheGLAWaypoint,allacknowledgementshavealreadybeensent.Itimmediatelysendsareinitiatemessagethatpromptstheproducertoproduceallthechunksagain. 3.2StatePassingAnotheradvantagethatcameoutofbuildingtheiterativemechanismwasstatepassing.Theonlymodeofcommunicationbetweenwaypointsintheoldersystemwasthroughchunksthatcarriedtuples.Themechanismwasineffectiveinsituationswherethestateproducedbyonewaypointcouldbeconsumedbyitssuccessor.Insuchcasesitwasrequiredtoextractthetuplesfromthestateinordertopassittothenext 18

PAGE 19

waypoint.Therefore,weneededatechniquethatcouldcarryanynon-relationalstateasitis,betweenwaypoints.Thesestateseliminatedthegenerationofunnnecessaryartifactswhereverpossibleandalsoproducedmoreefcientcode.ForexamplethealgorithmofFPGrowthrequiresaninitialscanningofthedatasettodeterminethefrequencyofeachelement.ThemainalgorithmusedthisfrequencytoconstructtheFP-Tree.Insuchasituation,itwouldbetootedioustopasstheelementanditscorrespondingfrequencyastuples.Therefore,theconstantstatesprovedeffectiveagain.Thechunksaremodiedsothattheycouldcarrytheconstantstatesjustastheywouldhavecarriedthetuplesandtheearliermechanismofroutingthechunksareapplicableinthiscaseaswell.Theonlyotherproblemforpassingstatesinthismannerisamechanismtodestructthem.Forthis,weincludedanacknowledgementinthechunk.Theacknowledgementwouldbeinitializedtothenumberofwaypointsthatrequirestoreceivetheconstantstate.Wheneverthechunksarereceivedbyawaypoint,theacknowledgementisdecrementedbyoneandifitiszerowhenitreachesaparticularwaypoint,thenthatconstantstateispassedintotheuserspacetobedestructed.Thisopenedupfurtheropportunitiesinsomealgorithms,especiallyinFP-Growth.WewereabletoanalyzeapartoftheinputdatasetandproduceanFP-Treecorrespondingtothat.Suchdisparatelyproduceddatasetscouldbesenttoanotherwaypointusingstatepassing,essentiallythestatebeingtheFP-Tree.WhenthisformofparallelismiscoupledwithanFP-Treemergealgorithmasdetailedinsection 4 ,wecouldproduceacompleteFP-Treeforthewholedataset. 3.3StatePassingandIterationStatePassingandIterationtogetherproducedevenmorepowerfulmechanismsforencodingcomputations.Itallowedfordividingcomputationsandmergingthematadifferentwaypointinwhateverformatrequired.ForexampleinCollaborativeFiltering,thecomputationsofthevaluesinanoutputmatrixisdividedintoblocks.Theseblocks 19

PAGE 20

couldbeassimilatedinanotherwayppointwithoutanyadditionalcostsduetoStatePassing. 20

PAGE 21

CHAPTER4ALGORITHMSWeexamineseveraldataminingalgorithmsadaptedfortheGLADEenvironmentstartingwiththeItem-BasedCollaborativeFiltering,theK-MeansClustering,DecisionTreeClassiierandFrequentItemsetMiningAlgorithmsincludingbothAprioriandFPGrowth. 4.1Item-BasedCollaborativeFilteringItemBasedCollaborativeFilteringexaminestherelationshipbetweenitemsfromaninputdataandbasedonthesendings,itsuggestsrecommendationsforeachoftheitems.ItemBasedCollaborativelteringinvolvestheconstructionofamatrixofitem-itempairs.Thematrixrepresentstherelationshipbetweenoneitemtotheother.Thismatrixisconstructedbyusingasimilaritymeasurelikepearson-correlation,cosineororloglikelihood.ForanincomingitemA,thesimilarityisupdatedinthematrixbetweenAandallotheritemslinkedtothesameuser.Oncethematrixisbuilt,recommendationsforanitemcanbemadebasedonitsstrongestrelationshipstootheritems.Andauserwhohasboughtseveralitemswillbeprovidedatop-koftherecommendationsfoundforeachitemfromthematrix.ProposedAlgorithm:Theconstructionoftheitem-itemmatrixpairisthemostcomputationallyintensivepartoftheitembasedcollaborativeltering.TheinputisoftheformUserId1,ItemId1,PreferenceValue1UserId2,ItemId2,PreferenceValue2UserId3,ItemId3,PreferenceValue3Theitem-itemmatrix,thecentralmatrixthatassociatesanitemwithallotheritemsandtheirsimilaritymeasure,couldbeashugeasthesquareofthenumberofitems.Therefore,itisnotpossibletomaintainthewholematrixinmemoryforlargedatasets.Inordertoproduceascalablealgorithm,thealgorithmshouldfocusonbuildingapartof 21

PAGE 22

thematrixatanypointoftime.Itisalsoenoughtoconsidereithertheuppertriangularmatrixorthelowertriangularmatrixasthematrixissymmetrical.Inourproposedimplementation,weconstructthematrixnotontheItemIDvaluesbutonthehashedvaluesoftheItemID.Thereforeeachrowandcolumnrangesfrom0to264bitsinsize,fora64bithash.Thisgivesusuniformityfordividingthecomputations: Algorithm4.1.0.1State user2ItemPrefMap-isanunorderedmapfromUserIDstoavectorofastructureItemPref.ItemPrefcarriesItemIDandtheirPreferencevalue item2PrefMap-isanunorderedmapfrom(Item1,Item2)andtheirstatistics,anobjectofStatisticsclassascorrespondingvalue Statistics-maintainsnumberofelementsn,sumofXY,sumofX,sumofY,sumofX2andsumofY2 ranges-denotesablockofthematrixwherecomputationisoccuringcurrently.ItwillhaveanupperRow,lowerRow,upperColumnandlowerColumnvalues.Thismechanismensuresthattheintermediatedataproduceddependsontheblockssizewespecify,usingtherangevalues. Thealgorithmisasshownin 4.1.0.2 .Ineachiteration,theGLAsareinitializedbytheconstantstateCFConstwitharange.EachGLAonlyconsidersthosetupleswhosehashedvaluebelongtotheranges.Foreachitem,thecurrentuserhasintheuser2ItemPrefMap,statisticsareupdatedforthatparticularitemandthecurrentitemintheitemItem2PrefMap.OncetheGLAsareaccumulatedusingAddState(),PearsonCorrelationisappliedtoeachofthemaintainedstatisticsinIterate().StatePassingoccursbyembeddingtheuser2ItemPrefMap,itemItem2PrefMapandrangesinCFConst.Also,CFConstproducesanewsetofrangesforthenextiteration.Theiterationssubsequentlynishes,whenalltherangeshavebeenexhausted.StatisticsclassmaintainsenoughinformationforapplyingPearsonCorrelation. 22

PAGE 23

Algorithm4.1.0.2CFItemGLA CFItemGLA(constConstantState*CFConst) ranges getNextRangesFrom(CFConst) AddItem(intUserID,intItemID,intPrefValue) hashedItemId hash(ItemID) ifhashedItemId6rangesthen return endif ifUserIduserItemMapthen itemIDs user2ItemPrefMap.Find(UserID) foreachiteminitemIDsdo Statistics item2PrefMap.Find(item.ID,ItemID) Statistics.Add(PrefValue,item.PrefValue) endfor itemIDs.Add(itemID,PrefValue) else user2ItemPrefMap.Add(UserID,ItemID) endif AddState(GLA&other) Merge(ranges,other.ranges) Merge(user2ItemPrefMap,other.user2ItemPrefMap) Merge(itemItem2PrefMap,other.itemItem2PrefMap) boolIterate(ConstantState*&CFConst,int&numFrags) numFrags 1;==toperformStatePassing foreachStatisticsinitem2PrefMapdo ApplyPearsonCorrelation(Statistics) endfor CFConst.currentRanges=ranges.ForStatePassing CFConst.itemItemMap=itemItem2PrefMap.ForStatePassing CFConst.userItemMap=user2ItemPrefMap.ForStatePassing ifCFConst.allRangesExhaustedthen returnfalse else returntrue endif 23

PAGE 24

SamplePearsoncorrelationcoefcient,r=nPXiYi)]TJ /F12 7.97 Tf 6.59 5.97 Td[(PXiPYi (p (nPXi2)]TJ /F12 7.97 Tf 6.59 5.97 Td[(P(Xi)2))(p (nPYi2)]TJ /F12 7.97 Tf 6.59 5.97 Td[(P(Yi)2))TheConstantStatehereCFConstprovidesthenewrangesforeachiteration.Eachiterationcomputesapartofthematrixandtheoutputcanbeaccumulatedasastateandsenttoanotherwaypointthatcaneitherdopostprocessingorserializeittoale.Theiterationendswhencomputationshaveoccurredineachpartofthematrix.Iftherightblocksizeisselected,thenthesizeoftheintermediatedatanevergetstoolarge. 4.2K-MeansClusteringInK-Means,krandompointsarechoseninitiallyfromthewholedataset.Eachpointisassignedtooneofthekgroupsbasedonthesmallestdistancebetweenthepointandthechosencentres.ThedistancecanbemeasuredusingadistancemeasuresuchasEuclideanDistance.Onceapassoverthedatasetiscompleted,thenewcentresofthenewlyformedkgroupsarefound.Thealgorithmterminatesifthenewcentresandthepreviouscentresonlydifferbyasatisablethreshold,calledconvergencedelta.Otherwise,itcontinuestomakeanewpassoverthedatasetwiththenewcentres.ProposedAlgorithm:K-Meansclusteringusesmultiplepassesoverdatatorenetheclustercenters.Themostcomputationallyintensivepartofthealgorithmisthenumberofscansthathavetobeperformed.Wealsoeliminateafurtheroverheadbydelayingstoringtheassociationofpointtotheclustercentersuntilafteralltheiterationscomplete. Algorithm4.2.0.3State centres-storeskcentres statVector-statVectorisanarrayofkStatisticsobjectwhoseeachindexcorrespondstooneofthekcentres.AlsoStatisticsobjectstoresintsum1,sum2,.....(foreachdimensionsofthepoint)andintcount 24

PAGE 25

Algorithm4.2.0.4StatisticsObjectforKMeans AddPoint(intX,intY,intZ)//AddingaPointtoStatisticsObject sum1+=X;sum2+=Y;sum3+=Z;++count; Merge(Statisticsother) sum1+=other.sum1;sum2+=other.sum2;sum3+=other.sum3;count+=other.count; Finalize() sum1==count;sum2==count;sum3==count; convergenceDeltaisthetolerabledifferenceindistancebetweenthecurrentcentresandthenewlycalculatedcentres,ameasureofconvergence.Thealgorithmstopsiteratingwhenthedifferenceislessthanorequaltoconvergencedelta.InitiallyKMeansConstisinitializedwithKrandomcentres.Thealgorithmisasshownin 4.2.0.5 .Theconstantstate,KMeansConststoresthekrandomcentresintherstiterationandstoresthenewlycalculatedcentresinsubsequentiterations.ItinitializestheGLAwiththecentres.ForeachincomingtupleinAddITem(),theGLAdeterminesthecentretowhichthetuplehastheminimumdistance.Thetupleisaddedtothestatisticscorrespondingtothiscentre.Thestatisticsmaintainedarethesumofeachco-ordinatesinthepointandalsothenumberofpoints.InIterate(),oncenewcentresarefound,theirdifferenceindistancefromtheoriginalcentresarecalculatedandiftheydifferbymorethantheconvergencedelta,thenanewiterationisperformed.EuclideanDist(point1,point2)calculatestheEuclideandistancebetweentwopoints.Here,onceiterationsarecomplete,anotherfunctioncouldrunthatassociateseachpointtooneofthecentres. 4.3DecisionTreeClassierThetrainingpartofthedecisiontreeclassierbuildsatreewhoseleafnodescontainoneoftheoutputclassesandeachoftheintermediatenodescontainsadecision.Adecisiondenotesaparticularattributeandoneofitsvalues.Thetreesare 25

PAGE 26

Algorithm4.2.0.5KMeansGLA KMeansGLA(constConstantState*KMeansConst) kCentres KMeansConst.getCentres AddItem(inTuple) closestCentre 0 minDistance EuclideanDist(kCentres[0],inTuple) fori 1;iconvergenceDeltathen iterateFlag true numFrags 0.nostateispassedout endif newCentres.Insert(newCentre) endfor KMeansConst.centres newCentres returniterateFlag 26

PAGE 27

builtsothatduringthetestingpartofthealgorithmeachincomingtupletraversesthetreebasedontheircorrespondingvalueforeachdecisionandendsupinoneoftheleafnodeswhichdeterminestheclasstowhichthetuplebelongs.Forcategoricalattributes,ifattributeA=a,b,cthenifatanodethedecisionistakenonattributeA,thenthreesplitswilloccurA=a,A=bandA=c.Forrealvaluedattributesthesamecannotbesaid,astherewillbensplitsifnrealvaluedattributesarepresent.Therefore,asinglevalueischosenthatwillmaximizetheefciencyofthesplit.B=1,2,3,4.......,x,....,thedecisionwilllooklikeB<=xandB>x,henceresultinginabinarysplit.Here,xiscalledthesplitvalue.Differentmethodscanbeemployedtochooseanattributeateachnode,suchasInformationGainorGiniIndex.WefocusonInformationGainforourcalculations.InformationGainchoosesanattributeinsuchamannerthat,thatparticularattributereducestheentropyforthewholedatasetbythemaximum.Theclassicalgorithmemploysthemethodofsplittingthedatasetperdecisiontakenandthenrepeatingtheprocessateachnewnodes.Thecomputationishighlyintensiveasforcalculatingtheinformationgain,onepassthroughthesubsetofthedatasetateachnodeisrequired.Forrealvaluedattributestheproblembecomesmoreintensiveasthesplitvalueisalsorequiredtobefoundwhichisdonebyassumingeachvalueasthesplitvalueandndingtheinformationgainforthatvalue.Thevaluewiththemaximuminformationgainischosenasthesplitvalueforthatparticularattribute.Wefocusouralgorithmontherealvaluedattributes.ProposedAlgorithm:Theproposedimplementationfocusesonlevelwisebuildingofdecisiontreeswitheachlevelrequiringonepassofdata.Asmentionedearlier,theclassicalapproachusesdatadivisionateachnodetothiseffectbutourimplementationdoesnotperformpartitioning.Insteadwemakeanewiterationofthedatathroughthepartiallygrowntreeagain.Thetuplespropogatethroughthetreeandisaddedtothestatisticsofthenodethattheyendsupat.Thereforethefocusisonaset 27

PAGE 28

ofnodesataparticulariterationandontheirchildren,ifanyinthenextiterationandsoon.LetcurNodescontainthecurrentnodes. DecisionTreeNode decisionAtt-forstoringthedecisiveattribute decisionVal-storesitscorrespondingsplittingvalue leaf-agwhichshowswhetheritsaleafornot leftandrightnode-pointerstochildren State root-storestherootoftheDecisionTree node2StatMap-mapseachofthecurrentnodestoaStatisticsobject.Foreachattribute,theStatisticsobjectsmaintainthecountofeachelement. Statisicsobject outputClass2CountMap-mapstheoutputclassofthetupletoitsCount attribitesStatVector-holdsanelement2StatMapforeachattribute element2CountMap-mapseachelementinaparticularattributetoCountobject. Countobject-storesboththecountoftheelementsandoutputClassCounter-amapbetweenoutputClasstoitscount Algorithm4.3.0.6DecisionTreeStatisticsObject AddTuple(Tuplet) Foreachelementoft,countisincrementedafterreachingtheelementusingtherequiredindirection Thealgorithmisshownin 4.3.0.7 4.4AprioriThealgorithmwasintroducedin[ 5 ]andisbasedonthepropertythatthesubsetofafrequentitemsetmustbefrequent,thispropertyallowsthealgorithmtoincrementallybuildlongercandidatesfromshorterones.Inthealgorithm,rstallthecandidates 28

PAGE 29

Algorithm4.3.0.7DTreeGLA DTreeGLA(constConstantStateDTreeConst) curNodes DTreeConst.getCurNodes root DTreeConst.getRoot AddItem(inTuple) node root whilenode!=leafjjnode6curNodesdo ifintuple[node.decisioAtt]
PAGE 30

if0 newNode2StatMap.size()then returnfalse.iterationshavenished endif DTreeConst.node2StatMap newNodes.storethenewnodesintheconstantstate returntrue.iterate aregeneratedthathaveasize1,1-itemcandidates.Thesupportcountforthesearecountedandtheonesthatfallbelowtheminimumsupportareeliminated.Thenthe2-itemcandidatesaregeneratedfromthe1-itemandtheprocesscontinues.Thealgorithmnisheswhentherearenomorecandidatestobegeneratedorwhenthesupportcountofnoneofthecandidatesisabovethethreshold.SomeofthemorefamousimplementationsofAprioriareFerecBodon'simplementationusingtrie-baseddatastructureandcandidatehashing[ 6 ],ChristianBorgelt'simplementation[ 7 ]usingrecursionpruningandBartGoethals'implementationusingAgrawalsalgorithm[ 8 ].BitvectorshavealsobeenusedforAprioriimplementation[ 9 ],[ 10 ],wherethetransactionsinthedatasetarerepresentedasbitmaps-a1ifpresentanda0ifnot.BothAprioriandFP-Growthrequiresalistoftheuniqueelementsinthedatasetandtheirnumberofoccurrences.WecallthattheF-List,FrequencyList.OnesimpleGLAwaypointcomputestheF-Listbyusingonepassoverthewholedataset.TheF-ListissenttothemaincomputationalGLAwaypointviastatepassing.ProposedAlgorithm:WecombinetheTriebasedAprioriwiththebitvectorimplementation.Thebitvectorisusedforsupportcountingandthetrieisusedforcandidategeneration.TheF-ListisfedtotwoGLAs.OneofthemfeedsittoaTrieandthenthetrieispassedasAPConsttoAPGLAandtheotherwilluseittomakeabitMapperthatmapseachoftheelementsinF-Listtooneofthe64or32bitindicesofthebitmap.Usingthebitmapper,itconvertstheincomingdatasettobitvectorsandthesevectorsarefedtoAPGLA.TheGLAsineachiterationcountstheoccurrencesofeachpattern,asshowninAddItem().Intherstiterationitcountstheoccurrencesof1-itemcandidates,intheseconditeration2-itemcandidatesandsoon.Oncethe 30

PAGE 31

supportcountisfound,itdiscardsthosepatternsthatarebelowtheminimumsupportareeliminated.ThenewcandidatesarefedtotheTrieinAPConstwhichgeneratesthenewsetofcandidates.Anewiterationbeginswiththenewsetofcandidates.IterationsnishwhentherearenomorecandidatesleftaftersupportcountingandeliminationorwhenAPConstfailstogeneratenewcandidates. State Candidates2CountMap-Mapsthecandidatestoitscounts Thealgorithmisasshownin 4.4.0.8 Algorithm4.4.0.8APGLA APGLA(constConstantStateAPConst) Trie APConst.getTrie Candidates2CountMap Trie.getNewCandidates() bitMapper APConst.getBitMapper AddItem(inTuple) forcandidateinCandidates2CountMapdo ifallbitsincandidatearesetininTuplethen IncrementitsCount endif endfor AddState(GLA&other) Merge(Candidates2CountMap,other.Candidates2CountMap) boolIterate(ConstantState*&APConst,int&numFrags) Trie APConst.getTrie forelementinCandidates2CountMapdo ifelement.countminSupportthen Trie.Add(element.Candidate) endif endfor Trie.generateNewCandidates if(thenTrie.getNewCandidates.size) returnfalse.NoIteration endif returntrue.Iterate 31

PAGE 32

CandidateGenerationinTrie:Thecandidatesaregeneratedusingatrie.TheTriewouldonlycontainelementsthatareallaboveminimumsupport,sonopruningisrequired.Thecandidatesaregeneratedatlevell+1inthefollowingmanner: allnodesinlevell-1areaddedtoavector Foreachnodeinthevector,takeeachchildascurChild ForeverycurChild,addallthesiblingstoitsrightasitschildren generateallthewordsoflengthl+1,whichwillbeournewcandidates 4.5FPGrowthThesignicanceofFP-Growthasintroducedin[ 11 ]isthatitreducesthenumberofscansofthedatasettotwoandeliminatescandidategeneration.ThisisanimprovementoveralgorithmslikeAprioriwhichscansthedatabaseasmanytimesasthedepthoftheminingandalsogeneratescandidatesineachstep.FP-GrowthusesacompactstructurecalledFPTreeatitscore.AnFPTreeisselfsufcientincontainingalltheinformationsrequiredfortheminingtasks.FPTreeisaprextreeinwhichthenodeshavingthesameelementsarelinkedandoncebuilt,itisrecursivelyminedtogeneratethefrequentitemset.Theinitialfocusonimprovingthealgorithmhasbeenonbetteringtheperformanceusingsinglethreadedimplementationsasin[ 12 ]and[ 13 ].Pramudionoetal.[ 14 ]proposedanimplementationforsharednothingarchitecture.AfterdeterminingthefrequencyofeachofthesingleitemspresentinthedatabaseandforminganF-Listbyretainingtheitemsthathaveanoccurrenceaboveminimumsupport,eachofthegeneratedpatternbasesismappedtoandprocessedbyoneofthesharednothingnodes.WiththeadventofMapReduce,[ 15 ]designedaparallelFP-Growthalgorithmthatutilizedthecomputationalmodel.Here,eachnodegrewitsownFPTreeinparallel.TheF-Listwaspartitionedintovariouslistscalledasg-listsandtheg-listsweredistributedtodifferentnodes.Duringthemininganodeaddedonlythosepatternsthatwerepresentinitsg-listtoitsFPTree.Zhouetal.[ 16 ]introducedaformofloadbalancingtotheparallelFP-Growthimplementationintroducedin[ 15 ].Theymade 32

PAGE 33

estimationsbasedontheinitialF-Listassumingthatthenumberofoccurrencesofanelementcorrespondsdirectlytothelengthofthelongestfrequentpathintheconditionalpatternbase.AndthedivisionoftheF-Listintog-listisinuencedbythisnotionofload.ProposedAlgorithm:TheFP-GrowthimplementationinMahoutusesthemechanismintroducedin[ 15 ].ItdealswithgrowingtheFPTreesinparallel,miningonthemseparatelyandthencompilingeachofthetop-kitemsproducedbythemtoobtainthegloballist.OurimplementationfocusesonanFPTreemergealgorithm, 4.5.1 .ThisenablestheAddState()tobeassociative.WebuildthetheFPTreesinparallelinAddItem().TheindividuallybuiltFPTreescanbeaccumulatedtoasingleFPTreeinAddState()usingFPTreemerge.Also,withtheuseofStatePassing,thistreecanbepassedtoadifferentwaypointhichcanmineonit.StatePassingandIterationalsogivesfurtheropportunitiesforparallelismhere.TheinitialdatasetcanbedividedandeachofthedivisionscanbeminedseperatelyusingindividualinstancesofFPGlas.TheF-ListisusedtoinitializeFPG-Const,butitisnalizedbeforeinitializationbyeliminatingelementsbelowminimumsupport,minSupport. State FList-holdstheF-List FPTree-holdstheFPTree Algorithm4.5.0.9FPGLAStatisticsObject Merge(FPTreeother) RefertoSection 4.5.1 Thealgorithmisasshownin 4.5.0.10 4.5.1FPTreeMergeThealgorithmforMergeisasfollows:Letnodebeadatastructurethatcontainsadataelement,countofitsoccurrenceandthelinkstoitschildnodes.Merge(FPTreeother) 33

PAGE 34

Algorithm4.5.0.10FPGLA FPGLA(constConstantStateFPGConst) FList FPGConst.getFList AddItem(inTuple) sort(inTuple,FList).sortsindecreasingorderaccordingtothefrequenciesinFList seenElements.elementsintthathavealreadybeenseen fori t.size())]TJ /F7 11.955 Tf 11.96 0 Td[(1;i0;++ido if!seenElements.Find(t[i]then Tupletup CopyElementsFromt[0]tot[i] FPTree.AddTuple(tup) endif endfor AddState(GLA&other) FPTree.Merge(other.FPTree) boolIterate(ConstantState&FPTreeConst,int&numFrags) encodeFPTreeinFPTreeConst numFrags 1.forStatePassing returnfalse;.noiterationrequired 1. Initialize,NodeoRootasother.root 2. ScanthechildnodesofoRoot,storethereferenceinoNodeandforeachiterationdostep3 3. CallMergeRecursive(curNode,oNode)MergeRecursive(curNode,oNode) 1. CheckswhetherthedataofoNodeispresentincurNode'schildren.Ifitis,letthenewlyfoundnodebefoundNodeandgotoStep2.IfnotfoundthengotoStep3. 2. AddthecountofoNodetothecountoffoundNode.AndsetcurNodetofoundNode. 3. CreateanewnodeandaddoNode'sdataandcounttoit.AddthenewnodeasachildtocurNode.SetcurNodetothenewnode. 4. ScanthechildrenofoNodeandforeachiterationcallmergeRecursive(curNode,oNode.child)MergeutilizesalevelwisecombinationofnodesofoneFPTreewiththeother. 4.5.2ProofofCorrectnessFortheproofofcorrectnessoftheentirealgorithm,tworesultsareproved. 34

PAGE 35

Figure4-1. FPTreeMerge InsertioninFPTreeisorderindependentForthisitisenoughtoprovethat,((S1+T)+S2)=(S1+(T+S2))whereS1-Sequence1S2-Sequence2T-FPTreeSequences1and2canbeconsideredaspatternsminedfromanincomingtuple.Twostatementsareconsideredfortheproof:Statement1:FPTreeisaprextreeandinaprextree,iftwosequenceshavethesameprexthentheywillfollowthesamepathinthetreeuntiltheendofthecommonprex.Statement2:Twodifferentsequences,whenaddedataparticularnodeofanFPTree,willdivergeandthestructureofthenodeandtheaddedsequenceswillnotdifferbasedontheorderoftheadditionofthesequences.Assumption-Assumingthat,((S1+T)+S2)!=(S1+(T+S2))Threescenariosarepossiblehere, 1. Inthiscase,S1=S2.Therefore,wecanconsideralltheprexestobesame.SoaccordingtoStatement1theymustfollowthesamepathfromroottoleaf. 2. S16=S2.Bothsequencesareentirelydifferentandhavenocommonprex.Inthiscasetheywilldivergeattherootnodeitself.Therefore,byStatement2,theorderofadditionofS1andS2doesnotrenderthestructureofthetreedifferent. 3. S16=S2buttheyhaveacommonprex.Here,thetwosequencesfollowthesamepathuntilthecommonprexends,thentheydivergefromthelastcommonelementuntiltheircorrespondingleafnodes.Therefore,thisisacombinationofScenario(1)andScenario(2),sothetreesformedcannotbedifferent.Therefore,insertioninanFPTreeisorderindependent. CorrectnessofFPTreemergeforasinglepathFPTree-Figure 4-1 showstheparticularscenario,thatisaddingasequencetoanFPTreehavingonlyasinglepath.ConsiderasinglepathFPTreetobemergedtoalargerFPTree.Mergingasinglepathiscompletelysynonymoustoinsertion.Soifinsertionasaprocessiscorrect,mergingasinglepathisalsocorrect. 35

PAGE 36

Therefore,thesamecanbesaidforamultipathFPTreeaswell,whichisrecursivecallingofasinglepathFPTreemergealgorithm.Fromthetwoprovenresults,theproofofcorrectnessoftheFPTreemergealgorithmisestablished. 4.5.3TimeComplexityMergingoftwotreesT1andT2requirestraversingT2onceusingDFSandfollowingthesamepathinT1.IfnodesarenotfoundtheyarecreatedinT1whichisdoneinconstanttime.Therefore,ifT2hasVverticesandEedges,thenthetimecomplexitywouldbe2*O(V+E).ThatisO(V+E)sameasDFStraversal. 36

PAGE 37

CHAPTER5EMPIRICALEVALUATIONAllexperimentswereconductedonanAMDOpteron(TM)Processor6128with64GBRAMand16corearchitecture.OurgoalistorstofallcomparetheperformancespeedupofGLADEwiththecorrespondingMahoutimplementationonsimilarparametricconditions,wheneversuchanimplementationexistsinMahout.Also,weshowthescalabilityofGLADEforvaryingchunksizesandcoresinsomeofthealgorithms.Forxedcorecomparisons,thealgorithmwasrunonall16coresofthesystem.Hadoop,theMapReduceimplementationonwhichMahoutrunswassetuponaPseudoDistributedmodewhereeachHadoopdaemonrunsinaseparateJavaprocesstherebygivingtheimpressionofapseudocluster.Itwasalsoconguredtorun8Mapand8Reducetask.ThecompletesetofHadoopandMahoutcongurationsusedforourexperimentsareasshowninTable 5-1 andTable 5-2 respectively. Table5-1. HadoopCongurations Property Value FileUsed dfs.block.size(bytes) 1048576 hdf-site.xml mapred.map.tasks 8 mapred-site.xml mapred.reduce.tasks 8 mapred-site.xml mapred.map.child.java.opts -Xmx20000M mapred-site.xml mapred.reduce.child.java.opts -Xmx20000M mapred-site.xml mapred.compress.map.output false mapred-site.xml fs.inmemory.size.mb(MB) 2048 core-site.xml io.sort.mb(MB) 2048 core-site.xml HADOOP HEAPSIZE(MB) 60000 hadoop-env.sh 37

PAGE 38

Table5-2. MahoutCongurations Property Value FileUsed MAHOUT HEAPSIZE(MB) 60000 bin/mahout 5.1Item-BasedCollaborativeFilteringTwodatasetsareusedtoevaluatetheperformanceofthealgorithm.TherstistheBook-CrossingDataset1minedbyCai-NicolasZiegler,DBISFreiburgandthesecondistheMovieLensdataset2. 5.1.1BookCrossingDatasetThenon-numericaldatawerediscardedfromprocessinghere.Asshowningure 5-1 ,experimentalresultsshowthatthealgorithmisfastestwhenthesizeofthechunkisinamediumsizerange.Theperformancelossatlowerchunksizescanbeattributedtothecontentionfordatabythethreadsonthelockthatallocatesthechunkswhereasthelesserspeedathigherchunksizescanbeattributedtothememorybandwidthduetomovementoflargestates.Also,someoftheGLAstowardstheendbecomesidleasallthechunkshavealreadybeenallocated. 5.1.2MovielensMovielensdatacontainedthreedatasets-K-consistingof100Kratings(amountingto956KB),M-consistingof1millionratings(amountingto11.0MB)andMconsistingof10millionratings(amountingto123.4MB).WeusedSIMILARITY PEARSON CORRELATIONasthesimilarityclassinboththeGLADEaswellastheMahoutimplementation.IntheGLAformattheitem-itemsimilaritymatrixwasconstructedtakingintoconsiderationallrelevantusers,itemsandpreferencevalues 1 http://www.informatik.uni-freiburg.de/cziegler/BX/ 2 http://www.grouplens.org/node/12 38

PAGE 39

Figure5-1. GLADEItemBasedCollaborativeFiltering andnolteringwasemployed.ThechunksizefortheGLAwaskeptconstantat1000thoftheinputsize.ExperimentalresultsshowedthatGLAbasedimplementationswerealmost3xfasterwhenusing10Mdataset,almost9xfasteron1Mdatasetandalmosta100xfasteronthe100Kdataset,asshownin 5-2 .Theinitiationandrecoveryfromthehadoopmappersandreducerscouldbethereasonforthisvastdifferenceatlowerlevelcomputations.Thisbecomestolerableasthecomputationtimetopplesthisoverheadbyalargemargin. 5.2K-MeansClusteringForrunningtheKMeansalgorithmweusedthe281.6KB,syntheticcontroldataset3.Wesynthesizedlargerdatasetsof141MB,1.2GBand5GBfromthesamedataset.KMeanswasrunwithaconvergencedeltaof0.5onallcasesandalsowehavetakenthevaluesaveragedon10instancesoftheexperimentssothatitevensoutthetimedifferencethatmightbecausedbytheinitialrandomseedingofthecentres. 3 http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data 39

PAGE 40

Figure5-2. ItemBasedCollaborativeFiltering:GLADEvsMahout[logscale] Figure 5-3 showsthecomparisonofGLADEwiththeMahoutimplementation.Weusedthedefaultscriptprovided:build-cluster-syntheticcontrol.shandchangedthedatasetaccordingly.Also,theMahoutimplementationhadamaximumnumberofiterationsparameterwhichwassetto10andGLADEranasmanyiterationsasrequiredfortheconvergence.Therefore,wedecidedthatperiterationtimewouldbeabettercomparisonparameter.Weobtainedaspeedupof2332xspeedupfor281KBdataset,246xspeedupforthe141MBdatasetanda441xspeedupforthe1.2GBdataset.Inthe1.2GBdataset,Mahoutranoutofmemoryinthe10thiteration.Forthe281KBdataset,timetakenininitailizingandterminatingthemapandreducephasesdominatedthecomputationaltimewhichevensoutforthe141MBsize.Againthethereducedperformanceinthe1.2GBregionisbecauseofthelargeamountofintermediatedatageneratedbyMahout.Figure 5-4 showsthescalabilityfortheGLADEimplementationofKMeans.Theexperimentwasrunonthe5GBdataset.Wehaveusedaconvergencedeltaof0.5forthealgorithmandthegraphshowsthevariationontimeduetok.Also,wehaveused 40

PAGE 41

thecomputationaltimeforthealgorithmstoconvergeratherthantheperiterationtimeinthisgraph. Figure5-3. PerIterationSpeedup,KMeans:GLADEvsMahout[logscale] Figure5-4. ScalabilityofKMeans(GLADE)Algorithm 41

PAGE 42

5.3DecisionTreeClassierFigure 5-5 showsthecomparisonoftheimplementationforvariousnumberofchunks.Thealgorithmwascarriedoutonasynthesizeddatasetwhoseeveryattributewasrealvaluedandthetotallesizeamountedto3.4GB.Experimentalresultsareshownwhenusing16cores,8coresand2cores.Itshowsthescalabilityofthemachineinthesharedmemoryarchitecture. Figure5-5. GLADEDecisionTree 5.4AprioriOurAprioriimplmentationwasrunontheGooglengramsdataset4,allthe1-grams[theoneworddataset]combinedtoforma9GBle.16coreswereusedforthecomparisonwiththeFPGrowthalgorithmasshowningure 5-6 .Thecomparisonwasmadeonthesamedatasetandthesamenumberofcores.ItcanbeobservedthatthealgorithmwassloweratlesserminSupports.Thiscanbeattributedtothefactthatthedepthoftheminingishigheratthispoint.AndasminSupportincreasesthedepth 4 http://books.google.com/ngrams/datasets 42

PAGE 43

ofminingdecreaseshencegivingAprioribetterperformance.FPGrowthmaintainsarelativelyconstantperformancethroughoutasitdoesnotdependontheminSupportasmuchastheApriorialgorithmdoes. Figure5-6. GLADEApriorivsGLADEFPGrowth 5.5FPGrowthTheexperimentswerecarriedoutonTrafcAccidentsDataSetprovidedbyKarolienGeurts5.Thedatawas33.9MBinsize,containing11500870transactions.ThecomparisonoftheGLADEimplementationversusMahoutcanbefoundinSection 5-7 .InMahout,thewholecomputationwasdividedintoveMapReducetasksofwhichthemostimportantwastheParallelFPGrowthtask,whichwasactuallyresponsibleforgeneratingtherequiredpatternsandgrowingtheFPTrees.TheMahoutimplementationthenminedoneachoftheFPTreesthusgeneratedandproducedalocaltop-kitems.Theseitemswerecompiledatalaterstage.Sinceouralgorithmfocusedon 5 http://fimi.ua.ac.be/data/ 43

PAGE 44

generatinganalFP-Tree,thetasksaftertheParallelFPGrowthtaskwereignoredforthecomparison,althoughtheGLAversionhastheextrataskofcombiningtheseparatetreestoasingleone.Experimentalresultsshowedaspeedupof17xovertheMahoutimplementation.Thetimeforthecomputationallyintensivepartaswellasthetotaltimeincludingpreprocessingisshowninthegraph.ThenumberofmappersandreducersusedwereeightandthenumberofGLAsusedwerealso8.Wedecreasedtheblocksizefromadefault30MBto1MB,whichshouldamounttoaround30maptasksand30reducetasksforeachmapreducejobsastheinputsizeiscloseto30MB. Figure5-7. FPGrowthGLADEvsFPGrowthMahout[Inlogscale] 44

PAGE 45

CHAPTER6DISTRIBUTEDSTATEThestatethataGLAhandlesmaygrowverylarge.Insuchcases,itisdifculttomanagethestateandthereisanoverheadonperformanceaswell.Inthissection,weproposeanewmechanismtomanagethelargestatesinGLAsknownasDistributedstateascomparedtothealreadyestablishedreplicatedstate.Inthereplicatedstate,theGLAsarepairwisecombineduntilasingleGLAremains.Therefore,forlargestates,thestatesgetevenlargerasaggregationproceeds.Suchatechniqueisfastbutloosessomeofitsefciencywhenthestatebecomeslargeandhavetobemovedaround.Therefore,themovementofdatabecomesmoredifcultaswegouptheaggregationtree.Asuccessfulsubstituteshouldbeabletocontrolitsstatefromgettingtoolargeandalsoshouldbeabletoefcientlymanagethetransmissionofthestatesforaggregation.DistributedstatedoesnotuseanaggregationtreeforcombiningtheGLAs.Itmaintainsaglobalstate,towhichallGLAscombine.Theglobalstatemakessurethattheamountofdatathathastobemovedislessthanthatinanaggregationtree.However,alockingmechanismisrequiredforaccesstotheglobalstatewhichmightdegradetheperformance.Therefore,wefocusonpartitioningtheglobalstateandlockingeachofthesepartitions.Thelocalstateswouldessentiallymaintainthesamepartitionsandeachofthelocalpartitionwouldcombinewithitscorrespondingglobalpartition.Weuseahashbaseddividingschemeasshownin 6.1 .Thisschemewouldreducethecontentionforthelockingmechanism,butalsostillenablebettermanagementofstates.However,thereisstillnoguaranteeastohowlargethelocalstateswouldgrowforaparticularGLA.Forthisweincorporatedaushingmechanism.AssoonasanewchunkarrivesforaGLA,itcheckswhetheranyofthepartitionsexceedsapredeterminedsize.Ifitdoes,thenitcombinesthatparitionwithitsglobalstatethenandthereandcarriesonwiththecomputation.Iftherightushsizeisspecied,thismechanismwouldmake 45

PAGE 46

surethatlocalstatesnevergrowuncontrollablylargeandalsoonlydatathesizeoftheushsizeistransportedthroughthesystem. 6.1FP-GrowthDistributedThedivisionoftheglobalstateisdonebypartitioningtheF-Listwhichactsasthekeytothehash.EachdivisionoftheF-ListcorrespondstoabucketofthehashthatgrowsanFPTree.AninstanceofthesamehashispassedtoalltheGLAsaslocalstate.Sowheneverapatternisproducedwhilescanningatuple,itscorrespondingpartitionisfoundinthelocalhashandthepatternisaddedtoit.AndinthisfashiontheGLAsbuildtreesoneachpartition.AggregationoccursduringushingaswellaswhenaGLAcompletesitscomputation.ThealgorithmforproducingtheFPTreeremainsthesamefromthedistributedimplementation. Figure6-1. TheDistributedState 6.2EmpiricalEvalutaionTheDistributedstatewasevaluatedonthesame16coreAMDOpteron(TM)machineasmentionedearlier.TheminingwasperformedonthesameGooglengramsdataset1,9GBdatasetasmentionedinthe 5.4 .Thecomparisonofthereplicatedstatetothedistributedstatecanbeseeningure 6-3 .Theminingwasforthemostfrequentlettersinthe1-grams.SothemaximumlengthoftheF-listwaslimitedto26.ThenumberofGLAsusedwere16andthevariantusedforthecomparisonwasthe 1 http://books.google.com/ngrams/datasets 46

PAGE 47

Figure6-2. TheReplicatedState chunksize.Thenumberoflocksthatwereusedwerekeptata1:1ratiowiththenumberofpartitions.Thespeedupacheivedwasalmost3xfasterthanthereplicatedstate. Figure6-3. ReplicatedvsDistributedState 47

PAGE 48

CHAPTER7RELATEDWORKINMAPREDUCEIterativeprocessesarenotefcientlyencodableintheoriginalMapReducesystem.IterationscanbeperformedbyusingindividualMapReducephasesforeachiterationandchainingthemoneaftertheother.However,ineachiterationthemapandreducejobshavetoberestarted.ThisresultsinwastageofresourcesasthesameMapReducetasksareusedinsuccessiveiterations.Also,theinputdatahavetobeprocessedandshufedineveryiteration,whichalsoaffectsperformance.MapReducealsolacksamechanismtocheckfortheterminatingconditionefcientlyanditmayrequireanotherMapReducetasktodothat.DifferentalgorithmshavebeenproposedforfacilitatingiterativecomputationinMapReduce.HaLoop[ 17 ]introducedacachingsolutionthatcachedtheloopinvariantdataandalsotheoutputofthereducetasks.Buttherewasnoimprovementovertheinherentproblemofrestartingtasks.Twister[ 18 ]introducedaformoflongrunningMapReducetasksthatreusedthetasksfornextiteration.However,thesolutiondependedontheassumptionthattheinputdatasetandtheintermediatedatatsintothedistributedmemoryofclusterswhichisnottrueinmanysituations.iMapReduce[ 19 ]alsoproducedaniterativemodelinitsimplementationbutitmadeanassumptionthatthesamekeyisusedbyboththemapsandreducers.Italsousedaformoftaskreuseknownaspersistenttasksinitsimplementation,butstillunnecessaryresourceswerewastedasthepersistenttasksremaineddormantinmemoryuntiladecisionwastakenwhethertoreuseitorterminateit.Theimplementationalsousedaformofcachinginthereducephasetoallowtermination.iHadoop[ 20 ]introducedanasynchronousimplementationofMapReducethatmadeiterationsexecuteasynchronouslywhenevertheinputdatawasavailableforit.Theimplementationstillhadnoimprovementsinthereuseoftasks. 48

PAGE 49

AlltheimplementationsmentionedabovesuffersfromperformancelossincarryingoutiterativecomputationseitherduetotheshortcomingsintheproposedmodelitselformainlyduetotheinherentshortcomingsoftheMapReduceframework.IterativeGLAsinGLADEcaneasilybeencodedasshownbythealreadypresentedalgorithms.Theoverheadisalsominimalasitonlyinvolvescleaningupsomememoryandpropogatingamessagedownthegraph.Wereusethesamewaypointandthegeneratedcodeforiterativecomputationsandthisisnotwastefulasnotaskremainsidleordormantduetotaskreuse.Also,thereinitiationofdataforanewiterationisorthogonaltotheDatapathframeworkandsinceitispushbasedanddatacentric,nooverheadisincurredforthiseither.TheterminationconditionisalsoencodedinanaturalformatasitcanbespeciedintheGLAitself.Therefore,iterativemechanisminGLADEistighlycoupledtothealreadyexistingmechanismspresentinDatapathaswellastheformatoftheGLAs. 49

PAGE 50

CHAPTER8CONCLUSIONThisthesisintroducedenhancementstotheGLADEframeworkintheformofIterativeComputationsandStatePassing.IterationsenabledustoperforminternaliterativecomputationsandStatePassingallowedcommunicationofnon-relationaldata.Withthehelpoftheseenhancements,weefcientlyadaptedseveralmachinelearningalgorithmssuchasFP-Growth,Apriori,ItemBasedCollaborativeFiltering,DecisionTreeClassicationandKMeansClusteringtotheGLAbasedformat.Experimentalresultsshowedupto441xspeedupforlargedatasetsand2332xspeedupforsmalldatasetswhencomparedtotheircounterpartsinHadoop.WebelievewehaveestablishedanaturalmethodofencodingcomplexcomputationsintheformofGLAsinthisthesis. 50

PAGE 51

REFERENCES [1] J.DeanandS.Ghemawat,Mapreduce:aexibledataprocessingtool,Commun.ACM,vol.53,no.1,pp.72,2010. [2] F.RusuandA.Dobra,Glade:ascalableframeworkforefcientanalytics,Operat-ingSystemsReview,vol.46,no.1,pp.12,2012. [3] S.Arumugam,A.Dobra,C.M.Jermaine,N.Pansare,andL.L.Perez,Thedatapathsystem:adata-centricanalyticprocessingengineforlargedatawarehouses,inSIGMODConference,2010,pp.519. [4] Apache,I.Drost,T.Dunning,J.Eastman,O.Gospodnetic,G.Ingersoll,J.Mannix,S.Owen,andK.Wettin,Apachemahout,2010, http://mloss.org/software/view/144/ [5] R.Agrawal,H.Mannila,R.Srikant,H.Toivonen,andA.I.Verkamo,Fastdiscoveryofassociationrules,inAdvancesinKnowledgeDiscoveryandDataMining.AAAI/MITPress,1996,pp.307. [6] F.Bodon,Atrie-basedaprioriimplementationforminingfrequentitemsequences,inProceedingsofthe1stinternationalworkshoponopensourcedatamining:frequentpatternminingimplementations,ser.OSDM'05.NewYork,NY,USA:ACM,2005,pp.56.[Online].Available: http://doi.acm.org/10.1145/1133905.1133913 [7] C.B.DepartmentandC.Borgelt.(2003)Efcientimplementationsofaprioriandeclat.[Online].Available: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.9.7817 [8] R.AgrawalandR.Srikant,Fastalgorithmsforminingassociationrulesinlargedatabases,inVLDB,1994,pp.487. [9] T.Le,T.Nguyen,andT.C.Chung,Bitapriori:Anapriori-basedfrequentitemsetsminingusingbitstreams,inInformationScienceandApplications(ICISA),2010InternationalConferenceon,2010,pp.16. [10] J.Pei,J.Han,andR.Mao,Closet:Anefcientalgorithmforminingfrequentcloseditemsets,inACMSIGMODWorkshoponResearchIssuesinDataMiningandKnowledgeDiscovery,2000,pp.21. [11] J.Han,J.Pei,Y.Yin,andR.Mao,Miningfrequentpatternswithoutcandidategeneration:Afrequent-patterntreeapproach,DataMin.Knowl.Discov.,vol.8,no.1,pp.53,2004. [12] E.Ozkural,C.Aykanat,andC.Aykanat,Aspaceoptimizationforfp-growth.inFIMI,2004. [13] B.Racz,nonordfp:Anfp-growthvariationwithoutrebuildingthefp-tree.inFIMI,2004. 51

PAGE 52

[14] I.Pramudiono,M.Kitsuregawa,andM.Kitsuregawa,Parallelfp-growthonpccluster.inPAKDD,2003,pp.467. [15] H.Li,Y.Wang,D.Zhang,M.Zhang,E.Y.Chang,andE.Y.Chang,Pfp:parallelfp-growthforqueryrecommendation,inRecSys'08:Proceedingsofthe2008ACMconferenceonRecommendersystems,2008,pp.107. [16] L.Zhou,Z.Zhong,J.Chang,J.Li,J.Huang,andS.Feng,Balancedparallelfp-growthwithmapreduce,inInformationComputingandTelecommunications(YC-ICT),2010IEEEYouthConferenceon,2010,pp.243246. [17] Y.Bu,B.Howe,M.Balazinska,M.D.Ernst,andM.D.Ernst,Thehaloopapproachtolarge-scaleiterativedataanalysis.2012,pp.169. [18] J.Ekanayake,H.Li,B.Zhang,T.Gunarathne,S.heeBae,J.Qiu,andG.Fox,Twister:Aruntimeforiterativemapreduce,inInTheFirstInternationalWorkshoponMapReduceanditsApplications,2010. [19] Y.Zhang,Q.Gao,L.Gao,andC.Wang,imapreduce:Adistributedcomputingframeworkforiterativecomputation,inProceedingsofthe2011IEEEInternationalSymposiumonParallelandDistributedProcessingWorkshopsandPhDForum,ser.IPDPSW'11.Washington,DC,USA:IEEEComputerSociety,2011,pp.1112.[Online].Available: http://dx.doi.org/10.1109/IPDPS.2011.260 [20] E.Elnikety,T.Elsayed,andH.E.Ramadan,ihadoop:Asynchronousiterationsformapreduce,inProceedingsofthe2011IEEEThirdInternationalConferenceonCloudComputingTechnologyandScience,ser.CLOUDCOM'11.Washington,DC,USA:IEEEComputerSociety,2011,pp.81.[Online].Available: http://dx.doi.org/10.1109/CloudCom.2011.21 52

PAGE 53

BIOGRAPHICALSKETCH DaleyAlexwasborninKerala,India.HecompletedthemajorpartofhisschoolinginCochinandwentontodohisBachelorsinInformationTechnologyatMahatmaGandhiUniversity,Keralainthesamecity.ShortlythereafterhejoinedtheDepartmentofComputerandInformationScienceandEngineeringatUniversityofFlorida,Gainesvillefordoinghismaster'sinComputerEngineeringfromwhichhegraduatedintheSummerof2012.HisresearchinterestsareinDataIntensiveComputing,DatabaseSystemsandTechnologyandinParallelComputing. 53