<%BANNER%>

Record for a UF thesis. Title & abstract won't display until thesis is accessible after 2013-04-30.

Permanent Link: http://ufdc.ufl.edu/UFE0042915/00001

Material Information

Title: Record for a UF thesis. Title & abstract won't display until thesis is accessible after 2013-04-30.
Physical Description: Book
Language: english
Creator: RAVUNNIKUTTY,GIRISH
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2011

Subjects

Subjects / Keywords: Computer and Information Science and Engineering -- Dissertations, Academic -- UF
Genre: Computer Engineering thesis, M.S.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Statement of Responsibility: by GIRISH RAVUNNIKUTTY.
Thesis: Thesis (M.S.)--University of Florida, 2011.
Local: Adviser: Ranka, Sanjay.
Electronic Access: INACCESSIBLE UNTIL 2013-04-30

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2011
System ID: UFE0042915:00001

Permanent Link: http://ufdc.ufl.edu/UFE0042915/00001

Material Information

Title: Record for a UF thesis. Title & abstract won't display until thesis is accessible after 2013-04-30.
Physical Description: Book
Language: english
Creator: RAVUNNIKUTTY,GIRISH
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2011

Subjects

Subjects / Keywords: Computer and Information Science and Engineering -- Dissertations, Academic -- UF
Genre: Computer Engineering thesis, M.S.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Statement of Responsibility: by GIRISH RAVUNNIKUTTY.
Thesis: Thesis (M.S.)--University of Florida, 2011.
Local: Adviser: Ranka, Sanjay.
Electronic Access: INACCESSIBLE UNTIL 2013-04-30

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2011
System ID: UFE0042915:00001


This item has the following downloads:


Full Text

PAGE 1

ENSEMBLEK-MEANSONMODERNMANYCOREHARDWAREByGIRISHRAVUNNIKUTTYATHESISPRESENTEDTOTHEGRADUATESCHOOLOFTHEUNIVERSITYOFFLORIDAINPARTIALFULFILLMENTOFTHEREQUIREMENTSFORTHEDEGREEOFMASTEROFSCIENCEUNIVERSITYOFFLORIDA2011

PAGE 2

c2011GirishRavunnikutty 2

PAGE 3

Idedicatethistomyparentsandmyfriends. 3

PAGE 4

ACKNOWLEDGMENTS Iwouldliketoexpressmygradtitutetomyadvisor,Dr.SanjayRankaforalltheguidanceandencouragementprovidedthroughoutmymastersprogram.Ialsothankmythesiscommittemembers,Dr.AlinDobra,Dr.Jih-KwonPeirfortheirhelpandsuggestionsIwouldliketothankDr.GordonErlebacher,andtheITsupportstaffofFloridaStateUniversityforprovidingaccesstothemachinesforbenchmarking.IthankmyparentsDr.CPRavunnikutty,Dr.GeethadeviRavunnikuttyandmybrotherMaheshRavunnikuttyfortheirimmensecareandsupportthatmotivatedmeduringmystudies.FinallyIthankallofmyfriendswhohelpedmewiththethesiswork. 4

PAGE 5

TABLEOFCONTENTS page ACKNOWLEDGMENTS .................................. 4 LISTOFTABLES ...................................... 7 LISTOFFIGURES ..................................... 8 ABSTRACT ......................................... 9 CHAPTER 1INTRODUCTION ................................... 10 2PRELIMINARIES ................................... 14 2.1ClusteringProblem ............................... 14 2.2EnsembleK-meansAlgorithm ........................ 15 2.3IndexBasedEnsembleK-meansAlgorithm ................. 15 2.4RelatedWork .................................. 16 3OPENCLDEVICEARCHITECTURE ........................ 17 4PARALLELENSEMBLEK-MEANSALGORITHM ................ 22 4.1ParallelizationIssues .............................. 22 4.1.1Concurrency .............................. 22 4.1.2MemoryConstraints .......................... 23 4.1.3SynchronizationandAtomicOperations ............... 24 4.1.4DataTransferBetweenHostandDevice ............... 24 5OPENCLIMPLEMENTATION ............................ 25 5.1TaskParallelism ................................ 25 5.2DataParallelism ................................ 28 5.3ConcatenatedParallelism ........................... 30 6PARALLELK-MEANSALGORITHMUSINGKD-TREE ............. 35 6.1ConcatenatedParallelismUsingKD-tree ................... 37 6.1.1DirectAccess .............................. 37 6.1.2DistributedAccess ........................... 38 6.1.3EliminatingMemoryConictsUsingReplication ........... 38 7EMPIRICALEVALUATION ............................. 39 7.1Datasets ..................................... 39 7.2BenchmarkMachines ............................. 39 7.3ComparisonofThreeClusteringApproachesona32-coreCPU ..... 39 5

PAGE 6

7.4ComparisonofThreeClusteringApproachesonFERMIGPU ....... 40 7.5PerformanceComparisonofDifferentHardware .............. 40 7.6ComparisonoftheThreeKD-treeAlgorithms ................ 41 7.7KD-treevsDirectImplementationofConcatenatedParallelism ...... 42 7.8KD-treevsDirectImplementationforVaryingDimensions ......... 42 7.9Discussion ................................... 43 8CONCLUSION .................................... 45 REFERENCES ....................................... 46 BIOGRAPHICALSKETCH ................................ 47 6

PAGE 7

LISTOFTABLES Table page 4-1ParallelizationMethodologies ............................ 23 7

PAGE 8

LISTOFFIGURES Figure page 3-1CPUvsGPUArchitecture .............................. 17 3-2CoalescedGlobalMemoryAccess ......................... 18 3-3Non-CoalescedGlobalMemoryAccess ...................... 19 3-4BankConictsinLocalMemory ........................... 19 3-5OpenCLComputeDevice .............................. 19 5-1TaskParallelism ................................... 26 5-2DataParallelism ................................... 29 5-3ConcatenatedParallelism .............................. 31 5-4HighContentionLocalMemoryAtomicUpdate .................. 32 5-5ReducedContentionLocalMemoryAtomicUpdate ................ 32 6-1KD-treewithDataPointsandMappedCentroidsonLeafNodes ........ 35 6-2KDTree:DirectAccess ............................... 37 6-3KDTree:DistributedMapping ............................ 38 7-1ComparisonofThreeClusteringApproachesona32coreCPU ......... 40 7-2ComparisonofThreeClusteringApproachesonFERMI ............. 41 7-3PerformanceComparisonofDifferentHardware ................. 41 7-4ComparisonoftheThreeKD-treeAlgorithms ................... 42 7-5KD-treevsDirectImplementationofConcatenatedParallelism ......... 43 7-6KD-treevsDirectImplementationforVaryingDimensions ............ 43 8

PAGE 9

AbstractofThesisPresentedtotheGraduateSchooloftheUniversityofFloridainPartialFulllmentoftheRequirementsfortheDegreeofMasterofScienceENSEMBLEK-MEANSONMODERNMANYCOREHARDWAREByGirishRavunnikuttyMay2011Chair:SanjayRankaMajor:ComputerEngineeringClusteringinvolvespartitioningasetofobjectsintosubsetscalledclusterssothatobjectsinthesameclusteraresimilaraccordingtosomemetric.Clusteringiswidelyusedinmanyeldslikemachinelearning,datamining,patternrecognitionandbioinformatics.K-meansalgorithmisthemostpopularalgorithmusedforclusteringwhichusesdistanceasthesimilaritymetric.K-meansalgorithmworksbychoosingasetofKrandomobjects(model)fromthedatasetandperformsdistancecomputation.EnsembleK-meanschoosesmultiplemodelswiththegoalofrenementandfasterconvergence.ThefocusofthethesisistoinvestigatedifferentapproachestoparallelizetheensemblebasedK-meansalgorithmsonmodernmanycorehardware.ThemanycorehardwareinvolvesGPUsandCPUsbackedbytheOpenCLsoftwarestack.Thisthesisalsopresentsnovelimplementationsthatexploittheimmensecomputingpowerofthehardwarebyminimizingtheamountofdataaccess. 9

PAGE 10

CHAPTER1INTRODUCTIONAnumberoftraditionalmachinelearninganddataminingalgorithmsgeneratemultiplemodelsusingthesamedataset.Thesemodelsaregeneratedbychoosingdifferentinitializationsthataregeneratedrandomlyorusingasystematicmethod.Thesemodelsarethenusedtogenerateasinglemodeloracommitteeofmodels.Forexample,inmostclusteringalgorithms,multiplerunswithdifferentrandomlygeneratedinitializationsareprocessedandtheonewiththeminimummeansquareerrorornumberofclustersoracombinationofbothmaybechosen.Similarly,forclassicationalgorithms,multiplemodelsmaybebuilt,whereeachmodelpredictstheclassofagiveninput.Amajoritybasedoutputisthenusedtopredicttheclassforanewsample.Inthesescenarios,thekeychallengeisbuiltmultiplemodels.Thecomputationalintensivenessofthealgorithmsmakesitasignicantchallengeespeciallyforlargedatasets.Thethesisexplorestheuseofmulti-corehardwareforperformingsuchcomputingtogeneratemultiplemodelsinanensemble.Weassumethatthemodelsintheensembleareindependentlygenerated.Although,ourmethodscanbeeasilyextendedwhenasequenceofensemblesaregenerated,inwhicheachensembleconsistofmultiplemodels.Thelatteristypicalofgeneticalgorithmsandotherevolutionaryapproaches.Therearetwobroadwaysofusingthemulticoreprocessorforthispurpose: 1. Taskparallelism:Generateeachmodelseparatelyoneachofthecoresoftheprocessor.Thispreventstheamountofcommunicationbetweeneachofthecoresbuthasthepotentialforloadimbalanceifthedifferentmodelsneeddifferentamountsoftime. 2. Dataparallelism:Usemultiple(orall)thecorestoutilizethedataparallelismthatispresentinmanyofthesemodelgenerationalgorithms.Thishastheadvantageofachievinggoodbalancebutmaygeneratecommunicationbetweenthecores.Weshowboththeseapproachessufferfromthelimitationthattheyrequiremultiplepassesoftheentiredataset.Thisrequiresextensivedatatrafcfromthe 10

PAGE 11

globalmemorytothelocalmemoryofthecores.Giventhetwotothreeordersofmagnitudeperformancedifferencebetweenglobalandlocalmemoryandthelimitedoverallbandwidthofthebus,theadditionaloverheadgeneratedbymultiplepassescandeterioratetheperformancesignicantly.Wepresentanovelapproachcalledconcatenatedparallelismthateffectivelyutilizestaskaswellasdataparallelismtoreducethenumberofpassesofdatabetweenlocalmemoryandglobalmemory.WedemonstratetheeffectivenessofourapproachusingK-meansclustering.K-meansclusteringpartitionsdatarecordsintorelatedgroupsbasedonamodelofrandomlychosenfeatureswithoutpriorknowledgeofthegroupdenitions.Itisaclassicexampleofunsupervisedmachinelearning.Multiplemodelsaretypicallygeneratedbyusingdifferentrandominitializationorbyvaryingthenumberofclusters.ForgenerationofasinglemodelusingtheK-meansalgorithm,arandomsetofKrecordsarechosenfromthegivenrecords(datapoints).TheseKrecordsformtheinitialclustercentroids.EachdatapointiscomparedagainsttheKclustercentroidsandisassignedtotheclosestcentroid.InEnsembleclustering,insteadofsingleprototypeofKcentroids,MmodelsorprototypeseachofKcentroidsarechosen.ThusdatapointsarecomparedwithKMclustercentroidsandthebestmodelorprototypeisselected.Weconsidertwoapproachesforbuildingthemodels.Inasimpleapproach,ineachiteration,allthecentroidsforeachmodelareupdatedusingtheentiredataset.Inamorecomplexapproach,thedatasetisstoredasaKDtree,wheretheleafnodescorrespondtoasubsetofrecords.Apreprocessingstepdetermineswhichcentroidscanbepotentialnearestneighborsforalltherecordscorrespondingtoagivenleafnode.Thisapproachhasthebenetofreducingthetotalnumberofcomparisonstodeterminethenearestneighborespeciallywhenthenumberofattributesissmall.Webenchmarkourcodeonmulti-coreCPUandGPUhardwarefrommultiplevendors.Ourexperimentalresultsshowthatourapproachhasseveralimportantbenets: 11

PAGE 12

1. Itsignicantlyreducestheamountofthedataaccesscostsbylimitingthenumberandamountdatathatistransferredfromtheglobalmemorytoanyofthecores. 2. IteffectivelyutilizestheSIMT(SingleInstructionMultipleThread)natureoftheunderlyinghardwarebyexploitingthedataparallelism.Additionally,weproposeseveralapproachestominimizetheupdatecontentionthatisgeneratedbytheSIMTnatureofeachcore.Wereducetheglobalmemoryaccesslatencybyusinglocalmemoryandmemorycoalescing.Ourexperimentalsoincorporatesperformanceenhancingtechniqueslikeloopunrolling,reducinglocalmemorybankconicts,optimizedregisterusageandusingconstantmemory.Thekeyoutcomeisthattheresultantcodeamortizesalargernumberofcomputationsperdataaccessandisanorderofmagnitudefasterthanstraightforwardparallelizationtechniques.WebelievethatthisMultipleInstructionSingleData(MISD)dataisextremelyimportanttoderiveperformanceonGPUforalargeclassofdataintensiveanddataminingoperations.Althoughourtechniqueshavebeendescribedinthecontextofensembleclustering,theyarequitegeneralandshouldbeapplicabletoavarietyofensemblebasedapplications.Thethesisisorganizedasfollows:Chapter 2 givesabackgroundofclusteringandensembleclusteringproblems.ItalsodiscussesindetailtheK-meansalgorithmusedforclustering.Chapter 3 describestheOpenCLdevicearchitecture.ThechapteralsodetailsthevariousperformancetweakswhilewritingOpenCLcode.Chapter 4 startsbydescribingtheissuestobeaddressedforparallelizinganyalgorithmonOpenCLhardware.Intheend,wediscusshowtheseissuesarehandledintheparallelizationofK-meansalgorithm.Chapter 5 presentsthreemethodologiestoparallelizeK-meansalgorithmonOpenCLhardware.WealsodiscusstheAsymptoticcomputationanddataaccesscomplexitiesforeachofthesemethodologies.Chapter 6 discussesanenhancementovertheK-meansalgorithmusingKD-tree.WediscussthreemethodologiesforK-meansclusteringusingaKD-tree.Chapter 7 presentstheexperimentalresults.Wecomparetheperformanceofmethodologiesdescribedin 12

PAGE 13

Chapter 5 andChapter 6 .Chapter 8 summarizesthethesisanddiscussesthedirectionsforfuturework. 13

PAGE 14

CHAPTER2PRELIMINARIES 2.1ClusteringProblemClusteringorclusteranalysisprobleminvolvesndinggroupsofrecordsfromahugecollectionsuchthateachrecordinagroupwillbemoresimilartoothersinthegroupanddifferentfromtherecordsinothergroups.Clusteringusesdistancebetweenrecords.Thedistanceistypicallycalculatedasalinearcombinationofthedistancealongeachofthemanyattributes.Thus,ndinganappropriatedistancemeasurebecomesanimportantstepinclustering.Thedistancemeasuredecidestheshapeofclustersbecausetworecordscanbeclosetooneanotherinonedimensionwhereastheywillbefaraccordingtoanotherdimension.WechoosetheEuclideandistancemetricforourwork.Wealsoassumethatallattributesordimensionareequallyweighted.Ouralgorithmsandapproachescanbesuitablygeneralizedtootherdistancemeasures.Lloyd'salgorithm,alsoknownasKMeansalgorithmisoneofthemostpopularalgorithmsusedinclustering.ThealgorithmpartitionsNrecordstoKclusters.InK-means,arandomsetofKrecordsarechosenfromthegivenrecords(datapoints).TheseKrecordsformtheinitialclustercentroids.EachdatapointiscomparedagainsttheKclustercentroidsandisassignedtotheclosestcentroid.GivenasetofNrecords(n1,n2,nn),whereeachrecordisad-dimensionalvector,theK-meansclusteringpartitionstheNrecordsintoKclustersS1,S2,,SkwhereK
PAGE 15

whereCjisthejthclusterwhosevalueisadisjointsubsetofinputpatterns.KmeansalgorithmworksiterativelyonagivensetofKclusters.Eachiterationconsistsoftwosteps: EverydatarecordiscomparedwiththeKcentroidsandisassociatedwiththeclosestcentroidcreatingKclusters. Thenewsetsofcentroidsaredeterminedasthemeanofthepointsintheclustercreatedinthepreviousstep.Thealgorithmrepeatsuntilthecentroidsdonotchangeorwhentheerrorreachesathresholdvalue.ThecomputationalcomplexityofalgorithmisO(NKdI)whereIisthenumberofiterations. 2.2EnsembleK-meansAlgorithmEnsembleK-meansalgorithmisanextensiontoKmeansalgorithmwiththegoalofrenementofresultsandprovidefasterconvergence.InsteadofasinglemodelwithKcentroids,wechooseMmodelsofKclustercentroids.EachiterationofKmeansoperatesonKMcentroids.EachdatapointcomparedwiththecentroidsofeachmodelisassignedtotheclosestcentroidineachoftheMmodels.ThecomputationalcomplexityofthealgorithmisO(NKdMI). 2.3IndexBasedEnsembleK-meansAlgorithmIndirectensembleK-meansclusteringalgorithm,eachcentroidinamodeliscomparedwithallthedatapoints.Thisrequirestimeproportionaltotheproductofnumberofmodelsandnumberofclusters.Asanenhancementtothis,wecanuseindexbaseddatastructurestoreducethecomputationsperformed.Thedatastructurecanpartitionthedatarecordssothatcentroidscanidentifyasubsetofrecordstowhichtheyarecloseinsteadofscanningthewholedatarecords.Inthethesis,weuseKD-treeastheindexingdatastructure.AKD-treeiscreatedfordatapointswitheachleafnodecontainingasetofdatapoints.InKD-treeeachinternalnodeisanaxisorthogonalhyperplanewhichdividesthesetofpointsinthespaceintotwohyperrectangles.OncetheKD-treecreationisdone,allthedatapointswillbeintheleafnodes.Eachcentroid 15

PAGE 16

inamodeltraversestheKD-treeandgetsmappedtothedatapointsintheleafnodes.KD-treebasedensembleclusteringisexplainedindetailinChapter 6 2.4RelatedWorkTheK-meansclusteringmethodhasbeenimplementedonGPU.BeforetheintroductionofNVIDIACUDAframework,clusteringproblemhasbeenimplementedonGPUbymappingtotextureandshaderusedforgraphicscomputation[ 2 8 14 ].TheseimplementationsusetheOpenGLAPIs.AftertheintroductionofNVIDIACUDAframework,thereareseveralGPUimplementationsofK-meansclusteringproblem.Changet.al.[ 3 4 ]computedpairwiseEuclideandistance,ManhattandistanceandPearsoncorrelationcoefcientwithspeed-upoverCPU.ThiswasextendedtotherstcompletehierarchicalclusteringonGPU[ 5 ].Caoet.al.[ 2 ]describestheGPUclusteringmethodfordatasetsthataretoolargetotintheGPU'sglobalmemory.IntheCUDAclusteringimplementationdonebyCheet.al.[ 6 ],CPUcomputesthenewclustercentroidswhichnecessitatesthetransferofdatabetweenCPUandGPUineachclusteringiteration.Liet.al.[ 10 ]describestheCUDAimplementationofclusteringusingregistersforlowdimensionaldataandsharedmemoryforhigherdimensions.Fanget.al.[ 7 ]proposesaGPUbasedclusteringapproachusingabitmaptospeedupthecountingineachiteration.Thisapproachdoesnotscalewellforensembleclusteringasbitmapincreasesthememoryfootprint.Thealgorithmsdescribedinthethesisdoesnotrequiredatatransfersbetweensuccessfulclsuteringiterations.TheimplementationsaredoneinOpenCLwhichensurestheportabilitytoanyheterogeneoushardwarebackedupbyanOpenCLdevicedriver. 16

PAGE 17

CHAPTER3OPENCLDEVICEARCHITECTUREInthethesis,weuseGPUsandmulticoreCPUsasOpenCLdevices.GPUisadedicatedcomputingdevicetoaddressproblemsthathavehigharithmeticintensityi.e.highratioofarithmeticoperationstomemoryoperations.EachoftheGPUcoreshasaSIMT(SingleInstructionMultipleThread)architecture.ThecoreofaGPUhasasmallcacheandlimitedowofcontrol-bothofthesetakeupmostofthetransistorsinaCPUwhereasinaGPU,moretransistorsareusedupforcomputingcores.Figure 3-1 showsahighlevelarchitectureviewofCPUandGPU.Parallelportionsofanapplicationareexpressedasdevicekernelswhichrunonmanythreads.GPUthreadsareextremelylightweightandthecreationoverheadisverylittle.TheschedulingisdonebythehardwareunliketheoperatingsysteminaCPU.AtypicalGPUneedshundredsofthreadsforfullutilizationofhardwarewhereasCPUcanbesaturatedwithonlyafewthreads.ATI(OwnedbyAMD)andNVIDIAaretheleadingvendorsthatprovideGPUswithgeneralpurposecomputingcapabilities.NVIDIAprovidestheComputeUniedDeviceArchitecture(CUDA)Framework[ 11 ]withanewparallelprogrammingmodelandInstructionSetArchitecturetoleveragethecapabilitiesofthemassiveparallelcomputinghardwareinNVIDIAGPUs.CUDAframeworkprovidesanextendedCapplicationprogramminginterfaceandaruntimelibrarywhichenablestheaccessand Figure3-1. CPUvsGPUArchitecture 17

PAGE 18

Figure3-2. CoalescedGlobalMemoryAccess controlofdevicesfromthehost.SimilarlyAMDprovidesATIstreamtechnologythatenableAMDCPUsandGPUstoaccelerateapplications.OpenCL(OpenComputingLanguage)[ 9 ]isanopenroyalty-freestandardtowriteparallelprogramsthatexecuteacrossheterogeneouscomputingenvironmentconsistingofCPUs,GPUs.OpenCLframeworkprovidesaISOC99basedlanguageforwritingportablecodethatexecutesonheterogeneousplatforms.NVIDIAprovideOpenCLdriversfortheirGPUsandAMDprovidesfortheCPUsandATIGPUs.AnOpenCLdeviceisidentiedasacollectionofcomputeunits.EachcomputeunitcancontainmanyProcessingElements(PEs).OpenCLprogramexecutesintwoparts:KernelsthatexecuteonOpenCLdevicesandahostprogramthatexecutesonthehost.Theinstanceofthekernelthatisexecutingonaprocessingiscalledwork-item.Work-itemsareorganizedintoworkgroups.Work-itemsinawork-groupexecutesthesamecodeinSIMDfashiononalltheprocessingelementsofacomputeunit.EachworkiteminaworkgrouphasauniqueIDandthework-groupshasaglobaluniqueID.InNVIDIACUDAframework,work-itemsareidentiedasthreadsandwork-groupasblocks.TheapplicationrunningonthehostsusesOpenCLAPIstocontroltheexecutionofkernelsondevices.OpenCLprovidesacommandqueueinterfacetocoordinateexecutionofkernelsonthedevicesfromthehost.Thehostplaceskernelexecution,memorytransferandsynchronizationcommandsintothecommandqueueandarethenscheduledtoexecuteonthedevices. 18

PAGE 19

Figure3-3. Non-CoalescedGlobalMemoryAccess Figure3-4. BankConictsinLocalMemory Figure3-5. OpenCLComputeDevice 19

PAGE 20

WorkitemsexecutinganOpenCLkernelhasaccesstofourdistinctmemoryhierarchies[ 12 13 ]. GlobalMemory:Allthework-itemsinallwork-groupshasreadandwriteaccesstothismemory.Hostcanreadandwritetoglobalmemorythroughmemorycommandsplacedoncommandqueue.InaGPU,GlobalmemoryisimplementedasDRAM.Hencetheaccesslatencyishigh.InNVIDIAGPUs,peakreadperformanceoccurswhenalltheworkitemsaccesscontinuousglobalmemorylocations.ThisisknownascoalescedmemoryaccessandisshowninFigure 3-2 .Non-coalescedaccessisshowninFigure 3-3 ConstantMemory:Partoftheglobalmemorythatisreadonlytoallwork-itemsinallwork-groups.Somedevicesprovidefastcachedaccesstoconstantmemory.Thememoryisallocatedanddataiscopiedbythehost. LocalMemory:Thememoryregionlocaltoawork-group,sharedbyallthework-itemsinthatwork-group.OpenCLdeviceslikeNVIDIAandATIGPUsprovidededicatedlocalmemoryonthedevicewhichisasfastasregisters.Insomeotherdeviceslocalmemoryismappedtoglobalmemory.InNVIDIAGPUs,localmemoryisreferredassharedmemory.Sharedmemoryisimplementedasbanks.Whenmultipleworkitems(threads)inaworkgroup(block)accessthesamebank,bankconictsoccurwhichresultsintheserializationofaccess.Figure 3-4 showsthebankconictsinBanks0,2and5.Localmemoryprovidesveryfastbroadcastwhenallthework-itemsreadthesamelocation. PrivateMemory:Thelowestlevelinthememoryhierarchywhichstoresthelocalvariablesofawork-item.Thesearemostlyhardwareregisters.Hostcannotreadandwritetolocalorprivatememories.OpenCLhardwaredoesdynamicpartitioningofregisterleamongallthework-groupsthatarescheduledtoruninthiscomputeunit.Onceassignedtoawork-group,theregisterisnotaccessiblebywork-itemsinotherwork-groups.Ifawork-groupoverusestheregisters,otherwork-groupswillstarve.Thislimitsthenumberofwork-groupsthatcanberuninparallel.Henceoveruseofthelocalvariablesmayresultinthereducednumberofwork-groupssimultaneouslygettingscheduledforprocessing.Moreoverifwork-grouprunsofoutregisters,theextralocalvariablesarestoredinglobalmemorywhichslowsdowntheexecution.Localarraysinakernelarestoredinglobalmemoryresultinginmorecyclesfortheaccess.AhighlevelviewofOpenCLdevicearchitectureisshowninFigure 3-5 .OpenCLprovidestwolevelsofsynchronization 1. Synchronizationofwork-itemsinaworkgroup:Thisisabarriersynchronizationwhichensuresallthework-itemsexecutethebarrierbeforeproceedingtoexecutionbeyondthebarrier. 20

PAGE 21

2. Synchronizationbetweencommandsenqueuedinacommandqueue:Thisensuresthatallthecommandsqueuedbeforethebarrierhavenishedexecutionandtheresultingmemoryisvisibletothecommandsafterthebarrierbeforetheybeginexecution. 21

PAGE 22

CHAPTER4PARALLELENSEMBLEK-MEANSALGORITHMAsmentionedbefore,K-meansalgorithmworksiniterations.Eachiterationisdependentontheresults(newclustercentroids)fromthepreviousiteration.Hencetheiterationscannotbeparallelized.Whatwecanparallelizeisthecomputationsdonewithinaniteration.Algorithm 1 showsthesequentialK-meansclusteringmethod. Algorithm1GeneralAlgorithm 1: dataPoints=read data points() 2: models=random(dataPoints) 3: fori=0toNumberofClusteringiterationsdo 4: centroidStats=compute statistics(dataPoints,models) 5: models=compute new models(centroidStats) 6: endfor Theinitialsetofcentroidsisrandomlychosenfromthedatapoints.Thecompute statistics()functioniteratesoverallthedatapointsandndstheclosestcentroidineachmodel.compute new models()aggregatesthestatisticsfromthecompute statistics()functiontondthenewmodelsfornextiteration.Thecompute statistics()functionisthemainworkhorseasitdoesallthecomputation.Inthethesis,wetargetthisfunctionforparallelizationonanOpenCLdevice.WealsoimplementedthekernelfortheaggregationstepsothatwedonothavetotransferdatabetweenCPUandthedeviceacrossclusteringiterations. 4.1ParallelizationIssuesForparallelizinganalgorithmonanOpenCLdevice,weneedtoaddressthefollowingissues: 4.1.1ConcurrencyThetwomainentitiesthatcanbepartitionedinaK-meansclusteringproblemaredatapointsandmodels.Eachmodelcanbemappedtoaprobleminstancewhichoperatesondata.OpenCLstandardprovides2levelsofparallelism: 22

PAGE 23

Table4-1. ParallelizationMethodologies MethodologyModelsData TaskParallelism12DataParallelism-1,2ConcatenatedParallelism21,2 1. Parallelismamongwork-groups 2. Parallelismamongwork-itemswithinawork-groupAsshowninTable 4-1 ,wedevisedthreeparallelizationmethodologiesbasedonhowdataandmodelsusesthesetwolevelsofparallelism.Thesemethodologiesarefortheimplementationofcompute statistics()function.compute new models()functionremainsthesameforallthethreestrategies. 4.1.2MemoryConstraintsAsmentionedinChapter 3 ,OpenCLdeviceprovidesahierarchicalmemorystructure.Beforedecidingonwhichhierarchytoplaceadatastructure,weneedtocarefullyidentifythememoryaccesspattern.Theobjectsthatareaccessedoverandoveragainwillresultinperformancepenaltyifstoredinglobalmemoryandthefastlocalmemoryislimitedbysize.Theoptionstostoretheinput(datarecordsandmodels)areasfollows: Keepboththemodelsanddatapointsinglobalmemory. Keepdatapointsinglobalmemory.Loadthemodelstolocalmemory.Eachdatapointisreadfromglobalmemorytotheprivateregisterandcomputationisperformedagainstallthemodels. Keepmodelsinglobalmemory.Loaddatapointstolocalmemory.Foreachdatapointinlocalmemoryperformscomputationsagainstthemodels.Wealsoneedtokeepthedatastructuresthatstorethecentroidstatisticsaftereachcomputation.Theoptionstostorethestatisticsdatastructuresare Updatethestatisticsinglobalmemory. Updatethestatisticsinlocalmemoryandafterpartialexecutionupdatetheresultstoglobalmemory. 23

PAGE 24

4.1.3SynchronizationandAtomicOperationsOurimplementationofclusteringhastwoOpenCLkernels.Eventhoughkernelcallsareasynchronouswiththehost,thekernelinvocationsinaparticularstreamareserialized.Hencethereisnoneedforexplicitsynchronizationbetweenthetwokernelinvocations.Weneedtoidentifysynchronizationpointswithinawork-groupandusebarriersynchronizationforsynchronizingthework-items.InparallelimplementationofK-meansalgorithm,thework-itemscomputethedistancesinparallel.Wehavememoryin-consistencybecausemanywork-itemscanconcurrentlyupdatethestatisticsofcentroids.WeusetheatomicoperationsprovidedbytheOpenCLstandardtoachievememoryconsistency.Excessiveuseofatomicoperationscanslowdowntheexecution.Sincewehavetwooptionstostorethestatisticsdatastructures,wecanhaveatomicoperationsin Globalmemory:Thisrequiresalargenumberofcycles. Localmemory:Thisisfasterthanglobalmemoryatomicupdates.However,itcanresultincontentionwhenmanywork-itemsinawork-groupupdatethesamelocationinlocalmemory. 4.1.4DataTransferBetweenHostandDeviceBeforelaunchingakernel,alltherequireddatahastobecopiedtodeviceglobalmemory.Afterexecutiontheresultsarecopiedbacktothehost.Iftherearemultiplekernels,therecanbefrequenttransferofdatabetweenhostanddevice.Thisreducesperformance. 24

PAGE 25

CHAPTER5OPENCLIMPLEMENTATIONInthischapter,wepresentthedesignandimplementationofthethreeparallelizationstrategiesmentionedinTable 4-1 .Forefcientindexing,westoredatapointsandmodelsinlineararrays.ThetwoarrayscopiedtotheOpenCLdeviceglobalmemoryaspartofpre-processingstep.Tokeeptrackofthestatisticsaftercompute statistics()kernel,wekeeptwoadditionalarraysinglobalmemory.Onearrayforkeepingtrackthenumberofclosestpointstocentroidinthemodelsandsecondarrayforthesumofpoints(Px,Py,Pz,etc)closesttoeachcentroidinthemodels.Thecompute new models()kernelusesthesetwoarraystocomputethenewsetofmodelsforuseinthenextiteration.Thereisnodata-transferbetweentheOpenCLdeviceandhostduringtheclusteringiterations.Foreachmethodology,weanalyzethecomputationalanddataaccesscomplexitiesforaniteration.Weusethefollowingconventionsforouranalysis.DatasethasNrecordswhereeachrecordisofDdimension.WechooserandomMmodels,eachwithKclustercentroidsrandomlychosenfromNrecords.WealsodiscusstheAlgorithmandOpenCLcodeforeachmethodology.Forsimplicityofdescription,wepresenttheOpenCLcodefortwo-dimensionaldatapoints.Allofourkernelsuse256work-itemsinawork-group.ThisisdenotedbythemacroWORK ITEMS PER WORK GROUPwhichisabbreviatedasWIPWGinthecodelistings. 5.1TaskParallelismIntaskparallelism,thetasksarepartitionedamongthecomputeunits.Eachmodelismappedtoawork-group.Eachwork-groupreadsthewholedata-recordfromthedeviceglobalmemory.Thework-groupcomputestheclosestcentroidofeachdatapointintherespectivemodel.Figure 5-1 depictstaskparallelism.Withinawork-group,thecomputationispartitionedamongthework-items.Eachwork-itemreadsachunkofdatarecordandndstheclosestcentroidinthemodel.In 25

PAGE 26

Figure5-1. TaskParallelism thiswayallthework-itemsinawork-groupreadsthemappedmodelforcomputationforallthedatapoints.Readingthemodeleverytimefromglobalmemoryaffectsperformance.Thismakesmodelanidealcandidateforkeepinginlocalmemory.Atthestartofkernel,thework-itemsloadthemodelstolocalmemory.Allthework-itemshavetowaittillthemodeliscompletelyloadedinlocalmemory.Thisrequirestheuseofaabarriersynchronization.Eachwork-groupexecutesthealgorithmdepictedinAlgorithm 2 Algorithm2TaskParallelism 1: Loadthemodelmcorrespondingtothiswork-grouptolocalmemory 2: barrier synchronize() 3: foreachdatapointpassignedtoawork-itemdo 4: statistics=find closest centroid(p,m) 5: atom update(statistics)inlocalmemory 6: endfor 7: barrier synchronize() 8: writebackstatisticsfromlocaltoglobal WenowdiscussindetailtheOpenCLimplementationofAlgorithm 2 fortwodimensionalrecords.Thework-itemsinawork-grouploadsthepointsinaparticularmodelfromglobalmemorytolocalmemory.Thework-groupbarriersynchronizationfunctionensuresthattherequiredmodelisloadedbeforeproceedingtocomputationphase.Wekeeptwoadditionallineararrays(sCountandsSum)inlocalmemorytostorethecomputationalresult.sCount[i]storesthenumberofdatapointsclosesttotheith 26

PAGE 27

centroidinthemodel.sSum[i]storesthecumulativesum(Px,Pyfortwodimensions)ofeachdatapointclosesttotheithcentroidinthemodel.Listing 5-1 showsthedistancecomputationphaseofthekernel. //dDatainGlobalMemory //sModels,sCount,sSuminLocalMemory //work)]TJ /F14 10.909 Tf 8.63 0 Td[(itemid size ttx=get local id(0); unsigned pointsPerWorkItem=NO DATA/TPWG for ( unsigned t=0;t
PAGE 28

ensuresallthecomputationwithinthiswork-groupiscompletedbeforeweproceedtothewritebackstage. unsigned index=bxMODEL SIZE+tx; if (tx
PAGE 29

Figure5-2. DataParallelism intaskparallelism,themodelarrayisaccessedbyallwork-items,hencestoredinlocalmemory.Allthework-groupsinakernelloadsthesamemodeltolocalmemory.Thebarriersynchronizationstepremainsthesameasintaskparallelism.Eachwork-groupexecutesthealgorithmdepictedinAlgorithm 3 Algorithm3DataParallelism 1: foreverymodelmdo 2: Loadthemodelmtolocalmemory 3: barrier synchronize() 4: foreachdatapointpassignedtoawork-itemdo 5: statistics=find closest centroid(p,m) 6: atom update(statistics)inlocalmemory 7: endfor 8: barrier synchronize() 9: atom update(statistics)fromlocaltoglobal 10: endfor TheOpenCLimplementationofAlgorithm 3 isasfollows.Listing 5-3 showsthecomputationalphase.Eachwork-groupreadschunkofthedataarray.Thechunkissplitamongthework-itemsinthework-group.Work-itemsaccessthedataarrayincoalescedpattern.Thenumberofsub-chunksreadbyawork-itemiscontrolledbythemacroPOINTS PER WORK ITEMwhichisabbreviatedasPPWIinthecodelisting.PPWIcontrolsthenumberofwork-groups.WhenPPWIismore,thenumberofwork-groups 29

PAGE 30

islessandviceversa.Thelocalmemoryaccessbroadcast,atomicupdationofsSumandsCountarraysremainsthesameasinthecomputationphaseoftaskparallelism(Listing 5-1 ). //dDatainglobalmemory //work)]TJ /F14 10.909 Tf 8.32 0 Td[(groupid size tbx=get group id(0); //work)]TJ /F14 10.909 Tf 8.63 0 Td[(itemid size ttx=get local id(0); unsigned bIndex=WIPWGPPWIbx+tx; for ( unsigned p=0;p
PAGE 31

Figure5-3. ConcatenatedParallelism withtheprimaryaimofreducingthisdataaccess.Inconcatenatedparallelism,dataispartitionedacrosswork-groupsandwork-itemsandcomputationispartitionedacrosswork-itemsinawork-group.Allthework-groupsworkonallthemodels.ConcatenatedparallelismisshowninFigure 5-3 .Likedataparallelism,eachwork-itemwithinawork-groupreadsasub-chunkofdataassignedtothiswork-group.Eachdata-pointfromthesub-chunkiscomparedagainstallthemodelsandtheclosestcentroidineachmodelisidentied.Thuseachwork-iteminawork-groupaccessesallthemodelsforeachdatapoint.Keepingallthemodelsinlocalmemoryensuresfastaccessbyallthethreads.Eachwork-groupexecutesthealgorithmshowninAlgorithm 4 Algorithm4ConcatenatedParallelism 1: Loadallthemodelstolocalmemory 2: barrier synchronize() 3: foreachdatapointpassignedtoawork-itemdo 4: foreachmodelminlocalmemorydo 5: statistics=find closest centroid(p,m) 6: atom update(statistics)inlocalmemory 7: endfor 8: endfor 9: barrier synchronize() 10: atom update(statistics)fromlocaltoglobal Listing 5-5 showstheOpenCLimplementationofcomputationalphaseofconcatenatedparallelism.sCountandsSumarraysnowstorethecountandsumrespectivelyforall 31

PAGE 32

Figure5-4. HighContentionLocalMemoryAtomicUpdate Figure5-5. ReducedContentionLocalMemoryAtomicUpdate themodels.Likedataparallelismthework-itemsreadsub-chunksofdataincoalescedpattern.Thelocalmemorystoresallthemodels.Thelocalmemoryaccesspatternasintaskanddataparallelismresultineachwork-itemstartingcomputationwiththecentroidsofrstmodel.Thisresultsinafastbroadcastduringtheread.ButduringtheatomicupdationofsCountandsSumallthework-itemsupdatethelocationscorrespondingtothesamemodel.Thisresultsinhighcontentioncausingserializedaccesswhichinturndegradesperformance.HighcontentionatomicupdateisshownininFigure 5-4 .Thearefourwork-items(W0,W1,W2,W3,W4)inthework-groupupdatethesamemodelinaparticulartimeframe.Wedevisedareducedcontentionmethodologyusingdistributedaccesspattern.Indistributedaccesspattern,theithwork-iteminawork-groupstartscomputationwiththemodelhavingindeximodNO MODELS.Inthisway,ithwork-itemwillstartupdatingthelocationscorrespondingtomodelimodNO MODELSinthesCountandsSumarrays.Thisreducesthecontentionduetoconictingaccess.Onedrawbackofthismethodis 32

PAGE 33

thatwecannotutilizethebroadcastfeatureprovidedbythelocalmemoryhardware.Ourexperimentsshowthatcontentioncanbeamajorbottleneckcomparedtotheperformanceimprovementfrombroadcast.ReducedcontentionatomicupdateisshowninFigure 5-5 .Work-itemsaccessdifferentmodelsinatimeframe. //dDatainglobalmemory //sModels,sCount,sSuminsharedmemory //work)]TJ /F14 10.909 Tf 8.32 0 Td[(groupid size tbx=get group id(0); //work)]TJ /F14 10.909 Tf 8.63 0 Td[(itemid size ttx=get local id(0); unsigned bIndex=TPWGPPWIbx+tx; for ( unsigned p=0;p
PAGE 34

for ( unsigned i=0;i=NO CENT) break ; atom add(dCount+cIndex,sCount[cIndex]); atom add(&(dSum[cIndex].x),sSum[cIndex].x); atom add(&(dSum[cIndex].y),sSum[cIndex].y); g Listing5-6.Writebackphaseofconcatenatedparallelism Oneofthemajorimprovementinconcatenatedparallelismisthenumberoftimesdataisread.Dataisreadonlyonceineachclusteringiteration.DataaccesscomplexityisO(N)unlikethedataandtaskparallelismwherethecomplexityisO(MN).Thecomputationsremainsthesameasintaskanddataparallelism.Concatenatedparallelismdoesnotinvolvethemultiplekernelinvocationoverheadasindataparallelism.Theoverheadduetoatomicupdatesinglobalmemoryismuchlesswhencomparetothetimespentinreadingdatamultipletimesasintaskparallelism. 34

PAGE 35

CHAPTER6PARALLELK-MEANSALGORITHMUSINGKD-TREEThealgorithmdescribedinChapter 4 requiresthedistancecomputationwitheachdatapointforallthemodels.AsintroducedinSection 2.3 ,weuseaKD-treetoreducethiscomputationbyorderingthedatapoints.AKDtreeiscreatedfromdatapoints.ThecreationofKDtreeformultidimensionaldataisdescribedindetailin[ 1 ].TheKDtreecreationisdoneonlyonceinthewholeclusteringprocess.OncetheKD-treeiscreated,thecentroidsineachmodeltraversetheKD-treeandgetsmappedtodatapointsintheleadnodes.Algorithm 5 showsthemodiedversionofAlgorithm 1 andusesKD-treeforpartitioningthedatapoints.Figure 6-1 showstheKD-treewithdatapointsmappedcentroidsfrommodels.ThecreationandtraversalofKD-treeisperformedbythefunctionscreate kd tree()andcreate leaf node to centroid mapping()respectively.ThesetwostepsaredoneinCPUbecausetheirregularstructureofKD-treeisnotwellsuitedfortheSIMTarchitectureoftheOpenCLdevice.OncetheKD-treeiscreatedandmappingofcentroidstodatapointsarecreated,themappinginformationiscopiedtothedevice.Aftercomputation,thenewcentroidsarecopiedtothehostmachinetocreatethemappinginformationfornextiteration.Thecopyingofmappinginformationtothedeviceandthenewcentroidsfromdevicetohostisperformedforeachclusteringiterationandcreatesadditionaloverhead. Figure6-1. KD-treewithDataPointsandMappedCentroidsonLeafNodes 35

PAGE 36

Algorithm5PruningusingKD-tree 1: dataPoints=read data points() 2: models=random(dataPoints) 3: kd tree=create kd tree(dataPoints) 4: fori=0toNumberofClusteringiterationsdo 5: mapping=create leaf node to centroid mapping(kd tree,models) 6: copy to device(mapping) 7: centroidStats=compute statistics(mapping) 8: models=compute new models(centroidStats) 9: copy from device(models) 10: endfor SimilartotheparallelimplementationsdescribedinChapter 5 ,wecanhavetask,dataandconcatenatedparallelismfortheKD-treebasedimplementationaswell. 1. TaskParallelism:Eachdevicework-groupperformscomputationsforamodel.Thework-itemsinawork-groupreadthedatapointsandcomputationisperformedforthemappedcentroidsofthismodel.Thecentroidsindifferentmodelsgetsmappedtosamedatapoints.Henceeachwork-groupmustreadtheentiredata. 2. DataParallelism:Eachkernelinvocationperformscomputationforamodel.Thedatapointsarepartitionedamongthework-groups.i.e.eachleafnodeofaKD-treegetsmappedtoawork-group.Thework-itemsinaworkgroupreadsthedatapointsandcomputationisperformedforthemappedcentroidsofthemodel.Asintaskparallelism,dataisreadmultipletimes. 3. ConcatenatedParallelism:Asindataparallelism,eachleaf-nodeofKD-treeismappedtoawork-group.Thework-itemsinawork-groupperformsreadsthedatapointsandperformscomputationsforthemappedcentroidsofallthemodels.Inthisapproach,thedatawillbereadonlyonceandthereisnooverheadofmultiplekernelcallsforeachmodelasindataparallelism.LetbethepruningfactorwhichisthefractionbywhichthecomputationisreducedafterpruningbytraversingtheKD-tree.WeimplementedthreedifferentapproachesforconcatenatedparallelismwhichisdiscussedinSection 5.3 Dataaccesstime=O(N)Computationtime=O(MKNd ) 36

PAGE 37

Figure6-2. KDTree:DirectAccess 6.1ConcatenatedParallelismUsingKD-treeInthissectionwediscussthreedifferentimplementationsofconcatenatedparallelismusingaKD-tree.Eachoftheseimplementationsdiffersinthelocal/globalmemoryaccesspatterns. 6.1.1DirectAccessEachleafnodeismappedtoadevicework-group.Thework-groupperformscomputationsforallthedatapointsinthisleafnodeandthecorrespondingsetofmappedcentroidsfromthemodels.Figure 6-2 showsthemappingfromleafnodestoworkitems.Thecentroiddatastructures(sumandthecountarrays)arestoredinthelocalmemory.LikewisenormalKmeans,weusetheatomicupdatefunctionprovidedbydevicehardwaretoachievememoryconsistencywhenwork-itemsinawork-groupupdatethesamelocation.IfthenumberofcentroidsmappedtoaparticularleafnodeofKD-treeless,allthework-itemscanupdate(atomic)thesamelocationsofthesumandthecountarraysinthelocalmemory.Thisincreasesthecontentionandresultsinserializedaccessdegradingtheperformance.Thenexttwoapproachesdescribesmethodstoreducethecontention. 37

PAGE 38

Figure6-3. KDTree:DistributedMapping 6.1.2DistributedAccessInthisapproachwereducethelocalmemoryupdatecontentionbyusingadistributedaccesspattern.Insteadofmappingaleafnodetoawork-group,thework-groupspansacrossmanyleafnodes.Inthisway,thework-itemsinawork-groupupdatedifferentlocationsinthelocalmemory.Thisreducesthecontentioninupdatingthelocalmemory.Figure 6-3 showsthemappingfromleafnodestoworkitems.Onedrawbackofthisapproachisthatthenumberglobalmemoryaccessfordataincreases.Thisisbecausecomputationinaleafnodeisperformedbymultiplework-groups. 6.1.3EliminatingMemoryConictsUsingReplicationInthisapproach,wereplicatethedatastructuresforeachwork-groupandwork-iteminthedeviceglobalmemory.Likedirectmappedapproach,eachleafintheKD-treeismappedtoawork-group.Eachwork-itemoperatesonitsownmemorylocationsothattherewon'tbeanyglobalmemoryconicts.Aftercomputation,anotherdevicekernelaggregatestheresultsfromallthereplicatedmodels.Onedrawbackwiththisapproachisthatreplicationresultsinanincreaseinthememoryfootprint. 38

PAGE 39

CHAPTER7EMPIRICALEVALUATIONWeanalyzethethreeclusteringapproachesTaskParallelism(TP),DataParallelism(DP),andConcatenatedParallelism.Wedeterminethebestparallelizationstrategyonmulti-corehardware.Wealsostudytheperformanceimpactofusingindexbaseddatastructures. 7.1DatasetsWeusedrandomlygenerateddatasetofsize1Millionpointswitheachdatapointofvaryingdimensions.Theinitialcentroidsforthemodelsarechosenrandomlyfromthedataset.Wexthenumberofclustercentroidstobe10thenumberofclusteringiterationsisxedto10. 7.2BenchmarkMachinesWeconductedourexperimentsonthreedifferenthardwarecongurations.Allthemachinesrunon64bitLinuxoperatingsystem. 1. ThemachinewithanNVIDIAFERMIGenerationGeForceGTX480GPUclockedat1.4GHz.ThemachinehasafourcoreIntelXeonProcessorclockedat2.8GHz.TheGPUhas1.6GBglobalmemoryand48KBlocalmemory.Ithas15computeunitsandeachcomputeunithas32processingelementsresultinginatotalof480streamingprocessorcores. 2. ThemachinewithanATIFireProV7800GPU.ThemachinehasafourcoreIntelXeonProcessorclockedat2.8GHz.TheGPUhas1GBglobalmemoryand32KBlocalmemory.Ithas18computeunits.Eachcomputeunithas80processingelementsresultinginatotalof1440streamingprocessorcores. 3. Themachinewith8quadcoreAMDOpteronprocessorsclockedat3.6GHtz. 7.3ComparisonofThreeClusteringApproachesona32-coreCPUFigure 7-1 showsthecomparisonofTP,DPandCPon32coreCPUmachine.Weusedtwodimensionaldataandstudytheresultsbyvaryingthenumberofmodels.ThegraphclearlyshowsCPgivesmuchbetterperformancecomparedtoDPandTP.DPisfasterthanTPinitiallybecauseinTPwhenthemodelsareless,thenumberofwork-groupsareless.Hencetheallthecomputeunitsofhardwareisnotused 39

PAGE 40

Figure7-1. ComparisonofThreeClusteringApproachesona32coreCPU effectively.Asthenumberofmodelsincrease,thenumberofwork-groupsincreaseutilizingallthecomputeunitsofhardware.InDPasthenumberofmodelsincrease,sodoesthenumberofkernelcalls.Thiscausesadditionaloverheadthatreducestheperformance. 7.4ComparisonofThreeClusteringApproachesonFERMIGPUFigure 7-2 showsthecomparisonofTP,DPandCPonNVIDIAFERMIGPU.Thedatasetandmodelscongurationremainsthesameasin 7.3 .Theresultsaresimilarto32coreCPUwithCPgivingthebestperformanceandTPperformingbetterthanDPasthenumberofmodelsincrease. 7.5PerformanceComparisonofDifferentHardwareFromsections 7.3 and 7.4 ,weconcludethatconcatenatedparallelismgivesthebestperformanceonamulticorehardware.Nowwedeterminewhichofthethreehardwaresgivesthebestperformanceforconcatenatedparallelism.Figure 7-3 showsthecomparisonofCPon32coreCPU,NVIDIAFERMIGPUandATIGPU.GPUsprovidecloseto10Xperformanceimprovementoverthe32-coreCPU.ATIandFERMIGPUgivescomparableperformance.EventhoughATIhasmoreexecutioncores, 40

PAGE 41

Figure7-2. ComparisonofThreeClusteringApproachesonFERMI Figure7-3. PerformanceComparisonofDifferentHardware FERMIhasawidermemorybus.InrestoftheexperimentsweusetheFERMIGPUastheOpenCLhardware. 7.6ComparisonoftheThreeKD-treeAlgorithmsFigure 7-4 showsthecomparisonofthethreeKD-treeimplementationsofconcatenatedparallelismdescribedinsection 6.1 .Distributedmappedimplementationgivesthebestperformanceduetoreducedcontention.Theimplementationwithno 41

PAGE 42

Figure7-4. ComparisonoftheThreeKD-treeAlgorithms conictsisslowerthandistributedmethodduetomorenumberofglobalmemoryaccesses. 7.7KD-treevsDirectImplementationofConcatenatedParallelismInthissection,wecomparethebestKD-treeimplementationwiththedirectimplementationofconcatenatedparallelism.Fromsection 7.6 ,weconcludethatdirectmappedimplementationofKD-treegivesthebestperformance.AsshowninFigure 7-5 ,theKD-treeimplementationgivesanaverageof2Xperformanceimprovementoverthenormalimplementation. 7.8KD-treevsDirectImplementationforVaryingDimensionsInthissection,wecomparetheperformanceofKD-treeanddirectimplementationwhenthedimensionalityofdatachanges.Thenumberofmodelsisxedas10.ResultsareshowninFigure 7-6 ,theperformanceofKD-treeimplementationdecreasesasdimensionalityofdataincreases.ThisisbecauseasthedimensionalityofdataincreasesthepruningobtainedfromKD-treedecreases.MoreoverthereisadditionaloverheadfromKD-treetraversal,datatransferbetweenthehostandthedevice. 42

PAGE 43

Figure7-5. KD-treevsDirectImplementationofConcatenatedParallelism Figure7-6. KD-treevsDirectImplementationforVaryingDimensions 7.9DiscussionWeconcludethatconcatenatedparallelismgivesbetterperformancefordataintensiveapplications.Ourresultsshowthatmemoryaccesspatternsisoneofthekeyfactorsthatdeterminestheoverallperformanceofanyalgorithmrunningonmanycorehardware.Althoughoutdescriptionassumesthatallthemodelshaveequalnumberof 43

PAGE 44

centroids.Butwithasmallchangeinthealgorithmwewillbeabletosupportmodelswithdifferentnumberofcentroidsaswell. 44

PAGE 45

CHAPTER8CONCLUSIONAstheprocessorsareevolving,thememorytrafcisbecomingamajorbottleneck.Inthethesis,wepresentedanapproachforreducingthememorytrafcforensembleclustering.Ourexperimentalresultsclearlyshowthatreducingmemorytrafccanleadtosignicantperformancegains.Theadditionaladvantagesofusingcomplexdatastructurestoreducethetotalamountofcomputationhasbenetsbuttheoverallreductionistimeismuchsmallercomparedtotheadvantagesofconcatenatedparallelismduetosubstantialreductioninthetrafc.Webelievethattheconcatenatedparallelismapproachpresentedinthispaperisrelativelygeneralcanbeusedtodevelopaframeworkforautomaticparallelizationofensemblecomputingapplications. 45

PAGE 46

REFERENCES [1] K.Alsabti,S.Ranka,andV.Singh.Anefcientkmeansclusteringalgorithm.FirstWorkshopHighPerformanceDataMining,March1998. [2] F.Cao,A.K.Tung,andA.Zhou.Scalableclusteringusinggraphicsprocessors.WAIM2006.LNCS,vol.4016.Springer,2006. [3] D.Chang,A.H.Desoky,M.Ouyang,andE.C.Rouchka.Computepairwisemanhattandistanceandpearsoncorrelationcoefcientofdatapointswithgpu.Proceedingsof10thACISInternationalConferenceonSoftwareEngineering,ArticialIntelligence,NetworkingandParallel/DistributedComputing(SNPD),2009. [4] D.Chang,N.A.Jones,D.Li,M.Ouyang,andR.K.Ragade.Computepairwiseeuclideandistancesofdatapointswithgpus.ProceedingsoftheIASTEDInterna-tionalSymposiumonComputationalBiologyandBioinformatics,2008. [5] D.Chang,M.Kantardzic,andM.Ouyang.Hierarchicalclusteringwithcuda/gpu.TechnicalReportHKUST-CS08-07,2008. [6] S.Che,J.Meng,J.Sheaffer,andK.Skadron.Aperformancestudyofgeneralpurposeapplicationsongraphicsprocessors.FirstWorkshoponGeneralPurposeProcessingonGraphicsProcessingUnit,2007. [7] W.Fang,K.K.Lau,M.Lu,X.Xiao,C.K.Lam,P.Y.Yang,B.He1,Q.Luo,P.V.Sander,andK.Yang.Paralleldataminingongraphicsprocessors.TechnicalReportHKUST-CS08-07,2008. [8] J.D.HallandJ.C.Hart.Gpuaccelerationofiterativeclustering.ACMWorkshoponGeneral-PurposeComputingonGraphicsProcessors,2004. [9] Khronosgroup.TheOpenCLSpecication,1.1edition,2010. [10] Y.Li,K.Zhao,X.Chu,andJ.Liu.Speedingupk-meansalgorithmbygpus.IEEEInternationalConferenceonComputerandInformationTechnology,2010. [11] NVIDIA.NVIDIACUDAReferenceManual,3.2edition,2010. [12] NVIDIA.OpenCLBestPracticesGuide,2010. [13] NVIDIA.OpenCLProgrammingGuidefortheCUDAArchitecture,3.2edition,2010. [14] S.A.Shalom,M.Dash,andM.Tue.Efcientk-meansclusteringusingacceleratedgraphicsprocessors.Proceedingsofthe13thACMSIGPLANSymposiumonPrinciplesandPracticeofParallelProgramming,2008. 46

PAGE 47

BIOGRAPHICALSKETCH GirishRavunnikuttyreceivedhisBachelorinComputerEngineeringfromDepartmentofComputerScienceandEngineering,NationalInstituteofTechnology,Calicut,Indiain2005.Aftergraduation,heworkedinsoftwareindustryforfouryears.HegraduatedwithhisMasterofScienceincomputerengineeringfromtheDepartmentofComputer&InformationScienceandEngineering,UniversityofFlorida,Gainesvillein2011.HeservedasaResearchassistantinthedepartmentduringhismaster'sprogram.HisresearchinterestsincludeHighperformancecomputingandAutotuningforheterogeneoussystems. 47