<%BANNER%>

Record for a UF thesis. Title & abstract won't display until thesis is accessible after 2013-04-30.

Permanent Link: http://ufdc.ufl.edu/UFE0042999/00001

Material Information

Title: Record for a UF thesis. Title & abstract won't display until thesis is accessible after 2013-04-30.
Physical Description: Book
Language: english
Creator: JOSEPH,REJITH G
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2011

Subjects

Subjects / Keywords: Computer and Information Science and Engineering -- Dissertations, Academic -- UF
Genre: Computer Engineering thesis, M.S.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Statement of Responsibility: by REJITH G JOSEPH.
Thesis: Thesis (M.S.)--University of Florida, 2011.
Local: Adviser: Ranka, Sanjay.
Electronic Access: INACCESSIBLE UNTIL 2013-04-30

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2011
System ID: UFE0042999:00001

Permanent Link: http://ufdc.ufl.edu/UFE0042999/00001

Material Information

Title: Record for a UF thesis. Title & abstract won't display until thesis is accessible after 2013-04-30.
Physical Description: Book
Language: english
Creator: JOSEPH,REJITH G
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2011

Subjects

Subjects / Keywords: Computer and Information Science and Engineering -- Dissertations, Academic -- UF
Genre: Computer Engineering thesis, M.S.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Statement of Responsibility: by REJITH G JOSEPH.
Thesis: Thesis (M.S.)--University of Florida, 2011.
Local: Adviser: Ranka, Sanjay.
Electronic Access: INACCESSIBLE UNTIL 2013-04-30

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2011
System ID: UFE0042999:00001


This item has the following downloads:


Full Text

PAGE 1

EFFICIENTIMPLEMENTATIONOFPARTICLEINCELLALGORITHMONGPUSByREJITHGEORGEJOSEPHATHESISPRESENTEDTOTHEGRADUATESCHOOLOFTHEUNIVERSITYOFFLORIDAINPARTIALFULFILLMENTOFTHEREQUIREMENTSFORTHEDEGREEOFMASTEROFSCIENCEUNIVERSITYOFFLORIDA2011

PAGE 2

c2011RejithGeorgeJoseph 2

PAGE 3

Tomyparents,brotherandmybeloved 3

PAGE 4

ACKNOWLEDGMENTS IhavetoacknowledgetheeffortwhichDr.SanjayRankahasputtoguidemeandencouragemetodothisworkaswellasthebeliefhehasputonmefordoingthisresearchwork.IwouldalsoliketoacknowledgeGirishRavunnikuttyforthehelpandefforthehasputontomakethisworksuccess.IwouldalsoliketothankDr.AlinDobraandDr.Jih-KwonPeirforteachingmetoexploretheworldofMulti-CoremachinesandGPU.Iwouldalsoliketothankmyparentsandmybrotherwhosebeliefandencouragementmademenishthiswork.IwouldalsoliketodedicatethistoSarukuttywhowasalwaysthereforme.Lastbutnottheleast,IwouldalsoliketothanktheallthememberswhoareandwerepartofY243Organizationforgivingmeanicehomelyatmospheretokeeptheworkingoodspirits. 4

PAGE 5

TABLEOFCONTENTS page ACKNOWLEDGMENTS .................................. 4 LISTOFTABLES ...................................... 6 LISTOFFIGURES ..................................... 7 ABSTRACT ......................................... 8 CHAPTER 1INTRODUCTION ................................... 9 2GPUARCHITECTURE ............................... 14 3PARALLELIZATIONISSUES ............................ 19 4GPUPARALLELIZATION .............................. 21 4.1PartitioningDataStructures .......................... 23 4.2MeshDecomposition .............................. 34 4.2.1DensityBasedPartitioning ....................... 34 4.2.2UniformPartitioning .......................... 36 4.2.3Non-UniformPartitioningwithLevel2Grid .............. 38 5EMPIRICALEVALUATION ............................. 41 5.1BenchmarkMachine .............................. 41 5.2ComparisonofDifferentParticleSortingSchemes ............. 41 5.3ChoosingtheBestWorkloadPerThread ................... 43 5.4ComparisonofTriangleSearchTimes .................... 43 6CONCLUSION .................................... 45 REFERENCES ....................................... 46 BIOGRAPHICALSKETCH ................................ 48 5

PAGE 6

LISTOFTABLES Table page 4-1DeterminingGPUblocks .............................. 25 5-1Comparisonofparticlesortingschemes ...................... 42 5-2Comparisonofsortingtimewithvariableworkloadperthread .......... 43 5-3Trianglesearchtimefornon-uniformpartitioning ................. 44 5-4Trianglesearchtimeforuniformpartitioning .................... 44 6

PAGE 7

LISTOFFIGURES Figure page 1-1Interactionbetweenparticleandmeshgridarray ................. 10 1-2Terminologiesused ................................. 13 2-1CPUvsGPU ..................................... 14 2-2GPUarchitecture ................................... 16 2-3Sharedmemorybankconicts ........................... 17 2-4Coalescedmemoryaccess ............................. 18 2-5Noncoalescedmemoryaccess ........................... 18 4-1Shadowtrianglesandshadowvertices ....................... 22 4-2Verticesarray ..................................... 23 4-3Atomicupdatesinglobalmemory .......................... 25 4-4Atomicupdatesinsharedmemory ......................... 26 4-5MeshcoloringandmappingtoGPUblocks .................... 26 4-6RegionmappedtomultipleGPUblocks ...................... 30 4-7RegionmappedtosingleGPUblock ........................ 31 4-8Comparisonofblockingstrategies ......................... 33 4-9Partitioningbasedontriangledensity ....................... 35 4-10Partitioningusingarectangularmesh ....................... 36 4-11Uniformpartitioningusingarectangulargrid .................... 37 4-12Partioningregionsbetweensharedandglobalmemoryforincrementalsorting 39 4-13Nonuniformpartitioningusingarectangulargrid ................. 40 7

PAGE 8

AbstractofThesisPresentedtotheGraduateSchooloftheUniversityofFloridainPartialFulllmentoftheRequirementsfortheDegreeofMasterofScienceEFFICIENTIMPLEMENTATIONOFPARTICLEINCELLALGORITHMONGPUSByRejithGeorgeJosephMay2011Chair:DrSanjayRankaMajor:ComputerEngineeringParticleincell(PIC)algorithmisawidelyusedmethodinplasmaphysicstostudythetrajectoriesofchargedparticlesunderelectromagneticelds.ThePICalgorithmiscomputationallyintensiveanditstimerequirementsareproportionaltothenumberofchargedparticlesinvolvedinthesimulation.ThefocusofthethesisistoparallelizethePICalgorithmonGraphicsProcessingUnit(GPU).Wepresentseveralperformancetrade-offsrelatedtosmallsharedmemoryandatomicoperationsontheGPUtoachievehighperformance.WehavealsodiscussedabetterwaytoimplementbucketSortinGPU. 8

PAGE 9

CHAPTER1INTRODUCTIONTheparticle-in-cellalgorithmisamethodwidelyusedtosimulateplasmasandhydrodynamics.Themovementofparticlesiscomputedbytheinteractionbetweeneachpairofparticlesinaself-consistentsystem.Insteadofdirectlycalculatingtheinteractionofparticles,thePIC(ParticleinCell)algorithmemploysaregularcomputationalmeshontheparticledomainandassignstheparticleattributestonearbygridpointsofthemesh.Theeldequationsarethensolvedonthemeshandtheforceoneachparticleisobtainedbygatheringtheattributesofnearbygridpointsfromtheresultanteldsonthemesh.Thisforceisthenusedtomovetheparticleforthenextstepofthesimulation.ThePICalgorithmhastwodifferentdatastructures,particlearrayandmeshgridarray.Ineachiterationbotharraysareupdatedbasedonvaluesoftheother.TheinteractionbetweentwoarrayscanbemodeledasabipartitegraphasdepictedinFigure 1-1 .Theelementsoftheeacharrayformthenodesofthegraphandedgesrepresentthecommunicationbetweenbotharrays.Themeshgridarrayistypicallyspatiallyhomogeneous.Theparticlearraycanbenonhomogeneous.Theedgesofthegrapharedynamicandchangewitheachsimulationiteration.AnefcientimplementationofthePICAlgorithmondistributedmemoryMIMDcomputersrequiresdistributingthesetwodatastructuresovertheprocessorssuchthatoff-processoraccessesareminimized.Agoodloadbalanceinparallelizationshouldalsoensurethateachdatastructureisnearlyequallydistributedamongtheprocessors.Further,thedataaccesspatternsbetweentwoarraysmaychangedynamically(albeitincrementally)andthereforetheparticlesmayneedtoberedistributedfrequentlytoreducecommunicationcost.TherearemanyimplementationsofPICalgorithmonaCPU.TherstfullytoroidaledgePICalgorithm,XGC1describesanopenMPimplementationoftheparticleincellsimulationusingatriangularmesh[ 1 ].Kraeva,MalyshkinusestheAssembly 9

PAGE 10

Figure1-1. Interactionbetweenparticleandmeshgridarray Technology[ 3 ]todevelopadynamicloadbalancingstrategyforparallelPICalgorithm[ 2 ].Liao,Ouetal.describesascalabledomainpartitioningstrategyforparticleandmesharraysthatreducesthecommunicationoverheadinadistributedcomputingframework[ 4 ].Madduri,Williamsetal.describes13differentapproachestoparallelizethePICproblemonMulti-coreprocessors[ 5 ].Thisimplementationlimitstheamountofparticlemovementinaniterationandalsoexpectsaninitialorderingofparticles.Stantcheva,Dorlandaetal.describesaGPUimplementationofthePICalgorithmusingarectangularmesh[ 6 ].Thealgorithmusesbinninginwhichthecellclustersinamesharegroupedasbins.Duringexecution,eachbinismappedtoaGPUblock.Theparticlesarekeptinbin-sortedorderwithanexchangemechanismusingsharedmemoryandregisters.Thisisbasedontheassumptionthatparticlescanmoveonlytoneighboringbins.ViktorK.Decyk,TajendraV.Singhetetal.developedaGPUglobalmemorybasedapproachforPICalgorithmthatisnotefcientbecausethefastlocalmemoryisnotusedatall[ 9 ].Ourapproachdiffersfromexistingimplementationsinthefollowingways: 10

PAGE 11

1. TheXGC1[ 1 ]coderequiresatriangularmeshforsimulationonGPU.Unlikearectangularmesh,whichissymmetricandcaneasilybepartitionedandmappedtoGPUparallelhardware,atriangularmeshisasymmetric.Thismakesthepartitioningcomplex.Atypicaltriangularmeshcontainsseveralhundredthousandtrianglesandmillionsofparticles.WeintroducehierarchicalmeshpartitioningstrategiesforeffectivemappingtotheGPUhardware.Thehierarchicalpartitioningcanbesymmetricorasymmetric. 2. Theuseofatriangularmeshrequiresexplicitsearchforanenclosingtriangle,foreveryparticletoupdatethemesh.Wedescribeseveralnovelapproachesforacceleratingthissearch. 3. Ouralgorithmsworkforrandomorderingofparticleswithoutimposinganylimitontheparticlemovement.Toorderparticlesandtriangles,wedescribeanovelparallelhierarchicalbucketsortalgorithm.Inthisapproach,thetrianglesareorderedonlyoncewhereastheparticlesareincrementallyorderedaftereachiteration.OurimplementationaddressesthelimitedmemorybandwidthbetweenGPUandthehostaswellasthelimitedamountofsharedmemoryavailableineachGPUvectorcore.Weexplorevariousmechanismstoreducetheperformancepenaltyduetomemoryconictswhileusingatomicoperations.WealsostudytheperformanceofvariouslevelsofmemoryhierarchyintheGPU(Global,SharedandRegisters)andtheperformanceeffectsofatomicoperationsontheglobalandsharedmemories.OurexperimentsarebasedontheCUDA[ 8 ]frameworkprovidedbyNVIDIA,thoughouralgorithmscanbeportedtoanyGPUcomputingframeworklikeopenCL.OurGPUkernelalgorithmsleveragemaximumutilizationoftheGPUsharedmemory.TheGPUkernelsareoptimizedtoreducethecommunicationwiththehostmachine.Inthispaper,westudyandparallelizethecomputationallyintensiveportionsofthePICalgorithm.Therestofthealgorithmcaneasilybeincorporatedintoourimplementations. ParticleinCellproblem.ThePICalgorithmusingatriangularmeshfollowstheevolutionoftheparticlesinaseriesoftimesteps,eachofwhichconsistsofthefollowingvephases:(1)Searchphase:Eachparticleidentiesthetriangle(cell)inwhichitbelongs, 11

PAGE 12

(2)Scatterphase:Usingalinearinterpolationscheme,eachparticlescattersitscontributionstothecurrentmeshgridpointsattheverticesofthecellinwhichitlies.Aparticlemakesnocontributionstoothergridpoints.Thetotalparticlecontributionsateachgridpointaresummedtoformthecurrentlocaldensity.,(3)Fieldsolvephase:SolveMaxwell'sequationsonthemeshtodeterminetheelectric(E)andmagneticelds(B).EachgridpointonthemeshneedsdatafromitsneighboringgridpointstocalculatenewvaluesforEandB.,(4)Gatherphase:Theelectricandmagneticeldsatparticlesareobtainedfromtheirvertexgridpointsbyalinearinterpolationofparticles'relativepositionsinthecell.Particlessumthesecontributionsfromthevertexgridpointstogeneratetheforce,and(5)Pushphase:Theforceobtainedfromthegatherphasemovesparticlestotheirnewpositions.Figure 1-2 showsthesemanticsusedforthePICproblemintherestofthepaper.Weusethetermmeshentitytoreferparticles,triangles,andverticesinatriangularmesh. 12

PAGE 13

Figure1-2. Terminologiesused 13

PAGE 14

CHAPTER2GPUARCHITECTUREGPUactsasaco-processordevicetoCPU(host).GPUisadedicatedcomputingdevicetoaddressproblemsthathavehigharithmeticintensityi.e.highratioofarithmeticoperationstomemoryoperations.EachoftheGPUcoreshasaSIMT(SingleInstructionMultipleThread)architecture.ThecoreofaGPUhasasmallcacheandlimitedowofcontrol-bothofthesetakeupmostofthetransistorsinaCPUwhereasinaGPU,moretransistorsareusedupforcomputingcores.Parallelportionsofan Figure2-1. CPUvsGPU applicationareexpressedasdevicekernelswhichrunonmanythreads.GPUthreadsareextremelylightweightandthecreationoverheadisverylittle.TheschedulingisdonebythehardwareunliketheoperatingsysteminaCPU.AtypicalGPUneedshundredsofthreadsforfullutilizationofhardwarewhereasCPUcanbesaturatedwithonlyafewthreads.NVIDIATMintroducedtheComputeUniedDeviceArchitecture(CUDA)FrameworkwithanewparallelprogrammingmodelandISA(InstructionSetArchitecture)toleveragethecapabilitiesofthemassiveparallelcomputinghardwareinNVIDIAGPUs.CUDAframeworkconsistsof AnextendedCapplicationprogramminginterface 14

PAGE 15

AruntimelibrarywhichenablestheaccessandcontrolofdevicesfromthehostACUDAkernelisexecutedbyanarrayofthreadsexecutingthesamecodeintheSIMD(SingleInstructionMultipleData)fashion.Thethreadarraycanberepresentedas1D(onedimensional),2Dor3Darray.EachthreadhasauniqueI.Dthatcanbeusedtoaccessmemoryformakingcontroldecisions.Threadsaregroupedintoblocks.Tosimplifyaccessingandaddressingofmultidimensionaldata,thethreadsaredividedintomonolithicblocks.Likethreads,eachblockhasauniqueI.D.Blockscanbe1Dor2D.Threadswithinablockco-operateviasharedmemory,atomicoperationsandbarriersynchronization.Threadsindifferentblockscannotco-operate.Blocks,sharedmemoryandbarriersynchronizationformsthethreecoreabstractionswhichprovidenegraineddataandthreadparallelism.Theseabstractionsenabletheprogrammertopartitiontheproblemintocoarsesub-problemsthatcanbesolvedindependentlybyathreadblockandeachsub-problemintonerpartsthatcanbesolvedco-operativelyinparallelusingthethreadsinablock.Thehardwareautomaticallydoestheschedulingandloadbalancing.TheseabstractionsalsoenableautomaticscalabilityoftheCUDAprogram.Sincethereisnosynchronizationbetweenblocks,hardwareisfreetoassignblockstoanyexecutionunit.InaGPU,theprocessingcoresaregroupedintostreamingmultiprocessors(SMs).Eachstreamingmultiprocessorhasmultiplescalarprocessors(SPs).SMmaintainsthreadI.D.sandblockI.D.sandschedulesthreadexecution.WarpisthebasicschedulingunitinanSM.Awarpisagroupof32threadsinablock.WarpsarenotpartofCUDAprogrammingmodelandareinvisibletoaprogrammer.AtanygiventimeSMschedulesonlyoneofthewarpsforexecution.Allthreadsinawarpoperateonthesameinstructionduringthekernelexecution.Figure 2-2 showsahighlevelviewofGPUarchitecture.EachSMhasasharedmemorythatisvisibletoallthreadsinablockexecutingintheSM.SharedMemoryprovidesextremelyfastreadandwriteaccess.Sharedmemory 15

PAGE 16

Figure2-2. GPUarchitecture isimplementedasmultiplememorybanksanditisasfastasregisterswhentherearenobankconicts.Bankconictsarisewhenmultiplethreadsinahalfwarpaccessthesamebank,whichresultsintheserializationofaccess.Figure 2-3 depictsconictsonbanks0,1and2.Ouralgorithmsusetheaccesspatternthatminimizesthesebankconicts.EachSMhasaregisterlethatstoresallthelocalvariablesofakernel.TheregisterleisdynamicallypartitionedacrossalltheblocksexecutinginanSM.Ifablockusesmoreregisters,itlimitstheregistersavailabletootherblocks.ThisreducesthenumberofblocksthatcanbeconcurrentlyexecutedinanSMandreducesperformance.Inourimplementation,registerusageiswithinlimitssothatanSMcanexecutemaximumnumberofblocksaspermittedbythehardware.GPUhasitsownDRAMcalledasglobalmemory.Globalmemoryisthemainmeansofcommunicationbetweenhostandthedevice.Thecontentsinaglobalmemoryarevisibletoallthreads.Sinceit'simplementedasDRAM,accesslatencyisalsohigh.Whenaccessingglobalmemory,peakperformanceoccurswhenallthe 16

PAGE 17

Figure2-3. Sharedmemorybankconicts threadsinahalfwarpaccesscontinuousmemorylocations.AsshowninFigure 2-4 ,thistypeofaccessiscalledascoalescedmemoryaccess.NoncoalescedaccessisshowninFigure 2-5 .Toimproveperformance,weusecoalescedglobalmemoryaccessesinouralgorithms. 17

PAGE 18

Figure2-4. Coalescedmemoryaccess Figure2-5. Noncoalescedmemoryaccess 18

PAGE 19

CHAPTER3PARALLELIZATIONISSUES Algorithm1SequentialAlgorithm 1: fori=0toNumberofSimulationiterationsdo 2: compute force() 3: move particles() 4: endfor AsdepictedinAlgorithm 1 ,thesimulationworksiniterations.Thecompute force()functionencapsulatesthescatter,eldsolve,andgatherphases.move particles()functiontakescareoftheparticlepushphase.Eachiterationisdependentontheresults(newparticleposition)fromthepreviousiteration.Hencethesuccessiveiterationscannotbeparallelized.However,wecanparallelizethecomputationsdonewithinaniteration.InaparallelPICalgorithm,thedistributionofparticlesandmeshgridpointsoverprocessorsshouldbemadesuchthattheoverheadgeneratedfromtheparallelizationiscarefullycontrolledtoobtainareasonablespeedup.ThePICproblem'sparallelizationstrategiescanbedividedintotwobroadcategories: 1. DIRECTEULERIANMETHOD:Particledomainisdividedspatiallyintoseveralnon-overlappingregions(orsub-domains)andeachprocessorisassignedoneofthesesub-domains.Particlesarethenassignedtoprocessorsaccordingtotheirpositionsandmaymigratebetweenprocessors'sub-domainsasthesystemevolves. 2. DIRECTLAGRANGIANMETHOD:Afterparticlesareassignedtoprocessors,thisassignmentremainsxedthroughoutthesimulationi.e.particleswillnotmovefromoneprocessortoanother.Insteadofreplicatingthemeshgridsineachprocessor,theotheroptionistopartitionthemesharrayintorectangularsubmeshesandassignadifferentsub-meshtoeachprocessor.Inthiscase,thecontributionsofparticlestotheirgridpointsaresentdirectlytotheownerprocessors.Themainissueistodecomposetwoarraystoreducetheamountofoff-processordataaccess.EfcientparallelizationofeachiterationofthePICalgorithmduringthesystemevolutionrequiresthat: 19

PAGE 20

(1)Thenumberofparticlesassignedtoeachprocessorbenearlyequal,(2)Thenumberofgridpointsassignedtoeachprocessorbenearlyequal,and(3)Thecommunicationbetweenthetwodatastructuresbeminimized.ThersttwoconditionsarerequiredtomaintaingoodprocessorutilizationforthecomputationsineachofthevephasesmentionedinChapter1,whilethethirdconditionisrequiredtoreducethecommunicationoverhead.Achievingallthreeconictinggoalstogetherisdifcult.Wedescribeandanalyzethreepotentialwaystoachievetheabovegoals: 1. MESHPARTITIONING:Partitionthemeshtrianglessuchthateachprocessorgetsanequalnumberoftriangles.Theparticlesarethenassignedbasedonthegridcellstowhichtheybelong. 2. PARTICLEPARTITONING:Partitiontheparticledomainsuchthateachsub-domainhasanapproximatelyequalnumberofparticles.Thegridcellsarethenassignedbasedonthecorrespondingsub-domaintowhichtheybelong. 3. INDEPENDENTPARTITIONING:Partitiontheparticledomainsuchthateachsub-domainhasanapproximatelyequalnumberofparticles.Also,partitionthegridcellssuchthateachprocessorgetsanequalnumberofgridcells. 20

PAGE 21

CHAPTER4GPUPARALLELIZATIONThedirectEulerianmethodresultsinlocalinterprocessorcommunication,butplacesnoconstraintsonthecomputationalloadbalance.ThedirectLagrangianmethodimposesstrictloadbalance,butmakesnoattempttoreducecommunication.Asmeshgriddatadistributionisspatiallyhomogeneous,whiletheparticledatacanbenonhomogeneous,thedecisiontochooseaproperstrategyforparticlemovementamongprocessorsrepresentsatrade-offbetweencommunicationcostandcomputationalloadbalance.ScalableimplementationofthePICproblemforalargenumberofprocessorsrequiresloadbalancinginallphases.Thecomputationalloadineachphaseissolargethataloadimbalanceinanyphasecouldbecomeabottleneck.Forthisreason,webelievethatscalableimplementationofthePICalgorithmonaGPUwouldrequireusingthedirectLagrangianmethodandanindependentpartitioningstrategythatkeepsthecomputationload,particles,andgridpointsstrictlybalanced.Wedevisedmethodologiestoreducethecommunicationcost.Triangularmesh,requiresanadditionalsteptodeterminetheenclosingtriangleforeachparticle.Tondthistriangleefciently,weneeddatastructuresthatrequireadditionalmemory.Thetypeofpartitioningusedcanhaveanindirecteffectonthestoragerequirementsandcomputationaloverheadforthiscomputation.ThePICproblem'smeshdecompositionisstatic.Theparticlesmovebecauseoftheforcesactingonthemmakingthedecompositiondynamic.Hencewedothedecompositionandoptimizationsinthemeshonce.However,forparticlesthedecompositionmaychangefromiterationtoiteration.Wepartitionthemeshintorectangularregions.VerticesandparticlesassociatedwithtrianglesinaregionaremappedtodatastructuresonaGPUtofacilitateparallelcomputation.Inthiswaywecaneffectivelyusethesmallerbutfastersharedmemoryeffectively. 21

PAGE 22

Foratriangularmesh,theentitiesareparticlesandtriangles.Theverticesofatrianglestoretheforcesfromthecontainedparticles.Theseforcevaluesareupdatedbytheparticleswhichinturnmaketheparticlemove.Particles,triangles,andverticesareseparatelystoredasalineararrayforfasteraccess.Sometrianglesintheunstructuredmeshcancrossregionboundaries.Werefersuchtrianglesasshadowtrianglesandthecorrespondingverticesasshadowvertices.ShadowtrianglesandverticesaredepictedinFigure 4-1 Figure4-1. Shadowtrianglesandshadowvertices Shadowtrianglesandverticeswillbereplicatedinalltheregionstheybelong.Sincetheregionsareexecutinginparallel,aftereachiteration,weaggregatetheforcesonshadowverticesfromeachregion. 22

PAGE 23

4.1PartitioningDataStructuresAsthemeshispartitionedintoregionsformappingtoGPUblocks,weneedtosortthemeshentities(particles,trianglesandvertices)inregionorder.Sortingenablestheuseofindexingtoaccesstheentities.Weuselineararraystostoretheentities.EachGPUblockaccessesthemappedregionentitiesbyindex.i.eithindexinthearraycorrespondstotheentitiesofregioni.Thedetaileddescriptionofthedatastructurescanbefoundbelow.Vertices:EachvertexisatupleofX,Yco-ordinatesandtheforce.Verticesarestoredinaonedimensionalarraybasedontheorderofregions.Theshadowverticesineachregionarestorednexttotheverticesintheregion.Figure 4-2 showsthedatastructureforvertices. Figure4-2. Verticesarray Triangles:Eachtriangleisatupleof3vertices(V1,V2,V3)whereV1,V2,V3areindexestotheverticesarray.Thetrianglesarestoredintheorderofregions.LikeshadowverticesShadowtrianglesinaregionarestorednexttothetrianglesintheregion.ThedatastructureissimilartoFigure 4-2 .Particles:EachparticleisatupleofX,Yco-ordinatesandthetriangleindex.Triangleindexisanindextotrianglearray.Thisindexstoresthetriangleinwhichtheparticleresides.Particlesarestoredintheorderofregions.Thetrianglearrayandverticesarrayareinitializedonlyonceinthepre-processingstepandwillnotbechangedduringthesimulationstepexceptwhileupdatingforceinthevertexarray.Particlemovementchangestheparticlearrayduringeachsimulationiteration.Hencetheparticlearrayneedstobere-sortedaftereachsimulationiteration. 23

PAGE 24

WemapeachstepintheiterationofAlgorithm 1 toakernelthatexecutesinGPU.Asmentionedbefore,aKernelisagroupofGPUblocksthatexecutesinparallel.TheparallelalgorithmisdepictedinAlgorithm 2 Algorithm2ParallelAlgorithm 1: sort triangles() 2: sort particles() 3: fori=0toNumberofSimulationiterationsdo 4: forallGPUblocksinparallel do 5: forallthreadsinablockinparallel do 6: compute force() 7: aggregate force(shadowVertices) 8: move particles() 9: incremental sort(particles) 10: endfor 11: endfor 12: endfor ThersttwostepsoftheAlgorithm 2 sorttrianglesandparticlesrespectivelyandcreatethedatastructuresasshowninFigure 4-2 .Trianglesortingaddressesvertices'sortingaswell.Forsorting,wecreateGPUblocksbasedontheentities(particlesortriangles)tobesortedandnumberofthreadswithinaGPUblock.ThenumberofGPUblockswillbeequaltothenumberofentitiesdividedbythenumberofthreadsinaGPUblock.NVIDIATMTeslahardwarelimitsthenumberofblocksinadimensionto65535.Ifthenumberofblocksexceedsthisvalue,wereduceitbyincreasingtheworkloadofeachthreadi.e.,insteadofworkingonasingleentity,eachthreadinablockwillberesponsibleformultipleentities.Thenumberofblocksinthiscasewillbethenumberofentitiesdividedbytheproductofnumberofthreadsinablockandnumberofentitiesperthread.GPUblocksusedforourkernelsareshowninTable 4-1 .Initiallythemeshentitiesarerandomlydistributed.HencetheconcurrentthreadsupdateallthedatastructuresinthePICproblem.ThisnecessitatestheneedforlocksandweuseatomiclocksprovidedbytheGPUhardware.Theatomiclocksarefor 24

PAGE 25

Table4-1. DeterminingGPUblocks GPUkernelNumberofblocksLoadonGPU sort particlesN PARTICLES/(THREADS PER BLOCK*LOAD PER THREAD)Uniformcompute forceN SUB REGIONSNonUniform integeroperationsandcanoperateonglobalandsharedmemories.Atomicoperationsinglobalmemoryhavetwomajorconsiderations:(1)Highaccesslatencyforaccessingglobalmemory,and(2)Thenumberofatomiclockscanincreaseduetorequestsfromalltherunningblocks,resultinginmoreserializedexecutionofthecode Figure4-3. Atomicupdatesinglobalmemory AtomicupdatesinglobalmemoryaredepictedinFigure 4-3 .Wesolvethisbyconcentratingtheatomicoperationsonsharedmemory.Threadsineachblockperformallatomicoperationsinthesharedmemory.Afteralltheprocessingiscompletedforaparticularblock,weperformasingleatomicupdateintheglobalmemoryasshowninFigure 4-4 .Anotherapproachtoreducethenumberofatomicupdatesistocolorthetriangles.Thetrianglesinthemeshareconsideredasnodesinagraphandarecoloredsuchthatthetriangleswithat-leastonecommonvertexareofdifferentcolors.AGPUkernelworksononlyaparticularcoloratatime.AsseenfromtheFigure 4-5 ,kernel1 25

PAGE 26

Figure4-4. Atomicupdatesinsharedmemory processesthetrianglesandassociatedparticleswithcolor1,kernel2processesthosewithcolor2etc. Figure4-5. MeshcoloringandmappingtoGPUblocks Coloringeliminatestheneedforatomicupdatesasnotwoadjacenttrianglesareupdatedconcurrently.Thisapproachinvolvesadditionaloverheadformultiplekernelinvocationsforeachcolor.ForefcientindexinginaGPU,wealsoneedtomaintainacolorsortedorderoftrianglesandcontainingparticleswhichinvolvesadditionalcomputation.Duetocoloring,somekernelscangetlessworkloadwhichresultsinunderutilizationoftheGPUhardware. 26

PAGE 27

Asaresult.forbetterperformance,insteadofcoloring,weusetheshadowtrianglereplicationapproach.Westoretheregionsinsharedmemoryforupdatingtheentitycount.Eachregionwillberepresentedbyanintegervalueinthesharedmemorywhichstoresthenumberofentitiesinthatregion.Everythreadinablockwillcomputetheregionfortheassignedentitiesandupdates(atomic)theintegercounterinthesharedmemory.NVIDIATMTeslaprovides16kofsharedmemoryperblock.Sharedmemoryalsostorestheargumentstokernelcalls.Hencethisapproachisconstrainedbytheamountofsharedmemoryavailableforstoringregioncounters.Fortherestofthediscussion,weassumethatthesharedmemorycanholdintegercountersfor1024(using32X32rectangulargrid)regions.Clearlythisnumberwillbelarger,ifthenextgenerationmachineshavelargersharedmemory,andissmalleriftheyhavesmallersharedmemory.ThemeshispartitionedintoregionsusingLevel1rectangulargridofxeddimensions.Therearetwobroadscenarios: 1. Ifthenumberofregionsafterpartitioningoftriangularmeshiswithin1024,weproceedwiththesorting,keepingallregionsinthesharedmemory. 2. Ifthenumberofregionsafterpartitioningthetriangularmesh(becauseofitssize),islargerthan1024,weuseatwolevelgridforregions.WechosetheLevel1gridsuchthatthepartitioningcreatesonly1024regionsandproceedwiththesorting.ThenweuseaLevel2gridtopartitioneachoftheregionintosubregionsandperformabucketsortsothatentitieswillbeinsubregionsortedorder.ThenumberofGPUblocksthatoperateoneachregionisdeterminedbytheentitydensityoftheregion.Weuseparallelbucketsortfororderingparticles.Thebucketsinthiscaseareregionscreatedbytherectangulargridduringpartitioningoftriangularmesh.AsdepictedinAlgorithm 3 and 4 ,weusetwoGPUkernelstoaccomplishparallelbucketsortofparticles: 1. Algorithm 3 outlinesthekernelthatndstheuniquepositionofalltheparticlesinthebuckets(steps1-20)andaGPUkernelfordoingparallelprexsum(step21).Steps7-9computetheregioncontainedbyaparticleandincrements(atomic)thecountercorrespondingtothatregioninsharedmemory.Steps11-14writeback 27

PAGE 28

Algorithm3ParallelBucketSortKernel1 1: forallGPUblocksinparallel do 2: regionsarrayisinSharedmemory 3: particlesarrayintheGlobalmemory 4: particleCountarrayintheGlobalMemory 5: thread=blockID*NO THREADS PER BLOCK+threadID 6: forallthreadsinablockinparallel do 7: particle=particles[thread] 8: regionIndex=particle.grid 9: sharedAtomicAdd(regions[regionIndex],1) 10: barrier synchronize() 11: fori=0tototalnumberofregionsdo 12: globalIndex=threadIdx.x+i*NO THREADS PER BLOCK 13: newStart=globalAtomicAdd(particleCount[globalIndex],regions[globalIndex]) 14: regions[globalIndex]=newStart 15: endfor 16: barrier synchronize() 17: particle.position=sharedAtomicAdd(regions[regionIndex],1) 18: particles[thread]=particle 19: endfor 20: endfor 21: prefix sum gpu() Algorithm4ParallelBucketSortKernel2 1: forallGPUblocksinparallel do 2: forallthreadsinablockinparallel do 3: thread=blockID*NO THREADS PER BLOCK+threadID 4: particle=particles[thread] 5: regionStart=get region start(particle.region) 6: sortedParticles[particle.position+index]=particle 7: endfor 8: endfor 28

PAGE 29

Algorithm5compute force 1: Loadverticesandtrianglesinthatregiontosharedmemory 2: barrier synchronize() 3: forallthreadsinthisblockinparallel do 4: forallparticlesintheregionassignedtothisthreaddo 5: triangle=find triangle(particle) 6: atomic force update(verticesinsharedmemory) 7: endfor 8: barrier synchronize() 9: atomicwritebackofvertices(force)toglobalmemory. 10: endfor theparticlecounttoglobalmemory.Instep13,weutilizeanimportantpropertyofatomicaddinGPU.Thereturnvalueofanatomicaddisthevalueofthememorylocationbeforetheatomicadd.Instep14,westorethatvalueinsharedmemorylocationcorrespondingtothatregion.Steps17-18usethisinformationtoidentifytheuniquebucketposition.Step21usesthestandardparallelprexsumalgorithmforGPUtoaggregatetheparticlecountersofeachregionandcomputetheexactlocationtoputtheparticleinthesortedarray. 2. ThekernelinAlgorithm 4 putstheparticleintheexactlocationcomputedfromtheAlgorithm 3 .Afterthiskerneltheparticleswillbesortedorderaccordingtotheregions Algorithm6move particles 1: Loadverticesandtrianglesinthatregiontosharedmemory 2: barrier synchronize() 3: forallthreadsinthisblockinparallel do 4: forallparticlesintheregionassignedtothisthreaddo 5: triangle=get triangle() 6: get force(triangle) 7: move particle() 8: endfor 9: endfor Algorithm 5 andAlgorithm 6 depictstheGPUkernelsforcompute force()andmove particles()respectively,whichformsthemostimportantstepsinasimulationiteration.aggregate force()performsaparallelatomicsummationoftheforcesinshadowvertices.Functionincremental sort()ofparticlesisexplainedlaterinthispaper.TherearetwoapproachestomapthemeshregionstoGPUblocks 29

PAGE 30

1. THEFIRSTAPPROACH:EachregionismappedtomultipleGPUblocksasdepictedinFigure 4-6 .ThenumberofGPUblocksisdeterminedbytheparticledensityoftheregion.ThisensuresthatallGPUblockswillbeequallyloaded.ThedisadvantageisthatallGPUblocksinaregionwillhavetoloadthetrianglesandverticesinthatregiontosharedmemoryanddoanatomicwritebacktotheglobalmemory. Figure4-6. RegionmappedtomultipleGPUblocks 2. THESECONDAPPROACH:EachregionismappedtoasingleGPUblockasdepictedinFigure 4-7 .Inthisapproach,eachblockloadstheregionentitiestothesharedmemoryandwritesbackonlyonce.Thedisadvantageisthattheparticledenseblockswillhavemoreworkload.Thisapproachwillbebetterwhenthereisauniformdistributionofparticlesandtrianglesacrosstheregions.Otherwisetheruntimealwaysdependsonthedensestmesh. 30

PAGE 31

Figure4-7. RegionmappedtosingleGPUblock Choosingthenumberofblockscandependonthetypeofmemoryaccessandtheamountofthecontentionbetweentheaccesses.Therecanbetwoaccesspatternsbasedonthekernelalgorithm.(1)INDEPENDENTACCESS:Memoryaccesspatternissuchthateachblockaccessesindependentglobalmemorylocations.AGPUkernelcanbeconguredintwoways: 1. Kernelwithmoreworkloadperblockandfewernumberofblocks. 2. Kernelwithmoreblocksandlessworkloadperblock.Thetotalnumberofglobalmemoryaccessisthesameinbothapproaches.Thesecondapproach(asshowninFigure 4-8 )isfastercomparedtotherstaslongasthenumber 31

PAGE 32

ofblocksiswithinlimitsspeciedbythehardware.Thereasoniswhenthememoryaccessesremainsconstant,increaseinthenumberofblockswillkeepalltheexecutionunitsinthehardwarebusy.TheaccesspatternisdemonstratedinAlgorithm 7 ,and(2)CONFLICTINGACCESS:Memoryaccesspatterninwhichmanyblocksaccessthesameglobalmemorylocation.Thisaccesscanleadtomorecontention Algorithm7Scenario1 1: forallGPUblocksinparallel do 2: forallthreadsinablockinparallel do 3: start=(blockIdx.x*NO THREADS PER BLOCK*DATA PER THREAD) 4: fori=0toDATA PER THREADdo 5: index=start+i*NO THREADS PER BLOCK+threadIdx.x; 6: endfor 7: endfor 8: endfor Algorithm8Scenario2 1: forallGPUblocksinparallel do 2: forallthreadsinablockinparallel do 3: start=get start() 4: data per thread=n elements block/NO THREADS PER BLOCK 5: fori=0todata per threaddo 6: index=start+i*NO THREADS PER BLOCK+threadIdx.x; 7: endfor 8: endfor 9: endfor Ourexperiment(asshowninFigure 4-8 )showsthatincreasingworkloadperblockgivesbetterperformanceinthiscase.Thereasoniswhentheworkloadperblockincreases,thenumberofblocksdecreaseswhichinturnreducescontentionfromconcurrentglobalaccesses.TheaccesspatternisdemonstratedinAlgorithm 8 .Anotheradvantagewiththereplicationofshadowtrianglesandverticesisthereductionofatomicupdatesintheglobalmemory.Asmentionedearlier,atomicupdatesinsharedmemoryareveryfastcomparedtoatomicupdatesinglobalmemory.Replicationensureseachregioncanoperateindependentlyoftheother.EachGPUblockhasitsowncopyofthetrianglesandverticesofthemeshthatitneedsfor 32

PAGE 33

Figure4-8. Comparisonofblockingstrategies completingitscomputations.Ifreplicationwasnotthere,atomic force update()depictedinAlgorithm 5 wouldbepartlyonsharedmemoryandpartlyonglobalmemory.Thisinturnmaydegradetheperformance.Beforeupdatingtheforce,eachparticleneedstoidentifythetriangleinwhichitbelongs.Theisaccomplishedbythend triangle(particle)stepinAlgorithm 5 .ThisstepcanbeamajorbottleneckinthePICalgorithmasitinvolvesalinearsearchoverthetrianglesforeachparticleinaregion.Weusetwostrategiestoreducethesearchtime: 1. EachGPUblockstoresthetrianglesassociatedwithitsregioninthesharedmemoryforfasteraccess.Allthethreadsintheblockperformalinearsearchoverthesetriangles.Weusethesharedmemoryaccesspatternwithminimalbankconicts. 2. Partitioningthetriangularmeshintoregionsusingarectangulargrid.Findingtheregioncontainedbyaparticleisstraightforwardandtakesonlyconstantnumberofoperations.Thislimitsthetrianglesearchtoonlytrianglesinthatregionandreducesthesearchtime.Asmentionedearlier,wearemappingaregiontoGPUblockandtheblockloadstheverticesandtrianglesintheregiontosharedmemory.Wehavetoensurethattheregionissmallenoughsothatthesharedmemorycanaccommodatetheverticesand 33

PAGE 34

triangles.Thiscreatesanotherchallenge,astheregiongetssmaller,thenumberofregionsincreases.GPUhardwareimposesalimitonthenumberblocksperkernelinvocation.Incaseofnd triangle(particle),wehavetolinearlyscanthroughthelistofthetrianglesinagivenregion.Thisfunctionbenetsiftheregionsizeissmallerasthenumberoftrianglestoscanreduces.Forsolvingthis,wefurtherpartitiontheregionsintosubregions.Subregionsarecreatedbasedonthetriangledensityinaregion.ThiswillbeexplainedinSection 4.2 .Bydoingsowecankeepallthetrianglestobelinearlyscannedintothesharedmemory.ThisensureseffectivesharingofthemeshentitiesbyallthreadsinaGPUblock.Asthesubregionsincrease,thenumberofGPUblockswillalsoincrease.WehavetondanoptimalpartitioningsuchthattrianglescantintothesharedmemoryandthenumberofGPUblocksiswithinthelimitsimposedbythehardware.Withtheseconstraintsinplace,weexploredifferentmeshpartitioningstrategiesasdescribedinSection 4.2 4.2MeshDecompositionInthispaper,weexplorethreedifferentmeshpartitioningstrategies.Thethreeapproachesonlydifferinthemeshpartitioningstrategyused.TherestofthePICalgorithmremainsthesame. 4.2.1DensityBasedPartitioningInthisapproach,wepartitionthemeshintoregionssuchthateachregionhasapproximatelythesametriangledensity.ThisisdepictedinFigure 4-9 .Uniformtriangledensityensureseffectiveloadbalancingacrosstheregions.Thisapproachrequiresslightlyexpensivepre-processingsteptondthetriangledensityforcreatingregions.AKDtree[ 13 ]iscreatedonalltrianglesinthemesh.TheKDtreeprunesthesearchspacefortriangles.EachparticletraversesdowntheKDtreendsthetriangleinwhichitbelongsto.Thisapproachrequiresanexpensivepre-processingstepforcreatingaKDtree.MoreoverKDtreeisanasymmetricdatastructurethatcannotbemappedverywelltotheSIMTarchitectureoftheGPU. 34

PAGE 35

Figure4-9. Partitioningbasedontriangledensity Innexttwoapproaches,wesuperimposearectangulargridoverthetriangularmeshasshownintheFigure 4-10 .Therectangulargridpartitionsthetriangularmeshintoregionswhichcanbesolvedindependently.Thepre-processingstepisveryfastcomparedtoKDtreeapproach.Onedisadvantagewiththisapproachisthetriangledensityvariesbetweenregions.ForexampleRegion1islessdensethanRegion2.Thisresultsinsomeregionshavingmoreworkloadcomparedtootherregions.Wereducetheloadimbalancebysuperimposingasecondlevelgridovertherectangular 35

PAGE 36

Figure4-10. Partitioningusingarectangularmesh gridwhichpartitionstheregionintosubregions.WeusesimilardatastructuresasshowninFigure 4-2 tostorethemeshentities.Eachofthesedatastructureswillbesortedaccordingtotheorderofsubregions. 4.2.2UniformPartitioningTheLevel2gridpartitionstheLevel1regionsintosubregions.Thestructureofthelevel2gridisuniformacrosstheregionsasshownintheFigure 4-11 36

PAGE 37

Figure4-11. Uniformpartitioningusingarectangulargrid ThisapproachisequivalenttothescenarioinwhichwehavemorenumberofLevel1regions.Theadditionalmeshgranularityreducesthetimeforthelinearsearchtoidentifywhichtriangleaparticlebelongto.Onedisadvantagewiththismethodistheupperlimitonthenumberofblocksakernelcallcanhandle.Sinceeachsubregionmapstoablock,wecannotincreasethenumberofsubregionsbeyondacertainlimit.Duetosymmetricnatureofthepartitioninggrid,toovercomethesharedmemorylimitation,weuseheuristicstoincrementallysorttheparticlesaftereachsimulation 37

PAGE 38

iteration.Inreality,mostoftheparticleswillmovetoadjacent.Hencewedonothavetokeepalltheregionsinthesharedmemory.Theregionsthatareinathresholdneighborhoodarekeptinthesharedmemoryandotherregionsarekeptintheglobalmemory.Inthisway,mostoftheupdateswillbeinthesharedmemoryandfewerupdatesintheglobalmemory.AsshowninFigure 4-12 ,whileprocessingtheparticlesinregionX,wekeeptheneighboringregionsmarkedasYinthesharedmemory.Alltheotherregionsinthemeshwillbeintheglobalmemory.ThealgorithmisdepictedinAlgorithm 9 Algorithm9incremental sort 1: forallblocksinparallel do 2: forallthreadsinthisblockinparallel do 3: shared memory=load threshold regions(global memory) 4: forallparticlesintheregionassignedtothisthreaddo 5: newRegion=get region(particle.x,particle.y) 6: ifnewRegion!=regionthen 7: ifnewRegioniswithinthresholdthen 8: atomic update(newRegioninshared memory) 9: else 10: atomic update(newRegioninglobal memory) 11: endif 12: endif 13: endfor 14: barrier synchronize() 15: Atomicwritebackoftheshared memoryregionstoglobal memory 16: endfor 17: endfor 4.2.3Non-UniformPartitioningwithLevel2GridInthisapproach,theLevel2gridisnotuniformacrosstheregions.Figure 4-13 showsnonuniformpartitioningofthemesh.Thelevel2gridisdenseintheregionswheretriangledensityismoreandcoarseintheregionswheredensityisless.Fornon-uniformpartitioning,wedeneathresholdvaluefortriangles.Thethresholdvalueisthemaximumnumberoftrianglesthatcanbepartofasubregion.Non-UniformPartitioningreducestheeffectivenumberofsubregionsandthereforeGPUblocks. 38

PAGE 39

Figure4-12. Partioningregionsbetweensharedandglobalmemoryforincrementalsorting Anotheradvantageisthatwecanpartitiontheregionsintomoresubregionstherebyreducingthetimetakenbynd triangle(particle)function.Non-uniformitycreatesasymmetryandrequiresmorecomplexpre-processingandindexing.Duetoasymmetricnatureofmesh,applyingtheheuristicsforincrementalsortasinuniformcaseincreasesthecomplexityofcodeandcanintroduceconditionals,whichwebelievewillreducetheperformance. 39

PAGE 40

Figure4-13. Nonuniformpartitioningusingarectangulargrid 40

PAGE 41

CHAPTER5EMPIRICALEVALUATIONWeusedthetriangularmeshfromOakRidgeNationalLaboratory,whichisusedforXGC1[ 1 ]benchmarks.Themeshhas1.8milliontrianglesandrepresentsatokamakgeometrywitheffectsoftheseparatrixandx-pointvisible.Werandomlydistributed18millionparticlesonthismesh.ForLevel1partitioningofthemesh,weuseda32X32rectangulargrid(1024regions).Restoftheexperimentalsectionisorganizedasfollows.InSection 5.1 ,weoutlinethehardwareandsoftwarecongurationofthebenchmarkmachine.Section 5.2 comparesdifferentparticlesortingapproaches.InSection 5.3 ,westudythebestworkloadperthreadfortheparticlesortingkernel.InSection 5.4 ,wecomparethetrianglesearchtimesintwomeshpartitioning(uniformandnon-uniform)approaches. 5.1BenchmarkMachineTheexperimentsareconductedwiththefollowinghardwareandsoftwarecongurations.WeusedanNVIDIATMTeslamachineforbenchmarking.ThemachinehasafourcoreIntelXeonProcessorclockedat2.8GHzandtwoNVIDIATMTeslaT10GPUswithaclockrateof1.3GHz.Themachinerunson64bitLinuxOperatingSystem.TheGPUhas4GBglobalmemoryand30StreamingMultiprocessors(SM).EachSMhas8ScalarProcessors(SP)resultingin240computingcores.TheNVIDAsoftwarestackusesversion3.0ofCUDADriverandRuntime.TheexperimentsusedCUDAcomputeCapabilitymode1.3.ThecompilationisdoneusingNVIDIATMnvcccompilerwithoptimizationlevel3()]TJ /F5 11.955 Tf 9.3 0 Td[(O3). 5.2ComparisonofDifferentParticleSortingSchemesAsmentionedbefore,usingbucketsort,were-arrangeparticlesbasedonsub-regionsortedorder.InAlgorithm 3 ,weuseanintegercountertocounttheparticlesinaregion.Inthisexperiment,weassumethatthenumberofsubregionsaresolargethatcountersforallsub-regionscannotbekeptinthesharedmemory.Weusethe 41

PAGE 42

Table5-1. Comparisonofparticlesortingschemes MaximumnumberoftrianglesinasubregionSortingusingsharedmemory(Timeinms)Sortingwithoutusingsharedmemory(Timeinms) 10185.681492.55200183.881881.561000183.552441.085000183.993773.92 non-uniformpartitioningstrategy(Section 4.2.3 ,Figure 4-13 )fortheLevel2grid.Wecomparetwosortingstrategies: Sortingwithoutusingsharedmemory:Theintegercounterscorrespondingtoallthesubregionsarekeptinglobalmemory.Wedonotusesharedmemory. Sortingusingsharedmemory:Inthisapproachsortingisdoneintwosteps.TherststeporderstheparticlesbasedonLevel1gridregions.ThenumberofLevel1regionsaresmallandcanbestoredinsharedmemory.Thesecondstepworksoneachoftheseregionsandsortsinsubregionsortedorder.ExperimentalresultsaredepictedinTable 5-1 .ThemaximumnumberoftrianglesinasubregionisthethresholdforLevel2partitioningi.e.Level2gridisplacedinsuchawaythateachsubregionhasat-mostthethresholdnumberoftriangles.Whensharedmemoryisnotused,theperformancedegradesasweincreasethetrianglethreshold.Whentrianglethresholdvalueincreases,thenumberofLevel2gridsdecreases.Thenumberofparticlesinagridwillincreaseresultinginmoreglobalmemorycontentionswhileupdatingtheregioncounter.Whenweusefastsharedmemory,thereisa20Xspeedup(forthresholdvalueof5000)inthesortingtime.Thesortingtimeremainsconstantforthethresholdvalues10)]TJ /F8 11.955 Tf 15.24 0 Td[(5000.Thisreasonbeingmajorityofatomicoperationsisonsharedmemoryandisconnedwithinablock.Hencetheexecutionofotherblockswontbeaffectedbecauseofatomiclocks.Wecannotincreasethethresholdbeyondacertainlimitbecausethe 42

PAGE 43

Table5-2. Comparisonofsortingtimewithvariableworkloadperthread NumberofGPUblocksParticlesPerThreadTime(ms) 35157261.4917579460.4943951660.7310996461.1355012865.7931422463.8927625565.0727525695 timetosearchduringnd triangle(particle)stepwillincreaseproportionally.Also,wewillnotbeabletokeepalltrianglesinsharedmemoryforfastersearchingandforceupdation. 5.3ChoosingtheBestWorkloadPerThreadInthisexperiment,westudytheeffectofincreasingtheworkloadofthreadsinablock.Weusetheparallelbucketsortalgorithm(Algorithm 3 )forthisstudyandincreasethenumberofparticlesassignedtoathread.AsshowninTable 5-2 ,whentheworkloadofthreadsincreases,thenumberofGPUblockswillreduce.WhentheGPUblocksareless,wedonothaveenoughblockstokeepthecomputingunitsinGPUbusyandthisleadstorelativelypoorperformance.Itisalsoworthnoting,thatwhenthenumberofparticlesperthreadreaches256,thereisahugedeteriorationinperformance.Thereasonisanaccesswidthofbetweenthreads256canresultinresultinginlargenumberofcachemisseswhichresultsintheperformancepenalty.Forparallelbucketsort,thebestperformanceisachievedwheneachthreadprocesses4particles. 5.4ComparisonofTriangleSearchTimesTheexperimentalresultsfornon-uniformpartitioninganduniformpartitioningareshownintheTable 5-3 andTable 5-4 respectively.Boththetablesshowtimeforteniterations.Non-uniformpartitioningwithfewernumberofblocksschemehascomparable 43

PAGE 44

Table5-3. Trianglesearchtimefornon-uniformpartitioning NumberofGPUblocksTime(ms) 33464428.5122471989.8827797235.16102412561.06 Table5-4. Trianglesearchtimeforuniformpartitioning NumberofGPUblocksTime(ms) 40963111.1192161366.2116384877.2325600609.0036864500.9250176427.00 performancewithuniformpartitioningwithmorenumberofblocks.Thereasonis,eventhoughtherecanbesomeblockshavinglessworkloadinuniformpartitioning,theautomaticloadbalancingprovidedbytheGPUhardwaretakescareoftheloadimbalance.Hencethesimpleruniformpartitioningwouldbeabetterchoiceasitcanalsosupporttheheuristicsforincrementalsorting.Ingeneral,uniformpartitioningwillbeabettermethodologyforautomaticcodegeneratorsevenwhentheunderlyingcomputationalgridishighlyasymmetric. 44

PAGE 45

CHAPTER6CONCLUSIONTheparallelizationofPICproblemsrequirescarefulpartitioningofparticleandmeshdomainstobalancethecomputationalloadandminimizecommunicationcost.WeexploredtheparallelizationofPICalgorithmonanasymmetricstructureliketriangularmesh.WediscussedvariousmethodologiestoexplorethecapabilitiesoftheGPUhardwareforPICalgorithm.ThepaperalsoprovidesaninsightintothedecisionstobemadebeforeparallelizinganapplicationontheGPU.OurexperimentalresultsshowthattheuseofsharedmemoryiscriticalforoptimalperformanceofanyalgorithmonGPU.WeintroducedtheconceptofreplicationwhichisasimplersolutionthangraphcoloringthatinvolvestheoverheadofmultipleGPUkernelinvocationsforeachcolorandorderingbasedoncolor.Theuseofreplicationforshadowtrianglesandverticesdoesnotdeteriorateperformanceduringforceaggregationstep.Inourexperiment,eachiterationofthesimulationtakes82ms.Theaggregationofforcefromtheverticesandshadowverticestakes3mswhichcontributesonly3.5%tothetotalsimulationtime.ThealgorithmswediscussarescalablewithsizeofmeshandnumberofparticlesandcaneasilybeportedtoexecuteonmultipleGPUs. 45

PAGE 46

REFERENCES [1] AdamsM.F,KuS,WorleyP,etal.,Scalingto150Kcores:RecentalgorithmandperformanceengineeringdevelopmentsenablingXGC1torunatscale,JournalofPhysics:ConferenceSeries,2009. [2] KraevaM.A,MalyshkinVE,AlgorithmsofParallelRealisationofthePICMethodwithAssemblyTechnology,LectureNotesinComputerScience,1999,Volume1593/1999,329-338. [3] KraevaM.A,MalyshkinVE,ImplementationofPICMethodonMIMDMulticomputerswithAssemblyTechnology,in:Proceedingsofthe(HPCN)HighPerformanceComputingandNetworkingInternationalConference,Europe,LectureNotesinComputerScience,Vol.1255,Springer,Berlin,1997,pp.541-549. [4] LiaoW,OuC,RankaS,DynamicAlignmentandDistributionofIrregularlyCoupledDataArraysforScalableParallelizationofParticle-in-CellProblems,ipps,pp.57,10thInternationalParallelProcessingSymposium(IPPS'96),1996. [5] MadduriK,WilliamsS,EthierSetal.,Memory-efcientoptimizationofGyrokineticparticle-to-gridinterpolationformulticoreprocessors,SIAMConferenceonParallelProcessingforScienticComputing(SIAMPP10),2010. [6] StantchevG,DorlandW,Gumerov,FastparallelParticle-To-GridinterpolationforplasmaPICsimulationsontheGPU,JournalofParallelandDistributedComputing,Volume68,Issue10,October2008,Pages1339-1349. [7] RyooS,RodriguesC.I,BaghsorkhiS.Setal.,Optimizationprinciplesandapplica-tionperformanceevaluationofamultithreadedGPUusingCUDA.,in:Proceedingsofthe13thACMSIGPLANSymposiumonPrinciplesandpracticeofparallelprogramming,PPoPP'08. [8] JiQiang,RobertD.Ryne,SalmanHabibandViktorDecyk,Object-OrientedParallelParticle-in-CellCodeforBeamDynamicsSimulationinLinearAccelerators,sc,pp.55,Proceedingsofthe1999ACM/IEEEconferenceonSupercomputing,1999. [9] ViktorK.Decyk,TajendraV.SinghandScottA.Friedman,GRAPHICALPROCESS-INGUNIT-BASEDPARTICLE-IN-CELLSIMULATIONS,Proc.Intl.ComputationalAcceleratorPhysicsConf.(ICAP2009),SanFrancisco,CA,Sept,2009. [10] ViktorK.Decyk,UPIC:Aframeworkformassivelyparallelparticle-in-cellcodes,ComputerPhysicsCommunications,Volume177,Issues1-2,July2007,Pages95-97ProceedingsoftheConferenceonComputationalPhysics2006-CCP2006,ConferenceonComputationalPhysics2006. [11] NVIDIA,CUDACProgrammingGuide,Version3.1. [12] NVIDIA,CUDACBestPracticesGuide,Version3.1. 46

PAGE 47

[13] KDtree, http://en.wikipedia.org/wiki/Kd_tree 47

PAGE 48

BIOGRAPHICALSKETCH RejithGeorgeJosephreceivedhisBachelorofTechnologyinComputerSciencefromNationalInstituteofTechnology,Calicut,Kerala,Indiain2005.AftergraduationheworkedasaSoftwareEngineeratIBM,SynopsysandAxiomDesignAutomationoveraspanof4years.HegraduatedwithhisMastersinComputerSciencefromtheDepartmentofComputerandInformationScienceandEngineering,UniversityofFloridaGainesville,Floridain2011.HisresearchinterestsincludeDataminingonGPU,ExtendingtheexistingalgorithmstoGPUandAlgorithmsforDataStreams. 48