<%BANNER%>

Parallel Discrete Event Simulation of Queuing Networks Using GPU-Based Hardware Acceleration

Permanent Link: http://ufdc.ufl.edu/UFE0041160/00001

Material Information

Title: Parallel Discrete Event Simulation of Queuing Networks Using GPU-Based Hardware Acceleration
Physical Description: 1 online resource (105 p.)
Language: english
Creator: Park, Hyung
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2009

Subjects

Subjects / Keywords: cuda, discrete, event, gpu, parallel, queuing, simulation
Computer and Information Science and Engineering -- Dissertations, Academic -- UF
Genre: Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: Queuing networks are used widely in computer simulation studies. Examples of queuing networks can be found in areas such as the supply chains, manufacturing work flow, and internet routing. If the networks are fairly small in size and complexity, it is possible to create discrete event simulations of the networks without incurring significant delays in analyzing the system. However, as the networks grow in size, such analysis can be time consuming and thus require more expensive parallel processing computers or clusters. The trend in computing architectures has been toward multicore central processing units (CPUs) and graphics processing units (GPUs). A GPU is the fairly inexpensive hardware, and found in most recent computing platforms, but practical example of single instruction, multiple data (SIMD) architectures. The majority of studies using the GPU within the graphics and simulation communities have focused on the use of the GPU for models that are traditionally simulated using regular time increments, whether these increments are accomplished through the addition of a time delta (i.e., numerical integration) or event scheduling using the delta (i.e., discrete event approximations of continuous-time systems). These types of models have the property of being decomposable over a variable or parameter space. In prior studies, discrete event simulation, such as a queuing network simulation, has been characterized as being an inefficient application for the GPU primarily due to the inherent synchronicity of the GPU organization and an apparent mismatch between the classic event scheduling cycle and the GPUs basic functionality. However, we have found that irregular time advances of the sort common in discrete event models can be successfully mapped to a GPU, thus making it possible to execute discrete event systems on an inexpensive personal computer platform. This dissertation introduces a set of tools that allows the analyst to simulate queuing networks in parallel using a GPU. We then present an analysis of a GPU-based algorithm, describing benefits and issues with the GPU approach. The algorithm clusters events, achieving speedup at the expense of an approximation error which grows as the cluster size increases. We were able to achieve 10-x speedup using our approach with a small error in the output statistics of the general network topology. This error can be mitigated, based on error analysis trends, obtaining reasonably accurate output statistics.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Hyung Park.
Thesis: Thesis (Ph.D.)--University of Florida, 2009.
Local: Adviser: Fishwick, Paul A.

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2009
System ID: UFE0041160:00001

Permanent Link: http://ufdc.ufl.edu/UFE0041160/00001

Material Information

Title: Parallel Discrete Event Simulation of Queuing Networks Using GPU-Based Hardware Acceleration
Physical Description: 1 online resource (105 p.)
Language: english
Creator: Park, Hyung
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2009

Subjects

Subjects / Keywords: cuda, discrete, event, gpu, parallel, queuing, simulation
Computer and Information Science and Engineering -- Dissertations, Academic -- UF
Genre: Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: Queuing networks are used widely in computer simulation studies. Examples of queuing networks can be found in areas such as the supply chains, manufacturing work flow, and internet routing. If the networks are fairly small in size and complexity, it is possible to create discrete event simulations of the networks without incurring significant delays in analyzing the system. However, as the networks grow in size, such analysis can be time consuming and thus require more expensive parallel processing computers or clusters. The trend in computing architectures has been toward multicore central processing units (CPUs) and graphics processing units (GPUs). A GPU is the fairly inexpensive hardware, and found in most recent computing platforms, but practical example of single instruction, multiple data (SIMD) architectures. The majority of studies using the GPU within the graphics and simulation communities have focused on the use of the GPU for models that are traditionally simulated using regular time increments, whether these increments are accomplished through the addition of a time delta (i.e., numerical integration) or event scheduling using the delta (i.e., discrete event approximations of continuous-time systems). These types of models have the property of being decomposable over a variable or parameter space. In prior studies, discrete event simulation, such as a queuing network simulation, has been characterized as being an inefficient application for the GPU primarily due to the inherent synchronicity of the GPU organization and an apparent mismatch between the classic event scheduling cycle and the GPUs basic functionality. However, we have found that irregular time advances of the sort common in discrete event models can be successfully mapped to a GPU, thus making it possible to execute discrete event systems on an inexpensive personal computer platform. This dissertation introduces a set of tools that allows the analyst to simulate queuing networks in parallel using a GPU. We then present an analysis of a GPU-based algorithm, describing benefits and issues with the GPU approach. The algorithm clusters events, achieving speedup at the expense of an approximation error which grows as the cluster size increases. We were able to achieve 10-x speedup using our approach with a small error in the output statistics of the general network topology. This error can be mitigated, based on error analysis trends, obtaining reasonably accurate output statistics.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Hyung Park.
Thesis: Thesis (Ph.D.)--University of Florida, 2009.
Local: Adviser: Fishwick, Paul A.

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2009
System ID: UFE0041160:00001


This item has the following downloads:


Full Text

PAGE 1

PARALLELDISCRETEEVENTSIMULATIONOFQUEUINGNETWORKSUSINGGPU-BASEDHARDWAREACCELERATIONByHYUNGWOOKPARKADISSERTATIONPRESENTEDTOTHEGRADUATESCHOOLOFTHEUNIVERSITYOFFLORIDAINPARTIALFULFILLMENTOFTHEREQUIREMENTSFORTHEDEGREEOFDOCTOROFPHILOSOPHYUNIVERSITYOFFLORIDA2009

PAGE 2

c2009HyungwookPark 2

PAGE 3

Tomyfamily 3

PAGE 4

ACKNOWLEDGMENTS Iwouldliketoexpressmysinceregratitudetomyadvisor,Dr.PaulA.FishwickforhisexcellentinspirationandguidancethroughoutmyPh.D.studiesattheUniversityofFlorida.IwouldalsoliketothankmyPh.D.committeemembers,Dr.Jih-KwonPeir,Dr.ShigangChen,Dr.BenjaminC.Lok,andDr.HowardW.Beckfortheirprecioustimeandadviceformyresearch.Also,IamgratefultotheKoreanArmy.TheygavemeachancetostudyintheUnitedStatesofAmericawithnancialsupport.Iwouldliketothankmyparents,HyunkooParkandOksoonJungwhoencouragedmethroughoutmystudies.Iwouldespeciallyliketothankmywife,JisukHan,andmysons,KyungeonandSangeonPark.Theyhavebeenverysupportiveandpatientthroughoutmystudies.Iwouldneverhavenishedmystudywithoutthem. 4

PAGE 5

TABLEOFCONTENTS page ACKNOWLEDGMENTS .................................. 4 LISTOFTABLES ...................................... 8 LISTOFFIGURES ..................................... 9 ABSTRACT ......................................... 11 CHAPTER 1INTRODUCTION ................................... 13 1.1MotivationsandChallenges .......................... 13 1.2ContributionstoKnowledge .......................... 16 1.2.1AGPU-BasedToolkitforDiscreteEventSimulationBasedonParallelEventScheduling ............................ 16 1.2.2MutualExclusionMechanismforGPU ................ 16 1.2.3EventClusteringAlgorithmonSIMDHardware ........... 17 1.2.4ErrorAnalysisandCorrection ..................... 18 1.3OrganizationoftheDissertation ........................ 18 2BACKGROUND ................................... 20 2.1QueuingModel ................................. 20 2.2DiscreteEventSimulation ........................... 23 2.2.1EventSchedulingMethod ....................... 23 2.2.2ParallelDiscreteEventSimulation .................. 25 2.2.2.1Conservativesynchronization ................ 26 2.2.2.2Optimisticsynchronization ................. 28 2.2.2.3Acomparisonoftwomethods ............... 30 2.3GPUandCUDA ................................ 30 2.3.1GPUasaCoprocessor ......................... 30 2.3.2StreamProcessing ........................... 32 2.3.3GeForce8800GTX ........................... 33 2.3.4CUDA .................................. 35 3RELATEDWORK .................................. 38 3.1DiscreteEventSimulationonSIMDHardware ................ 38 3.2TradeoffbetweenAccuracyandPerformance ................ 40 3.3ConcurrentPriorityQueue ........................... 41 3.4ParallelSimulationProblemSpace ...................... 41 5

PAGE 6

4AGPU-BASEDAPPLICATIONFRAMEWORKSUPPORTINGFASTDISCRETEEVENTSIMULATION ................................ 43 4.1ParallelEventScheduling ........................... 43 4.2IssuesinaQueuingModelSimulation .................... 45 4.2.1MutualExclusion ............................ 45 4.2.2SelectiveUpdate ............................ 49 4.2.3Synchronization ............................. 49 4.3DataStructuresandFunctions ........................ 50 4.3.1EventSchedulingMethod ....................... 50 4.3.2FunctionsforaQueuingModel .................... 54 4.3.3RandomNumberGeneration ..................... 58 4.4StepsforBuildingaQueuingModel ..................... 58 4.5ExperimentalResults ............................. 62 4.5.1SimulationEnvironment ........................ 62 4.5.2SimulationModel ............................ 62 4.5.3ParallelSimulationwithaSequentialEventSchedulingMethod .. 63 4.5.4ParallelSimulationwithaParallelEventSchedulingMethod .... 64 4.5.5ClusterExperiment ........................... 65 5ANANALYSISOFQUEUENETWORKSIMULATIONUSINGGPU-BASEDHARDWAREACCELERATION ........................... 67 5.1ParallelDiscreteEventSimulationofQueuingNetworksontheGPU ... 67 5.1.1ATime-Synchronous/EventAlgorithm ................ 67 5.1.2TimestampOrdering .......................... 69 5.2ImplementationandAnalysisofQueuingNetworkSimulation ....... 70 5.2.1ClosedandOpenQueuingNetworks ................. 70 5.2.2ComputerNetworkModel ....................... 72 5.2.3CUDAImplementation ......................... 74 5.3ExperimentalResults ............................. 76 5.3.1SimulationModel:ClosedandOpenQueuingNetworks ...... 76 5.3.1.1Accuracy:closedvs.openqueuingnetwork ....... 77 5.3.1.2Accuracy:effectsofparametersettingsonaccuracy ... 79 5.3.1.3Performance ......................... 79 5.3.2ComputerNetworkModel:aMobileAdHocNetwork ........ 83 5.3.2.1Simulationmodel ...................... 83 5.3.2.2Accuracyandperformance ................. 86 5.4ErrorAnalysis .................................. 88 6CONCLUSION .................................... 95 6.1Summary .................................... 95 6.2FutureResearch ................................ 96 REFERENCES ....................................... 98 6

PAGE 7

BIOGRAPHICALSKETCH ................................ 105 7

PAGE 8

LISTOFTABLES Table page 2-1Notationsforqueuingmodelstatistics ....................... 22 2-2Equationsforkeyqueuingmodelstatistics ..................... 23 3-1Classicationofparallelsimulationexamples ................... 42 4-1Thefutureeventlistanditsattributes ........................ 51 4-2Theservicefacilityanditsattributes ........................ 55 5-1SimulationscenariosofMANET .......................... 86 5-2Utilizationandsojourntime(Soj.time)fordifferentvaluesoftimeintervals(t)andmeanservicetimes(s) ............................. 91 8

PAGE 9

LISTOFFIGURES Figure page 2-1Componentsofasingleserverqueuingmodel .................. 21 2-2Cycleusedforeventscheduling .......................... 24 2-3Streamandkernel .................................. 33 2-4Traditionalvs.GeForce8seriesGPUpipeline ................... 34 2-5GeForce8800GTXarchitecture .......................... 35 2-6Executionbetweenthehostandthedevice .................... 37 3-1Diagramofparallelsimulationproblemspace ................... 42 4-1Thealgorithmforparalleleventscheduling .................... 44 4-2Theresultofaconcurrentrequestfromtwothreadswithoutamutualexclusionalgorithm ....................................... 46 4-3Amutualexclusionalgorithmwithclusteringevents ................ 48 4-4PseudocodeforNextEventTime ........................... 52 4-5PseudocodeforNextEvent ............................. 53 4-6PseudocodeforSchedule .............................. 54 4-7PseudocodeforRequest ............................... 55 4-8PseudocodeforRelease ............................... 56 4-9PseudocodeforScheduleServer .......................... 57 4-10Firststepinparallelreduction ............................ 59 4-11Stepsinparallelreduction .............................. 59 4-12Step3:Eventextractionanddepartureevent ................... 60 4-13Step4:Updateofservicefacility .......................... 61 4-14Step5:Neweventscheduling ............................ 61 4-1533toroidalqueuingnetwork ............................ 63 4-16PerformanceimprovementbyusingaGPUascoprocessor ........... 64 4-17Performanceimprovementfromparalleleventscheduling ............ 65 9

PAGE 10

5-1Pseudocodeforahybridtime-synchronous/eventalgorithmwithparalleleventscheduling ...................................... 68 5-2Queuingdelayinthecomputernetworkmodel .................. 73 5-33linearqueuingnetworkswith3servers ...................... 76 5-4Summarystatisticsofclosedandopenqueuingnetworksimulations ...... 78 5-5Summarystatisticswithvaryingparametersettings ................ 80 5-6Performanceimprovementwithvaryingtimeintervals(t) ............ 82 5-7Comparisonbetweenwirelessandmobileadhocnetworks ........... 84 5-8Averageend-to-enddelaywithvaryingtimeintervals(t) ............ 87 5-9Averagehopcountsandpacketdeliveryratiowithvaryingtimeintervals(t) 89 5-10PerformanceimprovementinMANETsimulationwithvaryingtimeintervals(t) 90 5-113-dimensionalrepresentationofutilizationforvaryingtimeintervalsandmeanservicetimes ..................................... 91 5-12Comparisonbetweenexperimentalandestimationresults ............ 93 5-13Resultoferrorcorrection .............................. 93 10

PAGE 11

AbstractofdissertationPresentedtotheGraduateSchooloftheUniversityofFloridainPartialFulllmentoftheRequirementsfortheDegreeofDoctorofPhilosophyPARALLELDISCRETEEVENTSIMULATIONOFQUEUINGNETWORKSUSINGGPU-BASEDHARDWAREACCELERATIONByHyungwookParkDecember2009Chair:PaulA.FishwickMajor:ComputerEngineering Queuingnetworksareusedwidelyincomputersimulationstudies.Examplesofqueuingnetworkscanbefoundinareassuchasthesupplychains,manufacturingworkow,andinternetrouting.Ifthenetworksarefairlysmallinsizeandcomplexity,itispossibletocreatediscreteeventsimulationsofthenetworkswithoutincurringsignicantdelaysinanalyzingthesystem.However,asthenetworksgrowinsize,suchanalysiscanbetimeconsumingandthusrequiremoreexpensiveparallelprocessingcomputersorclusters. Thetrendincomputingarchitectureshasbeentowardmulticorecentralprocessingunits(CPUs)andgraphicsprocessingunits(GPUs).AGPUisthefairlyinexpensivehardware,andfoundinmostrecentcomputingplatforms,butpracticalexampleofsingleinstruction,multipledata(SIMD)architectures.ThemajorityofstudiesusingtheGPUwithinthegraphicsandsimulationcommunitieshavefocusedontheuseoftheGPUformodelsthataretraditionallysimulatedusingregulartimeincrements,whethertheseincrementsareaccomplishedthroughtheadditionofatimedelta(i.e.,numericalintegration)oreventschedulingusingthedelta(i.e.,discreteeventapproximationsofcontinuous-timesystems).Thesetypesofmodelshavethepropertyofbeingdecomposableoveravariableorparameterspace.Inpriorstudies,discreteeventsimulation,suchasaqueuingnetworksimulation,hasbeencharacterizedasbeinganinefcientapplicationfortheGPUprimarilyduetotheinherentsynchronicityof 11

PAGE 12

theGPUorganizationandanapparentmismatchbetweentheclassiceventschedulingcycleandtheGPUsbasicfunctionality.However,wehavefoundthatirregulartimeadvancesofthesortcommonindiscreteeventmodelscanbesuccessfullymappedtoaGPU,thusmakingitpossibletoexecutediscreteeventsystemsonaninexpensivepersonalcomputerplatform. ThisdissertationintroducesasetoftoolsthatallowstheanalysttosimulatequeuingnetworksinparallelusingaGPU.WethenpresentananalysisofaGPU-basedalgorithm,describingbenetsandissueswiththeGPUapproach.Thealgorithmclustersevents,achievingspeedupattheexpenseofanapproximationerrorwhichgrowsastheclustersizeincreases.Wewereabletoachieve10-xspeedupusingourapproachwithasmallerrorintheoutputstatisticsofthegeneralnetworktopology.Thiserrorcanbemitigated,basedonerroranalysistrends,obtainingreasonablyaccurateoutputstatistics. 12

PAGE 13

CHAPTER1INTRODUCTION 1.1MotivationsandChallenges Queuingmodels[ 1 4 ]areconstructedtoanalyzehumanlyengineeredsystemswherejobs,parts,orpeopleowthroughanetworkofnodes(i.e.resources).Thestudyofqueuingmodels,theirsimulation,andtheiranalysisisoneoftheprimaryresearchtopicsstudiedwithinthediscreteeventsimulationcommunity[ 5 ].Therearetwoapproachestoestimatingtheperformanceandanalysisofqueuingsystems:analyticalmodelingandsimulation[ 3 5 6 ].Ananalyticalmodelistheabstractionofasystembasedonprobabilitytheory,representingthedescriptionofaformalsystemconsistingofequationsusedtoestimatetheperformanceofthesystem.However,itisdifculttorepresentallsituationsintherealworldusingananalyticalmodelbecausethatrequiresarestrictedsetofassumptions,suchasaninnitenumberofqueuecapacityandnoboundsontheinter-arrivalandservicetime,whichdonotoftenoccurintherealworld.Asimulationisoftenusedtoanalyzethequeuingsystemwhenatheoryforthesystemequationsisunknownorthealgorithmfortheequationsistoocomplicatedtobesolvedinclosed-form.Computersimulationinvolvestheformulationofamathematicalmodel,oftenincludingadiagram.Thismodelisthentranslatedintocomputercode,whichisthenexecutedandcomparedagainstaphysical,orreal-world,system'sbehaviorunderavarietyofconditions. Queuingmodelsimulationscanbeexpensiveintermsoftimeandresourcesincaseswherethemodelsarecomposedofmultipleresourcenodesandtokensthatowthroughthesystem.Therefore,thereisaneedtondwaystospeedupqueuingmodelsimulationssothatanalysescanbeobtainedmorequickly.Pastapproachestospeedingupqueuingmodelsimulationshaveusedasynchronousmessage-passingwithspecialemphasisontwoapproaches:theconservativeandtheoptimisticapproaches[ 7 ].Bothapproacheshavebeenusedtosynchronizetheasynchronous 13

PAGE 14

logicalprocessors(LPs),preservingcausalrelationshipsacrossLPssothattheresultsobtainedareexactlythesameasthoseproducedbysequentialsimulation.Moststudiesofparallelsimulationhavebeenperformedonmultipleinstruction,multipledata(MIMD)machines,orrelatednetworkstoexecutethepartofasimulationmodelorLP.TheparallelsimulationapproacheswithpartitioningthesimulationmodelintoseveralLPscouldeasilybeemployedwithaqueuingmodelsimulation,sincethestartofeachexecutionneednotbeexplicitlysynchronizedwithotherLPs. Agraphicsprocessingunit(GPU)isaprocessorthatrenders3Dgraphicsinrealtime,andwhichcontainsseveralsub-processingunits.Recently,theGPUhasbecomeanincreasinglyattractivearchitectureforsolvingcompute-intensiveproblemsforgeneralpurposecomputation,whichiscalledgeneral-purposecomputationonGPUs(GPGPU)[ 8 11 ].AvailabilityasacommodityandincreasedcomputationalpowermaketheGPUasubstituteforexpensiveclustersofworkstationsinaparallelsimulation,atarelativelylowcost.FormuchofthehistoryofGPUdevelopment,therehasbeenaneedtomapthemodelintothegraphicsapplicationprogramminginterface(API),whichlimitedtheavailabilityoftheGPUtothoseexpertswhohadGPU-andgraphics-specicknowledge.ThisdrawbackhasbeenresolvedwiththeadventoftheGeForce8seriesGPUs[ 12 ]andcomputeunieddevicearchitecture(CUDA)[ 13 14 ].ThecontroloftheuniedstreamprocessorsontheGeForce8seriesGPUsistransparenttotheprogrammer,andCUDAprovidesanefcientenvironmenttodevelopparallelcodesinahigh-levellanguageCwithouttheneedforgraphics-specicknowledge. IncontrasttothepreviouslyubiquitousMIMDapproachtoparallelcomputationwithinthecontextofsimulationresearch,theGPUissingleinstruction,multipledata(SIMD)-basedhardwarethatisorientedtowardstreamprocessing.SIMDhardwareisarelativelysimple,inexpensive,andhighlyparallelarchitecture;however,therearelimitstodevelopinganasynchronousmodelduetoitssynchronousoperation.Streamprocessing[ 15 16 ]isthebasicprogrammingmodelofSIMDarchitecture.The 14

PAGE 15

streamprocessingapproachexploitsdataandtaskparallelismbymappingdataowtoprocessors,andprovidesefcientcommunicationbyaccessingmemoryinapredictablepatternusingaproducer-consumerlocalityaswell.Forthesereasons,mostsimulationmodelsontheGPUaretime-synchronousandcompute-intensivemodelswithstreammemoryaccess. However,queuingmodelsareatypicalasynchronousmodel,andtheirtemporaleventsarerelativelyne-grained.Queuingmodelsareusuallysimulatedbasedoneventschedulingwithmanipulationofthefutureeventlist(FEL).Eventschedulingtendstobeasequentialoperation,whichoftenoverwhelmstheexecutiontimesofeventsinqueuingmodelsimulations.Anotherproblemliesinthedynamicdatastructurefortheeventschedulingmethodindiscreteeventsimulations.DynamicdatastructurescannotbedirectlyusedontheGPUbecausedynamicmemoryallocationisnotsupportedduringkernelexecution.Moreover,therandomizedmemoryaccessforindividualdatacannottakeadvantageofmassiveparallelismontheGPU. Nonetheless,theGPUcanbecomeusefulhardwareforfacilitatingne-graineddiscreteeventsimulations,especiallyforlarge-scalemodels,withtheconcurrentutilizationofanumberofthreadsandfastdatatransferbetweenprocessors.Theexecutiontimeofeacheventcanbeverysmall,butahigherdataparallelismwithclusteringoftheeventscanbeachievedforalarge-scalemodel. TheobjectiveofthisdissertationistosimulateasynchronousqueuingnetworksusingGPU-basedhardwareacceleration.Twomainissuesrelatedtothisstudyare:(1)howcanwesimulateasynchronousmodelsonSIMDhardware?And(2)howcanweachieveahigherdegreeofparallelism?Investigationsofthesetwomainissuesrevealthatfurtherattentionmustbepaidtothefollowingrelatedissues:(a)paralleleventscheduling,(b)dataconsistencywithoutexplicitsupportformutualexclusion,(c)eventclustering,and(d)errorestimationandcorrection.Thisdissertationpresentsanapproachtoresolvethesechallenges. 15

PAGE 16

1.2ContributionstoKnowledge 1.2.1AGPU-BasedToolkitforDiscreteEventSimulationBasedonParallelEventScheduling WehavedevelopedGPU-basedsimulationlibrariesforCUDAsothattheGPUcaneasilybeusedfordiscreteeventsimulation,especiallyforaqueuingnetworksimulation.AGPUisdesignedtoprocessarray-baseddatastructuresforthepurposeofprocessingpixelimagesinrealtime.TheframeworkincludesthefunctionsforeventschedulingandqueuingmodelsthathavebeendevelopedusingarraysontheGPU. Indiscreteeventsimulation,theeventschedulingmethodoccupiesalargeportionoftheoverallsimulationtime.TheFELimplementation,therefore,needstobeparallelizedinordertotakefulladvantageoftheGPUarchitecture.Aconcurrentpriorityqueueapproach[ 17 18 ]allowseachprocessortoaccesstheglobalFELinparallelonsharedmemorymultiprocessors.Theconcurrentpriorityqueueapproach,however,cannotbedirectlyappliedtoSIMD-basedhardwaresincetheconcurrentinsertionanddeletionofthepriorityqueueusuallyinvolvesmutualexclusion,whichisnotnativelysupportedbyGeForce8800GTXGPU[ 13 ]. ParalleleventschedulingallowsustoachievesignicantspeedupinqueuingmodelsimulationsontheGPU.AGPUhasmanythreadsexecutedinparallel,andeachthreadcanconcurrentlyaccesstheFEL.IftheFELisdecomposedintomanysub-FELs,andeachsub-FELisexclusivelyaccessedbyonethread,theaccesstooneelementintheFELisguaranteedtobeisolatedfromotherthreads.Exclusiveaccesstoeachelementallowseventinsertionanddeletiontobeconcurrentlyexecuted. 1.2.2MutualExclusionMechanismforGPU WehavereorganizedtheprocessingstepsinaqueuingmodelsimulationbyemployingalternateupdatesbetweentheFELandservicefacilitiessothattheycanbeupdatedinSIMDfashion.Thenewprocedureenablesustopreventmultiplethreads 16

PAGE 17

fromsimultaneouslyaccessingthesameelement,withouthavingexplicitsupportformutualexclusionontheGPU. Analternateupdateisalock-freemethodformutualexclusionontheGPU,inordertoupdatetwointeractivearraysatthesametime.Onlyonearraycanbeexclusivelyaccessedbyathreadindexiftheindexesoftwoarraysarenotinter-related.Ifonearrayneedstoupdatetheotherarray,theelementintheotherarrayisarbitrarilyaccessedbythethread.Dataconsistencycannotbemaintainediftwoormorethreadsconcurrentlyaccessthesameelementintheotherarray.Theotherarraymustbeupdatedafterthethreadindexisswitchedtoexclusivelyaccessitself.Theupdatedarray,however,hastosearchalloftheelementsintherequestarraytondtherequestelements.Iftheupdatedarrayknowswhichelementsintherequestarrayarelikelytorequesttheupdateinadvance,thenumberofsearcheswillbelimited.Eachnodeinqueuingnetworksusuallyknowsitsincomingedges,whichmakesitpossibletoreducethenumberofsearchesduringanalternateupdate,mitigatingtheoverallexecutiontime. 1.2.3EventClusteringAlgorithmonSIMDHardware SIMD-basedsimulationisusefulwhenalotofcomputationisrequiredbyasingleinstructionwithdifferentdata.However,itspotentialproblemsincludethebottleneckinthecontrolprocessorandloadimbalanceamongprocessors.ThebottleneckproblemshouldnotbesignicantwhenapplyingtheCPU/GPUapproach,sincetheCPUisdesignedtoprocessheavyweightthreads,whereastheGPUisdesignedtoprocesslightweightthreadsandtoexecutearithmeticequationsquickly[ 16 ]. Theloadimbalanceproblemcanberesolvedbyemployingatime-synchronous/eventalgorithminordertoachieveahigherdegreeofparallelism.Asingletimestampcannotexecutemanyeventsinparallel,sinceeventsinqueuingmodelsareirregularlyspaced.Thus,eventtimesneedtobemodiedsothattheycanbeclusteredandsynchronized.Atime-synchronous/eventalgorithmistheSIMD-basedhybridapproachtotwocommontypesofdiscretesimulation:discreteeventandtime-stepped.Thealgorithmadoptsthe 17

PAGE 18

advantagesofbothmethodstoutilizetheGPU.Thesimulationclockadvanceswhentheeventoccurs,buttheeventsinthemiddleofthetimeintervalareexecutedconcurrently.Atime-synchronous/eventalgorithmnaturallyleadstoapproximationerrorsinthesummarystatisticsyieldedfromthesimulation,becausetheeventsarenotexecutedattheirprecisetimestamp. Weinvestigatedthreedifferenttypesofqueuingmodelstoobservetheeffectsofoursimulationmethod,includinganimplementationofareal-worldapplication(mobileadhocnetworkmodel).Theexperimentalresultsofourinvestigationshowthatouralgorithmhasdifferentimpactsonthestatisticalresultsandperformanceofthreetypesofqueuingmodels. 1.2.4ErrorAnalysisandCorrection Theerrorinoursimulationisanumericalerrorsincewepreservestimestamporderingandcausalrelationshipsofevents,andtheresultisapproximateintermsofgatheredsummarystatistics.Theerrormaybeacceptableforthosemodeledapplicationswheretheanalystismoreconcernedwithspeed,andcanacceptrelativelysmallinaccuraciesinsummarystatistics.Insomecases,theerrorcanbeapproximatedandpotentiallycorrectedtoyieldmoreaccuratestatistics.Wepresentamethodforestimatingthepotentialerrorincurredthrougheventclusteringbycombiningqueuingtheoryandsimulationresults.Thismethodcanbeusedtoobtainacloserapproximationtothesummarystatisticsthroughpartiallycorrectingtheerror. 1.3OrganizationoftheDissertation Thisdissertationisorganizedinto6chapters.Chapter2reviewsbackgroundinformation,includingthequeuingmodel,sequentialandparalleldiscreteeventsimulation,GPU,andCUDA.Chapter3describesrelatedwork.WediscussotherstudiesfordiscreteeventsimulationonSIMDhardware,andatradeoffbetweenaccuracyandperformance.Chapter4describesaGPU-basedlibraryandapplicationsframeworkfordiscreteeventsimulation.Weintroducetheroutinesthatsupportparallel 18

PAGE 19

eventschedulingwithmutualexclusionandqueuingmodelsimulations.Chapter5discussesatheoreticalmethodologyanditsperformanceanalysis,includingthetradeoffsbetweennumericalerrorsandperformancegain,aswellastheapproachesforerrorestimationandcorrection.Chapter6providesasummaryofourndingsandintroducesareasforfutureresearch. 19

PAGE 20

CHAPTER2BACKGROUND 2.1QueuingModel Queuesarecommonlyfoundinmosthuman-engineeredsystemswherethereexistoneormoresharedresources.Anysystemwherethecustomerrequestsaserviceforanite-capacityresourcemaybeconsideredtobeaqueuingsystem[ 1 ].Thegrocerystore,themeparks,andfast-foodrestaurantsarewell-knownexamplesofqueuingsystems.Aqueuingsystemcanalsobereferredtoasasystemofow.Anewcustomerentersthequeuingsystemandjoinsthequeue(i.e.,line)ofcustomersunlessthereisnoqueueandanothercustomerwhocompleteshisservicemayexitthesystematthesametime.Duringtheexecution,awaitinglineisformedinasystembecausethearrivaltimeofeachcustomerisnotpredictable,andtheservicetimeoftenexceedscustomerinter-arrivaltimes.Asignicantnumberofarrivalsmakeeachcustomertowaitinlinelongerthanusual.Queuingmodelsareconstructedbyascientistorengineertoanalyzetheperformanceofadynamicsystemwherewaitingcanoccur.Ingeneral,thegoalsofaqueuingmodelaretominimizetheaveragenumberofwaitingcustomersinaqueueandtopredicttheestimatednumberoffacilitiesinaqueuingsystem.Theperformanceresultsofqueuingmodelsimulationareproducedattheendofasimulationintheformofaggregatestatistics. Aqueuingmodelisdescribedbyitsattributes[ 2 6 ]:customerpopulation,arrivalandservicepattern,queuediscipline,queuecapacity,andthenumberofservers.Anewcustomerfromthecallingpopulationentersintothequeuingmodelandwaitsforserviceinthequeue.Ifthequeueisemptyandtheserverisidle,anewcustomerisimmediatelysenttotheserverforservice,otherwisethecustomerremainsinthequeuejoiningthewaitinglineuntilthequeueisemptyandtheserverbecomesidle.Whenacustomerentersintotheserver,thestatusoftheserverbecomesbusy,notallowinganymore 20

PAGE 21

Figure2-1. Componentsofasingleserverqueuingmodel arrivalstogainaccesstotheserver.Afterbeingserved,acustomerexitsthesystem.Figure 2-1 illustratesasingleserverqueuewithitsattributes. Thecallingpopulation,whichcanbeeitherniteorinnite,isdenedasthepoolofcustomerswhopossiblycanrequesttheserviceinthenearfuture.Ifthesizeofthecallingpopulationisinnite,thearrivalrateisnotaffectedbyothers.Butthearrivalratevariesaccordingtothenumberofcustomerswhohavearrivedifthesizeofthecallingpopulationisniteandsmall.Arrivalandservicepatternsarethetwomostimportantfactorsdeterminingbehaviorsofqueuingmodels.Aqueuingmodelmaybedeterministicorstochastic.Forthestochasticcase,newarrivalsoccurinarandompatternandtheirservicetimeisobtainedbyprobabilitydistribution.Thearrivalandservicerates,basedonobservation,areprovidedasthevaluesofparametersforstochasticqueuingmodels.Thearrivalrateisdenedasthemeannumberofcustomersperunittime,andtheservicerateisdenedbythecapacityoftheserverinthequeuingmodel.Iftheservicerateislessthanthearrivalrate,thesizeofthequeuewillgrowinnitely.Thearrivalratemustbelessthantheservicerateinordertomaintainastablequeuingsystem[ 1 6 ]. 21

PAGE 22

Table2-1. Notationsforqueuingmodelstatistics NotationDescription ariArrivaltimeforcustomeri aiInter-arrivaltimeforcustomeri aAverageinter-arrivaltime Arrivalrate TTotalsimulationtime nNumberofarrivedcustomers siServicetimeofithcustomer Servicerate ssiServicestarttimeofithcustomer diDeparturetimeofithcustomer qMeanwaittime wMeanresidencetime Utilization BSystembusytime ISystemidletime Therandomnessofarrivalandservicepatternscausethelengthofwaitinglinesinthequeuetovary. Whenaserverbecomesidle,thenextcustomerisselectedamongcandidatesfromthequeue.Theselectionofstrategyfromthequeueiscalledqueuediscipline.Queuediscipline[ 6 19 ]isaschedulingalgorithmtoselectthenextcustomerfromthequeue.Thecommonalgorithmsofqueuedisciplinearerst-inrst-out(FIFO),last-inrst-out(LIFO),serviceinrandomorder(SIRO),andpriorityqueue.Theearlierarrivedcustomerisusuallyselectedfromaqueueintherealworld,thusthemostcommonqueuedisciplineisFIFO.Inapriorityqueuediscipline,eacharrivalhasitspriority.Thearrivalthathasthehighestpriorityischosenfromqueueamongwaitingcustomers. Thepurposeofbuildingaqueuingmodelandrunningasimulationistoobtainmeaningfulstatisticssuchastheserverperformance.ThenotationsusedforstatisticsarelistedinTable 2-1 ,andtheequationsforkeystatisticsaresummarizedinTable 2-2 22

PAGE 23

Table2-2. Equationsforkeyqueuingmodelstatistics NameEquationDescription Inter-arrivalai=ari-ari)]TJ /F7 7.97 Tf 6.59 0 Td[(1Intervalbetweentwoconsecutivetimearrivals Meana=Pai nAverageinter-arrivaltimeinter-arrivaltime Arrivalrate=n TThenumberofarrivalsatunittime =1 aLongrunaverage Means=Psi nAveragetimeforeachcustomertobeservicetimeserved Servicerate=1 sServercapabilityatunittime Meanq=P(ssi)]TJ /F3 11.955 Tf 11.96 0 Td[(ari) nAveragetimeforeachcustomertospendwaittimeinaqueue Meanw=P(di)]TJ /F3 11.955 Tf 11.95 0 Td[(ari) nAveragetimeeachcustomerstaysintheresidencetimesystem SystemB=PsiTotalservicetimeofserverbusytime SystemI=T)]TJ /F3 11.955 Tf 11.95 0 Td[(BTotalidletimeofserveridletime System=B TTheproportionofthetimeinwhichtheutilizationserverisbusy 2.2DiscreteEventSimulation 2.2.1EventSchedulingMethod Discreteeventsimulationchangesthestatevariablesatadiscretetimewhentheeventoccurs.Aneventschedulingmethod[ 20 ]isthebasicparadigmfordiscreteeventsimulationandisusedalongwithatime-advancedalgorithm.Thesimulationclockindicatesthecurrentsimulatedtime,theeventtimeoflasteventoccurrence.Theunprocessed,orfuture,eventsarestoredinadatastructurecalledtheFEL.EventsintheFELareusuallysortedinnon-decreasingtimestamporder.Whenthesimulationstarts,theheadoftheFELisextractedfromtheFEL,updatingthesimulationclock.Theextractedeventisthensenttoaneventroutine,whereitreproducesaneweventafter 23

PAGE 24

Figure2-2. Cycleusedforeventscheduling itsexecution.TheneweventisinsertedtotheFEL,sortingtheFELinnon-decreasingtimestamporder.Thisstepisiterateduntilthesimulationends. Figure 2-2 illustratesthebasiccycleforeventscheduling[ 20 ].ThreefutureeventsarestoredintotheFEL.WhenNEXT EVENTiscalled,tokenID#5withtimestamp12isextractedfromtheheadoftheFEL.Thesimulationclockthenadvancesfrom10to12.Theeventisexecutedateventroutine2,whichcreatesanewfutureevent,event#3.TokenID#5withevent#3isscheduledandinsertedintotheFEL.TokenID#5isplacedbetweentokenID#6andtokenID#3aftercomparingtheirtimestamps.TheeventloopiteratestocallNEXT EVENTuntilthesimulationends. ThepriorityqueueistheabstractdatastructureforanFEL.ThepriorityqueueinvolvestwooperationsforprocessingandmaintainingtheFEL:insertanddelete-min.Thesimplestwaytoimplementthepriorityqueueistouseanarrayoralinkedlist.Thesedatastructuresstoreeventsinalinearorderbyeventtimebutareinefcient 24

PAGE 25

forlarge-scalemodels,sincethenewlyinsertedeventcomparesitseventtimewithallothersinthesequence.AnarrayandlinkedlisttakesO(N)timeforinsertion,andO(1)timefordeletiononaverage,whereNisthenumberofelementsinthesedatastructures.Whenaneventisinserted,anarraycanbeaccessedfasterthanalinkedlistonthedisk,sincetheelementsinarraysarestoredcontiguously.Ontheotherhand,anFELusinganarrayrequiresitsowndynamicstoragemanagement[ 20 ]. Theheapandsplaytree[ 21 ]aredatastructurestypicallyusedforanFEL.Theyaretree-baseddatastructuresandcanexecuteoperationsfasterthanlineardatastructure,suchanarray.Minheapimplementedinaheight-balancedbinarysearchtreetakesO(logN)timeforbothinsertionanddeletion.Asplaytreeisaself-balancingbinarytree,butacertainelementscanrearrangethetree,placingthatelementintotheroot.Thismakesrecentlyaccessedelementsabletobequicklyreferencedagain.ThesplaytreeperformsbothoperationsinO(logN)amortizedtime.Heapandsplaytreearethereforesuitabledatastructuresforapriorityqueueforalarge-scalemodel. Calendarqueues[ 22 ]areoperatedbyahashfunction,whichperformsbothoperationsinO(1),onaverage.Eachbucketisadaythathasaspecicrangeandeachhasaspecicdatastructureforstoringeventsintimestamporder.Enqueueanddequeuefunctionsareoperatedbyhashfunctionsaccordingtoeventtime.Thenumberofbucketsandrangesinadayareadjustedtooperatethehashfunctionefciently.Calendarqueuesareefcientwheneventsareequallydistributedtoeachbucket,whichminimizestheadjustmentofbucketsize. 2.2.2ParallelDiscreteEventSimulation Intraditionalparalleldiscreteeventsimulation(PDES)[ 7 23 24 ],themodelisdecomposedintoseveralLPs,andeachLPisassignedtoaprocessorusedforparallelsimulation.EachLPrunsitsownindependentpartofthesimulationwithlocalclockandstatevariables.WhenLPsneedtocommunicatewitheachother,theysendtimestampedmessagestoeachotheroverasystembusorviaanetworkingsystem. 25

PAGE 26

EachlocalclockadvancesatdifferentpacesbecausetheintervalbetweenconsecutiveeventsontheLPisirregular.Forthisreason,thetimestampofincomingeventsfromotherLPscanbeearlierthanthecurrentlyexecutedevent.Itiscalledacausalityerroriftheincomingeventsaresupposedtochangethestatevariabletowhichthecurrenteventisreferring.Theviolationofthecausalityerrorcanproducedifferentresults.Asaresult,asynchronizationmethodneedstoprocesseventsinanon-decreasingtimestamporderandtopreservecausalrelationshipsacrossprocessors.Theperformancegainsarenotproportionaltotheincreasednumberofprocessorsduetothesynchronizationoverhead.Conservativeandoptimisticapproachesaretwomaincategoriesinsynchronization. 2.2.2.1Conservativesynchronization Inconservativesynchronizationmethods,eachprocessorexecuteseventswhenitcanguaranteethatotherprocessorswillnotsendeventswithasmallertimestampthanthatofthecurrentevent.ConservativemethodscancauseadeadlocksituationbetweenLPsbecauseeveryLPcanblocktheeventifitisconsideredtobeunsafetoprocess.Deadlockavoidance,anddeadlockdetectionandrecoveryaretwomajorchallengesofconservativesynchronizationmethods. ChandyandMisra[ 25 ]andBryant[ 26 ]developedadeadlockavoidancealgorithm.ThenecessaryandsufcientconditionisthatthemessagesaresenttootherLPsoverthelinksinnon-decreasingtimestamporder,whichguaranteesthattheprocessorwillnotreceiveaneventwithalowertimestampthanthepreviousone.Anullmessageissenttoavoidthedeadlock,indicatingthattheprocessorwillnotsendatimestampedmessagesmallerthananullmessage.Thetimestampofanullmessageisdeterminedbyeachincominglink,whichprovidesthelowerboundofthetimestampwhenthenexteventoccurs.Thelowerboundisdeterminedbytheknowledgeofthesimulationsuchaslookahead,ortheminimumtimestampincrementforamessagepassingbetweenLPs.Thevariationsofthenullmessagemethodtriedtoreducethenumberofnull 26

PAGE 27

messagesbasedondemandsincetheamountofnullmessagetrafccandegradeperformance[ 27 ]. ThedeadlockdetectionandrecoveryproposedbyChandyandMisra[ 28 ]triedtoeliminatetheuseofnullmessages.Thedeadlockrecoveryapproachallowstheprocessorstobecomedeadlocked.Whenthedeadlockisdetected,therecoveryfunctioniscalled.Acontroller,usedtobreakthedeadlock,identiestheeventcontainingthesmallesttimestampamongtheprocessors,andsendsthemessagestothatLPindicatingthattheeventissafetoprocess. Barriersynchronizationisoneoftheconservativesynchronizationapproaches.Thelowerboundonthetimestamp(LBTS1)iscalculated,basedonthetimeofthenextevent,andlookaheaddeterminesthetimewhenallprocessorsstoptheexecutiontosafelyprocesstheevent.TheeventsareexecutedonlyifthetimestampsofeventsarelessthanLBTS.ThedistancebetweenLPsisoftenusedtodetermineLBTSsinceitimpliestheminimumtimetotransmittheeventfromoneLPtoanother,suchasairtrafcsimulation. Conservativeapproachesareeasytoimplementbutperformancereliesonlookahead.Lookaheadistheminimumtimeincrementwhentheneweventisscheduled,thuslookahead(L)guaranteesthatnoothereventscontainingasmallertimestamparegenerateduntilthecurrentclockplusL.Lookaheadisusedtopredictthenextincomingeventsfromotherprocessorswhentheprocessordeterminesifthecurrenteventissafe.Ifthelookaheadistoosmallorzero,thecurrentlyexecutedeventcancausealleventsontheotherLPstowait.Inthiscase,theeventsarenearlyexecutedinsequential. 1LBTSisdenedasLowerboundonthetimestampofanymessageLPcanreceiveinthefuturein[ 7 ]p77. 27

PAGE 28

2.2.2.2Optimisticsynchronization Inoptimisticmethods,eachprocessorexecutesitsowneventsregardlessofthosereceivedfromotherprocessors.However,eachprocessorhastorollbackthesimulationwhenitdetectsacausalityerrorfromeventexecutioninordertorecoverthesystem.RollbackinaparallelcomputingenvironmentisacomplicatedprocessbecausesomeofthemessagessenttootherLPsalsoneedtobecanceled. Time-warp[ 29 ]isthemostwell-knownschemeinoptimisticsynchronization.Timewarphastwomajorparts:thelocalandglobalcontrolmechanisms.Thelocalcontrolmechanismassumesthateachlocalprocessorexecutestheeventsintimestamporderusingitsownlocalvirtualclock.WhenanLPsendsamessagetoothers,theidenticalmessage,exceptforoneeld,iscreated.TheoriginalmessagesentfromtheLPshasapositivesign,anditscorrespondingcopy,calledantimessage,hasanegative.EachLPmaintainsthreequeues.StatequeuecontainsthesnapshotsoftherecentstatesataninstantintimeintheLP.Thestateischangedwhenevertheeventoccurs,andenqueuedatthestatequeue.ReceivedmessagesfromotherLPsarestoredataninputqueueinthetimestamporder.Theantimessage,producedbyitsownLP,isstoredattheoutputqueue.WhenthetimestampofthearrivaleventisearlierthanthelocalvirtualtimeoftheLP,theLPencountersthecausalityerror.Thestateisrestoredfromstatequeuepriortothetimestampofthecurrentarrivalmessage.AntimessagesaredequeuedfromtheoutputqueueandsenttootherLPs,iftheirtimestampsarebetweenthearrivaleventandthelocalvirtualtime.WhentheLPreceivesanantimessage,theyannihilateeachothertocancelfutureeventsiftheinputqueuecontainsthecorrespondingpositivemessage.TheLPisrolledbackbyanantimessageifthecorrespondingpositivemessagesarealreadyexecuted. Globalvirtualtime(GVT)givesanideatosolvesomeproblemsonlocalcontroloftheTimeWarpmechanism,suchasthememorymanagement,theglobalcontrolofrollbackandthesafecommitmenttime.TheGVTisdenedbytheminimumof 28

PAGE 29

localvirtualtimeamongLPsandthetimestampofmessagesintransit,andservesasalowerboundforthevirtualtimesoftheLPs.GVTallowstheefcientmemorymanagementbecauseitdoesnotneedtomaintainthepreviousstatesifthoseexecutiontimesareearlierthantheGVT.DuplicateantimessagesareoftenproducedwhiletheLPreevaluatestheantimessagescausingtheproblemofperformance.TheLazycancelationwaitstosendtheantimessageuntiltheLPcheckstoseeifthere-executionproducesthesamemessages,whereasLazyreevaluationusesstatevectors,insteadofmessages,tosolvethisproblem[ 7 ]. Intheoptimisticapproach,thepaststatesaresavedforrecovery,butithasoneofthemostsignicantdrawbacksregardingmemorymanagement.Statesaving[ 30 ]makescopiesofthepaststatesduringsimulation.Copystatesaving(CSS)copiestheentirestatesofsimulationbeforeeacheventoccurs.CSSistheeasiestmethodforstatesaving,buttwodrawbacksarethehugememoryconsumptiontosavetheentirestatesandtheperformanceoverheadduringrollback.Periodicstatesaving(PSS)setsthecheckpointbyintervalskippingafewevents.TheperformanceisimprovedwithPSS,butallstatevaluesstillhavetobesavedatthecheckpoint.Incrementalstatesaving(ISS)isthemethodbasedonbacktracking.Onlythevaluesandaddressofmodiedvariablesarestoredbeforetheeventsexecute.Theoldvaluesarewrittentothevariablesinreverseorderwhenthestatesneedtoberestored.ISSreducesthememoryconsumptionandexecutionoverheads,buttheprogrammerhastoaddthemodulestohandleeachvariable. Reversecomputation(RC)[ 31 ]wasproposedtosolvethelimitationofthestatesavingmethodforforwardcomputation.RCdoesnotsavethevaluesofstatevariablesduringsimulation.Computationisperformedinreverseordertorecoverthevaluesofstatevariablesuntilitreachesthecheckpointwhentherollbackisinitiated.RCusesthebitvariabletocheckthechanges,thusitcandrasticallyreducethememoryconsumptionduringsimulationforespeciallyne-grainedmodels. 29

PAGE 30

2.2.2.3Acomparisonoftwomethods Eachsynchronizationapproachhasadrawback[ 32 ].Ittakesconsiderabletimetorunasimulationwithzerolookaheadintheconservativemethod.Itisalsotoodifculttorollbackasimulationsystemtothepreviousstatewithouterrorifwerunthesimulationwithacomplicatedmodelusingtheoptimisticmethod.Ingeneral,theoptimisticmethodhasanadvantageovertheconservativeinthattheexecutionisallowedwhereacausalityerrorispossible,butactuallydoesnotexist.Inaddition,theconservativemethodoftenneedsspecicinformationfortheapplicationtodeterminewhenitissafetoprocesstheevents,butitisnotveryrelevanttoanoptimisticapproach[ 23 ].Insomecases,averysmalllookaheadcannotcontinuethesimulationinparallel,butcaninsequential.Findingthelookaheadanditssizecanbecriticalfactorstodeterminetheperformancegainsintheconservativemethod[ 24 ].However,optimisticmechanismismuchmorecomplextoimplement,andfrequentrollbackcausesmorecomputationoverheadforacompute-intensivesystem.Ifthemodelistoocomplextoapplytheoptimisticmethod,theconservativemethodisabetterchoice.Ontheotherhand,ifaverysmalllookaheadisexpected,theoptimisticmethodhastobeapplied. 2.3GPUandCUDA 2.3.1GPUasaCoprocessor AGPUisadedicatedgraphicsprocessorthatrenders3Dgraphicsinrealtime,whichrequirestremendouscomputationalpower.ThecomputationspeedoftheGeForce8800GTXisapproximatelyfourtimesfasterthanthatofanIntelCore2Quadprocessorwith3.0GHz,whichisapproximatelytwiceasexpensiveastheGeForce8800GTX[ 13 ].TheincrementoftheCPUclockspeedhasslowedsince2003duetothephysicallimitations,soIntelandAMDturnedtheirintentiontomulti-corearchitectures[ 33 ].Ontheotherhand,theincrementofGPUspeedisstillgrowingbecausemoretransistorscanbeusedforparalleldataprocessingthandatacachingandowcontrolontheGPU.ProgrammabilityisanotherreasonthattheGPUhas 30

PAGE 31

becomeattractive.Thevertexandfragmentprocessorscanbecustomizedwiththeuser'sownprogram. TheGPUhasdifferentfeaturescomparedtotheCPU[ 16 ].TheCPUisdesignedtoprocessgeneralpurposeprograms.Forthisreason,CPUprogrammingmodelsandtheirprocessesaregenerallyserial,andtheCPUenablesthecomplexbranchcontrols.TheGPU,however,isdedicatedtoprocessingthepixelimageinrealtime,thusithasmuchmoreparallelismthantheCPUdoes.TheCPUreturnsmemoryreferencequicklytoprocessasmanyjobsaspossible,maximizingitsthroughputandminimizingthememorylatency.Asaresult,asinglethreadonaCPUcanproducehigherperformancecomparedtothatonaGPU.Ontheotherhand,theGPUmaximizestheparallelismthroughthreads.TheperformanceofasinglethreadonaGPUisnotasgood,comparedtothatonaCPU,buttheexecutionsofthreadsinamassivelyparallelhidethememorylatencytoproducehighthroughputfromparalleltasks.Inaddition,moretransistorsarededicatedtoGPUfordatacomputationratherthandatacachingandowcontrol.TheGPUcantakeagreatadvantageoveraCPUwhenthecachemissoccurs[ 34 ]. Despitemanyadvantages,theharnessingpoweroftheGPUhasbeenconsideredtobedifcultbecauseGPU-specicknowledge,suchasgraphicsAPIsandhardware,needstodealwiththeprogrammableGPU.ThetraditionalGPUshavetwotypesofprogrammableprocessors:vertexandfragment[ 35 ].Vertexprocessorstransformthestreamsofverticeswhicharedenedbypositions,colors,texturesandlighting.Thetransformedverticesareconvertedintofragmentsbytherasterizer.Fragmentprocessorscomputethecolorofeachpixeltorendertheimage.Graphicsshaderprogramminglanguages,suchasCg[ 36 ]andHLSL[ 37 ],allowtheprogrammertowritethecodeforthevertexandfragmentprocessorsinhigh-levelprogramminglanguage.Thoselanguagesareeasytolearn,comparedtoassemblylanguage,butarestillgraphic-specicassumingthattheuserhasthebasicknowledgeofinteractivegraphic 31

PAGE 32

programming.Theprogram,therefore,needstobewritteninagraphicsfashionusingtextureandpixelbymappingthecomputationalvariablestographicsprimitivesusinggraphicsAPI[ 38 ],suchasDirectXorOpenGLevenforgeneralpurposecomputations. Anotherproblemwastheconstrainedmemorylayoutandaccess.Theindirectwriteorscatteroperationwasnotpossiblebecausethereisnowriteinstructioninthefragmentprocessor[ 39 ].Asaresult,theimplementationofsparsedatastructure,suchaslistandtree,wherescatteringisrequired,isproblematicremovingtheexibilityinprogramming.TheCPUcanhandlethememoryeasilybecauseithastheuniedmemorymodel,butitisnottrivialontheGPUbecausememorycannotbewrittenanywhere[ 35 ].Finally,theadventoftheGeForce8800GTXGPUandCUDAeliminatesthelimitationsandprovidesaneasysolutiontotheprogrammer. 2.3.2StreamProcessing Streamprocessing[ 15 16 ]isthebasisoftheGPUprogrammingmodeltoday.Theapplicationofstreamprocessingisdividedintoseveralpartsforparallelprocessing.Eachpartisreferredtoasakernel,whichisaprogrammedfunctiontoprocessthestreamandisindependentoftheincomingstream.Thestreamisasequenceofelementscomposedofthesametypeanditrequiresthesameinstructionforcomputation.Figure 2-3 showstherelationshipbetweenthestreamandthekernel.ThestreamprocessingmodelcanprocesstheinputstreamoneachALUatthesamekernelinparallelsinceeachelementofinputstreamisindependentofeachother.Also,streamprocessingallowsmanystreamstobeprocessedconcurrentlyatdifferentkernels,whichhidesthememorylatencyandcommunicationdelay.However,thestreamprocessingmodelislessexibleandnotsuitableforthegeneralpurposeprogramwiththerandomizeddataaccessbecausethestreamisdirectlypassedtootherkernelsconnectedinsequentialafteritisprocessed.Streamprocessingcanconsistofseveralstages,eachofwhichhasseveralkernels.Dataparallelismisexploitedbyprocessing 32

PAGE 33

Figure2-3. Streamandkernel manystreamsinparallelateachstageandtaskparallelismisexploitedbyrunningseveralstagesconcurrently. Manycorescanbeutilizedconcurrentlywithastreamprogrammingmodel.Forexample,GeForce8800GTXhas16multiprocessors,andeachcanhavethemaximum768threads.Theoretically,approximatelytenthousandthreadscanbeexecutedinparallelyieldinghighperformanceparallelism. 2.3.3GeForce8800GTX TheGeForce8800GTX[ 12 13 ]GPUistherstGPUmodelunifyingvertex,geometryandfragmentshadersinto128individualstreamprocessors.ThepreviousGPUshavetheclassicpipelinemodelwithanumberofstagestorendertheimagefromthevertices.ManypassesinsidetheGPUconsumethebandwidth.Moreover,somestagesarenotrequiredtoprocessgeneralpurposecomputations,whichdegradetheperformanceoftheprocessingofthegeneralpurposeworkloadsontheGPU.Figure 2-4 [ 40 ]illustratesthedifferenceofpipelinestagesbetweenthetraditionalandGeForce8seriesGPUs.InGeForce8800GTXGPU,theshadershavebeenuniedintothestreamprocessors,whichreducethenumberofpipelinestagesandchangethesequentialprocessingintoloop-orientedprocessing.Uniedstreamprocessorshelptoimproveloadbalancing.Anygraphicaldatacanbeassignedtoanyavailable 33

PAGE 34

Figure2-4. Traditionalvs.GeForce8seriesGPUpipeline streamprocessor,anditsoutputstreamcanbeusedasaninputstreamofotherstreamprocessors. Figure 2-5 [ 41 ]showstheGeForce8800GTXarchitecture.TheGPUconsistsof16streammultiprocessors(SMs).EachSMhas8streamprocessors(SPs),whichmakesatotalof128.EachSPcontainsasinglearithmeticunitthatsupportsIEEE754single-precisionoating-pointarithmeticand32-bitintegeroperations,andcanprocesstheinstructioninSIMDfashion.EachSMcantakeupto8blocksor768threads,whichmakesforatotalof12,288threads,and8192registersoneachSMcanbedynamicallyallocatedintothethreadsrunningonit. 34

PAGE 35

Figure2-5. GeForce8800GTXarchitecture 2.3.4CUDA CUDA[ 13 ]isanAPIofCprogramminglanguageforutilizingtheNVIDIAclassofGPUs.CUDA,therefore,doesnotrequireatoughlearningcurveandprovidesasimpliedsolutionforthosewhoarenotfamiliarwiththeknowledgeofgraphicshardwareandAPI.TheusercanfocusonthealgorithmitselfratherthanonitsimplementationwithCUDA.WhentheprogramiswritteninCUDA,theCPUisahostthatrunstheCprogram,andtheGPUisadevicethatoperatesasaco-processortotheCPU.TheapplicationisprogrammedintoaCfunction,calledkernel,anddownloadedtotheGPUwhencompiled.ThekernelusesmemoryontheGPU,memoryallocationanddatatransferfromtheCPUtotheGPU,therefore,needtobedonebeforethekernelinvocation. CUDAexploitsdataparallelismbyutilizingamassivenumberofthreadssimultaneouslyafterpartitioninglargerproblemsintosmallerelements.Athreadisthebasicunitofexecutionthatusesitsuniqueidenticationtoexclusivelyaccesspartsofelementsinthedata.Themuchsmallercostofcreatingandswitchingthreads(ascomparedtothehighercostsassociatedwiththeCPU)makestheGPUmoreefcientwhenrunninginparallel.Theprogrammerorganizesthethreadsinatwo-levelhierarchy. 35

PAGE 36

Akernelinvocationcreatesagrid(theunitofexecutionofakernel).Agridconsistsofthegroupofthreadblocksthatexecutesasinglekernelwiththesameinstructionanddifferentdata.Eachthreadblockconsistsofabatchofthreadsthatcansharedatawithotherthreadsthroughalow-latencysharedmemory.Moreover,theirexecutionsaresynchronizedwithinathreadblocktocoordinatememoryaccessesbybarriersynchronizationusingthe syncthreads()function.ThreadsinthesameblockneedtoresideonthesameSMfortheefcientoperation,whichrestrictsthenumberofthreadsinasingleblock. IntheGeForce8800GTX,eachblockcantakeupto512threads.Theprogrammerdeterminesthedegreeofparallelismbyassigningthenumberofthreadsandblocksforexecutingakernel.TheexecutioncongurationhastobespeciedwheninvokingthekernelontheGPU,bydeningthenumberofgrids,blocksandbytesinsharedmemoryperblock,inanexpressionoffollowingform,wherememorysizeisoptional: KernelFunc<<>>(parameters); Thecorrespondingfunctionisdenedby global voidKernelFunc(parameters)ontheGPU,where globalrepresentsthecomputingdeviceorGPU.DataarecopiedfromthehostorCPUtoglobalmemoryontheGPUandareloadedtothesharedmemory.Afterperformingthecomputation,theresultsarecopiedbacktothehostviaPCI-Express. EachSMprocessesagridbyschedulingbatchesofthreadblocks,oneafteranother,butblockorderingisnotguaranteed.Thenumberofthreadblocksinonebatchdependsuponthedegreetowhichthesharedmemoryandregistersareassigned,perblockandthread,respectively.Thecurrentlyexecutedblocksarereferredtoasactiveblocks,andeachoneissplitintoagroupofthreadscalledawarp.Thenumberofthreadsinawarpiscalledwarpsizeanditissetto32ontheGeForce8series.Ateachclockcycle,thethreadsinawarparephysicallyexecutedinparallel.Eachwarpisexecutedalternativelybytime-slicingscheduling,whichhidesthememoryaccess 36

PAGE 37

Figure2-6. Executionbetweenthehostandthedevice latency.Thenumberofthreadblockscanincreaseifwecandecreasethesizeofsharedmemoryperblockandthenumberofregistersperthread.However,itfailstolaunchthekernelifthesharedmemoryperthreadblockisinsufcient. Theoverallperformancedependsonhoweffectivelytheprogrammerassignsthosethreadsandblocks,keepingthreadsbusyasmanyaspossible.EachSMcanusuallybecomposedof3threadblockswith256threads,or6blockswith128threads.The16KBsharedmemoryisassignedtoeachthreadblock,whichcanlimitthenumberofthreadsinathreadblockandthenumberofelementsforwhicheachthreadisresponsible.Figure 2-6 showstheinteractionbetweenthehostandthedevice.AhostexecutestheCprograminsequencebeforeinvokingkernel1.Akernelinvocationcreatesagrid,whichincludesanumberofblocksandthreads,andmapsoneormoreblocksontooneSM.Afterexecutingakernel2inparallelonthedevice,ahostcontinuestoexecutetheprogram. 37

PAGE 38

CHAPTER3RELATEDWORK 3.1DiscreteEventSimulationonSIMDHardware Inthe1990s,effortsweremadetoparallelizediscreteeventsimulationsusingaSIMDapproach.Givenabalancedworkload,SIMDhadthepotentialtosignicantlyspeedupsimulations.Theresearchperformedinthisareawasfocusedonreplication.Theprocessorswereusedtoparallelizethechoiceofparametersbyimplementingastandardclockalgorithm[ 42 43 ].AyaniandBerkman[ 44 ]usedSIMDforparallelizingsimultaneouseventexecutions,butSIMDwasdeterminedtobeapoorchoicebecauseoftheunevendistributionoftimedevents.TherewasaneedtollthegapbetweenasynchronousapplicationsandsynchronousmachinessothattheSIMDmachinecouldbeutilizedforasynchronousapplications[ 45 ]. Recently,thecomputergraphicscommunityhaswidelypublishedontheuseoftheGPUforphysicalandgeometricproblemsolving,andforvisualization.Thesetypesofmodelshavethepropertyofbeingdecomposableoveravariableorparameterspace,suchascellularautomata[ 46 ]fordiscretespacesandpartialdifferentialequations(PDEs)[ 47 48 ]forcontinuousspaces.Queuingmodels,however,donotstrictlyadheretothedecomposabilityproperty. Perumalla[ 49 ]hasperformedadiscreteeventsimulationonaGPUbyrunningadiffusionsimulation.Perumalla'salgorithmselectstheminimumeventtimefromthelistofupdatetimes,andusesitasatime-steptosynchronouslyupdateallelementsonagivenspacethroughoutthesimulationperiod.Thisapproachisusefulifasingleeventinthesimulationmodelcauseslargeamountsofcomputation,wheretheeventoccurrencesarenotsofrequent.Queuingmodels,incontrast,havemanyevents,buteacheventdoesnotrequiresignicantcomputation.Anumberofeventswithdifferenttimestampsinqueuingmodelsimulationscouldmaketheexecutionnearlysequentialwiththisalgorithm. 38

PAGE 39

XuandBagrodia[ 50 ]proposedadiscreteeventsimulationframeworkfornetworksimulations.TheyusedtheGPUasaco-processortodistributecompute-intensiveworkloadsforhigh-delitynetworksimulations.Otherparallelcomputingarchitecturesarecombinedtoperformthecomputationinparallel.Aeldprogrammablegatearray(FPGA)andaCellprocessorareincludedfortask-parallelcomputation,andaGPUisusedfordata-parallelcomputation.Auid-ow-basedTCPandahigh-delityphysicallayermodelareexploitedtoutilizetheGPU.Theformerismodeledwithdrivendifferentialequations,andthelatterusestheadaptiveantennaalgorithmwhichrecursivelyupdatestheweightsofthebeamformersusingleastsquaresestimation.TheeventschedulingmethodontheCPUsendsthosecompute-intensiveeventstotheGPUwhenevereventsoccur. ThesetwoexamplesshowedthemethodologyofrunningadiscreteeventsimulationontheGPU,butbothmethodscannotbeapplicableforthepurposeofimprovingtheperformanceinqueuingmodelssimulationsontheGPU.IntheGPUsimulation,2Dor3Dspacesrepresentthesimulationresults,andthesespacesareimplementedinarraysontheGPU.TheirmodelsareeasilyadaptedtotheGPUbypartitioningtheresultarrayandcomputingeachoftheminparallelsinceasingleeventintheirsimulationmodelsupdatesallelementsintheresultarrayatonce.However,anindividualeventinqueuingmodelsmakethechangesonlyonasingleelement(e.g.servicefacility)intheresultarray,whichmakesitdifculttoparallelizequeuingmodelsimulations.QueuingmodelsimulationsneedtohavemanyconcurrenteventstobenetfromtheGPU. LysenkoandD'Souza[ 51 ]proposedaGPU-basedframeworkforlargescaleagentbasedmodel(ABM)simulations.InABMsimulation,sequentialexecutionusingdiscreteeventsimulationtechniquesmakestheperformancetooinefcientforlargescaleABM.Data-parallelalgorithmsforenvironmentupdates,andagentinteraction,death,andbirthwere,therefore,presentedforGPU-basedABMsimulation.Thisstudyusedaniterative 39

PAGE 40

randomizedschemesothatagentreplicationcouldbeexecutedinO(1)averagetimeinparallelontheGPU. 3.2TradeoffbetweenAccuracyandPerformance Somestudiesofparallelsimulationhavefocusedonenhancingperformanceattheexpenseofaccuracy,whileothershavefocusedonaccuracywithaviewtoimprovingperformance.Tolerantsynchronization[ 52 ]usesthelock-stepmethodtoprocessthesimulationconservatively,butitallowstheprocessortoexecutetheeventoptimisticallyifthetimestampislessthanthetolerancepointinthesynchronization.Therecoveryprocedureisnotcalled,evenifacausalityerroroccurs,untilthetimestampreachesthetolerancepoint. Synchronizationwithaxedquantumisalock-stepsynchronization[ 53 ]thatensuresthatalleventsareproperlysynchronizedbeforeadvancingtothenextquantum.However,aquantumthatistoosmallcausesasignicantslowdownofoverallexecutiontime.Inanadaptivesynchronizationtechnique[ 54 ],thequantumsizeisadjustedbasedonthenumberofeventsatthecurrentlock-step.Adynamiclock-stepvalueimprovestheperformancewithalargerquantumvalue,thusreducingthesynchronizationoverheadwhenthenumberofeventsissmallandwheretheerrorrateislow. State-matchingisthemostdominantoverheadinatime-parallelsimulation[ 7 ],asissynchronizationinaspace-parallelsimulation.Iftheinitialandnalstatesarenotmatchedattheboundaryofatimeinterval,re-computationofthosetimeintervalsdegradessimulationperformance.Approximationsimulations[ 55 56 ]havebeenusedtoimprovethesimulationperformance,albeitwithalossofaccuracy. Fujimoto[ 32 ]proposedexploitationoftemporaluncertainty,whichintroducesapproximatetime.Approximatetimeisatimeintervalfortheexecutionoftheevent,ratherthanaprecisetimestamp,andassignedintoeacheventbasedonitstimestamp.Whenapproximatetimeisused,thetimeintervalsofeventsonthedifferentLPscanbeoverlappedonthetimelineatonecommonpoint.Whereaseventsonthedifferent 40

PAGE 41

LPshavetowaitforasynchronizationsignalwithaconservativemethodwhenaprecisetimestampisassigned,approximate-timedeventscanbeexecutedconcurrentlyiftheirtimeintervalsoverlapwitheachother.Theperformanceisimprovedduetoincreasedconcurrency,butatthecostofaccuracyinthesimulationresult.Ourapproachdiffersfromthismethodinthatwedonotassignatimeintervaltoeachevent:instead,eventsareclusteredatatimeintervalwhentheyareextractedfromtheFEL.Inaddition,anapproximatetimeisexecutedbasedonaMIMDschemethatpartitionsthesimulationmodel,whereasourapproachisbasedonaSIMDscheme. 3.3ConcurrentPriorityQueue ThepriorityqueueistheabstractdatastructurethathaswidelybeenusedasanFELfordiscreteeventsimulation.TheglobalpriorityqueueiscommonlyusedandaccessedsequentiallyforthepurposeofensuringconsistencyinPDESonsharedmemorymultiprocessors.Theconcurrentaccessofthepriorityqueuehasbeenstudiedbecausethesequentialaccesslimitsthepotentialspeedupinparallelsimulation[ 17 18 ].Mostconcurrentpriorityqueueapproacheshavebeenbasedonmutualexclusion,lockingpartofaheaportreewheninsertingordeletingtheeventssothatotherprocessorswouldnotaccessthecurrentlyupdatedelement[ 57 58 ].However,thisblocking-basedalgorithmlimitspotentialperformanceimprovementstoacertaindegree,sinceitinvolvesseveraldrawbacks,suchasdeadlockandstarvation,whichcausethesystemtobeinidleorwaitstates.Thelock-freeapproach[ 59 ]avoidsblockingbyusingatomicsynchronizationprimitivesandguaranteesthatatleastoneactiveoperationcanbeprocessed.PDESthatusethedistributedFELormessagequeuehaveimprovedtheirperformancebyoptimizingtheschedulingalgorithmtominimizethesynchronizationoverheadandtohidecommunicationlatency[ 60 61 ]. 3.4ParallelSimulationProblemSpace Parallelsimulationproblemspacecanbeclassiedusingtime-spaceandclassesofparallelcomputers,asshowninFigure 3-1 .Parallelsimulationmodelsfallintotwo 41

PAGE 42

Figure3-1. Diagramofparallelsimulationproblemspace Table3-1. Classicationofparallelsimulationexamples Index Examples (1) Ordinarydifferentialequations[ 62 ] (2) Reservoirsimulation[ 63 ] (3) Clouddynamics[ 47 ],N-bodysimulation[ 48 ] (4) ChandyandMisra[ 25 ],Time-warp[ 29 ] (5) AyaniandBourkman[ 44 ],ShuandWu[ 45 ] (6) Partialdifferentialequations[ 64 ] (7) Cellularautomata[ 65 ] (8) Retinasimulation[ 46 ] (9) Diffusionsimulation[ 49 ],XuandBagrodia[ 50 ] (10) Ourqueuingmodelsimulation majorcategories:continuousanddiscrete.Mostphysicalsimulationsarecontinuoussimulations(i.e.,ordinaryandpartialdifferentialequations,cellularautomata);however,complexhuman-madesystems(i.e.,communicationnetworks)tendtohaveadiscretestructure.Discretemodelscanbecategorizedintotwo,inregardstothebehaviorofsimulationmodels:asynchronous(discrete-event)andsynchronous(time-stepped)models.Asynchronousmodelscanbeclassiedaccordingtohowthepartitioningisdone.TheexamplesofeachbranchinFigure 3-1 aresummarizedinTable 3-1 42

PAGE 43

CHAPTER4AGPU-BASEDAPPLICATIONFRAMEWORKSUPPORTINGFASTDISCRETEEVENTSIMULATION 4.1ParallelEventScheduling SIMD-basedcomputationhasabottleneckprobleminthatsomeoperations,suchasinstructionfetch,havetobeimplementedsequentially,whichcausesmanyprocessorstobehalted.EventschedulinginSIMD-basedsimulationcanbeconsideredasastepofinstructionfetchthatdistributestheworkloadintoeachprocessor.Thesequentialoperationsinasharedeventlistcanbecrucialtotheoverallperformanceofsimulationforalarge-scalemodel.MostimplementationsofconcurrentpriorityqueuehavebeenrunonMIMDmachines.Theirasynchronousoperationsreducethenumberoflocksattheinstanttimeofsimulation.However,itisinefcienttoimplementaconcurrentpriorityqueuewithalock-basedapproachonSIMDhardware,especiallyaGPUbecausethepointintimewhenmultiplethreadsaccessthepriorityqueueissynchronized.Itproducesmanylocksthatareinvolvedinmutualexclusion,makingtheiroperationsalmostsequential.Moreover,sparseanddynamicdatastructure,suchasheaps,cannotbedirectlydevelopedontheGPUsincetheGPUisoptimizedtoprocessdenseandstaticdatastructuressuchaslineararrays. Bothinsertanddelete-minoperationsre-sorttheFELintimestamporder.OtherthreadscannotaccesstheFELduringthesort,sincealltheelementsintheFELaresortedifalineararrayisusedforthedatastructureoftheFEL.TheconceptofparalleleventschedulingisthatanFELisdividedintomanysub-FELs,andonlyoneofthemishandledbyeachthreadontheGPU.AnelementindexthatisusedtoaccesstheelementintheFELiscalculatedbyathreadIDcombinedwithablockID,whichallowseachthreadtoaccessitselementsinparallelwithoutanyinterferencefromotherthreads.Inaddition,keepingtheglobalFELunsortedguaranteesthateachthreadcanaccessitselements,regardlessoftheoperationsofotherthreads.Thenumberof 43

PAGE 44

while(currenttimeislessthansimulationtime)//executedbymultiplethreadsminimumTimestamp=ParallelReduction(FEL);foreachlocalFELbyeachthreadinparalleldocurrentEvent=ExtractEvent(minimumTimestamp);nextEvent=ExecuteEvent(currentEvent);ScheduleEvent(nextEvent);endforeachendwhile Figure4-1. Thealgorithmforparalleleventscheduling elementsthateachthreadisresponsibleforprocessingatthecurrenttimeiscalculatedbydividingthenumberofelementsintheFELbythenumberofthreads. Asaresult,theheadsoftheglobalFELandeachlocalFELaccessedbyeachthreadarenottheeventswiththeminimumtimestamp.Instead,thesmallesttimestampisdeterminedbyparallelreduction[ 14 66 ],usingmultiplethreads.Withthistimestamp,eachthreadcomparestheminimumtimestampwiththatofeachelementinthelocalFELtondandextractthecurrentactiveevents(delete-min).Afterthecurrenteventsareexecutedinparallel,neweventsarecreatedbythecurrentevents.ThecurrentlyextractedelementsintheFELarere-writtenbyupdatingtheattributes,suchasaneventanditstime(insert).ThealgorithmforparalleleventschedulingontheGPUissummarizedinFigure 4-1 Additionaloperationsareneededforaqueuingmodelsimulation.Thepurposeofdiscreteeventsimulationistoanalyzethebehaviorofthesystem[ 67 ].Inaqueuingmodelsimulation,aservicefacilityisthesystemtobeanalyzed.Servicefacilitiesaremodeledinarraysasresourcesthatcontaininformationregardingserverstatus,currentcustomers,andtheirqueues.Schedulingtheincomingcustomertotheservicefacility(Arrival),releasingthecustomerafteritsservice(Departure),andmanipulatingthequeuewhenitsserverisbusyaretheservicefacilityoperations.QueuingmodelsimulationsalsobenetfromthetensofthousandsofthreadsontheGPU.However, 44

PAGE 45

therearesomeissuestobeconsidered,sincethearraysofboththeFELandservicefacilityresideinglobalmemory,andthreadssharethem. 4.2IssuesinaQueuingModelSimulation 4.2.1MutualExclusion MostsimulationsthatarerunonaGPUuse2Dor3Dspacestorepresentthesimulationresults.Thespacesandthevariables,usedforupdatingthosespaces,areimplementedinanarrayontheGPU.Theresultarrayisupdatedbasedonvariablearraysthroughoutthesimulation.Forexample,thevelocityarrayisusedforupdatingtheresultarraybyapartialdifferentialequationinauidsimulation.Theresultarrayisdependentonthevariablearrays,butnotviceversa.Inauidsimulation,thechangesofvelocityinauidsimulationmaketheresultdifferent,buttheresultarraydoesnotchangethevelocity.Thiskindofupdateisone-directional.Mutualexclusionisnotnecessary,sinceeachthreadisresponsibleforaxednumberofelements,anddoesnotinterferewithotherthreads. However,theupdatesinaqueuingmodelsimulationarebi-directional.OneeventsimultaneouslyupdatesboththeFELandservicefacilityarrays.Bi-directionalupdatesoccurringatthesametimemaycausetheirresultstobeincorrect,becauseoneoftheelementindexeseithertheFELortheservicefacilitycannotbeaccessedbyotherthreadsindependently.Forexample,consideraconcurrentrequesttothesameservicefacilitythathasonlyoneserver,asshowninFigure 4-2 A.Boththreadstrytoscheduletheirtokentotheserverbecauseitsidlestatusisreadbyboththreadsatthesametime.Thesimultaneouswritingtothesamelocationleadstothewrongresultinthread#1,asshowninFigure 4-2 B.Weneedamutualexclusionalgorithmbecausedatainconsistencycanoccurwhenupdatingbotharraysatthesametime.Themutualexclusioninvolvedinthisenvironmentisdifferentfromthecaseoftheconcurrentpriorityqueue,inthattwodifferentarraysconcurrentlyattempttoupdateeachotherandareaccessedbythesameelementindex. 45

PAGE 46

AAconcurrentrequestfromtwothreads BTheincorrectresultsforaconcurrentrequest.Thestatusoftoken#1shouldbeQueue. Figure4-2. Theresultofaconcurrentrequestfromtwothreadswithoutamutualexclusionalgorithm Thesimplestwaytoimplementmutualexclusionistoseparatebothupdates.AlternateaccessbetweentheFELandservicefacilitycanresolvethisproblem.WhenupdatesarehappeningintermsoftheFEL,eachextractedtokenintheFELstoresinformationabouttheservicefacility,indicatingthatanupdateisrequiredatthenextstep.Servicefacilitiesarethenupdatedbasedontheseresults.EachservicefacilitysearchestheFELtondtheextractedtokensthatarerelatedtoitselfatthecurrenttime. 46

PAGE 47

Then,theextractedtokensareplacedintotheserverorqueueattheservicefacilityforanarrivalevent,orthestatusoftheserverturnstoidleforadepartureevent.Finally,thelocationsofextractedtokensintheFELareupdatedusingtheresultsoftheupdatedservicefacility. OneofbiggestproblemsfordiscreteeventsimulationonaGPUisthateventsareselectivelyupdated.Afeweventsoccurringatoneeventtimemakeitdifculttofullyutilizeallthethreadsatonce.Ifthemodelhasasmanyconcurrenteventsaspossible,thisapproachismoreefcient.Oneapproachtoimprovingperformanceistoclustereventsintooneeventtime.Iftheeventtimecanberoundedtointegersoronedecimal,moreeventscanconcurrentlyoccur. However,acausalityerrorcanoccurbecausetwoormoretokenswithdifferenttimestampsmayhavethesametimestamp,duetotheroundingofthetimestamp.Thecorrectordermustbemaintained,otherwisethestatisticalresultsproducedwillbedifferent.Wieland[ 68 ]proposedamethodtotreatsimultaneousevents.Theeventtimesofsimultaneouseventsarealteredbyaddingorsubtractingathresholdsothateacheventhasadifferenttimestamp.Hismethoddealswithoriginallysimultaneouseventsthatareunknownintheircorrectorder,butsimultaneouseventsinourmethodwerenon-simultaneouseventsbeforetheirtimestampswererounded.Weuseanoriginaltimestamptomaintainthecorrecteventorderforsimultaneousevents.Iftwotokensarriveatthesameservicefacilitywiththesametimestampduetotheroundingofthetimestamp,thetokenwiththesmalleroriginaltimestamphaspriority.Anoriginaltimestampismaintainedasoneoftheattributesinthetoken.Fororiginallysimultaneousevents,theservicefacilityrandomlybreakstieanddeterminestheirorder.ThepseudocodeformutualexclusionalgorithmswithclusteringeventsissummarizedinFigure 4-3 47

PAGE 48

//updatetheFELforeachtokenintheFELbyeachthreadinparalleldoif(Token.Timeislessthanorequaltotheroundedminimumtimestamp)Token.Extracted==TRUE;endifendforeach//updatetheservicefacilityforeachservicefacilitybyeachthreadinparalleldoforeachtokenintheFELdoif(Token.Extracted==TRUE&&Token.Facility==currentServiceFacility)if(Token.Event==DEPARTURE)Facility.ServerStatus=IDLE;elseif(Token.Event==ARRIVAL)AddthetokenintotherequestTokenList;endifendifendforeach//sortthecurrenttokenlistinoriginaltimestampordersortedTokenList=Sort(requestTokenList);if(Facility.ServerStatus==BUSY)Placealltokensintothequeueinsortedorder;elseif(Facility.ServerStatus==IDLE)PlacetheheadoftokenfromsortedTokenListintotheserver,andplaceothersintothequeue;endifendforeach//updatetheFELforeachtokenintheFELbyeachthreadinparalleldoif(Token.Extracted==TRUE)Token.Extracted=FALSE;Token.Time=nextEventTime;Token.Event=nextEvent;Token.Status=SERVEDorQUEUE;endifendforeach Figure4-3. Amutualexclusionalgorithmwithclusteringevents 48

PAGE 49

4.2.2SelectiveUpdate Analternateupdatethatisusedformutualexclusionproducestheotherissue.EachextractedtokenintheFELhasinformationabouttheservicefacility,whereaseachservicefacilitydoesnotknowwhichtokenhasrequestedtheserviceatthecurrenttimeduringthealternateupdate.EachservicefacilitysearchestheentireFELtondtherequestedtokens,whichisexecutedinO(N)time.Thissequentialsearchsignicantlydegradesperformance,especiallyforalarge-scalemodelsimulation.Thenumberofsearchedtokensforeachservicefacility,therefore,needstobereducedfortheperformanceofparallelsimulations.Oneofthesolutionsistousetheincomingedgesofeachservicefacilitybecauseatokenenterstheservicefacilityonlyfromtheincomingedges.Ifwelimitthenumberofsearchestothenumberofincomingedges,thesearchtimeisreducedtoO(Maximumnumberofedges)time. AdepartureeventcanbeexecutedattherststepofmutualexclusionbecauseitdoesnotcauseanytypeofthreadcollisionsbetweentheFELandservicefacility.Foradepartureevent,nootherconcurrentrequestsforthesameserverattheservicefacilityexist,sincethenumberofreleasedtokensfromoneserverisalwaysone.Therefore,eachfacilitycanstorethejust-releasedtokenwhenadepartureeventisexecuted.Eachservicefacilityreferstoitsneighborservicefacilitiestocheckwhethertheyreleasedthetokenatthecurrenttimeduringtheupdateofitself.Performancemaydependonthesimulationmodel,becausesearchtimedependsonthemaximumnumberofedgesamongservicefacilities. 4.2.3Synchronization Threadsinthesamethreadblockcanbesynchronizedwithsharedlocalmemory,buttheexecutionsofthreadsindifferentthreadblocksarecompletelyindependentofeachother.Thisindependentexecutionremovesthedependencyofassignmentsbetweenthethreadblocksandprocessors,allowingthreadblockstobescheduledacrossanyprocessor[ 14 ]. 49

PAGE 50

Foralarge-scalequeuingmodel,arraysforboththeFELandservicefacilityresideinglobalmemory.BotharraysareaccessedandupdatedbyanelementIDinsequence.Ifthesestepsarenotsynchronized,someindexesareusedtoaccesstheFEL,whileothersareusedtoupdatetheservicefacility.Theelementsinbotharraysmaythenhaveincorrectinformationwhenupdated. Weexpectthesameeffectofsynchronizationbetweenblocksifthekernelisdecomposedintomultiplekernels[ 66 ].Alternateaccessesbetweenbotharraysneedtobedevelopedasmultiplekernels,andinvokingthesekernelsinsequencefromtheCPUexplicitlysynchronizesthethreadblocks.OneofthebottlenecksinCUDAimplementationisdatatransferbetweentheCPUandGPU,butsequentialinvocationsofkernelsprovideaglobalsynchronizationpointwithouttransferringanydatabetweenthem. 4.3DataStructuresandFunctions 4.3.1EventSchedulingMethod FELLetatokendenoteanytypeofcustomerthatrequestsserviceattheservicefacility.TheFEListhereforeacollectionofunprocessedtokens,andtokensareidentiedbytheirIDwithoutbeingsortedinnon-decreasingtimestamporder.EachelementintheFELhasitsownattributes:tokenID,event,time,facility,andsoon.AnFELisrepresentedasatwo-dimensionalarray,andeachone-dimensionalarrayconsistsofattributesofatoken.Table 4-1 showsaninstantstatusoftheFELwithsomeoftheattributes.Forexample,tokenID#3willarriveatfacility#3atthesimulationtimeof2.Statusrepresentsthespeciclocationofthetokenattheservicefacility.Freeisassignedwhenthetokenisnotassociatedwithanyservicefacility.Token#1,placedinthequeueoffacility#2,cannotbescheduledforserviceuntiltheserverbecomesidle. FindingtheMinimumTimestamp:NextEventTimeTheminimumtimestampiscalculatedbyparallelreductionwithoutre-sortingtheFEL.Parallelreductionisatree-basedapproach,andthenumberofcomparisonsiscutinhalfateachstep.Each 50

PAGE 51

Table4-1. Thefutureeventlistanditsattributes TokenIDEventTimeFacilityStatus #1Arrival2#2Queue #2Departure3#3Served #3Arrival2#3Free #4Departure4#1Served threadndstheminimumvaluebycomparingaxedlengthofinput.Thenumberofthreadsthatisusedforcomparisonisalsocutinhalfaftereachthreadcompletescalculatingtheminimumvaluefromitsinput.Finally,theminimumvalueisstoredinthreadID0.TheminimumtimestampiscalculatedbyinvokingtheNextEventTimefunctionwhichreturnstheminimumtimestamp.TheCUDA-stylepseudocodeforNextEventTimeisillustratedinFigure 4-4 .Wehavemodiedtheparallelreductioncode[ 66 ]intheNVIDIACUDAsoftwaredevelopmentkittodeveloptheNextEventTimefunction. Comparisonofelementsusingglobalmemoryisveryexpensive,andadditionalmemoryspacesarerequiredsothattheFELispreventedfrombeingre-sorted.Iterativeexecutionsallowthesharedmemorytobeusedforalarge-scalemodel,althoughthesharedmemory,16KBperthreadblock,istoosmalltobeusedforalarge-scalemodel.Asanintermediatestep,eachblockproducesoneminimumtimestamp.Atthestartofthenextstep,comparisonsoftheresultsbetweentheblocksshouldbesynchronized.Inaddition,thenumberofthreadsandblocksusedforcomparisonattheblock-levelstepwillbedifferentfromthoseusedatthethread-levelstep,duetothesizeoftheremainingelements.ThedifferentnumberofthreadsandblocksatthevariousstepsaswellastheneedforglobalsynchronizationrequiresthatparallelreductionbeinvokedfromtheCPU. EventExtractionandApproximateTime:NextEventWhentheminimumtimestampisdetermined,eachthreadextractstheeventswiththesmallesttimestampbycallingtheNextEventfunction.Figure 4-5 showsthepseudocodefortheNextEvent 51

PAGE 52

global voidNextEventTime(oat*FEL,oat*minTime,intThreadSize)f shared oateTime[BlockSize];inttid=threadIdx.x;inteid=blockIdx.x*BlockSize+threadIdx.x;intm=0,j=0,k=0;//copysomepartsofeventtimesfromtheFELtosharedmemoryfor(inti=eid*ThreadSize;ieTime[tid*ThreadSize+k*i])feTime[tid*ThreadSize+j*i]=eTime[tid*ThreadSize+k*i];gj=j+2;k=k+2;gg//comparisonbetweenthreadselsefif((tid%((2*i)/ThreadSize)==0)&&(eTime[tid]>eTime[tid+i]))feTime[tid]=eTime[tid+i];gg syncthreads();g//copytheminimumvaluetoglobalmemoryif(tid==0)fminTime[blockIdx.x]=eTime[0];gg Figure4-4. PseudocodeforNextEventTime 52

PAGE 53

device intNextEvent(oat*FEL,intelementIndex,intinterval)fif(FEL[elementIndex*numOfTokenAttr+Time]<=interval&&FEL[elementIndex*numOfTokenAttr+Status]!=QUEUE)returnTRUE;elsereturnFALSE;g Figure4-5. PseudocodeforNextEvent function.Approximatetimeiscalculatedbyassigningtheintervalthattheeventtimecanberoundedto,sothatmoreeventsareclusteredintothattime.EventsareextractedfromtheFEL,unlesstokensareinaqueue.Eachthreadexecutesoneoftheeventroutinesinparallelaftertheeventsareextracted. EventInsertionandUpdate:ScheduleNeweventsarescheduledandinsertedintotheFELbythecurrentlyexecutedevents.Eachelementthatisexecutedatthecurrenttimeupdatesthecurrentstatus(e.g.nexteventtime,servedfacility,queue,andsoon)bycallingtheSchedulefunction.Figure 4-6 illustratesthepseudocodefortheSchedulefunction.TheSchedulefunctionisthegeneralfunctiontoupdatetheelementintheFEL,aswellastoschedulenewevents.Inanopenqueuingnetwork,thenumberofelementsintheFELvaries,duetothearrivalsfromanddeparturestooutsideofthesimulationmodel.TheindexismaintainedtoputanewlyarrivedtokenintotheFEL,andthelocationofanexitingtokenismarkedasbeingempty.WhentheindexreachesthelastlocationoftheFEL,theindexgoesbacktotherstlocation,andkeepsincreasingby1untilitndsanemptyspace.TheCPUisresponsibleforgeneratingnewarrivalsfromoutsideofthesimulationmodelintheopenqueuingnetwork,duetothemutualexclusionproblemoftheindex. 53

PAGE 54

device voidSchedule(oat*FEL,intelementIndex,oat*currentToken)ffor(inti=0;i
PAGE 55

device voidRequest(oat*FEL,intelementIndex)fFEL[elementIndex*numOfTokenAttr+Extracted]=TRUE;g Figure4-7. PseudocodeforRequest simulationtimeof4.Thequeueatservicefacility#1hasonetoken(#3),anditstotalbusytimesofaris3. Table4-2. Theservicefacilityanditsattributes FacilityServerServedBusyServiceQueueQueueIDStatusTokenTimeStarttimeLength #1Busy#2341#3 #2Busy#1120#3Idle-000#4Busy#6252#4,#5 Arrival:RequestArrivalanddeparturearethetwomaineventsforthequeuingmodel,andbotheventsareexecutedafterbeingextractedfromtheFEL.ThearrivaleventupdatestheelementoftherequestingtokenintheFELandservicefacilityaftercheckingthestatusofitsserver(busyoridle).However,anarrivalevent,executedbycallingtheRequestfunction,onlyupdatesthestatusoftokens,duetothemutualexclusionproblem.ThepseudocodefortheRequestfunctionisillustratedinFigure 4-7 Departure:ReleaseThedepartureeventalsoneedsupdatesforboththeFELandtheservicefacility.Foradepartureevent,itispossibletoupdatebothofthem,asspeciedintheprevioussection.Whentheeventisadepartureevent,tokeninformationisupdatedbycallingtheSchedulefunction.Then,theReleasefunctioniscalledinordertoupdatethestatisticsoftheservicefacilityforthereleasedtokenandthestatusofitsserver.Figure 4-8 illustratesthepseudocodefortheReleasefunction.WhentheReleasefunctionisexecuted,theindexoftheupdatedservicefacilityisdetermined 55

PAGE 56

device voidRelease(oat*Facility,intreleased,inttokenId)fFacility[released*numOfFacAttr+BusyTime]+=currentTime-Facility[released*numOfFacAttr+ServiceStart];Facility[released*numOfFacAttr+NumOfServed]++;Facility[released*numOfFacAttr+ServerStatus]=IDLE;Facility[released*numOfFacAttr+ReleasedToken]=tokenId;g Figure4-8. PseudocodeforRelease bythereleasedtoken,notbytheelementindex,asshowninFigure 4-8 .TheReleasefunctionstoresthecurrentlyreleasedtokenforselectiveupdate. SchedulingtheServiceFacility:ScheduleServerTheservicefacilityplacesthecurrentlyrequestingtokensintotheserverorqueue,aftersearchingtheFEL,bycallingtheScheduleServerfunction.Figure 4-9 illustratesthepseudocodefortheScheduleServerfunction.Thetokeninthequeuehaspriority,andisplacedintotheserverifthequeueisnotempty.Fortwoormoretokens,thetokenwiththeminimumoriginaltimestampisplacedintotheserver,andothersareplacedintothequeueifthequeueisempty.Thetokenisdroppedfromtheservicefacilitywhenthequeueisfull.Theheadandtailindexesareusedtoinsertatokeninto(EnQueue),anddeleteatokenfrom(DeQueue)thequeue. CollectingStatistics:PrintResultEachservicefacilityhasseveralattributesforsummarystatistics,includingtheaccumulatedbusytime,thenumberofservedtokens,andtheaveragelengthofthequeue.Wheneacheventoccursattheservicefacility,theseattributesareupdated.Attheendofthesimulation,theseattributesarecopiedtotheCPU.ThesummarystatisticsareproducedbycallingthePrintResultfunction.ThePrintResultfunctionisaCPU-sidefunctionwithnoparameters,anditreturnssummarystatisticsincludingutilization,throughput,andmeanwaittime. 56

PAGE 57

device voidScheduleServer(oat*Facility,intelementIndex,int*currentList)f//ifthequeueisnotempty,puttheheadofqueueintotheserverif(!IsEmpty(Facility,elementIndex))fselectedToken=DeQueue(Facility,elementIndex);startIndex=0;g//ifthequeueisemptyelsef//putthetokenwiththeminimumoriginaltimestampintotheserverselectedToken=MinOriginalTime(currentList);startIndex=1;gFacility[elementIndex*numOfFacAttr+ServedToken]=selectedToken;Facility[elementIndex*numOfFacAttr+ServerStatus]=BUSY;Facility[elementIndex*numOfFacAttr+ServiceStart]=currentTime;//putothertokensintothequeuefor(inti=startIndex;i=queueCapacuty)f//dropthecurrenttoken;break;gelsefEnQueue(Facility,elementIndex,currentToken);ggg Figure4-9. PseudocodeforScheduleServer 57

PAGE 58

4.3.3RandomNumberGeneration Indiscreteeventsimulations,thetimedurationforeachstateismodeledasarandomvariable[ 67 ].Inter-arrivalandservicetimesinqueuingmodelsarethetypesofvariablesthataremodeledasspeciedstatisticaldistributions.TheMersennetwister[ 69 ]isusedtoproducetheseedsforapseudo-randomnumbergeneratorsincebitwisearithmeticandanarbitraryamountofmemorywritesaresuitablefortheCUDAprogrammingmodel[ 70 ].Eachthreadblockupdatestheseedarrayforthecurrentexecutionateverysimulationstep.Thoseseedswithstatisticaldistributions,suchasuniformandexponentialdistributions,thenproducetherandomnumbersforthevariables. 4.4StepsforBuildingaQueuingModel Thissectiondescribesthebasicstepsindevelopingthequeuingmodelsimulation.EachsteprepresentseachkernelinvokedfromtheCPUinsequencetodevelopthemutualexclusionontheGPU.Wehaveassumedthateachservicefacilityhasonlyoneserverforthisexample. Step1:InitializationThememoryspacesareallocatedfortheFELandservicefacilities,andthestatevariablesaredenedbytheprogrammer.Thenumberofelementsforwhicheachthreadisresponsibleisdeterminedbytheproblemsize,aswellasbyuserselections,suchasthenumberofthreadsinathreadblockandthenumberofblocksinagrid.DatastructuresfortheFELandservicefacilityarecopiedtotheGPU,andinitialeventsaregeneratedforthesimulation. Step2:MinimumTimestampTheNextEventTimefunctionndstheminimumtimestampintheFELbyutilizingmultiplethreads.Atthisstep,eachthreadisresponsibleforhandlingacertainnumberofelementsintheFEL.Thenumberofelementseachthreadisresponsibleformaybedifferentfromthatofothersteps,ifsharedmemoryisusedforelementcomparison.ThestepsforndingtheminimumtimestampareillustratedinFigures 4-10 and 4-11 .InFigure 4-10 ,eachthread 58

PAGE 59

Figure4-10. Firststepinparallelreduction Figure4-11. Stepsinparallelreduction comparestwotimestamps,andthesmallertimestampisstoredattheleftlocation.ThetimestampsintheFELarecopiedtothesharedmemorywhentheyarecomparedsothattheFELwillnotbere-sorted,asshowninFigure 4-11 Step3:EventExtractionandDepartureEventTheNextEventfunctionextractstheeventswiththeminimumtimestamp.Atthisstep,eachthreadisresponsibleforhandlingacertainnumberofelementsintheFEL,asillustratedinFigure 4-12 .Twomaineventroutinesareexecutedatthisstep.ARequestfunctionexecutesanarrivaleventpartially,justindicatingthattheseeventswillbeexecutedatthecurrentiteration.AReleasefunction,ontheotherhand,executesadepartureevententirelyatthisstep,sinceonlyoneconstantindexisusedtoaccesstheservicefacilityforaRelease 59

PAGE 60

Figure4-12. Step3:Eventextractionanddepartureevent function.InFigure 4-12 ,tokens#4,#5,and#8areextractedforfutureupdates,andservicefacility#6releasestoken#5atthisstep,updatingboththeFELandservicefacilityatthesametime.Token#5isre-scheduledwhentheReleasefunctionisexecuted. Step4:UpdateofServiceFacilityTheScheduleServerfunctionupdatesthestatusoftheserverandthequeueforeachfacility.Atthisstep,eachthreadisresponsibleforprocessingacertainnumberofelementsintheservicefacility,asillustratedinFigure 4-13 .EachfacilityndsthenewlyarrivedtokensbycheckingtheincomingedgesandtheFEL.Ifthereisanewlyarrivedtokenateachservicefacility,theservicefacilitieswithidleserver(#2,#3,#5,#6,and#8)willplaceitintotheserver,whereastheservicefacilitieswithbusyserver(#1,#4,and#7)willputitintothequeue.Token#8isplacedintotheserverofservicefacility#8.Token#4canbelocatedintheserverofservicefacility#6becauseservicefacility#6hasalreadyreleasedtoken#5atthepreviousstep. Step5:NewEventSchedulingTheSchedulefunctionupdatestheexecutedtokensintheFEL.Atthisstep,eachthreadisresponsibleforprocessingacertainnumberofelementsintheFEL,asillustratedinFigure 4-14 .Alltokensthathave 60

PAGE 61

Figure4-13. Step4:Updateofservicefacility Figure4-14. Step5:Neweventscheduling requestedtheserviceatthecurrenttimearere-scheduledbyupdatingtheattributesoftokensintheFEL.Then,thecontrolgoestoStep2,untilthesimulationends.Theattributesoftokens#4and#8inFigure 4-14 areupdatedbasedontheresultsofthepreviousstep,asshowninFigure 4-13 Step6:SummaryStatisticsWhenthesimulationends,botharraysarecopiedtotheCPU,andthesummarystatisticsarecalculatedandgenerated. 61

PAGE 62

4.5ExperimentalResults Theexperimentalresultscomparetwoparallelsimulationswithasequentialsimulation:therstisaparallelsimulationwithasequentialeventschedulingmethod,andthesecondisaparallelsimulationwithaparalleleventschedulingmethod. 4.5.1SimulationEnvironment TheexperimentwasconductedonanIntelCore2ExtremeQuad2.66GHzprocessorwith3GBofmainmemory.TheNvidiaGeForce8800GTXGPU[ 12 ]has768MBofmemorywithamemorybandwidthof86.4GB/s.TheCPUcommunicateswiththeGPUviaPCI-Expresswithamaximumof4GB/sineachdirection.TheCversionofSimPack[ 71 ]withaheap-basedFELwasusedintwosequentialeventschedulingmethodsforcomparisonwithparallelversion.SimPackisasimulationtoolkitwhichsupportstheconstructionofvarioustypesofmodelsandexecutingthesimulation,basedonanextensionofthegeneral-purposeprogramminglanguage.C,C++,Java,JavaScript,andPythonversionsofSimPackhavebeendeveloped.Theresultsrepresentedinthisdissertationaretheaveragevalueofveruns. 4.5.2SimulationModel Thetoroidalqueuingnetworkmodelwasusedforthesimulation.Thisapplicationisanexampleofaclosedqueuingnetworkinterconnectedwithaservicefacility.Figure 4-15 showsanexampleof33toroidalqueuingnetwork.Eachservicefacilityisconnectedtoitsfourneighbors.Whenthetokenarrivesattheservicefacility,theservicetimeisassignedtothetokenbyarandomnumbergeneratorwithanexponentialdistribution.Afterbeingservedbytheservicefacility,thetokenmovestooneofitsfourneighbors,selectedwithuniformdistribution.Themeanservicetimeofthefacilityissetto10withexponentialdistribution,andthemessagepopulationthenumberofinitiallyassignedtokensperservicefacilityissetto1.Eachservicetimeisroundedtoanintegersothatmanyeventsareclusteredintooneeventtime.However,thiswillintroduceanumericalerrorintothesimulationresultsbecausetheirexecutiontimesare 62

PAGE 63

Figure4-15. 33toroidalqueuingnetwork differenttotheiroriginaltimestamps.Theerrormaybeacceptableinsomeapplications,butanerrorcorrectionmethodmayberequiredformoreaccurateresults.InChapter5,weanalyzetheerrorintroducedbyclusteringevents,andpresentthemethodsforerrorestimationandcorrection. 4.5.3ParallelSimulationwithaSequentialEventSchedulingMethod Inthisexperiment,theCPUandGPUarecombinedintoamaster-slaveparadigm.TheCPUworksasthecontrolunit,andtheGPUexecutestheprogrammedcodesforevents.WeusedaparallelsimulationmethodbasedonaSIMDschemesothateventswiththesametimestampvalueareexecutedconcurrently.Iftherearetwoormoreeventswiththesametimestamp,theyareclusteredintoalist,andeacheventonthelistisexecutedbyeachthread.Duringthesimulation,theGPUproducestworandomnumbersforeachactivetoken;theservicetimeatthecurrentservicefacilitybyexponentialdistribution,andnextservicefacilitybyuniformdistribution.WhentheCPUcallsthekernelandpassesthestreamsofactivetokens,threadsontheGPUgeneratetheresultsinparallel,andreturnthemtotheCPU.TheCPUschedulesthetokensusingtheseresults. Figure 4-16 showstheperformanceimprovementintheGPUexperiments.TheCPU-basedsimulationshowedbetterperformanceinthe1616facilitiesbecause(1) 63

PAGE 64

Figure4-16. PerformanceimprovementbyusingaGPUascoprocessor thesequentialexecutiontimeinonetimeintervalontheCPUwasnotlongenoughcomparedtothedatatransfertimebetweentheCPUandGPU(2)thenumberofeventsinonetimeintervalwasnotenoughtomaximizethenumberofthreadsontheGPU.TheGPU-basedsimulationoutperformsthesequentialsimulationwhen(1)issatised,andtheperformanceincreaseswhen(2)issatised.However,theperformancewasnotgoodenoughwhenwecomparetheresultswithothercoarse-grainedsimulations.IntheSIMDexecution,somepartsofcodesareprocessedinsequence,suchastheinstructionfetch.Theeventschedulingmethod(e.g.,theeventinsertionandextraction)performedinsequencerepresentsover95%oftheoverallsimulationtimewhiletheeventexecutiontime(e.g.,randomnumbergeneration)isreducedbyutilizingtheGPU. 4.5.4ParallelSimulationwithaParallelEventSchedulingMethod IntheGPUexperiment,thenumberofthreadsinthethreadblockisxedat128.Thenumberofelementsthateachthreadprocessesandthenumberofthreadblocksaredeterminedbythesizeofthesimulationmodel.Forexample,thereare8threadblocks,andeachthreadonlyprocessesoneelementforbotharraysina3232model. 64

PAGE 65

Figure4-17. Performanceimprovementfromparalleleventscheduling Thereare64threadblocks,andeachthreadprocesses32elementsforbotharraysina512512model.Figure 4-17 showstheperformanceimprovementintheGPUexperimentscomparedtosequentialsimulationontheCPU.Theperformancegraphshowsans-shapedcurve.Forasmallsimulationmodel,theCPU-basedsimulationshowsbetterperformance,sincethetimestoexecutethemutualexclusionalgorithmandtransferdatabetweentheCPUandGPUexceedthesequentialexecutiontimes.Moreover,thenumberofconcurrentlyexecutedeventsistoosmall.TheGPU-basedsimulationoutperformsthesequentialsimulationwhenthenumberofconcurrenteventsislargeenoughtoovercometheoverheadofparallelexecution.Finally,theperformancegraduallyincreaseswhentheproblemsizeislargeenoughtofullyutilizethethreadsontheGPU.ComparedtoFigure 4-16 ,paralleleventschedulingremovesthebottleneckofthesimulation,andsignicantlyimprovestheperformance. 4.5.5ClusterExperiment Wealsoderivedthesimulationoveraclusterusingasequentialeventschedulingmethod.Theclustersusedforthesimulationarecomposedof24Sunworkstations 65

PAGE 66

interconnectedbya100MbpsEthernet.EachworkstationisaSunSPARC1GHzmachinewitharunningversion5.8oftheSolarisoperatingsystemwith512MBofmainmemory.Inthisexperiment,theprocessorsarecombinedintoamaster-slaveparadigm.Onemasterprocessorworksasthecontrolunit,andseveralslaveprocessorsexecutetheprogrammedcodesforevents.Eacheventonthelistofconcurrenteventsissenttoeachprocessor.Thesimulationoveraclusterdidnotdemonstrateagoodperformancewithoutarticialdelay,sincethecomputationtimeofeacheventwastooshortcomparedtothecommunicationdelaybetweentheprocessors.Communicationdelayofthenullmessagebetweenmasterandslaveprocessorswasmeasuredaslessthan1millisecond(ms),butitoverwhelmstenmicroseconds(s)ofthecomputationspeedforeachevent. Mosttraditionalparalleldiscreteeventsimulationsexchangemessagesbetweenprocessorsinordertosendaneventtootherprocessorsortousethemasasignalofsynchronization.Communicationdelayisacriticalfactorintheperformanceofsimulationwhencomputationgranularityofeventsisrelativelysmall[ 72 ].Otherexperimentalresultsshowthatmodestspeedupisobtainedfromparallelsimulationwithnegranularity,butspeedupisrelativelysmallcomparedtocoarsegranularity[ 73 ],orperformanceisevenworsethanthatofsequentialsimulation[ 74 ].CommunicationdelaycanberelativelynegligibleintheCPU-GPUsimulationsincecommunicationsarehandledonthesamehardware. 66

PAGE 67

CHAPTER5ANANALYSISOFQUEUENETWORKSIMULATIONUSINGGPU-BASEDHARDWAREACCELERATION 5.1ParallelDiscreteEventSimulationofQueuingNetworksontheGPU 5.1.1ATime-Synchronous/EventAlgorithm WeusedaparallelsimulationmethodbasedonaSIMDschemesothateventswiththesametimestampvalueareexecutedconcurrently.ThesimulationbeginswiththeextractionoftheeventwiththelowesttimestampfromanFEL.Eventextractioncontinuesforaslongasthenexteventhasthesametimestamp.Eventswiththesametimestampareclusteredintothecurrentlistofexecution,andeacheventisexecutedoneachthreadoftheGPU.However,sinceitisunlikelythatseveraleventsoccuratasinglepointofsimulatedtimeinadiscreteeventsimulation,manythreadswillbeidle,resultinginwastedGPUresourcesandinefcientperformance.Weintroduceatime-synchronous/eventalgorithmusingatimeintervalinsteadofaprecisetime,inordertohavemoreeventsoccurringconcurrentlyandreducingtheloadimbalanceonthethreadsoftheGPU.Clusteringeventswithinatimeintervalmakesitpossibleformanymoreeventstobeexecutedatasinglepointofsimulatedtime,whichreducesthenumberofidlethreadsandachievesmoreefcientparallelprocessing. Atime-synchronous/eventalgorithmthatisusedtoclustermoreeventsatasingleeventtimeisahybridalgorithmofdiscreteeventsimulationsandtime-steppedsimulations.Themaindifferencebetweenthetwotypesofdiscretesimulationisthemethodusedtoadvancetime.Ourapproachissimilartoatime-steppedsimulationinthesensethatweexecuteeventsattheendofthetimeintervaltoimprovethedegreeofparallelism.However,atime-steppedsimulationcanbeinefcientifthestatechangesinthesimulationmodeloccurirregularly,orifeventdensityislowatthetimeinterval.Althoughthereisnoeventatthenexttime-step,theclockmustadvancetothenexttime-step,whichreducesefciencyowingtoidleprocessingtime.Ourapproachis 67

PAGE 68

while(currenttimeislessthansimulationtime)minimumTimeStamp=ParallelReduction(FEL);currentStep=thesmallestmultipleoftimeintervalgreaterthanorequaltominimumTimeStamp;foreachlocalFELbyeachthread(orprocessor)inparalleldoif(thetimestampofeventislessthanorequaltocurrentStep)CurrentList+=ExtractEvent(FEL);ExecuteEvent(CurrentList);Scheduleneweventsfromtheresults;endifendforeachendwhile Figure5-1. Pseudocodeforahybridtime-synchronous/eventalgorithmwithparalleleventscheduling basedondiscreteeventsimulationsinthattheclockadvancesbythenextevent,ratherthanbythenexttime-step. Thepseudocodeforatime-synchronous/eventalgorithmwithparalleleventschedulingisillustratedinFigure 5-1 .Ateachstartofthesimulationloop,thelowesttimestampiscalculatedfromtheFELbyparallelreduction[ 66 ].Theclockissettotheminimumtimestamp,andthesmallestmultipleofthetimeintervalthatisgreaterthanorequaltotheminimumtimestampissettothecurrenttime-step.AlleventsareextractedfromtheFELinparallelbymultiplethreadsontheGPUiftheirtimestampislessthan,orequalto,thecurrenttime-step.EachextractedeventisexclusivelyaccessedandexecutedbyeachthreadontheGPU.Thetimeintervalinourapproachisusedtoexecuteeventsconcurrentlyratherthantoadvancetheclock.Afterexecutingtheevents,theclockadvancestothenextlowesteventtime,andnottothenexttime-step. However,ifeventsareexecutedonlyattheendofthetimeinterval,theresultsloseaccuracybecauseeacheventhastobedelayedinitsexecutioncomparedtoitsoriginaltimestamp.Fortunately,wecanapproximatetheerrorduetothestochasticnatureofqueues.Forsmallandnon-complexqueuingnetworks,theanalyticmodelcanprovidethestatisticswithoutrunningasimulationbasedonqueuingtheory,albeit 68

PAGE 69

withassumptionsandapproximations[ 1 3 ].Weusequeuingtheorytoestimatethetotalerrorrateafterweobtainthesimulationresults.Thetimeintervalcanbeanotherparameterofthequeuingmodelcombinedwithtwotime-dependentparameters:arrivalandservicerate.Withtheuseofthetimeinterval,theerrorratecausedbythetimeintervalisrelatedtothearrivalandservicerates,andtheamountoferrordependsonthevaluesoftheseparameters.Therelationshipsbetweenthetimeintervalandparametersaredescribedinsections 5.3 and 5.4 5.1.2TimestampOrdering Inparallelsimulation,thepurposeofsynchronizationistoprocesstheeventsinnon-decreasingtimestampsordertoobtainthesameresultsasthoseofasequentialsimulation.Inatraditionalparalleldiscreteeventsimulation,theeventordercanbeviolatedbythedifferentspeedsofeventexecutionsandcommunicationdelaysbetweenprocessors,resultinginacausalityerror.Othersimulationmethodsusingthetradeoffbetweenaccuracyandspeedupallowthetimestamporderingtobeviolatedwithincertainlimits,whereasourapproachstillkeepsthetimestamporderingofevents. Wedonotneedtheexplicitsynchronizationmethodsincealltheeventsarestoredintheglobaleventlist,andthetimeforeacheventexecutionisdeterminedbytheglobalclock.Thesynchronousstepofthesimulationpreservestheexecutionsofeventsinanon-decreasingtimestamporder,blockingtheeventextractionsfromtheFELbeforethecurrenteventsnishschedulingthenextevents.Theerrorcausedbythetimeinterval,therefore,isdifferentfromthecausalityerrorbecausethetimestamporderingispreserved,eventhougheventsareclusteredattheendofthetimeinterval.Theerrorintheresultisthestatisticalerrorsinceeacheventdoesnotoccuratitsprecisetimestamp.However,acausalityerrorcanoccurfortheeventswiththesametimestampwheneventsareclusteredbyatimeinterval.Considertwoormoretokenswithdifferenttimestampsrequestingserviceatthesameservicefacility.Theirtimestampsaredifferent,buttheycanbeclusteredintoonetimeinterval.Inthiscase,anoriginal 69

PAGE 70

timestampisusedtodeterminethecorrecteventorderforsimultaneousevents.Fororiginallysimultaneousevents,theeventorderisrandomlydeterminedbyeachservicefacility,asdescribedinsection 4.2.1 5.2ImplementationandAnalysisofQueuingNetworkSimulation 5.2.1ClosedandOpenQueuingNetworks Queuingnetworksareclassiedintotwotypes:closedandopen[ 3 ].Inanopenqueuingnetwork,eachtokenarrivesatthesystem,basedonthearrivalrate,andleavesthesystemafterbeingserved.Ontheotherhand,thenitenumberoftokensisassigned,andtokensareexecutedcirculatingthenetworkinaclosedqueuingnetwork.Openqueuingnetworksaremorerealisticqueuingmodelsthanclosedqueuingnetworksare,andcommunicationnetworkandtrafcowmodels[ 75 ]aretypicalexamplesofthese.However,closedqueuingnetworksarewidelyusedinthemodelingofasystemwherethenumberoftokensinthesystemhasanimpactonthenatureofthearrivalprocess,duetotheniteinputpopulations[ 76 ].CPUscheduling,exiblemanufacturingsystems[ 77 ]andtruck-shovelsystems[ 78 ]areexamplesofclosedqueuingnetworks. Themaindifferencebetweenthesetwotypesofqueuingnetworksisthattheopenqueuingnetworkhasnewarrivalsduringthesimulation.Thenumberoftokensintheopenqueuingnetworkatanyinstantoftimeisalwaysdifferentduetothearrivalsanddepartures.Theclosedqueuingnetworkhasaconstantnumberoftokensduringthesimulationsincetherearenonewarrivalsordepartures.Theerrorrateproducedbytheuseofatimeintervalwillbedifferentbetweenthetwotypesofqueuingnetworkssincethenumberoftokensinthesystemaffectsthesimulationresults. Intheopenqueuingnetwork,thearrivalrateremainsconstantalthoughtheeventsareonlyexecutedattheendofeachtimeinterval.Adelayedexecutiontimeforeachevent,comparedtoitsprecisetimestamp,decreasesthedeparturerateofthequeuingnetwork,resultinginanincreasednumberoftokensinthequeuingnetwork.Asthe 70

PAGE 71

numberoftokensinthequeuingnetworkincreases,thewaittimealsoincreasessincethelengthofthequeueattheservicefacilityincreases.Intheclosedqueuingnetwork,weonlyneedtoconsiderthearrivalanddepartureratesbetweentheservicefacilitiessincethereisnoentryfromtheoutside.Thedelayedtokensarriveatthenextservicefacilityaslateasthedifferencebetweentheiroriginaltimestampsandactualexecutiontimes.Thelengthofthequeueattheservicefacilityremainsunchangedbythetimeintervalsincealltokensinthesystemaredelayedatthesamerate. Theimplementationbetweenclosedandopenqueuingnetworksisalsodifferent.ItispossibletoallocateaxedsizearrayfortheFELintheclosedqueuingnetworkbecauseoftheconstantnumberoftokensduringthesimulation.Astaticmemoryallocationwithaxednumberofelementsallowsextractionofeventsfrom,andre-schedulinginto,theFELtobeperformedontheGPU.ThedataneednotbesentbacktotheCPUinthemiddleofthesimulation.Inanopenqueuingnetwork,thesizeoftheFELisalwaysdifferent.ForthisreasontheupperlimitofmemoryfortheFELneedstobeestimated,whichcausesmanythreadsontheGPUtobeidleandmemorytobewasted.Moreover,theGeForce8800GTXGPUthedeviceofcomputercapability1.0doesnotsupportmutualexclusionandatomicfunctions[ 13 ].Themanuallockingmethodforconcurrencycontrolcannotbeusedwhentheintervalbetweenthreadsthattrytoaccessthesameelementinmemoryistooshort.TheassignmentsofnewarrivalsfromoutsideofthequeuingnetworktothesharedresourcesoftheFELrequiresequentialexecutionsothatmultiplethreadsarepreventedfromconcurrentlyaccessingandwritingtheirnewarrivalstothesameemptylocation.Inthiscase,newlyarrivedtokensneedtobegeneratedontheCPU,resultingindatatransferbetweentheCPUandGPU.Bothsequentialexecutionanddatatransferareaperformancebottleneck,anddatatransfertimecanhaveacriticalimpactonperformanceinlarge-scalemodels.TheexperimentalresultsinSection6showtheseperformancedifferencesbetweenaclosedandopenqueuingnetwork. 71

PAGE 72

Ontheotherhand,ifmemoryisseparatelyallocatedintotheservicefacilitywithexternalinputsfromoutsideofthequeuingnetwork,thennewarrivalscanbehandledontheGPU.AlocationthatisseparatefromotherthreadspreventsthemfromaccessingthesamelocationintheFEL.Itisafeasiblesolutioniftherearefewentriesfromoutsideofthequeuingnetworkinthesimulationmodel.Generally,thisisnotagoodapproachforthelarge-scalemodelsincememoryallocationrapidlygrowsasthenumberofservicefacilitiesincreases. 5.2.2ComputerNetworkModel Thequeuingmodelwasoriginallydevelopedtoanalyzeanddesignatelecommunicationsystem[ 79 ],andhasbeenfrequentlyusedtoanalyzetheperformanceofcomputernetworksystems.Whenapacketissentfromonenodetoanadjacentnode,therearefourdelaysbetweenthetwonodes:processing,queuing,transmission,andpropagationdelays[ 80 ].Amongthese,thequeuingdelayisthemoststudieddelaybecauseitistheonlydelayaffectedbythetrafcloadandcongestionpattern. Thetimerequiredforapackettowaitinthequeuebeforebeingtransmittedontothelinkisthequeuingdelay.Inacomputernetwork,thequeuingdelayincludesthemediumaccessdelay,whichcanincreasethequeuingdelay.Themediumaccessdelayisthetimerequiredforapackettowaitinthenodeuntilthemediumissensedasidle.Ifanothernodeconnectedtothesamemediumistransmittingpackets,thenapacketintherstnodecannotbetransmitted,evenifnopacketsarewaitinginthatqueue.Consequently,thesharedmediumisregardedasacommonresourceforthepackets,andtheservicefacilitiesinthesamemediumareregardedasanotherqueueforthesharedmedium.Oursimulationmethodcausestheerrorratetobehigherduetothetwoconsecutivequeues. Figure 5-2 illustratesthepossibledelayscausedbythetimeintervalinthecomputernetworksimulationwhenapacketistransmittedtothenextnode.Inthegeneral 72

PAGE 73

Figure5-2. Queuingdelayinthecomputernetworkmodel queuingmodel,d1istheonlydelaycausedbyatimeinterval,butthepacketcannotbetransmittedwhenthebackofftimeends,andthemediumissensedasidle.Theseconddelayd2isaddedontothemediumaccessdelay,anditcausesagreatererrorcomparedtothegeneralqueuingmodel.Thedelaysofotherpacketsinthesamemediumalsomakethed2muchlonger. Themediaaccesscontrol(MAC)protocol[ 81 ]allowsforseveralnodestobeconnectedtothesamemediumandcoordinatestheiraccessestothesharedmedium.TheimplementationoftheMACprotocolontheGPUcanbedifferentwithrespecttothebehaviorsofthenetwork.Itissufcientforeachnodeonlytosensethesharedmediuminthesequentialexecution,butexclusiveaccesstothesharedmediumbyeachnodeneedstobeguaranteedintheparallelexecution.Inawirednetwork,thenetworktopologyisusuallystatic,andthenodesconnectedtothesamemediumarealsostaticandnotchangedduringthesimulation.TheMACprotocolinthesimulationcanbedevelopedtocentrallycontrolthenodes,whichmakesitpossiblefortheMACprotocoltobeexecutedontheGPUusinganalternateupdate. TheimplementationoftheMACprotocolinawirelessnetworkwithanaccesspoint(AP)isnotmuchdifferentfrominawirednetwork.Thetopologyofmobilenodesisdynamic,butthatofAPsisstatic.ThenodesconnectedtothesameAParedifferent 73

PAGE 74

atanypointintime,buttheMACprotocolinthesimulationcanstillbedevelopedtocentrallycontrolthenodesaftersearchingallnodescurrentlyconnectedtotheAP. However,amobileadhocnetwork(MANET)simulation[ 82 ]requiresthedistributedimplementationoftheMACprotocol.ThetopologyinMANETsischangedrapidly,andthesharedmedium,whichisdeterminedbythetransmissionrangeofeachmobilenodewithoutanyxedAP,iscompletelydifferentforeachmobilenode.TheMACprotocolinthesimulationneedstobeimplementedwithrespecttoeachnode.ThisrequiresthesequentialexecutionoftheMACprotocolontheGPU,degradingtheperformance. 5.2.3CUDAImplementation AhigherdegreeofparallelismcanbeachievedbyconcurrentlyutilizingmanystreammultiprocessorsontheGPU.AGPUimplicitlysupportsdata-levelparallelismbydecomposingthecomputationsintolargepiecesofsmalltasksandbyguaranteeingthethreadsexclusiveaccesstoeachelement.TheFELandservicefacilityaretwomaindatastructuresinthequeuingmodel,andthesearerepresentedastwo-dimensionalarrays.Oneormoreelementsinbotharraysareassignedtoasinglethreadforparallelexecution.Eachthreadexecutesonlytheactiveeventsatthecurrenttime-step.AGPUcanprocessonekernelatatime,thusatask-levelparallelismisrequiredtobeimplementedmanuallybycombiningtwoormoretasksinasinglekernel,andbydividingthethreadblocksbasedonthenumberoftasks.InourMANETsimulation,theeventextractionsofdatapacketandlocationupdatesforeachmobilenodecanbeprogrammedintoasinglekernelsincethetasksareindependentofeachother,whichincreasestheutilizationofthreads. Parallelprocessingisdifferentfromsequentialprocessinginthatmanytasksareconcurrentlyexecuted,reducingtheoverallexecutiontime.Theproblemneedstobesafelydecomposedintosub-taskssothateachcurrentlyexecutedtaskdoesnotaffecteachotherwithoutchangingtheorderofexecution[ 83 ].TheFELandservicefacilityhavedependencyinthattwoarraysneedtobeupdatedatthesametimewhen 74

PAGE 75

arequesteventiscalled.Arbitraryaccesstoonearraybymultiplethreadsinparallelallowsmultiplethreadstoconcurrentlyaccessthesameelementsinthearray.Theirexecutionsneedtobeseparated.AlternateupdatesbetweentheFELandservicefacilityhaveresolvedthisproblem. WeneeddatatransferbetweentheCPUandGPUtoavoidsimultaneousaccesstosharedresourcessinceourGPUdoesnotsupportmutualexclusion.ThefastspeedofdatatransferbetweentheCPUandGPUviathePCI-Expressbushasasignicantadvantageoverclusterswithmessagepassingbetweenprocessors,whichmakestheCPUwithaGPUamoreappropriatearchitectureforthesimulationofne-grainedevents.However,thefrequentdatatransferbetweentwodevicescanbeabottleneckinthesimulation.Datatransfertimecanbereducedbyminimizingthesizeofdatatransfer,whichcanbeachievedbyseparatingthearrayintotwoparts.TheessentialelementswhichrequiresequentialexecutionontheCPUarecomposedofaseparatearraythathastheindextothemainarray. ThesizeofthedatastructureneedstobestaticontheGPU.Thenumberofservicefacilitiesisconstantduringthesimulation,whereasthenumberofelementsintheFELoftheopenqueuingnetworkisalwayschangedatanypointintime.ConcurrentaccesstotheelementsintheFELforcesthegenerationofnewlyarrivedtokenstobeexecutedontheCPU,which,however,makesitpossibletodynamicallyadjustthesizeoftheFELontheCPUatthestartofeachsimulationloop.ThearrayoftheFELcaneitherbedoubledorcutinhalf,basedonthenumberoftokensintheFEL. Wehavemadethemostuseofthedata-parallelalgorithmsinNVIDIACUDASDK[ 84 ]forourparallelqueuingmodelsimulation.AparallelreductionisusedforndingtheminimumtimestampwhenextractingtheeventsfromtheFEL,whichallowsustomaintaintheFELwithoutsortingit.ThesequentialexecutionoftheMACprotocolontheCPUdoesnotneedtosearchalltheelementsinthearrayifthearrayfortheMAC 75

PAGE 76

Figure5-3. 3linearqueuingnetworkswith3servers protocolissortedinanon-decreasingtimestamporder.AbitonicsortontheGPUallowsustosearchonlytheneededelementswithinthearray. 5.3ExperimentalResults 5.3.1SimulationModel:ClosedandOpenQueuingNetworks Whenweranasimulationusingthetimeinterval,weusedtwokindsofqueuingnetworkmodelsclosedandopenqueuingnetworkstoidentifythedifferencesofthestatisticalresultsandperformancebetweenthetwomodels.Werstcomparedtheresultsoftheclosedqueuingnetworkwiththoseoftheopenqueuingnetwork,andanalyzedtheaccuracyoftheclosedqueuingnetwork. Therstmodelisthequeuingnetworkofthetoroidaltopologyusedinsection 4.5.2 .Thevaluesofvariousparameterscanbeimportantfactorsaffectingaccuracyandperformance.Weranthesimulationwithvaryingvaluesoftwodifferentparameterstoseetheeffectsoftheseparametersonthestatisticalresults.TheopenqueuingnetworkconsistsofNlinearqueuingnetworkswithkservers,asshowninFigure 5-3 .Anewtokenarrivesatthequeuingnetworkbasedonarrivalratefromthecallingpopulation.Thenewlyarrivedtokenisassignedtooneofthelinearqueuingnetworkswithuniformdistribution.Afterbeingservedatthelastserverinthelinearqueuingnetwork,thetokencompletesitsjobandexitsthequeuingnetwork.Thearrivalandservicetimesaredeterminedbyexponentialdistribution. 76

PAGE 77

5.3.1.1Accuracy:closedvs.openqueuingnetwork Thevaluesoftheparametersandthenumberofservicefacilitiesforclosedandopenqueuingnetworksareconguredtoobtainsimilarresultswhenthetimeintervalissettozero.Theresultsforvarioustimeintervalsarecomparedwiththoseofazerotimeintervaltodetermineaccuracy.Themeanservicetimeoftheservicefacilityissetto10withexponentialdistributionforbothqueuingnetworks.Intheclosedqueuingnetwork,themessagepopulationthenumberofinitiallyassignedtokensperservicefacilityissetto1.Intheopenqueuingnetwork,themeaninter-arrivaltimefromthecallingpopulationissetto20.Weusedthe3232topologyasabasisfortheexperimentstodeterminetheaccuracy. TwosummarystatisticsarepresentedinFigure 5-4 toshowthedifferencebyusingthetimeinterval.Sojourntimeistheaveragetimeatokentostayinoneservicefacilityincludingthewaittimeinthequeue.Utilizationrepresentstheperformanceofthesimulationmodel.Ineachsubsequentplot,thetimeintervalisonthehorizontalaxis.Atimeintervalofzeroindicatesnoerrorinaccuracy.Astheintervalincreases,theerroralsoincreasesforthevariablebeingmeasuredontheverticalaxis.Figure 5-4 Ashowstheaveragesojourntimeofopenandclosedqueuingnetworksforthetimeinterval.Ittakesmuchlongerforatokentopassaservicefacilityintheopenqueuingnetwork,sincethenumberoftokensgrowsintheopenqueuingnetwork,comparedtotheclosedqueuingnetworkasthetimeintervalincreases.Figure 5-4 Bshowstheutilizationforthetimeinterval.Utilizationoftheclosedqueuingnetworkdropssincearrivalsforeachservicefacilityaredelayedduetothetimeinterval,whereasutilizationoftheopenqueuingnetworkisalmostconstantsincethearrivalrateisconstantregardlessofthetimeinterval,andtheincreasednumberoftokensllsuppossibleidletimeattheservicefacility. 77

PAGE 78

ASojourntime BUtilization Figure5-4. Summarystatisticsofclosedandopenqueuingnetworksimulations 78

PAGE 79

5.3.1.2Accuracy:effectsofparametersettingsonaccuracy Thetimeintervalbecomesoneoftheparametersinoursimulation,anditcauseserrorbycombiningwithotherparameters.Thetimeintervalisatime-dependentparameter,anditforcestheexecutiontimeofeacheventtobedelayedattheendofthetimeinterval.Time-dependentparameters,therefore,aresaidtobetheprimaryfactorsaffectingtheaccuracyofasimulation.Theclosedqueuingnetworkwasusedforthesimulationtodeterminetheeffectsoftheparametersettingsonaccuracy. Figure 5-5 Ashowstheutilizationwithvariationsinthenumberofservicefacilitiesforthetimeinterval.Theexperimentalresultsclearlyshowthattheerrorrateisconstantregardlessofthenumberofservicefacilitieswhichisnotatime-dependentparameter.Figure 5-5 Bshowstheutilizationofthe3232toroidalqueuingnetwork,withvariationinthemeanservicetimeforthetimeinterval.Thevariationofthemeanservicetimeoneoftime-dependentparametersmakestheerrorratedifferent.Asthemeanservicetimeincreases,theratioofthedelaytimebythetimeintervaltothemeanservicetimedrops.Theerror,therefore,decreasesasthemeanservicetimeincreasesforthesametimeinterval.Interestingly,theerrorrateinFigure 5-5 Bisdeterminedbytheratioofthemeanservicetimetothetimeinterval.Theutilizationsarealmostsameinthefollowingthreecases: Meanservicetime:5.Timeinterval:0.2 Meanservicetime:10.Timeinterval:0.4 Meanservicetime:20.Timeinterval:0.8 Figure 5-5 Bimpliesthattheerrorratecanbeestimatedbasedonthefactthattheerrorrateisregularforthesameratioofatime-dependentparametertothetimeinterval. 5.3.1.3Performance Theperformancewascalculatedbycomparingtheruntimeofaparallelsimulationwiththatofasequentialsimulation.Wecanexpectbetterperformanceasthetime 79

PAGE 80

AUtilizationwithvaryingthenumberoffacilities BUtilizationwithvaryingthemeanservicetime Figure5-5. Summarystatisticswithvaryingparametersettings 80

PAGE 81

intervalincreasessincemanyeventsareclusteredatonetimeinterval;however,alargetimeintervalalsointroducesmoreerrorsintheresults. Figure 5-6 Ashowstheimprovementintheperformanceofclosedqueuingnetworksimulationsforthenumberofservicefacilitiesandthetimeinterval,withthesamevaluesofparametersthatwereusedinFigure 5-4 .Thisgraphindicatesthattheperformanceimprovementdependsonthenumberofeventsinonetimeinterval.Asexpected,alargertimeintervalleadstobetterperformance.Foraverysmall-scalemodel,especially1616topology,thenumberofthreadsthatrunconcurrentlyistoosmall.Asaresult,theoverheadsofparallelexecution,suchasmutualexclusionandCPU-GPUinteractions,exceedthesequentialexecutiontimes.Theparallelsimulationoutperformsthesequentialsimulationwhenthenumberofclusteredeventsinonetimeintervalislargeenoughtoovercometheoverheadsofparallelexecution.Notallparticipantthreadscanbefullyutilizedinadiscreteeventsimulation,sinceonlyextractedeventsareexecutedatonce.Alargetimeintervalkeepsmoreparticipantthreadsbusy,resultinginanincreasingnumberofeventsinonetimeinterval.Theperformance,therefore,increasesinproportiontotheincrementofthetimeinterval.Finally,theperformanceimprovementgraduallyincreaseswhenthenumberofeventsinonetimeintervalislargeenoughtomaximizethenumberofthreadsexecutedinparallelontheGPU.Ina512512topology,thenumberofeventsintheFEListoolargetobeloadedintothesharedmemoryontheGPUatatimeduringtheparallelreduction,whichlimitstheperformanceimprovementscomparedtothe256256topology. Figure 5-6 Bshowsthespeedupofopenqueuingnetworksimulationsforthenumberofservicefacilitiesandthetimeinterval,withthesamevaluesofparametersthatwereusedinFigure 5-4 .Theshapesofthecurvesareverysimilartothoseofclosedqueuingnetworksimulations,exceptforthemagnitudeofspeedup.TheoverheadsofsequentialexecutionfornewlyarrivedtokensontheCPUandofdatatransferbetweentheCPUandGPUresultinadegradationofperformanceinthe 81

PAGE 82

AClosedqueuingnetwork BOpenqueuingnetwork Figure5-6. Performanceimprovementwithvaryingtimeintervals(t) simulationofanopenqueuingnetwork.Theexperimentalresultsindicatethattherelationshipbetweentheerrorrateandperformanceimprovementismodel-dependentandimplementation-dependent;henceitisnoteasytoformalize. 82

PAGE 83

Paralleloverheadsinourexperimentalresultsaresummarizedbelow. Threadsynchronizationbetweeneventtimes Reorganizationofsimulationstepsformutualexclusion DatatransferbetweentheCPUandtheGPU SequentialexecutionontheCPUtoavoidsimultaneousaccesstosharedresources Loadimbalancebetweenthreadsateachiteration 5.3.2ComputerNetworkModel:aMobileAdHocNetwork 5.3.2.1Simulationmodel AMANETisaself-conguringnetworkcomposedofmobilenodeswithoutanycentralizedinfrastructure.EachmobilenodedirectlysendsthepackettoothermobilenodesinaMANET.Eachmobilenoderelaysthepackettoitsneighbornodewhenthesourceanddestinationnodesarenotintransmissionrangeofeachother.Figure 5-7 1illustratesthedifferencebetweenwirelessandmobileadhocnetworks.Inawirelessnetwork,eachmobilenodeisconnectedtoanAPandcommunicateswithothermobilenodesviatheAP.Figure 5-7 Ashowsthatnode#1cancommunicatewithnode#3viatwoAPsinawirelessnetwork.Ontheotherhand,node#1cancommunicatewithnode#3vianodes#2and#4inaMANET,asshowninFigure 5-7 B. Whenamobilenodesendsthepacket,itisrelayedbyintermediatenodestoreachthedestinationnode,usingaroutingalgorithm.Thedevelopmentofaneffectiveroutingalgorithmcanreducetheend-to-enddelayaswellasthenumberofhopcounts,thusminimizingcongestionofthenetwork.Forthisreason,AMANETisoftendevelopedtoevaluatetheroutingalgorithm.AMANETsimulationrequiresmanymorecomputationsthanatraditionalwirednetworksimulationbecauseofitsmobilenature.Thelocations 1EachcirclerepresentsthetransmissionrangeofeachmobilenodeorAP. 83

PAGE 84

AWirelessnetwork BMobileadhocnetwork Figure5-7. Comparisonbetweenwirelessandmobileadhocnetworks 84

PAGE 85

ofmobilenodesarealwayschanging,whichmakesthetopologydifferentatanypointintime.Aroutingtableineachmobilenode,therefore,mustbefrequentlyupdated.Aroutingalgorithmrequiresabeaconsignaltobetransmittedbetweenmobilenodestoupdatetheroutingtable.AMANETsimulationcanbenetfromaGPUbecauseitrequiresheavycomputationswithfrequentlocationupdatesofeachmobilenodeandroutingtable.WehavedevelopedtheMANETsimulationmodelwitharoutingalgorithm,mobilitybehavior,andMACprotocoltorunthepacket-levelsimulation. RoutingAlgorithm:GreedyPerimeterStatelessRouting(GPSR)[ 85 ]isusedtoimplementtheroutingalgorithminaMANET.Eachmobilenodemaintainsonlyitsneighbortable.Whenthemobilenodereceivesagreedymodepacketforforwarding,ittransmitsthepackettotheneighborwhoselocationisgeographicallyclosesttothedestination.Ifthecurrentnodeistheclosestnodetothepacket'sdestination,thepacketisturnedtoaperimetermode.Thepacketintheperimetermodetraversestheedgesintheplanargraphbyapplyingtheright-handrule.Thepacketreturnstothegreedymodewhenitreachesthenodethatisgeographicallyclosertothedestinationthanthemobilenodethatpreviouslysetthepackettotheperimetermode.Eachmobilenodebroadcaststhebeaconsignalperiodicallytoacknowledgeitslocationtotheneighbors.Thosemobilenodesthatreceivethebeaconsignalupdatetheirneighbortable.Eachmobilenodetransmitsthebeaconsignalevery0.5to1.5seconds.ThedetailedalgorithmisspeciedinKarpandKung[ 85 ]. Mobility:Themobilityofamobilenodeismodeledbytherandomwaypointmobilitymodel[ 86 ].Amobilenodechoosesarandomdestinationwitharandomspeedwhichisuniformlydistributedbetween0and20m/s.Whenthenodearrivesatitsdestination,itstaysforacertainperiodoftimebeforeselectinganewdestination.Thepausetimeisuniformlydistributedbetween0and20seconds. MACProtocol:Themobilenodecantransmitthepacketifnoneofmobilenodeswithinthetransmissionrangecurrentlytransmitsthepacket.Eachmobilenodesenses 85

PAGE 86

Table5-1. SimulationscenariosofMANET Numberofmobilenodes502008003200 Region(mm)15006003000120060002400120004800 Nodedensity1node/9000m2 Packetarrivalrate0.80.40.20.1(pernode)packet/secpacket/secpacket/secpacket/sec themediumbeforeitsendsthepacket,andtransmitsthepacketonlyifthemediumissensedasidle.Whenthemediumissensedasbusy,arandombackofftimeischosen,andthemobilenodewaitsuntilthebackofftimeexpires.Weassumedthatthepacketcanbetransmittedimmediatelywhenthemediumissensedasidleandthebackofftimeexpireswithanidealcollisionavoidance. Simulationswereperformedinfourscenarios,asshowninTable 5-1 .Eachscenariohasadifferentnumberofnodesandregionsizes,butthenodedensityisidentical.Atthestartofthesimulation,mobilenodesarerandomlydistributedwithintheareaineachscenario.Eachnodegeneratesa1024bytepacketatarateofpacketspersecond,andtransmitsconstantbit-rate(CBR)trafctotherandomlyselecteddestination.Thetransmissionrateofeachnodeis1Mbps,andthetransmissionrangeofeachnodeis250meters.Ineachscenario,whenwemovedtothelarge-scalemodel,thenumberofmobilenodeswasquadrupled,butthenumberofpacketswasdoubledsothatthenetworkwasnotcongested. 5.3.2.2Accuracyandperformance Weproducedthreestatisticsforthenumberofmobilenodeswithvaryingtimeintervals.Averageend-to-enddelayistheaveragetransmissiontimeofpacketsacrossanetworkfromthesourcetodestinationnode.Packetdeliveryratioisthesuccessfulratioofthedatapacketsdeliveredtotheirdestinations.Averagehopcountistheaveragenumberofedgestobetransmittedacrossanetworkfromthesourcetodestinationnode. 86

PAGE 87

Figure5-8. Averageend-to-enddelaywithvaryingtimeintervals(t) Figure 5-8 and 5-9 showthreestatisticsproducedbyoursimulationmethod.Eachcurverepresentstheresultswithadifferenttimeinterval,andthedifferencefromatimeintervalofzerorepresentsanerror.Inanaverageend-to-enddelay,theerrorrateishigherasthetimeintervalincreases,especiallyforthelarge-scalemodelsasshowninFigure 5-8 .Weobservedthattheincrementinthenumberofservicefacilitiesdidnotcausetheerrorratetobedifferentintheprevioussectionsincethenumberofservicefacilitiesisnotatime-dependentparameter.However,thegraphofanaverageend-to-enddelayshowsthattheerrorratevarieswithrespecttothenumberofmobilenodes.Thisisrelatedtothemediumaccessdelay.Asmentionedintheprevioussection,weexpectedmoredelaysinthecomputernetworksimulationbyusingthetimeintervalduetothemediumaccessdelay.Inoursimulationmodel,alarge-scalemodelhasbroaderareascomparedtoasmall-scalemodel.Apacketusuallypassesalargernumberofintermediatenodestoreachthedestinationinthelarge-scalemodel.Moremediumaccessdelays,therefore,areexpectedtobeincludedintheend-to-enddelay,resultinginmoreerrorintheresults. 87

PAGE 88

Figure 5-9 AandBrespectivelyshowtheaveragehopcountsandpacketdeliveryratio.Allpacketsareincludedinthepacketdeliveryratioregardlessoftheexistenceoftheirpathstothedestination.Thesetwostatisticsshowboththeefciencyandaccuracyofaroutingalgorithm.Theerrorresultingfromthetimeintervalimpliesthattheroutingtableineachmobilenodewasnotupdatedcorrectly.Theresultsseemtobeconstantregardlessofthetimeintervalvalue.Ourtimeinterval(1msor2ms)istoonegligibletoaffecttheresults,comparedtotheintervalofbeaconsignal(1secondonaverage)fromeachmobilenode.Moreover,thesetwostatisticsarenottime-dependentstatistics,andarenotdeterminedbytime-dependentparameters.Theexperimentalresultsindicatethatwecanobtainaccurateresultsiftheresultsaswellastheparametersarenottime-dependent. Figure 5-10 showstheperformanceimprovementforthenumberofmobilenodeswithvaryingtimeintervals.ThesequentialexecutionsofnewpacketarrivalsandMACprotocolswerethebottlenecksinperformance,butwecouldachievespeedupbyexecutingthesub-tasksinparallelandminimizingdatatransferbetweentheCPUandGPU.Inaddition,eacheventinaMANETsimulationrequiresmuchmorecomputationtimecomparedtothequeuingmodelsintheprevioussection.Twosub-tasksareeasilyparallelizable:neighborupdateintheroutingalgorithm,andlocationupdateinthemobility.Asinglekernelcombineseachsub-taskwiththeeventroutinesfordatapacketswhichareindependentofthosetasks. 5.4ErrorAnalysis Inthissection,weexplainhowtheerrorequationisderivedandtheerroriscorrectedtoimprovetheaccuracyoftheresultingstatistics.Themethodsforerrorestimationandcorrectionshouldbesimpleenoughsinceourobjectiveistoobtaintheresultsfromthesimulation,notfromthecomplicatedanalyticalmethod.Forerrorestimation,werstneedtocapturethecharacteristicsofthesimulationmodel,therebydeterminingwhichparametersaresensitivetoerror.Thentheerrorrateisderivedas 88

PAGE 89

AAveragehopcounts BPacketdeliveryratio Figure5-9. Averagehopcountsandpacketdeliveryratiowithvaryingtimeintervals(t) 89

PAGE 90

Figure5-10. PerformanceimprovementinMANETsimulationwithvaryingtimeintervals(t) anequationbycombingthetimeintervalwitherror-sensitiveparametersusingqueuingtheory.Inthisdissertation,westartwithasimplemodeltheclosedqueuingnetworkfortheanalysis,becausetherearefewerparameterstoconsider. Figure 5-11 andTable 5-2 showtherelationshipbetweentimeintervalandmeanservicetimeinclosedqueuingnetworksimulations.Figure 5-11 showsa3-dimensionalgraphofutilizationforvaryingtimeintervalsandmeanservicetimes.Whenthemeanservicetimeisrelativelylarge,orwhenthetimeintervalissmall,theerrorratetendstobelow.Table 5-2 ssummarizestwosummarystatisticsfordifferentvaluesoftimeintervalsandmeanservicetimes.Wecanndsomeregularityinthistable.Theresultsimplythattheratioofthemeanservicetimetothetimeintervalisdirectlyrelatedtotheerrorrate.Theseresultsindicatethattime-dependentparametersaresensitivetoerror,andthatsucherrorscanbeestimated. Whenatokenisclusteredattheendofthetimeinterval,thetokenisdelayedbytheamountoftimebetweentheoriginalandactualexecutiontimes.Letddenotethedelaytimebythetimeinterval.Whenthetokenmovestothenextservicefacility,the 90

PAGE 91

Figure5-11. 3-dimensionalrepresentationofutilizationforvaryingtimeintervalsandmeanservicetimes Table5-2. Utilizationandsojourntime(Soj.time)fordifferentvaluesoftimeintervals(t)andmeanservicetimes(s) s=5 s=10 s=20 t Utilization Soj.time t Utilization Soj.time t Utilization Soj.time 0 0.5042 9.98 0 0.5042 19.97 0 0.5043 39.92 0.5 0.4843 10.50 0.5 0.4938 20.59 0.5 0.4977 40.73 1 0.4671 10.87 1 0.4840 21.03 1 0.4930 41.22 2 0.4343 11.65 2 0.4671 21.74 2 0.4840 42.06 inter-arrivaltimeofthenextservicefacilityincreasesbyanaverageofd.TheutilizationoftheM/M/1queueisdenedby ,whereandrefertothearrivalandservicerates,respectively[ 1 ].Theequationcanalsobedenedbys a,wheresandarefertotheservicetimeandinter-arrivaltime,respectively.Considerthelinearqueuingnetworkwithtwoqueues,andyieldstatisticsataninstantintime.Theequationofutilization(2)forthesecondqueueisdenedbyequation( 5 )sincetheinstantofinter-arrivaltimeatthesecondqueueisthesumoftheservicetimeattherstqueueandthedelaytimebythetimeinterval(). 2=s a+d(5) 91

PAGE 92

Letanerrorratedenotetherateofdecreaseinutilizationbythetimeinterval.Theerrorrateecanbedenedbyequation( 5 ). e=2 1=a a+d,where1=s aand2=s a+d(5) Tocalculateanaveraged,wehavetoconsidertheprobabilityP0thattheservicefacilitydoesnotcontainatoken.Intheopenqueuingnetwork,theincreasednumberoftokensduetothetimeintervalcausestheprobabilityP0todrop,thusdincreasesexponentially.Intheclosedqueuingnetwork,theprobabilityP0isnotaffectedbythetimeintervalsincealltokensaredelayed,reducingthearrivalratetoeachservicefacility.Alltokenshavetowaituntiltheendofthetimeinterval,thusthedofalong-runtime-averageis/2.Thedeclineinutilizationisaffectedbyhalfthetimeinterval.Theinter-arrivaltimeofalong-runtime-averageainequation( 5 )approachess,theservicetimeofalong-runtime-average.Whenwesubstituted=/2intotheequation( 5 ),theerrorrateeisdenedby e=s s+=2(5) Theutilizationwiththetimeinterval,()isdenedbyequation( 5 ),where0referstoazerotimeinterval. ()=s s+=2(0)=(0) 1+=2(5) Consequently,wecanderivetheequationtocorrecttheerrorinutilization.Theoriginalvalueoftheutilizationinthetoroidalqueuingnetworkcanbeapproximatedby (0)=(1+=2)()(5) Figure 5-12 showsthecomparisonoftheerrorratebetweentheexperimentalandestimatedresultsfortwocasesofthemeanservicetime.Astheratioofthemeanservicetimetotimeintervalincreases,thedifferencebetweenthetworesultsdecreases.Figure 5-13 showstheresultscalculatedbyequation( 5 )oftheerrorcorrectionmethodwiththeexperimentalresultsinFigure 5-12 .Thegraphindicates 92

PAGE 93

Figure5-12. Comparisonbetweenexperimentalandestimationresults Figure5-13. Resultoferrorcorrection thatwecansignicantlyreducetheerrorbytheerrorcorrectionmethod.Forthemeanservicetimeof20,theerrorrateisonly0.6%atthetimeintervalof1. 93

PAGE 94

Theequationoftheutilizationforerrorcorrectionisnotderivedfromtheanalysisofindividualnodes.Ourintentionistoapproximatethetotalerrorratewhenaddingonemoreparametertimeintervalsothattheerroriscorrectedtoyieldmoreaccurateresults.Theequationforthetotalerrorrateisderivedfromtheequationsofqueuingtheory.Theequationcombinedwiththeresultsfromthesimulationproducesmoreaccurateresultswithoutbuildingacomplicatedanalyticalmodelfromeachnode. 94

PAGE 95

CHAPTER6CONCLUSION 6.1Summary WehavebuiltaCUDA-basedlibrarytosupportparalleleventschedulingandqueuingmodelsimulationonaGPU,andintroducedatime-synchronous/eventapproachtoachieveahigherdegreeofparallelism.TherehasbeenlittleresearchintheuseofaSIMDplatformforparallelizingthesimulationofqueuingmodels.TheconcernsintheliteratureregardingeventdistributionandtheseeminglyinappropriateapplicationofGPUsfordiscreteeventsimulationareaddressed(1)byallowingeventstooccuratapproximateboundariesattheexpenseofaccuracy,and(2)byusingadetectionandcompensationapproachtominimizeerror.Thetradeoffinourworkisthatwhilewegetsignicantspeedup,theresultsareapproximateandresultinanumericalerror.However,insimulationswherethereisexibilityintheoutputresults,theerrormaybeacceptable. Theeventschedulingmethodoccupiesasignicantportionofcomputationaltimeindiscreteeventsimulations.AconcurrentpriorityqueueapproachallowedeachprocessortosimultaneouslyaccesstheglobalFELonsharedmemorymultiprocessors.However,anarray-baseddatastructureandsynchronousexecutionsamongthreadswithoutexplicitsupportformutualexclusionpreventedtheconcurrentpriorityqueueapproachfrombeingdirectlyappliedtotheGPU.Inourparalleleventschedulingmethod,theFELisdividedintomanysub-FELs,whichallowsthreadstoprocessthesesmallerunitsinparallelbyutilizinganumberofthreadsonaGPUwithoutinvokingsophisticatedmutualexclusionmethods.EachelementinthearrayholdsitspositionwhiletheFELremainsunsorted,whichguaranteesthateachelementisonlyaccessedbyonethread.Inaddition,alternateupdatesbetweentheFELandservicefacilitiesinaqueuingmodelsimulationallowbothsharedresourcestobeupdatedbi-directionallyontheGPU,therebyavoidingsimultaneousaccesstothesharedresources. 95

PAGE 96

Wehavesimulatedandanalyzedthreetypesofqueuingmodelstoseewhatdifferentimpactstheyhaveonthestatisticalresultsandperformanceusingoursimulationmethod.Theexperimentalresultsshowthatwecanachieveupto10timesthespeedupusingouralgorithm,althoughtheincreasedspeedcomesattheexpenseofaccuracyintheresults.Therelationshipbetweenaccuracyandperformance,however,ismodeldependent,andnoteasytodeneonamoregeneralbasis.Inaddition,thestatisticalresultsinMANETsimulationsshowthatourmethodonlycausesanerrorinthetime-dependentstatistics.Althoughtheimprovementofperformanceintroducedanerrorintothesimulationresults,theexperimentalresultsshowedthattheerrorinqueuingnetworksimulationsisregularenoughtoapplyinordertoestimatemoreaccurateresults.Thetimeintervalcanbeoneoftheparametersusedtoproducetheresults,andsotheerrorcanbeapproximatedwiththevaluesoftheparametersandtopologiesofthequeuingnetwork.Theerrorproducedbythetimeintervalcanbemitigatedusingresultsfromqueuingtheory. 6.2FutureResearch ThecurrentGPUsandCUDAprovideprogrammerswithanefcientframeworkofparallelprocessingforgeneralpurposecomputations.AGPUcanbemorepowerfulandcost-effectivethanotherparallelcomputersifitisefcientlyprogrammed.However,parallelprogrammingonGPUsmaystillbeinconvenientforprogrammers,sinceallgeneralalgorithmsandprogrammingtechniquescannotbedirectlyconvertedandused. WecanfurtherimprovetheperformanceofqueuingnetworksimulationsbyremovingmoresequentialexecutionsfromtheGPU.Themagnitudeoftheperformancedependsonhowmuchwecanreducesequentialexecutionsinthesimulation.Inthisstudy,wewereabletocompletelyremovesequentialexecutionsinthesimulationoftheclosedqueuingnetwork.However,thesynchronousexecutionsofmultiplethreadsrequireatleastsomecodetobesequential.Thus,removingsequentialexecutionintheprogrammingcodesnotonlyimprovesperformance,butalsoreducestheerrorinthe 96

PAGE 97

statisticalresults,sincewecanachieveconsiderablespeedupwithasmalltimeinterval.Wewillbeabletoconvertsomesequentialcodetotheparallelcoderelatedtodatainconsistency,usingatomicfunctionsfordevicesofcomputecapability1.1andabove.However,westillneedparallelalgorithmstoprocesstheremainingsequentialcode(e.g.MACprotocolinMANETsimulations)inparallel. Erroranalysisforrealapplicationsismorecomplexthanitisfortheexampleofthetoroidalqueuingnetwork,sincetheserviceratesofeachservicefacilityaredifferent,andalsobecausetherearemanyparameterstobeconsidered.Forthesereasons,itisdifculttocapturethecharacteristicsofthecomplexsimulationmodels.Ourfutureresearchwillincludemorestudiesforerrorestimationandcorrectionmethodsinregardstovariousapplications. 97

PAGE 98

REFERENCES [1] L.Kleinrock,QueueingSystemsVolume1:Theory,Wiley-Interscience,NewYork,NY,1975. [2] D.GrossandC.M.Harris,FundamentalsofQueueingTheory(WileySeriesinProbabilityandStatistics),Wiley-Interscience,February1998. [3] G.Bolch,S.Greiner,H.deMeer,andK.S.Trivedi,QueueingNetworksandMarkovChains:ModelingandPerformanceEvaluationwithComputerScienceApplications,Wiley-Interscience,NewYork,NY,2006. [4] R.B.Cooper,IntroductiontoQueueingTheory,North-Holland(Elsevier),2ndedition,1981. [5] A.M.LawandW.D.Kelton,SimulationModeling&Analysis,McGraw-Hill,Inc,NewYork,NY,4thedition,2006. [6] J.Banks,J.Carson,B.L.Nelson,andD.Nicol,Discrete-EventSystemSimulation,FourthEdition,Prentice-Hall,Inc.,UpperSaddleRiver,NJ,USA,December2004. [7] R.M.Fujimoto,ParallelandDistributionSimulationSystems,Wiley-Interscience,NewYork,NY,2000. [8] GPGPU,General-PurposeComputationonGraphicsHardware,2008.Web.September2008.. [9] D.Luebke,M.Harris,J.Kruger,T.Purcell,N.Govindaraju,I.Buck,C.Woolley,andA.Lefohn,Gpgpu:generalpurposecomputationongraphicshardware,inSIGGRAPH'04:ACMSIGGRAPH2004CourseNotes,NewYork,NY,USA,2004,ACMPress. [10] J.D.Owens,D.Luebke,N.Govindaraju,M.Harris,J.Kruger,A.E.Lefohn,andT.J.Purcell,Asurveyofgeneral-purposecomputationongraphicshardware,ComputerGraphicsForum,vol.26,no.1,pp.80,2007. [11] J.D.Owens,M.Houston,D.Luebke,S.Green,J.E.Stone,andJ.C.Phillips,GPUcomputing,ProceedingsoftheIEEE,vol.96,no.5,pp.879,May2008. [12] NVIDIA,TechnicalBrief:NVIDIAGeForce8800GPUarchitectureoverview,2006. [13] NVIDIA,NVIDIACUDAProgrammingGuide2.0,2008. [14] J.Nickolls,I.Buck,M.Garland,andK.Skadron,Scalableparallelprogrammingwithcuda,Queue,vol.6,no.2,pp.40,2008. [15] U.J.Kapasi,S.Rixner,W.J.Dally,B.Khailany,J.H.Ahn,P.Mattson,andJ.D.Owens,Programmablestreamprocessors,Computer,vol.36,no.8,pp.54,2003. 98

PAGE 99

[16] J.D.Owens,Streamingarchitecturesandtechnologytrends,inGPUGems2,M.Pharr,Ed.,chapter29.AddisonWesley,UpperSaddleRiver,NJ,2005. [17] V.N.RaoandV.Kumar,Concurrentaccessofpriorityqueues,IEEETrans.Comput.,vol.37,no.12,pp.1657,1988. [18] D.W.Jones,Concurrentoperationsonpriorityqueues,Commun.ACM,vol.32,no.1,pp.132,1989. [19] L.M.LeemisandS.K.Park,Discrete-EventSimulation:AFirstCourse,Prentice-Hall,Inc.,UpperSaddleRiver,NJ,USA,2005. [20] P.A.Fishwick,SimulationModelDesignandExecution:BuildingDigitalWorlds,PrenticeHall,UpperSaddleRiver,NJ,1995. [21] D.D.SleatorandR.E.Tarjan,Self-adjustingbinarysearchtrees,J.ACM,vol.32,no.3,pp.652,1985. [22] R.Brown,Calendarqueues:afasto(1)priorityqueueimplementationforthesimulationeventsetproblem,Commun.ACM,vol.31,no.10,pp.1220,1988. [23] R.M.Fujimoto,Parallelsimulation:parallelanddistributedsimulationsystems,inWSC'01:Proceedingsofthe33ndconferenceonWintersimulation,Washington,DC,USA,2001,pp.147,IEEEComputerSociety. [24] K.S.Perumalla,Parallelanddistributedsimulation:Traditionaltechniquesandrecentadvances,inProceedingsofthe2006WinterSimulationConference,LosAlamitos,CA,Dec.2006,pp.84,IEEEComputerSociety. [25] K.M.ChandyandJ.Misra,Distributedsimulation:Acasestudyindesignandvericationofdistributedprograms,SoftwareEngineering,IEEETransactionson,vol.SE-5,no.5,pp.440,1979. [26] R.E.Bryant,Simulationofpacketcommunicationarchitecturecomputersystems,Tech.Rep.,Cambridge,MA,USA,1977. [27] J.Misra,Distributeddiscrete-eventsimulation,ACMComput.Surv.,vol.18,no.1,pp.39,1986. [28] K.M.ChandyandJ.Misra,Asynchronousdistributedsimulationviaasequenceofparallelcomputations,Commun.ACM,vol.24,no.4,pp.198,1981. [29] D.R.Jefferson,Virtualtime,ACMTrans.Program.Lang.Syst.,vol.7,no.3,pp.404,1985. [30] F.Gomes,B.Unger,J.Cleary,andS.Franks,Multiplexedstatesavingforboundedrollback,inWSC'97:Proceedingsofthe29thconferenceonWintersimulation,Washington,DC,USA,1997,pp.460,IEEEComputerSociety. 99

PAGE 100

[31] C.D.Carothers,K.S.Perumalla,andR.M.Fujimoto,Efcientoptimisticparallelsimulationsusingreversecomputation,ACMTrans.Model.Comput.Simul.,vol.9,no.3,pp.224,1999. [32] R.M.Fujimoto,Exploitingtemporaluncertaintyinparallelanddistributedsimulations,inProceedingsofthe13thworkshoponParallelanddistributedsimulation,Washington,DC,May1999,pp.46,IEEEComputerSociety. [33] H.Sutter,Thefreelunchisover:Afundamentalturntowardconcurrencyinsoftware,Dr.Dobb'sJournal,vol.30,no.3,2005. [34] A.E.Lefohn,J.Kniss,andJ.D.Owens,Implementingefcientparalleldatastructuresongpus,inGPUGems2,M.Pharr,Ed.,chapter33.AddisonWesley,UpperSaddleRiver,NJ,2005. [35] M.Harris,Mappingcomputationalconceptstogpus,inGPUGems2,M.Pharr,Ed.,chapter31.AddisonWesley,UpperSaddleRiver,NJ,2005. [36] W.R.Mark,R.S.Glanville,K.Akeley,andM.J.Kilgard,Cg:asystemforprogramminggraphicshardwareinac-likelanguage,inSIGGRAPH'03:ACMSIGGRAPH2003Papers,NewYork,NY,USA,2003,pp.896,ACM. [37] Microsoft,Microsofthigh-levelshadinglanguage,2008.Web.April2008.. [38] I.Buck,T.Foley,D.Horn,J.Sugerman,K.Fatahalian,M.Houston,andP.Hanrahan,Brookforgpus:streamcomputingongraphicshardware,ACMTrans.Graph.,vol.23,no.3,pp.777,2004. [39] I.Buck,Takingtheplungeintogpucomputing,inGPUGems2,M.Pharr,Ed.,chapter32.AddisonWesley,UpperSaddleRiver,NJ,2005. [40] J.D.Owens,Gpuarchitectureoverview,inSIGGRAPH'07:ACMSIGGRAPH2007courses,NewYork,NY,USA,2007,p.2,ACM. [41] D.Luebke,Gpuarchitecture&applications,March22008,Tutorial,ASPLOS2008. [42] P.Vakili,Massivelyparallelanddistributedsimulationofaclassofdiscreteeventsystems:adifferentperspective,ACMTransactionsonModelingandComputerSimulation,vol.2,no.3,pp.214,1992. [43] N.T.Patsis,C.Chen,andM.E.Larson,Simdparalleldiscreteeventdynamicsystemsimulation,IEEETransactionsonControlSystemsTechnology,vol.5,pp.30,1997. [44] R.AyaniandB.Berkman,Paralleldiscreteeventsimulationonsimdcomputers,JournalofParallelandDistributedComputing,vol.18,no.4,pp.501,1993. 100

PAGE 101

[45] W.W.ShuandM.Wu,Asynchronousproblemsonsimdparallelcomputers,IEEETransactionsonParallelandDistributedSystems,vol.6,no.7,pp.704,1995. [46] S.Gobron,F.Devillard,andB.Heit,Retinasimulationusingcellularautomataandgpuprogramming,MachineVisionandApplications,vol.18,no.6,pp.331,2007. [47] M.J.Harris,W.V.Baxter,T.Scheuermann,andA.Lastra,Simulationofclouddynamicsongraphicshardware,inHWWS'03:ProceedingsoftheACMSIG-GRAPH/EUROGRAPHICSconferenceonGraphicshardware,Aire-la-Ville,Switzerland,Switzerland,2003,pp.92,EurographicsAssociation. [48] L.Nyland,M.Harris,andJ.Prins,Fastn-bodysimulationwithcuda,inGPUGems3,H.Nguyen,Ed.,chapter31.AddisonWesley,UpperSaddleRiver,NJ,2007. [49] K.S.Perumalla,Discrete-eventexecutionalternativesongeneralpurposegraphicalprocessingunits(gpgpus),inPADS'06:Proceedingsofthe20thWorkshoponPrinciplesofAdvancedandDistributedSimulation,Washington,DC,2006,pp.74,IEEEComputerSociety. [50] Z.XuandR.Bagrodia,Gpu-acceleratedevaluationplatformforhighdelitynetworkmodeling,inPADS'07:Proceedingsofthe21stInternationalWorkshoponPrinciplesofAdvancedandDistributedSimulation,Washington,DC,2007,pp.131,IEEEComputerSociety. [51] M.LysenkoandR.M.D'Souza,Aframeworkformegascaleagentbasedmodelsimulationsongraphicsprocessingunits,JournalofArticialSocietiesandSocialSimulation,vol.11,no.4,pp.10,2008. [52] P.Martini,M.Rumekasten,andJ.Tolle,Tolerantsynchronizationfordistributedsimulationsofinterconnectedcomputernetworks,inProceedingsofthe11thworkshoponParallelanddistributedsimulation,Washington,DC,June1997,pp.138,IEEEComputerSociety. [53] S.K.Reinhardt,M.D.Hill,J.R.Larus,A.R.Lebeck,J.C.Lewis,andD.A.Wood,Thewisconsinwindtunnel:virtualprototypingofparallelcomputers,inSIG-METRICS'93:Proceedingsofthe1993ACMSIGMETRICSconferenceonMeasurementandmodelingofcomputersystems,NewYork,NY,USA,1993,pp.48,ACM. [54] A.Falcon,P.Faraboschi,andD.Ortega,Anadaptivesynchronizationtechniqueforparallelsimulationofnetworkedclusters,inISPASS'08:ProceedingsoftheISPASS2008-IEEEInternationalSymposiumonPerformanceAnalysisofSystemsandsoftware,Washington,DC,USA,2008,pp.22,IEEEComputerSociety. 101

PAGE 102

[55] J.J.WangandM.Abrams,Approximatetime-parallelsimulationofqueueingsystemswithlosses,inWSC'92:Proceedingsofthe24thconferenceonWintersimulation,NewYork,NY,USA,1992,pp.700,ACM. [56] T.Kiesling,Usingapproximationwithtime-parallelsimulation,Simulation,vol.81,no.4,pp.255,2005. [57] G.C.Hunt,M.M.Michael,S.Parthasarathy,andM.L.Scott,Anefcientalgorithmforconcurrentpriorityqueueheaps,Inf.Process.Lett.,vol.60,no.3,pp.151,1996. [58] M.D.GrammatikakisandS.Liesche,Priorityqueuesandsortingmethodsforparallelsimulation,IEEETrans.Softw.Eng.,vol.26,no.5,pp.401,2000. [59] H.SundellandP.Tsigas,Fastandlock-freeconcurrentpriorityqueuesformulti-threadsystems,J.ParallelDistrib.Comput.,vol.65,no.5,pp.609,2005. [60] E.NaroskaandU.Schwiegelshohn,Anewschedulingmethodforparalleldiscrete-eventsimulation,inEuro-Par'96:ProceedingsoftheSecondInternationalEuro-ParConferenceonParallelProcessing-VolumeII,London,UK,1996,pp.582,Springer-Verlag. [61] J.Liu,D.M.Nicol,andK.Tan,Lock-freeschedulingoflogicalprocessesinparallelsimulation,inInProceedingsofthe2000ParallelandDistributedSimulationConference,LakeArrowHead,CA,2001,pp.22. [62] M.A.Franklin,Parallelsolutionofordinarydifferentialequations,IEEETrans.Comput.,vol.27,no.5,pp.413,1978. [63] J.M.Rutledge,D.R.Jones,W.H.Chen,andE.Y.Chung,Theuseofamassivelyparallelsimdcomputerforreservoirsimulation,inEleventhSPESymposiumonReservoirSimulation,1991,pp.117. [64] A.T.ChronopoulosandG.Wang,Parallelsolutionofatrafcowsimulationproblem,ParallelComput.,vol.22,no.14,pp.1965,1997. [65] J.Signorini,Howasimdmachinecanimplementacomplexcellularautomata?acasestudy:vonneumann's29-statecellularautomaton,inSupercomputing'89:Proceedingsofthe1989ACM/IEEEconferenceonSupercomputing,NewYork,NY,USA,1989,pp.175,ACM. [66] M.Harris,OptimizingParallelReductioninCUDA,NVIDIACorporation,2007. [67] R.Mansharamani,Anoverviewofdiscreteeventsimulationmethodologiesandimplementation,Sadhana,vol.22,no.7,pp.611,1997. [68] F.Wieland,Thethresholdofeventsimultaneity,SIGSIMSimul.Dig.,vol.27,no.1,pp.56,1997. 102

PAGE 103

[69] M.MatsumotoandT.Nishimura,Mersennetwister:a623-dimensionallyequidistributeduniformpseudo-randomnumbergenerator,ACMTrans.Model.Comput.Simul.,vol.8,no.1,pp.3,1998. [70] V.Podlozhnyuk,ParallelMersenneTwister,NVIDIACorporation,2007. [71] P.A.Fishwick,Simpack:gettingstartedwithsimulationprogrammingincandc++,inProceedingsofthe1992WinterSimulationConference,J.J.Swain,D.Goldsman,R.C.Crain,andJ.R.Wilson,Eds.,NewYork,NY,1992,pp.154,ACMPress. [72] C.D.Carothers,R.M.Fujimoto,andP.England,Effectofcommunicationoverheadsontimewarpperformance:anexperimentalstudy,inProceedingsofthe8thworkshoponParallelanddistributedsimulation,NewYork,NY,July1994,pp.118,ACMPress. [73] T.L.WilmarthandL.V.Kale,Pose:Gettingovergrainsizeinparalleldiscreteeventsimulation,inProceedingsofthe2004InternationalConferenceonParallelProcessing(ICPP'04),Washington,DC,Aug.2004,pp.12,IEEEComputerSociety. [74] C.L.O.Kawabata,R.H.C.Santana,M.J.Santana,S.M.Bruschi,andK.R.L.J.C.Branco,Performanceevaluationofacmbprotocol,inProceedingsofthe38thconferenceonWintersimulation,LosAlamitos,CA,Dec.2006,pp.1012,IEEEComputerSociety. [75] N.Vandaele,T.V.Woensel,andA.Verbruggen,Aqueueingbasedtrafcowmodel,TransportationResearchPartD:TransportandEnvironment,vol.5,no.2,pp.121135,2000. [76] L.Kleinrock,QueueingSystemsVolume2:ComputerApplications,Wiley-Interscience,NewYork,NY,1975. [77] A.Seidmann,P.Schweitzer,andS.Shalev-Oren,Computerizedclosedqueueingnetworkmodelsofexiblemanufacturingsystems,LargeScaleSyst.J.,vol.12,pp.91,1987. [78] P.K.MuduliandT.M.Yegulalp,Modelingtruck-shovelsystemsasclosedqueueingnetworkwithmultiplejobclasses,InternationalTransactionsinOp-erationalResearch,vol.3,no.1,pp.89,1996. [79] R.B.Cooper,Queueingtheory,inACM81:ProceedingsoftheACM'81conference,NewYork,NY,USA,1981,pp.119,ACM. [80] J.F.KuroseandK.W.Ross,ComputerNetworking:ATop-DownApproach(4thEdition),Addison-WesleyLongmanPublishingCo.,Inc.,Boston,MA,USA,2007. 103

PAGE 104

[81] S.Kumar,V.S.Raghavan,andJ.Deng,Mediumaccesscontrolprotocolsforadhocwirelessnetworks:Asurvey,AdHocNetworks,vol.4,no.3,pp.326,2006. [82] A.BoukercheandL.Bononi,Simulationandmodelingofwireless,mobile,andadhocnetworks,inMobileAdHocNetworking,S.G.I.S.StefanoBasagni,MarcoConti,Ed.,chapter14.Wiley-Interscience,NewYork,NY,2004. [83] T.Mattson,B.Sanders,andB.Massingill,Patternsforparallelprogramming,Addison-WesleyProfessional,2004. [84] NVIDIA,CUDA,2009.Web.May2009.. [85] B.KarpandH.T.Kung,Gpsr:greedyperimeterstatelessroutingforwirelessnetworks,inMobiCom'00:Proceedingsofthe6thannualinternationalconferenceonMobilecomputingandnetworking,NewYork,NY,USA,2000,pp.243,ACM. [86] T.Camp,J.Boleng,andV.Davies,Asurveyofmobilitymodelsforadhocnetworkresearch,WirelessCommunications&MobileComputing(WCMC):SpecialissueonMobileAdHocNetworking:Research,TrendsandApplications,vol.2,pp.483,2002. 104

PAGE 105

BIOGRAPHICALSKETCH HyungwookParkreceivedhisB.S.degreeincomputersciencefromKoreaMilitaryAcademyin1995andM.S.degreeincomputerandinformationscienceandengineeringfromUniversityofFloridain2003.HeservedasaseniorprogrammerintheRepublicofKoreaArmyHeadquartersandLogisticsCommanduntilhestartedhisPh.D.studiesattheUniversityofFloridain2005.Hisresearchinterestsaremodelingandparallelsimulation. 105