<%BANNER%>

Parallel Sorting and Motif Search

Permanent Link: http://ufdc.ufl.edu/UFE0044443/00001

Material Information

Title: Parallel Sorting and Motif Search
Physical Description: 1 online resource (132 p.)
Language: english
Creator: Bandyopadhyay, Shibdas
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2012

Subjects

Subjects / Keywords: motif -- parallel -- sorting
Computer and Information Science and Engineering -- Dissertations, Academic -- UF
Genre: Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: With the proliferation of multi-core architectures, it has become increasingly important to design versions of popular algorithms which exploit different micro-architectural  features of these chip multi-processors to gain maximum speed-up compared to a single core processor. In this work, we have developed parallel algorithms for solving two common problems: Sorting of large records and motif search. Most of our work is  done on a Cell processor and on Graphics Processing Units (GPU). We proposed a novel parallel merge sort algorithm on a Cell processor. We then used the parallel merge algorithm to sort a large number of records across different cores of Cell processor. We developed a  parallel radix sort algorithm to be run on a GPU. We also extended this algorithm to  sort a large number of records. Extensive experiments are done to compare the  performance of the proposed sorting algorithms with other contemporary algorithms.  The motif search is an approximate pattern search problem in  computational biology where a common pattern, albeit with a few mismatches, needs to be found from a set of strings. We first proposed a sequential algorithm for motif search which shows an improvement over the recently proposed algorithms. We then proposed a parallel implementation of the sequential  algorithm which exploits the parallelism present in a multi-core system.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Shibdas Bandyopadhyay.
Thesis: Thesis (Ph.D.)--University of Florida, 2012.
Local: Adviser: Sahni, Sartaj.

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2012
System ID: UFE0044443:00001

Permanent Link: http://ufdc.ufl.edu/UFE0044443/00001

Material Information

Title: Parallel Sorting and Motif Search
Physical Description: 1 online resource (132 p.)
Language: english
Creator: Bandyopadhyay, Shibdas
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2012

Subjects

Subjects / Keywords: motif -- parallel -- sorting
Computer and Information Science and Engineering -- Dissertations, Academic -- UF
Genre: Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: With the proliferation of multi-core architectures, it has become increasingly important to design versions of popular algorithms which exploit different micro-architectural  features of these chip multi-processors to gain maximum speed-up compared to a single core processor. In this work, we have developed parallel algorithms for solving two common problems: Sorting of large records and motif search. Most of our work is  done on a Cell processor and on Graphics Processing Units (GPU). We proposed a novel parallel merge sort algorithm on a Cell processor. We then used the parallel merge algorithm to sort a large number of records across different cores of Cell processor. We developed a  parallel radix sort algorithm to be run on a GPU. We also extended this algorithm to  sort a large number of records. Extensive experiments are done to compare the  performance of the proposed sorting algorithms with other contemporary algorithms.  The motif search is an approximate pattern search problem in  computational biology where a common pattern, albeit with a few mismatches, needs to be found from a set of strings. We first proposed a sequential algorithm for motif search which shows an improvement over the recently proposed algorithms. We then proposed a parallel implementation of the sequential  algorithm which exploits the parallelism present in a multi-core system.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Shibdas Bandyopadhyay.
Thesis: Thesis (Ph.D.)--University of Florida, 2012.
Local: Adviser: Sahni, Sartaj.

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2012
System ID: UFE0044443:00001


This item has the following downloads:


Full Text

PAGE 1

PARALLELSORTINGANDMOTIFSEARCH By SHIBDASBANDYOPADHYAY ADISSERTATIONPRESENTEDTOTHEGRADUATESCHOOL OFTHEUNIVERSITYOFFLORIDAINPARTIALFULFILLMENT OFTHEREQUIREMENTSFORTHEDEGREEOF DOCTOROFPHILOSOPHY UNIVERSITYOFFLORIDA 2012

PAGE 2

c r 2012ShibdasBandyopadhyay 2

PAGE 3

Idedicatethistomyfamily. 3

PAGE 4

ACKNOWLEDGMENTS Thisworkwouldnothavebeenpossiblewithoutcontinuousgu idanceofmyadvisor Dr.SartajSahni.Hisideasandpertinentquestionsinspire dmetoexploredifferentways tosolveproblems.I'malsogratefultomyparentsforencour agingandsupportingme duringalltheseyears. 4

PAGE 5

TABLEOFCONTENTS page ACKNOWLEDGMENTS .................................. 4 LISTOFTABLES ...................................... 8 LISTOFFIGURES ..................................... 9 ABSTRACT ......................................... 12 CHAPTER 1MULTI-COREARCHITECTURES ......................... 13 1.1TheCellBroadbandEngine .......................... 13 1.2GraphicsProcessingUnits .......................... 16 2SORTINGONTHECELLBROADBANDENGINE ................ 20 2.1High-levelStrategiesForSorting ....................... 20 2.2SPUVectorandMemoryOperations ..................... 21 2.3SortingNumbers ................................ 23 2.3.1SingleSPUSort ............................ 23 2.3.1.1Shellsortvariants ...................... 24 2.3.1.2Mergesort .......................... 28 2.3.1.3ComparisonofsingleSPUsortingalgorithms ....... 37 2.3.2HierarchicalSort ............................ 38 2.4SortingRecords ................................ 41 2.4.1RecordLayouts ............................. 41 2.4.2High-levelStrategiesforSortingRecords ............... 42 2.4.3SingleSPURecordSorting ...................... 44 2.4.4HierarchicalSortingforRecords-4wayMerge ........... 45 2.4.5ComparisonofRecordSortingAlgorithms .............. 49 2.4.5.1Runtimesfor ByField layout ................ 50 2.4.5.2Runtimesfor ByRecord layout ............... 52 2.4.5.3Crosslayoutcomparison .................. 54 2.5Summary .................................... 55 3SORTINGONAGRAPHICSPROCESSINGUNIT(GPU) ............ 59 3.1SortingNumbersonGPUs .......................... 59 3.1.1GPUSampleSort ........................... 61 3.1.1.1Step1–Splitterselection .................. 62 3.1.1.2Step2–Findingbuckets ................... 63 3.1.1.3Step3–Prexsum ...................... 63 3.1.1.4Step4–Placingelementsintobuckets ........... 63 3.1.2Warpsort ................................ 64 5

PAGE 6

3.1.2.1Step1–Bitonicsortbywarps ................ 64 3.1.2.2Step2–Bitonicmergebywarps ............... 65 3.1.2.3Step3–Splittinglongsequences .............. 65 3.1.2.4Step4–Finalmergebywarps ................ 66 3.1.3SRTSRadixSort ............................ 66 3.1.3.1Step1–Bottomlevelreduce ................. 68 3.1.3.2Step2–Toplevelscan .................... 69 3.1.3.3Step3–Bottomlevelscan .................. 69 3.1.4SDKRadixSortAlgorithm ....................... 70 3.1.4.1Step1–Sortingtiles ..................... 71 3.1.4.2Step2–Calculatinghistogram ................ 73 3.1.4.3Step3–Prexsumofhistogram .............. 73 3.1.4.4Step4–Rearrangement ................... 74 3.1.5GPURadixSort(GRS) ......................... 75 3.1.5.1Step1–HistogramandRanks ................ 76 3.1.5.2Step2–Prexsumoftilehistograms ............ 79 3.1.5.3Step3–Positioningnumbersinatile ............ 80 3.1.6ComparisonofNumberSortingAlgorithms ............. 81 3.2SortingRecordsonGPUs ........................... 82 3.2.1RecordLayouts ............................. 82 3.2.2HighlevelStrategiesforSortingRecords ............... 83 3.2.3SampleSortforSortingRecords ................... 84 3.2.4SRTSforSortingRecords ....................... 87 3.2.5GRSforSortingRecords ........................ 89 3.3ExperimentalResults ............................. 91 3.3.1RunTimesfor ByField Layout ..................... 92 3.3.2RunTimesfor Hybrid Layout ...................... 93 3.4Summary .................................... 95 4MOTIFSEARCH ................................... 97 4.1Introduction ................................... 97 4.2PMS5 ...................................... 100 4.2.1NotationsandDenitions ....................... 100 4.2.2PMS5Overview ............................. 100 4.2.3Computing B d ( x y z ) ......................... 101 4.2.4Intersectionof Q s ............................ 104 4.3PMS6 ...................................... 105 4.3.1Overview ................................ 105 4.3.2Computing B d ( C ( n 1 n 5 )) ..................... 105 4.3.3TheDataStructure Q ......................... 109 4.3.4Complexity ............................... 110 4.4ExperimentalResults ............................. 110 4.4.1Preprocessing .............................. 111 4.4.2Computing B d .............................. 112 6

PAGE 7

4.4.3TotalTime ................................ 112 4.4.4TotalTimePlusPreprocessingTime ................. 113 4.5PMS6MC .................................... 113 4.5.1Overview ................................ 113 4.5.2Outer-LevelParallelism ......................... 114 4.5.3Inner-LevelParallelization ....................... 114 4.5.3.1Findingequivalenceclassesinparallel .......... 115 4.5.3.2Computing B d ( C ) inparallel ................ 116 4.5.3.3Processing Q inparallel ................... 117 4.5.3.4The outputMotifs routine .................. 119 4.6ExperimentalResults ............................. 120 4.6.1PMS6MCImplementation ....................... 121 4.6.2PMS6andPMS6MC .......................... 121 4.6.3PMS6MCandOtherParallelAlgorithms ............... 122 4.7Summary .................................... 123 5CONCLUSIONANDFUTUREWORK ....................... 125 REFERENCES ....................................... 127 BIOGRAPHICALSKETCH ................................ 132 7

PAGE 8

LISTOFTABLES Table page 2-1ComparisonofvariousSPUsortingalgorithms .................. 39 4-1PreprocessingtimesforPMS5andPMS6 ..................... 112 4-2TotalstorageforPMS5andPMS6inMegaBytes(MB) .............. 112 4-3Timetakentocompute B d ( C ( n 1 n 5 )) and B d ( x y z ) ............ 112 4-4TotalrunrimeofPMS5andPMS6 ......................... 113 4-5Total+preprocessingtimebyPMS5andPMS6 ................. 113 4-6DegreeofinnerandouterlevelparallelismforPMS6MC ............ 121 4-7RuntimesforPMS6andPMS6MC ........................ 122 4-8RuntimesforPMS5andPMSPrune ........................ 122 4-9TotalruntimeofdifferentPMSAlgorithms .................... 123 4-10ComparingmSPELLERandgSPELLERwithPMS6MC ............. 123 8

PAGE 9

LISTOFFIGURES Figure page 1-1ArchitectureoftheCellBroadbandEngine ..................... 15 1-2NVIDIA'sTeslaGPU ................................. 16 1-3CudaProgrammingModel .............................. 17 2-1Comparisonsfor mySpu cmpswap skew ...................... 22 2-2Combsort ...................................... 26 2-3SingleSPUcolumn-majorAA-sort ......................... 26 2-4Collectingnumbersfroma 4 4 Matrix ...................... 27 2-5Column-majortorow-major ............................. 28 2-6Column-majorbricksort ............................... 29 2-7Column-majorshakersort ............................. 29 2-8Mergesortexample ................................. 31 2-9Phase2ofmergesort ............................... 32 2-10Phase3counters .................................. 33 2-11Phase3ofmergesort ............................... 34 2-12Phase4counters .................................. 37 2-13Phase4ofmergesortwithcountersindifferentcolumns ............ 38 2-14Plotofaveragetimetosort4-byteintegers .................... 40 2-15SIMDodd-evenmergeoftwovectors ....................... 41 2-16SIMD2-waymergeof2vectors v 1 and v 2 .................... 42 2-17Plotofaveragetimetosort1to67millionintegers ................ 43 2-18The ndmin operationforrecords ......................... 44 2-19The elds select operationin ByField layout .................... 44 2-20The elds select operationin ByRecord layout ................... 45 2-21Shufingtworecordsin ByField layout ....................... 45 2-22Shufingtworecordsin ByRecord layout ..................... 45 9

PAGE 10

2-234-waymerge ..................................... 46 2-244-waymerge ..................................... 48 2-25SIMD2-waymergeof2vectors v 1 and v 2 .................... 49 2-262-waysorts( ByField ) ................................. 51 2-274-waysorts( ByField ) ................................. 52 2-282-wayand4-waysorts( ByField ),1Mrecords ................... 53 2-294-waysorts( ByField ),1Mrecords ......................... 54 2-302-waysorts( ByRecord ) ............................... 55 2-314-waysorts( ByRecord ) ............................... 56 2-322-wayand4-waysorts( ByRecord ),1Mrecords .................. 57 2-334-waysorts( ByRecord ),1Mrecords ........................ 58 2-344-waysortsusingthebestalgorithmfordifferentlayo uts ............ 58 3-1SerialSampleSort ................................. 62 3-2AnIterationofGPUSampleSort .......................... 63 3-3BitonicMergeSortof8Elements ......................... 64 3-4WarpsortSteps ................................... 66 3-5BottomLevelReduction ............................... 68 3-6SRTSSteps ..................................... 70 3-7Bit-splitSchemeforSortingNumberson4Bits .................. 71 3-8DivergenceFreeWarpScanAlgorithm ...................... 72 3-9WarpScanof8Numbers .............................. 72 3-10CalculatingHistogramOffsets ........................... 73 3-11ColumnMajorHistogram .............................. 74 3-12AlgorithmtoComputetheHistogramsandRanksof64Tile s .......... 77 3-13ReadingtheNumbersfromGlobalMemory .................... 78 3-14ProcessinganElementof sHist4[] ........................ 79 3-15WritingtheRankstoGlobalMemory ....................... 79 10

PAGE 11

3-16WritingtheHistogramstoGlobalMemory ..................... 80 3-17RearrangingData .................................. 81 3-18Sortingnumbersusingradixsorts ......................... 82 3-19Movingrecordsin ByField layout .......................... 86 3-20Movingrecordsin ByRecord layout ......................... 86 3-21Optimizedversionofmovingrecordsin ByRecord layout ............. 87 3-22SampleSort-directandSampleSort-indirectfor10Mre cords( ByField ) ..... 92 3-23SRTS-directandSRTS-indirectfor10Mrecords( ByField ) ............ 92 3-24GRS-directandGRS-indirectfor10Mrecords( ByField ) ............. 93 3-25Differentsortingalgorithmsfor10MRecords( ByField ) .............. 94 3-26SampleSort-directandSampleSort-indirectfor10Mre cords( Hybrid ) ..... 94 3-27SRTS-directandSRTS-indirectfor10Mrecords( Hybrid ) ............ 95 3-28GRS-directandGRS-indirectfor10Mrecords( Hybrid ) ............. 95 3-29Differentsortingalgorithmsfor10Mrecords( Hybrid ) ............... 96 4-1PMS5 ......................................... 101 4-22-Neighborhoodtree ................................. 102 4-3Computing B d ( x y z ) ................................ 104 4-4PMS6 ......................................... 106 4-5Computing B d ( n 1 n 5 ) .............................. 108 4-6PMS6MCouterlevelloop .............................. 114 4-7Findingmotifsinparallel .............................. 115 4-8Findingequivalenceclassesinparallel ...................... 116 4-9Computeandprocess B d ( C ) ............................ 117 4-10Prexsumofcounters ............................... 119 4-11Processing Q .................................... 120 11

PAGE 12

AbstractofDissertationPresentedtotheGraduateSchool oftheUniversityofFloridainPartialFulllmentofthe RequirementsfortheDegreeofDoctorofPhilosophy PARALLELSORTINGANDMOTIFSEARCH By ShibdasBandyopadhyay August2012 Chair:SartajSahniMajor:ComputerEngineering Withtheproliferationofmulti-corearchitectures,ithas becomeincreasingly importanttodesignversionsofpopularalgorithmswhichex ploitdifferentmicro-architectural featuresofthesechipmulti-processorstogainmaximumspe ed-upcomparedtoasingle coreprocessor.Inthiswork,wehavedevelopedparallelalg orithmsforsolvingtwo commonproblems:Sortingoflargerecordsandmotifsearch. Mostofourworkisdone onaCellprocessorandonGraphicsProcessingUnits(GPU).W eproposedanovel parallelmergesortalgorithmonaCellprocessor.Wethenus edtheparallelmerge algorithmtosortalargenumberofrecordsacrossdifferent coresofCellprocessor.We developedaparallelradixsortalgorithmtoberunonaGPU.W ealsoextendedthis algorithmtosortalargenumberofrecords.Extensiveexper imentsaredonetocompare theperformanceoftheproposedsortingalgorithmswithoth ercontemporaryalgorithms. Themotifsearchisanapproximatepatternsearchproblemin computationalbiology whereacommonpattern,albeitwithafewmismatches,needst obefoundfroma setofstrings.Werstproposedasequentialalgorithmform otifsearchwhichshows animprovementovertherecentlyproposedalgorithms.Weth enproposedaparallel implementationofthesequentialalgorithmwhichexploits theparallelismpresentina multi-coresystem. 12

PAGE 13

CHAPTER1 MULTI-COREARCHITECTURES Duringrecentyears,agreatnumberofhighperformancemult i-coreprocessors arereleased.Insteadofhavingasinglehighperformanceco re,effectiveperformance ofaprocessorisincreasedbyhavingmanycoresanddesignin galgorithmstotake advantageofthosecores.Mostofthemfallintothecategory ofchipmulti-processors (CMP)whereallcoresarepackedontoasinglechipwithavery highspeedbus connectingthem.InanhomogeneousCMP,allcoresareofsame typewhileincaseof heterogeneousCMPsnotallcoresareofsametype.Typically ,thecoreshavetheirown smalllocalstorageandtheyareconnectedtotheexternalme moryusingsomekindof memoryinterfacecontroller.Themaximumperformancecano nlybeachievedwhen thesecoresarecarefullyprogrammedtodoalotoftheproces singlocallyandthen effectivelycommunicatingtheresultsamongeachother.De pendingonthearchitecture ofthesecores,onemighthavetousedifferentprogrammingp aradigmsandtechniques tobuildcodewhichcanberunonthesesystems.Wefocustotwo differentCMPs inthiswork.TherstoneisaheterogeneousCMPcalledCellB roadbandEngine whichismostcommonlyfoundinPlaystation3gameconsoles. Thenextoneiswidely populargraphicsprocessingunits(GPUs)whicharepresent inalmostallthepersonal computersmadenowadays.BothoftheseCMPsarelowcosthigh performance processorscomparedtoatraditionalIntelx86processor.H owever,programmingCell andGPUsareabitdifferentfromatypicalprocessorfoundin amodernPCbutthey canbeprogrammedusingahighlevellanguagelikeC.Inthisc hapter,wetalkaboutthe architectureandprogrammingmodelsforCellandGPUs. 1.1TheCellBroadbandEngine TheCellBroadbandEngine(CBE)isaheterogeneousmulticor earchitecturejointly developedbyIBM,Sony,andToshiba.CBEwasoriginallydesi gnedformultimediaand graphicsapplications.ItrstappearedinSony'sPlaystat ion3gameconsole.Toshiba 13

PAGE 14

hasalsousedCBEintheirHighDenitionTelevisionsassist ingindecodingmultiplehigh denitionstreams.TheCBE,however,wasdesignedtodoanyk indofgeneralpurpose compute-intensivetask.Consequently,IBMbuiltCellblad eserversandalsobuilt supercomputersusingCBEwithmorethanapetaopinpeakper formance.TheCBE cameintopicturewhentraditionalCPUshavehitapointofdi minishingreturnwithsingle coredesignsandmulti-coreCPUswerebeingconsideredtobe nextphaseofprocessor evolution.TheCBEwasadeparturefromthewayCPUsweredesi gnedmainlybecause oftheapplicationsforwhichitwasdesigned.Thedesigndec isionwastohaveasetof verysimpleandfastcorestoprocessdataatveryhighspeeds alongwithacomplicated generalpurposecontrolcorewhichiscapableofrunningano peratingsystem.To thisend,CBEisanin-orderprocessorunlikemostmodernpro cessors.Programmers andcompilerareresponsibleforanykindofinstructionsch edulingtooptimizethe performance.ACBE(Figure 1-1 )consistsofaPowerPCcore(PPE),eightSynergistic ProcessingElements(SPEs),anon-boardmemorycontroller andaexibleIOcontroller [ 1 ].AllthesecomponentsareconnectedinaringcalledElemen tInterconnectBus (EIB).TwoCBEscanalsobeconnectedtoeachotherusingthe exibleIOconnection. PPEisdesignedtoperformlikeanin-orderCPUcapableoffas ttaskswitchingand dispatchingcomputeintensiveworkstoSPE.SPEsaredesign edtobeveryhigh speedin-orderSIMDcomputecoreswithoutanycacheandwith outanykindofbranch predictionlogic.Becauseoftheirminimalistdesign,SPEs canrunasfastas4GHz. But,itcanonlyrunatthatspeed,ifithasveryfastaccessto thedataneededbythe instructionsbeingexecuted.Toassistfastexecution,eac hSPEhasalocalstoreof size256KBwithveryfastaccesstimelikeatraditionalcach e.Thelocalstorealong withtheexecutionunitisreferredtoasSynergisticProces singUnit(SPU).SPEsalso haveaMemoryFlowController(MFC)whichcanmovedatafromm ainmemoryto localstoreandviceversaviaDMAtransfers.MFCalsoallows manyDMAtransfers tobescheduledanditrunsindependentoftheexecutionengi neofSPE.Hence,SPE 14

PAGE 15

cancontinueexecutionwhilethedataisbeingmoved.Toachi evepeakperformance, programmersneedtoexploitthisconcurrencybycarefullys chedulingdatatransfers interleavedwithexecution.SPEshavetwodifferentexecut ionpipeline.Programmersor compilercanscheduleinstructionssuchthatconsecutivei nstructionsfallintodifferent pipelinesandtwoinstructionsareissuedinacycle.AsSPEh asaSIMDInstructionset, itisalsonecessarytopackdatainvectorstoexploitfullpo tentialofaSPE.Alongwith aforementionedprogrammingtechniques,theabsenceofbra nchpredictionlogicinan SPEposesachallengewhendevelopinghighperformanceCBEa lgorithms. Figure1-1.ArchitectureoftheCellBroadbandEngine[ 2 ] 15

PAGE 16

1.2GraphicsProcessingUnits Contemporarygraphicsprocessingunits(GPUs)aremassive lyparallelmanycore processors.NVIDIA'sTeslaGPUs,forexample,have240scal arprocessingcores(SPs) perchip[ 3 ].Thesecoresarepartitionedinto30StreamingMultiproce ssors(SMs)with eachSMcomprisingof8SPs.EachSMsharesa16KBlocalmemory (calledshared memory)andhasatotalof16,38432-bitregistersthatmaybe utilizedbythethreads runningonthisSM.Besidesregistersandsharedmemory,onchipmemorysharedby thecoresinanSMalsoincludesconstantandtexturecaches. The240on-chipcores alsosharea4GBoff-chipglobal(ordevice)memory.Figure 1-2 showsaschematic oftheTeslaarchitecture.WiththeintroductionofCUDA(Co mputeUniedDriver Architecture)[ 4 ],ithasbecomepossibletoprogramGPUsusingC.Thishasres ultedin anexplosionofresearchdirectedtowardexpandingtheappl icabilityofGPUsfromtheir nativecomputergraphicsapplicationstoawidevarietyofh igh-performancecomputing applications. Figure1-2.NVIDIA'sTeslaGPU[ 5 ] GPUsoperateunderthemaster-slavecomputingmodel(see[ 6 ]fore.g.)inwhich thereisahostormasterprocessortowhichisattachedacoll ectionofslaveprocessors. ApossiblecongurationwouldhaveaGPUcardattachedtothe busofaPC.ThePC CPUwouldbethehostormasterandtheGPUprocessorswouldbe theslaves.The CUDAprogrammingmodelrequirestheusertowriteaprogramt hatrunsonthehost 16

PAGE 17

processor.Atpresent,CUDAsupportshostprogramswritten inCandC++onlythough thereareplanstoexpandthesetofavailablelanguages[ 4 ].Thehostprogrammay invoke kernels ,whichareCfunctions,thatrunontheGPUslaves.Akernelma ybe instantiatedinsynchronous(theCPUwaitsforthekernelto completebeforeproceeding withothertasks)orasynchronous(theCPUcontinueswithot hertasksfollowingthe spawningofakernel)mode.Akernelspeciesthecomputatio ntobedonebyathread. Whenakernelisinvokedbythehostprogram,thehostprogram speciesthenumberof threadsthataretobecreated.Eachthreadisassignedauniq ueIDandCUDAprovides C-languageextensionstoenableakerneltodeterminewhich threaditisexecuting.The hostprogramgroupsthreadsintoblocks,byspecifyingablo cksize,atthetimeakernel isinvoked.Figure 1-3 showstheorganizationofthreadsusedbyCUDA. Figure1-3.CudaProgrammingModel[ 4 ] TheGPUschedulesthethreadssothatablockofthreadsrunso nthecoresofan SM.Atanygiventime,anSMexecutesthethreadsofasinglebl ockandthethreadsof ablockcanexecuteonlyonasingleSM.Onceablockbeginstoe xecuteonanSM,that SMexecutestheblocktocompletion.EachSMschedulestheth readsinitsassigned blockingroupsof32threadscalleda warp .Thepartitioningintowarpsisfairlyintuitive 17

PAGE 18

withtherst32threadsformingtherstwarp,thenext32thr eadsformthenextwarp, andsoon.A halfwarp isagroupof16threads.Therst16threadsinawarpformthe rsthalfwarpforthewarpandtheremaining16threadsformt hesecondhalfwarp. WhenanSMisreadytoexecutethenextinstruction,itselect sawarpthatisready(i.e., itsthreadsarenotwaitingforamemorytransactiontocompl ete)andexecutesthenext instructionofeverythreadintheselectedwarp.Commonins tructionsareexecutedin parallelusingthe8SPsintheSM.Non-commoninstructionsa reserialized.So,itis important,forperformance,toavoidthreaddivergencewit hinawarp.Someoftheother factorsimportantforperformanceare: 1. Sinceaccesstoglobalmemoryisabouttwoordersofmagnitud emoreexpensive thanaccesstoregistersandsharedmemory,datathataretob eusedseveral timesshouldbereadoncefromglobalmemoryandstoredinreg istersorshared memoryforreuse. 2. Whenthethreadsofahalf-warpaccessglobalmemory,thisac cessisaccomplished viaaseriesofmemorytransactions.Thenumberofmemorytra nsactionsequals thenumberofdifferent32-byte(64-byte,128-byte,128-by te)memorysegments thatthewordstobeaccessedlieinwheneachthreadaccesses an8-bit(16-bit, 32-bit,64-bit)word.Giventhecostofaglobalmemorytrans action,itpaysto organizethecomputationsothatthenumberofglobalmemory transactionsmade byeachhalfwarpisminimized. 3. Sharedmemoryisdividedinto16banksinroundrobinfashion usingwordsof size32bits.Whenthethreadsofahalfwarpaccesssharedmem ory,theaccess isaccomplishedasaseriesof1ormorememorytransactions. Let S denoted thesetofaddressestobeaccessed.Eachtransactionisbuil tbyselectingone oftheaddressesin S todenethebroadcastword.Alladdressesin S thatare includedinthebroadcastwordareremovedfrom S .Next,uptooneaddressfrom eachoftheremainingbanksisremovedfrom S .Thesetofremovedaddressesis servicedbyasinglememorytransaction.Sincetheuserhasn owaytospecifythe broadcastword,formaximumparallelism,thecomputations houldbeorganizedso that,atanygiventime,thethreadsinahalfwarpaccesseith erwordsindifferent banksofsharedmemoryortheyaccessthesamewordofsharedm emory. 4. Volkovetal.[ 7 ]haveobservedgreaterthroughputusingoperandsinregist ersthan operandsinsharedmemory.So,datathatistobeusedoftensh ouldbestoredin registersratherthaninsharedmemory. 18

PAGE 19

5. Loopunrollingoftenimprovesperformance.However,the#p ragmaunroll statementunrollsloopsonlyundercertainrestrictivecon ditions.Manualloop unrollingbyreplicatingcodeandchangingtheloopstridec anbeemployedto overcometheselimitations. 6. Arraysdeclaredasregisterarraysgetassignedtoglobalme morywhentheCUDA compilerisunabletodetermineatcompiletimewhatthevalu eofanarrayindexis. Thisistypicallythecasewhenanarrayisindexedusingaloo pvariable.Manually unrollingtheloopsothatallreferencestothearrayuseind exesknownatcompile timeensuresthattheregisterarrayis,infact,storedinre gisters. 19

PAGE 20

CHAPTER2 SORTINGONTHECELLBROADBANDENGINE 2.1High-levelStrategiesForSorting Asnotedin[ 8 ],alogicalwaytodevelopasortingalgorithmforaheteroge neous multicorecomputersuchastheCBEisto(1)beginwithasorti ngalgorithmforasingle SPU,then(2)usingthisasacore,developasortalgorithmfo rthecasewhendatatsin allavailablecores,then(3)usethismulti-SPUalgorithmt odevelopasortalgorithmfor thecasewhenthedatatobesorteddoesnottinthelocalstor esofallavailableSPEs buttsinmainmemory.Thestrategywouldbetoextendthishi erarchicalplantothe casewheredatatobesortedissolargethatitisdistributed overthemainmemoriesof aclusterofCBEs. Analternativestrategytosortistousethemaster-slavemo delinwhichthePPU servesasthemasterprocessorandtheSPUsaretheslaveproc essors.ThePPU partitionsthedatatobesortedandsendseachpartitiontoa differentSPU;theSPUs sorttheirpartitionusingasingleSPUsort;thePPUmergest hesorteddatafromthe SPUssoastocompletethesortoftheentiredataset.Thisstr ategyisusedin[ 9 ]tosort onthenCubehypercubeandin[ 10 ]tosortontheCBE. Regardlessofwhetherwesortlargedatasetsusingthehiera rchicalstrategyof [ 8 ]orthemaster-slavestrategyof[ 9 10 ],itisimportanttohaveafastalgorithmtosort withinasingleSPU.Theabsenceofanybranchpredictioncap abilityandtheavailability ofvectorinstructionsthatsupportSIMDparallelismonanS PUmakethedevelopment ofefcientSPUsortalgorithmsachallenge.SPUsalsohavet woinstructionpipelines whichmakethemcapableofissuingtwoinstructionsinthesa mecycleiftheyfallin differentpipelines.Itpaystohand-tunethegeneratedass emblycodesothattwo instructionsareissuedinasmanyofthecyclesaspossible. 20

PAGE 21

2.2SPUVectorandMemoryOperations WeshalluseseveralSIMDfunctionsthatoperateonavectoro f4numbersto describetheSPUadaptationofsortingalgorithms.Wedescr ibetheseinthissection. Inthefollowing, v 1 v 2 min max ,and temp arevectors,eachcomprisedof4numbers and p p 1 ,and p 2 arebitpatterns.Also, dest (destination)and src (source)areaddresses inthelocalstoreofanSPUand buerA isabufferinlocalstorewhile streamA isadata streaminmainmemory.Furthermore,Functionnamesthatbeg inwith spu arestandard C/C++CellSPUintrinsicswhilethosethatbeginwith mySpu aredenedbyus.Our descriptionofthesefunctionsistailoredtothesortingap plication. 1. spu shue ( v 1, v 2, p ) Thisfunctionreturnsavectorcomprisedofasubsetof the8numbersin v 1 and v 2 .Thereturnedsubsetisdeterminedbythebitpattern p .Let W X Y ,and Z denotethe4numbers(lefttoright)of v 1 andlet A B C and D denotethoseof v 2 .Thebitpattern p = XCCW ,forexample,returnsa vectorcomprisedofthesecondnumberin v 1 followedbytwocopiesofthethird numberof v 2 followedbytherstnumberin v 1 .Inthefollowing,weassumethat constantpatternssuchasXYZDhavebeenpre-dened. 2. spu cmpgt ( v 1, v 2) A128-bitvectorrepresentingthepairwisecomparisonofth e 4numbersof v 1 withthoseof v 2 isreturned.Ifanelementof v 1 isgreaterthan thecorrespondingelementof v 2 ,thecorresponding32bitsofthereturnedvector are1;otherwise,thesebitsare0. 3. spu add ( v 1, v 2) Returnsthevectorobtainedbypairwiseaddingthenumberso f v 1 withthecorrespondingnumbersof v 2 4. spu sub ( v 1, v 2) Returnsthevectorobtainedbypairwisesubtractingthe numbersof v 2 fromthecorrespondingnumbersof v 1 5. spu and ( p 1, p 2) Returnsthevectorobtainedbypairwiseandingthebitsof p 1 and p 2 6. mySpu not ( p ) Returnsthevectorobtainedbycomplementingeachofthe bitsof p .AlthoughtheCBEdoesnothavea not instruction,wecanperform thisoperationusingthe nor functionthatissupportedbytheCBEandwhich computesthecomplementofthebitwise or oftwovectors.Itiseasytoseethat spu nor ( p v 0) where v 0 isanallzerovector,correctlycomputesthecomplement ofthebitsof p 21

PAGE 22

7. spu select ( v 1, v 2, p ) Returnsavectorwhose i thbitcomesfrom v 1 ( v 2 )when the i thbitof p is0(1). 8. spu slqwbyte ( v 1, n ) Returnsavectorobtainedbyshiftingthebytesof v 1 m bytestotheleft,where m isthenumberrepresentedbythe5leastsignicantbits of n .Theleftshiftisdonewithzeroll.So,therightmost m bytesofthereturned vectorare0. 9. spu splat ( s ) Returnsavectorcomprisedof4copiesofthenumber s 10. mySpu cmpswap ( v 1, v 2) Pairwisecomparesthenumbersof v 1 and v 2 and swapssothat v 1 hasthesmallernumberofeachcompareand v 2 hasthelarger numberSpecically,thefollowinginstructionsareexecut ed: p = spu cmpgt ( v 1, v 2); min = spu select ( v 1, v 2, p ); v 2= spu select ( v 2, v 1, p ); v 1= min ; 11. mySpu cmpswap skew ( v 1, v 2) Performsthecomparisonsandswapsshownin Figure 2-1 .Specically,thefollowinginstructionsareexecuted: temp = spu slqwbyte ( v 2,4); p = spu cmpgt ( v 1, temp ); min = spu select ( v 1, temp p ); v 1= spu shue ( min v 1, WXYD ); max = spu select ( temp v 1, p ); v 2= spu shue ( max v 2, AWXY ); Min Max v 10 v 11 v 12 v 13 v 21 v 22 v 23 v 20 Figure2-1.Comparisonsfor mySpu cmpswap skew 12. mySpu gather ( vArray v 1) Here vArray isanarrayofvectors.Let W X Y and Z bethenumbersof v 1 .Thefunctionreturnsavectorwhoserstnumberisthe rstnumberof vArray [ W ] ,itssecondnumberisthesecondnumberof vArray [ X ] itsthirdnumberisthethirdnumberof vArray [ Y ] ,anditsfourthnumberisthefourth numberof vArray [ Z ] .Oneimplementationofthisfunctionrstextracts W X Y and Z from v 1 usingthefunction spu extract andthenexecutesthecode: temp = spu shue ( vArray [ W ], vArray [ X ], WBWW ); temp = spu shue ( temp vArray [ Y ], WXCC ); return spu shue ( temp vArray [ Z ], WXYD ); 22

PAGE 23

13. mySpu gather 12( vArray v 1) Thisfunction,whichissimilarto mySpu gather returnsavectorwhoserstnumberistherstnumberof vArray [ W ] andwhose secondnumberisthesecondnumberof vArray [ X ] .Thethirdandfourthnumbers ofthereturnedvectoraresetarbitrarily.Itscodeis:return spu shue ( vArray [ W ], vArray [ X ], WBWW ); 14. mySpu gather 34( vArray v 1) Thisfunction,whichissimilarto mySpu gather 12 returnsavectorwhoserstnumberisthethirdnumberof vArray [ W ] andwhose secondnumberisthefourthnumberof vArray [ X ] .Thethirdandfourthnumbersof thereturnedvectoraresetarbitrarily.Itscodeis:return spu shue ( vArray [ W ], vArray [ X ], YDYY ); 15. mySpu gatherA ( vArray v 1) Thisfunctionissimilarto mySpu gather andreturns avectorwhoserstnumberistherstnumberof vArray [ W ] ,itssecondnumberis thethirdnumberof vArray [ X ] ,itsthirdnumberistherstnumberof vArray [ Y ] ,and itsfourthnumberisthethirdnumberof vArray [ Z ] .Thecode: temp = spu shue ( vArray [ W ], vArray [ X ], WCWW ); temp = spu shue ( temp vArray [ Y ], WXAA ); return spu shue ( temp vArray [ Z ], WXYC ); 16. mySpu gatherB ( vArray v 1) Thistooissimilarto mySpu gather .Thefunction returnsavectorwhoserstnumberisthesecondnumberof vArray [ W ] ,itssecond numberisthefourthnumberof vArray [ X ] ,itsthirdnumberisthesecondnumberof vArray [ Y ] ,anditsfourthnumberisthefourthnumberof vArray [ Z ] .Thecodeis: temp = spu shue ( vArray [ W ], vArray [ X ], XDXX ); temp = spu shue ( temp vArray [ Y ], WXBB ); return spu shue ( temp vArray [ Z ], WXYD ); 17. memcpy ( dest src size ) copiesthe size numberofbytesfromthelocalstore locationbeginningat src to dest 18. dmaIn ( buerA streamA ) ThisfunctiontriggersaDMAtransferofthenextbuffer loadofdatafrom streamA inmainmemoryinto buerA inthelocalstore.Thisis doneasynchronouslyandconcurrentlywithSPUexecution. 19. dmaOut ( buerA streamA ) Thisfunctionissimilarto dmaIn exceptthata bufferloadofdataistransferredasynchronouslyfrom buerA inthelocalstoreto streamA inmainmemory. 2.3SortingNumbers 2.3.1SingleSPUSort Recently,threesortingalgorithms–AA-sort[ 11 ],CellSort[ 8 ]andMergesort[ 12 ]–were proposedfortheCBE.AA-sortisanadaptationofcombsort,w hichwasoriginally 23

PAGE 24

proposedbyKnuth[ 13 ]andrediscoveredbyDobosiewicz[ 14 ]andBoxandLacey[ 15 ]. CellSortisanadaptationofbitonicsort(e.g.,[ 13 ]).BothAA-sortandCellSortarebased onsortingalgorithmsthatareinefcientonasingleproces sor.Hence,parallelizing thesealgorithmsbeginswithahandicaprelativetothefast estserialsortingalgorithms–merge sortforworst-casebehaviorandquicksortforaveragebeha vior.Combsortisknown tohaveaworst-casecomplexitythatis O ( n 2 ) [ 16 ].Althoughthebestupperbound knownforitsaveragecomplexityisalso O ( n 2 ) ,experimentalresultsindicateanaverage complexityof O ( n log n ) [ 15 16 ].Ontheotherhand,theaveragecomplexityofquick sortisknowntobe O ( n log n ) .Sinceexperimentsindicatethatcombsortrunsinabout twiceasmuchtimeonasingleprocessorasdoesquicksort[ 16 ],attemptssuchas[ 11 ] todevelopafastaverage-casesort,forasingleSPUoftheCB Ethatbeginwithcomb sort,arehandicappedbyafactoroftwocomparedtoattempts thatbeginwithquicksort. ThishandicapisovercomebytheCBEadaptationoftheMerges ortdescribedin[ 12 ]. Forintegersandoats,theCBEsupports4-wayparallelismw ithinasingleSPU as4integers(oats)maybestoredineachoftheSPU's128-bi tvectorregisters. Hence,weexpectatbesttwo-foldspeedupoveraconventiona limplementationof quicksort.However,duetopossibleanomalousbehaviorres ultingfromsuchfactors astheabsenceofbranchprediction,wemayactuallyobserve agreaterspeedup[ 17 ]. Similarly,attemptssuchas[ 8 ]todevelopafastworst-casesortforasingleSPUstarting withbitonicsortarehandicappedrelativetostartingwith mergesortbecausethe worst-casecomplexityofbitonicsortis O ( n log 2 n ) whilethatofmergesortis O ( n log n ) 2.3.1.1Shellsortvariants Shellsort[ 13 ]sortsasequenceof n numbersin m passesemployingadecreasing incrementsequence i 1 > i 2 > > i m =1 .Inthe j thpass,increment h = i j isused; thesequenceisviewedascomprisedof h subsequenceswiththe k thsubsequence comprisedofthenumbersinpositions k k + h k +2 h ,oftheoverallsequence, 0 k < h ;andeachsubsequenceissorted.Thesortingofthesubseque ncesdonein 24

PAGE 25

eachpassiscalledan h -sort.Whilean h -sortistypicallyaccomplishedusinginsertion sort,othersimplesortingalgorithmssuchasbubblesortma yalsobeused.Withthe properchoiceofincrements,thecomplexityofShellsortis O ( n log 2 n ) [ 13 ].Shellsort variantsreplacethe h -sortusedineachpassofShellsortwithan h -passthatonly partiallysortsthesubsequences.Forexample,inan h -bubblepasswemakeonlythe rstpassofbubblesortoneachsubsequence.Sincereplacin g h -sortby h -passin Shellsortnolongerguaranteesacompletesort,wefollowwi thsomesimplesorting algorithmsuchasbubblesorttocompletethesort.So,the h -passesmaybeviewed aspreprocessingpassesdonesoastoimprovetheperformanc eoftheensuingsort algorithmInShellsort, i m =1 isusedtoassurethatthesequenceissortedfollowing thenal h -sort.However,inaShellsortvariant,thisassurancecome sfromthesort algorithmrunfollowingthepreprocessing h -passes.So,the h -passwith h =1 istypically skipped.ThegeneralstructureofaShellsortvariantis: Step1 [Preprocess]Perform h -passesfor h = i j 1 j < m Step2 [Sort]Sortthepreprocessedsequence CombandAAsort: Knuth[ 13 ]proposedaShellsortvariantinwhicheach h -pass isabubblepass(Figure 2-2 ).ThisvariantwasrediscoveredlaterbyDobosiewicz[ 14 ] andBoxandLacey[ 15 ]namedthisvariant combsort .Theincrementsequenceusedby combsortisgeometricwithfactor s .Dobosiewicz[ 14 ]hasshownthatthepreprocessing stepsorts a [0: n 1] withveryhighprobabilitywhenever s < 1.33 .Asaresult, s =1.3 isrecommendedinpractice(notethatalarger s requiresasmallernumberof h -passes). Withthischoice,theouter for loopofthesecondstep(bubblesort)isenteredonlyonce withhighprobabilityandthecomplexityofcombsortis O ( n log n ) withhighprobability. Experimentsindicatethatthealgorithm'saverageruntime isclosetothatofquicksort [ 16 ].However,theworst-casecomplexityofcombsortis O ( n 2 ) [ 16 ]. Inoueetal.[ 11 ]haveadaptedcombsorttotheCBEtoobtainthesortmethod AA-sort,whichefcientlysortsnumbersusingall8SPUsofa CBE.ThesingleSPU 25

PAGE 26

Algorithmcombsort(a,n)f //sorta[0:n-1] //Step1:Preprocessingfor(h=n/s;h > 1;h/=s) f //h-bubblepassfor(i=0;i < n-h;i++) if(a[i] > a[i+h])swap(a[i],a[i+h]); gsorted=false;//Step2:Bubblesortfor(pass=1;pass < n&&!sorted;pass++) f sorted=true;for(i=0;i < n-pass;i++) if(a[i] > a[i+1]) f swap(a[i],a[i+1]);sorted=false; g g gFigure2-2.Combsort versionbeginswithavectorarray d [0: r 1] ofnumbers;eachvectord[i]has4 numbers.Hence, d isan r 4 matrixofnumbers.Thismatrixisrstsortedinto column-majororderandthenthenumberspermutedsoastobes ortedinrow-major order.Figure 2-3 givesthealgorithmforthecolumn-majorsortandFigure 2-5 givesthe column-majortorow-majorreorderingalgorithm. AlgorithmAA(d,r)f //sortd[0:r-1]intocolumn-majororder //Step1:Preprocessingfor(i=0;i < r;i++)sort(d[i]); for(h=r;h > 1;h/=s) f //h-bubblepassfor(i=0;i < r-h;i++) mySpu cmpswap(d[i],d[i+h]); for(i=r-h;i < r;i++) mySpu cmpswap skew(d[i],d[i+h-r]); gsorted=false;//Step2:Bubblesortdo f for(i=0;i < r-1;i++) mySpu cmpswap(d[i],d[i+1]); mySpu cmpswap skew(d[r-1],d[0]); g while(notsorted) gFigure2-3.SingleSPUcolumn-majorAA-sort[ 11 ] 26

PAGE 27

Thecolumn-majortorow-majorreorderingisdoneintwostep s.Intherststep,the numbersineach 4 4 submatrixofthe r 4 matrixofnumbersaretransposedsothat eachvectornowhasthe4numbersinsomerowoftheresult.For simplicity,weassume that r isamultipleof4.Inthesecondstep,thevectorsarepermute dintothecorrect order.Fortherststep,wecollecttherstandsecondnumbe rsinrows0and2ofthe 4 4 matrixbeingtransposedintothevector row 02 A .Thethirdandfourthnumbers ofthesetworowsarecollectedinto row 02 b .Thesameisdoneforrows1and3using vectors row 13 A and row 13 B .Figure 2-4 showsthisrearrangement.Then,thetranspose isconstructedfromthejustcomputed4vectors. abcdefgh ijkl mnop A aibj ckdl emfngohp B aeimbfjn cgko dhlp C Figure2-4.Collectingnumbersfroma 4 4 Matrix.A)Initial.B)Rows02A,02B,03A, 03B,toptobottom.C)Transpose. Bricksort: Inbricksort,wereplacethe h -bubblepassofcombsortbyan h -brick pass[ 18 19 ]inwhichwerstcompare-exchangepositions i i +2 h i +4 h with positions i + h i +3 h i +5 h 0 i < h andthenwecompare-exchangepositions i + h i +3 h i +5 h withpositions i +2 h i +4 h i +6 h 0 i < h .Figure 2-6 givesour CBEadaptationofthepreprocessingstep(Step1)forbricks ort.Step2isabubblesort aswasthecaseforAA-sort.Thebubblesortneedstobefollow edbyacolumn-majorto row-majorreorderingstep(Figure 2-5 ).Itisknownthatthepreprocessingstepofbrick sortnearlyalwaysdoesacompletesortwhentheincrementse quenceisgeometricwith shrinkfactor(i.e., s )lessthan1.22[ 18 19 ].Hence,whenweuse s < 1.22 ,the do-while loopofStep2(bubblesort)isenteredonlyonce(toverifyth edataaresorted)withhigh probability. Shakersort: Shakersortdiffersfromcombsortinthat h -bubblepassesare replacedby h -shakepasses.An h -shakepassisaleft-to-rightbubblepassasincomb 27

PAGE 28

Algorithmtranspose(d,r)f //Columnmajortorowmajorreordering //Step1:Transpose4x4submatricesfor(i=0;i < r;i+=4) f //Computerow02A,row02B,row13A,androw13Brow02A=spu shuffle(d[i],d[i+2],WAXB); row02B=spu shuffle(d[i],d[i+2],YCZD); row13A=spu shuffle(d[i+1],d[i+3],WAXB); row13B=spu shuffle(d[i+1],d[i+3],YCZD); //Completethetransposed[i]=spu shuffle(row02A,row13A,WAXB); d[i+1]=spu shuffle(row02A,row13A,YCZD); d[i+2]=spu shuffle(row02B,row13B,WAXB); d[i+3]=spu shuffle(row02B,row13B,YCZD); g//Step2:Reordervectorsfor(i=0;i < r;i++) if(!inPlace[i]) f current=i;next=i/(r/4)+(imod(r/4))*4;temp=d[i];while(next!=i) f //followcycle d[current]=d[next];inPlace[current]=true;current=next;next=current/(r/4)+(currentmod(r/4))*4; gd[current]=temp;inPlace[current]=true; g gFigure2-5.Column-majortorow-major sortfollowedbyaright-to-leftbubblepass.Figure 2-7 givesourCBEadaptationof shakersort.Thepreprocessingstepofshakersortalmostal wayssortsthedatawhen theshrinkfactor s islessthan1.7. 2.3.1.2Mergesort UnliketheShellsortvariantscomb,brick,andshakersorto fSection 2.3.1.1 whose complexityis O ( n log n ) withhighprobability,theworst-casecomplexityofmerges ortis O ( n log n ) .Further,mergesortisastablesort(i.e.,therelativeord erofelementsthat havethesamekeyispreserved).Whilethispropertyofmerge sortisn'trelevantwhen wearesimplysortingnumbers(asyoucan'ttelltwoequalnum bersapart),thisproperty isusefulinsomeapplicationswhereeachelementhassevera lelds,onlyoneofwhich 28

PAGE 29

AlgorithmBrick(d,r)f //sortd[0:r-1]intocolumn-majororder //Step1:Preprocessingfor(i=0;i < r;i++)sort(d[i]); for(h=r;h > 1;h/=s) f //h-brickpass//compare-exchangeeven:oddbricksfor(i=0;i < r-2*h;i+=2*h) for(j=i;j < i+h;j++) mySpu cmpswap(d[j],d[j+h]); //handleendconditionsif(j < n-h) f //Morethan1brickremains end=j+h;for(;j < n-h;j++) mySpu cmpswap(d[j],d[j+h]); gelseend=r;while(j < end) f mySpu cmpswap skew(d[j],d[j+h-n]); j++; g//compare-exchangeodd:evenbricksbeginningwithi=h//similartoeven:oddbricks //Step2:Bubblesort//sameasforAA-sort gFigure2-6.Column-majorbricksortAlgorithmShaker(d,r)f //sortd[0:r-1]intocolumn-majororder //Step1:Preprocessingfor(i=0;i < r;i++)sort(d[i]); for(h=r;h > 1;h/=s) f //h-shakepass//left-to-rightbubblepassfor(i=0;i < r-h;i++) mySpu cmpswap(d[i],d[i+h]); for(i=r-h;i < r;i++) mySpu cmpswap skew(d[i],d[i+h-r]); //right-to-leftbubblepassfor(i=r-h-1;i > 0;i--) mySpu cmpswap(d[i],d[i+h]); for(i=r-1;i > = r-h;i--) mySpu cmpswap skew(d[i],d[i+h-r]); g//Step2:Bubblesort//SameasforAA-sort gFigure2-7.Column-majorshakersort 29

PAGE 30

isthesortkey.TheShellsortvariantsofSection 2.3.1.1 arenotstablesorts.Onthe downside,efcientimplementationsofmergesortrequirea ddedspace.Whensorting numbersinthevectorarray d [0: r 1] weneedanadditionalvectorarray t [0: r 1] tosupportthemerge.CBEmergesortadaptationispresented asastablesortandlater wepointoutthesimplicationsthatarepossiblewhenwewis htosortnumbersrather thanelementsthathavemultipleelds.Weagainassumethat thenumbersareinthe vectorarray d [0: r 1] Thereare4phasestoourstablemergesortadaptation: Phase1: Transposetheelementsof d [0: r 1] ,whichrepresentsa r 4 matrix,from row-majortocolumn-majororder. Phase2: Sortthe4columnsofthe r 4 matrixindependentlyandinparallel. Phase3: Inparallel,mergetherst2columnstogetherandthelast2c olumnstogether togettwosortedsequencesoflength 2 r each. Phase4: Mergethetwosortedsequencesoflength 2 r eachintoarow-majorsorted sequenceoflength 4 r Mergesortphase1–transpose: WenotethatPhase1isneededonlywhenwe desireastablesort.Figure 2-8 showsaninitial 8 4 matrixofnumbersandtheresult followingeachofthe4phasesofourmergesortadaptation. ThePhase1transformationistheinverseofthecolumn-majo rtorow-major transformationdoneinFigure 2-5 andwedonotprovideitsdetails.Detailsforthe remaining3phasesareprovidedinthefollowingsubsection s. Mergesortphase2–sortcolumns: Phase2operatesin log r subphases characterizedbythesizeofthesortedsegmentsbeingmerge d.Forinstance,inthe rstsubphase,wemergetogetherpairsofsortedsegmentsof size1each,inthenext subphasethesegmentsizeis2,inthethirditis4,andsofort h.Atanytime,thetwo segmentsbeingmergedhavethesamephysicallocationsinal l4columns.So,forour 8 4 example,whenmergingtogethersegmentsofsize2,weshall rstmerge,in 30

PAGE 31

223051714263292562010 2281611 191329121424715238311218273 A 222519153061323 520298 17101311422112262841832162427 91173 B 52139648 14107121711131522161918262021233025242732282931 C 217118520319622421925723 1026824112812271430132916321531 D 123456789101112 1314151617181920212223242526272829303132 E Figure2-8.Mergesortexample.A)Initial.B)Phase1.C)Pha se2.D)Phase3.E) Phase4. parallel,4pairsofsegments,onepairfromeachcolumn.The rstsegmentofapairis inrows0and1ofthe r 4 matrixandthesecondinrows2and3.Then,weshallmerge togetherthesegmentsofsize2thatareinrows4through7.Fo llowingthis,thesegment sizebecomes4. Tomerge4pairsofsegmentsinparallel,weemploy8counters tokeeptrackof whereweareinthe8segmentsbeingmerged.Thecountersarec alled a 0 a 3 b 0 b 3 ( a i b i ) arethecountersforthesegmentsofcolumn i 0 i 3 thatarebeing merged.Whenthesegmentsizeis s andthesegmentsbeingmergedoccupyrows i through i +2 s 1 ,the a countersareinitializedto i andthe b countersto i + s .Although all a countershavethesamevalueinitiallyasdoall b counters,asmergingprogresses, thesecountershavedifferentvalues.Figure 2-9 givesthePhase2algorithm.For simplicity,weassumethat r isapowerof2. Mergesortphase3–mergepairsofcolumns: InPhase3wemergethersttwo andlasttwocolumnsofthe r 4 matrixtogethertoobtain2sortedsequences,eachof 31

PAGE 32

AlgorithmPhase2(d,r)f //sortthe4columnsofd[0:r-1],useadditionalarrayt[0:r -1] for(s=1;s < r;s*=2) for(i=0;i < r;i+=2*s) f A=spu splats(i);//initializeacounters B=spu splats(i+s);//initializebcounters for(k=i;k < i+2*s;k++) f //mergethesegments //oneroundofcomparesaData=mySpu gather(d,A); bData=mySpu gather(d,B); p=spu cmpgt(aData,bData); t[k]=spu select(aData,bData,p); //updatecountersnotP=mySpu not(p); A=spu sub(A,notP); B=spu sub(B,p); gswap(d,t);//swaproles g gFigure2-9.Phase2ofmergesort size 2 r .Therstsequenceisincolumns0and1andthesecondincolum ns2and3of anoutputmatrix.Wedothismergingbyusing8counters.Coun ters a 0 b 0 a 1 b 1 startat thetopofthe4columnsofourmatrixandmovedownwardswhile counters a 2 b 2 a 3 b 3 startatthebottomandmoveup(seeFigure 2-10 (a)).Let e ( c ) bethematrixelement thatcounter c isat.Thecomparisons e ( a i ): e ( b i ) 0 i 3 aredoneinparalleland dependingontheoutcomeofthesecomparisons,4ofthe8elem entscomparedare movedtotheoutputmatrix.When e ( a 0 ) e ( b 0 ) ( e ( a 1 ) e ( b 1 ) ), e ( a 0 ) ( e ( a 1 ) )ismoved totheoutputand a 0 ( a 1 )incrementedby1;otherwise, e ( b 0 ) ( e ( b 1 ) )ismovedtothe outputand b 0 ( b 1 )incrementedby1.Similarly,when e ( a 2 ) e ( b 2 ) ( e ( a 3 ) e ( b 3 ) ), e ( b 2 ) ( e ( b 3 ) )ismovedtotheoutputand b 2 ( b 3 )decrementedby1;otherwise, e ( a 2 ) ( e ( a 3 ) ) ismovedtotheoutputand a 2 ( a 3 )decrementedby1.Themergeiscompletewhenwe havedone r roundsofcomparisons.Figure 2-11 givesthealgorithmforPhase3. Theorem2.1. Algorithm Phase 3 correctlymerges4sortedcolumnsinto2. Proof. ToprovethecorrectnessofAlgorithm Phase 3 weneedtoshowthateachelement oftherst(last)twocolumnsoftheinput r 4 matrixiscopiedintotherst(last)two 32

PAGE 33

a 0 a 2 b 0 b 2 a 0 b 0 a 2 b 2 A a 0 a 2 b 0 b 0 2 C b 2 b 0 0 B a 0 a 2 b 0 b 0 0 b 0 2 D b 2 a 0 a 2 b 0 F b 2 a 0 a 2 b 0 E b 2 b 0 0 b 0 2 b 2 G b 0 H b 0 b 2 a 0 a 2 a 0 a 2 b 0 I b 2 J b 0 a 0 a 2 a 0 a 2 b 2 Figure2-10.Phase3counters columnsoftheoutputmatrixexactlyonceandthattheelemen tsoftherst(third)output columnfollowedbythoseofthesecond(fourth)areinsorted order.Itissufcientto showthisforthersttwocolumnsoftheinputandoutputmatr ices.First,observethat when a 0 a 2 ( b 0 b 2 ),thesecountersareatinputelementsthathaveyettobecop ied totheoutput.Further,when a 0 > a 2 ( b 0 > b 2 )allelementsoftherespectivecolumnhave beencopiedfromtheinputtotheoutput(notethatacounteri supdatedonlywhenits 33

PAGE 34

AlgorithmPhase3(d,r)f //mergethe4sortedcolumnsofd[0:r-1]into2sortedsequen ces //useadditionalarrayt[0:r-1] A= f 0,0,r-1,r-1 g ;//initializeacounters B= f 0,0,r-1,r-1 g ;//initializebcounters for(k=0;k < r;i++) f aData=mySpu gatherA(d,A); bData=mySpu gatherB(d,B); p=spu cmpgt(aData,bData); e=spu equal(aData,bData); e=spu and(e,vector(0,0,-1,-1)); p=spu or(p,e); min=spu select(aData,bData,p); max=spu select(bData,aData,p); t[k]=spu shuffle(min,t[k],WBXD); t[r-k-1]=spu shuffle(max,t[r-k-1],AYCZ); //updatecountersnotP=mySpu not(p); f1=spu and(p,vector(-1,-1,0,0)); s1=spu and(p,vector(0,0,-1,-1)); f2=spu and(notP,vector(-1,-1,0,0)); s2=spu and(notP,vector(0,0,-1,-1)); A=spu sub(A,f2); A=spu add(A,s2); B=spu sub(B,f1); B=spu add(B,s1); g gFigure2-11.Phase3ofmergesort elementhasbeencopiedtotheoutputmatrix).Weconsider4c ases: a 0 < a 2 a 0 = a 2 a 0 = a 2 +1 ,and a 0 > a 2 +1 Case a 0 < a 2 : When b 0 < b 2 (Figure 2-10 A),exactlyoneof e ( a 0 ) and e ( b 0 ) andone of e ( a 2 ) and e ( b 2 ) arecopiedtotheoutputandthecorrespondingcountersare advanced.Noelementiscopiedtotheoutputtwice.Next,considerthecase b 0 = b 2 (Figure 2-10 B).If e ( a 0 ) e ( b 0 ) e ( a 0 ) and oneof e ( a 2 ) and e ( b 2 ) arecopiedtotheoutputandthecorrespondingcounters advanced.Againnoelementiscopiedtotheoutputtwice.If e ( a 0 ) > e ( b 0 )= e ( b 2 ) e ( b 2 ) < e ( a 0 ) e ( a 2 ) and e ( b 0 ) and e ( a 2 ) arecopiedtotheoutputandtheir countersadvanced.Again,noelementiscopiedtwice. 34

PAGE 35

Thenextcaseweconsiderhas b 0 = b 2 +1 .Letthevaluesof b 0 and b 2 be b 0 0 and b 0 2 justbeforetheupdate(s)thatresultedin b 0 = b 2 +1 andlet a 0 0 and a 0 2 bethevaluesofthe a countersatthistime.Oneofthefollowingmustbe true:(a) b 0 2 = b 0 0 +1 (both b 0 and b 2 wereadvanced,Figure 2-10 C),(b) b 0 0 = b 0 2 = b 0 (only b 2 wasadvanced,Figure 2-10 D),or(c) b 0 0 = b 0 2 = b 2 (only b 0 wasadvanced,Figure 2-10 E).In(a),itmustbethat b 2 = b 0 0 and b 0 = b 0 2 .So, e ( a 0 ) > e ( b 0 0 ) and e ( a 2 ) e ( b 0 2 ) .Hence, e ( a 0 ) e ( a 2 ) e ( b 0 2 )= e ( b 0 ) and e ( a 2 ) e ( a 0 ) > e ( b 0 0 )= e ( b 2 ) .Therefore, e ( a 0 ) and e ( a 2 ) arecopiedtothe outputand a 0 and a 2 advanced.Again,onlypreviouslyuncopiedelementsare copiedtotheoutputandeachiscopiedonce.Forsubcase(b), when b 0 2 was decrementedto b 2 a 0 0 wasincrementedto a 0 e ( b 0 2 ) e ( a 2 ) and e ( a 0 0 ) a ( b 0 0 ) Since b 0 > b 2 ,allelementsofthesecondcolumnhavebeencopiedtotheout put. Weseethat e ( a 0 ) e ( a 2 ) e ( b 0 2 )= e ( b 0 ) .So, e ( a 0 ) iscopiedand a 0 is advanced.Further,asaresultofsomepreviouscomparison, b 0 wasadvancedto itscurrentpositionfromthepresentpositionof b 2 .So,thereisan a 00 0 a 0 suchthat e ( b 2 ) < e ( a 00 0 ) e ( a 0 ) e ( a 2 ) .Therefore, e ( a 2 ) iscopiedand a 2 advanced.Again, nopreviouslycopiedelementiscopiedtotheoutputandnoel ementiscopied twice.Subcase(c)issymmetrictosubcase(b).Thenalcasehas b 0 > b 2 +1 (Figure 2-10 F).Fromtheproofofsubcases b 0 = b 2 and b 0 = b 2 +1 ,itfollowsthatthiscasecannotarise. Case a 0 = a 2 : Thereare4subcasestoconsider–(a) b 0 < b 2 ,(b) b 0 = b 2 ,(c) b 0 = b 2 +1 and(d) b 0 > b 2 +1 (Figures 2-10 (G–J)).Subcase(a)issymmetrictothecase a 0 < a 2 and b 0 = b 2 consideredearlier.Insubcase(b),independentoftheoutc ome ofthecomparison e ( a 0 ): e ( b 0 ) ,whichisthesameasthecomparison e ( a 2 ): e ( b 2 ) e ( a 0 ) (equivalently e ( a 2 ) )and e ( b 0 ) (equivalently e ( b 2 ) )arecopiedtotheoutput. Forsubcase(c),wenoticethatwhen a 0 = a 2 ,thesetwocountershavehada cumulativeadvanceof r 1 fromtheirinitialvaluesandwhen b 0 = b 2 +1 thesetwo 35

PAGE 36

countershavetogetheradvancedby r .So,the4counterstogetherhaveadvanced by 2 r 1 fromtheirinitialvalues.Thisisn'tpossibleasthe4count ersadvance byatotalof2ineachiterationofthe for loop.So,subcase(c)cannotarise.Next, considersubcase(d).Fromtheproofforthecase a 0 < a 2 ,weknowthatwecannot have b 0 > b 2 +1 while a 0 < a 2 .So,wemusthavegotintothisstatefromastate inwhich a 0 = a 2 and b 0 b 2 .Itisn'tpossibletogetintothisstatefromsubcase (a)assubcase(a),atworstincreases b 0 by1anddecreases b 2 by1eachtimewe areinthissubcase.So,itispossibletogetintothissubcas eonlyfromsubcase (b).However,subcase(b)onlyarisesatthelastiterationo ftheforloop.Even otherwise,subcase(b)eitherincrements b 0 by1ordecrements b 2 by1andso cannotresultin b 0 > b 2 +1 Case a 0 > a 2 +1 : Fromtheproofsoftheremainingcases,itfollowsthatthisc ase cannotarise. FromtheproofofTheorem 2.1 ,itfollowsthatwhenwearesortingnumbers ratherthanrecordswithnumerickeys,algorithm Phase 3 workscorrectlyevenwiththe statements e=spu equal(aData,bData); e=spu and(e,vector(0,0,-1,-1)); p=spu or(p,e); omitted. Mergesortphase4–nalmerge: ForthePhase4merge,weemploy4counters. Counters a 0 and a 1 ,respectivelybeginattherstandlastelementofthersts orted sequence(i.e.,atthetopoftherstcolumnandbottomofthe secondcolumn, respectively)while b 0 and b 1 beginattherstandlastelementsofthesecondsequence (Figure 2-12 ).Ineachround,thecomparisons a 0 : b 0 and a 1 : b 1 aredoneinparallel. e ( a 0 ) ( e ( b 1 ) )ismovedtotheoutputif e ( a 0 ) e ( b 0 ) ( e ( b 1 ) e ( a 1 ) ).Otherwise, e ( b 0 ) 36

PAGE 37

( e ( a 1 ) )ismovedtotheoutput.Thesortedoutputisassembledinrow -majororderinthe vectorarray t .Weusethevariables k and pos tokeeptrackoftherowandcolumnin t inwhichtoplacetheoutputelementfromthecomparison e ( a 0 ): e ( b 0 ) .Theoutput elementfrom e ( a 1 ): e ( b 1 ) goesintorow ( r k 1) andcolumn (3 pos ) of t Figure 2-13 givesthealgorithmforthecasewhenthecountersremainwit hinthebounds oftheirrespectivecolumns. mask [ pos ] 0 pos 3 isdenedsoastochangeonlythe numberinposition pos ofa t [] vector. Column 0 1 2 3 a 0 a 1 b 1 b 0 Figure2-12.Phase4counters AswasthecaseinPhase3,thestatementse=spu equal(aData,bData); e=spu and(e,vector(0,0,-1,-1)); p=spu or(p,e); maybeomittedwhenwearesortingnumbersratherthanrecord swithnumerickeys. 2.3.1.3ComparisonofsingleSPUsortingalgorithms Weprogrammedourmergesort,bricksort,andshakersortada ptationsusingthe CBESDKVersion3.0.Forcomparisonpurposes,weusedanAAso rtcodedeveloped byus,theCellsortcodeof[ 8 ],anon-vectorizedmergesortcodedevelopedbyus,and thequicksortroutineavailableintheCBESDK.Thecodeswer erstdebuggedand optimizedusingtheCBEsimulatorthatisavailablewiththe SDK.Theoptimizedcodes wererunontheGeorgiaTech-STICellbuzzclustertoobtaina ctualruntimes.Table 2-1 givestheaveragetimerequiredtosort n 4-byteintegersforvariousvaluesof n .The averageforeach n istakenover5randomlygeneratedsequences.Thevariancei nthe 37

PAGE 38

AlgorithmPhase4(d,r)f //partialalgorithmtomerge2sortedsequencesofd[0:r-1] //into1sortedsequence//useadditionalarrayt[0:r-1] A= f 0,r-1,0,0 g ;//initializeacounters B= f 0,r-1,0,0 g ;//initializebcounters k=0;pos=0;while(nocolumnisexhausted) f aData=mySpu gather12(d,A); bData=mySpu gather34(d,B); p=spu cmpgt(aData,bData); e=spu equal(aData,bData); e=spu and(e,vector(0,-1,0,0)); p=spu or(p,e); min=spu select(aData,bData,p); max=spu select(bData,aData,p); max=spu slqwbyte(max,4); t[k]=spu shuffle(min,t[k],mask[pos]); t[r-k-1]=spu shuffle(max,t[r-k-1],mask[3-pos]); //updatecountersnotP=mySpu not(p); f1=spu and(p,vector(-1,0,0,0)); s1=spu and(p,vector(0,-1,0,0)); f2=spu and(notP,vector(-1,0,0,0)); s2=spu and(notP,vector(0,-1,0,0)); A=spu sub(A,f2); A=spu add(A,s1); B=spu sub(B,f1); B=spu add(B,s2); k+=(pos+1)/4;pos=(pos+1)%4; g gFigure2-13.Phase4ofmergesortwithcountersindifferent columns sorttimefromonesequencetothenextisrathersmallandsot hereportedaverageis notmuchaffectedbytakingtheaverageofalargernumberofr andominputsequences. Figure 2-14 isaplotoftheaveragetimesreportedinFigure 2-1 .Theshownruntimes includethetimerequiredtofetchthedatatobesortedfromm ainmemoryandtostore thesortedresultsbacktomainmemory.2.3.2HierarchicalSort Forsortingalargesetofnumbers,ahierarchicalapproachs imilartoexternal sortisemployedwherersteachSPUsortsalocalmemoryload ofdatatogenerate sortedsequencescalledruns.Theserunsarethenmergedbyt heSPUstoproduce 38

PAGE 39

Table2-1.ComparisonofvariousSPUsortingalgorithms #ofIntegersAASortShakersortBrickSortBitonicSortMergeSortMergeSort(Sequential)QuickSort 12852.053.653.047.850.8146.6145.625662.465.463.465.657.6178.6206.851281.886.481.472.670.4272.2332.0 1024123.8142.2116.8125.497.0315.4605.62048222.8262.0190.2165.8142.0543.01164.04096438.6494.8332.6297.8268.4989.82416.68192912.41033.6663.8609.6508.02011.24686.6 163841906.42228.01361.01331.21017.04103.09485.2 thenalsortedsequence.Dependingonhowmanyrunsaremerg edatatimethere willbemultipleroundsofmergingbeforegeneratingthena lsortedsequence.The PPUdispatchestherunstotheSPUswhichdothemergingandre turntheresultsback tothePPU.Intherunmergingphase,eachSPUindependentlym ergesadifferent setofruns.So,oneneedstodeveloponlyarunmergingalgori thmforasingleSPU. Inoueetal.[ 11 ]proposeasingleSPUmergingalgorithmthatmergesrunsinp airs (i.e.,a2-waymerge)usinganadaptationofodd-evenmerge. Odd-evenmergeoftwo four-numbersortedsequencesisdonein3stages.First,the twosortedsequences areconcatenatedtogetan8-numbersequencewhereeachhalf isinsortedorder. Duringtherststage,numbersthatare4positionsapartare compare-exchanged 1 Inthesecondstage,numbersthatare2positionsapartareco mpare-exchangedand inthelaststagealternatenumbersarecompare-exchangedi fneeded.Thisscheme 1 Inacompare-exhange,twonumbersarecomparedandswappedi ftherstnumber islargerthanthesecond 39

PAGE 40

0 2000 4000 6000 8000 10000 12000 14000 16000 18000 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 # of IntegersTime in usec AASort Shakersort Bricksort Bitonicsort Mergesort Mergesort(seq) Quicksort(SDK) Figure2-14.Plotofaveragetimetosort4-byteintegers canbeeffectivelySIMDizedbybeginningwithtwovectors,e achcontainingoneof thetwosorted4-numbersequences.Vectorcompareinstruct ionsareusedsothat4 compare-exchangemaybedoneatatime.Figure 2-15 showstheprocesswithtwo sortedvectors A and B andFigure 2-16 givesthepseudocodeforthisadaptation. AstherunsaretoolongtotinthelocalmemoryofanSPU,buff ersareusedto holdaportionofeachoftherunscurrentlybeingmerged.Mul ti-bufferingtechniquesare employedtooverlapthecomputationwiththedatatransfer. Figure 2-17 istheplotofthe averagetimestakenforsorting1to67Mintegerswithdiffer entSPUsortingalgorithms followedbySIMDodd-even2-waymergetomergetherunsexcep tinthecaseofBitonic Sort[ 8 ]wherebitonicmergeisusedtomergetherunsandinthecaseo fsequential mergesortwhereatexbookmergeisdonetocombinetheruns.S imilartosingleSPU sorting,Bricksortisthefastestamongshellsortlikealgo rithmstaking82%ofthetime takenbyAASortforsorting67MnumberswhileShakersortist heslowestofthebunch running21%slowerthanBricksortforsorting67Mnumbers.M ergesortisthefastestof 40

PAGE 41

output A0A1A2A3 B0B1B2 stage 3 < < < B0B1B2 A0A1A2A3 < << << < < Vector Register AVector Register B sortedsorted Vector Register AVector Register B sorted : no operation:comparison MINMAX input stage 1stage 2 B3 B3 Figure2-15.SIMDodd-evenmergeoftwovectors[ 11 ] allalgorithmstestedtaking84%ofthetimetakenbyBrickso rtandruns24%fasterthan Bitonicsortforsorting67Mnumbers.Comparedtosequentia lsorts,theSIMDversionof mergesortruns3timesfasterthanthetextbookmergesortan d24timesfasterthanthe SDKquicksort. 2.4SortingRecords 2.4.1RecordLayouts Arecord R iscomprisedofakey k and m otherelds f 1 f 2 f m .Forsimplicity, weassumethatthekeyandeachothereldoccupies32bits.He nce,a128-bitCBE vectormayholdupto4ofthese32-bitvalues.Althoughthede velopmentinthissection reliesheavilyonstoring4keysinavector(eachkeyoccupyi ng32bits),thesizeofthe othereldsisn'tverysignicant.Let k i bethekeyofrecord R i andlet f ij 1 j m bethisrecord'sotherelds.Withoursimplifyingassumpti onofuniformsizeelds, wemayviewthe n recordstobesortedasatwo-dimensionalarray eldsArray [][] with eldsArray [ i ][0]= k i and eldsArray [ i ][ j ]= f ij 1 j m 1 i n .Whenthisarray 41

PAGE 42

AlgorithmoddEvenMerge(v1,v2)f //Mergetwovectorsv1andv2 vectorf1,f2;vetorf3,f4;vectorp;//forstoringpatternp=spu cmpgt(v1,v2); f1=spu select(v1,v2,pattern); f2=spu select(v2,v1,pattern); //Stage2f3=spu rotate(f1,8); f1=spu select(f3,f2,p); f4=spu select(f2,f3,p); f2=spu shuffle(f1,f4,WACY); f3=spu shuffle(f1,f4,ZXBD); //Stage3p=spu cmpgt(f2,f3); p=spu shuffle(p,vZero,WXYA); f1=spu select(f2,f3,p); f4=spu select(f3,f2,p); //Outputv1=spu shuffle(f1,f4ZWAX); v2=spu shuffle(f1,f4,BYCD); gFigure2-16.SIMD2-waymergeof2vectors v 1 and v 2 ismappedtomemoryincolumn-majororder,wegettherstlay outconsideredin[ 11 ]. Wecallthislayoutthe ByField layoutas,inthislayout,the n keyscomerst.Nextwe havethe n valuesforthersteldoftherecordsfollowedbythe n secondelds,andso on.Whentheeldsarrayismappedtomemoryinrow-majororde r,wegetthesecond layoutconsideredin[ 11 ].Thislayout,whichisamorecommonlayoutforrecords,is calledthe ByRecord layoutas,inthislayout,alltheeldsof R 1 comerst,thenwehave alleldsof R 2 andsoon.Whenthesortbeginswithdatainthe ByField ( ByRecord ) layout,theresultofthesortmustalsobeinthe ByField ( ByRecord )layout. 2.4.2High-levelStrategiesforSortingRecords Therearetwohigh-levelstrategiestosortmultieldrecor ds.Intherst,westrip thekeysfromtherecordsandcreate n tuplesoftheform ( k i i ) .Wethensortthetuples bytheirrstcomponent.Thesecondcomponentofthetuplesi nthesortedsequence denesapermutationoftherecordindexesthatcorresponds tothesortedorderforthe initialrecords.Therecordsarerearrangedintothispermu tationbyeithercopyingfrom 42

PAGE 43

0 10 20 30 40 50 60 70 0 0.5 1 1.5 2 2.5 3 3.5 4 x 10 7 # of Integers (in millions)Time in usec AASort Shakersort Bricksort Mergesort Mergesort(seq) Bitonic sort Quicksort(SDK) Figure2-17.Plotofaveragetimetosort1to67millioninteg ers eldsArray toanewspaceorinplaceusingacyclechasingalgorithmasde scribedfor atablesortin[ 20 ].Thisstrategyhastheadvantageofrequiringonlyalinear number ofrecordmoves.So,ifthesizeofeachrecordis s andifthetimetosortthetuplesis O ( n log n ) ,theentiresortofthe n recordscanbecompletedin O ( n log n + ns ) time. Thesecondhigh-levelstrategyistomovealltheeldsofare cordeachtimeitskeyis movedbythesortalgorithm.Inthiscase,ifthetimetosortt hekeysaloneis O ( n log n ) thetimetosorttherecordsis O ( ns log n ) .Forrelativelysmall s ,therststrategy outperformsthesecondwhentherecordsarestoredinunifor maccessmemory. However,sincereorderingrecordsaccordingtoaprescribe dpermutationwithalinear numberofmovesmakesrandomaccessestomemory,theseconds chemeoutperforms therst(unless s isverylarge)whentherecordstoberearrangedareinrelati velyslow memorysuchasdiskorthemainmemoryoftheCBE.Forthisreas on,wefocus,inthis 43

PAGE 44

chapter,onusingthesecondstrategy.Thatis,thesortalgo rithmmovesalltheeldsof arecordwheneveritskeyismoved.2.4.3SingleSPURecordSorting TwoSIMDvectoroperationsusedfrequentlyinnumbersortin galgorithmsare ndmin and shue .The ndmin operationcomparescorrespondingelementsintwo vectorsandreturnsavector min thatcontains,foreachcomparedpair,thesmaller.For example,whenthetwovectorsbeingcomparedare(4,6,2,9)a nd(1,8,5,3),the min is(1,6,2,3).Supposethat v i and v j arevectorsthat,respectively,containthekeysfor records R i : i +3 and R j : j +3 .Figure 2-18 showshowwemaymovetherecordswiththe smallerkeystoablockofmemorybeginningat minRecords pattern=spu cmpgt( v i v j ); minRecords=fields select( v i v j ,pattern); Figure2-18.The ndmin operationforrecords Whenthe ByField layoutisused, elds select takestheformgiveninFigure 2-19 for(p=1;p < = m;p++) f minRecords[p]=spu select(fieldsArray[i][p],fieldsArray[j][p],pattern) ; gFigure2-19.The elds select operationin ByField layout Noticethatinthe ByField layout,theelements eldsArray [ i : i +3][ p ] arecontiguous anddenea4-elementvector.However,inthe ByRecord layout,theseelements arenotcontiguousandadifferentstrategy(Figure 2-20 )mustbeemployed.The function memcpy ( dest src size ) moves size numberofbytesfromthelocalstorelocation beginningat src tothelocalstorelocationbeginningat dest ; recSize isthelengthofa recordincludingitskey(i.e.,itisthenumberofbytestake nbyarowof eldsArray ). Theshufeoperationdenedby spu shue forthecaseofsortingnumbersmaybe extendedtothecaseofmultieldrecordsusingthecodeofFi gure 2-21 forthe ByField layoutandthatofFigure 2-22 forthe ByRecord layout.Bothcodesareforthecase 44

PAGE 45

for(p=0;p < 4;p++) f memcpy(minRecords+p*recSize,(pattern[p]==1)?fieldsA rray[i+p]: fieldsArray[j+p],recSize);gFigure2-20.The elds select operationin ByRecord layout whentheshufepatternis WYAC .Othershufepatternsaredoneinasimilarway.This extensionof spu shue torecordsisreferredas elds shue .Othervectoroperations like spu slqwbyte canbethoughtofasa spu shue operationwithacertainpatternand onecandeneasimilaroperation elds rotate alongthoselinesfortherecordsinboth layoutse.g. elds rotate ( v ,8) isequivalentto elds shue ( v v CDWX ) for(p=0;p < = m;p++) f resultRecords[p]=spu shuffle(fieldsArray[i][p],fieldsArray[j][p],WYAC); gFigure2-21.Shufingtworecordsin ByField layout memcpy(resultRecords,fieldsArray[i],recSize);memcpy(resultRecords+recSize,fieldsArray[i+2],recSi ze); memcpy(resultRecords+2*recSize,fieldsArray[j],recSi ze); memcpy(resultRecords+3*recSize,fieldsArray[j+2],rec Size); Figure2-22.Shufingtworecordsin ByRecord layout Weobservethatwhenthe ByField layoutisusedthe ndMin and shue operations perform O ( m ) vectoroperationsandthatwhenthe ByRecord layoutisused,aconstant number(4)of memcpy operationsaredone.However,thetimeforeach memcpy operationincreaseswith m 2.4.4HierarchicalSortingforRecords-4wayMerge InSection 2.4 ,wesawtheadaptationsneededtotherungenerationalgorit hms of[ 8 11 12 ]sothatthesemaybeusedtogeneraterunsformultieldreco rdsrather thanfornumbers.Intherunmergingphase,eachoftheSPUsin dependentlymerges adifferentsetofruns.Inoueetal.[ 11 ]proposeasingleSPUmergingalgorithmthat mergesrunsinpairs(i.e.,a2-waymerge)usinganadaptatio nofodd-evenmerge.It takes3 spu cmpgt instructions,6 spu shue and6 spu select instructionstomergetwo 45

PAGE 46

SPUvectorsusingthisscheme.Therunmergingstrategyof[ 11 ]maybeadaptedto thecaseofmultieldrecordsusingthemethodsofSection 2.4 .Ahigher-ordermergein therunmergingphasereducesIOtimeandhaslittleimpacton computationtime[ 20 ]. Moreover,whensortingmultieldrecords,theIOtimeincre aseswiththesizeofarecord andforsuitablylargerecords,thisIOtimewillexceedthec omputationtimeandso cannotbeeffectivelyhiddenusinga2-waymergeanddoubleb uffering.So,formultield records,thereismerittodevelopingahigher-ordermerge. Correspondingly,two4-way mergealgorithmsareproposedin[ 21 ].Oneisascalaralgorithmandtheotherisa vectorizedSIMDalgorithm.Bothalgorithmsarebasedonthe high-levelstrategyshown inFigure 2-23 2-way Merge MainAlt MainAlt MainMainMain AltAltAlt Additional Buffer Additional Buffer 2-way Merge 2-way Merge Figure2-23.4-waymerge The4-waymergestrategyinvolvesperforming32-waymerges inasingleSPU usingtwobuffers(mainandalt)foreachofthe4inputstream sA,B,C,andDaswell as2buffersfortheoutputstreamO.Anadditionalbufferisu sedfortheoutput(EandF, respectively)ofeachofthetwoleft2-waymergenodesofFig ure 2-23 .So,weemploy atotalof12buffers.RunsAandBarepairwisemergedusingth etopleft2-waymerge nodewhilerunsCandDarepairwisemergedusingthebottomle ft2-waymergenode. Theformer2-waymergegeneratestheintermediaterunEwhil ethelattergenerates theintermediaterunF.TheintermediaterunsEandFaremerg edbytheright2-way mergenodetoproducetheoutputrunO,whichiswrittenbackt omainmemory.Run generationisdoneoneblockorbufferloadatatime.Doubleb ufferingisemployedfor 46

PAGE 47

theinputofA,B,C,andDfrommainmemoryandtheoutputofOto mainmemory.By usingdoublebufferingandasynchronousDMAtransferstoan dfrommainmemory,we areabletooverlapmuchoftheIOtimewithcomputationtime. Scalar4-waymerge: Figure 2-24 givesthepseudocodeforthescalar4-way mergealgorithm.Forsimplicity,algorithm 4 wayPipelinedMerge assumesthatwehave anintegralnumberofblocksofdataineachrun.So,ifeachof therunsA,B,C,andD is(say)10blockslong,theoutputrunOwillbe n =40blockslong. 4 wayPipelinedMerge generatestheseoutput n blocksoneblockatatime.Evenblocksareaccumulatedin oneoftheoutputbuffersandoddblocksintheother.Whenano utputbufferbecomes full,wewritetheblocktomemoryusinganasynchronousDMAt ransfer( dmaOut )and continueoutputrungenerationusingtheotheroutbutbuffe r.So,otherthanwhenthe rstoutputblockisbeinggeneratedandthelastbeingwritt entomainmemory,one oftheoutputblocksisbeingwrittentomainmemorywhilethe otheroneisbeinglled withrecordsforthenextblock.Attheendofeachiterationo ftheouter for loop,we switchtherolesofthetwooutputbuffers–theonethatwasbe ingwrittentomainmemory becomesthebuffertoplacerecordsforthenextblockandthe onethatwasbeinglled iswrittenout.Ofcourse,thisswitchmayentailsomedelaya swemustwaitforthe ongoing(ifany) dmaOut tocompletebeforeweusethisbufferfortherecordsofthene xt block.Whengeneratingablockoftheoutputrun,wemergefro mthebuffers buerE and buerF totheoutputbuffer buerO thatiscurrentlydesignatedforthispurpose. Thenumberofrecordsinafullbuffer(i.e.,theblocksize)i s bSize .Incaseeither buerE or buerF isempty,thegenerationoftheoutputblockissuspendedand weproceed tolltheemptybufferusingthemethod mergeEF ,whichmergesfromeitherinput streamsAandBto buerE orfromstreamsCandDto buerF .Thealgorithm mergeEF mergesforeithertheinputstreamsAandBto buerE orfrom E and F to buerF .It usesdoublebufferingonthestreamsA,B,C,andDandensures thatthereisalways anactive dmaIn forthesefourinputstreams.Sincethepseudocodeissimila rtothat 47

PAGE 48

for 4 wayPipelinedMerge ,wedonotprovidethispseudocodehere.Recordsaremoved betweenbuffersusingthe memcpy instructionswhenthe ByRecord layoutisusedand movedoneeldatatimewhenthelayoutis ByField Algorithm4wayPipelinedMerge(A,B,C,D,O,n)f //Mergeruns/streamsA,B,C,andDtoproduceO //withnblocksofsizebSize //bufferAisabufferforAinitiatea dmaIn for buerA buerB buerC and buerD ; for(i=0;i < n;i++) f for(j=0;j < bSize;j++) f //doblocki if(bufferEisempty) mergeEF(A,B,E); if(bufferFisempty) mergeEF(C,D,F); movesmallerrecordfromfrontofbufferEandbufferFtobufferO gdmaOut(bufferO,O);switchtherolesoftheoutputbuffers; g gFigure2-24.4-waymerge SIMD4-waymerge: TheSIMDversiondiffersfromthescalarversiononlyinthe wayeachofthethree2-waymergescomprisinga4-waymergewo rks.These2-way mergesmove4recordsatatimefrominputbufferstotheoutpu tbufferusingodd-even mergeschemeonthekeysofthoserecords.Twosortedvectors eachconsistingofkeys from4sortedrecordsaremergedusingodd-evenmerge.Thee ldsarealsomoved correspondinglyusingthe elds operationsintroducedintheprevioussections.The odd-evenmergeoftwovectorsisessentiallythesameproces sasincaseofmerging numbersdescribedinSection 2.3.2 .Figure 2-25 givesthepseudocodeoftheadaptation formergingrecords. InAlgorithm oddEvenMerge v 1 and v 2 aretwovectorseachcontainingthekeysof thenext4recordsintheinputbuffersforthetwostreamsbei ngmerged.Itiseasyto seethatthenextfourrecordsinthemergedoutputareasubse tofthese8recordsand 48

PAGE 49

AlgorithmoddEvenMerge(v1,v2)f //Mergerecordswhosekeysareinv1andv2 fieldsf1[],f2[];fieldsf3[],f4[];vectorp;//forstoringpatternp=spu cmpgt(v1,v2); f1=fields select(v1,v2,pattern); f2=fields select(v2,v1,pattern); //Stage2f3=fields rotate(f1,8); f1=fields select(f3,f2,p); f4=fields select(f2,f3,p); f2=fields shuffle(f1,f4,WACY); f3=fields shuffle(f1,f4,ZXBD); //Stage3p=spu cmpgt(f2,f3); p=spu shuffle(p,vZero,WXYA); f1=fields select(f2,f3,p); f4=fields select(f3,f2,p); //Outputv1=fields shuffle(f1,f4ZWAX); v2=fields shuffle(f1,f4,BYCD); gFigure2-25.SIMD2-waymergeof2vectors v 1 and v 2 infactarethe4records(ofthese8)withthesmallestkeys.A lgorithm oddEvenMerge determinesthese4smallestrecordsandmovesthesetotheou tputbuffer. 2.4.5ComparisonofRecordSortingAlgorithms Weprogrammedseveralmultieldrecordsortingalgorithms usingCellBESDK3.1. Specically,thefollowingalgorithmswereimplementedan devaluated: 1. 2-wayAASort...thisisthemultieldrecordsortingalgori thmofInoueetal.[ 11 ]. Thisusesacombsortvariantforrungenerationand2-wayodd -evenmergefor runmerging. 2. 4-wayAASort...thisusesacombsortvariantforrungenerat ionasin[ 11 ]and our4-wayodd-evenmergeforrunmerging(Section 2.4.4 ). 3. 2-wayBitonicSort...thisisanadaptationoftheCellSorta lgorithmofGediket al.[ 8 ]tomultieldrecords(Section 2.4 ).Itusesbitonicsortforrungenerationand 2-waybitonicmergeforrunmerging. 4. 4-wayBitonicSort...thisusesbitonicsortfortungenerat ionasin[ 8 ]andour 4-wayodd-evenmergeforrunmerging(Section 2.4.4 ). 49

PAGE 50

5. 2-wayMergeSort...thisusesanadaptationoftheSPUmerges ortalgorithm ofBandyopadhyayandSahni[ 12 ]tomultieldrecords(Section 2.4 )forrun generationandthe2-wayodd-evenmergeof[ 11 ]forrunmerging. 6. 4-wayMergeSort...thisusesanadaptationoftheSPUmerges ortalgorithm ofBandyopadhyayandSahni[ 12 ]tomultieldrecords(Section 2.4 )forrun generationandour4-wayodd-evenmergeforrunmerging(Sec tion 2.4.4 ). 7. 2-wayScalarMergeSort...thisusesanadaptationoftheSPU mergesort algorithmofBandyopadhyayandSahni[ 12 ]tomultieldrecords(Section 2.4 ) forrungeneration.Runmergingisdoneusinga2-wayscalarm ergingalgorithm derivedfromthe4-wayscalarmergingalgorithmofSection 2.4.4 byeliminatingthe bottomleftandtheright2-waymergenodes. 8. 4-wayScalarMergeSort...thisusesanadaptationoftheSPU mergesort algorithmofBandyopadhyayandSahni[ 12 ]tomultieldrecords(Section 2.4 )for rungenerationandour4-wayscalarmergeforrunmerging(Se ction 2.4.4 ). Weexperimentedwiththeabove8multieldsortingalgorith msusingrandomly generatedinputsequences.Inourexperiments,thenumbero f32-biteldsperrecordis variedfrom5to15(inadditiontothekeyeld)andthenumber ofrecordsvariedfrom 4Kto1M.Also,wetriedbothlayouts– ByField and ByRecord .Foreachcombination ofnumberofelds,numberofrecords,andlayouttype,theti metosort10random sequenceswasobtained.Thestandarddeviationintheobser vedruntimeswassmall andwereportonlytheaveragetimes.2.4.5.1Runtimesfor ByField layout Figures 2-26A through 2-27D givetheaverageruntimesforour8sortingalgorithms usingthe ByField layoutandFigures 2-28A through 2-28D comparetheaveragerun timesforthe2-wayand4-wayversionsofeachofoursortinga lgorithmsforthecase whenthenumberofrecordstobesortedis1M.Forallourdata, the4-wayversion outperformedthe2-wayversion.For1Mrecordswith532-bit elds(inadditionto a32-bitkey),the4-wayversionsofAASort,BitonicSort,Me rgeSort,andScalar MergeSort,respectively,took5%,4%,7%,and4%lesstimeth antakenbytheir 2-waycounterpartsandthesepercentagesfor15eldswere9 %,6%,9%,and6% respectively. 50

PAGE 51

A B C D Figure2-26.2-waysorts( ByField ).A)2-wayAAsort( ByField ).B)2-waybitonicsort ( ByField ).C)2-waymergesort( ByField ).D)2-wayscalarmergesort ( ByField ). Figure 2-29 showstheruntimesforthe44-waysortalgorithmsfor1Mreco rds.As canbeseen,4-wayBitonicSortistheslowest,followedby4wayScalarMergeSort, followedby4-wayAASort;4-wayMergeSortwasthefastest.I nfact,acrossallourdata sets,4-wayBitonicSorttookbetween17%and23%moretimeth antakenby4-way ScalarMergeSort,whichinturntookbetween18%and19%more timethantakenby 4-wayAASort.Thefastest4-waysortalgorithm,4-wayMerge Sorttook,respectively, between40%and35%,26%and25%,13%and10%lesstimethantak enby4-way BitonicSort,4-wayScalarMergeSort,and4-wayAASort. 51

PAGE 52

A B C D Figure2-27.4-waysorts( ByField ).A)4-wayAAsort( ByField ).B)4-waybitonicsort ( ByField ).C)4-waymergesort( ByField ).D)4-wayscalarmergesort ( ByField ). 2.4.5.2Runtimesfor ByRecord layout Figures 2-30A through 2-31D givetheaverageruntimesforsortingalgorithms usingthe ByRecord layoutandFigures 2-32A through 2-32D presentthecomparison ofaverageruntimesforthe2-wayand4-wayversionsofeachs ortingalgorithmwhen thenumberofrecordstobesortedis1M.Inthislayoutaswell ,the4-wayversion outperformedthe2-wayversionforallthedatasets.For1Mr ecordswith532-bitelds (inadditiontoa32-bitkey),the4-wayversionsofAASort,B itonicSort,MergeSort,and 52

PAGE 53

5 6 7 8 9 10 11 12 13 14 15 3 4 5 6 7 8 9 x 10 5 # FieldsExecution time (in microsec)2-way 4-way A 5 6 7 8 9 10 11 12 13 14 15 4 5 6 7 8 9 10 11 12 x 10 5 # FieldsExecution time (in microsec)4-way 2-way B 5 6 7 8 9 10 11 12 13 14 15 2 3 4 5 6 7 8 x 10 5 # FieldsExecution time (in microsec)2-way 4-way C 5 6 7 8 9 10 11 12 13 14 15 3 4 5 6 7 8 9 10 x 10 5 # FieldsExecution time (in microsec)2-way 4-way D Figure2-28.2-wayand4-waysorts( ByField ),1Mrecords.A)2-wayand4-wayAAsort ( ByField ).B)2-wayand4-waybitonicsort( ByField ).C)2-wayand4-way mergesort( ByField ).D)2-wayand4-wayscalarmergesort( ByField ). ScalarMergeSort,respectively,took5%,4%,7%,and0.1%le sstimethantakenby their2-waycounterpartsandthesepercentagesfor15elds were9%,6%,9%,and4% respectively. Figure 2-33 showstheruntimesforthe44-waysortalgorithmsfor1Mreco rds.As wecanobserve,4-wayBitonicSortistheslowest,followedb y4-wayAASort,followed by4-wayMergeSort;4-wayScalarMergeSortwasthefastest. Infact,acrossallour datasets,4-wayBitonicSorttookbetween16%and17%moreti methantakenby 4-wayAASort,whichinturntookbetween24%and35%moretime thantakenby 4-wayMergeSort.Thefastestofthemin ByRecord format,4-wayScalarMergeSort took,respectively,88%,86%,andbetween81%and88%lessti methantakenby4-way BitonicSort,4-wayAASort,and4-wayMergeSort. 53

PAGE 54

5 6 7 8 9 10 11 12 13 14 15 2 3 4 5 6 7 8 9 10 11 x 10 5 # FieldsExecution time (in microsec) Bitonic Scalar Merge AA Merge Figure2-29.4-waysorts( ByField ),1Mrecords 2.4.5.3Crosslayoutcomparison Althoughinarealapplicationonemaynotbeabletochooseth elayoutformatfor thedatatobesorted,itisworthwhiletocomparetherelativ eperformanceofthe8sort methodsusingthebetterlayoutforeach.Thismeansthatweu sethe ByField layoutfor AASortandMergeSortandthe ByRecord layoutforMergeSortandScalarMergeSort. Figure 2-34 givestheruntimesforthe4-wayversionsusingtheseformat sforthecase of1Mrecords.AlthoughFigure 2-34 isonlyforthecaseof1Mrecords,4-wayScalar MergeSortwasthefastestforallofourdatasets.For532-bi telds(inadditiontothe keyeld)4-wayScalarMergeSort( ByRecord )ran81%fasterthan4-wayMergeSort ( ByRecord ),30%fasterthan4-wayAASort( ByField ),and20%fasterthan4-wayMerge Sort( ByField ).Whenthenumberofeldswas15,thesepercentageswere88% ,64% and60%respectively. 54

PAGE 55

A B C D Figure2-30.2-waysorts( ByRecord ).A)2-wayAAsort( ByRecord ).B)2-waybitonicsort ( ByRecord ).C)2-waymergesort( ByRecord ).D)2-wayscalarmergesort ( ByRecord ). 2.5Summary WehavedevelopedSPUsortingalgorithmsbasedonmergesort ,shakersort,and bricksort.Ourmergesortadaptationisalsoastablesort.O urexperimentsrevealthat whilesortingnumbers,astandardnon-vectorizedtextbook implementationofmergesort takesabout4timesthetimetakenbythevectorizedmergesor tadaptation.Further,the quicksortmethodthatispartoftheCBESDKtakesabout9time sthetimetakenbyour mergesortadaptation.Bricksortisthefastestoftheshell sortlikealgorithms–AAsort, shakersortandbricksort–consideredinthischapter,taki ngabout71%thetimetaken 55

PAGE 56

A B C D Figure2-31.4-waysorts( ByRecord ).A)4-wayAAsort( ByRecord ).B)4-waybitonicsort ( ByRecord ).C)4-waymergesort( ByRecord ).D)4-wayscalarmergesort ( ByRecord ). byAAsorttosort16384integers.Althoughcell(bitonic)so rtisslightlyfasterthanbrick sort,ittakesabout31%moretimetosort16384integersthan takenbymergesort.On thedownside,mergesortrequires O ( n ) additionalspacetosort n numberswhilethe remainingmethodsrequireonly O (1) addedspace. Wehaveshownhowtoadaptnumbersortstosortmultieldreco rdsontheCell BroadbandEngine.Wehavealsodevelopedtwo4-waymergealg orithmsfortherun mergingphase.Oneoftheseisascalarversionandtheotheri sanSIMDversion.Our 56

PAGE 57

5 6 7 8 9 10 11 12 13 14 15 1.5 1.6 1.7 1.8 1.9 2 2.1 2.2 x 10 6 #FieldsExecution time (in microsec)2-way 4-way A 5 6 7 8 9 10 11 12 13 14 15 1.7 1.8 1.9 2 2.1 2.2 2.3 2.4 2.5 x 10 6 #FieldsExecution time(in microsec)4-way 2-way B 5 6 7 8 9 10 11 12 13 14 15 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 x 10 6 #FieldsExecution time(in microsec)2-way 4-way C 5 6 7 8 9 10 11 12 13 14 15 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 x 10 5 #FieldsExecution time(in microsec)2-way 4-way D Figure2-32.2-wayand4-waysorts( ByRecord ),1Mrecords.A)2-wayand4-wayAA sort( ByRecord ).B)2-wayand4-waybitonicsort( ByRecord ).C)2-wayand 4-waymergesort( ByRecord ).D)2-wayand4-wayscalarmergesort ( ByRecord ). experimentsindicatethatthe4-wayScalarMergeSortdevel opedinthispaperisthe fastestmethod(fromamongthosetested)tosortmultieldr ecordsontheCBE. 57

PAGE 58

5 6 7 8 9 10 11 12 13 14 15 0 0.5 1 1.5 2 2.5 x 10 6 #FieldsExecution time(in microsec)Bitonic AA Merge Scalar Merge Figure2-33.4-waysorts( ByRecord ),1Mrecords 5 6 7 8 9 10 11 12 13 14 15 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 x 10 6 #FieldsExecution time(in microsec)Merge (ByRecord) AA (ByField) Merge (ByField) Scalar Merge(ByField) Figure2-34.4-waysortsusingthebestalgorithmfordiffer entlayouts 58

PAGE 59

CHAPTER3 SORTINGONAGRAPHICSPROCESSINGUNIT(GPU) 3.1SortingNumbersonGPUs OneoftheveryrstGPUsortingalgorithms,anadaptationof bitonicsort,was developedbyGovindrajuetal.[ 22 ].Sincethisalgorithmwasdevelopedbeforethe adventofCUDA,thealgorithmwasimplementedusingGPUpixe lshaders.Zachmann etal.[ 23 ]improvedonthissortalgorithmbyusing BitonicTrees toreducethenumberof comparisonswhilemergingthebitonicsequences.Cederman etal.[ 24 ]haveadapted quicksortforGPUs.Theiradaptationrstpartitionsthese quencetobesortedinto subsequences,sortsthesesubsequencesinparallel,andth enmergesthesorted subsequencesinparallel.Ahybridsortalgorithmthatspli tsthedatausingbucket sortandthenmergesthedatausingavectorizedversionofme rgesortisproposed bySintronetal.[ 25 ].Satishetal.[ 5 ]havedevelopedanevenfastermergesort. Inthismergesort,twosortedsequencesAandBaremergedbya threadblockto producethesequenceCwhenAandBhavelessthan256elements each.Eachthread readsanelementofAandthendoesabinarysearchontheseque nceBwiththat elementtodeterminewhereitshouldbeplacedinthemergeds equenceC.When thenumberofelementsinasequenceismorethan256,AandBar edividedintoa setofsubsequencesbyusingasetofsplitters.Thesplitter sarechosenfromthetwo sequencesinsuchawaythattheintervalbetweensuccessive splittersissmallenough tobemergedbyathreadblock.ThefastestGPUmergesortalgo rithmknownatthis timeisWarpsort[ 26 ].Warpsortrstcreatessortedsequencesusingbitonicsor t;each sortedsequencebeingcreatedbyathreadwarp.Thesortedse quencesaremerged inpairsuntilonlyasmallnumberofsequencesremain.There mainingsequences arepartitionedintosubsequencesthatcanbepairwisemerg edindependentlyand nallythispairwisemergingisdonewitheachwarpmerginga pairofsubsequences. Experimentalresultsreportedin[ 26 ]indicatethatWarpsortisabout30%fasterthan 59

PAGE 60

themergesortalgorithmof[ 5 ].Anothercomparison-basedsortforGPUs–GPUsample sort–wasdevelopedbyLeischneretal.[ 27 ].Samplesortisreportedtobeabout30% fasterthanthemergesortof[ 5 ],onaverage,whenthekeysare32-bitintegers.This wouldmakesamplesortcompetitivewithWarpsortfor32-bit keys.For64-bitkeys, samplesortistwiceasfast,onaverage,asthemergesortof[ 5 ]. [ 5 28 – 32 ]haveadaptedradixsorttoGPUs.Radixsortworksinphasesw here eachphasesortsonadigitofthekeyusing,typically,eithe racountsortorabucket sort.Thecountingtobedoneineachphasemaybecarriedoutu singaprexsum or scan [ 33 ]operationthatisquiteefcientlydoneonaGPU[ 28 ].Harrisetal.'s[ 29 ] adaptationofradixsorttoGPUsusestheradix2(i.e.,eachp hasesortsonabitofthe key)andusesthe bitsplit techniqueof[ 33 ]ineachphaseoftheradixsorttoreorder numbersbythebitbeingconsideredinthatphase.Thisimple mentationofradixsort isavailableintheCUDADataParallelPrimitive(CUDPP)lib rary[ 29 ].For32-bitkeys, thisimplementationofradixsortrequires32phases.Ineac hphase,expensivescatter operationsto/fromtheglobalmemoryaremade.LeGrandetal .[ 30 ]reducethenumber ofphasesandhencethenumberofexpensivescatterstogloba lmemorybyusinga largerradix, 2 b ,for b > 0 .Aradixof16,forexample,reducesthenumberofphases from32to8.Thesortineachphaseisdonebyrstcomputingth ehistogramofthe 2 b possiblevaluesthatadigitwithradix 2 b mayhave.Satishetal.[ 5 ]furtherimprovethe 2 b -radixsortofLeGrandetal.[ 30 ]bysortingblocksofdatainsharedmemorybefore writingtoglobalmemory.Thisreducestherandomnessofthe scattertoglobalmemory, which,inturn,improvesperformance.Theradix-sortimple mentationofSatishetal.[ 5 ] isincludedinNVIDIA'sCUDASDK3.0.BandyopadhyayandSahn i[ 32 ]developed theradixsortalgorithm,GRS,whichissuitableforsorting recordswithmanyelds. GRSoutperformstheSDKradixsortalgorithmwhilesortingn umbersbyreducing thenumberofstepsandusinganadditionalstorageintheglo balmemory.Merrill andGrimshaw[ 31 ]havedevelopedanalternativeradixsort,SRTS,forGPUsth at 60

PAGE 61

isbasedonahighlyoptimizedalgorithm,developedbythem, forthescanoperation andco-minglingofseverallogicalstepsofaradixsortsoas toreduceaccessesto device/globalmemory.Presently,SRTSisthefastestGPUra dixsortalgorithmfor32-bit integers. Theresultsof[ 26 27 ]indicatethattheradixsortalgorithmof[ 5 ]outperformsboth Warpsort[ 26 ]andsamplesort[ 27 ]on32-bitkeys.Theseresultstogetherwiththoseof [ 31 ]implythattheradixsortof[ 31 ]isthefastestGPUsortalgorithmfor32-bitinteger keys. MostoftheGPUsortingroutinesperformsomefunctionsrepe titivelyduringtheir operations.Theseprimitivesaredesignedtobeeasilypara llelizedonamany-core architecturelikeaGPU.Twosuchimportantprimitivesthat arefrequentlyusedin GPUsortingare scan and reduce .Scanisafunctionthattakesalistof n elements [ x 0 ... x n 1 ]andabinaryoperator asinput.Theoutputisalsoalistof n elements [ y 0 .... y n 1 ].Therearetwovariantsofscan, exclusive and inclusive .Forexclusivescan, y i := x 0 x 1 x 2 ... x i 1 for i > 0 and y 0 = 0 ( 0 istheidentityelementdenedfor operator )whileinthecaseofinclusivescan, y i := x 0 x 1 x 2 ... x i .Thereduce operation,ontheotherhand,producesasinglevalue y = x 0 x 1 x 2 ... x n 1 .Many usefuloperationscanbedesignedusingtheseprimitives.F orexample,prexsumisan exclusivescanwih + asthebinaryoperator.Findingthemaximumofasequenceisa reducewith asmaximumoperatorthatreturnsthelargeroftwoelements. 3.1.1GPUSampleSort Samplesort[ 27 ]isamulti-waydivideandconquersortingalgorithmthatpe rforms betterwhenthememorybandwidthisanissueasthedatatrans ferredtoandfromthe globalmemoryislessthaninatwo-wayapproach.Theserialv ersionofsamplesort worksbyrstchoosingasetofsplittersrandomlyfromthein putdata.Thesplitters arethensortedandarrangedinincreasingorderoftheirval ues.Theinputdataset isdividedintobucketsdelimitedbysuccessivesplitters. Theelementsinaparticular 61

PAGE 62

buckethavevaluesthatareboundedbytheguardingsplitter s.Eachbucketissortedby recursiveapplicationofsamplesort.Thisrecursionconti nuesuntilthesizeofthebucket becomeslessthanacertainthreshold.Atthispointabaseso rtingalgorithmisusedto sortthesmallbucket.Figure 3-1 showsthestepsofserialsamplesort. SampleSort( a [] ) ifsizeof( a ) M //athreshold f Sort( a );//Useasortingmethodtosort a return; gSelect k elementsrandomlyfromaandputthemin samples [] ; Sort( samples [] ); for(eachelement e ina[]) f find i suchthat samples [ i ] e samples [ i +1] ; Put e inbucket b [ i ] ; gfor(eachbucket b [ i ] ) f SampleSort( b [ i ] ); g Figure3-1.SerialSampleSort Toobtainanefcientparallelversionofsamplesort,itisn ecessarytobalance thesizeofbucketsassignedtothreadblocks.Thisisdoneby choosingthesplitters fromalargerandomlyselectedsamplefromtheinput.Onceth esplittersareselected, thenumbersarepartitionedintobucketsbyrstdividingth edataintoequalsizedtiles witheachtilebeingassignedtoablockofthreads.Athreadb lockexaminesitstileof dataandassignsnumbersinthistiletobucketswhosebounda riesarethepreviously chosensplitters.Finallythebucketsproducedforthetile sarecombinedtoobtainglobal buckets.ThestepsinaparticulariterationofGPUsampleso rtof[ 27 ]aredescribed below.3.1.1.1Step1–Splitterselection Duringthisstep,thesplittersarechosen.First,randomsa mplesaretakenout oftheelementsinthebuckets.Asetofsplittersisthenchos enfromtheserandom samples.Finally,splittersaresortedusingodd-evenmerg esort[ 34 ]insharedmemory 62

PAGE 63

andaBinarySearchTreeofsplittersiscreatedtofacilitat etheprocessofndingthe bucketforanelement.3.1.1.2Step2–Findingbuckets Eachthreadblockisassignedapartoftheinputdata.Thread sinablockloadthe BinarySearchTreeintosharedmemoryandthencalculatethe bucketindexforeach elementinthetile.Attheendthreadsstorethenumberofele mentsineachbucketasa k -entryhistogramintheglobalmemory. 3.1.1.3Step3–Prexsum Theperblock k -entryhistogramsareprex-summedtoobtaintheglobaloff setfor eachbucket.3.1.1.4Step4–Placingelementsintobuckets Eachthreadblockinthisphaseagaincalculatesbucketinde xforallkeysinitstile. Theyalsocalculatethelocaloffsetwithinthebuckets.The localoffsetsareaddedtothe globaloffsetsfromthepreviousphasetogetthenalpositi onofthenumbers. Figure 3-2 depictsthestepsinaniterationoftheGPUsamplesort. Thread Blocks Output Input Input 0 0 k-1 2 k-1 12 1 Prefix Sum(Column major order) 012k-1 Bucket indices Thread Blocks Figure3-2.AnIterationofGPUSampleSort 63

PAGE 64

3.1.2Warpsort Warpsort[ 26 ]isanothercomparisonbasedsortingalgorithmalongtheli nesof mergesort.Itusesbitonicmergetosorttheinputinanumber ofstages.Warpsort consistsofthefollowing4steps.3.1.2.1Step1–Bitonicsortbywarps Theinputisrstdividedintoasetoftilesandeachtileisso rtedbyathreadwarp usingbitonicmergesort.Abitonicnetworkfor n elementscomprisesof log( n ) phases witheachphase i (0 i log( n ) 1) having ( i +1) stages.Figure 3-3 showsthe bitonicsortingnetworkfor8elementswitheacharrowindic atingacompare-exchange operationwheretwoelementsarecomparedandswappedifthe rstoneisgreater(or lessdependingonthedirectionofthearrow)thanthesecond Stages Phase 00 1 012 2 1 0 Figure3-3.BitonicMergeSortof8Elements Allthecompare-exchangesofastageneedstobeperformedbe foremovingtothe nextstage.Thisrequiresaglobalsynchronizationamongth ethreads.So,bitonicsort byathreadblockwillrequirea syncthreads () functioncallaftereachstage.Inwarpsort, asthebitonicsortisdonebyawarpofthreadsthereisnoneed forsynchronizationas threadsinawarpareexecutedinlock-stepfashion.However ,ineashstage,thethreads inawarpwillbedoingcompare-exchangesinascendingordes cendingdirection. 64

PAGE 65

Thiswillcauseadivergenceamongthethreadsinawarpandhe ncewillleadtoserial executionofthreads.Itcanbeseenthatforan n elementbitonicnetworkonly n = 2 compare-exchangesareperformedandhalfofthemaretoform ascendingpairswhile theotherhalfaretoformdescendingpairs(exceptduringth elaststagewhenallthe compare-exchangesareinthesameorder).Warpsortusesthi sfacttosetthenumber ofelementshandledbya32-threadwarptobe128.Eachthread thenperformstwo operationsduringeachstageofbitonicsortandoperations performedbytheallthreads inawarpareetherforanascendingpairoradescendingpair. Threaddivergenceis thusavoidedbyeachthreadperformingcompare-exchangest ogenerateresultsina particularsortedorderduringeachstepofitsexecution.3.1.2.2Step2–Bitonicmergebywarps Thesortedsequencesaremergedinthisstepbyawarpusingbi tonicmergeuntil thenumberofsequencesfallsbelowathreshold.Asthethrea dsinawarponlymerge axednumberofelements t inastep,sequencesaredividedintobuffersofsize t = 2 TomergetwosortedsequencesAandB, t = 2 elementsfromeachofthesequences arefetchedintothebuffersinsharedmemoryandthentheyar emergedusingthe bitonicmergealgorithm(bitonicmergeissameasthelastst epofbitonicsortoutlined inFigure 3-3 ).Thesmallest t = 2 elementsofthemergedsequenceareoutputandthe remaining t = 2 elementsarekeptinthebuffer.Thenthelastelementsofthe t = 2 element buffersofAandBarecheckedtondoutfromwhichsequenceth enextbufferloadof t = 2 elementsshouldbefetched. 3.1.2.3Step3–Splittinglongsequences Asthenumberofthesequencesdecreasesgeometricallyduri ngmerging,after somepointtherearenotenoughsequencestobemergedondiff erentSMsoftheGPU. Inthisstep,thelongsequencesaresplitintosmallsequenc eswhichcanbemerged usingalltheSMsintheGPU.Tosplit l sequencesofsize n intosubsequencesofsize s arandomsampleof s k elementsarechosenfromthesequenceswhere k isaconstant 65

PAGE 66

whosevaluedependsonthetrade-offbetweenchoosingmorer andomsamplesfora goodsetofsplittersandthetimeneededforchoosingthoser andomsamples.Theset ofrandomsamplesisthensortedandtheelementsinposition sthatareamultipleof k formtheglobalsetofsplitters.These s splittersarethenusedtopartitioneachofthese l sequencesinto s subsequences. 3.1.2.4Step4–Finalmergebywarps ThisstepisessentiallythesameasStep2.TheSmallerseque ncesproducedin Step3cannowbemergedindependentlybythreadwarpstoprod ucethenaloutput. Firgure 3-4 givesallthestepsinWarpsort. Step 4 Merge by a warp Merge by a warp Merge by a warp Merge by a warp Merge by a warp Split into independent subsequences Merge by a warp Split into independent subsequences Merge by a warp Merge by a warp Merge by a warp Merge by a warp Merge by a warp Merge by a warp Bitonic sort by a warp Bitonic sort by a warp InputsStep 1 Step 2 Step 3 Figure3-4.WarpsortSteps 3.1.3SRTSRadixSort SRTSemploysahighlyoptimizedversionofthescankernelde velopedbyMerrill andGrimshaw[ 31 ]toperformradixsort.Thescanprocessconsistsofthreeke rnels, bottomlevelreduce,toplevelscan,andbottomlevelscan. BottomLevelReduce Inthisstep,thethreadsinathreadblockreadinputsand producesanaloutputperthreadblock.Thescandoesnotuse thetraditionalmethod 66

PAGE 67

ofassigningauniquethreadtoeachinputelement.Instead, itxesthenumber C of threadblocksperformingthereduction.MerrillandGrimsh aw[ 31 ]experimentedwith differentcongurationsofemptykernelsdoingjustreadin gthenumbers(one,twoor fouratatime)fromtheglobalmemoryandthenwritingoutasi nglevaluetotheglobal memorytondtheoptimumCandthenumberofthreadsinablock .Threadsinablock loopovertheinputelementsinbatchestoperformthereduct ion.Thedependencies betweendifferentbatchesarecarriedoverintheSMregiste rsorlocalsharedmemory. Threadsinablockoperateinthreedifferentphaseswhilere ducingtheentireinput assignedtoit.Firstly,threadsreadmultipleinputelemen tsfromthecurrentbatchanddo aserialreduceusingtheregisters.Thenextphaseistheser ialreductionintheshared memorybygroupsofthreadsuntilthenumberofoutputsissma llenoughtobereduced byasinglewarpofthreads.Inthelastphase,thewarpreduce susingthestrategyof Figure 3-8 toproducethenalvalueforthecurrentbatch.Thevalueisa lsostored inthesharedmemorytobeusedwhenreducingtheelementsfro mthenextbatch. Usingthisstrategyforreducingtheinputincreasesthenum berofarithmeticoperations performedbythethreadsandhenceincreasesthenumberofin structionspermemory transaction.ThishelpsinoffsettingtheidletimeofanSMt hatiswaitingfordatafrom globalmemoryforapredominantlymemory-boundoperationl ikereduction.Figure 3-5 givesaschematicoftheentireprocess.Thecirclesindicat esathreadworkingontwo elements. Attheendofbottomlevelreduce,thereare C partialreductions. TopLevelScan Asinglethreadblockscansthe C valuesfromthepreviousphase alongwithsomeresidualinputelements,ifany,thatarenot reducedincasethenumber ofinputelementsisnotaperfectmultipleofnumberofblock s.Asonlyafewelements arescanned,thisphasedoesnotcontributemuchtotheovera llruntime. BottomLevelScan Thisisessentiallythesameastherstphasebutthescans areseededwiththevaluesobtainedfromTopLevelScan.Alth oughsimilarstepsare 67

PAGE 68

Warp scan Serial reduction Inputs (Register) Serial reduction (Shared Mem) Figure3-5.BottomLevelReduction carriedout,theoperationisscaninsteadofreduce.Thenum berofinputsandoutputs aresameforthescanoperation.MerrillandGrimshaw[ 31 ]experimentedwithempty kernelsthatreadandwritevaluestoandfromtheglobalmemo rytodeterminethe optimalvaluesof C andnumberofthreadsinablock. SRTSradixsortstartswiththethree-stepscanprocess.The scanandthereduce kernelsusethesummationoperationcalculatingtheprexs umandsumofallelements, respectively.SRTSaugmentsthekernelswiththeoperation srequiredtoradixsorta numberofintegersandhenceonly3kernellaunchesareneede dtodoaradixsortwith fewernumberofintermediatevaluespassedaroundcompared toSDKradixsort[ 5 ].As withtheotherradixsortstrategiesSRTSprogressivelyrad ixsortson4bitsperphase phase.Hence,SRTSrequires8phasestocompletelysort32-b itintegers.Eachphase, asmentionedearlierconsistsof3steps.3.1.3.1Step1–Bottomlevelreduce Thisstepaddsanextractphasebeforethebottomlevelreduc eoperation.Inthe extractphase, r =16 agvectorscorrespondingtoeverypossiblevalueof4bitsa re constructedfromtheinput.Forexample,ifathreadblockin puts64elementsatatime, 1664-elementbitvectorsareconstructedfromtheinputele ments.Forthe i thinput elementhavingavalueof j (0 j 15) inthecurrent4bitsbeingconsidered,the 68

PAGE 69

i thbitofthe j thvectorissetof1whilethe i thbitofothervectorsaresetto0.Inthe nextphase,bottomlevelprexsumisdoneonthese16vectors seriallyoneatatime. Thetotalforeachdigitiscarriedoverforthenextbatchofi nputtobeprocessedbythe threadblock.3.1.3.2Step2–Toplevelscan Inthisphaseasingleblockofthreadsoperatesoverthepre xsumstocomputethe globalprexsum.Thescanismodiedtohandlethescanningo f16differentsetsof partialsumsoneatatime.3.1.3.3Step3–Bottomlevelscan Bottomlevelscanisaugmentedwithtwodifferentoperation s,oneperformedbefore thescanandoneafterthescan.Therstoneisreadingtheinp utandextractingthe bitsasdoneintherststep.Duringthesecondphase,abotto mlevelscanisperformed seededwiththevaluesfromtoplevelscanofStep2.Asinthe rstphase,16bottom levelscansareperformedeachfor16possibledigitvaluesf or4bits.Afterthescan, thesortedpositionoftheinputelementswiththecurrentba tchisknown.Thisposition isusedinthenextphase,whentheelementsareplacedintoth eirsortedorderinthe naloutput.First,dependingontheprexsumpositionsobt ained,asharedmemory exchangeisperformedtoputtheelementswithinthebatchin sortedorder.Afterthisthe threadsreadconsecutiveelementsfromthesharedmemoryan dwritethemtoglobal memorydependingontheglobaloffsetandthelocalposition .Aswithotherradixsort methodsdescribedearlier,thisstrategygenerateslarger globalmemorytransactions asconsecutiveelementswithinthesharedmemoryarewritte ntonearbypositionsin theglobalmemory.Aswiththerstphase,theradixcounters areaccumulatedandare carriedovertothenextbatchofinputprocessedbythisbloc kofthreads. Figure 3-6 depictsthethreestreamkernelsaugmentedwithextractand scatter operations. 69

PAGE 70

Reduce Extract Inputs 0...7 Inputs 8..15 Inputs ... Inputs ... Inputs ... Inputs ... Inputs ... Inputs ... Inputs ... Inputs 0...7 Inputs 8..15 Inputs ... Inputs ... Inputs ... Inputs ... Outputs ... Outputs ... Outputs ... Outputs ... Outputs ... Outputs ... Inputs ... Inputs ... Inputs ... Outputs ... Outputs ... Outputs ... Local scan ScatterScatter ScatterScatter Scatter ScatterScatter Scatter Thread Block 0Thread Block 1Thread Block c Thread Block 0Thread Block 1Thread Block c Step 1 Step 2 Step 3 Local ReduceLocal ReduceLocal Reduce Scatter Scan ScanScanScan Scan Extr. Scan Scan Extr. Scan Extr. Scan Extr. Extr. Extr. Extr. Extr. Extr. ExtractReduceExtractReduce ExtractReduce Reduce ExtractExtractReduce ExtractReduceExtractReduceExtractReduce Thread Block 0 Figure3-6.SRTSSteps 3.1.4SDKRadixSortAlgorithm TheSDKradixsortalgorithm[ 5 ]usesaradixof 2 b with b =4 ( b =4 was determined,experimentally,togivebestresults).With b =4 and32-bitkeys,the radixsortrunsin8phaseswitheachphasesortingon4bitsof thekey.Eachphase oftheradixsortisaccomplishedin4stepsasbelow.Thesest epsassumethedatais partitioned(implicitly)intotilesofsize1024numbersea ch.Thereisaseparatekernel foreachofthese4steps. Step1:Sorteachtileonthe b bitsbeingconsideredinthisphaseusingthebit-split algorithmof[ 33 ]. Step2:Computethehistogramofeachtile.Notethatthehist ogramhas 2 b entrieswith entry i givingthenumberofelementsinthetileforwhichthe b bitsconsideredin thisphaseequal i Step3:Computetheprexsumofthehistogramsofalltiles. 70

PAGE 71

Step4:UsetheprexsumscomputedinStep3torearrangethen umbersintosorted orderofthe b bitsbeingconsidered. 3.1.4.1Step1–Sortingtiles ForStep1,eachSMinputsatileintosharedmemoryusing256t hreads(8warps). So,eachthreadreads4consecutivenumbersfacilitatingac oalescedreadfromthe globalmemory.Eachthreadinputsthe4numbersusingavaria bleoftype int4 ,which is128bitslong.Theinputdataarestoredinregisters.Next ,the256threadscollaborate todo b roundsofthebitsplitschemeof[ 33 ].Ineachround,allthenumbershaving 0 in thatbitplacearemovedforwardfollowedbytheoneshaving 1 inthebitpositionbeing consideredasshowninFigure 3-7 bit 3 0011 00000100001101110011 1100 1101 0011 00000100001101110011 11001101 000000110011001101000111 11001101 000000110011001101001100 01111101 00110011 0000011101000011 11001101 bit 1 inputbit 0bit 2 Figure3-7.Bit-splitSchemeforSortingNumberson4Bits[ 33 ] Tomovethenumbershaving 0 saheadoftheoneshaving 1 s,thepositionofeach numberintheoutputisdeterminedandthenthenumberiswrit tenintotheshared memoryinthatposition.Forthenextround,threadsreadthe numbersinthesorted orderofthepreviousbitfromthesharedmemory.Aprexscan basedalgorithmcalled warp-scanisexecutedbyeachthreadtondouttherankofthe 4numbersithasread. First,eachthreadcalculatesthenumberofzeroesinthecur rentbitpositionofthe4 numbersreadbyit.Itthenwritesthevalueinaparticularlo cationinthesharedmemory dependingonitsthreadid( tid )constructingacountvectorperwarp.Additionalstorage isallocatedinthesharedmemoryadjacenttothecountvecto randzeoredoutbefore thecalculationbegins.Anexclusiveprexscanofthecount vectorisperformedbythe 71

PAGE 72

32threadsinawarpin log(32)=5 steps.Duringstep i ( i =0...4 ),eachthreadaddsits owncountwiththecountwhichis 2 i awayasshowninthealgorithmofFigure 3-8 and correspondingschematicinFigure 3-9 .Thisalgorithmhasnodivergencewithinawarp. However,spuriousadditionsareperformedbythethreadsdu ringeachstepotherthan therststepwheneachadditionisnecessary. Algorithmwarp-scan()f //Computetheprefixsumofcountvector //warpIdisthewarpId=tid>>5//warpSizeis32idx=2*warpId*warpSize+tid;shared[idx]=0//zerooutthelocationidx+=warpSize;shared[idx]=val;//Copythecounthereshared[idx]+=shared[idx-1];shared[idx]+=shared[idx-2];shared[idx]+=shared[idx-4];shared[idx]+=shared[idx-8];shared[idx]+=shared[idx-16];//Convertinclusivetoexclusivereturn(shared[idx-val); gFigure3-8.DivergenceFreeWarpScanAlgorithm Result 0 00000 4 675654 0 10 21 34 0000 25421440 0000000 23114 00000 4 654 9 1111 InputPadded InputsData[tid] += sData[tid 1]sData[tid] += sData[tid 2]sData[tid] += sData[tid 4] Figure3-9.WarpScanof8Numbers Analscanisperformedoverthelastvaluesof8warpstondo utthenalposition ofthe4numbersreadbyathread.Eachthreadthenwritesthen umbersouttothe sharedmemoryaccordingtothedeterminedposition.Follow ingthese b roundsofthe bit-splitalgorithm,thetileisinsortedorder(ofthe b bitsbeingconsidered)inregisters. Thesortedtileisthenwrittenouttoglobalmemory. 72

PAGE 73

3.1.4.2Step2–Calculatinghistogram Inthisstep,thenumbersforeachhalftileareinput(insort edorder)using256 threadsperSMandthehistogramofthe b bitsbeingconsiderediscomputedforthehalf tileinput.Notethat,onecaninsteadinputtheentiretilea ndremainingstepswouldwork nebut,empirically,workingwithhalf-tilesfromthispoi ntonwardsperformsbetter.For thiscomputation,eachthreadinputstwonumbersfromgloba lmemoryandwritesthese tosharedmemory.Thethreadsthendeterminetheupto15plac esintheinputhalftile wherethe4bitsbeingconsideredchange.Forthis,eachthre adsimplychecksitsown numberwiththenextone.Ifthebitsdiffer,itrecordsthepo sitioninasharedmemory array.Onceallthethreadsaredone,allpositionsinthehal f-tilewherethebitschange arerecordedinthearrayasshownbythebrokenarrowsinFigu re 3-10 1111 0000 0000 0000 0000 000100010010001011101111 Figure3-10.CalculatingHistogramOffsets Oncethepositionsaredetermined,thesizeofeachbucket(i .enumberofelements havingsamebitvalues)isdeterminedbyndingouttheinter valsbetweenthepositions bythesamethreadswhichfoundthepositionsduringtheprev iousstep.Thehistogram iswrittenouttoglobalmemoryincolumnmajororder,i.e.,b ucket0ofallhalf-tiles followedbybucket1ofallhalf-tilesallthewayuptobucket 15ofallhalf-tiles.The positionsinahalftilewherethebitsdifferarealsowritte nouttoglobalmemoryasthis informationisneededinthenalstep.3.1.4.3Step3–Prexsumofhistogram InStep3,theprexsumofthehalf-tilehistogramsiscomput edusingtheprex sumcodeavailableinCUDPP[ 29 ]andwrittentoglobalmemoryasshowninthe Figure 3-11 73

PAGE 74

p 1 123456789101112131415 0 012 Histogram Tiles Global Prefix Sum Figure3-11.ColumnMajorHistogram 3.1.4.4Step4–Rearrangement InStep4,eachSMinputsahalftileofdata,whichissortedon the b bitsbeing considered,andusesthecomputedprexsumhistogramforth ishalftiletowritethe numbersinthehalftiletotheircorrectpositioninglobalm emory.EachthreadintheSM reads2numbersasbeforecorrespondingtoitsid.Thetileof fsetscalculatedinStep 2andtheprexsumhistogramfromStep3arealsoread.Astheh alftilesarealready sortedfollowingStep1,thethreadidisthesameastheposit ionofthenumberreadin thetile.Ifthevalueofthe b bitsbeingconsideredforthecurrentnumberis r ,theprex sumhistogramgiveshowmanynumberstherearewith b bitvaluelessthan r andalso howmanynumbersinthetilesbeforeithavebitvalueexactly equalto r .However,one alsoneedstondouthowmanyofthenumbersinthecurrenttil ehavebitvalueexactly equalto r andoccurbeforethenumberinthesortedorder.AsperStep2, thetileoffset forbitvalue ( r 1) readfromtheglobalmemoryindicatestheplacewherethebit value changeoccursfrom ( r 1) to r i.e.theplacefromwherenumberswithbitvalue r starts. Hence,thereare ( tid tileOset [ r 1]) numbersbeforethecurrentnumberhavingbit valuesexactlyequalto r .Hence,thenalpositionofthenumberwillbethesumofthis localoffsetandtheprexsumhistogram.Asthenumbersarea lreadyinsortedorder whenreadbythreads,consecutivenumbersinthetilereadby consecutivethreadsin ahalf-warparewrittentoconsecutivelocationsinglobalm emory.Thisreducesthe randomscatteringbyasignicantamountasmostofthewrite sproducecoalesced 74

PAGE 75

memorytransactions.Followingthis,allnumbersareinsor tedorder,inglobalmemory, withrespecttothe b bitsbeingconsideredinthisphaseoftheradixsort. 3.1.5GPURadixSort(GRS) LiketheSDKradixsortalgorithmof[ 5 ],GRSaccomplishesaradixsortusinga radixof 2 b with b =4 .Eachphaseoftheradixsortisdonein3stepswitheachstep usingadifferentkernel.Forpurposesofdiscussion,weass umeatilesizeof t ( t =1024 intheSDKimplementation).Wedenethe rank ofnumber i inatiletobethenumberof integersinthetilethatprecedenumber i andhavethesamevalueasofnumber i .Since wecomputeranksineachphaseoftheradixsort,numberequal ity(forrankpurposes) translatestoequalityofthe b bitsofthenumberbeingconsideredinaparticularphase. Notethatwhenthetilesizeis1024,rankslieintherange0th rough1023.Thethree stepsineachphaseofGRSare: Step1:Computethehistogramforforeachtileaswellasther ankofeachnumberin thetile.ThishistogramisthesameasthatcomputedinStep2 oftheSDKradix sort. Step2:Computetheprexsumsofthehistogramsofalltiles.Step3:UsetherankscomputedinStep1tosortthedatawithin atile.Next,usethe prexsumscomputedinStep2torearrangethenumbersintoso rtedorderofthe b bitsbeingconsidered. Step1requiresustoreadthenumbersinatilefromglobalmem ory,compute thehistogramandranks,andthenwritethecomputedhistogr amandrankstoglobal memory.Step2isidenticaltoStep3oftheSDKalgorithm.InS tep3,numbersare againreadfromglobalmemory.Thenumbersinatilearerstr eorderedinshared memorytogetthesortedarrangementofStep1oftheSDKalgor ithmandthenwritten toglobalmemorysoastoobtainthesortedorderfollowingSt ep4oftheSDKalgorithm. Thiswritingofnumbersfromsharedmemorytoglobalmemoryi sidenticaltothatdone inStep4oftheSDKalgorithm.Thefollowingsubsectionspro videimplementation 75

PAGE 76

detailsforthe3stepsofGRS.Notethatthesortingdonebyth eSDKradixsortduring theveryrststepisnowdoneusingtherankscalculatedinth erststepandthe sortingismergedintothelaststepofGRSwherethenumbersa renallyputintotheir sortedorder.Astherankcalculationisdonewithoutanyove rheadwhilecalculatingthe histogram,GRSisdoinglessworkcomparedtoSDKradixsort. However,GRSusesan additionalstorageforstoringtheranks.3.1.5.1Step1–HistogramandRanks AnSMcomputesthehistogramsandranksfor64tilesatatimee mploying64 threads.Figure 3-12 givesahigh-leveldescriptionofthealgorithmusedbyusfo r thispurpose.Ouralgorithmprocesses32numbersfromeacht ileinaniterationof the for loop.So,thenumberof for loopiterationsisthetilesize( t )dividedby32. Ineachiterationofthe for loop,the64threadscooperatetoread32numbersfrom eachofthe64tiles.Thisisdoneinsuchaway(describedlate r)thatglobalmemory transactionsare128byteseach.Thedatathatisreadiswrit tentosharedmemory. Next,eachthreadreadsthe32numbersofaparticulartilefr omsharedmemoryand updatesthetilehistogram,whichitselfresidesinsharedm emory.Althoughwehave enoughregisterstoaccommodatethe64histograms,CUDArel egatesaregisterarray toglobalmemoryunlessitisabletodetermine,atcompileti me,thevalueofthearray index.Tomaintainthehistogramsinregisters,weneedanel aboratehistogramupdate schemewhosecomputationtimeexceedsthetimesavedoverma kingaccessesto randompointsinanarraystoredinsharedmemory.Whenanumb erisprocessedby athread,thethreadextractsthe b bitsinuseforthisphaseoftheradixsort.Suppose theextracted b bitshavethevalue12,thenthecurrenthistogramvaluefor1 2isthe rankofthenumber.Thenewhistogramvaluefor12is1moretha nthecurrentvalue. Thedeterminedrankiswrittentosharedmemoryusingthesam elocationusedbythe number(i.e.,therankoverwritesthenumber).Notethatonc eanumberisprocessed, itisnolongerneededbyAlgorithmHR.Oncetheranksforthec urrentbatchof32 76

PAGE 77

numberspertilehavebeencomputed,thesearewrittentoglo balmemoryandwe proceedtothenextbatchof32numbers.Towritetherankstog lobalmemory,the64 threadscooperateensuringthateachtransactiontoglobal memoryis128bytes.When the for loopterminates,wehavesuccessfullycomputedthehistogr amsforthe64tiles andthesearewrittentoglobalmemory. AlgorithmHR()f //Computethehistogramsandranksfor64tiles itrs=t/32;//t=tilesizefor(i=0;i
PAGE 78

ofeachtileresidinginthesamebank.Sinceinthenextstep, eachthreadprocessesits tile'snumbersinthesameorder,wewillhavesharedmemoryc onictsthatwillcause thereadsofnumberstobeserializedwithineachhalfwarp.T oavoidthisserialization, weuseacircularshiftpatterntomapnumberstothearray sKeys 4 .TheCUDAkernel codetodothisreadisgiveninFigure 3-13 // tid isthethreadidand bid istheblockid //DeterminethefirsttilehandledbythisthreadstartTile=(bid*64+tid/8)*(t/4);//startingnumberpositioninthetile// keyOset istheoffsetforcurrent32keys keyPos=keyOffset+tid%8;//sharedmemorypositiontowritethekeyswith//circularshiftsKeyPos=(tid/8)*8+(((tid/8)%8)+(tid%8))%8;//someconstantstileSize8=8*(t/4);tid4=tid*4;//Initializethehistogramcountersfor(i=0;i<16;i++)f sHist[tid*16+i]=0; g//Waitforallthreadstofinishsyncthreads();curTileId=startTileId;for(i=0;i<8;i++)f sKeys4[sKeyPos+i*64]=keysIn4[keyPos+startTile];curTileId+=tileSize8; gsyncthreads(); Figure3-13.ReadingtheNumbersfromGlobalMemory Asstatedearlier,tocomputethehistogramsandranks,each threadworksonthe numbersofasingletile.Athreadinputsoneelement(4numbe rs)of sKeys 4 ,updates thehistogramusingthese4numbersandwritesbacktheranko fthese4numbers(as notedearliertheequalsthehistogramvaluejustbeforeiti supdated).Figure 3-14 gives thekernelcodetoupdatethehistogramandcomputetheranks forthe4numbersinone elementof sKeys 4 78

PAGE 79

//Updatethehistogramsandcalculatetherank// startbit isthestartingbitpositionfor //thisphaseint4p4,r4;for(i=0;i<8;i++)f p4=sKeys4[(tid4)+(i+tid)%8];r4.x=sHist[((p4.x>>startbit)&0xF)*64+tid]++;r4.y=sHist[((p4.y>>startbit)&0xF)*64+tid]++;r4.z=sHist[((p4.z>>startbit)&0xF)*64+tid]++;r4.w=sHist[((p4.w>>startbit)&0xF)*64+tid]++;sKeys4[tid4+(i+tid)%8]=r4;g syncthreads(); Figure3-14.ProcessinganElementof sHist4[] Oncetherankshavebeencomputed,theyarewrittentoglobal memoryusinga processsimilartothatusedtoreadinthenumbers.Figure 3-15 givesthekernelcode. Here, ranks 4[] isanarrayinglobalmemory;itsdatatypeis int4 curTileId=startTileId;for(i=0;i<8;i++)f ranks4[keyOffset+keyPos+startTileId]=sKeys4[sKeyPos+i*64];curTileId+=tileSize8; gsyncthreads(); Figure3-15.WritingtheRankstoGlobalMemory WhenanSMcompletesthehistogramcomputationforthe64til esassignedto it,itwritesthecomputed64histogramstothearray counters inglobalmemory.Ifwe viewthe64histogramsasforminga 16 64 array,thenthisarrayismappedtotheone dimensionalarrayincolumn-majororder.Figure 3-16 givesthekernelcodeforthis. 3.1.5.2Step2–Prexsumoftilehistograms AsinStep3oftheSDKalgorithm,theprexsumofthetilehist ogramsiscomputed withaCUDPPfunctioncall.TheSDKradixsortcomputesthepr exsumofhalftiles 79

PAGE 80

//calculateIdofthreadsamongstallthreads// nTiles istotalnumberoftiles globalTid=bid*64+tid;for(i=0;i<16;i++)f counters[i*nTiles+globalTid]=sHist[i*64+tid]; g Figure3-16.WritingtheHistogramstoGlobalMemory whilewedothisforfulltiles.Assumingbothalgorithmsuse thesametilesize,theprex suminStep2ofGRSinvolveshalfasmanyhistogramsasdoesth eprexsuminStep 3oftheSDKalgorithm.Thisdifference,however,resultsin anegligiblereductioninrun timeforStep2ofGRSversusStep3ofSDK.3.1.5.3Step3–Positioningnumbersinatile Tomovethenumbersinatiletotheircorrectoverallsortedp ositionwithrespect tothe b bitsbeingconsideredinaparticularphase,weneedtodeter minethecorrect positionforeachnumberinatile.Thecorrectpositionisob tainedbyrstcomputing theprexsumofthetilehistogram.Thisprexsummaybecomp utedusingthe warpscanalgorithmusedintheSDKradixsortcodecorrespon dingtothealgorithm of[ 5 ].Thecorrectpositionofanumberisitsrankplusthehistog ramprexsum correspondingtothe b bitsofthenumberbeingconsideredinthisphase.Asnoted earlier,movingnumbersdirectlyfromtheircurrentpositi ontotheirnalcorrectpositions isexpensivebecauseofthelargenumberofmemorytransacti onstoglobalmemory. Betterperformanceisobtainedwhenwerstrearrangethenu mbersinatileintosorted orderofthe b bitsbeingconsideredandthenmovethenumberstotheircorr ectposition intheoverallsortedorder.Figure 3-17 givesthekernelcode.OneSMreordersthe numbersinasingletile.Sinceeachthreadhandles4numbers ,thenumberofthreads usedbyanSMforthispurpose(equivalently,thenumberofth readsperthreadblock)is t = 4 80

PAGE 81

kernelreorderData(keysOut,keysIn4,counters,counters Sum,ranks4) f //Readthenumbersfrom keysIn 4 andputthemin //sortedorderin keysOut // sTileCnt storestilehistogram // sGOset storesglobalprefix-summedhistogram sharedsTileCnt[16],sGOffset[16];//storagefornumberssharedintsKeys[t];int4k4,r4;//Readthehistogramsfromtheglobalmemoryif(tid<16)f sTileCnt[tid]=counters[tid*nTiles+bid];sGOffset[tid]=countersSum[tid*nTiles+bid]; gsyncthreads();//PerformawarpscanonthetilehistogramsTileCnt=warp scan(sTileCnt); syncthreads();//readthenumbersandtheirranksk4=keysIn4[bid*(t/4)+tid];r4=ranks4[bid*(t/4)+tid];//Findthecorrectpositionandwritetothesharedmemr4.x=r4.x+sTileCnt[(k4.x>>startbit)&0xF];sKeys[r4.x]=k4.x;//Similarcodefory,zandwcomponentsof// r 4 and k 4 comeshere syncthreads();//Determinetheglobalrank//Eachthreadplaces4numbersatpositions// tid ,( tid + t = 4),( tid + t = 2) and ( tid +3 t = 2) radix=(sKeys[tid]>>startbit)&0xF;globalOffset.x=sGO ffset[radix]+tidsTileCnt[radix]; keysOut[globalOffset.x]=sKeys[tid]; gFigure3-17.RearrangingData 3.1.6ComparisonofNumberSortingAlgorithms Accordingtotheresultsreportedin[ 27 ]and[ 26 ],radixsortbasedalgorithms performbetterthansamplesortandwarpsort.Hence,here,w ecompareSDKradix sort,GRSandSRTS 1 .ThesealgorithmswerecodedandrunusingNVIDIACUDASDK 1 WearegratefultotheauthorsofSRTS[ 31 ]formakingtheircodeavailable. 81

PAGE 82

3.0onaTeslaC1060GPU.Foreachoftheexperiments,therunt imereportedisthe averageruntimefor100randomlygeneratedsequences.Whil ecomparingSDK,GRS andSRTSsamerandomsequencesareusedforthe3algorithms. Fortherstsetofcomparisons,SDK,GRSandSRTSareusedtos ortfrom1Mto 10Mnumbers.AsshowninFigure 3-18A ,SDKruns20%to7%fasterthanGRSfor1M to3Mnumbersrespectively.However,GRSoutperformsSDKfo rsorting4Mnumbers andmore.Itruns11%to21%fasterthanSDKfor4Mto10Mnumber s,respectively. SRTSisthebestperformingalgorithmamongthethreebyrunn ing53%to57%faster thanGRSfor1Mto10Mnumbers.Thisperformancedifferentia lisalsoobservedwhile sortingevenlargersetsofnumbers.Figure 3-18B showstheruntimesofSDK,GRS andSRTSstartingfrom10Mnumberswithanincrementof10M.F or100Mnumbers, GRSruns21%fasterthanSDKand53%slowerthanSRTS. 1 2 3 4 5 6 7 8 9 10 0 10 20 30 40 50 60 70 No. of integers (in Millions)Time (in ms)SDK GRS SRTS A 10 20 30 40 50 60 70 80 90 100 0 100 200 300 400 500 600 No. of integers (in Millions)Time (in ms)SRTS SDK GRS B Figure3-18.Sortingnumbersusingradixsorts.A)Soring1M to10MNumbers.B) Sorting10Mto100MNumbers. 3.2SortingRecordsonGPUs 3.2.1RecordLayouts Arecord R iscomprisedofakey k and m otherelds f 1 f 2 f m .Forsimplicity, weassumethatthekeyandeachothereldoccupies32bits.Le t k i bethekeyofrecord R i andlet f ij 1 j m bethisrecord'sotherelds.Withoursimplifyingassumpti on 82

PAGE 83

ofuniformsizeelds,wemayviewthe n recordstobesortedasatwo-dimensional array eldsArray [][] with eldsArray [ i ][0]= k i and eldsArray [ i ][ j ]= f ij 1 j m 1 i n .Whenthisarrayismappedtomemoryincolumn-majororder,w egetthe ByField layoutof[ 21 ].ThislayoutwasusedalsofortheAA-sortalgorithmdevelo pedfor theCellBroadbandEnginein[ 11 ]andisessentiallythesameasthatusedbytheGPU radixsortalgorithmof[ 5 ].Whentheeldsarrayismappedtomemoryinrow-major order,wegetthe ByRecord layoutof[ 21 ].Athirdlayout, Hybrid ,isemployedin[ 31 ].This isahybridbetweenthe ByField and ByRecord layouts.Thekeysarestoredinanarray andtheremainingeldsarestoredusingthe ByRecord layout.Essentiallythen,inthe Hybrid layout,wehavetwoarrays.Eachelementofonearrayisakeya ndeachelement oftheotherarrayisastructurethatcontainsalleldsasso ciatedwithanindividual record.Inthispaper,welimitourselvestothe ByField and Hybrid layouts.Wedonot considerthe ByRecord layoutasitappearsthatmosteffectivelywaytosortinthis layout istorstextractthekeys,sort(key,index)pairsandthenr eordertherecordsintothe obtainedsortedpermutation.Thelasttwostepsareidentic altothestepsinanoptimal sortforthe Hybrid layout.So,weexpectthatgoodstrategiestosortinthe Hybrid layout willalsobegoodforthe ByRecord layout.Whenthesortbeginswithdatainaparticular layoutformat,theresultofthesortmustalsobeinthatlayo utformat. 3.2.2HighlevelStrategiesforSortingRecords Atahighlevel,therearetwoverydistinctapproachestosor tmultieldrecords[ 35 ]. Intherst,weconstructasetoftuples ( k i i ) ,where k i isthekeyofthe i threcord. Then,thesetuplesaresortedbyextendinganumbersortalgo rithmsothatwhenever thenumbersortalgorithmmovesakey,theextendedversionm ovesatuple.Once thetuplesaresorted,theoriginalrecordsarerearrangedb ycopyingrecordsfromthe eldsArray toanewarrayplacingtherecordsintotheirsortedposition sinthenewarray orin-placeusingacyclechasingalgorithmasdescribedfor atablesortin[ 20 ].The secondstrategyistoextendanumbersortsoastomoveanenti rerecordeverytimeits 83

PAGE 84

keyismovedbythenumbersort.Wecalltherststrategyasin directandthesecond strategyasdirectstrategyforsortingmultieldrecords. Thereareadvantagesand disadvantagestoeachstrategy.Indirectstrategyseemsto performmuchlessworkthan thedirectduringsortingasthesatellitedatathatneedsto bemovedwiththekeyisonly anintegerindexwhileinthedirectstrategyitstheentirer ecord.Ontheipside,the indirectstrategyhasaverycostlyrandomglobalmemoryacc essphaseattheendwhen recordsaremovedtotheirsortedpositionswhereasthedire ctstrategydoesnothave thisphase. Ourfocus,istoextendGPUnumbersortingalgorithmstohand lerecordswith multipleeldsusingdifferentlayouts.Specically,weex tendthealgorithmswhichare thefastestcomparisonandnon-comparisonbasedsortingal gorithms.Warpsortand samplesortarebothcomparisonbasedalgorithmswithsimil arreportedperformance. So,weselectedoneofthese,samplesort,toextendtothesor tingofmultieldrecords. Amongnon-comparisonbasedsortingmethodsSRTSisthefast estforsortingnumbers whileGRSisthefastestforsortinglargerecordsinthe Hybrid layout(Section 3.2 )using thedirectstrategy(Section 3.2 ).So,weextendSRTSforsortingrecords.andGRSto sortrecordsinotherlayouts.3.2.3SampleSortforSortingRecords Samplesort[ 27 ]isamulti-waydivideandconquersortingalgorithmwhichp erforms betterwhenthememorybandwidthisanissueasthedatatrans ferredtoandfromthe globalmemoryislessthantwo-wayapproach.Figure 3-1 showsthestepsofserial samplesort. Toobtainanefcientparallelversionofsamplesort,itisn ecessarytobalancethe sizeofbucketsassignedtothreadblocks.Thisisdonebycho osingthesplittersfroma largerandomlyselectedsampleofkeys.Oncethesplittersa reselected,therecordsare partitionedintobucketsbyrstdividingthedataintoequa lsizedtileswitheachtilebeing assignedtoablockofthreads.Athreadblockexaminesitsti leofdataandassigns 84

PAGE 85

recordsinthistiletobucketswhoseboundariesaretheprev iouslychosensplitters. Finallythebucketsproducedforthetilesarecombinedtoob tainglobalbuckets.The stepintheGPUsamplesamplesortof[ 27 ]aredescribedbelow. Phase1: Duringthisphase,thesplittersarechosen.First,asetofr andom samplesaretakenoutoftheelementsinthebuckets.Asetofs plittersarethenchosen fromtheserandomsamples.Finally,splittersaresortedus ingodd-evenmergesort[ 34 ] insharedmemoryandaBinarySearchTreeofsplittersiscrea tedtofacilitatethe processofndingthebucketforanelement. Phase2: Eachthreadblockisassignedapartoftheinputdata.Thread sinablock loadtheBinarySearchTreeintosharedmemoryandthencalcu latethebucketindexfor eachelementinthetile.Attheendthreadsstorethenumbero felementsineachbucket asa k -entryhistogramintheglobalmemory. Phase3: Theperblock k -entryhistogramsareprex-summedtoobtaintheglobal offsetforeachbucket. Phase4: Eachthreadblockinthisphaseagaincalculatesbucketinde xforall keysinitstile.Theyalsocalculatethelocaloffsetwithin thebuckets.Thelocaloffsets areaddedtotheglobaloffsetsfromthepreviousphasetoget thenalpositionofthe records. Whensamplesortingrecordsusingthedirectstrategyoutli nedinSection 3.2 therecordsneedtobemovedonlyduringthefourthphaseasin allotherphases onlythekeysarerequiredtobemoved.Thisdistributionoft herecordsfromthelarge buckettosmallbucketsisrepeatedmultipletimestillthes izeofthebucketisbelowa speciedthreshold.Finally,quicksortisdoneontherecor dswhenthebucketsizeis small.Recordsarealsomovedduringthepartitioningphase ofthequicksortwithina smallbucket.Thefourthphaseandthequicksortpartofsamp lesortcanbeextended tohandlerecordsin ByField and ByRecord format.In ByField layout,whilemoving 85

PAGE 86

eldsArray [ i ] to outeldsArray [ j ] ,threadscanmovethecorrespondingeldsasshownin Figure 3-19 outKey[j]=key[i];//Movethefieldsfor(f=1;f<=m;f++) f outfieldsArray[j][p]=fieldsArray[i][p]; gFigure3-19.Movingrecordsin ByField layout Similarly,in ByRecord ( Hybrid )formateldscanbemovedbyathreadwhilemoving thekeys.Figure 3-20 showsthecodetomoverecordsassumingthat eldsArray [ i ] and outeldsArray [ j ] arestructuresthatcontaintheelds. outKey[j]=key[i];//MovethefieldsoutfieldsArray[j]=fieldsArray[i];Figure3-20.Movingrecordsin ByRecord layout Weobservethatinthe ByField layout,threadsinawarpaccessadjacentelements inglobalmemoryresultingincoalescingofamemoryaccessw hileinthe ByRecord layout,threadsinawarpaccesswordsintheglobalmemoryth atarepotentiallyfar apartgeneratingglobalmemorytransactionsofsizeatmost 16 bytes.Weemploy astrategyofgroupingthethreadstogethersothatwecangen eratelargermemory transactions.Ratherthanasinglethreadreadingandwriti ngtheentirerecord, weemployagroupofthreadstoreadandwritetherecordsinto theglobalmemory cooperatively.Thenthissamegroupofthreadsiteratestor eadandwriteotherrecords co-operatively.Thisensureslargerglobalmemorytransac tions.Asanexample,lets saytherecordisof 64 bytesinlengthandaseachthreadcanreadin16bytesofdata usingan int 4 datatype,wecangroup 4 threadstogethersothattheycanreadthe entirerecordtogether.Thenthisthreadgroupiteratesove rtoreadotherrecordsuntil alltherecordsassignedtothethreadgrouphasbeenread.Le t numThreads denote thenumberofthreadsinablockandsupposethateachthreadi storeadinonerecord 86

PAGE 87

andputitintoproperplaceintheoutputarray.Assumethatr ecordsfrom startOset to ( startOset + numThreads ) areprocessedbythisthreadblock.Forsakeofclarityofthe pseudocodeweassumethatthereisamap mapInToOut whichdeterminestheproper positionintheoutputarray.Incaseofsamplesort,itwould betheBinarySearchtree constructedoutofthesplitterswhichwoulddeterminethep ositionofaparticularrecord intheoutputarray.Figure 3-21 outlinestheoptimizedversionofmovingrecordsusing coalescedreadandwrite. //Determinethenumberofthreadsrequiredto//readtheentirerecordnumThrdsInGrp=sizeof(Rec)/16;//Totalnumberofrecordstoberead=numberof//threadsinthegroupnumItrs=numThrdsInGrp;//Numberofrecordsreadinasingleiteration//byallthreadsnRecsPerItr=numThrds/numThrdsInGrp;//ConvertRecordarraysto int 4 arrays recInt4=(int4*)rec;outRecInt4=(int4*)outRec;//Determinethestartingrecordandpositioninthegroupfo rthisthread startRec=startOffset+threadId/numThrdsInGrp;posInGrp=threadId%numThrdsInGrp;for(i=0;i
PAGE 88

inSRTSincreasesthearithmeticintensityofthesememoryb oundoperationsand eliminatestheneedforadditionalkernelsforsortingaspe rformedinSDKradixsort by[ 5 ].SRTSfurtherbringsparitybetweencomputationandmemor yaccessbyonly havingaxednumberofthreadblocksintheGPU.Eachthreadl oopsovertoprocess thedatainbatchesandhencetheamountofcomputationdonep erthreadincreases substantially.SRTSusesaxednumberofthreadblocksenou ghtooccupyallSMsin theGPU.Theinputdataisdividedintotilesandasetoftiles isassignedtoathread block.Itusesaradixof 2 b with b =4 .With b =4 and32-bitkeys,theradixsortrunsin8 phaseswitheachphasesortingon4bitsofthekey.Eachthrea dblockwhileprocessing atilendsoutthenumberofkeyswithaparticularradixvalu e.Theradixcounters indicatingnumberofkeyswithaparticularradixvaluearea ccumulatedoverthetiles assignedtothatthreadblock.SRTSconsistsoffollowingst eps[ 31 ]. Phase1-BottomLevelReduction: Thisphaseconsistsoftwosub-phases. Duringtherstphase,eachthreadreadsinanelementfromth einputdataandextracts the b bitsbeingconsideredandincreasesthehistogramcountsco rrespondingly.The threadsinthethreadblockloopoverthetilesassignedtoth ethreadblockandthe radixcountersareaccumulatedinthelocalregistersasthe reareonly 16 differentradix valuesfor 4 bits.Afterthelasttileofinputdataisprocessed,thethre adswithinablock performalocalprexsumcooperativelytoprexsumthesequ enceofcountersandthe resultiswrittentotheglobalmemoryasasetofprexsums. Phase2-TopLevelScan: Inthisphaseasingleblockofthreadsoperatesover theprexsumstocomputetheglobalprexsum.Theblock-lev elscanismodiedto handleaconcatenationofpartialprexsums. Phase3-BottomLevelScan: Lastly,thethreadsinablocksorttheelementsby rstreadingintheprexsumcalculatedduringthetop-leve lscanphase.Thethread blockreadsintheelementsagainandextractsthe b bitsbeingconsidered.Alocal parallelscanisdonetondthelocalprexsum.Theselocalo ffsetsareseededwiththe 88

PAGE 89

globalprexsumscalculatedearliertogetthenalpositio noftheelementintheoutput. Theinputelementsarenextscatteredinsharedmemoryusing thelocaloffsetstoput theminsortedorderwithinthetile.Finally,thoseelement sarereadinsortedorderfrom thesharedmemoryandwrittenontotheglobalmemory.Thisst rategyensuresbetter memorycoherenceandgenerateslargerglobalmemorytransa ctions.Aswiththerst phase,theradixcountersareaccumulatedandarecarriedov ertothenexttileofinput processedbythisblockofthreadsusinglocalregisters. Thenalscatterofinputelementshappensduringtheveryla stphase.Onlykeys oftherecordsarerequiredduringotherphases.So,theeld softherecordcanalso bemovedduringthethirdphasewhilescatteringthekeys.We canusethestrategies outlinedinFigures 3-19 and 3-20 toscattertheeldsin ByField and ByRecord layouts respectively.However,duetothewaySRTSisimplementedus inggenericprogramming itisdifculttousetheanoptimizedversionofrecordmovin g(Figure 3-21 )in ByRecord format.Thethirdphaseofrecordscatteringoccursonly8ti mesfor32-bitkeysduring theentiresortingprocessanddoesnotdependonthenumbero frecordsbeingsorted. Thisindicatesthereisapossibility,forSRTS,directstra tegyofmovingrecordswhile sortingmightactuallyperformbetterthantheindirectstr ategyofsorting ( key index ) pairsfollowedbyrearrangement.3.2.5GRSforSortingRecords GRSisdevelopedspecicallyforsortingrecords[ 32 ]alongthelinesoftheSDK radixsort[ 5 ]butthefocusistoreducethenumberoftimesarecordisread fromor writtenintotheglobalmemory.Aswiththeotherstrategies ,theinputdataisdividedinto tilesandwedenethe rank ofanelementasthenumberofelementshavingthesame digitvaluebeforetheelementintheinputdatatile.GRSemp loysaadditionalmemory tostore rank oftheelementsandeliminatesthesortingphaseproposedin SDKsort[ 5 ]. Thishelpstondthelocaloffsetofrecordswithinainputda tatileandhelpstosortthem inthesharedmemorymuchlikeSRTSbeforewritingthemoutto theglobalmemory. 89

PAGE 90

GRSalsoprocesses4bitsperpassandhencehas8passesintot altosortrecordswith 32-bitkeys.ThethreephasesineachpassofGRSare: Phase1: Computethehistogramforforeachtileaswellastherankofe achrecord inthetile.Inthisphase,ablockof64threadsoperateonain puttiletocooperatively readthekeysfromglobalmemorytothesharedmemory.Theglo balmemoryreadsare madecoalescedbyensuringconsecutivethreadsaccessthec onsecutivekeysinglobal memory.Thewritingonthesharedmemoryisperformedwithan offsettoavoidbank conicts.Threadsinreadthekeysfromthesharedmemorywit hanoffsetandcalculate thehistogramandtherankusingthedigitcounters.Therank overwritesthekeysin thesharedmemoryaswedon'tneedthemaftertheirranksarec alculated.Finally, thehistogramandtheranksarewrittenouttotheglobalmemo ry.Astherankcannot exceedthesizeofatilewhichistypicallysettothe1024rec ords,afullintegerisnot requiredtostoretherank. Phase2: Theprexsumsofthehistogramsofalltilesarecomputed. Phase3: Lastly,inthisphasetheentirerecordisrstreadfromtheg lobalmemory tothesharedmemory.Wethenusetheranks,prex-summedloc alhistogramstond outthelocaloffsetforeachrecordinthetile.Weputtherec ordsinthesharedmemory accordingtotheoffsetsothatwegetasortedtileofrecords .Thethreadsthenreadthe recordsfromsharedmemoryinorderandusetheglobalprexs umtoputthemintheir nalplaceintheoutput. MuchlikeSRTS,thenalscatteroftherecordsisdoneinthel astphase.Asthelast phasecaterstoaverysimpleimplementation,wecanefcien tlyreadandwriterecords inthisphase.Thissimplicityenablesustousethealgorith msofFigures 3-19 and 3-21 tomoverecordsin ByField and ByRecord layoutsrespectively.AswithSRTS,weonly moverecords8numberoftimesduringthesortingofrecordsw ith32-bitkeys.Hence, GRSalsohasafairchanceofoutperformingtherststrategy ofsortingrecordsbyusing (key,index)pairs. 90

PAGE 91

3.3ExperimentalResults Weimplementedandevaluatedtherecordsortingalgorithms mentionedinthe previoussectionsusingNvidiaCUDASDK3.2.Specically,w eevaluatedtwoversions ofsamplesort,GRSandSRTSeachcorrespondingtothedirect andindirectstrategies forsortingrecordsmentionedinSection 3.2 .Weevaluatedthefollowingalgorithms 1. SampleSort-direct...samplesortalgorithmof[ 27 ]extendedforsortingrecords. 2. SampleSort-indirect...Weform ( key index ) pairforeachrecordandthensortthem usingsamplesort.Finallyarearrangementisdonetoputthe entirerecordinthe sortedorder. 3. SRTS-direct...SRTSalgorithm[ 31 ]extendedforsortingrecords. 4. SRTS-indirect...Weform ( key index ) pairforeachrecordandthensortthemusing SRTS.Finallyarearrangementisdonetoputtheentirerecor dinthesortedorder. 5. GRS-direct...GRSalgorithmforsortingrecords[ 32 ]. 6. GRS-indirect...Weform ( key index ) pairforeachrecordandthensortthemusing GRS.Finallyarearrangementisdonetoputtheentirerecord inthesortedorder. Weimplementedtheabove6multieldsortingalgorithmsona nNvidiaTeslaC1060 whichhas240coresand4GBofglobalmemory 2 .Thealgorithmsareevaluatedusing randomlygeneratedinputsequences.Inourexperiments,th enumberof32-bitelds perrecordisvariedfrom2to20(inadditiontothekeyeld)a ndthenumberofrecords was10million.Also,thealgorithmsateimplementedforbot h ByField and Hybrid layout. Foreachcombinationofnumberofeldsandlayouttype,thet imetosort10random sequenceswasobtained.Thestandarddeviationintheobser vedruntimeswassmall andwereportonlytheaveragetimes. 2 Wearegratefultotheauthorsofsamplesort[ 27 ]formakingtheircodeavailable. 91

PAGE 92

3.3.1RunTimesfor ByField Layout Figures 3-22 through 3-23 showthecomparisonofSampleSort,GRSandSRTS usingdirectandindirectstrategiesforsorting10Mrecord swith2to9eldsinthe ByField layout.Duringeachrun,wehaveusedthesamesetofrecordsw hilecomparing thesealgorithms.SampleSort-indirectruns36%fastertha nSampleSort-directwhen sorting10Mrecordswith2eldswhileitruns66%fasterforr ecordswith9elds. 2 3 4 5 6 7 8 9 100 200 300 400 500 600 700 800 900 1000 # FieldsTime (in ms)SampleSort-direct SampleSort-indirect Figure3-22.SampleSort-directandSampleSort-indirectf or10Mrecords( ByField ) SRTS-indirectruns37%slowerthanSRTS-directsorting10M recordswith2elds whileitruns27%slowerforrecordswith9elds. 2 3 4 5 6 7 8 9 40 60 80 100 120 140 160 180 200 # FieldsTime(in ms)SRTS-direct SRTS-indirect Figure3-23.SRTS-directandSRTS-indirectfor10Mrecords ( ByField ) 92

PAGE 93

GRS-indirectruns27%slowerthanGRS-directwhensorting1 0Mrecordswith2 eldswhileitruns33%slowerforrecordswith9elds. 2 3 4 5 6 7 8 9 60 80 100 120 140 160 180 200 220 # FieldsTime (in ms)GRS-direct GRS-indirect Figure3-24.GRS-directandGRS-indirectfor10Mrecords( ByField ) Figure 3-25 showsthecomparisonbetweenthefasterversionofeachofth ese threealgorithms.SRTS-directisthefastestalgorithmtos ortrecordsinthe ByField layoutwhenrecordshavebetween2to11elds.GRS-directis thefastestalgorithm forsortingrecordswithmorethan11elds.SRTS-directrun s35%fasterthan GRS-directwhensorting10Mrecordswith2eldswhileGRS-d irectruns38%faster thanSRTS-directwhenrecordshave20elds.SampleSort-in directistheslowest, running63%slowerthanGRSwhensortingrecordswith2elds and48%slowerwhen sortingrecordswith20elds.3.3.2RunTimesfor Hybrid Layout Figures 3-26 through 3-27 showthecomparisonofSampleSort,GRSandSRTS usingthedirectandindirectstrategiesforsorting10Mrec ordswith2to9eldsinthe Hybrid layout.SampleSort-indirectruns12%fasterthanSampleSo rt-directwhensorting 10Mrecordswith2eldswhileitruns72%fasterforrecordsw ith9elds. SRTS-indirectruns38%fasterthanSRTS-directsorting10M recordswith2elds whileitruns79%fasterforrecordswith9elds. 93

PAGE 94

2 4 6 8 10 12 14 16 18 20 0 100 200 300 400 500 600 # FieldsTime (in ms)GRS-direct SampleSort-indirect SRTS-direct Figure3-25.Differentsortingalgorithmsfor10MRecords( ByField ) 2 3 4 5 6 7 8 9 200 400 600 800 1000 1200 1400 1600 # FieldsTime (in ms)SampleSort-direct SampleSort-indirect Figure3-26.SampleSort-directandSampleSort-indirectf or10Mrecords( Hybrid ) GRS-indirectruns13%slowerthanGRS-directsorting10Mre cordswith2elds whileitruns41%fasterforrecordswith9elds. Figure 3-29 showsthecomparisonbetweenthefasterversionofeachofth ese threealgorithms.SRTS-indirectisthefastestalgorithmt osortrecordsinthe Hybrid layout.SRTS-indirectruns27%fasterthanGRS-indirectan d71%fasterthanSampleSort-indirect whensortingrecordswith2eldswhileitruns3%fasterthan GRS-indirectand16% fasterthanSampleSort-indirectwhensortingrecordswith 20elds. 94

PAGE 95

2 3 4 5 6 7 8 9 0 200 400 600 800 1000 1200 # FieldsTime (in ms)SRTS-indirect SRTS-direct Figure3-27.SRTS-directandSRTS-indirectfor10Mrecords ( Hybrid ) 2 3 4 5 6 7 8 9 50 100 150 200 250 300 350 400 450 # FieldsTime (in ms)GRS-indirect GRS-direct Figure3-28.GRS-directandGRS-indirectfor10Mrecords( Hybrid ) 3.4Summary Wehavedevelopedanewradixsortalgorithm,GRS,tosortnum bersandrecords inaGPUusingdifferentlayouts.GRSreadsandwritesrecord sfrom/toglobalmemory onlyonceperradixsortphase.Wehaveconsideredtwoextens ions–directandindirect oftheGPUnumbersortalgorithmsSampleSort,SRTSandGRS.W eshowedhoweach extensioncouldbeimplementedoptimallyonaGPUmaximizin gdevicememoryand SMbandwidth.ExperimentsconductedontheNVIDIAC1060Tes laindicatethat,forthe ByField layout,GRS-directisthefastestsortalgorithmfor32-bit keyswhenrecordshave atleast12elds.Whenrecordshavefewerthan12elds,SRTS -directisthefastest. Thishappensduetothebettermemorycoalescingachievedby GRS-directduring 95

PAGE 96

2 4 6 8 10 12 14 16 18 20 0 200 400 600 800 1000 1200 # FieldsTime (in ms)SampleSort-indirect GRS-indirect SRTS-indirect Figure3-29.Differentsortingalgorithmsfor10Mrecords( Hybrid ) thelastscatterphasecomparedtoSRTS-direct.Inthe Hybrid layoutSRTS-indirect isthefastestalgorithmalthoughtheperformancegapbetwe enSRTS-indirectand GRS-indirectnarrowsoncerecordshavemorenumberofelds .Atthispoint,thelast globalrearrangementphaseinbothGRS-indirectandSRTS-i ndirectdominatesover otherphasesofthealgorithmsandbothGRS-indirectandSRT S-indirecthavesimilar runtimes. Intuitively,onewouldexpecttheindirectmethodtobefast erthanthedirectmethod forlargerecords.Thisisbecausetheindirectmethodmoves eachrecordonlyonce whilethedirectmethodmovesrecordsmanytimes(O(log n )timesonaveragefor samplesortand 8 timesforGRSandSRTS).Thisintuitionisborneoutinallcas es otherthanwhentherecordsarein ByField layoutandradixsortisused.SRTS-direct andGRS-directmoveeacheld8timesaffordingsomeopportu nityforcoalescingof devicememoryaccesses.AlthoughSRTS-indirectandGRS-in directmoveeacheld onlyonce,coaleascingisn'tpossibleastheeldsofarecor dinthe ByField layoutarefar apartindevicememory. 96

PAGE 97

CHAPTER4 MOTIFSEARCH 4.1Introduction Motifsareapproximatepatternsfoundingenesequences.Th ediscoveryofmotifs helpsinnding transcriptionfactor-bindingsites ,whichareusefulforunderstanding variousgenefunctions,drugdesign,andsoon.Manyversion softhemotifsearch problemhavebeenstudiedextensively.Wefocuson PlantedMotifSearch (PMS), alsoknownas ( l d ) motifsearch.PMStakes n sequencesandtwonumbers l and d asinput.Theproblemistondallsequences M oflength l (alsocalledan l -mer) thatappearineveryinputsequencewithatmost d mismatches.Moreprecisely,for an l -mer M ,wedeneits d -neighborhoodtobeall l -mersthatdifferfrom M inatmost d positions. M isamotifforthegivensetof n stringsiffeachofthese n stringshasa substringthatisinthe d -neighborhoodof M Asanexample,considerthreeinputsequencesCATACGT,ACAA GTCand AATCGTG.Supposethat l =3 and d =1 .The3-merCATisoneofthemotifsof thegiven3inputsequencesasCATappearsinrstsequenceat therstpositionwith0 mismatch,atthesecondpositioninthesecondsequencewith 1mismatch,andatfourth positioninthirdsequencewith1mismatch.Alternatively, CATisamotifofthegiven3 inputsequencesbecausethesubstringsCATofsequence1,CA Aofsequence2,and CGTofsequence3areinthe1-neighborhoodofCAT. PMS,whichiswellstudiedintheliterature,isknowntobeNP -hard[ 36 ].Known algorithmsforPMSaredividedintoexactandapproximation algorithmsdepending onwhethertheyareguaranteedtoproduceeverymotifalways ornot.Approximation algorithmsforPMSaregenerallyfasterthanexactalgorith ms,whichhaveanexponential worst-casecomplexity.MEME,whichisoneofthepopularapp roximationalgorithms forndingmotifs[ 37 ],usestheexpectationminimizationtechniquetooutputas et ofprobabilisticmodelsforeachmotifindicatingtheproba bilityofappearanceof 97

PAGE 98

differentcharactersineachpositionofthemotif.Pevzner andSze[ 38 ]proposedthe WINNOWERalgorithm,whichmapsPMStoaproblemofndinglar gecliquesin agraphwhereeachnodeisan l -merandtwonodesareconnectediffthenumber ofmismatchesbetweenthemislessthan 2 d .BuhlerandTompa[ 39 ]usedrandom projectionsbytakingonly k positionsoutoftheentire l -mertogroup l -merstogether basedonthesimilarityoftheprojections.Groupsthathave alargenumberof l -mers haveahighprobabilityofhavingthedesiredmotifaswell.P riceetal.[ 40 ]proposed analgorithmbasedonperformingalocalsearchonthe d -neighborhoodofsome ofthe l -mersfrominputsequences.GibbsDNA[ 41 ]employsGibbssamplingwhile CONSENSUS[ 42 ]usesstatisticalmeasurestoalignsequencesandndspote ntial motifsfromthealignment.Twootherexamplesofapproximat ionalgorithmsforthemotif searchproblemareMULTIPROFILER[ 43 ]andProleBranching[ 40 ]. Whileexactalgorithmsforthemotifproblemtakelongertoc ompletethan approximationalgorithms,theyguaranteetondallmotifs .Duetotheirexponential complexityitisimpracticaltorunthesealgorithmsonvery largeinstances.But,for manyinstancesofpracticalinterest,theyareabletorunwi thinareasonableamount oftime.Manyofthesealgorithmsusethesufxtreeorothert reedatastructuresto progressivelygeneratemotifsonecharacteratatime.MITR A[ 44 ]usesadatastructure calledMismatchTreewhileSPELLER[ 45 ],SMILE[ 46 ],RISO[ 47 ],andRISOTTO[ 48 ] usesufxtrees.RISOTTO[ 48 ]isthefastestsufxtreebasedalgorithmsofar.CENSUS [ 49 ]makesatrieofall l -mersfromeachoftheinputsequences.Thenodesinthetrie thenstoretheHammingdistance(i.e.,numberofmismatches )fromthemotifasitis beinggenerated,potentiallypruningmanybranchesofthet rie.Voting[ 50 ]useshashing butneedsspacetostoreallpossiblestringsoflength l .Hencethespacerequiredby thismethodistoolargeforlargeinstances.KauksaandPavl ovic[ 51 ]haveproposedan algorithmtogenerateasupersetofmotifs(i.e.,asetofmot ifstems).Sincetheydonot provideanydataonhowdifcultitmaybetoextractthetruem otifsfromthissuperset, 98

PAGE 99

onecannotassessthevalueofthisalgorithm.Therecentlyp roposedPMSseriesof algorithmsarebothfastandrelativelyeconomicalonspace .PMS1,PMS2andPMS3 [ 52 ]arebasedonradixsortingandthenefcientlyintersectin gthe d -neighborhood ofall l -mersintheinputsequences(everylength l substringofastringisan l -merof thatstring).PMS4[ 53 ]proposesagenericspeed-uptechniquetoimprovetherunti me ofanyexactalgorithm.PMSP[ 54 ]computesthe d -neighborhoodofall l -mersofthe rstinputsequenceandthenperformsanexhaustivesearchw iththeremaininginput sequencestodeterminewhichofthe l -mersinthecomputed d -neighborhoodaremotifs. PMSPrune[ 54 ]isabranch-and-boundalgorithm,whichusesdynamicprogr amming toimproveuponPMSP.Pampa[ 55 ]furtherimprovesPMSPrunebyusingwildcard characterstondapproximatemotifpatternsandthenperfo rminganexhaustivesearch withinpossiblemappingsofthepatterntondtheactualmot ifs.PMS5[ 56 ]isthemost recentalgorithmintheseries.Thisalgorithm,whichisdes cribedindetailinSection 4.2 efcientlycomputestheintersectionofthe d -neighborhoodof l -merswithoutgenerating theentireneighborhood.PMS5isfasterthanthealgorithmf orPampaforchallenge instances(15,5)andlarger[ 56 ].AlthoughnodirectcomparisonbetweenPMS5and thestemmingalgorithmof[ 51 ]hasbeenreportedintheliterature,bothhavebeen comparedtoPMSPrune[ 51 56 ].Fromthesecomparisons,weinferthatstemming isfasterthanPMS5forchallengeinstancesofsizeupto(19, 7)andslowerforlarger challengeinstances.Forexample,theruntimeratiostemmi ng/PMS5isapproximately 0.12for(13,4)instancesandapproximatelyand0.91for(19 ,7)instances.Although runtimesforstemminghavenotbeenreportedforlargerchal lengeinstances,wecan projectthatPMS5isfasterforlargerinstances. WeproposeamotifalgorithmPMS6[ 57 ]thatisfasterthanPMS5.It'srelative speedcomesfromafasteralgorithmfor d -neighborhoodgenerationandintersection. InSection 4.2 weintroducenotationsanddenitionsandalsodescribethe PMS5 algorithmindetailandinSection 4.3 wedescribeourproposedalgorithmfor d -neighborhood 99

PAGE 100

intersection.TheperformanceofPMS5andPMS6iscomparede xperimentallyin Section 4.4 4.2PMS5 Weusethesamenotationsanddenitionsasin[ 56 ]. 4.2.1NotationsandDenitions Denition1. Astring s = s [1] s [ l ] oflength l iscalledan l -mer. Denition2. TheHammingdistance, d H ( s t ) ,betweentwostrings s and t ofequal lengthisdenedtobethenumberofplaceswheretheydiffer. Denition3. The d -neighborhood, B d ( s ) ,ofastring s ,isdenedtobe f x j d H ( x s ) d g Notethatthenumberofstrings, N ( l d ) ,in B d ( s ) isuniquelydeterminedbythe length l of s and d .Itiseasytoseethat N ( l d )= j B d ( s ) j = P di =0 li ( j j 1) i ,where isthealphabetfromwhichthecharactersinourstringsared rawn. Denition4. Thenotation r 2 l s denotesasubstring r oflength l ofstring s .Wedene theHammingdistancebetweenalength l string x andastring s ofanylength l tobe d H ( x s )= min r 2 l s d H ( x r ) Denition5. Givenaset S = f s 1 s n g ofstringsandastring x oflength l ,wedene d H ( x S )= max 1 i n d H ( x s i ) Wenotethat x isamotifofaset S ofstringsifandonlyif d H ( x S ) d .Wedenote thesetof ( l d ) motifsofset S by M l d ( S ) 4.2.2PMS5Overview PMS5,whichispresentlythefastestexactalgorithmtocomp ute M l d ( S ) ,was proposedbyDinh,Rajasekaran,andKundeti[ 56 ].Theyobservedthat M l d ( S ) could bedeterminedbycomputingtheintersectionofthe d -neighborhoodofthe l -mersin onestringof S withthatof l -mersfromtheremainingstrings.Thatis,if S = S nf s 1 g then M l d ( S )= [ x 2 l s 1 [ B d ( x ) \ M l d ( S )] .Forsimplicity,assumethat j S j =2 p for someinteger p (thecasewhen j S j isoddisasimpleextension).Wemaypartitionthe stringsin S intopairs S 1 S p where S k = f s 2 k s 2 k +1 g .Thefollowingobservations 100

PAGE 101

canbemade: B d ( x ) \ M l d ( S )= \ 1 k p [ B d ( x ) \ M l d ( S k )] and B d \ M l d ( S k )= [ y 2 l s 2 k z 2 ls 2 k +1 [ B d ( x ) \ B d ( y ) \ B d ( z )] .Hence,themotifsof S maybefoundbymaking aseriesofcallstoafunctionthatcomputestheintersectio nof d -neighborhoods, B d ( x y z )= B d ( x ) \ B d ( y ) \ B d ( z ) asshowninFigure 4-1 PMS5( S l d ) foreach x 2 l s 1 f for k =1 to b n 1 2 c f Q ; foreach y 2 l s 2 k and z 2 l s 2 k +1 Q Q [ B d ( x y z ) if k =1 Q 0 = Q else Q 0 = Q 0 \ Q if j Q 0j < threshold break; gM l d ( S ) ; foreach r 2 Q 0 if d H ( r S ) d Addrto M l d ( S ) g Figure4-1.PMS5[ 56 ] ThetimecomplexityofPMS5is O ( nm 3 lN ( l d )) ,where m isthelengthofeachinput string s i [ 56 ]. 4.2.3Computing B d ( x y z ) Thebasicideainthealgorithmof[ 56 ]tocompute B d ( x y z ) istogenerate B d ( x ) one l -meratatimeandincludeitin B d ( x y z ) onlyifitisin B d ( y ) \ B d ( z ) .Tofacilitate this, B d ( x ) isrepresentedasatree T d ( x ) .Theroot,whichisatlevel0,of T d ( x ) is x Thenodesatthenextlevelrepresent l -mersthatareataHammingdistanceof1from x.Hence,ifthelengthof x is m andthealphabetis = f 0,1 g ,therootwillhave m childrenandthe i thchildwillrepresentthe l -mer x 0 thatdiffersfrom x onlyatposition i Thatis, x 0 [ i ]=0 if x [ i ]=1 and x 0 [ i ]=1 if x [ i ]=0 .Itiseasytoseethat l -mersatlevel 101

PAGE 102

k of T d ( x ) haveaHammingdistanceof k from x andthedepthof T d ( x ) is d .Figure 4-2 showsthetree T 2 (1010) with = f 0,1 g Figure4-2.2-Neighborhoodtree[ 56 ] Thetree T d ( x ) iscreateddynamicallyindepthrstmanner.An l -mer t inanode ( t p ) ( T isthe l -merrepresentedbythenodeand p isthepositionatwhichthis l -mer differsfromthe l -merintheparentnode)isaddedto B d ( x y z ) if t isin B d ( y ) \ B d ( z ) Asweknowthecurrentdistanceof t fromeachof x y and z ,wecancheckifthereis possiblyan l -merintheasyetungeneratedsubtreeof ( t p ) thatisadistance d from eachof x y and z .Thesubtreerootedat ( t p ) isprunedifthereisnosuch l -mer.The criterionusedin[ 56 ]forpruningisasfollows.Let t 1 = t [1] t [ p ] t 2 = t [ p +1] t [ l ] x 1 = x [1] x [ p ] and x 2 = x [ p +1] x [ l ] ; y 1 y 2 z 1 z 2 aredenedsimilarly.Duetothe way T d ( x ) isgenerated, x 2 = t 2 .Also,foranydescendent t 0 of t ,therst p characters areequal.So, t 0 canbewrittenas t 0 = t 1 w where w isan ( l p ) -mer.For t 0 tobein B d ( y ) \ B d ( z ) w hastosatisfythefollowingHammingdistanceconstraints[ 56 ]. 1. d H ( x t 0 )= d H ( x 1 t 1 )+ d H ( x 2 w ) d 2. d H ( y t 0 )= d H ( y 1 t 1 )+ d H ( y 2 w ) d 3. d H ( z t 0 )= d H ( z 1 t 1 )+ d H ( z 2 w ) d Notethateachposition i of x 2 y 2 and z 2 isoneofvedifferenttypes[ 56 ]. Type1(aaa): x 2 [ i ]= y 2 [ i ]= z 2 [ i ] Type2(aab): x 2 [ i ]= y 2 [ i ] 6 = z 2 [ i ] Type3(aba): x 2 [ i ]= z 2 [ i ] 6 = y 2 [ i ] Type4(abb): x 2 [ i ] 6 = y 2 [ i ]= z 2 [ i ] 102

PAGE 103

Type5(abc): x 2 [ i ] 6 = y 2 [ i ], x 2 [ i ] 6 = z 2 [ i ], y 2 [ i ] 6 = z 2 [ i ] Let n 1 n 2 n 3 n 4 and n 5 denotethenumberofpositionsofType1,2,3,4and5, respectively.Then, n 1 + n 2 + n 3 + n 4 + n 5 =( l p ) .Foran ( l p ) -mer w ,thefollowing variablesmaybedened[ 56 ]. 1. N 1, a isthenumberofpositions i ofType1forwhich w [ i ]= x 2 [ i ] .Bydenition, N 1, a n 1 2. N 2, a ( N 2, b )isthenumberofpositions i oftype2suchthat w [ i ]= x 2 [ i ] ( w [ i ]= z 2 [ i ] ). Also, N 2, a + N 2, b n 2 3. N 3, a ( N 3, b )isdenedtobethenumberofpositions i oftype3suchthat w [ i ]= x 2 [ i ] ( w [ i ]= y 2 [ i ] ).Bydenition, N 3, a + N 3, b n 3 4. N 4, a ( N 4, b )isthenumberofpositions i oftype4forwhich w [ i ]= y 2 [ i ] ( w [ i ]= x 2 [ i ] ). Bydenition, N 4, a + N 4, b n 4 5. N 5, a ( N 5, b N 5, c )isthenumberofpositions i oftype5suchthat w [ i ]= x 2 [ i ] )( w [ i ]= y 2 [ i ] w [ i ]= z 2 [ i ] ).Also, N 5, a + N 5, b + N 5, c n 5 Thenumberofmismatchesbetween x 2 and w intype1(2,3,4,5)positionsis n 1 N 1, a ( n 2 N 2, a n 3 N 3, a n 4 N 4, b n 5 N 5, a ).Hence, d H ( x 2 w )= n 1 N 1, a + n 2 N 2, a + n 3 N 3, a + n 4 N 4, b + n 5 N 5, a .Similarly, d H ( y 2 w )= n 1 N 1, a + n 2 N 2, a + n 3 N 3, b + n 4 N 4, a + n 5 N 5, b and d H ( z 2 w )= n 1 N 1, a + n 2 N 2, b + n 3 N 3, a + n 4 N 4, a + n 5 N 5, c TheHammingdistanceconstraintsgivenabovemaynowbeexpr essedasthe followingintegerlinearprogram(ILP)[ 56 ]. 1. n 1 N 1, a + n 2 N 2, a + n 3 N 3, a + n 4 N 4, b + n 5 N 5, a d d H ( x 1 t 1 ) 2. n 1 N 1, a + n 2 N 2, a + n 3 N 3, b + n 4 N 4, a + n 5 N 5, b d d H ( y 1 t 1 ) 3. n 1 N 1, a + n 2 N 2, b + n 3 N 3, a + n 4 N 4, a + n 5 N 5, c d d H ( z 1 t 1 ) 4. N 1, a n 1 5. N 2, a + N 2, b n 2 6. N 3, a + N 3, b n 3 7. N 4, a + N 4, b n 4 8. N 5, a + N 5, b + N 5, c n 5 103

PAGE 104

9. Allvariablesarenon-negativeintegers. Whentraversing T d ( x ) indepth-rstfashion,wemayprunethetreeateverynode ( t p ) forwhichtheaboveILPhasnosolution(notethat d H ( x 1 t 1 ) d H ( y 1 t 1 ) d H ( z 1 t 1 ) and n 1 n 5 arereadilydeterminedfor x y z t ,and p ). Asthepossiblevaluesfor n 1 n 5 and d H ( x 1 t 1 ), d H ( y 1 t 1 ), d H ( z 1 t 1 ) are [0, l ] and [0, d ] ,respectively,only ( l +1) 5 ( d +1) 3 distinctILPsarepossible.Foragiven pair ( l d ) ,wemaysolvethesedistinctILPsinapreprocessingstepand store,inan 8-dimensionaltable,whethereachhasasolution.Oncethes epreprocessinghasbeen done,wecanusetheresultsstoredinthe8-dimensionaltabl etondmotifsformany ( l d ) instances. Figure 4-3 givesthepseudocodeforthealgorithmtocompute B d ( x y z ) .Itstime complexityis O ( l + d j B d ( x y z ) j ) [ 56 ]. Bd( x y z ) Compute d H ( x y ) and d H ( x z ) Determine n 1 n 2 n 3 n 4 n 5 foreach p =0, ,( l 1) Doadepth-firsttraversalof T d ( x ) .Ateachnode ( t p ) dothefollowing: Incrementallycompute d H ( x t ), d H ( y t ) and d H ( z t ) Incrementallycompute d H ( x 1 t 1 ), d H ( y 1 t 1 ) and d H ( z 1 t 1 ) fromitsparent. If d H ( x t ) d d H ( y t ) d ,and d H ( z t ) d B d ( x y z ) B d ( x y z ) [ t Look-upILPtablewithparameters n 1 n 2 n 3 n 4 n 5 d H ( x t ), d H ( y t ), d H ( z t ) If d H ( x t ) d andtheILPhasasolution, explorethechildrenof ( t p ) ;otherwisepruneat ( t p ) Figure4-3.Computing B d ( x y z ) [ 56 ] 4.2.4Intersectionof Q s PMS5usesseveralnoveltechniquestocomputeasupersetoft heintersection Q 0 ofthe Q s.Althoughasupersetof Q 0 (Figure 4-1 )iscomputed,PMS5determines theexactsetofmotifsbecausethelastloopofthealgorithm (Figure 4-1 )veriesthat eachmemberofthenal Q 0 is,infact,amotif.Oneofthenoveltechniquesappliedfor challengeinstancesofsize(19,7)andlargerisaBloomlte r[ 20 ]with2hashfunctions. 104

PAGE 105

The l -mertobehasheduses d l = 4 e bytes(recallthatthealphabetsizeis4).Therst hashfunctionisbytes0-3ofthe l -merandthesecondisbytes1-4. 4.3PMS6 4.3.1Overview PMS6differsfromPMS5onlyinthewayitdeterminesthemotif scorresponding toan l -mer x of s 1 andthestrings s 2 k and s 2 k +1 [ 57 ].RecallthatPMS5doesthisby computing B d ( x y z ) independentlyforeverypair ( y z ) suchthat y 2 l s 2 k and z 2 l s 2 k +1 (Figure 4-1 );thecomputationof B d ( x y z ) isdonebyperformingadepth-rstsearch ofthetree T d ( x ) usinganILPtoprunesubtrees.InPMS6wedeterminethemotif s correspondingtoan l -mer x of s 1 andthestrings s 2 k and s 2 k +1 usingthefollowing2-step process: FormEquivalenceClasses :Inthisstep,thetriples ( x y z ) of l -merssuchthat y 2 l s 2 k and z 2 l s 2 k +1 arepartitionedintoclasses C ( n 1 n 5 ) .Forthispartitioning,for eachtriple ( x y z ) ,wecompute n 1 n 5 usingthedenitionsofSection 4.2.3 and p =0 x 1 = y 1 = z 1 = x 2 = x y 2 = y ,and z 2 = z .Eachtripleisplacedinthe classcorrespondingtoitscomputed n 1 n 5 values. Compute B d foralltriplesbyclasses :Foreachclass C ( n 1 n 5 ) ,theunion, B d ( C ) of B d ( x y z ) foralltriplesinthatclassiscomputed.Wenotethattheuni onofall B d ( C ) sisthesetofallmotifsof x s 2 k ,and s 2 k +1 Figure 4-4 givesthepseudocodeforPMS6. 4.3.2Computing B d ( C ( n 1 n 5 )) Let ( x y z ) beatriplein C ( n 1 n 5 ) andlet w bean l -merin B d ( x y z ) .Let p =0 andlet N 1, a N 2, a N 2, b N 3, a N 3, b N 4, a N 4, b N 5, a N 5, b N 5, c beasinSection 4.2.3 Weobservethatthe10-tuple( N 1, a N 5, c )satisestheILPofSection 4.2.3 with d H ( x 1 t 1 )= d H ( y 1 t 1 )= d H ( z 1 t 1 )=0 .Infact,every l -merof B d ( x y z ) hasa10-tuple ( N 1, a N 5, c )thatisasolutiontothisILPwith d H ( x 1 t 1 )= d H ( y 1 t 1 )= d H ( z 1 t 1 )=0 105

PAGE 106

PMS6( S l d ) foreach x 2 l s 1 f for k =1 to k = b n 1 2 c f Q ; Classes ; foreach y 2 l s 2 k and z 2 l s 2 k +1 f Compute n 1 n 5 for ( x y z ) if C ( n 1 n 5 ) 62 Classes f Createtheclass C ( n 1 n 5 ) with ( x y z ) .Add C ( n 1 n 5 ) to Classes gelseadd ( x y z ) toclass C ( n 1 n 5 ) gforeachclass C ( n 1 n 5 ) in Classes Q Q [ B d ( C ( n 1 n 5 )) if k =1 then Q 0 = Q else Q 0 = Q 0 \ Q if j Q 0j < threshold break; gforeach r 2 Q 0 if d H ( r S ) d Addrto M l d ( S ) g Figure4-4.PMS6 Givena10-tuplesolutiontotheILP,wemaygenerateall l -mers w in B d ( x y z ) as follows: 1. Eachofthe l positionsin w isclassiedasbeingofType1,2,3,4,or5depending ontheclassicationofthecorrespondingpositioninthe l -mers x y ,and z (see Section 4.2.3 ). 2. Select N 1, a ofthe n 1 Type1positionsof w .If i isaselectedposition,then,from thedenitionofaType1position,itfollowsthat x [ i ]= y [ i ]= z [ i ] .Alsofromthe denitionof N 1, a ,thismanyType1positionshavethesamecharacterin w as in x y ,and z .So,foreachselectedType1position i ,weset w [ i ]= x [ i ] .The remainingType1positionsof w musthaveacharacterdifferentfrom x [ i ] (and hencefor y [ i ] and z [ i ] ).So,fora4-characteralphabetthereare3choicesforeach ofthenon-selectedType1positionsof w .As,thereare n N 1, a waystoselect N 1, a positionsoutof n 1 positions,wehave 3 q n 1 N 1, a differentwaystopopulatethe n 1 Type 1positionsof w ,where q = n 1 N 1, a 3. Select N 2, a positions I and N 2, b differentpositions J fromthe n 2 Type2positions of w .Foreach i 2 I ,set w [ i ]= x [ i ] andforeach j 2 J ,set w [ j ]= z [ i ] .Eachof 106

PAGE 107

theremaining n 2 N 1, a N 1, b Type2positionsof w issettoacharacterdifferent fromthatin x y ,and z .So,if k isoneoftheseremainingType2positions, x [ k ]= y [ k ] 6 = z [ k ] .Weset w [ k ] tooneofthe2charactersofour4-letteralphabet thataredifferentfrom x [ k ] and z [ k ] .Hence,wehave 2 r n 2 N 2, a n 2 N 2, a N 2, b waysto populatethe n 2 Type2positionsin w ,where r = n 2 N 2, a N 2, b 4. Type3andType4positionsarepopulatedusingastrategysim ilartothat usedforType2positions.ThenumberofwaystopopulateType 3positions is 2 s n 3 N 3, a n 3 N 3, a N 3, b ,where s = n 3 N 3, a N 3, b andthatforType4positionsis 2 u n 4 N 4, a n 4 N 4, a N 4, b ,where u = n 4 N 4, a N 4, b 5. TopopulatetheType5Positionsof w ,wemustselectthe N 5, a Type5positions, k thatwillbesetto x [ k ] ,the N 5, b Type5positions, k ,thatwillbesetto y [ k ] ,andthe N 5, c Type5positions, k ,thatwillbesetto z [ k ] .Theremaining n 5 N 5, a N 5, b N 5, c Type2positions, k ,of w aresettothesinglecharacterofthe4-letteralphabetthat differsfrom x [ k ] y [ k ] ,and z [ k ] .Weseethatthenumberofwaystopopulatethe n 5 Type5positionsis n 5 N 5, a n 5 N 5, a N 5, b n 5 N 5, a N 5, b N 5, c Theprecedingstrategytogenerate B d ( x y z ) generates 3 q 2 r 2 s 2 u n 1 N 1, a n 2 N 2, a n 2 N 2, a N 2, b n 3 N 3, a n 3 N 3, a N 3, b n 4 N 4, a n 4 N 4, a N 4, b n 5 N 5, a n 5 N 5, a N 5, b n 5 N 5, a N 5, b N 5, c l -mers w foreach10-tuple ( N 1, a N 5, c ) .Whileeverygenerated l -merisin B d ( x y z ) ,some l -mersmaybethe same.Computationalefciencyisobtainedbycomputing B d ( x y z ) forall ( x y z ) in thesameclass C ( n 1 n 5 ) concurrentlybysharingtheloopoverheadsasthesame loopsareneededforall ( x y z ) inaclass.Figure 4-5 givesthepseudocodeforour algorithmtocompute B d ( x y z ) byclasses. AsinthecaseofPMS5,run-timemaybereducedbyprecomputin gdatathatdo notdependonthestringset S .So,foragivenpair ( l d ) ,thereare O (( l +1) 5 ) 5-tuples ( n 1 n 5 ) .Foreachofthe5-tuples,wecanprecomputeall10-tuples( N 1, a N 5, c ) thataresolutionstotheILPofSection 4.2.3 with d H ( x 1 t 1 )= d H ( y 1 t 1 )= d H ( z 1 t 1 )=0 The10-tuplesolutionsoftheILParefoundusinganexhausti vesearch.Foreach 10-tuple,wecanprecomputeallcombinations(i.e.,select ionsofpositionsin w ).The precomputed10-tuplesolutionsforeach5-tuplearestored inatablewith ( l +1) 5 entries andindexedby [ n 1 n 5 ] andtheprecomputedcombinationsforthe10-tuplesolution s arestoredinaseparatetable.Bystoringthecombinationsi naseparatetable,wecan 107

PAGE 108

ClassBd( C ( n 1 n 2 n 3 n 4 n 5 ) ) B d ; FindallILPsolutionswithparameters n 1 n 2 n 3 n 4 n 5 foreachsolution( N 1, a N 5, c ) f curComb firstcombinationforthissolution for i =0to(#combinations) f foreachtriplet ( x y z ) in C ( n 1 n 5 ) f Generate w 'sfor curComb Addthese w 'sto B d gCurComb nextcombinationinGraycodeorder. g greturn B d Figure4-5.Computing B d ( n 1 n 5 ) ensurethateachisstoredonlyonceeventhoughthesamecomb inationmaybeneeded bymany10-tuplesolutions. Westoreprecomputedcombinationsasvectors.Forexample, aType1combination for n 1 =3 and N 1, a =1 couldbestoredas f 010 g indicatingthattherstandthirdType 1positionsof w haveacharacterdifferentfromwhat x y ,and z haveinthatposition whilethecharacterinthesecondType1positionisthesamea sinthecorresponding positionof x y ,and z .AType2combinationfor n 2 =4, N 2, a =2 and N 2, b =1 could bestoredas f 3011 g indicatingthatthecharacterintherstType2positionof w comes fromthethird l -mer, z ,ofthetriplet,thesecondtype2positionof w hasacharacter thatisdifferentfromanyofthecharactersinthesameposit ionof x and z andthethird andfourthType2positionsof w havethesamecharacterasinthecorresponding positionsof x .Combinationsfortheremainingpositiontypesarestoreds imilarly.As indicatedbyourpseudocodeofFigure 4-5 ,combinationsareconsideredinGraycode ordersothatonlytwopositionsinthe l -merbeinggeneratedchangefromthepreviously generated l -mer.Consequently,weneedlessspacetostorethecombinat ionsinthe combinationtableandlesstimetogeneratethenew l -mer.Anexampleofasequence ofcombinationsinGraycodeorderforType2positionswith n 2 =4, N 2, a =1, N 2, b =1 108

PAGE 109

is f 0012,0021,0120,0102,0201,0210,1200,1002,1020,2010,2 001,2100 g .Notethatin goingfromonecombinationtothenextonlytwopositionsare swapped. 4.3.3TheDataStructure Q Wenowdescribethedatastructure Q thatisusedbyPMS6.Thisisareasonably simpledatastructurethathasefcientmechanismsforstor ingandintersection.In thePMS6implementationof[ 57 ],therearethreearraysin Q ;acharacterarray, strs [] forstoringall l -mers,anarrayofpointers, bucketPointers [] ,whichpointstolocationsin thecharacterarrayandabitarray, markBuer [] ,usedforintersection.Thereisalso aparameter bucketIndex whichdetermineshowmanycharactersof l -mersareused forindexinginto bucketPointers [] array.Asthereare4possibilitiesforacharacter,for p characters bucketIndex canvaryfrom 0 to (4 p 1) .Thenumberofcharacters, p tobeusedforindexinginto bucketPointers [] ,issetwhen Q isinitialized.Duringthe rstiterationofPMS6,for k =1 l -mersin B d ( C ) arestoredin strs [] .Afterall B d ( C ) s arecomputed, strs [] issortedin-placeusingMostSignicantDigitradixsort.A fter sorting,duplicate l -mersareadjacenttoeachother.Also, l -mersthathavethesame rst p charactersandhenceareinthesamebucketareadjacenttoea chotheraswell in strs [] .Byasinglescanthrough strs [] ,duplicatesareremovedandthepointersin bucketPointers [] aresettopointtodifferentbucketsin strs [] .Duringtheremaining iterations,for k 2 ,all B d ( C ) sgeneratedaretobeintersectedwith Q .Thisisdone byusingthebitarray markBuer .First,whilecomputing B d ( C ) ,each l -merissearched forin Q .Thesearchproceedsbyrstmappingtherst p charactersofthe l -merto thecorrespondingbucketandthendoingabinarysearchinsi de strs [] withintheregion pointedtobythebucketpointer.Ifthe l -merisfound,itspositionissetin markBuer [] Onceall l -mersaremarkedinthe markBuer [] strs [] iscompactedbyremovingthe unmarked l -mersbyasinglescanthroughthearray.Thebucketpointers arealso updatedduringthisscan. 109

PAGE 110

Forlargerinstances,thesizeof B d ( C ) issuchthatwedon'thavesufcientmemory tostore B d ( C ) in Q .Fortheselargerinstances,inthe k =1 iteration,weinitializea Bloomlterusingthe l -mersin B d ( C ) ratherthanstoringthese l -mersin Q .Duringthe nextiteration( k =2),westore,in Q ,onlythose l -mersthatpasstheBloomltertest (i.e.,theBloomlter'sresponseis”Maybe”).Fortheremai ningiterations,wedothe intersectionasforthecaseofsmallinstances.UsingaBloo mlterinthiswayreduces thenumberof l -merstobestoredin Q attheexpenseofnotdoingintersectionforthe seconditeration.Henceattheendoftheseconditeration,w ehaveasupersetof Q 0 4-4 ofthesetwewouldhavehadusingthestrategyforsmallinsta nces.Experimentally, itwasdeterminedthattheBloomlterstrategyimprovesper formanceforchallenging instancesofsize(19,7)andlarger.Asin[ 56 ],PMS6usesapartitionedBloomlter oftotalsize1GB.FromBloomltertheory[ 58 ]wecandeterminethenumberofhash functionstousetominimizetheltererror.However,wenee dtominimizetheruntime ratherthantheltererror.Experimentally,[ 57 ]determinedthatthebestperformance wasachievedusingtwohashfunctionswiththerstonebeing bytes0-3ofthekeyand thesecondbeingtheproductofbytes0-3andtheremainingby tes(byte4for(19,7) instancesandbytes4and5for(21,8)and(23,9)instances).4.3.4Complexity TheasymptotictimecomplexityofPMS56isthesameasthatof PMS5, O ( nm 3 lN ( l d )) where m isthelengthofeachinputstring s [ j ] 4.4ExperimentalResults WeevaluatetheperformanceofPMS6onchallengeinstancesd escribedin[ 56 ]. Foreach ( l d ) thatcharacterizesachallengeinstance,wegenerated20ra ndom stringsoflength600each.Next,arandommotifoflength l wasgeneratedandplanted atrandompositionsineachofthe20strings.Theplantedmot ifwasthenrandomly mutatedinexactly d randomlychosenpositions.Foreach ( l d ) valueupto(19,7),we generated20instancesandforlarger ( l d ) values,wegenerated5instances.The 110

PAGE 111

averageruntimesforeach ( l d ) valuearereportedinthissection.Sincethevariation inruntimesacrossinstanceswasrathersmall,wedonotrepo rtthestandarddeviation. Eventhoughwetestouralgorithmusingonlysyntheticdatas ets,severalauthors(e.g., [ 56 ])haveshownthatPMScodesthatworkwellonthekindofsynth eticdatausedby usalsoworkwellonrealdata.AsPMS5isthefastestalgorith m[ 56 ]forlarge-instance motifsearch,wecomparetheruntimesofPMS6withPMS5.ForP MS5,weusedC++ codeprovidedbytheauthorsof[ 56 ].PMS6wascodedbyusinC++. ThePMS5andPMS6codeswerecompiledusingthe-03agingccu nder64-bit LinuxandrunonaIntelCorei7systemrunningat3.3GHz.Oure xperimentswere limitedtochallengeinstances ( l d ) [ 56 ].Foreachchallengeinstancemultipledatasets of20randomlygeneratedstringseachoflengthof600charac tersweregenerated.For each ( l d ) ,theaveragetimeisreportedhere.Thestandarddeviationi ntheruntimesis verysmall.4.4.1Preprocessing ThepreprocessingtimesofPMS5andPMS6forallchallengein stancesaregiven inFigure 4-1 .Thespacerequiredtosavethepreprocesseddataisgivenin Figure 4-2 ThetimetakenbyPMS5tobuilditsILPtablesfor l =23 and d =9 is883seconds whilePMS6takes25.5secondstobuilditsILPtables.Forthe (23,9)case,PMS5uses roughly300MBtostoreitsILPresultswhilePMS6usesroughl y350MBtostoreitsILP solutionsandcombinations.Although,forlargeinstances ,PMS6needsmorespacefor itsILPtablesthandoesPMS5,thespacerequiredbyothercom ponentsofthePMS5 andPMS6algorithmsdominates.Forexample,theBloomlter usedbybothalgorithms occupies1GBandtosolve(23,9)instances,weneedapproxim ately0.5GBfor Q .At present,runtime,notmemory,limitsourabilitytosolvela rgerchallengeinstancesthan (23,9)usingeitherPMS5orPMS6. 111

PAGE 112

Table4-1.PreprocessingtimesforPMS5andPMS6 Algorithm(13,4)(15,5)(17,6)(19,7)(21,8)(23,9) PMS524.0s55.0s119.0s245.0s478.0s883.0sPMS60.9s1.0s2.0s3.5s9.5s25.5sPMS5/PMS626.6755.0059.5070.0050.3134.63 Table4-2.TotalstorageforPMS5andPMS6inMegaBytes(MB) Algorithm(13,4)(15,5)(17,6)(19,7)(21,8)(23,9) PMS55.0015.0035.0050.00155.00300.00PMS61.004.0015.0045.00145.00450.00PMS5/PMS65.003.752.331.111.070.67 4.4.2Computing B d Next,wecomparethetimetakentocomputeall B d ( x y z ) inPMS5andtimetaken tocomputeall B d ( C ( n 1 n 5 ) inPMS6fordifferentchallengeinstances.Thetimesare givenintheFigure 4-3 .NotethatsincePMS5andPMS6usedifferentBloomltersto intersectthe Q s,thenumberofiterationsneededfor Q 0 toreachthethresholdsize(i.e., thenumberofiterationsofthe for k loop(Figure 4-1 ))maybedifferentinPMS5and PMS6.IfPMS6isrunforthesamenumberofiterationsasusedb yPMS5,thePMS6 timestocomputethe B d sgoesupto8.89mforthe(19,7)instances,to1.29hforthe (21,8)instances,andto10.83hforthe(23,9)instances.Th ePMS5/PMS6ratiosbecome 2.5,2.26and1.82,respectively,fortheseinstances.4.4.3TotalTime Thetotaltime(i.e.,timetocomputethemotifs)takenbyPMS 5andPMS6for differentchallengeinstancesisshowninFigure 4-4 .TheruntimeratioPMS5/PMS6 rangesfromahighof2.20forthe(21,8)casetoalowof1.69fo rthe(17,6)case. Table4-3.Timetakentocompute B d ( C ( n 1 n 5 )) and B d ( x y z ) Algorithm(13,4)(15,5)(17,6)(19,7)(21,8)(23,9) PMS512.03s67.64s6.86m22.21m2.91h19.73hPMS64.20s21.67s2.31m7.75m1.09h8.84hPMS5/PMS62.863.122.972.872.672.23 112

PAGE 113

Table4-4.TotalrunrimeofPMS5andPMS6 Algorithm(13,4)(15,5)(17,6)(19,7)(21,8)(23,9) PMS539.0s130.0s11.35m40.38m4.96h40.99hPMS622.0s75.0s6.72m22.75m2.25h19.19hPMS5/PMS61.771.731.691.772.202.14 Table4-5.Total+preprocessingtimebyPMS5andPMS6 Algorithm(13,4)(15,5)(17,6)(19,7)(21,8)(23,9) PMS563.0s185.0s13.33m44.47m5.09h41.23hPMS622.9s76.0s6.75m22.81m2.25h19.19hPMS5/PMS62.752.431.981.952.262.15 4.4.4TotalTimePlusPreprocessingTime Insomeapplicationscenarios,thepreprocessingtimemayn otbeamortizedover severalinstancesthathavethesame ( l d ) .Intheseenvironmentsonewouldneed tousethetimetakeforboththepreprocessingstepaswellas thetimetondthe motifsusingthepreprocesseddataastheactualcostofthec omputation.Thetotal timeplusthepreprocessingtimeforPMS5andPMS6isshownin Figure 4-5 .When preprocessingtimeisfactoredin,theruntimeratioPMS5/P MS6declinesfromahighof 2.75forthe(13,4)casetoalowof1.95forthe(19,7)case. 4.5PMS6MC 4.5.1Overview PMS6MCexploitstheparallelismpresentinthePMS6algorit hm.First,thereis outer-levelparallelismwherethemotifsearchformany x 'sfrom s 1 canbecarriedout inparallel(i.e.,severaliterationsoftheouter for loopofFigure 4-4 areruninparallel). Second,thereisinner-levelparallelismwheretheindivid ualstepsoftheinner for loopofFigure 4-4 aredoneinparallelusingmultiplethreads.Outer-levelpa rallelismis limitedbytheamountofmemoryavailable.WehavedesignedP MS6MCtobeexible intermsofitsmemoryandthreadrequirements.Thetotalnum berofthreadscanbeset dependingonthenumberofcoresandavailablememoryofthes ystem.Thethreads aregroupedintothreadblocks.Eachthreadblockoperateso nadifferent x from s 1 113

PAGE 114

So,forexample,ifweuseatotalof t threadsand4threadblocks,thenourcodedoes 4iterationsoftheouter for loopinparallelwitheachiteration(orthreadblock)being assignedto t = 4 threads.Thethreadsassignedtoathreadblockcooperateto ndthe motifscorrespondingtoaparticular x .Thethreadsusethe syncthreads () primitive functiontosynchronize.Thisfunctioncanbeimplementedu singthethreadlibrary synchronizationmechanismavailableunderdifferentoper atingsystems.Wedenote threadblock i as T [ i ] whilethethreadswithinthreadblock i aredenotedby T [ i ][ j ] 4.5.2Outer-LevelParallelism Inthis,eachthreadblockprocessesadifferent x from s 1 andcallsthefunction ndMotifAtThisX () (Figure 4-6 ).Onceathreadblockisdonewithitsassigned x itmovesontothenext x from s 1 whichisnotprocessedyet.Threadsinathread blockexecutethefunction ndMotifAtThisX () tondifthereisanymotifinthe d -neighborhoodof x PMS6MC( S l d ) foreachidlethreadblockT[i]inparalleldof selectan x 2 l s 1 thathasn'tyetbeenselected; ifthereisnosuch x thethreadblockstops; findMotifAtThisX( x ,i); g Figure4-6.PMS6MCouterlevelloop 4.5.3Inner-LevelParallelization Findingmotifsinthe d -neighborhoodofaparticular x from s 1 isdonebyndingthe motifsof x andthestrings s 2 k and s 2 k +1 for k =1 b n 1 2 c .AsdescribedinFigure 4-4 thisisa4stepprocess.Thesestepsaredonecooperativelyb yallthreadsinathread block.First,wendtheequivalenceclassesfor x and l -mersfrom s 2 k and s 2 k +1 .Forany triple ( x y z ) fromanequivalenceclassweknowthenumberof l -mers w whichareata distance d from x y and z frompre-computedtables.Hence,bymultiplyingthenumber oftripleswiththenumberofpossible w 'swedeterminethetotalnumberof w 'sforeach equivalenceclass.Wedenotethisnumberby j B d ( C ) j .Next,wecompute B d ( C ) for 114

PAGE 115

theseequivalenceclassesindecreasingorderof j B d ( C ) j inparallelbythethreadsin thethreadblock.Thisorderhelpsinloadbalancingbetween differentthreadsaseach willbecomputing j B d ( C ) j inparallel.ThisisakintousingtheLPTschedulingruleto minimizenishtime.Thenweneedtostore B d ( C ) sin Q if k =1 ;i.e.,whennding motifsbetween x s 2 and s 3 .Thiscanbedoneduringthepreviousstepwhilecomputing B d ( C ) .For k 2 ,weneedtointersectthesetofall B d ( C ) swith Q .Whenthesizeof Q fallsbelowacertainthreshold,weneedtoexecutethefunct ion outputMotifs tondout which l -mersin Q arevalidmotifs.Thedifferentstepsfor ndMotifAtThisX () aregiven inFigure 4-7 findMotifAtThisX(Stringx,Threadblocki) for k =1 to b n 1 2 c f Classes [] ; Q ; findEquivalenceClasses( x ,T[i], Classes [] ); Sort Classes [] indecreasingorderof j B d ( C ) j ; foreachclass C inorderfrom Classes [] in parallelbythreadsT[i][j];f ComputeProcess B d ( C ); gsyncthreads(T[i]);ProcessQ( Q ,k,T[i]); syncthreads(T[i]);if j Q j < threshold break; goutputMotifs(Q)inparallelbythreadsT[i][j]; Figure4-7.Findingmotifsinparallel ThedatastructureusedinPMS6MCfor Q isverysimilartothatusedinPMS6. However,weusetwocharacterarrays strs 1[] and strs 2[] insteadofone.Thishelpsus toperformmanyoperationson Q inparallelbymultiplethreads.Wenowdescribethe differentstepsusedtondingmotifsemployingthismodie dstructurefor Q 4.5.3.1Findingequivalenceclassesinparallel Eachthreadworksonasegmentofthestring s 2 k inparallel.Forall y thatbelongto thethread'sassignedsegmentof s 2 k andforall z from s 2 k +1 ,thethreadcomputesthe 115

PAGE 116

numberoftype1throughtype5positionsforthetriple ( x y z ) .Basedonthenumber oftype1throughtype5positions( n 1 n 5 ),thetripleisputintothecorresponding equivalenceclass.Onceallthethreadsnish,theequivale nceclassesformedby differentthreadsneedtobemerged.As n i sfor i =1 5 canonlyvaryfrom 0 to l ,the whole l 5 rangeof ( n 1 n 5 ) isdividedamongthethreadsinthethreadblock.Each threadthenndstheequivalenceclassespresentinitsassi gnedrangeandgathers allthetriplesoftheseequivalencesclassesinparallel.T hepseudocodeforthisstepis giveninFigure 4-8 findEquivalenceClasses( x ,i, Classes [] ) foreach y 2 l s 2 k and z 2 l s 2 k +1 inparallelby threadsT[i][j]f Compute n 1 n 5 for ( x y z ) ; if C ( n 1 n 5 ) 62 Classes [ j ]; f Createtheclass C ( n 1 n 5 ) with ( x y z ) ; Add C ( n 1 n 5 ) to Classes [ j ] ; gelseadd ( x y z ) toclass C ( n 1 n 5 ) ; gsyncthreads(T[i]);Mergeequivalenceclassesin Classes [] bythreads T[i][j]inparallel;syncthreads(T[i]); Figure4-8.Findingequivalenceclassesinparallel 4.5.3.2Computing B d ( C ) inparallel Onceequivalenceclassesareformed,wedetermine j B d ( C ) j bymultiplying thenumberoftripleswiththenumberofsolutionsforequiva lenceclassesusing pre-computedtables.Oncethenumberof l -mersisknown,theoffsetin strs 1[] to store l -mersduringtherstiterationisalsoknown.Hence,eachth readcanstore l -mers fromthedesignatedoffsetwithoutconictwithotherthrea ds.Toensurethateach threadinathreadblockisroughlydoingthesameamountofwo rk,werstorderthe equivalenceclassesintermsofdecreasing j B d ( C ) j .Thissortingcanbedonebya singlethreadasthenumberofequivalenceclassesistypica llylessthan1000evenfor 116

PAGE 117

largeinstances.Thread j ofthethreadblockselectsthe j thequivalenceclasstowork on;whenathreadcompletes,itselectsthenextavailableeq uivalenceclasstoworkon. ThisstrategyisakintotheLPTschedulingstartegyandiskn owntoprovidegoodload balanceinpractice.Eachthreadcomputes B d ( C ) fortheclass C itisworkingonusing thesamestrategyasusedbyPMS6(seeSection 4.3.2 ). For k =1 ,westorethe l -mersin strs 1[] fromthedesignatedoffset.Wealsodo someadditionalworkwhichfacilitatessorting Q inparallelduringthenextstep.For eachthread,wekeeptrackofthenumberof l -mershavingthesamerstcharacter.This isdonebymaintainga2-Dcounterarray counter [][] indexedbythreadnumberandthe rstcharacterofthe l -mer. For k 2 ,the l -merissearchedforandmarkedinthe markBuer [] whenfound (seeSection 4.3.3 ).Althoughtheremightbeawriteconictwhilesettingtheb itinthe markBuer [] ,allthreadscancarrythisstepinparallelasthreadsthatw ritetothesame markbitwritethesamevalue.Figure 4-9 givesthestepsusedtocomputeandprocess B d ( C ) ProcessBd( B d ( C ) Q k i j ) if k =1 f foreach l -mer w 2 B d ( C ) counter [ j ][ w [0]]++; Copy l -mersin B d ( C ) to strs 1[] fromtheoffset forthisclass; gif k 2 f Forall l -mer w 2 B d ( C ) ,set markBuer [] if w ispresentin Q ; g Figure4-9.Computeandprocess B d ( C ) 4.5.3.3Processing Q inparallel Theprocessingof Q dependsontheiterationnumber.When k =1 ,wesort strs 1[] thenremovetheduplicatesandsetthebucketpointers.Fort heremainingiterations,we needtoremoveallunmarked l -mersfrom strs 1[] andupdatethebucketpointers. 117

PAGE 118

The k =1 sortisdonebyrstcomputingtheprexsumof counters [][] sothat thecounterforaparticularthreadandaparticularcharact erequalsthetotalnumber of l -mersprocessedthathaveeithersmallerrstcharacterore qualrstcharacterbut processedbythreadswithasmallerindex.Sincethenumbero fcountersissmall(256 differentcountersfor8-bitcharacters)wecomputethepre xsumsusingasinglethread. ThepseudocodeisgiveninFigure 4-10 .Next,eachthreadinthethreadblockscans throughthe l -mersin strs 1[] thatithadstoredwhilegenerating B d ( C ) .Dependingon therstcharacterofthe l -mer,the l -merismovedto strs 2[] startingfromtheoffset indicatedbyprexsumcounter.Thismovementof l -mersisdonebythethreadsina threadblockinparallel.Oncethemovementiscomplete, strs 2[] isdividedintosegments suchthattherstcharactersofall l -merswithinasegmentarethesame.Followingthis segmenting,thethreadssortthesegmentsof strs 2[] inparallelusingradixsort.Each threadworksonadifferentsegment.Sincetherstcharacte rofthe l -mersinasegment arethesame,theradixsortstartsfromthesecondcharacter .Oncethesegmentsare sorted,weproceedtoeliminateduplicatesandsetthebucke tpointers.Firstthethreads countthenumberofunique l -mersineachsegmentinparallelbycheckingadjacent l -mers.Again,eachthreadworksonadifferentsegment.Thed eterminedcountsof unique l -mersareprexedsummedbyasinglethreadtogettheoffsets requiredfor movingtheunique l -merstotheirnalpositions.Usingtheseoffsets,thethre adsmove unique l -merswitheachthreadmovingtheunique l -mersofadifferentsegmentfrom strs 2[] to strs 1[] inparallel.Whilemovingan l -mer,thethreadsalsochecktoseeifrst p charactersofthecurrent l -merarethesameasthoseoftheprevious l -mer;ifnot,the appropriatepointerin bucketPointers [] isset. When k 2 ,weneedtoremovefrom Q allthe l -mersthatarenotmarkedin markBuer .Thisisdoneintwosteps.First,the markBuer isdividedintosegments andeachthreaddoesaprexsumondifferentsegmentsinpara llel.Thisgivesthe numberofmarked l -mersineachsegment.Next,wemovethemarked l -mersineach 118

PAGE 119

PrefixCounters( counters [][] ) sum=0;fori=0to255//Thereare256possibilities //forthefirstcharacter f counters [0][ i ]= counters [0][ i ]+ sum ; forj=1tothreadsf counters [ j ][ i ]= counters [ j ][ i ]+ counters [ j ][ i 1]; gsum=counters[threads-1][i]; g Figure4-10.Prexsumofcounters segmentfrom strs 1[] to strs 2[] .Forthis,aprexsumisperformedbyasinglethread onthecountershavingthenumberofmarked l -mersindifferentsegmentstogetthe offsetin strs 2[] formovingthe l -mers.Withtheseoffsets,thethreadsthenmovethe marked l -mersfromdifferentsegmentsinparallel.Asbefore,whenm ovingan l -mer,the threadcheckstoseeiftherst p charactersofthecurrent l -merdiffersfromtheprevious oneandupdatetheappropriatepointerin bucketPointers .Theremightbeaproblem inupdatingthebucketpointersintheboundaryregionofthe segmentsasonebucket canextendacrosstheboundaryoftwosegmentsandhencetwot hreadsmightupdate thatbucketpointer.Theseboundarybucketpointersarexe dbyasinglethreadafter all l -mersaremoved.Notethattherecanonlybeasmanyboundaryb ucketsasthere aresegmentswhichareveryfewinnumber.Thepseudocodefor theprocessing Q for differentvaluesof k isgiveninFigure 4-11 4.5.3.4The outputMotifs routine Oncethesizeof Q dropsbelowacertainthreshold,webreakoutoftheloopand call outputMotifs ( Q ) todeterminethesetofvalidmotifsin Q .Thisstepcanbedonein parallelascheckingthevalidityof l -merstobemotifscanbedoneindependentofone another.So,eachthreadexaminesadisjointsetof l -mersfrom Q exhaustivelychecking ifitisamotifasisdoneinPMS6;thethreadsoperateinparal lel. 119

PAGE 120

ProcessQ(( Q k ,T[i]) if k =1 f prefixCounters(counters);Move l -mersfrom strs 1[] to strs 2[] bythreads T[i][j]inparallel;syncthreads(T[i]);Sortsegmentsin strs 2[] inparallelbythreads T[i][j];syncthreads(T[i]);Countunique l -mersineachsegmentinparallel bythreadsT[i][j];syncthreads(T[i]);Moveunique l -mersto strs 1[] andupdatebucket pointersinparallelbyT[i]j[j]; gif k 2 f Divide markBuer intosegments; Prefixsum markBuer bythreadsT[i][j]in parallel;syncthreads(T[i]);Movemarked l -mersfromsegmentsin strs 1[] to strs 2[] inparallelbyThreadsT[i][j]; Updatethebucketpointerswhilemoving l -mers; syncthreads(T[i]);Fixtheboundarybucketsifnecessary; g Figure4-11.Processing Q 4.6ExperimentalResults WeevaluatedtheperformanceofPMS6MConthechallengingin stancesdescribed in[ 56 ].Foreach ( l d ) thatcharacterizesachallenginginstance,wegenerated20 randomstringsoflength600each.Next,arandommotifoflen gth l wasgenerated andplantedatrandompositionsineachofthe20strings.The plantedmotifwasthen randomlymutatedinexactly d randomlychosenpositions.Foreach ( l d ) valueupto (19,7),wegenerated20instancesandforlarger ( l d ) values,wegenerated5instances. Theaverageruntimesforeach ( l d ) valuearereportedinthissection.Sincethe variationinruntimesacrossinstanceswasrathersmall,we donotreportthestandard deviation.Eventhoughwetestouralgorithmusingonlysynt heticdatasets,several 120

PAGE 121

authors(e.g.,[ 56 ])haveshownthatPMScodesthatworkwellonthekindofsynth etic datausedbyusalsoworkwellonrealdata.4.6.1PMS6MCImplementation PMS6MCisimplementedusingthepthreads(POSIXThreads)li braryunderLinux onanIntel6-coresystemwitheachcorerunningat3.3GHz.We experimentedwith differentdegreesofouter-level(numberofthreadblocks) andinner-level(numberof threadsinathreadblock)parallelismfordifferentchalle nginginstances.Forsmaller instances(e.g.(13,4)and(15,5)),theperformanceislimi tedbythememorybandwidth ofthesystem.Hence,increasingthedegreeofinnerorouter levelparallelismdoes nothavemucheffectontheruntimeasmostofthethreadsstal lformemoryaccess. Forlargerinstances,thenumberofthreadblocksislimited bytheavailablememoryof thesystem.Figure 4-6 givesthenumberofthreadblocksandthenumberofthreads inathreadblockfordifferentchallenginginstanceswhich producestheoptimum performance. Table4-6.DegreeofinnerandouterlevelparallelismforPM S6MC Challenging Instances Outer-levelBlocksThreadsperBlockTotalThreads (13,4)2612(15,5)2612(17,6)8648(19,7)41248(21,8)21224(23,9)21224 4.6.2PMS6andPMS6MC WecomparetheruntimesofPMS6andPMS6MConanIntel6-cores ystemwith eachcorerunningat3.3GHz.PMS6takes22secondsonanavera getosolve(15,5) instancesand19hoursonanaveragetosolve(23,9)instance s.PMS6MC,onthe otherhand,takes8secondsonanaveragetosolve(15,5)inst ancesand3.5hourson anaveragetosolve(23,9)instances.Thespeedupachievedb yPMS6MCoverPMS6 variesfromalowof2.75for(13,4)instancestoahighof6.62 for(21,8)instances.For 121

PAGE 122

(19,7)andlargerinstancesPMS6MCachievesaspeedupofove r5.Theruntimesfor variouschallenginginstancesaregiveninFigure 4-7 Table4-7.RuntimesforPMS6andPMS6MC Algorithm(13,4)(15,5)(17,6)(19,7)(21,8)(23,9) PMS622.0s74.0s6.82m22.75m2.25h19.19hPMS6MC8.0s21.0s1.03m4.45m25.5m3.57hPMS6/PMS6MC2.753.526.625.115.295.38 4.6.3PMS6MCandOtherParallelAlgorithms Dasari,DeshandZubairproposedavotingbasedparallelalg orithmformulti-core architecturesusingbitarrays[ 59 ].Theyfollowedupwithanimprovedalgorithmbased onsufxtreesforGPUsandmulti-coreCPUsfromintel[ 60 ].Weestimatetherelative performanceofPMS6MCandtheseparallelalgorithmsusingp ublishedruntimesand performanceratios.Figure 4-8 andtherst4rowsofFigure 4-9 givetheperformance ofPMS5andPMSPruneasreportedin[ 56 ]andthatofPMS6andPMS5asreportedin [ 57 ],respectively. Table4-8.RuntimesforPMS5andPMSPrune[ 56 ] Algorithm(13,4)(15,5)(17,6)(19,7) PMS5117.0s4.8m21.7m1.7hPMSPrune45.0s10.2m78.7m15.2hPMS5/PMSPrune2.600.470.280.11 Wedividetheratio PMS 5 = PMSPrune by PMS 5 = PMS 6 toestimatetheratio PMS 6 = PMSPrune (5throwofFigure 4-9 ).Next,wedividetheratio PMS 6 = PMS 6 MC (row4ofFigure 4-7 )byouretimateof PMS 6 = PMSPrune togetanestimateof PMSPrune = PMS 6 MC (6throwofFigure 4-9 ). Therst4rowsofFigure 4-10 givetheruntimesofgSPELLER-x,mSPELLER-x, andPMSPruneasreportedin[ 60 ].The”x”indicatesthenumberofCPUcoresfor mSPELLERandthenumberofGPUdevicesinthecaseofgSPELLER .Wereport thetimesformSPELLER-16andgSPELLER-4asthesewerethefa stestreported in[ 60 ].FromthisdataandthatofFigure 4-9 wecanestimatethespeedupsshownin 122

PAGE 123

Table4-9.TotalruntimeofdifferentPMSAlgorithms Algorithm(13,4)(15,5)(17,6)(19,7)(21,8)(23,9) PMS539.0s130.0s11.35m40.38m4.96h40.99hPMS622.0s75.0s6.72m22.75m2.25h19.19hPMS5/PMS61.771.731.691.772.202.14PMS6/PMSPrune1.460.270.170.06-.--.-PMSPrune/PMS6MC1.8813.0438.9485.17-.--.rows5through8ofFigure 4-9 .WeestimatethatthespeedupofPMS6MCusing6 corescomparedtomSPELLER-16using16coresvariesfromalo wof0.07for(13,4) instancesto3.58for(19,7)instanceswhilethespeedupfor PMS6MCusingonlyone CPUwithrespecttogSPELLER-4using4GPUsvariesfromalowo f0.03for(13,4) instancestoahighof1.97for(19,7)instances. Table4-10.ComparingmSPELLERandgSPELLERwithPMS6MC Algorithm(13,4)(15,5)(17,6)(19,7)(21,8) PMSPrune53.0s9.0m69.0m9.2h-.-mSPELLER-162.0s16.5s2.5m23.6m3.7hgSPELLER-40.8s7.2s1.2m13.0m2.2hPMSPrune/mSPELLER-1626.5032.7327.6023.38-.-PMSPrune/gSPELLER-466.2575.0057.5042.46-.-mSPELLER-16/PMS6MC0.070.401.413.64-.-gSPELLER-4/PMS6MC0.030.170.682.00-.4.7Summary Wehavedevelopedanewalgorithm,PMS6,forthemotifdiscov eryproblem.The runtimeratioPMS5/PMS6rangesfromahighof2.20forthe(21 ,8)challengeinstances toalowof1.69forthe(17,6)challengeinstances.BothPMS5 andPMS6requiresome amountofpreprocessing.ThepreprocessingtimeforPMS6is 34timesfasterthanthat forPMS5for (23,9) instances.Whenpreprocessingtimeisfactoredin,therunt ime ratioPMS5/PMS6isashighas2.75for(13,4)instancesandas lowas1.95for(17,6) instances. WehavefurtherdevelopedamulticoreversionofPMS6thatac hievesaspeedup thatrangesfromalowof2.75for(13,4)challenginginstanc estoahighof6.62for 123

PAGE 124

(17,6)challenginginstancesona6-coreCPU.Ourmulticore algorithmisableto solve(23,9)challenginginstancesin3.5hourswhilethesi nglecorePMS6algorithm takes19hours.Weestimatethatourmulticorealgorithmisf asterthanotherparallel algorithmsforthemotifsearchproblemonlargechallengin ginstances.Forexample, weestimatethatPMS6MCcansolve(19,7)instances3.6times fasterthanusingour 6-coreCPUmSPELLER-16usingthe16-coreCPUof[ 60 ]andabout2timesfasterthan gSPELLER-4using4GPUdevices. 124

PAGE 125

CHAPTER5 CONCLUSIONANDFUTUREWORK Inthisdissertation,wehavedevelopedalgorithmsforsort ingandmotifsearch ondifferentparallelarchitectures.Specically,wehave developedaparallelmerge sortalgorithmforCellBroadbandEngineArchitecturewhic histhefastestwayto sortnumbersstoredinthelocalstoreofanSPE.Wethenexten dedthisalgorithmto sortlargerecordsonCellBroadbandEngine.Next,wedevelo pedafastradixsort basedalgorithm,GRS,forsortinglargerecordsonGPUs.Weh aveexperimentedwith differentmemorylayoutsforstoringrecordsandhaveshown GRStobeafastalgorithm forsortingrecords.Wethendevelopedafastsequentialalg orithm,PMS6formotif searchandimplementeditonanintelCPU.WehaveshownthatP MS6istwiceasfast, onaverage,comparedtothefastestsequentialalgorithmfo rmotifsearch.Wethen developedaparallelversionofPMS6,PMS6MC,forintelmult i-coreCPUs.Ona6-core intelcorei7CPU,itachievesaspeedupofashighas6forsome challenginginstances ofmotifsearch. PMS6MCcanbeextendedtoothermulti-coresystems.However ,thegranularityof parallelismshouldbevariedaccordingtothesystemonwhic hitisbeingimplemented. ThenaturalextensionofPMS6MCwouldbetoimplementitonGP Us.Therearecertain differencesbetweenCPUsandGPUsthatmakesjustportingth ecurrentcodetoGPU muchlessoptimal.OntheCPUs,itisbettertorunafewthread switheachthread doingmultipledataelementsatatime.AsopposedtoCPUs,it isessentialtospawn thousandsofthreadsinGPUtohidememorylatencyandrunthe GPUatitsmaximum performance.Asanexampleofapplyingtheseprinciples,le tsconsiderthestepwhen B d ( C ) forallequivalenceclassesaregeneratedinparallelbythr eads.IntheCPU implementation,itwasenoughthateachthreadprocessesdi fferentclassesinparallel butitcaseofGPUsitisnotagooddesign.Firstly,thereisno tenoughclassestoexploit fullparallelismoftheGPU.Secondly,eachthreadprocessi ngadifferentclasswilllead 125

PAGE 126

todivergence.Hence,incaseofGPUimplementation,eachth readinathreadblock canworkwithdifferenttriplets ( x y z ) fromequivalenceclass C astherewillbealittle divergenceontheprocessingoftripletsfromthesameclass .Thisalsoincreasesthe numberofthreadsthatcanbelaunchedontheGPUs.Eachthrea dblockcanprocess differentclasses.However,thisapproachincreasesthere gisterandsharedmemory pressureaseachthreadneedstokeepinformationsrelatedt othatequivalenceclass. Weintendtoanalyzethevarioustrade-offsindesigningGPU kernelsbyincreasingthe levelofparallelismwithefcientmemoryaccess. 126

PAGE 127

REFERENCES [1] H.Hofstee,“Powerefcientprocessorarchitectureandthe cellprocessor,”in 11th InternationalSymposiumonHighPerformanceComputerArch itecture ,2005. [2] A.C.Chow,G.C.Fossum,andD.A.Brokenshire,“AProgrammin gExample: LargeFFTontheCellBroadbandEngine,” IBMWhitepaper ,2005. [3] E.Lindholm,J.Nickolls,S.Oberman,andJ.Montrym,“Nvidi atesla:Aunied graphicsandcomputingarchitecture,” IEEEMicro ,vol.28,p.3955,2008. [4] NVIDIACorporation,“Nvidiacudaprogrammingguide,” CUDAManual ,2010. [5] N.Satish,M.Harris,andM.Garland,“Designingefcientso rtingalgorithms formanycoregpus,”in IEEEInternationalParallelandDistributedProcessing Symposium(IPDPS) ,2009. [6] S.Sahni,“Schedulingmaster-slavemultiprocessorsystem s,” IEEETrans.on Computers ,vol.45,no.10,pp.1195–1199,1996. [7] V.VolkovandJ.Demmel,“Benchmarkinggpustotunedenselin earalgebra,”in ACM/IEEEconferenceonSupercomputing ,2008. [8] B.Gedik,R.Bordawekar,andP.Yu,“Cellsort:Highperforma ncesortingonthecell processor,”in InternationalConferenceonVeryLargeDataBases ,2007. [9] Y.WonandS.Sahni,“Hypercube-to-hostsorting,” JournalofSupercomputing vol.3,pp.41–61,1989. [10] D.Sharma,V.Thapar,R.Ammar,S.Rajasekaran,andM.Ahmed, “Efcientsorting algorithmsforthecellbroadbandengine,”in IEEEInternationalSymposiumon ComputersandCommunications(ISCC) ,2008. [11] H.Inoue,T.Moriyama,H.Komatsu,andT.Nakatani,“Aa-sort :Anewparallel algorithmformulti-coresimdprocessors,”in 16thInternationalConferenceon ParallelArchitectureandCompilation(PACT) ,2007. [12] S.BandyopadhyayandS.Sahni,“Sortingonacellbroadbande nginespu,”in IEEE InternationalSymposiumonComputersandCommunications( ISCC) ,2009. [13] D.Knuth, TheArtofComputerProgramming:SortingandSearching .Addition Wesley,1998,vol.3. [14] W.Dobosiewicz,“Anefcientvariationofbubblesort,” InformationProcessing Letters ,vol.11,pp.5–6,1980. [15] R.BoxandS.Lacey,“Afast,easysort,” Byte ,vol.4,pp.315–318,1991. [16] A.Drozdek,“Worstcaseforcombsort,” InformatykaTeoretycznaiStosowana vol.9,pp.23–27,2005. 127

PAGE 128

[17] S.LaiandS.Sahni,“Anomaliesinparallelbranch-and-boun dalgorithms,” Comm. oftheACM ,vol.6,pp.594–602,1984. [18] P.Lemke,“Theperformanceofrandomizedshellsort-likene tworksorting algorithms,” SCAMPworkingpaperP18/94 ,1994. [19] R.Sedgewick,“Analysisofshellsortandrelatedalgorithm s,”in 4thEuropean SymposiumonAlgorithms ,1996. [20] E.Horowitz,S.Sahni,andD.Mehta, FundamentalsofdatastructuresinC++ SiliconPress,2007. [21] S.BandyopadhyayandS.Sahni,“Sortinglargerecordsonace llbroadband engine,”in IEEEInternationalSymposiumonComputersandCommunications(ISCC) ,2010. [22] N.Govindaraju,J.Gray,R.Kumar,andD.Manocha,“Gputeras ort:High performancegraphicscoprocessorsortingforlargedataba semanagement,”in ACMSIGMODInternationalConferenceonManagementofData ,2006. [23] A.GrebandG.Zachmann,“Gpu-abisort:optimalparallelsor tingonstream architectures,”in IEEEInternationalParallelandDistributedProcessingSy mposium (IPDPS) ,2006. [24] D.CedermanandP.Tsigas,“Gpu-quicksort:Apracticalquic ksortalgorithmfor graphicsprocessors,” ACMJournalofExperimentalAlgorithmics(JEA) ,vol.14, 2009. [25] E.SintornandU.Assarsson,“Fastparallelgpu-sortingusi ngahybridalgorithm,” JournalofParallelandDistributedComputing ,vol.10,pp.1381–1388,2008. [26] X.Ye,D.Fan,W.Lin,N.Yuan,andP.Ienne,“Highperformance comparison-based sortingalgorithmonmany-coregpus,”in IEEEInternationalParallelandDistributed ProcessingSymposium(IPDPS) ,2010. [27] N.Leischner,V.Osipov,andP.Sanders,“Gpusamplesort,”i n IEEEInternational ParallelandDistributedProcessingSymposium(IPDPS) ,2010. [28] S.Sengupta,M.Harris,Y.Zhang,andJ.Owens,“Scanprimiti vesforgpu computing,”in GraphicsHardware2007 ,2007. [29] NVIDIACorporation,“Cudpp:Cudadata-parallelprimitive slibrary,” http://www.gpgpu.org/developer/cudpp/ ,2009. [30] S.L.Grand,“Broad-phasecollisiondetectionwithcuda,” GPUGems3 ,2007. [31] D.MerrillandA.Grimshaw,“Highperformanceandscalabler adixsorting:Acase studyofimplementingdynamicparallelismforgpucomputin g,” ParallelProcessing Letters ,vol.21,no.2,pp.245–272,2011. 128

PAGE 129

[32] S.BandyopadhyayandS.Sahni,“Grs-gpuradixsortforlarge multieldrecords,” in InternationalConferenceonHighPerformanceComputing(H iPC) ,2010. [33] G.E.Blelloch, Vectormodelsfordata-parallelcomputing .TheMITPress,MA, USA,1990. [34] K.E.Batcher,“Sortingnetworksandtheirapplications,”i n Proc.AFIPSSpringJoint ComputingConference) ,vol.32,1968,pp.307–314. [35] S.BandyopadhyayandS.Sahni,“Sortinglargemultieldreco rdsonagpu,”in IEEE InternationalConferenceonParallelandDistributedSyst ems(ICPADS) ,2011. [36] P.A.Evans,A.Smith,andH.T.Warenham,“Onthecomplexityo fndingcommon approximatesubstrings,” TheoreticalComputerScience ,vol.306,pp.407–430, 2003. [37] T.L.Bailey,N.Williams,C.Misleh,andW.W.Li.,“MEME:dis coveringand analyzingDNA,” NucleicAcidsResearch ,vol.34,pp.369–373,2006. [38] P.PevznerandS.-H.Sze,“Combinatorialapproachestondi ngsubtlesignalsin dnasequences,”in Proc.EnglishInternationalConferenceonIntelligentSys tems forMolecularBiology ,vol.8,2000,pp.269–278. [39] J.BuhlerandM.Tompa,“Findingmotifsusingrandomproject ions,”in FifthAnnual InternationalConferenceonComputationalMolecularBiol ogy(RECOMB) ,2001. [40] A.Price,S.Ramabhadran,andP.Pevzner,“Findingsubtlemo tifsbybranchingfrom thesamplestrings,”in Bioinformaticssupplementaryedition.Proceedingsofthe SecondEuropeanConferenceonComputationalBiology(ECCB ) ,2003. [41] C.E.Lawrence,S.F.Altschul,M.Boguski,J.S.Liu,A.F.Neu wald,andJ.C. Wootton,“Detectingsubtlesequencesignals:agibbssampl ingstrategyformultiple alignment,” Science ,vol.262,pp.208–214,2003. [42] G.HertzandG.Stormo,“Identifyingdnaandproteinpattern swithstatistically signicantalignmentsofmultiplesequences,” Bioinformatics ,vol.15,no.7,pp. 563–577,1999. [43] U.KeichandP.Pevzner,“Findingmotifsinthetwilightzone ,” Bioinformatics ,vol.18, pp.1374–1381,2002. [44] E.EskinandP.Pevzner,“Findingcompositeregulatorypatt ernsindnasequences,” Bioinformatics ,vol.18,no.1,pp.354–363,2002. [45] M.Sagot,“Spellingapproximaterepeatedorcommonmotifsu singasufxtree,” in ProceedingsoftheThirdLatinAmericanTheoreticalInform aticsSymposium (LATIN) ,1998,pp.111–127. 129

PAGE 130

[46] L.MarsanandM.F.Sagot,“Extractingstructuredmotifsusi ngasufxtreealgorithmsandapplicationtopromoterconsensusidentic ation,”in FourthAnnual InternationalConferenceonComputationalMolecularBiol ogy(RECOMB) ,2000. [47] A.M.Carvalho,A.T.Freitas,A.L.Oliveira,andS.M.F.,“Ah ighlyscalablealgorithm fortheextractionofcis-regulatoryregions,”in ThirdAsiaPaccBioinformatics Conference(APBC) ,2005. [48] N.Pisanti,A.M.Carvalho,L.Marsan,andM.Sagot,“Finding subtlemotifsby branchingfromthesamplestrings,”in Bioinformaticssupplementaryedition. ProceedingsoftheSecondEuropeanConferenceonComputati onalBiology (ECCB) ,2003. [49] P.EvansandA.Smith,“Towardoptimalmotifenumeration,”i n Proceedingsof AlgorithmsandDataStructures,8thInternationalWorksho p(WADS) ,2003,pp. 47–58. [50] F.Y.L.ChinandH.C.M.Leung,“Votingalgorithmsfordiscov eringlongmotifs,”in ProceedingsoftheThirdAsia-PacicBioinformaticsConfe rence(APBC) ,2005,pp. 261–271. [51] P.P.KuksaandV.Pavlovic,“Efcientmotifndingalgorith msforlarge-alphabet inputs,” BMCBioinformatics ,vol.11,2010. [52] S.Rajasekaran,S.Balla,andC.H.Huang,“Exactalgorithms forplantedmotif challengeproblems,” JournalofComputationalBiology ,vol.12,no.8,pp. 1177–1128,2005. [53] S.RajasekaranandH.Dinh,“Aspeeduptechniquefor(l,d)mo tifnding algorithms,” BMCResearchNotes ,vol.4,no.54,pp.1–7,2011. [54] J.Davila,S.Balla,andS.Rajasekaran,“Fastandpractical algorithmsforplanted(l, d)motifsearch,” IEEE/ACMTransactionsonComputationalBiologyandBioinf ormatics ,vol.4,no.4,2007. [55] J.Davila,S.Balla,andS.Rajasekaran,“Fastandpractical algorithmsforplanted(l, d)motifsearch,”UniversityofConnecticut,Tech.Rep.,20 07. [56] H.Dinh,S.Rajasekaran,andV.Kundeti,“Pms5:anefciente xactalgorithmforthe (l,d)motifndingproblem,” BMCBioinformatics ,vol.12,no.410,2011. [57] S.Bandyopadhyay,S.Sahni,andS.Rajasekaran,“Pms6:Afas talgorithmformotif discovery,”in SecondIEEEInternationalConferenceonComputationalAdv ancesin BioandMedicalSciences(ICCABS) ,2012. [58] E.Horowitz,S.Sahni,andD.Mehta, FundamentalsofdatastructuresinC++ SiliconPress,2007. 130

PAGE 131

[59] N.S.Dasai,R.Desh,andm.Zubair,“Anefcientmulticoreim plementation ofplantedmotifproblem,”in InternationalConferenceonHighPerformance ComputingandSimulation(HPCS) ,2010. [60] N.S.Dasai,R.Desh,andm.Zubair,“Highperformanceimplem entationofplanted motifproblemusingsufxtrees,”in InternationalConferenceonHighPerformance ComputingandSimulation(HPCS) ,2011. 131

PAGE 132

BIOGRAPHICALSKETCH ShibdasBandyopadhyaycompletedhisPh.D.incomputerengi neeringatthe UniversityofFloridainthesummerof2012.Hisresearchfoc usistodevelopparallel algorithmsformany-corearchitectures.Beforestartingh isPh.D.,hehascompleted hismaster'sincomputersciencefromIndianStatisticalIn stituteandhisbachelor'sin informationtechnologyfromIndianInstituteofInformati onTechnology,Kolkata. 132