GPU Algorithms for Bioinformatics

MISSING IMAGE

Material Information

Title:
GPU Algorithms for Bioinformatics
Physical Description:
1 online resource (164 p.)
Language:
english
Creator:
Li, Junjie
Publisher:
University of Florida
Place of Publication:
Gainesville, Fla.
Publication Date:

Thesis/Dissertation Information

Degree:
Doctorate ( Ph.D.)
Degree Grantor:
University of Florida
Degree Disciplines:
Computer Engineering, Computer and Information Science and Engineering
Committee Chair:
SAHNI,SARTAJ KUMAR
Committee Co-Chair:
RANKA,SANJAY
Committee Members:
PEIR,JIHKWON
NEWMAN,RICHARD E
LAN,GUANGHUI

Subjects

Subjects / Keywords:
algorithms -- bioinformatics -- gpu
Computer and Information Science and Engineering -- Dissertations, Academic -- UF
Genre:
Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract:
Graphics Processing Units (GPUs) were developed originally to meet the computational needs of algorithms for rendering computer graphics. Their low cost per gigaflop has motivated significant research into their efficient use for non-graphics applications. In this work, two parallel matrix multiplication algorithms on an NVIDIA GPU using CUDA were firstly developed to show the guidelines for developing efficient algorithms on a GPU. Following these guidelines, novel bioinformatics algorithms which run on a GPU were developed, including pairwise sequence alignment, syntenic sequence alignment, three-sequence alignment and RNA secondary structure prediction. All of these algorithms achieved significant performance improvement compared to their sequential versions which run on a CPU.
General Note:
In the series University of Florida Digital Collections.
General Note:
Includes vita.
Bibliography:
Includes bibliographical references.
Source of Description:
Description based on online resource; title from PDF title page.
Source of Description:
This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility:
by Junjie Li.
Thesis:
Thesis (Ph.D.)--University of Florida, 2014.
Local:
Adviser: SAHNI,SARTAJ KUMAR.
Local:
Co-adviser: RANKA,SANJAY.

Record Information

Source Institution:
UFRGP
Rights Management:
Applicable rights reserved.
Classification:
lcc - LD1780 2014
System ID:
UFE0046103:00001


This item is only available as the following downloads:


Full Text

PAGE 1

GPUALGORITHMSFORBIOINFORMATICSByJUNJIELIADISSERTATIONPRESENTEDTOTHEGRADUATESCHOOLOFTHEUNIVERSITYOFFLORIDAINPARTIALFULFILLMENTOFTHEREQUIREMENTSFORTHEDEGREEOFDOCTOROFPHILOSOPHYUNIVERSITYOFFLORIDA2014

PAGE 2

c2014JunjieLi 2

PAGE 3

TomywifeMingminandmyparents,whoofferedmeunconditionalloveandsupportedmeeachstepoftheway 3

PAGE 4

ACKNOWLEDGMENTS Thisdissertationwouldnothavebeenpossiblewithoutthehelpofsomanypeopleinsomanyways.Iwouldliketoexpressthedeepestappreciationtomycommitteechair,ProfessorSartajSahni,forhisendlesssupport,patienceandencouragementthroughmyPhDstudy.Hisscholarlyadvice,infectiousenthusiasmandscienticapproachhelpedmegreatlytoaccomplishthistask.Iwanttothankmycommitteeco-chair,ProfessorSanjayRanka,forhisexcellentguidance,timelyadviceandinstilledmentalstimulation.Thediscussionswithhimhaveinspiredmecountlesstimesformyresearchandclearednumerousdoubtsforme.Ialsowanttothankmycommitteemembers,ProfessorJih-KwonPeir,ProfessorRichardNewmanandProfessorGeorgeLanforservingasmyPhDcommitteemembers,discussingmyproposeddissertationtopic,listeningtomydefenseandreadingthedraftofmydissertation.Theyprovidedmanyvaluablecommentsfortheimprovementofthisdissertation.IwanttothanktheDepartmentofComputerandInformationScienceandEngineeringatUniversityofFloridawhichnourishedmeintheeldofcomputerscience.IwanttothankthestaffoftheCISEdepartmentaswellforcreatinganorderlyandcomfortableworkingenvironment.Lastbutnotleast,IwanttothankmywifeMingminforherloveandunderstandinginthepastfewyears.Iwanttothankmyparentsforsupportingmewithalltheeffortstopursuemyacademicgoal.Ialsowanttothankmyfriendsforthejoyfulmomentswespenttogether.Myresearchworkwassupported,inpart,bytheNationalScienceFoundationundergrantsCNS0829916,CNS0905308,CCF0903430,NETS0963812,NETS1115194,CNS0963812,CNS1115184andtheNationalInstitutesofHealthundergrantR01-LM010101. 4

PAGE 5

TABLEOFCONTENTS page ACKNOWLEDGMENTS .................................. 4 LISTOFTABLES ...................................... 8 LISTOFFIGURES ..................................... 11 ABSTRACT ......................................... 14 CHAPTER 1INTRODUCTION ................................... 15 1.1GPUArchitecture ................................ 15 1.2ProgrammingModel .............................. 17 1.3Occupancy ................................... 20 1.4GPGPUComputinginBioinformatics ..................... 24 2GPUMATRIXMULTIPLICATION .......................... 25 2.1SingleCoreMatrixMultiply .......................... 25 2.2MulticoreMatrixMultiply ............................ 26 2.3GPUMatrixMultiply .............................. 28 2.3.1AThreadComputesa11Sub-matrixofC ............ 29 2.3.2AThreadComputesa12Sub-matrixofC ............ 39 2.3.3AThreadComputesa14Sub-matrixofC ............ 43 2.3.4AThreadComputesa11Sub-matrixofCUsingSharedMemory 47 2.3.5AThreadComputesa161Sub-matrixofCUsingSharedMemory 53 2.4AComparison ................................. 58 3STRASSEN0SMATRIXMULTIPLICATION ..................... 68 3.1BasicGPUKernels ............................... 71 3.2One-LevelAdaptation ............................. 73 3.2.1One-LevelStrassen ........................... 73 3.2.2One-LevelWinograd .......................... 74 3.3MultilevelRecursiveGPUAdaptation ..................... 75 3.3.1MultilevelStrassen ........................... 75 3.3.2MultilevelWinograd ........................... 78 3.4ExperimentalResults ............................. 79 3.4.1SinglePrecisionMatrixMultiply .................... 79 3.4.1.1Runtime ........................... 81 3.4.1.2Accuracy ........................... 82 3.4.1.3Performancebynumberoflevels .............. 83 3.4.2IntegerMatrixMultiply ......................... 84 3.5Summary .................................... 85 5

PAGE 6

4PAIRWISESEQUENCEALIGNMENTFORVERYLONGSEQUENCES .... 86 4.1Smith-WatermanAlgorithm .......................... 88 4.2ComputingtheScoreoftheBestLocalAlignment .............. 92 4.3ComputingtheBestLocalAlignment ..................... 95 4.3.1StripedAlignment ............................ 95 4.3.2ChunkedAlignment1 ........................... 96 4.3.3ChunkedAlignment2 ........................... 98 4.3.4MemoryRequirements ......................... 98 4.4ExperimentalResults ............................. 99 4.5Summary .................................... 105 5OPTIMALALIGNMENTOFTHREESEQUENCESONAGPU ......... 106 5.1Three-SequenceAlignmentAlgorithm .................... 107 5.2ComputingtheScoreoftheBestAlignment ................. 109 5.2.1LayeredAlgorithm ........................... 109 5.2.2SlopedAlgorithm ............................ 112 5.3ComputingtheBestAlignment ........................ 114 5.3.1LAYERED)]TJ /F4 11.955 Tf 11.96 0 Td[(BT1 ............................ 114 5.3.2LAYERED)]TJ /F4 11.955 Tf 11.96 0 Td[(BT2 ............................ 115 5.3.3LAYERED)]TJ /F4 11.955 Tf 11.96 0 Td[(BT3 ............................ 116 5.4ExperimentalResults ............................. 117 5.4.1ComputingtheScoreoftheBestAlignment ............. 117 5.4.2ComputingtheAlignment ....................... 118 5.5Summary .................................... 118 6PARALLELSYNTENICALIGNMENT ....................... 120 6.1SyntenicAlignmentEquations ......................... 121 6.2ComputingtheScore .............................. 124 6.3ComputingtheAlignment ........................... 127 6.3.1STRIPED ................................ 127 6.3.2CHUNKED1D .............................. 129 6.3.3CHUNKED2D .............................. 130 6.3.4MemoryRequirements ......................... 131 6.4ExperimentalResults ............................. 132 6.5Summary .................................... 137 7MULTICOREANDGPUALGORITHMSFORNUSSINOVRNAFOLDING ... 139 7.1Background ................................... 139 7.1.1Nussinov'sDynamicProgrammingRecurrence ........... 140 7.1.2SimpliedRecurrenceAndGPUAlgorithm .............. 141 7.2Methods ..................................... 144 7.2.1OurGPUAlgorithm ........................... 145 7.2.1.1DescriptionofmaxKernel .................. 147 6

PAGE 7

7.3Results ..................................... 149 7.3.1SingleAndMulticoreAlgorithms ................... 149 7.3.2GPUAlgorithms ............................ 151 7.3.3SingleCorevsMulticorevsGPU ................... 151 7.4Summary .................................... 153 8CONCLUSIONS ................................... 154 REFERENCES ....................................... 156 BIOGRAPHICALSKETCH ................................ 164 7

PAGE 8

LISTOFTABLES Table page 1-1GPUconstraintsforcomputecapabilities1.3and2.0 ............... 22 2-1Runtimes(s)fortheijkandikjversionsofmatrixmultiply ............ 25 2-2Runtimes(s)forMultiCore ............................. 28 2-3C1060runtimes(s)forGPU1fordifferentblockdimensions ........... 34 2-4Device-memorytransactionstatisticsforGPU1 .................. 39 2-5Device-memorytransactionstatisticsforGPU2 .................. 42 2-6C1060runtimes(s)forGPU2fordifferentblockdimensions ........... 43 2-7C1060runtimes(s)forGPU3fordifferentblockdimensions ........... 45 2-8Device-memorytransactionstatisticsforGPU3 .................. 47 2-9C1060runtimesforGPU4andGPU5 ....................... 52 2-10Devicememorystatisticsforcomputecapability1.3 ............... 61 2-11Numberofregistersperthread,sharedmemory(bytes)perblock,andoccupancyforGPUmatrixmultiplymethods .......................... 61 2-12Runtime(s)forGPUmatrixmultiplymethods ................... 63 2-13GFlopsforGPUmatrixmultiplymethods ...................... 63 2-14EfciencyofGPUmatrixmultiplymethods ..................... 65 3-1Device-memorytransactionstatisticsformmmatrices ............. 73 3-2Transactionsandvolumeforone-levelmultiplicationofnnmatrices ..... 75 3-3PercentreductionrelativetoGPU8forone-levelmultiplication .......... 76 3-4Transactionsandvolumefortwo-levelmultiplicationofnnmatrices ..... 78 3-5Transactionsandvolumeforthree-levelmultiplicationofnnmatrices ..... 78 3-6Transactionsandvolumeforfour-levelmultiplicationofnnmatrices ..... 78 3-7PercentreductionrelativetoGPU8fortwo-levelmultiplication .......... 79 3-8PercentreductionrelativetoGPU8forthree-levelmultiplication ......... 79 3-9PercentreductionrelativetoGPU8forfour-levelmultiplication .......... 79 3-10Runtime(s)ontheTeslaC1060 .......................... 81 8

PAGE 9

3-11Maximumerrors ................................... 83 3-12Averageerrors .................................... 84 3-13Maximumerrorswhenn=16384 .......................... 84 3-14Averageerrorswhenn=16384 ........................... 84 3-15Speedup(%)oversgemm=GPU8whenn=16384 ................. 85 4-1Runningtime(ms)ofStripedScorefordifferentsvalues ............. 100 4-2Runningtime(ms)ofscoringalgorithms ...................... 101 4-3Runningtime(ms)ofStripedAlignmentfordifferentsvalues ........... 101 4-4Runningtime(ms)ofChunkedAlignment1(lenQuery=10430lenDB=14560) 103 4-5Runningtime(ms)ofChunkedAlignment1(lenQuery=15533lenDB=21728) 103 4-6Runningtime(ms)ofChunkedAlignment1(lenQuery=20860lenDB=29120) 103 4-7Runningtime(ms)ofChunkedAlignment2(lenQuery=10430lenDB=14560) 103 4-8Runningtime(ms)ofChunkedAlignment2(lenQuery=15533lenDB=21728) 103 4-9Runningtime(ms)ofChunkedAlignment2(lenQuery=20860lenDB=29120) 103 4-10Bestrunningtime(ms)ofChunkedAlignment1andChunkedAlignment2 ..... 103 5-1Runningtime(s)fordifferentinstances ....................... 118 5-2Runningtime(s)oftracebackmethodsforrealinstances ............ 118 6-1RunningtimeofGAP3 ................................ 133 6-2Executiontime(ms)ofSCOREfordifferentsvalues ............... 133 6-3Runningtime(s)ofSCOREwhenm<
PAGE 10

6-11Runningtime(ms)ofCHUNKED1D(m=16978n=18317) ........... 138 6-12Runningtime(ms)ofCHUNKED2D(m=6530n=7045) ............ 138 6-13Runningtime(ms)ofCHUNKED2D(m=13060n=14090) ........... 138 6-14Runningtime(ms)ofCHUNKED2D(m=16978n=18317) ........... 138 7-1Runningtime(s)ofdifferentCPUalgorithms ................... 150 7-2Runningtime(s)ofdifferentGPUalgorithms ................... 152 7-3SpeedupofOurs1relativetootherversions .................... 152 10

PAGE 11

LISTOFFIGURES Figure page 1-1StreamingmultiprocessorsofNVIDIA'sGPU ................... 16 1-2NVIDIA'sTeslaC1060GPU ............................. 17 1-3NVIDIA'sTeslaPCIex16Card ........................... 18 2-1Singlecorematrixmultiply .............................. 26 2-2Singlecorecacheawarematrixmultiply ...................... 26 2-3MulticorematrixmultiplyusingOpenMP ...................... 27 2-4Tilinga1616matrixusing3224tiles ..................... 29 2-5Thread(2,1)ofblock(1,2)computesC(5,6) ................... 31 2-6Threadcodetocomputeoneelementofc ..................... 31 2-7FragmenttoinvokeGPU1 ............................... 32 2-8An88sub-matrixcomputedbya(4,8)blockofthreads ............ 39 2-9Computinga12sub-matrixofC ......................... 40 2-10Athreadcomputesa12sub-matrixofc ..................... 41 2-11Athreadcomputesa14sub-matrixofc ..................... 44 2-12A(16,16)threadblockcomputesa1616sub-matrixofcusingsharedmemory 48 2-13Bankmappingforbs[16][16] ............................. 51 2-14Bankmappingforbs[16][17] ............................. 51 2-15SameasGPU4exceptthatthetransposeofthebsub-matrixisstoredinbs .. 52 2-16Updatingcvalues .................................. 53 2-17A(16,8)threadblockcomputesa16128sub-matrixofCusingsharedmemory(Parta) ................................... 54 2-18A(16,8)threadblockcomputesa16128sub-matrixofCusingsharedmemory(Partb) ................................... 55 2-19A(16,8)threadblockcomputesa16128sub-matrixofCusingsharedmemory(Partc) ................................... 56 2-20Figure 2-18 modiedtohandle4bsinroundrobinfashion ............ 56 11

PAGE 12

2-21Updatingcvalueswhenasreadusingfloat2 ................... 57 2-22GPU7modiedtohandle4bsinroundrobinfashionandreada1632sub-matrixofa(Parta) ...................................... 58 2-23GPU7modiedtohandle4bsinroundrobinfashionandreada1632sub-matrixofa(Partb) ...................................... 59 2-24GPU7modiedtohandle4bsinroundrobinfashionandreada1632sub-matrixofa(Partc) ...................................... 60 2-25Transactions(inbillions)for1638416384matrices ............... 61 2-26Volume(ingigabytes)for1638416384matrices ................. 62 2-27Runtimes(inseconds)for1638416384matrices ................ 64 2-28GFlopsfor1638416384matrices ......................... 64 2-29Efciencyfor1638416384matrices ....................... 65 2-30Timeversustransactionsfor1638416384matrices ............... 66 2-31Timeversusvolumefor1638416384matrices ................. 66 2-32SpeeduprelativetoSingleCoreIJKfor40964096matrices ........... 67 3-1BlockdecompositionofA,B,andC ........................ 68 3-2Kerneltoaddtwomatrices ............................. 72 3-3GPUkernelsinStrassenimplementation ..................... 74 3-4GPUkernelsinDouglasetal.'simplementationofWinogradvariant ...... 75 3-5Strassen'sGPUmatrixmultiply ........................... 76 3-6Winograd'sGPUmatrixmultiply .......................... 80 3-7Strassenspeedupwhen2=4096 ......................... 82 3-8Winogradspeedupwhen2=4096 ......................... 82 4-1DatadependencyofSmith-Watermanalgorithm ................. 91 4-2IllustrationofBlockedAntidiagonal .......................... 91 4-3StripedSmith-Watermanalgorithm ......................... 93 4-4ExampleforStripedAlignment ............................ 96 4-5ExampleforChunkedAlignment ........................... 97 12

PAGE 13

4-6Plotofrunningtime(ms)ofStripedScorefordifferentsvalues .......... 100 4-7Comparisonofdifferentscoringalgorithms .................... 102 4-8Runningtime(ms)ofStripedAlignmentwhenlenQuery=20860andlenDB=29120 ......................................... 102 4-9Plotofbestrunningtime(ms)ofChunkedAlignment1andChunkedAlignment2 104 5-1Thepartitioningofthethree-dimensionalmatrixH ................ 110 5-2Someofthechunkstraversedbytheoptimalpath ................ 116 5-3Plotofrunningtime(s)oftracebackmethodsforrealinstances ......... 119 6-1Syntenicalignment .................................. 123 6-2Stripedsyntenicalignment ............................. 125 6-3ExampleforSTRIPED ................................ 128 6-4ExampleforCHUNKED1D ............................. 129 6-5Plotofrunningtime(ms)ofSCOREwithdifferentsvalues ............ 134 6-6SpeedupofSCOREforvariablenumberofSMs ................. 135 6-7Runningtime(ms)ofSTRIPEDwhenmandnareboth20000 ......... 135 7-1FourcasesinNussinov'srecurrence ........................ 141 7-2InitializationofthematrixinChang'salgorithm .................. 143 7-3ThreestagesinChang'salgorithm ......................... 143 7-4BlockpartitioningofCmatrix ............................ 145 7-5maxKernelillustration ................................. 149 7-6PlotofrunningtimeofGPUalgorithms ....................... 152 13

PAGE 14

AbstractofDissertationPresentedtotheGraduateSchooloftheUniversityofFloridainPartialFulllmentoftheRequirementsfortheDegreeofDoctorofPhilosophyGPUALGORITHMSFORBIOINFORMATICSByJunjieLiMay2014Chair:SartajSahniCochair:SanjayRankaMajor:ComputerEngineeringGraphicsProcessingUnits(GPUs)weredevelopedoriginallytomeetthecomputationalneedsofalgorithmsforrenderingcomputergraphics.Theirlowcostpergigaophasmotivatedsignicantresearchintotheirefcientusefornon-graphicsapplications.Inthiswork,twoparallelmatrixmultiplicationalgorithmsonanNVIDIAGPUusingCUDAwererstlydevelopedtoshowtheguidelinesfordevelopingefcientalgorithmsonaGPU.Followingtheseguidelines,novelbioinformaticsalgorithmswhichrunonaGPUweredeveloped,includingpairwisesequencealignment,three-sequencealignment,syntenicsequencealignmentandRNAsecondarystructureprediction.Allofthesealgorithmsachievedsignicantperformanceimprovementcomparedtotheirsequentialversions,whichrunonaCPU. 14

PAGE 15

CHAPTER1INTRODUCTIONGraphicsProcessingUnits(GPUs)weredevelopedoriginallytomeetthecomputationalneedsofalgorithmsforrenderingcomputergraphics.TherapidandenormousgrowthinsophisticationofgraphicsapplicationssuchascomputergameshasresultedintheavailabilityofGPUsthathavethousandsofprocessorsandpeakperformanceoverateraopandthatsellforhundredsofdollarstoafewthousanddollars.AlthoughGPUsareoptimizedforgraphicscalculations,theirlowcostpergigaophasmotivatedsignicantresearchintotheirefcientusefornon-graphicsapplications.Theeffortbeingexpendedinthisdirectionhaslong-lastingpotentialbecausethewidespreaduseofGPUsinthevibrantcomputergamesindustryalmostensuresthelongevityofGPUs.So,unliketraditionalmultimilliondollarsupercomputerswhosedevelopmentcosthadtobeborneentirelybyarelativelysmallsupercomputingcommunity,GPUsarebackedbyaverylargegamingindustry.ThismakesitmorelikelythatGPUarchitectureswillremaineconomicallyviableandwillcontinuetoevolve.AlthoughthecostofaGPUmeasuredasdollarsperpeakgigaopisverylow,obtainingperformancenearthepeakrequiresverycarefulprogrammingoftheapplicationcode.Thisprogrammingiscomplicatedbytheavailabilityofseveraldifferentmemories(e.g.,devicememory,sharedmemory,cache,andregisters),eachwithdifferentlatencyandthepartitioningoftheavailablescalarprocessorsorcoresintogroupscalledstreamingmultiprocessors(SMs).AgroupofNSMsisshowninFigure 1-1 whichwastakenfrom[ 62 ].Inthischapter,weintroducethearchitectureofanNVIDIAGPUanditsprogrammingmodel. 1.1GPUArchitectureNVIDIA'sTeslaC1060GPU,Figure 1-2 ,isanexampleofNVIDIA'sgeneralpurposeparallelcomputingarchitectureCUDA(ComputeUniedDriverArchitecture).Figure 1-2 isasimpliedversionofFigure 1-1 withN=30andM=8.TheC1060 15

PAGE 16

Figure1-1. StreamingmultiprocessorsofNVIDIA'sGPU comprises30SMsandeachSMcomprises8streamingprocessors(SPs),16KBofon-chipsharedmemory,and16,38432-bitregisters.EachSPhasitsownintegerandsingle-precisionoatingpointunits.EachSMhas1double-precisionoating-pointunitand2single-precisiontranscendentalfunction(specialfunction,SF)unitsthataresharedbythe8SPsintheSM.The240SPsofaTeslaC1060share4GBofoff-chipmemoryreferredtoasdeviceorglobalmemory[ 62 ].AC1060hasapeakperformanceof933GFlopsofsingle-precisionoating-pointoperationsand78GFlopsofdouble-precisionoperations.Thepeakof933GFlopsisforthecasewhenMultiply-Add(MADD)instructionsaredualissuedwithspecialfunction(SF)instructions.IntheabsenceofSFinstructions,thepeakis622GFlops(MADDsonly).TheC1060consumes188Wofpower.ThearchitectureoftheNVIDIATeslaC2050(alsoknownasFermi)correspondstoFigure 1-1 withN=14andM=32.So,aC2050has14SMsandeachSMhas32SPsgivingtheC2050atotalof448SPsorcores.AlthougheachSPofaC2050hasitsowninteger,single-anddouble-precisionunits,the32SPsofanSMshare4single-precisiontranscendentalfunctionunits.AnSMhas64KBofon-chipmemorythatcanbeconguredas48KBofsharedmemorywith 16

PAGE 17

Figure1-2. NVIDIA'sTeslaC1060GPU 16KBofL1cache(defaultsetting)oras16KBofsharedmemorywith48KBofL1cache[ 62 ].Additionally,thereare32K32-bitregistersperSMand3GBofoff-chipdevice/globalmemorythatissharedbyall14SMs.ThepeakperformanceofaC2050is1,288GFlops(or1.288TFlops)ofsingle-precisionoperationsand515GFlopsofdouble-precisionoperationsandthepowerconsumptionis238W[ 63 ].Onceagain,thepeakof1,288GFlopsrequiresthatMADDsandSFinstructionsbedualissued.WhenthereareMADDsalone,thepeaksingle-precisionrateis1.03GFlops.Noticethattheratioofpowerconsumptiontopeaksingle-precisionGFloprateis0.2W/GFlopfortheC1060and0.18W/GFlopfortheC2050.Thecorrespondingratiofordouble-precisionoperationsis2.4W/GFlopsfortheC1060and0.46W/GFlopfortheC2050.InNVIDIAparlance,theC1060hascomputecapability1.3whilethecomputecapabilityoftheC2050is2.0.Detailsabouthowtowritehigh-performanceprogramsutilizingGPUs'architecturewillbeexploredinChapter2.ATeslaGPUispackagedasadouble-widePCIecard(Figure 1-3 )andusinganappropriatemotherboardandasufcientlylargepowersupply,onecaninstallupto4GPUsonthesamemotherboard.Inthisdissertation,wefocusonsingleGPUcomputation. 1.2ProgrammingModelAtahigh-level,aGPUusesthemaster-slaveprogrammingmodel[ 91 ]inwhichtheGPUoperatesasaslaveunderthecontrolofamasterorhostprocessor.Programminginthemaster-slavemodelrequiresustowriteaprogramthatrunsonthemaster 17

PAGE 18

Figure1-3. NVIDIA'sTeslaPCIex16Card processor.Thismasterprogramsendsdatatotheslave(s),invokesakernelorfunctionthatrunsontheslave(s)andprocessesthissentdata,andnallyreceivestheresultsbackfromtheslave.Thisprocessofsendingdatatotheslave,executingaslavekernel,andreceivingthecomputedresultsmayberepeatedseveraltimesbythemasterprogram.InCUDA,thehost/masterandGPU/slavecodesmaybewritteninC.CUDAprovidesextensionstoCtoallowfordatatransferto/fromdevicememoryandforkernel/slavecodetoaccessregisters,sharedmemory,anddevicememory.Atanotherlevel,GPUsusetheSIMT(singleinstructionmultiplethread)programmingmodelinwhichtheGPUaccomplishesacomputationaltaskusingthousandsoflightweightthreads.Thethreadsaregroupedintoblocksandtheblocksareorganizedasagrid.Whileablockofthreadsmaybe1-,2-,or3-dimensional,thegridofblocksmayonlybe1-or2-dimensional.Kernelinvocationrequiresthespecicationoftheblockandgriddimensionsalongwithanyparametersthekernelmayhave.ThisisillustratedbelowforamatrixmultiplykernelMatrixMultiplythathastheparametersa,b,c,andn,wherea,b,andcarepointerstothestartoftherow-majorrepresentationofnnmatricesandthekernelcomputesc=ab. MatrixMultiply<<>>(a,b,c,n)AGPUhasablockschedulerthatdynamicallyassignsthreadblockstoSMs.SinceallthethreadsofathreadblockareassignedtothesameSM,thethreadsofablockmaycommunicatewithoneanotherviathesharedmemoryofanSM.Further,the 18

PAGE 19

resourcesneededbyablockofthreads(e.g.,registersandsharedmemory)shouldbesufcientlysmallthatablockcanberunonanSM.Theblockschedulerassignsmorethan1blocktorunconcurrentlyonanSMwhenthecombinedresourcesneededbytheassignedblocksdoesnotexceedtheresourcesavailabletoanSM.However,sinceCUDAprovidesnomechanismtospecifyasubsetofblocksthataretobeco-scheduledonanSM,threadsofdifferentblockscancommunicateonlyviathedevicememory.OnceablockisassignedtoanSM,itsthreadsarescheduledtoexecuteontheSM'sSPsbytheSM'swarpscheduler.ThewarpschedulerdividesthethreadsoftheblocksassignedtoanSMintowarpsof32consecutivelyindexedthreadsfromthesameblock.Multidimensionalthreadindexesareserializedinrow-majororderforpartitioningintowarps.So,ablockthathas128threadsispartitionedinto4warps.EverythreadcurrentlyassignedtoanSMhasitsowninstructioncounterandsetofregisters.Thewarpschedulerselectsawarpofreadythreadsforexecution.Iftheinstructioncountersforallthreadsintheselectedwarparethesame,all32threadsexecutein1step.OnaGPUwithcomputecapability2.0,eachSMhas32coresandsoall32threadscanperformtheircommoninstructioninparallel,provided,ofcourse,thiscommoninstructionisanintegeroroatingpointoperation.OnaGPUwithcomputecapability1.3,eachSMhas8SPsandsoawarpcanexecutethecommoninstructionforonly8threadsinparallel.Hence,whenthecomputecapabilityis1.3,theGPUtakes4roundsofparallelexecutiontoexecutethecommoninstructionforall32threadsofawarp.Whentheinstructioncountersofthethreadsofawarparenotthesame,theGPUexecutesthedifferentinstructionsserially.Notethattheinstructioncountersmaybecomedifferentastheresultofdatadependentconditionalbranchesinakernel[ 62 ].AnSM'swarpschedulerisabletohidemuchofthe400to600cyclelatencyofadevice-memoryaccessbyexecutingwarpsthatarereadytodoarithmeticswhileotherwarpswaitfordevice-memoryaccessestocomplete.So,theperformanceofcodethatmakesmanyaccessestodevicememorycanoftenbeimprovedbyoptimizingitto 19

PAGE 20

increasethenumberofwarpsscheduledonanSM(i.e.,increasethemultiprocessoroccupancy,seeSection 1.3 ).Thisoptimizationcouldinvolveincreasingthenumberofthreadsperblockand/orreducingthesharedmemoryandregisterutilizationofablocktoenabletheschedulingofalargernumberofblocksonanSM. 1.3OccupancyTheoccupancyofastreamingmultiprocessor(SM)isdenedtobetheratioofthenumberofwarpsco-scheduledonamultiprocessorandtheGPU'slimit,maxWarpsPerSM,onthenumberofwarpsthatmaybeco-scheduledonamultiprocessor.So, occupancy=numberofwarpsco-scheduledperSM maxWarpsPerSM (1) Inpractice,thenumberofwarpsco-scheduledonamultiprocessorisdeterminedbyseveralfactors.ThesefactorsmaybebroadlyclassiedasthosethatarecharacteristicsoftheGPUandthosethatspecifyresourcerequirementsoftheapplicationkernel.Therefore,foragivenGPU,theoccupancyofamultiprocessorvariesfromonekerneltothenextandevenwhentheGPUandkernelarexed,theoccupancyvarieswiththeblocksize(i.e.,numberofthreadsperblock)withwhichthekernelisinvoked.Further,thesamekernelandblocksizemayresultindifferentoccupancyvalueson(say)aTeslaC1060andaTeslaC2050GPU.Since,applicationswithhigheroccupancyarebetterabletohidethelatencyoftheGPUglobal/devicememory,increasingtheoccupancyofanapplicationisoneofthestrategiesusedtoimproveperformance.Indescribingthefactorsthatcontributetotheoccupancyofamultiprocessor,weusesomeofthevariablenamesusedintheCUDAoccupancycalculator[ 64 ].TheGPUfactorsdependonthecomputecapabilityoftheGPU. 1. GPUFactors (a) Themaximumnumber,maxThreadsPerWarp,ofthreadsinawarp. 20

PAGE 21

(b) Themaximumnumberofblocks,maxBlocksPerSM,thatmaybeco-scheduledonamultiprocessor. (c) Warpallocationgranularity,warpAllocationUnit.Forpurposesofallocatingregisterstoablock,thenumberofwarpsinablockisroundedup(ifnecessary)tomakeitamultipleofthewarpallocationgranularity.So,forexample,whenwarpAllocationUnit=2andthenumberofwarpsinablockis7,thenumberofregistersassignedtotheblockisthatfor8warps. (d) Registersareallocatedtoawarp(computecapability2.0)orblock(computecapability1.3)inunitsofsizeregUnit.So,forexample,onaGPUwithcomputecapability2.0andregUnit=64,awarpthatrequires96registerswouldbeallocatedd96=64e64=264=128registers.Ablockof6suchwarpswouldbeassigned6128=768registers.ThesameblockonaGPUwithcomputecapability1.3andregUnit=512wouldbeallocatedd696=512e512=1024registers. (e) Sharedmemorygranularity,sharedMemUnit.SharedmemoryisallocatedinunitswhosesizeequalssharedMemUnit.So,whensharedMemUnit=128,ablockthatrequires900bytesofsharedmemoryisallocatedd900=128e128=8128=1024bytesofsharedmemory. (f) Numberofregisters,numReg,availablepermultiprocessor. (g) Sharedmemory,sharedMemSize,availablepermultiprocessor. 2. KernelFactors (a) Number,myThreadsPerBlock,ofthreadsinablock. (b) Number,myRegPerThread,ofregistersrequiredperthread. (c) Amount,mySharedMemPerBlock,ofsharedmemoryrequiredperblock.InadditiontotheexplicitlimitmaxBlocksPerSM,thenumberofblocksthatmaybeco-scheduledonamultiprocessorisconstrainedbythelimit,maxWarpsPerSM,onthenumberofwarpsaswellasbythenumberofregistersandthesizeofthesharedmemory.Table 1-1 from[ 64 ]givesthevaluesoftheGPUspecicfactorsforGPUswithcomputecapabilities1.3(e.g.,TeslaC1060)and2.0(e.g.,TeslaC2050).Fromthesevalues,weseethatatmost8blocksmaybeco-scheduledonamultiprocessorofaGPUwithcomputecapability1.3or2.0.Further,ifablockcomprises10warps,thenumberofblocksthatmaybeco-scheduledonamultiprocessorisreducedfrom8tob32=10c= 21

PAGE 22

Table1-1. GPUconstraintsforcomputecapabilities1.3and2.0 variable#/computecapability!1.32.0 maxBlocksPerSM88maxThreadsPerWarp3232maxWarpsPerSM3248numReg16,38432,768regUnit51264sharedMemUnit512128sharedMemSize16,38449,152warpAllocationUnit21 3whenthecomputecapabilityis1.3andtob48=10c=4whenthecomputecapabilityis2.0.Todeterminetheoccupancyofakernel,weneedtodeterminethenumber,WPM,ofwarpsthatwillbeco-scheduledonamultiprocessor(Equation 1 ).Thisnumberistheproductofthenumber,BPM,ofblocksthatgetco-scheduledonamultiprocessorandthenumber,WPB,ofthewarpsperblock.FromthedenitionoftheGPUandkernelfactorsthatcontributetooccupancy,weseethat: WPB=myThreadsPerBlock maxThreadsPerWarp (1) BPM=minfmaxBlocksPerSM,maxWarpsPerSM WPB,numReg myRegPerBlock,sharedMemSize myMemPerBlockg (1) 22

PAGE 23

where myRegPerBlock=8>>>>>>>>>>><>>>>>>>>>>>:lGWPBmyRegPerThreadmaxThreadsPerWarp regUnitmregUnit(capability1.3)lmyRegPerThreadmaxThreadsPerWarp regUnitmregUnitWPB(capability2.0) (1) GWPB=dWPB=warpAllocationUnitewarpAllocationUnit (1) myMemPerBlock=mySharedMemPerBlock sharedMemUnitsharedMemUnit (1) Asanexample,considerakernelthatrequires20registersperthreadand3000bytesofsharedmemory.AssumethatthekernelisinvokedonaC1060(computecapability1.3)withablocksizeof128.So,myThreadsPerBlock=128,myRegPerThread=20,andmySharedMemPerBlock=3000.FromEquations 1 to 1 ,weobtain:WPB=d128=32e=4GWPB=d4=2e2=4myRegPerBlock=d42032=512e512=2560myMemPerBlock=d3000=512e512=3072BPM=minf8,b32=4c,b16384=2560c,b16384=3072cg=minf8,8,6,5g=5Forourexamplekernel,weseethatnumberofblocksthatmaybeco-scheduledonamultiprocessorislimitedbytheavailablesharedmemory.FromEquation 1 ,weget:occupancy=BPMWPB maxWarpsPerSM=54=32=0.63 23

PAGE 24

Since,BPM8fortheC1060(Equation 1 )andWPB=4forourexamplekernel,optimizingourexamplekerneltouselesssharedmemoryandfewerregisterscouldraiseoccupancyto84=32=1.0.So,ifwereducedthesharedmemoryrequirementofablockto1000byteswhilekeepingregisterusagethesame,BPM=minf8,8,6,b16384=1024cg=minf8,8,6,16g=6andoccupancy=64=32=0.75.Now,theoccupancyislimitedbythenumberofregisters.Notethatifwereducethenumberofthreadsperblockto64,WPB=2andoccupancy82=32=0.5. 1.4GPGPUComputinginBioinformaticsBioinformaticsisaninterdisciplinaryeldwherebiologydataareprocessedusingtheknowledgeofcomputerscience,mathematicsandengineering.Unliketraditionaldisciplinessuchasmathematics,physicsorchemistry,bioinformaticsisverynewandstillabsorbingnewtechniquesandmethodsfromotherelds.Bioinformaticsalgorithmsusuallyhavehighcomputationalcomplexities.Besides,thevolumeofdatatobeprocessedcouldbelarge.Forexample,ifnstandsforthesizeofaviralRNAsequence,itcouldbeinthescaleofseveralmillion.Duetothishighcomplexity,itbecomescriticaltohaveanefcientwaytoparallelizebioinformaticsalgorithms.AlthoughmuchresearchhasbeendoneinthisregardontheIBMCellBE(e.g.[ 73 ][ 72 ][ 78 ][ 70 ][ 51 ][ 89 ][ 90 ][ 79 ][ 94 ]),FPGAs(e.g.[ 11 ][ 40 ][ 31 ][ 35 ][ 53 ][ 1 ][ 30 ])andCPUclusters(e.g.[ 67 ][ 80 ][ 82 ]),relativelylittleworkhasbeendoneonGPUs.Inthefollowingchapters,twomatrixmultiplicationalgorithmswhichrunonaGPUarerstlydevelopedtoshowtheguidelinesfordevelopingefcientalgorithmsonaGPU.Followingtheseguidelines,novelbioinformaticsalgorithmswhichrunonaGPUaredeveloped,includingpairwisesequencealignment,syntenicsequencealignment,three-sequencealignmentandRNAsecondarystructureprediction. 24

PAGE 25

CHAPTER2GPUMATRIXMULTIPLICATION 2.1SingleCoreMatrixMultiplyFigure 2-1 givestheclassicaltextbookalgorithmtomultiplytwonnmatricesAandBstoredinrow-majororderintheone-dimensionalarraysaandb.TheresultmatrixC=ABisreturned,alsoinrow-majororder,intheone-dimensionalarrayc.Thisclassicalalgorithmisreferredtoastheijkalgorithmbecausethethreenestedforloopsareinthisorder.ItiswellknownthatcomputingtheproductoftwomatricesinikjorderasinFigure 2-2 (see[ 71 ],forexample)reducescachemissesandsoresultsinbetterperformancewhennislarge.Table 2-1 givestheruntimesfortheijkandikjversionsonour2.8GHzXeonquad-coreprocessor.Asexpected,theijkversionisfasterforsmallnbecauseofitssmalleroverhead.But,forlargen,thecacheeffectivenessoftheikjversionenablesittooutperformtheijkversion.Forn=4096,forexample,theijkversiontakesalmost7timesasmuchtimeastakenbytheikjversion!Incomputingtheratioijk=ikj,wehaveusedthemeasuredruntimeswith3decimaldigitsofprecision.ThisexplainstheslightdifferenceintheratiosreportedinTable 2-1 andwhatyouwouldgetusingthetwo-decimal-digittimesgiveninthisgure.Intheremainderofthispaperalso,wecomputedderivedresultssuchasspeedupusingdatawith3-digitprecisioneventhoughreporteddatahave2-digitprecision. Table2-1. Runtimes(s)fortheijkandikjversionsofmatrixmultiply nijkikjijk=ikj 2560.110.140.805120.931.120.83102410.418.981.162048449.9072.696.1940964026.17581.286.93 25

PAGE 26

voidSingleCoreIJK(float*a,float*b,float*c,intn){//Singlecorealgorithmtomultiplytwo//nxnrow-majormatricesaandb.for(inti=0;i
PAGE 27

voidMultiCore(float*a,float*b,float*c,intn,intt){//Multicorealgorithmtomultiplytwo//nxnrow-majormatricesaandb.//tisthenumberofthreads.#pragmaompparallelshared(a,b,c,n,t){inttid=omp_get_thread_num();//threadIDmultiply(a,b,c,n,t,tid);}}voidmultiply(float*a,float*b,float*c,intn,intt,inttid){//computean(n/t)xnsub-matrixofc;tisthenumberofthreads//tid,0<=tid
PAGE 28

Table2-2. Runtimes(s)forMultiCore tisthenumberofthreadsTiisthetimeusingithreads,i2f1,2,4,8g tn 25651210242048409610.141.128.9872.67581.2420.070.564.5036.39291.1440.040.282.2618.21146.3780.040.282.2818.42146.98T1=T22.002.002.002.002.00T1=T44.003.973.983.993.97T1=T83.894.003.943.953.95 doesnotresultinalargerspeedupaswehaveonly4corestoexecutethesethreadson.Actually,whenwegofromt=4tot=8,thereisaverysmalldropinperformanceformostvaluesofnbecauseoftheoverheadofscheduling8threadsratherthan4. 2.3GPUMatrixMultiplyAlthoughSingleCoreIKJ(Figure 2-2 )isfasterthanSingleCoreIJKwhennislarge,wedonotexpectamassivelythreadedadaptationofSingleCoreIKJtooutperformasimilarlythreadedversionofSingleCoreIJKbecauseSingleCoreIKJmakesO(n3)accessestocwhiletheSingleCoreIJKmakesO(n2)accessesandcresidesindevicememory.(NotethatbothversionsmakeO(n3)accessestoaandb.)So,wefocusonobtainingaGPUkernelbasedonSingleCoreIJK.WebeginbytilingannnmatrixusingpqtilesasinFigure 2-4 .Forsimplicity,weassumethatpandqdividensothatanintegralnumberoftilesisneededtocoverthematrix.Wemayindexthetilesusinga2-dimensionalindex(u,v),0u
PAGE 29

Figure2-4. Tilinga1616matrixusing3224tiles (u,v)ofC.InCUDA,athreadmaydeterminethecoordinatesoftheblock(u,v)thatitispartofusingthevariablesblockIdx.xandblockIdx.y.So,(u,v)=(blockIdx.x,blockIdx.y).SeveraldifferentGPUadaptationsofSingleCoreIJKarepossible.Thesedifferintheirusageofregisters,sharedmemory,numberofCelementscomputedperthread,andsoon.Differentadaptationsresultindifferentperformance.Inthefollowing,weexaminesomeofthepossibleadaptations. 2.3.1AThreadComputesa11Sub-matrixofCKernelCode InourrstadaptationofSingleCoreIJK,eachthreadcomputesexactlyoneelementofatile.So,thenumberofthreadsinablockequalsthenumberofelementspqinatile.Threadsofablockareindexedusingthesameconventionasusedtoindexblocks.Thatis,thread(0,0)computesthetopleftelementofatile,therstcoordinateincreaseslefttoright,andthesecondcoordinateincreasestoptobottom.Withthisconvention,theindexofathreadwithinablockis(threadIdx.x,threadIdx.y).NoticethatthreadIdx.xisintherange[0,q)andthreadIdx.yisintherange[0,p).ACUDAkernelmaydetermine 29

PAGE 30

thedimensionsofathreadblockusingthevariables,blockDim.xandblockDim.y.Forourexample,p=blockDim.yandq=blockDim.x.Theconventiontoindexelementsofamatrixisdifferentfromthatusedtoindexblocksinagridandthreadsinablock.C(i,j)referstoelement(i,j)ofCandiincreasestoptobottomwhilejincreaseslefttoright;C(0,0)isthetop-leftelement.Becauseofthisdifferenceinindexingconvention,overlayingasub-matrixtilewithathreadblockresultsinthread(d,e)correspondingtoelement(e,d)ofthetile/sub-matrix.Figure 2-5 illustratestheworkdonebythethreadthatcomputesC(5,6)ofan88matrixC.MatrixCistiledusing24tiles.Thetotalnumberoftilesis8.Thesub-matrixdenedbyatileiscomputedbyablockofthreads.Thedimensionsofablockare(4,2).Thatis,blockDim.x=4andblockDim.y=2.Thedimensionsofthegridofblocksare(2,4).C(5,6)isintile/block(1,2)ofthegridanditiscomputedbythread(2,1)(i.e.,threadIdx.x=2andthreadIdx.y=1)oftheblock.Whenwritingthecode,wehavetodotheinversecomputation.Thatis,athreadhastodeterminewhichelement,C(i,j),oftheCmatrixitistocompute.Athreaddoesthisinversecomputationusingitsindexaswellastheindexoftheblockitispartof.TocomputeC(5,6),thread(2,1)ofblock(1,2)readsthepairs(A(5,k),B(k,6)),onepairatatime,multipliesthetwocomponentsofthepair,andaddstoavariable,temp,whosenalvaluewillbeC(5,6).A(5,0)andB(0,6)arelocatedintheirrow-majorrepresentationsaandbusingthestandardrow-majorformula(i.e.,X(u,v)ofandnnmatrixXisatpositionun+vinitsrow-majorrepresentationx).Figure 2-6 givesthekernelcodeforourrstadaptation,GPU1,ofSingleCoreIJK.Thiscodedenesthecomputationdonebyasinglethread.Thersttwolinesdeterminethematrixelement(i,j)tobecomputedbythethread.TheremaininglinesofcodeinthisgurecomputeC(i,j)=c[in+j].Thevariablesi,j,k,andtempareassignedtoregisterswhilea,b,andcarearraysthatresideindevicememory. 30

PAGE 31

Figure2-5. Thread(2,1)ofblock(1,2)computesC(5,6) __global__voidGPU1(float*a,float*b,float*c,intn){//threadcodetocomputeoneelementofc//element(i,j)oftheproductmatrixistobecomputedinti=blockIdx.y*blockDim.y+threadIdx.y;intj=blockIdx.x*blockDim.x+threadIdx.x;floattemp=0;for(intk=0;k
PAGE 32

//copyfromhosttodevicecudaMemcpy(d_A,h_A,memSize,cudaMemcpyHostToDevice);cudaMemcpy(d_B,h_B,memSize,cudaMemcpyHostToDevice);//definegridandblockdimensionsdim3grid(n/q,n/p);dim3block(q,p);//invokekernelGPU1<<>>(d_A,d_B,d_C,n);//waitforallthreadstocompletecudaThreadSynchronize();//copyresultmatrixfromdevicetohostcudaMemcpy(h_C,d_C,memSize,cudaMemcpyDeviceToHost); Figure2-7. FragmenttoinvokeGPU1 AandBfromthehosttothedevice.Then,thegridandblockdimensionsaredenedandthekernelthatdenesathreadcomputationinvoked.Followingtheinvocationofthiskernel,wewaitforallthreadstocompleteandthencopythecomputedmatrixproductbacktothehost.Tile/BlockDimensions BeforewecanrunourcodeonaGPU,wemustdecidethetiledimensionspandq,whichinturndeterminetheblockdimensions(q,p)1.SinceNVIDIAGPUslimitthenumberofthreadsinablocktobenomorethan512,pq512.Sincethreadsarescheduledinunitsofawarpandawarpis32threads,wewouldlikepqtobeamultipleof32.So,whenusingasquareblock,theonlypossiblechoicesfortheblockdimensionare:88and1616.Whenusingarectangularblock,additionalchoices(e.g.,164,1282,864)becomeavailable.Someoftheseblocksizesmaybeinfeasiblebecausetheyrequiremoreresources(e.g.,registersandsharedmemory)thanareavailableon 1Byconvention,thedimensionsofapqblockofthreadsarespeciedas(q,p) 32

PAGE 33

anSM.AninstrumentationofthekernelGPU1revealsthatitneeds10registersperthreadand44bytesofsharedmemory(thisincludesregistersandsharedmemoryrequiredforanycompilergeneratedtemporaryvariablesandsharedmemoryusedtostorethevaluesofkernelparameters)2.So,themaximumblocksize(i.e.,pq)forGPU1isnotconstrainedbythenumberofavailableregistersorbythesizeofsharedmemorybutonlybytheNVIDIAimposedlimitof512threadsperblock.Usingtoosmallablocksizewillresultinanoccupancylessthan1(Section 1.3 ).Forexamplewhentheblocksizeis64(sayan88blockisused),thenumberofblocksco-scheduledonanSMis8(thisisthevalueofmaxBlocksPerSM,Table 1-1 ).So,only16warpsareco-scheduledonanSMandtheoccupancyis16/32=0.5forcomputecapability1.3and16/48=0.33forcomputecapability2.0(Equation 1 ).RunTime Table 2-3 givesthetimetakenbyGPU1fordifferentvaluesofnanddifferentblockdimensions.ThesetimeswereobtainedonaC1060,whichisofcomputecapability1.3.TheshowntimesdonotincludethetimetakentocopymatricesAandBtodevicememoryorthatneededtocopyCfromdevicememorytohostmemory.Thisgurealsogivesthemultiprocessoroccupancy.Recallthatwhentheblockdimensionsare(x,y),weuseayxarrangementofthreadsandayxtile.So,eachblockcomputesayxsub-matrixoftheproductmatrix.Hence,whenx=1asisthecasefortherst5rowsofTable 2-3 ,eachthreadblockcomputesycontiguouselementsofacolumnofCandwheny=1(last5rowsofTable 2-3 ),eachthreadblockcomputesxcontiguouselementsofarowofC.Observethatwhenn=16,384thebest((256,1))andworst((1,512))choiceforblockdimensionsresultinperformancesthatdifferbyafactorof 2TheregisterandsharedmemoryutilizationaswellastheoccupancyofourGPUcodesreportedinthischapterwereobtainedbyusingtheptxoptiontotheCUDAcompiler 33

PAGE 34

Table2-3. C1060runtimes(s)forGPU1fordifferentblockdimensions (x,y)occupancyn=2048n=4096n=8192n=16384 (1,32)0.255.9444.13397.464396.39(1,64)0.505.6243.69441.526026.91(1,128)1.006.6751.15525.427341.49(1,256)1.006.5351.80594.718612.61(1,512)1.006.9755.55662.669373.16(8,8)0.501.187.9564.49540.03(16,16)1.000.765.5946.75420.82(16,32)1.000.745.6449.37499.72(32,2)0.500.655.2544.59372.27(32,16)1.000.745.5343.50361.24(32,1)0.250.735.7945.09356.42(64,1)0.500.645.1641.50360.11(128,1)1.000.725.3040.90336.87(256,1)1.000.715.2340.86327.64(512,1)1.000.765.5142.30327.68 almost30.Thisisdespitethefactthatbothchoicesresultinanoccupancyof1.0.Interestingly,when1616threadblocksareused,asisdonein[ 62 ],theruntimeforn=16383is28%morethanwhenthethreadblockdimensionsarea(256,1).WecanexplainsomeofthedifferenceintheobservedruntimesofGPU1bythenumberofdevice-memoryaccessesthataremadeandtheamountofdatatransferredbetweenthedevicememoryandtheSMs(weshallseeshortlythattheamountofdatatransferredmayvaryfromoneaccesstothenext)foreachchosen(x,y).Recallthatthelatencyofadevice-memoryaccessis400to600cycles[ 62 ].Thewarpschedulerisoftenabletohidemuchorallofthislatencybyschedulingtheexecutionofwarpsthatarereadytodoarithmeticswhileotherwarpsarestalledwaitingforadevice-memoryaccesstocomplete.Tosucceedwiththisstrategy,theremust,ofcourse,bereadywarpswaitingtoexecute.So,itoftenpaystominimizethenumberofdevice-memoryaccessesaswellastheamount(orvolume)ofdatatransferred. 34

PAGE 35

NumberofDevice-MemoryAccesses Fordeviceswithcomputecapability1.3,device-memoryaccessesarescheduledonaperhalf-warpbasis3.Whenaccessing4-bytedata,a128-bytesegmentofdevicememorymaybeaccessedwithasingledevice-memorytransaction.Ifalldatatobeaccessedbyahalfwarplieinthesame128-bytesegmentofdevicememory,asinglememorytransactionsufces.Ifthedatatobeaccessedliein(say)3differentsegments,then3transactionsarerequiredandeachincursthe400to600cyclelatency.Since,forcomputecapability1.3,device-memoryaccessesarescheduledhalf-warpatatime,toservicethedevice-memoryaccessesofacommoninstructionoverall32threadsofawarprequiresatleast2transactions.Deviceswithcomputecapability2.0scheduledevice-memoryaccessesonaperwarpbasis.So,ifeachthreadofawarpaccessesa4-bytewordandallthreadsaccessdatathatlieinthesame128-bytesegment,asinglememorytransactionsufces.Sincedeviceswithcomputecapability2.0cachedevice-memorydatainL1andL2cacheusingcachelinesthatare128byteswide,thethroughputofamemorytransactionismuchhigherwheneverthereisacachehit[ 62 ].Toanalyzethenumberofdevice-memorytransactionsdonebyGPU1andsubsequentGPUcodes,weassumethatthearraysa,b,andcareallsegmentaligned(i.e.,eachbeginsattheboundaryofa128-bytesegmentofdevicememory).Considerthestatementtemp+=a[i*n+k]*b[k*n+j]foranyxedvalueofk.WhenblockDim.x16andapowerof2(e.g.,(32,1),(256,1),(16,16),(32,16)),ineachiterationoftheforloopofGPU1,thethreadsofahalfwarpaccessthesamevalueofaandtheaccessedvaluesofblieinthesame128-bytesegment.So,whenthecompute 3Thepartitioningofablockofthreadsintohalfwarpsandwarpsisdonebyrstmappingathreadindextoanumber.Whentheblockdimensionsare(Dx,Dy),thethread(x,y)ismappedtothenumberx+yDx[ 62 ].Supposethatthreadst1andt2maptothenumbersm1andm2,respectively.t1andt2areinthesamehalfwarp(warp)iffbm1=16c=bm2=16c(bm1=32c=bm2=32c). 35

PAGE 36

capabilityis1.3,2transactionsareneededtofetchtheaandbvaluesneededbyahalfwarpineachiterationoftheforloop.Notethat2transactions(oneforaandtheotherforb)sufcetofetchthevaluesneededbyafullwarpwhenthecomputecapabilityis2.0andblockDim.xisapowerof2and32.Whenthecomputecapabilityis2.0andblockDim.x=16,2transactionsareneededtoreadthevaluesofbneededbyawarp(becausetheselieintwodifferent128-bytesegments)and2areneededforthetwodifferentvaluesofathataretoberead(thesealsoliein2differentsegments).Thestatementc[i*n+j]=tempdoesawriteaccessonc.WhenblockDim.x=16,awarpneeds2transactionsregardlessofthecomputecapability.WhenblockDim.x32andapowerof2,2transactionsarerequiredforcomputecapability1.3and1forcapability2.0.Wenotethattheforloopisiteratedntimesandthatthetotalnumberofhalfwarpsisn2=16.So,whenblockDim.x16andapowerof2,thetotalnumberofmemorytransactionsisn3=16(fora)+n3=16(forb)+n2=16(forc)=n3=8+n2=16,forcomputecapability1.3.Forcomputecapability2.0andblockDim.x=16,thenumberoftransactionsisalso=n3=8+n2=16,becauseeachwarpaccessestwovaluesofathatlieindifferentsegments,thebvaluesaccessedliein2segmentsasdothecvalues.However,whenblockDim.x32andapowerof2,thenumberoftransactionsisn3=16+n2=32forcomputecapability2.0.Additionally,severalofthememorytransactionsmaybeservicedfromcachewhenthecomputecapabilityis2.0.Tosimplifytheremainingdiscussion,welimitourselvestodeviceswithcomputecapability1.3.Forsuchdevices,GPU1hasthesamenumber,n3=8+n2=16,ofmemorytransactionswheneverx16andapowerof2.Thisistrueforallbut6rowsofTable 2-3 .Therst5rowsofthisgurehavex=1.Whenx=1,thethreadsofahalfwarpaccessavaluesthatareonthesamecolumn(andsodifferentrows)ofAand,forn>32,theaccessedvaluesareindifferent128-bytesegments.Eachhalfwarp,therefore,makes16transactionstofetchtheneededavaluesoneachiterationofthe 36

PAGE 37

forloop.Oneachiterationoftheforloop,thethreadsofahalfwarpfetchacommonbvalueusingasingletransaction.SincethecvalueswrittenbyahalfwarpareonthesamecolumnofC,thehalfwarpmakes16writetransactions.Thetotalnumberoftransactionsisn3(fora)+n3=16(forb)+n2(forc)=17n3=16+n2,whichissubstantiallymorethanwhenx16andapowerof2.Forthe88case,ahalfwarpfetches,oneachiterationoftheforloop,2avaluesfromdifferentsegmentsand8differentbsthatareinthesamesegment.So,overallniterationsoftheforloop,ahalfwarpmakes2natransactionsandnbtransactions.Sincetherearen2=16halfwarps,thenumberofreadtransactionsisn3=8(fora)+n3=16(forb).Thecvalueswrittenoutbyahalfwarpfallinto2segments.So,thenumberofwritetransactionsisn2=8.Thetotalnumberofdevice-memorytransactionsis3n3=16+n2=8.Performanceisdeterminednotjustbythenumberofmemorytransactionsbutalsobytheirsize.AlthoughaGPUcantransfer128-bytesofdatabetweenthehostandandSMinonetransaction,theactualsizeofthetransaction(i.e.,theamountofdatatransferred)maybe32-,64-,or128-bytes.A128-bytesegmentisnaturallydecomposedinto432-bytesubsegmentsor264-bytesubsegments.Ifthedatatobetransferredlieinasingle32-bytesubsegment,thetransactionsizeis32bytes.Otherwise,ifthedatalieinasingle64-bytesubsegment,thetransactionsizeis64bytes.Otherwise,thetransactionsizeis128bytes.ForGPU1,weseethatwhenx16andapowerof2,32-bytetransactionsareusedforaand64-bytetransactionsareusedforbandc.FortheremainingcasesofTable 2-3 ,32-bytetransactionsareusedfora,b,andc.Sinceitispossibletodo4timesasmany32-bytetransactions(ortwiceasmany64-bytetransactions)inthesametimeittakestodoone128-bytetransactions,performancemaybeimprovedbyreducingthetransactionsizewhilekeepingthenumberoftransactionssteady. 37

PAGE 38

Averagebandwidthutilization(ABU)andvolumearetwootherimportantmetricswithrespecttodevice-memorytransactions.Considerthecasewhenx16andapowerof2.Thereadsofaareaccomplishedusing32-bytetransactions.However,eachsuch32-bytepayloadcarriesonly4-bytesofusefuldata(i.e.,thesolitaryavaluetobeusedbyallthreadsofthehalfwarp.So,thesetransactionseffectivelyutilizeonly12.5%ofthebandwidthrequiredbya32-bytetransaction.Thisistheminimumutilizationpossiblefora32-bytetransaction(recallthatwearelimitingourdiscussionto4-bytedata).A64-bytetransactionhasatleast4-bytesofusefuldataineachofits232-bytesubsegments(otherwise,thetransactionsizewouldbe32-bytes).So,itsminimumutilizationisalso12.5%.A128-bytetransaction,ontheotherhand,mayhaveonly4bytesofusefuldataineachofits264-bytesubsegmentsyieldingautilizationofonly6.25%.Whenx16andapowerof2,then3=16transactionsonahaveautilizationof12.5%,then3=16transactiononbhaveautilizationof100%asdothen2=16transactionsonc.Forlargen,wemayignorethetransactiononcandcomputetheaveragebandwidthutilizationofatransactionas(12.5+100)=2=56.25%.Whenx=1,theutilizationofeachtransactionis12.5%.So,theABUis12.5%.Forthe88case,then3=8transactionsonahaveautilizationof12.5%whiletheutilizationofthen3=16transactionsonbis100%.So,theABUis(212.5+100)=3=41.7%.ThevolumeofdatatransferbetweentheSMsanddevicememoryisthesumofthenumberoftransactionsofeachsizemultipliedbythetransactionsize.So,whenx16andapowerof2,thevolumeisn3=1632+n3=1664+n2=1664=6n3+4n2bytes.DatavolumedividedbythebandwidthbetweendevicememoryandtheSMsisalowerboundonthetimespenttransferringdatabetweendevicememoryandtheSMs.Thisalsoisalowerboundonthetimefortheentirecomputation.So,itoftenpaystoreducethevolumeofdatatransfer. 38

PAGE 39

Table2-4. Device-memorytransactionstatisticsforGPU1 (x,y)TotalTransactions32-byteTransactions64-byteTransactionsVolumeABU (1,)17n3=16+n217n3=16+n2034n3+32n212.5%(8,8)3n3=16+n2=83n3=16+n2=806n3+4n241.7%restn3=8+n2=16n3=16n3=16+n2=166n3+4n256.25% Figure2-8. An88sub-matrixcomputedbya(4,8)blockofthreads Table 2-4 summarizesouranalysisofdevice-memorytransactionsmadebyGPU1fordifferentblockdimensions. 2.3.2AThreadComputesa12Sub-matrixofCKernelCode Wecanreducethenumberofmemorytransactionsbyhavingeachthreadread4valuesofa(i.e.,a14sub-matrix)atatimeusingthedatatypefloat4,read2valuesofb(i.e.,a12sub-matrix)atatimeusingthedatatypefloat2,andwrite2values(i.e.,a12sub-matrix)ofcatatimealsousingthedatatypefloat2.Now,eachthreadcomputesa12sub-matrixofc.So,forexample,wecoulduseathreadblockwithdimensions(4,8)tocomputean88sub-matrixofC(thisisequivalenttousingatilesizeof88)asinFigure 2-8 .So,forexample,thread(0,0)computeselements(0,0)and(0,1)ofthesub-matrixwhilethread(3,7)computeselements(7,6)and(7,7)ofthesub-matrix.ThesolidcirclesofFigure 2-9 showstheAandBvaluesreadfromdevicememorybyathreadthatcomputesthe12sub-matrixofCidentiedbysolidcircles.TheA 39

PAGE 40

Figure2-9. Computinga12sub-matrixofC valuesarereadinunitsof4asshownbytheboxesinthegure;theBvaluesarereadinunitsof2;andtheCvaluesarewrittentodevicememoryinunitsof2.Figure 2-10 givesthekernelGPU2forathreadthatcomputesa12sub-matrixofC.Supposeweindexthe12sub-matricesofCbythetuples(i,j),0i
PAGE 41

__global__voidGPU2(float*a,float*b,float*c,intn){//threadcodetocomputea1x2sub-matrixofc//casta,b,andcintotypessuitableforI/Ofloat4*a4=(float4*)a;float2*b2=(float2*)b;float2*c2=(float2*)c;//determineindexof1x2csub-matrixtocomputeinti=blockIdx.y*blockDim.y+threadIdx.y;intj=blockIdx.x*blockDim.x+threadIdx.x;intnDiv2=n/2;intnDiv4=n/4;intaNext=i*nDiv4;intbNext=j;float2temp2;temp2.x=temp2.y=0;for(intk=0;k
PAGE 42

Table2-5. Device-memorytransactionstatisticsforGPU2 (x,y)TotalTransactions32-byteTransactions64-byteTransactions128-byteTransactionsVolumeABU (1,)5n3=32+n2=25n3=32+n2=2005n3+16n245.0%(8,8)3n3=64+n2=16n3=64n3=32+n2=1602.5n3+4n283.3%rest5n3=128+n2=32n3=1280n3=32+n2=324.25n3+4n290.0% makes16ctransactionsofsize32byteswith25%utilization.Thetotalnumberoftransactionsofalltypesis5n3=32+n2=2.When88threadblocksareused,weneed2transactionsperiterationoftheforlooptoreadtheavaluesand4toreadthebvalues.Ahalfwarpalsoneeds2transactionstowritethecvalues.So,thenumberofdevice-memorytransactionsisn3=64(fora)+n3=32(forb)+n2=16(forc)=3n3=64+n2=16.Thesizeoftheatransactionsis32-bytesandtheirutilizationis50%.Thebandctransactionsare64-byteseachandhaveautilizationof100%.FortheremainingblocksizesofTable 2-3 ,thetotalnumberoftransactionsisn3=128(fora)+n3=32(forb)+n2=32(forc).Theatransactionsare32-byteseachandhaveautilizationof50%whilethebandctransactionsare128bytesandhaveautilizationof100%.Table 2-5 summarizesthedevice-memorytransactionpropertiesofGPU2.RunTime Table 2-6 givesthetimetakenbyGPU2fordifferentvaluesofnanddifferentblockdimensions.Thebestperformanceisobtainedusing88blocks.Thisisnotentirelysurprisinggiventhatthevolumeofdatamovementbetweendevice-memoryandSMsisleastwhentheblockdimensionsare(8,8)(Table 2-5 ).Dimensionsoftheform(1,)resultinthemaximumdatavolumeandthistranslatesintothelargestexecutiontimes.ItisinterestingthatGPU2exhibitsverylittlevariationinruntimeoverthedifferentdimensionsoftheform(i16,).ThereissomecorrelationbetweenthereductionindatavolumewhengoingfromGPU1toGPU2andthereductioninruntime.Forexample,GPU1hasavolumethatis6.8timesthatofGPU2whenblockdimensionsareoftheform(1,)andforthesedimensions,theruntimesfortheGPU1codeisbetween 42

PAGE 43

Table2-6. C1060runtimes(s)forGPU2fordifferentblockdimensions (x,y)occupancyn=2048n=4096n=8192n=16384 (1,32)0.250.976.6552.35667.91(1,64)0.500.997.4365.82885.25(1,128)1.001.078.0777.911058.65(1,256)1.001.118.7992.861212.03(1,512)1.001.2210.68112.781378.81(8,8)0.500.262.0617.65194.04(16,4)0.500.413.2927.23223.48(16,16)1.000.423.3027.26213.73(16,32)1.000.423.3427.38217.04(32,2)0.500.413.2826.69217.58(32,16)1.000.423.3326.88210.48(32,1)0.250.413.2826.69229.14(64,1)0.500.413.2826.72216.23(128,1)1.000.413.2926.76216.13(256,1)1.000.423.3026.77212.96(512,1)1.000.423.3126.75211.35 6.6and7.1timesthatofGPU2whenn=16,384.Forthe(8,8)case,theratioofthevolumesis2.4andtheratiooftheruntimesforn=16,384is2.8.Fortheremainingcases,thevolumeratiois1.4andthetimeratioisbetween1.53and1.67exceptfor2caseswherethetimeratiois1.97and2.3.ThoughnotreportedinTable 2-6 ,thedimensions(4,16)alsodidfairlywellcomputingtheproductoftwo1638416384matricesin223secondswithanoccupancyof0.50. 2.3.3AThreadComputesa14Sub-matrixofCKernelCode ExtendingthestrategyofSection 2.3.2 sothateachthreadcomputes4elementsofc(say,a14sub-matrix)enableseachthreadtoread4elementsofbatatimeusingthedatatypefloat4andalsotowrite4elementsofcatatime.Figure 2-11 givestheresultingcode,GPU3.When,forexample,GPU3isinvokedwith1616threadblocks,eachthreadblockcomputesa1664sub-matrixofc.Whentheblockdimensionsare(4,16),a1616sub-matrixofciscomputedbyeachthreadblock. 43

PAGE 44

__global__voidGPU3(float*a,float*b,float*c,intn){//threadcodetocomputea1x4sub-matrixofc//casta,b,andcintotypessuitableforI/Ofloat4*a4=(float4*)a;float4*b4=(float4*)b;float4*c4=(float4*)c;//determineindexof1x4csub-matrixtocomputeinti=blockIdx.y*blockDim.y+threadIdx.y;intj=blockIdx.x*blockDim.x+threadIdx.x;intnDiv4=n/4;intaNext=i*nDiv4;intbNext=j;float4temp4;temp4.x=temp4.y=temp4.z=temp4.w=0;for(intk=0;k
PAGE 45

Table2-7. C1060runtimes(s)forGPU3fordifferentblockdimensions (x,y)occupancyn=2048n=4096n=8192n=16384 (1,32)0.250.513.6429.11336.89(2,32)0.500.292.1417.27199.92(4,16)0.500.272.0616.28130.28(8,8)0.500.383.0124.10192.71(16,32)0.500.715.6343.98351.63(64,1)0.500.695.5043.97351.67 GPU3uses21registersperthreadwhileGPU2uses16.Whenthecomputecapabilityis1.3,anoccupancyof1.0cannotbeachievedbycodesthatusemorethan16registersperthreadastoachieveanoccupancyof1.0,32warpsmustbeco-scheduledonanSMand32warpswith16registersperthreadutilizeall16,384registersavailableonanSMofadevicewithcomputecapability1.3.So,forexample,whenGPU3isinvokedwith1616threadblocks,theoccupancyofGPU3is0.5.RunTime Table 2-7 givesthetimetakenbyGPU3fordifferentvaluesofn.Althoughweexperimentedwithmanychoicesforblockdimensions,onlythetimesforthe6bestchoicesarereportedinthisgure.Weremarkthattheremainingdimensionsoftheform(,1)reportedinTables 2-3 and 2-6 performedalmostaswellasdid(64,1)andwhiletheremainingdimensionsoftheform(1,)performedbetterthantheydidforGPU2,theytookbetween50%to110%moretimethantakenby(1,32).Ofthe6dimensionsreportedoninTable 2-7 ,theperformanceforthedimensions(64,1)and(16,32)wasbetterforGPU2thanGPU3.Theother4sawaperformanceimprovementusingthestrategyofGPU3.ThebestperformerforGPU3is(4,16).NumberofDevice-MemoryAccesses Wenowderivethedevice-memorytransactionstatisticsforthetop3performers((4,16),(8,8)and(2,32))ofTable 2-7 .Inallcases,thenumberofhalfwarpsisn2=64.Whenthedimensionsare(4,16),athreadmakes1aaccessand4baccessesofdevicememoryineachiterationoftheforloop.Theaaccessesmadebyahalfwarp 45

PAGE 46

aretofloat4sfromdifferentrowsofAandhencefromdifferent128-bytesegments.So,ahalfwarpmakes432-bytedevice-memorytransactionsperiterationoftheforloop.Theutilizationofeachaccessis50%.Thetotalnumberofatransactionsis,therefore,4n=4n2=64=n3=64.Whenthethreadsofahalfwarpreadafloat4ofb,theyread4float4sthatlieinthesame64-bytesegment.So,thisreadisaccomplishedusingasingle64-bytetransactionwhoseutilizationis100%.Athreadreads4float4sofbineachiterationoftheforloop.So,ineachiterationoftheforloop,ahalfwarpmakes464-bytetransactionsonbandtheutilizationofthesetransactionsis100%.Hence,thetotalnumberofbtransactionsis4n=4n2=64=n3=64.Forc,ahalfwarpmakes464-bytetransactionswithutilization100%.Thetotalnumberofctransactionsisn2=16.ThevolumeandABUmaybecalculatedeasilynow.Whenthedimensionsare(8,8),ahalfwarpreads,ineachiterationoftheforloop,twofloat4sofavaluesthatlieindifferent128-bytesegments.Thisrequires232-bytetransactionsandtheirutilizationis50%.So,thetotalnumberofatransactionsis2n=4n2=64=n3=128.Ahalfwarpaccesses32bvaluesthatareinthesame128-bytesegmenteachtimeitsthreadreadafloat4ofb.Athreaddoes4suchreadsinaniterationoftheforloop.So,atotalof4n=4n2=64=n3=64128-bytetransactionsaredoneforb.Theutilizationofeachoftheseis100%.Eachhalfwarpuses2128-bytetransactionstowritethecomputedcvalues.So,atotalofn2=32128-bytetransactionsaredoneforcandtheutilizationofeachoftheseis100%.Thenalanalysiswedoisthedimensions(2,32).Ineachiterationoftheforloop,ahalfwarpreads8oat4sofavaluesthatareindifferent128-bytesegments.So,GPU3makesatotalof8n=4n2=64=n3=3232-bytetransactionsona.Theutilizationofeachoftheseis50%.Further,oneachiterationoftheforloop,ahalfwarpreads24float4sofbsusing432-bytetransactionswhoseutilizationis100%.So,atotalof4n=4n=64=n3=6432-bytetransactionswithutilization100%aremadeforb.Towritethecvalues,ahalfwarpmakes832-bytetransactionswhose 46

PAGE 47

Table2-8. Device-memorytransactionstatisticsforGPU3 (x,y)TotalTransactions32-byteTransactions64-byteTransactions128-byteTransactionsVolumeABU (4,16)n3=32+n2=16n3=64n3=64+n2=1601.5n3+4n275.0%(8,8)3n3=128+n2=32n3=1280n3=64+n2=322.25n3+4n283.3%(2,32)3n3=64+n2=83n3=64+n2=8001.5n3+4n266.7% utilizationis100%.Table 2-8 summarizesthetransactionstatisticsforGPU3.Ofthe3dimensionsanalyzed,(2,32)and(4,16)transferthesmallestvolumeofdatabetweendevicememoryandtheSMsand(4,16)doesthistransferusingasmallernumberoftransactions.Thesefactors,mostlikely,playedasignicantroleincausingGPU3toexhibititsbestperformancewhenthedimensionsare(4,16). 2.3.4AThreadComputesa11Sub-matrixofCUsingSharedMemoryFirstKernelCodeandAnalysis ToimproveontheperformanceofGPU2,weresorttoablockmatrixmultiplicationalgorithminwhicheachofA,B,andCispartitionedinton2=s2sssub-matricesAij,Bij,andCij.Weassumethatsdividesn.ThealgorithmcomputesCusingtheequation:Cij=X0k
PAGE 48

__global__voidGPU4(float*a,float*b,float*c,intn){//threadcodetocomputeanelementofa16x16sub-matrixofc//sharedmemoryarraystoholda16x16sub-matrixofaandb__shared__floatas[16][16],bs[16][16];intnDiv16=n/16;intnTimes16=n*16;intaNext=(16*blockIdx.y+threadIdx.y)*n+threadIdx.x;intbNext=16*blockIdx.x+threadIdx.y*n+threadIdx.x;floattemp=0;for(intu=0;u
PAGE 49

Wenowobtainthedevice-memoryaccessstatisticsforGPU4.Sinceeachthreadcomputesasinglevalueofc,thenumberofhalfwarpsisn2=16.Ineachiterationoftheforloop,ahalfwarpreads64bytesofavaluesfromasingle128-bytesegmentusinga64-bytetransactionand64bytesofbvaluesusinganother64-bytetransaction.Sincetheforloopisiteratedn=16times,ahalfwarpmakesn=864-bytetransactionsofaandbtogether.Ahalfwarpalsomakes164-bytewritetransactiononc.So,thetotalnumberofdevice-memorytransactionsmadebyGPU4isn3=128+n2=16.Eachoftheseisa64-bytetransactionwith100%utilization.Thevolumeisn3=2+4n2andtheABUis100%.GPU4uses11registersperthreadand2092bytesofsharedmemory;itachievesanoccupancyof1.0.Comparedtothe(4,16)caseofGPU3,whichresultsinbestperformanceforGPU3,GPU4hasalowervolume(67%lower),ahigherABU(33%higher),andahigheroccupancy(doublethatofGPU3).So,weexpectGPU4tohaveamuchbetterperformancethanGPU3.ImprovedKernelCode TheperformanceofGPU4maybeimprovedbytuningshared-memoryaccesses.Thistuningdoesnotaffectthevolume,ABU,oroccupancyofthekernel.Athreadaccessesarowofasandacolumnofbs.AlthoughtheCUDAmanualdoesnotdocumentacacheforsharedmemory,itappearsthatitisfasterforathreadtoaccessdatabyrowsratherthanbycolumns.Wecantakeadvantageofthisasymmetryinaccesstimebystoringinbsthetransposeofthesub-matrixofbthatisreadfromdevicememory.Whilethisspeedsaccessforasinglethread,itresultsinshared-memoryaccessconictsforahalf-warpofthreads.Toseewhythisisso,weneedtobefamiliarwiththeorganizationofsharedmemoryonaGPU.Onadevicewithcomputecapability1.3,sharedmemoryisdividedinto16banksusingaroundrobinallocationof4-bytewords.So,intheshared-memoryarrayasoftypefloat,as[i]andas[j]areinthesamebankiffimod16=jmod16.(Adeviceof 49

PAGE 50

computecapability2.0has32banksofsharedmemory.)Sharedmemoryread/writesfromdifferentbankscanbeservicedatthesametime.Whentwoormorethreadsneedtoaccessthesamebank,thereisashared-memoryconictandtheread/writesgetserializedintoanumberofconictfreeread/writes.So,performanceismaximizedwhentherearenobankconicts.WhenaandbarereadfromdevicememoryandwrittentosharedmemorybyahalfwarpinGPU4,thehalfwarpaccesses16adjacentelementsofasandbs.Theselieindifferentbanksofsharedmemoryandso,theshared-memoryaccessesareconictfree.Similarly,thereadsofbs,byahalfwarp,intheinnerforloopareconictfree.Sinceallthreadsinahalfwarpreadthesameasintheinnerforloop,allthreadsinthehalfwarpgetthisasvaluemakingacommonaccesstosharedmemory.IfwemodifyGPU4tosavethetransposeofthebsub-matrixinbsbychangingthestatement bs[threadIdx.y][threadIdx.x]=b[bNext];to bs[threadIdx.x][threadIdx.y]=b[bNext];andcorrespondinglychangethestatement temp+=as[threadIdx.y][k]*bs[k][threadIdx.x];to temp+=as[threadIdx.y][k]*bs[threadIdx.x][k];thenthreadsinahalfwarpwriteandreadbsvaluesthatareinthesamecolumnofbs.Sincethevaluesinacolumnareinthesamebankofsharedmemory(assumingtherowsofbsareassignedtoacontiguousblockof16164=1024bytesofsharedmemory),the16threadsofthehalfwarpneedtoread16differentbsvaluesfromthesamebank(notethatelementsofacolumnofbsare164-bytewordsapartandsoareinthesamebank).Figure 2-13 showstheshared-memorybankmappingforbs[16][16].Theresultingbankconictisserializedinto16shared-memoryaccesses. 50

PAGE 51

Figure2-13. Bankmappingforbs[16][16] Figure2-14. Bankmappingforbs[16][17] Toavoidbankconictsinthereadingofbsvalues,wedenebstobea1617array.Now,elementsinacolumnare174-bytewordsapartandsoareindifferentbanks(seeFigure 2-14 ).ModifyingGPU4tostorethetransposeofsub-matricesofbinbsanddeningbssothatelementsofacolumnareindifferentbanksofsharedmemorygivesusGPU5(Figure 2-15 ).Table 2-9 givesthetimetakenbyGPU4andGPU5fordifferentvaluesofn.Weseethatforn=16384,theshared-memoryoptimizationdonebyGPU5hasreducedruntimebyalittleover4%.Forthisvalueofn,GPU4achievesaspeedupofalittleover2.8relativetothebestcase((4,16))forGPU3whileGPU5achievesaspeedupofalmost3.0.Amazingly,anadditionalspeedupofalmost2ispossible! 51

PAGE 52

__global__voidGPU5(float*a,float*b,float*c,intn){//threadcodetocomputeanelementofa16x16sub-matrixofc//sharedmemoryarraystoholdasub-matrixofaandb__shared__floatas[16][16],bs[16][17];intnDiv16=n/16;intnTimes16=n*16;intaNext=(16*blockIdx.y+threadIdx.y)*n+threadIdx.x;intbNext=16*blockIdx.x+threadIdx.y*n+threadIdx.x;floattemp=0;for(intu=0;u
PAGE 53

__device__voidupdate1(float*a,floatb,float*c){for(inti=0;i<16;i++)c[i]+=a[i]*b;} Figure2-16. Updatingcvalues 2.3.5AThreadComputesa161Sub-matrixofCUsingSharedMemoryFirstKernelCodeandAnalysis Toimprovetheperformanceofthematrixmultiplykernelfurther,weincreasethecomputationalloadperthreadtobettermaskdevice-memoryaccesseswitharithmeticoperations.Inparticular,eachthreadwillcompute16valuesofc,allonthesamecolumn.Thoughwecanhaveathreadcomputefewerormorevaluesofc,16gavebestperformance.Sinceathreadwillcomputeitsassigned16valuesofcincrementally,weallocateathread16registerscr[0:15]tostoretheincrementalvaluescomputedsofar.Whendone,thethreadwillwriteits16computedvaluestodevicememory.Theincrementalcomputationofthecvaluesisdoneusingan816blockofthreads(i.e.,theblockdimensionsare(16,8)).The816threadblockrstreadsa1632sub-matrixofafromdevicememoryandstoresitstransposeintoatwo-dimensionalshared-memoryarrayas[32][17].Todothis,eachthreadmustreadin4valuesofa;the4valuesreadinarefromtwoadjacentcolumnsoftheasub-matrix.Additionally,ahalf-warpreads32adjacentvaluesofathatlieinthesame128-bytesegment.Usinga3217arrayratherthana3216arrayavoidsbankconictswhenwritingtosharedmemoryandstoringthetransposederivescache-likebenetsasathreadwillaccessasbyrow(seeSection 2.3.4 ).ThecodeofFigure 2-16 updatesthevalueofeachofthe16csbeingcomputedbyathreadbyaddinginana[i]bvalue.Wheninvoked,aisapointertoarowofas,whichcorrespondstoacolumnofthe1632sub-matrixofathatwasreadin.Figures 2-17 2-18 ,and 2-19 givethecompletecodetomultiplytwonnmatricesusingthedescribedstrategy. 53

PAGE 54

__global__voidGPU6(float*a,float*b,float*c,intn){//threadcodetocomputeonecolumnofa16x128sub-matrixofc//usesharedmemorytoholdthetransposeofa16x32sub-matrixofa__shared__floatas[32][17];//registersforcolumnofcsub-matrixfloatcr[16]={0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0};intnDiv32=n/32;intsRow=threadIdx.y;intsCol=threadIdx.x;intsCol2=sCol*2;intsCol2Plus1=sCol2+1;inttid=sRow*16+sCol;intaNext=(16*blockIdx.y+sRow)*n+sCol*2;intbNext=128*blockIdx.x+tid;intsRowPlus8=sRow+8;intnTimes8=8*n;a+=aNext;b+=bNext;inti,j;float2temp; Figure2-17. A(16,8)threadblockcomputesa16128sub-matrixofCusingsharedmemory(Parta) Todeterminethedevice-memorystatistics,wenotethatathreadcomputes16cvalues.So,ahalfwarpcomputes256cvalues.Therefore,thenumberofhalfwarpsisn2=256.Ineachiterationoftheforiloop,thethreadsofahalfwarpuse2128-bytetransactionstoreadtherequiredavalues.Thetotalnumberoftransactionsonaisn2=2562n=32=n3=4096.Eachofthese128-bytetransactionshasautilizationof100%.Ineachiterationoftheforiloop,ahalfwarpmakes3264-bytetransactionsonb.Thetotalnumberofbtransactionsisthereforen2=25632n=32=n3=256.Eachofthesehas100%utilization.Ahalfwarpmakes1664-bytedevice-memorytransactionstowriteoutthe256cvaluesitcomputes.Therefore,thenumberofdevice-memorywrite-transactionsforcisn2=16andeachhasautilizationof100%.Combiningthe 54

PAGE 55

for(i=0;i
PAGE 56

//outputcr[]intcNext=16*blockIdx.y*n+128*blockIdx.x+tid;c+=cNext;for(inti=0;i<16;i++){c[0]=cr[i];c+=n;}} Figure2-19. A(16,8)threadblockcomputesa16128sub-matrixofCusingsharedmemory(Partc) floatbr0=b[0];floatbr1=b[n];floatbr2=b[nTimes2];floatbr3=b[nTimes3];#pragmaunrollfor(j=0;j<7;j++){b+=nTimes4;update1(&as[j*4][0],br0,cr);br0=b[0];update1(&as[j*4+1][0],br1,cr);br1=b[n];update1(&as[j*4+2][0],br2,cr);br2=b[nTimes2];update1(&as[j*4+3][0],br3,cr);br3=b[nTimes3];}b+=nTimes4;update1(&as[28][0],br0,cr);update1(&as[29][0],br1,cr);update1(&as[30][0],br2,cr);update1(&as[31][0],br3,cr); Figure2-20. Figure 2-18 modiedtohandle4bsinroundrobinfashion ofbssuchas2and8,wefoundthat4givesbestperformance.ThecodeofthisgureassumesthatnTimes2,nTimes3,andnTimes4have,respectively,beendenedtobe2n,3n,and4n.GPU6modiedtouse4valuesofbperroundasinFigure 2-20 isreferredtoasGPU7. 56

PAGE 57

__device__voidupdate2(float*a,floatb,float*c){for(inti=0;i<16;i++)c[i]+=a[i*4]*b;} Figure2-21. Updatingcvalueswhenasreadusingfloat2 FinalKernelCode AsinGPU6andGPU7,ournalmatrixmultiplycode,GPU8(Figures 2-22 2-23 ,and 2-24 ),uses(16,8)threadblocks.However,athreadblocknowreadsa1664sub-matrixofaratherthana1632sub-matrixfromdevicememorytosharedmemory.Eachhalfwarpreadsthe64avaluesinarowofthe1664sub-matrix,whichlieintwoadjacent128-bytesegmentsofdevicememory,usingtwo128-bytetransactions.Toaccomplishthis,eachthreadreadsa14sub-matrixofausingthedatatypefloat4.The1664asub-matrixthatisinputfromdevicememorymaybeviewedasa1616matrixinwhicheachelementisa14vector.Thetransposeofthis1616matrixofvectorsisstoredinthearrayas[16][65]witheach14vectorusingfouradjacentelementsofarowofas.Thismappingensuresthatthe16elementsineachcolumnofthe1664sub-matrixofathatisinputfromdevicememoryarestoredindifferentbanksofsharedmemory.So,thewritestosharedmemorydonebyahalfwarpofGPU8areconictfree.Further,bystoringthetransposeofa1616matrixof14vectorsratherthanthetransposeofa1664matrixofscalars,weareabletodothewritestosharedmemoryusingfloat4sratherthanfloatsasisdoneinGPU6andGPU7.Thisreducesthetimetowritetosharedmemory.Theschemeusedtomapa1664sub-matrixofaintoa1665arrayasnecessitatestheuseofaslightlydifferentupdatemethod,update2(Figure 2-21 ).Thenumberofhalfwarpsisn2=256.Ineachiterationoftheforiloop,ahalfwarpmakes4128-bytetransactionstoreadinavaluesand6464-bytetransactionstoreadinbvalues.GPU8makesn3=4096128-bytedevice-memorytransactionsonaandn3=256 57

PAGE 58

__global__voidGPU8(float*a,float*b,float*c,intn){//threadcodetocomputeonecolumnofa16x128sub-matrixofc//usesharedmemorytoholdthetransposeofa//16x64sub-matrixof1x4sub-vectorsofa__shared__floatas[16][65];//registersforcolumnofcsub-matrixfloatcr[16]={0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0};intnDiv64=n/64;intsRow=threadIdx.y;intsRow4=sRow*4;intsCol=threadIdx.x;inttid=sRow*16+sCol.x;intaNext=(16*blockIdx.y+sRow)*n+sCol*4;intbNext=128*blockIdx.x+tid;intcNext=16*blockIdx.y*n+128*blockIdx.x+tid;intnTimes2=2*n;intnTimes3=3*n;intnTimes4=4*n;a+=aNext;b+=bNext;c+=cNext;float4*a4=(float4*)a; Figure2-22. GPU7modiedtohandle4bsinroundrobinfashionandreada1632sub-matrixofa(Parta) 64-bytetransactionsonb.Additionally,n2=1664-bytetransactionsaremadeonc.Eachtransactionshas100%utilization.So,thedevicememorystatisticsforGPU8arethesameasforGPU6andGPU7. 2.4AComparisonGPUKernels Tocomparethe8GPUkernels,GPU1throughGPU8,weuse,foreach,theblockdimensionsthatyieldedbestperformance.Specically,forGPU1weuse(256,1)(i.e.,blockDim.x=256andblockDim.y=1)threadblocks,forGPU2weuse(8,8)blocks,forGPU3,weuse(4,16)blocks,forGPU4andGPU5,weuse(16,16),andforGPU6, 58

PAGE 59

for(inti=0;i
PAGE 60

update2(&as[15][0],br0,cr);update2(&as[15][1],br1,cr);update2(&as[15][2],br2,cr);update2(&as[15][3],br3,cr);a4+=16;__syncthreads();//waitforcomputationtocomplete}for(intj=0;j<16;j++){c[0]=cr[j];c+=n;}} Figure2-24. GPU7modiedtohandle4bsinroundrobinfashionandreada1632sub-matrixofa(Partc) and 2-26 plotthenumberoftransactions(inbillions)andthevolumefor1638416384matrices.WenotethateachrenementofthekernelresultedineitherareductionornochangeinthetotalnumberoftransactionsandinthevolumeofdatamovedbetweenthedevicememoryandtheSMs.So,thereisagoodcorrelationbetweenperformanceandnumberoftransactionsaswellasvolume.Thisisn'ttoosurprisingaswhentherearemoretransactions,therearemorelatenciestohideandthevolumedividedbythebandwidthbetweendevicememoryandtheSMsisalowerboundonthetimeneededtodothedatatransfer.WenotealsothatGPU6,GPU7,andGPU8make15%fewertransactionsthanmadebysgemmandgenerate10%lowervolume.Table 2-11 givestheblockdimensions,numberofregistersperthread,sharedmemoryusedbyanSM,andtheoccupancyforthedifferentkernels.Weseethatalthoughhighoccupancyisneededtohidedevice-memorylatency,highoccupancydoesnotnecessarilytranslateintobetterperformance.Inparticular,ourbestperformers,GPU6,GPU7,GPU8,andsgemmhavethelowestoccupancy.Thenumberoftransactionsandvolumeareabetterindicatorofperformancethanisoccupancy. 60

PAGE 61

Table2-10. Devicememorystatisticsforcomputecapability1.3 VersionTransactionsVolumeABU GPU1n3=8+n2=166n3+4n256%GPU23n3=64+n2=162.5n3+4n283%GPU3n3=32+n2=161.5n3+4n275%GPU4n3=128+n2=160.5n3+4n2100%GPU5n3=128+n2=160.5n3+4n2100%GPU617n3=4096+n2=169n3=32+4n2100%GPU717n3=4096+n2=169n3=32+4n2100%GPU817n3=4096+n2=169n3=32+4n2100%sgemm5n3=1024+n2=165n3=16+4n2100% Figure2-25. Transactions(inbillions)for1638416384matrices Table2-11. Numberofregistersperthread,sharedmemory(bytes)perblock,andoccupancyforGPUmatrixmultiplymethods BlockRegistersMemoryOccupancy GPU1(256,1)10441.00GPU2(8,8)16440.50GPU3(4,16)21440.50GPU4(16,16)1120921.00GPU5(16,16)1121561.00GPU6(16,8)3822200.38GPU7(16,8)3822200.38GPU8(16,8)3842040.38sgemm(16,4)3511600.38 61

PAGE 62

Figure2-26. Volume(ingigabytes)for1638416384matrices Tables 2-12 to 2-14 ,respectively,givetheruntime,effectivegigaops,andefciencyofourkernels.Figures 2-27 to 2-29 plotthesemetricsfor1638416384matrices.Forthegigaopratecomputation,weused2n3)]TJ /F4 11.955 Tf 12.72 0 Td[(nasthetotalnumberofoatingpointoperations(multipliesandadds)donebyakernelanddividedthisbytheruntime.Theefciencywasmeasuredrelativetothemanufacturersstatedpeakrateof933GFlopsfortheC1060.ThatiswedividedthegigaoprategiveninTable 2-13 by933tocomeupwiththeefciencygiveninTable 2-14 .Itisinterestingtonotethat,forn=16384,thereisaspeedupof14whengoingfromthemostsimpleadaptationofSingleCoreIJK(i.e.,GPU1)tothemostefcientadaptationGPU8.Further,GPU6,GPU7,GPU8,sgemm,andCUBLASalltakeaboutthesametimewithGPU8beingthefastestandsgemmtheslowestoftheve.GPU8isfasterthanCUBLASby2%andfasterthansgemmby3%whenn=16384.ThisrathersmallreductioninruntimerelativetoCUBLASandsgemmdespitethe15%reductioninnumberoftransactionsand10%reductioninvolumesuggeststhatCUBLASandsgemmdoaverygoodjobofmaskingdevice-memorylatency.Thesevekernelsareabletoachieveagigaoprateofover370GFlopswhenn=16384.SincethereisnoopportunitytouseSFsinamatrix 62

PAGE 63

Table2-12. Runtime(s)forGPUmatrixmultiplymethods 20484096819216384 GPU10.715.2440.85327.77GPU20.272.0617.65193.02GPU30.272.0516.28130.25GPU40.090.695.5446.18GPU50.080.604.8444.12GPU60.050.372.9123.28GPU70.050.372.9123.26GPU80.050.362.8822.97sgemm0.050.372.9723.70CUBLAS0.050.372.9523.53 Table2-13. GFlopsforGPUmatrixmultiplymethods 20484096819216384 GPU124262727GPU264676246GPU364676868GPU4197199198190GPU5226229227199GPU6373377377378GPU7373377378378GPU8373381382383sgemm358368371371CUBLAS354371373374 multiply,thetheoreticalpeakisactuallyonly622GFlopsandnot933GFlops.Using622GFlopsasthetheoreticalmaximumrate,theefciencyofthevekernelsGPU6,GPU7,GPU8,sgemm,andCUBLASbecomesabout60%.InFigure 2-27 ,weshowalsothetheoreticalminimumtimeneededtomovetherequiredvolumeofdatabetweendevicememoryandtheSMs.Thistheoreticalminimumwascomputedbydividingthedatavolumebythetheoreticalpeaktransferrateof102GB/sec.SincewedonothavethevolumedataforCUBLAS,thetheoreticalminimumdatatransfertimeforCUBLASisnotshown.Figures 2-30 and 2-31 ,respectively,showtheruntimeofourGPUkernelsasafunctionofthenumberofthedevicememorytransactionsandthevolumeofdatatransferredbetweendevicememoryandtheSMs.CUBLASisnotincludedinthese 63

PAGE 64

Figure2-27. Runtimes(inseconds)for1638416384matrices Figure2-28. GFlopsfor1638416384matrices 64

PAGE 65

Table2-14. EfciencyofGPUmatrixmultiplymethods 20484096819216384 GPU10.030.030.030.03GPU20.070.070.070.05GPU30.070.070.070.07GPU40.210.210.210.20GPU50.240.250.240.21GPU60.400.400.400.41GPU70.400.400.410.41GPU80.400.410.410.41sgemm0.380.390.400.40CUBLAS0.380.400.400.40 Figure2-29. Efciencyfor1638416384matrices guresaswedonothavethetransactionsandvolumedataforCUBLAS.Theseguresshowafairlystrongcorrelationbetweenruntimeandtransactionsaswellasbetweenruntimeandvolume.ComparisonWithSingle-andQuadcoreCode BecauseoftheexcessivetimerequiredtomultiplylargematricesonourXeonquadcorehost,weobtainedruntimesforthesingle-andquadcorehostcodesonlyforn4096(Tables 2-1 and 2-2 ).Forn=2048andn=4096,ourfastestGPUcode,GPU8achievedspeedupsof8998and11183relativetoSingleCoreIKJand 65

PAGE 66

Figure2-30. Timeversustransactionsfor1638416384matrices Figure2-31. Timeversusvolumefor1638416384matrices 66

PAGE 67

Figure2-32. SpeeduprelativetoSingleCoreIJKfor40964096matrices 364and407relativetothe4-threadOpenMPversionofSingleCoreIKJ(Figure 2-3 ).Figure 2-32 showsthespeedupachievedbyourvariousmatrixmultiplycodesrelativetoSingleCoreIJKforn=4096.transactions(inbillions)andthevolumefor1638416384matrices. 67

PAGE 68

CHAPTER3STRASSEN0SMATRIXMULTIPLICATIONMatrixmultiplicationisanintegralcomponentoftheCUDA(ComputeUniedDriverArchitecture)BLASlibrary[ 60 ]andmuchefforthasbeenexpendedinobtaininganefcientCUDAimplementation.ThecurrentimplementationintheCUDABLASlibraryisbasedonanalgorithmduetoVolkovandDemmel[ 87 ].Afurther3%reduction(ontheNVIDIATeslaC1060)inruntimeisachievedinthepreviouschapter.AlthoughsignicantefforthasbeenexpendedtoobtainefcientGPUalgorithmsformatrixmultiplicationbasedontheclassicalO(n3)single-corealgorithm,thereappearstobenoworktowardobtainingefcientGPUimplementationsofanyofthesingle-corematrixalgorithmswhosecomplexityislessthanO(n3).Oftheselatterlowercomplexityalgorithms,Strassen'soriginalO(n2.81)algorithm[ 81 ]andWinograd'svariant[ 14 ]ofthisalgorithm,whoseasymptoticcomplexityisalsoO(n2.81)areconsideredthemostpractical.Hence,wefocusonthesetwoalgorithmsinthischapter.WenotethattheasymptoticallyfastestmatrixmultiplicationalgorithmatthistimehasacomplexityO(n2.38)[ 10 ]anditisbelievedthatanoptimalalgorithmformatrixmultiplicationwillruninessentiallyO(n2)time[ 69 ].BothStrassen'salgorithmandWinograd'svariantcomputetheproductCoftwomatricesAandBbyrstdecomposingeachmatrixinto4roughlyequalsizedblocksasinFigure 3-1 .Strassen'salgorithm[ 81 ]computesCbyperforming7matrixmultiplicationsand18add/subtractsusingthefollowingequations: Figure3-1. BlockdecompositionofA,B,andC 68

PAGE 69

M1=(A11+A22)(B11+B22)C11=M1+M4)]TJ /F4 11.955 Tf 11.96 0 Td[(M5+M7M2=(A21+A22)B11C12=M3+M5M3=A11(B12)]TJ /F4 11.955 Tf 11.95 0 Td[(B22)C21=M2+M4M4=A22(B21)]TJ /F4 11.955 Tf 11.95 0 Td[(B11)C22=M1)]TJ /F4 11.955 Tf 11.95 0 Td[(M2+M3+M6M5=(A11+A12)B22M6=(A21)]TJ /F4 11.955 Tf 11.96 0 Td[(A11)(B11+B12)M7=(A12)]TJ /F4 11.955 Tf 11.96 0 Td[(A22)(B21+B22)Whenthisblockdecompositionisappliedrecursivelyuntiltheblockdimensionsreach(orfallbelow)athresholdvalue(say)thearithmeticcomplexityofStrassen'salgorithmbecomesO(n2.81).Winograd'svariantofStrassen'smethodusesthefollowingequationstocomputeCwith7matrixmultipliesand15add/subtracts[ 16 ]: S1=A21+A22M1=S2S6V1=M1+M2S2=S1)]TJ /F4 11.955 Tf 11.96 0 Td[(A11M2=A11B11V2=V1+M4S3=A11)]TJ /F4 11.955 Tf 11.96 0 Td[(A21M3=A12B21C11=M2+M3S4=A12)]TJ /F4 11.955 Tf 11.96 0 Td[(S2M4=S3S7C12=V1+M5+M6S5=B12)]TJ /F4 11.955 Tf 11.96 0 Td[(B11M5=S1S5C21=V2)]TJ /F4 11.955 Tf 11.96 0 Td[(M7S6=B22)]TJ /F4 11.955 Tf 11.96 0 Td[(S5M6=S4B22C22=V2+M5S7=B22)]TJ /F4 11.955 Tf 11.96 0 Td[(B12M7=A22S8S8=S6)]TJ /F4 11.955 Tf 11.96 0 Td[(B21AlthoughtherecursiveapplicationofWinograd'svariantalsoresultsinanasymptoticcomplexityofO(n2.81),thereductioninnumberofmatrixadds/subtractsfrom18to15manifestsitselfasaslightlysmallermeasuredruntimeinpractice.Bailey,Lee,andSimon[ 4 ]describeanimplementationofStrassen'salgorithmfortheCRAY-2andCRAYY-MP.Thisimplementationusesthreetemporary(scratch)matricesateachleveloftherecursion.Thetotalspacerequiredbythesetemporarymatricesisatmostn2.However,thecomputationcanbedoneusing2temporariesandthespacerequiredbytemporarymatricestoatmost2n2=3.Karunadasaand 69

PAGE 70

Ranasinghe[ 36 ]decribeanimplementationofStrassesn'salgorithmona2CPUGPUclusterinwhichoneCPUhas4coresandtheother2;eachnodehasanNVIDIAGT8800GPU.Inthisimplementationtheworkissharedbythe2CPUsandthe2GPUswiththeGPUsdoingmostofthesubmatrixmultipliesandtheCPUsdoingthesubmatrixadditionsandsubtractionsandsomeofthemultiplications.Theydonot,however,attempttodevelopanefcientimplementationinwhichalloftheworkisdonebyaGPU.Douglasetal.[ 16 ]provideanimplementationofWinograd'svariantthatusestwotemporarymatricesateachleveloftherecursion.So,thisimplementationusesatmost2n2=3memoryfortemporarymatrices.Douglasetal.[ 16 ]reportontheperformanceoftheirimplementationonavarietyofserialandparallelcomputers.Huss-Ledermanetal.[ 28 29 ]describetwoimplementationsofWinograd'svariant.TherstusestwotemporarymatricesateachleveloftherecursionandisidenticaltotheimplementationofDouglasetal.[ 16 ].Thesecondimplementationuses3temporariesateachleveloftherecursion.Thissecondimplementation,however,isrecommendedonlyforthecasewhenweareusingtheWinogradvarianttodoamultiply-accumulate(i.e.,C=AB+C)andnotwhenwearedoingastraightmultiply(C=AB)asinthispaper.So,wedonotconsiderthisimplementationfurtherinthispaper.Boyeretal.[ 7 ]showhowtoimplementWinograd'svariantusingnotemporarymatrix.Theyprovidetwoimplementations.TherstdoesnotincreasethenumberofarithmeticoperationsbutoverwritestheinputmatricesAandB.Sincewedonotpermitoverwritingoftheinputmatrices,wedonotconsiderthisimplementation.Althoughthesecondin-placeimplementationdoesnotoverwritetheinputmatrices,itincreasesthenumberofarithmeticsbyaconstantfactor.So,wedonotconsiderthisimplementationeither.ThereappearstobenoGPUimplementationofWinograd'salgorithm.Theremainderofthischapterisorganizedasfollows.Section 3.1 givesthebasicGPUkernelsusedinourGPUadaptationsofStrassen'salgorithmandWinograd'svariantandalsoanalyzesthesekernelsfortheirdevice-memorytransactionsand 70

PAGE 71

volumecomplexity.Aone-levelGPUimplementationofStrassen'salgorithmandWinograd'svariant(i.e.,animplementationthatdoesnotapplyStrassen'sandWinograd'sequationsrecursively)isgiveninSection 3.2 andthegeneralmultilevelrecursiveimplementationisgiveninSection 3.3 .Experimentationresultsforsingle-precisionandintegerimplementationsofStrassen'sandWinograd'salgorithmsarepresentedinSection 3.4 .WesummarizeinSection 3.5 .Throughoutthischapter,weassumethatnisapowerof2.Adaptationstoothervaluesofnmaybedoneusingmethodssuchaspaddingandpeeling[ 28 29 ].Whileourdevelopmentisdoneexplicitlyforsingle-precisiondata,theintegerversionisobtainedbysimplychangingthedatatypefromfloattoint. 3.1BasicGPUKernelsWeuseseveralbasicGPUkernelstoarriveatourefcientGPUadaptationofStrassen'salgorithmandWinograd'svariant.Thesekernelsaredescribedbelow. 1. add(X,Y,Z)computesZ=X+YusingthekernelcodeofFigure 3-2 .EachthreadfetchestwoadjacentvaluesofXandtwoadjacentvaluesofBfromdevicememoryusingthedatatypefloat2.Sincethe16pairsofX(Y)fetchedfromdevicememorylieinthesame128-bytesegment,thefetchesofahalfwarparecoalescedintoasingle128-bytememorytransaction.ThefetchedpairsofXandYareaddedandthesumswrittentodevicememory.Thiswritealsorequiresonememorytransactionperhalfwarp.So,twommmatricesareaddedusingatotalof3m2=32128-bytetransactionsthatresultinatotalvolumeof12m2bytes. 2. sub(X,Y,Z)computesZ=X)]TJ /F4 11.955 Tf 10.89 0 Td[(YusingakernelcodesimilartothatofFigure 3-2 3. mul(X,Y,Z)computesZ=XYusingGPU8.LetTandV,respectively,denotethenumberofmemorytransactionsandvolumeforthiscodewhenmultiplyingtwommmatrices(T=17m3=4096+m2=16andV=9m3=32+4m2). 4. mulIncInc(W,X,Y,Z)computes(Y+,Z+)=WX(i.e.,YandZarebothincrementedbyWX).ThisisdonebymodifyingthematrixmultiplykernelsothatitdoesnotwriteouttheelementsofWXaseachiscomputed.Instead,afteranelementofWXhasbeencomputed,thecorrespondingelementsofYandZarereadfromdevicememory,incrementedbythecomputedelementofWX,andtheincrementedvalueswrittenbacktodevicememory.NotethateachelementofWXiscomputedexactlyonce.ThemodiedkernelmakesT)]TJ /F4 11.955 Tf 12.94 0 Td[(m2=16transactionstomultiplyWandXasitdoesnotwriteoutWX. 71

PAGE 72

__global__voidadd(float*d_A,float*d_B,float*d_C,intwidthA,intwidthB,intwidthC){intstartA=blockIdx.x*64+threadIdx.x*2+(blockIdx.y*8+threadIdx.y)*widthA;intstartB=blockIdx.x*64+threadIdx.x*2+(blockIdx.y*8+threadIdx.y)*widthB;intstartC=blockIdx.x*64+threadIdx.x*2+(blockIdx.y*8+threadIdx.y)*widthC;float2tempA=*(float2*)(d_A+startA);float2tempB=*(float2*)(d_B+startB);tempA.x+=tempB.x;tempA.y+=tempB.y;*(float2*)(d_C+startC)=tempA;} Figure3-2. Kerneltoaddtwomatrices AdditionaltransactionsaremadetofetchYandZandwritetheincrementedvalues.Ahalfwarpreads/writesYandZusingcoalesced64-bytetransactions.Thetotalnumberofthesetransactionsism2=4(m2=16transactionsaremadetoreadorwriteeachofYandZ).So,thetotalnumberoftransactionsisT+3m2=16andthevolumeisV+12m2. 5. mulIncDec(W,X,Y,Z)computes(Y+,Z)]TJ /F3 11.955 Tf 9.3 0 Td[()=WX.ThisissimilartomulIncIncandhasthesametransactioncountandvolume. 6. mulStoreDec(W,X,Y,Z)computes(Y,Z)]TJ /F3 11.955 Tf 9.3 0 Td[()=WX.Again,thisisonebymodifyingthematrixmultiplykernelsothatafteritstoresacomputedelementofWXtotheappropriatedevicememorylocationforY,itreadsthecorrespondingelementofZfromdevicememory,decrementsthiselementofZbythevalueofhejustcomputedelementofYandstoresthedecrementedelementofZindevicememory.Inadditiontothetransactions(T)madetocomputeandstoreWX,themodiedkernelfetchesandwritesZusingm2=864-bytetransactions.So,themodiedkernelmakesatotalofT+m2=8transactionsandgeneratesavolumeofV+8m2. 7. mulStoreInc(W,X,Y,Z)computes(Y,Z+)=WXusingasuitablemodicationofthematrixmultiplykernel.ThiskernelissimilartothatformulStoreDecandhasthesamenumberoftransactionsandvolume. 72

PAGE 73

Table3-1. Device-memorytransactionstatisticsformmmatrices KernelTransactionsVolume add3m2=3212m2sub3m2=3212m2mulT=17m3=4096+m2=16V=9m3=32+4m2mulIncIncT+3m2=16V+12m2mulIncDecT+3m2=16V+12m2mulStoreDecT+m2=8V+8m2mulStoreIncT+m2=8V+8m2mulAddT+m2=16V+4m2mulIncIncIncT+5m2=16V+20m2mulSubIncT+3m2=16V+12m2 8. mulAdd(W,X,Y,Z)computesZ=WX+Y.Thiskernel,inadditiontodoingalltheworkdonebythematrixmultiplykernel,needstofetchYfromdevicememory.Thisfetchingisdoneusingm2=1664-bytetransactions.So,thetotalnumberoftransactionsisT+m2=16andthevolumeisV+4m2. 9. mulIncIncInc(U,V,W,X,Y,Z)computesW=UV;Y+=W;Z+=Y;Y+=X(inthisorder).ThemodicationtothematrixmultiplykernelrequiresthatwhenanelementofUViscomputed,itiswrittentodevicememoryasanelementofW;thecorrespondingelementofYisfetchedfromdevicememoryandincremented(butnotwrittenbacktodevicememory);nextthecorrespondingelementofZisfetchedfromdevicememory,incrementedbythejustcomputedYvalueandwrittentodevicememory;nallythiselementofYisincrementedagainbyfetchingthecorrespondingelementofXfromdevicememoryandtheincrementedvaluewrittentodevicememory.mulIncIncIncmakesm2=16transactionstofetcheachofX,Y,andZandtowriteeachofYandZ(inadditiontothosemadebythematrixmultiplykernel).So,anextra5m2=1664-bytetransactionsaremade.ThetotalnumberoftransactionsisT+5m2=16andthevolumeisV+20m2. 10. mulSubInc(V,W,X,Y,Z)computesY=X)]TJ /F4 11.955 Tf 12.03 0 Td[(VW;Z+=Xusingamodicationofthematrixmultiplykernel.ThetotalnumberoftransactionsisT+3m2=16andthevolumeisV+12m2. 3.2One-LevelAdaptation 3.2.1One-LevelStrassenInaone-levelimplementationofStrassen'salgorithmandWinograd'svariant,the7matrixproductsM1throughM7arecomputedbyadirectapplicationofGPU8(i.e.,Strassen'sandWinograd'sequationsarenotappliedrecursively).Figure 3-3 givesthesequenceofkernelcallsinourone-levelimplementationofStrassen'smethod.Werefer 73

PAGE 74

StepComputationGPUKernel 1C12=A21)]TJ /F4 11.955 Tf 11.95 0 Td[(A11sub(A21,A11,C12) 2C21=B11+B12add(B11,B12,C21) 3C22=C12C21mul(C12,C21,C22) 4C12=A12)]TJ /F4 11.955 Tf 11.95 0 Td[(A22sub(A12,A22,C12) 5C21=B21+B22add(B21,B22,C21) 6C11=C12C21mul(C12,C21,C11) 7C12=A11+A22add(A11,A22,C12) 8C21=B11+B22add(B11,B22,C21) 9T1=C12C2110C11=T1+C1111C22=T1+C22mulIncInc(C12,C21,C11,C22) 12T2=A21+A22add(A21,A22,T2) 13C21=T2B1114C22=C22)]TJ /F4 11.955 Tf 11.95 0 Td[(C21mulStoreDec(T2,B11,C21,C22) 15T1=B21)]TJ /F4 11.955 Tf 11.95 0 Td[(B11sub(B21,B11,T1) 16T2=A22T117C21=C21+T218C11=C11+T2mulIncInc(A22,T1,C21,C11) 19T1=B12)]TJ /F4 11.955 Tf 11.95 0 Td[(B22sub(B12,B22,T1) 20C12=A11T121C22=C22+C12mulStoreInc(A11,T1,C12,C22) 22T2=A11+A12add(A11,A12,T2) 23T1=T2B2224C12=C12+T125C11=C11)]TJ /F4 11.955 Tf 11.95 0 Td[(T1mulIncDec(T2,B22,C12,C11) Figure3-3. GPUkernelsinStrassenimplementation totheresultingprogramasone-levelStrassen.Theone-levelGPUimplementationofStrassen'smethodinvokestheaddandsubkernels10times,themulandmulIncInckernelstwiceeach,andthemulStoreDec,mulStoreInc,andmulIncDeckernelsonceeach.Usingthetransactionandvolumedataforeachkernel(Table 3-1 ),wedeterminethetotaltransactioncounttobe7T+7m2=4andthetotalvolumetobe7V+172m2,whereT=17m3=4096+m2=16andV=9m3=32+4m2.Whenmultiplyingnnmatrices,thekernelsareinvokedwithm=n=2.So,thetotalnumberoftransactionsis119n3=32768+35n2=64andthevolumeis63n3=256+50n2. 3.2.2One-LevelWinogradOurone-levelGPUimplementationofWinograd'svariantisgiveninFigure 3-4 .Thisisbasedonthe22-stepimplementationofDouglasetal.'s[ 16 ].Werefertothisimplementationasone-levelWinograd.Thisimplementationinvokestheaddandsubkernels8times,themulkernel3times,themulAddkerneltwice,andthemulIncIncIncandmulSubInckernelsonceeach.Whenthekernelsareinvokedusingmmmatrices, 74

PAGE 75

StepComputationGPUKernel 1T1=A11)]TJ /F4 11.955 Tf 11.95 0 Td[(A21sub(A11,A21,T1) 2T2=B22)]TJ /F4 11.955 Tf 11.95 0 Td[(B12sub(B22,B12,T2) 3C21=T1T2mul(T1,T2,C21) 4T1=A21+A22add(A21,A22,T1) 5T2=B12)]TJ /F4 11.955 Tf 11.95 0 Td[(B11sub(B12,B11,T2) 6C22=T1T2mul(T1,T2,C22) 7T1=T1)]TJ /F4 11.955 Tf 11.96 0 Td[(A11sub(T1,A11,T1) 8T2=B22)]TJ /F4 11.955 Tf 11.95 0 Td[(T2sub(B22,T2,T2) 9C11=T1T2mul(T1,T2,C11) 10T1=A12)]TJ /F4 11.955 Tf 11.95 0 Td[(T1sub(A12,T1,T1) 11C12=T1B2212C12=C22+C12mulAdd(T1,B22,C22,C12) 13T1=A11B1114C11=C11+T115C12=C11+C1216C11=C11+C21mulIncIncInc(A11,B11,,T1,C21,C11,C12) 17T2=T2)]TJ /F4 11.955 Tf 11.96 0 Td[(B21sub(T2,B21,T2) 18C21=A22T219C21=C11)]TJ /F4 11.955 Tf 11.96 0 Td[(C2120C22=C11+C22mulSubInc(A22,T2,C11,C21,C22) 21C11=A12B2122C11=T1+C11mulAdd(A12,B21,T1,C11) Figure3-4. GPUkernelsinDouglasetal.'simplementationofWinogradvariant Table3-2. Transactionsandvolumeforone-levelmultiplicationofnnmatrices MethodArithmeticsTransactionsVolume GPU82n3)]TJ /F4 11.955 Tf 11.96 0 Td[(n217n3=4096+n2=169n3=32+4n2Strassen7n3=4+11n2=4119n3=32768+35n2=6463n3=256+50n2Winograd7n3=4+2n2119n3=32768+29n2=6463n3=256+41n2 atotalof7T+11m2=8transactionsaremadeandthetotalvolumeis7V+136m2.Inaone-levelimplementation,m=n=2andthetotalnumberoftransactionsbecomes119n3=32768+29n2=64andthevolumeis63n3=256+41n2.Table 3-2 summarizesthenumberofarithmeticoperationsandtransactionsdonebyGPU8,one-levelStrassen,andone-levelWinogradaswellasthevolumeofdatatransferdonebyeach.Table 3-3 givesthepercentreductioninthesequantitiesrelativetoGPU8.Basedonthisanalysis,weexpecttheone-levelmethodstobeabout12%fasterthanGPU8. 3.3MultilevelRecursiveGPUAdaptation 3.3.1MultilevelStrassenFigure 3-5 givestherecursivecodeforourimplementationofStrassen'smethodfornapowerof2.Adaptationstoothervaluesofnmaybedoneusingmethodssuchas 75

PAGE 76

Table3-3. PercentreductionrelativetoGPU8forone-levelmultiplication nMethodArithmeticsTransactionsVolume 4096Strassen12.59.68.54096Winograd12.510.29.38192Strassen12.511.110.58192Winograd12.511.310.916384Strassen12.511.811.516384Winograd12.511.911.7 Strassen(A,B,C,n)fif(n<=1)computeC=ABusingGPU8;elseif(n<=2)computeC=ABusingFigure 3-3 ;elsefC12=A21)]TJ /F4 11.955 Tf 11.95 0 Td[(A11;C21=B11+B12;Strassen(C12,C21,C22,n=2);//M6C12=A12)]TJ /F4 11.955 Tf 11.95 0 Td[(A22;C21=B21+B22;Strassen(C12,C21,C11,n=2);//M7C12=A11+A22;C21=B11+B22;Strassen(C12,C21,T1,n=2);//M1(C11+,C22+)=T1;T2=A21+A22;Strassen(T2,B11,C21,n=2);//M2C22)]TJ /F1 11.955 Tf 9.3 0 Td[(=C21;T1=B21)]TJ /F4 11.955 Tf 11.96 0 Td[(B11;Strassen(A22,T1,T2,n=2);//M4(C11+,C21+)=T2;T1=B12)]TJ /F4 11.955 Tf 11.96 0 Td[(B22;Strassen(A11,T1,C12,n=2);//M3C22+=C12;T2=A11+A12;Strassen(T2,B22,T1,n=2);//M5(C11)]TJ /F3 11.955 Tf 9.29 0 Td[(,C12+)=T1;gg Figure3-5. Strassen'sGPUmatrixmultiply paddingandpeeling[ 28 29 ].Thecodeuses2thresholdvalues1and2.Whenn1thematricesaremultipliedusingGPU8andwhen12,Strassen'smethodisappliedrecursively.InFigure 3-5 ,thenotation(X+,Y+)=ZreferstoasinglekernelthatincrementsXandYbyZ.SuchakernelwouldreadX,YandZfromdevicememory,incrementXandY,andthenwritetheincrementedXandYtodevicememory.HencethekernelwouldreadZonlyoncewhileincrementingXandYusingthetwostepsX+=ZandY+=ZwouldreadZtwice. 76

PAGE 77

When,2
PAGE 78

Table3-4. Transactionsandvolumefortwo-levelmultiplicationofnnmatrices MethodArithmeticsTransactionsVolume GPU82n3)]TJ /F16 9.963 Tf 9.96 0 Td[(n217n3=4096+n2=169n3=32+4n2Strassen49n3=32+149n2=16833n3=262144+347n2=256441n3=2048+277n2=2Winograd49n3=32+29n2=4833n3=262144+285n2=256441n3=2048+451n2=4 Table3-5. Transactionsandvolumeforthree-levelmultiplicationofnnmatrices MethodArithmeticsTransactionsVolume GPU82n3)]TJ /F16 9.963 Tf 9.96 0 Td[(n217n3=4096+n2=169n3=32+4n2Strassen343n3=256+1331n2=645831n3=2097152+2837n2=10243087n3=16384+2347n2=8Winograd343n3=256+263n2=165831n3=2097152+2323n2=10243087n3=16384+3813n2=16 to2=8192);weexpectthethree-levelStrassenalgorithmrun26%to33%fasterthanGPU8;andthe4-levelversiontorun29%to41%faster(dependingonwhetherarithmetics,transactions,orvolumedominatesruntime). 3.3.2MultilevelWinogradTherecursivecodeforamatrixmultiplyusingWinograd'svariantissimilartothemultilevelcodeforStrassen'salgorithmandisgiveninFigure 3-6 .Thiscodedoes10standaloneadds/subtracts/increments/decrementsofn=2n=2matricesattheoutermostlevelwitheachreading2n=2n=2matricesandwriting1;theassignmentto(C12,C11)reads4n=2n=2matricesandwritestwo;andtheassignmentto(C21,C22)reads3n=2n=2matricesandwrites2.Thetotalnumberofreads/writesattheoutermostlevelistherefore41.So,forthiscode,weseethat:A(k,n)=7A(k)]TJ /F3 11.955 Tf 11.96 0 Td[(1,n=2)+15n2=4T(k,n)=7T(k)]TJ /F3 11.955 Tf 11.96 0 Td[(1,n=2)+41n2=128V(k,n)=7V(k)]TJ /F3 11.955 Tf 11.96 0 Td[(1,n=2)+41n2fork>1andA(1,n),T(1,n),andV(1,n)areasinTable 3-3 Table3-6. Transactionsandvolumeforfour-levelmultiplicationofnnmatrices MethodArithmeticsTransactionsVolume GPU82n3)]TJ /F12 7.97 Tf 8.47 0 Td[(n217n3=4096+n2=169n3=32+4n2Strassen2401n3=2048+10469n2=25640817n3=16777216+21491n2=409621609n3=131072+18061n2=32Winograd2401n3=2048+2081n2=6440817n3=16777216+17573n2=409621609n3=131072+29315n2=64 78

PAGE 79

Table3-7. PercentreductionrelativetoGPU8fortwo-levelmultiplication nMethodArithmeticsTransactionsVolume 4096Strassen23.315.811.74096Winograd23.317.213.98192Strassen23.419.617.68192Winograd23.420.318.716384Strassen23.421.520.516384Winograd23.421.921.1 Table3-8. PercentreductionrelativetoGPU8forthree-levelmultiplication nMethodArithmeticsTransactionsVolume 4096Strassen32.717.07.94096Winograd32.820.012.68192Strassen32.925.020.48192Winograd32.926.522.816384Strassen32.929.026.716384Winograd33.029.727.9 Tables 3-4 through 3-6 andTables 3-7 through 3-9 ,respectively,givethevaluesofA,T,andVandthepercentreductioninthesequantitiesrelativetoGPU8fork=2,3,and4.TheexpectedspeedupofWinogradrelativetoGPU8isslightlyhigherthanforStrassen. 3.4ExperimentalResults 3.4.1SinglePrecisionMatrixMultiplyWeprogrammedseveralversionsofGPU8,Strassen,Winograd,andsgemmusingCUDAandmeasuredtheirruntimeaswellasaccuracyonaTeslaC1060GPU.ThedifferentversionsofeachalgorithmvariedintheiruseoftexturememoryfortheinputmatricesAandB.Becauseofthelimitedavailabilityoftexturememory,GPU8and Table3-9. PercentreductionrelativetoGPU8forfour-levelmultiplication nMethodArithmeticsTransactionsVolume 4096Strassen40.910.8-7.24096Winograd41.016.52.08192Strassen41.126.117.08192Winograd41.228.921.616384Strassen41.333.729.216384Winograd41.335.131.5 79

PAGE 80

Winograd(A,B,C,n)fif(n<=1)computeC=ABusingGPU8;elseif(n<=2)computeC=ABusingFigure 3-4 ;elsefT1=A11)]TJ /F4 11.955 Tf 11.95 0 Td[(A21;T2=B22)]TJ /F4 11.955 Tf 11.96 0 Td[(B12;Winograd(T1,T2,C21,n=2);//M4T1=A21+A22;T2=B12)]TJ /F4 11.955 Tf 11.96 0 Td[(B11;Winograd(T1,T2,C22,n=2);//M5T1)]TJ /F1 11.955 Tf 9.3 0 Td[(=A11;T2=B22)]TJ /F4 11.955 Tf 11.95 0 Td[(T2;Winograd(T1,T2,C11,n=2);//M1T1=A12)]TJ /F4 11.955 Tf 11.95 0 Td[(T1;Winograd(T1,B22,C12,n=2);//M6C12+=C22;Winograd(A11,B11,T1,n=2);//M2(C12,C11)=(C11+C12+T1,C11+C21+T1);T2)]TJ /F1 11.955 Tf 9.3 0 Td[(=B21;Winograd(A22,T2,C21,n=2);(C21,C22)=(C11)]TJ /F4 11.955 Tf 11.96 0 Td[(C21,C11+C22);Winograd(A12,B21,C11,n=2);//M3C11+=T1;gg Figure3-6. Winograd'sGPUmatrixmultiply sgemmcanbereadilyadaptedtousetexturememoryonlywhenn<16384.Forlargervaluesofn,itisnecessarytowriteablockedversionofthesealgorithmsinvokingtheblockedversionusingtexturememoryforthesmallersizedAandBblockstobemultiplied.Ourexperimentswiththeblockedversionofsgemm,forexample,resultedinaverysmallreductioninruntimefromtheuseoftexturememory.Forexample,whenn=16384,ourtexturememoryversionsofsgemmyieldedbestperformanceusingblocksofsize81928192anddesignatingonlytheblocksofAastextureblocks.Themeasuredreductionintimewasabout0.6%relativetothenon-blockedsgemmcode.Becauseofthisverymarginalreductioninruntimeevenforthelargestmatrixsizeweusedinourexperiments,wedonotreportfurtherontheblockedtexturememoryversionsofGPU8andsgemm.StrassenandWinograd,ontheotherhandarewellsuitedfortexturememoryastheyrecursivelyandnaturallydecomposematricestosmaller 80

PAGE 81

Table3-10. Runtime(s)ontheTeslaC1060 Algorithm220484096819216384 sgemm-0.0480.3732.96623.699GPU8-0.0460.3612.87522.971Strassen40960.0460.3292.34416.561tStrassen20480.0440.3202.27616.107Winograd40960.0460.3282.32916.425tWinograd20480.0440.3182.24315.846 sizesubmatricesuntilthethresholdvalue2isreached.Solongas216384,thepairsofmatricestobemultipliedbyGPU8usingtheone-levelkernelsatthelowestlevelofrecursionmaybedesignatedastexturematrices.Again,ourexperimentsshowedbestperformancewhenonlytherstmatrixinthepairtobemultipliedwasdesignatedastexture.Hence,inthefollowing,tStrassenandtWinogradrefertoversionsofStrassenandWinogradinwhichwhenGPU8isinvokedbytheone-levelcodeusedwhenthematrixsizedropsto2,therstmatrixofeachpairtobemultipliedbyGPU8isdesignatedastexture(thesyntaxoftheGPU8codeiscorrespondinglymodiedtoworkwithitsrstmatrixbeingtexture).Forourexperiments,weset1=2=2. 3.4.1.1RuntimeTable 3-10 givesthetimetakebyourvariousmatrixmultiplicationcodes.StrassenandWinogradhadbestperformancewhen2=4096whiletStrasenandtWinogradperformedbestwhen2=2048.Table 3-10 showstheruntimeonlyforthesebestvaluesof2.Notethatthen=2048timesforStrassenandWinogradarethesameasforGPU8becausen=2=2=1.Whenn=16384,theoptimal2(4096)forStrassenandWinogradresultsin3levelsofrecursionwhiletheoptimal2(2048)fortStrassenandtWinogradresultsin4levelsofrecursion.Asfarasruntimegoes,whenn=16384,tStrassentakes2.7%lesstimethandoesStrassenandtheuseoftexturereducestheruntimeofWinogradby3.5%.Further,Strassenis30.1%fasterthansgemmand27.9%fasterthanGPU8.Ourfastestcode,tWinograd,takes33.1%lesstimetomultiplytwo1638416384thansgemmand31.0%lessthantakenbyGPU8. 81

PAGE 82

Figures 3-7 and 3-8 plotthepercentreductioninruntime,numberofarithmetics,numberoftransactions,andvolumeachievedbyStrassenandWinogradrelativetoGPU8when2=4096.Ascanbeseen,thespeedup(reductioninruntime)mostcloselytracksthereductioninvolume. Figure3-7. Strassenspeedupwhen2=4096 Figure3-8. Winogradspeedupwhen2=4096 3.4.1.2AccuracyAlthoughStrassen'salgorithmproducesaccurateresultsoverexactdomainsuchasintegers,itisknowntobenumericallyinaccurateoverthereals[ 23 ].WeassessthenumericalaccuracyofStrassen'salgorithmandWinograd'svariantusingthesingle-precisiontestmatrixusedin[ 34 ].Table 3-11 givesthemaximumabsolute 82

PAGE 83

Table3-11. Maximumerrors Algorithm220484096819216384 O(n3)onCPU-7.9e-51.6e-42.3e-43.9e-4sgemm-8.1e-51.6e-42.4e-43.9e-4GPU8-8.1e-51.6e-42.4e-43.9e-4Strassen20482.4e-46.7e-48.8e-38.3e-2Strassen40968.1e-53.4e-41.5e-35.8e-2Winograd20482.5e-41.9e-32.9e-26.3e-1Winograd40968.1e-55.0e-43.6e-31.6e-1 differencebetweenanelementoftheproductmatrixascomputedbyeachofouralgorithmsandtheexactproductandTable 3-12 givestheaverageoftheabsolutedifferences.Forcomparisonpurposes,weincludealsotheerrorsintheresultsobtainedusingtheclassicalO(n3)matrixmultiplicationalgorithmonthehostCPU.Forthereportederrors,weused2=2048and4096.Sincetheuseoftexturememorydoesnotimpactaccuracy,Tables 3-11 and 3-12 donotexplicitlyshowerrormeasurementsfortStrassenandtWinograd(theerrors,respectively,arethesameasforStrassenandWinograd).ThemaximumandaverageerrorsfortheclassicalCPUalgorithm,sgemm,andGPU8algorithmsarealmostthesame.However,theerrorsforStrassenandWinogradaresubstantiallylargerthanthosefortheclassicalalgorithm,sgemmandGPU8whenn>1=2=2(whenn1=2=2,StrassenandWinogradreducetoGPU8).Infact,whenn=16384and2=2048,themaximumerrorforStrassenis200timesthatfortheclassicalalgorithm,sgemmandGPU8whiletheaverageerroris424timesasmuch.ThecorrespondingratiosforWinogradare1615and5151.Wenotealsothatwhenn=16384and2=2048,themaximumerrorforWinogradisabout7.6timesthatforStrassenandtheaverageerrorisabout12timesthatforStrassen. 3.4.1.3PerformancebynumberoflevelsBecauseofthelargenumericalerrorsresultingfromStrassenandWinograd,wedecidedtodeterminehowtheerrorvariedwiththenumberoflevelsofrecursion.Notethatina1-levelexecution,1
PAGE 84

Table3-12. Averageerrors Algorithm220484096819216384 O(n3)onCPU-6.6e-85.5e-84.7e-83.3e-8sgemm-6.6e-85.6e-84.7e-83.3e-8GPU8-6.6e-85.6e-84.7e-83.3e-8Strassen20482.1e-74.6e-71.4e-61.4e-5Strassen40966.6e-81.7e-73.9e-72.9e-6Winograd20482.8e-71.3e-61.2e-51.7e-4Winograd40966.6e-82.8e-71.2e-63.2e-5 Table3-13. Maximumerrorswhenn=16384 Algorithm0-level1-level2-level3-level4-level Strassen3.9e-43.3e-33.1e-25.8e-28.3e-2Winograd3.9e-41.4e-39.7e-31.6e-16.3e-1 averageerrorsasafunctionoftheleveloftheexecutionforthecasen=16384andTable 3-15 givesthereductioninruntimerelativetosgemmandGPU8.Asexpected,theerrorsandspeedup(reductioninruntime)increasewiththenumberoflevels.Forexample,the1-levelversionofStrassenachievesalmosta15%speeduprelativetosgemmattheexpenseofanalmost8foldincreaseinthemaximumerrorandanalmost2foldincreaseintheaverageerrorwhilethe4-levelversionachievesaspeedupofalmost29%atacostofanalmost213foldincreaseinthemaximumerrorandanalmost212foldincreaseintheaverageerror. 3.4.2IntegerMatrixMultiplyAsindicatedearlier,theintegerimplementationsofouralgorithmswereobtainedbysimplychangingthedatatypefloatinthesingle-precisionimplementationtoint.ExperimentssimilartothosereportedinSection 3.4.1 wereconductedandforn=16,384,aspeedup,relativetotheintegerversionofsgemm,of32%wasobservedforStrassenand35%fortStrassenusing2=2048.ThespeedupforWinogradand Table3-14. Averageerrorswhenn=16384 Algorithm0-level1-level2-level3-level4-level Strassen6.6e-81.1e-74.4e-72.9e-61.4e-5Winograd6.6e-81.9e-71.6e-63.2e-51.7e-4 84

PAGE 85

Table3-15. Speedup(%)oversgemm=GPU8whenn=16384 Algorithm1-level2-level3-level4-level Strassen14.7/12.023.9/21.530.1/27.928.9/26.6tStrassen14.9/12.224.2/21.830.9/28.732.0/29.9Winograd14.7/12.024.2/21.830.7/28.530.1/27.9tWinograd15.0/12.324.5/22.131.5/29.333.1/31.0 tWinogradwas35%and36%,respectively,using2=2048.Asexpected,thenumericalerrorwaszero. 3.5SummaryWehavedevelopedefcientGPUimplementationsofStrassen'sandWinograd'smatrixmultiplicationalgorithms.Ourexperimentsindicatethatforsingle-precision(integer)arithmeticaspeedupof32%(35%)isachievedbyStrassen'salgorithmwhileWinograd'svariantachievesaspeedupof33%(36%)relativetothesgemm(integerversionofsgemm)codeinCUBLAS3.0whenmultiplying1638416384matrices.Forsingle-precisionarithmetics,thesespeedups,however,comeatsignicantcostintheaccuracyofthecomputedresult.ThemaximumnumericalerrorintroducedbyStrassen'sandWinograd'salgorithmsareabout2ordersofmagnitudehigherthanthoseforsgemmwhenn=16384.Whetherthelossinaccuracyisacceptableornotwilldependontheapplication.Thereisnolossinaccuracywhencomputingoverexactdomainsuchastheintegers.Wehaveanalyzedthearithmetic,transactionandvolumecomplexityofthevariousmatrixmultiplicationalgorithmsconsideredinthispaper(single-precisionversions).Ourexperimentsindicatethatspeedupmostcloselyfollowsvolume. 85

PAGE 86

CHAPTER4PAIRWISESEQUENCEALIGNMENTFORVERYLONGSEQUENCESSequencealignmentisafundamentalprobleminbioinformatics.Initsmostelementaryform,knownaspairwisesequencealignment,wearegiventwosequencesAandBandaretondtheirbestalignment(eitherglobalorlocal).ForDNAsequences,thealphabetforAandBisthefourlettersetfA,C,G,Tgandforproteinsequences,thealphabetisthe20lettersetfA,C)]TJ /F4 11.955 Tf 12.45 0 Td[(I,K)]TJ /F4 11.955 Tf 12.46 0 Td[(N,P)]TJ /F4 11.955 Tf 12.46 0 Td[(T,VWYg.ThebestglobalandlocalalignmentsofthesequencesAandBcanbefoundinO(jAjjBj)timeusingtheNeedleman-Wunsch[ 58 ]andSmith-Waterman[ 77 ]dynamicprogrammingalgorithms.Inthispaper,weconsideronlythelocalalignmentproblemthoughourmethodsarereadilyextendabletotheglobalalignmentproblem.Avariantofthepairwisesequencealignmentproblemasksforthebestk,k>0,alignments.Inthedatabasealignmentproblem,wearetondthebestkalignmentsofasequenceAwiththesequencesinadatabasesD.ThedatabasealignmentproblemmaybesolvedbysolvingjDjpairwisealignmentproblemswitheachpaircomprisedofAandadistinctsequencefromD.ThisrequiresjDjapplicationsoftheSmith-Watermanalgorithm.WhenthesequencesAandBarelongorwhenthenumberofsequencesinthedatabaseDislarge,computationalefciencyisoftenachievedbyreplacingtheSmith-Watermanalgorithmwithaheuristicthattradesaccuracyforcomputationaltime.Thisisdone,forexampleinthesequencealignmentsystemsBLAST[ 2 ],FASTA[ 46 ]andSim2[ 9 ].However,withtheadventoflow-costparallelcomputers,thereisrenewedinterestindevelopingcomputationallypracticalsystemsthatdonotsacriceaccuracy.Towardthisend,severalresearchershavedevelopedparallelversionsoftheSmith-WatermanalgorithmthataresuitableforGraphicsProcessingUnits(GPUs)[ 37 38 43 50 54 76 ].TheworkofKhajej-Saeed,Poole,andPerot[ 37 ]andSriwardenaandRanasinghe[ 76 ]isofparticularrelevancetousasthiswork 86

PAGE 87

specicallytargetsthealignmentoftwoverylongsequenceswhiletheremainingresearchonGPUalgorithmsforsequencealignmentfocusesonthedatabasealignmentproblemandthedevelopedalgorithmsandsoftwareareunabletohandleverylargesequences.Forexample,CUDASW++2.0[ 50 ]cannothandlestringswhoselengthismorethan70000onNVIDIATeslaC2050GPU.Asnotedin[ 37 ],biologicalapplicationsoftenhavejAjintherange104to105andjBjintherange107to1010.Werefertoinstancesofthissizeasverylarge.Khajej-Saeedetal.[ 37 ]modifytheSmith-Watermandynamicprogrammingequationstoobtainasetofequationsthataremoreamenabletoparallelimplementation.However,thismodicationintroducessignicantcomputationaloverhead.Despitethisoverhead,theiralgorithmisabletoachieveacomputationalrateofupto0.7GCUPS(billioncellupdatespersecond)usingasingleNVIDIATeslaC2050.TheinstancesizestheyexperimentedwithhadjAjjBjupto1011.AlthoughSriwardenaandRanasinghe[ 76 ]developtheirGPUalgorithmsforpairwisesequencealignmentspecicallyfortheglobalalignmentversion,theiralgorithmsareeasilyadaptedtothecaseoflocalalignment.Whiletheiradaptationsdonothavetheoverheadsof[ 37 ]thatresultfrommodifyingtherecurrenceequationssoastoincreaseparallelism,theiralgorithmisslowerthanthatof[ 37 ].Inthischapter,wedevelopsingle-GPUparallelizationsoftheunmodiedSmith-Watermanalgorithmandobtainaspeedupofupto17relativetothesingle-GPUalgorithmof[ 37 ]andacomputationalrateof7.1GCUPS.Ourhigh-levelparallelizationstrategyissimilartothatusedbyMeloetal.[ 56 ]andFutamuraetal.[ 18 ]toarriveatparallelalgorithmsforlocalalignmentandsyntenicalignmentonaclusterofworkstations,respectively.Bothdividethescoringmatrixintoasmanystripsasthereareprocessorsandeachprocessorcomputesthescoringmatrixforitsstriprowwise.Meloetal.[ 56 ]dothetracebackneededtodeterminetheactualalignmentseriallyusingasingleprocessorwhileFutamuraetal.'s[ 18 ]dothetracebackinparallelusingastrategysimilartothe 87

PAGE 88

oneusedbyus.Theessentialdifferencesbetweenourworkandthatof[ 56 ]and[ 18 ]are(a)ouralgorithmsareoptimizedforaGPUratherthanforacluster,(b)wedividethescoringmatrixintomanymorestripsthanthenumberofstreamingmultiprocessorsinaGPU,and(c)thecomputationofastripisdoneinparallelusingmanythreadsandtheCUDAcoresofastreamingmultiprocessorratherthanserially.Therestofthechapterisorganizedasfollows.InSection 4.1 ,wedescribetheSmith-Watermanalgorithmforpairwisesequencealignment.InSection 4.2 ,wedescribeourGPUadaptationoftheSmith-WatermanalgorithmforthecasewhenwewanttoreportonlythescoreofthebestalignmentandinSection 4.3 ,wedescribeouradaptationforthecasewhenthebestalignmentaswellasitsscorearetobereported.ExperimentalresultscomparingtheperformanceofourGPUadaptationswiththoseof[ 37 ]and[ 76 ]arepresentedinSection 4.4 andwesummarizeinSection 4.5 4.1Smith-WatermanAlgorithmLetA=a1a2...amandB=b1b2...bnbethetwosequencesthataretobelocallyaligned.Letc(ai,bj)bethescoreformatchingoraligningaiandbjandletbethegapopeningpenalty,andthegapextensionpenalty.So,thepenaltyforagapoflengthkis+k.Gotoh's[ 19 ]variantoftheSmith-Watermandynamicprogrammingalgorithmwithanafnepenaltyfunctionusesthefollowingthreerecurrences. 88

PAGE 89

H(i,j)=max8>>>>>>>>>><>>>>>>>>>>:H(i)]TJ /F3 11.955 Tf 11.95 0 Td[(1,j)]TJ /F3 11.955 Tf 11.96 0 Td[(1)+c(ai,bj)E(i,j)F(i,j)0E(i,j)=max8>><>>:E(i)]TJ /F3 11.955 Tf 11.96 0 Td[(1,j))]TJ /F6 11.955 Tf 11.95 0 Td[(H(i)]TJ /F3 11.955 Tf 11.95 0 Td[(1,j))]TJ /F6 11.955 Tf 11.96 0 Td[()]TJ /F6 11.955 Tf 11.95 0 Td[(F(i,j)=max8>><>>:F(i,j)]TJ /F3 11.955 Tf 11.96 0 Td[(1))]TJ /F6 11.955 Tf 11.96 0 Td[(H(i,j)]TJ /F3 11.955 Tf 11.96 0 Td[(1))]TJ /F6 11.955 Tf 11.96 0 Td[()]TJ /F6 11.955 Tf 11.95 0 Td[(where1im,1jnWherethescorematricesH,E,andFhavethefollowingmeaning: 1. H(i,j)isthescoreofthebestlocalalignmentfor(a1...ai)and(b1...bj). 2. E(i,j)isthescoreofthebestlocalalignmentfor(a1...ai)and(b1...bj)undertheconstraintthataiisalignedtoagap. 3. F(i,j)isthescoreofthebestlocalalignmentfor(a1...ai)and(b1...bj)undertheconstraintthatbjisalignedtoagap.Theinitialconditionsare:H(0,0)=H(i,0)=H(0,j)=0;E(0,0)=;E(i,0)=)]TJ /F6 11.955 Tf 9.3 0 Td[()]TJ /F4 11.955 Tf -456.25 -23.91 Td[(i;E(0,j)=;F(0,0)=;F(i,0)=;F(0,j)=)]TJ /F6 11.955 Tf 9.3 0 Td[()]TJ /F4 11.955 Tf 11.95 0 Td[(j;1im,1jn.Asmentionedintheintroduction,theGPUadaptationsofKhajej-Saeed,Poole,andPerot[ 37 ]andSriwardenaandRanasinghe[ 76 ]aremostsuitedforthepairwisealignmentofverylongsequences.Khajej-Saeed,Poole,andPerot[ 37 ]enhanceparallelismbyrewritingtherecurrenceequations.ThisrewriteeliminatestheEtermsandsotheiralgorithminitiallycomputesHvaluesthatdifferfromthosecomputedbytheoriginalsetofequations.LetH0bethecomputedHvalues.Inafollowupstep,modiedEvalues,E0,arecomputed.ThecorrectHvaluesarethencomputedinanalstep 89

PAGE 90

fromH0andE0.Althoughtheresulting3-stepcomputationincreasesparallelism,italsoincreasesI/OtrafcbetweendevicememoryandtheSMs.SriwardenaandRanasinghe[ 76 ]proposetwoGPUalgorithmsforglobalalignmentusingtheNeedleman-Wunschdynamicprogrammingalgorithm[ 58 ].ThesestrategiescanbereadilydeployedforlocalalignmentusingGotoh'svariantoftheSmith-Watermanalgorithm.BothofthestrategiesofSriwardenaandRanasinghe[ 76 ]arebasedontheobservationthatforany(i,j),theH,E,andFvaluesdependonlyonvaluesinthepositionsimmediatelytothenorth,northwest,andwestof(i,j)(seeFigure 4-1 ).Consequently,itispossibletocomputeallH,E,andFvaluesonthesameantidiagonal,inparallel,oncethesevalueshavebeencomputedfortheprecedingtwoantidiagonals.Therstalgorithm,Antidiagonal,ofSriwardenaandRanasinghe[ 76 ]doespreciselythis.TheGPUkernelcomputesH,E,andFvaluesonasingleantidiagonalusingvaluesstoredindevice/globalmemoryfortheprecedingtwoantidiagonals.ThehostprogramsendsthetwostringsAandBtodevicememoryandtheninvokestheGPUkernelonceforeachofthem+n)]TJ /F3 11.955 Tf 12.09 0 Td[(1antidiagonals.Additional(butminor)speedupcanbeattainedbyrecognizingthatthecomputationfortherstandlastfewantidiagonalscanbedonefasteronthehostCPUandinvokingtheGPUkernelonlyforsufcientlylargeantidiagonals.Whenwedesiretodetermineonlythescoreofthebestalignment,thedevicememoryneededbyAntidiagonalisO(minfm,ng).However,whenthebestalignmentalsoistobereported,weneedtosave,foreach(i,j),thedirection(north,northwest,west)andthescoringmatrices(H,E,F)thatyieldedthevaluesforthisposition.O(mn)memoryisrequiredtosavethisinformation.FollowingthecomputationoftheH,E,andFvaluesaserialtracebackisdonetodeterminethebestalignment.ThesecondGPUalgorithm,BlockedAntidiagonal,ofSriwardenaandRanasinghe[ 76 ]partitionstheH,E,andFvaluesintosssquareblocks(seeFigure 4-2 )andemploysaGPUkerneltocomputethevaluesforablock.ThehostprogramallocatesblockstoSMsandeachSMcomputestheH,E,andFvaluesforitsassignedblockusingvalues 90

PAGE 91

Figure4-1. DatadependencyofSmith-Watermanalgorithm Figure4-2. IllustrationofBlockedAntidiagonal computedearlierandstoredindevicememoryfortheblocksimmediatelytoitsnorth,northwest,andwest.Hence,BlockedAntidiagonalattemptstoenhanceperformancebyutilizingbothblock-levelparallelismandparallelismwithinanantidiagonalofablock.Moreimportantly,itseekstoreduceI/Otrafctodevicememoryandtoutilizesharedmemory.NoticethatdeviceI/Oisnowneededonlyatthestartandendofablockcomputation.TheblockassignmentstrategyofFigure 4-2 doesthecomputationinblockedantidiagonalorderwiththehostinvokingthekernelforallblocksonthesameantidiagonal.ThetotalnumberofblocksisO(mn=s2)andtheI/OtrafcbetweenglobalmemoryandtheSMsisO(mn).Incontrast,theI/OtrafcforAntidiagonalisO(mn).Experimentalresultsreportedin[ 76 ]demonstratethatBlockedAntidiagonalisroughly 91

PAGE 92

twotimesfasterthanAntidiagonal.TheirresearchshowsthatBlockedAntidiagonalexhibitsnear-optimalperformancewhentheblocksizesis8.TheBlockedAntidiagonalstrategyofFigure 4-2 maybeenhancedforthecasewhenweareinterestedonlyinthescoreofthebestalignemnt.Inthisenhancement,wewritetoglobalmemoryonlythecomputedvaluesforthebottomandrightboundariesofeachblock.ThisreducestheglobalmemoryI/OtrafctoO(mn=s). 4.2ComputingtheScoreoftheBestLocalAlignmentInourGPUadaptation,StripedScore,oftheSmith-Watermanalgorithm,weassumethatmn(incasethisisnotso,simplyswaptherolesofAandB)andpartitionthescoringmatricesH,E,andFintodn=semsstrips(Figure 4-3 ).Here,sisthestripwidth.LetpbethenumberofSMsintheGPU(fortheC2050,p=14).TheGPUkerneliswrittensothatSMicomputestheH,E,andFvaluesforallstripsjsuchthatjmodp=i,0j
PAGE 93

whetherthenextbatchofHandFvaluesitneedsfromitsleftadjacentstriparereadyinglobalmemory.Ifso,itreadsthisbatchandcomputesthenextTantidiagonalsforitsstrip.Ifnot,itwaitsinanidlestate. Figure4-3. StripedSmith-Watermanalgorithm OurstripedalgorithmthereforerequiresO(minfm,sg)sharedmemoryperSMandO(mn=s)globalmemory.TheI/OtrafcbetweenglobalmemoryandtheSMsisO(mn=s).Toderivethecomputationaltimerequirements(exclusiveofthetimetakenbytheglobalmemoryI/Otrafc),weassumethatthethresholdvalueTisO(1).Wenotethatthecomputationforthekthstripcannotbeginuntilthetoprightvalueofstripk)]TJ /F3 11.955 Tf 12.24 0 Td[(1hasbeencomputed.AnSMwithcprocessorstakesTa=O(s2=c)tocomputethetoprightvalueofthestripassignedtoitandO(ms=c)timetocompletethecomputationfortheentirestrip.So,SMp)]TJ /F3 11.955 Tf 12.51 0 Td[(1cannotstartworkingontherststripassignedtoituntiltime(p)]TJ /F3 11.955 Tf 9.82 0 Td[(1)Ta.WhenanSMcangofromthecomputationofonestriptothecomputationofthenextstripwithnodelay,thecompletiontimeofSMp)]TJ /F3 11.955 Tf 11.27 0 Td[(1,andhencethetimetakenbytheGPUtodoallitsassignedwork(exclusiveofthetimetakenbyglobalmemoryI/Otrafc),isO((p)]TJ /F3 11.955 Tf 12.9 0 Td[(1)Ta+ms cn ps)=O(ps2 c+mn pc).WhenanSMtakeslesstime 93

PAGE 94

tocompletethecomputationforastripthanittakestocomputethedataneededtocommenceonthenextstripassignedtotheSM(approximately,ms c
PAGE 95

2. InBlockedAntidiagonaltheassignmentofblocksthatarereadyforcomputationtoSMsisdonebytheGPUblockschedulerwhileinStripedScoretheassignmentofstripstoSMsisprogrammedintothekernelcode. 3. TheI/OtrafcofStripedScoreisO(mn=s)whilethatofBlockedAntidiagonalisO(mn). 4. WhileforBlockedAntidiagonalnear-optimalperformanceisachievedwhens=8,weenvisionmuchlargersvaluesforStripedScorewhichcanbeupto1900.Consequently,thereisgreateropportunityforparallelismwithinastripthanwithinablock.Theabovestepscanleadtosignicantimprovementintheoverallperformance. 4.3ComputingtheBestLocalAlignmentInthissection,wedescribethreeGPUalgorithmsforthecasewhenwewishtodetermineboththebestalignmentanditsscore. 4.3.1StripedAlignmentWitheachposition(u,v)ofH,E,andF,weassociateastartstate,whichisatriple(i,j,X),where(i,j)arethecoordinatesofthelocalstartpointoftheoptimalpathto(u,v).Thislocalstartpointiseitherapositioninthecurrentstriporapositionontherightboundaryofthestripimmediatelytotheleftofthecurrentstrip.XisoneofH,F,andEandidentieswhethertheoptimalpathto(say)H(u,v)beginsatH(i,j),F(i,j),orE(i,j).StripedAlignmentisa3-phasealgorithm.TherstphaseisanextensionofStripedScoreinwhicheachstripstores,inglobalmemory,notonlytheHandFvaluesneededbythestriptoitsrightbutalsoofthelocalstartstatesoftheoptimalpathtoeachboundarycell.Foreachboundarycell(u,v),threestartstates(oneforeachofH(u,v),F(u,v)andE(u,v))arestored.So,forthe4stripsofFigure 4-4 ,theboundarycellsstore,inglobalmemory,thelocalstartstatesofsubpathsthatendattheboundarycells(,4),(,8),(,12),and(,16),Additionally,weneedtostorethelocalstartstateandtheendstatefortheoverallbestalignment.SincethehighestHscoreisto(8,9)andthelocalstartstateforH(8,9)is(7,8,H),(7,8,H)isinitiallystoredinregistersandnallywrittenalongwith(8,9,H)toglobalmemory.Inphase1,thelocalstartstatesfor 95

PAGE 96

theoptimalpathstoallboundarycells(notjusttheboundarycellsthroughwhichtheoverallalignmentpathtraverses)arewrittentoglobalmemory. Figure4-4. ExampleforStripedAlignment Inphase2,weseriallydetermine,foreachstrip,thestartstateandendstateoftheoptimalalignmentsubpaththatgoesthroughthisstrip.SupposeforourexampleofFigure 4-4 ,wedeterminethattheoptimalalignmentpathiscomprisedofasubpathfrom(7,8,H)to(8,9,H),anothersubpathfrom(3,4,F)to(7,8,H)andonefrom(2,3,H)to(3,4,F).Finally,inphase3,theoptimalsubpathforeachstriptheoptimalpathgoesthroughiscomputedbyrecomputingtheH,E,andFvaluesforthestripstheoptimalalignmentpathtraverses.UsingthesavedboundaryHandFvalues,itispossibletocomputethesubpathsforallstripsinparallel. 4.3.2ChunkedAlignment1ChunkedAlignment1,likeStripedAlignment,isa3-phasealgorithm.InChunkedAlignment1,eachstripispartitionedintochunksofheighth(Figure 4-5 ).Foreachhschunkwestore,inglobalmemory,theH,F,andlocalstartstatesforpositionsonrightvertical 96

PAGE 97

chunkboundaries(i.e.,verticalbuffers,whicharethesameasboundarybuffersinStripedAlignment)andtheHandEvaluesforhorizontalbuffers.TheassignmentofstripstoSMsisthesameasinStripedScore(andStripedAlignment). Figure4-5. ExampleforChunkedAlignment Inphase2,weusethedatastoredinglobalmemorybythephase1computationtodeterminethestartandendstatesofthesubpathsoftheoptimalalignmentpathwithineachstrip.Finally,inphase3,theoptimalsubpathsareconstructedbyacomputationwithineachstripthroughwhichtheoptimalalignmenttraverses.However,thecomputationwithastripcanbelimitedtoessentialchunksasshownbytheshadedchunksinFigure 4-5 .Thecomputationforthese(sub)-stripscanbedoneinparallel.TherearetwomajordifferencesbetweenStripedAlignmentandChunkedAlignment1: 1. ChunkedAlignment1generatesmoreI/OtrafcthandoesStripedAlignmentandalsorequiresmoreglobalmemoryonaccountofstoringhorizontalbufferdata.Assumingthattheheightandwidthofthechunkarenearlyequal,theI/OtrafcandtheglobalmemoryrequirementareroughlytwicetheamountforStripedAlignmentforthesamestripsizeasthewidthofthechunk. 2. UnlikeStripedAlignment,thecomputationbeginsatthestartpointofachunkratherthanattherstrowofthestrip.Inpractice,thisshouldreducetheamountofcomputationsignicantly. 97

PAGE 98

4.3.3ChunkedAlignment2ChunkedAlignmment2isanaturalextensionofChunkedAlignment1.Phase1ismodiedtoadditionallystore,forthehorizontalbuffers,thelocalstartstate(localtothechunk)oftheoptimalpaththatgoesthroughthatbuffer.Similarly,forverticalbuffersthestartstatelocaltothechunk(ratherthanlocaltothestrip)isstored.InPhase2,weusethedatastoredinglobalmemoryduringPhase1todeterminethechunksthroughwhichtheoptimalalignmentpathtraversesaswellasthestartandendstatesofthesubpaththroughthesechunks.InPhase3,thesubpathwithineachchunkiscomputedbyusingdatastoredinPhase1andtheknowledgeofthesubpathendstatesdeterminedinPhase2.AswasthecaseforStripedAlignmentandChunkedAlignment1,thesubpathsforallidentiedchunkscanbecomputedinparallel.UnlikeChunkedAlignment1,additionalcomputationsandI/OneedtobeperformedinPhase1.TheadvantageisthatthecomputationforallthechunkscanbeperformedinparallelinPhase3.So,foragivendatasetifalargenumberofchunkscorrespondingtotheshortestpatharepresentinagivenstrip,thisworkwillbeassignedtoasinglestreamingprocessorforChunkedAlignment1andwilleffectivelybeperformedsequentially.However,thesechunkscanpotentiallybeassignedtomultiplestreamingprocessors.ForExample,inFigure 4-5 ,strip2has3chunks.ThesewillbeassignedtosamestreamingprocessorusingChunkedAlignment1,butcanbeassignedto3differentStreamingprocessorsusingChunkedAlignment2.Thus,theamountofparallelismavailableissignicantlyincreased. 4.3.4MemoryRequirementsThecodeof[ 37 ]tondtheactualalignmentstores3mnmatricesinglobalmemory.Sinceeachmatrixelementisascore,eachisa4-byteintandsothecodeof[ 37 ]needs12mnbytesofglobalmemory.Thecodeof[ 76 ]whenextendedtoafnecostfunctionsalsoneeds12mnbytesofglobalmemory.Theglobalmemoryrequiredbyourmethodsisinstancedependent.ForeachpositionencounteredinPhase3,we 98

PAGE 99

storedirectioninformationforallthreematrices.ForH,6possibilitiesexistforonecell,whichareNEW(thiscellstartsanewalignment),DIAGONAL(thiscellcomesfromdiagonaldirection),E UP(thiscellcomesfromtheabovecellinmatrixE),H UP(thiscellcomesfromtheabovecellinmatrixH),F LEFT(thiscellcomesfromtheleftcellinmatrixF),H LEFT(thiscellcomesfromtheleftcellinmatrixH).3bitsareenoughtostorethisinformationforeachcell.ForE,2possibilitiesexistwhichareE UPandH UP.1bitisenoughforthisinformation.ForF,1bitisenoughtodistinguishF LEFTandH LEFT.So,intheworst-case,StripedAlignmentrequires5mn=8bytesofglobalmemorywhileChunkedAlignment1andChunkedAlignment2require5(ms+nh)=8bytes.Hence,StripedAlignment,forexample,canhandleproblemswithanmnvalueabout19timesaslargeasthatcanbehandledby[ 37 ]and[ 76 ].Besides,whenmorememoryspaceisrequired,Phase3canbesplitintomultipleiterations.ThesamememoryspacerequiredforoneSMtocomputepartofthealignmentwithinonestripcanbereusedindifferentiterationswhilecomputingfordifferentstrips. 4.4ExperimentalResultsInthissection,wepresentexperimentalresultsforscoringandalignmentrespectively.AlloftheseexperimentswereconducedonanNVIDIATeslaC2050GPU. StripedScore.First,wemeasuredtherunningtimeofStripedScoreasafunctionofstripwidths(Table 4-1 andFigure 4-6 ).lenQueryisthelengthofthequerysequenceandlenDBisthelengthofthesubjectsequence.AspredictedbytheanalysisofSection 4.2 ,forsufcientlylargesequences,therunningtimedecreasesassincreases.Howeverifsequencesarerelativelysmall,whensincreases,therunningtimedecreasesrstandthenincreases.Next,wecomparedtherelativeperformanceofStripedScorewiths=1900,PerotRecurrence(thecodeof[ 37 ]modiedtoreportthebestscoreratherthanthebest200scores),BlockedAntidiagonal[ 76 ],EnhancedBA(ourenhancementof 99

PAGE 100

Table4-1. Runningtime(ms)ofStripedScorefordifferentsvalues lenQuery510310206204123061851030lenDB716814336286724300871680 s=6467.2260.21031.72316.76427.1s=12836.2132.9518.91161.43216.0s=25623.172.6269.8597.51645.2s=51222.350.1160.9344.9932.3s=1024-59.4135.5261.7664.4s=1900--175.0279.9625.7 Figure4-6. Plotofrunningtime(ms)ofStripedScorefordifferentsvalues BlockedAntidiagonalinwhichonlythevaluesontherightandbottomboundariesofeachblockarestoredinglobalmemorythusreducingglobalmemoryusagesignicantly),andCUDASW++2.0[ 50 ].Table 4-2 andFigure 4-7 givetheruntimeforthesealgorithms.Ascanbeseen,PerotRecurrencetakes13to17timesthetimetakenbyStripedScore.ThespeeduprangesofStripedScorerelativetoBlockedAntidiagonal,EnhancedBA,andCUDASW++2.0are,respectively,20to33,2.8to9.3,and7.7to22.8.BlockedAntidiagonalandCUDASW++2.0wereunabletosolvelargeinstancesbecauseoftheexcessivememoryrequiredbythem. 100

PAGE 101

Table4-2. Runningtime(ms)ofscoringalgorithms lenQuery11042104310451041105lenDB11042104310451041105 PerotRecurrence815.31917.73061.77014.920437.3StripedScore48.4113.0216.3449.81543.1BlockedAntidiagonal957.23719.9---EnhancedBA137.4527.01185.93438.614327.4CUDASW++2.0374.51530.43404.010259.1Table4-3. Runningtime(ms)ofStripedAlignmentfordifferentsvalues lenQuery104301553320860lenDB145602172829120 Phase13Total13Total13Totals=64616.9440.11066.41352.9973.52343.52402.41745.14176.5s=128328.6298.9633.5707.4650.21367.21245.41147.02408.1s=256183.9318.0505.6384.0715.71104.5670.41273.51952.6s=410136.4380.9520.7264.6827.31096.0484.21531.82020.9 StripedAlignment.Foreaseofcoding,ourimplementationusesachartostorethedirectioninformationateachpositionencounteredinPhase3ratherthan3bitsasintheanalysisofSection 4.3.4 .WetestedStripedAlignmentwithdifferentsvaluesandtheresultsareshowninTable 4-3 andFigure 4-8 .UsingasimilaranalysisasusedinSection 4.2 ,wedeterminethemaximumstripsize,whichislimitedbytheamountofsharedmemoryperSM,tobe410.ThetimeforPhase2isnegligibleandisnotreportedseparately.Thetimeforallthreephasesgenerallydecreaseassincreases.Forreallylarges,thenumberofstripsiscomparabletoorsmallerthanthenumberofstreamingprocessorsleadingtoidletimeonsomeprocessors.Generally,choosings=256givesthebestoverallperformanceforsequencesofsizeupto37000.Choosingalargersallowsforlargersequencestobealigned(asI/Oisinverselyproportionaltos).WedonotcompareStripedAlignmentwiththealgorithmsof[ 37 50 76 ]forthefollowingreasons(a)in[ 76 ],thetracebackisdoneseriallyinthehostCPU,(b)CUDASW++2.0[ 50 ]doesnothaveatracebackcapability,and(c)thetracebackof[ 37 ]isspecicallydesignedforthebenchmarksuiteSSCA#1[ 3 ]andsoonlyalignsmultiplebutsmallsubsequencesoflengthlessthan128. 101

PAGE 102

Figure4-7. Comparisonofdifferentscoringalgorithms Figure4-8. Runningtime(ms)ofStripedAlignmentwhenlenQuery=20860andlenDB=29120 102

PAGE 103

Table4-4. Runningtime(ms)ofChunkedAlignment1(lenQuery=10430lenDB=14560) h#=s!64128256410 Phase13Total13Total13Total13Total64728.58.9742.3390.010.8404.7220.111.6235.3159.221.1183.7128702.710.7718.0376.912.5393.1212.813.0229.2154.523.8181.5256689.414.6708.7368.715.5387.9208.315.7227.4152.230.0185.4512682.920.1707.7364.819.0387.6205.520.7229.7151.038.4192.5 Table4-5. Runningtime(ms)ofChunkedAlignment1(lenQuery=15533lenDB=21728) h#=s!64128256410 Phase13Total13Total13Total13Total641596.613.11616.7838.716.0859.9459.016.3479.9309.231.2344.51281540.415.91562.8810.118.2833.1443.518.4465.8300.034.5338.22561511.321.61539.4792.922.8820.5434.223.5461.5295.544.8344.05121497.030.31533.9784.628.3817.6428.630.7463.1293.056.3352.9 Table4-6. Runningtime(ms)ofChunkedAlignment1(lenQuery=20860lenDB=29120) h#=s!64128256410 Phase13Total13Total13Total13Total642833.317.92861.21475.021.41503.4803.821.7830.9571.240.3616.51282733.621.52764.41424.024.41454.7774.324.3803.4553.345.0602.72562683.028.42720.61395.129.41430.7756.229.1789.9544.857.2606.05122657.542.42709.21381.538.31425.8747.143.8795.3539.177.9621.0 Table4-7. Runningtime(ms)ofChunkedAlignment2(lenQuery=10430lenDB=14560) h#=s!64128256410 Phase13Total13Total13Total13Total64797.313.2817.0426.418.8451.0241.726.4273.4174.238.7218.3128739.815.1760.7403.417.4425.8229.221.2255.0164.127.9196.3256710.221.7737.5383.122.0409.8219.823.7247.7158.635.1197.6512695.836.5737.9373.630.1408.2211.840.2255.8154.753.9212.4 Table4-8. Runningtime(ms)ofChunkedAlignment2(lenQuery=15533lenDB=21728) h#=s!64128256410 Phase13Total13Total13Total13Total641745.019.61774.2916.828.0952.8504.239.5550.9338.456.3401.91281619.822.61650.5866.226.2898.9477.232.3515.2318.740.5364.82561555.632.81595.9823.732.5862.0457.734.5497.2308.251.9364.85121524.055.81587.4802.944.0852.5441.258.2504.0300.281.5386.0 Table4-9. Runningtime(ms)ofChunkedAlignment2(lenQuery=20860lenDB=29120) h#=s!64128256410 Phase13Total13Total13Total13Total643105.71.63119.31612.137.61660.7880.853.2943.7629.52.6640.41282886.51.72898.01521.335.01564.7831.142.2880.6592.653.8653.32562762.943.62817.01449.443.71500.5796.745.2848.1574.165.9645.75122706.273.62790.31413.358.11478.4771.276.5853.2557.2108.7670.9 Table4-10. Bestrunningtime(ms)ofChunkedAlignment1andChunkedAlignment2 lenQuery=lenDB10430/1456015533/2172820860/29120 Phase13Total13Total13TotalChunkedAlignment1154.523.8181.5300.034.5338.2553.345.0602.7ChunkedAlignment2164.127.9196.3318.740.5364.8629.52.6640.4 103

PAGE 104

Figure4-9. Plotofbestrunningtime(ms)ofChunkedAlignment1andChunkedAlignment2 ChunkedAlignment1.TherearetwoparametersinChunkedAlignment1-srepresentingforthewidthofonestrip,andhrepresentingfortheheightofonechunk.Effectively,thescoringmatrixisdividedintoblocksofsizehs.ThedataofTables 4-4 through 4-6 showsthat,aswasthecaseforStripedAlignment,performanceimprovesasweincreases.LargevaluesofhandshavethepotentialtoreducetheamountofparallelisminPhase3.Agoodchoicefromourexperimentalresultsiss=410andh=128.Asexpected,thePhase3timeforChunkedAlignment1issignicantlylessthanforStripedAlignment.AlthoughthisreductioncomeswithadditionalcomputationalandI/OcostinPhase1,theoveralltimeforChunkedAlignment1ismuchlessthanforStripedAlignment.Forsequencesofsize20860and29120,thebesttimeforStripedAlignmentis1952.6mswhilethatforChunkedAlignment1is602.7ms(s=410,h=128);theratioisslightlymorethan3.ThemajoradvantageofChunkedAlignment2isbetterparallelisminPhase3.However,theadditionaloverheadsinPhase1andPhase3(intermsofI/O)result 104

PAGE 105

inperformancethatisslightlyworsethanthatofChunkedAlignment1asshowninTables 4-7 through 4-10 andFigure 4-9 .ChunkedAlignment2willbebetterthanChunkedAlignment1onlyforstringsinwhichfortheshortestdistancetherearemanychunksinonestrip,Forthestripsizesandthedatasetsthatweused,wedidnotndthistobethecase.SinceStripedScoreisanorderofmagnitudefasterthanPerotRecurrenceandChunkedAlignment1isnotanorderofmagnitudeslowerthanStripedScore,weconclude,withoutexperiment,thatChunkedAlignment1isfasterthanthecodeof[ 37 ]modiedtondthebestalignment. 4.5SummaryInthischapter,wehavedevelopedsingle-GPUparallelizationsoftheunmodiedSmith-Watermanalgorithmforsequencealignment.OurscoringalgorithmStripedScoreachievesaspeedupof13to17relativetothesingle-GPUalgorithmof[ 37 ].ThespeeduprangesrelativetoBlockedAntidiagonal[ 76 ]andCUDASW++2.0[ 50 ]are,respectively,20to33and7.7to22.8.Ouralgorithmsachieveacomputationalrateof7.1GCUPSonasingleGPU.Ouralgorithmstodeterminetheactualalignmentarenotonlyfasterthancompetingalgorithmsbutalsorequiremuchlessmemory.Forexample,StripedAlignment,intheworst-case,takes1=19thememoryrequiredbythealgorithmsof[ 76 ]and[ 37 ]withtracebackfunction.ChunkedAlignment1andChunkedAlignment2requireevenlessmemory. 105

PAGE 106

CHAPTER5OPTIMALALIGNMENTOFTHREESEQUENCESONAGPUMultiplesequencealignmentisafundamentalprobleminbioinformatics.Inthisproblem,wearegivenasetofNsequences=fS0,S1,...,SN)]TJ /F8 7.97 Tf 6.58 0 Td[(1gwherejSij=liandi2f0,1,...,N)]TJ /F3 11.955 Tf 12.9 0 Td[(1g.ForDNAsequences,thealphabetforisthefourlettersetfA,C,G,Tgandforproteinsequences,thealphabetisthe20lettersetfA,C)]TJ /F4 11.955 Tf -432.05 -23.91 Td[(I,K)]TJ /F4 11.955 Tf 12.23 0 Td[(N,P)]TJ /F4 11.955 Tf 12.23 0 Td[(T,VWYg.Wearetoinsertgapsintothesequencessothattheresultingsequencesareofthesamelength;intheresultingsequences,eachcharacterissaidtobealignedwiththegaporcharacterinthecorrespondingpositionineachoftheothersequences;andthealignmentscoreismaximized.OnasinglecoreCPU,thebestmultiplesequencealignment(i.e.,theonewithmaximumscore)ofcanbefoundinO(jS0jjS1j...jSN)]TJ /F8 7.97 Tf 6.59 0 Td[(1j)timeusingdynamicprogramming[ 74 ].Inthischapter,wefocusonGPUalgorithmsfortheoptimalalignmentof3sequences(i.e.,N=3).Severalpapershavebeenwrittenonthree-sequencealignment.Webrieymentionafewhere.Murataetal.in[ 57 ]developedalgorithmsfortheoptimalalignmentof3sequencesusingaconstantgapweight.Gotoh[ 20 ]developedalgorithmstoalignthreesequencesusinganafnegappenaltymodel.Huang[ 25 ]reducedthememoryrequirementofthesealgorithmstobequadraticinthesequencelength.Powelletal.[ 66 ]havedevelopedafasterthree-sequencealignmentalgorithmusingaspeed-uptechniquebasedonUkkonen'sgreedyalgorithm[ 85 ].Hungetal.[ 27 ]useapositionspecicgappenaltymodelforthree-sequencealignment.Yueetal.[ 93 ]proposeadivide-and-conqueralgorithmforthree-sequencealignmentandLinetal.[ 44 ]proposeaparallelalgorithmforthree-sequencealignment.Besidesbeingofinterestinitsownright,three-sequencealignmenthasbeenproposedasabasestepinthealignmentofalargernumberofsequences.Forexample,KruspeandStadler[ 39 ]reportimprovedaccuracywhenthree-sequencealignmentisusedinsteadofpairwisesequencealignmentintheirprogressivemultiple 106

PAGE 107

sequencealignmentprogramaln3nn.Three-sequencealignmentalsohelpsinthetreealignmentproblem[ 86 ].Inpractice,heuristicsthattradeoptimalityforcomputationalefciencyareoftenusedwhenN>2duetothehighcomputationalcomplexityofmultiplesequencealignment,However,withtheadventoflow-costparallelcomputers,thereisrenewedinterestindevelopingcomputationallypracticalalgorithmsthatguaranteeoptimalityoftheconstructedalignmentandseveralresearchershavedevelopedefcientGPUalgorithmsfortheoptimalalignmentof2sequences(e.g.,[ 41 ],[ 42 ]).GPUadaptationsforheuristicmultiplesequencealignmenthavealsobeendeveloped[ 21 ][ 45 ][ 47 ][ 49 ][ 48 ].However,itappearsthatnoalgorithmsforthree-sequencealignmentonaGPUhaveyetbeendeveloped.Inthischapter,wedeveloptwosingle-GPUalgorithmsfortheoptimalalignmentofthreesequences.Therstoftheseusesalayeringapproachwhilethesecondusesaslopedapproach.ExperimentalresultsusinganNVIDIAGPUshowthattheslopedapproachresultsinafasteralgorithmandthatthisalgorithmisanorderofmagnitudefasterthanasingle-corealignmentalgorithmrunningonthehostcomputer.Therestofthepaperisorganizedasfollows.InSection 5.1 ,wedescribetheoptimalsingle-corealgorithmformultiplesequencealignmentforthecaseN=3.InSection 5.2 ,wedescribeourGPUadaptationofthisalgorithmforthecasewhenwewanttoreportonlythescoreofthebestalignmentandinSection 5.3 ,wedescribeouradaptationforthecasewhenthebestalignmentaswellasitsscorearetobereported.ExperimentalresultsarepresentedinSection 5.4 andwesummarizeinSection 5.5 5.1Three-SequenceAlignmentAlgorithmTheinputtothe3-sequencealignmentproblemisaset=fS0,S1,S2gof3sequencesandasubstitutionmatrixsubsuchasBLOSUM[ 22 ]orPAM[ 12 ],whichdenesthescoreforcharacterpairsxy.Theoutputisa3lmatrixMwherelmax(jS0j,jS1j,jS2j).RowiofMisSi,0i<3,withgapspossiblyinsertedatvarious 107

PAGE 108

positionsandsuchthateachcolumnofMhasatleastonenon-gapcharacter.Mdenesanalignmentwhosescoreisscore(M)=l)]TJ /F8 7.97 Tf 6.58 0 Td[(1Xi=0obj(M[0][i],M[1][i],M[2][i])Inthischapter,wedeneobjtobethesum-of-pairsfunction[ 74 ]:obj(M[0][i],M[1][i],M[2][i])=sub[M[0][i]][M[1][i]]+sub[M[0][i]][M[2][i]]+sub[M[1][i]][M[2][i]]ThealignmentMisoptimaliffitmaximizesscore(M).ThedynamicprogrammingalgorithmtoconstructMrstcomputesan(jS0j+1)(jS1j+1)(jS2j+1)matrixHusingtherecurrencesgivenbelow[ 74 ].H[i][j][k]=max8>>>>>>>>>>>>>>>>>>>>><>>>>>>>>>>>>>>>>>>>>>:H[i)]TJ /F17 9.963 Tf 9.97 0 Td[(1][j)]TJ /F17 9.963 Tf 9.96 0 Td[(1][k)]TJ /F17 9.963 Tf 9.97 0 Td[(1]+obj(S0[i)]TJ /F17 9.963 Tf 9.96 0 Td[(1],S1[j)]TJ /F17 9.963 Tf 9.96 0 Td[(1],S2[k)]TJ /F17 9.963 Tf 9.96 0 Td[(1])H[i)]TJ /F17 9.963 Tf 9.97 0 Td[(1][j)]TJ /F17 9.963 Tf 9.96 0 Td[(1][k]+obj(S0[i)]TJ /F17 9.963 Tf 9.97 0 Td[(1],S1[j)]TJ /F17 9.963 Tf 9.97 0 Td[(1],)]TJ /F17 9.963 Tf 7.74 0 Td[()H[i)]TJ /F17 9.963 Tf 9.97 0 Td[(1][j][k)]TJ /F17 9.963 Tf 9.96 0 Td[(1]+obj(S0[i)]TJ /F17 9.963 Tf 9.97 0 Td[(1],)]TJ /F17 9.963 Tf 7.74 0 Td[(,S2[k)]TJ /F17 9.963 Tf 9.96 0 Td[(1])H[i][j)]TJ /F17 9.963 Tf 9.96 0 Td[(1][k)]TJ /F17 9.963 Tf 9.96 0 Td[(1]+obj()]TJ /F17 9.963 Tf 7.74 0 Td[(,S1[j)]TJ /F17 9.963 Tf 9.96 0 Td[(1],S2[k)]TJ /F17 9.963 Tf 9.96 0 Td[(1])H[i)]TJ /F17 9.963 Tf 9.97 0 Td[(1][j][k]+obj(S0[i)]TJ /F17 9.963 Tf 9.96 0 Td[(1],)]TJ /F17 9.963 Tf 7.75 0 Td[(,)]TJ /F17 9.963 Tf 7.75 0 Td[()H[i][j)]TJ /F17 9.963 Tf 9.96 0 Td[(1][k]+obj()]TJ /F17 9.963 Tf 7.75 0 Td[(,S1[j)]TJ /F17 9.963 Tf 9.96 0 Td[(1],)]TJ /F17 9.963 Tf 7.75 0 Td[()H[i][j][k)]TJ /F17 9.963 Tf 9.96 0 Td[(1]+obj()]TJ /F17 9.963 Tf 7.75 0 Td[(,)]TJ /F17 9.963 Tf 7.75 0 Td[(,S2[k)]TJ /F17 9.963 Tf 9.96 0 Td[(1])where)]TJ /F1 11.955 Tf 9.3 0 Td[(denotesaGAPandistheGAPpenalty.Theinitialconditionsare(i>0,j>0,k>0): 108

PAGE 109

H[0][0][0]=0H[i][0][0]=2iH[0][j][0]=2jH[0][0][k]=2kH[i][j][0]=max8>>>>>><>>>>>>:H[i)]TJ /F17 9.963 Tf 9.96 0 Td[(1][j)]TJ /F17 9.963 Tf 9.96 0 Td[(1][0]+obj(S0[i)]TJ /F17 9.963 Tf 9.96 0 Td[(1],S1[j)]TJ /F17 9.963 Tf 9.96 0 Td[(1],)]TJ /F17 9.963 Tf 7.75 0 Td[()H[i)]TJ /F17 9.963 Tf 9.96 0 Td[(1][j][0]+obj(S0[i)]TJ /F17 9.963 Tf 9.97 0 Td[(1],)]TJ /F17 9.963 Tf 7.75 0 Td[(,)]TJ /F17 9.963 Tf 7.75 0 Td[()H[i][j)]TJ /F17 9.963 Tf 9.96 0 Td[(1][0]+obj()]TJ /F17 9.963 Tf 7.74 0 Td[(,S1[j)]TJ /F17 9.963 Tf 9.96 0 Td[(1],)]TJ /F17 9.963 Tf 7.75 0 Td[()H[i][0][k]=max8>>>>>><>>>>>>:H[i)]TJ /F17 9.963 Tf 9.96 0 Td[(1][0][k)]TJ /F17 9.963 Tf 9.97 0 Td[(1]+obj(S0[i)]TJ /F17 9.963 Tf 9.96 0 Td[(1],)]TJ /F17 9.963 Tf 7.75 0 Td[(,S2[k)]TJ /F17 9.963 Tf 9.97 0 Td[(1])H[i)]TJ /F17 9.963 Tf 9.96 0 Td[(1][0][k]+obj(S0[i)]TJ /F17 9.963 Tf 9.97 0 Td[(1],)]TJ /F17 9.963 Tf 7.75 0 Td[(,)]TJ /F17 9.963 Tf 7.75 0 Td[()H[i][0][k)]TJ /F17 9.963 Tf 9.96 0 Td[(1]+obj()]TJ /F17 9.963 Tf 7.74 0 Td[(,)]TJ /F17 9.963 Tf 7.75 0 Td[(,S2[k)]TJ /F17 9.963 Tf 9.96 0 Td[(1])H[0][j][k]=max8>>>>>><>>>>>>:H[0][j)]TJ /F17 9.963 Tf 9.96 0 Td[(1][k)]TJ /F17 9.963 Tf 9.96 0 Td[(1]+obj()]TJ /F17 9.963 Tf 7.75 0 Td[(,S1[j)]TJ /F17 9.963 Tf 9.97 0 Td[(1],S2[k)]TJ /F17 9.963 Tf 9.97 0 Td[(1])H[0][j)]TJ /F17 9.963 Tf 9.96 0 Td[(1][k]+obj()]TJ /F17 9.963 Tf 7.75 0 Td[(,S1[j)]TJ /F17 9.963 Tf 9.96 0 Td[(1],)]TJ /F17 9.963 Tf 7.75 0 Td[()H[0][j][k)]TJ /F17 9.963 Tf 9.96 0 Td[(1]+obj()]TJ /F17 9.963 Tf 7.75 0 Td[(,)]TJ /F17 9.963 Tf 7.75 0 Td[(,S2[k)]TJ /F17 9.963 Tf 9.96 0 Td[(1])ThescoreoftheoptimalalignmentisH[jS0j][jS1j][jS2j]andthecorrespondingalignmentmatrixMmaybeconstructedusingabacktraceprocessthatstartsatH[jS0j][jS1j][jS2j].ThetimerequiredtocomputetheHvaluesisO(jS0jjS1jjS2j).AnadditionalO(l)timeisrequiredtoconstructtheoptimalalignmentmatrix 5.2ComputingtheScoreoftheBestAlignmentInthissection,wedescribetwoGPUalgorithms,LAYEREDandSLOPED,tocomputeH. 5.2.1LayeredAlgorithmGPUComputationalStrategy Inthisalgorithm,whichiscalledLAYERED,thethree-dimensionalmatrixHispartitionedintoss(jS2j+1)chunks(cuboids)asshowninFigure 5-1 ,wheresisanalgorithmdesignparameter. 109

PAGE 110

Figure5-1. Thepartitioningofthethree-dimensionalmatrixH LetdbethenumberofchunksinthepartitioningofH.Thesechunksforman(jS1j+1)=s(jS0j+1)=smatrixwhoseelementsmaybenumberedfrom0throughd)]TJ /F3 11.955 Tf 11.22 0 Td[(1inantidiagonal(i.e.,top-righttobottom-left)orderasinFigure 5-1 .LetpbethenumberofSMsintheGPU(fortheC2050,p=14).SMioftheGPUwillcomputetheHvaluesforallchunksjsuchthatjmodp=i,0j
PAGE 111

Whenalllayershavebeencomputed,theSMmovestothenextchunkassignedtoit.Inordertoachievehighperformance,thecomputedHvaluesforalayerarestoredinthesharedmemoryoftheSMuntilthecomputationofthenextlayeriscompleteasthecurrentlayer'sHvaluesareneededtocomputethoseofthenextlayer.Atanytimeduringthecomputation,onlytwolayersofHvaluesarekeptinsharedmemoryandthememoryspaceforthesetwolayersisusedinaround-robinfashiontosavesharedmemory.FromthedynamicprogrammingrecurrenceforH,weseethattheHvaluesinachunkdependonlyonthoseonthesharedboundary(eastverticalface)ofthechunkstoitswestandnorthwestandthoseonthesharedboundary(southverticalface)ofthechunkstoitsnorthandnorthwest.Withrespecttothe(jS1j+1)=s(jS0j+1)=smatrixofchunks,itisnecessaryfortheSMthatcomputestheHvaluesforchunk(i,j)tocommunicatethecomputedHvaluesonitseastverticalfacetotheSMthatwillcomputethechunks(i,j+1)and(i+1,j+1)(thetopleftcornerofthismatrixisindexed(0,0)andthevaluesonitssouthverticalfacetotheSMthatwillcomputethechunks(i+1,j)and(i+1,j+1)).Thes(jS2j+1)HvaluesoneachoftheseverticalfacesarecommunicatedtotheappropriateSMsviatheGPU'sglobalmemory.EachSMwritestheHvaluesonitseastandsouthfacestoglobalmemory.WhenanSMhascompletedthecomputationforitscurrentlayer,itpollstheglobalmemorytodeterminewhethertheboundaryvaluesneedforthenextlayerareready.Ifso,itproceedstothenextlayer.Ifnot,itidles.Incasethereisnonextlayerinthecurrentchunk,theSMproceedstoitsnextassignedchunk.Analysis Torunthelayeredalgorithm,eachSMrequiresO(s2)sharedmemorytostorethevaluesassociatedwithalayerandO(jS0jjS1jjS2j=s)globalmemorytocommunicatetheHvaluesonitseastandsouthfaces.SincesharedmemoryisverysmalloncurrentGPUs,theavailablesharedmemoryconstrainsthechunksizes.Becauseofthehigh 111

PAGE 112

costofdatatransferbetweenSMsandglobalmemory(relativetothecostofarithmetic),theruntimeperformanceofaGPUalgorithmisoftencorrelatedtothevolumeofdatatransferredbetweentheSMsandglobalmemory.ThelayeredalgorithmtransfersatotalofO(jS0jjS1jjS2j=s)databetweentheGPU'sglobalmemoryandtheSMs.Thecomputationaltime(excludingtimetakenbytheglobalmemoryI/Otrafc)iscomputedbyrstnotingthattheccoresofanSMcandothecomputationforonelayerinO(s2=c)timecomputingtheHvaluesoneachantidiagonalofalayerinparallel.So,thetimetodothecomputationforonechunkisO(s2jS2j=c).Sincethecomputationfortheithchunkcannotbeginuntiltherstlayersofitswest,north,andnorthwestneighborchunkshavebeencomputed,eachSM,otherthanSM0,experiencesastartupdelay.AstheantidiagonalfromwhichtherstchunkassignedtothelastSM,SMp)]TJ /F3 11.955 Tf 12.74 0 Td[(1,isO(p p),thestartupdelayforSMp)]TJ /F3 11.955 Tf 12.97 0 Td[(1isO(p p(s2=c)).WhenjS2jissufcientlylarge(largerthandp pe)SMsexperience(almost)nofurtherdelayinworkingontheirassignedSMs.ThenumberofchunksassignedtoanSMisO(jS0jjS1j=(ps2)).So,thetotalcomputationtime(exclusiveofglobalmemoryI/Otime)isO(jS0jjS1jjS2j=(pc)+p p(s2=c)).WhilecomputationtimeexclusiveofglobalI/Otimeincreasesassincreases(becausethestartupdelayforSMsincreases),globalI/Otimedecreasesassincreases.OurexperimentsshowthatforlargejS0j,jS1jandjS2j,thereductioninglobalI/Omemorytrafcthatcomesfromincreasingsissubstantiallyhigherthantheincreaseintimespentoncomputationaltasks.Thus,withinlimits,choosingalargeswillreducetheoveralltimerequirements.Asnotedearlier,thevalueofs,ishowever,upperboundedbytheamountofsharedmemoryperSM. 5.2.2SlopedAlgorithmGPUComputationalStrategy Inthisalgorithm,whichiscalledSLOPED,wepartitionHintoss(jS2j+1)chunksandassignthesechunkstoSMsasinthelayeredalgorithm.However, 112

PAGE 113

insteadofcomputingtheHvaluesinachunkbyhorizontallayersandwithinalayerbyantidiagonals,theHvaluesarecomputedbyslopedplanescomprisedofH[i][j][k]sforwhichq=i+j+kisthesame.Therstslopedplanehasq=0,thenexthasq=1,andthelasthasq=2s+jS2j)]TJ /F3 11.955 Tf 19.18 0 Td[(2.Thecomputationfortheplaneq+1beginsafterthatfortheplaneqcompletes.Withinaplaneq,theHvaluesforalli,j,andkforwhichi+j+k=qcanbedoneinparallel.Thenumberofparallelstepsinthecomputationofachunkistherefore2s+jS2j)]TJ /F3 11.955 Tf 16.21 0 Td[(1.Incontrast,thelayeredalgorithmcannotcomputeallHvaluesinachunk'slayerinparallel.Thecomputationofalayerisdonebyantidiagonalsin2s)]TJ /F3 11.955 Tf 12.08 0 Td[(1stepswitheachstepcomputingthevaluesononeantidiagonalinparallel.Thetotalnumberofparallelstepsemployedbythelayeredalgorithminthecomputationofachunkis(2s)]TJ /F3 11.955 Tf 12.48 0 Td[(1)(jS2j+1).Hence,whenboththelayeredandslopedalgorithmsusethesames,theslopedalgorithmusesfewerstepswitheachstepcomprisedofmoreworkthatcanbedoneinparallel.Undertheseconditions,theslopedalgorithmisexpectedtoperformbetterthanthelayeredalgorithm.However,asnotedearlier,thevalueofsisconstrainedbytheamountoflocalSMmemoryavailable.Asweshallseebelow,theslopedalgorithmrequiresmoreSMmemoryandsomustuseasmallers.Analysis TocomputetheHvaluesonaslopedplaneq,weneedtheHvaluesfromtheplanesq)]TJ /F3 11.955 Tf 12.96 0 Td[(1,q)]TJ /F3 11.955 Tf 12.96 0 Td[(2,andq)]TJ /F3 11.955 Tf 12.96 0 Td[(3.Forefcientcomputation,wemustthereforehaveadequateSMmemoryfor4slopedplanes.Sinceaslopedplanemayhaveupto(s+1)2Hvalues,eachSMmusthavesufcientmemorytoaccommodate4(s+1)2Hvalues,whichistwicethenumberofHvaluesthatthelayeredalgorithmstoresinthesharedmemoryofanSM.TheslopedalgorithmgeneratesthesameamountofI/OtrafcbetweentheSMsandglobalmemoryasdoesthelayeredalgorithm.Thetwoalgorithmsalsorequirethesameamountofglobalmemory.ProceedingasforLAYERED,weseethatthedelaytimeforSMp)]TJ /F3 11.955 Tf 9.3 0 Td[(1isO(p p(s3=c))whereO(s3=c)isthetimetakentocomputethesmallpyramid.Thetimetocompute 113

PAGE 114

allthechunksassignedtoanSMisO(jS0jjS1jjS2j=(pc).So,thetotaltime(exclusiveofglobalI/Otrafctime)isO(jS0jjS1jjS2j=(pc)+p p(s3=c)).AlthoughtheanalysisshowsthetotaltimeforSLOPEDislargerthanthatforLAYEREDbecauseofthelargerdelaytostartSMp)]TJ /F3 11.955 Tf 12.35 0 Td[(1,whenS0,S1,andS2arelarge,thegreaterparallelismaffordedbySLOPEDcoupledwiththeneedforfewersynchronizationstepsdominatesandthemeasuredruntimeissmallerforSLOPED. 5.3ComputingtheBestAlignmentThelayeredandslopedalgorithmsmaybeextendedtocomputenotonlythescoreofthebestalignmentbutalsothebestalignment.Wedescribethreepossibleextensionsforthelayeredalgorithm.Theseextensionsrepresentatradeoffamongconceptualandimplementationsimplicity,computationalrequirementsforthetracebackdonetocomputethebestalignmentfromthescores,andtheamountofparallelism.Identicalextensionsmaybemadetotheslopedalgorithm. 5.3.1LAYERED)]TJ /F4 11.955 Tf 11.96 0 Td[(BT1Tocomputethebestalignment,weneedtomaintainadditionalinformationthatcanbeusedduringthetraceback.Inparticular,foreachposition(i,j,k)ofH,weassociatethecoordinates(is,js,ks)ofthelocalstartpointoftheoptimalpathto(i,j,k).Thislocalstartpointisapositionontheboundaryofnorth,northwest,orwestneighborchunkofthechunkthatcontainstheposition(i,j,k).LAYERED)]TJ /F4 11.955 Tf 11.95 0 Td[(BT1isa3-phasealgorithm: 1. Phase1:ThisisanextensionofLAYEREDinwhicheachchunkstores,inglobalmemory,notonlytheHvaluesneededbyitsneighboringchunksbutalsothelocalstartpointoftheoptimalpathtoeachboundarycell. 2. Phase2:Foreachchunk,wesequentiallydetermine,thestartpointandendpointofthesubpathofthebestalignmentthatgoesthroughthischunk.Thisisdonebytracingthebestpathbackwardsfromposition(jS0j,jS1j,jS2j)toposition(0,0,0)usingthelocalstartpointsofboundarycellscomputedinPhase1.Thistracebackgoesfromtheboundaryofonechunktotheboundaryofaneighborchunkwithoutactuallyenteringanychunk. 3. Phase3:ThesubpathofthebestpathwithineachchunkiscomputedbyrecomputingtheHvaluesforthechunksthroughwhichthebestalignmentpath 114

PAGE 115

traverses.NotethatPhase2determineswhichchunksthebestpathgoesthrough.UsingthesavedboundaryHvalues,itispossibletocomputethesubpathsforallchunksinparallel.TheglobalmemoryrequiredbyLAYERED)]TJ /F4 11.955 Tf 12.67 0 Td[(BT1ismorethanthatrequiredbyLAYEREDasPhase1storesa3DpositionwithHvaluesaved.Assuming4bytesforeachcoordinateofa3Dpositionand4bytesforanHvalue,LAYERED)]TJ /F4 11.955 Tf 12.14 0 Td[(BT1requires16bytesperboundarycellwhileLAYEREDrequiresonly4.Additionally,inthePhase3computation,weneedtostorewitheverypositioninachunkwhichofthe7optionsontherighthandsideofthedynamicprogrammingrecurrenceforHresultedinthemaxvalue.Thiscanbedoneusing3bits(ormorerealistically,1byte)perpositioninachunk. 5.3.2LAYERED)]TJ /F4 11.955 Tf 11.96 0 Td[(BT2AlthoughLAYERED)]TJ /F4 11.955 Tf 12.8 0 Td[(BT2isalsoathreephasealgorithm,LAYERED)]TJ /F4 11.955 Tf 12.8 0 Td[(BT2partitionseachchunkintosubchunksofheighth(Figure 5-2 ).Thethreephasesareasfollows: 1. Phase1:ChunksareassignedtoSMsasinLAYERED.InadditiontotheHandlocalstartpointsstoredinglobalmemorybyLAYERED)]TJ /F4 11.955 Tf 12.58 0 Td[(BT1,westoretheHvaluesofthepositionsonthebottomfaceofeachsubchunk. 2. Phase2:UsethelocalstartpointdatastoredinglobalmemorybythePhase1computationtodeterminethestartandendpointsofthesubpathsofthebestalignmentpathwithineachchunk.ThisphasecomputesthesamestartandendpointsascomputedinPhase2ofLAYERED)]TJ /F4 11.955 Tf 11.95 0 Td[(BT1. 3. Phase3:ThesubpathofthebestpathwithineachchunkiscomputedbyrecomputingtheHvaluesforthechunksthroughwhichthebestalignmentpathtraverses.Foreachchunkthroughwhichthispathpasses,thecomputationbeginsattherstlayerofthetopmostsubchunkthroughwhichthepathpasses(seeFigure 5-2 ,shadedsubchunksarethesubchunksthroughwhichthebestalignmentpathpasses).ThecomputationforthedifferentchunksthroughwhichthebestalignmentpathpassescanbedoneinparallelasinLAYERED)]TJ /F4 11.955 Tf 11.96 0 Td[(BT1.ThemajordifferencesbetweenLAYERED)]TJ /F4 11.955 Tf 11.95 0 Td[(BT1andLAYERED)]TJ /F4 11.955 Tf 11.96 0 Td[(BT2are: 1. SinceLAYERED)]TJ /F4 11.955 Tf 12.71 0 Td[(BT2savesHvaluesonthebottomfaceofeachsubchunkinadditiontodatasavedbyLAYERED)]TJ /F4 11.955 Tf 12.46 0 Td[(BT1,itgeneratesmoreI/Otrafcthan 115

PAGE 116

Figure5-2. Someofthechunkstraversedbytheoptimalpath LAYERED)]TJ /F4 11.955 Tf 11.59 0 Td[(BT1andalsorequiresmoreglobalmemory.TheamountofadditionalI/OtrafcandglobalmemoryrequiredisO(S1jjS2jjS3j=h). 2. Foreachchunkthroughwhichthebestalignmentpathpasses,LAYERED)]TJ /F4 11.955 Tf 12.21 0 Td[(BT1beginsthePhase3computationsatlayer1ofthatchunkwhileLAYERED)]TJ /F4 11.955 Tf 12.23 0 Td[(BT2beginsthiscomputationattherstlayerofthetopmostsubchunkthroughwhichthispathpasses. 5.3.3LAYERED)]TJ /F4 11.955 Tf 11.96 0 Td[(BT3ForLAYERED)]TJ /F4 11.955 Tf 13.46 0 Td[(BT3,thelocalstartpointsaredenedtobepositionsonneighboringsubchunksofthesubchunkthatcontainstheposition(i,j,k)(ratherthanpointsonneighboringchunksofthechunkthatcontains(i,j,k)).ThethreephasesofLAYERED)]TJ /F4 11.955 Tf 11.96 0 Td[(BT3are: 1. Phase1:ChunksareassignedtoSMsasinLAYERED.Foreachsubchunk,westoreinglobalmemory,theHvaluesonitssouth,east,andbottomfacesaswellasthelocalstartpointoftheoptimalpathtoeachofthepositionsonthesefaces. 2. Phase2:Foreachsubchunk,wesequentiallydetermine,thestartpointandendpoint(ifany)ofthesubpathofthebestalignmentpaththatgoesthroughthissubchunk. 3. Phase3:Thesubpath(ifany)ofthebestpathwithineachsubchunkiscomputedbyrecomputingtheHvaluesforthosesubchunksthroughwhichthebest 116

PAGE 117

alignmentpathtraverses.NotethatPhase2determinesthesubchunksthroughwhichthebestpathgoes.UsingthesavedboundaryHvalues,itispossibletocomputethesubpathsforallsubchunksinparallel.LAYERED)]TJ /F4 11.955 Tf 12.05 0 Td[(BT3hasmoreglobalI/OtrafcthanLAYERED)]TJ /F4 11.955 Tf 12.05 0 Td[(BT2inPhase1andalsorequiresmoreglobalmemory.Phase2ofbothalgorithmsdothesameamountofwork.WeexpectPhase3ofLAYERED)]TJ /F4 11.955 Tf 11.57 0 Td[(BT3tobefasterthanthatofLAYERED)]TJ /F4 11.955 Tf 11.57 0 Td[(BT2becauseofbetterloadbalancingresultingfromthenergranularityoftheper-subchunkworkandthefactthattherearemoresubchunksthanchunks,whichleadstomoreparallelism.Forexample,supposeonechunkhas50subchunksthroughwhichthebestalignmentpathpassesandanotherchunkhasonly1suchsubchunk.Phase3ofLAYERED)]TJ /F4 11.955 Tf 12.09 0 Td[(BT2canuseatmost2SMswith1workingonall50subchunksoftherstchunkandtheotherworkingonthesinglesubchunkofthesecondchunk.TheworkloadoverthetwoassignedSMsisunbalancedandthedegreeofparallelismisonly2.Phase3ofLAYERED)]TJ /F4 11.955 Tf 12.44 0 Td[(BT3,however,isabletouseupto51SMsassigning1subchunktoeachSMresultinginbetterworkloadbalancingandahigherdegreeofparallelism. 5.4ExperimentalResultsInthissection,wepresentexperimentalresultsforourscoringandalignmentalgorithms.AlloftheseexperimentswereconductedonanNVIDIATeslaC2050GPU.ThehostmachinehasanInteli7-x9803.33GHzCPUand12GBDDR3RAM. 5.4.1ComputingtheScoreoftheBestAlignmentWexeds=67forLAYEREDands=47forSLOPEDandexperimentedwithrealinstancesofdifferentsizes.Thesetwosvaluesweredeterminedfromourexperimentalresultswhichproducedtheleastrunningtime.ThesequencesusedwereretrievedfromNCBIEntrezGene[ 52 ]andtheirsizesarerepresentedasatuple(jS0j,jS1j,jS2j).TherunningtimeisshowninTable 5-1 .Therunningtimeofthesingle-corescoringmethodandthesingle-coretracebackmethodrunningonourhostCPUisreferredtoasScoringandTracebackinTable 5-1 ,respectively,forcomparisonpurpose.Ascanbeseen,SLOPEDisaboutthreetimesasfastasLAYEREDforlargeinstancesandisupto 117

PAGE 118

Table5-1. Runningtime(s)fordifferentinstances (113,166,364)(347,349,365)(267,439,452)(764,771,773)(1399,1404,1406) LAYERED0.0340.1650.1621.2506.643SLOPED0.0440.0870.0900.4161.921Scoring0.5703.6104.27037.500-Traceback0.6304.0104.76041.600Table5-2. Runningtime(s)oftracebackmethodsforrealinstances (113,166,364)(347,349,365)(267,439,452)(764,771,773)(1399,1404,1406) Phase1Phase3TotalPhase1Phase3TotalPhase1Phase3TotalPhase1Phase3TotalPhase1Phase3TotalL)]TJ /F4 11.955 Tf 11.96 0 Td[(BT10.0620.0120.0760.2980.0420.3420.3480.0350.3862.8140.1212.94114.9940.27715.300L)]TJ /F4 11.955 Tf 11.96 0 Td[(BT20.0620.0050.0690.2980.0100.3100.3470.0100.3592.8130.0142.83414.9820.02115.036L)]TJ /F4 11.955 Tf 11.96 0 Td[(BT30.0620.0050.0690.2980.0100.3100.3470.0090.3602.8120.0162.83614.9770.02315.040S)]TJ /F4 11.955 Tf 11.95 0 Td[(BT10.0130.0020.0300.0640.0050.1110.0770.0060.1250.5830.0210.7363.5130.0583.850S)]TJ /F4 11.955 Tf 11.95 0 Td[(BT20.0140.0020.0320.0670.0020.1140.0800.0020.1270.6060.0030.7483.6480.0053.946S)]TJ /F4 11.955 Tf 11.95 0 Td[(BT30.0140.0020.0320.0670.0020.1140.0800.0020.1270.6040.0040.7473.6360.0063.939 90timesasfastasthesingle-coreCPUalgorithm!Inourtestsspeedupincreaseswithinstancesize. 5.4.2ComputingtheAlignmentBecauseofmemorylimitations,ourscoringalgorithmscanhandlesequenceswhosesizeisuptoapproximately(2500,2500,2500)whileouralignmentalgorithmscanhandlesequencesofsizeuptoapproximately(1500,1500,1500).Weuseds=47,h=40forlayeredalgorithmsands=31,h=100fortheslopedalgorithmswhichweredeterminedbyexperimentstobetheoptimalvalues.Wetestedouralignmentalgorithms,withparameterssetasabove,ontherealinstancesusedearlierforthescoringexperiments.ThemeasuredruntimesarereportedinFigures 5-2 and 5-3 (L)]TJ /F1 11.955 Tf 12.63 0 Td[(forLAYERED,S)]TJ /F1 11.955 Tf 12.62 0 Td[(forSLOPED).WedonotreportthetimeforPhase2asthisisnegligiblecomparedtothatfortheotherphases.AlthoughBT2andBT3reducetheruntimeofthethirdphase,theoveralltimeisdominatedbythatforPhase1.Again,theSLOPEDalgorithmsareaboutthreetimesasfastastheLAYEREDalgorithms.TheSLOPEDalgorithmsprovideaspeedupbetween21to56relativetothesinglecorealgorithmrunningonourhostCPU. 5.5SummaryWehavedevelopedtwoGPUalgorithmstooptimallyalignthreesequences.Theslopedscoringalgorithm,whichis3timesasfastasthelayeredalgorithm,providesa 118

PAGE 119

Figure5-3. Plotofrunningtime(s)oftracebackmethodsforrealinstances speedupofupto90relativetoasingle-corealgorithmrunningonourhostCPU.Theslopedalignmentalgorithmisalso3timesasfastasthelayeredoneandprovidesaspeedupbetween21and56relativetothesinglecorealgorithm.Thestrategiesweusedinthispapercanalsobeextendedtoafnegapmodelthoughthecomputationwillrequiremorecubicmatricestostoretemporaryvalues. 119

PAGE 120

CHAPTER6PARALLELSYNTENICALIGNMENTHomologoussequencesoftenreservehighlyidenticalregionswhileotherregionsareverydifferentthroughtheevolutionprocess.Whenevaluatingthephylogeneticdistancebetweentwohomologoussequences,theirglobalalignmentscorecanbequitesmallifnon-reservedregionsarelongerthanreservedregions.So,inthesyntenicalignmentofhomologoussequences,wedisregardthenon-reservedregions.Syntenicalignmentisanvariantofglobalalignmentinwhichlongnon-matchingregions(differenceblocks)areignored.Thealignmentresultisalistofmatchingregionsthatmayincludegapsinterleavedwithdifferenceblocks.Insyntenicalignment,wearegiventwosequencesAandBandaretondtheirbestglobalalignmentusingthevariantcostmetricjustdescribed.ForDNAsequences,thealphabetforAandBisthefourlettersetfA,C,G,Tgandforproteinsequences,thealphabetisthe20lettersetfA,C)]TJ /F4 11.955 Tf 11.43 0 Td[(I,K)]TJ /F4 11.955 Tf 11.43 0 Td[(N,P)]TJ /F4 11.955 Tf 11.44 0 Td[(T,VWYg.Huangetal.[ 26 ]havedevelopeddynamicprogrammingrecurrencesforsyntenicalignmentandanassociatedsequentialalgorithmwhosecomplexityisO(jAjjBj).ThesedynamicprogrammingrecurrencesaresomewhatmorecomplexthanthoseusedforordinaryglobalalignmentintheNeedleman-Wunsch[ 58 ]algorithm.Huangetal.[ 26 ]introduceanewmatrixforthecasewhenalignmentsendwithdifferenceblocks.Hirschberg'srecursivetechnique[ 24 ]isusedtoconstructtheactualalignment.Futamuraetal.[ 18 ]modiedtheequationsofHuangetal.[ 26 ]toarriveatequationsmoresuitedtothedevelopmentofanefcientalgorithmforaCPUclusterusingparallelprexoperations.BiologicalapplicationsoftenhavejAjintherange104to105andjBjintherange107to1010[ 37 ].WhenthesequencesAandBareofthissize,theobservedruntimeonmodernsingleprocessorcomputersisquitehighbecauseofthequadraticrun-timeandlargememoryrequirementoftheassociateddynamicprogrammingalgorithm.Heuristictechniquessuchasin[ 13 ][ 33 ][ 5 ][ 75 ]havebeenproposedforthesolutionoflarge 120

PAGE 121

instances.However,withtheadventoflow-costparallelcomputers,thereisrenewedinterestindevelopingcomputationallypracticalalgorithmsthatdonotsacriceaccuracy.Towardthisend,wedevelopanovelsingle-GPUalgorithmthatisbasedonthedynamicprogrammingequationsofHuangetal.[ 26 ].Ourparallelalgorithmsachieveaspeedupof28overthesequentialalgorithmof[ 26 ].Ourhigh-levelparallelizationstrategyissimilartothatusedbyFutamuraetal.[ 18 ]toarriveatparallelalgorithmsforsyntenicalignmentonaclusterofworkstations.Themethodproposedin[ 18 ]dividesthescoringmatrixintoasmanystripsasthereareprocessorsandeachprocessorcomputesthescoringmatrixforitsstriprowwiseandthetracebackisdoneinparallelusingastrategysimilartotheoneusedbyus.Themaindifferencesbetweenourworkandthatof[ 18 ]areasfollows: 1. OuralgorithmsareoptimizedforaGPUratherthanforacluster.Unlikeacluster,aGPUhasverysmallamountofmemoryperSM.Tosupportthis,wedividethescoringmatrixintomanymorestripsthanthenumberofstreamingmultiprocessorsinaGPU. 2. ThecomputationofastripisdoneinparallelusingtheSIMDnatureofCUDAcoresonastreamingmultiprocessorratherthanserially.Therestofthechapterisorganizedasfollows.InSection 6.1 ,westatethedynamicprogrammingequationsof[ 26 ]forpairwisesyntenicalignment.InSection 6.2 ,wedescribeourGPUalgorithmforthecasewhenwewanttoreportonlythescoreofthebestalignmentandinSection 6.3 ,wedescribeourGPUalgorithmforthecasewhenthebestalignmentaswellasitsscorearetobereported.ExperimentalresultsarepresentedinSection 6.4 andwesummarizeinSection 6.5 6.1SyntenicAlignmentEquationsSyntenicalignmentreturnsanorderedlistofmatchingregionsfortwosequencesasshowninFigure 6-1 .Differenceblocksarelongnon-matchingregionsandwhencomputingthealignmentscorefortwosequences,eachdifferenceblockisonlychargedwithanopeningpenaltywhichisd. 121

PAGE 122

Huangetal.[ 26 ]developeddynamicprogrammingequationsforsyntenicalignment.LetA=a1a2...amandB=b1b2...bnbethetwosequencesthataretobealigned.Letc(ai,bj)bethescoreformatchingoraligningaiandbjandletbethegapopeningpenalty,andthegapextensionpenalty.So,thepenaltyforagapoflengthkis+k.ThesyntenicalignmentequationsofHuangetal.[ 26 ]areS(i,j)=max8>>>>>>>>>><>>>>>>>>>>:S(i)]TJ /F3 11.955 Tf 11.96 0 Td[(1,j)]TJ /F3 11.955 Tf 11.96 0 Td[(1)+c(ai,bj)D(i,j)I(i,j)H(i,j)D(i,j)=max8>><>>:D(i)]TJ /F3 11.955 Tf 11.95 0 Td[(1,j))]TJ /F6 11.955 Tf 11.96 0 Td[(S(i)]TJ /F3 11.955 Tf 11.96 0 Td[(1,j))]TJ /F6 11.955 Tf 11.96 0 Td[()]TJ /F6 11.955 Tf 11.95 0 Td[(I(i,j)=max8>><>>:I(i,j)]TJ /F3 11.955 Tf 11.96 0 Td[(1))]TJ /F6 11.955 Tf 11.96 0 Td[(S(i,j)]TJ /F3 11.955 Tf 11.96 0 Td[(1))]TJ /F6 11.955 Tf 11.96 0 Td[()]TJ /F6 11.955 Tf 11.95 0 Td[(H(i,j)=max8>>>>>>>>>><>>>>>>>>>>:H(i,j)]TJ /F3 11.955 Tf 11.96 0 Td[(1)H(i)]TJ /F3 11.955 Tf 11.95 0 Td[(1,j)S(i,j)]TJ /F3 11.955 Tf 11.96 0 Td[(1))]TJ /F1 11.955 Tf 9.29 0 Td[(dS(i)]TJ /F3 11.955 Tf 11.96 0 Td[(1,j))]TJ /F1 11.955 Tf 9.29 0 Td[(dwhere1im,1jnWherethescorematricesS,D,I,andHhavethefollowingmeaning: 1. S(i,j)isthescoreofthebestsyntenicalignmentfor(a1...ai)and(b1...bj). 2. D(i,j)isthescoreofthebestsyntenicalignmentfor(a1...ai)and(b1...bj)undertheconstraintthataiisalignedtoagap. 3. I(i,j)isthescoreofthebestsyntenicalignmentfor(a1...ai)and(b1...bj)undertheconstraintthatbjisalignedtoagap. 122

PAGE 123

4. H(i,j)isthescoreofthebestsyntenicalignmentfor(a1...ai)and(b1...bj)undertheconstraintthateitheraiorbjisalignedtoadifferenceblock.Theinitialconditionsare:S(0,0)=0;S(i,0)=maxfD(i,0),H(i,0)g;S(0,j)=maxfI(0,j),H(0,j)g;D(0,0)=)]TJ /F6 11.955 Tf 9.3 0 Td[(;D(i,0)=D(i)]TJ /F3 11.955 Tf 11.96 0 Td[(1,0))]TJ /F6 11.955 Tf 11.96 0 Td[(;D(0,j)=S(0,j))]TJ /F6 11.955 Tf 11.95 0 Td[(;I(0,0)=)]TJ /F6 11.955 Tf 9.3 0 Td[(;I(i,0)=S(i,0))]TJ /F6 11.955 Tf 11.95 0 Td[(;I(0,j)=I(0,j)]TJ /F3 11.955 Tf 11.95 0 Td[(1))]TJ /F6 11.955 Tf 11.95 0 Td[(;H(0,0)=)]TJ /F4 11.955 Tf 9.3 0 Td[(d;H(i,0)=)]TJ /F4 11.955 Tf 9.3 0 Td[(d;H(0,j)=)]TJ /F4 11.955 Tf 9.3 0 Td[(d;where1im,1jn. Figure6-1. Syntenicalignment Asindicatedin[ 26 ],whenusingtheaboverecurrenceequations,theoptimalalignmentsometimesstartswithasmallisolatedmatchfollowedbyadifferenceblockorendswithasmallisolatedmatchprecededbyadifferenceblock.Whilesuchanalignmentismathematicallyoptimal,itisnotbiologicallymeaningful.Huangetal.[ 26 ]proposedtwowaystoovercomethisdiscrepancy.Oneistochargeadifferenceblock 123

PAGE 124

penaltydifthealignmentdoesnotstartwithadifferenceblockordoesnotendwithadifferenceblock.Theothersolutionistochargenopenaltyforinitialorendingdifferenceblocks.Inourimplementation,weusedtherstofthesestrategies.SothevalueofS(0,0)becomes)]TJ /F4 11.955 Tf 9.29 0 Td[(dinsteadof0andthebestalignmentscoreismaxfS(m,n))]TJ /F4 11.955 Tf -415.1 -23.91 Td[(d,H(m,n)g. 6.2ComputingtheScoreOuralgorithm,SCOREassumesthatmnandpartitionthescoringmatricesS,D,I,andHintodn=semsstrips(Figure 6-2 ).Ifm>n,theroleofthetwosequencesAandBcanbeswapped.LetsbethestripwidthandpbethenumberofSMsintheGPU(fortheC2050,p=14).TheGPUkerneliswrittensothatSMicomputestheS,D,I,andHvaluesforallstripsjsuchthatjmodp=i,0j
PAGE 125

stripaccumulates,inabuffer,athresholdnumber,T,ofS,IandHvaluesneededbyitsrightadjacentstripinsharedmemory.Whenthisthresholdisreached,theaccumulatedS,IandHvaluesarewrittentoglobalmemory.ThethresholdTischosentooptimizethetotaltime.EachSMpollsglobalmemorytodeterminewhetherthenextbatchofS,IandHvaluesitneedsfromitsleftadjacentstripisreadyinglobalmemory.Ifso,itreadsthisbatchandcomputesthenextTantidiagonalsforitsstrip.Ifnot,itwaitsinanidlestate. Figure6-2. Stripedsyntenicalignment TheabovealgorithmrequiresO(minfm,sg)sharedmemoryperSMandO(mn=s)globalmemory.AnimportantperformancemetricforGPUsistheamountofI/OtransferredbetweentheglobalmemoryandtheSMs.Generallythegoalistominimizethistrafcasthelimitedbandwidthmakesthisveryexpensive.Forouralgorithm,theI/OtrafcbetweenglobalmemoryandtheSMsisO(mn=s). 125

PAGE 126

Thecomputationaltimerequirements(excludingtimetakenbytheglobalmemoryI/Otrafc)iscomputedbyassumingthatthethresholdvalueTisO(1).Thecomputationforthekthstripcannotbeginuntilthetoprightvalueofstripk)]TJ /F3 11.955 Tf 12.25 0 Td[(1hasbeencomputed.Thisaddsasmalldelaybetweencomputationofconsecutivestripsandhastobetakenintoaccounttocomputetheoveralltimerequirements.AnSMwithcprocessorstakesTa=O(s2=c)tocomputethetoprightvalueofthestripassignedtoitandO(ms=c)timetocompletethecomputationfortheentirestrip.So,SMp)]TJ /F3 11.955 Tf 12.42 0 Td[(1cannotstartworkingontherststripassignedtoituntiltime(p)]TJ /F3 11.955 Tf 11.96 0 Td[(1)Ta.Therearetwobroadscenariosthatweconsider.Intherstcase,weassumetheSMcancomputethenextstripassignedtoitwithoutanydelay(i.e.thepreviousSMhasalreadyprovidedtherequiredboundarydata).Inthesecondcase,weassumethattheSMhastowaitasthepreviousSMhasnotcompletedtherequiredcomputationoftheboundaryvalues.WhenanSMcangofromthecomputationofonestriptothecomputationofthenextstripwithnodelay,thecompletiontimeofSMp)]TJ /F3 11.955 Tf 12.17 0 Td[(1,andhencethetimetakenbytheGPUtodoallitsassignedwork(exclusiveofthetimetakenbyglobalmemoryI/Otrafc),isO((p)]TJ /F3 11.955 Tf 12.82 0 Td[(1)Ta+ms cn ps)=O(ps2 c+mn pc).WhenanSMtakeslesstimetocompletethecomputationforastripthanittakestocomputethedataneededtocommenceonthenextstripassignedtotheSM(approximately,ms c
PAGE 127

ThesubstitutionmatrixrequiredbySCOREisstoredinthesharedmemoryofeachSMusing2323sizeof(int)bytes.Additionally,eachSMhasthreeoutputbufferseachoflength16forwritingvaluesontheboundaryofeachstriptoglobalmemory.Thesebufferstake16sizeof(int)3bytes.Wealsouseeightarraysoflengthminfs,mg+2eachtoholdtheSvaluesonthreeadjacentantidiagonals,DvaluesandIvalues,newDorIvaluestobeswappedwitholdvalues,HvaluesandnewHvaluestobeswappedwitholdvalues.1456bytesarereservedbytheCUDAcompilertostorebuilt-invariablesandpassfunctionparameters.Giventhesizeofthesharedmemorycachewasconguredto48KBsharedmemoryand16KBL1cache,O(minfm,sg)shouldbelessthan1416.Sincewearealigningverylargesequences,weassumes
PAGE 128

cell(u,v),fourstartstates(oneforeachofS(u,v),D(u,v),I(u,v)andH(u,v))arestored.So,forthe4stripsofFigure 6-3 ,theboundarycellsstore,inglobalmemory,thelocalstartstatesofsubpathsthatendattheboundarycells(,4),(,8),and(,12).Wealsoneedtostorethelocalstartstateandtheendstatefortheoverallbestalignment.SincethehighestSscoreisto(7,16)andthelocalstartstateforS(7,16)is(5,12,I),(5,12,I)isinitiallystoredinregistersandnallywrittenalongwith(7,16,S)toglobalmemory.Inphase1,thelocalstartstatesfortheoptimalpathstoallboundarycells(notjusttheboundarycellsthroughwhichtheoverallalignmentpathtraverses)arewrittentoglobalmemory. Figure6-3. ExampleforSTRIPED 2. Phase2:Foreachstrip,wesequentiallydetermine,thestartstateandendstateoftheoptimalalignmentsubpaththatgoesthroughthisstrip.SupposeforourexampleofFigure 6-3 ,wedeterminethattheoptimalalignmentpathiscomprisedofasubpathfrom(5,12,I)to(7,16,S),anothersubpathfrom(3,8,H)to(5,12,I),onefrom(2,4,S)to(3,8,H)andanotheronefrom(0,0,H)to(2,4,S). 3. Phase3:ThesubpathoftheoptimalpathwithineachstripiscomputedbyrecomputingtheS,D,I,andHvaluesforthestripstheoptimalalignmentpathtraverses.UsingthesavedboundaryS,IandHvalues,itispossibletocomputethesubpathsforallstripsinparallel. 128

PAGE 129

6.3.2CHUNKED1DCHUNKED1Disalsoathreephasealgorithm.Themaindifferencewiththepreviousalgorithmisthateachstripispartitionedintochunksofheighth(Figure 6-4 ).Thethreephasesareasfollows: 1. Phase1:Foreachhschunkwestore,inglobalmemory,theS,I,H,andlocalstartstatesforpositionsonrightverticalchunkboundaries(i.e.,verticalbuffers)andtheS,DandHvaluesforhorizontalbuffers.TheassignmentofstripstoSMsisthesameasinSCORE(andSTRIPED). Figure6-4. ExampleforCHUNKED1D 2. Phase2:Weusethedatastoredinglobalmemorybythephase1computationtodeterminethestartandendstatesofthesubpathsoftheoptimalalignmentpathwithineachstrip. 3. Phase3:theoptimalsubpathsareconstructedbyacomputationwithineachstripthroughwhichtheoptimalalignmenttraverses.However,thecomputationwithinastripcanbelimitedtoessentialchunksasshownbytheshadedchunksinFigure 6-4 .Thecomputationforthese(sub)-stripscanbedoneinparallel.ThemajordifferencesbetweenCHUNKED1DandSTRIPEDareasfollows: 1. CHUNKED1DgeneratesmoreI/OtrafcthanSTRIPED. 129

PAGE 130

2. CHUNKED1Drequiresmoreglobalmemoryasthehorizontalbufferdataneedstobestored.Assumingthattheheightandwidthofthechunkarenearlyequal,theI/OtrafcandtheglobalmemoryrequirementareroughlytwicetheamountforSTRIPEDforthesamestripsizeasthewidthofthechunk. 3. UnlikeSTRIPED,thecomputationbeginsatthestartpointofachunkratherthanattherstrowofthestrip.Inpractice,thisreducestheamountofcomputationsignicantly. 6.3.3CHUNKED2DOneofthelimitationsofCHUNKED1Disthatintheworstcasethesizeofthecomputationforsomeofthestripscanbelarge(i.e.,theshadedareaisnearlyequaltotheentirestrip).AlgorithmaddressesthislimitationasCHUNKED2D.Thethreephasesare: 1. Phase1:Thisphaseismodiedtoadditionallystore,forthehorizontalbuffers,thelocalstartstate(localtothechunk)oftheoptimalpaththatgoesthroughthatbuffer.Similarly,forverticalbuffersthestartstatelocaltothechunk(ratherthanlocaltothestrip)isstored. 2. Phase2:WeusethedatastoredinglobalmemoryduringPhase1todeterminethechunksthroughwhichtheoptimalalignmentpathtraversesaswellasthestartandendstatesofthesubpaththroughthesechunks. 3. Phase3:ThesubpathwithineachchunkiscomputedbyusingdatastoredinPhase1andtheknowledgeofthesubpathendstatesdeterminedinPhase2.AswasthecaseforSTRIPEDandCHUNKED1D,thesubpathsforallidentiedchunkscanbecomputedinparallel.UnlikeCHUNKED1D,additionalcomputationsandI/OneedtobeperformedinPhase1.TheadvantageisthatthecomputationforallthechunkscanbeperformedinparallelinPhase3.So,foragivendatasetifalargenumberofchunkscorrespondingtotheshortestpatharepresentinagivenstrip,thisworkwillbeassignedtoasinglestreamingprocessorforCHUNKED1Dandwilleffectivelybeperformedsequentially.However,thesechunkscanpotentiallybeassignedtomultiplestreamingprocessors.ForExample,inFigure 6-4 ,strip2has2chunks.ThesewillbeassignedtosamestreamingmultiprocessorusingCHUNKED1D,butcanbeassignedto2different 130

PAGE 131

streamingmultiprocessorsusingCHUNKED2D.Thus,theamountofparallelismavailableissignicantlyincreased. 6.3.4MemoryRequirementsTheglobalmemoryrequiredbyourmethodsisinstancedependent.ForeachpositionencounteredinPhase3,westoredirectioninformationforallfourmatriceswhichessentiallyincludesnineags.Theseagsandtheirmeaningsare:FormatrixS:S DIAGONAL:S(i,j)=S(i)]TJ /F3 11.955 Tf 11.95 0 Td[(1,j)]TJ /F3 11.955 Tf 11.95 0 Td[(1)+c(ai,bj)H UP:S(i,j)=H(i,j)=H(i)]TJ /F3 11.955 Tf 11.95 0 Td[(1,j)H LEFT:S(i,j)=H(i,j)=H(i,j)]TJ /F3 11.955 Tf 11.96 0 Td[(1)S LEFT H:S(i,j)=H(i,j)=S(i,j)]TJ /F3 11.955 Tf 11.96 0 Td[(1))]TJ /F4 11.955 Tf 11.96 0 Td[(dS UP H:S(i,j)=H(i,j)=S(i)]TJ /F3 11.955 Tf 11.96 0 Td[(1,j))]TJ /F4 11.955 Tf 11.96 0 Td[(dS LEFT I:S(i,j)=I(i,j)=S(i,j)]TJ /F3 11.955 Tf 11.95 0 Td[(1))]TJ /F6 11.955 Tf 11.95 0 Td[()]TJ /F6 11.955 Tf 11.96 0 Td[(I LEFT:S(i,j)=I(i,j)=I(i,j)]TJ /F3 11.955 Tf 11.96 0 Td[(1))]TJ /F6 11.955 Tf 11.96 0 Td[(S UPD:S(i,j)=D(i,j)=S(i)]TJ /F3 11.955 Tf 11.96 0 Td[(1,j))]TJ /F6 11.955 Tf 11.96 0 Td[()]TJ /F6 11.955 Tf 11.95 0 Td[(D UP:S(i,j)=D(i,j)=D(i)]TJ /F3 11.955 Tf 11.95 0 Td[(1,j))]TJ /F6 11.955 Tf 11.96 0 Td[(FormatrixD:S UP D:D(i,j)=S(i)]TJ /F3 11.955 Tf 11.95 0 Td[(1,j))]TJ /F6 11.955 Tf 11.95 0 Td[()]TJ /F6 11.955 Tf 11.95 0 Td[(D UP:D(i,j)=D(i)]TJ /F3 11.955 Tf 11.95 0 Td[(1,j))]TJ /F6 11.955 Tf 11.95 0 Td[(FormatrixI:S LEFT I:I(i,j)=S(i,j)]TJ /F3 11.955 Tf 11.95 0 Td[(1))]TJ /F6 11.955 Tf 11.95 0 Td[()]TJ /F6 11.955 Tf 11.96 0 Td[(I LEFT:I(i,j)=I(i,j)]TJ /F3 11.955 Tf 11.96 0 Td[(1))]TJ /F6 11.955 Tf 11.96 0 Td[( 131

PAGE 132

FormatrixH:H UP:H(i,j)=H(i)]TJ /F3 11.955 Tf 11.96 0 Td[(1,j)S UP H:H(i,j)=S(i)]TJ /F3 11.955 Tf 11.96 0 Td[(1,j))]TJ /F4 11.955 Tf 11.96 0 Td[(dS LEFT H:H(i,j)=S(i,j)]TJ /F3 11.955 Tf 11.96 0 Td[(1))]TJ /F4 11.955 Tf 11.96 0 Td[(dH LEFT:H(i,j)=H(i,j)]TJ /F3 11.955 Tf 11.96 0 Td[(1)ForS,9possibilitiesexistforonecell.4bitsareenoughtostorethisinformationforeachcell.ForD,2possibilitiesexistand1bitisenoughforthisinformation.ForI,1bitisenough.AndforH,2bitsareenough.So,intheworst-case,STRIPEDrequiresmnbytesofglobalmemorywhileCHUNKED1DandCHUNKED2Drequirems+nhbytes.Thisisanorderofmagnitudereductioninmemoryrequirementcomparedtothetraditionalmethodwherewestorescoresinallfourmatriceswithamemoryrequirementofroughly4mnsizeof(int)=16mn.Hence,STRIPED,forexample,canhandleproblemswithanmnvalueabout16timesaslargeasthatcanbehandledbytraditionalmethod.Besides,whenmorememoryisrequired,Phase3canbesplitintomultipleiterations.ThesamememoryusedbyoneSMtocomputepartofthealignmentwithinonestripcanbereusedindifferentiterationswhilecomputingfordifferentstrips. 6.4ExperimentalResultsAlloftheexperimentsreportedinthissectionwereconductedonanNVIDIATeslaC2050GPUwithCUDAToolKit4.0.ThehostmachinehasanInteli7-x9803.33GHzCPUand12GBDDR3SDRAM.Theseexperimentswereconductedforavarietyofstringsizes,stripwidthandparameters.Representativeresultsarepresentedforeachofthealgorithmsduetospacelimitations.Unlessotherwisestated,longsequencesweregeneratedbyreplicatingapairofhomologoussequencesobtainedfromBLAST. SCORE.First,wemeasuredtherunningtimeof 132

PAGE 133

Table6-1. RunningtimeofGAP3 m=n6530/704513060/1409016978/18317 Time(ms)1950.07710.013010.0 Table6-2. Executiontime(ms)ofSCOREfordifferentsvalues m26120391805224065300n28180422705636070450 s=641773.53975.77027.810935.1s=128891.61994.53540.65542.2s=256456.31017.91799.02803.8s=512254.0549.1963.91496.5s=1024205.7408.9693.61059.9s=1400209.4416.1602.8949.1 SCOREasafunctionofstripwidths(Table 6-2 andFigure 6-5 ).AsintroducedinSection 6.1 ,misthelengthofthequerysequenceandnisthelengthofthesubjectsequence.AspredictedbytheanalysisofSection 6.2 ,forsufcientlylargesequences,therunningtimedecreasesassincreases.Therunningtimeforthesequentialcodein[ 26 ]isshowninTable 6-1 whichisreferredtoasGAP3.Next,wemeasuredtherunningtimeasafunctionofthenumberofSMswithm=nis104480=112720withs=1400.ThemeasuredtimesandspeedupareshowninTable 6-4 andFigure 6-6 .Ascanbeseen,thespeedup(relativeto1SM)isfairlylinear.Therunningtimewhenmislessthannandsis1024isshowninTable 6-3 .Thesequencesweregeneratedrandomly.Ascanbeseen,weareabletohandlesequencesforwhichtheproductofthelengthsisupto1011withthelengthofthedatabasesequencebeingupto109. STRIPED. Table6-3. Runningtime(s)ofSCOREwhenm<
PAGE 134

Figure6-5. Plotofrunningtime(ms)ofSCOREwithdifferentsvalues Table6-4. Runningtime(ms)ofSCOREwithdifferentnumberofSMs. SMsTime(ms)Speedup 129588.81.0214894.82.039823.83.047568.93.956104.04.865077.35.874362.56.883914.17.693305.59.0103175.69.3112912.910.2122583.911.5132546.611.6142230.713.3 Table6-5. Runningtime(ms)ofSTRIPEDfordifferentsvalues m65301306016978n70451409018317 Phase13Total13Total13Totals=64241.9153.8400.5943.9587.91543.41596.2989.62603.7s=128129.4122.4255.2486.2478.1971.0844.3816.31670.4s=25679.0143.8225.4263.0573.2839.9463.6969.61437.8s=35072.8152.2227.4208.9628.8840.9341.41046.61392.2 134

PAGE 135

Figure6-6. SpeedupofSCOREforvariablenumberofSMs Figure6-7. Runningtime(ms)ofSTRIPEDwhenmandnareboth20000 Foreaseofcoding,ourimplementationusesachartostorethedirectioninformationateachpositionencounteredinPhase3ratherthan4bitsasintheanalysisofSection 6.3.4 .WetestedSTRIPEDwithdifferentsvaluesandtheresultsareshowninTable 6-5 andFigure 6-7 .UsingasimilaranalysisasusedinSection 6.2 ,wedeterminethemaximumstripsize,whichislimitedbytheamountofsharedmemoryperSM,tobe350.ThetimeforPhase2isnegligibleandisnotreportedseparately.Thetimeforall 135

PAGE 136

Table6-6. Runningtime(s)ofSTRIPEDwhenm<
PAGE 137

Table6-7. Runningtime(s)ofCHUNKED1Dwhenm<
PAGE 138

Table6-10. Runningtime(ms)ofCHUNKED1D(m=13060n=14090) h#=s!64128256350 Phase13Total13Total13Total13Total641104.018.11127.8568.422.9595.6308.934.6347.2246.744.8294.91281065.720.81091.8550.124.9578.9299.037.1339.5239.648.7291.42561046.525.71077.4539.730.0573.5293.642.8339.6235.258.2296.45121036.538.81080.6534.242.1580.1290.457.7351.4232.874.2310.0 Table6-11. Runningtime(ms)ofCHUNKED1D(m=16978n=18317) h#=s!64128256350 Phase13Total13Total13Total13Total641868.522.41898.5994.028.21027.7550.643.1598.0403.957.6465.71281803.425.91836.5959.931.8996.6532.146.1582.1391.961.8457.42561771.134.01812.1940.239.4984.3521.854.5579.9384.773.5461.75121754.648.41810.1930.652.8988.2515.871.8591.1380.993.5477.8 Table6-12. Runningtime(ms)ofCHUNKED2D(m=6530n=7045) h#=s!64128256350 Phase13Total13Total13Total13Total64312.66.0323.6167.410.1181.9103.014.1121.296.218.3118.5128290.77.1302.3157.59.6170.897.312.5113.390.415.8109.5256279.39.1292.7150.711.5165.893.312.8109.386.519.4109.0512273.513.9291.5147.113.0163.390.419.3112.883.827.4114.2 Table6-13. Runningtime(ms)ofCHUNKED2D(m=13060n=14090) h#=s!64128256350 Phase13Total13Total13Total13Total641222.811.71242.8630.518.9656.4343.228.1377.7277.335.4319.01281135.713.21156.0592.218.4616.3323.424.9353.5259.631.1295.62561091.917.21115.8566.721.5593.5310.923.5338.9249.035.5288.85121068.824.91100.3553.124.1582.2301.634.3340.1241.952.5298.2 Table6-14. Runningtime(ms)ofCHUNKED2D(m=16978n=18317) h#=s!64128256350 Phase13Total13Total13Total13Total642069.815.12096.21103.823.71136.8614.135.7658.1452.345.9506.51281920.317.01946.71035.823.81066.9577.732.3616.5424.138.9469.32561846.121.61876.4989.627.61023.8554.230.1589.8406.645.0456.85121808.731.51848.9965.930.41002.5536.444.1585.5395.766.6467.0 138

PAGE 139

CHAPTER7MULTICOREANDGPUALGORITHMSFORNUSSINOVRNAFOLDING 7.1BackgroundAnRNAsequenceisachainofnucleotidesfromthealphabetfG(guanine),A(adenine),U(uracil),C(cytosine)g.OnesegmentofaRNAsequencemightbepairedwithanothersegmentofthesameRNAsequenceduetotheforceofhydrogenbonds.Thistwo-dimensionalstructureiscalledtheRNAsequence'ssecondarystructure.TwonucleotidesinanRNAsequencecanformWatsonCrickAUandGCbasepairsaswellastheGUwobblepair.SeveralalgorithmshavebeenproposedtopredictanRNAsequence'ssecondarystructure.ThesealgorithmsarereferredtoasRNAfoldingalgorithms.WatermanandSmith[ 88 ]andNussinovetal.[ 59 ]madetherstattemptin1978.Zukeretal.[ 95 ]renedNussinov'salgorithmbyusingathermodynamicenergyminimizationmodel,whichproducesmoreaccurateresultsattheexpenseofgreatercomputationalcomplexity.AlthoughourfocusinthispaperisthesimplerNussinov'salgorithm,ourstrategiesmaybeappliedtoZuker'salgorithmaswell.ThecomplexityofNussinov'sandZuker'salgorithmisO(n3),wherenisthelengthoftheRNAsequencetobefolded.OtherRNAfoldingalgorithmswithbetterpredictionpropertiesandhighercomplexityalsoexist.Whenfoldingviralsequences,nrangesfromseveralthousandtoseveralmillionandsingle-coreruntimebecomesexcessiveandsomuchefforthasgoneintodevelopingparallelversionsofvariousRNAfoldingalgorithms.Forexample,[ 55 83 ]developamulticorecodeforanO(n4)foldingalgorithmwhile[ 84 ]doesthisforNussinov'salgorithm.[ 17 ]developsaframeworkforRNAfoldingalgorithmsonaclusterandteststhisframeworkusinganO(n6)(Pknots-RE)andanO(n4)(Pknots-RG)algorithmsforthepredictionofRNAsecondarystructure.FPGAsolutionsforsecondarystructurepredictionhavebeenproposedin[ 15 32 92 ]andGPUsolutionsin[ 8 68 ].Wenotethat[ 68 ]isbasedonthealgorithmofZukeretal.[ 95 ]while[ 8 ]isbasedonthatofNussinovetal.[ 59 ]. 139

PAGE 140

WestartinthischapterbydescribingNussinovetal.'s[ 59 ]dynamicprogrammingrecurrenceforsecondarystructurepredictionandwegivemodicationsoftheseequationsasobtainedbyChangetal.[ 8 ].Thesemodicationssimplifytheparallelizationoftheoriginalequationsandcomputethesameresults.Wealsodescribethestrategyusedin[ 8 ]toobtainaGPUalgorithmtosolvethemodiedequations.Anaiveimplementationofthemodiedequationsof[ 8 ]togetherwithacacheefcientversionandmulticoreversionsaregiven.Wenotethatalthough[ 84 ]givesamulticoreversionofNussinov'salgorithm,ourmulticoreversionismuchsimplerandprovidessimilarspeedup.ThenourGPUalgorithmforthemodiedequationsisdescribed.Experimentalandbenchmarkresultsarepresentedafterthat. 7.1.1Nussinov'sDynamicProgrammingRecurrenceLetS=a1a2...anbeanRNAsequencewhereai2fA,C,G,Ugisanucleotide.Nussinov'salgorithmndsthemostpossiblesecondarystructurebymaximizingthenumberofbondedpairs(AU,GCorGU).LetC(i,j)bethemaximumnumberofbondedpairsinthesubsequenceaiaj,1ijn.Nussinov'sdynamicprogrammingrecurrenceforCisgivenbelow.C(i,i)]TJ /F3 11.955 Tf 11.96 0 Td[(1)=02inC(i,i)=01inC(i,j)=max8>>>>>>>>>><>>>>>>>>>>:C(i+1,j)C(i,j)]TJ /F3 11.955 Tf 11.95 0 Td[(1)C(i+1,j)]TJ /F3 11.955 Tf 11.96 0 Td[(1)+bond(ai,aj)maxi
PAGE 141

Here,bond(ai,aj)is1if(ai,aj)isanAU,GCorGUpairand0otherwise,andthethirdequationapplieswheni
PAGE 142

C(i,i)=0for1in (7) C(i,i+1)=0for1in)]TJ /F3 11.955 Tf 11.95 0 Td[(1 (7) C(i,j)=max8>><>>:C(i+1,j)]TJ /F3 11.955 Tf 11.96 0 Td[(1)+bond(ai,aj)maxik
PAGE 143

(k+1,j).ThischangescolumnreadstorowreadsandbetterutilizestheL2andL1cachesofthetargetGPU. Figure7-2. InitializationofthematrixinChang'salgorithm Figure7-3. ThreestagesinChang'salgorithm 143

PAGE 144

7.2MethodsCacheefcientandmulticorealgorithms CPU1(Algorithm 1 )isanaivesingle-corealgorithmtocomputeCusingthesimpliedrecurrenceofChangetal.ThisalgorithmcomputesM[i][j]=C(i+1,j+1),0ij
PAGE 145

7.2.1OurGPUAlgorithmUnliketheGPUalgorithmof[ 8 ]whichcomputesCbydiagonals,weusearenementoftheblockstrategyusedin[ 68 84 ].Figure 7-4 showsthe2020CmatrixforthecaseofRNAsequenceswhoselengthisn=20.TocomputetheelementlabeledX,elementsatolare,respectively,addedtoelementsAtoLandthemaximumofa+A,b+B,...l+Liscomputed.WenotethatthecomputationforYalsorequiresAtoLandthatXhastobecomputedbeforeYandZcanbecomputed. Figure7-4. BlockpartitioningofCmatrix Inourblockstrategy,wepartitiontheuppertriangleoftheCmatrixintosquareblocksexceptthatadjacenttothemaindiagonalthepartitioningisintotrianglesandthatattherightendisintopossiblynon-squarerectangles.Figure 7-4 showsthepartitioningforthecasen=20using44blocks.Noticethetrianglesadjacenttothemaindiagonalandthe41non-squarepartitionsattherightend.Theblocks(whethertriangular,square,ornon-square)areindexedleft-to-righttop-to-bottombeginningwith(0,0).InkeepingwiththetraditionalwaytonumberblocksforGPUcomputation,therstcoordinateincreasesasyoumovetotherightandthesecondasyoumovedown. 145

PAGE 146

SoX(Figure 7-4 )residesin(4,1),Kin(4,4),andPin(3,2).Blocksonthemaindiagonalareindexed(i,i)andaretriangular.ForthedependenciesinEquation 7 ,itfollowsthatblocksthatlieonthesamediagonalofblocks(i.e.,blockswiththeindex(i,i)]TJ /F4 11.955 Tf 12.78 0 Td[(k)forsomexedk)areindependentandsomaybecomputedinparallelbutthatelementswithinablockaretobecomputedbydiagonals.Our2020exampleofFigure 7-4 has6diagonalsofblocksandsosixiterationsofcomputationwitheachiterationcomputingallblocksonthesamediagonalinparallelarerequired.Asnotedearlier,therstdiagonalofblocksiscomprisedoftriangles.Toeachtriangularblock,weassignathreadblock.Thethreadswithintheassignedthreadblockcomputetheelementsinthetriangularblockindiagonalorderwithallelementsonthesamediagonalbeingcomputableinparallel.Hence,forthiscomputation,thenumberofthreadblocksequalsthenumberoftriangularblocks.Letusturnourattentionnowtotheremainingblocks(i.e.,thenon-triangularblocks).Noticethatwhenwestartthecomputationoftheelementsin,say,block(4,1),whereXresides,atoj,andCtoLhavealreadybeencomputed,becausetheyareonprecedingblockdiagonals.Butk,l,A,andBhaveyettobecomputed.Thecomputationofthemaximumofc+Ctoj+JcanbedoneusingakernelmaxKernel(describedlater).Thiskernelusesregistersfortemporaryvaluesandwritesthesetemporaryvaluestosharedmemoryuponcompletion.ThenalvalueforOcanbeobtainedbycomparingthetemporarymaximumvalueinOwithPplusthebondvalueinEquation 7 .Thenthemaximumofr+O,qplusitsbondvalue,andthetemporarymaximumvalueinmiswrittentomasitsnalvalue.Similarly,forM,themaximumofO+R,Qplusitsbondvalue,andthetemporarymaximumvalueinMiswrittentoMasitsnalvalue.ThecomputationsformandMcanbedoneinparallel.Sothecomputationwithinelementblock(4,1)isdoneindiagonalorder.Allelementsonthesamediagonalcanbecomputedinparallelwithalldataresidinginsharedmemory.ThepseudocodeisshownasAlgorithm 2 146

PAGE 147

Algorithm2OurGPUalgorithm Require: RNAsequenceR,blockeddiagonalindexD,blocksizeBS Ensure: Cmatrix 1: Register[16]reg 2: Shared Memory[BS][BS]sA 3: Shared Memory[BS+1][BS+1]sC 4: Global Memory[BS][BS]gB 5: Global Memory[BS][BS]gC 6: tx threadIdx.x;ty threadIdx.y 7: bx blockIdx.x;by blockIdx.y 8: aRow byBS;aCol aRow)]TJ /F3 11.955 Tf 11.95 0 Td[(1 9: bRow aRow;bCol DBS)]TJ /F3 11.955 Tf 11.95 0 Td[(1+aRow 10: forblk 1toD)]TJ /F3 11.955 Tf 11.95 0 Td[(1do 11: sA theblockstartingat(aRow,aCol+blkBS) 12: gB theblockstartingat(bRow+blkBS,bCol) 13: maxKernel(sA,gB,reg) 14: Syncthreads 15: endfor 16: sC reg 17: ford 1toBS2)]TJ /F3 11.955 Tf 11.96 0 Td[(1do 18: forEachelementeondiagonalddo 19: Finishremainingcomputation 20: endfor 21: Syncthreads 22: endfor 23: gC sC 7.2.1.1DescriptionofmaxKernelThecomputationofthemaximumofc+Ctoj+J(Figure 7-4 )bearssomeresemblancetothecomputationofaterminamatrixmultiply.So,wecanadapttheideasusedinmatrixmultiplykernelstoarriveatanefcientkerneltondthedesiredmaximumofthesumofpairs.Inourcase(Algorithm 3 ),weadapttheGPUmatrixmultiplykernelofVolkovandDemmel[ 87 ].Theelementblocksizeusedinourimplementationis3232andathreadblockisconguredas322.Eachthreadcomputes16elementsthatlieinthesamecolumnasshowninFigure 7-5 (thisgureshowsonlysixthreadsasarrowsaboveblockC).The16elementscomputedbyonethreadarerepresentedasaslimgraybarinblockC.ThegrayareainblockAdepicts 147

PAGE 148

Algorithm3maxKernel Require: BlocksAinsharedmemory,blockgBinglobalmemory Ensure: Partialresultofreductioninreg r0 gB[0][tx];r1 gB[1][tx] r2 gB[2][tx];r3 gB[3][tx] forj 0to6do fork 0to15do reg[k] maxfreg[k],r0+sA[ty16+k][j4]g endfor r0 gB[(j+1)4][tx] //2similarforloopsforr1andr2comehere fork 0to15do reg[k] maxfreg[k],r3+sA[ty16+k][j4+3]g endfor r3 gB[(j+1)4+3][tx] endfor fork 0to15do reg[k] maxfreg[k],r0+sA[ty16+k][28]g endfor //2similarforloopsforr1andr2comehere fork 0to15do reg[k] maxfreg[k],r3+sA[ty16+k][31]g endfor thedataneededbytherst32threads.Thisdatawillbereadintosharedmemory.Toachievehighthroughputfrom/todevicememory,weusecoalescedmemoryaccessesinwhichalldataaccessedbyonewarp(thisistheminimumschedulingunitanditcontains32threads)fallsinthesamedevicememorycachelineofsize128bytes.InFigure 7-5 ,sixthreadsfetchtherstrowfromthegrayareaofblockB.TheneachthreadusesthevaluejustfetchedtoaddwiththerstcolumninthegrayareaofblockA(whichisalreadyreadintosharedmemory).Inotherwords,threadiwilladdB[0][i]withA[j][0](0j<16)andcomparethisvaluewithregister[j]ofthreadiandupdateregister[j]ifnecessary.ThenB[1][i]isaddedwithA[j][1](0j<16)andcomparethisvaluewithregister[j]andupdatetheregister.SincethreadsinthesamewarpwillreaddatainthesamerowofblockB,thisreadingiscoalescedandservicedatthethroughputofL1orL2cacheincaseofacachehit,oratthethroughputofdevicememory,otherwise. 148

PAGE 149

Besides,allthreadsinthesamewarpusethesamevaluefromblockA,whichresidesinsharedmemory.Thisvaluecanbebroadcasttoallthreadsinthesamewarp. Figure7-5. maxKernelillustration Wenotethat[ 68 ]alsoemploysamaxKernelbuttheirkernelisdifferentfromours. 7.3ResultsWebenchmarkedouralgorithmsusingaPCwithahyperthreaded6-coreInteli7-x9803.33GHzCPUand12GBDDR3RAM.ThePCalsohadNVIDIATeslaC2050GPU.Sinceonlytwothreadsmaybescheduledperi7coreatanytime,amaximumof12threadsmaybegainfullyused.WeusedrandomlygeneratedRNAsequencesinourexperiments.SincetheruntimeofourcodesisrelativelyinsensitivetotheactualRNAsequenceinuse,ouruseofrandomsequencesdoesnotmateriallyimpactourconclusions. 7.3.1SingleAndMulticoreAlgorithmsInboththecodesOMP1andOMP2,theworkassignedtothethreadsiswellbalancedbyOpenMPandsobestperformanceisexpectedusingeither6or12threads.Ourexperimentsconrmedthisexpectationwiththeuseof6threadsgenerallybeing 149

PAGE 150

Table7-1. Runningtime(s)ofdifferentCPUalgorithms nCPU1OMP1RatioCPU2OMP2Ratio 300035.97.15.122.34.84.6400098.118.65.352.811.34.75000208.141.65.0102.922.24.66000363.772.25.0177.545.33.97000646.1125.25.2281.361.04.68000924.4197.84.7419.692.54.590001461.5291.05.0596.6129.94.6100001927.7395.04.9819.1176.94.6110002800.8559.25.01088.4234.54.6120003525.2741.44.81413.6303.34.7130004852.3978.85.01795.4388.44.6140006008.91250.24.82243.5485.24.6150007930.01641.44.82757.3594.04.61600010120.02380.84.33343.5725.44.6 betterthantheuseof12threads.So,forourapplicationtheoverheadofcontextswitchingbetweenthetwothreadsassignedtoacorewhenatotalof6threadsareusedgenerallyexceededthegainsobtainablefromhavingasecondthreadreadyincasetherstthreadstallsfrommemorydelay.Table 7-1 givestheruntimesforouralgorithmsCPU1,CPU2,OMP1,andOMP2fornvaluesrangingfrom3000to16000.ThecolumnslabeledratiogivetheratiosCPU1=OMP1andCPU2=OMP2,respectively.Althoughwehave6coresonourCPU,weareabletoachievespeedupsofalmost5fromthemulticoreversions.Bycomparison,thefarmorecomplexmulticorecodeof[ 84 ],whichusesablockingstrategysimilartothatusedbyourGPUcode,achievesasimulatedspeedupof6.3with4threads.However,thissimulatedspeedupignoresseveralfactorssuchassynchronizationoverheadthatwillreducespeedupinarealenvironment.Further,thesimulatedspeedupof6.3isrelativetotheequivalentofthecodeCPU1.ThespeedupachievedbyOMP2relativetoCPU1isbetween7.5and14.0!Wenotealsothatthespeedupobtainedsolelyfromtheuseofthecachingstrategy(i.e.,theratioCPU1=CPU2)rangesfrom1.6to3.0. 150

PAGE 151

7.3.2GPUAlgorithmsWeexperimentedwiththreeversionsofourGPUalgorithm.TherstiscalledOurs1,whichisasdescribedinOurGPUalgorithmsection.Inthesecondversion,whichiscalledOurs2,devicememoryusageisreducedbyhalfbystoringonlytheuppertriangleoftheoutputmatrix.Thisuppertriangleismappedintoaone-dimensionalarrayusingthestandardrow-majormapping.Sincethisversionusesonlyhalfthedevicememoryusedbytheotherversions,itmaybeusedonlargerinstances.Inthethirdversion,whichiscalledOursR,wereplacedourmaxKernelwiththekerneldescribedin[ 68 ]1.ThesethreecodeswerebenchmarkedagainsteachotheraswellasagainsttheGPUNussinovcodeof[ 8 ]2.ThemaximumsizeofsequenceOurs2canhandleis37000whiletheotherversionscanhandlesequencesofsizeupto26000.Ours2runsslightlyslowerthanOurs1asshowninTable 7-2 .So,Ours2isrecommendedonlywhentheinstancesizeislargeenoughtomakeOurs1nonfeasible.Table 7-2 andFigure 7-6 showtherunningtimeforthefourdifferentGPUcodes.Ratio1inTable 7-2 showsthespeedupofOurs1relativeto[ 8 ]([ 8 ]/Ours1).Ratio2showsOursR/Ours1.Ascanbeseen,Ours1isupto1.9timesasfastasOursRindicatingthatacorrespondingspeedupcouldbeobtainedforZuker'salgorithmbyreplacingthemaximumndingkernelusedin[ 68 ]withourkernelforthisoperation.Also,Ours1isbetween3.0and11.1timesasfastastheGPUalgorithmof[ 8 ]. 7.3.3SingleCorevsMulticorevsGPUFigure 7-3 givesthespeedupobtainedbyOurs1relativetoCPU2andOMP2.UsingaGPU,wecandotheNussinovcomputationsupto522.6timesasfastasusingacacheefcientsinglecorecodeandupto113.4timesasfastasusinga6-corecache 1SincewewereunabletogettheGPUcodeof[ 68 ],thekernelusedbyuswasactuallyonewewrotebasedonthedescriptionprovidedin[ 68 ].2Wearegratefultotheauthorsof[ 8 ]formakingtheircodeavailable. 151

PAGE 152

Table7-2. Runningtime(s)ofdifferentGPUalgorithms nOurs1Ours2[ 8 ]OursRRatio1Ratio2 20000.10.10.30.13.01.060000.60.74.00.86.71.3100001.92.216.43.28.61.7140004.55.143.07.99.61.8180008.89.989.516.010.21.82200015.116.9161.728.210.71.92600023.926.7266.345.811.11.937000-71.5---Figure7-6. PlotofrunningtimeofGPUalgorithms efcientcode!Comparedtothenaivesingle-corecodeCPU1,ourGPUcodesprovidesaspeedupofupto1582! Table7-3. SpeedupofOurs1relativetootherversions nCPU2OMP2nCPU2OMP2 3000157.033.810000424.491.74000224.748.111000441.495.15000259.856.112000465.9100.06000302.977.313000472.3102.27000341.874.114000496.9107.58000376.082.915000503.3108.49000392.285.416000522.6113.4 152

PAGE 153

7.4SummaryWehavedevelopedsimpleandefcientsingle-andmulti-corealgorithmsaswellasanefcientGPUcodeforRNAfoldingbasedonNussinov'sequations[ 59 ].Ourcacheefcientsingle-corealgorithmprovidesaspeedupbetween1.6and3.0relativetoanaivestraightforwardsinglecorecode.Themulticoreversionofthecacheefcientsinglecorealgorithmprovidesaspeedup,relativetothenaivesinglecorealgorithm,between7.5and14.0ona6corehyperthreadedCPU.OurGPUalgorithm,Ours1,fortheNVIDIAC2050isupto1582timesasfastasthenaivesinglecorealgorithmandbetween3.0and11.1timesasfastasthefastestpreviouslyknownGPUalgorithmforNussinovRNAfolding.Withtheavailable3GBdevicememoryonanNVIDIAGPU,Ours1isabletohandlesequencesoflengthupto26000.Sequencesoflengthbetween26000and37000maybehandledusingOurs2,whichusesaone-dimensionalarraymappingoftheuppertriangleoftheoutputmatrixratherthanatwo-dimensionalarraythatrepresentsthefulloutputmatrix.Ours2,however,runsslightlyslowerthanOurs1.OurmethodscanbeusedtospeedupupRNAfoldingusingZuker'sequationsaswell[ 68 95 ]. 153

PAGE 154

CHAPTER8CONCLUSIONSAlthoughGPUsareoptimizedforgraphicscalculations,theirlowcostpergigaophasmotivatedsignicantresearchintotheirefcientusefornon-graphicsapplications.However,obtainingperformancenearthepeakrequiresverycarefulprogrammingoftheapplicationcode.Thisprogrammingiscomplicatedbytheavailabilityofseveraldifferentmemories(e.g.,devicememory,sharedmemory,cache,andregisters),eachwithdifferentlatencyandthepartitioningoftheavailablescalarprocessorsorcoresintoSMs.Inthisdissertation,twomatrixmultiplicationalgorithmswhichrunonaGPUwererstlydevelopedtoshowtheguidelinesfordevelopingefcientalgorithmsonaGPU.Followingtheseguidelines,novelbioinformaticsalgorithmswhichrunonaGPUweredeveloped,includingpairwisesequencealignment,three-sequencealignment,syntenicsequencealignmentandRNAsecondarystructureprediction.TheseparallelalgorithmsachievedsignicantperformanceimprovementscomparedtothesequentialalgorithmsorotherGPUimplementations.Specically, 1. ThematrixmultiplicationcodeGPU8,whichrunsontheC1060,achieveagigaoprateofover370GFlopswhenn=16384.Thatisabout60%ofthetheoreticalpeakperformanceoftheC1060.Comparedtosingle-andquadcorecode,GPU8achievesuptofourordersofmagnitudespeeduprelativetoSingleCoreIKJandtwoordersofmagnitudespeeduprelativetothe4-threadOpenMPversionofSingleCoreIKJ. 2. OurGPUimplementationsofStrassen'sandWinograd'smatrixmultiplicationalgorithmsachievespeedupsof32%(35%)and33%(36%),respectively,forsingle-precision(integer)arithmetic,relativetothesgemm(integerversionofsgemm)codeinCUBLAS3.0whenmultiplying1638416384matrices. 3. ThescoringalgorithmStripedScoreforpairwisesequencealignmentachievesaspeedupof13to17relativetothesingle-GPUalgorithmof[ 37 ].ThespeeduprangesrelativetoBlockedAntidiagonal[ 76 ]andCUDASW++2.0[ 50 ]are,respectively,20to33and7.7to22.8.Ouralgorithmstodeterminetheactualalignmentalsorequiremuchlessmemory. 4. Theslopedscoringalgorithmforthree-sequencealignment,whichis3timesasfastasthelayeredalgorithm,providesaspeedupofupto90relativetoasingle-corealgorithmrunningonourhostCPU.Theslopedalignmentalgorithmis 154

PAGE 155

also3timesasfastasthelayeredoneandprovidesaspeedupbetween21and56relativetothesinglecorealgorithm. 5. Ourbestalignmentalgorithmforsyntenicalignmentachievesaspeedupof28relativetothesingle-coreCPUalgorithmof[ 26 ].Ouralgorithmstodeterminetheactualalignmentalsorequiremuchlessmemorythanthetraditionaltracebackalgorithm. 6. Ourcacheefcientsingle-corealgorithmforRNAsecondarystructurefoldingprovidesaspeedupbetween1.6and3.0relativetoanaivestraightforwardsinglecorecode.Themulticoreversionofthecacheefcientsinglecorealgorithmprovidesaspeedup,relativetothenaivesinglecorealgorithm,between7.5and14.0ona6-corehyperthreadedCPU.OurGPUalgorithmforNussinovRNAfolding,Ours1,fortheNVIDIAC2050isupto1582timesasfastasthenaivesinglecorealgorithmandbetween3.0and11.1timesasfastasthefastestpreviouslyknownGPUalgorithmforNussinovRNAfolding. 155

PAGE 156

REFERENCES [1] Alachiotis,N.,Berger,S.A.,andStamatakis,A.AcceleratingPhylogeny-AwareShortDNAReadAlignmentwithFPGAs.Field-ProgrammableCustomComputingMachines(FCCM),2011IEEE19thAnnualInternationalSymposiumon.2011,226. [2] Altschul,StephenF.,Gish,Warren,Miller,Webb,Myers,EugeneW.,andLipman,DavidJ.Basiclocalalignmentsearchtool.MolecularBiology,215,403-410(1990). [3] Bader,DavidA.,Madduri,Kamesh,Gilbert,JohnR.,Shah,Viral,Kepner,Jeremy,Meuse,Theresa,andKrishnamurthy,Ashok.DesigningScalableSyntheticCompactApplicationsforBenchmarkingHighProductivityComputingSystems.CTWatchQuarterly,2(4B):41-51(2006). [4] Bailey,DavidH.,Lee,King,andSimon,HorstD.UsingStrassen'salgorithmtoacceleratethesolutionoflinearsystems.J.Supercomput.4(1991).4:357. [5] Batzoglou,Seram,Pachter,Lior,Mesirov,Jill,Berger,Bonnie,andLander,EricS.Humanandmousegenestructure:comparativeanalysisandapplicationtoexonprediction.ProceedingsofthefourthannualinternationalconferenceonComputationalmolecularbiology.RECOMB'00.NewYork,NY,USA:ACM,2000,46. [6] BLAS,MAGMA.Website,2009. http://icl.cs.utk.edu/magma [7] Boyer,B.,Dumas,J.G.,Pernet,C.,andZhou,W.MemoryefcientschedulingofStrassen-Winograd'smatrixmultiplicationalgorithm.Proceedingsofthe2009internationalsymposiumonSymbolicandalgebraiccomputation.ACM,2009,55. [8] Chang,Dar-Jen,Kimmer,C.,andOuyang,Ming.AcceleratingtheNussinovRNAfoldingalgorithmwithCUDA/GPU.SignalProcessingandInformationTechnology(ISSPIT),2010IEEEInternationalSymposiumon.2010,120. [9] Chao,Kunmao,Zhang,Jinghui,Ostell,James,andMiller,Webb.AlocalalignmenttoolforverylongDNAsequences.Comput.Appl.Biosci,11,147-153(1995). [10] Coppersmith,D.andWinograd,S.Matrixmultiplicationviaarithmeticprogressions.ProceedingsofthenineteenthannualACMsymposiumonThe-oryofcomputing.STOC'87.NewYork,NY,USA:ACM,1987,1. [11] Dandass,YoginderS.,Burgess,ShaneC.,Lawrence,Mark,andBridges,SusanM.AcceleratingStringSetMatchinginFPGAHardwareforBioinformaticsResearch.BMCBioinformatics(2008):. [12] Dayhoff,M.O.,Schwartz,R.M.,andOrcutt,B.C.Amodelofevolutionarychangeinproteins.Atlasofproteinsequenceandstructure5(1978).suppl3:345. 156

PAGE 157

[13] Delcher,A.L.,Kasif,S.,Fleischmann,R.D.,Peterson,J.,White,O.,andSalzberg,S.L.Alignmentofwholegenomes.NucleicAcidsRes27(1999).11:2369. [14] Division,ResearchandWinograd,S.OnMultiplicationof2x2Matrices.Researchreports//IBM.InternationalBusinessMachinesCorporation,1970. [15] Dou,Yong,Xia,Fei,andJiang,Jingfei.Fine-grainedparallelapplicationspeciccomputingforRNAsecondarystructurepredictionusingSCFGSonFPGA.Proceedingsofthe2009internationalconferenceonCompilers,architecture,andsynthesisforembeddedsystems.CASES'09.NewYork,NY,USA:ACM,2009,107. [16] Douglas,CraigC.,Heroux,Michael,Slishman,Gordon,andSmith,RogerM.GEMMW:aportablelevel3BLASWinogradvariantofStrassen'smatrix-matrixmultiplyalgorithm.J.Comput.Phys.110(1994).1:1. [17] Estrada,Trilce,Licon,Abel,andTaufer,Michela.compPknots:aframeworkforparallelpredictionandcomparisonofRNAsecondarystructureswithpseudoknots.Proceedingsofthe2006internationalconferenceonFrontiersofHighPerformanceComputingandNetworking.ISPA'06.Berlin,Heidelberg:Springer-Verlag,2006,677. [18] Futamura,Natsuhiko,Aluru,Srinivas,andHuang,Xiaoqiu.ParallelSyntenicAlignments.Proceedingsofthe9thInternationalConferenceonHighPerformanceComputing.HiPC'02.London,UK,UK:Springer-Verlag,2002,420. [19] Gotoh,O.Animprovedalgorithmformatchingbiologicalsequences.MolecularBiology,162,705-708(1982). [20] Gotoh,Osamu.Alignmentofthreebiologicalsequenceswithanefcienttracebackprocedure.JournalofTheoreticalBiology121(1986).3:327. [21] Gudy,AdamandDeorowicz,Sebastian.AParallelGPU-DesignedAlgorithmfortheConstrainedMultipleSequenceAlignmentProblem.Man-MachineInteractions2.eds.TadeuszCzachrski,StanislawKozielski,andUrszulaStanczyk,vol.103ofAdvancesinIntelligentandSoftComputing.SpringerBerlin/Heidelberg,2011.361. [22] Henikoff,S.andHenikoff,J.G.Aminoacidsubstitutionmatricesfromproteinblocks.ProceedingsoftheNationalAcademyofSciencesoftheUnitedStatesofAmerica89(1992).22:10915. [23] Higham,N.J.Exploitingfastmatrixmultiplicationwithinthelevel3BLAS.ACMTransactionsonMathematicalSoftware(TOMS)16(1990).4:352. [24] Hirschberg,D.S.Alinearspacealgorithmforcomputingmaximalcommonsubsequences.Commun.ACM18(1975).6:341. 157

PAGE 158

[25] Huang,Xiaoqiu.Alignmentofthreesequencesinquadraticspace.ACMSIGAPPAppliedComputingReview1(1993).2:7. [26] Huang,XiaoqiuandChao,Kun-Mao.Ageneralizedglobalalignmentalgorithm.Bioinformatics19(2003).2:228. [27] Hung,Che-Lun,Lin,Chun-Yuan,Chung,Yeh-Ching,andTang,ChuanYi.Introducingvariablegappenaltiesintothree-sequencealignmentforproteinsequences.AdvancedInformationNetworkingandApplications-Workshops,2008.AINAW2008.22ndInternationalConferenceon.IEEE,2008,726. [28] Huss-Lederman,S.,Jacobson,E.M.,Johnson,J.R.,Tsao,A.,andTurnbull,T.ImplementationofStrassen'salgorithmformatrixmultiplication.Supercomputing,1996.Proceedingsofthe1996ACM/IEEEConferenceon.IEEE,1996,32. [29] .Strassen'sAlgorithmforMatrixMultiplication:Modeling,Analysis,andImplementation.Citeseer,1996. [30] Hussain,H.M.,Benkrid,K.,Seker,H.,andErdogan,A.T.FPGAimplementationofK-meansalgorithmforbioinformaticsapplication:AnacceleratedapproachtoclusteringMicroarraydata.AdaptiveHardwareandSystems(AHS),2011NASA/ESAConferenceon.2011,248. [31] Jacob,A.,Buhler,J.,andChamberlain,R.D.AcceleratingNussinovRNAsecondarystructurepredictionwithsystolicarraysonFPGAs.Application-SpecicSystems,ArchitecturesandProcessors,2008.ASAP2008.InternationalConferenceon.2008,191. [32] .AcceleratingNussinovRNAsecondarystructurepredictionwithsystolicarraysonFPGAs.Application-SpecicSystems,ArchitecturesandProcessors,2008.ASAP2008.InternationalConferenceon.2008,191. [33] Jareborg,Niclas,Birney,Ewan,andDurbin,Richard.Comparativeanalysisofnoncodingregionsof77orthologousmouseandhumangenepairs.GenomeResearch9(1999).9:815. [34] Kaporin,I.Apracticalalgorithmforfastermatrixmultiplication.Numericallinearalgebrawithapplications6(1999).8:687. [35] Karanam,R.K.,Ravindran,A.,Mukherjee,A.,Gibas,C.,andWilkinson,A.B.Usingfpga-basedhybridcomputersforbioinformaticsapplications.XilinxXcellJournal58(2006):80. [36] Karunadasa,NPandRanasinghe,DN.Onthecomparativeperformanceofparallelalgorithmsonsmallgpu/cudaclusters.Int.Conf.onHighPerformanceComputing.2009,0. 158

PAGE 159

[37] Khajeh-Saeed,Ali,Poole,Stephen,andBlairPerot,J.AccelerationoftheSmith-WatermanalgorithmusingsingleandmultipleGraphicsProcessors.ComputationalPhysics(2010). [38] Khalafallah,A.,Elbabb,H.F.,Mahmoud,O.,andElshamy,A.OptimizingSmith-WatermanalgorithmonGraphicsProcessingUnit.ComputerTechnol-ogyandDevelopment(ICCTD),20102ndInternationalConferenceon.2010,650. [39] Kruspe,MatthiasandStadler,PeterF.Progressivemultiplesequencealignmentsfromtriplets.BMCbioinformatics8(2007).1:254. [40] Li,Isaac,Shum,Warren,andTruong,Kevin.-foldaccelerationoftheSmith-Watermanalgorithmusingaeldprogrammablegatearray(FPGA).BMCBioinformatics8(2007):185. [41] Li,Junjie,Ranka,Sanjay,andSahni,Sartaj.PairwisesequencealignmentforverylongsequencesonGPUs.ComputationalAdvancesinBioandMedicalSciences,IEEEInternationalConferenceon0(2012):1. [42] .ParallelSyntenicAlignmentonGPUs.Proceedingsofthe3rdACMConferenceonBioinformatics,ComputationalBiologyandBiomedicine.BCB'12.ACM,2012,266. [43] Ligowski,LukaszandRudnicki,Witold.AnefcientimplementationofSmith-WatermanalgorithmonGPUusingCUDA,formassivelyparallelscanningofsequencedatabases.IPDPS2009,0,1-8(2009). [44] Lin,ChunYuan,Huang,ChenTai,Chung,Yeh-Ching,andTang,ChuanYi.EfcientParallelAlgorithmforOptimalThree-SequencesAlignment.ParallelProcessing,2007.ICPP2007.InternationalConferenceon.IEEE,2007,14. [45] Ling,Cheng,Benkrid,K.,andErdogan,A.T.HighperformanceIntra-taskparallelizationofMultipleSequenceAlignmentsonCUDA-compatibleGPUs.AdaptiveHardwareandSystems(AHS),2011NASA/ESAConferenceon.2011,360. [46] Lipman,DJandPearson,WR.Rapidandsensitiveproteinsimilaritysearches.Science,227,1435-1441(1985). [47] Liu,W.,Schmidt,B.,Voss,G.,andMuller-Wittig,W.GPU-ClustalW:Usinggraphicshardwaretoacceleratemultiplesequencealignment.HighPerformanceComputing-HiPC2006(2006):363374. [48] .StreamingAlgorithmsforBiologicalSequenceAlignmentonGPUs.ParallelandDistributedSystems,IEEETransactionson18(2007).9:1270. 159

PAGE 160

[49] Liu,Yongchao,Schmidt,B.,andMaskell,D.L.MSA-CUDA:MultipleSequenceAlignmentonGraphicsProcessingUnitswithCUDA.Application-specicSys-tems,ArchitecturesandProcessors,2009.ASAP2009.20thIEEEInternationalConferenceon.2009,121. [50] Liu,Yongchao,Schmidt,Bertil,andMaskell,Douglas.CUDASW++2.0:enhancedSmith-WatermanproteindatabasesearchonCUDA-enabledGPUsbasedonSIMTandvirtualizedSIMDabstractions.BMCResearchNotes,3,93(2010). [51] Lu,Jizhu,Perrone,M.,Albayraktaroglu,K.,andFranklin,M.HMMer-Cell:HighPerformanceProteinProleSearchingontheCell/B.E.Processor.PerformanceAnalysisofSystemsandsoftware,2008.ISPASS2008.IEEEInternationalSympo-siumon.2008,223. [52] Maglott,Donna,Ostell,Jim,Pruitt,KimD.,andTatusova,Tatiana.EntrezGene:gene-centeredinformationatNCBI.NucleicAcidsResearch33(2005).suppl1:D54D58. [53] Mak,T.S.T.andLam,K.P.Embeddedcomputationofmaximum-likelihoodphylogenyinferenceusingplatformFPGA.ComputationalSystemsBioinformaticsConference,2004.CSB2004.Proceedings.2004IEEE.2004,512514. [54] Manavski,SvetlinandValle,Giorgio.CUDAcompatibleGPUcardsasefcienthardwareacceleratorsforSmith-Watermansequencealignment.BMCBioinfor-matics,9,S10(2008). [55] Mathuriya,Amrita,Bader,DavidA.,Heitsch,ChristineE.,andHarvey,StephenC.GTfold:ascalablemulticorecodeforRNAsecondarystructureprediction.Proceedingsofthe2009ACMsymposiumonAppliedComputing.SAC'09.NewYork,NY,USA:ACM,2009,981. [56] Melo,Alba,Walter,Maria,Melo,Renata,Santana,Marcelo,andBatista,Rodolfo.LocalDNAsequencealignmentinaclusterofworkstations:algorithmsandtools.JournaloftheBrazilianComputerSociety2004,10,81-88(2004). [57] Murata,M,Richardson,JS,andSussman,JoelL.Simultaneouscomparisonofthreeproteinsequences.ProceedingsoftheNationalAcademyofSciences82(1985).10:3073. [58] Needleman,SaulB.andWunsch,ChristianD.Ageneralmethodapplicabletothesearchforsimilaritiesintheaminoacidsequenceoftwoproteins.MolecularBiology,48,443-453(1970). [59] Nussinov,Ruth,Pieczenik,George,Griggs,JerroldR.,andKleitman,DanielJ.AlgorithmsforLoopMatchings.SIAMJournalonAppliedMathematics35(1978).1:68. 160

PAGE 161

[60] NVIDIA.CUBLAS.Website,2010. http://developer.download.nvidia.com/compute/cuda/3_0/toolkit/docs/CUBLAS_Library_3.0.pdf [61] .CUDACBestPracticesGuide,Version3.2.Website,2010. http://developer.nvidia.com/object/gpucomputing.html [62] .CUDACProgrammingGuide,Version3.0.Website,2010. http://developer.nvidia.com/object/gpucomputing.html [63] .TESLAC2050/C2070GPUComputingProcessor.Website,2010. http://www.nvidia.com/object/product_tesla_C2050_C2070_us.html [64] .CUDAOccupancyCalculator.Website,2012. http://developer.download.nvidia.com/compute/cuda/CUDA_Occupancy_calculator.xls [65] OpenMP.TheOpenMPAPI.Website,2008. http://www.openmp.org/wp/ [66] Powell,DavidR,Allison,Lloyd,andDix,TrevorI.Fast,optimalalignmentofthreesequencesusinglineargapcosts.JournalofTheoreticalBiology207(2000).3:325. [67] Qiu,Xiaohong,Ekanayake,Jaliya,Beason,Scott,Gunarathne,Thilina,Fox,Geoffrey,Barga,Roger,andGannon,Dennis.Cloudtechnologiesforbioinformaticsapplications.Proceedingsofthe2ndWorkshoponMany-TaskComputingonGridsandSupercomputers.MTAGS'09.NewYork,NY,USA:ACM,2009,6:1:10. [68] Rizk,GuillaumeandLavenier,Dominique.GPUAcceleratedRNAFoldingAlgorithm.Proceedingsofthe9thInternationalConferenceonComputationalScience:PartI.ICCS'09.Berlin,Heidelberg:Springer-Verlag,2009,1004. [69] Robinson,S.Towardanoptimalalgorithmformatrixmultiplication.SIAMnews38(2005).9:1. [70] Sachdeva,Vipin,Kistler,Michael,Speight,Evan,andTzeng,Tzy-HwaKathy.ExploringtheviabilityoftheCellBroadbandEngineforbioinformaticsapplications.ParallelComputing34(2008).11:616626.ce:titleHigh-PerformanceComputationalBiology/ce:title. [71] Sahni,S.DataStructures,Algorithms,andApplicationsinC++.SiliconPress,NJ,USA,2005,2nded. [72] Sarje,A.andAluru,S.ParallelbiologicalsequencealignmentsontheCellBroadbandEngine.ParallelandDistributedProcessing,2008.IPDPS2008.IEEEInternationalSymposiumon.2008,1. [73] .ParallelGenomicAlignmentsontheCellBroadbandEngine.ParallelandDistributedSystems,IEEETransactionson20(2009).11:1600. 161

PAGE 162

[74] Schmidt,Bertil.Bioinformatics:HighPerformanceParallelComputerArchitectures.BocaRaton,FL,USA:CRCPress,Inc.,2010,1sted. [75] Schwartz,Scott,Zhang,Zheng,Frazer,KellyA.,Smit,Arian,Riemer,Cathy,Bouck,John,Gibbs,Richard,Hardison,Ross,andMiller,Webb.PipMakerAWebServerforAligningTwoGenomicDNASequences.GenomeResearch10(2000).4:577. [76] Siriwardena,T.R.P.andRanasinghe,D.N.AcceleratingglobalsequencealignmentusingCUDAcompatiblemulti-coreGPU.ICIAFs2010,201-206.2010,0. [77] Smith,T.F.andWaterman,M.S.Identicationofcommonmolecularsubsequences.MolecularBiology,147,195-197(1981). [78] Stamatakis,A.,Blagojevic,F.,Nikolopoulos,D.S.,andAntonopoulos,C.D.ExploringNewSearchAlgorithmsandHardwareforPhylogenetics:RAxMLMeetstheIBMCell.J.VLSISignalProcess.Syst.48(2007).3:271. [79] Steinfadt,S.andSchaffer,K.ParallelApproachesforSWAMPSequenceAlignment.Bioinformatics,2009.OCCBIO'09.OhioCollaborativeConferenceon.2009,87. [80] Stockinger,H.,Pagni,M.,Cerutti,L.,andFalquet,L.GridApproachtoEmbarrassinglyParallelCPU-IntensiveBioinformaticsProblems.e-ScienceandGridComputing,2006.e-Science'06.SecondIEEEInternationalConferenceon.2006,58. [81] Strassen,Volker.Gaussianeliminationisnotoptimal.NumerischeMathematik13(1969).4:354. [82] Sun,Ying,Zhao,Shuqi,Yu,Huashan,Gao,Ge,andLuo,Jingchu.ABCGrid:ApplicationforBioinformaticsComputingGrid.Bioinformatics23(2007).9:1175. [83] Swenson,M.S.,Anderson,J.,Ash,A.,Gaurav,P.,Sukosd,Z.,Bader,D.A.,Harvey,S.C.,andHeitsch,C.E.GTfold:EnablingparallelRNAsecondarystructurepredictiononmulti-coredesktops.BMCResNotes5(2012).1:341. [84] Tan,Guangming,Sun,Ninghui,andGao,GuangR.Aparalleldynamicprogrammingalgorithmonamulti-corearchitecture.Proceedingsofthenine-teenthannualACMsymposiumonParallelalgorithmsandarchitectures.SPAA'07.NewYork,NY,USA:ACM,2007,135. [85] Ukkonen,Esko.Onapproximatestringmatching.FoundationsofComputationTheory.Springer,1983,487. [86] Varon,Andres,Wheeler,WardC,etal.Thetreealignmentproblem.BMCbioinformatics13(2012):293. 162

PAGE 163

[87] Volkov,VasilyandDemmel,JamesW.BenchmarkingGPUstotunedenselinearalgebra.Proceedingsofthe2008ACM/IEEEconferenceonSupercomputing.Piscataway,NJ,USA:IEEEPress,2008,31:1:11. [88] Waterman,M.S.andSmith,T.F.RNAsecondarystructure:Acompletemathematicalanalysis.Math.Biosc42(1978):257. [89] Wirawan,Adrianto,Keong,KwohChee,andSchmidt,Bertil.ParallelDNAsequencealignmentonthecellbroadbandengine.Proceedingsofthe7thinter-nationalconferenceonParallelprocessingandappliedmathematics.PPAM'07.Berlin,Heidelberg:Springer-Verlag,2008,1249. [90] Wirawan,Adrianto,Schmidt,Bertil,andKwoh,CheeKeong.PairwiseDistanceMatrixComputationforMultipleSequenceAlignmentontheCellBroadbandEngine.Proceedingsofthe9thInternationalConferenceonComputationalScience:PartI.ICCS'09.Berlin,Heidelberg:Springer-Verlag,2009,954. [91] Won,YoungjuandSahni,Sartaj.Hypercube-to-hostsorting.TheJournalofSupercomputing3(1989):41. [92] Xia,Fei,Dou,Yong,Zhou,Xingming,Yang,Xuejun,Xu,Jiaqing,andZhang,Yang.Fine-grainedparallelRNAalifoldalgorithmforRNAsecondarystructurepredictiononFPGA.BMCBioinformatics10(2009):1. [93] Yue,FengandTang,Jijun.Adivide-and-conquerimplementationofthreesequencealignmentandancestorinference.BioinformaticsandBiomedicine,2007.BIBM2007.IEEEInternationalConferenceon.IEEE,2007,143. [94] Zhang,Huiliang,Schmidt,Bertil,andMuller-Wittig,Wolfgang.AcceleratingBLASTPontheCellBroadbandEngine.ProceedingsoftheThirdIAPRInter-nationalConferenceonPatternRecognitioninBioinformatics.PRIB'08.Berlin,Heidelberg:Springer-Verlag,2008,460. [95] Zuker,MichaelandStiegler,Patrick.OptimalcomputerfoldingoflargeRNAsequencesusingthermodynamicsandauxiliaryinformation.NucleicAcidsResearch9(1981).1:133. 163

PAGE 164

BIOGRAPHICALSKETCH JunjieLiwasborninasmalltown(Rushan)bytheseaside,ShandongProvince,P.R.Chinain1986andwasgiventhenamebyhisgrandpa.ThenheattendedYellowMountainRoadElementarySchool,FuQianMiddleSchoolandNo.1HighSchool.AfterChina'sNationalCollegeEntranceExamination,hewasadmittedintoBeijingUniversityofPostsandTelecommunicationsasanundergraduatestudentmajoringininformationsecurity.Attheendofhissecondyearthere,hewasadmittedtotheInnovationLabintheCollegeofInformationEngineering.In2009,hegraduatedandreceivedtheBachelorofEngineeringdegree.ThenhecametoattendUniversityofFloridatofurtherhisstudyasaDoctorofPhilosophystudentunderthesupervisionofProfessorSartajSahniandProfessorSanjayRankaintheDepartmentofComputerandInformationScienceandEngineering.Junjie'sresearchfocusesonhigh-performancecomputing.Hedesignshigh-performanceparallelalgorithmswhichrunonaGPUtoacceleratebioinformaticsapplications,particularlyforthoserequiringdynamicprogrammingsolutions. 164