Multicore and GPU algorithms for Nussinov RNA folding

MISSING IMAGE

Material Information

Title:
Multicore and GPU algorithms for Nussinov RNA folding
Physical Description:
Mixed Material
Language:
English
Creator:
Li, Junjie
Ranka, Sanjay
Sahni, Sartaj
Publication Date:

Notes

Abstract:
Background: One segment of a RNA sequence might be paired with another segment of the same RNA sequence due to the force of hydrogen bonds. This two-dimensional structure is called the RNA sequence’s secondary structure. Several algorithms have been proposed to predict an RNA sequence’s secondary structure. These algorithms are referred to as RNA folding algorithms. Results: We develop cache efficient, multicore, and GPU algorithms for RNA folding using Nussinov’s algorithm. Conclusions: Our cache efficient algorithm provides a speedup between 1.6 and 3.0 relative to a naive straightforward single core code. The multicore version of the cache efficient single core algorithm provides a speedup, relative to the naive single core algorithm, between 7.5 and 14.0 on a 6 core hyperthreaded CPU. Our GPU algorithm for the NVIDIA C2050 is up to 1582 times as fast as the naive single core algorithm and between 5.1 and 11.2 times as fast as the fastest previously known GPU algorithm for Nussinov RNA folding.
General Note:
Li et al. BMC Bioinformatics 2014, 15(Suppl 8):S1 http://www.biomedcentral.com/1471-2105/15/S8/S1; Pages 1-9
General Note:
doi:10.1186/1471-2105-15-S8-S1 Cite this article as: Li et al.: Multicore and GPU algorithms for Nussinov RNA folding. BMC Bioinformatics 2014 15(Suppl 8):S1.

Record Information

Source Institution:
University of Florida
Holding Location:
University of Florida
Rights Management:
© 2014 Li et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
System ID:
AA00026947:00001


This item is only available as the following downloads:


Full Text

PAGE 1

RESEARCH OpenAccessMulticoreandGPUalgorithmsforNussinov RNAfoldingJunjieLi*,SanjayRanka,SartajSahniFrom ThirdIEEEInternationalConferenceonComput ationalAdvancesinBio andMedicalSciences (ICCABS2013) NewOrleans,LA,USA.12-14June2013AbstractBackground: OnesegmentofaRNAsequencemightbepairedwithanothersegmentofthesameRNAsequence duetotheforceofhydrogenbonds.Thistwo-dimensionalstructureiscalledtheRNAsequence ssecondary structure.SeveralalgorithmshavebeenproposedtopredictanRNAsequence ssecondarystructure.These algorithmsarereferredtoasRNAfoldingalgorithms. Results: Wedevelopcacheefficient,multicore,andGPUalgorithmsforRNAfoldingusingNussinov salgorithm. Conclusions: Ourcacheefficientalgorithmprovidesaspeedupbetween1.6and3.0relativetoanaive straightforwardsinglecorecode.Themulticoreversionofthecacheefficientsinglecorealgorithmprovidesa speedup,relativetothenaivesinglecorealgorithm,between7.5and14.0ona6corehyperthreadedCPU.Our GPUalgorithmfortheNVIDIAC2050isupto1582timesasfastasthenaivesinglecorealgorithmandbetween 5.1and11.2timesasfastasthefastestpreviouslyknownGPUalgorithmforNussinovRNAfolding.BackgroundAnRNAsequenceisachainofnucleotidesfromthe alphabet{ G (guanine), A (adenine), U (uracil), C (cytosine)}.OnesegmentofaRNAsequencemightbepaired withanothersegmentofthesameRNAsequencedueto theforceofhydrogenbonds.Thistwo-dimensionalstructureiscalledtheRNAsequence ssecondarystructure. TwonucleotidesinanRNAsequencecanformWatsonCrick AU and GC basepairsaswellasthe GU wobble pair.Severalalgorithmshavebeenproposedtopredictan RNAsequence ssecondarystructure.Thesealgorithms arereferredtoasRNAfoldingalgorithms.Watermanand Smith[1]andNussinovetal.[2]madethefirstattemptin 1978.Zukeretal.[3]refinedNussinov salgorithmby usingathermodynamicenergyminimizationmodel, whichproducesmoreaccurateresultsattheexpenseof greatercomputationalcomplexity.Althoughourfocusin thispaperisthesimplerNussinov salgorithm,ourstrategiesmaybeappliedtoZuker salgorithmaswell. ThecomplexityofNussinov sandZuker salgorithmis O ( n3),where n isthelengthoftheRNAsequencetobe folded.OtherRNAfoldingalgorithmswithbetterpredictionpropertiesandhighercomplexityalsoexist.When foldingviralsequences, n rangesfromseveralthousand toseveralmillionandsingle-coreruntimebecomes excessiveandsomuchefforthasgoneintodeveloping parallelversionsofvariousRNAfoldingalgorithms. Forexample,[4,5]developamulticorecodeforan O ( n4) foldingalgorithmwhile[6]doesthisforNussinov s algorithm.[7]developsaframeworkforRNAfolding algorithmsonaclusterandteststhisframeworkusingan O ( n6)(Pknots-RE)andan O ( n4)(Pknots-RG)algorithms forthepredictionofRNAsecondarystructure.FPGA solutionsforsecondarystructurepredictionhavebeen proposedin[8-10]andGPUsolutionsin[11,12].We notethat[11]isbasedonthealgorithmofZukeretal. [3]while[12]isbasedonthatofNussinovetal.[2]. WestartinthispaperbydescribingtheGPUarchitectureandprogrammingmodel.ThenwestateNussinov etal. s[2]dynamicprogrammingrecurrenceforsecondary structurepredictionandwegivemodificationsofthese *Correspondence:jl3@cise.ufl.edu DepartmentofComputerandInformationScienceandEngineering, UniversityofFlorida,32611Gainesville,USALi etal BMCBioinformatics 2014, 15 (Suppl8):S1 http://www.biomedcentral.com/1471-2105/15/S8/S1 2014Lietal.;licenseeBioMedCentralLtd.ThisisanOpenAccessarticledistributedunderthetermsoftheCreativeCommons AttributionLicense(http://creativecommons.org/licenses/by/4.0),whichpermitsunrestricteduse,distribution,andreproduction inanymedium,providedtheoriginalworkisproperlycited.TheCreativeCommonsPublicDomainDedicationwaiver (http://creativecommons.org/publicdomain/zero/1.0/)appliestothedatamadeavailableinthisarticle,unlessotherwisestated.

PAGE 2

equationsasobtainedbyChangetal.[12].Thesemodificationssimplifytheparallelizationoftheoriginalequationsandcomputethesameresults.Wealsodescribethe strategyusedin[12]toobtainaGPUalgorithmtosolve themodifiedequations.Anaiveimplementationofthe modifiedequationsof[12]togetherwithacacheefficient versionandmulticoreversionsaregiven.Wenote thatalthough[6]givesamulticoreversionofNussinov s algorithm,ourmulticoreversionismuchsimplerand providessimilarspeedup.ThenourGPUalgorithmfor themodifiedequationsisde scribed.Experimentaland benchmarkresultsarepresentedafterthat.GPUarchitectureandprogrammingmodelOurworktargetstheNVIDIAC2050GPU.Figure1 showsthearchitectureofonestreamingmultiprocessor (SM)oftheNVIDIAFermilineofGPUsofwhichthe C2050isamember.TheC2050comprises448processor coresgroupedinto14SMswith32coresperSM.Each SMhas64KBofsharedmemory/L1cachethatmaybe setupaseither48KBofsharedmemoryand16KBofL1 cacheor16KBofsharedmemoryand48KBofL1cache. Inaddition,eachSMhas32Kregisters.The14SMs accessacommon3GBofDRAMmemory,calleddevice orglobalmemory,viaa768KBL2cache.AC2050is capableofperformingupto1.288TFLOPSofsingleprecisionoperationsand515GFLOPSofdoubleprecisionoperations.AC2050connectstothehostprocessor viaaPCI-Expressbus.Themaster-slaveprogramming modelinwhichonewritesaprogramforthehostor mastercomputerandthisprograminvokeskernelsthat executeontheGPUissupported.GPUsusetheSIMT (singleinstructionmultiplethread)programmingmodel inwhichtheGPUaccomplishesacomputationaltask usingthousandsoflightweightthreads.Thethreads aregroupedintoblocksandtheblocksareorganizedas agrid.Whileablockofthreadsmaybe1-,2-,or 3-dimensional,thegridofblocksmayonlybe1or 2-dimensional.Thekeych allengeinderivinghigh performanceonthismachineistobeabletoeffectively Figure1 ArchitectureofoneSMoftheNVIDIAFermi [14]. Li etal BMCBioinformatics 2014, 15 (Suppl8):S1 http://www.biomedcentral.com/1471-2105/15/S8/S1 Page2of9

PAGE 3

minimizethememorytrafficbetweentheSMsandthe globalmemoryoftheGPU.Thiseffectivelyrequires designofnovelalgorithmic andimplementationapproachesandisthemainfocusofthispaper.NussinovsdynamicprogramingrecurrenceLet S = a1a2... anbeanRNAsequencewhere ai { A,C, G,U }isanucleotide.Nussinov salgorithmfindsthemost possiblesecondarystructurebymaximizingthenumberof bondedpairs(AU,GCorGU).Let C ( i j )bethemaximum numberofbondedpairsinthesubsequence ai... aj,1 i j n .Nussinov sdynamicprogrammingrecurrencefor C isgivenbelow. C( i i 1)=02 i n C( i i )=01 i n C( i j)=max C( i +1, j) C( i j 1) C( i +1, j 1)+ bond( ai, aj) maxi < k < j{ C( i k )+ C( k +1, j) } Here,bond( ai,aj)is1if( ai,aj)isanAU,GCorGU pairand0otherwise,andthethirdequationapplies when i < j .Thethirdequationcomputesthemaximum offourtermsthathavethefollowingsignificance. 1Addunpaired aitothebeststructureforsubsequence ai +1... aj,asshowninFigure2(a). 2Addunpaired ajtothebeststructureforsubsequence ai... aj 1,asshowninFigure2(b). 3Add( ai,aj)pairtothebeststructureforsubsequence ai +1... aj 1,asshowninFigure2(c). 4Combinetwobeststructuresfor ai... akand ak +1... aj, asshowninFigure2(d). Oncethe C ( i,j ),1 i < j n ,havebeencomputed,a tracebackprocedurecanbeusedtoconstructtheactual secondarystructure,whichisrepresentedinthematrix asapathleadingtothemaximumscore.Whilethis tracebackproceduretakes O ( n )time,theactualcomputationofthe C matrixtakes O ( n3)time.SimplifiedrecurrenceandGPUalgorithmChangetal.[12]simplifiedNussinovsrecurrencetothe following. C( i i )=0 for 1 i n (1) C( i i +1)=0 for 1 i n 1 (2) C( i j)=max C( i +1, j 1)+ bond( ai, aj) maxi k < j{ C( i k )+ C( j, k +1) } (3) LikeNussinov soriginalrecurrence,thesimplified recurrenceuses O ( n2)memoryand O ( n3)time.However, Chang sformulationiseasiertoparallelize.Inaserial computationof C ,wewouldstartbyinitializing C ( i,i) (themaindiagonalof C )andC ( i,i +1)(thediagonaljust abovethemainone)usingEquations1and2andthen useEquation3tocomputetheremainingdiagonalsin theorder C ( i,i +2)... C ( i,i + n 1).Figure3(a)shows howthecomputationprogresses. In[12],theentirecomputationisdividedintothree stagesasshowninFigure3(b),namelytheinitialstage, themiddlestageandthefinalstage.Intheinitialstage, sincethecomputationateachelementisshallow(the distancetothecentraldiagonalisshort),oneGPUthread isassignedtocomputeoneelement.Nodataexchange betweenthreadsisneeded.Allthreadssynchronizebefore movingtothenextdiagonal.Inthemiddlestage,an entireblockofthreadsisassignedtocomputeone elementandtheparallelreductionmethodcontainedin CUDASDKisused.Inthefinalstage,allSMscollaboratetocomputeoneelementbecausethedistancefrom theelementtothecentraldiagonalislongandthe computationforeachelementisheavy.Again,parallel reductionisusedinthisstage.Toreduceaccessesto Figure2 FourcasesinNussinov srecurrence [15]. Li etal BMCBioinformatics 2014, 15 (Suppl8):S1 http://www.biomedcentral.com/1471-2105/15/S8/S1 Page3of9

PAGE 4

devicememory,theGPUalgorithmof[12]storeseach C ( i,j )value, i < j inpositions( i,j )aswellasinthe otherwiseunusedposition( j,i ).When C ( j,k +1)isto bereadfromdevicememory(i.e.,whenitisneededin therightsideofEquation3),thereadisdonefrom position(k +1, j ).Thischangescolumnreadstorow readsandbetterutilizestheL2andL1cachesofthe targetGPU.MethodsCacheefficientandmulticorealgorithmsCPU 1(Algorithm1)isanaivesingle-corealgorithmto compute C usingthesimplifiedrecurrenceofChangetal. Thisalgorithmcomputes M [ i ][j ]=C ( i +1, j +1),0 i j < n ,where n isthelengthoftheRNAsequence R Algorithm1 CPU 1 Require: RNAsequence R Ensure: Array M suchthat M [ i ][ j ]= C ( i +1, j +1) 1: for i 0to| R |-1 do 2: M [ i ][i ] 0 3: endfor 4: for i 0to| R |-2 do 5: M [ i ][i +1] 0 6: endfor 7: for diag 2to| R |-1 do 8: for row 0to| R |-diag -1 do 9: col diag + row 10: a R [ row ]; b R [ col] 11: max M [ row + 1][col 1]+ bond ( a,b ) 12: for k row to col-1 do 13: t M [ row ][k ]+M [ k +1][col] 14: max max { max,t } 15: endfor 16: M [ row ][ col] max 17: endfor 18: endfor Byusingthelowertriangleof M tostorethetransposeof thecomputedvaluesintheuppertriangleof M asisdone intheGPUimplementationof[12],wegetacacheefficient versionof CPU 1.Toarriveat CPU 2,wechangeLine13of Algorithm1to t M [ row ][k ]+M [ col ][ k +1] ,and changeLine16to M [ row ][ col ] M [ col ][row ] max Valuesof M [ k +1][ col ]locateinthesamecolumn butvaluesof M [ col ][k +1]locateinthesamerow,for row k
PAGE 5

asyoumovetotherightandthesecondasyoumove down.So X (Figure4)residesin(4,1), K in(4,4),and P in(3,2).Blocksonthemaindiagonalareindexed( i,i ) andaretriangular.ForthedependenciesinEquation3,it followsthatblocksthatlieonthesamediagonalofblocks (i.e.,blockswiththeindex( i,i k )forsomefixed k )are independentandsomaybecomputedinparallelbutthat elementswithinablockaretobecomputedbydiagonals. Our2020exampleofFigure4has6diagonalsofblocks andsosixiterationsofcomputationwitheachiteration computingallblocksonthesamediagonalinparallelare required. Asnotedearlier,thefirstdiagonalofblocksiscomprised oftriangles.Toeachtriangularblock,weassignathread block.Thethreadswithintheassignedthreadblockcomputetheelementsinthetriangularblockindiagonalorder withallelementsonthesamediagonalbeingcomputable inparallel.Hence,forthiscomputation,thenumberof threadblocksequalsthenumberoftriangularblocks. Letusturnourattentionnowtotheremainingblocks(i. e.,thenon-triangularblocks ).Noticethatwhenwestart thecomputationoftheeleme ntsin,say,block(4,1), where X resides, a to j ,and C to L havealready beencomputed,becausetheyareonprecedingblock diagonals.But k l A ,and B haveyettobe computed.Thecomputationofthemaximumof c+C to j+J canbedoneusingakernel maxKernel (described later).Thiskernelusesregistersfortemporaryvaluesand writesthesetemporaryvaluestosharedmemoryupon completion.Thefinalvaluefor O canbeobtainedby comparingthetemporarymaximumvaluein O with P plusthe bond valueinEquation3.Thenthemaximumof r+O q plusits bond value,andthetemporarymaximumvaluein m iswrittento m asitsfinalvalue. Similarly,for M ,themaximumof O+R Q plusits bondvalue,andthetemporarymaximumvaluein M is writtento M asitsfinalvalue.Thecomputationsfor m and M canbedoneinparallel.Sothecomputation withinelementblock(4,1)isdoneindiagonalorder.All elementsonthesamediagonalcanbecomputedin parallelwithalldataresidinginsharedmemory.The pseudocodeisshownasAlgorithm2. Figure4 BlockpartitioningofCmatrix Li etal BMCBioinformatics 2014, 15 (Suppl8):S1 http://www.biomedcentral.com/1471-2105/15/S8/S1 Page5of9

PAGE 6

Algorithm2 OurGPUalgorithm Require: RNAsequence R ,blockeddiagonalindex D blocksize BS Ensure: C matrix 1: Register [16] reg 2: Shared Memory [ BS ][BS ] sA 3: Shared_Memory [ BS +1][ BS +1] sC 4: Global Memory [ BS ][BS ] gB 5: Global Memory [ BS ][BS ] gC 6: tx threadIdx.x ; ty threadIdx.y 7: bx blockIdx.x ; by blockIdx.y 8: aRow by BS ; aCol aRow 1 9: bRow aRow ; bCol D BS 1+ aRow 10: for blk 1to D 1 do 11: sA the blockstartingat( aRow,aCol + blk BS ) 12: gB theblockstartingat( bRow + blk BS, bCol ) 13: maxKernel ( sA,gB,reg ) 14: Syncthreads 15: endfor 16: sC reg 17: for d 1to BS 1 do 18: for Eachelement e ondiagonal d do 19:Finishremainingcomputation 20: endfor 21: Syncthreads 22: endfor 23: gC sC Algorithm3 maxKernel Require: Block sA insharedmemory,block gB inglobalmemory Ensure: Partialresultofreductionin reg r 0 gB [0][ tx ]; r 1 gB [1][ tx] r 2 gB [2][ tx ]; r 3 gB [3][ tx] for j 0t o6 do for k 0to15 do reg [ k ] max{ reg [ k ], r 0+ sA [ ty 6+ k ] [ j 4]} endfor r 0 gB [(j +1)4][ tx] //2similarforloopsforr1andr2comehere for k 0to15 do reg [ k ] max{ reg [ k ], r 3+ sA [ ty 6+k ] [ j 4+3]} endfor r 3 gB [(j +1)4+3][tx] endfor for k 0to15 do reg [ k ] max{ reg [ k ], r 0+ sA [ ty 6+ k ][28]} endfor //2similarforloopsforr1andr2comehere for k 0to15 do reg [ k ] max{ reg [ k ], r 3+ sA [ ty 6+ k ][31]} endforDescriptionofmaxKernelThecomputationofthemaximumof c+C to j+J (Figure4)bearssomeresemblancetothecomputationof aterminamatrixmultiply.So,wecanadapttheideas usedinmatrixmultiplykernelstoarriveatanefficient kerneltofindthedesiredmaximumofthesumofpairs. Inourcase(Algorithm3),weadapttheGPUmatrixmultiplykernelofVolkovandDemmel[13].Theelement blocksizeusedinourimplementationis3232anda threadblockisconfiguredas322.Eachthreadcomputes16elementsthatlieinthesamecolumnasshownin Figure5(thisfigureshowsonlysixthreadsasarrows aboveblockC).The16elementscomputedbyonethread arerepresentedasaslimgraybarinblock C .Thegray areainblockAdepictsthedataneededbythefirst 32threads.Thisdatawillbereadintosharedmemory.To achievehighthroughputfrom/todevicememory,weuse coalescedmemoryaccessesinwhichalldataaccessedby onewarp(thisistheminimumschedulingunitanditcontains32threads)fallsinthesamedevicememorycache lineofsize128bytes.InFigure5,sixthreadsfetchthe firstrowfromthegrayareaofblock B.Theneachthread usesthevaluejustfetchedtoaddwiththefirstcolumnin thegrayareaofblockA(whichisalreadyreadintoshared memory).Inotherwords,thread i willadd B [ 0 ][ i ]with A [ j ][ 0 ](0 j <16)andcomparethisvaluewith register [ j ]of thread i andupdate register [ j ]ifnecessary.Then B[1][ i ]is addedwith A [ j ][1]andtheresultiscomparedwith register[ j ];theregisterisupdatedasneeded,0 j <16.Since threadsinthesamewarpwillreaddatainthesamerowof block B ,thisreadingiscoalescedandservicedatthe throughputofL1orL2cacheincaseofacachehit,orat thethroughputofdevicememory,otherwise.Besides,all threadsinthesamewarpusethesamevaluefromblock A ,whichresidesinsharedmemory.Thisvaluecanbe broadcasttoallthreadsinthesamewarp. Wenotethat[11]alsoemploysa maxKernel buttheir kernelisdifferentfromours.ResultsWebenchmarkedouralgorithmsusingaPCwitha hyperthreaded6-coreInteli7x9803.33GHzCPUand 12GBDDR3RAM.ThePCalsohadNVIDIATesla C2050GPU.Sinceonlytwothreadsmaybescheduled peri7coreatanytime,amaximumof12threadsmay begainfullyused.WeusedrandomlygeneratedRNA sequencesinourexperiments.Sincetheruntimeofour codesisrelativelyinsensitivetotheactualRNA sequenceinuseduetothefactthattheentirecomputationistofilloutan n n matrix,ouruseofrandom sequencesdoesnotmateriallyimpactourconclusions.SingleandmulticorealgorithmsInboththecodes OMP 1and OMP 2,theworkassigned tothethreadsiswellbalancedbyOpenMPandsobestLi etal BMCBioinformatics 2014, 15 (Suppl8):S1 http://www.biomedcentral.com/1471-2105/15/S8/S1 Page6of9

PAGE 7

performanceisexpectedusingeither6or12threads. Ourexperimentsconfirmedthisexpectationwiththeuse of6threadsgenerallybeingbetterthantheuseof 12threads.So,forourapplicationtheoverheadofcontextswitchingbetweenthetwothreadsassignedtoa corewhenatotalof6threadsareusedgenerally exceededthegainsobtainablefromhavingasecond threadreadyincasethefirstthreadstallsfrommemory delay.Table1givestheruntimesforouralgorithms CPU 1, CPU 2, OMP 1,and OMP 2for n valuesranging from3000to16000.Thecolumnslabeledratiogivethe ratios CPU 1/ OMP 1and CPU 2/ OMP 2,respectively. Althoughwehave6coresonourCPU,weareableto achievespeedupsofalmos t5fromthemulticore versions.Bycomparison,thefarmorecomplexmulticore codeof[6],whichusesablockingstrategysimilartothat usedbyourGPUcode,achievesasimulatedspeedupof 6.3with4threads.Thespeedupreportedin[6]is referredtoas simulatedspeedup becauseitcomesfrom theuseofamulticoresimulatorratherthanfromactual speedupmeasurementsonarealmuticorecomputer. However,thissimulatedspeedupignoresseveralfactors suchassynchronizationoverheadthatwillreduce speedupinarealenvironment.Further,thesimulated speedupof6.3isrelativetotheequivalentofthecode CPU 1.Thespeedupachievedby OMP 2relativeto CPU 1 isbetween7.5and14.0!Wenotealsothatthespeedup obtainedsolelyfromtheuseofthecachingstrategy(i.e., theratio CPU 1/ CPU 2)rangesfrom1.6to3.0.GPUalgorithmsWeexperimentedwiththreeversionsofourGPUalgorithm.Thefirstiscalled Ours1 ,whichisasdescribedin Figure5 maxKernel illustration. Table1Runningtime(seconds)ofdifferentCPU algorithmsnCPU1 OMP 1 RatioCPU 2 OMP 2 Ratio 300035.97.15.122.34.84.6 400098.118.65.352.811.34.7 5000208.141.65.0102.922.24.6 6000363.772.25.0177.545.33.9 7000646.1125.25.2281.361.04.6 8000924.4197.84.7419.692.54.5 90001461.5291.05.0596.6129.94.6 100001927.7395.04.9819.1176.94.6 110002800.8559.25.01088.4234.54.6 120003525.2741.44.81413.6303.34.7 130004852.3978.85.01795.4388.44.6 140006008.91250.24.82243.5485.24.6 150007930.01641.44.82757.3594.04.6 1600010120.02380.84.33343.5725.44.6 Li etal BMCBioinformatics 2014, 15 (Suppl8):S1 http://www.biomedcentral.com/1471-2105/15/S8/S1 Page7of9

PAGE 8

OurGPUalgorithm section.Inthesecondversion,which iscalled Ours2 ,devicememoryusageisreducedbyhalfby storingonlytheuppertriangleoftheoutputmatrix.This uppertriangleismappedintoaonedimensionalarray usingthestandardrow-majormapping.Sincethisversion usesonlyhalfthedevicememoryusedbytheother versions,itmaybeusedonlargerinstances.Inthethird version,whichiscalled OursR ,wereplacedour maxKernel withthekerneldescribedin[11].Sincewewereunableto gettheGPUcodeof[11],thekernelusedbyuswasactuallyonewewrotebasedonthedescriptionprovidedin [11].Thesethreecodeswerebenchmarkedagainsteach otheraswellasagainsttheGPUNussinovcodeof[12]. Themaximumsizeofsequence Ours 2canhandleis 37000whiletheotherversionscanhandlesequencesof sizeupto26000. Ours 2runsslightlyslowerthan Ours 1as showninTable2.So, Ours 2isrecommendedonlywhen theinstancesizeislargeenoughtomake Ours 1nonfeasible.Table2andFigure6showtherunningtimeforthe fourdifferentGPUcodes. Ratio 1inTable2showsthe speedupof Ours 1relativeto[12]([12]/ Ours 1). Ratio 2 shows OursR / Ours 1.Ascanbeseen, Ours 1isupto1.9 timesasfastas OursR indicatingthatacorresponding speedupcouldbeobtainedforZuker salgorithmbyreplacingthemaximumfindingkernelusedin[11]withour kernelforthisoperation.Also, Ours 1isbetween3.0and 11.1timesasfastastheGPUalgorithmof[12].SinglecorevsmulticorevsGPUTable3givesthespeedupobtainedby Ours 1relativeto CPU 2and OMP 2.UsingaGPU,wecandotheNussinov computationsupto522.6timesasfastasusingacache efficientsinglecorecodeandupto113.4timesasfastas usinga6-corecacheefficientcode!Comparedtothenaive single-corecodeCPU1,ourGPUcodesprovidesa speedupofupto1582!ConclusionsWehavedevelopedsimpleandefficientsingleandmulticorealgorithmsaswellasanefficientGPUcodeforRNA foldingbasedonNussinov sequations[2].Ourcacheefficientsingle-corealgorithmprovidesaspeedupbetween 1.6and3.0relativetoanaivestraightforwardsinglecore code.Themulticoreversionofthecacheefficientsingle corealgorithmprovidesaspeedup,relativetothenaive singlecorealgorithm,between7.5and14.0ona6core hyperthreadedCPU.OurGPUalgorithm, Ours 1,forthe NVIDIAC2050isupto1582timesasfastasthenaive singlecorealgorithmandbetween3.0and11.1timesas fastasthefastestpreviouslyknownGPUalgorithmfor NussinovRNAfolding.Withtheavailable3GBdevice Table2Runningtime(seconds)ofdifferentGPU algorithmsnOurs 1 Ours 2 [12] OursRRatio 1 Ratio 2 20000.10.10.30.13.01.0 60000.60.74.00.86.71.3 100001.92.216.43.28.61.7 140004.55.143.07.99.61.8 180008.89.989.516.010.21.8 2200015.116.9161.728.210.71.9 2600023.926.7266.345.811.11.9 37000-71.5---Figure6 PlotofrunningtimeofGPUalgorithms Li etal BMCBioinformatics 2014, 15 (Suppl8):S1 http://www.biomedcentral.com/1471-2105/15/S8/S1 Page8of9

PAGE 9

memoryonanNVIDIAGPU, Ours 1isabletohandle sequencesoflengthupto26000.Sequencesoflength between26000and37000maybehandledusing Ours 2, whichusesaonedimensionalarraymappingofthe uppertriangleoftheoutpu tmatrixratherthanatwodimensionalarraythatrepresentsthefulloutputmatrix. Ours 2,however,runsslightlyslowerthan Ours 1.Our methodscanbeusedtospeedupupRNAfoldingusing Zuker sequationsaswell[3,11].Listofabbreviationsused RNA:RiboNucleicAcid;GPU:GraphicsProcessingUnit;PCI-Express:Peripheral ComponentInterconnectExpress;CUDA:ComputeUnifiedDevice Architecture;GCUPS:BillionCellUpdatesperSecond;SM:Streaming Multiprocessors;DRAM:DynamicRandom-AccessMemory;TFLOPS:Trillion FloatingPointOperationsPerSecond;GFLOPS:BillionFloatingPoint OperationsPerSecond;I/O:Input/Output;CPU:CentralProcessingUnit. Competinginterests Theauthorsdeclarethattheyhavenocompetinginterests. Authors contributions JL,SR,andSSdevelopedtheGPUalgorithms,andanalyzedthe experimentalresults,andwrotethemanuscript.JLalsoprogrammedand debuggedtheGPUalgorithmsandrantheexperiments. Declarations Publicationofthissupplementwasfunded,inpart,bytheNationalScience FoundationundergrantsCNS0963812,CNS1115184,CNS0905308,andthe NationalInstitutesofHealthundergrantR01-LM010101. Thisarticlehasbeenpublishedaspartof BMCBioinformatics Volume15 Supplement8,2014:SelectedarticlesfromtheThirdIEEEInternational ConferenceonComputationalAdvancesinBioandMedicalSciences (ICCABS2013):Bioinformatics.Thefullcontentsofthesupplementare availableonlineathttp://www.biomedcentral.com/bmcbioinformatics/ supplements/15/S8. Published:14July2014 References1.WatermanMS,SmithTF: Rnasecondarystructure:Acomplete mathematicalanalysis. MathBiosc 1978, 42 :257-266. 2.NussinovR,PieczenikG,GriggsJR,KleitmanDJ: AlgorithmsforLoop Matchings. SIAMJournalonAppliedMathematics 1978, 35(1) :68-82. 3.ZukerM,StieglerP: Optimalcomputerfoldingoflargernasequences usingthermodynamicsandauxiliaryinformation. NucleicAcidsResearch 1981, 9(1) :133-148. 4.MathuriyaA,BaderDA,HeitschCE,HarveySC: Gtfold:ascalablemulticore codeforrnasecondarystructureprediction. Proceedingsofthe2009ACM SymposiumonAppliedComputingSAC 09,pp981-988 ACM,NewYork,NY, USA;2009. 5.SwensonMS,AndersonJ,AshA,GauravP,SukosdZ,BaderDA,HarveySC, HeitschCE: Gtfold:Enablingparallelrnasecondarystructureprediction onmulti-coredesktops. BMCResNotes 2012, 5(1) :341. 6.TanG,SunN,GaoGR: Aparalleldynamicprogrammingalgorithmona multi-corearchitecture. ProceedingsoftheNineteenthAnnualACM SymposiumonParallelAlgorithmsandArchitecturesSPAA 07,pp135-144 ACM,NewYork,NY,USA;2007. 7.EstradaT,LiconA,TauferM: comppknots:aframeworkforparallel predictionandcomparisonofrnasecondarystructureswith pseudoknots. Proceedingsofthe2006InternationalConferenceonFrontiers ofHighPerformanceComputingandNetworkingISPA 06,pp677-686 Springer,Berlin,Heidelberg;2006. 8.XiaF,DouY,ZhouX,YangX,XuJ,ZhangY: Fine-grainedparallel rnaalifoldalgorithmforrnasecondarystructurepredictiononfpga. BMC Bioinformatics 2009, 10 :1-14. 9.JacobA,BuhlerJ,ChamberlainRD: Acceleratingnussinovrnasecondary structurepredictionwithsystolicarraysonfpgas. Application-Specific Systems,ArchitecturesandProcessors,2008ASAP2008International ConferenceOn,pp191-196 2008. 10.DouY,XiaF,JiangJ: Fine-grainedparallelapplicationspecificcomputing forrnasecondarystructurepredictionusingscfgsonfpga. Proceedingsof the2009InternationalConferenceonCompilers,Architecture,andSynthesisfor EmbeddedSystemsCASES 09,pp107-116 ACM,NewYork,NY,USA;2009. 11.LavenierD,RizkG,RajopadhyeS, etal : Gpuacceleratedrnafolding algorithm. GPUComputingGems 2011. 12.ChangDJ,KimmerC,OuyangM: Acceleratingthenussinovrnafolding algorithmwithcuda/gpu. SignalProcessingandInformationTechnology (ISSPIT),2010IEEEInternationalSymposiumOn,pp120-125 2010. 13.VolkovV,DemmelJW: Benchmarkinggpustotunedenselinearalgebra. Proceedingsofthe2008ACM/IEEEConferenceonSupercomputingSC 08,pp 31-13111 IEEEPress,Piscataway,NJ,USA;2008. 14. NVIDIA:NVIDIA sNextGenerationCUDAComputeArchitecture:Fermi. NVIDIA ,1.12009[http://www.nvidia.com/content/PDF/fermi_white_papers/ NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf]. 15.DurbinR,EddyS,KroghA,MitchisonG: BiologicalSequenceAnalysis. 112006,92-96,Chap.4.4.doi:10.1186/1471-2105-15-S8-S1 Citethisarticleas: Li etal .: MulticoreandGPUalgorithmsforNussinov RNAfolding. BMCBioinformatics 2014 15(Suppl8):S1. Submit your next manuscript to BioMed Central and take full advantage of: Convenient online submission Thorough peer review No space constraints or color gure charges Immediate publication on acceptance Inclusion in PubMed, CAS, Scopus and Google Scholar Research which is freely available for redistribution Submit your manuscript at www.biomedcentral.com/submit Table3SpeedupofOurs1relativetootherversionsnCPU 2 OMP 2 nCPU 2 OMP 2 3000157.033.810000424.491.7 4000224.748.111000441.495.1 5000259.856.112000465.9100.0 6000302.977.313000472.3102.2 7000341.874.114000496.9107.5 8000376.082.915000503.3108.4 9000392.285.416000522.6113.4 Li etal BMCBioinformatics 2014, 15 (Suppl8):S1 http://www.biomedcentral.com/1471-2105/15/S8/S1 Page9of9