Citation
Libgapmis: extending short-read alignments

Material Information

Title:
Libgapmis: extending short-read alignments
Creator:
Alachiots, Nikolaos
Berger, Simon
Flouri, Tomas
Pissis, Solon P.
Stamatakis, Alexandros
Publisher:
BioMed Central (BMC Bioinformatics)
Publication Date:
Language:
English

Notes

Abstract:
Background: A wide variety of short-read alignment programmes have been published recently to tackle the problem of mapping millions of short reads to a reference genome, focusing on different aspects of the procedure such as time and memory efficiency, sensitivity, and accuracy. These tools allow for a small number of mismatches in the alignment; however, their ability to allow for gaps varies greatly, with many performing poorly or not allowing them at all. The seed-and-extend strategy is applied in most short-read alignment programmes. After aligning a substring of the reference sequence against the high-quality prefix of a short read–the seed–an important problem is to find the best possible alignment between a substring of the reference sequence succeeding and the remaining suffix of low quality of the read–extend. The fact that the reads are rather short and that the gap occurrence frequency observed in various studies is rather low suggest that aligning (parts of) those reads with a single gap is in fact desirable. Results: In this article, we present libgapmis, a library for extending pairwise short-read alignments. Apart from the standard CPU version, it includes ultrafast SSE- and GPU-based implementations. libgapmis is based on an algorithm computing a modified version of the traditional dynamic-programming matrix for sequence alignment. Extensive experimental results demonstrate that the functions of the CPU version provided in this library accelerate the computations by a factor of 20 compared to other programmes. The analogous SSE- and GPU-based implementations accelerate the computations by a factor of 6 and 11, respectively, compared to the CPU version. The library also provides the user the flexibility to split the read into fragments, based on the observed gap occurrence frequency and the length of the read, thereby allowing for a variable, but bounded, number of gaps in the alignment. Conclusions: We present libgapmis, a library for extending pairwise short-read alignments. We show that libgapmis is better-suited and more efficient than existing algorithms for this task. The importance of our contribution is underlined by the fact that the provided functions may be seamlessly integrated into any shortread alignment pipeline. The open-source code of libgapmis is available at http://www.exelixis-lab.org/gapmis.
General Note:
Alachiotis et al. BMC Bioinformatics 2013, 14(Suppl 11):S4 http://www.biomedcentral.com/1471-2105/14/S11/S4; Pages 1-14
General Note:
doi:10.1186/1471-2105-14-S11-S4 Cite this article as: Alachiotis et al.: libgapmis: extending short-read alignments. BMC Bioinformatics 2013 14(Suppl 11):S4.

Record Information

Source Institution:
University of Florida
Holding Location:
University of Florida
Rights Management:
All rights reserved by the source institution.

UFDC Membership

Aggregations:
University of Florida Institutional Repository

Downloads

This item is only available as the following downloads:


Full Text
xml version 1.0 encoding UTF-8 standalone no
fcla fda yes
!-- Libgapmis: extending short read alignments ( Mixed Material ) --
METS:mets OBJID AA00019158_00001
xmlns:METS http:www.loc.govMETS
xmlns:xlink http:www.w3.org1999xlink
xmlns:xsi http:www.w3.org2001XMLSchema-instance
xmlns:daitss http:www.fcla.edudlsmddaitss
xmlns:mods http:www.loc.govmodsv3
xmlns:sobekcm http:digital.uflib.ufl.edumetadatasobekcm
xmlns:lom http:digital.uflib.ufl.edumetadatasobekcm_lom
xsi:schemaLocation
http:www.loc.govstandardsmetsmets.xsd
http:www.fcla.edudlsmddaitssdaitss.xsd
http:www.loc.govmodsv3mods-3-4.xsd
http:digital.uflib.ufl.edumetadatasobekcmsobekcm.xsd
METS:metsHdr CREATEDATE 2015-01-14T11:16:07Z ID LASTMODDATE 2013-11-12T08:08:17Z RECORDSTATUS COMPLETE
METS:agent ROLE CREATOR TYPE ORGANIZATION
METS:name UF,University of Florida
METS:note Created using template 'INTERNAL' and project 'NONE'.
OTHERTYPE SOFTWARE OTHER
Go UFDC FDA Preparation Tool
INDIVIDUAL
UFAD\renner
METS:dmdSec DMD1
METS:mdWrap MDTYPE MODS MIMETYPE textxml LABEL Metadata
METS:xmlData
mods:mods
mods:abstract Background: A wide variety of short-read alignment programmes have been published recently to tackle the
problem of mapping millions of short reads to a reference genome, focusing on different aspects of the procedure
such as time and memory efficiency, sensitivity, and accuracy. These tools allow for a small number of mismatches
in the alignment; however, their ability to allow for gaps varies greatly, with many performing poorly or not
allowing them at all. The seed-and-extend strategy is applied in most short-read alignment programmes. After
aligning a substring of the reference sequence against the high-quality prefix of a short readthe seedan
important problem is to find the best possible alignment between a substring of the reference sequence
succeeding and the remaining suffix of low quality of the readextend. The fact that the reads are rather short and
that the gap occurrence frequency observed in various studies is rather low suggest that aligning (parts of) those
reads with a single gap is in fact desirable.
Results: In this article, we present libgapmis, a library for extending pairwise short-read alignments. Apart from
the standard CPU version, it includes ultrafast SSE- and GPU-based implementations. libgapmis is based on an
algorithm computing a modified version of the traditional dynamic-programming matrix for sequence alignment.
Extensive experimental results demonstrate that the functions of the CPU version provided in this library accelerate
the computations by a factor of 20 compared to other programmes. The analogous SSE- and GPU-based
implementations accelerate the computations by a factor of 6 and 11, respectively, compared to the CPU version.
The library also provides the user the flexibility to split the read into fragments, based on the observed gap
occurrence frequency and the length of the read, thereby allowing for a variable, but bounded, number of gaps in
the alignment.
Conclusions: We present libgapmis, a library for extending pairwise short-read alignments. We show that
libgapmis is better-suited and more efficient than existing algorithms for this task. The importance of our
contribution is underlined by the fact that the provided functions may be seamlessly integrated into any shortread
alignment pipeline. The open-source code of libgapmis is available at http://www.exelixis-lab.org/gapmis.
mods:accessCondition type restrictions on use displayLabel Rights All rights reserved by the source institution.
mods:language
mods:languageTerm text English
code authority iso639-2b eng
mods:location
mods:physicalLocation University of Florida
UF
mods:name
mods:namePart Alachiots, Nikolaos
Berger, Simon
Flouri, Tomas
Pissis, Solon P.
Stamatakis, Alexandros
mods:note Alachiotis et al. BMC Bioinformatics 2013, 14(Suppl 11):S4
http://www.biomedcentral.com/1471-2105/14/S11/S4; Pages 1-14
doi:10.1186/1471-2105-14-S11-S4
Cite this article as: Alachiotis et al.: libgapmis: extending short-read
alignments. BMC Bioinformatics 2013 14(Suppl 11):S4.
mods:originInfo
mods:publisher BioMed Central (BMC Bioinformatics)
mods:dateIssued 2013
mods:recordInfo
mods:recordIdentifier source sobekcm AA00019158_00001
mods:recordContentSource University of Florida
mods:titleInfo
mods:title Libgapmis: extending short-read alignments
mods:typeOfResource mixed material
DMD2
OTHERMDTYPE SOBEKCM SobekCM Custom
sobekcm:procParam
sobekcm:Aggregation ALL
UFIRG
UFIR
IUF
sobekcm:Wordmark UFIR
sobekcm:bibDesc
sobekcm:BibID AA00019158
sobekcm:VID 00001
sobekcm:Publisher
sobekcm:Name BioMed Central (BMC Bioinformatics)
sobekcm:Source
sobekcm:statement UF University of Florida
sobekcm:SortDate 734868
METS:amdSec
METS:digiprovMD DIGIPROV1
DAITSS Archiving Information
daitss:daitss
daitss:AGREEMENT_INFO ACCOUNT PROJECT UFDC
METS:techMD TECH1
File Technical Details
sobekcm:FileInfo
METS:fileSec
METS:fileGrp USE reference
METS:file GROUPID G1 PDF1 applicationpdf CHECKSUM c101ea9a7cd5c1317bd7d00ed27c75d9 CHECKSUMTYPE MD5 SIZE 1610897
METS:FLocat LOCTYPE OTHERLOCTYPE SYSTEM xlink:href 1471-2105-14-S11-S4.pdf
G2 PDF2 5ecf4e51b186e2655eca1aea3d4c1f04 21117
1471-2105-14-S11-S4-S1.PDF
G3 METS3 unknownx-mets 69d266106e4f9b7146fbd94535660633 6563
AA00019158_00001.mets
METS:structMap STRUCT2 other
METS:div DMDID ADMID short-read ORDER 0 main
ODIV1 1 Main
FILES1 Page
METS:fptr FILEID
FILES2 2
FILES3 3



PAGE 1

ALGORITHM GapMis ( t n x m ) f Initialisematrices G and H g 1: for i 0to n do 2: G [ i; 0] 0; 3: H [ i; 0] i ; 4: for j 0to m do 5: G [0 ;j ] 0; 6: H [0 ;j ] j ; f Computingmatrices G and H g 7: for i 1tomin f n;m + g do 8: for j max f 1 ;i g tomin f m;i + g do 9: if ij then 18: u G [ i 1 ;j 1]+ H ( t [ i ] ;x [ j ]); 19: v G [ j;j ]; 20: G [ i;j ] min f u;v g ; 21: if v


PAGE 1

RESEARCH OpenAccesslibgapmis :extendingshort-readalignmentsNikolaosAlachiotis1,SimonBerger1,Tom Flouri1,SolonPPissis1,2*,AlexandrosStamatakis1From TheSecondWorkshoponDataMiningofNext-GenerationSequencinginconjunctionwiththe2012 IEEEInternationalConferenceonBioinformaticsandBiomedicine Philadelphia,PA,USA.4-7October2012AbstractBackground: Awidevarietyofshort-readalignmentprogrammeshavebeenpublishedrecentlytotacklethe problemofmappingmillionsofshortreadstoareferencegenome,focusingondifferentaspectsoftheprocedure suchastimeandmemoryefficiency,sensitivity,andaccuracy.Thesetoolsallowforasmallnumberofmismatches inthealignment;however,theirabilitytoallowforgapsvariesgreatly,withmanyperformingpoorlyornot allowingthematall.The seed-and-extend strategyisappliedinmostshort-readalignmentprogrammes.After aligningasubstringofthereferencesequenceagainstthehigh-qualityprefixofashortread theseed an importantproblemistofindthebestpossiblealignmentbetweenasubstringofthereferencesequence succeedingandtheremainingsuffixoflowqualityoftheread extend .Thefactthatthereadsarerathershortand thatthegapoccurrencefrequencyobservedinvariousstudiesisratherlowsuggestthataligning(partsof)those readswithasinglegapisinfactdesirable. Results: Inthisarticle,wepresent libgapmis ,alibraryforextendingpairwiseshort-readalignments.Apartfrom thestandardCPUversion,itincludesultrafastSSE-andGPU-basedimplementations. libgapmis isbasedonan algorithmcomputingamodifiedversionofthetraditionaldynamic-programmingmatrixforsequencealignment. ExtensiveexperimentalresultsdemonstratethatthefunctionsoftheCPUversionprovidedinthislibraryaccelerate thecomputationsbyafactorof20comparedtootherprogrammes.TheanalogousSSE-andGPU-based implementationsacceleratethecomputationsbyafactorof6and11,respectively,comparedtotheCPUversion. Thelibraryalsoprovidestheusertheflexibilitytosplitthereadintofragments,basedontheobservedgap occurrencefrequencyandthelengthoftheread,therebyallowingforavariable,butbounded,numberofgapsin thealignment. Conclusions: Wepresent libgapmis ,alibraryforextendingpairwiseshort-readalignments.Weshowthat libgapmis isbetter-suitedandmoreefficientthanexistingalgorithmsforthistask.Theimportanceofour contributionisunderlinedbythefactthattheprovidedfunctionsmaybeseamlesslyintegratedintoanyshortreadalignmentpipeline.Theopen-sourcecodeof libgapmis isavailableathttp://www.exelixis-lab.org/gapmis.BackgroundTheproblemoffindingsubstringsofatextsimilartoa givenpatternhasbeenintensivelystudiedoverthepast decades,anditisacentralprobleminawiderangeof applications,includingsignalprocessing[1],information retrieval[2],searchingforsimilaritiesamongbiological sequences[3],filecomparison[4],spellingcorrection[5], andmusicanalysis[6].Someexamplesarerecoveringthe originalsignalsaftertheirt ransmissionovernoisychannels,findingDNAsubsequencesafterpossiblemutations, andtextsearchingwheretherearetypingorspelling errors. Approximatestringmatching,ingeneral,consistsin locatingalltheoccurrenceso fsubstringsinsideatext t thataresimilartoapattern x .Itconsistsofproducing thepositionsofthesubstringsof t thatareatdistanceat most k from x ,foragivennaturalnumber k .Fortherest ofthisarticle,weassumethat k< | x | | t |.Wefocuson onlinesearching thetextcannotbepreprocessedto *Correspondence:solon.pissis@h-its.org1HeidelbergInstituteforTheoreticalStudies,Heidelberg,Germany FulllistofauthorinformationisavailableattheendofthearticleAlachiotis etal BMCBioinformatics 2013, 14 (Suppl11):S4 http://www.biomedcentral.com/1471-2105/14/S11/S4 2013Alachiotisetal;licenseeBioMedCentralLtd.ThisisanopenaccessarticledistributedunderthetermsoftheCreative CommonsAttributionLicense(http://creativecommons.org/licenses/by/2.0),whichpermitsunrestricteduse,distribution,and reproductioninanymedium,providedtheoriginalworkisproperlycited.TheCreativeCommonsPublicDomainDedicationwaiver (http://creativecommons.org/publicdomain/zero/1.0/)appliestothedatamadeavailableinthisarticle,unlessotherwisestated.

PAGE 2

buildanindexonit.Thereexistfourmainapproachesto onlineapproximatestringmatching:algorithmsbasedon dynamicprogramming;algorithmsbasedonautomata; algorithmsbasedonword-le velparallelism;andalgorithmsbasedonfiltering.Wefocusonalgorithmsbased ondynamicprogramming.Theremainlyexisttwodifferentdistancesformeasuringtheapproximation:the edit distance andthe Hammingdistance Theeditdistancebetweentwostrings,notnecessarily ofthesamelength,istheminimumcostofasequence ofelementaryeditoperationsbetweenthesetwostrings. Arestrictednotionofthisdistanceisobtainedbyconsideringtheminimumnumberofeditoperationsrather thanthesumoftheircosts.TheHammingdistance betweentwostringsofthesamelengthisthenumber ofpositionswheremismatchesoccurbetweenthetwo strings. Alignmentsareacommonlyusedtechniquetocomparestringsandarebasedonnotionsofdistance[1]or ofsimilaritybetweenstrings ;forexample,similarities amongbiologicalsequences[3].Alignmentsareoften computedbydynamicprogramming[2]. A gap isasequenceofconsecutiveinsertionsordeletions(indels)oflettersinanalignment.Theextensive useofalignmentsonbiologicalsequenceshasshown thatitcanbedesirabletopenalisetheformationoflong gapsratherthanpenalisingindividualinsertionsordeletionsofletters. Thenotionofgapinabiologicalsequencecanbe describedastheabsence(re spectively,presence)ofa fragment,whichis(respectively,isnot)presentin anothersequence[7].Gapsoccurnaturallyinbiological sequencesaspartofthediversitybetweenindividuals.In manybiologicalapplications,asinglemutationalevent cancausetheinsertion(ordeletion)ofalargeDNAfragment,sothenotionofgapinanalignmentisanimportantone.Moreover,thecreationofgapscanoccurina wide,butbounded,rangeofsizeswithalmostequal likelihood. AnumberofnaturalprocessescancausegapsinDNA sequences:longpiecesofDNAcanbecopiedand insertedbyasinglemutationalevent;slippageduringthe replicationofDNAmaycausethesameareatobe repeatedmultipletimesast hereplicationmachinery losesitsplaceonthetemplate;aninsertioninone sequencepairedwithareciprocaldeletioninoneother maybecausedbyunequalcro ss-overinmeiosis;insertionoftransposableelements jumpinggenes intoa DNAsequence;insertionofDNAbyretroviruses;and translocationsofDNAbetweenchromosomes[8].The accurateidentificationofgapsisshowntobefundamentalinvariousstudiesondisorders;forexample,on Hajdu-Cheneysyndrome[9],adisorderofsevereand progressiveboneloss. Thefocusofthisworkisdirectlymotivatedbythe well-knownandchallengingapplicationof re-sequencing theassemblyofagenomedirectedbyareference sequence.Newdevelopmentsinsequencingtechnologies (see[10-12],forexample)al lowwhole-genomesequencingtobeturnedintoaroutineprocedure,creating sequencingdatainmassiveamounts.Shortsequences (reads)areproducedinhugeamounts(tensofgigabytes),andinordertodeterminethepartofthegenome fromwhichareadwasderived,itmustbemapped (aligned)backtosomereferencesequence,afewgigabaseslong. Awidevarietyofshort-readalignmentprogrammes (e.g.Bowtie[13],SOAP2[14],REAL[15],BWA[16], Bowtie2[17])werepublishedinthepastfiveyearsto addressthechallengeofefficientlymappingtensof millionsofshortreadstoagenome,focusingondifferentaspectsoftheprocedur e:speed,sensitivity,and accuracy.Thesetoolsallowforasmallnumberofmismatchesinthealignment;however,theirabilitytoallow forgapsvariesgreatly,withmanyperformingpoorly andothernotallowingthematall. Mostshort-readalignmentprogrammesapplythe well-knownschemeof seed-and-extend [18].Afteraligningasubstringofthereferencesequenceagainstthe seed (shorthigh-qualityprefixoftheread-positions 1-3insquarebracketsinFigure1)veryfast,ashortreadalignmentprogrammemustcomputethebestpossiblealignmentbetweenasubstringofthereference sequencesucceedingandtheremainingsuffixofthe read(low-qualitysuffixoftheread-positions4-9).This isachievedbyallowingaboundednumberofmismatches(position8)andgaps(positions5-6). FromFigure1,weobservethatagapmightneedtobe insertedintheleftmostpositionofthealignment(position4).However,wearenotabletoknowthelengthof thesubstringofthereferencesequencetobealigned beforehand.Duetothisobser vation,itisclearweneed anintermediatebetweentheglobal(Needleman-Wunsch algorithm[19],forexample)andthelocalalignment (Smith-Watermanalgorithm[20],forexample),known as semi-global alignment,thatallowstheinsertionofa gapattheendofanalignmentwithnopenalty(positions 10-12). Example1( [21] ) Lett = CGTCCGAAGT andx = TACGAA .Figures 2a,b, and 2c illustratetheglobal,the local,andthesemi-globalalignment,respectively Althoughgapsmayoccurinrangeoflengths,the shortlengthofreadsmeanslargegapscannotbeconfidentlydetecteddirectly.InF igure3,thedistributionof lengthsofgapsin homosapiens exomesequencingis demonstrated.Theillustrate ddistributionagreeswith thedistributioninotherstudiesongaps(cf.[9,22,23]). Figure3representsagapoccurrencefrequencyofAlachiotis etal BMCBioinformatics 2013, 14 (Suppl11):S4 http://www.biomedcentral.com/1471-2105/14/S11/S4 Page2of14

PAGE 3

approximately5.7106acrossthe exome.Thisfrequencyisanalogoustotheonesobservedinotherstudiesonexomesequencing,plantgenomes,andviruses (cf.[9,23,24]). Moreover,Figure3showsanexponentialdecreasein theoccurrenceofgapsasthelengthincreasesanda preferenceforlengthswhicharemultiplesof3.The presenceofmanygapsinshortreadsintheorderof 25-150basepairs(bp)isratherunlikelyduetothelow gapoccurrencefrequency.Hence,applyingatraditional dynamic-programmingapproach,whichessentiallycannotboundthenumberofdeletionsandinsertionsinthe alignment,wouldgreatlyaffectthemappingconfidence. Motivatedbytheaforementi onedobservations,in[7], theauthorspresented GapMis ,atoolforpairwiseglobal andsemi-globalsequencealignmentwitha single gap.In thisarticle,wepresent libgapmis,theanalogous libraryimplementation. libgapmis alsoincludestwo Figure1 Seed-and-extendstrategy .Thealignmentbetweenthefragmentofthereferencesequence,startingatposition1andendingat position9,andthereadwithonemismatchatposition8andagapoflengthtwoinsertedinthereadafterposition4.Thisfigurewastaken from[21]. Figure2 Global,local,andsemi-globalalignment .Theglobal,local,andsemi-globalalignmentsbetween t =CGTCCGAAGTGand x = TACGAA.Thisfigurewastakenfrom[21]. Alachiotis etal BMCBioinformatics 2013, 14 (Suppl11):S4 http://www.biomedcentral.com/1471-2105/14/S11/S4 Page3of14

PAGE 4

highlyoptimisedversions:onebasedonStreamingSIMD Extensions(SSE);andonebasedonGraphicsProcessing Units(GPU).Proofofconceptversionsof GapMis and libgapmis werepresentedin[25]and[21],respectively.Millionsofpairwisesequencealignments, performedhereunderrealisticconditionsbasedonthe propertiesofrealfull-lengthgenomes,demonstratethat libgapmis canincreasetheaccuracyofextending short-readalignmentsend-to-endcomparedtomoretraditionalapproaches.Theimportanceofourcontribution isunderlinedbythefactthattheprovidedopen-source libraryfunctionscandirectlybeintegratedintoany short-readalignmentprogramme.DefinitionsandnotationInthissection,wegiveafewdefinitions,generally following[26]and[7]. An alphabet isafinitenon-emptysetwhoseelementsarecalled letters .A string onanalphabet isa finite,possiblyempty,sequenceofelementsof .The zero-lettersequenceiscalledthe emptystring ,andis denotedby .Thesetofallthestringsonthealphabet isdenotedby .The length ofastring x isdefinedas thelengthofthesequenceassociatedwiththestring x andisdenotedby| x |.Wedenoteby x [ i ],forall1 i | x |, theletteratindex i of x .Eachindex i ,forall1 i | x |,is apositionin x when x .Itfollowsthatthe i thletterof x istheletteratposition i in x ,andthat x = x [1..| x |]. Astring x isa substring ofastring y ifthereexisttwo strings u and v ,suchthat y = uxv .Letx y u ,and v be strings, suchthat y = uxv holds.If u = ,then x isa prefix of y .If v = ,then x isa suffix of y Let x beanon-emptystringand y beastring.Wesay thatthereexistsan occurrence of x in y ,or,moresimply,that xoccursiny,when x isasubstringof y .Every occurrenceof x canbecharacterisedbyapositionin y Thuswesaythat x occursatthe startingpositioni in y when y [ i .. i +| x | 1]= x .Itissometimesmoresuitable toconsiderthe endingpositioni +| x | 1.The Hamming distance Hfortwostringsofthesamelength,isdefined Figure3 Distributionofgaplengthsinexomesequencing .Thedistributionofgaplengthsinexomesequencing.Thedataweregenerated bytheExomeSequencingProgrammeattheNIHRBiomedicalResearchCentreatGuy sandStThomas NHSFoundationTrustinpartnership withKing sCollegeLondon.Thisfigurewastakenfrom[21]. Alachiotis etal BMCBioinformatics 2013, 14 (Suppl11):S4 http://www.biomedcentral.com/1471-2105/14/S11/S4 Page4of14

PAGE 5

asthenumberofpositionswherethetwostringshave differentletters.A don tcare letterisaspecialletterthat doesnotbelongtoalphabet ,andmatcheswithitselfas wellaswithanyletterof .Itisdenotedby .A gap isa finitesequenceofsuch don tcare letters.A gapstring isa finite,possiblyempty,sequenceofelementsofthealphabet { } .Twoletters a and b ofalphabet { } are saidto correspond (denotedby a b )iftheyareequal, or,ifatleastoneofthemisthedon tcareletter.The G-distance ,denotedby G,fortwogapstringsofthe samelengthisdefinedasthenumberofpositionsin whichthetwostringspossesslettersthatdonotcorrespond.Agapstring x iscalled single-gapstring ifthere existtwostrings u and v onalphabet andagap g ,such that x = ugv .Letconc( y )beanoperationthat,givena gapstring y= y0g0y1g1... yn 2 gn 2yn 1 where yi *,forall0 i j min { G[ i 1, j 1]+ H( t [ i ], x [ j]),G[i i ] } i < j G[ i 1, j 1]+ H( t [ i ], x [ j]) i = j. Inordertocomputetheexactlocationoftheinserted gap,eitherinthetextorinthepattern,wealsoneedto defineamatrixH[0.. n ,0.. m ][7],suchthat H[ i j]= j i s.tG[i j]=G[ i i ]andi < j i j s.tG[i j]=G[ j, j]and i > j 0otherwise Example3([21] ) Lett = AGGTCAT ,x= GGGTA, and b =2.Figure4aand4b illustratematrixGand matrixH,respectively .AlgorithmGapMisGiventhetext t oflength n ,thepattern x oflength m andthethreshold b asinput,algorithmGAPMIS,first introducedin[7](seeAdditionalFile1),computes matricesGandH.Infact,weonlyneedtocomputea diagonalstripe(anarrowband)ofwidth2 b +1inmatrix GandinmatrixH.Asaresult,algorithmGAPMIScomputesaprunedversionofmatricesGandH,denotedby GPandHP,respectively(seeFigure4cand4d). Proposition1( [7] ) Thereexistatmost 2 b +1 cellsof matrixGthatsolveProblem1 Proposition2( [7]) Problem1canbesolvedbyalgorithm GAPMIS intime O ( m ). Example4([21] ) Lett = AGGTCAT x = GGGTA, k =1, a =1, and b =1. Startingthetrace-backfromcell H [6,5] (see Figure4d ),yieldsasolutionsinceG [6,5] 1 (see Figure4c ).Trivially,theinsertedgapisinthepattern,anditslengthis 1. Finally,wecanfindtheposition oftheinsertedgap(position5)usingmatrixH.Hence,a solutiontothisprobleminstanceistheendingposition 6 (see Figure5 ) sincethereexistsasingle-gapstringx = GGGT A withagapg = suchthatx = conc ( GGGT A ), G( x t [1..6])=1, and | g |=1. Alternatively,wecouldcomputematrixGandmatrix Hbasedonasimple alignmentscore schemedepending ontheapplicationofthealgorithm(seethefollowing sectionor[27],forexample),andcomputethe maximumscore intime ( b )byProposition1.Library libgapmisInthissection,wegiveabriefdescriptionofthelibrary implementation. libgapmis wasimplementedinthe Cprogramminglanguage.First,westartbydescribing thestandardCPUversionofthelibrary.Thereafter,we discusssometechnicalissuesregardingtheSSE-and GPU-basedimplementations. Finally,wedescribehow theprovidedfunctionsareextendedtoaccommodatea variable,butbounded,numberofgapsinthealignment. AlgorithmGAPMISwasimplementedasafunction computingmatrices G and H basedonasimplealignmentscorescheme.Theschemeusesthescoringmatrix EDNAFULL[28](resp.EBLOSUM62[29])forDNAAlachiotis etal BMCBioinformatics 2013, 14 (Suppl11):S4 http://www.biomedcentral.com/1471-2105/14/S11/S4 Page5of14

PAGE 6

(resp.protein)sequencestoassignscoresforeverypossiblenucleotide(resp.residue)matchormismatch. Moreover,itusesaffinegappenaltytoscoretheinsertionofagapof n> 0positionsasfollows: g apopeningpenalty + ( n 1 ) gapextensionpenalty Finally,thetotalscoreforeachalignmentisobtained byaddingthesetwoscores:scoringmatrixandaffine Figure4 Dynamic-programmingmatrices .ThematricesG,H,GP,andHPfor t =AGGTCAT, x =GGGTA,and b =2.Thisfigurewastaken from[21]. Alachiotis etal BMCBioinformatics 2013, 14 (Suppl11):S4 http://www.biomedcentral.com/1471-2105/14/S11/S4 Page6of14

PAGE 7

gappenaltyscores.Theoptimalalignmentisthealignmentwiththehighestsuchtotalscore.ThesamealignmentscoreschemeisappliedinpackageEMBOSS[30]. Weimplementedthefollowingfunctions: gapmis_one_to_one :thisfunctionfindstheoptimalsemi-globalalignmentbetweentwosequences.It firstimplementsalgorithmGAPMISintime O ( m ); thereafter,itfindstheoptimalsemi-globalalignment intime O( ). Finally, gapmis_one_to_one finds thepositionofthesinglegapviabacktrackingin matrixHintime O ( m ). Theusercanomitcomputing thepositionofthesinglegapandtherebycomputing matrixH. gapmis_one_to_many:thisfunctionusesfunction gapmis_one_to_one asbuildingblock.It computesthe optimalsemi-globalalignments betweena querysequence andasetof target sequences gapmis_many_to_many :thisfunctionuses function gapmis_one_to_many asbuildingblock. Itcomputesthe optimalsemi-globalalignmentsbetweenasetof querysequencesandaset of targetsequences. Finally,weimplementedfunctions results_one_to_one, results_one_to_many,and results_many_to_many forgeneratingthevisualisationofthe analogousoutputinaformatsimilartotheonegeneratedbyEMBOSS.SSE-basedimplementationTheSSE-basedimplementationisadirectapplicationof the inter-sequencevectorisation scheme.Ithasbeen usedtoacceleratetheSmith-Watermanalgorithmand analogousdynamic-programmingalgorithms[31,32]. AlgorithmGAPMIS,undert hisvectorisationscheme, usesSSEinstructionstosimultaneouslycomputemultipleseparatematrices(usually2,4,or8dependingon thevectorwidthandthedatatypeused)corresponding toalignmentsofonequerysequenceagainstmultiple othertargetsequences. Currently,thevectorisationuses32bitfloating-point arithmeticstorepresentscores,implyingthat,onCPUs withSSE3vectorunits,avectorwidth w :=4isused.By restrictingscorestointegervaluesandusing16bitintegers,wemayincreasethevectorwidthto w :=8.For performance-relatedreasons,theSSE-basedversiononly supportsthecomputation ofalignmentscores,and, therefore,doesnotsupportbacktracking.Thefunctions providedare gapmis_one_to_many_opt_sse and gapmis_many_to_many_opt_sse,whichmakeuse oftheaforementionedvectorisationschemetocompute thescoresforeachpairofsequences.Finally,wemake useofthepurelysequentialfunction gapmis_one_to_one tofindthepositionofthesinglegapviabacktrackinginmatrixH.Inordertofurtheracceleratethe computations,theusermayoptionallyandtransparently executethesefunctionsonmulti-corearchitecturesby settingthenumberofthreads.Moretechnicaldetailsof theSSE-basedimplementationcanbefoundin[21].GPU-basedimplementationThefunction gapmis_one_to_one wasportedto GPUsusingOpenCLinordertomaintainavendorindependentGPUversion.InanalogytotheSSE-based implementation,onlythecomputationofalignment scoresareoffloadedtotheGPU.TheGPUimplementationisalsosimilartotheSSE-basedimplementationin thesensethatmultipledynamic-programmingmatrices arecomputedsimultaneously. Aligningasetofquerysequences x = { x1, ... xk } againstasetoftargetsequences t = { t1, ... t } is achievedbylaunchingatotalof threadsonthe GPUtoexploitinter-sequenceparallelism similartothe aforementionedSSEvectorisationscheme.GPUthreads aregroupedsuchthateverythreadgroupalignsone querysequenceagainstalltargetsequences.Eachthread inathreadgroupcomputesadifferentdynamic-programmingmatrixsequentiallyandindependentlyofall Figure5 Single-gapalignment .Thesingle-gapalignmentbetween t =AGGTCATand x =GGGTAfor k =1, a =1,and b =1.Thisfigurewas takenfrom[21]. Alachiotis etal BMCBioinformatics 2013, 14 (Suppl11):S4 http://www.biomedcentral.com/1471-2105/14/S11/S4 Page7of14

PAGE 8

otherthreads.Duetotheindependencebetweenthe individualalignmenttasks,werefertothisschemeas inter-taskparallelisation .Inordertopreventmemoryaccessconflictsandalsomaximisememorythroughput, aninter-sequencedevicememoryorganisationscheme isapplied(seeFigure6inthisregard). SimilartotheSSE-basedversion,thefunctionsprovidedare gapmis_one_to_many_opt_gpu and gapmis_many_to_many_opt_gpu.Finally,wemake useofthepurelysequentialfunction gapmis_one_to_one tofindthepositionofthesinglegapviabacktrackinginmatrixH.Moretechnicaldetailsofthe GPU-basedimplementationcanbefoundin[21].AccommodatingmultiplegapsThepresenceofmultiplegapsisunlikelygiven theobservedgapoccurrencefrequencyinreal-life applications:5.710-6inthehumanexome(seethe BackgroundSection),1.710-5in Betavulgaris [24], 2.410-5in Arabidopsisthaliana [24],and3.210-6in bacteriophagePhiX174 [24].However,inorderto increasetheflexibilityofourlibrary,weimplemented twoadditionalfunctions, gapmis_one_to_one_f and gapmis_one_to_one_onf,toallowforavariable,butbounded,numberofgapsinthealignment. gapmis_one_to_one_f:thisfunctionprovides theusertheoptiontosplitthequerysequenceinto f fragments,basedontheobservedgapoccurrence frequencyandthequerylength,bytakingthe numberoffragmentsasinputargument.Itthen usesfunction gapmis_one_to_one toperforma single-gapalignmentforeachfragmentindependently.Thetotalscoreofthealignmentisobtained byaddingthe f individualscoresofthefragments. Wedenotethisfunctionby gm-f,where isthenumberoffragments f usedasinput argument. gapmis_one_to_one_onf:thisfunctioncomputesthealignmentbyusingtheoptimalnumberof fragments.First,ittakesthemaximumnumberof fragmentsasinputargument,say fmax,andonly computesthetotalscoreofthealignments,foreach differentnumber1,2,..., fmaxoffragments.Itthen usesfunction gapmis_one_to_one_f tocompute thealignmentbypassingtheoptimalnumberof fragments theonethatgivesthemaximumtotal scoreinthepreviousstep asinputargument.We denotethisfunctionby gm-onf,where isthemaximumnumberoffragments fmaxusedasinputargument.ExperimentalresultsTheexperimentswereconductedonaDesktopPCusing upto4coresofInteli72600CPUat3.4GHzunder Linux,andanNVIDIAGeForce560GPUwith336 CUDAcoresand1GBDDR5devicememory. libgapmis isdistributedundertheGNUGeneralPublicLicense (GPL).Thelibraryisavailableathttp://www.exelixis-lab. org/gapmis,whichissetupformaintainingthesource codeandthemanpagedocumentation. Figure6 Inter-sequenceGPUmemoryorganisation .Theinter-sequenceGPUmemoryorganisation.Thisfigurewastakenfrom[21]. Alachiotis etal BMCBioinformatics 2013, 14 (Suppl11):S4 http://www.biomedcentral.com/1471-2105/14/S11/S4 Page8of14

PAGE 9

Tothebestofourknowledge, lipgapmis isthefirst libraryforextendingpairwiseshort-readalignments.The maindesigngoalof lipgapmis istoidentifyasingle gapinthealignment(seetheBackgroundSectionforthe motivation).Therefore,inthissection,wefocusoncomparingtheperformanceoffunction gapmis_one_to_one totheanalogousperformanceofEMBOSS needle.ThelatterimplementsNeedleman-Wunsch algorithmforsemi-globalalignment.TheNeedlemanWunschalgorithmisthetraditionalapproachusedfor semi-globalalignment. needle is,up-to-date,oneofthe mostpopularpairwisesequencealignmentprogrammes forglobalandsemi-globalalignment. Wegenerated100,000pairsof100bp-longsequences ontheDNAalphabet.Initially,eachpairconsistedof twoidenticalsequences.Subsequently,weinserted: asinglegap withauniformlyrandomlengththat rangedbetween1and30intooneofthetwo sequences; auniformlyrandomnumberof mismatches that rangedbetween1and10. Sincethepresenceofmultiplegapsisunlikelybased onthegapoccurrencefrequencyobservedinrealdatasets,thisexperimentalsettingaimstodemonstratethe suitabilityoftheproposedalgorithmcomparedtomore traditionalapproachesinidentifyingthesimulated insertedgap. Weseamlesslyintegratedfunction gapmis_one_to_one intoatestprogramme,denotedbygapmis,for computingtheoptimalsemi-globalalignmentbetweena pairofsequences.Ineachcase,forafaircomparisonof needle andgapmis,aneffortwasmadetoruntheprogrammesunderassimilarconditionsaspossible.Ingapmis, weadditionallyusedfunction results_one_to_one to producethecorrespondingoutput.Whileparsingtheoutputgeneratedbythetwoprogrammes,weconsideredany insertedgapasgap,excluding,however,agapinsertedin theendofthealignment. Weconsideras valid thosealignmentswherethe numberofinsertedgapsislessorequaltotheonesoriginallyinserted.Furthermore,weconsideras correct thosevalidalignmentswithgapswhosetotallengthis smallerorequaltothelengthoftheonesoriginally inserted and withnumberofmismatchesbeinglessor equaltotheonesoriginallyinserted. Theaboveexperimentalprocedurewasrepeatedusing differentgapopeningandgapextensionpenalties.AscorroboratedbytheresultsinTables1,2,3, gapmis ismore suitableforidentifyingsinglealignmentgapsinallcases. Asitisalsoshownin[7]and[21], needle cannot by design guaranteetheinsertionofatmostonegapintothe alignment,evenwhensettingthegapopeningpenaltyto 12.0andthegapextensionpenaltyto0.5.Thecorrect(as perourdefinition)alignmentsofTables1,2,3areillustratedinFigure7.Furthermore,wecomparedtheprocessingtimesof gapmis tothoseof needle bygenerating 10,000pairsof100,150,200,and250bp-longDNA sequencesinanalogytotheaforementionedexperiment. Weusedtwodifferentversionsof gapmis: onewiththe modifier -m30 toset b =30;andonewith b = n -1, where b isthemaximumallowedlengthofthesinglegap, and n isthelengthofthelongestsequence. TheresultsinFigure8showthat gapmis wasableto completetheassignmentupto20fasterthan needle Althoughtheasymptoticcomplexityofthetwoalgorithmsisthesame,thenumberofarithmeticoperations requiredbyalgorithmGAPMISissubstantiallylower. Thiscanbeeasilyexplainedbyexaminingtherecurrencerelationsofthetwoalgorithms.Theversionwith themodifier -m30 wasalwaysthefastestconfirming ourtheoreticalresults.Notethat,itonlycomputesa narrowbandinthedynamic-programmingmatrices(see Figure4cand4d). WealsoevaluatedthetimeefficiencyoftheacceleratedSSE-andGPU-basedversionsof li bgapmis,by comparingtheirprocessingtimesagainsttheonesofthe standardCPUversion.Inparticular,wegenerateda75 bp-longDNAquerysequenceand4,639,576100bplongDNAtargetsequences.Thisrepresentsarealistic settingforre-sequencingapplicationsbecausethe seed partofashortreadusuallyoccursinthousandsormillionsofpositionsalongthereferencesequence.Hence animportantprobleminre-sequencingistheefficient andaccurate extension ofthesethousandstomillionsof potentialalignments.Weusedthefollowingversionsof thefunction gapmis_one_to_many : theCPUversion; thesingle-coreSSEversion; Table1Validandcorrectalignmentswithgapopening penalty10.0andgapextensionpenalty0.5Programme Valid Correct Needle 94,552 94,516 Gapmis 100,000 99,996Thevalidandcorrectalignmentsof100,000pairsof100bp-longgenerated sequenceswithgapopeningpenalty10.0andgapextensionpenalty0.5. Table2Validandcorrectalignmentswithgapopening penalty8.0andgapextensionpenalty1.0Programme Valid Correct needle 76,512 76,501 gapmis 100,000 99,997Thevalidandcorrectalignmentsof100,000pairsof100bp-longgenerated sequenceswithgapopeningpenalty8.0andgapextensionpenalty1.0.Alachiotis etal BMCBioinformatics 2013, 14 (Suppl11):S4 http://www.biomedcentral.com/1471-2105/14/S11/S4 Page9of14

PAGE 10

theSSEversionwith4threads( -t4 ); theGPUversion. Thesameexperimentwasrepeatedwith150and200 bp-longsequences.AsshownbytheresultsinFigure9, thesingle-coreSSEversionacceleratesthecomputations byafactorof6comparedtotheCPUversion;the SSE -t4 versionbyafactorof23comparedtothe CPUversion;andtheGPUversionbyafactorof4comparedtotheCPUversion.Thecellupdatespersecond (CUP/s)are290MCUP/s,1.6GCUP/s,6.5GCUP/s, and1.2GCUP/s,fortheCPU,theSSE,theSSE -t4, andtheGPUversions,respectively. Asfurtherexperiment,wegenerated1,000,00075 bp-longDNAquerysequencesand200100bp-long DNAtargetsequences.Similartotheaboveexperiment, thefouraforementionedversionsoffunction gapmis_many_to_many wereused,andthesameexperimental procedurewasrepeatedwith150and200bp-long sequences.AsshownbytheresultsinFigure10,thesingle-coreSSEversionacceleratesthecomputationsbya factorof6comparedtotheCPUversion;theSSE -t4 versionbyafactorof20comparedtotheCPUversion; andtheGPUversionbyafactorof11comparedtothe CPUversion.TheCUP/sare190MCUP/s,1.1GCUP/s, 4GCUP/s,and2.2GCUP/s,fortheCPU,theSSE,the SSE -t4 ,andtheGPUversions,respectively. Asfurtherexperiment,inordertoevaluatetheperformanceofprogrammegapmis,function gapmis_one_to_one_f ,function gapmis_one_to_one_onf ,and needle,underrealconditions,wesimulated1,000, 000100bp-longquerysequencesfromthe30Mbp chromosome1of Arabidopsisthaliana (AT)obtained from[33],andinsertedmismatchesandgapsintothe referencesequence;thenwealignedthembackagainst theoriginalreferencesequence.Asmismatchoccurrencefrequencyandgapoccurrencefrequencyweused 1.6103and2.4105,respectively theonesobserved Table3Validandcorrectalignmentswithgapopening penalty12.0andgapextensionpenalty0.5Programme Valid Correct needle 95,452 95,427 gapmis 100,000 99,999Thevalidandcorrectalignmentsof100,000pairsof100bp-longgenerated sequenceswithgapopeningpenalty12.0andgapextensionpenalty0.5. Figure7 Correctalignments .ThecorrectalignmentsofTables1-3. Alachiotis etal BMCBioinformatics 2013, 14 (Suppl11):S4 http://www.biomedcentral.com/1471-2105/14/S11/S4 Page10of14

PAGE 11

Figure8 Processingtimesofneedleandgapmis .Theprocessingtimesof needle and gapmis foraligning10,000pairsofsequences. Figure9 Processingtimesofgapmis_one_to_many .Theprocessingtimesof gapmis_one_to_many foraligningaquerysequenceand4, 639,576targetsequences. Alachiotis etal BMCBioinformatics 2013, 14 (Suppl11):S4 http://www.biomedcentral.com/1471-2105/14/S11/S4 Page11of14

PAGE 12

inAT[24].Since,inpractice, insertionsoccurlessfrequentlythandeletions,42%oftheinsertedgapswere insertionsand58%deletions alsoobservedinAT[24]. Forthelengthoftheinsertedgaps,weusedthedistributionofgaplengthsshowninFigure3,whichisconsistentwithotherstudiesongapdistributions(cf. [9,22,23]).Sincethequeriesweresimulated,wewere abletoknowtheexactlocationofthefragmentsofthe referencesequencetheywerederivedfrom(thetarget sequences).Hence,wewereabletoclassifyeachgeneratedalignmentasvalid/invalidandcorrect/incorrect. Wedefine accuracy astheproportionofcorrectalignmentsinthedataset.Thus,weevaluatedtheaccuracyof theaforementionedprogrammesin extending analignmentend-to-end,assumingthatthe seed partofthe alignmentisalreadyperformedbyusingaconventional indexingscheme,thatis,ahash-basedindex[15]oran FMindex[16].Werepeatedthesameexperimentby simulating150bp-longquerysequencesandusingother gapoccurrencefrequencies observedin Betavulgaris (BV)[24]and Homosapiens (HS)exome[9]. Thehighaccuracyof libgapmis isdemonstratedby theresultsshowninTable4.Theresultsshowthatfunction gm-onf3 hasthehighestaccuracyinallcases.It canincreasetheaccuracyofextendingshort-readalignmentsend-to-endby0.01%comparedto needle .Given theobservedgapoccurrencefrequencies,theincreased accuracyofgapidentificationissignificant.Forinstance, theproportionofpairsofsequenceswithgapsinthesix datasetsofTable4rangedfrom0.85%to3.5%. Althoughthegapopeningpenaltyin needle couldbe increasedbytheuser,thiswouldhaveapotentiallyfatal impactonaccuracybecausethehighnumberofmismatchesoptedwouldbeunderestimated[21].We checkedthisassumptionbyconductingthefollowinglast experiment.Weobtained100,000100bp-longand100, 000150bp-longquerysequencesfromthe30Mbpchromosome1ofAT,andinsertedmismatchesandgapsinto thereferencesequence;thenwealignedthemback againsttheoriginalreferencesequenceusing needle, similartothepreviousexperiments.Thegapopening penaltyrangedfrom10.0to20.0,andthegapextension Figure10 Processingtimesofgapmis_many_to_many .Theprocessingtimesof gapmis_many_to_many foraligning1,000,000query sequencesand200targetsequences. Alachiotis etal BMCBioinformatics 2013, 14 (Suppl11):S4 http://www.biomedcentral.com/1471-2105/14/S11/S4 Page12of14

PAGE 13

penaltywassetto0.5.Ourassumptionisconfirmedby theresultsshowninTable5. Noticethat,increasingthe gapopeningpenaltyincreasesthevalidalignmentsbut hasanegativeimpactontheaccuracyof needle :the numberofcorrectalignmentsdecreases.ConclusionsInthisarticle,wepresented libgapmis,anultrafast andflexiblelibraryforextendingpairwiseshort-read alignmentsend-to-end.ApartfromthestandardCPU version,itincludesultrafastSSE-andGPU-basedimplementations. libgapmis isbasedon GapMis ,atool thatcomputesadifferentversionofthetraditional dynamic-programmingmatrixforsequencealignment. Thisworkisdirectlymotivatedbythenext-generation re-sequencingapplication.Wedemonstratedthat libgapmis ismoresuitableandefficientthanmoretraditionalapproachesforextendingshort-readalignments end-to-end.Addingtheflexibilityofboundingthenumberofgapsinsertedinthealignment,strengthensthe classicalschemeofscoringmatricesandaffinegappenaltyscores.Thepresentedexperimentalresultsarevery promising,bothintermsofidentifyinggapsand efficiency. ByexploitingthepotentialofmodernCPUandGPU architecturesandapplyingmulti-threading,weimproved theperformanceofthepurelysequentialCPUversion bymorethanoneorderofmagnitude.Moreimportantly,thefunctionsprovidedin libgapmis canbe directlyintegratedintoanyshort-readalignmentprogramme.Ourimmediatetargetistofurtheroptimise thecode,andalsointegratethefunctionsofthislibrary intoashort-readalignmentpipeline.AdditionalmaterialAdditionalfile1:AlgorithmGAPMIS .ThealgorithmGAPMIScomputes matricesGandH.Ittakesasinputthetext t oflength n ,thepattern x of length m ,andthethreshold b .Thisalgorithmwastakenfrom[7]. Competinginterests Theauthorsdeclarethattheyhavenocompetinginterests. Authors contributions SPPandASdesignedthestudy.NA,SB,TF,andSPPdevelopedthelibrary. TFandSPPconductedtheexperiments.SPPwrotethemanuscriptwiththe contributionofallauthors.Thefinalversionofthemanuscriptisapproved byallauthors. Acknowledgements ThepublicationcostsforthisarticlewerefundedbytheHeidelbergInstitute forTheoreticalStudies(HITSgGmbH).NA,SB,andTFaresupportedby fundingfromtheDFG(GermanScienceFoundation,grantsSTA860/2and STA860/4).SPPissupportedbytheNSF-fundediPlantCollaborative(NSF grant#DBI-0735191).ASissupportedbyinstitutionalfundingfromHITS gGmbH.WethankRajeshKumarGottimukkalafromLifeTechnologiesfor valuablecommentsandusefuldiscussions. Thisarticlehasbeenpublishedaspartof BMCBioinformatics Volume14 Supplement11,2013:SelectedarticlesfromTheSecondWorkshoponData MiningofNext-GenerationSequencinginconjunctionwiththe2012IEEE InternationalConferenceonBioinformaticsandBiomedicine.Thefull contentsofthesupplementareavailableonlineathttp://www. biomedcentral.com/bmcbioinformatics/supplements/14/S11. Table5Validandcorrectalignmentsusing needleProgrammeSpeciesLengthofqueries [bp] Gapoccurrence frequency Gapopening penalty Gapextension penalty Valid alignments Correct alignments needle AT1002.410-510.0 0.5 99,988 99,917 needle AT 100 2.410-515.0 0.5 99,99299,911 needle AT 100 2.410-520.0 0.5 99,996 99,850 needle AT 150 2.410-510.0 0.5 99,991 99,919 needle AT 150 2.410-515.0 0.5 99,99299,901 needle AT 150 2.410-520.0 0.5 99,996 99,834Thevalidandcorrectalignmentsof100,000pairsofsimulatedsequenceswiththegapoccurrencefrequencyobservedin Arabidopsisthaliana using needle Eachofthedatasetsconsistsof100,000pairsofsequences;thehighestnumbersofcorrectandvalidalignmentsforeachdatasetareshowninbold. Table4Correctalignmentsusinggapmis, gapmis_one_to_one_f gapmis_one_to_one_onf ,and needleSpeciesLengthofqueries[bp]Gapoccurrencefrequency gapmisgm-f2gm-f3gm-onf2gm-onf3needle AT1002.4 105999,099998,404997,561999,207 999,259 999,126 AT 150 2.4 105998,805998,171997,542999,024 999,152 999,115 BV 100 1.7 105999,361998,868998,229999,432 999,459 999,353 BV 150 1.7 105999,196998,771998,249999,347 999,432 999,378 HS 100 5.7 106999,809999,615999,419999,822 999,825 999,782 HS 150 5.7 106999,795999,606999,408999,817 999,825 999,793Thecorrectalignmentsof1,000,000pairsofsimulatedsequenceswithvariousobservedgapoccurrencefrequenciesusinggapmis, gapmis_one_to_one_f gapmis_one_to_one_onf ,and needle .Eachofthedatasetsconsistsof1,000,000pairsofsequences;thehighestnumberofcorrectalignmentsforeach datasetisshowninbold.Alachiotis etal BMCBioinformatics 2013, 14 (Suppl11):S4 http://www.biomedcentral.com/1471-2105/14/S11/S4 Page13of14

PAGE 14

Authors details1HeidelbergInstituteforTheoreticalStudies,Heidelberg,Germany.2Florida MuseumofNaturalHistory,UniversityofFlorida,Gainesville,FL,USA. Published:4November2013 References1.LevenshteinVI: Binarycodescapableofcorrectingdeletions,insertions, andreversals. TechRep8 SovietPhysicsDoklady;1966. 2.WagnerRA,FischerMJ: TheString-to-StringCorrectionProblem. Journal oftheACM 1974, 21 :168-173. 3.SellersPH: OntheTheoryandComputationofEvolutionaryDistances. SIAMJournalonAppliedMathematics 1974, 26(4) :787-793. 4.HeckelP: Atechniqueforisolatingdifferencesbetweenfiles. CommunicationsoftheACM 1978, 21(4) :264-268. 5.PetersonJL: Computerprogramsfordetectingandcorrectingspelling errors. CommunicationsoftheACM 1980, 23(12) :676-687. 6.CambouropoulosE,CrochemoreM,IliopoulosCS,MouchardL,PinzonYJ: AlgorithmsforComputingApproximateRepetitionsinMusical Sequences. InternationalJournalofComputationalMathematics 2000, 79(11) :1135-1148. 7.FlouriT,FrousiosK,IliopoulosCS,ParkK,PissisSP,TischlerG: GapMis:a toolforpairwisesequencealignmentwithasinglegap. RecentPatDNA GeneSeq 2013, 7 :84-95. 8.GusfieldD: Algorithmsonstrings,trees,andsequences:computerscienceand computationalbiology USA:CambridgeUniversityPress;1997. 9.SimpsonMA,IrvingMD,AsilmazE,GrayMJ,DafouD,ElmslieFV, MansourS,HolderSE,BrainCE,BurtonBK,KimKH,PauliRM,AftimosS, StewartH,KimCA,Holder-EspinasseM,RobertsonSP,DrakeWM, TrembathRC: MutationsinNOTCH2causeHajdu-Cheneysyndrome,a disorderofsevereandprogressiveboneloss. NatureGenetics 2011, 43(4) :303-305. 10.BalasubramanianS,KlenermanD,BarnesC,OsborneM:2007,Patent US20077232656. 11.JuJ,LiZ,EdwardsJ,ItagakiY:2007,PatentEP1790736. 12.RothbergJ,BaderJ,DewellS,McDadeK,SimpsonJ,BerkaJ,ColangeloC: Foundingpatentof454LifeSciences. 2007,PatentUS20077211390. 13.LangmeadB,TrapnellC,PopM,SalzbergSL: UltrafastandmemoryefficientalignmentofshortDNAsequencestothehumangenome. Genomebiology 2009, 10(3) :R25+. 14.LiR,YuC,LiY,LamTW,YiuSM,KristiansenK,WangJ: SOAP2:an improvedultrafasttoolforshortreadalignment. Bioinformatics 2009, 25(16) :1966-1967. 15.FrousiosK,IliopoulosCS,MouchardL,PissisSP,TischlerG: REAL:an efficientREadALignerfornextgenerationsequencingreads. In ProceedingsofthefirstACMInternationalConferenceonBioinformaticsand ComputationalBiology(BCB2011). USA:ACM;ZhangA,BorodovskyM, zsoyogluG,MiklerAR2010:154-159. 16.LiH,DurbinR: FastandaccurateshortreadalignmentwithBurrowsWheelertransform. Bioinformatics 2009, 25(14) :1754-1760. 17.LangmeadB,SalzbergSL: Fastgapped-readalignmentwithBowtie2. NatMethods 2012, 9 :357-359. 18.AltschulSF,GishW,MillerW,MyersEW,LipmanDJ: BasicLocalAlignment SearchTool. JournalofMolecularBiology 1990, 215(3) :403-410. 19.NeedlemanSB,WunschCD: Ageneralmethodapplicabletothesearch forsimilaritiesintheaminoacidsequenceoftwoproteins. Journalof MolecularBiology 1970, 48(3) :443-453. 20.WatermanMS,SmithTF: Identificationofcommonmolecular subsequences. JournalofMolecularBiology 1981, 147 :195-197. 21.AlachiotisN,BergerS,FlouriT,PissisSP,StamatakisA: libgapmis :an ultrafastlibraryforshort-readsingle-gapalignment. Bioinformaticsand BiomedicineWorkshops(BIBMW),2012IEEEInternationalConferenceon:4-7 October2012 2012,688-695. 22.NgSB,TurnerEH,RobertsonPD,FlygareSD,BighamAW,LeeC,ShafferT, WongM,BhattacharjeeA,EichlerEE,BamshadM,NickersonDA, ShendureJ: Targetedcaptureandmassivelyparallelsequencingof12 humanexomes. Nature 2009, 461(7261) :272-276. 23.OstergaardP,SimpsonMA,BriceG,MansourS,ConnellFC,OnoufriadisA, ChildAH,HwangJ,KalidasK,MortimerPS,TrembathR,JefferyS: Rapid identificationofmutationsinGJC2inprimarylymphoedemausing wholeexomesequencingcombinedwithlinkageanalysiswith delineationofthephenotype. JMedGenet 2011, 48(4) :251-255. 24.MinoscheAE,DohmJC,HimmelbauerH: EvaluationofgenomichighthroughputsequencingdatageneratedonIlluminaHiSeqandGenome Analyzersystems. GenomeBiology 2011, 12 :R112. 25.FlouriT,FrousiosK,IliopoulosCS,ParkK,PissisSP,TischlerG: Approximate string-matchingwithasinglegapforsequencealignment. In Proceedings ofthesecondACMInternationalConferenceonBioinformaticsand ComputationalBiology(BCB2011). USA:ACM;ACM2011:490-492. 26.CrochemoreM,HancartC,LecroqT: AlgorithmsonStrings USA:Cambridge UniversityPress;2007. 27.NaJC,RohK,ApostolicoA,ParkK: Alignmentofbiologicalsequences withqualityscores. InternationalJournalofBioinformaticsResearchand Applications 2009, 5 :97-113. 28.NationalCenterforBiotechnologyInformation(NCBI):2013[ftp://ftp.ncbi. nih.gov/blast/matrices/NUC.4.4]. 29.NationalCenterforBiotechnologyInformation(NCBI):2013[ftp://ftp.ncbi. nih.gov/blast/matrices/BLOSUM62]. 30.RiceP,LongdenI,BleasbyA: EMBOSS:TheEuropeanMolecularBiology OpenSoftwareSuite. TrendsinGenetics 2000, 16(6) :276-277. 31.AlachiotisN,BergerS,StamatakisA: CouplingSIMDandSIMT architecturestoboostperformanceofaphylogeny-awarealignment kernel. BMCBioinformatics 2012, 13 :196. 32.RognesT: FasterSmith-Watermandatabasesearcheswithinter-sequence SIMDparallelisation. BMCBioinformatics 2011, 12 :221. 33.NationalCenterforBiotechnologyInformation(NCBI):[http://www.ncbi.nlm. nih.gov/].doi:10.1186/1471-2105-14-S11-S4 Citethisarticleas: Alachiotis etal .: libgapmis :extendingshort-read alignments. BMCBioinformatics 2013 14 (Suppl11):S4. Submit your next manuscript to BioMed Central and take full advantage of: Convenient online submission Thorough peer review No space constraints or color gure charges Immediate publication on acceptance Inclusion in PubMed, CAS, Scopus and Google Scholar Research which is freely available for redistribution Submit your manuscript at www.biomedcentral.com/submit Alachiotis etal BMCBioinformatics 2013, 14 (Suppl11):S4 http://www.biomedcentral.com/1471-2105/14/S11/S4 Page14of14


xml version 1.0 encoding UTF-8
REPORT xmlns http:www.fcla.edudlsmddaitss xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.fcla.edudlsmddaitssdaitssReport.xsd
INGEST IEID EX9IPKRQJ_OWO8H0 INGEST_TIME 2015-01-16T18:23:26Z PACKAGE AA00019158_00001
AGREEMENT_INFO ACCOUNT UF PROJECT UFDC
FILES