Citation
Efficient error correction algorithms for gene tree reconciliation based on duplication, duplication and loss, and deep coalescence

Material Information

Title:
Efficient error correction algorithms for gene tree reconciliation based on duplication, duplication and loss, and deep coalescence
Series Title:
BMC Bioinformatics
Creator:
Chaudhary, Ruchi
Burleigh, J. Gordon
Eulenstein, Oliver
Publisher:
BioMed Central
Publication Date:
Language:
English

Notes

Abstract:
Background: Gene tree - species tree reconciliation problems infer the patterns and processes of gene evolution within a species tree. Gene tree parsimony approaches seek the evolutionary scenario that implies the fewest gene duplications, duplications and losses, or deep coalescence (incomplete lineage sorting) events needed to reconcile a gene tree and a species tree. While a gene tree parsimony approach can be informative about genome evolution and phylogenetics, error in gene trees can profoundly bias the results. Results: We introduce efficient algorithms that rapidly search local Subtree Prune and Regraft (SPR) or Tree Bisection and Reconnection (TBR) neighborhoods of a given gene tree to identify a topology that implies the fewest duplications, duplication and losses, or deep coalescence events. These algorithms improve on the current solutions by a factor of n for searching SPR neighborhoods and n2 for searching TBR neighborhoods, where n is the number of taxa in the given gene tree. They provide a fast error correction protocol for ameliorating the effects of gene tree error by allowing small rearrangements in the topology to improve the reconciliation cost. We also demonstrate a simple protocol to use the gene rearrangement algorithm to improve gene tree parsimony phylogenetic analyses. Conclusions: The new gene tree rearrangement algorithms provide a fast method to address gene tree error. They do not make assumptions about the underlying processes of genome evolution, and they are amenable to analyses of large-scale genomic data sets. These algorithms are also easily incorporated into gene tree parsimony phylogenetic analyses, potentially producing more credible estimates of reconciliation cost.

Record Information

Source Institution:
University of Florida
Holding Location:
University of Florida
Rights Management:
All rights reserved by the source institution.
Resource Identifier:
10.1186/1471-2105-13-S10-S11 ( DOI )

Downloads

This item is only available as the following downloads:


Full Text
xml version 1.0 encoding utf-8 standalone no
mets ID sort-mets_mets OBJID sword-mets LABEL DSpace SWORD Item PROFILE METS SIP Profile xmlns http:www.loc.govMETS
xmlns:xlink http:www.w3.org1999xlink xmlns:xsi http:www.w3.org2001XMLSchema-instance
xsi:schemaLocation http:www.loc.govstandardsmetsmets.xsd
metsHdr CREATEDATE 2012-06-25T16:06:52
agent ROLE CUSTODIAN TYPE ORGANIZATION
name BioMed Central
dmdSec sword-mets-dmd-1 GROUPID sword-mets-dmd-1_group-1
mdWrap SWAP Metadata MDTYPE OTHER OTHERMDTYPE EPDCX MIMETYPE textxml
xmlData
epdcx:descriptionSet xmlns:epdcx http:purl.orgeprintepdcx2006-11-16 xmlns:MIOJAVI
http:purl.orgeprintepdcxxsd2006-11-16epdcx.xsd
epdcx:description epdcx:resourceId sword-mets-epdcx-1
epdcx:statement epdcx:propertyURI http:purl.orgdcelements1.1type epdcx:valueURI http:purl.orgeprintentityTypeScholarlyWork
http:purl.orgdcelements1.1title
epdcx:valueString Efficient error correction algorithms for gene tree reconciliation based on duplication, duplication and loss, and deep coalescence
http:purl.orgdctermsabstract
Abstract
Background
Gene tree species tree reconciliation problems infer the patterns and processes of gene evolution within a species tree. Gene tree parsimony approaches seek the evolutionary scenario that implies the fewest gene duplications, duplications and losses, or deep coalescence (incomplete lineage sorting) events needed to reconcile a gene tree and a species tree. While a gene tree parsimony approach can be informative about genome evolution and phylogenetics, error in gene trees can profoundly bias the results.
Results
We introduce efficient algorithms that rapidly search local Subtree Prune and Regraft (SPR) or Tree Bisection and Reconnection (TBR) neighborhoods of a given gene tree to identify a topology that implies the fewest duplications, duplication and losses, or deep coalescence events. These algorithms improve on the current solutions by a factor of n for searching SPR neighborhoods and n2 for searching TBR neighborhoods, where n is the number of taxa in the given gene tree. They provide a fast error correction protocol for ameliorating the effects of gene tree error by allowing small rearrangements in the topology to improve the reconciliation cost. We also demonstrate a simple protocol to use the gene rearrangement algorithm to improve gene tree parsimony phylogenetic analyses.
Conclusions
The new gene tree rearrangement algorithms provide a fast method to address gene tree error. They do not make assumptions about the underlying processes of genome evolution, and they are amenable to analyses of large-scale genomic data sets. These algorithms are also easily incorporated into gene tree parsimony phylogenetic analyses, potentially producing more credible estimates of reconciliation cost.
http:purl.orgdcelements1.1creator
Chaudhary, Ruchi
Burleigh, J Gordon
Eulenstein, Oliver
http:purl.orgeprinttermsisExpressedAs epdcx:valueRef sword-mets-expr-1
http:purl.orgeprintentityTypeExpression
http:purl.orgdcelements1.1language epdcx:vesURI http:purl.orgdctermsRFC3066
en
http:purl.orgeprinttermsType
http:purl.orgeprinttypeJournalArticle
http:purl.orgdctermsavailable
epdcx:sesURI http:purl.orgdctermsW3CDTF 2012-06-25
http:purl.orgdcelements1.1publisher
BioMed Central Ltd
http:purl.orgeprinttermsstatus http:purl.orgeprinttermsStatus
http:purl.orgeprintstatusPeerReviewed
http:purl.orgeprinttermscopyrightHolder
Ruchi Chaudhary et al.; licensee BioMed Central Ltd.
http:purl.orgdctermslicense
http://creativecommons.org/licenses/by/2.0
http:purl.orgdctermsaccessRights http:purl.orgeprinttermsAccessRights
http:purl.orgeprintaccessRightsOpenAccess
http:purl.orgeprinttermsbibliographicCitation
BMC Bioinformatics. 2012 Jun 25;13(Suppl 10):S11
http:purl.orgdcelements1.1identifier
http:purl.orgdctermsURI http://dx.doi.org/10.1186/1471-2105-13-S10-S11
fileSec
fileGrp sword-mets-fgrp-1 USE CONTENT
file sword-mets-fgid-0 sword-mets-file-1
FLocat LOCTYPE URL xlink:href 1471-2105-13-S10-S11.xml
sword-mets-fgid-1 sword-mets-file-2 applicationpdf
1471-2105-13-S10-S11.pdf
structMap sword-mets-struct-1 structure LOGICAL
div sword-mets-div-1 DMDID Object
sword-mets-div-2 File
fptr FILEID
sword-mets-div-3



PAGE 1

PROCEEDINGS OpenAccessEfficienterrorcorrectionalgorithmsforgenetree reconciliationbasedonduplication,duplication andloss,anddeepcoalescenceRuchiChaudhary1,JGordonBurleigh2,OliverEulenstein1*From 7thInternationalSymposiumonBioinformaticsResearchandApplications(ISBRA 11) Changsha,China.27-29May2011AbstractBackground: Genetree-speciestreereconciliationproblemsinferthepatternsandprocessesofgeneevolution withinaspeciestree.Genetreeparsimonyapproachesseektheevolutionaryscenariothatimpliesthefewestgene duplications,duplicationsandlosses,ordeepcoalescence(incompletelineagesorting)eventsneededtoreconcile agenetreeandaspeciestree.Whileagenetreeparsimonyapproachcanbeinformativeaboutgenome evolutionandphylogenetics,erroringenetreescanprofoundlybiastheresults. Results: WeintroduceefficientalgorithmsthatrapidlysearchlocalSubtreePruneandRegraft(SPR)orTree BisectionandReconnection(TBR)neighborhoodsofagivengenetreetoidentifyatopologythatimpliesthe fewestduplications,duplicationandlosses,ordeepcoalescenceevents.Thesealgorithmsimproveonthecurrent solutionsbyafactorof n forsearchingSPRneighborhoodsand n2forsearchingTBRneighborhoods,where n is thenumberoftaxainthegivengenetree.Theyprovideafasterrorcorrectionprotocolforamelioratingthe effectsofgenetreeerrorbyallowingsmallrearrangementsinthetopologytoimprovethereconciliationcost.We alsodemonstrateasimpleprotocoltousethegenerearrangementalgorithmtoimprovegenetreeparsimony phylogeneticanalyses. Conclusions: Thenewgenetreerearrangementalgorithmsprovideafastmethodtoaddressgenetreeerror. Theydonotmakeassumptionsabouttheunderlyingprocessesofgenomeevolution,andtheyareamenableto analysesoflarge-scalegenomicdatasets.Thesealgorithmsarealsoeasilyincorporatedintogenetreeparsimony phylogeneticanalyses,potentiallyproducingmorecredibleestimatesofreconciliationcost.IntroductionTheavailabilityoflarge-scalegenomicdatafromawide varietyoftaxahasrevealedmuchincongruencebetween genetreesandthephylogenyofthespeciesinwhich thegenesevolve.Thisincongruencemaybecausedby evolutionaryprocessessuchasgeneduplicationand loss,deepcoalescence,orlateralgenetransfer.Thevariationingenetreetopologiescanbeusedtoinferthe processesofgenomeevolution.Genetree-speciestree (GT-ST)reconciliationmethodsseektomapthehistory ofgenetreesintothecontextofspeciesevolutionand thuspotentiallylinkprocessesofgeneevolutiontophenotypicchangesanddiversification.Yetthesemethods canbeconfoundedbyerrorinthegenetrees,which alsomaycauseincongruencebetweenthegeneandspeciestopologies.Weintroduceefficientalgorithmsto correctgenetreetopologiesbasedonthegeneduplication,duplicationandloss,ordeepcoalescencecost models.Thealgorithmsworkbyidentifyingthesmall rearrangementsinthegenetreesthatreducethereconciliationcost.Theyareextremelyfastandthusamenabletoanalysesofenormousgenomicdatasets. Perhapsthemostcommonlyusedandcomputationally feasibleapproachtoGT-STreconciliationisgenetree *Correspondence:oeulenst@cs.iastate.edu1DepartmentofComputerScience,IowaStateUniversity,Ames,IA50011, USA FulllistofauthorinformationisavailableattheendofthearticleChaudhary etal BMCBioinformatics 2012, 13 (Suppl10):S11 http://www.biomedcentral.com/1471-2105-13-S10-S11 2012Chaudharyetal.licenseeBioMedCentralLtd.ThisisanopenaccessarticledistributedunderthetermsoftheCreative CommonsAttributionLicense(http://creativecommons.org/licenses/by/2.0),whichpermitsunrestricteduse,distribution,and reproductioninanymedium,providedtheoriginalworkisproperlycited.

PAGE 2

parsimony,whichseekstoinferthefewestevolutionary events(e.g.,duplication,loss,coalescence,orlateralgene transfer)neededtoreconcileagenetreeandspeciestree topology[1].Thisapproachalsocanbeextendedtoinfer speciesphylogenies,findingt hespeciestreethatimplies thefewestevolutionaryeventsimpliedbythegenetrees (e.g.,[2-4]).However,thegenetreesoftenareestimated usingheuristicmethodsfromshortsequencealignments, andconsequently,thereisoftenmucherrorintheestimatedgenetreetopologies.ErrorinthegenetreescreatesmoreGT-STincongruenceandcanradicallyaffect GT-STreconciliationanalyses,implyingfarmoreduplications,duplicationsandlosses,ordeepcoalescence eventsthanactuallyexist.Fo rexample,Rasmussenand Kellis[5]estimatedthaterroringenetreereconstruction canleadto2-3foldoverestimatesofgeneduplications andlosses.Genetreeerroralsocanerroneouslyimply largenumbersofduplicationsneartherootofthespecies tree[6,7],anditcanmisleadgenetreeparsimonyphylogeneticanalyses(e.g.,[8-10]). Severalapproacheshavebeenproposedtoaddress genetreeerrorinGT-STreconciliation.First,questionablenodesinagenetreeornodeswithlowsupport maybecollapsedpriortogenetreereconciliation,and theresultingnon-binarygenetreesmaybereconciled withspeciestrees[11-13].Similarly,GT-STreconciliationscanuseadistributionofgenetreetopologies,such asbootstrapgenetrees,ratherthanasinglegenetree estimate[6,14,15].Bothoftheseapproachesmayhelp accountforstochasticerroranduncertaintyingenetree topologies,buttheydonotexplicitlyconfrontgenetree error.Methodsalsoexisttosimultaneouslyinferthe genetreetopologyandthegenetreereconciliationwith aknownspeciestree[5,16].Whilethesesophisticated statisticalapproachesappearverypromising,theyare computationallyintensive,anditisuncleariftheywill betractableforlarge-scaleanalyses.Another,perhapsa morecomputationallyfeasible,approachistoallowa limitednumberoflocalrearrangementsinthegenetree topologyiftheyreducedthereconciliationcost[17,18]. Previously[17,18]describedamethodtoallowNNIbranchswapsonselectedbranchesofagenetreeto reducethereconciliationcost.Following[17,18],we addressgenetreeerrorinthereconciliationprocessby assumingthatthecorrectgenetreecanbefoundinaparticularneighborhoodofthegivengenetree.Wedescribe thisapproachforthegenedu plication,duplicationand loss,anddeepcoalescencemodels,whichidentifythefewestrespectiveeventsimpliedfromagivengenetreeand givenspeciestree.Thisneighborhoodconsistsofalltrees thatarewithinoneeditoperationofthegenetree.While [17,18]useNearestNeighborInterchange(NNI)edit operationstodefinetheneighborhood,weusethestandardtreeeditoperationsSPR[19,20]andTBR[19],which significantlyextenduponthesearchspaceoftheNNI neighborhood.TheSPRandTBRlocalsearchproblems findatreeintheSPRandTBRneighborhoodofagiven genetree,respectively,thathasthesmallestreconciliation costwhenreconciledwithagivenspeciestree.Usingthe algorithmfromZhang[21]thebestknown(nave)runtimesare O ( n3)fortheSPRlocalsearchproblemand O ( n4)fortheTBRlocalsearchproblem,where n isthe numberoftaxainthegivengenetree.Theseruntimes typicallyareprohibitivelylongforthecomputationoflargerGT-STreconciliations.Weimproveonthesesolutions byafactorof n fortheSPRlocalsearchproblemanda factorof n2fortheTBRlocalsearchproblem.Thismakes thelocalsearchundertheTBReditoperationasefficient asundertheSPReditoperation,anditprovidesahighspeedgenetreeerror-correctionprotocolthatiscomputationallyfeasibleforlarge-scalegenomicdatasets. Wealsoevaluatedtheperformanceofouralgorithms usingtheimplementationofSPRbasedlocalsearchalgorithms.Note,thattheSPRneighborhoodisproperlycontainedintheTBRneighborhoodforanygiventree.Thus theperformanceoftheSPRbasedprogramprovidesa conservativeestimateoftheperformanceoftheTBR basedprogram.Wetestourprogramsonacollectionof 106yeastgenetrees,someofwhichcontainhundredsof leaves,andwedemonstratehowitcanbeeasilyincorporatedintolarge-scalegenetreeparsimonyphylogenetic analyses.BasicnotationandpreliminariesThroughoutthispaper,thetermtreereferstoarooted fullbinary(phylogenetic)tree. Let T beatree.TheleafsetofT isdenotedby Le ( T ). Thesetofallverticesof T isdenotedby V ( T )andtheset ofalledgesby E ( T ).The root of T isdenotedby Ro ( T ). Thesetofinternalverticesof T is I ( T ):= V ( T )\ Le ( T ). Givenavertex v V ( T ),wedenotethe parent of v by PaT( v ).Let u := PaT( v ).Theedgethatconnects v with u is writtenas( u,v ).Thefirstelementinthepairisalwaysthe parentofthesecondelement.Thesetofallchildrenof v isdenotedby ChT( v )andthechildrenarecalled siblings For w ChT( v ),thesiblingof w isdenotedby SbT( w ). Wedefine Ttobethe partialorder on V ( T )where x Ty if y isavertexonthepathbetween Ro ( T )to x ,and write x
PAGE 3

subtree of T rootedat u V ( T ),denotedby Tu,is definedtobe T| U,forU :={v Le ( T ): v Tu }.Two trees T1and T2arecalled isomorphic ifthereexistsa bijectionbetweenthevertexsetsof T1and T2which mapsavertex u1of T1tovertex u2of T2ifthesubtree rootedat u1in T1hasthesameleafsetasthesubtree rootedat u2in T2.Ifanisomorphismexistsbetween T1and T2,wewrite T1 T2. Givenfunction f : A B ,wedenoteby f ( A )where A A asetofimagesofeachelementin A under f .ThereconciliationcostmodelsA speciestree isatreethatdepictsthebranchingpatternrepresentingthediver genceofasetofspecies, whereasa genetree isatreethatdepictstheevolutionaryhistoryamongthesequencesencodingonegene(or genefamily)foragivensetofspecies.Weassumethat eachleafofthegenetreeislabeledwiththespecies fromwhichthatgenewassampled.Let G beagenetree and S aspeciestree.Inordertocompare G with S ,we requireamappingfromeachgene g V ( G )tothemost recentspeciesin S thatcouldhavecontained g Definition1(Mapping ). The leaf-mapping LG,S: Le ( G ) Le ( S ) isasurjectionthatmapseachleafg Le ( G ) tothatuniqueleafs Le ( S ) whichhasthesame labelasg.Theextension G,S: V ( G ) V ( S ) isthe mappingdefinedby G,S( g ):= LCA ( LG,S( Le ( Gg))). For convenience,wewrite ( g ) insteadof G,S( g ) whenG andSareclearfromthecontext Definition2(Comparability ). GiventreesGandS, wesaythatGis comparable toSifaleaf-mapping LG,S(g)iswelldefined Throughoutthispaperweusethefollowingterminology: G isagenetreethatiscomparabletothespecies tree S throughaleaf-mapping LG,S,and n isthenumber oftaxapresentinbothinputtrees. Nowwedefinedifferentreconciliationcostsfrom G to S foragivenmapping G,Sthatextends LG,S.The reconciliationcostarebasedonthemodelsofgene duplication[22,23],duplication-loss[21],anddeepcoalescence[21]. Definition3(Duplicationcost). The duplicationcost fromg V ( G ) toS CD G, S, g := 1, ifM( g ) M( Ch( g )); 0, otherwise. The duplicationcost fromGtoS CD( G, S ) := g I ( G )CD G, S, g Definition4(Duplication-losscost). The losscost fromg V ( G ) toS CL( G, S, g ):= 0, if h Ch( g ):M( g )=M( h); h Ch ( g )| dS(M( g ),M( h)) 1| otherwise. Theduplication-losscost fromGtoS, CDL( G, S ) := g I ( G )(CD G, S, g +CL G, S, g ) Definition5(Deepcoalescencecost). The numberoflineagesfrom g V ( G ) toh Ch ( g ) inS CDC( G, S, g ):= h Ch ( g )dS(M( g ),M( h)). The deepcoalescencecost fromGtoS CDC( G, S ) := g I ( G ) CDC G, S, g | E( S) | .Theerror-correctionproblemsHerewegivedefinitionsfortreerearrangementoperationsTBR[19]andSPR[19,20],andthenformulatethe Error-Correctionproblemsthatweremotivatedinthe introduction. Definition6 (TreeBisectionandReconnection (TBR)). LetTbeatree.Forthisdefinition,weregardthe plantedtree Pl ( T ) asthetreeobtainedfromaddingthe rootedge{ r,Ro ( T )} toE ( T ), wherer V ( T ). Let e :=(u,v) E ( T ), andXandYbetheconnected componentsthatareobtainedbyremovingedgeefrom Tsuchthatv X and u Y.WedefineTBRT( v,x,y ) forx Xandy Ytobethetreethatisobtainedfrom Pl ( T ) byfirstdeletingedgee,andthenadjoininganew edgefbetweenXandYasfollows : 1. Ifx Ro ( X ) thensuppressRo ( X ) andcreateanew rootbysubdividingedge ( Pa ( x ), x ). 2. Subdivideedges ( Pa ( y ), y ) byintroducinganew vertexy 3. Re-connectcomponentsXandYbyaddingedge f =( y ,Ro ( x ). 4. Suppressthevertexu,andrenamevertexy asu. 5. Contracttherootedge Wesaythat,thetreeTBRT(v,x,y)isobtainedfromT bya treebisectionandreconnection(TBR) operation thatbisectsthetreeTintothecomponentsXandY, andreconnectsthemabovethenodesxandy.(See Figure1 .)Wedefinethefollowingneighborhoodsforthe TBRoperation :Chaudhary etal BMCBioinformatics 2012, 13 (Suppl10):S11 http://www.biomedcentral.com/1471-2105-13-S10-S11 Page3of10

PAGE 4

1. TBRG( v,x ) := y YTBRG( v,x,y ) 2. TBRG( v) := x XTBRG( v,x ) 3. TBRG:= ( u,v ) E ( G )TBRG( v) Definition7 (SubtreePruneandRegrafting(SPR)). TheSPRoperationisdefinedasaspecialcaseofthe TBRoperation.Lete :=(u,v) E ( T ), andXandYbe theconnectedcomponentsthatareobtainedbyremovingedgeefromTsuchthatv X and u Y.We defineSPRT( v,y ) fory YtobeTBRT( v,v,y ). We saythatthetreeSPRT( v,y ) isobtainedfromTbyperformingsubtreepruneandregraft(SPR)operationthat prunessubtreeTvandregraftsitabovey.(See Figure2 (a).) Wedefinethefollowing neighborhoods fortheSPR operation : 1.SPRG( v) := y YSPRG( v,y ) 2.SPRG:= ( u,v ) E ( G )SPRG( v) WenowstatetheSPRbasederror-correctionproblemsforduplication(D),duplication-loss(DL),and deepcoalescence(DC).Let {D,DL,DC}.Problem1(SPRbasederror-correctionfor (SEC))Instance : AgenetreeGandaspeciestreeS Find:AgenetreeG SPRGsuchthat C( G, S)=minG SPRGC( G, S) Figure1 AnTBRoperation .Tree T =TBRT( v,x,y )resultsfrom T afterperformingsingleTBRoperation. Figure2 TheNNIadjacencygraph .(a)Thetree G isobtainedfrom G bypruningandregraftingsubtree Gvtotherootof G.Thevertex x V ( G)issuppressed,andthenewvertexaboverootin G isnamed x .(b)TwoNNIoperationsNNIG ( z )andNNIG ( z )produceleft-child G land right-child G rof G in X Chaudhary etal BMCBioinformatics 2012, 13 (Suppl10):S11 http://www.biomedcentral.com/1471-2105-13-S10-S11 Page4of10

PAGE 5

TheTBRbasederror-correctionfor (TEC)problemsaredefinedanalogouslytotheSPRbasederrorcorrectionfor (SEC)problems.SolvingtheSECproblemsInthissectionwestudytheSPRbasederror-correction problems,forduplication(D),duplication-loss(DL),and deepcoalescence(DC),inmoredetail.Ourefficient solutionfortheseproblemsarebasedonsolving restrictedversionsoftheseproblemsefficiently.For each { D,DL,DC }wefirstdefinearestrictedversion oftheSECproblem,whichwecalltherestrictedSPR basederror-correctionforthe (R-SEC)problem.Problem2(RestrictedSPRbasederror-correctionfor(RSEC))Instance:AgenetreeG,aspeciestreeS,andv V ( G ). Find:AgenetreeG SPRG( v ) suchthat C( G, S)=minG SPRG( v )C( G, S) Observation1 .Let { D,DL,DC }. Givenagene treeGandaspeciestreeS ,the SECproblemcanbe solvedasfollows:(i)solvetheR-SECproblemforevery v V ( G ) wherev Ro ( G ), (ii)underallsolutionsfound returnaminimumscoringgenetreeG *. Navely,theR-SECproblemcanbesolvedin ( n2) timebycomputingthecost C G,S foreach G SPRG( v ).Thecostforagivengeneandspeciestreecan becomputedin ( n )time[21].Weintroduceanovel algorithmfortheR-SECproblemthatimprovesbya factorof n onthenavesolution.Thisspeedupis achievedbysemi-orderingthetreesin SPRG( v ),foreach v V ( G ),suchthatthescore-differenceofanytwoconsecutivetreesinthisordercanbecomputedinconstant time.OrderingthetreesinSPRG(v )Consideragraphontreesin SPRG( v ),inwhichevery twoadjacenttreesareoneNNI[19]operationapart. Weshowthatsuchagraphisarootedfullbinarytree, afterprovidingnecessarydefinitions. Definition8 (NearestNeighborInterchange(NNI)). WedefinetheNNIoperationasaspecialcaseofthe SPRoperation.Lete E ( T ) wheree :=( u,v), andX andYbetheconnectedcomponentsthatareobtainedby removingedgeefromTsuchthatv Xandu Y.We defineNNIT( v ) tobeSPRT( v,y ) fory:=Pa ( u ), andsay thatNNIT( v ) isobtainedfromTbyperformingnearest neighborinterchange(NNI)operationthatprunessubtreeTvandregraftsitabovetheparentofv sparent. (See Figure2(b).) Definition9 (NNIdistance). Letthe NNI-distance, denotedasdNNI( T1,T2), betweentwotreesT1andT2overntaxabetheminimumnumberofNNIoperations requiredtotransformT1intoT2. Definition10 (NNI-adjacencygraph). The NNI-adjacencygraph, denotedas X= ( V E ) isthegraphwhere V=SPRG( v) and { T1,T2} EifdNNI( T1,T2)=1. Lemma1 X isatree Proof .Weproveitbyshowingthatthereexistsa uniquepathbetweeneverytwoverticesin X .LetG G V ( X ),thus G ,G SPRG( v).Let G :=SPRG( v,x1) and G :=SPRG( v,x2).Weuseinductionon dG( x1, x2). Let dG( x1, x2)=1andassumewithoutlossofgenerality that x2= PaG( x1).Thus, G =NNIG ( Sb ( x1)).Sothe hypothesisholdsfor dG( x1, x2)=1.Assumenowthat thehypothesisistruefor dG( x1, x2) k andsuppose dG( x1, x2)=k +1.Since G isatree,theremustbea uniquepathbetween x1and x2;let y beavertexonthis path.Let dG( y,x1)=1,and Gn:=SPRG( v,y ).If y = PaG( x1),then Gn=NNIG ( v );otherwise Gn=NNIG ( Sb ( y )). Since dG( y,x2)= k ,thus(byinductionhypothesis)the hypothesisisvalidfor dG( x1, x2)=k +1. Theorem1 X isarootedfullbinarytree Proof .InviewofLemma1,itsufficestoshowthat exceptauniquevertexofdegree2allotherverticesin X areofdegree1or3.Let G V ( X ),thus G =SPRG( v,y )forsome y V ( G ).Therearethreecases: Case1: y isaroot .Let y1 ChG( y ).Let G1:=SPRG( v, y1),thus G=NNIG1( v ) ).Hence{G1, G } E ( X ).Since | ChG( y )|=2, G mustbeadegree2vertexin X Case2: y isaleaf .Let y1= PaG( y ).Let G1:=SPRG( v, y1),thus G1=NNIG ( v ).Hence{G1, G } E ( X ),and consequently, G isadegree1vertexin X Case3: y isaninternalvertex .Lety1= PaG( y )and y2 ChG( y ).Let G1:=SPRG( v,y1),thus G1=NNIG ( v ). Let G2:=SPRG( v,y2),thus G =NNIG 2( v ).Since y has oneparentandtwochildrenin G ,thus G isadegree3 vertexin X Thiscompletestheproof.Thescoredifferenceofconsecutivetreesin X TosolvetheR-SECproblemswetraversetree X Twoadjacenttreesin V ( X )areoneNNIoperation apart.Weshowthat C scoreofatreecanbecomputedinconstanttimefromtheLCAcomputationof itsadjacenttree. Let e :=(G ,G )beanedgein X .Let x := Pa ( v ), y := Sb ( v ),and z,z Ch ( y )in G (seeFigure2(b)).Without lossofgenerality,let G :=NNIG ( z ).(Observe, G is similarto Gr ofFigure2(b).) Lemma2 G ,S( y )=G ,S( x ). Proof .FromNNIoperation,v,z ChG ( x )andz, x ChG ( y ).Also, Gz Gz, Gz Gz, Gv Gv so Le (Gy)= Le (Gx) Thus, MG, S( x ) =LCA(LG, S( Le ( Gx)))=LCA(LG, S( Le ( Gy)))=MG, S y Chaudhary etal BMCBioinformatics 2012, 13 (Suppl10):S11 http://www.biomedcentral.com/1471-2105-13-S10-S11 Page5of10

PAGE 6

Lemma3 G ,S( w )= G ,S( w ), forallw V ( G )\{ x, y }. Proof .For g V ( Gv) ( Gz) ( Gz) ,since Gg Gg therefore G ,S( g )=G ,S( g ).Also,exceptforsubtree Gx ,therestofthetreeremainsthesamein Gx .Thus byLemma2, G ,S( PaG ( x ))= G ,S( PaG ( y )).Inductively, G ,S( g )=G ,S( g ),forall g V ( G )\ V ( G x). Lemma4 G ,S( x )= LCA ( G ,S( v ), G ,S( z )). Proof .FromLemma3, G ,S( v )=G ,S( v )and G ,S( z )= G ,S( z ).Thus, G ,S( x )=LCA( G ,S( v ), G ,S( z ))=LCA( G ,S( v ),G ,S( z )). Lemma5 C( G, S, g )=C( G, S, g ) forallg V ( G )\ { x,y } and { D,DL,DC }. Proof .Thegeneduplicationandlossstatusofavertex, andthenumberoflineagesfromavertextoitschildren in G canchangein G ifitsmappingormappingofany ofitschildrenchangesin G ,S.FromLemma3,and also,since G ,S( w )=G ,S( w ),for w Ch ( PaG ( x )), musthave C( G, S, PaG( x ))=C( G, S, PaG( x )) .Thus theLemmafollows. Let e :=(G ,G ) E ( X )and {D,DL,DC}.We define e :=C( G, S) C( G, S) withrespecttothe givenspeciestree S .Observethatthisscorecanbe negativetoo.Westudyhow ecanbecomputedefficientlyforeachedge e in X Theorem2 e= g { x y }(C( G, S, g ) C( G, S, g )) Proof e=C( G, S) C( G, S)= g V ( G)(C( G, S, g ) C( G, S, g ))= g V ( G) \{ x y }(C( G, S, g ) C( G, S, g ))+ g { x y }(C( G, S, g ) C( G, S, g ))= g { x y }(C( G, S, g ) C( G, S, g )). Definition11 Let G := SPRG( v Ro ( G )) andletPG beapathfrom G toGin X ForG, wedefinethe score-difference G G as G G := e E ( PG )e Theorem3 ForgivenS,G,andv V ( G ), thetreeG V ( X ) istheoutputofaR-SEC problemiff G G= minG V ( X ) G G Proof .Let G G= minG V ( X ) G G .Weprovethat G istheoutputofR-SECproblem.Since G G= e E ( PG )e= ( G, S) ( G, S) ,thus G gives theminimumnormalized C scoreoveralltreesin V ( X ).Hence, G mustbetheoutputoftheR-SECproblem.Theotherdirectionfollowssimilarly. ThealgorithmWedescribeageneralalgorithm,calledAlgo-R-SEC,to solvetheR-SECproblemforeach { D,DL,DC}. InitiallyAlgo-R-SECcomputes(i)therootvertexof theNNI-adjacencygraph X ,whichwecall G ,by regraftingthesubtree Gvabovetherootof G ,(ii)the LCAmappingfrom G to S ,and(iii)the scorefrom G to S .ThenrecursivelyAlgo-R-SECcomputestheLCA mappingand scoreforeveryvertex G in X whenthe LCAmappingand scoreofitsparentvertexin X is known.Algorithm1detailsAlgo-R-SEC.Algorithm1-Algo-R-SECInput: AgenetreeG,aspeciestreeS,andv V ( G ) Output: AtreeG SPRG( v ) suchthat C( G, S)=minG SPRG( v )C( G, S) 01.Compute G bypruningGvandregraftingatRo(G) 02.ComputeLCAmapping M G S 03.Call CG( G, S)= Algo Comp Score( G, S,M G S) 04.Set BestTree = G BestScore =0 05.Set G= G MG, S=M G S C( G, S)=C( G, S) G G =0 06. For each k b = Ro ( GSb ( v )) inpreordertraversalof GSb ( v ) do 07 If notbacktracking, then 08.Setx=PaG ( v ), y = SbG( v) 09.SetG = NNIG ( SbG ( k )) 10.Set G .S= G ,S, G ,S( y )= G ,S( x ) 11 G ,S( x )=LCA ( G ,S( k ),G ,S( v)) 12.Call { G, G }= h { x y }Algo G Score( G S,MG S, h) Algo G Score( G, S,MG, S, h) 13 G G = G G+ { G, G } 14 If G G < BestScore then 15.SetBestTree = G BestScore = G G 16 Else 17.Set x = PaG( v ) y = PaG( x ) 18.SetG = NNIG ( v) 19.Set MG, S=MG, S,MG, S( x ) =MG, S y 20.Set MG, S( y )= LCA MG, S SbG( x ) ,MG, S( k ) 21.Call { G, G}= h { x y }Algo G Score( G, S,MG, S, h) Algo G Score( G, S,MG, S, h) 22.Set G G= G G ( G, G) 23.Set G= G,MG, S=MG, S, G G= G G 24. Return BestTreeAlgorithm2-Algo-Comp-ScoreInput: AgenetreeG,aspeciestreeS,andLCAmapping G,SOutput: C ( G,S ) 01.score=0 02. For eachg I ( G ) inpreordertraversalofG do 03 Callscore = score + Algo G Score( G, S,MG S, g ) 04. If isDC, then 05.score=score -| I ( S )| 06. Return scoreAlgorithm3-Algo-G-ScoreInput:AgenetreeG,aspeciestreeS,LCAmapping G ,S, andg I ( G ) Output: C ( G,S,g ) 01. If isD, thenChaudhary etal BMCBioinformatics 2012, 13 (Suppl10):S11 http://www.biomedcentral.com/1471-2105-13-S10-S11 Page6of10

PAGE 7

02 If ( g ) ( Ch ( g )), then 03 Return 1 04. ElseIf isDL, then 05 ls = h Ch ( g )| dp(M( h)) dp(M( g )) 1| 06 If ( g ) ( Ch ( g )), then 07 Return ls+1 08 Else 09 Return ls 10. Else // isDC 11 Return h Ch ( g )| dp(M( h)) dp(M( g ))| Lemma6 TheR-SECproblemiscorrectlysolvedby Algo-R-SEC. Proof .Lemma1-5andTheorem1-3directlyimply thatinordertoprovethecorrectnessofalgorithm Algo-R-SEC,itissufficienttoprovethatitcorrectly returns G ofminimum G G amongall G V ( X ).We willshowthatalgorithmAlgo-R-SEC-accountseach G V ( X ),correctlycomputes G G for { D,DL,DC}, andreturnstheright G asoutput. FromDefinition10, V ( X )=SPRG( v ).InAlgo-R-SEC,step1prunessubtree Gvandregraftsitabovethe rootof G tocreate G .Step5sets G to G .Theforloopinstep6traversessubtree GSb ( v ) inpreorder.For eachtraversedvertex k b = Ro ( GSb ( v )) ,step9buildsthe tree G :=SPRG( v,k )byapplyingNNIoperationonthe lastbuild G .Eachfor-loopiterationsets G tothelast build G instep23. G and G sconstituteallthetrees in SPRG( v ). For G ,step2computestheLCAmappingandstep5 sets G G tozero.FollowingLemma2-4andTheorem 2,step10and11updatetheLCAofG andstep12 computes { G, G} bycallingalgorithmAlgo-G-Score. Dependingon { D,DL,DC },therearethreecases: Case1: isD .Algo-G-Scorereturns1,ifthevertexg V ( G )mapstothesamevertexin S asanyofitschildrenmapsto,otherwise0. Case2: isDL .Algo-G-Scorecomputeslossesby applyingtheformulaofDefi nition4.Further,itadds1 ifthereisaduplication. Case3: isDC.Algo-G-Score,returnsthenumberof lineagesfromgtoeachofitschildren h Ch ( g )in S Foreach h Ch ( g ),depthof ( g )issubtractedfrom depthof ( h )tocountnumberofedgesbetween ( g ) and ( h ). InAlgo-R-SEC,step13computes G G byadding G G and { G, G} .Whenbacktracking,steps17-22are executedtorestoretheright G tocomputethenext unique G Ch X ( G ).Thisensuresthatthecorrect G G iscomputedforeach G V ( X ). InAlgo-R-SEC,step4sets G astheBestTreeand G G=0 asBestScore.Everytimeanew G Ch X ( G ) isencountered,step14compares G G withBestScore, andupdatesBestTreewith G oftheminimum G G Afterthefor-loop,step24returnstheBestTree. Lemma7 TheR-SEC andSECproblemscanbe solvedin ( n ) and ( n2) time,respectively Proof .WewillprovethatthealgorithmAlgo-R-SECsolvestherestrictedSPRbasederror-correctionproblems foreach { D,DL,DC }in ( n )time.InAlgo-R-SEC, step1takesconstanttime.Step2precomputesLCAvalues forspeciestreeinO(n)time[24],andso,findsLCAmappingfrom G to S in O ( n )timeinbottom-upmanner.Step 3computestheduplication,duplication-lossordeepcoalescencescoreof G and S bycallingAlgo-Comp-Score.In Algo-Comp-Score,step1andstep2runsfor O (1)and O ( n )time,respectively.Step3callsAlgo-G-Scoreineach iterationoffor-loop.Algo-G-Scorerunsfor O (1)timefor { D,DL,DC }. When isDC,steps4and5arefurtherexecutedin Algo-Comp-Scoreforcons tanttime.ThusinAlgo-RSEC,step3runsfor O ( n )time.Further,steps4and5 takeconstanttime.Theloopofstep6runsfor ( n ) time.Ifconditionofstep7istrue,steps8-10executes inconstanttime.WithprecomputedLCAvaluesfrom step2,step11executesinconstanttime.Algo-G-Score runsforconstanttimefor { D,DL,DC},andlets step12toexecuteinconstanttime.Further,steps13-15 executeforconstanttimetoo.Iftheconditioninstep7 isfalse,thensteps17-22executeinconstanttime,similarly.Finally,step23runsforconstanttime,andhence, theR-SECproblemcanbesolvedin ( n )time.From Observation1,Algo-R-SECiscalled ( n )timeto solveSECproblem.Thus,theSECproblemcanbe solvedin ( n2)time. SolvingtheTECproblemsInthissectionwestudytheTBRbasederror-correction problems,forduplication(D),duplication-loss(DL),and deepcoalescence(DC).Moreprecisely,weextendour solutionfortheSECproblemstosolvetheTECproblems.ATBRoperationcanbeviewedasanSPR operation,exceptthattheprunedsubtreecanbe rerootedbeforeitisregrafted.Ourspeed-upforthe SECproblemsisachievedbyobservingthatthescores ofanyre-rootedprunedsubtreeanditsremaining prunedtreeareindependentofeachother.Wedefine theR-TECproblemsfortheTECproblems,aswe definedtheR-SECproblemsfortheSECproblems. WewillshowthattheR-TECproblemscanbesolved bysolvingtwosmallerproblemsseparatelyandcombiningtheirsolutions. Definition12 LetTbeatreeandx V ( T ). RR ( T,x ) isdefinedtobethetreeT,ifx = Ro ( T ) orx Ch ( Ro ( T )). Otherwise,RR ( T,x ) isthetreeobtainedbysuppressingRo ( T ), andsubdividingtheedge ( Pa ( x ), x ) bythe newrootnode .Chaudhary etal BMCBioinformatics 2012, 13 (Suppl10):S11 http://www.biomedcentral.com/1471-2105-13-S10-S11 Page7of10

PAGE 8

Lemma8 Givenatuple G,S,v ,andG := TBRG( v,x, y ), forx V ( Gv), y V ( G )\ V ( Gv). Then C( G, S) tG TBRG( v )C( G, S) iffC( RR( Gv, x ), S) tx V ( Gv)C( RR( Gv, x), S) and C( G, S) tG TBRG( v,x )C( G, S) Proof.( )LetG1:= TBRG( v,x1,y ),for x1 V ( Gv), and x1 x .Nowobservethat, g V ( G) \ V ( Gv),C( G, S, g )=C( G1, S, g ) Also, let G2:= TBRG( v,x,y1),for y1 V ( G )\ V ( Gv),and y1 y Observethat, g V ( Gv),C( G, S, g )= C( G2, S, g ) Thus,if G givestheminimumduplication,duplicationloss,ordeepcoalescencescoreamongalltreesin TBRG( v ),thenthescorecontributionofverticesin V ( Gv)and V ( G )\ V ( Gv)isindependent.Nowlookingatverticesof G ,thebestscoreisachievedwhen Gvisrootedat x ,i.e. C( RR( Gv, x ), S) tx V ( Gv)C( RR( Gv, x), S) ;alsothe bestscoreisachievedwhen RR (Gv,x )isregraftedat y ,i. e., C( G, S) tG TBRG( v x )C( G, S).(n ) .Thisfollows similarly. Lemma8impliesthatatreeinTBRG( v )withthe minimumduplication,duplic ation-loss,ordeepcoalescencecostcanbeobtainedbyoptimizingtherooting fortheprunedsubtree,andtheregraftlocation,separately.Abestrootingfortheprunedsubtreeislinear timecomputable[17,25],andthesolutiontotheR-SEC problemidentifiesabestregraftlocationin ( n )time. ThisallowstoobtainatreeinTBRG( v )withtheminimumduplication,duplication-loss,ordeepcoalescence costbyevaluatingonly ( n )trees.ThustheR-TECproblemcanbesolvedin (n)time.TheTECproblemcanbesolvedbycallingthesolutionofR-TECproblem ( n )times,andTheorem4follows. Theorem4. TheTECproblemcanbesolvedin ( n2) time .ExperimentalresultsWetestedtheperformanceofthegenetreerearrangementalgorithmsonasetof106genealignmentscontainingsequencesfrom8yeasttaxafromRokasetal.[26]. Thereisawellacceptedphylogenyfortheyeastspecies, andthedatasethasbeenusedtotestalgorithmsforgene treeparsimonybasedonthedeepcoalescenceproblem [27,28].Weconstructedmaxi mumlikelihoodgenetrees foreachgeneusingRAxML-VI-HPCversion7.0.4[29], thegenetreeswererootedwiththeoutgroup Candida albicans .Weusedthenewerrorcorrectionalgorithmsto examinehowmuchasingleSPRrearrangementinthe genetreereducesthereconc iliationcostbasedondeep coalescenceandalsogeneduplicationsandlosses.Over allgenestheSPRerrorcorrectionreducedthetotaldeep coalescencecostfrom151to53(Table1)andtheduplicationandlosscostfrom481to175(Table2).Boththe algorithmstookonlysecondstorunforall106geneson astandardlaptop. Wealsoimplementedaprotocoltousethegenerearrangementalgorithmtocorrectforgenetreeerrorin genetreeparsimonyphylogeneticanalyses.Wefirst tookacollectionofinputgenetreesandperformeda SPRspeciestreesearchusingDuptree[30],whichseeks thespeciestreewiththeminimumgeneduplication cost.Weusedtheduplicationonlycost(insteadof duplicationsandlosses)becausewhenthereisnocompletesamplingofallexisti nggenes,thelossestimates maybeinflatedbymissingsequences.Afterfindingthe locallyoptimalspeciestree,weusedourSPRgenetree rearrangementalgorithmtofindgenetreetopologies withalowerduplicationcost.Wethenperformed anotherSPRspeciestreesear chusingDuptree,starting fromthelocallyoptimalspeciestreeandusingthenew genetreetopologies.Thissearchstrategyissimilarto re-rootingprotocolinDuptree,whichchecksforbetter genetreerootsafteraSPRspeciestreesearch[30,31]. Wetestedthisprotocolondatasetof6,084genes(with acombined81,525leaves)from14seedplanttaxa.This isthesamedatasetusedby[31],exceptthatallgene treecladescontainingsequencesfromasinglespecies werecollapsedtoasingleleaf.OuroriginalSPRtree searchfoundaspeciestreewi th23,500duplications. TheSPRtreesearchafterthegenetreerearrangements identifiedthesamespeciestree,butthenewgenetrees hadareconciliationcostofonly18,213.Thistreesearch Table1Errorcorrectionbasedondeepcoalescence modelReconciliationCostOriginalPost-Correction 04577 13 21 5 268 395 480 >4 6 1Thenumberofyeastgenetreeswithdifferentreconciliationcostsbasedon thedeepcoalescencemodelbothbefore(Original)andafter(Post-Correction) theSPRerrorcorrection. Table2Errorcorrectionbasedonduplicationandloss modelReconciliationCostOriginalPost-Correction 04577 1-5 32 15 6-10 15 13 11-15 8 0 16-20 5 1 >20 1 0Thenumberofyeastgenetreeswithdifferentreconciliationcostsbasedon theduplicationandlossmodelbothbefore(Original)andafter(PostCorrection)theSPRerrorcorrection.Chaudhary etal BMCBioinformatics 2012, 13 (Suppl10):S11 http://www.biomedcentral.com/1471-2105-13-S10-S11 Page8of10

PAGE 9

protocoltookjustunder4hoursonaMacPowerbook witha2GHzIntelCore2Duoprocessorand2GB memory.ConclusionGT-STreconciliationprovidesapowerfulapproachto studythepatternsandprocessesofgeneandgenome evolution.Yetitcanbethwartedbytheerrorthatisan inherentpartofgenetree inference.Anyreliable methodforGT-STreconciliationmustaccountforgene treeerror;however,anyusefulmethodalsomustbe computationallytractableforlarge-scalegenomicdata. Weintroducefastandeffect ivealgorithmstocorrect errorinthegenetrees.Thesealgorithms,basedonSPR andTBRrearrangements,greatlyextendupontherange ofpossibleerrorsinthegenetreefromexistingalgorithms[17,18],whileremainingfastenoughtouseon datasetswiththousandsofgenes.Thesealgorithmscan correcttreesbasedonabroadvarietyofevolutionary factorsthatcancauseconflictbetweengenetreesand speciestrees,includinggene duplication,duplications andlosses,anddeepcoalescence. Ouranalysison106yeastgenetreesdemonstratesthat evenasingleSPRcorrectiononthegenetreescanradicallyimproveuponthereconciliationcost.Ourresultsin theyeastanalysisareverysimilartothe2-3foldimprovementinimpliedduplicationsandlossesreportedfromthe parametricgenetreeestimationandreconciliationmethod ofRasmussenandKellis[5].However,incontrast,tothis computationallycomplexmethod,thegenetreerearrangementalgorithmisextremelyfast,doesnotrequire assumptionabouttheratesofduplicationandloss,andis alsoamenabletocorrectionsbasedondeepcoalescence andduplicationsaswellasduplicationsandlosses.Wedo notclaimthatthegenecorrectionalgorithmsproducea moreaccuratereconciliationthantheseparametricmethods.However,theydopresentanextremelyfastandflexiblealternative. Wealsodemonstratedthatthiserrorcorrectionprotocolcouldeasilybeincorporatedintoagenetreeparsimonyphylogeneticanalysis.Previousstudieshave emphasizedthatgenetreeparsimonyissensitivetothe topologyoftheinputtrees.Forexample,thespeciestree maydifferwhetherthegenetreesaremadeusingparsimonyormaximumlikelihood[8,10].Inourstudy, althoughthegenetreerearrangementdidnotaffectthe speciestreeinference,itdidgreatlyreducethegeneduplicationreconciliationcost. Whiletheresultsoftheexperimentsarepromising,they alsosuggestseveraldirectionsforfutureresearch.First, furtherinvestigationisneededtocharacterizetheeffects oferrorongenetreetopologies.Forexample,itseems likelythatgenetreeerrorsmayextendbeyondasingle SPRorTBRneighborhood.Yet,ifweallowunlimited rearrangements,thegenetreeswillsimplyconvergeon thespeciestreetopology.Onesimpleimprovementmay betoweightthepossiblegenetreerearrangementsbased onsupportfordifferentcladesinthegenetree.Thus, well-supportedcladesmayberarelyorneverbesubjectto rearrangement,whilepoorlysupportedcladesmaybesubjecttoextensiverearrangements.Finally,theseapproaches implicitlyassumethatalldifferencesbetweengenetrees andspeciestreesareduetoeithercoalescence,duplications,orduplicationsandlosses.Futureworkwillseekto combinetheseobjectivesandalsoaddresslateraltransfer.Acknowledgements TheauthorsthankAndrWeheforhisgeneroussupportwiththe implementation.Thisworkwasconductedinpartswithsupportfromthe GeneTreeReconciliationWorkingGroupatNIMBioSthroughNSFawardEF0832858,withadditionalsupportfromtheUniversityofTennessee.R.C.and O.E.weresupportedinpartsbyNSFawards#0830012and#10117189. Thisarticlehasbeenpublishedaspartof BMCBioinformatics Volume13 Supplement10,2012: Selectedarticlesfromthe7thInternational SymposiumonBioinformaticsResearchandApplications(ISBRA 11) .Thefull contentsofthesupplementareavailableonlineathttp://www. biomedcentral.com/bmcbioinformatics/supplements/13/S10. Authordetails1DepartmentofComputerScience,IowaStateUniversity,Ames,IA50011, USA.2DepartmentofBiology,UniversityofFlorida,Gainesville,FL32611, USA. Authors contributions RCwasresponsibleforalgorithmdesignandprogramimplementation,and wrotemajorpartsofthepaper.JGBperformedtheexperimentalevaluation andtheanalysisoftheresults,andcontributedtothewritingofthepaper. OEsupervisedtheproject,contributedtothealgorithmicdesignandwriting ofthepaper. Allauthorsreadandapprovedthefinalmanuscript. Competinginterests Theauthorsdeclarethattheyhavenocompetinginterests. Published:25June2012 References1.MaddisonWP: GeneTreesinSpeciesTrees. SystematicBiology 1997, 46 :523-536. 2.GoodmanM,CzelusniakJ,MooreGW,Romero-HerreraAE,MatsudaG: Fittingthegenelineageintoitsspecieslineage.Aparsimonystrategy illustratedbycladogramsconstructedfromglobinsequences. Systematic Zoology 1979, 28 :132-163. 3.GuigR,MuchnikI,SmithTF: ReconstructionofAncientMolecular Phylogeny. MolecularPhylogeneticsandEvolution 1996, 6(2) :189-213. 4.SlowinskiJB,KnightA,RooneyAP: InferringSpeciesTreesfromGene Trees:APhylogeneticAnalysisoftheElapidae(Serpentes)Basedonthe AminoAcidSequencesofVenomProteins. MolecularPhylogeneticsand Evolution 1997, 8 :349-362. 5.RasmussenMD,KellisM: ABayesianapproachforfastandaccurategene treereconstruction. MolecularBiologyandEvolution 2011, 28 :273-290. 6.BurleighJG,BansalMS,WeheA,EulensteinO: LocatingLarge-ScaleGene DuplicationEventsthroughReconciledTrees:ImplicationsforIdentifying AncientPolyploidyEventsinPlants. JournalofComputationalBiology 2009, 16 :1071-1083. 7.HahnMW: Biasinphylogenetictreereconciliationmethods:implications forvertebrategenomeevolution. GenomeBiology 2007, 8 :R141. 8.BurleighJG,BansalMS,EulensteinO,HartmannS,WeheA,VisionTJ: Genome-scalephylogenetics:inferringtheplanttreeoflifefrom18,896 discordantgenetrees. SystematicBiology 2011, 60(2) :117-125.Chaudhary etal BMCBioinformatics 2012, 13 (Suppl10):S11 http://www.biomedcentral.com/1471-2105-13-S10-S11 Page9of10

PAGE 10

9.HuangH,KnowlesLL: WhatIstheDangeroftheAnomalyZonefor EmpiricalPhylogenetics? SystematicBiology 2009, 58 :527-536. 10.SandersonMJ,McMahonMM: InferringangiospermphylogenyfromEST datawithwidespreadgeneduplication. BMCEvolutionaryBiology 2007, 7(suppl1:S3) 11.Berglund-SonnhammerA,SteffanssonP,BettsMJ,LiberlesDA: Optimal GeneTreesfromSequencesandSpeciesTreesUsingaSoft InterpretationofParsimony. JournalofMolecularEvolution 2006, 63 :240-250. 12.VernotB,StolzerM,GoldmanA,DurandD: Reconciliationwithnon-binary speciestrees. ComputationalSystemsBioinformatics 2007, 53 :441-452. 13.YuY,WarnowT,NakhlehL: AlgorithmsforMDC-BasedMulti-locus PhylogenyInference. In RECOMB,Volume6577ofLectureNotesinComputer Science. Springer;BafnaV,SahinalpSC2011:531-545. 14.CottonJA,PageRDM: Goingnuclear:genefamilyevolutionand vertebratephylogenyreconciled. PRoySocLondBBiol 2002, 269 :1555-1561. 15.JolyS,BruneauA: MeasuringBranchSupportinSpeciesTreesObtained byGeneTreeParsimony. SystematicBiology 2009, 58 :100-113. 16.ArvestadL,BerglundA,LagergrenJ,SennbladB: Genetreereconstruction andorthologyanalysisbasedonanintegratedmodelforduplications andsequenceevolution. RECOMB 2004,326-335. 17.ChenK,DurandD,Farach-ColtonM: Notung:aprogramfordatinggene duplicationsandoptimizinggenefamilytrees. JournalofComputational Biology 2000, 7 :429-447. 18.DurandD,HalldrssonBV,VernotB: AHybridMicro-Macroevolutionary ApproachtoGeneTreeReconstruction. JournalofComputationalBiology 2006, 13(2) :320-335. 19.AllenBL,SteelM: Subtreetransferoperationsandtheirinducedmetrics onevolutionarytrees. AnnalsofCombinatorics 2001, 5 :1-13. 20.BordewichM,SempleC: Onthecomputationalcomplexityoftherooted subtreepruneandregraftdistance. AnnalsofCombinatorics 2004, 8 :409-423. 21.ZhangL: OnaMirkin-Muchnik-SmithConjectureforComparing MolecularPhylogenies. JournalofComputationalBiology 1997, 4(2) :177-187. 22.PageRDM: Mapsbetweentreesandcladisticanalysisofhistorical associationsamonggenes,organisms,andareas. SystematicBiology 1994, 43 :58-77. 23.EulensteinO: Predictionsofgene-duplicationsandtheirphylogenetic development. PhDthesis ,UniversityofBonn,Germany1998.[GMD ResearchSeriesNo.20/1998,ISSN:1435-2699]. 24.BenderMA,Farach-ColtonM: TheLCAProblemRevisited. LATIN 2000, 88-94. 25.GreckiP,TiurynJ: Inferringphylogenyfromwholegenomes. ECCB (SupplementofBioinformatics) 2006,116-122. 26.RokasA,WilliamsBL,KingN,CarrollSB: Genome-scaleapproachesto resolvingincongruenceinmolecularphylogenies. Nature 2003, 425(6960) :798-804. 27.ThanC,NakhlehL: Speciestreeinferencebyminimizingdeep coalescences. PLoSComputBiol 2009, 5(9) :e1000501. 28.BansalMS,BurleighJG,EulensteinO: Efficientgenome-scalephylogenetic analysisundertheduplication-lossanddeepcoalescencecostmodels. BMCBioinformatics 2010, 11(Suppl1):S42. 29.StamatakisA: RAxML-VI-HPC:maximumlikelihood-basedphylogenetic analyseswiththousandsoftaxaandmixedmodels. Bioinformatics 2006, 22(21) :2688-2690. 30.WeheA,BansalMS,BurleighJG,EulensteinO: DupTree:aprogramfor large-scalephylogeneticanalysesusinggenetreeparsimony. Bioinformatics 2008, 24(13) 31.ChangW,BurleighJG,Fernndez-BacaD,EulensteinO: AnILPsolutionfor thegeneduplicationproblem. BMCBioinformatics 2011, 12(Suppl1):S14.doi:10.1186/1471-2105-13-S10-S11 Citethisarticleas: Chaudhary etal .: Efficienterrorcorrectionalgorithms forgenetreereconciliationbasedonduplication,duplicationandloss, anddeepcoalescence. BMCBioinformatics 2012 13 (Suppl10):S11. Submit your next manuscript to BioMed Central and take full advantage of: Convenient online submission Thorough peer review No space constraints or color gure charges Immediate publication on acceptance Inclusion in PubMed, CAS, Scopus and Google Scholar Research which is freely available for redistribution Submit your manuscript at www.biomedcentral.com/submit Chaudhary etal BMCBioinformatics 2012, 13 (Suppl10):S11 http://www.biomedcentral.com/1471-2105-13-S10-S11 Page10of10


!DOCTYPE art SYSTEM 'http:www.biomedcentral.comxmlarticle.dtd'
ui 1471-2105-13-S10-S11
ji 1471-2105
fm
dochead Proceedings
bibl
title
p Efficient error correction algorithms for gene tree reconciliation based on duplication, duplication and loss, and deep coalescence
aug
au id A1 snm Chaudharyfnm Ruchiinsr iid I1 email ruchic@iastate.edu
A2 BurleighJ GordonI2 gburleigh@ufl.edu
ca yes A3 EulensteinOliveroeulenst@cs.iastate.edu
insg
ins Department of Computer Science, Iowa State University, Ames, IA 50011, USA
Department of Biology, University of Florida, Gainesville, FL 32611, USA
source BMC Bioinformatics
supplement Selected articles from the 7th International Symposium on Bioinformatics Research and Applications (ISBRA'11)editor Jianer Chen, Ion Mandoiu, Raj Sunderraman, Jianxin Wang and Alexander Zelikovskynote Proceedingsconference 7th International Symposium on Bioinformatics Research and Applications (ISBRA'11)location Changsha, Chinadate-range 27-29 May 2011url http://www.cs.gsu.edu/isbra11/issn 1471-2105
pubdate 2012
volume 13
issue Suppl 10
fpage S11
http://www.biomedcentral.com/1471-2105-13-S10-S11
xrefbib pubid idtype doi 10.1186/1471-2105-13-S10-S11
history pub date day 25month 6year 2012
cpyrt 2012collab Chaudhary et al. licensee BioMed Central Ltd.This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
abs
sec
st
Abstract
Background
Gene tree species tree reconciliation problems infer the patterns and processes of gene evolution within a species tree. Gene tree parsimony approaches seek the evolutionary scenario that implies the fewest gene duplications, duplications and losses, or deep coalescence (incomplete lineage sorting) events needed to reconcile a gene tree and a species tree. While a gene tree parsimony approach can be informative about genome evolution and phylogenetics, error in gene trees can profoundly bias the results.
Results
We introduce efficient algorithms that rapidly search local Subtree Prune and Regraft (SPR) or Tree Bisection and Reconnection (TBR) neighborhoods of a given gene tree to identify a topology that implies the fewest duplications, duplication and losses, or deep coalescence events. These algorithms improve on the current solutions by a factor of it n for searching SPR neighborhoods and nsup 2 for searching TBR neighborhoods, where n is the number of taxa in the given gene tree. They provide a fast error correction protocol for ameliorating the effects of gene tree error by allowing small rearrangements in the topology to improve the reconciliation cost. We also demonstrate a simple protocol to use the gene rearrangement algorithm to improve gene tree parsimony phylogenetic analyses.
Conclusions
The new gene tree rearrangement algorithms provide a fast method to address gene tree error. They do not make assumptions about the underlying processes of genome evolution, and they are amenable to analyses of large-scale genomic data sets. These algorithms are also easily incorporated into gene tree parsimony phylogenetic analyses, potentially producing more credible estimates of reconciliation cost.
bdy
Introduction
The availability of large-scale genomic data from a wide variety of taxa has revealed much incongruence between gene trees and the phylogeny of the species in which the genes evolve. This incongruence may be caused by evolutionary processes such as gene duplication and loss, deep coalescence, or lateral gene transfer. The variation in gene tree topologies can be used to infer the processes of genome evolution. Gene tree species tree (GT-ST) reconciliation methods seek to map the history of gene trees into the context of species evolution and thus potentially link processes of gene evolution to phenotypic changes and diversification. Yet these methods can be confounded by error in the gene trees, which also may cause incongruence between the gene and species topologies. We introduce efficient algorithms to correct gene tree topologies based on the gene duplication, duplication and loss, or deep coalescence cost models. The algorithms work by identifying the small rearrangements in the gene trees that reduce the reconciliation cost. They are extremely fast and thus amenable to analyses of enormous genomic data sets.
Perhaps the most commonly used and computationally feasible approach to GT-ST reconciliation is gene tree parsimony, which seeks to infer the fewest evolutionary events (e.g., duplication, loss, coalescence, or lateral gene transfer) needed to reconcile a gene tree and species tree topology abbrgrp abbr bid B1 1. This approach also can be extended to infer species phylogenies, finding the species tree that implies the fewest evolutionary events implied by the gene trees (e.g., B2 2B3 3B4 4). However, the gene trees often are estimated using heuristic methods from short sequence alignments, and consequently, there is often much error in the estimated gene tree topologies. Error in the gene trees creates more GT-ST incongruence and can radically affect GT-ST reconciliation analyses, implying far more duplications, duplications and losses, or deep coalescence events than actually exist. For example, Rasmussen and Kellis B5 5 estimated that error in gene tree reconstruction can lead to 2-3 fold overestimates of gene duplications and losses. Gene tree error also can erroneously imply large numbers of duplications near the root of the species tree B6 6B7 7, and it can mislead gene tree parsimony phylogenetic analyses (e.g., B8 8B9 9B10 10).
Several approaches have been proposed to address gene tree error in GT-ST reconciliation. First, questionable nodes in a gene tree or nodes with low support may be collapsed prior to gene tree reconciliation, and the resulting non-binary gene trees may be reconciled with species trees B11 11B12 12B13 13. Similarly, GT-ST reconciliations can use a distribution of gene tree topologies, such as bootstrap gene trees, rather than a single gene tree estimate 6B14 14B15 15. Both of these approaches may help account for stochastic error and uncertainty in gene tree topologies, but they do not explicitly confront gene tree error. Methods also exist to simultaneously infer the gene tree topology and the gene tree reconciliation with a known species tree 5B16 16. While these sophisticated statistical approaches appear very promising, they are computationally intensive, and it is unclear if they will be tractable for large-scale analyses. Another, perhaps a more computationally feasible, approach is to allow a limited number of local rearrangements in the gene tree topology if they reduced the reconciliation cost B17 17B18 18.
Previously 1718 described a method to allow NNI-branch swaps on selected branches of a gene tree to reduce the reconciliation cost. Following 1718, we address gene tree error in the reconciliation process by assuming that the correct gene tree can be found in a particular neighborhood of the given gene tree. We describe this approach for the gene duplication, duplication and loss, and deep coalescence models, which identify the fewest respective events implied from a given gene tree and given species tree. This neighborhood consists of all trees that are within one edit operation of the gene tree. While 1718 use Nearest Neighbor Interchange (NNI) edit operations to define the neighborhood, we use the standard tree edit operations SPR B19 19B20 20 and TBR 19, which significantly extend upon the search space of the NNI neighborhood. The SPR and TBR local search problems find a tree in the SPR and TBR neighborhood of a given gene tree, respectively, that has the smallest reconciliation cost when reconciled with a given species tree. Using the algorithm from Zhang B21 21 the best known (naïve) runtimes are O(n3) for the SPR local search problem and O(n4) for the TBR local search problem, where n is the number of taxa in the given gene tree. These runtimes typically are prohibitively long for the computation of larger GT-ST reconciliations. We improve on these solutions by a factor of n for the SPR local search problem and a factor of n2 for the TBR local search problem. This makes the local search under the TBR edit operation as efficient as under the SPR edit operation, and it provides a high-speed gene tree error-correction protocol that is computationally feasible for large-scale genomic data sets.
We also evaluated the performance of our algorithms using the implementation of SPR based local search algorithms. Note, that the SPR neighborhood is properly contained in the TBR neighborhood for any given tree. Thus the performance of the SPR based program provides a conservative estimate of the performance of the TBR based program. We test our programs on a collection of 106 yeast gene trees, some of which contain hundreds of leaves, and we demonstrate how it can be easily incorporated into large-scale gene tree parsimony phylogenetic analyses.
Basic notation and preliminaries
Throughout this paper, the term tree refers to a rooted full binary (phylogenetic) tree.
Let T be a tree. The leaf set of T is denoted by Le(T). The set of all vertices of T is denoted by V(T) and the set of all edges by E(T). The root of T is denoted by Ro(T). The set of internal vertices of T is I(T):= V(T)\Le(T).
Given a vertex v ∈ V(T), we denote the parent of v by Pasub T(v). Let u := PaT(v). The edge that connects v with u is written as (u, v). The first element in the pair is always the parent of the second element. The set of all children of v is denoted by ChT(v) and the children are called siblings. For w ∈ ChT(v), the sibling of w is denoted by SbT(w).
We define ≤T to be the partial order on V(T) where x≤T y if y is a vertex on the path between Ro(T) to x, and write x Given U ⊆ V(T), we denote by T(U) the unique rooted subtree of T that spans U with the minimum number of vertices. Furthermore, the restriction of T to U, denoted by T|U, is the rooted tree that is obtained from T(U) by suppressing all non-root vertices of degree two. The subtree of T rooted at u ∈ V(T), denoted by Tu, is defined to be T|U, for U:= {v ∈ Le(T): v ≤T u}. Two trees T1 and T2 are called isomorphic if there exists a bijection between the vertex sets of T1 and T2 which maps a vertex u1 of T1 to vertex u2 of T2 if the subtree rooted at u1 in T1 has the same leaf set as the subtree rooted at u2 in T2. If an isomorphism exists between T1 and T2, we write T1 ≃ T2.
Given function f : A → B, we denote by f(A') where A' ⊆ A a set of images of each element in A' under f.
The reconciliation cost models
A species tree is a tree that depicts the branching pattern representing the divergence of a set of species, whereas a gene tree is a tree that depicts the evolutionary history among the sequences encoding one gene (or gene family) for a given set of species. We assume that each leaf of the gene tree is labeled with the species from which that gene was sampled. Let G be a gene tree and S a species tree. In order to compare G with S, we require a mapping from each gene g ∈ V(G) to the most recent species in S that could have contained g.
b Definition 1 (Mapping). The leaf-mapping ℒG,S : Le(G) → Le(S) is a surjection that maps each leaf g∈ Le(G) to that unique leaf s ∈ Le(S) which has the same label as g. The extension ℳG,S : V(G) → V(S) is the mapping defined by ℳG,S(g):= LCA(ℒG,S(Le(Gg))). For convenience, we write ℳ(g) instead of ℳG,S(g) when G and S are clear from the context.
Definition 2 (Comparability). Given trees G and S, we say that G is comparable to S if a leaf-mapping ℒG,S(g) is well defined.
Throughout this paper we use the following terminology: G is a gene tree that is comparable to the species tree S through a leaf-mapping ℒG,S, and n is the number of taxa present in both input trees.
Now we define different reconciliation costs from G to S for a given mapping ℳG,S that extends ℒG,S. The reconciliation cost are based on the models of gene duplication B22 22B23 23, duplication-loss 21, and deep coalescence 21.
Definition 3 (Duplication cost).
indent 1 • The duplication cost from g ∈ V(G) to S, inline-formula m:math xmlns:m http:www.w3.org1998MathMathML name 1471-2105-13-S10-S11-i1 m:mrow
m:msub
m:mi mathvariant script C
D
m:mfenced close ) open ( separators
G
m:mo class MathClass-punc ,
S
,
g
MathClass-rel :
=
{
m:mtable gathered
m:mtr
m:mtd
m:mn 1
,
m:mspace thinspace width 0.3em
i
f
M
MathClass-open (
g
MathClass-close )

M
(
C
h
(
g
)
)
;
0
,
o
t
h
e
r
w
i
s
e
.
• The duplication cost from G to S, 1471-2105-13-S10-S11-i2
C
D
G
m:mstyle text
m:mtext textsf sans-serif ,
S
:
=
mathsize big ∑
g

I
(
G
)
C
D
G
,
S
,
g
.
Definition 4 (Duplication-loss cost).
• The loss cost from g ∈ V(G) to S,
display-formula
1471-2105-13-S10-S11-i3
C
L
(
G
,
S
,
g
)
:
=
array columnlines none equalcolumns false equalrows
columnalign center
0
,
i
f
MathClass-op ∀
h

C
h
(
g
)
:
M
(
g
)
=
M
(
h
)
;

h

C
h
(
g
)
|
d
S
(
M
(
g
)
,
M
(
h
)
)
MathClass-bin -
1
|
,
o
t
h
e
r
w
i
s
e
.
• The duplication-loss cost from G to S, 1471-2105-13-S10-S11-i4
C
D
L
G
,
 
S
:
=

g

I
G
(
C
D
G
,
S
,
g
 
+
C
L
G
,
S
,
g
)
.
Definition 5 (Deep coalescence cost).
• The number of lineages from g ∈ V(G) to h ∈ Ch(g) in S,
1471-2105-13-S10-S11-i5
C
D
C
(
G
,
tmspace 2.77695pt
S
,
g
)
:
=

h

C
h
(
g
)
d
S
(
M
(
g
)
,
M
(
h
)
)
.
• The deep coalescence cost from G to S, 1471-2105-13-S10-S11-i6
C
D
C
G
,
S
:
=

g

I
G
C
D
C
G
,
S
,
g
-
|
E
(
S
)
|
.
The error-correction problems
Here we give definitions for tree rearrangement operations TBR 19 and SPR 1920, and then formulate the Error-Correction problems that were motivated in the introduction.
Definition 6 (Tree Bisection and Reconnection (TBR)). Let T be a tree. For this definition, we regard the planted tree Pl(T) as the tree obtained from adding the root edge{r, Ro(T)} to E(T), where r ∉ V(T).
Let e := (u, v) ∈ E(T), and X and Y be the connected components that are obtained by removing edge e from T such that v ∈ X and u ∈ Y. We define TBRT (v, x, y) for x ∈ X and y ∈ Y to be the tree that is obtained from Pl(T) by first deleting edge e, and then adjoining a new edge f between X and Y as follows:
1. If x ≠ Ro(X) then suppress Ro(X) and create a new root by subdividing edge (Pa(x), x).
2. Subdivide edges (Pa(y), y) by introducing a new vertex y'.
3. Re-connect components X and Y by adding edge f = (y', Ro(x).
4. Suppress the vertex u, and rename vertex y' as u.
5. Contract the root edge.
We say that, the tree TBRT (v, x, y) is obtained from T by a tree bisection and reconnection (TBR) operation that bisects the tree T into the components X and Y and reconnects them above the nodes x and y. (See Figure figr fid F1 1.) We define the following neighborhoods for the TBR operation:
1. TBRG(v, x) := ∪ y∈Y TBRG(v, x, y)
2. TBRG(v) := ∪x∈X TBRG(v, x)
3. TBRG := ∪(u, v)∈E(G) TBRG(v)
fig Figure 1caption An TBR operation
An TBR operation. Tree T' = TBRT(v, x, y) results from T after performing single TBR operation.
graphic file 1471-2105-13-S10-S11-1
Definition 7 (Subtree Prune and Regrafting (SPR)). The SPR operation is defined as a special case of the TBR operation. Let e := (u, v) ∈ E(T), and X and Y be the connected components that are obtained by removing edge e from T such that v ∈ X and u ∈ Y. We define SPRT (v, y) for y ∈ Y to be TBRT (v, v, y). We say that the tree SPRT (v, y) is obtained from T by performing subtree prune and regraft (SPR) operation that prunes subtree Tv and regrafts it above y. (See Figure F2 2(a).)
Figure 2The NNI adjacency graph
The NNI adjacency graph. (a) The tree 1471-2105-13-S10-S11-i72
m:mover accent mml-overline
G
true ¯
is obtained from G by pruning and regrafting subtree Gv to the root of G. The vertex x ∈ V(G) is suppressed, and the new vertex above root in 1471-2105-13-S10-S11-i73
G
¯
is named x. (b) Two NNI operations NNIG'(z') and NNIG'(z) produce left-child G'l and right-child G'r of G' in 1471-2105-13-S10-S11-i74
 X 
.
1471-2105-13-S10-S11-2
We define the following neighborhoods for the SPR operation:
1. SPRG(v) := ∪y∈Y SPRG(v, y)
2. SPRG := ∪(u, v)∈E(G) SPRG(v)
We now state the SPR based error-correction problems for duplication (D), duplication-loss (DL), and deep coalescence (DC). Let Γ ∈ {D, DL, DC}.
Problem 1 (SPR based error-correction for Γ (SEC-Γ))
Instance: A gene tree G and a species tree S.
Find: A gene tree G* ∈ SPRG such that 1471-2105-13-S10-S11-i7
C
Γ
(
m:msup
G
*
,
S
)
=
m:munder msub
min
G


S
P
R
G
C
Γ
(
G

,
S
)
.
The TBR based error-correction for Γ (TEC-Γ) problems are defined analogously to the SPR based error-correction for Γ (SEC-Γ) problems.
Solving the SEC-Γ problems
In this section we study the SPR based error-correction problems, for duplication (D), duplication-loss (DL), and deep coalescence (DC), in more detail. Our efficient solution for these problems are based on solving restricted versions of these problems efficiently. For each Γ ∈ {D, DL, DC} we first define a restricted version of the SEC-Γ problem, which we call the restricted SPR based error-correction for the Γ (R-SEC-Γ) problem.
Problem 2 (Restricted SPR based error-correction for (R-SEC-Γ))
Instance: A gene tree G, a species tree S, and v ∈ V(G).
Find: A gene tree G* ∈ SPRG(v) such that 1471-2105-13-S10-S11-i8
C
Γ
(
G
*
,
S
)
=
min
G


S
P
R
G
(
v
)
C
Γ
(
G

,
S
)
.
Observation 1. Let Γ ∈ {D, DL, DC}. Given a gene tree G and a species tree S, the SEC-Γ problem can be solved as follows: (i) solve the R-SEC-Γ problem for every v ∈ V(G) where v ≠ Ro(G), (ii) under all solutions found return a minimum scoring gene tree G*.
Naïvely, the R-SEC-Γ problem can be solved in Θ(n2) time by computing the cost 1471-2105-13-S10-S11-i75
C
Γ
G

,
 S
for each G' ∈ SPRG(v). The cost for a given gene and species tree can be computed in Θ(n) time 21. We introduce a novel algorithm for the R-SEC-Γ problem that improves by a factor of n on the naïve solution. This speedup is achieved by semi-ordering the trees in SPRG(v), for each v ∈ V(G), such that the score-difference of any two consecutive trees in this order can be computed in constant time.
Ordering the trees in SPRG(v)
Consider a graph on trees in SPRG(v), in which every two adjacent trees are one NNI 19 operation apart. We show that such a graph is a rooted full binary tree, after providing necessary definitions.
Definition 8 (Nearest Neighbor Interchange (NNI)). We define the NNI operation as a special case of the SPR operation. Let e ∈ E(T) where e := (u, v), and X and Y be the connected components that are obtained by removing edge e from T such that v ∈ X and u ∈ Y. We define NNIT (v) to be SPRT (v, y) for y:= Pa(u), and say that NNIT (v) is obtained from T by performing nearest neighbor interchange (NNI) operation that prunes subtree Tv and regrafts it above the parent of v's parent. (See Figure 2(b).)
Definition 9 (NNI distance). Let the NNI-distance, denoted as dNNI(T1, T2), between two trees T1 and T2 over n taxa be the minimum number of NNI operations required to transform T1 into T2.
Definition 10 (NNI-adjacency graph). The NNI-adjacency graph, denoted as 1471-2105-13-S10-S11-i76
X
=
V
,
E
, is the graph where V = SPRG(v) and {T1, T2} ∈ E if dNNI(T1, T2) = 1.
Lemma 1.  X is a tree.
Proof. We prove it by showing that there exists a unique path between every two vertices in  X . Let G', G'' ∈ V( X ), thus G', G'' ∈ SPRG(v). Let G':= SPRG(v, x1) and G'':= SPRG(v, x2). We use induction on dG(x1, x2). Let dG(x1, x2) = 1 and assume without loss of generality that x2 = PaG(x1). Thus, G'' = NNIG'' (Sb(x1)). So the hypothesis holds for dG(x1, x2) = 1. Assume now that the hypothesis is true for dG(x1, x2) ≤ k and suppose dG(x1, x2) = k + 1. Since G is a tree, there must be a unique path between x1 and x2; let y be a vertex on this path. Let dG(y, x1) = 1, and Gn := SPRG(v, y). If y = PaG(x1), then Gn = NNIG'(v); otherwise Gn = NNIG'(Sb(y)). Since dG(y, x2) = k, thus (by induction hypothesis) the hypothesis is valid for dG(x1, x2) = k + 1. □
Theorem 1.  X is a rooted full binary tree.
Proof. In view of Lemma 1, it suffices to show that except a unique vertex of degree 2 all other vertices in  X  are of degree 1 or 3. Let G' ∈ V( X ), thus G' = SPRG(v, y) for some y ∈ V(G). There are three cases:
Case 1: y is a root. Let y1 ∈ ChG(y). Let G1 := SPRG(v, y1), thus 1471-2105-13-S10-S11-i81
G

=
NN
I
G
1
(
v
)
). Hence {G1, G'} ∈ E( X ). Since |ChG(y)| = 2, G' must be a degree 2 vertex in  X .
Case 2: y is a leaf. Let y1 = PaG(y). Let G1 := SPRG(v, y1), thus G1 = NNIG'(v). Hence {G1, G'} ∈ E( X ), and consequently, G' is a degree 1 vertex in  X .
Case 3: y is an internal vertex. Let y1 = PaG(y) and y2 ∈ ChG(y). Let G1 := SPRG(v, y1), thus G1 = NNIG'(v). Let G2 := SPRG(v, y2), thus G' = NNIG2(v). Since y has one parent and two children in G, thus G' is a degree 3 vertex in  X .
This completes the proof.
The score difference of consecutive trees in  X 
To solve the R-SEC-Γ problems we traverse tree  X . Two adjacent trees in V( X ) are one NNI operation apart. We show that 1471-2105-13-S10-S11-i77
C
Γ
score of a tree can be computed in constant time from the LCA computation of its adjacent tree.
Let e := (G', G'') be an edge in  X . Let x := Pa(v), y := Sb(v), and z, z' ∈ Ch(y) in G' (see Figure 2(b)). Without loss of generality, let G'':= NNIG'(z). (Observe, G'' is similar to 1471-2105-13-S10-S11-i82
m:msubsup
G
r

of Figure 2(b).)
Lemma 2. ℳG'',S(y) = ℳG',S(x).
Proof. From NNI operation, v, z' ∈ ChG''(x) and z, x ∈ ChG''(y). Also, 1471-2105-13-S10-S11-i83
G
z


G
z

,
G
z



G
z


,
G
v


G
v

, so 1471-2105-13-S10-S11-i84
L
e
(
G
y

)
=
L
e
(
G
x

)
. Thus, 1471-2105-13-S10-S11-i85
M
G

,
S
x
=
LCA
(
L
G

,
S
(
L
e
(
G
x

)
)
)
=
LCA
(
L
G

,
S
(
L
e
(
G
y

)
)
)
=
M
G

,
S
y
. □
Lemma 3. ℳG'',S(w) = ℳG',S(w), for all w ∈ V(G')\{x, y}.
Proof. For 1471-2105-13-S10-S11-i9
g

V
(
G
v

)

(
G
z

)

(
G
z


)
, since 1471-2105-13-S10-S11-i86
G
g


G
g

, therefore ℳG',S(g) = ℳG'',S(g). Also, except for subtree 1471-2105-13-S10-S11-i87
G
x

, the rest of the tree remains the same in 1471-2105-13-S10-S11-i88
G
x

. Thus by Lemma 2, ℳG',S(PaG' (x)) = ℳG'',S(PaG''(y)). Inductively, ℳG',S(g) = ℳG'',S(g), for all g ∈ V(G')\V(G'x). □
Lemma 4. ℳG'',S(x) = LCA(ℳG',S(v), ℳG', S(z')).
Proof. From Lemma 3, ℳG'',S(v) = ℳG',S(v) and ℳG'',S(z') = ℳG',S(z'). Thus, ℳG'',S(x) = LCA(ℳG'',S(v),ℳG'',S(z')) = LCA(ℳG',S(v),ℳG',S(z')). □
Lemma 5. 1471-2105-13-S10-S11-i78
C
Γ
(
G

,
S
,
g
)
=
C
Γ
(
G

,
S
,
g
)
, for all g ∈ V(G'')\{x, y} and Γ ∈ {D, DL, DC}.
Proof. The gene duplication and loss status of a vertex, and the number of lineages from a vertex to its children in G' can change in G'' if its mapping or mapping of any of its children changes in ℳG'',S. From Lemma 3, and also, since ℳG'',S(w) = ℳG',S(w), for w ∈ Ch(PaG'(x)), must have 1471-2105-13-S10-S11-i79
C
Γ
(
G

,
S
,
P
a
G

(
x
)
)
=
C
Γ
(
G

,
S
,
P
a
G

(
x
)
)
. Thus the Lemma follows. □
Let e := (G', G'') ∈ E( X ) and Γ ∈ {D, DL, DC}. We define 1471-2105-13-S10-S11-i80
Γ
e
:
=
C
Γ
(
G

,
S
)
-
C
Γ
(
G

,
S
)
with respect to the given species tree S. Observe that this score can be negative too. We study how Γe can be computed efficiently for each edge e in  X .
Theorem 2. 1471-2105-13-S10-S11-i10
Γ
e
=

g

{
x
,
y
}
(
C
Γ
(
G

,
S
,
g
)
-
C
Γ
(
G

,
S
,
g
)
)
.
Proof .1471-2105-13-S10-S11-i11
Γ
e
=
C
Γ
(
G

,
S
)
-
C
Γ
(
G

,
S
)
=

g

V
(
G

)
(
C
Γ
(
G

,
S
,
g
)
-
C
Γ
(
G

,
S
,
g
)
)
=

g

V
(
G

)
\
{
x
,
y
}
(
C
Γ
(
G

,
S
,
g
)
-
C
Γ
(
G

,
S
,
g
)
)
+

g

{
x
,
y
}
(
C
Γ
(
G

,
S
,
g
)
-
C
Γ
(
G

,
S
,
g
)
)
=

g

{
x
,
y
}
(
C
Γ
(
G

,
S
,
g
)
-
C
Γ
(
G

,
S
,
g
)
)
.

Definition 11. Let 1471-2105-13-S10-S11-i12
G
¯
:
=
S
P
R
G
v
,
R
o
G
, and let PG' be a path from 1471-2105-13-S10-S11-i13
G
¯
to G' in  X . For G', we define the score-difference 1471-2105-13-S10-S11-i14
Γ
G
¯
,
G

as 1471-2105-13-S10-S11-i15
Γ
G
¯
,
G

:
=

e

E
(
P
G

)
Γ
e
.
Theorem 3. For given S, G, and v ∈ V(G), the tree G' ∈ V( X ) is the output of a R-SEC-Γ problem iff 1471-2105-13-S10-S11-i16
Γ
G
¯
,
G

=
m
i
n
G


V
(
X
)
Γ
G
¯
,
G

.
Proof. Let 1471-2105-13-S10-S11-i17
Γ
G
¯
,
G

=
m
i
n
G


V
(
X
)
Γ
G
¯
,
G

. We prove that G' is the output of R-SEC-Γ problem. Since 1471-2105-13-S10-S11-i18
Γ

,
G

=

e

E
(
P
G

)
Γ
e
=
Γ
(
G

,
S
)
-
Γ
(

,
S
)
, thus G' gives the minimum normalized CΓ score over all trees in V( X ). Hence, G' must be the output of the R-SEC-Γ problem. The other direction follows similarly. □
The algorithm
We describe a general algorithm, called Algo-R-SEC-Γ, to solve the R-SEC-Γ problem for each Γ ∈ {D, DL, DC}. Initially Algo-R-SEC-Γ computes (i) the root vertex of the NNI-adjacency graph  X , which we call 1471-2105-13-S10-S11-i19
G
¯
, by regrafting the subtree Gv above the root of G, (ii) the LCA mapping from 1471-2105-13-S10-S11-i20
G
¯
to S, and (iii) the Γ score from 1471-2105-13-S10-S11-i21
G
¯
to S. Then recursively Algo-R-SEC-Γ computes the LCA mapping and Γ score for every vertex G' in  X  when the LCA mapping and Γ score of its parent vertex in  X  is known. Algorithm 1 details Algo-R-SEC-Γ.
Algorithm 1 Algo-R-SEC-Γ
Input: A gene tree G, a species tree S, and v ∈ V(G)
Output: A tree G* ∈ SPRG(v) such that 1471-2105-13-S10-S11-i22
C
Γ
(
G
*
,
S
)
=
min
G


S
P
R
G
(
v
)
C
Γ
(
G

,
S
)
01. Compute
1471-2105-13-S10-S11-i23
G
¯
by pruning Gv and regrafting at Ro(G)
02. Compute LCA mapping
1471-2105-13-S10-S11-i24
M

,
S
03. Call
1471-2105-13-S10-S11-i25
C
G
(
G
¯
,
S
)
 
=
A
l
g
o
-
C
o
m
p
-
S
c
o
r
e
(
G
¯
,
S
,
M
G
¯
,
S
)
04. Set 1471-2105-13-S10-S11-i26
B
e
s
t
T
r
e
e
=
G
¯
, BestScore = 0
05. Set 1471-2105-13-S10-S11-i27
G

=

, 1471-2105-13-S10-S11-i28
M
G

,
S
=
M

,
S
, 1471-2105-13-S10-S11-i29
C
Γ
(
G

,
S
)
=
C
Γ
(

,
S
)
, 1471-2105-13-S10-S11-i30
Γ

,
G

=
0
06. For each 1471-2105-13-S10-S11-i31
k

R
o
(

S
b
(
v
)
)
in preorder traversal of 1471-2105-13-S10-S11-i32

S
b
(
v
)
, do
07.    If not backtracking, then
08.       Set x = PaG'(v), y = SbG'(v)
09.       Set G'' = NNIG'(SbG'(k))
10.       Set ℳG''.S = ℳG',S, ℳG'',S(y) = ℳG',S(x)
11.       ℳG'',S(x) = LCA(ℳG',S(k),ℳG',S(v))
12.       Call
1471-2105-13-S10-S11-i33
Γ
{
G

,
G

}
=

h

{
x
,
y
}
A
l
g
o
-
G
-
S
c
o
r
e
(
G

,
S
,
M
G

,
S
,
h
)
-
A
l
g
o
-
G
-
S
c
o
r
e
(
G

,
S
,
M
G

,
S
,
h
)
13.       1471-2105-13-S10-S11-i34
Γ

,
G

=
Γ

,
G

+
Γ
{
G

,
G

}
14.       If1471-2105-13-S10-S11-i35
Γ

,
G

<
B
e
s
t
S
c
o
r
e
, then
15.          Set BestTree = G'', 1471-2105-13-S10-S11-i36
B
e
s
t
S
c
o
r
e
=
Γ

,
G

16.       Else,
17.          Set
1471-2105-13-S10-S11-i37
x
 
=
 
P
a
G

v
,
 
y
 
=
 
P
a
G

x
18.          Set G'' = NNIG'(v)
19.          Set
1471-2105-13-S10-S11-i38
M
G

,
S
=
M
G

,
S
,
M
G

,
S
(
x
)
=
M
G

,
S
(
y
)
20.          Set
1471-2105-13-S10-S11-i39
M
G

,
S
stretchy (
y
)
=
L
C
A
(
M
G

,
S
(
S
b
G

(
x
)
)
,
M
G

,
S
(
k
)
)
21.          Call
1471-2105-13-S10-S11-i40
Γ
{
G

,
G

}
=

h

{
x
,
y
}
A
l
g
o
-
G
-
S
c
o
r
e
(
G

,
S
,
M
G

,
S
,
h
)
-
A
l
g
o
-
G
-
S
c
o
r
e
(
G

,
S
,
M
G

,
S
,
h
)
22.          Set
1471-2105-13-S10-S11-i41
Γ

,
G

=
Γ

,
G

-
Γ
(
G

,
G

)
23.    Set
1471-2105-13-S10-S11-i42
G

=
G

,
M
G

,
S
=
M
G

,
S
,
Γ

,
G

=
Γ

,
G

24. Return BestTree
Algorithm 2 Algo-Comp-Score
Input: A gene tree G, a species tree S, and LCA mapping ℳG,S
Output:CΓ(G,S)
01. score = 0
02. For each g ∈ I(G) in preorder traversal of G, do
03.    1471-2105-13-S10-S11-i43
C
a
l
l
 
s
c
o
r
e
 
=
 
s
c
o
r
e
 
+
 
A
l
g
o
-
G
-
S
c
o
r
e
(
G
,
S
,
M
G
,
S
,
 
g
)
04. If Γis DC, then
05.    score = score |I(S)|
06. Return score
Algorithm 3 Algo-G-Score
Input: A gene tree G, a species tree S, LCA mapping ℳG,S, and g ∈ I(G)
Output:CΓ(G, S, g)
01. If Γ is D, then
02.    Ifℳ(g) ∈ ℳ(Ch(g)), then
03.       Return 1
04. ElseIf Γis DL, then
05    1471-2105-13-S10-S11-i44
l
s
=

h

C
h
(
g
)
|
d
p
(
M
(
h
)
)
-
d
p
(
M
(
g
)
)
-
1
|
06.    Ifℳ(g) ∈ ℳ(Ch(g)), then
07.       Return ls + 1
08. Else
09.    Return ls
10. Else //Γ is DC
11    Return1471-2105-13-S10-S11-i45

h

C
h
(
g
)
|
d
p
(
M
(
h
)
)
-
d
p
(
M
(
g
)
)
|
Lemma 6. The R-SEC-Γ problem is correctly solved by Algo-R-SEC-Γ.
Proof. Lemma 1-5 and Theorem 1-3 directly imply that in order to prove the correctness of algorithm Algo-R-SEC-Γ, it is sufficient to prove that it correctly returns G' of minimum 1471-2105-13-S10-S11-i46
Γ

,
G

among all G' ∈ V( X ). We will show that algorithm Algo-R-SEC- accounts each G' ∈ V( X ), correctly computes 1471-2105-13-S10-S11-i47
Γ

,
G

for Γ ∈ {D, DL, DC}, and returns the right G' as output.
From Definition 10, V( X ) = SPRG(v). In Algo-R-SEC-Γ, step 1 prunes subtree Gv and regrafts it above the root of G to create 1471-2105-13-S10-S11-i48
G
¯
. Step 5 sets G' to 1471-2105-13-S10-S11-i49
G
¯
. The for-loop in step 6 traverses subtree 1471-2105-13-S10-S11-i50

S
b
(
v
)
in preorder. For each traversed vertex 1471-2105-13-S10-S11-i51
k

R
o
(

S
b
(
v
)
)
, step 9 builds the tree G'':= SPRG(v, k) by applying NNI operation on the last build G''. Each for-loop iteration sets G' to the last build G'' in step 23. 1471-2105-13-S10-S11-i52
G
¯
and G''s constitute all the trees in SPRG(v).
For 1471-2105-13-S10-S11-i53
G
¯
, step 2 computes the LCA mapping and step 5 sets 1471-2105-13-S10-S11-i54
Γ

,
G

to zero. Following Lemma 2-4 and Theorem 2, step 10 and 11 update the LCA of G'' and step 12 computes 1471-2105-13-S10-S11-i55
Γ
{
G

,
G

}
by calling algorithm Algo-G-Score. Depending on Γ ∈ {D, DL, DC}, there are three cases:
Case 1: Γ is D. Algo-G-Score returns 1, if the vertex g ∈ V(G'') maps to the same vertex in S as any of its children maps to, otherwise 0.
Case 2: Γ is DL. Algo-G-Score computes losses by applying the formula of Definition 4. Further, it adds 1 if there is a duplication.
Case 3: Γ is DC. Algo-G-Score, returns the number of lineages from g to each of its children h ∈ Ch(g) in S. For each h ∈ Ch(g), depth of ℳ(g) is subtracted from depth of ℳ(h) to count number of edges between ℳ(g) and ℳ(h).
In Algo-R-SEC-Γ, step 13 computes 1471-2105-13-S10-S11-i56
Γ
G
¯
,
G

by adding 1471-2105-13-S10-S11-i57
Γ
G
,
¯
G

and 1471-2105-13-S10-S11-i58
Γ
{
G

,
G

}
. When backtracking, steps 17-22 are executed to restore the right G' to compute the next unique G'' ∈ Ch X (G). This ensures that the correct 1471-2105-13-S10-S11-i59
Γ

,
G

is computed for each G' ∈ V( X ).
In Algo-R-SEC-Γ, step 4 sets 1471-2105-13-S10-S11-i60
G
¯
as the BestTree and 1471-2105-13-S10-S11-i61
Γ

,
G
¯
 = 
0
as BestScore. Every time a new G'' ∈ Ch X (G) is encountered, step 14 compares 1471-2105-13-S10-S11-i62
Γ

,
G

with BestScore, and updates BestTree with G'' of the minimum 1471-2105-13-S10-S11-i63
Γ

,
G

. After the for-loop, step 24 returns the BestTree. □
Lemma 7. The R-SEC-Γ and SEC-Γ problems can be solved in Θ(n) and Θ(n2) time, respectively.
Proof. We will prove that the algorithm Algo-R-SEC-Γsolves the restricted SPR based error-correction problems for each Γ ∈ {D, DL, DC} in Θ(n) time. In Algo-R-SEC-Γ, step 1 takes constant time. Step 2 precomputes LCA values for species tree in O(n) time B24 24, and so, finds LCA mapping from 1471-2105-13-S10-S11-i64
G
¯
to S in O(n) time in bottom-up manner. Step 3 computes the duplication, duplication-loss or deep coalescence score of 1471-2105-13-S10-S11-i65
G
¯
and S by calling Algo-Comp-Score. In Algo-Comp-Score, step 1 and step 2 runs for O(1) and O(n) time, respectively. Step 3 calls Algo-G-Score in each iteration of for-loop. Algo-G-Score runs for O(1) time for Γ ∈ {D, DL, DC}.
When Γ is DC, steps 4 and 5 are further executed in Algo-Comp-Score for constant time. Thus in Algo-R-SEC-Γ, step 3 runs for O(n) time. Further, steps 4 and 5 take constant time. The loop of step 6 runs for Θ(n) time. If condition of step 7 is true, steps 8-10 executes in constant time. With precomputed LCA values from step 2, step 11 executes in constant time. Algo-G-Score runs for constant time for Γ ∈ {D, DL, DC}, and lets step 12 to execute in constant time. Further, steps 13-15 execute for constant time too. If the condition in step 7 is false, then steps 17-22 execute in constant time, similarly. Finally, step 23 runs for constant time, and hence, the R-SEC-Γ problem can be solved in Θ(n) time. From Observation 1, Algo-R-SEC-Γ is called Θ(n) time to solve SEC-Γ problem. Thus, the SEC-Γ problem can be solved in Θ(n2) time. □
Solving the TEC-Γ problems
In this section we study the TBR based error-correction problems, for duplication (D), duplication-loss (DL), and deep coalescence (DC). More precisely, we extend our solution for the SEC-Γ problems to solve the TEC-Γ problems. A TBR operation can be viewed as an SPR operation, except that the pruned subtree can be rerooted before it is regrafted. Our speed-up for the SEC-Γ problems is achieved by observing that the scores Γof any re-rooted pruned subtree and its remaining pruned tree are independent of each other. We define the R-TEC-Γ problems for the TEC-Γ problems, as we defined the R-SEC-Γ problems for the SEC-Γ problems. We will show that the R-TEC-Γ problems can be solved by solving two smaller problems separately and combining their solutions.
Definition 12. Let T be a tree and x ∈ V(T). RR(T, x) is defined to be the tree T if x = Ro(T) or x ∈ Ch(Ro(T)). Otherwise, RR(T, x) is the tree obtained by suppressing Ro(T), and subdividing the edge (Pa(x), x) by the new root node.
Lemma 8. Given a tuple 〈G, S, v〉, and G'':= TBRG(v, x, y), for x ∈ V(Gv), y ∈ V(G)\V(Gv). Then, 1471-2105-13-S10-S11-i66
C
Γ
(
G


,

S
)

G


T
B
R
G
(
v
)
C
Γ
(
G

,

S
)

i
f
f
C
Γ
(
R
R
(
G
v
,

x
)
,

S
)

x


V
(
G
v
)
C
Γ
(
R
R
(
G
v
,

x

)
,

S
)
and 1471-2105-13-S10-S11-i67
C
Γ
(
G


,

S
)

G


T
B
R
G
(
v
,
x
)
C
Γ
(
G

,

S
)
.
Proof. (⇒) Let G1 := TBRG(v, x1, y), for x1 ∈ V(Gv), and x1 ≠ x. Now observe that, 1471-2105-13-S10-S11-i68

g

V
(
G
)
\
V
(
G
v
)
,
C
Γ
(
G

,
S
,
g
)
=
C
Γ
(
G
1
,
S
,
g
)
Also, let G2 := TBRG(v, x, y1), for y1 ∈ V(G)\V(Gv), and y1 ≠ y. Observe that, 1471-2105-13-S10-S11-i69

g

V
(
G
v
)
,
C
Γ
(
G

,
S
,
g
)
=
C
Γ
(
G
2
,
S
,
g
)
. Thus, if G'' gives the minimum duplication, duplication-loss, or deep coalescence score among all trees in TBRG(v), then the score contribution of vertices in V(Gv) and V(G)\V(Gv) is independent. Now looking at vertices of G, the best score is achieved when Gv is rooted at x, i.e. 1471-2105-13-S10-S11-i70
C
Γ
(
R
R
(
G
v
,
x
)
,
S
)

x


V
(
G
v
)
C
Γ
(
R
R
(
G
v
,
x

)
,
S
)
; also the best score is achieved when RR(Gv, x) is regrafted at y, i.e., 1471-2105-13-S10-S11-i71
C
Γ
(
G

,
S
)

G


TB
R
G
(
v
,
x
)
C
Γ
(
G

,
S
)
.
(

)
. This follows similarly. □
Lemma 8 implies that a tree in TBRG(v) with the minimum duplication, duplication-loss, or deep coalescence cost can be obtained by optimizing the rooting for the pruned subtree, and the regraft location, separately. A best rooting for the pruned subtree is linear time computable 17B25 25, and the solution to the R-SEC problem identifies a best regraft location in Θ(n) time. This allows to obtain a tree in TBRG(v) with the minimum duplication, duplication-loss, or deep coalescence cost by evaluating only Θ(n) trees. Thus the R-TEC-Γ problem can be solved in Θ(n) time. The TEC-Γ problem can be solved by calling the solution of R-TEC-Γ problem Θ(n) times, and Theorem 4 follows.
Theorem 4. The TEC-Γ problem can be solved in Θ(n2) time.
Experimental results
We tested the performance of the gene tree rearrangement algorithms on a set of 106 gene alignments containing sequences from 8 yeast taxa from Rokas et al. B26 26. There is a well accepted phylogeny for the yeast species, and the data set has been used to test algorithms for gene tree parsimony based on the deep coalescence problem B27 27B28 28. We constructed maximum likelihood gene trees for each gene using RAxML-VI-HPC version 7.0.4 B29 29, the gene trees were rooted with the outgroup Candida albicans. We used the new error correction algorithms to examine how much a single SPR rearrangement in the gene tree reduces the reconciliation cost based on deep coalescence and also gene duplications and losses. Over all genes the SPR error correction reduced the total deep coalescence cost from 151 to 53 (Table tblr tid T1 1) and the duplication and loss cost from 481 to 175 (Table T2 2). Both the algorithms took only seconds to run for all 106 genes on a standard laptop.
tbl Table 1Error correction based on deep coalescence modeltblbdy cols 3
r
c
Reconciliation Cost
Original
Post-Correction
cspan
hr
0
45
77
1
32
15
2
6
8
3
9
5
4
8
0
>4
6
1
tblfn
The number of yeast gene trees with different reconciliation costs based on the deep coalescence model both before (Original) and after (Post-Correction) the SPR error correction.
Table 2Error correction based on duplication and loss model
Reconciliation Cost
Original
Post-Correction
0
45
77
1-5
32
15
6-10
15
13
11-15
8
0
16-20
5
1
>20
1
0
The number of yeast gene trees with different reconciliation costs based on the duplication and loss model both before (Original) and after (Post-Correction) the SPR error correction.
We also implemented a protocol to use the gene rearrangement algorithm to correct for gene tree error in gene tree parsimony phylogenetic analyses. We first took a collection of input gene trees and performed a SPR species tree search using Duptree B30 30, which seeks the species tree with the minimum gene duplication cost. We used the duplication only cost (instead of duplications and losses) because when there is no complete sampling of all existing genes, the loss estimates may be inflated by missing sequences. After finding the locally optimal species tree, we used our SPR gene tree rearrangement algorithm to find gene tree topologies with a lower duplication cost. We then performed another SPR species tree search using Duptree, starting from the locally optimal species tree and using the new gene tree topologies. This search strategy is similar to re-rooting protocol in Duptree, which checks for better gene tree roots after a SPR species tree search 30B31 31. We tested this protocol on data set of 6,084 genes (with a combined 81,525 leaves) from 14 seed plant taxa. This is the same data set used by 31, except that all gene tree clades containing sequences from a single species were collapsed to a single leaf. Our original SPR tree search found a species tree with 23,500 duplications. The SPR tree search after the gene tree rearrangements identified the same species tree, but the new gene trees had a reconciliation cost of only 18,213. This tree search protocol took just under 4 hours on a Mac Powerbook with a 2 GHz Intel Core 2 Duo processor and 2 GB memory.
Conclusion
GT-ST reconciliation provides a powerful approach to study the patterns and processes of gene and genome evolution. Yet it can be thwarted by the error that is an inherent part of gene tree inference. Any reliable method for GT-ST reconciliation must account for gene tree error; however, any useful method also must be computationally tractable for large-scale genomic data. We introduce fast and effective algorithms to correct error in the gene trees. These algorithms, based on SPR and TBR rearrangements, greatly extend upon the range of possible errors in the gene tree from existing algorithms 1718, while remaining fast enough to use on data sets with thousands of genes. These algorithms can correct trees based on a broad variety of evolutionary factors that can cause conflict between gene trees and species trees, including gene duplication, duplications and losses, and deep coalescence.
Our analysis on 106 yeast gene trees demonstrates that even a single SPR correction on the gene trees can radically improve upon the reconciliation cost. Our results in the yeast analysis are very similar to the 2-3 fold improvement in implied duplications and losses reported from the parametric gene tree estimation and reconciliation method of Rasmussen and Kellis 5. However, in contrast, to this computationally complex method, the gene tree rearrangement algorithm is extremely fast, does not require assumption about the rates of duplication and loss, and is also amenable to corrections based on deep coalescence and duplications as well as duplications and losses. We do not claim that the gene correction algorithms produce a more accurate reconciliation than these parametric methods. However, they do present an extremely fast and flexible alternative.
We also demonstrated that this error correction protocol could easily be incorporated into a gene tree parsimony phylogenetic analysis. Previous studies have emphasized that gene tree parsimony is sensitive to the topology of the input trees. For example, the species tree may differ whether the gene trees are made using parsimony or maximum likelihood 810. In our study, although the gene tree rearrangement did not affect the species tree inference, it did greatly reduce the gene duplication reconciliation cost.
While the results of the experiments are promising, they also suggest several directions for future research. First, further investigation is needed to characterize the effects of error on gene tree topologies. For example, it seems likely that gene tree errors may extend beyond a single SPR or TBR neighborhood. Yet, if we allow unlimited rearrangements, the gene trees will simply converge on the species tree topology. One simple improvement may be to weight the possible gene tree rearrangements based on support for different clades in the gene tree. Thus, well-supported clades may be rarely or never be subject to rearrangement, while poorly supported clades may be subject to extensive rearrangements. Finally, these approaches implicitly assume that all differences between gene trees and species trees are due to either coalescence, duplications, or duplications and losses. Future work will seek to combine these objectives and also address lateral transfer.
Competing interests
The authors declare that they have no competing interests.
Author's contributions
RC was responsible for algorithm design and program implementation, and wrote major parts of the paper. JGB performed the experimental evaluation and the analysis of the results, and contributed to the writing of the paper. OE supervised the project, contributed to the algorithmic design and writing of the paper.
All authors read and approved the final manuscript.
bm
ack
Acknowledgements
The authors thank André Wehe for his generous support with the implementation. This work was conducted in parts with support from the Gene Tree Reconciliation Working Group at NIMBioS through NSF award EF-0832858, with additional support from the University of Tennessee. R.C. and O.E. were supported in parts by NSF awards #0830012 and #10117189.
This article has been published as part of BMC Bioinformatics Volume 13 Supplement 10, 2012: "Selected articles from the 7th International Symposium on Bioinformatics Research and Applications (ISBRA'11)". The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/13/S10.
refgrp Gene Trees in Species TreesMaddisonWPSystematic Biology199746523lpage 53610.1093/sysbio/46.3.523Fitting the gene lineage into its species lineage. A parsimony strategy illustrated by cladograms constructed from globin sequencesGoodmanMCzelusniakJMooreGWRomero-HerreraAEMatsudaGSystematic Zoology19792813216310.2307/2412519Reconstruction of Ancient Molecular PhylogenyGuigóRMuchnikISmithTFMolecular Phylogenetics and Evolution199662189213pubidlist 10.1006/mpev.1996.0071pmpid link fulltext 8899723Inferring Species Trees from Gene Trees: A Phylogenetic Analysis of the Elapidae (Serpentes) Based on the Amino Acid Sequences of Venom ProteinsSlowinskiJBKnightARooneyAPMolecular Phylogenetics and Evolution1997834936210.1006/mpev.1997.04349417893A Bayesian approach for fast and accurate gene tree reconstructionRasmussenMDKellisMMolecular Biology and Evolution20112827329010.1093/molbev/msq189pmcid 300225020660489Locating Large-Scale Gene Duplication Events through Reconciled Trees: Implications for Identifying Ancient Polyploidy Events in PlantsBurleighJGBansalMSWeheAEulensteinOJournal of Computational Biology2009161071108310.1089/cmb.2009.013919689214Bias in phylogenetic tree reconciliation methods: implications for vertebrate genome evolutionHahnMWGenome Biology20078R14110.1186/gb-2007-8-7-r141232323017634151Genome-scale phylogenetics: inferring the plant tree of life from 18,896 discordant gene treesBurleighJGBansalMSEulensteinOHartmannSWeheAVisionTJSystematic Biology201160211712510.1093/sysbio/syq072303835021186249What Is the Danger of the Anomaly Zone for Empirical Phylogenetics?HuangHKnowlesLLSystematic Biology20095852753610.1093/sysbio/syp04720525606Inferring angiosperm phylogeny from EST data with widespread gene duplicationSandersonMJMcMahonMMBMC Evolutionary Biology20077suppl 1:S3Optimal Gene Trees from Sequences and Species Trees Using a Soft Interpretation of ParsimonyBerglund-SonnhammerASteffanssonPBettsMJLiberlesDAJournal of Molecular Evolution20066324025010.1007/s00239-005-0096-116830091Reconciliation with non-binary species treesVernotBStolzerMGoldmanADurandDComputational Systems Bioinformatics200753441452Algorithms for MDC-Based Multi-locus Phylogeny InferenceYuYWarnowTNakhlehLRECOMB, Volume 6577 of Lecture Notes in Computer Sciencepublisher SpringerBafna V, Sahinalp SC2011531545Going nuclear: gene family evolution and vertebrate phylogeny reconciledCottonJAPageRDMP Roy Soc Lond B Biol20022691555156110.1098/rspb.2002.2074Measuring Branch Support in Species Trees Obtained by Gene Tree ParsimonyJolySBruneauASystematic Biology20095810011310.1093/sysbio/syp01320525571Gene tree reconstruction and orthology analysis based on an integrated model for duplications and sequence evolutionArvestadLBerglundALagergrenJSennbladBRECOMB2004326335Notung: a program for dating gene duplications and optimizing gene family treesChenKDurandDFarach-ColtonMJournal of Computational Biology2000742944710.1089/10665270075005087111108472A Hybrid Micro-Macroevolutionary Approach to Gene Tree ReconstructionDurandDHalldórssonBVVernotBJournal of Computational Biology200613232033510.1089/cmb.2006.13.32016597243Subtree transfer operations and their induced metrics on evolutionary treesAllenBLSteelMAnnals of Combinatorics2001511310.1007/s00026-001-8006-8On the computational complexity of the rooted subtree prune and regraft distanceBordewichMSempleCAnnals of Combinatorics20048409423On a Mirkin-Muchnik-Smith Conjecture for Comparing Molecular PhylogeniesZhangLJournal of Computational Biology19974217718710.1089/cmb.1997.4.1779228616Maps between trees and cladistic analysis of historical associations among genes, organisms, and areasPageRDMSystematic Biology1994435877Predictions of gene-duplications and their phylogenetic developmentEulensteinOPhD thesisUniversity of Bonn, Germany 1998. [GMD Research Series No. 20/1998, ISSN: 1435-2699]The LCA Problem RevisitedBenderMAFarach-ColtonMLATIN20008894Inferring phylogeny from whole genomesGóreckiPTiurynJECCB (Supplement of Bioinformatics)2006116122Genome-scale approaches to resolving incongruence in molecular phylogeniesRokasAWilliamsBLKingNCarrollSBNature2003425696079880410.1038/nature0205314574403Species tree inference by minimizing deep coalescencesThanCNakhlehLPLoS Comput Biol200959e100050110.1371/journal.pcbi.1000501272938319749978Efficient genome-scale phylogenetic analysis under the duplication-loss and deep coalescence cost modelsBansalMSBurleighJGEulensteinOBMC Bioinformatics201011Suppl 1S4210.1186/1471-2105-11-S1-S42300951520122216RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed modelsStamatakisABioinformatics200622212688269010.1093/bioinformatics/btl44616928733DupTree: a program for large-scale phylogenetic analyses using gene tree parsimonyWeheABansalMSBurleighJGEulensteinOBioinformatics20082413An ILP solution for the gene duplication problemChangWBurleighJGFernández-BacaDEulensteinOBMC Bioinformatics201112Suppl 1S1410.1186/1471-2105-12-S1-S14327883022479706