<%BANNER%>

UFIR




PAGE 1

RESEARCHOpenAccess Robinson-FouldsSupertrees MukulSBansal 1,2 ,JGordonBurleigh 3 ,OliverEulenstein 2 ,DavidFernndez-Baca 2* Abstract Background: Supertreemethodssynthesizecollectionsofsmallphylogenetictreeswithincompletetaxonoverlap intocomprehensivetrees,orsupertrees,thatincludealltaxafoundintheinputtrees.Supertreemethodsbasedon thewellestablishedRobinson-Foulds(RF)distancehavethepotentialtobuildsupertreesthatretainmuch informationfromtheinputtrees.Specifically,theRFsupertreeproblemseeksabinarysupertreethatminimizesthe sumoftheRFdistancesfromthesupertreetotheinputtrees.Thus,anRFsupertreeisasupertreethatis consistentwiththelargestnumberofclusters(orclades)fromtheinputtrees. Results: Weintroduceefficient,localsearchbased,hill-climbingheuristicsfortheintrinsicallyhardRFsupertree problemonrootedtrees.Theseheuristicsusenovelnon-trivialalgorithmsfortheSPRandTBRlocalsearch problemswhichimproveonthetimecomplexityofthebestknown(nave)solutionsbyafactorof ( n )and ( n 2 ) respectively(where n isthenumberoftaxa,orleaves,inthesupertree).Weuseanimplementationofournew algorithmstoexaminetheperformanceoftheRFsupertreemethodandcompareittomatrixrepresentationwith parsimony(MRP)andthetripletsupertreemethodusingfoursupertreedatasets.NotonlydidourRFheuristic providefastestimatesofRFsupertreesinalldatasets,buttheRFsupertreesalsoretainedmoreoftheinformation fromtheinputtrees(basedontheRFdistance)thantheothersupertreemethods. Conclusions: OurheuristicsfortheRFsupertreeproblem,basedonournewlocalsearchalgorithms,makeit possibleforthefirsttimetoestimatelargesupertreesbydirectlyoptimizingtheRFdistancefromrootedinput treestothesupertrees.Thisprovidesanewandfastmethodtobuildaccuratesupertrees.RFsupertreesmayalso beusefulforestimatingmajority-rule(-)supertrees,whichareageneralizationofmajority-ruleconsensustrees. Introduction Supertreemethodsprovideaformalapproachforcombiningsmallphylogenetictreeswithincompletespecies overlapinordertobuildcomprehensivespeciesphylogenies,orsupertrees,thatcontainallspeciesfoundin theinputtrees.Supertreeanalyseshaveproducedthe firstfamily-levelphylogenyoffloweringplants[1]and thefirstphylogenyofnearlyallextantmammalspecies [2].Theyhavealsoenabledphylogeneticanalysesusing large-scalegenomicdatasets inbacteria,acrosseukaryotes,andwithinplants[3,4]andhavehelpedelucidate theoriginofeukaryoticgenomes[5].Furthermore, supertreeshavebeenusedtoexamineratesandpatterns ofspeciesdiversification[1,2],totesthypothesesregardingthestructureofecologicalcommunities[6],andto examineextinctionriskincurrentspecies[7]. Althoughsupertreescansupportlarge-scaleevolutionaryandecologicalanalyses,therearestillnumerous concernsabouttheperformanceofexistingsupertree methods(e.g.,[8-14]).Ingeneral,aneffectivesupertree methodmustaccuratelyestimatephylogeniesfromlarge datasetsinareasonableamountoftimewhileretaining muchofthephylogeneticinformationfromtheinput trees. Byfarthemostcommonlyusedsupertreemethodis matrixrepresentationwithparsimony(MRP),which worksbysolvingtheparsimonyproblemonabinary matrixrepresentationoftheinputtrees[15,16].While theparsimonyproblemisNP-hard,MRPcantake advantageoffastandeffectivehill-climbingheuristics implementedinPAUP*orTNT(e.g.,[17-19]).MRP heuristicsoftenperformwellinanalysesofbothsimulatedandempiricaldataset s(e.g.,[20-22]);however, therearenumerouscriticismsofMRP.Forexample, MRPshowsevidenceofbiasesbasedontheshapeand sizeofinputtrees[8,11],andMRPsupertreesmay *Correspondence:fernande@cs.iastate.edu 2 DepartmentofComputerScience,IowaStateUniversity,Ames,IA50011, USA Bansal etal AlgorithmsforMolecularBiology 2010, 5 :18 http://www.almob.org/content/5/1/18 2010Bansaletal;licenseeBioMedCentralLtd.ThisisanOpenAccessarticledistributedunderthetermsoftheCreativeCommons AttributionLicense(http://creativecommons.org/licenses/by/2.0),whichpermitsunrestricteduse,distribution,andreproductionin anymedium,providedtheoriginalworkisproperlycited.

PAGE 2

containrelationshipsthatarenotsupportedbyanyof theinputtrees[9,12].Furthermore,itisunclearifor whyminimizingtheparsimonyscoreofamatrixrepresentationofinputtreesisagoodoptimalitycriterionor shouldproduceaccuratesupertrees. Sinceevolutionarybiologistsrarely,ifever,knowthe truerelationshipsforagroupofspecies,itisdifficultto assesstheaccuracyofsupertree,oranyphylogenetic, methods.Oneapproachtoevaluatetheaccuracyof supertreesiswithsimulations(e.g.,[20,21]).However, simulationsinherentlysimplifythetrueprocessesof evolution,anditisunclearhowwelltheperformanceof aphylogeneticmethodinsimulationscorrespondstoits performancewithempiricaldata.Perhapsamoreuseful waytodefinetheaccuracyofasupertreemethodisto quantifytheamountofphylogeneticinformationfrom theinputtreesthatisretainedinthesupertree.Ideally, wewantthesupertreetoreflecttheinputtreetopologiesasmuchaspossible.Thissuggeststhatthesupertreeobjectiveshoulddirectlyevaluatethesimilarityof thesupertreetotheinputtrees(e.g.,[11,23,24]). Numerousmetricsexisttomeasurethesimilarityof inputtreestoasupertree,andtheRobinson-Foulds(RF) distancemetric[25]isamongthemostwidelyused.In fact,numerousstudieshave evaluatedtheperformance ofsupertreemethods,includingMRP,bymeasuringthe RFdistancebetweencollectionsofinputtreesandthe resultingsupertrees(e.g.,[11,20,21]).TheRFsupertree problemseeksabinarysupe rtreethatminimizesthe sumoftheRFdistancesbetweeneveryrootedinputtree andthesupertree.Theintuitionbehindseekinga binary supertreeisthat,inthissetting,minimizingtheRFdistanceisequivalenttomaxim izingthenumberofclusters(orclades)thataresharedbythesupertreeandthe inputtrees.Thus,anRFsupertreeisasupertreethatis consistentwiththelargestnumberofclustersfromthe inputtrees.Unfortunately,aswithMRP,computingRF supertreesisNP-hard[26].Inthiswork,wedescribe efficienthill-climbingheuristicstoestimateRFsupertrees.Theseheuristicsallowt hefirstlarge-scaleestimatesofRFsupertreesandcomparisonsoftheaccuracy ofRFsupertreestoothercommonlyusedsupertree methods. TheRFdistancemetricbetweentworootedtreesis definedtobeanormalizedcountofthesymmetricdifferencebetweenthesetsofclustersofthetwotrees.In thesupertreesetting,theinputtreeswilloftenhave onlyastrictsubsetofthetaxapresentinthesupertree. Thus,ahighRFdistancebetweenaninputtreeanda supertreedoesnotnecessarilycorrespondtoconflicting evolutionaryhistories;itcanalsoindicateincomplete phylogeneticinformation. Consequently,inorderto computetheRFdistancebetweenaninputtreewhich hasonlyastrictsubsetofthetaxainthesupertree,we firstrestrictthesupertreetoonlytheleafsetofthe inputtree.ThisadaptedversionoftheRFdistanceis notametric,orevenadistancemeasure(mathematicallyspeaking).However,forconvenience,wewillrefer tothisadaptedversionoftheRFdistancemetricusing thesamename.PreviousworkSupertreemethodsareageneralizationofconsensus methods,inwhichalltheinputtreeshavethesameleaf set.Theproblemoffindinganoptimalmediantree undertheRFdistanceinsuchaconsensussettingis well-studied.Inparticular,itisknownthatthemajorityruleconsensusoftheinputtreesmustbeamediantree [27],anditcanbefoundinpolynomialtime.Onthe otherhand,findingtheoptimum binary mediantree,i. e.anRFsupertree,intheconsensussettingisNP-hard [26].ThisimpliesthatcomputinganRFsupertreein generalisNP-hardaswell. OurdefinitionofRFdistancebetweentwotreeswhere onehasonlyastrictsubsetofthetaxaintheother,correspondstothedistancemeasureusedtodefine “ majority-rule(-)supertrees ” byCottonandWilkinson[28]. Thisdefinitionrestrictsthelargertreetoonlytheleaf setofthesmallertreebeforeevaluatingtheRFdistance. Majority-rule(-)supertreesaredefinedtobethestrict consensusofalltheoptimalmediantreesundertheRF distance.ThesemediantreesaredefinedsimilarlytoRF supertrees,exceptthatR Fsupertreesmustbebinary whilethemediantreescanbenon-binary.Ingeneral, majority-rulesupertrees[28],inboththeir(-)and(+) variants,seektogeneralizethemajority-ruleconsensus. Indeed,majority-rulesupertreeshavebeenshownto haveseveraldesirablepropertiesreminiscentofmajority-ruleconsensustrees[ 29].Althoughmajority-rule supertreesandRFsupertreesarebothbasedonminimizingRFdistance,theyrepresenttwodifferent approachestosupertreeconstruction.Inparticular,the RFsupertreemethodseeksasupertreethatisconsistent withthelargestnumberofclusters(clades)fromthe inputtrees,whilemajority-rulesupertreesdonot. Nevertheless,aswediscusslater,RFsupertreescouldbe usedasastartingpointtoestimatemajority-rule(-) supertrees. TheRFdistancebetweentwotreesonthesamesize n leafset,withleaveslabeledbyintegers{1,..., n },canbe computedin O ( n )time[30].Infact,an(1+ )-approximatevalueoftheRFdistancecanbecomputedinsublineartime,withhighprobability[31]. Inthecaseofunrootedtrees,theRFdistancemetric issometimesalsoknownasthesplitsmetric(e.g.,[32]). ThesupertreeanalysispackageClann[23]provides heuristicsthatoperateonunrootedtreesandattemptto maximizethenumberofsplitssharedbetweentheinputBansal etal AlgorithmsforMolecularBiology 2010, 5 :18 http://www.almob.org/content/5/1/18 Page2of12

PAGE 3

treesandtheinferredsupertree.Thismethodiscalled the “ maximumsplits-fit ” method.LocalSearchWeuseaheuristicapproachfortheRFsupertreeproblem.Localsearchisthebasisofeffectiveheuristics formanyphylogeneticproblems.Theseheuristics iterativelysearchthroughthespaceofpossiblesupertreesguided,ateachstep,bysolutionstosomelocal searchproblem.Moreformally,intheseheuristics,a treegraph (see[32,33])isdefinedforthegivensetof inputtreesandsomefixedtreeeditoperation.The nodesetofthistreegraphrepresentsthesetofall supertreesonthegiveninputtrees.Anedgeisdrawn betweentwonodesexactlyifthecorrespondingtrees canbetransformedintoeachotherbyonetreeedit operation.Inoursetting,the cost ofanodeinthe graphistheRFdistancebetweenthesupertreerepresentedbythatnodeandthegiveninputtrees.Given aninitialnodeinthetreegraph,theheuristic ’ staskis tofindamaximal-lengthpathofsteepestdescentin thecostofitsnodesandtoreturnthelastnodeon suchapath.Thispathisfoundbysolvingthe local searchproblem ateverynodealongthepath.Thelocal searchproblemistofindanodewiththeminimum costintheneighborhoodofagivennode.Theneighborhoodisdefinedbysomet reeeditoperation,and hence,thetimecomplexityofthelocalsearchproblem dependsonthetreeeditoperationused. Twoofthemostextensivelyusedtreeeditoperations forsupertreesarerootedSubtreePruneandRegraft ( SPR )[33-35]androotedTreeBisectionandReconnection( TBR )[22,33,34].Thebestknown(nave)algorithmsforthe SPR and TBR localsearchproblemsfor theRFsupertreeproblemrequire O ( kn3)and O ( kn4) timerespectively,where k isthenumberofinputtrees, and n isthenumberofleavesinthesupertreesolution.OurContributionWedescribeefficienthill-cl imbingheuristicsfortheRF supertreeproblem.Theseheuristicsarebasedonnovel non-trivialalgorithmsthatcansolvethecorresponding localsearchproblemsforboth SPR and TBR in O ( kn2) time,yieldingspeed-upsof ( n )and ( n2)overthebest knownsolutionsrespectively.Thesenewalgorithmsare inspiredbyfastlocalsearchalgorithmsforthegene duplicationproblem[36,37].Notethatwhilethesupertreeitselfmustbebinary,ouralgorithmsworkevenif theinputtreesarenot.WealsoexaminetheperformanceoftheRFsupertreemethodusingfourpublished supertreedatasets,andcompareitsperformancewith MRPandthetripletsupertreemethod[38].WedemonstratethatthenewalgorithmsenableRFsupertreeanalysesonlargedatasetsandthattheRFsupertree methodoutperformsothersupertreemethodsinfinding supertreesthataremostsimilartotheinputtreesbased ontheRFdistancemetric.BasicNotationandPreliminariesA treeT isaconnectedacyclicgraph,consistingofa nodeset V ( T )andanedgeset E ( T ). T is rooted ifithas exactlyonedistinguishednodecalledthe root whichwe denoteby rt ( T ).Throughoutthiswork,thetermtree referstoarootedtree.Wedefine Ttobethepartial orderon V ( T )where x Ty if y isanodeonthepath between rt ( T )and x .Thesetofminimaunder Tis denotedby L ( T )anditselementsarecalled leaves .The setofall non-rootinternalnodes of T ,denotedby I ( T ), isdefinedtobetheset V ( T )\( L ( T ) { rt ( T )}).If{ x,y } E ( T )and x Ty thenwecall y the parent of x denoted by paT( x )andwecall x a child of y .Thesetofallchildrenof y isdenotedby ChT( y ). T is fullybinary ifevery nodehaseitherzeroortwochildren.Iftwonodesin T havethesameparent,theyarecalled siblings .The least commonancestor ofanon-emptysubset L V ( T ), denotedas lca ( L ),istheuniquesmallestupperboundof L under T.The subtree of T rootedatnode y V ( T ), denotedby Ty,isthetreeinducedby{ x V ( T ): x y }. Foreachnode v I ( T ),the clusterT( v )isdefinedto bethesetofallleafnodesin Tv;i.e.T( v )= L ( Tv).We denotethesetofallclustersofatree T by ( T ).Given aset L L ( T ),let T ’ betheminimalrootedsubtreeof T withleafset L .Wedefinethe leafinducedsubtreeT [ L ]of T onleafset L tobethetreeobtainedfrom T ’ by successivelyremovingeachnon-rootnodeofdegreetwo andadjoiningitstwoneighbors.The symmetricdifference oftwosets A and B ,denotedby A B ,istheset ( A \ B ) ( B \ A ).A profileisatupleoftrees( T1,..., Tk).TheRFSupertreeProblemGivenaprofile,wedefinea supertree ontobea fullybinarytree T *where (*)() TTi i k 1. Definition1 (RFDistance). Givenaprofile=( T1,..., Tk) andasupertreeT on, wedefinethe RFdistance as follows : 1.ForanyTi, where 1 i k RF ( Ti, T *)=| ( Ti) ( T *[ L ( Ti)])|. 2 .RFTRFTTi i k(,*)(,*) 13.Letbethesetofsupertreeson, thenRFRFTT()min(,*)*. Remark :Traditionally,thevalueoftheRFdistance,as computedabove,isnormaliz edbymultiplyingby1/2. However,thisdoesnotaffectthedefinitionorBansal etal AlgorithmsforMolecularBiology 2010, 5 :18 http://www.almob.org/content/5/1/18 Page3of12

PAGE 4

computationofRFsupertrees,andtherefore,wedonot normalizetheRFdistance. Problem1 (RFSupertree). Instance: Aprofile. Find: AsupertreeToptonsuchthatRF (, Topt)= RF (). RecallthattheRFSupertreeproblemisNP-hard[27].LocalSearchProblemsHerewefirstprovidedefinitionsforthere-rooting operation(denoted RR )andthe TBR [22]and SPR [35] editoperationsandthenformulatetherelatedlocal searchproblemsthatweremotivatedinthe introduction. Definition2 ( RR operation). LetTbeatreeandx V ( T ). RR ( T,x ) isdefinedtobethetreeT,ifx = rt ( T ) or x Ch ( rt ( T )) .Otherwise RR ( T,x ) isthetreethatis obtainedfromTby(i)suppressingrt ( T ), and(ii)subdividingtheedge { pa ( x ), x } byanewrootnode.Wedefine thefollowingextension : RR ( T )= x V ( T ){ RR ( T,x )}. Fortechnicalreasons,beforewecandefinethe TBR operation,weneedthefollowingdefinition. Definition3 (Plantedtree). GivenatreeT,the plantedtree F ( T ) isthetreeobtainedbyaddinga root edge{ p,rt ( T )}, wherep V ( T ), toT Definition4 ( TBR operation).( SeeFig .1) LetTbea tree,e =( u,v ) E ( T ), whereu = pa ( v ), andX,Ybethe connectedcomponentsthatareobtainedbyremoving edgeefromTwherev Xandu Y.Wedefine TBRT( v,x,y ) forx Xandy Ytobethetreethatis obtainedfrom F ( T ) byfirstremovingedgee,thenreplacingthecomponentXby RR ( X,x ), andthenadjoininga newedgefbetweenx ’ = rt ( RR ( X,x )) andYasfollows : 1.Createanewnodey ’ thatsubdividestheedge ( pa ( y ), y ). 2.Adjointheedgefbetweennodesx ’ andy ’ 3.Suppressthenodeu,andrenamex ’ asvandy ’ asu 4.Contracttherootedge Notation .Wedefinethefollowing: 1 TBRT( v,x )= y Y{ TBRT( v,x,y )} 2 TBRT( v )= x XTBRT( v,x ) 3 TBRT= ( u v ) E ( T )TBRT( v ) Definition5 ( SPR operation). LetTbeatree,e = ( u,v ) E ( T ), whereu = pa ( v ), andX,Ybetheconnectedcomponentsthatar eobtainedbyremoving edgeefromTwherev Xandu Y.Wedefine SPRT( v,y ), fory Y,tobethetree TBRT( v,v,y ) Wesaythatthetree SPRT( v,y ) isobtainedfromT bya subtreepruneandregraft( SPR ) operationthat prunes subtreeTvand regrafts itabovenodey Notation .Wedefinethefollowing: 1 SPRT( v )= y Y{ SPRT( v,y )} 2 SPRT= ( u v ) E ( T )SPRT( v ) Notethatan SPR operationforagiventree T canbe brieflydescribedthroughthefollowingfoursteps:(i) prunesomesubtree P from T ,(ii)addarootedgeto theremainingtree S ,(iii)regraft P intoanedgeofthe remainingtree S ,and(iv)contracttherootedge. Wenowdefinetherelevantlocalsearchproblems basedonthe TBR and SPR operations. Problem2 ( TBR -Scoring( TBR -S)). Giveninstance T whereistheprofile ( T1,..., Tk) andTisa supertreeon, findatreeT TBRTsuchthatRFTRFTTT(,*)min(,) TBR. Problem3 ( TBR -RestrictedScoring( TBR -RS)). Given instance T v whereistheprofile ( T1,..., Tk), T isasupertreeon, andvisanon-rootnodein V ( T ), findatreeT TBRT( v ) suchthatRFTRFTTvT(,*)min(,)() TBR. Theproblems SPR Scoring ( SPR S )and SPR RestrictedScoring ( SPR RS )aredefinedanalogouslyto theproblems TBR -Sand TBR -RSrespectively. Figure1 TBROperation .Exampledepictinga TBR operationwhichtransformstree S intotree S ’ = TBRS( v,x,y ). Bansal etal AlgorithmsforMolecularBiology 2010, 5 :18 http://www.almob.org/content/5/1/18 Page4of12

PAGE 5

Throughouttheremainderofthismanuscript, k isthe numberoftreesintheprofile, T denotesasupertree on,and n isthenumberofleavesin T .ThefollowingobservationfollowsfromDefinition4. Observation1 The TBR Sproblemoninstance T canbesolvedbysolvingthe TBR RSproblem | E ( T )| times Weshowhowtosolvethe TBR -Sproblemonthe instance T in O ( kn2)time.Since SPRT TBRTthisalsoimpliesan O ( kn2)solutionforthe SPR -Sproblem.Thisgivesaspeed-upof ( n2)and ( n )overthe bestknown(nave)algorithmsforthe TBR -Sand SPR -S problemsrespectively. Inparticular,wefirstshowthatanyinstanceofthe TBR -RSproblemcanbedecomposedintoaninstanceof an SPR -RSproblem,andaninstanceofaRootingproblem(definedinthenextsection).Weshowhowto solveboththeseproblemsin O ( kn )time,yieldingan O ( kn )timesolutionforthe TBR -RSproblem.Thisimmediatelyimpliesan O ( kn2)timealgorithmforthe TBR -S problem(seeObservation1). Notethatthesizeoftheset TBRTis ( n3).Thus,for eachtreeintheinputprofilethetimecomplexityof computingandenumeratingtheRFdistancesofalltrees in TBRTis ( n3).However,tosolvethe TBR -Sproblem oneonlyneedstofindatreewiththeminimumRFdistance.Thisletsussolvethe TBR -Sproblemintimethat issub-linearinthesizeof TBRT.Infact,aftertheinitial O ( kn2)preprocessingstep,ouralgorithmcanoutputthe RFdistanceofanytreein TBRTin O (1)time.StructuralPropertiesThroughoutthissection,welimitourattentiontoone tree S fromtheprofile.Weshowhowtosolvethe TBR -RSproblemfortheinstance ( S ), T v forsome non-rootnode v V ( T )in O ( n )time.Basedonthis solution,itisstraightforwardtosolvethe TBR -RSproblemontheinstance T v with-in O ( kn )timeas well.Forclarity,wewillalsoassumethat L ( S )= L ( T ). Ingeneral,if L ( S ) L ( T )thenwecansimplyset T to be T [ L ( S )].Thistakes O ( n )timeand,consequently, doesnotaffectthetimecomplexityofouralgorithm. OuralgorithmmakesuseoftheLCAmappingfrom S to T .Thismappingisdefinedasfollows. Definition6 (LCAMapping). GiventwotreesT ’ and Tsuchthat L ( T ’ ) L ( T ), the LCAmapping T ’ T: V ( T ’ ) V ( T ) isthemapping T ’ T( u )= lcaT( L ( Tu)). Notation .Wedefineabooleanfunction fT: I ( S ) {0,1} suchthat fT( u )=1ifthereexistsanode v I ( T )such thatS( u )=T( v ),and fT( u )=0otherwise.Thus, fT( u ) =1ifandonlyiftheclusterS( u )existsinthetree T as well.Additionally,wedefine T={ u I ( S ): fT( u )=0}; thatis, Tisthesetofallnodes u I ( S )suchthatthe clusterS( u )doesnotexistinthetree T Thefollowinglemmaassociatesthevalue RF ( S,T ) withthecardinalityoftheset T. Lemma1 RF ( S,T )=| I ( T )|-| I ( S )|+2| T|. Proof .LetTdenotetheset{ u I ( S ): fT( u )=1}.By thedefinitionof RF ( S,T ),wemusthave RF ( S,T )=| I ( T )|+| I ( S )|-2|T|.Andhence,since|T|+| T|= I ( S ),weget RF ( S,T )=| I ( T )|-| I ( S )|+2| T|. Lemma2 Foranyu I ( S ), fT( u )=1 ifandonlyif |S( u )|=|T( S T( u ))|. Proof .If|S( u )|=|T( S T( u ))|thenwemusthaveS( u )=T( S T( u ))and,consequently, fT( u )=1.In theotherdirection,if|S( u )| |T( S T( u ))|,then wemusthaveS( u ) T( S T( u ))and,consequently, fT( u )=0. TheLCAmappingfrom S to T canbecomputedin O ( n )time[39],andconsequently,byLemmas1and2,we cancomputetheRFdistancebetween S and T in O ( n ) timeaswell(other O ( n )-timealgorithmsforcalculating theRFdistancearepresentedin[30,31]).Moreover, Lemma1impliesthatinordertofindatree T TBRT(v)suchthatRFTRFTTvT(,*)min(,)() TBR,itis sufficienttofindatreeT* TBRT( v )forwhich||min||*()TTvTT TBR. Remark :Animplicitassumptionhereisthattheleaves ofbothtreesarelabeledbyintegers{1,..., n }.Iftheleaf labelsarearbitrary,thenwerequireanadditional O ( kn log n )-timepreprocessingsteptorelabeltheleavesof thetreesinthegivenprofile.Note,however,thatthis additionalstepdoesnotaddtotheoveralltimecomplexityofsolvingthe TBR -Sor SPR -Sproblems. Wenowshowthatthe TBR -RSproblemcanbesolved bysolvingtwosmallerproblemsseparatelyandcombiningtheirsolutions. Asbefore,welimitourattentiontoonetree S from theprofile.Giventhe TBR -RSinstance ( S ), T v ,we defineabipartition{ X X}of I ( S ),where X ={ u I ( S ): S T( u ) V ( Tv)}. Lemma3 Ifu X,thenfT ’( u )= fT( u ) forallT ’ TBRT( v,v ). Ifu Xandydenotesthesiblingofv,then fT ’( u )= fT( u ), whereT ’ = TBRT( v x y ) foranyx V ( Tv). Proof .Considerthecasewhen u X .Let T ’ beany treein TBRT( v v )andletnode y V ( T )besuchthat T ’ = TBR ( v v y ).Thus,foranynode w V ( Tv),the subtrees Tvand Tvmustbeidentical.Since u X ,we musthave S T( u ) Tvand,consequently, TTSTSTuu ,,()().Lemma2nowimpliesthat fT ’( u )= fT( u ). Nowconsiderthecasewhen u X.Node y denotes thesiblingof v intree T andlet T ’ = TBR ( v x y ),for some x V ( Tv).Thus,foranynode w V ( T )\ V ( Tv), wemusthave LT( w )= LT ’( w ).Moreover,theleafsets ofthetwosubtreesrootedatthechildrenof w in T mustbeidenticaltotheleafsetsofthetwosubtreesBansal etal AlgorithmsforMolecularBiology 2010, 5 :18 http://www.almob.org/content/5/1/18 Page5of12

PAGE 6

rootedatthechildrenof w in T ’ :Thisimpliesthatif S T( u )= w ,then S T ’( u )= w aswell.ByLemma2 wemustthereforehave fT ’( u )= fT( u ). Lemma3impliesthatatreein TBRT( v )withsmallest RFdistancecanbeobtainedbyoptimizingtherootingfor theprunedsubtree,andoptimizingtheregraftlocation separately.Thisallowsustoobtainatreein TBRT( v )with smallestRFdistancebyevaluatingonly O ( n )trees.Contrastthiswiththenaveapproachtofindingatreein TBRT( v )withsmallesttotaldistance,whichistoevaluate alltreesobtainedbyrerootingtheprunedsubtreeinall possibleways,and,foreachrerooting,regraftingthesubtreeinallpossiblelocations.Sincethereare O ( n )waysto reroottheprunedsubtree,and O ( n )waystoregraft,this wouldrequireevaluating O ( n2)trees.Itisinterestingto notethatthisabilitytodecomposethe TBR -RSproblem intotwosimplerproblemsisnotuniquetothecontextof RFsupertreesalone.Forexample,ithasbeenobserved thatasimilardecompositioncanbeachievedinthecontextofthegeneduplicationproblem[37]. Thus,tosolvethe TBR -RSproblem,wemustfind(i)a rerooting T ’ ofthesubtree Tvforwhich T ’isminimized,and(ii)aregraftlocation y for Tvwhichminimizes| SPR ( v y )|.Observethattheprobleminpart(ii) issimplythe SPR -RSproblemontheinputinstance ( S ), T v .Forpart(i),considerthefollowingproblem statement. Problem4 (Rooting). Giveninstance T v whereistheprofile ( T1,..., Tk), Tisasupertreeon, and visanon-rootnodeinV ( T ), findanodex V ( Tv) for whichRF (, TBRT( v,x,y )) isminimum,wherey denotesthesiblingofvinT Notethattheprobleminpart(i)istheRootingproblemontheinputinstance ( S ), T v .Weshowhowto solveboththeRootingandthe SPR -RSproblemsin O ( n )timeoninstance ( S ), T v .Asseenabove,basedon Lemma3,thisimmediatelyimpliesthatthe TBR -RS problemforaprofileconsistingofasingletreecanbe solvedin O ( n )time.Tosolvethe TBR -RSproblemon instance T v ,wesimplysolvetheRootingand SPR -RSproblemsseparatelyontheinputinstance T v ,whichtakes O ( kn )time(seeTheorems3and4). Wethushavethefollowingtwotheorems. Theorem1 The TBR RSproblemcanbesolvedinO ( kn ) time Theorem2 The TBR SproblemcanbesolvedinO ( kn2) time .SolvingtheRootingProblemTosolvetheRootingproblemoninstance ( S ), T v ,we relyonanefficientalgorithmforcomputingthevalueof fT ’( u )forany T ’ RR ( Tv)andany u I ( S ).Thisalgorithmreliesonthefollowingfivelemmas.Let a denote thenode S T( u ), y denotethesiblingof v in T ,and T ’ = TBRT( v,x,y )for x V ( Tv).Dependingon a and fT( u )therearefivepossiblecases:(i) a V ( Tv),(ii) a = rt ( Tv)and fT( u )=1,(iii) a = rt ( Tv)and fT( u )=0,(iv) a V ( Tv)\ rt ( Tv)and fT( u )=1,and(v) a V ( Tv)\ rt ( Tv) and fT( u )=0.Lemmas4through8characterizethe value fT ’( u )foreachofthesefivecasesrespectively. Lemma4 Ifa V ( Tv), thenfT ’( u )= fT( u ) foranyx V ( Tv). Proof .FollowsdirectlyfromLemma3. Lemma5 Ifa = rt ( Tv) andfT( u )=1, thenfT ’( u )=1 forallx V ( Tv). Proof .Sincewehave a = rt ( Tv)and fT( u )=1,by Lemma2wemusthave L ( Su)= L ( Tv).Thus,forany x V ( Tv), S T ’( u )mustbetherootofthesubtree RR ( Tv, x ).Thelemmafollows. Lemma6 LetLdenotetheset L ( Tv)\ L ( Su), andletblcaTv( L ) .Ifa = rt ( Tv) andfT( u )=0, then 1.for xTvb,fT ’( u )=0, and 2.forxTvb,fT ’( u )=1 ifandonlyif | L |=| L ( Tb)|. Proof .Since a = rt ( Tv)and fT( u )=0,byLemma2we musthave L ( Su) L ( Tv).Weanalyzeeachpartofthe lemmaseparately. 1. xTvb :Forthiscasetobevalid,wemusthavebTvrt ( Tv).Therefore,let bpaTv( b ).Forany T ’ inthiscase, b ’ = paT ’( b ).Moreover, L (Tb) L ( Su) .Therefore,wemusthave b
PAGE 7

Proof .ByLemma2wemusthave L ( Su)= L ( Ta).We havetwocases: 1. xTva :Inthiscasewemusthave S T ’( u )= a and L ( Ta)= L ( Ta).Thus, L ( Su)= L ( Ta)and hence, fT ’( u )=1. 2.xTva :Inthiscase, S T ’( u )mustbetheroot ofthesubtree RR ( Tv, x ).Since L ( RR ( Tv, x ))= L ( Tv), and L ( Su) L ( Tv),Lemma2impliesthat fT ’( u )=0. Thelemmafollows. Lemma8 Ifa V ( Tv)\ rt ( Tv) andfT( u )=0, thenfT ’( u )=0 forallx V ( Tv). Proof .ByLemma2wemusthave L ( Su) L ( Ta).We havetwopossiblecases: 1. xTva :Inthiscasewemusthave S T ’( u )= a and L ( Ta)= L ( Ta).Thus, L ( Su) L ( Ta)and hence, fT ’( u )=0 2.xTva :Inthiscase, S T ’( u )mustbetheroot ofthesubtree RR ( Tv, x ).Since L ( RR ( Tv, x ))= L ( Tv), and L ( Su) L ( Tv),Lemma2impliesthat fT ’( u )=0. Thelemmafollows. TheAlgorithm .Forany x V ( Tv)let A ( x )denote thecardinalityoftheset { u I ( S ): fT( u )=0,but fT ’( u )=1},and B ( x )thecardinalityoftheset { u I ( S ): fT( u )=1,but fT ’( u )=0},where T ’ = TBRT( v,x,y ). Bydefinition,tosolvetheRootingproblemwemust findanode x V ( Tv)forwhich| A ( x )|-| B ( x )|ismaximized.Ouralgorithmcomputes,ateachnode x V ( Tv),thevalues A ( x )and B ( x ). Inapreprocessingstep,ouralgorithmcomputesthe mapping S Taswellasthesizeofeachclusterin S and T ,andcreatesandinitializes(to0)twocounters a ( x )and b ( x )ateachnode x V ( Tv).Thistakes O ( n ) time.Whenthealgorithmterminates,thevalues a ( x ) and b ( x )atany x V ( Tv)willbethevalues a ( x )and b ( x ). Recallthat,given u I ( S ), a denotesthenode S T( u ).Thus,anygiven u I ( S )mustsatisfytheprecondition(givenintermsof a )ofexactlyoneofthetheLemmas4through8.Moreover,thepreconditionofeachof theselemmascanbecheckedin O (1)time. Thealgorithmthentraversesthrough S andconsiders eachnode u I ( S ).Therearethreecases: 1.If u satisfiesthepreconditionsofLemmas4,5,or 8thenwemusthave fT ’( u )= fT( u ).Consequently, wedonothinginthiscase. 2.If u satisfiesthepreconditionofLemma7,then weincrementthevalueof b ( x )ateachnode x V ( Ta)\{ a }(where a isasinthestatementofLemma 7).Todothisefficientlywecansimplyincrementa counteratnode a suchthat,afterall u I ( S )have beenconsidered,asinglepre-ordertraversalof Tvcanbeusedtocomputethecorrectvaluesof b ( x )at each x V ( Tv). 3.If u satisfiesthepreconditionofLemma6,thenwe proceedasfollows:Let a and L beasinthestatement ofLemma6.AccordingtotheLemma,ifwecanfind anode b V ( Tv)suchthatblcaTv(L)and| L ( Tb)| =| L |,thenweincrementthevalueof a ( x )ateach node x V ( Tb);otherwise,ifsucha b doesnotexist, wedonothing.Asbefore,todothisefficiently,we onlyincrementasinglecounteratnode b suchthat, afterall u I ( S )havebeenconsidered,apre-order traversalof Tvsufficestocomputethecorrectvalues of a ( x )ateach x V ( Tv).Inordertoprovethe O ( n ) run-timeforthisalgorithmwewillnowexplainhow toprecomputesuchacorrespondingnode b (ifit exists),foreach u I ( S )satisfyingtheprecondition ofLemma6,within O ( n )time.Notethatanyedgein atreebi-partitionsitsleafset.Constructthetree S ’ = S [ L ( Tv)].Observethat,givenanycandidate u ,the correspondingnode b existsifandonlyifthepartitionof L ( S ’ )inducedbytheedge( u,pa ( u )) E ( S ’ ),is alsoinducedbysomeedge, e ,inthetree TvIfsuch an e exists,then b mustbethatnodeon e whichis fartherawayfromtheroot,i.e.theedge e mustbethe edge( b,pa ( b ))in TvThisedge e (oritsabsence)can beprecomputed,forallcandidate u ,asfollows:Computethestrictconsensus oftheunrootedvariantsof thetrees S ’ and Tv.Everyedgeinthisstrictconsensus correspondstoanedgein S ’ andanedgein Tvthat inducethesamebi-partitionsinthetwotrees. Thus,forallcandidate u thatlieonsuchanedge,the correspondingnode b canbeinferredin O (1)time(by usingtheassociationbetweentheedgesofthestrict consensusandtheedgesof S ’ and Tv),andforallcandidate u thatdonotlieonsuchanedge,weknowthat thecorrespondingnode b doesnotexist.Thisstrict consensusoftheunrootedvariantsof S ’ and Tvcanbe precomputedwith-in O ( n )timebyusingthealgorithm ofDay[30]. Hence,theRootingproblemforaprofileconsistingof asingletreecanbesolvedin O ( n )time;yieldingthefollowingtheorem. Theorem3 The Rooting problemcanbesolvedinO ( kn ) time .Solvingthe SPR -RSProblemWewillshowhowtosolvethe SPR -RSproblemon instance ( S ), T,v in O ( n )time.Considerthetree R = SPRT( v,rt ( T ))Observethat,since SPRR( v )= SPRS( v ),Bansal etal AlgorithmsforMolecularBiology 2010, 5 :18 http://www.almob.org/content/5/1/18 Page7of12

PAGE 8

solvingthe SPR -RSproblemoninstance ( S ), T,v is equivalenttosolvingitontheinstance ( S ), R,v .Thus, intheremainderofthissection,wewillworkwithtree R insteadoftree S .Thefollowingfourlemmasletus efficientlyinfer,forany u I ( S ),whether fT ’( u )=1or fT ’( u )=0,foranygiven T ’ SPRR( v ). Forbrevity,let a denotethenode S R( u ),andlet Q denotetheset V ( R )\( V ( Rv) { rt ( R )}).Let T ’ = SPRR( v, x ),forany x Q Dependingon a and fR( u )therearefourpossible cases:(i) a V ( Rv),(ii) a Q and fR( u )=1,(iii) a Q and fR( u )=0,and(iv) a = rt ( R ).Lemmas9through12 characterizethevalue fT ’( u )foreachofthesefourcases respectively. Lemma9 Ifa V ( Rv), thenfT ’( u )= fR( u ) forany x Q Proof .Observethat TBRR( v,v )= SPRR( v ).Lemma3 nowimmediatelycompletestheproof. Lemma10 Ifa QandfR( u )=1, then 1.fT ’( u )=0, forx
PAGE 9

10).Thiscanbedoneefficientlyasshowninpart(2) ofthealgorithmfortheRootingproblem. 3.If u satisfiesthepreconditionofLemma6,andif| L ( Rb)|+| L ( Rv)|=| L ( Su)|,thenweincrementthe valueof a ( x )ateachnode x V ( Tb)\{ b }(where a and b areasinthestatementofLemma6). Again,todothisefficiently,weincrementacounterat node b ,andperformasubsequentpre-ordertraversal. Notealsothatthemapping S ’ Rcanbecomputedin O ( n )timeinthepreprocessingstep,andhencethenode b canbeinferredin O (1)time.Thecondition| L ( Rb)|+ | L ( Rv)|=| L ( Su)|isalsoverifiablein O (1)time. Hence,the SPR -RSproblemforaprofileconsistingof asingletreecanbesolvedin O ( n )time;yieldingthefollowingtheorem. Theorem4 The SPR RSproblemcanbesolvedinO ( kn ) time Remark .Toimprovetheperformanceoflocalsearch heuristicsinphylogenyconst ruction,thestartingtree forthefirstlocalsearchstepisoftenconstructedusing agreedy ‘ stepwiseaddition ’ procedure.Thisgreedyprocedurebuildsastartingspeciestreestep-by-stepbyaddingonetaxonatatimeatitslocallyoptimalposition.In thecontextofRFsupertrees,ouralgorithmforthe SPR -RSproblemalsoyieldsa ( n )speed-upovernave algorithmsforthisgreedyprocedure.ExperimentalEvaluationInordertoevaluatetheperformanceoftheRFsupertreemethod,weimplementedanRFheuristicbasedon the SPR localsearchalgorithm.Wefocusedonthe SPR localsearchbecauseitisfasterandsimplertoimplementthan TBR ,andinanalysesofMRFandtriplet supertrees,theperformanceof SPR and TBR wasvery similar[22,38].Wecomparedtheperformanceofthe RFsupertreeheuristictoMRPandthetripletsupertree method(whichseeksasupertreewiththemostshared tripletswiththecollectionofinputtrees)usingpublishedsupertreedatasetsfromseabirds[40],marsupials [41],placentalmammals[42],andlegumes[43].The publisheddatasetscontainbetween7and726input treesandbetween112and571totaltaxa(Table1). Thereareanumberofwaystoimplementanylocal searchalgorithm.PreliminaryanalysesoftheRFheuristicbasedonthe SPR localsearchindicatedthat,aswith otherphylogeneticmethods,thestartingtreecanaffect theestimateofthefinalsupertree.Occasionallythe SPR searchesgotcaughtinlocaloptimawithrelativelyhigh RF-distancescores.Toamelioratethispotentialproblem,weimplementedaratchetsearchheuristicforRF supertreesbasedontheparsimonyratchet[44].Ingeneral,aratchetsearchperformsanumberofiterations – inourcase25 – thatconsistoftwolocal SPR searches: oneinwhichthecharacters(inputtrees)areequally weighted,andanotherinwhichthesetofthecharacters arere-weighted.Were-weightedthecharactersbyrandomlyremovingapproximatelytwo-thirdsoftheinput trees.Thegoalofre-weightingthecharactersistoalter thetreespacetoavoidgettingcaughtinagloballysuboptimalpartofthetreespace.Attheendofeachiteration,thebesttreeistakenasthestartingpointofthe nextiteration.Foreachdataset,westartedRFratchet searchesfrom20randomsequenceadditionstarting trees,andwealsoranthreereplicatesstartingfroman optimalMRPsupertree.AllRFsupertreeanalyseswere performedonan3GHzIntelPentium4baseddesktop computerwith1GBofmainmemory.TheRF-ratchet runstookbetween5seconds(fortheSeaBirdsdataset) and90minutes(forthelegumedataset)whenstarting fromarandomsequenceadditiontree.RF-ratchetruns startingfromoptimalMRPtreeswereatleasttwiceas fastbecausetheyrequiredfewersearchsteps. ForourMRPanalyses,wealsotriedtwoheuristic searchmethods,bothimplementedusingPAUP*[18]. First,weperformed20replicatesof TBR branchswappingfromtreesbuiltwithrandomadditionsequence Table1ExperimentalResultsDataSetSupertreeMethodRF-DistanceParsimonyScore Marsupial(272RF-Ratchet15142528 taxa;158trees)RF-MRP 1502 2513 MRP-TBR1514 2509 MRP-Ratchet1514 2509 Triplet16042569 SeaBirds(121RF-Ratchet 61 223 taxa;7trees)RF-MRP 61 223 MRP-TBR63 221 MRP-Ratchet63 221 Triplet 61 223 PlacentalRF-Ratchet 5686 8926 Mammals(116RF-MRP56908890 taxa;726trees)MRP-TBR5694 8878 MRP-Ratchet5694 8878 Triplet60329064 Legumes(571RF-Ratchet1556965 taxa;22trees)RF-MRP 1534 882 MRP-TBR1554856 MRP-Ratchet1552 854 TripletN/AN/AExperimentalresultscomparingtheperformanceoftheRFsupertreemethod toMRPandtripletsupertreemethods.Weusedfivedifferentsupertree analyses:RFsupertreesusingour SPR localsearchalgorithmwitharatchet startingfromeitherrandomadditionsequencetrees(RF-ratchet)orMRPtrees (RF-MRP),MRPwith TBR branchswappingwith(MRP-ratchet)andwithout (MRPTBR )aratchetsearch,andtripletsupertreeswitha TBR localsearch (Triplet).WemeasuredtheRFdistancetothecollectionofinputtrees(RFdistance)andtheparsimonyscoreofabestfoundsupertreebasedonthe matrixrepresentationoftheinputtrees.ThebestRFdistanceandparsimony scoresareinbold.Bansal etal AlgorithmsforMolecularBiology 2010, 5 :18 http://www.almob.org/content/5/1/18 Page9of12

PAGE 10

startingtrees.Next,weperformed20replicatesofaparsimonyratchetsearchwith TBR branchswapping.Based ontheresultsoftrialanalyses,eachratchetsearchconsistedof25iterations,eachreweighting15%ofthecharacters.ThePAUP*commandb lockfortheparsimony ratchetsearcheswasgeneratedusingPAUPRat[45].For eachdataset,weperformed20replicatesofa TBR local searchheuristicstartingwithrandomadditionsequence trees.TripletsupertreeswereconstructedusingtheprogramfromLinetal.[38].Wewereunabletoperform ratchetsearcheswiththeexistingtripletsupertreesoftware,andalso,duetomemorylimitations,wewere unabletoperformtriplet supertreeanalysesonthe legumedataset. Ouranalysesdemonstratetheeffectivenessofour localsearchheuristicsfort heRFsupertreeproblem.In allfourdatasets,RF-ratchetsearchesfoundthesupertreeswiththelowesttotalRFdistancetotheinputtrees (Table1).MRPalsogenerallyperformswell,finding supertreeswithRFdistancesbetween0.14%(placental mammals)and3.3%(seabirds)higherthanthebest scorefoundbytheRFsupertreeheuristics(Table1). ThetripletsupertreemethodperformsaswellastheRF supertreemethodonthesmallseabirddataset;however,thetripletsupertreesforthemarsupialandplacentalmammaldatasetshaveamuchhigherRFdistance totheinputtreesthaneithertheRForMRPsupertrees (Table1).Forallthedatasets,theMRPsupertreeshad thelowest(best)parsimonyscorebasedonabinary matrixrepresentationoftheinputtrees(Table1).Thus, notsurprisingly,itappearsthatoptimizingbasedonthe parsimonyscoreorthetripletdistancetotheinput treesdoesnotoptimizethesimilarityofthesupertrees totheinputtreesbasedontheRFdistancemetric(see also[11,13]). AllofthedatasetsusedinthisanalysisarefrompublishedstudiesthatusedMRP.Therefore,itisnotsurprisingthatMRPperformedwell(butsee[46]).Still,our resultsdemonstratethatMRPleavessomeroomfor improvement.Ifthegoalistofindthesupertreesthatare mostsimilartothecollectionofinputtrees,theRF searchesultimatelyprovid ebetterestimatesthanMRP (Table1). Interestingly,whiletheMRPtreestendtohaverelativelylowRF-distancescores,insomecases,suchas thelegumedataset,treeswithlowRF-distancescores havehighparsimonyscores(Table1).Thus,parsimony scoresarenotnecessarilyindicativeofRFscore,and MRPandRFsupertreeoptimalitycriteriaarecertainly notequivalent.Still,MRPtreesappeartobeusefulas startingpointsforRFsupertreeheuristics.Indeed,in threeofthefourdatasets,thebestRFtreeswere foundinratchetsearchesbeginningfromMRPtrees (Table1). OurprogramforcomputingRFsupertreesisfreely available(forWindows,Linux,andMacOSX)at http://genome.cs.iastate.edu/CBL/RFsupertreesDiscussionandConclusionThereisagrowinginterestinusingsupertreesforlargescaleevolutionaryandecologicalanalyses.Yetthereare manyconcernsabouttheperformanceofexistingsupertreemethods,andthegreatmajorityofpublishedsupertreeanalyseshavereliedononlyMRP[47].Sincethe goalofasupertreeanalysisistosynthesizethephylogeneticdatafromacollectionofinputtrees,itmakes sensethataneffectivesupertreemethodshoulddirectly seekthesupertreethatismostsimilartotheinput trees.Ournewalgorithmsmakeitpossible,forthefirst time,toestimatelargesupertreesbydirectlyoptimizing theRFdistancefromthesupertreetotheinputtrees. Therearenumerousalternatemetricstocompare phylogenetictreesbesidestheRFdistance,andanyof thesecanbeusedforsupertreemethods(see,forexample,[11]).Tripletdistancesupertrees[11,48],quartet-fit andquartetjoiningsupertrees[11,24],maximumsplitsfitsupertrees[11],andmostsimilarsupertrees[49]are all,likeRFsupertrees,estimatedbycomparinginput treestothesupertreeusingtreedistancemeasures.All ofthesemethodsmayprovidedifferent,andperhaps equallyvalid,perspectivesonsupertreeaccuracy.Based onourexperimentalanalysesusingtheRFandtriplet supertreemethod,optimizingthesupertreebasedon differentdistancemeasurescanresultinverydifferent supertrees(Table1).Inthefuture,itwillbeimportant tocharacterizeandcomparetheperformanceofthese methodsinmoredetail(see,forexample,[11,50]). Theresultsalsosuggestseveralfuturedirectionsfor research.Althoughheuristicsguidedbylocalsearchproblems,especially SPR and TBR ,havebeenveryeffective formanyintrinsicallydifficultphylogeneticinference problems,ourexperimentsindicatethatthetreespace forRFsupertreesiscomplex.Theratchetapproachand alsostartingfromMRPtreesappearstoimprovetheperformanceinthefourexampleswetested(Table1).However,moreworkisneededtoidentifythemostefficient waystoimplementourfastlocalsearchheuristics.Also, theuseofalternativesupertreesmethods(otherthan MRP)togeneratestartingtreesmightresultinabetter globalstrategytocomputeRFsupertreesandthisshould beinvestigatedfurther.Wenotethattheideaspresented in[51]canbedirectlyusedtoperformefficientNNIbasedlocalsearchesfortheRFsupertreeproblem.In particular,wecanshowthatheuristicsearchesfortheRF supertreeproblem,whichperformatotalof p local searchstepsbasedon1,2,or3-NNIneighborhoods(see [51]),canallbeexecutedin O ( kn ( n + p ))time;yielding speed-upsof (min{ n,p }), ( n min{ n,p })and ( n2minBansal etal AlgorithmsforMolecularBiology 2010, 5 :18 http://www.almob.org/content/5/1/18 Page10of12

PAGE 11

{ n,p })forheuristicsearchesthatarebasedonnavealgorithmsfor1,2and3-NNIlocalsearchesrespectively.It wouldalsobeinterestingtoseeifheuristicsbasedon TBR performsignificantlybetterthanthosebasedon SPR ininferringRFsupertrees. Insomecasesitmightbedesirabletoremovethe restrictionthatthesupertreebebinary.Intheconsensus setting,suchamediantreecanbeobtainedwithinpolynomialtime[27];however,findingamedianRFtreein thesupertreesettingisNP-hard[52].Onesimplewayto estimateanon-binarymediantreecouldbetofirstcomputeanRFsupertreeandthentoiteratively(andperhaps greedily)contractthoseedgesinthesupertreethatresult inareductioninthetotalRFdistance.Thus,ouralgorithmsmayevenbeusefulforroughlyestimatingmajority-rule(-)supertrees[28],whichareessentiallythestrict consensusofalloptimal,notnecessarilybinary,median RFtrees,andhaveseveraldesirableproperties[29]. Thesemajority-rule(-)supertreesarealsothestrictconsensusofallmaximum-likelihoodsupertrees[53].Also, thereareseveralalternateformsoftheRFdistance metricthatcouldbeincorporatedintoourlocalsearch algorithms.Forexample,inordertoaccountforbiases associatedwiththedifferentsizesofinputtrees,we couldnormalizetheRFdistanceforeachinputtree, dividingtheobservedRFdistancebythemaximumpossibleRFdistancebasedonthetreesize.Similarly,we couldincorporateeitherbranchlengthdataorphylogeneticsupportscores(bootstrapvaluesorposteriorprobabilities)fromtheinputtreesintotheRFdistancein ordertogivemoreweighttopartitionsthatarestrongly supportedorseparatedbylongbranches(e.g.,[25,54]). Ourcurrentimplementationessentiallytreatsallbranch lengthsasoneandallpartitionsasequal.Theadditionof branchlengthorsupportdatamayfurtherimprovethe accuracyoftheRFsupertreemethod.Acknowledgements WethankHarrisLinforprovidingsoftwareforthetripletsupertreeanalyses. ThisworkwassupportedinpartbyNESCentandbyNSFgrantsDEB0334832andDEB-0829674.MSBwassupportedinpartbyapostdoctoral fellowshipfromtheEdmondJ.SafraBioinformaticsprogramatTel-Aviv university. Authordetails1TheBlavatnikSchoolofComputerScience,TelAvivUniversity,TelAviv 69978,Israel.2DepartmentofComputerScience,IowaStateUniversity,Ames, IA50011,USA.3DepartmentofBiology,UniversityofFlorida,Gainesville,FL 32611,USA. Authors ’ contributions MSBwasresponsibleforalgorithmdesignandprogramimplementation, contributedtotheexperimentalevaluation,andwrotemajorpartsofthe manuscript.JGBperformedtheexperimentalevaluationandtheanalysisof theresults,andcontributedtothewritingofthemanuscript.OEandDFB supervisedtheprojectandcontributedtothewritingofthemanuscript.All authorsreadandapprovedthefinalmanuscript. Competinginterests Theauthorsdeclarethattheyhavenocompetinginterests. Received:27June2009Accepted:24February2010 Published:24February2010 References1.DaviesTJ,BarracloughTG,ChaseMW,SoltisPS,SoltisDE,SavolainenV: Darwin ’ sabominablemystery:Insightsfromasupertreeofthe angiosperms. ProceedingsoftheNationalAcademyofSciencesoftheUnited StatesofAmerica 2004, 101(7) :1904-1909. 2.Bininda-EmondsORP,CardilloM,JonesKE,MacpheeRDE,BeckRMD, GrenyerR,PriceSA,VosRA,GittlemanJL,PurvisA: Thedelayedriseof present-daymammals. Nature 2007, 446(7135) :507-512. 3.DaubinV,GouyM,PerriereG: APhylogenomicApproachtoBacterial Phylogeny:EvidenceofaCoreofGenesSharingaCommonHistory. GenomeRes 2002, 12(7) :1080-1090. 4.BurleighJG,DriskellAC,SandersonMJ: Supertreebootstrappingmethods forassessingphylogeneticvariationamonggenesingenome-scaledata sets. SystematicBiology 2006, 55 :426-440. 5.PisaniD,CottonJA,McInerneyJO: Supertreesdisentanglethechimerical originofeukaryoticgenomes. MolBiolEvol 2007, 24(8) :1752-1760. 6.WebbCO,AckerlyDD,McPeekM,DonoghueMJ: Phylogeniesand communityecology. AnnRevEcolSyst 2002, 33 :475-505. 7.DaviesTJ,FritzSA,GrenyerR,OrmeCDL,BielbyJ,Bininda-EmondsORP, CardilloM,JonesKE,GittlemanJL,MaceGM,PurvisA: Phylogenetictrees andthefutureofmammalianbiodiversity. ProceedingsoftheNational AcademyofSciences 2008, 105(Supplement1) :11556-11563. 8.PurvisA: AmodificationtoBaumandRagan ’ smethodforcombining phylogenetictrees. SystematicBiology 1995, 44 :251-255. 9.PisaniD,WilkinsonM: MatrixRepresentationwithParsimony,Taxonomic Congruence,andTotalEvidence. SystematicBiology 2002, 51 :151-155. 10.Bininda-EmondsORP,GittlemanJL,SteelMA: The(super)treeoflife: procedures,problems,andprospects. AnnualReviewofEcologyand Systematics 2002, 33 :265-289. 11.WilkinsonM,CottonJA,CreeveyC,EulensteinO,HarrisSR,LapointeFJ, LevasseurC,McInerneyJO,PisaniD,ThorleyJL: Theshapeofsupertrees tocome:Treeshaperelatedpropertiesoffourteensupertreemethods. SystBiol 2005, 54 :419-432. 12.GoloboffPA: Minorityrulesupertrees?MRP,Compatibility,andMinimum Flipmaydisplaytheleastfrequentgroups. Cladistics 2005, 21(3) :282-294. 13.WilkinsonM,CottonJA,LapointeFJ,PisaniD: PropertiesofSupertree MethodsintheConsensusSetting. SystBiol 2007, 56(2) :330-337. 14.DayWH,McMorrisF,WilkinsonM:Explosionsandhotspotsinsupertree methods. JournalofTheoreticalBiology 2008, 253(2) :345-348. 15.BaumBR: CombiningTreesasaWayofCombiningDataSetsfor PhylogeneticInference,andtheDesirabilityofCombiningGeneTrees. Taxon 1992, 41 :3-10. 16.RaganMA: Phylogeneticinferencebasedonmatrixrepresentationof trees. MolecularPhylogeneticsandEvolution 1992, 1 :53-58. 17.GoloboffPA: AnalyzingLargeDataSetsinReasonableTimes:Solutions forCompositeOptima. Cladistics 1999, 15(4) :415-428. 18.SwoffordDL: PAUP*:Phylogeneticanalysisusingparsimony(*andother methods),Version4.0b10. 2002. 19.RoshanU,MoretBME,WarnowT,WilliamsTL: Rec-I-DCM3:AFast AlgorithmicTechniqueforReconstructingLargePhylogeneticTrees. CSB 2004,98-109. 20.Bininda-EmondsO,SandersonM: Assessmentoftheaccuracyofmatrix representationwithparsimonyanalysissupertreeconstruction. SystematicBiology 2001, 50 :565-579. 21.EulensteinO,ChenD,BurleighJG,Fernndez-BacaD,SandersonMJ: PerformanceofFlipSupertreeConstructionwithaHeuristicAlgorithm. SystematicBiology 2003, 53 :299-308. 22.ChenD,EulensteinO,Fernndez-BacaD,BurleighJG: ImprovedHeuristics forMinimum-FlipSupertreeConstruction. EvolutionaryBioinformatics 2006, 2. 23.CreeveyCJ,McInerneyJO: Clann:investigatingphylogeneticinformation throughsupertreeanalyses. Bioinformatics 2005, 21(3) :390-392. 24.WilkinsonM,CottonJA: SupertreeMethodsforBuildingtheTreeofLife: Divide-and-ConquerApproachestoLargePhylogeneticProblems.Bansal etal AlgorithmsforMolecularBiology 2010, 5 :18 http://www.almob.org/content/5/1/18 Page11of12

PAGE 12

ReconstructingtheTreeofLife:TaxonomyandSystematicsofSpeciesRich Taxa CRCPressHodkinsonTR,ParnellJAN2007,61-76. 25.RobinsonDF,FouldsLR: Comparisonofphylogenetictrees. Mathematical Biosciences 1981, 53(1-2) :131-147. 26.McMorrisFR,SteelMA: Thecomplexityofthemedianprocedurefor binarytrees. ProceedingsoftheInternationalFederationofClassification Societies 1993. 27.BarthlemyJP,McMorrisFR: Themedianprocedureforn-trees. Journalof Classification 1986, 3 :329-334. 28.CottonJA,WilkinsonM: Majority-RuleSupertrees. SystematicBiology 2007, 56 :445-452. 29.DongJ,Fernndez-BacaD: PropertiesofMajority-RuleSupertrees. SystBiol 2009, 58(3) :360-367. 30.DayWHE: Optimalalgorithmsforcomparingtreeswithlabeledleaves. JournalofClassification 1985, 2 :7-28. 31.PattengaleND,GottliebEJ,MoretBME: EfficientlyComputingthe Robinson-FouldsMetric. JournalofComputationalBiology 2007, 14(6) :724-735,[PMID:17691890]. 32.SempleC,SteelM: Phylogenetics OxfordUniversityPress2003. 33.AllenBL,SteelM: Subtreetransferoperationsandtheirinducedmetrics onevolutionarytrees. AnnalsofCombinatorics 2001, 5 :1-13. 34.SwoffordDL,OlsenGJ,WaddelPJ,HillisDM: Phylogeneticinference. MolecularSystematics Sunderland,Mass:SinauerAssocHillisDM,MoritzC, MableBK1996,407-509. 35.BordewichM,SempleC: Onthecomputationalcomplexityoftherooted subtreepruneandregraftdistance. AnnalsofCombinatorics 2004, 8 :409-423. 36.BansalMS,BurleighJG,EulensteinO,WeheA: HeuristicsfortheGeneDuplicationProblem:A (n)Speed-UpfortheLocalSearch. RECOMB, Volume4453ofLectureNotesinComputerScience SpringerSpeedTP,Huang H2007,238-252. 37.BansalMS,EulensteinO: An ( n 2 / logn )Speed-UpofTBRHeuristicsfor theGene-DuplicationProblem. IEEE/ACMTransactionsonComputational BiologyandBioinformatics 2008, 5(4) :514-524. 38.LinH,BurleighJG,EulensteinO: Tripletsupertreeheuristicsforthetreeof life. BMCBioinformatics 2009, 10(Suppl1) :S8. 39.BenderMA,Farach-ColtonM: TheLCAProblemRevisited. LATIN,Volume 1776ofLectureNotesinComputerScience SpringerGonnetGH,PanarioD, ViolaA2000,88-94. 40.KennedyM,PageR: Seabirdsupertrees:combiningpartialestimatesof procellariiformphylogeny. TheAuk 2002, 119 :88-108. 41.CardilloM,Bininda-EmondsORP,BoakesE,PurvisA: Aspecies-level phylogeneticsupertreeofmarsupials. JournalofZoology 2004, 264 :11-31. 42.BeckR,Bininda-EmondsO,CardilloM,LiuFG,PurvisA: Ahigher-levelMRP supertreeofplacentalmammals. BMCEvolutionaryBiology 2006, 6 :93. 43.WojciechowskiM,SandersonM,SteeleK,ListonA: Molecularphylogenyof the “ TemperateHerbaceousTribes ” ofPapilionoidlegumes:asupertree approach. AdvancesinLegumeSystematics Kew:RoyalBotanic GardensHerendeenP,BruneauA2000, 9 :277-298. 44.NixonKC: Theparsimonyratchet:anewmethodforrapidparsimony analysis. Cladistics 1999, 15 :407-414. 45.SikesDS,LewisPO: PAUPRat:PAUP*implementationoftheparsimony ratchet. 2001. 46.WilkinsonM,PisaniD,CottonJA,CorfeI: MeasuringSupportandFinding UnsupportedRelationshipsinSupertrees. SystBiol 2005, 54(5) :823-831. 47.Bininda-EmondsOR(Ed): Phylogeneticsupertrees SpringerVerlag2004. 48.SnirS,RaoS: UsingMaxCuttoEnhanceRootedTreesConsistency. IEEE/ ACMTrans.Comput.BiologyBioinform 2006, 3(4) :323-333. 49.CreeveyCJ,FitzpatrickDA,GayleKPhilipRJK,O ’ ConnellMJ,PentonyMM, TraversSA,WilkinsonM,McInerneyJO: Doesatree-likephylogenyonly existatthetipsintheprokaryotes?. ProcBiolSci 2004, 271(1557) :2551-2558. 50.ThorleyJL,WilkinsonM: AViewofSupertreeMethods. Bioconsensus, Volume61ofDIMACS:SeriesinDiscreteMathematicsandTheoreticComputer Science,Providence,RhodeIsland,USA:AmericanMathematicalSociety 2003, 185-193. 51.BansalMS,EulensteinO,WeheA: TheGene-DuplicationProblem:NearLinearTimeAlgorithmsforNNI-BasedLocalSearches. IEEE/ACM TransactionsonComputationalBiologyandBioinformatics 2009, 6(2) :221-231. 52.BryantD: Buildingtrees,huntingfortrees,andcomparingtrees:Theory andmethodsinphylogeneticanalysis. PhDthesis Dept.ofMathematics, UniversityofCanterbury1997. 53.SteelM,RodrigoA: Maximumlikelihoodsupertrees. Syst.Biol 2008, 57 :243-250. 54.KuhnerMK,FelsensteinJ: Asimulationcomparisonofphylogeny algorithmsunderequalandunequalevolutionaryrates[published erratumappearsinMolBiolEvol1995May;12(3):525]. MolBiolEvol 1994, 11(3) :459-468. doi:10.1186/1748-7188-5-18 Citethisarticleas: Bansal etal .: Robinson-FouldsSupertrees. Algorithms forMolecularBiology 2010 5 :18. Submit your next manuscript to BioMed Central and take full advantage of: Convenient online submission Thorough peer review No space constraints or color gure charges Immediate publication on acceptance Inclusion in PubMed, CAS, Scopus and Google Scholar Research which is freely available for redistribution Submit your manuscript at www.biomedcentral.com/submit Bansal etal AlgorithmsforMolecularBiology 2010, 5 :18 http://www.almob.org/content/5/1/18 Page12of12


!DOCTYPE art SYSTEM 'http:www.biomedcentral.comxmlarticle.dtd'
ui 1748-7188-5-18ji 1748-7188fm
dochead Research
bibl
title
p Robinson-Foulds Supertrees
aug
au id A1 snm Bansalmi Sfnm Mukulinsr iid I1 I2 email bansal@cs.iastate.edu
A2 BurleighJ GordonI3 gburleigh@ufl.edu
A3 EulensteinOliveroeulenst@cs.iastate.edu
ca yes A4 Fernández-BacaDavidfernande@cs.iastate.edu
insg
ins The Blavatnik School of Computer Science, Tel Aviv University, Tel Aviv 69978, Israel
Department of Computer Science, Iowa State University, Ames, IA 50011, USA
Department of Biology, University of Florida, Gainesville, FL 32611, USA
source Algorithms for Molecular Biology
issn 1748-7188
pubdate 2010
volume 5
issue 1
fpage 18
url http://www.almob.org/content/5/1/18
xrefbib pubidlist pubid idtype doi 10.1186/1748-7188-5-18pmpid 20181274
history rec date day 27month 6year 2009acc 2422010pub 2422010
cpyrt 2010collab Bansal et al; licensee BioMed Central Ltd.note This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
abs
sec
st
Abstract
Background
Supertree methods synthesize collections of small phylogenetic trees with incomplete taxon overlap into comprehensive trees, or supertrees, that include all taxa found in the input trees. Supertree methods based on the well established Robinson-Foulds (RF) distance have the potential to build supertrees that retain much information from the input trees. Specifically, the RF supertree problem seeks a binary supertree that minimizes the sum of the RF distances from the supertree to the input trees. Thus, an RF supertree is a supertree that is consistent with the largest number of clusters (or clades) from the input trees.
Results
We introduce efficient, local search based, hill-climbing heuristics for the intrinsically hard RF supertree problem on rooted trees. These heuristics use novel non-trivial algorithms for the SPR and TBR local search problems which improve on the time complexity of the best known (naïve) solutions by a factor of Θ(it n) and Θ(n
sup 2) respectively (where n is the number of taxa, or leaves, in the supertree). We use an implementation of our new algorithms to examine the performance of the RF supertree method and compare it to matrix representation with parsimony (MRP) and the triplet supertree method using four supertree data sets. Not only did our RF heuristic provide fast estimates of RF supertrees in all data sets, but the RF supertrees also retained more of the information from the input trees (based on the RF distance) than the other supertree methods.
Conclusions
Our heuristics for the RF supertree problem, based on our new local search algorithms, make it possible for the first time to estimate large supertrees by directly optimizing the RF distance from rooted input trees to the supertrees. This provides a new and fast method to build accurate supertrees. RF supertrees may also be useful for estimating majority-rule(-) supertrees, which are a generalization of majority-rule consensus trees.
bdy
Introduction
Supertree methods provide a formal approach for combining small phylogenetic trees with incomplete species overlap in order to build comprehensive species phylogenies, or supertrees, that contain all species found in the input trees. Supertree analyses have produced the first family-level phylogeny of flowering plants abbrgrp
abbr bid B1 1
and the first phylogeny of nearly all extant mammal species
B2 2
. They have also enabled phylogenetic analyses using large-scale genomic data sets in bacteria, across eukaryotes, and within plants
B3 3
B4 4
and have helped elucidate the origin of eukaryotic genomes
B5 5
. Furthermore, supertrees have been used to examine rates and patterns of species diversification
1
2
, to test hypotheses regarding the structure of ecological communities
B6 6
, and to examine extinction risk in current species
B7 7
.
Although supertrees can support large-scale evolutionary and ecological analyses, there are still numerous concerns about the performance of existing supertree methods (e.g.,
B8 8
B9 9
B10 10
B11 11
B12 12
B13 13
B14 14
). In general, an effective supertree method must accurately estimate phylogenies from large data sets in a reasonable amount of time while retaining much of the phylogenetic information from the input trees.
By far the most commonly used supertree method is matrix representation with parsimony (MRP), which works by solving the parsimony problem on a binary matrix representation of the input trees
B15 15
B16 16
. While the parsimony problem is NP-hard, MRP can take advantage of fast and effective hill-climbing heuristics implemented in PAUP* or TNT (e.g.,
B17 17
B18 18
B19 19
). MRP heuristics often perform well in analyses of both simulated and empirical data sets (e.g.,
B20 20
B21 21
B22 22
); however, there are numerous criticisms of MRP. For example, MRP shows evidence of biases based on the shape and size of input trees
8
11
, and MRP supertrees may contain relationships that are not supported by any of the input trees
9
12
. Furthermore, it is unclear if or why minimizing the parsimony score of a matrix representation of input trees is a good optimality criterion or should produce accurate supertrees.
Since evolutionary biologists rarely, if ever, know the true relationships for a group of species, it is difficult to assess the accuracy of supertree, or any phylogenetic, methods. One approach to evaluate the accuracy of supertrees is with simulations (e.g.,
20
21
). However, simulations inherently simplify the true processes of evolution, and it is unclear how well the performance of a phylogenetic method in simulations corresponds to its performance with empirical data. Perhaps a more useful way to define the accuracy of a supertree method is to quantify the amount of phylogenetic information from the input trees that is retained in the supertree. Ideally, we want the supertree to reflect the input tree topologies as much as possible. This suggests that the supertree objective should directly evaluate the similarity of the supertree to the input trees (e.g.,
11
B23 23
B24 24
).
Numerous metrics exist to measure the similarity of input trees to a supertree, and the Robinson-Foulds (RF) distance metric
B25 25
is among the most widely used. In fact, numerous studies have evaluated the performance of supertree methods, including MRP, by measuring the RF distance between collections of input trees and the resulting supertrees (e.g.,
11
20
21
). The RF supertree problem seeks a binary supertree that minimizes the sum of the RF distances between every rooted input tree and the supertree. The intuition behind seeking a binary supertree is that, in this setting, minimizing the RF distance is equivalent to maximizing the number of clusters (or clades) that are shared by the supertree and the input trees. Thus, an RF supertree is a supertree that is consistent with the largest number of clusters from the input trees. Unfortunately, as with MRP, computing RF supertrees is NP-hard
B26 26
. In this work, we describe efficient hill-climbing heuristics to estimate RF supertrees. These heuristics allow the first large-scale estimates of RF supertrees and comparisons of the accuracy of RF supertrees to other commonly used supertree methods.
The RF distance metric between two rooted trees is defined to be a normalized count of the symmetric difference between the sets of clusters of the two trees. In the supertree setting, the input trees will often have only a strict subset of the taxa present in the supertree. Thus, a high RF distance between an input tree and a supertree does not necessarily correspond to conflicting evolutionary histories; it can also indicate incomplete phylogenetic information. Consequently, in order to compute the RF distance between an input tree which has only a strict subset of the taxa in the supertree, we first restrict the supertree to only the leaf set of the input tree. This adapted version of the RF distance is not a metric, or even a distance measure (mathematically speaking). However, for convenience, we will refer to this adapted version of the RF distance metric using the same name.
Previous work
Supertree methods are a generalization of consensus methods, in which all the input trees have the same leaf set. The problem of finding an optimal median tree under the RF distance in such a consensus setting is well-studied. In particular, it is known that the majority-rule consensus of the input trees must be a median tree
B27 27
, and it can be found in polynomial time. On the other hand, finding the optimum binary median tree, i.e. an RF supertree, in the consensus setting is NP-hard
26
. This implies that computing an RF supertree in general is NP-hard as well.
Our definition of RF distance between two trees where one has only a strict subset of the taxa in the other, corresponds to the distance measure used to define "majority-rule(-) supertrees" by Cotton and Wilkinson
B28 28
. This definition restricts the larger tree to only the leaf set of the smaller tree before evaluating the RF distance. Majority-rule(-) supertrees are defined to be the strict consensus of all the optimal median trees under the RF distance. These median trees are defined similarly to RF supertrees, except that RF supertrees must be binary while the median trees can be non-binary. In general, majority-rule supertrees
28
, in both their (-) and (+) variants, seek to generalize the majority-rule consensus. Indeed, majority-rule supertrees have been shown to have several desirable properties reminiscent of majority-rule consensus trees
B29 29
. Although majority-rule supertrees and RF supertrees are both based on minimizing RF distance, they represent two different approaches to supertree construction. In particular, the RF supertree method seeks a supertree that is consistent with the largest number of clusters (clades) from the input trees, while majority-rule supertrees do not. Nevertheless, as we discuss later, RF supertrees could be used as a starting point to estimate majority-rule(-) supertrees.
The RF distance between two trees on the same size n leaf set, with leaves labeled by integers {1, ..., n}, can be computed in O(n) time
B30 30
. In fact, an (1 + ϵ)-approximate value of the RF distance can be computed in sub-linear time, with high probability
B31 31
.
In the case of unrooted trees, the RF distance metric is sometimes also known as the splits metric (e.g.,
B32 32
). The supertree analysis package Clann
23
provides heuristics that operate on unrooted trees and attempt to maximize the number of splits shared between the input trees and the inferred supertree. This method is called the "maximum splits-fit" method.
Local Search
We use a heuristic approach for the RF supertree problem. Local search is the basis of effective heuristics for many phylogenetic problems. These heuristics iteratively search through the space of possible supertrees guided, at each step, by solutions to some local search problem. More formally, in these heuristics, a tree graph (see
32
B33 33
) is defined for the given set of input trees and some fixed tree edit operation. The node set of this tree graph represents the set of all supertrees on the given input trees. An edge is drawn between two nodes exactly if the corresponding trees can be transformed into each other by one tree edit operation. In our setting, the cost of a node in the graph is the RF distance between the supertree represented by that node and the given input trees. Given an initial node in the tree graph, the heuristic's task is to find a maximal-length path of steepest descent in the cost of its nodes and to return the last node on such a path. This path is found by solving the local search problem at every node along the path. The local search problem is to find a node with the minimum cost in the neighborhood of a given node. The neighborhood is defined by some tree edit operation, and hence, the time complexity of the local search problem depends on the tree edit operation used.
Two of the most extensively used tree edit operations for supertrees are rooted Subtree Prune and Regraft (monospace SPR)
33
B34 34
B35 35
and rooted Tree Bisection and Reconnection (TBR)
22
33
34
. The best known (naïve) algorithms for the SPR and TBR local search problems for the RF supertree problem require O(kn
3) and O(kn
4) time respectively, where k is the number of input trees, and n is the number of leaves in the supertree solution.
Our Contribution
We describe efficient hill-climbing heuristics for the RF supertree problem. These heuristics are based on novel non-trivial algorithms that can solve the corresponding local search problems for both SPR and TBR in O(kn
2) time, yielding speed-ups of Θ(n) and Θ(n
2) over the best known solutions respectively. These new algorithms are inspired by fast local search algorithms for the gene duplication problem
B36 36
B37 37
. Note that while the supertree itself must be binary, our algorithms work even if the input trees are not. We also examine the performance of the RF supertree method using four published supertree data sets, and compare its performance with MRP and the triplet supertree method
B38 38
. We demonstrate that the new algorithms enable RF supertree analyses on large data sets and that the RF supertree method outperforms other supertree methods in finding supertrees that are most similar to the input trees based on the RF distance metric.
Basic Notation and Preliminaries
A tree T is a connected acyclic graph, consisting of a node set V (T) and an edge set E(T). T is rooted if it has exactly one distinguished node called the root which we denote by rt(T). Throughout this work, the term tree refers to a rooted tree. We define ≤sub
T
to be the partial order on V (T) where x ≤
T
y if y is a node on the path between rt(T) and x. The set of minima under ≤
T
is denoted by ℒ(T) and its elements are called leaves. The set of all non-root internal nodes of T, denoted by I(T), is defined to be the set V (T)\(ℒ(T) ∪ {rt(T)}). If {x, y} ∈ E(T) and x ≤
T
y then we call y the parent of x denoted by pa
T
(x) and we call x a child of y. The set of all children of y is denoted by Ch
T
(y). T is fully binary if every node has either zero or two children. If two nodes in T have the same parent, they are called siblings. The least common ancestor of a non-empty subset L ⊆ V (T), denoted as lca(L), is the unique smallest upper bound of L under ≤
T
. The subtree of T rooted at node y ∈ V(T), denoted by T
y
, is the tree induced by {x ∈ V (T): x ≤ y}. For each node v ∈ I(T), the cluster
inline-formula
graphic file 1748-7188-5-18-i1.gif
(v) is defined to be the set of all leaf nodes in T
v
; i.e.
(v) = ℒ(T
v
). We denote the set of all clusters of a tree T by ℋ(T). Given a set L ⊆ ℒ(T), let T' be the minimal rooted subtree of T with leaf set L. We define the leaf induced subtree T [L] of T on leaf set L to be the tree obtained from T' by successively removing each non-root node of degree two and adjoining its two neighbors. The symmetric difference of two sets A and B, denoted by AΔB, is the set (A\B) ∪ (B\A). A profile
1748-7188-5-18-i2.gif
is a tuple of trees (T
1, ..., T
k
).
The RF Supertree Problem
Given a profile
, we define a supertree on
to be a fully binary tree T* where
1748-7188-5-18-i3.gif
.
b Definition 1 (RF Distance). Given a profile
= (T
1, ..., T
k
) and a supertree T on
, we define the RF distance as follows:
indent 1
1. For any T
i
, where 1 ≤ i ≤ k, RF (T
i
, T*) = |ℋ(T
i
)Δℋ(T*[ℒ(T
i
)])|.
2.
1748-7188-5-18-i4.gif
3. Let
1748-7188-5-18-i5.gif
be the set of supertrees on
, then
1748-7188-5-18-i6.gif
.
Remark: Traditionally, the value of the RF distance, as computed above, is normalized by multiplying by 1/2. However, this does not affect the definition or computation of RF supertrees, and therefore, we do not normalize the RF distance.
Problem 1 (RF Supertree).
Instance: A profile
.
Find: A supertree T
opt
on
such that RF (
, T
opt
) = RF (
).
Recall that the RF Supertree problem is NP-hard
27
.
Local Search Problems
Here we first provide definitions for the re-rooting operation (denoted RR) and the TBR
22
and SPR
35
edit operations and then formulate the related local search problems that were motivated in the introduction.
Definition 2 (RR operation). Let T be a tree and x ∈ V (T). RR(T, x) is defined to be the tree T, if x = rt(T) or x ∈ Ch(rt(T)). Otherwise, RR(T, x) is the tree that is obtained from T by (i) suppressing rt(T), and (ii) subdividing the edge {pa(x), x} by a new root node. We define the following extension: RR(T) = ∪
x ∈ V(T){RR(T, x)}.
For technical reasons, before we can define the TBR operation, we need the following definition.
Definition 3 (Planted tree). Given a tree T, the planted tree Φ(T) is the tree obtained by adding a root edge {p, rt(T)}, where p ∉ V (T), to T.
Definition 4 (TBR operation). (See Fig. figr fid F1 1) Let T be a tree, e = (u, v) ∈ E(T), where u = pa(v), and X, Y be the connected components that are obtained by removing edge e from T where v ∈ X and u ∈ Y. We define
TBR
T
(v, x, y) for x ∈ X and y ∈ Y to be the tree that is obtained from Φ(T) by first removing edge e, then replacing the component X by
RR(X, x), and then adjoining a new edge f between x' = rt(RR(X, x)) and Y as follows:
fig Figure 1caption TBR Operationtext
TBR Operation. Example depicting a TBR operation which transforms tree S into tree S' = TBRS(v, x, y).
1748-7188-5-18-1
1. Create a new node y' that subdivides the edge (pa(y), y).
2. Adjoin the edge f between nodes x' and y'.
3. Suppress the node u, and rename x' as v and y' as u.
4. Contract the root edge.
Notation. We define the following:
1. TBR
T
(v, x) = ∪
y ∈ Y
{TBR
T
(v, x, y)}
2. TBR
T
(v) = ∪
x ∈ X
TBR
T
(v, x)
3. TBR
T
= ∪(u, v) ∈ E(T)
TBR
T
(v)
Definition 5 (SPR operation). Let T be a tree, e = (u, v) ∈ E(T), where u = pa(v), and X, Y be the connected components that are obtained by removing edge e from T where v ∈ X and u ∈ Y. We define
SPR
T
(v, y), for y ∈ Y, to be the tree
TBR
T
(v, v, y). We say that the tree
SPR
T
(v, y) is obtained from T by a subtree prune and regraft (SPR) operation that prunes subtree T
v
and regrafts it above node y.
Notation. We define the following:
1. SPR
T
(v) = ∪
y ∈ Y
{SPR
T
(v, y)}
2. SPR
T
= ∪(u, v) ∈ E(T)
SPR
T
(v)
Note that an SPR operation for a given tree T can be briefly described through the following four steps: (i) prune some subtree P from T, (ii) add a root edge to the remaining tree S, (iii) regraft P into an edge of the remaining tree S, and (iv) contract the root edge.
We now define the relevant local search problems based on the TBR and SPR operations.
Problem 2 (TBR-Scoring (TBR-S)). Given instance ⟨
, T⟩, where
is the profile (T
1, ..., T
k
) and T is a supertree on
, find a tree T* ∈ TBR
T
such that
1748-7188-5-18-i7.gif
.
Problem 3 (TBR-Restricted Scoring (TBR-RS)). Given instance ⟨
, T, v⟩, where
is the profile (T
1, ..., T
k
), T is a supertree on
, and v is a non-root node in V (T), find a tree T ∈ TBR
T
(v) such that
1748-7188-5-18-i8.gif
.
The problems SPR-Scoring (SPR-S) and SPR-Restricted Scoring (SPR-RS) are defined analogously to the problems TBR-S and TBR-RS respectively.
Throughout the remainder of this manuscript, k is the number of trees in the profile
, T denotes a supertree on
, and n is the number of leaves in T. The following observation follows from Definition 4.
Observation 1. The
TBR-S problem on instance ⟨
, T⟩ can be solved by solving the
TBR-RS problem |E(T)| times.
We show how to solve the TBR-S problem on the instance ⟨
, T⟩ in O(kn
2) time. Since SPR
T
⊆ TBR
T
this also implies an O(kn
2) solution for the SPR-S problem. This gives a speed-up of Θ(n
2) and Θ(n) over the best known (naïve) algorithms for the TBR-S and SPR-S problems respectively.
In particular, we first show that any instance of the TBR-RS problem can be decomposed into an instance of an SPR-RS problem, and an instance of a Rooting problem (defined in the next section). We show how to solve both these problems in O(kn) time, yielding an O(kn) time solution for the TBR-RS problem. This immediately implies an O(kn
2) time algorithm for the TBR-S problem (see Observation 1).
Note that the size of the set TBR
T
is Θ(n
3). Thus, for each tree in the input profile the time complexity of computing and enumerating the RF distances of all trees in TBR
T
is Ω(n
3). However, to solve the TBR-S problem one only needs to find a tree with the minimum RF distance. This lets us solve the TBR-S problem in time that is sub-linear in the size of TBR
T
. In fact, after the initial O(kn
2) preprocessing step, our algorithm can output the RF distance of any tree in TBR
T
in O(1) time.
Structural Properties
Throughout this section, we limit our attention to one tree S from the profile
. We show how to solve the TBR-RS problem for the instance ⟨(S), T, v⟩ for some non-root node v ∈ V (T) in O(n) time. Based on this solution, it is straightforward to solve the TBR-RS problem on the instance ⟨
, T, v⟩ with-in O(kn) time as well. For clarity, we will also assume that ℒ(S) = ℒ(T). In general, if ℒ(S) ⊂ ℒ(T) then we can simply set T to be T [ℒ(S)]. This takes O(n) time and, consequently, does not affect the time complexity of our algorithm.
Our algorithm makes use of the LCA mapping from S to T. This mapping is defined as follows.
Definition 6 (LCA Mapping). Given two trees T' and T such that ℒ(T') ⊆ ℒ(T), the LCA mapping ℳ
T', T
: V(T') → V(T) is the mapping ℳ
T', T
(u) = lca
T
(ℒ(
1748-7188-5-18-i9.gif
)).
Notation. We define a boolean function f
T
: I(S) → {0, 1} such that f
T
(u) = 1 if there exists a node v ∈ I(T) such that
1748-7188-5-18-i10.gif
(u) =
(v), and f
T
(u) = 0 otherwise. Thus, f
T
(u) = 1 if and only if the cluster
(u) exists in the tree T as well. Additionally, we define ℱ
T
= {u ∈ I(S): f
T
(u) = 0}; that is, ℱ
T
is the set of all nodes u ∈ I(S) such that the cluster
(u) does not exist in the tree T.
The following lemma associates the value RF(S, T) with the cardinality of the set ℱ
T
.
Lemma 1. RF(S, T) = |I(T)| |I(S)| + 2·|ℱ
T
|.
Proof. Let
1748-7188-5-18-i11.gif
denote the set {u ∈ I(S): f
T
(u) = 1}. By the definition of RF (S, T), we must have RF(S, T) = |I(T)| + |I(S)| 2·|
|. And hence, since |
| + |ℱ
T
| = I(S), we get RF(S, T) = |I(T)| |I(S)| + 2·|ℱ
T
|.    □
Lemma 2. For any u ∈ I(S), f
T
(u) = 1 if and only if |
(u)| = |
(ℳ
S, T
(u))|.
Proof. If |
(u)| = |
(ℳ
S, T
(u))| then we must have
(u) =
(ℳ
S, T
(u)) and, consequently, f
T
(u) = 1. In the other direction, if |
(u)| ≠ |
(ℳ
S, T
(u))|, then we must have
(u) ⊂
(ℳ
S, T
(u)) and, consequently, f
T
(u) = 0.    □
The LCA mapping from S to T can be computed in O(n) time
B39 39
, and consequently, by Lemmas 1 and 2, we can compute the RF distance between S and T in O(n) time as well (other O(n)-time algorithms for calculating the RF distance are presented in
30
31
). Moreover, Lemma 1 implies that in order to find a tree T* ∈ TBR
T
(v) such that
, it is sufficient to find a tree T* ∈ TBR
T
(v) for which
1748-7188-5-18-i12.gif
.
Remark: An implicit assumption here is that the leaves of both trees are labeled by integers {1, ..., n}. If the leaf labels are arbitrary, then we require an additional O(kn log n)-time preprocessing step to relabel the leaves of the trees in the given profile. Note, however, that this additional step does not add to the overall time complexity of solving the TBR-S or SPR-S problems.
We now show that the TBR-RS problem can be solved by solving two smaller problems separately and combining their solutions.
As before, we limit our attention to one tree S from the profile
. Given the TBR-RS instance ⟨(S), T, v⟩, we define a bipartition {X,
1748-7188-5-18-i13.gif
} of I(S), where X = {u ∈ I(S): ℳ
S, T
(u) ∈ V (T
v
)}.
Lemma 3. If u ∈ X, then f
T'
(u) = f
T
(u) for all T ∈ TBR
T
(v, v). If u ∈
and y denotes the sibling of v, then f
T'
(u) = f
T
(u), where T' = TBR
T
(v, x, y) for any x ∈ V (T
v
).
Proof. Consider the case when u ∈ X. Let T' be any tree in TBR
T
(v, v) and let node y ∈ V (T) be such that T' = TBR(v, v, y). Thus, for any node w ∈ V (T
v
), the subtrees T
v
and
1748-7188-5-18-i14.gif
must be identical. Since u ∈ X, we must have ℳ
S, T
(u) ∈ T
v
and, consequently,
1748-7188-5-18-i15.gif
. Lemma 2 now implies that f
T'
(u) = f
T
(u).
Now consider the case when u ∈
. Node y denotes the sibling of v in tree T and let T' = TBR(v, x, y), for some x ∈ V (T
v
). Thus, for any node w ∈ V(T)\V(T
v
), we must have ℒ
T
(w) = ℒ
T'
(w). Moreover, the leaf sets of the two subtrees rooted at the children of w in T must be identical to the leaf sets of the two subtrees rooted at the children of w in T': This implies that if ℳ
S, T
(u) = w, then ℳ
S, T'
(u) = w as well. By Lemma 2 we must therefore have f
T'
(u) = f
T
(u).    □
Lemma 3 implies that a tree in TBR
T
(v) with smallest RF distance can be obtained by optimizing the rooting for the pruned subtree, and optimizing the regraft location separately. This allows us to obtain a tree in TBR
T
(v) with smallest RF distance by evaluating only O(n) trees. Contrast this with the naïve approach to finding a tree in TBR
T
(v) with smallest total distance, which is to evaluate all trees obtained by rerooting the pruned subtree in all possible ways, and, for each rerooting, regrafting the subtree in all possible locations. Since there are O(n) ways to reroot the pruned subtree, and O(n) ways to regraft, this would require evaluating O(n
2) trees. It is interesting to note that this ability to decompose the TBR-RS problem into two simpler problems is not unique to the context of RF supertrees alone. For example, it has been observed that a similar decomposition can be achieved in the context of the gene duplication problem
37
.
Thus, to solve the TBR-RS problem, we must find (i) a rerooting T' of the subtree T
v
for which ℱ
T'
is minimized, and (ii) a regraft location y for T
v
which minimizes |ℱ
SPR
(v, y)|. Observe that the problem in part (ii) is simply the SPR-RS problem on the input instance ⟨(S), T, v⟩. For part (i), consider the following problem statement.
Problem 4 (Rooting). Given instance ⟨
, T, v⟩, where
is the profile (T
1, ..., T
k
), T is a supertree on
, and v is a non-root node in V (T), find a node x ∈ V (T
v
) for which RF (
, TBR
T
(v, x, y)) is minimum, where y denotes the sibling of v in T.
Note that the problem in part (i) is the Rooting problem on the input instance ⟨(S), T, v⟩. We show how to solve both the Rooting and the SPR-RS problems in O(n) time on instance ⟨(S), T, v⟩. As seen above, based on Lemma 3, this immediately implies that the TBR-RS problem for a profile consisting of a single tree can be solved in O(n) time. To solve the TBR-RS problem on instance ⟨
, T, v⟩, we simply solve the Rooting and SPR-RS problems separately on the input instance ⟨
, T, v⟩, which takes O(kn) time (see Theorems 3 and 4). We thus have the following two theorems.
Theorem 1. The
TBR-RS problem can be solved in O(kn) time.
Theorem 2. The
TBR-S problem can be solved in O(kn
2) time.
Solving the Rooting Problem
To solve the Rooting problem on instance ⟨(S), T, v⟩, we rely on an efficient algorithm for computing the value of f
T'
(u) for any T' ∈ RR(T
v
) and any u ∈ I(S). This algorithm relies on the following five lemmas. Let a denote the node ℳ
S, T
(u), y denote the sibling of v in T, and T' = TBR
T
(v, x, y) for x ∈ V (T
v
). Depending on a and f
T
(u) there are five possible cases: (i) a ∉ V (T
v
), (ii) a = rt(T
v
) and f
T
(u) = 1, (iii) a = rt(T
v
) and f
T
(u) = 0, (iv) a ∈ V (T
v
)\rt(T
v
) and f
T
(u) = 1, and (v) a ∈ V (T
v
)\rt(T
v
) and f
T
(u) = 0. Lemmas 4 through 8 characterize the value f
T'
(u) for each of these five cases respectively.
Lemma 4. If a ∉ V (T
v
), then f
T'
(u) = f
T
(u) for any x ∈ V (T
v
).
Proof. Follows directly from Lemma 3.    □
Lemma 5. If a = rt(T
v
) and f
T
(u) = 1, then f
T'
(u) = 1 for all x ∈ V (T
v
).
Proof. Since we have a = rt(T
v
) and f
T
(u) = 1, by Lemma 2 we must have ℒ(S
u
) = ℒ(T
v
). Thus, for any x ∈ V (T
v
), ℳ
S, T'
(u) must be the root of the subtree RR(T
v
, x). The lemma follows.    □
Lemma 6. Let L denote the set ℒ(T
v
)\ℒ(S
u
), and let
1748-7188-5-18-i16.gif
(L). If a = rt(T
v
) and f
T
(u) = 0, then,
1. for
1748-7188-5-18-i17.gif
b, f
T'
(u) = 0, and,
2. for
1748-7188-5-18-i18.gif
b, f
T'
(u) = 1 if and only if |L| = |ℒ(T
b
)|.
Proof. Since a = rt(T
v
) and f
T
(u) = 0, by Lemma 2 we must have ℒ(S
u
) ≠ ℒ(T
v
). We analyze each part of the lemma separately.
1.
b: For this case to be valid, we must have
1748-7188-5-18-i19.gif
rt(T
v
). Therefore, let
1748-7188-5-18-i20.gif
(b). For any T' in this case, b' = pa
T'
(b). Moreover, ℒ(
1748-7188-5-18-i21.gif
) ∩ ℒ(S
u
) ≠ ∅. Therefore, we must have b <
T'

S, T'
(u) Hence,
1748-7188-5-18-i22.gif
in this case, and, consequently, Lemma 2 implies that f
T'
(u) = 0.
2.
b: We divide our analysis into two cases:
2 (a) |L| = |ℒ(T
b
)|: In this case we must have b ≠ rt(T
v
). Therefore, let b' denote the parent of b in tree T
v
. Now consider the tree T'. The set ℒ(
) must be identical to ℒ(S
u
). Hence, f
T'
(u) = 1 in this case.
(b) |L| ≠ |ℒ(T
b
)|: We claim that there does not exist any edge (pa(w), w) ∈ E(T
v
) such that ℒ(T
w
) is either ℒ(S
u
) or L. Let us suppose, for the sake of contradiction, that such an edge exists. If ℒ(T
w
) = ℒ(S
u
) then we must have a = w, which is a contradiction since a = rt(T
v
). If ℒ(T
w
) = g L then we must have b = w, and, consequently, |L| ≠ |ℒ(T
b
)|, which is, again, a contradiction. Thus, such an edge (pa(w), w) ∈ E(T
v
) cannot exist. Hence, we must have f
T'
(u) = 0 for every x ∈ V (T
v
) in this case.
The lemma follows.    □
Lemma 7. If a ∈ V (T
v
)\rt(T
v
) and f
T
(u) = 1, then f
T'
(u) = 0 if and only if x < T
v
a.
Proof. By Lemma 2 we must have ℒ(S
u
) = ℒ(T
a
). We have two cases:
1.
1748-7188-5-18-i23.gif
a: In this case we must have ℳ
S, T'
(u) = a, and ℒ(T
a
) = ℒ(
1748-7188-5-18-i24.gif
). Thus, ℒ(S
u
) = ℒ(
) and hence, f
T'
(u) = 1.
2.
1748-7188-5-18-i25.gif
a: In this case, ℳ
S, T'
(u) must be the root of the subtree RR(T
v
, x). Since ℒ(RR(T
v
, x)) = ℒ(T
v
), and ℒ(S
u
) ≠ ℒ(T
v
), Lemma 2 implies that f
T'
(u) = 0.
The lemma follows.    □
Lemma 8. If a ∈ V (T
v
)\rt(T
v
) and f
T
(u) = 0, then f
T'
(u) = 0 for all x ∈ V (T
v
).
Proof. By Lemma 2 we must have ℒ(S
u
) ≠ ℒ(T
a
). We have two possible cases:
1.
a: In this case we must have ℳ
S, T'
(u) = a, and ℒ(T
a
) = ℒ(
). Thus, ℒ(S
u
) ≠ ℒ(
) and hence, f
T'
(u) = 0
2.
a: In this case, ℳ
S, T'
(u) must be the root of the subtree RR(T
v
, x). Since ℒ(RR(T
v
, x)) = ℒ(T
v
), and ℒ(S
u
) ≠ ℒ(T
v
), Lemma 2 implies that f
T'
(u) = 0.
The lemma follows.
The Algorithm. For any x ∈ V (T
v
) let A(x) denote the cardinality of the set
{u ∈ I(S): f
T
(u) = 0, but f
T'
(u) = 1}, and B(x) the cardinality of the set
{u ∈ I(S): f
T
(u) = 1, but f
T'
(u) = 0}, where T' = TBR
T
(v, x, y).
By definition, to solve the Rooting problem we must find a node x ∈ V (T
v
) for which |A(x)| |B(x)| is maximized. Our algorithm computes, at each node x ∈ V (T
v
), the values A(x) and B(x).
In a preprocessing step, our algorithm computes the mapping ℳ
S, T
as well as the size of each cluster in S and T, and creates and initializes (to 0) two counters α(x) and β(x) at each node x ∈ V (T
v
). This takes O(n) time. When the algorithm terminates, the values α(x) and β(x) at any x ∈ V (T
v
) will be the values α(x) and β(x).
Recall that, given u ∈ I(S), a denotes the node ℳ
S, T
(u). Thus, any given u ∈ I(S) must satisfy the precondition (given in terms of a) of exactly one of the the Lemmas 4 through 8. Moreover, the precondition of each of these lemmas can be checked in O(1) time.
The algorithm then traverses through S and considers each node u ∈ I(S). There are three cases:
1. If u satisfies the preconditions of Lemmas 4, 5, or 8 then we must have f
T'
(u) = f
T
(u). Consequently, we do nothing in this case.
2. If u satisfies the precondition of Lemma 7, then we increment the value of β(x) at each node x ∈ V (T
a
)\{a} (where a is as in the statement of Lemma 7). To do this efficiently we can simply increment a counter at node a such that, after all u ∈ I(S) have been considered, a single pre-order traversal of T
v
can be used to compute the correct values of β(x) at each x ∈ V (T
v
).
3. If u satisfies the precondition of Lemma 6, then we proceed as follows: Let a and L be as in the statement of Lemma 6. According to the Lemma, if we can find a node b ∈ V (T
v
) such that
(L) and |ℒ(T
b
)| = |L|, then we increment the value of α(x) at each node x ∈ V (T
b
); otherwise, if such a b does not exist, we do nothing. As before, to do this efficiently, we only increment a single counter at node b such that, after all u ∈ I(S) have been considered, a pre-order traversal of T
v
suffices to compute the correct values of α(x) at each x ∈ V (T
v
). In order to prove the O(n) run-time for this algorithm we will now explain how to precompute such a corresponding node b (if it exists), for each u ∈ I(S) satisfying the precondition of Lemma 6, within O(n) time. Note that any edge in a tree bi-partitions its leaf set. Construct the tree S' = S [ℒ(T
v
)]. Observe that, given any candidate u, the corresponding node b exists if and only if the partition of ℒ(S') induced by the edge (u, pa(u)) E(S'), is also induced by some edge, e, in the tree T
v
If such an e exists, then b must be that node on e which is farther away from the root, i.e. the edge e must be the edge (b, pa(b)) in T
v
This edge e (or its absence) can be precomputed, for all candidate u, as follows: Compute the strict consensus of the unrooted variants of the trees S' and T
v
. Every edge in this strict consensus corresponds to an edge in S' and an edge in T
v
that induce the same bi-partitions in the two trees.
Thus, for all candidate u that lie on such an edge, the corresponding node b can be inferred in O(1) time (by using the association between the edges of the strict consensus and the edges of S' and T
v
), and for all candidate u that do not lie on such an edge, we know that the corresponding node b does not exist. This strict consensus of the unrooted variants of S' and T
v
can be precomputed with-in O(n) time by using the algorithm of Day
30
.
Hence, the Rooting problem for a profile consisting of a single tree can be solved in O(n) time; yielding the following theorem.
Theorem 3. The Rooting problem can be solved in O(kn) time.
Solving the SPR-RS Problem
We will show how to solve the SPR-RS problem on instance ⟨(S), T, v⟩ in O(n) time. Consider the tree R = SPR
T
(v, rt(T)) Observe that, since SPR
R
(v) = SPR
S
(v), solving the SPR-RS problem on instance ⟨(S), T, v⟩ is equivalent to solving it on the instance ⟨(S), R, v⟩. Thus, in the remainder of this section, we will work with tree R instead of tree S. The following four lemmas let us efficiently infer, for any u ∈ I(S), whether f
T'
(u) = 1 or f
T'
(u) = 0, for any given T' ∈ SPR
R
(v).
For brevity, let a denote the node ℳ
S, R
(u), and let Q denote the set V (R)\(V (R
v
) ∪ {rt(R)}). Let T' = SPR
R
(v, x), for any x ∈ Q.
Depending on a and f
R
(u) there are four possible cases: (i) a ∈ V (R
v
), (ii) a ∈ Q and f
R
(u) = 1, (iii) a ∈ Q and f
R
(u) = 0, and (iv) a = rt(R). Lemmas 9 through 12 characterize the value f
T'
(u) for each of these four cases respectively.    □
Lemma 9. If a ∈ V (R
v
), then f
T'
(u) = f
R
(u) for any x ∈ Q.
Proof. Observe that TBR
R
(v, v) = SPR
R
(v). Lemma 3 now immediately completes the proof.    □
Lemma 10. If a ∈ Q and f
R
(u) = 1, then,
1. f
T'
(u) = 0, for x <
R
a, and
2. f
T'
(u) = 1, otherwise.
Proof. Since f
R
(u) = 1, Lemma 2 implies that |
(u)| = |
1748-7188-5-18-i26.gif
(a)|. Let T' = SPR
R
(v, x); we now have two cases.
1. x <
R
a: In this case ℳ
S, T'
(u) = a, and, since |
(u)| < |
(a)| < |
1748-7188-5-18-i27.gif
(a)|, we must have f
T'
(u) = 0 (by Lemma 2).
2. x ≮
R
a: In this case ℳ
S, T'
(u) = a, and since |
(u)| = |
(a)| = |
(a)|, we must have f
T'
(u) = 1 (by Lemma 2).
The lemma follows.    □
Lemma 11. If a ∈ Q and f
R
(u) = 0, then f
T'
(u) = 0 for any x ∈ Q.
Proof. Since f
R
(u) = 0, Lemma 2 implies that |
(u)| ≠ |
(a)|. Thus, by the definition of LCA mapping, |
(u)| < |
(a)|. Let T' = SPR
R
(v, x); we now have two cases.
1. x <
R
a: In this case ℳ
S, T'
(u) = a, and, since |
(u)| < |
(a)| < |
(a)|, we must have f
T'
(u) = 0 (by Lemma 2).
2. x ≮
R
a: In this case ℳ
S, T'
(u) = a, and since |
(u)| < |
(a)| = |
(a)|, we must have f
T'
(u) = 0 (by Lemma 2).
The lemma follows.    □
For the next lemma, let S' be the tree obtained from S by suppressing all nodes s for which ℳ
S, R
(s) ∈ R
v
.
Lemma 12. If a = rt(R) and b = ℳ
S', R
(u), then, f
T'
(u) = 1 if and only if x <
R
b and |ℒ(R
b
)| + |ℒ(R
v
)| = |ℒ(S
u
)|.
Proof. First, observe that, since a = rt(R), the mapping ℳ
S', R
(u) is well defined. Second, since b = ℳ
S', R
(u), we must have ℒ(
1748-7188-5-18-i28.gif
) ⊆ ℒ(R
b
), which implies that ℒ(S
u
) ⊆ ℒ(R
v
) ⊆ ℒ(R
b
). We now have the following three cases:
1. x ≮
R
b: In this case we must have ℳ
S, T'
(u) = lca
T'
(x, b). By Lemma 2 we know that f
T'
(u) = 1 only if |
(u)| = |
(ℳ
S, T'
(u))|. However, since we have ℒ(S
u
) ⊆ ℒ(R
v
) ⊆ ℒ(R
b
), and x ≮
R
b, we must have |
(u)| < |
(ℳ
S, T'
(u))|; and hence, f
T'
(u) = 0.
2. x <
R
b
and |ℒ(R
b
)| + |ℒ(R
v
)| ≠ |ℒ(S
u
)|: In this case we must have ℳ
S, T'
(u) = b. Since |ℒ(R
b
)| + |ℒ(R
v
)| ≠ |ℒ(S
u
)|, we must have ℒ(S
u
) ⊂ ℒ(R
v
) ∪ ℒ(R
b
), which implies that |
(u)| < |
(ℳ
S, T'
(u))|. Thus, by Lemma 2, we must have f
T'
(u) = 0.
3. x <
R
b
and |ℒ(R
b
)| + |ℒ(R
v
)| = |ℒ(S
u
)|: In this case we must have ℳ
S, T'
(u) = b. Moreover, since |ℒ(R
b
)| + |ℒ(R
v
)| = |ℒ(S
u
)|, we must have |
(u)| = |
(ℳ
S, T'
(u))|. Thus, by Lemma 2, we must have f
T'
(u) = 1.
The lemma follows.    □
The Algorithm. Note that SPR
T
(v) = SPR
R
(v) = ∪
x ∈ Q
SPR
R
(v, x). For any x ∈ Q, let A(x) = |{u ∈ I(S): f
R
(u) = 0, but f
T'
(u) = 1}|, and B(x) = |{u ∈ I(S): f
R
(u) = 1, but f
T'
(u) = 0}|, where T' = SPR
R
(v, x). By definition, to solve the SPR-RS problem on instance ⟨(S), T, v⟩ we must find a node x ∈ Q for which |A(x)| |B(x)| is maximized. Our algorithm computes, at each node x ∈ Q, the values A(x) and B(x).
In a preprocessing step, our algorithm first constructs the tree R computes the mapping ℳ
S, R
as well as the size of each cluster in S and R, and creates and initializes (to 0) two counters α(x) and β(x) at each node x ∈ Q. This takes a total of O(n) time. When the algorithm terminates, the values α(x) and β(x), at any x ∈ Q will be the values A(x) and B(x).
Recall that, given u ∈ I(S), a denotes the node ℳ
S, R
(u). Thus, any given u ∈ I(S) must satisfy the precondition (given in terms of a) of exactly one of the the Lemmas 9 through 12. Moreover, the precondition of each of these lemmas can be checked in O(1) time.
The algorithm then traverses through S and considers each node u ∈ I(S). There are three cases:
1. If u satisfies the preconditions of Lemmas 9 or 11 then we must have f
T'
(u) = f
R
(u) Consequently, we do nothing in this case.
2. If u satisfies the precondition of Lemma 10, then we increment the value of β(x) at each node x ∈ V (T
a
)\{a} (where a is as in the statement of Lemma 10). This can be done efficiently as shown in part (2) of the algorithm for the Rooting problem.
3. If u satisfies the precondition of Lemma 6, and if |ℒ(R
b
)| + |ℒ(R
v
)| = |ℒ(S
u
)|, then we increment the value of α(x) at each node x ∈ V (T
b
)\{b} (where a and b are as in the statement of Lemma 6).
Again, to do this efficiently, we increment a counter at node b, and perform a subsequent pre-order traversal. Note also that the mapping ℳ
S', R
can be computed in O(n) time in the preprocessing step, and hence the node b can be inferred in O(1) time. The condition |ℒ(R
b
)| + |ℒ(R
v
)| = |ℒ(S
u
)| is also verifiable in O(1) time.
Hence, the SPR-RS problem for a profile consisting of a single tree can be solved in O(n) time; yielding the following theorem.
Theorem 4. The
SPR-RS problem can be solved in O(kn) time.
Remark. To improve the performance of local search heuristics in phylogeny construction, the starting tree for the first local search step is often constructed using a greedy 'stepwise addition' procedure. This greedy procedure builds a starting species tree step-by-step by adding one taxon at a time at its locally optimal position. In the context of RF supertrees, our algorithm for the SPR-RS problem also yields a Θ(n) speed-up over naïve algorithms for this greedy procedure.
Experimental Evaluation
In order to evaluate the performance of the RF supertree method, we implemented an RF heuristic based on the SPR local search algorithm. We focused on the SPR local search because it is faster and simpler to implement than TBR, and in analyses of MRF and triplet supertrees, the performance of SPR and TBR was very similar
22
38
. We compared the performance of the RF supertree heuristic to MRP and the triplet supertree method (which seeks a supertree with the most shared triplets with the collection of input trees) using published supertree data sets from sea birds
B40 40
, marsupials
B41 41
, placental mammals
B42 42
, and legumes
B43 43
. The published data sets contain between 7 and 726 input trees and between 112 and 571 total taxa (Table tblr tid T1 1).
tbl Table 1Experimental Results. tblbdy cols 4
r
c left
Data Set
Supertree Method
RF-Distance
Parsimony Score
cspan
hr
Marsupial (272 taxa; 158 trees)
RF-Ratchet
1514
2528
RF-MRP
1502
2513
MRP-TBR
1514
2509
MRP-Ratchet
1514
2509
Triplet
1604
2569
Sea Birds (121 taxa; 7 trees)
RF-Ratchet
61
223
RF-MRP
61
223
MRP-TBR
63
221
MRP-Ratchet
63
221
Triplet
61
223
Placental Mammals (116 taxa; 726 trees)
RF-Ratchet
5686
8926
RF-MRP
5690
8890
MRP-TBR
5694
8878
MRP-Ratchet
5694
8878
Triplet
6032
9064
Legumes (571 taxa; 22 trees)
RF-Ratchet
1556
965
RF-MRP
1534
882
MRP-TBR
1554
856
MRP-Ratchet
1552
854
Triplet
N/A
N/A
tblfn
Experimental results comparing the performance of the RF supertree method to MRP and triplet supertree methods. We used five different supertree analyses: RF supertrees using our SPR local search algorithm with a ratchet starting from either random addition sequence trees (RF-ratchet) or MRP trees (RF-MRP), MRP with TBR branch swapping with (MRP-ratchet) and without (MRP-TBR) a ratchet search, and triplet supertrees with a TBR local search (Triplet). We measured the RF distance to the collection of input trees (RF-distance) and the parsimony score of a best found supertree based on the matrix representation of the input trees. The best RF distance and parsimony scores are in bold
There are a number of ways to implement any local search algorithm. Preliminary analyses of the RF heuristic based on the SPR local search indicated that, as with other phylogenetic methods, the starting tree can affect the estimate of the final supertree. Occasionally the SPR searches got caught in local optima with relatively high RF-distance scores. To ameliorate this potential problem, we implemented a ratchet search heuristic for RF supertrees based on the parsimony ratchet
B44 44
. In general, a ratchet search performs a number of iterations -- in our case 25 -- that consist of two local SPR searches: one in which the characters (input trees) are equally weighted, and another in which the set of the characters are re-weighted. We re-weighted the characters by randomly removing approximately two-thirds of the input trees. The goal of re-weighting the characters is to alter the tree space to avoid getting caught in a globally suboptimal part of the tree space. At the end of each iteration, the best tree is taken as the starting point of the next iteration. For each data set, we started RF ratchet searches from 20 random sequence addition starting trees, and we also ran three replicates starting from an optimal MRP supertree. All RF supertree analyses were performed on an 3 GHz Intel Pentium 4 based desktop computer with 1 GB of main memory. The RF-ratchet runs took between 5 seconds (for the Sea Birds data set) and 90 minutes (for the legume dataset) when starting from a random sequence addition tree. RF-ratchet runs starting from optimal MRP trees were at least twice as fast because they required fewer search steps.
For our MRP analyses, we also tried two heuristic search methods, both implemented using PAUP*
18
. First, we performed 20 replicates of TBR branch swapping from trees built with random addition sequence starting trees. Next, we performed 20 replicates of a parsimony ratchet search with TBR branch swapping. Based on the results of trial analyses, each ratchet search consisted of 25 iterations, each reweighting 15% of the characters. The PAUP* command block for the parsimony ratchet searches was generated using PAUPRat
B45 45
. For each data set, we performed 20 replicates of a TBR local search heuristic starting with random addition sequence trees. Triplet supertrees were constructed using the program from Lin et al.
38
. We were unable to perform ratchet searches with the existing triplet supertree software, and also, due to memory limitations, we were unable to perform triplet supertree analyses on the legume data set.
Our analyses demonstrate the effectiveness of our local search heuristics for the RF supertree problem. In all four data sets, RF-ratchet searches found the supertrees with the lowest total RF distance to the input trees (Table 1). MRP also generally performs well, finding supertrees with RF distances between 0.14% (placental mammals) and 3.3% (sea birds) higher than the best score found by the RF supertree heuristics (Table 1). The triplet supertree method performs as well as the RF supertree method on the small sea bird data set; however, the triplet supertrees for the marsupial and placental mammal data sets have a much higher RF distance to the input trees than either the RF or MRP supertrees (Table 1). For all the data sets, the MRP supertrees had the lowest (best) parsimony score based on a binary matrix representation of the input trees (Table 1). Thus, not surprisingly, it appears that optimizing based on the parsimony score or the triplet distance to the input trees does not optimize the similarity of the supertrees to the input trees based on the RF distance metric (see also
11
13
).
All of the data sets used in this analysis are from published studies that used MRP. Therefore, it is not surprising that MRP performed well (but see
B46 46
). Still, our results demonstrate that MRP leaves some room for improvement. If the goal is to find the supertrees that are most similar to the collection of input trees, the RF searches ultimately provide better estimates than MRP (Table 1).
Interestingly, while the MRP trees tend to have relatively low RF-distance scores, in some cases, such as the legume data set, trees with low RF-distance scores have high parsimony scores (Table 1). Thus, parsimony scores are not necessarily indicative of RF score, and MRP and RF supertree optimality criteria are certainly not equivalent. Still, MRP trees appear to be useful as starting points for RF supertree heuristics. Indeed, in three of the four data sets, the best RF trees were found in ratchet searches beginning from MRP trees (Table 1).
Our program for computing RF supertrees is freely available (for Windows, Linux, and Mac OS X) at http://genome.cs.iastate.edu/CBL/RFsupertrees
Discussion and Conclusion
There is a growing interest in using supertrees for large-scale evolutionary and ecological analyses. Yet there are many concerns about the performance of existing supertree methods, and the great majority of published supertree analyses have relied on only MRP
B47 47
. Since the goal of a supertree analysis is to synthesize the phylogenetic data from a collection of input trees, it makes sense that an effective supertree method should directly seek the supertree that is most similar to the input trees. Our new algorithms make it possible, for the first time, to estimate large supertrees by directly optimizing the RF distance from the supertree to the input trees.
There are numerous alternate metrics to compare phylogenetic trees besides the RF distance, and any of these can be used for supertree methods (see, for example,
11
). Triplet distance supertrees
11
B48 48
, quartet-fit and quartet joining supertrees
11
24
, maximum splits-fit supertrees
11
, and most similar supertrees
B49 49
are all, like RF supertrees, estimated by comparing input trees to the supertree using tree distance measures. All of these methods may provide different, and perhaps equally valid, perspectives on supertree accuracy. Based on our experimental analyses using the RF and triplet supertree method, optimizing the supertree based on different distance measures can result in very different supertrees (Table 1). In the future, it will be important to characterize and compare the performance of these methods in more detail (see, for example,
11
B50 50
).
The results also suggest several future directions for research. Although heuristics guided by local search problems, especially SPR and TBR, have been very effective for many intrinsically difficult phylogenetic inference problems, our experiments indicate that the tree space for RF supertrees is complex. The ratchet approach and also starting from MRP trees appears to improve the performance in the four examples we tested (Table 1). However, more work is needed to identify the most efficient ways to implement our fast local search heuristics. Also, the use of alternative supertrees methods (other than MRP) to generate starting trees might result in a better global strategy to compute RF supertrees and this should be investigated further. We note that the ideas presented in
B51 51
can be directly used to perform efficient NNI-based local searches for the RF supertree problem. In particular, we can show that heuristic searches for the RF supertree problem, which perform a total of p local search steps based on 1, 2, or 3-NNI neighborhoods (see
51
), can all be executed in O(kn(n + p)) time; yielding speed-ups of Θ(min{n, p}), Θ(n·min{n, p}) and Θ(n
2·min{n, p}) for heuristic searches that are based on naïve algorithms for 1, 2 and 3-NNI local searches respectively. It would also be interesting to see if heuristics based on TBR perform significantly better than those based on SPR in inferring RF supertrees.
In some cases it might be desirable to remove the restriction that the supertree be binary. In the consensus setting, such a median tree can be obtained within polynomial time
27
; however, finding a median RF tree in the supertree setting is NP-hard
B52 52
. One simple way to estimate a non-binary median tree could be to first compute an RF supertree and then to iteratively (and perhaps greedily) contract those edges in the supertree that result in a reduction in the total RF distance. Thus, our algorithms may even be useful for roughly estimating majority-rule(-) supertrees
28
, which are essentially the strict consensus of all optimal, not necessarily binary, median RF trees, and have several desirable properties
29
. These majority-rule(-) supertrees are also the strict consensus of all maximum-likelihood supertrees
B53 53
. Also, there are several alternate forms of the RF distance metric that could be incorporated into our local search algorithms. For example, in order to account for biases associated with the different sizes of input trees, we could normalize the RF distance for each input tree, dividing the observed RF distance by the maximum possible RF distance based on the tree size. Similarly, we could incorporate either branch length data or phylogenetic support scores (bootstrap values or posterior probabilities) from the input trees into the RF distance in order to give more weight to partitions that are strongly supported or separated by long branches (e.g.,
25
B54 54
). Our current implementation essentially treats all branch lengths as one and all partitions as equal. The addition of branch length or support data may further improve the accuracy of the RF supertree method.
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
MSB was responsible for algorithm design and program implementation, contributed to the experimental evaluation, and wrote major parts of the manuscript. JGB performed the experimental evaluation and the analysis of the results, and contributed to the writing of the manuscript. OE and DFB supervised the project and contributed to the writing of the manuscript. All authors read and approved the final manuscript.
bm
ack
Acknowledgements
We thank Harris Lin for providing software for the triplet supertree analyses. This work was supported in part by NESCent and by NSF grants DEB-0334832 and DEB-0829674. MSB was supported in part by a postdoctoral fellowship from the Edmond J. Safra Bioinformatics program at Tel-Aviv university.
refgrp Darwin's abominable mystery: Insights from a supertree of the angiospermsDaviesTJBarracloughTGChaseMWSoltisPSSoltisDESavolainenVProceedings of the National Academy of Sciences of the United States of America200410171904lpage 190910.1073/pnas.0308127100pmcid 357025link fulltext 14766971The delayed rise of present-day mammalsBininda-EmondsORPCardilloMJonesKEMacpheeRDEBeckRMDGrenyerRPriceSAVosRAGittlemanJLPurvisANature2007446713550751210.1038/nature0563417392779A Phylogenomic Approach to Bacterial Phylogeny: Evidence of a Core of Genes Sharing a Common HistoryDaubinVGouyMPerriereGGenome Res20021271080109010.1101/gr.18700218662912097345Supertree bootstrapping methods for assessing phylogenetic variation among genes in genome-scale data setsBurleighJGDriskellACSandersonMJSystematic Biology20065542644010.1080/1063515050054172216861207Supertrees disentangle the chimerical origin of eukaryotic genomesPisaniDCottonJAMcInerneyJOMol Biol Evol20072481752176010.1093/molbev/msm09517504772Phylogenies and community ecologyWebbCOAckerlyDDMcPeekMDonoghueMJAnn Rev Ecol Syst20023347550510.1146/annurev.ecolsys.33.010802.150448Phylogenetic trees and the future of mammalian biodiversityDaviesTJFritzSAGrenyerROrmeCDLBielbyJBininda-EmondsORPCardilloMJonesKEGittlemanJLMaceGMPurvisAProceedings of the National Academy of Sciences2008105Supplement 1115561156310.1073/pnas.0801917105A modification to Baum and Ragan's method for combining phylogenetic treesPurvisASystematic Biology199544251255Matrix Representation with Parsimony, Taxonomic Congruence, and Total EvidencePisaniDWilkinsonMSystematic Biology20025115115510.1080/10635150275347592511943097The (super) tree of life: procedures, problems, and prospectsBininda-EmondsORPGittlemanJLSteelMAAnnual Review of Ecology and Systematics20023326528910.1146/annurev.ecolsys.33.010802.150511The shape of supertrees to come: Tree shape related properties of fourteen supertree methodsWilkinsonMCottonJACreeveyCEulensteinOHarrisSRLapointeFJLevasseurCMcInerneyJOPisaniDThorleyJLSyst Biol20055441943210.1080/1063515059094983216012108Minority rule supertrees? MRP, Compatibility, and Minimum Flip may display the least frequent groupsGoloboffPACladistics200521328229410.1111/j.1096-0031.2005.00064.xProperties of Supertree Methods in the Consensus SettingWilkinsonMCottonJALapointeFJPisaniDSyst Biol200756233033710.1080/1063515070124537017464887Explosions and hot spots in supertree methodsDayWHMcMorrisFWilkinsonMJournal of Theoretical Biology2008253234534810.1016/j.jtbi.2008.03.02418472112Combining Trees as a Way of Combining Data Sets for Phylogenetic Inference, and the Desirability of Combining Gene TreesBaumBRTaxon19924131010.2307/1222480Phylogenetic inference based on matrix representation of treesRaganMAMolecular Phylogenetics and Evolution19921535810.1016/1055-7903(92)90035-F1342924Analyzing Large Data Sets in Reasonable Times: Solutions for Composite OptimaGoloboffPACladistics199915441542810.1111/j.1096-0031.1999.tb00278.xPAUP*: Phylogenetic analysis using parsimony (*and other methods), Version 4.0b10SwoffordDL200212504223Rec-I-DCM3: A Fast Algorithmic Technique for Reconstructing Large Phylogenetic TreesRoshanUMoretBMEWarnowTWilliamsTLCSB200498109Assessment of the accuracy of matrix representation with parsimony analysis supertree constructionBininda-EmondsOSandersonMSystematic Biology20015056557910.1080/10635150175043511212116654Performance of Flip Supertree Construction with a Heuristic AlgorithmEulensteinOChenDBurleighJGFernández-BacaDSandersonMJSystematic Biology20035329930810.1080/10635150490423719Improved Heuristics for Minimum-Flip Supertree ConstructionChenDEulensteinOFernández-BacaDBurleighJGEvolutionary Bioinformatics20062Clann: investigating phylogenetic information through supertree analysesCreeveyCJMcInerneyJOBioinformatics200521339039210.1093/bioinformatics/bti02015374874Supertree Methods for Building the Tree of Life: Divide-and-Conquer Approaches to Large Phylogenetic ProblemsWilkinsonMCottonJAReconstructing the Tree of Life: Taxonomy and Systematics of Species Rich Taxapublisher CRC Presseditor Hodkinson TR, Parnell JAN20076176Comparison of phylogenetic treesRobinsonDFFouldsLRMathematical Biosciences1981531-213114710.1016/0025-5564(81)90043-2The complexity of the median procedure for binary treesMcMorrisFRSteelMAProceedings of the International Federation of Classification Societies1993The median procedure for n-treesBarthélemyJPMcMorrisFRJournal of Classification1986332933410.1007/BF01894194Majority-Rule SupertreesCottonJAWilkinsonMSystematic Biology20075644545210.1080/1063515070141668217558966Properties of Majority-Rule SupertreesDongJFernández-BacaDSyst Biol200958336036710.1093/sysbio/syp032Optimal algorithms for comparing trees with labeled leavesDayWHEJournal of Classification1985272810.1007/BF01908061Efficiently Computing the Robinson-Foulds MetricPattengaleNDGottliebEJMoretBMEJournal of Computational Biology2007146724735[PMID: 17691890]10.1089/cmb.2007.R01217691890SempleCSteelMPhylogeneticsOxford University Press2003Subtree transfer operations and their induced metrics on evolutionary treesAllenBLSteelMAnnals of Combinatorics2001511310.1007/s00026-001-8006-8Phylogenetic inferenceSwoffordDLOlsenGJWaddelPJHillisDMMolecular SystematicsSunderland, Mass: Sinauer AssocHillis DM, Moritz C, Mable BK1996407509On the computational complexity of the rooted subtree prune and regraft distanceBordewichMSempleCAnnals of Combinatorics2004840942310.1007/s00026-004-0229-zHeuristics for the Gene-Duplication Problem: A Θ(n) Speed-Up for the Local SearchBansalMSBurleighJGEulensteinOWeheARECOMB, Volume 4453 of Lecture Notes in Computer ScienceSpringerSpeed TP, Huang H2007238252full_textAn Ω(n2/log n) Speed-Up of TBR Heuristics for the Gene-Duplication ProblemBansalMSEulensteinOIEEE/ACM Transactions on Computational Biology and Bioinformatics20085451452410.1109/TCBB.2008.69Triplet supertree heuristics for the tree of lifeLinHBurleighJGEulensteinOBMC Bioinformatics200910Suppl 1S810.1186/1471-2105-10-S1-S8264875019208181The LCA Problem RevisitedBenderMAFarach-ColtonMLATIN, Volume 1776 of Lecture Notes in Computer ScienceSpringerGonnet GH, Panario D, Viola A20008894full_textSeabird supertrees: combining partial estimates of procellariiform phylogenyKennedyMPageRThe Auk20021198810810.1642/0004-8038(2002)119[0088:SSCPEO]2.0.CO;2A species-level phylogenetic supertree of marsupialsCardilloMBininda-EmondsORPBoakesEPurvisAJournal of Zoology2004264113110.1017/S0952836904005539A higher-level MRP supertree of placental mammalsBeckRBininda-EmondsOCardilloMLiuFGPurvisABMC Evolutionary Biology200669310.1186/1471-2148-6-93165419217101039Molecular phylogeny of the "Temperate Herbaceous Tribes" of Papilionoid legumes: a supertree approachWojciechowskiMSandersonMSteeleKListonAAdvances in Legume SystematicsKew: Royal Botanic GardensHerendeen P, Bruneau A20009277298The parsimony ratchet: a new method for rapid parsimony analysisNixonKCCladistics19991540741410.1111/j.1096-0031.1999.tb00277.xPAUPRat: PAUP* implementation of the parsimony ratchetSikesDSLewisPO2001Measuring Support and Finding Unsupported Relationships in SupertreesWilkinsonMPisaniDCottonJACorfeISyst Biol200554582383110.1080/1063515059095036216243766Bininda-Emonds OR(Ed)Phylogenetic supertreesSpringer Verlag2004Using Max Cut to Enhance Rooted Trees ConsistencySnirSRaoSIEEE/ACM Trans. Comput. Biology Bioinform20063432333310.1109/TCBB.2006.58Does a tree-like phylogeny only exist at the tips in the prokaryotes?CreeveyCJFitzpatrickDAGayle K PhilipRJKO'ConnellMJPentonyMMTraversSAWilkinsonMMcInerneyJOProc Biol Sci200427115572551255810.1098/rspb.2004.2864169190115615680A View of Supertree MethodsThorleyJLWilkinsonMBioconsensus, Volume 61 of DIMACS: Series in Discrete Mathematics and Theoretic Computer Science, Providence, Rhode Island, USA: American Mathematical Society2003185193The Gene-Duplication Problem: Near-Linear Time Algorithms for NNI-Based Local SearchesBansalMSEulensteinOWeheAIEEE/ACM Transactions on Computational Biology and Bioinformatics20096222123110.1109/TCBB.2009.7Building trees, hunting for trees, and comparing trees: Theory and methods in phylogenetic analysisBryantDPhD thesisDept. of Mathematics, University of Canterbury1997Maximum likelihood supertreesSteelMRodrigoASyst. Biol20085724325010.1080/1063515080203301418398769A simulation comparison of phylogeny algorithms under equal and unequal evolutionary rates [published erratum appears in Mol Biol Evol 1995 May;12(3):525]KuhnerMKFelsensteinJMol Biol Evol19941134594688015439


xml version 1.0 encoding utf-8 standalone no
mets ID sort-mets_mets OBJID sword-mets LABEL DSpace SWORD Item PROFILE METS SIP Profile xmlns http:www.loc.govMETS
xmlns:xlink http:www.w3.org1999xlink xmlns:xsi http:www.w3.org2001XMLSchema-instance
xsi:schemaLocation http:www.loc.govstandardsmetsmets.xsd
metsHdr CREATEDATE 2012-04-16T16:08:34
agent ROLE CUSTODIAN TYPE ORGANIZATION
name BioMed Central
dmdSec sword-mets-dmd-1 GROUPID sword-mets-dmd-1_group-1
mdWrap SWAP Metadata MDTYPE OTHER OTHERMDTYPE EPDCX MIMETYPE textxml
xmlData
epdcx:descriptionSet xmlns:epdcx http:purl.orgeprintepdcx2006-11-16 xmlns:MIOJAVI
http:purl.orgeprintepdcxxsd2006-11-16epdcx.xsd
epdcx:description epdcx:resourceId sword-mets-epdcx-1
epdcx:statement epdcx:propertyURI http:purl.orgdcelements1.1type epdcx:valueURI http:purl.orgeprintentityTypeScholarlyWork
http:purl.orgdcelements1.1title
epdcx:valueString Robinson-Foulds Supertrees
http:purl.orgdctermsabstract
Abstract
Background
Supertree methods synthesize collections of small phylogenetic trees with incomplete taxon overlap into comprehensive trees, or supertrees, that include all taxa found in the input trees. Supertree methods based on the well established Robinson-Foulds (RF) distance have the potential to build supertrees that retain much information from the input trees. Specifically, the RF supertree problem seeks a binary supertree that minimizes the sum of the RF distances from the supertree to the input trees. Thus, an RF supertree is a supertree that is consistent with the largest number of clusters (or clades) from the input trees.
Results
We introduce efficient, local search based, hill-climbing heuristics for the intrinsically hard RF supertree problem on rooted trees. These heuristics use novel non-trivial algorithms for the SPR and TBR local search problems which improve on the time complexity of the best known (naïve) solutions by a factor of Θ(n) and Θ(n
2) respectively (where n is the number of taxa, or leaves, in the supertree). We use an implementation of our new algorithms to examine the performance of the RF supertree method and compare it to matrix representation with parsimony (MRP) and the triplet supertree method using four supertree data sets. Not only did our RF heuristic provide fast estimates of RF supertrees in all data sets, but the RF supertrees also retained more of the information from the input trees (based on the RF distance) than the other supertree methods.
Conclusions
Our heuristics for the RF supertree problem, based on our new local search algorithms, make it possible for the first time to estimate large supertrees by directly optimizing the RF distance from rooted input trees to the supertrees. This provides a new and fast method to build accurate supertrees. RF supertrees may also be useful for estimating majority-rule(-) supertrees, which are a generalization of majority-rule consensus trees.
http:purl.orgdcelements1.1creator
Bansal, Mukul S
Burleigh, J G
Eulenstein, Oliver
Fernandez-Baca, David
http:purl.orgeprinttermsisExpressedAs epdcx:valueRef sword-mets-expr-1
http:purl.orgeprintentityTypeExpression
http:purl.orgdcelements1.1language epdcx:vesURI http:purl.orgdctermsRFC3066
en
http:purl.orgeprinttermsType
http:purl.orgeprinttypeJournalArticle
http:purl.orgdctermsavailable
epdcx:sesURI http:purl.orgdctermsW3CDTF 2010-02-24
http:purl.orgdcelements1.1publisher
BioMed Central Ltd
http:purl.orgeprinttermsstatus http:purl.orgeprinttermsStatus
http:purl.orgeprintstatusPeerReviewed
http:purl.orgeprinttermscopyrightHolder
Bansal et al.; licensee BioMed Central Ltd.
http:purl.orgdctermslicense
http://creativecommons.org/licenses/by/2.0
http:purl.orgdctermsaccessRights http:purl.orgeprinttermsAccessRights
http:purl.orgeprintaccessRightsOpenAccess
http:purl.orgeprinttermsbibliographicCitation
Algorithms for Molecular Biology. 2010 Feb 24;5(1):18
http:purl.orgdcelements1.1identifier
http:purl.orgdctermsURI http://dx.doi.org/10.1186/1748-7188-5-18
fileSec
fileGrp sword-mets-fgrp-1 USE CONTENT
file sword-mets-fgid-0 sword-mets-file-1
FLocat LOCTYPE URL xlink:href 1748-7188-5-18.xml
sword-mets-fgid-1 sword-mets-file-2 applicationpdf
1748-7188-5-18.pdf
structMap sword-mets-struct-1 structure LOGICAL
div sword-mets-div-1 DMDID Object
sword-mets-div-2 File
fptr FILEID
sword-mets-div-3


Robinson-Foulds Supertrees
CITATION SEARCH PDF VIEWER PAGE IMAGE ZOOMABLE
Full Citation
STANDARD VIEW MARC VIEW
Permanent Link: http://ufdc.ufl.edu/UF00100271/00001
 Material Information
Title: Robinson-Foulds Supertrees
Series Title: Algorithms for Molecular Biology 2010, 5:18
Physical Description: Archival
Creator: Bansal MS
Burleigh JG
Eulenstein O
Fernández-Baca D
Publication Date: 40233
 Record Information
Source Institution: University of Florida
Holding Location: University of Florida
Rights Management: Open Access: http://www.biomedcentral.com/info/about/openaccess/
Resource Identifier:
System ID: UF00100271:00001

Downloads
Full Text


Bansal et al. Algorithms for Molecular Biology 2010, 5:18
http://www.almob.org/content/5/1/18


BAM ALGORITHMS FOR
MOLECULAR BIOLOGY


RESEARCHER O-


Robinson-Foulds Supertrees

Mukul S Bansal '2, J Gordon Burleigh3, Oliver Eulenstein2, David Fernandez-Baca;


Introduction
Supertree methods provide a formal approach for com-
bining small phylogenetic trees with incomplete species
overlap in order to build comprehensive species phylo-
genies, or supertrees, that contain all species found in
the input trees. Supertree analyses have produced the
first family-level phylogeny of flowering plants [1] and
the first phylogeny of nearly all extant mammal species
[2]. They have also enabled phylogenetic analyses using
large-scale genomic data sets in bacteria, across eukar-
yotes, and within plants [3,4] and have helped elucidate
the origin of eukaryotic genomes [5]. Furthermore,
supertrees have been used to examine rates and patterns
of species diversification [1,2], to test hypotheses regard-
ing the structure of ecological communities [6], and to
examine extinction risk in current species [7].


* Correspondence' fernande@cs iastate edu
2Department of Computer Science, Iowa State University, Ames, A 50011,
USA


0 BioMed Central


Although supertrees can support large-scale evolution-
ary and ecological analyses, there are still numerous
concerns about the performance of existing supertree
methods (e.g., [8-14]). In general, an effective supertree
method must accurately estimate phylogenies from large
data sets in a reasonable amount of time while retaining
much of the phylogenetic information from the input
trees.
By far the most commonly used supertree method is
matrix representation with parsimony (MRP), which
works by solving the parsimony problem on a binary
matrix representation of the input trees [15,16]. While
the parsimony problem is NP-hard, MRP can take
advantage of fast and effective hill-climbing heuristics
implemented in PAUP* or TNT (e.g., [17-19]). MRP
heuristics often perform well in analyses of both simu-
lated and empirical data sets (e.g., [20-22]); however,
there are numerous criticisms of MRP. For example,
MRP shows evidence of biases based on the shape and
size of input trees [8,11], and MRP supertrees may


2010 Bansal et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons
Attribution License (http://creativecommon.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in
any medium, provided the original work is properly cited.


Abstract
Background: Supertree methods synthesize collections of small phylogenetic trees with incomplete taxon overlap
into comprehensive trees, or supertrees, that include all taxa found in the input trees. Supertree methods based on
the well established Robinson-Foulds (RF) distance have the potential to build supertrees that retain much
information from the input trees. Specifically, the RF supertree problem seeks a binary supertree that minimizes the
sum of the RF distances from the supertree to the input trees. Thus, an RF supertree is a supertree that is
consistent with the largest number of clusters (or clades) from the input trees.
Results: We introduce efficient, local search based, hill-climbing heuristics for the intrinsically hard RF supertree
problem on rooted trees. These heuristics use novel non-trivial algorithms for the SPR and TBR local search
problems which improve on the time complexity of the best known (naive) solutions by a factor of 0(n) and 0(n2)
respectively (where n is the number of taxa, or leaves, in the supertree). We use an implementation of our new
algorithms to examine the performance of the RF supertree method and compare it to matrix representation with
parsimony (MRP) and the triplet supertree method using four supertree data sets. Not only did our RF heuristic
provide fast estimates of RF supertrees in all data sets, but the RF supertrees also retained more of the information
from the input trees (based on the RF distance) than the other supertree methods.
Conclusions: Our heuristics for the RF supertree problem, based on our new local search algorithms, make it
possible for the first time to estimate large supertrees by directly optimizing the RF distance from rooted input
trees to the supertrees. This provides a new and fast method to build accurate supertrees. RF supertrees may also
be useful for estimating majority-rule() supertrees, which are a generalization of majority-rule consensus trees.






Bansal et al. Algorithms for Molecular Biology 2010, 5:18
http://www.almob.org/content/5/1/18


contain relationships that are not supported by any of
the input trees [9,12]. Furthermore, it is unclear if or
why minimizing the parsimony score of a matrix repre-
sentation of input trees is a good optimality criterion or
should produce accurate supertrees.
Since evolutionary biologists rarely, if ever, know the
true relationships for a group of species, it is difficult to
assess the accuracy of supertree, or any phylogenetic,
methods. One approach to evaluate the accuracy of
supertrees is with simulations (e.g., [20,21]). However,
simulations inherently simplify the true processes of
evolution, and it is unclear how well the performance of
a phylogenetic method in simulations corresponds to its
performance with empirical data. Perhaps a more useful
way to define the accuracy of a supertree method is to
quantify the amount of phylogenetic information from
the input trees that is retained in the supertree. Ideally,
we want the supertree to reflect the input tree topolo-
gies as much as possible. This suggests that the super-
tree objective should directly evaluate the similarity of
the supertree to the input trees (e.g., [11,23,24]).
Numerous metrics exist to measure the similarity of
input trees to a supertree, and the Robinson-Foulds (RF)
distance metric [25] is among the most widely used. In
fact, numerous studies have evaluated the performance
of supertree methods, including MRP, by measuring the
RF distance between collections of input trees and the
resulting supertrees (e.g., [11,20,21]). The RF supertree
problem seeks a binary supertree that minimizes the
sum of the RF distances between every rooted input tree
and the supertree. The intuition behind seeking a binary
supertree is that, in this setting, minimizing the RF dis-
tance is equivalent to maximizing the number of clus-
ters (or clades) that are shared by the supertree and the
input trees. Thus, an RF supertree is a supertree that is
consistent with the largest number of clusters from the
input trees. Unfortunately, as with MRP, computing RF
supertrees is NP-hard [26]. In this work, we describe
efficient hill-climbing heuristics to estimate RF super-
trees. These heuristics allow the first large-scale esti-
mates of RF supertrees and comparisons of the accuracy
of RF supertrees to other commonly used supertree
methods.
The RF distance metric between two rooted trees is
defined to be a normalized count of the symmetric dif-
ference between the sets of clusters of the two trees. In
the supertree setting, the input trees will often have
only a strict subset of the taxa present in the supertree.
Thus, a high RF distance between an input tree and a
supertree does not necessarily correspond to conflicting
evolutionary histories; it can also indicate incomplete
phylogenetic information. Consequently, in order to
compute the RF distance between an input tree which
has only a strict subset of the taxa in the supertree, we


first restrict the supertree to only the leaf set of the
input tree. This adapted version of the RF distance is
not a metric, or even a distance measure (mathemati-
cally speaking). However, for convenience, we will refer
to this adapted version of the RF distance metric using
the same name.

Previous work
Supertree methods are a generalization of consensus
methods, in which all the input trees have the same leaf
set. The problem of finding an optimal median tree
under the RF distance in such a consensus setting is
well-studied. In particular, it is known that the majority-
rule consensus of the input trees must be a median tree
[27], and it can be found in polynomial time. On the
other hand, finding the optimum binary median tree, i.
e. an RF supertree, in the consensus setting is NP-hard
[26]. This implies that computing an RF supertree in
general is NP-hard as well.
Our definition of RF distance between two trees where
one has only a strict subset of the taxa in the other, cor-
responds to the distance measure used to define "major-
ity-rule(-) supertrees" by Cotton and Wilkinson [28].
This definition restricts the larger tree to only the leaf
set of the smaller tree before evaluating the RF distance.
Majority-rule(-) supertrees are defined to be the strict
consensus of all the optimal median trees under the RF
distance. These median trees are defined similarly to RF
supertrees, except that RF supertrees must be binary
while the median trees can be non-binary. In general,
majority-rule supertrees [28], in both their (-) and (+)
variants, seek to generalize the majority-rule consensus.
Indeed, majority-rule supertrees have been shown to
have several desirable properties reminiscent of major-
ity-rule consensus trees [29]. Although majority-rule
supertrees and RF supertrees are both based on mini-
mizing RF distance, they represent two different
approaches to supertree construction. In particular, the
RF supertree method seeks a supertree that is consistent
with the largest number of clusters (clades) from the
input trees, while majority-rule supertrees do not.
Nevertheless, as we discuss later, RF supertrees could be
used as a starting point to estimate majority-rule(-)
supertrees.
The RF distance between two trees on the same size n
leaf set, with leaves labeled by integers {1, ..., n}, can be
computed in O(n) time [30]. In fact, an (1 + e)-approxi-
mate value of the RF distance can be computed in sub-
linear time, with high probability [31].
In the case of unrooted trees, the RF distance metric
is sometimes also known as the splits metric (e.g., [32]).
The supertree analysis package Clann [23] provides
heuristics that operate on unrooted trees and attempt to
maximize the number of splits shared between the input


Page 2 of 12






Bansal et al. Algorithms for Molecular Biology 2010, 5:18
http://www.almob.org/content/5/1/18


trees and the inferred supertree. This method is called
the "maximum splits-fit" method.

Local Search
We use a heuristic approach for the RF supertree pro-
blem. Local search is the basis of effective heuristics
for many phylogenetic problems. These heuristics
iteratively search through the space of possible super-
trees guided, at each step, by solutions to some local
search problem. More formally, in these heuristics, a
tree graph (see [32,33]) is defined for the given set of
input trees and some fixed tree edit operation. The
node set of this tree graph represents the set of all
supertrees on the given input trees. An edge is drawn
between two nodes exactly if the corresponding trees
can be transformed into each other by one tree edit
operation. In our setting, the cost of a node in the
graph is the RF distance between the supertree repre-
sented by that node and the given input trees. Given
an initial node in the tree graph, the heuristic's task is
to find a maximal-length path of steepest descent in
the cost of its nodes and to return the last node on
such a path. This path is found by solving the local
search problem at every node along the path. The local
search problem is to find a node with the minimum
cost in the neighborhood of a given node. The neigh-
borhood is defined by some tree edit operation, and
hence, the time complexity of the local search problem
depends on the tree edit operation used.
Two of the most extensively used tree edit operations
for supertrees are rooted Subtree Prune and Regraft
(SPR) [33-35] and rooted Tree Bisection and Reconnec-
tion (TBR) [22,33,34]. The best known (naive) algo-
rithms for the SPR and TBR local search problems for
the RF supertree problem require O(kn ) and O(kn4)
time respectively, where k is the number of input trees,
and n is the number of leaves in the supertree solution.

Our Contribution
We describe efficient hill-climbing heuristics for the RF
supertree problem. These heuristics are based on novel
non-trivial algorithms that can solve the corresponding
local search problems for both SPR and TBR in O(kn2)
time, yielding speed-ups of 0(n) and 0(n2) over the best
known solutions respectively. These new algorithms are
inspired by fast local search algorithms for the gene
duplication problem [36,37]. Note that while the super-
tree itself must be binary, our algorithms work even if
the input trees are not. We also examine the perfor-
mance of the RF supertree method using four published
supertree data sets, and compare its performance with
MRP and the triplet supertree method [38]. We demon-
strate that the new algorithms enable RF supertree ana-
lyses on large data sets and that the RF supertree


method outperforms other supertree methods in finding
supertrees that are most similar to the input trees based
on the RF distance metric.

Basic Notation and Preliminaries
A tree T is a connected acyclic graph, consisting of a
node set V (T) and an edge set E(T). T is rooted if it has
exactly one distinguished node called the root which we
denote by rt(T). Throughout this work, the term tree
refers to a rooted tree. We define < to be the partial
order on V (T) where x < y if y is a node on the path
between rt(T) and x. The set of minima under < is
denoted by L(T) and its elements are called leaves. The
set of all non-root internal nodes of T, denoted by I(T),
is defined to be the set V (T)\((T) U {rt(T)}). If {x, y} e
E(T) and x by paT (x) and we call x a child of y. The set of all chil-
dren of y is denoted by ChA(y). T is fully binary if every
node has either zero or two children. If two nodes in T
have the same parent, they are called siblings. The least
common ancestor of a non-empty subset L V (T),
denoted as lca(L), is the unique smallest upper bound of
L under denoted by T,, is the tree induced by {x e V (T): x < y}.
For each node v e I(T), the cluster CT (v) is defined to
be the set of all leaf nodes in Tv; i.e. CT (v) = L(Tv). We
denote the set of all clusters of a tree T by t(T). Given
a set L L(T), let T' be the minimal rooted subtree of
T with leaf set L. We define the leaf induced subtree T
[L] of T on leaf set L to be the tree obtained from T' by
successively removing each non-root node of degree two
and adjoining its two neighbors. The symmetric differ-
ence of two sets A and B, denoted by AAB, is the set
(A\B) U (B\A). A profile P is a tuple of trees (T1,...,
Tk).

The RF Supertree Problem
Given a profile P, we define a supertree on 'P to be a

fully binary tree T where (T*)= U'i (T,)
Definition 1 (RF Distance). Given a profile P = (T, ...,
Tk) and a supertree T* on 'P, we define the RF distance as
follows:

1. For any T,, where 1 < i < k, RF (Ti, P7) = |(rT,)
A(P [(Ti)]) .
2. RF(P,T*) = Y= RF(T, T*)
3. Let T be the set of supertrees on 'P, then
RF(P) = min,= RF(P, T*) .

Remark: Traditionally, the value of the RF distance, as
computed above, is normalized by multiplying by 1/2.
However, this does not affect the definition or


Page 3 of 12






Bansal et al. Algorithms for Molecular Biology 2010, 5:18
http://www.almob.org/content/5/1/18


computation of RF supertrees, and therefore, we do not
normalize the RF distance.
Problem 1 (RF Supertree).

Instance: A profile P .
Find: A supertree ToptOn P such that RF (P Tot) =
RF (P).

Recall that the RF Supertree problem is NP-hard [27].

Local Search Problems
Here we first provide definitions for the re-rooting
operation (denoted RR) and the TBR[22] and SPR[35]
edit operations and then formulate the related local
search problems that were motivated in the
introduction.
Definition 2 (RR operation). Let T be a tree and x e
V (T). RR(T, x) is defined to be the tree T, ifx = rt(T) or
x e Ch(rt(T)). Otherwise, RR(T, x) is the tree that is
obtained from T by (i) suppressing rt(T), and (ii) subdi-
viding the edge {pa(x), x} by a new root node. We define
the following extension: RR(T) = U, v(){RR(T, x)}.
For technical reasons, before we can define the TBR
operation, we need the following definition.
Definition 3 (Planted tree). Given a tree T, the
planted tree 02(T) is the tree obtained by adding a root
edge {p, rt(T)}, where p 4 V (T), to T.
Definition 4 (TBR operation). (See Fig. 1) Let T be a
tree, e = (u, v) e E(T), where u = pa(v), and X, Y be the
connected components that are obtained by removing
edge e from T where v e X and u e Y We define TBRT
(v, x, y) for x e X and y e Y to be the tree that is
obtained from D(T) by first removing edge e, then repla-
cing the component X by RR(X, x), and then adjoining a
new edge f between x' = rt(RR(X, x)) and Y as follows:

1. Create a new node y' that subdivides the edge (pa
(y), y).
2. Adjoin the edge between nodes x' and y'.
3. Suppress the node u, and rename x' as v and y' as u.


4. Contract the root edge.

Notation. We define the following:

1. TBRT (v, x) = U y {TBRT (v, x, y)}
2. TBRT (v) = U, x TBRT (v, x)
3. TBRT = U(u, v) e E(T TBRT(v)
Definition 5 (SPR operation). Let T be a tree, e =
(u, v) e E(T), where u = pa(v), and X, Y be the con-
nected components that are obtained by removing
edge e from T where v e X and u e Y. We define
SPRT (v, y), for y Y, to be the tree TBRT (v, v, y).
We say that the tree SPRT (v, y) is obtained from T
by a subtree prune and regraft (SPR) operation that
prunes subtree Tv and regrafts it above node y.

Notation. We define the following:

1. SPRT (v) = Uy ySPRT (v, y)}
2. SPRT = U(u, v) e E(T) SPRT(v)

Note that an SPR operation for a given tree T can be
briefly described through the following four steps: (i)
prune some subtree P from T, (ii) add a root edge to
the remaining tree S, (iii) regraft P into an edge of the
remaining tree S, and (iv) contract the root edge.
We now define the relevant local search problems
based on the TBR and SPR operations.
Problem 2 (TBR-Scoring (TBR-S)). Given instance
( P T), where P is the profile (Ti, ..., Tk) and T is a
supertree on 'P, find a tree T* e TBRT such that
RF(P, T*) = minT'eTBR, RF(P, T').
Problem 3 (TBR-Restricted Scoring (TBR-RS)). Given
instance (P T, v), where P is the profile (T, ..., Tk), T
is a supertree on 'P, and v is a non-root node in
V(T), find a tree T e TBRT (v) such that
RF(P,T*) = minT'eTBR(v) RF(P,T') .
The problems SPR-Scoring (SPR-S) and SPR-
Restricted Scoring (SPR-RS) are defined analogously to
the problems TBR-S and TBR-RS respectively.


S S'

v y
vre V /f


T'\ /T" AT'
a b c x d( d a b cx
Figure 1 TBR Operation. Example depicting a TBR operation which transforms tree S into tree S' TBRs(v, x, y).


Page 4 of 12






Bansal et al. Algorithms for Molecular Biology 2010, 5:18
http://www.almob.org/content/5/1/18


Throughout the remainder of this manuscript, k is the
number of trees in the profile P T denotes a supertree
on 'P, and n is the number of leaves in T. The follow-
ing observation follows from Definition 4.
Observation 1. The TBR-S problem on instance (P ,
T) can be solved by solving the TBR-RS problem IE(T)
times.
We show how to solve the TBR-S problem on the
instance (P, T) in O(kn2) time. Since SPRT c TBRT
this also implies an O(kn2) solution for the SPR-S pro-
blem. This gives a speed-up of 0(n2) and 0(n) over the
best known (naive) algorithms for the TBR-S and SPR-S
problems respectively.
In particular, we first show that any instance of the
TBR-RS problem can be decomposed into an instance of
an SPR-RS problem, and an instance of a Rooting pro-
blem (defined in the next section). We show how to
solve both these problems in O(kn) time, yielding an O
(kn) time solution for the TBR-RS problem. This imme-
diately implies an O(kn2) time algorithm for the TBR-S
problem (see Observation 1).
Note that the size of the set TBRT is 0(n3). Thus, for
each tree in the input profile the time complexity of
computing and enumerating the RF distances of all trees
in TBRT is i(n3). However, to solve the TBR-S problem
one only needs to find a tree with the minimum RF dis-
tance. This lets us solve the TBR-S problem in time that
is sub-linear in the size of TBRT. In fact, after the initial
O(kn2) preprocessing step, our algorithm can output the
RF distance of any tree in TBRT in 0(1) time.

Structural Properties
Throughout this section, we limit our attention to one
tree S from the profile P. We show how to solve the
TBR-RS problem for the instance ((S), T, v) for some
non-root node v e V (T) in O(n) time. Based on this
solution, it is straightforward to solve the TBR-RS pro-
blem on the instance (p, T, v) with-in O(kn) time as
well. For clarity, we will also assume that (S) = (T).
In general, if (S) c (T) then we can simply set T to
be T [L(S)]. This takes O(n) time and, consequently,
does not affect the time complexity of our algorithm.
Our algorithm makes use of the LCA mapping from S
to T. This mapping is defined as follows.
Definition 6 (LCA Mapping). Given two trees T' and
T such that (T) (T), the LCA mapping -&T', T: V
(T) V(T) is the mapping &; T (u) = IcaT (( T )).
Notation. We define a boolean functionfr: I(S) -> {0, 1}
such thatfT (u) = 1 if there exists a node v e I(T) such
that Cs (u) = C, (v), andfT (u) = 0 otherwise. Thus, fT (u)
= 1 if and only if the cluster Cs (u) exists in the tree T as
well. Additionally, we define _FT = {u e I(S):fT (u) = 0};
that is, rtT is the set of all nodes u e I(S) such that the
cluster Cs (u) does not exist in the tree T.


The following lemma associates the value RF(S, T)
with the cardinality of the set SFT.
Lemma 1. RF(S, T) = I(T) II(S)| + 2.|4T |.
Proof. Let gT denote the set {u e I(S):fT (u) = 1}. By
the definition of RF (S, T), we must have RF(S, T) = II
(T) + II(S) 2.1 gT |. And hence, since | T I + | Tr =
I(S), we get RF(S, T) = |I(T)| II(S)| + 2.| 1 I. E
Lemma 2. For any u e I(S),fT (u) = 1 if and only if
SCs (u) = I CT (aS, T (u)) .
Proof If Cs (u) = I C- (-s, r(u))| then we must have
Cs (u) = CT (-s, T (u)) and, consequently, fT (u) = 1. In
the other direction, if I Cs (u) # I CT ( &s, T (u))|, then
we must have Cs (u) c CT (-&s, T (u)) and, conse-
quently, fr (u) = 0. E
The LCA mapping from S to T can be computed in O
(n) time [39], and consequently, by Lemmas 1 and 2, we
can compute the RF distance between S and T in O(n)
time as well (other O(n)-time algorithms for calculating
the RF distance are presented in [30,31]). Moreover,
Lemma 1 implies that in order to find a tree T' e TBRT
(v) such that RF(P, T*) = minT'eTBRR() RF(P, T'), it is
sufficient to find a tree T* e TBRT (v) for which
-| T. I= minTETBR (v)I TT' I
Remark: An implicit assumption here is that the leaves
of both trees are labeled by integers {1, ..., n}. If the leaf
labels are arbitrary, then we require an additional O(kn
log n)-time preprocessing step to relabel the leaves of
the trees in the given profile. Note, however, that this
additional step does not add to the overall time com-
plexity of solving the TBR-S or SPR-S problems.
We now show that the TBR-RS problem can be solved
by solving two smaller problems separately and combin-
ing their solutions.
As before, we limit our attention to one tree S from
the profile P Given the TBR-RS instance ((S), T, v), we
define a bipartition {X, 5 } of I(S), where X = {u e I(S):
&s, Au) e V(T,)}.
Lemma 3. If u e X, then fT'(u) = fT (u) for all T' e
TBRT (v, v). If u e j and y denotes the sibling of v, then
fTr(u) = fT (u), where T' = TBRT(v, x, y) for any x e V
(Tv).
Proof. Consider the case when u e X. Let T' be any
tree in TBRT(v, v) and let node y e V (T) be such that
T' = TBR(v, v, y). Thus, for any node w e V (Tv), the
subtrees Tv and T' must be identical. Since u e X, we
must have -&s, T (u) e Tv and, consequently,
TMsT(u) = TM,7(u). Lemma 2 now implies thatfT, (u) =
fT (u).
Now consider the case when u L k. Node y denotes
the sibling of v in tree T and let T' = TBR(v, x, y), for
some x e V (Tv). Thus, for any node w e V(T)\V(Tv),
we must have T(W) = T' (w). Moreover, the leaf sets
of the two subtrees rooted at the children of w in T
must be identical to the leaf sets of the two subtrees


Page 5 of 12






Bansal et al. Algorithms for Molecular Biology 2010, 5:18
http://www.almob.org/content/5/1/18


rooted at the children of w in T': This implies that if
-&s, (u) = w, then s, T' (u) = w as well. By Lemma 2
we must therefore havefr' (u) =fT (u). 1
Lemma 3 implies that a tree in TBRT (v) with smallest
RF distance can be obtained by optimizing the rooting for
the pruned subtree, and optimizing the regraft location
separately. This allows us to obtain a tree in TBRT (v) with
smallest RF distance by evaluating only O(n) trees. Con-
trast this with the naive approach to finding a tree in
TBRT (v) with smallest total distance, which is to evaluate
all trees obtained by rerooting the pruned subtree in all
possible ways, and, for each rerooting, regrafting the sub-
tree in all possible locations. Since there are O(n) ways to
reroot the pruned subtree, and O(n) ways to regraft, this
would require evaluating O(n2) trees. It is interesting to
note that this ability to decompose the TBR-RS problem
into two simpler problems is not unique to the context of
RF supertrees alone. For example, it has been observed
that a similar decomposition can be achieved in the con-
text of the gene duplication problem [37].
Thus, to solve the TBR-RS problem, we must find (i) a
rerooting T' of the subtree Tv for which -,T' is mini-
mized, and (ii) a regraft location y for Tv which mini-
mizes r-, y- ) |. Observe that the problem in part (ii)
is simply the SPR-RS problem on the input instance
((S), T, v). For part (i), consider the following problem
statement.
Problem 4 (Rooting). Given instance (P, T, v), where
P is the profile (Ti, ..., Tk), T is a supertree on P and
v is a non-root node in V (T), find a node x e V (Tv) for
which RF (P, TBRT (v, x, y)) is minimum, where y
denotes the sibling of v in T.
Note that the problem in part (i) is the Rooting pro-
blem on the input instance ((S), T, v). We show how to
solve both the Rooting and the SPR-RS problems in O
(n) time on instance ((S), T, v). As seen above, based on
Lemma 3, this immediately implies that the TBR-RS
problem for a profile consisting of a single tree can be
solved in O(n) time. To solve the TBR-RS problem on
instance (P, T, v), we simply solve the Rooting and
SPR-RS problems separately on the input instance (P,
T, v), which takes O(kn) time (see Theorems 3 and 4).
We thus have the following two theorems.
Theorem 1. The TBR-RS problem can be solved in 0
(kn) time.
Theorem 2. The TBR-S problem can be solved in 0
(kn2) time.

Solving the Rooting Problem
To solve the Rooting problem on instance ((S), T, v), we
rely on an efficient algorithm for computing the value of
fT' (u) for any T' e RR(Tv) and any u e I(S). This algo-
rithm relies on the following five lemmas. Let a denote


the node s, T (u), y denote the sibling of v in T, and
T' = TBRT(v, x, y) for x e V (Tv). Depending on a and
fT (u) there are five possible cases: (i) a 4 V (Tv), (ii) a =
rt(T,) andfT (u) = 1, (iii) a= rt(Tv) andfT (u) = 0, (iv) a
eV (T,)\rt(T,) andfT (u) = 1, and (v) a e V (T,)\rt(T,)
and fT (u) = 0. Lemmas 4 through 8 characterize the
value f7r (u) for each of these five cases respectively.
Lemma 4. If a 4 V (Tv), then fT' (u) = fT (u) for any x
V (Tv).
Proof Follows directly from Lemma 3. E
Lemma 5. If a = rt(Tv) and fT (u) = 1, then f (u) = 1
for all x V (Tv).
Proof. Since we have a = rt(Tv) and f (u) = 1, by
Lemma 2 we must have -(S,) = (Tv). Thus, for any x
e V (Tv), -&s, T' (u) must be the root of the subtree RR
(Tv, x). The lemma follows. E
Lemma 6. Let L denote the set (T,)\(Su), and let
b = Ica- (L). If a = rt(Tv) andfT (u) = 0, then,

1. for x XT b, fT' (u) = 0, and,
2. for x T, b,fTr (u) = 1 if and only if ILl = I(Tb)|.

Proof Since a = rt(Ty) and f (u) = 0, by Lemma 2 we
must have L(S,) L(Tv). We analyze each part of the
lemma separately.

1. x XT, b: For this case to be valid, we must have
b in this case, b' = par' (b). Moreover, ( Tb,) n -(Su) #
0. Therefore, we must have b (Su) # (T ,r,(u)) in this case, and, consequently,
Lemma 2 implies thatfrT (u) = 0.
2. x (a) ILI = IL(Tb)l: In this case we must have b
rt(T,). Therefore, let b' denote the parent of b in
tree Tv. Now consider the tree T'. The set ( Ty,)
must be identical to L(Su). Hence, fT' (u) = 1 in
this case.
(b) IL| IL(Tb)l: We claim that there does not
exist any edge (pa(w), w) e E(Tv) such that L(T,)
is either L(S,) or L. Let us suppose, for the sake
of contradiction, that such an edge exists. If _i
(T,) = L(Su) then we must have a = w, which is a
contradiction since a = rt(T,). If L(T,) = g L then
we must have b = w, and, consequently, ILI # L_
(Tb) which is, again, a contradiction. Thus, such
an edge (pa(w), w) e E(T,) cannot exist. Hence,
we must have fT (u) = 0 for every x e V (Tv) in
this case.


The lemma follows. E
Lemma 7. If a e V (Tv)\rt(Tv) and f(u)
(u) = 0 if and only ifx < Tv a.


1, then fT,


Page 6 of 12






Bansal et al. Algorithms for Molecular Biology 2010, 5:18
http://www.almob.org/content/5/1/18


Proof By Lemma 2 we must have (S,) = (T,). We
have two cases:

1. x XT, a: In this case we must have &s, T' (u) = a,
and L(T,) = L(T). Thus, L(S,) = (T') and
hence, fr (u) = 1.
2. x of the subtree RR(T,, x). Since (RR(T,, x)) = L(T),
and L(S,) L(Tv), Lemma 2 implies thatfr (u) = 0.

The lemma follows. El
Lemma 8. If a e V (T,)\rt(T,) and fT (u) = 0, then fT'
(u) = Ofor all x e V (T).
Proof By Lemma 2 we must have -(S,) (Ta). We
have two possible cases:

1. x XT, a: In this case we must have s&, T' (u) = a,
and L(T,) = L(T'). Thus, L(S,) # L(T') and
hence, frT (u) = 0
2. x of the subtree RR(T,, x). Since (RR(T,, x)) = (Tv),
and L(S,) -L(Tv), Lemma 2 implies thatfrT (u) = 0.

The lemma follows.
The Algorithm. For any x e V (T,) let A(x) denote
the cardinality of the set
{u e I(S):fT (u) = 0, butfT' (u) = 1}, and B(x) the car-
dinality of the set
{u e I(S):fT (u) = 1, but fT (u) = 0}, where T' = TBRT
(v x, y).
By definition, to solve the Rooting problem we must
find a node x e V (Tv) for which IA(x)| IB(x)l is maxi-
mized. Our algorithm computes, at each node x e V
(Tv), the values A(x) and B(x).
In a preprocessing step, our algorithm computes the
mapping -&s, T as well as the size of each cluster in S
and T, and creates and initializes (to 0) two counters a
(x) and P(x) at each node x e V (Tv). This takes O(n)
time. When the algorithm terminates, the values a(x)
and P(x) at any x e V (Tv) will be the values a(x) and /3
(x).
Recall that, given u e I(S), a denotes the node -&s, T
(u). Thus, any given u e I(S) must satisfy the precondi-
tion (given in terms of a) of exactly one of the the Lem-
mas 4 through 8. Moreover, the precondition of each of
these lemmas can be checked in 0(1) time.
The algorithm then traverses through S and considers
each node u e I(S). There are three cases:

1. If u satisfies the preconditions of Lemmas 4, 5, or
8 then we must have fTr (u) = fT (u). Consequently,
we do nothing in this case.
2. If u satisfies the precondition of Lemma 7, then
we increment the value of ((x) at each node x e V


(Ta)\{a} (where a is as in the statement of Lemma
7). To do this efficiently we can simply increment a
counter at node a such that, after all u e I(S) have
been considered, a single pre-order traversal of Tv
can be used to compute the correct values of P(x) at
each x e V (Tv).
3. If u satisfies the precondition of Lemma 6, then we
proceed as follows: Let a and L be as in the statement
of Lemma 6. According to the Lemma, if we can find
a node b e V(Tv) such that b = IcaT (L) and |L(Tb)|
= ILI, then we increment the value of a(x) at each
node x e V (Tb); otherwise, if such a b does not exist,
we do nothing. As before, to do this efficiently, we
only increment a single counter at node b such that,
after all u e I(S) have been considered, a pre-order
traversal of Tv suffices to compute the correct values
of a(x) at each x e V (Tv). In order to prove the O(n)
run-time for this algorithm we will now explain how
to precompute such a corresponding node b (if it
exists), for each u e I(S) satisfying the precondition
of Lemma 6, within O(n) time. Note that any edge in
a tree bi-partitions its leaf set. Construct the tree S' =
S [L(Tv)]. Observe that, given any candidate u, the
corresponding node b exists if and only if the parti-
tion of L(S) induced by the edge (u, pa(u)) E(S), is
also induced by some edge, e, in the tree Tv If such
an e exists, then b must be that node on e which is
farther away from the root, i.e. the edge e must be the
edge (b, pa(b)) in Tv This edge e (or its absence) can
be precomputed, for all candidate u, as follows: Com-
pute the strict consensus of the unrooted variants of
the trees S' and Tv. Every edge in this strict consensus
corresponds to an edge in S' and an edge in Tv that
induce the same bi-partitions in the two trees.

Thus, for all candidate u that lie on such an edge, the
corresponding node b can be inferred in 0(1) time (by
using the association between the edges of the strict
consensus and the edges of S' and Tv), and for all candi-
date u that do not lie on such an edge, we know that
the corresponding node b does not exist. This strict
consensus of the unrooted variants of S' and Tv can be
precomputed with-in O(n) time by using the algorithm
of Day [30].
Hence, the Rooting problem for a profile consisting of
a single tree can be solved in O(n) time; yielding the fol-
lowing theorem.
Theorem 3. The Rooting problem can be solved in 0
(kn) time.

Solving the SPR-RS Problem
We will show how to solve the SPR-RS problem on
instance ((S), T, v) in O(n) time. Consider the tree R =
SPRT (v, rt(T)) Observe that, since SPRR(v) = SPRs(v),


Page 7 of 12






Bansal et al. Algorithms for Molecular Biology 2010, 5:18
http://www.almob.org/content/5/1/18


solving the SPR-RS problem on instance ((S), T, v) is
equivalent to solving it on the instance ((S), R, v). Thus,
in the remainder of this section, we will work with tree
R instead of tree S. The following four lemmas let us
efficiently infer, for any u e I(S), whether fT (u) = 1 or
fT' (u) = 0, for any given T' e SPRR(v).
For brevity, let a denote the node s, R(u), and let Q
denote the set V (R)\(V (R,) U {rt(R)}). Let T' = SPRR(V,
x), for any x e Q.
Depending on a and fR(u) there are four possible
cases: (i) a e V (Rv), (ii) a e Q and f(u) = 1, (iii) a e Q
andfR(u) = 0, and (iv) a = rt(R). Lemmas 9 through 12
characterize the value fr (u) for each of these four cases
respectively. l
Lemma 9. If a e V (Rv), then fT' (u) = fR(u) for any
x Q.
Proof Observe that TBRR(v, v) = SPRR(v). Lemma 3
now immediately completes the proof. E
Lemma 10. If a e Q andfR(u)= 1, then,

1. fr (u) = O, for x 2. frT (u) = 1, otherwise.

Proof Since fR(u) = 1, Lemma 2 implies that I Cs (u)| =
SCR (a)l. Let T' = SPRR(V, x); we now have two cases.

1. x (u) I < I CR (a)I < I C" (a) |, we must have f, (u) = 0
(by Lemma 2).
2. x 'R a: In this case &/s, T' (u) = a, and since | Cs
(u) I = I CR (a) = C'(a) 1, we must have f, (u) = 1
(by Lemma 2).

The lemma follows. El
Lemma 11. If a e Q and f(u) = 0, then f, (u) = Ofor
any x e Q.
Proof Since fR(u) = 0, Lemma 2 implies that I Cs (u) I
SCR (a) Thus, by the definition of LCA mapping, I Cs
(u) < I CR (a)|. Let T' = SPRR(V, x); we now have two
cases.

1. x (u) < I CR (a) < I C" (a) |, we must have fT (u) = 0
(by Lemma 2).
2. x (u) I < I CR (a) I = I CT- (a) 1, we must have fT (u) = 0
(by Lemma 2).

The lemma follows. El
For the next lemma, let S' be the tree obtained from S
by suppressing all nodes s for which -&s, R(s) e R,.
Lemma 12. If a = rt(R) and b = &gs, R(u), then, f, (u)=
1 if and only ifx

Proof First, observe that, since a = rt(R), the mapping
-%s' R(u) is well defined. Second, since b = -.s' R(u), we
must have L( Su) -L(Rb), which implies that L(S,) c
(Rv) L(Rb). We now have the following three cases:

1. x 'R b: In this case we must have s T' (u) =
IcaT' (x, b). By Lemma 2 we know that f, (u) = 1
only if Cs (u) = CT'(.-s, T' (u)) However, since
we have (S,) L(Rv) L(Rb), and x 'R b, we
must have Cs (u) < I Cr,(-s, T' (u)) ; and hence, fT'
(u) = 0.
2. x case we must have -&s, T' (u) = b. Since I-L(Rb) +
-C(Rv) I I L(S,)I, we must have (S,) c (R,) U IL
(Rb), which implies that I Cs (u) I < I C" ( Ms, -T (u)) .
Thus, by Lemma 2, we must have fr (u) = 0.
3. x case we must have /s, T (u) = b. Moreover, since
(Rb)I + IL-(Rv) = I-(Su), we must have | Cs (u) =
I CT' (.s, T' (u)) |. Thus, by Lemma 2, we must have
fT' (u) = 1.

The lemma follows. E
The Algorithm. Note that SPRT (v) = SPRR(V) = Ux Q
SPRR(V, x). For any x e Q, let A(x) = {u e I(S):fR(u) = 0,
butfT' (u) = 1}1, and B(x) = {u e I(S):fR(u) = 1, butfT'
(u) = 0}|, where T' = SPRn(v, x). By definition, to solve
the SPR-RS problem on instance ((S), T, v) we must find
a node x e Q for which |A(x)| |B(x)| is maximized. Our
algorithm computes, at each node x e Q, the values A(x)
and B(x).
In a preprocessing step, our algorithm first constructs
the tree R computes the mapping -&s, R as well as the
size of each cluster in S and R, and creates and initia-
lizes (to 0) two counters a(x) and P(x) at each node x e
Q. This takes a total of O(n) time. When the algorithm
terminates, the values a(x) and P(x), at any x e Q will
be the values A(x) and B(x).
Recall that, given u e I(S), a denotes the node 4Ms,
R(u). Thus, any given u e I(S) must satisfy the pre-
condition (given in terms of a) of exactly one of the
the Lemmas 9 through 12. Moreover, the precondi-
tion of each of these lemmas can be checked in 0(1)
time.
The algorithm then traverses through S and considers
each node u e I(S). There are three cases:

1. If u satisfies the preconditions of Lemmas 9 or 11
then we must have f, (u) = f(u) Consequently, we
do nothing in this case.
2. If u satisfies the precondition of Lemma 10, then
we increment the value of P(x) at each node x e V
(Ta)\{a} (where a is as in the statement of Lemma


Page 8 of 12







Bansal et al. Algorithms for Molecular Biology 2010, 5:18
http://www.almob.org/content/5/1/18


10). This can be done efficiently as shown in part (2)
of the algorithm for the Rooting problem.
3. If u satisfies the precondition of Lemma 6, and if
(Rb) + I(Rv) = IL(Su), then we increment the
value of a(x) at each node x e V (Tb)\{b} (where a
and b are as in the statement of Lemma 6).

Again, to do this efficiently, we increment a counter at
node b, and perform a subsequent pre-order traversal.
Note also that the mapping .-&s, R can be computed in
O(n) time in the preprocessing step, and hence the node
b can be inferred in 0(1) time. The condition L(Rb)\ +
I(Rv) = I(S.u) is also verifiable in 0(1) time.
Hence, the SPR-RS problem for a profile consisting of
a single tree can be solved in O(n) time; yielding the fol-
lowing theorem.
Theorem 4. The SPR-RS problem can be solved in 0
(kn) time.
Remark. To improve the performance of local search
heuristics in phylogeny construction, the starting tree
for the first local search step is often constructed using
a greedy 'stepwise addition' procedure. This greedy pro-
cedure builds a starting species tree step-by-step by add-
ing one taxon at a time at its locally optimal position. In
the context of RF supertrees, our algorithm for the
SPR-RS problem also yields a 0(n) speed-up over naive
algorithms for this greedy procedure.

Experimental Evaluation
In order to evaluate the performance of the RF super-
tree method, we implemented an RF heuristic based on
the SPR local search algorithm. We focused on the SPR
local search because it is faster and simpler to imple-
ment than TBR, and in analyses of MRF and triplet
supertrees, the performance of SPR and TBR was very
similar [22,38]. We compared the performance of the
RF supertree heuristic to MRP and the triplet supertree
method (which seeks a supertree with the most shared
triplets with the collection of input trees) using pub-
lished supertree data sets from sea birds [40], marsupials
[41], placental mammals [42], and legumes [43]. The
published data sets contain between 7 and 726 input
trees and between 112 and 571 total taxa (Table 1).
There are a number of ways to implement any local
search algorithm. Preliminary analyses of the RF heuris-
tic based on the SPR local search indicated that, as with
other phylogenetic methods, the starting tree can affect
the estimate of the final supertree. Occasionally the SPR
searches got caught in local optima with relatively high
RF-distance scores. To ameliorate this potential pro-
blem, we implemented a ratchet search heuristic for RF
supertrees based on the parsimony ratchet [44]. In gen-
eral, a ratchet search performs a number of iterations -
in our case 25 that consist of two local SPR searches:


Table 1 Experimental Results


Data Set


Supertree Method RF-Distance Parsimony Score


Marsupial (272 RF-Ratchet 1514 2528
taxa; 158 trees) RF-MRP 1502 2513
MRP-TBR 1514 2509
MRP-Ratchet 1514 2509
Triplet 1604 2569
Sea Birds (121 RF-Ratchet 61 223
taxa; 7 trees) RF-MRP 61 223
MRP-TBR 63 221
MRP-Ratchet 63 221
Triplet 61 223
Placental RF-Ratchet 5686 8926
Mammals (116 RF-MRP 5690 8890
taxa; 726 trees) MRP-TBR 5694 8878
MRP-Ratchet 5694 8878
Triplet 6032 9064
Legumes (571 RF-Ratchet 1556 965
taxa; 22 trees) RF-MRP 1534 882
MRP-TBR 1554 856
MRP-Ratchet 1552 854
Triplet N/A N/A
Experimental results comparing the performance of the RF supertree method
to MRP and triplet supertree methods. We used five different supertree
analyses: RF supertrees using our SPR local search algorithm with a ratchet
starting from either random addition sequence trees (RF-ratchet) or MRP trees
(RF-MRP), MRP with TBR branch swapping with (MRP-ratchet) and without
(MRP-TBR) a ratchet search, and triplet supertrees with a TBR local search
(Triplet). We measured the RF distance to the collection of input trees (RF-
distance) and the parsimony score of a best found supertree based on the
matrix representation of the input trees. The best RF distance and parsimony
scores are in bold.

one in which the characters (input trees) are equally
weighted, and another in which the set of the characters
are re-weighted. We re-weighted the characters by ran-
domly removing approximately two-thirds of the input
trees. The goal of re-weighting the characters is to alter
the tree space to avoid getting caught in a globally sub-
optimal part of the tree space. At the end of each itera-
tion, the best tree is taken as the starting point of the
next iteration. For each data set, we started RF ratchet
searches from 20 random sequence addition starting
trees, and we also ran three replicates starting from an
optimal MRP supertree. All RF supertree analyses were
performed on an 3 GHz Intel Pentium 4 based desktop
computer with 1 GB of main memory. The RF-ratchet
runs took between 5 seconds (for the Sea Birds data set)
and 90 minutes (for the legume dataset) when starting
from a random sequence addition tree. RF-ratchet runs
starting from optimal MRP trees were at least twice as
fast because they required fewer search steps.
For our MRP analyses, we also tried two heuristic
search methods, both implemented using PAUP* [18].
First, we performed 20 replicates of TBR branch swap-
ping from trees built with random addition sequence


Page 9 of 12






Bansal et al. Algorithms for Molecular Biology 2010, 5:18
http://www.almob.org/content/5/1/18


starting trees. Next, we performed 20 replicates of a par-
simony ratchet search with TBR branch swapping. Based
on the results of trial analyses, each ratchet search con-
sisted of 25 iterations, each reweighting 15% of the char-
acters. The PAUP* command block for the parsimony
ratchet searches was generated using PAUPRat [45]. For
each data set, we performed 20 replicates of a TBR local
search heuristic starting with random addition sequence
trees. Triplet supertrees were constructed using the pro-
gram from Lin et al. [38]. We were unable to perform
ratchet searches with the existing triplet supertree soft-
ware, and also, due to memory limitations, we were
unable to perform triplet supertree analyses on the
legume data set.
Our analyses demonstrate the effectiveness of our
local search heuristics for the RF supertree problem. In
all four data sets, RF-ratchet searches found the super-
trees with the lowest total RF distance to the input trees
(Table 1). MRP also generally performs well, finding
supertrees with RF distances between 0.14% (placental
mammals) and 3.3% (sea birds) higher than the best
score found by the RF supertree heuristics (Table 1).
The triplet supertree method performs as well as the RF
supertree method on the small sea bird data set; how-
ever, the triplet supertrees for the marsupial and placen-
tal mammal data sets have a much higher RF distance
to the input trees than either the RF or MRP supertrees
(Table 1). For all the data sets, the MRP supertrees had
the lowest (best) parsimony score based on a binary
matrix representation of the input trees (Table 1). Thus,
not surprisingly, it appears that optimizing based on the
parsimony score or the triplet distance to the input
trees does not optimize the similarity of the supertrees
to the input trees based on the RF distance metric (see
also [11,13]).
All of the data sets used in this analysis are from pub-
lished studies that used MRP. Therefore, it is not surpris-
ing that MRP performed well (but see [46]). Still, our
results demonstrate that MRP leaves some room for
improvement. If the goal is to find the supertrees that are
most similar to the collection of input trees, the RF
searches ultimately provide better estimates than MRP
(Table 1).
Interestingly, while the MRP trees tend to have rela-
tively low RF-distance scores, in some cases, such as
the legume data set, trees with low RF-distance scores
have high parsimony scores (Table 1). Thus, parsimony
scores are not necessarily indicative of RF score, and
MRP and RF supertree optimality criteria are certainly
not equivalent. Still, MRP trees appear to be useful as
starting points for RF supertree heuristics. Indeed, in
three of the four data sets, the best RF trees were
found in ratchet searches beginning from MRP trees
(Table 1).


Our program for computing RF supertrees is freely
available (for Windows, Linux, and Mac OS X) at
http://genome.cs.iastate.edu/CBL/RFsupertrees

Discussion and Conclusion
There is a growing interest in using supertrees for large-
scale evolutionary and ecological analyses. Yet there are
many concerns about the performance of existing super-
tree methods, and the great majority of published super-
tree analyses have relied on only MRP [47]. Since the
goal of a supertree analysis is to synthesize the phyloge-
netic data from a collection of input trees, it makes
sense that an effective supertree method should directly
seek the supertree that is most similar to the input
trees. Our new algorithms make it possible, for the first
time, to estimate large supertrees by directly optimizing
the RF distance from the supertree to the input trees.
There are numerous alternate metrics to compare
phylogenetic trees besides the RF distance, and any of
these can be used for supertree methods (see, for exam-
ple, [11]). Triplet distance supertrees [11,48], quartet-fit
and quartet joining supertrees [11,24], maximum splits-
fit supertrees [11], and most similar supertrees [49] are
all, like RF supertrees, estimated by comparing input
trees to the supertree using tree distance measures. All
of these methods may provide different, and perhaps
equally valid, perspectives on supertree accuracy. Based
on our experimental analyses using the RF and triplet
supertree method, optimizing the supertree based on
different distance measures can result in very different
supertrees (Table 1). In the future, it will be important
to characterize and compare the performance of these
methods in more detail (see, for example, [11,50]).
The results also suggest several future directions for
research. Although heuristics guided by local search pro-
blems, especially SPR and TBR, have been very effective
for many intrinsically difficult phylogenetic inference
problems, our experiments indicate that the tree space
for RF supertrees is complex. The ratchet approach and
also starting from MRP trees appears to improve the per-
formance in the four examples we tested (Table 1). How-
ever, more work is needed to identify the most efficient
ways to implement our fast local search heuristics. Also,
the use of alternative supertrees methods (other than
MRP) to generate starting trees might result in a better
global strategy to compute RF supertrees and this should
be investigated further. We note that the ideas presented
in [51] can be directly used to perform efficient NNI-
based local searches for the RF supertree problem. In
particular, we can show that heuristic searches for the RF
supertree problem, which perform a total of p local
search steps based on 1, 2, or 3-NNI neighborhoods (see
[51]), can all be executed in O(kn(n + p)) time; yielding
speed-ups of 0(min{n, p}), 0(n.min{n, p}) and 0(n2.min


Page 10 of 12







Bansal et al. Algorithms for Molecular Biology 2010, 5:18
http://www.almob.org/content/5/1/18


{n, p}) for heuristic searches that are based on naive algo-
rithms for 1, 2 and 3-NNI local searches respectively. It
would also be interesting to see if heuristics based on
TBR perform significantly better than those based on
SPR in inferring RF supertrees.
In some cases it might be desirable to remove the
restriction that the supertree be binary. In the consensus
setting, such a median tree can be obtained within poly-
nomial time [27]; however, finding a median RF tree in
the supertree setting is NP-hard [52]. One simple way to
estimate a non-binary median tree could be to first com-
pute an RF supertree and then to iteratively (and perhaps
greedily) contract those edges in the supertree that result
in a reduction in the total RF distance. Thus, our algo-
rithms may even be useful for roughly estimating major-
ity-rule(-) supertrees [28], which are essentially the strict
consensus of all optimal, not necessarily binary, median
RF trees, and have several desirable properties [29].
These majority-rule(-) supertrees are also the strict con-
sensus of all maximum-likelihood supertrees [53]. Also,
there are several alternate forms of the RF distance
metric that could be incorporated into our local search
algorithms. For example, in order to account for biases
associated with the different sizes of input trees, we
could normalize the RF distance for each input tree,
dividing the observed RF distance by the maximum pos-
sible RF distance based on the tree size. Similarly, we
could incorporate either branch length data or phyloge-
netic support scores (bootstrap values or posterior prob-
abilities) from the input trees into the RF distance in
order to give more weight to partitions that are strongly
supported or separated by long branches (e.g., [25,54]).
Our current implementation essentially treats all branch
lengths as one and all partitions as equal. The addition of
branch length or support data may further improve the
accuracy of the RF supertree method.


Acknowledgements
We thank Harris Lin for providing software for the triplet supertree analyses
This work was supported in part by NESCent and by NSF grants DEB
0334832 and DEB-0829674 MSB was supported in part by a postdoctoral
fellowship from the Edmond J Safra Bioinformatics program at Tel-Aviv
university

Author details
,The Blavatnik School of Computer Science, Tel Aviv University, Tel Aviv
69978, Israel Department of Computer Science, Iowa State University, Ames,
A 50011, USA Department of Biology, University of Florida, Gainesville, FL
32611, USA

Authors' contributions
MSB was responsible for algorithm design and program implementation,
contributed to the experimental evaluation, and wrote major parts of the
manuscript JGB performed the experimental evaluation and the analysis of
the results, and contributed to the writing of the manuscript OE and DFB
supervised the project and contributed to the writing of the manuscript Al
authors read and approved the final manuscript


Competing interests
The authors declare that they have no competing interests

Received: 27 June 2009 Accepted: 24 February 2010
Published: 24 February 2010

References
S Davies TJ, Barraclough TG, Chase MW, Soltis PS, Soltis DE, Savolainen V
Darwin's abominable mystery: Insights from a supertree of the
angiosperms. Proceedings of the National Academy of Sciences of the United
States of America 2004, 101(7)1904-1909
2 Bininda-Emonds ORP, Cardillo M, Jones KE, Macphee RDE, Beck RMD,
Grenyer R, Price SA, Vos RA, Gittleman JL, Purvis A The delayed rise of
present-day mammals. Nature 2007, 446(7135) 507-512
3 Daubin V, Gouy M, Perriere G A Phylogenomic Approach to Bacterial
Phylogeny: Evidence of a Core of Genes Sharing a Common History.
Genome Res 2002, 12(7)'1080-1090
4 Burleigh JG, Driskell AC, Sanderson MJ' Supertree bootstrapping methods
for assessing phylogenetic variation among genes in genome-scale data
sets. Systematic 2006, 55426 440
5 Pisani D, Cotton JA, Mclnerney JO Supertrees disentangle the chimerical
origin of eukaryotic genomes. Mol Biol Evol 2007, 24(8)1752-1760
6 Webb CO, Ackerly DD, McPeek M, Donoghue MJ Phylogenies and
community ecology. Ann Rev Ecol Syst 2002, 33475-505
7 Davies TJ, Fritz SA, Grenyer R, Orme CDL, Bielby J, Bininda-Emonds ORP,
Cardillo M, Jones KE, Gittleman JL, Mace GM, Purvis A Phylogenetic trees
and the future of mammalian biodiversity. Proceedings of the National
Academy of Sciences 2008, 105(Supplement 1)11556-11563
8 Purvis A A modification to Baum and Ragan's method for combining
phylogenetic trees. Systematic 1995, 44251-255
9 Pisani D, Wilkinson M Matrix Representation with Parsimony, Taxonomic
Congruence, and Total Evidence. Systematic 2002, 51151-155
10 Bininda-Emonds ORP, Gittleman JL, Steel MA The (super) tree of life:
procedures, problems, and prospects. Annual Review of Ecology and
Systematics 2002, 33265-289
11 Wilkinson M, Cotton JA, Creevey C, Eulenstein O, Harris SR, Lapointe FJ,
Levasseur C, Mclnerney JO, Pisani D, Thorley JL The shape of supertrees
to come: Tree shape related properties of fourteen supertree methods.
Syst Biol 2005, 54419-432
12 Goloboff PA Minority rule supertrees? MRP, Compatibility, and Minimum
Flip may display the least frequent groups. Cladistics 2005, 21(3)282-294
13 Wilkinson M, Cotton JA, Lapointe FJ, Pisani D Properties of Supertree
Methods in the Consensus Setting. Syst Biol 2007, 56(2) 330-337
14 Day WH, McMorris F, Wilkinson M Explosions and hot spots in supertree
methods. Journal of Theoretical 2008, 253(2) 345-348
15 Baum BR' Combining Trees as a Way of Combining Data Sets for
Phylogenetic Inference, and the Desirability of Combining Gene Trees.
Taxon 1992, 413-10
16 Ragan MA Phylogenetic inference based on matrix representation of
trees. Molecular Phylogenetics and Evolution 1992, 153-58
17 Goloboff PA Analyzing Large Data Sets in Reasonable Times: Solutions
for Composite Optima. Cladistics 1999, 15(4)415-428
18 Swofford DL' PAUP*: Phylogenetic analysis using parsimony (*and other
methods), Version 4.0b10. 2002
19 Roshan U, Moret BME, Warnow T, Williams TL' Rec-I-DCM3: A Fast
Algorithmic Technique for Reconstructing Large Phylogenetic Trees. CSB
2004, 98109
20 Bininda-Emonds O, Sanderson M Assessment of the accuracy of matrix
representation with parsimony analysis supertree construction.
Systematic 2001, 50565-579
21 Eulenstein 0, Chen D, Burleigh JG, Fernandez Baca D, Sanderson MJ
Performance of Flip Supertree Construction with a Heuristic Algorithm.
Systematic 2003, 53299-308
22 Chen D, Eulenstein O, Fernandez Baca D, Burleigh JG Improved Heuristics
for Minimum-Flip Supertree Construction. Evolutionary Bioinformatics 2006,
2.
23 Creevey CJ, Mclnerney JO Clann: investigating phylogenetic information
through supertree analyses. Bioinformatics 2005, 21(3)390-392
24 Wilkinson M, Cotton JA Supertree Methods for Building the Tree of Life:
Divide-and-Conquer Approaches to Large Phylogenetic Problems.


Page 11 of 12








Bansal et al. Algorithms for Molecular Biology 2010, 5:18
http://www.almob.org/content/5/1/18


Reconstructing the Tree of Life Taxonomy and Systematics of Species Rich
Taxa CRC PressHodkinson TR, Parnell JAN 2007, 61-76
25 Robinson DF, Foulds LR Comparison of phylogenetic trees. Mathematical
Biosciences 1981, 53(1-2)131-147
26 McMorris FR, Steel MA The complexity of the median procedure for
binary trees. Proceedings of the International Federation of Classification
Societies 1993
27 Barthelemy JP, McMorris FR The median procedure for n-trees. Journal of
Classification 1986, 3329-334
28 Cotton JA, Wilkinson M Majority-Rule Supertrees. Systematic 2007,
56445-452
29 Dong J, Fernandez Baca D Properties of Majority-Rule Supertrees. Syst Biol
2009, 58(3)'360-367
30 Day WHE Optimal algorithms for comparing trees with labeled leaves.
Journal of Classification 1985, 27-28
31 Pattengale ND, Gottlieb EJ, Moret BME Efficiently Computing the
Robinson-Foulds Metric. Journal of Computational 2007,
14(6)724-735, [PMID 17691890]
32 Semple C, Steel M Phylogenetics Oxford University Press 2003
33 Allen BL, Steel M Subtree transfer operations and their induced metrics
on evolutionary trees. Annals of Combinatoncs 2001, 5'1-13
34 Swofford DL, Olsen GJ, Waddel PJ, Hillis DM Phylogenetic inference.
Molecular Systematics Sunderland, Mass Sinauer AssocHillis DM, Moritz C,
Mable BK 1996, 407-509
35 Bordewich M, Semple C On the computational complexity of the rooted
subtree prune and regraft distance. Annals of Combinatoncs 2004,
8409-423
36 Bansal MS, Burleigh JG, Eulenstein O, Wehe A Heuristics for the Gene-
Duplication Problem: A 0(n) Speed-Up for the Local Search. RECOMB,
Volume 4453 of Lecture Notes in Computer Science SpringerSpeed TP, Huang
H 2007, 238-252
37 Bansal MS, Eulenstein O An O(n2/log n) Speed-Up of TBR Heuristics for
the Gene-Duplication Problem. IEEE/ACM Transactions on Computational
and Bionformatics 2008, 5(4) 514524
38 Lin H, Burleigh JG, Eulenstein O Triplet supertree heuristics for the tree of
life. BMC Bionformatics 2009, 10(Suppl 1)'58
39 Bender MA, Farach-Colton M The LCA Problem Revisited. LATIN, Volume
1776 of Lecture Notes in Computer Science SpringerGonnet GH, Panario D,
Viola A 2000, 88-94
40 Kennedy M, Page R Seabird supertrees: combining partial estimates of
procellariiform phylogeny. The Auk 2002, 11988-108
41 Cardillo M, Bininda-Emonds ORP, Boakes E, Purvis A A species-level
phylogenetic supertree of marsupials. Journal of Zoology 2004, 264'11-31
42 Beck R, Bininda-Emonds O, Cardillo M, Liu FG, Purvis A A higher-level MRP
supertree of placental mammals. BMC Evolutionary 2006, 693
43 Wojciechowski M, Sanderson M, Steele K Liston A Molecular phylogeny of
the "Temperate Herbaceous Tribes" of Papilionoid legumes: a supertree
approach. Advances in Legume Systematics Kew Royal Botanic
GardensHerendeen P, Bruneau A 2000, 9277-298
44 Nixon KC The parsimony ratchet: a new method for rapid parsimony
analysis. Cladistics 1999, 15407-414
45 Sikes DS, Lewis PO PAUPRat: PAUP* implementation of the parsimony
ratchet. 2001
46 Wilkinson M, Pisani D, Cotton JA, Corfe I Measuring Support and Finding
Unsupported Relationships in Supertrees. Syst Biol 2005, 54(5) 823-831
47 Bininda-Emonds OR (Ed) Phylogenetic supertrees Springer Verlag 2004
48 Snir S, Rao S Using Max Cut to Enhance Rooted Trees Consistency. IEEE/
ACM Trans Comput Bioinform 2006, 3(4)'323-333
49 Creevey CJ, Fitzpatrick DA, Gayle K Philip RJK, O'Connell MJ, Pentony MM,
Travers SA, Wilkinson M, Mclnerney JO Does a tree-like phylogeny only
exist at the tips in the prokaryotes?. Proc Biol Sc 2004,
271(1557) 2551-2558
50 Thorley JL, Wilkinson M A View of Supertree Methods. Bioconsensus,
Volume 61 of DIMACS Series in Discrete Mathematics and Theoretic Computer
Science, Providence, Rhode Island, USA American Mathematical Society 2003,
185-193
51 Bansal MS, Eulenstein 0, Wehe A The Gene-Duplication Problem: Near-
Linear Time Algorithms for NNI-Based Local Searches. IEEE/ACM
Transactions on Computational and Bioinformatics 2009, 6(2)'221-231


52 Bryant D Building trees, hunting for trees, and comparing trees: Theory
and methods in phylogenetic analysis. PhD thesis Dept of Mathematics,
University of Canterbury 1997
53 Steel M, Rodrigo A Maximum likelihood supertrees. Syst Biol 2008,
57243-250
54 Kuhner MK, Felsenstein J A simulation comparison of phylogeny
algorithms under equal and unequal evolutionary rates [published
erratum appears in Mol Biol Evol 1995 May;12(3):525]. Mol Biol Evol 1994,
11 (3) 459468

doi:10.1186/1748-7188-5-18
Cite this article as: Bansal et alo Robinson-Foulds Supertrees. Algorithms
for Molecular Biology 2010 5'18


O BloMed Central


Page 12 of 12


Submit your next manuscript to BioMed Central
and take full advantage of:

* Convenient online submission
* Thorough peer review
* No space constraints or color figure charges
* Immediate publication on acceptance
* Inclusion in PubMed, CAS, Scopus and Google Scholar
* Research which is freely available for redistribution


Submit your manuscript at
www.biomedcentral.com/submit