UFDC Home   Help  RSS 
CITATION
DOWNLOADS
PDF VIEWER


Full Citation  
STANDARD VIEW
MARC VIEW


Downloads  
This item is only available as the following downloads:  
Full Text  
PAGE 1 PROCEEDINGS OpenAccessConsensuspropertiesforthedeepcoalescence problemandtheirapplicationforscalabletree searchHarrisTLin1,JGordonBurleigh2,OliverEulenstein1*From 7thInternationalSymposiumonBioinformaticsResearchandApplications(ISBRA 11) Changsha,China.2729May2011AbstractBackground: Toinferaspeciesphylogenyfromunlinkedgenes,phylogeneticinferencemethodsmustconfront thebiologicalprocessesthatcreateincongruencebetweengenetreesandthespeciesphylogeny.Intraspecific genevariationinancestralspeciescanresultindeepcoalescence,alsoknownasincompletelineagesorting, whichcreatesincongruencebetweengenetreesandthespeciestree.Oneapproachtoaccountfordeep coalescenceinphylogeneticanalysesisthedeepcoalescenceproblem,whichtakesacollectionofgenetreesand seeksthespeciestreethatimpliesthefewestdeepcoalescenceevents.Althoughthisapproachispromisingfor phylogenetics,theconsensuspropertiesofthisproblemaremostlyunknownandanalysesoflargedatasetsmay becomputationallyprohibitive. Results: WeprovethatthedeepcoalescenceconsensustreeproblemsatisfiesthehighlydesirableParetoproperty forclusters(clades).Thatis,inallinstances,eachclusterthatispresentinalloftheinputgenetrees,calleda consensuscluster,willalsobefoundineveryoptimalsolution.Moreover,weintroduceanewdivideandconquer methodforthedeepcoalescenceproblembasedontheParetoproperty.Thismethodrefinesthestrictconsensus oftheinputgenetrees,thereby,inpractice,oftengreatlyreducingthecomplexityofthetreesearchand guaranteeingthattheestimatedspeciestreewillsatisfytheParetoproperty. Conclusions: Analysesofbothsimulatedandempiricaldatasetsdemonstratethatthedivideandconquer methodcangreatlyimproveuponthespeedofheuristicsthatdonotconsidertheParetoconsensusproperty, whilealsoguaranteeingthattheproposedsolutionfulfillstheParetoproperty.Thedivideandconquermethod extendstheutilityofthedeepcoalescenceproblemtodatasetswithenormousnumbersoftaxa.IntroductionTherapidlygrowingabundanceofgenomicsequencedata hasrevealedextensiveincongruenceamonggenetrees (e.g.,[1,2])thatmaybecausedbyprocessessuchasdeep coalescence(incompletelineagesorting),geneduplication andloss,orlateralgenetransfer(see[35]).Inthesecases, phylogeneticmethodsmustaccountforandexplainthe patternsofvariationamonggenetreetopologies,rather thansimplyassumingthegenetreetopologyreflectsthe relationshipsamongspecies.Inparticular,therehasbeen muchrecentinterestinphylogeneticmethodsthat accountfordeepcoalescence,whichmayoccurinany sexuallyreproducingorganisms(e.g.,[68]).Onesuch approachisthedeepcoalescenceproblem,which,givena collectionofgenetrees,seeksaspeciestreethatminimizes thenumberofdeepcoalescenceevents[4,9].Althoughthe deepcoalescenceproblemisNPhard[10],recentalgorithmicadvancesenablescientiststosolveinstanceswitha smallnumberoftaxa[11,12]andefficientlycompute heuristicsolutionsfordatasetswithslightlymorespecies [13].Still,theheuristicsarebasedongenericlocaltree searchstrategieswithnoperformanceguarantees,and theycannothandleenormousdatasets.Inthisstudy,we *Correspondence:oeulenst@iastate.edu1DepartmentofComputerScience,IowaStateUniversity,Ames,IA,USA FulllistofauthorinformationisavailableattheendofthearticleLin etal BMCBioinformatics 2012, 13 (Suppl10):S12 http://www.biomedcentral.com/14712105/13/S10/S12 2012Linetal.;licenseeBioMedCentralLtd.ThisisanopenaccessarticledistributedunderthetermsoftheCreativeCommons AttributionLicense(http://creativecommons.org/licenses/by/2.0),whichpermitsunrestricteduse,distribution,andreproductionin anymedium,providedtheoriginalworkisproperlycited. PAGE 2 provethatthedeepcoalescenceproblemsatisfiestheParetoconsensusproperty.Wethendescribeanewdivide andconquerapproach,basedontheParetoproperty,that, inpractice,cangreatlyextendtheutilityofexistingheuristicswhileguaranteeingthattheinferredspeciestreealso hastheParetopropertywithrespecttotheinputgene trees.RelatedworkThedeepcoalescenceproblemisanexampleofasupertreeproblems,inwhichinputtreeswithtaxonomicoverlaparecombinedtobuildaspeciestreethatincludesall ofthetaxafoundintheinputtrees(see[14]).Infact,itis amongthefewsupertreemethodsthatuseabiologically basedoptimalitycriterion.Onewayofevaluatingsupertreemethodsisbycharacterizingtheirconsensusproperties(e.g.,[15,16]).Theconsensustreeproblemisthe specialcaseofthesupertreeprobleminwhichallthe inputtreescontainthesametaxa.Sinceallsupertreeproblemsgenerallyseektoretainphylogeneticinformation fromtheinputtrees,oneofthemostdesirableconsensus propertiesistheParetoproperty.AconsensustreeproblemsatisfiestheParetopropertyonclusters(ortriplets, quartets,etc.)ifeverycluster(ortriplet,quartet,etc.)that ispresentineveryinputtreeappearsintheconsensustree [1517].ManysupertreeproblemssatisfytheParetopropertyforclustersintheconsensussetting[15,16].However, thishasnotbeenshownforthedeepcoalescenceproblem.OurcontributionsWeprovethatthedeepcoalescenceconsensustreeproblemsatisfiestheParetopropertyforclusters.Thisresult providesusefulguidanceforthespeciestreesearch. Insteadofevaluatingallpossiblespeciestrees,tofindthe optimalsolutionweneedonlytoexaminetreesthat satisfytheParetopropertyonclusters.Thesetreeswill allberefinementsofthestrictconsensusofthegene trees.Furthermore,theParetopropertyallowustoshow thattheproblemcanbedividedintosmallerindependent subproblemsbasedonthestrictconsensustree.We applythispropertyanddescribeanewdivideandconquermethod,andourexperimentsdemonstratethatthis methodcangreatlyimprovethespeedofdeepcoalescencetreeheuristics,potent iallyenablingefficientand effectiveestimatesfrominputswithseveralthousandsof taxa.Futureworkwillexploittheindependenceofthe subproblemsandsolvetheseonparallelmachines,which shouldresultinevenlargerandmoreaccuratesolutions.MethodsBasicdefinitions,notations,andpreliminariesInthissectionweintroduce basicdefinitionsandnotationsandthendefinepreliminariesrequiredforthis work.Forbrevitysomeproofsareomittedinthetext butavailableinAdditionalfile1. A graphG isanorderedpair( V,E )consistingofa nonemptyset V of nodes andaset E of edges .We denotethesetofnodesandedgesof G by V ( G )andE ( G ),respectively.If e ={ u,v}isanedgeofagraph G then e issaidtobe incident with u and v .If v isanode ofagraph G ,thenthe degree of v in G isthenumberof edgesin G thatareincidentwith v A treeT isaconnectedgraphwithnocycles. T is rooted ifithasexactlyonedistinguishednodeofdegree one,calledthe root ,andwedenoteitby Ro ( T ) .The uniqueedgeincidentwith Ro ( T ) iscalledthe rootedge Let T bearootedtree.Wedefine Ttobethepartial orderon V ( T )where x Ty if y isanodeonthepath between Ro ( T ) and x .If x Ty wecall x a descendant of y ,andy an ancestor of x .Wealsodefine x PAGE 3 DeepcoalescenceWedefinethe deepcoalescence costfunctionasdemonstratedinFigure2.Notethatourdefinitionofthedeep coalescencecostgiveninDef.3,issomewhatdifferent,but forourpurposesequivalent,toitsoriginaldefinitionalso termed extralineage givenin[4].Therelationship betweenbothdefinitionsisshowninAdditionalfile1. Throughoutthissectionweassume T and S aretrees overthesameleafset. Definition1 (Pathlength). Supposex Ty,the path length fromxtoy,denotedplT( x,y ), isthenumberofedges inthepathfromxtoy Further,let X Y Le ( T ) we extendthepathlengthfunctionbyplT( X,Y ) plT( lcaT( X ), lcaT( Y )). Definition2 (LCAmapping). Letv V ( T ), the LCA mapping ofvinS,denotedMT S( v ), isdefinedby MT S( v ) lcaS( Le ( Tv)) Definition3 (Deepcoalescence). The deepcoalescencecost fromT to S,denotedDC ( T,S ), is DC ( T S) { u v } E ( T ) u < vplS( MT S( u), MT S( v )) Usingtheextendedpathlengths,thedeepcoalescence costcanbeequivalentlyexpressedas DC ( T S)= { u v } E ( T ) u < vplS( Le ( Tu), Le ( Tv)) Figure1 Examplesoftreedefinitions .(a)Arootedtree T withfourleaves{ a,b,c,d }.(b)Thesubtreeof T inducedby X where X ={ a,b }.(c) Therestrictedsubtreeof T inducedby X Figure2 Exampleofdeepcoalescencecostdefinition .Exampleshowingthedeepcoalescencecostfrom T to S .Eachedgeof T is accompaniedbyitscost,anditscorrespondingpathisshownon S Lin etal BMCBioinformatics 2012, 13 (Suppl10):S12 http://www.biomedcentral.com/14712105/13/S10/S12 Page3of13 PAGE 4 ConsensustreeDefinition4 ( Consensustreeproblem ). Let f :Tx Tx R beacostfunctionwhereXisaleaf setand Tx isthesetofalltreesoverX.Aconsensus treeproblembasedonfisdefinedasfollows Instance:Atupleofntrees ( T1,..., Tn) overX Find:Thesetofalltreesthathavetheminimum aggregatedcostwithrespecttof.Formally, argminS TXni =1f ( Ti, S) Thissetisalsocalledthe solutions fortheconsensus treeinstance Definition5 (Deepcoalescenceconsensustreeproblem). Wedefinethe deepcoalescenceconsensustree problem tobetheconsensustreeproblembasedonthe deepcoalescencecostfunction .ClusterandParetoDefinition6 (Cluster). LetTbeatree,theclustersinduced byT,denoted Cl ( T ) ,is Cl ( T ) { Le ( Tv):v V ( T ) } Further, X Cl ( T ) iscalledatrivialclusterif X = Le ( T ) or X =1, itiscallednontrivialotherwise Let Y Le ( T ) ,wesaythat T contains (cluster)Y if Y Cl ( T ) Definition7 (Paretoonclusters). LetPbeaconsensus treeproblembasedonsomecostfunction.WesaythatP isParetoonclustersif:forallinstancesI =( T1,...,Tn) of P,forallsolutionsSofI,wehave n i =1C1( Ti) C1( S) .TheoremoverviewWewishtoshowthatthedeepcoalescenceconsensustree problemisParetoonclusters.Wedescribeahighlevel structureoftheproofinthissectionandprovidenecessary supportinglemmatainthenextsection.Theproofproceedsbycontradiction,assu mingthatthedeepcoalescenceconsensustreeproblemis not Paretoonclusters.By Def.7,theassumptionimpliesthatthereexistsaninstance I =( T1,..., Tn),asolution S for I ,andacluster X Le ( S ) where X n i =1Cl ( Ti) but X Cl ( S) S beingasolution for I ,impliesbyDef.4,thattheaggregateddeepcoalescencecost,i.e. n i =1DC ( Ti, S) isminimized.Then, basedontheexistenceofthecluster X ,weedit S andform anewtree R usingatreeeditoperationwhichwillbe introducedinthenextsection.Thepropertiesofthisnew operationtogetherwiththepropertiesof X (provedinthe nextsection),providesthekeyingredientstocalculate thechangesindeepcoalescencecosts.Withsomefurther arithmetics,thisallowsustoconcludethat R infact hasasmalleraggregateddeepcoalescencecost,i.e. n i =1DC ( Ti, S) > n i =1DC ( Ti, R ) ,hencecontradicting theassumptionthat S isasolutionfor I .Supportinglemmata ShallowestregroupingoperationInthissectionweformallydefinethenewtreeeditoperationthatformsthekeypartofthetheorem.Webegin withsomeusefuldefinitionsrelatedtothedepthofnodes. AnexampleofthisoperationisshowninFigure3. Definition8 (Nodedepth). The depth ofanodev V ( T ), denoteddepT( v) ,is pl ( v ,Ro ( T )) Definition9 (Shallowestnodes). LetTbeatreeand X V ( T ), the shallowest function,denotedshallowestT( X ), isthesetofnodesinXwhichhavetheminimum depthamongallnodesinX Formally,wedefineshallowestT( X ) argminv X( depT( v )). Nowwehavethenecessarymechanicstodefinethe newtreeeditoperation.Inwhatfollows,weassume S tobeatree, bt X t Le ( S) ,and Sn = S( X ) Definition10 (Regroup). Letv I2( S ). TheregroupingoperationofSbyXonv,denoted ( S,X,v ), isthe treeobtainedfromS by 1.(R1)Identify Ro( S  X ) andv.Inotherwordswe adjointherootoftreeSXontothenodev. 2.(R2)Suppressallnodeswithdegreetwo Definition11 (Shallowestregroup). The shallowest regrouping operationofSbyX,denoted ( S, X ), definesasetoftreesby ( S, X ) { ( S, X v ):v shallowestS n( I2( Sn ))} AsFigure3shows,theshallowestregroupingoperationpullsapart X from S andregroups X backonto eachoftheshallowestnodesin S .CountingthenumberofdegreetwonodesTheregroupingoperationincludesthestepofsuppressingnodeswithdegreetwo.Sincethisstepaffectspath lengthsandultimatelydeepcoalescencecosts,weare requiredtocountcarefullythenumberofdegreetwo nodesundervariousconditions.Hereweassumethat T isatreeand{ X,Y}isabipartitionof Le ( T ). Webegin withtwoobservationsthatassertexistenceofdegreetwonodes,andassertexistenceofleafsetsgivena degreetwonode. Observation1 I2( T ( X ) and I2( T ( Y ) Observation2 Ifv I2( T ( X )), then Le ( Tv) X = and. Le ( Tv) Y = ThenextLemmasaysthatiftherootof T istheparentof lca ( X ),thenthenumberofdegreetwonodesin T ( X )isatleastthedepthof v ,where v isashallowest degreetwonodeof T ( Y ). Lemma1 If Pa ( lca ( X )) = Ro ( T ) andv shallowest ( I2( T ( Y ))), thendep ( v )  I2( T ( X )). Proof .Assumethepremise.Let n = dep ( v),weobserve that n 1becauseoftherootedge.Figure4showsthe setupandvariableassignmentsforthisproof.Let v = v1<...< vn,andlet B,A1,..., Anbetheleafsetsoftheindicatedsubtrees.Weobservethefollowing:Lin etal BMCBioinformatics 2012, 13 (Suppl10):S12 http://www.biomedcentral.com/14712105/13/S10/S12 Page4of13 PAGE 5 Figure3 Exampleoftheshallowestregroupingoperation .Exampleoftheshallowestregroupingoperationof S by X where X ={ a,c,d }. Theintermediatetree Sn= S( X ) showsitstwoshallowestdegreetwonodes v1and v2. R1and R2aretheresultingtreesofthisoperation. Thatis, ( S, X )= { R 1, R 2} where R1= ( S,X,v1)and R2= ( S,X,v2). Figure4 SetupandvariableassignmentsfortheproofofLemma1 .TreeshowingthevariableassignmentsintheproofofLemma1. Dottedlinesrepresentomittedpartsofthetree,andtrianglesrepresentsubtrees. Lin etal BMCBioinformatics 2012, 13 (Suppl10):S12 http://www.biomedcentral.com/14712105/13/S10/S12 Page5of13 PAGE 6 vn= lca ( X )because Pa ( lca ( X )) = Ro ( T ) B X and A1 Y ,because v isadegreetwo nodeof T ( Y ). A1 Y impliesthat A2,..., Aneachcontainsat leastanelementof Y .Forotherwise,eachof v2,..., vnbecomesadegreetwonodein T ( Y ),contradicting theassumptionthat v = v1istheshallowestdegreetwonodein T ( Y ). Inordertoobtain T ( X ),wemustprunesubtreesin A1whoseleavesarein Y (whichcouldbetheentiresubtree A1).Thustheremustbeatleastonedegreetwonodein A1(or v1if A1ispruned).Similarly,for1< i n ,either vihasdegreetwoorthereexistsadegreetwonodein Ai. Overall T ( X )hasatleast n degreetwonodes,asrequired. PropertiesoftheregroupingoperationWeexaminesomepropertiesoftheregroupingoperationinthissection.Ingeneral,thesepropertiesshow thatthepathlengthsdefinedbyLCA sdonotincrease underseveraldifferentassumptions.Thispreservation ofpathlengthswouldlaterassistinthecalculationof deepcoalescencecosts.Throughoutthissection,we assume S tobeatree, bt X t Le ( S) ,and Sn= S( X ). Furtherwelet R = ( S,X,v)where v I2( S ). Lemma2 If A B Le ( S) and B X ,thenplS( A, B)=plS ( A,B). Lemma3 If A B Le ( S) and B X ,thenplS( A, B) plR( A,B). Lemma4 If A B Le ( S) and B X ,thenplS( A, B) plR( A,B). Lemma5. If A B Le ( S) A X ,andX B, thenplS( A,B) plR( A,B). Proof .Let S bethetreeobtainedfrom S byidentifying Ro( S  X )and v .Inotherwords, S isthetreeafterstep (R1)oftheregroupoperation ( S,X,v ).Wewillshow that plS ( A,B ) plS ( A,B ) plR( A,B ).Webeginwith thefirstinequality.First,since A X weknowthat lcaS( A )= lcaS ( A )= lcaS ( A ).Let x = lcaS( X )andb = lcaS( B ), thentheassumptionof X B implies x Sb .Since v has degreetwoin S ,weknowthat Le ( Sv) X = b (Observation2),andso v Sx .Nowlet x = lcaS ( X )andb = lcaS ( B ).By(R1)wehavethat x S v ,andsox S x whichimplies b S b .Furthermore, lcaS( A )= lcaS ( A )is adescendantofboth b and b because A B ,andhence b S b impliesthat plS( A,B) plS ( A,B). Next,by(R2) R isobtainedfrom S bysuppressing somenodes,thereforeapathin S canonlybemade shorterin R ,hencewehave plS ( A,B) plR( A,B). Finally,combiningtheaboveresultswehave plS( A,B ) plR( A,B). MaintheoremTheorem1 Deepcoalescenceconsensustreeproblemis Paretoonclusters. Proof .Assumenotforacontradiction,thenthereexists aninstance I =( T1,..., Tn),asolution S for I ,andacluster X Le ( S) where X n i =1Cl ( Ti) but X / Cl ( S ) .Since X / Cl ( S ) X mustbenontrivial,therefore ( S, X ) doesnotcontain S andisnotempty.Let R ( S, X ) Wewillshowthat( 1 i n )( DC( Ti,S )> DC ( Ti,R )), whichimplies n i =1DC ( Ti, S) > n i =1DC ( Ti, R ), contradictingtheassumptionthat S isasolutionfor I Let T = Tiwhere1 i n ,wewillshowthat DC( T, S )> DC ( T,R ).Thisrequiresthat DC ( T,S )DC ( T,R )> 0,inotherwords { u v } E ( T ) u < v plS( Le ( Tu), Le ( Tv)) plR Le ( Tu), Le ( Tv) > 0 (1) Since(1)sumsoveralledgesin T ,forconveniencewe partitiontheedgesof T andcomputethedifferencesin pathlengthsforeachpartitionindividually.Figure5 depictsarunningexamplefor T,S ,and R where X ={ a, b,c}. Weidentifysomespecificnodesinordertopartition theedgesof T .Let Sn= S( X ), w I2( Sn) where R = ( S, X,w ).Since X / Cl ( S ) S containsatleasttwonodes withdegreetwo.Let w I2( S )suchthat w w ,then Sw containssomeleaf y X (Observation2). Let x = lcaT( X )andz = lcaT( X b { y }),wepartition theedgesof T into{ E1, E2, E3, E4}asfollows. 1. E1 {{u,v } E ( T ): u < v x }=Alledgesunder x 2. E2 {{ u,v } E ( T ): x u < v }=Edgesforming thepathfrom x to Ro ( T ) 3. E3 {{u,v } E ( T ): y u < v z }=Edgesforming thepathfrom y to z 4. E4 E ( T )\( E1b E2b E3) Weconsider(1)foreachofthepartitionseparately. Forclarity,wedefinetheaggregatedcostdifference iforpartition Eiasfollows. i { u v } Eiu < v( plS( Le ( Tu), Le ( Tv)) plR( Le ( Tu), Le ( Tv))) Hence(1)becomes 1+ 2+ 3+ 4> 0 (2) Let x = lcaS( X )andp = plS( w,x )+1.Foreach i {1,2,3,4},weclaimandprovetheboundof ias follows. Claim1 1 p Proof .Firstweobservethatthedifferenceforeach pathlengthinthispartitionis 0(Lemma4),soweLin etal BMCBioinformatics 2012, 13 (Suppl10):S12 http://www.biomedcentral.com/14712105/13/S10/S12 Page6of13 PAGE 7 have 1 0.Since x = lcaS( X ),weonlyneedtoconsiderthesubtree Sx incomputingthepathlengthsin thispartition.Define U = Sx .Inparticular,thenumber ofdegreetwonodesin U ( X )givesusalowerboundon thetotaldecreasesofpathlengths,becausethesenodes areremovedtoobtain U  X whichisasubtreeof R Thatis, 1  I2( U ( X )).Lemma1appliesto U with bipartition { X Le ( U ) \ X } andthenode w ,sowehave I2( U ( X )) depU( w ).Thedepth depU( w )iswithrespect to U ,andwerelateittoapathlengthin S bytaking awaytherootedge,thatis depU( w) 1=plS( w, x n ) Finally,usingthedefinitionof p weobtain 1  I2( U ( X )) depU( w )= plS( w,x )+1= p Claim2 2= p Proof 2= plS( X Le ( T )) plR( X Le ( T )) =[ plS( x n Ro ( T )) 1] [1+ plR( w, x n )+ plR( x n Ro ( T )) 1] = plS( x n Ro ( T )) [1+ plR( w, x n )+ plR( x n Ro ( T ))] = plS( x n Ro ( T )) [1+ plS( w, x n )+ plS( x n Ro ( T ))] = [ plS( w, x n )+1] = p Thefourthequalityholdsbecause w istheshallowest degreetwonodein S ,sothatnoedgesalongthepath from w to x arecontractedin R ,hence plR( w,x )=plS( w,x ). Claim3 3 1 Proof .Let{ a,b } E3where a PAGE 8 Proof .Let{ a,b } E4where a PAGE 9 SimilartotheproofofTheorem1,wepartitionthe edgesof T into{ Eunder,Eout,Ein}asfollows. 1. Eunder {{u, v } E( T ): u < v and(fj)(v r cn j) } 2. Eout {{u, v } E( T ): u < v and v r hn} 3. Ein E( T ) \ ( Eunder Eout) Recallthatthemodificationof S into R onlyinvolves thesubtree S ,therefore MT S( v ) isunchangedfor every v occursin Eunderand Eout.Henceitsufficesto evaluate(3)on Einonly.Howeverwehavealready assumedthat n i =1DC ( Tn i, Sn) > n i =1DC ( Tn i, Rn) therefore(3)holds.Overallwehavethat R hasalower deepcoalescencecost,contradictingtheassumption that S isasolutionfor I Theorem2impliesthateveryinternalnodeofthe strictconsensustreedefinesanindependentsubproblem,andsolutionsofthesesubproblemscanbecombinedtogiveasolutiontotheoriginaldeepcoalescence consensustreeproblem.Thisleadstothefollowing generaldivideandconquermethodthatimprovesan existingsearchalgorithm. Method1 Deepcoalescenceconsensustreemethod 1: procedure DCConsensusTreeMethod( I ) Input:ADCconsensustreeprobleminstance I = ( T1,..., Tn),anexternalprogramDCSOLVER. Output:Acandidatesolution T for I 2: H StrictCon ( I ) 3: forall internalnode h of H do 4: Ih Cu tH h( I ) 5: Sh DCSOLVER( Ih) 6:Refinethechildrenof h on H bythetree Sh7: endfor 8: return H 9: endprocedureResultsWeusedsimulationexperimentsto(i)testifthesolutionsobtainedfromefficientheuristicspresentedin[13] Figure6 RunningexampleforthedefinitionsandproofofTheorem2 .ArunningexampleforthedefinitionsandproofofTheorem2. Arrowsaremarkedbynumbers1to6,demonstratingthestepsoftheproof.Eachstepisexplainedbelow:(1)Givenaninstance I =( T1,..., Tn),let H bethestrictconsensustreeof I .Aninternalnodeof H anditsfourchildrenareshown.(2)Let S beasolutionfor I ,havingtheoptimaldeep coalescencecost.(3)Cuttreesvia H and h ,obtaining Cu tH h( I ) and Cu tH h( S)= Sn. Let A,B,C,D betheleafsetsofeachsubtree.(4)We assumebycontradictionthat S isnotasolutionfor Cu tH h( I ) ,andsowelet R beasolutionfor Cu tH h( I ). (5and6)Modify S toobtain R ,byreplacing S with R Lin etal BMCBioinformatics 2012, 13 (Suppl10):S12 http://www.biomedcentral.com/14712105/13/S10/S12 Page9of13 PAGE 10 displaytheParetoproperty,and(ii)comparetheperformanceofournewdivideandconquerapproachbased ontheParetopropertytothegenericheuristicin[13].Experimentresults1Firsttoexamineifsubtreepruningandregrafting(SPR) heuristicsolutionsfrom[13]displaytheParetoproperty, wegeneratedaseriesoffour14taxontreesthatsharefew clusters.Todothis,wefirstgeneratedrandom11taxon trees.Next,wegeneratedrandom4taxontreescontaining thespecies1114.Wethenreplacedtheoneoftheleaves inthe11taxontreewiththerandom4taxontree.This procedureproducesgenetreesthatshareatleastasingle 4taxonclusterincommon.Althoughthissimulationdoes notreflectabiologicalprocess,itrepresentsextremecases oferrororincongruenceamonggenetrees.Inthreecases withthe14taxongenetrees,wefoundthattheSPRheuristicdidnotreturnaresultthatcontainedtheconsensus cluster.Inthesecases,ourproofdemonstratesthatthere existsabettersolutionthatalsocontainedtheconsensus cluster.However,thefailureoftheSPRheuristicinthese casesappearstodependonthestartingtree;thesedata setsdidnotfailwithallstartingtrees.Thus,theshortcomingsoftheSPRheuristicmaybeamelioratedbyperformingmultiplerunsfromdifferentstartingtrees.Experimentresults2WenextevaluatedtheefficacyandscalabilityofMethod1 andcomparedittothestandaloneSPRheuristic.Wegeneratesetsofgenetrees,eachwithdifferentconsensustree structures(depthsandbranchfactors)asfollows.The depth ofatreeisthemaximumnumberofedgesfromthe roottoaleaf,andthe branchfactor ofatreeisthemaximumdegreeofthenodes.Foreachdepth d andabranch factor b ,wefirstgenerateacomplete b arytreeofdepth d ,denoted Cd,b.Thistreerepresentstheconsensustree. Weuseddepthsof25,andbranchfactorsof330.For each Cd,b,wethengenerated10setsof20randomgene trees,suchthateachgenetreeisabinaryrefinementof Cd,b.EachsetofinputtreeswasgivenasinputtoMethod 1,using[13]astheexternaldeepcoalescencesolver.For comparison,weranthesamedatasetsusing[13]asthe standalonedeepcoalescencesolver.Wecalculatedthe deepcoalescencescoreforeachoutputspeciestree,and wereporttheaveragescoreof10profilesasthescorefor each Cd,b.Wealsomeasuredandrecordedtheaverage runtimeofeachrun.Weterminatetheexecutionofthe standalonesolveriftheruntimeexceedstwominutes,and inthiscasetheresultsarenotshown.Ingeneral,Figure7 showsthatthescoresofthetreeswereverysimilarfrom Method1andthestandaloneSPRheuristic.Thus,Method 1doesnotappeartoimprovethequalityofthedeepcoalescencespeciestrees.However,Method1showsextreme improvementsintheruntime,especiallyasthebranchfactorsincrease.Experimentresults3Finally,weexaminedtheperformanceofMethod1and compareittothestandaloneSPRheuristicusingmore biologicallyplausiblecoale scencesimulations.Wefollowedthegeneralstructurethecoalescencesimulation protocoldescribedbyMaddisonandKnowles[9].First, wegenerated40256taxonspeciestreesbasedonaYule purebirthprocessusingther8ssoftwarepackage[19].To transformthebranchlengthsfromtheYulesimulationto representgenerations,wemultipliedthemallby 1,000,000.Next,wesimulatedcoalescencewithineach speciestree(assumingnomigrationorhybridization) usingMesquite[20].Allsimulationsproducedasingle genecopyfromeachspecies.Foreachspeciestree,we simulated20genetreesassumingaconstantpopulation size.Thepopulationsizeeffectsthenumberofdeepcoalescenceevents,withlargerpopulationsleadingtomore incompletelineagesortingandconsequentlylessagreementamongthegenetrees.Thus,toincorporatedifferent levelsofincompletelineagesorting,for20ofthespecies trees,weusedaconstantpopulationsizeof10,000,and for20weusedaconstantpopulationsizeof100,000. Thus,intotal,weproduced40setsof20genetrees,with eachsetsimulatedfromadifferent256taxonspeciestree. Foreachdataset,weperformedaphylogeneticanalyses usingMethod1andalsousingonlytheSPRheuristic fromBansaletal.[13].Incontrasttothesimulationsin Experiment1,thestandaloneSPRheuristicofBansaletal. [13]alwaysreturnedspeciestreeswithallconsensusclusters.Ofcourse,allsolutionsfromMethod1mustdisplay theParetoproperty.Thedeepcoalescencereconciliation scoreforthebesttreesweresimilarwithbothalgorithms. Whenthepopulationsizewas10,000,theaveragecoalescencecostwas279,andallthegenetreessharedanaverageof29.4clusters.In19outofthe20ofthese simulations,bothapproachesproducedthesameresults, whileinonecase,Method1foundaspeciestreewitha onefewerimplieddeepcoalescenceevent.Whenthe populationsizewas100,000,theaveragecoalescencecost was2142,andtheallgenetreessharedanaverageof19.1 clusters.Althoughthereconciliationcostneverdifferedby morethan15,Method1hadabetterscorein6replicates, andthestandaloneSPRhadabetterscorein11replicates. Allanalysesfinishedwithin30secondsinalaptopPC,but Method1wasalwaysfasterthanSPRalone.DiscussionInadditiontoofferingabiologicallyinformedoptimality criteriontoresolveincongruenceamonggenetrees,we provethatthedeepcoalescenceproblemalsoisguaranteedLin etal BMCBioinformatics 2012, 13 (Suppl10):S12 http://www.biomedcentral.com/14712105/13/S10/S12 Page10of13 PAGE 11 toretainthephylogeneticclustersforwhichallgenetrees agree.SincethedeepcoalescenceproblemisNPhard[10], mostmeaningfulinstanceswillrequireheuristicstoestimateasolution.WedemonstratethattheParetoproperty canbeleveragedtovastlyimproveupontherunningtime ofheuristics.Method1representsanewgeneralapproach tophylogeneticalgorithms.Inmostcases,heuristicsto estimatesolutionsforphylogeneticinferenceproblemsare Figure7 DeepcoalescencescoreandruntimeresultsforExperiment2 .Legend:bluerepresentsMethod1(divideandconquer)and orangerepresentsstandaloneSPRheuristic. Lin etal BMCBioinformatics 2012, 13 (Suppl10):S12 http://www.biomedcentral.com/14712105/13/S10/S12 Page11of13 PAGE 12 basedonafewgenericsearchstrategiessuchasthelocal searchheuristicsbasedonnearestneighborinterchange (NNI),SPR,ortreebisectionandreconnection(TBR) branchswapping.Althoughthesesearchstrategiesoften appeartoperformwell,theyarenotconnectedtoanyspecificphylogeneticproblemsoroptimalitycriteria.Ideally, however,efficientandeffectiveheuristicsshouldbetailored tothepropertiesofthephylogeneticproblem.Inthecase ofthedeepcoalescenceconsensustreeproblem,thePareto propertyprovidesaninformativeguidingconstraintforthe treesearch.Specifically,whenconsideringpossiblesolutions,weneedonlyconsidersolutionsthatcontainallclustersfromtheinputgenetrees,or,inotherwords,that refinethestrictconsensusoftheinputgenetrees. Still,oursimulationexperimentssuggestthat,inmany cases,theSPRlocalsearchheuristicdescribedbyBansal etal.[13]performswell.Whileweidentifiedcasesin whichtheestimatefromtheSPRheuristicdidnotcontaintheParetoclusters,inmostcasesSPRalonefound treesasgood,orevenslightlybetter,thanMethod1.We notethatthesizeofthesimulatedcoalescencedataset, 256taxa,exceedsthesizeofthelargestpublishedanalysis ofthedeepcoalescenceconsensustreeproblemandis farbeyondthelargestinstances(8taxa)fromwhich exactsolutionshavebeencalculated[11],andtheSPR foundgoodsolutionswithin 30seconds.Still,running timefortheSPRheuristicdoesnotalwaysscalewell,and theresultsofExperiment2suggestthatitmightnotbe tractableforextremelylargedatasets.Inthesecases,in practiceMethod1mayvastlyimproveupontherunning time,whileguaranteeingasolutionwiththePareto property. Further,Theorem2showsthatthedeepcoalescence consensustreeproblemexhi bitsindependentoptimal substructures.Thisimpliesthat,oncewecomputethe strictconsensustreeoftheprobleminstance,therestof Method1canbedirectlyparallelized,regardlessofwhich externaldeepcoalescencesolverisused.Inthecase wheretheexternalsolverguaranteesexactsolutions,our methodwouldalsogiveexactsolutions,butcanpotentiallysolveinstanceswithamuchlargertaxasizecomparedtorunningtheexternalsolveralone. AlthoughtheParetopropertyforthedeepcoalescence consensustreeproblemisdesirable,andthedivideand conquermethodispromisingforlargescaleanalyses, therearelimitationstotheiruse.First,theParetoproperty andMethod1arelimitedtotheconsensuscase,or, instancesinwhichalloftheinputgenetreescontain sequencesfromallofthespecies.Also,theParetopropertyisonlyusefulwhenallinputtreessharesomeclusters incommon.Iftherearenoconsensusclustersamongthe inputtrees,thenMethod1conveysnoruntimebenefits. Whilethismayseemlikeanextremecase,itispossible withhighlevelsofincompletelineagesorting,or,perhaps morelikely,mucherrorinthegenetreeestimates.Also, asweaddmoreandmoregenetrees,wewouldexpect moreinstancesofconflictamongthegenetrees,potentiallyconvergingtowardst heeliminationofconsensus clusters.ThanandRosenberg[21]recentlyprovedthe existenceofcasesinwhichthedeepcoalescenceproblem isinconsistent,orconvergesonthewrongspeciestree estimatewithincreasinggenetreedata.Althoughinconsistencyisconcerning,theParetopropertyprovidessome reassurance.Eveninaworsecasescenarioinwhichthe deepcoalescenceproblemismisled,theoptimalsolutions willstillcontainalloftheagreeduponcladesfromthe genetrees.Perhapsthegreatestadvantageofthedeepcoalescenceproblem,especiallycomparedtolikelihoodand Bayesianapproachesthatinferspeciestreesbasedoncoalescencemodels(e.g.,[2224]),isitscomputationalspeed andthefeasibilityofestimatingaspeciestreefromlargescalegenomicdatasetsrepresentinghundredsoreven thousandsoftaxa[13].Notonlycanourmethodimprove theperformanceofanyexistingheuristic,theParetopropertydescribesalimitedsubse tofpossiblespeciestrees thatmustcontaintheoptimalsolution.ConclusionsWeprovethatthedeepcoalescenceconsensustreeproblemsatisfiesthePareto propertyforclustersand describeanefficientalgorithmthat,givenacandidate solutionthatdoesnotdisplaytheconsensusclusters, transformsthesolutionsothatitincludesalltheconsensusclustersandhasalowerdeepcoalescencecost. Weextendtheresultandprovethattheproblemexhibitsoptimalsubstructuresbasedonthestrictconsensus treeoftheinputgenetrees.Basedonthisproperty,we suggestanew,parallelizabletreesearchmethod,in whichwerefinethestrictconsensusoftheinputgene trees.Incontrasttopreviouslyproposedheuristics,this methodguaranteesthattheproposedsolutionwillcontaintheParetoclusters. Also,asourexperiments demonstrate,thismethodcangreatlyimprovethespeed ofdeepcoalescencetreeheuristics,potentiallyenabling efficientandeffectiveestimatesfrominputwiththousandsoftaxa.AdditionalmaterialAdditionalfile1:Omittedproofsinthemainmanuscript. Listofabbreviationsused LCA:leastcommonancestor;SPR:subtreepruningandregrafting;NNI: nearestneighborinterchange;TBR:treebisectionandreconnection Acknowledgements Theauthorswouldliketothankouranonymousreviewerswhohave providedvaluablecomments,aswellasprovidingasimplerproofofLemmaLin etal BMCBioinformatics 2012, 13 (Suppl10):S12 http://www.biomedcentral.com/14712105/13/S10/S12 Page12of13 PAGE 13 1.ThisworkwasconductedwithsupportfromtheGeneTreeReconciliation WorkingGroupatNIMBioSthroughNSFaward#EF0832858,withadditional supportfromtheUniversityofTennessee.HTLandOEweresupportedin partsbyNSFawards#0830012and#10117189. Thisarticlehasbeenpublishedaspartof BMCBioinformatics Volume13 Supplement10,2012: Selectedarticlesfromthe7thInternational SymposiumonBioinformaticsResearchandApplications(ISBRA 11) .Thefull contentsofthesupplementareavailableonlineathttp://www. biomedcentral.com/bmcbioinformatics/supplements/13/S10. Authordetails1DepartmentofComputerScience,IowaStateUniversity,Ames,IA,USA.2NationalEvolutionarySynthesisCenter,Durham,NC,USA;Universityof Florida,Gainesville,FL,USA. Authors contributions HTLandOEwereresponsiblefortheorydevelopmentandalgorithmdesign. HTLimplementedtheprograms.HTLandJGBdesignedandconducted simulationexperiments,andJGBledtheanalysisoftheresults.Allauthors contributedtothewritingofthismanuscript,andhavereadandapproved thefinalmanuscript. Competinginterests Theauthorsdeclarethattheyhavenocompetinginterests. Published:25June2012 References1.RokasA,WilliamsBL,KingN,CarrollSB: Genomescaleapproachesto resolvingincongruenceinmolecularphylogenies. Nature 2003, 425(6960) :798804. 2.PollardDA,IyerVN,MosesAM,EisenMB: WidespreadDiscordanceof GeneTreeswithSpeciesTreeinDrosophila:EvidenceforIncomplete LineageSorting. PLoSGenet 2006, 2(10) :e173.. 3.GoodmanM,CzelusniakJ,MooreGW,RomeroHerreraAE,MatsudaG: FittingtheGeneLineageintoitsSpeciesLineage,aParsimonyStrategy IllustratedbyCladogramsConstructedfromGlobinSequences. SystematicZoology 1979, 28(2) :132163. 4.MaddisonWP: GeneTreesinSpeciesTrees. SystematicBiology 1997, 46(3) :523536. 5.NicholsR: Genetreesandspeciestreesarenotthesame. Trendsin Ecology&Evolution 2001, 16(7) :358364. 6.EdwardsSV: Isanewandgeneraltheoryofmolecularsystematics emerging? Evolution;InternationalJournalofOrganicEvolution 2009, 63 :119. 7.KnowlesLL: EstimatingSpeciesTrees:MethodsofPhylogeneticAnalysis WhenThereIsIncongruenceacrossGenes. SystematicBiology 2009, 58(5) :463467. 8.YuY,WarnowT,NakhlehL: AlgorithmsforMDCbasedmultilocus phylogenyinference. Proceedingsofthe15thAnnualinternational conferenceonResearchincomputationalmolecularbiology RECOMB,Berlin, Heidelberg:SpringerVerlag;2011,531545. 9.MaddisonWP,KnowlesLL: InferringPhylogenyDespiteIncomplete LineageSorting. SystematicBiology 2006, 55 :2130. 10.ZhangL: FromgenetreestospeciestreesII:Speciestreeinferencein thedeepcoalescencemodel. IEEE/ACMTransComputBiolBioinformatics 2011, 8(6) :16851691. 11.ThanC,NakhlehL: SpeciesTreeInferencebyMinimizingDeep Coalescences. PLoSComputationalBiology 2009, 5(9) :e1000501. 12.ThanC,NakhlehL: Estimatingspeciestrees:PracticalandTheoreticalAspects WileyVCH,Chichester2010chap.Inferenceofparsimoniousspeciestree phylogeniesfrommultilocusdatabyminimizingdeepcoalescences;7998. 13.BansalM,BurleighJG,EulensteinO: Efficientgenomescalephylogenetic analysisundertheduplicationlossanddeepcoalescencecostmodels. BMCBioinformatics 2010, 11(Suppl1):S42. 14.BinindaEmondsORP: Phylogeneticsupertrees:combininginformationto revealtheTreeofLife Springer;2004. 15.BryantD: Aclassificationofconsensusmethodsforphylogenies. BioConsensus,DIMACS.AMS 2003,163184. 16.WilkinsonM,CottonJA,LapointeF,PisaniD: PropertiesofSupertree MethodsintheConsensusSetting. SystematicBiology 2007, 56(2) :330337. 17.WilkinsonM,ThorleyJ,PisaniD,LapointeFJ,McInerneyJ: Phylogenetic Supertrees:CombiningInformationtoRevealtheTreeofLife Springer, Dordrecht,theNetherlands2004chap.Somedesiderataforliberal supertrees;227246. 18.McMorrisFR,MeronkDB,NeumannDA: Aviewofsomeconsensus methodsfortrees. NumericalTaxonomy 1983,122125. 19.SandersonMJ: r8s:inferringabsoluteratesofmolecularevolutionand divergencetimesintheabsenceofamolecularclock. Bioinformatics (Oxford,England) 2003, 19(2) :301302. 20.MaddisonWP,MaddisonD: Mesquite:amodularsystemforevolutionary analysis 2001[http://mesquiteproject.org]. 21.ThanCV,RosenbergNA: Consistencypropertiesofspeciestreeinference byminimizingdeepcoalescences. JournalofComputationalBiology 2011, 18 :115. 22.LiuL: BEST:Bayesianestimationofspeciestreesunderthecoalescent model. Bioinformatics 2008, 24(21) :25422543. 23.KubatkoLS,CarstensBC,KnowlesLL: STEM:speciestreeestimationusing maximumlikelihoodforgenetreesundercoalescence. Bioinformatics 2009, 25(7) :971973. 24.HeledJ,DrummondAJ: BayesianInferenceofSpeciesTreesfrom MultilocusData. MolecularBiologyandEvolution 2010, 27(3) :570580.doi:10.1186/1471210513S10S12 Citethisarticleas: Lin etal .: Consensuspropertiesforthedeep coalescenceproblemandtheirapplicationforscalabletreesearch. BMC Bioinformatics 2012 13 (Suppl10):S12. Submit your next manuscript to BioMed Central and take full advantage of: Convenient online submission Thorough peer review No space constraints or color gure charges Immediate publication on acceptance Inclusion in PubMed, CAS, Scopus and Google Scholar Research which is freely available for redistribution Submit your manuscript at www.biomedcentral.com/submit Lin etal BMCBioinformatics 2012, 13 (Suppl10):S12 http://www.biomedcentral.com/14712105/13/S10/S12 Page13of13 PAGE 14 Omittedproofsinthemainmanuscript HarrisT.Lin 1 ,J.GordonBurleigh 2 andOliverEulenstein 1 1 DepartmentofComputerScience,IowaStateUniversity,Ames,IA,USA 2 NationalEvolutionarySynthesisCenter,Durham,NC,USA;UniversityofFlorida,Gainesville,FL,USA Email:HarrisT.Linhtlin@iastate.edu;J.GordonBurleighgburleigh@u.edu;OliverEulenstein oeulenst@iastate.edu; Correspondingauthor ARelationshipbetweendeepcoalescenceandextralineage Weshowthedierencesinourdenitionof deepcoalescence giveninDef.3anditsoriginaldenition termed extralineage givenin[1]. Denition1. The Booleanvalue ofastatement ,denotedas [[ ]] ,is1if istrue,0otherwise. Denition2 EdgeCoverage Let f u 0 ;v 0 g2 E S and u 0 PAGE 15 Proof. EL T;S = X f u 0 ;v 0 g2 E S u 0 PAGE 16 DProofofLemma4 Proof. Since A;B X ,Lemma2applieson S with X ,where S 0 nowbecomes S X = S X .Thereforewe have pl S A;B = pl S X A;B .Further,byR1weknowthat S j X isasubtreeof R ,sowehave pl S j X A;B = pl R A;B because A;B X .Lastlybythedenitionofrestriction, S j X isobtainedfrom S X bysuppressingsomenodes,thereforeapathin S X canonlybemadeshorterin S j X ,andsowe have pl S X A;B pl S j X A;B Overall,wehave pl S A;B = pl S X A;B pl S j X A;B = pl R A;B References 1.MaddisonWP: GeneTreesinSpeciesTrees SystematicBiology 1997, 46 :523{536. 3 !DOCTYPE art SYSTEM 'http:www.biomedcentral.comxmlarticle.dtd' ui 1471210513S10S12 ji 14712105 fm dochead Proceedings bibl title p Consensus properties for the deep coalescence problem and their application for scalable tree search aug au id A1 snm Linmi Tfnm Harrisinsr iid I1 email htlin@iastate.edu A2 BurleighJ GordonI2 gburleigh@ufl.edu ca yes A3 EulensteinOliveroeulenst@iastate.edu insg ins Department of Computer Science, Iowa State University, Ames, IA, USA National Evolutionary Synthesis Center, Durham, NC, USA; University of Florida, Gainesville, FL, USA source BMC Bioinformatics supplement Selected articles from the 7th International Symposium on Bioinformatics Research and Applications (ISBRA'11)editor Jianer Chen, Ion Mandoiu, Raj Sunderraman, Jianxin Wang and Alexander Zelikovskynote Proceedingsconference 7th International Symposium on Bioinformatics Research and Applications (ISBRA'11)location Changsha, Chinadaterange 2729 May 2011url http://www.cs.gsu.edu/isbra11/issn 14712105 pubdate 2012 volume 13 issue Suppl 10 fpage S12 http://www.biomedcentral.com/14712105/13/S10/S12 xrefbib pubid idtype doi 10.1186/1471210513S10S12 history pub date day 25month 6year 2012 cpyrt 2012collab Lin et al.; licensee BioMed Central Ltd.This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. abs sec st Abstract Background To infer a species phylogeny from unlinked genes, phylogenetic inference methods must confront the biological processes that create incongruence between gene trees and the species phylogeny. Intraspecific gene variation in ancestral species can result in deep coalescence, also known as incomplete lineage sorting, which creates incongruence between gene trees and the species tree. One approach to account for deep coalescence in phylogenetic analyses is the deep coalescence problem, which takes a collection of gene trees and seeks the species tree that implies the fewest deep coalescence events. Although this approach is promising for phylogenetics, the consensus properties of this problem are mostly unknown and analyses of large data sets may be computationally prohibitive. Results We prove that the deep coalescence consensus tree problem satisfies the highly desirable Pareto property for clusters (clades). That is, in all instances, each cluster that is present in all of the input gene trees, called a consensus cluster, will also be found in every optimal solution. Moreover, we introduce a new divide and conquer method for the deep coalescence problem based on the Pareto property. This method refines the strict consensus of the input gene trees, thereby, in practice, often greatly reducing the complexity of the tree search and guaranteeing that the estimated species tree will satisfy the Pareto property. Conclusions Analyses of both simulated and empirical data sets demonstrate that the divide and conquer method can greatly improve upon the speed of heuristics that do not consider the Pareto consensus property, while also guaranteeing that the proposed solution fulfills the Pareto property. The divide and conquer method extends the utility of the deep coalescence problem to data sets with enormous numbers of taxa. bdy Introduction The rapidly growing abundance of genomic sequence data has revealed extensive incongruence among gene trees (e.g., abbrgrp abbr bid B1 1B2 2) that may be caused by processes such as deep coalescence (incomplete lineage sorting), gene duplication and loss, or lateral gene transfer (see B3 3B4 4B5 5). In these cases, phylogenetic methods must account for and explain the patterns of variation among gene tree topologies, rather than simply assuming the gene tree topology reflects the relationships among species. In particular, there has been much recent interest in phylogenetic methods that account for deep coalescence, which may occur in any sexually reproducing organisms (e.g., B6 6B7 7B8 8). One such approach is the deep coalescence problem, which, given a collection of gene trees, seeks a species tree that minimizes the number of deep coalescence events 4B9 9. Although the deep coalescence problem is NPhard B10 10, recent algorithmic advances enable scientists to solve instances with a small number of taxa B11 11B12 12 and efficiently compute heuristic solutions for data sets with slightly more species B13 13. Still, the heuristics are based on generic local tree search strategies with no performance guarantees, and they cannot handle enormous data sets. In this study, we prove that the deep coalescence problem satisfies the Pareto consensus property. We then describe a new divide and conquer approach, based on the Pareto property, that, in practice, can greatly extend the utility of existing heuristics while guaranteeing that the inferred species tree also has the Pareto property with respect to the input gene trees. Related work The deep coalescence problem is an example of a supertree problems, in which input trees with taxonomic overlap are combined to build a species tree that includes all of the taxa found in the input trees (see B14 14). In fact, it is among the few supertree methods that use a biologically based optimality criterion. One way of evaluating supertree methods is by characterizing their consensus properties (e.g., B15 15B16 16). The consensus tree problem is the special case of the supertree problem in which all the input trees contain the same taxa. Since all supertree problems generally seek to retain phylogenetic information from the input trees, one of the most desirable consensus properties is the Pareto property. A consensus tree problem satisfies the Pareto property on clusters (or triplets, quartets, etc.) if every cluster (or triplet, quartet, etc.) that is present in every input tree appears in the consensus tree 1516B17 17. Many supertree problems satisfy the Pareto property for clusters in the consensus setting 1516. However, this has not been shown for the deep coalescence problem. Our contributions We prove that the deep coalescence consensus tree problem satisfies the Pareto property for clusters. This result provides useful guidance for the species tree search. Instead of evaluating all possible species trees, to find the optimal solution we need only to examine trees that satisfy the Pareto property on clusters. These trees will all be refinements of the strict consensus of the gene trees. Furthermore, the Pareto property allow us to show that the problem can be divided into smaller independent subproblems based on the strict consensus tree. We apply this property and describe a new divide and conquer method, and our experiments demonstrate that this method can greatly improve the speed of deep coalescence tree heuristics, potentially enabling efficient and effective estimates from inputs with several thousands of taxa. Future work will exploit the independence of the subproblems and solve these on parallel machines, which should result in even larger and more accurate solutions. Methods Basic definitions, notations, and preliminaries In this section we introduce basic definitions and notations and then define preliminaries required for this work. For brevity some proofs are omitted in the text but available in Additional file supplr sid S1 1. suppl Additional file 1 text b Omitted proofs in the main manuscript. file name 1471210513S10S12S1.pdf Click here for file A it graph G is an ordered pair (V, E) consisting of a nonempty set V of nodes and a set E of edges. We denote the set of nodes and edges of G by V(G) and E(G), respectively. If e = {u, v} is an edge of a graph G, then e is said to be incident with u and v. If v is a node of a graph G, then the degree of v in G is the number of edges in G that are incident with v. A tree T is a connected graph with no cycles. T is rooted if it has exactly one distinguished node of degree one, called the root, and we denote it by inlineformula m:math xmlns:m http:www.w3.org1998MathMathML 1471210513S10S12i1 m:mrow m:mi mathvariant normal R o m:mo class MathClassopen ( T MathClassclose ) . The unique edge incident with 1471210513S10S12i2 R o ( T ) is called the root edge. Let T be a rooted tree. We define ≤sub T to be the partial order on V (T) where x ≤T y if y is a node on the path between 1471210513S10S12i3 R o ( T ) and x. If x ≤T y we call x a descendant of y, and y an ancestor of x. We also define x ( T ) and its elements are called leaves. A node is internal if it is not a leaf. The set of all internal nodes of T is denoted by I(T). Further, we will frequently refer to the subset of I(T) whose degree is two, and we denote this by I2(T). Let 1471210513S10S12i5 X MathClassrel ⊆ L e m:mfenced close ) open ( separators T , we write 1471210513S10S12i6 m:mover accent false mmloverline X true ¯ to denote the leaf complement of X when the tree T is clear from the context, where 1471210513S10S12i7 X ¯ = m:mstyle m:mtext textsf sansserif L e T MathClassbin \ X . If {x, y} ∈ E(T) and x P a T x and we call x a child of y. The set of all children of y is denoted by 1471210513S10S12i9 C h T y . If two nodes in T have the same parent, they are called siblings. The least common ancestor (LCA) of a nonempty subset X ⊆ V(T), denoted as lcaT(X), is the unique smallest upper bound of X under ≤T. If e ∈ E(T), we define T/e to be the tree obtained from T by identifying the ends of e and then deleting e. T/e is said to be obtained from T by contracting e. If v is a vertex of T with degree one or two, and e is an edge incident with v, the tree T/e is said to be obtained from T by suppressing v. Examples of the following definitions are shown in Figure figr fid F1 1. Let X ⊆ V(T), the subtree of T induced by X, denoted T(X), is the minimal connected subtree of T that contains 1471210513S10S12i10 R o ( T ) and X. The restricted subtree of T induced by X, denoted as TX, is the tree obtained from T(X) by suppressing all nodes with degree two. The subtree of T rooted above node v ∈ V (T), denoted as Tv, is the restricted subtree induced by {u ∈ V (T): u ≤T v}. fig Figure 1caption Examples of tree definitions Examples of tree definitions. (a) A rooted tree T with four leaves {a, b, c, d}. (b) The subtree of T induced by X where X = {a, b}. (c) The restricted subtree of T induced by X. graphic 1471210513S10S121 T is binary if every node has degree one or three. Throughout this paper, the term tree refers to a rooted binary tree unless otherwise stated. Also, the subscript of a notation may be omitted when it is clear from the context. Deep coalescence We define the deep coalescence cost function as demonstrated in Figure F2 2. Note that our definition of the deep coalescence cost given in Def. 3, is somewhat different, but for our purposes equivalent, to its original definition also termed extra lineage given in 4. The relationship between both definitions is shown in Additional file 1. Figure 2Example of deep coalescence cost definition Example of deep coalescence cost definition. Example showing the deep coalescence cost from T to S. Each edge of T is accompanied by its cost, and its corresponding path is shown on S. 1471210513S10S122 Throughout this section we assume T and S are trees over the same leaf set. Definition 1 (Path length). Suppose x ≤T y, the path length from x to y, denoted plT(x, y), is the number of edges in the path from x to y. Further, let 1471210513S10S12i99 X ⊆ Y ⊆ L e T , we extend the path length function by plT(X,Y) ≜ plT(lcaT(X), lcaT(Y)). Definition 2 (LCA mapping). Let v ∈ V(T), the LCA mapping of v in S, denoted MT⊳S(v), is defined by 1471210513S10S12i100 M T ⊳ S v ≜ l c a S L e T v . Definition 3 (Deep coalescence). The deep coalescence cost from T to S, denoted DC(T, S), is displayformula 1471210513S10S12i11 D C ( T MathClasspunc , m:mspace tmspace width 2.77695pt S ) ≜ m:munder msub mathsize big ∑ m:mtable subarrayc columnalign center rowspacing 0 m:mtr m:mtd { u , v } ∈ E ( T ) u < v p l S ( M T ⊳ S ( u ) , M T ⊳ S ( v ) ) Using the extended path lengths, the deep coalescence cost can be equivalently expressed as 1471210513S10S12i12 D C ( T , S ) = ∑ { u , v } ∈ E ( T ) u < v p l S ( L e ( T u ) , L e ( T v ) ) Consensus tree Definition 4 (Consensus tree problem). Let 1471210513S10S12i101 f : script T x × T x → R be a cost function where X is a leaf set and 1471210513S10S12i13 T x is the set of all trees over X. A consensus tree problem based on f is defined as follows. Instance: A tuple of n trees (T1,...,Tn) over X Find: The set of all trees that have the minimum aggregated cost with respect to f. Formally, 1471210513S10S12i14 MathClassop argmin S ∈ T X m:munderover accentunder ∑ i = m:mn 1 n f ( T i , S ) This set is also called the solutions for the consensus tree instance. Definition 5 (Deep coalescence consensus tree problem). We define the deep coalescence consensus tree problem to be the consensus tree problem based on the deep coalescence cost function. Cluster and Pareto Definition 6 (Cluster). Let T be a tree, the clusters induced by T, denoted, 1471210513S10S12i15 C l T , is 1471210513S10S12i16 C l ( T ) ≜ { L e ( T v ) : v ∈ V ( T ) } . Further, 1471210513S10S12i17 X ∈ C l T is called a trivial cluster if 1471210513S10S12i18 X = L e T or X = 1, it is called nontrivial otherwise. Let 1471210513S10S12i19 Y ⊆ L e T , we say that T contains (cluster) Y if 1471210513S10S12i20 Y ∈ C l T . Definition 7 (Pareto on clusters). Let P be a consensus tree problem based on some cost function. We say that P is Pareto on clusters if: for all instances I = (T1,...,Tn) of P, for all solutions S of I, we have 1471210513S10S12i21 m:msubsup ∩ i = 1 n C1 ( T i ) ⊆ C1 ( S ) . Theorem overview We wish to show that the deep coalescence consensus tree problem is Pareto on clusters. We describe a high level structure of the proof in this section and provide necessary supporting lemmata in the next section. The proof proceeds by contradiction, assuming that the deep coalescence consensus tree problem is not Pareto on clusters. By Def. 7, the assumption implies that there exists an instance I = (T1,...,Tn), a solution S for I, and a cluster 1471210513S10S12i22 X ⊆ L e S where 1471210513S10S12i23 X ∈ ∩ i = 1 n C l ( T i ) but 1471210513S10S12i24 X ∉ C l ( S ) . S being a solution for I, implies by Def. 4, that the aggregated deep coalescence cost, i.e. 1471210513S10S12i25 ∑ i = 1 n D C ( T i , S ) is minimized. Then, based on the existence of the cluster X, we edit S and form a new tree R using a tree edit operation which will be introduced in the next section. The properties of this new operation together with the properties of X (proved in the next section), provides the key ingredients to calculate the changes in deep coalescence costs. With some further arithmetics, this allows us to conclude that R in fact has a smaller aggregated deep coalescence cost, i.e. 1471210513S10S12i26 ∑ i = 1 n D C ( T i , S ) > ∑ i = 1 n D C ( T i , R ) , hence contradicting the assumption that S is a solution for I. Supporting lemmata Shallowest regrouping operation In this section we formally define the new tree edit operation that forms the key part of the theorem. We begin with some useful definitions related to the depth of nodes. An example of this operation is shown in Figure F3 3. Figure 3Example of the shallowest regrouping operation Example of the shallowest regrouping operation. Example of the shallowest regrouping operation of S by X where X = {a, c, d}. The intermediate tree 1471210513S10S12i90 m:msup S ′ = S stretchy ( X ) ¯ shows its two shallowest degreetwo nodes v1 and v2. R1 and R2 are the resulting trees of this operation. That is, 1471210513S10S12i91 Γ ^ ( S , X ) = { R 1 , R 2 } where R1 = Γ(S, X, v1) and R2 = Γ(S, X, v2). 1471210513S10S123 Definition 8 (Node depth). The depth of a node v∈V(T), denoted depT (v), is 1471210513S10S12i27 p l v , Ro T . Definition 9 (Shallowest nodes). Let T be a tree and X ⊆ V(T), the shallowest function, denoted shallowestT(X), is the set of nodes in X which have the minimum depth among all nodes in X. Formally, we define shallowestT (X) ≜ argminv∈X (depT(v)). Now we have the necessary mechanics to define the new tree edit operation. In what follows, we assume S to be a tree, 1471210513S10S12i28 ∅ ⊂ X ⊂ L e ( S ) , and 1471210513S10S12i29 S ′ = S ( X ¯ ) . Definition 10 (Regroup). Let v ∈ I2(S'). The regrouping operation of S by X on v, denoted Γ(S, X, v), is the tree obtained from S' by indent 1 1. (R1) Identify Ro(SX) and v. In other words we adjoin the root of tree SX onto the node v. 2. (R2) Suppress all nodes with degree two. Definition 11 (Shallowest regroup). The shallowest regrouping operation of S by X, denoted 1471210513S10S12i30 Γ ^ ( S , X ) , defines a set of trees by 1471210513S10S12i31 Γ ^ ( S , X ) ≜ { Γ ( S , X , v ) : v ∈ s h a l l o w e s t S ′ ( I 2 ( S ′ ) ) } . As Figure 3 shows, the shallowest regrouping operation pulls apart X from S and regroups X back onto each of the shallowest nodes in S. Counting the number of degreetwo nodes The regrouping operation includes the step of suppressing nodes with degree two. Since this step affects path lengths and ultimately deep coalescence costs, we are required to count carefully the number of degreetwo nodes under various conditions. Here we assume that T is a tree and {X, Y} is a bipartition of 1471210513S10S12i32 L e ( T ) . We begin with two observations that assert existence of degreetwo nodes, and assert existence of leaf sets given a degreetwo node. Observation 1. I2(T(X) ≠ Ø and. I2(T(Y) ≠ Ø. Observation 2. If v ∈ I2(T(X)), then 1471210513S10S12i97 L e T v ∩ X ≠ ∅ and. 1471210513S10S12i98 L e T v ∩ Y ≠ ∅ . The next Lemma says that if the root of T is the parent of lca(X), then the number of degreetwo nodes in T(X) is at least the depth of v, where v is a shallowest degreetwo node of T(Y). Lemma 1. If 1471210513S10S12i102 P a l c a X = R o T and v ∈ shallowest (I2(T(Y))), then dep(v) ≤ I2(T(X)). Proof. Assume the premise. Let n = dep(v), we observe that n ≥ 1 because of the root edge. Figure F4 4 shows the setup and variable assignments for this proof. Let v = v1 < ... Setup and variable assignments for the proof of Lemma 1. Tree showing the variable assignments in the proof of Lemma 1. Dotted lines represent omitted parts of the tree, and triangles represent subtrees. 1471210513S10S124 • vn = lca(X) because 1471210513S10S12i33 P a l c a X = R o ( T ) . • B ⊆ X and A1 ∩ Y ≠ ∅, because v is a degreetwo node of T(Y). • A1 ∩ Y ≠ ∅ implies that A2,..., An each contains at least an element of Y. For otherwise, each of v2, ...,vn becomes a degreetwo node in T(Y), contradicting the assumption that v = v1 is the shallowest degreetwo node in T(Y). In order to obtain T(X), we must prune subtrees in A1 whose leaves are in Y (which could be the entire subtree A1). Thus there must be at least one degreetwo node in A1 (or v1 if A1 is pruned). Similarly, for 1 Properties of the regrouping operation We examine some properties of the regrouping operation in this section. In general, these properties show that the path lengths defined by LCA's do not increase under several different assumptions. This preservation of path lengths would later assist in the calculation of deep coalescence costs. Throughout this section, we assume S to be a tree, 1471210513S10S12i34 ∅ ⊂ X ⊂ L e ( S ) , and 1471210513S10S12i35 S ′ = S ( X ¯ ) . Further we let R = Γ(S, X, v) where v ∈ I2(S'). Lemma 2. If 1471210513S10S12i36 A ⊆ B ⊆ L e ( S ) and 1471210513S10S12i37 B ⊆ thinspace 0.3em X ¯ , then plS(A, B) = plS' (A, B). Lemma 3. If 1471210513S10S12i38 A ⊆ B ⊆ L e ( S ) and 1471210513S10S12i39 B ⊆ X ¯ , then plS(A, B) ≥ plR(A, B). Lemma 4. If 1471210513S10S12i40 A ⊆ B ⊆ L e ( S ) and 1471210513S10S12i41 B ⊆ X , then plS(A, B) ≥ plR(A, B). Lemma 5. If 1471210513S10S12i42 A ⊆ B ⊆ L e ( S ) , 1471210513S10S12i43 A ⊆ X ¯ , and X ⊆ B, then plS(A, B) ≥ plR(A, B). Proof. Let S" be the tree obtained from S' by identifying Ro(SX) and v. In other words, S" is the tree after step (R1) of the regroup operation Γ(S, X, v). We will show that plS ≥ (A, B) ≥ plS" (A, B) ≥ plR(A, B). We begin with the first inequality. First, since 1471210513S10S12i44 A ⊆ X ¯ we know that lcaS(A) = lcaS'(A) = lcaS" (A). Let x = lcaS(X) and b = lcaS(B), then the assumption of X ⊆ B implies x ≤S b. Since v has degree two in S', we know that 1471210513S10S12i45 L e ( S v ) ∩ X ≠ ∅ (Observation 2), and so v ≤S x. Now let x" = lcaS" (X) and b" = lcaS"(B). By (R1) we have that x" ≤ S" v, and so x" ≤ S" x, which implies b" ≤ S" b. Furthermore, lcaS(A)= lcaS" (A) is a descendant of both b and b" because A ⊆ B, and hence b" ≤ S" b implies that plS(A, B) ≥ plS" (A, B). Next, by (R2) R is obtained from S" by suppressing some nodes, therefore a path in S" can only be made shorter in R, hence we have plS"(A, B) ≥ plR(A, B). Finally, combining the above results we have plS(A, B) ≥ plR(A, B). □ Main theorem Theorem 1. Deep coalescence consensus tree problem is Pareto on clusters. Proof. Assume not for a contradiction, then there exists an instance I = (T1,...,Tn), a solution S for I, and a cluster 1471210513S10S12i46 X ⊆ L e ( S ) where 1471210513S10S12i47 X ∈ ∩ i = 1 n C l ( T i ) but 1471210513S10S12i103 X ∉ C l S . Since X∉ClS, X must be nontrivial, therefore 1471210513S10S12i48 Γ ^ ( S , X ) does not contain S and is not empty. Let 1471210513S10S12i49 R ∈ Γ ^ ( S , X ) . We will show that (∀ 1 ≤ i ≤ n) (DC(Ti, S) >DC(Ti, R)), which implies 1471210513S10S12i50 ∑ i = 1 n D C ( T i , S ) > ∑ i = 1 n D C ( T i , R ) , contradicting the assumption that S is a solution for I. Let T = Ti where 1 ≤ i ≤ n, we will show that DC(T, S) >DC(T, R). This requires that DC(T, S) DC(T, R) > 0, in other words M1 1471210513S10S12i51 ∑ { u , v } ∈ E ( T ) u < v p l S ( L e ( T u ) , L e ( T v ) )  p l R L e ( T u ) , L e ( T v ) > 0 Since (1) sums over all edges in T, for convenience we partition the edges of T and compute the differences in path lengths for each partition individually. Figure F5 5 depicts a running example for T, S, and R where X = {a, b, c}. Figure 5Running example for the proof of Theorem 1 Running example for the proof of Theorem 1. A running example for the proof of Theorem 1 where T is a tree in the instance tuple I, S is an assumed solution for 1471210513S10S12i92 I , X = { a , b , c } , S ′ = S ( X ¯ ) , U = S x ′ , and R = Γ(S, X, w). Highlighted regions in T are the edge partitions E1, E2, and E3. The rest of edges form the partition E4. By counting the costs for each partition we have Σ1 =6 4 = 2, Σ2 =1 3 = −2, Σ3 = 2 − 1 = 1, and Σ4 =3 − 3 = 0. Overall we have DC(T, S) − DC(T, R) = 1. 1471210513S10S125 We identify some specific nodes in order to partition the edges of T. Let 1471210513S10S12i52 S ′ = S ( X ̄ ) , w ∈ I 2 ( S ′ ) where R = Γ(S, X, w). Since X∉ClS, S' contains at least two nodes with degree two. Let w'∈ I2(S') such that w' ≠ w, then Sw' contains some leaf y ∉ X (Observation 2). Let x = lcaT (X) and z = lcaT (X ∪ {y}), we partition the edges of T into {E1, E2, E3, E4} as follows. 1. E1 ≜ {{u, v} ∈ E(T) : u We consider (1) for each of the partition separately. For clarity, we define the aggregated cost difference Σi for partition Ei as follows. 1471210513S10S12i53 Σ i ≜ ∑ { u , v } ∈ E i u < v ( p l S ( L e ( T u ) , L e ( T v ) )  p l R ( L e ( T u ) , L e ( T v ) ) ) Hence (1) becomes M2 1471210513S10S12i54 Σ 1 + Σ 2 + Σ 3 + Σ 4 > 0 Let x' = lcaS(X) and p = plS (w, x') + 1. For each i ∈ {1, 2, 3, 4}, we claim and prove the bound of Σi as follows. Claim 1. Σ1 ≥ p Proof. First we observe that the difference for each path length in this partition is ≥ 0 (Lemma 4), so we have Σ1 ≥ 0. Since x' = lcaS (X), we only need to consider the subtree Sx' in computing the path lengths in this partition. Define U = Sx'. In particular, the number of degree two nodes in U(X) gives us a lower bound on the total decreases of path lengths, because these nodes are removed to obtain UX which is a subtree of R. That is, Σ1 ≥ I2(U(X)). Lemma 1 applies to U with bipartition 1471210513S10S12i55 { X , L e ( U ) \ X } and the node w, so we have I2(U(X)) ≥ depU (w). The depth depU (w) is with respect to U, and we relate it to a path length in S by taking away the root edge, that is 1471210513S10S12i56 d e p U ( w )  1 = p l S ( w , x ′ ) . Finally, using the definition of p we obtain Σ1 ≥ I2(U(X)) ≥ depU (w)= plS (w, x') +1= p. Claim 2. Σ2 = −p Proof. 1471210513S10S12i57 alignstar left alignodd right Σ 2 aligneven = p l S ( X , L e ( T ) )  p l R ( X , L e ( T ) ) 2em alignlabel = [ p l S ( x ′ , R o ( T ) )  1 ]  [ 1 + p l R ( w , x ′ ) + p l R ( x ′ , R o ( T ) )  1 ] = p l S ( x ′ , R o ( T ) )  [ 1 + p l R ( w , x ′ ) + p l R ( x ′ , R o ( T ) ) ] = p l S ( x ′ , R o ( T ) )  [ 1 + p l S ( w , x ′ ) + p l S ( x ′ , R o ( T ) ) ] =  [ p l S ( w , x ′ ) + 1] =  p The fourth equality holds because w is the shallowest degreetwo node in S', so that no edges along the path from w to x' are contracted in R, hence plR(w, x') = plS(w, x'). Claim 3. Σ3 ≥ 1 Proof. Let {a, b} ∈ E3 where a = L e ( T a ) , and 1471210513S10S12i59 B = L e ( T b ) . We know that 1471210513S10S12i60 A ⊆ X ¯ because otherwise this edge should be in E1 or E2. We consider two cases for B. 1. If 1471210513S10S12i61 B ⊆ X ¯ , then Lemma 3 applies on S, R, A, B, so plS(A, B) − plR(A, B) ≥ 0. 2. If X ⊆ B, then Lemma 5 applies on S, R, A, B, so plS(A, B) − plR(A, B) ≥ 0. In any case, we have plS(A, B) − plR(A, B) ≥ 0 for each edge {a, b} ∈ E3. This implies that Σ3 ≥ 0. Further, since w' ∈ I2(S') and w' ≠ w, w' does not exist in R. We also know that 1471210513S10S12i62 y < S w ′ < S l c a S ( X ∪ { y } ) by the definitions of w' and y. Therefore there exists an edge {a, b} ∈ E3 such that plS(A, B) − plR(A, B) ≥ 1. Hence we have Σ3 ≥ 1. Claim 4. Σ4 ≥ 0 Proof. Let {a, b} ∈ E4 where a = L e ( T a ) , and 1471210513S10S12i64 B = L e ( T b ) . The proof follows from the same argument as in Claim 3 where we have plS (A, B) − plR(A, B) ≥ 0 for each edge {a, b} ∈ E4, hence Σ4 ≥ 0. Finally, we have Σ1 +Σ2 +Σ3 +Σ4 ≥ p + (p) + 1+ 0 =1 > 0. Hence (2) is satisfied, and so is (1). In sum, we have constructed a tree R and showed that 1471210513S10S12i65 ∑ i = 1 n D C ( T i , S ) > ∑ i = 1 n D C ( T i , R ) , which contradicts with the assumption that S is a solution for I, in other words the assumption that S has the minimum aggregated cost with respect to the deep coalescence cost function. □ Algorithm for improving a candidate solution Algorithm 1 takes a consensus tree problem instance and a candidate solution as inputs. If the candidate solution does not display the consensus clusters, it is transformed into one that includes all of the consensus clusters and has a smaller (more optimal) deep coalescence cost. Algorithm 1 Deep coalescence consensus clusters builder 1: procedure DCConsensusClustersBuilder (I, T) Input: A consensus tree problem instance I = (T1,...,Tn), a candidate solution T for I Output: T, or an improved solution R that contains all consensus clusters of I 2: R ← T 3: C ← Set of all consensus clusters of I 4: for all cluster X ∈ C do 5: if R does not contain X then 6: v ← A node in shallowest 1471210513S10S12i66 ( I 2 ( R ( X ¯ ) ) ) (shallowest degreetwo node of 1471210513S10S12i67 R ( X ¯ ) ) 7: R ← Γ(R, X, v) (regrouping operation of R by X on v) 8: end if 9: end for 10: return R 11: end procedure The correctness of Algorithm 1 follows from the proof of Theorem 1. We now analyze its time complexity. Let m be the number of taxa present in the input trees. Line 3 takes O(nm) time. Line 5, 6, and 7 each takes O(m) time, and there are O(m) iterations. Overall Algorithm 1 takes O(nm + msup 2) time. General method for improving a search algorithm In this section we extend the result of Theorem 1 and show that the deep coalescence consensus tree problem exhibits optimal substructures based on the strict consensus tree of the problem instance. This leads to another simple and general method that improves an existing search algorithm. Figure F6 6 depicts a running example for this section. We now begin with some useful definitions. Figure 6Running example for the definitions and proof of Theorem 2 Running example for the definitions and proof of Theorem 2. A running example for the definitions and proof of Theorem 2. Arrows are marked by numbers 1 to 6, demonstrating the steps of the proof. Each step is explained below: (1) Given an instance I = (T1,...,Tn), let H be the strict consensus tree of I. An internal node of H and its four children are shown. (2) Let S be a solution for I, having the optimal deep coalescence cost. (3) Cut trees via H and h, obtaining 1471210513S10S12i93 C u t H , h ( I ) and 1471210513S10S12i94 C u t H , h ( S ) = S ′ . Let A, B, C, D be the leaf sets of each subtree. (4) We assume by contradiction that S' is not a solution for 1471210513S10S12i95 C u t H , h ( I ) , and so we let R' be a solution for 1471210513S10S12i96 C u t H , h ( I ) . (5 and 6) Modify S to obtain R, by replacing S' with R'. 1471210513S10S126 Definition 12 (Strict consensus tree B18 18). Given a tuple of n trees I = (T1,...,Tn), the strict consensus tree of I, denoted StrictCon(I), is the unique tree that contains those clusters common to all the input trees. Formally, StrictCon(I) is a (possibly nonbinary) tree S such that 1471210513S10S12i68 C l ( S ) = ⋂ i = 1 n C l ( T i ) . Definition 13 (Cut on trees). Let H and T be two trees over the same leaf set, such that H is a nonbinary tree and T is a binary tree that refines H. Given an internal node h in H, a cut on T via H and h, denoted 1471210513S10S12i69 C u t H , h ( T ) , is the minimal connected subtree of T that contains 1471210513S10S12i70 { M H ⊳ T ( c ) : c ∈ C h H ( h ) } , and we rename each leaf x by 1471210513S10S12i71 L e ( T x ) . We further extend this to a tuple of trees I = (T1,...,Tn) by 1471210513S10S12i72 C u t H , h ( I ) ≜ ( C u t H , h ( T 1 ) , … , C u t H , h ( T n ) ) . Theorem 2. Let I =(T1,...,Tn) be an instance of the deep coalescence consensus tree problem, and let S be a solution for I (having the optimal deep coalescence cost). Further suppose H is the strict consensus tree of I, and h is an internal node in H. Then 1471210513S10S12i73 C u t H , h ( S ) is a solution for the instance 1471210513S10S12i74 C u t H , h ( I ) of the deep coalescence consensus tree problem. Proof. Let 1471210513S10S12i75 C u t H , h ( S ) = S ' and 1471210513S10S12i76 C u t H , h ( I ) = ( C u t H , h ( T 1 ) , … , C u t H , h ( T n ) ) = ( T 1 ′ , … , T n ′ ) . First we observe that S must be a refinement of H by Theorem 1, therefore S' is defined. We continue to prove by contradiction, assuming the premise holds but S' is not a solution for the instance 1471210513S10S12i77 C u t H , h ( I ) . So let R' be a solution for the instance 1471210513S10S12i78 C u t H , h ( I ) , this implies that 1471210513S10S12i79 ∑ i = 1 n D C ( T ′ i , S ′ ) > ∑ i = 1 n D C ( T ′ i , R ′ ) . We now modify S by replacing S' with R' as follows: 1. Remove all edges of S', and remove all nodes of S' excepts the root and the leaves. 2. Identify 1471210513S10S12i104 R o S ′ with. 1471210513S10S12i105 R o R ′ 3. For each leaf v of S', identify v with a leaf x of R' where 1471210513S10S12i80 x = L e ( S v ) . Let the resulting tree be R. We will show that R has a lower deep coalescence cost, contradicting the assumption that S is a solution for I. Let T = Ti where 1 ≤ i ≤ n, it suffices to show that DC(T,S) >DC(T, R), in other words M3 1471210513S10S12i81 ∑ { u , v } ∈ E ( T ) u < v p l S ( M T ⊳ S ( u ) , M T ⊳ S ( v ) )  p l R ( M T ⊳ R ( u ) , M T ⊳ R ( v ) ) > 0 For convenience, let 1471210513S10S12i82 C h H ( h ) = { c 1 , … , c m } , h ′ = M H ⊳ T ( h ) , and 1471210513S10S12i83 c j ′ = M H ⊳ T ( c j ) where 1 ≤ j ≤ m. Similar to the proof of Theorem 1, we partition the edges of T into {Eunder , Eout, Ein} as follows. 1. 1471210513S10S12i84 E u n d e r ≜ { { u , v } ∈ E ( T ) : u < v and ( ∃ j ) ( v ≤ c j ′ ) } 2. 1471210513S10S12i85 E o u t ≜ { { u , v } ∈ E ( T ) : u < v and v ≰ h ′ } 3. 1471210513S10S12i86 E i n ≜ E ( T ) \ ( E u n d e r ∪ E o u t ) Recall that the modification of S into R only involves the subtree S', therefore 1471210513S10S12i87 M T ⊳ S ( v ) is unchanged for every v occurs in Eunder and Eout. Hence it suffices to evaluate (3) on Ein only. However we have already assumed that 1471210513S10S12i88 ∑ i = 1 n D C ( T ′ i , S ′ ) > ∑ i = 1 n D C ( T i ′ , R ′ ) , therefore (3) holds. Overall we have that R has a lower deep coalescence cost, contradicting the assumption that S is a solution for I. □ Theorem 2 implies that every internal node of the strict consensus tree defines an independent subproblem, and solutions of these subproblems can be combined to give a solution to the original deep coalescence consensus tree problem. This leads to the following general divide and conquer method that improves an existing search algorithm. Method 1 Deep coalescence consensus tree method 1: procedure DCConsensusTreeMethod(I) Input: A DC consensus tree problem instance I =(T1,...,Tn), an external program DCSOLVER. Output: A candidate solution T for I 2: H ← StrictCon(I) 3: for all internal node h of H do 4: Ih ← 1471210513S10S12i89 C u t H , h ( I ) 5: Sh ← DCSOLVER(Ih) 6: Refine the children of h on H by the tree Sh 7: end for 8: return H 9: end procedure Results We used simulation experiments to (i) test if the solutions obtained from efficient heuristics presented in 13 display the Pareto property, and (ii) compare the performance of our new divide and conquer approach based on the Pareto property to the generic heuristic in 13. Experiment results 1 First to examine if subtree pruning and regrafting (SPR) heuristic solutions from 13 display the Pareto property, we generated a series of four 14taxon trees that share few clusters. To do this, we first generated random 11taxon trees. Next, we generated random 4taxon trees containing the species 1114. We then replaced the one of the leaves in the 11taxon tree with the random 4taxon tree. This procedure produces gene trees that share at least a single 4taxon cluster in common. Although this simulation does not reflect a biological process, it represents extreme cases of error or incongruence among gene trees. In three cases with the 14taxon gene trees, we found that the SPR heuristic did not return a result that contained the consensus cluster. In these cases, our proof demonstrates that there exists a better solution that also contained the consensus cluster. However, the failure of the SPR heuristic in these cases appears to depend on the starting tree; these data sets did not fail with all starting trees. Thus, the shortcomings of the SPR heuristic may be ameliorated by performing multiple runs from different starting trees. Experiment results 2 We next evaluated the efficacy and scalability of Method 1 and compared it to the standalone SPR heuristic. We generate sets of gene trees, each with different consensus tree structures (depths and branch factors) as follows. The depth of a tree is the maximum number of edges from the root to a leaf, and the branch factor of a tree is the maximum degree of the nodes. For each depth d and a branch factor b, we first generate a complete bary tree of depth d, denoted Cd,b. This tree represents the consensus tree. We used depths of 25, and branch factors of 330. For each Cd,b, we then generated 10 sets of 20 random gene trees, such that each gene tree is a binary refinement of Cd,b. Each set of input trees was given as input to Method 1, using 13 as the external deep coalescence solver. For comparison, we ran the same data sets using 13 as the standalone deep coalescence solver. We calculated the deep coalescence score for each output species tree, and we report the average score of 10 profiles as the score for each Cd,b. We also measured and recorded the average runtime of each run. We terminate the execution of the standalone solver if the runtime exceeds two minutes, and in this case the results are not shown. In general, Figure F7 7 shows that the scores of the trees were very similar from Method 1 and the standalone SPR heuristic. Thus, Method 1 does not appear to improve the quality of the deep coalescence species trees. However, Method 1 shows extreme improvements in the runtime, especially as the branch factors increase. Figure 7Deep coalescence score and runtime results for Experiment 2 Deep coalescence score and runtime results for Experiment 2. Legend: blue represents Method 1 (divide and conquer) and orange represents standalone SPR heuristic. 1471210513S10S127 Experiment results 3 Finally, we examined the performance of Method 1 and compare it to the standalone SPR heuristic using more biologically plausible coalescence simulations. We followed the general structure the coalescence simulation protocol described by Maddison and Knowles 9. First, we generated 40 256taxon species trees based on a Yule pure birth process using the r8s software package B19 19. To transform the branch lengths from the Yule simulation to represent generations, we multiplied them all by 1,000,000. Next, we simulated coalescence within each species tree (assuming no migration or hybridization) using Mesquite B20 20. All simulations produced a single gene copy from each species. For each species tree, we simulated 20 gene trees assuming a constant population size. The population size effects the number of deep coalescence events, with larger populations leading to more incomplete lineage sorting and consequently less agreement among the gene trees. Thus, to incorporate different levels of incomplete lineage sorting, for 20 of the species trees, we used a constant population size of 10,000, and for 20 we used a constant population size of 100,000. Thus, in total, we produced 40 sets of 20 gene trees, with each set simulated from a different 256taxon species tree. For each data set, we performed a phylogenetic analyses using Method 1 and also using only the SPR heuristic from Bansal et al. 13. In contrast to the simulations in Experiment 1, the standalone SPR heuristic of Bansal et al. 13 always returned species trees with all consensus clusters. Of course, all solutions from Method 1 must display the Pareto property. The deep coalescence reconciliation score for the best trees were similar with both algorithms. When the population size was 10,000, the average coalescence cost was 279, and all the gene trees shared an average of 29.4 clusters. In 19 out of the 20 of these simulations, both approaches produced the same results, while in one case, Method 1 found a species tree with a one fewer implied deep coalescence event. When the population size was 100,000, the average coalescence cost was 2142, and the all gene trees shared an average of 19.1 clusters. Although the reconciliation cost never differed by more than 15, Method 1 had a better score in 6 replicates, and the standalone SPR had a better score in 11 replicates. All analyses finished within 30 seconds in a laptop PC, but Method 1 was always faster than SPR alone. Discussion In addition to offering a biologically informed optimality criterion to resolve incongruence among gene trees, we prove that the deep coalescence problem also is guaranteed to retain the phylogenetic clusters for which all gene trees agree. Since the deep coalescence problem is NPhard 10, most meaningful instances will require heuristics to estimate a solution. We demonstrate that the Pareto property can be leveraged to vastly improve upon the running time of heuristics. Method 1 represents a new general approach to phylogenetic algorithms. In most cases, heuristics to estimate solutions for phylogenetic inference problems are based on a few generic search strategies such as the local search heuristics based on nearest neighbor interchange (NNI), SPR, or tree bisection and reconnection (TBR) branch swapping. Although these search strategies often appear to perform well, they are not connected to any specific phylogenetic problems or optimality criteria. Ideally, however, efficient and effective heuristics should be tailored to the properties of the phylogenetic problem. In the case of the deep coalescence consensus tree problem, the Pareto property provides an informative guiding constraint for the tree search. Specifically, when considering possible solutions, we need only consider solutions that contain all clusters from the input gene trees, or, in other words, that refine the strict consensus of the input gene trees. Still, our simulation experiments suggest that, in many cases, the SPR local search heuristic described by Bansal et al. 13 performs well. While we identified cases in which the estimate from the SPR heuristic did not contain the Pareto clusters, in most cases SPR alone found trees as good, or even slightly better, than Method 1. We note that the size of the simulated coalescence data set, 256 taxa, exceeds the size of the largest published analysis of the deep coalescence consensus tree problem and is far beyond the largest instances (8 taxa) from which exact solutions have been calculated 11, and the SPR found good solutions within 30 seconds. Still, running time for the SPR heuristic does not always scale well, and the results of Experiment 2 suggest that it might not be tractable for extremely large data sets. In these cases, in practice Method 1 may vastly improve upon the running time, while guaranteeing a solution with the Pareto property. Further, Theorem 2 shows that the deep coalescence consensus tree problem exhibits independent optimal substructures. This implies that, once we compute the strict consensus tree of the problem instance, the rest of Method 1 can be directly parallelized, regardless of which external deep coalescence solver is used. In the case where the external solver guarantees exact solutions, our method would also give exact solutions, but can potentially solve instances with a much larger taxa size compared to running the external solver alone. Although the Pareto property for the deep coalescence consensus tree problem is desirable, and the divide and conquer method is promising for largescale analyses, there are limitations to their use. First, the Pareto property and Method 1 are limited to the consensus case, or, instances in which all of the input gene trees contain sequences from all of the species. Also, the Pareto property is only useful when all input trees share some clusters in common. If there are no consensus clusters among the input trees, then Method 1 conveys no runtime benefits. While this may seem like an extreme case, it is possible with high levels of incomplete lineage sorting, or, perhaps more likely, much error in the gene tree estimates. Also, as we add more and more gene trees, we would expect more instances of conflict among the gene trees, potentially converging towards the elimination of consensus clusters. Than and Rosenberg B21 21 recently proved the existence of cases in which the deep coalescence problem is inconsistent, or converges on the wrong species tree estimate with increasing gene tree data. Although inconsistency is concerning, the Pareto property provides some reassurance. Even in a worse case scenario in which the deep coalescence problem is misled, the optimal solutions will still contain all of the agreed upon clades from the gene trees. Perhaps the greatest advantage of the deep coalescence problem, especially compared to likelihood and Bayesian approaches that infer species trees based on coalescence models (e.g., B22 22B23 23B24 24), is its computational speed and the feasibility of estimating a species tree from largescale genomic data sets representing hundreds or even thousands of taxa 13. Not only can our method improve the performance of any existing heuristic, the Pareto property describes a limited subset of possible species trees that must contain the optimal solution. Conclusions We prove that the deep coalescence consensus tree problem satisfies the Pareto property for clusters and describe an efficient algorithm that, given a candidate solution that does not display the consensus clusters, transforms the solution so that it includes all the consensus clusters and has a lower deep coalescence cost. We extend the result and prove that the problem exhibits optimal substructures based on the strict consensus tree of the input gene trees. Based on this property, we suggest a new, parallelizable tree search method, in which we refine the strict consensus of the input gene trees. In contrast to previously proposed heuristics, this method guarantees that the proposed solution will contain the Pareto clusters. Also, as our experiments demonstrate, this method can greatly improve the speed of deep coalescence tree heuristics, potentially enabling efficient and effective estimates from input with thousands of taxa. List of abbreviations used LCA: least common ancestor; SPR: subtree pruning and regrafting; NNI: nearest neighbor interchange; TBR: tree bisection and reconnection Competing interests The authors declare that they have no competing interests. Authors' contributions HTL and OE were responsible for theory development and algorithm design. HTL implemented the programs. HTL and JGB designed and conducted simulation experiments, and JGB led the analysis of the results. All authors contributed to the writing of this manuscript, and have read and approved the final manuscript. bm ack Acknowledgements The authors would like to thank our anonymous reviewers who have provided valuable comments, as well as providing a simpler proof of Lemma 1. This work was conducted with support from the Gene Tree Reconciliation Working Group at NIMBioS through NSF award #EF0832858, with additional support from the University of Tennessee. HTL and OE were supported in parts by NSF awards #0830012 and #10117189. This article has been published as part of BMC Bioinformatics Volume 13 Supplement 10, 2012: "Selected articles from the 7th International Symposium on Bioinformatics Research and Applications (ISBRA'11)". The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/13/S10. refgrp Genomescale approaches to resolving incongruence in molecular phylogeniesRokasAWilliamsBLKingNCarrollSBNature20034256960798lpage 804pubidlist 10.1038/nature02053pmpid link fulltext 14574403Widespread Discordance of Gene Trees with Species Tree in Drosophila: Evidence for Incomplete Lineage SortingPollardDAIyerVNMosesAMEisenMBPLoS Genet2006210e173.pmcid 162610717132051Fitting the Gene Lineage into its Species Lineage, a Parsimony Strategy Illustrated by Cladograms Constructed from Globin SequencesGoodmanMCzelusniakJMooreGWRomeroHerreraAEMatsudaGSystematic Zoology197928213216310.2307/2412519Gene Trees in Species TreesMaddisonWPSystematic Biology199746352353610.1093/sysbio/46.3.523Gene trees and species trees are not the sameNicholsRTrends in Ecology & Evolution200116735836410.1016/S01695347(01)02203022593226Is a new and general theory of molecular systematics emerging?EdwardsSVEvolution; International Journal of Organic Evolution20096311910.1111/j.15585646.2008.00549.x335437022624096Estimating Species Trees: Methods of Phylogenetic Analysis When There Is Incongruence across GenesKnowlesLLSystematic Biology200958546346710.1093/sysbio/syp06120525600Algorithms for MDCbased multilocus phylogeny inferenceYuYWarnowTNakhlehLProceedings of the 15th Annual international conference on Research in computational molecular biologypublisher RECOMB, Berlin, Heidelberg: SpringerVerlag2011531545Inferring Phylogeny Despite Incomplete Lineage SortingMaddisonWPKnowlesLLSystematic Biology200655213010.1080/1063515050035492816507521From gene trees to species trees II: Species tree inference in the deep coalescence modelZhangLIEEE/ACM Trans Comput Biol Bioinformatics20118616851691Species Tree Inference by Minimizing Deep CoalescencesThanCNakhlehLPLoS Computational Biology200959e100050110.1371/journal.pcbi.1000501272938319749978ThanCNakhlehLEstimating species trees: Practical and Theoretical AspectsWileyVCH, Chichester 2010 chap. Inference of parsimonious species tree phylogenies from multilocus data by minimizing deep coalescences7998Efficient genomescale phylogenetic analysis under the duplicationloss and deep coalescence cost modelsBansalMBurleighJGEulensteinOBMC Bioinformatics201011Suppl 1S4210.1186/1471210511S1S42300951520122216BinindaEmondsORPPhylogenetic supertrees: combining information to reveal the Tree of LifeSpringer2004A classification of consensus methods for phylogeniesBryantDBioConsensus, DIMACS. AMS2003163184Properties of Supertree Methods in the Consensus SettingWilkinsonMCottonJALapointeFPisaniDSystematic Biology200756233033710.1080/1063515070124537017464887WilkinsonMThorleyJPisaniDLapointeFJMcInerneyJPhylogenetic Supertrees: Combining Information to Reveal the Tree of LifeSpringer, Dordrecht, the Netherlands 2004 chap. Some desiderata for liberal supertrees227246A view of some consensus methods for treesMcMorrisFRMeronkDBNeumannDANumerical Taxonomy1983122125r8s: inferring absolute rates of molecular evolution and divergence times in the absence of a molecular clockSandersonMJBioinformatics (Oxford, England)200319230130210.1093/bioinformatics/19.2.301MaddisonWPMaddisonDMesquite: a modular system for evolutionary analysis2001http://mesquiteproject.orgConsistency properties of species tree inference by minimizing deep coalescencesThanCVRosenbergNAJournal of Computational Biology20111811510.1089/cmb.2010.010221210728BEST: Bayesian estimation of species trees under the coalescent modelLiuLBioinformatics200824212542254310.1093/bioinformatics/btn48418799483STEM: species tree estimation using maximum likelihood for gene trees under coalescenceKubatkoLSCarstensBCKnowlesLLBioinformatics200925797197310.1093/bioinformatics/btp07919211573Bayesian Inference of Species Trees from Multilocus DataHeledJDrummondAJMolecular Biology and Evolution201027357058010.1093/molbev/msp274282229019906793 xml version 1.0 encoding utf8 standalone no mets ID sortmets_mets OBJID swordmets LABEL DSpace SWORD Item PROFILE METS SIP Profile xmlns http:www.loc.govMETS xmlns:xlink http:www.w3.org1999xlink xmlns:xsi http:www.w3.org2001XMLSchemainstance xsi:schemaLocation http:www.loc.govstandardsmetsmets.xsd metsHdr CREATEDATE 20120625T16:06:16 agent ROLE CUSTODIAN TYPE ORGANIZATION name BioMed Central dmdSec swordmetsdmd1 GROUPID swordmetsdmd1_group1 mdWrap SWAP Metadata MDTYPE OTHER OTHERMDTYPE EPDCX MIMETYPE textxml xmlData epdcx:descriptionSet xmlns:epdcx http:purl.orgeprintepdcx20061116 xmlns:MIOJAVI http:purl.orgeprintepdcxxsd20061116epdcx.xsd epdcx:description epdcx:resourceId swordmetsepdcx1 epdcx:statement epdcx:propertyURI http:purl.orgdcelements1.1type epdcx:valueURI http:purl.orgeprintentityTypeScholarlyWork http:purl.orgdcelements1.1title epdcx:valueString Consensus properties for the deep coalescence problem and their application for scalable tree search http:purl.orgdctermsabstract Abstract Background To infer a species phylogeny from unlinked genes, phylogenetic inference methods must confront the biological processes that create incongruence between gene trees and the species phylogeny. Intraspecific gene variation in ancestral species can result in deep coalescence, also known as incomplete lineage sorting, which creates incongruence between gene trees and the species tree. One approach to account for deep coalescence in phylogenetic analyses is the deep coalescence problem, which takes a collection of gene trees and seeks the species tree that implies the fewest deep coalescence events. Although this approach is promising for phylogenetics, the consensus properties of this problem are mostly unknown and analyses of large data sets may be computationally prohibitive. Results We prove that the deep coalescence consensus tree problem satisfies the highly desirable Pareto property for clusters (clades). That is, in all instances, each cluster that is present in all of the input gene trees, called a consensus cluster, will also be found in every optimal solution. Moreover, we introduce a new divide and conquer method for the deep coalescence problem based on the Pareto property. This method refines the strict consensus of the input gene trees, thereby, in practice, often greatly reducing the complexity of the tree search and guaranteeing that the estimated species tree will satisfy the Pareto property. Conclusions Analyses of both simulated and empirical data sets demonstrate that the divide and conquer method can greatly improve upon the speed of heuristics that do not consider the Pareto consensus property, while also guaranteeing that the proposed solution fulfills the Pareto property. The divide and conquer method extends the utility of the deep coalescence problem to data sets with enormous numbers of taxa. http:purl.orgdcelements1.1creator Lin, Harris T Burleigh, J Gordon Eulenstein, Oliver http:purl.orgeprinttermsisExpressedAs epdcx:valueRef swordmetsexpr1 http:purl.orgeprintentityTypeExpression http:purl.orgdcelements1.1language epdcx:vesURI http:purl.orgdctermsRFC3066 en http:purl.orgeprinttermsType http:purl.orgeprinttypeJournalArticle http:purl.orgdctermsavailable epdcx:sesURI http:purl.orgdctermsW3CDTF 20120625 http:purl.orgdcelements1.1publisher BioMed Central Ltd http:purl.orgeprinttermsstatus http:purl.orgeprinttermsStatus http:purl.orgeprintstatusPeerReviewed http:purl.orgeprinttermscopyrightHolder Harris T Lin et al.; licensee BioMed Central Ltd. http:purl.orgdctermslicense http://creativecommons.org/licenses/by/2.0 http:purl.orgdctermsaccessRights http:purl.orgeprinttermsAccessRights http:purl.orgeprintaccessRightsOpenAccess http:purl.orgeprinttermsbibliographicCitation BMC Bioinformatics. 2012 Jun 25;13(Suppl 10):S12 http:purl.orgdcelements1.1identifier http:purl.orgdctermsURI http://dx.doi.org/10.1186/1471210513S10S12 fileSec fileGrp swordmetsfgrp1 USE CONTENT file swordmetsfgid0 swordmetsfile1 FLocat LOCTYPE URL xlink:href 1471210513S10S12.xml swordmetsfgid1 swordmetsfile2 applicationpdf 1471210513S10S12.pdf swordmetsfgid3 swordmetsfile3 1471210513S10S12S1.PDF structMap swordmetsstruct1 structure LOGICAL div swordmetsdiv1 DMDID Object swordmetsdiv2 File fptr FILEID swordmetsdiv3 swordmetsdiv4 