Group Title: BMC Bioinformatics
Title: Triplet supertree heuristics for the tree of life
CITATION PDF VIEWER THUMBNAILS PAGE IMAGE ZOOMABLE
Full Citation
STANDARD VIEW MARC VIEW
Permanent Link: http://ufdc.ufl.edu/UF00099920/00001
 Material Information
Title: Triplet supertree heuristics for the tree of life
Physical Description: Book
Language: English
Creator: Lin, Harris
Burleigh, J. G.
Eulenstein, Oliver
Publisher: BMC Bioinformatics
Publication Date: 2009
 Notes
Abstract: BACKGROUND:There is much interest in developing fast and accurate supertree methods to infer the tree of life. Supertree methods combine smaller input trees with overlapping sets of taxa to make a comprehensive phylogenetic tree that contains all of the taxa in the input trees. The intrinsically hard triplet supertree problem takes a collection of input species trees and seeks a species tree (supertree) that maximizes the number of triplet subtrees that it shares with the input trees. However, the utility of this supertree problem has been limited by a lack of efficient and effective heuristics.RESULTS:We introduce fast hill-climbing heuristics for the triplet supertree problem that perform a step-wise search of the tree space, where each step is guided by an exact solution to an instance of a local search problem. To realize time efficient heuristics we designed the first nontrivial algorithms for two standard search problems, which greatly improve on the time complexity to the best known (naïve) solutions by a factor of n and n2 (the number of taxa in the supertree). These algorithms enable large-scale supertree analyses based on the triplet supertree problem that were previously not possible. We implemented hill-climbing heuristics that are based on our new algorithms, and in analyses of two published supertree data sets, we demonstrate that our new heuristics outperform other standard supertree methods in maximizing the number of triplets shared with the input trees.CONCLUSION:With our new heuristics, the triplet supertree problem is now computationally more tractable for large-scale supertree analyses, and it provides a potentially more accurate alternative to existing supertree methods.
General Note: Periodical Abbreviation:BMC Bioinformatics
General Note: Start page S8
General Note: M3: 10.1186/1471-2105-10-S1-S8
 Record Information
Bibliographic ID: UF00099920
Volume ID: VID00001
Source Institution: University of Florida
Holding Location: University of Florida
Rights Management: Open Access: http://www.biomedcentral.com/info/about/openaccess/
Resource Identifier: issn - 1471-2105
http://www.biomedcentral.com/1471-2105/10/S1/S8

Downloads

This item has the following downloads:

PDF ( PDF )


Full Text



BMC Bioinformatics


Research

Triplet supertree heuristics for the tree of life
Harris T Lin', J Gordon Burleigh2 and Oliver Eulenstein* 1


Address: 'Department of Computer Science, Iowa State University, Ames, IA, USA and 2National Evolutionary Synthesis Center, Durham, NC,
USA; University of Florida, Gainesville, FL, USA
Email: Harris T Lin htlin@cs.iastate.edu; J Gordon Burleigh gburleigh@ufl.edu; Oliver Eulenstein* oeulenst@cs.iastate.edu
* Corresponding author



from The Seventh Asia Pacific Bioinformatics Conference (APBC 2009)
Beijing, China. 13-16 January 2009

Published: 30 January 2009
8MC Bioinformatics 2009, 10 O(Suppl I):S8 doi: 10.1186/1471-2105-10-SI -S8


This article is available from: http://www.biomedcentral.com/1471-2105/10/S I/S8
2009 Lin et al; licensee BioMed Central Ltd.
This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0),
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.


Abstract
Background: There is much interest in developing fast and accurate supertree methods to infer
the tree of life. Supertree methods combine smaller input trees with overlapping sets of taxa to
make a comprehensive phylogenetic tree that contains all of the taxa in the input trees. The
intrinsically hard triplet supertree problem takes a collection of input species trees and seeks a
species tree (supertree) that maximizes the number of triplet subtrees that it shares with the input
trees. However, the utility of this supertree problem has been limited by a lack of efficient and
effective heuristics.
Results: We introduce fast hill-climbing heuristics for the triplet supertree problem that perform
a step-wise search of the tree space, where each step is guided by an exact solution to an instance
of a local search problem. To realize time efficient heuristics we designed the first nontrivial
algorithms for two standard search problems, which greatly improve on the time complexity to the
best known (naive) solutions by a factor of n and n2 (the number of taxa in the supertree). These
algorithms enable large-scale supertree analyses based on the triplet supertree problem that were
previously not possible. We implemented hill-climbing heuristics that are based on our new
algorithms, and in analyses of two published supertree data sets, we demonstrate that our new
heuristics outperform other standard supertree methods in maximizing the number of triplets
shared with the input trees.
Conclusion: With our new heuristics, the triplet supertree problem is now computationally more
tractable for large-scale supertree analyses, and it provides a potentially more accurate alternative
to existing supertree methods.


Background
Assembling the tree of life, or the phylogeny of all species,
is one of the grand challenges in evolutionary biology.
Supertree methods take a collection of species trees with


overlapping, but not identical, sets of taxa and return a
"supertree" that contains all taxa found in the input trees
(e.g., [1-4]). Thus, supertrees provide a way to synthesize
small trees into a comprehensive phylogeny representing


Page 1 of 12
(page number not for citation purposes)


0
BioMed Central







BMC Bioinformatics 2009, 10(Suppl 1):S8


large sections of the tree of life. Recent supertree analyses
have produced the first complete family-level phylogeny
of flowering plants [5], and the first phylogeny of nearly
all extant mammals [6]. Since the main objective of most
supertree analyses is to build extremely large phylogenetic
trees by solving intrinsically hard computational prob-
lems, the design of efficient and effective heuristics is a
critically important part of developing any useful super-
tree method.

Ideal supertree methods must combine speed and accu-
racy. By far the most commonly used supertree method is
matrix representation with parsimony (MRP; [7,8]). MRP
converts a collection of input trees into a binary character
matrix, and then performs a parsimony analysis on a
matrix representation of the input trees. Thus, MRP anal-
yses can use efficient parsimony heuristics implemented
in programs such as PAUP* [9] and TNT [10], making
large-scale MRP supertree analyses computationally more
tractable. However, the accuracy and performance of MRP
are frequently criticized. For example, there is evidence of
input tree size and shape biases [11,12], the results can
vary depending on the method of matrix representation
[ 111, and the accuracy of the MRP supertrees are not nec-
essarily correlated with the parsimony score [13]. There-
fore, there is a need to develop alternate methods that
share the advantages of MRP but produce more accurate
supertrees.

Since we rarely know the evolutionary history of a group
of organisms with certainty, it is usually impossible to
assess the accuracy of a supertree based on its similarity to
the true species phylogeny. A more practical way to define
the accuracy of a supertree is based on the overall similar-
ity of the supertree to the collection of input trees. There
are numerous ways to measure the similarity between
input trees and the supertree. The intrinsically hard [14]
triplet supertree problem measures this similarity based
on the common shared triplets, or rooted, binary, 3-taxon
trees that are the irreducible unit of phylogenetic informa-
tion in rooted trees [14]. Specifically, the triplet supertree
problem seeks a supertree that shares the most triplets
with the input trees.

We introduce hill-climbing heuristics for the triplet super-
tree problem that make it feasible for truly large-scale phy-
logenetic analyses. Hill-climbing heuristics have been
effectively applied to other intrinsically difficult supertree
problems [7,13,15]. They search the space of all possible
supertrees guided by a series of exact solutions to
instances of a local search problem. The local search prob-
lem is to find an optimal phylogenetic tree that shares the
most number of triplets with the input trees in the neigh-
borhood of a given tree. The neighborhood is the set of all
phylogenetic trees into which the given tree can be trans-


http://www.biomedcentral.com/1471-2105/10/S1/S8



formed by applying a tree edit operation. A variety of dif-
ferent tree edit operations have been proposed [16,17],
and two of them, rooted Subtree Pruning and Regrafting
(SPR) and Tree Bisection and Reconnection (TBR), have
shown much promise for phylogenetic studies [18,19].
However, algorithms for local search problems based on
SPR and TBR operations, especially on rooted trees, are
still in their infancy. To conduct large-scale phylogenetic
analyses, there is much need for effective SPR and TBR
based local search problems that can be solved efficiently.

In this work we improve upon the best known (naive)
solutions for the SPR and TBR local search problems by a
factor of n and n2 (the number of taxa in the supertree)
respectively. This is especially desirable since standard
local search heuristics for the triplet supertree problem
typically involve solving several thousand instances of the
local search problem. We demonstrate the performance of
our new triplet heuristics in a comparative analysis with
other standard supertree methods.

Related work
Triplet supertree problem
The triplet supertree problem makes use of the fact that
every rooted tree can be equivalently represented by a set
of triplet trees [171. A triplet tree is a rooted fully binary
tree over three taxa. Thus, a triplet-similarity measure can
be defined between two rooted trees that is the cardinality
of the intersection of their triplet presentations. This
measure can be extended to measure the similarity from a
collection of rooted input trees to a rooted supertree, by
summing up the triplet-similarities for each input tree and
the supertree. The triplet supertree problem is to find a
supertree that maximizes the triplet-similarity for a given
collection of input trees. Figure 1 illustrates the triplet
supertree problem.

Hill-climbing heuristics
We introduce hill-climbing heuristics to solve the triplet
supertree problem. Hill-climbing heuristics have been
successfully applied to several intrinsically complex super-
tree problems. In these heuristics a tree graph is defined
for the given set of input trees and some, typically sym-
metric, tree-edit operation. The nodes in the tree graph are
the phylogenetic trees over the overall taxon set of the
input trees. An edge adjoins two nodes exactly if the cor-
responding trees can be transformed into each other by
the tree edit operation. The cost of a node in the graph is
the measurement from the input trees to the tree repre-
sented by the node under the particular supertree prob-
lems optimization measurement. For the triplet supertree
problem, the cost of a node in the graph is the triplet-sim-
ilarity from the input trees to the tree represented by the
node. Given a starting node in the tree graph, the heuris-
tic's task is to find a maximal-length path of steepest


Page 2 of 12
(page number not for citation purposes)







BMC Bioinformatics 2009, 10(Suppl 1):S8



Input Triplet Number of Triplet
Profile Presentation Common Triplets Presentation


XA)

(AA



AA)

T, A


AA
A^
A AA
AA
A AA
A.


Candidate
Supertree


-A


Figure I
Triplet supertree problem. Given an input profile of n
species trees (TI,..., Tn), the triplet supertree problem is to
find a supertree that maximizes the triplet-similarity score.
The score for a supertree is calculated by first decomposing
trees into their corresponding triplet presentations, then
counting the number of common triplets between the super-
tree and each input tree (S,,..., Sn), and finally aggregating all
the counts. The triplet-similarity score for the candidate
supertree T with respect to the input profile is therefore

i=1 S




ascent in the cost of its nodes and to return the last node
on such a path. This path is found by solving the local
search problem for every node along the path. The local
search problem is to find a node with the maximum cost
in the neighborhood (all adjacent nodes) of a given node.
The neighborhood searched depends on the edit opera-
tion. Edit operations of interest are SPR and TBR [ 17]. We
defer the definition of these operations to the next section.
The best known run times (naive solutions) for the SPR
and TBR based local search problems under the triplet-
similarity measurement are 0(kn4) and O(kn5) respec-
tively, where k is the number of input gene trees and n is
the number of taxa present in the input gene trees.

Contribution of the manuscript
We introduce algorithms that solve the local SPR and TBR
based search problems for our triplet supertree heuristics
in times 0(n3) and 0(n3) respectively, with an initial pre-
processing time of 0(kn3). These algorithms allow true
large-scale phylogenetic analyses using hill-climbing heu-
ristics for the triplet supertree problem. Finally, we dem-
onstrate the performance of our SPR and TBR based hill-
climbing heuristics in comparative studies on two large
published data sets.


http://www.biomedcentral.com/1471-2105/10/S1/S8



Methods
Initially, for each possible triplet over the set of all taxa we
count and store the frequency displayed by all the input
trees in 0(kn3) time. Then, for each local search problem,
we use dynamic programming to efficiently pre-process
necessary triplet counts in O(n3) time. By exploiting the
structural properties of SPR and TBR related to triplet-sim-
ilarity, we are able to use these triplet counts to compute
the differences in triplet-similarity for all SPR and TBR
neighborhoods, each in O(n3) time.

Basic definitions, notations, and preliminaries
In this section we introduce basic definitions and nota-
tions and then define preliminaries required for this work.
For brevity the proofs of Lemmas 2-6 are omitted, but
available on request.

Basic definitions and notations
A tree T is a connected graph with no cycles, consisting of
a node set V(T) and an edge set E(T). T is rooted if it has
exactly one distinguished node called the root which we
denote by Ro(T).

Let T be a rooted tree. We define on V(T) where x Ro(T) and x. If x ancestor of x. We also define x case we call x a proper descendant of y, and y a proper ancestor
of x.

The set of minima under ments are called leaves. If {x, y} e E(T) and x call y the parent ofx denoted by PaT(x) and we call x a child
of y. The set of all children of y is denoted by ChA(y). If two
nodes in T have the same parent, they are called siblings.
The least common ancestor of a non-empty subset L c V(T),
denoted as lcaT(L), is the unique smallest upper bound of
L under
Ife e E(T), we define Tie to be the tree obtained from Tby
identifying the ends of e and then deleting e. Tie is said to
be obtained from Tby contracting e. If v is a vertex of T with
degree one or two, and e is an edge incident with v, the tree
T/e is said to be obtained from T by suppressing v.

The restricted subtree of T induced by a non-empty subset L
c V(T), denoted as T|L, is the tree induced by L where all
internal nodes with degree two are suppressed, with the
exception of the root node. The subtree of T rooted at node
y e V(T), denoted as TY, is the restricted subtree induced
by {x e V(T): x
T is fully binary if every node has either zero or two chil-
dren. Throughout this paper, the term tree refers to a
rooted fully binary tree.


Page 3 of 12
(page number not for citation purposes)







BMC Bioinformatics 2009, 10(Suppl 1):S8


The triplet supertree problem
We now introduce necessary definitions to state the triplet
supertree problem. A triplet is a rooted binary tree with
three leaves. A triplet T with leaves a, b, and c is denoted
abIc if lca( {a, b}) is a proper descendant of the root. Note
that we do not distinguish between abIc and baIc. The set
of all triplets of a tree T, denoted as Tr(T), is {abIc : TI {a,
b, c} = abIc}. The set of common triplets between two trees
T1 and T2, denoted as S(T1, T2), is Tr(T1) n Tr(T2). A profile
P is a tuple of trees (T1,..., TJ, we extend the definition of

leaf set to profiles as Le(P) = U1 I Le(T,). Let P be a pro-
file, we call T* a supertree of P if Le(T*) = Le(P).

We are now ready to define the triplet supertree problem
(Fig. 1).

Definition 1 (Triplet similarity). Given a profile P = (T1,...,
TJ) and a supertree T* of P, we define the triplet-similarity

score S(P, T) = II S(Ti, T*) I.

Problem 1 (The triplet supertree problem). Given a profile
P, find a supertree T* that maximizes S(P, T*). We call any
such T* a triplet supertree.

Theorem 1 ([14]). The triplet supertree problem is NP-hard.

Local search problems
Here we first provide definitions for the re-root (RR), TBR,
and SPR edit operations and then formulate the related
local search problems. Figures 2 and 3 illustrate the RR
and TBR edit operations respectively.

Definition 2 (RR operation). Let T be a tree and x e V(T).
RRT(x) is defined to be the tree T if x = Ro(T). Otherwise,
RRT(x) is the tree that is obtained from T by (i) suppressing
Ro(T), and (ii) subdividing the edge {PaT(x), x} by a new root
node. We define the following extension:

RRT= u V(T) {RRT(x)}.

Let x replacing T, with RR v (x).

Definition 3 (TBR operation). For technical reasons we first
define for a tree T the planted tree PI(T) that is the tree
obtained by adding an additional edge, called root edge, {r,
Ro(T)} to E(T).

Let T be a tree, e = (u, v) e E(T), and X, Y be the connected
components that are obtained by removing edge from T where


http://www.biomedcentral.com/1471-2105/10/S1/S8



v e X and u e Y. We define TBRT(v, x, y) for x e X and y e Y
to be the tree that is obtained from PI(T) by first removing edge
e, then replacing the component X by RRx(x), and then adjoin-
ing a new edge f between x' = Ro(RRx(x)) and Y as follows:

1. Create a new node y' that subdivides the edge (Par(y), y).

2. Adjoin the edge f between nodes x' and y'.

3. Suppress the node u, and rename x' as v and y' as u.

4. Contract the root edge.

We say that the tree TBRT(v, x, y) is obtained from T by a tree
bisection and reconnection (TBR) operation that bisects the
tree T into the components X, Y and reconnects them above the
nodes x, y.

We define the following extensions for the TBR operation:

1. TBRT(v, x) = u, yTBRT(v, x, y)

2. TBRT(v) = u xTBRT(v, x)

3. TBR = U(u, v)E(T) TBRT(v)

An SPR operation for a tree T can be briefly described
through the following steps: (i) prune some subtree S
from T, (ii) add a root edge to the remaining tree T', (iii)
regraft S into an edge of the remaining tree T', (iv) contract
the root edge. For our purposes we define the SPR opera-
tion as a special case of the TBR operation.

Definition 4 (SPR operation). Let T be a tree, e = (u, v) e
E(T), and X, Y be the connected components that are obtained
by removing edge e from T where v e X and u e Y We define
SPRT(v, y) for y e Y to be TBRT(v, v, y). We say that the tree
SPRT(v, y) is obtained from T by a subtree prune and regraft
(SPR) operation that prunes subtree Tv and regrafts it above
node y.

We define the following extensions of the SPR operation:

1. SPRr(v) = uyySPRr(v, y)

2. SPRT = U(u, v)cE(T) SPRT(v)

Problem 2 (TBR Scoring (TBR-S)). Given a profile P and a
supertree T of P, find a tree T* e TBRT such that
S(P,T) = maxT'r TSBR(P,T')


Page 4 of 12
(page number not for citation purposes)







http://www.biomedcentral.com/1471-2105/10/S1/S8


4


b d g


h i b d g


(a)


h i d b g h i


(b)


Figure 2
Example of an RR operation. Depicted is an example of an RR operation where T' = RRr(d). The original tree T is shown in
(a). In (b), we first suppress the root node, and then introduce the new root node r above d. Finally we rearrange the tree so
that r is at root, as in (c).


Problem 3 (TBR-Restricted Scoring (TBR-RS)). Given a
profile P, a supertree T of P, and (u, v) e E(T), find a tree T*
e TBRT(v) such that S(P, T*) = maxT',BR S(P, T')

The problems SPR Scoring (SPR-S) and SPR-Restricted
Scoring (SPR-RS) are defined analogously to the problems
TBR-S and TBR-RS respectively.

Further, we observe that to solve any of these four local
search problems, it is sufficient to find a tree within the
neighborhood that gives the maximum increase on S (P,
T), without calculating the value of each S (P, T) itself.
With this observation, it is useful to give the following def-
inition.

Definition 5. Let P be a profile, T1 and T2 be two supertrees of
P, we define the score difference function, denoted as Ap(T1,
T2), to be S (P, T2) S (P, T).

Solving the SPR-RS and SPR-S problems
We first show how to solve the SPR-RS problem. Extend-
ing on this solution we introduce a new algorithm for the
SPR-S problem.

Solving the SPR-RS problem
Given a profile P, a supertree T of P, and (u, v) e E(T), we
compute Ap(T, T') for each T' e SPRT(v) by first pruning
and regrafting T, to Ro(T) and compute the score differ-
ences for each "move-down" operation, then traverse T in
pre-order to obtain the tree that gives the maximum score


difference. We first give a definition that helps us describe
a single "move-down".

Definition 6 (Immediate Triplet). Let T be a tree and v e
V(T), an immediate triplet induced by v, denoted as yz > <1
v, is a triplet yz|v where there exists nodes a, b e V(T) such that
PaT(y) = PaT(z) = b and PaT(b) = PaT(v) = a.

Algorithm 1 Algorithm for the SPR-RS problem

1: procedure SPR-RS(P, T, (u, v))

Input: A profile P = (T1,..., TJ, a supertree T of P, and (u,
v) E E(T)

Output: T* e SPRT(v), and Ap(T, T*)

2: r <- Ro(T)

3: T <- SPRT(v, r)

4: Call MovedownAndCompute(P, T, v)

5: Traverse the tree T, in pre-order to compute Ap( T, T')
for each T' e SPRT(v) using the values computed by Move-
downAndCompute

6: T* <- T' e SPRT(v) such that
Ap(T, T') = maxTSPRAp(T T)


Page 5 of 12
(page number not for citation purposes)


BMC Bioinformatics 2009, 10(Suppl 1):S8







http://www.biomedcentral.com/1471-2105/10/S1/S8


b d g h i b d g h i b h g


(a)


(b)


i d


(c)


Figure 3
Example of a TBR operation. Depicted is an example of a TBR operation where T' = TBRT(e, h, b). The original tree T is
shown in (a). In (b), we first remove the edge above e, that is we prune the subtree Te. Then we introduce a new node above h
which will be the new root of the pruned subtree. We also introduce a new node above b this is where we will reconnect
the subtree back to T. Finally we rearrange the tree and obtain the resulting tree F as in (c).


7: d <- Ap( T, T*)- Ap(T, T)


Input: A profile P = (T1,..., Tn), a supertree T of P


Output: T* e SPRT, and Ap(T, T*)


8: return (T*, d)

9: end procedure


10: procedure MOVEDOWNANDCOMPUTE(P, T, v)

Input: A profile P, a tree T, and v e V(T)

11: yz N <1 v -- The immediate triplet induced by v in T

12: for all t e {y, z} do


13: T'P-- SPRT(v, t)


14: Compute and store Ap(T, T')


15: Call MovedownAndCompute(P, T', v)

16: end for

17: end procedure

It can be easily seen that Algorithm 1 is correctly solving
the SPR-RS problem.

Solving the SPR-S problem
Algorithm 2 Algorithm for the SPR-S problem

1: procedure SPR-S(P, T)


3: Store the value of SPR-RS(P, T, (u, v))

4: end for

5: (T*, d) <- the stored value of SPR-RS calls that has the
maximum score increase by traversing the tree T in post-
order


6: return (T*, d)

7: end procedure


Algorithm 2 gives a trivial extension of Algorithm 1 to
solve the SPR-S problem.

Computing Ap(T, T') efficiently
Algorithm 1 assumed the computation of Ap(T, T') for
each move-down operation (Line 14). In this section we
show how to compute each Ap(T, T') efficiently by exploit-
ing structural properties related to the triplet-similarity.
We begin with some useful definitions.

Definition 7. Let A, B, C be pairwise mutual exclusive leaf
sets, we extend the triplet notation by AB C = {abIc : a e A, b
e B, c e C}. Further, let u, v, w be three nodes in a tree T hav-


Page 6 of 12
(page number not for citation purposes)


2: for all (u, v) e E(T) do


BMC Bioinformatics 2009, 10(Suppl 1):S8







BMC Bioinformatics 2009, 10(Suppl 1):S8


ing no ancestral relationships, define uvITW = Le(T,) Le(Tj)I
Le(Tr)

Definition 8. The Boolean value of a statement (p, denoted
as (p, is 1 if (p is true, 0 otherwise.

Definition 9. Given a profile P = (T1,..., Tj) and distinct a, b,
c e Le(P), we define the triplet summation function by


sp(ab I c) = [ab ce Tr(T1)i
i=1
Let A, B, C c Le(P) be pairwise mutual exclusive leaf sets, we
extend the triplet summation function by


O-p(AB | C) = Y
aeA.beB ceC


T p(ab I c)


Further, let u, v, w be three nodes in a tree T having no ances-
tral relationships, we define

o-, T(uv w) = o-p(uv Tw)

Lemma 1. Let T be a tree and yz > <1 v be an immediate tri-
plet induced by a node v e V(T). If T' = SPRT(v, y), then

Tr(T') = (Tr(T)\yz|Tv) u vy\z (1)

(1) Proof. Let a, b, c e Le(T), we consider the following
cases:

1. If any one of a, b, c is not in the subtree Tpa(v), then
T'|{a, b, c} = T {a, b, c}. Since both yz TV and vy z4 only
contain triplets formed under the subtree TpaT(V), TI {a, b,
c} and only this triplet resolution of {a, b, c} is in both
sides of equality.

2. Consider three subtrees T7, Ty, T,. If a, b, c are all in one
of the subtree, then since the subtrees T,, TY, T, do not
change by the SPR operation, T'| {a, b, c} = TI {a, b, c}.
Since both yz\TV and vy|lz only contain triplets that are
formed by one leaf from each of T Ty, and T, subtrees,
T| {a, b, c} and only this triplet resolution of {a, b, c} is in
both sides of equality.

3. Consider three subtrees TT, T, T. If two leaves of {a, b,
c} are in one subtree and the other leaf is in another sub-
tree, and suppose WLOG that {a, b} are in one subtree,
then we observe that IcaT({a, b}) lcaT.({a, b}) < lcaT.({{a, b, c}), so T'| {a, b, c} = T| {a, b, c}
= ablc.


http://www.biomedcentral.com/1471-2105/10/S1/S8



Also, as in Case 2, both yzITV and vy|z does not contain
triplet formed by {a, b, c}, so T| {a, b, c} and only this tri-
plet resolution of {a, b, c} is in both sides of equality.

4. If each of Tv, TY, and Tz contains exactly one leaf in {a,
b, c}, and suppose WLOG that a e Le(Tj), b e Le(TY), and
c e Le(Tz). Then T| {a, b, c} = bc|a and T' {a, b, c} = ab|c.
Also we observe that bc~a e yz|TV and ab~c e vy|Tz, there-
fore RHS, and hence both sides contain abIc and only this
resolution of {a, b, c}.

Lemma 2. Given a profile P and a supertree T of P, let yz > <1
v be an immediate triplet induced by a node v e V(T). If T' =
SPRT(v, y), then


Ap (T, T') = -p, T( I|z) op, T(yz|v)


Lemma 3. Given a profile P and a supertree T of P, let distinct
v, b e V(T) such that v Tb, b Tv, and ChT(b) = {y, z}. If T =
SPRT(v, b) and T, = SPRT(v, y), then


Ap(Ti, T2) = O-p, T(VY) o-p, T(Yzv)


Lemma 4. Given a profile P and a supertree T of P, let v, b e
V(T) such that Pa (PaT(v)) x z}, and WLOG suppose y o F. If T1 = SPRT(v, b) and T2
SPRT(v, y), then


Ap(T, T2) = sp I vyX ) spfyx I v)]
xeC


Equations (3) and (4) provide computations for all SPR-
RS neighborhoods (Line 14 of Algorithm 1). We now
show how to compute them efficiently.

Algorithm 3 Algorithm to compute triplet summation
function

1: procedure PREPROCESSTRIPLETSUM(P)

Input: A profile P = (T1,..., T.)

2: Initialize all values of uo to 0

3: for i = 1 to n do

4: for all u e V(T,) in post-order, u o Le(Ti) do


{v, w} I- ChT(u)


6: for all {x, y} e Le(Tj), z e Le(Tw) do

7: Increment o-p(xy|z)



Page 7 of 12
(page number not for citation purposes)







BMC Bioinformatics 2009, 10(Suppl 1):S8


8: end for

9: for all {x, y} e Le(Tj), z e Le(Tj) do

10: Increment up(xy|z)

11: end for

12: end for

13: end for

14: end procedure

Given a profile P = (T1,..., Tj, we start by computing the
triplet summation function oup for all triplets, as shown by
Algorithm 3.

Algorithm 4 Algorithm to compute extended triplet sum-
mation function

1: procedure PREPROCESSEXTENDEDTRIPLETSUM(P, T)

Input: A profile P = (T1,..., TJ, a supertree T of P

2: for all u e V(T) in post-order do

3: for all v e V(T) in post-order after u, lcaT({u, v}) 0
{u, v} do

4: for all w e V(T) in post-order after v, IcaT({u, v, w})
o {u, v, w} do

5: ifw e Le(T) then

6: ifv e Le(T) then

7: if u e Le(T) then

8: Up, (uvIw) qp(uv w)

9: UP, (uwIv) Uq(uw v)

10: UP, (vw|u) *- Uq(vw|u)

11: else

12: {ul, u2} *- ChT(u)

13: UP, T(uvIw) <- o, T(ulV|w) + Uo, T(u2vIw)

14: Up, T(uw v) U- o (ulw |) + Up, T(u2w v)

15: Uo, T(vw u) <- -p, T(vw|ul) + -p, T(vw uU2)


http://www.biomedcentral.com/1471-2105/10/S1/S8



16: end if

17: else

18: {vl, v2} +- ChT(v)

19: O-p, (uv|w) +- O-p (uv Iw) + O-p, (uv2 w)

20: oP, T(uw v) U- op, T(uw v) + oP, T(uw v2)

21: P, T(vwIu) U- ,p W(vlw u) + U,p T(v2w u)

22: end if

23: else

24: {wi, w2} +- ChT(w)

25: op, T(uvIw) U- op, (uv wl) + op, (uv w2)

26: o, T(uw v) -p T(uw1 v) + oP, T(uw2v)

27: oP, (vw|u) oTp(vw1 u) + o,p T(vw2 u)

28: end if

29: end for

30: end for

31: end for

32: end procedure

Next, we compute the extended triplet summation func-
tion p, T, for all nodes u, v, w in T having no ancestral rela-
tionships, as shown by Algorithm 4.

Algorithm 5 Algorithm to compute score difference func-
tion

1: procedure PREPROCESSSCOREDIFFERENCE(P, T)

Input: A profile P = (T1,..., Tj, a supertree T of P

2: for all v e V(T) do

3: for all (b, y) e E(T): b TPaT(v) do

4: Let ChT(b) = {y, z}

5: Let Ti = SPRT(v, b), T2 = SPRT(v, y)

6: if v

Page 8 of 12
(page number not for citation purposes)







BMC Bioinformatics 2009, 10(Suppl 1):S8


7: for all p e V(T): v
8: Let x e ChT (p) where rv x

9: Ap(TI, T2) -- Ap(Ti, T2) + Op, T(vyIx)- o-, T(yxIV)

10: end for

11: else

12: Ap(T1, T2) -- o-, A(vyTz) o-, r(yzIv)

13: end if

14: end for

15: end for

16: end procedure

Finally, we compute the score difference function Ap(T1,
T2) for all v e V(T) and (b, y) e E(T) where b TPaT(v) such
that T1 = SPRT(v, b) and T2 = SPRT(v, y). This is shown by
Algorithm 5.

Time complexity
We describe the time complexity for our solution for the
SPR-S problem. First, we run Algorithm 3 once for the
entire heuristic run, which takes 0(kn3) where k is the
number of input trees and n is the number of taxa present
in the input trees. Then, for each SPR-S problem, we begin
by pre-processing necessary counts using Algorithm 4 and
5, each takes time 0(n3). These pre-processed computa-
tions allow Algorithm 1 to run in 0(n) time. Finally, Algo-
rithm 2 issues 0(n) calls to Algorithm 1, so overall it takes
0(n2) time. Including the pre-processing steps, we see that
solving the SPR-S problem takes 0(n3) time, with an
expense of 0(kn3) at the beginning of the entire heuristic
run.

Solving the TBR-RS and TBR-S problems
We extend our solutions of SPR-RS and SPR-S problems to
solve TBR-RS and TBR-S problems.

Solving the TBR-RS problem
We observe that a TBR operation can be viewed as an SPR
operation followed by an RR operation. We exploit this
structural property further and establish some lemmas
which helps us compute the score differences for all TBR
operations in the TBR-RS neighborhood.

Lemma 5. Given a profile P, a supertree T of P, and a valid
TBR operation on T where T' = TBRT(v, x, y), then

Ap(T, T') = Ap(T, SPRT(v, y)) + Ap(T, RRT(v, x)) (5)


http://www.biomedcentral.com/1471-2105/10/S1/S8



Lemma 5 implies that given a subtree, we can find the best
TBR-RS neighborhood by finding the best SPR-RS neigh-
borhood and apply the best re-rooting for the subtree
regardless of which SPR operation was chosen. Further,
we note that RR is a special case of SPR operation by the
following lemma.

Lemma 6. Given a profile P and a supertree T of P, let xy <1
z be an immediate triplet induced by a node z e V(T) and
PaT(z) = v, then


RRT(v, x) = SPRT(z, y)


This means that we can reuse Lemmas 2, 3, 4 and their
corresponding algorithms to compute the score differ-
ences of all move-downs in terms of re-rooting. Hence,
given a subtree the algorithm would first compute the best
re-rooting and its score difference, then simply join this
RR operation with the best SPR operation in the SPR-RS
neighborhood.

Solving the TBR-S problem
Similar to solving the SPR-S problem, we can solve the
TBR-S problem given the solution to the TBR-RS problem
in the previous section.

Time complexity
We perform the same steps to solve the SPR-S problem,
and additionally by utilizing Lemmas 5 and 6 we compute
and store the best re-rooting for each subtree, which takes
0(n3) time. Overall, solving the TBR-S problem still takes
0(n3) time.

Results and discussion
We examined the performance of our new triplet heuris-
tics by comparing with two other supertree methods, MRP
and modified Min-Cut (MMC; [20]), using published
data sets from marsupials [21] and Cetartiodactyla [22].
MRP is the most widely applied supertree method [3].
However, MRP supertrees, like triplet supertrees, are
intrinsically hard to compute. Therefore they are esti-
mated using heuristics, which do not guarantee an opti-
mal solution. In contrast, MMC supertrees can be
computed exactly in polynomial time, and therefore, it
has been suggested that MMC will be useful for building
very large phylogenies [20]. We evaluated each of the
supertree methods using the triplet-similarity and the
maximum agreement subtree (MAST) similarity [23]
between the input trees and the supertrees. Furthermore,
we measured the parsimony score of each computed
supertree based on its binary matrix representation. For
the marsupial data set, we also compared our results to
published results using the max cut (MXC) supertree algo-
rithm [24]. MXC is a modification of MMC that provides
a heuristic approach based on the triplet supertree prob-


Page 9 of 12
(page number not for citation purposes)








BMC Bioinformatics 2009, 10(Suppl 1):S8


lem. There is currently no publicly available implementa-
tion of MXC (S. Snir, pers. comm.), and therefore, we were
unable to apply it to the other data set.

Triplet supertrees were constructed using the programs
TH(SPR) and TH(TBR) that implement hill-climbing heu-
ristics based on our efficient local search algorithms for
SPR and TBR branch swapping respectively. MRP super-
trees were constructed using the parsimony hill-climbing
heuristic implemented in PAUP* [9] with TBR branch
swapping. We found very little difference in the results of
MRP analyses when we collapsed zero-length branches
and when we forced all MRP trees to be binary. We report
the results from analyses that collapsed zero-length
branches. All hill-climbing analyses were executed on 20
initial random addition sequence replicate trees and sav-
ing a single best tree per replicate. MMC supertrees were
constructed using a program [20] supplied by Rod Page.

Our new triplet heuristics seek the supertrees with the
maximum number of identical triplets to the collection of
input trees, and indeed, we find they both outperform all
other methods based on triplet-similarity in all data sets
(Table 1). Furthermore, all the triplet supertree analyses
were completed within 15 minutes using a Kensington
quad-core 2.66 GHz Linux-based machine, demonstrat-
ing that our heuristics make the triplet supertree problem
extremely tractable for large-scale analyses. Both triplet
heuristics and the MRP heuristic perform much better
than the exact MMC algorithm, based on triplet-similarity


http://www.biomedcentral.com/1471-2105/10/S1/S8



and MAST-similarity. The MXC algorithm suffers from the
lack of an available implementation; however, in our sin-
gle comparison with the published results, this algorithm
does not perform as well as either the MRP heuristic or our
triplet heuristics.

Although the difference in the triplet-similarity score
between the triplet heuristics and the MRP heuristic is
always less than 2% (Table 1), due to the extremely large-
number of triplets, even these apparently small differ-
ences represent large differences in tree topologies. For
example, in the marsupial data set, the 0.7% difference in
triplet-similarity represents over 17,400 triplets. The com-
parison of the triplet heuristics and the MRP heuristic also
demonstrates that optimizing the parsimony score of the
matrix representation of input trees is not directly corre-
lated with optimizing the triplet-similarity of input trees
and the supertree; supertrees with smaller (better) parsi-
mony scores have lower (worse) triplet-similarity (Table
1).

Our experiments also demonstrate that the supertree with
the best triplet-similarity is not necessarily best in terms of
MAST-similarity. In fact, MRP outperforms the triplet heu-
ristics in terms of the MAST-similarity to the input trees
(Table 1). It is not intuitive that a parsimony analysis on
a matrix representation of the input trees (MRP) is a valid
or useful approach to infer the most accurate supertrees
(but see [25]). The popularity of MRP is probably based
more on the availability of programs that implement fast


Table I: Results of supertree analyses of two empirical data sets. Note that a bolded number represents the best found score for each
measurement in a data set.


Data Set

Marsupial [21]
158 input trees
267 taxa


Cetartiodactyla [22]
201 input trees
290 taxa


Method

MMC



MXC

MRP

TH(SPR)

TH(TBR)


MMC


MRP

TH(SPR)

TH(TBR)


Triplet-similarity

51.73 %



S96 %

98.29 %

98.99 %

98.99 %


70.03 %



95.84 %

97.28 %

97.28 %


MAST-similarity

54.20 %



66%

71.70%

70.50 %

70.70 %


54.20 %



65.40 %

63.40 %

63.50%


Page 10 of 12
(page number not for citation purposes)


Pars. Score

3901



N/A

2274

2317


2317


2603

2754








BMC Bioinformatics 2009, 10(Suppl 1):S8


heuristics rather than a belief that parsimony on a matrix
representation of input trees is the ideal supertree opti-
mality criterion. However, MRP performs well in simple
simulation experiments [26,27] and in analyses of empir-
ical data (e.g., [13]), and it clearly can be an effective
supertree method. Since it is not obvious whether it is bet-
ter to find supertrees that maximize accuracy in terms of
triplet-similarity, MAST-similarity, or some other tree sim-
ilarity measure like the Robinson-Foulds distance, we sug-
gest that the triplet heuristics are an informative
complement to the MRP method. Both approaches can
provide supertrees that represent different, and equally
valid, perspectives on accuracy.

Conclusion
Despite the inherent complexity of the triplet supertree
problem, we have shown that it can be addressed effec-
tively by using hill-climbing heuristics. We introduced
efficient algorithms for standard local search problems
that are solved by these heuristics. Our algorithms greatly
improve on the best known (naive) solutions for these
search problems. This in turn makes hill-climbing heuris-
tics for the triplet supertree problem applicable for large-
scale phylogenetic studies.

We demonstrate the performance of an implementation
of our hill-climbing heuristics. In analyses of two empiri-
cal data sets, our triplet heuristics quickly found super-
trees that contained more triplets in common with the
input trees than supertrees found by MRP, MMC, or MXC.
These results demonstrate not only that our heuristics for
the triplet supertree problem make it a valuable alterna-
tive to standard supertree methods. They also demon-
strate that developing new supertree heuristics that
directly seek to optimize the accuracy of the supertree with
respect to the input trees can enhance our ability to infer
with accuracy large sections of the tree of life.

The algorithmic ideas developed in this work might set
base for theoretical properties that identifies a much
broader class of local search objectives, which can be
solved more efficiently. This could lead to other powerful
supertree heuristics. However, it remains an open prob-
lem if our solutions for the SPR and TBR based local
search problems for the triplet supertree problem can be
improved further.

Competing interests
The authors declare that they have no competing interests.

Authors' contributions
HTL designed the triplet heuristics, implemented pro-
grams TH(SPR) and TH(TBR), and carried out the experi-
ments. JGB led the analysis of the experimental results. OE
inspired the triplet heuristics and supervised the project.


http://www.biomedcentral.com/1471-2105/10/S1/S8




All authors contributed to the writing of this manuscript,
and have read and approved the final manuscript.

Acknowledgements
This research is supported in part by National Science Foundation AToL
grant EF-0334832 and NESCent (NSF EF-042364 I).

This article has been published as part of BMCBioinformatics Volume 10 Sup-
plement I, 2009: Proceedings of The Seventh Asia Pacific Bioinformatics
Conference (APBC) 2009. The full contents of the supplement are available
online at http://www.biomedcentral.com/1471-2105/10?issue=S I

References
I. Gordon AD: Consensus supertrees: The synthesis of rooted
trees containing overlapping sets of labeled leaves. journal of
Classification 1986, 3(2):335-348.
2. Sanderson MJ, Purvis A, Henze C: Phylogenetic supertrees:
assembling the trees of life. Trends in Ecology & Evolution 1998,
13(3):105-109.
3. Bininda-Emonds ORP, Gittleman JL, Steel MA: The (super) tree of
life: procedures, problems, and prospects. Annual Review of
Ecology and Systematics 2002, 33:265-289.
4. Bininda-Emonds ORP: The evolution ofsupertrees. Trends in Ecol-
ogy and Evolution 2004, 19:3 15-22.
5. Davies JT, Barraclough TG, Chase MW, Soltis PS, Soltis DE, Savol-
ainen V: Darwin's abominable mystery: Insights from a super-
tree of the angiosperms. PNAS 2004, 101(7): 1904-1909.
6. Bininda-Emonds ORP, Cardillo M, Jones KE, MacPhee RDE, Beck
RMD, Grenyer R, Price SA, Vos RA, Gittleman JL, Purvis A: The
delayed rise of present-day mammals. Nature 2007,
446(7135):507-512.
7. Baum BR: Combining trees as a way of combining data sets for
phylogenetic inference, and the desirability of combining
gene trees. Taxon 1992, 41:3-10.
8. Ragan MA: Phylogenetic inference based on matrix represen-
tation of trees. Molecular Phylogenetics and Evolution 1992, 1:53-58.
9. Swofford DL: PAUP*: Phylogenetic Analysis Using Parsimony (*and Other
Methods). Version 4.0 beta Sunderland, Massachusetts, USA: Sinauer
Assoc; 2002.
10. Goloboff PA: Techniques for analysis of large data sets. In Tech-
niques in Molecular Systematics and Evolution Edited by: DeSalle R,
Wheeler W, Giribet Ge. Birkhauser-Verlag, Basel; 2000:70-9.
I I. Purvis A: A modification to Baum and Ragan's method for
combining phylogenetic trees. Systematic Biology 1995, 44:251-5.
12. Wilkinson M, Cotton JA, Creevey C, et al.: The shape of super-
trees to come: tree shape related properties of fourteen
supertree methods. Systematic Biology 2005, 54:419-3 1.
13. Chen D, Eulenstein 0, Fernandez-Baca D, Burleigh JG: Improved
heuristics for minimum-flip supertree construction. Evolution-
ary Bioinformatics 2006, 2:401-410.
14. Bryant D: Building Trees, Hunting for Trees, and Comparing
Trees Theory and Methods in Phylogenetic Analysis. In PhD
thesis University of Canterbury; 1997.
15. Bansal M, Eulenstein 0: The Gene-Duplication Problem: Near-
Linear Time Algorithms for NNI Based Local Searches. Bio-
informatics Research and Applications 2008:14-25.
16. Page RDM, Holmes EC: Molecular evolution: a phylogenetic approach
Blackwell Science; 1998.
17. Semple C, Steel M: Phylogenetics Oxford University Press; 2003.
18. Guigo R, Muchnik I, Smith TF: Reconstruction of ancient molec-
ular phylogeny. Mol Phylogenet Evol 1996, 6(2): 189-213.
19. Page RDM, Charleston M: From Gene to organismal phylogeny:
reconciled trees and the gene tree/species tree problem. Mol
Phylogenet Evol 1997, 7:231-240.
20. Page RDM: Modified mincut supertrees. In International Work-
shop, Algorithms in Bioinformatics (WABI) Volume 2452. Edited by: Gus-
field D, Guig6 R. Lecture Notes in Computer Science, Springer Verlag;
2002:300-315.
21. Cardillo M, Bininda-Emonds ORP, Boakes E, Purvis A: A species-
level phylogenetic supertree of marsupials. journal of Zoology
2004, 264:1 1-31.




Page 11 of 12
(page number not for citation purposes)








http://www.biomedcentral.com/1471-2105/10/S1/S8


22. Price SA, Bininda-Emonds ORP, Gittleman JL: A complete phylog-
eny of the whales, dolphins and even-toed hoofed mammals
(Cetartiodactyla). Biological Reviews 2005, 80(3):445-473.
23. Burleigh JG, Eulenstein 0, Fernandez-Baca D, Sanderson MJ: MRF
supertrees. In Phylogenetic supertrees: Combining Information to Reveal
the Tree of Life Edited by: Bininda-Emonds ORP. Dordrecht: Kluwer Aca-
demic; 2004:65-85.
24. Snir S, Rao S: Using Max Cut to Enhance Rooted Trees Con-
sistency. IEEEIACM Trans Comput Biol Bioinformatics 2006,
3(4):323-333.
25. Bruen TC, Bryant D: Parsimony via Consensus. Systematic Biology
2008, 57(2):251-256.
26. Bininda-Emonds ORP, Sanderson MJ: Assessment of the Accuracy
of Matrix Representation with Parsimony Analysis Super-
tree Construction. Systematic Biology 2001, 50(4):565-579.
27. Eulenstein 0, Chen D, Burleigh JG, Fernandez-Baca D, Sanderson M:
Performance of flip supertrees with a heuristic algorithm.
Systematic Biology 2004, 53(2):299-308.


Page 12 of 12
(page number not for citation purposes)


Publish with BioMed Central and every
scientist can read your work free of charge
"BioMed Central will be the most significant development for
disseminating the results of biomedical research in our lifetime."
Sir Paul Nurse, Cancer Research UK
Your research papers will be:
available free of charge to the entire biomedical community
peer reviewed and published immediately upon acceptance
cited in PubMed and archived on PubMed Central
yours you keep the copyright
Submit your manuscript here: BioMedcentral
http://www.biomedcentral.com/info/publishing adv.asp


BMC Bioinformatics 2009, 10(Suppl 1):S8




University of Florida Home Page
© 2004 - 2010 University of Florida George A. Smathers Libraries.
All rights reserved.

Acceptable Use, Copyright, and Disclaimer Statement
Last updated October 10, 2010 - - mvs