A scalable method for identifying frequent subtrees in sets of large phylogenetic trees

MISSING IMAGE

Material Information

Title:
A scalable method for identifying frequent subtrees in sets of large phylogenetic trees
Physical Description:
Mixed Material
Language:
English
Creator:
Ramu, Avinash
Kahveci, Tamer
Burleigh, J Gordon
Publisher:
Bio-Med Central (BMC Bioinformatics)
Publication Date:

Notes

Abstract:
Background: We consider the problem of finding the maximum frequent agreement subtrees (MFASTs) in a collection of phylogenetic trees. Existing methods for this problem often do not scale beyond datasets with around 100 taxa. Our goal is to address this problem for datasets with over a thousand taxa and hundreds of trees. Results: We develop a heuristic solution that aims to find MFASTs in sets of many, large phylogenetic trees. Our method works in multiple phases. In the first phase, it identifies small candidate subtrees from the set of input trees which serve as the seeds of larger subtrees. In the second phase, it combines these small seeds to build larger candidate MFASTs. In the final phase, it performs a post-processing step that ensures that we find a frequent agreement subtree that is not contained in a larger frequent agreement subtree. We demonstrate that this heuristic can easily handle data sets with 1000 taxa, greatly extending the estimation of MFASTs beyond current methods. Conclusions: Although this heuristic does not guarantee to find all MFASTs or the largest MFAST, it found the MFAST in all of our synthetic datasets where we could verify the correctness of the result. It also performed well on large empirical data sets. Its performance is robust to the number and size of the input trees. Overall, this method provides a simple and fast way to identify strongly supported subtrees within large phylogenetic hypotheses. Keywords: Phylogenetic trees, Frequent subtree
General Note:
Publication of this article was funded in part by the University of Florida Open-Access publishing Fund. In addition, requestors receiving funding through the UFOAP project are expected to sumit a post review, final draft of the article to UF's institutional repository, IR@UF, (www.uflib.ufl.edu/UFir) at the time of funding. The institutional Repository at the University of Florida community, with research, news, outreach, and educational materials.
General Note:
Ramu et al. BMC Bioinformatics 2012, 13:256 http://www.biomedcentral.com/1471-2105/13/256; pages 1-15
General Note:
doi:10.1186/1471-2105-13-256 Cite this article as: Ramu et al.: A scalable method for identifying frequent subtrees in sets of large phylogenetic trees. BMC Bioinformatics 2012 13:256.

Record Information

Source Institution:
University of Florida
Holding Location:
University of Florida
Rights Management:
All rights reserved by the source institution.
System ID:
AA00013516:00001


This item is only available as the following downloads:


Full Text


Ramu etal. BMCBioinformatics 2012, 13:256
http://www.biomedcentral.com/1471-2105/13/256


nBM ai
Bioinformatics


A scalable method for identifying frequent

subtrees in sets of large phylogenetic trees

Avinash Ramu1, Tamer Kahveci2* and J Gordon Burleigh3


Abstract
Background: We consider the problem of finding the maximum frequent agreement subtrees (MFASTs) in a
collection of phylogenetic trees. Existing methods for this problem often do not scale beyond datasets with around
100 taxa. Our goal is to address this problem for datasets with over a thousand taxa and hundreds of trees.
Results: We develop a heuristic solution that aims to find MFASTs in sets of many, large phylogenetic trees. Our
method works in multiple phases. In the first phase, it identifies small candidate subtrees from the set of input trees
which serve as the seeds of larger subtrees. In the second phase, it combines these small seeds to build larger
candidate MFASTs. In the final phase, it performs a post-processing step that ensures that we find a frequent
agreement subtree that is not contained in a larger frequent agreement subtree. We demonstrate that this heuristic
can easily handle data sets with 1000 taxa, greatly extending the estimation of MFASTs beyond current methods.
Conclusions: Although this heuristic does not guarantee to find all MFASTs or the largest MFAST, it found the MFAST
in all of our synthetic datasets where we could verify the correctness of the result. It also performed well on large
empirical data sets. Its performance is robust to the number and size of the input trees. Overall, this method provides a
simple and fast way to identify strongly supported subtrees within large phylogenetic hypotheses.
Keywords: Phylogenetic trees, Frequent subtree


Background
Phylogenetic trees represent the evolutionary relation-
ships of organisms. While recent advances in genomic
sequencing t,. !..i.. ,\% and computational methods have
enabled construction of extremely large phylogenetic
trees (e.g., [1-3]), assessing the support for phylogenetic
hypotheses, and ultimately identifying well-supported
relationships, remains a major challenge in phylogenet-
ics. Support for a tree often is determined by methods
such as nonparametric bootstrapping [4], jackknifing [5],
or Bayesian MCMC sampling (e.g., [6]), which generate
a collection of trees with identical taxa representing the
range of possible phylogenetic relationships. These trees
can be summarized in a consensus tree (see [7]). Consen-
sus methods can highlight support for specific nodes in
a tree, but they also may obscure highly supported sub-
trees. For example, in Figure 1, the subtree containing taxa

*Correspondence: tamer@cise.ufl.edu
2Computer and Information Science and Engineering, University of Florida,
Gainesville, FL, USA
Full list of author information is available at the end of the article


A, B, C, and D is present in all five input trees. However,
due to the uncertain placement of taxon E, the majority
rule consensus tree implies that the clades in the tree have
relatively low (60%) support.
Alternate approaches have been proposed to reveal
highly supported subtrees. The maximum agreement sub-
tree (MAST) problem seeks the largest subtree that is
present in all members of a given collection of trees [8].
For example, in Figure 1 the MAST includes taxa A, B,
C, and D. Finding the MAST is an NP-hard problem [9],
although efficient algorithms exist to compute the MAST
in some cases (e.g., [9-17]). In practice, since any differ-
ence in any single tree will reduce the size of the MAST,
the MAST is often quite small, limiting it usefulness.
A less restrictive problem is to find frequent agreement
subtrees (FAST), or subtrees that are found in many, but
not necessarily all, of the input trees (see [18]). In this
problem, a subtree is declared as frequent if it is in at
least as many trees as a user supplied frequency thresh-
old. Several algorithmic approaches have been suggested
to identify FASTs, and specifically the maximum FASTs


Q BioMlled Central






Ramu etal. BMCBioinformatics 2012, 13:256
http://www.biomedcentral.com/1471-2105/13/256


A A
E E> -
B> ^D B D
A C

B D
A A x


S(a) D
(a)


Majority Rule Maximum Agreement
Consensus Subtree
A A
60% 100%
60%00%0
E 0 0
(b)


Figure 1 (a) A collection of five input trees. The same subtree with taxa A, B, C, and D is present in all input trees, and only
changes. (b) The majority rule consensus and maximum agreement subtrees of the 5 input trees in Figure 1 a.


(MFASTs), or FASTs that contain the largest number of
taxa. A variant of this problem seeks the maximal FASTs,
i.e., FASTS that are not contained in any other FASTs.
Notice that an MFAST is a maximal FAST, however, the
inverse is not necessarily true. Zhang and Wang defined
algorithms, implemented in Phylominer, to identify FASTs
from a collection of phylogenetic trees [19,20]. These algo-
rithms are guaranteed to find all FASTs but they may
be prohibitively slow for data sets larger than 20 taxa.
Cranston and Rannala implemented Metropolis-Hastings
and Threshold Accepting searches to identify large FASTs
from a Bayesian posterior distribution of phylogenetic
trees [21]. This approach can handle thousands of input
trees but it may not be feasible if the trees have more than
100 taxa [21].
Another approach to reveal highly supported subtrees
from a collection of trees is to identify and remove rogue
taxa, or taxa whose position in the input trees is least con-
sistent. Recently, several methods have been developed
that can identify and remove rogue taxa from collec-
tions of trees with thousands of taxa [22-24]. However,
unlike MAST or FAST approaches, they do not provide
guarantees about the support for the remaining taxa.
In this paper, we describe a heuristic approach for iden-
tifying MFASTs in collections of trees. Unlike previous
methods, our method easily scales to datasets with over a
thousand taxa and hundreds of trees. Towards this goal,
we develop a heuristic solution that works in multiple
phases. In the first phase, it identifies small candidate sub-
trees from the set of input trees which serve as the seeds
of larger subtrees. In the second phase, it combines these
seeds to build larger candidate MFASTs. In the final phase,
it performs a post processing step. This step ensures that
the size (i.e., number of taxa) of the FAST found can not be
increased further by adding a new taxon without reducing
its frequency below a user supplied frequency thresh-
old. We demonstrate that this heuristic can easily handle
data sets with 1000 taxa. We test the effectiveness of


the position oftaxa E


these approaches on simulated data sets and then demon-
strate its performance on large, empirical data sets.
Although our heuristic does not guarantee to find all
MFASTs or the largest MFAST in theory, it found the true
MFAST in all of our synthetic datasets where we could
verify the correctness of the result. It also performed well
on the empirical data sets. Its performance is robust with
respect to the number of input trees and the size of the
input trees.

Methods
In this section we describe our method that aims to find
Maximum Frequent Agreement SubTrees (MFASTs) in a
given set of m phylogenetic trees T = [T1, T2, .. TJ}.
Our method follows from the observation that an MFAST
is present in a large number of trees in T. The method
builds MFASTs bottom up from small subtrees of taxa in
the trees in T. Briefly, it works in three phases.
Phase 1. Seed generation (Section "Phase one: Seed
generation").
In the first phase, we identify small subtrees from the
input trees that have a potential to be a part of an
MFAST. We call each such subtree a seed.
Phase 2. Seed combination (Section "Phase two: Seed
combination").
In the second phase, we construct an initial FAST by
combining the seeds found in the first phase.
Phase 3. Post processing (Section "Phase three:
Post-processing").
In the third phase, we grow the FAST further to
obtain the maximal FAST that contains it by
individually considering the taxa which are not
already in the FAST. We report the resulting
maximal FAST as a possible MFAST.
First, we present the the basic definitions needed for this
paper in Section "Preliminaries and notation" We then
discuss each of the three phases above in detail.


Page 2 of 15







Ramu et al. BMCBioinformatics 2012, 13:256
http://www.biomedcentral.com/1471-2105/13/256


Preliminaries and notation
In this section, we present the key definitions and nota-
tions needed to understand the rest of the paper. We
describe our method using rooted and bifurcating phylo-
genetic trees. However, our method and definitions can
easily be applied to unrooted or multifurcating trees with
minor or no modifications. Also, we assume that all the
taxa are placed at the leaf level nodes of the phylogenetic
tree, and all the internal nodes are inferred ancestors.
Figure 2(a) shows a sample phylogenetic tree built on five
taxa. We define the size of a tree as the number of taxa in
that tree. We start by defining key terms.

Definition 1 (Clade). Let The a phylogenetic tree. Given
an internal node of T, we define the set of all nodes and
edges of T contained under that node as the clade rooted at
that node.

Each internal node of a phylogenetic tree corresponds
to a clade of that tree. Figure 2(b) depicts the clade of the
tree in Figure 2(a) rooted at xl.

Definition 2 (Contraction). Let The a phylogenetic tree
with n taxa. The contraction operation transforms Tinto a
tree with n 1 taxa by removing a given taxon in T along
with the edge that connects that taxon to T.

The contraction operation can extract the clades of a
tree by removing all the taxa that are not a part of that
clade. It can also extract parts of the tree that are not nec-
essarily clades. We use the term subtree to denote a tree
that is obtained by applying contractions to arbitrary set
of taxa in a given tree. Formal definition is as follows.

Definition 3 (Subtree). Let T and T' be two phyloge-
netic trees. We say that T' is a subtree of T if T can be
transformed into T' by applying a series of contractions on
T.

If a tree T' is a subtree of another tree T, we say that
T' is present in T. Notice that a clade is always a subtree,



xo


x2

a b c d e a b c b c d
(a) (b) (c)
Figure 2 (a) A rooted, bifurcating phylogenetic tree T built on
five taxa labeled with a, b, c, d and e. The internal nodes are shown
with xo, xi, X2 and x3. (b) A clade of T rooted at xi. (b) and (c) Two
subtrees of T by contracting the taxa sets {d, e} and {a, e}.


but the inverse is not true all the time. Figures 2(b) and
2(c) illustrate two subtrees of the tree in Figure 2(a). Let us
denote the number of combinations of k taxa from a set of
n taxa with ('). In general, if a tree has n taxa, then that
tree contains (1) subtrees with k taxa. As a consequence,
that tree contains 2n 1 subtrees of any size including
itself.

Definition 4 (Frequency). LetT = {T1, T2, Tm,} be
a set ofm phylogenetic trees and T be a phylogenetic tree.
Let us denote the number of trees in T at which T is present
with the variable m'. We define the frequency of T in T as


freq(T, ) = -.
m

Definition 5 (FAST). Let T = {T1, T2, ... T,} be a set
of m phylogenetic trees and T be a phylogenetic tree. Let y
be a number in [0, 1] interval that denotes frequency cutoff
We say that T is a Frequent Agreement SubTree (FAST) of
T if its frequency in T is at least y (i.e.,freq(T, T) > y).

We say that a FAST is maximal if there is no other FAST
that contains all the taxa in that FAST. Clearly, larger
FASTs indicate biologically more relevant consensus pat-
terns. The following definition summarizes this.

Definition 6 (MFAST). Let T = {Ti, T2, .. ., T,}) be a
set of m phylogenetic trees. Let y be a number in [0, 1]
interval that denotes frequency cutoff A FAST T of T is a
Maximum Frequent Agreement SubTree (MFAST) ofT if
there is no other FAST T' ofT that has a larger size than T.

Formally, given a set of phylogenetic trees T = {T1, T2,
S.. T,} and a frequency cutoff, y, we would like to find
the MFASTs in T in this paper. We develop an algorithm
that aims to solve this problem. Table 1 lists the variables
used throughout the rest of this paper.



Table 1 Commonly used variables and functions in this
paper


n
a,
freq(T, T)
y


A set of phylogenetic trees
ith tree
Number of trees in T
Number of taxa in each input tree
ith taxa
Frequency of the subtree T in T
Frequency cutoff
ith seed (each seed is a subtree of a tree in 7T)
Size of a seed
Number of contractions used to create a seed


Page 3 of 15







Ramu etal. BMCBioinformatics 2012, 13:256
http://www.biomedcentral.com/1471-2105/13/256


Phase one: Seed generation
The first phase extracts small subtrees from the given set
of trees. From these subtrees we extract the basic building
blocks which are used to construct MFASTs. We call these
building blocks seeds. Conceptually each seed is a phylo-
genetic tree that contains a small subset of the taxa that
make up the trees in T. We characterize each seed with
three features that are listed below. We elaborate on each
feature later in this section.

1. Seed size (k) is the number of taxa in the seed.
2. Number of contractions (c) is the number of taxa we
prune from a clade taken from an input tree in order
to extract the seed.
3. Frequency (f) is the fraction of input trees in which
the seed is present.

We explain the seed features with the help of Figures 3
and 4. The first two characteristics explain how a seed can
be found in one of the trees in T. They indicate that there
is a clade of a tree in T such that this clade contains k + c
taxa and it can be transformed into that seed after c con-
tractions from that clade. For instance in Figure 3, when
k = 2 and c = 0, only seed S can be extracted from Ti by
choosing the clade rooted at x2. When k = 2 and c = 1,
seeds Si, S2 and Ss can be obtained using one contraction
(as, a2 and ai respectively) from the clade rooted at xl.
The last feature denotes the number of trees in T in
which the seed is present. For example in Figure 4, there
are nine seeds S1, S2, -, S9 extracted from the three input
trees using only one contraction. Among these, the fre-
quency of Si is 1 as it is present in all the trees. Frequency
of S2 is about 0.67 for it is present in only two out of three


a, a3 a,


S a a a3


a, a, a, a,


Figure 3 T1 is an input tree built on four taxa a,, a2, a3 and a4.
The internal nodes of T] are labeled as x, xi and x2.Si is the only
seed obtained from T] when k = 2 and c = O. That is S is identical to
the clade rooted at x2.S1,S and S are the seeds extracted from Ti
when k = 2 and c = 1. They are all extracted from the clade rooted at
xi by contracting 03,02 and a, respectively.


S3


a, a


trees (Ti and T2). The frequency of the rest of the seeds
is only about 0.33. Recall that, by definition, an MFAST is
present in at least a fraction y of the trees in T. There-
fore, we consider only the seeds whose frequency values
are equal to or greater than this number ( i.e. ,f > y).
Given the values of k, c and y, we extract all the seeds
which possess the desired feature values from the set of
input trees as follows. In the newick string representation
of a tree, a pair of matching parentheses corresponds to
an internal node in the tree. The number of taxa in the
clade rooted at this internal node is given by the number
of labels between the two matching parentheses. Follow-
ing from this observation, we scan the newick string of
each tree one by one. For each such tree, we identify the
clades which have k + c taxa. Notice that, if a tree con-
tains n taxa, then it contains at most clades of size
k + c as no two such clades can contain common taxa. We
then extract all combinations of k taxa from each of these
clades by contracting the remaining c taxa. The number
of ways this can be done is (k+c). Notice that all the small
trees extracted this way possess the first two character-
istics explained above. At this point, we however do not
know their frequencies. Therefore, we call them potential
seeds. It is worth mentioning that the same seed might be
extracted from different trees. As we extract a new poten-
tial seed, before storing it in the list of potential seeds, we
check if it is already present there. We include it in the
potential seed list only if it does not exist there yet. Other-
wise, we ignore it. This way, we maintain only one copy of
each seed.
Once we build our potential seed list for all the trees in
T, we go over them one by one and count their frequency
in T as the fraction of trees that contain them. We filter
all the potential seeds whose frequencies are less than the
frequency cutoff. We keep the remaining ones as the list
of seeds along with the frequency of each seed.
In Figure 4, consider the tree Ti that has four taxa. For
k = 3 and c = 1, there is only one clade of size k + c = 4
which is the tree Ti itself. We extract four potential seeds,
each having three leaves from this tree. The potential
seeds in this figure are given by Si, S2, S5 and S7 which
we extract by contracting a4, as, a2 and ai respectively
from Ti.


Phase two: Seed combination
At the end of the first phase, we obtain a set of frequent
seeds from the input trees. Notice that each seed is a FAST
as each seed is present in sufficient number of trees speci-
fied by y. These seeds are the basic building blocks of our
method. In the second phase of our method, we combine
subsets of these seeds to construct larger FASTs.
We first define what it means to combine two seeds. In
order to combine two seeds, it is a necessary condition


Page 4 of 15






Ramu et al. BMCBioinformatics 2012, 13:256
http://www.biomedcentral.com/1471-2105/13/256


Page 5 of 15


a, a2 a3 a4


SS2

a, a2 a3 a, a2 a. a, a2 a.

S Sx S/

a, a3 a. a, a3 a. a, a. a3


XS '\


a2 a3 a4 a, a. a3 a2 a3 a,
Figure 4 The set of input trees T1, T2, T3 and the set of all nine potential seeds S, S2 ..* S9 when the seed characteristics are set to
k = 3 and c = 1. All the potential seeds have three taxa as k 3. We need one contraction from the input tree to obtain each seed. S5 has frequency
1.0 as it is present in T1, T2 and T3. Seed S2 has frequency ~0.67 as it is present in Ti and T .. ........ i seeds have frequency ~0.33 as each appears
in only one of the three trees.


SX\


that both seeds are present in at least one common tree
T in T. We call such a tree T as the reference tree. We
combine two seeds with the guidance of a reference tree.
Let Si and S2 be two seeds and let T be their reference
tree. Let L1, L2 and L be the set of taxa in S1, S2 and T
respectively. Combining Si and S2 results in the tree that
is equivalent to the one obtained by contracting the taxa
in L (L1 U L2) from T. For simplicity, we will denote the
combine operation using T as the reference network with
the Er symbol. For instance we denote combining S1 and
S2 with T being the reference tree as S1 Er S2. To simplify
our notation, whenever the identity of the reference tree
is irrelevant, we will use the symbol E instead of :r.-
Figure 5 demonstrates how two seeds Si and S2 are com-
bined with the help of the reference tree T. In this figure,
both Si and S2 are subtrees of T. Thus, it is possible to
use T as the reference tree. We have L = [{a, a3, a4},
L2 = {a,a2,a5, a7}. Thus, we build C Si= r S2 by
contracting the taxa in L (L1 U L2) = {a6, a8 from T.
So far, we have explained how to combine two seeds Si
and S2 using a reference tree. It is possible that many trees
in T have both seeds present in them. Thus, one ques-
tion is which of these trees should we use as the reference
tree to combine the two seeds? The brief answer is that
all such trees need to be considered. However, we make
several observations that helps us avoid combining S1 and
S2 using each such reference tree one by one exhaus-
tively without ignoring any of such trees. We explain
them next.
Consider two trees Ti and T2 from T where both seeds
are present in. There are two cases for Ti and T2.


CASE 1: S1 E, 52 = S1 ErT2 2. In this case, it does
not matter whether we use Ti or T2 as the reference
tree. They will both lead to the same combined
subtree. Thus, we use only one.
CASE 2: S Er, 52 # Si ET2 52. In this case, the trees
Ti and T2 lead to alternative combination topologies.
So, we consider both of them separately.

We utilize the observations above as follows. We start
by picking one reference tree arbitrarily. Once we create a
combined subtree using that tree, we check whether that
subtree is present in the remaining trees in T. We mark
those trees that contain it as considered for reference tree
and never use them as reference for the same seed pair
again. This is because those trees fall into the first case
described above. This way, we also store the frequency of
the combined subtree in T. If the number of unmarked
trees is too small (i.e., less than y x m) then it means that
even if all the remaining trees agree on the same combined
topology for the two seeds under consideration, they are
not sufficient to make it a FAST. Thus, we do not use any
of the remaining trees as reference for those two seeds.
Otherwise, we pick another unmarked tree arbitrarily and
repeat the same process until we run out of reference
trees.
The next question we need to answer is which seed pairs
should we combine? To answer this question we first make
the following proposition.

Proposition 1. Assume that we are given a set ofphylo-
genetic trees T. LetS1 and S2 be two seeds constructed from


a, a, a, a.


a3 a a a

a, a 2 a. a 3






Ramu etal. BMCBioinformatics 2012, 13:256
http://www.biomedcentral.com/1471-2105/13/256


T










al a2 a3 a4 a5 a6 a7 a8



S S 2



a, a3 a4 a1 a2 a5 a










a a a3 a4 a5 a7
Figure 5 T is the reference tree. 5S and 5S are the seeds to be
combined, both are present in T. C is obtained by pruning the
subtree containing taxa a1, a2, a3, 04, 05 and 07 from T.


the trees in T. For all trees T e T, we have the following
inequality

freq(S1 ET S2, T) < min{freq(Si, T), freq(S2, T)1

Proof. For any T, both S and S2 are subtrees of S T
S2. Thus if S1 ET S2 is present in a tree, then both Si and
S2 are present in that tree. As a result,freq(S1 ETS2, T) <
freq(Si, T) and freq(51 ET S2, T) < freq(S2, T). Hence,

freq(S ErT S2, T) < min{freq(SI, T), freq(S2, T)}



Proposition 1 states that as we combine pairs of seeds to
grow them, their frequency monotonically decreases. This
suggests that it is desirable to combine two seeds if both
of them have large frequencies. This is because if one of
them has a small frequency, regardless of the frequency of
the other, the combined tree will have a small frequency.
As a result its chance to grow into a larger tree through
additional combine operations gets smaller. Following this


intuition, we develop two approaches for combining the
seeds.

1. In-order Combination (Section "In-order
combination").
2. Minimum Overlap Combination (Section "Minimum
overlap combination").

Both approaches accept the list of seeds computed in
the first phase as input and produce a larger FAST that is a
combination of multiple seeds. Both of them also assume
that the list of input seeds are already sorted in decreasing
order of their frequencies. We discuss these approaches
next.

In-order combination
The in-order combination approach follows from Propo-
sition 1. It assumes that the seeds with higher frequencies
have greater potential to be a part of an MFAST. It exploits
this assumption as follows, first it picks a seed as the start-
ing point to create a FAST. It then grows this seed by
combining it with other seeds starting from the most fre-
quent one as long as the frequency of the resulting tree
remains at least as large as the given cutoff y. It repeats
this process by trying each seed as the starting point,
Algorithm 1 presents this approach.


Algorithm 1 In order combination
FAST <- 0
for all seeds Si do
FAST' <- Si
Mark Si as considered
repeat
Si +- seed with highest frequency among
unconsidered seeds
Mark Si as considered
CUTOFF <- y
tJAST' <- FAST'
repeat
Pick the next unconsidered tree T e T as
reference
Mark all the trees as that contain FAST' ET Sj as
considered
if freq(FAST' ET Si, T) > CUTOFF then
t-AST' <- FAST' ET Sj
CUTOFF <- freq(FAST' ET Sj, T)
end if
until Less than y x m unmarked reference trees are
left in T
FAST' <- tJAST'
Unmark all trees in T
until all seeds are considered
if size of FAST' > size of FAST then
FAST <- FAST'


Page 6 of 15







Ramu et al. BMCBioinformatics 2012, 13:256
http://www.biomedcentral.com/1471-2105/13/256


end if
Unmark all seeds
end for

In Algorithm 1 we first initialize the FAST as empty. We
then consider each seed one by one. We initialize a tem-
porary subtree denoted by FAST' with the seed Si under
consideration and mark Si as considered. We combine the
FAST' with a seed Sj which has the highest frequency
amongst the seeds that have not been added. If multiple
seeds have the highest frequency, we randomly pick one of
them and mark that seed Sj as added to the FAST' There
can be alternative ways to combine FAST' with Sj leading
to different topologies. We use the trees in T that con-
tain both FAST' and Sj as guides to try only the topologies
that exist in T. We stop constructing alternative topolo-
gies as soon as we ensure that there are not sufficient
number of trees to yield frequency of y. We set FAST' to
the combined seed if the combined seed has large enough
frequency. We then consider the seed with the next high-
est frequency for addition and repeat this step till all Sj
have been considered. If the resulting temporary FAST is
larger than FAST we replace the smaller FAST with the
larger one. In the next iteration, we initialize the FAST
with the next Si. Using this approach we can initialize the
FAST with all Si, alternatively if the user wishes to limit
the amount of time spent using a maximum time cutoff
we stop the outermost loop (i.e., alternative initializations
of FAST') as soon as the allowed running time budget is
reached.
Notice that in Algorithm 3 each seed Si can lead to a dif-
ferent FAST. We record only the FAST that has the largest
size. However, it is trivial to maintain the top k FASTs
with the largest size instead if the user is looking for k
alternative maximal FASTs.

Minimum overlap combination
The purpose of combining seeds is to construct a FAST
that is large in size. Our in-order combination approach
(Section "In-order combination") aimed to maximize the
frequency of the combined seeds. In this section, we
develop our second approach, named Minimum Over-
lap Combination. This approach picks seeds so that their
combination produced as large subtree as possible. We
elaborate on this approach next.
When we combine two seeds, the size of the resulting
tree becomes at least as big as the size of each of these
seeds. Formally let S1 and S2 be two seeds (i.e., trees). Let
L1 and L2 be the set of taxa combined in S1 and S2. We
denote the size of a set, say L1, with IL, |. The size of the
tree resulting from combination of S1 and S2 is IL1 I+ IL2 -
|L1 n L2|. For a given fixed seed size, the first two terms
of this formulation remains unchanged regardless of the
seed. The last term determines the growth in the size of


the FAST. Thus, in order to grow the FAST rapidly, it is
desirable to combine two frequent subtrees with a small
number of common taxa.
Our second approach follows from the observation
above. We introduce a criteria called the overlap between
two subtrees as the number of taxa common between
them. Our minimum overlap combination approach
works the same as Algorithm 1 with a minor difference in
selecting the seed Sj that will be combined with the cur-
rent temporary FAST (i.e., FAST'). Rather than choosing
the seed with the largest frequency, this approach chooses
the one that has the least overlap with FAST' among all
the unconsidered and frequent seeds. If multiple seeds
have the same smallest overlap, it considers the frequency
as the tie breaker and chooses the one with the largest
frequency among those.

Phase three: Post-processing
So far we described how to obtain seeds (Section "Phase
one: Seed generation") and how to combine them to con-
struct FAST (Section "Phase two: Seed combination" ).
The two approaches we developed for combining seeds
aim to maximize the size of FAST. However, they do not
ensure the maximality of the resulting FAST. There are
two main reasons that prevent our seed combining algo-
rithms from constructing maximal FAST. First, some of
the taxa of a maximal FAST may not appear in any seed
(i.e. false negatives). As a result no combination of seeds
will lead to that maximal FAST. Second, even if all the taxa
of a maximal FAST are parts of at least one seed, our algo-
rithms will reject combining that seed with the FAST of
the seeds if those seeds contain other taxa that are not part
of the maximal FAST (i.e. false positives).
In the post-processing phase, we tackle above-
mentioned problem. Algorithm 3 describes the post
processing phase in detail. We do this by considering all
taxa which are not already present in the FAST one by
one. We iteratively grow the current FAST by including
one more taxon at a time if the frequency of the resulting
FAST remains at least as large as the frequency cutoff
y. We repeat these iterations until no new taxon can
be included in the FAST. Thus the resulting FAST is
guaranteed to be maximal.

Algorithm 2 Post processing
INPUT = FAST from the seed combination phase
INPUT = T
OUTPUT = Maximal FAST

RESULT <- FAST
for all ai not in FAST do
CUTOFF <- y
t-RESULT <- RESULT
repeat


Page 7 of 15






Ramu etal. BMCBioinformatics 2012, 13:256
http://www.biomedcentral.com/1471-2105/13/256


Pick the next unconsidered tree T e T as reference
RESULT' <- RESULT rai
Mark all the trees that contain RESULT' as
considered
if frequency of RESULT' > CUTOFF then
t-RESULT <- RESULT'
CUTOFF <- frequency of RESULT'
end if
until Less than y x m unmarked reference trees are
left in T-
RESULT <- t-RESULT
Unmark all trees in T-
end for
return RESULT

We expect the post processing step to identify quickly
the taxa that have a potential to be in an MFAST that
might have not been considered during the seed genera-
tion and seed combination phases. At the end of the post
processing step we obtain an MFAST.

Complexity analysis of our method
In this section we discuss the complexity of our method in
terms of the three phases involved in it. Let T be a set of
m phylogenetic trees having n leaves each. The complexity
of the different phases of our method are as follows.

Phase one. Finding the seeds involves enumerating all
the subtrees and checking their frequencies. Given seed
size k and number of contractions c, each tree will con-
tain at most clades each leading to ( c) alternative
subtrees. Thus, in total there can be up to j-k+c) seeds
(possibly many of them identical) from all the trees in T.
Typically, the values of k and c are fixed and small (in our
experiments we have k e {3, 4, 5} and c e {0, 1, 2, 3, 4, 5})
leading to 0(mn) seeds.
The complexity of finding whether a seed is present in
a single tree is 0(n log n). Given that there are m trees in
T, the cost of computing the frequency of a single seed
is 0(mn log n). Thus, the time complexity for finding the
frequency of all the seeds is this expression multiplied by
the number of seeds, which is O(m2n2 log n).

Phase two. Consider a set of p frequent seeds that will
be considered for combining in this phase. Recall that we
have two approaches to combine them. Below, we focus
on each.
INORDER COMBINATION We try to combine each seed
with every other seed leading to 0(p2) iterations. The
complexity of checking the frequency of each combined
subtree is 0(mn logn). Also, there can be up to 0(m)
different reference trees for guiding the combine opera-
tion. Multiplying these terms, we obtain the complexity of
phase using this approach as O(p2 2n log n).


MINIMUM OVERLAP COMBINATION The complexity of
combining the frequent seeds using the minimum over-
lap combination approach is very similar to the inorder
approach except for an additional term. The additional
complexity is because we maintain the overlap between
the subtrees. This leads to the complexity 0(p2n2 +
p2 m2n log n).

Phase three. Here, we consider the FAST obtained from
each of the p frequent seeds in phase two. For each FAST,
we sequentially go over each taxa one by one leading to
0(n) iterations. There can be up to 0(y x m) references to
add a taxon. So the cost of extending allp FASTs is 0(y x
mnp).
Notice that each frequent seed has to appear in at least
y x m trees. Thus, the number of unique frequent seeds
p is bounded by 0( )= O(-). Thus, adding the cost
of all the three phases, the overall time complexity of our
method using inorder combination is
m2n3 logn
O(m2n2 log n + 2 + n2).
y2
That using minimum overlap combination is
m2n3logn n4logn n
O(m2n2 log n + -/2 + n + mn2).
y2 y2
In the two summations above, the second term is
asymptotically larger than the first and the last terms.
Thus, we can simplify the asymptotic time complexity of
inorder and minimum overlap combinations as
O(m2f3 log n
y)2
and
(n log n 2 n))
y 2
respectively.

Results and discussion
This section evaluates the performance of our MFAST
algorithm experimentally.

Implementation details. We implemented our MFAST
algorithm using C and Perl. More specifically, we imple-
mented the first two phases (seed generation and seed
combination) in C and the third phase (post processing)
in Perl. We utilize the functions provided in the newick
Utilities [25] package by modifying the source code pro-
vided in that package. We use k e {3,4, 5} and c e
{0, 1, 2, 3, 4, 5} in all of our experiments unless oth-
erwise stated. In our experiments, we observed that the
minimum overlap combination produced larger MFASTs
than the in-order combination approach. Therefore, we


Page 8 of 15






Ramu et al. BMCBioinformatics 2012, 13:256
http://www.biomedcentral.com/1471-2105/13/256


limit our experimental results to the minimum overlap
combination approach.

Methods compared against. We have compared our
method against Phylominer [20] and the MAST command
implemented in PAUP* [26]. Among these, Phylominer
also seeks MFASTs in a collection of trees. However, the
time complexity of this method is exponential in the size
of the input trees, and hence it becomes intractable for
large trees. In our experiments, we observed that it does
not scale beyond 50 taxa. PAUP* is primarily a program
for phylogenetic inference, although it also can compute
MASTs. MASTs have a strict 100% agreement criterion
unlike the arbitrary frequency cutoff values y in our
method.

Evaluation Criteria. We evaluate our algorithm based
on the size of the MFAST found. Larger MFASTs are
preferable. When possible, we report the size of the opti-
mal solution as well.

Test Environment. We ran our experiments on Linux
servers equipped with dual AMD Opteron dual core pro-
cessors running at 2.2 GHz and 3 GB of main memory to
test the performance of our method.

Datasets We test the performance and verify the results
of our method on synthetic datasets and real datasets.

SYNTHETIC DATASET We built synthetic datasets in
which we embedded an MFAST as described below.
We characterize each synthetic dataset using five
parameters.


Tree size (n).
Number of trees (m).
MFAST frequency (f).
MFAST size (n').
Noise percentage (c).


The first two parameters denote the size and number
of trees in T. MFAST frequency specifies the fraction
of trees in T which contain an MFAST. MFAST size
is the number of taxa in the embedded MFAST. The
noise percentage is the percentage of taxa that is not
a part of the embedded MFAST but is placed on the
branches within the clade that contains the MFAST.
We place all the other taxa on the branches outside
this clade.
Given an instantiation of these parameters, we first
created a tree that has n' taxa. This tree serves as the
MFAST. We then created m xf trees that contain
this MFAST. We build each of these trees by inserting
n n' taxa randomly in the MFAST. With probability
c we insert each taxa within the clade that contains


MFAST. With probability 1 c we insert it outside
that clade. We then created m (m xf) trees that do
not contain the current MFAST. We simply do this by
inserting all the taxa one by one at a random location.
REAL DATASETS. We use two empirical datasets to
evaluate the performance of our heuristic. The data
sets contain 200 bootstrap trees generated from
phylogenetic analysis of the Gymnosperm [27] and
Saxifragales (Burleigh, unpublished) plant clades. To
make the bootstrap trees, we assembled
super-matrices, matrices of concatenated gene
alignments with partial taxon overlap, from gene
sequence data available in GenBank. We performed a
maximum likelihood bootstrap analysis on each
super-matrix using RAxML v. 7.0.4 [28]. The
Gymnosperm trees each contain 959 taxa, and the
Saxifragales trees each contain 950 taxa.

Effects of number of input trees
In our first experiment, we analyze how the number of
input trees in T affects the performance of our algorithm.
For this purpose, we created 30 synthetic datasets. The
size of the embedded MFAST in all the datasets was 15.
Among these 30 datasets, 10 contained 50 trees, 10 con-
tained 100 trees and 10 contained 200 trees. We set the
noise percentage to 20% in all the datasets. The frequency
of the embedded MFAST was 0.8. We set the number of
taxa in all the trees in these datasets to 100.
We ran our algorithm on each of these datasets to find
the size of the MFAST for y = 0.7. Table 2 lists the
average MFAST size we found for each of the dataset
sizes before post processing (i.e., at the end of phase two)
and after post processing (i.e., at the end of phase three).
The results demonstrate that our method can identify an
MFAST that is almost as big as the embedded one even
without post processing, regardless of the number of trees
in the dataset. Post processing improves the MFAST size
slightly. On the average, we always find an MFAST that is
as large as or larger than the embedded one. An MFAST
larger than 15 here implies that while randomly inserting
the taxa that are not in the embedded MFAST, at least one
of them was placed under the same clade at least a fraction
y of the time. More importantly, our method successfully
located such taxa along with the rest of the MFAST.

Effects of tree size
Our second experiment considers the impact of the num-
ber of taxa in the input trees contained in T on the success
of our method. To carry out this test, we built datasets
with varying tree sizes (i.e., n). Particularly, we used n =
100, 250, 500 and 1000. For each value of n, we repeated
the experiment 10 times by creating 10 datasets with the
same properties. In all datasets, we set the number of trees
to m = 100, the noise percentage at c = 20%, the size of the


Page 9 of 15







Ramu etal. BMCBioinformatics 2012, 13:256
http://www.biomedcentral.com/1471-2105/13/256


Table 2 Evaluation of the effect of the number of trees in T


Number of trees


MFAST size
Before post processing After post processing


The number of trees is set to 50, 100 and 200. For each number of trees we run
our experiments on ten datasets. Each dataset contains trees with 100 taxa and
an embedded MFAST of size 15. We report the average size of the MFAST
obtained by our method across the ten datasets.


embedded MFAST at 15% of n, and the MFAST frequency
at 0.8.
Table 3 reports the average MFAST size found by our
method for varying tree sizes. Second column shows the
embedded MFAST size. Last two columns list the aver-
age size of the MFAST found by our method across the
ten datasets. Before, going into detailed discussion of the
results, it is crucial to observe that our method could run
to completion for datasets that have as many as 1000 taxa.
When we tried to run Phylominer, it did not return any
results for datasets that have more than 100 taxa. The
results also demonstrate that our method could success-
fully identify the embedded MFAST in all the datasets
regardless of the size of the input trees. In some datasets,
the reported MFAST was slightly larger than the embed-
ded one. This indicates that while randomly inserting the
taxa that are not part of the embedded MFAST, it is possi-
ble that a few taxa was consistently placed under the same
same clade.
The results also suggest that our method identifies a sig-
nificant percentage of the taxa in the embedded MFAST
after the second phase (i.e., before post-processing) when
the tree size is small. As the tree size grows, it starts
missing some taxa at this phase. It however recovers the
missing taxa during the post-processing phase even for
the largest tree size. This indicates that at the end of phase


Table 3 Evaluation of the effect of the size of the trees in T


Number of


Embedded


250
500
1000


MFAST size
Reported
Before post


After post


two our method could identify a backbone of the actual
MFAST. The unidentified taxa at this phase are scattered
throughout the clades in the input trees. Thus, there is no
clade of size k + c that contains them with c contractions
for small k or c. As evident from Table 3, this however
does not prevent our method from recovering them. This
is because the backbone reported at the end of phase two
is large enough, and thus specific enough, to recover the
missing taxa one by one in the last phase. This is a sig-
nificant observation as it demonstrates that our method
works well even with small values of k and c.

Effects of noise percentage
Recall that the noise percentage c denotes the percent-
age of taxa that is added inside the clade that contains the
MFAST. As c increases, the pairs of taxa in the MFAST get
farther away from each other in the tree that contains it.
As a result, fewer taxa from MFAST will be contained in
small clades of size k + c. This raises the question whether
our method works well as c increase and thus the MFAST
taxa gets scattered around in the trees that contain it.
In this experiment, we answer the question above and
analyze the effect of the noise percentage on the success
of our method. We create synthetic datasets with various
c values. Particularly, we use c = 20, 40 and 60%. We set
the size of the embedded MFAST to n' = 15, the tree size
to n = 100, number of trees to m = 100 and the MFAST
frequency tof = 0.8. We repeat our experiment for each
parameter 10 times by recreating the dataset randomly
using the same parameters. We set the frequency cutoff to
y = 0.7. We report the average MFAST size found by our
method in Table 4.
The results suggest that our method can identify the
embedded MFAST successfully even when the noise per-
centage is very high. We observe that the size of the
MFAST found by our method before post processing
decreases slowly with increasing amount of noise. This
is not surprising as the taxa contained in the embedded
MFAST gets more spread out (and thus farther away from
each other) in the trees in T with increasing noise. As a
result, if there are taxa that are not part of any seed with
the provided values of k and c, they will never be included


Table 4 Evaluation of the effect of the noise in the trees
processing processing in T-
15.3 15.8 MFAST size


32.3
43.7
69S8


38.8


Noise (%)


76.0 20
151.0 40


Before post processing
15.3


After post processing
15.8


The tree size is set to 100, 250, 500 and 1000. For each tree size we run our
experiments on ten datasets. Each dataset contains 100 trees with an embedded
MFAST of size 15% of the input tree size. Second column shows the embedded
MFAST size. Last two columns list the average size of the MFAST found by our
method across the ten datasets.


60 12.7 15.0
The size of the embedded MFAST in all the experiments is 15. We list the
average size of the MFAST found by our method before and after the post
processing phase.


Page 10of 15






Ramu etal. BMCBioinformatics 2012, 13:256
http://www.biomedcentral.com/1471-2105/13/256


in the computed MFAST at the end of phase two. We
however observe that (i) only a small number of such taxa
exists. For instance, even for the largest noise percentage
(e = 60%), only 2.3 taxa (i.e., 15 12.7) are missing on the
average. (ii) The missing taxa are recovered during phase
three. This is because the computed MFAST at the end
of phase two is very large, and thus it is specific to the
embedded MFAST.

Impact of seed creation
So far, in our experiments we consistently observed
two major points for all the parameter settings (see
Sections "Effects of number of input trees" to "Effects
of noise percentage"): (i) Our method always finds a
large subtree of the embedded MFAST after phase two.
(ii) Our method always recovers the entire embedded
MFAST after phase three. The second observation can be
explained from the first one that the outcome of phase two
is large enough to build the entire MFAST precisely. The
first observation however indicates that the set of seeds
generated in phase one contain a significant percentage of
the taxa in the embedded MFAST. In this section, we take
a closer look into this phenomenon and explain why this
is the case even for small values of seed size k and contrac-
tion amount c, and large noise percentage c. To do that,
we will compute the probability that a subset of the taxa
of the embedded MFAST appears in at least one seed gen-
erated in phase one. In our computation, we will assume
that the taxa can appear at any location of a given tree
with the same probability. We discuss the implication of
this assumption later in this section.
The number of rooted bifurcating trees for a given set of
n taxa is

R(n) (2n 3)!
2n-2(n 2)!

Consider a clade with k + c taxa. The number of trees
with n taxa that contains this clade is R(n (k + c) + 1)
as the topology of the k + c sized clade is fixed. For a
given a subtree with k taxa, let us denote the number of
clade topologies of size k + c that contains that subtree
with NU(k, c). We can compute this function recursively
as NU(k, 0) = 1 and for c > 0,

NU(k, c) = NU(k, c 1) x 2 x (k + c 2).

Let us denote one of these clades by U(k, c). Also, let
us denote the probability that the clade U(k, c) exists in a
random tree topology that contains n taxa with P(n, k, c).
Intuitively, P(n, k, c) is the probability that our method
will extract a specific k taxa subtree from one n taxa tree
after only c contractions. We can compute this probability
as the ratio of the number of tree topologies that satisfy


this constraint to that of all possible tree topologies. We
formulate this as
NU(k, c) x R(n (k + c) + 1)
P(n,k,c) =R
R(n)
Recall that it suffices for our algorithm to have a k taxa
subtree of the MFAST in at least one tree in the given set
of m trees. The probability that the clade U(k, c) exists in at
least one of the m random tree topologies each containing
n taxa is

P(n, k, c, m) = 1 (1 P(n, k, c))m.

Assume that the MFAST size in the given set of trees T
is h. Let us denote the number of k taxa subtrees of the
MFAST as NS(h, k). The probability that at least one of
these subtrees will be found in at least one of the input
trees is then

P(n, k, c, m, h) = 1 (1 P(n, k, c, m))NS(hk).

A lower bound to NS(h, k) is h k + 1 which can be
obtained by picking a contiguous block of k taxa from the
canonical newick representation of the MFAST by consid-
ering all possible h k+ 1 starting point locations. Notice
that the larger the value of P(n, k, c, m, h), the higher the
chances that our algorithm will construct some part of
the MFAST. Similarly, the larger the value of NS(h, k), the
higher the chances that our algorithm will construct some
part of the MFAST.
Figure 6 plots the success probability (i.e., P(n, k, c,
m, h)) of our method for varying parameter values. As
the MFAST size increases, the success probability rapidly
increases. This is because the number of alternative sub-
trees of the MFAST increases with increasing MFAST
size. Thus, the chance of observing at least one increases
as well. We observe that when the size of MFAST is
around 20% of the tree size, for all the parameters reported
our success probability becomes almost 1. As the num-
ber of contractions increases, the probability of success
increases. This is because large number of contractions
increases the possibility of eliminating false positive taxa
from clades. In other words, it helps gluing the taxa that
are normally scattered in the input trees back together
by removing the remaining taxa among them. When c
= 5, our success probability becomes almost one even
for MFASTs that are as small as a 4-6% of the tree size.
As the number of trees increases, the success probability
increases as well. This is because we have more alternative
topologies with increasing number of trees. Thus, there
are more chances to have a small clade that contains a part
of the MFAST. Finally, it is worth noting that these results
are computed based on the assumption that the trees in T
are uniformly distributed among all possible topologies. In
practice, we expect that these trees are constructed with


Page 11 of 15







Ramu etal. BMCBioinformatics 2012, 13:256
http://www.biomedcentral.com/1471-2105/13/256


>,

- 0.6


o 0.4
o
Co


10
MFAST size [%]


15 20


10
MFAST size [%]


15 20


Figure 6 The probability of finding at least one seed which contains a part of an MFAST. The number of contractions c is set to 3,4 and 5 and
the 1 I ,. In. i seed size k is 5,4 or 3. The x-axis shows the MFAST size in terms of the percentage of the number of taxa in the trees in T. In (a),
we set the total number of trees m 500. In (b) we set m 1000.


the same or similar objectives (such as maximum parsi-
mony or maximum likelihood). As a result, they will likely
have a higher chance to contain large MFASTs. The results
we expect in practice will thus be similar or even better
than the theoretical results in Figure 6.
Overall, we conclude from this experiment that even
small values of k and c suffices to capture a part of the
MFAST in phase two. Therefore, although our algorithm's
complexity increases exponentially with k and c, we do not
need to use large values for k and c. This enables our algo-
rithm to scale to very large datasets with thousands of taxa
and trees. These results explain the theory behind the prac-
tical results we observed in Sections "Effects of number of
input trees" to "Effects of noise percentage"

Evaluation of state of the art methods
So far, we have shown that our method could success-
fully find the MFASTs contained in sets of trees T for up
to 1000 taxa and 200 trees (i.e., n = 1000 and m = 200).
An obvious question is how well do existing methods per-
form on the same datasets. Here, we answer this question
for two existing programs, namely PAUP* (version 4.0b10)
and Phylominer.
When we fix the number of trees and the number of
taxa to 100, PAUP* was able to find the MAST for for all
datasets. As we grow the number of taxa to 250 or larger
while keeping the number of trees as 100, PAUP* runs our
of memory and fails to return any results. After reducing
the number of trees to 50, PAUP* still runs out of memory
and cannot report any results for more than 100 taxa.
The scalability problem of Phylominer is even more
severe. Phylominer is able to compute the MFASTs on


datasets with up to 20 taxa. However, as we increase
the number of taxa further, its performance deteriorates
quickly. When we set the number of taxa to 100, even with
as few as 100 trees, Phylominer takes more than a week
to report a result. Moreover, in our experiments, the max-
imum size of the subtrees it found on average contained
fewer than 7 taxa, even though the size of the true MAST
was 10.
Another interesting question about existing methods
would be whether the majority consensus rule can be used
to find MFASTs. To evaluate this, we used the same three
synthetic datasets used in Section "Effects of noise per-
centage" Recall each of these three datasets contains an
MFAST of size 15 which is embedded in 80% of the trees.
The datasets are created with 20%, 40% and 60% noise
indicating different levels of difficulty in recovering the
embedded MFAST. We computed 70% majority consen-
sus tree. Notice that if majority consensus rule can identify
an MFAST, that would correspond to a bifurcating subtree
topology in the consensus tree. In other words a subtree
is bifurcating in this experiment only if 70% or more of
the input trees agree on the topology of that subtree. The
resulting tree, however, was multifurcating for all the three
datasets. This means that majority consensus rule could
not recover even a smaller portion of the embedded tree
while our method was able to locate the entire MFAST
successfully (see Table 4).
These results demonstrate that both PAUP* and Phy-
lominer are not well suited to finding agreement subtrees
in larger datasets, our method scales better in terms of
both the number of taxa and the number of trees. When
PAUP* runs to completion, we observed that it reports


Page 12 of 15


0.8


- 0.6
0

0.4


0.2







Ramu et al. BMCBioinformatics 2012, 13:256
http://www.biomedcentral.com/1471-2105/13/256


the true results. Recall from previous experiments that
our method always found the true results on the same
datasets as well as larger datasets. This suggests that our
method has the potential to have an impact in large scale
phylogenetic analysis when existing methods fail.

Empirical dataset experiments
To examine the performance of the MFAST method on
real data, we performed experiments using 200 maximum
likelihood bootstrap trees from a phylogenetic analysis
of gymnosperms (959 taxa) and Saxifragales (950 taxa).
Specifically, we evaluated how the performance of the
MFAST algorithm was affected by the number of input
trees and the size of the input trees.

Effects of number of input trees
We first examined the effect of input tree number on
the size of MFAST. For both the gymnosperm and Sax-
ifragales trees, we generated 10 sets of 50 and 100 trees
by randomly sampling from the original 200 trees with-
out replacement. We compared the average size of the
MFAST in the 50 and 100 tree data sets with the size of the
MFAST in the original 200 tree data set. First, in all anal-
ysis, the post-processing step greatly increases the size of
the MFAST, sometimes more than doubling it (Table 5).
This increase is similar to the one observed in the 1000
taxon simulated data sets (Table 2), emphasizing again
the importance of the post-processing step with large tree
data sets. Although the sizes of the MFASTs were simi-
lar, they decreased slightly with the addition of more trees
(Table 5). This may simply be a matter of observing more
conflict with more trees.
The large gap between the MFAST sizes before and after
the post processing suggests that phase three is the main
reason behind the success of our method, and thus, the
costly seed combination phase (i.e., phase two) may be
unnecessary. To answer whether this conjecture is correct,
we ran a variant of our method by disabling the second
phase; we only ran the post processing phase starting from


each seed as the initial MFAST one by one. We reported
the largest MFAST found that way as the output of this
variant in Table 5. The results demonstrate that although
phase three can grow a large FAST, phase two is essential
to find the largest frequent agreement subtree. In other
words, post processing finds the true MFAST only if a
large portion of it is already found (which is the role served
by phase two). In conclusion, phase three of our method
cannot replace phase two, yet both phases are essential for
the success of our method.

Effects of size of input tree
Next, we examined the effect of number of leaves in the
input trees on the size of MFASTs. For both the gym-
nosperm and Saxifragales trees, we generated 10 sets of
200 input trees with 100, 250, and 500 taxa. To make
each set, we randomly selected 100, 250, or 500 taxa,
and we deleted all other taxa from the original sets of
200 trees. Thus, these sets of trees with 100, 250, or 500
taxa are subtrees of the original data sets. The size of
the average MFAST increases with more taxa in the orig-
inal trees (Table 6). However, interestingly, the average
size of the MFASTs for the gymnosperm data set with
500 trees is larger than the MFAST found from the orig-
inal gymnosperm trees with all the taxa (Table 6). Since
the MFAST from the 500 taxon data sets should all be
found within the full data set, this indicates that on the
larger trees, our method may not always find the true (i.e.,
largest) MFAST. The full data sets may require a larger
number of contractions to find the true MFASTs.
Similar to the experiments in Section "Effects of num-
ber of input trees';, we investigated the gap between the
MFAST sizes before and after the post processing step. We
ran a variant of our method by disabling the second phase;
we only ran the post processing phase starting from each
seed as the initial MFAST one by one. We reported the
largest MFAST found that way as the output of this variant


Table 6 The size of the MFAST found by our method on the
Table 5 The size of the MFAST found by our method on the Gymnosperms and Saxifragales datasets before and after
Gymnosperms and Saxifragales datasets before and after post processing (phase three) for different number of taxa
post processing (phase three) MFAST size


MFAST size
Gymnosperms Saxifragales
Before After Only Before After Only


Number of
leaves

100


78.5 129.8 99.5 64.7 122.0 84.1 250
68.4 119.2 83.1 55.4 112.8 74.7 500


200 76.0 118.0 84.0 40.0 105.0 75.0
The size of the MFAST found by running onlythe post processing step is also
shown. We run our method on the entire dataset that contains 200 trees as well
as randomly selected subsets of 50 and 100 trees. We repeated the 50 and 100
tree experiments 10 times by randomly selecting the trees from the entire
dataset and reported the average value.


Gymnosperms Saxifragales
Before After Only Before After Only
41.2 56.1 43.5 43.5 50.7 38.5
67.2 88.5 63.0 62.3 76.2 54.6
91.6 123.0 74.9 52.0 86.7 62.9
760 1180 840 400 1050 750


The size of the MFAST found by running only the post processing step is also
shown. We run our method on the entire dataset that contains all the taxa (last
row) as well as randomly selected taxa subsets of size 100, 250 and 500. We
repeated the 50, 100 and 250 taxa experiments 10 times by randomly selecting
the taxa from the entire dataset and reported the average value.


Page 13 of 15


Number of
trees







Ramu etal. BMCBioinformatics 2012, 13:256
http://www.biomedcentral.com/1471-2105/13/256


Table 7 The size of the MFAST found by our method on the
Gymnosperms and Saxifragales datasets for different
random subsamples of the total number of taxa
Sampling MFAST size


percentage


Gymnosperms


87.5
88.4
875


We run our method by randomly picking 2%, 5%, 10%, 25%, 50%, 100% of the
seeds found in phase one for combination in phasetwo.

in Table 6. The results are in parallel with those in Table 5.
Phase three can grow a large frequent agreement subtree,
but not quite as big as that when both phase two and three
are executed.

Effects of sample size
In our final experiment, we evaluated the effect of the
maximum time cutoff, we described in Section "In-order
combination" on the accuracy of our method. Recall that,
this cutoff limits the number of initial seeds tried in our
algorithm by randomly sampling a small percentage of the
seeds. It only uses the sampled seeds as possible initial
seeds. However, it uses the entire set of seeds while grow-
ing the MFAST determined by the initial seed. As each
initial seed roughly takes the same amount of time to grow
into an MFAST, using x% of the seeds as the sample set
reduces the total running time our method to roughly x%
of that of our original implementation.
We carried out this experiment as follows. For both
the gymnosperm and Saxifragales trees, we ran 10 sets
of experiments for each sampling percentage of 2, 5, 10,
25, 50 and 100%. Thus, totally we ran 60 (6 x 10) exper-
iments. Table 7 presents the average MFAST sizes for
varying sample sizes. The results demonstrate that even
for very small sampling percentages, our method finds
MFAST that is almost as big as the MFAST found by using
the entire dataset (i.e., 100% sampling percentage). This is
very promising as it demonstrates that the running time
cost of our method can easily be cut to a small fraction
by sampling the starting seeds. The rationale behind this
is that the MFAST contains many seeds. Starting from
any of these seeds, our algorithm has the potential to lead
to that MFAST. The probability that at least one of these
seeds appear in the sample set is large particularly for large
MFASTs.

Conclusion
In this paper, we present a heuristic for finding the maxi-
mum agreement subtrees. The heuristic uses a multi-step


approach which first identifies small candidate subtrees
(called seeds), from the set of input trees, combines the
seeds to build larger candidate MFASTs, and then per-
forms a post-processing step to increase the size of the
candidate MFASTs. We demonstrate that this heuristic
can easily handle data sets with 1000 taxa, greatly extend-
ing the estimation of MFASTs beyond current meth-
ods. Although this heuristic is not guaranteed to find
all MFASTs, it performs well using both simulated and
empirical data sets. Its performance is relatively robust to
the number of input trees and the size of the input trees,
although with the larger data sets, the post processing
step becomes more important. Overall this method pro-
vides a simple and fast way to identify strongly supported
subtrees within large phylogenetic hypotheses.
Although the method we developed is described and
implemented for the rooted and bifurcating trees, it can
be trivially extended to multifurcating as well as unrooted
trees. The central technical difference in the case of
unrooted trees would be the definition of clade (see Def-
inition 1) as the definition requires a root. A clade in an
unrooted tree encompasses two sets of nodes; (i) a given
set of taxa X, (ii) the set of all internal nodes that are on
a path between two taxa in X on the phylogenetic tree.
We expect that this will increase the number of seeds
substantially and thus make the problem more compu-
tationally intensive. The amount of increase will depend
on the tree topology. The theoretical worst case happens
when all the taxa are connected to a single internal node
(i.e., star topology). In that case any subset of taxa can
lead to a potential seed as long as the subset size is equal
to the seed size allowed. One possible way to overcome
this problem would be to exploit randomization or graph
coloring strategies and avoid enumerating majority of the
possible seeds.


Abbreviations
MAST: Maximum agreement subtree; FAST: Frequent agreement subtree;
MFAST: Maximum frequent agreement subtree; MCMC: Markov chain Monte
Carlo.

Competing interests
The authors declare that they have no competing interests.

Authors' contributions
AR participated in algorithm development, implementation, experimental
evaluation and writing of the paper. TK participated in algorithm
development, experiment design and writing of the paper. GB participated in
experiment design, dataset collection and writing of the paper. All authors
read and approved the final manuscript.

Acknowledgments
This work was supported partially by the National Science Foundation (grants
CCF-0829867 and IIS- 0845439).

Author details
1Electrical and Computer Engineering, University of Florida, Gainesville, FL,
USA Computer and Information Science and Engineering, University of


Page 14 of 15








Ramu et al. BMCBioinformatics 2012, 13:256
http://www.biomedcentral.com/1471-2105/13/256


Florida, Gainesville, FL, USA. Department of Biology, University of Florida,
Gainesville, FL, USA.

Received: 5 February 2012 Accepted: 5 September 2012
Published: 3 October 2012

References
1. Goloboff PA, Catalano SA, Mirande JM, Szumik CA, Arias JS, Kallersjo M,
Farris JS: Phylogenetic analysis of 73 060 taxa corroborates major
eukaryotic groups. -adistcs 2009, 25(3):211 -230.
2. Price MN, Dehal PS, Arkin AP: Fasttree 2 approximately
maximum-likelihood trees for large alignments. PLoS ONE 2010,
5(3):e9490.
3. Smith SA, Beaulieu JM, Stamatakis A, Donoghue MJ: Understanding
angiosperm diversification using small and large phylogenetic
trees. American Journal ofBotany 2011, 98:404-414.
4. Felsenstein J: Confidence Limits on Phylogenies: An Approach Using
the Bootstrap. Evol 1985, 39(4):783-791.
5. Farris JS, Albert VA, Kallersjo M, Lipscomb D, Kluge, AG: Parsimony
jackknifing outperforms neighbor-joining. C/adistics 1996,
12(2):99-124.
6. Huelsenbeck JP, Rannala B, Masly JP: Accommodating phylogenetic
uncertainty in evolutionary studies. 2000 Science, 30 June 2000,
288(5475):2349-2350. doi: 10.1126/science.288.5475.2349.
7. Bryant D: A classification of consensus methods for phylogenetics. In
Bioconsensus (Piscataway, NJ, 2000/2001), volume 6D1 ofDIMACSSer. Discrete
Math. Theoret. Comput. Sci; Amer. Math. Soc., 2003:163-183.
8. Finden CR, Gordon AD: Obtaining common pruned trees. J
Classification 1985, 2:255-276.
9. Amir A, Keselman D: Maximum agreement subtree in a set of
evolutionary trees: Metrics and efficient algorithms. 5SAMJ Comput
1997, 26(6):1656-1669.
10. Kubicka E, Kubicki G, McMorris FR: An algorithm to find agreement
subtrees. J Classification 1995,12(1):91-99.
11. Farach M, Przytycka TM, Thorup M: On the agreement of many trees. Inf
Process Lett 1995,55(6):297-301.
12. Bryant D: Building trees, hunting for trees and comparing trees. PhD
thesis. Dept. Mathematics: University of Canterbur; 1997.
13. Cole R, Farach-Colton M, Hariharan R, Przytycka TM, Thorup M: An o(n log
n) algorithm for the maximum agreement subtree problem for
binary trees. SIAMJ Comput 2000, 30(5):1385-1404.
14. Lee CM, Hung LJ, Chang MS, Shen CB, Tang CY: An improved algorithm
for the maximum agreement subtree problem. InfProcess Lett 2005,
94(5):211-216.
15. Berry V, Nicolas F: Improved parameterized complexity of the
maximum agreement subtree and maximum compatible tree
problems. EEEACM Trns CoputBioogyBonorm 2006, 3(3):
289-302.
16. Guillemot S, Nicolas F: Solving the maximum agreement subtree and
the maximum compatible tree problems on many bounded degree
trees. In Proceedings of the 17th Annual conference on Combinatorial
Pattern Matching (CPMO6). Edited by Lewenstein M, Valiente G(Eds.).
Berlin, Heidelberg: Springer-Verlag:165-176. doi:10.1007/11780441 16.
17. Guillemot S, Nicolas F, Berry V, Paul C: On the approximability of the
maximum agreement subtree and maximum compatible tree
problems. Discrete Appied Mathematcs 2009, 157(7):1555-1570.
18. Chi Y, Xia Y, Yang Y, Muntz, RR: Correction to "mining closed and
maximal frequent subtrees from databases of labeled rooted trees".
IEEE Trans Know/Da ta Eng 2005, 17(12):1737.
19. Zhang S, Wang JTL: Mining frequent agreement subtrees in
phylogenetic databases. In Proceedings of theth 5/AM International
Conference on Data Mining (SDM 2006). Edited by Ghosh J, Lambert D,
Skillicorn DB, Srivastava J. Maryland: Bethesda; April 2006:222-233.
20. Zhang S, Wang JTL: Discovering frequent agreement subtrees from
phylogenetic data. EEE Trans Know/ Data Eng 2008, 20(1):68-82.
21. Cranston KA, Rannala B: Summarizing a Posterior Distribution of Trees
Using Agreement Subtrees. SystB 2007, 56(4):578-590.
22. Pattengale ND, Swenson KM, Moret BME: Uncovering hidden
phylogenetic consensus. In Proc.6th Int'/ymp Bioinformatcs Research &
Apps. SBRA0, in Lecture Notesin omputerScience. Edited by Borodovsky
M, Gogarten JP, Przytycka TM, Rajasekaran S:pp 128-139. Springer, 2010.


23. Pattengale ND, Aberer AJ, Swenson KM, Stamatakis A, Moret BME:
Uncovering hidden phylogenetic consensus in large data sets.
IEEE/ACM Trans Comput Biology Bioinform 2011,8(4):902-911.
24. Aberer AJ, Stamatakis A: A simple and accurate method for rogue
taxon identification. In Proceedings of the/EE International conference
on, Bioinformaticsand Biomedicine (B/BM 'I): IEEE Computer Society,
Washington, DC, USA:118-122. doi:10.1109/BIBM.2011.70.
25. JunierT, Zdobnov EM: The newick utilities: high-throughput
phylogenetic tree processing in the unix shell. Bioinformatics 2010,
26(13):1669-1670.
26. Swofford DL: PAUP* Phylogenetic analysis using parsimony (and other
methods). Version 4.0 beta 10.,2002. Sunderland, Massachusetts: Sinauer
Assoc.
27. Burleigh JG, Barbazuk WB, Davis JM, Morse AM, Soltis PS: Exploring
diversification and genome size evolution in extant gymnosperms
through phylogenetic synthesis. JBotany 2012, 2012:6. Article ID
292857. doi:10.1155/2012/292857.
28. Stamatakis A: Raxml-vi-hpc: maximum likelihood-based
phylogenetic analyses with thousands of taxa and mixed models.
Bioinormatics2006, 22(21):2688-2690.

doi:10.1186/1471-2105-13-256
Cite this article as: Ramu eta: A scalable method for identifying frequent
subtrees in sets of large phylogenetic trees. BMCBioinormacs 2012 1325


Page 15 of 15


Submit your next manuscript to BioMed Central
and take full advantage of:

* Convenient online submission
* Thorough peer review
* No space constraints or color figure charges
* Immediate publication on acceptance
* Inclusion in PubMed, CAS, Scopus and Google Scholar
* Research which is freely available for redistribution


Submit your manuscript at I d Central
www.biomedcentra I.com/subm it 0 Eiulid Central




Full Text
!DOCTYPE art SYSTEM 'http:www.biomedcentral.comxmlarticle.dtd'
ui 1471-2105-13-256
ji 1471-2105
fm
dochead Research article
bibl
title
p A scalable method for identifying frequent subtrees in sets of large phylogenetic trees
aug
au id A1 snm Ramufnm Avinashinsr iid I1 email r.avinash@ufl.edu
A2 ca yes KahveciTamerI2 tamer@cise.ufl.edu
A3 BurleighJ GordonI3 gburleigh@ufl.edu
insg
ins Electrical and Computer Engineering, University of Florida, Gainesville, FL, USA
Computer and Information Science and Engineering, University of Florida, Gainesville, FL, USA
Department of Biology, University of Florida, Gainesville, FL, USA
source BMC Bioinformatics
section Comparative genomicsissn 1471-2105
pubdate 2012
volume 13
issue 1
fpage 256
url http://www.biomedcentral.com/1471-2105/13/256
xrefbib pubidlist pubid idtype doi 10.1186/1471-2105-13-256pmpid 23033843
history rec date day 5month 2year 2012acc 592012pub 3102012
cpyrt 2012collab Ramu et al.; licensee BioMed Central Ltd.note This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
kwdg
kwd Phylogenetic trees
Frequent subtree
abs
sec
st
Abstract
Background
We consider the problem of finding the maximum frequent agreement subtrees (MFASTs) in a collection of phylogenetic trees. Existing methods for this problem often do not scale beyond datasets with around 100 taxa. Our goal is to address this problem for datasets with over a thousand taxa and hundreds of trees.
Results
We develop a heuristic solution that aims to find MFASTs in sets of many, large phylogenetic trees. Our method works in multiple phases. In the first phase, it identifies small candidate subtrees from the set of input trees which serve as the seeds of larger subtrees. In the second phase, it combines these small seeds to build larger candidate MFASTs. In the final phase, it performs a post-processing step that ensures that we find a frequent agreement subtree that is not contained in a larger frequent agreement subtree. We demonstrate that this heuristic can easily handle data sets with 1000 taxa, greatly extending the estimation of MFASTs beyond current methods.
Conclusions
Although this heuristic does not guarantee to find all MFASTs or the largest MFAST, it found the MFAST in all of our synthetic datasets where we could verify the correctness of the result. It also performed well on large empirical data sets. Its performance is robust to the number and size of the input trees. Overall, this method provides a simple and fast way to identify strongly supported subtrees within large phylogenetic hypotheses.
bdy
Background
Phylogenetic trees represent the evolutionary relationships of organisms. While recent advances in genomic sequencing technology and computational methods have enabled construction of extremely large phylogenetic trees (e.g., abbrgrp
abbr bid B1 1
B2 2
B3 3
), assessing the support for phylogenetic hypotheses, and ultimately identifying well-supported relationships, remains a major challenge in phylogenetics. Support for a tree often is determined by methods such as nonparametric bootstrapping
B4 4
, jackknifing
B5 5
, or Bayesian MCMC sampling (e.g.,
B6 6
), which generate a collection of trees with identical taxa representing the range of possible phylogenetic relationships. These trees can be summarized in a consensus tree (see
B7 7
). Consensus methods can highlight support for specific nodes in a tree, but they also may obscure highly supported subtrees. For example, in Figure figr fid F1 1, the subtree containing taxa A, B, C, and D is present in all five input trees. However, due to the uncertain placement of taxon E, the majority rule consensus tree implies that the clades in the tree have relatively low (60%) support.
fig Figure 1caption (a) A collection of five input treestext
b (a) A collection of five input trees. The same subtree with taxa A, B, C, and D is present in all input trees, and only the position of taxa E changes. (b) The majority rule consensus and maximum agreement subtrees of the 5 input trees in Figure 1a.
graphic file 1471-2105-13-256-1 Alternate approaches have been proposed to reveal highly supported subtrees. The maximum agreement subtree (MAST) problem seeks the largest subtree that is present in all members of a given collection of trees
B8 8
. For example, in Figure 1 the MAST includes taxa A, B, C, and D. Finding the MAST is an NP-hard problem
B9 9
, although efficient algorithms exist to compute the MAST in some cases (e.g.,
9
B10 10
B11 11
B12 12
B13 13
B14 14
B15 15
B16 16
B17 17
). In practice, since any difference in any single tree will reduce the size of the MAST, the MAST is often quite small, limiting it usefulness.A less restrictive problem is to find frequent agreement subtrees (FAST), or subtrees that are found in many, but not necessarily all, of the input trees (see
B18 18
). In this problem, a subtree is declared as frequent if it is in at least as many trees as a user supplied frequency threshold. Several algorithmic approaches have been suggested to identify FASTs, and specifically the maximum FASTs (MFASTs), or FASTs that contain the largest number of taxa. A variant of this problem seeks the maximal FASTs, i.e., FASTS that are not contained in any other FASTs. Notice that an MFAST is a it maximal FAST, however, the inverse is not necessarily true. Zhang and Wang defined algorithms, implemented in Phylominer, to identify FASTs from a collection of phylogenetic trees
B19 19
B20 20
. These algorithms are guaranteed to find all FASTs but they may be prohibitively slow for data sets larger than 20 taxa. Cranston and Rannala implemented Metropolis-Hastings and Threshold Accepting searches to identify large FASTs from a Bayesian posterior distribution of phylogenetic trees
B21 21
. This approach can handle thousands of input trees but it may not be feasible if the trees have more than 100 taxa
21
.Another approach to reveal highly supported subtrees from a collection of trees is to identify and remove rogue taxa, or taxa whose position in the input trees is least consistent. Recently, several methods have been developed that can identify and remove rogue taxa from collections of trees with thousands of taxa
B22 22
B23 23
B24 24
. However, unlike MAST or FAST approaches, they do not provide guarantees about the support for the remaining taxa.In this paper, we describe a heuristic approach for identifying MFASTs in collections of trees. Unlike previous methods, our method easily scales to datasets with over a thousand taxa and hundreds of trees. Towards this goal, we develop a heuristic solution that works in multiple phases. In the first phase, it identifies small candidate subtrees from the set of input trees which serve as the seeds of larger subtrees. In the second phase, it combines these seeds to build larger candidate MFASTs. In the final phase, it performs a post processing step. This step ensures that the size (i.e., number of taxa) of the FAST found can not be increased further by adding a new taxon without reducing its frequency below a user supplied frequency threshold. We demonstrate that this heuristic can easily handle data sets with 1000 taxa. We test the effectiveness of these approaches on simulated data sets and then demonstrate its performance on large, empirical data sets. Although our heuristic does not guarantee to find all MFASTs or the largest MFAST in theory, it found the true MFAST in all of our synthetic datasets where we could verify the correctness of the result. It also performed well on the empirical data sets. Its performance is robust with respect to the number of input trees and the size of the input trees.
Methods
In this section we describe our method that aims to find Maximum Frequent Agreement SubTrees (MFASTs) in a given set of m phylogenetic trees inline-formula
m:math name 1471-2105-13-256-i1 xmlns:m http:www.w3.org1998MathMathML m:mi mathvariant script T
= {T
sub 1, T
2, …, T
m
}. Our method follows from the observation that an MFAST is present in a large number of trees in
1471-2105-13-256-i2 T
. The method builds MFASTs bottom up from small subtrees of taxa in the trees in
1471-2105-13-256-i3 T
. Briefly, it works in three phases. indent 1 • Phase 1. Seed generation (Section “Phase one: Seed generation”).In the first phase, we identify small subtrees from the input trees that have a potential to be a part of an MFAST. We call each such subtree a seed.• Phase 2. Seed combination (Section “Phase two: Seed combination”).In the second phase, we construct an initial FAST by combining the seeds found in the first phase.• Phase 3. Post processing (Section “Phase three: Post-processing”).In the third phase, we grow the FAST further to obtain the maximal FAST that contains it by individually considering the taxa which are not already in the FAST. We report the resulting maximal FAST as a possible MFAST.First, we present the the basic definitions needed for this paper in Section “Preliminaries and notation”. We then discuss each of the three phases above in detail.
Preliminaries and notation
In this section, we present the key definitions and notations needed to understand the rest of the paper. We describe our method using rooted and bifurcating phylogenetic trees. However, our method and definitions can easily be applied to unrooted or multifurcating trees with minor or no modifications. Also, we assume that all the taxa are placed at the leaf level nodes of the phylogenetic tree, and all the internal nodes are inferred ancestors. Figure F2 2(a) shows a sample phylogenetic tree built on five taxa. We define the size of a tree as the number of taxa in that tree. We start by defining key terms.
Definition 1 (Clade)
Let T be a phylogenetic tree. Given an internal node of T, we define the set of all nodes and edges of T contained under that node as the clade rooted at that node.
Figure 2(a) A rooted, bifurcating phylogenetic tree T built on five taxa labeled with a, b, c, d and e
(a) A rooted, bifurcating phylogenetic tree T built on five taxa labeled with a, b, c, d and e. The internal nodes are shown with x0, x1, x2 and x3. (b) A clade of T rooted at x1. (b) and (c) Two subtrees of T by contracting the taxa sets {d, e} and {a, e}.
1471-2105-13-256-2 Each internal node of a phylogenetic tree corresponds to a clade of that tree. Figure 2(b) depicts the clade of the tree in Figure 2(a) rooted at x
1.
Definition 2 (Contraction)
Let T be a phylogenetic tree with n taxa. The contraction operation transforms T into a tree with n−1 taxa by removing a given taxon in T along with the edge that connects that taxon to T.
The contraction operation can extract the clades of a tree by removing all the taxa that are not a part of that clade. It can also extract parts of the tree that are not necessarily clades. We use the term subtree to denote a tree that is obtained by applying contractions to arbitrary set of taxa in a given tree. Formal definition is as follows.
Definition 3 (Subtree)
Let T and T’ be two phylogenetic trees. We say that T’ is a subtree of T if T can be transformed into T’ by applying a series of contractions on T.
If a tree T’ is a subtree of another tree T, we say that T’ is present in T. Notice that a clade is always a subtree, but the inverse is not true all the time. Figures 2(b) and 2(c) illustrate two subtrees of the tree in Figure 2(a). Let us denote the number of combinations of k taxa from a set of n taxa with
1471-2105-13-256-i4 m:mfenced separators open ( close )
m:mfrac linethickness 0
m:mrow
n
k
. In general, if a tree has n taxa, then that tree contains
1471-2105-13-256-i5
n
k
subtrees with k taxa. As a consequence, that tree contains 2sup
n
 − 1 subtrees of any size including itself.
Definition 4 (Frequency)
Let
1471-2105-13-256-i6 T
= {T
1,
T2, … T
m
} be a set of m phylogenetic trees and T be a phylogenetic tree. Let us denote the number of trees in
1471-2105-13-256-i7 T
at which T is present with the variable m’. We define the frequency of T in
1471-2105-13-256-i8 T
as
display-formula
1471-2105-13-256-i9 m:mspace width 1em
m:mtext italic freq
m:mo (
T
,
T
)
=
m:msup
m

m
.
Definition 5 (FAST)
Let
1471-2105-13-256-i10 T
= {T
1, T
2, … T
m
} be a set of m phylogenetic trees and T be a phylogenetic tree. Let γ be a number in [0, 1] interval that denotes frequency cutoff. We say that T is a Frequent Agreement SubTree (FAST) of
1471-2105-13-256-i11 T
if its frequency in
1471-2105-13-256-i12 T
is at least γ (i.e.,
1471-2105-13-256-i13 freq
(
T
,
T
)

γ
).
We say that a FAST is maximal if there is no other FAST that contains all the taxa in that FAST. Clearly, larger FASTs indicate biologically more relevant consensus patterns. The following definition summarizes this.
Definition 6 (MFAST)
Let
1471-2105-13-256-i14 T
= {T
1, T
2, …, T
m
} be a set of m phylogenetic trees. Let γ be a number in [0, 1] interval that denotes frequency cutoff. A FAST T of
1471-2105-13-256-i15 T
is a Maximum Frequent Agreement SubTree (MFAST) of
1471-2105-13-256-i16 T
if there is no other FAST T’ of
1471-2105-13-256-i17 T
that has a larger size than T.
Formally, given a set of phylogenetic trees
1471-2105-13-256-i18 T
= {T
1, T
2, …, T
m
} and a frequency cutoff, γ, we would like to find the MFASTs in
1471-2105-13-256-i19 T
in this paper. We develop an algorithm that aims to solve this problem. Table tblr tid T1 1 lists the variables used throughout the rest of this paper.
table
Table 1
Commonly used variables and functions in this paper
tgroup cols 2
colspec align left colname c1 colnum colwidth 1*
c2
thead valign top
row rowsep
entry
tbody
1471-2105-13-256-i20 T
A set of phylogenetic trees
T
i
ith tree
m
Number of trees in
1471-2105-13-256-i21 T
n
Number of taxa in each input tree
a
i
ith taxa
1471-2105-13-256-i22 freq
(
T
,
T
)
Frequency of the subtree T in
1471-2105-13-256-i23 T
γ
Frequency cutoff
S
i
ith seed (each seed is a subtree of a tree in
1471-2105-13-256-i24 T
)
k
Size of a seed
c
Number of contractions used to create a seed
Phase one: Seed generation
The first phase extracts small subtrees from the given set of trees. From these subtrees we extract the basic building blocks which are used to construct MFASTs. We call these building blocks seeds. Conceptually each seed is a phylogenetic tree that contains a small subset of the taxa that make up the trees in
1471-2105-13-256-i25 T
. We characterize each seed with three features that are listed below. We elaborate on each feature later in this section. 1. Seed size (k) is the number of taxa in the seed.2. Number of contractions (c) is the number of taxa we prune from a clade taken from an input tree in order to extract the seed.3. Frequency (f) is the fraction of input trees in which the seed is present.We explain the seed features with the help of Figures F3 3 and F4 4. The first two characteristics explain how a seed can be found in one of the trees in
1471-2105-13-256-i26 T
. They indicate that there is a clade of a tree in
1471-2105-13-256-i27 T
such that this clade contains k + c taxa and it can be transformed into that seed after c contractions from that clade. For instance in Figure 3, when k = 2 and c = 0, only seed S
1 can be extracted from T
1 by choosing the clade rooted at x
2. When k = 2 and c = 1, seeds S
1, S
2 and S
3 can be obtained using one contraction (a
3, a
2 and a
1 respectively) from the clade rooted at x
1.
Figure 3T1 is an input tree built on four taxa a1, a2, a3 and a4
T1 is an input tree built on four taxa a1, a2, a3and a4. The internal nodes of T1 are labeled as x0, x1 and x2. S1 is the only seed obtained from T1 when k = 2 and c = 0. That is S1 is identical to the clade rooted at x2. S1, S2 and S3 are the seeds extracted from T1 when k = 2 and c = 1. They are all extracted from the clade rooted at x1 by contracting a3, a2 and a1 respectively.
1471-2105-13-256-3
Figure 4The set of input trees T1, T2, T3and the set of all nine potential seeds S1, S2…S9when the seed characteristics are set to k=3 and c=1
The set of input trees T1, T2, T3 and the set of all nine potential seeds S1, S2 … S9 when the seed characteristics are set to k = 3 and c = 1. All the potential seeds have three taxa as k = 3. We need one contraction from the input tree to obtain each seed. S1 has frequency 1.0 as it is present in T1, T2 and T3. Seed S2 has frequency ∼0.67 as it is present in T1 and T2. Remaining seeds have frequency ∼0.33 as each appears in only one of the three trees.
1471-2105-13-256-4 The last feature denotes the number of trees in
1471-2105-13-256-i28 T
in which the seed is present. For example in Figure 4, there are nine seeds S
1, S
2, …, S
9 extracted from the three input trees using only one contraction. Among these, the frequency of S
1 is 1 as it is present in all the trees. Frequency of S
2 is about 0.67 for it is present in only two out of three trees (T
1 and T
2). The frequency of the rest of the seeds is only about 0.33. Recall that, by definition, an MFAST is present in at least a fraction γ of the trees in
1471-2105-13-256-i29 T
. Therefore, we consider only the seeds whose frequency values are equal to or greater than this number ( i.e. f ≥ γ).Given the values of k, c and γ, we extract all the seeds which possess the desired feature values from the set of input trees as follows. In the newick string representation of a tree, a pair of matching parentheses corresponds to an internal node in the tree. The number of taxa in the clade rooted at this internal node is given by the number of labels between the two matching parentheses. Following from this observation, we scan the newick string of each tree one by one. For each such tree, we identify the clades which have k + c taxa. Notice that, if a tree contains n taxa, then it contains at most
1471-2105-13-256-i30
n
k
+
c
clades of size k + c as no two such clades can contain common taxa. We then extract all combinations of k taxa from each of these clades by contracting the remaining c taxa. The number of ways this can be done is
1471-2105-13-256-i31
k
+
c
c
. Notice that all the small trees extracted this way possess the first two characteristics explained above. At this point, we however do not know their frequencies. Therefore, we call them potential seeds. It is worth mentioning that the same seed might be extracted from different trees. As we extract a new potential seed, before storing it in the list of potential seeds, we check if it is already present there. We include it in the potential seed list only if it does not exist there yet. Otherwise, we ignore it. This way, we maintain only one copy of each seed.Once we build our potential seed list for all the trees in
1471-2105-13-256-i32 T
, we go over them one by one and count their frequency in
1471-2105-13-256-i33 T
as the fraction of trees that contain them. We filter all the potential seeds whose frequencies are less than the frequency cutoff. We keep the remaining ones as the list of seeds along with the frequency of each seed.In Figure 4, consider the tree T
1 that has four taxa. For k = 3 and c = 1, there is only one clade of size k + c = 4 which is the tree T
1 itself. We extract four potential seeds, each having three leaves from this tree. The potential seeds in this figure are given by S
1, S
2, S
5 and S
7 which we extract by contracting a
4, a
3, a
2 and a
1 respectively from T
1.
Phase two: Seed combination
At the end of the first phase, we obtain a set of frequent seeds from the input trees. Notice that each seed is a FAST as each seed is present in sufficient number of trees specified by γ. These seeds are the basic building blocks of our method. In the second phase of our method, we combine subsets of these seeds to construct larger FASTs.We first define what it means to combine two seeds. In order to combine two seeds, it is a necessary condition that both seeds are present in at least one common tree T in
1471-2105-13-256-i34 T
. We call such a tree T as the reference tree. We combine two seeds with the guidance of a reference tree. Let S
1 and S
2 be two seeds and let T be their reference tree. Let L
1, L
2 and L be the set of taxa in S
1, S
2 and T respectively. Combining S
1 and S
2 results in the tree that is equivalent to the one obtained by contracting the taxa in L − (L
1 ∪ L
2) from T. For simplicity, we will denote the combine operation using T as the reference network with the ⊕
T
symbol. For instance we denote combining S
1 and S
2 with T being the reference tree as S
1 ⊕ 
T
S
2. To simplify our notation, whenever the identity of the reference tree is irrelevant, we will use the symbol ⊕ instead of ⊕
T
.Figure F5 5 demonstrates how two seeds S
1 and S
2 are combined with the help of the reference tree T. In this figure, both S
1 and S
2 are subtrees of T. Thus, it is possible to use T as the reference tree. We have L
1 = {a
1,a
3,a
4}, L
2 = {a
1,a
2,a
5,a
7}. Thus, we build C = S
1 ⊕ 
T
S
2 by contracting the taxa in L − (L
1 ∪ L
2) = {a
6,a
8} from T.
Figure 5T is the reference tree
Tis the reference tree. S1 and S2 are the seeds to be combined, both are present in T. C is obtained by pruning the subtree containing taxa a1, a2, a3, a4, a5 and a7 from T.
1471-2105-13-256-5 So far, we have explained how to combine two seeds S
1 and S
2 using a reference tree. It is possible that many trees in
1471-2105-13-256-i35 T
have both seeds present in them. Thus, one question is which of these trees should we use as the reference tree to combine the two seeds? The brief answer is that all such trees need to be considered. However, we make several observations that helps us avoid combining S
1 and S
2 using each such reference tree one by one exhaustively without ignoring any of such trees. We explain them next.Consider two trees T
1 and T
2 from
1471-2105-13-256-i36 T
where both seeds are present in. There are two cases for T
1 and T
2. • Case 1: S
1 ⊕
T
1
S
2 = S
1 ⊕
T
2
S
2. In this case, it does not matter whether we use T
1 or T
2 as the reference tree. They will both lead to the same combined subtree. Thus, we use only one.• Case 2: S
1 ⊕
T
1
S
2 ≠ S
1 ⊕
T
2
S
2. In this case, the trees T
1 and T
2 lead to alternative combination topologies. So, we consider both of them separately.We utilize the observations above as follows. We start by picking one reference tree arbitrarily. Once we create a combined subtree using that tree, we check whether that subtree is present in the remaining trees in
1471-2105-13-256-i37 T
. We mark those trees that contain it as considered for reference tree and never use them as reference for the same seed pair again. This is because those trees fall into the first case described above. This way, we also store the frequency of the combined subtree in
1471-2105-13-256-i38 T
. If the number of unmarked trees is too small (i.e., less than γ × m) then it means that even if all the remaining trees agree on the same combined topology for the two seeds under consideration, they are not sufficient to make it a FAST. Thus, we do not use any of the remaining trees as reference for those two seeds. Otherwise, we pick another unmarked tree arbitrarily and repeat the same process until we run out of reference trees.The next question we need to answer is which seed pairs should we combine? To answer this question we first make the following proposition.
Proposition 1
Assume that we are given a set of phylogenetic trees
1471-2105-13-256-i39 T
. Let S
1 and S
2 be two seeds constructed from the trees in
1471-2105-13-256-i40 T
. For all trees
1471-2105-13-256-i41 T

T
, we have the following inequality
1471-2105-13-256-i42
freq
(
m:msub
S
m:mn 1

T
S
2
,
T
)

min
{
freq
(
S
1
,
T
)
,
freq
(
S
2
,
T
)
}
Proof
For any T, both S
1 and S
2 are subtrees of S
1 ⊕
T
S
2. Thus if S
1 ⊕
T
S
2 is present in a tree, then both S
1 and S
2 are present in that tree. As a result, freq(S
1 ⊕
T
S
2,
1471-2105-13-256-i43 T
) ≤ freq(S
1,
1471-2105-13-256-i44 T
) and freq(S
1 ⊕
T
S
2,
1471-2105-13-256-i45 T
) ≤ freq(S
2,
1471-2105-13-256-i46 T
). Hence,
1471-2105-13-256-i47
freq
(
S
1

T
S
2
,
T
)

min
{
freq
(
S
1
,
T
)
,
freq
(
S
2
,
T
)
}

Proposition 1 states that as we combine pairs of seeds to grow them, their frequency monotonically decreases. This suggests that it is desirable to combine two seeds if both of them have large frequencies. This is because if one of them has a small frequency, regardless of the frequency of the other, the combined tree will have a small frequency. As a result its chance to grow into a larger tree through additional combine operations gets smaller. Following this intuition, we develop two approaches for combining the seeds. 1. In-order Combination (Section “In-order combination”).2. Minimum Overlap Combination (Section “Minimum overlap combination”).Both approaches accept the list of seeds computed in the first phase as input and produce a larger FAST that is a combination of multiple seeds. Both of them also assume that the list of input seeds are already sorted in decreasing order of their frequencies. We discuss these approaches next.
In-order combination
The in-order combination approach follows from Proposition 1. It assumes that the seeds with higher frequencies have greater potential to be a part of an MFAST. It exploits this assumption as follows, first it picks a seed as the starting point to create a FAST. It then grows this seed by combining it with other seeds starting from the most frequent one as long as the frequency of the resulting tree remains at least as large as the given cutoff γ. It repeats this process by trying each seed as the starting point, Algorithm Algorithm 1 In order combination presents this approach.
Algorithm 1 In order combination
FAST ← ∅
for all seeds S
i
do
FAS
T

 ← S
i
Mark S
i
as considered
repeat
S
j
← seed with highest frequency among unconsidered seeds Mark S
j
as considered CUTOFF ← γ
t_FAST

← FAS
T

repeat
3 Pick the next unconsidered tree
1471-2105-13-256-i48 T

T
as referenceMark all the trees as that contain FAS
T


T
S
j
as considered
if freq(
1471-2105-13-256-i49 FAS
T


T
S
j
,
T
)

CUTOFF then
t_FAS
T

← FAS
T


T
S
j
CUTOFF ← freq(
1471-2105-13-256-i50 FAS
T


T
S
j
,
T
)
end if
until Less than γ × m unmarked reference trees are left in
1471-2105-13-256-i51 T
FAS
T

←t_FAS
T

Unmark all trees in
1471-2105-13-256-i52 T
until all seeds are considered
if size of FAS
T

≥ size of FAST then
FAST ← FAS
T

end if
Unmark all seeds
end for
In Algorithm Algorithm 1 In order combination we first initialize the FAST as empty. We then consider each seed one by one. We initialize a temporary subtree denoted by FAST’ with the seed
S
i
under consideration and mark S
i
as considered. We combine the FAST’ with a seed S
j
which has the highest frequency amongst the seeds that have not been added. If multiple seeds have the highest frequency, we randomly pick one of them and mark that seed
S
j
as added to the FAST’. There can be alternative ways to combine FAST’ with S
j
leading to different topologies. We use the trees in
1471-2105-13-256-i53 T
that contain both FAST’ and S
j
as guides to try only the topologies that exist in
1471-2105-13-256-i54 T
. We stop constructing alternative topologies as soon as we ensure that there are not sufficient number of trees to yield frequency of γ. We set FAST’ to the combined seed if the combined seed has large enough frequency. We then consider the seed with the next highest frequency for addition and repeat this step till all
S
j
have been considered. If the resulting temporary FAST is larger than FAST we replace the smaller FAST with the larger one. In the next iteration, we initialize the FAST with the next
S
i
. Using this approach we can initialize the FAST with all
S
i
, alternatively if the user wishes to limit the amount of time spent using a maximum time cutoff we stop the outermost loop (i.e., alternative initializations of FAST’) as soon as the allowed running time budget is reached.Notice that in Algorithm 3 each seed S
i
can lead to a different FAST. We record only the FAST that has the largest size. However, it is trivial to maintain the top k FASTs with the largest size instead if the user is looking for k alternative maximal FASTs.
Minimum overlap combination
The purpose of combining seeds is to construct a FAST that is large in size. Our in-order combination approach (Section “In-order combination”) aimed to maximize the frequency of the combined seeds. In this section, we develop our second approach, named Minimum Overlap Combination. This approach picks seeds so that their combination produced as large subtree as possible. We elaborate on this approach next.When we combine two seeds, the size of the resulting tree becomes at least as big as the size of each of these seeds. Formally let S
1 and S
2 be two seeds (i.e., trees). Let L
1 and L
2 be the set of taxa combined in S
1 and S
2. We denote the size of a set, say L
1, with |L
1|. The size of the tree resulting from combination of S
1 and S
2 is |L
1| + |L
2| − |L
1 ∩ L
2|. For a given fixed seed size, the first two terms of this formulation remains unchanged regardless of the seed. The last term determines the growth in the size of the FAST. Thus, in order to grow the FAST rapidly, it is desirable to combine two frequent subtrees with a small number of common taxa.Our second approach follows from the observation above. We introduce a criteria called the overlap between two subtrees as the number of taxa common between them. Our minimum overlap combination approach works the same as Algorithm Algorithm 1 In order combination with a minor difference in selecting the seed S
j
that will be combined with the current temporary FAST (i.e., FAST’). Rather than choosing the seed with the largest frequency, this approach chooses the one that has the least overlap with FAST’ among all the unconsidered and frequent seeds. If multiple seeds have the same smallest overlap, it considers the frequency as the tie breaker and chooses the one with the largest frequency among those.
Phase three: Post-processing
So far we described how to obtain seeds (Section “Phase one: Seed generation”) and how to combine them to construct FAST (Section “Phase two: Seed combination”). The two approaches we developed for combining seeds aim to maximize the size of FAST. However, they do not ensure the maximality of the resulting FAST. There are two main reasons that prevent our seed combining algorithms from constructing maximal FAST. First, some of the taxa of a maximal FAST may not appear in any seed (i.e. false negatives). As a result no combination of seeds will lead to that maximal FAST. Second, even if all the taxa of a maximal FAST are parts of at least one seed, our algorithms will reject combining that seed with the FAST of the seeds if those seeds contain other taxa that are not part of the maximal FAST (i.e. false positives).In the post-processing phase, we tackle above-mentioned problem. Algorithm 3 describes the post processing phase in detail. We do this by considering all taxa which are not already present in the FAST one by one. We iteratively grow the current FAST by including one more taxon at a time if the frequency of the resulting FAST remains at least as large as the frequency cutoff γ. We repeat these iterations until no new taxon can be included in the FAST. Thus the resulting FAST is guaranteed to be maximal.
Algorithm 2 Post processing
INPUT = FAST from the seed combination phaseINPUT =
1471-2105-13-256-i55 T
OUTPUT = Maximal FASTRESULT ← FAST
for all
a
i
not in FAST do
CUTOFF ← γ
t_RESULT ← RESULT
repeat
Pick the next unconsidered tree
1471-2105-13-256-i56 T

T
as referenceRESULT’ ← RESULT ⊕
T
a
i
Mark all the trees that contain RESULT’ as considered
if frequency of RESULT’ ≥ CUTOFF then t_RESULT ← RESULT’ CUTOFF ← frequency of RESULT’
end if
until Less than γ × m unmarked reference trees are left in
1471-2105-13-256-i57 T
RESULT ← t_RESULTUnmark all trees in
1471-2105-13-256-i58 T
end for
return RESULTWe expect the post processing step to identify quickly the taxa that have a potential to be in an MFAST that might have not been considered during the seed generation and seed combination phases. At the end of the post processing step we obtain an MFAST.
Complexity analysis of our method
In this section we discuss the complexity of our method in terms of the three phases involved in it. Let
1471-2105-13-256-i59 T
be a set of m phylogenetic trees having n leaves each. The complexity of the different phases of our method are as follows.
Phase one
Finding the seeds involves enumerating all the subtrees and checking their frequencies. Given seed size k and number of contractions c, each tree will contain at most
1471-2105-13-256-i60
n
k
+
c
clades each leading to
1471-2105-13-256-i61
k
+
c
c
alternative subtrees. Thus, in total there can be up to
1471-2105-13-256-i62
mn
k
+
c
k
+
c
c
seeds (possibly many of them identical) from all the trees in
1471-2105-13-256-i63 T
. Typically, the values of k and c are fixed and small (in our experiments we have k ∈ {3, 4, 5} and c ∈ {0, 1, 2, 3, 4, 5}) leading to O(mn) seeds.The complexity of finding whether a seed is present in a single tree is O(n log n). Given that there are m trees in
1471-2105-13-256-i64 T
, the cost of computing the frequency of a single seed is O(mn log n). Thus, the time complexity for finding the frequency of all the seeds is this expression multiplied by the number of seeds, which is O(m
2
n
2 log n).
Phase two
Consider a set of p frequent seeds that will be considered for combining in this phase. Recall that we have two approaches to combine them. Below, we focus on each.
smcaps INORDER COMBINATION We try to combine each seed with every other seed leading to O(p
2) iterations. The complexity of checking the frequency of each combined subtree is O(mnlogn). Also, there can be up to O(m) different reference trees for guiding the combine operation. Multiplying these terms, we obtain the complexity of phase using this approach as O(p
2
m
2
n log n).
MINIMUM OVERLAP COMBINATION The complexity of combining the frequent seeds using the minimum overlap combination approach is very similar to the inorder approach except for an additional term. The additional complexity is because we maintain the overlap between the subtrees. This leads to the complexity O(p
2
n
2 + p
2
m
2
n log n).
Phase three
Here, we consider the FAST obtained from each of the p frequent seeds in phase two. For each FAST, we sequentially go over each taxa one by one leading to O(n) iterations. There can be up to O( γ  × m) references to add a taxon. So the cost of extending all p FASTs is O(γ × mnp).Notice that each frequent seed has to appear in at least γ × m trees. Thus, the number of unique frequent seeds p is bounded by
1471-2105-13-256-i65 O
(
mn
γ
×
m
)
=
1471-2105-13-256-i66 O
(
n
γ
)
. Thus, adding the cost of all the three phases, the overall time complexity of our method using inorder combination is
1471-2105-13-256-i67
O
(
m
2
n
2
log
n
+
m
2
n
3
log
n
γ
2
+
m
n
2
)
.
That using minimum overlap combination is
1471-2105-13-256-i68
O
(
m
2
n
2
log
n
+
m
2
n
3
log
n
γ
2
+
n
4
log
n
γ
2
+
m
n
2
)
.
In the two summations above, the second term is asymptotically larger than the first and the last terms. Thus, we can simplify the asymptotic time complexity of inorder and minimum overlap combinations as
1471-2105-13-256-i69
O
(
m
2
n
3
log
n
γ
2
)
and
1471-2105-13-256-i70
O
(
n
3
log
n
γ
2
(
m
2
+
n
)
)
respectively.
Results and discussion
This section evaluates the performance of our MFAST algorithm experimentally.
Implementation details
We implemented our MFAST algorithm using C and Perl. More specifically, we implemented the first two phases (seed generation and seed combination) in C and the third phase (post processing) in Perl. We utilize the functions provided in the newick Utilities
B25 25
package by modifying the source code provided in that package. We use k ∈ {3, 4, 5} and c ∈ {0, 1, 2, 3, 4, 5} in all of our experiments unless otherwise stated. In our experiments, we observed that the minimum overlap combination produced larger MFASTs than the in-order combination approach. Therefore, we limit our experimental results to the minimum overlap combination approach.
Methods compared against
We have compared our method against Phylominer
20
and the MAST command implemented in PAUP*
B26 26
. Among these, Phylominer also seeks MFASTs in a collection of trees. However, the time complexity of this method is exponential in the size of the input trees, and hence it becomes intractable for large trees. In our experiments, we observed that it does not scale beyond 50 taxa. PAUP* is primarily a program for phylogenetic inference, although it also can compute MASTs. MASTs have a strict 100% agreement criterion unlike the arbitrary frequency cutoff values γ in our method.
Evaluation Criteria
We evaluate our algorithm based on the size of the MFAST found. Larger MFASTs are preferable. When possible, we report the size of the optimal solution as well.
Test Environment
We ran our experiments on Linux servers equipped with dual AMD Opteron dual core processors running at 2.2 GHz and 3 GB of main memory to test the performance of our method.
Datasets
We test the performance and verify the results of our method on synthetic datasets and real datasets. • Synthetic dataset We built synthetic datasets in which we embedded an MFAST as described below. We characterize each synthetic dataset using five parameters. The first two parameters denote the size and number of trees in
1471-2105-13-256-i71 T
. MFAST frequency specifies the fraction of trees in
1471-2105-13-256-i72 T
which contain an MFAST. MFAST size is the number of taxa in the embedded MFAST. The noise percentage is the percentage of taxa that is not a part of the embedded MFAST but is placed on the branches within the clade that contains the MFAST. We place all the other taxa on the branches outside this clade.Given an instantiation of these parameters, we first created a tree that has n’ taxa. This tree serves as the MFAST. We then created m × f trees that contain this MFAST. We build each of these trees by inserting n − n

taxa randomly in the MFAST. With probability ∊ we insert each taxa within the clade that contains MFAST. With probability 1 − ∊ we insert it outside that clade. We then created m − (m × f) trees that do not contain the current MFAST. We simply do this by inserting all the taxa one by one at a random location.1. Tree size (n).2. Number of trees (m).3. MFAST frequency (f).4. MFAST size (n’).5. Noise percentage (∊).• Real datasets. We use two empirical datasets to evaluate the performance of our heuristic. The data sets contain 200 bootstrap trees generated from phylogenetic analysis of the Gymnosperm
B27 27
and Saxifragales (Burleigh, unpublished) plant clades. To make the bootstrap trees, we assembled super-matrices, matrices of concatenated gene alignments with partial taxon overlap, from gene sequence data available in GenBank. We performed a maximum likelihood bootstrap analysis on each super-matrix using RAxML v. 7.0.4
B28 28
. The Gymnosperm trees each contain 959 taxa, and the Saxifragales trees each contain 950 taxa.
Effects of number of input trees
In our first experiment, we analyze how the number of input trees in
1471-2105-13-256-i73 T
affects the performance of our algorithm. For this purpose, we created 30 synthetic datasets. The size of the embedded MFAST in all the datasets was 15. Among these 30 datasets, 10 contained 50 trees, 10 contained 100 trees and 10 contained 200 trees. We set the noise percentage to 20% in all the datasets. The frequency of the embedded MFAST was 0.8. We set the number of taxa in all the trees in these datasets to 100.We ran our algorithm on each of these datasets to find the size of the MFAST for γ = 0.7. Table T2 2 lists the average MFAST size we found for each of the dataset sizes before post processing (i.e., at the end of phase two) and after post processing (i.e., at the end of phase three). The results demonstrate that our method can identify an MFAST that is almost as big as the embedded one even without post processing, regardless of the number of trees in the dataset. Post processing improves the MFAST size slightly. On the average, we always find an MFAST that is as large as or larger than the embedded one. An MFAST larger than 15 here implies that while randomly inserting the taxa that are not in the embedded MFAST, at least one of them was placed under the same clade at least a fraction γ of the time. More importantly, our method successfully located such taxa along with the rest of the MFAST.
Table 2
Evaluation of the effect of the number of trees in
1471-2105-13-256-i74 bold-script T
c3
Number of trees
center nameend namest
MFAST size
tfoot
The number of trees is set to 50, 100 and 200. For each number of trees we run our experiments on ten datasets. Each dataset contains trees with 100 taxa and an embedded MFAST of size 15. We report the average size of the MFAST obtained by our method across the ten datasets.
Before post processing
After post processing
50
14.5
16.0
100
15.3
15.8
200
14.4
15.4
Effects of tree size
Our second experiment considers the impact of the number of taxa in the input trees contained in
1471-2105-13-256-i75 T
on the success of our method. To carry out this test, we built datasets with varying tree sizes (i.e., n). Particularly, we used n = 100, 250, 500 and 1000. For each value of n, we repeated the experiment 10 times by creating 10 datasets with the same properties. In all datasets, we set the number of trees to m = 100, the noise percentage at ∊ = 20%, the size of the embedded MFAST at 15% of n, and the MFAST frequency at 0.8.Table T3 3 reports the average MFAST size found by our method for varying tree sizes. Second column shows the embedded MFAST size. Last two columns list the average size of the MFAST found by our method across the ten datasets. Before, going into detailed discussion of the results, it is crucial to observe that our method could run to completion for datasets that have as many as 1000 taxa. When we tried to run Phylominer, it did not return any results for datasets that have more than 100 taxa. The results also demonstrate that our method could successfully identify the embedded MFAST in all the datasets regardless of the size of the input trees. In some datasets, the reported MFAST was slightly larger than the embedded one. This indicates that while randomly inserting the taxa that are not part of the embedded MFAST, it is possible that a few taxa was consistently placed under the same same clade.
Table 3
Evaluation of the effect of the size of the trees in
1471-2105-13-256-i76 T
4
c4
Number of
MFAST size
The tree size is set to 100, 250, 500 and 1000. For each tree size we run our experiments on ten datasets. Each dataset contains 100 trees with an embedded MFAST of size 15% of the input tree size. Second column shows the embedded MFAST size. Last two columns list the average size of the MFAST found by our method across the ten datasets.
taxa
Embedded
Reported
Before post
After post
processing
processing
100
15
15.3
15.8
250
38
32.3
38.8
500
75
43.7
76.0
1000
150
69.8
151.0
The results also suggest that our method identifies a significant percentage of the taxa in the embedded MFAST after the second phase (i.e., before post-processing) when the tree size is small. As the tree size grows, it starts missing some taxa at this phase. It however recovers the missing taxa during the post-processing phase even for the largest tree size. This indicates that at the end of phase two our method could identify a backbone of the actual MFAST. The unidentified taxa at this phase are scattered throughout the clades in the input trees. Thus, there is no clade of size k + c that contains them with c contractions for small k or c. As evident from Table 3, this however does not prevent our method from recovering them. This is because the backbone reported at the end of phase two is large enough, and thus specific enough, to recover the missing taxa one by one in the last phase. This is a significant observation as it demonstrates that our method works well even with small values of k and c.
Effects of noise percentage
Recall that the noise percentage ∊denotes the percentage of taxa that is added inside the clade that contains the MFAST. As ∊increases, the pairs of taxa in the MFAST get farther away from each other in the tree that contains it. As a result, fewer taxa from MFAST will be contained in small clades of size k + c. This raises the question whether our method works well as ∊increase and thus the MFAST taxa gets scattered around in the trees that contain it.In this experiment, we answer the question above and analyze the effect of the noise percentage on the success of our method. We create synthetic datasets with various ∊ values. Particularly, we use ∊ = 20, 40 and 60%. We set the size of the embedded MFAST to n

 = 15, the tree size to n = 100, number of trees to m = 100 and the MFAST frequency to f = 0.8. We repeat our experiment for each parameter 10 times by recreating the dataset randomly using the same parameters. We set the frequency cutoff to γ = 0.7. We report the average MFAST size found by our method in Table T4 4.
Table 4
Evaluation of the effect of the noise in the trees in
1471-2105-13-256-i77 T
Noise (%)
MFAST size
The size of the embedded MFAST in all the experiments is 15. We list the average size of the MFAST found by our method before and after the post processing phase.
Before post processing
After post processing
20
15.3
15.8
40
13.6
15.0
60
12.7
15.0
The results suggest that our method can identify the embedded MFAST successfully even when the noise percentage is very high. We observe that the size of the MFAST found by our method before post processing decreases slowly with increasing amount of noise. This is not surprising as the taxa contained in the embedded MFAST gets more spread out (and thus farther away from each other) in the trees in
1471-2105-13-256-i78 T
with increasing noise. As a result, if there are taxa that are not part of any seed with the provided values of k and c, they will never be included in the computed MFAST at the end of phase two. We however observe that (i) only a small number of such taxa exists. For instance, even for the largest noise percentage (∊ = 60%), only 2.3 taxa (i.e., 15 12.7) are missing on the average. (ii) The missing taxa are recovered during phase three. This is because the computed MFAST at the end of phase two is very large, and thus it is specific to the embedded MFAST.
Impact of seed creation
So far, in our experiments we consistently observed two major points for all the parameter settings (see Sections “Effects of number of input trees” to “Effects of noise percentage”): (i) Our method always finds a large subtree of the embedded MFAST after phase two. (ii) Our method always recovers the entire embedded MFAST after phase three. The second observation can be explained from the first one that the outcome of phase two is large enough to build the entire MFAST precisely. The first observation however indicates that the set of seeds generated in phase one contain a significant percentage of the taxa in the embedded MFAST. In this section, we take a closer look into this phenomenon and explain why this is the case even for small values of seed size k and contraction amount c, and large noise percentage ∊. To do that, we will compute the probability that a subset of the taxa of the embedded MFAST appears in at least one seed generated in phase one. In our computation, we will assume that the taxa can appear at any location of a given tree with the same probability. We discuss the implication of this assumption later in this section.The number of rooted bifurcating trees for a given set of n taxa is
1471-2105-13-256-i79
R
(
n
)
=
(
2
n

3
)
!
2
n

2
(
n

2
)
!
.
Consider a clade with k + c taxa. The number of trees with n taxa that contains this clade is R(n − (k + c) + 1) as the topology of the k + c sized clade is fixed. For a given a subtree with k taxa, let us denote the number of clade topologies of size k + c that contains that subtree with NU(k, c). We can compute this function recursively as NU(k, 0) = 1 and for c > 0,
1471-2105-13-256-i80
normal NU
(
k
,
c
)
=
NU
(
k
,
c

1
)
×
2
×
(
k
+
c

2
)
.
Let us denote one of these clades by U(k, c). Also, let us denote the probability that the clade U(k, c) exists in a random tree topology that contains n taxa with P(n, k, c). Intuitively, P(n,k,c) is the probability that our method will extract a specific k taxa subtree from one n taxa tree after only c contractions. We can compute this probability as the ratio of the number of tree topologies that satisfy this constraint to that of all possible tree topologies. We formulate this as
1471-2105-13-256-i81
P
(
n
,
k
,
c
)
=
NU
(
k
,
c
)
×
R
(
n

(
k
+
c
)
+
1
)
R
(
n
)
.
Recall that it suffices for our algorithm to have a k taxa subtree of the MFAST in at least one tree in the given set of m trees. The probability that the clade U(k, c) exists in at least one of the m random tree topologies each containing n taxa is
1471-2105-13-256-i82
P
(
n
,
k
,
c
,
m
)
=
1

(
1

P
(
n
,
k
,
c
)
)
m
.
Assume that the MFAST size in the given set of trees
1471-2105-13-256-i83 T
is h. Let us denote the number of k taxa subtrees of the MFAST as NS(h, k). The probability that at least one of these subtrees will be found in at least one of the input trees is then
1471-2105-13-256-i84
P
(
n
,
k
,
c
,
m
,
h
)
=
1

(
1

P
(
n
,
k
,
c
,
m
)
)
NS
(
h
,
k
)
.
A lower bound to NS(h, k) is h − k + 1 which can be obtained by picking a contiguous block of k taxa from the canonical newick representation of the MFAST by considering all possible h − k + 1 starting point locations. Notice that the larger the value of P(n, k,c, m, h), the higher the chances that our algorithm will construct some part of the MFAST. Similarly, the larger the value of NS(h, k), the higher the chances that our algorithm will construct some part of the MFAST.Figure F6 6 plots the success probability (i.e., P(n, k, c, m, h)) of our method for varying parameter values. As the MFAST size increases, the success probability rapidly increases. This is because the number of alternative subtrees of the MFAST increases with increasing MFAST size. Thus, the chance of observing at least one increases as well. We observe that when the size of MFAST is around 20% of the tree size, for all the parameters reported our success probability becomes almost 1. As the number of contractions increases, the probability of success increases. This is because large number of contractions increases the possibility of eliminating false positive taxa from clades. In other words, it helps gluing the taxa that are normally scattered in the input trees back together by removing the remaining taxa among them. When c = 5, our success probability becomes almost one even for MFASTs that are as small as a 4-6% of the tree size. As the number of trees increases, the success probability increases as well. This is because we have more alternative topologies with increasing number of trees. Thus, there are more chances to have a small clade that contains a part of the MFAST. Finally, it is worth noting that these results are computed based on the assumption that the trees in
1471-2105-13-256-i85 T
are uniformly distributed among all possible topologies. In practice, we expect that these trees are constructed with the same or similar objectives (such as maximum parsimony or maximum likelihood). As a result, they will likely have a higher chance to contain large MFASTs. The results we expect in practice will thus be similar or even better than the theoretical results in Figure 6.
Figure 6The probability of finding at least one seed which contains a part of an MFAST
The probability of finding at least one seed which contains a part of an MFAST. The number of contractions c is set to 3, 4 and 5 and the corresponding seed size k is 5, 4 or 3. The x-axis shows the MFAST size in terms of the percentage of the number of taxa in the trees in 1471-2105-13-256-i86 T
. In (a), we set the total number of trees m = 500. In (b) we set m = 1000.
1471-2105-13-256-6 Overall, we conclude from this experiment that even small values of k and c suffices to capture a part of the MFAST in phase two. Therefore, although our algorithm’s complexity increases exponentially with k and c, we do not need to use large values for k and c. This enables our algorithm to scale to very large datasets with thousands of taxa and trees. These results explain the theory behind the practical results we observed in Sections “Effects of number of input trees” to “Effects of noise percentage”.
Evaluation of state of the art methods
So far, we have shown that our method could successfully find the MFASTs contained in sets of trees
1471-2105-13-256-i87 T
for up to 1000 taxa and 200 trees (i.e., n = 1000 and m = 200). An obvious question is how well do existing methods perform on the same datasets. Here, we answer this question for two existing programs, namely PAUP* (version 4.0b10) and Phylominer.When we fix the number of trees and the number of taxa to 100, PAUP* was able to find the MAST for for all datasets. As we grow the number of taxa to 250 or larger while keeping the number of trees as 100, PAUP* runs our of memory and fails to return any results. After reducing the number of trees to 50, PAUP* still runs out of memory and cannot report any results for more than 100 taxa.The scalability problem of Phylominer is even more severe. Phylominer is able to compute the MFASTs on datasets with up to 20 taxa. However, as we increase the number of taxa further, its performance deteriorates quickly. When we set the number of taxa to 100, even with as few as 100 trees, Phylominer takes more than a week to report a result. Moreover, in our experiments, the maximum size of the subtrees it found on average contained fewer than 7 taxa, even though the size of the true MAST was 10.Another interesting question about existing methods would be whether the majority consensus rule can be used to find MFASTs. To evaluate this, we used the same three synthetic datasets used in Section “Effects of noise percentage”. Recall each of these three datasets contains an MFAST of size 15 which is embedded in 80% of the trees. The datasets are created with 20%, 40% and 60% noise indicating different levels of difficulty in recovering the embedded MFAST. We computed 70% majority consensus tree. Notice that if majority consensus rule can identify an MFAST, that would correspond to a bifurcating subtree topology in the consensus tree. In other words a subtree is bifurcating in this experiment only if 70% or more of the input trees agree on the topology of that subtree. The resulting tree, however, was multifurcating for all the three datasets. This means that majority consensus rule could not recover even a smaller portion of the embedded tree while our method was able to locate the entire MFAST successfully (see Table 4).These results demonstrate that both PAUP* and Phylominer are not well suited to finding agreement subtrees in larger datasets, our method scales better in terms of both the number of taxa and the number of trees. When PAUP* runs to completion, we observed that it reports the true results. Recall from previous experiments that our method always found the true results on the same datasets as well as larger datasets. This suggests that our method has the potential to have an impact in large scale phylogenetic analysis when existing methods fail.
Empirical dataset experiments
To examine the performance of the MFAST method on real data, we performed experiments using 200 maximum likelihood bootstrap trees from a phylogenetic analysis of gymnosperms (959 taxa) and Saxifragales (950 taxa). Specifically, we evaluated how the performance of the MFAST algorithm was affected by the number of input trees and the size of the input trees.
Effects of number of input trees
We first examined the effect of input tree number on the size of MFAST. For both the gymnosperm and Saxifragales trees, we generated 10 sets of 50 and 100 trees by randomly sampling from the original 200 trees without replacement. We compared the average size of the MFAST in the 50 and 100 tree data sets with the size of the MFAST in the original 200 tree data set. First, in all analysis, the post-processing step greatly increases the size of the MFAST, sometimes more than doubling it (Table T5 5). This increase is similar to the one observed in the 1000 taxon simulated data sets (Table 2), emphasizing again the importance of the post-processing step with large tree data sets. Although the sizes of the MFASTs were similar, they decreased slightly with the addition of more trees (Table 5). This may simply be a matter of observing more conflict with more trees.
Table 5
The size of the MFAST found by our method on the Gymnosperms and Saxifragales datasets before and after post processing (phase three)
7
c5 5
c6 6
c7
Number of
MFAST size
The size of the MFAST found by running only the post processing step is also shown. We run our method on the entire dataset that contains 200 trees as well as randomly selected subsets of 50 and 100 trees. We repeated the 50 and 100 tree experiments 10 times by randomly selecting the trees from the entire dataset and reported the average value.
trees
Gymnosperms
Saxifragales
Before
After
Only
Before
After
Only
50
78.5
129.8
99.5
64.7
122.0
84.1
100
68.4
119.2
83.1
55.4
112.8
74.7
200
76.0
118.0
84.0
40.0
105.0
75.0
The large gap between the MFAST sizes before and after the post processing suggests that phase three is the main reason behind the success of our method, and thus, the costly seed combination phase (i.e., phase two) may be unnecessary. To answer whether this conjecture is correct, we ran a variant of our method by disabling the second phase; we only ran the post processing phase starting from each seed as the initial MFAST one by one. We reported the largest MFAST found that way as the output of this variant in Table 5. The results demonstrate that although phase three can grow a large FAST, phase two is essential to find the largest frequent agreement subtree. In other words, post processing finds the true MFAST only if a large portion of it is already found (which is the role served by phase two). In conclusion, phase three of our method cannot replace phase two, yet both phases are essential for the success of our method.
Effects of size of input tree
Next, we examined the effect of number of leaves in the input trees on the size of MFASTs. For both the gymnosperm and Saxifragales trees, we generated 10 sets of 200 input trees with 100, 250, and 500 taxa. To make each set, we randomly selected 100, 250, or 500 taxa, and we deleted all other taxa from the original sets of 200 trees. Thus, these sets of trees with 100, 250, or 500 taxa are subtrees of the original data sets. The size of the average MFAST increases with more taxa in the original trees (Table T6 6). However, interestingly, the average size of the MFASTs for the gymnosperm data set with 500 trees is larger than the MFAST found from the original gymnosperm trees with all the taxa (Table 6). Since the MFAST from the 500 taxon data sets should all be found within the full data set, this indicates that on the larger trees, our method may not always find the true (i.e., largest) MFAST. The full data sets may require a larger number of contractions to find the true MFASTs.
Table 6
The size of the MFAST found by our method on the Gymnosperms and Saxifragales datasets before and after post processing (phase three) for different number of taxa
Number of
MFAST size
The size of the MFAST found by running only the post processing step is also shown. We run our method on the entire dataset that contains all the taxa (last row) as well as randomly selected taxa subsets of size 100, 250 and 500. We repeated the 50, 100 and 250 taxa experiments 10 times by randomly selecting the taxa from the entire dataset and reported the average value.
leaves
Gymnosperms
Saxifragales
Before
After
Only
Before
After
Only
100
41.2
56.1
43.5
43.5
50.7
38.5
250
67.2
88.5
63.0
62.3
76.2
54.6
500
91.6
123.0
74.9
52.0
86.7
62.9
All
76.0
118.0
84.0
40.0
105.0
75.0
Similar to the experiments in Section “Effects of number of input trees”, we investigated the gap between the MFAST sizes before and after the post processing step. We ran a variant of our method by disabling the second phase; we only ran the post processing phase starting from each seed as the initial MFAST one by one. We reported the largest MFAST found that way as the output of this variant in Table 6. The results are in parallel with those in Table 5. Phase three can grow a large frequent agreement subtree, but not quite as big as that when both phase two and three are executed.
Effects of sample size
In our final experiment, we evaluated the effect of the maximum time cutoff we described in Section “In-order combination” on the accuracy of our method. Recall that, this cutoff limits the number of initial seeds tried in our algorithm by randomly sampling a small percentage of the seeds. It only uses the sampled seeds as possible initial seeds. However, it uses the entire set of seeds while growing the MFAST determined by the initial seed. As each initial seed roughly takes the same amount of time to grow into an MFAST, using x% of the seeds as the sample set reduces the total running time our method to roughly x% of that of our original implementation.We carried out this experiment as follows. For both the gymnosperm and Saxifragales trees, we ran 10 sets of experiments for each sampling percentage of 2, 5, 10, 25, 50 and 100%. Thus, totally we ran 60 (6 × 10) experiments. Table T7 7 presents the average MFAST sizes for varying sample sizes. The results demonstrate that even for very small sampling percentages, our method finds MFAST that is almost as big as the MFAST found by using the entire dataset (i.e., 100% sampling percentage). This is very promising as it demonstrates that the running time cost of our method can easily be cut to a small fraction by sampling the starting seeds. The rationale behind this is that the MFAST contains many seeds. Starting from any of these seeds, our algorithm has the potential to lead to that MFAST. The probability that at least one of these seeds appear in the sample set is large particularly for large MFASTs.
Table 7
The size of the MFAST found by our method on the Gymnosperms and Saxifragales datasets for different random subsamples of the total number of taxa
Sampling
MFAST size
We run our method by randomly picking 2%, 5%, 10%, 25%, 50%, 100% of the seeds found in phase one for combination in phase two.
percentage
Gymnosperms
Saxifragales
2
85.9
74.3
5
87.5
75.6
10
88.4
75.2
25
87.5
75.5
50
88.5
76.2
100
88.5
76.2
Conclusion
In this paper, we present a heuristic for finding the maximum agreement subtrees. The heuristic uses a multi-step approach which first identifies small candidate subtrees (called seeds), from the set of input trees, combines the seeds to build larger candidate MFASTs, and then performs a post-processing step to increase the size of the candidate MFASTs. We demonstrate that this heuristic can easily handle data sets with 1000 taxa, greatly extending the estimation of MFASTs beyond current methods. Although this heuristic is not guaranteed to find all MFASTs, it performs well using both simulated and empirical data sets. Its performance is relatively robust to the number of input trees and the size of the input trees, although with the larger data sets, the post processing step becomes more important. Overall this method provides a simple and fast way to identify strongly supported subtrees within large phylogenetic hypotheses.Although the method we developed is described and implemented for the rooted and bifurcating trees, it can be trivially extended to multifurcating as well as unrooted trees. The central technical difference in the case of unrooted trees would be the definition of clade (see Definition 1) as the definition requires a root. A clade in an unrooted tree encompasses two sets of nodes; (i) a given set of taxa X, (ii) the set of all internal nodes that are on a path between two taxa in X on the phylogenetic tree. We expect that this will increase the number of seeds substantially and thus make the problem more computationally intensive. The amount of increase will depend on the tree topology. The theoretical worst case happens when all the taxa are connected to a single internal node (i.e., star topology). In that case any subset of taxa can lead to a potential seed as long as the subset size is equal to the seed size allowed. One possible way to overcome this problem would be to exploit randomization or graph coloring strategies and avoid enumerating majority of the possible seeds.
Abbreviations
MAST: Maximum agreement subtree; FAST: Frequent agreement subtree; MFAST: Maximum frequent agreement subtree.
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
AR participated in algorithm development, implementation, experimental evaluation and writing of the paper. TK participated in algorithm development, experiment design and writing of the paper. GB participated in experiment design, dataset collection and writing of the paper. All authors read and approved the final manuscript.
bm
ack
Acknowledgments
This work was supported partially by the National Science Foundation (grants CCF-0829867 and IIS- 0845439).
refgrp Phylogenetic analysis of 73 060 taxa corroborates major eukaryotic groupsGoloboffPACatalanoSAMirandeJMSzumikCAAriasJSKällersjöMFarrisJSCladistics2009253211lpage 23010.1111/j.1096-0031.2009.00255.xFasttree 2 approximately maximum-likelihood trees for large alignmentsPriceMNDehalPSArkinAPPLoS ONE201053e949010.1371/journal.pone.0009490pmcid 2835736link fulltext 20224823Understanding angiosperm diversification using small and large phylogenetic treesSmithSABeaulieuJMStamatakisADonoghueMJAmerican Journal of Botany20119840441410.3732/ajb.100048121613134Confidence Limits on Phylogenies: An Approach Using the BootstrapFelsensteinJEvol198539478379110.2307/2408678Parsimony jackknifing outperforms neighbor-joiningFarrisJSAlbertVAKällersjöMLipscombDcnm KlugeCladistics19961229912410.1111/j.1096-0031.1996.tb00196.xAccommodating phylogenetic uncertainty in evolutionary studiesHuelsenbeckJPRannalaBMaslyJP2000 Science,30 June 200028854752349235010.1126/science.288.5475.2349A classification of consensus methods for phylogeneticsBryantDBioconsensus (Piscataway, NJ, 2000/2001), volume 61 of DIMACS Ser. Discrete Math. Theoret. Comput. Sci.Amer. Math. Soc., 2003163183Obtaining common pruned treesFindenCRGordonADJ Classification1985225527610.1007/BF01908078Maximum agreement subtree in a set of evolutionary trees: Metrics and efficient algorithmsAmirAKeselmanDSIAM J Comput19972661656166910.1137/S0097539794269461An algorithm to find agreement subtreesKubickaEKubickiGMcMorrisFRJ Classification1995121919910.1007/BF01202269On the agreement of many treesFarachMPrzytyckaTMThorupMInf Process Lett199555629730110.1016/0020-0190(95)00110-XBryantDBuilding trees, hunting for trees and comparing trees. PhD thesispublisher Dept. Mathematics: University of Canterbur1997An o(n log n) algorithm for the maximum agreement subtree problem for binary treesColeRFarach-ColtonMHariharanRPrzytyckaTMThorupMSIAM J Comput20003051385140410.1137/S0097539796313477An improved algorithm for the maximum agreement subtree problemLeeCMHungLJChangMSShenCBTangCYInf Process Lett200594521121610.1016/j.ipl.2005.02.005Improved parameterized complexity of the maximum agreement subtree and maximum compatible tree problemsBerryVNicolasFIEEE/ACM Trans Comput Biology Bioinform20063328930210.1109/TCBB.2006.39Solving the maximum agreement subtree and the maximum compatible tree problems on many bounded degree treesGuillemotSNicolasFProceedings of the 17th Annual conference on Combinatorial Pattern Matching (CPM’06)Berlin, Heidelberg: Springer-Verlageditor Lewenstein M, Valiente G(Eds. ).16517610.1007/11780441_16On the approximability of the maximum agreement subtree and maximum compatible tree problemsGuillemotSNicolasFBerryVPaulCDiscrete Applied Mathematics200915771555157010.1016/j.dam.2008.06.007Correction to ”mining closed and maximal frequent subtrees from databases of labeled rooted trees”ChiYXiaYYangYMuntzIEEE Trans Knowl Data Eng200517121737Mining frequent agreement subtrees in phylogenetic databasesZhangSWangJTLProceedings of the 6th SIAM International Conference on Data Mining (SDM 2006)Maryland: BethesdaGhosh J, Lambert D, Skillicorn DB, Srivastava JApril 2006222233Discovering frequent agreement subtrees from phylogenetic dataZhangSWangJTLIEEE Trans Knowl Data Eng20082016882Summarizing a Posterior Distribution of Trees Using Agreement SubtreesCranstonKARannalaBSyst B200756457859010.1080/10635150701485091Uncovering hidden phylogenetic consensusPattengaleNDSwensonKMMoretBMEProc. 6th Int’l Symp. Bioinformatics Research & Appls. ISBRA’10, in Lecture Notes in Computer ScienceBorodovsky M, Gogarten JP, Przytycka TM, Rajasekaran Spp 128139Springer, 201018249686Uncovering hidden phylogenetic consensus in large data setsPattengaleNDAbererAJSwensonKMStamatakisAMoretBMEIEEE/ACM Trans Comput Biology Bioinform201184902911A simple and accurate method for rogue taxon identificationAbererAJStamatakisAProceedings of the IEEE International Conference on, Bioinformatics and Biomedicine (BIBM ’11)IEEE Computer Society, Washington, DC, USA11812210.1109/BIBM.2011.70The newick utilities: high-throughput phylogenetic tree processing in the unix shellJunierTZdobnovEMBioinformatics201026131669167010.1093/bioinformatics/btq243288705020472542SwoffordDLPAUP∗ Phylogenetic analysis using parsimony (and other methods). Version 4.0 beta 10.,2002Sunderland, Massachusetts: Sinauer Assoc350915923209646Exploring diversification and genome size evolution in extant gymnosperms through phylogenetic synthesisBurleighJGBarbazukWBDavisJMMorseAMSoltisPSJ Botany201220126. Article ID 29285710.1155/2012/292857Raxml-vi-hpc: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed modelsStamatakisABioinformatics200622212688269010.1093/bioinformatics/btl44616928733


xml version 1.0 encoding UTF-8
REPORT xmlns http:www.fcla.edudlsmddaitss xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.fcla.edudlsmddaitssdaitssReport.xsd
INGEST IEID EFWYYJNAT_P0IALL INGEST_TIME 2013-03-05T20:28:37Z PACKAGE AA00013516_00001
AGREEMENT_INFO ACCOUNT UF PROJECT UFDC
FILES



PAGE 1

Ramu etal.BMCBioinformatics 2012, 13 :256 http://www.biomedcentral.com/1471-2105/13/256 RESEARCHARTICLE OpenAccessAscalablemethodforidentifyingfrequent subtreesinsetsoflargephylogenetictreesAvinashRamu1,TamerKahveci2*andJGordonBurleigh3 AbstractBackground: Weconsidertheproblemof“ndingthemaximumfrequentagreementsubtrees(MFASTs)ina collectionofphylogenetictrees.Existingmethodsforthisproblemoftendonotscalebeyonddatasetswitharound 100taxa.Ourgoalistoaddressthisproblemfordatasetswithoverathousandtaxaandhundredsoftrees. Results: Wedevelopaheuristicsolutionthataimsto“ndMFASTsinsetsofmany,largephylogenetictrees.Our methodworksinmultiplephases.Inthe“rstphase,itidenti“essmallcandidatesubtreesfromthesetofinputtrees whichserveastheseedsoflargersubtrees.Inthesecondphase,itcombinesthesesmallseedstobuildlarger candidateMFASTs.Inthe“nalphase,itperformsapost-processingstepthatensuresthatwe“ndafrequent agreementsubtreethatisnotcontainedinalargerfrequentagreementsubtree.Wedemonstratethatthisheuristic caneasilyhandledatasetswith1000taxa,greatlyextendingtheestimationofMFASTsbeyondcurrentmethods. Conclusions: Althoughthisheuristicdoesnotguaranteeto“ndallMFASTsorthelargestMFAST,itfoundtheMFAST inallofoursyntheticdatasetswherewecouldverifythecorrectnessoftheresult.Italsoperformedwellonlarge empiricaldatasets.Itsperformanceisrobusttothenumberandsizeoftheinputtrees.Overall,thismethodprovidesa simpleandfastwaytoidentifystronglysupportedsubtreeswithinlargephylogenetichypotheses. Keywords: Phylogenetictrees,FrequentsubtreeBackgroundPhylogenetictreesrepresenttheevolutionaryrelationshipsoforganisms.Whilerecentadvancesingenomic sequencingtechnologyandcomputationalmethodshave enabledconstructionofextremelylargephylogenetic trees(e.g.,[1-3]),assessingthesupportforphylogenetic hypotheses,andultimatelyidentifyingwell-supported relationships,remainsamajorchallengeinphylogenetics.Supportforatreeoftenisdeterminedbymethods suchasnonparametricbootstrapping[4],jackkni“ng[5], orBayesianMCMCsampling(e.g.,[6]),whichgenerate acollectionoftreeswithidenticaltaxarepresentingthe rangeofpossiblephylogeneticrelationships.Thesetrees canbesummarizedinaconsensustree(see[7]).Consensusmethodscanhighlightsupportforspeci“cnodesin atree,buttheyalsomayobscurehighlysupportedsubtrees.Forexample,inFigure1,thesubtreecontainingtaxa *Correspondence:tamer@cise.u”.edu 2 ComputerandInformationScienceandEngineering,UniversityofFlorida, Gainesville,FL,USA FulllistofauthorinformationisavailableattheendofthearticleA,B,C,andDispresentinall“veinputtrees.However, duetotheuncertainplacementoftaxonE,themajority ruleconsensustreeimpliesthatthecladesinthetreehave relativelylow(60%)support. Alternateapproacheshavebeenproposedtoreveal highlysupportedsubtrees.Themaximumagreementsubtree(MAST)problemseeksthelargestsubtreethatis presentinallmembersofagivencollectionoftrees[8]. Forexample,inFigure1theMASTincludestaxaA,B, C,andD.FindingtheMASTisanNP-hardproblem[9], althoughecientalgorithmsexisttocomputetheMAST insomecases(e.g.,[9-17]).Inpractice,sinceanydierenceinanysingletreewillreducethesizeoftheMAST, theMASTisoftenquitesmall,limitingitusefulness. Alessrestrictiveproblemisto“ndfrequentagreement subtrees(FAST),orsubtreesthatarefoundinmany,but notnecessarilyall,oftheinputtrees(see[18]).Inthis problem,asubtreeisdeclaredasfrequentifitisinat leastasmanytreesasausersuppliedfrequencythreshold.Severalalgorithmicapproacheshavebeensuggested toidentifyFASTs,andspeci“callythemaximumFASTs 2012Ramuetal.;licenseeBioMedCentralLtd.ThisisanOpenAccessarticledistributedunderthetermsoftheCreative CommonsAttributionLicense(http://creativecommons.org/licenses/by/2.0),whichpermitsunrestricteduse,distribution,and reproductioninanymedium,providedtheoriginalworkisproperlycited.

PAGE 2

Ramu etal.BMCBioinformatics 2012, 13 :256 Page2of15 http://www.biomedcentral.com/1471-2105/13/256 A B C D A B C D A B C D A B C D A B C D E E E E E (a) A B C D E 60% 60% Majority Rule Consensus A B C D 100% Maximum Agreement Subtree(b) Figure1 (a)Acollectionof“veinputtrees. ThesamesubtreewithtaxaA,B,C,andDispresentinallinputtrees,andonlythepositionoftaxaE changes. (b) Themajorityruleconsensusandmaximumagreementsubtreesofthe5inputtreesinFigure1a.(MFASTs),orFASTsthatcontainthelargestnumberof taxa.AvariantofthisproblemseeksthemaximalFASTs, i.e.,FASTSthatarenotcontainedinanyotherFASTs. NoticethatanMFASTisa maximal FAST,however,the inverseisnotnecessarilytrue.ZhangandWangde“ned algorithms,implementedinPhylominer,toidentifyFASTs fromacollectionofphylogenetictrees[19,20].Thesealgorithmsareguaranteedto“ndallFASTsbuttheymay beprohibitivelyslowfordatasetslargerthan20taxa. CranstonandRannalaimplementedMetropolis-Hastings andThresholdAcceptingsearchestoidentifylargeFASTs fromaBayesianposteriordistributionofphylogenetic trees[21].Thisapproachcanhandlethousandsofinput treesbutitmaynotbefeasibleifthetreeshavemorethan 100taxa[21]. Anotherapproachtorevealhighlysupportedsubtrees fromacollectionoftreesistoidentifyandremoverogue taxa,ortaxawhosepositionintheinputtreesisleastconsistent.Recently,severalmethodshavebeendeveloped thatcanidentifyandremoveroguetaxafromcollectionsoftreeswiththousandsoftaxa[22-24].However, unlikeMASTorFASTapproaches,theydonotprovide guaranteesaboutthesupportfortheremainingtaxa. Inthispaper,wedescribeaheuristicapproachforidentifyingMFASTsincollectionsoftrees.Unlikeprevious methods,ourmethodeasilyscalestodatasetswithovera thousandtaxaandhundredsoftrees.Towardsthisgoal, wedevelopaheuristicsolutionthatworksinmultiple phases.Inthe“rstphase,itidenti“essmallcandidatesubtreesfromthesetofinputtreeswhichserveastheseeds oflargersubtrees.Inthesecondphase,itcombinesthese seedstobuildlargercandidateMFASTs.Inthe“nalphase, itperformsapostprocessingstep.Thisstepensuresthat thesize(i.e.,numberoftaxa)oftheFASTfoundcannotbe increasedfurtherbyaddinganewtaxonwithoutreducing itsfrequencybelowausersuppliedfrequencythreshold.Wedemonstratethatthisheuristiccaneasilyhandle datasetswith1000taxa.Wetesttheeectivenessof theseapproachesonsimulateddatasetsandthendemonstrateitsperformanceonlarge,empiricaldatasets. Althoughourheuristicdoesnotguaranteeto“ndall MFASTsorthelargestMFASTintheory,itfoundthetrue MFASTinallofoursyntheticdatasetswherewecould verifythecorrectnessoftheresult.Italsoperformedwell ontheempiricaldatasets.Itsperformanceisrobustwith respecttothenumberofinputtreesandthesizeofthe inputtrees.MethodsInthissectionwedescribeourmethodthataimsto“nd MaximumFrequentAgreementSubTrees(MFASTs) ina givensetof m phylogenetictreesT={ T1, T2, Tm} OurmethodfollowsfromtheobservationthatanMFAST ispresentinalargenumberoftreesinT.Themethod buildsMFASTsbottomupfromsmallsubtreesoftaxain thetreesinT.Brie”y,itworksinthreephases. € Phase1. Seedgeneration(SectionPhaseone:Seed generationŽ). Inthe“rstphase,weidentifysmallsubtreesfromthe inputtreesthathaveapotentialtobeapartofan MFAST.Wecalleachsuchsubtreeaseed. € Phase2. Seedcombination(SectionPhasetwo:Seed combinationŽ). Inthesecondphase,weconstructaninitialFASTby combiningtheseedsfoundinthe“rstphase. € Phase3. Postprocessing(SectionPhasethree: Post-processingŽ). Inthethirdphase,wegrowtheFASTfurtherto obtainthemaximalFASTthatcontainsitby individuallyconsideringthetaxawhicharenot alreadyintheFAST.Wereporttheresulting maximalFASTasapossibleMFAST. First,wepresentthethebasicde“nitionsneededforthis paperinSectionPreliminariesandnotationŽ.Wethen discusseachofthethreephasesaboveindetail.

PAGE 3

Ramu etal.BMCBioinformatics 2012, 13 :256 Page3of15 http://www.biomedcentral.com/1471-2105/13/256PreliminariesandnotationInthissection,wepresentthekeyde“nitionsandnotationsneededtounderstandtherestofthepaper.We describeourmethodusingrootedandbifurcatingphylogenetictrees.However,ourmethodandde“nitionscan easilybeappliedtounrootedormultifurcatingtreeswith minorornomodi“cations.Also,weassumethatallthe taxaareplacedattheleaflevelnodesofthephylogenetic tree,andalltheinternalnodesareinferredancestors. Figure2(a)showsasamplephylogenetictreebuilton“ve taxa.Wede“nethe size ofatreeasthenumberoftaxain thattree.Westartbyde“ningkeyterms. De“nition1 ( Clade ) LetTbeaphylogenetictree.Given aninternalnodeofT,wede“nethesetofallnodesand edgesofTcontainedunderthatnodeasthe clade rootedat thatnode. Eachinternalnodeofaphylogenetictreecorresponds toacladeofthattree.Figure2(b)depictsthecladeofthe treeinFigure2(a)rootedat x1. De“nition2 ( Contraction ) LetTbeaphylogenetictree withntaxa.ThecontractionoperationtransformsTintoa treewithn Š 1 taxabyremovingagiventaxoninTalong withtheedgethatconnectsthattaxontoT. Thecontractionoperationcanextractthecladesofa treebyremovingallthetaxathatarenotapartofthat clade.Itcanalsoextractpartsofthetreethatarenotnecessarilyclades.Weusetheterm subtree todenoteatree thatisobtainedbyapplyingcontractionstoarbitraryset oftaxainagiventree.Formalde“nitionisasfollows. De“nition3 ( Subtree ) LetTandTbetwophylogenetictrees.WesaythatTisa subtree ofTifTcanbe transformedintoTbyapplyingaseriesofcontractionson T. Ifatree Tisasubtreeofanothertree T ,wesaythat Tis present in T .Noticethatacladeisalwaysasubtree, x2x3x1x0abcde(a) abc(b) bcd(c) Figure2 (a)Arooted,bifurcatingphylogenetictree T builton “vetaxalabeledwith a,b,c,d and e Theinternalnodesareshown with x0, x1, x2and x3. (b) Acladeof T rootedat x1. (b) and (c) Two subtreesof T bycontractingthetaxasets { d,e } and { a,e } .buttheinverseisnottrueallthetime.Figures2(b)and 2(c)illustratetwosubtreesofthetreeinFigure2(a).Letus denotethenumberofcombinationsof k taxafromasetof n taxawith n k .Ingeneral,ifatreehas n taxa,thenthat treecontains n k subtreeswith k taxa.Asaconsequence, thattreecontains2nŠ 1subtreesofanysizeincluding itself. De“nition4 ( Frequency ) LetT= { T1,T2, ,Tm} be asetofmphylogenetictreesandTbeaphylogenetictree. LetusdenotethenumberoftreesinTatwhichTispresent withthevariablem.Wede“nethefrequencyofTinTas freq ( T ,T) = m m De“nition5 ( FAST ) LetT= { T1,T2, ,Tm} beaset ofmphylogenetictreesandTbeaphylogenetictree.Let beanumberin[0,1]intervalthatdenotesfrequencycuto. WesaythatTisaFrequentAgreementSubTree(FAST)ofTifitsfrequencyinTisatleast (i.e.,freq ( T ,T) ). WesaythataFASTis maximal ifthereisnootherFAST thatcontainsallthetaxainthatFAST.Clearly,larger FASTsindicatebiologicallymorerelevantconsensuspatterns.Thefollowingde“nitionsummarizesthis. De“nition6 ( MFAST ) LetT= { T1,T2, ,Tm} bea setofmphylogenetictrees.Let beanumberin[0,1] intervalthatdenotesfrequencycuto.AFASTTofTisa MaximumFrequentAgreementSubTree(MFAST)ofTif thereisnootherFASTTofTthathasalargersizethanT. Formally,givenasetofphylogenetictreesT= { T1, T2, Tm} andafrequencycuto, ,wewouldliketo“nd theMFASTsinTinthispaper.Wedevelopanalgorithm thataimstosolvethisproblem.Table1liststhevariables usedthroughouttherestofthispaper.Table1Commonlyusedvariablesandfunctionsinthis paper TAsetofphylogenetictrees Tii thtree m NumberoftreesinTn Numberoftaxaineachinputtree aii thtaxa freq ( T ,T) Frequencyofthesubtree T inT Frequencycuto Sii thseed(eachseedisasubtreeofatreeinT) k Sizeofaseed c Numberofcontractionsusedtocreateaseed

PAGE 4

Ramu etal.BMCBioinformatics 2012, 13 :256 Page4of15 http://www.biomedcentral.com/1471-2105/13/256Phaseone:SeedgenerationThe“rstphaseextractssmallsubtreesfromthegivenset oftrees.Fromthesesubtreesweextractthebasicbuilding blockswhichareusedtoconstructMFASTs.Wecallthese buildingblocksseeds.Conceptuallyeachseedisaphylogenetictreethatcontainsasmallsubsetofthetaxathat makeupthetreesinT.Wecharacterizeeachseedwith threefeaturesthatarelistedbelow.Weelaborateoneach featurelaterinthissection. 1.Seedsize(k)isthenumberoftaxaintheseed. 2.Numberofcontractions(c)isthenumberoftaxawe prunefromacladetakenfromaninputtreeinorder toextracttheseed. 3.Frequency(f)isthefractionofinputtreesinwhich theseedispresent. WeexplaintheseedfeatureswiththehelpofFigures3 and4.The“rsttwocharacteristicsexplainhowaseedcan befoundinoneofthetreesinT.Theyindicatethatthere isacladeofatreeinTsuchthatthiscladecontains k + c taxaanditcanbetransformedintothatseedafter c contractionsfromthatclade.ForinstanceinFigure3,when k = 2and c = 0,onlyseed S1canbeextractedfrom T1by choosingthecladerootedat x2.When k = 2and c = 1, seeds S1, S2and S3canbeobtainedusingonecontraction ( a3, a2and a1respectively)fromthecladerootedat x1. ThelastfeaturedenotesthenumberoftreesinTin whichtheseedispresent.ForexampleinFigure4,there arenineseeds S1, S2, S9extractedfromthethreeinput treesusingonlyonecontraction.Amongthese,thefrequencyof S1is1asitispresentinallthetrees.Frequency of S2isabout0.67foritispresentinonlytwooutofthree S1S2S3x1x2T1x0aa2a3 1a4a a a a a21a3 2 3 1 Figure3 T1isaninputtreebuiltonfourtaxa a1, a2, a3and a4. Theinternalnodesof T1arelabeledas x0, x1and x2. S1istheonly seedobtainedfrom T1when k = 2and c = 0.Thatis S1isidenticalto thecladerootedat x2. S1, S2and S3aretheseedsextractedfrom T1when k = 2and c = 1.Theyareallextractedfromthecladerootedat x1bycontracting a3, a2and a1respectively.trees( T1and T2).Thefrequencyoftherestoftheseeds isonlyabout0.33.Recallthat,byde“nition,anMFASTis presentinatleastafraction ofthetreesinT.Therefore,weconsideronlytheseedswhosefrequencyvalues areequaltoorgreaterthanthisnumber(i.e., f ). Giventhevaluesof k c and ,weextractalltheseeds whichpossessthedesiredfeaturevaluesfromthesetof inputtreesasfollows.Inthenewickstringrepresentation ofatree,apairofmatchingparenthesescorrespondsto aninternalnodeinthetree.Thenumberoftaxainthe claderootedatthisinternalnodeisgivenbythenumber oflabelsbetweenthetwomatchingparentheses.Followingfromthisobservation,wescanthenewickstringof eachtreeonebyone.Foreachsuchtree,weidentifythe cladeswhichhave k + c taxa.Noticethat,ifatreecontains n taxa,thenitcontainsatmostn k + ccladesofsize k + c asnotwosuchcladescancontaincommontaxa.We thenextractallcombinationsof k taxafromeachofthese cladesbycontractingtheremaining c taxa.Thenumber ofwaysthiscanbedoneis k + c c .Noticethatallthesmall treesextractedthiswaypossessthe“rsttwocharacteristicsexplainedabove.Atthispoint,wehoweverdonot knowtheirfrequencies.Therefore,wecallthem potential seeds .Itisworthmentioningthatthesameseedmightbe extractedfromdierenttrees.Asweextractanewpotentialseed,beforestoringitinthelistofpotentialseeds,we checkifitisalreadypresentthere.Weincludeitinthe potentialseedlistonlyifitdoesnotexistthereyet.Otherwise,weignoreit.Thisway,wemaintainonlyonecopyof eachseed. OncewebuildourpotentialseedlistforallthetreesinT,wegooverthemonebyoneandcounttheirfrequency inTasthefractionoftreesthatcontainthem.We“lter allthepotentialseedswhosefrequenciesarelessthanthe frequencycuto.Wekeeptheremainingonesasthelist ofseedsalongwiththefrequencyofeachseed. InFigure4,considerthetree T1thathasfourtaxa.For k = 3and c = 1,thereisonlyonecladeofsize k + c = 4 whichisthetree T1itself.Weextractfourpotentialseeds, eachhavingthreeleavesfromthistree.Thepotential seedsinthis“gurearegivenby S1, S2, S5and S7which weextractbycontracting a4, a3, a2and a1respectively from T1.Phasetwo:SeedcombinationAttheendofthe“rstphase,weobtainasetoffrequent seedsfromtheinputtrees.NoticethateachseedisaFAST aseachseedispresentinsucientnumberoftreesspeci“edby .Theseseedsarethebasicbuildingblocksofour method.Inthesecondphaseofourmethod,wecombine subsetsoftheseseedstoconstructlargerFASTs. We“rstde“newhatitmeanstocombinetwoseeds.In ordertocombinetwoseeds,itisanecessarycondition

PAGE 5

Ramu etal.BMCBioinformatics 2012, 13 :256 Page5of15 http://www.biomedcentral.com/1471-2105/13/256 1a a2a3a4 1a a4 1a2aa4 1aa2a3Sa a3 4SSa aSa a4a a3 4a2S S1a a3a4 1a a a2S S2S13 4 3 23 6 5 4 9 8 7a2a4a3T1 1a1aa2a3a4 2T1aa2a4a3T3 Figure4 Thesetofinputtrees T1, T2, T3andthesetofallninepotentialseeds S1, S2 S9whentheseedcharacteristicsaresetto k = 3 and c = 1 Allthepotentialseedshavethreetaxaask=3.Weneedonecontractionfromtheinputtreetoobtaineachseed. S1hasfrequency 1.0asitispresentin T1, T2and T3.Seed S2hasfrequency 0.67asitispresentin T1and T2.Remainingseedshavefrequency 0.33aseachappears inonlyoneofthethreetrees.thatbothseedsarepresentinatleastonecommontree T inT.Wecallsuchatree T asthe referencetree .We combinetwoseedswiththeguidanceofareferencetree. Let S1and S2betwoseedsandlet T betheirreference tree.Let L1, L2and L bethesetoftaxain S1, S2and T respectively.Combining S1and S2resultsinthetreethat isequivalenttotheoneobtainedbycontractingthetaxa in L Š ( L1 L2) from T .Forsimplicity,wewilldenotethe combineoperationusing T asthereferencenetworkwith the Tsymbol.Forinstancewedenotecombining S1and S2with T beingthereferencetreeas S1TS2.Tosimplify ournotation,whenevertheidentityofthereferencetree isirrelevant,wewillusethesymbol insteadof T. Figure5demonstrateshowtwoseeds S1and S2arecombinedwiththehelpofthereferencetree T .Inthis“gure, both S1and S2aresubtreesof T .Thus,itispossibleto use T asthereferencetree.Wehave L1={ a1, a3, a4} L2={ a1, a2, a5, a7} .Thus,webuild C = S1TS2by contractingthetaxain L Š ( L1 L2) ={ a6, a8} from T Sofar,wehaveexplainedhowtocombinetwoseeds S1and S2usingareferencetree.Itispossiblethatmanytrees inThavebothseedspresentinthem.Thus,onequestioniswhichofthesetreesshouldweuseasthereference treetocombinethetwoseeds?Thebriefansweristhat allsuchtreesneedtobeconsidered.However,wemake severalobservationsthathelpsusavoidcombining S1and S2usingeachsuchreferencetreeonebyoneexhaustivelywithoutignoringanyofsuchtrees.Weexplain themnext. Considertwotrees T1and T2fromTwherebothseeds arepresentin.Therearetwocasesfor T1and T2. € CASE1: S1T1S2= S1T2S2.Inthiscase,itdoes notmatterwhetherweuse T1or T2asthereference tree.Theywillbothleadtothesamecombined subtree.Thus,weuseonlyone. € CASE2: S1T1S2 = S1T2S2.Inthiscase,thetrees T1and T2leadtoalternativecombinationtopologies. So,weconsiderbothofthemseparately. Weutilizetheobservationsaboveasfollows.Westart bypickingonereferencetreearbitrarily.Oncewecreatea combinedsubtreeusingthattree,wecheckwhetherthat subtreeispresentintheremainingtreesinT.Wemark thosetreesthatcontainitasconsideredforreferencetree andneverusethemasreferenceforthesameseedpair again.Thisisbecausethosetreesfallintothe“rstcase describedabove.Thisway,wealsostorethefrequencyof thecombinedsubtreeinT.Ifthenumberofunmarked treesistoosmall(i.e.,lessthan m )thenitmeansthat evenifalltheremainingtreesagreeonthesamecombined topologyforthetwoseedsunderconsideration,theyare notsucienttomakeitaFAST.Thus,wedonotuseany oftheremainingtreesasreferenceforthosetwoseeds. Otherwise,wepickanotherunmarkedtreearbitrarilyand repeatthesameprocessuntilwerunoutofreference trees. Thenextquestionweneedtoansweriswhichseedpairs shouldwecombine?Toanswerthisquestionwe“rstmake thefollowingproposition. Proposition1. AssumethatwearegivenasetofphylogenetictreesT.LetS1andS2betwoseedsconstructedfrom

PAGE 6

Ramu etal.BMCBioinformatics 2012, 13 :256 Page6of15 http://www.biomedcentral.com/1471-2105/13/256 1a a3 1a a2a7 2S S1a4aa23 1aaa67 5aa8a4aa23a a57a4a5T1aC Figure5 T isthereferencetree. S1and S2aretheseedstobe combined,botharepresentin T C isobtainedbypruningthe subtreecontainingtaxa a1, a2, a3, a4, a5and a7from T .thetreesinT.ForalltreesT T,wehavethefollowing inequality freq ( S1TS2,T) min { freq ( S1,T) freq ( S2,T) } Proof. Forany T ,both S1and S2aresubtreesof S1TS2.Thusif S1TS2ispresentinatree,thenboth S1and S2arepresentinthattree.Asaresult, freq ( S1TS2,T) freq ( S1,T) and freq ( S1TS2,T) freq ( S2,T) .Hence, freq ( S1TS2,T) min { freq ( S1,T) freq ( S2,T) } Proposition1statesthataswecombinepairsofseedsto growthem,theirfrequencymonotonicallydecreases.This suggeststhatitisdesirabletocombinetwoseedsifboth ofthemhavelargefrequencies.Thisisbecauseifoneof themhasasmallfrequency,regardlessofthefrequencyof theother,thecombinedtreewillhaveasmallfrequency. Asaresultitschancetogrowintoalargertreethrough additionalcombineoperationsgetssmaller.Followingthis intuition,wedeveloptwoapproachesforcombiningthe seeds. 1.In-orderCombination(SectionIn-order combinationŽ). 2.MinimumOverlapCombination(SectionMinimum overlapcombinationŽ). Bothapproachesacceptthelistofseedscomputedin the“rstphaseasinputandproducealargerFASTthatisa combinationofmultipleseeds.Bothofthemalsoassume thatthelistofinputseedsarealreadysortedindecreasing orderoftheirfrequencies.Wediscusstheseapproaches next.In-ordercombinationThein-ordercombinationapproachfollowsfromProposition1.Itassumesthattheseedswithhigherfrequencies havegreaterpotentialtobeapartofan MFAST .Itexploits thisassumptionasfollows,“rstitpicksaseedasthestartingpointtocreateaFAST.Itthengrowsthisseedby combiningitwithotherseedsstartingfromthemostfrequentoneaslongasthefrequencyoftheresultingtree remainsatleastaslargeasthegivencuto .Itrepeats thisprocessbytryingeachseedasthestartingpoint, Algorithm1presentsthisapproach.Algorithm1InordercombinationFAST forall seeds Sido FAST SiMark Siasconsidered repeat Sj seedwithhighestfrequencyamong unconsideredseeds Mark Sjasconsidered CUTOFF t FAST FASTrepeat Pickthenextunconsideredtree T Tas reference Markallthetreesasthatcontain FASTTSjas considered if freq( FASTTSj,T) CUTOFF then t FAST FASTTSjCUTOFF freq( FASTTSj,T) endif until Lessthan m unmarkedreferencetreesare leftinTFAST t FASTUnmarkalltreesinTuntil allseedsareconsidered if sizeof FAST sizeof FAST then FAST FAST

PAGE 7

Ramu etal.BMCBioinformatics 2012, 13 :256 Page7of15 http://www.biomedcentral.com/1471-2105/13/256endif Unmarkallseeds endfor InAlgorithm1we“rstinitializetheFASTasempty.We thenconsidereachseedonebyone.Weinitializeatemporarysubtreedenotedby FAST withtheseed Siunder considerationandmark Siasconsidered.Wecombinethe FASTwithaseed Sjwhichhasthehighestfrequency amongsttheseedsthathavenotbeenadded.Ifmultiple seedshavethehighestfrequency,werandomlypickoneof themandmarkthatseed SjasaddedtotheFAST.There canbealternativewaystocombineFASTwith Sjleading todierenttopologies.WeusethetreesinTthatcontainbothFASTand Sjasguidestotryonlythetopologies thatexistinT.Westopconstructingalternativetopologiesassoonasweensurethattherearenotsucient numberoftreestoyieldfrequencyof .WesetFASTto thecombinedseedifthecombinedseedhaslargeenough frequency.Wethenconsidertheseedwiththenexthighestfrequencyforadditionandrepeatthissteptillall Sjhavebeenconsidered.IftheresultingtemporaryFASTis largerthanFASTwereplacethesmallerFASTwiththe largerone.Inthenextiteration,weinitializethe FAST withthenext Si.Usingthisapproachwecaninitializethe FAST withall Si,alternativelyiftheuserwishestolimit theamountoftimespentusinga maximumtimecuto westoptheoutermostloop(i.e.,alternativeinitializations ofFAST)assoonastheallowedrunningtimebudgetis reached. NoticethatinAlgorithm3eachseed SicanleadtoadifferentFAST.WerecordonlytheFASTthathasthelargest size.However,itistrivialtomaintainthetop k FASTs withthelargestsizeinsteadiftheuserislookingfor k alternativemaximalFASTs.MinimumoverlapcombinationThepurposeofcombiningseedsistoconstructaFAST thatislargeinsize.Ourin-ordercombinationapproach (SectionIn-ordercombinationŽ)aimedtomaximizethe frequencyofthecombinedseeds.Inthissection,we developoursecondapproach,named MinimumOverlapCombination .Thisapproachpicksseedssothattheir combinationproducedaslargesubtreeaspossible.We elaborateonthisapproachnext. Whenwecombinetwoseeds,thesizeoftheresulting treebecomesatleastasbigasthesizeofeachofthese seeds.Formallylet S1and S2betwoseeds(i.e.,trees).Let L1and L2bethesetoftaxacombinedin S1and S2.We denotethesizeofaset,say L1,with | L1| .Thesizeofthe treeresultingfromcombinationof S1and S2is | L1|+| L2|Š | L1 L2| .Foragiven“xedseedsize,the“rsttwoterms ofthisformulationremainsunchangedregardlessofthe seed.Thelasttermdeterminesthegrowthinthesizeof theFAST.Thus,inordertogrowtheFASTrapidly,itis desirabletocombinetwofrequentsubtreeswithasmall numberofcommontaxa. Oursecondapproachfollowsfromtheobservation above.Weintroduceacriteriacalledthe overlap between twosubtreesasthenumberoftaxacommonbetween them.Ourminimumoverlapcombinationapproach worksthesameasAlgorithm1withaminordierencein selectingtheseed SjthatwillbecombinedwiththecurrenttemporaryFAST(i.e.,FAST).Ratherthanchoosing theseedwiththelargestfrequency,thisapproachchooses theonethathastheleastoverlapwithFASTamongall theunconsideredandfrequentseeds.Ifmultipleseeds havethesamesmallestoverlap,itconsidersthefrequency asthetiebreakerandchoosestheonewiththelargest frequencyamongthose.Phasethree:Post-processingSofarwedescribedhowtoobtainseeds(SectionPhase one:SeedgenerationŽ)andhowtocombinethemtoconstructFAST(SectionPhasetwo:SeedcombinationŽ). Thetwoapproacheswedevelopedforcombiningseeds aimtomaximizethesizeofFAST.However,theydonot ensurethemaximalityoftheresultingFAST.Thereare twomainreasonsthatpreventourseedcombiningalgorithmsfromconstructingmaximalFAST.First,someof thetaxaofamaximalFASTmaynotappearinanyseed (i.e.falsenegatives).Asaresultnocombinationofseeds willleadtothatmaximalFAST.Second,evenifallthetaxa ofamaximalFASTarepartsofatleastoneseed,ouralgorithmswillrejectcombiningthatseedwiththeFASTof theseedsifthoseseedscontainothertaxathatarenotpart ofthemaximalFAST(i.e.falsepositives). Inthepost-processingphase,wetackleabovementionedproblem.Algorithm3describesthepost processingphaseindetail.Wedothisbyconsideringall taxawhicharenotalreadypresentinthe FAST oneby one.WeiterativelygrowthecurrentFASTbyincluding onemoretaxonatatimeifthefrequencyoftheresulting FASTremainsatleastaslargeasthefrequencycuto .Werepeattheseiterationsuntilnonewtaxoncan beincludedintheFAST.ThustheresultingFASTis guaranteedtobemaximal.Algorithm2PostprocessingINPUT=FASTfromtheseedcombinationphase INPUT=TOUTPUT=MaximalFAST RESULT FAST forall ainotinFAST do CUTOFF t RESULT RESULT repeat

PAGE 8

Ramu etal.BMCBioinformatics 2012, 13 :256 Page8of15 http://www.biomedcentral.com/1471-2105/13/256Pickthenextunconsideredtree T Tasreference RESULT RESULT TaiMarkallthetreesthatcontainRESULTas considered if frequencyofRESULT CUTOFF then t RESULT RESULT CUTOFF frequencyofRESULT endif until Lessthan m unmarkedreferencetreesare leftinTRESULT t RESULT UnmarkalltreesinTendfor return RESULT Weexpectthepostprocessingsteptoidentifyquickly thetaxathathaveapotentialtobeinanMFASTthat mighthavenotbeenconsideredduringtheseedgenerationandseedcombinationphases.Attheendofthepost processingstepweobtainanMFAST.ComplexityanalysisofourmethodInthissectionwediscussthecomplexityofourmethodin termsofthethreephasesinvolvedinit.LetTbeasetof m phylogenetictreeshaving n leaveseach.Thecomplexity ofthedierentphasesofourmethodareasfollows.Phaseone.Findingtheseedsinvolvesenumeratingall thesubtreesandcheckingtheirfrequencies.Givenseed size k andnumberofcontractions c ,eachtreewillcontainatmostn k + ccladeseachleadingto k + c c alternative subtrees.Thus,intotaltherecanbeuptomn k + ck + c c seeds (possiblymanyofthemidentical)fromallthetreesinT. Typically,thevaluesof k and c are“xedandsmall(inour experimentswehave k { 3,4,5 } and c { 0,1,2,3,4,5 } ) leadingto O ( mn ) seeds. Thecomplexityof“ndingwhetheraseedispresentin asingletreeis O ( n log n ) .Giventhatthereare m treesinT,thecostofcomputingthefrequencyofasingleseed is O ( mn log n ) .Thus,thetimecomplexityfor“ndingthe frequencyofalltheseedsisthisexpressionmultipliedby thenumberofseeds,whichis O ( m2n2log n ) .Phasetwo.Considerasetof p frequentseedsthatwill beconsideredforcombininginthisphase.Recallthatwe havetwoapproachestocombinethem.Below,wefocus oneach. INORDERCOMBINATIONWetrytocombineeachseed witheveryotherseedleadingto O ( p2) iterations.The complexityofcheckingthefrequencyofeachcombined subtreeis O ( mn log n ) .Also,therecanbeupto O ( m ) dierentreferencetreesforguidingthecombineoperation.Multiplyingtheseterms,weobtainthecomplexityof phaseusingthisapproachas O ( p2m2n log n ) MINIMUMOVERLAPCOMBINATIONThecomplexityof combiningthefrequentseedsusingtheminimumoverlapcombinationapproachisverysimilartotheinorder approachexceptforanadditionalterm.Theadditional complexityisbecausewemaintaintheoverlapbetween thesubtrees.Thisleadstothecomplexity O ( p2n2+ p2m2n log n ) .Phasethree.Here,weconsidertheFASTobtainedfrom eachofthe p frequentseedsinphasetwo.ForeachFAST, wesequentiallygoovereachtaxaonebyoneleadingto O ( n ) iterations.Therecanbeupto O ( m ) referencesto addataxon.Sothecostofextendingall p FASTsis O ( mnp ) Noticethateachfrequentseedhastoappearinatleast m trees.Thus,thenumberofuniquefrequentseeds p isboundedby O (mn m) = O (n ) .Thus,addingthecost ofallthethreephases,theoveralltimecomplexityofour methodusinginordercombinationis O ( m2n2log n + m2n3log n 2+ mn2) Thatusingminimumoverlapcombinationis O ( m2n2log n + m2n3log n 2+ n4log n 2+ mn2) Inthetwosummationsabove,thesecondtermis asymptoticallylargerthanthe“rstandthelastterms. Thus,wecansimplifytheasymptotictimecomplexityof inorderandminimumoverlapcombinationsas O ( m2n3log n 2) and O ( n3log n 2( m2+ n )) respectively.ResultsanddiscussionThissectionevaluatestheperformanceofourMFAST algorithmexperimentally.Implementationdetails.WeimplementedourMFAST algorithmusingCandPerl.Morespeci“cally,weimplementedthe“rsttwophases(seedgenerationandseed combination)inCandthethirdphase(postprocessing) inPerl.Weutilizethefunctionsprovidedinthenewick Utilities[25]packagebymodifyingthesourcecodeprovidedinthatpackage.Weuse k { 3,4,5 } and c { 0,1,2,3,4,5 } inallofourexperimentsunlessotherwisestated.Inourexperiments,weobservedthatthe minimumoverlapcombinationproducedlargerMFASTs thanthein-ordercombinationapproach.Therefore,we

PAGE 9

Ramu etal.BMCBioinformatics 2012, 13 :256 Page9of15 http://www.biomedcentral.com/1471-2105/13/256limitourexperimentalresultstotheminimumoverlap combinationapproach.Methodscomparedagainst.Wehavecomparedour methodagainstPhylominer[20]andtheMASTcommand implementedinPAUP*[26].Amongthese,Phylominer alsoseeksMFASTsinacollectionoftrees.However,the timecomplexityofthismethodisexponentialinthesize oftheinputtrees,andhenceitbecomesintractablefor largetrees.Inourexperiments,weobservedthatitdoes notscalebeyond50taxa.PAUP*isprimarilyaprogram forphylogeneticinference,althoughitalsocancompute MASTs.MASTshaveastrict100%agreementcriterion unlikethearbitraryfrequencycutovalues inour method.EvaluationCriteria.Weevaluateouralgorithmbased onthesizeoftheMFASTfound.LargerMFASTsare preferable.Whenpossible,wereportthesizeoftheoptimalsolutionaswell.TestEnvironment.WeranourexperimentsonLinux serversequippedwithdualAMDOpterondualcoreprocessorsrunningat2.2GHzand3GBofmainmemoryto testtheperformanceofourmethod.DatasetsWetesttheperformanceandverifytheresults ofourmethodonsyntheticdatasetsandrealdatasets. € SYNTHETICDATASETWebuiltsyntheticdatasetsin whichweembeddedanMFASTasdescribedbelow. Wecharacterizeeachsyntheticdatasetusing“ve parameters. 1.Treesize( n ). 2.Numberoftrees( m ). 3.MFASTfrequency( f ). 4.MFASTsize( n). 5.Noisepercentage( ). The“rsttwoparametersdenotethesizeandnumber oftreesinT.MFASTfrequencyspeci“esthefraction oftreesinTwhichcontainanMFAST.MFASTsize isthenumberoftaxaintheembeddedMFAST.The noisepercentageisthepercentageoftaxathatisnot apartoftheembeddedMFASTbutisplacedonthe brancheswithinthecladethatcontainstheMFAST. Weplacealltheothertaxaonthebranchesoutside thisclade. Givenaninstantiationoftheseparameters,we“rst createdatreethathas ntaxa.Thistreeservesasthe MFAST.Wethencreated m f treesthatcontain thisMFAST.Webuildeachofthesetreesbyinserting n Š ntaxarandomlyintheMFAST.Withprobability weinserteachtaxawithinthecladethatcontains MFAST.Withprobability 1 Š weinsertitoutside thatclade.Wethencreated m Š ( m f ) treesthatdo notcontainthecurrentMFAST.Wesimplydothisby insertingallthetaxaonebyoneatarandomlocation. € REALDATASETS. Weusetwoempiricaldatasetsto evaluatetheperformanceofourheuristic.Thedata setscontain200bootstraptreesgeneratedfrom phylogeneticanalysisoftheGymnosperm[27]and Saxifragales(Burleigh,unpublished)plantclades.To makethebootstraptrees,weassembled super-matrices,matricesofconcatenatedgene alignmentswithpartialtaxonoverlap,fromgene sequencedataavailableinGenBank.Weperformeda maximumlikelihoodbootstrapanalysisoneach super-matrixusingRAxMLv.7.0.4[28].The Gymnospermtreeseachcontain959taxa,andthe Saxifragalestreeseachcontain950taxa.EectsofnumberofinputtreesInour“rstexperiment,weanalyzehowthenumberof inputtreesinTaectstheperformanceofouralgorithm. Forthispurpose,wecreated30syntheticdatasets.The sizeoftheembeddedMFASTinallthedatasetswas15. Amongthese30datasets,10contained50trees,10contained100treesand10contained200trees.Wesetthe noisepercentageto20%inallthedatasets.Thefrequency oftheembeddedMFASTwas0.8.Wesetthenumberof taxainallthetreesinthesedatasetsto100. Weranouralgorithmoneachofthesedatasetsto“nd thesizeoftheMFASTfor = 0.7.Table2liststhe averageMFASTsizewefoundforeachofthedataset sizesbeforepostprocessing(i.e.,attheendofphasetwo) andafterpostprocessing(i.e.,attheendofphasethree). Theresultsdemonstratethatourmethodcanidentifyan MFASTthatisalmostasbigastheembeddedoneeven withoutpostprocessing,regardlessofthenumberoftrees inthedataset.PostprocessingimprovestheMFASTsize slightly.Ontheaverage,wealways“ndanMFASTthatis aslargeasorlargerthantheembeddedone.AnMFAST largerthan15hereimpliesthatwhilerandomlyinserting thetaxathatarenotintheembeddedMFAST,atleastone ofthemwasplacedunderthesamecladeatleastafraction ofthetime.Moreimportantly,ourmethodsuccessfully locatedsuchtaxaalongwiththerestoftheMFAST.EectsoftreesizeOursecondexperimentconsiderstheimpactofthenumberoftaxaintheinputtreescontainedinTonthesuccess ofourmethod.Tocarryoutthistest,webuiltdatasets withvaryingtreesizes(i.e., n ).Particularly,weused n = 100,250,500and1000.Foreachvalueof n ,werepeated theexperiment10timesbycreating10datasetswiththe sameproperties.Inalldatasets,wesetthenumberoftrees to m = 100,thenoisepercentageat = 20%,thesizeofthe

PAGE 10

Ramu etal.BMCBioinformatics 2012, 13 :256 Page10of15 http://www.biomedcentral.com/1471-2105/13/256Table2EvaluationoftheeectofthenumberoftreesinT Numberoftrees MFASTsize BeforepostprocessingAfterpostprocessing 5014.516.0 10015.315.8 20014.415.4 Thenumberoftreesissetto50,100and200.Foreachnumberoftreeswerun ourexperimentsontendatasets.Eachdatasetcontainstreeswith100taxaand anembeddedMFASTofsize15.WereporttheaveragesizeoftheMFAST obtainedbyourmethodacrossthetendatasets.embeddedMFASTat15%of n ,andtheMFASTfrequency at0.8. Table3reportstheaverageMFASTsizefoundbyour methodforvaryingtreesizes.Secondcolumnshowsthe embeddedMFASTsize.LasttwocolumnslisttheaveragesizeoftheMFASTfoundbyourmethodacrossthe tendatasets.Before,goingintodetaileddiscussionofthe results,itiscrucialtoobservethatourmethodcouldrun tocompletionfordatasetsthathaveasmanyas1000taxa. WhenwetriedtorunPhylominer,itdidnotreturnany resultsfordatasetsthathavemorethan100taxa.The resultsalsodemonstratethatourmethodcouldsuccessfullyidentifytheembeddedMFASTinallthedatasets regardlessofthesizeoftheinputtrees.Insomedatasets, thereportedMFASTwasslightlylargerthantheembeddedone.Thisindicatesthatwhilerandomlyinsertingthe taxathatarenotpartoftheembeddedMFAST,itispossiblethatafewtaxawasconsistentlyplacedunderthesame sameclade. Theresultsalsosuggestthatourmethodidenti“esasigni“cantpercentageofthetaxaintheembeddedMFAST afterthesecondphase(i.e.,beforepost-processing)when thetreesizeissmall.Asthetreesizegrows,itstarts missingsometaxaatthisphase.Ithoweverrecoversthe missingtaxaduringthepost-processingphaseevenfor thelargesttreesize.ThisindicatesthatattheendofphaseTable3EvaluationoftheeectofthesizeofthetreesinT NumberofMFASTsize taxaEmbeddedReported BeforepostAfterpost processingprocessing 1001515.315.8 2503832.338.8 5007543.776.0 100015069.8151.0 Thetreesizeissetto100,250,500and1000.Foreachtreesizewerunour experimentsontendatasets.Eachdatasetcontains100treeswithanembedded MFASTofsize15%oftheinputtreesize.Secondcolumnshowstheembedded MFASTsize.LasttwocolumnslisttheaveragesizeoftheMFASTfoundbyour methodacrossthetendatasets.twoourmethodcouldidentifyabackboneoftheactual MFAST.Theunidenti“edtaxaatthisphasearescattered throughoutthecladesintheinputtrees.Thus,thereisno cladeofsize k + c thatcontainsthemwith c contractions forsmall k or c .AsevidentfromTable3,thishowever doesnotpreventourmethodfromrecoveringthem.This isbecausethebackbonereportedattheendofphasetwo islargeenough,andthusspeci“cenough,torecoverthe missingtaxaonebyoneinthelastphase.Thisisasigni“cantobservationasitdemonstratesthatourmethod workswellevenwithsmallvaluesof k and c .EectsofnoisepercentageRecallthatthenoisepercentage denotesthepercentageoftaxathatisaddedinsidethecladethatcontainsthe MFAST.As increases,thepairsoftaxaintheMFASTget fartherawayfromeachotherinthetreethatcontainsit. Asaresult,fewertaxafromMFASTwillbecontainedin smallcladesofsize k + c .Thisraisesthequestionwhether ourmethodworkswellas increaseandthustheMFAST taxagetsscatteredaroundinthetreesthatcontainit. Inthisexperiment,weanswerthequestionaboveand analyzetheeectofthenoisepercentageonthesuccess ofourmethod.Wecreatesyntheticdatasetswithvarious values.Particularly,weuse = 20,40and60%.Weset thesizeoftheembeddedMFASTto n= 15,thetreesize to n = 100,numberoftreesto m = 100andtheMFAST frequencyto f = 0.8.Werepeatourexperimentforeach parameter10timesbyrecreatingthedatasetrandomly usingthesameparameters.Wesetthefrequencycutoto = 0.7.WereporttheaverageMFASTsizefoundbyour methodinTable4. Theresultssuggestthatourmethodcanidentifythe embeddedMFASTsuccessfullyevenwhenthenoisepercentageisveryhigh.Weobservethatthesizeofthe MFASTfoundbyourmethodbeforepostprocessing decreasesslowlywithincreasingamountofnoise.This isnotsurprisingasthetaxacontainedintheembedded MFASTgetsmorespreadout(andthusfartherawayfrom eachother)inthetreesinTwithincreasingnoise.Asa result,iftherearetaxathatarenotpartofanyseedwith theprovidedvaluesof k and c ,theywillneverbeincludedTable4Evaluationoftheeectofthenoiseinthetrees inT Noise(%) MFASTsize BeforepostprocessingAfterpostprocessing 2015.315.8 4013.615.0 6012.715.0 ThesizeoftheembeddedMFASTinalltheexperimentsis15.Welistthe averagesizeoftheMFASTfoundbyourmethodbeforeandafterthepost processingphase.

PAGE 11

Ramu etal.BMCBioinformatics 2012, 13 :256 Page11of15 http://www.biomedcentral.com/1471-2105/13/256inthecomputedMFASTattheendofphasetwo.We howeverobservethat(i)onlyasmallnumberofsuchtaxa exists.Forinstance,evenforthelargestnoisepercentage ( =60%),only2.3taxa(i.e.,15-12.7)aremissingonthe average.(ii)Themissingtaxaarerecoveredduringphase three.ThisisbecausethecomputedMFASTattheend ofphasetwoisverylarge,andthusitisspeci“ctothe embeddedMFAST.ImpactofseedcreationSofar,inourexperimentsweconsistentlyobserved twomajorpointsforalltheparametersettings(see SectionsEectsofnumberofinputtreesŽtoEects ofnoisepercentageŽ):(i)Ourmethodalways“ndsa largesubtreeoftheembeddedMFASTafterphasetwo. (ii)Ourmethodalwaysrecoverstheentireembedded MFASTafterphasethree.Thesecondobservationcanbe explainedfromthe“rstonethattheoutcomeofphasetwo islargeenoughtobuildtheentireMFASTprecisely.The “rstobservationhoweverindicatesthatthesetofseeds generatedinphaseonecontainasigni“cantpercentageof thetaxaintheembeddedMFAST.Inthissection,wetake acloserlookintothisphenomenonandexplainwhythis isthecaseevenforsmallvaluesofseedsize k andcontractionamount c ,andlargenoisepercentage .Todothat, wewillcomputetheprobabilitythatasubsetofthetaxa oftheembeddedMFASTappearsinatleastoneseedgeneratedinphaseone.Inourcomputation,wewillassume thatthetaxacanappearatanylocationofagiventree withthesameprobability.Wediscusstheimplicationof thisassumptionlaterinthissection. Thenumberofrootedbifurcatingtreesforagivensetof n taxais R ( n ) = ( 2 n Š 3 ) 2n Š 2( n Š 2 ) Consideracladewith k + c taxa.Thenumberoftrees with n taxathatcontainsthiscladeis R ( n Š ( k + c ) + 1 ) asthetopologyofthe k + c sizedcladeis“xed.Fora givenasubtreewith k taxa,letusdenotethenumberof cladetopologiesofsize k + c thatcontainsthatsubtree withNU( k c ).Wecancomputethisfunctionrecursively asNU( k ,0)=1andfor c > 0, NU ( k c ) = NU ( k c Š 1 ) 2 ( k + c Š 2 ) LetusdenoteoneofthesecladesbyU( k c ).Also,let usdenotetheprobabilitythatthecladeU( k c )existsina randomtreetopologythatcontains n taxawith P ( n k c ) Intuitively, P( n k c ) istheprobabilitythatourmethod willextractaspeci“c k taxasubtreefromone n taxatree afteronly c contractions.Wecancomputethisprobability astheratioofthenumberoftreetopologiesthatsatisfy thisconstrainttothatofallpossibletreetopologies.We formulatethisas P ( n k c ) = NU ( k c ) R ( n Š ( k + c ) + 1 ) R ( n ) Recallthatitsucesforouralgorithmtohavea k taxa subtreeoftheMFASTinatleastonetreeinthegivenset of m trees.TheprobabilitythatthecladeU( k c )existsinat leastoneofthe m randomtreetopologieseachcontaining n taxais P ( n k c m ) = 1 Š ( 1 Š P ( n k c ))m. AssumethattheMFASTsizeinthegivensetoftreesTis h .Letusdenotethenumberof k taxasubtreesofthe MFASTasNS( h k ).Theprobabilitythatatleastoneof thesesubtreeswillbefoundinatleastoneoftheinput treesisthen P ( n k c m h ) = 1 Š ( 1 Š P ( n k c m ))NS ( h k ). AlowerboundtoNS( h k )is h Š k + 1whichcanbe obtainedbypickingacontiguousblockof k taxafromthe canonicalnewickrepresentationoftheMFASTbyconsideringallpossible h Š k + 1startingpointlocations.Notice thatthelargerthevalueof P ( n k c m h ) ,thehigherthe chancesthatouralgorithmwillconstructsomepartof theMFAST.Similarly,thelargerthevalueofNS( h k ),the higherthechancesthatouralgorithmwillconstructsome partoftheMFAST. Figure6plotsthesuccessprobability(i.e., P ( n,k,c, m,h ))ofourmethodforvaryingparametervalues.As theMFASTsizeincreases,thesuccessprobabilityrapidly increases.ThisisbecausethenumberofalternativesubtreesoftheMFASTincreaseswithincreasingMFAST size.Thus,thechanceofobservingatleastoneincreases aswell.WeobservethatwhenthesizeofMFASTis around20%ofthetreesize,foralltheparametersreported oursuccessprobabilitybecomesalmost1.Asthenumberofcontractionsincreases,theprobabilityofsuccess increases.Thisisbecauselargenumberofcontractions increasesthepossibilityofeliminatingfalsepositivetaxa fromclades.Inotherwords,ithelpsgluingthetaxathat arenormallyscatteredintheinputtreesbacktogether byremovingtheremainingtaxaamongthem.When c =5,oursuccessprobabilitybecomesalmostoneeven forMFASTsthatareassmallasa4-6%ofthetreesize. Asthenumberoftreesincreases,thesuccessprobability increasesaswell.Thisisbecausewehavemorealternative topologieswithincreasingnumberoftrees.Thus,there aremorechancestohaveasmallcladethatcontainsapart oftheMFAST.Finally,itisworthnotingthattheseresults arecomputedbasedontheassumptionthatthetreesinTareuniformlydistributedamongallpossibletopologies.In practice,weexpectthatthesetreesareconstructedwith

PAGE 12

Ramu etal.BMCBioinformatics 2012, 13 :256 Page12of15 http://www.biomedcentral.com/1471-2105/13/256 0 0.2 0.4 0.6 0.8 1 0 5 10 15 20 Success probabilityMFAST size [%]c=5 c=4 c=3 (a) 0 0.2 0.4 0.6 0.8 1 0 5 10 15 20 Success probabilityMFAST size [%]c=5 c=4 c=3 (b) Figure6 Theprobabilityof“ndingatleastoneseedwhichcontainsapartofanMFAST. Thenumberofcontractions c issetto3,4and5and thecorrespondingseedsize k is5,4or3.Thex-axisshowstheMFASTsizeintermsofthepercentageofthenumberoftaxainthetreesinT.In( a ), wesetthetotalnumberoftrees m =500.In( b )weset m =1000.thesameorsimilarobjectives(suchasmaximumparsimonyormaximumlikelihood).Asaresult,theywilllikely haveahigherchancetocontainlargeMFASTs.Theresults weexpectinpracticewillthusbesimilarorevenbetter thanthetheoreticalresultsinFigure6. Overall,weconcludefromthisexperimentthateven smallvaluesof k and c sucestocaptureapartofthe MFASTinphasetwo.Therefore,althoughouralgorithms complexityincreasesexponentiallywith k and c ,wedonot needtouselargevaluesfor k and c Thisenablesouralgorithmtoscaletoverylargedatasetswiththousandsoftaxa andtrees.TheseresultsexplainthetheorybehindthepracticalresultsweobservedinSectionsEectsofnumberof inputtreesŽtoEectsofnoisepercentageŽ.EvaluationofstateoftheartmethodsSofar,wehaveshownthatourmethodcouldsuccessfully“ndtheMFASTscontainedinsetsoftreesTforup to1000taxaand200trees(i.e., n =1000and m =200). Anobviousquestionishowwelldoexistingmethodsperformonthesamedatasets.Here,weanswerthisquestion fortwoexistingprograms,namelyPAUP*(version4.0b10) andPhylominer. Whenwe“xthenumberoftreesandthenumberof taxato100,PAUP*wasableto“ndtheMASTforforall datasets.Aswegrowthenumberoftaxato250orlarger whilekeepingthenumberoftreesas100,PAUP*runsour ofmemoryandfailstoreturnanyresults.Afterreducing thenumberoftreesto50,PAUP*stillrunsoutofmemory andcannotreportanyresultsformorethan100taxa. ThescalabilityproblemofPhylominerisevenmore severe.PhylominerisabletocomputetheMFASTson datasetswithupto20taxa.However,asweincrease thenumberoftaxafurther,itsperformancedeteriorates quickly.Whenwesetthenumberoftaxato100,evenwith asfewas100trees,Phylominertakesmorethanaweek toreportaresult.Moreover,inourexperiments,themaximumsizeofthesubtreesitfoundonaveragecontained fewerthan7taxa,eventhoughthesizeofthetrueMAST was10. Anotherinterestingquestionaboutexistingmethods wouldbewhetherthemajorityconsensusrulecanbeused to“ndMFASTs.Toevaluatethis,weusedthesamethree syntheticdatasetsusedinSectionEectsofnoisepercentageŽ.Recalleachofthesethreedatasetscontainsan MFASTofsize15whichisembeddedin80%ofthetrees. Thedatasetsarecreatedwith20%,40%and60%noise indicatingdierentlevelsofdicultyinrecoveringthe embeddedMFAST.Wecomputed70%majorityconsensustree.Noticethatifmajorityconsensusrulecanidentify anMFAST,thatwouldcorrespondtoabifurcatingsubtree topologyintheconsensustree.Inotherwordsasubtree isbifurcatinginthisexperimentonlyif70%ormoreof theinputtreesagreeonthetopologyofthatsubtree.The resultingtree,however,wasmultifurcatingforallthethree datasets.Thismeansthatmajorityconsensusrulecould notrecoverevenasmallerportionoftheembeddedtree whileourmethodwasabletolocatetheentireMFAST successfully(seeTable4). TheseresultsdemonstratethatbothPAUP*andPhylominerarenotwellsuitedto“ndingagreementsubtrees inlargerdatasets,ourmethodscalesbetterintermsof boththenumberoftaxaandthenumberoftrees.When PAUP*runstocompletion,weobservedthatitreports

PAGE 13

Ramu etal.BMCBioinformatics 2012, 13 :256 Page13of15 http://www.biomedcentral.com/1471-2105/13/256thetrueresults.Recallfrompreviousexperimentsthat ourmethodalwaysfoundthetrueresultsonthesame datasetsaswellaslargerdatasets. Thissuggeststhatour methodhasthepotentialtohaveanimpactinlargescale phylogeneticanalysiswhenexistingmethodsfail.EmpiricaldatasetexperimentsToexaminetheperformanceoftheMFASTmethodon realdata,weperformedexperimentsusing200maximum likelihoodbootstraptreesfromaphylogeneticanalysis ofgymnosperms(959taxa)andSaxifragales(950taxa). Speci“cally,weevaluatedhowtheperformanceofthe MFASTalgorithmwasaectedbythenumberofinput treesandthesizeoftheinputtrees.EectsofnumberofinputtreesWe“rstexaminedtheeectofinputtreenumberon thesizeofMFAST.ForboththegymnospermandSaxifragalestrees,wegenerated10setsof50and100trees byrandomlysamplingfromtheoriginal200treeswithoutreplacement.Wecomparedtheaveragesizeofthe MFASTinthe50and100treedatasetswiththesizeofthe MFASTintheoriginal200treedataset.First,inallanalysis,thepost-processingstepgreatlyincreasesthesizeof theMFAST,sometimesmorethandoublingit(Table5). Thisincreaseissimilartotheoneobservedinthe1000 taxonsimulateddatasets(Table2),emphasizingagain theimportanceofthepost-processingstepwithlargetree datasets.AlthoughthesizesoftheMFASTsweresimilar,theydecreasedslightlywiththeadditionofmoretrees (Table5).Thismaysimplybeamatterofobservingmore con”ictwithmoretrees. ThelargegapbetweentheMFASTsizesbeforeandafter thepostprocessingsuggeststhatphasethreeisthemain reasonbehindthesuccessofourmethod,andthus,the costlyseedcombinationphase(i.e.,phasetwo)maybe unnecessary.Toanswerwhetherthisconjectureiscorrect, weranavariantofourmethodbydisablingthesecond phase;weonlyranthepostprocessingphasestartingfromTable5ThesizeoftheMFASTfoundbyourmethodonthe GymnospermsandSaxifragalesdatasetsbeforeandafter postprocessing(phasethree) Numberof MFASTsize trees GymnospermsSaxifragales BeforeAfterOnlyBeforeAfterOnly 5078.5129.899.564.7122.084.1 10068.4119.283.155.4112.874.7 20076.0118.084.040.0105.075.0 ThesizeoftheMFASTfoundbyrunningonlythepostprocessingstepisalso shown.Werunourmethodontheentiredatasetthatcontains200treesaswell asrandomlyselectedsubsetsof50and100trees.Werepeatedthe50and100 treeexperiments10timesbyrandomlyselectingthetreesfromtheentire datasetandreportedtheaveragevalue.eachseedastheinitialMFASTonebyone.Wereported thelargestMFASTfoundthatwayastheoutputofthis variantinTable5.Theresultsdemonstratethatalthough phasethreecangrowalargeFAST, phasetwoisessential to“ndthelargestfrequentagreementsubtree.Inother words,postprocessing“ndsthetrueMFASTonlyifa largeportionofitisalreadyfound(whichistheroleserved byphasetwo).Inconclusion,phasethreeofourmethod cannotreplacephasetwo,yetbothphasesareessentialfor thesuccessofourmethod.EectsofsizeofinputtreeNext,weexaminedtheeectofnumberofleavesinthe inputtreesonthesizeofMFASTs.ForboththegymnospermandSaxifragalestrees,wegenerated10setsof 200inputtreeswith100,250,and500taxa.Tomake eachset,werandomlyselected100,250,or500taxa, andwedeletedallothertaxafromtheoriginalsetsof 200trees.Thus,thesesetsoftreeswith100,250,or500 taxaaresubtreesoftheoriginaldatasets.Thesizeof theaverageMFASTincreaseswithmoretaxaintheoriginaltrees(Table6).However,interestingly,theaverage sizeoftheMFASTsforthegymnospermdatasetwith 500treesislargerthantheMFASTfoundfromtheoriginalgymnospermtreeswithallthetaxa(Table6).Since theMFASTfromthe500taxondatasetsshouldallbe foundwithinthefulldataset,thisindicatesthatonthe largertrees,ourmethodmaynotalways“ndthetrue(i.e., largest)MFAST.Thefulldatasetsmayrequirealarger numberofcontractionsto“ndthetrueMFASTs. SimilartotheexperimentsinSectionEectsofnumberofinputtreesŽ,weinvestigatedthegapbetweenthe MFASTsizesbeforeandafterthepostprocessingstep.We ranavariantofourmethodbydisablingthesecondphase; weonlyranthepostprocessingphasestartingfromeach seedastheinitialMFASTonebyone.Wereportedthe largestMFASTfoundthatwayastheoutputofthisvariantTable6ThesizeoftheMFASTfoundbyourmethodonthe GymnospermsandSaxifragalesdatasetsbeforeandafter postprocessing(phasethree)fordierentnumberoftaxa Numberof MFASTsize leaves GymnospermsSaxifragales BeforeAfterOnlyBeforeAfterOnly 10041.256.143.543.550.738.5 25067.288.563.062.376.254.6 50091.6123.074.952.086.762.9 All76.0118.084.040.0105.075.0 ThesizeoftheMFASTfoundbyrunningonlythepostprocessingstepisalso shown.Werunourmethodontheentiredatasetthatcontainsallthetaxa(last row)aswellasrandomlyselectedtaxasubsetsofsize100,250and500.We repeatedthe50,100and250taxaexperiments10timesbyrandomlyselecting thetaxafromtheentiredatasetandreportedtheaveragevalue.

PAGE 14

Ramu etal.BMCBioinformatics 2012, 13 :256 Page14of15 http://www.biomedcentral.com/1471-2105/13/256Table7ThesizeoftheMFASTfoundbyourmethodonthe GymnospermsandSaxifragalesdatasetsfordierent randomsubsamplesofthetotalnumberoftaxa Sampling MFASTsize percentage GymnospermsSaxifragales 285.974.3 587.575.6 1088.475.2 2587.575.5 5088.576.2 10088.576.2 Werunourmethodbyrandomlypicking2%,5%,10%,25%,50%,100%ofthe seedsfoundinphaseoneforcombinationinphasetwo.inTable6.TheresultsareinparallelwiththoseinTable5. Phasethreecangrowalargefrequentagreementsubtree, butnotquiteasbigasthatwhenbothphasetwoandthree areexecuted.EectsofsamplesizeInour“nalexperiment,weevaluatedtheeectofthe maximumtimecuto ,wedescribedinSectionIn-order combinationŽontheaccuracyofourmethod.Recallthat, thiscutolimitsthenumberofinitialseedstriedinour algorithmbyrandomlysamplingasmallpercentageofthe seeds.Itonlyusesthesampledseedsaspossibleinitial seeds.However,itusestheentiresetofseedswhilegrowingtheMFASTdeterminedbytheinitialseed.Aseach initialseedroughlytakesthesameamountoftimetogrow intoanMFAST,using x %oftheseedsasthesampleset reducesthetotalrunningtimeourmethodtoroughly x % ofthatofouroriginalimplementation. Wecarriedoutthisexperimentasfollows.Forboth thegymnospermandSaxifragalestrees,weran10sets ofexperimentsforeachsamplingpercentageof2,5,10, 25,50and100%.Thus,totallyweran60(6 10)experiments.Table7presentstheaverageMFASTsizesfor varyingsamplesizes.Theresultsdemonstratethateven forverysmallsamplingpercentages,ourmethod“nds MFASTthatisalmostasbigastheMFASTfoundbyusing theentiredataset(i.e.,100%samplingpercentage).Thisis verypromisingasitdemonstratesthattherunningtime costofourmethodcaneasilybecuttoasmallfraction bysamplingthestartingseeds.Therationalebehindthis isthattheMFASTcontainsmanyseeds.Startingfrom anyoftheseseeds,ouralgorithmhasthepotentialtolead tothatMFAST.Theprobabilitythatatleastoneofthese seedsappearinthesamplesetislargeparticularlyforlarge MFASTs.ConclusionInthispaper,wepresentaheuristicfor“ndingthemaximumagreementsubtrees.Theheuristicusesamulti-step approachwhich“rstidenti“essmallcandidatesubtrees (calledseeds),fromthesetofinputtrees,combinesthe seedstobuildlargercandidateMFASTs,andthenperformsapost-processingsteptoincreasethesizeofthe candidateMFASTs.Wedemonstratethatthisheuristic caneasilyhandledatasetswith1000taxa,greatlyextendingtheestimationofMFASTsbeyondcurrentmethods.Althoughthisheuristicisnotguaranteedto“nd allMFASTs,itperformswellusingbothsimulatedand empiricaldatasets.Itsperformanceisrelativelyrobustto thenumberofinputtreesandthesizeoftheinputtrees, althoughwiththelargerdatasets,thepostprocessing stepbecomesmoreimportant.Overallthismethodprovidesasimpleandfastwaytoidentifystronglysupported subtreeswithinlargephylogenetichypotheses. Althoughthemethodwedevelopedisdescribedand implementedfortherootedandbifurcatingtrees,itcan betriviallyextendedtomultifurcatingaswellasunrooted trees.Thecentraltechnicaldierenceinthecaseof unrootedtreeswouldbethede“nitionofclade(seeDefinition1)asthede“nitionrequiresaroot.Acladeinan unrootedtreeencompassestwosetsofnodes;(i)agiven setoftaxa X ,(ii)thesetofallinternalnodesthatareon apathbetweentwotaxain X onthephylogenetictree. Weexpectthatthiswillincreasethenumberofseeds substantiallyandthusmaketheproblemmorecomputationallyintensive.Theamountofincreasewilldepend onthetreetopology.Thetheoreticalworstcasehappens whenallthetaxaareconnectedtoasingleinternalnode (i.e.,startopology).Inthatcaseanysubsetoftaxacan leadtoapotentialseedaslongasthesubsetsizeisequal totheseedsizeallowed.Onepossiblewaytoovercome thisproblemwouldbetoexploitrandomizationorgraph coloringstrategiesandavoidenumeratingmajorityofthe possibleseeds.Abbreviations MAST:Maximumagreementsubtree;FAST:Frequentagreementsubtree; MFAST:Maximumfrequentagreementsubtree;MCMC:MarkovchainMonte Carlo. Competinginterests Theauthorsdeclarethattheyhavenocompetinginterests. Authorscontributions ARparticipatedinalgorithmdevelopment,implementation,experimental evaluationandwritingofthepaper.TKparticipatedinalgorithm development,experimentdesignandwritingofthepaper.GBparticipatedin experimentdesign,datasetcollectionandwritingofthepaper.Allauthors readandapprovedthe“nalmanuscript. Acknowledgments ThisworkwassupportedpartiallybytheNationalScienceFoundation(grants CCF-0829867andIIS-0845439). Authordetails1ElectricalandComputerEngineering,UniversityofFlorida,Gainesville,FL, USA.2ComputerandInformationScienceandEngineering,Universityof

PAGE 15

Ramu etal.BMCBioinformatics 2012, 13 :256 Page15of15 http://www.biomedcentral.com/1471-2105/13/256Florida,Gainesville,FL,USA.3DepartmentofBiology,UniversityofFlorida, Gainesville,FL,USA. Received:5February2012Accepted:5September2012 Published:3October2012 References 1.GoloboPA,CatalanoSA,MirandeJM,SzumikCA,AriasJS,K ¨ allersj ¨ oM, FarrisJS: Phylogeneticanalysisof73060taxacorroboratesmajor eukaryoticgroups. Cladistics 2009, 25 (3):211…230. 2.PriceMN,DehalPS,ArkinAP: Fasttree2-approximately maximum-likelihoodtreesforlargealignments. PLoSONE 2010, 5 (3):e9490. 3.SmithSA,BeaulieuJM,StamatakisA,DonoghueMJ: Understanding angiospermdiversi“cationusingsmallandlargephylogenetic trees. AmericanJournalofBotany 2011, 98: 404…414. 4.FelsensteinJ: Con“denceLimitsonPhylogenies:AnApproachUsing theBootstrap. Evol 1985, 39 (4):783…791. 5.FarrisJS,AlbertVA,K ¨ allersj ¨ oM,LipscombD,Kluge,AG: Parsimony jackkni“ngoutperformsneighbor-joining. Cladistics 1996, 12 (2):99…124. 6.HuelsenbeckJP,RannalaB,MaslyJP: Accommodatingphylogenetic uncertaintyinevolutionarystudies. 2000 Science, 30June2000, 288 (5475):2349…2350.doi:10.1126/science.288.5475.2349. 7.BryantD: Aclassi“cationofconsensusmethodsforphylogenetics. In Bioconsensus(Piscataway,NJ,2000/2001),volume61ofDIMACSSer.Discrete Math.Theoret.Comput.Sci. ;Amer.Math.Soc.,2003:163…183. 8.FindenCR,GordonAD: Obtainingcommonprunedtrees. J Classi“cation 1985, 2: 255…276. 9.AmirA,KeselmanD: Maximumagreementsubtreeinasetof evolutionarytrees:Metricsandecientalgorithms. SIAMJComput 1997, 26 (6):1656…1669. 10.KubickaE,KubickiG,McMorrisFR: Analgorithmto“ndagreement subtrees. JClassi“cation 1995, 12 (1):91…99. 11.FarachM,PrzytyckaTM,ThorupM: Ontheagreementofmanytrees. Inf ProcessLett 1995, 55 (6):297…301. 12.BryantD: Buildingtrees,huntingfortreesandcomparingtrees. PhD thesis .Dept.Mathematics:UniversityofCanterbur;1997. 13.ColeR,Farach-ColtonM,HariharanR,PrzytyckaTM,ThorupM: Ano(nlog n)algorithmforthemaximumagreementsubtreeproblemfor binarytrees. SIAMJComput 2000, 30(5):1385…1404. 14.LeeCM,HungLJ,ChangMS,ShenCB,TangCY: Animprovedalgorithm forthemaximumagreementsubtreeproblem. InfProcessLett 2005, 94 (5):211…216. 15.BerryV,NicolasF: Improvedparameterizedcomplexityofthe maximumagreementsubtreeandmaximumcompatibletree problems. IEEE/ACMTransComputBiologyBioinform 2006, 3 (3): 289…302. 16.GuillemotS,NicolasF: Solvingthemaximumagreementsubtreeand themaximumcompatibletreeproblemsonmanyboundeddegree trees. In Proceedingsofthe17thAnnualconferenceonCombinatorial PatternMatching(CPM06) .EditedbyLewensteinM,ValienteG(Eds.). Berlin,Heidelberg:Springer-Verlag:165…176.doi:10.1007/11780441 16. 17.GuillemotS,NicolasF,BerryV,PaulC: Ontheapproximabilityofthe maximumagreementsubtreeandmaximumcompatibletree problems. DiscreteAppliedMathematics 2009, 157 (7):1555…1570. 18.ChiY,XiaY,YangY,Muntz,RR: CorrectiontoŽminingclosedand maximalfrequentsubtreesfromdatabasesoflabeledrootedtreesŽ. IEEETransKnowlDataEng 2005, 17 (12):1737. 19.ZhangS,WangJTL: Miningfrequentagreementsubtreesin phylogeneticdatabases. In Proceedingsofthe6thSIAMInternational ConferenceonDataMining(SDM2006) .EditedbyGhoshJ,LambertD, SkillicornDB,SrivastavaJ.Maryland:Bethesda;April2006:222-233. 20.ZhangS,WangJTL: Discoveringfrequentagreementsubtreesfrom phylogeneticdata. IEEETransKnowlDataEng 2008, 20 (1):68…82. 21.CranstonKA,RannalaB: SummarizingaPosteriorDistributionofTrees UsingAgreementSubtrees. SystB 2007, 56 (4):578…590. 22.PattengaleND,SwensonKM,MoretBME: Uncoveringhidden phylogeneticconsensus. In Proc.6thIntlSymp.BioinformaticsResearch& Appls.ISBRA10,inLectureNotesinComputerScience .EditedbyBorodovsky M,GogartenJP,PrzytyckaTM,RajasekaranS:pp128…139.Springer,2010. 23.PattengaleND,AbererAJ,SwensonKM,StamatakisA,MoretBME: Uncoveringhiddenphylogeneticconsensusinlargedatasets. IEEE/ACMTransComputBiologyBioinform 2011, 8 (4):902…911. 24.AbererAJ,StamatakisA: Asimpleandaccuratemethodforrogue taxonidenti“cation. In ProceedingsoftheIEEEInternationalConference on,BioinformaticsandBiomedicine(BIBM11) :IEEEComputerSociety, Washington,DC,USA:118-122.doi:10.1109/BIBM.2011.70. 25.JunierT,ZdobnovEM: Thenewickutilities:high-throughput phylogenetictreeprocessingintheunixshell. Bioinformatics 2010, 26 (13):1669…1670. 26.SwoordDL:PAUPPhylogeneticanalysisusingparsimony(andother methods).Version4.0beta10.,2002.Sunderland,Massachusetts:Sinauer Assoc. 27.BurleighJG,BarbazukWB,DavisJM,MorseAM,SoltisPS: Exploring diversi“cationandgenomesizeevolutioninextantgymnosperms throughphylogeneticsynthesis. JBotany 2012, 2012: 6.ArticleID 292857.doi:10.1155/2012/292857. 28.StamatakisA: Raxml-vi-hpc:maximumlikelihood-based phylogeneticanalyseswiththousandsoftaxaandmixedmodels. Bioinformatics 2006, 22 (21):2688…2690. doi:10.1186/1471-2105-13-256 Citethisarticleas: Ramu etal. : Ascalablemethodforidentifyingfrequent subtreesinsetsoflargephylogenetictrees. BMCBioinformatics 2012 13 :256. Submit your next manuscript to BioMed Central and take full advantage of: € Convenient online submission € Thorough peer review € No space constraints or color “gure charges € Immediate publication on acceptance € Inclusion in PubMed, CAS, Scopus and Google Scholar € Research which is freely available for redistribution Submit your manuscript at www.biomedcentral.com/submit


xml version 1.0 encoding utf-8 standalone no
mets ID sort-mets_mets OBJID sword-mets LABEL DSpace SWORD Item PROFILE METS SIP Profile xmlns http:www.loc.govMETS
xmlns:xlink http:www.w3.org1999xlink xmlns:xsi http:www.w3.org2001XMLSchema-instance
xsi:schemaLocation http:www.loc.govstandardsmetsmets.xsd
metsHdr CREATEDATE 2013-01-11T12:07:35
agent ROLE CUSTODIAN TYPE ORGANIZATION
name BioMed Central
dmdSec sword-mets-dmd-1 GROUPID sword-mets-dmd-1_group-1
mdWrap SWAP Metadata MDTYPE OTHER OTHERMDTYPE EPDCX MIMETYPE textxml
xmlData
epdcx:descriptionSet xmlns:epdcx http:purl.orgeprintepdcx2006-11-16 xmlns:MIOJAVI
http:purl.orgeprintepdcxxsd2006-11-16epdcx.xsd
epdcx:description epdcx:resourceId sword-mets-epdcx-1
epdcx:statement epdcx:propertyURI http:purl.orgdcelements1.1type epdcx:valueURI http:purl.orgeprintentityTypeScholarlyWork
http:purl.orgdcelements1.1title
epdcx:valueString A scalable method for identifying frequent subtrees
in sets of large phylogenetic trees
http:purl.orgdctermsabstract
Abstract
Background
We consider the problem of finding the maximum frequent agreement subtrees (MFASTs) in a collection of phylogenetic trees. Existing methods for this problem often do not scale beyond datasets with around 100 taxa. Our goal is to address this problem for datasets with over a thousand taxa and hundreds of trees.
Results
We develop a heuristic solution that aims to find MFASTs in sets of many, large phylogenetic trees. Our method works in multiple phases. In the first phase, it identifies small candidate subtrees from the set of input trees which serve as the seeds of larger subtrees. In the second phase, it combines these small seeds to build larger candidate MFASTs. In the final phase, it performs a post-processing step that ensures that we find a frequent agreement subtree that is not contained in a larger frequent agreement subtree. We demonstrate that this heuristic can easily handle data sets with 1000 taxa, greatly extending the estimation of MFASTs beyond current methods.
Conclusions
Although this heuristic does not guarantee to find all MFASTs or the largest MFAST, it found the MFAST in all of our synthetic datasets where we could verify the correctness of the result. It also performed well on large empirical data sets. Its performance is robust to the number and size of the input trees. Overall, this method provides a simple and fast way to identify strongly supported subtrees within large phylogenetic hypotheses.
http:purl.orgdcelements1.1creator
Ramu, Avinash
Kahveci, Tamer
Burleigh, J Gordon
http:purl.orgeprinttermsisExpressedAs epdcx:valueRef sword-mets-expr-1
http:purl.orgeprintentityTypeExpression
http:purl.orgdcelements1.1language epdcx:vesURI http:purl.orgdctermsRFC3066
en
http:purl.orgeprinttermsType
http:purl.orgeprinttypeJournalArticle
http:purl.orgdctermsavailable
epdcx:sesURI http:purl.orgdctermsW3CDTF 2012-10-03
http:purl.orgdcelements1.1publisher
BioMed Central Ltd
http:purl.orgeprinttermsstatus http:purl.orgeprinttermsStatus
http:purl.orgeprintstatusPeerReviewed
http:purl.orgeprinttermscopyrightHolder
Avinash Ramu et al.; licensee BioMed Central Ltd.
http:purl.orgdctermslicense
http://creativecommons.org/licenses/by/2.0
http:purl.orgdctermsaccessRights http:purl.orgeprinttermsAccessRights
http:purl.orgeprintaccessRightsOpenAccess
http:purl.orgeprinttermsbibliographicCitation
BMC Bioinformatics. 2012 Oct 03;13(1):256
http:purl.orgdcelements1.1identifier
http:purl.orgdctermsURI http://dx.doi.org/10.1186/1471-2105-13-256
fileSec
fileGrp sword-mets-fgrp-1 USE CONTENT
file sword-mets-fgid-0 sword-mets-file-1
FLocat LOCTYPE URL xlink:href 1471-2105-13-256.xml
sword-mets-fgid-1 sword-mets-file-2 applicationpdf
1471-2105-13-256.pdf
structMap sword-mets-struct-1 structure LOGICAL
div sword-mets-div-1 DMDID Object
sword-mets-div-2 File
fptr FILEID
sword-mets-div-3