UFDC Home  Search all Groups  UF Institutional Repository  UF Institutional Repository   Help 
Material Information
Notes
Record Information

Full Text 
Ramu etal. BMCBioinformatics 2012, 13:256 http://www.biomedcentral.com/14712105/13/256 nBM ai Bioinformatics A scalable method for identifying frequent subtrees in sets of large phylogenetic trees Avinash Ramu1, Tamer Kahveci2* and J Gordon Burleigh3 Abstract Background: We consider the problem of finding the maximum frequent agreement subtrees (MFASTs) in a collection of phylogenetic trees. Existing methods for this problem often do not scale beyond datasets with around 100 taxa. Our goal is to address this problem for datasets with over a thousand taxa and hundreds of trees. Results: We develop a heuristic solution that aims to find MFASTs in sets of many, large phylogenetic trees. Our method works in multiple phases. In the first phase, it identifies small candidate subtrees from the set of input trees which serve as the seeds of larger subtrees. In the second phase, it combines these small seeds to build larger candidate MFASTs. In the final phase, it performs a postprocessing step that ensures that we find a frequent agreement subtree that is not contained in a larger frequent agreement subtree. We demonstrate that this heuristic can easily handle data sets with 1000 taxa, greatly extending the estimation of MFASTs beyond current methods. Conclusions: Although this heuristic does not guarantee to find all MFASTs or the largest MFAST, it found the MFAST in all of our synthetic datasets where we could verify the correctness of the result. It also performed well on large empirical data sets. Its performance is robust to the number and size of the input trees. Overall, this method provides a simple and fast way to identify strongly supported subtrees within large phylogenetic hypotheses. Keywords: Phylogenetic trees, Frequent subtree Background Phylogenetic trees represent the evolutionary relation ships of organisms. While recent advances in genomic sequencing t,. !..i.. ,\% and computational methods have enabled construction of extremely large phylogenetic trees (e.g., [13]), assessing the support for phylogenetic hypotheses, and ultimately identifying wellsupported relationships, remains a major challenge in phylogenet ics. Support for a tree often is determined by methods such as nonparametric bootstrapping [4], jackknifing [5], or Bayesian MCMC sampling (e.g., [6]), which generate a collection of trees with identical taxa representing the range of possible phylogenetic relationships. These trees can be summarized in a consensus tree (see [7]). Consen sus methods can highlight support for specific nodes in a tree, but they also may obscure highly supported sub trees. For example, in Figure 1, the subtree containing taxa *Correspondence: tamer@cise.ufl.edu 2Computer and Information Science and Engineering, University of Florida, Gainesville, FL, USA Full list of author information is available at the end of the article A, B, C, and D is present in all five input trees. However, due to the uncertain placement of taxon E, the majority rule consensus tree implies that the clades in the tree have relatively low (60%) support. Alternate approaches have been proposed to reveal highly supported subtrees. The maximum agreement sub tree (MAST) problem seeks the largest subtree that is present in all members of a given collection of trees [8]. For example, in Figure 1 the MAST includes taxa A, B, C, and D. Finding the MAST is an NPhard problem [9], although efficient algorithms exist to compute the MAST in some cases (e.g., [917]). In practice, since any differ ence in any single tree will reduce the size of the MAST, the MAST is often quite small, limiting it usefulness. A less restrictive problem is to find frequent agreement subtrees (FAST), or subtrees that are found in many, but not necessarily all, of the input trees (see [18]). In this problem, a subtree is declared as frequent if it is in at least as many trees as a user supplied frequency thresh old. Several algorithmic approaches have been suggested to identify FASTs, and specifically the maximum FASTs Q BioMlled Central Ramu etal. BMCBioinformatics 2012, 13:256 http://www.biomedcentral.com/14712105/13/256 A A E E>  B> ^D B D A C B D A A x S(a) D (a) Majority Rule Maximum Agreement Consensus Subtree A A 60% 100% 60%00%0 E 0 0 (b) Figure 1 (a) A collection of five input trees. The same subtree with taxa A, B, C, and D is present in all input trees, and only changes. (b) The majority rule consensus and maximum agreement subtrees of the 5 input trees in Figure 1 a. (MFASTs), or FASTs that contain the largest number of taxa. A variant of this problem seeks the maximal FASTs, i.e., FASTS that are not contained in any other FASTs. Notice that an MFAST is a maximal FAST, however, the inverse is not necessarily true. Zhang and Wang defined algorithms, implemented in Phylominer, to identify FASTs from a collection of phylogenetic trees [19,20]. These algo rithms are guaranteed to find all FASTs but they may be prohibitively slow for data sets larger than 20 taxa. Cranston and Rannala implemented MetropolisHastings and Threshold Accepting searches to identify large FASTs from a Bayesian posterior distribution of phylogenetic trees [21]. This approach can handle thousands of input trees but it may not be feasible if the trees have more than 100 taxa [21]. Another approach to reveal highly supported subtrees from a collection of trees is to identify and remove rogue taxa, or taxa whose position in the input trees is least con sistent. Recently, several methods have been developed that can identify and remove rogue taxa from collec tions of trees with thousands of taxa [2224]. However, unlike MAST or FAST approaches, they do not provide guarantees about the support for the remaining taxa. In this paper, we describe a heuristic approach for iden tifying MFASTs in collections of trees. Unlike previous methods, our method easily scales to datasets with over a thousand taxa and hundreds of trees. Towards this goal, we develop a heuristic solution that works in multiple phases. In the first phase, it identifies small candidate sub trees from the set of input trees which serve as the seeds of larger subtrees. In the second phase, it combines these seeds to build larger candidate MFASTs. In the final phase, it performs a post processing step. This step ensures that the size (i.e., number of taxa) of the FAST found can not be increased further by adding a new taxon without reducing its frequency below a user supplied frequency thresh old. We demonstrate that this heuristic can easily handle data sets with 1000 taxa. We test the effectiveness of the position oftaxa E these approaches on simulated data sets and then demon strate its performance on large, empirical data sets. Although our heuristic does not guarantee to find all MFASTs or the largest MFAST in theory, it found the true MFAST in all of our synthetic datasets where we could verify the correctness of the result. It also performed well on the empirical data sets. Its performance is robust with respect to the number of input trees and the size of the input trees. Methods In this section we describe our method that aims to find Maximum Frequent Agreement SubTrees (MFASTs) in a given set of m phylogenetic trees T = [T1, T2, .. TJ}. Our method follows from the observation that an MFAST is present in a large number of trees in T. The method builds MFASTs bottom up from small subtrees of taxa in the trees in T. Briefly, it works in three phases. Phase 1. Seed generation (Section "Phase one: Seed generation"). In the first phase, we identify small subtrees from the input trees that have a potential to be a part of an MFAST. We call each such subtree a seed. Phase 2. Seed combination (Section "Phase two: Seed combination"). In the second phase, we construct an initial FAST by combining the seeds found in the first phase. Phase 3. Post processing (Section "Phase three: Postprocessing"). In the third phase, we grow the FAST further to obtain the maximal FAST that contains it by individually considering the taxa which are not already in the FAST. We report the resulting maximal FAST as a possible MFAST. First, we present the the basic definitions needed for this paper in Section "Preliminaries and notation" We then discuss each of the three phases above in detail. Page 2 of 15 Ramu et al. BMCBioinformatics 2012, 13:256 http://www.biomedcentral.com/14712105/13/256 Preliminaries and notation In this section, we present the key definitions and nota tions needed to understand the rest of the paper. We describe our method using rooted and bifurcating phylo genetic trees. However, our method and definitions can easily be applied to unrooted or multifurcating trees with minor or no modifications. Also, we assume that all the taxa are placed at the leaf level nodes of the phylogenetic tree, and all the internal nodes are inferred ancestors. Figure 2(a) shows a sample phylogenetic tree built on five taxa. We define the size of a tree as the number of taxa in that tree. We start by defining key terms. Definition 1 (Clade). Let The a phylogenetic tree. Given an internal node of T, we define the set of all nodes and edges of T contained under that node as the clade rooted at that node. Each internal node of a phylogenetic tree corresponds to a clade of that tree. Figure 2(b) depicts the clade of the tree in Figure 2(a) rooted at xl. Definition 2 (Contraction). Let The a phylogenetic tree with n taxa. The contraction operation transforms Tinto a tree with n 1 taxa by removing a given taxon in T along with the edge that connects that taxon to T. The contraction operation can extract the clades of a tree by removing all the taxa that are not a part of that clade. It can also extract parts of the tree that are not nec essarily clades. We use the term subtree to denote a tree that is obtained by applying contractions to arbitrary set of taxa in a given tree. Formal definition is as follows. Definition 3 (Subtree). Let T and T' be two phyloge netic trees. We say that T' is a subtree of T if T can be transformed into T' by applying a series of contractions on T. If a tree T' is a subtree of another tree T, we say that T' is present in T. Notice that a clade is always a subtree, xo x2 a b c d e a b c b c d (a) (b) (c) Figure 2 (a) A rooted, bifurcating phylogenetic tree T built on five taxa labeled with a, b, c, d and e. The internal nodes are shown with xo, xi, X2 and x3. (b) A clade of T rooted at xi. (b) and (c) Two subtrees of T by contracting the taxa sets {d, e} and {a, e}. but the inverse is not true all the time. Figures 2(b) and 2(c) illustrate two subtrees of the tree in Figure 2(a). Let us denote the number of combinations of k taxa from a set of n taxa with ('). In general, if a tree has n taxa, then that tree contains (1) subtrees with k taxa. As a consequence, that tree contains 2n 1 subtrees of any size including itself. Definition 4 (Frequency). LetT = {T1, T2, Tm,} be a set ofm phylogenetic trees and T be a phylogenetic tree. Let us denote the number of trees in T at which T is present with the variable m'. We define the frequency of T in T as freq(T, ) = . m Definition 5 (FAST). Let T = {T1, T2, ... T,} be a set of m phylogenetic trees and T be a phylogenetic tree. Let y be a number in [0, 1] interval that denotes frequency cutoff We say that T is a Frequent Agreement SubTree (FAST) of T if its frequency in T is at least y (i.e.,freq(T, T) > y). We say that a FAST is maximal if there is no other FAST that contains all the taxa in that FAST. Clearly, larger FASTs indicate biologically more relevant consensus pat terns. The following definition summarizes this. Definition 6 (MFAST). Let T = {Ti, T2, .. ., T,}) be a set of m phylogenetic trees. Let y be a number in [0, 1] interval that denotes frequency cutoff A FAST T of T is a Maximum Frequent Agreement SubTree (MFAST) ofT if there is no other FAST T' ofT that has a larger size than T. Formally, given a set of phylogenetic trees T = {T1, T2, S.. T,} and a frequency cutoff, y, we would like to find the MFASTs in T in this paper. We develop an algorithm that aims to solve this problem. Table 1 lists the variables used throughout the rest of this paper. Table 1 Commonly used variables and functions in this paper n a, freq(T, T) y A set of phylogenetic trees ith tree Number of trees in T Number of taxa in each input tree ith taxa Frequency of the subtree T in T Frequency cutoff ith seed (each seed is a subtree of a tree in 7T) Size of a seed Number of contractions used to create a seed Page 3 of 15 Ramu etal. BMCBioinformatics 2012, 13:256 http://www.biomedcentral.com/14712105/13/256 Phase one: Seed generation The first phase extracts small subtrees from the given set of trees. From these subtrees we extract the basic building blocks which are used to construct MFASTs. We call these building blocks seeds. Conceptually each seed is a phylo genetic tree that contains a small subset of the taxa that make up the trees in T. We characterize each seed with three features that are listed below. We elaborate on each feature later in this section. 1. Seed size (k) is the number of taxa in the seed. 2. Number of contractions (c) is the number of taxa we prune from a clade taken from an input tree in order to extract the seed. 3. Frequency (f) is the fraction of input trees in which the seed is present. We explain the seed features with the help of Figures 3 and 4. The first two characteristics explain how a seed can be found in one of the trees in T. They indicate that there is a clade of a tree in T such that this clade contains k + c taxa and it can be transformed into that seed after c con tractions from that clade. For instance in Figure 3, when k = 2 and c = 0, only seed S can be extracted from Ti by choosing the clade rooted at x2. When k = 2 and c = 1, seeds Si, S2 and Ss can be obtained using one contraction (as, a2 and ai respectively) from the clade rooted at xl. The last feature denotes the number of trees in T in which the seed is present. For example in Figure 4, there are nine seeds S1, S2, , S9 extracted from the three input trees using only one contraction. Among these, the fre quency of Si is 1 as it is present in all the trees. Frequency of S2 is about 0.67 for it is present in only two out of three a, a3 a, S a a a3 a, a, a, a, Figure 3 T1 is an input tree built on four taxa a,, a2, a3 and a4. The internal nodes of T] are labeled as x, xi and x2.Si is the only seed obtained from T] when k = 2 and c = O. That is S is identical to the clade rooted at x2.S1,S and S are the seeds extracted from Ti when k = 2 and c = 1. They are all extracted from the clade rooted at xi by contracting 03,02 and a, respectively. S3 a, a trees (Ti and T2). The frequency of the rest of the seeds is only about 0.33. Recall that, by definition, an MFAST is present in at least a fraction y of the trees in T. There fore, we consider only the seeds whose frequency values are equal to or greater than this number ( i.e. ,f > y). Given the values of k, c and y, we extract all the seeds which possess the desired feature values from the set of input trees as follows. In the newick string representation of a tree, a pair of matching parentheses corresponds to an internal node in the tree. The number of taxa in the clade rooted at this internal node is given by the number of labels between the two matching parentheses. Follow ing from this observation, we scan the newick string of each tree one by one. For each such tree, we identify the clades which have k + c taxa. Notice that, if a tree con tains n taxa, then it contains at most clades of size k + c as no two such clades can contain common taxa. We then extract all combinations of k taxa from each of these clades by contracting the remaining c taxa. The number of ways this can be done is (k+c). Notice that all the small trees extracted this way possess the first two character istics explained above. At this point, we however do not know their frequencies. Therefore, we call them potential seeds. It is worth mentioning that the same seed might be extracted from different trees. As we extract a new poten tial seed, before storing it in the list of potential seeds, we check if it is already present there. We include it in the potential seed list only if it does not exist there yet. Other wise, we ignore it. This way, we maintain only one copy of each seed. Once we build our potential seed list for all the trees in T, we go over them one by one and count their frequency in T as the fraction of trees that contain them. We filter all the potential seeds whose frequencies are less than the frequency cutoff. We keep the remaining ones as the list of seeds along with the frequency of each seed. In Figure 4, consider the tree Ti that has four taxa. For k = 3 and c = 1, there is only one clade of size k + c = 4 which is the tree Ti itself. We extract four potential seeds, each having three leaves from this tree. The potential seeds in this figure are given by Si, S2, S5 and S7 which we extract by contracting a4, as, a2 and ai respectively from Ti. Phase two: Seed combination At the end of the first phase, we obtain a set of frequent seeds from the input trees. Notice that each seed is a FAST as each seed is present in sufficient number of trees speci fied by y. These seeds are the basic building blocks of our method. In the second phase of our method, we combine subsets of these seeds to construct larger FASTs. We first define what it means to combine two seeds. In order to combine two seeds, it is a necessary condition Page 4 of 15 Ramu et al. BMCBioinformatics 2012, 13:256 http://www.biomedcentral.com/14712105/13/256 Page 5 of 15 a, a2 a3 a4 SS2 a, a2 a3 a, a2 a. a, a2 a. S Sx S/ a, a3 a. a, a3 a. a, a. a3 XS '\ a2 a3 a4 a, a. a3 a2 a3 a, Figure 4 The set of input trees T1, T2, T3 and the set of all nine potential seeds S, S2 ..* S9 when the seed characteristics are set to k = 3 and c = 1. All the potential seeds have three taxa as k 3. We need one contraction from the input tree to obtain each seed. S5 has frequency 1.0 as it is present in T1, T2 and T3. Seed S2 has frequency ~0.67 as it is present in Ti and T .. ........ i seeds have frequency ~0.33 as each appears in only one of the three trees. SX\ that both seeds are present in at least one common tree T in T. We call such a tree T as the reference tree. We combine two seeds with the guidance of a reference tree. Let Si and S2 be two seeds and let T be their reference tree. Let L1, L2 and L be the set of taxa in S1, S2 and T respectively. Combining Si and S2 results in the tree that is equivalent to the one obtained by contracting the taxa in L (L1 U L2) from T. For simplicity, we will denote the combine operation using T as the reference network with the Er symbol. For instance we denote combining S1 and S2 with T being the reference tree as S1 Er S2. To simplify our notation, whenever the identity of the reference tree is irrelevant, we will use the symbol E instead of :r. Figure 5 demonstrates how two seeds Si and S2 are com bined with the help of the reference tree T. In this figure, both Si and S2 are subtrees of T. Thus, it is possible to use T as the reference tree. We have L = [{a, a3, a4}, L2 = {a,a2,a5, a7}. Thus, we build C Si= r S2 by contracting the taxa in L (L1 U L2) = {a6, a8 from T. So far, we have explained how to combine two seeds Si and S2 using a reference tree. It is possible that many trees in T have both seeds present in them. Thus, one ques tion is which of these trees should we use as the reference tree to combine the two seeds? The brief answer is that all such trees need to be considered. However, we make several observations that helps us avoid combining S1 and S2 using each such reference tree one by one exhaus tively without ignoring any of such trees. We explain them next. Consider two trees Ti and T2 from T where both seeds are present in. There are two cases for Ti and T2. CASE 1: S1 E, 52 = S1 ErT2 2. In this case, it does not matter whether we use Ti or T2 as the reference tree. They will both lead to the same combined subtree. Thus, we use only one. CASE 2: S Er, 52 # Si ET2 52. In this case, the trees Ti and T2 lead to alternative combination topologies. So, we consider both of them separately. We utilize the observations above as follows. We start by picking one reference tree arbitrarily. Once we create a combined subtree using that tree, we check whether that subtree is present in the remaining trees in T. We mark those trees that contain it as considered for reference tree and never use them as reference for the same seed pair again. This is because those trees fall into the first case described above. This way, we also store the frequency of the combined subtree in T. If the number of unmarked trees is too small (i.e., less than y x m) then it means that even if all the remaining trees agree on the same combined topology for the two seeds under consideration, they are not sufficient to make it a FAST. Thus, we do not use any of the remaining trees as reference for those two seeds. Otherwise, we pick another unmarked tree arbitrarily and repeat the same process until we run out of reference trees. The next question we need to answer is which seed pairs should we combine? To answer this question we first make the following proposition. Proposition 1. Assume that we are given a set ofphylo genetic trees T. LetS1 and S2 be two seeds constructed from a, a, a, a. a3 a a a a, a 2 a. a 3 Ramu etal. BMCBioinformatics 2012, 13:256 http://www.biomedcentral.com/14712105/13/256 T al a2 a3 a4 a5 a6 a7 a8 S S 2 a, a3 a4 a1 a2 a5 a a a a3 a4 a5 a7 Figure 5 T is the reference tree. 5S and 5S are the seeds to be combined, both are present in T. C is obtained by pruning the subtree containing taxa a1, a2, a3, 04, 05 and 07 from T. the trees in T. For all trees T e T, we have the following inequality freq(S1 ET S2, T) < min{freq(Si, T), freq(S2, T)1 Proof. For any T, both S and S2 are subtrees of S T S2. Thus if S1 ET S2 is present in a tree, then both Si and S2 are present in that tree. As a result,freq(S1 ETS2, T) < freq(Si, T) and freq(51 ET S2, T) < freq(S2, T). Hence, freq(S ErT S2, T) < min{freq(SI, T), freq(S2, T)} Proposition 1 states that as we combine pairs of seeds to grow them, their frequency monotonically decreases. This suggests that it is desirable to combine two seeds if both of them have large frequencies. This is because if one of them has a small frequency, regardless of the frequency of the other, the combined tree will have a small frequency. As a result its chance to grow into a larger tree through additional combine operations gets smaller. Following this intuition, we develop two approaches for combining the seeds. 1. Inorder Combination (Section "Inorder combination"). 2. Minimum Overlap Combination (Section "Minimum overlap combination"). Both approaches accept the list of seeds computed in the first phase as input and produce a larger FAST that is a combination of multiple seeds. Both of them also assume that the list of input seeds are already sorted in decreasing order of their frequencies. We discuss these approaches next. Inorder combination The inorder combination approach follows from Propo sition 1. It assumes that the seeds with higher frequencies have greater potential to be a part of an MFAST. It exploits this assumption as follows, first it picks a seed as the start ing point to create a FAST. It then grows this seed by combining it with other seeds starting from the most fre quent one as long as the frequency of the resulting tree remains at least as large as the given cutoff y. It repeats this process by trying each seed as the starting point, Algorithm 1 presents this approach. Algorithm 1 In order combination FAST < 0 for all seeds Si do FAST' < Si Mark Si as considered repeat Si + seed with highest frequency among unconsidered seeds Mark Si as considered CUTOFF < y tJAST' < FAST' repeat Pick the next unconsidered tree T e T as reference Mark all the trees as that contain FAST' ET Sj as considered if freq(FAST' ET Si, T) > CUTOFF then tAST' < FAST' ET Sj CUTOFF < freq(FAST' ET Sj, T) end if until Less than y x m unmarked reference trees are left in T FAST' < tJAST' Unmark all trees in T until all seeds are considered if size of FAST' > size of FAST then FAST < FAST' Page 6 of 15 Ramu et al. BMCBioinformatics 2012, 13:256 http://www.biomedcentral.com/14712105/13/256 end if Unmark all seeds end for In Algorithm 1 we first initialize the FAST as empty. We then consider each seed one by one. We initialize a tem porary subtree denoted by FAST' with the seed Si under consideration and mark Si as considered. We combine the FAST' with a seed Sj which has the highest frequency amongst the seeds that have not been added. If multiple seeds have the highest frequency, we randomly pick one of them and mark that seed Sj as added to the FAST' There can be alternative ways to combine FAST' with Sj leading to different topologies. We use the trees in T that con tain both FAST' and Sj as guides to try only the topologies that exist in T. We stop constructing alternative topolo gies as soon as we ensure that there are not sufficient number of trees to yield frequency of y. We set FAST' to the combined seed if the combined seed has large enough frequency. We then consider the seed with the next high est frequency for addition and repeat this step till all Sj have been considered. If the resulting temporary FAST is larger than FAST we replace the smaller FAST with the larger one. In the next iteration, we initialize the FAST with the next Si. Using this approach we can initialize the FAST with all Si, alternatively if the user wishes to limit the amount of time spent using a maximum time cutoff we stop the outermost loop (i.e., alternative initializations of FAST') as soon as the allowed running time budget is reached. Notice that in Algorithm 3 each seed Si can lead to a dif ferent FAST. We record only the FAST that has the largest size. However, it is trivial to maintain the top k FASTs with the largest size instead if the user is looking for k alternative maximal FASTs. Minimum overlap combination The purpose of combining seeds is to construct a FAST that is large in size. Our inorder combination approach (Section "Inorder combination") aimed to maximize the frequency of the combined seeds. In this section, we develop our second approach, named Minimum Over lap Combination. This approach picks seeds so that their combination produced as large subtree as possible. We elaborate on this approach next. When we combine two seeds, the size of the resulting tree becomes at least as big as the size of each of these seeds. Formally let S1 and S2 be two seeds (i.e., trees). Let L1 and L2 be the set of taxa combined in S1 and S2. We denote the size of a set, say L1, with IL, . The size of the tree resulting from combination of S1 and S2 is IL1 I+ IL2  L1 n L2. For a given fixed seed size, the first two terms of this formulation remains unchanged regardless of the seed. The last term determines the growth in the size of the FAST. Thus, in order to grow the FAST rapidly, it is desirable to combine two frequent subtrees with a small number of common taxa. Our second approach follows from the observation above. We introduce a criteria called the overlap between two subtrees as the number of taxa common between them. Our minimum overlap combination approach works the same as Algorithm 1 with a minor difference in selecting the seed Sj that will be combined with the cur rent temporary FAST (i.e., FAST'). Rather than choosing the seed with the largest frequency, this approach chooses the one that has the least overlap with FAST' among all the unconsidered and frequent seeds. If multiple seeds have the same smallest overlap, it considers the frequency as the tie breaker and chooses the one with the largest frequency among those. Phase three: Postprocessing So far we described how to obtain seeds (Section "Phase one: Seed generation") and how to combine them to con struct FAST (Section "Phase two: Seed combination" ). The two approaches we developed for combining seeds aim to maximize the size of FAST. However, they do not ensure the maximality of the resulting FAST. There are two main reasons that prevent our seed combining algo rithms from constructing maximal FAST. First, some of the taxa of a maximal FAST may not appear in any seed (i.e. false negatives). As a result no combination of seeds will lead to that maximal FAST. Second, even if all the taxa of a maximal FAST are parts of at least one seed, our algo rithms will reject combining that seed with the FAST of the seeds if those seeds contain other taxa that are not part of the maximal FAST (i.e. false positives). In the postprocessing phase, we tackle above mentioned problem. Algorithm 3 describes the post processing phase in detail. We do this by considering all taxa which are not already present in the FAST one by one. We iteratively grow the current FAST by including one more taxon at a time if the frequency of the resulting FAST remains at least as large as the frequency cutoff y. We repeat these iterations until no new taxon can be included in the FAST. Thus the resulting FAST is guaranteed to be maximal. Algorithm 2 Post processing INPUT = FAST from the seed combination phase INPUT = T OUTPUT = Maximal FAST RESULT < FAST for all ai not in FAST do CUTOFF < y tRESULT < RESULT repeat Page 7 of 15 Ramu etal. BMCBioinformatics 2012, 13:256 http://www.biomedcentral.com/14712105/13/256 Pick the next unconsidered tree T e T as reference RESULT' < RESULT rai Mark all the trees that contain RESULT' as considered if frequency of RESULT' > CUTOFF then tRESULT < RESULT' CUTOFF < frequency of RESULT' end if until Less than y x m unmarked reference trees are left in T RESULT < tRESULT Unmark all trees in T end for return RESULT We expect the post processing step to identify quickly the taxa that have a potential to be in an MFAST that might have not been considered during the seed genera tion and seed combination phases. At the end of the post processing step we obtain an MFAST. Complexity analysis of our method In this section we discuss the complexity of our method in terms of the three phases involved in it. Let T be a set of m phylogenetic trees having n leaves each. The complexity of the different phases of our method are as follows. Phase one. Finding the seeds involves enumerating all the subtrees and checking their frequencies. Given seed size k and number of contractions c, each tree will con tain at most clades each leading to ( c) alternative subtrees. Thus, in total there can be up to jk+c) seeds (possibly many of them identical) from all the trees in T. Typically, the values of k and c are fixed and small (in our experiments we have k e {3, 4, 5} and c e {0, 1, 2, 3, 4, 5}) leading to 0(mn) seeds. The complexity of finding whether a seed is present in a single tree is 0(n log n). Given that there are m trees in T, the cost of computing the frequency of a single seed is 0(mn log n). Thus, the time complexity for finding the frequency of all the seeds is this expression multiplied by the number of seeds, which is O(m2n2 log n). Phase two. Consider a set of p frequent seeds that will be considered for combining in this phase. Recall that we have two approaches to combine them. Below, we focus on each. INORDER COMBINATION We try to combine each seed with every other seed leading to 0(p2) iterations. The complexity of checking the frequency of each combined subtree is 0(mn logn). Also, there can be up to 0(m) different reference trees for guiding the combine opera tion. Multiplying these terms, we obtain the complexity of phase using this approach as O(p2 2n log n). MINIMUM OVERLAP COMBINATION The complexity of combining the frequent seeds using the minimum over lap combination approach is very similar to the inorder approach except for an additional term. The additional complexity is because we maintain the overlap between the subtrees. This leads to the complexity 0(p2n2 + p2 m2n log n). Phase three. Here, we consider the FAST obtained from each of the p frequent seeds in phase two. For each FAST, we sequentially go over each taxa one by one leading to 0(n) iterations. There can be up to 0(y x m) references to add a taxon. So the cost of extending allp FASTs is 0(y x mnp). Notice that each frequent seed has to appear in at least y x m trees. Thus, the number of unique frequent seeds p is bounded by 0( )= O(). Thus, adding the cost of all the three phases, the overall time complexity of our method using inorder combination is m2n3 logn O(m2n2 log n + 2 + n2). y2 That using minimum overlap combination is m2n3logn n4logn n O(m2n2 log n + /2 + n + mn2). y2 y2 In the two summations above, the second term is asymptotically larger than the first and the last terms. Thus, we can simplify the asymptotic time complexity of inorder and minimum overlap combinations as O(m2f3 log n y)2 and (n log n 2 n)) y 2 respectively. Results and discussion This section evaluates the performance of our MFAST algorithm experimentally. Implementation details. We implemented our MFAST algorithm using C and Perl. More specifically, we imple mented the first two phases (seed generation and seed combination) in C and the third phase (post processing) in Perl. We utilize the functions provided in the newick Utilities [25] package by modifying the source code pro vided in that package. We use k e {3,4, 5} and c e {0, 1, 2, 3, 4, 5} in all of our experiments unless oth erwise stated. In our experiments, we observed that the minimum overlap combination produced larger MFASTs than the inorder combination approach. Therefore, we Page 8 of 15 Ramu et al. BMCBioinformatics 2012, 13:256 http://www.biomedcentral.com/14712105/13/256 limit our experimental results to the minimum overlap combination approach. Methods compared against. We have compared our method against Phylominer [20] and the MAST command implemented in PAUP* [26]. Among these, Phylominer also seeks MFASTs in a collection of trees. However, the time complexity of this method is exponential in the size of the input trees, and hence it becomes intractable for large trees. In our experiments, we observed that it does not scale beyond 50 taxa. PAUP* is primarily a program for phylogenetic inference, although it also can compute MASTs. MASTs have a strict 100% agreement criterion unlike the arbitrary frequency cutoff values y in our method. Evaluation Criteria. We evaluate our algorithm based on the size of the MFAST found. Larger MFASTs are preferable. When possible, we report the size of the opti mal solution as well. Test Environment. We ran our experiments on Linux servers equipped with dual AMD Opteron dual core pro cessors running at 2.2 GHz and 3 GB of main memory to test the performance of our method. Datasets We test the performance and verify the results of our method on synthetic datasets and real datasets. SYNTHETIC DATASET We built synthetic datasets in which we embedded an MFAST as described below. We characterize each synthetic dataset using five parameters. Tree size (n). Number of trees (m). MFAST frequency (f). MFAST size (n'). Noise percentage (c). The first two parameters denote the size and number of trees in T. MFAST frequency specifies the fraction of trees in T which contain an MFAST. MFAST size is the number of taxa in the embedded MFAST. The noise percentage is the percentage of taxa that is not a part of the embedded MFAST but is placed on the branches within the clade that contains the MFAST. We place all the other taxa on the branches outside this clade. Given an instantiation of these parameters, we first created a tree that has n' taxa. This tree serves as the MFAST. We then created m xf trees that contain this MFAST. We build each of these trees by inserting n n' taxa randomly in the MFAST. With probability c we insert each taxa within the clade that contains MFAST. With probability 1 c we insert it outside that clade. We then created m (m xf) trees that do not contain the current MFAST. We simply do this by inserting all the taxa one by one at a random location. REAL DATASETS. We use two empirical datasets to evaluate the performance of our heuristic. The data sets contain 200 bootstrap trees generated from phylogenetic analysis of the Gymnosperm [27] and Saxifragales (Burleigh, unpublished) plant clades. To make the bootstrap trees, we assembled supermatrices, matrices of concatenated gene alignments with partial taxon overlap, from gene sequence data available in GenBank. We performed a maximum likelihood bootstrap analysis on each supermatrix using RAxML v. 7.0.4 [28]. The Gymnosperm trees each contain 959 taxa, and the Saxifragales trees each contain 950 taxa. Effects of number of input trees In our first experiment, we analyze how the number of input trees in T affects the performance of our algorithm. For this purpose, we created 30 synthetic datasets. The size of the embedded MFAST in all the datasets was 15. Among these 30 datasets, 10 contained 50 trees, 10 con tained 100 trees and 10 contained 200 trees. We set the noise percentage to 20% in all the datasets. The frequency of the embedded MFAST was 0.8. We set the number of taxa in all the trees in these datasets to 100. We ran our algorithm on each of these datasets to find the size of the MFAST for y = 0.7. Table 2 lists the average MFAST size we found for each of the dataset sizes before post processing (i.e., at the end of phase two) and after post processing (i.e., at the end of phase three). The results demonstrate that our method can identify an MFAST that is almost as big as the embedded one even without post processing, regardless of the number of trees in the dataset. Post processing improves the MFAST size slightly. On the average, we always find an MFAST that is as large as or larger than the embedded one. An MFAST larger than 15 here implies that while randomly inserting the taxa that are not in the embedded MFAST, at least one of them was placed under the same clade at least a fraction y of the time. More importantly, our method successfully located such taxa along with the rest of the MFAST. Effects of tree size Our second experiment considers the impact of the num ber of taxa in the input trees contained in T on the success of our method. To carry out this test, we built datasets with varying tree sizes (i.e., n). Particularly, we used n = 100, 250, 500 and 1000. For each value of n, we repeated the experiment 10 times by creating 10 datasets with the same properties. In all datasets, we set the number of trees to m = 100, the noise percentage at c = 20%, the size of the Page 9 of 15 Ramu etal. BMCBioinformatics 2012, 13:256 http://www.biomedcentral.com/14712105/13/256 Table 2 Evaluation of the effect of the number of trees in T Number of trees MFAST size Before post processing After post processing The number of trees is set to 50, 100 and 200. For each number of trees we run our experiments on ten datasets. Each dataset contains trees with 100 taxa and an embedded MFAST of size 15. We report the average size of the MFAST obtained by our method across the ten datasets. embedded MFAST at 15% of n, and the MFAST frequency at 0.8. Table 3 reports the average MFAST size found by our method for varying tree sizes. Second column shows the embedded MFAST size. Last two columns list the aver age size of the MFAST found by our method across the ten datasets. Before, going into detailed discussion of the results, it is crucial to observe that our method could run to completion for datasets that have as many as 1000 taxa. When we tried to run Phylominer, it did not return any results for datasets that have more than 100 taxa. The results also demonstrate that our method could success fully identify the embedded MFAST in all the datasets regardless of the size of the input trees. In some datasets, the reported MFAST was slightly larger than the embed ded one. This indicates that while randomly inserting the taxa that are not part of the embedded MFAST, it is possi ble that a few taxa was consistently placed under the same same clade. The results also suggest that our method identifies a sig nificant percentage of the taxa in the embedded MFAST after the second phase (i.e., before postprocessing) when the tree size is small. As the tree size grows, it starts missing some taxa at this phase. It however recovers the missing taxa during the postprocessing phase even for the largest tree size. This indicates that at the end of phase Table 3 Evaluation of the effect of the size of the trees in T Number of Embedded 250 500 1000 MFAST size Reported Before post After post two our method could identify a backbone of the actual MFAST. The unidentified taxa at this phase are scattered throughout the clades in the input trees. Thus, there is no clade of size k + c that contains them with c contractions for small k or c. As evident from Table 3, this however does not prevent our method from recovering them. This is because the backbone reported at the end of phase two is large enough, and thus specific enough, to recover the missing taxa one by one in the last phase. This is a sig nificant observation as it demonstrates that our method works well even with small values of k and c. Effects of noise percentage Recall that the noise percentage c denotes the percent age of taxa that is added inside the clade that contains the MFAST. As c increases, the pairs of taxa in the MFAST get farther away from each other in the tree that contains it. As a result, fewer taxa from MFAST will be contained in small clades of size k + c. This raises the question whether our method works well as c increase and thus the MFAST taxa gets scattered around in the trees that contain it. In this experiment, we answer the question above and analyze the effect of the noise percentage on the success of our method. We create synthetic datasets with various c values. Particularly, we use c = 20, 40 and 60%. We set the size of the embedded MFAST to n' = 15, the tree size to n = 100, number of trees to m = 100 and the MFAST frequency tof = 0.8. We repeat our experiment for each parameter 10 times by recreating the dataset randomly using the same parameters. We set the frequency cutoff to y = 0.7. We report the average MFAST size found by our method in Table 4. The results suggest that our method can identify the embedded MFAST successfully even when the noise per centage is very high. We observe that the size of the MFAST found by our method before post processing decreases slowly with increasing amount of noise. This is not surprising as the taxa contained in the embedded MFAST gets more spread out (and thus farther away from each other) in the trees in T with increasing noise. As a result, if there are taxa that are not part of any seed with the provided values of k and c, they will never be included Table 4 Evaluation of the effect of the noise in the trees processing processing in T 15.3 15.8 MFAST size 32.3 43.7 69S8 38.8 Noise (%) 76.0 20 151.0 40 Before post processing 15.3 After post processing 15.8 The tree size is set to 100, 250, 500 and 1000. For each tree size we run our experiments on ten datasets. Each dataset contains 100 trees with an embedded MFAST of size 15% of the input tree size. Second column shows the embedded MFAST size. Last two columns list the average size of the MFAST found by our method across the ten datasets. 60 12.7 15.0 The size of the embedded MFAST in all the experiments is 15. We list the average size of the MFAST found by our method before and after the post processing phase. Page 10of 15 Ramu etal. BMCBioinformatics 2012, 13:256 http://www.biomedcentral.com/14712105/13/256 in the computed MFAST at the end of phase two. We however observe that (i) only a small number of such taxa exists. For instance, even for the largest noise percentage (e = 60%), only 2.3 taxa (i.e., 15 12.7) are missing on the average. (ii) The missing taxa are recovered during phase three. This is because the computed MFAST at the end of phase two is very large, and thus it is specific to the embedded MFAST. Impact of seed creation So far, in our experiments we consistently observed two major points for all the parameter settings (see Sections "Effects of number of input trees" to "Effects of noise percentage"): (i) Our method always finds a large subtree of the embedded MFAST after phase two. (ii) Our method always recovers the entire embedded MFAST after phase three. The second observation can be explained from the first one that the outcome of phase two is large enough to build the entire MFAST precisely. The first observation however indicates that the set of seeds generated in phase one contain a significant percentage of the taxa in the embedded MFAST. In this section, we take a closer look into this phenomenon and explain why this is the case even for small values of seed size k and contrac tion amount c, and large noise percentage c. To do that, we will compute the probability that a subset of the taxa of the embedded MFAST appears in at least one seed gen erated in phase one. In our computation, we will assume that the taxa can appear at any location of a given tree with the same probability. We discuss the implication of this assumption later in this section. The number of rooted bifurcating trees for a given set of n taxa is R(n) (2n 3)! 2n2(n 2)! Consider a clade with k + c taxa. The number of trees with n taxa that contains this clade is R(n (k + c) + 1) as the topology of the k + c sized clade is fixed. For a given a subtree with k taxa, let us denote the number of clade topologies of size k + c that contains that subtree with NU(k, c). We can compute this function recursively as NU(k, 0) = 1 and for c > 0, NU(k, c) = NU(k, c 1) x 2 x (k + c 2). Let us denote one of these clades by U(k, c). Also, let us denote the probability that the clade U(k, c) exists in a random tree topology that contains n taxa with P(n, k, c). Intuitively, P(n, k, c) is the probability that our method will extract a specific k taxa subtree from one n taxa tree after only c contractions. We can compute this probability as the ratio of the number of tree topologies that satisfy this constraint to that of all possible tree topologies. We formulate this as NU(k, c) x R(n (k + c) + 1) P(n,k,c) =R R(n) Recall that it suffices for our algorithm to have a k taxa subtree of the MFAST in at least one tree in the given set of m trees. The probability that the clade U(k, c) exists in at least one of the m random tree topologies each containing n taxa is P(n, k, c, m) = 1 (1 P(n, k, c))m. Assume that the MFAST size in the given set of trees T is h. Let us denote the number of k taxa subtrees of the MFAST as NS(h, k). The probability that at least one of these subtrees will be found in at least one of the input trees is then P(n, k, c, m, h) = 1 (1 P(n, k, c, m))NS(hk). A lower bound to NS(h, k) is h k + 1 which can be obtained by picking a contiguous block of k taxa from the canonical newick representation of the MFAST by consid ering all possible h k+ 1 starting point locations. Notice that the larger the value of P(n, k, c, m, h), the higher the chances that our algorithm will construct some part of the MFAST. Similarly, the larger the value of NS(h, k), the higher the chances that our algorithm will construct some part of the MFAST. Figure 6 plots the success probability (i.e., P(n, k, c, m, h)) of our method for varying parameter values. As the MFAST size increases, the success probability rapidly increases. This is because the number of alternative sub trees of the MFAST increases with increasing MFAST size. Thus, the chance of observing at least one increases as well. We observe that when the size of MFAST is around 20% of the tree size, for all the parameters reported our success probability becomes almost 1. As the num ber of contractions increases, the probability of success increases. This is because large number of contractions increases the possibility of eliminating false positive taxa from clades. In other words, it helps gluing the taxa that are normally scattered in the input trees back together by removing the remaining taxa among them. When c = 5, our success probability becomes almost one even for MFASTs that are as small as a 46% of the tree size. As the number of trees increases, the success probability increases as well. This is because we have more alternative topologies with increasing number of trees. Thus, there are more chances to have a small clade that contains a part of the MFAST. Finally, it is worth noting that these results are computed based on the assumption that the trees in T are uniformly distributed among all possible topologies. In practice, we expect that these trees are constructed with Page 11 of 15 Ramu etal. BMCBioinformatics 2012, 13:256 http://www.biomedcentral.com/14712105/13/256 >,  0.6 o 0.4 o Co 10 MFAST size [%] 15 20 10 MFAST size [%] 15 20 Figure 6 The probability of finding at least one seed which contains a part of an MFAST. The number of contractions c is set to 3,4 and 5 and the 1 I ,. In. i seed size k is 5,4 or 3. The xaxis shows the MFAST size in terms of the percentage of the number of taxa in the trees in T. In (a), we set the total number of trees m 500. In (b) we set m 1000. the same or similar objectives (such as maximum parsi mony or maximum likelihood). As a result, they will likely have a higher chance to contain large MFASTs. The results we expect in practice will thus be similar or even better than the theoretical results in Figure 6. Overall, we conclude from this experiment that even small values of k and c suffices to capture a part of the MFAST in phase two. Therefore, although our algorithm's complexity increases exponentially with k and c, we do not need to use large values for k and c. This enables our algo rithm to scale to very large datasets with thousands of taxa and trees. These results explain the theory behind the prac tical results we observed in Sections "Effects of number of input trees" to "Effects of noise percentage" Evaluation of state of the art methods So far, we have shown that our method could success fully find the MFASTs contained in sets of trees T for up to 1000 taxa and 200 trees (i.e., n = 1000 and m = 200). An obvious question is how well do existing methods per form on the same datasets. Here, we answer this question for two existing programs, namely PAUP* (version 4.0b10) and Phylominer. When we fix the number of trees and the number of taxa to 100, PAUP* was able to find the MAST for for all datasets. As we grow the number of taxa to 250 or larger while keeping the number of trees as 100, PAUP* runs our of memory and fails to return any results. After reducing the number of trees to 50, PAUP* still runs out of memory and cannot report any results for more than 100 taxa. The scalability problem of Phylominer is even more severe. Phylominer is able to compute the MFASTs on datasets with up to 20 taxa. However, as we increase the number of taxa further, its performance deteriorates quickly. When we set the number of taxa to 100, even with as few as 100 trees, Phylominer takes more than a week to report a result. Moreover, in our experiments, the max imum size of the subtrees it found on average contained fewer than 7 taxa, even though the size of the true MAST was 10. Another interesting question about existing methods would be whether the majority consensus rule can be used to find MFASTs. To evaluate this, we used the same three synthetic datasets used in Section "Effects of noise per centage" Recall each of these three datasets contains an MFAST of size 15 which is embedded in 80% of the trees. The datasets are created with 20%, 40% and 60% noise indicating different levels of difficulty in recovering the embedded MFAST. We computed 70% majority consen sus tree. Notice that if majority consensus rule can identify an MFAST, that would correspond to a bifurcating subtree topology in the consensus tree. In other words a subtree is bifurcating in this experiment only if 70% or more of the input trees agree on the topology of that subtree. The resulting tree, however, was multifurcating for all the three datasets. This means that majority consensus rule could not recover even a smaller portion of the embedded tree while our method was able to locate the entire MFAST successfully (see Table 4). These results demonstrate that both PAUP* and Phy lominer are not well suited to finding agreement subtrees in larger datasets, our method scales better in terms of both the number of taxa and the number of trees. When PAUP* runs to completion, we observed that it reports Page 12 of 15 0.8  0.6 0 0.4 0.2 Ramu et al. BMCBioinformatics 2012, 13:256 http://www.biomedcentral.com/14712105/13/256 the true results. Recall from previous experiments that our method always found the true results on the same datasets as well as larger datasets. This suggests that our method has the potential to have an impact in large scale phylogenetic analysis when existing methods fail. Empirical dataset experiments To examine the performance of the MFAST method on real data, we performed experiments using 200 maximum likelihood bootstrap trees from a phylogenetic analysis of gymnosperms (959 taxa) and Saxifragales (950 taxa). Specifically, we evaluated how the performance of the MFAST algorithm was affected by the number of input trees and the size of the input trees. Effects of number of input trees We first examined the effect of input tree number on the size of MFAST. For both the gymnosperm and Sax ifragales trees, we generated 10 sets of 50 and 100 trees by randomly sampling from the original 200 trees with out replacement. We compared the average size of the MFAST in the 50 and 100 tree data sets with the size of the MFAST in the original 200 tree data set. First, in all anal ysis, the postprocessing step greatly increases the size of the MFAST, sometimes more than doubling it (Table 5). This increase is similar to the one observed in the 1000 taxon simulated data sets (Table 2), emphasizing again the importance of the postprocessing step with large tree data sets. Although the sizes of the MFASTs were simi lar, they decreased slightly with the addition of more trees (Table 5). This may simply be a matter of observing more conflict with more trees. The large gap between the MFAST sizes before and after the post processing suggests that phase three is the main reason behind the success of our method, and thus, the costly seed combination phase (i.e., phase two) may be unnecessary. To answer whether this conjecture is correct, we ran a variant of our method by disabling the second phase; we only ran the post processing phase starting from each seed as the initial MFAST one by one. We reported the largest MFAST found that way as the output of this variant in Table 5. The results demonstrate that although phase three can grow a large FAST, phase two is essential to find the largest frequent agreement subtree. In other words, post processing finds the true MFAST only if a large portion of it is already found (which is the role served by phase two). In conclusion, phase three of our method cannot replace phase two, yet both phases are essential for the success of our method. Effects of size of input tree Next, we examined the effect of number of leaves in the input trees on the size of MFASTs. For both the gym nosperm and Saxifragales trees, we generated 10 sets of 200 input trees with 100, 250, and 500 taxa. To make each set, we randomly selected 100, 250, or 500 taxa, and we deleted all other taxa from the original sets of 200 trees. Thus, these sets of trees with 100, 250, or 500 taxa are subtrees of the original data sets. The size of the average MFAST increases with more taxa in the orig inal trees (Table 6). However, interestingly, the average size of the MFASTs for the gymnosperm data set with 500 trees is larger than the MFAST found from the orig inal gymnosperm trees with all the taxa (Table 6). Since the MFAST from the 500 taxon data sets should all be found within the full data set, this indicates that on the larger trees, our method may not always find the true (i.e., largest) MFAST. The full data sets may require a larger number of contractions to find the true MFASTs. Similar to the experiments in Section "Effects of num ber of input trees';, we investigated the gap between the MFAST sizes before and after the post processing step. We ran a variant of our method by disabling the second phase; we only ran the post processing phase starting from each seed as the initial MFAST one by one. We reported the largest MFAST found that way as the output of this variant Table 6 The size of the MFAST found by our method on the Table 5 The size of the MFAST found by our method on the Gymnosperms and Saxifragales datasets before and after Gymnosperms and Saxifragales datasets before and after post processing (phase three) for different number of taxa post processing (phase three) MFAST size MFAST size Gymnosperms Saxifragales Before After Only Before After Only Number of leaves 100 78.5 129.8 99.5 64.7 122.0 84.1 250 68.4 119.2 83.1 55.4 112.8 74.7 500 200 76.0 118.0 84.0 40.0 105.0 75.0 The size of the MFAST found by running onlythe post processing step is also shown. We run our method on the entire dataset that contains 200 trees as well as randomly selected subsets of 50 and 100 trees. We repeated the 50 and 100 tree experiments 10 times by randomly selecting the trees from the entire dataset and reported the average value. Gymnosperms Saxifragales Before After Only Before After Only 41.2 56.1 43.5 43.5 50.7 38.5 67.2 88.5 63.0 62.3 76.2 54.6 91.6 123.0 74.9 52.0 86.7 62.9 760 1180 840 400 1050 750 The size of the MFAST found by running only the post processing step is also shown. We run our method on the entire dataset that contains all the taxa (last row) as well as randomly selected taxa subsets of size 100, 250 and 500. We repeated the 50, 100 and 250 taxa experiments 10 times by randomly selecting the taxa from the entire dataset and reported the average value. Page 13 of 15 Number of trees Ramu etal. BMCBioinformatics 2012, 13:256 http://www.biomedcentral.com/14712105/13/256 Table 7 The size of the MFAST found by our method on the Gymnosperms and Saxifragales datasets for different random subsamples of the total number of taxa Sampling MFAST size percentage Gymnosperms 87.5 88.4 875 We run our method by randomly picking 2%, 5%, 10%, 25%, 50%, 100% of the seeds found in phase one for combination in phasetwo. in Table 6. The results are in parallel with those in Table 5. Phase three can grow a large frequent agreement subtree, but not quite as big as that when both phase two and three are executed. Effects of sample size In our final experiment, we evaluated the effect of the maximum time cutoff, we described in Section "Inorder combination" on the accuracy of our method. Recall that, this cutoff limits the number of initial seeds tried in our algorithm by randomly sampling a small percentage of the seeds. It only uses the sampled seeds as possible initial seeds. However, it uses the entire set of seeds while grow ing the MFAST determined by the initial seed. As each initial seed roughly takes the same amount of time to grow into an MFAST, using x% of the seeds as the sample set reduces the total running time our method to roughly x% of that of our original implementation. We carried out this experiment as follows. For both the gymnosperm and Saxifragales trees, we ran 10 sets of experiments for each sampling percentage of 2, 5, 10, 25, 50 and 100%. Thus, totally we ran 60 (6 x 10) exper iments. Table 7 presents the average MFAST sizes for varying sample sizes. The results demonstrate that even for very small sampling percentages, our method finds MFAST that is almost as big as the MFAST found by using the entire dataset (i.e., 100% sampling percentage). This is very promising as it demonstrates that the running time cost of our method can easily be cut to a small fraction by sampling the starting seeds. The rationale behind this is that the MFAST contains many seeds. Starting from any of these seeds, our algorithm has the potential to lead to that MFAST. The probability that at least one of these seeds appear in the sample set is large particularly for large MFASTs. Conclusion In this paper, we present a heuristic for finding the maxi mum agreement subtrees. The heuristic uses a multistep approach which first identifies small candidate subtrees (called seeds), from the set of input trees, combines the seeds to build larger candidate MFASTs, and then per forms a postprocessing step to increase the size of the candidate MFASTs. We demonstrate that this heuristic can easily handle data sets with 1000 taxa, greatly extend ing the estimation of MFASTs beyond current meth ods. Although this heuristic is not guaranteed to find all MFASTs, it performs well using both simulated and empirical data sets. Its performance is relatively robust to the number of input trees and the size of the input trees, although with the larger data sets, the post processing step becomes more important. Overall this method pro vides a simple and fast way to identify strongly supported subtrees within large phylogenetic hypotheses. Although the method we developed is described and implemented for the rooted and bifurcating trees, it can be trivially extended to multifurcating as well as unrooted trees. The central technical difference in the case of unrooted trees would be the definition of clade (see Def inition 1) as the definition requires a root. A clade in an unrooted tree encompasses two sets of nodes; (i) a given set of taxa X, (ii) the set of all internal nodes that are on a path between two taxa in X on the phylogenetic tree. We expect that this will increase the number of seeds substantially and thus make the problem more compu tationally intensive. The amount of increase will depend on the tree topology. The theoretical worst case happens when all the taxa are connected to a single internal node (i.e., star topology). In that case any subset of taxa can lead to a potential seed as long as the subset size is equal to the seed size allowed. One possible way to overcome this problem would be to exploit randomization or graph coloring strategies and avoid enumerating majority of the possible seeds. Abbreviations MAST: Maximum agreement subtree; FAST: Frequent agreement subtree; MFAST: Maximum frequent agreement subtree; MCMC: Markov chain Monte Carlo. Competing interests The authors declare that they have no competing interests. Authors' contributions AR participated in algorithm development, implementation, experimental evaluation and writing of the paper. TK participated in algorithm development, experiment design and writing of the paper. GB participated in experiment design, dataset collection and writing of the paper. All authors read and approved the final manuscript. Acknowledgments This work was supported partially by the National Science Foundation (grants CCF0829867 and IIS 0845439). Author details 1Electrical and Computer Engineering, University of Florida, Gainesville, FL, USA Computer and Information Science and Engineering, University of Page 14 of 15 Ramu et al. BMCBioinformatics 2012, 13:256 http://www.biomedcentral.com/14712105/13/256 Florida, Gainesville, FL, USA. Department of Biology, University of Florida, Gainesville, FL, USA. Received: 5 February 2012 Accepted: 5 September 2012 Published: 3 October 2012 References 1. Goloboff PA, Catalano SA, Mirande JM, Szumik CA, Arias JS, Kallersjo M, Farris JS: Phylogenetic analysis of 73 060 taxa corroborates major eukaryotic groups. adistcs 2009, 25(3):211 230. 2. Price MN, Dehal PS, Arkin AP: Fasttree 2 approximately maximumlikelihood trees for large alignments. PLoS ONE 2010, 5(3):e9490. 3. Smith SA, Beaulieu JM, Stamatakis A, Donoghue MJ: Understanding angiosperm diversification using small and large phylogenetic trees. American Journal ofBotany 2011, 98:404414. 4. Felsenstein J: Confidence Limits on Phylogenies: An Approach Using the Bootstrap. Evol 1985, 39(4):783791. 5. Farris JS, Albert VA, Kallersjo M, Lipscomb D, Kluge, AG: Parsimony jackknifing outperforms neighborjoining. C/adistics 1996, 12(2):99124. 6. Huelsenbeck JP, Rannala B, Masly JP: Accommodating phylogenetic uncertainty in evolutionary studies. 2000 Science, 30 June 2000, 288(5475):23492350. doi: 10.1126/science.288.5475.2349. 7. Bryant D: A classification of consensus methods for phylogenetics. In Bioconsensus (Piscataway, NJ, 2000/2001), volume 6D1 ofDIMACSSer. Discrete Math. Theoret. Comput. Sci; Amer. Math. Soc., 2003:163183. 8. Finden CR, Gordon AD: Obtaining common pruned trees. J Classification 1985, 2:255276. 9. Amir A, Keselman D: Maximum agreement subtree in a set of evolutionary trees: Metrics and efficient algorithms. 5SAMJ Comput 1997, 26(6):16561669. 10. Kubicka E, Kubicki G, McMorris FR: An algorithm to find agreement subtrees. J Classification 1995,12(1):9199. 11. Farach M, Przytycka TM, Thorup M: On the agreement of many trees. Inf Process Lett 1995,55(6):297301. 12. Bryant D: Building trees, hunting for trees and comparing trees. PhD thesis. Dept. Mathematics: University of Canterbur; 1997. 13. Cole R, FarachColton M, Hariharan R, Przytycka TM, Thorup M: An o(n log n) algorithm for the maximum agreement subtree problem for binary trees. SIAMJ Comput 2000, 30(5):13851404. 14. Lee CM, Hung LJ, Chang MS, Shen CB, Tang CY: An improved algorithm for the maximum agreement subtree problem. InfProcess Lett 2005, 94(5):211216. 15. Berry V, Nicolas F: Improved parameterized complexity of the maximum agreement subtree and maximum compatible tree problems. EEEACM Trns CoputBioogyBonorm 2006, 3(3): 289302. 16. Guillemot S, Nicolas F: Solving the maximum agreement subtree and the maximum compatible tree problems on many bounded degree trees. In Proceedings of the 17th Annual conference on Combinatorial Pattern Matching (CPMO6). Edited by Lewenstein M, Valiente G(Eds.). Berlin, Heidelberg: SpringerVerlag:165176. doi:10.1007/11780441 16. 17. Guillemot S, Nicolas F, Berry V, Paul C: On the approximability of the maximum agreement subtree and maximum compatible tree problems. Discrete Appied Mathematcs 2009, 157(7):15551570. 18. Chi Y, Xia Y, Yang Y, Muntz, RR: Correction to "mining closed and maximal frequent subtrees from databases of labeled rooted trees". IEEE Trans Know/Da ta Eng 2005, 17(12):1737. 19. Zhang S, Wang JTL: Mining frequent agreement subtrees in phylogenetic databases. In Proceedings of theth 5/AM International Conference on Data Mining (SDM 2006). Edited by Ghosh J, Lambert D, Skillicorn DB, Srivastava J. Maryland: Bethesda; April 2006:222233. 20. Zhang S, Wang JTL: Discovering frequent agreement subtrees from phylogenetic data. EEE Trans Know/ Data Eng 2008, 20(1):6882. 21. Cranston KA, Rannala B: Summarizing a Posterior Distribution of Trees Using Agreement Subtrees. SystB 2007, 56(4):578590. 22. Pattengale ND, Swenson KM, Moret BME: Uncovering hidden phylogenetic consensus. In Proc.6th Int'/ymp Bioinformatcs Research & Apps. SBRA0, in Lecture Notesin omputerScience. Edited by Borodovsky M, Gogarten JP, Przytycka TM, Rajasekaran S:pp 128139. Springer, 2010. 23. Pattengale ND, Aberer AJ, Swenson KM, Stamatakis A, Moret BME: Uncovering hidden phylogenetic consensus in large data sets. IEEE/ACM Trans Comput Biology Bioinform 2011,8(4):902911. 24. Aberer AJ, Stamatakis A: A simple and accurate method for rogue taxon identification. In Proceedings of the/EE International conference on, Bioinformaticsand Biomedicine (B/BM 'I): IEEE Computer Society, Washington, DC, USA:118122. doi:10.1109/BIBM.2011.70. 25. JunierT, Zdobnov EM: The newick utilities: highthroughput phylogenetic tree processing in the unix shell. Bioinformatics 2010, 26(13):16691670. 26. Swofford DL: PAUP* Phylogenetic analysis using parsimony (and other methods). Version 4.0 beta 10.,2002. Sunderland, Massachusetts: Sinauer Assoc. 27. Burleigh JG, Barbazuk WB, Davis JM, Morse AM, Soltis PS: Exploring diversification and genome size evolution in extant gymnosperms through phylogenetic synthesis. JBotany 2012, 2012:6. Article ID 292857. doi:10.1155/2012/292857. 28. Stamatakis A: Raxmlvihpc: maximum likelihoodbased phylogenetic analyses with thousands of taxa and mixed models. Bioinormatics2006, 22(21):26882690. doi:10.1186/1471210513256 Cite this article as: Ramu eta: A scalable method for identifying frequent subtrees in sets of large phylogenetic trees. BMCBioinormacs 2012 1325 Page 15 of 15 Submit your next manuscript to BioMed Central and take full advantage of: * Convenient online submission * Thorough peer review * No space constraints or color figure charges * Immediate publication on acceptance * Inclusion in PubMed, CAS, Scopus and Google Scholar * Research which is freely available for redistribution Submit your manuscript at I d Central www.biomedcentra I.com/subm it 0 Eiulid Central 
Full Text 
!DOCTYPE art SYSTEM 'http:www.biomedcentral.comxmlarticle.dtd'
ui 1471210513256 ji 14712105 fm dochead Research article bibl title p A scalable method for identifying frequent subtrees in sets of large phylogenetic trees aug au id A1 snm Ramufnm Avinashinsr iid I1 email r.avinash@ufl.edu A2 ca yes KahveciTamerI2 tamer@cise.ufl.edu A3 BurleighJ GordonI3 gburleigh@ufl.edu insg ins Electrical and Computer Engineering, University of Florida, Gainesville, FL, USA Computer and Information Science and Engineering, University of Florida, Gainesville, FL, USA Department of Biology, University of Florida, Gainesville, FL, USA source BMC Bioinformatics section Comparative genomicsissn 14712105 pubdate 2012 volume 13 issue 1 fpage 256 url http://www.biomedcentral.com/14712105/13/256 xrefbib pubidlist pubid idtype doi 10.1186/1471210513256pmpid 23033843 history rec date day 5month 2year 2012acc 592012pub 3102012 cpyrt 2012collab Ramu et al.; licensee BioMed Central Ltd.note This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. kwdg kwd Phylogenetic trees Frequent subtree abs sec st Abstract Background We consider the problem of finding the maximum frequent agreement subtrees (MFASTs) in a collection of phylogenetic trees. Existing methods for this problem often do not scale beyond datasets with around 100 taxa. Our goal is to address this problem for datasets with over a thousand taxa and hundreds of trees. Results We develop a heuristic solution that aims to find MFASTs in sets of many, large phylogenetic trees. Our method works in multiple phases. In the first phase, it identifies small candidate subtrees from the set of input trees which serve as the seeds of larger subtrees. In the second phase, it combines these small seeds to build larger candidate MFASTs. In the final phase, it performs a postprocessing step that ensures that we find a frequent agreement subtree that is not contained in a larger frequent agreement subtree. We demonstrate that this heuristic can easily handle data sets with 1000 taxa, greatly extending the estimation of MFASTs beyond current methods. Conclusions Although this heuristic does not guarantee to find all MFASTs or the largest MFAST, it found the MFAST in all of our synthetic datasets where we could verify the correctness of the result. It also performed well on large empirical data sets. Its performance is robust to the number and size of the input trees. Overall, this method provides a simple and fast way to identify strongly supported subtrees within large phylogenetic hypotheses. bdy Background Phylogenetic trees represent the evolutionary relationships of organisms. While recent advances in genomic sequencing technology and computational methods have enabled construction of extremely large phylogenetic trees (e.g., abbrgrp abbr bid B1 1 B2 2 B3 3 ), assessing the support for phylogenetic hypotheses, and ultimately identifying wellsupported relationships, remains a major challenge in phylogenetics. Support for a tree often is determined by methods such as nonparametric bootstrapping B4 4 , jackknifing B5 5 , or Bayesian MCMC sampling (e.g., B6 6 ), which generate a collection of trees with identical taxa representing the range of possible phylogenetic relationships. These trees can be summarized in a consensus tree (see B7 7 ). Consensus methods can highlight support for specific nodes in a tree, but they also may obscure highly supported subtrees. For example, in Figure figr fid F1 1, the subtree containing taxa A, B, C, and D is present in all five input trees. However, due to the uncertain placement of taxon E, the majority rule consensus tree implies that the clades in the tree have relatively low (60%) support. fig Figure 1caption (a) A collection of five input treestext b (a) A collection of five input trees. The same subtree with taxa A, B, C, and D is present in all input trees, and only the position of taxa E changes. (b) The majority rule consensus and maximum agreement subtrees of the 5 input trees in Figure 1a. graphic file 14712105132561 Alternate approaches have been proposed to reveal highly supported subtrees. The maximum agreement subtree (MAST) problem seeks the largest subtree that is present in all members of a given collection of trees B8 8 . For example, in Figure 1 the MAST includes taxa A, B, C, and D. Finding the MAST is an NPhard problem B9 9 , although efficient algorithms exist to compute the MAST in some cases (e.g., 9 B10 10 B11 11 B12 12 B13 13 B14 14 B15 15 B16 16 B17 17 ). In practice, since any difference in any single tree will reduce the size of the MAST, the MAST is often quite small, limiting it usefulness.A less restrictive problem is to find frequent agreement subtrees (FAST), or subtrees that are found in many, but not necessarily all, of the input trees (see B18 18 ). In this problem, a subtree is declared as frequent if it is in at least as many trees as a user supplied frequency threshold. Several algorithmic approaches have been suggested to identify FASTs, and specifically the maximum FASTs (MFASTs), or FASTs that contain the largest number of taxa. A variant of this problem seeks the maximal FASTs, i.e., FASTS that are not contained in any other FASTs. Notice that an MFAST is a it maximal FAST, however, the inverse is not necessarily true. Zhang and Wang defined algorithms, implemented in Phylominer, to identify FASTs from a collection of phylogenetic trees B19 19 B20 20 . These algorithms are guaranteed to find all FASTs but they may be prohibitively slow for data sets larger than 20 taxa. Cranston and Rannala implemented MetropolisHastings and Threshold Accepting searches to identify large FASTs from a Bayesian posterior distribution of phylogenetic trees B21 21 . This approach can handle thousands of input trees but it may not be feasible if the trees have more than 100 taxa 21 .Another approach to reveal highly supported subtrees from a collection of trees is to identify and remove rogue taxa, or taxa whose position in the input trees is least consistent. Recently, several methods have been developed that can identify and remove rogue taxa from collections of trees with thousands of taxa B22 22 B23 23 B24 24 . However, unlike MAST or FAST approaches, they do not provide guarantees about the support for the remaining taxa.In this paper, we describe a heuristic approach for identifying MFASTs in collections of trees. Unlike previous methods, our method easily scales to datasets with over a thousand taxa and hundreds of trees. Towards this goal, we develop a heuristic solution that works in multiple phases. In the first phase, it identifies small candidate subtrees from the set of input trees which serve as the seeds of larger subtrees. In the second phase, it combines these seeds to build larger candidate MFASTs. In the final phase, it performs a post processing step. This step ensures that the size (i.e., number of taxa) of the FAST found can not be increased further by adding a new taxon without reducing its frequency below a user supplied frequency threshold. We demonstrate that this heuristic can easily handle data sets with 1000 taxa. We test the effectiveness of these approaches on simulated data sets and then demonstrate its performance on large, empirical data sets. Although our heuristic does not guarantee to find all MFASTs or the largest MFAST in theory, it found the true MFAST in all of our synthetic datasets where we could verify the correctness of the result. It also performed well on the empirical data sets. Its performance is robust with respect to the number of input trees and the size of the input trees. Methods In this section we describe our method that aims to find Maximum Frequent Agreement SubTrees (MFASTs) in a given set of m phylogenetic trees inlineformula m:math name 1471210513256i1 xmlns:m http:www.w3.org1998MathMathML m:mi mathvariant script T = {T sub 1, T 2, …, T m }. Our method follows from the observation that an MFAST is present in a large number of trees in 1471210513256i2 T . The method builds MFASTs bottom up from small subtrees of taxa in the trees in 1471210513256i3 T . Briefly, it works in three phases. indent 1 • Phase 1. Seed generation (Section “Phase one: Seed generation”).In the first phase, we identify small subtrees from the input trees that have a potential to be a part of an MFAST. We call each such subtree a seed.• Phase 2. Seed combination (Section “Phase two: Seed combination”).In the second phase, we construct an initial FAST by combining the seeds found in the first phase.• Phase 3. Post processing (Section “Phase three: Postprocessing”).In the third phase, we grow the FAST further to obtain the maximal FAST that contains it by individually considering the taxa which are not already in the FAST. We report the resulting maximal FAST as a possible MFAST.First, we present the the basic definitions needed for this paper in Section “Preliminaries and notation”. We then discuss each of the three phases above in detail. Preliminaries and notation In this section, we present the key definitions and notations needed to understand the rest of the paper. We describe our method using rooted and bifurcating phylogenetic trees. However, our method and definitions can easily be applied to unrooted or multifurcating trees with minor or no modifications. Also, we assume that all the taxa are placed at the leaf level nodes of the phylogenetic tree, and all the internal nodes are inferred ancestors. Figure F2 2(a) shows a sample phylogenetic tree built on five taxa. We define the size of a tree as the number of taxa in that tree. We start by defining key terms. Definition 1 (Clade) Let T be a phylogenetic tree. Given an internal node of T, we define the set of all nodes and edges of T contained under that node as the clade rooted at that node. Figure 2(a) A rooted, bifurcating phylogenetic tree T built on five taxa labeled with a, b, c, d and e (a) A rooted, bifurcating phylogenetic tree T built on five taxa labeled with a, b, c, d and e. The internal nodes are shown with x0, x1, x2 and x3. (b) A clade of T rooted at x1. (b) and (c) Two subtrees of T by contracting the taxa sets {d, e} and {a, e}. 14712105132562 Each internal node of a phylogenetic tree corresponds to a clade of that tree. Figure 2(b) depicts the clade of the tree in Figure 2(a) rooted at x 1. Definition 2 (Contraction) Let T be a phylogenetic tree with n taxa. The contraction operation transforms T into a tree with n−1 taxa by removing a given taxon in T along with the edge that connects that taxon to T. The contraction operation can extract the clades of a tree by removing all the taxa that are not a part of that clade. It can also extract parts of the tree that are not necessarily clades. We use the term subtree to denote a tree that is obtained by applying contractions to arbitrary set of taxa in a given tree. Formal definition is as follows. Definition 3 (Subtree) Let T and T’ be two phylogenetic trees. We say that T’ is a subtree of T if T can be transformed into T’ by applying a series of contractions on T. If a tree T’ is a subtree of another tree T, we say that T’ is present in T. Notice that a clade is always a subtree, but the inverse is not true all the time. Figures 2(b) and 2(c) illustrate two subtrees of the tree in Figure 2(a). Let us denote the number of combinations of k taxa from a set of n taxa with 1471210513256i4 m:mfenced separators open ( close ) m:mfrac linethickness 0 m:mrow n k . In general, if a tree has n taxa, then that tree contains 1471210513256i5 n k subtrees with k taxa. As a consequence, that tree contains 2sup n − 1 subtrees of any size including itself. Definition 4 (Frequency) Let 1471210513256i6 T = {T 1, T2, … T m } be a set of m phylogenetic trees and T be a phylogenetic tree. Let us denote the number of trees in 1471210513256i7 T at which T is present with the variable m’. We define the frequency of T in 1471210513256i8 T as displayformula 1471210513256i9 m:mspace width 1em m:mtext italic freq m:mo ( T , T ) = m:msup m ′ m . Definition 5 (FAST) Let 1471210513256i10 T = {T 1, T 2, … T m } be a set of m phylogenetic trees and T be a phylogenetic tree. Let γ be a number in [0, 1] interval that denotes frequency cutoff. We say that T is a Frequent Agreement SubTree (FAST) of 1471210513256i11 T if its frequency in 1471210513256i12 T is at least γ (i.e., 1471210513256i13 freq ( T , T ) ≥ γ ). We say that a FAST is maximal if there is no other FAST that contains all the taxa in that FAST. Clearly, larger FASTs indicate biologically more relevant consensus patterns. The following definition summarizes this. Definition 6 (MFAST) Let 1471210513256i14 T = {T 1, T 2, …, T m } be a set of m phylogenetic trees. Let γ be a number in [0, 1] interval that denotes frequency cutoff. A FAST T of 1471210513256i15 T is a Maximum Frequent Agreement SubTree (MFAST) of 1471210513256i16 T if there is no other FAST T’ of 1471210513256i17 T that has a larger size than T. Formally, given a set of phylogenetic trees 1471210513256i18 T = {T 1, T 2, …, T m } and a frequency cutoff, γ, we would like to find the MFASTs in 1471210513256i19 T in this paper. We develop an algorithm that aims to solve this problem. Table tblr tid T1 1 lists the variables used throughout the rest of this paper. table Table 1 Commonly used variables and functions in this paper tgroup cols 2 colspec align left colname c1 colnum colwidth 1* c2 thead valign top row rowsep entry tbody 1471210513256i20 T A set of phylogenetic trees T i ith tree m Number of trees in 1471210513256i21 T n Number of taxa in each input tree a i ith taxa 1471210513256i22 freq ( T , T ) Frequency of the subtree T in 1471210513256i23 T γ Frequency cutoff S i ith seed (each seed is a subtree of a tree in 1471210513256i24 T ) k Size of a seed c Number of contractions used to create a seed Phase one: Seed generation The first phase extracts small subtrees from the given set of trees. From these subtrees we extract the basic building blocks which are used to construct MFASTs. We call these building blocks seeds. Conceptually each seed is a phylogenetic tree that contains a small subset of the taxa that make up the trees in 1471210513256i25 T . We characterize each seed with three features that are listed below. We elaborate on each feature later in this section. 1. Seed size (k) is the number of taxa in the seed.2. Number of contractions (c) is the number of taxa we prune from a clade taken from an input tree in order to extract the seed.3. Frequency (f) is the fraction of input trees in which the seed is present.We explain the seed features with the help of Figures F3 3 and F4 4. The first two characteristics explain how a seed can be found in one of the trees in 1471210513256i26 T . They indicate that there is a clade of a tree in 1471210513256i27 T such that this clade contains k + c taxa and it can be transformed into that seed after c contractions from that clade. For instance in Figure 3, when k = 2 and c = 0, only seed S 1 can be extracted from T 1 by choosing the clade rooted at x 2. When k = 2 and c = 1, seeds S 1, S 2 and S 3 can be obtained using one contraction (a 3, a 2 and a 1 respectively) from the clade rooted at x 1. Figure 3T1 is an input tree built on four taxa a1, a2, a3 and a4 T1 is an input tree built on four taxa a1, a2, a3and a4. The internal nodes of T1 are labeled as x0, x1 and x2. S1 is the only seed obtained from T1 when k = 2 and c = 0. That is S1 is identical to the clade rooted at x2. S1, S2 and S3 are the seeds extracted from T1 when k = 2 and c = 1. They are all extracted from the clade rooted at x1 by contracting a3, a2 and a1 respectively. 14712105132563 Figure 4The set of input trees T1, T2, T3and the set of all nine potential seeds S1, S2…S9when the seed characteristics are set to k=3 and c=1 The set of input trees T1, T2, T3 and the set of all nine potential seeds S1, S2 … S9 when the seed characteristics are set to k = 3 and c = 1. All the potential seeds have three taxa as k = 3. We need one contraction from the input tree to obtain each seed. S1 has frequency 1.0 as it is present in T1, T2 and T3. Seed S2 has frequency ∼0.67 as it is present in T1 and T2. Remaining seeds have frequency ∼0.33 as each appears in only one of the three trees. 14712105132564 The last feature denotes the number of trees in 1471210513256i28 T in which the seed is present. For example in Figure 4, there are nine seeds S 1, S 2, …, S 9 extracted from the three input trees using only one contraction. Among these, the frequency of S 1 is 1 as it is present in all the trees. Frequency of S 2 is about 0.67 for it is present in only two out of three trees (T 1 and T 2). The frequency of the rest of the seeds is only about 0.33. Recall that, by definition, an MFAST is present in at least a fraction γ of the trees in 1471210513256i29 T . Therefore, we consider only the seeds whose frequency values are equal to or greater than this number ( i.e. f ≥ γ).Given the values of k, c and γ, we extract all the seeds which possess the desired feature values from the set of input trees as follows. In the newick string representation of a tree, a pair of matching parentheses corresponds to an internal node in the tree. The number of taxa in the clade rooted at this internal node is given by the number of labels between the two matching parentheses. Following from this observation, we scan the newick string of each tree one by one. For each such tree, we identify the clades which have k + c taxa. Notice that, if a tree contains n taxa, then it contains at most 1471210513256i30 n k + c clades of size k + c as no two such clades can contain common taxa. We then extract all combinations of k taxa from each of these clades by contracting the remaining c taxa. The number of ways this can be done is 1471210513256i31 k + c c . Notice that all the small trees extracted this way possess the first two characteristics explained above. At this point, we however do not know their frequencies. Therefore, we call them potential seeds. It is worth mentioning that the same seed might be extracted from different trees. As we extract a new potential seed, before storing it in the list of potential seeds, we check if it is already present there. We include it in the potential seed list only if it does not exist there yet. Otherwise, we ignore it. This way, we maintain only one copy of each seed.Once we build our potential seed list for all the trees in 1471210513256i32 T , we go over them one by one and count their frequency in 1471210513256i33 T as the fraction of trees that contain them. We filter all the potential seeds whose frequencies are less than the frequency cutoff. We keep the remaining ones as the list of seeds along with the frequency of each seed.In Figure 4, consider the tree T 1 that has four taxa. For k = 3 and c = 1, there is only one clade of size k + c = 4 which is the tree T 1 itself. We extract four potential seeds, each having three leaves from this tree. The potential seeds in this figure are given by S 1, S 2, S 5 and S 7 which we extract by contracting a 4, a 3, a 2 and a 1 respectively from T 1. Phase two: Seed combination At the end of the first phase, we obtain a set of frequent seeds from the input trees. Notice that each seed is a FAST as each seed is present in sufficient number of trees specified by γ. These seeds are the basic building blocks of our method. In the second phase of our method, we combine subsets of these seeds to construct larger FASTs.We first define what it means to combine two seeds. In order to combine two seeds, it is a necessary condition that both seeds are present in at least one common tree T in 1471210513256i34 T . We call such a tree T as the reference tree. We combine two seeds with the guidance of a reference tree. Let S 1 and S 2 be two seeds and let T be their reference tree. Let L 1, L 2 and L be the set of taxa in S 1, S 2 and T respectively. Combining S 1 and S 2 results in the tree that is equivalent to the one obtained by contracting the taxa in L − (L 1 ∪ L 2) from T. For simplicity, we will denote the combine operation using T as the reference network with the ⊕ T symbol. For instance we denote combining S 1 and S 2 with T being the reference tree as S 1 ⊕ T S 2. To simplify our notation, whenever the identity of the reference tree is irrelevant, we will use the symbol ⊕ instead of ⊕ T .Figure F5 5 demonstrates how two seeds S 1 and S 2 are combined with the help of the reference tree T. In this figure, both S 1 and S 2 are subtrees of T. Thus, it is possible to use T as the reference tree. We have L 1 = {a 1,a 3,a 4}, L 2 = {a 1,a 2,a 5,a 7}. Thus, we build C = S 1 ⊕ T S 2 by contracting the taxa in L − (L 1 ∪ L 2) = {a 6,a 8} from T. Figure 5T is the reference tree Tis the reference tree. S1 and S2 are the seeds to be combined, both are present in T. C is obtained by pruning the subtree containing taxa a1, a2, a3, a4, a5 and a7 from T. 14712105132565 So far, we have explained how to combine two seeds S 1 and S 2 using a reference tree. It is possible that many trees in 1471210513256i35 T have both seeds present in them. Thus, one question is which of these trees should we use as the reference tree to combine the two seeds? The brief answer is that all such trees need to be considered. However, we make several observations that helps us avoid combining S 1 and S 2 using each such reference tree one by one exhaustively without ignoring any of such trees. We explain them next.Consider two trees T 1 and T 2 from 1471210513256i36 T where both seeds are present in. There are two cases for T 1 and T 2. • Case 1: S 1 ⊕ T 1 S 2 = S 1 ⊕ T 2 S 2. In this case, it does not matter whether we use T 1 or T 2 as the reference tree. They will both lead to the same combined subtree. Thus, we use only one.• Case 2: S 1 ⊕ T 1 S 2 ≠ S 1 ⊕ T 2 S 2. In this case, the trees T 1 and T 2 lead to alternative combination topologies. So, we consider both of them separately.We utilize the observations above as follows. We start by picking one reference tree arbitrarily. Once we create a combined subtree using that tree, we check whether that subtree is present in the remaining trees in 1471210513256i37 T . We mark those trees that contain it as considered for reference tree and never use them as reference for the same seed pair again. This is because those trees fall into the first case described above. This way, we also store the frequency of the combined subtree in 1471210513256i38 T . If the number of unmarked trees is too small (i.e., less than γ × m) then it means that even if all the remaining trees agree on the same combined topology for the two seeds under consideration, they are not sufficient to make it a FAST. Thus, we do not use any of the remaining trees as reference for those two seeds. Otherwise, we pick another unmarked tree arbitrarily and repeat the same process until we run out of reference trees.The next question we need to answer is which seed pairs should we combine? To answer this question we first make the following proposition. Proposition 1 Assume that we are given a set of phylogenetic trees 1471210513256i39 T . Let S 1 and S 2 be two seeds constructed from the trees in 1471210513256i40 T . For all trees 1471210513256i41 T ∈ T , we have the following inequality 1471210513256i42 freq ( m:msub S m:mn 1 ⊕ T S 2 , T ) ≤ min { freq ( S 1 , T ) , freq ( S 2 , T ) } Proof For any T, both S 1 and S 2 are subtrees of S 1 ⊕ T S 2. Thus if S 1 ⊕ T S 2 is present in a tree, then both S 1 and S 2 are present in that tree. As a result, freq(S 1 ⊕ T S 2, 1471210513256i43 T ) ≤ freq(S 1, 1471210513256i44 T ) and freq(S 1 ⊕ T S 2, 1471210513256i45 T ) ≤ freq(S 2, 1471210513256i46 T ). Hence, 1471210513256i47 freq ( S 1 ⊕ T S 2 , T ) ≤ min { freq ( S 1 , T ) , freq ( S 2 , T ) } □ Proposition 1 states that as we combine pairs of seeds to grow them, their frequency monotonically decreases. This suggests that it is desirable to combine two seeds if both of them have large frequencies. This is because if one of them has a small frequency, regardless of the frequency of the other, the combined tree will have a small frequency. As a result its chance to grow into a larger tree through additional combine operations gets smaller. Following this intuition, we develop two approaches for combining the seeds. 1. Inorder Combination (Section “Inorder combination”).2. Minimum Overlap Combination (Section “Minimum overlap combination”).Both approaches accept the list of seeds computed in the first phase as input and produce a larger FAST that is a combination of multiple seeds. Both of them also assume that the list of input seeds are already sorted in decreasing order of their frequencies. We discuss these approaches next. Inorder combination The inorder combination approach follows from Proposition 1. It assumes that the seeds with higher frequencies have greater potential to be a part of an MFAST. It exploits this assumption as follows, first it picks a seed as the starting point to create a FAST. It then grows this seed by combining it with other seeds starting from the most frequent one as long as the frequency of the resulting tree remains at least as large as the given cutoff γ. It repeats this process by trying each seed as the starting point, Algorithm Algorithm 1 In order combination presents this approach. Algorithm 1 In order combination FAST ← ∅ for all seeds S i do FAS T ′ ← S i Mark S i as considered repeat S j ← seed with highest frequency among unconsidered seeds Mark S j as considered CUTOFF ← γ t_FAST ′ ← FAS T ′ repeat 3 Pick the next unconsidered tree 1471210513256i48 T ∈ T as referenceMark all the trees as that contain FAS T ′ ⊕ T S j as considered if freq( 1471210513256i49 FAS T ′ ⊕ T S j , T ) ≥ CUTOFF then t_FAS T ′ ← FAS T ′ ⊕ T S j CUTOFF ← freq( 1471210513256i50 FAS T ′ ⊕ T S j , T ) end if until Less than γ × m unmarked reference trees are left in 1471210513256i51 T FAS T ′ ←t_FAS T ′ Unmark all trees in 1471210513256i52 T until all seeds are considered if size of FAS T ′ ≥ size of FAST then FAST ← FAS T ′ end if Unmark all seeds end for In Algorithm Algorithm 1 In order combination we first initialize the FAST as empty. We then consider each seed one by one. We initialize a temporary subtree denoted by FAST’ with the seed S i under consideration and mark S i as considered. We combine the FAST’ with a seed S j which has the highest frequency amongst the seeds that have not been added. If multiple seeds have the highest frequency, we randomly pick one of them and mark that seed S j as added to the FAST’. There can be alternative ways to combine FAST’ with S j leading to different topologies. We use the trees in 1471210513256i53 T that contain both FAST’ and S j as guides to try only the topologies that exist in 1471210513256i54 T . We stop constructing alternative topologies as soon as we ensure that there are not sufficient number of trees to yield frequency of γ. We set FAST’ to the combined seed if the combined seed has large enough frequency. We then consider the seed with the next highest frequency for addition and repeat this step till all S j have been considered. If the resulting temporary FAST is larger than FAST we replace the smaller FAST with the larger one. In the next iteration, we initialize the FAST with the next S i . Using this approach we can initialize the FAST with all S i , alternatively if the user wishes to limit the amount of time spent using a maximum time cutoff we stop the outermost loop (i.e., alternative initializations of FAST’) as soon as the allowed running time budget is reached.Notice that in Algorithm 3 each seed S i can lead to a different FAST. We record only the FAST that has the largest size. However, it is trivial to maintain the top k FASTs with the largest size instead if the user is looking for k alternative maximal FASTs. Minimum overlap combination The purpose of combining seeds is to construct a FAST that is large in size. Our inorder combination approach (Section “Inorder combination”) aimed to maximize the frequency of the combined seeds. In this section, we develop our second approach, named Minimum Overlap Combination. This approach picks seeds so that their combination produced as large subtree as possible. We elaborate on this approach next.When we combine two seeds, the size of the resulting tree becomes at least as big as the size of each of these seeds. Formally let S 1 and S 2 be two seeds (i.e., trees). Let L 1 and L 2 be the set of taxa combined in S 1 and S 2. We denote the size of a set, say L 1, with L 1. The size of the tree resulting from combination of S 1 and S 2 is L 1 + L 2 − L 1 ∩ L 2. For a given fixed seed size, the first two terms of this formulation remains unchanged regardless of the seed. The last term determines the growth in the size of the FAST. Thus, in order to grow the FAST rapidly, it is desirable to combine two frequent subtrees with a small number of common taxa.Our second approach follows from the observation above. We introduce a criteria called the overlap between two subtrees as the number of taxa common between them. Our minimum overlap combination approach works the same as Algorithm Algorithm 1 In order combination with a minor difference in selecting the seed S j that will be combined with the current temporary FAST (i.e., FAST’). Rather than choosing the seed with the largest frequency, this approach chooses the one that has the least overlap with FAST’ among all the unconsidered and frequent seeds. If multiple seeds have the same smallest overlap, it considers the frequency as the tie breaker and chooses the one with the largest frequency among those. Phase three: Postprocessing So far we described how to obtain seeds (Section “Phase one: Seed generation”) and how to combine them to construct FAST (Section “Phase two: Seed combination”). The two approaches we developed for combining seeds aim to maximize the size of FAST. However, they do not ensure the maximality of the resulting FAST. There are two main reasons that prevent our seed combining algorithms from constructing maximal FAST. First, some of the taxa of a maximal FAST may not appear in any seed (i.e. false negatives). As a result no combination of seeds will lead to that maximal FAST. Second, even if all the taxa of a maximal FAST are parts of at least one seed, our algorithms will reject combining that seed with the FAST of the seeds if those seeds contain other taxa that are not part of the maximal FAST (i.e. false positives).In the postprocessing phase, we tackle abovementioned problem. Algorithm 3 describes the post processing phase in detail. We do this by considering all taxa which are not already present in the FAST one by one. We iteratively grow the current FAST by including one more taxon at a time if the frequency of the resulting FAST remains at least as large as the frequency cutoff γ. We repeat these iterations until no new taxon can be included in the FAST. Thus the resulting FAST is guaranteed to be maximal. Algorithm 2 Post processing INPUT = FAST from the seed combination phaseINPUT = 1471210513256i55 T OUTPUT = Maximal FASTRESULT ← FAST for all a i not in FAST do CUTOFF ← γ t_RESULT ← RESULT repeat Pick the next unconsidered tree 1471210513256i56 T ∈ T as referenceRESULT’ ← RESULT ⊕ T a i Mark all the trees that contain RESULT’ as considered if frequency of RESULT’ ≥ CUTOFF then t_RESULT ← RESULT’ CUTOFF ← frequency of RESULT’ end if until Less than γ × m unmarked reference trees are left in 1471210513256i57 T RESULT ← t_RESULTUnmark all trees in 1471210513256i58 T end for return RESULTWe expect the post processing step to identify quickly the taxa that have a potential to be in an MFAST that might have not been considered during the seed generation and seed combination phases. At the end of the post processing step we obtain an MFAST. Complexity analysis of our method In this section we discuss the complexity of our method in terms of the three phases involved in it. Let 1471210513256i59 T be a set of m phylogenetic trees having n leaves each. The complexity of the different phases of our method are as follows. Phase one Finding the seeds involves enumerating all the subtrees and checking their frequencies. Given seed size k and number of contractions c, each tree will contain at most 1471210513256i60 n k + c clades each leading to 1471210513256i61 k + c c alternative subtrees. Thus, in total there can be up to 1471210513256i62 mn k + c k + c c seeds (possibly many of them identical) from all the trees in 1471210513256i63 T . Typically, the values of k and c are fixed and small (in our experiments we have k ∈ {3, 4, 5} and c ∈ {0, 1, 2, 3, 4, 5}) leading to O(mn) seeds.The complexity of finding whether a seed is present in a single tree is O(n log n). Given that there are m trees in 1471210513256i64 T , the cost of computing the frequency of a single seed is O(mn log n). Thus, the time complexity for finding the frequency of all the seeds is this expression multiplied by the number of seeds, which is O(m 2 n 2 log n). Phase two Consider a set of p frequent seeds that will be considered for combining in this phase. Recall that we have two approaches to combine them. Below, we focus on each. smcaps INORDER COMBINATION We try to combine each seed with every other seed leading to O(p 2) iterations. The complexity of checking the frequency of each combined subtree is O(mnlogn). Also, there can be up to O(m) different reference trees for guiding the combine operation. Multiplying these terms, we obtain the complexity of phase using this approach as O(p 2 m 2 n log n). MINIMUM OVERLAP COMBINATION The complexity of combining the frequent seeds using the minimum overlap combination approach is very similar to the inorder approach except for an additional term. The additional complexity is because we maintain the overlap between the subtrees. This leads to the complexity O(p 2 n 2 + p 2 m 2 n log n). Phase three Here, we consider the FAST obtained from each of the p frequent seeds in phase two. For each FAST, we sequentially go over each taxa one by one leading to O(n) iterations. There can be up to O( γ × m) references to add a taxon. So the cost of extending all p FASTs is O(γ × mnp).Notice that each frequent seed has to appear in at least γ × m trees. Thus, the number of unique frequent seeds p is bounded by 1471210513256i65 O ( mn γ × m ) = 1471210513256i66 O ( n γ ) . Thus, adding the cost of all the three phases, the overall time complexity of our method using inorder combination is 1471210513256i67 O ( m 2 n 2 log n + m 2 n 3 log n γ 2 + m n 2 ) . That using minimum overlap combination is 1471210513256i68 O ( m 2 n 2 log n + m 2 n 3 log n γ 2 + n 4 log n γ 2 + m n 2 ) . In the two summations above, the second term is asymptotically larger than the first and the last terms. Thus, we can simplify the asymptotic time complexity of inorder and minimum overlap combinations as 1471210513256i69 O ( m 2 n 3 log n γ 2 ) and 1471210513256i70 O ( n 3 log n γ 2 ( m 2 + n ) ) respectively. Results and discussion This section evaluates the performance of our MFAST algorithm experimentally. Implementation details We implemented our MFAST algorithm using C and Perl. More specifically, we implemented the first two phases (seed generation and seed combination) in C and the third phase (post processing) in Perl. We utilize the functions provided in the newick Utilities B25 25 package by modifying the source code provided in that package. We use k ∈ {3, 4, 5} and c ∈ {0, 1, 2, 3, 4, 5} in all of our experiments unless otherwise stated. In our experiments, we observed that the minimum overlap combination produced larger MFASTs than the inorder combination approach. Therefore, we limit our experimental results to the minimum overlap combination approach. Methods compared against We have compared our method against Phylominer 20 and the MAST command implemented in PAUP* B26 26 . Among these, Phylominer also seeks MFASTs in a collection of trees. However, the time complexity of this method is exponential in the size of the input trees, and hence it becomes intractable for large trees. In our experiments, we observed that it does not scale beyond 50 taxa. PAUP* is primarily a program for phylogenetic inference, although it also can compute MASTs. MASTs have a strict 100% agreement criterion unlike the arbitrary frequency cutoff values γ in our method. Evaluation Criteria We evaluate our algorithm based on the size of the MFAST found. Larger MFASTs are preferable. When possible, we report the size of the optimal solution as well. Test Environment We ran our experiments on Linux servers equipped with dual AMD Opteron dual core processors running at 2.2 GHz and 3 GB of main memory to test the performance of our method. Datasets We test the performance and verify the results of our method on synthetic datasets and real datasets. • Synthetic dataset We built synthetic datasets in which we embedded an MFAST as described below. We characterize each synthetic dataset using five parameters. The first two parameters denote the size and number of trees in 1471210513256i71 T . MFAST frequency specifies the fraction of trees in 1471210513256i72 T which contain an MFAST. MFAST size is the number of taxa in the embedded MFAST. The noise percentage is the percentage of taxa that is not a part of the embedded MFAST but is placed on the branches within the clade that contains the MFAST. We place all the other taxa on the branches outside this clade.Given an instantiation of these parameters, we first created a tree that has n’ taxa. This tree serves as the MFAST. We then created m × f trees that contain this MFAST. We build each of these trees by inserting n − n ′ taxa randomly in the MFAST. With probability ∊ we insert each taxa within the clade that contains MFAST. With probability 1 − ∊ we insert it outside that clade. We then created m − (m × f) trees that do not contain the current MFAST. We simply do this by inserting all the taxa one by one at a random location.1. Tree size (n).2. Number of trees (m).3. MFAST frequency (f).4. MFAST size (n’).5. Noise percentage (∊).• Real datasets. We use two empirical datasets to evaluate the performance of our heuristic. The data sets contain 200 bootstrap trees generated from phylogenetic analysis of the Gymnosperm B27 27 and Saxifragales (Burleigh, unpublished) plant clades. To make the bootstrap trees, we assembled supermatrices, matrices of concatenated gene alignments with partial taxon overlap, from gene sequence data available in GenBank. We performed a maximum likelihood bootstrap analysis on each supermatrix using RAxML v. 7.0.4 B28 28 . The Gymnosperm trees each contain 959 taxa, and the Saxifragales trees each contain 950 taxa. Effects of number of input trees In our first experiment, we analyze how the number of input trees in 1471210513256i73 T affects the performance of our algorithm. For this purpose, we created 30 synthetic datasets. The size of the embedded MFAST in all the datasets was 15. Among these 30 datasets, 10 contained 50 trees, 10 contained 100 trees and 10 contained 200 trees. We set the noise percentage to 20% in all the datasets. The frequency of the embedded MFAST was 0.8. We set the number of taxa in all the trees in these datasets to 100.We ran our algorithm on each of these datasets to find the size of the MFAST for γ = 0.7. Table T2 2 lists the average MFAST size we found for each of the dataset sizes before post processing (i.e., at the end of phase two) and after post processing (i.e., at the end of phase three). The results demonstrate that our method can identify an MFAST that is almost as big as the embedded one even without post processing, regardless of the number of trees in the dataset. Post processing improves the MFAST size slightly. On the average, we always find an MFAST that is as large as or larger than the embedded one. An MFAST larger than 15 here implies that while randomly inserting the taxa that are not in the embedded MFAST, at least one of them was placed under the same clade at least a fraction γ of the time. More importantly, our method successfully located such taxa along with the rest of the MFAST. Table 2 Evaluation of the effect of the number of trees in 1471210513256i74 boldscript T c3 Number of trees center nameend namest MFAST size tfoot The number of trees is set to 50, 100 and 200. For each number of trees we run our experiments on ten datasets. Each dataset contains trees with 100 taxa and an embedded MFAST of size 15. We report the average size of the MFAST obtained by our method across the ten datasets. Before post processing After post processing 50 14.5 16.0 100 15.3 15.8 200 14.4 15.4 Effects of tree size Our second experiment considers the impact of the number of taxa in the input trees contained in 1471210513256i75 T on the success of our method. To carry out this test, we built datasets with varying tree sizes (i.e., n). Particularly, we used n = 100, 250, 500 and 1000. For each value of n, we repeated the experiment 10 times by creating 10 datasets with the same properties. In all datasets, we set the number of trees to m = 100, the noise percentage at ∊ = 20%, the size of the embedded MFAST at 15% of n, and the MFAST frequency at 0.8.Table T3 3 reports the average MFAST size found by our method for varying tree sizes. Second column shows the embedded MFAST size. Last two columns list the average size of the MFAST found by our method across the ten datasets. Before, going into detailed discussion of the results, it is crucial to observe that our method could run to completion for datasets that have as many as 1000 taxa. When we tried to run Phylominer, it did not return any results for datasets that have more than 100 taxa. The results also demonstrate that our method could successfully identify the embedded MFAST in all the datasets regardless of the size of the input trees. In some datasets, the reported MFAST was slightly larger than the embedded one. This indicates that while randomly inserting the taxa that are not part of the embedded MFAST, it is possible that a few taxa was consistently placed under the same same clade. Table 3 Evaluation of the effect of the size of the trees in 1471210513256i76 T 4 c4 Number of MFAST size The tree size is set to 100, 250, 500 and 1000. For each tree size we run our experiments on ten datasets. Each dataset contains 100 trees with an embedded MFAST of size 15% of the input tree size. Second column shows the embedded MFAST size. Last two columns list the average size of the MFAST found by our method across the ten datasets. taxa Embedded Reported Before post After post processing processing 100 15 15.3 15.8 250 38 32.3 38.8 500 75 43.7 76.0 1000 150 69.8 151.0 The results also suggest that our method identifies a significant percentage of the taxa in the embedded MFAST after the second phase (i.e., before postprocessing) when the tree size is small. As the tree size grows, it starts missing some taxa at this phase. It however recovers the missing taxa during the postprocessing phase even for the largest tree size. This indicates that at the end of phase two our method could identify a backbone of the actual MFAST. The unidentified taxa at this phase are scattered throughout the clades in the input trees. Thus, there is no clade of size k + c that contains them with c contractions for small k or c. As evident from Table 3, this however does not prevent our method from recovering them. This is because the backbone reported at the end of phase two is large enough, and thus specific enough, to recover the missing taxa one by one in the last phase. This is a significant observation as it demonstrates that our method works well even with small values of k and c. Effects of noise percentage Recall that the noise percentage ∊denotes the percentage of taxa that is added inside the clade that contains the MFAST. As ∊increases, the pairs of taxa in the MFAST get farther away from each other in the tree that contains it. As a result, fewer taxa from MFAST will be contained in small clades of size k + c. This raises the question whether our method works well as ∊increase and thus the MFAST taxa gets scattered around in the trees that contain it.In this experiment, we answer the question above and analyze the effect of the noise percentage on the success of our method. We create synthetic datasets with various ∊ values. Particularly, we use ∊ = 20, 40 and 60%. We set the size of the embedded MFAST to n ′ = 15, the tree size to n = 100, number of trees to m = 100 and the MFAST frequency to f = 0.8. We repeat our experiment for each parameter 10 times by recreating the dataset randomly using the same parameters. We set the frequency cutoff to γ = 0.7. We report the average MFAST size found by our method in Table T4 4. Table 4 Evaluation of the effect of the noise in the trees in 1471210513256i77 T Noise (%) MFAST size The size of the embedded MFAST in all the experiments is 15. We list the average size of the MFAST found by our method before and after the post processing phase. Before post processing After post processing 20 15.3 15.8 40 13.6 15.0 60 12.7 15.0 The results suggest that our method can identify the embedded MFAST successfully even when the noise percentage is very high. We observe that the size of the MFAST found by our method before post processing decreases slowly with increasing amount of noise. This is not surprising as the taxa contained in the embedded MFAST gets more spread out (and thus farther away from each other) in the trees in 1471210513256i78 T with increasing noise. As a result, if there are taxa that are not part of any seed with the provided values of k and c, they will never be included in the computed MFAST at the end of phase two. We however observe that (i) only a small number of such taxa exists. For instance, even for the largest noise percentage (∊ = 60%), only 2.3 taxa (i.e., 15 12.7) are missing on the average. (ii) The missing taxa are recovered during phase three. This is because the computed MFAST at the end of phase two is very large, and thus it is specific to the embedded MFAST. Impact of seed creation So far, in our experiments we consistently observed two major points for all the parameter settings (see Sections “Effects of number of input trees” to “Effects of noise percentage”): (i) Our method always finds a large subtree of the embedded MFAST after phase two. (ii) Our method always recovers the entire embedded MFAST after phase three. The second observation can be explained from the first one that the outcome of phase two is large enough to build the entire MFAST precisely. The first observation however indicates that the set of seeds generated in phase one contain a significant percentage of the taxa in the embedded MFAST. In this section, we take a closer look into this phenomenon and explain why this is the case even for small values of seed size k and contraction amount c, and large noise percentage ∊. To do that, we will compute the probability that a subset of the taxa of the embedded MFAST appears in at least one seed generated in phase one. In our computation, we will assume that the taxa can appear at any location of a given tree with the same probability. We discuss the implication of this assumption later in this section.The number of rooted bifurcating trees for a given set of n taxa is 1471210513256i79 R ( n ) = ( 2 n − 3 ) ! 2 n − 2 ( n − 2 ) ! . Consider a clade with k + c taxa. The number of trees with n taxa that contains this clade is R(n − (k + c) + 1) as the topology of the k + c sized clade is fixed. For a given a subtree with k taxa, let us denote the number of clade topologies of size k + c that contains that subtree with NU(k, c). We can compute this function recursively as NU(k, 0) = 1 and for c > 0, 1471210513256i80 normal NU ( k , c ) = NU ( k , c − 1 ) × 2 × ( k + c − 2 ) . Let us denote one of these clades by U(k, c). Also, let us denote the probability that the clade U(k, c) exists in a random tree topology that contains n taxa with P(n, k, c). Intuitively, P(n,k,c) is the probability that our method will extract a specific k taxa subtree from one n taxa tree after only c contractions. We can compute this probability as the ratio of the number of tree topologies that satisfy this constraint to that of all possible tree topologies. We formulate this as 1471210513256i81 P ( n , k , c ) = NU ( k , c ) × R ( n − ( k + c ) + 1 ) R ( n ) . Recall that it suffices for our algorithm to have a k taxa subtree of the MFAST in at least one tree in the given set of m trees. The probability that the clade U(k, c) exists in at least one of the m random tree topologies each containing n taxa is 1471210513256i82 P ( n , k , c , m ) = 1 − ( 1 − P ( n , k , c ) ) m . Assume that the MFAST size in the given set of trees 1471210513256i83 T is h. Let us denote the number of k taxa subtrees of the MFAST as NS(h, k). The probability that at least one of these subtrees will be found in at least one of the input trees is then 1471210513256i84 P ( n , k , c , m , h ) = 1 − ( 1 − P ( n , k , c , m ) ) NS ( h , k ) . A lower bound to NS(h, k) is h − k + 1 which can be obtained by picking a contiguous block of k taxa from the canonical newick representation of the MFAST by considering all possible h − k + 1 starting point locations. Notice that the larger the value of P(n, k,c, m, h), the higher the chances that our algorithm will construct some part of the MFAST. Similarly, the larger the value of NS(h, k), the higher the chances that our algorithm will construct some part of the MFAST.Figure F6 6 plots the success probability (i.e., P(n, k, c, m, h)) of our method for varying parameter values. As the MFAST size increases, the success probability rapidly increases. This is because the number of alternative subtrees of the MFAST increases with increasing MFAST size. Thus, the chance of observing at least one increases as well. We observe that when the size of MFAST is around 20% of the tree size, for all the parameters reported our success probability becomes almost 1. As the number of contractions increases, the probability of success increases. This is because large number of contractions increases the possibility of eliminating false positive taxa from clades. In other words, it helps gluing the taxa that are normally scattered in the input trees back together by removing the remaining taxa among them. When c = 5, our success probability becomes almost one even for MFASTs that are as small as a 46% of the tree size. As the number of trees increases, the success probability increases as well. This is because we have more alternative topologies with increasing number of trees. Thus, there are more chances to have a small clade that contains a part of the MFAST. Finally, it is worth noting that these results are computed based on the assumption that the trees in 1471210513256i85 T are uniformly distributed among all possible topologies. In practice, we expect that these trees are constructed with the same or similar objectives (such as maximum parsimony or maximum likelihood). As a result, they will likely have a higher chance to contain large MFASTs. The results we expect in practice will thus be similar or even better than the theoretical results in Figure 6. Figure 6The probability of finding at least one seed which contains a part of an MFAST The probability of finding at least one seed which contains a part of an MFAST. The number of contractions c is set to 3, 4 and 5 and the corresponding seed size k is 5, 4 or 3. The xaxis shows the MFAST size in terms of the percentage of the number of taxa in the trees in 1471210513256i86 T . In (a), we set the total number of trees m = 500. In (b) we set m = 1000. 14712105132566 Overall, we conclude from this experiment that even small values of k and c suffices to capture a part of the MFAST in phase two. Therefore, although our algorithm’s complexity increases exponentially with k and c, we do not need to use large values for k and c. This enables our algorithm to scale to very large datasets with thousands of taxa and trees. These results explain the theory behind the practical results we observed in Sections “Effects of number of input trees” to “Effects of noise percentage”. Evaluation of state of the art methods So far, we have shown that our method could successfully find the MFASTs contained in sets of trees 1471210513256i87 T for up to 1000 taxa and 200 trees (i.e., n = 1000 and m = 200). An obvious question is how well do existing methods perform on the same datasets. Here, we answer this question for two existing programs, namely PAUP* (version 4.0b10) and Phylominer.When we fix the number of trees and the number of taxa to 100, PAUP* was able to find the MAST for for all datasets. As we grow the number of taxa to 250 or larger while keeping the number of trees as 100, PAUP* runs our of memory and fails to return any results. After reducing the number of trees to 50, PAUP* still runs out of memory and cannot report any results for more than 100 taxa.The scalability problem of Phylominer is even more severe. Phylominer is able to compute the MFASTs on datasets with up to 20 taxa. However, as we increase the number of taxa further, its performance deteriorates quickly. When we set the number of taxa to 100, even with as few as 100 trees, Phylominer takes more than a week to report a result. Moreover, in our experiments, the maximum size of the subtrees it found on average contained fewer than 7 taxa, even though the size of the true MAST was 10.Another interesting question about existing methods would be whether the majority consensus rule can be used to find MFASTs. To evaluate this, we used the same three synthetic datasets used in Section “Effects of noise percentage”. Recall each of these three datasets contains an MFAST of size 15 which is embedded in 80% of the trees. The datasets are created with 20%, 40% and 60% noise indicating different levels of difficulty in recovering the embedded MFAST. We computed 70% majority consensus tree. Notice that if majority consensus rule can identify an MFAST, that would correspond to a bifurcating subtree topology in the consensus tree. In other words a subtree is bifurcating in this experiment only if 70% or more of the input trees agree on the topology of that subtree. The resulting tree, however, was multifurcating for all the three datasets. This means that majority consensus rule could not recover even a smaller portion of the embedded tree while our method was able to locate the entire MFAST successfully (see Table 4).These results demonstrate that both PAUP* and Phylominer are not well suited to finding agreement subtrees in larger datasets, our method scales better in terms of both the number of taxa and the number of trees. When PAUP* runs to completion, we observed that it reports the true results. Recall from previous experiments that our method always found the true results on the same datasets as well as larger datasets. This suggests that our method has the potential to have an impact in large scale phylogenetic analysis when existing methods fail. Empirical dataset experiments To examine the performance of the MFAST method on real data, we performed experiments using 200 maximum likelihood bootstrap trees from a phylogenetic analysis of gymnosperms (959 taxa) and Saxifragales (950 taxa). Specifically, we evaluated how the performance of the MFAST algorithm was affected by the number of input trees and the size of the input trees. Effects of number of input trees We first examined the effect of input tree number on the size of MFAST. For both the gymnosperm and Saxifragales trees, we generated 10 sets of 50 and 100 trees by randomly sampling from the original 200 trees without replacement. We compared the average size of the MFAST in the 50 and 100 tree data sets with the size of the MFAST in the original 200 tree data set. First, in all analysis, the postprocessing step greatly increases the size of the MFAST, sometimes more than doubling it (Table T5 5). This increase is similar to the one observed in the 1000 taxon simulated data sets (Table 2), emphasizing again the importance of the postprocessing step with large tree data sets. Although the sizes of the MFASTs were similar, they decreased slightly with the addition of more trees (Table 5). This may simply be a matter of observing more conflict with more trees. Table 5 The size of the MFAST found by our method on the Gymnosperms and Saxifragales datasets before and after post processing (phase three) 7 c5 5 c6 6 c7 Number of MFAST size The size of the MFAST found by running only the post processing step is also shown. We run our method on the entire dataset that contains 200 trees as well as randomly selected subsets of 50 and 100 trees. We repeated the 50 and 100 tree experiments 10 times by randomly selecting the trees from the entire dataset and reported the average value. trees Gymnosperms Saxifragales Before After Only Before After Only 50 78.5 129.8 99.5 64.7 122.0 84.1 100 68.4 119.2 83.1 55.4 112.8 74.7 200 76.0 118.0 84.0 40.0 105.0 75.0 The large gap between the MFAST sizes before and after the post processing suggests that phase three is the main reason behind the success of our method, and thus, the costly seed combination phase (i.e., phase two) may be unnecessary. To answer whether this conjecture is correct, we ran a variant of our method by disabling the second phase; we only ran the post processing phase starting from each seed as the initial MFAST one by one. We reported the largest MFAST found that way as the output of this variant in Table 5. The results demonstrate that although phase three can grow a large FAST, phase two is essential to find the largest frequent agreement subtree. In other words, post processing finds the true MFAST only if a large portion of it is already found (which is the role served by phase two). In conclusion, phase three of our method cannot replace phase two, yet both phases are essential for the success of our method. Effects of size of input tree Next, we examined the effect of number of leaves in the input trees on the size of MFASTs. For both the gymnosperm and Saxifragales trees, we generated 10 sets of 200 input trees with 100, 250, and 500 taxa. To make each set, we randomly selected 100, 250, or 500 taxa, and we deleted all other taxa from the original sets of 200 trees. Thus, these sets of trees with 100, 250, or 500 taxa are subtrees of the original data sets. The size of the average MFAST increases with more taxa in the original trees (Table T6 6). However, interestingly, the average size of the MFASTs for the gymnosperm data set with 500 trees is larger than the MFAST found from the original gymnosperm trees with all the taxa (Table 6). Since the MFAST from the 500 taxon data sets should all be found within the full data set, this indicates that on the larger trees, our method may not always find the true (i.e., largest) MFAST. The full data sets may require a larger number of contractions to find the true MFASTs. Table 6 The size of the MFAST found by our method on the Gymnosperms and Saxifragales datasets before and after post processing (phase three) for different number of taxa Number of MFAST size The size of the MFAST found by running only the post processing step is also shown. We run our method on the entire dataset that contains all the taxa (last row) as well as randomly selected taxa subsets of size 100, 250 and 500. We repeated the 50, 100 and 250 taxa experiments 10 times by randomly selecting the taxa from the entire dataset and reported the average value. leaves Gymnosperms Saxifragales Before After Only Before After Only 100 41.2 56.1 43.5 43.5 50.7 38.5 250 67.2 88.5 63.0 62.3 76.2 54.6 500 91.6 123.0 74.9 52.0 86.7 62.9 All 76.0 118.0 84.0 40.0 105.0 75.0 Similar to the experiments in Section “Effects of number of input trees”, we investigated the gap between the MFAST sizes before and after the post processing step. We ran a variant of our method by disabling the second phase; we only ran the post processing phase starting from each seed as the initial MFAST one by one. We reported the largest MFAST found that way as the output of this variant in Table 6. The results are in parallel with those in Table 5. Phase three can grow a large frequent agreement subtree, but not quite as big as that when both phase two and three are executed. Effects of sample size In our final experiment, we evaluated the effect of the maximum time cutoff we described in Section “Inorder combination” on the accuracy of our method. Recall that, this cutoff limits the number of initial seeds tried in our algorithm by randomly sampling a small percentage of the seeds. It only uses the sampled seeds as possible initial seeds. However, it uses the entire set of seeds while growing the MFAST determined by the initial seed. As each initial seed roughly takes the same amount of time to grow into an MFAST, using x% of the seeds as the sample set reduces the total running time our method to roughly x% of that of our original implementation.We carried out this experiment as follows. For both the gymnosperm and Saxifragales trees, we ran 10 sets of experiments for each sampling percentage of 2, 5, 10, 25, 50 and 100%. Thus, totally we ran 60 (6 × 10) experiments. Table T7 7 presents the average MFAST sizes for varying sample sizes. The results demonstrate that even for very small sampling percentages, our method finds MFAST that is almost as big as the MFAST found by using the entire dataset (i.e., 100% sampling percentage). This is very promising as it demonstrates that the running time cost of our method can easily be cut to a small fraction by sampling the starting seeds. The rationale behind this is that the MFAST contains many seeds. Starting from any of these seeds, our algorithm has the potential to lead to that MFAST. The probability that at least one of these seeds appear in the sample set is large particularly for large MFASTs. Table 7 The size of the MFAST found by our method on the Gymnosperms and Saxifragales datasets for different random subsamples of the total number of taxa Sampling MFAST size We run our method by randomly picking 2%, 5%, 10%, 25%, 50%, 100% of the seeds found in phase one for combination in phase two. percentage Gymnosperms Saxifragales 2 85.9 74.3 5 87.5 75.6 10 88.4 75.2 25 87.5 75.5 50 88.5 76.2 100 88.5 76.2 Conclusion In this paper, we present a heuristic for finding the maximum agreement subtrees. The heuristic uses a multistep approach which first identifies small candidate subtrees (called seeds), from the set of input trees, combines the seeds to build larger candidate MFASTs, and then performs a postprocessing step to increase the size of the candidate MFASTs. We demonstrate that this heuristic can easily handle data sets with 1000 taxa, greatly extending the estimation of MFASTs beyond current methods. Although this heuristic is not guaranteed to find all MFASTs, it performs well using both simulated and empirical data sets. Its performance is relatively robust to the number of input trees and the size of the input trees, although with the larger data sets, the post processing step becomes more important. Overall this method provides a simple and fast way to identify strongly supported subtrees within large phylogenetic hypotheses.Although the method we developed is described and implemented for the rooted and bifurcating trees, it can be trivially extended to multifurcating as well as unrooted trees. The central technical difference in the case of unrooted trees would be the definition of clade (see Definition 1) as the definition requires a root. A clade in an unrooted tree encompasses two sets of nodes; (i) a given set of taxa X, (ii) the set of all internal nodes that are on a path between two taxa in X on the phylogenetic tree. We expect that this will increase the number of seeds substantially and thus make the problem more computationally intensive. The amount of increase will depend on the tree topology. The theoretical worst case happens when all the taxa are connected to a single internal node (i.e., star topology). In that case any subset of taxa can lead to a potential seed as long as the subset size is equal to the seed size allowed. One possible way to overcome this problem would be to exploit randomization or graph coloring strategies and avoid enumerating majority of the possible seeds. Abbreviations MAST: Maximum agreement subtree; FAST: Frequent agreement subtree; MFAST: Maximum frequent agreement subtree. Competing interests The authors declare that they have no competing interests. Authors’ contributions AR participated in algorithm development, implementation, experimental evaluation and writing of the paper. TK participated in algorithm development, experiment design and writing of the paper. GB participated in experiment design, dataset collection and writing of the paper. All authors read and approved the final manuscript. bm ack Acknowledgments This work was supported partially by the National Science Foundation (grants CCF0829867 and IIS 0845439). refgrp Phylogenetic analysis of 73 060 taxa corroborates major eukaryotic groupsGoloboffPACatalanoSAMirandeJMSzumikCAAriasJSKällersjöMFarrisJSCladistics2009253211lpage 23010.1111/j.10960031.2009.00255.xFasttree 2 approximately maximumlikelihood trees for large alignmentsPriceMNDehalPSArkinAPPLoS ONE201053e949010.1371/journal.pone.0009490pmcid 2835736link fulltext 20224823Understanding angiosperm diversification using small and large phylogenetic treesSmithSABeaulieuJMStamatakisADonoghueMJAmerican Journal of Botany20119840441410.3732/ajb.100048121613134Confidence Limits on Phylogenies: An Approach Using the BootstrapFelsensteinJEvol198539478379110.2307/2408678Parsimony jackknifing outperforms neighborjoiningFarrisJSAlbertVAKällersjöMLipscombDcnm KlugeCladistics19961229912410.1111/j.10960031.1996.tb00196.xAccommodating phylogenetic uncertainty in evolutionary studiesHuelsenbeckJPRannalaBMaslyJP2000 Science,30 June 200028854752349235010.1126/science.288.5475.2349A classification of consensus methods for phylogeneticsBryantDBioconsensus (Piscataway, NJ, 2000/2001), volume 61 of DIMACS Ser. Discrete Math. Theoret. Comput. Sci.Amer. Math. Soc., 2003163183Obtaining common pruned treesFindenCRGordonADJ Classification1985225527610.1007/BF01908078Maximum agreement subtree in a set of evolutionary trees: Metrics and efficient algorithmsAmirAKeselmanDSIAM J Comput19972661656166910.1137/S0097539794269461An algorithm to find agreement subtreesKubickaEKubickiGMcMorrisFRJ Classification1995121919910.1007/BF01202269On the agreement of many treesFarachMPrzytyckaTMThorupMInf Process Lett199555629730110.1016/00200190(95)00110XBryantDBuilding trees, hunting for trees and comparing trees. PhD thesispublisher Dept. Mathematics: University of Canterbur1997An o(n log n) algorithm for the maximum agreement subtree problem for binary treesColeRFarachColtonMHariharanRPrzytyckaTMThorupMSIAM J Comput20003051385140410.1137/S0097539796313477An improved algorithm for the maximum agreement subtree problemLeeCMHungLJChangMSShenCBTangCYInf Process Lett200594521121610.1016/j.ipl.2005.02.005Improved parameterized complexity of the maximum agreement subtree and maximum compatible tree problemsBerryVNicolasFIEEE/ACM Trans Comput Biology Bioinform20063328930210.1109/TCBB.2006.39Solving the maximum agreement subtree and the maximum compatible tree problems on many bounded degree treesGuillemotSNicolasFProceedings of the 17th Annual conference on Combinatorial Pattern Matching (CPM’06)Berlin, Heidelberg: SpringerVerlageditor Lewenstein M, Valiente G(Eds. ).16517610.1007/11780441_16On the approximability of the maximum agreement subtree and maximum compatible tree problemsGuillemotSNicolasFBerryVPaulCDiscrete Applied Mathematics200915771555157010.1016/j.dam.2008.06.007Correction to ”mining closed and maximal frequent subtrees from databases of labeled rooted trees”ChiYXiaYYangYMuntzIEEE Trans Knowl Data Eng200517121737Mining frequent agreement subtrees in phylogenetic databasesZhangSWangJTLProceedings of the 6th SIAM International Conference on Data Mining (SDM 2006)Maryland: BethesdaGhosh J, Lambert D, Skillicorn DB, Srivastava JApril 2006222233Discovering frequent agreement subtrees from phylogenetic dataZhangSWangJTLIEEE Trans Knowl Data Eng20082016882Summarizing a Posterior Distribution of Trees Using Agreement SubtreesCranstonKARannalaBSyst B200756457859010.1080/10635150701485091Uncovering hidden phylogenetic consensusPattengaleNDSwensonKMMoretBMEProc. 6th Int’l Symp. Bioinformatics Research & Appls. ISBRA’10, in Lecture Notes in Computer ScienceBorodovsky M, Gogarten JP, Przytycka TM, Rajasekaran Spp 128139Springer, 201018249686Uncovering hidden phylogenetic consensus in large data setsPattengaleNDAbererAJSwensonKMStamatakisAMoretBMEIEEE/ACM Trans Comput Biology Bioinform201184902911A simple and accurate method for rogue taxon identificationAbererAJStamatakisAProceedings of the IEEE International Conference on, Bioinformatics and Biomedicine (BIBM ’11)IEEE Computer Society, Washington, DC, USA11812210.1109/BIBM.2011.70The newick utilities: highthroughput phylogenetic tree processing in the unix shellJunierTZdobnovEMBioinformatics201026131669167010.1093/bioinformatics/btq243288705020472542SwoffordDLPAUP∗ Phylogenetic analysis using parsimony (and other methods). Version 4.0 beta 10.,2002Sunderland, Massachusetts: Sinauer Assoc350915923209646Exploring diversification and genome size evolution in extant gymnosperms through phylogenetic synthesisBurleighJGBarbazukWBDavisJMMorseAMSoltisPSJ Botany201220126. Article ID 29285710.1155/2012/292857Raxmlvihpc: maximum likelihoodbased phylogenetic analyses with thousands of taxa and mixed modelsStamatakisABioinformatics200622212688269010.1093/bioinformatics/btl44616928733 xml version 1.0 encoding UTF8 REPORT xmlns http:www.fcla.edudlsmddaitss xmlns:xsi http:www.w3.org2001XMLSchemainstance xsi:schemaLocation http:www.fcla.edudlsmddaitssdaitssReport.xsd INGEST IEID EFWYYJNAT_P0IALL INGEST_TIME 20130305T20:28:37Z PACKAGE AA00013516_00001 AGREEMENT_INFO ACCOUNT UF PROJECT UFDC FILES PAGE 1 Ramu etal.BMCBioinformatics 2012, 13 :256 http://www.biomedcentral.com/14712105/13/256 RESEARCHARTICLE OpenAccessAscalablemethodforidentifyingfrequent subtreesinsetsoflargephylogenetictreesAvinashRamu1,TamerKahveci2*andJGordonBurleigh3 AbstractBackground: Weconsidertheproblemofndingthemaximumfrequentagreementsubtrees(MFASTs)ina collectionofphylogenetictrees.Existingmethodsforthisproblemoftendonotscalebeyonddatasetswitharound 100taxa.Ourgoalistoaddressthisproblemfordatasetswithoverathousandtaxaandhundredsoftrees. Results: WedevelopaheuristicsolutionthataimstondMFASTsinsetsofmany,largephylogenetictrees.Our methodworksinmultiplephases.Intherstphase,itidentiessmallcandidatesubtreesfromthesetofinputtrees whichserveastheseedsoflargersubtrees.Inthesecondphase,itcombinesthesesmallseedstobuildlarger candidateMFASTs.Inthenalphase,itperformsapostprocessingstepthatensuresthatwendafrequent agreementsubtreethatisnotcontainedinalargerfrequentagreementsubtree.Wedemonstratethatthisheuristic caneasilyhandledatasetswith1000taxa,greatlyextendingtheestimationofMFASTsbeyondcurrentmethods. Conclusions: AlthoughthisheuristicdoesnotguaranteetondallMFASTsorthelargestMFAST,itfoundtheMFAST inallofoursyntheticdatasetswherewecouldverifythecorrectnessoftheresult.Italsoperformedwellonlarge empiricaldatasets.Itsperformanceisrobusttothenumberandsizeoftheinputtrees.Overall,thismethodprovidesa simpleandfastwaytoidentifystronglysupportedsubtreeswithinlargephylogenetichypotheses. Keywords: Phylogenetictrees,FrequentsubtreeBackgroundPhylogenetictreesrepresenttheevolutionaryrelationshipsoforganisms.Whilerecentadvancesingenomic sequencingtechnologyandcomputationalmethodshave enabledconstructionofextremelylargephylogenetic trees(e.g.,[13]),assessingthesupportforphylogenetic hypotheses,andultimatelyidentifyingwellsupported relationships,remainsamajorchallengeinphylogenetics.Supportforatreeoftenisdeterminedbymethods suchasnonparametricbootstrapping[4],jackkning[5], orBayesianMCMCsampling(e.g.,[6]),whichgenerate acollectionoftreeswithidenticaltaxarepresentingthe rangeofpossiblephylogeneticrelationships.Thesetrees canbesummarizedinaconsensustree(see[7]).Consensusmethodscanhighlightsupportforspecicnodesin atree,buttheyalsomayobscurehighlysupportedsubtrees.Forexample,inFigure1,thesubtreecontainingtaxa *Correspondence:tamer@cise.u.edu 2 ComputerandInformationScienceandEngineering,UniversityofFlorida, Gainesville,FL,USA FulllistofauthorinformationisavailableattheendofthearticleA,B,C,andDispresentinallveinputtrees.However, duetotheuncertainplacementoftaxonE,themajority ruleconsensustreeimpliesthatthecladesinthetreehave relativelylow(60%)support. Alternateapproacheshavebeenproposedtoreveal highlysupportedsubtrees.Themaximumagreementsubtree(MAST)problemseeksthelargestsubtreethatis presentinallmembersofagivencollectionoftrees[8]. Forexample,inFigure1theMASTincludestaxaA,B, C,andD.FindingtheMASTisanNPhardproblem[9], althoughecientalgorithmsexisttocomputetheMAST insomecases(e.g.,[917]).Inpractice,sinceanydierenceinanysingletreewillreducethesizeoftheMAST, theMASTisoftenquitesmall,limitingitusefulness. Alessrestrictiveproblemistondfrequentagreement subtrees(FAST),orsubtreesthatarefoundinmany,but notnecessarilyall,oftheinputtrees(see[18]).Inthis problem,asubtreeisdeclaredasfrequentifitisinat leastasmanytreesasausersuppliedfrequencythreshold.Severalalgorithmicapproacheshavebeensuggested toidentifyFASTs,andspecicallythemaximumFASTs 2012Ramuetal.;licenseeBioMedCentralLtd.ThisisanOpenAccessarticledistributedunderthetermsoftheCreative CommonsAttributionLicense(http://creativecommons.org/licenses/by/2.0),whichpermitsunrestricteduse,distribution,and reproductioninanymedium,providedtheoriginalworkisproperlycited. PAGE 2 Ramu etal.BMCBioinformatics 2012, 13 :256 Page2of15 http://www.biomedcentral.com/14712105/13/256 A B C D A B C D A B C D A B C D A B C D E E E E E (a) A B C D E 60% 60% Majority Rule Consensus A B C D 100% Maximum Agreement Subtree(b) Figure1 (a)Acollectionofveinputtrees. ThesamesubtreewithtaxaA,B,C,andDispresentinallinputtrees,andonlythepositionoftaxaE changes. (b) Themajorityruleconsensusandmaximumagreementsubtreesofthe5inputtreesinFigure1a.(MFASTs),orFASTsthatcontainthelargestnumberof taxa.AvariantofthisproblemseeksthemaximalFASTs, i.e.,FASTSthatarenotcontainedinanyotherFASTs. NoticethatanMFASTisa maximal FAST,however,the inverseisnotnecessarilytrue.ZhangandWangdened algorithms,implementedinPhylominer,toidentifyFASTs fromacollectionofphylogenetictrees[19,20].ThesealgorithmsareguaranteedtondallFASTsbuttheymay beprohibitivelyslowfordatasetslargerthan20taxa. CranstonandRannalaimplementedMetropolisHastings andThresholdAcceptingsearchestoidentifylargeFASTs fromaBayesianposteriordistributionofphylogenetic trees[21].Thisapproachcanhandlethousandsofinput treesbutitmaynotbefeasibleifthetreeshavemorethan 100taxa[21]. Anotherapproachtorevealhighlysupportedsubtrees fromacollectionoftreesistoidentifyandremoverogue taxa,ortaxawhosepositionintheinputtreesisleastconsistent.Recently,severalmethodshavebeendeveloped thatcanidentifyandremoveroguetaxafromcollectionsoftreeswiththousandsoftaxa[2224].However, unlikeMASTorFASTapproaches,theydonotprovide guaranteesaboutthesupportfortheremainingtaxa. Inthispaper,wedescribeaheuristicapproachforidentifyingMFASTsincollectionsoftrees.Unlikeprevious methods,ourmethodeasilyscalestodatasetswithovera thousandtaxaandhundredsoftrees.Towardsthisgoal, wedevelopaheuristicsolutionthatworksinmultiple phases.Intherstphase,itidentiessmallcandidatesubtreesfromthesetofinputtreeswhichserveastheseeds oflargersubtrees.Inthesecondphase,itcombinesthese seedstobuildlargercandidateMFASTs.Inthenalphase, itperformsapostprocessingstep.Thisstepensuresthat thesize(i.e.,numberoftaxa)oftheFASTfoundcannotbe increasedfurtherbyaddinganewtaxonwithoutreducing itsfrequencybelowausersuppliedfrequencythreshold.Wedemonstratethatthisheuristiccaneasilyhandle datasetswith1000taxa.Wetesttheeectivenessof theseapproachesonsimulateddatasetsandthendemonstrateitsperformanceonlarge,empiricaldatasets. Althoughourheuristicdoesnotguaranteetondall MFASTsorthelargestMFASTintheory,itfoundthetrue MFASTinallofoursyntheticdatasetswherewecould verifythecorrectnessoftheresult.Italsoperformedwell ontheempiricaldatasets.Itsperformanceisrobustwith respecttothenumberofinputtreesandthesizeofthe inputtrees.MethodsInthissectionwedescribeourmethodthataimstond MaximumFrequentAgreementSubTrees(MFASTs) ina givensetof m phylogenetictreesT={ T1, T2, Tm} OurmethodfollowsfromtheobservationthatanMFAST ispresentinalargenumberoftreesinT.Themethod buildsMFASTsbottomupfromsmallsubtreesoftaxain thetreesinT.Briey,itworksinthreephases. Phase1. Seedgeneration(SectionPhaseone:Seed generation). Intherstphase,weidentifysmallsubtreesfromthe inputtreesthathaveapotentialtobeapartofan MFAST.Wecalleachsuchsubtreeaseed. Phase2. Seedcombination(SectionPhasetwo:Seed combination). Inthesecondphase,weconstructaninitialFASTby combiningtheseedsfoundintherstphase. Phase3. Postprocessing(SectionPhasethree: Postprocessing). Inthethirdphase,wegrowtheFASTfurtherto obtainthemaximalFASTthatcontainsitby individuallyconsideringthetaxawhicharenot alreadyintheFAST.Wereporttheresulting maximalFASTasapossibleMFAST. First,wepresentthethebasicdenitionsneededforthis paperinSectionPreliminariesandnotation.Wethen discusseachofthethreephasesaboveindetail. PAGE 3 Ramu etal.BMCBioinformatics 2012, 13 :256 Page3of15 http://www.biomedcentral.com/14712105/13/256PreliminariesandnotationInthissection,wepresentthekeydenitionsandnotationsneededtounderstandtherestofthepaper.We describeourmethodusingrootedandbifurcatingphylogenetictrees.However,ourmethodanddenitionscan easilybeappliedtounrootedormultifurcatingtreeswith minorornomodications.Also,weassumethatallthe taxaareplacedattheleaflevelnodesofthephylogenetic tree,andalltheinternalnodesareinferredancestors. Figure2(a)showsasamplephylogenetictreebuiltonve taxa.Wedenethe size ofatreeasthenumberoftaxain thattree.Westartbydeningkeyterms. Denition1 ( Clade ) LetTbeaphylogenetictree.Given aninternalnodeofT,wedenethesetofallnodesand edgesofTcontainedunderthatnodeasthe clade rootedat thatnode. Eachinternalnodeofaphylogenetictreecorresponds toacladeofthattree.Figure2(b)depictsthecladeofthe treeinFigure2(a)rootedat x1. Denition2 ( Contraction ) LetTbeaphylogenetictree withntaxa.ThecontractionoperationtransformsTintoa treewithn 1 taxabyremovingagiventaxoninTalong withtheedgethatconnectsthattaxontoT. Thecontractionoperationcanextractthecladesofa treebyremovingallthetaxathatarenotapartofthat clade.Itcanalsoextractpartsofthetreethatarenotnecessarilyclades.Weusetheterm subtree todenoteatree thatisobtainedbyapplyingcontractionstoarbitraryset oftaxainagiventree.Formaldenitionisasfollows. Denition3 ( Subtree ) LetTandTbetwophylogenetictrees.WesaythatTisa subtree ofTifTcanbe transformedintoTbyapplyingaseriesofcontractionson T. Ifatree Tisasubtreeofanothertree T ,wesaythat Tis present in T .Noticethatacladeisalwaysasubtree, x2x3x1x0abcde(a) abc(b) bcd(c) Figure2 (a)Arooted,bifurcatingphylogenetictree T builton vetaxalabeledwith a,b,c,d and e Theinternalnodesareshown with x0, x1, x2and x3. (b) Acladeof T rootedat x1. (b) and (c) Two subtreesof T bycontractingthetaxasets { d,e } and { a,e } .buttheinverseisnottrueallthetime.Figures2(b)and 2(c)illustratetwosubtreesofthetreeinFigure2(a).Letus denotethenumberofcombinationsof k taxafromasetof n taxawith n k .Ingeneral,ifatreehas n taxa,thenthat treecontains n k subtreeswith k taxa.Asaconsequence, thattreecontains2n 1subtreesofanysizeincluding itself. Denition4 ( Frequency ) LetT= { T1,T2, ,Tm} be asetofmphylogenetictreesandTbeaphylogenetictree. LetusdenotethenumberoftreesinTatwhichTispresent withthevariablem.WedenethefrequencyofTinTas freq ( T ,T) = m m Denition5 ( FAST ) LetT= { T1,T2, ,Tm} beaset ofmphylogenetictreesandTbeaphylogenetictree.Let beanumberin[0,1]intervalthatdenotesfrequencycuto. WesaythatTisaFrequentAgreementSubTree(FAST)ofTifitsfrequencyinTisatleast (i.e.,freq ( T ,T) ). WesaythataFASTis maximal ifthereisnootherFAST thatcontainsallthetaxainthatFAST.Clearly,larger FASTsindicatebiologicallymorerelevantconsensuspatterns.Thefollowingdenitionsummarizesthis. Denition6 ( MFAST ) LetT= { T1,T2, ,Tm} bea setofmphylogenetictrees.Let beanumberin[0,1] intervalthatdenotesfrequencycuto.AFASTTofTisa MaximumFrequentAgreementSubTree(MFAST)ofTif thereisnootherFASTTofTthathasalargersizethanT. Formally,givenasetofphylogenetictreesT= { T1, T2, Tm} andafrequencycuto, ,wewouldliketond theMFASTsinTinthispaper.Wedevelopanalgorithm thataimstosolvethisproblem.Table1liststhevariables usedthroughouttherestofthispaper.Table1Commonlyusedvariablesandfunctionsinthis paper TAsetofphylogenetictrees Tii thtree m NumberoftreesinTn Numberoftaxaineachinputtree aii thtaxa freq ( T ,T) Frequencyofthesubtree T inT Frequencycuto Sii thseed(eachseedisasubtreeofatreeinT) k Sizeofaseed c Numberofcontractionsusedtocreateaseed PAGE 4 Ramu etal.BMCBioinformatics 2012, 13 :256 Page4of15 http://www.biomedcentral.com/14712105/13/256Phaseone:SeedgenerationTherstphaseextractssmallsubtreesfromthegivenset oftrees.Fromthesesubtreesweextractthebasicbuilding blockswhichareusedtoconstructMFASTs.Wecallthese buildingblocksseeds.Conceptuallyeachseedisaphylogenetictreethatcontainsasmallsubsetofthetaxathat makeupthetreesinT.Wecharacterizeeachseedwith threefeaturesthatarelistedbelow.Weelaborateoneach featurelaterinthissection. 1.Seedsize(k)isthenumberoftaxaintheseed. 2.Numberofcontractions(c)isthenumberoftaxawe prunefromacladetakenfromaninputtreeinorder toextracttheseed. 3.Frequency(f)isthefractionofinputtreesinwhich theseedispresent. WeexplaintheseedfeatureswiththehelpofFigures3 and4.Thersttwocharacteristicsexplainhowaseedcan befoundinoneofthetreesinT.Theyindicatethatthere isacladeofatreeinTsuchthatthiscladecontains k + c taxaanditcanbetransformedintothatseedafter c contractionsfromthatclade.ForinstanceinFigure3,when k = 2and c = 0,onlyseed S1canbeextractedfrom T1by choosingthecladerootedat x2.When k = 2and c = 1, seeds S1, S2and S3canbeobtainedusingonecontraction ( a3, a2and a1respectively)fromthecladerootedat x1. ThelastfeaturedenotesthenumberoftreesinTin whichtheseedispresent.ForexampleinFigure4,there arenineseeds S1, S2, S9extractedfromthethreeinput treesusingonlyonecontraction.Amongthese,thefrequencyof S1is1asitispresentinallthetrees.Frequency of S2isabout0.67foritispresentinonlytwooutofthree S1S2S3x1x2T1x0aa2a3 1a4a a a a a21a3 2 3 1 Figure3 T1isaninputtreebuiltonfourtaxa a1, a2, a3and a4. Theinternalnodesof T1arelabeledas x0, x1and x2. S1istheonly seedobtainedfrom T1when k = 2and c = 0.Thatis S1isidenticalto thecladerootedat x2. S1, S2and S3aretheseedsextractedfrom T1when k = 2and c = 1.Theyareallextractedfromthecladerootedat x1bycontracting a3, a2and a1respectively.trees( T1and T2).Thefrequencyoftherestoftheseeds isonlyabout0.33.Recallthat,bydenition,anMFASTis presentinatleastafraction ofthetreesinT.Therefore,weconsideronlytheseedswhosefrequencyvalues areequaltoorgreaterthanthisnumber(i.e., f ). Giventhevaluesof k c and ,weextractalltheseeds whichpossessthedesiredfeaturevaluesfromthesetof inputtreesasfollows.Inthenewickstringrepresentation ofatree,apairofmatchingparenthesescorrespondsto aninternalnodeinthetree.Thenumberoftaxainthe claderootedatthisinternalnodeisgivenbythenumber oflabelsbetweenthetwomatchingparentheses.Followingfromthisobservation,wescanthenewickstringof eachtreeonebyone.Foreachsuchtree,weidentifythe cladeswhichhave k + c taxa.Noticethat,ifatreecontains n taxa,thenitcontainsatmostn k + ccladesofsize k + c asnotwosuchcladescancontaincommontaxa.We thenextractallcombinationsof k taxafromeachofthese cladesbycontractingtheremaining c taxa.Thenumber ofwaysthiscanbedoneis k + c c .Noticethatallthesmall treesextractedthiswaypossessthersttwocharacteristicsexplainedabove.Atthispoint,wehoweverdonot knowtheirfrequencies.Therefore,wecallthem potential seeds .Itisworthmentioningthatthesameseedmightbe extractedfromdierenttrees.Asweextractanewpotentialseed,beforestoringitinthelistofpotentialseeds,we checkifitisalreadypresentthere.Weincludeitinthe potentialseedlistonlyifitdoesnotexistthereyet.Otherwise,weignoreit.Thisway,wemaintainonlyonecopyof eachseed. OncewebuildourpotentialseedlistforallthetreesinT,wegooverthemonebyoneandcounttheirfrequency inTasthefractionoftreesthatcontainthem.Welter allthepotentialseedswhosefrequenciesarelessthanthe frequencycuto.Wekeeptheremainingonesasthelist ofseedsalongwiththefrequencyofeachseed. InFigure4,considerthetree T1thathasfourtaxa.For k = 3and c = 1,thereisonlyonecladeofsize k + c = 4 whichisthetree T1itself.Weextractfourpotentialseeds, eachhavingthreeleavesfromthistree.Thepotential seedsinthisgurearegivenby S1, S2, S5and S7which weextractbycontracting a4, a3, a2and a1respectively from T1.Phasetwo:SeedcombinationAttheendoftherstphase,weobtainasetoffrequent seedsfromtheinputtrees.NoticethateachseedisaFAST aseachseedispresentinsucientnumberoftreesspeciedby .Theseseedsarethebasicbuildingblocksofour method.Inthesecondphaseofourmethod,wecombine subsetsoftheseseedstoconstructlargerFASTs. Werstdenewhatitmeanstocombinetwoseeds.In ordertocombinetwoseeds,itisanecessarycondition PAGE 5 Ramu etal.BMCBioinformatics 2012, 13 :256 Page5of15 http://www.biomedcentral.com/14712105/13/256 1a a2a3a4 1a a4 1a2aa4 1aa2a3Sa a3 4SSa aSa a4a a3 4a2S S1a a3a4 1a a a2S S2S13 4 3 23 6 5 4 9 8 7a2a4a3T1 1a1aa2a3a4 2T1aa2a4a3T3 Figure4 Thesetofinputtrees T1, T2, T3andthesetofallninepotentialseeds S1, S2 S9whentheseedcharacteristicsaresetto k = 3 and c = 1 Allthepotentialseedshavethreetaxaask=3.Weneedonecontractionfromtheinputtreetoobtaineachseed. S1hasfrequency 1.0asitispresentin T1, T2and T3.Seed S2hasfrequency 0.67asitispresentin T1and T2.Remainingseedshavefrequency 0.33aseachappears inonlyoneofthethreetrees.thatbothseedsarepresentinatleastonecommontree T inT.Wecallsuchatree T asthe referencetree .We combinetwoseedswiththeguidanceofareferencetree. Let S1and S2betwoseedsandlet T betheirreference tree.Let L1, L2and L bethesetoftaxain S1, S2and T respectively.Combining S1and S2resultsinthetreethat isequivalenttotheoneobtainedbycontractingthetaxa in L ( L1 L2) from T .Forsimplicity,wewilldenotethe combineoperationusing T asthereferencenetworkwith the Tsymbol.Forinstancewedenotecombining S1and S2with T beingthereferencetreeas S1TS2.Tosimplify ournotation,whenevertheidentityofthereferencetree isirrelevant,wewillusethesymbol insteadof T. Figure5demonstrateshowtwoseeds S1and S2arecombinedwiththehelpofthereferencetree T .Inthisgure, both S1and S2aresubtreesof T .Thus,itispossibleto use T asthereferencetree.Wehave L1={ a1, a3, a4} L2={ a1, a2, a5, a7} .Thus,webuild C = S1TS2by contractingthetaxain L ( L1 L2) ={ a6, a8} from T Sofar,wehaveexplainedhowtocombinetwoseeds S1and S2usingareferencetree.Itispossiblethatmanytrees inThavebothseedspresentinthem.Thus,onequestioniswhichofthesetreesshouldweuseasthereference treetocombinethetwoseeds?Thebriefansweristhat allsuchtreesneedtobeconsidered.However,wemake severalobservationsthathelpsusavoidcombining S1and S2usingeachsuchreferencetreeonebyoneexhaustivelywithoutignoringanyofsuchtrees.Weexplain themnext. Considertwotrees T1and T2fromTwherebothseeds arepresentin.Therearetwocasesfor T1and T2. CASE1: S1T1S2= S1T2S2.Inthiscase,itdoes notmatterwhetherweuse T1or T2asthereference tree.Theywillbothleadtothesamecombined subtree.Thus,weuseonlyone. CASE2: S1T1S2 = S1T2S2.Inthiscase,thetrees T1and T2leadtoalternativecombinationtopologies. So,weconsiderbothofthemseparately. Weutilizetheobservationsaboveasfollows.Westart bypickingonereferencetreearbitrarily.Oncewecreatea combinedsubtreeusingthattree,wecheckwhetherthat subtreeispresentintheremainingtreesinT.Wemark thosetreesthatcontainitasconsideredforreferencetree andneverusethemasreferenceforthesameseedpair again.Thisisbecausethosetreesfallintotherstcase describedabove.Thisway,wealsostorethefrequencyof thecombinedsubtreeinT.Ifthenumberofunmarked treesistoosmall(i.e.,lessthan m )thenitmeansthat evenifalltheremainingtreesagreeonthesamecombined topologyforthetwoseedsunderconsideration,theyare notsucienttomakeitaFAST.Thus,wedonotuseany oftheremainingtreesasreferenceforthosetwoseeds. Otherwise,wepickanotherunmarkedtreearbitrarilyand repeatthesameprocessuntilwerunoutofreference trees. Thenextquestionweneedtoansweriswhichseedpairs shouldwecombine?Toanswerthisquestionwerstmake thefollowingproposition. Proposition1. AssumethatwearegivenasetofphylogenetictreesT.LetS1andS2betwoseedsconstructedfrom PAGE 6 Ramu etal.BMCBioinformatics 2012, 13 :256 Page6of15 http://www.biomedcentral.com/14712105/13/256 1a a3 1a a2a7 2S S1a4aa23 1aaa67 5aa8a4aa23a a57a4a5T1aC Figure5 T isthereferencetree. S1and S2aretheseedstobe combined,botharepresentin T C isobtainedbypruningthe subtreecontainingtaxa a1, a2, a3, a4, a5and a7from T .thetreesinT.ForalltreesT T,wehavethefollowing inequality freq ( S1TS2,T) min { freq ( S1,T) freq ( S2,T) } Proof. Forany T ,both S1and S2aresubtreesof S1TS2.Thusif S1TS2ispresentinatree,thenboth S1and S2arepresentinthattree.Asaresult, freq ( S1TS2,T) freq ( S1,T) and freq ( S1TS2,T) freq ( S2,T) .Hence, freq ( S1TS2,T) min { freq ( S1,T) freq ( S2,T) } Proposition1statesthataswecombinepairsofseedsto growthem,theirfrequencymonotonicallydecreases.This suggeststhatitisdesirabletocombinetwoseedsifboth ofthemhavelargefrequencies.Thisisbecauseifoneof themhasasmallfrequency,regardlessofthefrequencyof theother,thecombinedtreewillhaveasmallfrequency. Asaresultitschancetogrowintoalargertreethrough additionalcombineoperationsgetssmaller.Followingthis intuition,wedeveloptwoapproachesforcombiningthe seeds. 1.InorderCombination(SectionInorder combination). 2.MinimumOverlapCombination(SectionMinimum overlapcombination). Bothapproachesacceptthelistofseedscomputedin therstphaseasinputandproducealargerFASTthatisa combinationofmultipleseeds.Bothofthemalsoassume thatthelistofinputseedsarealreadysortedindecreasing orderoftheirfrequencies.Wediscusstheseapproaches next.InordercombinationTheinordercombinationapproachfollowsfromProposition1.Itassumesthattheseedswithhigherfrequencies havegreaterpotentialtobeapartofan MFAST .Itexploits thisassumptionasfollows,rstitpicksaseedasthestartingpointtocreateaFAST.Itthengrowsthisseedby combiningitwithotherseedsstartingfromthemostfrequentoneaslongasthefrequencyoftheresultingtree remainsatleastaslargeasthegivencuto .Itrepeats thisprocessbytryingeachseedasthestartingpoint, Algorithm1presentsthisapproach.Algorithm1InordercombinationFAST forall seeds Sido FAST SiMark Siasconsidered repeat Sj seedwithhighestfrequencyamong unconsideredseeds Mark Sjasconsidered CUTOFF t FAST FASTrepeat Pickthenextunconsideredtree T Tas reference Markallthetreesasthatcontain FASTTSjas considered if freq( FASTTSj,T) CUTOFF then t FAST FASTTSjCUTOFF freq( FASTTSj,T) endif until Lessthan m unmarkedreferencetreesare leftinTFAST t FASTUnmarkalltreesinTuntil allseedsareconsidered if sizeof FAST sizeof FAST then FAST FAST PAGE 7 Ramu etal.BMCBioinformatics 2012, 13 :256 Page7of15 http://www.biomedcentral.com/14712105/13/256endif Unmarkallseeds endfor InAlgorithm1werstinitializetheFASTasempty.We thenconsidereachseedonebyone.Weinitializeatemporarysubtreedenotedby FAST withtheseed Siunder considerationandmark Siasconsidered.Wecombinethe FASTwithaseed Sjwhichhasthehighestfrequency amongsttheseedsthathavenotbeenadded.Ifmultiple seedshavethehighestfrequency,werandomlypickoneof themandmarkthatseed SjasaddedtotheFAST.There canbealternativewaystocombineFASTwith Sjleading todierenttopologies.WeusethetreesinTthatcontainbothFASTand Sjasguidestotryonlythetopologies thatexistinT.Westopconstructingalternativetopologiesassoonasweensurethattherearenotsucient numberoftreestoyieldfrequencyof .WesetFASTto thecombinedseedifthecombinedseedhaslargeenough frequency.Wethenconsidertheseedwiththenexthighestfrequencyforadditionandrepeatthissteptillall Sjhavebeenconsidered.IftheresultingtemporaryFASTis largerthanFASTwereplacethesmallerFASTwiththe largerone.Inthenextiteration,weinitializethe FAST withthenext Si.Usingthisapproachwecaninitializethe FAST withall Si,alternativelyiftheuserwishestolimit theamountoftimespentusinga maximumtimecuto westoptheoutermostloop(i.e.,alternativeinitializations ofFAST)assoonastheallowedrunningtimebudgetis reached. NoticethatinAlgorithm3eachseed SicanleadtoadifferentFAST.WerecordonlytheFASTthathasthelargest size.However,itistrivialtomaintainthetop k FASTs withthelargestsizeinsteadiftheuserislookingfor k alternativemaximalFASTs.MinimumoverlapcombinationThepurposeofcombiningseedsistoconstructaFAST thatislargeinsize.Ourinordercombinationapproach (SectionInordercombination)aimedtomaximizethe frequencyofthecombinedseeds.Inthissection,we developoursecondapproach,named MinimumOverlapCombination .Thisapproachpicksseedssothattheir combinationproducedaslargesubtreeaspossible.We elaborateonthisapproachnext. Whenwecombinetwoseeds,thesizeoftheresulting treebecomesatleastasbigasthesizeofeachofthese seeds.Formallylet S1and S2betwoseeds(i.e.,trees).Let L1and L2bethesetoftaxacombinedin S1and S2.We denotethesizeofaset,say L1,with  L1 .Thesizeofthe treeresultingfromcombinationof S1and S2is  L1+ L2  L1 L2 .Foragivenxedseedsize,thersttwoterms ofthisformulationremainsunchangedregardlessofthe seed.Thelasttermdeterminesthegrowthinthesizeof theFAST.Thus,inordertogrowtheFASTrapidly,itis desirabletocombinetwofrequentsubtreeswithasmall numberofcommontaxa. Oursecondapproachfollowsfromtheobservation above.Weintroduceacriteriacalledthe overlap between twosubtreesasthenumberoftaxacommonbetween them.Ourminimumoverlapcombinationapproach worksthesameasAlgorithm1withaminordierencein selectingtheseed SjthatwillbecombinedwiththecurrenttemporaryFAST(i.e.,FAST).Ratherthanchoosing theseedwiththelargestfrequency,thisapproachchooses theonethathastheleastoverlapwithFASTamongall theunconsideredandfrequentseeds.Ifmultipleseeds havethesamesmallestoverlap,itconsidersthefrequency asthetiebreakerandchoosestheonewiththelargest frequencyamongthose.Phasethree:PostprocessingSofarwedescribedhowtoobtainseeds(SectionPhase one:Seedgeneration)andhowtocombinethemtoconstructFAST(SectionPhasetwo:Seedcombination). Thetwoapproacheswedevelopedforcombiningseeds aimtomaximizethesizeofFAST.However,theydonot ensurethemaximalityoftheresultingFAST.Thereare twomainreasonsthatpreventourseedcombiningalgorithmsfromconstructingmaximalFAST.First,someof thetaxaofamaximalFASTmaynotappearinanyseed (i.e.falsenegatives).Asaresultnocombinationofseeds willleadtothatmaximalFAST.Second,evenifallthetaxa ofamaximalFASTarepartsofatleastoneseed,ouralgorithmswillrejectcombiningthatseedwiththeFASTof theseedsifthoseseedscontainothertaxathatarenotpart ofthemaximalFAST(i.e.falsepositives). Inthepostprocessingphase,wetackleabovementionedproblem.Algorithm3describesthepost processingphaseindetail.Wedothisbyconsideringall taxawhicharenotalreadypresentinthe FAST oneby one.WeiterativelygrowthecurrentFASTbyincluding onemoretaxonatatimeifthefrequencyoftheresulting FASTremainsatleastaslargeasthefrequencycuto .Werepeattheseiterationsuntilnonewtaxoncan beincludedintheFAST.ThustheresultingFASTis guaranteedtobemaximal.Algorithm2PostprocessingINPUT=FASTfromtheseedcombinationphase INPUT=TOUTPUT=MaximalFAST RESULT FAST forall ainotinFAST do CUTOFF t RESULT RESULT repeat PAGE 8 Ramu etal.BMCBioinformatics 2012, 13 :256 Page8of15 http://www.biomedcentral.com/14712105/13/256Pickthenextunconsideredtree T Tasreference RESULT RESULT TaiMarkallthetreesthatcontainRESULTas considered if frequencyofRESULT CUTOFF then t RESULT RESULT CUTOFF frequencyofRESULT endif until Lessthan m unmarkedreferencetreesare leftinTRESULT t RESULT UnmarkalltreesinTendfor return RESULT Weexpectthepostprocessingsteptoidentifyquickly thetaxathathaveapotentialtobeinanMFASTthat mighthavenotbeenconsideredduringtheseedgenerationandseedcombinationphases.Attheendofthepost processingstepweobtainanMFAST.ComplexityanalysisofourmethodInthissectionwediscussthecomplexityofourmethodin termsofthethreephasesinvolvedinit.LetTbeasetof m phylogenetictreeshaving n leaveseach.Thecomplexity ofthedierentphasesofourmethodareasfollows.Phaseone.Findingtheseedsinvolvesenumeratingall thesubtreesandcheckingtheirfrequencies.Givenseed size k andnumberofcontractions c ,eachtreewillcontainatmostn k + ccladeseachleadingto k + c c alternative subtrees.Thus,intotaltherecanbeuptomn k + ck + c c seeds (possiblymanyofthemidentical)fromallthetreesinT. Typically,thevaluesof k and c arexedandsmall(inour experimentswehave k { 3,4,5 } and c { 0,1,2,3,4,5 } ) leadingto O ( mn ) seeds. Thecomplexityofndingwhetheraseedispresentin asingletreeis O ( n log n ) .Giventhatthereare m treesinT,thecostofcomputingthefrequencyofasingleseed is O ( mn log n ) .Thus,thetimecomplexityforndingthe frequencyofalltheseedsisthisexpressionmultipliedby thenumberofseeds,whichis O ( m2n2log n ) .Phasetwo.Considerasetof p frequentseedsthatwill beconsideredforcombininginthisphase.Recallthatwe havetwoapproachestocombinethem.Below,wefocus oneach. INORDERCOMBINATIONWetrytocombineeachseed witheveryotherseedleadingto O ( p2) iterations.The complexityofcheckingthefrequencyofeachcombined subtreeis O ( mn log n ) .Also,therecanbeupto O ( m ) dierentreferencetreesforguidingthecombineoperation.Multiplyingtheseterms,weobtainthecomplexityof phaseusingthisapproachas O ( p2m2n log n ) MINIMUMOVERLAPCOMBINATIONThecomplexityof combiningthefrequentseedsusingtheminimumoverlapcombinationapproachisverysimilartotheinorder approachexceptforanadditionalterm.Theadditional complexityisbecausewemaintaintheoverlapbetween thesubtrees.Thisleadstothecomplexity O ( p2n2+ p2m2n log n ) .Phasethree.Here,weconsidertheFASTobtainedfrom eachofthe p frequentseedsinphasetwo.ForeachFAST, wesequentiallygoovereachtaxaonebyoneleadingto O ( n ) iterations.Therecanbeupto O ( m ) referencesto addataxon.Sothecostofextendingall p FASTsis O ( mnp ) Noticethateachfrequentseedhastoappearinatleast m trees.Thus,thenumberofuniquefrequentseeds p isboundedby O (mn m) = O (n ) .Thus,addingthecost ofallthethreephases,theoveralltimecomplexityofour methodusinginordercombinationis O ( m2n2log n + m2n3log n 2+ mn2) Thatusingminimumoverlapcombinationis O ( m2n2log n + m2n3log n 2+ n4log n 2+ mn2) Inthetwosummationsabove,thesecondtermis asymptoticallylargerthantherstandthelastterms. Thus,wecansimplifytheasymptotictimecomplexityof inorderandminimumoverlapcombinationsas O ( m2n3log n 2) and O ( n3log n 2( m2+ n )) respectively.ResultsanddiscussionThissectionevaluatestheperformanceofourMFAST algorithmexperimentally.Implementationdetails.WeimplementedourMFAST algorithmusingCandPerl.Morespecically,weimplementedthersttwophases(seedgenerationandseed combination)inCandthethirdphase(postprocessing) inPerl.Weutilizethefunctionsprovidedinthenewick Utilities[25]packagebymodifyingthesourcecodeprovidedinthatpackage.Weuse k { 3,4,5 } and c { 0,1,2,3,4,5 } inallofourexperimentsunlessotherwisestated.Inourexperiments,weobservedthatthe minimumoverlapcombinationproducedlargerMFASTs thantheinordercombinationapproach.Therefore,we PAGE 9 Ramu etal.BMCBioinformatics 2012, 13 :256 Page9of15 http://www.biomedcentral.com/14712105/13/256limitourexperimentalresultstotheminimumoverlap combinationapproach.Methodscomparedagainst.Wehavecomparedour methodagainstPhylominer[20]andtheMASTcommand implementedinPAUP*[26].Amongthese,Phylominer alsoseeksMFASTsinacollectionoftrees.However,the timecomplexityofthismethodisexponentialinthesize oftheinputtrees,andhenceitbecomesintractablefor largetrees.Inourexperiments,weobservedthatitdoes notscalebeyond50taxa.PAUP*isprimarilyaprogram forphylogeneticinference,althoughitalsocancompute MASTs.MASTshaveastrict100%agreementcriterion unlikethearbitraryfrequencycutovalues inour method.EvaluationCriteria.Weevaluateouralgorithmbased onthesizeoftheMFASTfound.LargerMFASTsare preferable.Whenpossible,wereportthesizeoftheoptimalsolutionaswell.TestEnvironment.WeranourexperimentsonLinux serversequippedwithdualAMDOpterondualcoreprocessorsrunningat2.2GHzand3GBofmainmemoryto testtheperformanceofourmethod.DatasetsWetesttheperformanceandverifytheresults ofourmethodonsyntheticdatasetsandrealdatasets. SYNTHETICDATASETWebuiltsyntheticdatasetsin whichweembeddedanMFASTasdescribedbelow. Wecharacterizeeachsyntheticdatasetusingve parameters. 1.Treesize( n ). 2.Numberoftrees( m ). 3.MFASTfrequency( f ). 4.MFASTsize( n). 5.Noisepercentage( ). Thersttwoparametersdenotethesizeandnumber oftreesinT.MFASTfrequencyspeciesthefraction oftreesinTwhichcontainanMFAST.MFASTsize isthenumberoftaxaintheembeddedMFAST.The noisepercentageisthepercentageoftaxathatisnot apartoftheembeddedMFASTbutisplacedonthe brancheswithinthecladethatcontainstheMFAST. Weplacealltheothertaxaonthebranchesoutside thisclade. Givenaninstantiationoftheseparameters,werst createdatreethathas ntaxa.Thistreeservesasthe MFAST.Wethencreated m f treesthatcontain thisMFAST.Webuildeachofthesetreesbyinserting n ntaxarandomlyintheMFAST.Withprobability weinserteachtaxawithinthecladethatcontains MFAST.Withprobability 1 weinsertitoutside thatclade.Wethencreated m ( m f ) treesthatdo notcontainthecurrentMFAST.Wesimplydothisby insertingallthetaxaonebyoneatarandomlocation. REALDATASETS. Weusetwoempiricaldatasetsto evaluatetheperformanceofourheuristic.Thedata setscontain200bootstraptreesgeneratedfrom phylogeneticanalysisoftheGymnosperm[27]and Saxifragales(Burleigh,unpublished)plantclades.To makethebootstraptrees,weassembled supermatrices,matricesofconcatenatedgene alignmentswithpartialtaxonoverlap,fromgene sequencedataavailableinGenBank.Weperformeda maximumlikelihoodbootstrapanalysisoneach supermatrixusingRAxMLv.7.0.4[28].The Gymnospermtreeseachcontain959taxa,andthe Saxifragalestreeseachcontain950taxa.EectsofnumberofinputtreesInourrstexperiment,weanalyzehowthenumberof inputtreesinTaectstheperformanceofouralgorithm. Forthispurpose,wecreated30syntheticdatasets.The sizeoftheembeddedMFASTinallthedatasetswas15. Amongthese30datasets,10contained50trees,10contained100treesand10contained200trees.Wesetthe noisepercentageto20%inallthedatasets.Thefrequency oftheembeddedMFASTwas0.8.Wesetthenumberof taxainallthetreesinthesedatasetsto100. Weranouralgorithmoneachofthesedatasetstond thesizeoftheMFASTfor = 0.7.Table2liststhe averageMFASTsizewefoundforeachofthedataset sizesbeforepostprocessing(i.e.,attheendofphasetwo) andafterpostprocessing(i.e.,attheendofphasethree). Theresultsdemonstratethatourmethodcanidentifyan MFASTthatisalmostasbigastheembeddedoneeven withoutpostprocessing,regardlessofthenumberoftrees inthedataset.PostprocessingimprovestheMFASTsize slightly.Ontheaverage,wealwaysndanMFASTthatis aslargeasorlargerthantheembeddedone.AnMFAST largerthan15hereimpliesthatwhilerandomlyinserting thetaxathatarenotintheembeddedMFAST,atleastone ofthemwasplacedunderthesamecladeatleastafraction ofthetime.Moreimportantly,ourmethodsuccessfully locatedsuchtaxaalongwiththerestoftheMFAST.EectsoftreesizeOursecondexperimentconsiderstheimpactofthenumberoftaxaintheinputtreescontainedinTonthesuccess ofourmethod.Tocarryoutthistest,webuiltdatasets withvaryingtreesizes(i.e., n ).Particularly,weused n = 100,250,500and1000.Foreachvalueof n ,werepeated theexperiment10timesbycreating10datasetswiththe sameproperties.Inalldatasets,wesetthenumberoftrees to m = 100,thenoisepercentageat = 20%,thesizeofthe PAGE 10 Ramu etal.BMCBioinformatics 2012, 13 :256 Page10of15 http://www.biomedcentral.com/14712105/13/256Table2EvaluationoftheeectofthenumberoftreesinT Numberoftrees MFASTsize BeforepostprocessingAfterpostprocessing 5014.516.0 10015.315.8 20014.415.4 Thenumberoftreesissetto50,100and200.Foreachnumberoftreeswerun ourexperimentsontendatasets.Eachdatasetcontainstreeswith100taxaand anembeddedMFASTofsize15.WereporttheaveragesizeoftheMFAST obtainedbyourmethodacrossthetendatasets.embeddedMFASTat15%of n ,andtheMFASTfrequency at0.8. Table3reportstheaverageMFASTsizefoundbyour methodforvaryingtreesizes.Secondcolumnshowsthe embeddedMFASTsize.LasttwocolumnslisttheaveragesizeoftheMFASTfoundbyourmethodacrossthe tendatasets.Before,goingintodetaileddiscussionofthe results,itiscrucialtoobservethatourmethodcouldrun tocompletionfordatasetsthathaveasmanyas1000taxa. WhenwetriedtorunPhylominer,itdidnotreturnany resultsfordatasetsthathavemorethan100taxa.The resultsalsodemonstratethatourmethodcouldsuccessfullyidentifytheembeddedMFASTinallthedatasets regardlessofthesizeoftheinputtrees.Insomedatasets, thereportedMFASTwasslightlylargerthantheembeddedone.Thisindicatesthatwhilerandomlyinsertingthe taxathatarenotpartoftheembeddedMFAST,itispossiblethatafewtaxawasconsistentlyplacedunderthesame sameclade. TheresultsalsosuggestthatourmethodidentiesasignicantpercentageofthetaxaintheembeddedMFAST afterthesecondphase(i.e.,beforepostprocessing)when thetreesizeissmall.Asthetreesizegrows,itstarts missingsometaxaatthisphase.Ithoweverrecoversthe missingtaxaduringthepostprocessingphaseevenfor thelargesttreesize.ThisindicatesthatattheendofphaseTable3EvaluationoftheeectofthesizeofthetreesinT NumberofMFASTsize taxaEmbeddedReported BeforepostAfterpost processingprocessing 1001515.315.8 2503832.338.8 5007543.776.0 100015069.8151.0 Thetreesizeissetto100,250,500and1000.Foreachtreesizewerunour experimentsontendatasets.Eachdatasetcontains100treeswithanembedded MFASTofsize15%oftheinputtreesize.Secondcolumnshowstheembedded MFASTsize.LasttwocolumnslisttheaveragesizeoftheMFASTfoundbyour methodacrossthetendatasets.twoourmethodcouldidentifyabackboneoftheactual MFAST.Theunidentiedtaxaatthisphasearescattered throughoutthecladesintheinputtrees.Thus,thereisno cladeofsize k + c thatcontainsthemwith c contractions forsmall k or c .AsevidentfromTable3,thishowever doesnotpreventourmethodfromrecoveringthem.This isbecausethebackbonereportedattheendofphasetwo islargeenough,andthusspecicenough,torecoverthe missingtaxaonebyoneinthelastphase.Thisisasignicantobservationasitdemonstratesthatourmethod workswellevenwithsmallvaluesof k and c .EectsofnoisepercentageRecallthatthenoisepercentage denotesthepercentageoftaxathatisaddedinsidethecladethatcontainsthe MFAST.As increases,thepairsoftaxaintheMFASTget fartherawayfromeachotherinthetreethatcontainsit. Asaresult,fewertaxafromMFASTwillbecontainedin smallcladesofsize k + c .Thisraisesthequestionwhether ourmethodworkswellas increaseandthustheMFAST taxagetsscatteredaroundinthetreesthatcontainit. Inthisexperiment,weanswerthequestionaboveand analyzetheeectofthenoisepercentageonthesuccess ofourmethod.Wecreatesyntheticdatasetswithvarious values.Particularly,weuse = 20,40and60%.Weset thesizeoftheembeddedMFASTto n= 15,thetreesize to n = 100,numberoftreesto m = 100andtheMFAST frequencyto f = 0.8.Werepeatourexperimentforeach parameter10timesbyrecreatingthedatasetrandomly usingthesameparameters.Wesetthefrequencycutoto = 0.7.WereporttheaverageMFASTsizefoundbyour methodinTable4. Theresultssuggestthatourmethodcanidentifythe embeddedMFASTsuccessfullyevenwhenthenoisepercentageisveryhigh.Weobservethatthesizeofthe MFASTfoundbyourmethodbeforepostprocessing decreasesslowlywithincreasingamountofnoise.This isnotsurprisingasthetaxacontainedintheembedded MFASTgetsmorespreadout(andthusfartherawayfrom eachother)inthetreesinTwithincreasingnoise.Asa result,iftherearetaxathatarenotpartofanyseedwith theprovidedvaluesof k and c ,theywillneverbeincludedTable4Evaluationoftheeectofthenoiseinthetrees inT Noise(%) MFASTsize BeforepostprocessingAfterpostprocessing 2015.315.8 4013.615.0 6012.715.0 ThesizeoftheembeddedMFASTinalltheexperimentsis15.Welistthe averagesizeoftheMFASTfoundbyourmethodbeforeandafterthepost processingphase. PAGE 11 Ramu etal.BMCBioinformatics 2012, 13 :256 Page11of15 http://www.biomedcentral.com/14712105/13/256inthecomputedMFASTattheendofphasetwo.We howeverobservethat(i)onlyasmallnumberofsuchtaxa exists.Forinstance,evenforthelargestnoisepercentage ( =60%),only2.3taxa(i.e.,1512.7)aremissingonthe average.(ii)Themissingtaxaarerecoveredduringphase three.ThisisbecausethecomputedMFASTattheend ofphasetwoisverylarge,andthusitisspecictothe embeddedMFAST.ImpactofseedcreationSofar,inourexperimentsweconsistentlyobserved twomajorpointsforalltheparametersettings(see SectionsEectsofnumberofinputtreestoEects ofnoisepercentage):(i)Ourmethodalwaysndsa largesubtreeoftheembeddedMFASTafterphasetwo. (ii)Ourmethodalwaysrecoverstheentireembedded MFASTafterphasethree.Thesecondobservationcanbe explainedfromtherstonethattheoutcomeofphasetwo islargeenoughtobuildtheentireMFASTprecisely.The rstobservationhoweverindicatesthatthesetofseeds generatedinphaseonecontainasignicantpercentageof thetaxaintheembeddedMFAST.Inthissection,wetake acloserlookintothisphenomenonandexplainwhythis isthecaseevenforsmallvaluesofseedsize k andcontractionamount c ,andlargenoisepercentage .Todothat, wewillcomputetheprobabilitythatasubsetofthetaxa oftheembeddedMFASTappearsinatleastoneseedgeneratedinphaseone.Inourcomputation,wewillassume thatthetaxacanappearatanylocationofagiventree withthesameprobability.Wediscusstheimplicationof thisassumptionlaterinthissection. Thenumberofrootedbifurcatingtreesforagivensetof n taxais R ( n ) = ( 2 n 3 ) 2n 2( n 2 ) Consideracladewith k + c taxa.Thenumberoftrees with n taxathatcontainsthiscladeis R ( n ( k + c ) + 1 ) asthetopologyofthe k + c sizedcladeisxed.Fora givenasubtreewith k taxa,letusdenotethenumberof cladetopologiesofsize k + c thatcontainsthatsubtree withNU( k c ).Wecancomputethisfunctionrecursively asNU( k ,0)=1andfor c > 0, NU ( k c ) = NU ( k c 1 ) 2 ( k + c 2 ) LetusdenoteoneofthesecladesbyU( k c ).Also,let usdenotetheprobabilitythatthecladeU( k c )existsina randomtreetopologythatcontains n taxawith P ( n k c ) Intuitively, P( n k c ) istheprobabilitythatourmethod willextractaspecic k taxasubtreefromone n taxatree afteronly c contractions.Wecancomputethisprobability astheratioofthenumberoftreetopologiesthatsatisfy thisconstrainttothatofallpossibletreetopologies.We formulatethisas P ( n k c ) = NU ( k c ) R ( n ( k + c ) + 1 ) R ( n ) Recallthatitsucesforouralgorithmtohavea k taxa subtreeoftheMFASTinatleastonetreeinthegivenset of m trees.TheprobabilitythatthecladeU( k c )existsinat leastoneofthe m randomtreetopologieseachcontaining n taxais P ( n k c m ) = 1 ( 1 P ( n k c ))m. AssumethattheMFASTsizeinthegivensetoftreesTis h .Letusdenotethenumberof k taxasubtreesofthe MFASTasNS( h k ).Theprobabilitythatatleastoneof thesesubtreeswillbefoundinatleastoneoftheinput treesisthen P ( n k c m h ) = 1 ( 1 P ( n k c m ))NS ( h k ). AlowerboundtoNS( h k )is h k + 1whichcanbe obtainedbypickingacontiguousblockof k taxafromthe canonicalnewickrepresentationoftheMFASTbyconsideringallpossible h k + 1startingpointlocations.Notice thatthelargerthevalueof P ( n k c m h ) ,thehigherthe chancesthatouralgorithmwillconstructsomepartof theMFAST.Similarly,thelargerthevalueofNS( h k ),the higherthechancesthatouralgorithmwillconstructsome partoftheMFAST. Figure6plotsthesuccessprobability(i.e., P ( n,k,c, m,h ))ofourmethodforvaryingparametervalues.As theMFASTsizeincreases,thesuccessprobabilityrapidly increases.ThisisbecausethenumberofalternativesubtreesoftheMFASTincreaseswithincreasingMFAST size.Thus,thechanceofobservingatleastoneincreases aswell.WeobservethatwhenthesizeofMFASTis around20%ofthetreesize,foralltheparametersreported oursuccessprobabilitybecomesalmost1.Asthenumberofcontractionsincreases,theprobabilityofsuccess increases.Thisisbecauselargenumberofcontractions increasesthepossibilityofeliminatingfalsepositivetaxa fromclades.Inotherwords,ithelpsgluingthetaxathat arenormallyscatteredintheinputtreesbacktogether byremovingtheremainingtaxaamongthem.When c =5,oursuccessprobabilitybecomesalmostoneeven forMFASTsthatareassmallasa46%ofthetreesize. Asthenumberoftreesincreases,thesuccessprobability increasesaswell.Thisisbecausewehavemorealternative topologieswithincreasingnumberoftrees.Thus,there aremorechancestohaveasmallcladethatcontainsapart oftheMFAST.Finally,itisworthnotingthattheseresults arecomputedbasedontheassumptionthatthetreesinTareuniformlydistributedamongallpossibletopologies.In practice,weexpectthatthesetreesareconstructedwith PAGE 12 Ramu etal.BMCBioinformatics 2012, 13 :256 Page12of15 http://www.biomedcentral.com/14712105/13/256 0 0.2 0.4 0.6 0.8 1 0 5 10 15 20 Success probabilityMFAST size [%]c=5 c=4 c=3 (a) 0 0.2 0.4 0.6 0.8 1 0 5 10 15 20 Success probabilityMFAST size [%]c=5 c=4 c=3 (b) Figure6 TheprobabilityofndingatleastoneseedwhichcontainsapartofanMFAST. Thenumberofcontractions c issetto3,4and5and thecorrespondingseedsize k is5,4or3.ThexaxisshowstheMFASTsizeintermsofthepercentageofthenumberoftaxainthetreesinT.In( a ), wesetthetotalnumberoftrees m =500.In( b )weset m =1000.thesameorsimilarobjectives(suchasmaximumparsimonyormaximumlikelihood).Asaresult,theywilllikely haveahigherchancetocontainlargeMFASTs.Theresults weexpectinpracticewillthusbesimilarorevenbetter thanthetheoreticalresultsinFigure6. Overall,weconcludefromthisexperimentthateven smallvaluesof k and c sucestocaptureapartofthe MFASTinphasetwo.Therefore,althoughouralgorithms complexityincreasesexponentiallywith k and c ,wedonot needtouselargevaluesfor k and c Thisenablesouralgorithmtoscaletoverylargedatasetswiththousandsoftaxa andtrees.TheseresultsexplainthetheorybehindthepracticalresultsweobservedinSectionsEectsofnumberof inputtreestoEectsofnoisepercentage.EvaluationofstateoftheartmethodsSofar,wehaveshownthatourmethodcouldsuccessfullyndtheMFASTscontainedinsetsoftreesTforup to1000taxaand200trees(i.e., n =1000and m =200). Anobviousquestionishowwelldoexistingmethodsperformonthesamedatasets.Here,weanswerthisquestion fortwoexistingprograms,namelyPAUP*(version4.0b10) andPhylominer. Whenwexthenumberoftreesandthenumberof taxato100,PAUP*wasabletondtheMASTforforall datasets.Aswegrowthenumberoftaxato250orlarger whilekeepingthenumberoftreesas100,PAUP*runsour ofmemoryandfailstoreturnanyresults.Afterreducing thenumberoftreesto50,PAUP*stillrunsoutofmemory andcannotreportanyresultsformorethan100taxa. ThescalabilityproblemofPhylominerisevenmore severe.PhylominerisabletocomputetheMFASTson datasetswithupto20taxa.However,asweincrease thenumberoftaxafurther,itsperformancedeteriorates quickly.Whenwesetthenumberoftaxato100,evenwith asfewas100trees,Phylominertakesmorethanaweek toreportaresult.Moreover,inourexperiments,themaximumsizeofthesubtreesitfoundonaveragecontained fewerthan7taxa,eventhoughthesizeofthetrueMAST was10. Anotherinterestingquestionaboutexistingmethods wouldbewhetherthemajorityconsensusrulecanbeused tondMFASTs.Toevaluatethis,weusedthesamethree syntheticdatasetsusedinSectionEectsofnoisepercentage.Recalleachofthesethreedatasetscontainsan MFASTofsize15whichisembeddedin80%ofthetrees. Thedatasetsarecreatedwith20%,40%and60%noise indicatingdierentlevelsofdicultyinrecoveringthe embeddedMFAST.Wecomputed70%majorityconsensustree.Noticethatifmajorityconsensusrulecanidentify anMFAST,thatwouldcorrespondtoabifurcatingsubtree topologyintheconsensustree.Inotherwordsasubtree isbifurcatinginthisexperimentonlyif70%ormoreof theinputtreesagreeonthetopologyofthatsubtree.The resultingtree,however,wasmultifurcatingforallthethree datasets.Thismeansthatmajorityconsensusrulecould notrecoverevenasmallerportionoftheembeddedtree whileourmethodwasabletolocatetheentireMFAST successfully(seeTable4). TheseresultsdemonstratethatbothPAUP*andPhylominerarenotwellsuitedtondingagreementsubtrees inlargerdatasets,ourmethodscalesbetterintermsof boththenumberoftaxaandthenumberoftrees.When PAUP*runstocompletion,weobservedthatitreports PAGE 13 Ramu etal.BMCBioinformatics 2012, 13 :256 Page13of15 http://www.biomedcentral.com/14712105/13/256thetrueresults.Recallfrompreviousexperimentsthat ourmethodalwaysfoundthetrueresultsonthesame datasetsaswellaslargerdatasets. Thissuggeststhatour methodhasthepotentialtohaveanimpactinlargescale phylogeneticanalysiswhenexistingmethodsfail.EmpiricaldatasetexperimentsToexaminetheperformanceoftheMFASTmethodon realdata,weperformedexperimentsusing200maximum likelihoodbootstraptreesfromaphylogeneticanalysis ofgymnosperms(959taxa)andSaxifragales(950taxa). Specically,weevaluatedhowtheperformanceofthe MFASTalgorithmwasaectedbythenumberofinput treesandthesizeoftheinputtrees.EectsofnumberofinputtreesWerstexaminedtheeectofinputtreenumberon thesizeofMFAST.ForboththegymnospermandSaxifragalestrees,wegenerated10setsof50and100trees byrandomlysamplingfromtheoriginal200treeswithoutreplacement.Wecomparedtheaveragesizeofthe MFASTinthe50and100treedatasetswiththesizeofthe MFASTintheoriginal200treedataset.First,inallanalysis,thepostprocessingstepgreatlyincreasesthesizeof theMFAST,sometimesmorethandoublingit(Table5). Thisincreaseissimilartotheoneobservedinthe1000 taxonsimulateddatasets(Table2),emphasizingagain theimportanceofthepostprocessingstepwithlargetree datasets.AlthoughthesizesoftheMFASTsweresimilar,theydecreasedslightlywiththeadditionofmoretrees (Table5).Thismaysimplybeamatterofobservingmore conictwithmoretrees. ThelargegapbetweentheMFASTsizesbeforeandafter thepostprocessingsuggeststhatphasethreeisthemain reasonbehindthesuccessofourmethod,andthus,the costlyseedcombinationphase(i.e.,phasetwo)maybe unnecessary.Toanswerwhetherthisconjectureiscorrect, weranavariantofourmethodbydisablingthesecond phase;weonlyranthepostprocessingphasestartingfromTable5ThesizeoftheMFASTfoundbyourmethodonthe GymnospermsandSaxifragalesdatasetsbeforeandafter postprocessing(phasethree) Numberof MFASTsize trees GymnospermsSaxifragales BeforeAfterOnlyBeforeAfterOnly 5078.5129.899.564.7122.084.1 10068.4119.283.155.4112.874.7 20076.0118.084.040.0105.075.0 ThesizeoftheMFASTfoundbyrunningonlythepostprocessingstepisalso shown.Werunourmethodontheentiredatasetthatcontains200treesaswell asrandomlyselectedsubsetsof50and100trees.Werepeatedthe50and100 treeexperiments10timesbyrandomlyselectingthetreesfromtheentire datasetandreportedtheaveragevalue.eachseedastheinitialMFASTonebyone.Wereported thelargestMFASTfoundthatwayastheoutputofthis variantinTable5.Theresultsdemonstratethatalthough phasethreecangrowalargeFAST, phasetwoisessential tondthelargestfrequentagreementsubtree.Inother words,postprocessingndsthetrueMFASTonlyifa largeportionofitisalreadyfound(whichistheroleserved byphasetwo).Inconclusion,phasethreeofourmethod cannotreplacephasetwo,yetbothphasesareessentialfor thesuccessofourmethod.EectsofsizeofinputtreeNext,weexaminedtheeectofnumberofleavesinthe inputtreesonthesizeofMFASTs.ForboththegymnospermandSaxifragalestrees,wegenerated10setsof 200inputtreeswith100,250,and500taxa.Tomake eachset,werandomlyselected100,250,or500taxa, andwedeletedallothertaxafromtheoriginalsetsof 200trees.Thus,thesesetsoftreeswith100,250,or500 taxaaresubtreesoftheoriginaldatasets.Thesizeof theaverageMFASTincreaseswithmoretaxaintheoriginaltrees(Table6).However,interestingly,theaverage sizeoftheMFASTsforthegymnospermdatasetwith 500treesislargerthantheMFASTfoundfromtheoriginalgymnospermtreeswithallthetaxa(Table6).Since theMFASTfromthe500taxondatasetsshouldallbe foundwithinthefulldataset,thisindicatesthatonthe largertrees,ourmethodmaynotalwaysndthetrue(i.e., largest)MFAST.Thefulldatasetsmayrequirealarger numberofcontractionstondthetrueMFASTs. SimilartotheexperimentsinSectionEectsofnumberofinputtrees,weinvestigatedthegapbetweenthe MFASTsizesbeforeandafterthepostprocessingstep.We ranavariantofourmethodbydisablingthesecondphase; weonlyranthepostprocessingphasestartingfromeach seedastheinitialMFASTonebyone.Wereportedthe largestMFASTfoundthatwayastheoutputofthisvariantTable6ThesizeoftheMFASTfoundbyourmethodonthe GymnospermsandSaxifragalesdatasetsbeforeandafter postprocessing(phasethree)fordierentnumberoftaxa Numberof MFASTsize leaves GymnospermsSaxifragales BeforeAfterOnlyBeforeAfterOnly 10041.256.143.543.550.738.5 25067.288.563.062.376.254.6 50091.6123.074.952.086.762.9 All76.0118.084.040.0105.075.0 ThesizeoftheMFASTfoundbyrunningonlythepostprocessingstepisalso shown.Werunourmethodontheentiredatasetthatcontainsallthetaxa(last row)aswellasrandomlyselectedtaxasubsetsofsize100,250and500.We repeatedthe50,100and250taxaexperiments10timesbyrandomlyselecting thetaxafromtheentiredatasetandreportedtheaveragevalue. PAGE 14 Ramu etal.BMCBioinformatics 2012, 13 :256 Page14of15 http://www.biomedcentral.com/14712105/13/256Table7ThesizeoftheMFASTfoundbyourmethodonthe GymnospermsandSaxifragalesdatasetsfordierent randomsubsamplesofthetotalnumberoftaxa Sampling MFASTsize percentage GymnospermsSaxifragales 285.974.3 587.575.6 1088.475.2 2587.575.5 5088.576.2 10088.576.2 Werunourmethodbyrandomlypicking2%,5%,10%,25%,50%,100%ofthe seedsfoundinphaseoneforcombinationinphasetwo.inTable6.TheresultsareinparallelwiththoseinTable5. Phasethreecangrowalargefrequentagreementsubtree, butnotquiteasbigasthatwhenbothphasetwoandthree areexecuted.EectsofsamplesizeInournalexperiment,weevaluatedtheeectofthe maximumtimecuto ,wedescribedinSectionInorder combinationontheaccuracyofourmethod.Recallthat, thiscutolimitsthenumberofinitialseedstriedinour algorithmbyrandomlysamplingasmallpercentageofthe seeds.Itonlyusesthesampledseedsaspossibleinitial seeds.However,itusestheentiresetofseedswhilegrowingtheMFASTdeterminedbytheinitialseed.Aseach initialseedroughlytakesthesameamountoftimetogrow intoanMFAST,using x %oftheseedsasthesampleset reducesthetotalrunningtimeourmethodtoroughly x % ofthatofouroriginalimplementation. Wecarriedoutthisexperimentasfollows.Forboth thegymnospermandSaxifragalestrees,weran10sets ofexperimentsforeachsamplingpercentageof2,5,10, 25,50and100%.Thus,totallyweran60(6 10)experiments.Table7presentstheaverageMFASTsizesfor varyingsamplesizes.Theresultsdemonstratethateven forverysmallsamplingpercentages,ourmethodnds MFASTthatisalmostasbigastheMFASTfoundbyusing theentiredataset(i.e.,100%samplingpercentage).Thisis verypromisingasitdemonstratesthattherunningtime costofourmethodcaneasilybecuttoasmallfraction bysamplingthestartingseeds.Therationalebehindthis isthattheMFASTcontainsmanyseeds.Startingfrom anyoftheseseeds,ouralgorithmhasthepotentialtolead tothatMFAST.Theprobabilitythatatleastoneofthese seedsappearinthesamplesetislargeparticularlyforlarge MFASTs.ConclusionInthispaper,wepresentaheuristicforndingthemaximumagreementsubtrees.Theheuristicusesamultistep approachwhichrstidentiessmallcandidatesubtrees (calledseeds),fromthesetofinputtrees,combinesthe seedstobuildlargercandidateMFASTs,andthenperformsapostprocessingsteptoincreasethesizeofthe candidateMFASTs.Wedemonstratethatthisheuristic caneasilyhandledatasetswith1000taxa,greatlyextendingtheestimationofMFASTsbeyondcurrentmethods.Althoughthisheuristicisnotguaranteedtond allMFASTs,itperformswellusingbothsimulatedand empiricaldatasets.Itsperformanceisrelativelyrobustto thenumberofinputtreesandthesizeoftheinputtrees, althoughwiththelargerdatasets,thepostprocessing stepbecomesmoreimportant.Overallthismethodprovidesasimpleandfastwaytoidentifystronglysupported subtreeswithinlargephylogenetichypotheses. Althoughthemethodwedevelopedisdescribedand implementedfortherootedandbifurcatingtrees,itcan betriviallyextendedtomultifurcatingaswellasunrooted trees.Thecentraltechnicaldierenceinthecaseof unrootedtreeswouldbethedenitionofclade(seeDefinition1)asthedenitionrequiresaroot.Acladeinan unrootedtreeencompassestwosetsofnodes;(i)agiven setoftaxa X ,(ii)thesetofallinternalnodesthatareon apathbetweentwotaxain X onthephylogenetictree. Weexpectthatthiswillincreasethenumberofseeds substantiallyandthusmaketheproblemmorecomputationallyintensive.Theamountofincreasewilldepend onthetreetopology.Thetheoreticalworstcasehappens whenallthetaxaareconnectedtoasingleinternalnode (i.e.,startopology).Inthatcaseanysubsetoftaxacan leadtoapotentialseedaslongasthesubsetsizeisequal totheseedsizeallowed.Onepossiblewaytoovercome thisproblemwouldbetoexploitrandomizationorgraph coloringstrategiesandavoidenumeratingmajorityofthe possibleseeds.Abbreviations MAST:Maximumagreementsubtree;FAST:Frequentagreementsubtree; MFAST:Maximumfrequentagreementsubtree;MCMC:MarkovchainMonte Carlo. Competinginterests Theauthorsdeclarethattheyhavenocompetinginterests. Authorscontributions ARparticipatedinalgorithmdevelopment,implementation,experimental evaluationandwritingofthepaper.TKparticipatedinalgorithm development,experimentdesignandwritingofthepaper.GBparticipatedin experimentdesign,datasetcollectionandwritingofthepaper.Allauthors readandapprovedthenalmanuscript. Acknowledgments ThisworkwassupportedpartiallybytheNationalScienceFoundation(grants CCF0829867andIIS0845439). Authordetails1ElectricalandComputerEngineering,UniversityofFlorida,Gainesville,FL, USA.2ComputerandInformationScienceandEngineering,Universityof PAGE 15 Ramu etal.BMCBioinformatics 2012, 13 :256 Page15of15 http://www.biomedcentral.com/14712105/13/256Florida,Gainesville,FL,USA.3DepartmentofBiology,UniversityofFlorida, Gainesville,FL,USA. Received:5February2012Accepted:5September2012 Published:3October2012 References 1.GoloboPA,CatalanoSA,MirandeJM,SzumikCA,AriasJS,K ¨ allersj ¨ oM, FarrisJS: Phylogeneticanalysisof73060taxacorroboratesmajor eukaryoticgroups. Cladistics 2009, 25 (3):211 230. 2.PriceMN,DehalPS,ArkinAP: Fasttree2approximately maximumlikelihoodtreesforlargealignments. PLoSONE 2010, 5 (3):e9490. 3.SmithSA,BeaulieuJM,StamatakisA,DonoghueMJ: Understanding angiospermdiversicationusingsmallandlargephylogenetic trees. AmericanJournalofBotany 2011, 98: 404 414. 4.FelsensteinJ: CondenceLimitsonPhylogenies:AnApproachUsing theBootstrap. Evol 1985, 39 (4):783 791. 5.FarrisJS,AlbertVA,K ¨ allersj ¨ oM,LipscombD,Kluge,AG: Parsimony jackkningoutperformsneighborjoining. Cladistics 1996, 12 (2):99 124. 6.HuelsenbeckJP,RannalaB,MaslyJP: Accommodatingphylogenetic uncertaintyinevolutionarystudies. 2000 Science, 30June2000, 288 (5475):2349 2350.doi:10.1126/science.288.5475.2349. 7.BryantD: Aclassicationofconsensusmethodsforphylogenetics. In Bioconsensus(Piscataway,NJ,2000/2001),volume61ofDIMACSSer.Discrete Math.Theoret.Comput.Sci. ;Amer.Math.Soc.,2003:163 183. 8.FindenCR,GordonAD: Obtainingcommonprunedtrees. J Classication 1985, 2: 255 276. 9.AmirA,KeselmanD: Maximumagreementsubtreeinasetof evolutionarytrees:Metricsandecientalgorithms. SIAMJComput 1997, 26 (6):1656 1669. 10.KubickaE,KubickiG,McMorrisFR: Analgorithmtondagreement subtrees. JClassication 1995, 12 (1):91 99. 11.FarachM,PrzytyckaTM,ThorupM: Ontheagreementofmanytrees. Inf ProcessLett 1995, 55 (6):297 301. 12.BryantD: Buildingtrees,huntingfortreesandcomparingtrees. PhD thesis .Dept.Mathematics:UniversityofCanterbur;1997. 13.ColeR,FarachColtonM,HariharanR,PrzytyckaTM,ThorupM: Ano(nlog n)algorithmforthemaximumagreementsubtreeproblemfor binarytrees. SIAMJComput 2000, 30(5):1385 1404. 14.LeeCM,HungLJ,ChangMS,ShenCB,TangCY: Animprovedalgorithm forthemaximumagreementsubtreeproblem. InfProcessLett 2005, 94 (5):211 216. 15.BerryV,NicolasF: Improvedparameterizedcomplexityofthe maximumagreementsubtreeandmaximumcompatibletree problems. IEEE/ACMTransComputBiologyBioinform 2006, 3 (3): 289 302. 16.GuillemotS,NicolasF: Solvingthemaximumagreementsubtreeand themaximumcompatibletreeproblemsonmanyboundeddegree trees. In Proceedingsofthe17thAnnualconferenceonCombinatorial PatternMatching(CPM06) .EditedbyLewensteinM,ValienteG(Eds.). Berlin,Heidelberg:SpringerVerlag:165 176.doi:10.1007/11780441 16. 17.GuillemotS,NicolasF,BerryV,PaulC: Ontheapproximabilityofthe maximumagreementsubtreeandmaximumcompatibletree problems. DiscreteAppliedMathematics 2009, 157 (7):1555 1570. 18.ChiY,XiaY,YangY,Muntz,RR: Correctiontominingclosedand maximalfrequentsubtreesfromdatabasesoflabeledrootedtrees. IEEETransKnowlDataEng 2005, 17 (12):1737. 19.ZhangS,WangJTL: Miningfrequentagreementsubtreesin phylogeneticdatabases. In Proceedingsofthe6thSIAMInternational ConferenceonDataMining(SDM2006) .EditedbyGhoshJ,LambertD, SkillicornDB,SrivastavaJ.Maryland:Bethesda;April2006:222233. 20.ZhangS,WangJTL: Discoveringfrequentagreementsubtreesfrom phylogeneticdata. IEEETransKnowlDataEng 2008, 20 (1):68 82. 21.CranstonKA,RannalaB: SummarizingaPosteriorDistributionofTrees UsingAgreementSubtrees. SystB 2007, 56 (4):578 590. 22.PattengaleND,SwensonKM,MoretBME: Uncoveringhidden phylogeneticconsensus. In Proc.6thIntlSymp.BioinformaticsResearch& Appls.ISBRA10,inLectureNotesinComputerScience .EditedbyBorodovsky M,GogartenJP,PrzytyckaTM,RajasekaranS:pp128 139.Springer,2010. 23.PattengaleND,AbererAJ,SwensonKM,StamatakisA,MoretBME: Uncoveringhiddenphylogeneticconsensusinlargedatasets. IEEE/ACMTransComputBiologyBioinform 2011, 8 (4):902 911. 24.AbererAJ,StamatakisA: Asimpleandaccuratemethodforrogue taxonidentication. In ProceedingsoftheIEEEInternationalConference on,BioinformaticsandBiomedicine(BIBM11) :IEEEComputerSociety, Washington,DC,USA:118122.doi:10.1109/BIBM.2011.70. 25.JunierT,ZdobnovEM: Thenewickutilities:highthroughput phylogenetictreeprocessingintheunixshell. Bioinformatics 2010, 26 (13):1669 1670. 26.SwoordDL:PAUPPhylogeneticanalysisusingparsimony(andother methods).Version4.0beta10.,2002.Sunderland,Massachusetts:Sinauer Assoc. 27.BurleighJG,BarbazukWB,DavisJM,MorseAM,SoltisPS: Exploring diversicationandgenomesizeevolutioninextantgymnosperms throughphylogeneticsynthesis. JBotany 2012, 2012: 6.ArticleID 292857.doi:10.1155/2012/292857. 28.StamatakisA: Raxmlvihpc:maximumlikelihoodbased phylogeneticanalyseswiththousandsoftaxaandmixedmodels. Bioinformatics 2006, 22 (21):2688 2690. doi:10.1186/1471210513256 Citethisarticleas: Ramu etal. : Ascalablemethodforidentifyingfrequent subtreesinsetsoflargephylogenetictrees. BMCBioinformatics 2012 13 :256. Submit your next manuscript to BioMed Central and take full advantage of: Convenient online submission Thorough peer review No space constraints or color gure charges Immediate publication on acceptance Inclusion in PubMed, CAS, Scopus and Google Scholar Research which is freely available for redistribution Submit your manuscript at www.biomedcentral.com/submit xml version 1.0 encoding utf8 standalone no mets ID sortmets_mets OBJID swordmets LABEL DSpace SWORD Item PROFILE METS SIP Profile xmlns http:www.loc.govMETS xmlns:xlink http:www.w3.org1999xlink xmlns:xsi http:www.w3.org2001XMLSchemainstance xsi:schemaLocation http:www.loc.govstandardsmetsmets.xsd metsHdr CREATEDATE 20130111T12:07:35 agent ROLE CUSTODIAN TYPE ORGANIZATION name BioMed Central dmdSec swordmetsdmd1 GROUPID swordmetsdmd1_group1 mdWrap SWAP Metadata MDTYPE OTHER OTHERMDTYPE EPDCX MIMETYPE textxml xmlData epdcx:descriptionSet xmlns:epdcx http:purl.orgeprintepdcx20061116 xmlns:MIOJAVI http:purl.orgeprintepdcxxsd20061116epdcx.xsd epdcx:description epdcx:resourceId swordmetsepdcx1 epdcx:statement epdcx:propertyURI http:purl.orgdcelements1.1type epdcx:valueURI http:purl.orgeprintentityTypeScholarlyWork http:purl.orgdcelements1.1title epdcx:valueString A scalable method for identifying frequent subtrees in sets of large phylogenetic trees http:purl.orgdctermsabstract Abstract Background We consider the problem of finding the maximum frequent agreement subtrees (MFASTs) in a collection of phylogenetic trees. Existing methods for this problem often do not scale beyond datasets with around 100 taxa. Our goal is to address this problem for datasets with over a thousand taxa and hundreds of trees. Results We develop a heuristic solution that aims to find MFASTs in sets of many, large phylogenetic trees. Our method works in multiple phases. In the first phase, it identifies small candidate subtrees from the set of input trees which serve as the seeds of larger subtrees. In the second phase, it combines these small seeds to build larger candidate MFASTs. In the final phase, it performs a postprocessing step that ensures that we find a frequent agreement subtree that is not contained in a larger frequent agreement subtree. We demonstrate that this heuristic can easily handle data sets with 1000 taxa, greatly extending the estimation of MFASTs beyond current methods. Conclusions Although this heuristic does not guarantee to find all MFASTs or the largest MFAST, it found the MFAST in all of our synthetic datasets where we could verify the correctness of the result. It also performed well on large empirical data sets. Its performance is robust to the number and size of the input trees. Overall, this method provides a simple and fast way to identify strongly supported subtrees within large phylogenetic hypotheses. http:purl.orgdcelements1.1creator Ramu, Avinash Kahveci, Tamer Burleigh, J Gordon http:purl.orgeprinttermsisExpressedAs epdcx:valueRef swordmetsexpr1 http:purl.orgeprintentityTypeExpression http:purl.orgdcelements1.1language epdcx:vesURI http:purl.orgdctermsRFC3066 en http:purl.orgeprinttermsType http:purl.orgeprinttypeJournalArticle http:purl.orgdctermsavailable epdcx:sesURI http:purl.orgdctermsW3CDTF 20121003 http:purl.orgdcelements1.1publisher BioMed Central Ltd http:purl.orgeprinttermsstatus http:purl.orgeprinttermsStatus http:purl.orgeprintstatusPeerReviewed http:purl.orgeprinttermscopyrightHolder Avinash Ramu et al.; licensee BioMed Central Ltd. http:purl.orgdctermslicense http://creativecommons.org/licenses/by/2.0 http:purl.orgdctermsaccessRights http:purl.orgeprinttermsAccessRights http:purl.orgeprintaccessRightsOpenAccess http:purl.orgeprinttermsbibliographicCitation BMC Bioinformatics. 2012 Oct 03;13(1):256 http:purl.orgdcelements1.1identifier http:purl.orgdctermsURI http://dx.doi.org/10.1186/1471210513256 fileSec fileGrp swordmetsfgrp1 USE CONTENT file swordmetsfgid0 swordmetsfile1 FLocat LOCTYPE URL xlink:href 1471210513256.xml swordmetsfgid1 swordmetsfile2 applicationpdf 1471210513256.pdf structMap swordmetsstruct1 structure LOGICAL div swordmetsdiv1 DMDID Object swordmetsdiv2 File fptr FILEID swordmetsdiv3 