UFDC Home  Search all Groups  UF Institutional Repository  UF Institutional Repository  Research Archive   Help 
Material Information
Notes
Record Information

This item is only available as the following downloads: 
Full Text 
xml version 1.0 encoding utf8 standalone no
mets ID sortmets_mets OBJID swordmets LABEL DSpace SWORD Item PROFILE METS SIP Profile xmlns http:www.loc.govMETS xmlns:xlink http:www.w3.org1999xlink xmlns:xsi http:www.w3.org2001XMLSchemainstance xsi:schemaLocation http:www.loc.govstandardsmetsmets.xsd metsHdr CREATEDATE 20120321T13:17:24 agent ROLE CUSTODIAN TYPE ORGANIZATION name BioMed Central dmdSec swordmetsdmd1 GROUPID swordmetsdmd1_group1 mdWrap SWAP Metadata MDTYPE OTHER OTHERMDTYPE EPDCX MIMETYPE textxml xmlData epdcx:descriptionSet xmlns:epdcx http:purl.orgeprintepdcx20061116 xmlns:MIOJAVI http:purl.orgeprintepdcxxsd20061116epdcx.xsd epdcx:description epdcx:resourceId swordmetsepdcx1 epdcx:statement epdcx:propertyURI http:purl.orgdcelements1.1type epdcx:valueURI http:purl.orgeprintentityTypeScholarlyWork http:purl.orgdcelements1.1title epdcx:valueString Metabolic network alignment in large scale by network compression http:purl.orgdctermsabstract Abstract Metabolic network alignment is a system scale comparative analysis that discovers important similarities and differences across different metabolisms and organisms. Although the problem of aligning metabolic networks has been considered in the past, the computational complexity of the existing solutions has so far limited their use to moderately sized networks. In this paper, we address the problem of aligning two metabolic networks, particularly when both of them are too large to be dealt with using existing methods. We develop a generic framework that can significantly improve the scale of the networks that can be aligned in practical time. Our framework has three major phases, namely the compression phase, the alignment phase and the refinement phase. For the first phase, we develop an algorithm which transforms the given networks to a compressed domain where they are summarized using fewer nodes, termed supernodes, and interactions. In the second phase, we carry out the alignment in the compressed domain using an existing network alignment method as our base algorithm. This alignment results in supernode mappings in the compressed domain, each of which are smaller instances of network alignment problem. In the third phase, we solve each of the instances using the base alignment algorithm to refine the alignment results. We provide a user defined parameter to control the number of compression levels which generally determines the tradeoff between the quality of the alignment versus how fast the algorithm runs. Our experiments on the networks from KEGG pathway database demonstrate that the compression method we propose reduces the sizes of metabolic networks by almost half at each compression level which provides an expected speedup of more than an order of magnitude. We also observe that the alignments obtained by only one level of compression capture the original alignment results with high accuracy. Together, these suggest that our framework results in alignments that are comparable to existing algorithms and can do this with practical resource utilization for large scale networks that existing algorithms could not handle. As an example of our method's performance in practice, the alignment of organismwide metabolic networks of human (1615 reactions) and mouse (1600 reactions) was performed under three minutes by only using a single level of compression. http:purl.orgeprinttermsisExpressedAs epdcx:valueRef swordmetsexpr1 http:purl.orgeprintentityTypeExpression http:purl.orgdcelements1.1language epdcx:vesURI http:purl.orgdctermsRFC3066 en http:purl.orgeprinttermsType http:purl.orgeprinttypeJournalArticle http:purl.orgdctermsavailable epdcx:sesURI http:purl.orgdctermsW3CDTF 20120321 http:purl.orgdcelements1.1publisher BioMed Central Ltd http:purl.orgeprinttermsstatus http:purl.orgeprinttermsStatus http:purl.orgeprintstatusPeerReviewed http:purl.orgeprinttermscopyrightHolder et al.; licensee BioMed Central Ltd. http:purl.orgdctermslicense http://creativecommons.org/licenses/by/2.0 http:purl.orgdctermsaccessRights http:purl.orgeprinttermsAccessRights http:purl.orgeprintaccessRightsOpenAccess http:purl.orgeprinttermsbibliographicCitation BMC Bioinformatics. 2012 Mar 21;13(Suppl 3):S2 http:purl.orgdcelements1.1identifier http:purl.orgdctermsURI http://dx.doi.org/10.1186/1471210513S3S2 fileSec fileGrp swordmetsfgrp1 USE CONTENT file swordmetsfgid0 swordmetsfile1 FLocat LOCTYPE URL xlink:href 1471210513S3S2.xml swordmetsfgid1 swordmetsfile2 applicationpdf 1471210513S3S2.pdf structMap swordmetsstruct1 structure LOGICAL div swordmetsdiv1 DMDID Object swordmetsdiv2 File fptr FILEID swordmetsdiv3 !DOCTYPE art SYSTEM 'http:www.biomedcentral.comxmlarticle.dtd' ui 1471210513S3S2 ji 14712105 fm dochead Proceedings bibl title p Metabolic network alignment in large scale by network compression aug au ca yes id A1 snm Ayfnm Ferhatinsr iid I1 I2 email ferhatay@uw.edu A2 DangMichaeldang@cise.ufl.edu A3 KahveciTamertamer@cise.ufl.edu insg ins Computer and Information Science and Engineering, University of Florida, Gainesville, FL 32611, USA Department of Genome Sciences, University of Washington, Seattle, WA 98195, USA source BMC Bioinformatics supplement ACM Conference on Bioinformatics, Computational Biology and Biomedicine 2011editor Sun Kim and Wei Wangsponsor note Publication of this supplement has been supported by NSF support number NSF IIS1137427: III: Small: Women in Bioinformatics Initiative at ACMBCB 2011.Proceedingsconference ACM Conference on Bioinformatics, Computational Biology and Biomedicine 2011 (ACMBCB)location Chicago, IL, USAdaterange 13 August 2011url http://acmbcb.org/issn 14712105 pubdate 2012 volume 13 issue Suppl 3 fpage S2 http://www.biomedcentral.com/14712105/13/S3/S2 xrefbib pubid idtype doi 10.1186/1471210513S3S2 history pub date day 21month 3year 2012 cpyrt 2012collab Ay et al.; licensee BioMed Central Ltd.This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. abs sec st Abstract Metabolic network alignment is a system scale comparative analysis that discovers important similarities and differences across different metabolisms and organisms. Although the problem of aligning metabolic networks has been considered in the past, the computational complexity of the existing solutions has so far limited their use to moderately sized networks. In this paper, we address the problem of aligning two metabolic networks, particularly when both of them are too large to be dealt with using existing methods. We develop a generic framework that can significantly improve the scale of the networks that can be aligned in practical time. Our framework has three major phases, namely the it compression phase, the alignment phase and the refinement phase. For the first phase, we develop an algorithm which transforms the given networks to a compressed domain where they are summarized using fewer nodes, termed supernodes, and interactions. In the second phase, we carry out the alignment in the compressed domain using an existing network alignment method as our base algorithm. This alignment results in supernode mappings in the compressed domain, each of which are smaller instances of network alignment problem. In the third phase, we solve each of the instances using the base alignment algorithm to refine the alignment results. We provide a user defined parameter to control the number of compression levels which generally determines the tradeoff between the quality of the alignment versus how fast the algorithm runs. Our experiments on the networks from KEGG pathway database demonstrate that the compression method we propose reduces the sizes of metabolic networks by almost half at each compression level which provides an expected speedup of more than an order of magnitude. We also observe that the alignments obtained by only one level of compression capture the original alignment results with high accuracy. Together, these suggest that our framework results in alignments that are comparable to existing algorithms and can do this with practical resource utilization for large scale networks that existing algorithms could not handle. As an example of our method's performance in practice, the alignment of organismwide metabolic networks of human (1615 reactions) and mouse (1600 reactions) was performed under three minutes by only using a single level of compression. bdy Background Biological networks provide a compact representation of the roles of different biochemical entities and the interactions between them. Depending on the types of entities and interactions, these networks are segregated into different types, where each network type encompasses a particular set of biological processes. Proteinprotein interaction (PPI) networks comprise binding relationships between two or more proteins to carry out specific cellular functions such as signal transduction. Regulatory networks consist of interactions between genes and gene products to control the rates at which genes are transcribed. Metabolic networks represent sets of chemical reactions that are catalyzed by enzymes to transform a set of metabolites into others to maintain the stability of a cell and to meet its particular needs. Analysis of the connectivity properties of these networks has proven to be crucial in uncovering the details of the cell machinery and in revealing the functional modules and complexes involved in this mechanism abbrgrp abbr bid B1 1B2 2B3 3B4 4. An essential type of network analysis is the comparative analysis that aims at identifying functionally similar elements or element sets shared among different organisms which would not be possible if these elements were only considered individually. This is often achieved through alignment of the networks of these organisms. Analogous to sequence alignment which identifies conserved sequences, network alignment reveals connectivity patterns that are conserved among two or more organisms. A number of studies have been done to systematically align different types of biological networks B5 5B6 6B7 7B8 8B9 9B10 10B11 11B12 12B13 13B14 14B15 15B16 16B17 17B18 18B19 19B20 20B21 21. For metabolic networks, Pinter et al. 20 devised an algorithm that aligns query networks with specific topologies by using a graph theoretic approach. Recently, some of us developed an algorithm that combines both topological features and homological similarity of pairwise molecules to align metabolic networks 8. We also proposed a method, SubMAP 910, that incorporates subnetwork mappings in metabolic network alignment. A similar method, IsoRank 21, has been applied to find the alignments of PPI networks. IsoRankN 11 extended this algorithm to work for multiple networks and to allow mappings of protein clusters. Comparative analysis is important particulary for large metabolic networks such as organismwide networks. Identification of the conserved patterns among metabolic networks across species provide insights for metabolic reconstruction of a newly sequenced genome B22 22, orthology detection 21, drug target identification B23 23 and identification of enzyme clusters and missing enzymes B24 24B25 25. However, aligning large scale networks is a computationally challenging problem due to the underlying subgraph isomorphism problem that has to be solved to find the alignment that maximizes the similarity between the query networks. The methods we mentioned above either restrict the query topologies and/or their sizes. Even under these conditions, the running times and memory utilization of these methods can still be prohibitive for large query networks. For instance, the method of Pinter et al. 20 takes around one minute per alignment on a dataset with only small size networks ranging from 2 to 41 nodes. Our earlier method, SubMAP has no limitations on the query topologies and allows mappings of node sets that are connected (i.e., subnetworks). However, allowing subnetworks comes at a cost of increasing running time that is inherent due to the fact that the number of all connected subnetworks up to a given size can be exponential in the size of the network. For a network of size 80 and subnetwork sizes up to 3, SubMAP takes around 6 minutes and 150 MBs of memory on the average per alignment with a database of networks of size 50 on the average. Therefore, improving the running time and memory utilization of these methods is necessary to leverage the alignment of larger scale networks especially when subnetwork mappings are allowed. In this paper, we develop a framework that significantly improves the scale of the networks that can be aligned using existing algorithms. Our framework has three major phases, namely the compression phase, the alignment phase and the refinement phase. For the first phase, we develop a compression method that reduces the size of the input metabolic networks by a desired rate. In other words, we transform the query networks from their original domains (see Figure figr fid F1 1(a)) to a compressed domain (see Figure 1(d)). A single node in compressed domain corresponds to a set of connected nodes and the edges between them in the original domain. We call each such node in the compressed network a supernode. For instance, Figure 1(d) depicts the compressed networks of the two input networks in Figure 1(a) when each supernode is allowed to contain up to two nodes (i.e., only one level of compression is allowed). In the second phase, we carry out the alignment in the compressed domain by using an existing network alignment algorithm, which is SubMAP in this paper, as our base method. Once the compressed networks are aligned, we next consider each mapping of supernodes found by the first phase individually. Each such mapping suggests a smaller instance of network alignment. Figure 1(f) demonstrates this where two such instances exist. For each of these mappings, we solve the alignment problem using the base algorithm. At the end of this refinement phase, the final mappings of reactions are extracted (see Figure 1(g)) transforming the alignment back to the original domain. fig Figure 1caption Aligning two metabolic networks with and without compressiontext b Aligning two metabolic networks with and without compression. Top figures (ac) illustrate the steps of alignment without compression. Bottom figures (dg) demonstrate different phases of alignment with compression using our framework. (a) Two hypothetical metabolic networks with 5 and 4 reactions respectively. Directed edges represent the neighborhood relations between the reactions. (b) Support matrix of size 20×20 needed for the alignment if compression is not used. We only show the nonzero entries of a single row that corresponds to topological support given by b b' mapping to possible mappings of its backward and forward neighbors. Five such mappings supported equally are denoted by inlineformula m:math xmlns:m http:www.w3.org1998MathMathML name 1471210513S3S2i87 m:mfrac m:mrow m:mn 1 5 m:mstyle class m:mtext textsf mathvariant sansserif s in the matrix, namely a a' mapping for the backward neighbors and c c', c d', d c' and d d' mappings for the forward neighbors. (c) The resulting reaction mappings of alignment without compression. (d) Query networks shown in (a) in compressed domain after one level of compression. (e) Support matrix of size 6×6 needed for the alignment with compression. We only show the entries for the mappings supported by the a, b a', b' mapping. (f) The resulting mappings from the alignment in compressed domain. (g) The resulting reaction mappings after refinement phase of our framework. graphic file 1471210513S3S21 We can best motivate the need for such a framework on an example. Figure 1 illustrates the difference between aligning two metabolic networks in compressed domain versus aligning them in the original domain without compression. If we use a base alignment algorithm such as SubMAP or IsoRank, the time and space complexity of the algorithm is determined by the size of a data structure, named support matrix 1021. Conceptually, this data structure governs the topological similarities between every pair of reaction tuples. Each reaction tuple contains one reaction from each of the two query metabolic networks. A detailed description of this matrix can be found in previous articles describing IsoRank 21 and SubMAP 10. The size of this support matrix is quadratic in terms of both n and m (i.e., O (nsup 2m2)) for IsoRank and for SubMAP when only subnetworks of size one are allowed. Figures 1(b) and 1(e) illustrate the support matrices required for alignment starting from the networks shown in Figure 1(a) and 1(d) respectively. As a result of compression by only one level, the size of the matrix we need to create, drops to 6×6 from 20×20 which translates into more than an order of magnitude improvement in theoretical resource utilization compared to the base method. Notice that when we compress the network more (i.e., increase the number of compression levels), the compressed network gets smaller in terms of its number of nodes and edges. As a result, we can expect to align the compressed networks faster. However, this comes at the price of two drawbacks both due to the fact that each supernode contains multiple nodes from the original domain. First, once we find a mapping for the supernodes in the compressed domain, we still need to align the nodes of each supernode pair. For example, after mapping the supernodes (a, b) and (a', b') shown in Figure 1(f), we need to align the two subnetworks induced by these two supernodes. Thus as the size of the supernodes grow (i.e., as we compress for more levels), the size of the smaller problem instances grow as well and resource utilization bottleneck shifts from the alignment phase to refinement phase. Second, when we use compression the resulting alignment may not be the same as the one found by the original algorithm. For example, one out of four mappings in Figure 1(g) (i.e., e c') is different than the results of the base algorithm shown in Figure 1(c) (i.e., e e'). This brings the need to define a measure of consistency between the results of alignments with and without compression which can be used as an indicator of accuracy for the framework we propose here. We calculate this accuracy as the correlation of the scores calculated for each possible mapping found by our framework in the compressed domain with the scores for these mapping in the original domain found by the base method. Bigger compression rates generally mean less similarity between the results of the two methods (i.e., less accuracy). Several key questions follow from these observations are: indent 1 1. How does compression affect the alignment accuracy with respect to the base network alignment method? 2. How far is our compression method from an optimal compression that produces the compressed network with the minimum number of nodes? 3. When is it a good idea to do the alignment in compressed domain taking into account the overhead of compression and refinement phases? 4. What is the right amount of compression? That is, when does compression minimize the running time of our overall framework? In the rest of the paper we address each of these questions in detail. At this point, it is important to notice the potential for leveraging the alignment of larger scale networks by the framework we are proposing. The actual performance gain for an alignment will depend on the level of compression we use, the topologies of the query networks and complexity of the base alignment method. Results overview Our experiments on metabolic networks extracted from KEGG pathway database B26 26 demonstrate that our compression method reduces the number of nodes and edges by almost half at each level of compression. As a result of this reduction, we observe significant amount of improvement in running time and memory utilization of our earlier alignment algorithm SubMAP. Lastly, we analyze the accuracy of our framework as compared to the base alignment algorithm. The results suggest that the alignment obtained by only one level of compression captures the original alignment results with very high accuracy and the accuracy decreases with further levels of compression. Technical contributions  We devise an efficient framework for the network alignment problem that employs a scalable compression method which shrinks the given networks while respecting their topology.  We prove the optimality of our compression method under certain conditions and provide a bound on how much our compression results can deviate from the optimal solution in the worst case.  We provide a mathematical formulation that serves as a guideline to select an optimal number of compression levels depending on the input characteristics of the alignment.  We characterize the cases for which the proposed framework is expected to provide significant improvement in alignment performance. In the next section, we report our experimental results on a set of large scale metabolic networks that are constructed by combining networks from KEGG Pathway database 26. The details of the network compression method we propose here and the other phases of our framework are described in the methods section. Results and discussion In this section, we experimentally evaluate the performance of our framework. First, we measure the compression rates achieved for different levels of compression with minimum degree selection (MDS) method that we propose here. Next, we further analyzed the changes in degree distribution and large scale organization of organismwide metabolic networks with increasing compression levels. We, then, examine the gain in running time and memory utilization achieved by our framework for different values of compression level (c) and subnetwork size (k) parameters. Last, we examine the accuracy of the alignments we found by measuring the accuracy as the Pearson's correlation coefficient between the scores of mappings calculated by our framework and the ones calculated by the base algorithm we use. Dataset We use the metabolic networks from the KEGG pathway database 26. For our medium scale dataset, we downloaded all metabolic networks with at least 10 reactions for 10 different organisms. This resulted in 620 metabolic networks in total with sizes ranging from 10 to 97. In order to obtain our large scale dataset, we first combined all the metabolic networks that belong to one of the 9 different metabolism categories in KEGG database to create a complete metabolism network for each metabolism for 10 selected organisms (Homo sapiens (human), Mus musculus (mouse), Rattus norvegicus (rat), Drosophila melanogaster (fruit fly), Arabidopsis thaliana (thale cress), Caenorhabditis elegans (nematode), Saccharomyces cerevisiae (budding yeast), Staphylococcus aureus COL (MRSA), Escherichia coli K12 MG1655, Pseudomonas aeruginosa PAO1). We obtain the organismwide metabolic networks by combining all the listed networks in KEGG for each of these organisms. In total, we have 100 networks with sizes ranging from 5 to 1615 (9 complete metabolism networks plus 1 organismwide network for each of the 10 organisms). Below is the list of metabolism categories we use. 1. Carbohydrate Metabolism 2. Energy Metabolism 3. Lipid Metabolism 4. Nucleotide Metabolism 5. Amino Acid Metabolism 6. Metabolism of Other Amino Acids 7. Glycan Biosynthesis and Metabolism 8. Metabolism of Cofactors and Vitamins 9. All Amino Acids (Amino Acid + Other Amino Acids) Implementation and system details We implemented our compression and alignment algorithms in Csub ++. We ran all the experiments on a desktop computer running Red Hat Enterprise Client 5.7 with 4 GB of RAM and two dualcore 2.40 GHz processors. Evaluation of compression rates The efficiency of our alignment framework depends on how much the query metabolic networks can be compressed. For this reason, in this experiment, we measure the number of nodes and edges of the metabolic networks in our large scale dataset before and after compression. The minimum degree selection (MDS) method we describe in this paper compresses the query metabolic networks by selecting the first node among the list of nodes with minimum degree at each intermediate step and by compressing it with one of its neighbors. In order to evaluate stability of this compression method, we examined the effect of the node selection strategy on the size of the resulting compressed networks. By randomizing the step at which we select a node among the set of minimum degree nodes, we generated 100 different compressed networks for each of the input metabolic networks. In the following, we examine how much compression we achieve by the MDS method and also analyze its stability with respect to compressions achieved by randomization of node selection step. Table tblr tid T1 1 summarizes the compression rates achieved by our method for networks of different sizes. We divide all the metabolic networks in our dataset into bins according to the number of their reactions (i.e., network size). The first column in Table 1 lists the network size intervals we used for each group. Notice that the gaps in the size interval are due to the fact that organismwide networks are of size 850 and larger whereas the other combined networks for nine different metabolism categories have sizes below 400. Each row of this table shows the number of nodes and edges averaged over all the networks in this group before and after compression. The two columns with c = 0 correspond to the average number of nodes and edges of the networks with no compression respectively. For c ∈ {1, 2, 3}, we split each row corresponding to an interval into two. The upper part denotes the average node and edge numbers for the compressed network if the MDS method is used as originally described (i.e., the first among the list of minimum degree nodes is selected and combined with its first neighbor at each compression step). The lower part in bold represents the numbers gathered when we introduce randomization in this node selection. Each value in bold in Table 1 denotes the average of the corresponding value over these 100 different runs of compression. tbl Table 1Summary of compression rates for all the networks in our large scale datasettblbdy cols 9 r c center Network size intervals cspan 4 Average number of nodes Average number of edges hr c = 0 c = 1 c = 2 c = 3 c = 0 c = 1 c = 2 c = 3 [0, 100) 41.5 26.5 26.5 19.1 19.1 15 14.8 83.5 55.2 55.5 36.3 36.5 23.6 23.5 [100, 200) 154.8 92.4 92.2 61.3 61.5 48.6 48.6 310.1 174.9 174 116.5 118.1 96.3 94.6 [200, 300) 240.5 139.1 139.4 89.2 89.1 69.4 69.7 508.1 296.5 298.4 230.5 228.4 187.8 188.1 [300, 400] 344.9 207.3 207.6 133.1 133.8 103 104.5 585.7 372.9 373.5 302.7 300.4 261.6 259.9 [850, 1250] 1080.5 623.2 623.7 406.8 407.9 311.3 311.9 3727 2269 2280.6 1732.7 1733.8 1584.8 1587.5 [1500, 1615] 1576.5 909 910 582 583 447.8 444.6 4740 2955.2 2964.3 2283.5 2279.3 2128.8 2129.6 tblfn We create six intervals according to number of reactions in these networks. Each row, corresponding to one such interval, shows the average number of nodes and edges before compression (i.e., c = 0) and after compression of different levels (i.e., c ∈ {1, 2, 3}). For each row, top entries correspond to numbers obtained with the M D S method which selects the first node from the list of nodes with minimum degree at each intermediate step and compresses it with its first neighbor from the list of its neighbors. The bottom entries that are in bold correspond to the averages of 100 different compressions which are gathered by randomizing the step at which a node is selected among the set of minimum degree nodes. One conclusion that can be drawn from Table 1 is that independent of the network size, our compression method performs well in practice. On the average, with only one level of compression we achieve network sizes that are 5764%, 6471% and 7780% of the network sizes in the previous compression level for c = 1, 2 and 3 respectively. In other words, our method compresses the entire dataset down to approximately 60%, 40% and 30% of the sizes of original networks for c = 1, 2 and 3 respectively. These rates suggest that our framework has great potential in scaling the network alignment to large metabolic networks by compression. As an example, consider the row corresponding to interval [850,1250] in Table 1. We see that instead of aligning networks with 1080 nodes and 3727 edges on the average, we can apply two levels of compression first and do the alignment with significantly smaller networks that have only 407 nodes and 1733 edges on the average. Another observation is that, we get the most of the reduction in network size after the first compression level. That is, our method compresses the networks aggressively for c = 1 and achieves 57% to 64% compression rate which is close to the half of the size of the networks. As we go up in the levels of compression, the actual rate of compression achieved at one level reduces. Considering the fact that having an input network which can lead to the best possible compression (i.e., reducing its size from n down to size 1471210513S3S2i1 m:mfenced close ⌉ open ⌈ separators m:mi n 2 (i.e., 50%) at each level of compression) is a rare event, the observed compression rates suggest that our method provides an efficient compression for metabolic networks in practice. This experimental setup also suggests that the MDS method is stable with respect to the choice of the node to compress as long as that node is selected among the nodes with minimum degree. Among the six rows and three columns (18 entries) of Table 1 for the average number of nodes after the compression, only one of them have difference larger than two between the original size and the randomized average. The results of this experiment suggest that our compression method, MDS, serves as an efficient and stable first phase for our alignment framework by achieving good compression rates on a large dataset of metabolic networks. Changes in degree distributions with compression Even though the compression rates we achieve with MDS as described above suggest significant reduction in the problem size, we observe that there is a noticeable difference between the compression rates achieved by going from one compression level to the next. For instance, on the average we get 57% to 64% reduction in the size of the networks going from c = 0 to c = 1 whereas we only get 76% to 80% reduction if we go from c = 2 to c = 3. This suggests that the large scale organization of the networks change with increasing levels of compression. Even though a change in the network structure can be expected as a result of our compression, it is not obvious how to quantify this change and whether the change is consistent among different metabolic networks. In order to understand the reason behind different compression rates for different compression levels, we examined the degree distributions of the ten organismwide networks we have in our dataset. For each of these networks, we plotted the histogram of outdegree distributions for different levels of compression. Figure F2 2 plots the frequencies of each outdegree in the range [2,40] for each c ∈ {0, 1, 2, 3, 4} for these networks. We observe that for each of these plots the degree distributions for c = 0 and c = 1 are very similar and they follow powerlaw distribution which is an indicator of scalefree network topology. This is not surprising since the scalefree topology has been observed in numerous articles in the literature as a common signature for different metabolic networks B27 27B28 28B29 29. The similarity between the degree distributions of the original networks (c = 0) and the networks compressed by only one level (c = 1) signifies that the networks still conserve their scalefreeness after the first level of compression. Figure 2Shift of outdegree distributions from power law to uniform Shift of outdegree distributions from power law to uniform. Changes in the outdegree distributions of ten organismwide metabolic networks with increasing levels of compression. We calculate the frequencies of each outdegree in the range [2,40] for c ∈ {0, 1, 2, 3, 4} and plot them together for each of the ten organisms in our dataset. Outdegree distributions for organismwide metabolic networks of (a) Arabidopsis thaliana (thale cress), (b) Caenorhabditis elegans (nematode), (c) Drosophila melanogaster (fruit fly), (d) Escherichia coli K12 MG1655, (e) Homo sapiens (human), (f) Mus musculus (mouse), (g) Pseudomonas aeruginosa PAO1, (h) Rattus norvegicus (rat), (i) Staphylococcus aureus COL (MRSA), (j) Saccharomyces cerevisiae (budding yeast). 1471210513S3S22 A more interesting observation is that there is a consistent shift from the powerlaw degree distribution to uniform distribution with increasing c values for each of the ten networks we have. It is important to clarify that our claim is not that the degree distribution becomes uniform for large c values but rather the degree distributions for large c values are more similar to uniform distribution (also less similar to powerlaw distribution) compared to ones obtained with smaller c values. To quantify this on an example, we look at one of the most discernable characteristics of scalefree networks, hence the powerlaw distribution, which is the small number of hub nodes with large degrees. If we consider the organismwide network of Homo sapiens (Figure 2(e)), which is the largest network in our dataset, and focus on the percentage of nodes with outdegree greater than 15, we get percentages of 3%, 4%, 6.5%, 11.5% and 12.4% for c values of 0, 1, 2, 3 and 4 respectively. This indicates that the number of nodes that can be considered as hubs increase significantly with increasing levels of compression. This increase deteriorates the scalefreeness of the Homo sapiens network which in turn decreases the achieved compression rates. Similar trend is observed for each of the other nine organismwide networks which are plotted separately in Figure 2. The results of this experiment show that there is a consistent change in the network topology when multiple levels of compression is used. This difference we observe here between the first level of compression and later levels of compression is likely to be one of the main reasons of the significant differences in both the performance and the accuracy of our framework which will be discussed next in the remaining of the results section. Evaluation of running time and memory utilization In order to understand the capabilities and limitations of our framework, we examine its performance in terms of its running time and memory utilization on a set of large scale networks we constructed as described in the dataset section. We have ten networks for each of the ten organisms in our dataset. For each organism, nine of these networks constitute different metabolism categories and the tenth network is the organismwide metabolic network. In total, we have 100 networks with sizes ranging from 5 to 1615. For each parameter setting (different combinations of k ∈ {1, 2} and c ∈ {0, 1, 2, 3}, we aligned each of these 100 networks with each other network (including itself) resulting in a total of 5500 alignment queries. When the value of c is equal to zero, the alignment is carried out completely by a single application of SubMAP without any compression. This provides us a mechanism to measure how much performance gain is achieved by our compression based framework with respect to SubMAP. Figure F3 3(a) illustrates the average query running times in a loglog plot where xaxis is the size of the query measured as the product of the number of reactions of the metabolic networks that are aligned. We grouped queries into logarithmic bins according to the query sizes. The first bin contains all the queries of size less than or equal to 64. The next bins contain the queries of size in the interval [2i+5, 2i+6] where i = 2, 3, ..., 17. For each parameter setting we display the average running time of all the queries in each bin. For both k = 1 and k = 2, we plot all the results for all four different compression values and also draw the fitting curves to better illustrate the trend in the increase of running time. Figure 3Resource utilization of our framework Resource utilization of our framework. The average (a) running time and (b) memory utilization of our framework when each query network in our large scale dataset is aligned with all the networks (including itself) in the same dataset. xaxis is the query size which is calculated as the product of the sizes (i.e., number of reactions) of the metabolic networks aligned. c = 0 denote the alignments performed with no compression. c ∈ {1, 2, 3} denote the results of our framework that compresses both of the query networks by c levels before aligning them. 1471210513S3S23 For k = 1, we can immediately observe that each additional compression level improves the running time over the previous one for all query sizes. We obtain the largest fold change in running time by only one level of compression for the first level. This is expected considering that the first level of compression achieved the largest compression rate as shown in Table 1. The second compression level improves the running time by a smaller factor compared to the first and by a larger factor compared to the third level. For k = 1 we were able to plot all the points for all c values as the running time for even the largest query (i.e., human organismwide network vs itself which has size 1615*1615) with nocompression (i.e., c = 0) is still practical, around 12 minutes (with c = 3 this drops to <40 seconds). Similar trend of improved running times with increasing c is also observed for queries up to a certain size for k = 2. For only one level of compression (c = 1) we observe significant improvement in running times for queries of all different sizes. However, starting from the bin [213, 214] compressing the networks more than only one level (c >1) shows a consistent adverse effect on the running time. This implies when both query networks have sizes around 150 or larger and k >1 is used, the idea of compressing the networks more than one level and then performing the alignment suffers from the explosion in the number of possible subnetworks in the compressed domain with size at most k. We explore this in more detail later on in the paper (see Figure F4 4 and its discussion). Figure 4Gain/Loss in running time Gain/Loss in running time. Gain/Loss in running time of alignment by using our framework with respect to the base alignment method (xaxis) versus the ratio of the number of all possible subnetwork mappings in compressed domain to this number in the original domain. The blue vertical line shows when the two methods take exact same amount of time or when both methods take very short amount of time in the case of small query networks. Points on the right (left) handside of this line means gain (loss) in the running time. The dashed line is our decision criteria for predicting whether there will be gain or loss before doing the alignment. 1471210513S3S24 An important aspect of our framework is that it makes possible to align networks that could not be aligned with our base method. For k = 2, we observed that in the original domain (c = 0) a significant portion of the large queries did not finish in less than the cutoff time which we set as one hour. For instance, among 252 possible queries with sizes in the interval [217, 218], 96 did not complete successfully for c = 0 whereas with c = 1 all of them were completed. For the next bin, 45 out of 223 possible queries were completed for c = 0 and for c = 1 this number increased to 185. These results indicate that by using the correct amount of compression, we can align larger networks than the base alignment method SubMAP. We believe this is an important step in leveraging organismwide network alignments with subnetwork mappings for they provide a more complete picture of functional similarities and evolutionary differences between the metabolic networks of two or more organisms. Figure 3(b) presents results for the estimated memory required for the support matrix, which is the memory bottleneck of the algorithm, that is needed to perform the alignment. For this figure, we use the same query set as Figure 3(a), hence the same xaxis. On the average the memory required for alignment with c = 1 is around 30% of that needed for alignment with no compression using the SubMAP method for both k = 1 and k = 2. For k = 1, the memory utilization decreases by each additional compression level (on the average around 45% of the memory required for c = 1 is used when c is increased to 2 and around 65% of the memory required for c = 2 is used when c is increased to 3). For k = 2, concordance with the running time results, only one level of compression provides better memory utilization for all network sizes whereas compressing more than one level has an adverse effect for medium and large scale queries. These results suggest that our framework demonstrates a great potential to provide significant improvement in both the running time and the memory utilization of the base alignment method. This allows us to align large networks that could not be aligned by existing methods by utilizing the same hardware. Accuracy of the alignment results We conclude our experimental results by answering the first question introduced earlier in the paper, that is "How does compression affect the alignment accuracy?". In order to answer this, we calculate the correlation between the scores of each possible mapping in compressed domain and the scores that we obtain for these mappings from the original SubMAP method. We consider the scores of each possible subnetwork mapping of compressed nodes found by our framework. Since the mappings found by SubMAP are not of the same form with the mappings in compressed domain, we calculate a score value for each mapping in compressed domain by using the scores of the mappings found by SubMAP in the original domain. This way, we get two sets of score values one from SubMAP one from our framework for the same set of mappings. We calculate the Pearson's correlation coefficient between these two sets of scores as an indicator of the similarity between the results of the two methods. Before looking at the correlation values we found, it is important to describe how we calculate the score for a mapping in compressed domain from the mappings of SubMAP. Let P1 and 1471210513S3S2i2 m:msup m:mover accent true P m:mo MathClassop ̄ 1 denote the one level compressed forms of two metabolic networks. Let 1471210513S3S2i3 MathClassopen ( m:msub v 1 MathClassbin  { v ̄ 1 MathClasspunc , v ̄ 2 MathClassclose } ) denote a mapping in compressed domain where v1 is a subnetwork of P1 and 1471210513S3S2i4 { v ̄ 1 , v ̄ 2 } is a subnetwork of P ̄1. Also, let v1 = {r1, r2}, 1471210513S3S2i5 v ̄ 1 MathClassrel = { r ̄ 1 , r ̄ 2 } and 1471210513S3S2i6 v ̄ 2 = { r ̄ 3 } . We know the edge that maps these two subnetworks has a mapping score in the compressed domain and let us denote it by e1} for c = 1. We want to compute a mapping score, say e, for 1471210513S3S2i7 stretchy false ( v 1  { v ¯ 1 , v ¯ 2 } ) from the mappings in original domain that is comparable to e1. This subnetwork mapping in compressed domain contains six possible mappings in the original, namely 1471210513S3S2i8 ( r 1 , m:mspace tmspace width 2.77695pt r ̄ 1 ) , 1471210513S3S2i9 ( r 1 , r ̄ 2 ) , 1471210513S3S2i10 ( r 1 , r ̄ 3 ) , 1471210513S3S2i11 ( r 2 , r ̄ 1 ) , 1471210513S3S2i12 ( r 2 , r ̄ 2 ) and 1471210513S3S2i13 ( r 2 , r ̄ 3 ) . Let us denote the scores of these mappings in the original domain by ei for i = 1, 2, ..., 6 respective to their ordering. Then, we compute the mapping score e as 1471210513S3S2i14 1 6 m:msubsup ∑ i = 1 6 e i . It is important to note that, this score is a conservative choice among other possible scoring options. This is because the average can include mapping scores of subnetworks with very low similarities from the original domain of SubMAP. This can underestimate the correct mapping score of e and hence degrade the correlation of compressed domain and original domain mapping scores. Overall, for each mapping in compressed domain with a score ec and we calculate the corresponding score e in the original domain using this average score. Table T2 2 summarizes the correlation values found from a set of 3600 alignments (400 alignments for each parameter combination of k ∈ {1, 2, 3} and c ∈ {1, 2, 3}). We calculate the correlation of each query with the alignment that has the same k value but is in the original domain (i.e., c = 0). Table 2 shows the average correlation values of these 400 alignments for each k value, c value combination. The first column indicates that the alignment found by using only one compression level is highly similar to the alignment found by directly using the base method. Combining this with the running time gain in Figure 3(a) for c = 1, we can strongly argue that compression by one level not only provides significant improvement in running time but also accurately captures very high percentage of the original alignment results which makes it very useful for practical purposes. The accuracy measured in terms of correlation drops to 0.57 on the average when we perform the second level of compression and to 0.51 for the third level. Table 2Correlation of the mapping scores found with and without compression k/c 1 2 3 1 0.89 0.56 0.53 2 0.85 0.58 0.50 3 0.84 0.57 0.49 We calculate the Pearson's correlation coefficient between the two sets of score values one from SubMAP (without compression) one from our framework (with compression) and report it as an indicator of the accuracy of alignment results of our framework for different parameter settings. These results suggest that we can almost always use one level of compression to benefit from a high performance gain without losing much accuracy in terms of the alignment results. For c = 2 and c = 3, even though the accuracy of the results are significantly better than random, such compression levels should be used with caution if the accuracy of the alignment is the main concern. Conclusions In this paper, we considered the problem of aligning two metabolic networks particularly when both of them are too large to be dealt with using existing methods. To solve this problem, we developed a framework that scales the size of the metabolic networks that existing methods can align significantly. Our framework is generic as it can be used to improve the scalability of any existing network alignment method. It has three major phases, namely the compression phase, the alignment phase and the refinement phase. For the first phase, we developed an algorithm which transforms the given metabolic networks to a compressed domain where they are summarized using much fewer nodes, termed supernodes, and interactions. In the second phase, we carried out the alignment in the compressed domain using an existing method, SubMAP, as the base alignment algorithm. In the refinement phase, we considered each individual mapping of supernodes one by one. Each such mapping corresponds to a smaller instance of network alignment problem. For each of these mappings, we solved the alignment problem using SubMAP as our base method. Our experiments on the metabolic networks extracted from the KEGG pathway database demonstrate that our compression method reduces the number of reactions by almost half at each level of compression. As a result of this compression, we observe that SubMAP coupled with our framework can align twice or more as large networks as its original version can with the same amount of resources. Our results also suggested that the alignment obtained by only one level of compression benefits from a significant performance gain while capturing the original alignment results with very high accuracy. We believe that this paper takes an important step in scaling the metabolic network alignment with subnetwork mappings to organismwide networks, and thus, can have great impact on making the existing network alignment methods more useful for domain scientists. Methods In this section, we describe the method we develop to compress the query networks and the overall framework for aligning networks in this compressed domain. Before going into detail, it is important to state that we are using a reactionbased model for representing metabolic networks throughout this paper. Formally, we represent a metabolic network with P = (V, E) where V is the set of all reactions of the network and E is the set of directed edges between them. An edge eij ∈ E exists if and only if the reaction vi has at least one output compound which is an input for the reaction vj. In the following, we first describe our compression method. We use the shorthand notation MDS (minimum degree selection) to refer to this method in the rest of the paper. We, then, prove the optimality of MDS under certain conditions and provide an upper bound for the number of compressions that can be missed by this method with respect to the optimal compression. Next, we give a brief overview of the base alignment method that we use in this paper and explain in detail the two remaining phases of our alignment framework. We provide our analysis on the computational complexity of the overall method and conclude the methods section by answering two questions related to performance characteristics of this method. Minimum degree selection (MDS) method Let P = (V, E) be the reactionbased representation of a metabolic network and c denote the user specified parameter for the desired level of compression. For x = 1, ..., c, we denote the compressed form of P after x compression levels with Px = (V x, Ex). To simplify our notation, we assume that P0 = P. We construct Px from Px 1 for each x = 1, ..., c. Each v ∈ V x is either a node from V x 1 or a supernode that contains two nodes of V x 1. In summary, we construct V x from V x 1 in a number of consecutive steps. At each step, we choose a pair of connected nodes in V x 1 that are not compressed in earlier steps of the current compression level. We then merge this node pair into a supernode and add it to V x. We repeat these steps until there is no such node pair in V x 1. Assume that the number of such steps is t for compression level x. We denote the state of the network after the ith step during the xth level of compression as 1471210513S3S2i15 P i x = ( V i x , E i x ) Figure F5 5 (b)). Note that, 1471210513S3S2i16 V t x = V x and 1471210513S3S2i17 V i x ⊆ V x  1 ∪ V x for each i = 1, ..., t as the nodes of 1471210513S3S2i18 V i x are either singleton nodes from V x 1 or supernodes from V x. Figure 5One compression step of the MDS method One compression step of the MDS method. Small circles represent reactions and big circles represent supernodes that result from earlier steps of compression. A solid arrow represents an edge between two noncompressed nodes in the current compression level. A dashed arrow denotes an edge between a supernode and another node in the network. While calculating the degrees of the noncompressed nodes, only the solid arrows are taken into account. (a) The state of network P during compression level x before the ith intermediate step (i.e., 1471210513S3S2i88 P i  1 x ). The node with the minimum degree is denoted with va and its first neighbor is denoted with vb. (b) The state of this network after the ith compression step (i.e., 1471210513S3S2i89 P i x ). We denote the node resulted from the compression at this step with vab. 1471210513S3S25 We are now ready to discuss how we compress Px 1 to get Px. We define the degree of a noncompressed node v in a given network as deg(v) = indeg(v) + outdeg(v), where indeg(v) (outdeg(v)) denotes the number of incoming edges from (outgoing edges to) noncompressed nodes in the network. We say that two nodes in a network are neighbors if they are connected by at least one edge. We denote the set of neighbors of a node v with N(v). We start the compression by initializing 1471210513S3S2i19 V 0 x = V x  1 , E 0 x , E x  1 . Then, while there exists a noncompressed node with degree greater than zero at the current state of the network, say 1471210513S3S2i20 P i  1 x , we apply the next step, the ith step, of compression to obtain 1471210513S3S2i21 P i x from Pi1x. Figure 5 depicts the states of an example network before (Figure 5(a)) and after (Figure 5(b)) the ith step of compression. We start the ith step by selecting a node with minimum positive degree among the nodes in 1471210513S3S2i22 V i  1 x . If there are more than one such node, we select the first one among them. In our example in Figure 5(a), the node with minimum degree is unique and is shown by va. We use the term minimum degree as a shorthand for minimum positive degree to exclude singleton nodes. This way we ensure that deg(va) >0 and N (va) is nonempty. We select one such neighbor from N(va), say vb. The only node in N (va) in Figure 5(a) is denoted with vb. We, then, merge va with vb to form the supernode vab = {va, vb}. Figure 5(b) illustrates this newly created node vab. This is the only compression to be done at the ith compression step. Next, we create the new node set as 1471210513S3S2i23 V i x = V i  1 x ∪ { v a b }  { v a , v b } . For creating the edge set 1471210513S3S2i24 E i x , we initialize it to 1471210513S3S2i25 E i  1 x and remove all the incoming and outgoing edges of va and vb from it. Then, we insert an incoming edge to vab from each node in 1471210513S3S2i26 V i  1 x  { v a , v b } , which has an outgoing edge to either va or vb in the previous edge set Ei1x. We insert outgoing edges from vab to other nodes in a similar manner. Figure 5 illustrates the changes in the edge set after creating vab. Notice that for each i = 1, ..., t, the set Vix contains a mixture of nodes and supernodes. After each such step, the size of the network decreases by one and the number of edges of the new network decreases at least by one. For instance in Figure 5, the number of nodes dropped from five to four and the number of edges dropped from six to five. The compression of Px1 to get Px continues by applying another compression step until there are no more noncompressed nodes with positive degree. The discussion above describes the intermediate compression steps of the MDS method to perform a single level of compression on a given network. Given a compression level c, for each level x = 1, ..., c, we apply the same compression steps on Px 1 = (V x 1, Ex 1) by initially treating Px 1 as a noncompressed network with no supernodes. As a result of this process, after finishing the xth level of compression, the actual number of reactions that each node of V x can contain is assure to be in the interval [1, 2x]. The limitation on the number of reactions in each node allows the MDS method to respect and highly preserve the initial topology of the query networks. This is very important for the alignment as it makes significant use of the network topologies. Additionally, the bound on the number of reactions in each supernode translates to a uniform compression for both networks which limits the sizes of the smaller alignment problems we can encounter in the refinement phase. This allows us to keep under control the complexity and the running time of the refinement phase of our alignment framework. Optimality analysis for MDS In the previous section, we described in detail the compression method (MDS) we use in our framework. Ideally, it is preferable to compress the given network as much as possible at each compression level. This is because smaller network size often implies smaller time and memory usage for the alignment. We say that a compression is optimal if the resulting compressed network contains the smallest number of nodes among all possible compressions with the restriction that each noncompressed node can be merged with at most one other noncompressed node at each compression level. We name the hypothetical optimal compression method that can achieve the best possible compression rate as OPT. In the rest of this section, we analyze the optimality of our MDS method under different conditions. We first consider each connected component of the input network that will be compressed separately and then integrate their results to generalize our analysis for networks with arbitrary topologies. We start by introducing the notation we use in this section to handle networks with more than one connected component. Let P be a metabolic network with r connected components. We denote these components by 1471210513S3S2i27 C 1 = ( V ^ 1 , Ê 1 ) , C 2 = ( V ^ 2 , Ê 2 ) , … , C r = ( V ^ r , Ê r ) , such that 1471210513S3S2i28 P = ( ⋃ j = 1 r V ^ j , ⋃ j = 1 r Ê j ) . Let 1471210513S3S2i29 C = ( V ^ , Ê ) be an arbitrary component of P and *x represent the compressed form of C after x levels of compression using either the MDS method or OPT that achieves the optimal compression. We use (star) as a generic symbol to avoid introducing new symbols for each compressed component in places where only their sizes are of relevance. We use MDS(C, *x), OPT(C, *x) to denote the total number of compression steps performed to transform C into its compressed form after x levels of compression by using the corresponding methods. Recall that each compression step reduces the network size by one. Thus, the bigger these values (MDS(C, *x) and OPT(C, *x)) the better they are in terms of compression rate. The first and second arguments in this notation can be any state of a connected component or a network at any point during the compression. For instance, 1471210513S3S2i30 O P T ( C i x , * x ) denotes the number of compression steps taken by OPT starting from (i + 1)th intermediate step of the xth level until the xth level of compression is completed. In the following, we first prove that the MDS method makes an optimal choice in terms of which two nodes to compress at each compression step if there exists a node with degree one in the current state for a given component. We, then, show that if no node with degree one exists at a compression step taken by MDS can increase the size of the compressed component by at most one as compared to the one found by OPT. Finally, by aggregating the results from each component, for a given metabolic network P and a compression level c, we develop an upper bound on the size of the compressed networks obtained by MDS with respect to the size of network that can be obtained by the optimal method. Lemma 1 Let C=(V ^,Ê)denote a connected component of a given metabolic network P. Let 1471210513S3S2i31 C i x = ( V ^ i x , Ê i x ) denote the state of C after the ith step of the xth compression level. If there exists a node in 1471210513S3S2i32 V ^ i x with degree one, then the compression step taken by the MDS method to create the next state 1471210513S3S2i33 C i + 1 x is optimal. Formally, displayformula M1 1471210513S3S2i34 O thinspace 0.3em P T ( C i x , * x ) = 1 + O P T ( C i + 1 x , * x ) Proof 1 We prove (1) by contradiction in two parts: Part 1. 1471210513S3S2i35 O P T ( C i x , * x ) ≮ 1 + O P T ( C i + 1 x , * x ) Part 2. 1471210513S3S2i36 O P T ( C i x , * x ) ≯ 1 + O P T ( C i + 1 x , * x ) The first part (i.e., ≮) is trivial. The number of compression steps of OPT after performing one step of compression cannot be larger than the number before performing this step, otherwise the solution of OPT(Cix,*x)cannot be optimal. This leads to a contradiction, hence proves Part 1. To prove the second part (i.e., ≯), it is important to recall how the MDS method progresses given the state 1471210513S3S2i37 C i x at which there exists at least one node va with deg(va) = 1. This method picks va. The node va has exactly one noncompressed neighbor, say vb. Thus, MDS merges them to create the supernode vab (see Figure 5). We complete the proof by considering two cases. In the first case the OPT method merges va and vb while compressing Cix. In this case, we can assume that OPT takes this step as its next step in compressing Cix, since a fixed compressed network can be obtained by arbitrarily shuffling the order of intermediate steps. Therefore, if va and vb are compressed at any point in the optimal method, then the optimal solution for Ci+1x, which is created by applying the MDS method on Cix has exactly 1471210513S3S2i38 O P T ( C i x , * x )  1 compressions. Hence, 1471210513S3S2i39 O P T ( C i x , * x ) = l + O P T ( C i + 1 x , * x ) and OPT(Cix,*x)≯1+OPT(Ci+1x,*x) In the second case va and vb are not merged together in the optimal solution. This case implies va is left as a singleton at the end of the xth level as deg(va) = 1. Then, the network that results after removing va and all the edges connected to it can have at most OPT(Cix,*x) compressions until the end of the xth level since otherwise it contradicts with the optimality of MDS. This shows that the number of compressions that can be achieved when va is left as a singleton cannot be greater than one plus 1471210513S3S2i40 O P T ( C i + 1 x , * x ) . Thus, OPT(Cix,*x)≯1+OPT(Ci+1x,*x) and combining it with the first part (i.e., ≮) we get 1471210513S3S2i41 O P T ( C i x , * x ) = 1 + O P T ( C i + 1 x , * x ) . □ Lemma 2 Let 1471210513S3S2i42 C = ( V ^ , Ê ) denote a connected component of a given metabolic network P. Let Cix=(V ^ix,Êix)denote the state of C after the ith step of the xth compression level. If the node with minimum degree in V ^ixhas degree greater than one, then the compression step taken by MDS to create the next state Ci+1xcan lead to a network that has size at most one larger than the compressed network that is obtained from the state Cixby OPT. Formally, M2 1471210513S3S2i43 O P T ( C i x , * x ) ≤ 2 + O P T ( C i + 1 x , * x ) Proof 2 Let va be the first node in the list of minimum degree nodes in V ^ix. From the assumption we know deg(va) >1 and hence it has at least one noncompressed neighbor node of vb that also has deg(vb) >1. Without loss of generality assume that the MDS method merges va and vb to create the supernode vab at the compression step from Cixto Ci+1x. This step can prevent at most one neighbor of va, say vc, and at most one neighbor of vb, say vd, to be merged with the corresponding node in later steps. Notice that vc and vd are not necessarily distinct. The MDS algorithm can also merge vc and vd in the next steps if they are also neighbors though we do not know it for sure at this point. This results in either one compression or two compressions using only the four nodes va, vb, vc and vd by the MDS method. Next, we calculate the number of compression steps that the OPT method can take for compressing these four nodes. There are three cases to consider: Case 1. The OPT method merges va with vb at any point during the xth level of compression. This case is equivalent to merging va with vb in the next step by MDS and then compressing the rest of the network by O PT. In other words, MDS already takes the optimal compression step. Hence, 1471210513S3S2i44 O P T ( C i x , * x ) = 1 + O P T ( C i + 1 x , * x ) ≤ 2 + O P T ( C i + 1 x , * x ) . Case 2. The O PT method merges va with vc at any point during the xth level of compression. The worst case scenario for the MDS method in this case is when vc is not connected to vd and the OPT method merges vb with vd in a later step. This way the OPT method optimally compresses four nodes down to two supernodes, namely vac and vbd. On the other hand the MDS method creates a single supernode, vab, and the nodes vc and vd remain as singleton However, even for this worst case, the MDS method prevents only one compression step to take place with respect to O PT. Hence, 1471210513S3S2i45 O P T ( C i x , ∗ x ) ) ≤ 2 + O P T ( C i + 1 x , ∗ x ) . Case 3. The O PT method merges vb with vd at any point during the xth level of compression. We can prove this similar to Case 2 by the symmetry. □ Using lemmas 1 and 2, Theorem 1 develops an upper bound on the number of compression that can be missed by MDS with respect to the optimal compression. Theorem 1 (Osmcaps PTIMALITY BOUND FOR MDS) Let P be a metabolic network with r connected components 1471210513S3S2i46 C 1 = ( V ^ 1 , Ê 1 ) , … , C r = ( V ^ r , Ê r ) such that 1471210513S3S2i47 P = ⋃ j = 1 r C j and c be a positive integer given as the desired number of compression levels. Let C=(V ^,Ê)denote an arbitrary connected component of P. Also, let s represent the number of intermediate steps for which no noncompressed nodes with degree one is found during the compression from P to Pc by the MDS method. Then, each of the following statements hold: 1. O PT (Cx 1, *x) ≤ 2 MDS (Cx 1, *x) for × = 1, ..., c. 2. O PT (P, *c) ≤ s + MDS (P, *c) 3. O PT (P, *c) ≤ min{2 MDS (P, *c), s + MDS (P, *c)}. Proof 3 1. This part follows from Lemma 1 and 2. Lemma 1 states the case when MDS method is equivalent to OPT. Lemma 2 gives an upper bound on the number of compression steps that MDS can miss. The worst case is when the boundary condition of Lemma 2 holds for each step of the xth compression level for Cx 1. In this case, the number of steps taken by the OPT method while compressing Cx 1 is two times the number for the MDS method. 2. This part also follows from Lemma 1 and 2. Throughout the compression of the entire network P by c levels, each step of the MDS method that satisfies the condition in Lemma 2 can decrease the number of possible merge operations by one with respect to OPT. By simply counting these steps, at the end of the execution of the MDS method we can give the upper bound s+ MDS (P, *c) on the number of optimal compressions O PT (P, *c). 3. Part 2 shows that O PT(P, *c) ≤ s+ MDS (P, *c). It is only necessary to show O PT(P, *c) ≤ 2 MDS (P, *c). Part 1 proves this result for a single connected component C for the xth compression level. P is given as 1471210513S3S2i48 ⋃ j = 1 r C j before the first level of compression. We know by Part 1 that O PT (C, *1) ≤ 2 MDS(C, *1). Summing this up for all j from 1 to r, we get OPT(P, *1) ≤ 2 MDS(P, *1). This equation holds for each compression level x from 1 to c. Summation over x gives 1471210513S3S2i49 ∑ x = 1 c ( O P T ( P x  1 , * x ) ) ≤ ∑ x = 1 c M D S ( P x  1 , * x ) . Hence, we prove OPT(P, *c) ≤ 2 MDS(P, *c). □ Another way of interpreting Theorem 1 is to transform it to an upper bound on the size of the compressed network generated by MDS in terms of the one that can be obtained by OPT. By carrying out this transformation, we answer the question we pointed out in the introduction which is "How far is our compression method from the optimal compression?". We do this as follows. Let P be a network of size n. Given compression level c, let us represent the number of compressions steps of the O PT method with θ = O PT (P, *c). Also, let nO PT and nMDS denote the sizes of the compressed networks obtained by the OPT and MDS methods respectively. By the bound given in Theorem 1, we know that 1471210513S3S2i50 M D S ( P , * c ) > = θ 2 . Therefore, we can write nO PT = n θ. and 1471210513S3S2i51 n M D S ≤ n  θ 2 . Also, we know by definition that 1471210513S3S2i52 θ ≤ ∑ x = 1 c ⌋ ⌊ n 2 x . Using this inequality, we get: M3 1471210513S3S2i53 n O P T ≥ n  m:munderover accentunder mathsize big ∑ x = 1 c n 2 x , n M D S ≤ n  ∑ x = 1 c n 2 x + 1 If we examine the ratio 1471210513S3S2i54 n M D S n O P T , for c = 1 we get 1471210513S3S2i55 n M D S n O P T ≤ 3 2 for arbitrary n (details omitted). This demonstrates that after one level of compression, the size of the compressed network found by our method is at most 1.5 times the size of the optimal network. For x = 1, 2, ..., c, this ratio is proportional with (1.5)x. We can also use the bound on number of compression steps given in the second statement of Theorem 1 to gather a similar upper bound on the size of the compressed network found by MDS. The tighter of these two upper bounds on the network size can be calculated during the execution of the MDS method and reported as an indicator of how much room is left for improving the compression. Alignment framework We described the first phase, namely the compression phase in detail in previous sections. Here, we first summarize the base alignment method, SubMAP 10, we use in our framework. Then, we explain the two remaining phases of our framework, namely the alignment phase and the refinement phase. The alignment phase follows the compression phase and utilizes the base method to find an alignment in compressed domain. The refinement phase applies the base method on the mappings found in previous phase to further refine the alignment results. After describing all the phases, we analyze the complexity of each phase and combine them to obtain the complexity of the entire framework. Then, we examine the characteristics of the queries to determine which are likely to benefit from compression during the alignment to answer the question of "When should we compress?" Last, we provide a guideline for selecting the compression level that is expected to give the best performance gain reached by our framework with respect to the base alignment method. Overview of SubMAP Here, we take a small detour and explain SubMAP, a recent method for aligning metabolic networks when they are not compressed. We pick SubMAP method for its high accuracy and biological relevance as it considers subnetworks of the given networks during the alignment. A subnetwork of a network is a subset of the reactions of that network such that the induced undirected graph of this subset is connected. Given two metabolic networks P = (V, E) and 1471210513S3S2i56 P ̄ = ( V ̄ , Ē ) and a positive integer k, SubMAP aims to find a set of mappings between the reactions of P and 1471210513S3S2i57 P ̄ with the largest similarity score, such that: (i) Each reaction in 1471210513S3S2i58 P ( P ̄ ) can map to a subnetwork of 1471210513S3S2i59 P ̄ ( P ) with at most k reactions (ii) Each reaction of P and P ̄ can appear in at most one mapping. The first step of SubMAP is to create the set of all possible subnetworks of size at most k for each query network. We denote the number of these subnetworks for P and P ̄ with Nk and Mk respectively. The second step of SubMAP is to calculate pairwise similarities between each pair of these subnetworks one from P and one from P ̄. Each subnetwork consists of reactions and each reaction is defined by its input and output compounds (i.e., substrates and products) and the enzymes that catalyze it. Therefore, we measure the pairwise similarities between subnetworks using reaction similarities which in turn are defined by the similarities of the components of these reactions. For more details of this similarity score we refer the reader to Ay et al. 10. The step that dominates the time and space complexity of SubMAP is the third step. The aim of this step is to create a similarity score that combines pairwise similarities with the topological similarity of the networks. A data structure named the support matrix is created for this purpose. The size of this matrix is quadratic in terms of the number of subnetworks of both query networks. In other words, the support matrix requires O (Nk2 Mk2) space. This complexity is very important as it is the dominating factor in the overall time and space complexity of SubMAP. The next two steps of the algorithm are to combine topological similarity with pairwise node similarities and to extract the alignment as a set of subnetwork mappings of P and P ̄. Alignment phase The SubMAP method described above aligns the networks P = (V, E) and P ̄=(V ̄,Ē) in their original form. Our framework first compresses each of these networks to reduce their sizes and then aligns the compressed networks instead of P and P ̄. In this section, we explain how we align the compressed networks Pc and 1471210513S3S2i60 P ̄ c that are in the compressed domain of level c using SubMAP with a given parameter k. Let us first consider Pc = (V c, Ec). Each node va in V c is a supernode of the reactions in V Also, by the working of our compression method, we know that each supernode va contains at most 2c reactions. An edge from the node va to the node vb exists in Ec if and only if at least one reaction in va has an edge to one reaction in vb in E. The same arguments hold for the other network P ̄c as well. To align these compressed networks, we consider their nodes, which are supernodes of reactions, as if they are the reactions of the metabolic networks Pc and P ̄c. This way, we can directly apply SubMAP to align these networks. As far as the operation of the SubMAP method is concerned, this is no different than aligning two networks that are identical to these networks but are in the original domain. The difference is in the interpretation of the intermediate steps and the form of the mappings found by the alignment. For instance, for the first step of SubMAP, we enumerate the reaction subnetworks of size at most k in the original domain, whereas in the compressed domain we enumerate the subnetworks of supernodes where each supernode can contain more than one reaction and the number of such supernodes in one subnetwork is at most k. Similarly, we calculate the pairwise similarity, the support matrix and the conflict graph for the subnetworks of supernodes (i.e., nodes of V c) instead of subnetworks of reactions (i.e., nodes of V ). The resulting alignment gives us a set of mappings between the subnetworks of Pc and P ̄c. We can think of these mappings as a high level view of the alignment between the networks P and P ̄. For instance, from Figure 1(f) one can immediately see that the resulting alignment will map node a either to node a' or node b' and that these are the only options for node a which is imposed by the higher level supernode mapping (a, b a'b'). In the next phase, we consider each of these supernode mappings as smaller instances of the alignment problem and solve them to obtain a more refined alignment of P and P ̄. Refinement phase Each mapping found by the alignment phase is a subnetwork pair where one is from Pc and the other is from P ̄c. The mappings found by SubMAP can have up to k nodes in one subnetwork and only one node in the other. If we denote a subnetwork of Pc with 1471210513S3S2i61 R i c and a subnetwork of P ̄c with 1471210513S3S2i62 R ̄ j c , the resulting mappings of the alignment phase will be in the form 1471210513S3S2i63 ( R i c , R ̄ j c ) . We can assume, without loss of generality, for this specific pair that Ric contains up to k nodes of Pc and R ̄jc contains a single node of P ̄c. Each node contained in either of these subnetworks is a supernode that contains either one node or two nodes and an edge between them in the previous level of compression, namely the (c 1)th level. For both Ric and R ̄jc, we decompress their nodes by one level by retrieving the connectivity between these nodes in the (c 1)th compression level that was encapsulated in the cth level. This decompression results in at most 2k nodes from (c 1)th level for Ric and at most 2 nodes from (c 1)th level for R ̄jc. We then recursively align these smaller networks generated from Ric and R ̄jc by using SubMAP until the original domain (i.e., c = 0) is reached. At the (c x)th recursive step, the sizes of two networks to be aligned can be at most k 2x for one network and 2x for the other. Figure 1(f) illustrates this on a concrete example. The network on the left has two supernodes (i.e., (a, b) and (e, d)) each containing two nodes with an edge between them and one supernode (i.e., (c)) which contains only one node from the previous level of compression. The one on the right has two supernodes with two nodes in each. To understand how decompression by one level works, we can focus on the supernode mapping (e, d) (c', d') which is found in compression level one. We can think of decompression as removing the circles that surround these supernodes to get back the connectivity within their nodes in the previous compression level. In our case, this leads to the small networks d → e and c' → d'. We align these small networks recursively using SubMAP and report their final alignment in only one recursive call since the compression level is only one for this case. Also, since k = 1 is used for the ease of this example, the sizes of the networks, in terms of the nodes in original domain, on each side are at most 2 for the recursive call from c = 1 as can be seen from Figure 1(f) (i.e., k 2c = 2c = 2 for k = c = 1). Complexity analysis Having finished the discussion of all the three phases, now we can analyze the overall complexity of our framework. We start from the first phase which is compression of the input networks P and P ̄ by c levels. We first calculate the complexity of the first compression level for the network P with size n. At each compression step, MDS first searches for a minimum degree node. Once it finds this node, it picks one of its neighbor nodes and merges these two nodes. After this merging, it updates the degrees of all the neighbors of each of the merged nodes. The first two of these operations take O (log n) time if proper data structures are used and the last one can take O (n) in the worst case. Since the size of network P is n, there can be at most 1471210513S3S2i64 n 2 compression steps during the first level of compression. Hence, the complexity of the compression for the first level is O (n2). Since the input sizes of this level is larger than all the next levels, we can safely assume that each of these next levels also take O(n2) and the complexity of compression by c levels is therefore O (cn2). Even though this is not a tight bound, it is sufficient at this point for the complexity of the next two phases will dominate it. Since we compress both networks, the overall complexity for the compression phase is: M4 1471210513S3S2i65 O ( c ( n 2 + m 2 ) ) . For the analysis of the next phases, we make two assumptions both of which are supported by experimental evidence on the topological properties of metabolic networks. Our first assumption is that at each level of compression our method reduces the network size by half. In other words, if the sizes of our query networks are n and m, then the sizes of the compressed networks after c levels by the MDS method are 1471210513S3S2i66 n M D S = n 2 c and 1471210513S3S2i67 m M D S = m 2 c respectively. This is mainly because metabolic networks contain many nodes with low degrees 27. Our experiments on a large dataset of networks summarized in Table 1 supports this as well. The second assumption is that the number of subnetworks is a constant multiple of the network size for small k values. In other words, NMDS = α (k) n and MMDS = β (k) m where α (k) and β (k) are functions of k but are independent of n and m respectively. Our earlier analysis in Ay et al. 10 demonstrated that the number of subnetworks for k = 3, which is the largest k value we use here, is in the order of 5V  for a large set of metabolic networks. We are now ready to analyze the complexity of the second phase which is the alignment phase. By the first assumption, we know that the sizes of Pc and P ̄c are nMDS=n2c and mMDS=m2c respectively. By the second, we have the number of subnetworks of these networks as NMDS = α (k) n and MMDS = β (k) m for a given k. Also, we know that the complexity of SubMAP is quadratic in terms of NMDS and MMDS. Therefore, the complexity of the second phase is: M5 1471210513S3S2i68 O ( α ( k ) 2 β ( k ) 2 n 2 m 2 2 4 c ) . The complexity of the refinement phase has two factors in it. The first one is the number of mappings found by the alignment phase. Since we know that SubMAP allows each node of both networks to be reported in at most one mapping, we have a trivial upper bound on the number of possible mappings in terms of n and m. The biggest number of mappings is reported when all the subnetworks of both networks are singletons. In this case, the number of reported mappings is the minimum of n and m. We can assume without loss of generality that n < m and hence this number is O (n). The second factor is the sizes of each of these O(n) smaller alignment problems that needs to be solved by SubMAP again to refine the mapping results. As we discussed in the refinement phase, the sizes of the networks created by decompressing the mapped subnetworks by one level are at most k 2c on one side and at most 2c on the other. The number of subnetworks that can be created from these networks are α (k) k 2c and β (k) 2c for the corresponding sides. Therefore, each mapping can be refined by decompressing and applying SubMAP which is O (α (k)2 k2 22c β (k)2 22c). We do this refinement for O (n) times in the worst case, hence the complexity of the refinement phase is: M6 1471210513S3S2i69 O ( α ( k ) 2 β ( k ) 2 n k 2 2 4 c ) . Combining the results of Equations 4, 5 and 6, we can see that the overall complexity of our method is determined by the second or the third phase depending on the value of c. For small values of c and k such as 1, 2 and 3, the second phase dominates the overall complexity. Larger values of c results in a costlier refinement phase and a less expensive alignment phase. Very large values of k imply exponentially many subnetworks in which case the above complexity analysis would not hold and the alignment problem may become intractable with or without compression. When should we compress? We discussed the potential of our framework improving the scalability of existing network alignment methods. However, there can be cases when the compression results in such network topologies which would enforce the alignment method to reach its worst case performance. In this section, we want to analyze when performing the alignment in compressed domain is the better alternative. For this purpose, we devise a criterion that is inspired by the results of a large number of network alignments that are done by both of the methods. We find that the gain/loss in running time is highly dependent on the number of all possible subnetworks of compressed and noncompressed networks. The numbers of these subnetworks can be determined in advance to the alignment. By formulating a criterion in terms of these numbers, we can make a decision between the two algorithms before actually performing an alignment. Figure 4 illustrates the results for 3600 alignments performed by both of the methods on a wide range of network sizes with all possible combinations of k and c values. The xaxis show the running time of SubMAP minus the running time of our framework. The bigger this value is the better improvement we get from our framework. The yaxis shows the ratio 1471210513S3S2i70 y = N k c M k c N k M k where Nk, Mk denote the numbers of all subnetwork of P and P ̄ and 1471210513S3S2i71 N k c , 1471210513S3S2i72 M k c denote the numbers of all subnetwork of the compressed networks Pc and P ̄c. The dashed line passing from y = 0.5 visualizes our criterion. If the above ratio is below 0.5, then the number of all possible subnetworks generated by the compressed alignment is less than the half of this number for the original alignment. Very large portion of the alignments (97%) satisfying this criterion shows improvement in running time if compression is used. For the upper part of 0.5, only a small portion of these alignments (10%) shows improvement. Considering the overhead of refinement phase and the compression phase, this result is expected. These results strongly suggest that the answer to the question "When should we compress?" is "when 1471210513S3S2i73 N k c M k c N k M k ≤ 0 . 5 ". How much should we compress? In this section, we provide a guideline for selecting a value for compression level c that results in the minimum expected running time, among other possible values, for our framework to align the query networks with for a given k. We make extensive use of the computational complexity results we discussed before in the proof of the below theorem which formulates the optimal c for a given k value and the two query networks with sizes n and m. This theorem answers the question "What is the right amount of compression that we need to use in order to minimize the running time of our framework?". Theorem 2 (OPTIMAL LEVEL OF COMPRESSION) Let P = (V, E), 1471210513S3S2i74 P ̄ = ( V ̄ , Ē ) be two metabolic networks with sizes n and m respectively, and k be a given positive integer. Assume without loss of generality that n < m. Then, the compression level c that gives the optimal compression is: M7 1471210513S3S2i75 c = log 2 ( n m 2 k  2 ) 8 . Proof 4 Given P and P ̄ , we want to find c value such that the difference between the complexity of applying SubMAP to align these networks in their original domain for a given k and the complexity of using our framework that aligns P with P ̄ in compressed domain for the same k value is maximum. We omit the constant factors and use the algorithmic complexity as the cost of alignment. Under this assumption, the cost of aligning two networks with sizes n and m with SubMAP in the original domain for a given k value is: M8 1471210513S3S2i76 α ( k ) 2 β ( k ) 2 n 2 m 2 For our framework, this cost can be determined from the complexities of three different phases given by the Equations (4), (5) and (6) (see main article for these equations). As discussed, the dominating factors in the complexity are the last two phases (i.e., Equation (5) and Equation (6)). Therefore, we write the total cost of aligning P with P ̄ in the compressed domain c, for a given k value as: M9 1471210513S3S2i77 α ( k ) 2 β ( k ) 2 n 2 m 2 2 4 c + α ( k ) 2 β ( k ) 2 n k 2 2 4 c Our aim is to maximize (8) (9) with respect to c. We know that this difference is negative (i.e., alignment in compressed domain is costlier) when c ≥ n (assuming n < m as stated in the Theorem) or when c = 0 due to the overhead of compression and/or refinement phases. We also know that, for c = 1 this difference is positive as compression by one level always results in less costlier alignments compared to no compression. Therefore, if there is an extrema of (8) (9) with respect to c for c ∈ (0, n), then this extrema is a maxima meaning that the difference (8) (9) is maximum at that point. We calculate this maxima by derivation of (8) (9) with respect to c and setting it to zero as: M10 1471210513S3S2i78 m:mtable align columnalign left m:mtr m:mtd alignodd right ∂ ( ( 1 )  ( 2 ) ) ∂ c aligneven = 0 2em ∂ { α ( k ) 2 β ( k ) 2 n 2 m 2  α ( k ) 2 β ( k ) 2 n 2 m 2 2  4 c  α ( k ) 2 β ( k ) 2 n k 2 2 4 c } ∂ c = 0 4 log ( 2 ) 2  4 c α ( k ) 2 β ( k ) 2 n 2 m 2  4 log ( 2 ) 2 4 c α ( k ) 2 β ( k ) 2 n k 2 = 0 2  4 c n m 2  2 4 c k 2 = 0 2 8 c = n m 2 k  2 c = log 2 ( n m 2 k  2 ) 8 □ The value obtained from the above discussion is not necessarily an integer. We suggest using the nearest integer to this value as the number of compression levels in our alignment. Next, we want to give a few examples for to see what Theorem 2 implies in practice. Assume we have two networks with sizes n = 100, m = 100 and we want to align them using our framework for k = 2. Plugging these number in Equation 7, we get: 1471210513S3S2i79 c = log 2 ( 250000 ) 8 = 17 . 93 8 ≅ 2 . 24 If we round this to the nearest integer, the Equation 7 suggests that we use two levels of compression for this alignment problem to be able to get the largest gain in running time. We can carry the calculations similarly for a bigger set of inputs n = m = 1000 and k = 3 which gives around 3.34, suggesting three levels of compression is likely to provide the best running time improvement for this instance. However, it is important to note that depending on how much of a tradeoff is desired between the running time gain and the alignment accuracy, the user can always use smaller (or bigger) c values than the ones suggested here. Also, the above calculated values are only expected to provide the best running time improvement with respect to the original alignments running time. If the size of the query is orders of magnitude bigger than the original algorithm can handle, then it is likely that the framework we propose here to also fail to perform the alignment. List of abbreviations P = (V, E), 1471210513S3S2i80 P ̄ = ( V ̄ , Ē ) : Query metabolic networks; V, 1471210513S3S2i81 V ̄ : Sets of all reactions of the query networks; ri ∈ V, 1471210513S3S2i82 r ̄ j ∈ V ̄ : Reactions of the query networks; n = V , 1471210513S3S2i83 m =  V ̄  : Sizes of the query networks; c, 2c: Compression level and compression rate; Pc = (V c, Ec): P after c levels of compression; 1471210513S3S2i84 C i = ( V ^ i , Ê i ) : A connected component of network P; N(va), deg(va): The set of neighbors and degree of node va; va: Number of reactions that are contained in va; vab : A supernode containing the nodes va and vb; k: Parameter for the largest subnetwork size; 1471210513S3S2i85 ℛ k , ℛ ̄ k : Sets of all subnetworks of size at most k; 1471210513S3S2i86 R i , R ̄ j : Subnetworks of the query networks; Nk, Mk: Numbers of all subnetworks of size at most k. Competing interests The authors declare that they have no competing interests. Authors' contributions FA, TK, and MD developed the method. MD and FA implemented the methods and gathered experimental results. FA and TK wrote the paper. bm ack Acknowledgements and funding This work was supported partially by NSF under grants IIS0845439 and CCF0829867. FA is partially supported by NSF under grant #1136996 to the Computing Research Association for the CIFellows project. This article has been published as part of BMC Bioinformatics Volume 13 Supplement 3, 2012: ACM Conference on Bioinformatics, Computational Biology and Biomedicine 2011. The full contents of the supplement are available online at http://www.biomedcentral.com/14712105/13/S3. refgrp Revealing biological modules via graph summarizationNavlakhaSSchatzMKingsfordCJ Comput Biol2009162253lpage 264pubidlist 10.1089/cmb.2008.11TTpmpid link fulltext 19183002Learning module networksSegalEPe'erDRegevAKollerDFriedmanNJournal of Machine Learning Research2005655788Dynamic modular structure of regulatory networksAyFDinhTThaiMKahveciTIEEE International Conference on Bioinformatics and Bioengineering (BIBE)2010136143Identification of functional modules from conserved ancestral protein protein interactionsDutkowskiJTiurynJBioinformatics20072313i149i15810.1093/bioinformatics/btm19417646291Pathway alignment: application to the comparative analysis of glycolytic enzymesDandekarTSchusterSSnelBHuynenMBorkPBiochem J1999343 Pt 111512410.1042/02646021:3430115pmcid 122053110493919QNet: a tool for querying protein interaction networksDostBShlomiTGuptaNRuppinEBafnaVSharanRInternational Conference on Research in Computational Molecular Biology (RECOMB)2007115Integrative network alignment reveals large regions of global network similarity in yeast and humanKuchaievOPrzuljNBioinformatics2011271390139610.1093/bioinformatics/btr12721414992A fast and accurate algorithm for comparative analysis of metabolic pathwaysAyFKahveciTDE CrécyLagardVJ Bioinform Comput Biol20097338942810.1142/S021972000900416319507283SubMAP: aligning metabolic pathways with subnetwork mappingsAyFKahveciTInternational Conference on Research in Computational Molecular Biology (RECOMB)2010LNCS60441530SubMAP: aligning metabolic pathways with subnetwork mappingsAyFKellisMKahveciTJ Comput Biol201118321923510.1089/cmb.2010.0280312393221385030IsoRankN: spectral methods for global alignment of multiple protein networksLiaoCSLuKBaymMSinghRBergerBBioinformatics20092512i253i25810.1093/bioinformatics/btp203268795719477996Pairwise local alignment of protein interaction networks guided by models of evolutionKoyuturkMGramaASzpankowskiWInternational Conference on Research in Computational Molecular Biology (RECOMB)20054865MetNetAligner: a web service tool for metabolic network alignmentsChengQHarrisonRZelikovskyABioinformatics2009251519899010.1093/bioinformatics/btp28719414533Fast and accurate alignment of multiple protein networksKalaevMBafnaVSharanRJ Comput Biol2009169899910.1089/cmb.2009.013619624266PathAligner: metabolic pathway retrieval and alignmentChenMHofestadtRAppl Bioinformatics20043424125210.2165/008229422004030400000615702955Alignment of molecular networks by integer quadratic programmingLiZZhangSWangYZhangXSChenLBioinformatics200723131631163910.1093/bioinformatics/btm15617468121Metabolic pathway alignment between species using a comprehensive and flexible similarity measureLiYde RidderDde GrootMJLReindersMJTBMC Syst Biol2008211110.1186/175205092111267739719108747Topological network alignment uncovers biological function and phylogenyKuchaievOMilenkovicTMemisevicVHayesWPrzuljNJ R Soc Interface201071341135410.1098/rsif.2010.0063289488920236959Biological networks: comparison, conservation, and evolution via relative description lengthChorBTullerTJ Comput Biol200714681783810.1089/cmb.2007.R01817691896Alignment of metabolic pathwaysPinterRYRokhlenkoOYegerLotemEZivUkelsonMBioinformatics200521163401340810.1093/bioinformatics/bti55415985496Global alignment of multiple protein interaction networks with application to functional orthology detectionSinghRXuJBergerBProc Natl Acad Sci USA2008105127631276810.1073/pnas.0806627105252226218725631Reconstructing the metabolic network of a bacterium from its genomeFranckeCSiezenRJTeusinkBTrends Microbiol2005131155055810.1016/j.tim.2005.09.00116169729An iterative algorithm for metabolic networkbased drug target identificationSridharPKahveciTRankaSPac Symp Biocomput200712889917992747A heuristic graph comparison algorithm and its application to detect functionally related enzyme clustersOgataHFujibuchiWGotoSKanehisaMNucleic Acids Res2000284021402810.1093/nar/28.20.402111077911024183A Bayesian method for identifying missing enzymes in predicted metabolic pathway databasesGreenMLKarpPDBMC Bioinformatics200457610.1186/1471210557644618515189570KEGG: Kyoto Encyclopedia of Genes and GenomesOgataHGotoSSatoKFujibuchiWBonoHKanehisaMNucleic Acids Res199927293410.1093/nar/27.1.291480909847135The largescale organization of metabolic networksJeongHTomborBAlbertROltvaiZNBarabasiALNature2000407680465165410.1038/3503662711034217The evolution of connectivity in metabolic networksPfeifferTSoyerOSBonhoefferSPLoS Biol200537e22810.1371/journal.pbio.0030228115709616000019Hierarchical organization of modularity in metabolic networksRavaszESomeraALMongruDAOltvaiZNBarabasiALScience200229755861551155510.1126/science.107337412202830 PAGE 1 PROCEEDINGS OpenAccessMetabolicnetworkalignmentinlargescaleby networkcompressionFerhatAy1,2*,MichaelDang1,TamerKahveci1From ACMConferenceonBioinformatics,ComputationalBiologyandBiomedicine2011(ACMBCB) Chicago,IL,USA.13August2011AbstractMetabolicnetworkalignmentisasystemscalecomparativeanalysisthatdiscoversimportantsimilaritiesand differencesacrossdifferentmetabolismsandorganisms.Althoughtheproblemofaligningmetabolicnetworkshas beenconsideredinthepast,thecomputationalcomplexityoftheexistingsolutionshassofarlimitedtheiruseto moderatelysizednetworks.Inthispaper,weaddresstheproblemofaligningtwometabolicnetworks,particularly whenbothofthemaretoolargetobedealtwithusingexistingmethods.Wedevelopagenericframeworkthat cansignificantlyimprovethescaleofthenetworksthatcanbealignedinpracticaltime.Ourframeworkhasthree majorphases,namelythe compressionphase ,the alignmentphase andthe refinementphase .Forthefirstphase,we developanalgorithmwhichtransformsthegivennetworkstoacompresseddomainwheretheyaresummarized usingfewernodes,termed supernodes ,andinteractions.Inthesecondphase,wecarryoutthealignmentinthe compresseddomainusinganexistingnetworkalignmentmethodasourbasealgorithm.Thisalignmentresultsin supernodemappingsinthecompresseddomain,eachofwhicharesmallerinstancesofnetworkalignment problem.Inthethirdphase,wesolveeachoftheinstancesusingthebasealignmentalgorithmtorefinethe alignmentresults.Weprovideauserdefinedparametertocontrolthenumberofcompressionlevelswhich generallydeterminesthetradeoffbetweenthequalityofthealignmentversushowfastthealgorithmruns.Our experimentsonthenetworksfromKEGGpathwaydatabasedemonstratethatthecompressionmethodwe proposereducesthesizesofmetabolicnetworksbyalmosthalfateachcompressionlevelwhichprovidesan expectedspeedupofmorethananorderofmagnitude.Wealsoobservethatthealignmentsobtainedbyonly onelevelofcompressioncapturetheoriginalalignmentresultswithhighaccuracy.Together,thesesuggestthat ourframeworkresultsinalignmentsthatarecomparabletoexistingalgorithmsandcandothiswithpractical resourceutilizationforlargescalenetworksthatexistingalgorithmscouldnothandle.Asanexampleofour method sperformanceinpractice,thealignmentoforganismwidemetabolicnetworksofhuman(1615reactions) andmouse(1600reactions)wasperformedunderthreeminutesbyonlyusingasinglelevelofcompression.BackgroundBiologicalnetworksprovideacompactrepresentation oftherolesofdifferentbiochemicalentitiesandtheinteractionsbetweenthem.Dependingonthetypesofentities andinteractions,thesenetworksaresegregatedintodifferenttypes,whereeachnetworktypeencompassesa particularsetofbiologicalprocesses.Proteinproteininteraction(PPI)networkscomprisebindingrelationships betweentwoormoreproteinstocarryoutspecificcellular functionssuchassignaltransduction.Regulatorynetworks consistofinteractionsbetweengenesandgeneproducts tocontroltheratesatwhichgenesaretranscribed.Metabolicnetworksrepresentsetsofchemicalreactionsthat arecatalyzedbyenzymestotransformasetofmetabolites intootherstomaintainthestabilityofacellandtomeet itsparticularneeds.Analysisoftheconnectivityproperties ofthesenetworkshasproventobecrucialinuncovering thedetailsofthecellmachineryandinrevealingthefunctionalmodulesandcomplexesinvolvedinthismechanism [14]. *Correspondence:ferhatay@uw.edu1ComputerandInformationScienceandEngineering,UniversityofFlorida, Gainesville,FL32611,USA FulllistofauthorinformationisavailableattheendofthearticleAy etal BMCBioinformatics 2012, 13 (Suppl3):S2 http://www.biomedcentral.com/14712105/13/S3/S2 2012Ayetal.;licenseeBioMedCentralLtd.ThisisanopenaccessarticledistributedunderthetermsoftheCreativeCommons AttributionLicense(http://creativecommons.org/licenses/by/2.0),whichpermitsunrestricteduse,distribution,andreproductionin anymedium,providedtheoriginalworkisproperlycited. PAGE 2 Anessentialtypeofnetworkanalysisisthecomparativeanalysisthataimsatidentifyingfunctionallysimilar elementsorelementsetssharedamongdifferentorganismswhichwouldnotbepossibleiftheseelementswere onlyconsideredindividually.Thisisoftenachieved throughalignmentofthenetworksoftheseorganisms. Analogoustosequencealignmentwhichidentifiesconservedsequences,networkalignmentrevealsconnectivity patternsthatareconservedamongtwoormoreorganisms.Anumberofstudieshavebeendonetosystematicallyaligndifferenttypesofbiologicalnetworks[521]. Formetabolicnetworks,Pinter etal .[20]devisedan algorithmthatalignsquerynetworkswithspecifictopologiesbyusingagraphtheoreticapproach.Recently, someofusdevelopedanalgorithmthatcombinesboth topologicalfeaturesandhomol ogicalsimilarityofpairwisemoleculestoalignmetabolicnetworks[8].Wealso proposedamethod,SubMAP[9,10],thatincorporates subnetworkmappingsinmetabolicnetworkalignment.A similarmethod,IsoRank[21],hasbeenappliedtofind thealignmentsofPPInetworks.IsoRankN[11]extended thisalgorithmtoworkformultiplenetworksandto allowmappingsofproteinclusters. Comparativeanalysisisimportantparticularyforlarge metabolicnetworkssuchasorganismwidenetworks. Identificationoftheconservedpatternsamongmetabolic networksacrossspeciesprovideinsightsformetabolic reconstructionofanewlysequencedgenome[22], orthologydetection[21],dru gtargetidentification[23] andidentificationofenzymeclustersandmissing enzymes[24,25].However,aligninglargescalenetworks isacomputationallychallengingproblemduetothe underlyingsubgraphisomorphismproblemthathasto besolvedtofindthealignmentthatmaximizesthesimilaritybetweenthequerynetworks.Themethodswe mentionedaboveeitherrestrictthequerytopologies and/ortheirsizes.Evenundertheseconditions,therunningtimesandmemoryutilizationofthesemethodscan stillbeprohibitiveforlargequerynetworks.Forinstance, themethodofPinter etal .[20]takesaroundoneminute peralignmentonadatasetwithonlysmallsizenetworks rangingfrom2to41nodes.Ourearliermethod,SubMAPhasnolimitationsonthequerytopologiesand allowsmappingsofnodesetsthatareconnected(i.e., subnetworks).However,allowingsubnetworkscomesat acostofincreasingrunningtimethatisinherentdueto thefactthatthenumberofallconnectedsubnetworksup toagivensizecanbeexponentialinthesizeofthenetwork.Foranetworkofsize80andsubnetworksizesup to3,SubMAPtakesaround6minutesand150MBsof memoryontheaverageperalignmentwithadatabaseof networksofsize50ontheaverage.Therefore,improving therunningtimeandmemoryu tilizationofthesemethodsisnecessarytoleveragethealignmentoflargerscale networksespeciallywhensubnetworkmappingsare allowed. Inthispaper,wedevelopaframeworkthatsignificantly improvesthescaleofthenetworksthatcanbealigned usingexistingalgorithms.Ourframeworkhasthree majorphases,namelythe compressionphase ,the alignmentphase andthe refinementphase .Forthefirstphase, wedevelopacompressionme thodthatreducesthesize oftheinputmetabolicnetworksbyadesiredrate.In otherwords,wetransformthequerynetworksfromtheir originaldomains(seeFigure1(a))toa compressed domain (seeFigure1(d)).Asinglenodeincompressed domaincorrespondstoasetofconnectednodesandthe edgesbetweenthemintheoriginaldomain.Wecalleach suchnodeinthecompressednetworka supernode .For instance,Figure1(d)depictsthecompressednetworksof thetwoinputnetworksinFigure1(a)wheneachsupernodeisallowedtocontainuptotwonodes(i.e.,onlyone levelofcompressionisallowed).Inthesecondphase,we carryoutthealignmentinthecompresseddomainby usinganexistingnetworkalignmentalgorithm,whichis SubMAPinthispaper,asourbasemethod.Oncethe compressednetworksarealigned,wenextconsidereach mappingofsupernodesfoundbythefirstphaseindividually.Eachsuchmappingsuggestsasmallerinstanceof networkalignment.Figure1(f)demonstratesthiswhere twosuchinstancesexist.Foreachofthesemappings,we solvethealignmentproblemusingthebasealgorithm.At theendofthisrefinementphase,thefinalmappingsof reactionsareextracted(seeFigure1(g))transformingthe alignmentbacktotheoriginaldomain. Wecanbestmotivatetheneedforsuchaframeworkon anexample.Figure1illustratesthedifferencebetween aligningtwometabolicnetworksincompresseddomain versusaligningthemintheoriginaldomainwithoutcompression.Ifweuseabasealignmentalgorithmsuchas SubMAPorIsoRank,thetimeandspacecomplexityof thealgorithmisdeterminedbythesizeofadatastructure, named supportmatrix [10,21].Conceptually,thisdata structuregovernsthetopolog icalsimilaritiesbetween everypairofreactiontuples.Eachreactiontuplecontains onereactionfromeachofthetwoquerymetabolicnetworks.Adetaileddescriptionofthismatrixcanbefound inpreviousarticlesdescribingIsoRank[21]andSubMAP [10].Thesizeofthissupportmatrixisquadraticinterms ofboth n and m (i.e., O ( n2m2))forIsoRankandforSubMAPwhenonlysubnetworksofsizeoneareallowed. Figures1(b)and1(e)illustratethesupportmatrices requiredforalignmentstartingfromthenetworksshown inFigure1(a)and1(d)respectively.Asaresultofcompressionbyonlyonelevel,thesizeofthematrixweneed tocreate,dropsto6 6from20 20whichtranslatesinto morethananorderofmagnitudeimprovementintheoreticalresourceutilizationcomparedtothebasemethod.Ay etal BMCBioinformatics 2012, 13 (Suppl3):S2 http://www.biomedcentral.com/14712105/13/S3/S2 Page2of19 PAGE 3 Noticethatwhenwecompressthenetworkmore(i.e., increasethenumberofcompressionlevels),thecompressednetworkgetssmallerintermsofitsnumberof nodesandedges.Asaresult,wecanexpecttoalignthe compressednetworksfaster.However,thiscomesatthe priceoftwodrawbacksbothduetothefactthateach supernodecontainsmultiplenodesfromtheoriginal domain.First,oncewefindamappingforthesupernodes inthecompresseddomain,westillneedtoalignthe nodesofeachsupernodepair.Forexample,aftermappingthesupernodes(a,b)and( a b )showninFigure1 (f),weneedtoalignthetwosubnetworksinducedby thesetwosupernodes.Thusasthesizeofthesupernodes grow(i.e.,aswecompressformorelevels),thesizeofthe smallerprobleminstancesgrowaswellandresourceutilizationbottleneckshiftsf romthealignmentphaseto refinementphase.Second,whenweusecompressionthe resultingalignmentmaynotbethesameastheone foundbytheoriginalalgorithm.Forexample,oneoutof fourmappingsinFigure1(g)(i.e., e c )isdifferentthan theresultsofthebasealgorithmshowninFigure1(c) (i.e., e e ).Thisbringstheneedtodefineameasureof consistencybetweentheresultsofalignmentswithand withoutcompressionwhichcanbeusedasanindicator ofaccuracyfortheframeworkweproposehere.We calculatethisaccuracyasthecorrelationofthescores calculatedforeachpossiblemappingfoundbyourframeworkinthecompresseddomainwiththescoresforthese mappingintheoriginaldomainfoundbythebase method.Biggercompressionratesgenerallymeanless similaritybetweentheresultsofthetwomethods(i.e., lessaccuracy). Severalkeyquestionsfollowfromtheseobservationsare: 1.Howdoescompressionaffectthealignmentaccuracywithrespecttothe basenetworkalignment method? 2.Howfarisourcompressionmethodfromanoptimalcompressionthatproducesthecompressednetworkwiththeminimumnumberofnodes? 3.Whenisitagoodideatodothealignmentin compresseddomaintakingintoaccounttheoverheadofcompressionandrefinementphases? 4.Whatistherightamountofcompression?Thatis, whendoescompressionminimizetherunningtime ofouroverallframework? Intherestofthepaperweaddresseachofthesequestionsindetail.Atthispoint,itisimportanttonoticethe potentialforleveragingth ealignmentoflargerscale Figure1 Aligningtwometabolicnetworkswithandwithoutcompression. Topfigures(ac)illustratethestepsofalignmentwithout compression.Bottomfigures(dg)demonstratedifferentphasesofalignmentwithcompressionusingourframework.(a)Twohypothetical metabolicnetworkswith5and4reactionsrespectively.Directededgesrepresenttheneighborhoodrelationsbetweenthereactions.(b)Support matrixofsize20 20neededforthealignmentifcompressionisnotused.Weonlyshowthenonzeroentriesofasinglerowthatcorresponds totopologicalsupportgivenby b b mappingtopossiblemappingsofitsbackwardandforwardneighbors.Fivesuchmappingssupported equallyaredenotedby 1 5s inthematrix,namely a a mappingforthebackwardneighborsand c c c d d c and d d mappingsforthe forwardneighbors.(c)Theresultingreactionmappingsofalignmentwithoutcompression.(d)Querynetworksshownin(a)incompressed domainafteronelevelofcompression.(e)Supportmatrixofsize6 6neededforthealignmentwithcompression.Weonlyshowtheentriesfor themappingssupportedbythe a b a b mapping.(f)Theresultingmappingsfromthealignmentincompresseddomain.(g)Theresulting reactionmappingsafterrefinementphaseofourframework. Ay etal BMCBioinformatics 2012, 13 (Suppl3):S2 http://www.biomedcentral.com/14712105/13/S3/S2 Page3of19 PAGE 4 networksbytheframeworkweareproposing.The actualperformancegainforanalignmentwilldepend onthelevelofcompressionweuse,thetopologiesof thequerynetworksandcomplexityofthebasealignmentmethod.ResultsoverviewOurexperimentsonmetabolicnetworksextractedfrom KEGGpathwaydatabase[26]demonstratethatour compressionmethodreducesthenumberofnodesand edgesbyalmosthalfateachlevelofcompression.Asa resultofthisreduction,weobservesignificantamount ofimprovementinrunningtimeandmemoryutilization ofourearlieralignmentalgorithmSubMAP.Lastly,we analyzetheaccuracyofourframeworkascomparedto thebasealignmentalgorithm.Theresultssuggestthat thealignmentobtainedbyonlyonelevelofcompression capturestheoriginalalignmentresultswithveryhigh accuracyandtheaccuracydecreaseswithfurtherlevels ofcompression.TechnicalcontributionsWedeviseanefficientframeworkforthenetwork alignmentproblemthatemploysascalablecompressionmethodwhichshrinksthegivennetworkswhile respectingtheirtopology. Weprovetheoptimalityofourcompression methodundercertainconditionsandprovidea boundonhowmuchourcompressionresultscan deviatefromtheoptimalsolutionintheworstcase. Weprovideamathematicalformulationthatserves asaguidelinetoselectanoptimalnumberofcompressionlevelsdependingontheinputcharacteristicsofthealignment. Wecharacterizethecasesforwhichtheproposed frameworkisexpectedtoprovidesignificant improvementinalignmentperformance. Inthenextsection,wereportourexperimentalresults onasetoflargescalemetabolicnetworksthatareconstructedbycombiningnetworksfromKEGGPathway database[26].Thedetailsofthenetworkcompression methodweproposehereandtheotherphasesofour frameworkaredescribedinthemethodssection.ResultsanddiscussionInthissection,weexperimentallyevaluatetheperformanceofourframework.First,wemeasurethecompressionratesachievedfordifferentlevelsof compressionwithminimumdegreeselection(MDS ) methodthatweproposehere. Next,wefurtheranalyzedthechangesindegreedistributionandlargescaleorganizationoforganismwide metabolicnetworkswithincreasingcompressionlevels. We,then,examinethegaininrunningtimeandmemory utilizationachievedbyourframeworkfordifferentvalues ofcompressionlevel( c )andsubnetworksize( k )parameters.Last,weexaminetheaccuracyofthealignments wefoundbymeasuringtheaccuracyasthePearson scorrelationcoefficientbetweenthescoresofmappings calculatedbyourframeworkandtheonescalculatedby thebasealgorithmweuse.DatasetWeusethemetabolicnetworksfromtheKEGGpathwaydatabase[26].Forour mediumscaledataset ,we downloadedallmetabolicnetworkswithatleast10 reactionsfor10differentorganisms.Thisresultedin 620metabolicnetworksintotalwithsizesrangingfrom 10to97. Inordertoobtainour largescaledataset ,wefirst combinedallthemetabolicnetworksthatbelongtoone ofthe9differentmetabolismcategoriesinKEGGdatabasetocreatea completemetabolismnetwork foreach metabolismfor10selectedorganisms(Homosapiens (human),Musmusculus(mouse),Rattusnorvegicus (rat),Drosophilamelanogaster(fruitfly),Arabidopsis thaliana(thalecress),Caenorhabditiselegans(nematode),Saccharomycescerevisiae(buddingyeast),StaphylococcusaureusCOL(MRSA),EscherichiacoliK12 MG1655,PseudomonasaeruginosaPAO1).Weobtain the organismwidemetabolicnetworks bycombiningall thelistednetworksinKEGGforeachoftheseorganisms.Intotal,wehave100networkswithsizesranging from5to1615(9completemetabolismnetworksplus1 organismwidenetworkforeachofthe10organisms). Belowisthelistofmetabolismcategoriesweuse. 1.CarbohydrateMetabolism 2.EnergyMetabolism 3.LipidMetabolism 4.NucleotideMetabolism 5.AminoAcidMetabolism 6.MetabolismofOtherAminoAcids 7.GlycanBiosynthesisandMetabolism 8.MetabolismofCofactorsandVitamins 9.AllAminoAcids(AminoAcid+OtherAmino Acids)ImplementationandsystemdetailsWeimplementedourcompressionandalignmentalgorithmsinC++.Weranalltheexperimentsonadesktop computerrunningRedHatEnterpriseClient5.7with4 GBofRAMandtwodualcore2.40GHzprocessors.EvaluationofcompressionratesTheefficiencyofouralignmentframeworkdependson howmuchthequerymetabolicnetworkscanbeAy etal BMCBioinformatics 2012, 13 (Suppl3):S2 http://www.biomedcentral.com/14712105/13/S3/S2 Page4of19 PAGE 5 compressed.Forthisreason,inthisexperiment,we measurethenumberofnodesandedgesofthemetabolicnetworksinourlargescaledatasetbeforeand aftercompression. Theminimumdegreeselection( MDS )methodwe describeinthispapercompressesthequerymetabolic networksbyselectingthefirstnodeamongthelistof nodeswithminimumdegreeateachintermediatestep andbycompressingitwithoneofitsneighbors.Inorder toevaluatestabilityofthiscompressionmethod,we examinedtheeffectofthenodeselectionstrategyonthe sizeoftheresultingcompressednetworks.Byrandomizingthestepatwhichweselectanodeamongthesetof minimumdegreenodes,wegenerated100differentcompressednetworksforeach oftheinputmetabolicnetworks.Inthefollowing,weexaminehowmuch compressionweachievebythe MDS methodandalso analyzeitsstabilitywithrespecttocompressions achievedbyrandomizationofnodeselectionstep. Table1summarizesthecompressionratesachievedby ourmethodfornetworksofdifferentsizes.Wedivideall themetabolicnetworksinourdatasetintobinsaccording tothenumberoftheirreactions(i.e.,networksize).The firstcolumninTable1liststhenetworksizeintervalswe usedforeachgroup.Noticethatthegapsinthesize intervalareduetothefactthatorganismwidenetworks areofsize850andlargerwhereastheothercombined networksforninedifferent metabolismcategorieshave sizesbelow400.Eachrowofthistableshowsthenumber ofnodesandedgesaveragedoverallthenetworksinthis groupbeforeandaftercompression.Thetwocolumns with c =0correspondtotheaveragenumberofnodes andedgesofthenetworkswithnocompressionrespectively.For c {1,2,3},wespliteachrowcorresponding toanintervalintotwo.Theupperpartdenotesthe averagenodeandedgenumbersforthecompressednetworkifthe MDS methodisusedasoriginallydescribed (i.e.,thefirstamongthelistofminimumdegreenodesis selectedandcombinedwithitsfirstneighborateach compressionstep).Thelowerpartinboldrepresentsthe numbersgatheredwhenweintroducerandomizationin thisnodeselection.EachvalueinboldinTable1denotes theaverageofthecorrespondingvalueoverthese100 differentrunsofcompression. OneconclusionthatcanbedrawnfromTable1isthat independentofthenetworksize,ourcompression methodperformswellinpractice.Ontheaverage,with onlyonelevelofcompressionweachievenetworksizes thatare5764%,6471%and7780%ofthenetworksizes inthepreviouscompressionlevelfor c =1,2and3 respectively.Inotherwords,ourmethodcompressesthe entiredatasetdowntoapproximately60%,40%and30% ofthesizesoforiginalnetworksfor c =1,2and3respectively.Theseratessuggest thatourframeworkhasgreat potentialinscalingthenetworkalignmenttolargemetabolicnetworksbycompression.Asanexample,consider therowcorrespondingtointerval[850,1250]inTable1. Weseethatinsteadofaligningnetworkswith1080 nodesand3727edgesontheaverage,wecanapplytwo levelsofcompressionfirstanddothealignmentwithsignificantlysmallernetworksthathaveonly407nodesand 1733edgesontheaverage.Anotherobservationisthat, wegetthemostofthereductioninnetworksizeafterthe firstcompressionlevel.Th atis,ourmethodcompresses thenetworksaggressivelyfor c =1andachieves57%to 64%compressionratewhichisclosetothehalfofthe sizeofthenetworks.Aswegoupinthelevelsofcompression,theactualrateofcompressionachievedatone levelreduces.Consideringthefactthathavinganinput networkwhichcanleadtothebestpossiblecompression Table1SummaryofcompressionratesforallthenetworksinourlargescaledatasetNetworksizeintervals Averagenumberofnodes Averagenumberofedges c=0c=1c=2c=3c=0c=1c=2c=3 [0,100) 41.526.5 26.5 19.1 19.1 15 14.8 83.555.2 55.5 36.3 36.5 23.6 23.5 [100,200) 154.892.4 92.2 61.3 61.5 48.6 48.6 310.1174.9 174 116.5 118.1 96.3 94.6 [200,300) 240.5139.1 139.4 89.2 89.1 69.4 69.7 508.1296.5 298.4 230.5 228.4 187.8 188.1 [300,400] 344.9207.3 207.6 133.1 133.8 103 104.5 585.7372.9 373.5 302.7 300.4 261.6 259.9 [850,1250] 1080.5623.2 623.7 406.8 407.9 311.3 311.9 37272269 2280.6 1732.7 1733.8 1584.8 1587.5 [1500,1615] 1576.5909 910 582 583 447.8 444.6 47402955.2 2964.3 2283.5 2279.3 2128.8 2129.6Wecreatesixintervalsaccordingtonumberofreactionsinthesenetworks.Eachrow,correspondingtoonesuchinterval,showstheaveragenumberofnodes andedgesbeforecompression(i.e., c =0)andaftercompressionofdifferentlevels(i.e., c {1,2,3}).Foreachrow,topentriescorrespondtonumbersobtained withthe MDS methodwhichselectsthefirstnodefromthelistofnodeswithminimumdegreeateachintermediatestepandcompressesitwithitsfirst neighborfromthelistofitsneighbors.Thebottomentriesthatareinboldcorrespondtotheaveragesof100differentcompressionswhicharegathere dby randomizingthestepatwhichanodeisselectedamongthesetofminimumdegreenodes.Ay etal BMCBioinformatics 2012, 13 (Suppl3):S2 http://www.biomedcentral.com/14712105/13/S3/S2 Page5of19 PAGE 6 (i.e.,reducingitssizefrom n downtosize n 2 (i.e.,50%) ateachlevelofcompression)isarareevent,theobserved compressionratessuggestthatourmethodprovidesan efficientcompressionformetabolicnetworksinpractice. Thisexperimentalsetu palsosuggeststhatthe MDS methodisstablewithrespecttothechoiceofthenodeto compressaslongasthatnodeisselectedamongthe nodeswithminimumdegree.Amongthesixrowsand threecolumns(18entries)ofTable1fortheaverage numberofnodesafterthecompression,onlyoneof themhavedifferencelargerthantwobetweentheoriginalsizeandtherandomizedaverage. Theresultsofthisexperimentsuggestthatourcompressionmethod,MDS,servesasanefficientandstable firstphaseforouralignmentframeworkbyachieving goodcompressionratesonalargedatasetofmetabolic networks .ChangesindegreedistributionswithcompressionEventhoughthecompressionratesweachievewith MDS asdescribedabovesuggestsi gnificantreductioninthe problemsize,weobservethatthereisanoticeabledifferencebetweenthecompressionratesachievedbygoing fromonecompressionleveltothenext.Forinstance,on theaverageweget57%to64%reductioninthesizeof thenetworksgoingfrom c =0to c =1whereasweonly get76%to80%reductionifwegofrom c =2to c =3. Thissuggeststhatthelargescaleorganizationofthenetworkschangewithincreasinglevelsofcompression. Eventhoughachangeinthenetworkstructurecanbe expectedasaresultofourcompression,itisnotobvious howtoquantifythischangeandwhetherthechangeis consistentamongdifferentmetabolicnetworks. Inordertounderstandthereasonbehinddifferentcompressionratesfordifferentcompressionlevels,weexaminedthedegreedistributio nsofthetenorganismwide networkswehaveinourdataset.Foreachofthesenetworks,weplottedthehistogramofoutdegreedistributionsfordifferentlevelsofcompression.Figure2plotsthe frequenciesofeachoutdegreeintherange[2,40]foreach c {0,1,2,3,4}forthesenetworks.Weobservethatfor eachoftheseplotsthedegreedistributionsfor c =0and c =1areverysimilarandtheyfollowpowerlawdistributionwhichisanindicatorofscalefreenetworktopology. Thisisnotsurprisingsincethescalefreetopologyhas beenobservedinnumerousarticlesintheliteratureasa commonsignaturefordifferentmetabolicnetworks [2729].Thesimilaritybetweenthedegreedistributionsof theoriginalnetworks( c =0)andthenetworkscompressed byonlyonelevel( c =1)signifiesthatthenetworksstill conservetheirscalefreenes safterthefirstlevelof compression. Amoreinterestingobservationisthatthereisaconsistentshiftfromthepowerlawdegreedistributionto uniformdistributionwithincreasing c valuesforeachof thetennetworkswehave.Itisimportanttoclarifythat ourclaimisnotthatthedegreedistributionbecomes uniformforlarge c valuesbutratherthedegreedistributionsforlarge c valuesaremoresimilartouniformdistribution(alsolesssimilartopowerlawdistribution) comparedtoonesobtainedwithsmaller c values.To quantifythisonanexample,welookatoneofthemost discernablecharacteristicsofscalefreenetworks,hence thepowerlawdistribution,whichisthesmallnumber ofhubnodeswithlargedegrees.Ifweconsiderthe organismwidenetworkof Homosapiens (Figure2(e)), whichisthelargestnetworkinourdataset,andfocus onthepercentageofnodeswithoutdegreegreaterthan 15,wegetpercentagesof3%,4%,6.5%,11.5%and 12.4%for c valuesof0,1,2,3and4respectively.This indicatesthatthenumberofnodesthatcanbeconsideredashubsincreasesignificantlywithincreasinglevels ofcompression.Thisincreasedeterioratesthescalefreenessofthe Homosapiens networkwhichinturn decreasestheachievedcompressionrates.Similartrend isobservedforeachoftheothernineorganismwide networkswhichareplottedseparatelyinFigure2. Theresultsofthisexperimentshowthatthereisaconsistentchangeinthenetworktopologywhenmultiple levelsofcompressionisused.Thisdifferenceweobserve herebetweenthefirstlevelofcompressionandlater levelsofcompressionislikelytobeoneofthemainreasonsofthesignificantdifferencesinboththeperformance andtheaccuracyofourframeworkwhichwillbediscussednextintheremainingoftheresultssection .EvaluationofrunningtimeandmemoryutilizationInordertounderstandthecapabilitiesandlimitationsof ourframework,weexamineitsperformanceintermsof itsrunningtimeandmemoryutilizationonasetof largescalenetworksweconstructedasdescribedinthe datasetsection.Wehavetennetworksforeachofthe tenorganismsinourdataset.Foreachorganism,nineof thesenetworksconstituted ifferentmetabolismcategoriesandthetenthnetworkistheorganismwide metabolicnetwork.Intotal,wehave100networkswith sizesrangingfrom5to1615.Foreachparametersetting (differentcombinationsof k {1,2}and c {0,1,2,3}, wealignedeachofthese100networkswitheachother network(includingitself)resultinginatotalof5500 alignmentqueries.Whenthevalueof c isequaltozero, thealignmentiscarriedoutcompletelybyasingle applicationofSubMAPwitho utanycompression.This providesusamechanismtomeasurehowmuchperformancegainisachievedbyourcompressionbasedframeworkwithrespecttoSubMAP. Figure3(a)illustratestheaveragequeryrunningtimes inaloglogplotwherexaxisisthesizeofthequery measuredastheproductofthenumberofreactionsof themetabolicnetworksth atarealigned.WegroupedAy etal BMCBioinformatics 2012, 13 (Suppl3):S2 http://www.biomedcentral.com/14712105/13/S3/S2 Page6of19 PAGE 7 Figure2 Shiftofoutdegreedistributionsfrompowerlawtouniform. Changesintheoutdegreedistributionsoftenorganismwide metabolicnetworkswithincreasinglevelsofcompression.Wecalculatethefrequenciesofeachoutdegreeintherange[2,40]for c {0,1,2,3, 4}andplotthemtogetherforeachofthetenorganismsinourdataset.Outdegreedistributionsfororganismwidemetabolicnetworksof(a) Arabidopsisthaliana (thalecress),(b) Caenorhabditiselegans (nematode),(c) Drosophilamelanogaster (fruitfly),(d) EscherichiacoliK12MG1655 (e) Homosapiens (human),(f) Musmusculus (mouse),(g) PseudomonasaeruginosaPAO1 ,(h) Rattusnorvegicus (rat),(i) StaphylococcusaureusCOL (MRSA),(j) Saccharomycescerevisiae (buddingyeast). Ay etal BMCBioinformatics 2012, 13 (Suppl3):S2 http://www.biomedcentral.com/14712105/13/S3/S2 Page7of19 PAGE 8 queriesintologarithmicbinsaccordingtothequery sizes.Thefirstbincontainsallthequeriesofsizeless thanorequalto64.Thenextbinscontainthequeries ofsizeintheinterval[2i +5,2i +6]where i =2,3,...,17. Foreachparametersettingwedisplaytheaveragerunningtimeofallthequeriesineachbin.Forboth k =1 and k =2,weplotalltheresultsforallfourdifferent compressionvaluesandalsodrawthefittingcurvesto betterillustratethetrendintheincreaseofrunning time. For k =1,wecanimmediatelyobservethateachadditionalcompressionlevelimprovestherunningtimeover thepreviousoneforallquerysizes.Weobtainthelargestfoldchangeinrunningtimebyonlyonelevelof compressionforthefirstlevel.ThisisexpectedconsideringthatthefirstlevelofcompressionachievedthelargestcompressionrateasshowninTable1.Thesecond compressionlevelimprovestherunningtimebyasmallerfactorcomparedtothefirstandbyalargerfactor comparedtothethirdlevel.For k =1wewereableto plotallthepointsforall c valuesastherunningtime foreventhelargestquery(i .e.,humanorganismwide networkvsitselfwhichhassize1615*1615)withnocompression(i.e., c =0)isstillpractical,around12minutes(with c =3thisdropsto < 40seconds). Similartrendofimprovedrunningtimeswithincreasing c isalsoobservedforqueriesuptoacertainsizefor k =2.Foronlyonelevelofcompression(c =1)we observesignificantimprovementinrunningtimesfor queriesofalldifferentsizes.However,startingfromthe bin[213,214]compressingthenetworksmorethanonly onelevel( c> 1)showsaconsistentadverseeffectonthe runningtime.Thisimplieswhenbothquerynetworks havesizesaround150orlargerand k> 1isused,the ideaofcompressingthenetworksmorethanonelevel andthenperformingthealignmentsuffersfromthe explosioninthenumberofpossiblesubnetworksinthe compresseddomainwithsizeatmost k .Weexplore thisinmoredetaillateroninthepaper(seeFigure4 anditsdiscussion). Animportantaspectofourframeworkisthatitmakes possibletoalignnetworks thatcouldnotbealigned withourbasemethod.For k =2,weobservedthatin theoriginaldomain( c =0)asignificantportionofthe largequeriesdidnotfinishinlessthanthecutofftime whichwesetasonehour.Forinstance,among252possiblequerieswithsizesintheinterval[217,218],96did notcompletesuccessfullyfor c =0whereaswith c =1 allofthemwerecompleted.Forthenextbin,45outof 223possiblequerieswerecompletedfor c =0andfor c =1thisnumberincreasedt o185.Theseresultsindicatethatbyusingthecorrectamountofcompression, wecanalignlargernetworksthanthebasealignment methodSubMAP.Webelievethisisanimportantstep inleveragingorganismwidenetworkalignmentswith subnetworkmappingsfortheyprovideamorecomplete pictureoffunctionalsimilaritiesandevolutionarydifferencesbetweenthemetabolicnetworksoftwoormore organisms. Figure3(b)presentsresultsfortheestimatedmemory requiredforthesupportmatrix,whichisthememory bottleneckofthealgorithm,thatisneededtoperform Figure3 Resourceutilizationofourframework. Theaverage(a)runningtimeand(b)memoryutilizationofourframeworkwheneachquery networkinourlargescaledatasetisalignedwithallthenetworks(includingitself)inthesamedataset.xaxisisthequerysizewhichis calculatedastheproductofthesizes(i.e.,numberofreactions)ofthemetabolicnetworksaligned. c =0denotethealignmentsperformedwith nocompression. c {1,2,3}denotetheresultsofourframeworkthatcompressesbothofthequerynetworksby c levelsbeforealigningthem. Ay etal BMCBioinformatics 2012, 13 (Suppl3):S2 http://www.biomedcentral.com/14712105/13/S3/S2 Page8of19 PAGE 9 thealignment.Forthisfigure,weusethesamequeryset asFigure3(a),hencethesamexaxis.Ontheaveragethe memoryrequiredforalignmentwith c =1isaround30% ofthatneededforalignmentw ithnocompressionusing theSubMAPmethodforboth k =1and k =2.For k =1, thememoryutilizationdecreasesbyeachadditionalcompressionlevel(ontheaveragearound45%ofthememory requiredfor c =1isusedwhen c isincreasedto2and around65%ofthememoryrequiredfor c =2isused when c isincreasedto3).For k =2,concordancewith therunningtimeresults,onlyonelevelofcompression providesbettermemoryutilizationforallnetworksizes whereascompressingmorethanonelevelhasanadverse effectformediumandlargescalequeries. Theseresultssuggestthatourframeworkdemonstrates agreatpotentialtoprovidesignificantimprovementin boththerunningtimeandthememoryutilizationofthe basealignmentmethod.Thisallowsustoalignlarge networksthatcouldnotbealignedbyexistingmethods byutilizingthesamehardware.AccuracyofthealignmentresultsWeconcludeourexperimentalresultsbyansweringthe firstquestionintroducede arlierinthepaper,thatis Howdoescompressionaffectthealignmentaccuracy? Inordertoanswerthis,wecalculatethecorrelation betweenthescoresofeachpossiblemappingincompresseddomainandthescoresthatweobtainforthese mappingsfromtheoriginalSubMAPmethod.Weconsiderthescoresofeachpossiblesubnetworkmappingof compressednodesfoundbyourframework.Sincethe mappingsfoundbySubMAParenotofthesameform withthemappingsincompresseddomain,wecalculate Figure4 Gain/Lossinrunningtime. Gain/Lossinrunningtimeofalignmentbyusingourframeworkwithrespecttothebasealignment method(xaxis)versustheratioofthenumberofallpossiblesubnetworkmappingsincompresseddomaintothisnumberintheoriginal domain.Theblueverticallineshowswhenthetwomethodstakeexactsameamountoftimeorwhenbothmethodstakeveryshortamount oftimeinthecaseofsmallquerynetworks.Pointsontheright(left)handsideofthislinemeansgain(loss)intherunningtime.Thedashed lineisourdecisioncriteriaforpredictingwhethertherewillbegainorlossbeforedoingthealignment. Ay etal BMCBioinformatics 2012, 13 (Suppl3):S2 http://www.biomedcentral.com/14712105/13/S3/S2 Page9of19 PAGE 10 ascorevalueforeachmappingincompresseddomain byusingthescoresofthemappingsfoundbySubMAP intheoriginaldomain.Thisway,wegettwosetsof scorevaluesonefromSubMAPonefromourframeworkforthesamesetofmappings.Wecalculatethe Pearson scorrelationcoefficientbetweenthesetwosets ofscoresasanindicatorofthesimilaritybetweenthe resultsofthetwomethods. Beforelookingatthecorrelationvalueswefound,itis importanttodescribehowwecalculatethescorefora mappingincompresseddomainfromthemappingsof SubMAP.Let P1and P1 denotetheonelevelcompressedformsoftwometabolicnetworks.Let ( v1{ v1, v2} ) denoteamappingincompresseddomain where v1isasubnetworkof P1and { v1, v2 } isasubnetworkof P1 .Also,let v1={ r1, r2}, v1= { r1, r2 } and v2= { r3 } .Weknowtheedgethatmapsthesetwosubnetworkshasamappingscoreinthecompressed domainandletusdenoteitby e1}for c =1.Wewant tocomputeamappingscore,say e ,for ( v1{ v1, v2} ) fromthemappingsinoriginaldomainthatiscomparableto e1.Thissubnetworkmappingincompressed domaincontainssixpossiblemappingsintheoriginal, namely ( r1, r1 ) ( r1, r2 ) ( r1, r3 ) ( r2, r1 ) ( r2, r2 ) and ( r2, r3 ) .Letusdenotethescoresofthesemappingsin theoriginaldomainby eifor i =1,2,...,6respective totheirordering.Then,wecomputethemappingscore  e as 1 6 6 i =1e i .Itisimportanttonotethat,thisscoreis aconservativechoiceamongotherpossiblescoring options.Thisisbecausetheaveragecanincludemappingscoresofsubnetworkswithverylowsimilarities fromtheoriginaldomainofSubMAP.Thiscanunderestimatethecorrectmappingscoreof e andhence degradethecorrelationofcompresseddomainandoriginaldomainmappingscores.Overall,foreachmapping incompresseddomainwithascore ecandwecalculate thecorrespondingscore e intheoriginaldomainusing thisaveragescore. Table2summarizesthecorrelationvaluesfoundfrom asetof3600alignments(400alignmentsforeachparametercombinationof k {1,2,3}and c {1,2,3}). Wecalculatethecorrelationofeachquerywiththe alignmentthathasthesame k valuebutisintheoriginaldomain(i.e., c =0).Table2showstheaveragecorrelationvaluesofthese400alignmentsforeach k value, c valuecombination.Thefirstcolumnindicatesthatthe alignmentfoundbyusingonlyonecompressionlevelis highlysimilartothealignmentfoundbydirectlyusing thebasemethod.Combiningthiswiththerunningtime gaininFigure3(a)for c =1,wecanstronglyarguethat compressionbyonelevelnotonlyprovidessignificant improvementinrunningtimebutalsoaccuratelycapturesveryhighpercentageoftheoriginalalignment resultswhichmakesitveryusefulforpracticalpurposes. Theaccuracymeasuredintermsofcorrelationdropsto 0.57ontheaveragewhenweperformthesecondlevel ofcompressionandto0.51forthethirdlevel. Theseresultssuggestthatwecanalmostalwaysuse onelevelofcompressiontobenefitfromahighperformancegainwithoutlosingmuchaccuracyintermsof thealignmentresults.Forc = 2andc = 3,eventhough theaccuracyoftheresultsaresignificantlybetterthan random,suchcompressionlevelsshouldbeusedwith cautioniftheaccuracyofthealignmentisthemain concern.ConclusionsInthispaper,weconsideredtheproblemofaligningtwo metabolicnetworksparticularlywhenbothofthemare toolargetobedealtwithusingexistingmethods.Tosolve thisproblem,wedevelopedaframeworkthatscalesthe sizeofthemetabolicnetworksthatexistingmethodscan alignsignificantly.Ourframeworkisgenericasitcanbe usedtoimprovethescalabilityofanyexistingnetwork alignmentmethod.Ithasthreemajorphases,namelythe compressionphase,the alignmentphase andthe refinementphase .Forthefirstphase,wedevelopedanalgorithm whichtransformsthegivenmetabolicnetworkstoacompresseddomainwheretheyaresummarizedusingmuch fewernodes,termedsupernodes,andinteractions.Inthe secondphase,wecarriedoutthealignmentinthecompresseddomainusinganexistingmethod,SubMAP,as thebasealignmentalgorithm.Intherefinementphase,we consideredeachindividualmappingof supernodes oneby one.Eachsuchmappingcorrespondstoasmallerinstance ofnetworkalignmentproblem.Foreachofthesemappings,wesolvedthealignmentproblemusingSubMAPas ourbasemethod.OurexperimentsonthemetabolicnetworksextractedfromtheKEGGpathwaydatabase demonstratethatourcompressionmethodreducesthe numberofreactionsbyalmosthalfateachlevelofcompression.Asaresultofthiscompression,weobservethat SubMAPcoupledwithourframeworkcanaligntwiceor moreaslargenetworksasitsoriginalversioncanwith thesameamountofresources.Ourresultsalsosuggested thatthealignmentobtainedbyonlyonelevelofcompressionbenefitsfromasignificantperformancegainwhile Table2Correlationofthemappingscoresfoundwith andwithoutcompressionk/c 123 10.890.560.53 2 0.85 0.58 0.50 3 0.84 0.57 0.49WecalculatethePearson scorrelationcoefficientbetweenthetwosetsof scorevaluesonefromSubMAP(withoutcompression)onefromour framework(withcompression)andreportitasanindicatoroftheaccuracyof alignmentresultsofourframeworkfordifferentparametersettings.Ay etal BMCBioinformatics 2012, 13 (Suppl3):S2 http://www.biomedcentral.com/14712105/13/S3/S2 Page10of19 PAGE 11 capturingtheoriginalalignmentresultswithveryhigh accuracy.Webelievethatthispapertakesanimportant stepinscalingthemetabolicnetworkalignmentwithsubnetworkmappingstoorganismwidenetworks,andthus, canhavegreatimpactonmakingtheexistingnetwork alignmentmethodsmoreusefulfordomainscientists.MethodsInthissection,wedescribethemethodwedevelopto compressthequerynetworksandtheoverallframework foraligningnetworksinthiscompresseddomain.Before goingintodetail,itisimpo rtanttostatethatweare usingareactionbasedmodelforrepresentingmetabolic networksthroughoutthispaper.Formally,werepresent ametabolicnetworkwith P =( V E )where V istheset ofallreactionsofthenetworkand E isthesetofdirectededgesbetweenthem.Anedge eij E existsifand onlyifthereaction vi hasatleastoneoutputcompound whichisaninputforthereaction vj.Inthefollowing, wefirstdescribeourcompressionmethod.Weusethe shorthandnotation MDS (minimumdegreeselection)to refertothismethodintherestofthepaper.We,then, provetheoptimalityof MDS undercertainconditions andprovideanupperboundforthenumberofcompressionsthatcanbemissedbythismethodwith respecttotheoptimalcomp ression.Next,wegivea briefoverviewofthebasealignmentmethodthatwe useinthispaperandexplainindetailthetworemaining phasesofouralignmentframework.Weprovideour analysisonthecomputationalcomplexityoftheoverall methodandconcludethemethodssectionbyanswering twoquestionsrelatedtoperformancecharacteristicsof thismethod.Minimumdegreeselection( MDS)methodLet P =( V E )bethereactionbased representationofa metabolicnetworkand c denotetheuserspecifiedparameterforthedesiredlevelofcompression.For x =1,..., c wedenotethecompressedformof P after x compression levelswith Px=( Vx, Ex).Tosimplifyournotation,we assumethat P0= P .Weconstruct Pxfrom Px 1foreach x =1,..., c.Each v Vxiseitheranodefrom Vx 1ora supernodethatcontainstwonodesof Vx 1.Insummary, weconstruct Vxfrom Vx 1inanumberofconsecutive steps.Ateachstep,wechooseapairofconnectednodes in Vx 1thatarenotcompressedinearlierstepsofthe currentcompressionlevel.Wethenmergethisnodepair intoasupernodeandadditto Vx.Werepeatthesesteps untilthereisnosuchnodepairin Vx 1.Assumethatthe numberofsuchstepsis t forcompressionlevel x .We denotethestateofthenetworkafterthe i thstepduring the x thlevelofcompressionas Px i =( Vx i Ex i) Figure5(b)). Notethat, V x t = V x and V x i Vx 1 V x foreachi=1,..., t asthenodesof V x i areeithersingletonnodesfrom Vx1orsupernodesfrom Vx. Wearenowreadytodiscusshowwecompress Px 1toget Px.Wedefinethe degree ofanoncompressed node v inagivennetworkas deg ( v )=indeg ( v )+outdeg ( v ),where indeg ( v )( outdeg ( v ))denotesthenumberof incomingedgesfrom(outgoingedgesto)noncompressednodesinthenetwork.Wesaythattwonodesin anetworkareneighborsiftheyareconnectedbyat leastoneedge.Wedenotethesetofneighborsofa node v with N ( v ).Westartthecompressionbyinitializing V x 0 = Vx 1, Ex 0 Ex 1 .Then,whilethereexistsanoncompressednodewithdegreegreaterthanzeroatthe currentstateofthenetwork,say P x i 1 ,weapplythenext step,the i thstep,ofcompressiontoobtain P x i from Px i 1 Figure5depictsthestatesofanexamplenetworkbefore (Figure5(a))andafter(Figure5(b))the i thstepofcompression.Westartthe i thstepbyselectinganodewith minimumpositivedegreeamongthenodesin V x i 1 .If therearemorethanonesuchnode,weselectthefirst oneamongthem.InourexampleinFigure5(a),the nodewithminimumdegreeisuniqueandisshownby va.Weusethetermminimumdegreeasashorthand forminimumpositivedegreetoexcludesingleton nodes.Thiswayweensurethat deg ( va) > 0and N ( va)is nonempty.Weselectonesuchneighborfrom N ( va), say vb.Theonlynodein N ( va)inFigure5(a)isdenoted with vb.We,then,merge vawith vbtoformthesupernode vab={ va, vb}.Figure5(b)illustratesthisnewlycreatednode vab.Thisistheonlycompressiontobedone atthe i thcompressionstep.Next,wecreatethenew nodesetas V x i = Vx i 1{ vab}{ va, vb } .Forcreatingthe edgeset E x i ,weinitializeitto Ex i 1 andremoveallthe incomingandoutgoingedgesof vaand vbfromit. Then,weinsertanincomingedgeto vabfromeach nodein Vx i 1{ va, vb } ,whichhasanoutgoingedgeto either vaor vbinthepreviousedgeset Ex i 1 .Weinsert outgoingedgesfrom vabtoothernodesinasimilar manner.Figure5illustratesthechangesintheedgeset aftercreating vab.Noticethatforeach i =1,..., t ,theset Vx i containsamixtureofnodesandsupernodes.After eachsuchstep,thesizeofthenetworkdecreasesbyone andthenumberofedgesofthenewnetworkdecreases atleastbyone.ForinstanceinFigure5,thenumberof nodesdroppedfromfivetofourandthenumberof edgesdroppedfromsixtofive.Thecompressionof Px1toget Pxcontinuesbyapplyinganothercompression stepuntiltherearenomorenoncompressednodes withpositivedegree. Thediscussionabovedescribestheintermediatecompressionstepsofthe MDS methodtoperformasingle levelofcompressiononagivennetwork.Givenacompressionlevel c ,foreachlevel x =1,..., c ,weapplytheAy etal BMCBioinformatics 2012, 13 (Suppl3):S2 http://www.biomedcentral.com/14712105/13/S3/S2 Page11of19 PAGE 12 samecompressionstepson Px 1=( Vx 1, Ex 1)by initiallytreating Px 1asanoncompressednetwork withnosupernodes.Asaresultofthisprocess,afterfinishingthe x thlevelofcompression,theactualnumber ofreactionsthateachnodeof Vxcancontainisassure tobeintheinterval[1,2x].Thelimitationonthenumberofreactionsineachnodeallowsthe MDS method torespectandhighlypreservetheinitialtopologyofthe querynetworks.Thisisveryimportantforthealignmentasitmakessignificantuseofthenetworktopologies.Additionally,theboundonthenumberof reactionsineachsupernodetranslatestoauniform compressionforbothnetworkswhichlimitsthesizesof thesmalleralignmentproblemswecanencounterinthe refinementphase.Thisallowsustokeepundercontrol thecomplexityandtherunningtimeoftherefinement phaseofouralignmentframework.Optimalityanalysisfor MDSIntheprevioussection,wedescribedindetailthecompressionmethod( MDS )weuseinourframework.Ideally, itispreferabletocompressthegivennetworkasmuchas possibleateachcompressionlevel.Thisisbecausesmaller networksizeoftenimpliessmallertimeandmemory usageforthealignment.Wesaythatacompressionis optimal iftheresultingcompressednetworkcontainsthe smallestnumberofnodesamongallpossiblecompressionswiththerestrictionthateachnoncompressednode canbemergedwithatmostoneothernoncompressed nodeateachcompressionlevel.Wenamethehypothetical optimalcompressionmethodthatcanachievethebest possiblecompressionrateas OPT .Intherestofthissection,weanalyzetheoptimalityofour MDS methodunder differentconditions.Wefi rstconsidereachconnected componentoftheinputnetworkthatwillbecompressed separatelyandthenintegratetheirresultstogeneralize ouranalysisfornetworkswitharbitrarytopologies. Westartbyintroducingthenotationweuseinthis sectiontohandlenetworkswithmorethanoneconnectedcomponent.Let P beametabolicnetworkwith r connectedcomponents.Wedenotethesecomponents by C1= ( V1, E1 ) C2= ( V2, E2 ) ... Cr= ( Vr, Er ) ,such that P =( r j =1 Vj, r j =1 Ej ) .Let C = ( V E ) beanarbitrary componentof P and*xrepresentthecompressedform of C after x levelsofcompressionusingeitherthe MDS methodor OPT thatachievestheoptimalcompression. Weuse*(star)asagenericsymboltoavoidintroducing newsymbolsforeachcompressedcomponentinplaces whereonlytheirsizesareofrelevance.Weuse MDS ( C *x), OPT ( C ,*x)todenotethetotalnumberofcompressionstepsperformedtotransform C intoitscompressed formafter x levelsofcompressionbyusingthecorrespondingmethods.Recallthateachcompressionstep reducesthenetworksizebyone.Thus,thebiggerthese values( MDS ( C ,*x)andOPT ( C ,*x))thebettertheyare intermsofcompressionrate.Thefirstandsecondargumentsinthisnotationcanbeanystateofaconnected componentoranetworkatanypointduringthecompression.Forinstance, OPT ( Cx i x ) denotesthenumber ofcompressionstepstakenby OPT startingfrom( i +1) thintermediatestepofthe x thleveluntilthe x thlevel ofcompressioniscompleted. Figure5 Onecompressionstepofthe MDS method. Smallcirclesrepresentreactionsandbigcirclesrepresentsupernodesthatresultfrom earlierstepsofcompression.Asolidarrowrepresentsanedgebetweentwononcompressednodesinthecurrentcompressionlevel.Adashed arrowdenotesanedgebetweenasupernodeandanothernodeinthenetwork.Whilecalculatingthedegreesofthenoncompressednodes, onlythesolidarrowsaretakenintoaccount.(a)Thestateofnetwork P duringcompressionlevel x beforethe i thintermediatestep(i.e., Px i 1 ). Thenodewiththeminimumdegreeisdenotedwith vaanditsfirstneighborisdenotedwith vb.(b)Thestateofthisnetworkafterthe i th compressionstep(i.e., P x i ).Wedenotethenoderesultedfromthecompressionatthisstepwith vab. Ay etal BMCBioinformatics 2012, 13 (Suppl3):S2 http://www.biomedcentral.com/14712105/13/S3/S2 Page12of19 PAGE 13 Inthefollowing,wefirstprovethatthe MDS method makesanoptimalchoiceintermsofwhichtwonodesto compressateachcompressionstepifthereexistsanode withdegreeoneinthecurrentstateforagivencomponent.We,then,showthatifnonodewithdegreeone existsatacompressionsteptakenby MDS canincrease thesizeofthecompressedcomponentbyatmostoneas comparedtotheonefoundby OPT .Finally,byaggregatingtheresultsfromeachcomponent,foragivenmetabolicnetwork P andacompressionlevel c,wedevelopan upperboundonthesizeofthecompressednetworks obtainedby MDS withrespecttothesizeofnetworkthat canbeobtainedbytheoptimalmethod. Lemma1 Let C = ( V E ) denoteaconnectedcomponentofagivenmetabolicnetworkP.Let Cx i =( Vx i Ex i) denotethestateofCaftertheithstepofthexthcompressionlevel.Ifthereexistsanodein V x i withdegree one,thenthecompressionsteptakenbytheMDS methodtocreatethenextstate Cx i + 1 isoptimal.Formally, OPT ( Cx i x)=1+ OPT ( Cx i +1, x ) (1) Proof1 Weprove(1)bycontradictionintwoparts: Part1. OPT ( Cx i x) 1+OPT ( Cx i +1, x ) Part2. OPT ( Cx i x) 1+OPT ( Cx i +1, x ) Thefirstpart(i.e. )istrivial.ThenumberofcompressionstepsofOPTafterperformingonestepofcompression cannotbelargerthanthenumberbeforeperformingthis step,otherwisethesolutionof OPT ( Cx i x ) cannotbeoptimal.Thisleadstoacontradiction,henceprovesPart1 Toprovethesecondpart(i.e. ),itisimportantto recallhowtheMDSmethodprogressesgiventhestate C x i atwhichthereexistsatleastonenodevawithdeg ( va)= 1 .Thismethodpicksva.Thenodevahasexactlyone noncompressedneighbor,sayvb.Thus,MDSmerges themtocreatethesupernodevab(see Figure5 ).Wecompletetheproofbyconsideringtwocases.Inthefirstcase theOPTmethodmergesvaandvbwhilecompressing C x i .Inthiscase,wecanassumethatOPTtakesthisstepas itsnextstepincompressing C x i ,sinceafixedcompressed networkcanbeobtainedbyarbitrarilyshufflingthe orderofintermediatesteps.Therefore,ifvaandvbare compressedatanypointintheoptimalmethod,thenthe optimalsolutionfor Cx i + 1 ,whichiscreatedbyapplying theMDSmethodon C x i hasexactly OPT ( Cx i x) 1 compressions.Hence, OPT ( Cx i x)= l + OPT ( Cx i +1, x ) and OPT ( C x i x ) 1+OPT ( C x i +1, x) Inthesecondcasevaandvbarenotmergedtogether intheoptimalsolution.Thiscaseimpliesvaisleftasa singletonattheendofthexthlevelasdeg ( va)=1 .Then, thenetworkthatresultsafterremovingvaandallthe edgesconnectedtoitcanhaveatmost OPT ( Cx i x ) compressionsuntiltheendofthexthlevelsinceotherwiseitcontradictswiththeoptimalityofMDS.This showsthatthenumberofcompressionsthatcan beachievedwhenvaisleftasasingletoncannotbe greaterthanoneplus OPT ( Cx i +1, x ) .Thus OPT ( Cx i x) 1+OPT ( Cx i +1, x ) andcombiningit withthefirstpart(i.e. )weget OPT ( Cx i x)=1+ OPT ( Cx i +1, x ) Lemma2 Let C = ( V E ) denoteaconnectedcomponent ofagivenmetabolicnetworkP.Let Cx i =( Vx i Ex i) denote thestateofCaftertheithstepofthexthcompressionlevel. Ifthenodewithminimumdegreein Vx i hasdegreegreater thanone,thenthecompressionsteptakenbyMDStocreatethenextstate Cx i + 1 canleadtoanetworkthathassize atmostonelargerthanthecompressednetworkthatis obtainedfromthestate C x i byOPT.Formally OPT ( Cx i x) 2+OPT ( Cx i +1, x ) (2) Proof2 Letvabethefirstnodeinthelistofminimum degreenodesin V x i .Fromtheassumptionweknowdeg ( va) > 1 andhenceithasatleastonenoncompressed neighbornodeofvbthatalsohasdeg ( vb) > 1 .Without lossofgeneralityassumethattheMDSmethodmerges vaandvbtocreatethesupernodevabatthecompression stepfrom C x i to Cx i + 1 .Thisstepcanpreventatmostone neighborofva,sayvc,andatmostoneneighborofvb, sayvd,tobemergedwiththecorrespondingnodeinlater steps.Noticethatvcandvdarenotnecessarilydistinct. TheMDSalgorithmcanalsomergevcandvdinthe nextstepsiftheyarealsoneighborsthoughwedonot knowitforsureatthispoint.Thisresultsineitherone compressionortwocompressionsusingonlythefour nodesva,vb,vcandvdbytheMDSmethod.Next,we calculatethenumberofcompressionstepsthattheOPT methodcantakeforcompressingthesefournodes.There arethreecasestoconsider: Case1.The OPT methodmerges vawith vb atanypoint duringthe x thlevelofcompression. Thiscaseisequivalent tomergingvawithvbinthenextstepbyMDSandthen compressingtherestofthenetworkbyOPT.Inother words,MDSalreadytakesthe optimalcompressionstep. Hence, OPT ( Cx i x)=1+ OPT ( Cx i +1, x) 2+OPT ( Cx i +1, x ) Case2.The OPT methodmerges vawith vcatany pointduringthe x thlevelofcompression. Theworst casescenariofortheMDSmethodinthiscaseiswhenvcisnotconnectedtovdandtheOPTmethodmergesvbwithvdinalaterstep.ThiswaytheOPTmethodoptimallycompressesfournodesdowntotwosupernodes, namelyvacandvbd.OntheotherhandtheMDSmethod createsasinglesupernode,vab,andthenodesvcandvdremainassingletonHowever ,evenforthisworstcase, theMDSmethodpreventsonlyonecompressionstepto takeplacewithrespecttoOPT.Hence, OPT ( Cx i x)) 2+OPT ( Cx i +1, x ) .Ay etal BMCBioinformatics 2012, 13 (Suppl3):S2 http://www.biomedcentral.com/14712105/13/S3/S2 Page13of19 PAGE 14 Case3.The OPT methodmerges vbwith vdatany pointduringthe x thlevelofcompression. Wecanprove thissimilarto Case2 bythesymmetry Usinglemmas1and2,Theorem1developsanupper boundonthenumberofcompressionthatcanbe missedby MDS withrespecttotheoptimal compression. Theorem1 (OPTIMALITYBOUNDFORMDS) Let Pbeametabolicnetworkwithrconnectedcomponents C1= ( V1, E1 ) ... Cr= ( Vr, Er ) suchthat P = r j =1C j andcbeapositiveintegergivenasthedesirednumber ofcompressionlevels.Let C = ( V E ) denoteanarbitrary connectedcomponentofP.Also,letsrepresentthenumberofintermediatestepsforwhichnononcompressed nodeswithdegreeoneisfoundduringthecompression fromPtoPcbytheMDSmethod Then,eachofthefollowingstatementshold: 1 OPT ( Cx 1,*x) 2 MDS ( Cx 1,*x) for = 1 ,...,c 2 OPT ( P ,*c) s+MDS ( P ,*c) 3 OPT ( P ,*c) min {2 MDS ( P ,*c) ,s+MDS ( P ,*c)}. Proof3 1.ThispartfollowsfromLemma1and2. Lemma1statesthecasewhenMDSmethodisequivalent toOPT.Lemma2givesanupperboundonthenumberof compressionstepsthatMDScanmiss.Theworstcaseis whentheboundaryconditionofLemma2holdsforeach stepofthexthcompressionlevelforCx 1.Inthiscase, thenumberofstepstakenbytheOPTmethodwhilecompressingCx 1istwotimesthenumberfortheMDS method 2 ThispartalsofollowsfromLemma1and2.ThroughoutthecompressionoftheentirenetworkPbyclevels, eachstepoftheMDSmethodthatsatisfiestheconditionin Lemma2candecreasethenumberofpossiblemerge operationsbyonewithrespecttoOPT.Bysimplycounting thesesteps,attheendoftheexecutionoftheMDSmethod wecangivetheupperbounds+MDS ( P ,*c) onthenumber ofoptimalcompressionsOPT ( P ,*c). 3 Part2showsthatOPT( P ,*c) s+MDS ( P ,*c) .Itis onlynecessarytoshowOPT ( P ,*c) 2 MDS ( P ,*c) .Part1 provesthisresultforasingleconnectedcomponentC forthexthcompressionlevel.Pisgivenas r j =1C j before thefirstlevelofcompression.WeknowbyPart1thatOPT ( C ,*1) 2 MDS ( C ,*1) .Summingthisupforalljfrom1to r,wegetOPT ( P ,*1) 2 MDS ( P ,*1) .Thisequationholds foreachcompressionlevelxfrom1toc.Summationoverx gives c x =1( OPT ( Px 1, x)) c x =1MDS ( Px 1, x ) Hence,weproveOPT ( P ,*c) 2 MDS ( P ,*c). AnotherwayofinterpretingTheorem1istotransformittoanupperboundonthesizeofthe compressednetworkgeneratedby MDS intermsofthe onethatcanbeobtainedby OPT .Bycarryingoutthis transformation,weanswerthequestionwepointedout intheintroductionwhichis Howfarisourcompressionmethodfromtheoptimalcompression? .Wedo thisasfollows.Let P beanetworkofsize n .Givencompressionlevel c ,letusrepresentthenumberofcompressionsstepsofthe OPT methodwith = OPT ( P *c).Also,let nOPTand nMDSdenotethesizesofthe compressednetworksobtainedbythe OPT and MDS methodsrespectively.BytheboundgiveninTheorem1, weknowthat MDS ( P c) > = 2 .Therefore,wecan write nOPT= n .and nMDS n 2 .Also,we knowbydefinitionthat c x =1n 2 x .Usingthis inequality,weget: nOPT n cbx =1t n 2xn nMDS n cbx =1t n 2x +1n f (3) Ifweexaminetheratio nMD S n O PT ,for c =1weget nMD S n O PT3 2 forarbitrary n (detailsomitted).Thisdemonstratesthat afteronelevelofcompression,thesizeofthecompressednetworkfoundbyourmethodisatmost1.5 timesthesizeoftheoptimalnetwork.For x =1,2,..., c, thisratioisproportionalwith(1.5)x.Wecanalsouse theboundonnumberofcompressionstepsgiveninthe secondstatementofTheorem1togatherasimilar upperboundonthesizeofthecompressednetwork foundby MDS .Thetighterofthesetwoupperbounds onthenetworksizecanbecalculatedduringtheexecutionofthe MDS methodandreportedasanindicatorof howmuchroomisleftforimprovingthecompression.AlignmentframeworkWedescribedthefirstpha se,namelythecompression phaseindetailinprevioussections.Here,wefirstsummarizethebasealignmentmethod,SubMAP[10],weuse inourframework.Then,weexplainthetworemaining phasesofourframework,namelythealignmentphaseand therefinementphase.Thealignmentphasefollowsthe compressionphaseandutilizesthebasemethodtofindan alignmentincompresseddomain.Therefinementphase appliesthebasemethodonthemappingsfoundinpreviousphasetofurtherrefinethealignmentresults.After describingallthephases,weanalyzethecomplexityof eachphaseandcombinethemtoobtainthecomplexityof theentireframework.Then,weexaminethecharacteristicsofthequeriestodeterminewhicharelikelytobenefit fromcompressionduringthealignmenttoanswerthe questionof Whenshouldwecompress? Last,weprovide aguidelineforselectingthecompressionlevelthatis expectedtogivethebestperformancegainreachedbyour frameworkwithrespecttothebasealignmentmethod.Ay etal BMCBioinformatics 2012, 13 (Suppl3):S2 http://www.biomedcentral.com/14712105/13/S3/S2 Page14of19 PAGE 15 OverviewofSubMAPHere,wetakeasmalldetourandexplainSubMAP,a recentmethodforaligningmetabolicnetworkswhenthey arenotcompressed.WepickSubMAPmethodforits highaccuracyandbiologicalrelevanceasitconsiderssubnetworksofthegivennetworksduringthealignment.A subnetwork ofanetworkisasubsetofthereactionsof thatnetworksuchthattheinducedundirectedgraphof thissubsetisconnected.Giventwometabolicnetworks P =( V E )and P = ( V E ) andapositiveinteger k ,SubMAP aimstofindasetofmappingsbetweenthereactionsof P and P withthelargestsimilarityscore,suchthat:(i)Each reactionin P ( P ) canmaptoasubnetworkof P ( P ) with atmost k reactions(ii)Eachreactionof P and P can appearinatmostonemapping. ThefirststepofSubMAPistocreatethesetofall possiblesubnetworksofsizeatmost k foreachquery network.Wedenotethenumberofthesesubnetworks for P and P with Nkand Mkrespectively.Thesecond stepofSubMAPistocalculatepairwisesimilarities betweeneachpairofthesesubnetworksonefrom P and onefrom P .Eachsubnetworkconsistsofreactionsand eachreactionisdefinedb yitsinputandoutputcompounds(i.e.,substratesandproducts)andtheenzymes thatcatalyzeit.Therefore ,wemeasurethepairwise similaritiesbetweensubnetworksusingreactionsimilaritieswhichinturnaredefinedbythesimilaritiesofthe componentsofthesereactio ns.Formoredetailsofthis similarityscorewereferthereadertoAy etal .[10]. ThestepthatdominatesthetimeandspacecomplexityofSubMAPisthethirdstep.Theaimofthisstepis tocreateasimilarityscorethatcombinespairwisesimilaritieswiththetopologicalsimilarityofthenetworks.A datastructurenamedthe supportmatrix iscreatedfor thispurpose.Thesizeofthismatrixisquadraticin termsofthenumberofsubnetworksofbothquerynetworks.Inotherwords,thesupportmatrixrequires O ( Nk 2Mk 2)space.Thiscomplexityisveryimportantasit isthedominatingfactorintheoveralltimeandspace complexityofSubMAP.Thenexttwostepsofthealgorithmaretocombinetopologicalsimilaritywithpairwise nodesimilaritiesandtoextractthealignmentasasetof subnetworkmappingsof P and P .AlignmentphaseTheSubMAPmethoddescribedabovealignsthenetworks P =( V E )and P = ( V E ) intheiroriginalform. Ourframeworkfirstcompresseseachofthesenetworks toreducetheirsizesandthenalignsthecompressed networksinsteadof P and P .Inthissection,weexplain howwealignthecompressednetworks Pcand Pc that areinthecompresseddomainoflevel c usingSubMAP withagivenparameter k Letusfirstconsider Pc=( Vc, Ec).Eachnode vain Vcis asupernodeofthereactionsin V .Also,bytheworkingof ourcompressionmethod,weknowthateachsupernode vacontainsatmost2creactions.Anedgefromthenode vatothenode vbexistsin Ecifandonlyifatleastone reactionin vahasanedgetoonereactionin vbin E .The sameargumentsholdfortheothernetwork Pc aswell.To alignthesecompressednetworks,weconsidertheirnodes, whicharesupernodesofreactions,asiftheyarethereactionsofthemetabolicnetworks Pcand Pc .Thisway,we candirectlyapplySubMAPtoalignthesenetworks.Asfar astheoperationoftheSubMAPmethodisconcerned, thisisnodifferentthanaligningtwonetworksthatare identicaltothesenetworksbutareintheoriginaldomain. Thedifferenceisintheinterpretationoftheintermediate stepsandtheformofthemappingsfoundbythealignment.Forinstance,forthefirststepofSubMAP,weenumeratethereactionsubnetworksofsizeatmost k inthe originaldomain,whereasin thecompresseddomainwe enumeratethesubnetworksofsupernodeswhereeach supernodecancontainmorethanonereactionandthe numberofsuchsupernodesinonesubnetworkisatmost k .Similarly,wecalculatethepairwisesimilarity,thesupportmatrixandtheconflictgraphforthesubnetworksof supernodes(i.e.,nodesof Vc)insteadofsubnetworksof reactions(i.e.,nodesof V ).Theresultingalignmentgives usasetofmappingsbetweenthesubnetworksof Pcand Pc .Wecanthinkofthesemappingsasahighlevelviewof thealignmentbetweenthenetworks P and P .Forinstance, fromFigure1(f)onecanimmediatelyseethattheresulting alignmentwillmapnode a eithertonode a ornode b andthatthesearetheonlyoptionsfornode a whichis imposedbythehigherlevelsupernodemapping( a b a b ).Inthenextphase,weconsidereachofthesesupernodemappingsassmallerinstancesofthealignmentproblemandsolvethemtoobtainamorerefinedalignment of P and P .RefinementphaseEachmappingfoundbythealignmentphaseisasubnetworkpairwhereoneisfrom Pcandtheotherisfrom Pc ThemappingsfoundbySubMAPcanhaveupto k nodes inonesubnetworkandonlyonenodeintheother.Ifwe denoteasubnetworkof Pcwith R c i andasubnetworkof Pc with R c j ,theresultingmappingsofthealignmentphase willbeintheform ( Rc i, Rc j) .Wecanassume,withoutloss ofgenerality,forthisspecificpairthat R c i containsupto k nodesof Pcand R c j containsasinglenodeof Pc .Each nodecontainedineitherofthesesubnetworksisasupernodethatcontainseitheronenodeortwonodesandan edgebetweentheminthepreviouslevelofcompression, namelythe( c 1)thlevel.Forboth R c i and R c j ,wedecompresstheirnodesbyonelevelbyretrievingtheAy etal BMCBioinformatics 2012, 13 (Suppl3):S2 http://www.biomedcentral.com/14712105/13/S3/S2 Page15of19 PAGE 16 connectivitybetweenthesenodesinthe( c 1)thcompressionlevelthatwasencapsulatedinthe cthlevel.This decompressionresultsinatmost2 k nodesfrom( c 1)th levelfor R c i andatmost2nodesfrom( c 1)thlevelfor R c j .Wethenrecursivelyalignthesesmallernetworksgeneratedfrom R c i and R c j byusingSubMAPuntiltheoriginal domain(i.e., c =0)isreached.Atthe( c x )threcursive step,thesizesoftwonetworkstobealignedcanbeat most k 2xforonenetworkand2xfortheother. Figure1(f)illustratesthisonaconcreteexample.The networkonthelefthastwosupernodes(i.e.,( a b )and ( e d ))eachcontainingtwonodeswithanedgebetween themandonesupernode(i.e.,( c ))whichcontainsonly onenodefromthepreviouslevelofcompression.The oneontherighthastwosupernodeswithtwonodesin each.Tounderstandhowdecompressionbyonelevel works,wecanfocusonthesupernodemapping( e d )( c d )whichisfoundincompressionlevelone.Wecan thinkofdecompressionasremovingthecirclesthatsurroundthesesupernodestogetbacktheconnectivity withintheirnodesinthepreviouscompressionlevel.In ourcase,thisleadstothesmallnetworks d e and c d .Wealignthesesmallnetworksrecursivelyusing SubMAPandreporttheirfinalalignmentinonlyone recursivecallsincethecompressionlevelisonlyonefor thiscase.Also,since k =1isusedfortheeaseofthis example,thesizesofthenetworks,intermsofthe nodesinoriginaldomain,oneachsideareatmost2for therecursivecallfrom c =1ascanbeseenfromFigure 1(f)(i.e., k 2c=2c=2for k = c =1).ComplexityanalysisHavingfinishedthediscussionofallthethreephases, nowwecananalyzetheoverallcomplexityofourframework.Westartfromthefirstphasewhichiscompressionoftheinputnetworks P and P by c levels.We firstcalculatethecomplexityofthefirstcompression levelforthenetwork P withsize n .Ateachcompression step, MDS firstsearchesforaminimumdegreenode. Onceitfindsthisnode,itpicksoneofitsneighbor nodesandmergesthesetwonodes.Afterthismerging, itupdatesthedegreesofalltheneighborsofeachofthe mergednodes.Thefirsttwooftheseoperationstake O ( logn )timeifproperdatastructuresareusedandthe lastonecantake O ( n )intheworstcase.Sincethesize ofnetwork P is n ,therecanbeatmost n 2 compression stepsduringthefirstlevelofcompression.Hence,the complexityofthecompressionforthefirstlevelis O ( n2).Sincetheinputsizesofthislevelislargerthanall thenextlevels,wecansafelyassumethateachofthese nextlevelsalsotake O ( n2)andthecomplexityofcompressionby c levelsistherefore O ( cn2).Eventhough thisisnotatightbound,itissufficientatthispointfor thecomplexityofthenexttwophaseswilldominateit. Sincewecompressbothnetworks,theoverallcomplexityforthecompressionphaseis: O ( c ( n2+ m2 )). (4) Fortheanalysisofthenextphases,wemaketwo assumptionsbothofwhicharesupportedbyexperimentalevidenceonthetopologicalpropertiesofmetabolic networks.Ourfirstassumptionisthatateachlevelof compressionourmethodreducesthenetworksizeby half.Inotherwords,ifthesizesofourquerynetworks are n and m ,thenthesizesofthecompressednetworks after c levelsbythe MDS methodare n MD S= n 2 c and mMD S= m 2 c respectively.Thisismainlybecausemetabolicnetworkscontainmanynodeswithlowdegrees [27].Ourexperimentsonalargedatasetofnetworks summarizedinTable1supportsthisaswell.Thesecondassumptionisthatthenumberofsubnetworksisa constantmultipleofthenetworksizeforsmall k values. Inotherwords, NMDS= a ( k ) n and MMDS= b ( k ) m where a ( k )and b ( k )arefunctionsof k butareindependentof n and m respectively.Ourearlieranalysisin Ay etal .[10]demonstratedthatthenumberofsubnetworksfor k =3,whichisthelargest k valueweuse here,isintheorderof5 V foralargesetofmetabolic networks. Wearenowreadytoanalyzethecomplexityofthe secondphasewhichisthealignmentphase.Bythefirst assumption,weknowthatthesizesof Pcand Pc are nMD S= n 2 c and mMD S= m 2 c respectively.Bythesecond,wehavethenumberofsubnetworksofthesenetworksas NMDS= a ( k ) n and MMDS= b ( k ) m fora given k .Also,weknowthatthecomplexityofSubMAP isquadraticintermsof NMDSand MMDS.Therefore,the complexityofthesecondphaseis: O( ( k )2 ( k )2n2m2 2 4 c) (5) Thecomplexityoftherefinementphasehastwofactorsinit.Thefirstoneisthenumberofmappings foundbythealignmentphase.SinceweknowthatSubMAPallowseachnodeofbothnetworkstobereported inatmostonemapping,wehaveatrivialupperbound onthenumberofpossiblemappingsintermsof n and m .Thebiggestnumberofmappingsisreportedwhen allthesubnetworksofbothn etworksaresingletons.In thiscase,thenumberofreportedmappingsistheminimumof n and m .Wecanassumewithoutlossofgeneralitythat n PAGE 17 areatmost k 2cononesideandatmost2conthe other.Thenumberofsubnetworksthatcanbecreated fromthesenetworksare a ( k ) k 2cand b ( k )2cforthe correspondingsides.Therefore,eachmappingcanbe refinedbydecompressingandapplyingSubMAPwhich is O ( a ( k )2k222 cb ( k )222 c).Wedothisrefinementfor O ( n )timesintheworstcase,hencethecomplexityof therefinementphaseis: O ( ( k ) 2 ( k ) 2nk224 c ). (6) CombiningtheresultsofEquations4,5and6,wecan seethattheoverallcomplexityofourmethodisdeterminedbythesecondorthethirdphasedependingon thevalueof c .Forsmallvaluesof c and k suchas1,2 and3,thesecondphasedominatestheoverallcomplexity.Largervaluesof c resultsinacostlierrefinement phaseandalessexpensivealignmentphase.Verylarge valuesof k implyexponentiallymanysubnetworksin whichcasetheabovecomplexityanalysiswouldnot holdandthealignmentproblemmaybecomeintractablewithorwithoutcompression.Whenshouldwecompress?Wediscussedthepotentialofourframeworkimproving thescalabilityofexistingnetworkalignmentmethods. However,therecanbecaseswhenthecompression resultsinsuchnetworktopologieswhichwouldenforce thealignmentmethodtoreachitsworstcaseperformance.Inthissection,wewanttoanalyzewhenperformingthealignmentincompresseddomainisthe betteralternative.Forthispurpose,wedeviseacriterion thatisinspiredbytheresultsofalargenumberofnetworkalignmentsthataredonebybothofthemethods. Wefindthatthegain/lossinrunningtimeishighly dependentonthenumberofallpossiblesubnetworksof compressedandnoncompressednetworks.Thenumbersofthesesubnetworkscanbedeterminedinadvance tothealignment.Byformulatingacriterionintermsof thesenumbers,wecanmakeadecisionbetweenthe twoalgorithmsbeforeactuallyperforminganalignment. Figure4illustratestheresultsfor3600alignments performedbybothofthemethodsonawiderangeof networksizeswithallpossiblecombinationsof k and c values.ThexaxisshowtherunningtimeofSubMAP minustherunningtimeofourframework.Thebigger thisvalueisthebetterimprovementwegetfromour framework.Theyaxisshowstheratio y =Nc kMc k N k M k where Nk, Mkdenotethenumbersofallsubnetworkof P and P and Nc k Mc k denotethenumbersofallsubnetworkof thecompressednetworks Pcand Pc .Thedashedline passingfrom y =0.5visualizesourcriterion.Ifthe aboveratioisbelow0.5,thenthenumberofallpossible subnetworksgeneratedbythecompressedalignmentis lessthanthehalfofthisnum berfortheoriginalalignment.Verylargeportionofthealignments(97%)satisfyingthiscriterionshowsimprovementinrunningtimeif compressionisused.Fortheupperpartof0.5,onlya smallportionofthesealignments(10%)showsimprovement.Consideringtheoverheadofrefinementphase andthecompressionphase,thisresultisexpected. Theseresultsstronglysu ggestthattheanswertothe question Whenshouldwecompress? is when Nc kMc k N k M k 0. 5 .Howmuchshouldwecompress?Inthissection,weprovideaguidelineforselectinga valueforcompressionlevel c thatresultsintheminimumexpectedrunningtime,amongotherpossible values,forourframeworktoalignthequerynetworks withforagiven k .Wemakeextensiveuseofthecomputationalcomplexityresultswediscussedbeforeinthe proofofthebelowtheoremwhichformulatestheoptimal c foragiven k valueandthetwoquerynetworks withsizes n and m .Thistheoremanswersthequestion Whatistherightamountofcompressionthatweneed touseinordertominimizetherunningtimeofour framework? Theorem2 (OPTIMALLEVELOFCOMPRESSION) LetP = (V,E), P = ( V E ) betwometabolicnetworks withsizesnandmrespectively,andkbeagivenpositive integer.Assumewithoutlossofgeneralitythatn PAGE 18 ( k )2 ( k )2n2m2 24 c+ ( k )2 ( k )2nk224 c (9) Ouraimistomaximize(8) (9)withrespecttoc.We knowthatthisdifferenceisnegative(i.e.,alignmentin compresseddomainiscostlier)whenc n(assuming n PAGE 19 13.ChengQ,HarrisonR,ZelikovskyA: MetNetAligner:awebservicetoolfor metabolicnetworkalignments. Bioinformatics 2009, 25(15) :198990. 14.KalaevM,BafnaV,SharanR: Fastandaccuratealignmentofmultiple proteinnetworks. JComputBiol 2009, 16 :98999. 15.ChenM,HofestadtR: PathAligner:metabolicpathwayretrievaland alignment. ApplBioinformatics 2004, 3(4) :241252. 16.LiZ,ZhangS,WangY,ZhangXS,ChenL: Alignmentofmolecular networksbyintegerquadraticprogramming. Bioinformatics 2007, 23(13) :16311639. 17.LiY,deRidderD,deGrootMJL,ReindersMJT: Metabolicpathway alignmentbetweenspeciesusingacomprehensiveandflexible similaritymeasure. BMCSystBiol 2008, 2 :111. 18.KuchaievO,MilenkovicT,MemisevicV,HayesW,PrzuljN: Topological networkalignmentuncoversbiologicalfunctionandphylogeny. JRSoc Interface 2010, 7 :13411354. 19.ChorB,TullerT: Biologicalnetworks:comparison,conservation,and evolutionviarelativedescriptionlength. JComputBiol 2007, 14(6) :817838. 20.PinterRY,RokhlenkoO,YegerLotemE,ZivUkelsonM: Alignmentof metabolicpathways. Bioinformatics 2005, 21(16) :34013408. 21.SinghR,XuJ,BergerB: Globalalignmentofmultipleproteininteraction networkswithapplicationtofunctionalorthologydetection. ProcNatl AcadSciUSA 2008, 105 :1276312768. 22.FranckeC,SiezenRJ,TeusinkB: Reconstructingthemetabolicnetworkof abacteriumfromitsgenome. TrendsMicrobiol 2005, 13(11) :550558. 23.SridharP,KahveciT,RankaS: Aniterativealgorithmformetabolic networkbaseddrugtargetidentification. PacSympBiocomput 2007, 12 :8899. 24.OgataH,FujibuchiW,GotoS,KanehisaM: Aheuristicgraphcomparison algorithmanditsapplicationtodetectfunctionallyrelatedenzyme clusters. NucleicAcidsRes 2000, 28 :40214028. 25.GreenML,KarpPD: ABayesianmethodforidentifyingmissingenzymes inpredictedmetabolicpathwaydatabases. BMCBioinformatics 2004, 5 :76. 26.OgataH,GotoS,SatoK,FujibuchiW,BonoH,KanehisaM: KEGG:Kyoto EncyclopediaofGenesandGenomes. NucleicAcidsRes 1999, 27 :2934. 27.JeongH,TomborB,AlbertR,OltvaiZN,BarabasiAL: Thelargescale organizationofmetabolicnetworks. Nature 2000, 407(6804) :651654. 28.PfeifferT,SoyerOS,BonhoefferS: Theevolutionofconnectivityin metabolicnetworks. PLoSBiol 2005, 3(7) :e228. 29.RavaszE,SomeraAL,MongruDA,OltvaiZN,BarabasiAL: Hierarchical organizationofmodularityinmetabolicnetworks. Science 2002, 297(5586) :15511555.doi:10.1186/1471210513S3S2 Citethisarticleas: Ay etal .: Metabolicnetworkalignmentinlargescale bynetworkcompression. BMCBioinformatics 2012 13 (Suppl3):S2. Submit your next manuscript to BioMed Central and take full advantage of: Convenient online submission Thorough peer review No space constraints or color gure charges Immediate publication on acceptance Inclusion in PubMed, CAS, Scopus and Google Scholar Research which is freely available for redistribution Submit your manuscript at www.biomedcentral.com/submit Ay etal BMCBioinformatics 2012, 13 (Suppl3):S2 http://www.biomedcentral.com/14712105/13/S3/S2 Page19of19 PAGE 1 PROCEEDINGS OpenAccessMetabolicnetworkalignmentinlargescaleby networkcompressionFerhatAy1,2*,MichaelDang1,TamerKahveci1From ACMConferenceonBioinformatics,ComputationalBiologyandBiomedicine2011(ACMBCB) Chicago,IL,USA.13August2011AbstractMetabolicnetworkalignmentisasystemscalecomparativeanalysisthatdiscoversimportantsimilaritiesand differencesacrossdifferentmetabolismsandorganisms.Althoughtheproblemofaligningmetabolicnetworkshas beenconsideredinthepast,thecomputationalcomplexityoftheexistingsolutionshassofarlimitedtheiruseto moderatelysizednetworks.Inthispaper,weaddresstheproblemofaligningtwometabolicnetworks,particularly whenbothofthemaretoolargetobedealtwithusingexistingmethods.Wedevelopagenericframeworkthat cansignificantlyimprovethescaleofthenetworksthatcanbealignedinpracticaltime.Ourframeworkhasthree majorphases,namelythe compressionphase ,the alignmentphase andthe refinementphase .Forthefirstphase,we developanalgorithmwhichtransformsthegivennetworkstoacompresseddomainwheretheyaresummarized usingfewernodes,termed supernodes ,andinteractions.Inthesecondphase,wecarryoutthealignmentinthe compresseddomainusinganexistingnetworkalignmentmethodasourbasealgorithm.Thisalignmentresultsin supernodemappingsinthecompresseddomain,eachofwhicharesmallerinstancesofnetworkalignment problem.Inthethirdphase,wesolveeachoftheinstancesusingthebasealignmentalgorithmtorefinethe alignmentresults.Weprovideauserdefinedparametertocontrolthenumberofcompressionlevelswhich generallydeterminesthetradeoffbetweenthequalityofthealignmentversushowfastthealgorithmruns.Our experimentsonthenetworksfromKEGGpathwaydatabasedemonstratethatthecompressionmethodwe proposereducesthesizesofmetabolicnetworksbyalmosthalfateachcompressionlevelwhichprovidesan expectedspeedupofmorethananorderofmagnitude.Wealsoobservethatthealignmentsobtainedbyonly onelevelofcompressioncapturetheoriginalalignmentresultswithhighaccuracy.Together,thesesuggestthat ourframeworkresultsinalignmentsthatarecomparabletoexistingalgorithmsandcandothiswithpractical resourceutilizationforlargescalenetworksthatexistingalgorithmscouldnothandle.Asanexampleofour method sperformanceinpractice,thealignmentoforganismwidemetabolicnetworksofhuman(1615reactions) andmouse(1600reactions)wasperformedunderthreeminutesbyonlyusingasinglelevelofcompression.BackgroundBiologicalnetworksprovideacompactrepresentation oftherolesofdifferentbiochemicalentitiesandtheinteractionsbetweenthem.Dependingonthetypesofentities andinteractions,thesenetworksaresegregatedintodifferenttypes,whereeachnetworktypeencompassesa particularsetofbiologicalprocesses.Proteinproteininteraction(PPI)networkscomprisebindingrelationships betweentwoormoreproteinstocarryoutspecificcellular functionssuchassignaltransduction.Regulatorynetworks consistofinteractionsbetweengenesandgeneproducts tocontroltheratesatwhichgenesaretranscribed.Metabolicnetworksrepresentsetsofchemicalreactionsthat arecatalyzedbyenzymestotransformasetofmetabolites intootherstomaintainthestabilityofacellandtomeet itsparticularneeds.Analysisoftheconnectivityproperties ofthesenetworkshasproventobecrucialinuncovering thedetailsofthecellmachineryandinrevealingthefunctionalmodulesandcomplexesinvolvedinthismechanism [14]. *Correspondence:ferhatay@uw.edu1ComputerandInformationScienceandEngineering,UniversityofFlorida, Gainesville,FL32611,USA FulllistofauthorinformationisavailableattheendofthearticleAy etal BMCBioinformatics 2012, 13 (Suppl3):S2 http://www.biomedcentral.com/14712105/13/S3/S2 2012Ayetal.;licenseeBioMedCentralLtd.ThisisanopenaccessarticledistributedunderthetermsoftheCreativeCommons AttributionLicense(http://creativecommons.org/licenses/by/2.0),whichpermitsunrestricteduse,distribution,andreproductionin anymedium,providedtheoriginalworkisproperlycited. PAGE 2 Anessentialtypeofnetworkanalysisisthecomparativeanalysisthataimsatidentifyingfunctionallysimilar elementsorelementsetssharedamongdifferentorganismswhichwouldnotbepossibleiftheseelementswere onlyconsideredindividually.Thisisoftenachieved throughalignmentofthenetworksoftheseorganisms. Analogoustosequencealignmentwhichidentifiesconservedsequences,networkalignmentrevealsconnectivity patternsthatareconservedamongtwoormoreorganisms.Anumberofstudieshavebeendonetosystematicallyaligndifferenttypesofbiologicalnetworks[521]. Formetabolicnetworks,Pinter etal .[20]devisedan algorithmthatalignsquerynetworkswithspecifictopologiesbyusingagraphtheoreticapproach.Recently, someofusdevelopedanalgorithmthatcombinesboth topologicalfeaturesandhomol ogicalsimilarityofpairwisemoleculestoalignmetabolicnetworks[8].Wealso proposedamethod,SubMAP[9,10],thatincorporates subnetworkmappingsinmetabolicnetworkalignment.A similarmethod,IsoRank[21],hasbeenappliedtofind thealignmentsofPPInetworks.IsoRankN[11]extended thisalgorithmtoworkformultiplenetworksandto allowmappingsofproteinclusters. Comparativeanalysisisimportantparticularyforlarge metabolicnetworkssuchasorganismwidenetworks. Identificationoftheconservedpatternsamongmetabolic networksacrossspeciesprovideinsightsformetabolic reconstructionofanewlysequencedgenome[22], orthologydetection[21],dru gtargetidentification[23] andidentificationofenzymeclustersandmissing enzymes[24,25].However,aligninglargescalenetworks isacomputationallychallengingproblemduetothe underlyingsubgraphisomorphismproblemthathasto besolvedtofindthealignmentthatmaximizesthesimilaritybetweenthequerynetworks.Themethodswe mentionedaboveeitherrestrictthequerytopologies and/ortheirsizes.Evenundertheseconditions,therunningtimesandmemoryutilizationofthesemethodscan stillbeprohibitiveforlargequerynetworks.Forinstance, themethodofPinter etal .[20]takesaroundoneminute peralignmentonadatasetwithonlysmallsizenetworks rangingfrom2to41nodes.Ourearliermethod,SubMAPhasnolimitationsonthequerytopologiesand allowsmappingsofnodesetsthatareconnected(i.e., subnetworks).However,allowingsubnetworkscomesat acostofincreasingrunningtimethatisinherentdueto thefactthatthenumberofallconnectedsubnetworksup toagivensizecanbeexponentialinthesizeofthenetwork.Foranetworkofsize80andsubnetworksizesup to3,SubMAPtakesaround6minutesand150MBsof memoryontheaverageperalignmentwithadatabaseof networksofsize50ontheaverage.Therefore,improving therunningtimeandmemoryu tilizationofthesemethodsisnecessarytoleveragethealignmentoflargerscale networksespeciallywhensubnetworkmappingsare allowed. Inthispaper,wedevelopaframeworkthatsignificantly improvesthescaleofthenetworksthatcanbealigned usingexistingalgorithms.Ourframeworkhasthree majorphases,namelythe compressionphase ,the alignmentphase andthe refinementphase .Forthefirstphase, wedevelopacompressionme thodthatreducesthesize oftheinputmetabolicnetworksbyadesiredrate.In otherwords,wetransformthequerynetworksfromtheir originaldomains(seeFigure1(a))toa compressed domain (seeFigure1(d)).Asinglenodeincompressed domaincorrespondstoasetofconnectednodesandthe edgesbetweenthemintheoriginaldomain.Wecalleach suchnodeinthecompressednetworka supernode .For instance,Figure1(d)depictsthecompressednetworksof thetwoinputnetworksinFigure1(a)wheneachsupernodeisallowedtocontainuptotwonodes(i.e.,onlyone levelofcompressionisallowed).Inthesecondphase,we carryoutthealignmentinthecompresseddomainby usinganexistingnetworkalignmentalgorithm,whichis SubMAPinthispaper,asourbasemethod.Oncethe compressednetworksarealigned,wenextconsidereach mappingofsupernodesfoundbythefirstphaseindividually.Eachsuchmappingsuggestsasmallerinstanceof networkalignment.Figure1(f)demonstratesthiswhere twosuchinstancesexist.Foreachofthesemappings,we solvethealignmentproblemusingthebasealgorithm.At theendofthisrefinementphase,thefinalmappingsof reactionsareextracted(seeFigure1(g))transformingthe alignmentbacktotheoriginaldomain. Wecanbestmotivatetheneedforsuchaframeworkon anexample.Figure1illustratesthedifferencebetween aligningtwometabolicnetworksincompresseddomain versusaligningthemintheoriginaldomainwithoutcompression.Ifweuseabasealignmentalgorithmsuchas SubMAPorIsoRank,thetimeandspacecomplexityof thealgorithmisdeterminedbythesizeofadatastructure, named supportmatrix [10,21].Conceptually,thisdata structuregovernsthetopolog icalsimilaritiesbetween everypairofreactiontuples.Eachreactiontuplecontains onereactionfromeachofthetwoquerymetabolicnetworks.Adetaileddescriptionofthismatrixcanbefound inpreviousarticlesdescribingIsoRank[21]andSubMAP [10].Thesizeofthissupportmatrixisquadraticinterms ofboth n and m (i.e., O ( n2m2))forIsoRankandforSubMAPwhenonlysubnetworksofsizeoneareallowed. Figures1(b)and1(e)illustratethesupportmatrices requiredforalignmentstartingfromthenetworksshown inFigure1(a)and1(d)respectively.Asaresultofcompressionbyonlyonelevel,thesizeofthematrixweneed tocreate,dropsto6 6from20 20whichtranslatesinto morethananorderofmagnitudeimprovementintheoreticalresourceutilizationcomparedtothebasemethod.Ay etal BMCBioinformatics 2012, 13 (Suppl3):S2 http://www.biomedcentral.com/14712105/13/S3/S2 Page2of19 PAGE 3 Noticethatwhenwecompressthenetworkmore(i.e., increasethenumberofcompressionlevels),thecompressednetworkgetssmallerintermsofitsnumberof nodesandedges.Asaresult,wecanexpecttoalignthe compressednetworksfaster.However,thiscomesatthe priceoftwodrawbacksbothduetothefactthateach supernodecontainsmultiplenodesfromtheoriginal domain.First,oncewefindamappingforthesupernodes inthecompresseddomain,westillneedtoalignthe nodesofeachsupernodepair.Forexample,aftermappingthesupernodes(a,b)and( a b )showninFigure1 (f),weneedtoalignthetwosubnetworksinducedby thesetwosupernodes.Thusasthesizeofthesupernodes grow(i.e.,aswecompressformorelevels),thesizeofthe smallerprobleminstancesgrowaswellandresourceutilizationbottleneckshiftsf romthealignmentphaseto refinementphase.Second,whenweusecompressionthe resultingalignmentmaynotbethesameastheone foundbytheoriginalalgorithm.Forexample,oneoutof fourmappingsinFigure1(g)(i.e., e c )isdifferentthan theresultsofthebasealgorithmshowninFigure1(c) (i.e., e e ).Thisbringstheneedtodefineameasureof consistencybetweentheresultsofalignmentswithand withoutcompressionwhichcanbeusedasanindicator ofaccuracyfortheframeworkweproposehere.We calculatethisaccuracyasthecorrelationofthescores calculatedforeachpossiblemappingfoundbyourframeworkinthecompresseddomainwiththescoresforthese mappingintheoriginaldomainfoundbythebase method.Biggercompressionratesgenerallymeanless similaritybetweentheresultsofthetwomethods(i.e., lessaccuracy). Severalkeyquestionsfollowfromtheseobservationsare: 1.Howdoescompressionaffectthealignmentaccuracywithrespecttothe basenetworkalignment method? 2.Howfarisourcompressionmethodfromanoptimalcompressionthatproducesthecompressednetworkwiththeminimumnumberofnodes? 3.Whenisitagoodideatodothealignmentin compresseddomaintakingintoaccounttheoverheadofcompressionandrefinementphases? 4.Whatistherightamountofcompression?Thatis, whendoescompressionminimizetherunningtime ofouroverallframework? Intherestofthepaperweaddresseachofthesequestionsindetail.Atthispoint,itisimportanttonoticethe potentialforleveragingth ealignmentoflargerscale Figure1 Aligningtwometabolicnetworkswithandwithoutcompression. Topfigures(ac)illustratethestepsofalignmentwithout compression.Bottomfigures(dg)demonstratedifferentphasesofalignmentwithcompressionusingourframework.(a)Twohypothetical metabolicnetworkswith5and4reactionsrespectively.Directededgesrepresenttheneighborhoodrelationsbetweenthereactions.(b)Support matrixofsize20 20neededforthealignmentifcompressionisnotused.Weonlyshowthenonzeroentriesofasinglerowthatcorresponds totopologicalsupportgivenby b b mappingtopossiblemappingsofitsbackwardandforwardneighbors.Fivesuchmappingssupported equallyaredenotedby 1 5s inthematrix,namely a a mappingforthebackwardneighborsand c c c d d c and d d mappingsforthe forwardneighbors.(c)Theresultingreactionmappingsofalignmentwithoutcompression.(d)Querynetworksshownin(a)incompressed domainafteronelevelofcompression.(e)Supportmatrixofsize6 6neededforthealignmentwithcompression.Weonlyshowtheentriesfor themappingssupportedbythe a b a b mapping.(f)Theresultingmappingsfromthealignmentincompresseddomain.(g)Theresulting reactionmappingsafterrefinementphaseofourframework. Ay etal BMCBioinformatics 2012, 13 (Suppl3):S2 http://www.biomedcentral.com/14712105/13/S3/S2 Page3of19 PAGE 4 networksbytheframeworkweareproposing.The actualperformancegainforanalignmentwilldepend onthelevelofcompressionweuse,thetopologiesof thequerynetworksandcomplexityofthebasealignmentmethod.ResultsoverviewOurexperimentsonmetabolicnetworksextractedfrom KEGGpathwaydatabase[26]demonstratethatour compressionmethodreducesthenumberofnodesand edgesbyalmosthalfateachlevelofcompression.Asa resultofthisreduction,weobservesignificantamount ofimprovementinrunningtimeandmemoryutilization ofourearlieralignmentalgorithmSubMAP.Lastly,we analyzetheaccuracyofourframeworkascomparedto thebasealignmentalgorithm.Theresultssuggestthat thealignmentobtainedbyonlyonelevelofcompression capturestheoriginalalignmentresultswithveryhigh accuracyandtheaccuracydecreaseswithfurtherlevels ofcompression.TechnicalcontributionsWedeviseanefficientframeworkforthenetwork alignmentproblemthatemploysascalablecompressionmethodwhichshrinksthegivennetworkswhile respectingtheirtopology. Weprovetheoptimalityofourcompression methodundercertainconditionsandprovidea boundonhowmuchourcompressionresultscan deviatefromtheoptimalsolutionintheworstcase. Weprovideamathematicalformulationthatserves asaguidelinetoselectanoptimalnumberofcompressionlevelsdependingontheinputcharacteristicsofthealignment. Wecharacterizethecasesforwhichtheproposed frameworkisexpectedtoprovidesignificant improvementinalignmentperformance. Inthenextsection,wereportourexperimentalresults onasetoflargescalemetabolicnetworksthatareconstructedbycombiningnetworksfromKEGGPathway database[26].Thedetailsofthenetworkcompression methodweproposehereandtheotherphasesofour frameworkaredescribedinthemethodssection.ResultsanddiscussionInthissection,weexperimentallyevaluatetheperformanceofourframework.First,wemeasurethecompressionratesachievedfordifferentlevelsof compressionwithminimumdegreeselection(MDS ) methodthatweproposehere. Next,wefurtheranalyzedthechangesindegreedistributionandlargescaleorganizationoforganismwide metabolicnetworkswithincreasingcompressionlevels. We,then,examinethegaininrunningtimeandmemory utilizationachievedbyourframeworkfordifferentvalues ofcompressionlevel( c )andsubnetworksize( k )parameters.Last,weexaminetheaccuracyofthealignments wefoundbymeasuringtheaccuracyasthePearson scorrelationcoefficientbetweenthescoresofmappings calculatedbyourframeworkandtheonescalculatedby thebasealgorithmweuse.DatasetWeusethemetabolicnetworksfromtheKEGGpathwaydatabase[26].Forour mediumscaledataset ,we downloadedallmetabolicnetworkswithatleast10 reactionsfor10differentorganisms.Thisresultedin 620metabolicnetworksintotalwithsizesrangingfrom 10to97. Inordertoobtainour largescaledataset ,wefirst combinedallthemetabolicnetworksthatbelongtoone ofthe9differentmetabolismcategoriesinKEGGdatabasetocreatea completemetabolismnetwork foreach metabolismfor10selectedorganisms(Homosapiens (human),Musmusculus(mouse),Rattusnorvegicus (rat),Drosophilamelanogaster(fruitfly),Arabidopsis thaliana(thalecress),Caenorhabditiselegans(nematode),Saccharomycescerevisiae(buddingyeast),StaphylococcusaureusCOL(MRSA),EscherichiacoliK12 MG1655,PseudomonasaeruginosaPAO1).Weobtain the organismwidemetabolicnetworks bycombiningall thelistednetworksinKEGGforeachoftheseorganisms.Intotal,wehave100networkswithsizesranging from5to1615(9completemetabolismnetworksplus1 organismwidenetworkforeachofthe10organisms). Belowisthelistofmetabolismcategoriesweuse. 1.CarbohydrateMetabolism 2.EnergyMetabolism 3.LipidMetabolism 4.NucleotideMetabolism 5.AminoAcidMetabolism 6.MetabolismofOtherAminoAcids 7.GlycanBiosynthesisandMetabolism 8.MetabolismofCofactorsandVitamins 9.AllAminoAcids(AminoAcid+OtherAmino Acids)ImplementationandsystemdetailsWeimplementedourcompressionandalignmentalgorithmsinC++.Weranalltheexperimentsonadesktop computerrunningRedHatEnterpriseClient5.7with4 GBofRAMandtwodualcore2.40GHzprocessors.EvaluationofcompressionratesTheefficiencyofouralignmentframeworkdependson howmuchthequerymetabolicnetworkscanbeAy etal BMCBioinformatics 2012, 13 (Suppl3):S2 http://www.biomedcentral.com/14712105/13/S3/S2 Page4of19 PAGE 5 compressed.Forthisreason,inthisexperiment,we measurethenumberofnodesandedgesofthemetabolicnetworksinourlargescaledatasetbeforeand aftercompression. Theminimumdegreeselection( MDS )methodwe describeinthispapercompressesthequerymetabolic networksbyselectingthefirstnodeamongthelistof nodeswithminimumdegreeateachintermediatestep andbycompressingitwithoneofitsneighbors.Inorder toevaluatestabilityofthiscompressionmethod,we examinedtheeffectofthenodeselectionstrategyonthe sizeoftheresultingcompressednetworks.Byrandomizingthestepatwhichweselectanodeamongthesetof minimumdegreenodes,wegenerated100differentcompressednetworksforeach oftheinputmetabolicnetworks.Inthefollowing,weexaminehowmuch compressionweachievebythe MDS methodandalso analyzeitsstabilitywithrespecttocompressions achievedbyrandomizationofnodeselectionstep. Table1summarizesthecompressionratesachievedby ourmethodfornetworksofdifferentsizes.Wedivideall themetabolicnetworksinourdatasetintobinsaccording tothenumberoftheirreactions(i.e.,networksize).The firstcolumninTable1liststhenetworksizeintervalswe usedforeachgroup.Noticethatthegapsinthesize intervalareduetothefactthatorganismwidenetworks areofsize850andlargerwhereastheothercombined networksforninedifferent metabolismcategorieshave sizesbelow400.Eachrowofthistableshowsthenumber ofnodesandedgesaveragedoverallthenetworksinthis groupbeforeandaftercompression.Thetwocolumns with c =0correspondtotheaveragenumberofnodes andedgesofthenetworkswithnocompressionrespectively.For c {1,2,3},wespliteachrowcorresponding toanintervalintotwo.Theupperpartdenotesthe averagenodeandedgenumbersforthecompressednetworkifthe MDS methodisusedasoriginallydescribed (i.e.,thefirstamongthelistofminimumdegreenodesis selectedandcombinedwithitsfirstneighborateach compressionstep).Thelowerpartinboldrepresentsthe numbersgatheredwhenweintroducerandomizationin thisnodeselection.EachvalueinboldinTable1denotes theaverageofthecorrespondingvalueoverthese100 differentrunsofcompression. OneconclusionthatcanbedrawnfromTable1isthat independentofthenetworksize,ourcompression methodperformswellinpractice.Ontheaverage,with onlyonelevelofcompressionweachievenetworksizes thatare5764%,6471%and7780%ofthenetworksizes inthepreviouscompressionlevelfor c =1,2and3 respectively.Inotherwords,ourmethodcompressesthe entiredatasetdowntoapproximately60%,40%and30% ofthesizesoforiginalnetworksfor c =1,2and3respectively.Theseratessuggest thatourframeworkhasgreat potentialinscalingthenetworkalignmenttolargemetabolicnetworksbycompression.Asanexample,consider therowcorrespondingtointerval[850,1250]inTable1. Weseethatinsteadofaligningnetworkswith1080 nodesand3727edgesontheaverage,wecanapplytwo levelsofcompressionfirstanddothealignmentwithsignificantlysmallernetworksthathaveonly407nodesand 1733edgesontheaverage.Anotherobservationisthat, wegetthemostofthereductioninnetworksizeafterthe firstcompressionlevel.Th atis,ourmethodcompresses thenetworksaggressivelyfor c =1andachieves57%to 64%compressionratewhichisclosetothehalfofthe sizeofthenetworks.Aswegoupinthelevelsofcompression,theactualrateofcompressionachievedatone levelreduces.Consideringthefactthathavinganinput networkwhichcanleadtothebestpossiblecompression Table1SummaryofcompressionratesforallthenetworksinourlargescaledatasetNetworksizeintervals Averagenumberofnodes Averagenumberofedges c=0c=1c=2c=3c=0c=1c=2c=3 [0,100) 41.526.5 26.5 19.1 19.1 15 14.8 83.555.2 55.5 36.3 36.5 23.6 23.5 [100,200) 154.892.4 92.2 61.3 61.5 48.6 48.6 310.1174.9 174 116.5 118.1 96.3 94.6 [200,300) 240.5139.1 139.4 89.2 89.1 69.4 69.7 508.1296.5 298.4 230.5 228.4 187.8 188.1 [300,400] 344.9207.3 207.6 133.1 133.8 103 104.5 585.7372.9 373.5 302.7 300.4 261.6 259.9 [850,1250] 1080.5623.2 623.7 406.8 407.9 311.3 311.9 37272269 2280.6 1732.7 1733.8 1584.8 1587.5 [1500,1615] 1576.5909 910 582 583 447.8 444.6 47402955.2 2964.3 2283.5 2279.3 2128.8 2129.6Wecreatesixintervalsaccordingtonumberofreactionsinthesenetworks.Eachrow,correspondingtoonesuchinterval,showstheaveragenumberofnodes andedgesbeforecompression(i.e., c =0)andaftercompressionofdifferentlevels(i.e., c {1,2,3}).Foreachrow,topentriescorrespondtonumbersobtained withthe MDS methodwhichselectsthefirstnodefromthelistofnodeswithminimumdegreeateachintermediatestepandcompressesitwithitsfirst neighborfromthelistofitsneighbors.Thebottomentriesthatareinboldcorrespondtotheaveragesof100differentcompressionswhicharegathere dby randomizingthestepatwhichanodeisselectedamongthesetofminimumdegreenodes.Ay etal BMCBioinformatics 2012, 13 (Suppl3):S2 http://www.biomedcentral.com/14712105/13/S3/S2 Page5of19 PAGE 6 (i.e.,reducingitssizefrom n downtosize n 2 (i.e.,50%) ateachlevelofcompression)isarareevent,theobserved compressionratessuggestthatourmethodprovidesan efficientcompressionformetabolicnetworksinpractice. Thisexperimentalsetu palsosuggeststhatthe MDS methodisstablewithrespecttothechoiceofthenodeto compressaslongasthatnodeisselectedamongthe nodeswithminimumdegree.Amongthesixrowsand threecolumns(18entries)ofTable1fortheaverage numberofnodesafterthecompression,onlyoneof themhavedifferencelargerthantwobetweentheoriginalsizeandtherandomizedaverage. Theresultsofthisexperimentsuggestthatourcompressionmethod,MDS,servesasanefficientandstable firstphaseforouralignmentframeworkbyachieving goodcompressionratesonalargedatasetofmetabolic networks .ChangesindegreedistributionswithcompressionEventhoughthecompressionratesweachievewith MDS asdescribedabovesuggestsi gnificantreductioninthe problemsize,weobservethatthereisanoticeabledifferencebetweenthecompressionratesachievedbygoing fromonecompressionleveltothenext.Forinstance,on theaverageweget57%to64%reductioninthesizeof thenetworksgoingfrom c =0to c =1whereasweonly get76%to80%reductionifwegofrom c =2to c =3. Thissuggeststhatthelargescaleorganizationofthenetworkschangewithincreasinglevelsofcompression. Eventhoughachangeinthenetworkstructurecanbe expectedasaresultofourcompression,itisnotobvious howtoquantifythischangeandwhetherthechangeis consistentamongdifferentmetabolicnetworks. Inordertounderstandthereasonbehinddifferentcompressionratesfordifferentcompressionlevels,weexaminedthedegreedistributio nsofthetenorganismwide networkswehaveinourdataset.Foreachofthesenetworks,weplottedthehistogramofoutdegreedistributionsfordifferentlevelsofcompression.Figure2plotsthe frequenciesofeachoutdegreeintherange[2,40]foreach c {0,1,2,3,4}forthesenetworks.Weobservethatfor eachoftheseplotsthedegreedistributionsfor c =0and c =1areverysimilarandtheyfollowpowerlawdistributionwhichisanindicatorofscalefreenetworktopology. Thisisnotsurprisingsincethescalefreetopologyhas beenobservedinnumerousarticlesintheliteratureasa commonsignaturefordifferentmetabolicnetworks [2729].Thesimilaritybetweenthedegreedistributionsof theoriginalnetworks( c =0)andthenetworkscompressed byonlyonelevel( c =1)signifiesthatthenetworksstill conservetheirscalefreenes safterthefirstlevelof compression. Amoreinterestingobservationisthatthereisaconsistentshiftfromthepowerlawdegreedistributionto uniformdistributionwithincreasing c valuesforeachof thetennetworkswehave.Itisimportanttoclarifythat ourclaimisnotthatthedegreedistributionbecomes uniformforlarge c valuesbutratherthedegreedistributionsforlarge c valuesaremoresimilartouniformdistribution(alsolesssimilartopowerlawdistribution) comparedtoonesobtainedwithsmaller c values.To quantifythisonanexample,welookatoneofthemost discernablecharacteristicsofscalefreenetworks,hence thepowerlawdistribution,whichisthesmallnumber ofhubnodeswithlargedegrees.Ifweconsiderthe organismwidenetworkof Homosapiens (Figure2(e)), whichisthelargestnetworkinourdataset,andfocus onthepercentageofnodeswithoutdegreegreaterthan 15,wegetpercentagesof3%,4%,6.5%,11.5%and 12.4%for c valuesof0,1,2,3and4respectively.This indicatesthatthenumberofnodesthatcanbeconsideredashubsincreasesignificantlywithincreasinglevels ofcompression.Thisincreasedeterioratesthescalefreenessofthe Homosapiens networkwhichinturn decreasestheachievedcompressionrates.Similartrend isobservedforeachoftheothernineorganismwide networkswhichareplottedseparatelyinFigure2. Theresultsofthisexperimentshowthatthereisaconsistentchangeinthenetworktopologywhenmultiple levelsofcompressionisused.Thisdifferenceweobserve herebetweenthefirstlevelofcompressionandlater levelsofcompressionislikelytobeoneofthemainreasonsofthesignificantdifferencesinboththeperformance andtheaccuracyofourframeworkwhichwillbediscussednextintheremainingoftheresultssection .EvaluationofrunningtimeandmemoryutilizationInordertounderstandthecapabilitiesandlimitationsof ourframework,weexamineitsperformanceintermsof itsrunningtimeandmemoryutilizationonasetof largescalenetworksweconstructedasdescribedinthe datasetsection.Wehavetennetworksforeachofthe tenorganismsinourdataset.Foreachorganism,nineof thesenetworksconstituted ifferentmetabolismcategoriesandthetenthnetworkistheorganismwide metabolicnetwork.Intotal,wehave100networkswith sizesrangingfrom5to1615.Foreachparametersetting (differentcombinationsof k {1,2}and c {0,1,2,3}, wealignedeachofthese100networkswitheachother network(includingitself)resultinginatotalof5500 alignmentqueries.Whenthevalueof c isequaltozero, thealignmentiscarriedoutcompletelybyasingle applicationofSubMAPwitho utanycompression.This providesusamechanismtomeasurehowmuchperformancegainisachievedbyourcompressionbasedframeworkwithrespecttoSubMAP. Figure3(a)illustratestheaveragequeryrunningtimes inaloglogplotwherexaxisisthesizeofthequery measuredastheproductofthenumberofreactionsof themetabolicnetworksth atarealigned.WegroupedAy etal BMCBioinformatics 2012, 13 (Suppl3):S2 http://www.biomedcentral.com/14712105/13/S3/S2 Page6of19 PAGE 7 Figure2 Shiftofoutdegreedistributionsfrompowerlawtouniform. Changesintheoutdegreedistributionsoftenorganismwide metabolicnetworkswithincreasinglevelsofcompression.Wecalculatethefrequenciesofeachoutdegreeintherange[2,40]for c {0,1,2,3, 4}andplotthemtogetherforeachofthetenorganismsinourdataset.Outdegreedistributionsfororganismwidemetabolicnetworksof(a) Arabidopsisthaliana (thalecress),(b) Caenorhabditiselegans (nematode),(c) Drosophilamelanogaster (fruitfly),(d) EscherichiacoliK12MG1655 (e) Homosapiens (human),(f) Musmusculus (mouse),(g) PseudomonasaeruginosaPAO1 ,(h) Rattusnorvegicus (rat),(i) StaphylococcusaureusCOL (MRSA),(j) Saccharomycescerevisiae (buddingyeast). Ay etal BMCBioinformatics 2012, 13 (Suppl3):S2 http://www.biomedcentral.com/14712105/13/S3/S2 Page7of19 PAGE 8 queriesintologarithmicbinsaccordingtothequery sizes.Thefirstbincontainsallthequeriesofsizeless thanorequalto64.Thenextbinscontainthequeries ofsizeintheinterval[2i +5,2i +6]where i =2,3,...,17. Foreachparametersettingwedisplaytheaveragerunningtimeofallthequeriesineachbin.Forboth k =1 and k =2,weplotalltheresultsforallfourdifferent compressionvaluesandalsodrawthefittingcurvesto betterillustratethetrendintheincreaseofrunning time. For k =1,wecanimmediatelyobservethateachadditionalcompressionlevelimprovestherunningtimeover thepreviousoneforallquerysizes.Weobtainthelargestfoldchangeinrunningtimebyonlyonelevelof compressionforthefirstlevel.ThisisexpectedconsideringthatthefirstlevelofcompressionachievedthelargestcompressionrateasshowninTable1.Thesecond compressionlevelimprovestherunningtimebyasmallerfactorcomparedtothefirstandbyalargerfactor comparedtothethirdlevel.For k =1wewereableto plotallthepointsforall c valuesastherunningtime foreventhelargestquery(i .e.,humanorganismwide networkvsitselfwhichhassize1615*1615)withnocompression(i.e., c =0)isstillpractical,around12minutes(with c =3thisdropsto < 40seconds). Similartrendofimprovedrunningtimeswithincreasing c isalsoobservedforqueriesuptoacertainsizefor k =2.Foronlyonelevelofcompression(c =1)we observesignificantimprovementinrunningtimesfor queriesofalldifferentsizes.However,startingfromthe bin[213,214]compressingthenetworksmorethanonly onelevel( c> 1)showsaconsistentadverseeffectonthe runningtime.Thisimplieswhenbothquerynetworks havesizesaround150orlargerand k> 1isused,the ideaofcompressingthenetworksmorethanonelevel andthenperformingthealignmentsuffersfromthe explosioninthenumberofpossiblesubnetworksinthe compresseddomainwithsizeatmost k .Weexplore thisinmoredetaillateroninthepaper(seeFigure4 anditsdiscussion). Animportantaspectofourframeworkisthatitmakes possibletoalignnetworks thatcouldnotbealigned withourbasemethod.For k =2,weobservedthatin theoriginaldomain( c =0)asignificantportionofthe largequeriesdidnotfinishinlessthanthecutofftime whichwesetasonehour.Forinstance,among252possiblequerieswithsizesintheinterval[217,218],96did notcompletesuccessfullyfor c =0whereaswith c =1 allofthemwerecompleted.Forthenextbin,45outof 223possiblequerieswerecompletedfor c =0andfor c =1thisnumberincreasedt o185.Theseresultsindicatethatbyusingthecorrectamountofcompression, wecanalignlargernetworksthanthebasealignment methodSubMAP.Webelievethisisanimportantstep inleveragingorganismwidenetworkalignmentswith subnetworkmappingsfortheyprovideamorecomplete pictureoffunctionalsimilaritiesandevolutionarydifferencesbetweenthemetabolicnetworksoftwoormore organisms. Figure3(b)presentsresultsfortheestimatedmemory requiredforthesupportmatrix,whichisthememory bottleneckofthealgorithm,thatisneededtoperform Figure3 Resourceutilizationofourframework. Theaverage(a)runningtimeand(b)memoryutilizationofourframeworkwheneachquery networkinourlargescaledatasetisalignedwithallthenetworks(includingitself)inthesamedataset.xaxisisthequerysizewhichis calculatedastheproductofthesizes(i.e.,numberofreactions)ofthemetabolicnetworksaligned. c =0denotethealignmentsperformedwith nocompression. c {1,2,3}denotetheresultsofourframeworkthatcompressesbothofthequerynetworksby c levelsbeforealigningthem. Ay etal BMCBioinformatics 2012, 13 (Suppl3):S2 http://www.biomedcentral.com/14712105/13/S3/S2 Page8of19 PAGE 9 thealignment.Forthisfigure,weusethesamequeryset asFigure3(a),hencethesamexaxis.Ontheaveragethe memoryrequiredforalignmentwith c =1isaround30% ofthatneededforalignmentw ithnocompressionusing theSubMAPmethodforboth k =1and k =2.For k =1, thememoryutilizationdecreasesbyeachadditionalcompressionlevel(ontheaveragearound45%ofthememory requiredfor c =1isusedwhen c isincreasedto2and around65%ofthememoryrequiredfor c =2isused when c isincreasedto3).For k =2,concordancewith therunningtimeresults,onlyonelevelofcompression providesbettermemoryutilizationforallnetworksizes whereascompressingmorethanonelevelhasanadverse effectformediumandlargescalequeries. Theseresultssuggestthatourframeworkdemonstrates agreatpotentialtoprovidesignificantimprovementin boththerunningtimeandthememoryutilizationofthe basealignmentmethod.Thisallowsustoalignlarge networksthatcouldnotbealignedbyexistingmethods byutilizingthesamehardware.AccuracyofthealignmentresultsWeconcludeourexperimentalresultsbyansweringthe firstquestionintroducede arlierinthepaper,thatis Howdoescompressionaffectthealignmentaccuracy? Inordertoanswerthis,wecalculatethecorrelation betweenthescoresofeachpossiblemappingincompresseddomainandthescoresthatweobtainforthese mappingsfromtheoriginalSubMAPmethod.Weconsiderthescoresofeachpossiblesubnetworkmappingof compressednodesfoundbyourframework.Sincethe mappingsfoundbySubMAParenotofthesameform withthemappingsincompresseddomain,wecalculate Figure4 Gain/Lossinrunningtime. Gain/Lossinrunningtimeofalignmentbyusingourframeworkwithrespecttothebasealignment method(xaxis)versustheratioofthenumberofallpossiblesubnetworkmappingsincompresseddomaintothisnumberintheoriginal domain.Theblueverticallineshowswhenthetwomethodstakeexactsameamountoftimeorwhenbothmethodstakeveryshortamount oftimeinthecaseofsmallquerynetworks.Pointsontheright(left)handsideofthislinemeansgain(loss)intherunningtime.Thedashed lineisourdecisioncriteriaforpredictingwhethertherewillbegainorlossbeforedoingthealignment. Ay etal BMCBioinformatics 2012, 13 (Suppl3):S2 http://www.biomedcentral.com/14712105/13/S3/S2 Page9of19 PAGE 10 ascorevalueforeachmappingincompresseddomain byusingthescoresofthemappingsfoundbySubMAP intheoriginaldomain.Thisway,wegettwosetsof scorevaluesonefromSubMAPonefromourframeworkforthesamesetofmappings.Wecalculatethe Pearson scorrelationcoefficientbetweenthesetwosets ofscoresasanindicatorofthesimilaritybetweenthe resultsofthetwomethods. Beforelookingatthecorrelationvalueswefound,itis importanttodescribehowwecalculatethescorefora mappingincompresseddomainfromthemappingsof SubMAP.Let P1and P1 denotetheonelevelcompressedformsoftwometabolicnetworks.Let ( v1{ v1, v2} ) denoteamappingincompresseddomain where v1isasubnetworkof P1and { v1, v2 } isasubnetworkof P1 .Also,let v1={ r1, r2}, v1= { r1, r2 } and v2= { r3 } .Weknowtheedgethatmapsthesetwosubnetworkshasamappingscoreinthecompressed domainandletusdenoteitby e1}for c =1.Wewant tocomputeamappingscore,say e ,for ( v1{ v1, v2} ) fromthemappingsinoriginaldomainthatiscomparableto e1.Thissubnetworkmappingincompressed domaincontainssixpossiblemappingsintheoriginal, namely ( r1, r1 ) ( r1, r2 ) ( r1, r3 ) ( r2, r1 ) ( r2, r2 ) and ( r2, r3 ) .Letusdenotethescoresofthesemappingsin theoriginaldomainby eifor i =1,2,...,6respective totheirordering.Then,wecomputethemappingscore  e as 1 6 6 i =1e i .Itisimportanttonotethat,thisscoreis aconservativechoiceamongotherpossiblescoring options.Thisisbecausetheaveragecanincludemappingscoresofsubnetworkswithverylowsimilarities fromtheoriginaldomainofSubMAP.Thiscanunderestimatethecorrectmappingscoreof e andhence degradethecorrelationofcompresseddomainandoriginaldomainmappingscores.Overall,foreachmapping incompresseddomainwithascore ecandwecalculate thecorrespondingscore e intheoriginaldomainusing thisaveragescore. Table2summarizesthecorrelationvaluesfoundfrom asetof3600alignments(400alignmentsforeachparametercombinationof k {1,2,3}and c {1,2,3}). Wecalculatethecorrelationofeachquerywiththe alignmentthathasthesame k valuebutisintheoriginaldomain(i.e., c =0).Table2showstheaveragecorrelationvaluesofthese400alignmentsforeach k value, c valuecombination.Thefirstcolumnindicatesthatthe alignmentfoundbyusingonlyonecompressionlevelis highlysimilartothealignmentfoundbydirectlyusing thebasemethod.Combiningthiswiththerunningtime gaininFigure3(a)for c =1,wecanstronglyarguethat compressionbyonelevelnotonlyprovidessignificant improvementinrunningtimebutalsoaccuratelycapturesveryhighpercentageoftheoriginalalignment resultswhichmakesitveryusefulforpracticalpurposes. Theaccuracymeasuredintermsofcorrelationdropsto 0.57ontheaveragewhenweperformthesecondlevel ofcompressionandto0.51forthethirdlevel. Theseresultssuggestthatwecanalmostalwaysuse onelevelofcompressiontobenefitfromahighperformancegainwithoutlosingmuchaccuracyintermsof thealignmentresults.Forc = 2andc = 3,eventhough theaccuracyoftheresultsaresignificantlybetterthan random,suchcompressionlevelsshouldbeusedwith cautioniftheaccuracyofthealignmentisthemain concern.ConclusionsInthispaper,weconsideredtheproblemofaligningtwo metabolicnetworksparticularlywhenbothofthemare toolargetobedealtwithusingexistingmethods.Tosolve thisproblem,wedevelopedaframeworkthatscalesthe sizeofthemetabolicnetworksthatexistingmethodscan alignsignificantly.Ourframeworkisgenericasitcanbe usedtoimprovethescalabilityofanyexistingnetwork alignmentmethod.Ithasthreemajorphases,namelythe compressionphase,the alignmentphase andthe refinementphase .Forthefirstphase,wedevelopedanalgorithm whichtransformsthegivenmetabolicnetworkstoacompresseddomainwheretheyaresummarizedusingmuch fewernodes,termedsupernodes,andinteractions.Inthe secondphase,wecarriedoutthealignmentinthecompresseddomainusinganexistingmethod,SubMAP,as thebasealignmentalgorithm.Intherefinementphase,we consideredeachindividualmappingof supernodes oneby one.Eachsuchmappingcorrespondstoasmallerinstance ofnetworkalignmentproblem.Foreachofthesemappings,wesolvedthealignmentproblemusingSubMAPas ourbasemethod.OurexperimentsonthemetabolicnetworksextractedfromtheKEGGpathwaydatabase demonstratethatourcompressionmethodreducesthe numberofreactionsbyalmosthalfateachlevelofcompression.Asaresultofthiscompression,weobservethat SubMAPcoupledwithourframeworkcanaligntwiceor moreaslargenetworksasitsoriginalversioncanwith thesameamountofresources.Ourresultsalsosuggested thatthealignmentobtainedbyonlyonelevelofcompressionbenefitsfromasignificantperformancegainwhile Table2Correlationofthemappingscoresfoundwith andwithoutcompressionk/c 123 10.890.560.53 2 0.85 0.58 0.50 3 0.84 0.57 0.49WecalculatethePearson scorrelationcoefficientbetweenthetwosetsof scorevaluesonefromSubMAP(withoutcompression)onefromour framework(withcompression)andreportitasanindicatoroftheaccuracyof alignmentresultsofourframeworkfordifferentparametersettings.Ay etal BMCBioinformatics 2012, 13 (Suppl3):S2 http://www.biomedcentral.com/14712105/13/S3/S2 Page10of19 PAGE 11 capturingtheoriginalalignmentresultswithveryhigh accuracy.Webelievethatthispapertakesanimportant stepinscalingthemetabolicnetworkalignmentwithsubnetworkmappingstoorganismwidenetworks,andthus, canhavegreatimpactonmakingtheexistingnetwork alignmentmethodsmoreusefulfordomainscientists.MethodsInthissection,wedescribethemethodwedevelopto compressthequerynetworksandtheoverallframework foraligningnetworksinthiscompresseddomain.Before goingintodetail,itisimpo rtanttostatethatweare usingareactionbasedmodelforrepresentingmetabolic networksthroughoutthispaper.Formally,werepresent ametabolicnetworkwith P =( V E )where V istheset ofallreactionsofthenetworkand E isthesetofdirectededgesbetweenthem.Anedge eij E existsifand onlyifthereaction vi hasatleastoneoutputcompound whichisaninputforthereaction vj.Inthefollowing, wefirstdescribeourcompressionmethod.Weusethe shorthandnotation MDS (minimumdegreeselection)to refertothismethodintherestofthepaper.We,then, provetheoptimalityof MDS undercertainconditions andprovideanupperboundforthenumberofcompressionsthatcanbemissedbythismethodwith respecttotheoptimalcomp ression.Next,wegivea briefoverviewofthebasealignmentmethodthatwe useinthispaperandexplainindetailthetworemaining phasesofouralignmentframework.Weprovideour analysisonthecomputationalcomplexityoftheoverall methodandconcludethemethodssectionbyanswering twoquestionsrelatedtoperformancecharacteristicsof thismethod.Minimumdegreeselection( MDS)methodLet P =( V E )bethereactionbased representationofa metabolicnetworkand c denotetheuserspecifiedparameterforthedesiredlevelofcompression.For x =1,..., c wedenotethecompressedformof P after x compression levelswith Px=( Vx, Ex).Tosimplifyournotation,we assumethat P0= P .Weconstruct Pxfrom Px 1foreach x =1,..., c.Each v Vxiseitheranodefrom Vx 1ora supernodethatcontainstwonodesof Vx 1.Insummary, weconstruct Vxfrom Vx 1inanumberofconsecutive steps.Ateachstep,wechooseapairofconnectednodes in Vx 1thatarenotcompressedinearlierstepsofthe currentcompressionlevel.Wethenmergethisnodepair intoasupernodeandadditto Vx.Werepeatthesesteps untilthereisnosuchnodepairin Vx 1.Assumethatthe numberofsuchstepsis t forcompressionlevel x .We denotethestateofthenetworkafterthe i thstepduring the x thlevelofcompressionas Px i =( Vx i Ex i) Figure5(b)). Notethat, V x t = V x and V x i Vx 1 V x foreachi=1,..., t asthenodesof V x i areeithersingletonnodesfrom Vx1orsupernodesfrom Vx. Wearenowreadytodiscusshowwecompress Px 1toget Px.Wedefinethe degree ofanoncompressed node v inagivennetworkas deg ( v )=indeg ( v )+outdeg ( v ),where indeg ( v )( outdeg ( v ))denotesthenumberof incomingedgesfrom(outgoingedgesto)noncompressednodesinthenetwork.Wesaythattwonodesin anetworkareneighborsiftheyareconnectedbyat leastoneedge.Wedenotethesetofneighborsofa node v with N ( v ).Westartthecompressionbyinitializing V x 0 = Vx 1, Ex 0 Ex 1 .Then,whilethereexistsanoncompressednodewithdegreegreaterthanzeroatthe currentstateofthenetwork,say P x i 1 ,weapplythenext step,the i thstep,ofcompressiontoobtain P x i from Px i 1 Figure5depictsthestatesofanexamplenetworkbefore (Figure5(a))andafter(Figure5(b))the i thstepofcompression.Westartthe i thstepbyselectinganodewith minimumpositivedegreeamongthenodesin V x i 1 .If therearemorethanonesuchnode,weselectthefirst oneamongthem.InourexampleinFigure5(a),the nodewithminimumdegreeisuniqueandisshownby va.Weusethetermminimumdegreeasashorthand forminimumpositivedegreetoexcludesingleton nodes.Thiswayweensurethat deg ( va) > 0and N ( va)is nonempty.Weselectonesuchneighborfrom N ( va), say vb.Theonlynodein N ( va)inFigure5(a)isdenoted with vb.We,then,merge vawith vbtoformthesupernode vab={ va, vb}.Figure5(b)illustratesthisnewlycreatednode vab.Thisistheonlycompressiontobedone atthe i thcompressionstep.Next,wecreatethenew nodesetas V x i = Vx i 1{ vab}{ va, vb } .Forcreatingthe edgeset E x i ,weinitializeitto Ex i 1 andremoveallthe incomingandoutgoingedgesof vaand vbfromit. Then,weinsertanincomingedgeto vabfromeach nodein Vx i 1{ va, vb } ,whichhasanoutgoingedgeto either vaor vbinthepreviousedgeset Ex i 1 .Weinsert outgoingedgesfrom vabtoothernodesinasimilar manner.Figure5illustratesthechangesintheedgeset aftercreating vab.Noticethatforeach i =1,..., t ,theset Vx i containsamixtureofnodesandsupernodes.After eachsuchstep,thesizeofthenetworkdecreasesbyone andthenumberofedgesofthenewnetworkdecreases atleastbyone.ForinstanceinFigure5,thenumberof nodesdroppedfromfivetofourandthenumberof edgesdroppedfromsixtofive.Thecompressionof Px1toget Pxcontinuesbyapplyinganothercompression stepuntiltherearenomorenoncompressednodes withpositivedegree. Thediscussionabovedescribestheintermediatecompressionstepsofthe MDS methodtoperformasingle levelofcompressiononagivennetwork.Givenacompressionlevel c ,foreachlevel x =1,..., c ,weapplytheAy etal BMCBioinformatics 2012, 13 (Suppl3):S2 http://www.biomedcentral.com/14712105/13/S3/S2 Page11of19 PAGE 12 samecompressionstepson Px 1=( Vx 1, Ex 1)by initiallytreating Px 1asanoncompressednetwork withnosupernodes.Asaresultofthisprocess,afterfinishingthe x thlevelofcompression,theactualnumber ofreactionsthateachnodeof Vxcancontainisassure tobeintheinterval[1,2x].Thelimitationonthenumberofreactionsineachnodeallowsthe MDS method torespectandhighlypreservetheinitialtopologyofthe querynetworks.Thisisveryimportantforthealignmentasitmakessignificantuseofthenetworktopologies.Additionally,theboundonthenumberof reactionsineachsupernodetranslatestoauniform compressionforbothnetworkswhichlimitsthesizesof thesmalleralignmentproblemswecanencounterinthe refinementphase.Thisallowsustokeepundercontrol thecomplexityandtherunningtimeoftherefinement phaseofouralignmentframework.Optimalityanalysisfor MDSIntheprevioussection,wedescribedindetailthecompressionmethod( MDS )weuseinourframework.Ideally, itispreferabletocompressthegivennetworkasmuchas possibleateachcompressionlevel.Thisisbecausesmaller networksizeoftenimpliessmallertimeandmemory usageforthealignment.Wesaythatacompressionis optimal iftheresultingcompressednetworkcontainsthe smallestnumberofnodesamongallpossiblecompressionswiththerestrictionthateachnoncompressednode canbemergedwithatmostoneothernoncompressed nodeateachcompressionlevel.Wenamethehypothetical optimalcompressionmethodthatcanachievethebest possiblecompressionrateas OPT .Intherestofthissection,weanalyzetheoptimalityofour MDS methodunder differentconditions.Wefi rstconsidereachconnected componentoftheinputnetworkthatwillbecompressed separatelyandthenintegratetheirresultstogeneralize ouranalysisfornetworkswitharbitrarytopologies. Westartbyintroducingthenotationweuseinthis sectiontohandlenetworkswithmorethanoneconnectedcomponent.Let P beametabolicnetworkwith r connectedcomponents.Wedenotethesecomponents by C1= ( V1, E1 ) C2= ( V2, E2 ) ... Cr= ( Vr, Er ) ,such that P =( r j =1 Vj, r j =1 Ej ) .Let C = ( V E ) beanarbitrary componentof P and*xrepresentthecompressedform of C after x levelsofcompressionusingeitherthe MDS methodor OPT thatachievestheoptimalcompression. Weuse*(star)asagenericsymboltoavoidintroducing newsymbolsforeachcompressedcomponentinplaces whereonlytheirsizesareofrelevance.Weuse MDS ( C *x), OPT ( C ,*x)todenotethetotalnumberofcompressionstepsperformedtotransform C intoitscompressed formafter x levelsofcompressionbyusingthecorrespondingmethods.Recallthateachcompressionstep reducesthenetworksizebyone.Thus,thebiggerthese values( MDS ( C ,*x)andOPT ( C ,*x))thebettertheyare intermsofcompressionrate.Thefirstandsecondargumentsinthisnotationcanbeanystateofaconnected componentoranetworkatanypointduringthecompression.Forinstance, OPT ( Cx i x ) denotesthenumber ofcompressionstepstakenby OPT startingfrom( i +1) thintermediatestepofthe x thleveluntilthe x thlevel ofcompressioniscompleted. Figure5 Onecompressionstepofthe MDS method. Smallcirclesrepresentreactionsandbigcirclesrepresentsupernodesthatresultfrom earlierstepsofcompression.Asolidarrowrepresentsanedgebetweentwononcompressednodesinthecurrentcompressionlevel.Adashed arrowdenotesanedgebetweenasupernodeandanothernodeinthenetwork.Whilecalculatingthedegreesofthenoncompressednodes, onlythesolidarrowsaretakenintoaccount.(a)Thestateofnetwork P duringcompressionlevel x beforethe i thintermediatestep(i.e., Px i 1 ). Thenodewiththeminimumdegreeisdenotedwith vaanditsfirstneighborisdenotedwith vb.(b)Thestateofthisnetworkafterthe i th compressionstep(i.e., P x i ).Wedenotethenoderesultedfromthecompressionatthisstepwith vab. Ay etal BMCBioinformatics 2012, 13 (Suppl3):S2 http://www.biomedcentral.com/14712105/13/S3/S2 Page12of19 PAGE 13 Inthefollowing,wefirstprovethatthe MDS method makesanoptimalchoiceintermsofwhichtwonodesto compressateachcompressionstepifthereexistsanode withdegreeoneinthecurrentstateforagivencomponent.We,then,showthatifnonodewithdegreeone existsatacompressionsteptakenby MDS canincrease thesizeofthecompressedcomponentbyatmostoneas comparedtotheonefoundby OPT .Finally,byaggregatingtheresultsfromeachcomponent,foragivenmetabolicnetwork P andacompressionlevel c,wedevelopan upperboundonthesizeofthecompressednetworks obtainedby MDS withrespecttothesizeofnetworkthat canbeobtainedbytheoptimalmethod. Lemma1 Let C = ( V E ) denoteaconnectedcomponentofagivenmetabolicnetworkP.Let Cx i =( Vx i Ex i) denotethestateofCaftertheithstepofthexthcompressionlevel.Ifthereexistsanodein V x i withdegree one,thenthecompressionsteptakenbytheMDS methodtocreatethenextstate Cx i + 1 isoptimal.Formally, OPT ( Cx i x)=1+ OPT ( Cx i +1, x ) (1) Proof1 Weprove(1)bycontradictionintwoparts: Part1. OPT ( Cx i x) 1+OPT ( Cx i +1, x ) Part2. OPT ( Cx i x) 1+OPT ( Cx i +1, x ) Thefirstpart(i.e. )istrivial.ThenumberofcompressionstepsofOPTafterperformingonestepofcompression cannotbelargerthanthenumberbeforeperformingthis step,otherwisethesolutionof OPT ( Cx i x ) cannotbeoptimal.Thisleadstoacontradiction,henceprovesPart1 Toprovethesecondpart(i.e. ),itisimportantto recallhowtheMDSmethodprogressesgiventhestate C x i atwhichthereexistsatleastonenodevawithdeg ( va)= 1 .Thismethodpicksva.Thenodevahasexactlyone noncompressedneighbor,sayvb.Thus,MDSmerges themtocreatethesupernodevab(see Figure5 ).Wecompletetheproofbyconsideringtwocases.Inthefirstcase theOPTmethodmergesvaandvbwhilecompressing C x i .Inthiscase,wecanassumethatOPTtakesthisstepas itsnextstepincompressing C x i ,sinceafixedcompressed networkcanbeobtainedbyarbitrarilyshufflingthe orderofintermediatesteps.Therefore,ifvaandvbare compressedatanypointintheoptimalmethod,thenthe optimalsolutionfor Cx i + 1 ,whichiscreatedbyapplying theMDSmethodon C x i hasexactly OPT ( Cx i x) 1 compressions.Hence, OPT ( Cx i x)= l + OPT ( Cx i +1, x ) and OPT ( C x i x ) 1+OPT ( C x i +1, x) Inthesecondcasevaandvbarenotmergedtogether intheoptimalsolution.Thiscaseimpliesvaisleftasa singletonattheendofthexthlevelasdeg ( va)=1 .Then, thenetworkthatresultsafterremovingvaandallthe edgesconnectedtoitcanhaveatmost OPT ( Cx i x ) compressionsuntiltheendofthexthlevelsinceotherwiseitcontradictswiththeoptimalityofMDS.This showsthatthenumberofcompressionsthatcan beachievedwhenvaisleftasasingletoncannotbe greaterthanoneplus OPT ( Cx i +1, x ) .Thus OPT ( Cx i x) 1+OPT ( Cx i +1, x ) andcombiningit withthefirstpart(i.e. )weget OPT ( Cx i x)=1+ OPT ( Cx i +1, x ) Lemma2 Let C = ( V E ) denoteaconnectedcomponent ofagivenmetabolicnetworkP.Let Cx i =( Vx i Ex i) denote thestateofCaftertheithstepofthexthcompressionlevel. Ifthenodewithminimumdegreein Vx i hasdegreegreater thanone,thenthecompressionsteptakenbyMDStocreatethenextstate Cx i + 1 canleadtoanetworkthathassize atmostonelargerthanthecompressednetworkthatis obtainedfromthestate C x i byOPT.Formally OPT ( Cx i x) 2+OPT ( Cx i +1, x ) (2) Proof2 Letvabethefirstnodeinthelistofminimum degreenodesin V x i .Fromtheassumptionweknowdeg ( va) > 1 andhenceithasatleastonenoncompressed neighbornodeofvbthatalsohasdeg ( vb) > 1 .Without lossofgeneralityassumethattheMDSmethodmerges vaandvbtocreatethesupernodevabatthecompression stepfrom C x i to Cx i + 1 .Thisstepcanpreventatmostone neighborofva,sayvc,andatmostoneneighborofvb, sayvd,tobemergedwiththecorrespondingnodeinlater steps.Noticethatvcandvdarenotnecessarilydistinct. TheMDSalgorithmcanalsomergevcandvdinthe nextstepsiftheyarealsoneighborsthoughwedonot knowitforsureatthispoint.Thisresultsineitherone compressionortwocompressionsusingonlythefour nodesva,vb,vcandvdbytheMDSmethod.Next,we calculatethenumberofcompressionstepsthattheOPT methodcantakeforcompressingthesefournodes.There arethreecasestoconsider: Case1.The OPT methodmerges vawith vb atanypoint duringthe x thlevelofcompression. Thiscaseisequivalent tomergingvawithvbinthenextstepbyMDSandthen compressingtherestofthenetworkbyOPT.Inother words,MDSalreadytakesthe optimalcompressionstep. Hence, OPT ( Cx i x)=1+ OPT ( Cx i +1, x) 2+OPT ( Cx i +1, x ) Case2.The OPT methodmerges vawith vcatany pointduringthe x thlevelofcompression. Theworst casescenariofortheMDSmethodinthiscaseiswhenvcisnotconnectedtovdandtheOPTmethodmergesvbwithvdinalaterstep.ThiswaytheOPTmethodoptimallycompressesfournodesdowntotwosupernodes, namelyvacandvbd.OntheotherhandtheMDSmethod createsasinglesupernode,vab,andthenodesvcandvdremainassingletonHowever ,evenforthisworstcase, theMDSmethodpreventsonlyonecompressionstepto takeplacewithrespecttoOPT.Hence, OPT ( Cx i x)) 2+OPT ( Cx i +1, x ) .Ay etal BMCBioinformatics 2012, 13 (Suppl3):S2 http://www.biomedcentral.com/14712105/13/S3/S2 Page13of19 PAGE 14 Case3.The OPT methodmerges vbwith vdatany pointduringthe x thlevelofcompression. Wecanprove thissimilarto Case2 bythesymmetry Usinglemmas1and2,Theorem1developsanupper boundonthenumberofcompressionthatcanbe missedby MDS withrespecttotheoptimal compression. Theorem1 (OPTIMALITYBOUNDFORMDS) Let Pbeametabolicnetworkwithrconnectedcomponents C1= ( V1, E1 ) ... Cr= ( Vr, Er ) suchthat P = r j =1C j andcbeapositiveintegergivenasthedesirednumber ofcompressionlevels.Let C = ( V E ) denoteanarbitrary connectedcomponentofP.Also,letsrepresentthenumberofintermediatestepsforwhichnononcompressed nodeswithdegreeoneisfoundduringthecompression fromPtoPcbytheMDSmethod Then,eachofthefollowingstatementshold: 1 OPT ( Cx 1,*x) 2 MDS ( Cx 1,*x) for = 1 ,...,c 2 OPT ( P ,*c) s+MDS ( P ,*c) 3 OPT ( P ,*c) min {2 MDS ( P ,*c) ,s+MDS ( P ,*c)}. Proof3 1.ThispartfollowsfromLemma1and2. Lemma1statesthecasewhenMDSmethodisequivalent toOPT.Lemma2givesanupperboundonthenumberof compressionstepsthatMDScanmiss.Theworstcaseis whentheboundaryconditionofLemma2holdsforeach stepofthexthcompressionlevelforCx 1.Inthiscase, thenumberofstepstakenbytheOPTmethodwhilecompressingCx 1istwotimesthenumberfortheMDS method 2 ThispartalsofollowsfromLemma1and2.ThroughoutthecompressionoftheentirenetworkPbyclevels, eachstepoftheMDSmethodthatsatisfiestheconditionin Lemma2candecreasethenumberofpossiblemerge operationsbyonewithrespecttoOPT.Bysimplycounting thesesteps,attheendoftheexecutionoftheMDSmethod wecangivetheupperbounds+MDS ( P ,*c) onthenumber ofoptimalcompressionsOPT ( P ,*c). 3 Part2showsthatOPT( P ,*c) s+MDS ( P ,*c) .Itis onlynecessarytoshowOPT ( P ,*c) 2 MDS ( P ,*c) .Part1 provesthisresultforasingleconnectedcomponentC forthexthcompressionlevel.Pisgivenas r j =1C j before thefirstlevelofcompression.WeknowbyPart1thatOPT ( C ,*1) 2 MDS ( C ,*1) .Summingthisupforalljfrom1to r,wegetOPT ( P ,*1) 2 MDS ( P ,*1) .Thisequationholds foreachcompressionlevelxfrom1toc.Summationoverx gives c x =1( OPT ( Px 1, x)) c x =1MDS ( Px 1, x ) Hence,weproveOPT ( P ,*c) 2 MDS ( P ,*c). AnotherwayofinterpretingTheorem1istotransformittoanupperboundonthesizeofthe compressednetworkgeneratedby MDS intermsofthe onethatcanbeobtainedby OPT .Bycarryingoutthis transformation,weanswerthequestionwepointedout intheintroductionwhichis Howfarisourcompressionmethodfromtheoptimalcompression? .Wedo thisasfollows.Let P beanetworkofsize n .Givencompressionlevel c ,letusrepresentthenumberofcompressionsstepsofthe OPT methodwith = OPT ( P *c).Also,let nOPTand nMDSdenotethesizesofthe compressednetworksobtainedbythe OPT and MDS methodsrespectively.BytheboundgiveninTheorem1, weknowthat MDS ( P c) > = 2 .Therefore,wecan write nOPT= n .and nMDS n 2 .Also,we knowbydefinitionthat c x =1n 2 x .Usingthis inequality,weget: nOPT n cbx =1t n 2xn nMDS n cbx =1t n 2x +1n f (3) Ifweexaminetheratio nMD S n O PT ,for c =1weget nMD S n O PT3 2 forarbitrary n (detailsomitted).Thisdemonstratesthat afteronelevelofcompression,thesizeofthecompressednetworkfoundbyourmethodisatmost1.5 timesthesizeoftheoptimalnetwork.For x =1,2,..., c, thisratioisproportionalwith(1.5)x.Wecanalsouse theboundonnumberofcompressionstepsgiveninthe secondstatementofTheorem1togatherasimilar upperboundonthesizeofthecompressednetwork foundby MDS .Thetighterofthesetwoupperbounds onthenetworksizecanbecalculatedduringtheexecutionofthe MDS methodandreportedasanindicatorof howmuchroomisleftforimprovingthecompression.AlignmentframeworkWedescribedthefirstpha se,namelythecompression phaseindetailinprevioussections.Here,wefirstsummarizethebasealignmentmethod,SubMAP[10],weuse inourframework.Then,weexplainthetworemaining phasesofourframework,namelythealignmentphaseand therefinementphase.Thealignmentphasefollowsthe compressionphaseandutilizesthebasemethodtofindan alignmentincompresseddomain.Therefinementphase appliesthebasemethodonthemappingsfoundinpreviousphasetofurtherrefinethealignmentresults.After describingallthephases,weanalyzethecomplexityof eachphaseandcombinethemtoobtainthecomplexityof theentireframework.Then,weexaminethecharacteristicsofthequeriestodeterminewhicharelikelytobenefit fromcompressionduringthealignmenttoanswerthe questionof Whenshouldwecompress? Last,weprovide aguidelineforselectingthecompressionlevelthatis expectedtogivethebestperformancegainreachedbyour frameworkwithrespecttothebasealignmentmethod.Ay etal BMCBioinformatics 2012, 13 (Suppl3):S2 http://www.biomedcentral.com/14712105/13/S3/S2 Page14of19 PAGE 15 OverviewofSubMAPHere,wetakeasmalldetourandexplainSubMAP,a recentmethodforaligningmetabolicnetworkswhenthey arenotcompressed.WepickSubMAPmethodforits highaccuracyandbiologicalrelevanceasitconsiderssubnetworksofthegivennetworksduringthealignment.A subnetwork ofanetworkisasubsetofthereactionsof thatnetworksuchthattheinducedundirectedgraphof thissubsetisconnected.Giventwometabolicnetworks P =( V E )and P = ( V E ) andapositiveinteger k ,SubMAP aimstofindasetofmappingsbetweenthereactionsof P and P withthelargestsimilarityscore,suchthat:(i)Each reactionin P ( P ) canmaptoasubnetworkof P ( P ) with atmost k reactions(ii)Eachreactionof P and P can appearinatmostonemapping. ThefirststepofSubMAPistocreatethesetofall possiblesubnetworksofsizeatmost k foreachquery network.Wedenotethenumberofthesesubnetworks for P and P with Nkand Mkrespectively.Thesecond stepofSubMAPistocalculatepairwisesimilarities betweeneachpairofthesesubnetworksonefrom P and onefrom P .Eachsubnetworkconsistsofreactionsand eachreactionisdefinedb yitsinputandoutputcompounds(i.e.,substratesandproducts)andtheenzymes thatcatalyzeit.Therefore ,wemeasurethepairwise similaritiesbetweensubnetworksusingreactionsimilaritieswhichinturnaredefinedbythesimilaritiesofthe componentsofthesereactio ns.Formoredetailsofthis similarityscorewereferthereadertoAy etal .[10]. ThestepthatdominatesthetimeandspacecomplexityofSubMAPisthethirdstep.Theaimofthisstepis tocreateasimilarityscorethatcombinespairwisesimilaritieswiththetopologicalsimilarityofthenetworks.A datastructurenamedthe supportmatrix iscreatedfor thispurpose.Thesizeofthismatrixisquadraticin termsofthenumberofsubnetworksofbothquerynetworks.Inotherwords,thesupportmatrixrequires O ( Nk 2Mk 2)space.Thiscomplexityisveryimportantasit isthedominatingfactorintheoveralltimeandspace complexityofSubMAP.Thenexttwostepsofthealgorithmaretocombinetopologicalsimilaritywithpairwise nodesimilaritiesandtoextractthealignmentasasetof subnetworkmappingsof P and P .AlignmentphaseTheSubMAPmethoddescribedabovealignsthenetworks P =( V E )and P = ( V E ) intheiroriginalform. Ourframeworkfirstcompresseseachofthesenetworks toreducetheirsizesandthenalignsthecompressed networksinsteadof P and P .Inthissection,weexplain howwealignthecompressednetworks Pcand Pc that areinthecompresseddomainoflevel c usingSubMAP withagivenparameter k Letusfirstconsider Pc=( Vc, Ec).Eachnode vain Vcis asupernodeofthereactionsin V .Also,bytheworkingof ourcompressionmethod,weknowthateachsupernode vacontainsatmost2creactions.Anedgefromthenode vatothenode vbexistsin Ecifandonlyifatleastone reactionin vahasanedgetoonereactionin vbin E .The sameargumentsholdfortheothernetwork Pc aswell.To alignthesecompressednetworks,weconsidertheirnodes, whicharesupernodesofreactions,asiftheyarethereactionsofthemetabolicnetworks Pcand Pc .Thisway,we candirectlyapplySubMAPtoalignthesenetworks.Asfar astheoperationoftheSubMAPmethodisconcerned, thisisnodifferentthanaligningtwonetworksthatare identicaltothesenetworksbutareintheoriginaldomain. Thedifferenceisintheinterpretationoftheintermediate stepsandtheformofthemappingsfoundbythealignment.Forinstance,forthefirststepofSubMAP,weenumeratethereactionsubnetworksofsizeatmost k inthe originaldomain,whereasin thecompresseddomainwe enumeratethesubnetworksofsupernodeswhereeach supernodecancontainmorethanonereactionandthe numberofsuchsupernodesinonesubnetworkisatmost k .Similarly,wecalculatethepairwisesimilarity,thesupportmatrixandtheconflictgraphforthesubnetworksof supernodes(i.e.,nodesof Vc)insteadofsubnetworksof reactions(i.e.,nodesof V ).Theresultingalignmentgives usasetofmappingsbetweenthesubnetworksof Pcand Pc .Wecanthinkofthesemappingsasahighlevelviewof thealignmentbetweenthenetworks P and P .Forinstance, fromFigure1(f)onecanimmediatelyseethattheresulting alignmentwillmapnode a eithertonode a ornode b andthatthesearetheonlyoptionsfornode a whichis imposedbythehigherlevelsupernodemapping( a b a b ).Inthenextphase,weconsidereachofthesesupernodemappingsassmallerinstancesofthealignmentproblemandsolvethemtoobtainamorerefinedalignment of P and P .RefinementphaseEachmappingfoundbythealignmentphaseisasubnetworkpairwhereoneisfrom Pcandtheotherisfrom Pc ThemappingsfoundbySubMAPcanhaveupto k nodes inonesubnetworkandonlyonenodeintheother.Ifwe denoteasubnetworkof Pcwith R c i andasubnetworkof Pc with R c j ,theresultingmappingsofthealignmentphase willbeintheform ( Rc i, Rc j) .Wecanassume,withoutloss ofgenerality,forthisspecificpairthat R c i containsupto k nodesof Pcand R c j containsasinglenodeof Pc .Each nodecontainedineitherofthesesubnetworksisasupernodethatcontainseitheronenodeortwonodesandan edgebetweentheminthepreviouslevelofcompression, namelythe( c 1)thlevel.Forboth R c i and R c j ,wedecompresstheirnodesbyonelevelbyretrievingtheAy etal BMCBioinformatics 2012, 13 (Suppl3):S2 http://www.biomedcentral.com/14712105/13/S3/S2 Page15of19 PAGE 16 connectivitybetweenthesenodesinthe( c 1)thcompressionlevelthatwasencapsulatedinthe cthlevel.This decompressionresultsinatmost2 k nodesfrom( c 1)th levelfor R c i andatmost2nodesfrom( c 1)thlevelfor R c j .Wethenrecursivelyalignthesesmallernetworksgeneratedfrom R c i and R c j byusingSubMAPuntiltheoriginal domain(i.e., c =0)isreached.Atthe( c x )threcursive step,thesizesoftwonetworkstobealignedcanbeat most k 2xforonenetworkand2xfortheother. Figure1(f)illustratesthisonaconcreteexample.The networkonthelefthastwosupernodes(i.e.,( a b )and ( e d ))eachcontainingtwonodeswithanedgebetween themandonesupernode(i.e.,( c ))whichcontainsonly onenodefromthepreviouslevelofcompression.The oneontherighthastwosupernodeswithtwonodesin each.Tounderstandhowdecompressionbyonelevel works,wecanfocusonthesupernodemapping( e d )( c d )whichisfoundincompressionlevelone.Wecan thinkofdecompressionasremovingthecirclesthatsurroundthesesupernodestogetbacktheconnectivity withintheirnodesinthepreviouscompressionlevel.In ourcase,thisleadstothesmallnetworks d e and c d .Wealignthesesmallnetworksrecursivelyusing SubMAPandreporttheirfinalalignmentinonlyone recursivecallsincethecompressionlevelisonlyonefor thiscase.Also,since k =1isusedfortheeaseofthis example,thesizesofthenetworks,intermsofthe nodesinoriginaldomain,oneachsideareatmost2for therecursivecallfrom c =1ascanbeseenfromFigure 1(f)(i.e., k 2c=2c=2for k = c =1).ComplexityanalysisHavingfinishedthediscussionofallthethreephases, nowwecananalyzetheoverallcomplexityofourframework.Westartfromthefirstphasewhichiscompressionoftheinputnetworks P and P by c levels.We firstcalculatethecomplexityofthefirstcompression levelforthenetwork P withsize n .Ateachcompression step, MDS firstsearchesforaminimumdegreenode. Onceitfindsthisnode,itpicksoneofitsneighbor nodesandmergesthesetwonodes.Afterthismerging, itupdatesthedegreesofalltheneighborsofeachofthe mergednodes.Thefirsttwooftheseoperationstake O ( logn )timeifproperdatastructuresareusedandthe lastonecantake O ( n )intheworstcase.Sincethesize ofnetwork P is n ,therecanbeatmost n 2 compression stepsduringthefirstlevelofcompression.Hence,the complexityofthecompressionforthefirstlevelis O ( n2).Sincetheinputsizesofthislevelislargerthanall thenextlevels,wecansafelyassumethateachofthese nextlevelsalsotake O ( n2)andthecomplexityofcompressionby c levelsistherefore O ( cn2).Eventhough thisisnotatightbound,itissufficientatthispointfor thecomplexityofthenexttwophaseswilldominateit. Sincewecompressbothnetworks,theoverallcomplexityforthecompressionphaseis: O ( c ( n2+ m2 )). (4) Fortheanalysisofthenextphases,wemaketwo assumptionsbothofwhicharesupportedbyexperimentalevidenceonthetopologicalpropertiesofmetabolic networks.Ourfirstassumptionisthatateachlevelof compressionourmethodreducesthenetworksizeby half.Inotherwords,ifthesizesofourquerynetworks are n and m ,thenthesizesofthecompressednetworks after c levelsbythe MDS methodare n MD S= n 2 c and mMD S= m 2 c respectively.Thisismainlybecausemetabolicnetworkscontainmanynodeswithlowdegrees [27].Ourexperimentsonalargedatasetofnetworks summarizedinTable1supportsthisaswell.Thesecondassumptionisthatthenumberofsubnetworksisa constantmultipleofthenetworksizeforsmall k values. Inotherwords, NMDS= a ( k ) n and MMDS= b ( k ) m where a ( k )and b ( k )arefunctionsof k butareindependentof n and m respectively.Ourearlieranalysisin Ay etal .[10]demonstratedthatthenumberofsubnetworksfor k =3,whichisthelargest k valueweuse here,isintheorderof5 V foralargesetofmetabolic networks. Wearenowreadytoanalyzethecomplexityofthe secondphasewhichisthealignmentphase.Bythefirst assumption,weknowthatthesizesof Pcand Pc are nMD S= n 2 c and mMD S= m 2 c respectively.Bythesecond,wehavethenumberofsubnetworksofthesenetworksas NMDS= a ( k ) n and MMDS= b ( k ) m fora given k .Also,weknowthatthecomplexityofSubMAP isquadraticintermsof NMDSand MMDS.Therefore,the complexityofthesecondphaseis: O( ( k )2 ( k )2n2m2 2 4 c) (5) Thecomplexityoftherefinementphasehastwofactorsinit.Thefirstoneisthenumberofmappings foundbythealignmentphase.SinceweknowthatSubMAPallowseachnodeofbothnetworkstobereported inatmostonemapping,wehaveatrivialupperbound onthenumberofpossiblemappingsintermsof n and m .Thebiggestnumberofmappingsisreportedwhen allthesubnetworksofbothn etworksaresingletons.In thiscase,thenumberofreportedmappingsistheminimumof n and m .Wecanassumewithoutlossofgeneralitythat n PAGE 17 areatmost k 2cononesideandatmost2conthe other.Thenumberofsubnetworksthatcanbecreated fromthesenetworksare a ( k ) k 2cand b ( k )2cforthe correspondingsides.Therefore,eachmappingcanbe refinedbydecompressingandapplyingSubMAPwhich is O ( a ( k )2k222 cb ( k )222 c).Wedothisrefinementfor O ( n )timesintheworstcase,hencethecomplexityof therefinementphaseis: O ( ( k ) 2 ( k ) 2nk224 c ). (6) CombiningtheresultsofEquations4,5and6,wecan seethattheoverallcomplexityofourmethodisdeterminedbythesecondorthethirdphasedependingon thevalueof c .Forsmallvaluesof c and k suchas1,2 and3,thesecondphasedominatestheoverallcomplexity.Largervaluesof c resultsinacostlierrefinement phaseandalessexpensivealignmentphase.Verylarge valuesof k implyexponentiallymanysubnetworksin whichcasetheabovecomplexityanalysiswouldnot holdandthealignmentproblemmaybecomeintractablewithorwithoutcompression.Whenshouldwecompress?Wediscussedthepotentialofourframeworkimproving thescalabilityofexistingnetworkalignmentmethods. However,therecanbecaseswhenthecompression resultsinsuchnetworktopologieswhichwouldenforce thealignmentmethodtoreachitsworstcaseperformance.Inthissection,wewanttoanalyzewhenperformingthealignmentincompresseddomainisthe betteralternative.Forthispurpose,wedeviseacriterion thatisinspiredbytheresultsofalargenumberofnetworkalignmentsthataredonebybothofthemethods. Wefindthatthegain/lossinrunningtimeishighly dependentonthenumberofallpossiblesubnetworksof compressedandnoncompressednetworks.Thenumbersofthesesubnetworkscanbedeterminedinadvance tothealignment.Byformulatingacriterionintermsof thesenumbers,wecanmakeadecisionbetweenthe twoalgorithmsbeforeactuallyperforminganalignment. Figure4illustratestheresultsfor3600alignments performedbybothofthemethodsonawiderangeof networksizeswithallpossiblecombinationsof k and c values.ThexaxisshowtherunningtimeofSubMAP minustherunningtimeofourframework.Thebigger thisvalueisthebetterimprovementwegetfromour framework.Theyaxisshowstheratio y =Nc kMc k N k M k where Nk, Mkdenotethenumbersofallsubnetworkof P and P and Nc k Mc k denotethenumbersofallsubnetworkof thecompressednetworks Pcand Pc .Thedashedline passingfrom y =0.5visualizesourcriterion.Ifthe aboveratioisbelow0.5,thenthenumberofallpossible subnetworksgeneratedbythecompressedalignmentis lessthanthehalfofthisnum berfortheoriginalalignment.Verylargeportionofthealignments(97%)satisfyingthiscriterionshowsimprovementinrunningtimeif compressionisused.Fortheupperpartof0.5,onlya smallportionofthesealignments(10%)showsimprovement.Consideringtheoverheadofrefinementphase andthecompressionphase,thisresultisexpected. Theseresultsstronglysu ggestthattheanswertothe question Whenshouldwecompress? is when Nc kMc k N k M k 0. 5 .Howmuchshouldwecompress?Inthissection,weprovideaguidelineforselectinga valueforcompressionlevel c thatresultsintheminimumexpectedrunningtime,amongotherpossible values,forourframeworktoalignthequerynetworks withforagiven k .Wemakeextensiveuseofthecomputationalcomplexityresultswediscussedbeforeinthe proofofthebelowtheoremwhichformulatestheoptimal c foragiven k valueandthetwoquerynetworks withsizes n and m .Thistheoremanswersthequestion Whatistherightamountofcompressionthatweneed touseinordertominimizetherunningtimeofour framework? Theorem2 (OPTIMALLEVELOFCOMPRESSION) LetP = (V,E), P = ( V E ) betwometabolicnetworks withsizesnandmrespectively,andkbeagivenpositive integer.Assumewithoutlossofgeneralitythatn PAGE 18 ( k )2 ( k )2n2m2 24 c+ ( k )2 ( k )2nk224 c (9) Ouraimistomaximize(8) (9)withrespecttoc.We knowthatthisdifferenceisnegative(i.e.,alignmentin compresseddomainiscostlier)whenc n(assuming n PAGE 19 13.ChengQ,HarrisonR,ZelikovskyA: MetNetAligner:awebservicetoolfor metabolicnetworkalignments. Bioinformatics 2009, 25(15) :198990. 14.KalaevM,BafnaV,SharanR: Fastandaccuratealignmentofmultiple proteinnetworks. JComputBiol 2009, 16 :98999. 15.ChenM,HofestadtR: PathAligner:metabolicpathwayretrievaland alignment. ApplBioinformatics 2004, 3(4) :241252. 16.LiZ,ZhangS,WangY,ZhangXS,ChenL: Alignmentofmolecular networksbyintegerquadraticprogramming. Bioinformatics 2007, 23(13) :16311639. 17.LiY,deRidderD,deGrootMJL,ReindersMJT: Metabolicpathway alignmentbetweenspeciesusingacomprehensiveandflexible similaritymeasure. BMCSystBiol 2008, 2 :111. 18.KuchaievO,MilenkovicT,MemisevicV,HayesW,PrzuljN: Topological networkalignmentuncoversbiologicalfunctionandphylogeny. JRSoc Interface 2010, 7 :13411354. 19.ChorB,TullerT: Biologicalnetworks:comparison,conservation,and evolutionviarelativedescriptionlength. JComputBiol 2007, 14(6) :817838. 20.PinterRY,RokhlenkoO,YegerLotemE,ZivUkelsonM: Alignmentof metabolicpathways. Bioinformatics 2005, 21(16) :34013408. 21.SinghR,XuJ,BergerB: Globalalignmentofmultipleproteininteraction networkswithapplicationtofunctionalorthologydetection. ProcNatl AcadSciUSA 2008, 105 :1276312768. 22.FranckeC,SiezenRJ,TeusinkB: Reconstructingthemetabolicnetworkof abacteriumfromitsgenome. TrendsMicrobiol 2005, 13(11) :550558. 23.SridharP,KahveciT,RankaS: Aniterativealgorithmformetabolic networkbaseddrugtargetidentification. PacSympBiocomput 2007, 12 :8899. 24.OgataH,FujibuchiW,GotoS,KanehisaM: Aheuristicgraphcomparison algorithmanditsapplicationtodetectfunctionallyrelatedenzyme clusters. NucleicAcidsRes 2000, 28 :40214028. 25.GreenML,KarpPD: ABayesianmethodforidentifyingmissingenzymes inpredictedmetabolicpathwaydatabases. BMCBioinformatics 2004, 5 :76. 26.OgataH,GotoS,SatoK,FujibuchiW,BonoH,KanehisaM: KEGG:Kyoto EncyclopediaofGenesandGenomes. NucleicAcidsRes 1999, 27 :2934. 27.JeongH,TomborB,AlbertR,OltvaiZN,BarabasiAL: Thelargescale organizationofmetabolicnetworks. Nature 2000, 407(6804) :651654. 28.PfeifferT,SoyerOS,BonhoefferS: Theevolutionofconnectivityin metabolicnetworks. PLoSBiol 2005, 3(7) :e228. 29.RavaszE,SomeraAL,MongruDA,OltvaiZN,BarabasiAL: Hierarchical organizationofmodularityinmetabolicnetworks. Science 2002, 297(5586) :15511555.doi:10.1186/1471210513S3S2 Citethisarticleas: Ay etal .: Metabolicnetworkalignmentinlargescale bynetworkcompression. BMCBioinformatics 2012 13 (Suppl3):S2. Submit your next manuscript to BioMed Central and take full advantage of: Convenient online submission Thorough peer review No space constraints or color gure charges Immediate publication on acceptance Inclusion in PubMed, CAS, Scopus and Google Scholar Research which is freely available for redistribution Submit your manuscript at www.biomedcentral.com/submit Ay etal BMCBioinformatics 2012, 13 (Suppl3):S2 http://www.biomedcentral.com/14712105/13/S3/S2 Page19of19 !DOCTYPE art SYSTEM 'http:www.biomedcentral.comxmlarticle.dtd' ui 1471210513S3S2 ji 14712105 fm dochead Proceedings bibl title p Metabolic network alignment in large scale by network compression aug au ca yes id A1 snm Ayfnm Ferhatinsr iid I1 I2 email ferhatay@uw.edu A2 DangMichaeldang@cise.ufl.edu A3 KahveciTamertamer@cise.ufl.edu insg ins Computer and Information Science and Engineering, University of Florida, Gainesville, FL 32611, USA Department of Genome Sciences, University of Washington, Seattle, WA 98195, USA source BMC Bioinformatics supplement ACM Conference on Bioinformatics, Computational Biology and Biomedicine 2011editor Sun Kim and Wei Wangsponsor note Publication of this supplement has been supported by NSF support number NSF IIS1137427: III: Small: Women in Bioinformatics Initiative at ACMBCB 2011.Proceedingsconference ACM Conference on Bioinformatics, Computational Biology and Biomedicine 2011 (ACMBCB)location Chicago, IL, USAdaterange 13 August 2011url http://acmbcb.org/issn 14712105 pubdate 2012 volume 13 issue Suppl 3 fpage S2 http://www.biomedcentral.com/14712105/13/S3/S2 xrefbib pubid idtype doi 10.1186/1471210513S3S2 history pub date day 21month 3year 2012 cpyrt 2012collab Ay et al.; licensee BioMed Central Ltd.This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. abs sec st Abstract Metabolic network alignment is a system scale comparative analysis that discovers important similarities and differences across different metabolisms and organisms. Although the problem of aligning metabolic networks has been considered in the past, the computational complexity of the existing solutions has so far limited their use to moderately sized networks. In this paper, we address the problem of aligning two metabolic networks, particularly when both of them are too large to be dealt with using existing methods. We develop a generic framework that can significantly improve the scale of the networks that can be aligned in practical time. Our framework has three major phases, namely the it compression phase, the alignment phase and the refinement phase. For the first phase, we develop an algorithm which transforms the given networks to a compressed domain where they are summarized using fewer nodes, termed supernodes, and interactions. In the second phase, we carry out the alignment in the compressed domain using an existing network alignment method as our base algorithm. This alignment results in supernode mappings in the compressed domain, each of which are smaller instances of network alignment problem. In the third phase, we solve each of the instances using the base alignment algorithm to refine the alignment results. We provide a user defined parameter to control the number of compression levels which generally determines the tradeoff between the quality of the alignment versus how fast the algorithm runs. Our experiments on the networks from KEGG pathway database demonstrate that the compression method we propose reduces the sizes of metabolic networks by almost half at each compression level which provides an expected speedup of more than an order of magnitude. We also observe that the alignments obtained by only one level of compression capture the original alignment results with high accuracy. Together, these suggest that our framework results in alignments that are comparable to existing algorithms and can do this with practical resource utilization for large scale networks that existing algorithms could not handle. As an example of our method's performance in practice, the alignment of organismwide metabolic networks of human (1615 reactions) and mouse (1600 reactions) was performed under three minutes by only using a single level of compression. bdy Background Biological networks provide a compact representation of the roles of different biochemical entities and the interactions between them. Depending on the types of entities and interactions, these networks are segregated into different types, where each network type encompasses a particular set of biological processes. Proteinprotein interaction (PPI) networks comprise binding relationships between two or more proteins to carry out specific cellular functions such as signal transduction. Regulatory networks consist of interactions between genes and gene products to control the rates at which genes are transcribed. Metabolic networks represent sets of chemical reactions that are catalyzed by enzymes to transform a set of metabolites into others to maintain the stability of a cell and to meet its particular needs. Analysis of the connectivity properties of these networks has proven to be crucial in uncovering the details of the cell machinery and in revealing the functional modules and complexes involved in this mechanism abbrgrp abbr bid B1 1B2 2B3 3B4 4. An essential type of network analysis is the comparative analysis that aims at identifying functionally similar elements or element sets shared among different organisms which would not be possible if these elements were only considered individually. This is often achieved through alignment of the networks of these organisms. Analogous to sequence alignment which identifies conserved sequences, network alignment reveals connectivity patterns that are conserved among two or more organisms. A number of studies have been done to systematically align different types of biological networks B5 5B6 6B7 7B8 8B9 9B10 10B11 11B12 12B13 13B14 14B15 15B16 16B17 17B18 18B19 19B20 20B21 21. For metabolic networks, Pinter et al. 20 devised an algorithm that aligns query networks with specific topologies by using a graph theoretic approach. Recently, some of us developed an algorithm that combines both topological features and homological similarity of pairwise molecules to align metabolic networks 8. We also proposed a method, SubMAP 910, that incorporates subnetwork mappings in metabolic network alignment. A similar method, IsoRank 21, has been applied to find the alignments of PPI networks. IsoRankN 11 extended this algorithm to work for multiple networks and to allow mappings of protein clusters. Comparative analysis is important particulary for large metabolic networks such as organismwide networks. Identification of the conserved patterns among metabolic networks across species provide insights for metabolic reconstruction of a newly sequenced genome B22 22, orthology detection 21, drug target identification B23 23 and identification of enzyme clusters and missing enzymes B24 24B25 25. However, aligning large scale networks is a computationally challenging problem due to the underlying subgraph isomorphism problem that has to be solved to find the alignment that maximizes the similarity between the query networks. The methods we mentioned above either restrict the query topologies and/or their sizes. Even under these conditions, the running times and memory utilization of these methods can still be prohibitive for large query networks. For instance, the method of Pinter et al. 20 takes around one minute per alignment on a dataset with only small size networks ranging from 2 to 41 nodes. Our earlier method, SubMAP has no limitations on the query topologies and allows mappings of node sets that are connected (i.e., subnetworks). However, allowing subnetworks comes at a cost of increasing running time that is inherent due to the fact that the number of all connected subnetworks up to a given size can be exponential in the size of the network. For a network of size 80 and subnetwork sizes up to 3, SubMAP takes around 6 minutes and 150 MBs of memory on the average per alignment with a database of networks of size 50 on the average. Therefore, improving the running time and memory utilization of these methods is necessary to leverage the alignment of larger scale networks especially when subnetwork mappings are allowed. In this paper, we develop a framework that significantly improves the scale of the networks that can be aligned using existing algorithms. Our framework has three major phases, namely the compression phase, the alignment phase and the refinement phase. For the first phase, we develop a compression method that reduces the size of the input metabolic networks by a desired rate. In other words, we transform the query networks from their original domains (see Figure figr fid F1 1(a)) to a compressed domain (see Figure 1(d)). A single node in compressed domain corresponds to a set of connected nodes and the edges between them in the original domain. We call each such node in the compressed network a supernode. For instance, Figure 1(d) depicts the compressed networks of the two input networks in Figure 1(a) when each supernode is allowed to contain up to two nodes (i.e., only one level of compression is allowed). In the second phase, we carry out the alignment in the compressed domain by using an existing network alignment algorithm, which is SubMAP in this paper, as our base method. Once the compressed networks are aligned, we next consider each mapping of supernodes found by the first phase individually. Each such mapping suggests a smaller instance of network alignment. Figure 1(f) demonstrates this where two such instances exist. For each of these mappings, we solve the alignment problem using the base algorithm. At the end of this refinement phase, the final mappings of reactions are extracted (see Figure 1(g)) transforming the alignment back to the original domain. fig Figure 1caption Aligning two metabolic networks with and without compressiontext b Aligning two metabolic networks with and without compression. Top figures (ac) illustrate the steps of alignment without compression. Bottom figures (dg) demonstrate different phases of alignment with compression using our framework. (a) Two hypothetical metabolic networks with 5 and 4 reactions respectively. Directed edges represent the neighborhood relations between the reactions. (b) Support matrix of size 20×20 needed for the alignment if compression is not used. We only show the nonzero entries of a single row that corresponds to topological support given by b b' mapping to possible mappings of its backward and forward neighbors. Five such mappings supported equally are denoted by inlineformula m:math xmlns:m http:www.w3.org1998MathMathML name 1471210513S3S2i87 m:mfrac m:mrow m:mn 1 5 m:mstyle class m:mtext textsf mathvariant sansserif s in the matrix, namely a a' mapping for the backward neighbors and c c', c d', d c' and d d' mappings for the forward neighbors. (c) The resulting reaction mappings of alignment without compression. (d) Query networks shown in (a) in compressed domain after one level of compression. (e) Support matrix of size 6×6 needed for the alignment with compression. We only show the entries for the mappings supported by the a, b a', b' mapping. (f) The resulting mappings from the alignment in compressed domain. (g) The resulting reaction mappings after refinement phase of our framework. graphic file 1471210513S3S21 We can best motivate the need for such a framework on an example. Figure 1 illustrates the difference between aligning two metabolic networks in compressed domain versus aligning them in the original domain without compression. If we use a base alignment algorithm such as SubMAP or IsoRank, the time and space complexity of the algorithm is determined by the size of a data structure, named support matrix 1021. Conceptually, this data structure governs the topological similarities between every pair of reaction tuples. Each reaction tuple contains one reaction from each of the two query metabolic networks. A detailed description of this matrix can be found in previous articles describing IsoRank 21 and SubMAP 10. The size of this support matrix is quadratic in terms of both n and m (i.e., O (nsup 2m2)) for IsoRank and for SubMAP when only subnetworks of size one are allowed. Figures 1(b) and 1(e) illustrate the support matrices required for alignment starting from the networks shown in Figure 1(a) and 1(d) respectively. As a result of compression by only one level, the size of the matrix we need to create, drops to 6×6 from 20×20 which translates into more than an order of magnitude improvement in theoretical resource utilization compared to the base method. Notice that when we compress the network more (i.e., increase the number of compression levels), the compressed network gets smaller in terms of its number of nodes and edges. As a result, we can expect to align the compressed networks faster. However, this comes at the price of two drawbacks both due to the fact that each supernode contains multiple nodes from the original domain. First, once we find a mapping for the supernodes in the compressed domain, we still need to align the nodes of each supernode pair. For example, after mapping the supernodes (a, b) and (a', b') shown in Figure 1(f), we need to align the two subnetworks induced by these two supernodes. Thus as the size of the supernodes grow (i.e., as we compress for more levels), the size of the smaller problem instances grow as well and resource utilization bottleneck shifts from the alignment phase to refinement phase. Second, when we use compression the resulting alignment may not be the same as the one found by the original algorithm. For example, one out of four mappings in Figure 1(g) (i.e., e c') is different than the results of the base algorithm shown in Figure 1(c) (i.e., e e'). This brings the need to define a measure of consistency between the results of alignments with and without compression which can be used as an indicator of accuracy for the framework we propose here. We calculate this accuracy as the correlation of the scores calculated for each possible mapping found by our framework in the compressed domain with the scores for these mapping in the original domain found by the base method. Bigger compression rates generally mean less similarity between the results of the two methods (i.e., less accuracy). Several key questions follow from these observations are: indent 1 1. How does compression affect the alignment accuracy with respect to the base network alignment method? 2. How far is our compression method from an optimal compression that produces the compressed network with the minimum number of nodes? 3. When is it a good idea to do the alignment in compressed domain taking into account the overhead of compression and refinement phases? 4. What is the right amount of compression? That is, when does compression minimize the running time of our overall framework? In the rest of the paper we address each of these questions in detail. At this point, it is important to notice the potential for leveraging the alignment of larger scale networks by the framework we are proposing. The actual performance gain for an alignment will depend on the level of compression we use, the topologies of the query networks and complexity of the base alignment method. Results overview Our experiments on metabolic networks extracted from KEGG pathway database B26 26 demonstrate that our compression method reduces the number of nodes and edges by almost half at each level of compression. As a result of this reduction, we observe significant amount of improvement in running time and memory utilization of our earlier alignment algorithm SubMAP. Lastly, we analyze the accuracy of our framework as compared to the base alignment algorithm. The results suggest that the alignment obtained by only one level of compression captures the original alignment results with very high accuracy and the accuracy decreases with further levels of compression. Technical contributions  We devise an efficient framework for the network alignment problem that employs a scalable compression method which shrinks the given networks while respecting their topology.  We prove the optimality of our compression method under certain conditions and provide a bound on how much our compression results can deviate from the optimal solution in the worst case.  We provide a mathematical formulation that serves as a guideline to select an optimal number of compression levels depending on the input characteristics of the alignment.  We characterize the cases for which the proposed framework is expected to provide significant improvement in alignment performance. In the next section, we report our experimental results on a set of large scale metabolic networks that are constructed by combining networks from KEGG Pathway database 26. The details of the network compression method we propose here and the other phases of our framework are described in the methods section. Results and discussion In this section, we experimentally evaluate the performance of our framework. First, we measure the compression rates achieved for different levels of compression with minimum degree selection (MDS) method that we propose here. Next, we further analyzed the changes in degree distribution and large scale organization of organismwide metabolic networks with increasing compression levels. We, then, examine the gain in running time and memory utilization achieved by our framework for different values of compression level (c) and subnetwork size (k) parameters. Last, we examine the accuracy of the alignments we found by measuring the accuracy as the Pearson's correlation coefficient between the scores of mappings calculated by our framework and the ones calculated by the base algorithm we use. Dataset We use the metabolic networks from the KEGG pathway database 26. For our medium scale dataset, we downloaded all metabolic networks with at least 10 reactions for 10 different organisms. This resulted in 620 metabolic networks in total with sizes ranging from 10 to 97. In order to obtain our large scale dataset, we first combined all the metabolic networks that belong to one of the 9 different metabolism categories in KEGG database to create a complete metabolism network for each metabolism for 10 selected organisms (Homo sapiens (human), Mus musculus (mouse), Rattus norvegicus (rat), Drosophila melanogaster (fruit fly), Arabidopsis thaliana (thale cress), Caenorhabditis elegans (nematode), Saccharomyces cerevisiae (budding yeast), Staphylococcus aureus COL (MRSA), Escherichia coli K12 MG1655, Pseudomonas aeruginosa PAO1). We obtain the organismwide metabolic networks by combining all the listed networks in KEGG for each of these organisms. In total, we have 100 networks with sizes ranging from 5 to 1615 (9 complete metabolism networks plus 1 organismwide network for each of the 10 organisms). Below is the list of metabolism categories we use. 1. Carbohydrate Metabolism 2. Energy Metabolism 3. Lipid Metabolism 4. Nucleotide Metabolism 5. Amino Acid Metabolism 6. Metabolism of Other Amino Acids 7. Glycan Biosynthesis and Metabolism 8. Metabolism of Cofactors and Vitamins 9. All Amino Acids (Amino Acid + Other Amino Acids) Implementation and system details We implemented our compression and alignment algorithms in Csub ++. We ran all the experiments on a desktop computer running Red Hat Enterprise Client 5.7 with 4 GB of RAM and two dualcore 2.40 GHz processors. Evaluation of compression rates The efficiency of our alignment framework depends on how much the query metabolic networks can be compressed. For this reason, in this experiment, we measure the number of nodes and edges of the metabolic networks in our large scale dataset before and after compression. The minimum degree selection (MDS) method we describe in this paper compresses the query metabolic networks by selecting the first node among the list of nodes with minimum degree at each intermediate step and by compressing it with one of its neighbors. In order to evaluate stability of this compression method, we examined the effect of the node selection strategy on the size of the resulting compressed networks. By randomizing the step at which we select a node among the set of minimum degree nodes, we generated 100 different compressed networks for each of the input metabolic networks. In the following, we examine how much compression we achieve by the MDS method and also analyze its stability with respect to compressions achieved by randomization of node selection step. Table tblr tid T1 1 summarizes the compression rates achieved by our method for networks of different sizes. We divide all the metabolic networks in our dataset into bins according to the number of their reactions (i.e., network size). The first column in Table 1 lists the network size intervals we used for each group. Notice that the gaps in the size interval are due to the fact that organismwide networks are of size 850 and larger whereas the other combined networks for nine different metabolism categories have sizes below 400. Each row of this table shows the number of nodes and edges averaged over all the networks in this group before and after compression. The two columns with c = 0 correspond to the average number of nodes and edges of the networks with no compression respectively. For c ∈ {1, 2, 3}, we split each row corresponding to an interval into two. The upper part denotes the average node and edge numbers for the compressed network if the MDS method is used as originally described (i.e., the first among the list of minimum degree nodes is selected and combined with its first neighbor at each compression step). The lower part in bold represents the numbers gathered when we introduce randomization in this node selection. Each value in bold in Table 1 denotes the average of the corresponding value over these 100 different runs of compression. tbl Table 1Summary of compression rates for all the networks in our large scale datasettblbdy cols 9 r c center Network size intervals cspan 4 Average number of nodes Average number of edges hr c = 0 c = 1 c = 2 c = 3 c = 0 c = 1 c = 2 c = 3 [0, 100) 41.5 26.5 26.5 19.1 19.1 15 14.8 83.5 55.2 55.5 36.3 36.5 23.6 23.5 [100, 200) 154.8 92.4 92.2 61.3 61.5 48.6 48.6 310.1 174.9 174 116.5 118.1 96.3 94.6 [200, 300) 240.5 139.1 139.4 89.2 89.1 69.4 69.7 508.1 296.5 298.4 230.5 228.4 187.8 188.1 [300, 400] 344.9 207.3 207.6 133.1 133.8 103 104.5 585.7 372.9 373.5 302.7 300.4 261.6 259.9 [850, 1250] 1080.5 623.2 623.7 406.8 407.9 311.3 311.9 3727 2269 2280.6 1732.7 1733.8 1584.8 1587.5 [1500, 1615] 1576.5 909 910 582 583 447.8 444.6 4740 2955.2 2964.3 2283.5 2279.3 2128.8 2129.6 tblfn We create six intervals according to number of reactions in these networks. Each row, corresponding to one such interval, shows the average number of nodes and edges before compression (i.e., c = 0) and after compression of different levels (i.e., c ∈ {1, 2, 3}). For each row, top entries correspond to numbers obtained with the M D S method which selects the first node from the list of nodes with minimum degree at each intermediate step and compresses it with its first neighbor from the list of its neighbors. The bottom entries that are in bold correspond to the averages of 100 different compressions which are gathered by randomizing the step at which a node is selected among the set of minimum degree nodes. One conclusion that can be drawn from Table 1 is that independent of the network size, our compression method performs well in practice. On the average, with only one level of compression we achieve network sizes that are 5764%, 6471% and 7780% of the network sizes in the previous compression level for c = 1, 2 and 3 respectively. In other words, our method compresses the entire dataset down to approximately 60%, 40% and 30% of the sizes of original networks for c = 1, 2 and 3 respectively. These rates suggest that our framework has great potential in scaling the network alignment to large metabolic networks by compression. As an example, consider the row corresponding to interval [850,1250] in Table 1. We see that instead of aligning networks with 1080 nodes and 3727 edges on the average, we can apply two levels of compression first and do the alignment with significantly smaller networks that have only 407 nodes and 1733 edges on the average. Another observation is that, we get the most of the reduction in network size after the first compression level. That is, our method compresses the networks aggressively for c = 1 and achieves 57% to 64% compression rate which is close to the half of the size of the networks. As we go up in the levels of compression, the actual rate of compression achieved at one level reduces. Considering the fact that having an input network which can lead to the best possible compression (i.e., reducing its size from n down to size 1471210513S3S2i1 m:mfenced close ⌉ open ⌈ separators m:mi n 2 (i.e., 50%) at each level of compression) is a rare event, the observed compression rates suggest that our method provides an efficient compression for metabolic networks in practice. This experimental setup also suggests that the MDS method is stable with respect to the choice of the node to compress as long as that node is selected among the nodes with minimum degree. Among the six rows and three columns (18 entries) of Table 1 for the average number of nodes after the compression, only one of them have difference larger than two between the original size and the randomized average. The results of this experiment suggest that our compression method, MDS, serves as an efficient and stable first phase for our alignment framework by achieving good compression rates on a large dataset of metabolic networks. Changes in degree distributions with compression Even though the compression rates we achieve with MDS as described above suggest significant reduction in the problem size, we observe that there is a noticeable difference between the compression rates achieved by going from one compression level to the next. For instance, on the average we get 57% to 64% reduction in the size of the networks going from c = 0 to c = 1 whereas we only get 76% to 80% reduction if we go from c = 2 to c = 3. This suggests that the large scale organization of the networks change with increasing levels of compression. Even though a change in the network structure can be expected as a result of our compression, it is not obvious how to quantify this change and whether the change is consistent among different metabolic networks. In order to understand the reason behind different compression rates for different compression levels, we examined the degree distributions of the ten organismwide networks we have in our dataset. For each of these networks, we plotted the histogram of outdegree distributions for different levels of compression. Figure F2 2 plots the frequencies of each outdegree in the range [2,40] for each c ∈ {0, 1, 2, 3, 4} for these networks. We observe that for each of these plots the degree distributions for c = 0 and c = 1 are very similar and they follow powerlaw distribution which is an indicator of scalefree network topology. This is not surprising since the scalefree topology has been observed in numerous articles in the literature as a common signature for different metabolic networks B27 27B28 28B29 29. The similarity between the degree distributions of the original networks (c = 0) and the networks compressed by only one level (c = 1) signifies that the networks still conserve their scalefreeness after the first level of compression. Figure 2Shift of outdegree distributions from power law to uniform Shift of outdegree distributions from power law to uniform. Changes in the outdegree distributions of ten organismwide metabolic networks with increasing levels of compression. We calculate the frequencies of each outdegree in the range [2,40] for c ∈ {0, 1, 2, 3, 4} and plot them together for each of the ten organisms in our dataset. Outdegree distributions for organismwide metabolic networks of (a) Arabidopsis thaliana (thale cress), (b) Caenorhabditis elegans (nematode), (c) Drosophila melanogaster (fruit fly), (d) Escherichia coli K12 MG1655, (e) Homo sapiens (human), (f) Mus musculus (mouse), (g) Pseudomonas aeruginosa PAO1, (h) Rattus norvegicus (rat), (i) Staphylococcus aureus COL (MRSA), (j) Saccharomyces cerevisiae (budding yeast). 1471210513S3S22 A more interesting observation is that there is a consistent shift from the powerlaw degree distribution to uniform distribution with increasing c values for each of the ten networks we have. It is important to clarify that our claim is not that the degree distribution becomes uniform for large c values but rather the degree distributions for large c values are more similar to uniform distribution (also less similar to powerlaw distribution) compared to ones obtained with smaller c values. To quantify this on an example, we look at one of the most discernable characteristics of scalefree networks, hence the powerlaw distribution, which is the small number of hub nodes with large degrees. If we consider the organismwide network of Homo sapiens (Figure 2(e)), which is the largest network in our dataset, and focus on the percentage of nodes with outdegree greater than 15, we get percentages of 3%, 4%, 6.5%, 11.5% and 12.4% for c values of 0, 1, 2, 3 and 4 respectively. This indicates that the number of nodes that can be considered as hubs increase significantly with increasing levels of compression. This increase deteriorates the scalefreeness of the Homo sapiens network which in turn decreases the achieved compression rates. Similar trend is observed for each of the other nine organismwide networks which are plotted separately in Figure 2. The results of this experiment show that there is a consistent change in the network topology when multiple levels of compression is used. This difference we observe here between the first level of compression and later levels of compression is likely to be one of the main reasons of the significant differences in both the performance and the accuracy of our framework which will be discussed next in the remaining of the results section. Evaluation of running time and memory utilization In order to understand the capabilities and limitations of our framework, we examine its performance in terms of its running time and memory utilization on a set of large scale networks we constructed as described in the dataset section. We have ten networks for each of the ten organisms in our dataset. For each organism, nine of these networks constitute different metabolism categories and the tenth network is the organismwide metabolic network. In total, we have 100 networks with sizes ranging from 5 to 1615. For each parameter setting (different combinations of k ∈ {1, 2} and c ∈ {0, 1, 2, 3}, we aligned each of these 100 networks with each other network (including itself) resulting in a total of 5500 alignment queries. When the value of c is equal to zero, the alignment is carried out completely by a single application of SubMAP without any compression. This provides us a mechanism to measure how much performance gain is achieved by our compression based framework with respect to SubMAP. Figure F3 3(a) illustrates the average query running times in a loglog plot where xaxis is the size of the query measured as the product of the number of reactions of the metabolic networks that are aligned. We grouped queries into logarithmic bins according to the query sizes. The first bin contains all the queries of size less than or equal to 64. The next bins contain the queries of size in the interval [2i+5, 2i+6] where i = 2, 3, ..., 17. For each parameter setting we display the average running time of all the queries in each bin. For both k = 1 and k = 2, we plot all the results for all four different compression values and also draw the fitting curves to better illustrate the trend in the increase of running time. Figure 3Resource utilization of our framework Resource utilization of our framework. The average (a) running time and (b) memory utilization of our framework when each query network in our large scale dataset is aligned with all the networks (including itself) in the same dataset. xaxis is the query size which is calculated as the product of the sizes (i.e., number of reactions) of the metabolic networks aligned. c = 0 denote the alignments performed with no compression. c ∈ {1, 2, 3} denote the results of our framework that compresses both of the query networks by c levels before aligning them. 1471210513S3S23 For k = 1, we can immediately observe that each additional compression level improves the running time over the previous one for all query sizes. We obtain the largest fold change in running time by only one level of compression for the first level. This is expected considering that the first level of compression achieved the largest compression rate as shown in Table 1. The second compression level improves the running time by a smaller factor compared to the first and by a larger factor compared to the third level. For k = 1 we were able to plot all the points for all c values as the running time for even the largest query (i.e., human organismwide network vs itself which has size 1615*1615) with nocompression (i.e., c = 0) is still practical, around 12 minutes (with c = 3 this drops to <40 seconds). Similar trend of improved running times with increasing c is also observed for queries up to a certain size for k = 2. For only one level of compression (c = 1) we observe significant improvement in running times for queries of all different sizes. However, starting from the bin [213, 214] compressing the networks more than only one level (c >1) shows a consistent adverse effect on the running time. This implies when both query networks have sizes around 150 or larger and k >1 is used, the idea of compressing the networks more than one level and then performing the alignment suffers from the explosion in the number of possible subnetworks in the compressed domain with size at most k. We explore this in more detail later on in the paper (see Figure F4 4 and its discussion). Figure 4Gain/Loss in running time Gain/Loss in running time. Gain/Loss in running time of alignment by using our framework with respect to the base alignment method (xaxis) versus the ratio of the number of all possible subnetwork mappings in compressed domain to this number in the original domain. The blue vertical line shows when the two methods take exact same amount of time or when both methods take very short amount of time in the case of small query networks. Points on the right (left) handside of this line means gain (loss) in the running time. The dashed line is our decision criteria for predicting whether there will be gain or loss before doing the alignment. 1471210513S3S24 An important aspect of our framework is that it makes possible to align networks that could not be aligned with our base method. For k = 2, we observed that in the original domain (c = 0) a significant portion of the large queries did not finish in less than the cutoff time which we set as one hour. For instance, among 252 possible queries with sizes in the interval [217, 218], 96 did not complete successfully for c = 0 whereas with c = 1 all of them were completed. For the next bin, 45 out of 223 possible queries were completed for c = 0 and for c = 1 this number increased to 185. These results indicate that by using the correct amount of compression, we can align larger networks than the base alignment method SubMAP. We believe this is an important step in leveraging organismwide network alignments with subnetwork mappings for they provide a more complete picture of functional similarities and evolutionary differences between the metabolic networks of two or more organisms. Figure 3(b) presents results for the estimated memory required for the support matrix, which is the memory bottleneck of the algorithm, that is needed to perform the alignment. For this figure, we use the same query set as Figure 3(a), hence the same xaxis. On the average the memory required for alignment with c = 1 is around 30% of that needed for alignment with no compression using the SubMAP method for both k = 1 and k = 2. For k = 1, the memory utilization decreases by each additional compression level (on the average around 45% of the memory required for c = 1 is used when c is increased to 2 and around 65% of the memory required for c = 2 is used when c is increased to 3). For k = 2, concordance with the running time results, only one level of compression provides better memory utilization for all network sizes whereas compressing more than one level has an adverse effect for medium and large scale queries. These results suggest that our framework demonstrates a great potential to provide significant improvement in both the running time and the memory utilization of the base alignment method. This allows us to align large networks that could not be aligned by existing methods by utilizing the same hardware. Accuracy of the alignment results We conclude our experimental results by answering the first question introduced earlier in the paper, that is "How does compression affect the alignment accuracy?". In order to answer this, we calculate the correlation between the scores of each possible mapping in compressed domain and the scores that we obtain for these mappings from the original SubMAP method. We consider the scores of each possible subnetwork mapping of compressed nodes found by our framework. Since the mappings found by SubMAP are not of the same form with the mappings in compressed domain, we calculate a score value for each mapping in compressed domain by using the scores of the mappings found by SubMAP in the original domain. This way, we get two sets of score values one from SubMAP one from our framework for the same set of mappings. We calculate the Pearson's correlation coefficient between these two sets of scores as an indicator of the similarity between the results of the two methods. Before looking at the correlation values we found, it is important to describe how we calculate the score for a mapping in compressed domain from the mappings of SubMAP. Let P1 and 1471210513S3S2i2 m:msup m:mover accent true P m:mo MathClassop ̄ 1 denote the one level compressed forms of two metabolic networks. Let 1471210513S3S2i3 MathClassopen ( m:msub v 1 MathClassbin  { v ̄ 1 MathClasspunc , v ̄ 2 MathClassclose } ) denote a mapping in compressed domain where v1 is a subnetwork of P1 and 1471210513S3S2i4 { v ̄ 1 , v ̄ 2 } is a subnetwork of P ̄1. Also, let v1 = {r1, r2}, 1471210513S3S2i5 v ̄ 1 MathClassrel = { r ̄ 1 , r ̄ 2 } and 1471210513S3S2i6 v ̄ 2 = { r ̄ 3 } . We know the edge that maps these two subnetworks has a mapping score in the compressed domain and let us denote it by e1} for c = 1. We want to compute a mapping score, say e, for 1471210513S3S2i7 stretchy false ( v 1  { v ¯ 1 , v ¯ 2 } ) from the mappings in original domain that is comparable to e1. This subnetwork mapping in compressed domain contains six possible mappings in the original, namely 1471210513S3S2i8 ( r 1 , m:mspace tmspace width 2.77695pt r ̄ 1 ) , 1471210513S3S2i9 ( r 1 , r ̄ 2 ) , 1471210513S3S2i10 ( r 1 , r ̄ 3 ) , 1471210513S3S2i11 ( r 2 , r ̄ 1 ) , 1471210513S3S2i12 ( r 2 , r ̄ 2 ) and 1471210513S3S2i13 ( r 2 , r ̄ 3 ) . Let us denote the scores of these mappings in the original domain by ei for i = 1, 2, ..., 6 respective to their ordering. Then, we compute the mapping score e as 1471210513S3S2i14 1 6 m:msubsup ∑ i = 1 6 e i . It is important to note that, this score is a conservative choice among other possible scoring options. This is because the average can include mapping scores of subnetworks with very low similarities from the original domain of SubMAP. This can underestimate the correct mapping score of e and hence degrade the correlation of compressed domain and original domain mapping scores. Overall, for each mapping in compressed domain with a score ec and we calculate the corresponding score e in the original domain using this average score. Table T2 2 summarizes the correlation values found from a set of 3600 alignments (400 alignments for each parameter combination of k ∈ {1, 2, 3} and c ∈ {1, 2, 3}). We calculate the correlation of each query with the alignment that has the same k value but is in the original domain (i.e., c = 0). Table 2 shows the average correlation values of these 400 alignments for each k value, c value combination. The first column indicates that the alignment found by using only one compression level is highly similar to the alignment found by directly using the base method. Combining this with the running time gain in Figure 3(a) for c = 1, we can strongly argue that compression by one level not only provides significant improvement in running time but also accurately captures very high percentage of the original alignment results which makes it very useful for practical purposes. The accuracy measured in terms of correlation drops to 0.57 on the average when we perform the second level of compression and to 0.51 for the third level. Table 2Correlation of the mapping scores found with and without compression k/c 1 2 3 1 0.89 0.56 0.53 2 0.85 0.58 0.50 3 0.84 0.57 0.49 We calculate the Pearson's correlation coefficient between the two sets of score values one from SubMAP (without compression) one from our framework (with compression) and report it as an indicator of the accuracy of alignment results of our framework for different parameter settings. These results suggest that we can almost always use one level of compression to benefit from a high performance gain without losing much accuracy in terms of the alignment results. For c = 2 and c = 3, even though the accuracy of the results are significantly better than random, such compression levels should be used with caution if the accuracy of the alignment is the main concern. Conclusions In this paper, we considered the problem of aligning two metabolic networks particularly when both of them are too large to be dealt with using existing methods. To solve this problem, we developed a framework that scales the size of the metabolic networks that existing methods can align significantly. Our framework is generic as it can be used to improve the scalability of any existing network alignment method. It has three major phases, namely the compression phase, the alignment phase and the refinement phase. For the first phase, we developed an algorithm which transforms the given metabolic networks to a compressed domain where they are summarized using much fewer nodes, termed supernodes, and interactions. In the second phase, we carried out the alignment in the compressed domain using an existing method, SubMAP, as the base alignment algorithm. In the refinement phase, we considered each individual mapping of supernodes one by one. Each such mapping corresponds to a smaller instance of network alignment problem. For each of these mappings, we solved the alignment problem using SubMAP as our base method. Our experiments on the metabolic networks extracted from the KEGG pathway database demonstrate that our compression method reduces the number of reactions by almost half at each level of compression. As a result of this compression, we observe that SubMAP coupled with our framework can align twice or more as large networks as its original version can with the same amount of resources. Our results also suggested that the alignment obtained by only one level of compression benefits from a significant performance gain while capturing the original alignment results with very high accuracy. We believe that this paper takes an important step in scaling the metabolic network alignment with subnetwork mappings to organismwide networks, and thus, can have great impact on making the existing network alignment methods more useful for domain scientists. Methods In this section, we describe the method we develop to compress the query networks and the overall framework for aligning networks in this compressed domain. Before going into detail, it is important to state that we are using a reactionbased model for representing metabolic networks throughout this paper. Formally, we represent a metabolic network with P = (V, E) where V is the set of all reactions of the network and E is the set of directed edges between them. An edge eij ∈ E exists if and only if the reaction vi has at least one output compound which is an input for the reaction vj. In the following, we first describe our compression method. We use the shorthand notation MDS (minimum degree selection) to refer to this method in the rest of the paper. We, then, prove the optimality of MDS under certain conditions and provide an upper bound for the number of compressions that can be missed by this method with respect to the optimal compression. Next, we give a brief overview of the base alignment method that we use in this paper and explain in detail the two remaining phases of our alignment framework. We provide our analysis on the computational complexity of the overall method and conclude the methods section by answering two questions related to performance characteristics of this method. Minimum degree selection (MDS) method Let P = (V, E) be the reactionbased representation of a metabolic network and c denote the user specified parameter for the desired level of compression. For x = 1, ..., c, we denote the compressed form of P after x compression levels with Px = (V x, Ex). To simplify our notation, we assume that P0 = P. We construct Px from Px 1 for each x = 1, ..., c. Each v ∈ V x is either a node from V x 1 or a supernode that contains two nodes of V x 1. In summary, we construct V x from V x 1 in a number of consecutive steps. At each step, we choose a pair of connected nodes in V x 1 that are not compressed in earlier steps of the current compression level. We then merge this node pair into a supernode and add it to V x. We repeat these steps until there is no such node pair in V x 1. Assume that the number of such steps is t for compression level x. We denote the state of the network after the ith step during the xth level of compression as 1471210513S3S2i15 P i x = ( V i x , E i x ) Figure F5 5 (b)). Note that, 1471210513S3S2i16 V t x = V x and 1471210513S3S2i17 V i x ⊆ V x  1 ∪ V x for each i = 1, ..., t as the nodes of 1471210513S3S2i18 V i x are either singleton nodes from V x 1 or supernodes from V x. Figure 5One compression step of the MDS method One compression step of the MDS method. Small circles represent reactions and big circles represent supernodes that result from earlier steps of compression. A solid arrow represents an edge between two noncompressed nodes in the current compression level. A dashed arrow denotes an edge between a supernode and another node in the network. While calculating the degrees of the noncompressed nodes, only the solid arrows are taken into account. (a) The state of network P during compression level x before the ith intermediate step (i.e., 1471210513S3S2i88 P i  1 x ). The node with the minimum degree is denoted with va and its first neighbor is denoted with vb. (b) The state of this network after the ith compression step (i.e., 1471210513S3S2i89 P i x ). We denote the node resulted from the compression at this step with vab. 1471210513S3S25 We are now ready to discuss how we compress Px 1 to get Px. We define the degree of a noncompressed node v in a given network as deg(v) = indeg(v) + outdeg(v), where indeg(v) (outdeg(v)) denotes the number of incoming edges from (outgoing edges to) noncompressed nodes in the network. We say that two nodes in a network are neighbors if they are connected by at least one edge. We denote the set of neighbors of a node v with N(v). We start the compression by initializing 1471210513S3S2i19 V 0 x = V x  1 , E 0 x , E x  1 . Then, while there exists a noncompressed node with degree greater than zero at the current state of the network, say 1471210513S3S2i20 P i  1 x , we apply the next step, the ith step, of compression to obtain 1471210513S3S2i21 P i x from Pi1x. Figure 5 depicts the states of an example network before (Figure 5(a)) and after (Figure 5(b)) the ith step of compression. We start the ith step by selecting a node with minimum positive degree among the nodes in 1471210513S3S2i22 V i  1 x . If there are more than one such node, we select the first one among them. In our example in Figure 5(a), the node with minimum degree is unique and is shown by va. We use the term minimum degree as a shorthand for minimum positive degree to exclude singleton nodes. This way we ensure that deg(va) >0 and N (va) is nonempty. We select one such neighbor from N(va), say vb. The only node in N (va) in Figure 5(a) is denoted with vb. We, then, merge va with vb to form the supernode vab = {va, vb}. Figure 5(b) illustrates this newly created node vab. This is the only compression to be done at the ith compression step. Next, we create the new node set as 1471210513S3S2i23 V i x = V i  1 x ∪ { v a b }  { v a , v b } . For creating the edge set 1471210513S3S2i24 E i x , we initialize it to 1471210513S3S2i25 E i  1 x and remove all the incoming and outgoing edges of va and vb from it. Then, we insert an incoming edge to vab from each node in 1471210513S3S2i26 V i  1 x  { v a , v b } , which has an outgoing edge to either va or vb in the previous edge set Ei1x. We insert outgoing edges from vab to other nodes in a similar manner. Figure 5 illustrates the changes in the edge set after creating vab. Notice that for each i = 1, ..., t, the set Vix contains a mixture of nodes and supernodes. After each such step, the size of the network decreases by one and the number of edges of the new network decreases at least by one. For instance in Figure 5, the number of nodes dropped from five to four and the number of edges dropped from six to five. The compression of Px1 to get Px continues by applying another compression step until there are no more noncompressed nodes with positive degree. The discussion above describes the intermediate compression steps of the MDS method to perform a single level of compression on a given network. Given a compression level c, for each level x = 1, ..., c, we apply the same compression steps on Px 1 = (V x 1, Ex 1) by initially treating Px 1 as a noncompressed network with no supernodes. As a result of this process, after finishing the xth level of compression, the actual number of reactions that each node of V x can contain is assure to be in the interval [1, 2x]. The limitation on the number of reactions in each node allows the MDS method to respect and highly preserve the initial topology of the query networks. This is very important for the alignment as it makes significant use of the network topologies. Additionally, the bound on the number of reactions in each supernode translates to a uniform compression for both networks which limits the sizes of the smaller alignment problems we can encounter in the refinement phase. This allows us to keep under control the complexity and the running time of the refinement phase of our alignment framework. Optimality analysis for MDS In the previous section, we described in detail the compression method (MDS) we use in our framework. Ideally, it is preferable to compress the given network as much as possible at each compression level. This is because smaller network size often implies smaller time and memory usage for the alignment. We say that a compression is optimal if the resulting compressed network contains the smallest number of nodes among all possible compressions with the restriction that each noncompressed node can be merged with at most one other noncompressed node at each compression level. We name the hypothetical optimal compression method that can achieve the best possible compression rate as OPT. In the rest of this section, we analyze the optimality of our MDS method under different conditions. We first consider each connected component of the input network that will be compressed separately and then integrate their results to generalize our analysis for networks with arbitrary topologies. We start by introducing the notation we use in this section to handle networks with more than one connected component. Let P be a metabolic network with r connected components. We denote these components by 1471210513S3S2i27 C 1 = ( V ^ 1 , Ê 1 ) , C 2 = ( V ^ 2 , Ê 2 ) , … , C r = ( V ^ r , Ê r ) , such that 1471210513S3S2i28 P = ( ⋃ j = 1 r V ^ j , ⋃ j = 1 r Ê j ) . Let 1471210513S3S2i29 C = ( V ^ , Ê ) be an arbitrary component of P and *x represent the compressed form of C after x levels of compression using either the MDS method or OPT that achieves the optimal compression. We use (star) as a generic symbol to avoid introducing new symbols for each compressed component in places where only their sizes are of relevance. We use MDS(C, *x), OPT(C, *x) to denote the total number of compression steps performed to transform C into its compressed form after x levels of compression by using the corresponding methods. Recall that each compression step reduces the network size by one. Thus, the bigger these values (MDS(C, *x) and OPT(C, *x)) the better they are in terms of compression rate. The first and second arguments in this notation can be any state of a connected component or a network at any point during the compression. For instance, 1471210513S3S2i30 O P T ( C i x , * x ) denotes the number of compression steps taken by OPT starting from (i + 1)th intermediate step of the xth level until the xth level of compression is completed. In the following, we first prove that the MDS method makes an optimal choice in terms of which two nodes to compress at each compression step if there exists a node with degree one in the current state for a given component. We, then, show that if no node with degree one exists at a compression step taken by MDS can increase the size of the compressed component by at most one as compared to the one found by OPT. Finally, by aggregating the results from each component, for a given metabolic network P and a compression level c, we develop an upper bound on the size of the compressed networks obtained by MDS with respect to the size of network that can be obtained by the optimal method. Lemma 1 Let C=(V ^,Ê)denote a connected component of a given metabolic network P. Let 1471210513S3S2i31 C i x = ( V ^ i x , Ê i x ) denote the state of C after the ith step of the xth compression level. If there exists a node in 1471210513S3S2i32 V ^ i x with degree one, then the compression step taken by the MDS method to create the next state 1471210513S3S2i33 C i + 1 x is optimal. Formally, displayformula M1 1471210513S3S2i34 O thinspace 0.3em P T ( C i x , * x ) = 1 + O P T ( C i + 1 x , * x ) Proof 1 We prove (1) by contradiction in two parts: Part 1. 1471210513S3S2i35 O P T ( C i x , * x ) ≮ 1 + O P T ( C i + 1 x , * x ) Part 2. 1471210513S3S2i36 O P T ( C i x , * x ) ≯ 1 + O P T ( C i + 1 x , * x ) The first part (i.e., ≮) is trivial. The number of compression steps of OPT after performing one step of compression cannot be larger than the number before performing this step, otherwise the solution of OPT(Cix,*x)cannot be optimal. This leads to a contradiction, hence proves Part 1. To prove the second part (i.e., ≯), it is important to recall how the MDS method progresses given the state 1471210513S3S2i37 C i x at which there exists at least one node va with deg(va) = 1. This method picks va. The node va has exactly one noncompressed neighbor, say vb. Thus, MDS merges them to create the supernode vab (see Figure 5). We complete the proof by considering two cases. In the first case the OPT method merges va and vb while compressing Cix. In this case, we can assume that OPT takes this step as its next step in compressing Cix, since a fixed compressed network can be obtained by arbitrarily shuffling the order of intermediate steps. Therefore, if va and vb are compressed at any point in the optimal method, then the optimal solution for Ci+1x, which is created by applying the MDS method on Cix has exactly 1471210513S3S2i38 O P T ( C i x , * x )  1 compressions. Hence, 1471210513S3S2i39 O P T ( C i x , * x ) = l + O P T ( C i + 1 x , * x ) and OPT(Cix,*x)≯1+OPT(Ci+1x,*x) In the second case va and vb are not merged together in the optimal solution. This case implies va is left as a singleton at the end of the xth level as deg(va) = 1. Then, the network that results after removing va and all the edges connected to it can have at most OPT(Cix,*x) compressions until the end of the xth level since otherwise it contradicts with the optimality of MDS. This shows that the number of compressions that can be achieved when va is left as a singleton cannot be greater than one plus 1471210513S3S2i40 O P T ( C i + 1 x , * x ) . Thus, OPT(Cix,*x)≯1+OPT(Ci+1x,*x) and combining it with the first part (i.e., ≮) we get 1471210513S3S2i41 O P T ( C i x , * x ) = 1 + O P T ( C i + 1 x , * x ) . □ Lemma 2 Let 1471210513S3S2i42 C = ( V ^ , Ê ) denote a connected component of a given metabolic network P. Let Cix=(V ^ix,Êix)denote the state of C after the ith step of the xth compression level. If the node with minimum degree in V ^ixhas degree greater than one, then the compression step taken by MDS to create the next state Ci+1xcan lead to a network that has size at most one larger than the compressed network that is obtained from the state Cixby OPT. Formally, M2 1471210513S3S2i43 O P T ( C i x , * x ) ≤ 2 + O P T ( C i + 1 x , * x ) Proof 2 Let va be the first node in the list of minimum degree nodes in V ^ix. From the assumption we know deg(va) >1 and hence it has at least one noncompressed neighbor node of vb that also has deg(vb) >1. Without loss of generality assume that the MDS method merges va and vb to create the supernode vab at the compression step from Cixto Ci+1x. This step can prevent at most one neighbor of va, say vc, and at most one neighbor of vb, say vd, to be merged with the corresponding node in later steps. Notice that vc and vd are not necessarily distinct. The MDS algorithm can also merge vc and vd in the next steps if they are also neighbors though we do not know it for sure at this point. This results in either one compression or two compressions using only the four nodes va, vb, vc and vd by the MDS method. Next, we calculate the number of compression steps that the OPT method can take for compressing these four nodes. There are three cases to consider: Case 1. The OPT method merges va with vb at any point during the xth level of compression. This case is equivalent to merging va with vb in the next step by MDS and then compressing the rest of the network by O PT. In other words, MDS already takes the optimal compression step. Hence, 1471210513S3S2i44 O P T ( C i x , * x ) = 1 + O P T ( C i + 1 x , * x ) ≤ 2 + O P T ( C i + 1 x , * x ) . Case 2. The O PT method merges va with vc at any point during the xth level of compression. The worst case scenario for the MDS method in this case is when vc is not connected to vd and the OPT method merges vb with vd in a later step. This way the OPT method optimally compresses four nodes down to two supernodes, namely vac and vbd. On the other hand the MDS method creates a single supernode, vab, and the nodes vc and vd remain as singleton However, even for this worst case, the MDS method prevents only one compression step to take place with respect to O PT. Hence, 1471210513S3S2i45 O P T ( C i x , ∗ x ) ) ≤ 2 + O P T ( C i + 1 x , ∗ x ) . Case 3. The O PT method merges vb with vd at any point during the xth level of compression. We can prove this similar to Case 2 by the symmetry. □ Using lemmas 1 and 2, Theorem 1 develops an upper bound on the number of compression that can be missed by MDS with respect to the optimal compression. Theorem 1 (Osmcaps PTIMALITY BOUND FOR MDS) Let P be a metabolic network with r connected components 1471210513S3S2i46 C 1 = ( V ^ 1 , Ê 1 ) , … , C r = ( V ^ r , Ê r ) such that 1471210513S3S2i47 P = ⋃ j = 1 r C j and c be a positive integer given as the desired number of compression levels. Let C=(V ^,Ê)denote an arbitrary connected component of P. Also, let s represent the number of intermediate steps for which no noncompressed nodes with degree one is found during the compression from P to Pc by the MDS method. Then, each of the following statements hold: 1. O PT (Cx 1, *x) ≤ 2 MDS (Cx 1, *x) for × = 1, ..., c. 2. O PT (P, *c) ≤ s + MDS (P, *c) 3. O PT (P, *c) ≤ min{2 MDS (P, *c), s + MDS (P, *c)}. Proof 3 1. This part follows from Lemma 1 and 2. Lemma 1 states the case when MDS method is equivalent to OPT. Lemma 2 gives an upper bound on the number of compression steps that MDS can miss. The worst case is when the boundary condition of Lemma 2 holds for each step of the xth compression level for Cx 1. In this case, the number of steps taken by the OPT method while compressing Cx 1 is two times the number for the MDS method. 2. This part also follows from Lemma 1 and 2. Throughout the compression of the entire network P by c levels, each step of the MDS method that satisfies the condition in Lemma 2 can decrease the number of possible merge operations by one with respect to OPT. By simply counting these steps, at the end of the execution of the MDS method we can give the upper bound s+ MDS (P, *c) on the number of optimal compressions O PT (P, *c). 3. Part 2 shows that O PT(P, *c) ≤ s+ MDS (P, *c). It is only necessary to show O PT(P, *c) ≤ 2 MDS (P, *c). Part 1 proves this result for a single connected component C for the xth compression level. P is given as 1471210513S3S2i48 ⋃ j = 1 r C j before the first level of compression. We know by Part 1 that O PT (C, *1) ≤ 2 MDS(C, *1). Summing this up for all j from 1 to r, we get OPT(P, *1) ≤ 2 MDS(P, *1). This equation holds for each compression level x from 1 to c. Summation over x gives 1471210513S3S2i49 ∑ x = 1 c ( O P T ( P x  1 , * x ) ) ≤ ∑ x = 1 c M D S ( P x  1 , * x ) . Hence, we prove OPT(P, *c) ≤ 2 MDS(P, *c). □ Another way of interpreting Theorem 1 is to transform it to an upper bound on the size of the compressed network generated by MDS in terms of the one that can be obtained by OPT. By carrying out this transformation, we answer the question we pointed out in the introduction which is "How far is our compression method from the optimal compression?". We do this as follows. Let P be a network of size n. Given compression level c, let us represent the number of compressions steps of the O PT method with θ = O PT (P, *c). Also, let nO PT and nMDS denote the sizes of the compressed networks obtained by the OPT and MDS methods respectively. By the bound given in Theorem 1, we know that 1471210513S3S2i50 M D S ( P , * c ) > = θ 2 . Therefore, we can write nO PT = n θ. and 1471210513S3S2i51 n M D S ≤ n  θ 2 . Also, we know by definition that 1471210513S3S2i52 θ ≤ ∑ x = 1 c ⌋ ⌊ n 2 x . Using this inequality, we get: M3 1471210513S3S2i53 n O P T ≥ n  m:munderover accentunder mathsize big ∑ x = 1 c n 2 x , n M D S ≤ n  ∑ x = 1 c n 2 x + 1 If we examine the ratio 1471210513S3S2i54 n M D S n O P T , for c = 1 we get 1471210513S3S2i55 n M D S n O P T ≤ 3 2 for arbitrary n (details omitted). This demonstrates that after one level of compression, the size of the compressed network found by our method is at most 1.5 times the size of the optimal network. For x = 1, 2, ..., c, this ratio is proportional with (1.5)x. We can also use the bound on number of compression steps given in the second statement of Theorem 1 to gather a similar upper bound on the size of the compressed network found by MDS. The tighter of these two upper bounds on the network size can be calculated during the execution of the MDS method and reported as an indicator of how much room is left for improving the compression. Alignment framework We described the first phase, namely the compression phase in detail in previous sections. Here, we first summarize the base alignment method, SubMAP 10, we use in our framework. Then, we explain the two remaining phases of our framework, namely the alignment phase and the refinement phase. The alignment phase follows the compression phase and utilizes the base method to find an alignment in compressed domain. The refinement phase applies the base method on the mappings found in previous phase to further refine the alignment results. After describing all the phases, we analyze the complexity of each phase and combine them to obtain the complexity of the entire framework. Then, we examine the characteristics of the queries to determine which are likely to benefit from compression during the alignment to answer the question of "When should we compress?" Last, we provide a guideline for selecting the compression level that is expected to give the best performance gain reached by our framework with respect to the base alignment method. Overview of SubMAP Here, we take a small detour and explain SubMAP, a recent method for aligning metabolic networks when they are not compressed. We pick SubMAP method for its high accuracy and biological relevance as it considers subnetworks of the given networks during the alignment. A subnetwork of a network is a subset of the reactions of that network such that the induced undirected graph of this subset is connected. Given two metabolic networks P = (V, E) and 1471210513S3S2i56 P ̄ = ( V ̄ , Ē ) and a positive integer k, SubMAP aims to find a set of mappings between the reactions of P and 1471210513S3S2i57 P ̄ with the largest similarity score, such that: (i) Each reaction in 1471210513S3S2i58 P ( P ̄ ) can map to a subnetwork of 1471210513S3S2i59 P ̄ ( P ) with at most k reactions (ii) Each reaction of P and P ̄ can appear in at most one mapping. The first step of SubMAP is to create the set of all possible subnetworks of size at most k for each query network. We denote the number of these subnetworks for P and P ̄ with Nk and Mk respectively. The second step of SubMAP is to calculate pairwise similarities between each pair of these subnetworks one from P and one from P ̄. Each subnetwork consists of reactions and each reaction is defined by its input and output compounds (i.e., substrates and products) and the enzymes that catalyze it. Therefore, we measure the pairwise similarities between subnetworks using reaction similarities which in turn are defined by the similarities of the components of these reactions. For more details of this similarity score we refer the reader to Ay et al. 10. The step that dominates the time and space complexity of SubMAP is the third step. The aim of this step is to create a similarity score that combines pairwise similarities with the topological similarity of the networks. A data structure named the support matrix is created for this purpose. The size of this matrix is quadratic in terms of the number of subnetworks of both query networks. In other words, the support matrix requires O (Nk2 Mk2) space. This complexity is very important as it is the dominating factor in the overall time and space complexity of SubMAP. The next two steps of the algorithm are to combine topological similarity with pairwise node similarities and to extract the alignment as a set of subnetwork mappings of P and P ̄. Alignment phase The SubMAP method described above aligns the networks P = (V, E) and P ̄=(V ̄,Ē) in their original form. Our framework first compresses each of these networks to reduce their sizes and then aligns the compressed networks instead of P and P ̄. In this section, we explain how we align the compressed networks Pc and 1471210513S3S2i60 P ̄ c that are in the compressed domain of level c using SubMAP with a given parameter k. Let us first consider Pc = (V c, Ec). Each node va in V c is a supernode of the reactions in V Also, by the working of our compression method, we know that each supernode va contains at most 2c reactions. An edge from the node va to the node vb exists in Ec if and only if at least one reaction in va has an edge to one reaction in vb in E. The same arguments hold for the other network P ̄c as well. To align these compressed networks, we consider their nodes, which are supernodes of reactions, as if they are the reactions of the metabolic networks Pc and P ̄c. This way, we can directly apply SubMAP to align these networks. As far as the operation of the SubMAP method is concerned, this is no different than aligning two networks that are identical to these networks but are in the original domain. The difference is in the interpretation of the intermediate steps and the form of the mappings found by the alignment. For instance, for the first step of SubMAP, we enumerate the reaction subnetworks of size at most k in the original domain, whereas in the compressed domain we enumerate the subnetworks of supernodes where each supernode can contain more than one reaction and the number of such supernodes in one subnetwork is at most k. Similarly, we calculate the pairwise similarity, the support matrix and the conflict graph for the subnetworks of supernodes (i.e., nodes of V c) instead of subnetworks of reactions (i.e., nodes of V ). The resulting alignment gives us a set of mappings between the subnetworks of Pc and P ̄c. We can think of these mappings as a high level view of the alignment between the networks P and P ̄. For instance, from Figure 1(f) one can immediately see that the resulting alignment will map node a either to node a' or node b' and that these are the only options for node a which is imposed by the higher level supernode mapping (a, b a'b'). In the next phase, we consider each of these supernode mappings as smaller instances of the alignment problem and solve them to obtain a more refined alignment of P and P ̄. Refinement phase Each mapping found by the alignment phase is a subnetwork pair where one is from Pc and the other is from P ̄c. The mappings found by SubMAP can have up to k nodes in one subnetwork and only one node in the other. If we denote a subnetwork of Pc with 1471210513S3S2i61 R i c and a subnetwork of P ̄c with 1471210513S3S2i62 R ̄ j c , the resulting mappings of the alignment phase will be in the form 1471210513S3S2i63 ( R i c , R ̄ j c ) . We can assume, without loss of generality, for this specific pair that Ric contains up to k nodes of Pc and R ̄jc contains a single node of P ̄c. Each node contained in either of these subnetworks is a supernode that contains either one node or two nodes and an edge between them in the previous level of compression, namely the (c 1)th level. For both Ric and R ̄jc, we decompress their nodes by one level by retrieving the connectivity between these nodes in the (c 1)th compression level that was encapsulated in the cth level. This decompression results in at most 2k nodes from (c 1)th level for Ric and at most 2 nodes from (c 1)th level for R ̄jc. We then recursively align these smaller networks generated from Ric and R ̄jc by using SubMAP until the original domain (i.e., c = 0) is reached. At the (c x)th recursive step, the sizes of two networks to be aligned can be at most k 2x for one network and 2x for the other. Figure 1(f) illustrates this on a concrete example. The network on the left has two supernodes (i.e., (a, b) and (e, d)) each containing two nodes with an edge between them and one supernode (i.e., (c)) which contains only one node from the previous level of compression. The one on the right has two supernodes with two nodes in each. To understand how decompression by one level works, we can focus on the supernode mapping (e, d) (c', d') which is found in compression level one. We can think of decompression as removing the circles that surround these supernodes to get back the connectivity within their nodes in the previous compression level. In our case, this leads to the small networks d → e and c' → d'. We align these small networks recursively using SubMAP and report their final alignment in only one recursive call since the compression level is only one for this case. Also, since k = 1 is used for the ease of this example, the sizes of the networks, in terms of the nodes in original domain, on each side are at most 2 for the recursive call from c = 1 as can be seen from Figure 1(f) (i.e., k 2c = 2c = 2 for k = c = 1). Complexity analysis Having finished the discussion of all the three phases, now we can analyze the overall complexity of our framework. We start from the first phase which is compression of the input networks P and P ̄ by c levels. We first calculate the complexity of the first compression level for the network P with size n. At each compression step, MDS first searches for a minimum degree node. Once it finds this node, it picks one of its neighbor nodes and merges these two nodes. After this merging, it updates the degrees of all the neighbors of each of the merged nodes. The first two of these operations take O (log n) time if proper data structures are used and the last one can take O (n) in the worst case. Since the size of network P is n, there can be at most 1471210513S3S2i64 n 2 compression steps during the first level of compression. Hence, the complexity of the compression for the first level is O (n2). Since the input sizes of this level is larger than all the next levels, we can safely assume that each of these next levels also take O(n2) and the complexity of compression by c levels is therefore O (cn2). Even though this is not a tight bound, it is sufficient at this point for the complexity of the next two phases will dominate it. Since we compress both networks, the overall complexity for the compression phase is: M4 1471210513S3S2i65 O ( c ( n 2 + m 2 ) ) . For the analysis of the next phases, we make two assumptions both of which are supported by experimental evidence on the topological properties of metabolic networks. Our first assumption is that at each level of compression our method reduces the network size by half. In other words, if the sizes of our query networks are n and m, then the sizes of the compressed networks after c levels by the MDS method are 1471210513S3S2i66 n M D S = n 2 c and 1471210513S3S2i67 m M D S = m 2 c respectively. This is mainly because metabolic networks contain many nodes with low degrees 27. Our experiments on a large dataset of networks summarized in Table 1 supports this as well. The second assumption is that the number of subnetworks is a constant multiple of the network size for small k values. In other words, NMDS = α (k) n and MMDS = β (k) m where α (k) and β (k) are functions of k but are independent of n and m respectively. Our earlier analysis in Ay et al. 10 demonstrated that the number of subnetworks for k = 3, which is the largest k value we use here, is in the order of 5V  for a large set of metabolic networks. We are now ready to analyze the complexity of the second phase which is the alignment phase. By the first assumption, we know that the sizes of Pc and P ̄c are nMDS=n2c and mMDS=m2c respectively. By the second, we have the number of subnetworks of these networks as NMDS = α (k) n and MMDS = β (k) m for a given k. Also, we know that the complexity of SubMAP is quadratic in terms of NMDS and MMDS. Therefore, the complexity of the second phase is: M5 1471210513S3S2i68 O ( α ( k ) 2 β ( k ) 2 n 2 m 2 2 4 c ) . The complexity of the refinement phase has two factors in it. The first one is the number of mappings found by the alignment phase. Since we know that SubMAP allows each node of both networks to be reported in at most one mapping, we have a trivial upper bound on the number of possible mappings in terms of n and m. The biggest number of mappings is reported when all the subnetworks of both networks are singletons. In this case, the number of reported mappings is the minimum of n and m. We can assume without loss of generality that n < m and hence this number is O (n). The second factor is the sizes of each of these O(n) smaller alignment problems that needs to be solved by SubMAP again to refine the mapping results. As we discussed in the refinement phase, the sizes of the networks created by decompressing the mapped subnetworks by one level are at most k 2c on one side and at most 2c on the other. The number of subnetworks that can be created from these networks are α (k) k 2c and β (k) 2c for the corresponding sides. Therefore, each mapping can be refined by decompressing and applying SubMAP which is O (α (k)2 k2 22c β (k)2 22c). We do this refinement for O (n) times in the worst case, hence the complexity of the refinement phase is: M6 1471210513S3S2i69 O ( α ( k ) 2 β ( k ) 2 n k 2 2 4 c ) . Combining the results of Equations 4, 5 and 6, we can see that the overall complexity of our method is determined by the second or the third phase depending on the value of c. For small values of c and k such as 1, 2 and 3, the second phase dominates the overall complexity. Larger values of c results in a costlier refinement phase and a less expensive alignment phase. Very large values of k imply exponentially many subnetworks in which case the above complexity analysis would not hold and the alignment problem may become intractable with or without compression. When should we compress? We discussed the potential of our framework improving the scalability of existing network alignment methods. However, there can be cases when the compression results in such network topologies which would enforce the alignment method to reach its worst case performance. In this section, we want to analyze when performing the alignment in compressed domain is the better alternative. For this purpose, we devise a criterion that is inspired by the results of a large number of network alignments that are done by both of the methods. We find that the gain/loss in running time is highly dependent on the number of all possible subnetworks of compressed and noncompressed networks. The numbers of these subnetworks can be determined in advance to the alignment. By formulating a criterion in terms of these numbers, we can make a decision between the two algorithms before actually performing an alignment. Figure 4 illustrates the results for 3600 alignments performed by both of the methods on a wide range of network sizes with all possible combinations of k and c values. The xaxis show the running time of SubMAP minus the running time of our framework. The bigger this value is the better improvement we get from our framework. The yaxis shows the ratio 1471210513S3S2i70 y = N k c M k c N k M k where Nk, Mk denote the numbers of all subnetwork of P and P ̄ and 1471210513S3S2i71 N k c , 1471210513S3S2i72 M k c denote the numbers of all subnetwork of the compressed networks Pc and P ̄c. The dashed line passing from y = 0.5 visualizes our criterion. If the above ratio is below 0.5, then the number of all possible subnetworks generated by the compressed alignment is less than the half of this number for the original alignment. Very large portion of the alignments (97%) satisfying this criterion shows improvement in running time if compression is used. For the upper part of 0.5, only a small portion of these alignments (10%) shows improvement. Considering the overhead of refinement phase and the compression phase, this result is expected. These results strongly suggest that the answer to the question "When should we compress?" is "when 1471210513S3S2i73 N k c M k c N k M k ≤ 0 . 5 ". How much should we compress? In this section, we provide a guideline for selecting a value for compression level c that results in the minimum expected running time, among other possible values, for our framework to align the query networks with for a given k. We make extensive use of the computational complexity results we discussed before in the proof of the below theorem which formulates the optimal c for a given k value and the two query networks with sizes n and m. This theorem answers the question "What is the right amount of compression that we need to use in order to minimize the running time of our framework?". Theorem 2 (OPTIMAL LEVEL OF COMPRESSION) Let P = (V, E), 1471210513S3S2i74 P ̄ = ( V ̄ , Ē ) be two metabolic networks with sizes n and m respectively, and k be a given positive integer. Assume without loss of generality that n < m. Then, the compression level c that gives the optimal compression is: M7 1471210513S3S2i75 c = log 2 ( n m 2 k  2 ) 8 . Proof 4 Given P and P ̄ , we want to find c value such that the difference between the complexity of applying SubMAP to align these networks in their original domain for a given k and the complexity of using our framework that aligns P with P ̄ in compressed domain for the same k value is maximum. We omit the constant factors and use the algorithmic complexity as the cost of alignment. Under this assumption, the cost of aligning two networks with sizes n and m with SubMAP in the original domain for a given k value is: M8 1471210513S3S2i76 α ( k ) 2 β ( k ) 2 n 2 m 2 For our framework, this cost can be determined from the complexities of three different phases given by the Equations (4), (5) and (6) (see main article for these equations). As discussed, the dominating factors in the complexity are the last two phases (i.e., Equation (5) and Equation (6)). Therefore, we write the total cost of aligning P with P ̄ in the compressed domain c, for a given k value as: M9 1471210513S3S2i77 α ( k ) 2 β ( k ) 2 n 2 m 2 2 4 c + α ( k ) 2 β ( k ) 2 n k 2 2 4 c Our aim is to maximize (8) (9) with respect to c. We know that this difference is negative (i.e., alignment in compressed domain is costlier) when c ≥ n (assuming n < m as stated in the Theorem) or when c = 0 due to the overhead of compression and/or refinement phases. We also know that, for c = 1 this difference is positive as compression by one level always results in less costlier alignments compared to no compression. Therefore, if there is an extrema of (8) (9) with respect to c for c ∈ (0, n), then this extrema is a maxima meaning that the difference (8) (9) is maximum at that point. We calculate this maxima by derivation of (8) (9) with respect to c and setting it to zero as: M10 1471210513S3S2i78 m:mtable align columnalign left m:mtr m:mtd alignodd right ∂ ( ( 1 )  ( 2 ) ) ∂ c aligneven = 0 2em ∂ { α ( k ) 2 β ( k ) 2 n 2 m 2  α ( k ) 2 β ( k ) 2 n 2 m 2 2  4 c  α ( k ) 2 β ( k ) 2 n k 2 2 4 c } ∂ c = 0 4 log ( 2 ) 2  4 c α ( k ) 2 β ( k ) 2 n 2 m 2  4 log ( 2 ) 2 4 c α ( k ) 2 β ( k ) 2 n k 2 = 0 2  4 c n m 2  2 4 c k 2 = 0 2 8 c = n m 2 k  2 c = log 2 ( n m 2 k  2 ) 8 □ The value obtained from the above discussion is not necessarily an integer. We suggest using the nearest integer to this value as the number of compression levels in our alignment. Next, we want to give a few examples for to see what Theorem 2 implies in practice. Assume we have two networks with sizes n = 100, m = 100 and we want to align them using our framework for k = 2. Plugging these number in Equation 7, we get: 1471210513S3S2i79 c = log 2 ( 250000 ) 8 = 17 . 93 8 ≅ 2 . 24 If we round this to the nearest integer, the Equation 7 suggests that we use two levels of compression for this alignment problem to be able to get the largest gain in running time. We can carry the calculations similarly for a bigger set of inputs n = m = 1000 and k = 3 which gives around 3.34, suggesting three levels of compression is likely to provide the best running time improvement for this instance. However, it is important to note that depending on how much of a tradeoff is desired between the running time gain and the alignment accuracy, the user can always use smaller (or bigger) c values than the ones suggested here. Also, the above calculated values are only expected to provide the best running time improvement with respect to the original alignments running time. If the size of the query is orders of magnitude bigger than the original algorithm can handle, then it is likely that the framework we propose here to also fail to perform the alignment. List of abbreviations P = (V, E), 1471210513S3S2i80 P ̄ = ( V ̄ , Ē ) : Query metabolic networks; V, 1471210513S3S2i81 V ̄ : Sets of all reactions of the query networks; ri ∈ V, 1471210513S3S2i82 r ̄ j ∈ V ̄ : Reactions of the query networks; n = V , 1471210513S3S2i83 m =  V ̄  : Sizes of the query networks; c, 2c: Compression level and compression rate; Pc = (V c, Ec): P after c levels of compression; 1471210513S3S2i84 C i = ( V ^ i , Ê i ) : A connected component of network P; N(va), deg(va): The set of neighbors and degree of node va; va: Number of reactions that are contained in va; vab : A supernode containing the nodes va and vb; k: Parameter for the largest subnetwork size; 1471210513S3S2i85 ℛ k , ℛ ̄ k : Sets of all subnetworks of size at most k; 1471210513S3S2i86 R i , R ̄ j : Subnetworks of the query networks; Nk, Mk: Numbers of all subnetworks of size at most k. Competing interests The authors declare that they have no competing interests. Authors' contributions FA, TK, and MD developed the method. MD and FA implemented the methods and gathered experimental results. FA and TK wrote the paper. bm ack Acknowledgements and funding This work was supported partially by NSF under grants IIS0845439 and CCF0829867. FA is partially supported by NSF under grant #1136996 to the Computing Research Association for the CIFellows project. This article has been published as part of BMC Bioinformatics Volume 13 Supplement 3, 2012: ACM Conference on Bioinformatics, Computational Biology and Biomedicine 2011. The full contents of the supplement are available online at http://www.biomedcentral.com/14712105/13/S3. refgrp Revealing biological modules via graph summarizationNavlakhaSSchatzMKingsfordCJ Comput Biol2009162253lpage 264pubidlist 10.1089/cmb.2008.11TTpmpid link fulltext 19183002Learning module networksSegalEPe'erDRegevAKollerDFriedmanNJournal of Machine Learning Research2005655788Dynamic modular structure of regulatory networksAyFDinhTThaiMKahveciTIEEE International Conference on Bioinformatics and Bioengineering (BIBE)2010136143Identification of functional modules from conserved ancestral protein protein interactionsDutkowskiJTiurynJBioinformatics20072313i149i15810.1093/bioinformatics/btm19417646291Pathway alignment: application to the comparative analysis of glycolytic enzymesDandekarTSchusterSSnelBHuynenMBorkPBiochem J1999343 Pt 111512410.1042/02646021:3430115pmcid 122053110493919QNet: a tool for querying protein interaction networksDostBShlomiTGuptaNRuppinEBafnaVSharanRInternational Conference on Research in Computational Molecular Biology (RECOMB)2007115Integrative network alignment reveals large regions of global network similarity in yeast and humanKuchaievOPrzuljNBioinformatics2011271390139610.1093/bioinformatics/btr12721414992A fast and accurate algorithm for comparative analysis of metabolic pathwaysAyFKahveciTDE CrécyLagardVJ Bioinform Comput Biol20097338942810.1142/S021972000900416319507283SubMAP: aligning metabolic pathways with subnetwork mappingsAyFKahveciTInternational Conference on Research in Computational Molecular Biology (RECOMB)2010LNCS60441530SubMAP: aligning metabolic pathways with subnetwork mappingsAyFKellisMKahveciTJ Comput Biol201118321923510.1089/cmb.2010.0280312393221385030IsoRankN: spectral methods for global alignment of multiple protein networksLiaoCSLuKBaymMSinghRBergerBBioinformatics20092512i253i25810.1093/bioinformatics/btp203268795719477996Pairwise local alignment of protein interaction networks guided by models of evolutionKoyuturkMGramaASzpankowskiWInternational Conference on Research in Computational Molecular Biology (RECOMB)20054865MetNetAligner: a web service tool for metabolic network alignmentsChengQHarrisonRZelikovskyABioinformatics2009251519899010.1093/bioinformatics/btp28719414533Fast and accurate alignment of multiple protein networksKalaevMBafnaVSharanRJ Comput Biol2009169899910.1089/cmb.2009.013619624266PathAligner: metabolic pathway retrieval and alignmentChenMHofestadtRAppl Bioinformatics20043424125210.2165/008229422004030400000615702955Alignment of molecular networks by integer quadratic programmingLiZZhangSWangYZhangXSChenLBioinformatics200723131631163910.1093/bioinformatics/btm15617468121Metabolic pathway alignment between species using a comprehensive and flexible similarity measureLiYde RidderDde GrootMJLReindersMJTBMC Syst Biol2008211110.1186/175205092111267739719108747Topological network alignment uncovers biological function and phylogenyKuchaievOMilenkovicTMemisevicVHayesWPrzuljNJ R Soc Interface201071341135410.1098/rsif.2010.0063289488920236959Biological networks: comparison, conservation, and evolution via relative description lengthChorBTullerTJ Comput Biol200714681783810.1089/cmb.2007.R01817691896Alignment of metabolic pathwaysPinterRYRokhlenkoOYegerLotemEZivUkelsonMBioinformatics200521163401340810.1093/bioinformatics/bti55415985496Global alignment of multiple protein interaction networks with application to functional orthology detectionSinghRXuJBergerBProc Natl Acad Sci USA2008105127631276810.1073/pnas.0806627105252226218725631Reconstructing the metabolic network of a bacterium from its genomeFranckeCSiezenRJTeusinkBTrends Microbiol2005131155055810.1016/j.tim.2005.09.00116169729An iterative algorithm for metabolic networkbased drug target identificationSridharPKahveciTRankaSPac Symp Biocomput200712889917992747A heuristic graph comparison algorithm and its application to detect functionally related enzyme clustersOgataHFujibuchiWGotoSKanehisaMNucleic Acids Res2000284021402810.1093/nar/28.20.402111077911024183A Bayesian method for identifying missing enzymes in predicted metabolic pathway databasesGreenMLKarpPDBMC Bioinformatics200457610.1186/1471210557644618515189570KEGG: Kyoto Encyclopedia of Genes and GenomesOgataHGotoSSatoKFujibuchiWBonoHKanehisaMNucleic Acids Res199927293410.1093/nar/27.1.291480909847135The largescale organization of metabolic networksJeongHTomborBAlbertROltvaiZNBarabasiALNature2000407680465165410.1038/3503662711034217The evolution of connectivity in metabolic networksPfeifferTSoyerOSBonhoefferSPLoS Biol200537e22810.1371/journal.pbio.0030228115709616000019Hierarchical organization of modularity in metabolic networksRavaszESomeraALMongruDAOltvaiZNBarabasiALScience200229755861551155510.1126/science.107337412202830 