UFDC Home  myUFDC Home  Help 



Full Text  
ALGORITHMS FOR METABOLIC NETWORKBASED DRUG TARGET IDENTIFICATION By PADMAVATI SRIDHAR, A THESIS PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE UNIVERSITY OF FLORIDA 2006 Copyright 2006 by Padmavati Sridhar I dedicate this thesis to my parents who have shaped me into who I am 4 ~i. ACKENOWLED GMENTS I thank my advisor, Dr. Tamer K~ahveci, for his exceptional guidance during the course of this research. He has been a constant source of motivation and support to me. I am grateful for his valuable technical and professional mentoringf. I thank Dr. Ranka, for his technical guidance during this research. I also thank Dr. Jermaine for his help and guidance. I have been blessed with supportive parents who have ahrl . encouraged me in all my endeavors. They were my first teachers, and they believe that knowledge and education are the best gifts they can give their children. I am grateful to them for their vision and for their constant support and encouragement. My brother, B I1 i 0 is my pillar of support, critic and one of my best friends. His clarity of thought and new perspectives have ahrl . helped me refine my ideas. My fiance, Ram, has stood by me in all my decisions and has helped shape my career. He is my sounding board for ideas and the first person I turn to for guidance. I am grateful to have him by my side .ll. li ;. I also thank all my friends and the people who have helped me in times of need. TABLE OF CONTENTS page ACK(NOWLEDGMENTS .......... . .. .. 4 LIST OF TABLES ......... ..... .. 6 LIST OF FIGURES ......... .... .. 7 ABSTRACT ............ .......... .. 8 CHAPTER 1 INTRODUCTION ......... ... .. 10 2 BACKGROUND AND RELATED WORK( .... .. 14 3 PROBLEM MODELING . ..._.. ... 16 4 OPTIMAL ALGORITHM . ..._. ... 18 4.1 State Space and Basic Strategy . ..... .. 18 4.2 OPMET Prioritization Strategies . ..... 20 4.2.1 Static OPMET ......... .. 20 4.2.2 Dynamic OPMET ......... .. 22 4.3 Filtering Strategies ......... . 24 4.3.1 Target Filter ......... .. 25 4.3.2 Nontarget Filter ......... ... 26 5 ITERATIVE ALGORITHM ......... .. 28 5.1 Initialization ......... ... .. 28 5.2 Iterative Steps ......... . .. 30 5.3 Maximum Number of Iterations . ...... .. 32 6 EXPERIMENTAL RESULTS ......... ... 36 6.1 Evaluation of OPMET Algorithm . .... 36 6.2 Evaluation of Iterative Algorithm . ..... 39 7 CONCLUSION ......... . .. .. 46 REFERENCES ............. ........... 48 BIOGRAPHICAL SK(ETCH ......... . .. 52 LIST OF TABLES Table page 51 Iterative Steps: lo is the initialization step; II and I2 are the iterations. 15 and Vc represent the damage values of reactions and compounds respectively computed at each iteration. 15 = [3, 0, 0] in all iterations. ... .. .. 35 61 1\etabolic networks from K(EGG with identifier (Id). C, R and Ed denote the number of compounds, reactions and edges (interactions) respectively. .. .. 45 62 Comparison of average damage values of solutions determined hv the iterative algorithm versus the optimal algorithm. ...... .. . 45 LIST OF FIGURES Figure page 31 A graph constructed for a metabolic network with three reactions R1, R2, and R3, three enzymes El, E2, and E3, and nine compounds C1, C9. C1TC16S, rectangles, and triangles denote compounds, reactions, and enzymes respectively. Here, C4 (Shown by double circle) is the target compound. Dotted lines indicate the subgraph removed due to inhibition of an enzyme. (a) Effect of inhibiting E2. (b) Effect of inhibiting El. ......... ... 17 41 The basic OPMET strategy for a hypothetical 4enzyme network. Enzymes are ordered as El, E2, E3, E4 nO, n1, n4 are the nodes generated. The initial I1. Ju~l cutoff threshold D = 10 initializedd to the total number of compounds in the network). nl is a true solution (shown by double circle) with damage d = 5. Since d < D, D is updated to 5. nl is saved and the subtree rooted at nl is pruned. The method backtracks tO n0. n82 n3 arT fRISe nodes generated along the search with damage d < D. us is a true node with d = 2. As d < D, D is updated to 2 and the method backtracks to search the unexplored space for solutions with d < D(indicated by the dashed edge). .. .. .. 27 42 Effect of deletion of enzyme El on the weights of reactions and compounds in the network. .. ... . .. 27 43 Dynamic OPMET procedure for enzyme evaluation. .. . .. 27 51 (a)A graph constructed for a metabolic network with four reactions R1, R4, three enzymes E1, E2 and E3, and five compounds C1, Cs. Circles, rectangles, and tri angles denote compounds, reactions, and enzymes respectively. Here, C1 (shown by double circle) is the target compound. (b)Effect of inhibiting E1. Dotted lines indi cate the subgraph removed due to inhibition of enzyme E1. ... .. .. 35 61 Comparison of OPMET ordering strategies ..... .. . 41 62 Comparison of OPMET filtering strategies ..... .... . 42 63 Average number of nodes by generated by Dynamic OPMET with combined filters. ......... ... . 43 64 Average optimal node rank by generated by Dynamic OPMET with combined filters. ......... ... . 43 65 Evaluation of iterative algorithm. (a)Average execution time in milliseconds. (b)Average number of iterations ...... ... .. 44 Abstract of Thesis Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Master of Science ALGORITHMS FOR METABOLIC NETWORK(BASED DRITG TARGET IDENTIFICATION By Padnmavati Sridhar December 2006 C'I .Ir~: Tanter K~ahveci Major Department: Computer and Information Science and Engineering Traditional pharmacological drug discovery approaches focused more on the therapeutic effects of drugs than their sideeffects. Recent advances in bioinforniatics have fostered rational drug development methods that aim to reduce serious sideeffects. The first step in this approach is the identification of specific biological drug targets (enzymes or proteins), which can he manipulated to produce the desired effect (of curing a disease) with nxininiun disruptive sideeffects. In this thesis, we study the pharmacological problem of identifying the optimal enzyntecombination (i.e., drug targets) whose inhibition will achieve the required effect of eliminating a given target set of compounds, while incurring nxininial sideeffects. We propose two approaches to solve the problem. In our first approach, we formulate the problem as an optimization problem on metabolic networks. We define a graph hased computational model of the network, that encapsulates the impact of enzymes onto compounds. We propose OPM~ET, an Optimal enzyme drug target identification algorithm hased on Metabolic networks, to solve this problem optimally. It is a branchandhound algorithm to explore the search space. We develop a cost model and two enzyme prioritization strategies, Static OPMET and Dynamic OPMET, based on it. Static OPMET priorities enzymes according to their impacts, such that the most promising enzymes are inspected first, for possible inclusion in the optimal subset. Dynamic OPMET dynamically updates the priorities as the search space is explored. We also develop two filtering strategies to prune the search space while still guaranteeing an optima~l solution. Our experilments on Eischerichia C/ol ~i(.C/oli) metabolic network show that OPMET can reduce the total search time by several orders of magnitude as compared to the exhaustive search. In our second approach, we develop a heuristic solution to the same problem for metabolic networks with a large number of enzymes. We propose a scalable iterative algorithm which computes a suboptimal solution within reasonable timebounds for large metabolic networks. It evaluates immediate precursors of the target compounds and iteratively moves backwards to identify the enzymes whose inhibition will stop the production of the target compounds while incurring minimum sideeffects. We show that this algorithm converges to a suboptimal solution within a finite number of such iterations. Our experiments on the E. Coli metabolic network show that the average accuracy of this method deviates from that of the optimal solution only by 0.02 This iterative algorithm is highly scalable. It can solve the problem for the entire metabolic network of E. Coli in less than 10 seconds. CHAPTER 1 INTRODUCTION Motivation: Traditional drug discovery approaches focus more on the efficacy of drugs than their toxicity (untoward side effects). Lack of predictive models that account for the complexity of the interrelationships between the metabolic processes often leads to drug development failures [1]. Toxicity and/or lack of efficacy can result, if metabolic network components other than the intended target are affected. This is wellillustrated by the example of the recent failure of Tolcap~one (a new drug developed for Parkinson's disease) due to observed hepatic toxicity in some patients [2]. Postgenomic drug research focuses more on the identification of specific biological targets (gene products, such as enzymes or proteins) for drugs, which can he manipulated to produce the desired effect (of curing a disease) with minimum disruptive sideeffects [3, 4]. The main phases in such a drug development strategy are target identification, validation and lead inhibitor identification [5]. Enzymes catalyze reactions, which produce metabolites (compounds) in the metabolic networks of organisms. Enzyme malfunctions can result in the accumulation of certain compounds which may result in diseases. We term such compounds as TI ,11 I C'om pounds and the remaining compounds as NonTIr,1I I compounds. For instance, the malfunction of enzyme Ich. ,..;; l le s:.:.0... h t;, Jr~ ..nl.; . causes buildup of the amino acid, phenylalanine, resulting in phenylketonuria [6], a disease that causes mental retardation. Similarly, mutations in the gene that produces glucokinase results in a form of diabetes called maturityonset diabetes of the young type 2 (110DY2) [7]. This condition causes excessive accumulation of glucose in the blood (hyperglycemia), due to the malfunction of glucokinase. Glucokinase normally phI i a central role in insulin regulation and hence is responsible for maintaining the blood glucose level in the body. These examples underline the importance of identifying the optimal enzyme set which can he manipulated by drugs to prevent the excess production of target compounds, with minimal sideeffects. We term the sideeffects of inhibiting a certain enzyme combination as the IArl,,~ri.: caused to the metabolic network. Problem definition: Let R, C, and E denote the set of reactions, compounds, and enzymes of a metabolic network respectively. Given a large metabolic network and a set of target compounds, we consider the problem of identifying the set of enzymes whose inhibition eliminates all the target compounds and inflicts minimum damage on the rest of the network. Formally, we define the IArl,, .qi: of inhibiting an enzyme as the number of nontarget compounds whose production is stopped by the inhibition of that specific enzyme. Evaluating all enzyme combinations in order to identify the combination with minimum damage is not feasible for metabolic networks with a large number of enzymes. This is because, the number of enzyme combinations, i.e., 2E i, inCTreSeS exponentially with the number of enzymes. Efficient and precise heuristics are needed to tackle this problem. We formally state the optimal enzyme combination identification problem as Given a set of target compounds T (T C C), find the set of enzymes X (X C E) with minimum damage, whose inhibition stops the production of all the compounds in T. Lemke et al. [8, 9] defined the damage of inhibition of an enzyme as the number of compounds whose production stops after the inhibition of that enzyme. Our definition of IArl,, rli: is similar to that of Lemke in principle. We differ by excluding the target compounds from the damage computation. Different enzymes and compounds may have varying levels of importance in the metabolic network. Our model considers all the enzymes and compounds to be of equal importance (similar to Lemke's work). We can extend our model by assigning weights to enzymes and compounds based on their role in the network. However, we do not discuss these extensions in this thesis. For simplicity, we also assume that the input compounds to all reactions are present in the network and that there are no external inputs. Also, we are not incorporating backup enzyme activities [10] in our model. Contribution: In this thesis, we study the pharmacological problem of identifying the optimal enzymecombination (i.e., drug targets) whose inhibition will achieve the required effect of eliminating a given target set of compounds, while incurring minimum damage. In this section, we formally defined the problem. Further, we develop a graph based computational model of the metabolic network and propose two methods to solve the problem, based on this model. We discuss them briefly here. * Graphbased model: We define a graph based computational model of the metabolic network that encapsulates the impact of enzymes onto compounds. We formulate the optimal enzyme combination identification problem as an optimization problem on this graph. We develop a cost model that takes both the observed and potential damage (i.e, sideeffects) resulting from the inhibition of an enzyme set into account, for the cost computation. * Optimal solution: We propose a branchandbound algorithm, OPM~ET, an Optimal enzyme drug target identification algorithm based on Metabolic networks, to explore the search space and produce an optimal solution. This work is published as a technical report [11]. We develop two enzyme prioritization strategies, Static OPM~ET and D include. OPM~ET based on the cost model. Static OPMET priorities enzymes according to the impact of their inhibition on the production of compounds in the metabolic network. It inspects the most promising enzymes first, for possible inclusion in the optimal subset. Dynamic OPMET dynamically updates the priorities as the search space is explored. We develop two filtering approaches, namely in /* A filter and nonr4,,l ai Aflter, which are combined with the OP MET to prune the search space while still guaranteeing an optimal solution. The target filter eliminates a subspace when it is proven that there is no combination of enzymes in this space that can stop the production of all the target compounds (i.e., there is no useful drug target). The nontarget filter prunes subspaces where there is no solution with a damage less than the optimal solution found so far. * Approximate solution: We develop a scalable iterative algorithm as an approximation to the optimal enzyme combination detection problem. This work is published in PSB 2007 Online Proceedings [12]. This algorithm is based on the intuition that we can arrive at a solution close to the optimal one by tracing backward from the target compounds. It starts by finding the damage incurred due to the removal of each reaction or compound by evaluating its immediate precursors. It then iteratively improves the damage by considering the damage computed for the immediate precursors. It converges when the damage values cannot he improved any further. We prove that the number of iterations is at most the number of reactions on the longest path from any enzyme to the target compounds in the underlying pathway. To the best of our knowledge, this is the first polynomial time solution for a metabolicnetwork hased drug target identification problem. We extracted the metabolic network data of E. Coli fr~om the KEG 13 4]daabs for our experiments. Our experiments on the E. C'ol mea oi network show that' our first method (optimal solution) reduces the total search time by several orders of magnitude as compared to the exhaustive search, for mediumsized metabolic networks. Dynamic OPMET prunes 91.6 .of the search space. It generates the optimal enzyme combination within the exploration 0.005 of the search space on average. Our. experilments on the Ei. oli metabolic network also show that the average accuracy of our approximate method iterativee solution) deviates from that of the optimal solution only by 0.02 .for mediumsized networks. It is also highly scalable. It can solve the problem for the entire metabolic network of Escherichia Coli in less than 10 seconds. Thesis organization: The rest of the thesis is organized as follows. C'!s Ilter 2 discusses the background and related work. ('! .pter 3 describes our proposed network model. ('!s Ilter 4 presents the proposed OPMET algorithm with static and dynamic prioritization and filtering strategies. ('!s Ilter 5 presents the proposed iterative algorithm for determining the enzyme combination whose inhibition achieves the desired effect of inhibiting the production of target compounds. ('! .pter 6 discusses experimental results of testing our methods on the E. Coli metabolic network. ('! .pter 7 presents the future work and concludes the thesis. CHAPTER 2 BACKGROUND AND RELATED WORK( Classical drug discovery approaches involve incorporating a large number of hypothetical targets into invitro or cellbased .I li and performing automated high throughput screening (HTS) of vast chemical compound libraries [5, 15]. Postgenomic advances in bioinformatics have fostered the development of rational drugdesign methods and reduction of serious sideeffects [1, 1618]. This has engendered the concept of reverse phrI'I''' ..J.''I ,'1i [4], in which, the first step is TI, i. L .J.;./.I.: el..>,n, the step to identify protein targets, that may be critical intervention points in a disease process [:3, 19]. This is followed by TI, i. A validation, the step to demonstrate that an identified drug target is primarily responsible for the therapeutic activity of a proven drug [2022]. The third step is Lead Inhibitor I I. ,./I.:. el..>,n [5]. Since the reverse pharmacological approach is driven by the mechanics of the disease, it is expected to be more efficient than the classical approach [4]. Rapid identification of enzyme (or protein) targets needs a thorough understanding of the underlying metabolic network of the organism affected by a disease. The availability of fully sequenced genomes has enabled researchers to integrate the available genomic information to reconstruct and study metabolic networks [2:325]. These studies have revealed important properties of these networks [2628]. The vital step in understanding the relationship between metabolic networks and drug discovery is to develop an accurate model of the pathway. The accuracy of the model determines how well the candidate targets reflect the real biological process. A good model also has to be flexible to meet the possible future updates on the metabolic networks with minimal changes. A number of models have already been developed, such as graphs,' l.io 1 Io networks [29, :30], boolean networks [:313:3], logical networks [:34, :35], differential equations [:36, :37], and stochastic models [:38, :39]. Finding the right model is a challenging problem. This difficulty is further increased due to inaccuracies in the network such as missing enzymes or reactions. The potential of an enzyme to be an effective drug target is considered to be related to its essentiality in the corresponding metabolic network [40]. Lemke et. al proposed the measure i .;I,,. I/****l'9: as an indicator of enzyme essentiality [8, 9]. Recently, a computational approach for prioritizing potential drug targets for antimalarial drugs has been developed [41]. A chokepoint analysis of P.falciparcum was performed to identify essential enzymes which are potential drug targets. The possibility of using enzyme inhibitors as antiparasitic drugs is being investigated through stoichiometric an~ llh; of the metabolic networks of parasites [42, 43]. These studies show the effectiveness of computational techniques in reverse pharmacological approaches. A combination of microarray timecourse data and geneknockout data was used to study the effects of a chemical compound on a gene network [44]. An investigation of metabolite essentiality is carried out with the help of stoichiometric analysis [45]. These approaches underline the importance of studying the role of compounds metabolitess) during the pursuit of computational solutions to pharmacological problems. CHAPTER 3 PROBLEM MODELING, In this chapter, we describe our model of the metabolic network. We develop a graph based model that captures the interactions between reactions, compounds, and enzymes. Our model is a variation of the boolean network model [31, 32]. Let R, C, and E denote the set of reactions, compounds, and enzymes respectively. The vertex set consists of all the members of R U CU E. A vertex is labeled as reaction, compound, or enzyme based on the entity it refers to. Let VR, VC, and VE denote the set of vertices from R, C, and E. A directed edge from vertex x to vertex y is then drawn if one of the following three conditions holds: 1. x represents an enzyme that catalyzes the reaction represented by y. 2. x corresponds to a substrate for the reaction represented by y. 3. x represents a reaction that produces the compound mapped to y7. Figure 31 illustrates a small hypothetical metabolic network. A directed edge from an enzyme to a reaction implies that the enzyme catalyzes the reaction (i.e., El of Ir l. 0 R1 and E2 CatalyZeS R2 and R3). A directed edge from a compound to a reaction implies that the compound is a reactant. A directed edge from a reaction to a compound implies that the compound is a product. In this figure, Cq is the target compound (i.e., the production of Cq should be stopped). In order to stop the production of Cq, R2 has to be prevented from taking place. This can be achieved in two v .s. One way is by disrupting one of its catalyzing enzymes (E2 in this case). Another is by stopping the production of one of its reactant compounds (C2 Or C3 in this case). If we stop the production of C2, We need to recursively look for the enzyme which is indirectly responsible for its production (El in this case). Thus, the production of the target compound can be stopped by manipulating either El or E2 Figure 31(a) shows the disruption of E2 and its effect on the network. Inhibiting E2 resuts n te kockout f cmponds@, s andl C9 in addition to the target compound, R2R Disrupted pathway  Disrupted Pathway (a) (b) Figure 31. A graph constructed for a metabolic network with three reactions R R2, and R3, three enzymes El, E2, and E3, and nine compounds C Cg* Circles, rectangles, and triangles denote compounds, reactions, and enzymes respectively. Here, Cq (shown by double circle) is the target compound. Dotted lines indicate the subgraph removed due to inhibition of an enzyme. (a) Effect of inhibiting E2. (b) Effect of inhibiting El. C/. Note~ thati the~ productionI of 7 is not stopped since it is produced by R1 even after the inhibition of E2. We term the effect of disrupting nontarget compounds as the side effect of manipulating E2. We define the number of nontarget compounds knocked out as the A~rlsa,,rl the manipulation of an enzyme set causes to the metabolic network. In this case, the damage of inhibiting E2 is 3 (iO., g5 C8 andu C9). Figure: 31(b) shows the inhibition of El and its effect on the same network. In this case, the damage is 2 (i.e., C2 and C5). The important observation is that El and E2 both achieve the effect of disrupting the target compound, CQ. Hence, El and E2 are both potential drug targets. However, El is a better drugtarget than E2 SillCe 11 CauseS leSSer damage. CHAPTER 4 OPTIMAL ALGORITHM The number of possible subsets of enzymes that need to be considered for finding the optimal drug target is exponential in the number of enzymes. This renders exhaustive search infeasible beyond small sized metabolic networks. In this chapter, we propose OPMET, a branch and bound algorithm that considerably reduces the number of possible combinations to be searched while still guaranteeing to find an optimal solution, for mediumsized networks (networks with at most 32 enzymes). Section 4.1 describes the basic branch and bound strategy. As part of any effective branch and bound strategy it is important to find a good solution quickly. This allows for effective filtering (or pruning) of subspaces that can be guaranteed not to have better solution that the best found so far. Our prioritization (Section 4.2) and filtering (Section 4.3) strategies achieve this goal. 4.1 State Space and Basic Strategy In this section we discuss our modeling of the search space and the basic underlying strategy of the OPMET algorithm. We develop a systematic and flexible enumeration of the search space and use it to build our efficient search strategy to find optimal results. Let E = {Ei Vi, 1 < i < m} denote the set of enzymes for a metabolic network. The search space is modeled as a tree structure. Every node of this tree corresponds to a state in the search space and it is represented by a 4tuple ([e,,, e,,, exm], k, d, remove). Here, xtl, xm, is a permutation of 1, 2, m. The first parameter corresponds to the state of all the enzymes (i.e., ex, corresponds to enzyme Ei). ex, = 1 if Ei is inhibited. Otherwise, ex, = 0. The parameter k indicates that the first k enzymes are considered at that search state. The decision to inhibit or not inhibit has been fixed for enzymes from 1 to k 1 and we now set e,, = 1 and ex, = 0, Vi, k < i < m. The damage incurred due to inhibited enzymes at that state is represented by d. The final parameter, remove, is a boolean variable. It takes value True if the inhibited enzymes stop the production of all the target compounds. Otherwise, it is set to False. We call a node with remove = True as a true node, and a node with remove = False as a false node. Figure 41 shows part of the search tree for a hypothetical 4enzyme network. The tuple for node nl indicates that enzyme El is inhibited and it is a true node with damage d = 5. k = 1 since the decision to inhibit or not has been fixed only for one enzyme. OP MET algorithm: We start with the root node ([0, 0, 0], 0, 0, False) indicating that all enzymes are present in the network. As the search space is traversed, we keep the true node with the minimum damage found so far as the current true solution and store the associated damage value as D, the 11g l..1.l cutoff threshold. D is initialized to the number of compounds in the network. At any point, we have an active set of nodes A, stored in a stack structure. A contains the nodes currently being considered. Let node NV = ([exr, exrr,,  exm]~, k, d, remove) be the node on top of this stack (i.e., the node to be evaluated). There are three cases: * Case 1: N has damage d > D. In this case we prune the subtree rooted at NV. We then backtrack. * Case 2: NV is a true node with damage d < D. In this case, we save NV as the current true solution and update D with the damage value of NV. We then backtrack. * Case 8: NV is a false node with damage d < D. In this case, we insert NV in the active set A for backtracking purposes. We then create a new node M~ by setting exey, = 1 in NV (i.e., we inhibit the enzyme E,, ). The resulting node is M~ ([e,,, ex,,, exm], k + 1, d', remove'). The node M~ is evaluated in the next step similarly. Backtrackingf involves following steps. First we pick the top node from the active nodes stack A. Let NV = ([e,,, e,,, exm], k, d, remove) denote this node. We then set e,,, = 0 (indicating the node we are backtracking from) and e,,,, = 1 in NV (i.e., we inhibit the enzyme e,, ). The resulting node becomes the node to be evaluated in the next step. The first two cases above stop expanding the tree at the current node. The former one implies that the current node is a possible solution (node nl in Figure 41). The latter one implies that the current node incurs too much damage to lead to a possible solution. The third case happens when the current node does not stop production of all the target compounds, but the damage is lower than the damage of the current best solution (nodeS n2 and n3 in Figure 41). Such nodes may produce a possible solution (node n4 in FiguTO 41) With the inhibition of more enzymes. Thus, they need to be explored further to ensure that we find an optimal solution. The search terminates when there are no more nodes to explore. At this stage, the current true solution is the optimal solution. 4.2 OPMET Prioritization Strategies In this section we present our cost model as the basis for enzyme evaluation and our resulting enzyme prioritization strategies. The purpose of ordering enzymes by a priority measure is to test the combinations with a high likelihood of deleting the target compounds first. We develop the following two enzyme prioritization strategies which are described in more detail in the rest of the section: * Static OPMET: We preorder the enzymes according to a priority measure and apply the OPMET algorithm. * Dynamic OPMET: We dynamically select the next enzyme to inhibit based on the partial solution. 4.2.1 Static OPMET The simplest way to order the enzymes is to sort them in ascending order of damage. However, this is inaccurate for several reasons. First, since damage is defined in terms of non target compounds, it does not reflect the impact of an enzyme on the target compounds. Second, the inhibition of an enzyme can make a compound more vulnerable even if the compound is not removed entirely. For example, in Figure 31, the production of C7 is not stopped after the inhibition of El or E2 individually. However, C7 is removed if El is inhibited as well as E2. Thus, damage values of individual enzymes are not good indicators of the combined damage of these enzymes. Cost Model: We develop a cost model as the basis for enzyme ordering in OPMET. This cost model takes both the observed and potential damage (sideeffects) resulting from the inhibition of an enzyme set into the cost computation. For each enzyme Ei E E, we compute a weight W(Ei) as W(Ei) = 0 if Ei inhibited and W(Ei) = 1 otherwise. We assign rational weights (fractions between 0 and 1) to the reaction and compound nodes and the edges. Intuitively, the weight of a node or edge denotes the rate at which that node or edge appears in the network. We then use these weights to compute the cost of EG. The weight of each node depends on the weights of the incoming edges and the type of the node (i.e., reaction or compound): * Cost Rule 1: Let Rj be a reaction node. Let I, 1 < i < k, denote the weights of the incoming edges to Rj. We compute the weight of Rj as W(Rj) min ,{te' }. This computation is intuitive since a reaction takes place only if all the inputs are present . * Cost Rule 2: Let Cj be a compound node. Let I, 1 < i < k, denote the weights of the incomin~g edges to Cy. W/ie compute the wevigh~t of C, a~s W(C,) (Ef ,{ reactions that produce it stops. We define the weight of an edge as the weight of the node for which it is the outgoing edge. In order to compute the cost of Ei, we set the weight of Ei to zero (i.e., W(Ei) = 0). The weights of all the reaction and compound nodes are assigned progressively by a breadthfirst search, according to the above scheme. The weights of all the nodes and edges which can be reached from Ei are recomputed to reflect the change. The effect of deleting El is shown in Figure 42. We define an impact vector for each enzyme based on the effects of its inhibition. Definition 1. Given a network with a compounds, Cy .Lt (C)dnt h weight of the node corresponding to Cj after the inhibition of t.'.c;,I,, Ei. We I. I;,..: the impact vector of Ei as I(Ei) =[Wi(Cz), Wi(C2>l '; i(Cn)]. We term Wi(Cj) as theC impact of Ei on Cj, Vj. M The impact vector of an enzyme approximates the amount of each compound that remains after the inhibition of that enzyme. Every entry of the impact vector is a fractional number between 0 and 1, where 0 indicates that the corresponding compound does not exist after inhibition of the corresponding enzyme. We define the cost of an enzyme as follows: Definition 2. Given a network with a compounds, Cj, 1 < j < n. Assume that the compounds Cj, Vj, 1 < j < k < n constitute the set of Iary,ll compounds. Assume that the remaining compounds @,l Vj, k 1 < j < n constitute the nonf ore,li compounds. Let I(Ei) = [Wi(C1), Wi(C,)] denote the impact vector of Ei. We I. I;,:: the cost of Ei as cost(E,)= I(E,) V',: where: V =[vi1, :, v,] is the normalization victor: t=" o 1 < i < k. and I = (n'\ k) for k <( i Each target compound contributes a positive value and each nontarget compound contributes a negative value to the cost of an enzyme. This is justified since the cost promotes removal of target compounds and demotes the removal of nontarget compounds. The magnitude of the contribution of a compound increases linearly with the amount of that compound that diminishes from the system after the inhibition of the corresponding enzyme . In order to benefit from the pruning power of OPMET cases 1 and 2 (see Section 4.1), we need to compute the permutation xxl, xm carefully. The earlier we place the enzymes in the optimal solution in this permutation, the better, as OPMET reaches the optimal solution earlier under such an ordering. Thus, reaching the solution with the smallest possible damage value (i.e., the optimal solution) increases the chances of pruning the remaining nodes of the search tree. We propose to sort the enzymes in ascending order of cost value. This is because a small cost value indicates that the corresponding enzyme has a large impact on target compounds and low impact on nontarget compounds. 4.2.2 Dynamic OPMET Static OPMET priorities enzymes based on their individual potential to be part of the solution. However, this model does not account for the combined damage of a set of enzymes. Finding the damage of a set of enzymes is nontrivial. This is because the combined damage depends on the overlap of the reactions catalyzed by these enzymes as well as their individual damages. Enzymes that catalyze almost the same set of reactions have less combined damage compared to enzymes that catalyze disjoint reaction sets. This is due to the overlap of the damage of the enzymes in the former case. We develop a dynamic strategy, Dynamic OPMET. It sorts the enzymes on the fly. At a high level, this strategy evaluates the current state at every step and picks the enzymetoinhibit for the next step based on this evaluation. We propose an incremental algorithm by using the impact vector of individual enzymes. We predict the remaining fraction of all the compounds after inhibition of each enzyme with the help of its impact vector. This algorithm is described in Figure 43. Let R = [rl, r2, ru] denote the remaining fractions of compounds (step 1). Here, ri E [0, 1] corresponds to compound Ci, Vi. We initialize ri = 1, Vi indicating that all compounds are being produced without any disruption. Let V be the normalization vector as given in Definition 2. Let I(Ei) be the impact vector of enzyme Ei (see Definition 1). At every step of the Dynamic OPMET algorithm, let NV = ([e,,, e,,, exm]~, k, d, remove) be the node currently being evaluated (i.e., the decision to inhibit or not inhibit has been fixed for e,,, e,,, ex,_z) (step 2). We now need to decide which enzyme has to be evaluated next. In Step 3a, for every enzyme in the remaining enzyme set (e,,, Vi, k < i < m), we compute the new remaining fractions of compounds (Ri). This is done by a Vector Direct Product of R and the impact vector of Ei (I(Ei)). Vector direct product is defined as X e Y = [xlyl, x292, n~yn Where X = [xl, x,] and Y = [yl, y,]. The resulting vector Ri is an approximation to the impact of inhibition of the enzyme Ei in addition to already inhibited enzymes. This is justified since the quantity of a compound eliminated by a combination including Ei will be at least as much as the quantity eliminated by Ei alone. A good candidate enzyme at this step is the one that ensures that lesser of the target compounds remain after its inhibition. Also, it should ensure that the non target compounds suffer the minimum possible damage. Our cost model satisfies these requirements. In Step 3b, we compute the cost of each enzyme as the dot product of Ri and V. In step 4, we pick the enzyme (Ej) with the minimum cost. Thus, this strategy chooses the the next best enzyme to inhibit dynamically. The cost of finding the best enzyme at each step takes O(mn), where m and a denote the number of enzymes and compounds in the metabolic network respectively. This is because a vector product costs O(u), and O(m) such products are carried out. 4.3 Filtering Strategies So far, we have described how OPMET traverses the state space. In this section we propose two filtering strategies to eliminate large portions of the search space quickly while still guaranteeing the optimal solution. The following theorem establishes a relationship between the impact of enzymes and their damage. Theorem 1. Let E = {El, E2, Er } be a set of ;..oto Let Cj be a compound in the metabolic network, rk. Le d4(C ) 1 < i < r, denote the impctofEsonCy If the inhibition of all the ,:..ater. in E stops the production of Cj, then CE (1 d,(C,)) '> 1. M Proof of Theorem 1: Let E = {El, E2, Er } be a set of enzymes. Let Cj be a compound in the metabolic network. Let di, 1 < i < r, denote the impact of Ei on Cj. Let di(Rk), 1 < i < r, denote the impact of Ei on Rk. We need to show that the following rules hold: Rule 1: If the inhibition of all the enzymes in E stops the production of Cj, then C=1(1 d,(C,)) > 1. Rule 2: If the inhibition of all the enzymes in E stops a reaction Rk, then CE (1  d (~> > 1. We first consider the case when a reaction RkIS iS topped and prove that Rule 2 holds, given that Rule 1 is correct. Case 1: 3 E E E such that Ej catalyzes Rk. Hence, the inhibition of Ej stops Rk* The impact of Ej on Rk, is 1 dj Rk) = 1. Therefore, CE (1 di(Rk)) > . Rule 2 holds. Case 2: o enzym E eE catalyzes Rkc. This implies that E removes Ci,whcisa input to Rk. From Rule 1, C~(1 d (Ci)) > 1. From the Cost Model (see Cost Rule 1 in Section 4.2), di(Rk) min di(Cl' k / kis the set of all input compounds of Rk. Thus, 1 4(R) =maX 1 di k)}I. Hence, 2=1( iR~ d4R 1 d Rule 2 holds. We now consider the case when the production of ,,, a opud ysos n rv that Rule 1 holds, given that Rule 2 is correct. If E removes Cj, it means that all the reactions that produce Cj have been stopped. Let R1, R2, Rt be the reactions that produce Cj. For each reaction Rk, 1 < k < t, one of the the following two cases has to be true. Case 1: 3 E E E such that Ej catalyzes Rk. Hence, the inhibition of Ej stops Rk. The impact of Ej on Rk, is 1 dj Rk) = 1. From the Cost Model (see Cost Rule 2 in Section 4.2), 1 dy (Cj) > ,. Since all the t input reactions of Cj have been stopped, pr (1 d4C ))> 1xt > 1 Rule 1 holds. Case 2: No enzyme E E E directly catalyzes Rk. But since RkIS iS topped by the inhibition of enzymes in E, C~(1 d (Rk)) > 1 (Rule 2). Also given that t reactions produce Cj, (1~~ d1(Ci) = 1 Rule 1 holds. Next, we describe our filtering strategies. 4.3.1 Target Filter As we traverse the state space considerable effort is spent evaluating combinations which might yield a solution with damage value less than D, the global cutoff threshold (see Section 4.1). If a combination does not delete the target compound set, the effort spent is wasted. Filtering such combinations will improve the performance of the search vastly. This is the motivation behind our T o9. I flter. The target filter eliminates a bulk of the search space when it is proven that there is no combination of enzymes in this space that can stop the production of all the target compounds (i.e., there is no useful drug target). This filtering strategy is based on Theorem 1. Formally, let node NV = ([e,,, e,,, exm]~, k, d, False) be a node in the search space. Let T denote the set of target compounds. Backtrack if i=1 i=k+1 In this inequality, the first term indicates the impact of enzymes, which are currently part of the solution set, on the target compounds. The second term represents the impact of the remaining enzymes on the target compounds. 4.3.2 Nontarget Filter The motivation behind this filtering strategy is to quickly determine if there is any solution in the subtree with a damage d < D, the global cutoff threshold (see Section 4.1). This filter utilizes Theorem 1 similar to the Target Filter. The idea is as follows. At a given node NV, for each target compound, C, we find the minimum number of enzymes, m such that i= 1 This gives us the minimum number of enzymes needed to delete C. Let mmax be the maximum value of m for any target compound (i.e, we will need at least mmax enzymes to delete the entire target compound set). Now, we sort the remaining enzymes (enzymes not considered so far) in the ascending order of their damage values. Let dmax be the damage of the enzyme at index mmax. If dmax in addition to the damage incurred so far is greater than D, we prune the sub tree rooted at NV. n "o[()())() (). () False] E1 Ep() n 2[()100, 2, 1, False] E 1 na ~[(11() 3, 2, Truel Figure 41. Figure 42. 1. Let N 2. Let R */ The basic OPMET strategy for a hypothetical 4enzyme network. Enzymes are ordered as El, E2, E3, E4 no0 n1 4 are the nodes generated. The initial 11l. .l..rl cutoff threshold D = 10 initializedd to the total number of compounds in the network). nl is a true solution (shown by double circle) with damage d = 5. Since d < D, D is updated to 5. nl is saved and the subtree rooted at nl is pruned. The method backtracks tO n0. n2, n3 arT fRISe nodes generated along the search with damage d < D. us is a true node with d = 2. As d < D, D is updated to 2 and the method backtracks to search the unexplored space for solutions with d < D(indicated by the dashed edge). /E1 Disrupted Pathway Effect of deletion of enzyme El on the weights of reactions and compounds in the network. = ([e, r, eX25,.. e 67m], k, d, remove) be the node currently being evaluated. =[rT, rg, r,] /* ri E [0, 1] corresponds to the remaining fraction of compound C,, Vi 3. For every candidate enzyme ei E {e' ex , ,em (a) Ri R 0I(E,) /* Compute new ratio */ (b) Cost(E,) Ri VT /* Cost of inhibiting E, */ 4. Let j = arg min {~Cost(E3)} 5. R = Rj /* Update the ratio after inhibition of the enzyme with least cost */ 6. Select E, as the next enzyme to inhibit. Figure 43. Dynamic OPMET procedure for enzyme evaluation. CHAPTER 5 ITERATIVE ALGORITHM In this chapter, we develop a scalable iterative algorithm that finds a suboptimal solution to the enzymetarget identification problem quickly. Our algorithm is based on the intuition that we can arrive at a solution close to the optimal one, by tracing backwards from the target compounds. We evaluate the immediate precursors of the target compounds and iteratively move backwards to identify the enzymes, whose inhibition will stop the production of the target compounds while incurring minimum damage. Our algorithm consists of an initialization step followed by iterations, until some convergence criteria is satisfied. Let E, R and C denote the sets of enzymes, reactions and compounds of the metabolic network respectively. Let E, R and C denote the number of enzymes, reactions and compounds respectively. The primary data structures are three vectors, namely an it .;;;;;. vector VE = 61l, 62, *, 6E], a reaction vector VR = [rl, T2, S, rlR ], and a compound vector Vc = [cl, c2, CC]. Each value, ei, in VE denotes the damage of inhibition of enzyme, Ei E E. Each value, ri, in VR denotes the damage incurred by stopping the reaction Ri E R. Each value, ci, in Vc denotes the damage incurred by stopping the production of the compound Ci e C. 5.1 Initialization Here, we describe the initialization of vectors VE, VR, and Vc. We initialize VE firSt, VR second, and Vc last. Fr.:.;,l,,. vector: The damage ei, Vi, 1 < i < E, is computed as the number of nontarget compounds whose productions stop after inhibiting Ei. We find the number of such compounds by doing a breadthfirst traversal of the metabolic network starting from Ei. We calculate the damage ei associated with every enzyme Ei E E, Vi, 1 < i < E, and store it at position i in the enzyme vector VE. Reaction vector: The damage rj is computed as the minimum of the damages of the enzymes that catalyze Rj, Vj, 1 < j < R. In other words, let E,,, E,,, Ex, be the enzymes that catalyze Rj. We compute the damage of rj as rj = min {e,,). This computation is intuitive since a reaction can be disrupted by inhibiting any of its catalyzers. We calculate rj associated with every reaction Rj E R, Vj, 1 < j < R and store it at position j in the reaction vector VR. Let E(Rj) denote the set of enzymes that produced the damage rj. Along with rj, we also store E(Rj). Note that in our model, we do not consider backup enzyme activities for simplicity. Coemp~ound vector: The damage ck, Vk, 1 < k < C, is computed by considering the re~actionls that produce: Ck. Let R,,, R,,, Rxj be the^ rea^ctions thatC produce. ^ C We first compute a set of enzymes E(Ck~) foT CIk aS E(Ck~) = E(R,,) U E(R,,) U U E(R,,). We then compute the damage value ck aS the number of nontarget compounds that is deleted after ~lthe inibitionI~U of aIllthe enz~ymles in1 E(\Ck). This~ compIutation is based on the observation that a compound disappears from the system only if all the reactions that produce it stop. We calculate Ck aSSociated with every compound Ck, E C, 1 < k < C and store it at position k in the compound vector Vc. Along with ck, We alSO Store E(Ck~)* Consider the network in Figure 51. Column Io in Table 51 shows the initialization of the vectors for this network. In this figure, C1 is the target compound (i.e., the production of Ci should be stopped). In order to stop its production, we have to prevent R1 from taking place. This can be accomplished in two vi~. (1) By disrupting one of its catalyzing enzymes (El in this case). Figure 51(b) shows the effects of disrupting El. The damage el of El is three, as inhibiting El stops the production of three nontarget compounds C2, C3 and C4. (2) By stopping the production of one of its reactant compounds (Cg in this case). To stop the production of Cg, we need to recursively look for the enzyme combination which is indirectly responsible for its production (E2 and E3). Since the disruption of E2 or E3 alone does not stop the production of any nontarget compound, their damage values are zero. Hence, VE = [3, 0, 0]. The damage values for reactions are computed as the minimum of their catalyzers (rl = T2 = 61 and T3 = 4 = 62). Hence, VR = [3, 3, 0, 0]. The damage values for compounds are computed from the reactions that produce them. For instance, R1 and R2 produce C2 E(R1) = E(R2) = {El}. Therefore, c2 = 61. However, the combined damage of E2 and E3 IS 1. Hence, cs is equal to the damage of inhibiting the set E(R3) U E(R4) = {E2, E3) which is 1. Thus, cs = 1. Thus, the production of the target compound can be stopped by manipulating either El or a combination of E2 and E3. The optimal solution is the enzyme combination whose disruption has the minimum damage on the network (E2 and E3 in this case). 5.2 Iterative Steps We iteratively refine the damage values in vectors VR and Vc in a number of steps. At each iteration, the values are updated by considering the damage of the precursor of the precursors. Thus, at nth iteration, the precursors from which a reaction or a compound is reachable on a path of length up to n are considered. We define the length of a path on the graph constructed for a metabolic network as the number of reactions on that path (see Definition 4). There is no need to update VE Since the enzymes are not affected by the reactions or the compounds. Next, we describe the actions taken to update VR and Vc at each iteration. We later discuss the stopping criteria for the iterations. Reaction vector: Let C,, C,,, C,, be the compounds that are input to Rj. We update the damage of rj as rj = min~rj, min z,(cmi)) The first term of the min function denotes the damage value calculated for Rj during the previous iteration. The second term provides the damage of the input compound with the minimum damage found in the previous iteration. This computation is intuitive since a reaction can be disrupted by stopping the production of any of its input compounds. The damage of all the input compounds are already computed in the previous iteration ( wi (n 1)th iteration). Therefore, at iteration n, the second term of the min function considers the impact of the reactions and compounds that are away from Rj by n edges in the graph for the metabolic network. Let E(Rj) denote the set that contains the enzymes that produced the new damage rj. Along with rj, we also store E(Rj). We update all rj E VR using the same strategy. Note that the values rj can be updated in any order, i.e., the result does not depend on the order in which they are updated. Compound vector: The damage ck, Vk, 1 < k < C, is updated by considering the damage computed for Ck, in the previous iteration and the damages of the reactions that produce Ck~. Let R,,, R,,, Rx, be the reactions that produce Ck~. We first compute a set of enzymes as E(R,,) U E(R,,) U U E(R,,). Here, E(R,,), 1 < t < j, is the set of enzymes computed for Rt after the reaction vector is updated in the current iteration. We then update the damage value ck aS Ck, = min Ck, damage(UL=, E(Rxei))}. The first term here denotes the damage value computed for Ck, in the previous iteration. The second term shows the damage computed for all the precursor reactions in the current step. Along with ck, We alSO Store E(Ck~), the set of enzymes which provides the current minimum damage ck. Condition for convergence: At each iteration, each value in VR and Vc either remains the same or decreases by an integer amount. This is because a min function is applied to update each value as the minimum of the current value and a function of its precursors. Therefore, the values of VR and Vc do not increase. Furthermore, a damage value is ahrlw an integer since it denotes the number of deleted nontarget compounds. We stop our iterative refinement steps when the vectors VR and Vc do not change in two consecutive iterations. This is justified, because, if these two vectors remain the same after an iteration, it implies that the damage values in VR and Vc cannot be minimized any more using our refinement strategy. Columns II and I2 in Table 51 show the iterative steps to update the values of the vectors VR and Vc. In lo, we compute the damage rl for R1 as the minimum of its current damage (three) and the damage of its precursor compound, cg = 1. Hence, rl is updated to 1 and its associated enzyme set is changed to {E2, E3}. The other values in VR remain the same. When we compute the values for Vc, cl is updated to 1, as its new associated enzyme set is {E2, E3} and the damage of inhibiting both E2 and E3 together is 1. Hence, VR = [1, 3, 0, 0] and Vc = [1, 3, 3, 3, 1]. In I2, We find that the values in VR and Vc do not change anymore. Hence, we stop our iterative refinement and report the enzyme combination E2, E3 aS the iterative solution for stopping the production of the target compound, C1. Complexity analysis: We now discuss the complexity of our iterative algorithm. Space Complexity: The number of elements in the reaction and compound vectors is (R + C). For each element, we store an associated set of enzymes. Hence, the space complexity is O((R + C)~ I E). Time Complexity: The number of iterations of the algorithm is O(R) (see Section 5.3). The computational time per iteration is O(G (R + C)), where G is the size of the graph. Hence, the time complexity is O(RG o (R + C)). 5.3 Maximum Number of Iterations In this section, we present a theoretical analysis of our proposed algorithm. We show that the number of iterations for the method to converge is finite. This is because the number of iterations is dependent on the length of the longest nonselfintersecting path (see Definitions below) from any enzyme to a reaction or compound. Definition 3. In a given metabolic network, a nonselfintersecting path is a path which traces i t.;, vertex: on the path ~rer. A once. For simplicity, we will use the term path instead of nonselfintersecting path in the rest of this section. Definition 4. In a given metabolic network, the length of a path from an .;..an;. Ei to a reaction Rj or compound Ck, iS I. I;,:. .1 aS the number of unique reactions on that path. Note that the reaction Rj is counted as one of the unique reactions on the path from enzyme Ei to Rj. Definition 5. In a given metabolic network, the preceding path of a reaction Rj (or a compound Ok~ iS I. I;,:.. G S the l6Rgth Of the l089681 path fTOm it,:;; ,:.;;i,,o in that network to Rj (or Ck~). Theorem 2. Let VE = 61,i 621 .. I 6E VR = [li, T21 .. I r FR], and Vc = [cl, c2; ' CC] be the e..u;,I, reaction and compound vectors ,n Ic.~. :.l; ; Let a be the length of the longest path (see D. ~i,!~n il4i and 8) frmJ .; .co;Est eato R o compound Ck ). The Ualue Ty (OT Ck TemtinS CONStatR af1Te at most n ifTertOnS. Proof: We prove this theorem by an induction on the number of reactions on the longest path (see Definitions 4 and 3) from any enzyme in Ei corresponding to ei a VE tO Gk  BASIS: The basis is the case when the longest path from an enzyme Ei is of length 1 (i.e., the path consists of exactly one reaction). Let Rj be such a reaction. This implies that there is no other reaction on a path from any Ei to Rj. As a result, the value rj remains constant after initialization. Let Ck, be a compound such that there is at mos8~Ult: one~U recion fro anly enzl~ymel to G~. Let R,,, R,,, Rx, be the reactions that produce Ck~. Because of our assumption there is no precursor reaction to any of these reactions. Otherwise, the length of the longest path would be greater than one. Therefore, the values r,,, r,,, rx, and the sets E(R,,), E(R,,), E(R,,) do not change after initialization. The value Ck is computed as the damage of E(Ck~) E(R,,) U E(R,,) U U E(R,,). Thus, Ck remains unchanged after initialization and the algorithm terminates after the first iteration. INDUCTIVE STEP: Assume that the theorem is true for reactions and compounds that have a preceding path with at most n 1 reactions. Now, we will prove the theorem for reactions and compounds that have a preceding path with a reactions. Assume that Rj and Ck, denote such a reaction and a compound. We will prove the theorem for each one separately. Proof for Rj: Let C,,, Ox,, Ox, be the compounds that are input to Rj. The preceding path length of each of these input compounds, wi Cx, is at most n. Otherwise, the preceding path length of Rj would be greater than n. C'rse 1: If the preceding path length of C,, is less than n, by our induction hypothesis, c~s would remain constant after (n 1)th iteration. Thus, the input compound C~s will not change the value of rj after oth iteration. C'rse 2: If the preceding path length of C,, is n, then Rj is one of the reactions on this path. In other words, C~s and Rj are on a cycle of length n. Otherwise, the preceding path length of Rj would be greater than n. Recall that at each iteration, the algorithm considers a new reaction or a compound on the preceding path starting front the closest one. Thus, at oth iteration of computation of rj, the algorithm completes the cycle and considers Rj. This however will not modify rj. This is because the value of rj monotonically decreases (or remains the same) at each iteration. Thus, the initial damage value computed front Rj is guaranteed to be no better than r y after n 1 iterations. We conclude that rj will remain unchanged after oth iteration. Proof for Ok~: Let R,,, R,,, R,j he the reactions that produce Ok~. The preceding path length of each of these reactions, ;?i R,, is at most n. Otherwise, the preceding path length of Ok, would be greater than n. C'rse 1: If the preceding path length of R,s is less than n, by our induction hypothesis r,, would remain constant after (n 1)th iteration. Thus, the reaction R,s will not change the value of ck after oth iteration. Case 2: If the preceding path length of R,s is n, then front our earlier discussion for proof of Rj, r,, remains unchanged after oth iteration. Therefore R,, will not change the value of Ck after oth iteration. Hence, by induction, we show that the Theorem 2 holds. Figure 51. (a)A graph constructed for a metabolic network with four reactions R R4, three enzymes E1, E2 and E3, and five compounds C1, Cs. Circles, rectangles, and triangles denote compounds, reactions, and enzymes respectively. Here, C1 (shown by double circle) is the target compound. (b)Effect of inhibiting E1. Dotted lines indicate the subgraph removed due to inhibition of enzyme E1. Table 51. Iterative Steps: lo is the initialization step, II and I2 are the iterations. VR and Vc represent the damage values of reactions and compounds respectively computed at each iteration. VE = [3, 0, 0] in all iterations. Io II I2 VR, VC [3, 3, 0, 0], [3, 3, 3, 3, 1] [1, 3, 0, 0], [1, 3, 3, 3, 1] [1, 3, 0, 0], [1, 3, 3, 3, 1] C5 ,, E1~I R2 C4 () i\ I R3 R1 ,IL I\ Cz, \Cq) (b) CHAPTER 6 EXPERIMENTAL RESULTS We extracted the metabolic network information of Escherichia Coli (E. Coli) from K(EGG [13, 14] (f tp ://f tp .genome .j p/pub/kegg/pathways/eco/) The metabolic network in K(EGG has been hierarchically classified into smaller networks according to their functionality. We performed experiments at different levels of hierarchy of the metabolic network and on the entire metabolic network, that is an .I_ regfation of all the functional subnetworks. We devised a uniform labeling scheme for the networks based on the number of enzymes. According to this scheme, a network label begins with 'N' and is followed by the number of enzymes in the network. For instance, 'N20' indicates a network with 20 enzymes. Table 61 shows the metabolic networks chosen, along with their identifiers and the number of compounds (C), reactions (R) and edges (Ed). The edges represent the interactions in the network. For each network, we constructed query sets of sizes one, two and four target compounds, by randomly choosing compounds from that network. Each query set contains 10 queries each. We ran our experiments on an Intel Pentium 4 processor with 2.8 GHz clock speed and 1GB main memory running Linux operating system. 6.1 Evaluation of OPMET Algorithm We evaluate the performance of our first solution, the OPMET algorithm, for mediumsized networks(at most 32 enzymes) using the following three criteria: 1. Number of nodes generated: It represents the total number of enzyme combinations tested to complete the search. The lesser the number of nodes generated, the better the performance of the method. 2. Optimal node rank: This indicates the number of nodes explored before the method arrives at the optimal solution. A small optimal node rank indicates that the optimal combination is found quickly. 3. Execution time: This indicates the total time taken by the method to finish the search and conclude that it found an optimal solution. We intpleniented the OPMET algorithm with a random enzyme ordering, Static OPMET and Dynamic OPMET (see Section 4.2) as well as Nontarget filter and Target filter (see Section 4.3). We compare the methods we intpleniented to an exhaustive search which enunterates and tests all possible combinations of enzymes (2': where n is the number of enzymes). Evaluation of prioritization strategies: We compare our static and dynamic OPMET algorithms with a random ordering of enzymes and an exhaustive search. We do not include our filtering strategies here as the goal is to focus on the enzyme ordering. We present the results only up to a network of size 20 enzymes. This is because, beyond this, the search space grows rapidly, necessitating the use of filtering strategies. Table 61 shows the average number of nodes generated and the average optimal node rank of the random ordering, static OPMET and dynamic OPMET, as compared to an exhaustive search. The results show that Dynamic OPMET is the best strategy for all the tested networks. It generates the least number of nodes in all the experiments. All the methods generate significantly large number of nodes for N17. This is because the number of reactions and compounds of this network is much larger than the other networks, resulting in more interactions in the network (see Table 61). Both Dynamic and Static OPMET have small Optimal Node Ranks. Dynamic OPMET has the lowest optimal node rank on an average. On an average, it arrives at the optimal solution within the generation of 0.008 .of the number of nodes possible in an exhaustive search. This is significantly better than the random ordering which arrives at the optimal solution within the generation of 11 The difference between the Optimal Node Ranks and the number of nodes generated for Static and Dynamic OPMET shows that they find optimal results quickly but still evaluate many nodes to ensure that that result is optimal. We use Dynamic OPMET in the rest of the experiments as it has the best performance. Evaluation of filtering strategies: We measure how much our filtering strategies reduce the search space. The experiments are performed for Dynamic OPMET without filters, with Target Filter, Nontarget Filter and both of these filters together. Figure 62(a) shows the average number of nodes generated. The combined filters show the best pruning. On an average, the combined filters prune 91.5 of the nodes generated in the method without filters. We also see that most of this benefit is obtained front the Target Filter (it filters 91.4 .~ of the nodes generated by the method without filters). The combined filter generates only 12700 nodes for N24 (0.004 .~ of an exhaustive search). Figure 62(b) shows the average execution time. We see that the average execution time shows a trend similar to Figure 62(a). This establishes that the execution time is proportional to the number of nodes generated. The combined filter reduces the execution time by an order of magnitude as compared to the method without filters on an average. Figure 62(c) shows the average optimal node rank. All the methods have the same optimal node rank for networks except N24. This el__o1; that Dynamic OPMET yielded the optimal solution as early as possible for these networks. For N24, the combined filter shows that filteringf strategies can also lead to advancement in findings the optimal solution. For N24, Target filter arrives at the optimal solution 99 .earlier and the combined filters arrive at the optimal solution 99.9 .earlier than the method without filters (the additional 0.9 .~ intprovenient is obtained front the nontarget filter). We observe that the target filter is more efficient than the nontarget filter and the combined filter has the best performance. We use Dynamic OPMET with combined filters in the rest of the experiments. Scalability in the number of targets: We evaluate how Dynamic OPMET with combined filter scales with increasing target set size. The target sets compared are of sizes one, two and four. Figure 63 shows the average number of nodes generated. There is no clear correlation between the target set size and the number of nodes explored. The network topology determines how the target set size affects the number of nodes generated. The target compound set can contain highly correlated compounds (deleted by the inhibition of almost the same set of enzymes) or unrelated compounds (located in different parts of the network). If the target set consists of correlated compounds, an increase in the target set size decreases the average number of nodes generated. This can he seen for N14, N17, N20 and N24 (from 2 to 4 target compound queries). The number of nodes generated for the twotarget sets is 62 .lesser than single target sets on an average. For the fourtarget sets, this value is 75 lesser. On the other hand, if the target set consists of unrelated compounds, an increase in the target set size increases the average number of nodes generated. The average number of nodes increases by 1.2 times when we go from single target to twotarget sets and by 2.2 times when we go from the twotarget to the fourtarget sets. Figure 64 shows that the average optimal node rank increases sub linearly with the number of target compounds. On an average, The twotarget queries arrive at the optimal solution after generation of 1.8 times more nodes than the single target queries. Similarly, the fourtarget queries generate 3.8 times more nodes than the single target queries before arriving at the optimal solution. This II__ R that as the target set size increases, the number of enzyme combinations that need to be tested before we find the optimal solution mecreases. 6.2 Evaluation of Iterative Algorithm We evaluate our proposed iterative algorithm using the following three criteria: 1. Execution time: The total time (in milliseconds) taken by the method to finish execution and report if a feasible solution is identified or not. 2. Number of iterations: The number of iterations performed hv the method to arrive at a steadystate solution. :3. Average damage: The average number of nontarget compounds that are eliminated when the enzymes in the result set are inhibited. We intpleniented the proposed iterative algorithm and an exhaustive search algorithm which determines the optimal enzyme combination to eliminate the given set of target compounds with nxininiun damage. We intpleniented the algorithms in Java. Evaluation of accuracy: Table 62 shows the comparison of the average damage values of the solutions computed by the iterative algorithm versus the optimal algorithm. We have shown the results only upto N:32, as the optimal algorithm took longer than one dei to finish for networks slightly largest than N:32. We can see that the damage values of our method exactly match the damage values of the optimal algorithm for all the networks except NV24. For NV24, the average damage of the iterative solution differs front that of the optimal solution by only 0.02' This shows that the iterative algorithm is a good approximation of the optimal algorithm. The slight deviation in damage is the tradeoff for achieving the salability of the iterative algorithm (described next). Evaluation of scalability: Figure 65(a) plots the average execution time of our iterative method for increasing sizes of metabolic networks. The running time increases slowly with the network size. As the number of enzymes increases front 8 to 5:37, the running time increases front roughly 1 to 10 seconds. The largest network, N15:37, consists of 5:37 enzymes, and hence, an exhaustive evaluation inspects 2537 1 combinations (which is computationally infeasible). Thus, our results show that the iterative method scales well for networks of increasing sizes. This property makes our method an important tool for identifying the right enzyme combination for eliminating target compounds, especially for those networks for which an exhaustive search is not feasible. Figure 65(b) shows a plot of the average number of iterations for increasing sizes of metabolic networks. The iterative method reaches a steady state within 10 iterations in all cases. The various parameters (see Table 61) that influence the number of iterations are the number of enzymes, compounds, reactions and especially the number of interactions in 10000000 S1000000 S100000 N14 N17 N20 Pathway identifier MExhaustive Search M Random Order 0 Static Priority n Dynamic Priority (a) Average number of nodes generated 1000000 100000 N14 N117 N20 Pathway Identifier M Random Order a Static Priority 0 Dynamic Priority (b) Average execution time in milliseconds Figure 61. Comparison of OPMET ordering strategies the network (represented by edges in the network graph). Larger number of interactions increase the number of iterations considerably, as can be seen for networks NV22, NV48, NV96, N\1537, where the number of iterations is greater than 5. This shows that, in addition to the number of enzymes, the number of compounds and reactions in the network and their interactions also pIli wa significant role in determining the number of iterations. Our results show that the iterative algorithm can reliably reach a steady state and terminate, for networks as large as the entire metabolic network of E. Coli. 10000000 110 000 100 N14 N17 N20 N24 N28 N32 Pathway Identifer HNoFilter HNonTargetFilter OTargetFilter BCombinedFilters (a) Average number of nodes generated N14 N17 N20 N24 N28 N Pathway Identifier aNoFilter a NonTargetFilter 0 Targetfilter a CombinedFilters (b) Average execution time in milliseconds (c) Average optimal node rank Figure 62. Comparison of OPlMET filteringf strategies Figure 63. Average number of nodes by generated by Dynamic OPMET with combined filters. 100000 10000 S1000 o 100 10iii I N14 N17 N20 N24 N28 N32 Pathway identifier 1Target M2Target 04Target Figure 64. Average optimal filters. node rank by generated by Dynamic OPMET with combined 10000 e, 1000 S100 10 1 10 S7 6 S5 4 e,3 S2 Pathway Identifier zzzzzz00 zzzzzzzzzzzi Pathway Identifier (b) Figure 65. Evaluation of iterative algorithm. (a)Average execution time in milliseconds. (b)Average number of iterations Table 61. 1\etabolic networks from K(EGG with identifier (Id). C, R and Ed denote the number of compounds, reactions and edges (interactions) respectively. Id Metabolic Network C R Ed Id Metabolic Network C R Ed NON Polvketide 11 11 33 N42 Other amnito acid 69 63 208 biosynthesis N13 Xenobiotics 47 58 187 N48 Lipid 134 196 654 biodegradation N14 C'itrate or TC'A cycle 21 35 125 N52 Purille 67 128 404 N17 Galactose 38 50 172 N59 Energy 72 8'> '68 N20 Pentose phosphate 26: 37 120 N71 Nucleotide 10'> '17 684 N22 Glveall Biosynthesis 54 51 171 N96 Vitalitils and 145 175 550 Cofactors N24 Glveerolipid 32 49 160 N170 Amnito acid 54 378 1210 N28 Op .1, serille 36 46 151 N180 Carbohydrate 247 501 1650 and threonine N32 Pyrrivate 21 51 163 N537 Entire Network 988 1700 5833 Table 62. Comparison of average damage values of solutions determined by the iterative algorithm versus the optimal algorithm. CHAPTER 7 CONCLUSION Efficient computational methods are required to identify the optimal enzyntecombination (i.e., drug targets) whose inhibition will achieve the required effect of eliminating a given target set of compounds while incurring nxininial sideeffects. In this thesis, we proposed two solutions to this pharmacological problem. In our first work, we proposed an optimal solution for niediumsized metabolic networks. In our second work, we proposed an approximate solution for a large sized metabolic network. In the optimal method, we formulated the optimal enzyntecombination identification problem as an optimization problem on metabolic networks. We defined a graph hased computational model of the network that encapsulates the impact of enzymes onto compounds. We proposed OPMET, a branchandhound algorithm to explore the search space. We developed a cost model and two enzyme prioritization strategies, Static OPMET and Dynamic OPMET based on it. We also developed two filtering strategies to prune the search space while still guaranteeing an optimal solution. The filters compute an upper bound to the number of target compounds deleted and a lower bound to the sideeffect respectively. Our experiments on the E. C'oli metabolic network show that our optimal methods reduced the total search time by several orders of magnitude as compared to the exhaustive search. The optimal solution is reached by Dynamic OPMET within the exploration of 0.005 of the total search space on an average, proving that our methods are effective in approximating the impact of an enzyme on a compound. The Dynamic OPMET with combined filters pruned 91.6 of the search space on average. In our second method, we developed an approximate solution to the optimal enzyntecombination identification problem for largesized metabolic networks, for which an optimal solution is not feasible. We proposed a scalable iterative algorithm which computes a suboptinmal solution to this problem within reasonable tintebounds. Our algorithm is based on the intuition that we can arrive at a solution close to the optimal one by tracing backward front the target compounds. We evaluated the ininediate precursors of a target compound and iteratively moved backwards, to identify the enzymes, whose inhibition stopped the production of the target compound while incurring nmininiun damage. We showed that this approximate method converges within a finite number of such iterations. In our experiments on E. C'oli metabolic network, the accuracy of a solution computed by the iterative algorithm deviated front that of the optimal solution only by 0.02 Our iterative algorithm is highly scalable. It solved the problem for even the entire metabolic network of E. C'oli (which has at least 500 enzymes) in less than 10 seconds. In this thesis, we have developed a foundation for developing effective solutions to pharmacological problems, through an~ llh of metabolic networks. We are currently working to improve the accuracy of our network and cost models in modeling the metabolic network's behavior. We are also developing methods which will efficiently provide solutions to the same problem for larger metabolic networks. REFERENCES [1] Capranico G. A rational selection of drug targets needs deeper insights into general regulation mechanisms. Current M~edicinal C'I, I,,.i ry AntiCancer Agents, 4(5):393394, Sep 2004. [2] Deane K(.H.O., Spieker S., and Clarke C.E. CatecholOmethyltransferase inhibitors versus active comparators for levodopainduced complications in Parkinson's disease. Cochrane Database of S;,ich ol..: Reviews, 4, 2004. [3] Smith C. Hitting the target. Nature, 422:341347, Mar 2003. [4] Takenaka T. Classical vs reverse pharmacology in drug discovery. BJU International, 88(2):710, Sep 2001. [5] Drews J. Drug discovery: a historical perspective. Science, 287(5460):19601964, Mar 2000. [6] Surtees R. and Blau N. The neurochemistry of phenylketonuria. European Journal of Pediatrics, 159:10913, 2000. [7] Gloyn A.L. Glucokinase (GCK() mutations in hyper and hypoglycemia: maturityonset diabetes of the young, permanent neonatal diabetes, and hyperinsulinemia of infancy. Human M~utation, 22(5):35362, Nov 2003. [8] Mombach J.C., Lemke N., da Silva N.M., Ferreira R.A., Isaia E., and Barcellos C.K(. Bioinformatics analysis of mycoplasma metabolism: important enzymes, metabolic similarities, and redundancy. Computers in B..~~I J.m;, and M~edicine, 36(5):54252, May 2006. [9] Lemke N., Heriidia F., Barcellos C.K(., dos Reis A. N., and Mombach C.M. Essentiality and damage in metabolic networks. Bioinformatics, 20(1):115119, Jan 2004. [10] Ocampo et. al. Targeted deletion of mNth1 reveals a novel DNA repair enzyme activity. M~ol Cell Biol., 22(17):611121, Sep 2002. [11] Sridhar P., K~ahveci T., and Ranka S. OPMET: A metabolic networkbased algorithm for optimal drug target identification. Technical report, CISE Department, University of Florida, Sep 2006. [12] Sridhar P., K~ahveci T., and Ranka S. An iterative algorithm for metabolic networkbased drug target identification. PSB 2007 Online Proceedings, 2007. [13] K~anehisa M. A database for postgenome analysis. Trends in Genetics, 13(9):3756, 1997. [14] K~anehisa M. and Goto S. K(EGG: K~yoto encyclopedia of genes and genomes. Nucleic Acids Res., 28(1):2730, Jan 2000. [15] Wess G. How to escape the bottleneck of medicinal chemistry. Drug Discovery T c.rt;, 7(10):533535, May 2002. [16] Davidov E.J., Holland J.M., Marple E.W., and Naylor S. Advancing drug discovery through systems biology. Drug Discovery T ../.r ;, 8(4):175183, Feb 2003. [17] Broder S. and Venter J.C. Sequencing the entire genomes of freeliving organisms: the foundation of pharmacology in the new millennium. Annual Review of Ph,~r ,Ient..J..~~I,i and TI.:...J..it ~;, 40:97132, Apr 2000. [18] CI.! .1.01 S.K(. and Caldwell J.S. Fulfilling the promise: drug discovery in the postgenomic era. Drug Discovery T ../r ;, 8(4):168174, Feb 2003. [19] 'Proteome Mining' can zero in on drug targets. Duke University Medical News, Aug 2004. [20] Jackson L.K(. and Phillips M.A. Target validation for drug discovery in parasitic organisms. Current Top~ics in M~edicinal Clo, ne,.n;, 2(5):425438, if li 2002. [21] Nuttall M.E. Drug discovery and target validation. Cells Tissues Ollr ga 169(3):265271, 2001. [22] Schwardt O., K~olb H., and Ernst B. Drug discovery today. Current Top~ics in Medicinal Clo, I,,.iry, 3(1):19, Jan 2003. [23] Jeong H., Tombor B., Albert R., Oltvai Z. N., and Barabasi A.L. The largescale organization of metabolic networks. Letters to NATURE, 407:651654, Oct 2000. [24] Papin J.A., Price N.D., Wiback S.J., Fell D.A., and Palsson B.O. Metabolic pathi .1< in the postgenome era. TRENDS in Biochemical Sciences, 28(5):250258, if li 2003. [25] Teichmann S.A., Rison S., Thornton J.M., Riley M., Gough J., and Chothia C. The evolution and structural .Ilr Irhow: of the small molecule metabolic pathi .1< in Escherichia coli. JM~B, 311:693708, 2001. [26] Ma H., Zhao X., Yuan Y., and Zeng A.P. Decomposition of metabolic network into functional modules based on the global connectivity structure of reaction graph. Bioinformatics, 20(12):18701876, 2004. [27] Arita M. The metabolic world of Escherichia coli is not small. PNAS, 101(6):15431547, Feb 2004. [28] Hatzimanikatis V., Li C., lonita J.A., and Broadbelt L.J. Metabolic networks: enzyme function and metabolite structure. Current Op~inion in Structural B:. J. ~it 14(3):300306, 2004. [29] Pearl J. Probabilistic reasoning in intelligent st;,/. rt, Morgan K~aufmann, San Francisco, 1988. [30] Friedman N., Linial M., Nachman I., and Pe'er D. Using 'l. li ,o ~ networks to analyze expression data. JOB, 7(34):60120, 2000. [31] Somogyi R. and Sniegoski C.A. Modeling the complexity of genetic networks: understanding multigene and pleiotropic regulation. Consp~~il. ;, 1:4563, 1996. [32] K~auffman S.A. The origins of order: selforganization and selection in evolution. Oxford University Press, New York, 1993. [33] Akutsu T., Miyano S., and K~uhara S. Identification of genetic networks from a small number of gene expression patterns under the boolean network model. Pi . Symp~osium on B*'**** uL,,;.:( 4:1728, 1999. [34] Thieffry D., Colet M., and Thomas R. Formalisation of regulatory nets: a logical method and its automatization. M~ath. M~odelling Sci. Comp~uting, 2:144151, 1993. [35] Thomas R., Thieffry D., and K~aufmaun M. Dynamical behaviour of biological regulatory networksI: biological role of feedback loops and practical use of the concept of the loopcharacteristic state. Bulletin of M~athematical B..~~I J..;,i 57:247276, 1995. [36] CornishBowden A. Fundamentals of .;..one. kinetics. Portland Press, London, 1995. [37] Voit E.O. Comp~utational r,:..;&;.: of biochemical s;,;lent a practical guide for biochemists and molecular biologists. Cambridge University Press, Cambridge, 2000. [38] Arkin A., Ross J., and McAdams H.H. Stochastic kinetic analysis of developmental pathway bifurcation in phage lambdainfected Escherichia coli cells. Genetics, 149(4):16331648, 1998. [39] Gillespie D.T. Exact stochastic simulation of coupled chemical reactions. J. Phys. C'I,; I; 81:23402361, 1977. [40] Jeong H., Oltvai Z.N., and Barabasi A.L. Prediction of protein essentiality based on genomic data. ComPlex;Us, 1:1928, 2003. [41] Yeh I., Hanekamp T., Tsoka S., K~arp P., and Altman R.B. Computational analysis of plasmodium falciparum metabolism: organizing genomic information to facilitate drug discovery. Genome Research, 14:917924, 2004. [42] CornishBowden A. Why is uncompetitive inhibition so rare?: a possible explanation, with implications for the design of drugs and pesticides. FEBS Letters, 203(1):36, Jul 1986. [43] CornishBowden A. and Hofmeyr J. S. The role of stoichiometric analysis in studies of metabolism: an example. Journal of Theoretical B..~~I J..;,i 216:179191, May 2002. [44] Imoto et. al. Computational strategy for discovering dri_ 110~ gene networks from genomewide RNA expression profiles. In PSB .'thb:r Online Proceedings, 2006. [45] Imielinski 1\., Belta C., Halasz A., and Rubin H. Investigating metabolite essentiality through genome scale analysis of E. coli production capabilities. Bioinformatics, 21(9):200816, Jan 2005. BIOGRAPHICAL SKETCH Padmavati Sridhar was born on December 6, 1981, in ('11. H!!Isi India. She earned her B.Tech in information technology from Sri Venkateswara College of Engineering, affiliated to the University of 1\adras, in 2003. After graduation, she worked as an Assistant Systems Engineer at Tata Consultancy Services, India, for a year. Padmavati joined the Computer and Information Science and Engineering department at the University of Florida as a graduate student in Fall 2004. Here, she worked in the area of bioinformatics with Dr. Tamer K~ahveci. She earned her 1\aster of Science degree in computer engineering in December 2006. 