Efficient algorithms for querying enzymes to manipulate the steady state of
metabolic pathways
Bin Song, Tamer Kahveci, and Sanjay Ranka
Computer and Information Science and Engineering,
University of Florida, Gainesville, FL, 32611
{bsong,tamer,ranka}@cise.ufl.edu
Abstract. Metabolic pathways show the complex interactions among enzymes that transform chemical compounds.
The state of a metabolic pathway can be expressed as a vector, which denotes the yield of the compounds or the flux
in that pathway at a given time. The steady state is a state that remains unchanged over time. Altering the state of
the metabolism is very important for many applications such as biomedicine, biofuels, food industry and cosmetics.
The goal of the enzymatic target identification problem is to identify the set of enzymes whose inhibitions lead the
metabolism to a state that is close to a given desired or goal state. Given that the size of the solution space is exponential
in the number of enzymes, the target identification problem is very computationally intensive.
Given a metabolic pathway and a desired goal state of that pathway, we develop efficient algorithms to solve the
enzymatic target identification problem in this paper. We develop a measure, State Distance (SD), that evaluates the
affect of the inhibitions of a set of enzymes as a function of the deviation of the steady state of the pathway after their
inhibitions from the goal state. Using this measure, we develop two algorithms. The first one is a traversal approach
that explores possible solutions in a systematic way using a branch and bound method. The second one uses genetic
algorithms to derive good solutions from a set of alternative solutions iteratively. Unlike the former one, this one can
run for very large pathways.
Our experiments show that our algorithms can identify enzyme sets whose inhibitions produce steady states that follow
the results obtained in vitro in the literature from a number of applications. They also show that the traversal method is
a good approximation of the exhaustive search algorithm and it is up to 11 times faster than the exhaustive one. This
algorithm runs efficiently for pathways with up to 30 enzymes. For large pathways, the genetic algorithm can find good
solutions in less than 10 minutes.
Availability: All the datasets used in this paper and the code developed for this paper are available at www. cise.
ufl.edu/~bsong/ResearchProject/Impact.htm.
Key words: Metabolic pathways; metabolic engineering; steady state.
1 Introduction
Metabolic 1p.ilvI ..i\s are one of the most important data resources in biology. A metabolic 1p.,ivI .i\ is a complex
network of chemical reactions occurring within a cell. Enzymes catalyze these reactions. Reactions transform a set of
compounds into another set of compounds. Note that, the term Il v .i k is also used in the literature to denote the union
of all 1p.iihvi .i\ of an organism or large 1p.iiIv, .i\ To keep the notation consistent, we will use the term 1p.iihv .i\ instead
of network regardless of the 1p.illv .i\ size in this paper. The state of a metabolic ip.illvi .i can be expressed as a vector,
which denotes the yield of the compounds [27] or the flux [17] in the p.iihv\ ..i at a given time. Steady state is the state that
remains unchanged over time.
Many applications require altering the steady state of a given p.,ilv, .i\ For example, a healthy metabolism keeps
the state of the 1p.iih\ ..i at desired levels. External factors or genetic mutations can change the production rate of a set
of enzymes. They can even modify the structure of produced enzymes. Such unexpected enzyme behaviors can lead
to aberrations in the metabolism. For instance, low or missing activity of an enzyme may result in the blockage of the
p.,ilv .~.i Furthermore, this can propagate to other parts of the 1p.,ivi .i\ that need the compounds produced in the blocked
part of the 1p.ivI\ .i\ As a result, the production of compounds may increase or decrease. Such aberrations in the state of
the metabolism can lead to severe diseases. Examples include mental retardation, seizures, decreased muscle tone, organ
failure and blindness [22,7]. Thus, changing the state of the metabolism back to a desired level is often needed.
Many applications in fields such as cosmetics and food industry require manipulating the state of 1p.,ihvi..ii For
example, fatty acid hi, ill',i, p1.ilvI .i\ converts fatty acids that are used in the cosmetic industry in creams and lotions.
Butanoate metabolism produces poly/3hydroxybutyrate which is essential for producing plastics [3]. Mevalonic acid
p.ihv ..i\ and MEP/DOXP 1p.ihvi ..i\ produce carotenoid that are often used as antioxidant in food industry [21]. The
metabolisms of many organisms, such as bacteria, algae and plants naturally produce these compounds. A common
practice is to extract them from these organisms. By manipulating the 1p.,ilv ..i s of these organisms, the production of
these compounds can be increased significantly. Currently, this is done through cutting the consumption of the desired
compound by the underlying organism. This strategy, however, is inefficient. There is a great potential to improve the
efficiency through accurate computational methods.
One way to change the state of the p.idv ..i\ back to the desired one is to use chemical compounds (i.e. drugs) to
inhibit a set of enzymes. When an enzyme is inhibited, it cannot catalyze the reactions it is responsible from. As a result,
some entries in the steady state of the p.,ivI. .i\ may increase and some may decrease. The Enzymatic Target Identification
Problem aims to identify the set of enzymes whose inhibitions lead to a steady state of the metabolic p.iliv, .i\ that is as
close to a user supplied goal state as possible. We consider the enzymatic target identification problem in this paper.
The size and the complex structure of the metabolic 1p.,llv ..i\ along with the large size of the solution state space
makes the enzymatic target identification problem computationally challenging. We can compute the steady state of a
metabolic 1p.il1iv .i\ after inhibiting a given set of enzymes in polynomial time. (See Appendix for a discussion on steady
state computation.) However, enzymatic target identification aims to solve the inverse problem of finding the set of en
zymes to achieve a steady state that is close to a given goal state. We have shown that a simplified version of the enzyme
identification problem is an NPcomplete problem [13]. Thus, exhaustive search is impractical for the typical p.Ilv .i. s
that contain hundreds of enzymes, thousands of compounds and reactions. Assuming that finding a steady state of a path
way takes on the order of 10 milliseconds, it will require nearly a year of computational time to test all solutions with up
to four enzyme combinations on a metabolic 1p.ivI ..i\ with 500 enzymes. Existing methodologies restrict the search to
a small number of choking points (i.e., enzymes with many connections) or immediate precursors of the targeted com
pounds. Targeted compounds are the ones whose productions need to be stopped. These approaches, however, can have
Ni nili k.iii deviations from the optimal result. This is because inhibiting a precursor enzyme or a choking point can affect
the production of many other compounds that are not targeted.
Sridhar et. al and Song et. al considered a simplified version of the enzymatic target identification problem [24,25,23].
In their version, each entry of the state denotes whether the corresponding compound is present or not. For this simplified
version, the goal, then, is to identify the set of enzymes whose inhibitions eliminate all the targeted compounds while
incurring minimum damage. They have defined damage as the number of nontargeted compounds that are eliminated
because of inhibitions. Sridhar et al., used a branch and bound strategy to solve this problem for p.ilIvI .i\s with up to 32
enzymes in less than an hour [25]. Sridhar et. al and Song et. al developed an iterative heuristic method to scale their
solutions to large 1p.iv,\ .i\s [24,23]. The binary model described in these works, however, is limited. It cannot address
partial production of the compounds. In addition, it ignores the change in flux or yield due to the inhibitions of a set of
enzymes.
Figure 1 illustrates the limitations of the binary model
used by Sridhar et. al and Song et. al [24,25]. Inhibiting ..
El knocks out compounds C2, C4 and C5. Suppose C4 2
is the targeted compound. Inhibiting E1 stops two non ........ i
targeted compounds (C2 and C5). The damage is two us ....... .
ing the binary model. There are several drawbacks. One
is that the inhibition of El accumulates C1. The binary
model, however, ignores this. Furthermore, although the Legend:
inhibition of El does not fully eliminate C7, it influences Enzyme Reaction
the production of C7. The binary model disregards this in Compound Disrupted Pathway
fluence as well. Co
There are many models to reconstruct and simulate Fig. 1. Graph representation of a metabolic p.il,\; .1 with
the metabolism. On such method called Flux Balance three reactions R1, R2, Rs, two enzymes E1 and E2, and
Analysis, (FBA) [16,2,9], computes the flux values of a nine compounds C1,  , C9. Dashed lines show the impact
given p.iihiv .i\ at the steady state. Extreme R, n1i, ,, Anal of inhibiting enzyme El.
ysis [19] uses such models to find the path in a 1p.iiliv.;
that maximizes or minimizes the production of a given compound. This problem is similar to a special case of the enzyme
target identification problem considered in this paper. De et. al, for example, consider the extreme 1p.iliv .i\ analysis prob
lem [5]. In order to reduce the yield of a compound in a p.iihv\ .i De et. al use FBA to compute the optimal p.iliv .i\ so
that the yield of the target metabolite is minimum. They, then, change the concentration of the enzymes in other paths so
that these paths are inactive except that optimal one. This method has two major drawbacks. First, it requires changing
the concentration for many enzymes. In practice, changing the enzyme concentration is a costly process. Therefore, the
number of enzymes whose concentrations are altered should be kept low. Second, the alterations that change the pro
duction of a compound can affect the production of other compounds in that p.iliv, ..i Thus, the solution found by this
method can have ,sigii .iim sideeffects. In addition to these drawbacks, extreme 1p.'lv' ..i analysis cannot solve the enzy
matic target identification problem considered in this paper. Here, we develop methods to overcome the abovementioned
disadvantages and solve a more generic problem.
Contributions: In this paper, we address the enzymatic target identification problem. We overcome the limitations
of earlier solutions by allowing fractional values for the quantities of compounds. We develop a measure, named State
Distance (SD) that computes the distance between two states of a given 1p.uih .i\ We use SD to measure how much
the steady state of a p.iliv, .i\ deviates from a given goal state. We develop two algorithms to solve the enzymatic target
identification problem based on traversal and genetic algorithms. The former one traverses the search space. Each solution
in the search space is a set of enzymes. This algorithm aims to find the enzyme set whose inhibitions have the least SD to
the given goal state. Evaluating each potential solution requires computing the steady state of the 1p.iiv; .i\ after inhibiting
the enzymes in that solution. To reduce the search cost, it avoids exploring the states that are unlikely to result in a good
solution using a filtering and a prioritization strategy. The genetic algorithm uses crossover and mutation to combine
existing good solutions to find better ones. It employs the traversal algorithm on small portions of the search space during
crossover and a biased randomization technique to improve the quality of the results.
Our experiments using the metabolic p.ivI, .i\s from KEGG [14, 15] validate that the computational results of our
algorithms agree with the results found by other means in the literature for numerous applications. We observe that our
traversal method is a good approximation of the exhaustive search algorithm and it requires significantly less compu
tational time. However, it is only effective for small p.ivI ..i\s (i.e., 3035 enzymes). Our genetic algorithm finds good
solutions for large metabolic p.ilv, .i s of size around 100 enzymes in less than 10 minutes.
The rest of the paper is organized as follows. Section 2.1 formally defines the SD measure. Section 2.2 presents the
proposed traversal method. Section 2.3 explains the genetic algorithm. Section 3 discusses experimental results. Section 4
concludes the paper.
2 Methods
In this section, we present in silico methods for the enzymatic target identification problem. We first provide a mea
sure to compute the distance between the current steady state and the goal state (Section 2.1). We, then, describe two
algorithms, traversal and genetic algorithms (Section 2.2 and 2.3), to solve the enzymatic target identification problem.
2.1 StateDistance
The first task that needs to be addressed is to measure the distance between a given goal state and the steady state of the
p.iliv, .i\ after inhibiting a set of enzymes. In order to achieve this, we first compute the steady state of the 1p.uih. .i\ after a
set of enzymes are inhibited. We then measure the distance between this state and the goal state. We call this measure the
StateDistance (SD).
The state of a metabolic p.ilii .i\ is a vector that indicates its current status. There are alternative ways to define
and compute the state of a given p.iiv ..i\ in the literature. These alternatives are similar in spirit. Each entry of the
state vector denotes the yield of a compound [27] or a flux in the p.iivI\ .i\ [17]. Yield of a compound is its amount in
the metabolism. The flux of a reaction is the speed at which each compound is produced or consumed by that reaction.
We compute the yield of each compound in the steady state using the reaction parameters such as the rate at which it
is consumed or produced by each reaction. We apply FBA [16, 2, 9] to get the flux of the 1p.uih .i\ in the steady state.
Usually, FBA produces a space of steady states that contains infinitely many possible steady states. To select a unique
steady state, FBA enforces optimizing an objective function in the solution space. The objective function of FBA often
maximizes biomass [6] or the production of ATP [20]. Since the literature contains detailed discussion on the steady state
computation, we focus on the distance measure in the rest of this section. We defer the discussion of the steady state
computation to the Appendix. In the rest of this paper, each entry of the steady state denotes the yield of a compound
unless otherwise stated. Thus, in our notation, the steady state vector for a 1p.luiv ..i has as many entries as the number of
compounds in that 1p.idvI .,\
Given a goal state for the underlying 1p.luiv .i\ next, we discuss how we compute the distance, SD, between the
goal state and the current steady state. We, first, present the notation to formally define SD. Assume that the number of
compounds in the 1p.luiv .i\ is m. We denote the goal state as VG = [912,2,  , gm], where gi = ideal value for the ith
entry of the steady state. Let N denote the number of enzymes. The enzyme vector shows the inhibition status of the
enzymes. We denote it with VE [ee, 2, ... eN], where ei {0, 1} (ei = 1 if E, is inhibited, otherwise ei = 0). Let
[rl, r,..., rm] be the steady state of that p.,ilvI .i, based on the enzyme vector VE. Let wi be a real number that shows
the importance of the ith entry of the steady state. We discuss the parameter ci later in this section.
We will use the variable d, to represent the distance contribution of the ith entry of the steady state. We develop two
alternative definitions for di, namely exact and fuzzy distance.
Exact distance: This measure is useful when the exact value of the entry in the goal state is known. For the ith entry,
we want to approach the goal state as close as possible. We compute the distance as d, = wj Qg rj.
Fuzzy distance: This measure is useful for extreme p.,ihv ..i\ analysis or when we do not know the exact values for
some entries in the goal state. In this case, we, however, know a lower or upper bound for such entries. In other words,
we want to increase (or decrease) the value of that entry to at least (or at most) a given value. Thus, we have two
possibilities.
Case 1. (decrease the production of a compound) We want to minimize the ith value of the steady state with a
threshold of gi. In other words, r, should be smaller than gi. The smaller the better:
d, fo if gi < ri
i ci/(gi ri) for gi > ri
Case 2. (increase the production of a compound) We want to maximize the ith value of the steady state with a
threshold of gi. In other words, r, should be bigger than gi. The bigger the better.
d oo if gi > ri
i c /i/(ri gi) for gi < ri
The choice of exact and fuzzy distance depends on the underlying application. Our search algorithm in this paper works
for both of them. We use the exact distance measure to compute d, in the rest of this paper unless otherwise stated.
The weights, ci, show the importance of the ith entry of the state vector for the organism. Large ci indicates that the
ith entry is important. This can be set based on expert input. For simplicity and without loss of generality, we set ci = 1
for all i, for the experiment results that are presented in this paper.
We compute SD as the largest of the distance of all the entries of the state vector, i.e., SD = max {di } I Id I 1. Note
that one can also define it as SD = Zidi. Other combinations of distance measures discussed above are orthogonal with
the rest of this paper. Therefore, we do not discuss them further.
Example 1. Consider the p.idv ..i\ in Figure 1. Assume that the goal state is VG [0, 0, 1, 0, 0, 0, 1, 1, 1]. That is,
we want one unit molecule of each of the compounds C3, C7, C8 and C9, and none of the remaining compounds. Also,
assume that none of the enzymes are inhibited in this 1p.iih\ ..i\ (i.e., VE = [0, 0, 0]). The steady state of this 1p.lihv\ ..i
is so = [0, 0, 0.5, 0.5, 0.5, 0, 3, 1, 1] (See Example II.A of the Appendix). The state distance under this condition is
SD(VE) IIso V ll 2.
Now assume that El is inhibited (i.e., VE [1, 0, 0]). We compute the steady state after inhibiting El as sl [1, 0,
1, 0, 0, 0, 1, 1, 1] (See Example II.B of the Appendix). Applying the state distance, we get SD(VE) = V* VGI 1.
Thus, inhibiting El brings the steady state closer to the goal state. o
Example 2. Consider the p.ilv, ..i\ in Figure 1 when no enzyme is inhibited. Assume that the goal state is the same as that
in Example 1. With the difference that we want to maximize the yield of the compound C7 and obtain a yield of at least
one unit of it. In this case, we use fuzzy distance for C7 with a minimum goal = 1. The steady state of the 1p.luih .i\ is the
same as that in Example 1. Thus, the state distance is max{0, 0, 0.5, 0.5, 0.5, 0, 0.5, 0, 0} = 0.5. O
2.2 Traversal Method
Given a metabolic p.iliv, .i\ and a goal state, we aim to find the set of enzymes whose inhibitions lead to a steady state
with lowest value of SD. One way to solve this problem is to exhaustively examine the SD value after inhibitions of all
possible subsets of enzymes. However, this is not feasible because the number of subsets is exponential in the number of
enzymes. In this section, we develop a traversal algorithm and two optimization strategies to accelerate it.
We traverse the search space using a branchandbound strategy. We consider the search space as a binary tree. Each
node of this tree corresponds to a potential solution. Each node records four items; (i) the set of enzymes that are inhibited,
(ii) the set of enzymes that are not inhibited, (iii) the set of enzymes that have not been considered so far, and (iv) the
SD value of the p.iivI .,is after all the enzymes in the first set are inhibited. In the root node, the first and the second sets
are empty. Therefore, all the enzymes of the p.ivI\; .i\ are in the third set. We recursively visit the nodes using an in order
traversal method [11]. After visiting the current node, we visit the left and right child. Moving from a parent to a child
node means considering a new enzyme on top of the enzymes considered in the parent node. The left child denotes that
the new enzyme is inhibited and the right child denotes that it is not inhibited. We, then, compute the current SD. If the
current SD is less than the best result seen so far, we update the value of best result with the new one. We propose to
improve the performance of this algorithm through two different optimizations:
In many cases, the inhibitions of some uninspected enzymes cannot improve the SD. We set the values in the enzyme
vector for such enzymes to zero (i.e. not inhibited). This process, called ,I l, i can improve the performance of
the algorithm as it skips many levels of the search tree.
The selection of the enzyme when we move from a parent to a child impacts the performance of the algorithm. This is
because if the nodes of the top levels in the search tree have small SD, the chance of filtering the nodes in its subtree
becomes large. We call this the Prioritization strategy.
We discuss these two optimizations later in this section. In order to implement these two optimizations we, first, discuss
how we quickly predict SD incrementally when a new enzyme is inhibited.
Predicting the value of SD: Computing the SD value of a given enzyme vector requires computing the steady state of
the p.lilvI .i\ An effective prediction strategy can help in avoiding computation of the steady states of a large number of
nodes if their SD values are greater than the current best.
We conjecture that the steady state after inhibitions of enzymes E, and Ej simultaneously is close to the average of
that after inhibitions of Ei and Ej separately. This is intuitive, because the influence of inhibitions of two enzymes should
be correlated with the total influence of inhibitions of the individual enzymes. We tested this conjecture by randomly
sampling 50 enzyme pairs in each of ten randomly selected metabolic p.uiv\; .i\, of Homo sapiens (H.sapiens). The average
correlation coefficient [8] of the steady state between actual and predicted values of these 500 random samples was 0.91.
This high correlation supports our conjecture.
Filtering strategy : We filter some uninspected nodes as follows. If the predicted SD values of these nodes are bigger
than the current minimum SD, we filter these nodes. For a given node in the search tree, let A, B and C denote the set
of enzymes that are inhibited, the set of enzymes that are not inhibited and the set of enzymes that are not yet considered
respectively. For each enzyme in set C we predict the SD value after that enzyme is inhibited in addition to the enzymes
in A. If the predicted value for an enzyme in C is worse than the best SD value found so far, we move that enzyme to set
B. Moving h enzymes from set B to set C is equivalent to filtering h levels of the search subtree rooted at the current
node. We predict the steady state for a single enzyme during filtering in time proportional to the size of the state vector by
precomputing the steady state after inhibition of each enzyme alone.
It is worth noting that we are using each enzyme independently to predict the steady state using a combination of
enzymes. This process is error prone and can lead to pruning of potentially useful enzymes when using the filtering
strategy. We address this problem by giving each enzyme several chances before we filter it. Let K denote the number
of chances, where K is a positive integer. We call this Kchance strategy. To incorporate this strategy, we keep a vector,
where each entry of this vector denotes the number of times that an enzyme is tested positively for a filter. We filter an
enzyme only if that enzyme uses all of its K chances.
Prioritization strategy: We select the most promising enzyme in set C to consider for inhibition. An enzyme is promising
if its inhibition in addition to the enzymes in set A has a small SD with high probability. This increases the chance of
filtering the nodes in its subtree. Our method works as follows. For each of the uninspected enzymes, we predict the steady
state after inhibition of that enzyme in addition to already inhibited enzymes. We then compute the SD between that state
and the goal state. We move the enzyme with the smallest predicted SD to set A to create the next child node.
2.3 Genetic Algorithm
The time complexity of the traversal method remains exponential, though the filtering and prioritization strategies
reduce the search space significantly. From experiments, the method described in the previous section is only useful
for 3035 enzymes. In this section, we propose a genetic algorithm to solve the target identification problem for large
p.lhv, ., The genetic algorithm exploits the traversal method as a building block. The main idea of the genetic algorithm
is to generate a population of candidate solutions and improve these solutions through crossover and mutation operations.
The algorithm stops after a predefined number of epochs (or iterations). Next, we describe the genetic algorithm in detail.
The genetic algorithm uses the following data structure.
Population: The population P is a set of candidate solutions Algorithm 1 Genetic Algorithm (Epochs)
{Si, S2, Snum seed}. Here, each solution, S,, is a vector
that shows which enzymes are inhibited and which enzymes are Initialize population.
not. Let N be the number of enzymes in the underlying 1p.hi\; .i\ 2. for i 1 to Epochs do
We represent a solution with S, = [sl, sf, ..., S ], where s 3 = 3 Generate children using crossover.
=1 24. Perform selection.
0 means that the enzyme Ej is not inhibited. Similarly, s = 1 Perform tion.
means that Ej is inhibited. 5. Perform mutation.
6. Shrink each solution to minimal subset.
Algorithm 1 summarizes our solution. We discuss the details .
7. end for
of this algorithm next. f
8. Report the solution with minimal SD.
Step 1. Initialize population. This step generates the initial pop
ulation, P, which contains candidate solutions. Ideally, a good candidate resembles the optimal solution in terms of both
the number and the selection of enzymes that are inhibited by it. However, we do not know the optimal solution at this
step. To address this problem, we need to answer two questions. (i) How many enzymes are inhibited in each candidate?
(ii) How do we decide which enzymes are inhibited in each candidate? To address the first problem, we employ our traver
sal algorithm in Section 2.2 as follows. We estimate the number of inhibited enzymes in good solutions (i.e., solutions
with low SD). Let A denote this estimate. We run the traversal method for the top several levels of the search space. We
search 10 levels to limit this traversal time. We, then, select a set of solutions with smallest SD (say 20 best solutions). We
estimate A as the average number of inhibited enzymes in these solutions. The filtering and prioritization strategies push
the results with small SD to the top levels of the search tree with high probability. Thus, limiting the search to only a few
levels of the tree does not degrade the accuracy of A greatly.
Once we compute A, the next problem is to decide which enzymes will be inhibited. We use binomial distribution to
compute the inhibition probability of each enzyme. Ideally, the probability of inhibiting an enzyme should be high if that
enzyme has a high potential to contribute to a good solution. We predict this potential of each enzyme by considering the
SD value obtained after inhibiting that enzyme. Suppose that SD(Ei) denotes the value of SD when only enzyme Ei is
inhibited. We conjecture that if SD(Ei) is small, then Ei has a high probability to contribute to a good solution. Let pi
denote the probability of inhibiting enzyme Ei in a good solution. There is a negative correlation between pi and SD (Ei).
Suppose 3 is the coefficient of that correlation. We use the following equations to estimate pi.
pi = /(1 + SD(Ei)). (1)
This equation requires the value of 3. We compute 3 from the observation that the expected value of the total number of
inhibited enzymes is A. We do this using the following equation.
S 1 + SD(E) (2)
From (1) and (2), we compute the probability of inhibiting enzyme Ei as:
Pi A/((1 +SD(E)) 1 + SD(E()). (3)
Now, we are ready to generate candidate solutions. We create each candidate by inhibiting enzyme Ei with probability pi,
Vj. In practice, we generate 100 candidate solutions to create the initial population.
Step 3. Generate children using crossover. This step computes a child population from the current population. It aims to
combine two existing solutions to create a better one. Generating a child solution involves the following steps: selection
of two parent solutions from the existing population and crossover of these two parents.
We first discuss how we pick two parent solutions from the current population. We conjecture that good parents
can generate good children with high probabilities. We, thus, randomly choose each parent using a biased distribution
that prefers parents with small SD. Suppose the probability of choosing the solution S, as a parent is xi. Based on our
conjecture, there is an inverse relation between xi and SD(Si). Assume that 7 is the coefficient of that correlation. We
can write xi as:
Xi 7/(1( + SD(Si)) (4)
We need the parameter 7 to compute the probability xi. We can compute 7 from the observation that the selection
probabilities of all the solutions in the current population add up to one. Formally, E xi = 1. Thus, we have
1 + SD(S) (5)
We get the value of 7 from (5) and use it in (4) to compute xiias:
xi 1/((1 + SD(S,)) 1 + SD1(S) (6)
So far, we have discussed the selection of parents. Next, we discuss the creation of a single child solution from two
parent solutions. We denote the parent solutions with F and M (F, M E P), where F = [fl, f2, fN] and M = [ml,
m2,  ., mN]. We denote the child of F and M with Ch [ Ch1,  , ChN]. We use the following crossover method
to produce the child. We first set the enzymes in the child vector for which both parents have the same value. Formally,
if fj = mj, then we set chj = fj = mj. We decide the values of the remaining enzymes using our traversal method in
Section 2.2. This strategy favors children that have small SD values as the traversal strategies seeks solutions with small
SD.
Table 1 demonstrates this process on an example. In this example, f2 Table 1. An example showing the generation
m2 = 1, so ch2 = 1. We set the values chg, ch6 and ch9 similarly. The parents of a child Ch from two h:... parents
do not agree for the values of chl, ch3, ch4, ch7 and chs. We use the traversal F and M for a pathway that contains nine en
method to find their values. To do this, we initialize the root of the search
,. 1 ,zymes.
tree as Iollows; me set ot innmitea enzymes is tE2, Ls5j, me set ot enzymes
that are not inhibited is {E6, E }, rest of the enzymes are undecided. The
traversal algorithm, then, traverses the search space defined by the undecided
enzymes to determine their values that minimizes SD.
We repeatedly select two parents and create a child until the number of
children is equal to the total number solutions in the initial population.
fl 2 3 f4 f5 f6 7 f8 f9
F 0 1 1 0 1 0 0 1 0
m m2 m3 m4 m5 m6 m7 m8 m9
M 1 1 0 1 1 0 1 0 0
chi ch2 ch3 ch4 ch5 ch6 ch7 chs ch9
Ch ? 1 ? ? 1 0 ? ? 0
Step 4. Perform Selection. At the end of Step 3, we have a set of already existing solutions, P, and a set of child
solutions. Thus the total number of solutions is double the size of P (i.e., it is 2num_seed). In this step, we update P as
the num_seed solutions with the smallest SD among the union of P and the child solutions.
Step 5. Perform mutation. Steps 3 to 4 repeatedly improve the solutions in the initial population P. There is however,
the risk that these steps get stuck in a local minima or a plateau. This step aims to avoid such local minima or plateaus. To
do this, we alter the solutions in P by mutation (i.e., changing the inhibition status of some of the enzymes). We mutate
each solution in P except the one with the minimum value of SD. For a given mutation rate 6 (we used a rate of 0.04), we
change the value of each sto 1 s with probability 8.
Step 6. Shrink each solution to minimal subset. In practical applications, solutions that inhibit fewer enzymes are
desirable. This is because drugs have to be used to inhibit them. Smaller number of inhibitors are preferable for this
purpose as inhibiting an enzyme is a costly task. One way to do this is to inhibit fewer enzymes than given in each
solution without changing the SD value of that solution. This step improves the solutions in the population by shrinking
the set of inhibited enzymes in each solution as follows. We iteratively test each inhibited enzyme of a solution. For each
such enzyme, we update its state to uninhibited. If the SD of the solution does not increase after this modification, we
remove this enzyme from the set of inhibited enzymes permanently. Otherwise we keep it. We repeat this iteration until
we test all the enzymes.
Use of traversal algorithm in crossover and performance hiccups. The crossover step (Step 3) of our genetic algorithm
uses our traversal algorithm to create child solutions with small SD values. We, however, advocated the use of our genetic
algorithm over our traversal algorithm for large 1p.luiI .i\ because the traversal algorithm is not scalable. Thus, we need
to show why our genetic algorithm is scalable despite it uses the traversal algorithm for each child at each iteration.
The scalability of the traversal algorithm used in the genetic algorithm follows from the observation that it is used only
for the undecided enzymes (i.e., the enzymes for which the two parents disagree). Using the notation we defined in this
section, we can compute the probability that enzyme E, is undecided as 2pi (1 pi). This is because one parent inhibits
E, while the other does not. Since the inhibition status of each enzyme in a parent is decided independently, the expected
number of undecided enzymes is 2 Ej pi (1 pi), or simply 2A 2 Ej /'. The actual value of this expectation depends
on the probability values pi. Table 2 lists the expected number of undecided enzymes for four p.,iI. .i s of varying sizes.
It shows that the expected value is small, and thus, the traversal algorithm can be used without affecting the scalability of
the genetic algorithm in practice.
One can argue that the 1.,il,\;.,!, in Table 2 might have a
Table 2. The expected number of undecided enzymes for
small expected number of undecided enzymes due to their topol different pathways. E denotes the number of enzymes.
different pathways. #E denotes the number of enzymes.
ogy or size, and thus, the search space may be large for larger Metabolic pathway #E Exp. num.
'p.ilv, .,i s (1 p.,liv, .\ s with different topologies. To complete our Valine, leucine and isoleucine degradation 24 3.40
discussion on scalability, we need to compute the expected num Glycolysis/Gluconeogenesis 27 2.69
ber of undecided enzymes in the worst possible distribution of Glycine, serine and threonine metabolism 32 7.15
probabilities pi. The following theorem computes the worst ex Purine metabolism 52 1.72
pectation and the corresponding standard deviation.
Theorem 1. Let N denote the number of enzymes in a given Ipd,/im,, Let A be the expected number of inhibited en
zymes in a solution of the population of solution generated for that I,' tm.,,, by our genetic algorithm. The expected
number of undecided enzymes in a child solution is at most 2A The standard deviation for the worst scenario is
'/(2A2 2AN + N2)(N2 1)N
N2I
We defer the proof of Theorem 1 to Appendix. Figure 2 i
plots the expected number of undecided enzymes in the
worst case for varying 1p.iI\ ..i\ sizes and A. The vertical
lines show the standard deviation in each direction. We
plot the expectation for up to A 10, since we observed
A < 10 in practice. The figure shows that for practical val
ues of A, the expected number of undecided enzymes re
main small enough to make the genetic algorithm efficient.
Furthermore, as the 1p.iili.. i size grows by eight fold, the
upper bound for the expectation grows by only a few en Fig. 2. Upper bound to the expected number of undecided enzymes
zymes. Thus, we conclude that our algorithm is scalable to for different A and pathway sizes (N= E). The vertical bars show
large l1ili\ s the standard deviation in each direction.
3 Results
In this section, we evaluate our algorithms on real datasets. We evaluate their biological significance on real applica
tions (Section 3.1). We also evaluate their performance quantitatively (Section 3.2) using the following two measures:
1. SD: The SD value is the distance from the goal state. A small SD value indicates a better result.
2. Execution time: This indicates the total time (in seconds) taken by our algorithms.
We use the metabolic p.iIi .i\ information of H.sapiens and E.coli from KEGG as the input dataset (ftp: / /ftp.
genome. jp/pub/kegg/pathway/organisms/). We present results for eight p.iliv\ .,is of varying sizes in our
experiments. Details of these datasets are shown in Table 3.
We implemented the developed algorithms in
We implemented the developed algorithms in Table 3. Metabolic pathways from KEGG that are used in our experi
C++. In our experiments, we set the default values
C++. In our experiments, we set the default values ments in this paper. KEGG Id is the unique identifier of each pathway in
of the parameters as follows. In the computation of
of the parameters as follows. In the computation of the KEGG database. #E, #R and #C denote the number of enzymes, reach
SD, cc, = 1, for all the entries for simplicity. From
tions and compounds respectively in the pathway.
our experience, me itering strategy aoes a gooajob
with two chances. The genetic algorithm works well
when the mutation rate, 6, is 0.04. To estimate A, we
search the first 10 levels of the search space to get
top 20 best solutions for the balance of efficiency
and accuracy. We ran our experiments on a system
with dual 2.2 GHz AMD Opteron Processors, 4 gi
gabytes of RAM, and a Linux operating system.
3.1 Evaluation of the biological significance
KEGG Id Metabolic pathway I#E #R #C
00980 Metabolism of xenobiotics by cytochrome P450 7 49 57
00670 One carbon pool by folate 17 23 9
00220 Urea cycle and metabolism of amino groups 21 22 27
00310 Lysine degradation 21 25 28
00280 Valine, leucine and isoleucine degradation 24 33 32
00010 Glycolysis/Gluconeogenesis 27 31 25
00260 Glycine, serine and threonine metabolism 32 33 37
00230 Purine metabolism 52 92 65
We first evaluate the biological significance of
our algorithms using flux or yield. We do this by comparing our results to known results in the literature.
Metabolic engineering of Glycolysis/Gluconeogenesis pathway The Glycolysis 1p.livi\ ..i of Tpallidum produces Phos
phoenolpyruvate. De et al., [5] studied the path that maximizes the production of this compound. Tpallidum, however,
has a simple 'p.IIdv .i\ that contains only four enzymes. We considered the Glycolysis/Gluconeogenesis 'p.iIvI .i\ of E.coli.
This organism has significantly larger and more complex 1p.iIy, .i\ that contains all the enzymes of Tpallidum and many
more. We ran our algorithm by setting the goal state to maximization of Phosphoenolpyruvate (i.e., goal > 0 for Phos
phoenolpyruvate). Our results suggest inhibiting the set of enzymes phosphoglucomutasee, aldose 1epimerase, fructose
1,6bisphosphatase II, phosphoglycerate kinase, pyruvate kinase}, (see red, crossed out enzymes in Figure 6 of the Ap
pendix). When all these enzymes are inhibited, the p.,ivI .i. from alphaDGlucose 1phosphate to Phosphoenolpyruvate
is, alphaDGlucose 1phosphate alphaDGlucose alphaDGlucose 6phosphate = (betaDGlucose6P =) beta
DFructose6P betaDFructose1,6P2 = (GlyceroneP =) Glyceraldehyde3P = Glycerate1,3P2 Glycerate3P
= Glycerate2P = Phosphoenolpyruvate. This is the same path found for T. pallidum [5]. This means, for the simple life
of bacteria, T. pallidum, the main function of Glycolysis 1p.iy; ..i\ is to generate Phosphoenolpyruvate whereas the same
1p.idvi .i\ for E.coli has additional purposes.
Application on the production of Phenylalanine LPhenylalanine is widely used in the manufacture of aspartame and in
parenteral nutrition [1]. Industrial application need an efficient process to generate LPhenylalanine. Considering the com
mercial application, Backman et. al selected E. coli as the production organism [1]. Thus, we studied the Phenylalanine,
tyrosine and tryptophan bi il'\ Iilldi p.1iiI. .ai of E.coli. We ran our algorithm by setting the goal state to maximization
of Phenylalanine. Our results suggest inhibiting the set of enzymes phenylalaninee translase, chorismate lyase, aspartate
aminotransferase, tyrosyltRNA synthetase}, (see the red, crossed out enzymes in Figure 7 of the Appendix). Usually,
phenylalanine is synthesized from glucose and ammonia in common bacterial systems. Inhibiting phenylalanine translase
stops the PhetRNA \ iwilic, i from phenylalanine. Thus, the concentration of pheylalanine is accumulated. Inhibiting cho
rismate lyase cuts the flux to other hi' s,\ iIillci, It, then, reduces the byproducts and efficiently uses the raw materials.
Increasing the production of cGMP The compound 3',5'Cyclic GMP (cGMP) plays an important role in the heart
failure. One treatment is to increase cGMP [12, 18,4]. cGMP appears in the Purine metabolism. We consider this 1p.illy .i\
in H.sapiens in this experiment. We design the experiment as follows. We first compute the steady state when all the
enzymes are active. We consider this steady state as the goal state except the cGMP entry. For the cGMP entry, we set
our goal to maximize its production. Our algorithms on the 1p.iihy .i\ suggests inhibiting the enzymes PDE and guanase.
The literature supports this result. PDE inhibitor is a kind of drug that blocks the subtypes of PDE, then it increases the
concentration of cAMP or cGMP or both. It can treat heart failure [10, 4]. We predict that the side effect may be less if
PDE inhibitor and guanase inhibitor are used together rather than inhibiting PDE alone.
Acetate reduction in E.coli Yang et. al shows several strategies for acetate reduction in the central metabolic 1p.luivI .,i
of E. coli [28]. One method is to directly influence the formation of acetate. When we run our algorithms on E.coli
Glycolysis/Gluconeogenesis 1up.iv ..i\ by setting the goal to minimization of the yield of acetate, the result enzyme is
acylactivating enzyme. This result is consistent with the biological discovery. Inhibition of acylactivating enzyme cuts
down the acetylCoA acetate 1p.ilv .i\ The destruction of this 1p.ihiv .i\ results in the low level of acetate [28].
We run our algorithms on the computation of the steady state using flux (details in Appendix). We get another strategy
for reduction of Acetate. The results of our algorithms are to inhibit dihydrolipoamide acetyltransferase and pyruvate
decarboxylase, which destroy the p.uiIv .i\ from Pyruvate to AcetylCoA. These results are consistent with the metabolic
engineering methods, e.g., the destruction or complete elimination of the portion of the reaction 1p.iI; ..i\ at the acetylCoA
node [28].
Metabolic engineering of the Butanoate metabolism. Poly3hydroxybutyrate is an essential compound for producing
plastics. Thus, increasing its production is critical for many industrial applications. Butanoate metabolism produces this
compound. We run our algorithm on the Butanoate lp.ivI .i.i of E.coli with the goal of maximizing this compound. Our re
sults predict that the inhibitions of phosphotransbutyrylase and BHBD increase the entry value of poly/3hydroxybutyrate
while incurring minimum damage to the rest of the metabolism. In fact, Vazque et. al shows the evidence of an association
between poly/3hydroxybutyrate and phosphotransbutyrylase [26].
3.2 Quantitative analysis of the proposed methods
In this section we quantitatively evaluate the performance of our traversal and genetic algorithms.
Evaluation of the traversal method The goal of this ex 3
periment is to evaluate the performance of the traversal Exhaustive : search
method. For each 1up.iv ..i\ we construct the goal state as 2.5 Tr
the steady state of that p.iIhvi .i\ when no enzyme is in
hibited. We, then, modify the 1p.iiliv..i by eliminating an 2
enzyme from that p.iili ..i\ and search the resulting path
way to find the solution that is closest to the goal state. For 1.5
each p.ivi ..i\ we create one query for each enzyme. Thus,
we have as many queries as the number of enzymes. We
report the average values over all queries for each p.ilivi .i\
0.5
The experiment described above is based on the fol
lowing biological intuition. An organism is, often, healthy 0
if all of its enzymes function well. This corresponds to the 00980(#E 7) 00
case when no enzymes are inhibited. Thus, the correspond
ing steady state is the goal state. An organism may suffer Fig. 3. Average SD values
from a disease if an enzyme malfunctions. In this case the search for different pathway
query will correspond to a 1p.iIhi .i\ that has malfunction a number shows the number
ing enzymes. We aim to find an enzyme set whose inhibi
tion leads it back to the healthy state as close as possible. 10000.00
Figures 3 and 4 compare traversal method to exhaus Exhaustive s
tive search for 1p.,iIvI..i.s that contain up to 24 enzymes. 1000.00
We did not compare the traversal algorithm to exhaustive
search for bigger 1p.iI\;..i\ as the exhaustive search re 0
quires weeks to months even for a single query. Figure 310
shows the average SD of the solutions computed by the
traversal method and those of the exhaustive search algo 10.00
rithm. The results show that the SD values of the traver
sal method and those of the exhaustive solution are almost 1.00
identical. Thus, the traversal method is a good approxima
tion to the optimal solution. Figures 4 presents the average 0.10
running time of the traversal method as compared to the 00980(#E 7)
exhaustive search algorithm. The running time of exhaus
tive search is 1.2 to 11 times that of the traversal method. Fig. 4. The average runn
However, it is also clear that the running time of the traver method and exhaustive sea
sal method increases exponentially with the number of en pathways. 'E' followed by
zymes. Therefore, it is impractical for very large up.iihv ..i;\ in the pathway.
although it can be used to solve larger sized problems than the exhaustive search.
670(#E 17) 00220(#E 21) 00280(#E 24)
Pathway
of the traversal method and exhaustive
ys over multiple queries. 'E' followed by
r of enzymes in the pathway.
arch
hod
00670(#E 17) 00220(#E 21) 00280(#E 24)
Pathway
ing time (in seconds) of the traversal
irch over multiple queries for different
a number shows the number of enzymes

Evaluation of the genetic algorithm This experiment evaluates the performance of our genetic algorithm. Similar to the
evaluation of the traversal method, we design the experiment with l'p.iII .i\ s up to 84 enzymes. We obtain the p.i1v.I .i\ s that
have more than 52 enzymes by combining multiple 1p.ivi .i s For example, Pathway 00230+00790 denotes the 1p.iiv .'i\
obtained by combining 00230 and 00790. For each of these 1p.ilv. .i s we created one query for each enzyme similar to
the previous section. We report the average values over all queries for each p.ilv, .i\
The traversal algorithm is impractical for such large path
ways due to the exponential time complexity. We, therefore, im
plement a truncated version of the traversal algorithm. This ver
sion truncates all the nodes deeper than L levels, where L is
a given parameter. Choosing an appropriate value for L makes
the execution time of the truncated method suitable fi p.1.1 ,..i s~
with a large number of enzymes. For example, the number of en
zymes in Glycolysis / Gluconeogenesis 1p.iIy, .i\ (00010) is 27.
In the worst case it generates 227 nodes requiring an execution
time of 15+ hours. In order to bound the running time of our
traversal algorithm for large p.ilv, ..i\ we truncated the depth
of the search after a fixed depth L. We chose L to be 20 or 23 as
the running time quickly becomes impractical for larger L.
Table 4 presents the SD value and the running time of the
traversal method and the genetic algorithm for four p.ivI..i s.
that have 24 or more enzymes. The difference between the ini
tial damage Do and Dao shows how much the truncated traversal
algorithm reduces the distance between the initial and the goal
state. The difference between D20 and Da show the amount of
improvement obtained by the genetic algorithm over the trun
Table 4. Comparison of the truncated traversal method
(maximum number of levels = 20) and the genetic algorithm
for different pathways. D2o and To2 denote the average SD
of the best solution and the average time requirement for
multiple queries using the truncated method. Dga and Tga
denote the average SD and the average running time for the
genetic algorithm respectively. All the time is in seconds. Do
denotes the average SD between the steady state of the query
pathway and the goal state.
SD Time
Pathways DolD2olDga T2olTga
00280 (#E 24) 2.611 2.596 2.545 20 2
00010 (#E 27) 2.687 2.556 2.370 166 52
00260 (#E 32) 0.837 0.819 0.797 162 38
00230 (#E 52) 10.865 1.568  1.120 141 150
00230+00790 (#E 61) 6.848 1.497 0.920 58  573
00230+00030 (#E 67) 8.590 3.682 13.159 465 410
00230+00340 (#E 67) 5.504 1.274 0.902 122 168
00230+00260 (#E 84) 6.342 1.322 0.996 121 494
cated traversal method. These results show that the genetic algorithm generates significantly better solutions (lower dam
age values) as compared to the truncated traversal method for all the cases. The genetic algorithm found solutions that have
SD values that are 2% to 39% lower than that found by the truncated traversal algorithm. In addition, the time require
ments of the genetic algorithm were comparable or better than the truncated traversal method. For p.iivI; .i\ 00010, the
genetic algorithm generated on an average 8.06% improvement over the traversal method. The maximum improvement in
these experiments was 39%. We have similar comparative gains for other p.iIhv, .,i
One can argue that the effectiveness of the traversal method
can be limited due to the limited number of levels. To evaluate the
trade off between the accuracy and running time of the truncated
algorithm better, we increase the depth of the traversal algorithm
until it spends at least as much time as the genetic algorithm for all
the 1p.iI.i, .\ For this purpose, we tested the truncated algorithm
for 23 levels. Table 5 shows the comparison between the genetic
algorithm and the traversal method with 23 levels for the three
largest 1p.,ili, .i s From Table 5, the running time of truncated al
gorithm with the additional 3 levels is up to an order of magnitude
higher. However, the accuracy only improved by a small amount
Table 5. The average SD value and the running time of
the truncated traversal algorithm (with maximum number
of levels = 23) and the Genetic Algorithm. The running time
is reported in seconds.
SD Time
Pathways D23 sDga T23 Tgy
00230+00030 (#E 67) 3.552  3.159 3310 410
00230+00340 (#E 67) 1.242  0.902 828  168
00230+00260 (#E 84) 1.271  0.996 974 494
and is much lower than the genetic algorithm. It is worth noting that the genetic algorithm required considerably less time
for all these cases. Therefore, the genetic algorithm is a good choice for the enzymatic target identification problem for
very large p.,ilIv, .is
4 Discussion
The goal of enzymatic target identification problem is to identify the set of enzymes whose inhibitions lead to a steady
state of the metabolic p.ilv, .i\ that is close to a goal or desired state. We develop a novel distance measure, Statedistance
(SD) that measures the damage of the inhibitions of a set of enzymes as a function of the deviation of the entry in the
steady state after their inhibitions from that in the goal state. Using this measure, we develop two algorithms that are based
on search space traversal and genetic algorithms.
Experiments using the metabolic p.hivI..i s of H.sapiens and E.coli from KEGG show that our algorithms can be
useful for numerous application including metabolic engineering and biomedicine. Our traversal method is effective for
1p.ilhv .i.' with up to 3035 enzymes. Our genetic algorithm is effective for arbitrarily large p.iihv, .i\
References
1. K. Backman, M. J. O'Connor, A. Maruya, E. Rudd, D. McKay, R. Balakrishnan, M. Radjai, V. DiPasquantonio, D. Shoda, and
R. Hatch. Genetic engineering of metabolic pathways applied to the production of phenylalanine. Ann N YAcad Sci., 589, 1990.
2. H. P. J. Bonarius, G. Schmid, and J. Tramper. Flux analysis of underdetermined metabolic networks: The quest for the missing
constraints. Trends Biotechnology, 15, 1997.
3. G. Q. Chen and Q. Wu. The application of polyhydroxyalkanoates as tissue engineering materials. Biomaterials, 26, 2005.
4. H. H. Chen. Heart Failure: A State of Brain Natriuretic Peptide Deficiency or Resistance or Both. Journal of the American College
of Cardiology, 49(10), 2007.
5. R. K. De, M. Das, and S. Mukhopadhyay. Incorporation of enzyme concentrations into FBA and identification of optimal metabolic
pathways. BMC Systems Biology, 2(65), 2008.
6. J. S. Edwards and B. O. Palsson. The Escherichia coli MG1655 in silico metabolic genotype: Its definition, characteristics, and
capabilities. Proc Natl Acad Sci US A, 97, 2000.
7. J. Fernandes, J. M. Saudubray, G. v. d. Berghe, and J. H. Walter. Inborn Metabolic Diseases: Diagnosis and Treatment. 2006.
8. R. A. Fisher. Frequency distribution of the values of the correlation coefficient in samples from an indefinitely large population.
Biometrika, 10, 1915.
9. J. Forster, I. Famili, P Fu, B. O. Palsson, and J Nielsen. Genomescale reconstruction of the saccharomyces cerevisiae metabolic
network. Genome Research, 13, 2003.
10. S. R. Goldsmith. Type 5 Phosphodiesterase Inhibition in Heart Failure: The Next Step. Journal of the American College of
Cardiology, 50, 2007.
11. E. Horowitz, S. Sahni, and D. Mehta. Fundamentals of data structures in c++. Silicon Press, 2007.
12. John C Burnett Jr. Modulating cGMP in heart failure. BMC Pharmacology, 7(Suppl 1), 2007.
13. T. Kahveci. Np completeness for optimal enzyme combination identification. Technical report, CISE Department, University of
Florida, Jan 2008.
14. M. Kanehisa. A database for postgenome analysis. Trends in Genetics, 13(9):3756, 1997.
15. M. Kanehisa and S. Goto. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res., 28(1):2730, Jan 2000.
16. K. J. Kauffman, P. Prakash, and J. S. Edwards. Advances in flux balance analysis. Current opinion in biotechnology, 14(5).
17. B. O. Palsson. Systems biology: Properties of reconstructed networks. Cambridge University Press, 2006.
18. S. Philipp, J. Monti, I. Pagel, T. Langenickel, T. Notter, F. Ruschitzka, T. Luscher, R. Dietz, and R. Willenbrock. Treatment with
darusentan over 21 days improved cGMP generation in patients with chronic heart failure. Clinical Science, 103 (Suppl. 48), 2002.
19. N. D. Price, J. L. Reed, J. A. Papin, S. L. Wiback, and B. O. Palsson. Networkbased analysis of metabolic regulation in the human
red blood cell. Journal of Theoretical Biology, 225, 2008.
20. R. Ramakrishna, J. S. Edwards, A. McCulloch, and B. O. Palsson. Fluxbalance analysis of mitochondrial energy metabolism:
consequences of systemic stoichiometric constraints. Am J Physiol Regul Integr Comp Physiol, 280, 2007.
21. D. RodriguezAmaya. Food carotenoids: analysis, composition and alterations during storage and processing of foods. Forum
Nutr, 56, 2003.
22. C. Scriver, A. L. Beaudet, D. Valle, W. S. Sly, B. Vagelstein, B. Childs, and K. W. Kinzler. The Online Metabolic and Molecular
Bases of Inherited Disease. New York: McGrawHill, 2007.
23. B. Song, P Sridhar, T. Kahveci, and S. Ranka. Double iterative optimization for metabolic networkbased drug target identification.
accepted to International Journal of Data Mining and Bioinformatics, 2007.
24. P Sridhar, T. Kahveci, and S. Ranka. An iterative algorithm for metabolic networkbased drug target identification. Pacific
Symposium on Biocomputing, 2007.
25. P Sridhar, B. Song, T. Kahveci, and S. Ranka. OPMET: A metabolic networkbased algorithm for optimal drug target identification.
Pacific Symposium on Biocomputing, 2008.
26. G. J. Vazque, M. J. Pettinari, and B. S. Mendez. Evidence of an association between poly(3hydroxybutyrate) accumulation and
phosphotransbutyrylase expression in Bacillus megaterium. Int Microbiol., 6(2), 2003.
27. A. I. Vogel, A. R. Tatchell, B. S. Furnis, A. J. Hannaford, and P W. G. Smith. Vogel's textbook of practical organic chemistry.
Prentice Hall, 5 edition, 1996.
28. Y. T. Yang, G. N. Bennett, and K. Y. San. Genetic and metabolic engineering. Electronic Journal of Biotechnology, 1(3), 1998.
Appendix
The developed algorithms require computing the steady state of the metabolic 1p.idv ..i\ for each enzyme vector (i.e.,
candidate solution) that is considered during the search. Since our algorithms evaluate many enzyme vectors, we need
to answer the following question: Can we compute the steady state of the 1p.ilv ..i\ efficiently? This question has been
studied thoroughly in the literature. Here, we briefly discuss two alternative ways to compute the steady state based on
two alternative, yet similar, steady state definitions (Section I and II).
I. Computing the steady state using flux
The flux of a reaction shows the speed at which each compound is produced or consumed by that reaction. As the
reactions progress, each flux can increase or decrease other fluxes. The 1p.idivi .i\ reaches to a steady state when all the flux
values remain unchanged over time.
Figure 5 shows a hypothetical 1p.idiv .i\ along with fluxes that operate on it. Let vl, v2, vs denote the internal flux for
reaction R1, R2 and Rs. b, b2, ..., b8 are the external flux. The state of the 1p.iivI .i\ shows these current internal and
external flux which relates to the oidll' p.iIvi .i\ or compounds. Assume the reactions in the p.ilivI .i\ are as follows.
R1 : C1 C2 + 2C7
R: 2C2 + C3 0 4 + 5
R : 6 C7 + Cs + C9
One way to compute the steady state of this p.,iIy, .i\ is to employ Flux Balance Analysis (FBA) [16,2,9]. FBA creates
a matrix, A, that shows how each flux operates on each compound. Each row of this matrix corresponds to a compound
and each column corresponds to a flux. For example, for the 1p.iliv ..i in Figure 5, A has nine rows and 11 columns since
there are nine compounds and 11 fluxes. Consider the flux vl which corresponds to reaction R1 above. Assume that the
first column of A corresponds to vl. The values of A in this column are A[1, 1] = 1, A[2, 1] = 1, A[7, 1] = 2 and all
others are zero. This is because vi consumes one unit of C1 to produce one unit of C2 and two units of C7. Let v denote
the flux vector (i.e., each entry of this vector corresponds to a flux). The product Av, then shows the amount of change in
the concentration of each of the compounds. In order to compute the steady state, FBA makes several assumptions. One
of them is that the total influx of each compound is equal to the total out flux of that compound. Thus, FBA computes the
set of all possible steady states as the solution space of v to the equation
dA
SAv = 0.
dt
Typically, the number of variables in this equation is more than the number of constraints. As a result there are infinite
possibilities for x. There are many ways to narrow down the solution space. One of them is that a flux can not be a
negative value (i.e., vi > 0 for all i). Another assumption is that the maximal rate of incoming flux (the flux enters
that the metabolism from external sources) is limited, e.g. < 1. Another way is to include an objective function, such as
maximizing the biomass or the production of a specific compound or energy.
Example I.A Consider the p.,ilv, .i\ in Figure 5. Assume that all the enzymes in this p.idiv ..i\ are present in the metabolism
and they are not inhibited. Also, assume that as the objective function, we maximize the total output flux. In other words,
we want to find the steady state that maximizes b4 + b5 + b6 + b7 + bs.
Thus, the problem of solving Av 0 b3
translates into the following one: Maximize b4 bl
+ b5 + b6 + b7 + b8 subject to the constraints
2v2 = 0 = 0
v2 2Vb = 0
v2b5=0 b6
v2 Vs = 0 v C
2v + V3 6 = 0 b
vs b7 =0
3 b8 = 0 b
bi, b2, b3 < 1 Fig. 5. Flux distribution of a hypothetical 1p.liv\ .i\
Vl, V2, V3, bV bh, b), b4, b5, b6, b7, b8 > 0
The solution is, vl = 1.0, v2 = 0.5, vs = 1.0, bl = 1.0, b2 = 1.0, b3 = 0.5, b4 = 0.5, b5 = 0.5, b6 = 3.0, b7 = 1.0, b8 = 1.0.
We define the steady state as all the flux in the p.iliv ..i\ That is, [vl,v2, v3, bl, b2, b3, b4, b5, b6, b7, b8] = [1, 0.5 1, 1, 1,
0.5, 0.5, 0.5, 3, 1, 1]. o
Example I.B Now assume that El is inhibited in the 1p.iihv .i\ used in Example I.A, Since El catalyzes the reactionRl,
R1 becomes inactive. Then vl become zero in the flux computation of Example 1. Therefore, we add one more limitation
vl = 0 to the above flux computation. The steady state under these constraints is [0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1]. o
II. Theoretical yield computation
FBA computes the flux at the steady state. It, however, does not describe the concentration of each compound when
the metabolism reaches to a steady state. The amount of each compound at this state is a crucial information for many
applications. For example, when dopamine concentration in the brain reduces below a certain level, the motor system
nerves become unable to control movement and coordination.
We can compute the amount of each compound in the metabolism as its theoretical yield. The theoretical yield of a
compound is the amount of that compound produced by the underlying reactions [27]. Compared to the flux computation,
which describes the process of the reactions, the yield computation only depicts the outcomes of the reactions. We can
compute the yield of each compound in a 1p.uih ..i\ for a given initial state using the stoichiometry of that p.iilv .a.i We
use the same matrix A that is used in Section I of the Appendix to denote the stoichiometry. Recall that the stoichiometric
coefficients of a reaction show the rate at which it consumes/produces each compound.
In order to compute the yield of each compound in the steady state, we need the initial state of the 1p.idv ..i\ For
simplicity, we assume that the initial yields of the compounds which are not produced in that p.dilv ..i\ are one unit (e.g.
one mol). For all other compounds, we set them to zero. Note that the process of computing the steady state is orthogonal
to that of the initial state. Thus, one can replace our strategy of selecting the initial state with a different one. We, then,
simulate the reactions in the 1p.ivI; ..i We assume that all the reactions take place simultaneously given that the enzymes
and input compounds required for them are present in the metabolism. For simplicity, we will assume that all reactions
take the same amount of time in our discussion. Computing the yield for varying reaction speeds is similar.
Example II.A Assume the reactions in the 1p.idiv .i\ are the same as that in Section I of the Appendix. Also, assume that
all the enzymes are present in the metabolism, similar to Example I.A. In Figure 5, compounds C1, C3 and Cg are not
produced in the 1p.iihv .i\ They are external inputs to the 1p.iihv .i\ Thus, the initial state of the 1p.iihvI .i is [1, 0, 1, 0, 0, 1,
0, 0, 0 ]. In other words, there is one unit of C1, C3 and C0 and none of the other compounds exist initially.
Initially, the reactions, R1 and Rs, are active. This is because the yields of C1 and C0 are nonzero in the initial state.
However, the reaction R2 is inactive because of the lack of C2. R1 consumes one mol of C1, and generates one mol of
C2 and two mol of C7. Rs consumes one mol of Cg and produces one mol of C7, one mol of Cs and one mol of Cg. The
current state after these reactions take place is [0, 1, 1, 0, 0, 0, 3, 1, 1]. At this moment, R2 is active for both yields of C2
and C3 are nonzero. R1 and R2 are inactive for the yields of C1 and Cg are zero. Then, R2 consumes one mol of C2 and
Smol of C3, and generates 9 mol of C4 and 1 mol of C5. Then, 1 mol of C3 is left. Thus, the state of the 1p.idv; .i\ is [0,
0, 0.5, 0.5, 0.5, 0, 3, 1, 1]. Since the yields of C1, C2 and C6 become zero, R1, R2 and Rs are inactive. A steady state is
reached at this point as the yield does not change. o
Example II.B Assume that El is inhibited in the 1p.iliv .i\ used in Example II.A. This makes R1 inactive as El catalyzes
R1. For consistency, assume the initial state of the 1p.luih ..i is also [1, 0, 1, 0, 0, 1, 0, 0, 0] (i.e., same as that in Example
II.A). Then, only R3 is active for both the input compound Cg and the enzyme E3 are present. R3 consumes one mol of
C6 and generates one mol of C7, C8 and Cg. The current state is [1, 0, 1, 0, 0, 0, 1, 1, 1]. At this moment, all the reactions
become inactive. Therefore, the metabolism comes to a steady state. As compared to the steady state when all the enzymes
are present, the yield of C1 increases and the yield of C7 decreases. o
There are four important observations that follow from the discussion of steady state for the theoretical yield compu
tation.
We assume that all reactions take place simultaneously as long as all the input compounds and the enzymes are present
in the metabolism. We also assume that each reaction takes place until there are no sufficient compounds that it can
consume.
Theoretically, the state transition of a 1p.iyI\; .i may lead to an infinite loop of a sequence of more than one state. For
example, the translations a b c a show a loop of three states a, b and c. Such loops are considered a sequence
of steady states. One can use average, max, min or other functions to summarize such states in a single state. We do
not discuss this in detail as it is beyond the scope of this paper.
The steady state depends on the initial state. One can show that the metabolism is guaranteed to reach to a steady
state (or a sequence of steady states) and that the steady state depends on the initial state. The proof of the former
follows from the synchronous transition model. The state transition moves each state to a unique next state. Thus, a
transition either creates a new state, remains constant or creates one of the previously visited states. The latter two
cases correspond to steady states. The proof of the latter can be done by simply setting the value of C1 to zero in the
initial state of Example II.A. When all the enzymes are present, the steady state after this modification is [0, 0, 1, 0,
0, 0, 1, 1, 1], which is different than the one in that example.
There is a strong correlation between the steady state of the flux and the steady state of the yield. As the flux reaches
to a steady state, the total influx and outflux of each compound remains unchanged. Thus the derivative of the yield
of each compound remains unchanged.
The yield computation is a good approximation of the state of the 1p.idivi .i\ and this value can be measured at lab.
It is computationally desirable since it does not require the functional or kinetic information unlike concentration or
flux.These information is not available for majority of the existing p.idivi ..i databases. We use yield to denote state in this
paper. Furthermore, this paper is orthogonal to the state computation. Thus, one can easily replace flux or concentration
values with yield values.
III. Metabolic engineering of Glycolysis/Gluconeogenesis pathway
Figure 6 shows the Glycolysis/Gluconeogenesis p.iihv\ ..i of E.coli. We ran our algorithm with the goal of maximizing
the production of Phosphoenolpyruvate. Our algorithm predicts that inhibiting the enzymes colored in red and crossed out
is the best way to achieve this goal.
IV. Metabolic engineering of Phenylalanine, tyrosine and tryptophan biosynthesis pathway
Figure 7 shows the Phenylalanine, tyrosine and tryptophan hiil, i\ mllc',i p.1iIi ..\ of E.coli. We ran our algorithm with
the goal of maximizing the production of phenylalanine. Our algorithm computes that inhibiting the enzymes colored and
crossed out in red is the best way to achieve this goal.
V. Proof of Theorem 1:
In this section, we prove Theorem 1. We denote the expected value of a random variable with E[.]. We will use N to
denote the number of enzymes. We will first prove the mean in the first part of the proof. In the second part, we will prove
the standard deviation.
Part 1 of proof: The worst scenario analysis of the expected number of undecided enzymes. Let F = [f, f2, ", fN]
and M = [mi, mn,  , mN]. denote two arbitrary solutions taken from the population of solutions. In order to compute
the expected number of undecided enzymes during the crossover of F and M, we first estimate the number of decided
enzymes, that is, the number of enzymes for which, f, and mi are the same. We define the random variable X, as:
X f1 if fi = mi
{0 otherwise
X, = 1 in two cases: f, = mi = 1 and f, = m = 0. Let pi be the probability that the ith enzyme in a solution is
inhibited. Then, the expected value of Xi is
E[X] =pi pi + (1 i)(1 pi)
2,.2 2pi + 1
We would like to minimize E[E Xi] subject to the constraint that pi = A. The random variables Xi and Xj are
independent for i / j. Therefore, we compute the expected number of decided enzymes as
E[Z X] Z= E[X]
i i
SE(2p2 + 1)
i
=21, 215pi+N
i i
From the definition of the probabilities pi in Equation (2) we have E /' = A.
A E
A Zi pi
N 
A2 2
N < Lp
^ 7
(Follows from power mean inequality)
Using the inequality we obtained above, we get
E[E X] 2 ,?
2X+N
A2
> 2 2A + N
N
By definition of power mean inequality, we minimize the value of the expectation (i.e., equality case) when all pi = pj
for all i, j. Thus, we conclude that pi for all i. Using this value for pi, we compute the minimum value of E[E X,]
E 2A 2
"[/IJ N
2A + N.
The expected number of undecided enzymes is N E[E Xi]. Therefore, the maximum number of undecided enzymes is
2A2
N (
N
2A2
2A + N) = 2A N
N
Part 2 of proof: The standard deviation of the worst scenario of the expected number of undecided enzymes. Here,
we analyze the standard deviation a for the undecided enzymes in the worst scenario for the mean.
Assume that,
X =X1 +X2 +... +X,.
From the definition of the standard deviation, a, is
a = E[(X E(X))2] E[X2] (E[X])2.
We proved in Part 1 of our proof of Theorem 1 that E[X] 22 2A + N in the worst scenario, and that this scenario
happens when, pi = Let q be the probability of X, = 1 in the worst scenario. Then,
q =pi +(1p) (1p)= ( )2 +(1
N
12N+2 NY
N N2
Before we compute E[X2], we consider E[X?]. The values of Xi is either 0 or 1. Therefore,
E[X,] Pr(X, = 1) P(X2 1) E[X].
The multiplication XXj (i i j) evaluates to one only when X, = Xj 1. Assume that k of the Xis have value of one
and the remaining N k of the Xis have value of zero. Then, exactly k(k 1) of the XiXj multiplications evaluate to
one. Let Zk k(k 1) denote this. Now, we are ready to consider E[X2].
E[X] E [(X + X +... + X)2]
E[ X,2 + :X, X3
i i7j
E[ X?] +E[ XiXj]
i i7j
E[:X,] +E[ XiXj] (byE[X,] E[X?])
i i7j
E[j Xj] + Pr(X, 1, X, = 1)
=E[Xi]+ZPr(X =k)Z
E[E X] + k qA (1 q) kZ
i fk>2
i kc>2
=E[ X ] + q1 k!(Nr k)! q (1 q)N k(k 1)
i( N>2
E[ >2+ (k2)!(N )!q(1 q) )
E[ X] + N(N 1)q2 2) q 2(1 q)
i kc>2 k
=E[j X,] + N(N 1)q(q + 1 q)N2
i
E[Z X] + N(N 1)q2
i
2A2 A A2 A A2
2A+ N +N(N 1)(12 .+2.)2 (byq 12. A+2. )
N N N2 N N2
Finally, we compute the standard deviation.
a =E[X2] (E[X])2
2 2A+ N+ N(N )(1 2 + 2 )2( 2A+N)2
V/(2A2 2AN + N2)(N2 1)
N
F1
=Nucleodde sugars
metabolism
Pentose and glucumonate
 interconversons
t starch and sucrose]
GlucoseIP metabolism J
Arbutin
extracellularr)
Salicin
extracellularr)
Pentose
phosphate
pathway
4.2.1.11  Phe,Tyr&Trp
Ami phosphonate L_________________________. bsynthests J
Ametabolsm  I/ b 5
metbos ICitrate cycle   Phoaosynthesis )
Phosphoenol
metablea bpyruvate
Tryptophan  j 2
metabolism    
  Lyse biosyresis ) \ /
S __ , 1.2.1.51 1 1.1.1.27 LLactate
SAcetylCoA ThPP Pynit
degradation 2 3 1 12 1.2"4.1 thyl_hPP 1.2 1 II Propanoa.te metabolism
of ketone bodies Acet .1.1.1 Branchd dibasic acid metabolism)
.] dihydrolipoamideE
62.1 1X 1 8.1 I F Butanoate metabolism I
DihydrolpoamideE LipoamideE 4 .1.1
Ethanol 111 Patothenate and CoA biosynthesis)
4Acetate 1.1.1.2 I Acetadehyde,
i YAcetateo Acetalhdey  Alanine and aspartate metabolism)
SItI DAlanne metabohnI
I 1 .9.8
1215   Tyrosine metabolism
00010 816107
Fig. 6. Glycolysis/Gluconeogenesis for E.coli. The enzymes highlighted in green are the enzymes that exist in E.coli. According to our
algorithm, inhibition of the enzymes marked and crossed out in red maximizes the production of Phosphoenolpyruvate the best.
SPHENYLALANINE, TYROSINE AND TRYPTOPHAN BIOSYNTHESIS
Tyiosine metabolism
Coumarin and.
p yppanidbiosnthesis   Alkaloid biosynthesis I
Thy NA PuHomycin biosynthesis
Coumarin and
phenylpropanoid biosynthesis
Phenylalanine metabolism 
Photosynthesis
r Pense phosphate pathay
PhetRNA Ph
Phosphoenol
D hotse i pyuvate
4phosphate 0<1Gls
Acridone alkaloid biosynthesis
Ubiquinone biosynthesis
K Folate biosynthesis
__ Biosynthesis of siderophore
group nonribosomal peptides
SIndole and ipecac
alkaloid biosynthesis
42.1.20
Benzoxezincne
biosyixtheas
LTryptophan
metabolism
 Cs
C
zo
C
Co
C).
C)
C
ciE
C) 
C)
Fjf~
N C)
C
o
9 C)
0 o
C0
C~a
iZO 0
00400 3/5?07
