Vol. 00 no. 00 2010
Pages 112
Leveraging Gene Networks to Estimate Perturbations on
Gene Expression
Nirmalya Bandyopadhyay, Manas Somaiya, Tamer Kahveci, Sanjay Ranka
Computer and Information Science and Engineering, University of Florida, Gainesville, FL
{nirmalya,mhs, tamer, ranka}@cise.ufl.edu
ABSTRACT
External factors such as radiation, drugs or chemotherapy can
alter the expressions of a set of genes. We call these genes the
primarily affected genes. Primarily affected genes can in time change
the expressions of other genes as they activate/suppress each other
through interactions. Measuring the gene expressions before and
after applying an external factor (i.e., perturbations) in microarray
experiments can reveal how the expression of each gene changes.
It however can not tell the cause of the change.
In this paper, we consider the problem of identifying primarily
affected genes given the expression measurements of a set of genes
before and after the application of an external perturbation. We
develop a new probabilistic method to quantify the cause of differential
expression of each gene. Our method considers the possible gene
interactions in regulatory and signaling networks, for a large number
of perturbations. It uses a Bayesian model with the help of Markov
Random Fields to capture the dependency between the genes. It
also provides the underlying distribution of the impact with confidence
interval.
Our experiments on both real and synthetic datasets demonstrate
that our method can find primarily affected genes with high accuracy.
In our experiments, our method was 100% accurate when the
difference between expected expressions of primarily and secondarily
affected genes is at least half of the standard deviation of the
gene expressions. Our experiments also suggest that our method
is significantly more accurate then SSEM, a recent relevant method,
and the Studet's ttest.
1 INTRODUCTION
A significant set of microarray experiments measure the expressions
of genes in the presence of external perturbations [8, 20]. In these
experiments, also called perturbation experiments, perturbations
such as radiation [38], drug [28] or other biological conditions are
administered on tissues and their responses are monitored using
microarrays. The expressions of the genes that are not subject to
perturbations are called control data, while the expressions of genes
after perturbations are called noncontrol data [17].
As response to an external perturbation, a fraction of genes can
change their expression values significantly between control and
noncontrol groups. Such genes are called differentially expressed
(DE) genes [3]. All the other genes without noticeable change in
expression are called equally expressed (EE) genes.
Often, some of the DE genes are directly affected by the external
perturbation [13]. We denote them as the primarily iffa rtd genes.
Rest of the genes change their expressions due to interactions
with primarily affected genes and each other through signaling
and regulatory networks [9, 10, 29, 36, 33]. We call them as
 r .
,... .ERK__
I RaIGDS Ral Ral 1
PLD1
Fig. 1. Illustration of the impact of a hypothetical external perturbation on a
small portion of the Pancreatic Cancer pathway. The pathway is taken from
the KEGG database. The solid rectangles denote the genes affected directly
by perturbation, the dashed rectangles indicate genes secondarily affected
through the networks. The dotted rectangles denote the genes without any
change in expression. > implies activation and I implies inhibition. In this
figure, gene KRas, Raf and Cob42Roc are primarily affected and MEK, Ral
and RalGDS are secondarily affected through interactions.
secondarily ,Ilft rLd genes. We refer to signaling and regulatory
networks by gene networks in this paper. Figure 1 shows the state
of the genes [2, 35] in the Pancreatic Cancer pathway after a
h:." .l.i.i.il external perturbation is applied. In this figure, genes
KRas, Raf and Cob42Roc are primarily affected and MEK, Ral
and RalGDS are secondarily affected through interactions.
We consider the problem of i.. in,:. the primarily ftie rit,
genes in a perturbation experiment, where gene expressions before
and after the application of perturbation are provided for a set of
samples.
Existing methods to identify the primarily affected genes using
association analysis techniques [19, 28], haplotinsufficiency
profiling [16, 15, 27] and chemicalgenetic interaction mapping [31]
are limited to applications where additional information such as
fitness based assays of drug response or a library of genetic
mutants is available. Bernardo et al. [13] suggested a regression
based approach, named MNI, that assumes that the internal genetic
interactions are offset by the external perturbation. It estimates gene
gene interaction coefficients from the control data. It then uses
those coefficients to predict the target genes in the noncontrol data.
Cosgrove et. al. [9] proposed a method named SSEM that is similar
to MNI. SSEM models the effect of perturbation by an explicit
change of gene expression from that of the unperturbed state. These
methods have several limitations.
1. Lack of gene interaction data: They do not employ regulatory or
signaling (i.e. gene networks) to model genegene interactions.
Since gene networks are manually curated using domain experts,
they are reliable sources of gene interactions. Utilizing them has
the potential to more accurately solve the problem of identifying
primarily affected genes.
c Oxford University Press 2010.
Bandyopadhyay et al
2. Limited perturbations: These methods are suitable when only
a very small number of genes are perturbed, e.g., the genetic
perturbation experiments are often designed for single gene
perturbations [19]. However, external effects such as radiation
can alter the expression of many genes directly making the
existing methods to be of limited use.
3. Simplistic models: These methods provide only the set of
genes that are directly affected by the perturbations and do not
specify any error bounds. However, the change of the state of
a gene is a stochastic event and a nonprobabilistic inference
oversimplifies the problem especially in cases when a small
number of gene expression measurements are available. As a
result, these methods can overfit the data, making the solution
unreliable.
The method we propose in this paper addresses these limitations.
Contributions: Let g {fg, g2,  g } denote the set of all
genes. We assume that a microarray dataset with N samples is
given for control and noncontrol groups. Let yij and., be the
expression of the ith gene of the jth sample (i E {1, 2, M}, j E
{1, 2,  N}) in the control and noncontrol groups respectively.
For each gene gi, we would like to estimate the probability that gi
is DE due to perturbation or due to each of the other genes.
In this paper, we propose a new probabilistic method that
addresses the limitations of the existing methods to find the
primarily affected genes in microarray dataset as defined above.
Our method uses gene networks. We consider gene network as a
directed graph where each node represents a gene, and a directed
edge from gene a to gene b represents a genetic interaction (e.g
a activates or inhibits b). We define two genes as neighbors of
each other if they share an edge. For example, in Figure 1, genes
KRas and Raf are neighbors as KRaf activates Ras. We also
classify a neighbor as incoming or outgoing if it is at the start or
at the end of the directed edge respectively. In Figure 1, Raf is an
incoming neighbor of MEK and MEK is an outgoing neighbor of
Raf. When the expression level of a gene is altered, it can affect
some of its outgoing neighbors. Thus, the expression of a gene
can change due to external perturbation or because of one or more
of the affected incoming neighbors. We build our solution based
on this observation. We represent the external perturbation by a
h:. I". i.clIk.i gene (i.e. metagene) gR in our the gene network. An
edge from the metagene the to all the other genes imply that the
external perturbation has the potential to affect all the other genes.
So, gR is an incoming neighbor to all the other genes. We call the
resulting network the extended gene network.
Our method estimates the probability that a gene gj is DE due
to an alteration in the activity of gene gi (V gi, gj E g U {gn}) if
there is an edge from gi to gj in the extended network. We use a
Bayesian model in our solution with the help of Markov Random
Field (MRF) to capture the dependency between the genes in the
extended gene network. We optimize the likelihood of the joint
posterior distribution over the random variables in the MRF using
Iterated Conditional Mode (ICM) [6]. The optimization provides us
the state of the genes and the pairwise causality among the genes
including the metagene.
Our experiments on both real and synthetic datasets demonstrate
that our method can find primarily affected genes with high
accuracy. In our experiments, our method attains 100% when
the difference between expected expressions of primarily and
secondarily affected genes is at least half of the standard deviation
of the gene expressions. We compared our method with SSEM and
Student's t test and obtained significant higher accuracy in finding
out the differentially expressed genes.
The rest of the paper is organized as follows. In Section 2
we describe our method in detail. In Section 3 we discuss the
experiments and results. Finally, in Section 4 we describe our key
conclusions.
2 METHODS
In this section, we describe our mathematical model and methods.
Section 2.1 presents the notations. Section 2.2 provides an overview
of our solution. Section 2.3 discusses the modeling of the MRF
based prior distribution. Section 2.4 describes how we formulate a
tractable approximate version of the objective function. Section 2.5
discusses how we compute the joint likelihood distribution of the
expression of a gene. Section 2.6 explains how we optimize the
objective function.
2.1 Notations and problem formulation
We start by describing the notation we use in the rest of this paper
and provide a formal definition of the problem. We use two types
of parameters to model this problem, namely observed and hidden.
Observed variables are the ones whose values are available in the
underlying data set. We derive the values of the hidden variables
from the the observed data using the method.
Observed variables: There are two sets of observed variables.
* Microarray dataset: We denote the number of microarray
samples and the number of genes by N and M respectively.
We represent the set of all genes in the dataset with
{gf, 9,  gM}. For each gene gi, the dataset contains the
expressions before and after the perturbation (i. e. control and
noncontrol) respectively. We denote the expressions of gi with
yij and . in control and noncontrol group respectively, (1 <
i < M, 1 < j < N). Let yi {i, y i2,y iN } and
Yi = Yil, Wi2, " YiN'} denote the expression values of gi in
control and noncontrol groups respectively. We use Yi to denote
all the data for gene gi in control and noncontrol groups (i.e.
Yi yi U yi). We denote the collection of the entire gene
expression data by y = UUi Yi.
* Neighborhood variables: We use the term W = {Wij} to
represent if any two genes gi and gj are neighbors. We set the
value of Wij (1 < i < M, 1 < j < N) to 1 if gi is incoming
neighbor of gj (i.e. gj has an incoming edge from gi in the
extended gene network.) and 0 otherwise.
Hidden Variables: There are two sets of hidden variables.
* State variables: Each gene gi can attain one of the two states
(i.e. DE or EE) We introduce the variables S {Si} to indicate
the states of the genes. Formally, Si is DE if gi is differentially
expressed and EE if gi is equally expressed.
* Interaction variables: We define the set of random variables
X {Xij } to represent the joint state of genes gi and gj
Leveraging Gene Networks to Estimate Perturbations on Gene Expression
(1 < i < M, 1 < j < N). Formally,
1 if Si = DE and Sj = DE;
0 if Si =EE and Sj =EE;
j 2 if Si =DEand Sj =EE;
3 if Si =EE and Sj = DE;
Problem formulation: We have microarray expression data
y and the gene network {g,W} as input to the problem. We
would like to estimate the conditional probability density function
p(XXj IX Xij, y) when Wij = 1.
2.2 Overview of our solution
An approach to solve our problem can be to maximize a likelihood
distribution over the gene expression y where X are the parameters
of the distribution. The objective is to obtain the maximum
likelihood estimate (MLE) of X. However, there are two problems
in this this approach. First, MLE requires a large number of data
points to accurately estimate the parameters. Second, MLE depends
only on the observed data and cannot utilize domain specific
knowledge leading to overfitting of data and poor generalization.
We develop a method that uses a Bayesian estimation of X to
address the abovementioned limitations of the existing approaches.
We compute a probability distribution over X. To ameliorate
overfitting, we incorporate the domain knowledge as the prior
distribution. Also, Bayesian approach can generally estimate the
parameters with fewer datapoints, which makes our approach more
suitable for perturbation experiments [7].
We estimate the probability of Xij given the other observed and
hidden variables. In this approach, we aim to maximize the joint
probability of the Xij variables given the gene expressions y. Thus,
the objective is to maximize the joint distribution of X given by,
P(yX, v)P' I '
P(XY, y,Bx)=)P' I ) (1)
Ex P(YlX, 0y)P, I I )
Here Oy is the set of parameters for the data likelihood density
function P(y2 X, Oy) and Ox is the set of parameters for the density
function of X (i.e. P(XIOx)). We define the set of parameters for
Ox and Oy in Sections 2.3 and 2.5 respectively.
We obtain an assignment of the X, Ox and Oy after
the optimization of Equation 1. Using these hidden variable
assignments and the observed dataset we calculate the posterior
probability of all Xij variables given by P(Xij I XXij, Ox, Oy).
Using this posterior probability, we quantify the contribution of one
DE gene on its outgoing neighbor for being DE.
In order to evaluate the joint distribution of X (i.e. Equation 1)
we have to define the prior density function P(XIOx) and the
data likelihood function P(yX, Oy). We also have to define the
formulation of the posterior probability of Xij. We discuss these in
the following sections.
2.3 Computation of the prior density function
In this section, we describe how we build the prior density function
P(XIOx). We build this distribution based on two assumptions that
follow from the underlying biological problem.
1. Each gene can affect the expressions of its outgoing neighbors.
2. The metagene gR (i.e. external perturbation) can affect the
expression of every other gene.
r 
gg . 
LL
(a) Perturbation experiment
X(b XR2 Xr 3 XRa
X14 X1 X
(b) Markov random field graph
Fig. 2. (a). A small hypothetical gene network with perturbation. The
circle gR represents the abstraction of the external perturbation i.e. gR.
Rectangles denote genes.  implies activation and I implies inhibition.
The dotted arrow from gR indicates potential effect on each genes. The
directly impacted DE genes gl and g3 are denoted by solid rectangle. Dashed
rectangles g4 and g5 imply secondarily impacted DE genes. Dotted rectangle
is for the EE gene g2. (b). The graph for Markov random field created from
the hypothetical gene network in (a). For each neighbor pair we create a
circular node. We create three rectangular nodes that do not correspond to
any neighbor pair, however they are part of the MRF graph. Two nodes are
connected with an undirected edge if they share a subscript at same position.
For example, node XR4 and X14 are connected as they share 4 at second
position.
The first assumption follows from the fact that each gene can
activate or inhibit its outgoing neighbors. So, if the activity of a
gene is altered, the effect can propagate to its outgoing neighbors.
The second assumption is evident as the external perturbation such
as radiation can change the activity of any of the genes.
Under these assumptions, we observe that the Xij variables can
depend on each other. To capture this dependency we create a graph
called Markov Random Field (MRF) graph over the X variables
that encompasses the two above mentioned assumptions. This graph
represents the probabilistic dependency between the genes in the
extended gene network.
The MRF graph is an undirected graph g = (X E), where X
= {Xij} variables represent the vertices of the graph. We denote
the set of edges with E = (Xij,Xkj) Wki = Wij = 1} U
{(Xij, Xip) IWj Wij 1}. For example, in Figure 2(b) X12
and X13 are neighbors as in Figure 2(a) gi interacts with g2 and g2
interacts with g3. Notice that, this graph represents the probabilistic
dependency between the genes in the X domain, not in the domain
of genes. So, the dependency of two neighbors are captured by an
edge between two X variables. For example, in Fig 2(b) we draw
the MRF graph corresponding to the h!. I. iL i 'l.,i gene network in
Figure 2(a). In the gene network, there is an edge from gi to g4.
So, gi can possibly alter the state of g4. We have an edge between
XR1 and XR4 that corresponds to the edge from gi to g4. As R is
common for XR1 and XR4, if they attain the same value it implies
that the genes gi and g4 are in same state (i.e. DE or EE).
We employ MRF over E to evaluate this dependency among
genes. We first describe how we build MRF over the MRF graph.
Let us denote the neighbors of Xij in the MRF graph as Xi =
{Xkj\Wki 1} U {X, ITp = 1}. In a Markov random field, a
Bandyopadhyay et al
p(XijX Xj, Ox)
exp(i(l(Xij) + 72 Ete{0,2,3} f(Xj, t) + 73 Fk,Wi,1 f(Xij, Xcj) + 4 Ep,Wp=1 f(Xij, Xp))
xie t{o0,1,2,3} exp(lC((Xij) + 72 Ete{0,2,3} f(Xij, t) + 73 Ek,Wki=l f(Xij, Xkj) + 74 Ep,Wy=1 f(Xij, Xip))
(2)
Fig. 3. The equation for the conditional likelihood function.
node is independent of other nodes given its neighbors. Formally,
p(XijlX {Xij},Ox) p(Xij IX Ox). Using Hammersley
Clifford theorem [4, 18], we express the joint distribution of X
in MRF as the product of potential functions in a way that it
represents the conditional independence of the Xij variables. More
specifically, we express p(XlOx) as the product of the potential
functions defined over the cliques in the graph g, divided by a
partition function Z as, p(XlOx) Hc,,,w,= =1 (Cij). Here,
Cij is a clique in the MRF and 4(Cij) is a potential function over
Cij respectively. We create a clique Cij over the variable Xij and its
neighbors Xj when Wj =1. To limit the complexity of our model,
we consider only cliques of size one and two. We normalize the
probability distribution with Z = ~x F_, ,, (Xij).
The potential functions capture the dependency of the variables
in the MRF graph. So, we incorporate our assumptions about the
genetic network in the potential functions. The potential functions
consist of feature functions. Each feature function takes two integers
as input and produces a binary integer as output. Here, we define
each feature function as f(vl, v2) =1 if v1 = v2 and 0 otherwise.
The optimization of MRF assigns weights to the feature functions
according to their actual prevalence in the MRF graph. In the
following paragraphs, we discuss the proposed feature functions.
1. We define two feature functions for the set of singleton cliques.
These two feature functions capture the two events, when Xij
1 and Xij 1.We write a feature function f(Xj, 1) which
equals to 1 when Xij = 1 and 0 otherwise. For notational
convenience, we also denote this feature function by ((Xij)
with the same semantics as that of f(Xij, 1). We define another
feature function f(Xij, t), where t E {0, 2, 3}, which equals to
1 when Xij equals to t and 0 otherwise. We define two separate
feature functions to capture these two events, as their frequencies
are not equal.
2. Let us consider a sequence of four genes gi, g2, g3 and g5 in
Figure 2(a). Consider the X23 variable in the MRF graph that
consists of the states of g2 and g3. X13 is a neighbor of X23
in MRF graph as gi is an incoming neighbor of g3 in the gene
network. Similarly, X25 is a neighbor of X23 as 95 is an outgoing
neighbor of g3. So, if S1 equals to S2 then X23 = X13. Similarly
if S3 equals to S5 then X23 = X25. We capture these events in
two feature functions for Xij based on the incoming neighbors of
gi and the outgoing neighbors of gj.
Incoming neighbors of gi: Let us denote the incoming
neighbors of gi with In(gi). We write a feature function
f(Xkj,Xij), Vk, gk e In(gi). f(Xkj, Xi) = 1 if Si = SI
and Wki = Wij = 1. Otherwise, f(Xkj, Xij) = 0.
Outgoing neighbors of gj: Let us denote the outgoing
neighbors of gj as Out(gj). We define a feature function
f(Xip,X), p gp E Out(gy). f(Xip,Xiy) = 1 if S =
Sy and Wj = Wij = 1. Otherwise, f(Xip, Xij) = 0.
In the last two feature functions, Xkj or Xip may not represent any
interactions from the extended gene network when Wkj = 0 or Wip
= 0 respectively. We represent them by rectangles in Figure 2(b).
After embedding these feature functions in the potential functions
the joint prior distribution of X is,
' .) pi P() E (Xj)+72 f(Xij,t)
i,3, ij=1 i,j,,ij 1,te{0,2,3}
+ 73 E f(Xij, Xkj)
i,j,k,Wij 1,Wki=1
+74 E f(Xij, Xip))
i,j,p,Wj =1,Wjp1
(3)
Here we denote {f1, ?72, 73, 74} as the MRF parameters Ox.
In the next section, we discuss how we define the objective
function with respect the MRF. We also describe how we formulate
the posterior probability density function for Xij.
2.4 Approximation of the objective function
To optimize the objective function in Equation 1, we need to
evaluate the prior density function p(XIOx). In Equation 3, we
defined this function as a combination of feature functions. This
requires estimation of Ox. Also, a maximum likelihood estimation
of Ox requires evaluation of the partition function Z in Equation 3.
Evaluation of Z is intractable even for a small number of Xij, as
the number of terms in the summation is exponential in the number
of Xij. We use an approximate formula to solve this problem. A
standard approximation scheme is pseudolikelihood [5], where the
objective function is the simple product of the conditional likelihood
function of the Xij variables. Geman et al. proved the consistency
of the maximum pseudolikelihood estimate [14].
From Equation 3 the conditional likelihood of a node Xij is given
by Equation 2 in Figure 3. We refer the reader to the Appendix I for
the derivation of Equation 2.
In this framework of pseudolikelihood, we can approximate the
objective function of Equation 1 as the product of posterior density
function of Xij and optimize it using the ICM [6] algorithm given
Ox. We consider those Xij when Wij =1, as for other Xij variables,
gi and gj are not correlated. Thus, our objective function becomes,
F argmaxy ( Fij) (4)
ij
We derive the posterior density function Fij of Xij as,
Fi p(Xj X Xij,Ox,Oy)
p(Yi, Yj Xj,Xi, 0y)p(Xjl X Xj, Ox) (5)
Exj G{0,1,2,3} P(Yi, I IX, X:3 y)
We present the derivation of Fij in Appendix III due to page
limitation.
Leveraging Gene Networks to Estimate Perturbations on Gene Expression
There are two different terms in objective function of Equation 4.
We already discussed one, the conditional likelihood density
function in Equation 2. We discuss the other, the data likelihood
density function p(Yi, Yj Xij,, y, O) in the next section.
2.5 Calculation of likelihood density function
The next challenge in computing the posterior density function,
is to define the data likelihood function (i.e. the function at the
denominator of Equation 5). In this section, we describe how we
derive this function.
We start by making a mild assumption that the expressions of a
gene in its control or noncontrol groups follow normal distribution.
Note that we can rewrite the derivation below with minor changes
when the expressions of genes follow another distribution. To keep
the discussion brief we will use normal distribution.
When a gene is equally expressed, all the data points in both
control and noncontrol groups share a latent mean [25]. For a DE
gene, the shared latent means in the two groups are different. So,
for a DE gene the control and noncontrol groups should follow
separate normal distribution. Let us denote a set of measurements
for a gene gi by zi = {zil, zi2, ziw} that follows a single
Gaussian distribution. Let us denote the latent mean of zi as p and
the standard deviation as a. As different genes can have different
average expression, we assume that p follows a genome wise
distribution [25] with mean po and standard deviation r. p varies
between the control and noncontrol groups of a DE gene. Thus, for
zi, the joint likelihood for the data points in that group is given by,
= A[N(z Ip,2)]AP(pPo,_2)dp
f i1l
(2\/ 7)Wn, + C2 '
E, ? p2
Z 2 2
4i
+22 r2 2n0po
exp( 22 2 2 2 o
2(nr2 + (2)
The reader can find the derivation of Equation 6 at Demichelis et
al [12].
If a gene is DE, its expression measurements in control and non
control groups follow separate distributions. On the other hand, for
equally expressed genes, all the measurements in both the groups
share the same mean.The joint data likelihood for a DE gene is given
by,
DE(gi) p(y i O, 2,T2)p(y' I P, O2, T2) (7)
Similarly, for EE genes it is given by,
EE(gi) p(yi U yi po, 2, 2) (8)
Now we are ready to derive the joint likelihood distribution for
different values of Xij. Let us denote the set of parameters {p, a, r}
by Oy.
Case 1. (Xij 1) In this case, both gi and gj are DE. We
define the neighbors of Si by S; {SkIXik,Xki E XiI}. We
substitute Xij with the set {Si, Sj}, where Si and Sj denote the
states of gene gi and gj respectively. We substitute Xij U X* by
{Si,Sj, S, Sj}. So, for Xij = 1, p(Yi,Yj Xij 1, X*, Oy) 
p(YiSi DE, 0y)p(YISj DE,Oy) = CDE (gi)DE (g9).
Case 2. (Xij 0) Here, both gi and gj are EE. As earlier,
we substitute {Xij, X } with {Si, Sj, S,S~ }. From Equation 8
we obtain the likelihood density function as, p(Yi,YjIXij
0, XJ, Oy) = EE(gJi) EE (g).
Case 3. (Xij = 2 or Xij = 3) In this case only one
of the genes is DE. From Equation 7 and 8 for Xij =2
we obtain the likelihood density function as, p(Yi,YjXij
2, Xi, Oy) DE(gi)EE(gj). Similarly for Xi =3 we
obtain the likelihood density function as, p(Yi,YjXij
3, Xi, Oy) = EE(gi)CDE(gj) We defer the detailed derivation to
the Appendix IV.
A special case arises when gi is the metagene, i.e. gR. We assume
that DE(gR) = 1 and EE(gR) = 0. Thus, the likelihood of the
metagene given than SR is DE equals to 1 and 0 otherwise.
2.6 Objective function optimization
So far, we have described how we compute the posterior density
function. The final challenge is to find the values of the hidden
variables that maximize the objective function (Equation 4). We
develop an iterative algorithm to address this challenge.
At each iteration, we first estimate the hyperparameters of joint
conditional density function and joint likelihood density function
based on the estimated value of X in the previous iteration. Next,
based on the estimated hyperparameters, we estimate X.
The joint likelihood density function is nonconvex in terms
of the parameters Oy = {po, a T). Also, the joint conditional
density is nonconvex in terms of Ox = {71, 72, 7, 4}. We use
a global optimization method called differential evolution [37] to
optimize both of them. To optimize the objective function we
employ the ICM algorithm described by Besag [6]. Briefly, our
iterative algorithm works as follows.
1. Obtain an initial estimate of S variables. In our implementation we use
student's ttest assuming the data follows normal distribution. We use 5%
confidence interval for this purpose.
2. Estimate parameters Oy that maximizes the joint data likelihood function,
H p(Yi,YjXij, Xj, y)
Xij ,Wi1=l
0n
ve{0,1,2,3} Xij=v,Wij=l
p(Yi,YjlA, Xi v,,Oy)
We implement this step using Differential Evolution, which is similar to
genetic algorithm.
3. Calculate an estimate of parameters Ox that maximizes the conditional
prior density function by FIx ,wij == p(Xij X {Xi}, x). We also
implement this step using Differential Evolution.
4. Carry out a single cycle of ICM using the current estimate of S, Ox and
Oy. For all Si, maximize [Ixn p(XmnIX Xmn, Y, Ox, y) when
Xmn E {Xi,,Xi} and Wmn =1.
5. Go to step 2 for a fixed number of cycles or until X converges to a certain
predefined value.
We optimize the objective function in terms of the Si (1 <
i < M) variables instead of Xij variables. So, in step 4, we
maximize the product of the conditional density functions of all
the Xij variables given by [x p(Xmn IX Xmn,, Ox, OY),
Xmn E {Xij mi = m or j = n}, Wmn = 1.
Bandyopadhyay et al
Table 1. List of top 25 genes that are mostly affected by external
perturbation. The dataset was generated using 10 Gy ionizing radiation over
immortalized B cells obtained from 155 members of 15 Centre d'tude du
Polymorphisme Humain (CEPH) Utah pedigrees [11]. Genes are tabulated
rowwise, in increasing order of ranking.
PGF IL8RB FOSL1 F2R PPM1D
MDM2 CDKN1A TNC PLXNB2 EPHA2
DDB2 TP53I3 PLK1 TNFSF9 ADRB2
MAP3K12 JUN SORBS1 LRDD MDM2
SDC1 MYC PRKAB1 EI24 DDIT4
3 EXPERIMENTS
In this section we discuss the experiments we conducted to
evaluate the quality of our method. We implemented our method
in MATLAB and Java. We obtained the code of Differential
Evolution from the http://www.icsi.berkeley.edu/
Sstorn/ code .html. We compared our method with SSEM [9]
as SSEM is one of the most recent methods that can be used to
solve the problem considered in this paper. We obtained SSEM
from http://gardnerlab.bu.edu/SSEMLasso. We ran
our code on a cluster that consists of AMD Opteron 2.4 Ghz and
Intel Core 2 Duo 2.4 Ghz with 4GB memory on every machine.
Dataset We use the data set collected by Smirnov et al. [34] for
the real microarray data. The dataset was generated using 10 Gy
ionizing radiation over immortalized B cells obtained from 155
members of 15 Centre d'tude du Polymorphisme Humain (CEPH)
Utah pedigrees [11]. Microarray snapshots were obtained at Oth
hour (i.e., before the radiation) and 2 and 6 hours after the radiation.
We adapt the time series data to create the control and noncontrol
data for our experiments. We use the data before radiation as control
data. For the noncontrol data we calculate the expected expressions
of a gene at each points after the radiation. We select the one with
higher absolute difference from the expected expression of control
data for that gene.
We also collect 24,663 genetic interactions from the 105
regulatory and signaling pathways of KEGG database [24]. Overall
2,335 genes belong to at least one pathway in KEGG. We consider
only the genes that take part in the gene networks in our model.
3.1 Evaluation of biological significance
In this experiment, we investigate the biological significance of the
genes that our method detects as the primarily affected ones. We
train our method on the dataset collected by Smirnov et al. [34].
After optimization we rank the genes in descending order of the
data likelihood of the DE genes with the perturbation metagene. We
tabulate top 25 genes from the rank in Table 1.
Nine out of the ten highest ranked genes have significant
biological evidence that they are impacted by radiation. Imaoka et
al. [21] compared the gene expression between normal mammary
glands to spontaneous and 7radiation induced cancerous glands
of rat. The PGF (parental growth factor) gene showed differential
expression in both spontaneous and irradiated carcinomas.
Nagtegaal et al. [30] applied radiation to human rectal adenocarcinoma
and compared the gene response to that of normal tissues. The
cytokines and receptor IL8RB showed differential expression
between normal and irradiated rectal tissues. Amundson et al. [1]
administered 7radiation to p53 wild type ML1 human myeloid
cell line. FOSL1 (known by FRA1 that time) showed differential
expression as the stress response. Lin et al. [26] applied ionizing
0 1 2 3 4
Average ranking distance
5 1
Fig. 4. Frequency of average distance of rankings over training and testing
data. The figure shows that the difference is very close to zero. This suggests
that our method can rank the probabilistic effect of the incoming neighbors
of the genes with great precision. The average difference between the ranks
obtained in the training and the testing data is less than one position in 92.7%
of the cases.
radiation on human lymphoblastoid cells. F2R, a coagulation
factor II receptor, was upregulated in that experiment. Jen
et al. [23] investigated the effect of ionizing radiation on the
transcriptional response of lymphoblastoid cells in time series
microarray experiments. PPM1D, a gene related to DNA repair,
showed response to both 3Gy and 10Gy radiation. Wu et al. [39]
conducted a high dose UV radiation experiment to observe the
relation between MDM2 gene on p53 gene. Their experiment
revealed that initially both protein and mRNA level of MDM2
increases in a p53 independent manner, which clearly substantiated
the direct effect of radiation on MDM2. Jakob et al. [22] irradiated
human fibroblasts with accelerated lead ions. Confocal microscopy
discovered a single, bright focus of CDKN1A protein in the nuclei
of human fibroblast within 2 minutes after radiation. Rieger et
al. [32] applied both ultra violet and infrared radiation on fifteen
human cell lines and observed that PLXNB2 was upregulated for
both kind of radiations. Zhang et al. [40] reported that EPHA2
worked as an essential mediator of UVradiation induced apoptosis.
The above discussion demonstrates that our method can find
the primarily affected genes from a perturbation microarray data
accurately.
3.2 Evaluation of the rankings of neighbor genes
Recall that our goal is to find the primarily affected genes.
We achieve this objective by computing the probability of each
gene, including the metagene to contribute the alterations in the
expression of every other gene. In this experiment, we evaluate
our success in terms of how accurately we rank the contribution
probabilities of the genes as follows.
We divide the dataset of 155 samples into training and testing set
in 2:1 ratio. For each DE gene, we sort its incoming DE neighbors
in decreasing order of their data likelihood probability with respect
to the outgoing neighbor. For example, let us assume gi is DE. It
has four incoming DE neighbors g2, g3, g4 and gR where gR is
the metagene. Let NLij denotes the normalized likelihood function
p(v, y, {x ij,x,,Oy)
Yij { 0,1,2,3}) (Yi 3 X ,X ,) of Xij. For instance, If NLRp
> NL41 > NL21 > NL31, then the sorted list is {gf, g4, g2, g3}.
We denote the sorted list as a ranking of the incoming DE neighbors.
Let us denote the position of a gene gi in the ranking of gj for
training data pg, (gi). We create another set of rankings from the
nl_ _
Leveraging Gene Networks to Estimate Perturbations on Gene Expression
testing data likelihood probability. Let us denote the position of gi
in the ranking of gj from testing data by pg (gi). For a gene gj
we define i,.. ..... ranking distance between training and testing
i( N(9) abs(p9 (gS)P9 (gS))
data as 6(g) (gj) where IN(gy) is
the set of incoming DE neighbors for gi, abs(.) denotes the absolute
value and IIN(gj)l stands for the cardinality of IN(gj).
We calculated the average ranking distance for all the genes that
have incoming neighbors apart from the metagene. We repeat the
experiments three times with a different set of training and testing
data. We create a histogram for the average differences from the
three experiments in Figure 4.
Figure 4 shows that the difference is very close to zero. This
suggests that our method can rank the probabilistic effect of the
incoming neighbors of the genes with great precision. The average
difference between the ranks obtained in the training and the testing
data is less than one position in 92.7% of the cases. This implies that
our method can accurately identify the primary cause of DE genes.
3.3 Evaluation of the accuracy of our method
The experiments over the real dataset suggest the validity of our
model. Two questions however follow from these experiments. (1)
What are the limitations of our method? In other words, when does
our method work accurately? (2) Does our method distinguish the
primary affected genes from the others?
To answer these questions we conducted experiments on synthetic
datasets to observe the performance of our method in a controlled
manner.
Synthetic data generation We generate the data in the presence
of a ]h:." * icI'i.li perturbation to simulate the real dataset. We use
the gene network derived from KEGG first to select a random gene
from the network and denote it as a primarily affected DE gene.
We traverse the ancestors in a breadth fast manner. For each of
the ancestor, we made it a secondarily affected DE gene with a
probability of 1 (1 q)", where qr is the number of incoming
DE neighbors. Here q is the probability that a gene is DE due
to a DE predecessor. We repeat these steps to create the desired
number of primarily affected genes. After the classification of the
genes we create control and noncontrol data for each of them for
over N patients. We first create data for the control group for a
gene with mean p, and standard deviation a. We sample the mean
from another Gaussian distribution N(pc po, T). For EE gene both
the groups follow the same distribution. For DE gene we separate
the mean p,c of gene expression between control and noncontrol
group by a fixed amount. We keep the separation in primarily
affected genes higher than that of the secondarily affect genes. In
our experiment q = 0.4, N = 200, po = 7, T = 2.9 and a = 0.87.
Detection of primarily affected genes In this experiment our goal
is to detect the primarily affected genes by the perturbation using
the synthetic dataset. After optimization, we rank all the DE genes
in descending order of the normalized data with respect to metagene
gR. Let the set of true primarily affected genes be PA. Let RG
be the set of first IPAI genes from the rank, where IPAI is the
PAn G
cardinality of PA. We define accuracy as IGI
In our experiments we keep p, fixed at a and vary difference
pIc pcI between 0 to a to produce different dataset. We calculate
the accuracy for each of the Ipr p, and plot the calculated
accuracy. We repeat the entire experiments twice with IPAI = 10
and 54. The total number of DE genes is 159 and 400 respectively.
o 08
0 ]
07
<06
05
04  Primarily affected genes = 10
03 Primarily affected genes = 54
0 02 04 06 08 1
Gap between pnmarily and secondarily affected genes
Fig. 5. Accuracy of our method over synthetic dataset. The X axis is in
terms of a. For instance, 0.4 means that the difference between the mean of
control and noncontrol groups of a gene is 0.4 xa. The figure shows that
we reach 100% accuracy with even a difference of 0.5 x a. Compared to the
standard deviation of noise (i.e. a) of the data this difference is quite small.
This testifies that our model is efficient to differentiate between primarily
and secondarily affected genes with very small difference of the two groups.
Figure 5 shows that we reach 100% accuracy with even a
difference of 0.5 x a. Compared to the standard deviation of noise
(i.e. a) of the data this difference is quite small. This testifies that our
model is efficient to differentiate between primarily and secondarily
affected genes with very small difference of the two groups. Also,
the number of total DE genes were 159 and 400 respectively, which
were quite high compared to the number of corresponding primarily
affected genes (i.e. 10 and 54 1 .lu' l. cil. i. So, our method is able
to eliminate most of the Secondarily impacted genes in most of the
cases. As the number of primarily affected genes decrease from 54
to 10 the accuracy of our method goes higher.
3.4 Comparison to other methods
In this section, we compare the accuracy of our method to that of
SSEM and a simpler method Student's t test.
Synthetic data generation: We simulated real perturbation events
to prepare synthetic data with known primarily and secondarily
affected genes in a controlled setting. We generated this dataset by
simulating the real dataset to maximum possible extent. We used the
gene networks from KEGG database. First we decide on a random
set of primarily and secondarily affected genes based on the steps
described in Section 3.3. We use the control part of the real dataset
in Smirnov et al. [34] as the control part of our ...II.icI *ii.I,.,ll To
generate the noncontrol dataset, we traverse each of the genes that
participate in the gene networks. Suppose, for a gene gi, the mean
and standard deviation of its expression in the control dataset are
given by pi, and ai0 respectively. If the gene is EE we generate its
noncontrol data points from the a normal distribution given by the
parameters (pi, ac ). If the gene is DE, we use the same variance
as that of the control group. However, we use a different mean. For
the primarily and secondarily affected genes we use pi, d, and
pi, d, respectively, where d, > ds.
To summarize, we use the same variance in the noncontrol group
as that in the control group. However, for an affected gene we
change the value of the mean in the noncontrol group from that in
the control group. For a primarily affected gene we applied a higher
deviation of mean than that of the secondarily affected genes.
Experimental setup: Given an input dataset, using each of the
three methods, we ranked all the genes. Highly ranked genes have
higher chance of being a primarily affected gene according to each
method. We explain how we do the ranking in the following.
Bandyopadhyay et al
YOur Method
USSEMLasso
 Student's t test
io 100
rank of genes
(a) Gap =0.2 xc
YOur Method
aSSEMLasso
* Student's t test
50 100
rank of genes
150 200
(b) Gap= 0.6 x a
Fig. 6. Comparison of our method to SSEM and ttest. The number of
primarily affected genes is 50. The gap between the mean of primarily
affected and secondarily affected genes are 0.2 to 0.6 xc, where sigma
is estimated from the real dataset. The figures indicate that our method
outperforms SSEM and ttest.
* Our method: We sort the genes in decreasing order of joint
likelihood with the metagene. A higher joint likelihood implies a
higher chance of being primarily affected.
* SSEM: We train SSEM on the control dataset, where it learns the
correlation between the genes. We test SSEM on the noncontrol
dataset, where it produces a rank for each single data point.
* Student's t test: We used the function called ttest from
MATLAB. We apply it on every individual gene, where it takes
control and noncontrol dataset as input and produces a pvalue as
output. By default, null ih,. p.ih .',, is that "the differences of two
input data set are a random sample from a normal distribution
with mean 0 and unknown variance, against the alternative that
the mean is not 0". Thus, the null h].pI.I i,.', corresponds that
the gene is EE. So a substantially lower pvalue implies a higher
chance of being primarily affected. We performed the test on
all the genes and rank them according the increasing order of
pvalues.
Let us assume the set of primarily affected genes as PG and
first k elements of the ranking as RGk. We define the sensitivity
of the ranking at position k by rJk = P I Thus, a higher
value of rk denotes a higher sensitivity. We prepare a sensitivity
vector {qil, q2,  rI }, by arraying the sensitivity of a ranking
at all the positions of the ranks. Here, IRI denotes the cardinality
of the ranking. For SSEM we obtain a sensitivity vector for every
data points in the noncontrol dataset. We create a consolidated
sensitivity vector by averaging them.
Results: We conducted experiments by for dII {0.2, 0.4, 0.6,
0.8, 1.0, 1.2, 1.5, 1.75}, number of primarily affected genes = {10,
50} and number of data points = {10, 20, 40, 60, 80, 100, 125, 155}.
Here, a corresponds to the standard deviation of the expressions of
genes in the dataset. However, due to space limitation we discuss
only two of them in this paper (see Figure 6). The results we discuss
correspond to the case when we have 40 primarily affected genes
and 155 data points. The results of the other experiments are similar
to those in Figure 6(b).
Figures 6(a)and 6(b) show the sensitivity of the three methods
when (d, d,) 0.2 xa and 0.6 xa respectively. The former
one corresponds to the computationally harder case as the difference
between the control and noncontrol datasets is small. As the gap
between d, and dp increases identifying primarily affected genes
becomes easier.
Form the figure, we observe that our method is significantly more
accurate than the other two methods for all datasets consistently. It
reaches 100% accuracy (i.e., it can find all the 50 primarily affected
genes) in the top 150 ranked genes in when the gap is small and
in the top 50 genes as the gap increases to 0.6 xa. The results
were similar for larger gap values (results not shown). The t test
reaches around 40% and 50% sensitivity at 200 ranking position
respectively. SSEM's sensitivity is below 0.25 for all experiments
even within the top 200 positions.
We believe that there are two major factors for the success of
our method over the competing methods among other. First, our
method can successfully incorporate the gene interactions using
MRFs while others ignore this informations. Second, our method is
capable of dealing with both large and small number of primarily
affected genes while other methods' performance deteriorates as
this number grows. In real perturbation experiments, often multiple
genes are primarily affected. Thus, we conclude that our method is
suitable for real perturbation experiments.
4 CONCLUSION
In this paper, we considered the problem of identifying primarily
affected genes in the presence of an external effect that can
perturb the expressions of genes. We assumed that we were
given the expression measurements of a set of genes before and
after the application of an external perturbation. We developed
a new probabilistic method to quantify the cause of differential
expression of each gene. Our method considers the possible gene
interactions in regulatory and signaling networks, for a large number
of perturbations. It uses a Bayesian model with the help of Markov
Random Fields to capture the dependency between the genes. It also
provides the underlying distribution of the impact with confidence
interval.
Our experiments on both real and synthetic datasets demonstrated
that our method could find primarily affected genes with high
accuracy. Our method was 100% accurate when the difference
between expected expressions of primarily and secondarily affected
genes is at least half of the standard deviation of the gene
expressions. It achieved significantly better accuracy than two
competing methods, namely SSEM and the student's t test method.
Our experiments suggest that our method is applicable to real
applications as it works well when the number of primarily affected
genes grows. SSEM can be more appropriate for small scale
experiments, where a few genes are primarily affected. When
Leveraging Gene Networks to Estimate Perturbations on Gene Expression
confronted with a dataset with higher number of affected genes and
multiple replications, SSEM is negatively impacted by the variance
of the genes over those replications. Hence, it fails to model the
patterns that exists over the entire dataset The efficiency of our
method justifies the use of gene networks in our methods.
Our method produces a probability distribution rather than a fixed
binary decision. The major advantage of this approach is that it
augment every decision with a range, and hence endows it with a
confidence. A distribution is most of the time more useful, as is it
models the very stochastic nature of gene interactions.
REFERENCES
[1]SA. Amundson, M. Bittner, and Y. Chen et al. Fluorescent cDNA
microarray hybridization reveals complexity and heterogeneity of
cellular genotoxic stress responses. Oncogene, 18(24):366672, 1999.
[2]Ferhat Ay, Fei Xu, and Tamer Kahveci. Scalable steady state analysis of
boolean biological regulatory networks. PloS one, 4(12), 2009.
[3]KA. Baggerly, KR. Coombes, and KR. Hess et al. Identifying
differentially expressed genes in cDNA microarray experiments. J
ComputBiol, 8(6):63959, 2001.
[4]Julian Besag. Spatial interaction and the statistical analysis of lattice
systems. Journal of the Royal Statistical Society, 36(2):192236, 1974.
[5]Julian Besag. Efficiency of pseudolikelihood estimation for simple
Gaussian fields. Biometrika, 64(3):616618, 1977.
[6]Julian Besag. On the Statistical Analysis of Dirty Pictures. Journal of
the Royal Statistical Society, 48(3):259302, 1986.
[7]Christopher M. Bishop. Pattern Recognition and Machine Learning .
Springer, 2007.
[8]RY. Cheng, A. Zhao, and WG. Alvord et al. Gene expression dose
response changes in microarrays after exposure of human peripheral
lung epithelial cells to nickel(II). Toxicol Appl Pharmacol, 191(1):22
39, 2003.
[9]EJ. Cosgrove, Y. Zhou, TS. Gardner, and ED. Kolaczyk. Predicting gene
targets of perturbations via networkbased filtering of mRNA expression
compendia. Bioinformatics, 24(21):248290, 2008.
[10]J. Courcelle, A. Khodursky, and B. Peter et al. Comparative gene
expression profiles following UV exposure in wildtype and SOS
deficient Escherichia coli. Genetics, 158(1):4164, 2001.
[11]J. Dausset and Others. Centre d'etude du polymorphisme human
(CEPH): collaborative genetic mapping of the human genome.
Genomics, 6:575577, 1990.
[12]F. Demichelis, P. Magni, and P. Piergiorgi et al. A hierarchical
Nave Bayes Model for handling sample heterogeneity in classification
problems: an application to tissue microarrays. BMC Bioinformatics,
7:514, 2006.
[13]D. di Bernardo, MJ. Thompson, and TS. Gardner et al. Chemogenomic
profiling on a genomewide scale using reverseengineered gene
networks. Nat Biotechnol, 23(3):37783, 2005.
[14]S. Geman and C. Graffigne. Markov random field image models and
their applications to computer vision. Proceedings of the International
Congress of Mathematics: Berkley, pages 14961517, 1987.
[15]G. Giaever, P Flaherty, and J. Kumm et al. Chemogenomic profiling:
identifying the functional interactions of small molecules in yeast. Proc
Natl Acad Sci US A, 101(3):7938, 2004.
[16]G. Giaever, DD. Shoemaker, and TW. Jones et al. Genomic profiling of
drug sensitivities via induced haploinsufficiency. Nat Genet, 21(3):278
83,1999.
[17]D. Hamelinck, H. Zhou, and L. Li et al. Optimized normalization for
antibody microarrays and application to serumprotein profiling. Mol
Cell Proteomics, 4(6):77384, 2005.
[18]JM. Hammersley and P. Clifford. Markov fields on finite graphs and
lattices. Unpublished manuscript, 1971.
[19]TR. Hughes, MJ. Marton, and AR. Jones et al. Functional discovery via
a compendium of expression profiles. Cell, 102(1):109126, 2000.
[20]T. Ideker, V. Thorsson, and JA. Ranish et al. Integrated genomic and
proteomic analyses of a systematically perturbed metabolic network.
Science, 292(5518):92934, 2001.
[21]T. Imaoka, S. Yamashita, and M. Nishimura et al. Gene expression
profiling distinguishes between spontaneous and radiationinduced rat
mammary carcinomas. J Radiat Res (Tokyo), 49(4):34960, 2008.
[22]B. Jakob, M. Scholz, and G. TaucherScholz. Immediate localized
CDKN1A I '! I radiation response after damage produced by heavyion
tracks. Radiat Res, 154(4):398405, 2000.
[23]KY. Jen and VG. Cheung. Transcriptional response of lymphoblastoid
cells to ionizing radiation. Genome Res, 13(9):2092100, 2003.
[24]M. Kanehisa and S. Goto. KEGG: kyoto encyclopedia of genes and
genomes. Nucleic Acids Res, 28(1):2730, 2000.
[25]CM. Kendziorski, MA. Newton, H. Lan, and MN. Gould. On
parametric empirical Bayes methods for comparing multiple groups
using replicated gene expression profiles. Stat Med, 22(24):3899914,
2003.
[26]R. Lin, Y. Sun, and C. Li et al. Identification of differentially
expressed genes in human lymphoblastoid cells exposed to irradiation
and suppression of radiationinduced apoptosis with antisense
oligonucleotides against caspase4. Oligonucleotides, 17(3):31426,
2007.
[27]PY. Lum, CD. Armour, and SB. Stepaniants et al. Discovering Modes
of Action for Therapeutic Compounds Using a GenomeWide Screen of
Yeast Heterozygotes. Cell, 116(1):57, 2004.
[28]MJ. Marton, JL. DeRisi, and HA. Bennett et al. Drug target
validation and identification of secondary drug target effects using DNA
microarrays. Nat Med, 4(11):1293301, 1998.
[29]GL. Miklos and R. Maleszka. Microarray reality checks in the context
of a complex disease. Nat Biotechnol, 22(5):61521, 2004.
[30]ID. Nagtegaal, CG. Gaspar, and LT. Peltenburg et al. Radiation induces
different changes in expression profiles of normal rectal tissue compared
with rectal carcinoma. Virchows Arch, 446(2):12735, 2005.
[31]AB. Parsons, RL. Brost, and H. Ding et al. Integration of chemical
genetic and genetic interaction data links bioactive compounds to
cellular target pathways. Nat Biotechnol, 22(1):629, 2004.
[32]KE. Rieger and G. Chu. Portrait of transcriptional responses to
ultraviolet and ionizing radiation in human cells. Nucleic Acids Res,
32(16):4786803, 2004.
[33]Yishai Shimoni, Gilgi Friedlander, and Guy Hetzroni et al. Regulation
of gene expression by small noncoding RNAs: a quantitative view. Mol
SystBiol., 3:138, 2007.
[34]DA. Smirnov, M. Morley, and E. Shin et al. Genetic analysis
of radiationinduced changes in human gene expression. Nature,
459(7246):58791, 2009.
[35]Bin Song, I. Esra Buyuktahtakin, Sanjay Ranka, and Tamer Kahveci.
Manipulating the steady state of metabolic pathways. IEEE TCBB.
[36]Le Song, Mladen Kolar, and Eric P. Xing. KELLER: estimating time
varying interactions between genes. Bioinformatics, 25(12):i128136,
June 2009.
[37]R. Storn and K. Price. Differential Evolution A Simple and Efficient
Heuristic for global Optimization over Continuous Spaces. Journal of
Global Optimization, 11(4):341359, 1997.
[38]KK. Tsai, EY Chuang, JB. Little, and ZM. Yuan. Cellular mechanisms
for lowdose ionizing radiationinduced perturbation of the breast tissue
microenvironment. Cancer Res, 65(15):673444, 2005.
[39]L. Wu and AJ. Levine. Differential regulation of the p21/WAF1 and
mdm2 genes after highdose UV irradiation: p53dependent and p53
independent regulation of the mdm2 gene. MolMed, 3(7):44151, 1997.
[40]G. Zhang, CN. Njauw, JM. Park, C. Naruse, M. Asano, and H. Tsao.
EphA2 is an essential mediator of UV radiationinduced apoptosis.
Cancer Res, 68(6):16916, 2008.
Bandyopadhyay et al
APPENDIX
I CONDITIONAL LIKELIHOOD
From Equation 3 the conditional likelihood of a node Xij is,
exp(l((Xij) + 2 Ete{o,2,3} f(Xij, t) + 3 Ek,Wki1 f(Xij,Xkj) + 4 p,Wj,=1 f(Xij, Xp))
p(XijlX Xj, Ox)=
SEx, eo0,1,2,3}) ep(71C(Xj) + 72 Ete{o,2,3} f(Xj, t) + 3 E ,,W ,i f(Xij,Xkj) +74 Ep,Wp,=1 f(Xj, Xip))
(9)
We provide the derivation of Equation 9 in Appendix II. As the number of incoming and outgoing neighbors can vary for a gene we modify version of
Equation 2 to normalize 73 and 74 as,
p(Xi X Xij, x)
i x" x Y p,Wj =1 f(Xij,Xi,)
exp(7l((Xi, ) +72 EIt{o,2,3} f(Xij,t) + 1+ 4 N2(Xij)
.*. v, Y'" p,Wj 1 f(Xij,Xip)
XjiEC{0,1,2,3}e P(1yC((Xij) + 72 EtC {,2,3} f(Xij,t) + +74 p N(Xi,) )
Here, N (Xij) = {Xkj Wki = 1} and N2(Xij) = {XpWjp, = 1} are the set of neighbors for Xgj and IN1 (X j)l is the cardinality of N1 (Xij).
II DERIVATION OF P(XIjIX {XI, Ox}), Wui =1 :
p(Xij X Xij, Ox)
p(X, x)
P(X Xij, Ox)
p(X, Ox) (11)
xij e{0,1,2,3} P(X Xij, Xij, Ox)
A(Xij).B
(A(0) + A(1) + A(2) + A(3)).B
Here, A(Xij) is exp((l(Xij) + 72 te{o,2,4} f(Xij,t) + 73 Ek,W1i f(Xij,Xkj) + 74 Ep,W3,pl f(Xj,,Xp)). B is given by
exp(71 Emn.ij,W.=1 ((Xmn) + 72 E mn..ij,W..=1,te{0,2,4} f(Xmn, t) + 73 rq,ij.mn,W..=1,Wqm1 f(Xmn, Xqn)
+74 rl,ijmn,Wm =1,Wl=1 f(Xmn, Xml)). We cancel B from numerator and denominator and the density function simplifies to,
f(Xij, 1)p(Xij IX Xij, Ox)
exp(ylC(Xij) + 72 Ete{o,2,4} f(Xij, t) + 73 Ek,Wi1= f(Xj, Xkj) + 74 p,W3,=1 f(Xij, Xip)) (12)
xij e{o,1,2,3} eXp(71C(Xj) + 72 Ete{o,2,4} f(Xij, t) + 73 Ek,Wi=1 f(Xij, Xkj) + 74 Ep,W,1 f(XjX, Xip))
Here, we denote the prior density parameters {71, y2, 73, 4 } by Ox.
III DERIVATION OF F :
Fij
=p(Xij IX Xij,y, Ox,Oy)
=p(Xij X Xij,Yi,Y,Ox, y)
Sp(Yi, Y, X X >j X, XjX,X, Ox, y)
p(Yi, ,X Xij X:, X:, Ox, eY)
p(Y, Y, X X XX Ij, X@, Ox, By)p(Xij, X>, Ox, OY)
p(Yi, Yj, X Xij X XO, Ox, By)p(X, Ox, OY)
Sp(Yi, YXj,X', Ox, By)p(X Xij X IXj,X Ox, By)p(Xij, X, Ox, Y)
p(Yi, Y X, xOx, By)p(X Xij XX,Ox, By)p(XN Ox, Y)
p(Yi,Y,Xij,,Xj,Ox,eY)p(X Xij Xt,Xij,X ,Ox,eY)
p(Yi,YjIX' Ox,eY)p(X Xi, X*,X, Ox,0Y)
p(Y, Yj IXj, X,Ox, 0Y)p(X,Ox,Y)
p(Yi,IYjX:,Ox,0Y)p(X Xij,Ox,0Y)
Leveraging Gene Networks to Estimate Perturbations on Gene Expression
p(Yi, Yj IXi X, O )p(x, y)(X )p()P(Y)
p(Yi, Yj Xj, Ox, y)p(X Xij,Ox)p((y)
p(Yi, Yj Xij,X ,Ox, Oy)p(X, Ox)
p(Yi, Yj IX, Ox, Oy)p(X Xij, x)
p(Yi, Yj IX~X, X:, Ox, By)p(Xi IX Xij, Ox)p(X Xij, Ox)
p(Yi, Y IXj,0Ox,0y)p(X Xi, Ox)
p(Yi, Yj IXj, Xi, Ox, 0Y)p(Xij IX Xij, Ox)
p(Yi,Yj IX:,x,X0Y)
p(Yi, Yj, Xij, X:, O, BY)p(X', O, By)p(Xij X Xj, Ox)
p(Xij, Xj Ox, Y)(Yi, Yj,X@, Ox, Y)
p(Y~, bj, Ox X~, X By)p(Xi, Xj, Oy)p(Xj Oy)p(Xii IX Xii, Ox) (13)
p(Xij,Xt, Ox, Y)p(Yi, Yj, x IX,0y)p(X 0y)
(Y, 3Yj IXj Xi X y)p(Ox Xij,, X:, 0y)p(Xij, Xt, 0y)p(X:j, Ox, 0y)p(Xij IX Xij, Ox)
p(Yi,Yj X,By)p(Xij,XX,Ox, Y)p(Ox Xt,By)p(Xt, oy)
p(Y, Yj IXij, X, By)p(Xij, X Ox, By)p(X@, Ox, y)p(Xij IX Xij, Ox)
p(Yi, YJ\X:,ey)p(Xij,X,ex,0 y)p(X Ox, 0y)
p(Yi, YIXi,,X,,0Y )p(Xj IX Xj,0 x)
p(Yi,YIX>,0y)
p(Yi, y3X, X,X:3, o0)p(Xj IX Xj, Ox)
x, e{o0,1,2,3} P(Yi, YjX I X:, 0y)
In step 2 of the derivation, we substitute y by Yj and Yj as Xj is independent of all Yk such that k i and k 7 j.
IV DERIVATION OF LIKELIHOOD DENSITY FUNCTIONS:
In this section we derive the data likelihood density functions. We use Equation 7 and Equation 8 as the data likelihood function for DE and EE respectively
in these derivations. We start with proving that that Y, and Yj are independent of S* and S* given Si and Sj. We represent the parameters {po, a, T} by Oy.
p(Y Ii,Y ,Sj,,S, SS ,0y)
p(Yi, Y~, Si, S Sj S, SO y)
p(Yi, Yj'S: S S, Sj, s y)p(s S, ) S, y)
p(Si, Sj,,S SO ey)
(14)
p(Yi, Yj IS, Sj, OB)p(S S3 IS, S\ 0Y)p(Si, Sj, 0y)
p(si,Sj,S;,S,o0y)
p(Yi, Y\ IS, Sj, Oy)p(Si, Sj, Si, OS, 0y)
p(si,Sj,S, ,S, 0y)
=p(Yi, Yj IS, S, 0y)
We use this conditional independence in the following proofs.
Case 1. Xij 1 We can replace Xij by the set {Si, Sj}, where Si and Sj denote the states of gi and gj. We define the neighbors of Si by S
{Sk IXi, Xki e Xi }. Thus, we can replace Xij U Xi by {SS, S>, S, S }. So, for Xij = 1, we write,
p(i,Y IXj = 1, X*,,0y)
= p(Yi,Y3 Si = DE, SJ = DE, Si, S;, 0y)
=p(Yi,YjSi =DE, S DE, y) (15)
= p(Yi\Si = DE)p(Yj IS = DE, Oy)
= DE(gi)[DE(gj)
Bandyopadhyay et al
Case 2. Xij 0 As earlier, we shall replace {Xij, Xj} by {Si, Sj, S>, S }.
p(Yi,Y IXij = ,X ,BOy)
= p(Yi, Y Si = EE, S = EE, S>,S Oy)
=p(Yi,Yj3Si = EE, S = EE,By)
=p(YiS, = EE,Oy)p(Yj Sj = EE, y)
= EE(gii)EE(gj)
Case 3. Xij 2 In a similar fashion we can can write that,
Case 4. Xij =3 Similarly,
p (Yi, Y I Xj 2, XI jy)
 p(Yi, Y3 ISi DE, SJ
 p(Yi, Y3 ISi DE, SJ
p(YjjS, DE)p(Yj Sj
LDE(gi)LEE(gj)
p(Yi, Yj Xjj 3, Xi~, Oy)
 p(Yi, Y3 ISi EE, SJ
 p(Yi, Y3 IS, EE, S,
p (Yi ISi T F"' "jIS
LEE(gi)LDE(gj)
EE, Si,> S;, By)
EE, By)
= EE)
DE, S>, S;, Oy)
DE, Oy)
= DE, Oy)
