Group Title: BMC Bioinformatics
Title: An improved procedure for gene selection from microarray experiments using false discovery rate criterion
CITATION PDF VIEWER THUMBNAILS PAGE IMAGE ZOOMABLE
Full Citation
STANDARD VIEW MARC VIEW
Permanent Link: http://ufdc.ufl.edu/UF00100020/00001
 Material Information
Title: An improved procedure for gene selection from microarray experiments using false discovery rate criterion
Physical Description: Book
Language: English
Creator: Yang, James
Yang, Mark
Publisher: BMC Bioinformatics
Publication Date: 2006
 Notes
Abstract: BACKGROUND:A large number of genes usually show differential expressions in a microarray experiment with two types of tissues, and the p-values of a proper statistical test are often used to quantify the significance of these differences. The genes with small p-values are then picked as the genes responsible for the differences in the tissue RNA expressions. One key question is what should be the threshold to consider the p-values small. There is always a trade off between this threshold and the rate of false claims. Recent statistical literature shows that the false discovery rate (FDR) criterion is a powerful and reasonable criterion to pick those genes with differential expression. Moreover, the power of detection can be increased by knowing the number of non-differential expression genes. While this number is unknown in practice, there are methods to estimate it from data. The purpose of this paper is to present a new method of estimating this number and use it for the FDR procedure construction.RESULTS:A combination of test functions is used to estimate the number of differentially expressed genes. Simulation study shows that the proposed method has a higher power to detect these genes than other existing methods, while still keeping the FDR under control. The improvement can be substantial if the proportion of true differentially expressed genes is large. This procedure has also been tested with good results using a real dataset.CONCLUSION:For a given expected FDR, the method proposed in this paper has better power to pick genes that show differentiation in their expression than two other well known methods.
General Note: Periodical Abbreviation:BMC Bioinformatics
General Note: Start page 15
General Note: M3: 10.1186/1471-2105-7-15
 Record Information
Bibliographic ID: UF00100020
Volume ID: VID00001
Source Institution: University of Florida
Holding Location: University of Florida
Rights Management: Open Access: http://www.biomedcentral.com/info/about/openaccess/
Resource Identifier: issn - 1471-2105
http://www.biomedcentral.com/1471-2105/7/15

Downloads

This item has the following downloads:

PDF ( PDF )


Full Text



BMC Bioinformatics


B
BioM.- Central


Research article

An improved procedure for gene selection from microarray
experiments using false discovery rate criterion
James J Yang*1 and Mark CK Yang*2


Address: 1Biostatistics and Research Epidemiology, Henry Ford Health Sciences Center, Detroit, Michigan, USA and 2Department of Statistics,
University of Florida, Gainesville, Florida, USA
Email: James J Yang* jyang2@hfhs.org; Mark CK Yang* yang@stat.ufl.edu
* Corresponding authors


Published: II January 2006
8MC Bioinformatics 2006, 7:15 doi:10.1186/1471-2105-7-15


Received: II July 2005
Accepted: I I January 2006


This article is available from: http://www.biomedcentral.com/1471-2105/7/15
2006 Yang and Yang; licensee BioMed Central Ltd.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0),
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.



Abstract
Background: A large number of genes usually show differential expressions in a microarray
experiment with two types of tissues, and the p-values of a proper statistical test are often used to
quantify the significance of these differences. The genes with small p-values are then picked as the
genes responsible for the differences in the tissue RNA expressions. One key question is what
should be the threshold to consider the p-values small. There is always a trade off between this
threshold and the rate of false claims. Recent statistical literature shows that the false discovery
rate (FDR) criterion is a powerful and reasonable criterion to pick those genes with differential
expression. Moreover, the power of detection can be increased by knowing the number of non-
differential expression genes. While this number is unknown in practice, there are methods to
estimate it from data. The purpose of this paper is to present a new method of estimating this
number and use it for the FDR procedure construction.
Results: A combination of test functions is used to estimate the number of differentially expressed
genes. Simulation study shows that the proposed method has a higher power to detect these genes
than other existing methods, while still keeping the FDR under control. The improvement can be
substantial if the proportion of true differentially expressed genes is large. This procedure has also
been tested with good results using a real dataset.
Conclusion: For a given expected FDR, the method proposed in this paper has better power to
pick genes that show differentiation in their expression than two other well known methods.


Background
The development of microarray technologies has created
unparalleled opportunities to study the mechanism of
disease, monitor disease expression and evaluate effective
therapies. Because tens of thousands of genes are investi-
gated simultaneously with a technology that has not yet
been perfected, assessing uncertainty in the decision proc-
ess relies on statistical modelling and theory. One key
function of any statistical procedure is to control the rate


of erroneous decisions or, in the microarray case, rate of
false discovery of responsible genes.

The above concern can be illustrated as a multiple com-
parisons problem. Suppose we are interested in testing g
parameters (p/,.... pa) = p. For each individual parameter
pj, the null hypothesis is Hoj: /j = 0 and the alternative
hypothesis is H1j: p/j 0. This p/ can be thought as the dif-
ference in mean expressions of the jth gene under two dif-


Page 1 of 14
(page number not for citation purposes)








http://www.biomedcentral.com/1471-2105/7/15


d = 0.2


d = 0.4


Mixed d


16 32 64 128 1


0.03 -

0.02 -

0.01 -

0.00 -

0.05 -

0.04 -

0.03 -

0.02 -

0.01 -

0.00 -

0.05 -

0.04 -

0.03 -

0.02 -

0.01 -

0.00 -


6


I I I I


32 64 128 16 32 64 128


Figure I
A simulation study to estimate FDR (in y-axis) for various numbers of hypotheses Tested (x-axis), proportion of true null
hypotheses (left labels), and various configurations of true alternative hypotheses (top labels). All the examined methods, BH
procedure (- -); aBH procedure (......); Fisher's combining function (-- --); Liptak's combining function (-- --); proportion of
null known (----), satisfied the required FDR < 0.05.


ferent conditions in a microarray experiment. A
conventional method to test each hypothesis is to take a
sample and then calculate its p-value from a proper statis-
tical test. If the calculated p-value is less than a threshold
determined by a testing significance level, then H0g is
rejected. However, when many hypotheses are simultane-


ously performed, a multiple comparisons procedure
(MCP) has to be used to control the error rate [1].

The traditional MCP controls the probability of making
any error in multiple selections, i.e., controls the family-
wise error rate (FWER). It has been shown, however, that


Page 2 of 14
(page number not for citation purposes)


0.05 -


0.04 -


I I I I


. .-


.........................
- - - - - - -


BMC Bioinformatics 2006, 7:15


..... -. - .........--....


. ,. .. . . . . . . . . ..







http://www.biomedcentral.com/1471-2105/7/15


this type of procedure is extremely conservative when the
number of hypotheses is large. Alternatively, Benjamini
and Hochberg [2] proposed the measure of false discovery
rate (FDR) for which the expected proportion of false dis-
coveries is controlled. This procedure is based on the idea
that one can tolerate more false discoveries if the number
of tests is large. For example, 5 false discoveries out of 10
selections is probably too many while 5 false discoveries
out of 100 selections should be acceptable. This is partic-
ularly useful in microarray analysis since a very large
number of genes usually show differential expressions.
Therefore, controlling the FDR can greatly increase the
power of discovery.

Benjamini and Hochberg [3] proved that Simes' proce-
dure [4] can be used to control expected FDR at a (0 1) when tests of true null hypotheses are either independ-
ent or positively dependent. More specifically, let P(i) <
P(2) < ...< < p(g) be the ordered p-values and I + 1 be the
smallest integer satisfying


P(+I) > ( + 1) X (1)

If/ > 1, then rejecting H(0oj), j = 1,..., I ensures the expected
FDR at a. We call this method the BH procedure.

Suppose the number of true nulls is go, or the proportion
is zr = go/g. Benjamini and Hochberg [3] proved that the
expected FDR of BH procedure is less than or equal to

So a. When go is small compared to g, using the original

a in (1) loses of power because it can be replaced by the
larger value a/iz0 and still control the FDP at a. Here, the
power of a selection procedure is defined as the propor-
tion of the alternative hypotheses that are correctly identi-
fied. Obviously, if we knew zo in advance, we could
replace a in (1) by a/zo to increase power. Since zo0 is
unknown, the key question here is how to estimate the
number of true null hypotheses go.

In an earlier published paper, Benjamini and Hochberg
[5] proposed an adaptive procedure (aBH) which, by simu-
lation, showed a higher power than BH procedure. The
idea of the aBH procedure is to estimate go based on the
fact that the p-value under the alternative hypothesis is
stochastically smaller than that under the null, which is
uniformly distributed on (0, 1). On a quantile plot of p-
values (pQ) versus j), the slope passing through (j, pQ)) and
(g + 1,1) increases just asj increases if the P) s are corre-
sponding to the subset of true alternative hypotheses. The
first time the slope decreases indicates a change point.


Using this stopping rule in conjunction with the Lowest
SLope (LSL) estimator, (1 po))/(g + 1 j), Benjamini and
Hochberg proposed their aBH method to estimate go as
follows:

Alg. I aBH method in go estimation
1. If there is no hypothesis rejected by the BH procedure,
then stop and declare no discovery.

2. Calculate m) = (1 po))/(g + 1 j), Vj = 1,..., g.

3. Let J* be the largest value such that m(j,) Define b as the smallest integer larger than l/mn(j).

4. The number of true null hypotheses is estimated as go
= min(b, g).

Recently, Storey and Tibshirani [6] observed and demon-
strated in their Figure 1 that the density histogram of p-
values based on pl,..., pg from a microarray experiment
looked flat in the region of (0.5, 1). Based on the fact that
the null p-values of genes are uniformly distributed, most
of genes in the region of (0.5, 1) should be from the null.
Therefore, they used a smoothing method to estimate go
based on the flat region of the observed p-values. How-
ever, the implicit requirements of their method are: 1) g
should be large, and 2) the proportion of true null
hypotheses should also be large, such as 0.5, so that0 can
be estimated accurately. The method developed in this
paper tries to bypass these two requirements so that a
broad range of multiple testing problems can be applied.
It is conceivable that when the number of relevant genes
are known for certain target disease, a chip with a small
number of genes will be widely used for diagnosis. The
FDR methods, including both the BH and the aBH ver-
sions, have now been widely accepted for microarray anal-
yses. They have been implemented in the R program [7],
which can be incorporated into the Bioconductor package
[8]. Several microarray analysis computer programs, such
as SAM (Significance Analysis of Microarrays) [9] and
GeneSpring [10] also use the FDR criterion to identify dif-
ferentially expressed genes. Yang et al. [11] have used the
FDR criterion to determine the sample size in microarray
experiments.

Methods
Without loss of generality, we describe the method in a
one-sample testing problem. Its extension to commonly
used paired-t test, two-sample t test, or nonparametric
rank tests in microarray analysis is very straightforward.
Let Yi = (Y,i,..., Yig)', i = 1I..., n be g-variate vector samples
from populations with means p = (p/,..., pg)'. Define YO) =
(Y1j,..., y,1), ,j = 1,..., g as a row vector for the jth compo-
nent. The null hypothesis of the jth component is Ho0 : /j
= 0 and the alternative hypothesis is H1j: p/j # 0. Let the p-


Page 3 of 14
(page number not for citation purposes)


BMC Bioinformatics 2006, 7:15







http://www.biomedcentral.com/1471-2105/7/15


value for rejecting the null H0j be p, derived from a prop-
erly chosen test statistic.

Once the p-values are derived, we first summarize the
information content of p-values using a differentiable
real-valued decreasing function h, i.e., h(pi,..., pg) is
decreasing to all its arguments. We call h the combining
function and its value, 7 = h(pi,..., pg), the global statistic.
Next, define 77(0) to be the observed global statistic of 77
derived directly from the observed data. Note that a small
value of pj indicates Hoj is less likely to be true. Since h is a
decreasing function of its argument pj, a large value of /(0)
indicates a subset of the null hypotheses is likely to be true
alternatives. To determine whether 7/(0) is large enough to
make such a claim, we need to know the distribution of 77
when all the nulls are true. The distribution of 77 depends
on the distribution of its arguments, the correlation
between its arguments, and the combining function h. In
a microarray experiment, there seems no way to model or
estimate so many correlations (see, for example, Chapter
6 of Pesarin [12]). Hence, a reasonable approach which
can tackle the correlation within each experimental unit is
to determine the critical value by a permutation test. More
specifically, we calculate all B = 2" values of 7 based on
(coY1 ,..., co,Y3n), where ), = 1 (The multivariate two-sam-
ple permutation method can be found in Pesarin [12]).
Hence, we have a set of reference values {q0) ,..., (g 0)}
from the B permutations, and can now define the p-value
of the global 7/()-statistic as


(0) b=1 ) (0)
B


where I(x) is the indicator function that takes value 1 if the
statement x is true and 0 if it is not. To determine whether
the global null hypothesis of all p/),j = 1,..., g, is zero is to
evaluate if the global p-value, p(0), is less than or equal to
a significance level. However, this level is part of the esti-
mation process. It will be determined by the data (see later
Alg. 2).

When the global null hypothesis is rejected, it indicates
that not all null hypotheses Ho0s are true. Immediate ques-
tion is, which subset of hypotheses is the true alternative.
To determine whether to reject the hypothesis H(oi) is a
multiple testing problem. We can, however, determine the
size of true alternatives using the following iterated proc-
ess. We regard the gene with the smallest p-value as the
major contributor for the rejection and claim that the null


hypothesis is not true with this gene. When this gene is
removed from the data set, we continue the same process
with the rest g 1 genes and the p(O) computed by (2) is
now denoted by p(1). The whole process is then repeated
to produce a sequence of pseudo-global p-values, p(s), s =
1,..., g 1. We call the later step global p-value pseudo
because it is only based on the subset of the original data.
The detail of how to generate pseudo-global p-values is
given in the Algorithm section. First, we will use these
pseudo-global p-values to estimate the number of true
null genes go.

We observe that, intuitively, p(o) < p(1) < ... < p(9-i), and this
monotone increasing property will be proved in (J.1) of
the Justification section. In addition we will prove in
(J.2) of the Justification section that if r pseudo-global p-
values are less than a given value /(0 tor of the number of null genes is


g0o =g r+ p /(1- )2


and this estimator ensures that E [ go I > go. (The conserv-
ativeness of this estimate under other conditions using
other techniques has also been mentioned by Storey and
Tibshirani [13] and Efron et al. [ 14].) Since the inequality
E [ g0 ] > go holds for any value of t, the best choice of/ is
the one which makes go closest to go. By defining A = 8/
(1 l)2 and p(/) = r A, we observe that p(/j) is an increas-
ing function and then a decreasing function of 8/ for the
following reasons. When 8/ is small, A increases slowly.
However, r is an increasing function in /f and r is always
greater than 1. Therefore, p(/j) is an increasing function
for small value of /. On the other hand, when / -> 1, A
reaches oo. Since r is finite, p(/) becomes a decreasing
function in this range. Since (3) is equivalent to go = g -
p(8), the optimal value of / should be determined as

/ = arg max p(3)
/3

and the method of estimating go is equivalent to finding
the optimal /value.

The estimation of go can thus be summarized as

Alg. 2 global-p method in go estimation
1. List rp as the integer such that p(-1) '
p ) > /for a large number ofs for 0



Page 4 of 14
(page number not for citation purposes)


BMC Bioinformatics 2006, 7:15







http://www.biomedcentral.com/1471-2105/7/15


-10000 -5000 0 5000 10000 15000
























0 0 0 0 0 0 O


Figure 5
The plot of function p(/) in a real data analysis to determine
the arg max p(3) for the tumor data.



2. Find such that p(0J) = r/- 6/(1 f)2 is maximized (see
Figure 5). Let this /fbe /*.

3. Let r,, be the integer such that p( -1) < V s = 1,..., r,

and p ) > ,*


4. Let g = min[g r, + /( *)2, g].

We called this global-p method and denoted it by .

Alg. 3 Differentially expressed gene selection with given
FDR
With go estimated, to control the FDR at a level for identi-
fying differentially expressed genes in microarray data
analysis with the conservative estimate go0, we simply test
each sub-hypothesis based on (1) with the new a- a/ go0,

i.e., rejects all H), j = 1,..., /, if p) < j x a and P(+) > (I
go


+ 1) x We also need to address the choices for the

combining function h. In this paper, we consider only two
commonly used ones. 1) Fisher's sum of logarithm

g
n= h(p ...,pg)= -21log(p,)
j=1

and 2) Liptak's sum of inverse standard Gaussian distribu-
tion functions


n= h(pl,...,pg f 1(I- ps).
j=1

More discussion on the choices of combining functions
can be found in Birnbaum [15] and Folks [16]. The com-
puter program for the proposed method, global-p, has
been implemented in R script and is publicly available
[171.

Results
Simulation
Two simulation studies were conducted. First, when the
number of hypotheses are small and second, when the
total number of hypotheses is extremely large. For small
number of hypotheses, we used the following configura-
tions: numbers of hypotheses are 16, 32, 64, and 128 with
sample size n = 10 and the proportions of go/g being 0,
0.25, 0.5 and 0.75. The true expressions under the alterna-
tive hypotheses are assumed to be variance 1 Gaussian
random variables with non-zero mean values p/ = dj. In
one set of experiments, all djs are set as 0.2 or 0.4; in the
other set the djs are divided into 4 equal size blocks with
values 0.2, 0.4, 0.6 or 0.8 in each block. The number of
permutations is B = 1,000 and the number of simulations
is 20,000. The FDR is set at a = 0.05. Four procedures are
used for testing: BH, aBH, our global-p test one with
Fisher's combining function, and one with Liptak's com-
bining function. For the purpose of comparison, we also
plug in the true value of go into the aBH method. This is
the ideal situation to reach the highest power.


Table I: False discovery rate when all null hypotheses are true.


BH aBH


0.047
0.049
0.049
0.048


0.047
0.049
0.049
0.048


Fisher

0.048
0.049
0.049
0.049


Liptak

0.048
0.050
0.049
0.049


Page 5 of 14
(page number not for citation purposes)


BMC Bioinformatics 2006, 7:15








BMC Bioinformatics 2006, 7:15




Table 1 lists the simulated expected FDR when all null
hypotheses are true. Figures 1 and 2 show the estimated
FDR and power under various gene expression composi-
tions.

Based on this simulation results, all the four methods
have their expected FDR below the nominal FDR a = 0.05.
Both combining functions in our proposed method have
a higher power than the aBH procedure in most situa-
tions, while there is little difference between them. The





d = 0.2


1.0 -
= 0.8 -
--
Z 0.6 -
S0.4 -
0
0.2 -
O.-
0.0 -
- 1.0 -
0.8 -
0.6 -
LO 0.4 -
CN
0.2 -
0.0 -
- 1.0 -
S0.8 -
0.6 -
0-
O 0.4 -
LO
0.2 -
O.-
0.0 -
- 1.0 -
Z 0.8 -
z-
S0.6 -
LO 0.4 -
t-~
0.2 -
0.0 -


16 32 64 128


http://www.biomedcentral.com/1471-2105/7/15




improvement over the aBH methods is substantial when
the proportion of the true null hypotheses is small.

Since most microarray data consist of tens of thousands of
genes simultaneously and many of their expressions are
correlated, our second simulation tried to reflect these
facts. Storey and Tibshirani [13] proposed a clumpyy"
dependence model in which genes are partitioned into
blocks. The gene expressions are assumed independent
between blocks but dependent within the same block.





d = 0.4 Mixed d


------ - -




._


16


32 64 128 16 32 64 128


Figure 2
The same simulation study with estimated average powers, using the same legends as Figure I. The powers of the new method
are clearly higher than the existing two, especially when the null proportion is small.




Page 6 of 14
(page number not for citation purposes)


, , ,


IIII







http://www.biomedcentral.com/1471-2105/7/15


Following the clumpy model, 10 test and 10 control sam-
ples were generated from normal random variables with
mean 0 and standard deviation 1 where each sample con-
tained 10,000 genes. For test samples, genes with expres-
sion difference were added by 3 to represent the
expression differences. The number of differentially
expressed genes go were 100, 2,000, 5,000, 8,000 which
corresponded to the proportion of nulls 0o at 0.99, 0.8,
0.5, 0.2. To simulate intra-block dependency, we gener-
ated one vector of normal random variables with mean 0
and standard deviation 0.2 in each block of size 50 and
add it to every gene in that block. This process creates cor-
relations between genes. Since the true expression differ-
ences for non-null genes are moderate large in this
simulation, we expect that any good method should accu-
rately estimate o0.

In a recent article, Broberg [18] did a comparative review
of various newly developed methods for estimating r0. To
make a comparison, we presented the means and stand-
ard errors of estimates of 0o for all the methods describe in
Broberg's paper in addition to our proposed method (p )
over 1000 simulations. The details of r0estimation meth-
ods (BUM, SPLOSH, QVALUE, Boostrap LSE, SEP, LSL,
mgf, PRE) can be found in Broberg's paper and the pro-
grams for these methods can be found in the add-on pack-
ages of the freely available R software [19]. The simulation
results are shown in Table 2.

As expected, all of them performed very well for large o0.
If both the mean (bias) and standard error (stability) over
the whole range of 0o are considered, LSL and Global-p
stand out and LSL is better in stability. However, as we will
see from the next study, LSL seems to be over-conservative
in real data analysis. The means of both combining func-
tions of our proposed method performed well but Liptak's
function has a higher standard errors. Therefore, Fisher's
function will be used for the real data analysis. During the
simulation, we also noticed that current implementation
of SPLOSH, SEP, mgf, and PRE methods in R programs
failed to work when the number of hypotheses was small.

Real data analysis
We use a publically available experimental data set from
the Stanford microarray database [20] first to compare the
differences between BH, aBH and our method with
Fisher's combining function. The purpose of this experi-
ment was to identify genes that have different expressions
between prostate tumor tissue and matched normal tis-
sue. It consists of a total of 82 microarrays: 41 arrays were
from primary prostate tumors and the other half were
from matched normal prostate tissues. Each array con-
tains 32,152 different human genes. We chose this data


set because of the large number of replicates. With this
amount of replicates, we had a better idea of the ground
truth.

To compare performances of various methods, we split
the whole data set into two groups: a test data set and a
confirmation data set. Eleven pairs of tumor and normal
arrays were chosen for the test data set and the remaining
arrays were used for the confirmation. The number of
eleven test pairs were taken from a systematic sampling to
avoid bias. Patients 1, 5, 9,..., 41 in the original order from
the database were chosen as the test set. Expression infor-
mation is missing if it is labelled as failed, contaminated,
or flagged. Genes were removed from our study if more
than half of their expressions were missing in the original
data set. A total of 24,865 genes was used to identify dif-
ferences in expressions using BH, aBH and our proposed
procedures at 0.01 expected FDR level. Since the experi-
ment was designed with paired tumor-normal arrays, the
paired t-test was used to derive the p-value of each gene in
the test data set. The BH procedure identified 1,254 genes,
the aBH procedure identified 1,523 genes, and our pro-
posed method identified 2,119 genes. Our test is appar-
ently more powerful if it can maintain the required FDR
0.01. Since we did not have the biological information,
another approach was to estimate FDR based on our rejec-
tion set. Specifically, suppose the rejection set contains the
p-values that are less than and the estimated proportion
of null hypotheses is 0o An intuitive estimate for FDR is
[21,221


FDR() = (4)
Pr[P<^ ]

where Pr[P < 1] (= I(i P < )/g. Based on the

confirm data set, our proposed method reported an esti-
mator of r0, z0 = 0.5326. Using equation (4), the esti-
mated FDRs for BH, aBH and our method were 0.0054,
0.0067, 0.0099, respectively.

If we further reject more sub-hypotheses beyond the
number provided by our method, the FDR will exceed the
assigned level. For example, if we reject the next 500 null
sub-hypotheses that have the smallest p-values among
those not previously rejected, the estimated FDR is 0.0137
which is larger than the pre-assigned 0.01 level.

Moreover, we calculated the standard difference (paired t-
statistic) of expressions using the confirmation data set.
The confirmation data set contains 30 pairs of arrays so


Page 7 of 14
(page number not for citation purposes)


BMC Bioinformatics 2006, 7:15








http://www.biomedcentral.com/1471-2105/7/15


Table 2: The means (first row) and standard errors (second row) of estimates 7kO
0.2,0.5,0.8,0.99.


True TO


BUM

SPLOSH

LSL

Bootstrap LSE

QVALUE

SEP

mgf


0.200

0.1124
0.0021
0.8770
0.0737
0.2011
0.0005
0.1938
0.0079
0.1975
0.0107
0.9982
0.0004




0.2005
0.0014
0.1998
0.0017


Global-p
(Fisher)
Global-p
(Liptak)


0.5000

0.3962
0.0024
0.9030
0.0592
0.5015
0.0007
0.4905
0.0135
0.4979
0.0176
0.4978
0.0078
0.4944
0.0044
0.1712
0.0084
0.5002
0.0012
0.4992
0.0038


based on various methods when r0 =


0.8000

0.7288
0.0021
0.9343
0.0482
0.8016
0.0006
0.7863
0.0184
0.7980
0.0242
0.7982
0.0089
0.7973
0.0055
0.6568
0.0107
0.8002
0.0018
0.7978
0.0037


0.9900

1.0000
0.0000
0.9594
0.0271
0.9904
0.0003
0.9787
0.0159
0.9871
0.0154
0.9883
0.0090
0.9905
0.0062
0.9815
0.0066
0.9899
0.0019
0.9884
0.0027


that the statistic is a good estimate for the unknown stand-
ard difference. Figure 3 shows histograms of absolute
standard differences based on genes identified by the BH
procedure, by the aBH procedure, by our method, and the
extra 500 genes beyond our method. From the histo-
grams, several hundred differentially expressed genes that
were not identified by BH or aBH in the test arrays were
identified by our method and most of them have standard
expression differences greater than 2. However, the next
group of 500 genes beyond our method may contain too
many false discoveries to make the FDR acceptable.

Next, we use this data set to compare different estimates
of proportion of null genes, zr. The whole data set was
partitioned into four sub-data sets. We labelled the arrays
from 1 to 41 based on the original order in their database.
Sub-data set 1 contained 11 arrays with labels 1, 5, 9,...,
41; sub-data set 2 contains 10 arrays with labels 2,6,..., 38;
sub-data set 3 contains 10 arrays with labels 3,7,..., 39;
sub-data set 4 contained the remaining arrays. Although
we did not know the true value of %o in real data analysis,
we used the four sub-data sets to compare the variation of
different estimation methods. In addition, we used the
whole data set to check if the estimates are reliable. For
global-p method, we also plot the function of p(/) in Fig-
ure 5 using the whole data set. The maximum of p(/)
occurs at = 0.91.


The estimates 0o were listed in Table 3. All z0 estimates
in the subsets were larger than the ones in the whole data
set. This is expected because genes with small expressional


differences are more likely to be classified as null genes in
a small sample. From the %o estimates from the whole data
set, we may say the LSL gave a large %o value, PRE gave a
small value and all the others gave similar estimates. We
used LSL, PRE and Global-p to draw the p-value density
histogram using the whole data set in Figure 4. The upper
panel is the overall view and the lower panel is the closer
view. The green dash-dot line is the height using LSL esti-
mate; the red dash line PRE estimate; the blue dot line glo-
bal-p estimate. The LSL method, which is also shown in
Broberg's study, is too conservative producing the largest
value. The PRE method underestimated zr. If we used 0o
by PRE method to improve FDR procedure, it is likely to
have a higher FDR than the nominal level. We think our
%o estimate is reliable because it is close but higher than
the height of observed p-values in the region of (0.6, 1) if
we assume genes with expressional differences have rare
chance to have large observed p-values.

Conclusion
Benjamin and Hochberg's FDR procedure [21 opens an
useful approach for multiple comparisons in microarray
analysis. Due to the complexity in microarray experi-
ments, it seems unlikely that one method can dominate
all the other methods in all the cases. In this paper, we
have demonstrated a new method that is intuitive appear-
ing because:




Page 8 of 14
(page number not for citation purposes)


BMC Bioinformatics 2006, 7:15








http://www.biomedcentral.com/1471-2105/7/15


3 4 5 6 7 8 9 10 11 12

Adaptive BH


3 4 5 6 7 8 9 10 11 12

Proposed method


0 1 2 3 4 5 6 7 8 9 10 11 12

500 genes above proposed method







--1 -__fl


0 1


2 3 4 5 6 7 8 9 10 11 12


Absolute standard difference estimate





Figure 3
The histograms of the absolute standard difference estimates (x-axis) in the confirmation data set for the selected genes from
the three procedures based on the test data set for the prostate tumor data. The top is the selected gene histogram by the BH
procedure; the second is that by the aBH procedure; and the third is the one by our method. The similarity of the three histo-
grams means that the additional 596 genes picked by our method should maintain the same FDR (0.01) as the other two. The
bottom panel contains the next 500 genes beyond our proposed method. The left shift of this histogram means that the addi-
tional genes may contain too many false discoveries.




Page 9 of 14
(page number not for citation purposes)


BMC Bioinformatics 2006, 7:15







http://www.biomedcentral.com/1471-2105/7/15


-.- --
-. -- I I I I- 1 .

0.0 0.2 0.4 0.6 0.8 1.0


-II I I


0.0 0.2 0.4 0.6 0.8 1.0

Figure 4
The histogram of p-values in a real data analysis. About half of
the large p-values to the right look uniformly distributed.


1. It uses permutation test which can take care the com-
plex correlation structures in gene expressions.

2. Its global test based on sequentially eliminated signifi-
cant genes should provide a good stopping rule because
all the remaining genes are always considered together.

With the support of simulation and real data studies, the
new method should be a viable alternative to find the dif-
ferentially expressed genes in microarray experiments.

Algorithm
We illustrate the algorithm for calculating pseudo-global
p-values based on marginal p-values, pl,..., Pg Please
note that the procedure described here is the same regard-
less of go. We consider h(pl, ..., pgo ) in the summation
form:


q = h(p1,...,pgo)


go
Sh(Pi),
i=1


where is a differentiable decreasing function in which
dh(t)
h'(p) = dt exists and (p) -> oo as p -> 0. Note that
dt t = P
equation (5) totally meets the requirement that h should
be a differentiable real-valued decreasing function. The
realization of can be, for instance, (p) = -2log(p) if we use


Fisher's sum of logarithm or (p) = (D-1 (1 p) if we use
Liptak's sum of inverse standard Gaussian distribution
function. Recall that pj is the p-value obtained fromjth row
vector (Ylj,..., Yn,) and the ordered marginal p-values are
PP() denoted by subscripts while the pseudo-global p-values in
(2) in are denoted by superscripts. Parentheses are used to
reveal the order property.

In one-sample testing problem, we permute data B times,
where each time we assign wi = 1 with equal probability
to (wY, ..., it Y ) For each permuted data, we use PbN) to
be the p-value from the bth permuted data for the gene
that produced p), j = 1,..., go, b = 1,..., B.

We summarize P(i) and Pb(i) using the following table.


P(1) PI(1) ... Pb(1) ... PB(1)


P(2) Pl1(2)


... Pb(2)


PB(2)


P(go) P1(go) Pb(go) "' PB(go)
Note that the first column is p-values of genes from the
original data while the remaining columns are from the
permuted data.

To simplify the notation, let (()) = (), b(i) = b(i) and 7 is
just the sum of (.). Therefore, we can transform the table
(6) into the following table with the addition of the last
row where its value 7 is the sum of the (.) in its corre-
sponding column.

h(1) h(1) ... hb(1) ... hB(1)


h(2) h1(2)


.. hb(2)


.. hB(2)


h(go) hl(go) ... hb(go) ... hB(go)
4 4 ... 4 ... 4

(0 ... 7 ) ... o 0)
Once o(0) and 0(),..., 7(B) are available, the global p-value,
by definition, is


(o) -1 b )1 b (o)
B
The same process is directly used to calculate the pseudo-
global p-values after removing the genes which have the
smallest p-values. Actually, we can use the available data
illustrated in table (8). By induction, suppose P(I),..., P(s)


Page 10 of 14
(page number not for citation purposes)


BMC Bioinformatics 2006, 7:15








http://www.biomedcentral.com/1471-2105/7/15


Table 3: Estimates of 7Ct based on the prostate tumor data set from the Stanford microarray database. The first four rows are
estimated based on the four partitions of the whole data set while the last row the is based on the data set. The means and standard
errors (5th and 6th rows) are calculated based on the estimates of the four sub-data sets.


BUM SPLOSH


0.4011
0.5211
0.5529
0.6056
0.5202
0.0867
0.3940


0.4844
0.5710
0.6104
0.6627
0.5821
0.0752
0.4706


LSL Bootstrap
LSE


0.8071
0.9106
0.9302
0.9628
0.9027
0.0672
0.6598


0.4802
0.5797
0.6235
0.6709
0.5886
0.0813
0.3949


QVALUE


0.4662
0.5679
0.6070
0.6618
0.5758
0.0826
0.3882


SEP mgf PRE Global-p


0.5792
0.5865
0.6232
0.6760
0.6162
0.0443
0.5979


0.5677
0.6866
0.7120
0.7529
0.6798
0.0796
0.4706


0.4536
0.5727
0.6097
0.6646
0.5751
0.0894
0.2803


0.5583
0.6722
0.7082
0.7493
0.6702
0.0821
0.4721


have been removed, the values of the raw data and their
reference sets are


h(s+1) hl(s+l) ... hb(s+1) ... hB(s+l)

h(s+2) hl(s+2) ... hb(s+2) ..*** hB(s+2)


(8)
h(go) hl(g0) ... b(o) ... hB(go)


(S) () ( (s)
1 1 ... 1,t. ... B

and the pseudo-global p-value after removing the s genes
is


B I(7(s) >(s))
p(s) b=1 b
B
Justification
Our justification is based on extreme cases to make the
proofs tractable. However, this requirement is not very
critical. Our simulation and real data analysis show that
even with moderate go value, the pseudo-global p-value,
p(l), starts to increase when our procedure has removed
most of the significant genes. Besides, it jumps up very fast
to reach the point that p(s) will be larger than threshold /8
and stop removing genes. Let g be the total number of sub-
hypotheses considered. Given g, let d = (d1,..., dg) denote
the vector containing the true values of pj, j = 1,..., g for
each sub-hypothesis; and R(g, d) be the number of YO)
removed using procedure p Without loss of generality,

we assume d( < ... < d(g-g0) < 0 and do) = 0 forj = g go +

1,..., g. We have


and the equality holds when d(gg0) -> The equality

affirms that the sub-sample YO) with extreme small dj < 0
is identified and removed from the component of 77. Actu-
ally, when the gene expression difference is far away from
zero, it will be identified by any reasonable multiple test-
ing procedure. Hence, in this extreme case, we only focus
on EV[R(go,O)] which is based on the remaining go sam-

ples that have null distributions. Our goal is reduced to
finding an upper bond of EV[R(go,O)], so that the

expected value of go can be bounded (below) by go.

J. I Monotone in Pseudo-global p-values
The key feature of our algorithm is that the pseudo-global
p-values are monotone increasing. We can be more precise
by showing that


Pr p() < p(+) (S) 1 ass > 0.
go


(10)


Romano (Section 2)[23] has proved that the distribution

of 7Sb) can be approximated by Gaussian distribution

using the Central Limit theorem. Therefore, if we let the

mean and standard deviation of reference samples 17s) (b

= 1,..., B) be p/(s) and o(s), respectively, and use Z to denote
the standard Gaussian random variable with distribution
function (D, then the pseudo-global p-value p(s) can be
expressed as


E [R(g,d) < g- go + E [R(go,0)],


Page 11 of 14
(page number not for citation purposes)


Sub-data I
Sub-data 2
Sub-data 3
Sub-data 4
Mean
Sd
Whole


BMC Bioinformatics 2006, 7:15







http://www.biomedcentral.com/1471-2105/7/15


B
Pr Z > (s) () 1()
SPr Z > C7()



where C(s) = -1(1 p(s)) is p(s)th upper percentile of the
standard Gaussian distribution. Hence, the relation
between 7l(s) and C(s) is

q(s) = P(s) + oCs) C (s).

Similarly, for the p/s+1) and c(s+1) as the mean and standard
deviation of reference samples fjs+l) (b = 1,..., B), we
repeat the same approximation method to express p(s+1) as


) > 1, n 1(S+))
B
Pr[Z> (s+l) (s+l)


Pr[ Z > (,(s ) + h(s+l)) (,(s+l) + h(s+l))

Pr[Z )> (s) _- (s+) ) h(s+)) )1
Pr Z > (+l)

Pr Z > (s) + (s)dcs) ((s+) + h(s+)

Pr Z > r(s) C(s) +

J(S) (s+1) h(s+l)
a (s+)


Comparing equations (11) and (13), we observe that to
prove p(s) < p(s+i) with probability 1 asymptotically when
p(s) is given is equivalent to prove that

(s) (S) (s+') -
SC() s+ (s+l) < C(s) (14)
O_(s+l) 4_(s+l)
with probability 1 asymptotically when C(s) is given.

As g0 -> oc, the sample deviations o(s) and gcs+1) are almost

identical so that U C(s) -> C(s). Next, using table (8)

again, we can express p/) /(S+i) as


B h(Pb+))

B
The distribution of the reference set P(s+i) ... PB(+I) con-
verges to the uniform (0, 1) distribution since the distri-
bution of the permuted test statistics, which are
corresponding to these p-values, converges to a Normal
distribution using Theorem 2.1 of Romano's work. There-
fore, if we define uniform (0, 1) random variable as U,

-~b=}1h +) B converges in probability to E[(U)] by the
B
Law of Large Number. Hence, to prove equation (14) we
only need to prove that

Pr[(p(,+l)) ->E[(U)]] -> 1. (15)


Assume, without loss of generality, that pl,..., Pgo corre-
(12) spond to the true null. To prove equation (15), we first
observe that, since pl,... Pgo are independent uniform
random variables, P(s+i) is the (s + 1)/g0-th quantile of the
uniform (0, 1) distribution. By defining
Cs+1,go = (s + 1) / go and using the distribution of sample
quantile [24], we have

8o-(p(s+i) ,+1,so) -> N(0,+1,sgo ( + lo0)).
Since the first derivative of exists, we can use the delta
method to have

so (h(p(s+l)-) h(s+l,g) -> N(0,(,2),


where = +1g (1 s+, ) [h'(s+,go )]2 represents
the asymptotical standard error of (p(s, )).

Finally, equation (15) can be proven as,

Pr[h(p(s+l)) < E[h(U)]]

=Pr h(p(s+l)) h(s+,go)
< I
E[h(U) ]-h(4+,g)


E[h(U)]-h(gsz^g)


Page 12 of 14
(page number not for citation purposes)


BMC Bioinformatics 2006, 7:15








http://www.biomedcentral.com/1471-2105/7/15


which converges to 0 because E[(U)] is finite and
h(G+i,go) as ,+,g = (s + 1)/go 0.


J.2 Determining the threshold P3 by (3)
First, we recall that, as p(0) is the global p-value based on
all null data, it is uniformly distributed. That is,

Pr[p(o) < ,61 = P.

We have proved that the pseudo-global p-values, p(s), are
monotone increasing when s/go -> 0. Therefore, there
exists m such that p(0) p() ... < p(m). Furthermore, for all
s & m,

Pr[p(S) <6 fpO) < f, j = 0, ..., s 1]
The purpose here is to derive equation (3) for any /within
(0,1) using the above inequality. We start with the
number of genes removed by our procedure. Observe that

Eo[R(go,O)]
go-1
= kPr[R(go,O) = k]
k=o

= -1 kpr [p(o) < ap) < l ,... p(k-1) < ,
k=i
(k) > ].


In addition,


Pr[p(0) < ,p) < p,..., p(1) < ,p() > /p]

= Pr[p(o)



\Y[Pr I p pS < pl= 0,..., s -


Pr[p(k) > p(i) < ,i= 0,...,k- 1

_< r/min(k-1,m)1

= km

where km = min(k, m + 1). Hence, we can find an upper
bound of the expected number of genes removed,
Ep[R(go,O)], which is


Ep[R(go,O)]


k=l
( (1-_p 1-(m+l)m+2 /(1 /)
(1- _)2
+(go m 2)m"+1
( )as m is large. (16)
(1 )2
Let A = A(/?) = /(1 ,)2. From equations (9) and (16), we
have

E [R(g,d) -Al
so that the number of true null estimate is


0 = g R(g, d) + A

and this estimate ensures that

E[ olg0 go-

References
1. Westfall PH, Young SS: Resampling-based multiple testing New York:
John Wiley & Sons; 1993.
2. Benjamini Y, Hochberg Y: Controlling the false discovery rate: A
Practical and powerful approach to multiple testing. JRSSB
1995, 57:289-300.
3. Benjamini Y, Hochberg Y: The control of the false discovery rate
in multiple testing under dependency. Annals of Statistics
2001:1165-1188.
4. Simes RJ: An improved Bonferroni procedure for multiple
tests of significance. Biometrika 1986, 73:.
5. Benjamini Y, Hochberg Y: On the adaptive control of the false
discovery rate in multiple testing with independent statis-
tics. journal of Education and Behavioral Statistics 2000, 25:60-83.
6. Storey JD, Tibshirani R: Statistical significance for genomewide
studies. PNAS 2003, 100:9440-9445.
7. Reiner A, Yekutieli D, Benjamini Y: Identifying differentially
expressed genes using false discovery rate controlling proce-
dures. Bioinformatics 2003, I 9:368-375.
8. Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S,
Ellis B, Gautier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W,
lacus S, Irizarry R, Li FLC, Maechler M, Rossini AJ, Sawitzki G, Smith
C, Smyth G, Tierney L, Yang JYH, Zhang J: Bioconductor: Open
software development for computational biology and bioin-
formatics. 2004 [http://genomebiology.com/2004/5/10/R80].
9. Tusher VG, Tibshirani R, Chu G: Significance analysis of micro-
arrays applied to the ionizing radiation response. PNAS 2001,
98:5116-5121.
10. GeneSpring 7.1. Silicon Genetics [http://www.silicongenet
ics.com]
I I. Yang MCK, Yang JJ, Mclndoe RA, She JX: Microarray experimen-
tal design: power and samples size considerations. Physiol
Genomics 2003, 16:24-28.
12. Pesarin F: Multivariate permutation tests with applications in Biostatistics
West Sussex, England: John Wiley & Sons; 2001.
13. Storey JD, Tibshirani R: Estimating false discovery rates under
dependence, with applications to DNA microarrays. Technical
report, Department of Statistics, Stanford University 2001.
14. Efron B, Storey JD, Tibshirani R: Microarrays, Empirical Bayes
Methods, and False Discovery Rates. Technical Report 217,
Department of Statistics, Stanford University 2001.
15. Birnbaum A: Combining independent tests of significance. JASA
1954, 49:559-574.


Page 13 of 14
(page number not for citation purposes)


BMC Bioinformatics 2006, 7:15








http://www.biomedcentral.com/1471-2105/7/15


16. Folks JL: Combination of independent tests. In Handbook of Sta-
tistics Volume 4. Edited by: Krishnaiah PR, Sen PK. New York: Elsevier
Science Publishers; 1984:113-121.
17. Global-p website [http://www.stat.ufl.edu/-~jyang/global p/glo
bal.p.R
18. Broberg P: A comparative review of estimates of the propor-
tion unchanged genes and the false discovery rate. 8MC Bioin-
formatics 2005, 6(I 99):.
19. R Development Core Team: R: A language and environment for
statistical computing. R Foundation for Statistical Computing, Vienna,
Austria 2005 [http://www.R-project.org]. ISBN 3-900051-07-0
20. Lapointe J, Li C, Higgins JP, van de Rijn M, Bair E, Montgomery K, Fer-
rari M, Egevad L, Rayford W, Bergerheim U, Ekman P, DeMarzo AM,
Tibshirani R, Botstein D, Brown PO, Brooks JD, Pollack JR: Gene
expression profiling identifies clinically relevant subtypes of
prostate cancer. PNAS 2004, 101:811-816.
21. Storey JD: A direct approach to false discovery rates. JRSS8
2002, 64:479-498.
22. Genovese C, Wasserman L: A stochastic process approach to
false discovery control. The Annals of Statistics 2004,
32:1035-1061.
23. Romano JH: On the behavior of randomization tests without
a group invariance assumption. journal of the American Statistical
Association 1990, 85:.
24. Serfling RJ: Approximation Theorems of Mathematical Statistics New
York: John Wiley & Sons; 1980.


Page 14 of 14
(page number not for citation purposes)


Publish with BioMed Central and every
scientist can read your work free of charge
"BioMed Central will be the most significant development for
disseminating the results of biomedical research in our lifetime."
Sir Paul Nurse, Cancer Research UK
Your research papers will be:
available free of charge to the entire biomedical community
peer reviewed and published immediately upon acceptance
cited in PubMed and archived on PubMed Central
yours you keep the copyright
Submit your manuscript here: BioMedcentral
http://www.biomedcentral.com/info/publishing adv.asp


BMC Bioinformatics 2006, 7:15




University of Florida Home Page
© 2004 - 2010 University of Florida George A. Smathers Libraries.
All rights reserved.

Acceptable Use, Copyright, and Disclaimer Statement
Last updated October 10, 2010 - - mvs