Group Title: BMC Genetics
Title: A Bierarchical and modular approach to the discovery of robust associations in genome-wide association studies from pooled DNA samples
CITATION PDF VIEWER THUMBNAILS PAGE IMAGE ZOOMABLE
Full Citation
STANDARD VIEW MARC VIEW
Permanent Link: http://ufdc.ufl.edu/UF00099976/00001
 Material Information
Title: A Bierarchical and modular approach to the discovery of robust associations in genome-wide association studies from pooled DNA samples
Physical Description: Book
Language: English
Creator: Sebastiani, Paola
Zhao, Zhenming
Abad-Grau, Maria
Riva, Alberto
Hartley, Stephen
Sedgewick, Amanda
Doria, Alessandro
Montano, Monty
Melista, Efthymia
Terry, Dellara
Perls, Thomas
Steinberg, Martin
Baldwin, Clinton
Publisher: BMC Genetics
Publication Date: 2008
 Notes
Abstract: BACKGROUND:One of the challenges of the analysis of pooling-based genome wide association studies is to identify authentic associations among potentially thousands of false positive associations.RESULTS:We present a hierarchical and modular approach to the analysis of genome wide genotype data that incorporates quality control, linkage disequilibrium, physical distance and gene ontology to identify authentic associations among those found by statistical association tests. The method is developed for the allelic association analysis of pooled DNA samples, but it can be easily generalized to the analysis of individually genotyped samples. We evaluate the approach using data sets from diverse genome wide association studies including fetal hemoglobin levels in sickle cell anemia and a sample of centenarians and show that the approach is highly reproducible and allows for discovery at different levels of synthesis.CONCLUSION:Results from the integration of Bayesian tests and other machine learning techniques with linkage disequilibrium data suggest that we do not need to use too stringent thresholds to reduce the number of false positive associations. This method yields increased power even with relatively small samples. In fact, our evaluation shows that the method can reach almost 70% sensitivity with samples of only 100 subjects.
General Note: Start page 6
General Note: M3: 10.1186/1471-2156-9-6
 Record Information
Bibliographic ID: UF00099976
Volume ID: VID00001
Source Institution: University of Florida
Holding Location: University of Florida
Rights Management: Open Access: http://www.biomedcentral.com/info/about/openaccess/
Resource Identifier: issn - 1471-2156
http://www.biomedcentral.com/1471-2156/9/6

Downloads

This item has the following downloads:

PDF ( PDF )


Full Text


0
BMC Genetics Bioled Central



Methodology article
A hierarchical and modular approach to the discovery of robust
associations in genome-wide association studies from pooled DNA
samples
Paola Sebastiani* 1, Zhenming Zhaol, Maria M Abad-Grau2, Alberto Riva3,
Stephen W Hartley', Amanda E Sedgewick4, Alessandro Doria5,
Monty Montano6, Efthymia Melista6, Dellara Terry7, Thomas T Perls7,
Martin H Steinberg6 and Clinton T Baldwin6

Address: 'Department of Biostatistics, Boston University School of Public Health, Boston 02118 MA, USA, 2Department of Software Engineering,
University of Granada, Granada 18071, Spain, 3Department of Molecular Genetics, University of Florida at Gainesville, Gainesville 32611 FL, USA,
4Bioinformatics Program, Boston University School of Engineering, Boston 02116 MA, USA, 5Joslin Diabetes Center, Harvard Medical School,
Boston 02215 MA, USA, 6Department of Medicine, Boston University School of Medicine, Boston 02118 MA, USA and 7Geriatric Section, Boston
Medical Center, Boston 02118 MA, USA
Email: Paola Sebastiani* sebas@bu.edu; Zhenming Zhao zmzhao@bu.edu; Maria M Abad-Grau mabad@ugr.es;
Alberto Riva ariva@ufl.edu; Stephen W Hartley- shartley@bu.edu; Amanda E Sedgewick asedge@bu.edu;
Alessandro Doria Alessandro.Doria@joslin.harvard.edu; Monty Montano mmontano@bu.edu; Efthymia Melista emelista@bu.edu;
Dellara Terry laterry@bu.edu; Thomas T Perls thperls@bu.edu; Martin H Steinberg mhsteinb@bu.edu;
Clinton T Baldwin cbaldwin@bu.edu
* Corresponding author



Published: 14 January 2008 Received: 6 September 2007
BMC Genetics 2008, 9:6 doi: 10.1186/1471-2156-9-6 Accepted: 14 January 2008
This article is available from: http://www.biomedcentral.com/1471-2156/9/6
2008 Sebastiani et al; licensee BioMed Central Ltd.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0),
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.



Abstract
Background: One of the challenges of the analysis of pooling-based genome wide association
studies is to identify authentic associations among potentially thousands of false positive
associations.
Results: We present a hierarchical and modular approach to the analysis of genome wide
genotype data that incorporates quality control, linkage disequilibrium, physical distance and gene
ontology to identify authentic associations among those found by statistical association tests. The
method is developed for the allelic association analysis of pooled DNA samples, but it can be easily
generalized to the analysis of individually genotyped samples. We evaluate the approach using data
sets from diverse genome wide association studies including fetal hemoglobin levels in sickle cell
anemia and a sample of centenarians and show that the approach is highly reproducible and allows
for discovery at different levels of synthesis.
Conclusion: Results from the integration of Bayesian tests and other machine learning techniques
with linkage disequilibrium data suggest that we do not need to use too stringent thresholds to
reduce the number of false positive associations. This method yields increased power even with
relatively small samples. In fact, our evaluation shows that the method can reach almost 70%
sensitivity with samples of only 100 subjects.




Page 1 of 14
(page number not for citation purposes)







http://www.biomedcentral.com/1471-2156/9/6


Background
The availability of genotyping assays for hundreds of
thousands of single nucleotide polymorphisms (SNP)s is
making genome wide association (GWA) studies more
accessible to a broad range of genotype-phenotype inves-
tigations. The promise of this technology is that it will
accelerate gene discovery for polygenic diseases and com-
plex phenotypes of Mendelian disorders because data for
all genes can be obtained simultaneously [1,2]. At the
same time, the large number of significance tests per-
formed is expected to result in a large number of false pos-
itive association signals. In fact, the number of signals
observed by chance may well be greater than those that
are authentic [3]. Thus, the development of analytic meth-
ods and strategies to distinguish authentic signals from
those due to chance will contribute significantly to dis-
ease-gene association studies.

Here we describe a modular procedure to analyze data
from pooling-based GWA studies that use the Illumina
SNP microarray technology [4]. Rather than genotyping
individual samples, the pooling-based technology types a
carefully constructed pool of DNA samples that can be
used to infer allele frequencies and is an affordable alter-
native to GWA studies that are still a financial burden for
many investigators. Several studies have shown the useful-
ness of pooling-based GWA studies to discover SNPs asso-
ciated with disease [5-9] using well calibrated methods
[7,10-12], and a variety of methods to estimate allele fre-
quencies from pooled-based studies that use the Affyme-
trix microarray technology have been proposed [13,14].
Our objective is twofolds: (i) we wish to assess reproduci-
bility and accuracy of the algorithm proposed by Illumina
to detect chromosomal aberrations when used to estimate
allele frequencies from pooled DNA samples [15]; and (ii)
we propose a modular approach to the analysis of pool-
ing-based GWA studies that limits the loss of power due
to both the use of pools of DNA samples and the issue of
multiple comparisons.

Several studies apply stringent thresholds on the signifi-
cance level that is required to determine significant SNP-


phenotype associations [16-18]. Contrary to this
approach, our method integrates Bayesian tests for general
associations [19] with decision rules based on the struc-
ture of linkage disequilibrium (LD) discovered through
the International HapMap project [20], and other
machine learning techniques to reduce the number of
false positive associations. We also describe a hierarchical
procedure to summarize the findings in terms of genes
that can be further synthesized into gene sets using Gene
Ontology annotations [21], pathways [22,23], or chromo-
somal bands. We evaluate this method using data from
the sixty unrelated CEPH parents used for the Interna-
tional HapMap project [20] and two independent data-
sets. The first is a study of fetal hemoglobin (HbF) levels
in African American subjects with sickle cell anemia and
the objective is to discover genetic modulators of HbF.
The second dataset is a study of exceptional longevity in a
cohort of centenarians. In both datasets, using our novel
analytic approach, we identified association signals in
genes previously known to affect these phenotypes. The
method is implemented in the R package and can be inte-
grated with other R packages for genetic analysis, or GWA
studies [24,25]. We develop the method for the analysis of
pooled DNA samples [26,27], but the approach can be
easily extended to the analysis of samples that are individ-
ually genotyped.

Results
We ran three sets of experiments to assess the reproduci-
bility and accuracy of the estimates of the allele frequen-
cies derived from pooled DNA samples, as well as the
sensitivity and specificity of our modular procedure.

Experiment I: accuracy and reproducibility
We obtained DNA samples from the 60 unrelated parents
used in the HapMap CEU panel and created 2 duplicated
pools of 30, 45 and 60 samples each (Table 1 provides a
summary). The pooled DNA samples were analyzed in
duplicates with the Illumina Sentrix HumanHap300 Gen-
otyping BeadChip (v. 1) and b-allele frequencies were esti-
mated using the Illumina LOH and Copy Number
analysis tool. The reproducibility was assessed by the


Table 1: Summary of the results of Experiment I.


Number of
pools

2
2
2


Sample size


30
45
60


Average
difference

0.0043
0.0011
0.0164


Standard
deviation

0.0303
0.0295
0.0498


Correlation Average error Standard
deviation


0.9940
0.9956
0.9873


0.0304
0.0331
0.0451


0.0565
0.0573
0.0668


Correlation


0.9860
0.9890
0.9890


Column I: pool description; Column 2: number of samples per pool; Column 3: average difference between estimates of allele frequencies in
repeated pools; Column 4: standard deviation of differences; Column 5: correlation between repeated allele frequency estimation. Column 6:
average difference between estimates of allele frequencies from pooled DNA samples and individually genotyped samples. Column 7: standard
deviation of the differences; Column 8: Correlation between estimates of allele frequencies from pooled DNA samples and individually genotyped
samples.


Page 2 of 14
(page number not for citation purposes)


BMC Genetics 2008, 9:6







http://www.biomedcentral.com/1471-2156/9/6


agreement between allele frequency estimates in the two
replicate samples for each pool (Table 1). Shown in Figure
1 is the scatter plot of two independent replicates of allele
frequency estimates for the 22842 SNPs tagging chromo-
some 1 (top), and the 5452 SNPs tagging chromosome 22
(bottom) obtained with pools of 60 samples. The plots
show a high degree of agreement that is confirmed for dif-
ferent sample sizes as shown by the results summarized in
Table 1. Plots for other chromosomes are in the supple-
mentary material [281.




Chromov;me 1 rpr~duchbClty


I r I I I I
00 ( 2 ff4 0I6 0 a (.
frtn 1


Chtoromom*e I accuracy


0 0 2 (14 %1 08 It


We assessed the accuracy of the allele frequency estimates
from pooled DNA samples by comparing the average esti-
mates over the replicated pools with the allele frequencies
computed using individually genotyped DNA samples
that are available from the web site of the HapMap project
[29]. A scatter plot of part of the results is displayed in Fig-
ure 1 for pools of 60 samples. The error analysis summa-
rized in Table 1 suggests that, on average, the allele
frequency based on the analysis of replicated pooled DNA
samples differ from those based on individually geno-
typed data by approximately 0.04 but the error can be as


Chromosome 22 ruprodcdbWry








e-41











Chir~msom, 22 accuracy

- "'J fi ,



IH. 1


m1on~one 2cyre


Figure I
The reproducibility of the allele frequency estimates is shown by the scatter plot of repeated estimates of allele frequency
inferred from pooled DNA samples (left). The labels "run I" and "run 2" in the x- and y-axis specify each replication. The accu-
racy of the allele frequency estimates is shown by the scatter plot of the estimates of allele frequency inferred from pooled
DNA samples (y-axis in the right plots) and those computed from individually genotyped samples (x-axis). The analysis of the
other chromosomes shows similar results.



Page 3 of 14
(page number not for citation purposes)


BMC Genetics 2008, 9:6







http://www.biomedcentral.com/1471-2156/9/6


large as ~0.12 = 0.04+2x0.06/42 thus making differences
in allele frequencies smaller than 0.24 difficult to detect
because of technical errors. However, our analysis shows
that less than 5% of the estimates based on pools of DNA
samples differ from those based on individually geno-
typed samples by more than 0.12, and less than 10% dif-
fer by more than 0.08. This suggests reducing the
minimum detectable allele frequency difference to 0.15
with a 10% chance of error. Furthermore, we have
observed that amplifying DNA does not appear to affect
either the reproducibility or the accuracy of the analysis.

To infer the effective sample size to be used in the analysis,
we also looked at the distribution of the ratio between the
two types of allele frequency estimates: say p(Si) = ni/n
and q(Si) where ni is the frequency of the minor allele of
the SNP Si computed from the samples that were typed
individually, n is the overall sample size, and q(Si) is the
frequency of the same minor allele computed from the
analysis of the pooled DNA samples, in the different sets.
The analysis demonstrated that log(q(Si)/p(Si)) has
approximately a normal distribution with 0 mean and
standard deviation 0.35. From this data, we deduced that
about 95% of allele frequency estimates derived from the
pooled DNA samples can be assumed to be within the
interval p(Si) exp( 1.95 x 0.35) from which we derive the
empirical relation between p(Si) and q(Si): 0.51 p(Si) <
q(Si) < 1.98 p(Si) with a range of uncertainty of 1.47 ni/
n. The inequality suggests that when we infer allele fre-
quency from pooled DNA samples, we have a loss of pre-
cision approximately equivalent to using 2/3 (= 1/1.47) of
the original DNA sample size. We call this the "effective
sample size" used in the calculation of the Bayesian test of
association.

Experiment 2: specificity
To estimate the false positive rate (FPR) we used real data
from pools of DNA samples to create artificial sets of
pools. The original pools are described in Table 2 and
were generated in duplicates to discover genetic variants
associated with exceptional longevity [30], and fetal
hemoglobin expression in subjects with sickle cell anemia
[311. The Illumina Sentrix HumanHap300 Genotyping
BeadChip was used for all the experiments. We created the
artificial sets of pools by mixing replicates of different
pool sets. For example, we generated a set of two pools by
taking one replicate of the pooled DNA samples from the
female centenarians and one replicate of the pooled DNA
samples from the younger female controls, and we con-
structed a second set of two pools by taking the remaining
replicates from the two sets (See Figure 2 for an example).
Because the two artificial sets of pools are homogeneous
relative to the phenotype, the differences in allele frequen-
cies between the two sets can be attributed to chance, and
the SNPs with significant differences in allele distribution


Table 2: Summary of the pools of DNA samples that were used
for the validation of the analytical method. Each pool was done in
duplicates.


Phenotype

Exceptional
longevity



Fetal
hemoglobin
expression


Sample size


130 male centenarians

130 male controls
130 female centenarians
100 female controls
55 sickle cell anemia subjects with fetal hemoglobin
below 3% of the total hemoglobin


54 sickle cell anemia subjects with fetal hemoglobin
above 6.5% of the total hemoglobin



are false positives. We repeated this analysis by mixing dif-
ferent types of pools of DNA samples and using a BF>3,
together with the LD and regional filters, we observed a
false positive rate ranging between 0.001 and 4x 10-4 with
a mean of 0.001, and an average of 300 SNPs selected by
chance. Note that this number is substantially smaller
than the number of false positive associations that we
would expect by chance using a BF>3. This threshold is
equivalent to accepting an association when the posterior
probability of the association is greater than 0.75, so that
we expect 1 in 4 associations to be false. Also the specifi-
city of the selected and significant genes was very high:
The number of genes that by chance were selected in two
unrelated analyses was 9 and this number was further
reduced to 7 when we limited attention to significant
genes. These numbers should provide a reference when
we examine the reproducibility of findings in different
studies, because we expect that, by chance alone, we
would have an agreement in about 0.1% of findings. We
note that long genes that are tagged by a larger number of
SNPs are more likely to be selected by chance in different
studies. Figure 3 displays the log10 BF in the 1,114 false
positive associations generated in approximately the 106
association tests. The plot shows an exponential decay of
the BF so that the chance to observe a very large BF has an
exponential decay, and the probability of observing a BF
greater than 10 by chance is 6 x 10-4, whereas the proba-
bility of observing a BF greater than 100 is 3 x 10-4, and
greater than 1000 is 2 x 10-4. This analysis however shows
that trying to reduce the false positive rate by imposing a
stringent threshold on the BF would likely reduce the
power of relatively small association studies and require
unrealistically large sample sizes.

We also run experiments to assess the effect of LD and
regional filters on the specificity. Using the same simu-
lated sets, we run the analysis by using only a BF>3 to
select the significant SNPs, and also examined the effect of
adding either the LD filter or the regional filter or both on


Page 4 of 14
(page number not for citation purposes)


BMC Genetics 2008, 9:6







http://www.biomedcentral.com/1471-2156/9/6


Male Centenarians Controls for Male Centenarians
Original Pool 1 Pool 2 Pool 3 Pool
Pool Set

Comparative
analysis

Set 1 Set 2
Artificial Pooh Pool3 Pool2 Pool4
Pool Set

Comparative SNPs with different allele
analysis frequencies are false positives.

Set 3 Set 4
Artificial Pool1 Pool4 P 2 Pool 3
Pool Set

Comparative SNPs with different allele
analysis frequencies are false positives.

Figure 2
Example of the artificial pool sets that we created to assess the specificity of the procedure. As an example, the top four pools
were generated to compare the genome of centenarians (pools I and 2) with that of younger controls (pools 3 and 4). The
two artificial pool sets are obtained by mixing pools of centenarians DNA with those of controls.


S10 0 00 0 0
0 10 20 30


LOG(Bayes Factor)

Figure 3
Distribution of the log10 Bayes factor in 1, 114 false positive
associations generated in approximately 106 association tests
with an estimated false positive rate 5 10-6. The analysis
shows that the chance to observe a very large Bayes factor
has an exponential decay, and the probability of observing a
Bayes factor greater than 10 by chance is 6 10-4, the probabil-
ity of observing a Bayes factor greater than 100 is 3 10-4,
greater than 1000 is 2 10-4.


the false positive rate. Our results suggest that the LD filter
reduces the false positive rate by 43%, while the regional
filter alone increases the false positive rate by approxi-
mately 25%, and both filters decrease the false positive
rate by 20%. These results are consistent with the intuition
that the regional filter increases the power by finding clus-
ters of SNPs that individually have small effects and
would be disregarded by a one-SNP-at-a-time analysis.
However, the effect is to slightly increase the false positive
rate. This conjecture is confirmed in the next experiments
that we conducted to assess the sensitivity.

Experiment 3: sensitivity
In related work we are analyzing pools of DNA samples as
a screening tool to discover genetic variants associated
with exceptional longevity [30], and fetal hemoglobin
(HbF) expression in subjects with sickle cell anemia [31].
As an indication of the sensitivity of technology and ana-
lytic method, we searched for SNPs in the Illumina Sentrix
HumanHap300 Genotyping BeadChip that have been
reported associated with either trait in independent stud-
ies, and verified whether an association was found based
on the pooled DNA samples.

HbF experiment
We created two pools using DNA samples from 55
patients in the top and 54 patients in the bottom quartile


Page 5 of 14
(page number not for citation purposes)


BMC Genetics 2008, 9:6








http://www.biomedcentral.com/1471-2156/9/6


of HbF concentrations. These patients were part of a clin-
ical trial described in [32]. The pools were run in dupli-
cates, and the data analyzed using the method proposed
here. We searched the literature and found 36 SNPs with
rs numbers that were reported associated with different
levels of HbF [31,33-35]. Thirteen of these SNPs are in the
Illumina array, and only 3 of these were found associated
in our analysis with a BF greater than 1, and 2 with a BF
greater than 3. The moderate effect of the other 10 SNPs
(odds ratios between 0.55 and 1.76) is consistent with the
weak associations reported by other investigators and
would not be detectable with our sample size of about 60
subjects per group. In fact, a sample size of 60 subjects
would give at most 30% power to detect an odds ratio of


1.75 when the MAF in one group is 0.5. We also found 23
SNPs in the Illumina array that are within 150 kb of the
other 23 reported SNPs, and are associated with HbF lev-
els with a BF greater than 1, and 13 of these had a BF
greater than 3. Fifteen of these SNPs were typed as part of
the HapMap project and ten of these are in strong LD
(Bayes D' > 0.8)[36]. Thus the analysis based on pooled
DNA samples discovered association of 26 SNPs, and for
13 of them the association was strong. This analysis sug-
gests a sensitivity of 72% and if we limit attention to asso-
ciations supported by a BF of at least 3 the sensitivity is
36%. The details of the associations are in Table 3.


Table 3: List of SNPs that are known to be associated with different levels of HbF and results of the analysis based on pooled DNA
samples.


Number


SNP Band


rs 143637
rs31481
rs271158
rs454877
rs212770
rs2295199
rs210948
rs509342
rs2076192
rs997139
rs3778314
rs2237262
rs201 2700
rs717088
rs3799476
rs 1342645
rsl 342641
rs I 322393
rs44450
rs13491 15
rs 10504269
rs6997859
rsl2155519
rs 1947178
rs746867
rs389349
rs851800
rs380620
rs2043190
rs7482144
rs723623
rs 1867380
rs4489951
rs1440372
rs8038623
rs2227319


2q 13
5q23.3
6q23.2
6q23.2
6q23.2
6q23.2
6q23.2
6q23.2
6q23.3
6q23.3
6q23.3
6q23.3
6q23.3
6q23.3
6q23.3
6q23.3
6q23.3
6q23.3
6q23.3
8q12.1
8q 12.1
8q 12.1
8q12.1
8q12.1
8q12.1
8q12.1
8q 12.1
8q12.1
9q34.1 I
I Ipl5.4
15q 13.3
15q22.31
15q22.31
15q22.31
15q22.31
17q21.1


Gene

ILIB
IL3

EYA4
EYA4

MYB
PDE7B
MAP7
MAP7
MAP7
MAP3KS
PEX7
PEX7
PEX7
PEX7

IL20RA

TOX
TOX
TOX
TOX
TOX
TOX
TOX
TOX
TOX
ASS
HBG2
CI5orfl6
AQP9
MAP2KI
SMAD6
SMAD3
CSF3


Validated Distance D' BF PH PL OR


rs 12469600
rs40401
rs271156
rs211433
rs 1154727
rs2295 199
rs210798
rs560713
rs2076193
rs3799419
rs2181096
rs3799472
rs201 2700
rs717088
rs3799479
rs 1342645
rs 1342642
rs 1322394
rs276568
rsl3491 15
rs 10504269
rs6997859
rsl2155519
rs1947178
rs746867
rs389349
rs396720
rs2561145
rs540140
rs3813727
rs6493688
rs1867380
rs4489951
rs2469141
rs6494633
rs2071369


16976
724
641
475
4274
0
17242
23645
44438
2100
25746
18405
0
0
44021
0
22583
661
3229
0
0
0
0
0
0
0
23622
300069
609
20257
8825
0
0
65753
16831
1460


1.39
2.32
22.86
2.43
1.25
0.51
2.17
1.47
1.30
1.35
57.04
1.03
1.50
0.47
1.98
0.67
6.12
3.00
47.32
0.69
0.61
0.37
0.94
21.86
0.29
0.42
9.20
10.67
19.56
2.52
1.31
0.28
17.66
2181.00
87.23
3.42


3.08
0.48
0.35
2.09
3.03
0.58
0.38
17.80
1.88
0.36
0.28
0.53
0.52
0.65
0.42
0.55
4.98
0.44
0.34
0.59
0.58
1.45
1.77
0.28
0.76
1.43
3.24
0.26
24.90
0.43
2.66
0.85
0.38
4.24
4.09
2.88


Column I: row number; Column 2: SNP ID; Column 3: Cytogenic band; Column 4: Genes tagged by the SNP; Column 5: SNP in the Illumina array that
was used to compare the association. If the SNP to be validated was not in the array, we searched for the closest SNP within 100 kb from that to be
validated with a positive Bayes Factor; Column 6: distance between the two SNPs; Column 7: Bayes D' between the two SNPs, an NA means that the
SNPs originally reported as associated with HbF is not in the HapMap data. Column 8: Bayes Factor; Column 9-10: estimates of allele frequencies in the
pools of DNA from patients with high HbF and low HbF; Column I I: Odds ratio. The SNPs 6, 13, 14, 16, 20-26, 32 and 33 are in the array and SNPs
13, 24 and 33 were found associated with different levels of HbF. Highlighted in bold are the associations confirmed by our analysis.


Page 6 of 14
(page number not for citation purposes)


BMC Genetics 2008, 9:6







http://www.biomedcentral.com/1471-2156/9/6


Longevity experiment
We created pools of DNA samples from unrelated cente-
narians and younger controls. Because there is evidence of
gender effect [37] --- 85% of centenarians are female--- we
created distinct pools for males and females as summa-
rized in Table 2. We searched the literature and found 36
SNPs with rs numbers reported as associated with longev-
ity [38-40]. Seven of these 36 SNPs are in the Illumina
array, and five of these seven were found associated in our
analysis. For 21 of the remaining 29, we found SNPs
within 100 kb that were associated with longevity in
either the males and female comparisons, or both. These
SNPs are reported in Table 4. The analysis suggests
approximately 67% sensitivity, and this is consistent with
the sensitivity estimated with the HbF experiment. We
also noted that the regional filter helped identify some of
the associations that would be lost with a tight threshold
on the BF. As an example, the SNP rs2227956 on HSPA1A
was found associated with longevity in males only when
the regional filter is used, and the two SNPs in WRN -a
well known longevity gene in mice [41] were selected by
the regional filter. A similar form of sensitivity analysis is
to see whether the GSEA analysis can lead to discover sets
of functionally similar genes that are known to be associ-
ated with longevity. GSEA analysis of the centenarian
cohort revealed several enriched GO biological categories
(Table 4 and manuscript in preparation). Among the sig-
nificantly enriched categories were genes associated with
immune response (e.g., CSF3) and DNA repair (e.g.,
XRCC4), see Table 4. Intriguingly, CSF3 (also known as
GCSF) is reported to influence migration of stem cells
between the bone marrow and blood [42,43] and appears
to promote regeneration of myocardial tissue [44-46],
which has clear relevance to longevity. The gene XRCC4
has a well established role in DNA repair [47], and unre-
paired DNA has been reported to accelerate ageing, possi-
bly through dysregulating the IGF/growth axis [48].
Therefore, a comprehensive analysis of these and other
genes present with the enriched gene sets that we detected
will be essential to fully appreciate pathways engaged that
contribute to the longevity phenotype.

Discussion and Conclusion
We have developed a hierarchical and modular approach
to the analysis of genome wide genotype data based on
pooled DNA samples. The method incorporates quality
control data, information about linkage disequilibrium,
Bayesian association tests, physical distance and gene
ontology to identify associations warranting further inves-
tigation. Our evaluation using real data has shown the
accuracy, reproducibility, sensitivity and specificity of the
method.

Compared to other approaches, the integration of Baye-
sian tests with information about linkage disequilibrium


and other machine learning techniques implies that we do
not need to use too stringent thresholds to reduce the
number of false positive associations. The implication of
this fact is an increased power even with relatively small
samples. In fact, our estimate of the sensitivity shows that
the method can reach almost 70% sensitivity with sam-
ples of only 100 subjects.

Although we developed the approach to analyze pooled
DNA samples, the method can also be used for the analy-
sis of individually genotyped samples.

Methods
Genotyping
For the HbF study, DNA was obtained from the 60 sub-
jects with HbF levels below the first quartile of the distri-
bution, and the 60 subjects with HbF levels above the
third quartile who were enrolled in the Multicenter Study
of Hydroxyurea (MSH) study in Sickle Cell Anemia [49].
DNA samples from 260 centenarians and a control group
of 230 subjects were obtained from the New England Cen-
tenarian Study: a cross-sectional study of individuals aged
97 and older conducted at the Boston Medical Center.
CEPH DNA samples of the sixty unrelated parents used for
the International HapMap project [20] were obtained
from the Coriell Institute, Camden, NY and used to com-
pare the accuracy and reproducibility of the estimates of
allele frequency in pooled DNA samples compared to
individually genotyped samples. For DNA pool construc-
tion and to ensure that each individual contributed
equally to the pool, we first measured DNA stock solu-
tions using a fluorimetric method (RNAseP) against a
standard curve constructed from known concentrations of
human genomic DNA. We then diluted the stock solu-
tions to 10 ng/ul and measured the concentrations of
these working solutions by means of PicoGreen. In the
case of samples for which the CV of the three measure-
ments was greater than 10%, quantification was repeated
in triplicate until the CV was smaller than 10. Measure-
ments were highly reproducible, with a correlation coeffi-
cient of 0.97 between the third measurement and the
average of the first two. Based on these concentrations, 50
ng of DNA were added to the pool for each individual.
The pools of DNA were analyzed on the Sentrix
HumanHap300 bead chip (Illumina) according to the
manufacture's protocol. The data used in the HbF and
longevity studies will be released with companion publi-
cations. We make available the data derived from pools of
CEPH DNA samples from the supplementary web site
[28]. The HbF and longevity studies were approved by the
Institutional Review Boards of Boston University.

Association test
The overall analytic strategy is shown in Figure 4. The first
module is a statistical procedure to test the allelic associa-


Page 7 of 14
(page number not for citation purposes)


BMC Genetics 2008, 9:6










( Table 4: List of SNPs that were found associated with exceptional longevity in different studies, and results based on the analysis of pooled DNA samples.
0)

!P Males Females

Index SNP Band Gene Validated Distance D' BF P L PC OR Validated Distance D' BF P L PC OR

o I rs 1870377 4q 2 KDR rs2305945 1128 1.00 1.5 0.84 0.75 1.71 rs2305945 1128 1.00 1.1 0.77 0.68 1.62
- 2 rs2866164 4q23 MTP rs7693203 9187 NA 3.4 0.82 0.72 1.83 rs1057613 14042 NA 8.6 0.64 0.49 1.90
b 3 rs750032 4q24 PPP3CA rs2850971 3819 NA 41.4 0.87 0.74 2.39 rs2850971 3819 NA 10.8 0.82 0.69 2.12
a 4 rs951085 4q24 rs9999238 29363 0.98 0.2 0.81 0.83 0.85 rs9999238 29363 0.98 10.9 0.81 0.67 2.09
-0 5 rs28360135 5ql4.2 XRCC4 rs1382367 40462 NA 295053.4 0.32 0.59 0.33 rs1382367 40462 NA 2.6 0.50 0.63 0.58
a)
E 6 rs1799945 6p22.2 HFE rs 1572982 3188 0.99 28228.6 0.67 0.42 2.78 rs 1572982 3188 0.99 11.2 0.44 0.61 0.51
o 7 rs9380254 6p21.33 MICA rs I 131896 780 1.00 5744.1 0.67 0.87 0.32 rs 1131896 780 1.00 0.8 0.83 0.84 0.96
8 rs2227956 6p21.33 HSPAIL rs2227956 0 12.6 0.10 0.20 0.42 rs2227956 0 0.3 0.18 0.14 1.31
9 rs1800797 7pl5.3 IL6 rs2056576 5019 0.64 0.2 0.70 0.72 0.91 rs2056576 5019 0.99 1.7 0.75 0.64 1.70
10 rs662 7q21.3 PONI rs662 0 0.2 0.33 0.29 1.21 rs662 0 2.8 0.32 0.21 1.77
II rs 1799983 7q36.1 NOS3 rs2373929 18701 0.16 6.1 0.50 0.64 0.57 rs2373929 18701 0.07 0.6 0.63 0.54 1.45


rs225 1621 8p I 2 PURGK
rs2725362 8pI2 WRN
rs 1346044 8p 2 WRN
rs5744256 I Iq23.1 ILl8
rs675 I I q23.2 APOA4
rs 1467558 II p33 CD44
rs9536314 13 KL
rs861539 14 XRCC3
rs8052394 16ql2.2 MTIA
rs 1800776 16q13 CETP
rs5882 16q13 CETP
rs704 17qI 1.2 VTN
rs4344 17q23.3 ACE
rs2252673 19p 3.2 INSR
rs 1799782 19 XRCCI


rs i3269094
rs3024239
rs 1346044
rs243908
rs58 10 15
rs 1467558
rs95363 14
rs861539
rs7189840
rs3764261 I
rs5882
rs2027993
rs4343
rs2059807
rs939461 1128


818 9 0.66
158 NA
0
16909 0.61
5559 0.28
0
0
0
9798 NA
1910 NA
0
12085 0.99
693 1.00
15691 0.23
11001 NA


1 36O.2 U.U0 U.22 U.26
4942.7 0.34 0.57 0.39
257.9 0.36 0.19 2.44
>10A6 0.63 0.29 4.02
1.3 0.61 0.51 1.55
2.0 0.91 0.83 1.97
5.1 0.24 0..36 0.55
0.15 0.75 0.74 1.05
0.14 0.59 0.57 1.09
1.1 0.59 0.69 0.65
0.19 0.39 0.35 1.19
17.3 0.45 0.61 0.53
1.4 0.69 0.78 0.61
1.5 0.63 0.73 0.63
1.4 0.05 0.10 0.47


rs I 3629I
rs2725362
rs 1346044
rs243908
rs 10502189
rs 1467558
rs9536314
rs861539
rs7189840
rs3764261 I
rs5882
rs2027993
rs4343
rs2059807
rs939461


UI23018 0.4
0
0
16909 0.96
371 1.00
0
0
0
9798 NA
1910 NA
0
12085 0.99
693 1.00
15691 0.31
11001 NA


27.2 0.48 0.66
0.4 0.28 0.21
0.2 0.28 0.28
4.8 0.05 0.00
0.2 0.84 0.86
99.0 0.36 0.18
68.7 0.74 0.55
54.1 0.56 0.37
0.15 0.66 0.67
0.17 0.37 0.40
1.1 0.58 0.47
6.4 0.66 0.51
0.2 0.75 0.72
12.6 0.09 0.02


Column I: row number; Column 2: SNP ID; Column 3: Cytogenic band; Column 4: Genes tagged by the SNP; Column 5: SNP in the Illumina array that was used to compare the association as described in
the caption of Table 3; Column 6: distance between the two SNPs; Column 7: Bayes D' between the two SNPs; Column 8: Bayes Factor; Column 9-10: estimates of allele frequencies in the pools of DNA
from male centenarians and younger controls; Column I 1: Odds ratio. Columns 12-18: as columns 5-1 I but for the pools comparing female centenarians to younger controls. The SNPs 8, 10, 14, 16-19, 22
are in the array and SNPs 8, 14, 17-19 were found associated with longevity in females and/or males. Highlighted in bold are the associations confirmed by our analysis.


aoo
0.
a) C








C71


U.10U
0.48
1.46
1.01
56.22
0.87
2.56
2.27
2.18
0.97
0.88
1.57
1.87
1.16
5.88







http://www.biomedcentral.com/1471-2156/9/6


tion between each individual SNP and the phenotype. The
input data are the allele relative frequencies estimated
with the "b-allele frequency" value provided by the Illu-
mina Beadstudio genotype module. This value represents
the relative proportion of each allele in the DNA sample
and is used by the Illumina loss of heterozygosity (LOH)
and Copy Number analysis tool [15] to detect chromo-
somal aberrations and copy numbers by comparing the
normalized intensity of the test sample (the pooled DNA
samples) to a reference sample. We use the estimate of the
allele frequencies O, in test and control pools to recon-
struct the expected allele frequencies as n,. = n*gq where

n1 is the effective sample size in pool i, the index i = 1 for
cases and i = 2 for controls, and the index j = A,B denotes
the A or B allele. We then use a Bayesian test of association
to compare the distributions of allele frequency in the two
different pools. The test is described in [19,50] and
assumes that prior probabilities are available for the
model of allelic association and the model of no associa-
tion say p(MJ, p(Mi) and then uses the data to update
these prior probabilities p(MJ, p(Mi) into the posterior
probabilities by using Bayes' theorem. The decision rule is
then to select the model of association if its posterior
probability is at least 3 times larger than the posterior
probability of the model of no association (as suggested
in reference [51]). Formally, the ratio of the posterior
probabilities is


Bayestest: Pan Milni)
/P(Mll"j))


needed with the analysis of pooled DNA samples [52]. We
note here that one advantage of this modular procedure is
that the Bayesian test can be replaced by a standard ;2 test
for allelic association.

Filtering out false positives
Although we can take into account the issue of multiple
comparisons by choosing appropriate prior odds for an
association, the consequence of this approach is to reduce
power and to require large sample sizes to detect associa-
tions with a small effect. This consequence can be prob-
lematic in studies where cases are relatively rare such as
the study of exceptional longevity in which cases are sub-
jects who lives 100 years and older. To fully exploit the
power of small scale studies we developed a series of data
filters that remove unreliable or suspicious associations
(see Figure 4). The first filter is specific for allele associa-
tion analysis using pooled DNA samples and accounts for
the lack of precision of the technology. The other two fil-
ters take into account redundancy as well as reciprocal
information of SNPs based on the LD structure of the
human genome. So, rather than using "SNP pruning" as


p(nijlMa xp(Ma) (y
lp(nij Mi) X P i


and the ratio of the marginal likelihood functions p(nij
Mr)/p(nij | M,) is known as the Bayes factor (BF). When the
prior probabilities of the two models are equal, the BF is
equivalent to the posterior odds. Assuming the conjugate
Beta distribution for the allele frequencies, the BF can be
calculated in closed form and the formula for this calcula-
tion is reported in the appendix. To take into account the
issue of multiple testing, we can use prior information
about the number of SNPs that we expected to be associ-
ated to make the selection stronger. For example, if we
expect 1,500 SNPs associated with the phenotype, then
the prior odds for the alternative hypothesis of association
are 0.005/0.995 when we test 300,000 SNPs, and the deci-
sion rule becomes to accept that a SNP is associated with
the phenotype if the posterior odds for the association are
at least 3 x 0.995/0.005 = 597. Initial experiments
described in the Evaluation section suggest that a robust
choice for an effective sample size is 2/3 of the original
pool, and this is consistent with a larger sample size


Figure 4
Schematic summary of the modular approach to the analysis
of GWA data.


Page 9 of 14
(page number not for citation purposes)


BMC Genetics 2008, 9:6







http://www.biomedcentral.com/1471-2156/9/6


in PLINK to remove SNPs that are in LD [53], we leverage
on dependencies determined by LD to improve the detec-
tion of false positive while reducing the false negative rate.

Quality control (QC) filter
The function of this filter is based on an extensive evalua-
tion of the accuracy and reproducibility of the allele fre-
quency estimates that are computed with the Illumina
software. Allele frequencies obtained from genotyping of
pooled DNA samples were compared with those derived
from genotyping of individual samples (detailed in the
below Evaluation section). The results suggest that alleles
with minor allele frequency MAF <0.15 as well as differ-
ences in allele frequency of less than 0.15 are not reliable.
We therefore filter out all SNPs with these characteristics,
as well as those SNPs for which repeated estimates of
allele frequencies in replications of the same pool differ
by more than 0.15.

Linkage disequilibrium (LD) filter
SNPs in LD with each other would be expected to show
similar patterns of association if the signal is authentic
whereas a single SNP in a LD block showing association is
more likely to represent a spurious association. Therefore,
our procedure automatically checks this condition and
disregards the associations for SNPs that are not sup-
ported by positive associations with other SNPs in the
same LD block. To this end, we used genotype data col-
lected within the HapMap project to compute pairwise
measures of LD for all consecutive pairs of SNPs in the
HumanHap300 platform. The estimation of LD was based
on a novel Bayesian version of D' that we introduced in
[36]. As the traditional D', our Bayesian estimator is
defined in the interval [0;1] regardless of the allele fre-
quency so that it is easier to interpret than other measures
of correlation like r2 but it is much less biased toward dis-
equilibrium. We use a Bayesian D' > 0.7 between pairs of
consecutive SNPs as suggestive of strong LD and we filter
out all the associations of the SNPs whose adjacent SNPs
that are in strong LD are not associated with the pheno-
type. The value 0.7 was chosen based on experiments
reported in [36] showing that the Bayesian D' rarely
exceeds 0.7 under no LD. The Bayesian D' values for each
pair of consecutive SNPs were built for Caucasians using
the DNA samples from unrelated parents of thirty trios of
the CEPH (Utah residents with ancestry from northern
and western Europe, also known as CEU) and similarly
for Africans, using Yoruba in Ibadan Nigeria. These data
are available from the supplementary material web site
[28].

Regional association filter
The rationale of this filter is that a region or gene showing
authentic association would be expected to show a greater
number of SNPs associated than would be expected by


chance. In this filter, we analyze the data using a sliding
window of 20 SNPs, and summarize the global measure
of association within the window as the product of the
posterior probabilities of associations of the 20 SNPs.
Here, we assume that the association tests are independ-
ent, so that the product of the posterior probability of
association of the individual SNPs becomes a measure of
the global association of the region tagged by the 20 SNPs.
Windows with a global measure of association exceeding
0.5 [or product of BFs >1] are then selected for further
inspection. Although LD between SNPs in a window may
introduce dependencies, the global measure of associa-
tion does not seem to be affected by this approximation.
Figure 5 shows some examples.

Hierarchical summary
The list of SNPs that are selected by the association test
and the filters are labeled as "significant SNPs". The list is
annotated by the SNP physical position, the position rel-
ative to known genes, the allele frequencies estimated in
different populations, and the cytogenetic band. This
information is collected through SNPPer [54] that inte-
grates information from the UCSC human genome
browser and dbSNP [55]. As a further level of summary we
use those SNPs that are linked to genes to create a set of
selected genes and a set of significant genes. The first set
consists of genes that are tagged by at least one significant
SNP. The set of significant genes is a subset of the selected
genes and consists of genes in which the global measure
of association given by the product of the posterior prob-
ability of association of the gene tagging SNPs is greater
than 0.5 [Or equivalently, the product of BFs exceeds 1].

Ranking
We rank the significant genes by the global measure of
association. To rank the selected genes we score them by
two further measures that weigh the likelihood of select-
ing a gene by chance. In fact, there are genes that are
tagged by a large number of SNPs: for example CSMD1 in
chromosome 8 is tagged by 614 SNPs in the
HumanHap300 array, and assuming a 5% false positive
rate, we would expect about 30 SNPs to be selected from
this gene by chance in any analysis. To take this issue into
account, each selected gene is assigned 3 scores: the global
measure of association, the ratio of selected SNPs relative
to the number of tagging SNPs, and the probability of
selecting this number by chance using the hyper-geomet-
ric distribution. Each score determines a ranking and then
the sum of the ranks is defined as a final ranking of
selected genes.

Gene set enrichment analysis
To evaluate selected and/or significant genes for enrich-
ment of biological categories associated with a variable
phenotype, we implement a stand-alone version of the


Page 10 of 14
(page number not for citation purposes)


BMC Genetics 2008, 9:6







http://www.biomedcentral.com/1471-2156/9/6


02 04 06 08
LD









0. ,








02 04 06 08 10


02 04 043 08
LO,






'o
410'y


Figure 5
Relation between the pattern of LD (x-axis) and the global measure of association (y-axis) in the regional filter. The pattern of
LD is measured by the average of the Bayes D' between consecutive SNPs in the region, and the global measure of association
is the joint probability of association in the region. The two figures in the top half show the relation using data from the study
of fetal hemoglobin in the sickle cell anemia subjects. The two figure in the bottom half show the relation using data from the
longevity study. The different extent of LD reflect the fact that sickle cell anemia subjects are all African American while cente-
narians in the longevity study are all Caucasians The correlations in the four sets are 0.03, 0.18, 0.018, -0.10.


EASE statistical software [56]. This program computes a
modified Fisher's exact probability score for observing the
frequency of a biological category associated with a phe-
notype (e.g., dementia, sickle phenotype, infection), com-
pared with the likelihood of identifying that category by
chance given the total number of genes in the data set. An
adjusted score is then reported representing the upper
bound of the distribution of Jackknife Fisher exact proba-
bilities for observing an enriched biological category.
Enriched categories are then inspected for biological
trends and overlapping or related categories, based on sig-


nificance scores or categories with a p-value << 0.05. For
more detail, see Hosack et al. [56].

Authors' contributions
PS developed and implemented the analytic method, con-
ceived the evaluation and drafted the manuscript. ZZ con-
tributed to the implementation and evaluation. MAG, AR,
SH, AS, developed the support material for the various fil-
ters and hierarchical summary. AD helped to design and
conduct the experiments with the CEPH samples. MM
contributed to the development of the hierarchical
method and the interpretation of the analysis results. EM


Page 11 of 14
(page number not for citation purposes)


BMC Genetics 2008, 9:6







http://www.biomedcentral.com/1471-2156/9/6


Table 5: parameters and allele frequencies from pooled DNA samples


Allele A


Allele B


Cases (one pool)
Controls (one pool)


nlA = p(Alcases)*(2n1)
n2A = p(Alcontrols)*(2n2)


nB = p(B|cases)*(2n|)
n2B = p(Blcontrols)*(2n2)


and CB carried out the pooling-based GW genotyping.
DT, TTP, MS contributed to the development and evalua-
tion of the method using the longevity study and the HbF
study. All authors read and approved the final manuscript.

Appendix
Derivation of the Bayes Factor
This Bayesian association test assumes that the allele fre-
quencies follow a binomial distribution with probabili-
ties ft the index i = 1 for cases and i = 2 for controls, and
the index j = A,B denotes the A or B allele [See Table 5 for
an example]. Under the hypothesis of general association,
the parameters 60 describing the allele distributions in
cases and controls follow different probability distribu-
tions, while the parameters 0t follow the same probability
distribution under the hypothesis of no association.
Therefore, the likelihood function under the hypothesis
of general association M, is


M, : p(qj nj) c qA(1 g) 1B (12A _- g)2B

While the likelihood function under the hypothesis of no
association Mi is:


Mi : p(qj I n j) qA (- qA)nB

We assume independent Beta distributions to model the
prior distributions of the parameters that are defined as

Ma: p(Crj) AA '(1 g)bB A g27)b2B -1 and
A--1 bCYA)-1
M :p( _[ij) (1 qA b-1
Where the hyper-parameters are chosen as alA = 2A = cA/
2 = a/4 and

AB = 82B = B/2 = a/4. The parameter a is the overall prior
precision and can be set based on prior information. The
likelihood function and the prior distribution of the
parameters are used to compute the marginal likelihood
as the expected likelihood function, where the expectation
is taken over the parameter distribution. Formally

p(M I n3) = Jp(Oj I n)p(()d j

Compared to the maximum likelihood that returns the
likelihood function evaluated in the estimate of the


parameters, the marginal likelihood incorporates the
uncertainty about the parameters by averaging the likeli-
hood functions for different parameter values. This con-
ceptual difference is fundamental to understand the
different approach to model selection: in the classical
framework, model selection is based on the maximized
likelihood and its sampling distribution to take into
account sampling variability, for fixed parameter values.
In the Bayesian framework, model selection is based on
the marginal likelihood which takes into account the
parameter variability, for fixed sample values. Therefore,
no significance testing is performed when using this
approach to model selection. Our experience with the
Bayesian procedure to model selection is that it is usually
more robust to false associations. The calculation of the
marginal likelihood can be done in closed form and it is
easy to show that

p(M, n,)= F(a A+blB)F(a2A+b2B) F(a lA+nlA)F(blB+nlB)F(a2A+n2A)F(b2B+n2B)
(alA+lB+ni)F(a2A+b2B+n2) F(alA)F(a2A)F(blB)F(b2B)


F(a) F(alA+a2A+nA)F(blB+b2B+nB)
p(M | n) F(a +n) F(alA+a2A)F(blB+b2B)

Where G is the gamma function and the ratio produces
the Bayes factor.

Acknowledgements
Supported by NHLBI grants R21 HL080463 (PS); ROI HL68970 (MHS); K-
24, AG025727 (TP); K23 AG026754 (D.T.). We thank the anonymous
reviewers and editors for their helpful suggestions.

References
I. Genome-wide association study of 14,000 cases of seven
common diseases and 3,000 shared controls. Nature 2007,
447(7145):661-678.
2. Christensen K, MurrayJC: What genome-wide association stud-
ies can do for medicine. N Englj Med 2007, 356(1 I): 1094-1097.
3. de Bakker PI, Yelensky R, Pe'er I, Gabriel SB, Daly MJ, Altshuler D:
Efficiency and power in genetic association studies. Nat Genet
2005, 37(1 I):1217-1223.
4. Fan JB, Chee MS, Gunderson KL: Highly parallel genomic assays.
Nat Rev Genet 2006, 7(8):632-644.
5. Craig DW, Huentelman MJ, Hu-Lince D, Zismann VL, Kruer MC, Lee
AM, Puffenberger EG, Pearson JM, Stephan DA: Identification of
disease causing loci using an array-based genotyping
approach on pooled DNA. BMC Genomics 2005, 6:138.
6. Melquist S, Craig DW, Huentelman MJ, Crook R, Pearson JV, Baker
M, Zismann VL, Gass J, Adamson J, Szelinger S, CorneveauxJ, Cannon
A, Coon KD, Lincoln S, Adler C, Tuite P, Calne DB, Bigio EH, Uitti RJ,
Wszolek ZK, Golbe LI, Caselli RJ, Graff-Radford N, Litvan I, Farrer
MJ, Dickson DW, Hutton M, Stephan DA: Identification of a novel
risk locus for progressive supranuclear palsy by a pooled



Page 12 of 14
(page number not for citation purposes)


BMC Genetics 2008, 9:6








http://www.biomedcentral.com/1471-2156/9/6


genomewide scan of 500,288 single-nucleotide polymor-
phisms. Am] Hum Genet 2007, 80(4):769-778.
7. Hanson RL, Craig DW, Millis MP, Yeatts KA, Kobes S, Pearson JV, Lee
AM, Knowler WC, Nelson RG, Wolford JK: Identification of
PVTI as a candidate gene for end-stage renal disease in type
2 diabetes using a pooling-based genome-wide single nucle-
otide polymorphism association study. Diabetes 2007,
56(4):975-983.
8. Steer S, Abkevich V, Gutin A, Cordell HJ, Gendall KL, Merriman ME,
Rodger RA, Rowley KA, Chapman P, Gow P, Harrison AA, Highton
J, Jones PB, O'Donnell J, Stamp L, Fitzgerald L, Iliev D, Kouzmine A,
Tran T, Skolnick MH, Timms KM, Lanchbury JS, Merriman TR:
Genomic DNA pooling for whole-genome association scans
in complex disease: empirical demonstration of efficacy in
rheumatoid arthritis. Genes Immun 2007, 8(1):57-68.
9. Meaburn EL, Harlaar N, Craig IW, Schalkwyk LC, Plomin R: Quanti-
tative trait locus association scan of early reading disability
and ability using pooled DNA and 100K SNP microarrays in
a sample of 5760 children. Mol Psychiatry 2007.
10. Lavebratt C, Sengul S: Single nucleotide polymorphism (SNP)
allele frequency estimation in DNA pools using Pyrose-
quencing. Nat Protoc 2006, I (6):2573-2582.
I I. Wilkening S, Chen B, Wirtenberger M, Burwinkel B, Forsti A, Hem-
minki K, Canzian F: Allelotyping of pooled DNA with 250 K
SNP microarrays. BMC Genomics 2007, 8:77.
12. Barratt BJ, Payne F, Rance HE, Nutland S, Todd JA, Clayton DG: Iden-
tification of the sources of error in allele frequency estima-
tions from pooled DNA indicates an optimal experimental
design. Ann Hum Genet 2002, 66(Pt 5-6):393-405.
13. Docherty SJ, Butcher LM, Schalkwyk LC, Plomin R: Applicability of
DNA pools on 500 K SNP microarrays for cost-effective ini-
tial screens in genomewide association studies. BMC Genomics
2007, 8:214.
14. Meaburn E, Butcher LM, Schalkwyk LC, Plomin R: Genotyping
pooled DNA using 100K SNP microarrays: a step towards
genomewide association scans. Nucleic Acids Res 2006,
34(4):e27.
15. Lips EH, Dierssen JW, van Eijk R, Oosting J, Eilers PH, Tollenaar RA,
de Graaf EJ, van't Slot R, Wijmenga C, Morreau H, van Wezel T: Reli-
able high-throughput genotyping and loss-of-heterozygosity
detection in formalin-fixed, paraffin-embedded tumors using
single nucleotide polymorphism arrays. Cancer Res 2005,
65(22):10188-10191.
16. Benjamini Y, Hochberg Y: Controlling the false discovery rate -
a practical and powerful approach to multiple testing. J Roy
Stat Soc B Met 1995, 57(1):289-300.
17. Tusher VG, Tibshirani R, Chu G: Significance analysis of micro-
arrays applied to the ionizing radiation response. Proc Naot
Acad Sci U S A 2001, 98(9):5116-5121.
18. Roeder K, Bacanu SA, Wasserman L, Devlin B: Using linkage
genome scans to improve power of association in genome
scans. Am J Hum Genet 2006, 78(2):243-252.
19. Balding DJ: A tutorial on statistical methods for population
association studies. Nat Rev Genet 2006, 7(10):781-791.
20. International HapMap Consortium: A haplotype map of the
human genome. Nature 2005, 437(7063): 1299-1320.
21. The Gene Ontology (GO) project in 2006. Nucleic Acids Res
2006, 34(Database issue):D322-6.
22. Kanehisa M, Goto S: KEGG: kyoto encyclopedia of genes and
genomes. Nucleic Acids Res 2000, 28(1):27-30.
23. Kanehisa M, Goto S, Hattori M, Aoki-Kinoshita KF, Itoh M,
Kawashima S, Katayama T, Araki M, Hirakawa M: From genomics
to chemical genomics: new developments in KEGG. Nucleic
Acids Res 2006, 34(Database issue):D354-7.
24. Aulchenko YS, Ripke S, Isaacs A, van Duijn CM: GenABEL: an R
library for genome-wide association analysis. Bioinformatics
2007, 23(10): 1294-1296.
25. Clayton D, Leung HT: An R package for analysis of whole-
genome association studies. Hum Hered 2007, 64(1):45-51.
26. Elston RC, Lin DY, Zheng G: Multistage Sampling for Genetic
Studies. Annu Rev Genomics Hum Genet 2007.
27. Sham P, BaderJS, Craig I, O'Donovan M, Owen M: DNA Pooling: a
tool for large-scale association studies. Nat Rev Genet 2002,
3(1 1):862-871.
28. Supplementary material [http://www.bu.edu/sicklecell/down
loads/Proiects/web supplement pooling/index.html]


29. Gibbs RA, BelmontJW, Hardenbol P, Willis TD, Yu F, Yang H, Ch'ang
LY, Huang W, Liu B, Shen Y, Tam PK, Tsui LC, Waye MM, Wong JT,
Zeng C, Zhang Q, Chee MS, Galver LM, Kruglyak S, Murray SS, Oli-
phant AR, MontpetitA, Hudson TJ, Chagnon F, Ferretti V, Leboeuf M,
Phillips MS, Verner A, Kwok PY, Duan S, Lind DL, Miller RD, Rice JP,
Saccone NL, Taillon-Miller P, Xiao M, Nakamura Y, Sekine A,
Sorimachi K, Tanaka T, Tanaka Y, Tsunoda T, Yoshino E, Bentley DR,
Deloukas P, Hunt S, Powell D, Altshuler D, Gabriel SB, Zhang H, Mat-
suda I, Fukushima Y, Macer DR, Suda E, Rotimi CN, Adebamowo CA,
Aniagwu T, Marshall PA, Matthew 0, Nkwodimmah C, Royal CD,
Leppert MF, Dixon M, Stein LD, Cunningham F, Kanani A, Thorisson
GA, Chakravarti A, Chen PE, Cutler DJ, Kashuk CS, Donnelly P, Mar-
chini J, McVean GA, Myers SR, Cardon LR, Abecasis GR, Morris A,
Weir BS, Mullikin JC, Sherry ST, Feolo M, Daly MJ, Schaffner SF, Qiu
R, Kent A, Dunston GM, Kato K, Niikawa N, Knoppers BM, Foster
MW, Clayton EW, Wang VO, Watkin J, Sodergren E, Weinstock GM,
Wilson RK, Fulton LL, Rogers J, Birren BW, Han H, Wang H, God-
bout M, Wallenburg JC, L'Archeveque P, Bellemare G, Todani K,
Fujita T, Tanaka S, Holden AL, Lai EH, Collins FS, Brooks LD, McEwen
JE, Guyer MS, Jordan E, Peterson JL, Spiegel J, Sung LM, Zacharia LF,
Kennedy K, Dunn MG, Seabrook R, Shillito M, Skene B, Stewart JG,
Valle DL, Jorde LB, Cho MK, Duster T, Jasperse M, Licinio J, LongJC,
Ossorio PN, Spallone P, Terry SF, Lander ES, Nickerson DA, Boehnke
M, Douglas JA, Hudson RR, Kruglyak L, Nussbaum RL: The Interna-
tional HapMap Project. Nature 2003, 426(6968):789-796.
30. Perls T, Terry D: Understanding the determinants of excep-
tional longevity. Ann Intern Med 2003, 139(5 Pt 2):445-449.
31. Steinberg MH: Predicting clinical severity in sickle cell anae-
mia. BrJ Haematol 2005, 129(4):465-48 1.
32. Steinberg MH, Barton F, Castro 0, Pegelow CH, Ballas SK, Kutlar A,
Orringer E, Bellevue R, Olivieri N, Eckman J, Varma M, Ramirez G,
Adler B, Smith W, Carlos T, Ataga K, DeCastro L, Bigelow C, Saun-
thararajah Y, Telfer M, Vichinsky E, Claster S, Shurin S, Bridges K,
Waclawiw M, Bonds D, Terrin M: Effect of hydroxyurea on mor-
tality and morbidity in adult sickle cell anemia: risks and ben-
efits up to 9 years of treatment. Jama 2003, 289(13): 1645-1651.
33. Ma Q, Baldwin CT, Safaya S, Kutlar A, Farrer LA, Steinberg MH: Fetal
hemoglobin in sickle cell anemia: association with single
nucleotide polymorphisms in TOX (8ql 2). Hum Mol Genet
2007.
34. Wyszynski DF, Baldwin CT, Cleves MA, Amirault Y, Nolan VG, Far-
rell JJ, Bisbee A, Kutlar A, Farrer LA, Steinberg MH: Polymorphisms
near a chromosome 6q QTL area are associated with mod-
ulation of fetal hemoglobin levels in sickle cell anemia. Cell
Mol Biol (Noisy-le-grand) 2004, 50(1):23-3 3.
35. Sebastiani P, Nolan VG, Baldwin CT, Abad-Grau MM, Wang L, Ade-
woye AH, McMahon LC, Farrer LA, Taylor JG, Kato GJ, Gladwin MT,
Steinberg MH: A network model to predict the risk of death in
sickle cell disease. Blood 2007, I 10(7):2727-2735.
36. Sebastiani P, Abad-Grau MM: Bayesian estimates of linkage dis-
equilibrium. BMC Genet 2007, 8(1):36.
37. Arking R, Butler B, Chiko B, Fossel M, Gavrilov LA, Morley JE,
Olshansky SJ, Perls T, Walker RF: Anti-aging teleconference:
what is anti-aging medicine? J Anti Aging Med 2003, 6(2):91-106.
38. Barzilai N, Atzmon G, Schechter C, Schaefer EJ, Cupples AL, Lipton
R, Cheng S, Shuldiner AR: Unique lipoprotein phenotype and
genotype associated with exceptional longevity. Jama 2003,
290(15):2030-2040.
39. Christensen K, Johnson TE, Vaupel JW: The quest for genetic
determinants of human longevity: challenges and insights.
Nat Rev Genet 2006, 7(6):436-448.
40. Human aging genomic resources [http://genomics.senes
cence.info/genes/longevity.html]
41. Lombard DB, Beard C, Johnson B, Marciniak RA, Dausman J, Bronson
R, Buhlmann JE, Lipman R, Curry R, Sharpe A, Jaenisch R, Guarente L:
Mutations in the WRN gene in mice accelerate mortality in
a p53-null background. Mol Cell Biol 2000, 20(9):3286-3291.
42. Petit I, Szyper-Kravitz M, Nagler A, Lahav M, Peled A, Habler L, Pon-
omaryov T, Taichman RS, Arenzana-Seisdedos F, Fujii N, Sandbank J,
Zipori D, Lapidot T: G-CSF induces stem cell mobilization by
decreasing bone marrow SDF-I and up-regulating CXCR4.
Nat Immunol 2002, 3(7):687-694.
43. Levesque JP, HendyJ, Takamatsu Y, Simmons PJ, Bendall LJ: Disrup-
tion of the CXCR4/CXCLI2 chemotactic interaction during
hematopoietic stem cell mobilization induced by GCSF or
cyclophosphamide. J Clin Invest 2003, 11 1(2): 187-196.


Page 13 of 14
(page number not for citation purposes)


BMC Genetics 2008, 9:6








http://www.biomedcentral.com/1471-2156/9/6


44. Harada M, Qin Y, Takano H, Minamino T, Zou Y, Toko H, Ohtsuka
M, Matsuura K, Sano M, Nishi J, Iwanaga K, Akazawa H, Kunieda T,
Zhu W, Hasegawa H, Kunisada K, Nagai T, Nakaya H, Yamauchi-Tak-
ihara K, Komuro 1: G-CSF prevents cardiac remodeling after
myocardial infarction by activating the jak-Stat pathway in
cardiomyocytes. Nat Med 2005, II (3):305-31 I.
45. Leone AM, Rutella S, Bonanno G, Contemi AM, de Ritis DG, Giannico
MB, Rebuzzi AG, Leone G, Crea F: Endogenous G-CSF and
CD34+ cell mobilization after acute myocardial infarction.
Int Cardiol 2006, I I I (2):202-208.
46. Capoccia BJ, Shepherd RM, Link DC: G-CSF and AMD3100 mobi-
lize monocytes into the blood that stimulate angiogenesis in
vivo through a paracrine mechanism. Blood 2006,
108(7):2438-2445.
47. Gao Y, Ferguson DO, Xie W, Manis JP, Sekiguchi J, Frank KM, Chaud-
huri J, Horner J, DePinho RA, Alt FW: Interplay of p53 and DNA-
repair protein XRCC4 in tumorigenesis, genomic stability
and development. Nature 2000, 404(6780):897-900.
48. Niedernhofer LJ, Garinis GA, Raams A, Lalai AS, Robinson AR, Appel-
doorn E, Odijk H, Oostendorp R, Ahmad A, van Leeuwen W, Theil
AF, Vermeulen W, van der Horst GT, Meinecke P, Kleijer WJ, Vijg J,
Jaspers NG, Hoeijmakers JH: A new progeroid syndrome reveals
that genotoxic stress suppresses the somatotroph axis.
Nature 2006, 444(7122): 1038-1043.
49. Steinberg MH, Lu ZH, Barton FB, Terrin ML, Charache S, Dover GJ:
Fetal hemoglobin in sickle cell anemia: determinants of
response to hydroxyurea. Multicenter Study of Hydroxyu-
rea. Blood 1997, 89(3): 1078-1088.
50. Sebastiani P, Ramoni MF, Nolan V, Baldwin CT, Steinberg MH:
Genetic dissection and prognostic modeling of overt stroke
in sickle cell anemia. Nat Genet 2005, 37(4):435-440.
51. Kass RE, Raftery AE: Bayes factor. j Am Statist Assoc 1995,
90:773-795.
52. Zou G, Zhao H: The impacts of errors in individual genotyping
and DNA pooling on association studies. Genet Epidemiol 2004,
26(1):1-10.
53. Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender
D, Mailer J, Sklar P, de Bakker PI, Daly MJ, Sham PC: PLINK: A Tool
Set for Whole-Genome Association and Population-Based
Linkage Analyses. Am Hum Genet 2007, 81 (3):559-575.
54. SNPPer [http://snpper.chip.org/]
55. Riva A, Kohane IS: A SNP-centric database for the investiga-
tion of the human genome. BMC Bioinformatics 2004, 5:33.
56. Hosack DA, Dennis G Jr., Sherman BT, Lane HC, Lempicki RA: Iden-
tifying biological themes within lists of genes with EASE.
Genome Biol 2003, 4(1 0):R70.














Publish with BioMed Central and every
scientist can read your work free of charge
"BioMed Central will be the most significant development for
disseminating the results of biomedical research in our lifetime."
Sir Paul Nurse, Cancer Research UK
Your research papers will be:
available free of charge to the entire biomedical community
peer reviewed and published immediately upon acceptance
cited in PubMed and archived on PubMed Central
yours you keep the copyright
Submit your manuscript here: BioMedcentral
http://www.biomedcentral.com/info/publishingadv.asp


Page 14 of 14
(page number not for citation purposes)


BMC Genetics 2008, 9:6




University of Florida Home Page
© 2004 - 2010 University of Florida George A. Smathers Libraries.
All rights reserved.

Acceptable Use, Copyright, and Disclaimer Statement
Last updated October 10, 2010 - - mvs