Algorithms for Molecular Biology nioVedCetra
Research
Modeling genetic imprinting effects of DNA sequences with
multilocus polymorphism data
Sheron Wen', Chenguang Wang', Arthur Berg', Yao Li1, Myron M Chang2,
Roger B Fillingim3, Margaret R Wallace4, Roland Staud4, Lee Kaplan4 and
Rongling Wu*1,5,6
Address: 'Department of Statistics, University of Florida, Gainesville, Florida 32611, USA, 2Department of Epidemiology and Health Policy
Research, University of Florida, Gainesville, Florida 32611, USA, 3Department of Community Dentistry and Behavioral Science, University of
Florida, Gainesville, Florida 32611, USA, 4Department of Molecular Genetics and Microbiology, University of Florida, Gainesville, Florida 32611,
USA, 5Department of Public Health Sciences, Pennsylvania State College of Medicine, Hershey, Pennsylvania 17033, USA and 6Department of
Statistics, Pennsylvania State University, University Park, Pennsylvania 16802, USA
Email: Sheron Wen xwen4@ufl.edu; Chenguang Wang cgwang@cog.ufl.edu; Arthur Berg berg@ufl.edu; Yao Li li@ufl.edu;
Myron M Chang mchang@cog.ufl.edu; Roger B Fillingim rfilling@ufl.edu; Margaret R Wallace peggyw@ufl.edu;
Roland Staud staudr@ufl.edu; Lee Kaplan lee@ufl.edu; Rongling Wu* rwu@hes.hmc.psu.edu
* Corresponding author
Published: I I August 2009
Algorithms for Molecular Biology 2009, 4:11 doi:10.1 186/1748718841 I
Received: 4 February 2009
Accepted: II August 2009
This article is available from: http://www.almob.org/content/4/I/I I
2009 Wen et al; licensee BioMed Central Ltd.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0),
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Abstract
Single nucleotide polymorphisms (SNPs) represent the most widespread type of DNA sequence
variation in the human genome and they have recently emerged as valuable genetic markers for
revealing the genetic architecture of complex traits in terms of nucleotide combination and
sequence. Here, we extend an algorithmic model for the haplotype analysis of SNPs to estimate
the effects of genetic imprinting expressed at the DNA sequence level. The model provides a
general procedure for identifying the number and types of optimal DNA sequence variants that are
expressed differently due to their parental origin. The model is used to analyze a genetic data set
collected from a pain genetics project. We find that DNA haplotype GAC from three SNPs,
OPRKG36T (with two alleles G and T), OPRKA843G (with alleles A and G), and OPRKC846T
(with alleles C and T), at the kappaopioid receptor, triggers a significant effect on pain sensitivity,
but with expression significantly depending on the parent from which it is inherited (p = 0.008).
With a tremendous advance in SNP identification and automated screening, the model founded on
haplotype discovery and statistical inference may provide a useful tool for genetic analysis of any
quantitative trait with complex inheritance.
Background
In diploid organisms, there are two copies at every auto
somal gene, one inherited from the maternal parent and
the other from the paternal parent. Both copies are
expressed to affect a trait for a majority of these genes. Yet,
there is also a small subset of genes for which one copy
from a particular parent is turned off. These genes, whose
expression depends on the parent of origin due to the epi
genetic or imprinted mark of one copy in either the egg or
the sperm, have been thought to play an important role in
complex diseases and traits, although imprinted expres
sion can also vary between tissues, developmental stages,
and species [1]. Anomalies derived from imprinted genes
are often manifested as developmental and neurological
Page 1 of 11
(page number not for citation purposes)
Algorithms for Molecular Biology 2009, 4:11
disorders during early development and as cancer later in
life [25].
The mechanisms for genetic imprinting are still incom
pletely known, but they involve epigenetic modifications
that are erased and then reset during the formation of eggs
and sperm. Recent research shows that the parentofori
gin dependent expression of imprinted genes is related
with environmental interactions with the genome [5]. The
phenomenon ofgenomic imprinting is explained from an
evolutionary perspective [6]. Genomic imprinting evolves
in mammals with the advent of live birth through a paren
tal battle between the sexes to control the maternal
expenditure of resources to the offspring [7].
Paternally expressed imprinted genes tend to promote off
spring growth by extracting nutrients from the mother
during pregnancy while it is suppressed by those genes
that are maternally expressed. This genetic battle between
the paternal and maternal parents appears to continue
even after birth [8,9].
The genetic mechanisms for imprinting can be made clear
if the genomic distribution of imprinted genes and their
actions and interactions are studied. Genetic mapping
with molecular markers and linkage maps has been used
to map quantitative trait loci (QTLs) that show parentof
origin effects [1012]. Using an outbred strategy appropri
ate for plants and animals, significant imprinting QTLs
were detected for body composition and body weight in
pigs [13,14], chickens [15] and sheep [16]. Cui et al. [12]
proposed an F2based strategy to map imprinting QTLs by
capitalizing on the difference in the recombination frac
tion between different sexes. More explorations on the
development of imprinting models are given in Cui and
others [12,17]. Liu et al. [18] developed a randomeffect
model for estimating the parentdependent genetic vari
ance of complex traits at imprinting QTLs.
We will propose a statistical model for estimating the
imprinting effects of DNA sequence variants that encode
a complex trait. This model uses widely available single
nucleotide polymorphisms (SNPs) that reside within a
coding sequence of the human genome. The central idea
of this model is to separate maternally and paternally
derived haplotypes (i.e., a linear combination of alleles at
different SNPs on a single chromosome) from observed
genotypes. By specifying one risk haplotype, i.e., one that
operates differently from the rest of haplotypes (called
nonrisk haplotypes), Liu et al. [19] proposed a statistical
method for detecting risk haplotypes for a complex trait
with a random sample drawn from a natural population.
Liu et al.'s approach can be used to characterize DNA
sequence variants that encode the phenotypic value of a
trait. Wu et al. [20] constructed a general multiallelic
model in which any number of risk haplotypes can be
assumed. The best number and combination of risk hap
lotypes can be estimated by using the likelihood and AIC
or BIC values. We will derive a computational algorithm
for estimating the imprinting effects of SNPconstructed
haplotypes with multilocus genetic data based on these
previous workings. The new algorithmic model provides a
general framework for hypothesis tests on the pattern of
genetic imprinting expressed by haplotypes. A real exam
ple from a pain genetic study is used to demonstrate the
application of the model.
Results
The new imprinting model was used to detect the differ
ence of gene expression between the maternal and pater
nal parents at the haplotype level. Genetic and phenotypic
data were from a pain genetics project in which 237 sub
jects (including 143 men and 94 women) from five differ
ent races were sampled. All these subjects were genotyped
for three SNPs, OPRKG36T (rs1051160), with two alleles
G and T, OPRKA843G (rs702764), with alleles A and G,
and OPRKC846T (rs16918875), with alleles C and T, at
the kappaopioid receptor [21]. Three traits of pain sensi
tivity to heat were tested with a procedure described by
Fillingim [22], and they are the average pain rating for
heat stimuli at 49C (PreInt49tot), the increase in heat
pain threshold following administration of 0.5 mg/kg of
pentazocine (an mixed action opioid agonistantagonist
with activity at the kappa receptor) (Hpthpent), and the
increase in heat pain tolerance following administration
of 0.5 mg/kg of pentazocine (an mixed action opioid ago
nistantagonist with activity at the kappa receptor) (Hpto
pent). Prelnt49tot is a baseline pain measure before any
drug administration. Although the model allows the esti
mation of any covariate effect, we will remove the effect
on the pain traits due to different races because of too few
samples for some races. Our final imprinting analysis was
based on 237 subjects for Prelnt49tot but on 85 subjects
for Hpthpent and Hptopent. Sexspecific haplotype fre
quencies for the three SNPs were estimated for males and
females, from which linkage disequilibria were estimated
and tested with results given in Table 1. Of the eight hap
lotypes, haplotype GAC occupies an overwhelming pro
portion in both sexes (>75%). Some haplotypes, like GAT,
TAT, and TGC, are very rare, with population frequencies
tending to be zero. Linkage disequilibria between these
SNPs are generally strong: those between OPRKA843G
and OPRKC846T and between OPRKG36T and
OPRKC846T are highly significant (p < 0.01), although
that between OPRKG36T and OPRKA843G is much less
significant. There is a highly significant highorder linkage
disequilibrium among these three SNPs. It is interesting to
see significant difference in haplotype distribution
between the two sexes (Table 1). This sexspecific differ
ence is due to the difference in linkage disequilibria
Page 2 of 11
(page number not for citation purposes)
hftp://www.almob.org/content/4/1/1 1
Algorithms for Molecular Biology 2009, 4:11
Table I: Sexspecific differences observed in haplotype frequencies and higherorder linkage disequilibria estimates
Female
Genetic Parameter
pvalue
Sexspecific
pvalue
Sexspecific
pvalue
Haplotype Frequency
pGAC 0.780
p GAT 0.000
pGGC 0.124
PGGT 0.011
p TAC 0.049
p TAT 0.000
p TGC 0.008
p TGT 0.030
Allele Frequency and Linkage Disequilibrium
G (OPRKG36T) 0.914
pA (OPRKA843G) 0.828
SA (OPRKC846T) 0.960
D 0.034
D 12
0.026
0.023
0.023
0.021
D 123
Estimates and tests of population genetic structure for three SNPs, OPRKG36T (with two alleles G and T),
and OPRKC846T (with alleles C and T), at the kappaopioid receptor in males and females.
because allele frequencies seem to be mostly sexinvariant
(Table 1).
A "biallelic" model assuming that there is only one haplo
type is used to estimate haplotype effects at the kappaopi
oid receptor on pain traits.
Significant haplotype effects were observed on the three
pain sensitivity traits studied. By assuming a risk haplo
type from all possible haplotypes, we calculated the
resultant likelihood of haplotype effects, which are given
in Table 2.
OPRKA843G (with alleles A and G),
It can be seen that an optimal risk haplotype is GAC for
preint49tot and TAC for Hpthpent and Hptopent. Statisti
cal tests indicate that these risk haplotypes trigger a signif
icant effect on PreInt49tot (p = 0.004) and Hpthpent and
Hptopent (p = 0.025), respectively (Table 3). Risk haplo
type GAC displays a significant additive effect (p = 0.005)
on preint49tot, but there is no dominant effect due to its
interaction with nonrisk haplotypes. It is interesting to
find that risk haplotype GAC contributes to variation in
preint49tot in a parentoforigin manner. The subjects
with risk haplotype GAC inherited from the maternal par
ent will be significantly different in preint49tot than those
with the risk haplotype inherited from the paternal par
Table 2: Haplotype effects are estimated over three pain sensitivity traits
GAC
Prelnt49tot
Hpthpent
Hptopent
764.663
195.107
165.867
GAT
773.196
199.832
170.494
GGC
771.972
198.640
170.468
GGT
770.903
199.460
170.216
TAC
769.645
193.569
157.053
TAT
773.196
199.832
170.494
TGC
772.842
195.345
168.324
TGT
772.677
199.709
170.414
Page 3 of 11
(page number not for citation purposes)
0.764
0.000
0.081
0.023
0.104
0.000
0.000
0.029
0.868
0.868
0.949
0.045
0.022
0.011
0.018
0.0459
4.98e6
8.87es
2.53e4
6.166
3.501
2.943
5.77
42.28
22.216
21.088
0.1812
3.61e6
0.0089
0.0055
0.0130
0.0613
0.0862
0.0163
7.9 1 e I
2.44e6
4.39e6
http://www.almob.org/content/4/1/11
Algorithms for Molecular Biology 2009, 4:11
Table 3: Additive, dominant, imprinting, and overall effects at three SNPs
Risk Haplotype
Prelnt49tot
Hpthpent
Hptopent
Effect
pvalue
Effect
pvalue
Effect
pvalue
GAC
Overall
13.47
0.005
3.06
0.002
2.15
0.003
6.22
0.237
3.08
0.003
2.33
0.003
19.02
0.008
0.26
0.089
0.28
0.621
0.004
0.023
0.025
Estimates of additive (a), dominant (d), and imprinting effects (i) of haplotypes at SNPs, OPRKG36T (with two alleles G and T), OPRKA843G (with
alleles A and G), and OPRKC846T (with alleles C and T), at the kappaopioid receptor, on pain sensitivity traits.
ent. No significant imprinting effects are detected on traits
Hpthpent and Hptopent, although risk haplotype TAC
displays significant additive and dominant effects on
these two traits.
The statistical properties of the imprinting model are
investigated through simulation studies. The first simula
tion mimics the data structure of the real example (with
237 subjects) above based on its estimates of sexspecific
allele frequencies and linkage disequilibria among three
SNPs (Table 1) and of the additive, dominant, and
imprinting effects arising from different haplotypes (for
Prelnt49tot, Table 3). The second simulation includes
sample size from 237 to 400, 800, and 2000, keeping the
other parameters unchanged. Each simulation scheme is
repeated for 1000 runs.
Results from simulation studies are summarized as fol
lows:
(1) Population genetic parameters including three
SNP haplotype frequencies, allele frequencies, and
linkage disequilibria of different orders can be pre
cisely estimated even when a smaller sample size
(237) is used. As expected, the estimation precision
can be improved consistently when the sample size
increases from 237 to 2000.
(2) Quantitative genetic parameters can also be well
estimated, but the better estimation of the dominant
and imprinting effects needs a larger sample size (400
or more). With a sample size of 2000, the precision of
all parameter estimates are remarkably high, with
sampling errors of each estimate being low than 10%
of the estimate.
(3) The power to detect imprinting effects reaches 75%
for a sample size of 237, but it can increase dramati
cally when increasing the sample size to 400.
(4) The type I error rate of detecting the imprinting
effect, i.e, a probability for a false positive discovery of
that effect, is quite low (< 10%) for a small sample size
and can be lowered when sample size increases.
The simulation for testing the type I error rate in (4) was
based on the same parameters as used in (1)(3), except
that no imprinting effect is assumed. Because we have
derived a series of closed forms for the estimation of
parameters within the EM framework, parameter estima
tion is very efficient. For a single simulation run, it will
take a few dozen of seconds to obtain all estimates on a
PC laptop. Also, estimates do not depend heavily on ini
tial values of parameters, showing that the estimates
achieve a global maxima for the likelihood.
Discussion
Although a traditional view assumes that the maternally
and paternally derived alleles of each gene are expressed
simultaneously at a similar level, there are many excep
tions where alleles are expressed from only one of the two
parental chromosomes [1,23]. This socalled genetic
imprinting or parentoforigin effect has been thought to
play a pivotal role in regulating the phenotypic variation
of a complex trait [2427]. With the discovery of more
imprinting genes involved in trait control through molec
ular and bioinformatics approaches, we will be in a posi
tion to elucidate the genetic architecture of quantitative
variation for various organisms including humans.
Genetic mapping in controlled crosses has opened up a
great opportunity for a genomewide search of imprinting
effects by identifying imprinted quantitative trait loci
(iQTLs). This approach has successfully detected iQTLs
that are responsible for body mass and diseases
[10,11,14,19,28,29]. Cloning of these iQTLs will require
highresolution mapping of genes, which is hardly met for
traditional linkage analysis based on the production of
recombinants in experimental crosses. Single nucleotide
polymorphisms (SNPs) are powerful markers that can
explain interindividual differences. Multiple adjacent
SNPs are especially useful to associate phenotypic varia
Page 4 of 11
(page number not for citation purposes)
hftp://www.almob.org/content/4/1/1 1
Algorithms for Molecular Biology 2009, 4:11
ability with haplotypes [3034]. The quantitative effect of
haplotypes on a complex trait was modeled by Liu et al.
[19] and subsequently refined byWu et al. [20].
In this article, we incorporate genetic imprinting into Wu
et al.'s [20] multiallelic model to estimate the number and
combination of multiple functional haplotypes that are
expressed differently depending on the parental origin of
these haplotypes. Because of the modeling of any possible
distinct haplotypes, the multiallelic model will have more
power for detecting significant haplotypes and their
imprinting effects than biallelic models. The imprinting
model was shown to work well in a wide range of param
eter space for a modest sample size. However, a consider
ably large sample size is needed if there are multiple risk
haplotypes that contribute to trait variation. By analyzing
a real example for pain genetics, the new model detects
significant haplotypes composed of three SNPs within the
kappaopioid receptor, which may play an important role
in affecting pain sensitivity to heat.
Haplotype GAC derived from this gene appears to be
imprinted for Prelnt49tot, a pain sensitivity trait to heat
stimuli at 49 C before drug administration, leading to
different levels of pain sensitivity between the patients
when they inherit this haplotype from maternal and
paternal parents. In this example, no imprinting was
detected for Hpthpent and Hptopent, two pain sensitivity
traits measured after the patients were administrated with
pentazocine. This result should be, however, explained
with caution. First, the risk haplotype, TAG, detected for
Hpthpent and Hptopent is a rare haplotype in the
admixed population studied, although it is quite com
mon in African Americans and Hispanics. The significance
of this rare haplotype detected could be a sample size arti
fact, or it could be indicating a powerful haplotype effect.
Second, in a different analysis, no significant genetic asso
ciation was detected for the same heat pain test at heat
stimuli at 52C (data not shown). Nonetheless, the
method provides a powerful tool for detecting possible
associations and imprinting effects, which provide a start
ing point for future work to pursue the positive results
with larger sample sizes and family studies. There have
not been any previous reports suggesting an imprinting
effect at an opioid receptor locus, or related to pain meas
ures.
In practice, although the human genome contains mil
lions of SNPs, it is not possible and also not necessary to
model and analyze these SNPs simultaneously. These
SNPs are often distributed in different haplotype blocks
[35], within each of which a particular (small) number of
representative SNPs or htSNPs can uniquely explain most
of the haplotype variation. A minimal subset of htSNPs,
identified by several computing algorithms, can be imple
mented into our imprinting model to detect their imprint
ing effects at the haplotype level. In addition, our model
can be extended to model imprinting effects in a network
of interactive architecture, including haplotypehaplotype
interactions from different genomic regions, haplotype
environment interactions, and haplotype effects regulat
ing pharmacodynamic reactions of drugs. It can be
expected that all extensions will require expensive compu
tation, but this computing can be made possible if combi
natorial mathematics, graphical models, and machine
learning are incorporated into closed forms of parameter
estimation.
This imprinting model assumes that if the SNPs constitut
ing haplotypes are tightly linked, haplotype frequencies
estimated from the current generation can be used to
approximate haplotype frequencies in the parental gener
ation. To relax this assumption, a strategy of sampling a
panel of random families from a population is required,
in which a known family structure allows the tracing and
estimation of maternally and paternallyderived haplo
types. Such a strategy will help to precisely estimate and
test imprinting effects of haplotypes, providing a new
gateway for studying the genetic architecture of complex
traits.
Methods
Imprinting Model
Consider a set of three ordered SNPs, each with two alleles
1 and 0, genotyped from a candidate gene. These three
SNPs form eight haplotypes, 111, 110, 101, 100, 011,
010, 001, and 000. A risk haplotype group is defined as a
set of haplotypes that are in manner distinct from the
other haplotypes in affecting a complex trait. For example,
if a risk haplotype group only consists of the haplotype
111, the remaining seven haplotypes form the nonrisk
haplotype group; this means that the diplotypes com
posed of 111 will have different genotypic values from
those composed of 110, 101, 100, 011, 010, 001, or 000.
There may be multiple risk haplotype groups, and we let
Rk denote any risk haplotype from the kth risk haplotype
group (k = 1,...,K; K < 8, where K is the number of risk hap
lotype groups) and R0 denote any of the remaining non
risk haplotypes in the nonrisk haplotype group. The com
binations between the risk and nonrisk haplotypes are
called the composite diplotypes, including RkRk. (k < k' =
1,...,K), RkRo and RoRo (any two nonrisk haplotypes).
Here we do not distinguish between RkRk. and Rk.Rk as we
do not know parental origin of the haplotypes, however
genetic imprinting implies that the same composite diplo
type may function differently, depending on the parental
origin of its underlying haplotypes. To reflect the parental
origin of haplotypes in the composite diplotype, the fol
lowing notation is used: RkRk' (k, k' = 1 .....K), RkRo,
RoRk, and RoRo, where the vertical lines are used to sepa
Page 5 of 11
(page number not for citation purposes)
hftp://www.almob.org/content/4/1/1 1
Algorithms for Molecular Biology 2009, 4:11
rate the haplotypes derived from the maternal parent
(left) and paternal parent (right).
According to traditional quantitative genetic principles,
the genotypic value of a given composite diplotype is par
titioned into different components due to additive and
dominance genetic effects. For an imprinting model, an
additional component is the imprinting genetic effect due
to different contributions of haplotypes from the mater
nal and paternal parents. Mathematically, the genotypic
value of a composite diplotype of known parental origins
is expressed as
uPk =p + ak,
for composite diplotype Rk I Rk
1 k'k
kk' = Pf + (ak + a') + dkk' + ik kk
2 k'k
for composite diplotype Rk I Rk
K
Pko = I ak' + dko + ko
k'#k
for composite diplotype Rk I Ro
K
ok P + dko iko
2 h'z,
k'k
for composite diplotype Ro I Rk
K
Poo =P ak
for composite diplotype Ro I Ro
where p is the overall mean, ak is the additive effect due to
the substitution of risk haplotype k by the nonrisk haplo
type, I is the dominance effect due to the interaction
between risk haplotypes h and k'( = ). ,,, is the dom
inance effect due to the interaction between risk haplo
types k and the nonrisk haplotype ( ,, = 1,,,), ik' is the
imprinting effects due to different parental origins of risk
haplotypes k and k'(ik' = ik'), and i'o is the imprinting
effect due to different parental origins of risk haplotype k
and the nonrisk haplotype (i4o = i). The sizes and signs
of ikk. and i'o determine the extent and direction of
imprinting effects at the haplotype level.
A set of genetic parameters, including the additive, domi
nance, and imprinting effects, define the genetic architec
ture of a quantitative trait. By estimating and testing these
parameters, the genetic architecture of a trait can well be
studied. Specific genetic effects of haplotypes can be esti
mated from genotypic values of the composite diplotypes
with the formulas as follows:
K
ak = K1 Kkk P + Poo
1 ,#
dkk' = IPk k' + Pk'k) (Pkuk + Pk'k')I
1
dko = I[(PUko + PoQk) (Pkk + Poo)1
1
1kk' = ( Pk k' P k'\k)
2
1
ko = (P o 0 Po0k)
Estimating Model
Genetic Structure
Suppose the three SNPs are genotyped from a natural
human population at HardyWeinberg equilibrium
(HWE). Let p and 1 p denote the frequencies of two
alternative alleles r; (r; = 1 or 0) at SNP I in the population
of sex s (s = M for females and P for males). Sexspecific
frequencies of eight haplotypes produced by the three
SNPs are generally expressed as p',, For each sex s, gen
otypes consisting of three SNPs are denoted as
rlr / r2'r3'(r1 > ', r2 r3 > r), totaling 27 distinct
genotypes. Let (r1 > r ,r2 > ,r, 3 > = 1,0) denote the
observation of a threeSNP genotype for sex s, which sums
over the two sexes to nr,,irr ir r, Some genotypes are
consistent with diplotypes, whereas those that are hetero
zygous at two or more SNPs are not. Each double hetero
zygote contains two different diplotypes, and the triple
heterozygote, i.e., 10/10/10, contains four different diplo
types: 111000 (with probability 2p11pioo0 for sex s),
1101001 (with probability 2p loPooI for sex s), 101010
(with probability 2Po iPoo1 for sex s), and 100011 (with
probability 2p0ooP011 for sex s). Note that slashes are used
to separate genotypes at different SNPs and vertical lines
are used to separate haplotypes derived from the maternal
parent (left) and paternal parent (right). The observed
genotypes, the underlying diplotypes, and diplotype fre
quencies are provided in Additional file 1. From the HWE
assumption, diplotype frequencies are simply expressed
as the products of the underlyinghaplotype frequencies
derived from different parents.
For a random sample from a natural population, it is
impossible to estimate the frequencies of maternally and
Page 6 of 11
(page number not for citation purposes)
http://www.almob.org/content/4/1/11
Algorithms for Molecular Biology 2009, 4:11
paternallyderived haplotypes. However, parentspecific
haplotype frequencies can be approximated by sexspe
cific haplotype frequencies in the current generation if the
SNPs studied are highly linked together. For example, we
will argue below that Plo which measures frequency for
the haplotype 100 among females in the current genera
tion can be accurately approximated by the haplotype fre
quency 100 in the "maternal generation" or previous
generation of females. This is proven as follows. Let
pr r2r3 (t) and Prr2r3 (t + 1) be the frequencies of a repre
sentative threeSNP haplotype rr23r, for a monosexual
population in generations t and t + 1, respectively. The
relationship of haplotype frequencies between the two
generations is expressed as
Pr"r, (t + 1) = p, (t + 1)(t+)(t + 1)p, (t + 1)
+(1) +2 p, (t + 1)D12(t +1)
+(1)' (t+l)D13(t+l)
+(1) 2+. p (t + )D23(t + 1)
(1)12+" D123(t +1) (generation t + 1)
= P (t)p ''. (t)
+(1)+"p r12)D12(t)
+(I)"+" p2 (t)(1 13)D13(t)
+(1) .2+" Pp, (t)(1 23)D23t)
(1)'i.2+ (1 r12)(1 13)(1 r23)D23(t) (generation t +1)
SP / (t)P2 )' (t. ) + (1) + p,, (t)D 2(t)
+(1)rl+ P (t)D13(t)
+(1)' 2+ p, (t)D23(t)
(1)'l 2+ D123(t) (generation t + 1)
= p,, (t) (generation t)
where pq (t), p, (t), and pr (t) are the allele frequencies
of three SNPs in generation t, and D12(t), D13(t), and
(D23(t) are the linkage disequilibria between the first and
second SNPs, the first and third SNPs, and the second and
third SNPs in generation t. Under HardyWeinberg equi
librium, the allele frequencies remain constant from gen
eration to generation (p, (t) = p, (t + 1)) and the linkage
disequilibria decay in proportion with the recombination
fractions r1 in that
Dy(t + 1) = (1 r,)D,(t + 1)
and
D123(t +1) = (1 12)( r13)(1 r23)D123(t).
Because the three SNPs are genotyped from the same
region of a candidate gene, their recombination fractions
should be very small and can be thought to be close to
zero. Thus, the frequencies of maternally and paternally
derived haplotypes can be approximated by the estimates
of these haplotype frequencies in the female and male
populations of current generation, respectively.
Likelihoods
With a random sample from a natural population, in
which each genotyped subject is measured for a pheno
typic trait of interest, we will develop a model to estimate
population genetic parameters, including the eight mater
nallyderived haplotype frequencies
(pM = {Pr23 }\,r2,r3=o), the eight paternallyderived hap
lotype frequencies (D, = {p,,,, }\,, ,r and the quan
titative genetic parameters (4q) that include haplotype
effects (({ak,agkk',dkhoik',iOk}hlk'=l)) and the residual
variance of the trait (02). The haplotype effects are derived
uniquely from genotypic values of composite diplotypes
(({Pkk', Pkl,/o, Kolo}k=1) ) as provided in equations
(2)(6).
Given sexspecific genotypic observations (MM and Mp)
and phenotypic data (y), a joint loglikelihood is con
structed as
log L(pM,p q I y MMMP)
log L(M M ) + log L(QP Mp)
+logL(Q, y),MMMP, M ,Q)
Thus, maximizing the likelihood in (1) is equivalent to
individually maximizing the three terms on the right hand
side of(1).
A polynomial likelihood is constructed for the marker
data of a sex (s) to estimate Qp and 4 For notational
convenience, we define a function h(r) on genotypes
r = / r2,' / r3 to be
h(r) = (r r) + (r2 ) + (r3 )
So, for example, h(r) = 2 if r is a double heterozygote. The
first two log likelihood on the righthand side of equation
(7) are then expressed as
Page 7 of 11
(page number not for citation purposes)
hftp://www.almob.org/content/4/1/1 1
Algorithms for Molecular Biology 2009, 4:11
Table 4: Number of choices for several multiallelic models with likelihood and model selection notation
Risk Haplotype
Loglikelihood
AIC/BIC
No. Choice
L(Q Q 2, y,M)
1 8 (one haplotype)
28 (two haplotypes)
56 (three haplotypes)
70 (four haplotypes)
2 28
3 56
4 170
5 56
6 24
7 8
log LB
log L
log LQ
log LP
log LH
logLs,
log Lo
In this table, Q~ is an estimated vector of the genotypic values of different composite diplotypes and the residual variance. The largest log
likelihood and/or the smallest AIC or BIC value calculated is thought to correspond to the most likely risk haplotypes and the optimal number of
risk haplotypes.
log L(Q I M,)
constant +
{r:h(r)=Oj
2nr log(pSr,2)
risk haplotype(s). By assuming that haplotype 111 is only
one risk heplotype, we construct the likelihood as
+ n log(2p p,;)
{r:h(r)=1}
+ nrlog(p,12P ,rp
{r:h(r)=2}
S 5 S 5 ,5
+eP~r P3r^2 + Prr3 Per
+P12r3 1P
+n10/10/10 log(2p11Pspoo
+2psolp0lo + 2p+loP0ol + 2p1oo011)
A closed system of the EM algorithm was derived to esti
mate these haplotype frequencies (see the Text Si). The
estimates of sexspecific haplotype frequencies are then
used to estimate haplotype effects by constructing a mix
ture modelbased likelihood. The construction of this
likelihood requires knowledge of which haplotypes are
Page 8 of 11
(page number not for citation purposes)
Model
Biallelic
Triallelic
Quadriallelic
Pentaallelic
Hexaallelic
Septemallelic
Octoallelic
http://www.almob.org/content/4/1/11
Algorithms for Molecular Biology 2009, 4:11
logL(Qq,c 2 , ,y,M)=
n111/11/11 1111/11/10
X logf11(yi)+ log lPifrl(Yi) +tffoYi)]
i=1 i=1
+ log[t2f10(i) +F 2fo1(i)]
i=1
n11111o1o
+ ,log[ 3 i10(y + 3f 01(yJ +3, fo(yi)
i=1
n 10 1111
+ log[tpy4fl0Yi) + t4fo(Yi)
i=1
no10/111/10
+ log[ io410(y+ J/o0(yi +), foo(Yr)
i=1
1010/11
+ loglt/so(i y + tf/oi((y )+ oo+Y)
i=1
n10io1010
+ log[y / io1(y + tfoi(Yi) ++7foo(Yi)]
i=1
i=i
+ l0ogo fo(y1)
and
m = n11111/oo + n11/10/00 + n11/1oo11
+n11/O/lO + n11/oo/oo + n10/11/oo
+n10O/10/O + l01/oo/11 + 101o0oolo
+n1o/oo/oo + noo/11/11 + noo/11/1o
+naOO/10/ + n0o0/10/0 + n00/10/10
+n010/10/O + no01oo/11 + n000oooo1
+n00/00/oo.
The EM algorithm is derived to estimate quantitative
genetic parameters with a detailed procedure given in
Additional file 2. The estimated genotypic values of com
posite diplotypes are used to estimate the additive, domi
nant and imprinting effects of haplotypes using equations
(2)(6).
Model Selection
For an observed marker (M) and phenotypic (y) data set,
we do not know which are the risk haplotypes nor how
many there are. Standard model selection criteria such as
the AIC and BIC are used to determine the optimal
number and type of risk haplotypes. Among a total of
eight haplotypes formed by three SNPs, up to seven ones
can be risk haplotypes. The modeling of one to seven risk
haplotypes is equivalent to the analysis of the genetic data
by biallelic, triallelic, quadrialleli, pentaallelic, hexaal
lelic, septemallelic and octoallelic models, respectively.
The biallelic model dissolves all the haplotypes into two
distinct groups, a single risk haplotype group and a non
risk haplotype group. The risk haplotype group may be
composed of one (8 choices), two (28 choices), three (56
choices) or four haplotypes (70 choices). The triallelic,
quadrialleli, pentaallelic, hexaallelic, septemallelic and
octoallelic models contains 28, 56, 170, 56, 24 and 8
cases, respectively. The likelihood and model selection
criteria, AIC or BIC, are then calculated and compared
among different models and assumptions. This is summa
rized in Table 4.
Hypothesis Tests
For a given data set, testing the existence of functional
haplotypes is a first step. This can be done by formulating
the following hypotheses:
Ho : P.k =k Pkk' = Pk0 = Po0 = Poo
(k = 1...K)
H1 : At least one of equality in H0 does not hold The log
likelihood ratio (LR) is then calculated by plugging the
estimated parameters into the likelihood under the H0
and H1, respectively. The LR can be viewed as being
asymptotically X2distributed with (k + 1)2 1 degrees of
freedom.
After a significant haplotype effect is detected, a series of
further tests are performed for the significance of additive,
dominance and imprinting effects triggered by haplo
types. The null hypotheses under each of these tests can be
formulated by setting the effect being tested to be equal to
zero. For example, under the triallelic model, the null
hypothesis for testing the imprinting effect of the haplo
typesis expressed as
Ho : i12 i10 i20 = 0.
In practice, it is also interesting to test each of the additive
genetic effects, each of the dominance effects and each of
the imprinting effects for the triand quadriallelic models.
The estimates of the parameters under the null hypotheses
can be obtained with the same EM algorithm derived for
the alternative hypotheses but with a constraint of the
tested effect equal to zero. The loglikelihood ratio test sta
tistics for each hypothesis is thought to asymptotically fol
low a X2distributed with the degree of freedom equal to
the difference of the numbers of the parameters being
tested under the null and alternative hypotheses.
Competing interests
The authors declare that they have no competing interests.
Page 9 of 11
(page number not for citation purposes)
http://www.almob.org/content/4/1/11
Algorithms for Molecular Biology 2009, 4:11
Authors' contributions
SW, CW, and AB contributed equally to this manuscript.
SW, CW, AB, YL derived the algorithm and performed the
data analysis. RBF, MRW, RS, LK designed the experiment
and collected the data. MMC provided statistical advice
and help. RW conceived the model and wrote the paper.
AB provided final modifications to the paper. All authors
have read and approved the final manuscript.
Additional material
Additional file 1
Observed genotypes, underlying diplotypes, and diplotype frequencies
under biallelic and ocoallelic models. Two tables are provided describing
genotypes, diplotypes, and diplotype frequencies for 2 7 genotypes at three
SNPs, and genotypic values of composite diplotypes under the biallelic
(assuming that 111 is the risk haplotype and the others are the nonrisk
haplotype) and ocoallelic models.
Click here for file
[http://www.biomedcentral.com/content/supplementary/1748
7188411Sl.pdf]
Additional file 2
EM algorithm details for calculating parameter estimation. EM Algo
rithms for estimating haplotype frequencies and for estimating quantita
tive genetic parameters.
Click here for file
[http://www.biomedcentral.com/content/supplementary/1748
718841 lS2.pdf]
Acknowledgements
This work is supported by joint grant DMS/NIGMS0540745 and NIH RO I
grant NS41670.
References
I. Reik W, Walter J: Genomic imprinting: parental influence on
the genome. Nature Reviews Genetics 2001, 2:2132.
2. Falls J, Pulford D, Wylie A, Jirtle R: Genomic imprinting: implica
tions for human disease. American journal of Pathology 1999,
154(3):635647.
3. Jirtle R: Genomic imprinting and cancer. Experimental cell
research 1999, 248:1824.
4. Luedi P, Hartemink A, Jirtle R: Genomewide prediction of
imprinted murine genes. Genome Research 2005, 15(6):875884.
5. Waterland R, Lin J, Smith C, Jirtle R: Postweaning diet affects
genomic imprinting at the insulinlike growth factor 2(lgf2)
locus. Human Molecular Genetics 2006, 15(5):705716.
6. Killian J, Byrd J, Jirtle J, Munday B, Stoskopf M, MacDonald R, Jirtle R:
M6P/IGF2R imprinting evolution in mammals. Molecular Cell
2000, 5(4):707716.
7. Haig D: Altercation of generations: genetic conflicts of preg
nancy. American journal of reproductive immunology (New York, NY:
1989) 1996, 35(3):226.
8. Lefebvre L, Viville S, Barton S, Ishino F, Keverne E, Surani M: Abnor
mal maternal behaviour and growth retardation associated
with loss of the imprinted gene Mest. Nature genetics 1998,
20:163169.
9. Li L, Keverne E, Aparicio S, Ishino F, Barton S, Surani M: Regulation
of maternal behavior and offspring growth by paternally
expressed Peg3. Science 1999, 284(5412):330.
10. de Koning D, Rattink A, Harlizius B, Van Arendonk J, Brascamp E,
Groenen M: Genomewide scan for body composition in pigs
reveals important role of imprinting. Proceedings of the National
Academy of Sciences 2000, 97(14):79477950.
II. de Koning D, Bovenhuis H, van Arendonk J: On the detection of
imprinted quantitative trait loci in experimental crosses of
outbred species. Genetics 2002, 161 (2):93 1938.
12. Cui Y, Lu Q, Cheverud J, Littell R, Wu R: Model for mapping
imprinted quantitative trait loci in an inbred F2 design.
Genomics 2006, 87(4):54355 1.
13. Jeon J, Carlborg 0, Tornsten A, Giuffra E, Amarger V, Chardon P,
AndersenEklund L: A paternally expressed QTL affecting skel
etal and cardiac muscle mass in pig maps to the IGF2 locus.
Not Genet 1999, 21:157158.
14. Nezer C, Moreau L, Brouwers B, Coppieters W, Detilleuxj, Hanset
R, Karim L, Kvasz A, Leroy P, Georges M: An imprinted QTL with
major effect on muscle mass and fat deposition maps to the
IGF2 locus in pigs. Nature Genetics 1999, 21(2):155156.
15. TuiskulaHaavisto M, De Koning D, Honkatukia M, Schulman N, Maki
tanila A, Vilkki J: Quantitative trait loci with parentoforigin
effects in chicken. Genetics Research 2004, 84(01):5766.
16. Lewis A, Redrup L: Genetic imprinting: conflict at the Callipyge
locus. Current Biology 2005, 15(8):29 1294.
17. Cui Y: A statistical framework for genomewide scanning and
testing of imprinted quantitative trait loci. journal of theoretical
biology 2007, 244:1 15126.
18. Liu T, Todhunter R, Wu S, Hou W, Mateescu R, Zhang Z, Burton
Wurster N, Acland G, Lust G, Wu R: A random model for map
ping imprinted quantitative trait loci in a structured pedi
gree: An implication for mapping canine hip dysplasia.
Genomics 2007, 90(2):276284.
19. Liu T, Johnson J, Casella G, Wu R: Sequencing complex diseases
with HapMap. Genetics 2004, 168:50351 I.
20. Wu S, Yang J, Wang C, Wu R: A General Quantitative Genetic
Model for Haplotyping a Complex Trait in Humans. Current
Genomics 2007, 8(5):343350.
21. Takasaki I, Suzuki T, Sasaki A, Nakao K, Hirakata M, Okano K, Tanaka
T, Nagase H, Shiraki K, Nojima H, et al.: Suppression of acute her
petic painrelated responses by the kappaopioid receptor
agonist ()17cyclopropylmethyl3, 14betadihydroxy4,
5alphaepoxybeta[nmethyl3trans3(3furyl) acryla
mido] morphinan hydrochloride (TRK820) in mice. The our
nal of pharmacology and experimental therapeutics 2004, 309:36.
22. Fillingim R, Doleys D, Edwards R, Lowery D: Spousal Responses
Are Differentially Associated With Clinical Variables in
Women and Men With Chronic Pain. Clinical journal of Pain
2003, 19(4):217.
23. Wilkins J, Haig D: What good is genomic imprinting: the func
tion of parentspecific gene expression. Nature Reviews Genetics
2003, 4(5):359368.
24. Wood A, Oakey R: Genomic imprinting in mammals: emerg
ing themes and established theories. PLoS genetics 2006, 2(I 1):.
25. Lewis A, Reik W: How imprinting centres work. Cytogenet
Genome Res 2006, I 13(14):8189.
26. Jirtle R, Skinner M: Environmental epigenomics and disease
susceptibility. Nature Reviews Genetics 2007, 8(4):253262.
27. Feil R, Berger F: Convergent evolution of genomic imprinting
in plants and mammals. Trends in Genetics 2007, 23(4):192199.
28. Nezer C, Collette C, Moreau L, Brouwers B, Kim J, Giuffra E, Buys N,
Andersson L, Georges M: Haplotype sharing refines the location
of an imprinted quantitative trait locus with major effect on
muscle mass to a 250kb chromosome segment containing
the porcine IGF2 gene. Genetics 2003, 165:277285.
29. Cheverud J, Hager R, Roseman C, Fawcett G, Wang B, Wolf J:
Genomic imprinting effects on adult body composition in
mice. Proceedings of the National Academy of Sciences 2008,
105(1 1):4253.
30. Judson R, Stephens J, Windemuth A: The predictive power of hap
lotypes in clinical response. Pharmacogenomics 2000, 1:1526.
31. Bader : The relative power of SNPs and haplotype as genetic
markers for association tests. Pharmacogenomics 2001, 2:1 124.
32. Winkelmann B, Hoffmann M, Nauck M, Kumar A, Nandabalan K, Jud
son R, Boehm B, Tall A, Ruano G, Marz W: Haplotypes of the
cholesterol ester transfer protein gene predict lipidmodify
ing response to station therapy. The Pharmacogenomics journal
2003, 3(5):284296.
33. Clark A: The role of haplotypes in candidate gene studies.
Genetic epidemiology 2004, 27(4):321333.
Page 10 of 11
(page number not for citation purposes)
hftp://www.almob.org/content/4/1/1 1
Algorithms for Molecular Biology 2009, 4:11
34. Jin G, Miao R, Deng Y, Hu Z, Zhou Y, Tan Y, Wangj, Hua Z, Ding W,
Wang L, et al.: Variant genotypes and haplotypes of the epider
mal growth factor gene promoter are associated with a
decreased risk of gastric cancer in a highrisk Chinese popu
lation. Cancer science 2007, 98(6):864868.
35. Altshuler D, Brooks L, Chakravarti A, Collins F, Daly M, Donnelly P,
et al.: A haplotype map of the human genome. Nature 2005,
437(7063): 12991320.
Page 11 of 11
(page number not for citation purposes)
Publish with BioMed Central and every
scientist can read your work free of charge
"BioMed Central will be the most significant development for
disseminating the results of biomedical research in our lifetime."
Sir Paul Nurse, Cancer Research UK
Your research papers will be:
available free of charge to the entire biomedical community
peer reviewed and published immediately upon acceptance
cited in PubMed and archived on PubMed Central
yours you keep the copyright
Submit your manuscript here: BioMedcentral
http://www.biomedcentral.com/info/publishingadv.asp
hftp://www.almob.org/content/4/1/1 1
