BMC Genetics
Methodology article
A hierarchical statistical model for estimating population
0
BioMed Central
properties of quantitative genes
Samuel S Wu1, ChangXing Mal,2, Rongling Wu* I and George Casella1
Address: IDepartment of Statistics, University of Florida, Gainesville, FL 32611, USA and 2Department of Statistics, Nankai University, Tianjin
300071, P. R. China
Email: Samuel S Wu samwu@stat.ufl.edu; ChangXing Ma cma@biostat.ufl.edu; Rongling Wu* rwu@stat.ufl.edu;
George Casella casella@stat.ufl.edu
* Corresponding author
Published: 12 June 2002
BMC Genetics 2002, 3:10
Received: 15 March 2002
Accepted: 12June 2002
This article is available from: http://www.biomedcentral.com/14712156/3/10
2002 Wu et al; licensee BioMed Central Ltd. Verbatim copying and redistribution of this article are permitted in any medium for any purpose, provided
this notice is preserved along with the article's original URL.
Abstract
Background: Earlier methods for detecting major genes responsible for a quantitative trait rely
critically upon a wellstructured pedigree in which the segregation pattern of genes exactly follow
Mendelian inheritance laws. However, for many outcrossing species, such pedigrees are not
available and genes also display population properties.
Results: In this paper, a hierarchical statistical model is proposed to monitor the existence of a
major gene based on its segregation and transmission across two successive generations. The
model is implemented with an EM algorithm to provide maximum likelihood estimates for genetic
parameters of the major locus. This new method is successfully applied to identify an additive gene
having a large effect on stem height growth of aspen trees. The estimates of population genetic
parameters for this major gene can be generalized to the original breeding population from which
the parents were sampled. A simulation study is presented to evaluate finite sample properties of
the model.
Conclusions: A hierarchical model was derived for detecting major genes affecting a quantitative
trait based on progeny tests of outcrossing species. The new model takes into account the
population genetic properties of genes and is expected to enhance the accuracy, precision and
power of gene detection.
Background
The identification of individual genes governing pheno
typic variation is crucial to understand the mechanistic
basis of quantitative inheritance and ultimately provide
information about the design of optimal breeding strate
gies in both plants and animals. Quantitative genetics
based on new statistical and computational technologies
has advanced to the point at which individual genes can
be detected and mapped on chromosomes [1,2]. The un
derpinning for the detection and mapping of genes is
founded on their particular segregation pattern. For a ped
igree derived from inbred lines, the segregation pattern of
a gene can be exactly predicted. In this situation, quantita
tive genetic theory can lay a sufficient foundation for gene
detection and, thus, the properties of a detected gene can
be adequately described by its effect and genomic loca
tion. However, for outcrossing populations, such as forest
trees, in which inbred lines are not available, genes also
Page 1 of 7
(page number not for citation purposes)
http://www.biomedcentral.com/14712156/3/10
display population genetic properties [3]. Hence, the
foundation for gene detection in these populations
should be established on a combination of quantitative
genetics and population genetics. The estimation of the
population genetic properties of genes, such as allele fre
quencies and the degree of disequilibrium, can not only
enhance the precision and power of gene detection, but
also is fundamentally important to our understanding of
population differentiation and evolution [4,5].
We present a strategy for investigating the existence of
genes in an arbitrarily complex progeny test based on phe
notype data and estimating the effects of these genes on a
quantitative trait and their population properties under
the maximum likelihood framework. Only a limited
number of statistical studies have been performed to de
tect genes based on the phenotypic data [613], and none
of these studies have considered the population and evo
lutionary genetic properties of genes. Our study popula
tion can be derived from a natural population instead of
inbred lines as used for general agronomic crops. A finite
number of unrelated parents are randomly selected from
a natural population and are mated to produce progeny
tests under a mating design such as factorial or diallel. In
a recent paper, we used such progeny tests to derive a
model for detecting major genes affecting a quantitative
trait [14]. This model has power to estimate the effect of
drift errors on genetic variation during hybridization, thus
providing information about dynamic changes of the ge
netic architecture of a population under artificial hybridi
zation. However, the estimates of population genetic
parameters from the progeny test using the above model
cannot be generalized to the original population from
which the mating design is derived. In this paper, we de
veloped a hierarchical model for analyzing the pattern of
gene segregation at both population and family levels,
which thus can provide estimates of genetic parameters in
two successive generations of the population. Estimates of
population genetic parameters using this hierarchical
model can reasonably infer the genetic structure of the
original population. A forest tree example is used to dem
onstrate the application of the new statistical model for
gene detection.
Model
Population genetic structure
Suppose there is a segregating major gene responsible for
a quantitative trait in a natural or experimental diploid
population. This gene has s different codominant alleles,
A1,...,A,, whose frequencies in the population are denoted
by pl,...,Ps with s1 P, = 1. These s alleles randomly
unite to form s(s + 1)/2 distinguishable genotypes includ
ing s homozygotes and s(s 1)/2 heterozygotes. The pop
ulation frequencies of the genotypes are denoted by
PKt (K, I = 1...,s). If the population is at HardyWeinberg
disequilibrium (HWD), the genotypic frequency is the
product of the two corresponding allelic frequencies, plus
the coefficient of HWD (dK), i.e., pK = pp + dK, where dK
has the restrictions, max{pw (1 pj, p, (1 p)} < d <
pp>, and i,,K K Yidw), pi [15].
Mating design
A finite number of unrelated individuals are randomly se
lected from the population as female and male parents to
generate a I x J factorial mating design. Based on sampling
theory, these selected individuals have genotypes at the
major locus whose counts follow the same multinomial
distribution as those in the original population, with
probability vector p = (PK, K, = 1,...,n). The segregation of
progeny genotypes in the mating design depends on how
the majorlocus genotypes segregate (1) among the select
ed female or male parents and (2) within each of the full
sib families (Table 1). The segregation among the I female
or I male parents having the same pattern as in the origi
nal population is viewed as the populationlevel segrega
tion. The segregation within a fullsib family of size nij
derived from the ith female and jth male parent follows
the Mendelian segregation ratio, viewed as the familylev
el segregation. To understand the genetic structure of the
original population, different genetic information derived
from this twolevel segregation will not be mixed, but
rather combined within a maximum likelihood frame
work.
Estimation method
Let Yijk denote the phenotypic measurement for the kth
offspring from ith female parent and jth male parent and
g4k denote its genotype at the major gene, i = 1,2, ..., I, j =
1, 2, ..., J, k = 1,2, ..., ni. The statistical model describing
the phenotypegenotype relationship can be expressed as
Yi k = ItkI{g g=AKA } + eiJ
where ytK, is the genotypic value of genotype AA1; and eijk
is the error term with the normal distribution N(0,(2).
Let gf, be the genotype for the ith female parent and g,,
for the jth male parent. We denote by y the collection of
the observed data and byz = (g,gpgm) the collection of the
majorlocus genotypes for offspring, female parents and
male parents. We assume the following hierarchical mod
el:
Page 2 of 7
(page number not for citation purposes)
BMC Genetics 2002, 3
http://www.biomedcentral.com/14712156/3/10
Table I: Factorial mating design using female and male parents sampled from a natural population.
AIAI
(p2)
A A2
(2pIP2)
AIA :AIA2
(1/2:1/2)
AIAI :AIA2 :A2A2
(1/4:1/2:1/4)
AA :A,,2: AtA : AA2
(1/4 : 1/4: 1/4 : 1/4)
A As : A2As
(1/2:1/2)
 A As : A2As
 (1/2: 1/2)
I I
AKAs : AtAs
 (1/2: 1/2)
AsA
The segregation patterns of genotypes are indicated at the parental and progeny generation levels. Genotype frequencies in each generation are
given in parentheses.
Y 9l Rikff(Yijk Igik ),
(g  ggm ) ~n (g,) ng gj (V)m,
where f denotes the normal distribution of the phenotype
N([at, y2), h is distributed according to the Mendelian
segregation pattern and V is multinomially distributed
with probability vector p. Note that genotypes for siblings
are dependent and, given genotypes, the traits are inde
pendent of each other. Under the above model (y, z) has
a joint distribution as follows:
f(ylo)=nl,,[f(y,1klgk)h(g lk(g f, ))]l,V(g )nl (., ) (2)
Statistically, equation (2) is a mixture model in which
each component is represented by a genotype within a
fullsib family [16181. The maximum likelihood esti
mates for 0 = ([t, y2, p) can be obtained using the EM al
gorithm [19,201. At the tth step of the EM sequence, the
expected loglikelihood
Q( l 6(),y) = E (,) logf(y,z 0 ) is proportional to
_21 y (yAc +log ,og[P(c)] PIG++ oI (3)
y,,t G 0 o tG G I
where YG sums over all possible genotypes,
PijkG = r(Sijk = G I (),y) are the posterior distribu
tions of offspring genotypes and p iG, pjG are the posteri
or distributions of parent genotypes. Conditional on
((t,y), these posterior probabilities can be evaluated as
follows:
p (g. g') [i G(e" }(G(^ g, )e(, F4 )
( I iV (4)
p g
SP z e yj)
The EM sequence is defined by
Page 3 of 7
(page number not for citation purposes)
Female
AIA2
(2p P2)
AA
(2ppt)
AsAs
(I)
AIAI :AIA2
(1/2: 1/2)
AKAI :AtA
(1/2: 1/2)
A A,
BMC Genetics 2002, 3
http://www.biomedcentral.com/14712156/3/10
ijk ijk
()(G = ( ijkGYij ) ]2)/
2 )(t+1) (PijkC [Yijk G (G) / ijk,
ijkG ijkG
i j
004
001
004
001
004
(5) 001
Suppose the major gene has two different codominant al
leles, the posterior probablities in equation (4) can be ex
actly calculated for a usual mating design. However, for
large designs with many parents or for the case where the
major gene has many different alleles, we have to rely on
a Monte Carlo version of the EM algorithm. For example,
we can sample from the conditional distribution of z =
(g,gf, gm) given ( (),y) using the following Gibbs sam
pler [201:
g8 ImfO (t)Y1 CHik [Yijk gijk (t) jh(gik f('m ))'
gm g, gf ,6(t),Y Hi[Wgf, (t) )Hjkh( gijk (f',m ))
g8 8 Em 6^ty n V Em, 16(t) Fikh gijk If'
Example
We use data from aspen trees to illustrate the application
of our statistical model for detecting a major gene. The ex
ample is derived from a 6 x 6 factorial mating design of
Populus tremuloides and P. tremula. The parents for the
crosses were randomly selected from a mixed breeding
population of two different species established at the Uni
versity of Minnesota. Originally, the P. tremuloides trees
used as the female parents from the Upper Peninsula of
Michigan and northern Wisconsin, and the P. tremula
male parents came from northeastern Poland and Germa
ny [211. The seedlings from the progeny population were
planted in two different locations, one on a cutover forest
site near Grand Rapids, Minnesota, and the other on farm
land near Rhinelander, Wisconsin. Both trials were laid
out in a randomized complete block design with 10 repli
cates and six seedlings per family in each replicate at 2.5 x
2.5 spacing. At the end of the second year in the field, all
trees were measured for stem height.
Figure 1 presents density estimates of the second year
height from each family pooled over the two locations.
Trees from eight families did not have growth data due to
mortality. The phenotypic distributions suggest the exist
ence of a major gene in the progeny test. For example, in
004
001
004
1 20 50 80 20 50 80 20 50 80
a
20 50 80
CLONE5
T158
T5060
T5360
T561
T8057
20 50 80
Figure I
Density estimates of second year stem height (in cm) for all
families. The measurement is the stem height at the end of
second year. The randomly selected parents are labelled at
the top (male) and on the right of plots (female). Trees from
eight families did not have growth data due to mortality.
the family derived from Clone5 x TA168, the phenotyp
ic distribution is a mixture of at least two components,
showing the existence of a segregating gene. Figure 1 also
indicates different segregation patterns among the fami
lies, with some more smooth and others more waved,
thus suggesting that some parents are heterozygous with
their alleles segregating in the progeny.
The effect and genotypic frequencies of the major gene are
estimated using the hierarchical model assuming two alle
les. Since no significant difference is detected between two
locations, we only report the result based on pooled data.
With probability of almost one, parents TA191, TA268,
TA491, XTA36 have genotype AIA1, parents TA168,
TA175, CLONE5, TA561 have genotype AIA2 and the
rest parents have genotype A2A2. The estimates of the gen
otypic values of AIA1, AIA2 and A2A2 are [11
56.7
0.80, p 12 = 40.6 0.32 and [ 22 = 30.0 1.02, respective
^2
ly. The estimate of residual variance is 5 = 218.676
6.09. Here, the estimates of standard errors for the param
eter estimators are obtained based on the Fisher informa
tion matrix [22]. We also fitted the model under the null
hypothesis that there is no major gene with a single nor
mal distribution. The likelihood ratio test has Xdf=4 =150
(p < 0.0001), suggesting the existence of a major gene for
height growth. The ratio of the dominant to additive effect
for this major gene is less than 0.2, which implies that it
affects the stem height growth in an additive manner.
Page 4 of 7
(page number not for citation purposes)
_____ _____ ] _____ I
KU] __ ~LJ
BMC Genetics 2002, 3
TA1RA TA17R TA191
TA1AA TAa91 XTA1A
http://www.biomedcentral.com/14712156/3/10
The population genetic parameters of this major gene are
estimated based on the genotypes of the parents detected.
The genotype frequencies estimated are identical for the
three genotypes, i.e. p 1
P 12 = P22 = 1/3. In conse
quence, the allele frequencies are also same (p = 2 =1/
2). Thus, the mixed breeding population of P. tremulodes
and P. tremula is not at HardyWeinberg equilibrium, be
cause the genotype frequencies are not products of the
corresponding allele frequencies. The coefficient of Har
dyWeinberg disequilibrium is estimated as d = 0.06. The
estimate of allele frequencies and HardyWeinberg dise
quilibrium using our hierarchical model can be well gen
eralized to the original breeding population from which
the parents were sampled. However, one should be cau
tious about the accuracy of the estimator due to the small
sample size.
Simulation
The behavior of the EM estimators is evaluated by two
simulation experiments. In the first simulation, a 6 by 6
mating design same as our real example is assumed. The
major gene is assumed to have two codominant alleles A1
and A2 whose frequencies in the population are pi = 0.6
and P2 = 0.4. The genotypic values are set to be 65, 40 and
25 with standard deviation 15. Based on the hierarchical
model defined in equation (1), we first simulated a set of
the parental genotypes (AIA1,AIA1, AIA2, AIA1, AIA2,
AIA1 for females and AIA2,A2A2, A2A2, AIA1, AIA2, AIA2
for males). Then we randomly generated 50 offspring gen
otypes and observations for each family.
Our EM estimating procedure converged to the correct pa
rental genotypes after only a few iterations (normally 2 or
3). A loglikelihood test is highly significant with loglike
lihood ratio equals 125 (df= 4). The estimates of genotyp
ic values from a single simulation replicate are 64.1 0.74
for AIA1, 39.5 0.59 for AIA2 and and 26.8 1.64 for
A2A2. The estimate of residual variance is 241.6 10.30.
In our second simulation we studied a 5 by 5 mating de
sign. We used 25 replications to investigate the empirical
accuracy of the estimators. We assume the major gene has
2 alleles and randomly generated parental genotypes. The
parental genotypes were randomly generated based on al
lele frequence pi = 0.5. That resulted in female genotypes
(A1A1, A1A2, A2A2, A1A2, A2A2) and male genotypes
(A1A2, A1A1, A2A2, A1A1, A1A2). For each family we simu
lated the 50 genotypes for its offspring and then simulat
ed its trait measurement using a normal and lognormal
distribution.
The results from the second simulation study are summa
rized in Figure 2. The top panel of the plots is for the sim
ulation where trait measurements are simulated from a
normal distribution. The bottom panel corresponds to
lognormal data. In each plot we graphed the ideal esti
mates of the parameter (assuming that we know the gen
otypes of each offspring) versus our estimates based on
the EM algorithm. Mean squared errors are also indicated
in the plots. It is seen from Figure 2 that all four parame
ters can be estimated quite accurately in the normal case
while they are slightly biased for the lognormal data.
Discussion
We have derived a hierarchical model for detecting major
genes affecting a quantitative trait based on progeny tests
of outcrossing species. The new model takes into account
the population genetic properties of genes and is expected
to enhance the accuracy, precision and power of gene de
tection. The information about the population behavior
of genes estimated from this model is fundamentally im
portant to our understanding of the genetic architecture of
a natural or experimental population and is also useful for
the design of sound breeding strategies for agricultural
crops and forest trees.
We used an example from interspecific aspen hybrids to il
lustrate the application of our genedetecting model. The
model has successfully identified a major gene with addi
tive effects on second year stem growth in the hybrid as
pens. The additive effect of genes on height growth in
young poplars was also observed in a molecular marker
experiment [23]. In this marker experiment, an additive
quantitative trait locus was found to explain almost a
quarter of the total phenotypic variation in second year
height growth. The result from the aspen hybrids was con
sistent with that from a simulation experiment based on a
mating scheme analogous to that used in our real exam
ple. The consistency of the results from the simulation
study suggests that the model proposed here can be ade
quately used to detect genes of large effect on the pheno
types.
The model will have immediate applications for many
species in which progeny tests have been established for a
number of years [24]. Most of these tests are maintained
in different locations and measured annually, thus offer
ing desirable opportunities to address two important
questions: (1) How does the genes interact with environ
ment? (2) What is the developmental plasticity of gene ex
pression? In our real example, no evidence has been
obtained for the change of gene expression at the major
locus detected over the two field trials, despite their dra
matic differences in climate, soil properties and silvicul
tural measures [21]. If the stability of the expression of
this gene over a range of environments is confirmed by
Page 5 of 7
(page number not for citation purposes)
BMC Genetics 2002, 3
http://www.biomedcentral.com/14712156/3/10
68

D 66
E
S64
62 Z
62
68
N
S66
E
S64
62
62
38 "
64 66 68 38 40
True p True P2
30
K
M
25
20 Z
42 20
30
K
25
20 Z
42 20
25
True [3
25
True [3
16
MSE=0.13
o
15 o0
oo 0
14
14 15 16
True a
16
MSE=0.12
00
00o
15 0o
0o0
0
14
14 15 16
True a
Figure 2
Summary of a simulation study based on a 5 x 5 mating design. In each plot we graphed the ideal estimates of the parameter
(assuming that we know the genotypes of each offspring) versus the estimates based on the EM algorithm. The top panel cor
responds to normal data and the bottom panel to the lognormal case
more accurate genotype determination methods, e.g.
based on genetic markers. It will have a tremendous appli
cation in breeding for stable cultivars of forest tree species.
Our analysis and simulation is based on diallelic inherit
ance. But multiallelic inheritance can be similarly consid
ered, although more parameters should be introduced.
Also, our model assumes the segregation of a single gene
in progeny tests. The principle behind the model can be
extended to consider two or more genes. Modelling mul
tiple genes may be closer to biological reality, because the
linkage, epistatic interaction and genetic association
among genes are considered. However, the consideration
of these relationships needs to estimate an increased
number of unknown genetic parameters. Finally, our
model is based on a balanced factorial design. For some
populations, like mammals, mating designs are frequent
ly unbalanced because of limited resources or reproduc
tive difficulties. It is very important to extend our analysis
and model to an unbalanced mating design of any com
plexity. For all of these extensions mentioned above, max
imum likelihood approaches may not be sufficient owing
to an increasing dimension of parameter space. A more
powerful computational tool, e.g., Markov chain Monte
Carol methods, may be needed [20] to make our model
more tractable in real applications.
Acknowledgements
We thank Dr. Jim Hobert, Dr. Dudley Huber, Dr. Bailian Li and Dr. Tim
White for helpful discussions and reviews regarding this study. This manu
script was approved as Journal Series No. R08032 by the Florida Agricul
tural Experiment Station.
References
I. Hoeschele I, Uimari P, Grignola FE, Zhang Q, Gage KM: Advances
in statistical methods to map quantitative trait loci in out
bred populations. Genetics 1997, 147:14451457
2. Balding JD, Bishop M, Cannings C: Handbook of statistical genet
ics John Wiley & Sons, New York 2001
3. Wu RL: Mapping quantitative trait loci by genotyping haploid
tissues. Genetics 1999, 152:17411752
4. Wright S: Evolution and the genetic of populations. II. The
theory of gene frequencies. Univ. of Chicago Press, Chicago 1969
5. Elston RC: Segregation analysis. Adv Hum Genet 1981, 11:3723
6. Hoeschele I: Genetic evaluation with data presenting evidence
of mixed major gene and polygenic inheritance. Theoretical and
Applied Genetics 1988, 76:8192
7. Hoeschele I: Statistical techniques for detection of major
genes in animal breeding data. Theoretical and Applied Genetics
1988, 76:311319
8. Le Roy P, Naveau J, Elsen JM, Sellier P: Evidence for a new major
gene influencing meat quality in pigs. Genetical Research 1990,
55:3340
Page 6 of 7
(page number not for citation purposes)
38 "
64 66 68 38 40
True p True P2
BMC Genetics 2002, 3
http://www.biomedcentral.com/14712156/3/10
9. Knott SA, Haley CS, Thompson R: Methods of segregation anal
ysis for animal breeding data: a comparison of power. Heredity
1992, 68:29931 I
10. Le Roy P, Elsen JM: Simple test statistics for major gene detec
tion: a numerical comparison. Theoretical and Applied Genetics
1992, 83:635644
I I. Loisel R, Goffinet B, Monod H, De Oca GM: Detecting a major
gene in an F2 population. Biometrics 1994, 50:512516
12. Janss LL, Thompson R, Van Arendonk JAM: Application of Gibbs
sampling for inference in a mixed major genepolygenic in
heritance model in animal population. Theoretical and Applied
Genetics 1995,91:1137I 147
13. Janss LL, Van Arendonk JAM, Brascamp EW: Bayesian statistical
analysis for presence of single genes affecting meat quality
traits in a crossed pig population. Genetics 1997, 145:395408
14. Wu RL, Li BL, Wu SS, Casella G: A maximum likelihoodbased
method for mining major genes affecting a quantitative
character. Biometrics 200 1, 57:764768
15. Weir BS: Genetic Data Analysis II. Sinauer Associates, Sunderland,
MA 1996
16. Titterington DM, Smith AFM, Makov UE: Statistical Analysis of Fi
nite Mixture Distributions. John Wiley & Sons, NY 1985
17. Feng ZD, McCulloch CE: On the likelihood ratio test statistic
for the number of components in a normal mixture with un
equal variances. Biometrics 1994, 50:1 1581 162
18. Fernando RL, Stricker C, Elston RC: The finite polygenic mixed
model: an alternative formulation for the mixed model of in
heritance. Theoretical and Applied Genetics 1994, 88:573580
19. Dempster AP, Laird NM, Rubin DB: Maximum likelihood from in
complete data via EM algorithm. Journal of Royal Statistical Society
Serial B 1977, 39:138
20. Robert C, Casella G: Monte Carlo Statistical Methods Springer,
NY 1999
21. Li BL, Wu RL: Heterosis and genotype x environment interac
tions of juvenile aspens in two contrasting sites. Can. J. Forest
Res. 1997, 27:15251537
22. Oakes D: Direct calculation of the information matrix via the
EM algorithm. J. R. Statist. Soc B 1999, 61:479482
23. Bradshaw HD, Stettler RF: Moleculargenetics of growth and de
velopment in Populus. 4 Mapping QTLs with large effects on
growth, form, and phenology traits in a forest tree. Genetics
1995, 139:963973
24. McKeand SE, Bridgwater FE: A strategy for the third breeding
cycle of loblolly pine in the southeastern US. Silvae Genetica
1998, 47:223234
Publish with BioMed Central and every
scientist can read your work free of charge
"BioMedcentral will be the most significant development for
disseminating the results of biomedical research in our lifetime."
Paul Nurse, DirectorGeneral, Imperial Cancer Research Fund
Publish with BMC and your research papers will be:
available free of charge to the entire biomedical community
peer reviewed and published immediately upon acceptance
cited in PubMed and archived on PubMed Central
yours you keep the copyright BioMedcentral.com
Submit your manuscript here:
http://www.blomedcentral.com/manuscript/ editorial@blomedcentral.com
Page 7 of 7
(page number not for citation purposes)
BMC Genetics 2002, 3
