BMC Evolutionary Biology .
Research article
Using equilibrium frequencies in models of sequence evolution
Bjarne Knudsen* and Michael M Miyamoto
Address: Department of Zoology, Box 118525, University of Florida, Gainesville, FL 326118525, USA
Email: Bjarne Knudsen* bk@birc.dk; Michael M Miyamoto miyamoto@mail.clas.ufl.edu
* Corresponding author
Published: 02 March 2005
BMC Evolutionary Biology 2005, 5:21 doi: 10.1 186/14712148521
Received: 19 July 2004
Accepted: 02 March 2005
This article is available from: http://www.biomedcentral.com/14712148/5/21
2005 Knudsen and Miyamoto; licensee BioMed Central Ltd.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0),
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Abstract
Background: The f factor is a new parameter for accommodating the influence of both the
starting and ending states in the rate matrices of "generalized weighted frequencies" (+gwF) models
for sequence evolution. In this study, we derive an expected value for f, starting from a nearly
neutral model of weak selection, and then assess the biological interpretation of this factor with
evolutionary simulations.
Results: An expected value of f= 0.5 (i.e., equal dependency on the starting and ending states) is
derived for sequences that are evolving under the nearly neutral model of this study. However, this
expectation is sensitive to violations of its underlying assumptions as illustrated with the
evolutionary simulations.
Conclusion: This study illustrates how selection, drift, and mutation at the population level can
be linked to the rate matrices of models for sequence evolution to derive an expected value of f.
However, as f is affected by a number of factors that limit its biological interpretation, this factor
should normally be estimated as a free parameter rather than fixed a priori in a +gwF analysis.
Background
Felsenstein [1] was the first to introduce an evolutionary
model for DNA sequences, which allows for unequal
nucleotide frequencies (see also [2]). His F81 model
allows for substitutions at a rate proportional to the fre
quencies of the ending nucleotides. It is considered the
simplest rate matrix for accommodating variable nucle
otide frequencies and is therefore the starting point for the
consideration of more complex models with frequency
variation (e.g., the HKY model of Hasegawa et al. [3]).
Goldman and Whelan [4] described new variants of these
F81based models (their +gwF (generalized weighted fre
quencies) models; e.g., JC+gwF for Jukes and Cantor [5],
and K2P+gwF for Kimura [6]). At the heart of their +gwF
variants was a new free parameter (f) to accommodate the
frequencies of the starting, as well as ending, nucleotides
in the evolutionary process:
1if
qij = i i
where qj refers to the substitution rate from nucleotide i to
j, T and 7T correspond to their equilibrium base frequen
cies, and sj is the exchangeability between the two. In the
+gwF variants, the substitution rate becomes more
dependent on the ending nucleotide as f decreases from 1
to 0, with f = 0 for the classic F81type models.
This study starts with a population genetics model to
derive equations that link weak selection, genetic drift,
Page 1 of 5
(page number not for citation purposes)
0
Wied Central
http://www.biomedcentral.com/14712148/5/21
and mutation to the f parameter and evolutionary rate
matrices of the +gwF variants. These theoretical deriva
tions lead to an expected value off = 0.5. However, as
illustrated with simulations, the f parameter is complex
and thus its biological interpretation must be considered
with caution.
Results
Derivation of the rate matrix for the weak selection model
The nearly neutral model of molecular evolution states
that most DNA mutations of longerterm evolutionary
consequence are under weak selection and are therefore
prone to drift [7,8]. For a diploid population of size N, a
neutral mutation has a probability of 1/2N of becoming
fixed in the population. However, because of drift, even
slightly deleterious mutations can become fixed, but at a
probability of less than 1/2N. Advantageous mutations
have higher fixation probabilities than neutral mutations.
In the nearly neutral model, the distribution of alleles is
determined by an equilibrium of selection, drift, and
mutation.
Consider a number of sites under identical evolutionary
constraints and with a bias in nucleotide distribution.
Assume that weak selection and drift are the causes of this
bias; e.g., as for the codon usage biases in microorgan
isms and Drosophila [9,10]. In our model, some nucle
otides confer a slightly higher fitness onto the organism
than do others, regardless of their position, and these can
become fixed in the population through drift and/or
selection. Here, we also assume that selective advantages
are additive for the two alleles of the diploid organism
[11,12]. Let the selective advantages of the four nucle
otides be given by sA, sC, sG, and sr. The differences
between these selection coefficients will be very close to
zero, since no strong selection is expected.
Consider a mutation from nucleotide i to j, with a selec
tive advantage of s = s si (a selective disadvantage exists
when s is negative). For a population of size N and an
effective size of Ne, Kimura [11] showed that the fixation
probability in this population is given by:
1 e2Nes/N 2Nes
1 e4Nes N( e 4Nes)
when s # 0. For s = 0, we have P(s) = 1/2N. This approxi
mation is valid for small values of s, which is the case here.
The substitution rate from nucleotide i to j is proportional
to P(sj s,):
qi = 2N iP(s, st), (3)
where /zi is the mutation rate from i to j. For different i and
j, /ij can vary because of unequal transition versus trans
version rates (for example). Furthermore, let us assume
that the mutation rate is the same for either direction of
substitutions between i and j. This assumption is neces
sary to maintain the widely used condition of time revers
ibility in the evolutionary process, which thereby keeps
the following derivations tractable [1,13].
We then have:
qij P(Sj si)
qji P(si sj)
2Ne (sj s) N(1 e 4Ne(s,s)
N( e4Ne(ss) 2Ne (si sj)
1 e4Ne(s s,)
e 4N,(s s ,)
4Ne(sjs,) e 4N
e 4Nes,
Since ',. can be written as a function evaluated at sj
divided by the same function evaluated at s,, evolution is
time reversible according to this model with:
i "= ce4Ns' => si "=  logli + c'. (5)
4Ne
Here, c and c' are constants with c' = 1/4Ne log c, which
will be chosen to make the equilibrium frequencies sum
to one. The substitution rates can now be approximated
as:
q u 4Ne(sj si)
i( i _4Ne(s s,)
e2Ne(s s,)
= 4Y,t, (sj st) e2N(s,s,) 2Ne(s,s,)
log e
i (6)
,r i Zi
Given an exchangeability of si =S ij, this equation reduces
to equation (1) with = 0.5 and an adjustment factor of:
71
log
Ivi
'Vj
Z'v
Page 2 of 5
(page number not for citation purposes)
BMC Evolutionary Biology 2005, 5:21
http://www.biomedcentral.com/14712148/5/21
This adjustment factor is close to one for moderate ratios
of z, with a horizontal tangent around 1/Ti = 1 and a
slight bending downwards when deviating from this value
(Fig. 1). Thus, a value off = 0.5 is suggested for the +gwF
variants according to these derivations of the weak selec
tion model.
Evolutionary simulations
Evolutionary simulations were conducted to examine the
effects of violating certain assumptions in the above
model of weak selection. Unless otherwise noted, these
simulations were based on the K2P+gwF model with f =
0.5 and k = 2 (for the transition/transversion ratio).
Simulations consisted of four sequences of length 10,000
and relied on a symmetric rooted phylogeny with all
branch lengths equal to 0.10 expected substitutions per
site under the model in question [i.e., ((seql:0.10,
seq2:0.10):0.10, (seq3:0.10, seq4:0.10):0.10)]. Violations
of the weak selection model were incorporated in the sim
ulations by: (1) heterogeneous sequences with sites drawn
from different equilibrium base frequencies; (2) popula
tions in disequilibrium due to changing N,; and (3) an
accelerated C to T substitution rate. Estimates off for the
simulated sequences were made with the K2P+gwF
model. Forty simulations were run for each test condition,
with the results for the f estimates summarized as their
0.6 I
0.2 [
0.0 1
0.
0.5 1
7t ratio
2 5 10
Figure I
Adjustment factor as a function of the ratio of Ys.
The adjustment factor is given by log /( I)
(ri (7i ,
(equation (7)).
Table I: Starting equilibrium base frequencies and results for the simulations with either homogeneous or heterogeneous sequences
(i.e., those with sites from single versus multiple categories, respectively).
Categories
A
B
C
D
E
F
A+B
A+C
B+C
A+B+Ce
D+E+Fe
0.154
0.105
0.029
0.078
0.078
0.078
0.074
0.015
0.015
0.016
0.010
0.50 0.01
0.50 0.01
0.50 0.02
0.49 0.01
0.51 0.01
0.50 0.02
0.43 0.01
0.34 0.03
0.24 0.02
0.39 0.04
0.68 0.03
0.00 0.01
0.00 0.01
0.00 0.03
0.01 0.01
0.01 0.02
0.01 0.01
0.11 0.02
0.16 0.03
0.33 0.03
0.13 0.04
0.29 0.04
aExpected nucleotide distribution.
bNucleotide bias, as information content measured in bits: I.i log2
0.25
cMean twice the standard error of the estimate.
df= 0:0 for these simulations with the HKY model. With f= 0:0, the HKY+gwF variant is reduced in these simulations to its more standard F81
based model.
eThe heterogeneous sequences in these simulations were of length 9,999, rather than 10,000, since the latter is not a multiple of 3.
Page 3 of 5
(page number not for citation purposes)
I I_
I I I I I
BMC Evolutionary Biology 2005, 5:21
1
http://www.biomedcentral.com/14712148/5/21
means and twice their standard errors. In the first set of
simulations, six categories of sites with different equilib
rium distributions were considered (Table 1). The f esti
mates for the simulations with each category alone were
not significantly different from 0.5 (i.e., the value under
which the sequences were generated). In contrast, for the
simulated heterogeneous sequences (i.e., those composed
of equal numbers of sites from two or three different cat
egories), their values of f varied significantly in either
direction from 0.5. Analogous results were obtained for
the simulations of homogeneous and heterogeneous
sequences under the HKY model (with f = 0.0 instead of
0.5). Thus, the value off can vary considerably when het
erogeneous sequences are analyzed with a +gwF model.
Here, such deviations are a consequence of using a single
rate matrix to analyze sequences that were derived from
two or three different ones.
Category A
 
!
0.5 1.0 2.0 3.0
Population ratio
Category A
I I I I I
     
1.0 1.5 2.0 3.0 4.0
C to T acceleration factor
In the second set of simulations, Ne was kept constant
until the time of the most recent common ancestor for the
four simulated sequences. Then, Ne was either left
unchanged or was suddenly changed by a certain factor.
The latter was done by replacing the rate matrix derived
from equation (4), resulting in new equilibrium frequen
cies of the nucleotides. When Newas kept constant, the
selective pressures and drift were left unchanged, thereby
maintaining the same starting equilibrium frequencies
throughout the phylogeny. Thus, the corresponding esti
mates did not significantly differ from 0.5 (Fig. 2). In con
trast, increases in N, lowered the value off as the efficiency
of selection was increased relative to drift [4]. Corre
spondingly, the evolutionary process became more dom
inated by the ending nucleotide. This increasing
dominance can be expected to continue until a new equi
librium is restored (which occurs on a longer time scale
than that in these simulations).
Category B
Category C
0.5 1.0 2.0 3.0 0.5 1.0 2.0 3.0
Population ratio Population ratio
Category B
1.0 1.5 2.0 3.0 4.0
C to T acceleration factor
Category C
I I I I I
I I I I I
1.0 1.5 2.0 3.0 4.0
C to T acceleration factor
Figure 2
Two situations where f is affected by deviations from the model. (A) The effect of a change in Ne on the value of f.
This change in Ne occurs in the most recent common ancestor of the four simulated sequences. Population ratio refers to its
Ne after versus before this change. (B) The effect of an increased C to T substitution rate. Categories A, B, and C are defined in
Table I.
Page 4 of 5
(page number not for citation purposes)
0.50
0.45
0.40
0.35
B
1.20
1.00
0.80
0.60
0.40
0.20
BMC Evolutionary Biology 2005, 5:21
http://www.biomedcentral.com/14712148/5/21
In the third set of simulations, an acceleration in the C to
T substitution rate was incorporated, thereby modeling an
increase in their mutation rate due to the deamination of
methylated C's in CpG pairs [14]. The introduction of this
bias resulted in significant deviations off in either direc
tion from 0.5, even though their sequences were simu
lated in equilibrium (Fig. 2). Thus, the value of f can vary
considerably when the rates for reciprocal mutations are
unequal.
Discussion
This study illustrates how selection, drift, and mutation
within a population can be linked to the f parameter and
rate matrices of the +gwF variants for sequence evolution.
Our weak selection model relies on the fixation
probabilities of mutant alleles with additive genie selec
tion and equal mutation rates for reciprocal substitutions.
What is now needed are additional studies that link other
population genetics models to the +gwF variants [9]. For
example, the population genetics models of Li [ 15], which
focus on allele frequency distributions and different
modes of selection and mutation, could be studied for
their connections to the f parameter and +gwF rate
matrices.
Collectively, the three sets of simulations highlight that
the f parameter is complex and can be influenced by a
number of different factors [4]. This complexity limits its
biological interpretation and the use of its expected value
of 0.5 as derived for the weak selection model. Corre
spondingly, in many +gwF analyses, f ill need to be esti
mated as a free parameter rather than fixed beforehand.
Goldman and Whelan [4] focused on amino acid
sequences, where they found that the +gwF models pro
vided better fits to the majority of their protein data sets.
They also analyzed two rather small nucleotide data sets
for which the general reversible model (REV) outper
formed the +gwF variants. As noted by them, the REV
model provides enough free parameters to cover the
effects of a +gwF analysis. Thus, given sufficient data, this
model will consistently outperform the simpler +gwF
variants, since it can always accommodate more of the
evolutionary process by virtue of its extra parameters.
Nevertheless, as widely acknowledged, simpler models
have their place, since they allow one to maximize
analytical power for more limited data, while minimizing
the risk of overparameterization [13,16]. Thus, as for the
JC, K2P, and HKY models, we expect their +gwF variants
to remain of interest as part of the hierarchy of simple to
complex models for sequence evolution.
Authors' contributions
Both authors contributed to the conception and design of
this study and to the writing, reviewing, and final
approval of this article. B.K. performed the simulations
and parameter estimations.
Acknowledgements
B.K. thanks the Carlsberg Foundation, the University of Aarhus, and the
Danish National Science Research Council (grant number 21000283) for
their support. Both authors also thank the Department of Zoology, Univer
sity of Florida for its support.
References
I. Felsenstein J: Evolutionary trees from DNA sequences: a max
imum likelihood approach. JMot Evol 1981, 17:368376.
2. Tajima F, Nei M: Biases of the estimates of DNA divergence
obtained by the restriction enzyme technique. J Mol Evol 1982,
18:115120.
3. Hasegawa M, Kishino H, Yano T: Dating of the humanape split
ting by a molecular clock of mitochondrial DNA. J Mol Evol
1985, 22:160174.
4. Goldman N, Whelan S: A novel use of equilibrium frequencies
in models of sequence evolution. Mol Biol Evol 2002,
19:18211831.
5. Jukes TH, Cantor CR: Evolution of protein molecules. In Mam
malian Protein Metabolism Edited by: Munro HN. New York: Academic
Press; 1969:21132.
6. Kimura M: A simple method for estimating evolutionary rates
of base substitutions through comparative studies of nucle
otide sequences. J Mol Evol 1980, 16:1 I 1120.
7. Ohta T: Slightly deleterious mutant substitutions in
evolution. Nature 1973, 246:9698.
8. Ohta T: The nearly neutral theory of molecular evolution.
Annu Rev Ecol Syst 1992, 23:263286.
9. Buhner M: The selectionmutationdrift theory of synony
mous codon usage. Genetics 1991, I 29:897907.
10. Carlini DB, Stephan W: In vivo introduction of unpreferred syn
onymous codons into the Drosophila Adh gene results in
reduced levels of ADH protein. Genetics 2003, 163:239243.
I I. Kimura M: On the probability of fixation of mutant genes in
populations. Genetics 1962, 47:713719.
12. Kimura M, Ohta T: Population genetics, molecular biometry,
and evolution. Proc Sixth Berkeley Symp Math Stat Prob 1972, 5:4368.
13. Felsenstein J: Inferring phylogenies Sunderland, MA: Sinauer Associates;
2004.
14. Razin A, Riggs AD: DNA methylation and gene function. Science
1980, 210:604610.
15. Li WH: Maintenance of genetic variability under mutation
and selection pressures in a finite population. Proc Natl Acad Sci
U S A 1977, 74:25092513.
16. Posada D, Crandall KA: Selecting models of nucleotide substi
tution: an application to human immunodeficiency virus I
(H IV I). Mol Biol Evol 2001, I 8:897906.
Page 5 of 5
(page number not for citation purposes)
Publish with BioMed Central and every
scientist can read your work free of charge
"BioMed Central will be the most significant development for
disseminating the results of biomedical research in our lifetime."
Sir Paul Nurse, Cancer Research UK
Your research papers will be:
available free of charge to the entire biomedical community
peer reviewed and published immediately upon acceptance
cited in PubMed and archived on PubMed Central
yours you keep the copyright
Submit your manuscript here: BioMedcentral
http://www.biomedcentral.com/info/publishing adv.asp
BMC Evolutionary Biology 2005, 5:21
