|UFDC Home||myUFDC Home | Help|
|Table of Contents|
|List of Tables|
|List of Figures|
|List of abbreviations|
|Exobiology and post-genomic...|
|Extending the PBL method for estimating...|
|The adaptive evolution database...|
|Detecting compensatory covariation...|
|The planetary biology of cytochrome...|
|Appendix A: Names and abbreviations...|
This item has the following downloads:
|Table of Contents|
Table of Contents
List of Tables
List of Figures
List of abbreviations
Exobiology and post-genomic science
Extending the PBL method for estimating numbers of synonymous and nonsynonymous mutations to account for ambiguity in inferred ancestral sequences
The adaptive evolution database (TAED): Application of the extended PBL method to a large dataset
Detecting compensatory covariation signals in protein evolution using reconstructed ancestral sequences
The planetary biology of cytochrome p450 aromatases
Appendix A: Names and abbreviations of nucleic acid bases and amino acids
THE BIOINFORMATICIST'S TOOLBOX IN THE POST-GENOMIC AGE: APPLICATIONS
DAVID R. SCHREIBER
A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
UNIVERSITY OF FLORIDA
David R. Schreiber
Tremendous thanks go to James Deyrup-for the opportunity, the faith, understanding,
generosity, goodwill, humanity and love. Thanks also go to Steve Benner-for the freedom, the
enthusiasm, the privilege, and for an absurd portion of patience. And thanks go to my family-
for being there, wherever "there" chose to be.
TABLE OF CONTENTS
A C K N O W L E D G M E N T S ...............................................................................................................3
L IS T O F T A B L E S ................................................................................................. ..................... 6
LIST OF FIGURES ................................. .................... .7
L IST O F A B B R E V IA TIO N S .............................................. ...............................................8.......
A B S T R A C T ............... ................................................................ .......................................... 10
EXOBIOLOGY AND POST-GENOMIC SCIENCE ..............................................................12
In tro d u c tio n ............................................................................................................................. 1 2
The Conventional Evolutionary Paradigm ........................................................ ................ 13
Post-Genomic Science: Modeling Molecular Evolution.................................................. 17
The M arkov M odel ...................................................... .................. ............... 17
Non-Markovian Protein Evolution as a Post-Genomic Tool for Structure Prediction.... 18
Structure Prediction as a Tool for Identifying Long Distance Homologs.................... 18
Recruitment of Function ................. ........ ............................................ 24
Correlating the Paleontological Record with Episodes of Sequence Evolution..................30
Identification of in vitro Behaviors that Contribute to Physiological Function. ..................32
Structure Prediction and a Rapidly Searchable Database..................................................33
C o n clu sio n s............................................................................................................. ........ .. 3 6
EXTENDING THE PBL METHOD FOR ESTIMATING NUMBERS OF SYNONYMOUS
AND NONSYNONYMOUS MUTATIONS TO ACCOUNT FOR AMBIGUITY IN
INFERRED AN CESTRAL SEQUEN CES ............................................................................41
Introduction ........................ ......................................... ....... ..................... 41
A n O verview of the PB L M ethod.................................................................... ................ 42
Extending the PBL Method to Phylogenetic Trees with Inferred Ancestral Sequences:
Putting A m biguity B ack into the Equations .................................................. ................ 45
THE ADAPTIVE EVOLUTION DATABASE (TAED): APPLICATION OF THE
EXTENDED PBL METHOD TO A LARGE DATASET................................................56
B ack g rou n d ........................................................................................................... ........ .. 56
R results ........................................................................................................... 60
D iscu ssio n .............................................................................................................. ........ .. 6 3
C o n clu sio n s............................................................................................................ ........ .. 6 4
M materials and M methods .............. .............................................................................. 66
DETECTING COMPENSATORY COVARIATION SIGNALS IN PROTEIN
EVOLUTION USING RECONSTRUCTED ANCESTRAL SEQUENCES ........................ 68
In tro d u c tio n ............................................................................................................................. 6 8
R e su lts ................. ....... ............................. .................................................. .......... . ....... 7 3
Charge Compensation and the Surface of the Folded Protein....................................75
Charge Compensation in Both Contiguous and Non-Contiguous Position Pairs............76
Enhancing the Charge Compensation Signal .............. ................................................. 77
Charge Compensation in Specific Secondary Structural Elements...............................78
Charge Compensation in Buried Residues.................................................... 79
D discussion .................... ... ..... .......... ................... .. .... ................... 80
Accounting for the Stronger Signal From Node-Node Comparison............................. 81
A Model-Independent Method to Evaluate an Evolutionary Tree ............................... 82
Darwinian Requirements for Compensatory Covariation.........................................84
M e th o d s ........................................................................................................ .................... 8 7
THE PLANETARY BIOLOGY OF CYTOCHROME P450 AROMATASES ........................... 99
B a c k g ro u n d ............................................................................................................................. 9 9
R results ........................................................................................................... 10 1
D discussion .................................................................................................... 106
C conclusions ................................................................................................. 112
M e th o d s ........................................................................................................ ......... . ....... 1 14
NAMES AND ABBREVIATIONS OF NUCLEIC ACID BASES AND AMINO ACIDS....... 125
AN IMPLEMENTATION OF THE EXTENDED PBL METHOD CODED IN JAVA 1.1 ......126
L IS T O F R E F E R E N C E S ............................................................................................................. 14 1
B IO G R A PH IC A L SK E T C H .................................................... ............................................. 151
LIST OF TABLES
3-1 A sam ple listing from TA E D ........................................... .......................... ................ 67
4-1 Frequencies of the average contiguous position pairs ..................................................96
4-2 List of 71 protein fam ilies used in this analysis............................................ ................ 97
5-1 Frequency distributions of stem pig duplication substitutions ..................................123
5-2 Distributions of stem pig duplication substitutions .......... ....................................124
A-i Names and abbreviations of nucleic acid bases.......... .......................................125
A-2 One and three letter symbols for the amino acids.......... .......................................125
LIST OF FIGURES
1-1 Predicted surface, interior, and secondary structure assignments.................................38
1-2 Evolutionary tree showing the evolutionary history of the leptins...............................39
1-3 Evolutionary tree showing the evolutionary history of the extracellular........................40
2-1 Two possible ways to get from TCA to CAA..................................................... 49
2-2 A hypothetical phylogeny for sequences tl, t2 and t3. ................................ ................ 50
2-3 Parsimony for each of the nucleotides in sequences tl, t2 and t3................................51
2-4 EPSs for ancestral node a2. ........................................................... ................ 52
2-5 Sequences tl, t2 and t3, numbers of degenerate sites, and average degeneracies ............53
2-6 Numbers of transitions and transversions for each pair of sequences .............................54
2-7 Tallies of average degeneracies and mutations for each pair of sequences....................55
4-1 A leaf-leaf comparison (red) traverses more evolutionary distance ................................89
4-2 Attempted detection of charge compensatory covariation signal using leaf-leaf ...........90
4-3 Detecting charge compensatory covariation signal using explicitly reconstructed...........91
4-4 Predicting secondary structure using contiguous pairs of compensatory changes ...........92
4-5 Distribution of distances between charge anti-compensatory pairs...............................93
4-6 Surface accessibility of charged residues ................................................... 94
4-7 A schematic illustration of the use of compensatory covariation to select.....................95
5-1 Reactions catalyzed by aromatases on multiple androgenic substrates .........................117
5-2 D eating the pig duplication events. ........................................................1...... 18
5-3 The amino acid alignment of exon 4 of two aromatase isoforms...............................119
5-4 Cladogram of the order Artiodactyla showing the extant families............................... 120
5-5 Phylogenetic tree for the 18 vertebrate aromatase genes...................... .................. 121
5-6 The distribution of amino acid replacements on the tertiary structure of cytochrome ....122
LIST OF ABBREVIATIONS
ASA accessible surface area
BLAST basic local alignment search tool
CASP Critical Assessment of Structure Prediction
C-beta beta carbon atom
cDNA complementary DNA
CUTG Codon Usage Tabulated from GenBank
DNA deoxyribonucleic acid
DSSP The Dictionary of Protein Secondary Structure
EPS equally parsimonious solution
HSP heat shock protein
Indel insertion (or) deletion
JTT Jones-Taylor-Thornton matrix substitution model
KA number of nonsynonymous substitution per nonsynonymous site
kcat catalytic constant
KM Michaelis constant
Ks number of synonymous substitution per synonymous site
Ma million years ago
MC the MasterCatalog
MHC major histocompatibility complex
mRNA messenger RNA
MSA multiple sequence alignment
NMR nuclear magnetic resonance
ORF open reading frame
PAM point-accepted mutation
PAML Phylogenetic Analysis by Maximum Likelihood
PAUP Phylogenetic Analysis Using Parsimony
PBL Pamilo, Bianchi and Li
PCR polymerase chain reaction
PDB The Protein Data Bank
RNA ribonucleic acid
SH2 Src homology 2 domain
TAED The Adaptive Evolution Database
TREx transition redundant exchange
Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Doctor of Philosophy
THE BIOINFORMATICIST'S TOOLBOX IN THE POST-GENOMIC AGE: APPLICATIONS
David R. Schreiber
Chair: Steven A. Benner
Major Department: Chemistry
Bioinformaticists of the post-genomic age, having at their disposal the complete sequence
of the human genome, find themselves in possession of a true embarrassment of riches-a
massive collection of data that grows ever larger with each new submission. To effectively mine
this glut of information, the tools of bioinformatics must adapt, and new ones must be developed,
tested and applied. Herein we describe a modification of a method, first developed by Pamilo,
Bianchi, and Li for estimating mutation rates between the protein sequences from two extant
species which allows for comparisons with proteins from ancestral species. In this case, the
ancestral sequences are determined by parsimony, but the method is applicable to any
phylogenetic method used to deduce ancestral states from extant species data. Computationally
cheap, the parsimony method lends itself to the analysis of large datasets. We conducted such an
experiment on a version of the GenBank database. The results were collected in a new database
called The Adaptive Evolutionary Database (TAED), whose purpose was to be a collection of
proteins that have rapidly (and presumably adaptively) evolved at some point in their natural
history. In addition to discovering adaptive episodes, we used this method to search for
compensatory covariation signals within another large database of sequences-signals that
provide clues relating to the proteins' higher-order structures. These methods are used as well in
interdisciplinary investigations that incorporate information from the paleontological and
geological records. This provided high-level functional information relating to cytochrome P450
aromatases and their role in Suoidea evolution.
EXOBIOLOGY AND POST-GENOMIC SCIENCE
The 1990s were bountiful years for genomics. At their midpoint, complete sequences were
available for the genomes of several eubacteria, an archaebacterium, and a eukaryote (yeast).
Before the decade was over, a half dozen additional bacterial genomes would be added to this
list, together with the complete genome of Caenorhabditis elegans, and three years later that of
Homo sapiens. In the 21st century, the first century of the post-genomic era, sequences will be
added from hundreds to thousands of other organisms, collected with varying degrees of
systematic effort [Ben98].
As these data have accumulated, it has become increasingly apparent that new methods are
needed to exploit the information that they (must) contain. Chemistry has always been driven by
the discovery of new natural products, elucidation of their structures, and exploration of their
behaviors. The genome database provides an enormous new collection of natural product
structures to study. These display every behavior of interest to chemists: conformation,
supramolecular organization, combinatorial assembly, photochemistry, and catalysis are just a
few. Organic chemistry should be revolutionized by genomic data. But how?
A similar question can be framed for the biomedical sciences. Pharmaceutical and genome
corporations worldwide are collecting and cataloging sequence data, comparing the expressed
genetic inventory of diseased and normal tissues, and attempting to correlate genomic data with
physiological function. It seems that the treatment of human disease should be revolutionized by
genomic sequence data. But it is not obvious precisely how this will happen.
One consequence of genomic sequences from a variety of organisms is the ready
availability of the evolutionary histories of families of proteins represented within the genome
sequence databases. Sequence data will be organized as a set of independently evolving protein
sequence "modules." For each of these, an evolutionary history can be built that will consist of a
multiple alignment of the sequences of the proteins in the module themselves (as well as their
encoding DNA sequences), an evolutionary tree, and a reconstructed ancestral DNA and protein
sequence for each branch point in the tree [BenOO]. The tools described herein are designed to
exploit evolutionary histories directly, extracting information concerning protein structure,
behavior, and function from a detailed understanding of how protein sequences divergently
evolve under functional constraints. In a post-genomic world, with volumes of sequence data
from an ever growing number of organisms, these tools will be used widely to learn more from
sequence data about living systems, their chemistry and their diseases.
The Conventional Evolutionary Paradigm.
Since the mid 1970s, it has been known that homologous proteins-those related by
common ancestry-have analogous conformations (folds) [Ros76, Cho86], at least in their core
domains. From this observation has emerged many tools, including profile analysis, homology
modeling, and threading [Gri87, May95], that help biological chemists model the conformation
of a protein from the conformation of one of its homologs.
If homologous proteins have analogous conformations, it might be reasoned that
homologous proteins might be analogous in other ways as well. Biomolecular behavior depends
in part on conformation. Thus, two homologous proteins with analogous conformations might
have analogous behaviors. The reasoning can be carried further. As biomolecular behavior is
important for biomolecular function, perhaps two homologous proteins will also have analogous
This train of logic has been viewed as a starting point for analyzing genomic data. Genome
sequencing projects generate sequences of proteins without any supporting experimental data
describing the protein itself, data that accompany most sequences generated in classical
biochemical research. While applications for genomic sequences are many, virtually all require
that a structure, behavior, or function be assigned to proteins known only as sequences. Chemical
theory is insufficient to allow us to assign structure, behavior, or function directly to a sequence.
But bioinformatic theory is adequate to use the genomic sequence to identify homologous
proteins in the database. Some of these homologs may have known structures, behaviors, or
physiological functions. If homology implies structural analogy, and (from there) behavioral
analogy, and (from there) functional analogy, then the task of assigning structure, behavior, or
function to the genomic data should become trivial whenever a homolog with a known structure,
behavior, or function can be identified.
Many tools for analyzing genomic sequence data are based on this logic, which we shall
call the "conventional evolutionary paradigm." The conventional evolutionary paradigm is
implemented throughout bioinformatics research in various simple ways. Consider the following
recipe, used in a variety of commercial software packages to analyze a new sequence for a new
protein collected in a genome project:
* The new sequence is first recorded.
* The sequence is then used as input into a basic local alignment search tool (BLAST) which
finds putative homologs within the existing sequence database.
* Proteins in the database that have sequences similar to the new sequence (by some scoring
criterion) are recorded. These are putative homologs of the new sequence.
* The "functions" of the putative homologs are read from their documentation in the database.
* The function of the new sequence is presumed to be the same as that of the homologs.
This approach has many well known limitations. First, a BLAST search need not find an
analogous sequence in the database whose function is already known. In many cases, a BLAST
search fails to find any sequence in the database with a statistically significant sequence
similarity. If no putative homolog can be identified, then it follows that no homolog with known
structure, behavior or function can be identified.
In other cases, a BLAST search identifies one or more putative homologs, but the
homologs have not been studied sufficiently to assign a structure, behavior, or function. Again,
the paradigm fails to resolve the problem.
More frequently, the BLAST search identifies many possible homologs in the database, all
with statistically marginal sequence similarities. The approach frequently presents the user with
the documentation from several of these, and leaves the user to decide which, if any, of the
putative functions are correct. As the literature shows, weak sequence similarities reflecting
distant homologies have been both profoundly informative and profoundly deceptive [Ben91].
Very frequently, the functions of these putative homologs are widely different, making it
difficult to decide which, if any, function should be imputed to the probe sequence.
These problems are all well recognized as limitations of the conventional evolutionary
paradigm. Less well recognized, however, are problems with the elements of the logic upon
which the conventional evolutionary paradigm is based. Proteins with analogous folds need not
have analogous behaviors. Proteins with analogous behaviors need not have analogous functions.
As has been reviewed in detail [Ben88b, Ben89a, Ben89c, Ben90], evolution is a potent process
for recruiting old protein folds to perform new functions, and many cases are known where the
conventional evolutionary paradigm would generate (indeed, has generated) incorrect
conclusions. For example, fumarase (functioning in the citric acid cycle), adenylosuccinate lyase
(functioning in nucleotide biosynthesis) and aspartate ammonia lyase (functioning in amino acid
metabolism) are all identified as homologs by a BLAST search. Yet their behaviors are
analogous only at the level of organic reaction mechanism, and there only at the most abstract
level. The conventional evolutionary paradigm fails to suggest correct conclusions in this case.
Work by Babbitt, Kenyon, Gerlt, and Petsko has added other fascinating examples of how
microorganisms can recruit a common fold to catalyze a variety of reactions from the
racemization of mandelic acid to the opening/closing of a lactone [Bab95].
In proteins involved in so-called advanced functions (for example, developmental biology)
in more complex organisms, difficulties with the homology-implies-analogous-
structure/behavior/function assumption underlying the conventional evolutionary paradigm
become confounding. For example, protein serine kinases and protein tyrosine kinases are clearly
identifiable homologs at the level of sequence similarity. The chemist would say that both
classes of protein operate using analogous reaction mechanisms, differing only in the source of
the oxygen nucleophile in the phosphoryl transfer reaction. The biologist would note that the
physiological function of the two classes of proteins are greatly different, however. For any
biomedical application, the biologist would be correct. The physiologically relevant differences
in behavior, central to the understanding of biological function phosphorylationn on tyrosine
versus phosphorylation on serine) cannot be inferred for one family from the other using the
conventional evolutionary paradigm.
The deeper the chemistry of developmental biology is probed in metazoa (multicellular
animals), the more it becomes apparent that function in the Darwinian sense can change with
very little change in sequence. For example, a variety of Src homology 2 (SH2) domains all bind
peptide sequences containing phosphotyrosine residues. The binding specificities of the different
SH2 domains are different, however, for the peptide sequences surrounding the phosphotyrosine
residues. It is these specificities that determine which protein binds to each individual SH2
domain. The physiologically relevant function of each SH2 domain centers on this pairwise
interaction. Thus, any statement of function for any particular SH2 domain must at least identify
its phosphotyrosine-containing partner. To assign function at this level, the conventional
evolutionary paradigm has little to say.
Post-Genomic Science: Modeling Molecular Evolution
These considerations suggest that if it is to be used to analyze genomic data, evolutionary
theory must be applied at a more sophisticated level than the level exploited by the conventional
evolutionary paradigm. Much of this sophistication is available at the present time. Let us
examine briefly how biomolecular evolution is modeled as a starting point to designing tools for
interpreting genomic data.
The Markov Model
Virtually all of contemporary genomic science is based in some sense on an analysis where
sequence data are treated simply as strings of characters. This analysis is based on a specific
model of sequence evolution at the level of proteins. Overwhelmingly, these models are in some
sense "Markovian." They assume, implicitly or otherwise, that variation in a protein sequence
occurs independently at each position in the sequence, that future mutations occur independently
of past mutations, and that a single substitution matrix adequately describes the relative
likelihood of each amino acid being replaced by any other amino acid during an episode of
molecular evolution. Further, simplifying assumptions are made when scoring gaps in a pairwise
The primary virtue of this model is its simplicity. Yet the model is certainly false. Proteins
are not linear strings of independent characters. Rather, they are folded structures that have
arisen by divergent evolution under constraints imposed by natural selection seeking adaptive
function. Because real proteins fold in three dimensions, mutations at different positions in the
protein sequence are not independent [Coh94]. Nor are future mutations independent of past
mutations [Coh94]. Insertions and deletions (indels) display complex patterns that reflect
functional constraints on protein evolution [Ben93].
Here, the second virtue of Markovian models, their exactness, becomes useful. Markovian
models generate specific expectations. These can be explicitly violated by the behavior of real
proteins during divergent evolution, and the extent of the violation can frequently be quantitated.
The differences between the expected and actual evolution of proteins contain clues concerning
how proteins fold, function, and create new function. Let us examine some tools to extract these
Non-Markovian Protein Evolution as a Post-Genomic Tool for Structure Prediction
Tools that extract information from the non-Markovian behavior of proteins undergoing
divergent evolution have been exceptionally valuable for predicting the conformation of proteins
from an evolutionary history [Ben97]. These tools have made major strides towards solving a
problem that as recently as a decade ago was considered to be unsolvable. The approach is
described in detail previously [Ben89b, Ben91], and will not be described here. Rather, we shall
assume that reasonably accurate (if not perfect) models of protein secondary structure can be
predicted for a family of homologous protein sequences. We will then ask how these predictions
might be exploited to solve one of the problems noted above with the conventional evolutionary
paradigm: the need to have methods more powerful than simple sequence analysis to detect long
Structure Prediction as a Tool for Identifying Long Distance Homologs
The core of a protein fold is conserved during divergent evolution long after the sequences
within the family have diverged so much that homology is no longer evident by sequence
analysis alone. In the mid 1970s, Rossman and his coworkers suggested that analogous folds in
two proteins might indicate that the proteins are homologs, even when the sequences of the two
proteins bear no statistically significant similarities [Ros76, Cho86]. This observation prompted
many groups to develop methods for building models from protein sequences starting from
marginal sequence similarities [Gri87, May95]. These are known as profile or threading
In practical application, the primary difficulty with these approaches is that they generate
too many "hits," or matching of a probe sequence against a sequence database. The analyses
frequently return many proteins that are possible homologs. As has been shown in many joint
prediction projects (such as the Critical Assessment of Structure Prediction, or CASP projects
[Mou96]), these hits are sometimes correct and informative. Equally likely, however, the
putative homologs identified by a threading or profile analysis are not true homologs, or are
misaligned with true homologs using the analysis. Especially needed, therefore, are tools to
confirm or (especially) deny the possibility that a target identified from a sub-statistical sequence
similarity is a homolog.
In the early 1990s, the first example was presented where a structure prediction was used
to critically evaluate suggestions, based on sub-statistical sequence similarities, that two proteins
were homologous. The case concerned the protein kinases [Ben91]. Protein kinase contains the
sequence motif Gly-Xxx-Gly-Xxx-Xxx-Gly (where Xxx is any amino acid) preceded by a strand
and followed by a helix [Ste84]. A similar motif was found in adenylate kinase, where a crystal
structure was known. Therefore, following the conventional evolutionary paradigm, several
groups proposed that the two structures were homologs. This proposal implied in turn that
protein kinase would adopt the same fold as adenylate kinase. Several groups built models based
on this inference [Tay86, Tay84, Wie86, Sho81].
Contrasting with this view was a model built for the secondary and tertiary structure of the
protein kinase family based on the evolutionary history of the protein. In this model, the Gly-
Xxx-Gly-Xxx-Xxx-Gly motif was predicted to be flanked by beta strand both before and
afterwards. Thus, the predicted secondary structural model of protein kinase was not congruent
with the experimental structure of adenylate kinase. Accordingly, the prediction noted that the
two folds could not be the same. From this, it was inferred that the two proteins could not be
This alternative model based on post-genomic methods was later shown to be correct by a
subsequently determined crystal structure [Kni91]. This was perhaps the first time that a
predicted structure and a motif analysis had been used to infer the absence of homology between
two families catalyzing analogous chemical reactions. The tool used in the protein kinase
prediction proved to be an example of a more general tool to confirm or deny long distance
homology between two protein families.
Using this tool, core secondary structural elements of the protein families are aligned
sequentially. In this process, the secondary structural elements are considered to be congruent
when every core element from one family finds a core element in the other of the same type
(helix or strand), in the same order, where gaps matched against non-core elements (where a
non-core element in one family is not aligned against any element in the other) are allowed in
any number, and a core element in one protein may be missing in the other, but may not be
aligned with a core element in the other of a different type (i.e., helix against strand).
This approach is especially powerful when coupled with motif analysis, as was done for
protein kinase. Congruent secondary structural elements can indicate whether a motif is a
significant indicator of homology or not. Thus, flanking a motif, a secondary structural model
might have one of four forms: helix-motif-helix, helix-motif-strand, strand-motif-helix, and
strand-motif-strand. The secondary structural alignments are congruent if and only if the
secondary structural elements flanking the motifs correspond between the two proteins.
Homology is not denied if and only if the secondary structures are congruent. This method is
preferably applied when each family contains proteins that are at least 120 point accepted
mutations per 100 amino acids (PAM units) divergent.
Important to this tool is a distinction between core and non-core secondary structural
elements. The failure of core secondary structural elements to correspond between two protein
families is a clear contradiction of homology. But non-core elements need not be conserved over
long periods of divergent evolution, and failure to find a correspondence between non-core
elements from one protein family in another does not exclude homology. Several ways of
defining "core" exist in this context. Some involve crystal structure data; others do not.
When an experimental structure is known for a protein (for example, by crystallography or
nmr), core elements are conveniently defined geometrically; a core element is one where a
substantial fraction is buried in the protein fold unexposed to solvent water. Most useful is the
application of this concept to beta strands in a beta sheet. A core strand is one that forms
backbone hydrogen bonding interactions with two other strands on both of its edges. Thus, a core
strand is distinct from an edge strand, which forms backbone hydrogen bonds to only one other
strand on only one if its edges. Core strands are highly conserved during divergent evolution. If
they are lost, the sheet in which they participate cannot be conserved.
A more general definition of a core secondary structural unit focuses on the evolutionary
stability of the secondary structural unit. For the purpose of detecting long distance homologs, a
core secondary structural element, predicted or otherwise, is one that cannot be lost during
divergent evolution without damaging the integrity of the protein fold. This is based on notions
of continuity in protein evolution, most fundamentally on the assumption that a protein that has
one topology of protein fold (e.g., an eight fold alpha-beta barrel) cannot by continuous
evolutionary processes be converted into a protein with another (e.g., an immunoglobulin fold).
It is clear that divergence of biological function can add or subtract peripheral secondary
structural elements to create or remove contact elements, expand or eliminate binding sites, or to
modify the performance of the protein in other fashions.
In this case, non-core segments of a protein family are recognized from a family of
sequences, preferably between 100 and 150 PAM units divergent for the most divergent pairs,
where regions that are deleted. If a segment (including a segment containing a helix or a strand)
is deleted in a protein family built from members all sharing significant sequence similarity, it
cannot be essential for the integrity of the fold in the family. In applying this tool, one must be
concerned about database mistakes; a part of sequence that is deleted because the scientist
providing the entry into the database neglected to collect it, or neglected to enter it, is not a
deletion from the purpose of detecting non-core segments.
A third method for identifying a core segment of a protein sequence is applicable to any
alignment containing three or more sequences. In the tool, a pairwise alignment is constructed
for each pair of sequences in the set using a dynamic programming tool. Consider for example a
set of sequences with three proteins, A, B and C. A core segment of the multiple alignment is
defined as those regions where the alignment of A with B and the alignment of B with C is
consistent with the alignment of A with C.
A final method relates to the reconstructed ancestral sequence of the protein. It has long
been appreciated [Zuc65] that when the sequences of two or more homologous proteins are
available, it is possible to construct a probabilistic model for the sequence of the ancestral
protein. The part of the ancestral sequence that is reconstructed with high probability is the core
of the protein. These reconstructions are done by well-known maximum likelihood tools (for
example, as implemented within Darwin, available via the web at the address
http://cbrg.inf.ethz.ch (see also [Gon91])). A core is defined from the ancestral sequences as a
segment of the multiple alignment where the average probability of the most frequent amino acid
at that position is greater than one standard deviation above the average probability of all of the
reconstructed positions in the multiple alignment. This is a tree-weighted measure of the
divergence in the family as a whole, and correlates with core regions defined in the other ways,
as the region of the ancestral sequence that is reconstructed with high probability is also the one
that has not suffered insertions and deletions, and the one that has seen relatively little sequence
divergence. These segments also correlate with core segments defined geometrically.
Structure predictions made using post-genomic tools have now been applied in several
cases to make statements about long distant homology and, from there, catalytic behavior. One
of the most dramatic was for the heat shock protein 90 (HSP 90) family. The predicted secondary
structural elements were assembled to yield a tertiary fold that resembled closely the fold
determined in the N-terminal fragment of DNA gyrase B (the ATPase fragment) [Wig91]. This
analysis was published before an experimental structure of HSP 90 was known [Ger97], and
after an experimental study had suggested that HSP 90 did not have catalytic activity as an
ATPase. The prediction thus generated a statement about catalytic behavior (and, presumably,
physiological function) in addition to secondary structure.
Post-genomic prediction tools were also applied to the family of ribonucleotide reductases
[Tau97]. Here, proteins with highly divergent sequences were shown to belong to one universal
family using a combination of protein structural analysis and gene sequencing. The tools in this
case confirmed a speculation by Stubbe and coworkers based on a mechanistic analysis that all
ribonucleotide reductases are related by common ancestry [Mao92].
Comparing predicted secondary structure models for a protein family is now a proven
technique with an impressive track record to detect or deny long distance homologs based on
sequence data alone. As genome projects are completed, and evolutionary histories for their
constituent protein families articulated, secondary structural models for those families should
become routinely available. Thus, these tools should solve the first problem with practical
implementation of the conventional evolutionary paradigm- the difficulty of detecting
homologs using standard alignment tools.
Recruitment of Function
Predicted secondary structures of a protein family offer an approach to solve the general
problem of finding distant homologs in a database. They do not, however, address other
problems associated with the conventional evolutionary paradigm for assigning function to
sequences those that arise when the assumptions underlying the conventional evolutionary
paradigm are invalid. This is especially the case when one protein family gives rise to proteins
with different function. This possibility can never be excluded a priori. In the case of the heat
shock protein 90, for example, the post-genomic prediction tool showed that the fold of HSP 90
was analogous to the fold of gyrase. It suggested (under the conventional evolutionary paradigm)
that the behavior and function of HSP 90 was analogous to the behavior and function of gyrase.
Some of these suggestions may in fact be true. But they need not be. The gyrase fold could be
recruited in HSP 90 to perform many other (indeed, any other) function.
The premise for the post-genomic sciences, however, is that the evolutionary histories of
protein families will become generally available as a result of massive genome sequencing
projects. Non-Markovian behavior is not only expected to arise because of folded structure. In
addition, patterns of sequence evolution are expected to violate the Markovian model differently
if the protein in question is undergoing an episode where function is being changed.
A study illustrated the power of this approach to identify the active site residues of
mammalian alcohol dehydrogenase (E.C. 22.214.171.124). Mammalian alcohol dehydrogenases have
undergone a rapid episode of sequence evolution in and around the active site as substrate
specificity has divergently evolved to handle xenobiotic substances in the liver. In contrast, over
a comparable span of evolutionary distance, the active site of yeast alcohol dehydrogenase has
changed very little, corresponding to an apparently constant role of the enzyme to act on the
ethanol-acetaldehyde redox couple. Indeed, by identifying positions in mammalian
dehydrogenases where amino acid variation was observed over a span of evolution where the
same residues were conserved in the yeast dehydrogenases provided a clear map of the active
site of the protein.
A particularly clever use of this approach has been described by Lichtarge et al. These
workers described an evolutionary trace method that defined functionally significant residues as
those that are conserved within a family [Lic96]. They then used this approach to identify
patches on the surface of proteins that contribute to functionality.
The approaches work because the scientist understands something about the function of a
protein family. Thus, the active site of alcohol dehydrogenase could be identified because the
scientist knew that substrate specificity is conserved in one branch of a protein family (yeast
alcohol dehydrogenases) but not in another (liver alcohol dehydrogenases). The evolutionary
trace method works only when the function being traced is conserved within the family. Neither
approach is applicable when the evolutionary status of the protein function (conserved or not
conserved) is not known. In general, this status will not be known to the post-genomic scientists
examining only genomic data. To apply these post-genomic tools, therefore, the post-genomic
scientist needs a tool for learning whether function is changing within the family in an episode of
protein sequence evolution one based on an analysis of the sequence data alone.
One tool is based on the fact that the genetic code is degenerate. More than one triplet
codon encodes the same amino acid. Therefore, a mutation in a gene can be either silent (not
changing the encoded amino acid) or expressed (changing the encoded amino acid). Especially in
multicellular organisms, and most particularly in multicellular animals (metazoa), silent changes
are not under selective pressure. In contrast, expressed changes at the DNA level, by changing
the structure of the protein that the gene encodes, change the property of the protein. This
frequently places these changes under selective behavior.
The outcome of different selection pressures on silent and expressed mutations is non-
Markovian behavior in the evolution of DNA sequences. Consider three cases. In the first, we
examine an episode of protein sequence evolution during a period of evolutionary history where,
at the outset of the period, the behavior of a protein whose sequence has been perfectly
optimized for a specific biological function, and where that function remains constant for the
protein throughout the period being examined. During this period, changes in the DNA sequence
that lead to a change in the sequence of the encoded protein (expressed changes) will diminish
the survival value of the protein [Ben88b] and therefore will be removed by natural selection.
Silent changes will not be removed by natural selection, but will accumulate at an approximately
clock-like rate, as silent changes are approximately neutral, especially in higher organisms. Thus,
the ratio of expressed to silent changes will be low during a period of evolution of a protein
family where the ancestor and its descendants share a common function.
A second case concerns a period of evolution where a protein is acquiring a new derived
function. Its amino acid sequence at the beginning of this episode will be optimized for the
ancestral function, rather than the derived function. To be optimally suited for the derived
function, therefore, amino acids must be changed in the protein. Thus, changes in the gene that
are expressed will have a chance of improving the behavior of the protein vis a vis its new
biological function, and these will be selected for. The ratio of expressed to silent substitutions at
the DNA level will be high, and a high expressed/silent ratio will reveal a period of evolution of
a protein family where the function of the ancestor is changing.
In a third case, consider the evolution of a gene encoding a protein that has no function (a
pseudogene, for example), neutrally drifting without functional constraints. In this case, the
expressed/silent ratio will reflect random introduction of point mutations. Given the genetic code
and a typical distribution of amino acid codons within the gene, a ratio of expressed to silent
changes will be approximately 2.5 during the period of evolution of a protein family where the
ancestor and its descendants have no function.
Stewart and Messier introduced this approach through a clever and systematic analysis of
conservation and variation within the lysozyme family [Mes97]. Episodes of rapid sequence
evolution were identified by high expressed/silent ratios in specific branches of the evolutionary
trees. These were correlated with specific episodes in the evolution of the protein itself and the
physiology of the organisms that contained the protein.
This approach can be illustrated quite easily in a biomedically interesting family of
proteins, using an "off the cuff" method to examine the protein leptin, a protein whose mutation
in mice is evidently correlated with obesity, and was previously known as the "obesity gene
protein." The protein has attracted substantial interest in the pharmaceutical industry, especially
after a human gene encoding a leptin homolog was isolated. According to the conventional
evolutionary paradigm, because it is a homolog of the mouse leptin, the human leptin must also
play a role in obesity, and might be an appropriate target for pharmaceutical companies seeking
human pharmaceuticals to combat this common condition in the first world.
DNA and protein sequences were retrieved for the genes encoding leptins. A multiple
alignment for the protein sequences was constructed for the DNA sequences and the protein
sequences. Congruent tress for both the DNA and protein sequences were then constructed, and
sequences at the nodes of the tree reconstructed using MacClade [Mad92] and the known
relationship between the organisms from which these sequences were derived. For the DNA
sequences, the biologically most plausible tree proved to be the most parsimonious tree as well.
The most parsimonious tree for the protein sequences proved not to be the most plausible tree
(by one change) from a biological perspective. The DNA tree was taken to be definitive because
of its consistency with the biological cladisticc) data.
A secondary structure prediction was made for the protein family. The evolutionary
divergence of the sequences available for the leptin family is small-only 21 PAM units (point
accepted mutations per 100 amino acids)-and predictions were biased to favor surface
assignments [Ben94]. Thus, positions holding conserved KREND were assigned as surface
residues, conserved H and Q were assigned to the surface as well, while positions holding
conserved CST were assigned as uncertain.
Five separate secondary structural elements were identified. The results are summarized in
Figure 1-1. A disulfide bond was presumed to connect positions 96 and 146. These secondary
structural elements can be accommodated by only a small number of overall folds. Interestingly,
the pattern of secondary structure in this prediction is consistent with an overall fold that
resembles that seen in cytokines such as colony stimulating factor [Hil93] and human growth
To decide whether evolutionary function may have changed under selective pressure
during the divergent evolution of the protein family, silent and expressed mutations were
assigned to individual branches on the evolutionary tree. For each branch of the tree, the sum of
the number of silent and expressed changes were tabulated, and the ratio of expressed to silent
changes calculated. These are shown in Figure 1-2.
The branches on the evolutionary tree leading to the primate leptins from their ancestors at
the time that rodents and primates diverged had an extremely high ratio of expressed to silent
changes. From this analysis, it was concluded that the biological function of leptins has changed
significantly in the primates relative to the function of the leptin in the common ancestor of
primates and rodents. This conclusion has several implications of importance, not the least being
for pharmaceutical companies asked whether they should explore leptins as a pharmaceutical
target. At the very least, it suggests that the mouse is not a good pharmacological model for
compounds to be tested for their ability to combat obesity in humans. The post-genomic analysis
suggests that a primate model must be used to test those compounds, with implications for the
cost of developing an anti-obesity drug based on the leptin protein.
Intriguingly, a tree can also be built for the leptin receptor (Figure 1-3). Here, the
evolutionary history is not so complete. In particular, fewer primate sequences are available for
the leptin receptor than for leptin itself. Thus, the reconstructed ancestral sequences are less
precise with the leptin receptor family, and the assignment of expressed and silent mutations to
the tree are less certain. Nevertheless, it appears that the leptin receptor has undergone an
episode of rapid sequence evolution in the primate half of the family as well. The example
illustrates how much sequence data is needed (much) to build reliable models of this nature, as
the ambiguity in the assignment of ancestral sequences makes it possible that the receptor was
evolving rapidly not only in the lineage leading to primates but also in the lineage leading to
Nevertheless, the approximate correlation between the episode of rapid sequence evolution
in the leptin family and in the leptin receptor family suggests a tool that might become useful in
the advanced stages of post-genomic science when evolutionary histories are very well
articulated. Here, it might be possible to detect ligand-receptor relationships between protein
families in the database by a correspondence between their episodes of rapid sequence evolution.
Thus, ligand families should evolve rapidly (in a non-Markovian fashion) at the same time in
geological history as their receptors evolve. It will be interesting to identify more sequences for
primate leptin receptors to see if a more complete evolutionary history allows us to see more
clearly the co-evolution of the leptin receptor and leptin itself.
Correlating the Paleontological Record with Episodes of Sequence Evolution
As discussed above, detailed analyses of evolutionary histories frequently can provide a
solution to the most general problem of the conventional evolutionary paradigm, the difficulty in
routinely identifying a homolog of a target sequence with known function within the database.
By analysis of non-Markovian evolutionary behavior at the level of the protein, a model of
secondary structure can be predicted. This prediction can be used in turn to detect long distance
homologs in some cases and exclude the possibility of distant homology in others. This increases
the likelihood that a homolog will be found with a known structure, behavior, or function for a
new protein sequence. If one is found, then the logic associated with the conventional
evolutionary paradigm can be applied to generate a hypothesis concerning the behavior or
function of the protein.
The value of this post-genomic tool to assign behavior and structure to a target sequence
problem is expected to grow over the near term, as the ratio of sequences supported by
experimental studies to those not supported increases with the conclusion of genome projects,
and as more sequences increase the detail of the evolutionary histories that can be extracted from
the database directly, and therefore the quality of the predicted secondary structural model.
At the next level, analysis of non-Markovian behavior at the level of the gene can alert the
biological chemist that the logic associated with the conventional evolutionary paradigm might
not apply in individual cases. In particular, if an episode of rapid sequence evolution intervenes
in the evolutionary tree between the sequence of interest and the sequence with the known
behavior and function, the biological chemist is alerted to the possibility that the function of the
protein might have changed. This alert is useful even with close homologs, as illustrated in the
example with leptin.
But what if the evolutionary tree contains no protein with a sequence with assigned
function, even one with low sequence similarity? Even with more limited evolutionary histories,
post-genomic tools that analyze non-Markovian evolution at the level of the codon can be useful.
By identifying the organisms that provide the sequences at the leaves, or external nodes, of the
evolutionary tree, it is frequently possible to correlate branches in the evolutionary tree with
episodes in geological history, as determined from the fossil record. Especially in multicellular
animals (metazoa), the fossil record can provide approximate dates for the emergence of new
physiological function. In this case, it is possible to ask whether an episode of rapid sequence
evolution in a protein family (in particular, an episode with a high expressed/silent ratio)
occurred at the same time as a new physiological function emerged on earth. If so, a first level of
hypothesis about physiological function can be proposed, even if no behavior or function of any
kind is known for any of the modern proteins.
Perhaps the most transparent analysis of this type concerns proteins that underwent
massive radiative divergences in metazoa approximately 600 Ma. This is the time of the
Cambrian explosion, an episode in terrestrial history that marks the massive radiative divergence
of multicellular animals, including chordates. Proteins families undergoing rapid evolution at this
time (for example, of protein tyrosine kinases and src homology 2 domains) are almost certainly
involved in the basic processes by which multicellular animals develop from a single fertilized
This type of analysis might be applied in the family of ribonuclease (RNase) A
(E.C.126.96.36.199), a well known family of digestive proteins found in ruminants. The protein
underwent rapid sequence evolution approximately 45 Ma, a time where ruminant digestion
emerged in mammals [Jer95]. Thus, the rapid molecular evolution evident in the reconstructed
evolutionary history of this protein suggests that the protein is important for ruminant digestive
Identification of in vitro Behaviors that Contribute to Physiological Function.
In vitro experiments in biological chemistry extract data on proteins and nucleic acids (for
example) that are removed from their native environment, often in pure or purified states. While
isolation and purification of molecules and molecular aggregates from biological systems is an
essential part of contemporary biological research, the fact that the data are obtained in a non-
native environment raises questions concerning their physiological relevance. Properties of
biological systems determined in vitro need not correspond to those in vivo, and properties
determined in vitro need have no biological relevance in vivo.
To date, there has been no simple way to say whether or not biological behaviors are
important physiologically to a host organism. Even in those cases where a relatively strong case
can be made for physiological relevance (for example, for enzymes that catalyze steps in primary
metabolism), it has proven to be difficult to decide whether individual properties of that enzymes
(kcat, KM, kinetic order, stereospecificity, etc.) have physiological relevance. Especially difficult,
however, is to ascertain which behaviors measured in vitro play roles in "higher" function in
metazoa, including digestion, development, regulation, reproduction, and complex behavior.
Analysis of non-Markovian behavior, as described above, permits the biological chemist to
identify episodes in the history of a protein family where new function is emerging. This
suggests a general method to determine whether a behavior measured in vitro is important to the
evolution of new physiological function. We may take the following steps:
a) Prepare in the laboratory proteins that have the reconstructed sequences corresponding to the
ancestral proteins before, during, and after the evolution of new biological function [Jer95],
as revealed by an episode of high expressed to silent ratio of substitution in a protein. This
high ratio compels the conclusion that the protein itself serves a physiological role, one that
is changing during the period of rapid non-Markovian sequence evolution.
b) Measure in the laboratory the behavior in question in ancestral proteins before, during, and
after the evolution of new biological function, as revealed by an episode of high expressed to
silent ratio of substitution. Those behaviors that increase during this episode are deduced to
be important for physiological function. Those that do not are not.
Structure Prediction and a Rapidly Searchable Database
The overarching problem with genomic sequence databases is their sheer size. This makes
them tedious to search, and nearly impossible to subject to exhaustive self-matching [Gon92].
Predicted structures can be connected with reconstructed evolutionary histories to resolve this
problem, to build a rapidly searchable database. Consider the following steps:
(a) A multiple alignment, an evolutionary tree, and ancestral sequences at nodes in the tree are
constructed for a set of homologous proteins.
(b) A corresponding multiple alignment is constructed by methods well known in the art for the
DNA sequences that encode the proteins in the protein family. This multiple alignment is
constructed in parallel with the protein alignment. In regions of gaps or ambiguities, the amino
acid sequence alignment is adjusted to give the alignment with the most parsimonious DNA tree.
(c) Mutations in the DNA sequences are then assigned to each branch of the DNA evolutionary
tree. These may be fractional mutations to reflect ambiguities in the sequences at the nodes of the
tree. When ambiguities are encountered, alternatives are weighted equally. Mutations along each
branch are then assigned as being "silent," meaning that they do not have an impact on the
encoded protein sequence, and "expressed," meaning that they do have an impact on the encoded
protein sequence. Fractional assignments are made in the case of ambiguities in the reconstructed
sequences at nodes in a tree.
(d) A prediction is then made for each protein family. A secondary structure is predicted for the
family, and this predicted secondary structure is aligned with the ancestral sequence at the root of
the tree. If the root of the tree is unassigned, the predicted secondary structure is aligned with the
ancestral sequence calculated for an arbitrary point near the center of gravity of the tree
(e) An ancestral sequence is then reconstructed at nodes on the tree, and at a point on the tree as
near as possible to its root the point on the tree representing the oldest (geologically) sequence.
Steps (a) through (e) provide a method to organize the protein sequence database in a
rapidly searchable form. The ancestral sequences and the predicted secondary structures
associated with the families defined by steps (a) through (e) are surrogates for the sequences and
structures of the individual proteins that are members of the family. The reconstructed ancestral
sequence represents in a single sequence all of the sequences of the descendent proteins. The
predicted secondary structure associated with the ancestral sequence represents in a single
structural model all of the core secondary structural elements of the descendent proteins. Thus,
the ancestral sequences can replace the descendent sequences, and the corresponding core
secondary structural models can replace the secondary structures of the descendent proteins.
This makes it possible to define two surrogate databases, one for the sequences, the other
for secondary structures. The first surrogate database is the database that collects from each of
the families of proteins in the databases a single ancestral sequence, at the point in the tree that
most accurately approximates the root of the tree. If the root cannot be determined, the ancestral
sequence chosen for the surrogate sequence database is near the center of gravity of the tree. The
second surrogate database is a database of the corresponding secondary structural elements. The
surrogate databases are much smaller than the complete databases that contain the actual
sequences or actual structures for each protein in the family, as each ancestral sequence
represents many descendent proteins. Further, because there is a limited number of protein
families on the planet, there is a limit to the size of the surrogate databases. Based on our work
with partial sequence databases [Gon92], we expect there to be fewer than 10,000 families as
defined by steps (a) through (e).
Searching the surrogate databases for homologs of a probe sequence proceeds in two steps.
In the first, the probe sequence (or structure) is matched against the database of surrogate
sequences (or structures). As there will be on the order of 10,000 families of proteins as defined
by steps (a) through (e) after all the genomes are sequenced for all of the organisms on earth,
there will be only on the order of 10,000 surrogate sequences to search. Thus, this search will be
far more rapid than with the complete databases. A probe protein sequence (or DNA sequence in
translated form) can be exhaustively matched [Gon92] against this surrogate database (that is,
every subsequence of the probe sequence will be matched against every subsequence in the
ancestral proteins) more rapidly than it could be matched against the complete database.
Should the search yield a significant match, the probe sequence is identified as a member
of one of the families already defined. The probe sequence is then matched with the members of
this family to determine where it fits within the evolutionary tree defined by the family. The
multiple alignment, evolutionary tree, predicted secondary structure and reconstructed ancestral
sequences may be different once the new probe sequence is incorporated into the family. If so,
the different multiple alignment, evolutionary tree, and predicted secondary structure are
recorded, and the modified reconstructed ancestral sequence and structure are incorporated into
their respective surrogate databases for future use.
The advantage of this data structure over those presently used is apparent. As presently
organized, sequence and structure databases treat each entry as a distinct sequence. Each new
sequence that is determined increases the size of the database that must be searched. The
database will grow roughly linearly with the number of organismal genomes whose sequences
are completed, and become increasingly more expensive to search.
The surrogate database will not grow linearly. Most of the sequence families are already
represented in the existing database. Addition of more sequences will therefore, in most cases,
simply refine the ancestral sequences and associated structures. In any case, the total number of
sequences and structures in their respective databases will not grow past ca. 10,000-the
estimate for the total number of sequence families that will be identifiable after the genomes of
all organisms on earth are sequenced. If a dramatically new class of organism is identified, this
estimate may grow, but not exponentially (as is the growth of the present database).
The evolutionary histories of protein families are now becoming routinely available
through genome sequencing projects. These histories comprise a multiple alignment for their
protein sequences and the corresponding DNA sequences, an evolutionary tree showing the
pedigree of these sequences, and reconstructed ancestral sequences for each node in the tree. In a
post-genomic world having genomic sequences from an unlimited number of organisms, these
histories will be used to connect structure, chemical reactivity, and physiological function to
Our work consists of the application and development of post-genomic tools that exploit
these evolutionary histories. Analysis of non-Markovian behavior at the level of the gene can
identify proteins within a family that have new functions. Systematic application of such
analyses to sequence databases can be used to create a database of rapidly evolving protein
families which could serve as a source of starting points for deeper study. Further, non-
Markovian patterns of replacement within a family can provide information relating to the form
(secondary and tertiary structure) and function of its constituent proteins. Coupling these
analyses with the paleontological record can suggest hypotheses for physiological function in
families where no function is suggested by any experimental work for any of its members.
Coupling these analyses with experimental work reconstructing ancestral proteins can identify
specific in vitro properties of the protein that are important for its physiological role. Last,
evolution-based data structures can be used to organize large sequence databases in a fashion that
allows them to be searched with extreme efficiency. Together, these post-genomic tools can join
with classical methods whose value is now becoming widely recognized [Hue97].
010 020 030 040 050
I I I I I
VPIQKVQDOTKTLIKTIVTRINDISHTQSVSSKQKVTGLDF ImLHPILT chimp
VPIQKVQSDTKTLIKTIVTRINDISHTQSVSSKQRVTGLDFI ELHPVLT rhesus
VPIQKVQDDTKTLIKTIVTRINDISHTQSVSAKQRVTGLDF IPLHPI LS mouse
VPIRKVQDDTKTLIKTIVTRINDISHTOSVSM KQRVTGLDFIZQLHPLLS sheep
< --------helix-------- > weak parse strong parse predict
< ------ helix 1 ------ > coil 1 expt
MRLFEVQLQSjFVLLALMISLFLLGS MKDIMMNWDDAGCAFVPEAFTFLC bact
060 070 080 090 100
LSKMDQTLAVYQQILTSb IIRNVIQISNDLENLRDLLHVLAFSKSCHLPW orangutan
LSKMDQTLAVYQQILTSIfQ NVLQIAHDLENaLRDLLHLLAFSKSCSLPQ rat
LSKMDQTLAIYQQILASLRVIQI SNDLENLRDLLHLLAASKSCMEQ sheep
LSKMDQTLAIYQQILTSLE RNVVQISNDLENLRDLLHLLAASKSCEiEQ ox
i?Siss?i?iissiissiPPSsIIsiissississiisii? i ?s?cPPPS surf/int
<--- helix 2 ---> 1<-- helix 3 --> expt
110 120 130 140
I I I I
ASGLETLDSI SVLEASGYSTEVVALSRLQGSLQDMLWQLDL4E2C human
ASGLETLDS gVLEASGYSTEVVALSRLGSLQDMLWQLDLt4S-C chimp
ASGLETLDS LQVLEASGYSTEVVALSRLQGSLQDILWQLDLEJ-C gorilla
ASGLETLESLGDVLEASLYSTEVVALSRLQGSLQDMLWQLDLsI C rhesus
TRGLQKPESLDGVL ASLYSTEVVALSRLQGSLQDILQQLDVSPSC ratnor
VRALESLESLGVVLEASLYSTEVVALSRLQGSLQDMLRQLDLS EC ox
ARGLETFESL VLEASLYSTEVVALSRLOAALQDMLRRIDLSI~C dog
|<-helix>| |<----helix---->! predict
<-helix>| |<---- helix 4 ------->| expt
Figure 1-1. Predicted surface, interior, and secondary structure assignments for leptin. S and s, I
and i indicate strong and weak surface and interior assignments respectively. P and p
indicate strong and weak parses respectively. A "?" indicates that no assignment is
made. A "c" indicates that the position is involved in a disulfide bond. Secondary
structure was assigned using the method of Benner and Gerloff where positions
denoted "?" were permitted to fall in either the surface or interior arc of the helix.
Underlined residues are part of parsing strings.
gorilla I 0/0
human 0/2 7.5/2
"I orangutan 1/2
rhesus IIIIIIIIIIIIill IIIUllllli ili 6/10.25
mouse 2/6.54 -
-, IIII-iiin itiuiiu,, 4 33/4.45
rat 1 0/0 22.2/18.04
rat 2 4.5/7.67
ox 1/6.79 1/4.86
0 dog IIIIIIIIIIII IIIIIIIIIIIiiiiii ii iiii it iiiiiiiiiiiuiii r
Figure 1-2. Evolutionary tree showing the evolutionary history of the leptins. Heavy lines show
branches with expressed/silent ratios higher than 2. Hatched lines show branches with
expressed/silent ratios from 1 to 2. Thin, solid lines show branches with
expressed/silent ratios less than 1. Numbers on the lines indicate the ratio of
expressed/silent changes for that branch. The branch lengths do not correspond to
either geological time or evolutionary distance.
gorilla ------ J
E 24/6.33 |
i. orangutan ---------------
mouse IIIIIIIIallilli .L
| rat 1 46.1/11.9
rat 2 ------
Figure 1-3. Evolutionary tree showing the evolutionary history of the extracellular domain of the
leptin receptors. Heavy lines show branches with expressed/silent ratios higher than
3. Hatched lines show branches with expressed/silent ratios from 2 to 3. Thin, solid
lines show branches with expressed/silent ratios less than 2. Notice the greater overall
divergence in the leptin receptor than in leptin itself. Dotted lines show branches with
indeterminate ratios. Numbers on the lines indicate the ratio of expressed/silent
changes for that branch. The branch lengths do not correspond to either geological
time or evolutionary distance.
EXTENDING THE PBL METHOD FOR ESTIMATING NUMBERS OF SYNONYMOUS
AND NONSYNONYMOUS MUTATIONS TO ACCOUNT FOR AMBIGUITY IN
INFERRED ANCESTRAL SEQUENCES
The measurement and estimation of synonymous and nonsynonymous mutations in DNA
sequences that code for proteins has proven to be a useful tool for scientists interested in the
evolution of protein families and their constituent molecules. Specifically, comparisons of rates
of synonymous and nonsynonymous mutation can be used to detect and differentiate episodes of
positive (adaptive) selection within the evolutionary history of a group of proteins. Comparisons
are typically made between pairs of uniquely determined, unambiguous, DNA sequences derived
from extant taxa, and though a phylogenetic tree might be included in an analysis, comparisons
are not made with the ancestral sequences that could be inferred at internal nodes of the tree via
parsimony or another method (e.g., maximum likelihood). The inclusion of these ancestral
sequences can greatly enhance an evolutionary analysis, for it allows the assignment of specific
mutations to the branches of the evolutionary tree, and these branches represent major
evolutionary events (e.g., speciations and gene duplications). By linking these events to episodes
of positive selection we are better enabled to explain the natural history of a protein family. Such
combined analyses have been published by a number of investigators, the first of whom were
Messier and Stewart [Mes97]. They used the PBL method developed by Li et al. [Li85] and
amended by Pamilo and Bianchi [Pam93] to calculate numbers of synonymous and
nonsynonymous mutations between DNA sequences from extant species and intranodal ancestral
DNA sequences inferred by the PAML method (a maximum likelihood method) of Yang et al.
[Yan97]. Though they do not make note of it in their Methods section, they most certainly used
the ancestral sequences that were predicted to have the highest probability; that is, for each
internal node of their tree, they used one unambiguous DNA sequence out of a number of
possible sequences for their comparisons. This approach discards information-information
about DNA and amino acid variability between the pair of sequences being compared and
between each sequence and the sequences at daughter nodes. The consequences of this editing
depend upon the questions being asked and must therefore be considered on a case-by-case basis
but, in general, we would argue that discarding information about the evolution of these proteins
and "unnaturally selecting" one sequence from a population of possible sequences would
introduce a bias that diminishes the confidence in any conclusions drawn. Herein we describe a
method that accounts for the ambiguity in ancestral sequences, providing perhaps a more realistic
measure of synonymous and nonsynonymous mutations and preserving data that pertains to the
natural history of these sequences.
An Overview of the PBL Method
The PBL method involves the comparison of two DNA sequences, codon by codon, and is
performed in the following steps:
* The average numbers of zero-, two- and fourfold degenerate sites are calculated.
* Mutations are counted and classified as transitions or transversions. Each mutation is further
classified as having occurred at a site of i-fold degeneracy.
* KA and Ks are calculated from the degeneracy and mutation calculations.
Figure 2-2 shows five taxa and their DNA sequences which, for this example, consist of
only three nucleotides (one codon) each. The degeneracy of each nucleotide within a codon is
determined as follows. Zerofold degenerate sites are sites where changing the nucleotide to any
of the other three nucleotides changes the amino acid translation of the sequence. For example,
sequence t1 (TCA), translates to the amino acid serine. If we change the second position of t1 to
A, G or T (giving TAA, TGA and TTA), the translation changes to Stop, Stop and leucine,
respectively (when using the nuclear code for translation). Therefore the second nucleotide of
TCA is classified as a zerofold site. Two-fold sites are sites where two of the three possible
changes change the translation. The third position of CAA (glutamine) is twofold degenerate,
because changing to C or T changes the amino acid to histidine, while changing to G does not
change the translation. Four-fold sites are sites where changing the nucleotide to any of the three
other nucleotides does not change the amino acid. The third positions of TCA and TCG are each
fourfold degenerate. After the degeneracy of each of the nucleotides in each sequence being
compared has been tallied, an average is computed.
Next, mutations are counted between the pair of sequences. In general, if a mutation occurs
between two nucleotides of the same chemical class-either the purines or the pyrimidines-the
mutation is called a transition: [A->G, G->A, T->C, C->T]. Mutations from one class to another
[purine->pyrimidine, pyrimidine->purine] are called transversions. In the PBL method these
classifications hold for all mutations in the vertebrate mitochondrial code, but in the nuclear code
and the codes of other organisms some adhoc adjustments are made. The purpose of these
adjustments is to preserve the following rules:
* All changes at zerofold sites, whether transitions or transversions, are nonsynonymous
* All changes at fourfold sites are synonymous (silent).
* Transitions at twofold sites are synonymous and transversions at twofold sites are
If we compare sequence tl with t2 we see that A changes to G at the third position. This is
tallied as a transition. The change is further classified as having occurred at a fourfold degenerate
site (Figure 2-6). Following the above rules, the mutation is ultimately classified as a
Sequences tl and t3 have two mutations between them. Rather than simply counting the
changes, the PBL method considers all possible paths that could be traversed as one codon
mutates into another. Figure 2-1 shows the two paths that can be taken to get from TCA to CAA.
Mutations are counted and classified as described above, but the mutations are then multiplied by
a probability determined for each path. A path's absolute probability (in Figure 2-1, pil x pi2,
where i = path number) is computed by determining the probability of each "leg" of the path, and
then taking their product. The leg probabilities are determined by assessing the probability of the
amino acid change taking place on that leg. The determination of these amino acid transition
probabilities is described in Li et al. [Li85].; briefly, the probabilities of each amino acid
changing into every other amino acid are derived from Grantham's 20x20 mutation matrix and
partitioned into four categories (synonymous, conservative, moderate and radical changes). In
Figure 2-1, TCA changes to CCA serinee to proline) with a probability of 0.767, which is the
value for a conservative change. Synonymy (no change) has a higher probability (2.235) and
moderate and radical changes have lower probabilities.1
Once all paths' absolute probabilities have been determined, each path's relative probability
can be determined by dividing its absolute probability by the sum of all paths' absolute
probabilities (in Figure 2-1, Pi, where i = path number). In Figure 2-1, Path 2 (P2) involves a stop
codon, so this path is given an absolute and relative probability of zero, giving Path 1 (Pi) a
relative probability of one (its absolute probability is 0.588). Finally, the mutations occurring
along a path are multiplied by their path's relative probability, and the sum of mutations is taken
by summing the mutations along each path. This procedure is the same for codons with three
1 These values are for mutations under a moderate rate of mutation model. Li et al. [citation] also provides
values for a high rate of mutation model.
differences between them, except in such cases there are potentially six paths, and each path has
three legs (e.g., t2 vs. t3). Figure 2-6 shows the results of comparing each of the sequences tl, t2
and t3 with each other.
After the average degeneracy of a pair of sequences has been determined, and the number
and type of mutations between them has been counted, KA and Ks can be calculated. KA is
defined as the number of nonsynonymous substitutions per nonsynonymous site in a pair of
sequences. Ks is the number of synonymous substitutions per synonymous site. The formulae
were developed originally by Li et al. and amended by Pamilo and Bianchi, and their derivation
can be found in these references and Kimura [Li85, Pam93, Li93, Kim80, Kim81].
Extending the PBL Method to Phylogenetic Trees with Inferred Ancestral Sequences:
Putting Ambiguity Back into the Equations.
Figure 2-2 shows a hypothetical phylogeny for sequences tl, t2 and t3. This tree contains
two internal nodes (ancestral sequences) al and a2. These sequences are inferred by the
parsimony algorithm of Fitch [Fit71]. Figure 2-3 shows the reconstruction of the ancestral
sequences at each nucleotide position. Below each nucleotide is a number representing the
probability of that nucleotide appearing at its node. These probabilities are computed by a
method developed by Fitch.
Our method is conducted as follows. The tree is traversed node by node, from top to
bottom; beginning with the least ancient nodes and ending with the root node. At each node,
comparisons are made between each node codon(s) and each codon(s) of each descendant. These
comparisons are made by
* Constructing a set of equally parsimonious solutions (EPSs).
* Determining the relative probability of each EPS.
* Calculating the average number of zero-, two- and fourfold degenerate sites for each
* Counting and classifying mutations for each ancestor/descendant pair.
* Calculating KA and Ks.
An EPS is defined as three codons [x,y,z] (x = left descendant, y = ancestor, z= right
descendant) that are permitted by Fitch's parsimony rules. Each codon is constructed by
combining "valid" nucleotides from each of the three parsimonious nucleotide trees (Figure 2-3).
The validity of the presence of a nucleotide in an EPS is determined by Fitch's rules for valid
linkages. In this example, all linkages are permitted between all of the nucleotides at a2 and al
in each of the nucleotide trees, so all possible codons that can be built are permitted. Ancestral
node al has only one EPS, [TCA, TCA, TCG]. Ancestral node a2 has four EPSs, [TCA, TCA,
CAA], [TCA, TAA, CAA], [TCA, CCA, CAA] and [TCA, CAA, CAA]. The determination of
EPSs is required because the identity of a nucleotide at the ancestral node dictates what
nucleotides are allowed at its descendant nodes.
Next, the probabilities for each EPS are determined. Since al has only one EPS, this EPS
is given a probability of 1. The probability of each of the four EPSs at a2 is determined by the
following method (Figure 2-4). Each EPS has two paths, the path between the ancestor and its
left descendant and the path between the ancestor and its right descendant. Each of these paths
has a probability associated with it, PL and PR, respectively. The probability of a path is
determined by calculating the probability of the amino acid change that occurs along that path.
PL for EPSI's left path (TCA->TCA) is 2.235 because there is no amino acid change
(synonymy). PL for EPS3's left path (TCA->CCA) is 0.767 because this amino acid change
(serine->proline) is conservative. For paths involving two or three mutations, a variation of the
paths method depicted in Figure 2-1 is used. For example, the right path of EPS has two
mutations. The absolute probability of Path 1 (p11 x p12) = 0.588, and since Path 2 contains a stop
codon, its absolute probability is zero. We then compute the average probability by summing the
absolute probabilities of all paths and dividing by the number of paths. In this case there is only
one path, so its absolute probability is the average probability: PL = 0.588.
EPS2 has a probability of zero because it contains a stop codon. EPS3 and EPS4
probabilities are calculated as shown in Figure 2-4.
Once the average probabilities of each path in the EPS are calculated, we calculate the
product of the probabilities of all nine nucleotides in the EPS (the nucleotide probability product,
PN) and then compute the absolute probability of the EPS by taking the product of the two
average probabilities and the nucleotide probability product: PLxPRxPN. We can then determine
the EPS's relative probability by dividing its absolute probability by the sum of all EPSs' absolute
probabilities (Figure 2-4). We include PN in the calculation because it is a measure of the
probability of the codons in the EPS. Without PN, an EPS's probability would only be a measure
of the probability of its amino acid changes, so if a particular EPS contained conservative
mutations and/or synonymy, for example, it would be given a high relative probability regardless
of the probability of its individual codons existing.
We then use the EPSs and their probabilities to calculate the average number of zero-, two-
and fourfold degenerate sites between each sequence pair. As an example we will consider the
al/a2 pair (the left path of the EPS set for node a2). For EPS1, we compare TCA with TCA.
Codon TCA has two zerofold sites, zero twofold sites and one fourfold site ([2,0,1]). The
average sites are then (([2,0,1] + [2,0,1])/2) PEPS1 = [0.818, 0, 0.409]. For EPS2 we have [0,0,0]
because of the stop codon. Sites for EPS3 and EPS4 are similarly computed. The sum of sites for
all EPSs is [2, 0.204, 0.796] (Figure 2-7). Counts and classifications of mutations for each
sequence pair are then determined. We will continue using the al/a2 pair as an example. For
EPS1 we tally [[0,0,0],[0,0,0]] for TCA vs. TCA ([[a,b,c],[x,y,z]]: a = zerofold transitions, b =
twofold transitions, c = fourfold transitions, x = zerofold transversions, y = twofold
transversions, z = fourfold transversions). For EPS2 we do not count any mutations because of
the stop codon. For EPS3 we count [[1,0,0],[0,0,0]] PEPS3 = [[0.183,0,0],[0,0,0]] and for EPS4
we count [[1,0,0],[1,0,0]] PEPS4 = [[0.409,0,0],[0.409,0,0]].
These procedures are repeated for each codon in each pair of sequences at each node in the
tree (Figure 2-7). Finally, KA and Ks. are computed.
We wrote a computer program that implements this method. Over a course of years the
program was written and rewritten in a number of different programming languages to suit
whatever task was at hand. In Appendix B we present a version written in Java 1.1, used in the
context of the MasterCatalog-an interactive database containing the contents of GenBank
reorganized as a collection of clades, or families, consisting of independently evolving protein
fragments, or modules [BenOO]. The extended PBL method described herein was engineered to
measure KA and Ks values for pairs of modules within each family of modules, including
reconstructed ancestral modules at nodes within the phylogenetic tree representing the natural
history of the module family.
Prior to that project we coded this method and applied it to all of GenBank. This work is
described in Chapter 3.
Path 1 TCA (S) Path 2
O-fold 2-fold 4-fold
0-fold 2-Fold 4-fold
1 1 0 0 1 0 0
2 0 0 0 0 0 0
total 1 0 0
Pi(relative) = (p, X Pi2J/([p, X P12] + [P21 X P22]) = (.767x.767)/(.588) = 1
P2(relative) =P2 X p/22 Pi, X P12 + [P2, X P2j]) 0/(.588) = 0
Figure 2-1. Two possible ways to get from TCA to CAA. Formulae for path relative
probabilities, and numbers of transitions and transversions adjusted by multiplication
by path probabilities.
TCA TCG CAA
TCA a2 CCA
Figure 2-2. A hypothetical phylogeny for sequences tl, t2 and t3. Ancestral sequences al and a2
inferred by parsimony.
0.5 0.5 1.0
Figure 2-3. Parsimony for each of the nucleotides in sequences tl, t2 and t3. The numbers below
each nucleotide represent the probability of that nucleotide being at its node per
EPS 1: [TCA, TCA, CAA]
1-0 1_0 1-0 1_0 1.0 1.0
0.5 0.5 1.0
P, = P(TCA->TCA) = 2.235
P, = P(TCA->CAA) = .588
P. = t 7 n, = .25
PEPSI = PtxRxP = .409
P pEP absolutec)
EPS 3: [TCA, CCA, CAA]
1.0 1.0 1.0 1.0 1.0 1.0
EPS 2: [TCA, TAA, CAA]
P, = T T n, = .25
PEPS2 = PxP.xP,, = 0
EPS 4: [TCA, CAA, CAA]
1.0 1.0 1.0 1.0 1.0 1.0
0.5 05 1.0
P, = P(TCA->CCA) = .767
PR = P(CCA->CAA) = .767
PN = T 7 n, = .25
i= l j=l
PEPS4, PLXPxP. = .182
PL = P(TCA->CAA) = .588
P, = P(CAA->CAA) = 2.235
PN= TC n,= .25
PEPS4 = PLXPxP, = .409
Figure 2-4. EPSs for ancestral node a2. Each EPS has a probability for its left and right path, PL
and PR. PN is the product of the probabilities of each nucleotide in the EPS (i = 1, 2 or
3 where 1 refers to the left descendant, 2 refers to the ancestor and 3 refers to the right
descendant. j = nucleotide position). The relative probability of a path is equal to the
quotient of its absolute probability and the sum of absolute probabilities for all EPSs
(x = number of EPSs).
H1 TCA 2 0
t2 TCG 2 0
t3 CAA 2 1
Figure 2-5. Sequences tl, t2 and t3,
each pair of sequences.
racy Sequence Pair Averge Degeneracy
4-fowld -fold 2-1old 4-fald
1 tlt2 2 0 1
1 tl/t3 2 0.5 0.5
0 t2/t3 2 0.5 0.5
numbers of degenerate sites, and average degeneracies of
Figure 2-6. Numbers of transitions and transversions for each pair of sequences.
0-fold 2-fold 4-fold
1 0.33 0.67
Ancestral Sequence Pair Average Degeneracy Transitions Transversions
0-fold 24old A-fold 0-fold 2-fod 4-Fold O-fold 2-fold A-fold
2 0 1
2 .204 .796
2 .704 .296
0 0 0 0 0
0 0 1 0 0
0 0 .409 0
0 0 .592 0
Figure 2-7. Tallies of average degeneracies and mutations for each pair of sequences. Fractional
degeneracies and mutations are results of multiplication of observed numbers with
computed probabilities for EPSs.
THE ADAPTIVE EVOLUTION DATABASE (TAED): APPLICATION OF THE EXTENDED
PBL METHOD TO A LARGE DATASET
The growth of gene and genomic databases provides motivation for developing tools to
extract information about the function of a protein from sequence data, with the ultimate goal of
understanding the collection of functions represented in an organism's genome [Lib01l]. Work on
molecular evolution over 30 years has shown that such questions must be phrased carefully, and
always with cognizance of the Darwinian paradigm that insists that the only way of obtaining
functional behavior in living systems is through natural selection superimposed on random
variation in structure [Ben88b]. A behavior is functional if the organism would be less able to
survive and reproduce if that behavior were different. An amino acid residue is functional if,
upon mutation, the organism is less able to survive and reproduce.
A long literature has sought to interpret the evolutionary behavior of protein sequences, in
the hope of drawing inferences about the relationship between fitness and sequence [Kim82].
What has emerged is the recognition that a family of orthologous proteins displays a diversity of
structure and a corresponding diversity in behavior, where some of the behavioral differences
have a strong impact on fitness (are functional), and others are neutral (or nearly so). Without
resolving, in a general way, questions regarding the relationship (neutrality versus selection)
between fitness and protein sequence, we can build interpretive tools that capture information
from patterns of evolution of genomic sequences that is informative about function-in
particular, events that are characterized by the biological scientist as a change in function.
For a protein to change its function it must change its behavior; this in turn requires that it
change its amino acid sequence. A protein being recruited for a different function over a very
short time (geologically speaking) frequently experiences an episode of rapid sequence
evolution, an episode where the number of amino acid substitutions per unit time is large.
Therefore, molecular evolutionists have long been interested in the rates at which substitution
accumulate in protein sequences. These rates are known to vary widely in different protein
Calculating rates in the units of substitutions/time requires knowledge of the geological
dates of divergence of protein sequences. Because geological times are frequently not known
(and almost never known precisely), alternative approaches for identifying episodes of rapid
sequence evolution have been sought. One of these examines nucleotide substitutions. It divides
the number of nucleotide substitutions that change the sequence of the encoded protein
(nonsynonymous substitution) by the number of nucleotide substitutions that do not change the
sequence of the encoded protein (nonsynonymous substitution), and then normalizes these for
the number of nonsynonymous and synonymous sites. This is the KA/Ks ratio [Li85, Pam93,
Li93]. High KA/Ks ratios for reconstructed ancestral episodes of sequence evolution are
hypothesized to be signatures of positive adaptation, and have been associated with significant
changes in function [Tra96, Mes97].
In general, KA/Ks values are low. For example, the average KA/Ks value in proteins
between rodents and primates is 0.2 [Mak98]. This is taken to indicate that most of these
proteins, selected over millions of years, attained an optimum function prior to the divergence of
rodents and primates. This implies that subsequent evolution was conservative; most
nonsynonymous mutations were detrimental to the fitness of the organism.
Functional change can be defined as mutation that alters organismal fitness and is subject
to selective pressure. For an example of intraspecific variation, phosphoglucose isomerase in
montane beetles shows adaptation to local temperature variations [DahOO]. Orthologous proteins
also suffer positive selection. For example, the hemoglobin in the bar-headed goose has
undergone adaptive change relative to the hemoglobin from the closely related greylag goose in
response to a reduced partial pressure of oxygen at high altitudes [Zha96]. Adaptive evolution is
also believed to be displayed by paralogous mammalian MHC class I genes and relate to a birth-
and-death model of gene duplication [Nei97].
Traditionally, positive selection is defined by a KA/Ks rate ratio significantly greater than
unity. While 0.6 < KA/Ks < 1 can occur by relaxation of functional constraint, the theoretical cut-
off of one is well known to miss significant functional changes in proteins for several reasons
[Cra99]. Long branches can dilute an episode of positive adaptation (with KA/Ks >1) with
episodes of conservative evolution. KA/Ks values can miss positive selective pressures on
individual amino acids because they average events over the entire protein sequence. Behavior in
a protein can change significantly if only a few amino acids change while the remainder of the
sequence is conserved in order to retain core behaviors of the old and new functions (e.g., the
protein fold). These adaptive events will only be detected on sufficiently short branches which
pinpoint the adaptive change.
Alternative ways of identifying KA/Ks values below unity that are suggestive of adaptive
evolution involve comparison of these values for an individual branch of a tree with those values
for branches in the tree generally. If one branch has a KA/Ks value far outside of the norm for the
family (but still below one), we can guess that this branch represents an episode of positive
selection. This will work for gene families that generally display conservative evolution (such as
the SH2 (Src homology 2) domains) [Wig98], but not for others. For example, many immune-
system genes show a much more continuous distribution of values, which may indicate that they
are perpetually under different amounts of positive selective pressure [Nei97]. In this case, the
designation of a cutoff value of KA/Ks, below which two homologous genes have the same
function, and above which they have different functions, is arbitrary. Ultimately, this level
should be determined by benchmarking adaptivity with specific functions and specific protein
KA/Ks rate ratios are well known to be useful starting points for generating stories about
the interaction between protein sequences and the Darwinian processes that shape these
sequences. These stories help us understand how these sequences contribute to the fitness of the
host. This means that biologists would find useful a comprehensive database of examples where
KA/Ks values are high. Most useful would be a database that presents families where KA/Ks is
greater than one, and a separate family where KA/Ks is greater than some arbitrary cut-off less
than one, but still relatively high compared to the average value in the average protein.
We report here such a database, The Adaptive Evolution Database (TAED). TAED is
designed to provide, in raw form, evolutionary episodes in specific chordate and embryophyte
(flowering plants, conifers, ferns, mosses and liverworts) protein families that might be
candidates for adaptive evolution. TAED contains a collection of protein families where at least
one branch in the reconstructed molecular record has a KA/Ks value greater than unity, or greater
than 0.6. The second cut-off is arbitrary, chosen to be high relative to the average KA/Ks value
for the average episode of evolution in a protein family. Empirically, the lower cut-off seems to
admit additional examples of gene families that might have undergone adaptive evolution.
TAED should be used as a raw list of potentially adaptively evolving genes for
experimentalists seeking gene families to study in further detail, and for bioinformaticists
interested in studying large datasets of examples of genes with high KA/Ks ratios.
The MasterCatalog [BenOO] is a database of 26,843 families of protein modules generated
from an all-against-all search of GenBank release 113. A protein is broken into independently
evolving modules on the criterion of the presence of a subsection of a gene as a complete open
reading frame (ORF) in another species. Pairs that were within 180 PAM (point accepted
mutation) units with a minimum length requirement were grouped into the same family. Each
family contains an evolutionary tree and a multiple sequence alignment. This database was the
starting point for the exhaustive calculation of KA/Ks ratios.
The MasterCatalog (MC) is different, both in concept and execution, from other resources
(e.g., Hovergen [Dur94], Pfam [BatOO] and COGs) that offer databases of protein families. The
MC incorporates reconstructed ancestral states within its data structure, in addition to multiple
sequence alignments (MSAs) and evolutionary trees. Having these reconstructed ancestral states
provides a value to the database, especially for functional interpretation, that is not offered by
databases that contain only trees, or only MSAs, or only trees and MSAs. Further, because the
MasterCatalog is explicitly developed as a tool for doing functional genomics relying on
reconstructed intermediates, and as the information about function is extracted from analysis of
patterns of variation and conservation in genes and proteins within a family, it emphasizes the
generation of high-quality trees, MSAs and reconstructed ancestral states. For this reason, the
MC does not attempt to build superfamilies (like Pfam does). Instead, it constructs families,
where the trees, MSAs and ancestral states are not compromised by poor gap placement-a
common problem in Clustal-based MSAs of sets of highly divergent protein sequences.
Alternative methods were considered for reconstructing ancestral sequences. Whereas
maximum likelihood methodologies perform better in some situations, they are too
computationally intensive to apply exhaustively. Further, they are based upon an explicit model
of evolution that may not be appropriate along all branches analyzed, a situation where
maximum parsimony may outperform maximum likelihood on some branches [Pag98].
Therefore, to generate the initial version of this database, more computationally simple methods
were used. As improved methodologies are developed, these will undoubtedly be applied to
recalculate this database.
Two issues concerned the scope of the KA/Ks analysis. First, we were concerned that silent
positions would be 'saturated' with substitutions, rendering the Ks measurements meaningless.
Whereas reconstruction back to the last common ancestor of chordates or embryophytes with no
intermediates frequently bears the signature of synonymous position equilibration, synonymous
position saturation can be avoided if individual branches are shorter than the period required for
saturation to occur (tl. to saturation of approximately 120 million years). Saturation was
measured through the examination of the extent to which two-fold redundant codon systems had
reached equilibration [PelOO]. Branches that showed equilibration greater than five half-lives
towards saturation were excluded from TAED on the basis of differences between reconstructed
ancestral sequences at the beginning of branches and sequences at the end.
A second significant problem is that of short branches bearing fractional mutations. These
are known to generate KA/Ks values with large errors. To prevent these errors from biasing the
database, a new simple robustness test was implemented to ensure that an "interesting" KA/Ks
value (one above the cut-off) was not recorded in the database if it became "uninteresting"
(below the cut-off) through the shift of a single mutation reconstructed in the branch. The test
modified the KA/Ks calculation in a simple way, as described below:
Modified KA/Ks = KAmod/Ksmod, where
KAmod = (number of nonsynonymous 1)/total nonsynonymous sites
Ksmod = (number of synonymous + 1)/total synonymous sites
In general, the smaller the difference between KA/Ks and KAmod/Ksmod, the more
significant or robust the branch.
To exclude short branches with fractional mutations (arising through ambiguous ancestral
sequence reconstruction) without excluding other short branches, branches with KAmod/Ksmod
values below 0.5 were excluded from the database.
Of 5,305 families of modules containing chordate proteins, 280 contained at least one
branch with a KA/Ks value greater than one, representing 643 branches emanating from 63
different nodes of the tree of life. Some 778 families had at least one branch with a KA/Ks value
greater than 0.6, totaling 2,232 branches emanating from 92 nodes of the tree of life. Thus 15%
of all families of chordate modules are likely to have modified their function at least once during
the course of evolution.
Of 3,385 families of modules representing embryophyte proteins, 123 have at least one
branch with a KA/Ks value greater than one, representing 227 families emanating from 25 nodes.
Some 407 families had at least one branch with a KA/Ks value greater than 0.6, totaling 1,105
branches from 43 nodes. Here, perhaps 12% of all embryophyte families have modified their
function along at least one branch.
This result based on ancestral sequence reconstruction contrasts greatly with the result of
Endo, Ikeo and Gojobori, where the search for gene families undergoing adaptive evolution
yielded only two families [End96]. They compared extant sequences rather than reconstructed
evolutionary intermediates, counted families only where a majority of the pairs were at high
KA/Ks values, and used a smaller database.
A list of candidate protein module families that have undergone modification of function is
available at [Lib03]. The version described here is designated TAED 2.1 and will remain
available at this site. As more sophisticated methods are developed and applied, as correlations
with functional and structural databases are pursued, and as data from other types of evolution
beyond coding sequence evolution is added, links to these datasets will be provided. TAED 2.1
contains two image-mapped trees (for chordates and embryophytes); where the node that an
adaptive branch emanates from can be clicked on to obtain a list and MasterCatalog reference
number. Multiple sequence alignments and phylogenetic trees corresponding to these entries can
be obtained from EraGen Biosciences WI.
This study represents the first comprehensive analysis of KA/Ks rate ratios throughout the
Chordata and the Embryophyta. Although the methods utilized were rough and designed to give
a quick snapshot into a global picture of evolution, the TAED, as a raw resource, should be
valuable in the analysis of much of chordate evolution. Functional genomics analyses of many of
the protein families that have undergone recruitment and functional change within the past 500
million years will soon emerge. Many of the episodes of functional change recorded in TAED
can be correlated with events in the geological or paleontological record, in response to changing
environments, evolving paleoecology or the development of new physiology.
Gene families may display evolutionary episodes with high KA/Ks values, and therefore
appear within TAED, for several possible reasons. For example, branches resulting from gene
duplication events that give rise to paralogs with very different behaviors will presumably have
high KA/Ks values, as will orthologous pairs from species that place very different demands on
their function. This search was done without distinguishing paralogs from orthologs, and the user
of TAED should be careful in the analysis of specific families in recognition of this fact.
Because there is no reliable true set of protein families known to have undergone
functional adaptation, it is not possible to score the results of this tool. It is important to
remember that a Darwinian definition of function differs from the functional annotation of
genomes, and it is possible for a protein to alter or change its function while retaining the same
annotation. To examine this dataset, specific proteins must be examined individually.
Individual examination is likely to be productive, however. Many protein families already
believed to be candidates for functional recruitment appear on the list. These include
plasminogen activator in vampire bats which is expressed in saliva and involved in blood clotting
[Bod97], phospholipase A2 in snakes which is expressed in venom and involved in tissue damage
[Nak95] and MHC proteins in mammals, which are involved in the immune system as part of the
host-parasite arms race [Hug88], all having obvious explanations of why they may have
undergone functional change. Several families are newly identified as being candidates for
functional change, such as the previously proposed obesity protein leptin in primates.
A third category of discovery in TAED is in the detection of episodes of adaptive change at
new points in the divergent evolution of proteins, for example myostatin in the Bovidae [Lee99].
Table 3-1 is a sample table from TAED representing bovids. These are the candidate genes that
were identified as showing rapid sequence evolution emanating from this node in the tree of life.
They potentially include orthologs between two species of bovids, paralogs, alternatively spliced,
transcripts and intraspecific evolution. The genes on the list have roles in the immune system,
body musculature and reproduction, traits frequently under selective pressure. These examples
and many others are candidates for further experimental study through cloning from additional
species and through functional study in laboratories expert in the particular protein.
From a phylogenetic perspective, the knowledge of candidate genes evolving at the same
time in the same organism can allow one to begin to ask if entire pathways or phenotypic
functions are under selective pressure at particular points in evolutionary history. Where tertiary
structures for the proteins exist, mutations along branches can be mapped onto three-dimensional
structures first to evaluate the validity of specific examples, and second, to understand the nature
of adaptive evolution at a structural level.
One analysis of TAED indicates that among branches with KA/Ks rate ratios > 1, only 3%
of synonymous sites had mutated compared with 10% on the average branch in the database.
This is consistent with the notion that episodes of adaptive evolution can be lost in long
branches, as these are combined with prior and/or subsequent episodes characterized by lower
KA/Ks rate ratios characteristic of functional constancy. As more genes are sequenced from more
species, the greater articulation of trees will not only increase the accuracy of sequence
reconstructions, but will also allow us to detect new examples of functional change that are
buried in long branches.
At a biological level, the database created here can be mined to provide global pictures of
how evolution has occurred. Correlation of data in this database with that in other functional
databases will enable a leap from genotype to organismal phenotype. Further, the dataset
provides a resource for experimentalists interested in specific genes. The high KA/Ks rate ratio in
leptin in a branch connecting primates with rodents may have been a useful predictor of change
of function for pharmaceutical companies interested in the mouse model of leptin for human
obesity. For the experimentalist, mutations occurring along putatively adaptive branches can be
assayed for functional importance in systems of interest.
Finally, this database represents a growing framework for the study of adaptive evolution.
As datasets become available, changes in gene expression, alternative splicing patterns,
imprinting patterns, recombination events and other molecular mechanisms of adaptation will be
added to this database in a phylogenetic perspective. The ultimate goal is a dynamic resource
depicting candidate molecular events that are responsible for phenotypic differences between
closely related species.
Materials and Methods
Starting with the MasterCatalog [BenOO] (version 1.1 derived from GenBank release 113)
KA/Ks rate ratios were reconstructed database-wide for each ancestral branch in every
evolutionary tree containing genes from the Chordata and the Embryophyta. This analysis was
restricted to these organisms because there is less evidence for codon and GC-content biases
which complicate the accurate calculation of Ks. The MC uses multiple sequence alignments
generated from Clustal-W and neighbor-joining trees, both derived from protein sequences.
Because the MC is based on an analysis of nuclear families, rather than extended families, these
inexpensive tools generate acceptable multiple sequence alignments.
KA/Ks values were calculated for branches on an evolutionary tree between nodes using
the method of Li and Pamilo and Bianchi [Li85, Pam93, Li93] modified to allow full treatment
of probabilistic ancestral sequences [Ben98]. Reconstruction of ancestral sequences was done
using the Fitch maximum parsimony methodology [Fit71]. While reconstructed ancestral
sequences contain ambiguities, using probabilistic ancestral sequences takes this into account (by
weighting ambiguous positions according to their probabilities) and allows us to construct a
model of evolutionary history that is robust. Two cut-offs were used to identify interesting values
for the KA/Ks rate ratio, one and 0.6. Separate databases were constructed for each cut-off. The
resulting dataset is freely available for further analysis [Lib03].
Table 3-1. A sample listing from TAED containing candidate adaptively evolving genes
detected that emanated from the Bovidae node.
Genes with KA/Ks > 1.0
1. T-cell receptor CD3 epsilon chain
from MasterCatalog family 9668
2. AF092740 cytotoxic T-lymphocyte-
associated protein 4 precursor from
MasterCatalog family 9698
3. CD5 from MasterCatalog family 9700
4. AF 110984 intercellular adhesion
molecule-1 precursor from
MasterCatalog family 9802
5. Interferon alpha/beta receptor-2 from
MasterCatalog family 9817
6. AF020508 pregnancy-associated
glycoprotein 6 from MasterCatalog
7. MCH OVAR-DQ-ALPHA1 from
MasterCatalog family 15669
8. Major histocompatibility complex
class II from MasterCatalog family
9. T-cell receptor gamma from
MasterCatalog family 21940
Additional Genes with KJA/K> 0.6
10. Interleukin 2 receptor from
MasterCatalog family 974
11. Interleukin-3 from MasterCatalog
12. AF019622 myostatin;
growth/differentiation factor-8; GDF-8
from MasterCatalog family 20325
13. Fas gene product from MasterCatalog
14. Calpastatin from MasterCatalog family
15. Prolactin receptor from MasterCatalog
16. Pre-pro serum albumin from
MasterCatalog family 21864
17. Immunoglobulin gamma-1 chain from
MasterCatalog family 21881
18. AF 110984 intercellular adhesion
molecule-1 precursor from
MasterCatalog family 21997
These examples potentially include orthologs between different species of Bovidae, paralogs,
alternatively spliced cDNAs with potentially different functional effects and intraspecific
DETECTING COMPENSATORY COVARIATION SIGNALS IN PROTEIN EVOLUTION
USING RECONSTRUCTED ANCESTRAL SEQUENCES
The evolution of protein sequences is nearly always described using one of several
stochastic models for the accumulation of amino acid replacements [Ben97, Fuk02]. These are
captured in algorithms known by widely recognized names (e.g., the Needleman-Wunsch
[Nee70], Smith-Waterman [Smi81], and Felsenstein maximum likelihood [Tho92] tools). These
tools have become more sophisticated in recent years as mathematicians have "inched towards
reality" [Tho92] in their mathematical modeling well supported by a rich background in
statistics, theorems, and proofs.
Nevertheless, patterns of replacement predicted by these mathematical models remain
quite different from the patterns that are actually observed in proteins diverging under functional
constraints [Ben97, Gau 01]. The reason for these differences is well understood. Briefly, simple
stochastic models treat proteins as if they were linear strings of letters. In reality, proteins have
three dimensional structures that support behaviors that are important for them to contribute to
the fitness of the host ("function"). These behaviors are not a linear sum of the behaviors of their
parts. Amino acid replacement is therefore constrained in a way unanticipated for a linear string
The differences between mathematically convenient models and the reality of organic
chemistry need not be paralyzing, however. First-order stochastic treatments of protein
sequences can provide "null" hypotheses, statements about how proteins would behave if they
were formless, functionless strings of letters. The difference between how proteins actually
divergently evolve, and how first order treatments model their divergent evolution, therefore
contains a signal about fold and function [Ben97].
Analyses of this signal have been remarkably productive. They provide practical tools for
predicting the folded conformation of proteins from sequences [Ben97, Ros94], and many useful
approaches for extracting functional information from genomic sequence data [Ben98, BenOO,
Lib01]. Accordingly, one goal of contemporary computational biology is to extend the concepts
and tools needed to extract signals concerning structure and function from features of divergent
evolution that do not meet the expectations of simple stochastic models.
Perhaps the most serious approximation made by first order stochastic models is their
treatment of individual positions in a protein sequence as independently evolving entities.
Virtually all of these tools analyze sequence divergence one position at a time under a model
where position i in a sequence suffers replacement independently of position. Even models that
recognize that different sites may have different mutability (gamma models) treat sites as being
independently evolving [Miy95]. This is, of course, extremely convenient for any statistical
model for protein sequence divergence as it enables the probability of a sequence alignment
overall to be calculated as the product of probabilities calculated for individual positions in the
Even casual inspection of a multiple sequence alignment, however, shows that positions in
a protein sequence do not suffer replacement independently. Replacements in positions adjacent
in a sequence are strongly correlated [Coh94]. The correlation almost certainly reflects
functional constraints on replacement set within the context of a folded structure. Therefore, it
has proven to be useful, in particular, for predicting the three dimensional structure of protein
Position pairs distant in the sequence but near in space in the three-dimensional fold might
also be expected to suffer replacement in a correlated fashion [Alt87, Alt88]. Many classes of
these can be envisioned. For example, a "big-for-small" replacement at position i might be
compensated by a coincident "small-for-big" replacement at position j, to conserve overall size in
the packed core of the fold, and therefore conserve a functional behavior related to packing (the
stability of the fold). Alternatively, a "positive-for-negative" charge replacement at position i
might be compensated by a "negative-for-positive" charge replacement at position to conserve
overall charge, and therefore conserve a functional behavior related to net charge.
Behind this expectation stands a model based on the neutral theory of evolution [Kim83].
Under this model, amino acid replacements that dramatically alter the physical property of the
side chain (e.g., its size or charge) will disrupt the performance of a protein already optimized to
contribute to the fitness of the host organism. The fitness value of the protein can be restored
only by a second change that compensates for the first alteration in physical properties. In the
language of neutral theory, we would say that the first replacement was selectively
disadvantageous, the second was positively selected (in the context of the first), and both
together lead to a result that is neutral.
Analyses of compensatory replacement have found practical application. For example, a
pair of compensatory changes in the protein kinase family underlay the successful prediction of
the antiparallel sheet in the first kinase domain [Ben91], contradicting a prediction of a parallel
sheet based on motif analysis [Ste84]. This example represents the first example where
compensatory covariation analysis was a key to a bonafide prediction of protein structure. In
another case, Cohen and his coworkers presented an elegant example of compensatory
replacement in phosphoglycerate kinase [GohOO], supporting the suggestion that correlated
change in the evolutionary history of two protein families might be taken as evidence that the
two proteins operate together. More generally, several laboratories have suggested that
compensatory covariation might be used to detect incorrect folds in a structure prediction
environment [Olm99, Che97, Che98].
Unfortunately, comparison of any two sequences diverging at n sites generates n(n-1)/2
pairs of candidate sites holding coincident replacements. These might be compensatory; they
might be coincidental. Only a fraction of these will be truly compensatory, meaning that either
replacement alone would have been rejected by natural selection without the other. Most
replacements are presumably not compensated, either because they are neutral (implying that no
other changes are needed to prevent their having a negative impact on functional behavior), or
because they have a positive impact on fitness (implying that no other changes are desired to
neutralize their impact on functional behavior).
Thus, while such compensation is easily recognized when analyzed in the context of a
known crystal structure (the sites suffering compensatory replacements presumably are near in
space), it is difficult to identify the pairs of sites that might suffer compensatory changes without
a crystal structure. As a consequence, the compensatory covariation signal, represented by the
number of pairs of replacement that are causally interrelated (where either alone would be
rejected by natural selection) relative to the n(n-1)/2 pairs that arise whenever two protein
sequences differing at n sites are compared, is weak, at least as it has been generally calculated
[Shi94, Gob94, Neh94]. Some have suggested that the compensatory replacement signal may
never be broadly useful for this reason [Tay94]. Others have suggested that the signal might
have value, especially if it can be strengthened.
A variety of laboratories have explored approaches to strengthen the compensatory
replacement signal [Olm99, Che97]. For example, the signal for charge compensation appears to
be stronger than the signal for size compensation, suggesting that it might be useful to analyze
different types of coincident replacement separately. Further, compensatory covariation signals
appear to be strongly dependent on the evolutionary distance between the two sequences,
measured in PAM units (the number of point accepted mutations per 100 amino acids) [Day78].
For example, charge compensatory changes were found to be more prominent at PAM 25 than at
PAM 100 [Che97].
These observations are not surprising given the model. Charge reversal changes might be
expected to be the change most in need of compensation. Likewise, at longer PAM distances,
pairwise compensation might be obscured beneath compensation arising from replacements at
This work suggested a general strategy to strengthen the signal for compensatory
replacement. At the core of this strategy is the recognition that compensatory replacements can
be calculated in two ways. In the first, two extant protein sequences-sequences that are found in
organisms living today-are examined. As extant sequences are at the "leaves" of an
evolutionary tree, we call these "leaf-leaf" comparisons (Figure 4-1). In the past, virtually all
compensatory substitution has been sought via "leaf-leaf' comparisons.
An alternative approach is possible. Given a set of homologous protein sequences, a
multiple sequence alignment, and an evolutionary tree interrelating them, the sequences of
ancestral proteins represented by nodes in the tree can be approximated using well-known
heuristics. Given reconstructed sequences of ancestral proteins at nodes in an evolutionary tree,
compensatory covariation can be sought using "node-leaf" and "node-node" comparisons.
As compensatory signals are stronger when the sequences being compared are separated by
shorter distances [Che97], and as the distance between two nodes in a tree must be shorter than
the distance between the leaves on the tree (Figure 4-1), node-node and node-leaf compensatory
signals must be stronger than the leaf-leaf signal that contains them. Phrased differently, the
number of replacements between average nodes is smaller than the number between average
leaves. This means that the n(n-1)/2 total number of pairs is smaller, implying that the truly
compensatory pairs, those driven by a fitness constraint, will be less obscured by the background
of uncompensated events, when they are identified on individual branches of a tree.
Further, although the reconstructed ancestral sequences are probabilistic, and cannot be
proven to be correct, even crude heuristics for reconstructing ancestral sequences localize
specific changes to specific regions of a tree better than if no reconstructions are done at all.
Thus, even poor reconstruction heuristics permit us to focus on briefer episodes of time during
which two compensatory changes might have occurred than is possible by leaf-leaf comparisons.
Last, node-leaf and node-node comparisons model the actual evolutionary events that
might contain the compensatory signal better than leaf-leaf comparisons. The statement that
"position i and position should suffer replacement in a compensatory way" is equivalent to the
statement that if either position i or position suffer replacement individually, the host organism
is less "fit" (in a Darwinian sense). This, in turn, implies that if position i suffers a replacement,
then position is under "positive selective pressure" to suffer a replacement. Conversely, this
means that a replacement at position will be fixed in a population faster than expected for a
neutrally drifting position. This means that true compensatory changes will normally occur on
the same branch of the tree, even on a very short branch of the tree.
This study examined the feasibility of detecting node-node and node-leaf compensatory
covariation signals by examining reconstructed evolutionary events within families of proteins.
A total of 71 families of proteins were examined in this study. For each family, a multiple
sequence alignment and a phylogenetic tree were constructed using Clustal-W [Tho94].
Reconstructed ancestral sequences at nodes throughout the tree were generated using the Fitch
method as described in Methods [Fit71].
We then sought charge reversal replacements in these families. For each family, positions
were identified where an amino acid replacement caused a charge reversal between two ancestral
sequences (representing a replacement event that occurred on a branch between two nodes) or
between a node sequence and a leaf sequence. Fractional changes were included. Each was
associated with a particular branch of the tree.
From these, pairs of replacements displaying charge compensatory behavior were
collected, as described in Methods. A pair of replacements was defined as being charge
compensatory if they were coincident (both occurring on the same branch of the tree), if they
each individually would reverse a charge, and taken together, if no change in the overall charge
of the protein resulted from the two.
A three dimensional crystal structure for a member of the protein family was then extracted
from the PDB and an estimate was made for the strength of the signal. In making this estimate,
we assumed that only proximal charge compensatory changes, near enough in space in the folded
structure that the two side chains could interact coulombically, could be functionally correlated.
We therefore tabulated the distances between the sites in pairs that suffered charge compensatory
Histograms were then constructed to show the distance distribution of pairs suffering
compensatory replacements. As a standard, a distance distribution for all pairs of sites was
calculated for the proteins. If the number of charge compensatory pairs was located
disproportionately more in proximal positions than the average pair (where the square root of the
number of pairs was a crude measure for significance), a signal was considered significant.
As previous studies predicted, leaf-leaf comparisons found an only barely perceptible
charge compensatory signal (Figure 4-2). That is, pairs of positions suffering charge
compensatory replacement in two extant sequences were not much more likely to be near in
space than the average pair. The analogous analysis for each evolutionary branch in the tree,
however, identified a signal that was clearly perceptible (Figure 4-3), even by eye. This signal
was then analyzed.
Charge Compensation and the Surface of the Folded Protein
Surprisingly, the initial analysis (Figure 4-3a) showed that position pairs suffering node-
node or node-leaf compensatory charge replacement were more likely to lie both proximally and
distally in the fold, when compared with the average distance between all position pairs in the
proteins. That is, pairs of positions near in the fold displayed charge compensatory covariation
more frequently than the average pair, as expected should charge compensatory replacement be
functionally correlated. But pairs of positions distant in the fold also displayed charge
compensatory covariation more than the average pair.
Distal charge compensatory replacement suggested two explanations. First, overall net
charge might be a selected trait in a protein. It is conceivable, for example, that a constant
isoelectric point is desired by natural selection. If so, reversing a charge at one position would
require an adaptive replacement leading to the opposite reversal somewhere (anywhere), even at
a second position distant in the fold from the first.
Alternatively, charge changes are more likely to be tolerated by natural selection, and
therefore are more likely to be observed, if they occur on the surface of the protein. The mean
distance between a pair of surface residues is greater than the mean distance between the average
pair of residues. Therefore, charge changes are more likely by chance to occur in pairs more
distant than the average pair if they occur predominantly in positions on the surface, whether the
pair is under direct selection or not (i.e., if the replacement is neutral).
These considerations suggested that the distance between the average position pair might
not be the most informative reference distribution for these studies. Instead, an alternative
reference distribution was calculated (blue bars, Figure 4-3b) for the distance between pairs of
surface residues in the model proteins. This new distribution fit nicely the distal portion of the
distribution of distances between position pairs displaying charge compensatory replacement.
The result implies that the apparently high occurrence of charge compensatory replacement in
distant pairs of residues reflects simply the greater likelihood that charge reversal replacements
occur on the surface of the protein. This surface pair reference distribution does not, however, fit
the proximal portion of the distribution. The probability that a pair of positions displaying charge
compensatory replacement is near in the folded structure is considerably higher than expected
(Figure 4-3b). This represents the "signal" in the charge compensatory pattern of replacement.
When a side chain changes its charge, that change is more likely to be accompanied by a
compensatory change in the charge of another side chain near in space to the first.
Charge Compensation in Both Contiguous and Non-Contiguous Position Pairs
We then explored several features of this signal. First, we asked whether the signal arises
only in positions that were also nearby in the polypeptide sequence (contiguous pairs), or
whether it was also observed in residue pairs > 4 amino acids distant (those of i, i+q relationship,
where q > 4) in the sequence (non-contiguous pairs). A strong signal was also observed for non-
contiguous pairs (Figure 4-3c,d) as well as for contiguous pairs. This implies that a charge
compensatory replacement signal arises when two residues are near in space as a consequence of
the tertiary fold, as well as if they are near in space because they are near in the sequence.
Enhancing the Charge Compensation Signal
These results showed that node-node and node-leaf analyses generated a more perceptible
charge compensatory covariation signal than leaf-leaf analyses. Nevertheless, the signal
We considered several explanations for the small size of the signal. First, we considered a
case where four sites suffer charge reversal replacements in a single evolutionary episode. Site a
suffers a (+ to -) replacement compensated by a (- to +) replacement at site b. Site c suffers a (+
to -) replacement compensated by a (- to +) replacement at site d. Sites a and b are proximal;
sites c and d are proximal. Compensatory changes at sites b and d are required to maintain a
functional protein given changes at sites a and c, respectively. Two pairs that do not represent
adaptively significant compensation (a,d and b,c) arise in addition to the two pairs that do (a,b
and c,d). This is simply another way of saying that changes that need compensation are more
noticeable when the total number of changes is small.
Therefore, we asked whether the signal was stronger if the only branches examined were
those holding exactly one pair of charge compensatory replacements (3e,f for all pairs and non-
contiguous pairs, respectively). The likelihood that two positions undergoing charge
compensatory replacement are near in space was indeed more perceptible after we excluded all
of the events occurring on branches where more than two positions suffered charge reversal. The
number of cases was small, however (57 for all pairs, and 48 for non-contiguous pairs,
respectively), and the plots accordingly displayed substantial variances. As the number of
sequences in the database grows, trees will become more articulated, individual compensatory
events will be more likely to be isolated from others on a single branch of the tree, and the signal
Charge Compensation in Specific Secondary Structural Elements
We then examined more closely the contiguous position pairs (i, i+4 or nearer) displaying
charge compensatory covariation. The most striking feature of the charge compensatory signal
within contiguous pairs was its dependence on the nature of the secondary structural element that
held those positions (Figure 4-4). In 98% of the cases where the positions showing compensatory
replacement had an i, i+4 relationship, one or both of the residues was found in an alpha helix. In
only 1.6% of these cases was one of the residues found in a beta strand, and in none of the cases
were both found in a strand. This was significantly larger than the 47% of the position pairs with
an i,i+4 relationship having one or both residues in a helix found in the dataset as a whole (Table
4-1). Conversely, in 49% of the cases where the position pair showing compensatory substitution
had an i, i+2 relationship, one or both of the residues was found in a beta strand.
Two alternative explanations can be proposed to account for the secondary structure bias.
First, surface helices present residues to solvent (water) once every 3.6 turns. Surface residues
are expected to be less constrained by function from diverging (that is, single replacements are
more likely to be neutral), and are more likely than the average residue to suffer charge reversal
substitutions. Therefore, the abundance of compensatory changes occurring with i,i+3 or i,i+4
relationship in the sequence might simply reflect unconstrained (i.e., neutral) charge reversal at
Alternatively, loss of a coulombic interaction between residues i and i+3 or i+4 might lead
to a protein less able to contribute to the fitness of the host. This view implies that once a charge
reversal substitution occurs at position i, position i+4 is under sufficient positive selection to
acquire a compensating charge reversal substitution.
To explore these alternatives, we sought examples of anti-compensatory charge reversals,
where a charge reversal (+ to -, for example) was accompanied by another charge reversal
substitution in the same direction (+ to -) (Figure 4-5). The histogram showing the distance
distribution in pairs suffering compensatory covariation is shown in Figure 4-5. Here, the
distribution of distances between position pairs carrying charge anti-compensatory substitution
was not noticeably different from the distribution of distances between all surface position pairs
(Figure 4-3b). Notable is the absence of an increased probability of proximal anti-compensatory
pairs (compare with Figure 4-3).
As a predictive tool, this enhanced charge compensatory signal may prove to be most
valuable in secondary structure prediction. The accuracy of a prediction made on the relative
positions of a compensatory pair is extremely high. The "coverage" is low, however. Only a few
dozen examples were observed; only 6.8% of the helices contained within the 71 test families
have one or more charge compensatory position pair. This number will, of course, grow as the
size of the protein families grows with the increasing size of the genomic database (Table 4-2).
Charge Compensation in Buried Residues
The high dielectric constant of water is known to weaken coulombic interactions between
charged species. We therefore asked whether a stronger signal might be found in position pairs
where one or more of the side chains was buried. Figure 4-6a shows the distribution of surface
accessibility calculated for all charged amino acids (DEKR, or Asp, Glu, Lys, and Arg) (blue),
those that participate in a position pair suffering charge compensatory substitution (red), and
those that participate in a position pair suffering charge anti-compensatory substitution (green).
Not surprisingly, most buried charged residues do not suffer charge reversal within the test set.
Compensatory changes near in space (Figure 4-6b) were slightly more likely to be found in
positions that were more buried, a modest signal that suggests that compensation is more
necessary in partly buried sites shielded from the high dielectric constant presented by the
solvent water. This is the case for those residue pairs in protein kinase [Ben91] and
phosphoglycerate kinase [GohOO] where charge compensatory substitution has had predictive
Explicit reconstruction of ancestral proteins has been shown to provide insight into the
structure and function of protein families, both when done in silico [Fit71, Mes97, Tra96] and
when recombinant DNA technology is used to resurrect ancestral proteins from extinct
organisms so they can be studied in the laboratory [Ben88a, Mal90, Sta90, Jer95].
The work reported here applies ancestral reconstructions towards a new goal. We suggest
several conclusions. First, node-node and node-leaf comparisons between reconstructed ancestral
sequences provide a stronger compensatory covariation signal than the leaf-leaf comparisons that
have been used previously in the search for a compensatory covariation signal.
These results are consistent with the model outlined in the Introduction, which holds that
amino acid replacements that dramatically alter the physical property of the side chain will
disrupt the performance of a protein to an extent that requires a compensatory change elsewhere
in the protein structure a significant number of times. The fitness value of the protein can be
restored only by a second change that compensates for the first alteration in physical properties.
In the language of neutral theory, we would say that the first replacement was selectively
advantageous, the second was positively selected (in the context of the first), and both together
are neutral. Thus, while the two compensatory substitutions taken together might be regarded as
collectively neutral, the second mutation in a truly compensatory pair is viewed as positively
adaptive in this model.
These results also suggest that as the database grows, the charge compensatory signal will
become more perceptible as more sequences are added to each family. More sequences mean
more highly articulated evolutionary trees. This, in turn, means that compensatory events will
become better isolated on specific branches, preventing the "spurious" signals that arise when
more than one pair of compensatory events occurs along a specific branch on a tree.
The stronger signal will undoubtedly find use. Predicting secondary structure using
contiguous pairs of compensatory changes is one. It remains to be seen, however, how much data
in a family are required for the signal to be useful to support de novo assembly of a protein fold
in a prediction setting.
Accounting for the Stronger Signal From Node-Node Comparison
The tools that we have presented make the compensatory covariation signal more
perceptible. Isolation of truly compensatory pairs on shorter branches of a tree away from other
changes is, we believe, sufficient explanation for this effect. A tree with shorter branches implies
a smaller n, the total number of differences between the two protein sequences being compared,
diminishing the number of n(n-1)/2 pairs behind which the compensatory signal might be
While it is possible in principle to construct a statistical model that permits double
replacements to be evaluated, this requires a high degree of empirical parameterization. For
example, nearly a decade ago, investigators collected the parameters needed to build a statistical
model that concerned double replacements at adjacent sites [Coh94]. Some 220 parameters are
formally required in this exercise, a large number by most measures.
For the purpose of this work, a signal is considered to be significant if it lies two standard
deviations outside of the fluctuation expected for n sites at a specified distance. This can be
estimated by the square root of the number of sites. By this measure, all of the results reported
here are strongly significant, with the exception of those reported in Figures 4-3e,f and Figure 4-
A Model-Independent Method to Evaluate an Evolutionary Tree
The strength of the compensatory covariation signal undoubtedly depends on the degree to
which the trees and the reconstructed ancestral sequences accurately reflect the history of the
family. If the branching of the tree or the reconstructed sequences themselves are not correct, a
pair of charge compensatory replacements that are coincident, in fact, may not be assigned to the
same branch of a tree. In this case, the signal from this pair will be lost.
Getting the branching correct in an evolutionary tree is a difficult problem. Part of the
difficulty arises because of the trade-off between the accuracy of the tree and the cost of
generating it. For example, the Clustal-W [Tho94] and Fitch parsimony tools used here are
relatively inexpensive methods for reconstructing trees and ancestral sequences. Clustal-W uses
a neighbor joining tool [Sai87] based on estimates of the distances between sequence pairs
derived from the Kimura empirical formula [Kim83]. Ancestral sequences reconstructed by
parsimony are well known to be sensitive to incorrect branching topology. This is the principal
error associated with the choice of this inexpensive reconstruction tool.
More sophisticated methods, including maximum likelihood methods, are expected to
provide better trees, at least given the first order stochastic models. These are expected to
generate ancestral reconstructions that are more robust to errors in tree topology. They are,
however, more expensive.
Even the more expensive tools do not guarantee a correct tree, of course. In practice, the
approximations made in the model (see Introduction) may create systematic error larger than
fluctuation error. To date, the only way to benchmark a tree requires knowledge of the
evolutionary history of the sequences in question [Hil94], or a reconstruction of a simulated
evolutionary process [TakOO]. The first is difficult to get for sequences emerging from natural
history. The second requires a mathematical model for evolution, which is often the same one
that is used to construct the tree in the first place.
Here, the compensatory covariation signal, extracted from reconstructed ancestral
sequences, may provide a metric for the quality of a tree based on organic chemistry,
independent of any mathematical model for evolution. Hypothetically, the best tree should be the
tree that places compensatory replacements truly driven by natural selection on the same branch.
This requires the construction of a tree that reflects the actual evolutionary history. This, in turn,
implies that the tree has the most compensatory covariation is the tree that is most likely to
reflect the actual history.
To illustrate this application, consider four hypothetical proteins, just four amino acids in
length, having the sequences ALKD, MVKD, ALER, and MVER. Exactly two topologies exist
for unrooted trees that relate these four sequences (Figure 4-7). Both reconstructions have two
ambiguous sites in both ancestors. In Topology I, the first two positions are ambiguous; in
Topology II, the last two positions are ambiguous. Both trees require four "homoplastic" events
(independent mutations that cause sequence convergence). Both trees require exactly six
changes. Classical parsimony therefore ranks these two topologies as equally likely.
The two topologies are different, however, with respect to the extent to which charge
changes are compensated. In Topology I, a charge altering replacement is 100% likely to be
compensated. In Topology II, however, a charge altering replacement is only 50% likely to be
compensated. This is illustrated in Figure 4-7 by writing out four trees, each equally likely, that
carry reconstructions that the ambiguities require. If we postulate that compensatory covariation
is maximized, then Topology I is preferred over Topology II.
Conversely, an analogous logic can be used to assign preferred ancestral states involving
charged residues. For Topology I, the ancestral states involving charged residues are fixed. For
Topology II, the preferred ancestral sequences are in reconstructions IIa and IIb.
This metric can be applied even if no crystal structure is available for a protein family. If,
however, a crystal structure is available, then (as a practical matter) one would maximize the
number of proximal charge compensatory changes when identifying the preferred tree.
It will require much future work with many families to determine how useful the metric
will be. Worth noting at this point, however, is that this metric is rooted in principles of structural
biology (that is, organic chemistry), not in a mathematical formalism. Further, the proposed
metric values changes at position i in light of changes at position. Thus, this metric for
evaluating the quality of a tree is fundamentally different from any metric based on first order
stochastic analyses of protein sequences, which treat replacements at site i and site as
Darwinian Requirements for Compensatory Covariation
Even given these results, and the evidence that the charge compensatory substitution signal
can only become stronger as the database grows, it remains inescapable that the charge
compensatory signal is weak, perhaps even weaker "than expected." What might be the scientific
implications of this observation?
Charge compensatory covariation might be weak because the coulombic interactions being
sought may themselves be largely unimportant to the selective fitness of proteins. Gaining or
losing them, in this view, has insufficient impact on fitness to ensure that natural selection will
require compensation, and thereby prevent uncompensated charge reversals from entering the
global proteome. This implies a limit to the tool generally, one imposed by the physical organic
chemistry of folded protein sequences.
An alternative explanation should be considered, however. Observation of a compensatory
pair of substitutions implies, under the neutral theory, that natural selection preserved some
global feature of a protein during the episode represented by the branch between two nodes.
This, in turn, implies some degree of constancy in the behavior of the protein before and after the
episode where compensatory change has occurred.
In this view, compensatory replacement should be observed only in protein families whose
behavior must remain largely constant during this branch. This, in turn, implies that
compensatory covariation should be observed only during episodes where "function," defined as
the behavior that contributes to fitness, is largely conserved. In the language of the neutral
theory, the demand for compensation arises because the protein is optimized at the beginning of
the episode for fitness, the same behaviors are optimal at the end of the episode, and any
replacements occurring during the episode must have the net (and, if necessary, combined)
impact of being neutral with respect to their impact on selected behavior.
This implies, of course, that when functional behavior is changing, there may be no need to
compensate individual replacements in a sequence, even those that reverse charge. Indeed, an
uncompensated change may be more likely to generate a protein with different behaviors, whose
(now) different behaviors contribute most to the (now different) requirements for fitness. In this
view, compensatory covariation should not be observed, or should be observed less frequently,
whenever functional behavior is changing.
In this view, compensatory covariation is scarce because branches of an evolutionary tree
where functional behavior is rigorously conserved are scarce. This is, of course, a controversial
suggestion. Many computational biologists treat homologous proteins in distinct organisms as if
they were "the same protein," and neutral theory remains the majority view of protein sequence
evolution. In contrast, recent work in these laboratories and elsewhere has suggested that
functionally significant divergence in behavior is frequent, and may be the rule more than the
exception. For example, it is almost certainly observed in elongation factors, regarded as some of
the most functionally conserved proteins in the biosphere [Gau01 ].
Given this observation, compensatory replacements may become a powerful tool in
functional genomics to detect episodes where function is, and is not, conserved. A branch that
has more compensatory replacement is more likely to represent an episode where functional
behavior is constant than one with less compensatory replacement.
This is relevant to the issue of "annotation transfer" in comparative proteomics. Annotation
transfer assigns the function of a new protein by identifying in the database a homologous
protein for which the function is known, and transferring annotation describing that function to
the annotation for the new protein. Annotation transfer assumes that function does not change
within a set of homologous protein sequences as they diverge [Heg01, Fet98]. This assumption
has long been known to be poor in many proteins, including many characterized before the dawn
of the age of the genome [Ben88b].
To date, several tools have been suggested to detect functional change. One of these is to
measure KA/Ks values for branches of an evolutionary tree between reconstructed ancestral
sequences at the nodes of the tree [Mes97, Li85]. Here, compensatory changes would indicate
functional constancy, while uncompensated changes would indicate functional change. Because
compensatory analysis rests on protein sequences, while the KA/Ks value requires measurement
of silent substitution rates, and because silent substitution rates are frequently rather high, this
metric for functional recruitment may ultimately prove to be more valuable than KA/Ks ratios,
especially for deeply branching sequences.
The core of this study exploited the PDB "select 25" subset of proteins [Hob94]. Each
protein in this database was matched against the proteins contained in SWISS-PROT (version
33) [Bai91]. The older sequence dataset was chosen to avoid under-annotated sequences, in
particular, pseudogenes that might be divergently evolving without functional constraints.
Families that contained at least 12 members, where the maximum evolutionary distance between
any pair of sequences in the family was between 50 and 120 PAM units, and where the family
had at least two subfamilies defined at PAM 20 with four or more members, were retained.
These criteria, made to ensure a balanced tree, were satisfied by 71 families.
The sequences within each family were aligned using the MultAlign package [KorOO] with
the option PROB from Darwin system [Gon91, Ben93]. The gap shifting heuristic was applied
iteratively until the overall alignment score ceased to improve. Secondary structure assignments
were extracted from the crystallographic data using DSSP [Kab83].
An evolutionary distance matrix and a phylogenetic tree were computed for each family
using Clustal-W [Tho94], which employs a neighbor joining method using distances derived
from the Kimura empirical formula [Kim83]. Branches with negative lengths were ignored.
Probabilistic reconstructed ancestral sequences at nodes in the tree were then built using
the Fitch parsimony method [Fit71]. For those sites where parsimony did not generate a single
assigned residue in an ancestral sequence, fractional probabilities were assigned to each of the
contender amino acids were assigned by the statistical method described by Fitch [Fit71]. The
statistical method was used to assign probabilities to each possible path between the residue at a
site at an ancestral node's residue (either a single amino acid or a set of possible amino acids
each with an assigned probability) and its two descendant residues (either a residue in an extant
species' sequence (a node-leaf comparison) or a residue (or set of possible amino acids) in
another ancestral species (a node-node comparison)).
Using probabilistic reconstructed ancestral sequences computed in this way, charge
compensation was sought on individual branches of the evolutionary trees. For each branch in
each tree, the sequences at the flanking nodes were compared (residue by residue) to identify
single substitutions that reversed charge. A position was retained if and only if one of the
following transition probabilities (K->E, K->D, R->E, R->D, E->K, E->R, D->K, D->R) was >
0.2. If more than one probability was > 0.2, then both transition probabilities were retained for
Coordinates locating the position of C-beta were then extracted from the reference crystal
structure with the PDB. When the residue was Gly, the position of the beta carbon atom was
inferred from the position of the backbone atoms. A Perl script was written to permit calculation
of the distances between the beta carbon atoms for each pair of amino acids retained. This
generated the distances reported in the Figures. The accessible surface area (ASA) was computed
using the DSSP program [Kab83].
A list was then made of all pairs of positions having compensatory charge reversals for
each branch of the tree. These were defined to be "position pairs" undergoing charge
compensatory covariation. The distance between the beta carbon atoms of the position pairs was
inferred from the positions of those two carbon atoms in the reference crystal structure. The
pairwise distance was then calculated for all amino acids in the reference crystal structure, and
then for the distance between pairs of residues on the surface of the reference structure.
L1 L2 L3 L4L
5 L6 L7 L8
Sequences of extant
proteins at the
leaves of the tree
Figure 4-1. A leaf-leaf comparison (red) traverses more evolutionary distance than a node-node
M compensatory pairs
r[~) all pairs
M- compensatory pairs
r-I all pairs
.fl S C) .4 I~ fl S W .4 1- ,~ 0% fl ~4 C.-
.4 .4 C-. n fl IA ~D 0 1. 1 0
M compensatory pairs
I-1 surface pairs
M compensatory pairs
C- surface pairs
- ta distancllIIIII II II
Figure 4-2. Attempted detection of charge compensatory covariation signal using leaf-leaf
comparisons. (a) Distribution of distances in 71 protein families between position
pairs displaying charge compensatory substitution (a positive-for-negative
substitution at position i, and a negative-for-positive substitution at position j) (red
bars), using as a reference (green bars) the distribution of all pairwise distances in the
proteins. The x-axis is in angstroms; the y-axis is frequency, with each distribution
normalized to unity. Note the absence of considerably taller red bars at short
distances. Total observations in sample: 13,460. (b) As in (a), but using as a reference
curve the distances between all pairs of surface residues (blue bars, > 50% exposure,
calculated by normalizing its accessible surface area (ASA) by the standard ASA of
the residue. ASA was computed with the DSSP program. Total observations in
sample: 13,460. (c) As in (a), but considering only non-contiguous pairs, those where
i andj are separated by more than four positions, with reference to a distribution of
distances between all pairs of residues in the reference crystal structures (green bars).
Total observations in sample: 12,811 .(d) As in (c), but with reference to the
distribution of distance between surface pairs (blue bars). Total observations in
U~ N I 0 tO 1. 14 0 In .-
0 '0 CC 0 0 '0 CC 0 '0 fl -
.-1 .4 N t1 P~ W C- P.
0.10 i .oao ..w ory pai. 0+.1 i 1 0ri 0r*
oV o .1) ol
i 0.10 M 4 0.04
a3. 0. .02
........ .l .... .
- ~ p.fr.
1 V .1 p.0.
-, 0 0- 0- --
C-beta distance C-beta distance
Figure 4-3. Detecting charge compensatory covariation signal using explicitly reconstructed
ancestral sequences. (a) Distribution of distances in 71 protein families between
position pairs displaying charge compensatory substitution (a positive-for-negative
substitution at position i, and a negative-for-positive substitution at position j) (red
bars), using as a reference (green bars) the distribution of all pairwise distances in the
proteins. The x-axis is in angstroms; the y-axis is frequency, with each distribution
normalized to unity. Note the greater height of the red bars at both short distances and
at long distances. Total observations in sample: 803.3 (note fractional number
reflecting fractional character assignments in ancestral states; precision is less than
implied by the decimal). (b) As in (a), but using as a reference curve the distances
between all pairs of surface residues (blue bars, > 50% exposure, calculated by
normalizing its accessible surface area (ASA) by the standard ASA of the residue.
ASA was computed with the DSSP program. Note the greater height of the red bars at
short distances only. This is the compensatory covariation signal. Total observations
in sample: 803.3. (c) As in (a), but considering only non-contiguous pairs, those
where i andj are separated by more than four positions, with reference to a
distribution of distances between all pairs of residues in the reference crystal
structures (green bars). Total observations in sample: 745. (d) As in (c), but with
reference to the distribution of distance between surface pairs (blue bars). Total
observations in sample: 745.5. (e) Distance distribution of charge compensatory pairs
where only one such pair occurs on a specific branch of the evolutionary tree between
reconstructed ancestral nodes. Total observations in sample: 57.2. Note the small
sample size. (f) Same as (e), but where the position pair is separated by more than
four positions in the linear sequence. Total observations in sample: 48.3. Note the
small sample size.
i, i+5 i, i+6
MEE NEC MEH
*HH HCEI CC
Figure 4-4. Predicting secondary structure using contiguous pairs of compensatory changes. The
relative sizes of the pies indicate the relative numbers of examples of each pair.
Where position pairs are separated by 1-5, or 6 positions, the likelihood that residue
pairs are both found in helices (red), both found in strands (dark green), one in a helix
and the other in a coil (pink), one in a strand and the other in a coil (light green), one
in a helix and the other in a strand (violet), and both found in coils (yellow). Note that
if two charge compensatory substitutions are observed with a i, i+4 relationship, one
or both are 98% likely to lie in a helix. The total number of observations; i, i+1 =
12.82; i, i+2 = 6.47; i, i+3 = 11.38; i, i+4 = 23.37; i, i+5 = 4.79; i, i+6 = 6.48. Note
fractional values and the small size of these samples.
H1 H M M 0 H 0 0 H
M anti-conpensatory pairs
=) surface pairs
0 0 N O V 0 4a W c N V 0 V M c V i
H H M M M IV V &A 0 %0 C, 0O
Figure 4-5. Distribution of distances between charge anti-compensatory pairs (red bars) (a)
relative to all pairwise distances (green bars) and (b) relative to pairwise distances
between all pairs of surface residues (blue bars, > 50% exposure). Notable is the
absence of the increased probability of proximal anti-compensatory pairs (compare
with Figure 4-3). The total number of observations in both distributions = 793.3.
M anti-compensatory residues
M compensatory residues
o O DEKR residues
1? T T I I I I IT I I I
M compensatory (near)
41j t I 11 I I I I n f I
o O 0O 0 0 0 0 0 0 0 H t Hl
o H Ci M l i O! 0 H i
O O O O O O O O O O H H H
Figure 4-6. Surface accessibility of charged residues, and charged residues participating in a
charge compensatory event.(a) A frequency versus accessibility distribution for all
Asp, Glu, Lys, and Arg (DEKR) residues (blue), those participating in a charge
compensatory (red) and anti-compensatory (green) events. Bars of each color sum to
unity. Buried DEKR residues are less likely than average to suffer charge reversal
substitution. Compensated charge reversal substitutions are slightly more likely than
anti-compensated charge reversal substitutions. The total number of observations in
sample: anti = 2,812.4; comp = 2,792.2; DEKR = 3,530.0. (b) A frequency versus
accessibility distribution showing the frequency of compensated charge reversals for
proximal pairs near in space (less than 12 A distant, red) in the folded structure, and
distal pairs distant in the folded structure (more than 12 A distant, green). A pair of
charged compensatory substitutions is slightly more likely to be near in space if the
side chains involved are more buried. The total number of observations in sample:
near = 367.1; far = 795.7.
- - charge changes compensated
AAAAA charge changes not compensated
--- no charge changes
M1A < M1A
M K3E M1A
MLKD --_-_ MLER
a L2V < L2V
- - charge changes compensated
4vdvvv charge changes not compensated
--- no charge changes
0.5 M1A V2L
0.5 A1M A1M
D 0.5 L2V
V2L V2L AL
\^ K3E /
AVKD >--D R- AVER
S a MV
AK K3E D
ALKD -D- ALERT
MVKD c MVER MVKD d MVER
Figure 4-7. A schematic illustration of the use of compensatory covariation to select a preferred
tree from two equally parsimonious trees. The two tree topologies relating the four
sequences (ALKD, MVKD, ALER, and MVER) each require six changes. The
changes are marked on individual branches, with fractional changes arising from the
ambiguity in the ancestral sequences. The ancestral sequences are placed at the nodes
in the tree, with ambiguous sites (by parsimony) noted by placing the two possible
residues above and below a horizontal line. For each topology, identical trees holding
all four possible ancestral sequences are shown. Each, by parsimony, has equal
likelihood (0.25 for each).In Topology I, the ancestral sequences are ambiguous at the
first two positions. In Topology II, these are ambiguous at the last two positions.
Both trees require the same amount of homoplasy (convergence). Classical parsimony
analysis is indifferent with respect to the two topologies. In Topology I, however, the
likelihood that a charge reversal is compensated is unity. In Topology II, the
likelihood that a charge altering replacement is compensated is only 0.5. Thus,
Topology I is preferred if compensatory covariation is maximized. This criterion is
independent of mathematical formalisms used to construct the tree. Further, the
metric weights changes at position i depending on events at position j, making this
metric for evaluating a tree fundamentally different from any metric based on a first
order stochastic analysis of protein sequences
MVKD -R. MVER
VK D4 R <\
Table 4-1. Frequencies of the average contiguous position pairs participating in a helix versus
Relationship between the pair in the protein sequence
Pair i,i+1 i,i+2 i,i+3 i,i+4 i,i+5 i,i+6
EE 0.185 0.146 0.119 0.099 0.087 0.079
EC 0.077 0.148 0.192 0.219 0.234 0.239
strand 0.262 0.294 0.311 0.318 0.321 0.318
HE 0.003 0.011 0.022 0.033 0.044 0.055
HH 0.307 0.274 0.243 0.223 0.208 0.194
HC 0.072 0.132 0.184 0.214 0.233 0.249
helix 0.382 0.417 0.449 0.470 0.485 0.498
CC 0.356 0.289 0.241 0.212 0.195 0.185
Total 1.00 1.00 1.00 1.00 1.00 1.00
Total sample 12.8 6.5 11.4 23.4 4.8 6.5
EE, both positions involved in a charge compensatory event found in strands; EC, one in a strand
and the other in a coil; Strand: HE, one in a helix and the other in a strand; HH, both in helices;
HC, one in a helix and the other in a coil; Helix: CC, both in coils. Calculated from the test set of
reference crystal structures using DSSP. Note the small number of observations in each sample,
and the fractional number of changes arising from the parsimony reconstructions.
Table 4-2. List of 71 protein families used in this analysis
ID L N Protein name
121P 166 60 H-Ras P21 Protein
193L 129 43 Lysozyme
1AAF 55 38 Hiv-1 Nucleocapsid Protein
1AAK 152 20 Ubiquitin Conjugating Enzyme
1ARS 396 28 Aspartate Aminotransferase
1ATP(E) 350 20 c-AMP-Dependent Protein Kinase
1BET 107 14 Beta-Nerve Growth Factor
1BP2 123 64 Phospholipase A2
1CCR 112 93 Cytochrome c
1CPC(B) 172 44 C-Phycocyanin
1CYO 93 17 Cytochrome B5 (Oxidized)
1DLH(A) 180 28 Hia-Drl (Dra, Drbl 0101) Human Class II Histocompatibility Protein (Extracellular
1DLH(B) 188 45 HIa-Drl (Dra, Drbl 0101) Human Class II Histocompatibility Protein (Extracellular
lEFT 405 57 Elongation Factor Tu (Ef-Tu)
1FRP(A) 335 18 Fructose- 1,6-Bisphosphatase (D-Fructose- 1,6-Bisphosphate 1-Phosphohydrolase)
1FVL 70 29 Flavoridin
1FXI(A) 96 69 Ferredoxin I
1GDD 353 51 Guanine Nucleotide-Binding Protein G(I)
1GPI(A) 198 16 Glutathione Peroxidase
1HAR 216 36 HIV-1 Reverse Transcriptase (Amino-Terminal Half)
1HCN(A) 92 26 Human Chorionic Gonadotropin
1HDG(O) 332 109 Holo-D-Glyceraldehyde-3 -Phosphate Dehydrogenase
1HGE(A) 328 70 Hemagglutinin
1HLE(A) 345 14 Horse Leukocyte Elastase Inhibitor
1HMT 132 16 Fatty Acid Binding Protein
1HPM 386 128 44K Atpase Fragment (N-Terminal) Of 70kda Heat-Shock Cognate Protein
1HRA 80 42 Retinoic Acid Receptor
1HRY(A) 76 43 Human Sry
1HTB(A) 374 56 Beta-3 Alcohol Dehydrogenase
1HTM(D) 138 92 Hemagglutinin Ectodomain (Soluble Fragment, Tbha2)
1HUR(A) 180 35 Human Adp-Ribosylation Factor 1
1HUW 166 31 Human Growth Hormone
1HVD 319 23 Annexin V
IIRK 306 17 Insulin Receptor (Tyrosine Kinase Domain)
IITG 166 42 Hiv-1 Integrase (Catalytic Domain)
IIVD 388 23 Influenza A Subtype N2 Neuraminidase (Sialidase)
1LDM 329 27 M4 Lactate Dehydrogenase
1MHC(A) 282 143 Mhc Class I Antigen H2-M3
1MLS 154 74 Myoglobin
1NDH 272 27 Cytochrome B5 Reductase
1NHK(L) 144 26 Nucleoside Diphosphate Kinase
1NIP(A) 289 35 Nitrogenase Iron Protein
10OCT(C) 156 33 Oct-1 (Pou Domain)
10OSA 148 64 Calmodulin
1PBX(A) 143 56 Hemoglobin
1PDN(C) 128 28 Prd Paired Domain
1PLQ 258 13 Proliferating Cell Nuclear Antigen
1PVC(2) 271 24 Poliovirus Type 3, Sabin Strain
1REC 201 21 Recoverin
1SXC(A) 151 51 Superoxide Dismutase
1TGX(A) 60 51 Toxin gamma
1TIV 86 24 HIV-1 Transactivator Protein
Table 4-2. Continued
ID L N Protein name
1TPH(1) 247 35 Triosephosphate Isomerase
1YTB(A) 180 18 TATA-Box Binding Protein
1ZAA(C) 87 18 Zif268 Immediate Early Gene
2BTF(A) 375 121 Beta-Actin-Profilin Complex
2CPL 165 35 Cyclophilin A
2GDM 153 23 Leghemoglobin
2HMX 133 40 Human Immunodeficiency Virus Type 1 Matrix Protein
2HPE(A) 99 36 HIV-2 Protease
2REB 352 58 Rec A Protein
2TGI 112 18 Transforming Growth Factor-Beta Two
3RUB(S) 123 79 Ribulose 1,5-Bisphosphate Carboxylase/Oxygenase (Form III)
4ENL 436 35 Enolase
4FGF 146 16 Basic Fibroblast Growth Factor
4GCR 174 31 Gamma-B Crystallin
4MT2 62 50 Metallothionein Isoform II
4RHV(1) 289 24 Rhinovirus 14
4RHV(3) 236 24 Rhinovirus 14
7RSA 124 48 Ribonuclease A
8CAT(A) 506 34 Catalase
ID, Protein Data Bank identifier, protein subunit in parentheses; L, chain length; N, number of
sequences in multiple sequence alignment.
THE PLANETARY BIOLOGY OF CYTOCHROME P450 AROMATASES
The emergence of complete genomes for many organisms, including humans, has created
the need for hypotheses concerning the "function" of specific genes that encode specific proteins
[Gau04]. While "function" is interpreted by different workers in different ways [Bor98],
Darwinian theory (by axiom) requires that the term be connected to fitness; natural selection is
the only mechanism admitted by theory to generate functional behavior in a living system, macro
or molecular. This, in turn, implies that the hypotheses about function have a "systems"
component, including the interaction of the protein with other proteins, their impact on the
physiology (defined broadly) of the cell and organism, and the consequences of physiology in a
changing ecosystem in a planetary context [Ben02].
Systems hypotheses can be supported by information from many areas. Geology,
paleontology, and genomics, for example, provide three records that capture the natural history
of past life on Earth. At the same time, structural biology, genetics, and organic chemistry
describe the structures, behaviors and reactivities of proteins that allow them to support present
life. It has been appreciated that a combination of these six types of analysis provides insights
into functional behavior of proteins that cannot be provided by any of these alone [Ben02]. Over
the long term, we expect that the histories of the geosphere, the biosphere, and the genosphere
will converge to give a coherent picture showing the relationship between life and the planet that
supports it. This picture will be based, however, on individual cases that serve as paradigms for
making the connection.
The aromatase family of proteins offers an interesting system to illustrate the power of this
combination as a way to create hypotheses regarding protein function within a system [Con01a].
These hypotheses are not "proof," of course, but are limiting in genomics-inspired biological
experimentation, now that genomic data themselves are so abundant.
Aromatases are cytochrome P450-dependent enzymes that use dioxygen to catalyze a
multistep transformation of an androgenic steroid (such as testosterone) to an estrogenic steroid
(such as estradiol) (Figure 5-1). The protein plays a key role in normal vertebrate reproductive
biology, a role that appears to have arisen before fish and tetrapods (land vertebrates, including
mammals) diverged some 375 Ma. [Cal84]. Aromatase is important in modern medicine as well,
especially in breast and other hormone-dependent cancers [Wol02].
Different numbers of aromatase genes are found in different vertebrates. Two aromatase
genes are known in teleost fish [Cal97, Cha97]. Only a single gene is known in the horse
[Boe97], rat [Hic90], and mouse [Ter91]. Cattle have both a functional gene and a pseudogene
built from homologs of exons two, three, five, eight and nine of their functional gene; these are
interspersed with a bovine repeat element [Fur95, Hin93]. In several mammalian species,
including humans and rabbits, a single gene yields multiple forms of the mRNA for aromatase in
different tissues via alternative splicing [Har88, Del96, Sim97, Del98].
A still different phenomenology is observed in the pig (Sus scrofa). Three different mRNA
molecules had been reported in different tissues from pig [Cor95, Con97, Cho96, Cho97a,
Cho97b]. Compelling evidence then emerged that the three variants of mRNA identified in
cDNA studies arose from three paralogous genes [GraOO], rather than from a single gene
differentially spliced [Cor04]. This implies that the three aromatase paralogs in pigs arose via
gene duplications relatively recent in geologic time.
Hypotheses relating the function of the three depend in part on when those duplications
took place. If they were very recent, then the three genes might have helped pigs adapt to