<%BANNER%>

The Bioinformaticist's Toolbox in the Post-Genomic Age: Applications and Developments

HIDE
 Title Page
 Acknowledgement
 Table of Contents
 List of Tables
 List of Figures
 List of abbreviations
 Abstract
 Exobiology and post-genomic...
 Extending the PBL method for estimating...
 The adaptive evolution database...
 Detecting compensatory covariation...
 The planetary biology of cytochrome...
 Appendix A: Names and abbreviations...
 References
 Biographical sketch
 

PAGE 1

1 THE BIOINFORMATICISTS TOOLBOX IN THE POST-GENOMIC AGE: APPLICATIONS AND DEVELOPMENTS By DAVID R. SCHREIBER A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLOR IDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2007

PAGE 2

2 Copyright 2007 By David R. Schreiber

PAGE 3

3 ACKNOWLEDGMENTS Tremendous thanks go to James Deyrupfor the opportunity, the fa ith, understanding, generosity, goodwill, humanity and love. Thanks also go to Steve Bennerfor the freedom, the enthusiasm, the privilege, and for an absurd por tion of patience. And th anks go to my family for being there, wherever there chose to be.

PAGE 4

4 TABLE OF CONTENTS page ACKNOWLEDGMENTS...............................................................................................................3 LIST OF TABLES................................................................................................................. ..........6 LIST OF FIGURES................................................................................................................ .........7 LIST OF ABBREVIATIONS..........................................................................................................8 ABSTRACT....................................................................................................................... ............10 EXOBIOLOGY AND POST-GENOMIC SCIENCE...................................................................12 Introduction................................................................................................................... ..........12 The Conventional Evolutionary Paradigm.............................................................................13 Post-Genomic Science: Mode ling Molecular Evolution........................................................17 The Markov Model..........................................................................................................17 Non-Markovian Protein Evolution as a Post -Genomic Tool for Structure Prediction....18 Structure Prediction as a Tool for Identifying Long Distance Homologs.......................18 Recruitment of Function........................................................................................................ .24 Correlating the Paleontological Record w ith Episodes of Sequence Evolution.....................30 Identification of in vitro Behaviors that Contribute to Physiological Function.....................32 Structure Prediction and a Ra pidly Searchable Database.......................................................33 Conclusions.................................................................................................................... .........36 EXTENDING THE PBL METHOD FOR ES TIMATING NUMBERS OF SYNONYMOUS AND NONSYNONYMOUS MUTATIONS TO ACCOUNT FOR AMBIGUITY IN INFERRED ANCESTRAL SEQUENCES............................................................................41 Introduction................................................................................................................... ..........41 An Overview of the PBL Method...........................................................................................42 Extending the PBL Method to Phylogenetic Tr ees with Inferred Ancestral Sequences: Putting Ambiguity Back into the Equations.......................................................................45 THE ADAPTIVE EVOLUTION DATABASE (TAED): APPLICATION OF THE EXTENDED PBL METHOD TO A LARGE DATASET.....................................................56 Background..................................................................................................................... ........56 Results........................................................................................................................ .............60 Discussion..................................................................................................................... ..........63 Conclusions.................................................................................................................... .........64 Materials and Methods.......................................................................................................... .66 DETECTING COMPENSATORY COVA RIATION SIGNALS IN PROTEIN EVOLUTION USING RECONSTRUC TED ANCESTRAL SEQUENCES........................68

PAGE 5

5 Introduction................................................................................................................... ..........68 Results........................................................................................................................ .............73 Charge Compensation and the Surface of the Folded Protein.........................................75 Charge Compensation in Both Contiguous and Non-Contiguous Position Pairs............76 Enhancing the Charge Compensation Signal..................................................................77 Charge Compensation in Specific Secondary Structural Elements.................................78 Charge Compensation in Buried Residues......................................................................79 Discussion..................................................................................................................... ..........80 Accounting for the Stronger Signa l From Node-Node Comparison...............................81 A Model-Independent Method to Ev aluate an Evolutionary Tree..................................82 Darwinian Requirements for Co mpensatory Covariation...............................................84 Methods........................................................................................................................ ..........87 THE PLANETARY BIOLOGY OF CY TOCHROME P450 AROMATASES...........................99 Background..................................................................................................................... ........99 Results........................................................................................................................ ...........101 Discussion..................................................................................................................... ........106 Conclusions.................................................................................................................... .......112 Methods........................................................................................................................ ........114 NAMES AND ABBREVIATIONS OF NUCLEI C ACID BASES AND AMINO ACIDS.......125 AN IMPLEMENTATION OF THE EXTENDED PBL METHOD CODED IN JAVA 1.1......126 LIST OF REFERENCES.............................................................................................................141 BIOGRAPHICAL SKETCH.......................................................................................................151

PAGE 6

6 LIST OF TABLES Table page 3-1 A sample listing from TAED.............................................................................................67 4-1 Frequencies of the averag e contiguous position pairs.......................................................96 4-2 List of 71 protein families used in this analysis.................................................................97 5-1 Frequency distribut ions of stem pig dup lication substitutions........................................123 5-2 Distributions of stem pi g duplication substitutions.........................................................124 A-1 Names and abbreviations of nucleic acid bases...............................................................125 A-2 One and three letter symbols for the amino acids............................................................125

PAGE 7

7 LIST OF FIGURES Figure page 1-1 Predicted surface, interior, and secondary structure assignments......................................38 1-2 Evolutionary tree showing the evol utionary history of the leptins....................................39 1-3 Evolutionary tree showing the evolu tionary history of the extracellular...........................40 2-1 Two possible ways to get from TCA to CAA....................................................................49 2-2 A hypothetical phylogeny for sequences t1 t2 and t3 ......................................................50 2-3 Parsimony for each of the nucleotides in sequences t1 t2 and t3 .....................................51 2-4 EPSs for ancestral node a2 ................................................................................................52 2-5 Sequences t1 t2 and t3 numbers of degenerate site s, and average degeneracies............53 2-6 Numbers of transitions and transv ersions for each pair of sequences...............................54 2-7 Tallies of average degeneracies and mutations for each pair of sequences.......................55 4-1 A leaf-leaf comparison (red) trav erses more evolutionary distance..................................89 4-2 Attempted detection of charge compensa tory covariation signal using leaf-leaf..............90 4-3 Detecting charge compensatory covaria tion signal using explicitly reconstructed...........91 4-4 Predicting secondary structure using c ontiguous pairs of compensatory changes............92 4-5 Distribution of distances between charge anti-compensatory pairs...................................93 4-6 Surface accessibility of charged residues..........................................................................94 4-7 A schematic illustration of the use of compensatory covariation to select........................95 5-1 Reactions catalyzed by aromatases on multiple androgenic substrates...........................117 5-2 Dating the pig duplication events....................................................................................118 5-3 The amino acid alignment of exon 4 of two aromatase isoforms....................................119 5-4 Cladogram of the order Artiodactyla showing the extant families..................................120 5-5 Phylogenetic tree for the 18 vertebrate aromatase genes.................................................121 5-6 The distribution of amino acid replacements on the tertiary structure of cytochrome....122

PAGE 8

8 LIST OF ABBREVIATIONS ASA accessible surface area BLAST basic local alignment search tool CASP Critical Assessment of Structure Prediction C-beta beta carbon atom cDNA complementary DNA CUTG Codon Usage Tabulated from GenBank DNA deoxyribonucleic acid DSSP The Dictionary of Protein Secondary Structure EPS equally parsimonious solution HSP heat shock protein Indel insertion (or) deletion JTT Jones-Taylor-Thornton matrix substitution model KA number of nonsynonymous subs titution per nonsynonymous site kcat catalytic constant KM Michaelis constant KS number of synonymous subs titution per synonymous site Ma million years ago MC the MasterCatalog MHC major histocompatibility complex mRNA messenger RNA MSA multiple sequence alignment NMR nuclear magnetic resonance ORF open reading frame product

PAGE 9

9 PAM point-accepted mutation PAML Phylogenetic Analys is by Maximum Likelihood PAUP Phylogenetic Anal ysis Using Parsimony PBL Pamilo, Bianchi and Li PCR polymerase chain reaction PDB The Protein Data Bank PIE Proto-Indo-European RNA ribonucleic acid RNase ribonuclease sum SH2 Src homology 2 domain Src sarcoma t half-life TAED The Adaptive Evolution Database TREx transition redundant exchange

PAGE 10

10 Abstract of Dissertation Pres ented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy THE BIOINFORMATICISTS TOOLBOX IN THE POST-GENOMIC AGE: APPLICATIONS AND DEVELOPMENTS By David R. Schreiber May 2007 Chair: Steven A. Benner Major Department: Chemistry Bioinformaticists of the post-genomic age, havi ng at their disposal the complete sequence of the human genome, find themselves in posse ssion of a true emba rrassment of richesa massive collection of data that grows ever larg er with each new submission. To effectively mine this glut of information, the tools of bioinforma tics must adapt, and new ones must be developed, tested and applied. Herein we describe a modi fication of a method, firs t developed by Pamilo, Bianchi, and Li for estimating mutation rates be tween the protein sequences from two extant species which allows for comparisons with protei ns from ancestral species. In this case, the ancestral sequences are determined by pars imony, but the method is applicable to any phylogenetic method used to deduce ancestral states from extant species data. Computationally cheap, the parsimony method lends itself to the anal ysis of large datasets. We conducted such an experiment on a version of the GenBank database. The results were collected in a new database called The Adaptive Evolutionary Database (TAE D), whose purpose was to be a collection of proteins that have rapidly (and presumably adap tively) evolved at some point in their natural history. In addition to discove ring adaptive episodes, we used this method to search for compensatory covariation signals within anothe r large database of sequencessignals that provide clues relating to the prot eins higher-order stru ctures. These methods are used as well in

PAGE 11

11 interdisciplinary investigations that incor porate information from the paleontological and geological records. This provided high-level func tional information relating to cytochrome P450 aromatases and their role in Suoidea evolution.

PAGE 12

12 CHAPTER 1 EXOBIOLOGY AND POST-GENOMIC SCIENCE Introduction The 1990s were bountiful years for genomics. At their midpoint, complete sequences were available for the genomes of several eubacteria an archaebacterium, and a eukaryote (yeast). Before the decade was over, a half dozen additio nal bacterial genomes would be added to this list, together with the complete genome of Caenorhabditis elegans and three years later that of Homo sapiens In the 21st century, the first century of the post-genomic era, sequences will be added from hundreds to thousands of other orga nisms, collected with varying degrees of systematic effort [Ben98]. As these data have accumulated, it has become increasingly apparent that new methods are needed to exploit the information that they (mus t) contain. Chemistry has always been driven by the discovery of new natural products, elucidatio n of their structures, a nd exploration of their behaviors. The genome database provides an enormous new collecti on of natural product structures to study. These disp lay every behavior of intere st to chemists: conformation, supramolecular organization, combinatorial assemb ly, photochemistry, and catalysis are just a few. Organic chemistry should be revol utionized by genomic data. But how? A similar question can be framed for the biom edical sciences. Pharmaceutical and genome corporations worldwide are co llecting and cataloging sequence data, comparing the expressed genetic inventory of diseased and normal tissues, and attempting to correlate genomic data with physiological function. It seems that the treatmen t of human disease shou ld be revolutionized by genomic sequence data. But it is not obvi ous precisely how this will happen. One consequence of genomic sequences from a variety of organisms is the ready availability of the evolutionary histories of families of protei ns represented within the genome

PAGE 13

13 sequence databases. Sequence data will be organized as a set of independently evolving protein sequence "modules." For each of these, an evolutiona ry history can be built that will consist of a multiple alignment of the sequences of the protei ns in the module themselves (as well as their encoding DNA sequences), an evolutionary tree, and a reconstructed ancestral DNA and protein sequence for each branch point in the tree [Ben00] The tools described herein are designed to exploit evolutionary histories directly, extrac ting information concerni ng protein structure, behavior, and function from a detailed understa nding of how protein sequences divergently evolve under functional constraint s. In a post-genomic world, with volumes of sequence data from an ever growing number of organisms, these tools will be used widely to learn more from sequence data about living systems, their chemistry and their diseases. The Conventional Evolutionary Paradigm. Since the mid 1970s, it has been known th at homologous proteinsthose related by common ancestryhave analogous conf ormations (folds) [Ros76, Cho86] at least in their core domains. From this observation has emerged ma ny tools, including prof ile analysis, homology modeling, and threading [Gri87, May95], that help biological chemists model the conformation of a protein from the conforma tion of one of its homologs. If homologous proteins have analogous c onformations, it might be reasoned that homologous proteins might be an alogous in other ways as well. Biomolecular behavior depends in part on conformation. Thus, two homologous proteins with analogous conformations might have analogous behaviors. The reasoning can be carried further. As bi omolecular behavior is important for biomolecular function, perhaps two homologous proteins will also have analogous functions. This train of logic has been viewed as a st arting point for analyzing genomic data. Genome sequencing projects generate se quences of proteins without a ny supporting experimental data

PAGE 14

14 describing the protein itself, data that accomp any most sequences generated in classical biochemical research. While applications for ge nomic sequences are man y, virtually all require that a structure, behavior, or function be assigned to proteins known only as sequences. Chemical theory is insufficient to allow us to assign struct ure, behavior, or function directly to a sequence. But bioinformatic theory is ad equate to use the genomic seque nce to identify homologous proteins in the databa se. Some of these homologs may have known structures, behaviors, or physiological functions. If homology implies struct ural analogy, and (fro m there) behavioral analogy, and (from there) functional analogy, then th e task of assigning structure, behavior, or function to the genomic data should become tr ivial whenever a homolog with a known structure, behavior, or function can be identified. Many tools for analyzing genomic sequence data are based on this logic, which we shall call the conventional evolutionary paradigm. The conventional evolutionary paradigm is implemented throughout bioinformatic s research in various simple ways. Consider the following recipe, used in a variety of co mmercial software packages to analyze a new sequence for a new protein collected in a genome project: The new sequence is first recorded. The sequence is then used as input into a ba sic local alignment search tool (BLAST) which finds putative homologs within the existing sequence database. Proteins in the database that have sequences similar to th e new sequence (by some scoring criterion) are recorded. These are pu tative homologs of the new sequence. The "functions" of the putative homologs are read from their documentation in the database. The function of the new sequence is presumed to be the same as that of the homologs. This approach has many well known limitations. First, a BLAST search need not find an analogous sequence in the database whose func tion is already known. In many cases, a BLAST search fails to find any sequence in the database with a statistically significant sequence

PAGE 15

15 similarity. If no putative homolog can be identified, then it follows that no homolog with known structure, behavior or fu nction can be identified. In other cases, a BLAST search identifie s one or more putative homologs, but the homologs have not been studied sufficiently to assign a structur e, behavior, or function. Again, the paradigm fails to resolve the problem. More frequently, the BLAST search identifies many possible homologs in the database, all with statistically marginal sequence similarities. The approach frequently presents the user with the documentation from several of these, and leav es the user to decide which, if any, of the putative functions are correct. As the literature shows, weak se quence similarities reflecting distant homologies have been both profoundly informative and profoundl y deceptive [Ben91]. Very frequently, the functions of these puta tive homologs are widely different, making it difficult to decide which, if any, function should be imputed to the probe sequence. These problems are all well recognized as lim itations of the conven tional evolutionary paradigm. Less well recognized, however, are pr oblems with the elements of the logic upon which the conventional evolutionary paradigm is based. Proteins with analogous folds need not have analogous behaviors. Protei ns with analogous behaviors need not have analogous functions. As has been reviewed in de tail [Ben88b, Ben89a, Ben89c, Ben90], evolution is a potent process for recruiting old protein folds to perform ne w functions, and many cas es are known where the conventional evolutionary paradigm would ge nerate (indeed, has generated) incorrect conclusions. For example, fumarase (functioning in the citric acid cycle), adenylosuccinate lyase (functioning in nucleotide biosynthesis) and as partate ammonia lyase (f unctioning in amino acid metabolism) are all identified as homologs by a BLAST search. Yet their behaviors are analogous only at the level of or ganic reaction mechanism, and there only at the most abstract

PAGE 16

16 level. The conventional evolutiona ry paradigm fails to suggest co rrect conclusions in this case. Work by Babbitt, Kenyon, Gerlt, and Petsko has added other fascinating examples of how microorganisms can recruit a common fold to catalyze a variety of reactions from the racemization of mandelic acid to the opening/closing of a lactone [Bab95]. In proteins involved in so-called advanced f unctions (for example, developmental biology) in more complex organisms, difficu lties with the homology-implies-analogousstructure/behavior/function assu mption underlying the conventi onal evolutionary paradigm become confounding. For example, protein serine ki nases and protein tyrosi ne kinases are clearly identifiable homologs at the level of sequen ce similarity. The chemist would say that both classes of protein operate using analogous reac tion mechanisms, differing only in the source of the oxygen nucleophile in the phosphoryl transfer reaction. The biologist would note that the physiological function of the two classes of proteins are grea tly different, however. For any biomedical application, the biol ogist would be correct. The physio logically relevant differences in behavior, central to the unde rstanding of biological functi on (phosphorylation on tyrosine versus phosphorylation on serine) ca nnot be inferred for one family from the other using the conventional evolutionary paradigm. The deeper the chemistry of developmental biology is probed in metazoa (multicellular animals), the more it becomes apparent that f unction in the Darwinian sense can change with very little change in sequence. For example, a variety of Src homology 2 (SH2) domains all bind peptide sequences containing phosph otyrosine residues. The binding sp ecificities of the different SH2 domains are different, however, for the pe ptide sequences surround ing the phosphotyrosine residues. It is these specificities that determ ine which protein binds to each individual SH2 domain. The physiologically relevant function of each SH2 domain centers on this pairwise

PAGE 17

17 interaction. Thus, any statement of function for a ny particular SH2 domain must at least identify its phosphotyrosine-containing partner. To assign function at this level, the conventional evolutionary paradigm has little to say. Post-Genomic Science: Mode ling Molecular Evolution These considerations suggest that if it is to be used to anal yze genomic data, evolutionary theory must be applied at a more sophisticated level than the level expl oited by the conventional evolutionary paradigm. Much of this sophisticat ion is available at the present time. Let us examine briefly how biomolecular e volution is modeled as a starti ng point to designing tools for interpreting genomic data. The Markov Model Virtually all of contemporary ge nomic science is based in some sense on an analysis where sequence data are treated simply as strings of characters. This analysis is based on a specific model of sequence evolution at the level of prot eins. Overwhelmingly, these models are in some sense "Markovian. They assume, implicitly or otherwise, that variatio n in a protein sequence occurs independently at each position in the seque nce, that future mutations occur independently of past mutations, and that a single substituti on matrix adequately describes the relative likelihood of each amino acid being replaced by any other amino acid during an episode of molecular evolution. Further, simp lifying assumptions are made when scoring gaps in a pairwise alignment The primary virtue of this model is its simplicity. Yet the model is certainly false. Proteins are not linear strings of independe nt characters. Rather, they are folded structures that have arisen by divergent evolution under constraints imposed by natu ral selection seeking adaptive function. Because real proteins fold in three dime nsions, mutations at different positions in the protein sequence are not independe nt [Coh94]. Nor are future mu tations independent of past

PAGE 18

18 mutations [Coh94]. Insertions and deletions (ind els) display complex patterns that reflect functional constraints on protein evolution [Ben93]. Here, the second virtue of Markovian models their exactness, beco mes useful. Markovian models generate specific expecta tions. These can be explicitly vi olated by the behavior of real proteins during divergent evoluti on, and the extent of the violati on can frequently be quantitated. The differences between the expected and actual evolution of proteins contain clues concerning how proteins fold, function, and create new functi on. Let us examine some tools to extract these clues. Non-Markovian Protein Evolution as a Pos t-Genomic Tool for Structure Prediction Tools that extract information from the nonMarkovian behavior of proteins undergoing divergent evolution have been ex ceptionally valuable for predicti ng the conformation of proteins from an evolutionary history [Ben97]. These t ools have made major st rides towards solving a problem that as recently as a decade ago was c onsidered to be unsolvabl e. The approach is described in detail previously [Ben89b, Ben91], a nd will not be described here. Rather, we shall assume that reasonably accurate (if not perfect) models of protein secondary structure can be predicted for a family of homologous protein se quences. We will then ask how these predictions might be exploited to solve one of the problems noted above with the conventional evolutionary paradigm: the need to have methods more powerfu l than simple sequence analysis to detect long distance homologs. Structure Prediction as a Tool for Identifying Long Distance Homologs The core of a protein fold is conserved duri ng divergent evolution l ong after the sequences within the family have diverged so much that homology is no longer evident by sequence analysis alone. In the mid 1970s, Rossman and hi s coworkers suggested that analogous folds in two proteins might indicate that the proteins are homologs, even when the sequences of the two

PAGE 19

19 proteins bear no statistically significant similarities [Ros76, C ho86]. This observation prompted many groups to develop methods for building m odels from protein sequences starting from marginal sequence similarities [Gri87, May95]. These are known as pr ofile or threading approaches. In practical application, the primary difficulty with these a pproaches is that they generate too many "hits, or matchings of a probe seque nce against a sequence database. The analyses frequently return many proteins that are possi ble homologs. As has been shown in many joint prediction projects (such as the Critical Assessm ent of Structure Predic tion, or CASP projects [Mou96]), these hits are sometimes correct a nd informative. Equally likely, however, the putative homologs identified by a threading or profile analysis are not true homologs, or are misaligned with true homologs using the analysis Especially needed, therefore, are tools to confirm or (especially) deny the possi bility that a target identified from a sub-statistical sequence similarity is a homolog. In the early 1990s, the first example was pres ented where a structure prediction was used to critically evaluate suggestions, based on sub-sta tistical sequence similariti es, that two proteins were homologous. The case concerned the protein ki nases [Ben91]. Protein kinase contains the sequence motif Gly-Xxx-Gly-Xxx-Xxx-Gly (where Xxx is any amino acid) preceded by a strand and followed by a helix [Ste84]. A similar motif was found in adenylate kinase, where a crystal structure was known. Therefore, following the conventional evolutiona ry paradigm, several groups proposed that the two structures were homologs. This proposal implied in turn that protein kinase would adopt the same fold as adenylate kinase. Several groups built models based on this inference [Tay86, Tay84, Wie86, Sho81].

PAGE 20

20 Contrasting with this view wa s a model built for the secondary and tertiary structure of the protein kinase family based on the evolutionary history of the protein. In this model, the GlyXxx-Gly-Xxx-Xxx-Gly motif was predicted to be flanked by beta strand both before and afterwards. Thus, the predicted s econdary structural model of pr otein kinase was not congruent with the experimental structure of adenylate ki nase. Accordingly, the pr ediction noted that the two folds could not be the same. From this, it wa s inferred that the two proteins could not be homologs. This alternative model based on post-genomic methods was later shown to be correct by a subsequently determined crystal structure [Kni 91]. This was perhaps the first time that a predicted structure and a motif analysis had been used to infer the absence of homology between two families catalyzing analogous chemical reacti ons. The tool used in the protein kinase prediction proved to be an example of a more general tool to confir m or deny long distance homology between two protein families. Using this tool, core secondary structural elements of the protein families are aligned sequentially. In this process, th e secondary structural elements are considered to be congruent when every core element from one family finds a core element in the other of the same type (helix or strand), in the same order, where ga ps matched against non-core elements (where a non-core element in one family is not aligned ag ainst any element in th e other) are allowed in any number, and a core element in one protein may be missing in the other, but may not be aligned with a core element in the other of a different type (i.e., helix against strand). This approach is especially powerful when coupled with motif analysis, as was done for protein kinase. Congruent seconda ry structural elements can indicate whether a motif is a significant indicator of homology or not. Thus, flanking a motif, a secondary structural model

PAGE 21

21 might have one of four forms: helix-motif-he lix, helix-motif-strand, st rand-motif-helix, and strand-motif-strand. The secondary structural alignments are congruent if and only if the secondary structural elements flanking the motifs correspond between the two proteins. Homology is not denied if and only if the second ary structures are congruent. This method is preferably applied when each family contains pr oteins that are at l east 120 point accepted mutations per 100 amino acids (PAM units) divergent. Important to this tool is a distinction be tween core and non-core secondary structural elements. The failure of core secondary struct ural elements to correspond between two protein families is a clear contradicti on of homology. But non-core elements need not be conserved over long periods of divergent evolution, and failu re to find a correspondence between non-core elements from one protein family in anothe r does not exclude homo logy. Several ways of defining "core" exist in this context. Some involve crystal structure data; others do not. When an experimental structure is known for a protein (for example, by crystallography or nmr), core elements are conveniently defined geometrically; a core el ement is one where a substantial fraction is buried in the protein fold unexposed to solv ent water. Most useful is the application of this concept to beta strands in a beta sheet. A core strand is one that forms backbone hydrogen bonding interactions with two ot her strands on both of its edges. Thus, a core strand is distinct from an edge strand, whic h forms backbone hydrogen bonds to only one other strand on only one if its edges. Core strands are highly conserve d during divergent evolution. If they are lost, the sheet in which th ey participate cannot be conserved. A more general definition of a core secondary structural unit focuses on the evolutionary stability of the secondary struct ural unit. For the purpose of de tecting long distance homologs, a core secondary structural element, predicted or otherwise, is one that cannot be lost during

PAGE 22

22 divergent evolution without damagi ng the integrity of the protein fo ld. This is based on notions of continuity in protein evolut ion, most fundamentally on the assumption that a protein that has one topology of protein fold (e.g., an eight fold alpha-beta barrel) cannot by continuous evolutionary processes be conve rted into a protein with anot her (e.g., an immunoglobulin fold). It is clear that divergence of biological function can add or subtract peripheral secondary structural elements to create or remove contact el ements, expand or eliminate binding sites, or to modify the performance of the protein in other fashions. In this case, non-core segments of a protei n family are recognized from a family of sequences, preferably between 100 and 150 PAM units divergent for the most divergent pairs, where regions that are deleted. If a segment (inc luding a segment containing a helix or a strand) is deleted in a protein family built from memb ers all sharing significant sequence similarity, it cannot be essential for the integrity of the fold in the family. In applying this tool, one must be concerned about database mistakes; a part of sequence that is deleted because the scientist providing the entry into the databa se neglected to collect it, or neglected to enter it, is not a deletion from the purpose of detecting non-core segments. A third method for identifying a core segment of a protein sequence is applicable to any alignment containing three or more sequences. In the tool, a pairwise alignment is constructed for each pair of sequences in the set using a dynamic programming tool. Consider for example a set of sequences with three proteins, A, B and C. A core segment of the multiple alignment is defined as those regions where the alignment of A with B and the alignment of B with C is consistent with the alignment of A with C. A final method relates to the reconstructed ance stral sequence of the protein. It has long been appreciated [Zuc65] that when the seque nces of two or more homologous proteins are

PAGE 23

23 available, it is possible to c onstruct a probabilistic model for the sequence of the ancestral protein. The part of the ancestral sequence that is reconstructed with high probability is the core of the protein. These reconstructions are done by well-known maximum likelihood tools (for example, as implemented within Darwi n, available via the web at the address http://cbrg.inf.ethz.ch (see also [Gon91])). A core is define d from the ancestral sequences as a segment of the multiple alignment where the averag e probability of the most frequent amino acid at that position is greater than one standard devi ation above the average pr obability of all of the reconstructed positions in the multiple alignmen t. This is a tree-weighted measure of the divergence in the family as a whole, and correlat es with core regions defined in the other ways, as the region of the ancestral se quence that is reconstr ucted with high probability is also the one that has not suffered insertions a nd deletions, and the one that has seen relatively little sequence divergence. These segments also correlate with core segments defined geometrically. Structure predictions made using post-genomic tools have now been applied in several cases to make statements about long distant hom ology and, from there, catalytic behavior. One of the most dramatic was for the heat shock pr otein 90 (HSP 90) family. The predicted secondary structural elements were assembled to yield a tertiary fold that resembled closely the fold determined in the N-terminal fragment of DNA gyrase B (the ATPase fragment) [Wig91]. This analysis was published before an experimental structure of HSP 90 was known [Ger97], and after an experimental study had suggested that HSP 90 did not have catalytic activity as an ATPase. The prediction thus gene rated a statement about catalytic behavior (and, presumably, physiological function) in addi tion to secondary structure. Post-genomic prediction tools were also appl ied to the family of ribonucleotide reductases [Tau97]. Here, proteins with hi ghly divergent sequences were s hown to belong to one universal

PAGE 24

24 family using a combination of protein structural analysis and gene sequencing. The tools in this case confirmed a speculation by Stubbe and cowork ers based on a mechanistic analysis that all ribonucleotide reductases are rela ted by common ancestry [Mao92]. Comparing predicted secondary structure mode ls for a protein family is now a proven technique with an impressive track record to detect or deny long dist ance homologs based on sequence data alone. As genome projects are co mpleted, and evolutionary histories for their constituent protein families arti culated, secondary structural m odels for those families should become routinely available. Thus, these tools should solve the first problem with practical implementation of the conventional evolutiona ry paradigm the difficulty of detecting homologs using standard alignment tools. Recruitment of Function Predicted secondary structures of a protein fa mily offer an approach to solve the general problem of finding distant homologs in a da tabase. They do not, however, address other problems associated with the conventional evol utionary paradigm for assigning function to sequences those that arise when the assump tions underlying the conventional evolutionary paradigm are invalid. This is especially the case when one protein family gives rise to proteins with different function. This po ssibility can never be excluded a priori In the case of the heat shock protein 90, for example, the post-genomic pr ediction tool showed th at the fold of HSP 90 was analogous to the fold of gyrase. It suggest ed (under the conventional evolutionary paradigm) that the behavior and function of HSP 90 was an alogous to the behavior and function of gyrase. Some of these suggestions may in fact be true. But they need not be. The gyrase fold could be recruited in HSP 90 to perform many other (indeed, any other) function. The premise for the post-genomic sciences, howev er, is that the evolu tionary histories of protein families will become generally availabl e as a result of massive genome sequencing

PAGE 25

25 projects. Non-Markovian behavior is not only expected to arise be cause of folded structure. In addition, patterns of sequence evol ution are expected to violate the Markovian model differently if the protein in question is undergoing an episode where function is being changed. A study illustrated the power of this appro ach to identify the active site residues of mammalian alcohol dehydrogenase (E.C.1.1.1.1). Mammalian alcohol dehydrogenases have undergone a rapid episode of sequence evolution in and around the active site as substrate specificity has divergently evolve d to handle xenobiotic substances in the liver. In contrast, over a comparable span of evolutionary distance, th e active site of yeast alcohol dehydrogenase has changed very little, corresponding to an apparently constant role of the enzyme to act on the ethanol-acetaldehyde redox couple. Indee d, by identifying positions in mammalian dehydrogenases where amino acid va riation was observed over a sp an of evolution where the same residues were conserved in the yeast de hydrogenases provided a clear map of the active site of the protein. A particularly clever use of this approach has been described by Lichtarge et al. These workers described an evolutionary trace method that defined functionally significant residues as those that are conserved within a family [Lic96] They then used this approach to identify patches on the surface of proteins that contribute to functionality. The approaches work because the scientist understands something about the function of a protein family. Thus, the active site of alcohol dehydrogenase could be identified because the scientist knew that substrate specificity is conser ved in one branch of a protein family (yeast alcohol dehydrogenases) but not in another (liver alcohol deh ydrogenases). The evolutionary trace method works only when the function being traced is conser ved within the family. Neither approach is applicable when th e evolutionary status of the pr otein function (c onserved or not

PAGE 26

26 conserved) is not known. In general, this status will not be known to the post-genomic scientists examining only genomic data. To apply these po st-genomic tools, therefore, the post-genomic scientist needs a tool for learning whether function is changing with in the family in an episode of protein sequence evolution one based on an analysis of the sequence data alone. One tool is based on the fact that the genetic code is degenerate. More than one triplet codon encodes the same amino acid. Therefore, a mu tation in a gene can be either silent (not changing the encoded amino acid) or expressed (c hanging the encoded amino acid). Especially in multicellular organisms, and most particularly in multicellular animals (metazoa), silent changes are not under selective pressure. In contrast, expressed changes at the DNA level, by changing the structure of the protein that the gene enc odes, change the propert y of the protein. This frequently places these cha nges under selective behavior. The outcome of different selection pressure s on silent and expressed mutations is nonMarkovian behavior in the evolut ion of DNA sequences. Consider th ree cases. In the first, we examine an episode of protein sequence evolution during a period of evolutionary history where, at the outset of the period, the behavior of a protein whos e sequence has been perfectly optimized for a specific biologi cal function, and where that func tion remains constant for the protein throughout the period being examined. Du ring this period, changes in the DNA sequence that lead to a change in the sequence of the encoded protein (e xpressed changes) will diminish the survival value of the prot ein [Ben88b] and therefore will be removed by natural selection. Silent changes will not be removed by natural sel ection, but will accumulate at an approximately clock-like rate, as silent changes are approximately neutral, especi ally in higher organisms. Thus, the ratio of expressed to silent changes will be low during a period of evolution of a protein family where the ancestor and its descendants share a common function.

PAGE 27

27 A second case concerns a peri od of evolution where a protei n is acquiring a new derived function. Its amino acid sequence at the beginning of this episode will be optimized for the ancestral function, rather than the derived function. To be optimally suited for the derived function, therefore, amino acids must be changed in the protein. Thus, changes in the gene that are expressed will have a chance of improving the behavior of the protein vis a vis its new biological function, and these will be selected for. The ratio of expr essed to silent substitutions at the DNA level will be high, and a high expressed/sile nt ratio will reveal a period of evolution of a protein family where the function of the ancestor is changing. In a third case, consider the evolution of a gene encoding a protein that has no function (a pseudogene, for example), neutrall y drifting without functional cons traints. In this case, the expressed/silent ratio wi ll reflect random introduction of point mutations. Given the genetic code and a typical distribution of ami no acid codons within the gene, a ratio of expressed to silent changes will be approximately 2.5 during the period of evolution of a protein family where the ancestor and its desce ndants have no function. Stewart and Messier introduced this approach through a clever and syst ematic analysis of conservation and variation within the lysozyme family [Mes97]. Episodes of rapid sequence evolution were identified by high ex pressed/silent ratios in specific branches of the evolutionary trees. These were correlated with specific episode s in the evolution of th e protein itself and the physiology of the organisms th at contained the protein. This approach can be illustrated quite easil y in a biomedically interesting family of proteins, using an off the cuff method to exam ine the protein leptin, a protein whose mutation in mice is evidently correlated with obesity, and was previously known as the "obesity gene protein. The protein has attracted substantial in terest in the pharmaceutical industry, especially

PAGE 28

28 after a human gene encoding a leptin homolog was isolated. According to the conventional evolutionary paradigm, because it is a homolog of the mouse leptin, the human leptin must also play a role in obesity, and might be an appropr iate target for pharmaceutical companies seeking human pharmaceuticals to combat this common condition in the first world. DNA and protein sequences were retrieved fo r the genes encoding leptins. A multiple alignment for the protein sequences was cons tructed for the DNA sequences and the protein sequences. Congruent tress for both the DNA and pr otein sequences were then constructed, and sequences at the nodes of the tree reconstruc ted using MacClade [Mad92] and the known relationship between the organisms from whic h these sequences were derived. For the DNA sequences, the biologically most plausible tree proved to be the most parsimonious tree as well. The most parsimonious tree for the protein sequences proved not to be the most plausible tree (by one change) from a biological perspective. The DNA tree was taken to be definitive because of its consistency with the biological (cladistic) data. A secondary structure predic tion was made for the protei n family. The evolutionary divergence of the sequences ava ilable for the leptin family is smallonly 21 PAM units (point accepted mutations per 100 amino acids)and predictions were biased to favor surface assignments [Ben94]. Thus, positions holding conserved KREND were assigned as surface residues, conserved H and Q were assigned to the surface as well, while positions holding conserved CST were assigned as uncertain. Five separate secondary structur al elements were identified. The results are summarized in Figure 1-1. A disulfide bond was presumed to connect positions 96 and 146. These secondary structural elements can be accommodated by only a small number of overall folds. Interestingly, the pattern of secondary structure in this prediction is consistent with an overall fold that

PAGE 29

29 resembles that seen in cytokine s such as colony stimulating f actor [Hil93] and human growth hormone [Dev92]. To decide whether evolutionary function may have changed under selective pressure during the divergent evolution of the protein family, silent and expressed mutations were assigned to individual branches on the evolutionary tree. For each br anch of the tree, the sum of the number of silent and expressed changes were tabulated, and the ratio of expressed to silent changes calculated. These are shown in Figure 1-2. The branches on the evolutionary tree leading to the primate leptins from their ancestors at the time that rodents and primates diverged had an extremely high ratio of expressed to silent changes. From this analysis, it was concluded th at the biological function of leptins has changed significantly in the primates relative to the function of the leptin in the common ancestor of primates and rodents. This conclusion has several implications of importan ce, not the least being for pharmaceutical companies asked whether th ey should explore leptins as a pharmaceutical target. At the very least, it suggests that th e mouse is not a good pharmacological model for compounds to be tested for their ability to combat obesity in hu mans. The post-genomic analysis suggests that a primate model must be used to test those compounds, with implications for the cost of developing an anti-obesity drug based on the leptin protein. Intriguingly, a tree can also be built for the leptin receptor (Figure 1-3). Here, the evolutionary history is not so complete. In part icular, fewer primate sequences are available for the leptin receptor than for leptin itself. Thus the reconstructed ances tral sequences are less precise with the leptin receptor family, and the assignment of expressed and silent mutations to the tree are less certain. Nevertheless, it appear s that the leptin receptor has undergone an episode of rapid sequence evolution in the prim ate half of the family as well. The example

PAGE 30

30 illustrates how much sequence data is needed (much) to build reliable models of this nature, as the ambiguity in the assignment of ancestral sequences makes it possible that the receptor was evolving rapidly not only in the lineage leading to primates but also in the lineage leading to mouse. Nevertheless, the approximate correlation betw een the episode of rapid sequence evolution in the leptin family and in the leptin receptor family suggests a tool that might become useful in the advanced stages of post-genomic science when evolutionary hist ories are very well articulated. Here, it might be possible to detect ligand-receptor relations hips between protein families in the database by a correspondence betw een their episodes of rapid sequence evolution. Thus, ligand families should evolve rapidly (in a non-Markovian fashion) at the same time in geological history as their receptor s evolve. It will be interesti ng to identify more sequences for primate leptin receptors to see if a more comple te evolutionary history allows us to see more clearly the co-evolution of the le ptin receptor and leptin itself. Correlating the Paleontological Record with Episodes of Sequence Evolution As discussed above, detailed analyses of evol utionary histories frequently can provide a solution to the most general problem of the conve ntional evolutionary paradigm, the difficulty in routinely identifying a homolog of a target sequence with known function within the database. By analysis of non-Markovian e volutionary behavior at the le vel of the protein, a model of secondary structure can be predicted. This predicti on can be used in turn to detect long distance homologs in some cases and excl ude the possibility of distant homology in ot hers. This increases the likelihood that a homolog will be found with a known structure, behavior, or function for a new protein sequence. If one is found, then the logic associated with the conventional evolutionary paradigm can be applied to gene rate a hypothesis concerning the behavior or function of the protein.

PAGE 31

31 The value of this post-genomic tool to assi gn behavior and structur e to a target sequence problem is expected to grow over the near term, as the ratio of sequences supported by experimental studies to those not supported increases with the c onclusion of genome projects, and as more sequences increase the detail of the e volutionary histories that can be extracted from the database directly, and theref ore the quality of the predicted secondary structural model. At the next level, analysis of non-Markovian behavior at the le vel of the gene can alert the biological chemist that the logic associated with the conventional evolutionary paradigm might not apply in individual cases. In particular, if an episode of ra pid sequence evolution intervenes in the evolutionary tree between the sequence of interest and the sequence with the known behavior and function, the biological chemist is al erted to the possibility that the function of the protein might have changed. This alert is useful even with close homologs, as illustrated in the example with leptin. But what if the evolutionary tree contains no protein with a sequence with assigned function, even one with low sequenc e similarity? Even with more limited evolutionary histories, post-genomic tools that analyze non-Markovian evolution at the le vel of the codon can be useful. By identifying the organisms that provide the sequen ces at the leaves, or external nodes, of the evolutionary tree, it is frequen tly possible to correlate branches in the evolutionary tree with episodes in geological history, as determined from the fossil record. Especially in multicellular animals (metazoa), the fossil record can provide approximate dates for the emergence of new physiological function. In this case, it is possibl e to ask whether an ep isode of rapid sequence evolution in a protein family (in particular, an episode with a high expressed/silent ratio) occurred at the same time as a new physiological function emerged on earth. If so, a first level of

PAGE 32

32 hypothesis about physiological functi on can be proposed, even if no behavior or function of any kind is known for any of the modern proteins. Perhaps the most transparent analysis of th is type concerns prot eins that underwent massive radiative divergences in metazoa a pproximately 600 Ma. This is the time of the Cambrian explosion, an episode in terrestrial history that marks the massive radiative divergence of multicellular animals, including chordates. Proteins famili es undergoing rapid evolution at this time (for example, of protein tyrosine kinases and src homology 2 domains) are almost certainly involved in the basic processes by which multice llular animals develop from a single fertilized egg. This type of analysis might be applie d in the family of ribonuclease (RNase) A (E.C.2.7.7.16), a well known family of digestive proteins found in ruminants. The protein underwent rapid sequence evolution approximate ly 45 Ma, a time where ruminant digestion emerged in mammals [Jer95]. Thus, the rapid mol ecular evolution evident in the reconstructed evolutionary history of this prot ein suggests that the protein is important for ruminant digestive function. Identification of in vitro Behaviors that Contribute to Physiological Function. In vitro experiments in biological chemistry extrac t data on proteins a nd nucleic acids (for example) that are removed from their native envi ronment, often in pure or purified states. While isolation and purific ation of molecules and mo lecular aggregates from biological systems is an essential part of contemporary biological research, th e fact that the data are obtained in a nonnative environment raises questions concerning their physiological rele vance. Properties of biological systems determined in vitro need not correspond to those in vivo and properties determined in vitro need have no biological relevance in vivo

PAGE 33

33 To date, there has been no simple way to say whether or not biological behaviors are important physiologically to a host organism. Even in those cases where a relatively strong case can be made for physiological relevance (for exam ple, for enzymes that catalyze steps in primary metabolism), it has proven to be difficult to deci de whether individual prope rties of that enzymes ( kcat, KM, kinetic order, stereospecific ity, etc.) have physiological rele vance. Especially difficult, however, is to ascertain which behaviors measured in vitro play roles in "higher" function in metazoa, including digestion, development, re gulation, reproduction, and complex behavior. Analysis of non-Markovian behavior, as descri bed above, permits the biological chemist to identify episodes in the history of a protei n family where new function is emerging. This suggests a general method to determ ine whether a behavior measured in vitro is important to the evolution of new physiological function. We may take the following steps: a) Prepare in the laboratory protei ns that have the reconstructe d sequences corresponding to the ancestral proteins before, during, and after th e evolution of new biol ogical function [Jer95], as revealed by an episode of hi gh expressed to silent ratio of substitution in a protein. This high ratio compels the conclusion that the protei n itself serves a physiological role, one that is changing during the period of ra pid non-Markovian se quence evolution. b) Measure in the laboratory the behavior in ques tion in ancestral protei ns before, during, and after the evolution of new biologi cal function, as revealed by an episode of high expressed to silent ratio of substitution. Those behaviors that increase during this episode are deduced to be important for physiological func tion. Those that do not are not. Structure Prediction and a Rapidly Searchable Database The overarching problem with genomic sequence databases is their sheer size. This makes them tedious to search, and nearly impossible to subject to exhaustiv e self-matching [Gon92]. Predicted structures can be conn ected with reconstructed evolutio nary histories to resolve this problem, to build a rapidly searchable database. Consider the following steps: (a) A multiple alignment, an evolutionary tree, and ancestral sequences at nodes in the tree are constructed for a set of homologous proteins.

PAGE 34

34 (b) A corresponding multiple alignment is constr ucted by methods well known in the art for the DNA sequences that encode the proteins in th e protein family. This multiple alignment is constructed in parallel with the protein alignment. In regions of gaps or ambiguities, the amino acid sequence alignment is adjusted to give the alignment w ith the most parsimonious DNA tree. (c) Mutations in the DNA sequences are then assi gned to each branch of the DNA evolutionary tree. These may be fractional mutations to reflect ambiguities in the sequences at the nodes of the tree. When ambiguities are encountered, alternativ es are weighted equally. Mutations along each branch are then assigned as being "silent, m eaning that they do not have an impact on the encoded protein sequence, and "expressed, meani ng that they do have an impact on the encoded protein sequence. Fractional assignments are made in the case of ambiguities in the reconstructed sequences at nodes in a tree. (d) A prediction is then made for each protein fa mily. A secondary structure is predicted for the family, and this predicted secondary structure is aligned with the an cestral sequence at the root of the tree. If the root of the tree is unassigned, the predic ted secondary structure is aligned with the ancestral sequence calculated fo r an arbitrary point near the center of gravity of the tree (e) An ancestral sequence is then reconstructed at nodes on the tree, and at a point on the tree as near as possible to its root the point on the tree representing th e oldest (geologically) sequence. Steps (a) through (e) provide a method to or ganize the protein sequence database in a rapidly searchable form. The ancestral sequen ces and the predicted secondary structures associated with the families defined by steps (a) through (e) are surrogates for the sequences and structures of the individual prot eins that are members of the family. The reconstructed ancestral sequence represents in a single sequence all of the sequences of the descendent proteins. The predicted secondary stru cture associated with the ancestral sequence represents in a single structural model all of the core secondary struct ural elements of the descendent proteins. Thus, the ancestral sequences can replace the des cendent sequences, and the corresponding core secondary structural models can replace the seco ndary structures of the descendent proteins. This makes it possible to define two surrogate databases, one for the sequences, the other for secondary structures. The first surrogate databa se is the database that collects from each of the families of proteins in the databases a single ancestral sequence, at the point in the tree that most accurately approximates the root of the tree If the root cannot be determined, the ancestral

PAGE 35

35 sequence chosen for the surrogate sequence database is near the cen ter of gravity of the tree. The second surrogate database is a database of th e corresponding secondary structural elements. The surrogate databases are much smaller than the complete databases that contain the actual sequences or actual structures for each prot ein in the family, as each ancestral sequence represents many descendent proteins. Further, because there is a limited number of protein families on the planet, there is a limit to the size of the surrogate databases. Based on our work with partial sequence database s [Gon92], we expect there to be fewer than 10,000 families as defined by steps (a) through (e). Searching the surrogate databases for homologs of a probe sequence proceeds in two steps. In the first, the probe sequence (or structure) is matched against the database of surrogate sequences (or structures). As there will be on th e order of 10,000 families of proteins as defined by steps (a) through (e) after all the genomes are sequenced for all of the organisms on earth, there will be only on the order of 10,000 surrogate seque nces to search. Thus, this search will be far more rapid than with the complete databa ses. A probe protein sequence (or DNA sequence in translated form) can be exhaustively matched [G on92] against this surrogate database (that is, every subsequence of the probe sequence will be matched against every subsequence in the ancestral proteins) more rapidly than it coul d be matched against the complete database. Should the search yield a significant match, the probe sequence is identified as a member of one of the families already defined. The probe sequence is then matched with the members of this family to determine where it fits within the evolutionary tree defined by the family. The multiple alignment, evolutionary tree, predicted secondary struct ure and reconstructed ancestral sequences may be different once the new probe seque nce is incorporated into the family. If so, the different multiple alignment, evolutionary tree, and predicted secondary structure are

PAGE 36

36 recorded, and the modified reconstructed ancestra l sequence and structure are incorporated into their respective surrogate databases for future use. The advantage of this data structure over thos e presently used is apparent. As presently organized, sequence and structure databases treat each entry as a distinct sequence. Each new sequence that is determined increases the size of the database that must be searched. The database will grow roughly lin early with the number of orga nismal genomes whose sequences are completed, and become increasingly more expensive to search. The surrogate database will not grow linearly. Most of the sequence families are already represented in the existing databa se. Addition of more sequences wi ll therefore, in most cases, simply refine the ancestral sequences and associat ed structures. In any cas e, the total number of sequences and structures in their respectiv e databases will not grow past ca. 10,000the estimate for the total number of sequence families that will be identifiable after the genomes of all organisms on earth are sequenced. If a dramati cally new class of organism is identified, this estimate may grow, but not exponentially (as is the growth of the present database). Conclusions The evolutionary histories of protein fa milies are now becoming routinely available through genome sequencing projects. These histories comprise a multiple alignment for their protein sequences and the co rresponding DNA sequences, an e volutionary tree showing the pedigree of these sequences, and reconstructed ances tral sequences for each node in the tree. In a post-genomic world having genomic sequences fro m an unlimited number of organisms, these histories will be used to conn ect structure, chemical reactiv ity, and physiological function to these families. Our work consists of the application and de velopment of post-genomic tools that exploit these evolutionary histories. Analysis of non-Ma rkovian behavior at the level of the gene can

PAGE 37

37 identify proteins within a family that have new functions. Systematic application of such analyses to sequence databases can be used to create a database of rapidly evolving protein families which could serve as a source of starting points for deeper study. Further, nonMarkovian patterns of replacement within a family can provide in formation relating to the form (secondary and tertiary structur e) and function of its const ituent proteins. Coupling these analyses with the paleontological record can suggest hypotheses for physiological function in families where no function is suggested by any e xperimental work for any of its members. Coupling these analyses with experimental work reconstructing ancestral proteins can identify specific in vitro properties of the protein that are important for its physiological role. Last, evolution-based data structures can be used to organize large seque nce databases in a fashion that allows them to be searched with extreme effici ency. Together, these post-genomic tools can join with classical methods whose value is now becoming widely recognized [Hue97].

PAGE 38

38 Figure 1-1. Predicted surface, interior, and secondar y structure assignments for leptin. S and s, I and i indicate strong and weak surface and in terior assignments respectively. P and p indicate strong and weak parses respectivel y. A "?" indicates that no assignment is made. A "c" indicates that the position is involved in a disulfide bond. Secondary structure was assigned using the method of Benner and Gerl off where positions denoted "?" were permitted to fall in either the surface or interior arc of the helix. Underlined residues are pa rt of parsing strings.

PAGE 39

39 Figure 1-2. Evolutionary tree showi ng the evolutionary history of the leptins. Heavy lines show branches with expressed/silent ratios higher than 2. Hatched lines show branches with expressed/silent ratios from 1 to 2. Thin, solid lines show branches with expressed/silent ratios less than 1. Nu mbers on the lines indicate the ratio of expressed/silent changes for that branc h. The branch lengths do not correspond to either geological time or evolutionary distance.

PAGE 40

40 Figure 1-3. Evolutionary tree showi ng the evolutionary history of the extracellular domain of the leptin receptors. Heavy lines show branches with expressed/silent ratios higher than 3. Hatched lines show branches with expre ssed/silent ratios fro m 2 to 3. Thin, solid lines show branches with expressed/silent ratios less than 2. Notice the greater overall divergence in the leptin receptor than in lep tin itself. Dotted lines show branches with indeterminate ratios. Numbers on the lines indicate the ratio of expressed/silent changes for that branch. The branch lengt hs do not correspond to either geological time or evolutionary distance.

PAGE 41

41 CHAPTER 2 EXTENDING THE PBL METHOD FOR ES TIMATING NUMBERS OF SYNONYMOUS AND NONSYNONYMOUS MUTATIONS TO ACCOUNT FOR AMBIGUITY IN INFERRED ANCESTRAL SEQUENCES Introduction The measurement and estimation of synonymous and nonsynonymous mutations in DNA sequences that code for proteins has proven to be a useful tool for scientists interested in the evolution of protein families and their constituen t molecules. Specifically, comparisons of rates of synonymous and nonsynonymous mutation can be used to detect and differentiate episodes of positive (adaptive) selection within the evolutiona ry history of a group of proteins. Comparisons are typically made between pa irs of uniquely determined, unambiguous, DNA sequences derived from extant taxa, and though a phyl ogenetic tree might be included in an analysis, comparisons are not made with the ancestral sequences that c ould be inferred at internal nodes of the tree via parsimony or another method (e.g., maximum like lihood). The inclusion of these ancestral sequences can greatly enhance an evolutionary an alysis, for it allows the assignment of specific mutations to the branches of the evolutionary tree, and th ese branches represent major evolutionary events (e.g., specia tions and gene duplications). By linking these events to episodes of positive selection we are better enabled to explai n the natural history of a protein family. Such combined analyses have been published by a numbe r of investigators, the first of whom were Messier and Stewart [Mes97]. They used the PBL method developed by Li et al. [Li85] and amended by Pamilo and Bianchi [Pam93] to calculate numbers of synonymous and nonsynonymous mutations between DNA sequences from extant species and intranodal ancestral DNA sequences inferred by the PAML method (a maximum likelihood method) of Yang et al. [Yan97]. Though they do not make note of it in th eir Methods section, they most certainly used the ancestral sequences that were predicted to have the highest probability; that is, for each

PAGE 42

42 internal node of their tree, they used one unambiguous DNA sequence out of a number of possible sequences for their comparisons. This approach discards informationinformation about DNA and amino acid variability between th e pair of sequences being compared and between each sequence and the sequences at daught er nodes. The consequences of this editing depend upon the questions being asked and must th erefore be considered on a case-by-case basis but, in general, we would argue that discarding information about the evolution of these proteins and unnaturally selecting one sequence fr om a population of possible sequences would introduce a bias that diminishes the confidence in any conclusions drawn. Herein we describe a method that accounts for the ambiguity in ancestra l sequences, providing perhaps a more realistic measure of synonymous and nonsynonymous mutations and preserving data that pertains to the natural history of these sequences. An Overview of the PBL Method The PBL method involves the comparison of two DNA sequences, codon by codon, and is performed in the following steps: The average numbers of zero-, twoand f ourfold degenerate sites are calculated. Mutations are counted and classifi ed as transitions or transver sions. Each mutation is further classified as having o ccurred at a site of i -fold degeneracy. KA and KS are calculated from the degene racy and mutation calculations. Figure 2-2 shows five taxa and their DNA seque nces which, for this example, consist of only three nucleotides (one codon) each. The dege neracy of each nucleotide within a codon is determined as follows. Zerofold degenerate site s are sites where changing the nucleotide to any of the other three nucleotides changes the amino acid translation of the sequence. For example, sequence t1 (TCA), translates to the amino acid seri ne. If we change the second position of t1 to A, G or T (giving TAA, TGA and TTA), the tr anslation changes to Stop, Stop and leucine,

PAGE 43

43 respectively (when using the nuclear code for tr anslation). Therefore th e second nucleotide of TCA is classified as a zerofold site. Two-fold sites are sites where two of the three possible changes change the translation. The third positio n of CAA (glutamine) is twofold degenerate, because changing to C or T changes the amino acid to histidine, while changing to G does not change the translation. Four-fold sites are sites where changing the nucleotide to any of the three other nucleotides does not change the amino acid. The third positions of TCA and TCG are each fourfold degenerate. After the degeneracy of each of the nucleotides in each sequence being compared has been tallied, an average is computed. Next, mutations are counted between the pair of sequences. In general, if a mutation occurs between two nucleotides of the sa me chemical classeither the purines or the pyrimidinesthe mutation is called a transition: [A->G, G->A, T>C, C->T]. Mutations from one class to another [purine->pyrimidine, pyrimidine->purine] are called transversions. In the PBL method these classifications hold for all mutations in the vertebrate mitochondria l code, but in the nuclear code and the codes of other organisms some ad hoc adjustments are made. The purpose of these adjustments is to preserve the following rules: All changes at zerofold sites, whether tr ansitions or transversions, are nonsynonymous (expressed) changes. All changes at fourfold site s are synonymous (silent). Transitions at twofold sites are synonymous and transver sions at twofold sites are nonsynonymous. If we compare sequence t1 with t2 we see that A changes to G at the third position. This is tallied as a transition. The change is further classified as having o ccurred at a fourfold degenerate site (Figure 2-6). Following the above rules, the mutation is ultimately classified as a synonymous mutation.

PAGE 44

44 Sequences t1 and t3 have two mutations between them. Rather than simply counting the changes, the PBL method considers all possibl e paths that could be traversed as one codon mutates into another. Figure 2-1 shows the two pa ths that can be taken to get from TCA to CAA. Mutations are counted and classified as descri bed above, but the mutations are then multiplied by a probability determined for each path. A path's absolute probability (in Figure 2-1, pi1 x pi2, where i = path number) is computed by determining the probability of each "leg" of the path, and then taking their product. The leg probabilities ar e determined by assessing the probability of the amino acid change taking place on that leg. The determination of these amino acid transition probabilities is described in Li et al [Li85].; briefly, the probabilities of each amino acid changing into every other amino acid are deri ved from Grantham's 20x20 mutation matrix and partitioned into four categories (synonymous, cons ervative, moderate and radical changes). In Figure 2-1, TCA changes to CCA (serine to prol ine) with a probability of 0.767, which is the value for a conservative change. Synonymy (no change) has a higher probability (2.235) and moderate and radical change s have lower probabilities.1 Once all paths' absolute probabilities have been determined, each path's relative probability can be determined by dividing its absolute pr obability by the sum of all paths' absolute probabilities (in Figure 2-1, Pi, where i = path number). In Figure 2-1, Path 2 (P2) involves a stop codon, so this path is given an absolute and relative probability of zero, giving Path 1 (P1) a relative probability of one (its absolute probability is 0.588). Finally, the mutations occurring along a path are multiplied by their path's relativ e probability, and the sum of mutations is taken by summing the mutations along each path. This pr ocedure is the same for codons with three 1 These values are for mutations under a moderate rate of mutation model. Li et al .[citation] also provides values for a high rate of mutation model.

PAGE 45

45 differences between them, except in such cases th ere are potentially six paths, and each path has three legs (e.g., t2 vs. t3 ). Figure 2-6 shows the results of comparing each of the sequences t1 t2 and t3 with each other. After the average degeneracy of a pair of se quences has been determined, and the number and type of mutations between them has been counted, KA and KS can be calculated. KA is defined as the number of nonsynonymous substitu tions per nonsynonymous s ite in a pair of sequences. KS is the number of synonymous substitutions per synonymous site. The formulae were developed originally by Li et al. and ame nded by Pamilo and Bianchi, and their derivation can be found in these references and Kimura [Li85, Pam93, Li93, Kim80, Kim81]. Extending the PBL Method to Phylogenetic Trees with Inferred Ancestral Sequences: Putting Ambiguity Back into the Equations. Figure 2-2 shows a hypothetical phylogeny for sequences t1 t2 and t3 This tree contains two internal nodes (ancestral sequences) a1 and a2 These sequences are inferred by the parsimony algorithm of Fitch [Fit71]. Figure 23 shows the reconstruc tion of the ancestral sequences at each nucleotide position. Below each nucleotide is a number representing the probability of that nucleotide appearing at its node. These probabilities are computed by a method developed by Fitch. Our method is conducted as follows. The tree is traversed node by node, from top to bottom; beginning with the least ancient nodes an d ending with the root node. At each node, comparisons are made between each node codon(s) and each codon(s) of each descendant. These comparisons are made by Constructing a set of equally parsimonious solutions (EPSs). Determining the relative probability of each EPS. Calculating the average number of zero-, twoand fourfold degenerate sites for each ancestor/descendant pair.

PAGE 46

46 Counting and classifying mutations for each ancestor/descendant pair. Calculating KA and KS. An EPS is defined as three codons [ x y z ] ( x = left descendant, y = ancestor, z = right descendant) that are permitted by Fitch's pa rsimony rules. Each codon is constructed by combining "valid" nucleotides from each of the three parsimonious nucleotide trees (Figure 2-3). The validity of the presence of a nucleotide in an EPS is determined by Fitch's rules for valid linkages. In this example, all linkages are permitted between all of the nucleotides at a2 and a1 in each of the nucleotide trees, so all possible codons that can be built are permitted. Ancestral node a1 has only one EPS, [TCA, TCA, TCG]. Ancestral node a2 has four EPSs, [TCA, TCA, CAA], [TCA, TAA, CAA], [TCA CCA, CAA] and [TCA, CAA, CAA]. The determination of EPSs is required because the identity of a nucleotide at th e ancestral node dictates what nucleotides are allowed at its descendant nodes. Next, the probabilities for each EPS are determined. Since a1 has only one EPS, this EPS is given a probability of 1. The probab ility of each of the four EPSs at a2 is determined by the following method (Figure 2-4). Each EPS has two pa ths, the path between the ancestor and its left descendant and the path be tween the ancestor and its right descendant. Each of these paths has a probability associated with it, PL and PR, respectively. The probability of a path is determined by calculating the proba bility of the amino acid change that occurs along that path. PL for EPS1's left path (TCA->TCA) is 2.235 because there is no amino acid change (synonymy). PL for EPS3's left path (TCA->CCA) is 0.767 because this amino acid change (serine->proline) is conservative. For paths invo lving two or three mutati ons, a variation of the paths method depicted in Figure 2-1 is used. Fo r example, the right path of EPS1 has two mutations. The absolute probability of Path 1 (p11 x p12) = 0.588, and since Path 2 contains a stop codon, its absolute probability is zero. We then compute the av erage probability by summing the

PAGE 47

47 absolute probabilities of all paths and dividing by the number of paths. In this case there is only one path, so its absolute probabi lity is the average probability: PL = 0.588. EPS2 has a probability of zero because it contains a stop codon. EPS3 and EPS4 probabilities are calculated as shown in Figure 2-4. Once the average probabilities of each path in the EPS are calculated, we calculate the product of the probabilities of all nine nucleotides in the EPS (the nucleotide probability product, PN) and then compute the absolu te probability of the EPS by taking the product of the two average probabilities and the nucleotide probability product: PLxPRxPN. We can then determine the EPS's relative probability by dividing its absolu te probability by the sum of all EPSs' absolute probabilities (Figure 2-4). We include PN in the calculation because it is a measure of the probability of the codons in the EPS. Without PN, an EPS's probability would only be a measure of the probability of its amino acid changes, so if a particular EPS contained conservative mutations and/or synonymy, for example, it would be given a high relative probability regardless of the probability of its individual codons existing. We then use the EPSs and their probabilities to calculate the average number of zero-, twoand fourfold degenerate sites between each seque nce pair. As an example we will consider the a1 / a2 pair (the left path of the EPS set for node a2 ). For EPS1, we compare TCA with TCA. Codon TCA has two zerofold sites, zero twofold sites and one fourfold site ([2,0,1]). The average sites are then (( [2,0,1] + [2,0,1])/2) PEPS1 = [0.818, 0, 0.409]. For EPS2 we have [0,0,0] because of the stop codon. Sites for EPS3 and EPS4 are similarly computed. The sum of sites for all EPSs is [2, 0.204, 0.796] (Figure 2-7). Counts and classifications of mutations for each sequence pair are then determined. We will continue using the a1 / a2 pair as an example. For EPS1 we tally [[0,0,0],[0,0,0]] for TCA vs. TCA ( [[a,b,c],[x,y,z]]: a = zer ofold transitions, b =

PAGE 48

48 twofold transitions, c = fourfold transitions x = zerofold transversions, y = twofold transversions, z = fourfold transversions). For EPS2 we do not count any mutations because of the stop codon. For EPS3 we count [[1,0,0],[0,0,0]] PEPS3 = [[0.183,0,0],[0,0,0]] and for EPS4 we count [[1,0,0],[1,0,0]] PEPS4 = [[0.409,0,0],[0.409,0,0]]. These procedures are repeated for each codon in each pair of sequences at each node in the tree (Figure 2-7). Finally, KA and KS. are computed. We wrote a computer program that implement s this method. Over a course of years the program was written and rewritten in a numbe r of different programming languages to suit whatever task was at hand. In Appendix B we pr esent a version written in Java 1.1, used in the context of the MasterCatalogan interactive database containing the contents of GenBank reorganized as a collection of clades, or familie s, consisting of independently evolving protein fragments, or modules [Ben00]. The extended PB L method described herein was engineered to measure KA and KS values for pairs of modules within each family of modules, including reconstructed ancestral modules at nodes within the phylogenetic tree representing the natural history of the module family. Prior to that project we coded this method and applied it to all of GenBank. This work is described in Chapter 3.

PAGE 49

49 Figure 2-1. Two possible ways to get from TCA to CAA. Formulae for path relative probabilities, and numbers of transitions and transversions adjusted by multiplication by path probabilities.

PAGE 50

50 Figure 2-2. A hypothetical phylogeny for sequences t1 t2 and t3 Ancestral sequences a1 and a2 inferred by parsimony.

PAGE 51

51 Figure 2-3. Parsimony for each of the nucleotides in sequences t1 t2 and t3 The numbers below each nucleotide represent the probability of that nucleotide being at its node per Fitch's method.

PAGE 52

52 Figure 2-4. EPSs for ancestral node a2 Each EPS has a probability for its left and right path, PL and PR. PN is the product of the probabilities of each nucleotide in the EPS (i = 1, 2 or 3 where 1 refers to the left descendant, 2 refe rs to the ancestor and 3 refers to the right descendant. j = nucleotide position). The rela tive probability of a path is equal to the quotient of its absolute probability and the su m of absolute probabilities for all EPSs (x = number of EPSs).

PAGE 53

53 Figure 2-5. Sequences t1 t2 and t3 numbers of degenerate sites, and average degeneracies of each pair of sequences.

PAGE 54

54 Figure 2-6. Numbers of transi tions and transversions for each pair of sequences.

PAGE 55

55 Figure 2-7. Tallies of average degeneracies and mutations for each pair of sequences. Fractional degeneracies and mutations are results of multiplication of observed numbers with computed probabilities for EPSs.

PAGE 56

56 CHAPTER 3 THE ADAPTIVE EVOLUTION DATABASE (T AED): APPLICATION OF THE EXTENDED PBL METHOD TO A LARGE DATASET Background The growth of gene and genomic databases provides motivation for developing tools to extract information about the function of a protei n from sequence data, with the ultimate goal of understanding the collection of functions represente d in an organisms genome [Lib01]. Work on molecular evolution over 30 years has shown that such questions must be phrased carefully, and always with cognizance of the Da rwinian paradigm that insists that the only way of obtaining functional behavior in living systems is thr ough natural selection s uperimposed on random variation in structure [Ben88b]. A behavior is functional if the organism would be less able to survive and reproduce if that behavior were di fferent. An amino acid residue is functional if, upon mutation, the organism is less ab le to survive and reproduce. A long literature has sought to interpret the evoluti onary behavior of protein sequences, in the hope of drawing inferences about the relationship betwee n fitness and sequence [Kim82]. What has emerged is the recognition that a family of orthologous proteins displays a diversity of structure and a corresponding dive rsity in behavior, where some of the behavioral differences have a strong impact on fitness (are functional), and others are neutral (o r nearly so). Without resolving, in a general way, quest ions regarding the relationship (neutrality versus selection) between fitness and protein sequence, we can bu ild interpretive tools that capture information from patterns of evolution of genomic sequenc es that is informative about functionin particular, events that are characterized by th e biological scie ntist as a change in function. For a protein to change its function it must cha nge its behavior; this in turn requires that it change its amino acid sequence. A protein being recruited for a different function over a very short time (geologically speaking) frequently experiences an episode of rapid sequence

PAGE 57

57 evolution, an episode where the number of amino acid substitutions per unit time is large. Therefore, molecular evolutionist s have long been interested in the rates at which substitution accumulate in protein sequences. These rates ar e known to vary widely in different protein families. Calculating rates in the units of substitutions /time requires knowledge of the geological dates of divergence of protein sequences. Becau se geological times are frequently not known (and almost never known precisely), alternative approaches for identifying episodes of rapid sequence evolution have been sought. One of thes e examines nucleotide substitutions. It divides the number of nucleotide substitutions that change the sequence of the encoded protein (nonsynonymous substitution) by the number of nuc leotide substitutions that do not change the sequence of the encoded protein (nonsynonymous s ubstitution), and then normalizes these for the number of nonsynonymous and synonymous sites. This is the KA/KS ratio [Li85, Pam93, Li93]. High KA/KS ratios for reconstructed ancestral episodes of sequen ce evolution are hypothesized to be signatures of positive adaptation, and have been associated with significant changes in function [Tra96, Mes97]. In general, KA/KS values are low. For example, the average KA/KS value in proteins between rodents and primates is 0.2 [Mak98]. This is taken to indicate that most of these proteins, selected over millions of years, attained an optimum function prior to the divergence of rodents and primates. This implies that s ubsequent evolution wa s conservative; most nonsynonymous mutations were detrimental to the fitness of the organism. Functional change can be defined as mutation th at alters organismal fitness and is subject to selective pressure. For an example of in traspecific variation, phos phoglucose isomerase in montane beetles shows adaptation to local temper ature variations [Dah00] Orthologous proteins

PAGE 58

58 also suffer positive selection. For example, the hemoglobin in the bar-headed goose has undergone adaptive change relative to the hemogl obin from the closely related greylag goose in response to a reduced partial pressure of oxygen at high altitudes [Zha96] Adaptive evolution is also believed to be displayed by paralogous mamm alian MHC class I genes and relate to a birthand-death model of ge ne duplication [Nei97]. Traditionally, positive selection is defined by a KA/KS rate ratio significantly greater than unity. While 0.6 < KA/KS < 1 can occur by relaxation of functi onal constraint, the theoretical cutoff of one is well known to miss significant functi onal changes in proteins for several reasons [Cra99]. Long branches can dilute an episode of positive adaptation (with KA/KS >1) with episodes of conservative evolution. KA/KS values can miss positive selective pressures on individual amino acids because they average events over the entire protein sequence. Behavior in a protein can change significantly if only a few amino acids change while the remainder of the sequence is conserved in order to retain core behaviors of th e old and new functions (e.g., the protein fold). These adaptive events will only be detected on sufficiently short branches which pinpoint the adaptive change. Alternative ways of identifying KA/KS values below unity that are suggestive of adaptive evolution involve comparison of these values for an individual branch of a tree with those values for branches in the tree gene rally. If one branch has a KA/KS value far outside of the norm for the family (but still below one), we can guess that this branch represents an episode of positive selection. This will work for gene families that generally display conservative evolution (such as the SH2 (Src homology 2) domains) [Wig98], bu t not for others. For example, many immunesystem genes show a much more continuous distri bution of values, which may indicate that they are perpetually under different amounts of positive selective pressure [Nei 97]. In this case, the

PAGE 59

59 designation of a cutoff value of KA/KS, below which two homologous genes have the same function, and above which they have different functions, is arbitrary. Ultimately, this level should be determined by benchmarking adaptivity with specific functions and specific protein folds. KA/KS rate ratios are well known to be useful starting points for gene rating stories about the interaction between protei n sequences and the Darwinian processes that shape these sequences. These stories help us understand how these sequences co ntribute to the fitness of the host. This means that biologists would find useful a comprehensive database of examples where KA/KS values are high. Most useful would be a database that presents families where KA/KS is greater than one, and a separate family where KA/KS is greater than some arbitrary cut-off less than one, but still relatively high compared to the average value in the average protein. We report here such a database, The Adap tive Evolution Database (TAED). TAED is designed to provide, in raw fo rm, evolutionary episodes in specific chordate and embryophyte (flowering plants, conifers, ferns, mosses and liverworts) protein families that might be candidates for adaptive evolution. TAED contains a collection of protein families where at least one branch in the reconstruc ted molecular record has a KA/KS value greater than unity, or greater than 0.6. The second cut-off is arbitrary, chos en to be high relative to the average KA/KS value for the average episode of evolution in a protei n family. Empirically, the lower cut-off seems to admit additional examples of gene families that might have undergone adaptive evolution. TAED should be used as a raw list of potentially adaptively evolving genes for experimentalists seeking gene families to study in further detail, and for bioinformaticists interested in studying large datasets of examples of genes with high KA/KS ratios.

PAGE 60

60 Results The MasterCatalog [Ben00] is a database of 26,843 families of protein modules generated from an all-against-all search of GenBank release 113. A prot ein is broken into independently evolving modules on the criterion of the presence of a subsection of a gene as a complete open reading frame (ORF) in another species. Pair s that were within 180 PAM (point accepted mutation) units with a minimum length requirem ent were grouped into the same family. Each family contains an evolutionary tree and a multip le sequence alignment. This database was the starting point for the ex haustive calculation of KA/KS ratios. The MasterCatalog (MC) is different, both in concept and execution, from other resources (e.g., Hovergen [Dur94], Pfam [Bat 00] and COGs) that offer databases of protein families. The MC incorporates reconstructed ancestral states wi thin its data structure, in addition to multiple sequence alignments (MSAs) and evolutionary tr ees. Having these reconstr ucted ancestral states provides a value to the database, especially for functional interpretation, that is not offered by databases that contain only trees, or only MSAs or only trees and MSAs. Further, because the MasterCatalog is explicitly developed as a tool for doing functional genomics relying on reconstructed intermediates, and as the informati on about function is extracted from analysis of patterns of variation and conservation in genes and proteins within a family, it emphasizes the generation of high-quality trees, MSAs and reconstructed ancestral states. For this reason, the MC does not attempt to build superfamilies (lik e Pfam does). Instead, it constructs families, where the trees, MSAs and ancestral states are not compromised by poor gap placementa common problem in Clustal-based MSAs of se ts of highly divergent protein sequences. Alternative methods were considered for reconstructing ancestral sequences. Whereas maximum likelihood methodologies perform bette r in some situations, they are too computationally intensive to a pply exhaustively. Further, they are based upon an explicit model

PAGE 61

61 of evolution that may not be appropriate al ong all branches analyzed, a situation where maximum parsimony may outperform maximum likelihood on some branches [Pag98]. Therefore, to generate the initial version of th is database, more computationally simple methods were used. As improved methodologies are deve loped, these will undoubtedly be applied to recalculate this database. Two issues concerned the scope of the KA/KS analysis. First, we were concerned that silent positions would be saturated with substitutions, rendering the KS measurements meaningless. Whereas reconstruction back to the last comm on ancestor of chordates or embryophytes with no intermediates frequently bears the signature of synonymous position equ ilibration, synonymous position saturation can be avoided if individual branches are shor ter than the period required for saturation to occur ( t to saturation of approximately 120 million years). Saturation was measured through the examination of the extent to which two-fold redundant codon systems had reached equilibration [Pel00]. Branches that showed equilibration greater than five half-lives towards saturation were excluded from TAED on the basis of di fferences between reconstructed ancestral sequences at the beginning of branches and sequences at the end. A second significant problem is that of short branches bearing fractional mutations. These are known to generate KA/KS values with large errors. To prev ent these errors from biasing the database, a new simple robustness test was impl emented to ensure that an interesting KA/KS value (one above the cut-off) was not recorded in the database if it became uninteresting (below the cut-off) through the shift of a single mutation reconstr ucted in the branch. The test modified the KA/KS calculation in a simple way, as described below: Modified KA/KS = KAmod/KSmod, where KAmod = (number of nonsynonymou s 1)/total nonsynonymous sites KSmod = (number of synonymous + 1)/total synonymous sites

PAGE 62

62 In general, the smaller the difference between KA/KS and KAmod/KSmod, the more significant or robust the branch. To exclude short branches with fractional mutations (arising through ambiguous ancestral sequence reconstruction) without excluding other short branches, branches with KAmod/KSmod values below 0.5 were excluded from the database. Of 5,305 families of modules containing chorda te proteins, 280 contained at least one branch with a KA/KS value greater than one, representi ng 643 branches emanating from 63 different nodes of the tree of life. Some 778 families had at least one branch with a KA/KS value greater than 0.6, totaling 2,232 bran ches emanating from 92 nodes of the tree of life. Thus 15% of all families of chordate modules are likely to have modified their function at least once during the course of evolution. Of 3,385 families of modules representing em bryophyte proteins, 123 have at least one branch with a KA/KS value greater than one, representing 227 families emanating from 25 nodes. Some 407 families had at least one branch with a KA/KS value greater than 0.6, totaling 1,105 branches from 43 nodes. Here, perhaps 12% of all embryophyte families have modified their function along at least one branch. This result based on ancestral sequence reconstr uction contrasts greatly with the result of Endo, Ikeo and Gojobori, where the search fo r gene families undergoing adaptive evolution yielded only two families [End96]. They compared extant sequences rather than reconstructed evolutionary intermediates, counted families onl y where a majority of the pairs were at high KA/KS values, and used a smaller database. A list of candidate protein module families that have undergone modification of function is available at [Lib03]. The version described here is designated TAED 2.1 and will remain

PAGE 63

63 available at this site. As more sophisticated methods are devel oped and applied, as correlations with functional and structural databases are purs ued, and as data from other types of evolution beyond coding sequence evolution is added, links to these datasets will be provided. TAED 2.1 contains two image-mapped trees (for chorda tes and embryophytes); where the node that an adaptive branch emanates from can be clicked on to obtain a list and Ma sterCatalog reference number. Multiple sequence alignments and phylogenetic trees corresponding to these entries can be obtained from EraGen Biosciences WI. Discussion This study represents the first comprehensive analysis of KA/KS rate ratios throughout the Chordata and the Embryophyta. Although the methods utilized were rough and designed to give a quick snapshot into a global picture of evolution, the TAED, as a raw resource, should be valuable in the analysis of much of chordate evolution. Functional genomic s analyses of many of the protein families that have undergone recruitm ent and functional change within the past 500 million years will soon emerge. Many of the episode s of functional change recorded in TAED can be correlated with events in the geological or paleontological record in response to changing environments, evolving paleoecology or the development of new physiology. Gene families may display evolutionary episodes with high KA/KS values, and therefore appear within TAED, for several possible reasons For example, branches resulting from gene duplication events that give rise to paralogs wi th very different behaviors will presumably have high KA/KS values, as will orthologous pairs from species that place very different demands on their function. This search was done without disti nguishing paralogs from orthologs, and the user of TAED should be careful in the analysis of specific families in recognition of this fact. Because there is no reliable true set of protein families known to have undergone functional adaptation, it is not pos sible to score the results of th is tool. It is important to

PAGE 64

64 remember that a Darwinian definition of func tion differs from the f unctional annotation of genomes, and it is possible for a protein to alter or change its function wh ile retaining the same annotation. To examine this da taset, specific proteins must be examined individually. Individual examination is likely to be pr oductive, however. Many protein families already believed to be candidates for functional recruitment appear on the list. These include plasminogen activator in vampire bats which is expressed in saliva and involved in blood clotting [Bod97], phospholipase A2 in snakes which is expressed in venom and involved in tissue damage [Nak95] and MHC proteins in mammals, which are involved in the immune system as part of the host-parasite arms race [Hug88], all having obvi ous explanations of why they may have undergone functional change. Several families ar e newly identified as being candidates for functional change, such as the previously pr oposed obesity protein leptin in primates. A third category of discovery in TAED is in th e detection of episodes of adaptive change at new points in the divergent evolution of proteins, for example myostatin in the Bovidae [Lee99]. Table 3-1 is a sample table from TAED repres enting bovids. These are th e candidate genes that were identified as showing rapid sequence evoluti on emanating from this node in the tree of life. They potentially include orthologs between two species of bovids, pa ralogs, alternatively spliced, transcripts and intraspecific evol ution. The genes on the list have roles in the immune system, body musculature and reproduction, tr aits frequently under selectiv e pressure. These examples and many others are candidates for further expe rimental study through cloning from additional species and through functional st udy in laboratories expert in the particular protein. Conclusions From a phylogenetic perspective, the knowledge of candidate genes evolving at the same time in the same organism can allow one to begin to ask if entire pathways or phenotypic functions are under selective pressu re at particular points in evol utionary history. Where tertiary

PAGE 65

65 structures for the proteins exist, mutations al ong branches can be mapped onto three-dimensional structures first to evaluate the validity of speci fic examples, and second, to understand the nature of adaptive evolution at a structural level. One analysis of TAED indicates that among branches with KA/KS rate ratios > 1, only 3% of synonymous sites had mutated compared with 10% on the average branch in the database. This is consistent with the no tion that episodes of adaptive ev olution can be lost in long branches, as these are combined with prior and/ or subsequent episodes characterized by lower KA/KS rate ratios characteristic of functional constancy. As more genes are sequenced from more species, the greater articulation of trees will not only increase the accuracy of sequence reconstructions, but will also allow us to detect new examples of functional change that are buried in long branches. At a biological level, the data base created here can be mined to provide global pictures of how evolution has occurred. Correlation of data in this database with th at in other functional databases will enable a leap from genotype to organismal phenotype. Further, the dataset provides a resource for experimentalists in terested in specific genes. The high KA/KS rate ratio in leptin in a branch connecting primates with rode nts may have been a useful predictor of change of function for pharmaceutical companies interest ed in the mouse model of leptin for human obesity. For the experimentalist, mutations occu rring along putatively adaptive branches can be assayed for functional importance in systems of interest. Finally, this database repres ents a growing framework for the study of adaptive evolution. As datasets become available, changes in gene expression, alterna tive splicing patterns, imprinting patterns, recombination events and other molecular mechanisms of adaptation will be added to this database in a phylogenetic perspective. The ultim ate goal is a dynamic resource

PAGE 66

66 depicting candidate molecular events that ar e responsible for phenotypic differences between closely related species. Materials and Methods Starting with the MasterCatalog [Ben00] (ver sion 1.1 derived from GenBank release 113) KA/KS rate ratios were reconstructed database -wide for each ancestral branch in every evolutionary tree cont aining genes from the Chordata and the Embryophyta This analysis was restricted to these organisms because there is less evidence for codon and GC-content biases which complicate the accurate calculation of KS. The MC uses multiple sequence alignments generated from Clustal-W and neighbor-joining trees, both deri ved from protein sequences. Because the MC is based on an analysis of nuclear families, rather than extended families, these inexpensive tools generate acceptable multiple sequence alignments. KA/KS values were calculated for branches on an evolutionary tr ee between nodes using the method of Li and Pamilo and Bianchi [Li85, Pam93, Li93] modified to allow full treatment of probabilistic ancestral seque nces [Ben98]. Reconstruction of ancestral sequences was done using the Fitch maximum parsimony methodolo gy [Fit71]. While reconstructed ancestral sequences contain ambiguities, us ing probabilistic ancest ral sequences takes this into account (by weighting ambiguous positions according to their pr obabilities) and allows us to construct a model of evolutionary history that is robust. Two cut-offs were used to identify interesting values for the KA/KS rate ratio, one and 0.6. Separate database s were constructed for each cut-off. The resulting dataset is freely availabl e for further analysis [Lib03].

PAGE 67

67 Table 3-1. A sample listing from TAED cont aining candidate adaptively evolving genes detected that emanated from the Bovidae node. Genes with KA/KS > 1.0 Additional Genes with KA/KS > 0.6 1. T-cell receptor CD3 epsilon chain from MasterCatalog family 9668 10. Interleukin 2 receptor from MasterCatalog family 974 2. AF092740 cytotoxic T-lymphocyteassociated protein 4 precursor from MasterCatalog family 9698 11. Interleukin-3 from MasterCatalog family 9775 3. CD5 from MasterCatalog family 9700 12. AF019622 myostatin; growth/differentiation factor-8; GDF-8 from MasterCatalog family 20325 4. AF110984 intercellular adhesion molecule-1 precursor from MasterCatalog family 9802 13. Fas gene product from MasterCatalog family 21743 5. Interferon alpha/beta receptor-2 from MasterCatalog family 9817 14. Calpastatin from MasterCatalog family 21751 6. AF020508 pregnancy-associated glycoprotein 6 from MasterCatalog family 15612 15. Prolactin receptor from MasterCatalog family 21853 7. MCH OVAR-DQ-ALPHA1 from MasterCatalog family 15669 16. Pre-pro serum albumin from MasterCatalog family 21864 8. Major histocompatibility complex class II from MasterCatalog family 21739 17. Immunoglobulin gamma-1 chain from MasterCatalog family 21881 9. T-cell receptor gamma from MasterCatalog family 21940 18. AF110984 intercellular adhesion molecule-1 precursor from MasterCatalog family 21997 These examples potentially include or thologs between different species of Bovidae paralogs, alternatively spliced cDNAs with potentially di fferent functional effects and intraspecific modifications.

PAGE 68

68 CHAPTER 4 DETECTING COMPENSATORY COVARIATION SIGNALS IN PROTEIN EVOLUTION USING RECONSTRUCTED ANCESTRAL SEQUENCES Introduction The evolution of protein seque nces is nearly always desc ribed using one of several stochastic models for the accumulation of am ino acid replacements [Ben97, Fuk02]. These are captured in algorithms known by widely recognized names (e.g., the Needleman-Wunsch [Nee70], Smith-Waterman [Smi81], and Felsen stein maximum likelihood [Tho92] tools). These tools have become more sophisticated in recent years as mathematicians have "inched towards reality" [Tho92] in their mathematical m odeling well supported by a rich background in statistics, theorems, and proofs. Nevertheless, patterns of replacement predicted by these mathematical models remain quite different from the patterns that are actually obs erved in proteins diverging under functional constraints [Ben97, Gau01]. The reason for these differences is well und erstood. Briefly, simple stochastic models treat proteins as if they were linear strings of letters. In reality, proteins have three dimensional structures that support behavior s that are important for them to contribute to the fitness of the host ("function"). These behavior s are not a linear sum of the behaviors of their parts. Amino acid replacement is therefore constr ained in a way unanticipated for a linear string of letters. The differences between mathematically conve nient models and the reality of organic chemistry need not be paralyzing, however. Fi rst-order stochastic treatments of protein sequences can provide "null" hypo theses, statements about how pr oteins would behave if they were formless, functionless strings of letters. The difference between how proteins actually divergently evolve, and how firs t order treatments model their divergent evolution, therefore contains a signal about fold and function [Ben97].

PAGE 69

69 Analyses of this signal have been remarkably productive. They provide practical tools for predicting the folded conformation of proteins from sequences [Ben97, Ro s94], and many useful approaches for extracting functional informa tion from genomic sequence data [Ben98, Ben00, Lib01]. Accordingly, one goal of contemporary computational biol ogy is to extend the concepts and tools needed to extract signa ls concerning structure and functi on from features of divergent evolution that do not meet the expecta tions of simple stochastic models. Perhaps the most serious approximation made by first order stochastic models is their treatment of individual positions in a protei n sequence as independently evolving entities. Virtually all of these tools analyze sequence divergence one position at a time under a model where position i in a sequence suffers replacement independently of position j Even models that recognize that different sites may have different mutability (gamma models) treat sites as being independently evolving [Miy95]. This is, of c ourse, extremely convenient for any statistical model for protein sequence divergence as it en ables the probability of a sequence alignment overall to be calculated as the product of probabilities calculated for indivi dual positions in the alignment. Even casual inspection of a multiple sequence al ignment, however, shows that positions in a protein sequence do not suffer replacement independently. Replacements in positions adjacent in a sequence are strongly co rrelated [Coh94]. The correlatio n almost certainly reflects functional constraints on replacement set within th e context of a folded structure. Therefore, it has proven to be useful, in particular, for pred icting the three dimensiona l structure of protein folds [Ben97]. Position pairs distant in the sequence but near in space in the three-dimensional fold might also be expected to suffer replacement in a co rrelated fashion [Alt87, A lt88]. Many classes of

PAGE 70

70 these can be envisioned. For example, a "big-for-small" replacement at position i might be compensated by a coincident "small-for-big" replacement at position j to conserve overall size in the packed core of the fold, and therefore conser ve a functional behavior related to packing (the stability of the fold). Alternatively, a "posit ive-for-negative" charge replacement at position i might be compensated by a "negative-for-positive" charge replacement at position j to conserve overall charge, and therefore conserve a f unctional behavior related to net charge. Behind this expectation stands a model based on the neutral theory of evolution [Kim83]. Under this model, amino acid replacements that dramatically alter the ph ysical property of the side chain (e.g., its size or charge) will disrupt th e performance of a protein already optimized to contribute to the fitness of the host organism. The fitness value of the protein can be restored only by a second change that compensates for the first alteration in physi cal properties. In the language of neutral theory, we would say that the first replacement was selectively disadvantageous, the second was positively selected (in the context of the first), and both together lead to a result that is neutral. Analyses of compensatory replacement have found practical application. For example, a pair of compensatory changes in the protein kina se family underlay the successful prediction of the antiparallel sheet in the fi rst kinase domain [Ben91], contradi cting a prediction of a parallel sheet based on motif analysis [Ste84]. This example represents the first example where compensatory covariation analysis was a key to a bona fide prediction of protein structure. In another case, Cohen and his coworkers presen ted an elegant example of compensatory replacement in phosphoglycerate kinase [Goh00] supporting the suggestion that correlated change in the evolutionary history of two protei n families might be taken as evidence that the two proteins operate together. More generall y, several laboratories have suggested that

PAGE 71

71 compensatory covariation might be used to de tect incorrect folds in a structure prediction environment [Olm99, Che97, Che98]. Unfortunately, comparison of a ny two sequences diverging at n sites generates n ( n -1)/2 pairs of candidate sites holding coincident replacements. These might be compensatory; they might be coincidental. Only a fraction of these will be truly compensatory, meaning that either replacement alone would have been rejected by natural selection wit hout the other. Most replacements are presumably not compensated, eith er because they are neutral (implying that no other changes are needed to prevent their havi ng a negative impact on functional behavior), or because they have a positive impact on fitness ( implying that no other changes are desired to neutralize their impact on functional behavior). Thus, while such compensation is easily rec ognized when analyzed in the context of a known crystal structure (the sites suffering compen satory replacements presumably are near in space), it is difficult to identify th e pairs of sites that might suffe r compensatory changes without a crystal structure. As a consequence, the comp ensatory covariation signal, represented by the number of pairs of replacement that are causa lly interrelated (where either alone would be rejected by natural selection) relative to the n ( n -1)/2 pairs that arise whenever two protein sequences differing at n sites are compared, is weak, at leas t as it has been ge nerally calculated [Shi94, Gob94, Neh94]. Some have suggested that the compensatory replacement signal may never be broadly useful for this reason [Tay94] Others have suggested that the signal might have value, especially if it can be strengthened. A variety of laboratories have explored approaches to strengthen the compensatory replacement signal [Olm99, Che97]. For example, the signal for ch arge compensation appears to be stronger than the signal for size compensation, suggesting that it might be useful to analyze

PAGE 72

72 different types of coincident replacement separa tely. Further, compensatory covariation signals appear to be strongly dependent on the evol utionary distance between the two sequences, measured in PAM units (the number of point accepted mutations per 100 amino acids) [Day78]. For example, charge compensatory changes were found to be more prominent at PAM 25 than at PAM 100 [Che97]. These observations are not surprising given th e model. Charge reversal changes might be expected to be the change most in need of compensation. Likewise, at longer PAM distances, pairwise compensation might be obscured beneath compensation arising from replacements at multiple sites. This work suggested a general strategy to strengthen the signal for compensatory replacement. At the core of this strategy is th e recognition that compensatory replacements can be calculated in two ways. In the first, two exta nt protein sequencesseque nces that are found in organisms living todayare examined. As exta nt sequences are at the "leaves" of an evolutionary tree, we call these "leaf-leaf" comp arisons (Figure 4-1). In the past, virtually all compensatory substitution has been sought via "leaf-leaf" comparisons. An alternative approach is possible. Give n a set of homologous protein sequences, a multiple sequence alignment, and an evolutionary tree interrelating them, the sequences of ancestral proteins represented by nodes in the tree can be approxi mated using well-known heuristics. Given reconstructed se quences of ancestral proteins at nodes in an evolutionary tree, compensatory covariation can be sought usi ng "node-leaf" and "node-node" comparisons. As compensatory signals are stronger when th e sequences being compared are separated by shorter distances [Che97], and as the distance be tween two nodes in a tree must be shorter than the distance between the leaves on the tree (Figure 4-1), node-node and node-leaf compensatory

PAGE 73

73 signals must be stronger than th e leaf-leaf signal that contains them. Phrased differently, the number of replacements between average nodes is smaller than the number between average leaves. This means that the n ( n -1)/2 total number of pairs is smaller, implying that the truly compensatory pairs, those driven by a fitness constraint, will be less obscured by the background of uncompensated events, when they are id entified on individual branches of a tree. Further, although the reconstr ucted ancestral sequences ar e probabilistic, and cannot be proven to be correct, even crude heuristics for reconstructing ancestr al sequences localize specific changes to specific regi ons of a tree better than if no reconstructions are done at all. Thus, even poor reconstruction heuristics permit us to focus on briefer episodes of time during which two compensatory changes might have occu rred than is possible by leaf-leaf comparisons. Last, node-leaf and node-node comparisons mode l the actual evoluti onary events that might contain the compensatory signal better than leaf-leaf comparisons. The statement that "position i and position j should suffer replacement in a compensatory way" is equivalent to the statement that if either position i or position j suffer replacement individually, the host organism is less "fit" (in a Darwinian sense). This, in turn, implies that if position i suffers a replacement, then position j is under "positive selective pressure" to suffer a replacement. Conversely, this means that a replacement at position j will be fixed in a population faster than expected for a neutrally drifting position. This means that true compensatory changes will normally occur on the same branch of the tree, even on a very short branch of the tree. This study examined the feasibility of det ecting node-node and node-leaf compensatory covariation signals by examining reconstructed evol utionary events within families of proteins. Results A total of 71 families of proteins were examined in this study. For each family, a multiple sequence alignment and a phylogenetic tree we re constructed using Clustal-W [Tho94].

PAGE 74

74 Reconstructed ancestral sequences at nodes throu ghout the tree were generated using the Fitch method as described in Methods [Fit71]. We then sought charge reversal replacements in these families. For each family, positions were identified where an amino acid replacement cau sed a charge reversal between two ancestral sequences (representing a replacem ent event that occurred on a branch between two nodes) or between a node sequence and a leaf sequence. Fractional changes were included. Each was associated with a particular branch of the tree. From these, pairs of replacements displaying charge compensatory behavior were collected, as described in Methods. A pair of replacements was defined as being charge compensatory if they were coincident (both occu rring on the same branch of the tree), if they each individually would reverse a charge, and take n together, if no change in the overall charge of the protein resulted from the two. A three dimensional crystal structure for a memb er of the protein family was then extracted from the PDB and an estimate was made for the st rength of the signal. In making this estimate, we assumed that only proximal charge compensatory changes, near enough in space in the folded structure that the two side chains could intera ct coulombically, could be functionally correlated. We therefore tabulated the distan ces between the sites in pairs th at suffered charge compensatory replacements. Histograms were then constructed to show the distance distribution of pairs suffering compensatory replacements. As a standard, a di stance distribution for all pairs of sites was calculated for the proteins. If the number of charge compensatory pairs was located disproportionately more in proximal positions than th e average pair (where the square root of the number of pairs was a crude measure for signifi cance), a signal was cons idered significant.

PAGE 75

75 As previous studies predicte d, leaf-leaf comparisons found an only barely perceptible charge compensatory signal (Figure 4-2). That is, pairs of positions suffering charge compensatory replacement in two extant sequences were not much more likely to be near in space than the average pair. The analogous analysis for each evolutionary branch in the tree, however, identified a signal that was clearly perceptible (Figure 4-3), even by eye. This signal was then analyzed. Charge Compensation and the Surface of the Folded Protein Surprisingly, the initial analysis (Figure 43a) showed that position pairs suffering nodenode or node-leaf compensatory charge replaceme nt were more likely to lie both proximally and distally in the fold, when compared with the av erage distance between all position pairs in the proteins. That is, pairs of positi ons near in the fold displayed charge compensatory covariation more frequently than the average pair, as expe cted should charge compensatory replacement be functionally correlated. But pairs of positions di stant in the fold also displayed charge compensatory covariation more than the average pair. Distal charge compensatory replacement s uggested two explanations. First, overall net charge might be a selected trait in a protein. It is conceivable, for example, that a constant isoelectric point is desired by na tural selection. If so, reversing a charge at one position would require an adaptive replacement leading to the o pposite reversal somewhere (anywhere), even at a second position distant in the fold from the first. Alternatively, charge changes are more likely to be tolerated by natural selection, and therefore are more likely to be observed, if they occur on the surface of the protein. The mean distance between a pair of surface residues is grea ter than the mean distance between the average pair of residues. Therefore, charge changes ar e more likely by chance to occur in pairs more

PAGE 76

76 distant than the average pair if they occur pred ominantly in positions on the surface, whether the pair is under direct selection or not (i.e., if the replacement is neutral). These considerations suggested that the dist ance between the averag e position pair might not be the most informative reference distribu tion for these studies. In stead, an alternative reference distribution was calculated (blue bars, Figure 4-3b) for the distance between pairs of surface residues in the model proteins. This new distribution fit nicely th e distal portion of the distribution of distances betw een position pairs displaying charge compensatory replacement. The result implies that the apparently high occu rrence of charge compensatory replacement in distant pairs of residues reflects simply the greater likelihood that charge reversal replacements occur on the surface of the protein. This surface pair reference distribution does not, however, fit the proximal portion of the distribution. The probabi lity that a pair of positions displaying charge compensatory replacement is near in the folded structure is co nsiderably higher than expected (Figure 4-3b). This represents the "signal" in the charge compensatory pattern of replacement. When a side chain changes its charge, that ch ange is more likely to be accompanied by a compensatory change in the charge of anot her side chain near in space to the first. Charge Compensation in Both Contiguous and Non-Contiguous Position Pairs We then explored several features of this si gnal. First, we asked whether the signal arises only in positions that were also nearby in the polypeptide sequence (contiguous pairs), or whether it was also observed in residue pairs > 4 amino acids distant (those of i i + q relationship, where q > 4) in the sequence (non-contiguous pairs) A strong signal was also observed for noncontiguous pairs (Figure 4-3c,d) as well as for contiguous pairs. This implies that a charge compensatory replacement signal ar ises when two residues are near in space as a consequence of the tertiary fold, as well as if they are near in space because they are near in the sequence.

PAGE 77

77 Enhancing the Charge Compensation Signal These results showed that node-node and node-l eaf analyses generated a more perceptible charge compensatory covariation signal than leaf-leaf analyses. Nevertheless, the signal remained small. We considered several explanations for the sma ll size of the signal. First, we considered a case where four sites suffer charge reversal repl acements in a single evol utionary episode. Site a suffers a (+ to -) replacement compensate d by a (to +) replacement at site b Site c suffers a (+ to -) replacement compensated by a (to +) replacement at site d Sites a and b are proximal; sites c and d are proximal. Compensatory changes at sites b and d are required to maintain a functional protein given changes at sites a and c respectively. Two pairs that do not represent adaptively significant compensation ( a d and b c ) arise in addition to the two pairs that do ( a b and c d ). This is simply another way of saying th at changes that need compensation are more noticeable when the total number of changes is small. Therefore, we asked whether the signal was st ronger if the only branches examined were those holding exactly one pair of charge compensatory replacements (3e,f for all pairs and noncontiguous pairs, respectively). The like lihood that two positions undergoing charge compensatory replacement are near in space was indeed more perceptible after we excluded all of the events occurring on branches where more than two positions suffere d charge reversal. The number of cases was small, however (57 fo r all pairs, and 48 for non-contiguous pairs, respectively), and the plots accordingly displa yed substantial variances. As the number of sequences in the database grows, trees will b ecome more articulated, individual compensatory events will be more likely to be isolated from othe rs on a single branch of the tree, and the signal should strengthen.

PAGE 78

78 Charge Compensation in Specific Secondary Structural Elements We then examined more closely the contiguous position pairs ( i i +4 or nearer) displaying charge compensatory covariation. The most stri king feature of the charge compensatory signal within contiguous pairs was its de pendence on the nature of the sec ondary structural element that held those positions (Figure 4-4). In 98% of th e cases where the positions showing compensatory replacement had an i i +4 relationship, one or both of the re sidues was found in an alpha helix. In only 1.6% of these cases was one of the residues found in a beta strand, and in none of the cases were both found in a strand. This was significantly la rger than the 47% of the position pairs with an i i +4 relationship having one or both residues in a helix found in the dataset as a whole (Table 4-1). Conversely, in 49% of the cases where the position pair showing compensatory substitution had an i i +2 relationship, one or both of the residues was found in a beta strand. Two alternative explanations can be proposed to account for the secondary structure bias. First, surface helices present residues to solvent (water) once every 3.6 turns. Surface residues are expected to be less constrai ned by function from diverging (that is, single replacements are more likely to be neutral), and are more likely th an the average residue to suffer charge reversal substitutions. Therefore, the abundance of compensatory changes occurring with i i +3 or i i +4 relationship in the sequence might simply reflect unconstrained (i.e., neutral) charge reversal at surface positions. Alternatively, loss of a coulombi c interaction between residues i and i +3 or i +4 might lead to a protein less able to contribute to the fitness of the host. This view implies that once a charge reversal substitution occurs at position i position i +4 is under sufficient positive selection to acquire a compensating char ge reversal substitution. To explore these alternatives, we sought exam ples of anti-compensatory charge reversals, where a charge reversal (+ to -, for exampl e) was accompanied by another charge reversal

PAGE 79

79 substitution in the same direction (+ to -) (F igure 4-5). The histogram showing the distance distribution in pairs suffering compensatory co variation is shown in Figure 4-5. Here, the distribution of distances between position pairs carrying charge anti-compensatory substitution was not noticeably different from the distributi on of distances between all surface position pairs (Figure 4-3b). Notable is the absence of an in creased probability of pr oximal anti-compensatory pairs (compare with Figure 4-3). As a predictive tool, this e nhanced charge compensatory signal may prove to be most valuable in secondary structure prediction. The accuracy of a predicti on made on the relative positions of a compensatory pair is extremely high. The "coverage" is low, however. Only a few dozen examples were observed; only 6.8% of th e helices contained within the 71 test families have one or more charge compensatory position pa ir. This number will, of course, grow as the size of the protein families grows with the increasing size of the genomic database (Table 4-2). Charge Compensation in Buried Residues The high dielectric constant of water is known to weaken coulombic interactions between charged species. We therefore asked whether a stronger signal might be found in position pairs where one or more of the side chains was burie d. Figure 4-6a shows the distribution of surface accessibility calculated for all charged amino acids (DEKR, or Asp, Glu, Lys, and Arg) (blue), those that participate in a position pair suffe ring charge compensatory substitution (red), and those that participate in a positi on pair suffering charge anti-compen satory substitution (green). Not surprisingly, most buried charged residues do not suffer charge reversal within the test set. Compensatory changes near in space (Figure 4-6b ) were slightly more likely to be found in positions that were more buried, a modest sign al that suggests that compensation is more necessary in partly buried site s shielded from the high dielect ric constant presented by the solvent water. This is the case for those re sidue pairs in protei n kinase [Ben91] and

PAGE 80

80 phosphoglycerate kinase [G oh00] where charge compensatory s ubstitution has had predictive value. Discussion Explicit reconstruction of ances tral proteins has been shown to provide insight into the structure and function of prot ein families, both when done in silico [Fit71, Mes97, Tra96] and when recombinant DNA technology is used to re surrect ancestral proteins from extinct organisms so they can be studied in th e laboratory [Ben88a, Mal90, Sta90, Jer95]. The work reported here applie s ancestral reconstructions towards a new goal. We suggest several conclusions. First, nodenode and node-leaf comparisons be tween reconstructed ancestral sequences provide a stronger compensatory covaria tion signal than the leaf-leaf comparisons that have been used previously in the search for a compensatory covariation signal. These results are consistent with the model outlined in the Introduction, which holds that amino acid replacements that dramatically alte r the physical property of the side chain will disrupt the performance of a prot ein to an extent that requires a compensatory change elsewhere in the protein structure a signifi cant number of times. The fitne ss value of the protein can be restored only by a second change that compensate s for the first alteration in physical properties. In the language of neutral theo ry, we would say that the firs t replacement was selectively advantageous, the second was positively selected (in the context of the fi rst), and both together are neutral. Thus, while the two compensatory su bstitutions taken together might be regarded as collectively neutral, the second mutation in a truly compensato ry pair is viewed as positively adaptive in this model. These results also suggest that as the databa se grows, the charge compensatory signal will become more perceptible as more sequences ar e added to each family. More sequences mean more highly articulated evolutionary trees. This, in turn, means that compensatory events will

PAGE 81

81 become better isolated on specific branches, pr eventing the "spurious" signals that arise when more than one pair of compensatory even ts occurs along a specific branch on a tree. The stronger signal will undoubtedly find use. Predicting secondary structure using contiguous pairs of compensatory changes is one. It remains to be seen, however, how much data in a family are required for the signal to be useful to support de novo assembly of a protein fold in a prediction setting. Accounting for the Stronger Signal From Node-Node Comparison The tools that we have presented make th e compensatory covariation signal more perceptible. Isolation of truly co mpensatory pairs on shorter branch es of a tree away from other changes is, we believe, sufficient explanation for th is effect. A tree with shorter branches implies a smaller n the total number of differences between the two protein sequences being compared, diminishing the number of n ( n -1)/2 pairs behind which the compensatory signal might be obscured. While it is possible in principle to constr uct a statistical model that permits double replacements to be evaluated, this requires a high degree of empirical parameterization. For example, nearly a decade ago, investigators collecte d the parameters needed to build a statistical model that concerned double replacements at adjacent sites [Coh94]. Some 220 parameters are formally required in this exercise a large number by most measures. For the purpose of this work, a signal is consider ed to be significant if it lies two standard deviations outside of the fluctuation expected for n sites at a specified distance. This can be estimated by the square root of the number of s ites. By this measure, all of the results reported here are strongly significant, with the exception of those reported in Figures 4-3e,f and Figure 44.

PAGE 82

82 A Model-Independent Method to E valuate an Evolutionary Tree The strength of the compensa tory covariation signal undoubted ly depends on the degree to which the trees and the reconstructed ancestral se quences accurately reflect the history of the family. If the branching of the tree or the rec onstructed sequences themse lves are not correct, a pair of charge compensatory replacements that ar e coincident, in fact, may not be assigned to the same branch of a tree. In this case, th e signal from this pair will be lost. Getting the branching correct in an evolutiona ry tree is a difficult problem. Part of the difficulty arises because of the trade-off betw een the accuracy of the tree and the cost of generating it. For example, the Clustal-W [T ho94] and Fitch parsimony tools used here are relatively inexpensive methods for reconstructing trees and ancestral sequ ences. Clustal-W uses a neighbor joining tool [Sai87] based on estimat es of the distances between sequence pairs derived from the Kimura empirical formula [Kim83]. Ancestral seque nces reconstructed by parsimony are well known to be sensitive to incorre ct branching topology. This is the principal error associated with the choice of th is inexpensive reconstruction tool. More sophisticated methods, including maxi mum likelihood methods, are expected to provide better trees, at least given the first or der stochastic models. These are expected to generate ancestral reconstructions that are more robust to erro rs in tree topology. They are, however, more expensive. Even the more expensive tools do not guarantee a correct tree, of cour se. In practice, the approximations made in the model (see Introduction) may create systematic error larger than fluctuation error. To date, the only way to benchmark a tree requires knowledge of the evolutionary history of the sequences in quest ion [Hil94], or a reconstruction of a simulated evolutionary process [Tak00]. The first is difficu lt to get for sequences emerging from natural

PAGE 83

83 history. The second requires a mathematical mode l for evolution, which is often the same one that is used to construct the tree in the first place. Here, the compensatory covariation signal, extracted from reconstructed ancestral sequences, may provide a metric for the qual ity of a tree based on organic chemistry, independent of any mathematical model for evolu tion. Hypothetically, the best tree should be the tree that places compensatory replacements truly driven by natural selection on the same branch. This requires the construction of a tree that reflects the actual e volutionary history. This, in turn, implies that the tree has the most compensatory co variation is the tree that is most likely to reflect the actual history. To illustrate this application, consider four hypothetical proteins, just four amino acids in length, having the sequences ALKD, MVKD, ALER and MVER. Exactly two topologies exist for unrooted trees that relate th ese four sequences (Figure 4-7). Both reconstructions have two ambiguous sites in both ancestors. In Topology I, the first two positions are ambiguous; in Topology II, the last two positions are ambiguous. Bo th trees require four "homoplastic" events (independent mutations that cause sequence c onvergence). Both trees require exactly six changes. Classical parsimony therefore ranks these two topologies as equally likely. The two topologies are different however, with respect to the extent to which charge changes are compensated. In Topology I, a char ge altering replacement is 100% likely to be compensated. In Topology II, however, a charge altering replacement is only 50% likely to be compensated. This is illustrated in Figure 4-7 by writing out four trees, each equally likely, that carry reconstructions that the ambi guities require. If we postulate that compensatory covariation is maximized, then Topology I is preferred over Topology II.

PAGE 84

84 Conversely, an analogous logic can be used to assign preferred ances tral states involving charged residues. For Topology I, the ancestral states involving charged residues are fixed. For Topology II, the preferred ancestral sequen ces are in reconstructions IIa and IIb. This metric can be applied even if no crystal structure is available fo r a protein family. If, however, a crystal structure is available, then (as a practical matter) one would maximize the number of proximal charge compensatory ch anges when identifying the preferred tree. It will require much future work with many families to determine how useful the metric will be. Worth noting at this point, however, is that th is metric is rooted in principles of structural biology (that is, organic chemistr y), not in a mathematical form alism. Further, the proposed metric values changes at position i in light of changes at position j Thus, this metric for evaluating the quality of a tree is fundamentally different from any metr ic based on first order stochastic analyses of pr otein sequences, which treat replacements at site i and site j as independent. Darwinian Requirements for Compensatory Covariation Even given these results, and the evidence that the charge compensato ry substitution signal can only become stronger as the database grow s, it remains inescapab le that the charge compensatory signal is weak, perhaps even weaker "than expected. What might be the scientific implications of this observation? Charge compensatory covariation might be w eak because the coulombic interactions being sought may themselves be largely unimportant to the selective fitness of proteins. Gaining or losing them, in this view, has insufficient impact on fitness to ensure that natural selection will require compensation, and thereby prevent uncomp ensated charge reversals from entering the global proteome. This implies a limit to the tool generally, one imposed by the physical organic chemistry of folded protein sequences.

PAGE 85

85 An alternative explanation should be consider ed, however. Observation of a compensatory pair of substitutions implies, under the neutral theory, that natural se lection preserved some global feature of a protein dur ing the episode represented by the branch between two nodes. This, in turn, implies some degree of constancy in the behavior of the protein before and after the episode where compensatory change has occurred. In this view, compensatory replacement should be observed only in protein families whose behavior must remain largely constant duri ng this branch. This, in turn, implies that compensatory covariation should be observed on ly during episodes where "function, defined as the behavior that contributes to fitness, is largely conserved. In the language of the neutral theory, the demand for compensation arises because the protein is optimized at the beginning of the episode for fitness, the same behaviors ar e optimal at the end of the episode, and any replacements occurring during the episode must have the net (and, if necessary, combined) impact of being neutral with respect to their impact on selected behavior. This implies, of course, that when functional behavior is changing, th ere may be no need to compensate individual replacements in a sequence, even those that reverse charge. Indeed, an uncompensated change may be more likely to gene rate a protein with different behaviors, whose (now) different behaviors contribute most to the (now different) requ irements for fitness. In this view, compensatory covariation should not be observed, or should be observed less frequently, whenever functional behavior is changing. In this view, compensatory covariation is scar ce because branches of an evolutionary tree where functional behavior is rigorously conserved are scarce This is, of course, a controversial suggestion. Many computational biol ogists treat homologous proteins in distinct organisms as if they were "the same protein, and neutral theory remains the majority view of protein sequence

PAGE 86

86 evolution. In contrast, recent work in these laboratories and elsewher e has suggested that functionally significant divergence in behavior is frequent, and ma y be the rule more than the exception. For example, it is almost certainly observe d in elongation factors, regarded as some of the most functionally conserved pr oteins in the biosphere [Gau01]. Given this observation, compensatory repl acements may become a powerful tool in functional genomics to detect episodes where f unction is, and is not, c onserved. A branch that has more compensatory replacement is more lik ely to represent an episode where functional behavior is constant than one w ith less compensatory replacement. This is relevant to the issue of "annotation transfer" in comparative proteomics. Annotation transfer assigns the function of a new protein by identifying in the database a homologous protein for which the function is known, and tran sferring annotation descri bing that function to the annotation for the new protein. Annotation tr ansfer assumes that function does not change within a set of homologous pr otein sequences as they diverg e [Heg01, Fet98]. This assumption has long been known to be poor in many proteins, including many characterized before the dawn of the age of the genome [Ben88b]. To date, several tools have been suggested to detect functional change. One of these is to measure KA/KS values for branches of an evolutiona ry tree between rec onstructed ancestral sequences at the nodes of the tree [Mes97, Li85] Here, compensatory changes would indicate functional constancy, while uncompensated cha nges would indicate functional change. Because compensatory analysis rests on protein sequences, while the KA/KS value requires measurement of silent substitution rates, and because silent substitution rates are frequently rather high, this metric for functional recruitment may ultimately prove to be more valuable than KA/KS ratios, especially for deeply branching sequences.

PAGE 87

87 Methods The core of this study exploited the PDB "s elect 25" subset of proteins [Hob94]. Each protein in this database was matched against the proteins contained in SWISS-PROT (version 33) [Bai91]. The older sequence dataset was ch osen to avoid under-annotated sequences, in particular, pseudogenes that might be divergen tly evolving without f unctional constraints. Families that contained at least 12 members, wh ere the maximum evolutionary distance between any pair of sequences in the family was be tween 50 and 120 PAM units, and where the family had at least two subfamilies defined at PAM 20 with four or more members, were retained. These criteria, made to ensure a balanced tree, were satisfied by 71 families. The sequences within each family were ali gned using the MultAlign package [Kor00] with the option PROB from Darwin system [Gon91, Ben93]. The gap sh ifting heuristic was applied iteratively until the overall alignm ent score ceased to improve. Secondary structure assignments were extracted from the crystallo graphic data using DSSP [Kab83]. An evolutionary distance matrix and a phyl ogenetic tree were computed for each family using Clustal-W [Tho94], which employs a neigh bor joining method using distances derived from the Kimura empirical formula [Kim83]. Br anches with negative lengths were ignored. Probabilistic reconstructed ances tral sequences at nodes in th e tree were then built using the Fitch parsimony method [Fit71]. For those si tes where parsimony did not generate a single assigned residue in an ancestral sequence, fracti onal probabilities were assigned to each of the contender amino acids were assigned by the st atistical method described by Fitch [Fit71]. The statistical method was used to assign probabilitie s to each possible path between the residue at a site at an ancestral nodes resi due (either a single amino acid or a set of possible amino acids each with an assigned probability) and its two des cendant residues (either a residue in an extant

PAGE 88

88 species' sequence (a node-leaf comparison) or a residue (or set of possible amino acids) in another ancestral species (a node-node comparison)). Using probabilistic reconstructed ancestral sequences computed in this way, charge compensation was sought on individual branches of the evolutionary trees. For each branch in each tree, the sequences at the flanking nodes we re compared (residue by residue) to identify single substitutions that revers ed charge. A position was retained if and only if one of the following transition probabilities (K->E, K->D R->E, R->D, E->K, E->R, D->K, D->R) was > 0.2. If more than one probability was > 0.2, then both transition probabili ties were retained for that position. Coordinates locating the position of C-beta were then extracte d from the reference crystal structure with the PDB. When the residue wa s Gly, the position of the beta carbon atom was inferred from the position of the backbone atoms. A Perl script was written to permit calculation of the distances between the beta carbon atoms for each pair of amino acids retained. This generated the distances reported in the Figures. The accessible surface area (ASA) was computed using the DSSP program [Kab83]. A list was then made of all pairs of positions having compensatory charge reversals for each branch of the tree. These were define d to be "position pairs" undergoing charge compensatory covariation. The distance between the beta carbon atoms of the position pairs was inferred from the positions of those two carbon at oms in the reference crystal structure. The pairwise distance was then calculated for all ami no acids in the reference crystal structure, and then for the distance between pairs of residue s on the surface of the reference structure.

PAGE 89

89 N1L1L2N2L3L4N5N3L5L6N4L7L8N6N7Sequences of extant proteins at the leaves of the tree leaf-leaf comparison reconstructed sequences at nodes in tree node-node comparison Figure 4-1. A leaf-leaf comparison (red) traverses more evoluti onary distance than a node-node comparison (blue).

PAGE 90

90 Figure 4-2. Attempted detection of charge comp ensatory covariation si gnal using leaf-leaf comparisons. (a) Distribution of distances in 71 protein families between position pairs displaying charge compensatory substitution (a positive-for-negative substitution at position i and a negative-for-positive substitution at position j ) (red bars), using as a reference (green bars) the distribution of all pairwise distances in the proteins. The x -axis is in angstroms; the y -axis is frequency, with each distribution normalized to unity. Note the absence of considerably taller red bars at short distances. Total observations in sample: 13,460. (b) As in (a), but using as a reference curve the distances between all pairs of surface residues (blue bars, > 50% exposure, calculated by normalizing its accessible su rface area (ASA) by the standard ASA of the residue. ASA was computed with th e DSSP program. Total observations in sample: 13,460. (c) As in (a), but consider ing only non-contiguous pairs, those where i and j are separated by more than four positi ons, with reference to a distribution of distances between all pairs of residues in the reference crystal structures (green bars). Total observations in sample: 12,811.(d) As in (c), but with reference to the distribution of distance betw een surface pairs (blue bars ). Total observations in sample: 12,811.

PAGE 91

91 Figure 4-3. Detecting charge compensatory cova riation signal using explicitly reconstructed ancestral sequences. (a) Distribution of distances in 71 protein families between position pairs displaying charge compensatory substitution (a positive-for-negative substitution at position i and a negative-for-positive substitution at position j ) (red bars), using as a reference (green bars) the distribution of all pairwise distances in the proteins. The x -axis is in angstroms; the y -axis is frequency, with each distribution normalized to unity. Note the greater height of the red bars at bot h short distances and at long distances. Total observations in sample: 803.3 (note fractional number reflecting fractional character assignments in ancestral states; precision is less than implied by the decimal). (b) As in (a), but using as a reference curve the distances between all pairs of surf ace residues (blue bars, > 50% exposure, calculated by normalizing its accessible surface area (ASA ) by the standard ASA of the residue. ASA was computed with the DSSP program. Note the greater height of the red bars at short distances only. This is the compensato ry covariation signal Total observations in sample: 803.3. (c) As in (a), but cons idering only non-contiguous pairs, those where i and j are separated by more than four positions, with reference to a distribution of distances between all pair s of residues in the reference crystal structures (green bars). To tal observations in sample: 745. (d) As in (c), but with reference to the distribution of distance between surface pairs (blue bars). Total observations in sample: 745.5. (e) Distance dist ribution of charge compensatory pairs where only one such pair occurs on a specifi c branch of the evol utionary tree between reconstructed ancestral nodes. Total obs ervations in sample: 57.2. Note the small sample size. (f) Same as (e), but where th e position pair is separated by more than four positions in the linear sequence. Total observations in sample: 48.3. Note the small sample size.

PAGE 92

92 Figure 4-4. Predicting secondary st ructure using contiguous pairs of compensatory changes. The relative sizes of the pies indicate the re lative numbers of examples of each pair. Where position pairs are separated by 1-5, or 6 positions, the likelihood that residue pairs are both found in helices (red), both f ound in strands (dark green), one in a helix and the other in a coil (pink), one in a strand and the other in a coil (light green), one in a helix and the other in a strand (violet), and both found in coils (yellow). Note that if two charge compensatory subs titutions are observed with a i i +4 relationship, one or both are 98% likely to lie in a heli x. The total number of observations; i i +1 = 12.82; i i +2 = 6.47; i i +3 = 11.38; i i +4 = 23.37; i i +5 = 4.79; i i +6 = 6.48. Note fractional values and the sm all size of these samples.

PAGE 93

93 Figure 4-5. Distribution of distan ces between charge anti-compe nsatory pairs (red bars) (a) relative to all pairwise distances (green ba rs) and (b) relative to pairwise distances between all pairs of surface residues (blue bars, > 50% exposure). Notable is the absence of the increased probability of pr oximal anti-compensatory pairs (compare with Figure 4-3). The total number of obs ervations in both di stributions = 793.3.

PAGE 94

94 Figure 4-6. Surface accessibility of charged residues, and charged residues participating in a charge compensatory event.(a) A frequency versus accessibility distribution for all Asp, Glu, Lys, and Arg (DEKR) residues (b lue), those participating in a charge compensatory (red) and anti-compensatory (g reen) events. Bars of each color sum to unity. Buried DEKR residues are less likely than average to suffer charge reversal substitution. Compensated char ge reversal substitutions ar e slightly more likely than anti-compensated charge reversal substitutions. The total number of observations in sample: anti = 2,812.4; comp = 2,792.2; DEKR = 3,530.0. (b) A frequency versus accessibility distribution showing the frequenc y of compensated charge reversals for proximal pairs near in space (less than 12 di stant, red) in the folded structure, and distal pairs distant in the folded structure (more than 12 distant, green). A pair of charged compensatory substituti ons is slightly more likely to be near in space if the side chains involved are more buried. The total number of observations in sample: near = 367.1; far = 795.7.

PAGE 95

95 Figure 4-7. A schematic illustration of the use of co mpensatory covariation to select a preferred tree from two equally parsimonious trees. The two tree topologie s relating the four sequences (ALKD, MVKD, ALER, and MV ER) each require six changes. The changes are marked on individual branches, with fractional changes arising from the ambiguity in the ancestral sequences. The ancestral sequences are placed at the nodes in the tree, with ambiguous sites (by pa rsimony) noted by placing the two possible residues above and below a horizontal line. For each topology, iden tical trees holding all four possible ancestral sequences are shown. Each, by parsimony, has equal likelihood (0.25 for each).In Topology I, the ancestral sequences are ambiguous at the first two positions. In Topology II, these are ambiguous at the last two positions. Both trees require the same amount of ho moplasy (convergence). Classical parsimony analysis is indifferent with respect to the two topologies. In Topology I, however, the likelihood that a charge reversal is co mpensated is unity. In Topology II, the likelihood that a charge altering repla cement is compensated is only 0.5. Thus, Topology I is preferred if compensatory c ovariation is maximized. This criterion is independent of mathematical formalisms us ed to construct the tree. Further, the metric weights changes at position i depending on events at position j making this metric for evaluating a tree fundamentally di fferent from any metric based on a first order stochastic analysis of protein sequences ALKD MVKD ALER MVER AVER AVKDK3E D4R V2L A1M V2L A1MALKD MVKD ALER MVER --ER --KDK3E D4R 0.5 V2L 0.5 M1A 0.5 A1M 0.5 L2V V2L A1M AL MV MV ALALKD MVKD ALER MVER MLER MLKDK3E D4RALKD MVKD ALER MVER ALER ALKDK3E D4R L2V A1MA1MALKD MVKD ALER MVER MVER MVKDK3E D4R L2V M1A L2V M1A L2V M1A V2L M1A V2LTopology I a b c d charge changes compensated charge changes not compensated no charge changes ALKD ALER MVER MVKDA1M V2L E3K R4DALER MVERR4DALKD ALER MVER MVKDA1M V2LALKD ALER MVER MVKDA1M V2L K3EALKD MVKDK3E D4RALKD ALER MVER MVKDA1M V2L 0.5 K3EAL-MV--0.5 D4R KD ER 0.5 E3K 0.5 R4D 0.5 E3K 0.5 R4D 0.5 K3E 0.5 D4R KD ERALKD ALER MVER MVKDA1M V2L K3E R4DALKR MVKRK3E R4DMVED ALEDD4R E3K D4R E3K E3K D4Rab c d Topology II charge changes compensated charge changes not compensated no charge changes

PAGE 96

96 Table 4-1. Frequencies of the average conti guous position pairs participating in a helix versus strand. Relationship between the pair in the protein sequence Pair i i +1 i i +2 i i +3 i i +4 i i +5 i i +6 EE 0.185 0.146 0.119 0.099 0.087 0.079 EC 0.077 0.148 0.192 0.219 0.234 0.239 strand 0.262 0.294 0.311 0.318 0.321 0.318 HE 0.003 0.011 0.022 0.033 0.044 0.055 HH 0.307 0.274 0.243 0.223 0.208 0.194 HC 0.072 0.132 0.184 0.214 0.233 0.249 helix 0.382 0.417 0.449 0.470 0.485 0.498 CC 0.356 0.289 0.241 0.212 0.195 0.185 Total 1.00 1.00 1.00 1.00 1.00 1.00 Total sample 12.8 6.5 11.4 23.4 4.8 6.5 EE, both positions involved in a charge compensatory event found in strands; EC, one in a strand and the other in a coil; Strand: HE, one in a helix and the other in a strand; HH, both in helices; HC, one in a helix and the other in a coil; Helix: CC, both in coils. Calculated from the test set of reference crystal structures using DSSP. Note th e small number of observations in each sample, and the fractional number of changes aris ing from the parsimony reconstructions.

PAGE 97

97 Table 4-2. List of 71 protein families used in this analysis ID L N Protein name 121P 166 60 H-Ras P21 Protein 193L 129 43 Lysozyme 1AAF 55 38 Hiv-1 Nucleocapsid Protein 1AAK 152 20 Ubiquitin Conjugating Enzyme 1ARS 396 28 Aspartate Aminotransferase 1ATP(E) 350 20 c-AMP-Dependent Protein Kinase 1BET 107 14 Beta-Nerve Growth Factor 1BP2 123 64 Phospholipase A2 1CCR 112 93 Cytochrome c 1CPC(B) 172 44 C-Phycocyanin 1CYO 93 17 Cytochrome B5 (Oxidized) 1DLH(A) 180 28 Hla-Dr1 (Dra, Drb1 0101) Human Class II Histocompatibility Protein (Extracellular Domain) 1DLH(B) 188 45 Hla-Dr1 (Dra, Drb1 0101) Human Class II Histocompatibility Protein (Extracellular Domain) 1EFT 405 57 Elongation Factor Tu (Ef-Tu) 1FRP(A) 335 18 Fructose-1,6-Bisphosphatase (D-Fructose-1,6-Bisphosphate1-Phosphohydrolase) 1FVL 70 29 Flavoridin 1FXI(A) 96 69 Ferredoxin I 1GDD 353 51 Guanine Nucleotide-Binding Protein G(I) 1GP1(A) 198 16 Glutathione Peroxidase 1HAR 216 36 HIV-1 Reverse Transcriptase (Amino-Terminal Half) 1HCN(A) 92 26 Human Chorionic Gonadotropin 1HDG(O) 332 109 Holo-D-Glyceraldehyde-3-Phosphate Dehydrogenase 1HGE(A) 328 70 Hemagglutinin 1HLE(A) 345 14 Horse Leukocyte Elastase Inhibitor 1HMT 132 16 Fatty Acid Binding Protein 1HPM 386 128 44K Atpase Fragment (N-Terminal) Of 7Okda Heat-Shock Cognate Protein 1HRA 80 42 Retinoic Acid Receptor 1HRY(A) 76 43 Human Sry 1HTB(A) 374 56 Beta-3 Alcohol Dehydrogenase 1HTM(D) 138 92 Hemagglutinin Ectodomain (Soluble Fragment, Tbha2) 1HUR(A) 180 35 Human Adp-Ribosylation Factor 1 1HUW 166 31 Human Growth Hormone 1HVD 319 23 Annexin V 1IRK 306 17 Insulin Receptor (Tyrosine Kinase Domain) 1ITG 166 42 Hiv-1 Integrase (Catalytic Domain) 1IVD 388 23 Influenza A Subtype N2 Neuraminidase (Sialidase) 1LDM 329 27 M4 Lactate Dehydrogenase 1MHC(A) 282 143 Mhc Class I Antigen H2-M3 1MLS 154 74 Myoglobin 1NDH 272 27 Cytochrome B5 Reductase 1NHK(L) 144 26 Nucleoside Diphosphate Kinase 1NIP(A) 289 35 Nitrogenase Iron Protein 1OCT(C) 156 33 Oct-1 (Pou Domain) 1OSA 148 64 Calmodulin 1PBX(A) 143 56 Hemoglobin 1PDN(C) 128 28 Prd Paired Domain 1PLQ 258 13 Proliferating Cell Nuclear Antigen 1PVC(2) 271 24 Poliovirus Type 3, Sabin Strain 1REC 201 21 Recoverin 1SXC(A) 151 51 Superoxide Dismutase 1TGX(A) 60 51 Toxin gamma 1TIV 86 24 HIV-1 Transactivator Protein

PAGE 98

98 Table 4-2. Continued ID L N Protein name 1TPH(1) 247 35 Triosephosphate Isomerase 1YTB(A) 180 18 TATA-Box Binding Protein 1ZAA(C) 87 18 Zif268 Immediate Early Gene 2BTF(A) 375 121 Beta-Actin-Profilin Complex 2CPL 165 35 Cyclophilin A 2GDM 153 23 Leghemoglobin 2HMX 133 40 Human Immunodeficiency Virus Type 1 Matrix Protein 2HPE(A) 99 36 HIV-2 Protease 2REB 352 58 Rec A Protein 2TGI 112 18 Transforming Growth Factor-Beta Two 3RUB(S) 123 79 Ribulose 1,5-Bisphosphate Carboxylase/Oxygenase (Form III) 4ENL 436 35 Enolase 4FGF 146 16 Basic Fibroblast Growth Factor 4GCR 174 31 Gamma-B Crystallin 4MT2 62 50 Metallothionein Isoform II 4RHV(1) 289 24 Rhinovirus 14 4RHV(3) 236 24 Rhinovirus 14 7RSA 124 48 Ribonuclease A 8CAT(A) 506 34 Catalase ID, Protein Data Bank identifier, protein subunit in parentheses; L, chain length; N, number of sequences in multiple sequence alignment. .

PAGE 99

99 CHAPTER 5 THE PLANETARY BIOLOGY OF CY TOCHROME P450 AROMATASES Background The emergence of complete genomes for many organisms, including humans, has created the need for hypotheses concerning the function of specific genes that encode specific proteins [Gau04]. While function is in terpreted by different workers in different ways [Bor98], Darwinian theory (by axiom) requires that the term be connected to fitnes s; natural selection is the only mechanism admitted by theory to generate functional behavior in a living system, macro or molecular. This, in turn, implies that the hypotheses about function have a systems component, including the interac tion of the protein with other proteins, their impact on the physiology (defined broadly) of the cell and orga nism, and the consequences of physiology in a changing ecosystem in a planetary context [Ben02]. Systems hypotheses can be supported by information from many areas. Geology, paleontology, and genomics, for example, provide three records that capture the natural history of past life on Earth. At the same time, stru ctural biology, genetics, and organic chemistry describe the structures, behaviors and reactivities of proteins that allow them to support present life. It has been appreciated that a combination of these six types of an alysis provides insights into functional behavior of prot eins that cannot be provided by any of these alone [Ben02]. Over the long term, we expect that the histories of the geosphere, the biosphere, and the genosphere will converge to give a coherent picture showing the relationship between life and the planet that supports it. This picture will be based, however, on individual cases that serve as paradigms for making the connection. The aromatase family of proteins offers an interesting system to illustrate the power of this combination as a way to create hypotheses regard ing protein function within a system [Con01a].

PAGE 100

100 These hypotheses are not proof, of course, but are limiting in genomics-inspired biological experimentation, now that genomic data themselves are so abundant. Aromatases are cytochrome P450-dependent enzymes that use dioxygen to catalyze a multistep transformation of an androgenic steroid (s uch as testosterone) to an estrogenic steroid (such as estradiol) (Figure 5-1). The protein pl ays a key role in normal vertebrate reproductive biology, a role that appears to ha ve arisen before fish and tetr apods (land vertebrates, including mammals) diverged some 375 Ma. [Cal84]. Aromatase is important in modern medicine as well, especially in breast and other hormone-dependent cancers [Wol02]. Different numbers of aromatase genes are f ound in different vertebrates. Two aromatase genes are known in teleost fish [Cal97, Cha97] Only a single gene is known in the horse [Boe97], rat [Hic90], and mouse [Ter91]. Cattle have both a functional gene and a pseudogene built from homologs of exons two, three, five, ei ght and nine of their f unctional gene; these are interspersed with a bovine repeat element [Fur95, Hin93]. In several mammalian species, including humans and rabbits, a single gene yields multiple forms of the mRNA for aromatase in different tissues via alternative splicing [H ar88, Del96, Sim97, Del98]. A still different phenomenology is observed in the pig ( Sus scrofa ). Three different mRNA molecules had been reported in different tissues from pig [Cor95, Con97, Cho96, Cho97a, Cho97b]. Compelling evidence then emerged that the three variants of mRNA identified in cDNA studies arose from three paralogous gene s [Gra00], rather than from a single gene differentially spliced [Cor04]. This implies that th e three aromatase paralogs in pigs arose via gene duplications relatively recent in geologic time. Hypotheses relating the function of the three depend in part on when those duplications took place. If they were very recent, then the three genes might have helped pigs adapt to

PAGE 101

101 domestication. If they pre-dated the divergence of pigs and fish [Cal97], then they may have different roles that are very fundamental to re productive endocrinology in vertebrates. We apply here a series of tools to generate better hypothe ses concerning the aromatase family of paralogs in swine. Results One strategy useful for understanding the f unction of genes correla tes events in their molecular evolution with events occurring in th e history of other genes in the same and/or neighboring lineages, and w ith events recorded in the geol ogical and paleontological records [Ben02]. We incorporated a tool to date the divergence of two or more genes through an analysis of transitions at synonymous sites of twofold redundant coding systems, where the encoded amino acid has been conserved [Ben03]. This an alysis exploits the approach-to-equilibrium kinetic behavior displayed by these sites. The analysis yields a tran sition redundant exchange (TREx) distance for any gene pair where th e synonymous sites ha ve not equilibrated. To calibrate the silent TREx clock, inter-taxa histograms relating pig ( Sus scrofa ) and ox (Bos taurus ) were constructed for transitions at th e silent sites of twofold redundant codon systems where the encoded amino acid was conser ved between the species [Ben03]. The major peaks associated with the separation of these two lineages was observed at f2 = 0.87, corresponding to a TREx distance of kt = 0.332. As the fossil record constrains the date of divergence of these tw o lineages to be 60 5 Ma [Kum 98, Arn98, Foo99], and the codon biases in modern Sus scrofa and Bos taurus project an equilibrium value for f2 = 0.54 [Ben03], the rate constants for transitions at the TREx sile nt sites was estimated to be ca. 2.8 x 10-9 transitions/silent site/year during the time interval that separates these lineages. Analogous f2 values were then obtained for other vert ebrate aromatase pairs, including fish vs. tetrapods ( f2 = 0.56), birds versus mammals ( f2 = 0.612), primates versus ungulates ( f2 =

PAGE 102

102 0.823), and horses versus artiodactyls ( f2 = 0.828). Assuming a time-i nvariant single lineage first order rate constant of 3.6 x 10-9 changes/site/year and an equilibrium f2 = 0.54, the corresponding dates of divergence are calculated to be 435, 258, 67, and 65 Ma respectively, with the oldest dates being the least precise. The last three of these dates of divergen ce are similar to those suggested by the paleontological record [Car88] within the error of the calculation, which reflects the modest number of char acters used to calculate the f2 values. A tree for the artiodactyl lineage was constructed from th e corresponding TREx distances (F igure 5-2). This was found to be consistent with the tree c onstructed from other metrics. The TREx clock is not widely used. It may, however, provide more accurate dates in regions where synonymous transiti ons have not equilibrated th an conventional clocks that combine data from synonymous transitions a nd synonymous transversions, or from nonsynonymous changes. A comparison of different cloc ks is provided in detail elsewhere (data not shown). Briefly, the rate constants for transitions and transversions are mo re different than the two rate constants for purine-pur ine and pyrimidine-pyrimidine tran sitions. Further, nucleotide frequencies can be used to calibrate the e nd equilibrium points for twofold redundant codon systems directly, and this permits an appro ach to equilibrium formalism, well known in chemical kinetics, to be app lied [Ben03, Ari54, Pol98, Pel00]. From the tree, the TREx distances from the an cestor of fetal and placental aromatase to the modern enzymes are 0.113 0.079 (using an end poi nt of 0.54 to reflect equilibration at the silent sites), corresponding to a range in the time of dive rgence of 26-38 Ma. The TREx distances from the divergence of all of the porci ne aromatases and the modern forms ranges from 0.082-0.116, corresponding to dates of divergence in the range of 27 Ma. This suggests that the three aromatase paralogs diverged in the late Eocene to mid Oligocene.

PAGE 103

103 To further correlate the duplication of the ge nes with the fossil record, genomic DNA was analyzed from relatives of Sus scrofa Both peccary and babirusa seminal plasma ( Tayassu pecari from the Center for Reproduction of Endange red Species, Zoologi cal Society of San Diego; Babyrousa babyrussa from the Bronx Zoo, New York) were probed by PCR (Polymerase Chain Reaction) amplification us ing exon four-specific primers [Gra99]. Bands having the sizes expected for the correspondi ng aromatases were observed by agarose gel electrophoresis. Based on sequence similarity, tw o isoforms of aromatase were obtained from both peccary and babirusa as clones derive d from the PCR products (Figure 5-3). This establishes that at least one of the duplications occurred before the Tayassuidae (the peccaries) diverged from the Suidae (the true pigs) ca. 35 Ma [Coo78, For96]. These data are consistent with an evolutionary model that holds that the ancestor of pig and oxen (approximated in the fossil record by Diacodexis from the early Eocene ca. 55 Ma) [Ros96] contained a single aromatase gene, and that the paralogous genes in pig arose some 20 million years later. This suggests that the paralogs in pig can be explained neither in terms of the fundamentals of vertebrate reproductive endoc rinology, nor as a consequence of swine domestication. This does, however, suggest that the emergence of the aromatase paralogs was approximately contemporaneous with the emergence of a litter in the Suoidea larger than that found in the ancestral artiodactyl condition. While ruminant and camelid artiodactyls have only one or two young per litter, suoi ds in general have at least two young per litter (as seen in peccaries) and most suines (tru e pigs) routinely have three or four young (up to 12 in the domestic pig, Sus ). Note that there has long been the tac it assumption that large litters in suoids represent the primitive artiodactyl condition. Large litters are primitive for mammals in general,

PAGE 104

104 and because suoids are plesiomorphic in so me anatomical conditions relative to other artiodactyls (e.g., short legs, rete ntion of four digits, bunodont ch eek teeth), they have been assumed to be plesiomorphic in other respects. Other data suggest that small litters are in fact the primitive artiodactyl condition. Tragulids (mouse deer or chevrotains) are surviv ing small, primitive ruminants that are not too dissimilar from Diacodexis in body form, but they only have one or two young per litter. Additionally, fossil record data on pregnant oreodonts (an extinct group probably related to the ruminant/camelid artiodactyl lineage, but w ith a suoid-like plesiomorphic postcranial morphology) shows that they also only had one or two young [Fra97, Oha30]. A cladogram of the Artiodactyla (Figure 5-4) illustrates the probable acquisition of multiparous versus uniparous reproductive strategies, and places the character of litters with t ypically more than two members emerging just before the divergence of Tayassuidae and Suidae The approximate correlation in time of the aromatase divergence in Suoidea with the enlargement of litters in Suoidea suggests, as a hypothesis, that the two are functionally related. To expand on this hypothesis, we sought genomic signatures of functional change within the aromatase paralogs. The number of nonsynonymous changes in the gene divided by the number of the synonymous changes, normalized for the number of nonsynonymous and synonymous sites (the KA/KS value), strongly suggests functional ch ange when the value is significantly greater than unity [Li85, Mes97], and is also an indicator of hypot hetical functional change when the value is high on a branch of a tree relativ e to other branches on the same tree [Ben00, Yan00, Lib01, Gau03a]. KA/KS values were reconstructed for indivi dual branches of the evolutionary tree derived from the Darwin bioinformatics wor kbench (see Methods) using a distance matrix and ancestral states constructed by the method of Messier and Stewart [Mes97]. The typical

PAGE 105

105 branch in the aromatase evolutionary tree has a value of KA/KS of 0.35. A higher KA/KS value of 0.85 is found in the episodes of evolution near when the pig aromatases diverged. While a KA/KS value of 0.85 does not require the conclusion that positive selectio n occurred during the emergence of these aromatase paralogs, an inference based on the magnitude of KA/KS in one branch, relative to the KA/KS value for typical branches [Ben00, Yan00, Lib01, Gau03a], suggests that adaptive changes o ccurred during the duplic ations of the aromatase genes in pigs. A complete maximum likelihood analysis of the aromatase gene family was performed using the PAUP and PAML programs. The resulting tree, generated in PAUP is shown in Figure 5-5, with parameters estimated using PAML. Once more, the generation of paralogs in the pig was found to have occurred after the divergence of pigs from oxen. A high KA/KS value (0.93) was again found in the divergence of the swine isoforms on the bran ch leading to the ancestor of the placental and embryonic enzymes following th eir divergence from the pig ovarian enzyme. The distribution of substitutions al ong this branch is consistent w ith altered functional constraints for the placental and embryonic enzymes compared with their extinct and extant counterparts (Table 5-1 and Table 5-2) [Nei00]. We correlated the episode of rapid seque nce change during the emergence of the embryonic and placental paralogs with the struct ural biology of aromatase. A homology model of aromatase was built from progesterone 21-hy droxylase from rabbit liver (coordinates from PDB file 1DT6) [Wil00], a homologous cyto chrome P450-dependent monooxygenase. Residues undergoing replacement during the episodes repres ented by branches in Figure 5-5 (branches 13) are highlighted in color on the 3D model using a program in prototype with HyperChem (Figure 5-6).

PAGE 106

106 Multiple features within the pattern of amino acid replacement were apparent. First, the sites accepting amino acid replacements in the branches with low KA/KS values (as represented by branch 2 in Figure 5-5) were typically scat tered without any obvious pattern over the surface of the protein. This is expected for neutral drif t, although an adaptive role for these replacements is not excluded by this analysis. In contrast, the distribution of sites accep ting amino acid replacements during the episode of rapid sequence evolution of branch 1 (as indicated by a relatively high KA/KS value) involving pig paralogs was not random over the protein su rface. Rather, the sites are clustered near the substrate binding pocket, and in a region of the surface believed to contact the co-reductant protein, as identified by mutagenesis expe riments in the homolog [Leh00, Bri98]. The clustering of amino acid replacements near a substrate binding site during an episode of rapid sequence evolution suggests that the su bstrate specificity of the protein might be changing in correlation with a change in the de tailed physiological role of the protein. Recent reports suggest that the substrate and produc t specificities of the placental and embryonic enzymes are indeed different from those of the ovarian enzyme [C or99, Kao00, Con01b, Cor04]. Further, synthesis of estrogen by the ovarian enzy me is more dependent on the structure of the co-reductant than is the placental enzyme [Cor01]. Our in silico analyses rationalize these experimental observations from a structural pe rspective. The coupling of an evolutionary analysis to a crystallographic analysis suggest s that the amino acid ch anges are functionally significant. Discussion Today, natural history holds some of the mo st intellectually challenging conundrums to ever fascinate the human mind. Further, natural hi story offers biological chemists the opportunity to place broad biological meaning on the detailed analysis of the structur e reactivity of isolated

PAGE 107

107 biological molecules studied in a reductionist setting. To do so, how ever, natural history must be connected to the physical and molecular sciences, both in subject matter and in culture. In part to make this connection, natural hist orians have sought to change the research paradigm in their field to favor quantitative da ta directed towards the proof of hypotheses over story telling. Proving hypotheses is difficult in natural history ( pace the philosophical reality that no significant statement in empirical science can ever be said to be proven). The events of interest (such as the exti nction of dinosaurs) are frequently di stant in time, or require a passing of time (as for speciation), making them difficult to reproduce in a laborato ry. The scale of the concepts involved (species, environments, pl anets) also does not lend these concepts to laboratory models and laboratory-co ntrolled tests. Further, a re ductionist approach, even when available, will not necessarily generate data that are relevant to the big issue that concerns the natural historian. The emphasis on data and pr oof has ameliorated the worst excesses of storytelling in natural history, with enormous positive impact. Just as natural historians were purifying th eir field of storytelling, however, whole genome sequences began to emerge. By dramatically incr easing the quantity of chemical data concerning the molecular structures of proteins, genomics changed the limiting steps in biochemical and biomedical research. No longer was the typica l researcher attempting to solve an organic chemical or biotechnological ques tion (What is the sequence of my protein? How do I express it at high levels to get the sequence?) for a protei n that had been selected for functional reasons. Today, the typical researcher knows the structure of many proteins, and wishes to select one for expression and study based on a hypothe sis about its po tential function. Here, the fact that any definition of func tion, which must make reference to fitness, requires some systems, ecological or planetary context, makes the natural historian a natural

PAGE 108

108 source of hypotheses. Their full reductionist armament arium is available in the laboratory to test and explore any hypothesis that th e natural historian might provide. The biomedical researchers may like some guidance from the natural historia n to narrow the broad selection, or to shorten the random walk, if only slightly. For this purpose, the forswear ing by natural historians of st orytelling has come at a most inopportune time. To the modern natural historian, creating hypotheses can easily be regarded as storytelling. They are reluctant to do so, a nd may criticize as atavistic colleagues who do. This has created a vacuum in the scientific community. Very few laboratories exist that can draw upon an expertise in natural history to ge nerate stories that create hypotheses for the researcher working in experimental biochemistry and molecular biology. This work is designed in part to illustrate how this vacuum might be filled. Here, we do not just tell a story based on natural history, or even a st ory based on natural history supplemented with physiology and molecular sequen ce data. Rather, we show how the addition of other data, including data from X-ray crystall ography, can make a story su fficiently rich that it can be viewed as being internally consistent w ith a wide range of independent data drawn from independent sources. This creates a hypothesis that is more than a story, even if it is less than proven. With aromatase, the congruence of our different analyses makes a compelling suggestion that the three aromatase paralogs in pigs arose by two duplication events in the late Eocene or early Oligocene. The emergence of the aromat ase paralogs corresponded approximately in time to the emergence of larger litter size in suines. This implies th at the two duplication events are functionally related to the larger litter sizes. This inference is consistent with the physiological impact of estrogen synthesis by these paralogs in Sus Steroid production by the porcine embryo

PAGE 109

109 is tightly controlled by the transient expression of aromat ase and 17-hydroxylase (P450C17) between days 10 and 13 [Cho97a, Cho97b, Pop82]. In contrast, estrogen synthesis by the equine embryo begins as early as day six and increa ses with embryo age and diameter [Pop82]. The estrogen produced by the pig embryonic aromatase is believed to have an impact on the mobility, spacing, and implantation of the concepti [P op82, Baz82, Gei97, Val98, Wil02]. Adequate spacing would appear to be require d to manage a larger litter. This is consistent with a stru ctural biological analysis that correlates specific amino acid replacements with specific ch anges in the substrate and produ ct specificity of the protein [Gau02]. Interestingly, the substr ate specificity of human aromat ase is reported to be more similar to that displayed by the pig placental enzyme than the ovarian form [Kao00, Con01b]. This is an unexpected similarity given that ou r evolutionary analysis suggests a change in biochemical function along the fetal/placental branch in the Suidae It should be noted that th e hypothesis is supported by the combination of data that individually would not have stre ngth past storytelling. Thus, the KA/KS ratio of 0.93 would not, by itself, compel any particular interpretation. Its implications ar e greater given the relatively low KA/KS ratios of other branches of the tree. But the addition of crystallographic information, itself not compelling, makes a combin ation that is more compelling. Further, this hypothesis generati on itself generates discoveries th at might lead to their own hypotheses. An analysis of the evolutionary br anches separating pigs and humans suggests an additional episode of adaptive change. The branch leading to the ancest or of human aromatase (branch 3) has a remarkably high KA/KS ratio (13 non-synonymous and no synonymous changes) (Figure 5-5). This is a KA/KS ratio greater than unity, and does (pending evaluation of its statistical significance) compel the inference of an episode of adaptive change. Intriguingly,

PAGE 110

110 these changes are also clustered in the same re gions of the structure as those changing along branch 1 leading to the stem fetal/placental en zyme, near the substrate and co-reductant binding sites. This implies that the substrate/product speci ficity of the ancestral aromatase protein was not like that of either the human or the pig placental forms, but ra ther reflects features that arose convergently in these two species [Ste87]. Notably, four of the sites (positions 47, 153, 219, 269) that undergo replacement during the emergence of pig placental aromatase from the la st common ancestor are the same as four that arose in the emergence of the human aromatase from its last common ancestor. Of these, the amino acid replacements are id entical at tw o sites (Thr Met at site 153, His Arg at site 269). The probability associated with randomly observing this pattern is extremely low (0.000021) [Zha97]. An additional site is disp laced by a single position in the sequence alignment (259/260). We hypothesize that these represent an example of adaptive parallel evolution. It is important to point out that even an anal ysis this broad is likely to cover only a small part of a complicated reproductive endocrinology that must be associ ated with larger litter sizes. For example, the exact nature of the products produced by individual aromatases remains controversial, and may be different in laborat ory studies depending on th e conditions where they are studied [Cor99, Van89, Gar91, Hof91]. This is especially the case with the 19nortestosterone derivatives in Figure 5-1. Further, an elegant recent study by Co rbin et al. [Cor04] identified 1 -hydroxytestosterone as a novel product produced by recombinant pig ova rian aromatase that is absent from the products produced by the porcine placental para log, or by either human or bovine aromatase. This testosterone derivative binds to an androgen receptor, consis tent with physiological activity.

PAGE 111

111 This was unknown until very recently, suggesting that more endocrine novelties remain to be discovered. Any of these may be re levant to a test of this syst em. For example, these hypotheses make predictions about the product specificities of the two peccary aromat ases reported here. In fact, some data suggest that uterine expos ure to androgens severely decreases litter size and embryonic survival during th e time of maternal recognition of pregnancy [Car02]. This is consistent with the hypothesis of Corbin et al. [C or99] that the evolution of the placental paralog is associated with increased efficiency of testoste rone aromatization. This is also consistent with the current data, and the argument presented here. It goes without saying that still more factors might be associated with an increase in litter size from one to two (presumed in Diacodexis see Figure 5-4) to the 12 or more in domestic swine. Most trivially, this increase might be asso ciated with an increase in ovulation rate, and/or an adjustment in the structur es and binding specificities of estrogen receptors [Rot96]. Nevertheless, the first aromatase duplication, sh ared by pigs and peccaries, appears to have happened in the late Eocene (recognizing the erro r associated with thes e dates), around 35 Ma (Figure 5-4). This was a time of great global change, with dramatic cooling in the higher latitudes. More archaic kinds of mammals (e.g., some earlier families of perissodactyls and artiodactyls) became extinct, while many modern families (including the Suidae and Tayassuidae ) became established at this time [Jan97]. Suoids differed from other contemporaneous ungulates in their commitment to omnivory, even though a few forms, such as the modern warthog Phacochoerus aethiopicus are more specialized herbivores. Perhaps the ability to bear a slightly larger litter than other artiodactyls was a dvantageous to them in this time of global ecological transition. Ho wever, it should be noted that larger litters usually mean altricial (i.e., relativ ely underdeveloped) young, a reproductive strategy apparently not available

PAGE 112

112 to larger, cursorial (runningadapted) ungulates, which give birth to precocial (i.e., well developed) young that are fully lo comotory at birth [Eis81]. The second aromatase duplication, with the ensuing capacity to produce multiple young, probably occurred within the family Suidae some time during the Oligocene. The molecular data suggest dates of divergence between porcine feta l and placental aromatases as between 27 38 Ma, and the earliest known suid is of early Oligocene age [Pic93], around 33 Ma (Figure 5-4). Large litters may have characte rized the entire suid famil y. While the extant subfamily Suinae is primarily a Plio-Pleistocene radiation, during th e Oligocene to Pliocene suids were exceedingly diverse taxonomically (with six other subfamilie s known) as well as individually abundant as fossils [Gra99, Coo78, Pic93]. In contrast, the pr edominantly North America tayassuids were never as diverse. It is possible that this tr emendous Old World diversity of suids, which continues to this day, is related to their capacity for the produc tion of large litters. This type of speculation opens questions. For ex ample, the babirusa (an Indonesian pig) is reported to have average litters of one or tw o individuals [Sch89, Mac00]. While it is possible that litters contain three or four individuals, the occurrence is low [Pat90]. If the common ancestor of babirusa with the African/Eurasian Suinae had a larger litter, then the babirusa must be hypothesized to represent a reversion to the more primitive condition. At present, however, relatively little is known of eith er the molecular biol ogy or the natural history of babirusa. The date of divergence from modern swine is placed between 12-26 million years [Tho96, Bos80], while our TREx analysis using cytochrome b places this data at ca. 18 Ma (data not shown). Clearly, further study is warranted. Conclusions The aromatase family offers an example wh ere a combination of phylogenetic analysis, molecular evolutionary analysis, and chemical analysis set within the context of the

PAGE 113

113 paleontological and geological records, and supported by contemporary bioinformatics and molecular modeling tools, permits a higher order level of hypothesis ge neration concerning the function of proteins. Rather than simply an Enzyme Commission number (E.C. 1.14.14.1 for aromatase), a description of catalytic activity (the enzyme oxidizes testoste rone), or a description of the regulatory pattern (the pr otein expressed between day 10 and 13), this type of analysis can generate a truly functional hypothesis: The embryoni c enzyme oxidizes testos terone as a way of managing the larger litter sizes that emerged in the Suoidea during a time of dramatic planetary cooling (ca. 35 Ma). Such hypotheses set a higher bar, and a more useful standard, for the field of systems biology. Evolutionary theory holds that the only mechanism for obt aining functional behavior in a biological system is natural selection. Selection, based on a fr equently poorly defined concept of fitness, is determined by a context that not only includes the cell and tissue, but also the organism, the ecosystem, and a changing planet [Gau03b]. One cannot ex pect a collection of expression data with a mathemati cal model, by themselves, to pr ovide this type of functional information unless it is set in the organismic, ecosystem, and planetary context. The historical view, of the type outlined here, becomes a critical tool for constructing this setting. Humans have evidently exploited the molecular biology of larger litter s to select for pigs that have truly large litters (as many as 14) fo llowing their domestication. Evidence for ancient domestication of pigs comes, inter alia from a study of Indo-European languages. Proto-IndoEuropean (PIE) language had words for "pig" (PIE su -, compared with Tocharian B suwo Latin sus Greek us Sanskrit sukara Church Slavic svinija Old High German swin and English sow ; also compare PIE porko -, with Latin porcus Church Slavic prase Old High German farah etc. [Buc88]), indicating that the pi g has been under human domesticat ion for at least 6,000 years,

PAGE 114

114 enough time to have undergone a significant imp act on its genotype thro ugh husbandry. We are unable, at this time, to exploit complete genome se quences of pigs or other closely related taxa to discuss the impact of domestication on aromatase, steroid receptors, amphiregulins, or other proteins that appear to be associated with uterine capacity and large litter sizes in the domesticated pig [Kim03]. With the complete ge nome sequences of representatives of various mammalian orders, including artioda ctyls, it should be possible to extend this planetary biology approach. Methods Calculations were done under the RedHat Li nux 6.3 operating system on an Intel-Pentium III instruments using Blackdowns Java-SDK 1.1.8. PAML calculations were done on an IBM PC using the Unix operating system. Seque nce analyses were aided by the DARWIN bioinformatics package [Gon91]. The DARWIN pack age can be obtained by emailing a request to cbrg@inf.ethz.ch Initially, pairwise alignments were constr ucted for the aromatase protein sequences available in the database. An e volutionary distance in PAM units was calculated for each pair by applying the PamEstimator-package from DARWIN using an empirical log-odds matrix. From this, a preliminary evolutionary tree was built for the mammalian sequences, with branch lengths along internal nodes calculated to minimize a le ast-squares distance. The sequences of the ancestral genes and proteins at branch points in the tree were then reconstructed. From there, mutations (including fractional mu tations) at both the DNA level a nd protein level were assigned to individual branches in the tree using the method of Fitch [Fit71]. The evolutionary history of the aromatase fa mily was then analyzed using the transition redundant exchange (TREx) metric based on an analysis of twofold redundant codon systems [Ben03, Juk69]. These were obtained for each pair wise comparison of aligned aromatase genes.

PAGE 115

115 The number ( n ) of twofold redundant amino acids (Cys, Asp, Glu, Phe, His, Lys, Asn, Gln, and Tyr) that are conserved in the aligned pairs wa s determined. The number of those amino acids that are encoded by the same codon ( c ) was determined, and the fraction ( f2 = c / n ) of the codons that are the same were then tabulated. Th e TREx distances were calculated from f2 values using the expression kt = -ln(( f2-Equil)/(1-Equil)), where Equil is the f2 value expected after a large number of nucleotide substitutions have occurred at the synonymous sites [Ben03]. The DNA sequences for aromatase were phylogenetically analyzed using a maximum likelihood framework in PAUP 4.0* (beta 10) [Sw o98], with the following parameters: alpha value representing the gamma dist ribution (2.1), the tr ansition-transversion ra tio (1.6), proportion of invariable sites (0.24), and empirical base frequencies. Th e resulting topology of the tree mirrors those based on other molecular studies [Mur01]. For inter-taxon analyses, families in the Ma sterCatalog [Ben00] were identified that contained at least one representative protein from both of the taxa of interest. For these families, all inter-taxa pairs of genes were extracted, together with the pair wise protein sequence alignment. A pairwise alignment of the DNA seque nces was then generated to follow the protein sequence alignment. If a family contained more than one sequence of a species belonging to one of the taxa analyzed, then those sequences we re checked to determine whether they were duplicate entries into the database. If this was the case, only one of the duplicate sequences was retained in the analysis. A histogram of inter-taxa pairs was constructed, and the f2 value characteristic of orthologs dete rmined [Ben03]. This was used to calibrate the TREx clock using the divergence of pigs and oxen, and pigs and humans. Codon biases were obtained from the CUTG (Codon Usage Tabulated from GenBank) made available by the Kazusa DNA Research Institute Foundati on, Japan [Nak05].

PAGE 116

116 Pairwise TREx distances were used to gene rate lengths for the branches connecting the swine paralogs using the minimu m evolution criterion in PAUP. This preliminary analysis was followed by a maximum likelihood analysis for th e complete dataset using the PAML program [Yan97]. This includes the assignment of KA/KS values to individual br anches. Tests of parallel evolution were conducted using Converge [Zha97], implementing the JTT model. Secondary structural data based on homol ogy modeling for aromatases were generated using the DARWIN bioinformatics package, and in agreement with previous studies [Gra95, Lew98]. Renderings of the three di mensional structure of the protei ns were obtained using a beta version of the HyperProtein package (HyperCube, Gainesville FL, USA 32601).

PAGE 117

117 Figure 5-1. Reactions catalyzed by aromat ases on multiple androgenic substrates

PAGE 118

118 Figure 5-2. Dating the pig duplicatio n events. An evolutionary tree, following the topology of Figure 5-5, showing estimated TREx distan ces for individual branches calculated from reconstructed ancestral sequences. The scale corresponds to evolutionary time (in million years) estimated from the TRExs using a first order rate constant for transitions of 3 x 10-9 changes per base per year. Pig Placent al Pig Fetal Pig Ovary Sheep Bovine Horse 0.11250 0.07850 0.00367 0.09383 0.07961 26.237.5 27.438.7 53.965.3 Ma

PAGE 119

119 Figure 5-3. The amino acid alignment of exon 4 of two aromatase isof orms from both peccary and babirusa sequences with exon 4 of pi g aromatase isoforms ovarian, fetal, and placental. Asterisks represent conserved sites.

PAGE 120

120 Figure 5-4. Cladogram of the order Artiodactyla showing the extant families and some selected extinct ones. Ruminantia includes the modern families Tragulidae Giraffidae Bovidae Moschidae and Cervidae plus a number of extinct families. Dichobunidae is a paraphyletic assemblage of primitive taxa considered broadly ancestral to the later families. The interrelationships of the families reflect the "traditional" relationship based on mo rphology. Different arrangements based on molecular information would alter the placement of the Camelidae and Hippopotamidae but would make no difference to the arguments presented here concerning the Suoidea The interrelationships within the Suidae are based on information in several studies. Note that only a couple of extinct suid subfamilies are shown, and that only extant genera of Suinae are shown. Thick, medium-thick and thin lines represent family or above subfamily and genera, respectively. CENOZOIC TERTIARYQUARTERNY MESOZOIC PaleocenePliocenePleistocene Miocene Oligocene Eocene Sus Potamochoerus Hylochoerus Phacochoerus Babyrousa Suinae SUIDAE SUOIDEA SUIFORMES SELENODONTIAHyotheriinae Palaeochoerinae TAYASSUIDAE HIPPOPOTAMIDAE ANTHRACOTHERIIDAE DICHOBUNIDAE CAMELIDAE RUMINATIA 1.7 5.2 23.0 33.4 55.0 65.0 Average >3 young per litter 1-2 young per litter Average >2 young per litter SUOIDEA Aromatase Duplication Events OREODONTOIDEA? Recent 0.01

PAGE 121

121 Figure 5-5. Phylogenetic tree for the 18 vertebrate aromatase gene s. Numbers above the branches represent the KA/KS ratios, while numbers below i ndicate branches highlighted in Figure 5-6. Single and double asterisks re present bootstrap values of 95-99% and 100%, respectively. The following sequences were used: Oncorhynchus mykiss (rainbow trout), gi:1613859, Oryzias latipes (medaka), gi:1786171, Danio rerio (zebrafish), gi:2306966, Carassius auratus (goldfish, ovary), gi:2662330, Ictalurus punctatus (catfish), gi:912802, Carassius auratus (goldfish, brain), gi:2662328, Sus scrofa (pig) placental, isoform 2, gi:1762232, Sus scrofa (pig) embryo, isoform 3, gi:1244543, Sus scrofa (pig) ovary, isoform 1, gi:1928957, Bos taurus (ox), gi:665546, Equus caballus (horse), gi:2921277, Mus musculus (mouse), gi:3046857, Rattus norvegicus (rat), gi:203804, Oryctolagus cuniculus (rabbit), gi:2493381, Homo sapiens (human), gi:28846, Gallus gallus (chicken), gi:211703, Poephila guttata (zebra finch, ovary), gi:926845, Ovis aries (sheep), gi:7673985.

PAGE 122

122 Figure 5-6. The distribution of am ino acid replacements on the ter tiary structure of cytochrome P450 homolog. Amino acid replacements o ccurring along branches highlighted in Figure 5-5 are shown in red. The substrate binding pocket and nicotinamide co-factor are colored yellow and purple, respectively. The sites that bind the co-reductant are highlighted in green for reference.

PAGE 123

123 Table 5-1. Frequency distributions of stem pig duplication substitutions versus substitutions on all other terr estrial vertebrate branches Terrestrial vertebrates Nonsynonymous substitutions Synonymous substitutions Totals Stem Pig duplicates (Branch in Figure 5-5) 23 9 32 Remaining branches 598 1449 2047 Totals 621 1458 2079 Fishers exact test, P = 0.00000094784

PAGE 124

124 Table 5-2. Distributions of stem pig duplic ation substitutions ve rsus substitutions within the Laurasiatheria subtree. Laurasiatheria subtree Nonsynonymous substitutions Synonymous substitutions Totals Stem Pig duplicates (Branch in Figure 5-5) 23 9 32 Remaining branches 232 258 490 Totals 255 267 522 Fishers exact test, P = 0.0056688

PAGE 125

125 APPENDIX A NAMES AND ABBREVIATIONS OF NUCL EIC ACID BASES AND AMINO ACIDS Table A-1. Names and abbrevia tions of nucleic acid bases One Letter Symbol Nucleic Acid Base A Adenine C Cytosine G Guanine T Thymine U Uracil Table A-2. One and three letter symbols for the amino acids One Letter Symbol Three Letter Symbol Amino Acid A Ala Alanine B Asx Asparagine or aspartic acid C Cys Cysteine D Asp Aspartic acid E Glu Glutamic acid F Phe Phenylalanine G Gly Glycine H His Histidine I Ile Isoleucine K Lys Lysine L Leu Leucine M Met Methionine N Asn Asparagine P Pro Proline Q Gln Glutamine R Arg Arginine S Ser Serine T Thr Threonine V Val Valine W Trp Tryptophan Y Tyr Tyrosine Z Glx Glutamine or glutamic acid

PAGE 126

126 APPENDIX B AN IMPLEMENTATION OF THE EXTENDED PBL METHOD CODED IN JAVA 1.1 /** KaKs.java * Created: Wed Oct 27 21:53:19 1999 * @David Schreiber @1.7 */ public class Info { public Info() { } public Info(Tree t, TreeSequence[] s, int i) { if (t.isLeaf() != true) { setTree(t); setMutationModel(i); setNumberOfTaxa(); setMaxIndex(); setTaxaIndices(); setTaxaLabels(); doTreeNodes(t); setTreeNodes(); setTreeEdges(); nn = s[1].length(); naa = nn/3; try { setTreeSequences(s); for (int cod = 0; cod < nn/3; cod++) { Parsimony(cod); Probabilities(); makeTrips(cod); doEdgeDegeneracy(); doTtnTvn(); } doKaKsComputations(); } catch (Sequen cePosOutOfBoundsException e) {} catch (IllegalCodeException e) {} catch (IllegalCharException e) {} } } public void setTree(Tree t) { tree = t; } public void setMutationModel(int i) { mm = i; } public void setNumberOfTaxa() { ntax = tree.getLeafCount();

PAGE 127

127 } public int getNumberOfTaxa() { return ntax; } public void setMaxIndex() { maxidx = tree.getMaxIndex(); } public void setTaxaIndices() { int[] temp = tree.getLeafIndices(); Sort.sort(temp, ntax); TaxaIndices = temp; } public void setTaxaLabels() { String[] s = new String[ntax]; for (int i = 0; i < ntax; i++) { Tree temp = tree.findLeaf(TaxaIndices[i]); s[i] = temp.label; } TaxaLabels = s; } public int getNumberOfNucleotides() { return nn; } public int getNumberOfAminoAcids() { return naa; } public int doTreeNodes(Tree t) { if (t.isLeaf() == true) { return t.index; } else { int a = doTreeNodes(t.left); int b = doTreeNodes(t.right); Node e = new Node(a, b); nodes.addElement(e); return (nodes.size() + maxidx); } } public void setTreeNodes() { Node[] temp = new Node[nodes.size()]; for (int i = 0; i < nodes.size(); i++) { temp[i] = (Node)nodes.elementAt(i); temp[i].index = i + maxidx + 1; //nodes numbered consecutively, starting with maximum leaf index + 1 } TN = temp; nnode = nodes.size(); }

PAGE 128

128 public Node[] getTreeNodes() { return TN; } public void setTreeEdges() { Edge[] e = new Edge[nnode 2]; Kes[] k = new Kes[nnode 2]; for (int i = 0; i < nnode; i++) { e[i 2] = new Edge(TN[i].index, TN[i].LR[0]); e[i 2 + 1] = new Edge(TN[i].index, TN[i].LR[1]); k[i 2] = new Kes(); k[i 2 + 1] = new Kes(); } TE = e; KaKs = k; } public Edge[] getTreeEdges() { return TE; } public void setTreeSequences(TreeSequen ce[] s) throws SequencePosOutOfBoundsException { tax_d = s; Vector codons = Constants.CODONS; for (int i = 0; i < ntax; i++) { s[i].index = TaxaIndices[i]; //this assumes that order of sequences in Tr eeSequence[] corresponds with //ascending numerical order of tree's leaf indices } Vector v = new Vector(); for (int i = 0; i < ntax; i++) { v.addElement(new Integer(TaxaIndices[i])); } IdxVctr = v; } public Vector getIdxVctr() { return IdxVctr; } public Kes[] getKaKs() { return KaKs; } public void printTreeNodes() { for (int i = 0; i < nnode; i++) { Format.printf("Node %d --> ", TN[i].index); Format.printf("[%d,", TN[i].LR[0]); Format.printf("%d]\n", TN[i].LR[1]); } } public void printTreeEdges() { for (int i = 0; i < nnode *2; i++) {

PAGE 129

129 Format.printf("Edge %d --> ", i); Format.printf("[%d,", TE[i].LR[0]); Format.printf("%d]\n", TE[i].LR[1]); } } public void Parsimony(int cod) throws SequencePosOutOfBoundsException { for (int npos = 0; npos < 3; npos++) { String[][] Preliminary = new String[nnode][2]; String[][] Final = new String[nnode][3]; for (int n = 0; n < nnode; n++) { int[] lr = TN[n].LR; String[] lr_seq = new String[2]; String prelim = ""; String u = ""; for (int i = 0; i < 2; i++) { if (lr[i] <= maxidx) { int j = IdxVctr.indexOf(new Integer(lr[i])); lr_seq[i] = tax_d[j].subsequence((cod 3) + npos, (cod 3) + npos + 1).toString(); } else { lr_seq[i] = Preliminary[lr[i] maxidx 1][0]; } } String isct = SetFx. intersect(lr_seq[0 ], lr_seq[1]); if (isct != "") { prelim = isct; } else { prelim = SetFx.union(lr_seq[0], lr_seq[1]); u = "u"; } Preliminary[n][0] = prelim; Preliminary[n][1] = u; } Final[nnode 1][0] = Preliminary[nnode 1][0]; Final[nnode 1][1] = ""; Final[nnode 1][2] = Preliminary[nnode 1][1]; for (int n = nnode 2; n > -1; n--) { int[] lr = TN[n].LR; int anc = 0; for (int z = n + 1; z < nnode; z++) { int[] lr_temp = TN[z].LR; if ((n + maxidx + 1) == lr_temp[0] || (n + maxidx + 1) == lr_temp[1]) { anc = z; break; } } String anc_fi = Final[anc][0]; String prelim_d = Preliminary[n][0]; String test_i = SetFx.intersect(prelim_d, anc_fi); if (test_i.length() == anc_fi.length()) { Final[n][0] = test_i; Final[n][1] = ""; Final[n][2] = Preliminary[n][1]; }

PAGE 130

130 else { if (Preliminary[n][1] == "u") { String u_iv = SetFx.union(prelim_d, anc_fi); String m_iv = SetFx.minus(u_iv, prelim_d); Final[n][0] = u_iv; Final[n][1] = m_iv; Final[n][2] = "u"; } else { String fi_temp = prelim_d; String m_v = SetFx.minus(anc_fi, fi_temp); String[] prelim = new String[2]; for (int ii = 0; ii < 2; ii++) { if (lr[ii] <=maxidx) { int jj = IdxVctr.indexOf(new Integer(lr[ii])); prelim[ii] = tax_d[jj].subsequence((cod 3) + npos, (cod 3) + npos).toString(); } else { prelim[ii] = Preliminary[lr[ii] maxidx 1][0]; } } for (int j = 0; j < m_v.length(); j++) { String li = SetFx.intersect(m_v.substring(j,j+1), prelim[0]); String ri = SetFx.intersect(m_v.substring(j,j+1), prelim[1]); if (li.length() == 1 || ri.length() == 1) { fi_temp += m_v.substring(j,j+1); } } String add = SetFx.minus(fi_temp, prelim_d); Final[n][0] = fi_temp; Final[n][1] = add; Final[n][2] = ""; } } } for (int n = 0; n < nnode; n++) { TN[n].setParsimony(Final[n], npos); } } } public void Probabilities() throws SequencePosOutOfBoundsException { //set root probabilities Vector v = TN[nnode 1].getParsimony(); for (int npos = 0; npos < 3; npos++) { String s = ((String[])v.elementAt(npos))[0]; int len = s.length(); Float[] f = new Float[len]; for (int i = 0; i < len; i++) { f[i] = new Float(1.00/len); } TN[nnode 1].setProbabilities(f, npos); } for (int npos = 0; npos < 3; npos++) { for (int n = nnode 1; n > -1; n--) {

PAGE 131

131 Vector ap = TN[n].getProbabilities(); Vector as = TN[n].getParsimony(); int[] lr = TN[n].LR; if (lr[0] <= maxidx && lr[1] <= maxidx) { } else { for (int x = 0; x < 2; x++) { if (lr[x] <= maxidx) { } else { Vector des = TN[lr[x] maxidx 1].getParsimony(); Float[] anc_probs = (Float[])ap.elementAt(npos); String anc_fi = ((String[])as.elementAt(npos))[0]; String des_fi = (( String[])des.elementAt(npos))[0]; if (des_fi.length() == 1) { Float[] f = new Float[1]; f[0] = new Float(1.00); TN[lr[x] maxidx 1].setProbabilities(f, npos); } else { float[] probs = new float[des_fi.length()]; for (int p = 0; p < des_fi.length(); p++) { probs[p] = 0; } String des_ad = ((St ring[])des.elementAt(npos))[1]; String des_mn = SetFx.minus(des_fi, des_ad); for (int j = 0; j < anc_fi.length(); j++) { float anc_prob = anc_probs[j].floatValue(); String anc_fi_sub = anc_fi.substring(j, j + 1); boolean flag = false; for (int k = 0; k < des_mn.length(); k++) { String des_mn_sub = des_mn.substring(k, k + 1); if (des_mn_sub.equals(anc_fi_sub)) { int idx = des_fi.indexOf(anc_fi_sub); probs[idx] += anc_prob; flag = true; break; } } if (flag == false) { int m1 = des_mn.length(); int m2 = 0; for (int da = 0; da < des_ad.length(); da++) { String des_ad_sub = des_ad.substring(da, da + 1); if (anc_fi_sub.equals(des_ad_sub)) { m2++; break; } } float p_temp = anc_prob/(m1 + m2); for (int mn = 0; mn < des_mn.length(); mn++) { String des_mn_sub = des_mn.substring(mn, mn + 1); int idx = des_fi.indexOf(des_mn_sub); probs[idx] += p_temp; } for (int ad = 0; ad < des_ad.length(); ad++) {

PAGE 132

132 String des_ad_sub = des_ad.substring(ad, ad + 1); if (anc_fi_sub.equals(des_ad_sub)) { int idx = des_fi.indexOf(des_ad_sub); probs[idx] += p_temp; } } } } Float[] in = new Float[probs.length]; for (int i = 0; i < probs.length; i++) { in[i] = new Float(probs[i]); } TN[lr[x] maxidx 1].setProbabilities(in, npos); } } } } } } } public void makeTrips(int cod) throws IllegalCharException, IllegalCodeException, SequencePosOutOfBoundsException { //a Trip is a.k.a. an EPS (equally parsimonious solution) for (int n = 0; n < nnode; n++) { //node int[] lr = TN[n].LR; int[] l_r = new int[2]; for (int a = 0; a < 2; a ++) { if (lr[a] <= maxidx) { l_r[a] = IdxVctr.indexOf(new Integer(lr[a])); } } int[] gc = new int[3]; //genetic codes for left, ancestor, right of the present node Vector Trips = new Vector(); Vector[] rules = new Vector[3]; for (int i = 0; i < 3; i++) { rules[i] = new Vector(); } Vector anc = TN[n].getParsimony(); Vector anc_probs = TN[n].getProbabilities(); for (int p = 0; p < 3; p++) { //each position in codon int pos = (cod 3) + p; String anc_fi = (( String[])anc.elementAt(p))[0]; String anc_ad = (( String[])anc.elementAt(p))[1]; String[] minz = new String[2]; String[] addz = new String[2]; String[] finz = new String[2]; gc[1] = TN[n].genetic_code; int ct = 0; for (int x = 0; x < 3; x += 2) { if (lr[ct] <= maxidx) { finz[ct] = tax_d[l_r[ct]].subsequence(pos, pos + 1).toString(); addz[ct] = ""; minz[ct] = finz[ct]; gc[x] = tax_d[l_r[ct]].genetic_code; }

PAGE 133

133 else { Vector des = TN[lr[ct] maxidx 1].getParsimony(); finz[ct] = (( String[])des.elem entAt(p))[0]; addz[ct] = ((String[])des.elementAt(p))[1]; minz[ct] = SetFx.minus(finz[ct], addz[ct]); gc[x] = TN[lr[ct] maxidx 1].genetic_code; } ct++; } Vector[] rulz = new Vector[2]; rulz[0] = new Vector(); rulz[1] = new Vector(); for (int x = 0; x < anc_fi.length(); x++) { for (int j = 0; j < 2; j++) { String temp = SetFx.intersect(anc_fi.substring(x, x + 1), minz[j]); if (!(temp.length() == 0)) { rulz[j].addElement(temp + temp); } else { if (!(minz[j].length() == 0)) { for (int l = 0; l < minz[j].length(); l++) { rulz[j].addElement(minz[j] .substring(l, l + 1) + anc_fi.substring(x, x + 1)); } String temp2 = SetFx.intersec t(anc_fi.substring(x, x + 1), addz[j]); if (!(temp2.length() == 0)) { rulz[j].addElement(temp2 + temp2); } } } } } rulz[0].trimToSize(); rulz[1].trimToSize(); for (int j = 0; j < rulz[0].size(); j++) { for (int k = 0; k < rulz[1].size(); k++) { String rulz_l = (String)rulz[0].elementAt(j); String rulz_r = (String)rulz[1].elementAt(k); if (rulz_l.substring(1, 2).equals(rulz_r.substring(1, 2))) { rules[p].addElement(rulz_l.substring(0, 1) + rulz_l.substring(1, 2) + rulz_r.substring(0, 1)); } } } } //close codon position loop rules[0].trimToSize(); rules[1].trimToSize(); rules[2].trimToSize(); Vector p_trips = new Vector(); for (int a = 0; a < rules[0].size(); a++) { for (int b = 0; b < rules[1].size(); b++) { for (int c = 0; c < rules[2].size(); c++) { String[] s = new String[3]; String r0 = (String)rules[0].elementAt(a); String r1 = (String)rules[1].elementAt(b); String r2 = (String)rules[2].elementAt(c);

PAGE 134

134 for (int d = 0; d < 3; d++) { s[d] = r0.substring(d, d + 1) + r1.substring(d, d + 1) + r2.substring(d, d + 1); } p_trips.addElement(s); } } } p_trips.trimToSize(); Vector trips = new Vector(); for (int a = 0; a < p_trips.size(); a++) { int ct = 0; String[] s = (Str ing[])p_trips. elementAt(a); for (int b = 0; b < 3; b++) { DnaSequence d = new DnaSequence(s[b], GeneticCode.getCode(gc[b])); String trans = d.product().toString(); if (!(trans.equals("X"))) { ct++; } } if (ct == 3) { trips.addElement(s); } } trips.trimToSize(); if (trips.size() == 0) { Vector in = new Vector(4); in.addElement(""); in.addElement(""); in.addElement(""); Float in_d = new Float(0); in.addElement(in_d); Trips.addElement(in); Trips.trimToSize(); TN[n].setTrips(Trips); } else { Vector[] nuc_probs = new Vector[3]; String[] lefs = new String[3]; String[] ancs = new String[3]; String[] rits = new String[3]; for (int p = 0; p < 3; p++) { nuc_probs[p] = new Vector(3); nuc_probs[p].setSize(3); int pos = (cod 3) + p; nuc_probs[p].setElementAt((Float[])anc_probs.elementAt(p), 1); //ancestor's nuc probs ancs[p] = ((String[])anc.elementAt(p))[0]; //ancestor's final nuc set int ct = 0; for (int x = 0; x < 3; x += 2) { if (lr[ct] <= maxidx) { Float[] temp = new Float[1]; temp[0] = new Float(1); nuc_probs[p].setElementAt(temp, x); if (ct == 0) { lefs[p] = tax_d[l_r[0]].subsequence(pos, pos + 1).toString(); }

PAGE 135

135 else { rits[p] = tax_d[l_r[1]].subsequence(pos, pos + 1).toString(); } } else { Vector des = TN[lr[ct] maxidx 1].getParsimony(); Vector des_probs = TN[lr[ct] maxidx 1].getProbabilities(); Float[] temp = (Float[])des_probs.elementAt(p); nuc_probs[p].setElementAt(temp, x); if (ct == 0) { lefs[p] = ((String[])des.elementAt(p))[0]; } else { rits[p] = ((String[])des.elementAt(p))[0]; } } ct++; } } float[] prob = new float[trips.size()]; float[] trip_prob = new float[trips.size()]; float tot = 0; Vector codons = Constants.CODONS; for (int a = 0; a < trips.size(); a++) { String[] s = (Str ing[])trips.el ementAt(a); float nuc_probs_product = 1; for (int b = 0; b < 3; b++) { Float[] lef_floats = (Float[])nuc_probs[b].elementAt(0); Float[] anc_floats = (Float[])nuc_probs[b].elementAt(1); Float[] rit_floats = (Float[])nuc_probs[b].elementAt(2); nuc_probs_product *= lef_floats[lefs[b].indexOf(s[0].substring(b, b + 1))].floatValue(); nuc_probs_product *= anc_floats[ancs[b].indexOf(s[1].substring(b, b + 1))].floatValue(); nuc_probs_product *= rit_floats[rits[b].indexOf(s[2].substring(b, b + 1))].floatValue(); } float l_prob = Constants.AA_CHANGE_PROBS[mm][gc[0]][codons.indexOf(s[0])][codons.indexOf(s[1])]; float r_prob = Constants.AA_CHANGE_PROBS[mm][gc[2]][codons.indexOf(s[1])][codons.indexOf(s[2])]; prob[a] = l_prob r_prob nuc_probs_product; tot += l_prob r_prob nuc_probs_product; } for (int a = 0; a < trips.size(); a++) { trip_prob[a] = prob[a]/tot; String[] s = (Str ing[])trips.el ementAt(a); Vector in = new Vector(4); in.addElement(s[0]); in.addElement(s[1]); in.addElement(s[2]); in.addElement(new Float(trip_prob[a])); Trips.addElement(in); } Trips.trimToSize();

PAGE 136

136 TN[n].setTrips(Trips); } } //close node loop } public void doEdgeDegeneracy() { for (int n = 0; n < nnode; n++) { //node by node byte[][] sites = Co nstants.SITES[TN[n].genetic_code]; int[] lr = TN[n].LR; int[] l_r = new int[2]; for (int a = 0; a < 2; a ++) { if (lr[a] <= maxidx) { l_r[a] = IdxVctr.indexOf(new Integer(lr[a])); } } Vector trips = TN[n].getTrips(); float[][] degs_temp = {{0,0,0},{0,0,0},{0,0,0}}; if (trips.size() == 0) { } else { float[] lef_temp = new float[3]; float[] rit_temp = new float[3]; if (trips.size() == 1) { Vector trip = (Vector)trips.elementAt(0); String[] c = new String[3]; c[0] = (String)trip.elementAt(0); c[1] = (String)trip.elementAt(1); c[2] = (String)trip.elementAt(2); if (!(c[0].equals("") || c[1].equals("") || c[2].equals(""))) { //if the trip contains c odons at each node, i.e., //there is one valid trip at this codon for (int a = 0; a < 3; a++) { for (int b = 0; b < 3; b++) { degs_temp[a][b] += sites[Constants.CODONS.indexOf(c[a])][b]; } } for (int a = 0; a < 3; a++) { lef_temp[a] = (degs_temp[0][a] + degs_temp[1][a]) / 2; rit_temp[a] = (degs_temp[1][a] + degs_temp[2][a]) / 2; } //System.out.println("one trip"); TE[n*2].setDegeneracy(lef_temp); TE[(n*2)+1 ].setDegeneracy(rit_temp); } } else { float[][] degs = new float[3][3]; //dim 1 = l, a, r; dim 2 = 0-,2-,4-fold for (int t = 0; t < trips.size(); t++) { //trip by trip Vector trip = (Vector)trips.elementAt(t); String[] cods = new String[3]; cods[0] = (String)trip.elementAt(0); cods[1] = (String)trip.elementAt(1); cods[2] = (String)trip.elementAt(2); float prob = ((Float)trip.elementAt(3)).floatValue(); for (int a = 0; a < 3; a++) {

PAGE 137

137 for (int b = 0; b < 3; b++) { degs_temp[a][b] += sites[Constants.CODONS.indexOf(cods[a])][b] prob; } } } //close trip loop for (int a = 0; a < 3; a++) { for (int b = 0; b < 3; b++) { degs[a][b] += degs_temp[a][b]; } } for (int a = 0; a < 3; a++) { lef_temp[a] += (degs[0][a] + degs[1][a])/2; rit_temp[a] += (degs[1][a] + degs[2][a])/2; } TE[n*2].setDegeneracy(lef_temp); TE[(n*2)+1].setDegeneracy(rit_temp); } } } } public void doTtnTvn() { for (int n = 0; n < nnode; n++) { //node by node float[][][] ttn_tvn = {{{0,0,0},{0,0,0}}, {{0,0,0},{0,0,0}}}; Vector trips = TN[n].getTrips(); if (trips.size() == 0) { } else if (trips.size() == 1) { Vector trip = (Vector)trips.elementAt(0); String[] c = new String[3]; c[0] = (String)trip.elementAt(0); c[1] = (String)trip.elementAt(1); c[2] = (String)trip.elementAt(2); if (!(c[0].equals("") || c[1].equals("") || c[2].equals(""))) { //if the trip contains c odons at each node, i.e., //there is one valid trip at this codon for (byte a = 0; a < 2; a++) { //transitions/transversions for (byte b = 0; b < 3; b++) {//0-,2-,4-fold sites ttn_tvn[0][a][b] = Constants.CHANGES[mm][TN[n].genetic_code][Constants.CODONS.indexOf(c[0])] [Constants.CODONS.indexOf(c[1])][a][b]; //left descendant vs. ancestor ttn_tvn[1][a][b] = Constants.CHANGES[mm][TN[n].genetic_code][Constants.CODONS.indexOf(c[1])] [Constants.CODONS.indexOf(c[2])][a][b]; //right descendant vs. ancestor } } } } else { for (int t = 0; t < trips.size(); t++) { //trip by trip Vector trip = (V ector)trips.elementAt(t); String[] cods = new String[3];

PAGE 138

138 cods[0] = (String)trip.elementAt(0); cods[1] = (String)trip.elementAt(1); cods[2] = (String)trip.elementAt(2); float prob = ((Float)trip.elementAt(3)).floatValue(); for (byte a = 0; a < 2; a++) { for (byte b = 0; b < 3; b++) { ttn_tvn[0][a][b] += Constants.CHANGES[mm][TN[n].genetic_code][Constants.CODONS.indexOf(cods[0])] [Constants.CODONS.indexOf(cods[1])][a][b] prob; ttn_tvn[1][a][b] += Constants.CHANGES[mm][TN[n].genetic_code][Constants.CODONS.indexOf(cods[1])] [Constants.CODONS.indexOf(cods[2])][a][b] prob; } } } } TE[n*2].setTtnTvn(ttn_tvn[0]); TE[(n*2)+1].setTtnTvn(ttn_tvn[1]); } } public void doKaKsComputations() { int ne = nnode 2; double[][] P_Li = new double[ne][3]; double[][] Q_Li = new double[ne][3]; double[][] a_Li = new double[ne][3]; double[][] b_Li = new double[ne][3]; double[][] c_Li = new double[ne][3]; double[][] d_Li = new double[ne][3]; double[][] A_Li = new double[ne][3]; double[][] B_Li = new double[ne][3]; double[][] VA_Li = new double[ne][3]; double[][] VB_Li = new double[ne][3]; double[][] K_Li = new double[ne][3]; double[][] VK_Li = new double[ne][3]; double[] Ka_Li = new double[ne]; double[] VKa_Li = new double[ne]; double[] Ks_Li = new double[ne]; double[] VKs_Li = new double[ne]; boolean[] problem = new boolean[ne]; for (int e = 0; e < ne; e++) { problem[e] = false; float[] deg = TE[e].getDegeneracy(); float[][] ttn_tvn = TE[e].getTtnTvn(); for (int j = 0; j < 3; j++) { if (deg[j] == 0) { P_Li[e][j] = 0; Q_Li[e][j] = 0; } else { P_Li[e][j] = ttn_tvn[0][j] / deg[j]; Q_Li[e][j] = ttn_tvn[1][j] / deg[j]; } } } for (int e = 0; e < ne; e++) {

PAGE 139

139 for (int j = 0; j < 3; j++) { double denom_1 = 1 (2 P_Li[e][j]) Q_Li[e][j]; double denom_2 = 1 (2 Q_Li[e][j]); if (denom_1 == 0) { a_Li[e][j] = 0; } else { a_Li[e][j] = 1/denom_1; } if (denom_2 == 0) { b_Li[e][j] = 0; } else { b_Li[e][j] = 1/denom_2; } } } for (int e = 0; e < ne; e++) { for (int j = 0; j < 3; j++) { float[] deg = TE[e].getDegeneracy(); if (a_Li[e][j] <= 0 || b_Li[e][j] <= 0 || deg[j] == 0) { problem[e] = true; } else { A_Li[e][j] = (0.5 Math.log(a_Li[e][j])) (0.25 Math.log(b_Li[e][j])); B_Li[e][j] = 0.5 Math.log(b_Li[e][j]); VA_Li[e][j] = (((Math.pow(a_Li[e][j], 2) P_Li[e][j]) + (Math.pow(c_Li[e][j], 2) Q_Li[e][j])) Math.pow((a_Li[e][j] P_Li[e][j]) + (c_Li[e][j] Q_Li[e][j]), 2)) / deg[j]; VB_Li[e][j] = ((Math.pow(b_Li[e][j], 2) Q_Li[e][j]) (1 Q_Li[e][j])) / deg[j]; K_Li[e][j] = A_Li[e][j] + B_Li[e][j]; } } } for (int e = 0; e < ne; e++) { float[] deg_old = TE[e].getDegeneracy(); double[] deg = new double[3]; for (int i = 0; i < 3; i++) { deg[i] = 1 deg_old[i]; } if (problem[e] == true) { Ka_Li[e] = -1; Ks_Li[e] = -1; VKa_Li[e] = -1; VKs_Li[e] = -1; } else { Ks_Li[e] = (((deg[1] A_Li[e][1]) + (deg[2] A_Li[e][2])) / (deg[1] + deg[2])) + B_Li[e][2]; VKs_Li[e] = (((Math.pow(deg[1], 2) VA_Li[e][1]) + (Math.pow(deg[2], 2) VA_Li[e][2])) / Math.pow((deg[1] + deg[2]), 2) + VB_Li[e][2]) (((b_Li[e][2] Q_Li[e][2])*(2 a_Li[e][2] P_Li[e][2]c_Li[e][2] (1 Q_Li[e][2]))) / (deg[1] + deg[2])); Ka_Li[e] = (((deg[0] B_Li[e][0]) + (deg[1] B_Li[e][1])) / (deg[0] + deg[1])) + A_Li[e][0]; VKa_Li[e] = (((Math.pow(deg[0], 2) VB_Li[e][0]) + (Math.pow(deg[1], 2) VB_Li[e][1])) / Math.pow((deg[0] + deg[1]), 2) + VA_Li[e][0]) (((b_Li[e][0] Q_Li[e][0]) (2 a_Li[e][0] P_Li[e][0]c_Li[e][0] (1 Q_Li[e][0]))) / (deg[0] + deg[1]));

PAGE 140

140 } KaKs[e].syn = Ks_Li[e]; KaKs[e].nonsyn = Ka_Li[e]; } } private Tree tree = null; private int[] TaxaIndices; private Vector IdxVctr; private String[] TaxaLabels = null; private int maxidx; private int ntax; private int nnode; private int nn; private int mm; private int naa; private Vector nodes = new Vector(); private Node[] TN = null; private Edge[] TE = null; private Kes[] KaKs = null; private TreeSequence[] tax_d = null; }

PAGE 141

141 LIST OF REFERENCES Alt87 Altschuh, D., Lesk, A. M., Bloomer, A. C., and Klug, A. (1987) Prot. Engng. 1 228-236. Jimmy crack corn. Alt88 Altschuh, D., Vernet, T., Berti, P., Moras, D., and Nagai, K. (1988) Prot. Engng. 2 193-199. Ari54 Aris-Brosou, S., and Yang, Z. (2003) Mol. Biol. Evol. 20 1947-1954. Arn98 Arnason, U., Gullberg, A., and Janke, A. (1998) J. Mol. Evol. 47 718-727. Bab95 Babbitt, P.C., Mrachko, G. T., Hasson, M. S., Huisman, G. W., Kolter, R., Ringe, D., Petsko, G. A., and Kenyon, G. L. (1995) Science 267 1159-1161. Bai91 Bairoch, A., and Boeckmann, B. (1991) Nucleic Acids Res. 19 2247-2250. Bat00 Bateman, A., Birney, E., Durbin, R., Eddy, S. R., Howe, K. L., and Sonnhammer, E. L. L. (2000) Nucleic Acids Res. 28 263-266. Baz82 Bazer, F. W., Geisert, R. D., Thatch er, W. W., and Roberts, R. M. (1982) in Control of Reproduction in Pig (Cole, D. I. A., and Foxcroft, G. R., Eds.) pp 227-252, Butterworth Company, London, England. Ben88a Benner, S. A. (1988) in Redesigning the Molecules of Life (Benner, S. A., Ed.) pp 115-175, Springer-Verlag, He idelberg, Germany. Ben88b Benner, S. A., and Ellington, A. D. (1988) CRC Crit. Rev. Biochem. 23 369-426. Ben89a Benner, S. A. (1989) Chem. Rev. 89 789-806. Ben89b Benner, S. A. (1989) Adv. Enzym. Regulation 28 219-236. Ben89c Benner, S. A., Glasfeld, A., and Piccirilli, A. (1989) Topics in Stereochem. 19 127-207. Ben90 Benner, S. A., and Ellington, A. D. (1990) Bioorganic Chem. Frontiers 1 1-70. Ben91 Benner, S. A., and Gerloff, D. L. (1991) Adv. Enz. Regul. 31 121-181. Ben93 Benner, S. A., Cohen, M. A., and Gonnet, G. H. (1993) J. Mol. Biol. 229 10651082. Ben94 Benner, S. A., Badcoe, I., Cohen, M. A., and Gerloff, D. L. (1994) J. Mol. Biol. 235 926-958. Ben97 Benner, S. A., Cannarozzi, G., Chelva nayagam, G., and Turcotte, M. (1997) Chem. Rev. 97 2725-2843.

PAGE 142

142 Ben98 Benner, S. A., Trabesinger, N., and Schreiber, D. (1998) Adv. Enz. Regul. 38 155-180. Ben00 Benner, S. A., Chamberlin, S. G., Libe rles, D. A., Govindarajan, S., and Knecht, L. (2000) Res. Microbiol. 151 97-106. Ben02 Benner, S. A., Caraco, M. D., Thomson, J. M., and Gaucher, E. A. (2002) Science 293 864-868. Ben03 Benner, S. A. (2003) Adv. Enz. Regul. 43 271-359. Bod97 Bode, W., and Renatus, M. (1997) Curr. Opin. Struct. Biol. 7 865-872. Boe97 Boerboom, D., Kerban, A., and Sirois, J. (1997) Biol. Reprod. 56(Suppl. 1) 479479. Bor98 Bork, P., Dandekar, T., Diaz-Lazcoz, Y., Eisenhaber, F., Huynen, M., and Yuan, Y. (1998) J. Mol. Biol. 283 707-725. Bos80 Bosma, A. A. (1980) Proc. 4th Eur. C. Cyto. Dom. An. 238-241. Bri98 Bridges, A., Gruenke, L., Chang, Y. T., Vakser, I. A., Loew, G., and Waskell, L. (1998) J. Biol. Chem. 273 17036-17049. Buc88 Buck, C. D. (1988) in A Dictionary of Selected Synonyms in the Principal European Languages University of Chicago Press, Chicago, IL. Cal84 Callard, G. V., Pudney, J. A., Ke ndall, S. L., and Reinboth, R. (1984) Gen. Comp. Endocrinol. 56 53-58. Cal97 Callard, G. V., and Tchoudakova, A. (1997) J. Steroid Biochem. Mol. Biol. 61 387-392. Car02 Cardenas, H., Herrick, J. R., and Pope, W. F. (2002) Reproduction 123 527-533. Car88 Carroll, R. L. (1988) Vertebrate Paleontology and Evolution W. H. Freeman & Co., New York City, NY. Cha97 Chang, X. T., Kobayashi, T., Kajiu ra, H., Nakamura, M., and Nagahama, Y. (1997) J. Mol. Endocrinol. 18 57-66. Che97 Chelvanayagam, G., Egge nschwiler, A., Knecht, L., Gonnet, G. H., and Benner, S. A. (1997) Prot. Engng. 10 307-316. Che98 Chelvanayagam, G., Knecht, L., Jenny, T. F., Benner, S. A., and Gonnet, G. H. (1998) Fold. Design 3 149-160. Cho86 Chothia, C., and Lesk, A. M. (1986) EMBO J. 5 823-826.

PAGE 143

143 Cho96 Choi, I., Simmen, R. C. M., and Simmen, F. A. (1996) Endocrinol. 137 14571467. Cho97a Choi, I., Collante, W. R., Simme n, R. C. M., and Simmen, F. A. (1997) Biol. Reprod. 56 688-696. Cho97b Choi, I. H., Troyer, D. L., Cornwell, D. L., Kirby-Dobbels, K. R., Collante, W. R., and Simmen, F. A. (1997) DNA Cell Biol. 16 769-777. Coh94 Cohen, M. A., Benner, S. A., and Gonnet, G. H. (1994) Biochem. Biophys. Res. Comm. 199 489-496. Con97 Conley, A., Corbin, J., Smith, T., Hinshelwood, M., Liu, Z., and Simpson, E. (1997) J. Steroid Biochem. Mol. Biol. 61 407-413. Con01a Conley, A., and Hinshelwood, M. (2001) Reproduction 121 685-695. Con01b Conley, A., Mapes, S., Corbin, C. J ., Greger, D., Walters, K., Trant, J., and Graham, S. (2001) J. Steroid Biochem. Mol. Biol. 79 289-297. Coo78 Cooke, H. B. S., and Wilkinson, A. F. (1978) in Evolution of African Mammals (Maglio, V. J., and Cooke, H. B. S., Eds.) pp 438-482, Harvard University Press, Cambridge, MA. Cor95 Corbin, C. J., Khalil, M. W., and Conley, A. J. (1995) Mol. Cell Endocrinol. 113 29-37. Cor99 Corbin, C. J., Trant, J. M., Wa lters, K. W., and Conley, A. J. (1999) Endocrinology 140 5202-5210. Cor01 Corbin, C. J., Trant, J. M., and Conley, A. J. (2001) Mol. Cell Endocrinol. 172 115-124. Cor04 Corbin, C. J., Mapes, S. M., Marcos, J., Shackleton, C. H., Morrow, D., Safe, S., Wise, T., Ford, J. J., and Conley, A. J. (2004) Endocrinology 145 21572164. Cra99 Crandall, K. A., Kelsey, C. R., Imam ichi, H., Lane, H. C., and Salzman, N. P. (1999) Mol. Biol. Evol. 16 372-382. Dah00 Dahlhoff, E. P., and Rank, N. E. (2000) Proc. Natl. Acad. Sci. U. S. A. 97 1005610061. Day78 Dayhoff, M. O., Schwartz, R. M., and Orcott, B. C. (1978) Atlas of Protein Sequence and Structure (Dayhoff, M. O., Ed.) Vol 5, suppl. 3., 345, Nat. Biomed. Res. Found., Washington, D. C.

PAGE 144

144 Del96 Delarue, B., Mittre, H., Feral, C ., Benhaim, A., and Leymarie, P. (1996) Comptes Rend. LAcad. Sci. Serie. III Sciences De La Vie-Life Sciences 319 663670. Del98 Delarue, B., Breard, E., Mitt re, H., and Leymarie, P. (1998) J. Steroid Biochem. Mol. Biol. 64 113-119. Dev92 de Vos, A. M., Ultsch, M., and Kossiakoff, A. A. (1992) Science 255 306-312. Dur94 Duret, L., Mouchir oud, D., and Guoy, M. (1994) Nucleic Acids Res. 22 23602365. Eis81 Eisenberg, J. F. (1981) An Analysis of Trends in Evolution, Adaptation, and Behavior University of Chicago Press, Chicago, IL. End96 Endo, T., Ikeo, K., and Gojobori, T. (1996) Mol. Biol. Evol. 13 685-690. Fet98 Fetrow, J. S., and Skolnick, J. (1998) J. Mol. Biol. 281 949-968. Fit71 Fitch, W. M. (1971) Syst. Zool. 20 406-416. Foo99 Foote, M., Hunter, J. P., Janis, C. M., and Sepkoski Jr., J. J. (1999) Science 283 1310-1314. For96 Fortelius, M., van der Made, J., and Bernor, R. L. (1996) in The Evolution of Western Eurasian Neogene Mammal Fanas (Bernor, R. L., Fahlbusch, V., and Mittmann, H. W., Eds.) pp 348-377, Columbia University Press, New York City, NY. Fra97 Franzen, J. L. (1997) Natur und Museum 127 61-62. Fuk02 Fukami-Kobayashi, K., Schreiber, D. R., and Benner, S. A. (2002) J. Mol. Biol. 319 729-743. Fr95 Frbass, R., and Vanselow, J. (1995) Gene 154 287-291. Gar91 Garrett, W. M., Hoover, D. J., Shack leton, C. H., and Anderson, L. D. (1991) Endocrinology 129 2941-2950. Gat99 Gatesy, J., Milinkovitch, M., Waddell, V., and Stanhope, M. (1999) Syst. Biol. 48 6-20. Gau01 Gaucher, E. A., Miyamoto, M. M., and Benner, S. A. (2001) Proc. Natl. Acad. Sci. U. S. A. 98 548-552. Gau02 Gaucher, E. A., Gu, X., Miyamot o, M. M., and Benner, S. A. (2002) Trends Biochem. Sci. 27 315-321.

PAGE 145

145 Gau03a Gaucher, E. A., Miyamoto, M. M., and Benner, S. A. (2003) Genetics 163 15491553. Gau03b Gaucher, E. A., Thomson, J. M., Bu rgan, M. F., and Benner, S. A. (2003) Nature 425 285-288. Gei97 Geisert, R. S., and Yelich, J. V. (1997) J. Reproduct. Fertil. Suppl. 52 133-149. Ger97 Gerloff, D. L., Cohen, F. E., Korost ensky, M., Turcotte, M., Gonnet, G. H., and Benner, S. A. (1997) Proteins:Struct. Funct. Genet. 27 450-458. Gb94 Gbel, U., Sander, C., Schneider, R., and Valencia, A. (1994) Proteins:Struct. Funct. Genet. 18 309-317. Goh00 Goh, C. S., Bogan, A. A., Joachimiak, M., Walther, D., and Cohen, F. E. (2000) J. Mol. Biol. 299 283-293. Gon91 Gonnet, G. H., and Benner, S. A. (1991) in Computational Biochemistry Research at ETH. Technical Report 154 Departement Informatik, Swiss Federal Institute of Technology, Zrich, Switzerland. Gon92 Gonnet, G. H., Cohen, M. A., and Benner, S. A. (1992) Science 256 1443-1445. Gra95 Graham-Lorence, S., Amarneh, B., White R. E., Peterson, J. A., and Simpson, E. R. (1995) Protein Sci. 4 1065-1080. Gra99 Graddy, L. G. (1999) PhD Dissertation, Un iversity of Florida, Gainesville, FL. Gra00 Graddy, L. G., Kowalski, A. A., Simmen, F. A., Davis, S. L. F., Baumgartner, W. W., and Simmen, R. C. M. (2000) J. Steroid Biochem. Mol. Biol. 73 4957. Gri87 Gribskov, M., McLachlan, A. D., and Eisenberg, D. (1987) Proc. Natl. Acad. Sci. 84 4355-4358. Har88 Harada, N. (1988) Biochem. Biophys. Res. Comm. 156 725-732. Heg01 Hegyi, H., and Gerstein, M. (2001) Genome Res. 11 1632-1640. Hic90 Hickey, G. J., Krasnow, J. S., Bea ttie, W. G., and Richards, J. S. (1990) Mol. Endocrinol. 4 3-12. Hil93 Hill, C. P., Osslund, D., and Eisenberg, D. (1993) Proc. Natl. Acad. Sci. 90 51765181. Hil94 Hillis, D. M., Huelsenbeck, J. P., and Cunningham, C. W. (1994) Science 264 671-677.

PAGE 146

146 Hin93 Hinshelwood, M. M., Corbin, C. J., Ts ang, P. C., and Simpson, E. R. (1993) Endocrinology 133 1971-1977. Hob94 Hobohm, U., and Sander, C. (1984) Protein. Sci. 3 522-524. Hof91 Hofig, A., Simmen, F. A., Bazer F. W., and Simmen, R. C. (1991) J. Endocrinol. 130 245-250. Hue97 Huelsenbeck, J., and Rannala, B. (1997) Science 276 227-232. Hug88 Hughes, A. L., and Nei, M. (1988) Nature 335 167-170. Jan97 Janis, C. M. (1997) Zoo-Anal. Complex. Sy. 100 203-220. Jan98 Janis, C. M., Effinger, J. E., Harrison, J. A., Honey, J. G., Kron, D. G., Lander, B., Manning, E., Prothero, D. R., St evens, M. S., et al. (1998) in Evolution of Tertiary Mammals of North America (Janis, C. M., Scott, K. M., and Jacobs, L. L., Eds.) pp 337-357, Cambridge University Press, Cambridge, England. Jer95 Jermann, T. M., Opitz, J. G., St ackhouse, J., and Benner, S. A. (1995) Nature 374 57-59. Juk69 Jukes, T. H., and Cantor, C. R. (1969) in Mammalian Protein Metabolism (Munro, H. N., Ed.) pp 21-123, Academic Press, New York City, NY. Kab83 Kabsch, W., and Sander, C. (1983) Biopolymers 22 2577-2637. Kao00 Kao, Y. C., Higashiyama, T., Sun, X., Okubo, T., Yarborough, C., Choi, I., Osawa, Y., Simmen, F. A., and Chen, S. (2000) Eur. J. Biochem. 267 6134-6139. Kim80 Kimura, M. (1980) J. Mol. Evol. 16 111-120. Kim81 Kimura, M. (1981) Proc. Natl. Acad. Sci. U. S. A. 78 454-458. Kim82 Kimura, M. (1982) Molecular Evolution, Protein Polymorphism and the Neutral Theory Springer-Verlag, Berlin, Germany. Kim83 Kimura, M. (1983) The Neutral Theory of Molecular Evolution Cambridge University Press, New York City, NY. Kim03 Kim, J. G., Vallet, J. L., and Christenson, R. K. (2003) Mol. Reproduct. Devel. 65 366-372. Kni91 Knighton, D. R., Zheng, J., Ten Eyc k, L., Ashford, F. V. A., Xuong, N. H., Taylor, S. S., and Sowadski, J. M. (1991) Science 253 407-414.

PAGE 147

147 Kor00 Korostensky, C. (2000) ETH Dissert ation 13550, Swiss Federal Institute of Technology, Zrich, Switzerland. Kum98 Kumar, S., and Hedges, S. B. (1998) Nature 392 917-920. Lee99 Lee, S. J., and McPherron, A. C. (1999) Curr. Opin. Genet. Dev. 9 604-607. Leh00 Lehnerer, M., Schulze, J., Achterhold, K., Lewis, D. F. V., and Hlavica, P. (2000) J. Biochem. 127 163-169. Lew98 Lewis, D. F., and Lee-Robichaud, P. (1998) J. Steroid Biochem. Mol. Biol. 66 217-233. Li85 Li. W. H., Wu, C. I ., and Luo, C. C. (1985) Mol. Biol. Evol. 2 150-174. Li93 Li. W. H. (1993) J. Mol. Evol. 36 96-99. Lib01 Liberles, D. A., Schreiber, D. R., Govindarajan, S., Chamberlin, S. G., and Benner, S. A. (2001) Genome Biol. 2 0003.1-0003.18. Lib03 Liberles, D. A. (2003) The Adaptive Evolution Database (TAED) Stockholm Bioinformatics Center, http://www.sbc.su.se/~liberles/TAED2002 accessed August 2004. Lic96 Lichtarge, O., Bourne, H. R., and Cohen, F. E. (1996) J. Mol. Biol. 257 342-358. Mac00 MacLaughlin, K., Ostro, L. E. T ., Koontz, C., and Koontz, F. (2000) Zoo. Biol. 19 253-262. Mad92 Maddison, W. P., and Maddison, D. R. (1992) MacClade. Analysis of Phylogeny and Character Evolution Sinauer Associates, Sunderland, MA. Mak98 Makalowski, W., and Boguski, M. S. (1998) Proc. Natl. Acad. Sci. U. S. A. 95 9407-9412. Mal90 Malcolm, B. A., Wilson, K. P., Matthews, B. W., Kirsch, J. F., and Wilson, A. C. (1990) Nature 345 86-89. Mao92 Mao, S. S., Holler, T. P., Yu, G. X., Bollinger Jr., J. M., Booker, S., Johnston, M. I., and Stubbe, J. (1992) Biochemistry 31 9733-9743. May95 May, A. C. W., and Blundell, T. L. (1995) Curr. Opin. Biotech. 5 355-360. Mes97 Messier, W., and Stewart, C.-B. (1997) Nature 385 151-154. Miy95 Miyamoto, M. M., and Fitch, W. M. (1995) Mol. Biol. Evol. 12 503-513. Mou96 Moult, J. (1996) Curr. Opin. Biotech. 7 422-427.

PAGE 148

148 Mur01 Murphy, W. J., Eizirik, E., OBrien, S. J., Madsen, O., Scally, M., Douady, C. J., Teeling, E., Ryder, O. A., St anhope, M. J., et al. (2001) Science 294 2348-2351. Nak95 Nakashima, K. I., Nobuhisa, I., Desh imaru, M., Nakai, M., Ogawa, T., ShimoHigashi, Y., Fukumaki, Y., Hattori M., Sakaki, Y., et al. (1995) Proc. Natl. Acad. Sci. U. S. A. 92 5605-5609. Nak05 Nakamura, Y. (2005) Codon Usage Database Department of Plant Research, Kazusa DNA Research Institute, http://www.kazusa.or.jp/codon/ accessed August 2004. Nee70 Needleman, S. B., and Wunsch, C. D. (1970) J. Mol. Biol. 48 443-453. Neh94 Neher, E. (1994) Proc. Natl. Acad. Sci. U. S. A. 91 98-102. Nei97 Nei, M., Gu, X., and Sitnikova, T. (1997) Proc. Natl. Acad. Sci. U. S. A. 94 77997806. Nei00 Nei, M., and Kumar, S. (2000) Molecular Evolution and Phylogenetics Oxford University Press, New York City, NY. Nik99 Nikaido, A., Rooney, P., and Okada, N. (1999) Proc. Natl. Acad. Sci. U. S. A. 96 10261-10266. Oha30 OHarra, C. C. (1930) Science 71 341. Olm99 Olmea, O., Rost, B., and Valencia, A. (1999) J. Mol. Biol. 293 1221-1239. Pag98 Page, R. D. M., and Holmes, E. C. (1998) Molecular Evolution: A Phylogenetic Approach Blackwell Sciences, Oxford, England. Pam93 Pamilo, P., and Bianchi, N. O. (1993) Mol. Biol. Evol. 10 271-281. Pat90 Patry, M. (1990) Babiroussa: Une Vie jusquau bout du Rve Fixot, Paris, France. Pel00 Peltier, M. R., Raley, L. C., Liberles, D. A., Benner, S. A., and Hansen, P. J. (2000) J. Exp. Zool. 288 165-174. Pic86 Pickford, M. (1986) Ter. Res. Spec. Pap. 7 1-83. Pol98 Pollock, D. D. (1998) Theor. Popul. Biol. 54 78-90. Pop82 Pope, W. F., Maurer, R. R., and Stormshak, F. (1982) Biol. Reprod. 27 575-579. Ran96 Randi, E., Lucchini, V., and Diong, C. H. (1996) J. Mammal. Evol. 3 163-194. Ros76 Rossman, M. G., and Argos, P. (1976) J. Mol. Biol. 105 75-95.

PAGE 149

149 Ros94 Rost, B., Sander, C., and Schneider, R. (1994) CABIOS 10 53-60. Ros96 Rose, K. D. (1996) Proc. Natl. Acad. Sci. U. S. A. 93 1705-1709. Rot96 Rothschild, M., Jacobson, C., Vaske, D., Tuggle, C., Wang, L., Short, T., Sasaki, S., Vincent, A., McLaren, D., et al. (1996) Proc. Natl. Acad. Sci. U. S. A. 93 201-205. Sai87 Saitou, N., and Nei, M. (1987) Mol. Biol. Evol. 4 406-425. Sch89 Schmidt, C. R. (1989) in Grizmeks Encyclopedia of Mammals (Parker, S. P., Ed.) pp 20-47, McGraw Hill, New York City, NY. Shi94 Shindyalov, I. N., Kolchanov, N. A., and Sander, C. (1994) Prot. Engng. 7 349358. Sho81 Shoji, S., Parmelee, D. C., Wade, R. D ., Kumar, S., Ericsson, L. H., Walsh, K. A., Neurath, H., Long, H. L., Demaille, J. G., et al. (1981) Proc. Natl. Acad. Sci. 78 848-851. Sim97 Simpson, E. R., Michael, M. D., Agar wal, V. R., Hinshelwood, M. M., Bulun, S. E., and Zhao, Y. (1997) FASEB J. 11 29-36. Smi81 Smith, T. F., and Waterman, M. S. (1981) J. Mol. Biol. 147 195-197. Sta90 Stackhouse, J., Presnell, S. R., McGeeha n, G. M., Nambiar, K. P., and Benner, S. A. (1990) FEBS Lett. 262 104-106. Ste84 Sternberg, M. J. E., and Taylor, W. R. (1984) FEBS Lett. 175 387-392. Ste87 Stewart, C. B., Schilling, J. W., and Wilson, A. C. (1987) Nature 330 401-404. Swo98 Swofford, D. L. (1998) PAUP 4.0 Phylogenetic Analys is Using Parsimony (and Other Methods) Sinauer Associates, Sunderland, MA. Tak00 Takahashi, K., and Nei, M. (2000) Mol. Biol. Evol. 17 1251-1258. Tau97 Tauer, A., and Benner, S. A. (1997) Proc. Natl. Acad. Sci. 94 53-58. Tay84 Taylor, W. R., and Thornton, J. M. (1984) J. Mol. Biol. 173 487-514. Tay86 Taylor, W. R. (1986) J. Mol. Biol. 188 233-258. Tay94 Taylor, W. R., and Hatrick, K. (1994) Prot. Engng. 7 341-348. Ter91 Terashima, M., Toda, K., Kawamoto, T., Kuribayashi, I., Ogawa, Y., Maeda, T., and Shizuta, Y. (1991) Arch. Biochem. Biophys. 285 231-237. Tho92 Thorne, J. L., Kishino, H., and Felsenstein, J. (1992) J. Mol. Evol. 34 3-16.

PAGE 150

150 Tho94 Thompson, J. D., Higgins, D. G., and Gibson, T. J. (1994) Nucl. Acids Res. 22 4673-4680. Tho96 Thomsen, P. D., Hoyheim, B ., and Christensen, K. (1996) Cytogenet. Cell Genet. 73 203-208. Tra96 Trabesinger-Ref, N., Jermann, T. M., Zankel, T. R., Durrant, B., Frank, G., and Benner, S. A. (1996) FEBS Lett 382 319-322. Val98 Vallet, J. L., Christenson, R. K., Trout, W. E., and Klemcke, H. G. (1998) J. Animal Sci. 76 2657-2670. Van89 van der Meulen, J., te Kronnie, G., van Deursen, R., and Geelen, J. (1989) J. Reprod. Fertil. 87 783-788. Wie86 Wierenga, R. K., Terpstra, P., and Hol, W. G. J. (1986) J. Mol. Biol. 187 101107. Wig91 Wigley, D. B., Davies, G. J., Dodson, E. J., Maxwell, A., and Dodson, G. (1991) Nature 351 624-629. Wig98 Wigger, M. (1998) ETH Dissertat ion 12929, Swiss Federal Institute of Technology, Zrich, Switzerland. Wil00 Williams, P. A., Cosme, J., Sridhar, V., Johnson, E. F., and McRee, D. E. (2000) Mol. Cell 5 121-131. Wil02 Wilson, M. E. (2002) J. Animal Sci. 80 (E Suppl. 2). Wol02 Wolff, A. C. (2002) Curr. Opin. Oncol. 14 600-608. Yan97 Yang, Z. (1997) CABIOS 13 555-556. Yan00 Yang, Z. H., and Bielawski, J. P. (2000) Trends Ecol. Evol. 15 496-503. Zha96 Zhang, J., Ziqian, H., Tame, J. R. H., Lu, G., Zhang, R., and Gu, X. (1996) J. Mol. Biol. 255 484-493. Zha97 Zhang, J., and Kumar, S. (1997) Mol. Biol. Evol. 14 527-536. Zuc65 Zuckerkandl, E., and Pauling, L. (1965) in Evolving Genes and Proteins (Bryson, V., and Vogel, J., Eds.) pp 97-166, Academic Press, New York City, NY.

PAGE 151

151 BIOGRAPHICAL SKETCH David Schreiber was born in Cleveland, Ohio, December 31, 1969. He attended three different elementary schools, the first in Universi ty Heights and the last two in Shaker Heights, before moving to Florida, in 1980. He completed most of his pre-colleg e education in Coral Springs; the last two years of high school were completed in Miami Beach. He started his undergraduate education at the Un iversity of Chicago but left after having completed only two quarters. He returned to Florida and attended the University of Florida, receiving a Bachelor of Science degree with Honors in December, 1991. Af ter a number of years working at Cornell University Medical College, New York City, New York, in the labs of Hugh Roberston and Andrea Branch, he enrolled in the chemistry docto ral program at the University of Florida.


Permanent Link: http://ufdc.ufl.edu/UFE0016940/00001

Material Information

Title: The Bioinformaticist's Toolbox in the Post-Genomic Age: Applications and Developments
Physical Description: Mixed Material
Copyright Date: 2008

Record Information

Source Institution: University of Florida
Holding Location: University of Florida
Rights Management: All rights reserved by the source institution and holding location.
System ID: UFE0016940:00001

Permanent Link: http://ufdc.ufl.edu/UFE0016940/00001

Material Information

Title: The Bioinformaticist's Toolbox in the Post-Genomic Age: Applications and Developments
Physical Description: Mixed Material
Copyright Date: 2008

Record Information

Source Institution: University of Florida
Holding Location: University of Florida
Rights Management: All rights reserved by the source institution and holding location.
System ID: UFE0016940:00001


This item has the following downloads:


Table of Contents
    Title Page
        Page 1
        Page 2
    Acknowledgement
        Page 3
    Table of Contents
        Page 4
        Page 5
    List of Tables
        Page 6
    List of Figures
        Page 7
    List of abbreviations
        Page 8
        Page 9
    Abstract
        Page 10
        Page 11
    Exobiology and post-genomic science
        Page 12
        Page 13
        Page 14
        Page 15
        Page 16
        Page 17
        Page 18
        Page 19
        Page 20
        Page 21
        Page 22
        Page 23
        Page 24
        Page 25
        Page 26
        Page 27
        Page 28
        Page 29
        Page 30
        Page 31
        Page 32
        Page 33
        Page 34
        Page 35
        Page 36
        Page 37
        Page 38
        Page 39
        Page 40
    Extending the PBL method for estimating numbers of synonymous and nonsynonymous mutations to account for ambiguity in inferred ancestral sequences
        Page 41
        Page 42
        Page 43
        Page 44
        Page 45
        Page 46
        Page 47
        Page 48
        Page 49
        Page 50
        Page 51
        Page 52
        Page 53
        Page 54
        Page 55
    The adaptive evolution database (TAED): Application of the extended PBL method to a large dataset
        Page 56
        Page 57
        Page 58
        Page 59
        Page 60
        Page 61
        Page 62
        Page 63
        Page 64
        Page 65
        Page 66
        Page 67
    Detecting compensatory covariation signals in protein evolution using reconstructed ancestral sequences
        Page 68
        Page 69
        Page 70
        Page 71
        Page 72
        Page 73
        Page 74
        Page 75
        Page 76
        Page 77
        Page 78
        Page 79
        Page 80
        Page 81
        Page 82
        Page 83
        Page 84
        Page 85
        Page 86
        Page 87
        Page 88
        Page 89
        Page 90
        Page 91
        Page 92
        Page 93
        Page 94
        Page 95
        Page 96
        Page 97
        Page 98
    The planetary biology of cytochrome p450 aromatases
        Page 99
        Page 100
        Page 101
        Page 102
        Page 103
        Page 104
        Page 105
        Page 106
        Page 107
        Page 108
        Page 109
        Page 110
        Page 111
        Page 112
        Page 113
        Page 114
        Page 115
        Page 116
        Page 117
        Page 118
        Page 119
        Page 120
        Page 121
        Page 122
        Page 123
        Page 124
    Appendix A: Names and abbreviations of nucleic acid bases and amino acids
        Page 125
        Page 126
        Page 127
        Page 128
        Page 129
        Page 130
        Page 131
        Page 132
        Page 133
        Page 134
        Page 135
        Page 136
        Page 137
        Page 138
        Page 139
        Page 140
    References
        Page 141
        Page 142
        Page 143
        Page 144
        Page 145
        Page 146
        Page 147
        Page 148
        Page 149
        Page 150
    Biographical sketch
        Page 151
Full Text





THE BIOINFORMATICIST'S TOOLBOX IN THE POST-GENOMIC AGE: APPLICATIONS
AND DEVELOPMENTS


















By

DAVID R. SCHREIBER


A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA

2007

































Copyright 2007

By

David R. Schreiber









ACKNOWLEDGMENTS

Tremendous thanks go to James Deyrup-for the opportunity, the faith, understanding,

generosity, goodwill, humanity and love. Thanks also go to Steve Benner-for the freedom, the

enthusiasm, the privilege, and for an absurd portion of patience. And thanks go to my family-

for being there, wherever "there" chose to be.









TABLE OF CONTENTS



A C K N O W L E D G M E N T S ...............................................................................................................3

L IS T O F T A B L E S ................................................................................................. ..................... 6

LIST OF FIGURES ................................. .................... .7

L IST O F A B B R E V IA TIO N S .............................................. ...............................................8.......

A B S T R A C T ............... ................................................................ .......................................... 10

EXOBIOLOGY AND POST-GENOMIC SCIENCE ..............................................................12

In tro d u c tio n ............................................................................................................................. 1 2
The Conventional Evolutionary Paradigm ........................................................ ................ 13
Post-Genomic Science: Modeling Molecular Evolution.................................................. 17
The M arkov M odel ...................................................... .................. ............... 17
Non-Markovian Protein Evolution as a Post-Genomic Tool for Structure Prediction.... 18
Structure Prediction as a Tool for Identifying Long Distance Homologs.................... 18
Recruitment of Function ................. ........ ............................................ 24
Correlating the Paleontological Record with Episodes of Sequence Evolution..................30
Identification of in vitro Behaviors that Contribute to Physiological Function. ..................32
Structure Prediction and a Rapidly Searchable Database..................................................33
C o n clu sio n s............................................................................................................. ........ .. 3 6

EXTENDING THE PBL METHOD FOR ESTIMATING NUMBERS OF SYNONYMOUS
AND NONSYNONYMOUS MUTATIONS TO ACCOUNT FOR AMBIGUITY IN
INFERRED AN CESTRAL SEQUEN CES ............................................................................41

Introduction ........................ ......................................... ....... ..................... 41
A n O verview of the PB L M ethod.................................................................... ................ 42
Extending the PBL Method to Phylogenetic Trees with Inferred Ancestral Sequences:
Putting A m biguity B ack into the Equations .................................................. ................ 45

THE ADAPTIVE EVOLUTION DATABASE (TAED): APPLICATION OF THE
EXTENDED PBL METHOD TO A LARGE DATASET................................................56

B ack g rou n d ........................................................................................................... ........ .. 56
R results ........................................................................................................... 60
D iscu ssio n .............................................................................................................. ........ .. 6 3
C o n clu sio n s............................................................................................................ ........ .. 6 4
M materials and M methods .............. .............................................................................. 66

DETECTING COMPENSATORY COVARIATION SIGNALS IN PROTEIN
EVOLUTION USING RECONSTRUCTED ANCESTRAL SEQUENCES ........................ 68









In tro d u c tio n ............................................................................................................................. 6 8
R e su lts ................. ....... ............................. .................................................. .......... . ....... 7 3
Charge Compensation and the Surface of the Folded Protein....................................75
Charge Compensation in Both Contiguous and Non-Contiguous Position Pairs............76
Enhancing the Charge Compensation Signal .............. ................................................. 77
Charge Compensation in Specific Secondary Structural Elements...............................78
Charge Compensation in Buried Residues.................................................... 79
D discussion .................... ... ..... .......... ................... .. .... ................... 80
Accounting for the Stronger Signal From Node-Node Comparison............................. 81
A Model-Independent Method to Evaluate an Evolutionary Tree ............................... 82
Darwinian Requirements for Compensatory Covariation.........................................84
M e th o d s ........................................................................................................ .................... 8 7

THE PLANETARY BIOLOGY OF CYTOCHROME P450 AROMATASES ........................... 99

B a c k g ro u n d ............................................................................................................................. 9 9
R results ........................................................................................................... 10 1
D discussion .................................................................................................... 106
C conclusions ................................................................................................. 112
M e th o d s ........................................................................................................ ......... . ....... 1 14

NAMES AND ABBREVIATIONS OF NUCLEIC ACID BASES AND AMINO ACIDS....... 125

AN IMPLEMENTATION OF THE EXTENDED PBL METHOD CODED IN JAVA 1.1 ......126

L IS T O F R E F E R E N C E S ............................................................................................................. 14 1

B IO G R A PH IC A L SK E T C H .................................................... ............................................. 151









LIST OF TABLES


Table page

3-1 A sam ple listing from TA E D ........................................... .......................... ................ 67

4-1 Frequencies of the average contiguous position pairs ..................................................96

4-2 List of 71 protein fam ilies used in this analysis............................................ ................ 97

5-1 Frequency distributions of stem pig duplication substitutions ..................................123

5-2 Distributions of stem pig duplication substitutions .......... ....................................124

A-i Names and abbreviations of nucleic acid bases.......... .......................................125

A-2 One and three letter symbols for the amino acids.......... .......................................125









LIST OF FIGURES


Figure page

1-1 Predicted surface, interior, and secondary structure assignments.................................38

1-2 Evolutionary tree showing the evolutionary history of the leptins...............................39

1-3 Evolutionary tree showing the evolutionary history of the extracellular........................40

2-1 Two possible ways to get from TCA to CAA..................................................... 49

2-2 A hypothetical phylogeny for sequences tl, t2 and t3. ................................ ................ 50

2-3 Parsimony for each of the nucleotides in sequences tl, t2 and t3................................51

2-4 EPSs for ancestral node a2. ........................................................... ................ 52

2-5 Sequences tl, t2 and t3, numbers of degenerate sites, and average degeneracies ............53

2-6 Numbers of transitions and transversions for each pair of sequences .............................54

2-7 Tallies of average degeneracies and mutations for each pair of sequences....................55

4-1 A leaf-leaf comparison (red) traverses more evolutionary distance ................................89

4-2 Attempted detection of charge compensatory covariation signal using leaf-leaf ...........90

4-3 Detecting charge compensatory covariation signal using explicitly reconstructed...........91

4-4 Predicting secondary structure using contiguous pairs of compensatory changes ...........92

4-5 Distribution of distances between charge anti-compensatory pairs...............................93

4-6 Surface accessibility of charged residues ................................................... 94

4-7 A schematic illustration of the use of compensatory covariation to select.....................95

5-1 Reactions catalyzed by aromatases on multiple androgenic substrates .........................117

5-2 D eating the pig duplication events. ........................................................1...... 18

5-3 The amino acid alignment of exon 4 of two aromatase isoforms...............................119

5-4 Cladogram of the order Artiodactyla showing the extant families............................... 120

5-5 Phylogenetic tree for the 18 vertebrate aromatase genes...................... .................. 121

5-6 The distribution of amino acid replacements on the tertiary structure of cytochrome ....122









LIST OF ABBREVIATIONS

ASA accessible surface area

BLAST basic local alignment search tool

CASP Critical Assessment of Structure Prediction

C-beta beta carbon atom

cDNA complementary DNA

CUTG Codon Usage Tabulated from GenBank

DNA deoxyribonucleic acid

DSSP The Dictionary of Protein Secondary Structure

EPS equally parsimonious solution

HSP heat shock protein

Indel insertion (or) deletion

JTT Jones-Taylor-Thornton matrix substitution model

KA number of nonsynonymous substitution per nonsynonymous site

kcat catalytic constant

KM Michaelis constant

Ks number of synonymous substitution per synonymous site

Ma million years ago

MC the MasterCatalog

MHC major histocompatibility complex

mRNA messenger RNA

MSA multiple sequence alignment

NMR nuclear magnetic resonance

ORF open reading frame

II product









PAM point-accepted mutation

PAML Phylogenetic Analysis by Maximum Likelihood

PAUP Phylogenetic Analysis Using Parsimony

PBL Pamilo, Bianchi and Li

PCR polymerase chain reaction

PDB The Protein Data Bank

PIE Proto-Indo-European

RNA ribonucleic acid

RNase ribonuclease

Y sum

SH2 Src homology 2 domain

Src sarcoma

t2 half-life

TAED The Adaptive Evolution Database

TREx transition redundant exchange









Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Doctor of Philosophy

THE BIOINFORMATICIST'S TOOLBOX IN THE POST-GENOMIC AGE: APPLICATIONS
AND DEVELOPMENTS

By

David R. Schreiber

May 2007

Chair: Steven A. Benner
Major Department: Chemistry

Bioinformaticists of the post-genomic age, having at their disposal the complete sequence

of the human genome, find themselves in possession of a true embarrassment of riches-a

massive collection of data that grows ever larger with each new submission. To effectively mine

this glut of information, the tools of bioinformatics must adapt, and new ones must be developed,

tested and applied. Herein we describe a modification of a method, first developed by Pamilo,

Bianchi, and Li for estimating mutation rates between the protein sequences from two extant

species which allows for comparisons with proteins from ancestral species. In this case, the

ancestral sequences are determined by parsimony, but the method is applicable to any

phylogenetic method used to deduce ancestral states from extant species data. Computationally

cheap, the parsimony method lends itself to the analysis of large datasets. We conducted such an

experiment on a version of the GenBank database. The results were collected in a new database

called The Adaptive Evolutionary Database (TAED), whose purpose was to be a collection of

proteins that have rapidly (and presumably adaptively) evolved at some point in their natural

history. In addition to discovering adaptive episodes, we used this method to search for

compensatory covariation signals within another large database of sequences-signals that

provide clues relating to the proteins' higher-order structures. These methods are used as well in









interdisciplinary investigations that incorporate information from the paleontological and

geological records. This provided high-level functional information relating to cytochrome P450

aromatases and their role in Suoidea evolution.









CHAPTER 1
EXOBIOLOGY AND POST-GENOMIC SCIENCE

Introduction

The 1990s were bountiful years for genomics. At their midpoint, complete sequences were

available for the genomes of several eubacteria, an archaebacterium, and a eukaryote (yeast).

Before the decade was over, a half dozen additional bacterial genomes would be added to this

list, together with the complete genome of Caenorhabditis elegans, and three years later that of

Homo sapiens. In the 21st century, the first century of the post-genomic era, sequences will be

added from hundreds to thousands of other organisms, collected with varying degrees of

systematic effort [Ben98].

As these data have accumulated, it has become increasingly apparent that new methods are

needed to exploit the information that they (must) contain. Chemistry has always been driven by

the discovery of new natural products, elucidation of their structures, and exploration of their

behaviors. The genome database provides an enormous new collection of natural product

structures to study. These display every behavior of interest to chemists: conformation,

supramolecular organization, combinatorial assembly, photochemistry, and catalysis are just a

few. Organic chemistry should be revolutionized by genomic data. But how?

A similar question can be framed for the biomedical sciences. Pharmaceutical and genome

corporations worldwide are collecting and cataloging sequence data, comparing the expressed

genetic inventory of diseased and normal tissues, and attempting to correlate genomic data with

physiological function. It seems that the treatment of human disease should be revolutionized by

genomic sequence data. But it is not obvious precisely how this will happen.

One consequence of genomic sequences from a variety of organisms is the ready

availability of the evolutionary histories of families of proteins represented within the genome









sequence databases. Sequence data will be organized as a set of independently evolving protein

sequence "modules." For each of these, an evolutionary history can be built that will consist of a

multiple alignment of the sequences of the proteins in the module themselves (as well as their

encoding DNA sequences), an evolutionary tree, and a reconstructed ancestral DNA and protein

sequence for each branch point in the tree [BenOO]. The tools described herein are designed to

exploit evolutionary histories directly, extracting information concerning protein structure,

behavior, and function from a detailed understanding of how protein sequences divergently

evolve under functional constraints. In a post-genomic world, with volumes of sequence data

from an ever growing number of organisms, these tools will be used widely to learn more from

sequence data about living systems, their chemistry and their diseases.

The Conventional Evolutionary Paradigm.

Since the mid 1970s, it has been known that homologous proteins-those related by

common ancestry-have analogous conformations (folds) [Ros76, Cho86], at least in their core

domains. From this observation has emerged many tools, including profile analysis, homology

modeling, and threading [Gri87, May95], that help biological chemists model the conformation

of a protein from the conformation of one of its homologs.

If homologous proteins have analogous conformations, it might be reasoned that

homologous proteins might be analogous in other ways as well. Biomolecular behavior depends

in part on conformation. Thus, two homologous proteins with analogous conformations might

have analogous behaviors. The reasoning can be carried further. As biomolecular behavior is

important for biomolecular function, perhaps two homologous proteins will also have analogous

functions.

This train of logic has been viewed as a starting point for analyzing genomic data. Genome

sequencing projects generate sequences of proteins without any supporting experimental data









describing the protein itself, data that accompany most sequences generated in classical

biochemical research. While applications for genomic sequences are many, virtually all require

that a structure, behavior, or function be assigned to proteins known only as sequences. Chemical

theory is insufficient to allow us to assign structure, behavior, or function directly to a sequence.

But bioinformatic theory is adequate to use the genomic sequence to identify homologous

proteins in the database. Some of these homologs may have known structures, behaviors, or

physiological functions. If homology implies structural analogy, and (from there) behavioral

analogy, and (from there) functional analogy, then the task of assigning structure, behavior, or

function to the genomic data should become trivial whenever a homolog with a known structure,

behavior, or function can be identified.

Many tools for analyzing genomic sequence data are based on this logic, which we shall

call the "conventional evolutionary paradigm." The conventional evolutionary paradigm is

implemented throughout bioinformatics research in various simple ways. Consider the following

recipe, used in a variety of commercial software packages to analyze a new sequence for a new

protein collected in a genome project:

* The new sequence is first recorded.

* The sequence is then used as input into a basic local alignment search tool (BLAST) which
finds putative homologs within the existing sequence database.

* Proteins in the database that have sequences similar to the new sequence (by some scoring
criterion) are recorded. These are putative homologs of the new sequence.

* The "functions" of the putative homologs are read from their documentation in the database.

* The function of the new sequence is presumed to be the same as that of the homologs.

This approach has many well known limitations. First, a BLAST search need not find an

analogous sequence in the database whose function is already known. In many cases, a BLAST

search fails to find any sequence in the database with a statistically significant sequence









similarity. If no putative homolog can be identified, then it follows that no homolog with known

structure, behavior or function can be identified.

In other cases, a BLAST search identifies one or more putative homologs, but the

homologs have not been studied sufficiently to assign a structure, behavior, or function. Again,

the paradigm fails to resolve the problem.

More frequently, the BLAST search identifies many possible homologs in the database, all

with statistically marginal sequence similarities. The approach frequently presents the user with

the documentation from several of these, and leaves the user to decide which, if any, of the

putative functions are correct. As the literature shows, weak sequence similarities reflecting

distant homologies have been both profoundly informative and profoundly deceptive [Ben91].

Very frequently, the functions of these putative homologs are widely different, making it

difficult to decide which, if any, function should be imputed to the probe sequence.

These problems are all well recognized as limitations of the conventional evolutionary

paradigm. Less well recognized, however, are problems with the elements of the logic upon

which the conventional evolutionary paradigm is based. Proteins with analogous folds need not

have analogous behaviors. Proteins with analogous behaviors need not have analogous functions.

As has been reviewed in detail [Ben88b, Ben89a, Ben89c, Ben90], evolution is a potent process

for recruiting old protein folds to perform new functions, and many cases are known where the

conventional evolutionary paradigm would generate (indeed, has generated) incorrect

conclusions. For example, fumarase (functioning in the citric acid cycle), adenylosuccinate lyase

(functioning in nucleotide biosynthesis) and aspartate ammonia lyase (functioning in amino acid

metabolism) are all identified as homologs by a BLAST search. Yet their behaviors are

analogous only at the level of organic reaction mechanism, and there only at the most abstract









level. The conventional evolutionary paradigm fails to suggest correct conclusions in this case.

Work by Babbitt, Kenyon, Gerlt, and Petsko has added other fascinating examples of how

microorganisms can recruit a common fold to catalyze a variety of reactions from the

racemization of mandelic acid to the opening/closing of a lactone [Bab95].

In proteins involved in so-called advanced functions (for example, developmental biology)

in more complex organisms, difficulties with the homology-implies-analogous-

structure/behavior/function assumption underlying the conventional evolutionary paradigm

become confounding. For example, protein serine kinases and protein tyrosine kinases are clearly

identifiable homologs at the level of sequence similarity. The chemist would say that both

classes of protein operate using analogous reaction mechanisms, differing only in the source of

the oxygen nucleophile in the phosphoryl transfer reaction. The biologist would note that the

physiological function of the two classes of proteins are greatly different, however. For any

biomedical application, the biologist would be correct. The physiologically relevant differences

in behavior, central to the understanding of biological function phosphorylationn on tyrosine

versus phosphorylation on serine) cannot be inferred for one family from the other using the

conventional evolutionary paradigm.

The deeper the chemistry of developmental biology is probed in metazoa (multicellular

animals), the more it becomes apparent that function in the Darwinian sense can change with

very little change in sequence. For example, a variety of Src homology 2 (SH2) domains all bind

peptide sequences containing phosphotyrosine residues. The binding specificities of the different

SH2 domains are different, however, for the peptide sequences surrounding the phosphotyrosine

residues. It is these specificities that determine which protein binds to each individual SH2

domain. The physiologically relevant function of each SH2 domain centers on this pairwise









interaction. Thus, any statement of function for any particular SH2 domain must at least identify

its phosphotyrosine-containing partner. To assign function at this level, the conventional

evolutionary paradigm has little to say.

Post-Genomic Science: Modeling Molecular Evolution

These considerations suggest that if it is to be used to analyze genomic data, evolutionary

theory must be applied at a more sophisticated level than the level exploited by the conventional

evolutionary paradigm. Much of this sophistication is available at the present time. Let us

examine briefly how biomolecular evolution is modeled as a starting point to designing tools for

interpreting genomic data.

The Markov Model

Virtually all of contemporary genomic science is based in some sense on an analysis where

sequence data are treated simply as strings of characters. This analysis is based on a specific

model of sequence evolution at the level of proteins. Overwhelmingly, these models are in some

sense "Markovian." They assume, implicitly or otherwise, that variation in a protein sequence

occurs independently at each position in the sequence, that future mutations occur independently

of past mutations, and that a single substitution matrix adequately describes the relative

likelihood of each amino acid being replaced by any other amino acid during an episode of

molecular evolution. Further, simplifying assumptions are made when scoring gaps in a pairwise

alignment.

The primary virtue of this model is its simplicity. Yet the model is certainly false. Proteins

are not linear strings of independent characters. Rather, they are folded structures that have

arisen by divergent evolution under constraints imposed by natural selection seeking adaptive

function. Because real proteins fold in three dimensions, mutations at different positions in the

protein sequence are not independent [Coh94]. Nor are future mutations independent of past









mutations [Coh94]. Insertions and deletions (indels) display complex patterns that reflect

functional constraints on protein evolution [Ben93].

Here, the second virtue of Markovian models, their exactness, becomes useful. Markovian

models generate specific expectations. These can be explicitly violated by the behavior of real

proteins during divergent evolution, and the extent of the violation can frequently be quantitated.

The differences between the expected and actual evolution of proteins contain clues concerning

how proteins fold, function, and create new function. Let us examine some tools to extract these

clues.

Non-Markovian Protein Evolution as a Post-Genomic Tool for Structure Prediction

Tools that extract information from the non-Markovian behavior of proteins undergoing

divergent evolution have been exceptionally valuable for predicting the conformation of proteins

from an evolutionary history [Ben97]. These tools have made major strides towards solving a

problem that as recently as a decade ago was considered to be unsolvable. The approach is

described in detail previously [Ben89b, Ben91], and will not be described here. Rather, we shall

assume that reasonably accurate (if not perfect) models of protein secondary structure can be

predicted for a family of homologous protein sequences. We will then ask how these predictions

might be exploited to solve one of the problems noted above with the conventional evolutionary

paradigm: the need to have methods more powerful than simple sequence analysis to detect long

distance homologs.

Structure Prediction as a Tool for Identifying Long Distance Homologs

The core of a protein fold is conserved during divergent evolution long after the sequences

within the family have diverged so much that homology is no longer evident by sequence

analysis alone. In the mid 1970s, Rossman and his coworkers suggested that analogous folds in

two proteins might indicate that the proteins are homologs, even when the sequences of the two









proteins bear no statistically significant similarities [Ros76, Cho86]. This observation prompted

many groups to develop methods for building models from protein sequences starting from

marginal sequence similarities [Gri87, May95]. These are known as profile or threading

approaches.

In practical application, the primary difficulty with these approaches is that they generate

too many "hits," or matching of a probe sequence against a sequence database. The analyses

frequently return many proteins that are possible homologs. As has been shown in many joint

prediction projects (such as the Critical Assessment of Structure Prediction, or CASP projects

[Mou96]), these hits are sometimes correct and informative. Equally likely, however, the

putative homologs identified by a threading or profile analysis are not true homologs, or are

misaligned with true homologs using the analysis. Especially needed, therefore, are tools to

confirm or (especially) deny the possibility that a target identified from a sub-statistical sequence

similarity is a homolog.

In the early 1990s, the first example was presented where a structure prediction was used

to critically evaluate suggestions, based on sub-statistical sequence similarities, that two proteins

were homologous. The case concerned the protein kinases [Ben91]. Protein kinase contains the

sequence motif Gly-Xxx-Gly-Xxx-Xxx-Gly (where Xxx is any amino acid) preceded by a strand

and followed by a helix [Ste84]. A similar motif was found in adenylate kinase, where a crystal

structure was known. Therefore, following the conventional evolutionary paradigm, several

groups proposed that the two structures were homologs. This proposal implied in turn that

protein kinase would adopt the same fold as adenylate kinase. Several groups built models based

on this inference [Tay86, Tay84, Wie86, Sho81].









Contrasting with this view was a model built for the secondary and tertiary structure of the

protein kinase family based on the evolutionary history of the protein. In this model, the Gly-

Xxx-Gly-Xxx-Xxx-Gly motif was predicted to be flanked by beta strand both before and

afterwards. Thus, the predicted secondary structural model of protein kinase was not congruent

with the experimental structure of adenylate kinase. Accordingly, the prediction noted that the

two folds could not be the same. From this, it was inferred that the two proteins could not be

homologs.

This alternative model based on post-genomic methods was later shown to be correct by a

subsequently determined crystal structure [Kni91]. This was perhaps the first time that a

predicted structure and a motif analysis had been used to infer the absence of homology between

two families catalyzing analogous chemical reactions. The tool used in the protein kinase

prediction proved to be an example of a more general tool to confirm or deny long distance

homology between two protein families.

Using this tool, core secondary structural elements of the protein families are aligned

sequentially. In this process, the secondary structural elements are considered to be congruent

when every core element from one family finds a core element in the other of the same type

(helix or strand), in the same order, where gaps matched against non-core elements (where a

non-core element in one family is not aligned against any element in the other) are allowed in

any number, and a core element in one protein may be missing in the other, but may not be

aligned with a core element in the other of a different type (i.e., helix against strand).

This approach is especially powerful when coupled with motif analysis, as was done for

protein kinase. Congruent secondary structural elements can indicate whether a motif is a

significant indicator of homology or not. Thus, flanking a motif, a secondary structural model









might have one of four forms: helix-motif-helix, helix-motif-strand, strand-motif-helix, and

strand-motif-strand. The secondary structural alignments are congruent if and only if the

secondary structural elements flanking the motifs correspond between the two proteins.

Homology is not denied if and only if the secondary structures are congruent. This method is

preferably applied when each family contains proteins that are at least 120 point accepted

mutations per 100 amino acids (PAM units) divergent.

Important to this tool is a distinction between core and non-core secondary structural

elements. The failure of core secondary structural elements to correspond between two protein

families is a clear contradiction of homology. But non-core elements need not be conserved over

long periods of divergent evolution, and failure to find a correspondence between non-core

elements from one protein family in another does not exclude homology. Several ways of

defining "core" exist in this context. Some involve crystal structure data; others do not.

When an experimental structure is known for a protein (for example, by crystallography or

nmr), core elements are conveniently defined geometrically; a core element is one where a

substantial fraction is buried in the protein fold unexposed to solvent water. Most useful is the

application of this concept to beta strands in a beta sheet. A core strand is one that forms

backbone hydrogen bonding interactions with two other strands on both of its edges. Thus, a core

strand is distinct from an edge strand, which forms backbone hydrogen bonds to only one other

strand on only one if its edges. Core strands are highly conserved during divergent evolution. If

they are lost, the sheet in which they participate cannot be conserved.

A more general definition of a core secondary structural unit focuses on the evolutionary

stability of the secondary structural unit. For the purpose of detecting long distance homologs, a

core secondary structural element, predicted or otherwise, is one that cannot be lost during









divergent evolution without damaging the integrity of the protein fold. This is based on notions

of continuity in protein evolution, most fundamentally on the assumption that a protein that has

one topology of protein fold (e.g., an eight fold alpha-beta barrel) cannot by continuous

evolutionary processes be converted into a protein with another (e.g., an immunoglobulin fold).

It is clear that divergence of biological function can add or subtract peripheral secondary

structural elements to create or remove contact elements, expand or eliminate binding sites, or to

modify the performance of the protein in other fashions.

In this case, non-core segments of a protein family are recognized from a family of

sequences, preferably between 100 and 150 PAM units divergent for the most divergent pairs,

where regions that are deleted. If a segment (including a segment containing a helix or a strand)

is deleted in a protein family built from members all sharing significant sequence similarity, it

cannot be essential for the integrity of the fold in the family. In applying this tool, one must be

concerned about database mistakes; a part of sequence that is deleted because the scientist

providing the entry into the database neglected to collect it, or neglected to enter it, is not a

deletion from the purpose of detecting non-core segments.

A third method for identifying a core segment of a protein sequence is applicable to any

alignment containing three or more sequences. In the tool, a pairwise alignment is constructed

for each pair of sequences in the set using a dynamic programming tool. Consider for example a

set of sequences with three proteins, A, B and C. A core segment of the multiple alignment is

defined as those regions where the alignment of A with B and the alignment of B with C is

consistent with the alignment of A with C.

A final method relates to the reconstructed ancestral sequence of the protein. It has long

been appreciated [Zuc65] that when the sequences of two or more homologous proteins are









available, it is possible to construct a probabilistic model for the sequence of the ancestral

protein. The part of the ancestral sequence that is reconstructed with high probability is the core

of the protein. These reconstructions are done by well-known maximum likelihood tools (for

example, as implemented within Darwin, available via the web at the address

http://cbrg.inf.ethz.ch (see also [Gon91])). A core is defined from the ancestral sequences as a

segment of the multiple alignment where the average probability of the most frequent amino acid

at that position is greater than one standard deviation above the average probability of all of the

reconstructed positions in the multiple alignment. This is a tree-weighted measure of the

divergence in the family as a whole, and correlates with core regions defined in the other ways,

as the region of the ancestral sequence that is reconstructed with high probability is also the one

that has not suffered insertions and deletions, and the one that has seen relatively little sequence

divergence. These segments also correlate with core segments defined geometrically.

Structure predictions made using post-genomic tools have now been applied in several

cases to make statements about long distant homology and, from there, catalytic behavior. One

of the most dramatic was for the heat shock protein 90 (HSP 90) family. The predicted secondary

structural elements were assembled to yield a tertiary fold that resembled closely the fold

determined in the N-terminal fragment of DNA gyrase B (the ATPase fragment) [Wig91]. This

analysis was published before an experimental structure of HSP 90 was known [Ger97], and

after an experimental study had suggested that HSP 90 did not have catalytic activity as an

ATPase. The prediction thus generated a statement about catalytic behavior (and, presumably,

physiological function) in addition to secondary structure.

Post-genomic prediction tools were also applied to the family of ribonucleotide reductases

[Tau97]. Here, proteins with highly divergent sequences were shown to belong to one universal









family using a combination of protein structural analysis and gene sequencing. The tools in this

case confirmed a speculation by Stubbe and coworkers based on a mechanistic analysis that all

ribonucleotide reductases are related by common ancestry [Mao92].

Comparing predicted secondary structure models for a protein family is now a proven

technique with an impressive track record to detect or deny long distance homologs based on

sequence data alone. As genome projects are completed, and evolutionary histories for their

constituent protein families articulated, secondary structural models for those families should

become routinely available. Thus, these tools should solve the first problem with practical

implementation of the conventional evolutionary paradigm- the difficulty of detecting

homologs using standard alignment tools.

Recruitment of Function

Predicted secondary structures of a protein family offer an approach to solve the general

problem of finding distant homologs in a database. They do not, however, address other

problems associated with the conventional evolutionary paradigm for assigning function to

sequences those that arise when the assumptions underlying the conventional evolutionary

paradigm are invalid. This is especially the case when one protein family gives rise to proteins

with different function. This possibility can never be excluded a priori. In the case of the heat

shock protein 90, for example, the post-genomic prediction tool showed that the fold of HSP 90

was analogous to the fold of gyrase. It suggested (under the conventional evolutionary paradigm)

that the behavior and function of HSP 90 was analogous to the behavior and function of gyrase.

Some of these suggestions may in fact be true. But they need not be. The gyrase fold could be

recruited in HSP 90 to perform many other (indeed, any other) function.

The premise for the post-genomic sciences, however, is that the evolutionary histories of

protein families will become generally available as a result of massive genome sequencing









projects. Non-Markovian behavior is not only expected to arise because of folded structure. In

addition, patterns of sequence evolution are expected to violate the Markovian model differently

if the protein in question is undergoing an episode where function is being changed.

A study illustrated the power of this approach to identify the active site residues of

mammalian alcohol dehydrogenase (E.C. 1.1.1.1). Mammalian alcohol dehydrogenases have

undergone a rapid episode of sequence evolution in and around the active site as substrate

specificity has divergently evolved to handle xenobiotic substances in the liver. In contrast, over

a comparable span of evolutionary distance, the active site of yeast alcohol dehydrogenase has

changed very little, corresponding to an apparently constant role of the enzyme to act on the

ethanol-acetaldehyde redox couple. Indeed, by identifying positions in mammalian

dehydrogenases where amino acid variation was observed over a span of evolution where the

same residues were conserved in the yeast dehydrogenases provided a clear map of the active

site of the protein.

A particularly clever use of this approach has been described by Lichtarge et al. These

workers described an evolutionary trace method that defined functionally significant residues as

those that are conserved within a family [Lic96]. They then used this approach to identify

patches on the surface of proteins that contribute to functionality.

The approaches work because the scientist understands something about the function of a

protein family. Thus, the active site of alcohol dehydrogenase could be identified because the

scientist knew that substrate specificity is conserved in one branch of a protein family (yeast

alcohol dehydrogenases) but not in another (liver alcohol dehydrogenases). The evolutionary

trace method works only when the function being traced is conserved within the family. Neither

approach is applicable when the evolutionary status of the protein function (conserved or not









conserved) is not known. In general, this status will not be known to the post-genomic scientists

examining only genomic data. To apply these post-genomic tools, therefore, the post-genomic

scientist needs a tool for learning whether function is changing within the family in an episode of

protein sequence evolution one based on an analysis of the sequence data alone.

One tool is based on the fact that the genetic code is degenerate. More than one triplet

codon encodes the same amino acid. Therefore, a mutation in a gene can be either silent (not

changing the encoded amino acid) or expressed (changing the encoded amino acid). Especially in

multicellular organisms, and most particularly in multicellular animals (metazoa), silent changes

are not under selective pressure. In contrast, expressed changes at the DNA level, by changing

the structure of the protein that the gene encodes, change the property of the protein. This

frequently places these changes under selective behavior.

The outcome of different selection pressures on silent and expressed mutations is non-

Markovian behavior in the evolution of DNA sequences. Consider three cases. In the first, we

examine an episode of protein sequence evolution during a period of evolutionary history where,

at the outset of the period, the behavior of a protein whose sequence has been perfectly

optimized for a specific biological function, and where that function remains constant for the

protein throughout the period being examined. During this period, changes in the DNA sequence

that lead to a change in the sequence of the encoded protein (expressed changes) will diminish

the survival value of the protein [Ben88b] and therefore will be removed by natural selection.

Silent changes will not be removed by natural selection, but will accumulate at an approximately

clock-like rate, as silent changes are approximately neutral, especially in higher organisms. Thus,

the ratio of expressed to silent changes will be low during a period of evolution of a protein

family where the ancestor and its descendants share a common function.









A second case concerns a period of evolution where a protein is acquiring a new derived

function. Its amino acid sequence at the beginning of this episode will be optimized for the

ancestral function, rather than the derived function. To be optimally suited for the derived

function, therefore, amino acids must be changed in the protein. Thus, changes in the gene that

are expressed will have a chance of improving the behavior of the protein vis a vis its new

biological function, and these will be selected for. The ratio of expressed to silent substitutions at

the DNA level will be high, and a high expressed/silent ratio will reveal a period of evolution of

a protein family where the function of the ancestor is changing.

In a third case, consider the evolution of a gene encoding a protein that has no function (a

pseudogene, for example), neutrally drifting without functional constraints. In this case, the

expressed/silent ratio will reflect random introduction of point mutations. Given the genetic code

and a typical distribution of amino acid codons within the gene, a ratio of expressed to silent

changes will be approximately 2.5 during the period of evolution of a protein family where the

ancestor and its descendants have no function.

Stewart and Messier introduced this approach through a clever and systematic analysis of

conservation and variation within the lysozyme family [Mes97]. Episodes of rapid sequence

evolution were identified by high expressed/silent ratios in specific branches of the evolutionary

trees. These were correlated with specific episodes in the evolution of the protein itself and the

physiology of the organisms that contained the protein.

This approach can be illustrated quite easily in a biomedically interesting family of

proteins, using an "off the cuff" method to examine the protein leptin, a protein whose mutation

in mice is evidently correlated with obesity, and was previously known as the "obesity gene

protein." The protein has attracted substantial interest in the pharmaceutical industry, especially









after a human gene encoding a leptin homolog was isolated. According to the conventional

evolutionary paradigm, because it is a homolog of the mouse leptin, the human leptin must also

play a role in obesity, and might be an appropriate target for pharmaceutical companies seeking

human pharmaceuticals to combat this common condition in the first world.

DNA and protein sequences were retrieved for the genes encoding leptins. A multiple

alignment for the protein sequences was constructed for the DNA sequences and the protein

sequences. Congruent tress for both the DNA and protein sequences were then constructed, and

sequences at the nodes of the tree reconstructed using MacClade [Mad92] and the known

relationship between the organisms from which these sequences were derived. For the DNA

sequences, the biologically most plausible tree proved to be the most parsimonious tree as well.

The most parsimonious tree for the protein sequences proved not to be the most plausible tree

(by one change) from a biological perspective. The DNA tree was taken to be definitive because

of its consistency with the biological cladisticc) data.

A secondary structure prediction was made for the protein family. The evolutionary

divergence of the sequences available for the leptin family is small-only 21 PAM units (point

accepted mutations per 100 amino acids)-and predictions were biased to favor surface

assignments [Ben94]. Thus, positions holding conserved KREND were assigned as surface

residues, conserved H and Q were assigned to the surface as well, while positions holding

conserved CST were assigned as uncertain.

Five separate secondary structural elements were identified. The results are summarized in

Figure 1-1. A disulfide bond was presumed to connect positions 96 and 146. These secondary

structural elements can be accommodated by only a small number of overall folds. Interestingly,

the pattern of secondary structure in this prediction is consistent with an overall fold that









resembles that seen in cytokines such as colony stimulating factor [Hil93] and human growth

hormone [Dev92].

To decide whether evolutionary function may have changed under selective pressure

during the divergent evolution of the protein family, silent and expressed mutations were

assigned to individual branches on the evolutionary tree. For each branch of the tree, the sum of

the number of silent and expressed changes were tabulated, and the ratio of expressed to silent

changes calculated. These are shown in Figure 1-2.

The branches on the evolutionary tree leading to the primate leptins from their ancestors at

the time that rodents and primates diverged had an extremely high ratio of expressed to silent

changes. From this analysis, it was concluded that the biological function of leptins has changed

significantly in the primates relative to the function of the leptin in the common ancestor of

primates and rodents. This conclusion has several implications of importance, not the least being

for pharmaceutical companies asked whether they should explore leptins as a pharmaceutical

target. At the very least, it suggests that the mouse is not a good pharmacological model for

compounds to be tested for their ability to combat obesity in humans. The post-genomic analysis

suggests that a primate model must be used to test those compounds, with implications for the

cost of developing an anti-obesity drug based on the leptin protein.

Intriguingly, a tree can also be built for the leptin receptor (Figure 1-3). Here, the

evolutionary history is not so complete. In particular, fewer primate sequences are available for

the leptin receptor than for leptin itself. Thus, the reconstructed ancestral sequences are less

precise with the leptin receptor family, and the assignment of expressed and silent mutations to

the tree are less certain. Nevertheless, it appears that the leptin receptor has undergone an

episode of rapid sequence evolution in the primate half of the family as well. The example









illustrates how much sequence data is needed (much) to build reliable models of this nature, as

the ambiguity in the assignment of ancestral sequences makes it possible that the receptor was

evolving rapidly not only in the lineage leading to primates but also in the lineage leading to

mouse.

Nevertheless, the approximate correlation between the episode of rapid sequence evolution

in the leptin family and in the leptin receptor family suggests a tool that might become useful in

the advanced stages of post-genomic science when evolutionary histories are very well

articulated. Here, it might be possible to detect ligand-receptor relationships between protein

families in the database by a correspondence between their episodes of rapid sequence evolution.

Thus, ligand families should evolve rapidly (in a non-Markovian fashion) at the same time in

geological history as their receptors evolve. It will be interesting to identify more sequences for

primate leptin receptors to see if a more complete evolutionary history allows us to see more

clearly the co-evolution of the leptin receptor and leptin itself.

Correlating the Paleontological Record with Episodes of Sequence Evolution

As discussed above, detailed analyses of evolutionary histories frequently can provide a

solution to the most general problem of the conventional evolutionary paradigm, the difficulty in

routinely identifying a homolog of a target sequence with known function within the database.

By analysis of non-Markovian evolutionary behavior at the level of the protein, a model of

secondary structure can be predicted. This prediction can be used in turn to detect long distance

homologs in some cases and exclude the possibility of distant homology in others. This increases

the likelihood that a homolog will be found with a known structure, behavior, or function for a

new protein sequence. If one is found, then the logic associated with the conventional

evolutionary paradigm can be applied to generate a hypothesis concerning the behavior or

function of the protein.









The value of this post-genomic tool to assign behavior and structure to a target sequence

problem is expected to grow over the near term, as the ratio of sequences supported by

experimental studies to those not supported increases with the conclusion of genome projects,

and as more sequences increase the detail of the evolutionary histories that can be extracted from

the database directly, and therefore the quality of the predicted secondary structural model.

At the next level, analysis of non-Markovian behavior at the level of the gene can alert the

biological chemist that the logic associated with the conventional evolutionary paradigm might

not apply in individual cases. In particular, if an episode of rapid sequence evolution intervenes

in the evolutionary tree between the sequence of interest and the sequence with the known

behavior and function, the biological chemist is alerted to the possibility that the function of the

protein might have changed. This alert is useful even with close homologs, as illustrated in the

example with leptin.

But what if the evolutionary tree contains no protein with a sequence with assigned

function, even one with low sequence similarity? Even with more limited evolutionary histories,

post-genomic tools that analyze non-Markovian evolution at the level of the codon can be useful.

By identifying the organisms that provide the sequences at the leaves, or external nodes, of the

evolutionary tree, it is frequently possible to correlate branches in the evolutionary tree with

episodes in geological history, as determined from the fossil record. Especially in multicellular

animals (metazoa), the fossil record can provide approximate dates for the emergence of new

physiological function. In this case, it is possible to ask whether an episode of rapid sequence

evolution in a protein family (in particular, an episode with a high expressed/silent ratio)

occurred at the same time as a new physiological function emerged on earth. If so, a first level of









hypothesis about physiological function can be proposed, even if no behavior or function of any

kind is known for any of the modern proteins.

Perhaps the most transparent analysis of this type concerns proteins that underwent

massive radiative divergences in metazoa approximately 600 Ma. This is the time of the

Cambrian explosion, an episode in terrestrial history that marks the massive radiative divergence

of multicellular animals, including chordates. Proteins families undergoing rapid evolution at this

time (for example, of protein tyrosine kinases and src homology 2 domains) are almost certainly

involved in the basic processes by which multicellular animals develop from a single fertilized

egg.

This type of analysis might be applied in the family of ribonuclease (RNase) A

(E.C.2.7.7.16), a well known family of digestive proteins found in ruminants. The protein

underwent rapid sequence evolution approximately 45 Ma, a time where ruminant digestion

emerged in mammals [Jer95]. Thus, the rapid molecular evolution evident in the reconstructed

evolutionary history of this protein suggests that the protein is important for ruminant digestive

function.

Identification of in vitro Behaviors that Contribute to Physiological Function.

In vitro experiments in biological chemistry extract data on proteins and nucleic acids (for

example) that are removed from their native environment, often in pure or purified states. While

isolation and purification of molecules and molecular aggregates from biological systems is an

essential part of contemporary biological research, the fact that the data are obtained in a non-

native environment raises questions concerning their physiological relevance. Properties of

biological systems determined in vitro need not correspond to those in vivo, and properties

determined in vitro need have no biological relevance in vivo.









To date, there has been no simple way to say whether or not biological behaviors are

important physiologically to a host organism. Even in those cases where a relatively strong case

can be made for physiological relevance (for example, for enzymes that catalyze steps in primary

metabolism), it has proven to be difficult to decide whether individual properties of that enzymes

(kcat, KM, kinetic order, stereospecificity, etc.) have physiological relevance. Especially difficult,

however, is to ascertain which behaviors measured in vitro play roles in "higher" function in

metazoa, including digestion, development, regulation, reproduction, and complex behavior.

Analysis of non-Markovian behavior, as described above, permits the biological chemist to

identify episodes in the history of a protein family where new function is emerging. This

suggests a general method to determine whether a behavior measured in vitro is important to the

evolution of new physiological function. We may take the following steps:

a) Prepare in the laboratory proteins that have the reconstructed sequences corresponding to the
ancestral proteins before, during, and after the evolution of new biological function [Jer95],
as revealed by an episode of high expressed to silent ratio of substitution in a protein. This
high ratio compels the conclusion that the protein itself serves a physiological role, one that
is changing during the period of rapid non-Markovian sequence evolution.

b) Measure in the laboratory the behavior in question in ancestral proteins before, during, and
after the evolution of new biological function, as revealed by an episode of high expressed to
silent ratio of substitution. Those behaviors that increase during this episode are deduced to
be important for physiological function. Those that do not are not.

Structure Prediction and a Rapidly Searchable Database

The overarching problem with genomic sequence databases is their sheer size. This makes

them tedious to search, and nearly impossible to subject to exhaustive self-matching [Gon92].

Predicted structures can be connected with reconstructed evolutionary histories to resolve this

problem, to build a rapidly searchable database. Consider the following steps:

(a) A multiple alignment, an evolutionary tree, and ancestral sequences at nodes in the tree are
constructed for a set of homologous proteins.









(b) A corresponding multiple alignment is constructed by methods well known in the art for the
DNA sequences that encode the proteins in the protein family. This multiple alignment is
constructed in parallel with the protein alignment. In regions of gaps or ambiguities, the amino
acid sequence alignment is adjusted to give the alignment with the most parsimonious DNA tree.

(c) Mutations in the DNA sequences are then assigned to each branch of the DNA evolutionary
tree. These may be fractional mutations to reflect ambiguities in the sequences at the nodes of the
tree. When ambiguities are encountered, alternatives are weighted equally. Mutations along each
branch are then assigned as being "silent," meaning that they do not have an impact on the
encoded protein sequence, and "expressed," meaning that they do have an impact on the encoded
protein sequence. Fractional assignments are made in the case of ambiguities in the reconstructed
sequences at nodes in a tree.

(d) A prediction is then made for each protein family. A secondary structure is predicted for the
family, and this predicted secondary structure is aligned with the ancestral sequence at the root of
the tree. If the root of the tree is unassigned, the predicted secondary structure is aligned with the
ancestral sequence calculated for an arbitrary point near the center of gravity of the tree

(e) An ancestral sequence is then reconstructed at nodes on the tree, and at a point on the tree as
near as possible to its root the point on the tree representing the oldest (geologically) sequence.

Steps (a) through (e) provide a method to organize the protein sequence database in a

rapidly searchable form. The ancestral sequences and the predicted secondary structures

associated with the families defined by steps (a) through (e) are surrogates for the sequences and

structures of the individual proteins that are members of the family. The reconstructed ancestral

sequence represents in a single sequence all of the sequences of the descendent proteins. The

predicted secondary structure associated with the ancestral sequence represents in a single

structural model all of the core secondary structural elements of the descendent proteins. Thus,

the ancestral sequences can replace the descendent sequences, and the corresponding core

secondary structural models can replace the secondary structures of the descendent proteins.

This makes it possible to define two surrogate databases, one for the sequences, the other

for secondary structures. The first surrogate database is the database that collects from each of

the families of proteins in the databases a single ancestral sequence, at the point in the tree that

most accurately approximates the root of the tree. If the root cannot be determined, the ancestral









sequence chosen for the surrogate sequence database is near the center of gravity of the tree. The

second surrogate database is a database of the corresponding secondary structural elements. The

surrogate databases are much smaller than the complete databases that contain the actual

sequences or actual structures for each protein in the family, as each ancestral sequence

represents many descendent proteins. Further, because there is a limited number of protein

families on the planet, there is a limit to the size of the surrogate databases. Based on our work

with partial sequence databases [Gon92], we expect there to be fewer than 10,000 families as

defined by steps (a) through (e).

Searching the surrogate databases for homologs of a probe sequence proceeds in two steps.

In the first, the probe sequence (or structure) is matched against the database of surrogate

sequences (or structures). As there will be on the order of 10,000 families of proteins as defined

by steps (a) through (e) after all the genomes are sequenced for all of the organisms on earth,

there will be only on the order of 10,000 surrogate sequences to search. Thus, this search will be

far more rapid than with the complete databases. A probe protein sequence (or DNA sequence in

translated form) can be exhaustively matched [Gon92] against this surrogate database (that is,

every subsequence of the probe sequence will be matched against every subsequence in the

ancestral proteins) more rapidly than it could be matched against the complete database.

Should the search yield a significant match, the probe sequence is identified as a member

of one of the families already defined. The probe sequence is then matched with the members of

this family to determine where it fits within the evolutionary tree defined by the family. The

multiple alignment, evolutionary tree, predicted secondary structure and reconstructed ancestral

sequences may be different once the new probe sequence is incorporated into the family. If so,

the different multiple alignment, evolutionary tree, and predicted secondary structure are









recorded, and the modified reconstructed ancestral sequence and structure are incorporated into

their respective surrogate databases for future use.

The advantage of this data structure over those presently used is apparent. As presently

organized, sequence and structure databases treat each entry as a distinct sequence. Each new

sequence that is determined increases the size of the database that must be searched. The

database will grow roughly linearly with the number of organismal genomes whose sequences

are completed, and become increasingly more expensive to search.

The surrogate database will not grow linearly. Most of the sequence families are already

represented in the existing database. Addition of more sequences will therefore, in most cases,

simply refine the ancestral sequences and associated structures. In any case, the total number of

sequences and structures in their respective databases will not grow past ca. 10,000-the

estimate for the total number of sequence families that will be identifiable after the genomes of

all organisms on earth are sequenced. If a dramatically new class of organism is identified, this

estimate may grow, but not exponentially (as is the growth of the present database).

Conclusions

The evolutionary histories of protein families are now becoming routinely available

through genome sequencing projects. These histories comprise a multiple alignment for their

protein sequences and the corresponding DNA sequences, an evolutionary tree showing the

pedigree of these sequences, and reconstructed ancestral sequences for each node in the tree. In a

post-genomic world having genomic sequences from an unlimited number of organisms, these

histories will be used to connect structure, chemical reactivity, and physiological function to

these families.

Our work consists of the application and development of post-genomic tools that exploit

these evolutionary histories. Analysis of non-Markovian behavior at the level of the gene can









identify proteins within a family that have new functions. Systematic application of such

analyses to sequence databases can be used to create a database of rapidly evolving protein

families which could serve as a source of starting points for deeper study. Further, non-

Markovian patterns of replacement within a family can provide information relating to the form

(secondary and tertiary structure) and function of its constituent proteins. Coupling these

analyses with the paleontological record can suggest hypotheses for physiological function in

families where no function is suggested by any experimental work for any of its members.

Coupling these analyses with experimental work reconstructing ancestral proteins can identify

specific in vitro properties of the protein that are important for its physiological role. Last,

evolution-based data structures can be used to organize large sequence databases in a fashion that

allows them to be searched with extreme efficiency. Together, these post-genomic tools can join

with classical methods whose value is now becoming widely recognized [Hue97].
















010 020 030 040 050
I I I I I
VPIQKVQDDTKTLIKTIVTRINDISHTOSVSSKQKVTGLDFIELHPILT human
VPIQKVQDOTKTLIKTIVTRINDISHTQSVSSKQKVTGLDF ImLHPILT chimp
VPIQKVQDDTKTLIKTIVTRISDISHTQSVSSKQKVTGLDFIPaLHPILT gorilla
VPIQKVQDDTKTLIKTVITRINDISHTQSVSSKQKVTGLDFIEfaLHPILT orangutan
VPIQKVQSDTKTLIKTIVTRINDISHTQSVSSKQRVTGLDFI ELHPVLT rhesus
VPIHXCVQDDTKTLIKTIVTRINDISHTOSVSARQRVTGLDFIELHPILS rat
VPIHKVQDDTKTLIKTIVTRINDISHTQSVSARQRVTGLDFIPfLHPILS rat
VPIQKVQDDTKTLIKTIVTRINDISHTQSVSAKQRVTGLDF IPLHPI LS mouse
VPIWRVQDDTKTLIKTIVTRISDISHMQSVSSKQRVTGLFIEiLHPVLS pig
VPIRKVQDDTKTLIKTIVTRINDISHTOSVSM KQRVTGLDFIZQLHPLLS sheep
VPIRKVQDDTKTLIKTIVTRINDISHTQSVSSKQRVTGLDFIfiLHPLLS ox
VPIRKVQDDTKTLIKTIVARINDISHTQSVZ$KQRVAGLDFIELQPVLS dog
ipiSSisss?s?iis?Ii?siSsissisppppssSii?isiiPPispiis surf/int
< --------helix-------- > weak parse strong parse predict
< ------ helix 1 ------ > coil 1 expt
MRLFEVQLQSjFVLLALMISLFLLGS MKDIMMNWDDAGCAFVPEAFTFLC bact


060 070 080 090 100
LSKMDQTLAVYQQILTSbSPRNVIQISNDLENLRDLLHVLAPSKSCHLPW human
LSKMDQTLAVYQQILTSMEERNMIQISNDLENLRDLLHVLAFSKSCHLPW chimp
LSKMDQTLAVYQQILTSbW~tgMIQISNDLENLRDLHUVLAFSKSCHLPW gorilla
LSKMDQTLAVYQQILTSb IIRNVIQISNDLENLRDLLHVLAFSKSCHLPW orangutan
LSQIDQTLAIYQQILINLkLNVIQISNDLENLRDLLHLLAFSKSCHLPL rhesus
LSKMDQTLAVYQQILTSLPgQNVLQIAHDLENLRDLLHLLAPSKSCSLPQ rat
LSKMDQTLAVYQQILTSIfQ NVLQIAHDLENaLRDLLHLLAFSKSCSLPQ rat
LSKMDQTLAVYQQVLTSLflQNVLQIANDLENLRDLLHLLAFSKSCSLPQ mouse
LSKMDQTLAIYQQILTSLESRNVIQISNDLENLRDLLHLASSKSCPLAQ pig
LSKMDQTLAIYQQILASLRVIQI SNDLENLRDLLHLLAASKSCMEQ sheep
LSKMDQTLAIYQQILTSLE RNVVQISNDLENLRDLLHLLAASKSCEiEQ ox
LSRMDQTLAtYQQILNSLHSRWVVQISNDLENLRDLLHLLASSXSCPLR dog
i?Siss?i?iissiissiPPSsIIsiissississiisii? i ?s?cPPPS surf/int
<---helix---> | || predict
<--- helix 2 ---> 1<-- helix 3 --> expt


110 120 130 140
I I I I
ASGLETLDSI SVLEASGYSTEVVALSRLQGSLQDMLWQLDL4E2C human
ASGLETLDS gVLEASGYSTEVVALSRLGSLQDMLWQLDLt4S-C chimp
ASGLETLDS LQVLEASGYSTEVVALSRLQGSLQDILWQLDLEJ-C gorilla
ASGLETLDRLrgVLEASGYSTEVVALSRLQRSLQIMLWQLDLSGC orang
ASGLETLESLGDVLEASLYSTEVVALSRLQGSLQDMLWQLDLsI C rhesus
TRGLQKPESLDGVLEASLYSTEVVALSRLQGSLQDILQQLDLZSEC rat
TRGLQKPESLDGVL ASLYSTEVVALSRLQGSLQDILQQLDVSPSC ratnor
TSGLQKPESLDGVLEASLYSTEVVALSRLQGSLQDILQQLDVSPEC mouse
ARALETLESLGSVLEASLYSTEVVALSRLQGALQDMLRQLDLSP2C pig
VRALESLESLGVVLEASLYSTEVVALSRLQGSLQDMLRQLDLGPSC sheep
VRALESLESLGVVLEASLYSTEVVALSRLQGSLQDMLRQLDLS EC ox
ARGLETFESL VLEASLYSTEVVALSRLOAALQDMLRRIDLSI~C dog
iSiiSSiSSippiis??ii??sii?i?siss?issiissisipppc surf/int
|<-helix>| |<----helix---->! predict
<-helix>| |<---- helix 4 ------->| expt



Figure 1-1. Predicted surface, interior, and secondary structure assignments for leptin. S and s, I
and i indicate strong and weak surface and interior assignments respectively. P and p
indicate strong and weak parses respectively. A "?" indicates that no assignment is
made. A "c" indicates that the position is involved in a disulfide bond. Secondary
structure was assigned using the method of Benner and Gerloff where positions
denoted "?" were permitted to fall in either the surface or interior arc of the helix.
Underlined residues are part of parsing strings.











chimpanzee


gorilla I 0/0
1/0
human 0/2 7.5/2
"I orangutan 1/2
4/1
rhesus IIIIIIIIIIIIill IIIUllllli ili 6/10.25
7.5/4
mouse
mouse 2/6.54 -
-, IIII-iiin itiuiiu,, 4 33/4.45
rat 1 0/0 22.2/18.04

rat 2 4.5/7.67

pg 5.17/6.04
ox 1/6.79 1/4.86

5.33/11.33
sheep
-- 1/2.21

0 dog IIIIIIIIIIII IIIIIIIIIIIiiiiii ii iiii it iiiiiiiiiiiuiii r
6/5.04


Figure 1-2. Evolutionary tree showing the evolutionary history of the leptins. Heavy lines show
branches with expressed/silent ratios higher than 2. Hatched lines show branches with
expressed/silent ratios from 1 to 2. Thin, solid lines show branches with
expressed/silent ratios less than 1. Numbers on the lines indicate the ratio of
expressed/silent changes for that branch. The branch lengths do not correspond to
either geological time or evolutionary distance.











chimpanzee -------
I---,
gorilla ------ J

Human
E 24/6.33 |
i. orangutan ---------------

rhesus ---------------------
---- 12.5/5.17
mouse IIIIIIIIallilli .L
S6.7/10.3 7.6/2.13
| rat 1 46.1/11.9


rat 2 ------
---- 14/5.1

_
= 7.6/2.13

sheep iiiiiiiiiiiir
---- 12.8/7.6

dog --------------------------------




Figure 1-3. Evolutionary tree showing the evolutionary history of the extracellular domain of the
leptin receptors. Heavy lines show branches with expressed/silent ratios higher than
3. Hatched lines show branches with expressed/silent ratios from 2 to 3. Thin, solid
lines show branches with expressed/silent ratios less than 2. Notice the greater overall
divergence in the leptin receptor than in leptin itself. Dotted lines show branches with
indeterminate ratios. Numbers on the lines indicate the ratio of expressed/silent
changes for that branch. The branch lengths do not correspond to either geological
time or evolutionary distance.









CHAPTER 2
EXTENDING THE PBL METHOD FOR ESTIMATING NUMBERS OF SYNONYMOUS
AND NONSYNONYMOUS MUTATIONS TO ACCOUNT FOR AMBIGUITY IN
INFERRED ANCESTRAL SEQUENCES

Introduction

The measurement and estimation of synonymous and nonsynonymous mutations in DNA

sequences that code for proteins has proven to be a useful tool for scientists interested in the

evolution of protein families and their constituent molecules. Specifically, comparisons of rates

of synonymous and nonsynonymous mutation can be used to detect and differentiate episodes of

positive (adaptive) selection within the evolutionary history of a group of proteins. Comparisons

are typically made between pairs of uniquely determined, unambiguous, DNA sequences derived

from extant taxa, and though a phylogenetic tree might be included in an analysis, comparisons

are not made with the ancestral sequences that could be inferred at internal nodes of the tree via

parsimony or another method (e.g., maximum likelihood). The inclusion of these ancestral

sequences can greatly enhance an evolutionary analysis, for it allows the assignment of specific

mutations to the branches of the evolutionary tree, and these branches represent major

evolutionary events (e.g., speciations and gene duplications). By linking these events to episodes

of positive selection we are better enabled to explain the natural history of a protein family. Such

combined analyses have been published by a number of investigators, the first of whom were

Messier and Stewart [Mes97]. They used the PBL method developed by Li et al. [Li85] and

amended by Pamilo and Bianchi [Pam93] to calculate numbers of synonymous and

nonsynonymous mutations between DNA sequences from extant species and intranodal ancestral

DNA sequences inferred by the PAML method (a maximum likelihood method) of Yang et al.

[Yan97]. Though they do not make note of it in their Methods section, they most certainly used

the ancestral sequences that were predicted to have the highest probability; that is, for each









internal node of their tree, they used one unambiguous DNA sequence out of a number of

possible sequences for their comparisons. This approach discards information-information

about DNA and amino acid variability between the pair of sequences being compared and

between each sequence and the sequences at daughter nodes. The consequences of this editing

depend upon the questions being asked and must therefore be considered on a case-by-case basis

but, in general, we would argue that discarding information about the evolution of these proteins

and "unnaturally selecting" one sequence from a population of possible sequences would

introduce a bias that diminishes the confidence in any conclusions drawn. Herein we describe a

method that accounts for the ambiguity in ancestral sequences, providing perhaps a more realistic

measure of synonymous and nonsynonymous mutations and preserving data that pertains to the

natural history of these sequences.

An Overview of the PBL Method

The PBL method involves the comparison of two DNA sequences, codon by codon, and is

performed in the following steps:

* The average numbers of zero-, two- and fourfold degenerate sites are calculated.

* Mutations are counted and classified as transitions or transversions. Each mutation is further
classified as having occurred at a site of i-fold degeneracy.

* KA and Ks are calculated from the degeneracy and mutation calculations.

Figure 2-2 shows five taxa and their DNA sequences which, for this example, consist of

only three nucleotides (one codon) each. The degeneracy of each nucleotide within a codon is

determined as follows. Zerofold degenerate sites are sites where changing the nucleotide to any

of the other three nucleotides changes the amino acid translation of the sequence. For example,

sequence t1 (TCA), translates to the amino acid serine. If we change the second position of t1 to

A, G or T (giving TAA, TGA and TTA), the translation changes to Stop, Stop and leucine,









respectively (when using the nuclear code for translation). Therefore the second nucleotide of

TCA is classified as a zerofold site. Two-fold sites are sites where two of the three possible

changes change the translation. The third position of CAA (glutamine) is twofold degenerate,

because changing to C or T changes the amino acid to histidine, while changing to G does not

change the translation. Four-fold sites are sites where changing the nucleotide to any of the three

other nucleotides does not change the amino acid. The third positions of TCA and TCG are each

fourfold degenerate. After the degeneracy of each of the nucleotides in each sequence being

compared has been tallied, an average is computed.

Next, mutations are counted between the pair of sequences. In general, if a mutation occurs

between two nucleotides of the same chemical class-either the purines or the pyrimidines-the

mutation is called a transition: [A->G, G->A, T->C, C->T]. Mutations from one class to another

[purine->pyrimidine, pyrimidine->purine] are called transversions. In the PBL method these

classifications hold for all mutations in the vertebrate mitochondrial code, but in the nuclear code

and the codes of other organisms some adhoc adjustments are made. The purpose of these

adjustments is to preserve the following rules:

* All changes at zerofold sites, whether transitions or transversions, are nonsynonymous
(expressed) changes.

* All changes at fourfold sites are synonymous (silent).

* Transitions at twofold sites are synonymous and transversions at twofold sites are
nonsynonymous.

If we compare sequence tl with t2 we see that A changes to G at the third position. This is

tallied as a transition. The change is further classified as having occurred at a fourfold degenerate

site (Figure 2-6). Following the above rules, the mutation is ultimately classified as a

synonymous mutation.









Sequences tl and t3 have two mutations between them. Rather than simply counting the

changes, the PBL method considers all possible paths that could be traversed as one codon

mutates into another. Figure 2-1 shows the two paths that can be taken to get from TCA to CAA.

Mutations are counted and classified as described above, but the mutations are then multiplied by

a probability determined for each path. A path's absolute probability (in Figure 2-1, pil x pi2,

where i = path number) is computed by determining the probability of each "leg" of the path, and

then taking their product. The leg probabilities are determined by assessing the probability of the

amino acid change taking place on that leg. The determination of these amino acid transition

probabilities is described in Li et al. [Li85].; briefly, the probabilities of each amino acid

changing into every other amino acid are derived from Grantham's 20x20 mutation matrix and

partitioned into four categories (synonymous, conservative, moderate and radical changes). In

Figure 2-1, TCA changes to CCA serinee to proline) with a probability of 0.767, which is the

value for a conservative change. Synonymy (no change) has a higher probability (2.235) and

moderate and radical changes have lower probabilities.1

Once all paths' absolute probabilities have been determined, each path's relative probability

can be determined by dividing its absolute probability by the sum of all paths' absolute

probabilities (in Figure 2-1, Pi, where i = path number). In Figure 2-1, Path 2 (P2) involves a stop

codon, so this path is given an absolute and relative probability of zero, giving Path 1 (Pi) a

relative probability of one (its absolute probability is 0.588). Finally, the mutations occurring

along a path are multiplied by their path's relative probability, and the sum of mutations is taken

by summing the mutations along each path. This procedure is the same for codons with three


1 These values are for mutations under a moderate rate of mutation model. Li et al. [citation] also provides
values for a high rate of mutation model.









differences between them, except in such cases there are potentially six paths, and each path has

three legs (e.g., t2 vs. t3). Figure 2-6 shows the results of comparing each of the sequences tl, t2

and t3 with each other.

After the average degeneracy of a pair of sequences has been determined, and the number

and type of mutations between them has been counted, KA and Ks can be calculated. KA is

defined as the number of nonsynonymous substitutions per nonsynonymous site in a pair of

sequences. Ks is the number of synonymous substitutions per synonymous site. The formulae

were developed originally by Li et al. and amended by Pamilo and Bianchi, and their derivation

can be found in these references and Kimura [Li85, Pam93, Li93, Kim80, Kim81].

Extending the PBL Method to Phylogenetic Trees with Inferred Ancestral Sequences:
Putting Ambiguity Back into the Equations.

Figure 2-2 shows a hypothetical phylogeny for sequences tl, t2 and t3. This tree contains

two internal nodes (ancestral sequences) al and a2. These sequences are inferred by the

parsimony algorithm of Fitch [Fit71]. Figure 2-3 shows the reconstruction of the ancestral

sequences at each nucleotide position. Below each nucleotide is a number representing the

probability of that nucleotide appearing at its node. These probabilities are computed by a

method developed by Fitch.

Our method is conducted as follows. The tree is traversed node by node, from top to

bottom; beginning with the least ancient nodes and ending with the root node. At each node,

comparisons are made between each node codon(s) and each codon(s) of each descendant. These

comparisons are made by

* Constructing a set of equally parsimonious solutions (EPSs).

* Determining the relative probability of each EPS.

* Calculating the average number of zero-, two- and fourfold degenerate sites for each
ancestor/descendant pair.









* Counting and classifying mutations for each ancestor/descendant pair.

* Calculating KA and Ks.

An EPS is defined as three codons [x,y,z] (x = left descendant, y = ancestor, z= right

descendant) that are permitted by Fitch's parsimony rules. Each codon is constructed by

combining "valid" nucleotides from each of the three parsimonious nucleotide trees (Figure 2-3).

The validity of the presence of a nucleotide in an EPS is determined by Fitch's rules for valid

linkages. In this example, all linkages are permitted between all of the nucleotides at a2 and al

in each of the nucleotide trees, so all possible codons that can be built are permitted. Ancestral

node al has only one EPS, [TCA, TCA, TCG]. Ancestral node a2 has four EPSs, [TCA, TCA,

CAA], [TCA, TAA, CAA], [TCA, CCA, CAA] and [TCA, CAA, CAA]. The determination of

EPSs is required because the identity of a nucleotide at the ancestral node dictates what

nucleotides are allowed at its descendant nodes.

Next, the probabilities for each EPS are determined. Since al has only one EPS, this EPS

is given a probability of 1. The probability of each of the four EPSs at a2 is determined by the

following method (Figure 2-4). Each EPS has two paths, the path between the ancestor and its

left descendant and the path between the ancestor and its right descendant. Each of these paths

has a probability associated with it, PL and PR, respectively. The probability of a path is

determined by calculating the probability of the amino acid change that occurs along that path.

PL for EPSI's left path (TCA->TCA) is 2.235 because there is no amino acid change

(synonymy). PL for EPS3's left path (TCA->CCA) is 0.767 because this amino acid change

(serine->proline) is conservative. For paths involving two or three mutations, a variation of the

paths method depicted in Figure 2-1 is used. For example, the right path of EPS has two

mutations. The absolute probability of Path 1 (p11 x p12) = 0.588, and since Path 2 contains a stop

codon, its absolute probability is zero. We then compute the average probability by summing the









absolute probabilities of all paths and dividing by the number of paths. In this case there is only

one path, so its absolute probability is the average probability: PL = 0.588.

EPS2 has a probability of zero because it contains a stop codon. EPS3 and EPS4

probabilities are calculated as shown in Figure 2-4.

Once the average probabilities of each path in the EPS are calculated, we calculate the

product of the probabilities of all nine nucleotides in the EPS (the nucleotide probability product,

PN) and then compute the absolute probability of the EPS by taking the product of the two

average probabilities and the nucleotide probability product: PLxPRxPN. We can then determine

the EPS's relative probability by dividing its absolute probability by the sum of all EPSs' absolute

probabilities (Figure 2-4). We include PN in the calculation because it is a measure of the

probability of the codons in the EPS. Without PN, an EPS's probability would only be a measure

of the probability of its amino acid changes, so if a particular EPS contained conservative

mutations and/or synonymy, for example, it would be given a high relative probability regardless

of the probability of its individual codons existing.

We then use the EPSs and their probabilities to calculate the average number of zero-, two-

and fourfold degenerate sites between each sequence pair. As an example we will consider the

al/a2 pair (the left path of the EPS set for node a2). For EPS1, we compare TCA with TCA.

Codon TCA has two zerofold sites, zero twofold sites and one fourfold site ([2,0,1]). The

average sites are then (([2,0,1] + [2,0,1])/2) PEPS1 = [0.818, 0, 0.409]. For EPS2 we have [0,0,0]

because of the stop codon. Sites for EPS3 and EPS4 are similarly computed. The sum of sites for

all EPSs is [2, 0.204, 0.796] (Figure 2-7). Counts and classifications of mutations for each

sequence pair are then determined. We will continue using the al/a2 pair as an example. For

EPS1 we tally [[0,0,0],[0,0,0]] for TCA vs. TCA ([[a,b,c],[x,y,z]]: a = zerofold transitions, b =









twofold transitions, c = fourfold transitions, x = zerofold transversions, y = twofold

transversions, z = fourfold transversions). For EPS2 we do not count any mutations because of

the stop codon. For EPS3 we count [[1,0,0],[0,0,0]] PEPS3 = [[0.183,0,0],[0,0,0]] and for EPS4

we count [[1,0,0],[1,0,0]] PEPS4 = [[0.409,0,0],[0.409,0,0]].

These procedures are repeated for each codon in each pair of sequences at each node in the

tree (Figure 2-7). Finally, KA and Ks. are computed.

We wrote a computer program that implements this method. Over a course of years the

program was written and rewritten in a number of different programming languages to suit

whatever task was at hand. In Appendix B we present a version written in Java 1.1, used in the

context of the MasterCatalog-an interactive database containing the contents of GenBank

reorganized as a collection of clades, or families, consisting of independently evolving protein

fragments, or modules [BenOO]. The extended PBL method described herein was engineered to

measure KA and Ks values for pairs of modules within each family of modules, including

reconstructed ancestral modules at nodes within the phylogenetic tree representing the natural

history of the module family.

Prior to that project we coded this method and applied it to all of GenBank. This work is

described in Chapter 3.








Path 1 TCA (S) Path 2
P11 P21

CCA(P) TAA(X)

12CAA (Q)

CAA(Q)


Path


Transitions
O-fold 2-fold 4-fold


Transversions
0-fold 2-Fold 4-fold


1 1 0 0 1 0 0
2 0 0 0 0 0 0


total 1 0 0


0 0


Pi(relative) = (p, X Pi2J/([p, X P12] + [P21 X P22]) = (.767x.767)/(.588) = 1
P2(relative) =P2 X p/22 Pi, X P12 + [P2, X P2j]) 0/(.588) = 0
Figure 2-1. Two possible ways to get from TCA to CAA. Formulae for path relative
probabilities, and numbers of transitions and transversions adjusted by multiplication
by path probabilities.








TCA TCG CAA
t2 t3





al







TCA a2 CCA
TAA CAA
Figure 2-2. A hypothetical phylogeny for sequences tl, t2 and t3. Ancestral sequences al and a2
inferred by parsimony.









Position 1
T
1,0


TC
D.5 0.5


Position 21
C
1.0


Position 3


CA A
0.5 0.5 1.0


Figure 2-3. Parsimony for each of the nucleotides in sequences tl, t2 and t3. The numbers below
each nucleotide represent the probability of that nucleotide being at its node per
Fitch's method.











EPS 1: [TCA, TCA, CAA]
TCA CAA
1-0 1_0 1-0 1_0 1.0 1.0







TCA
0.5 0.5 1.0

P, = P(TCA->TCA) = 2.235

P, = P(TCA->CAA) = .588
3 3
P. = t 7 n, = .25
i= lj=
PEPSI = PtxRxP = .409
P pEP absolutec)



EPS 3: [TCA, CCA, CAA]
TCA CAA
1.0 1.0 1.0 1.0 1.0 1.0


EPS 2: [TCA, TAA, CAA]


PL=

3 3
P, = T T n, = .25
i= j=1
PEPS2 = PxP.xP,, = 0
X PEPSx(absoIute)



EPS 4: [TCA, CAA, CAA]
TCA CAA
1.0 1.0 1.0 1.0 1.0 1.0


CAA
0.5 05 1.0


P, = P(TCA->CCA) = .767
PR = P(CCA->CAA) = .767
3 3
PN = T 7 n, = .25
i= l j=l
PEPS4, PLXPxP. = .182
I


PL = P(TCA->CAA) = .588
P, = P(CAA->CAA) = 2.235
3 3
PN= TC n,= .25
i=lj=l
PEPS4 = PLXPxP, = .409
T PEPs(absoI,,te)


Figure 2-4. EPSs for ancestral node a2. Each EPS has a probability for its left and right path, PL
and PR. PN is the product of the probabilities of each nucleotide in the EPS (i = 1, 2 or
3 where 1 refers to the left descendant, 2 refers to the ancestor and 3 refers to the right
descendant. j = nucleotide position). The relative probability of a path is equal to the
quotient of its absolute probability and the sum of absolute probabilities for all EPSs
(x = number of EPSs).









Nume


Sequence


Degene
0-fold 2-fold


H1 TCA 2 0

t2 TCG 2 0

t3 CAA 2 1


Figure 2-5. Sequences tl, t2 and t3,
each pair of sequences.


racy Sequence Pair Averge Degeneracy
4-fowld -fold 2-1old 4-fald

1 tlt2 2 0 1

1 tl/t3 2 0.5 0.5


0 t2/t3 2 0.5 0.5


numbers of degenerate sites, and average degeneracies of







Sequence Pair


Transitions


4-Fold


0-fold 2-Fold


tl/t2


tl/t3


t2/t3


Figure 2-6. Numbers of transitions and transversions for each pair of sequences.


Transversions
0-fold 2-fold 4-fold


1 0.33 0.67








Ancestral Sequence Pair Average Degeneracy Transitions Transversions
Node
0-fold 24old A-fold 0-fold 2-fod 4-Fold O-fold 2-fold A-fold


2 0


2 0 1

2 .204 .796

2 .704 .296


0 0 0 0 0

0 0 1 0 0


.592

.409


0 0 .409 0

0 0 .592 0


Figure 2-7. Tallies of average degeneracies and mutations for each pair of sequences. Fractional
degeneracies and mutations are results of multiplication of observed numbers with
computed probabilities for EPSs.


al/ti

al/t2

a2/al

a2/t3


a2









CHAPTER 3
THE ADAPTIVE EVOLUTION DATABASE (TAED): APPLICATION OF THE EXTENDED
PBL METHOD TO A LARGE DATASET

Background

The growth of gene and genomic databases provides motivation for developing tools to

extract information about the function of a protein from sequence data, with the ultimate goal of

understanding the collection of functions represented in an organism's genome [Lib01l]. Work on

molecular evolution over 30 years has shown that such questions must be phrased carefully, and

always with cognizance of the Darwinian paradigm that insists that the only way of obtaining

functional behavior in living systems is through natural selection superimposed on random

variation in structure [Ben88b]. A behavior is functional if the organism would be less able to

survive and reproduce if that behavior were different. An amino acid residue is functional if,

upon mutation, the organism is less able to survive and reproduce.

A long literature has sought to interpret the evolutionary behavior of protein sequences, in

the hope of drawing inferences about the relationship between fitness and sequence [Kim82].

What has emerged is the recognition that a family of orthologous proteins displays a diversity of

structure and a corresponding diversity in behavior, where some of the behavioral differences

have a strong impact on fitness (are functional), and others are neutral (or nearly so). Without

resolving, in a general way, questions regarding the relationship (neutrality versus selection)

between fitness and protein sequence, we can build interpretive tools that capture information

from patterns of evolution of genomic sequences that is informative about function-in

particular, events that are characterized by the biological scientist as a change in function.

For a protein to change its function it must change its behavior; this in turn requires that it

change its amino acid sequence. A protein being recruited for a different function over a very

short time (geologically speaking) frequently experiences an episode of rapid sequence









evolution, an episode where the number of amino acid substitutions per unit time is large.

Therefore, molecular evolutionists have long been interested in the rates at which substitution

accumulate in protein sequences. These rates are known to vary widely in different protein

families.

Calculating rates in the units of substitutions/time requires knowledge of the geological

dates of divergence of protein sequences. Because geological times are frequently not known

(and almost never known precisely), alternative approaches for identifying episodes of rapid

sequence evolution have been sought. One of these examines nucleotide substitutions. It divides

the number of nucleotide substitutions that change the sequence of the encoded protein

(nonsynonymous substitution) by the number of nucleotide substitutions that do not change the

sequence of the encoded protein (nonsynonymous substitution), and then normalizes these for

the number of nonsynonymous and synonymous sites. This is the KA/Ks ratio [Li85, Pam93,

Li93]. High KA/Ks ratios for reconstructed ancestral episodes of sequence evolution are

hypothesized to be signatures of positive adaptation, and have been associated with significant

changes in function [Tra96, Mes97].

In general, KA/Ks values are low. For example, the average KA/Ks value in proteins

between rodents and primates is 0.2 [Mak98]. This is taken to indicate that most of these

proteins, selected over millions of years, attained an optimum function prior to the divergence of

rodents and primates. This implies that subsequent evolution was conservative; most

nonsynonymous mutations were detrimental to the fitness of the organism.

Functional change can be defined as mutation that alters organismal fitness and is subject

to selective pressure. For an example of intraspecific variation, phosphoglucose isomerase in

montane beetles shows adaptation to local temperature variations [DahOO]. Orthologous proteins









also suffer positive selection. For example, the hemoglobin in the bar-headed goose has

undergone adaptive change relative to the hemoglobin from the closely related greylag goose in

response to a reduced partial pressure of oxygen at high altitudes [Zha96]. Adaptive evolution is

also believed to be displayed by paralogous mammalian MHC class I genes and relate to a birth-

and-death model of gene duplication [Nei97].

Traditionally, positive selection is defined by a KA/Ks rate ratio significantly greater than

unity. While 0.6 < KA/Ks < 1 can occur by relaxation of functional constraint, the theoretical cut-

off of one is well known to miss significant functional changes in proteins for several reasons

[Cra99]. Long branches can dilute an episode of positive adaptation (with KA/Ks >1) with

episodes of conservative evolution. KA/Ks values can miss positive selective pressures on

individual amino acids because they average events over the entire protein sequence. Behavior in

a protein can change significantly if only a few amino acids change while the remainder of the

sequence is conserved in order to retain core behaviors of the old and new functions (e.g., the

protein fold). These adaptive events will only be detected on sufficiently short branches which

pinpoint the adaptive change.

Alternative ways of identifying KA/Ks values below unity that are suggestive of adaptive

evolution involve comparison of these values for an individual branch of a tree with those values

for branches in the tree generally. If one branch has a KA/Ks value far outside of the norm for the

family (but still below one), we can guess that this branch represents an episode of positive

selection. This will work for gene families that generally display conservative evolution (such as

the SH2 (Src homology 2) domains) [Wig98], but not for others. For example, many immune-

system genes show a much more continuous distribution of values, which may indicate that they

are perpetually under different amounts of positive selective pressure [Nei97]. In this case, the









designation of a cutoff value of KA/Ks, below which two homologous genes have the same

function, and above which they have different functions, is arbitrary. Ultimately, this level

should be determined by benchmarking adaptivity with specific functions and specific protein

folds.

KA/Ks rate ratios are well known to be useful starting points for generating stories about

the interaction between protein sequences and the Darwinian processes that shape these

sequences. These stories help us understand how these sequences contribute to the fitness of the

host. This means that biologists would find useful a comprehensive database of examples where

KA/Ks values are high. Most useful would be a database that presents families where KA/Ks is

greater than one, and a separate family where KA/Ks is greater than some arbitrary cut-off less

than one, but still relatively high compared to the average value in the average protein.

We report here such a database, The Adaptive Evolution Database (TAED). TAED is

designed to provide, in raw form, evolutionary episodes in specific chordate and embryophyte

(flowering plants, conifers, ferns, mosses and liverworts) protein families that might be

candidates for adaptive evolution. TAED contains a collection of protein families where at least

one branch in the reconstructed molecular record has a KA/Ks value greater than unity, or greater

than 0.6. The second cut-off is arbitrary, chosen to be high relative to the average KA/Ks value

for the average episode of evolution in a protein family. Empirically, the lower cut-off seems to

admit additional examples of gene families that might have undergone adaptive evolution.

TAED should be used as a raw list of potentially adaptively evolving genes for

experimentalists seeking gene families to study in further detail, and for bioinformaticists

interested in studying large datasets of examples of genes with high KA/Ks ratios.









Results

The MasterCatalog [BenOO] is a database of 26,843 families of protein modules generated

from an all-against-all search of GenBank release 113. A protein is broken into independently

evolving modules on the criterion of the presence of a subsection of a gene as a complete open

reading frame (ORF) in another species. Pairs that were within 180 PAM (point accepted

mutation) units with a minimum length requirement were grouped into the same family. Each

family contains an evolutionary tree and a multiple sequence alignment. This database was the

starting point for the exhaustive calculation of KA/Ks ratios.

The MasterCatalog (MC) is different, both in concept and execution, from other resources

(e.g., Hovergen [Dur94], Pfam [BatOO] and COGs) that offer databases of protein families. The

MC incorporates reconstructed ancestral states within its data structure, in addition to multiple

sequence alignments (MSAs) and evolutionary trees. Having these reconstructed ancestral states

provides a value to the database, especially for functional interpretation, that is not offered by

databases that contain only trees, or only MSAs, or only trees and MSAs. Further, because the

MasterCatalog is explicitly developed as a tool for doing functional genomics relying on

reconstructed intermediates, and as the information about function is extracted from analysis of

patterns of variation and conservation in genes and proteins within a family, it emphasizes the

generation of high-quality trees, MSAs and reconstructed ancestral states. For this reason, the

MC does not attempt to build superfamilies (like Pfam does). Instead, it constructs families,

where the trees, MSAs and ancestral states are not compromised by poor gap placement-a

common problem in Clustal-based MSAs of sets of highly divergent protein sequences.

Alternative methods were considered for reconstructing ancestral sequences. Whereas

maximum likelihood methodologies perform better in some situations, they are too

computationally intensive to apply exhaustively. Further, they are based upon an explicit model









of evolution that may not be appropriate along all branches analyzed, a situation where

maximum parsimony may outperform maximum likelihood on some branches [Pag98].

Therefore, to generate the initial version of this database, more computationally simple methods

were used. As improved methodologies are developed, these will undoubtedly be applied to

recalculate this database.

Two issues concerned the scope of the KA/Ks analysis. First, we were concerned that silent

positions would be 'saturated' with substitutions, rendering the Ks measurements meaningless.

Whereas reconstruction back to the last common ancestor of chordates or embryophytes with no

intermediates frequently bears the signature of synonymous position equilibration, synonymous

position saturation can be avoided if individual branches are shorter than the period required for

saturation to occur (tl. to saturation of approximately 120 million years). Saturation was

measured through the examination of the extent to which two-fold redundant codon systems had

reached equilibration [PelOO]. Branches that showed equilibration greater than five half-lives

towards saturation were excluded from TAED on the basis of differences between reconstructed

ancestral sequences at the beginning of branches and sequences at the end.

A second significant problem is that of short branches bearing fractional mutations. These

are known to generate KA/Ks values with large errors. To prevent these errors from biasing the

database, a new simple robustness test was implemented to ensure that an "interesting" KA/Ks

value (one above the cut-off) was not recorded in the database if it became "uninteresting"

(below the cut-off) through the shift of a single mutation reconstructed in the branch. The test

modified the KA/Ks calculation in a simple way, as described below:

Modified KA/Ks = KAmod/Ksmod, where

KAmod = (number of nonsynonymous 1)/total nonsynonymous sites

Ksmod = (number of synonymous + 1)/total synonymous sites









In general, the smaller the difference between KA/Ks and KAmod/Ksmod, the more

significant or robust the branch.

To exclude short branches with fractional mutations (arising through ambiguous ancestral

sequence reconstruction) without excluding other short branches, branches with KAmod/Ksmod

values below 0.5 were excluded from the database.

Of 5,305 families of modules containing chordate proteins, 280 contained at least one

branch with a KA/Ks value greater than one, representing 643 branches emanating from 63

different nodes of the tree of life. Some 778 families had at least one branch with a KA/Ks value

greater than 0.6, totaling 2,232 branches emanating from 92 nodes of the tree of life. Thus 15%

of all families of chordate modules are likely to have modified their function at least once during

the course of evolution.

Of 3,385 families of modules representing embryophyte proteins, 123 have at least one

branch with a KA/Ks value greater than one, representing 227 families emanating from 25 nodes.

Some 407 families had at least one branch with a KA/Ks value greater than 0.6, totaling 1,105

branches from 43 nodes. Here, perhaps 12% of all embryophyte families have modified their

function along at least one branch.

This result based on ancestral sequence reconstruction contrasts greatly with the result of

Endo, Ikeo and Gojobori, where the search for gene families undergoing adaptive evolution

yielded only two families [End96]. They compared extant sequences rather than reconstructed

evolutionary intermediates, counted families only where a majority of the pairs were at high

KA/Ks values, and used a smaller database.

A list of candidate protein module families that have undergone modification of function is

available at [Lib03]. The version described here is designated TAED 2.1 and will remain









available at this site. As more sophisticated methods are developed and applied, as correlations

with functional and structural databases are pursued, and as data from other types of evolution

beyond coding sequence evolution is added, links to these datasets will be provided. TAED 2.1

contains two image-mapped trees (for chordates and embryophytes); where the node that an

adaptive branch emanates from can be clicked on to obtain a list and MasterCatalog reference

number. Multiple sequence alignments and phylogenetic trees corresponding to these entries can

be obtained from EraGen Biosciences WI.

Discussion

This study represents the first comprehensive analysis of KA/Ks rate ratios throughout the

Chordata and the Embryophyta. Although the methods utilized were rough and designed to give

a quick snapshot into a global picture of evolution, the TAED, as a raw resource, should be

valuable in the analysis of much of chordate evolution. Functional genomics analyses of many of

the protein families that have undergone recruitment and functional change within the past 500

million years will soon emerge. Many of the episodes of functional change recorded in TAED

can be correlated with events in the geological or paleontological record, in response to changing

environments, evolving paleoecology or the development of new physiology.

Gene families may display evolutionary episodes with high KA/Ks values, and therefore

appear within TAED, for several possible reasons. For example, branches resulting from gene

duplication events that give rise to paralogs with very different behaviors will presumably have

high KA/Ks values, as will orthologous pairs from species that place very different demands on

their function. This search was done without distinguishing paralogs from orthologs, and the user

of TAED should be careful in the analysis of specific families in recognition of this fact.

Because there is no reliable true set of protein families known to have undergone

functional adaptation, it is not possible to score the results of this tool. It is important to









remember that a Darwinian definition of function differs from the functional annotation of

genomes, and it is possible for a protein to alter or change its function while retaining the same

annotation. To examine this dataset, specific proteins must be examined individually.

Individual examination is likely to be productive, however. Many protein families already

believed to be candidates for functional recruitment appear on the list. These include

plasminogen activator in vampire bats which is expressed in saliva and involved in blood clotting

[Bod97], phospholipase A2 in snakes which is expressed in venom and involved in tissue damage

[Nak95] and MHC proteins in mammals, which are involved in the immune system as part of the

host-parasite arms race [Hug88], all having obvious explanations of why they may have

undergone functional change. Several families are newly identified as being candidates for

functional change, such as the previously proposed obesity protein leptin in primates.

A third category of discovery in TAED is in the detection of episodes of adaptive change at

new points in the divergent evolution of proteins, for example myostatin in the Bovidae [Lee99].

Table 3-1 is a sample table from TAED representing bovids. These are the candidate genes that

were identified as showing rapid sequence evolution emanating from this node in the tree of life.

They potentially include orthologs between two species of bovids, paralogs, alternatively spliced,

transcripts and intraspecific evolution. The genes on the list have roles in the immune system,

body musculature and reproduction, traits frequently under selective pressure. These examples

and many others are candidates for further experimental study through cloning from additional

species and through functional study in laboratories expert in the particular protein.

Conclusions

From a phylogenetic perspective, the knowledge of candidate genes evolving at the same

time in the same organism can allow one to begin to ask if entire pathways or phenotypic

functions are under selective pressure at particular points in evolutionary history. Where tertiary









structures for the proteins exist, mutations along branches can be mapped onto three-dimensional

structures first to evaluate the validity of specific examples, and second, to understand the nature

of adaptive evolution at a structural level.

One analysis of TAED indicates that among branches with KA/Ks rate ratios > 1, only 3%

of synonymous sites had mutated compared with 10% on the average branch in the database.

This is consistent with the notion that episodes of adaptive evolution can be lost in long

branches, as these are combined with prior and/or subsequent episodes characterized by lower

KA/Ks rate ratios characteristic of functional constancy. As more genes are sequenced from more

species, the greater articulation of trees will not only increase the accuracy of sequence

reconstructions, but will also allow us to detect new examples of functional change that are

buried in long branches.

At a biological level, the database created here can be mined to provide global pictures of

how evolution has occurred. Correlation of data in this database with that in other functional

databases will enable a leap from genotype to organismal phenotype. Further, the dataset

provides a resource for experimentalists interested in specific genes. The high KA/Ks rate ratio in

leptin in a branch connecting primates with rodents may have been a useful predictor of change

of function for pharmaceutical companies interested in the mouse model of leptin for human

obesity. For the experimentalist, mutations occurring along putatively adaptive branches can be

assayed for functional importance in systems of interest.

Finally, this database represents a growing framework for the study of adaptive evolution.

As datasets become available, changes in gene expression, alternative splicing patterns,

imprinting patterns, recombination events and other molecular mechanisms of adaptation will be

added to this database in a phylogenetic perspective. The ultimate goal is a dynamic resource









depicting candidate molecular events that are responsible for phenotypic differences between

closely related species.

Materials and Methods

Starting with the MasterCatalog [BenOO] (version 1.1 derived from GenBank release 113)

KA/Ks rate ratios were reconstructed database-wide for each ancestral branch in every

evolutionary tree containing genes from the Chordata and the Embryophyta. This analysis was

restricted to these organisms because there is less evidence for codon and GC-content biases

which complicate the accurate calculation of Ks. The MC uses multiple sequence alignments

generated from Clustal-W and neighbor-joining trees, both derived from protein sequences.

Because the MC is based on an analysis of nuclear families, rather than extended families, these

inexpensive tools generate acceptable multiple sequence alignments.

KA/Ks values were calculated for branches on an evolutionary tree between nodes using

the method of Li and Pamilo and Bianchi [Li85, Pam93, Li93] modified to allow full treatment

of probabilistic ancestral sequences [Ben98]. Reconstruction of ancestral sequences was done

using the Fitch maximum parsimony methodology [Fit71]. While reconstructed ancestral

sequences contain ambiguities, using probabilistic ancestral sequences takes this into account (by

weighting ambiguous positions according to their probabilities) and allows us to construct a

model of evolutionary history that is robust. Two cut-offs were used to identify interesting values

for the KA/Ks rate ratio, one and 0.6. Separate databases were constructed for each cut-off. The

resulting dataset is freely available for further analysis [Lib03].










Table 3-1. A sample listing from TAED containing candidate adaptively evolving genes
detected that emanated from the Bovidae node.


Genes with KA/Ks > 1.0
1. T-cell receptor CD3 epsilon chain
from MasterCatalog family 9668
2. AF092740 cytotoxic T-lymphocyte-
associated protein 4 precursor from
MasterCatalog family 9698
3. CD5 from MasterCatalog family 9700


4. AF 110984 intercellular adhesion
molecule-1 precursor from
MasterCatalog family 9802
5. Interferon alpha/beta receptor-2 from
MasterCatalog family 9817
6. AF020508 pregnancy-associated
glycoprotein 6 from MasterCatalog
family 15612
7. MCH OVAR-DQ-ALPHA1 from
MasterCatalog family 15669
8. Major histocompatibility complex
class II from MasterCatalog family
21739
9. T-cell receptor gamma from
MasterCatalog family 21940


Additional Genes with KJA/K> 0.6
10. Interleukin 2 receptor from
MasterCatalog family 974
11. Interleukin-3 from MasterCatalog
family 9775

12. AF019622 myostatin;
growth/differentiation factor-8; GDF-8
from MasterCatalog family 20325
13. Fas gene product from MasterCatalog
family 21743

14. Calpastatin from MasterCatalog family
21751
15. Prolactin receptor from MasterCatalog
family 21853

16. Pre-pro serum albumin from
MasterCatalog family 21864
17. Immunoglobulin gamma-1 chain from
MasterCatalog family 21881

18. AF 110984 intercellular adhesion
molecule-1 precursor from
MasterCatalog family 21997


These examples potentially include orthologs between different species of Bovidae, paralogs,
alternatively spliced cDNAs with potentially different functional effects and intraspecific
modifications.









CHAPTER 4
DETECTING COMPENSATORY COVARIATION SIGNALS IN PROTEIN EVOLUTION
USING RECONSTRUCTED ANCESTRAL SEQUENCES

Introduction

The evolution of protein sequences is nearly always described using one of several

stochastic models for the accumulation of amino acid replacements [Ben97, Fuk02]. These are

captured in algorithms known by widely recognized names (e.g., the Needleman-Wunsch

[Nee70], Smith-Waterman [Smi81], and Felsenstein maximum likelihood [Tho92] tools). These

tools have become more sophisticated in recent years as mathematicians have "inched towards

reality" [Tho92] in their mathematical modeling well supported by a rich background in

statistics, theorems, and proofs.

Nevertheless, patterns of replacement predicted by these mathematical models remain

quite different from the patterns that are actually observed in proteins diverging under functional

constraints [Ben97, Gau 01]. The reason for these differences is well understood. Briefly, simple

stochastic models treat proteins as if they were linear strings of letters. In reality, proteins have

three dimensional structures that support behaviors that are important for them to contribute to

the fitness of the host ("function"). These behaviors are not a linear sum of the behaviors of their

parts. Amino acid replacement is therefore constrained in a way unanticipated for a linear string

of letters.

The differences between mathematically convenient models and the reality of organic

chemistry need not be paralyzing, however. First-order stochastic treatments of protein

sequences can provide "null" hypotheses, statements about how proteins would behave if they

were formless, functionless strings of letters. The difference between how proteins actually

divergently evolve, and how first order treatments model their divergent evolution, therefore

contains a signal about fold and function [Ben97].









Analyses of this signal have been remarkably productive. They provide practical tools for

predicting the folded conformation of proteins from sequences [Ben97, Ros94], and many useful

approaches for extracting functional information from genomic sequence data [Ben98, BenOO,

Lib01]. Accordingly, one goal of contemporary computational biology is to extend the concepts

and tools needed to extract signals concerning structure and function from features of divergent

evolution that do not meet the expectations of simple stochastic models.

Perhaps the most serious approximation made by first order stochastic models is their

treatment of individual positions in a protein sequence as independently evolving entities.

Virtually all of these tools analyze sequence divergence one position at a time under a model

where position i in a sequence suffers replacement independently of position. Even models that

recognize that different sites may have different mutability (gamma models) treat sites as being

independently evolving [Miy95]. This is, of course, extremely convenient for any statistical

model for protein sequence divergence as it enables the probability of a sequence alignment

overall to be calculated as the product of probabilities calculated for individual positions in the

alignment.

Even casual inspection of a multiple sequence alignment, however, shows that positions in

a protein sequence do not suffer replacement independently. Replacements in positions adjacent

in a sequence are strongly correlated [Coh94]. The correlation almost certainly reflects

functional constraints on replacement set within the context of a folded structure. Therefore, it

has proven to be useful, in particular, for predicting the three dimensional structure of protein

folds [Ben97].

Position pairs distant in the sequence but near in space in the three-dimensional fold might

also be expected to suffer replacement in a correlated fashion [Alt87, Alt88]. Many classes of









these can be envisioned. For example, a "big-for-small" replacement at position i might be

compensated by a coincident "small-for-big" replacement at position j, to conserve overall size in

the packed core of the fold, and therefore conserve a functional behavior related to packing (the

stability of the fold). Alternatively, a "positive-for-negative" charge replacement at position i

might be compensated by a "negative-for-positive" charge replacement at position to conserve

overall charge, and therefore conserve a functional behavior related to net charge.

Behind this expectation stands a model based on the neutral theory of evolution [Kim83].

Under this model, amino acid replacements that dramatically alter the physical property of the

side chain (e.g., its size or charge) will disrupt the performance of a protein already optimized to

contribute to the fitness of the host organism. The fitness value of the protein can be restored

only by a second change that compensates for the first alteration in physical properties. In the

language of neutral theory, we would say that the first replacement was selectively

disadvantageous, the second was positively selected (in the context of the first), and both

together lead to a result that is neutral.

Analyses of compensatory replacement have found practical application. For example, a

pair of compensatory changes in the protein kinase family underlay the successful prediction of

the antiparallel sheet in the first kinase domain [Ben91], contradicting a prediction of a parallel

sheet based on motif analysis [Ste84]. This example represents the first example where

compensatory covariation analysis was a key to a bonafide prediction of protein structure. In

another case, Cohen and his coworkers presented an elegant example of compensatory

replacement in phosphoglycerate kinase [GohOO], supporting the suggestion that correlated

change in the evolutionary history of two protein families might be taken as evidence that the

two proteins operate together. More generally, several laboratories have suggested that









compensatory covariation might be used to detect incorrect folds in a structure prediction

environment [Olm99, Che97, Che98].

Unfortunately, comparison of any two sequences diverging at n sites generates n(n-1)/2

pairs of candidate sites holding coincident replacements. These might be compensatory; they

might be coincidental. Only a fraction of these will be truly compensatory, meaning that either

replacement alone would have been rejected by natural selection without the other. Most

replacements are presumably not compensated, either because they are neutral (implying that no

other changes are needed to prevent their having a negative impact on functional behavior), or

because they have a positive impact on fitness (implying that no other changes are desired to

neutralize their impact on functional behavior).

Thus, while such compensation is easily recognized when analyzed in the context of a

known crystal structure (the sites suffering compensatory replacements presumably are near in

space), it is difficult to identify the pairs of sites that might suffer compensatory changes without

a crystal structure. As a consequence, the compensatory covariation signal, represented by the

number of pairs of replacement that are causally interrelated (where either alone would be

rejected by natural selection) relative to the n(n-1)/2 pairs that arise whenever two protein

sequences differing at n sites are compared, is weak, at least as it has been generally calculated

[Shi94, Gob94, Neh94]. Some have suggested that the compensatory replacement signal may

never be broadly useful for this reason [Tay94]. Others have suggested that the signal might

have value, especially if it can be strengthened.

A variety of laboratories have explored approaches to strengthen the compensatory

replacement signal [Olm99, Che97]. For example, the signal for charge compensation appears to

be stronger than the signal for size compensation, suggesting that it might be useful to analyze









different types of coincident replacement separately. Further, compensatory covariation signals

appear to be strongly dependent on the evolutionary distance between the two sequences,

measured in PAM units (the number of point accepted mutations per 100 amino acids) [Day78].

For example, charge compensatory changes were found to be more prominent at PAM 25 than at

PAM 100 [Che97].

These observations are not surprising given the model. Charge reversal changes might be

expected to be the change most in need of compensation. Likewise, at longer PAM distances,

pairwise compensation might be obscured beneath compensation arising from replacements at

multiple sites.

This work suggested a general strategy to strengthen the signal for compensatory

replacement. At the core of this strategy is the recognition that compensatory replacements can

be calculated in two ways. In the first, two extant protein sequences-sequences that are found in

organisms living today-are examined. As extant sequences are at the "leaves" of an

evolutionary tree, we call these "leaf-leaf" comparisons (Figure 4-1). In the past, virtually all

compensatory substitution has been sought via "leaf-leaf' comparisons.

An alternative approach is possible. Given a set of homologous protein sequences, a

multiple sequence alignment, and an evolutionary tree interrelating them, the sequences of

ancestral proteins represented by nodes in the tree can be approximated using well-known

heuristics. Given reconstructed sequences of ancestral proteins at nodes in an evolutionary tree,

compensatory covariation can be sought using "node-leaf" and "node-node" comparisons.

As compensatory signals are stronger when the sequences being compared are separated by

shorter distances [Che97], and as the distance between two nodes in a tree must be shorter than

the distance between the leaves on the tree (Figure 4-1), node-node and node-leaf compensatory









signals must be stronger than the leaf-leaf signal that contains them. Phrased differently, the

number of replacements between average nodes is smaller than the number between average

leaves. This means that the n(n-1)/2 total number of pairs is smaller, implying that the truly

compensatory pairs, those driven by a fitness constraint, will be less obscured by the background

of uncompensated events, when they are identified on individual branches of a tree.

Further, although the reconstructed ancestral sequences are probabilistic, and cannot be

proven to be correct, even crude heuristics for reconstructing ancestral sequences localize

specific changes to specific regions of a tree better than if no reconstructions are done at all.

Thus, even poor reconstruction heuristics permit us to focus on briefer episodes of time during

which two compensatory changes might have occurred than is possible by leaf-leaf comparisons.

Last, node-leaf and node-node comparisons model the actual evolutionary events that

might contain the compensatory signal better than leaf-leaf comparisons. The statement that

"position i and position should suffer replacement in a compensatory way" is equivalent to the

statement that if either position i or position suffer replacement individually, the host organism

is less "fit" (in a Darwinian sense). This, in turn, implies that if position i suffers a replacement,

then position is under "positive selective pressure" to suffer a replacement. Conversely, this

means that a replacement at position will be fixed in a population faster than expected for a

neutrally drifting position. This means that true compensatory changes will normally occur on

the same branch of the tree, even on a very short branch of the tree.

This study examined the feasibility of detecting node-node and node-leaf compensatory

covariation signals by examining reconstructed evolutionary events within families of proteins.

Results

A total of 71 families of proteins were examined in this study. For each family, a multiple

sequence alignment and a phylogenetic tree were constructed using Clustal-W [Tho94].









Reconstructed ancestral sequences at nodes throughout the tree were generated using the Fitch

method as described in Methods [Fit71].

We then sought charge reversal replacements in these families. For each family, positions

were identified where an amino acid replacement caused a charge reversal between two ancestral

sequences (representing a replacement event that occurred on a branch between two nodes) or

between a node sequence and a leaf sequence. Fractional changes were included. Each was

associated with a particular branch of the tree.

From these, pairs of replacements displaying charge compensatory behavior were

collected, as described in Methods. A pair of replacements was defined as being charge

compensatory if they were coincident (both occurring on the same branch of the tree), if they

each individually would reverse a charge, and taken together, if no change in the overall charge

of the protein resulted from the two.

A three dimensional crystal structure for a member of the protein family was then extracted

from the PDB and an estimate was made for the strength of the signal. In making this estimate,

we assumed that only proximal charge compensatory changes, near enough in space in the folded

structure that the two side chains could interact coulombically, could be functionally correlated.

We therefore tabulated the distances between the sites in pairs that suffered charge compensatory

replacements.

Histograms were then constructed to show the distance distribution of pairs suffering

compensatory replacements. As a standard, a distance distribution for all pairs of sites was

calculated for the proteins. If the number of charge compensatory pairs was located

disproportionately more in proximal positions than the average pair (where the square root of the

number of pairs was a crude measure for significance), a signal was considered significant.









As previous studies predicted, leaf-leaf comparisons found an only barely perceptible

charge compensatory signal (Figure 4-2). That is, pairs of positions suffering charge

compensatory replacement in two extant sequences were not much more likely to be near in

space than the average pair. The analogous analysis for each evolutionary branch in the tree,

however, identified a signal that was clearly perceptible (Figure 4-3), even by eye. This signal

was then analyzed.

Charge Compensation and the Surface of the Folded Protein

Surprisingly, the initial analysis (Figure 4-3a) showed that position pairs suffering node-

node or node-leaf compensatory charge replacement were more likely to lie both proximally and

distally in the fold, when compared with the average distance between all position pairs in the

proteins. That is, pairs of positions near in the fold displayed charge compensatory covariation

more frequently than the average pair, as expected should charge compensatory replacement be

functionally correlated. But pairs of positions distant in the fold also displayed charge

compensatory covariation more than the average pair.

Distal charge compensatory replacement suggested two explanations. First, overall net

charge might be a selected trait in a protein. It is conceivable, for example, that a constant

isoelectric point is desired by natural selection. If so, reversing a charge at one position would

require an adaptive replacement leading to the opposite reversal somewhere (anywhere), even at

a second position distant in the fold from the first.

Alternatively, charge changes are more likely to be tolerated by natural selection, and

therefore are more likely to be observed, if they occur on the surface of the protein. The mean

distance between a pair of surface residues is greater than the mean distance between the average

pair of residues. Therefore, charge changes are more likely by chance to occur in pairs more









distant than the average pair if they occur predominantly in positions on the surface, whether the

pair is under direct selection or not (i.e., if the replacement is neutral).

These considerations suggested that the distance between the average position pair might

not be the most informative reference distribution for these studies. Instead, an alternative

reference distribution was calculated (blue bars, Figure 4-3b) for the distance between pairs of

surface residues in the model proteins. This new distribution fit nicely the distal portion of the

distribution of distances between position pairs displaying charge compensatory replacement.

The result implies that the apparently high occurrence of charge compensatory replacement in

distant pairs of residues reflects simply the greater likelihood that charge reversal replacements

occur on the surface of the protein. This surface pair reference distribution does not, however, fit

the proximal portion of the distribution. The probability that a pair of positions displaying charge

compensatory replacement is near in the folded structure is considerably higher than expected

(Figure 4-3b). This represents the "signal" in the charge compensatory pattern of replacement.

When a side chain changes its charge, that change is more likely to be accompanied by a

compensatory change in the charge of another side chain near in space to the first.

Charge Compensation in Both Contiguous and Non-Contiguous Position Pairs

We then explored several features of this signal. First, we asked whether the signal arises

only in positions that were also nearby in the polypeptide sequence (contiguous pairs), or

whether it was also observed in residue pairs > 4 amino acids distant (those of i, i+q relationship,

where q > 4) in the sequence (non-contiguous pairs). A strong signal was also observed for non-

contiguous pairs (Figure 4-3c,d) as well as for contiguous pairs. This implies that a charge

compensatory replacement signal arises when two residues are near in space as a consequence of

the tertiary fold, as well as if they are near in space because they are near in the sequence.









Enhancing the Charge Compensation Signal

These results showed that node-node and node-leaf analyses generated a more perceptible

charge compensatory covariation signal than leaf-leaf analyses. Nevertheless, the signal

remained small.

We considered several explanations for the small size of the signal. First, we considered a

case where four sites suffer charge reversal replacements in a single evolutionary episode. Site a

suffers a (+ to -) replacement compensated by a (- to +) replacement at site b. Site c suffers a (+

to -) replacement compensated by a (- to +) replacement at site d. Sites a and b are proximal;

sites c and d are proximal. Compensatory changes at sites b and d are required to maintain a

functional protein given changes at sites a and c, respectively. Two pairs that do not represent

adaptively significant compensation (a,d and b,c) arise in addition to the two pairs that do (a,b

and c,d). This is simply another way of saying that changes that need compensation are more

noticeable when the total number of changes is small.

Therefore, we asked whether the signal was stronger if the only branches examined were

those holding exactly one pair of charge compensatory replacements (3e,f for all pairs and non-

contiguous pairs, respectively). The likelihood that two positions undergoing charge

compensatory replacement are near in space was indeed more perceptible after we excluded all

of the events occurring on branches where more than two positions suffered charge reversal. The

number of cases was small, however (57 for all pairs, and 48 for non-contiguous pairs,

respectively), and the plots accordingly displayed substantial variances. As the number of

sequences in the database grows, trees will become more articulated, individual compensatory

events will be more likely to be isolated from others on a single branch of the tree, and the signal

should strengthen.









Charge Compensation in Specific Secondary Structural Elements

We then examined more closely the contiguous position pairs (i, i+4 or nearer) displaying

charge compensatory covariation. The most striking feature of the charge compensatory signal

within contiguous pairs was its dependence on the nature of the secondary structural element that

held those positions (Figure 4-4). In 98% of the cases where the positions showing compensatory

replacement had an i, i+4 relationship, one or both of the residues was found in an alpha helix. In

only 1.6% of these cases was one of the residues found in a beta strand, and in none of the cases

were both found in a strand. This was significantly larger than the 47% of the position pairs with

an i,i+4 relationship having one or both residues in a helix found in the dataset as a whole (Table

4-1). Conversely, in 49% of the cases where the position pair showing compensatory substitution

had an i, i+2 relationship, one or both of the residues was found in a beta strand.

Two alternative explanations can be proposed to account for the secondary structure bias.

First, surface helices present residues to solvent (water) once every 3.6 turns. Surface residues

are expected to be less constrained by function from diverging (that is, single replacements are

more likely to be neutral), and are more likely than the average residue to suffer charge reversal

substitutions. Therefore, the abundance of compensatory changes occurring with i,i+3 or i,i+4

relationship in the sequence might simply reflect unconstrained (i.e., neutral) charge reversal at

surface positions.

Alternatively, loss of a coulombic interaction between residues i and i+3 or i+4 might lead

to a protein less able to contribute to the fitness of the host. This view implies that once a charge

reversal substitution occurs at position i, position i+4 is under sufficient positive selection to

acquire a compensating charge reversal substitution.

To explore these alternatives, we sought examples of anti-compensatory charge reversals,

where a charge reversal (+ to -, for example) was accompanied by another charge reversal









substitution in the same direction (+ to -) (Figure 4-5). The histogram showing the distance

distribution in pairs suffering compensatory covariation is shown in Figure 4-5. Here, the

distribution of distances between position pairs carrying charge anti-compensatory substitution

was not noticeably different from the distribution of distances between all surface position pairs

(Figure 4-3b). Notable is the absence of an increased probability of proximal anti-compensatory

pairs (compare with Figure 4-3).

As a predictive tool, this enhanced charge compensatory signal may prove to be most

valuable in secondary structure prediction. The accuracy of a prediction made on the relative

positions of a compensatory pair is extremely high. The "coverage" is low, however. Only a few

dozen examples were observed; only 6.8% of the helices contained within the 71 test families

have one or more charge compensatory position pair. This number will, of course, grow as the

size of the protein families grows with the increasing size of the genomic database (Table 4-2).

Charge Compensation in Buried Residues

The high dielectric constant of water is known to weaken coulombic interactions between

charged species. We therefore asked whether a stronger signal might be found in position pairs

where one or more of the side chains was buried. Figure 4-6a shows the distribution of surface

accessibility calculated for all charged amino acids (DEKR, or Asp, Glu, Lys, and Arg) (blue),

those that participate in a position pair suffering charge compensatory substitution (red), and

those that participate in a position pair suffering charge anti-compensatory substitution (green).

Not surprisingly, most buried charged residues do not suffer charge reversal within the test set.

Compensatory changes near in space (Figure 4-6b) were slightly more likely to be found in

positions that were more buried, a modest signal that suggests that compensation is more

necessary in partly buried sites shielded from the high dielectric constant presented by the

solvent water. This is the case for those residue pairs in protein kinase [Ben91] and









phosphoglycerate kinase [GohOO] where charge compensatory substitution has had predictive

value.

Discussion

Explicit reconstruction of ancestral proteins has been shown to provide insight into the

structure and function of protein families, both when done in silico [Fit71, Mes97, Tra96] and

when recombinant DNA technology is used to resurrect ancestral proteins from extinct

organisms so they can be studied in the laboratory [Ben88a, Mal90, Sta90, Jer95].

The work reported here applies ancestral reconstructions towards a new goal. We suggest

several conclusions. First, node-node and node-leaf comparisons between reconstructed ancestral

sequences provide a stronger compensatory covariation signal than the leaf-leaf comparisons that

have been used previously in the search for a compensatory covariation signal.

These results are consistent with the model outlined in the Introduction, which holds that

amino acid replacements that dramatically alter the physical property of the side chain will

disrupt the performance of a protein to an extent that requires a compensatory change elsewhere

in the protein structure a significant number of times. The fitness value of the protein can be

restored only by a second change that compensates for the first alteration in physical properties.

In the language of neutral theory, we would say that the first replacement was selectively

advantageous, the second was positively selected (in the context of the first), and both together

are neutral. Thus, while the two compensatory substitutions taken together might be regarded as

collectively neutral, the second mutation in a truly compensatory pair is viewed as positively

adaptive in this model.

These results also suggest that as the database grows, the charge compensatory signal will

become more perceptible as more sequences are added to each family. More sequences mean

more highly articulated evolutionary trees. This, in turn, means that compensatory events will









become better isolated on specific branches, preventing the "spurious" signals that arise when

more than one pair of compensatory events occurs along a specific branch on a tree.

The stronger signal will undoubtedly find use. Predicting secondary structure using

contiguous pairs of compensatory changes is one. It remains to be seen, however, how much data

in a family are required for the signal to be useful to support de novo assembly of a protein fold

in a prediction setting.

Accounting for the Stronger Signal From Node-Node Comparison

The tools that we have presented make the compensatory covariation signal more

perceptible. Isolation of truly compensatory pairs on shorter branches of a tree away from other

changes is, we believe, sufficient explanation for this effect. A tree with shorter branches implies

a smaller n, the total number of differences between the two protein sequences being compared,

diminishing the number of n(n-1)/2 pairs behind which the compensatory signal might be

obscured.

While it is possible in principle to construct a statistical model that permits double

replacements to be evaluated, this requires a high degree of empirical parameterization. For

example, nearly a decade ago, investigators collected the parameters needed to build a statistical

model that concerned double replacements at adjacent sites [Coh94]. Some 220 parameters are

formally required in this exercise, a large number by most measures.

For the purpose of this work, a signal is considered to be significant if it lies two standard

deviations outside of the fluctuation expected for n sites at a specified distance. This can be

estimated by the square root of the number of sites. By this measure, all of the results reported

here are strongly significant, with the exception of those reported in Figures 4-3e,f and Figure 4-

4.









A Model-Independent Method to Evaluate an Evolutionary Tree

The strength of the compensatory covariation signal undoubtedly depends on the degree to

which the trees and the reconstructed ancestral sequences accurately reflect the history of the

family. If the branching of the tree or the reconstructed sequences themselves are not correct, a

pair of charge compensatory replacements that are coincident, in fact, may not be assigned to the

same branch of a tree. In this case, the signal from this pair will be lost.

Getting the branching correct in an evolutionary tree is a difficult problem. Part of the

difficulty arises because of the trade-off between the accuracy of the tree and the cost of

generating it. For example, the Clustal-W [Tho94] and Fitch parsimony tools used here are

relatively inexpensive methods for reconstructing trees and ancestral sequences. Clustal-W uses

a neighbor joining tool [Sai87] based on estimates of the distances between sequence pairs

derived from the Kimura empirical formula [Kim83]. Ancestral sequences reconstructed by

parsimony are well known to be sensitive to incorrect branching topology. This is the principal

error associated with the choice of this inexpensive reconstruction tool.

More sophisticated methods, including maximum likelihood methods, are expected to

provide better trees, at least given the first order stochastic models. These are expected to

generate ancestral reconstructions that are more robust to errors in tree topology. They are,

however, more expensive.

Even the more expensive tools do not guarantee a correct tree, of course. In practice, the

approximations made in the model (see Introduction) may create systematic error larger than

fluctuation error. To date, the only way to benchmark a tree requires knowledge of the

evolutionary history of the sequences in question [Hil94], or a reconstruction of a simulated

evolutionary process [TakOO]. The first is difficult to get for sequences emerging from natural









history. The second requires a mathematical model for evolution, which is often the same one

that is used to construct the tree in the first place.

Here, the compensatory covariation signal, extracted from reconstructed ancestral

sequences, may provide a metric for the quality of a tree based on organic chemistry,

independent of any mathematical model for evolution. Hypothetically, the best tree should be the

tree that places compensatory replacements truly driven by natural selection on the same branch.

This requires the construction of a tree that reflects the actual evolutionary history. This, in turn,

implies that the tree has the most compensatory covariation is the tree that is most likely to

reflect the actual history.

To illustrate this application, consider four hypothetical proteins, just four amino acids in

length, having the sequences ALKD, MVKD, ALER, and MVER. Exactly two topologies exist

for unrooted trees that relate these four sequences (Figure 4-7). Both reconstructions have two

ambiguous sites in both ancestors. In Topology I, the first two positions are ambiguous; in

Topology II, the last two positions are ambiguous. Both trees require four "homoplastic" events

(independent mutations that cause sequence convergence). Both trees require exactly six

changes. Classical parsimony therefore ranks these two topologies as equally likely.

The two topologies are different, however, with respect to the extent to which charge

changes are compensated. In Topology I, a charge altering replacement is 100% likely to be

compensated. In Topology II, however, a charge altering replacement is only 50% likely to be

compensated. This is illustrated in Figure 4-7 by writing out four trees, each equally likely, that

carry reconstructions that the ambiguities require. If we postulate that compensatory covariation

is maximized, then Topology I is preferred over Topology II.









Conversely, an analogous logic can be used to assign preferred ancestral states involving

charged residues. For Topology I, the ancestral states involving charged residues are fixed. For

Topology II, the preferred ancestral sequences are in reconstructions IIa and IIb.

This metric can be applied even if no crystal structure is available for a protein family. If,

however, a crystal structure is available, then (as a practical matter) one would maximize the

number of proximal charge compensatory changes when identifying the preferred tree.

It will require much future work with many families to determine how useful the metric

will be. Worth noting at this point, however, is that this metric is rooted in principles of structural

biology (that is, organic chemistry), not in a mathematical formalism. Further, the proposed

metric values changes at position i in light of changes at position. Thus, this metric for

evaluating the quality of a tree is fundamentally different from any metric based on first order

stochastic analyses of protein sequences, which treat replacements at site i and site as

independent.

Darwinian Requirements for Compensatory Covariation

Even given these results, and the evidence that the charge compensatory substitution signal

can only become stronger as the database grows, it remains inescapable that the charge

compensatory signal is weak, perhaps even weaker "than expected." What might be the scientific

implications of this observation?

Charge compensatory covariation might be weak because the coulombic interactions being

sought may themselves be largely unimportant to the selective fitness of proteins. Gaining or

losing them, in this view, has insufficient impact on fitness to ensure that natural selection will

require compensation, and thereby prevent uncompensated charge reversals from entering the

global proteome. This implies a limit to the tool generally, one imposed by the physical organic

chemistry of folded protein sequences.









An alternative explanation should be considered, however. Observation of a compensatory

pair of substitutions implies, under the neutral theory, that natural selection preserved some

global feature of a protein during the episode represented by the branch between two nodes.

This, in turn, implies some degree of constancy in the behavior of the protein before and after the

episode where compensatory change has occurred.

In this view, compensatory replacement should be observed only in protein families whose

behavior must remain largely constant during this branch. This, in turn, implies that

compensatory covariation should be observed only during episodes where "function," defined as

the behavior that contributes to fitness, is largely conserved. In the language of the neutral

theory, the demand for compensation arises because the protein is optimized at the beginning of

the episode for fitness, the same behaviors are optimal at the end of the episode, and any

replacements occurring during the episode must have the net (and, if necessary, combined)

impact of being neutral with respect to their impact on selected behavior.

This implies, of course, that when functional behavior is changing, there may be no need to

compensate individual replacements in a sequence, even those that reverse charge. Indeed, an

uncompensated change may be more likely to generate a protein with different behaviors, whose

(now) different behaviors contribute most to the (now different) requirements for fitness. In this

view, compensatory covariation should not be observed, or should be observed less frequently,

whenever functional behavior is changing.

In this view, compensatory covariation is scarce because branches of an evolutionary tree

where functional behavior is rigorously conserved are scarce. This is, of course, a controversial

suggestion. Many computational biologists treat homologous proteins in distinct organisms as if

they were "the same protein," and neutral theory remains the majority view of protein sequence









evolution. In contrast, recent work in these laboratories and elsewhere has suggested that

functionally significant divergence in behavior is frequent, and may be the rule more than the

exception. For example, it is almost certainly observed in elongation factors, regarded as some of

the most functionally conserved proteins in the biosphere [Gau01 ].

Given this observation, compensatory replacements may become a powerful tool in

functional genomics to detect episodes where function is, and is not, conserved. A branch that

has more compensatory replacement is more likely to represent an episode where functional

behavior is constant than one with less compensatory replacement.

This is relevant to the issue of "annotation transfer" in comparative proteomics. Annotation

transfer assigns the function of a new protein by identifying in the database a homologous

protein for which the function is known, and transferring annotation describing that function to

the annotation for the new protein. Annotation transfer assumes that function does not change

within a set of homologous protein sequences as they diverge [Heg01, Fet98]. This assumption

has long been known to be poor in many proteins, including many characterized before the dawn

of the age of the genome [Ben88b].

To date, several tools have been suggested to detect functional change. One of these is to

measure KA/Ks values for branches of an evolutionary tree between reconstructed ancestral

sequences at the nodes of the tree [Mes97, Li85]. Here, compensatory changes would indicate

functional constancy, while uncompensated changes would indicate functional change. Because

compensatory analysis rests on protein sequences, while the KA/Ks value requires measurement

of silent substitution rates, and because silent substitution rates are frequently rather high, this

metric for functional recruitment may ultimately prove to be more valuable than KA/Ks ratios,

especially for deeply branching sequences.









Methods

The core of this study exploited the PDB "select 25" subset of proteins [Hob94]. Each

protein in this database was matched against the proteins contained in SWISS-PROT (version

33) [Bai91]. The older sequence dataset was chosen to avoid under-annotated sequences, in

particular, pseudogenes that might be divergently evolving without functional constraints.

Families that contained at least 12 members, where the maximum evolutionary distance between

any pair of sequences in the family was between 50 and 120 PAM units, and where the family

had at least two subfamilies defined at PAM 20 with four or more members, were retained.

These criteria, made to ensure a balanced tree, were satisfied by 71 families.

The sequences within each family were aligned using the MultAlign package [KorOO] with

the option PROB from Darwin system [Gon91, Ben93]. The gap shifting heuristic was applied

iteratively until the overall alignment score ceased to improve. Secondary structure assignments

were extracted from the crystallographic data using DSSP [Kab83].

An evolutionary distance matrix and a phylogenetic tree were computed for each family

using Clustal-W [Tho94], which employs a neighbor joining method using distances derived

from the Kimura empirical formula [Kim83]. Branches with negative lengths were ignored.

Probabilistic reconstructed ancestral sequences at nodes in the tree were then built using

the Fitch parsimony method [Fit71]. For those sites where parsimony did not generate a single

assigned residue in an ancestral sequence, fractional probabilities were assigned to each of the

contender amino acids were assigned by the statistical method described by Fitch [Fit71]. The

statistical method was used to assign probabilities to each possible path between the residue at a

site at an ancestral node's residue (either a single amino acid or a set of possible amino acids

each with an assigned probability) and its two descendant residues (either a residue in an extant









species' sequence (a node-leaf comparison) or a residue (or set of possible amino acids) in

another ancestral species (a node-node comparison)).

Using probabilistic reconstructed ancestral sequences computed in this way, charge

compensation was sought on individual branches of the evolutionary trees. For each branch in

each tree, the sequences at the flanking nodes were compared (residue by residue) to identify

single substitutions that reversed charge. A position was retained if and only if one of the

following transition probabilities (K->E, K->D, R->E, R->D, E->K, E->R, D->K, D->R) was >

0.2. If more than one probability was > 0.2, then both transition probabilities were retained for

that position.

Coordinates locating the position of C-beta were then extracted from the reference crystal

structure with the PDB. When the residue was Gly, the position of the beta carbon atom was

inferred from the position of the backbone atoms. A Perl script was written to permit calculation

of the distances between the beta carbon atoms for each pair of amino acids retained. This

generated the distances reported in the Figures. The accessible surface area (ASA) was computed

using the DSSP program [Kab83].

A list was then made of all pairs of positions having compensatory charge reversals for

each branch of the tree. These were defined to be "position pairs" undergoing charge

compensatory covariation. The distance between the beta carbon atoms of the position pairs was

inferred from the positions of those two carbon atoms in the reference crystal structure. The

pairwise distance was then calculated for all amino acids in the reference crystal structure, and

then for the distance between pairs of residues on the surface of the reference structure.








N7



N5 node-
S\ comp


N1 N2



L1 L2 L3 L4L


rec


node N6
arson


N3 N4



5 L6 L7 L8


constructed
sequences
at nodes
in tree


Sequences of extant
proteins at the
leaves of the tree


leaf-leaf
comparison
Figure 4-1. A leaf-leaf comparison (red) traverses more evolutionary distance than a node-node
comparison (blue).


















M compensatory pairs
r[~) all pairs





SILL_


(a)
0.12
0.10
0.08
0.06
0.04
0.02
0.00




(c)
0.12
0.10
0.08
o0.06

0.04
0.02
0.00


M- compensatory pairs
r-I all pairs


.fl S C) .4 I~ fl S W .4 1- ,~ 0% fl ~4 C.-

.4 .4 C-. n fl IA ~D 0 1. 1 0


(b)
0.12
0.10
0.08
0.01
0.04
0.02
0.00




(d)
0.12
0.10
0.08
0.06
0.04
0.02
0.00


C-beta distance


M compensatory pairs
I-1 surface pairs


M compensatory pairs
C- surface pairs


- ta distancllIIIII II II





C-beta distance


Figure 4-2. Attempted detection of charge compensatory covariation signal using leaf-leaf
comparisons. (a) Distribution of distances in 71 protein families between position
pairs displaying charge compensatory substitution (a positive-for-negative
substitution at position i, and a negative-for-positive substitution at position j) (red
bars), using as a reference (green bars) the distribution of all pairwise distances in the
proteins. The x-axis is in angstroms; the y-axis is frequency, with each distribution
normalized to unity. Note the absence of considerably taller red bars at short
distances. Total observations in sample: 13,460. (b) As in (a), but using as a reference
curve the distances between all pairs of surface residues (blue bars, > 50% exposure,
calculated by normalizing its accessible surface area (ASA) by the standard ASA of
the residue. ASA was computed with the DSSP program. Total observations in
sample: 13,460. (c) As in (a), but considering only non-contiguous pairs, those where
i andj are separated by more than four positions, with reference to a distribution of
distances between all pairs of residues in the reference crystal structures (green bars).
Total observations in sample: 12,811 .(d) As in (c), but with reference to the
distribution of distance between surface pairs (blue bars). Total observations in
sample: 12,811.


U~ N I 0 tO 1. 14 0 In .-

0 '0 CC 0 0 '0 CC 0 '0 fl -
.-1 .4 N t1 P~ W C- P.








(a) (b)
0.10 i .oao ..w ory pai. 0+.1 i 1 0ri 0r*
oV o .1) ol

i 0.10 M 4 0.04
a3. 0. .02

........ .l .... .
(C)(d


004



(e)
0.00 r


- ~ p.fr.


1 V .1 p.0.
-, 0 0- 0- --


C-beta distance C-beta distance
Figure 4-3. Detecting charge compensatory covariation signal using explicitly reconstructed
ancestral sequences. (a) Distribution of distances in 71 protein families between
position pairs displaying charge compensatory substitution (a positive-for-negative
substitution at position i, and a negative-for-positive substitution at position j) (red
bars), using as a reference (green bars) the distribution of all pairwise distances in the
proteins. The x-axis is in angstroms; the y-axis is frequency, with each distribution
normalized to unity. Note the greater height of the red bars at both short distances and
at long distances. Total observations in sample: 803.3 (note fractional number
reflecting fractional character assignments in ancestral states; precision is less than
implied by the decimal). (b) As in (a), but using as a reference curve the distances
between all pairs of surface residues (blue bars, > 50% exposure, calculated by
normalizing its accessible surface area (ASA) by the standard ASA of the residue.
ASA was computed with the DSSP program. Note the greater height of the red bars at
short distances only. This is the compensatory covariation signal. Total observations
in sample: 803.3. (c) As in (a), but considering only non-contiguous pairs, those
where i andj are separated by more than four positions, with reference to a
distribution of distances between all pairs of residues in the reference crystal
structures (green bars). Total observations in sample: 745. (d) As in (c), but with
reference to the distribution of distance between surface pairs (blue bars). Total
observations in sample: 745.5. (e) Distance distribution of charge compensatory pairs
where only one such pair occurs on a specific branch of the evolutionary tree between
reconstructed ancestral nodes. Total observations in sample: 57.2. Note the small
sample size. (f) Same as (e), but where the position pair is separated by more than
four positions in the linear sequence. Total observations in sample: 48.3. Note the
small sample size.











i,i+1


i, i+3


i, i+2


i, i+4


i, i+5 i, i+6

MEE NEC MEH
*HH HCEI CC


Figure 4-4. Predicting secondary structure using contiguous pairs of compensatory changes. The
relative sizes of the pies indicate the relative numbers of examples of each pair.
Where position pairs are separated by 1-5, or 6 positions, the likelihood that residue
pairs are both found in helices (red), both found in strands (dark green), one in a helix
and the other in a coil (pink), one in a strand and the other in a coil (light green), one
in a helix and the other in a strand (violet), and both found in coils (yellow). Note that
if two charge compensatory substitutions are observed with a i, i+4 relationship, one
or both are 98% likely to lie in a helix. The total number of observations; i, i+1 =
12.82; i, i+2 = 6.47; i, i+3 = 11.38; i, i+4 = 23.37; i, i+5 = 4.79; i, i+6 = 6.48. Note
fractional values and the small size of these samples.












(b)


H1 H M M 0 H 0 0 H


C-beta distance


M anti-conpensatory pairs
=) surface pairs







0 0 N O V 0 4a W c N V 0 V M c V i
H H M M M IV V &A 0 %0 C, 0O


C-beta distance


Figure 4-5. Distribution of distances between charge anti-compensatory pairs (red bars) (a)
relative to all pairwise distances (green bars) and (b) relative to pairwise distances
between all pairs of surface residues (blue bars, > 50% exposure). Notable is the
absence of the increased probability of proximal anti-compensatory pairs (compare
with Figure 4-3). The total number of observations in both distributions = 793.3.


(a)












M anti-compensatory residues
M compensatory residues
o O DEKR residues






1? T T I I I I IT I I I


M compensatory (near)
compensatory (far)





41j t I 11 I I I I n f I

o O 0O 0 0 0 0 0 0 0 H t Hl
o H Ci M l i O! 0 H i
O O O O O O O O O O H H H


Surface accessibilities


Figure 4-6. Surface accessibility of charged residues, and charged residues participating in a
charge compensatory event.(a) A frequency versus accessibility distribution for all
Asp, Glu, Lys, and Arg (DEKR) residues (blue), those participating in a charge
compensatory (red) and anti-compensatory (green) events. Bars of each color sum to
unity. Buried DEKR residues are less likely than average to suffer charge reversal
substitution. Compensated charge reversal substitutions are slightly more likely than
anti-compensated charge reversal substitutions. The total number of observations in
sample: anti = 2,812.4; comp = 2,792.2; DEKR = 3,530.0. (b) A frequency versus
accessibility distribution showing the frequency of compensated charge reversals for
proximal pairs near in space (less than 12 A distant, red) in the folded structure, and
distal pairs distant in the folded structure (more than 12 A distant, green). A pair of
charged compensatory substitutions is slightly more likely to be near in space if the
side chains involved are more buried. The total number of observations in sample:
near = 367.1; far = 795.7.


(a)


0.14
0.12
0.10
0.08
0.06
0.04
0.02
0.00


(b)


0.14
0.12
0.10
0.08
0.06
0.04
0.02
0.00














Topology I


- - charge changes compensated
AAAAA charge changes not compensated
--- no charge changes


ALKD ALER
M1A < M1A
M K3E M1A
MLKD --_-_ MLER
L2V
a L2V < L2V


MVKD


Topology II
ALKD ALER


- - charge changes compensated

4vdvvv charge changes not compensated

--- no charge changes


MVER


0.5 M1A V2L
ALyd K3E

0.5 A1M A1M
D 0.5 L2V

A NAL
V2L V2L AL
\^ K3E /
AVKD >--D R- AVER
D4R
A1M A1M
S a MV

) AI

AK K3E D
ALKD -D- ALERT
A D4R
/AIM A1M<


ALER


s K
SD4
ALKD
AIM
V2L
IMVKD
a K
Da D


MVKD

ALKD


MVER

ALER


ALKD

E
R







MVKD

ALKD


ALER


3 /
4D
ALER
A1M
V2L
MVER
E3K ,
R4DsS
, b
MVER

ALER


MVKD c MVER MVKD d MVER

Figure 4-7. A schematic illustration of the use of compensatory covariation to select a preferred
tree from two equally parsimonious trees. The two tree topologies relating the four

sequences (ALKD, MVKD, ALER, and MVER) each require six changes. The

changes are marked on individual branches, with fractional changes arising from the

ambiguity in the ancestral sequences. The ancestral sequences are placed at the nodes

in the tree, with ambiguous sites (by parsimony) noted by placing the two possible

residues above and below a horizontal line. For each topology, identical trees holding

all four possible ancestral sequences are shown. Each, by parsimony, has equal

likelihood (0.25 for each).In Topology I, the ancestral sequences are ambiguous at the

first two positions. In Topology II, these are ambiguous at the last two positions.

Both trees require the same amount of homoplasy (convergence). Classical parsimony

analysis is indifferent with respect to the two topologies. In Topology I, however, the

likelihood that a charge reversal is compensated is unity. In Topology II, the

likelihood that a charge altering replacement is compensated is only 0.5. Thus,

Topology I is preferred if compensatory covariation is maximized. This criterion is

independent of mathematical formalisms used to construct the tree. Further, the

metric weights changes at position i depending on events at position j, making this

metric for evaluating a tree fundamentally different from any metric based on a first

order stochastic analysis of protein sequences


ALKD


M1A M1A
V2L
M> K3E
MVKD -R. MVER
VK D4 R <\









Table 4-1. Frequencies of the average contiguous position pairs participating in a helix versus
strand.
Relationship between the pair in the protein sequence
Pair i,i+1 i,i+2 i,i+3 i,i+4 i,i+5 i,i+6
EE 0.185 0.146 0.119 0.099 0.087 0.079
EC 0.077 0.148 0.192 0.219 0.234 0.239
strand 0.262 0.294 0.311 0.318 0.321 0.318
HE 0.003 0.011 0.022 0.033 0.044 0.055
HH 0.307 0.274 0.243 0.223 0.208 0.194
HC 0.072 0.132 0.184 0.214 0.233 0.249
helix 0.382 0.417 0.449 0.470 0.485 0.498
CC 0.356 0.289 0.241 0.212 0.195 0.185
Total 1.00 1.00 1.00 1.00 1.00 1.00
Total sample 12.8 6.5 11.4 23.4 4.8 6.5
EE, both positions involved in a charge compensatory event found in strands; EC, one in a strand
and the other in a coil; Strand: HE, one in a helix and the other in a strand; HH, both in helices;
HC, one in a helix and the other in a coil; Helix: CC, both in coils. Calculated from the test set of
reference crystal structures using DSSP. Note the small number of observations in each sample,
and the fractional number of changes arising from the parsimony reconstructions.










Table 4-2. List of 71 protein families used in this analysis
ID L N Protein name
121P 166 60 H-Ras P21 Protein
193L 129 43 Lysozyme
1AAF 55 38 Hiv-1 Nucleocapsid Protein
1AAK 152 20 Ubiquitin Conjugating Enzyme
1ARS 396 28 Aspartate Aminotransferase
1ATP(E) 350 20 c-AMP-Dependent Protein Kinase
1BET 107 14 Beta-Nerve Growth Factor
1BP2 123 64 Phospholipase A2
1CCR 112 93 Cytochrome c
1CPC(B) 172 44 C-Phycocyanin
1CYO 93 17 Cytochrome B5 (Oxidized)
1DLH(A) 180 28 Hia-Drl (Dra, Drbl 0101) Human Class II Histocompatibility Protein (Extracellular
Domain)
1DLH(B) 188 45 HIa-Drl (Dra, Drbl 0101) Human Class II Histocompatibility Protein (Extracellular
Domain)
lEFT 405 57 Elongation Factor Tu (Ef-Tu)
1FRP(A) 335 18 Fructose- 1,6-Bisphosphatase (D-Fructose- 1,6-Bisphosphate 1-Phosphohydrolase)
1FVL 70 29 Flavoridin
1FXI(A) 96 69 Ferredoxin I
1GDD 353 51 Guanine Nucleotide-Binding Protein G(I)
1GPI(A) 198 16 Glutathione Peroxidase
1HAR 216 36 HIV-1 Reverse Transcriptase (Amino-Terminal Half)
1HCN(A) 92 26 Human Chorionic Gonadotropin
1HDG(O) 332 109 Holo-D-Glyceraldehyde-3 -Phosphate Dehydrogenase
1HGE(A) 328 70 Hemagglutinin
1HLE(A) 345 14 Horse Leukocyte Elastase Inhibitor
1HMT 132 16 Fatty Acid Binding Protein
1HPM 386 128 44K Atpase Fragment (N-Terminal) Of 70kda Heat-Shock Cognate Protein
1HRA 80 42 Retinoic Acid Receptor
1HRY(A) 76 43 Human Sry
1HTB(A) 374 56 Beta-3 Alcohol Dehydrogenase
1HTM(D) 138 92 Hemagglutinin Ectodomain (Soluble Fragment, Tbha2)
1HUR(A) 180 35 Human Adp-Ribosylation Factor 1
1HUW 166 31 Human Growth Hormone
1HVD 319 23 Annexin V
IIRK 306 17 Insulin Receptor (Tyrosine Kinase Domain)
IITG 166 42 Hiv-1 Integrase (Catalytic Domain)
IIVD 388 23 Influenza A Subtype N2 Neuraminidase (Sialidase)
1LDM 329 27 M4 Lactate Dehydrogenase
1MHC(A) 282 143 Mhc Class I Antigen H2-M3
1MLS 154 74 Myoglobin
1NDH 272 27 Cytochrome B5 Reductase
1NHK(L) 144 26 Nucleoside Diphosphate Kinase
1NIP(A) 289 35 Nitrogenase Iron Protein
10OCT(C) 156 33 Oct-1 (Pou Domain)
10OSA 148 64 Calmodulin
1PBX(A) 143 56 Hemoglobin
1PDN(C) 128 28 Prd Paired Domain
1PLQ 258 13 Proliferating Cell Nuclear Antigen
1PVC(2) 271 24 Poliovirus Type 3, Sabin Strain
1REC 201 21 Recoverin
1SXC(A) 151 51 Superoxide Dismutase
1TGX(A) 60 51 Toxin gamma
1TIV 86 24 HIV-1 Transactivator Protein










Table 4-2. Continued
ID L N Protein name
1TPH(1) 247 35 Triosephosphate Isomerase
1YTB(A) 180 18 TATA-Box Binding Protein
1ZAA(C) 87 18 Zif268 Immediate Early Gene
2BTF(A) 375 121 Beta-Actin-Profilin Complex
2CPL 165 35 Cyclophilin A
2GDM 153 23 Leghemoglobin
2HMX 133 40 Human Immunodeficiency Virus Type 1 Matrix Protein
2HPE(A) 99 36 HIV-2 Protease
2REB 352 58 Rec A Protein
2TGI 112 18 Transforming Growth Factor-Beta Two
3RUB(S) 123 79 Ribulose 1,5-Bisphosphate Carboxylase/Oxygenase (Form III)
4ENL 436 35 Enolase
4FGF 146 16 Basic Fibroblast Growth Factor
4GCR 174 31 Gamma-B Crystallin
4MT2 62 50 Metallothionein Isoform II
4RHV(1) 289 24 Rhinovirus 14
4RHV(3) 236 24 Rhinovirus 14
7RSA 124 48 Ribonuclease A
8CAT(A) 506 34 Catalase
ID, Protein Data Bank identifier, protein subunit in parentheses; L, chain length; N, number of
sequences in multiple sequence alignment.









CHAPTER 5
THE PLANETARY BIOLOGY OF CYTOCHROME P450 AROMATASES

Background

The emergence of complete genomes for many organisms, including humans, has created

the need for hypotheses concerning the "function" of specific genes that encode specific proteins

[Gau04]. While "function" is interpreted by different workers in different ways [Bor98],

Darwinian theory (by axiom) requires that the term be connected to fitness; natural selection is

the only mechanism admitted by theory to generate functional behavior in a living system, macro

or molecular. This, in turn, implies that the hypotheses about function have a "systems"

component, including the interaction of the protein with other proteins, their impact on the

physiology (defined broadly) of the cell and organism, and the consequences of physiology in a

changing ecosystem in a planetary context [Ben02].

Systems hypotheses can be supported by information from many areas. Geology,

paleontology, and genomics, for example, provide three records that capture the natural history

of past life on Earth. At the same time, structural biology, genetics, and organic chemistry

describe the structures, behaviors and reactivities of proteins that allow them to support present

life. It has been appreciated that a combination of these six types of analysis provides insights

into functional behavior of proteins that cannot be provided by any of these alone [Ben02]. Over

the long term, we expect that the histories of the geosphere, the biosphere, and the genosphere

will converge to give a coherent picture showing the relationship between life and the planet that

supports it. This picture will be based, however, on individual cases that serve as paradigms for

making the connection.

The aromatase family of proteins offers an interesting system to illustrate the power of this

combination as a way to create hypotheses regarding protein function within a system [Con01a].









These hypotheses are not "proof," of course, but are limiting in genomics-inspired biological

experimentation, now that genomic data themselves are so abundant.

Aromatases are cytochrome P450-dependent enzymes that use dioxygen to catalyze a

multistep transformation of an androgenic steroid (such as testosterone) to an estrogenic steroid

(such as estradiol) (Figure 5-1). The protein plays a key role in normal vertebrate reproductive

biology, a role that appears to have arisen before fish and tetrapods (land vertebrates, including

mammals) diverged some 375 Ma. [Cal84]. Aromatase is important in modern medicine as well,

especially in breast and other hormone-dependent cancers [Wol02].

Different numbers of aromatase genes are found in different vertebrates. Two aromatase

genes are known in teleost fish [Cal97, Cha97]. Only a single gene is known in the horse

[Boe97], rat [Hic90], and mouse [Ter91]. Cattle have both a functional gene and a pseudogene

built from homologs of exons two, three, five, eight and nine of their functional gene; these are

interspersed with a bovine repeat element [Fur95, Hin93]. In several mammalian species,

including humans and rabbits, a single gene yields multiple forms of the mRNA for aromatase in

different tissues via alternative splicing [Har88, Del96, Sim97, Del98].

A still different phenomenology is observed in the pig (Sus scrofa). Three different mRNA

molecules had been reported in different tissues from pig [Cor95, Con97, Cho96, Cho97a,

Cho97b]. Compelling evidence then emerged that the three variants of mRNA identified in

cDNA studies arose from three paralogous genes [GraOO], rather than from a single gene

differentially spliced [Cor04]. This implies that the three aromatase paralogs in pigs arose via

gene duplications relatively recent in geologic time.

Hypotheses relating the function of the three depend in part on when those duplications

took place. If they were very recent, then the three genes might have helped pigs adapt to