Paleogenetics: The Past As a Key to Unlock the Present. The Resurrection of AncestralxxProteins to Elucidate the Function of Seminal Ribonuclease

Material Information

Paleogenetics: The Past As a Key to Unlock the Present. The Resurrection of AncestralxxProteins to Elucidate the Function of Seminal Ribonuclease
SASSI, SLIM ( Author, Primary )
Copyright Date:


Subjects / Keywords:
Ambiguity ( jstor )
Amino acids ( jstor )
Buffaloes ( jstor )
Codons ( jstor )
Enzymes ( jstor )
In vitro fertilization ( jstor )
RNA ( jstor )
Ruminants ( jstor )
Topology ( jstor )
We they distinction ( jstor )

Record Information

Source Institution:
University of Florida
Holding Location:
University of Florida
Rights Management:
Copyright Slim Sassi. Permission granted to the University of Florida to digitize, archive and distribute this item for non-profit research and educational purposes. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder.
Embargo Date:


This item is only available as the following downloads:

Full Text




Copyright 2005 by Slim Sassi


To my father, M.L Sassi; my mother, Sami a Sassi; and my grandfather, Abdelrazzak Dhane.


iv ACKNOWLEDGMENTS I would like to thank the f aculty of the Anatomy and Cell Biology department for their generous help. I would like to also thank my family for their unconditional support, and especially my parents for their sacrifices. Finally I would like to thank my adviso r Professor Steven A. Benner for his valuable guidance, endless enthus iasm and colorful personality.


v TABLE OF CONTENTS page ACKNOWLEDGMENTS.................................................................................................iv LIST OF TABLES.............................................................................................................ix LIST OF FIGURES.............................................................................................................x ABSTRACT.....................................................................................................................xi v CHAPTER 1 MOLECULAR PALEOSCIENCE: SYS TEMS BIOLOGY FROM THE PAST........1 Introduction................................................................................................................... 1 A Role for History in Molecular Biology..............................................................1 Evolutionary Analysis and the "Just so Story"......................................................6 Biomolecular Resurrections as a Way of Adding to an Evolutionary Narrative10 Practicing Experimental Paleobiochemistry...............................................................11 Building a Model for the Evol ution of a Protein Family.....................................11 Homology, alignments, and matrices...........................................................11 Trees and outgroups.....................................................................................14 Correlating the molecular and paleontological records................................15 A Hierarchy of Models for Modeli ng Ancestral Protein Sequences...................16 Assuming that the historical reality arose from the minimum number of amino acid re placements.........................................................................16 Allowing the possibility that the hi story actually had more than the minimum number of changes required...................................................19 Adding a third sequence...............................................................................21 The relative merits of maximum likelihood versus maximum parsimony methods for inferring ancestral sequences..............................................23 Computational Methods......................................................................................24 How not to Draw Inferences About Ancestral States..........................................26 Ambiguity in the Historical Models...........................................................................27 Sources of Ambiguity in the Reconstructions.....................................................27 Managing Ambiguity...........................................................................................29 Hierarchical models of inference.................................................................29 Collecting more sequences...........................................................................31 Select sites considered to be importa nt and ignore ambiguity elsewhere....32


vi Synthesizing multiple candidates ancestral proteins that cover, or sample, the ambiguity..........................................................................................33 The Extent to Which Ambiguity De feats the Paleogenetic Paradigm................34 Examples.....................................................................................................................35 Ribonucleases from Mammals: From Ecology to Medicine...............................36 Resurrecting ancestral ribonucl eases from artiodactyls...............................37 Understanding the origin of ruminant digestion..........................................44 Ribonuclease homologs involved in unexpected biological activities.........45 Paleobiochemistry with eosinophil RNase homologs..................................48 Paleobiochemistry with ribonu clease homologs in bovine seminal fluid....51 Lessons learned from ribonuclease resurrections.........................................58 Lysozymes: Testing Neutrality and Parallel evolution.......................................59 Transposable Elements and Their Ancestors......................................................61 Long interspersed repetitive elements of type 1...........................................62 Sleeping Beauty transposon.........................................................................64 Frog Prince...................................................................................................64 Biomedical applications of transposons.......................................................66 Chymase-Angiotensin Converting Enzyme: Understanding Protease Specificity........................................................................................................67 Resurrection of Regulatory Systems: The Pax System.......................................70 Visual Pigments...................................................................................................77 Rhodopsins from archaeosaurs, an an cestor of modern alligators and birds........................................................................................................79 The history of short wavelength se nsitive type 1 visual pigments...............81 Green opsin from fish...................................................................................85 Blue opsins...................................................................................................86 Planetary biology of the opsins....................................................................88 At What Temperature Did Early Bacteria Live?.................................................89 Elongation factors........................................................................................90 Isopropylmalate and isocitrate dehydrogenases...........................................94 Conclusions from these two studies.............................................................96 Alcohol Dehydrogenase: A Changing Ecosystem in the Cretaceous..................97 Resurrecting the Ancestral Steroid Receptor and the Origin of Estrogen Signaling........................................................................................................105 Ancestral Coral Fluorescent Proteins................................................................110 Isopropylmalate Dehydrogenases......................................................................112 Global Lessons..........................................................................................................115 2 INTRODUCTION AND HYPHOTHESIS..............................................................142 Introduction...............................................................................................................142 The Phenomenology of Seminal RNase and the Proposed Hypothesis...................144 3 BIOINFORMATIC ANALYSIS OF THE SEMINAL RNASE GENE SEQUENCES...........................................................................................................147 Building Evolutionary Models.................................................................................147


vii Maximum Likelihood...............................................................................................148 Sources of Data.........................................................................................................150 Sources of Ambiguity...............................................................................................151 Building an Evolutionary Model for the Seminal RNase.........................................152 Inferring Trees...................................................................................................153 Inferring Ancestral Sequences...........................................................................155 Constructing the Ancestral Sequences..............................................................156 Empirical Amino Acid Model:..........................................................................158 Codon Models....................................................................................................1 58 Sources of Ambiguity in Ancestral Sequence Inference...................................160 The Ancestral Sequences for Seminal RNase..........................................................161 Interpreting the Evol utionary Model........................................................................162 Seminal RNase in Oxen is the Product of a Recent Episode of Adaptive Evolution........................................................................................................163 Maximum Likelihood Model for Adaptive Evolution......................................165 Seeking Signatures of Adaptive Evolu tion in the Seminal Ribonuclease Gene Gamily............................................................................................................166 Assessing the Significance of the Output..........................................................168 The Only Branch with Significantly High Values Leads to Modern Seminal Bovine RNase..................................................................................170 Gene Duplications.............................................................................................171 Tests for Pseudogeneness Based on NonMarkovian Features of Sequence Evolution........................................................................................................175 Structural and Evolutionary Analysis Combined: A Test for Pseudogeneness176 Statistical Support for the Distributio n of the Ancestrally Replaced Amino Acids on the Structure of RNase....................................................................177 Conclusion for Derived Pseudogenes:...............................................................179 4 PRODUCTION AND BIOCHEMI CAL CHARACTERIZATION OF RESURRECTED RIBONUCLEASES....................................................................193 Production of Ancestral Proteins..............................................................................193 Site Directed Mutagenesis Method...................................................................193 Protein Expression.............................................................................................194 Ribonuclease toxicity in expression...........................................................194 Solutions to expression problem................................................................197 Purification of Ancestral Ribonucleases...........................................................198 Method 1....................................................................................................198 Method 2....................................................................................................199 Ancestral Protein Characterization...........................................................................201 Enzyme Kinetic Activity of Ancestral Ribonucleases......................................201 Assaying RNase catalytic activity using UpA as the substrate..................203 Assaying RNase catalytic activity usi ng a fluorescent tetranucleotide as a substrate................................................................................................204 Results and interpretations.........................................................................206 Dimerization.............................................................................................................207 Disulfide Based Dimerization...........................................................................207


viii Domain Swap with a Divinylsulfone (DVS) Assay..........................................208 Materials and Methods.............................................................................................210 Site-Directed Mutagenesis.................................................................................210 CE6 Phage Induced Expression.........................................................................211 Induction of Expression With IPTG..................................................................211 Cell Lysis and Inclusion Bodies Purification....................................................212 Solubilization of Inclusion Bodies....................................................................212 Purification of RNase........................................................................................213 Dimerization......................................................................................................214 DVS Swap Crosslinking Assay.........................................................................214 UpA Kinetics Assay..........................................................................................214 Fluorescent Tetranucleot ide Kinetics Assay.....................................................214 5 SEMINAL RIBONUCLEASE AND THE IMMUNE SYSTEM............................231 Immune Response in the Re productive Tract and Sperm........................................231 Immune Suppressive Factors in Mammalian Seminal Plasma.................................232 Immunological Assays..............................................................................................234 Mitogen Induced Lymphocyte Proliferation Assay..........................................234 Mixed Lymphocyte Assay (MLR)....................................................................235 Apoptosis Assay Usi ng Flow Cytometry:.........................................................239 Materials and Methods.............................................................................................242 Bovine Peripheral Blood Le ukocyte Isolation: Method1..................................242 Bovine Peripheral Blood Le ukocyte Isolation: Method2..................................243 Mitogen Induced Proliferation Assay:...............................................................244 Mixed Lymphocyte Proliferation Assay:..........................................................245 Annexin V/ Propidium Iodide Apoptosis Assay...............................................246 6 CONCLUSIONS......................................................................................................252 APPENDIX: MUTAGENESIS PRIM ERS AND STARTING SEQUENCE................254 LIST OF REFERENCES.................................................................................................256 BIOGRAPHICAL SKETCH...........................................................................................273


ix LIST OF TABLES Table page 1-1 Examples of molecular resurrections.....................................................................136 1-2 Axioms commonly incorporated into th eories of protein sequence evolution.......136 1-3 Sequence changes in recons tructed ancient ribonucleases.....................................137 1-4 Kinetic properties of recons tructed ancestral ribonucleases....................................40 1-5 Thermal transition temperatures fo r reconstructed anci ent ribonucleases.............138 1-6 Percent sequence identity between ancestral and modern proteins........................139 1-7 Kinetic properties of ADH 1, ADH 2, and candidate ancestral ADHAs................139 1-8 Duplication in the Saccharomyces cerevisiae Genome where 0.80 < f2 < 0.86.......................................................................................................140 3-1 Sequence summary for sites influencing seminal RNase tree*..............................187 3-2 This table shows the different models used to detect adaptive episodes of evolution for both topologies (Topology 1 and Topology 4) for the pancreatic outgroup.................................................................................................................188 3-3 This table shows the different models used to detect adaptive episodes of evolution for both topologies (Topo logy 1 and Topology 4) for the brain outgroup.................................................................................................................189 3-4 This table shows the AIC and Akaike weights ( wi ) given all models and both topologies (Topology 1 and Topology 4) for the pancreatic outgroup..................190 3-5 This table shows the AIC and Akaike weights ( wi ) given all models and both topologies (Topology 1 and Topology 4) for the brain outgroup..........................191 3-6 Buried versus exposed crystallographi c distribution of total and ancestrally replaced amino acids in BS-RNase........................................................................192 4-1 UpA assay results...................................................................................................230


x LIST OF FIGURES Figure page 1-1 An Old World fossil suid shortly afte r the emergence of large litter sizes............117 1-2 A 20 x 20 matrix.....................................................................................................118 1-3 A part of the multiple sequence alignm ent and tree relating the sequences of the ribonucleases from the eland, the ox, and the swamp and river buffaloes, covering sites 37-43...............................................................................................118 1-4 A fragmentary specimen of Pachportax latidens...................................................119 1-5 The amino acid at a site 39 is methi onine (Met, or M) in the RNase from both the swamp and the river buffalo.............................................................................119 1-6 The inferences drawn from the RNAse sequences drawn from the sequences of swamp and river buffalo.........................................................................................119 1-7 The probabilistic values for th e amino acid residue at site 38...............................120 1-8 A rooted tree for site 38..........................................................................................120 1-9 A probabilistic model for site 39 in the last common ancestor of swamp and river buffalo............................................................................................................120 1-10 In a maximum likelihood model of site 39............................................................120 1-11 A parsimony analysis infers an Asn at site 38.......................................................121 1-12 A maximum likelihood analysis infers an Asn at site 38.......................................121 1-13 The ancestral residues inferred for specific sites at internal nodes in the tree also assign changes along individua l branches in the tree.............................................121 1-14 With a different tree, the amino acid in ferred at site 41 changes in a parsimony analysis...................................................................................................................122 1-15 The distribution of amino aci ds at site 168 and site 211........................................122 1-16 Fossils of Eotragus .................................................................................................123


xi 1-17 Fossils of Leptomeryx , a primitive ruminant from the North American Oligocene123 1-18 The evolutionary tree used in th e analysis of ancestral RNases............................124 1-19 A tree showing the divergence of se minal ribonucleases from ruminants.............125 1-20 Tree showing relationship of a variety of serine proteases....................................125 1-21 Eomaia , a primitive placental mammal, and Sinodelphys , a primitive marsupial, at the Jurassic/Cretaceous boundary......................................................................126 1-22 A tree relating the Pax genes..................................................................................126 1-23 Visual pigments from vertebrates analyzed by Chang and her coworkers............127 1-24. A fossil of Confuciusornis sanctus from the Jurassic/Cretaceous boundary..........127 1-25 An evolutionary tree of 21 vertebrate SWS1 pigments..........................................128 1-26 The phylogenetic tree of the vertebrate RH2 opsins..............................................129 1-27 The two un-rooted universal trees us ed to reconstruct ancestral bacterial sequences................................................................................................................130 1-28 GDP binding assay to test thermostability of ancestral and modern EF proteins..131 1-29 Evidence of stromatolites in the geologic record...................................................132 1-30 The pathway by which yeast makes, accumulates, and then consumes, ethanol...133 1-31 Maximum likelihood trees interrelating se quences determined in this work with sequences in the publicly available database.........................................................134 1-32 A histogram showing all of the pairs of paralogs in the Saccharomyces cerevisae genome, dated using the TREx tool.......................................................................134 1-33 Cretaceous fruit from Patagonia showing insect damage......................................135 1-34 Phylogeny of GFP-like proteins from the great star coral M. cavernosa and closely related coral species...................................................................................135 3-1 Coin toss example illustrating th e principle of Maximum Likelihood..................179 3-2 Three-dimensional plot representi ng the likelihood distribution for two parameters..............................................................................................................180 3-3 A hypothetical phylogenetic tree rela ting nucleotide site (A, C, and G)...............180 3-4 The multiple sequence alignment of the protein sequences of the seminal genes.181


xii 3-5 This figure represents two tree topolog ies of the seminal RNase gene family (Topology 1 and Topology 4)................................................................................181 3-6 This figure represents two tree topologies (Topology 2 and Topology 3).............182 3-7 Hypothetical phylogenetic tr ee used as example to illustrate ancestral states reconstructions.......................................................................................................182 3-8 A distance tree represents the differe nces between all 17 inferred ancestral candidates for node An19.......................................................................................183 3-9 Multiple sequence alignment showing the ancestral sequences considered in this study.......................................................................................................................184 3-10 The values of for each branch in Topology 1 under the pancreatic and brain outgroups................................................................................................................184 3-11 The values of for each branch in Topology 4 under the pancreatic and brain outgroups................................................................................................................185 3-12 The dimeric structure with swapped do mains (amino acids 1-20) of BS-RNse....186 3-13 DSSP solvent accessibility classifica tion of the monomeric (pdb 1N3Z ) and dimeric (1BSR) structures of BS-RNase................................................................187 4-1 Site directed mutagenesi s strategic plan to produce the various RNase ancestral candidate genes......................................................................................................215 4-2 Plasmid stability test for most RNase variants.......................................................216 4-3 Plasmid stability test for AN21 RNase..................................................................216 4-4 A coomassie stained SDS-PAGE gel separating E. coli cell lysate after inducing expression with IPTG and CE6 phage...................................................................217 4-5 A coomassie stained SDS-PAGE gel showing equally successful expression in both IPTG and CE6 phage induced expressions....................................................217 4-6 Coomassie stained SDS-PAGE showing that the expression of RNase is in inclusion bodies......................................................................................................218 4-7 SDS-PAGE gel separates pr oteins in several purifica tion fractions from a cationexchange chromatography of RNase......................................................................219 4-8 Silver stained SDS-PAGE gel and a We stern blot of affinity (pUp) purified RNase.....................................................................................................................220 4-9 Silver stained SDS-PAGE of pur ified RNase on the newly developed DNA oligonucleotide column..........................................................................................221


xiii 4-10 The transphorylation step in the RNase mechanism of RNA cleavage.................222 4-11 Active site amino acids of RNase are shown in relation to a 3 nucleotide RNA chain.......................................................................................................................223 4-12 The progress curves of UpA cleavage by RNase (An28)......................................224 4-13 The structure of the fluorescent tetra nucleotide used in the RNase kinetic assay.......................................................................................................................224 4-14. Emission-scan of the cleavage of the fluorescent tetranucleotide by RNase A.................................................................................................................225 4-15 A typical progress curve of the tetr anucleotide cleavage reaction by RNase........225 4-16 This histogram compares the different kcat/KM of the ancestral RNase candidate proteins, BS-RNase and RNaseA...........................................................................226 4-17. Non-reducing and reducing gels with the various purified RNases and RNaseA..................................................................................................................227 4-18 Divinyl sulfone cross-linking reacti on in the active site of RNase........................228 4-19 Three SDS-PAGE reducing gels showing RNases reacted with DVS...................229 5-1 Mitogen (PHA) proliferation assay usi ng proteins purified on a pUp column......246 5-2 Mitogen (PHA) Proliferation assay us ing proteins purified on an affinity column....................................................................................................................247 5-3 Mixed Lymphocyte Reaction (MLR).....................................................................248 5-4 Percentages of apoptotic, dead and live cells in a mitogen pr oliferation assay.....249 5-5 Percentages of dead and live cells in a mitogen proliferation assay determined by Trypan blue cell counting..................................................................................250 5-6 Cell viability determined by cell counting using a hemocytometer of a mitogen proliferation assay..................................................................................................251


xiv Abstract of Dissertation Pres ented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy PALEOGENETICS: THE PAST AS A KE Y TO UNLOCK THE PRESENT. THE RESURRECTION OF ANCESTRAL PROTEINS TO ELUCIDATE THE FUNCTION OF SEMINAL RIBONUCLEASE By Slim Sassi December 2005 Chair: Steven A. Benner Major Department: Anatomy and Cell Biology One longstanding question in reductionist biology asks whether particular in vitro behaviors of isolated proteins are physiologically relevant in complex living organisms. Natural selection is the only mechanism fo r obtaining function in biology, making this question equivalent to asking whether an in vitro behavior, if changed, would diminish the fitness of its host. This disserta tion develops a strategy to identify in vitro behaviors relevant to new physiological function. It ex ploits the emerging field of experimental paleogenetics, which resurrects ancestral prot eins from extinct organisms for laboratory study. This strategy is illustrated with th e seminal ribonuclease (RNase) family of proteins from artiodactyls, the order that includes oxen, buffalos, and deer. Bovine seminal RNase (BS-RNase) constitutes 2% of the protein in bovine seminal plasma, and has unusual in vitro features, including dimeric structure, catalytic activit y against duplex RNA, affinity for duplex DNA, immunosuppressi vity, and tumor cytotoxicity. These are


xv absent from pancreatic RNase A. We app lied experimental paleogenetics to ask: Which of these behaviors is key to the emerging function of seminal ribonuclease? Phylogenic models were constructed fo r artiodactyl BS-RNase using Bayesian and Maximum Likelihood tools and paleontology data. The se quences of ancestral sequences were inferred, and the Akaike information criteri on (AIC) was used to demonstrate that an episode of positive selection of BS-RNase occurred in the last 2 million years after oxen diverged from buffalo; seminal RNase must have then acquired a new function. The inferred ancestral sequences were resu rrected and studied to show that immunosuppressivity, measured in vitro using the mixed lymphocyte culture assay, emerged at the same time as the episode of positive selection. This implies that immunosuppressivity is the in vitro behavior relevant to the new function. This may confer fitness by suppressing the female imm une response against male sperm, possibly as primitive humans domesticated oxen. We also developed new tools exploiting homoplasy and crystallography to suggest that BSRNase was not a pseudogene throughout much of its history, even though it is in most extant non-bovine ruminants. These findings illustrate the utility of experimental paleogenetics to interpret in vitro data from a systems, organismic, and planetary perspective.


1 CHAPTER 1 MOLECULAR PALEOSCIENCE: SYS TEMS BIOLOGY FROM THE PAST Introduction A Role for History in Molecular Biology The physiology of life on Earth reflects natural selecti on acting on random variation as living systems respond to chan ce events in their environments. These responses are constrained by physical and chem ical law, and further by the limitations of the linear process by which Darwinian pro cesses search for solutions to biological problems. Darwinian processes need not deli ver the best response to an environmental challenge; indeed, it may deliver no response that allows an organism to avoid extinction. The resulting physiology therefore refl ects history as much as optimization. It is therefore not surprising that biology finds its deepest root s in natural history, including the fields of systematic zool ogy, botany, paleontology, and planetary science. Here, seemingly trivial details (such as the physiology of the panda's thumb) have enlightened naturalists as they attempt to understand the interplay of chance and necessity in determining the outcome of evol ution(Glenner et al., 2004;Gould, 1980). These roots in natural history are not felt in molecular biology, the field that has emerged from the enormously productive alli ance between biology and chemistry in the 20th century. Today, we have the complete chem ical structures of many biomolecules and their complexes, from systems as small as glucose to systems as large as the human genome(Venter et al., 2001). X-ray crysta llography and NMR spectroscopy locate atoms with precisions of tenths of nanometers within biomol ecular structures. Biophysical


2 methods time biomolecular phenomena on a microsecond scale (Buck and Rosen, 2001). These and other molecular characterizations, written in the language of chemistry, have supported industries from drug design to food ma nufacture, all without any apparent need to reference the history of the molecular co mponents that have proven to be so useful. The success of this reductionist molecu lar biology has caused many to place a lower priority on historical biology. The archetypal molecula r biologist has never had a course in systematics, paleont ology, or Earth science, and may resent being compelled to take one. The combination of chemistry and biology has generated so much excitement that history seems to be no longer relevant, a nd certainly not necessary, to the practice of life science. Nearly overlooked in the excitement, however , has been the frequent failure of the molecular characterization of living systems, even the most extensive and detailed, to generate a comprehensive type of understa nding. The human genome itself provides an example of this. The genome is itself nothing more (and nothing less) than the structures of some large natural products. Each st ructure indicates how carbon, hydrogen, oxygen, nitrogen and phosphorus atoms are bonded within the molecule being represented. It has long been known to natural product chemists that such structures need not make any statement about the function of a biomolecu le in its host living system. Genomic scientists have come to real ize this as well, but not w ithout expressions of dismay. Genomic sequences do have certain a dvantages over other natural product structures when it comes to understanding th eir function, however. Gene and protein structures, when compared to each other, provide a model for their histories more transparently than other natu ral product structures (Hesse , 2002). As was recognized by


3 Pauling and Zuckerkandl nearly a half cen tury ago (Pauling and Zuckerkandl, 1963), a degree of similarity between two protein seque nces indicates, to a de gree of certainty, that the two proteins share a common ances tor. Two homologous gene sequences may be aligned to indicate where a nucleotide in one gene shares common ancestry with a nucleotide in the other, both descending from a single nucleotide in an ancestral gene. An evolutionary tree can be built from alignments of multiple gene sequences to show their family relationships. The sequences of an cestral genes represented by points throughout the trees can be inferred, to a degree of certa inty, from the sequences of the descendent sequences at the leaves of the tree. The history that gene sequences convey can be used to understand their function. Much understanding can come first by anal yzing the sets of homologous sequences themselves. Thus, credible models for the fold ed structure of a prot ein can be predicted from a detailed analysis of the patterns of variation and conservation of amino acids within an evolutionary family, set within a model of the history of the family (Thornton and DeSalle, 2000). The quality of these predic tions has been demonstrated through their application to protein struct ure prediction contests (Benne r et al., 1997;Gerloff et al., 1999;Rost, 2001), and the use of predicted stru cture to detect distant protein homologs (Benner and Gerloff, 1991;Dietmann and Holm, 2001;Gerloff et al., 1997;Tauer and Benner, 1997). More recently, analysis of patt erns of variation and conservation in genes is used today to determine whether the gro ss function of a protein is changing, and which amino acids are involved in the change (Bielawski and Yang, 2004;Gaucher et al., 2002;Gaucher, Miyamoto and Benner, 2001).


4 These examples illustrate how sets of protein sequences can be manipulated computationally using an evolutionary perspe ctive. Computational analysis of protein sequences from an evolutionary perspective has emerged as a major area of activity in the past decade. The purpose of this chapte r is to review strategies that go beyond simple computational manipulation of protein sequences. This chapter will explore experiment as a way to exploit the history captured w ithin the chemical structures of DNA and protein molecules. Our focus will be the emerging fiel d known variously as experimental paleogenetics, paleobiochemistry, paleom olecular biology and pa leosystems biology. Practitioners of the field resurrect ancien t biomolecular systems from now-extinct organisms for study in the laboratory. The fiel d was started 20 years ago (Nambiar et al., 1984;Presnell and Benner, 1988;Stackhouse et al ., 1990) for the specific purpose of joining information from natural history, its elf undergoing a surge of activity, to the chemical characterization of biomolecules, with the dual intents of helping molecular biologists select interesting re search problems, and to generate hypotheses and models to understand the molecular features of the systems that they study. The field has now explored approximatel y a dozen biomolecular systems (Table 11). These include digestive proteins (ri bonucleases, proteases, and lysozymes) in ruminants to illustrate how digestive func tion arose from non-digestive function in response to a changing global ecosystem, fermentive enzymes from fungi to illustrate how molecular biology supported the adaptation of life as mammals displaced dinosaurs as the dominant large land animals, pigments in the visual system adapting as the system evolved to manage different environmenta l demands, evolution of steroid hormone


5 receptors, and proteins from very ancient bacteria, which is helping to define the environment where the earliest forms of contemporary life lived. In general, the understanding delivered by experimental paleogenetics was not accessible by standard methods. To date, approximately 20 narratives have emerged where specific molecular systems from extin ct organisms been resurrected for study in the laboratory. After a brief in troduction of the strategies and problems in experimental paleoscience, we will review each of these. The goal of this chapter is also to st rengthen the awareness of molecular and biomedical scientists to the ability of expe rimental paleogenetics to provide meaning to biological data. We believe that current e fforts falling under the rubric of "systems biology" has stopped short in th is regard. Understanding will not, we suspect, arise from still more, and still more quantitative, molecular, chemical, and geometric characterization of cellular, or gan, and organism-defined system s. At the very least, the analysis must go further, to include the organism, the ecosystem, and the physical environment, which extends from the local ha bitat to the planet and the cosmos (Feder and Mitchell-Olds, 2003). But we suspect that without an understanding of the history, efforts in reductive systems biology are likely to fall short of their promise to deliver understanding. The same applies to the broader scie ntific community. Reviewing the first paleomolecular resurrections (Jermann et al., 1995;Stackhouseet al., 1990) a decade ago, Nicholas Wade, writing in the New York Times Magazine (Wade, 1995), expressed displeasure. "The stirring of ancient artiodactyl ribonucleases", he wrote "is a foretaste of biology's demiurgic powers." He then suggested that biomolecular "r esurrection [remain]


6 an unroutine event." Given the absence of hazard presented by paleomolecular resurrections ( pace Jurassic Park ), it seems unwise to forgo understanding that comes from bringing again to life biomolecules fr om times past. As the size of the genome sequence database grows, and the gap between chemical data compilations and biological understanding increases, we suspect that experi mental paleogenetics wi ll be the key tool to bridge the gap. Evolutionary Analysis and the "Just so Story" In the age of reductionist biology, evolutiona ry analysis must struggle to enter the mainstream of discussion within the mol ecular sciences. As Adey noted a decade ago, many scientists view evolutionary hypothe ses as being inherently resistant to experimental test, and therefore fundament ally non-scientific (Adey et al., 1994). Certainly, neither the sequence nor the behavior of a protein from an organism that went extinct a billion years ago can be known w ith the same precision as the sequence or behavior of a descendent in an organism living today. But much in science is useful even though it is not known with certainty. Our job in pa leogenetics, as with all science, is to manage the uncertainty. The arcane nature of the most prominent debates in molecular evolution has also not helped. It is not clear to many biomedical researchers how their view of biology will be different if it turns out th at dog diverged from humans and rodents before or after humans and rodents themselves diverged. Like wise, those trained in chemistry know that it is unproductive to ask whether a lterations in chemical structure generally have any specific impact on behavior ; productive discussions in chemistry focus on specific chemical structures. Yet the neutralist-sel ectionist debate that consumed molecular evolution for nearly a decade did not in corporate this pers pective (Hey, 1999).


7 Further, the accidents that shape molecu lar biology cannot be reproduced in the laboratory, and may in fact never be known in detail. Complex biological systems are chaotic. Small differences in input can have large impacts on the out put. The notion that biology would be rather different had a fly flapped his wings differently in the Triassic is a compelling reason to marginalize histor ical narratives (Lorenz, 1963;Lorenz, 1969). Even when historical narratives are construc ted, they are frequently viewed as "just so stories". This epithet is pejorative; it indicates that the narrative is constructed ad hoc to explain a specific fact (H ow the zebra got his stripes), makes no reference to facts verifiable outside of the fact being explained (we have no way to verify that an ancestral zebra took a nap under a ladder), and could easily be replaced by a diffe rent story, just as compelling, explaining a different observati on (if modern zebra had spots, the story would be that the ancestral zebra took a nap beneath a Philodendron such as Monstera Friedrichsthalii ). A narrative can be made more compelling by bringing many different types of data to bear on a single system. A recent example of this use of multiple lines of evidence attempted to bring biological meaning to th e fact that modern swine have not one, but rather three, different genes for the en zyme "aromatase" (Gaucher et al., 2004). Aromatases use cytochrome P450 and molecular oxygen to c onvert androgenic steroids into estrogenic steroids. These steroids pl ay roles throughout vert ebrate reproductive biology. Most mammals (including humans) have only a single aromatase gene. Pigs, for some reason, have three. Here, a complete molecular characterization of the three pig aromatases (the sequences of all three ar e known) did not addre ss the question: "Why do


8 pigs have three genes for an enzyme, all ca talyzing approximately the same reaction?" A historical narrative was neede d. To build this narrative, a cl adistic analysis was used to suggest that the two duplicati ons that created the three pa ralogous aromatase genes in swine occurred after the su ids diverged from oxen ca. 60 Ma (Arnason, Gullberg and Janke, 1998;Foote et al., 1999;Kumar and Hedge s, 1998). A silent tr ansition redundant clock (Benner, 2003) was then applied to date the duplications at 39 Ma, in the late Eocene to mid Oligocene. To help define further the timing of the duplications, gene fragments were sequenced from other close re latives of the pig, incl uding the peccary and the babirousa. Given the timing of the duplications that generated the three genes, it was concluded that the three aromat ases in pig were not needed to manage the fundamental reproductive endocrinology of mamm als, which arose very early in vertebrates (perhaps earlier than 400 Ma). Nor did the three ar omatase genes arise in response to the domestication of pigs, which occu rred only a few thousand years ago. Rather, the date of the duplication events correlated with the time in history when the size of pig litters increased. This timing was suggest ed by a cladistic analysis of reproductive physiology and litter size and the fossil record, which includes fossils of pregnant ancestral animals (Franzen, 1997;O'Harra, 1930). Since the period following the Eocene was a time of global cool ing, it is possible that an in crease in litter size was an adaptive change that contribute d to the fitness of the pig lin eage under changing climate. From this analysis, an explanatory na rrative emerged. The narrative hypothesized that the gene duplications gave rise to func tionally different isozymes of aromatase that played specific roles to establish and maintain large litter sizes in pigs.


9 At this point, most biologists would regard this as simply a "just so story". To avoid this epithet, structural biology was then recruited. A crysta l structure was not available for aromatase. Nevertheless, an approxi mate homology model could be built from a homologous P450-dependent enzyme. The sites in the aromatase protein that changed immediately after the aromatase genes ha d duplicated were then mapped onto the homology model. This map showed that many of the amino acid repl acements occurring shortly after the duplications were in or near the substrate-binding site. This, in turn, suggested that the substrate/product specificity of the aromatase enzyme changed at this time. This change was confirmed by experi ment, where the ability of the different aromatases to catalyze the oxidation of diffe rent androgens was shown to differ (Corbin et al., 2004). The diverging catalytic properties of the ar omatase proteins were then observed to be consistent with the different physiologies of the three enzyme s. One aromatase is expressed in ovaries, as are aromatases in other mammals (including human). Another of the aromatases was expresse d in the pig embryo, between day 11 and day 13 after conception. This is approximately the time of implanta tion. It was proposed that estrogen created by this aromatase isoform helped space multiple embryos around the uterine wall. The third aromatase duplicate was expressed primarily in the placenta. The estrogen that this aromatase releases was pr oposed to help the mother determine whether the pregnancy should be terminated, whic h happens in pigs if too few embryos successfully implant.


10 Together, the narrative combined sequ encing data, molecu lar evolution and paleontology with the geological records, ge nomic sequence analysis, structural biology, experimental biochemistry, and reproductive physiology, to create an answer to the question: "Why do pigs have three aromatases?" Some commentaries view this use of ma ny lines of evidence of many different types as able to elevate th is narrative beyond a "just so story". One commentary even called the combination a "tour de force" [Facu lty of 1000: evaluations for Gaucher EA et al BMC Biol 2004 Aug 17 2 (1) :19 Sh elley Copley: Facu lty of 1000, 5 Oct 2004 ]. Nevertheless, the argument embedded in the narrative remains correlative. Correlations do not compel causality. It would therefore be helpful to have an additional tool to expand such narratives further. Biomolecular Resurrections as a Way of A dding to an Evolutionary Narrative Experimental paleogenetics provides such a tool. Here, the sequences of ancestral DNA or protein sequences are inferred from the sequences of their descendents. Then, using molecular biological methods, an ancestr al gene sequence is synthesized in the laboratory and delivered to a biological host where its behavior can be studied. If the interesting ancestral biomolecule is a prot ein, the gene is expressed to deliver the corresponding ancestral protein, wh ich is then isolated for study. Biomolecular resurrections allow the scientist to partially re-live past events. By doing so, we hope to generate evidence to confirm or deny a hypothe sis about past or current function. For example, we might hypot hesize that lysozymes or ribonucleases in modern oxen function to support ruminant di gestion. This hypothesis might be supported by arguments analogous to those used for arom atase if the ribonucl eases and lysozymes


11 emerged at the time when ruminant digesti on arose. But a paleobiochemical experiment with ribonucleases and lysozymes resurrected from an animal just beginning to create ruminant digestion may contribute more. If the experimental properties of these resurrected ancient enzymes show that par ticular digestive behaviors emerged in the ancient proteins at the time when ruminant digestion arose in the ancestral organisms, then the narrative is strengthened. Alternatively, we might susp ect that ancestral bacteria lived at an elevated temperature. Resurrection of proteins from those ancestors, and determining the temperature at which they function best, might confirm or deny that hypothesis. Practicing Experimental Paleobiochemistry We next review the process of drawing in ferences about the past and the present from information obtained from the cont emporary times. More details surrounding the process can be found in the literature (Benner, 2003). Building a Model for the Evolution of a Protein Family Homology, alignments, and matrices The basic element of an evolutionary anal ysis is the pairwise sequence alignment. Here, two gene or protein sequences are wr itten next to each other so that their similarities are the most perspicuous. While any two sequences can be aligned, an alignment makes evolutionary sense only if the two sequences themselves shar e a common ancestor, that is, if the two sequences are homologous . Indeed, the goal of sequence alignment is often to determine whether or not two sequences are homologous. To make this determination, all possible alignments of two sequences are made, each is given a score, and the alignment with the highest score is chosen and examined statisti cally to determine if the score surpasses a


12 threshold to answer the question: "Are these sequences related by common ancestry?" (yes or no). An alignment is "correct" if it accurately represents the history of individual aligned sites. If two sequences are ho mologous, and are descendents of a common ancestor, it should be possible to map the sites in the seque nce of a descendent onto the sites in the sequence of the ancestor. Except fo r sites that are gained by insertions, each site in a descendent is mappa ble into a single site in the ancestral sequence. Except for sites that are lost by deletions, each site in an ancestor is mappable in to a single site in the descendent sequence. A correct alignmen t involves a transitive mapping, aligning the sites in one descendent to site s in the other if the sites ar e mapped onto the same site in the ancestor. We shall not review here the literature discussing the process of searching possible alignments to find the one with the best score. Calculating a score for a pairwise alignment requires, however, a theory of evolution that describes how biomolecular sequences divergently evolve naturally. The deta ils of the theory often have an impact on the sequences that are inferred for ancestral proteins. It therefore becomes important conceptually to understand how inferences about ancestral sequences are made. Thus, it is useful to summarize the use of scoring functions. Nearly all strategies for scoring a pairwi se alignment incorporate some notion of minimum motion. Thus, identities, where two biomolecules have the same nucleotide or amino acid at an aligned site, are considered to be stronger signal s of homology than nonidentities.


13 Most theories also includ e parameters that permit some non-identities to be stronger indicators of homology than othe rs, however. For example, two aligned hydrophobic amino acids (a Leu and a Val, for example) are presumed to be stronger indicators of homology than is a hydr ophobic amino acid aligned with a hydrophilic amino acid (a Leu and a Glu, for example). Scores that incorporate such additional parameters are said to measure th e "similarity" between two sequences. A 20 x 20 matrix (such as the matrix in Figure 1-2) is generally used to score identities and non-identi ties in an alignment. The diagonal elements of such matrices hold the scores for identities in the matched sites; the off-diagonal elemen ts hold the scores for non-identities. The matrix itself is defined for a particular overall divergence of the pair of sequences being aligned. The more divergent the pair of proteins that the matrix is intended to score, the smaller the on-diagonal terms are relative to the off-diagonal terms. Because neither of the sequences in a pair being aligned is distinctive, a Leu matched against a Val is given the same score as a Val matched against a Leu. Thus, the alignment scoring matrix is symmet rical across the diagonal. A rate matrix, which describes the probabi lity of each of the 20 amino acids of being replaced by each of the 20 amino acids, is different from a matrix that is used to score a pairwise alignment. The rate of replacement of one amino acid by another can be described as a pseudo first order process ha ving the units of cha nges per site per unit time. Alternatively, the rate can be norma lized to remove the time dimension. This provides a unitless rate parameter that desc ribes the rate of replacement of one amino acid by another relative to the rate of repl acement of all amino ac ids by all others. In either case, there is no reason for particular amino acid (Leu, for example) to be replaced


14 by another (Val, for example) with the same ra te constant as a Val is replaced by a Leu. If the rate constant for the conversion of Leu to Val is higher than the rate constant for the conversion of Val to Leu, then the equilibrium ratio of Leu to Val will be higher in the descendent than in the ancestor. Trees and outgroups Given a family of proteins containing more than three members, it is possible to summarize their interrelationship using an e volutionary tree. The tree is a graphical model of the history of a family of protei ns. The leaves of the tree represent modern sequences from contemporary organisms. The in ternal three fold nodes in the tree each represent a duplication in an ancestral gene to give rise to two descendant lineages. The lengths of the edges in the gr aph represent the distance betw een nodes, often expressed in the units of changes per site. Thus, the longer the branch, th e more the sequences at the ends of the branch differ. This is exemp lified in Figure 1-3 using four ribonucleases (RNases) from four closely related bovids, the eland, the ox, the river buffalo, and the swamp buffalo. The root of the tree is defined as the poi nt on the tree that re presents the oldest sequence. If the rate of sequence divergen ce has been constant through the period of divergent evolution, then the root is represented by a point at the middle of the tree, sometimes referred to as the center of gravity. In general, however, the rates of protein sequence divergence are not constant over time, meaning that the root of the tree cannot be easily placed if sequen ce data are the only input. Fortunately, it is possible to use other, non-sequence information to identify the root of the tree. For example, in the tree in Figure 1-3, the ro ot is placed along the branch connecting the eland sequence to the point on the tree represen ting sequence of the


15 protein from the last common ancestors of the oxen and buffaloes. This placement is based on information from cladistics, which suggests that the eland diverged from the lineage leading to oxen and buffaloes before buffaloes diverged from the oxen. Thus, the eland is an outgroup for the tree that contains the oxen, the river buffalo, and the swamp buffalo sequences. An outgroup, however obtained, can be used to root the tree of the "in group." Correlating the molecular a nd paleontological records Just as we might use cladistics to dete rmine that the eland is the appropriate outgroup for a family of proteins obtained fr om the ox, the swamp buffalo, and the river buffalo, we might refer to the paleontological re cord to gain some insight into the animal that carried the ribonuclease that is repres ented by the threefol d node in the tree connecting the branch to the ox ribonuclease and the branch to the RNases from the buffaloes. The paleontological record is inco mplete. Therefore, one never can find in the paleontological record a fossil that truly co rresponds to an indi vidual, or even the population that held the indi vidual, that generated two descendent populations that eventually evolved to give two descendent species. Therefore, any assignment of a specific foss il to a specific point in an evolutionary tree can be only an approximation. Such a ssignments are often useful approximations, however, as they allow us to draw upon th e paleontological and geological records to interpret the molecular record. In the case of the tree modeling the evol utionary history of the ribonucleases from eland, ox, swamp buffalo and river buffalo, the fossil record contains several bovids that might correspond approximately to the last common ancestor of ox, swamp buffalo and river buffalo. Stackhouse et al. chose Pachyportax (Figure 1-4) as the relevant genus


16 from the paleontological record to represen t this node, this genus is known from a fossil record from the Indian subcontinent. The Pachyportax is represented on the tree using the scripted letter P. A Hierarchy of Models for Modeling Ancestral Protein Sequences Assuming that the historical reality aros e from the minimum number of amino acid replacements In a world without knowledge, we woul d know nothing about sequence of any ancestral protein that lived in any ancient or ganism that went extinct millions of years ago. In particular, we would know nothing a bout the sequence of th e ribonuclease that was biosynthesized by Pachyportax . But we have some knowledge about the an cient world. In particular, we know the sequences of some of the descendents of the ancient RNase, as well as the sequences of descendents of its relatives. Further, from sequence data generally, we might derive a theory describing the replacement of amino acids generally during the divergent evolution of protein sequences. We might assu me that the general pattern of amino acid sequence evolution might be a good approximation of the pattern in the protein family of interest (Table 1-2). To understand how one can draw inferenc es about ancestral states from the sequences of descendents, let us consider a hi erarchy of models that begins with just two homologous protein sequences, using ribonucleas es as an example. We begin with the ribonucleases from the swamp and river buffa loes. These two homologs are represented by the leaves, or end points, of a simple tree consisting of a single li ne connecting the two end points. The points within the line represent evolutiona ry intermediates between the two sequences. Evolution is, of course, discreet, meaning that it is not represented


17 perfectly by a continuous line with an infinite number of poi nts, but the approximation is serviceable. Consider now just two types of sites in the pairwise alignment of the two ribonucleases. In the first, the amino acids oc cupying the two aligned sites are identical. Site 39 is an example of such a site in these two ribonucleas es. In the ribonuclease from both the swamp buffalo and the river buffal o, Site 39 holds a methionine (abbreviated Met, or the single letter M) . Figure 1-5 shows the simple tree with a M at each leaves, represented by the points at the end of the line. If our theory of evolution holds that the absence of change is substantially more likely than change (and this site does not contra dict that theory), a simple model infers a Met at Site 39 in all proteins that are evolutionary interm ediates between the swamp and river buffalo RNases. This inference can be described using the la nguage of probability. We simply say that the probability of findi ng a Met at Site 39 in every ribonuclease represented by each point in the tree is unity (Figure 1-5). The same is true if we root the tree. As an approximation, let us place the root midway between the two leaves of the tree, as in Figure 1-6. The ol dest sequence is now at the top of the tree, with the direction of time being positive (towar ds the future) as one proceeds from node at the top of the tree to the leaves at the bottom on the tree. The inference that is now drawn is that the ri bonuclease found in the last common ancestor of swamp and river buffalo had a Met at position 39. What if the amino acids occupying the two a ligned sites are not id entical in the two buffalo RNase sequences? This is the case for Site 38 in the pairwise alignment (Figure 1-7). Here, the RNase from swamp buffalo hol ds a serine (Ser, or S), while the RNase


18 from river buffalo holds an asparagine (Asn, or N). At this site, no change is not an option. If the history that we are to narrate is to account for the presence of two different amino acids at this site in the contem porary sequences, at least one amino acid replacement must have occu rred along the line conn ecting the two contemporary proteins. If we assume that one replacement is s ubstantially more likely than two, then a simple model represents the amino acid at S ite 38 as a linear function of the distance along the line connecting the two leaves. Thus, an ancestral protein represented by a point on the line near the swamp buffalo leaf is more likely to have a Ser than an Asn at Site 38, while an ancestral protein represented by a point on the line nearer the river buffalo leaf is more likely to have a Asn than an Ser. The amino acid from the swamp buffalo sequence "morphs" into the sequence from the river buffalo (Figure 1-7). The same is true for the tree arbitrarily r ooted at the midpoint of the line (Figure 18). The rooted tree illustrate s the inference from the model that the last common ancestor of the RNases from swamp and river buffa lo has a 50% chance of holding a Ser at position 38, and a 50% chance of holding an Asn at position 38. This simple analysis introdu ces two concepts that are key when using information from modern sequences to make inferences about ancestral sequen ces. The first is the concept of a probabilistic ancestral sequence . Every site in a se quence at every point along the branch(es) of the tree can be repr esented as a 20 x 1 probability matrix, where each element is the probability that each of the 20 amino acids was found at that site in the ancestral protein represented by the point. These probabilities sum to unity.


19 The example also illustrates the applicat ion of the notion of "parsimony", or minimum change, when making inferences a bout ancestral sequences using information from modern sequences. Central to this mode l is that no amino acid replacement is more probable than one replacement, and that one replacement is more probable than two. Thus, if we can account for the contemporar y sequences by a historical model that involves no changes, then no changes ar e inferred. If we can account for the contemporary sequences by a historical model that involves one change, then one change (not two changes) is inferred. These inferen ces are said to have been made using an evolutionary theory that is known as "maximum parsimony". Allowing the possibility that the history actually had more than the minimum number of changes required More sophisticated theories describing th e divergence of prot ein sequences allow for the possibility that more amino acids replacements have occurred in the history of a site than the minimum required to account for the derived sequences. For example, we may incorporate into our theory the possibility that a site contained amino acids that are not found in any of the derived sequences, even t hough this requires some histories to have more than the minimum number of cha nges absolutely required to account for the amino acids in the derived sequences. Considering again Site 39 of the ribonuc leases from the swamp and river buffaloes. Even though the sequences of both RNases hold a Met at this site, it is conceivable that another amino acid (let us say Ile) occupied Site 39 in the last common ancestor. If this had been the case, at least two independent ev ents in the history of the protein family would be required to account for the fact that Met occupies Site 39 in both of the derived proteins. One Ile-to-Met repla cement must have occurred at Site 39 during the evolution


20 of the modern swamp buffalo sequence from the common ancestor. Another Ile-to-Met replacement must have occurred at Site 39 during the evolution of the modern river buffalo sequence from the common ancestor. Because such a history requires two change s at Site 39, and because the probability of a change is assumed to be a small fraction of the probability of no change, the likelihood that this ancestor ha s an Ile at site 39 is consid erably less than the likelihood that the ancestor had a Met there. But this likelihood is not zero. Further, we can build a theory that as signs a numerical likelihood based on this model. Specifically, we might consider empiri cal data or theoretical considerations to weight the probabilities that va rious other amino acids intrud ed at Site 39 during this history. For example, Lys is (on various gr ounds) chemically more dissimilar (compared to Met) than Ile. Thus, the theory might give a lower probability to a Lys-to-Met replacement than a Ile-to-Met replacement. Accordingly, the chances of Site 39 having suffered two Lys-to-Met replacements is considerab ly lower than the chances of Site 39 having suffered two Ile-to-Met replacements. Th is means that the chance of the ancestral sequence having had a Lys is lower than the ch ance of it having had an Ile, given that the two derived sequences both have Met at Si te 39. Here, a probabilistic ancestor might have non-zero values for all amino acids at S ite 39. For example, here are the calculated occupancies for Site 39 for the 20 amino acids at the midpoint of the tree (Figure 1-9). This is illustrated graphically in Figure 1-10. Here, the pr obability of Met at Site 39 does not remain unity, even though both leav es have Met at this site. Rather, the probability of Met at Site 39 drops linearly un til the midpoint of the line, and then rises back to unity as the other leaf is approache d. At the same time, the probability of all of


21 the other amino acids being pres ent at Site 39 increases from zero at the leaves to a maximum number at the midpoint, and then drop s back to zero as the other leaf is approached. This analysis is sometimes called "maximum likelihood", because it integrates features, including the distance of the node from the leaves (where the sequence is perfectly defined), into a pr obabilistic model. Thus, the probability of two changes occurring along a branch of a tree is lower if the line is shorter (tha t is, if the distance between the ends of the branch is shorter), and higher if the br anch is longer. This follows directly from the fact that the length of the branch is the number of changes per site. For the tree holding two leaves, the ML formalism sa ys that the certaint y that a residue found in common at both leaves of a tree was pr esent throughout the history of the site diminishes as the branch length becomes longer. Adding a third sequence Adding a third sequence to the original tr ee creates a trifurcat ed tree. The third sequence is tied into the line separating th e first two sequences by a branch at the point where the probabilistic ancestral sequence al ong the line resembles the third sequence the most. In the example above, we use the oxen sequence as the third. We might say that the sequence of the RN ase from oxen serves as an outgroup for the two RNase sequences from swamp and river buffalo. Alternatively, if we assume that the ox diverged from the lineage leading to the two buffalos before the buffalos themselves diverged, we may say that the ox sequence roots the swamp buffalo-river buffalo tree. The oldest point on the swam p buffalo-river buffalo tree, given this additional information, is the poi nt where the oxen sequence is tied to that tree. The root


22 of the trifurcated is not known, but lies so mewhere along the line from the central threefold node and the ox sequence. The third sequence contributes more prior information that can be used to infer the amino acids occupying various site s at points internal to the tree. Consider again Site 38 (Figure 1-11), where the RNase from swamp buffalo holds a Ser, while the RNase from river buffalo holds an Asn. With just two sequences, the ancestral sequences morphed smoothly from Ser to Asn from left to righ t along the line. With the ox sequence, the available information has increased. Now, a si mple parsimony analysis assigns an Asn at the central node, with the model for the hist ory of Site 38 incorporating an Asn-to-Ser replacement along the branch leading from the central node to the RNase from swamp buffalo. The probabilities along the line are al so dramatically cha nged by the introduction of the third sequence. This is illustrated in Figure 1-12. It requires no conceptual leap to extend this analysis to all of the sites in the three fold alignment of the ox, swamp buffalo, and river buffalo RNase se quences, or to trees that include the sequence of the RNase from eland, added to the tree as an outgroup. The eland sequence roots the (ox(swam p buffalo,river buffalo) tree, and allows us to infer the "maximum parsimony" sequence for the RNase from Pachyportax . This sequence was the first for an ancestral biomolecule to have been experimentally resurrected (Stackhouse et al., 1990). As an exercise, it might be useful to s ee how a parsimony analysis infers the amino acid replacements that occur along each branch of the tree. For example, in the time during which the RNase from ox evolved from the RNase in Pachyportax , the methionine at Site 39 was replaced by a leucine. In th e time during which the RNase from the last


23 common ancestor of the swamp and rive r buffalos evolved from the RNase in Pachyportax , the Lys at Site 41 was replaced by a Ser. In the time during which the RNase from swamp buffalo evolved from th e last common ancestor of the swamp and river buffaloes, the Asn at Site 38 wa s replaced by a Ser (Figure 1-13). Because the maximum likelihood analysis infe rs fractional residues at nodes, it also assigns fractional changes along branches. Th ese are not different conceptually from integral changes. The relative merits of maximum likelihood versus maximum parsimony methods for inferring ancestral sequences In much of literature on experimental paleog enetics, it is argued that the maximum likelihood methods are preferred to infer an cestral sequences over maximum parsimony methods (Nielsen, 2002;Pagel, 1999a;Yang, Kumar and Nei, 1995;Zhang and Nei, 1997). While this argument is true, especially to the extent that ma ximum likelihood methods capture more of what we know about protein sequence evolution, it is important to realize that the impact of replacing a maximum parsimony analysis by a maximum likelihood analysis is small when the tree is highly ar ticulated, the branching topology is secure, and the overall extent of sequence divergence is sm all. This is the case for the ribonucleases that we have just discussed. Nevertheless, it is worth using this example to illustrate how the nature of an inferred ancestral sequence might change if el ements of the evolutionary model were to change. In particular, the ancestral sequences inferred via a parsimony analysis are often extremely sensitive to changes in the topology of the tree. Consider, for example, how the sequence inferred for the ancestor P would cha nge if the position of the eland on the tree were moved so that it was not an outgroup, but rather was a sister group of the oxen. This


24 change is illustrated in Figure 1-14. Here, th e amino acid at a Site 41 is no longer defined by a parsimony analysis. Site 41 could hold w ith equal probability a Lys or a Ser. Because many trees inferred from sequen ce data are not known with certainty, where much of the ambiguity involves swapping around short branches, maximum likelihood inferences have significant adva ntages over maximum parsimony inferences. Here, maximum likelihood considers the le ngths of branches when assigning probabilities for different amino acids bei ng present. Changes in tree topology around short branches therefore does not change greatly the proba bilities assigned by maximum likelihood tools. At the same time, because an experime ntal paleogeneticist cannot construct a protein having fractional occupancies of singl e sites, the uncertainty in the sequence represented by the probab ilities of "all of the ot her" amino acids is generally ignored if the probability assigned to those other amino acids falls below a threshold. There is no consensus in the community for where that thre shold should be. It is clear, however, that if the probability of Site 39 holding a Met is greater than 0.81, while the probability of any of the other amino acids at that site is 0.01, then one can make a convincing (to most) argument that the only ancestral protein that needs to be resurrected is the one holding a Met at Site 39. Computational Methods A variety of computer programs are now available to construct multiple sequence alignments, build trees, and infer ancestral se quences. Clustal W is widely used for the first, although nearly all pract itioners adjust the multiple se quence alignments that Clustal W produces by hand. Further, most practitioners adjust the trees that are generated by automated computer tools by hand, with the goal of having the tree topology conform to


25 the topology expected based on information i ndependent of the sequences in the family of interest. The identification of the optimal (best sc oring and, we hope, corresponding best to the historical reality) of multiple sequen ce alignments and evolutionary trees is computationally expensive. The number of possi ble alternative trees scales severely with increasing numbers of sequences. A large l iterature, not reviewed here, discusses heuristics that generate trees with many leaves more e fficiently (Huelsenbeck et al., 2001). Once a multiple sequence alignment and a tree are in hand, several computer programs are available to infer ancestral sequences. Parsimony methods are implemented, for example, in programs such as MacC lade and PAUP* (Maddison and Maddison, 1989;Swofford, 2001;Swofford, 1996). MacClade allo ws a user to manually alter the tree interrelating the input sequen ces to find the most parsimonious tree. This is a powerful tool, as the interaction between the experi mentalist and his/her intuition with the computational support provided by the program allows the user to ask "What if?" and "Why not?" questions with ease. PAUP* does not have this in teractive feature. It has the advantage, however, of being able to calculate the maximum likelihood tree, as well as the most parsimonious tree. Maximum likelihood analysis for the in ference of ancestral sequences is implemented in programs such as Darwin (Gonnet and Benner, 1991), PHYLIP (Felsenstein, 1989), MOLPHY (Adachi, 1996), PAML (Yang, 1997)and NHML(Galtier and Gouy, 1998). These all make accessible one or more formal models for evolution and


26 use a likelihood score as an optimality crit erion (Felsenstein, 1981). Optimization of the likelihood score can be used to specify topology and parameters such as branch lengths, character state frequencies, and ancestral states(Cai, Pei and Grishin, 2004;Pupko et al., 2000;Thornton, 2004;Zhang and Nei, 1997). In the examples discussed below, these and other methods are used to infer the sequences of ancient proteins that are to be resurrected. We will comment on the methods used throughout the discussion, to allow the r eader to better unders tand the uncertainties that the methods generated, and the am biguity of the resurrection achieved. How not to Draw Inferences About Ancestral States Each of the methods described above exploits the information in sequence according to the position of the sequence w ithin a tree, which models the familial relationships of the protein sequences. The tr ee weights the sequences so that two nearly identical sequences do not contribute twice as much as one distant sequence to the inferences of the ancestral sequences. For this reason, consensus tools are not preferred as ways to infer ancestral character states. Consensus tools allow each me mber of the family to "vote", and build a consensus sequence by "majority rule". Thus, if 6 proteins hold a Met at Site 39, and 18 hold a Leu at Site 39, the consensus holds a Le u at Site 39. This approach makes no sense if all of the 18 proteins hol ding Leu are from the same breed of ox, the 6 holding Met are from buffalo, eland, deer, sheep, impala and camel. Several groups, especially in the earliest literature in experi mental paleoscience, used a consensus sequence to approximate an ancestral sequence. We include these in this review despite this def ect. Even if a tree is not known precisely, an approximate tree


27 can better weight different sequences to give better inferred ancestral sequences than a consensus tool. Ambiguity in the Historical Models Sources of Ambiguity in the Reconstructions As with any inference about the past, an cestral sequences are not inferred with absolute certainty. Mistakes in the sequence da tabase, failure of approximations built into the evolutionary theory, uncertainty in the multiple sequence alignment, and uncertainty in the tree topology, all contribute to the ambiguity, even though these contributions are not captured in the formalism of a maximu m likelihood analysis based on a single MSA and a single tree. In addition to ambiguity in the database, theory, alignment, and tree, a measure of the uncertainty of the inferences is form ally offered by a maximum likelihood analysis. The extent of this ambiguity is captured in the fractional probabil ities of each of the amino acids inferred for each site. While diffe rent ML tools give different levels of ambiguity in any given case (and sometimes infer different preferred amino acids at a site), it is difficult to know which ML tool is most likely to capture the historical reality for any particular prot ein family. Methods for performing statistical tests on models, such as the likelihood ratio test and Ba yes factors, may be used to identify the best model to fit the data (Huelsenbeck, La rget and Alfaro, 2004). Such tests do not, of course, capture the am biguity that arises from defects in the underlying data or approximations in the theory. The principal source of formal ambiguity in the inference by ML tools of the amino acid at an ancestral site arises simply because the history of a site contains too many replacements relative to the degree of articulation of the tree. Further ambiguity arises if that histor y includes cases of


28 homoplasy (e.g., parallel or convergent evoluti on )where the same replacement occurs on different branches of a tree with a higher-t han-random frequency. Such events are not captured within the formalism of the theory. These issues are illustrated by two sites, 168 and 211, from the alcohol dehydrogenases from various taxa of yeast (Figur e 1-15). In this example, the topology of the tree was relatively secure, as were the sequ ences at the leaves, which were checked in many ways (Thomson et al., 2005). The site s had suffered too many replacements, however, for any method (parsimony or ML) to strongly infer the presence of one amino acid at a key node in the tree (the node at the right end of th e red branch). The corresponding ML analysis gave posterior probabilities substa ntially less than 0.5 for all 20 amino acids at that node, with no indivi dual amino acid having a clear preference. Sites such as these are often discussed when different maximum likelihood tools are compared for their ability to infer ancestral sequences. ML tools generate significantly different inferences at such sites compared to parsimony tools. Different ML tools often generate considerably different probabilistic ancestral sequences (e.g., DNA-, codon-, or amino acid-based models). Extensiv e discussion is emerging as to which tool makes "better" inferences. In such discussi ons, it is important to recognize that the systematic errors introduced by incomplete e volutionary theories, including those arising from the assumptions that amino acid re placements at individual sites reflect replacements at the average site, might be more significant than the formal uncertainty expressed by ML probabilities. The experience in experimental paleogenetics over the past 20 years suggests that even with abundant data and highly articula ted trees, ambiguity will remain. In those


29 cases where it remains, the experimental pa leogeneticist must consider how to manage any ambiguity in an ancest ral sequence that remains. Managing Ambiguity Four ways are commonly used to manage th e ambiguity in a set of input sequences. The first relies on statistical models of sequence evolution wherein ‘optimized’ parameters are used as input to estimate ancestral character states. Here, inferred character states are often ‘resolved’ fr om ambiguity using the branch lengths interconnecting nodes of the phylogenetic tr ee. The second involves collecting more sequences in the hope of eliminating the ambiguity. The third ignores the ambiguity based on an argument that the ambiguity occurs only at sites that are not critical for the biological interpretation. The fourth involves synthesizi ng and studying many candidate ancestral sequences to cover al l plausible alternative recons tructions, or to sample among the plausible alternative reconstructions. Hierarchical models of inference Frequent ambiguity in the estimation of an cestral character states is a weakness of the parsimony approach ( vide supra ), and can be rampant even when the relationships among the sequences are known. Individual e fforts to resolve this weakness have attempted to accommodate branch length information and empirical frequencies associated with amino acid replacements to in fer ancestral states (Koshi and Goldstein, 1996;Schluter, 1995;Yang, Kumar and Nei, 1995). A concerted effort to unite tools from various laboratories was made at a sy mposium entitled “Reconstructing Ancestral Character States” in 1999, which focuse d on phenotypic and ecological characters (Cunningham, 1999;Martins, 1999;Mooers a nd Schluter, 1999;Omland, 1999;Pagel, 1999b;Ree and Donoghue, 1999;Schultz and Churchill, 1999).


30 This was followed up six years later by th e inaugural conference on "Ancestral Sequence Reconstruction" (, which focused on ancestral nucleic and amino acid states. A common discussion pr esented at both symposia addressed uncertainties in evolutionary models and their impact on inferred ancestral states (Huelsenbeck and Bollback, 2001;Krishna n et al., 2004;Schultz and Churchill, 1999;Schultz, Cocroft and Churchill, 1996). Central in these discussions are "hierarchical Bayesian" approaches. A Bayesian analysis begins by constructing a formal relationship between an unknown, a data set, and one or more inference rules. Adding in ference rules can crea te a hierarchy of analyses. These attempt to accommodate uncertainty in optimized parameters such as tree topology, branch lengths, and rate heterogeneity, inter alia , by adding stepwise sophistication to the models (Huelsenbeck and Bollback, 2001;Pagel, Meade and Barker, 2004). The phylogenetic accuracy, cons istency and congruence of these models remain to be determined. We expect that the more uncer tainties in the estimates of the parameters accepted by a model, the more the ambiguous the ancestral states will appear. Thus, this approach does not solve the key problem in experimental resurrections: One needs to actually resurrect an ancestral sequence, not be paralyzed by the perception of ambiguity. The challenge will require a balance betw een acceptable uncertainty and acceptable parameter estimation. To this end, research ers at the Foundation for Applied Molecular Evolution and the University of Florida ar e currently using computer simulations to evaluate the performance of hierarchical Bayesian a pproaches. Nevertheless, for


31 practicing experimental paleoge netics, it is clear that purel y mathematical tools may not resolve issues of ambiguity. Collecting more sequences The least controversial way to manage am biguity in an ancient reconstruction is simply to collect more sequences. If stra tegically chosen, additional sequences can articulate a tree in a way that resolves ambiguity in a pars imony analysis, or alters the posterior probabilities in a maximum likeli hood analysis to incr ease confidence of the inference. Depending on the taxa involved, collecting mo re sequences might be a simple task, requiring only that the scientist obtain bi ological tissue specimens of organisms that branch from the tree at the positions where fu rther articulation of the tree might resolve ambiguities. Extinctions of lineages obvious ly can prevent this from being done, of course. To the extent that extinctions have removed information from the biosphere, it may be impossible to find an extant organism that branches from the tree at a strategic point useful to resolve an ambigu ity in ancestral reconstruction. Of course, it is always possible that co llecting additional sequences may not resolve ambiguities. Indeed, additional sequences might create new ambiguities, especially when long branches are being articulated. This is no t bad, as it means that additional sequences have discovered ambiguities that exist, but were not revealed by the previous, smaller dataset. Although it is ob vious, a general rule is worth sta ting: The more sequences in the dataset, and the more broadly the relevant history is sampled, the more reliable the reconstructed ancestral sequences will be. The biodiversity represented in the micr obial world is only beginning to be explored, of course. This sugge sts that sequencing of the type that Venter performed will


32 add data, and make paleomolecular reconstructi ons, increasingly more reliable over the coming decades. The same is true for many metazoan phyla. The diversity of genetic information within, for example, the beetles, is largely unexplored, meaning that we do not know how far back in time we will be ab le to resurrect proteins from extinct arthropods. Here, the mass extinction at th e Permian-Triassic boundary appears to be the first problematic event in Earth history th at might have removed sufficient genetic information to cause problems. Unfortunately, this is not the case with mammals. Reconstructions in mammals can be well supported over the past 100 million ye ars, given the number of radiant mammal orders that continue to leave descendents in the modern world. Even here, the loss of large eutheria remains a problem. Those in terested in, for example, the molecular paleontology of the dawn horse will be disappointed in the number of descendent lineages that survive. Here, the experiment al paleomolecular biologist will likely to forever be constrained by the hi story of the terran biosphere. Select sites considered to be impo rtant and ignore ambiguity elsewhere If ambiguity cannot be resolved by add itional sequencing, we might simply ignore the ambiguity at some sites by focusing on just a few where specific amino acids are believed to be critical to biological function, and where ambiguity is not observed. The strategy then involves ignori ng the remaining ambiguity, in the hope that it does not influence the behavior of the pr otein that is the object of biological interpretation. A large number of the examples discussed below manage ambiguity in this way. This strategy is controversial, and for good reasons. The behavior of a protein is generally not a linear function of the amino acids in its sequence. It is impossible,


33 therefore, to say with certainty which site s are critical (indeed, th at is often why the paleomolecular experiment is being done). Thus, examples are known from protein en gineering where the impact of an amino acid replacement at Site i is different depending on the amino acid occupying Site j . This means that an amino acid replacement may have an impact on the behavior of a protein in some contexts that is different from its imp act in others. Further, many examples are now known from protein engineering where the beha vior of a protein at its active site is influenced by an amino acid replacement far from the active site. Thus, ignoring ambiguity is recommended only if there is no alte rnative, or if a relatively comprehensive examination of th e sites in single mutagenesis experiments makes compelling the argument that ambiguity at these sites can be ignored. Confidence in this approach could be improved if the examined protein has multiple domains that have been experimentally shown to func tion independently a nd the reconstruction ambiguity is not in the domain of interest. Synthesizing multiple candidates ancestral proteins that cover, or sample, the ambiguity A relatively non-controversial strategy for managing ambiguity involves the synthesis of all of the candida te ancestral sequences that ar e plausible given the model. Thus, if the ancestor has one site that is ambiguous, and its ambiguity arises from the failure of the analysis to d ecisively choose one of two ami no acids, then both sequences can be resurrected as candidate ancestral proteins. If the behavi or to be interpreted is the same in both candidates, then the am biguity has no impact on the biological interpretation. The biological interpretation is said to be "robust" with respect to the ambiguity.


34 If the number of sites holding ambiguities is large, this strategy may require the synthesis of a large number of candidate ancestral sequences. For example, if 10 sites are ambiguous with respect to the ch oice of two different amino aci ds at each site, a total of 1024 (=210) different candidate ances tral sequences must be synthesized to cover all combinations of amino acids at the different s ites. This can, of cour se, strain a laboratory budget. An alternative is to sample among th e candidate ancestral sequences. Here, a library is constructed that c ontains the candidate ancestral sequences, and a sample of these is studied. The library can be biased to reflect the fractional posterior probabilities in the inferred ancestral sequence, so that the sampling captures the Bayesian features of the analysis. If the behavior of all of the ca ndidate ancestral sequences that are sampled is the same with respect to the phenotype that supports the biological interpretation, then it is possible to argue that the interpretation is robust with respect to the ambiguity. The extent to which the argument is persuasi ve will depend on the size of the sample, the extent of the ambiguity and the taste of the scientist. The Extent to Which Ambiguity De feats the Paleogenetic Paradigm If the hypersurface relating protein behavi or to protein sequence were extremely rugged, and if every amino acid replacement cau sed a significant change in behavior, then ambiguity would defeat th e paleogenetic research approach in all but the most ideal cases. Fortunately, biochemical reality is diffe rent. For nearly all proteins, some amino acid replacements at some sites have large impacts on functional behaviors, replacements at other sites have modest impact on those be haviors, and replacements at still other sites have even less impact on most behaviors.


35 This fact tends to ameliorate the extent to which ambiguity compromises work in experimental paleoscience. Ambiguity generally is found at sites that have suffered the most amino acid replacements. Multiple ami no acid replacements often (but not always) reflect "neutral drift" at a site. Neutral drift implies that the choice of a residue at the site does not have a significant impact on fitness. This generally (but not always) means that replacement of an amino acid at that site do es not have any impact on the behavior of a protein that can be detected by an in vitro experiment. Stringing these logical premises together, we can expect (but not always) that biologically interpretable beha vior will not differ greatly between ancestral sequences that differ only at ambiguous sites. To the exte nt that the premises are true, ambiguity in general will not limit our ability to draw inferences a bout the behavior of ancestral proteins by experimental analys is of ancestral sequences, ev en if our analysis does not capture all of the ambiguity in those seque nces. This, in turn, means that we will generally be able to use those behaviors to generate interesting bi ological interpretations. In fact, this is the case, as is illustrate d by approximately 20 examples of experimental paleogenetics to emerge over the past two decades. Examples Further discussion of the details of evolutio nary theories and tools to build models for the history of protein seque nces can be found in the literat ure. Below, we review the examples of experimental paleogenetics wh ere these theories and tools have been applied. We present these in approximately the order in which they appeared in the literature. The presentation deviates from this order when it makes logical sense to group a series of studies together.


36 Ribonucleases from Mammals: From Ecology to Medicine The family of proteins related to bovi ne pancreatic ribonuc lease A (RNase A) provided the first biomolecular system to be analyzed using experimental paleobiochemistry. These studies extende d the long history during which RNase contributed to the development of tools in the biomolecular sciences. RNase was also the first protein to be observed by nuclear ma gnetic resonance methods (Saunders, Wishnia and Kirkwood, 1957), one of the first to be reconstituted from its parts (Richards and Logue, 1962), one of the first to be analy zed by protein sequencing (Moore and Stein, 1973), the very first protein to be synthe sized (Denkewalter et al., 1969;Hirschmann et al., 1969;Jenkins et al., 1969;St rachan et al., 1969;Veber et al., 1969), and the first enzyme for which a synthetic gene was prepared (Nambiar et al., 1984). Members of the RNase family of proteins are typically composed of a signal peptide of about 25 amino acids and a mature peptide of about 130 amino acids. Most members of the RNase family have three cataly tic residues (one lysi ne and two histidines, at positions 41, 12 and 119 in RNase A). These come together in the folded enzyme to form an active site. In addition, RNases genera lly have six or eight cysteines that form three or four disulfide bonds . Except for these conserved residues, the sequences of RNases have diverged substantially in verteb rates, with sequence identities as low as 20% when comparing oxen and frog homologs (for example). Before paleobiochemical experiments began, RNase was known simply as a digestive enzyme. In the subsequent 20 year s, developments on many fronts have shown that the digestive function in the RNase fa mily is a relatively recent innovation, and is important in only a few mammal orders (Benner, 1988). Behaviors ranging from immunosuppressivity and antitu mor activity to duplex DNA bi nding and antiviral activity


37 are now known in RNase family members. Today, paleobiochemistry is arguably the most important tool being used to sort out th e rich functional diversity in this family of protein; it is unlikely that th is functional understanding of this family could have been so effectively developed without paleobiochemical studies. Resurrecting ancestral ribonucleases from artiodactyls In the early 1980's, the RNase A family was a practical choice to begin paleobiochemical work. Jaap Beintema and his coworkers had spent many years sequencing RNase homologs isolated from the pancreases of a variety of mammals(Beintema et al., 1985;Beintema, Gaastra and Munniksma, 1979;Beintema and Gruber, 1967;Beintema and Gruber, 1973;Bein tema and Martena, 1982;Beintema et al., 1984;Breukelman et al., 2001;Emmens, We lling and Beintema, 1976;Gaastra et al., 1974;Gaastra, Welling and Beintema, 1978;Gr oen, Welling and Beintema, 1975;Jekel et al., 1979;Kuper and Beintema, 1976;Lenstra and Beintema, 1979;Muskiet, Welling and Beintema, 1976;Vandenberg, Vandenhendetimme r and Beintema, 1976;Vandijk et al., 1976;Welling, Groen and Beintema, 1975;Welling, Mulder and Beintema, 1976) . Done before the "age of the genome", this work exploited classical Edman degradation of peptide fragments derived by selective cleav age of the protein. Such work required substantial amounts of protein, making conve nient the large amount of RNase found in the digestive tracts of oxen a nd their immediate relatives. As expected for enzymes found in the di gestive tract, RNases were themselves robust. For example, the first step in the purification of RNase A involved the treatment of an extract from ox pancreas with 0.25 M sulfuric acid. This procedure precipitates most other proteins and removes the glycosyl groups from RNase, but otherwise leaves the protein intact. Thus, by 1980, ca. 50 different RNase sequences were known.


38 At that time, few other protein families were so well represented in the protein sequence database. Other families that had been well sequenced included cytochrome C, which had been developed as a paradigm for molecular evolution by Margoliash (Margoliash, 1963;Margoliash, 1964 ), and hemoglobin, which was studied as a model for biomolecular adaptation (Bonaventura, B onaventura and Sullivan, 1974;Riggs, 1959). Each of these alternative families was problematic as a system for developing paleobiochemistry as a field. The cytochromes are themselves substrates for other proteins, the cytochrome C oxidases. Th is suggested that studies on ancestral cytochromes would need to involve resurrected ancestral oxidases. As no U.S. Federal agency was willing to fund paleobiochemist ry in the 1980's, resurrecting ancestral sequences in one family was likely to be diffi cult; resurrecting two sets of ancestors from two families was considered to be impossible. Hemoglobins remain a promising family for paleomolecular resurrections. To date, however, only one laboratory ha s explored them for this pur pose (Benner & Schreiber, unpublished). RNases proved to present several opport unities for biological interpretation and discovery. As digestive enzymes, pancreatic RNases lie at one interface between their host organisms and their changing environments , and are expected to evolve with the environment. Not all mammals, however, have large amounts of pancreatic RNase. In fact, RNase is abundant in the digestive syst ems primarily in ruminants (which include the oxen, antelopes, and other bovids, together with the sheep, the deer, the giraffe, okapi, and pronghorn) and certain other special gr oups of other herbivores (Barnard, 1969).


39 In 1969, Barnard proposed that pancreat ic RNase was abundant primarily in ruminants because ruminant digestion created a special need for an enzyme that digested RNA (Barnard, 1969). Ruminant digestive phys iology is considerably different from human digestive physiology (for example). The ru minant foregut serves as a vat to hold fermenting microorganisms. The ox delivers fodder to these microorganisms, which produce digestive enzymes (inc luding cellulases) that the ox cannot. The microorganisms digest the grass, converting its carbon into a variety of products, in cluding low molecular weight fatty acids. The fatty acids then ente r the circulation system of the ruminant, providing energy. The ox then eats the microorganisms for fu rther nourishment. According to the Barnard hypothesis, this digestive physiology creates a need for especially large amounts of intestinal RNase to digest microorganism s. The fermenting microorganisms are packed with ribosomes and ribosomal RNA, tran sfer RNA, and messenger RNA. Fermenting bacteria therefore deliver la rge amounts of RNA to the gastric region of the bovine stomach and the small intestine. Barnard esti mated that between 10 and 20 percent of the nitrogen in the diet of a typi cal bovid enters the lower digestive tract in the form of RNA. Barnard's hypothesis was certainly consiste nt with the high level of digestive enzymes in the ruminant generally. For ex ample, ruminants have large amounts of lysozyme active against bacterial cell walls in their digestive tracts. Was the Barnard hypothesis merely a "just so " story, based on correlations that did not require causality or func tional necessity? The first expe rimental paleobiochemistry program set out to test this.


40 As discussed above, the available seque nces were adequate to support the inference, with little ambiguity, of the sequence of the RNase represented (approximately) by the fossil ruminant Pachyportax (Figure 1-4) (Stackhouse et al., 1990). This was also the case for the more ancient Eotragus, which lived in the Miocene (Figure 1-16). The available RNase sequences also permitted the inference, with only modest ambiguity, of sequences for RNases in the first ruminant, approximated in the fossil record by the genus Archaeomeryx (Figure 1-16). With slightly more ambiguity, the contemporary RNase sequences allowed th e inference of the sequences of RNase in the first artiodactyl, the orde r of mammals having cloven hoofs that includes the true ruminants as well as the camels, the pigs, a nd the hippos. This ancestor is approximately represented in the fossi l record by the genus Diacodexis (Figure 1-15). A collaboration between the Center for Reproduction of Enda ngered Species at the San Diego Zoo and the Benner laboratory yielded several additi onal sequences that assisted in these inferences (TrabesingerRuef et al., 1996). Once the ancestral sequences were rec onstructed, the Benner group prepared by total synthesis a gene for RNase that was sp ecially designed to suppor t the resurrection of ancient proteins (Nambiar et al., 1984;Stackhouseet al ., 1990). From this gene, approximately two dozen candida te ancestral genes for intermediates in the evolution of artiodactyls ribonucleases were synthesized, cloned, and e xpressed to resurrect the ancestral proteins for labor atory study (Jermann et al., 1995;Stackhouseet al., 1990). To assess whether reconstructions yielde d proteins that were plausible as intermediates in the evolution of the RNase family, the catalytic activities, substrate specificities, and thermal/proteolytic stabilitie s of the resurrected ancestral RNases were


41 examined. Most of the resurrected proteins, and all of those corr esponding to proteins expected in artiodactyls living after Archaeomeryx , behaved as expected for digestive enzymes. This was especially apparent from their kinetic properties (Table 1-4). Modern digestive RNases are catalytically active against small RNA substrates and single stranded RNA (Blackburn and Moore, 1982). The RNase from Pachyportax was also, as were many of the earlier RNases. Thus, if one assumes that these ca talytic properties are indicative of a digestive enzyme, these ancestral proteins were digestive enzymes as well. This was also true quantitatively. Thus, the kcat/KM values for the putative ancestral RNases with the ribodinucleotide uridylyl 3' ->5'-adenosine (UpA) as a substrate (Ipata and Felicioli, 1968)in many ancient artiodactyl s proved not to differ more than by 25% from those of contemporary bovine digestive RNase (Table 1-4). With single stranded poly(U) as substrate, the va riance in catalytic activity was even smaller (18%). Modern digestive RNases, like most dige stive enzymes, are stable to thermal denaturation and cleavage by proteases. This suggested another metric for determining whether the ancestral proteins acted in the digestive tract . Using a method developed by Lang and Schmidt (Lang and Schmid, 1986), the sensitivity of the ancestral RNases to proteolysis as a function of temperature was measured (Table 1-5). Again, little change was observed in thermal stability of the ancestral RNases back to the ancestral artiodactyls approximated by Archaeomeryx in the fossil record. The mid points in the activitytemperature curves for these ancien t proteins varied by only 1.1 C when compared with RNase A. This can be compared with typical experime ntal errors of 0.5 C.


42 Had all of the ancestral RNases beha ved like modern RNases, the resulting evolutionary narrative would have had little interest. The experiments in paleobiochemistry became interesting because the behavior of RNases resurrected from organisms more ancient than the last common ancestor of the true ruminants ( Archaeomeryx and earlier) did not behave like digestive enzymes using these metrics. These more ancient resurrected ancestral RN ases displayed a five fold increase in catalytic activity against double stranded RNA (poly(A)-poly(U)). This is not a digestive substrate. Further, the ancestral RNases s howed an increased ability to bind and melt double stranded DNA. Bovine digestive RNase A has only low catalytic activity against duplex RNA under physiological conditions, and does not bind and melt duplex DNA; these activities are presumably not needed for a digestive enzyme. At the same time, the catalytic activity of the candidate ancestra l sequences against single stranded RNA and short RNA fragments, the kinds of substrates th at are expected in the digestive tract, was substantially lower (by a factor of 5) than in the mode rn proteins. Proposing that these phenotypes can be used as metrics, Jermann et al. (Jermann et al., 1995) concluded that RNases in artiodactyls th at were ancestral to Archaeomeryx were not digestive enzymes. A similar inference was drawn from stabil ity studies. The more ancient ancestors displayed a modest but signifi cant decrease in thermal-proteolytic stability using the assay of Lang and Schmidt. Jermann et al . (Jermann et al., 1995) considered the possibility that the decrease in stability mi ght reflect an incorrect reconstruction. A less stable enzyme, and a lower activity against single stranded RNA, for example, might imply simply that the incorrect amino aci d sequence was inferred for the ancestral


43 protein. The fact that catalyt ic activity against double stranded RNA, and the ability to melt duplex RNA, was higher in the ancestors argued ag ainst this possibility. The issue was probed further by consider ing the ambiguity in the tree. The connectivity of deep branches in the artiodact yl evolutionary tree is not fully clarified by either the sequence data or the fossil record (Table 1-3) (Graur, 1993). This created a degree of ambiguity in the ancestral sequences . To manage this ambiguity, Jermann et al. synthesized a variety of alternative ca ndidate ancestral RNase sequences. These effectively covered all of the ambiguity in the tree topology, and th e resulting ambiguity in the sequences. The survey showed that the measured phenotype (and the consequent biological interpretation) were robus t with respect to the ambiguity. Site 38 proved to be especia lly interesting. The variant of h 1 (Figure 1-18) that restores Asp at position 38 (as in RNase A) has a catalytic activity against duplex RNA similar to that of RNase A (Jermann et al., 1995;Opitz et al., 1998). Conversely, the variant of RNase A that introduces Gly alone at position 38 has catal ytic activity against duplex RNA essentially that of ancestor h . These results show that substitution at a single position, 38, accounts for essentially all of the increased catalytic activity against duplex RNA in ancestor h . The reconstructed amino acids at position 38 are unambiguous before and after the Archaeomeryx sequence. Thus, it is highly probable th at the changes in catalytic activity against duplex RNA in fact occurred in RNases as the ruminant RNases arose. In one interpretation, catalytic activity against duplex RNA was not necessary in the descendent RNases, and therefore was lost. This implies that the replacement of Gly 38 by Asp in the evolution of ancestor g from ancestor h was neutral. Jermann et al. could not, however,


44 rule out an alternative model, that Asp 38 confers positive selective advantage on RNases found in the ruminants. Understanding the origin of ruminant digestion The experimental paleobiochemical data within the pancreatic RNase family suggested a coherent evolutionary narrativ e consistent with the Barnard hypothesis. RNases with increased stability, decreased catalytic activity against duplex RNA, decreased ability to bind and melt duplex DNA, and increased activity against single stranded RNA and small RNA substrat es, emerged near the time when Archaeomeryx lived. The properties that increased are essentia l for digestive function; the properties that decreased are not. Archaeomeryx was the first artiodactyl to be a true ruminant. This implies that a digestive RNase emerge d when ruminant digestion emerged. This converts the Barnard hypothesis (or, pe joratively, the Barnar d "just so" story) into a broader narrative. This narrative b ecame still more compelling when the molecular behavior is joined to the historical reco rd as known from the fossil and geological records. These records suggested that the camels, deer, and bovid artiodactyl genera diverged ca. 40 million years ago (Ma), toge ther with ruminant digestion and the digestive RNases to support it, at the time of global climate change that began at the end of the Eocene. This climate change eventually involved the lowering of the m ean temperature of Earth by ca. 17 C, and the drying of large part s of the surface (Jan is et al., 1998). This, in turn, was almost certainly causally related to the emergence of grasses as a predominant source of vegetable food in many ecosystems. Tropical rain forests receded, grasslands emerged, and the interactions be tween herbivores and their foliage changed.


45 Grasses offer poor nutrition compared to many other flora, and ruminant physiology appears to have substantial adaptive value when eating grasses. This, in turn, may help explain why ru minant artiodactyls were enormously successful in competition with the herbivorous perissodactyls (for example, horses, tapirs, and rhinoceroses) as the global climate change proceeded. Today, nearly 200 species of artiodactyls have displaced the ca. 250 species of perissodactyls that were found in the tropical Eocene. Today, only three species groups of peri ssodactyls survive. This is the principal reason w hy resurrection of enzymes from the dawn horse will remain outside of the reach of contemporary paleomolecular biologists. Ribonuclease homologs involved in unexpected biological activities The paleobiochemical experiments with panc reatic RNases suggested that RNases having digestive function emerged in artiod actyls from a non-digestive precursor about 40 Ma. This implies, in turn, that non-digestiv e cousins of digestive RNases might remain in the genomes of modern mammals, where th ey might continue to play a non-digestive role there. This suggestion, generated from the first experiments in paleogenetics, emerged at the same time as researchers were independently discoveri ng non-digestive paralogs of digestive RNase A. These were termed "R IBAses" (ribonucleases with interesting biological activities) by D'Ale ssio (D'Alessio et al., 1991). Th ey include RNase homologs that display immunosuppressi ve (Soucek et al., 1986), cytostatic (Matousek, 1973), antitumor (Ardelt, Mikulski and Shogen, 1991), endo thelial cell stimulatory (Strydom et al., 1985), and lectin-like activ ities (Okabe et al., 1991). These proteins all appeared to be extracellular, based on their secretory signal pe ptides and the presence of disulfide bonds.


46 Their existence suggested to some that perh aps a functional RNA existed outside of cells (Benner, 1988). These results suggested that the RNase A superfamily was extremely dynamic in vertebrates, with larger th an typical amounts of gene dupl ication, paralog generation, and gene loss. In humans, for example, prior to the completion of the complete genome sequence, eight RNases were already know n. These included the poorly named human pancreatic ribonuclease (RNase 1, and which does not appear to be a protein specific for the pancreas), the equally poorly named eo sinophil-derived neurotox in (EDN, or RNase 2, which does not appear to have a physiol ogical role as a neur otoxin), the eosinophilcationic protein (ECP, or RNase 3, aptly named in the sense that the name captures about all that we know about th e protein), RNase 4, angiogeni n (also questionably named, RNase 5), RNase 6 (sometimes known as k6), RNase 7 (Harder and Schroder, 2002;Zhang, Dyer and Rosenberg, 2003), a nd RNase 8 (Zhang, Dyer and Rosenberg, 2002). The in silico analysis of the human genome showed that the human RNase 1 genes lie on chromosome 14q11.2 as a cluster of ~368 kb. In order from the centromere to the telomere, the genes are angiogenin (RNase 5), RNase 4, RNase 6, RNase 1, ECP (RNase 3), an EDN pseudogene, EDN itself (R Nase 2), RNase 7, and RNase 8, separated from each other by a 6to 90-kb intervals. The genome also identified two new human RNase homologs (RNase 9 and RNase 10) in this cluster preceding angiogenin. In addition, three new open reading frames shari ng a number of common features with other RNases were found. Beintema therefore proposed to name these RNases 11, 12, and 13. RNase 11 and RNase 12 are located between RNase 9 and angiogenin. RNase 13 lies on


47 the centromere side of RNase 7, and has a tr anscriptional directi on opposite to that of RNases 7 and 8. The human genome reveals no ot her ORFs with significant similarity to these RNase genes. Therefore, it is likely that all human RNase A superfamily members have been identified. As in humans, rat RNase genes are locat ed on one chromosome (15p14) in a single cluster. The cluster in the rat genome contains the RNase family in the same syntenic order and transcriptional dire ction as in human, with only a few exceptions. The RNase 1 family (RNase1h, RNase1g, and RNase1y), the eosinophil-associated RNase family (EAR) (R15-17, ECP, R-pseudogene, and Ear3 ), and the angiogenin family (Ang1 and Ang2) have undergone expansion in the ra t (Dubois et al., 2002;Singhania et al., 1999;Zhao et al., 1998). Further, orthologs of human RNases 7 and 8 are not present in the rat genome. This permits us to propose a relatively coherent model for the order of gene creation in the time separating primates and rodents, and a listing of the RNase homologs likely to have been present in the last common ancestor of primates and rodents. The dynamic behavior of this group of gene s is shown by the differences separating the rat and mouse groups. In mouse, two RNase gene clusters are found, on mouse chromosome 14qB–qC1 (bcluster AQ) and chro mosome 10qB1 (bcluster BQ). Cluster A is syntenic to the human and rat clusters and is essentially identical to the rat cluster in gene content and order except for substantial expansions of the EAR and angiogenin gene subfamilies. Cluster B emerged in mouse afte r the mouse–rat divergence, and contains only genes and pseudogenes that belong to the EAR and angiogenin subfamilies. It also includes a large number of pseudogenes.


48 This level of diversity presents many "W hy?" questions that might be addressed using molecular paleoscience. To date, two of these have been pursued, one in the Rosenberg laboratory, the sec ond in the Benner laboratory. Paleobiochemistry with eosinophil RNase homologs In an effort to understand more about the function of these abundant RNase paralogs, Zhang and Rosenberg examined the eosinophil-derived neurotoxin (EDN) and eosinophil cationic protein (ECP) in prim ates (Zhang and Rosenberg, 2002). These proteins arose by gene duplication some 30 Ma in an African primate that was ancestral to humans and Old World monkeys. Zhang and Rosenberg first asked the basi c question: Why do eosinophils have two RNase paralogs? Eosinophils are associated with asthma, infectiv e wheezing, and eczema (Onorato et al., 1996); thei r role in non-diseased state re mains enigmatic. Some textbooks say that eosinophils function to destroy larger parasites and modulate allergic inflammatory responses. Others suggest that eosinophils defend their host from outside agents, with allergic diseases ar ising as an undesired side effect. Earlier work by Zhang, Rosenberg and their associates had sugge sted that ECP and EDN might contribute to organismic defe nse in other ways. ECP kills bacteria in vitro , while EDN inactivates retroviruses (Rosenberg and Domachowske, 2001). In silico analysis of reconstructed ancestral sequences in primates suggested that the proteins had suffered rapid sequence change near the tim e of the duplication that generated the paralogs, a change that might acc ount for their differing behaviors in vitro (Zhang, Rosenberg and Nei, 1998). This suggests that in primate evolution, mutations in EDN and ECP may have adapted them for different, sp ecialized roles during the episodes of rapid sequence evolution.


49 To obtain a more densely articulated tree for the protein family, Zhang and Rosenberg sequenced additional genes from various primates. They used these sequences to better reconstruct ancest ral sequences for ancient EDN/ECPs. They estimated the posterior probabilities of these ancestral seque nces using Bayesian inference. Then, they resurrected these ancient proteins by cl oning and expressing their genes(Zhang and Rosenberg, 2002). Guiding the experimental work was the hypot hesis that the anti -retroviral activity of EDN might be related to the ability of the protein to cleave RNA. Studies of the ancestral proteins allowed Zhang and Rosenbe rg to retrace the origins of the antiretroviral and RNA cleaving activities of EDN. Both th e ribonuclease and antiviral activities of the last common ancestor of ECP and EDN, which lived ca, 30 Ma, were low. Both activities increased in the E DN lineage after its emergence by duplication. Zhang and Rosenberg showed that two re placements (at sites 64 and 132) in the sequence were together require d to increase the ribonucleolytic activity of the protein; neither alone was sufficient. Zhang and Rose nberg then analyzed the three dimensional crystal structure of EDN to offer possible explanations for the interconnection between sites suffering replacement and the change s in behavior that they created. Zhang and Rosenberg concluded that in the EDN/ECP family, either of the two replacements at sites 64 and 132 individually had little impact on behavior. Each does, however, provide the context fo r the other to have an imp act on behavior. This provides one example where a "neutral" (perhaps better, behaviorally inconsequential) replacement might have set the stage for a second adaptive replacement.


50 This observation influences how protein e ngineering is done in general. Virtually all analyses of divergent evolut ion treat protein sequences as if they were linear strings of letters (Benner, Trabesinger a nd Schreiber, 1998). With this treatment, each site is modeled to suffer replacement independent of all others, future replacement at a site is viewed as being independent of past repl acement, and patterns of replacements are treated as being the same at each site. This has long been known to be an approximation, useful primarily for mathematical analysis (the "spherical cow"). Understanding higher order features of protein se quence divergence has offered in silico approaches to some of the most puzzling conundrums in biological chemistry, including how to predict the folded structure of proteins from sequence da ta (Benner et al., 1997), and how to assign function to protein sequences (Benner, Trab esinger and Schreiber, 1998). The results of Zhang and Rosenberg provide an experiment al case where higher order analysis is necessary to understand a biomolecular phenomenon. Another interpretive strategy involving resurrected proteins (Benner, 1997) was suggested from the results produced by Zhang and Rosenberg. This strategy identifies physiologically relevant in vitro behaviors for a protein wh ere new biological function has emerged, as indicated by an episode of rapid (and therefore presumably adaptive) sequence evolution. The strategy examines the behavior of proteins resurrected from points in history before and after the episode of adaptive evolution. Those behaviors that are rapidly changing during the episode of adaptive sequence evolution, by hypothesis, confer selective value on the protein in its new function, and therefore are relevant to the change in function, either directly or by close coupling to beha viors that are. The in vitro


51 properties that are the same at the beginning and end of this episode are not relevant to the change in function. While the number of amino acids changing is insufficient to make the case statistically compelling, the rate of change in the EDN lineage is strongly suggestive of adaptive evolution (Zhang, Rosenberg and Ne i, 1998). The antiviral and ribonucleolytic activities of the proteins before and after the adaptive episode in the EDN lineage are quite different. Benner (Benner, 2002), interpre ting the data of Zhang et al., suggested that these activities ar e important to the emerging physiolo gical role for EDN. This adds support, perhaps only modest, for the notion th at the antiviral activity of EDN became important in Old World primates ca. 30 Ma. The timing of the emergence of the ECP/ EDN pair in Old World primates might also contain information. The duplication occu rred near the start of a global climatic deterioration that has continued until the pres ent, with the Ice Ages in the past million years being the culmination (we hope) of this deterioration. These are the same changes as those that presumably drove the selecti on of ruminant digestion. If EDN, ECP, and eosinophils are part of a defens ive system, it is appropriate to ask: What happened during the Oligocene that might have encouraged this type of system to be selected? Why might new defenses against retroviruses be needed at this time? If we are able to address these questions we might better understand how to improve our immune defenses against viral infections, an area of biomedical research that is in need of rapid progress. Paleobiochemistry with ribonuclease homologs in bovine seminal fluid The rest of the dissertation will discuss this project in much more detail. New biomolecular function is believed to arise, at least in recent times, largely through recruitment of existing proteins having estab lished roles to play new roles following gene


52 duplication (Benner and Ellington, 1990;Ohno, 1970). Under one model, one copy of a gene continues to divergently evolve under c onstraints dictated by the ancestral function. The duplicate, meanwhile, is unencumbered by a functional role, and is free to search protein "structure space". It may, eventually, come to encode new behaviors required for a new physiological function, and ther eby confer selec tive advantage. This model contains a well-recognized paradox. Because duplicate genes are not under selective pressure, they should also accumulate mutations that render them incapable of encoding a protein useful for a ny function. Most duplicates therefore should become pseudogenes (Lynch and Conery, 2000), inexpressible genetic information ("junk DNA" (Li, Gojobori and Nei, 1981)) in just a few million years (Jukes and Kimura, 1984;Marshall, Raff and Raff, 1994). This limits the evolutionary value of a functionally unconstrained gene duplicate as a tool for e xploring protein "struc ture space" in the search of new behaviors that might c onfer selectable physiological function. One of the non-digestive RNase subfamilies offered an interesting system to use experimental paleobiochemistry to study how new function arises in proteins. This focused on the seminal RNase, paralogs found in ruminants that arose by duplication of the RNaseA gene just as it was becoming a digestive protein. In ox, seminal RNase is 23 amino acids different from pancreatic RNase A. As suggested by its name, the paralog is expressed in the seminal plasma, where it cons titutes some 2% of total protein (D'Alessio et al., 1972). Seminal RNase has evolved to b ecome a dimer with composite active sites. It binds tightly to anionic glycolipids (O pitz, 1995), including seminolipid, a fusogenic sulfated galactolipid found in bovine spermatozoa (Vos, Lopes-Cardozo and Gadella,


53 1994). Further, seminal RNase has immunosupp ressive and cytotoxic activities that pancreatic RNase A lacked (Benner a nd Allemann, 1989;Soucek et al., 1986). Laboratory reconstructions of ancient RNas es (Jermann et al., 1995) suggested that each of these traits was not present in th e most recent common ancestor of seminal and pancreatic RNase, but rather arose in the seminal lineage after the divergence of these two protein families. To learn more about how this remarkable example of evolutionary recruitment occurred, RNase genes were collected from peccary ( Tayassu pecari ), Eld's deer ( Cervus eldi ), domestic sheep ( Ovis aries ), oryx ( Oryx leucoryx ), saiga ( Saiga tatarica ), yellow backed duiker ( Cephalophus sylvicultor ), lesser kudu ( Tragelaphus imberbis ) and Cape buffalo ( Syncerus caffer caffer ). These diverged approximately in that order within the mammal order Artiodactyla (Carroll, 1988). These complemented the known genes for various pancreatic RN ases (Carsana et al., 1988), and seminal RNases from ox ( Bos taurus ) (Preuss et al., 1990), giraffe ( Giraffa camelopardalis ) (Breukelman et al., 1993) and hog deer. Seminal RNase genes are distinguished fr om their pancreatic cousins by several "marker" substitutions introduced early afte r the gene duplication, including Pro 19, Cys 32, and Lys 62. By this standard, the genes from saiga, sheep, duiker, kudu, and the buffaloes were all assigned to the seminal RNase family. No evidence for a seminal-like gene could be found in peccary. Thus, these da ta are consistent with a analysis of previously published genes that places the gene duplication separating pancreatic and seminal RNases ca. 35 million years before present (Beintema et al., 1988), and the divergence of giraffe preceding the divergen ce of sheep, saiga, duiker, kudu, Cape


54 buffalo and ox, in this order, consistent with mitochondrial sequence data (Allard et al., 1992) and global phylogenetic analyses of Ruminantia (Hassanin and Douzery, 2003). Sequence analysis shows that the seminal RNase gene from kudu almost certainly could not serve a physiological function as a fo lded stable protein. A single base deletion disrupts codon 114, creating a frame shift. Furt her, the seminal RNase gene from duiker was found to encode a substitution of the active site lysine by a proline. Thus, this protein was not likely to have cataly tic activity. Lesions are also present in the seminal RNase gene from giraffe. To show that these seminal genes were indeed not expressed in semen, seminal plasmas from 15 artiodactyls were examined (ox, forest buffalo ( Syncerus caffer nanus ), Cape buffalo, kudu, sitatunga ( Tragelaphus spekei, ) , nyala ( Tragelaphus angasi ) , eland ( Tragelaphus oryx ) , Maxwell's duiker ( Cephalophus monticola maxwelli ), yellow backed duiker, suni ( Neotragus moschatus ) , sable antelope ( Hippotragus niger ), impala ( Aepyceros melampus ), saiga ( Saiga tatarica ), sheep ( Ovis aries ), and Elds deer). Catalytically active RNase was not detected in the seminal plasma in significant amounts in any artiodactyl genus diverging before the Cape buffalo, except in Ovis . Independent mutagenesis experiments showed that the prot eins encoded by these genes, all carrying a Cys at position 32, should form dime rs (Jermann, 1995;Opitz, 1995;Raillard, 1993;Trautwein, 1991). By Western blotting, however, only small amounts of a monomeric, presumably pancreatic RNase, we re detected in these seminal plasmas. In contrast, the seminal plasmas of forest buffalo, cape buffalo and ox all contained substantial amounts of Western blot-active RNase (Kleineidam et al., 1999). Only in the


55 seminal plasma of ox, however, is seminal RNase expressed. Even though the gene is intact in water buffalo, no expressed prot ein could be found in its seminal plasma. The seminal plasma from the Ovis genus (sheep and goat) was a notable exception. Sheep seminal plasma contained signif icant amounts of RNase protein and the corresponding ribonucleolytic activity. To learn whether RNases in the Ovis seminal plasma were derived from a seminal RNase gene, the RNase from goat seminal plasma was isolated, purified, and sequenced by trypt ic cleavage and Edman degradation. Both Edman degradation (covering 80% of th e sequence) and MALDI mass spectroscopy showed that the sequence of the RNase isolat ed from goat seminal plasma is identical to the sequence of its pancreat ic RNase (Beintema et al., 1988;Jermann, 1995). This shows that the RNase in Ovis seminal plasma is not expresse d from a seminal RNase gene, but rather from the Ovis pancreatic gene. To confirm th is conclusion, a fragment of the seminal RNase gene from sheep was sequence d, and shown to be different in structure from the pancreatic gene. But what was this new function that wa s acquired by bovine seminal RNase? What is the molecular basis of the newly acquire d function? To addre ss these questions, the Benner group set out to reconstruct and resurr ect the ancestral semina l proteins. The tree in Figure 1-19 shows the nodes where sequences were reconstructe d using a likelihood method. These nodes include the evolutionary period where the new biological function might be arising. Three different evolutio nary models, one amino acid based and two codon based, were used to make the r econstructions. Two outgroups were also considered, those holding the pancreatic RNas es and brain RNases, as the data did not unambiguously force the conclusion that one of these two RNase subfamilies was the


56 closest outgroup. Uncertainty in the topology of the tree holdi ng the seminal RNases (the relative placement of the okapi sequence) wa s also considered; two topologies based on molecular and paleontological da ta were chosen for the analysis (Figure 1-19). In an effort to manage ambiguities, all possible sequences were resurrected whenever the reconstructions disagreed. The distribution of ancestral replacement s on the three-dimensional structure of seminal RNase followed a specific pattern. All of the active site residues remained conserved after the gene duplication. More over, the RNA binding site was also conserved. Most of the replacements were c oncentrated on the surface of the protein and away from the RNA binding site. This re placement pattern is consistent with an evolutionary path where the enzymatic functi on of the protein was conserved; it is not consistent with an inference that the ances tral seminal RNase genes were pseudogenes. Furthermore, the lesions causing the pseudogene formation in the different lineages are different. These two observations taken together imply that the ancestral seminal RNases were enzymatically active, and that indepe ndent inactivation events converted active genes in the different lineages into pseudoge nes in many of the modern artiodactyls. Consistent with this model, the resurre cted ancestral seminal RNases were all enzymatically active (hydrolyzing UpA and a fluorescent tetranucleotide). The enzymatic activity remained high and comparable to th e contemporary activity levels of RNaseA and seminal RNase through out the evolutio nary history of the seminal gene. This indicated that enzymatic activity was not the behavior under adaptive positive selection that lead to the seminal gene new function.


57 What then are the properties of seminal RNase that were the targets for natural selection over the past 30 million years? As noted above, there are many in vitro behaviors to choose from. For some (such as cytotoxic activity agai nst cancer cells in culture), it is difficult to rationalize how such behaviors might be important for a protein that exists in seminal plasma. But the site of expression of a prot ein is changeable over short periods of evolutionary time, meaning that we cannot be certain where seminal RNase has been expressed over its history. We hypothesized that since this protein is expressed in the seminal fluid and it has immunosuppressive capabilities then it coul d have evolved to confer a selective reproductive advantage to bulls when the fe male reproductive tract mounts an immune response against the invading sperm. Indeed, it has been shown in reproductive biology that in many species sperm encounters a de fensive immune response and that in many cases seminal plasma is capable of repres sing this response (James and Hargreave, 1984;Kelly and Critchley, 1997; Schroder et al., 1990). To test whether this is true, Sassi et al. exploited the paleogenetics strategy to identify the physiologically relevant in vitro behaviors for a newly emerging function. As noted above, the strategy examines the behavi or of proteins resurrected from points in history before and after the presumed episode of adaptive evolution. The in vitro behaviors that are rapidly cha nging during this episode are infe rred to be those relevant to adaptive change. The in vitro behaviors that are the same at the beginning and end of this episode are not relevant to the change in function. To apply this strategy, Sassi examined th e immunosuppressivity of the resurrected proteins in lymphocyte pr oliferation assays. Consistent with the hypothesis,


58 immunosuppression increased dramatically in bovine seminal RNase compared with its immediate ancestors. This study presents an example where the evolutionary history of a gene and the physiological function of the protein were both unknown but the resurrection of the ancestral protein provided evidence for a hypot hesis and hints on the evolutionary events shaping this gene’s history. Lessons learned from ribonuclease resurrections The ribonuclease family contains the be st-developed example of the use of paleomolecular resurrections to understand protei n function. It also demonstrates most of the key issues that must be addressed when implementing this paradigm. This includes the management of ambiguiti es. In all of the cases reviewed here, additional sequences were obtained from additional organisms to increase the articulation of the evolutionary tree, and thereby redu ce the ambiguity in the inferred ancestral sequences. When ambiguities remained, multiple candidate ancestral sequences were resurrected to determine that the behavior s ubject to biological in terpretation was robust with respect to the ambiguity. These examples also show the value of maximum likelihood tools in reconstructing ancestral sequences. The simplest parsim ony tools, which minimize the number of changes in a tree, are easily deceived by swaps around short branches. Ancestral character states are less likely to be confused by incorrect detailed topology of a tree when they are constructed using maximu m likelihood tools than by maximum parsimony tools. More important, however, these exampl es show the poten tial of molecular paleoscience as a strategy to sort out the co mplexities of biological function in complex


59 genome systems. Here, the potential of this strategy has only begun to be explored. In the long term, we expect that paleomolecular resurrections will allow us to understand changing biomolecular function in an ecological and planetary systems context. Margulis and others have referred to this as "p lanetary biology" (Margulis and Guerrero, 1995;Margulis and West, 1993). Last, these examples show the value of paleomolecular resurrec tions in converting "just so" stories into serious scientific na rratives that connect phenomenology inferred by correlation into a comprehensive historical -molecular hypothesis that incorporates experimental data and suggest s new experiments. Thus, they offer a key example of how paleobiology might enter the ma instream of molecular biol ogy as the number of genome sequences becomes large, and the frustration with their lack of meaning becomes still more widespread. Lysozymes: Testing Neutrality and Parallel evolution At approximately the same time as the first ancestral ribonuc leases were being resurrected in the Benner laboratory in Zuri ch, Allan Wilson and his colleagues at the University of California were considering the history of the evolution of lysozymes from bird eggs (Malcolm et al., 1990). Their work focused on three positions (sites 40, 55, and 91). A crystal structure showed that these lie just beneath the active site cleft inside the folded structure in a hinge between th e two globular domains of lysozyme. Malcolm et al. (Malcolm et al., 1990) noticed that the pattern of divergence at these sites was peculiar when compared with the pa ttern of divergence at other sites in the lysozyme family. These sites show little variat ion in other bird lysozymes. In one branch leading to the lysozyme in the la st common ancestor of the quail ( Callipepla californica ), northern bobwhite ( Colinus virginianus ), and guinea fowl ( Numida meleagris ) from the


60 next higher node in the tree, three amino aci d replacements (T40S, I55V, and S91T) were inferred. The ancestor therefor e had TIS at sites 40, 55, a nd 91 respectively, while the descendent had SVT at those sites. No other internal sites suffered replacemen t along this branch. Further, these sites are conserved in a variety of other birds, including the chachalaca, turkey, pheasant, chicken partridge, and Old World quail. The evolutionary tree is insufficiently arti culated to infer the order in which these three amino acids are replace d. Thus, Malcolm et al. asked if these three changes occurred in a specific sequence. A total of 6 paths (= 3 + 2 + 1) lead stepwise from the TIS to the SVT trio. Proteins representing a ll intermediates in each of these paths of amino acids (with reference to these three site s only) were prepared, with the amino acids at these sites varied within the background of the chicken lysozyme. They then asked whether all proteins had similar thermal stability. In this work, the thermal melting transition temperatures of the modern lysozymes were used to provide the upper and lower lim its of "normal" stability. Transitions at temperatures higher or lower than these bounds were interpreted as being functionally significant. Some of the intermediates constructed in this work had transition temperatures that were outside of these bounds. This led the au thors to suggest that these intermediates were less likely to be neutral variants th an others. Further, evidence was obtained suggesting that the temperature of unfolding co rrelated with the total volume of the side chains of the residues at these three sites.


61 This experiment had only limited interpretive goals. Egg lysozyme is frequently viewed as functioning to prot ect the egg from microbial attack. It remains an open question why the North American birds would need to have a di fferent set of residues in the hinge region of their lysozyme to perform a function distinctive with respect to North American microbial agents. The distinctiven ess of the changes observed in the branch, and the unlikelihood that such distinctive change s represent neutral drif t, suggest that this question might have an interesting answer. Transposable Elements and Their Ancestors A substantial fraction of the genomes of many organisms, including mammals, consists of interspersed repetitive DNA seque nces. These are primarily degenerate copies of transposable elements, units of DNA that can migrate to different parts of the genome. Transposable elements include retrotranspos ons, which are derived from RNA molecules that are transposed to DNA via cDNA intermediates, and transposons, which move directly from DNA to DNA. These can be "sho rt interspersed nuclear elements" (SINES, ca. 300 nucleotides long), "long interspe rsed nuclear elements" (LINES, a 6000-8000 nucleotides long), both of which contain in ternal promoters for RNA polymerase III, transposable elements with long terminal repeats (which can contain a reverse transcriptase or its remnants), or DNA transposons that have an ORF (or its remnants) that encodes a transposase. Because transposable elements do not serve any known function except maybe as an evolutionary tool, it is expected that th eir functional elements will sustain mutations that render them inactive over extended periods of evolutio nary time. By going backwards in time starting from inactive tr ansposable elements in contemporary organisms, it should be possible to resurrect active ancestral transposable elements that


62 delivered those transposable elements to th e modern genomes. This has now been done in several laboratories. Long interspersed repetitive elements of type 1 Long interspersed repetitive elements of type 1 (LINE-1 or L1) are examples of retrotransposons that encode reverse transcriptase, lack long terminal repeats, and appeared to transpose via a polyadenylated RNA intermediate. The F-type subfamily of LINE-1 retroposons appeared to have be gun to be dispersed throughout the mouse genome about 6 million years ago. The resul ting paralogs are only ca. 75% sequence identical when compared pairwise. In this way, they differ from the A-subfamily of LINE-1 retrotransposons, which are ove r 95% identical in pairwise sequence comparisons. The A and F subfamilies of LINEs also differ in terms of their activity. Members of the A subfamily appear to be fully active as transcriptional promoters. Members of the F subfamily, in contrast, appear to be both tran scriptionally and trans positionally inactive. Adey et al. (Adey et al., 1994) hypothe sized that these F-type L1s are "evolutionarily extinct" descende nts of an ancestral LINE-1 that was functionally active. The inactivation was caused by accumulation of mutations in the transcription initiation region. To test this idea, they analyzed an al ignment of thirty F sequences to generate a consensus sequence of the promoter that approximated the sequence of the ancestral LINE promoter. They then resurrected that se quence, and demonstrated that it was indeed functional in that the resurrect ed promoter was able to driv e transcription in promoter assays. The reconstruction did not follow a single evolutionary model based on a coherently defined tree, and therefore doe s not represent an example of precise


63 resurrection. Rather, a consensus of the F-t ype sequences was obtained using the program PRETTY from the University of Wisconsin 's Genetics Computing Group. This was compared with a consensus of a subset of the sequences of elements that were believed to have diverged more recently. Next, positions that displayed CpG hypermutability within the younger subset of F-type sequences were converted back to CpG in the presumed ancestral sequence. At sites where no cons ensus was observed for all 30 sequences, the consensus nucleotide within the younger subset was placed at that site in the ancestral sequence. The authors recognized that this combinati on of analytical tools did not infer an ancestral sequence in a formally coherent way, but rather approximated the ancestral sequence. The sequence that resulted from th is analysis differed at 11 positions from a previously reported consensus sequence that was based on a smaller set of data. The resulting consensus sequence was then resurrected by chemical synthesis. The consensus sequence was placed in front of a chloramphenicol transferase reporter gene, as were eight F sequences from modern mammals, which served as controls. Each construct was transfected into undifferentia ted mouse F9 teratocarcinoma cells. There, the ability of the ancient and modern promoter s to direct the expre ssion of the reporter protein was determined. The resurrected ancestral promoter genera ted a high level of reporter expression, a level approximately equal to the level of e xpression generated by the most active A-type promoter known. In contrast, seven of the ei ght modern F-type promoters generated no detectable expression of the reporter. Thus , these results supported the hypothesis of Adey et al.(Adey et al., 1994). The currently inactive F-type tran sposable elements do


64 appear to be descendents of a promoter that was active approximately 6 million years ago. Sleeping Beauty transposon An analogous experiment was done by Ivics et al. (Ivics et al., 1997), who used a "majority rule" consensus stra tegy to approximate another an cestral transposon sequence. As with the F subfamily of LINEs discussed above, members of the Tc1/mariner superfamily of transposons in fish appear to be transpositionally inactive due to the accumulation of mutations following divergen ce of an active transposon. Ivics et al. analyzed a dozen sequences to infer a cons ensus sequence of an ancestral transposon, which they termed Sleeping Beauty . This sequence was then resurrected and studied. The consensus ancestral transposase was shown to bind to the invert ed repeats of salmonid transposons in a substrate-specific manner, and mediate precise cut-and-paste transposition in fish as well as in mouse and human cells. This resu lt suggested that the modern, inactive, transposable elements are descendents of a more ancient transposable element that dispersed itself in fish genomes some time ago, in part by horizontal transmission between species. Frog Prince Another member of the Tc1/mariner superfamily from the Northern Leopard Frog (Rana pipiens ) was also resurrected in much the same manner as Sleeping Beauty (SB). In fact, SB does not show host dependent restri ctions but it does show some transposition efficiency variability depending on the cel l line derived from different species (Neidhardt, Ingraham and Schaechter, 1990). Consequently, Ivics and coworkers (Miskey


65 et al., 2003) set out to resurrect another vertebrate transposon in an effort to have another genomics tool with differe nt characteristics. R. pipiens genome was estimated to contain 8000 copies of transposable element, closely related to Txr elements in Xenopus laevis . In order to clone few uninterrupted transposase open reading fram es (ORFs) from this collection, the authors designed a method to trap them. This method selected fo r the uninterrupted transposase ORFs. They used the cloned sequences to generate a cons ensus of the transposas e gene and along with the inverted repeats from Rana pipiens they obtained all the necessary components for the transposon system Frog Prince. Frog Prin ce (FP) transposons were shown to be phylogenetically closer to the Txr elemen ts than to the Sleeping Beauty/Tdr1 transposons. In order to te st its transposition activity and compare to that of Sleeping Beauty (SB), the same assay developed for SB was used on HeLa cells. Indeed, FP was active in these cells and it was shown to cross-mobilize with Xenopus laevis transposon but not with Sleeping Beauty. This indi cates that the transposon families in Xenopus laevis and Rana pipiens have diverged recently, in contra st with SB transposons where the divergence is more distant. Of course, phylogenetically it is not surprising since FP originated from amphibians and SB from fishes. FP was tested in different cell lines and co mpared to SB activity. The transposons systems overlapped and differed in their leve ls of activity in the tested cell lines, achieving the authors goal of obtaining a ne w active transposon system that had some different characteristics from SB, increasi ng the number and application of genomics tools. In fact, FP was more active than SB in Zebrafish demonstrating the advantage of using a phylogenetically distan t transposon if higher activity was the goal. This is


66 possibly due to the presence of SB like transposons (Tc1 Transposons) and the corresponding inhibitory m echanisms in Zebrafish. These mechanisms would be ineffective against the ph ylogenetically distant FB. In all cases, the study is wort h repeating using ancestral sequences inferred in light of specific evolutionary trees. This would provide an example of whether consensus reconstructions differ in outcome when compar ed to rigorous reconstructions. Of course using just the salmonid sequences used for th e Sleeping Beauty resurrection or just the paralogous sequences used for Frog Prince w ould not be enough to get a reconstruction where the phylogenetic inference produces a different sequence than the consensus approach. In fact, the SB salmonid sequences and certainly the FP sequences do not have enough phylogenetic variation for the inference to necessarily be any different from the consensus. On the other hand if one were to include more sequences from more divergent organisms the phylogentic inference would mo st certainly be more useful than a consensus approach. Biomedical applications of transposons The active ancestral transposon Sleeping Beauty became an intense focus of research as other research groups succeeded in improving its activity by site directed mutagenesis (Yant et al., 2004). The SB system became a tool in gene therapy generating a large number of publications and a biotechnology company, Discovery Genomics.Inc , with a goal to develop a gene therapy de livery system based on this resurrected transposon (Hermanson et al., 2004;Ivics et al ., 2004). This is yet another example of a novel application of paleobios cience where resurrected biom olecules are being used as biomedical tools to combat human disease.


67 Alternatively, transposons could be used as a tool to identify genes involved in cancer by monitoring their abil ity to disrupt open-reading frames or regulatory regions with a genome. SB has recently been used as such a tool to identify cancer related genes (Collier et al., 2005;Dupuy et al., 2005;Weiser and Justice, 2005). Identifying cancer genes today often involves random mutagenesi s through the use of radiation, chemical agents or viruses. These approaches raise serious concern, as it is diffi cult to separate the cancer-causing mutations from the benign muta tions. On the other hand, Sleeping Beauty presents a way to tag cancer genes because it is a transposon with a known sequence and one that is very divergent from its mouse homolog. As an example, Collier et al. (Collier et al., 2005) designed a way to identify onc ogenes using SB (altered regulation by SB when integrated upstream of the gene) or tu mor suppressor genes (disrupted by SB when integrated in the middle of the sequence). This resurrected transposon will enable researchers not only to identify new cancer genes but also to dissect the unknown pathways for cancer formation in different types of tumors, tissues, and developmental stages. Oncogenesis is indeed a very comp licated process with many possible pathways involving a large number of genes. Paleobioscien ce presents a powerful tool to shed light on the darkness of this disease. Chymase-Angiotensin Converting Enzyme: Understanding Protease Specificity As with the ribonucleases, proteases in modern organisms present a baffling diversity of paralogs that have arisen from gene duplic ation throughout th e history of vertebrates. Many proteases have been studied to determine their substrate specificities, where a parallel diversity in th eir behavior is also observed. The serine proteases offer one example of this. As with ribonucl eases, the classical serine proteases, such as tr ypsin and chymotrypsin, are dige stive enzymes isolated from


68 the digestive tract of oxen. Thes e proteases are paralogs of non-digestive pr oteases that are found in many tissues. For example, chym ases form a clade of serine proteases homologous to trypsin and chymotrypsin. Th ese are secreted from mast cells. Once secreted, the chymases help process peptide hormones, are involved in the inflammatory response, and may aid in the expulsion of parasites (Knight et al., 2000;Miller, 1996). True physiological function remains unclear but pathologically these enzymes have been involved in vascular disease and might be attractive drug targets (Doggrell and Wanstall, 2004). Proteases carrying the name "chymase" from different species differ in the details of their substrate specificit y. For example, human chymase cleaves angiotensin I between the Phe8-His9 residues to give angiotensin II, but does not then further cleave the Tyr4Ile5 bond in angiotensin II. In contrast, the chymase from rat does degrade the Tyr4-Ile5 bond in angiotensin II, leading to inactiva tion of the peptide hormone. This has many implications, not the least of which be ing that the pharmacology of hypertension management targeted at the angiotensin system is expected to differ in rats and humans. The fact that these homologs in human s and rat display different substrate specificities raises the question: What was the specificity of the chymase in the last common ancestor of humans and rats? To an swer this question, Chandrasekharan et (Chandrasekharan et al., 1996) applied pars imony analysis (using PAUP) to two dozen homologous serine proteases to reconstruct the last co mmon ancestral chymase of modern chymases in humans and rats. In doing so, they used the kal likrein proteases as an outgroup (Figure 1-20).


69 The tree that emerged was supported at a 95 percent level by bootstrap analysis. There was, however, considerable ambiguity in the ancestral sequence inferred using parsimony. The analysis did not identify a specific amino acid to occupy 15 sites in the ancestor. To generate the seque nce to be resurrected, amino aci ds at seven of these sites were assigned arbitrarily. Amino acids at the remaining eight sites were chosen to ensure that the net positive charge of the ancestral chymase was + 18. The candidate for the ancestral chymase differs from the modern chymase from 52 to 100 sites, depending on the modern descendent, corresponding to a difference of 23 to 34%. A gene encoding the ancestral protein was then resurrected and expressed, and the ancestral chymase studied in the laborato ry. The ancestor was shown to efficiently convert angiotensin I to angiotensin II, with a turnover number of about 700 per second. This kinetic performance was consistent w ith the hypothesis that the parsimony analysis generated a functionall y active ancestor. Relevant to the hypothesis, the ancestra l chymase did not degrade angiotensin II further by cleavage of the Tyr4 -Ile5 bond. In this respect, th e behavior of the ancestor resembled the behavior of the more specific modern human chymase, not the less specific rat chymase. This provided a case where protease specificity decreased over time. Chandrasekharan et al. (Chandr asekharanet al., 1996) noted that this contrasted with a common view of the evolution of protease sp ecificity, where the protease begins with broad specificity that narrows over time as the role of the protease becomes narrower. Is not clear exactly how to date the diverg ence of these enzymes. At the time that the experiments were done, only a single c hymase gene was known to be present in humans and baboons, and only a single chymas e was known in dogs. In contrast, at least


70 five alpha and beta chymases-isozymes have been identified in mice and rats as of 1995. Correlation with the species tree suggests th at the alpha and beta forms of chymases emerged long before mammals branched from the rapsids. This implies that the nonrodent mammals lost some of the extra paralogs. The narrative for the chymase family provi des another interesting example where the biological information generated by an e xperiment in paleobiochemistry could not have been obtained in any other way. Th e narrative would, of course, be improved by resurrecting a sample of the alternative ances tral protease sequences to demonstrate that the inference is robust with respect to the reconstruction ambiguity. This would make it unnecessary to assume that the sites whose residues were assigned randomly have no impact on the interpreted phenotype. Indeed, the abundance of genome sequen ces available today would make this system, as well as many protease systems, wo rth revisiting. It would be interesting to follow through paleogenetics the co-evoluti on of these proteases as well as the angiotensin protein. Th e number of paralogs of di fferent proteases in mammalian genomes, and the diversity of protease type s, and the complexity of the biological functions that they perform, all suggest that molecular paleoscience should be a key part of any program to understand their biology. Resurrection of Regulatory Systems: The Pax System The difference in the morphology of different metazoans arises, in part, not from changes in the sequences of encoded proteins , but rather from cha nges in the regulation of expression of those proteins. Regulati on of expression, in tu rn, often involves the specific binding of transcription factors to specific target DNA regulatory elements. Many of these factors are homologous. Thus, the divergence of specificity of the DNA


71 binding protein and the DNA elemen t-binding partner is a histor ical process that can be examined using paleomolecular resurrections. To explore one such history, Sun et al. (Sun et al., 2002) examined the Pax system. Pax genes encode a well-conserved DNA-bindi ng domain of 128 amino acids. In mammals, nine Pax genes ( Pax-1 to Pax-9 ) were known in 2002. These all play roles in the embryonic development of tissues and organs. Thus, Pax genes are implicated in human congenital defects ( Pax-2, 3, 6, 8, 9 ), in the development of cancers ( Pax-3, 5, 7 ), and in the development of the central nervous system ( Pax-2, 3, 5, 6, 7, 8 ), the eye ( Pax2, 6 ), the pancreas ( Pax-4, 6 ), and B-lymphocytes ( Pax-5 ) (Dahl, Koseki and Balling, 1997;Engelkamp and van Heyningen, 1996;Underhill, 2000). Sequence analysis classifies Pax genes into five groups within two supergroups: Pax-2, Pax-5, Pax-8, Pax-B, poxn/Pax-A, and Pax-6/ey in supergroup I, and Pax-1/Pax9/poxm and Pax-3/Pax-7/gsb/gsbn in supergroup II (Sun et al., 1997). Pax genes within each group often display similarities in th eir expression patterns; this may imply analogous roles in development (Chalepakis et al., 1993). The process by which Pax genes and their binding site s duplicated and functionally diversified presents an example of the "t ransitional form conundrum" in divergent evolution. Briefly, this conundrum arises because incompletely specialized biomolecules that are intermediates in this evolution are expected to display cross-reactivity that might confuse their functional distin ctiveness. Some of the am ino acid changes between the duplicates are important to making these dis tinctions; others, however, reflect instead neutral evolution.


72 As was shown with ribonucleases, deliber ate resurrection of ancestral forms can help distinguish amino acid replacements that are key to functional diversification from those that are neutral. More classical methods, including ma king hybrid structures that are widely used in molecular biology, do th is less efficiently and might miss important functional residues. Two residue s that come together in the folded structure of the protein, but are remotely placed in the primary structure would be difficult to detect by classical molecular biology deletion analysis. In the Pax family, for example, hybrid constructs suggested that only three of the 30 amino acid differences between Pax-5 and Pax-6 paired domains are important for their observed differences in DNA-binding specificity (Czerny and Busslinger, 1995). Thes e studies do not show, however, how this diversification act ually occurred. To address this issue in the Pax family, Sun et al. (Sun et al., 2002) reconstructed a set of ancestral Pax sequences using the Distance-bas ed method developed by Zhang and Nei (Zhang and Nei, 1997). The ancestral proteins were then resurrected by expression of the gene in the in vitro reticulocyte -translation system. Th e expressed proteins were then assayed based on their ability to bind to various DNA sequences (as detected by gel mobility shifts), as well as a biological assay in Drosophila . For the DNA binding assay, seven sequences were selected identified as Pax-5 binding sites and ar e representative of the Pax binding domain generally (Czer ny, Schaffner and Busslinger, 1993). Two of the resurrected ancestral sequences stand at the head of the two supergroups (Figure 1-22). Ancestor I (AnI) bound strongly to all test sequences except H2A2.2 (one of the seven binding sequences), which was bound less strongly. Pax-2 and Pax-A , descendents of AnI showed the same broad specificity as the an cestor, although overall


73 they bound less strongly than ANI. In contrast, AN6, another descendent of AnI, showed the narrower binding specif icity as it descendent, Pax-6 . These experiments suggest that the fundamental binding features of this s upergroup were already established in the ancestor ANI. The ANII showed almost no binding w ith the test sequences. The mouse Pax-1 paired domain (MPD1) showed mo dest binding with all of th e test sequences (except for as H2A2.2), in contrast to the mouse Pax-3 paired domain (MPD3) that showed little or no binding to the tested sequences. Therefore, within the two supergroups, functional divers ification during evolution appears to have involved cha nges from broad to narrow specif icities in binding within the test sequences (ANI to AN6 and human Pax-6 ), and changes in binding affinities either across multiple binding seque nces (e.g., ANI to mouse Pax-2 and sea nettle Pax-A, and ANII to mouse Pax-1 ) or across a few sequences (ANII to mouse Pax-3 ). Further, changes in binding specificity can occur indepe ndent of the changes in binding affinity. The next phase of the analysis is to se quence a wider diversity of Pax genes to better articulate the tree, segment the branch leading to Pa x-4 so that it is not so long, and so that its information can contribute to the resurrection. Sun et al. (2002) then applie d their analysis to identify two amino acid changes that might dominate the differences in the DNA-binding properties of ANI and ANII. As there are 7 amino acid differences between ANI and ANII, and 19 between AN6 and ANI the authors used the difference in the evolutiona ry rates of these sites as a guide to select a more manageable subset of s ites to test. They calculated the relative rates of amino acid substitution at the differing sites and selected the ones that show a relative low rate as this


74 suggests importance for the binding. Paleobios cience in this case combines ancestral reconstruction and site-specific evolutionary ra tes to select a functi onally important set of sites. Testing of the resurrected proteins and the ones with specific replacement in the selected sites is an efficient way to shed light on the functionally important residues. Of course one has to be careful when compari ng site-specific rates and deciding which are important as deciding the cutoff for the “fas t” and the “slow” c ould be challenging. However, the synergy between the reconstruc tion and the relative rates could make this problem manageable. The authors selected th ree differing sites between ANI and ANII to investigate (sites 121, 22 and 20) and three ot hers to investigate the difference between AN6 and ANI (sites 44, 47, and 66). No single mutation changed the binding pr operties of ANI signi ficantly, as shown in the binding patterns of ANI-NVN (D20N), ANI-DVS (N121S), and ANI-DIN (V22I). Changes at positions 20 (N20D, ANII-DIS) a nd 121 (S121N, ANIININ) in ANII greatly increased the binding strength of ANII to the test sequences, however. The binding properties of ANI and ANII with combined mutations in positions 20, 22, and 121 were then examined. The hybrid ANI-NVS (D20N and N121S) had substantially decreased binding to test sequences re lative to ANI, whereas the ANII-DIN (N20D and S121N), unlike ANII, bound efficiently to several sequ ences. When the bindi ng strengths of ANINVS and ANII-DIN to test sequence 5S2A were compared using serially diluted concentrations of 5S2A, the binding of ANI-NVS was visibl y weaker than that of ANIIDIN at every concentration tested. When the ra tios of intensity of the shifted band versus the free band of 5S2A are calculated fo r ANI-NVS and ANII-DIN, the ratio for ANIIDIN ranges from 2to 11-fold higher than th at for ANI-NVS (five independent replicate


75 assays). This range is rather larger than that expected from experimental error, suggesting that different preparations of the proteins differ in the specific activity of their total binding affinity. Nevertheless, ANII-DIN clearly had a highe r affinity to the test sequences than ANI-NVS. DNA-binding pr operties of ANI and ANII are strongly influenced by the amino acid occupying sites 20 and 121, although replacements at all seven sites that separate the ANI a nd ANII sequences have some effect. Next they repeated the above approach to evaluate the importance of the three selected sites (sites 44, 47, and 66) betw een AN6 and ANI. They tested R44Q, H47N, and G66R in ANI and the reciprocal changes in AN6. This series of experiment showed that only the amino acid change in site 47 is sufficient to cause a near complete specificity swap between ANI and AN6, alt hough the other two sites (44 and 66) have minor effects. Next Sun et al. investigated the impor tance of site 47 in an in vivo drosophila assay. Their rationale for this assay is that a difference in affinity, caused by a mutation in site 47, would result in an ectopic eyes phenotype in drosophila. This is based on previous studies showi ng that ectopic expression of Pax-6 homologs and eyeless (homolog to the vertebrate Pax-6 in drosophila ) result in the formation of supernumerary eyes in the fly (Brand and Perrimon, 1993; Halder, Callaerts and Gehring, 1995). In addition, another study(Czerny et al., 1999) sh owed that ectopic eye phenotype is also observed when the second Pax-6 homolog, twin of eyeless (toy), is ectopically expressed. The two Pax-6 paralogs in drosophila eyeless and toy were also shown to produce proteins with different DNA binding properties. In the in vivo assay the size and the frequency of th e ectopic eyes were evaluated as well the pigment concentration in thoraces with ectopic eyes. These measurement were


76 used to compare the overexpres sion of the wild-type Eyeless transgene (EU transgene), Eyeless-N47H (DP6M3 transgene with a mutation at site 47), Eyeless-Pax2 (M2 transgene replacing the paired box domain with mouse Pax2) and Eyeless-Pax2-H47N (M2M3 transgene with a mutation at site 47 in the introduced mous e Pax2). As a result the in vivo experiments showed that a change toward the Pax-6 specific N at position 47 of the paired domain leads to larger and more ectopic eyes (or both). Smaller or fewer ectopic eyes were observed when the change was toward H47 present in Pax-2, 5, 8. An interesting result is that a complete replacem ent of the Eyeless paired domain for the Pax2 paired domain did not abolish ectopic eye induction entirely but led instead to the induction of fewer and smaller ectopic eyes. This result strongly suggests that specificity can be conveyed by a single amino acid change but that the interaction of paired domains and their in vivo binding sites may be more flexible than expected. Sun et al analyzed these changes in light of the crystal structur e of the protein to further make sense of their paleoscience resu lt regarding func tional residues (Xu et al., 1999;Xu et al., 1995). As with all experimental re surrections, certain simplifying assumptions were made. Thus, some Pax pr oteins contain a ho meodomain, which may interact with the paired domain durin g DNA-binding (Fortin, Underhill and Gros, 1998;Underhill, Vogan and Gros, 1995). By not cons idering this interaction in detail, this experiment represents a simplified appro ach to recapitulate th e evolution of DNAbinding properties of regulatory proteins. That this approximation is serviceable is s hown by the insights into the evolution of Pax domains that these studies produced. The an cestors of supergroups I and II have very different binding properties to the panel of test sequences used in this study, and two


77 amino acid substitutions have dominant effect s on swapping their binding properties. And only one amino acid was enough to do the same for An6 and ANI. Because there is no reliable root to th e phylogeneti c tree of Pax genes, we cannot predict the sequence of the common ancestor of all Pax genes and its binding propert ies to complete the whole picture of early Pax evolution. However, it is intriguing to speculate that gene duplication of a common ancestor gave rise to the two ancestors of supergroups I and II, and these two ancestor genes mutated at positions 20 and 121 and acquired different DNA-binding properties to initiate the diffe rentiation of the two supergro ups. Within supergroupd one a similar result was observed as site 47. Alt hough the paired domains of the Pax-2, 5, 8 and the Pax6 group differ by 19 amino acids, their distinct DNA-bi nding properties are determined almost completely by a single am ino acid change. Thus, a small number of amino acid changes can account in large part for the divergence in binding properties among the known paired domains. As with ribonuclease, Sun et al. (Sun et al., 2002) proposed that this evolutionary approach is an efficient strategy to select candidate sites responsib le for the functional divergence between genes. In th is example, they used this approach to identify candidate amino acid changes responsible for the di fferences in bindin g properties between different groups of paired domains. The candi date changes were then tested, individually or in combination, by in vitro binding and in vivo functional assays. Visual Pigments Vertebrates see light using visual pigmen ts that, upon absorption of light, trigger a biochemical cascade using standard G protein pathways. The pigment itself consists of a protein (an opsin) that binds to a chromophor e, generally 11-cis retinal, via a protonated imine linkage. A simple model for the protei n-chromophore pair is the protonated imine


78 formed via reaction of a simple amine with re tinal. This imine absorbs light maximally at 440 nm (blue violet) in organic solvents. This absorption maximum is sensitive to the environment, however. The opsin protein that binds the retinal provides the environment, which it can change by placing different amino acids ar ound the chromophore. Thus, within one homologous family of visual pigments, the absorption maximum can be tuned within the range of 360 to 600 nm, that is, from the ultr aviolet to red. For exam ple, the four opsins in humans absorb at 414 (violet), 497 (blu e-green), 530 (green-yellow), and 560 (yellow) nanometers. Vertebrate opsins are classified into five subfamilies, known as RH1 (rod opsin, also known as rhodopsin), RH2 (RH1-like, or green, cone opsin), SWS1 (short wavelength –sensitive type 1, or UV-blue, cone opsi n), SWS2 (short wavelength– sensitive type 2, or blue, cone opsin), a nd M/LWS (middle to long wavelength–sensitive, or red-green, cone opsin). Some specie s (such as goldfish and chicken) have representatives of each. Humans , with the four mentioned ab ove, lost the SWS2 and RH2 genes, but gained an additional member of the M/LWS family. As with the digestive enzymes, visual pi gments stand between an organism and its environment. Thus, the wavelength of abso rbance is presumably driven by adaptive pressures. The coelecanth, for example, a fish that lives in the deep sea where ultraviolet light is absent, does not have a visual pigment that responds to UV. For scotopic vision (seeing in dim light), for example, pigments n eed to have a very high quantum efficiency, generate an extremely low level of noise in darkness, and absorb with an absorption maximum at about 500 nm (Menon, Han and Sakmar, 2001).


79 The demands on vision have changed frequen tly through the history of vertebrates. It is therefore expected that an evoluti onary narrative will help expand our understanding of the evolving function of vision. Not surp rising in this light, three fascinating paleogenetics studies have been done in this area. Rhodopsins from archaeosaurs, an an cestor of modern alligators and birds Chang et al. (Chang et al., 2002) began work with visual pigments by seeking to understand how vision might have evolved in the ancestors of modern birds. Their focus was the visual pigment of the primitive arc hosaur, an ancestor that gave rise to the modern alligator, as well as to birds such as the pigeon, chicken, and zebra finch. The sequences of rhodopsins from all of these species was known when their work began, and this clade was rooted by the sequence from the green anole (Figure 1-23). A multiple sequence alignment for these rhodopsin genes, together with genes from two dozen other vertebrates, was built using Clustal W. The alignment was then adjusted by hand to ensure that amino acids believed to be structurally important were aligned, and to remove gaps within codons. The amino acids at the ends of the rhodopsin genes (encoding the first 21 amino acids and the 25 am ino acids following a palmitoylation site) were difficult to align, and could not generate an ancestral sequence with a manageable level of ambiguity. These segments were ther efore excluded from the alignment, with the exclusion justified by the observation that these segmen ts are believed not to be important in determining the features of photon absorption or activation of transducin. Bovine rhodopsin provided the sequence for these regions of the putative archosaur protein. A phylogenetic tree was chosen to reflect the accepted relatio n between the taxa providing the rhodopsin genes, as supported by cladistic and paleont ological data. Amino


80 acids in the ancestral archosaur sequence were then computed using the maximum likelihood methods as implemented in PAML (Yang, 1997). This implementation calculated marginal probabilities at each si te using an empirical Bayesian approach. Pairwise likelihood ratio te sts were also used to select the models with the best fit for the data (Navidi, Churchill and von Haeseler, 1991). A gene for the ancestral archosaur rhodopsin was prepared by chemical synthesi s and expressed in COS cells. The protein was regenerated with 11cis retinal, purified, and assayed fo r function in the laboratory. Especially interesting in this work was its exploration of different models for reconstructing ancestral rhodopsins. Phyloge netic reconstructions using model-based methods may be sensitive to th e assumptions underlying the pa rticular model used in the analysis (Cao et al., 1994;Huelsenbeck, 1997) and may also be problematic when reconstructing ancestral stat es (Chang and Donoghue, 2000). Chang et al. (Changet al., 2002) used seve ral different models for nucleotide-based, amino acid–based, and codon-based reconstructi on strategies, and co mpared the results using likelihood ratio tests (see Chang et al. in ref. (Changet al., 2002) for the details of the comparison). In particular, for all models tested, eliminating the gamma distribution, which accounts for different rates of change at different sites, resulted in a significantly worse fit to the data. For the three best-fitting mode ls from the three strategi es, reconstructions of the ancestral archosaur rhodopsin were in agreem ent at all but three sites (213, 217, and 218). At these sites, two of the three models ag reed with each other, permitting a "majority rule" principle to be applied in inferring the ancestral residue at these sites. Given that these sites lie in a helix facing the lipid bila yer lipid, the behavior of the pigment was not


81 expected to be influenced by this ambi guity. To demonstrate this, Chang et al. synthesized alternative candidate ancestral r hodopsin variants that contained the various amino acids at the three sites. The ancestral rhodopsin had an absorption maximum in the visible at 508 nm. This maximum is at a longer wavelength than th e rhodopsin from most mammals and fish; it lies within the higher end of th e range of absorption maxima in reptiles and birds. Chang et al. then demonstrated that the ancestral rhodopsin was ab le to activate the G-protein transducin. Rhodopsin, which is found in rods, is essent ial for vision at lo w light intensities. The paleogenetic evidence for an active rhodopsin in the arch osaur might therefore also be interpreted as evidence th at the archaosaur could see in dim light, and therefore may have been nocturnal. The history of short wavelength se nsitive type 1 visual pigments The detection of light in th e violet and near ultraviole t is key to the survival strategy of many vertebrates, who may det ect light in these regions while foraging, selecting a mate, and communicating. The extent to which UV vision is distributed across the vertebrate tree (even though it is lack ing in humans) suggests that perhaps the ancestral vertebrate also had UV vision. Inferences from such an analysis are notoriously insecure, of course, and even more so given the ability of opsin e volution to generate different absorbance spectra. To understand be tter the evolution of these pigments, Shi and Yokoyama (Shi and Yokoyama, 2003) inferr ed the sequences of various ancestral SWS1 pigments throughout th e tree interconnecting a vari ty of mammals (including rodents, artiodactyls, and primates), birds, reptiles, and fish (Figure 1-25). This tree


82 topology was based on the amino acid sequences of SWS1 pigments as well as some DNA–DNA hybridization data, and represented sequences as old as 400 million years. The sequences of ancestral SWS1 pigments were then inferred at seven nodes in the tree using likelihood-bas ed Bayesian methods implemented in PAML (Yang, 1997;Yang, Kumar and Nei, 1995) and resurrecte d. Various hybrid pigments were also designed and constructed by swa pping different SWS1 cDNAs. This ambiguity was managed through th e resurrection of a variety of hybrid pigments. In addition, for the most ancient pigmen t, if the probability of a residue at a site was less than 0.9, it was replaced individually by the "second best" residue to emerge from the inference. These studies showed that nearly a ll of the ambiguity had no impact on the absorption maximum of the candidate ancestral sequences. This type of analysis assumes, of course, that the impact of amino acid repl acements at individual si tes is independent of each other. While such an assumption is alwa ys open for dispute (see above), it can be evaluated by sequencing more opsin genes to better articulate the tree. This will undoubtedly be done in the future. Shi and Yokoyama (Shi and Yokoyama, 2003) also considered whether alternative tree topologies might compromise biological inferences based on the absorption maxima of candidate ancestral proteins. The three avian sequence groups are represented in Figure 1-25 as branching as a "starburst" from ancestor f. Alternative tree topologies at this node influenced the oldest opsin sequen ce (represented by a in Figure 1-25) at three of the 20 variant sites, excha nging the order of likelihood of the two most likely residues. For ancestor f (which has the high absorbance maximum), the probability of the preferred


83 amino acid increased. This included an increase in the probability of Ala at site 118, a site where ambiguity did have an impact on the absorbance maximum. For the remainder of the sites, the changes in the reconstructi ons created by changing the tree topology did not have an impact on the phenotype of the ancestors that generates the biological interpretation. Two biological interpretations were drawn. First, most vertebrate ancestors could see in the ultraviolet. As the numbers on the trees show, most resurrected ancestral pigments were sensitive to ultraviolet li ght (absorbance maxima at 360 1 nm). In the emergence of the last common ancestor of the avian sequences, however, the absorbance maximum evolved from 360 nm to 393 nm. Further, in a variety of modern vertebrates, evolution to longer wavelength occurred. The most curious aspect of the resurrected history of SWS1 is that after evolving from a UV-sensitive to a violet-sensitive pigment in the last common ancestor of the birds, it regained UV sensitivity (homoplasy) in some of the lineages leading to modern birds, including in th e canary, the budgerigar, and the zeb ra finch. This behavior provides an example showing that the behaviors of ancestral proteins are not necessarily the behaviors expected by averaging the behaviors of the descendents. The SWS1 pigment narrative also provid es another example illustrating how resurrections lead to the discovery of specifi c amino acids that have particular impact. Thus, just as the ribonuclease resurrection id entified Site 38 as being important to substrate specificity, comparison of ancestral pigments e and f led to the discovery of previously sites for tuning the spectrum of SWS1 that were not among the eight known sites (46, 49, 52, 86, 90, 93, 114, and 118). Three of these sites are suffering replacement


84 along this branch (49, 86, and 118), but replacements at these sites proved to be insufficient to account for the evolution of th e spectrum. To account for the rest of the shift in the absorbance maximum, a quadr uple change including an L116L replacement was required. Thus, site 116 was discovered as a new site involved in the tuning of the spectrum of SWS1. To explain the UV sensitivity in zebra finch, a cysteine at position 90 was the focus of classical studies that focus on leaf-leaf comparisons (Wilkie et al., 2000;Yokoyama, Radlwimmer and Blow, 2000). When S90C was introduced into a candidate ancestral pigment e , the absorbance of the variant pigm ent dropped to 360 nm. Thus, the S90C change by itself is sufficient to produce UV-se nsitivity. However, in the lineage leading from ancestor e to the last common ancestor of fi nch and canary, the S86C replacement also occurred. Individually, this also shifts the abso rbance maximum down by 360 nm. Changing S86C and S90C together does not re sult in an additive shift; these two changes also create a pigment having an absorbance at 360 nm. Thus, both the S86C and S90C replacements, either separately or jo intly, decrease the ab sorbance by ca. 30 nm, providing a textbook example of non-additivit y of phenotype-sequence relationships. The SWS1 case illustrates th e use of paleobiochemistry to manage a family that has seen a relatively large amount of seque nce divergence, homoplasy, and functional adaptation. Approximately 43 amino acid replacem ents have occurred in the history at the nine sites now believed to be important in tuning the absorption wavelength of the pigment. Given 21 sequences and a reasonabl e effort to study hybrid and alternative ancestral candidates, the associated am biguity can be managed, and interesting information can be extracted.

PAGE 100

85 Shi and Yokoyama then extended the analys is to ask planetary biology questions. Clearly, the life styles and the need for UV vision changed dramatically during the divergent evolution of vertebra tes. Indeed, life style, includi ng the exposure to ultraviolet light, is different within vert ebrates classes and individual or ders. In cases such as the coelacanth and dolphin, where ultraviolet light is assumed to be absent in deep ocean water, the SWS1 gene app ears to be non-functional. Shi and Yokoyama also noted that ultr aviolet light can da mage the retina. Therefore, yellow pigments in the lenses or corneas of many sp ecies, including humans, filter out ultraviolet light to pr event it from reaching the retin a. Therefore, especially in animals exposed to bright sunlight, the switc h from ultraviolet visi on to violet vision might be driven adaptively. Other factors mi ght include the use of ultraviolet vision in migratory birds, the detection of rodents tracks, or they select ion of food by larval fishes. The planetary biology of visual pigments has on ly begun to be explored. As it is explored using paleobiochemistry, more and more of th ese cases will advance from just so stories to serious scientific narrative's that join bi ochemistry to the cellular phenotype, and from there to the ecosystem and the changing planetary environment. Green opsin from fish The question of ecological function in th e visual pigments has recently been examined further by, Chinen et al (Chi nen, Matsumoto and Kawamura, 2005a). These authors analyzed the difference between visual pigments in goldfish and the zebrafish. Zebrafish have four paralogous green (R H2) opsin genes, designated RH2-1, RH2-2, RH2-3, and RH2-4. These have different abso rption maxima when reconstituted with 11cis retinal at 467, 476, 488, and 505 nm. Goldfi sh have two pigments that diverge

PAGE 101

86 somewhere in the time when the four zebrafi sh pigments were diverging (Figure 1-26). Both are diurnal freshwater specie s belonging to the same family, the Cyprinidae . To understand the diversity of paralogs a nd their different numbers in these two fish, Chinen et al (Chinen, Matsumoto a nd Kawamura, 2005a) reconstructed the amino acid sequences of ancestral zebrafish RH2 ops ins by likelihood-based Bayesian statistics. They then resurrected the ancestral pi gments and measured their photophysical properties. The pigment ancestral to the four zebrafi sh RH2 pigments (termed A1, see Figure 1-26) and the pigment ancestral to RH2-3 a nd RH2-4 (A3) both absorbed maximally at 506 nm. In contrast, the pigment ancestral to RH2-1 and RH2-2 (A2) absorbed maximally at 474 nm. This indicates that th e RH2-3 and RH2-4 modern pigments display the ancestral photophysics, while the RH2-1 and RH2-2 pigments have the derived phenotype, with the derived phenotype em erging along branch A (Figure 1-22). Hybrid pigments were then constructed to show that a large contribution of the spectral tuning (ca. 15 nm) arose by a subs titution of a glutamate by a glutamine substitution at site 122. The remaining spect ral differences appeared to arise from complex interactive effects of a number of amino acid replacements, each of which has only a minor impact (1 nm). Thus, the four zebrafish RH2 pigments cover nearly an entire range of absorbance found in vertebrate RH2 pigments. Blue opsins Chinen et al. then continued these studi es by examining the blue opsins (Chinen, Matsumoto and Kawamura, 2005b). In a separa te study, the sequence of the ancestral SWS2 pigment of the two species was infe rred by applying likeli hoodbased Bayesian statistics. The ancestral protein was then resurrected by site-directed mutagenesis on a

PAGE 102

87 cloned gene. The reconstituted ancestral photopigment had a max of 430 nm, indicating that zebrafish and goldfish achieved short wavelength (14 nm) and long wavelength (13 nm) spectral shifts, respectively, from the ancestor. Unexpectedly, the S94A mutation resulted in only a 3-nm spectral shift when introduced into the goldfish SWS2 pigment. Nearly half of the long wa velength shift toward the goldfish pigment was achieved instead by T116L (6 nm). The S295C mutati on toward zebrafish SWS2 contributed to creating a ridge of absorbance around 400 nm and broadening its spectral sensitivity in the short wavelength direction. These results indicate that the evol utionary engineering approach is very effective in deciphering th e process of functional divergence of visual pigments. Among the amino acid differences between th e two pigments, only one (alanine in zebrafish and serine in goldfi sh at residue 94) was previous ly known to cause a difference in absorption spectrum (14-nm ( max shift in newt SWS2). In their study, the ancestral SWS2 pigment of the two species was r econstructed by applyi ng likelihoodbased Bayesian statistics and performing site-directed mutagenesis. The reconstituted ancestral photopigment had a l max of 430 nm, indicatin g that zebrafish and goldfish achieved short wavelength (14 nm) and long wavelength (13 nm) spectral shif ts, respectively, from the ancestor. Unexpectedly, the S94A muta tion resulted in only a 3-nm spectral shift when introduced into the gol dfish SWS2 pigment. Nearly half of the long wavelength shift toward the goldfish pigment was achie ved instead by T116L (6 nm). The S295C mutation toward zebrafish SWS2 contribute d to creating a ridge of absorbance around 400 nm and broadening its spectral sensitiv ity in the short wavelength direction.

PAGE 103

88 Among the amino acid differences between th e two pigments, only one (alanine in zebrafish and serine in goldfi sh at residue 94) was previous ly known to cause a difference in absorption spectrum (14-nm max shift in newt SWS2). Zebrafish and goldfish are both diurnal fr eshwater fish species belonging to the Cyprinidae family. Their ecological surroundings diffe r considerably with respect to their visual needs, however. Zebrafish are su rface swimmers in conditions of broad and shortwave-dominated background spectra. Gold fish are generalized swimmers whose light environment extends to a depth of elevated short wavelength absorbance with turbidity. The peak absorption spectrum ( max) of the zebrafish blue (SWS2) visual pigment is consistently shifted to short wa velength (416 nm) compared with that of the goldfish SWS2 (443 nm). Planetary biology of the opsins In due course, we will be able to tie th e molecular and biophysical history of these visual pigments to historical changes in th e environment of the organisms that carried them. The formation of paralogous visual pigments in zebrafish ( Danio rerio ) occurred well after the divergen ce of the medaka ( Oryzias latipes ) from the Otophysi , which includes the mexican cavefish ( Astyanax mexicanus ), the goldfish ( Carassius auratus ), and the zebrafish ( Danio rerio ). Further, they appear to have diverged after the divergence of the Otophysi to separate the Cypriniphysi (which includes the zebrafish and goldfish) from the Characiphysi (which includes the Mexi can cavefish). Rather, the paralogs arose via a duplication that occurred near the time of diverg ence of the goldfish (the family Cyprininae ) from the zebrafish (the family Rasborinae ).

PAGE 104

89 Limits have been put on the dates of divergence of various families (Kruiswijk et al., 2002). For example, the family Rasborinae (which includes the zebrafish) is estimated to have diverged 50 million years ago from th e lineage leading to the family Cyprininae (Cavender, 1991;Dixon et al., 1996;Ohno et al ., 1967;Stroband et al., 1995). These, in turn, are events that occurred in a time when the fossil and geological records are strong. These dates are obtained primarily using mo lecular clocks. Many of these fish are known from the lakes in the Rift Valley of central Africa, however, the same area where primate evolution was to generate Homo sapi ens. The area is associated with multiple lava flows and ash beds, including a lava fl ow dated 5 Ma that blocked Lake Tana (in Ethiopia), which is home to the Barbus intermedius, believed to branch from the lineage leading to carp and goldfish at 30 Ma. A c oherent narrative will combine the geological and paleontological records to explain what was happening in the environment of these fish that caused them to change their lifes tyle, which in turn made the changes in the photophysical properties of the visual pigments adaptive. At What Temperature Did Early Bacteria Live? By the end of the last decade, the most ancient paleomolecular resurrections had traveled back in time only ca. 200-300 million years. This has left untouched many of the most widely discussed questions about the na ture of early life on Earth. One of these relates to the role of thermophily in the early history of terran life. The issue has been confused somewhat by c ontradicting analytical strategies. This, various authors, observing that thermophilic or ganisms are placed at deep branches of a tree, suggested that the last common ancesto r of all organisms should be a thermophile. Others inferred the G + C content of an an cestral ribosomal RNA gene, and noted that this was inconsistent with the ancestor bei ng a hyperthermophile (G altier, Tourasse and

PAGE 105

90 Gouy, 1999). Various models for environments deep in the Archaean suggest that the Earth was hot (Knauth, 2005), and covered w ith snow (Runnegar, 2000). Other models suggest that early bacteria may have been thermophiles, or possibly extreme thermophiles. Arguments based on indirect evid ence, such as the lengths of branches of various trees (Woese, 1987), th e G+C content of reconstruc ted ancestral ribosomal RNA (Galtier, Tourasse and Gouy, 1999), and the di stribution of thermophily in contemporary taxa (Hugenholtz, Goebel and Pace, 1998), have generated contradictory inferences. Elongation factors. Gaucher et al. conjectured th at an experiment in pale ogenetics might shed light on this question(Gaucher et al., 2003). If the seque nces of ancestral pr oteins from bacteria that lived in the Archaean could be resurre cted and their proper ties over a range of temperatures could be studied, we might be able to obtain direct evidence for the temperature(s) at which the ancestral bacteria lived. Elongation factor Tu (from Bacteria) a nd elongation factor 1A (from Archaea and Eukarya) proved to be suitable for such a st udy. EFs are G-proteins that present charged aminoacyl-tRNAs to the ribosome during tran slation. Because of their relatively slow rates of sequence divergence, most characte r states of ancient EF sequences can be robustly reconstructed for proteins from bacteria deep in the eubacterial tree. Further, the optimal thermal stabilities of EFs correlate with the optimal growth temperature of the host organism. Thus, EFs from mesophile s, thermophiles, an d hyperthermophiles, defined as organisms that grow at 20-40, 40-80, and >80 C, respectively, and represented by species of Escherichia , Thermus , and Thermotoga , have temperature optima in their respective ranges (Arai, Kaziro and Kawakita, 1972;Nock et al., 1995;Sanangelantoni, Cammarano a nd Tiboni, 1996). This is cons istent with a previous

PAGE 106

91 study based on a large set of proteins in which a correla tion coefficient of 0.91 was calculated between environmental temperatures of the host organisms and protein melting temperatures (Gromiha, Oobatake and Sarai, 1999). To infer the sequences of EFs deep within the bacterial lineage, Gaucher et al. collected amino acid sequences of 50 EF-Tu’s from various bacterial lineages. Because saturation at silent sites in the DNA sequen ce had occurred, amino acid sequences were used to build a multiple sequence alignment and candidate trees. The differences in the rates of amino acid replacement at different sites in the sequence was captured using a gamma distribution (Gaucher, Miyamoto and Benner, 2001). Ambiguity was immediately encountered in constructing a tree to interrelate these proteins. Thus, two trees were used. The first was constructed from EF-Tu sequences alone using a combination of phylogenetic t ools. The second was constructed from the literature, which contains various views of bacterial phylogeny. Thes e trees were largely congruent. Where they differed, a tree was ex tracted that captured those differences (Hugenholtz, Goebel and Pace, 1998) (Figure 1-27). Candidate ancestral sequences were then reconstructed at the most basal node of the bacterial domain in each tree, using archaeal and eukaryotic EF sequences as outgroups. Marginal reconstructions, opposed to joint reconstructions, were calculated to compare probabilities of multiple character st ates at a single interior node and selecting the character with the highest posterior probability (Yang, Kumar and Nei, 1995). The “most probabilistic ancestral sequence” (MPAS) was then reconstructed by accepting at each site the amino acid with the highest posterior probability.

PAGE 107

92 The MPASs for the two trees were found to be surprisingly similar to sequences from modern day Aquifex ; only 4-6 amino acid replacements out of ca. 400 residues were inferred to have occurred from the most recent common ancestor of bacteria to the modern Aquifex . The placement of the branch leadi ng to Aquificaceae at the base of the tree appeared, however, to be due to longbranch attraction (Brochier and Philippe, 2002;Cavalier-Smith, 2002). To test this, the 23 sites that displayed no variation within the outgroup subfamily, the Aquificaceae subfamily, and the subfamily containing all other bacteria, but not conserved between the subfamilies, were removed from the analysis. The resulting an alysis no longer places Aquificaceae at the base of the bacterial lineage. This is consistent w ith the working of long-branch attraction. To eliminate bias due to this artifact, ancestral sequences were recalculated from a dataset that excluded Aquifex sequences. Figure 1-27 shows the two topologies used to reconstruct ancestral sequences at the node representing the hypothetical organism lyi ng at the stem of th e bacterial tree. The number of sequences in the outgroup, 3-20, did not affect th e amino acid reconstructions at these nodes, ML-stem (maximum likelihood stem bacteria) and Alt-stem (Alternative stem bacteria). The ancestral sequence at the node representing the most recent common ancestor of only mesophilic bacterial lineag es was reconstructed, and named the MLmeso (maximum likelihood mesophiles only). Th is node captures one feature of models that have concluded that the last common an cestor of Bacteria wa s mesophilic (Brochier and Philippe, 2002). In all, these reconstructed ancestral sequences did not appear to be influenced by long-branch attraction or non-homogeneous m odes of molecular evolution, such as changes in the mutability of individual sites in different branches of the bacterial

PAGE 108

93 subtree (Gaucher, Miyamoto and Benner, 2001) . Table 1-6 shows th e sequence identity relating the reconstructed putative ancestral sequences and their descendants. ML-stem and Alt-stem are most similar to the sequences of EFs from Thermoanaerobacter tengcongensis (a thermophile) and Thermotoga maritima (a hyperthermophile), respectively, and differ from each other by 28 amino acids. ML-meso is most similar to the sequence of EF from Neisseria meningitides (a mesophile). If we assume that similarity in sequence implies similarity in thermostability, it might have been predicted that the stem bacterium was thermophilic or hyperthermophilic, while th e ancestral node construc ted without considering thermophiles was a mesophile. To test these predictions based on this assumption, genes encoding the ancestral sequences we re synthesized, expressed in an E. coli host, and purified. The thermostabilities of these ancest ral EFs, and three representative EFs from contemporary organisms, were then assessed by measuring the ability of each to bind GDP across a range of temperatures. Each resurrected protein behaved sim ilarly. Both ML-stem and Alt-stem bound GDP with a temperature profile similar to that of the thermophilic EF from modern Thermus aquaticus , with optimal binding at ca. 65 C. Although the sequence similarity was higher between Alt-stem and the modern hyperthermophilic Thermotoga maritima, the temperature profile of Alt-stem was not similar to that from maritima , which is maximally active up to at 85 C. The observa tion that the amino acid sequences of MLstem and Alt-stem shared only 93% identi ty, but display the same thermostability profiles, suggest that inferences of this an cestral property are robus t with respect to both varying topologies and ancestral character state predictions. This suggests, based on these

PAGE 109

94 given evolutionary models, that the pale oenvironment of ancient bacterium was approximately 65 C. Inferences were then drawn from a re surrected elongation factor whose sequence was reconstructed from the last common ances tral sequences of c ontemporary organisms that, for the most part, grow optimally at mesophilic temperatures. The temperature profile of the ancestral protein, which disp layed a maximum at 55 C, suggests that the ancestor of modern mesophiles lived at a high er temperature than any of its descendants (Figure 1-28). This result shows that the behavi or of an ancestor need not be an average of the behaviors of its descendants. The observation that a tree-based ancestr al sequence reconstructions can give results different from consensus sequence rec onstructions may be general (Gaschen et al., 2002). It underscores a fact, well known in prot ein chemistry, that physical behavior in a protein is not a linear sum, or even a simple function, of the behavior of its parts. This, in turn, implies that an experiment in paleobi ochemistry can yield information beyond that yielded by analysis of des cendent proteins alone. The resurrection is made still more complicated by the incomplete fossil record of microorganisms (Figure 1-29). As the sequen ce, geological, and pa leontological records improve, we may expect the overall hist orical model to improve as well. Isopropylmalate and isocitrate dehydrogenases A similar approach was exploited by the Yamagishi group (Iwabata et al., 2005;Miyazaki et al., 2001) usi ng the 3-isopropylmalate and isocitrate dehydrogenases as the test systems to address questions rega rding the origins of archaebacteria. Although the authors’ methodology raises serious con cern, a summary of their conclusions is valuable nonetheless. These two classes of enzyme protei ns appear to be related by

PAGE 110

95 common ancestry. In their firs t study in 2001, ancestral resi dues were inferred using the PROTPARS program in the PHYLIP package. These were introduced into an enzyme of a strain of extreme thermophiles of the genus Sulfolobus . Of the seven-engineered proteins, five displayed a thermal stability higher than was displayed by the modern enzyme. This result was interpreted as be ing consistent with the hypothesis that the universal ancestor was a hyperthermophile. The study was very recently extended (Iw abataet al., 2005). Here, a total of 18 sequences of isocitrate dehydrogenases and isopropylmalate dehydrogenases were aligned with CLUSTAL X, and well-aligned regions were selected using the program Gblock. Composite trees were constructed us ing a neighbor joining heuristic, and a most probable tree was selected. The PAML pr ogram was used to construct ancestral sequences, using a gamma distribution and th e WAG amino acid substitution matrix. A parallel analysis permitted these inferences to be compared to those inferred using maximum parsimony. One or two ancestral re sidues were then introduced into the isocitrate dehydrogenase from Caldococcus noboribetus , and the thermostability of the resulting protein assayed to di scover variants that had altered thermal stabilities. In the first pass, 4 ‘sets’ of mutations representing ancestral states were introduced individually and assayed for their thermostability (v ariants Y309I/I310L, I321L, A325P/G326S, and A336F). The half-inactivation temperatur es were 88.4, 88.9, 91.3 and 74 C, respectively, compared to 87.5 C for the wild-type. The authors then synthesized a construct consisting of mutations Y309I/I310L, I321L, a nd A325P/G326S to yield a variant with a half-inactivation temperature of 90 C. Curious ly, the authors failed to synthesize (or at least report) a variant consisting of all inferred ancestral stat es. It would appear that the

PAGE 111

96 authors attempted to accommodate their con cepts of ancient life while presenting the data, rather than presenting a priori reasons to dismiss results that do not conform to those concepts. With that said, the results were interp reted as evidence for the hyperthermophilic hypothesis for the last universal common ancesto r (LUCA). However, it is possible that the last common ancestor of Archaea (tested he re) did not inhabit an environment similar to the environment that hosted LUCA. It wa s also noted, however, that the results were not additive. In some cases, an individual change that increased thermostability relative to the wild type enzyme did not further enhance the thermal stability of an enzyme already stabilized by one of the putative ancestral states. Conclusions from these two studies These studies push the experimental paleoge netics research strategy back in time 2 to 3 billion years, to the most primitive ancestors from which descent can be traced. Accordingly, the ambiguity encountered is su bstantial, and available sequence data are not sufficient to manage it convincingly. Here, the ambiguities do not depend primarily on the details of the model used to infer ancestr al states. They seem to arise rather from the uncertainty of the phylogenetic tree joining the protein family members. The fact that reconstructions can be made at all is therefore noteworthy. Further, if the large scale sequencing of random bacterial genomes, as undertaken by Venter and his group (Venter et al., 2004), continues, there is good reason to hope that the reconstructions become better. Indeed, the temperature histor y of eubacteria is already beginning to be defined (Gaucher, persona l communication) by st udies throughout that kingdom.

PAGE 112

97 The preliminary results (Gaucher et al., 2003) suggest that the temperature environment of more recent bacterial is still hi gher than the descendents that presently are characterized as mesophiles. The work of Lowe, Knauth and others extracting information from the geological record about the temperature history of Earth will soon be tied to this molecular record usin g experimental paleomolecular biology. Alcohol Dehydrogenase: A Changing Ecosystem in the Cretaceous The ultimate goal of molecular paleoscience is to connect the molecular records for all proteins from all organisms in the modern biosphere with the geological, paleontological, and cosmological records to create a broadl y based, coherent narrative for life on Earth. Because much of natural selection is driven by species-species interactions, developing this narrative will re quire tools that broadly connect genomes from different species, as well as interconnect events within a singl e species. It remains an open question, of course, how much of th e record has been lo st through extinction, erosion, and poor fossil preservation. The first paradigms of this broad interc onnection are now begi nning to emerge. As might be imagined, they f eature paleomolecular resurrect ions. One recently published study concerns the interaction be tween yeast, fruits, and othe r forms of life near the age of the dinosaurs. It appears th at this was the time when the first yeast developed the metabolic strategy to make and consume ethanol in fleshy fruits. Today, modern yeast living in modern fl eshy fruits rapidly convert sugars into bulk ethanol via pyruvate (Figure 1-30). Pyr uvate then loses carbon dioxide to give acetaldehyde, which is reduced by alcohol de hydrogenase 1 (Adh 1) to give ethanol, which accumulates. Yeast later consumes the accumulated ethanol, exploiting Adh 2 and Adh 1 homologs differing by 24 (of 348) amino acids.

PAGE 113

98 Generating ethanol from glucose in the presence of dioxygen, only to then reoxidize the ethanol, is energetically expens ive (Figure 1-30). For each molecule of ethanol converted to acetyl-CoA, a molecule of ATP is used. This ATP would not be “wasted” if the pyruvate that is made initiall y from glucose were deliv ered directly to the citric acid cycle. This implies that yeast has a reason, transc ending simple energetic efficiency, for rapidly converting available s ugar in fruit to give bulk ethanol in the presence of dioxygen. One "just so" story to explain this inefficiency hol ds that yeast, which is relatively resistant to ethanol toxicity, may accumulate ethanol to defend resources in the fruit from competing microorganisms(Boult on et al., 1996). While the ecology of wine yeasts is certainly more complex than this simple hypothesis implies,(Fleet and Heard, 1993) fleshy fruits offer a large reservoir of carbohydrate, and this resource must have value to competing organisms as well as to yeast. For example, humans have exploited the preservative value of ethanol since prehistory (McGovern, 2004). The timing of Adh expression in Saccharomyces cerevisiae and the properties of the expressed proteins are both consistent wi th this story. The yeast genome encodes two major alcohol dehydrogenases (Adhs) that in terconvert ethanol and acetaldehyde (Figure 1-30)(Wills, 1976). The first (Adh 1) is expressed at hi gh levels constitutively. Its kinetic properties optimize it as a catalyst to make ethanol from acetaldehyde(Ellington and Benner, 1987;Fersht, 1977). In partic ular, the Michaelis constant (KM) for ethanol in Adh 1 is high (17000 M), consistent with ethanol being a produc t of the reaction. After the sugar concentrati on drops, the second dehydrogenase (Adh 2) is derepressed. This paralog oxidizes ethanol to acetaldehyde wi th kinetic parameters suited for this role.

PAGE 114

99 The KM for ethanol for Adh 2 is low (600 M), consistent with ethanol being its substrate. Adh 1 and Adh 2 are homologs (Elli ngton and Benner, 1987) differing by 24 of 348 amino acids. Their common ancestor, termed ADHA, had an unknown role. If ADHA existed in a yeast that made, but did not accu mulate, ethanol, its physiological role would presumably have been the same as the role of lactate dehydrogena se in mammals during anaerobic glycolysis: to recycle NADH gene rated by the oxidation of glyceraldehyde-3phosphate (Figure 1-30)(Stryer, 1995). Lactat e, in human muscle, is removed by the bloodstream; ethanol would be lost by th e yeast to the environment. If so, ADHA should be optimized for ethanol synthesis, as is modern Adh 1. The kinetic behaviors of ADHA should resemble those of modern Ad h 1 more than Adh 2, with a high KM for ethanol. To add paleobiochemical data to convert th is "just so" story into a more compelling scientific narrative, a collection of Adhs from yeasts related to S. cerevisiae was cloned, sequenced, and added to the existing sequences in the database (Thomson et al., 2005). A maximum likelihood evolutionary tree was th en constructed usi ng PAUP*4.0 (Figure 131)(Swofford, 1998). Maximum likelihood sequences for ADHA were then reconstructed using both codon and amino acid models in PAML(Yang, 1997) . When the posterior probability that a particular amino acid occupi ed a particular site was > 80%, that amino acid was assigned at that site in ADHA. When the posterior probability was <80% and/or the most probabilistic ancestral state estimated using the codon and amino acid models were not in agreement, the site was considered ambiguous, and alternative ancestral genes were constructed. For example, the posterior probabilities of two amino acids (methionine and arginine) were

PAGE 115

100 nearly equal at site 168 in ADHA, three amino acids (lysine, ar ginine, and threonine) were plausibly present at site 211, and two (asparti c acid and asparagine) were plausible for site 236. To handle these ambiguities, all twel ve (all 2 x 2 x 3 combinations) candidate ADHAs were resurrected by constructing genes that encoded them, transforming these genes into a strain of S. cerevisiae (Thomson, 2002) from which both Adh 1 and Adh 2 had been deleted, and expressing them from the Adh1 promoter. Al l of the ancestral sequences could rescue th e double deletion phenotype, Table 1-7 collects kinetic data fr om the candidate ancestral ADHAs. To assess the quality of the data, Haldane values (= Keq = VfKiqKp/ VrKiaKb) (Segel, 1975), where Vf and Vr are forward and reverse maximal velocities, Kia and Kiq are disassociation constants for NAD+ and NADH, and Kb and Kp are Michaelis consta nts for ethanol and acetaldehyde, respectively) were calculate d from the experimental data. These reproduced the literature equilibrium constant for the reaction to with in a factor of two. One variant, termed MTN, had very low cat alytic activity in both directions. This suggested that this pa rticular candidate ancestor was not present in the ancient yeast. Significant to the hypothesis, the kinetic properties of the remaining candidate ancestral ADHAs resembled those of Adh 1 more than Adh 2 (Table 1-7). From this, it was inferred that the ancestral yeast did not have an Adh specialized for the consumption of ethanol, like modern Adh 2, but rather had an Adh specialized fo r making ethanol, like modern Adh 1. This, in turn, suggests that th e ancestral yeast prio r to the time of the duplication did not consume ethanol. This imp lies that the ancestral yeast also did not make and accumulate ethanol under aerobic conditions for future consumption, and that

PAGE 116

101 the make-accumulate-consume strategy emerged after Adh 1 and Adh 2 diverged. These interpretations are robust with respect to the ambiguities in the reconstructions. Several details are noteworthy. For modern Adh 1, reported KM values for ethanol range from 17000 to 24000 M, from 170 M for NAD+, from 1100 M for acetaldehyde, and from 110 M for NAD H (Ganzhorn et al., 1987). These comparisons, together with the Haldane anal ysis, provide a view of the experimental error in the kinetic paramete rs reported here. Thus, the interpretations are based on differences well outside of experimental error. Further, when paralogs are generate d by duplication, the duplicate acquiring the new functional role is often believed to evol ve more rapidly than the one retaining the primitive role (Kellis, Birren and Lander, 2004). If this were generally true, one might identify the functionally innova tive duplicate by a bioinfor matics analysis. While this may be true for many genes, chemical princi ples do not obligate this outcome, and it is not manifest with these Adh paralogs. Here, the rate of evolution is not markedly faster in the lineage leading to Adh 2 (having the deri ved function) than in the lineage leading to Adh 1 (having the primitive function). T hus, a paleobiochemistry experiment was necessary to assign the primitive behavior. Further, the Haldane ratio relate s various kinetic parameters (kcat, KM, Kdiss) that can change via a changing amino acid sequenc e to the overall equilibrium constant, which the enzyme (being a catalyst) cannot change. Thus, if a lower KM for ethanol is selected, other terms in the Haldane must change to keep the ratio the same. This is observed in data for the ancestral proteins prepared here and the natural enzymes.

PAGE 117

102 The assignment of a primitive function to ADHA raises a broader historical question: Did the Adh1/Adh 2 duplication, and the "accumulate-consume" strategy that it presumably enabled, become fixed in res ponse to a particular selective pressure? Connecting molecular change to organismic fitness is always difficult (Kreitman and Akashi, 1995), but is necessary if reductioni st biology is to move through systems biology to a planetary biology that answers "Why ?" as well as "How ?" questions (Benner et al., 2002). Hypothetically, the emergence of a make -accumulate-consume strategy may have been driven by the domesticati on of yeast by humans selecti ng for yeast that accumulate ethanol. Alternatively, the strategy might have been driven by the emergence of fleshy fruits that offered a resource worth defe nding using ethanol accumulation. We might distinguish between the tw o by estimating the date when the Adh 1/2 duplication occurred. Even with large erro rs in the estimate, a distin ction should be possible, as human domestication occurred in the past millio n years, while fleshy fruits arose in the Cretaceous, after the first angiosperms appear ed in the fossil record 125 Ma (Sun, 2002), but before the extinction of the dinosaurs 65 Ma (Collinson and Hooker, 1991;FernandezEspinar, Barrio and Querol, 2003). The topology of the evoluti onary tree in Figure 1-31 suggests that the Adh 1/2 duplication occurred before the divergence of the sensu strictu species of Saccharomyces (Fernandez-Espinar, Barrio and Querol , 2003), but after th e divergence of Saccharomyces and Kluyveromyces . The date of divergence of Saccharomyces and Kluyveromyces is unknown, but might be estimated to have occurred 80 15 million years ago (Berbee and Taylor, 1993). This date is consistent with a transition redundant

PAGE 118

103 exchange (TREx) clock (Benner, 2003), wh ich exploits the fr actional identity ( f2) of silent sites in conserved two fold redundant codon systems to estimate the time since the divergence of two genes. Between pa irs of presumed orthologs from Saccharomyces and Kluyveromyces , f2 is typically 0.82, not much lower than the f2 value (0.85) separating Adh 1 and Adh 2 (Benner et al., 2002), but mu ch lower than paralog pairs within the Saccharomyces genome that appear to have arisen by more recent duplic ation (ca. 0.98) (Lynch and Conery, 2000). Interestingly, Adh 1 and Adh 2 are not th e only pair of pa ralogs where 0.80 < f2 < 0.86 (Benner et al., 2002). Analysis of ca. 350 pairs of paralogs contained in the yeast genome (considering pairs that sh ared at least 100 silent site s, and diverged by less than 120 point accepted replacements per 100 sites) identified15 pairs having 0.80 < f2 < 0.86 (Figure 1-32). These represent eight duplicati ons that occurred near the time of the Adh 1 and Adh 2 duplication, if f2 values are assumed to support a clock. These duplications are not randomly distri buted within the y east genome. Rather, six of the eight duplications i nvolve proteins that participat e in the conversion of glucose to ethanol (Table 1-8). Further, the enzyme s arising from the duplicates are those that appear, from expression analys is, to control flux from hexose to ethanol (Pretorius, 2000;Schaaff, Heinisch and Zimmerman, 1989) . These include proteins that import glucose, pyruvate decarboxylases that gene rates the acetaldehyde from pyruvate, the transporter that imports thiamine for these de carboxylases, and the Adhs (the red proteins in Figure 1-30). If the f2 clock (within its expected variance) is assumed to date paralogs in yeast, this cluster suggests that several genes other than Adh duplic ated as part of the

PAGE 119

104 emergence of the new make-accumulate-c onsume strategy, near the time when fleshy fruit arose. The six gene duplications proposed to be part of the emergence of the makeaccumulate-consume strategy (in the 0.80 < f2< 0.86 window) are not associated with one of the documented blocks of genes duplicated in ancient fungi, possibly as part of a whole genome duplication (WGD)(Kellis, Birre n and Lander, 2004;Wolfe and Shields, 2001). Two duplications in genes that are not associated with fermentation that fall in the 0.80 < f2< 0.86 window are part of a duplication block (see Table 1-8). The silent sites for most gene pairs associated with blocks are nearly equilibrated (with the prominent exception of ribosomal proteins), and ther efore suggest that mo st blocks arose by duplications more ancient th an duplications in the 0.80 < f2< 0.86 window. Therefore, the hypothesis that a set of six tim e-correlated duplication (Figur e 1-32, Table 1-8) generated the make-accumulate-consume strategy in yeast near the time when fermentable fruit emerged is not inconsistent with the WGD hypothesis. The ecology of fermenting fruit is complex. In rotting fruits, cerevisiae become dominant after fermentation begins, while os motic stress and pH, as well as ethanol, appear to inhibit the growth of competing organisms (Pre torius, 2000). Nevertheless, the emergence of bulk ethanol may not be unrelated to other changes in the ecosystem at the end of the Cretaceous (Figure 1-33), which in clude the extinction of the dinosaurs and the emergence of mammals and fruit flies (A shburner, 1998;Barrett and Willis, 2001;Baudin et al., 1993). Thus, this paleogenetics experime nt is an interesting step to connect the chemical behavior of individual enzymes operating as part of a multi-enzyme system, via metabolism and physiology, to the ecosystem a nd the fitness of or ganisms within it.

PAGE 120

105 Resurrecting the Ancestral Steroid Receptor and the Origin of Estrogen Signaling Another narrative to be supported by paleomolecular resurrection focused on receptors for steroid hormones by Thornton and his coworkers (Thornton, Need and Crews, 2003). Steroid receptors are widely distributed throug hout the chordates. In the pre-genomic age, steroid receptors were not know n in invertebrates, ei ther by analysis of fully sequenced invertebrate genomes or by classical biochemical studies. Because of this, steroid receptors were thought to have ar isen early in the divergence of chordates, perhaps some 400 to 500 million years ago, pr esumably via the duplication of a more ancient receptor gene (Baker, 2003; Escriva et al., 1997;Giguere, 2002). Vertebrate type steroids a ppear to be involved in th e reproductive endocrinology of certain mollusks, however (Di Cosmo, Di Cris to and Paolucci, 2001;Di Cosmo, Di Cristo and Paolucci, 2002). Further, arthropods and ne matodes, where the most intensive studies had been done to suggest that steroid receptors are absent, both use ecdysone as a molting hormone (Peterson and Eernisse, 2001), maki ng it conceivable that steroid receptors were lost in the lineage leading to ecdysozoa . Further, analysis of receptor sequences, the ancestral sequence of steroid receptor (AncSR1) and structure mapping studies indicated that the ancient progenitor of this protein class was most similar to extant estrogen receptors (Thornton, 2001). Guided by these hints, Thornton and his co workers used degenerate PCR and rapid amplification of cDNA ends to isolate putative estr ogen receptor sequence from the mollusk Aplysia californica . The protein sequence of the Aplysia receptor’s DNAbinding domain (DBD) was found to be more sim ilar to that of the vertebrate estrogen receptors, compared to those of other nuclear receptors. This suggested that Aplysia might have a true estrogen receptor.

PAGE 121

106 To characterize the molecular function of the Aplysia estrogen receptor, Thornton separately analyzed the ac tivity of its DNA binding domain (the DBD) and its ligand binding domain (LBD) by expressing them in fu sion constructs in a cell culture system (Green and Chambon, 1987). The Aplysia DBD was fused with a constitutive activation domain (AD) and the construct was cotransf ected with an estrogen response element (ERE) luciferase reporter gene into CHO-K1 cells. The Aplysia estrogen receptor-DBD fusion protein activated lucife rase expression approximately 10-fold above control levels, which is slightly more than that produ ced by the analogous construct using the human estrogen receptor DBD. The identification of a putative estrogen receptor from the mollusk suggested that steroid receptors might be more ancient in metazoa than it had been thought. Barring lateral transfer, the presence of an estr ogen receptor in both mollusks and chordates suggested that steroid receptors arose prior to th e divergence of bilaterally symmetric animals. This gene was, according to this hypothesis, lost in the lineage leading to arthropods and nematodes. The existence of a receptor family in c hordates and mollusks does not, of course, require that the ligand specificity be the sa me throughout the family history. In order to use a paleobiochemical experiment to determin e this, Thornton et al. analyzed 74 steroid and related receptors, including the putative Aplysia estrogen receptor, using maximum parsimony and Bayesian Markov Chain Monte Carlo (BMCMC) techniques. Both methods suggested that the Aplysia sequence is an ortholog of the vertebrate estrogen receptors, with a BMCMC posterior probability of 100%, a bootstrap proportion of 90%, and a decay index of 6. Although BMCMC proba bilities can overest imate statistical

PAGE 122

107 confidence (Hillis and Bull, 1993;Suzuki, Glazk o and Nei, 2002), the result suggests that the steroid receptors originat ed ca. 600 to 1200 million years ago as the major metazoan phyla were just beginning to di verge (Benton and Ayala, 2003). To characterize the functionality of the LBD, a fusion of the Aplysia ER-LBD with a Gal4-DBD was co-transfected with a lucife rase reporter driven by an upstream activator sequence, the response element for Gal4-DBD. The Aplysia ER-LBD activates transcription constitutively, a nd none of a set of vertebrate steroid hormones, including estrogens, androgens, progestins, and corticoi ds, further activated or repressed this activation, even at micromolar doses. Thornton et al. then resurre cted the conserved functiona l domains of the ancestral steroid receptor (AncSR1) from which all extant steroid recept ors evolved. A maximum likelihood joint reconstruction algorithm in PAML (Yang, 1997) was used to obtain the protein sequences of the ancestral An cSR1 DNA binding domai n and ligand binding domains using a sequence matrix (Gonnet, Cohen and Benner, 1992), the Jones model for amino acid replacement, a gamma distribution of evolutionary rates across sites with for rate categories, and a tree determin ed by maximum parsimony methods. Although support for the AncSR1 node is not strong in a parsimony context, it has 100% posterior probability, and the best tree with this node has a likelihood 4.7 milli on times greater than the best tree without this node. The matrix included broadly sampled repr esentatives of both ligand-regulated and ligand-independent receptors. The maximu m likelihood sequence inferred for the AncSR1-DBD had a mean probability of 81% per site; the AncSR1-ligand binding domain had a mean probability of 62%.

PAGE 123

108 The amino acid chosen for each position in the ancestral (AncSR1 DBD and LBD) sequences was the one amino acid preferred at each site . The appropriate gene was synthesized and subcloned into fusion c onstructs. When the constructs containing AncSR1 domains were expressed in CHOK1 cells , their activity was consistent with the prediction that the ancestral receptor would function like an estrogen receptor (Thornton, 2001). For example, the AncSR1-DNA binding domain fusion increased transcription from an estrogen response element ca. 4-fol d, which is slightly less than the human ERDNA binding domain but significantly greater than that seen in controls. Other extant steroid receptors do not ac tivate effectively on EREs (Zilliacus et al., 1994). The AncSR1-ligand binding domain activat ed transcription in a dose-dependent fashion in the presence of estradiol, estr one, and estriol. The magnitude of hormoneinduced activation was smaller than that effected by the human ER-ligand binding domain with estradiol. Similar results were obtained with a radi o-ligand binding assay. Here, the concentration to the midpoint wa s in the 10-100 nM ra nge; the corresponding range in human ER-LBD construc t was in the 1-3 nM range. To assess specificity, other steroids , including androgens, progestins, and corticoids, were examined. These gave still smaller maximal activation of the AncSR1, ranging from 1 to 45% of that observed with estradiol. Further, dose-response analysis shows that the ancestor was 30,000 fold less sens itive to these ligands than to estradiol. This suggests a high level of estrogen specificity. It is unlikely that the specificity of estrogen activation is an artifact. Of the 26 sites in the ligand-binding pocket, 22 were identical to a human estrogen receptor. Thornton et al. argued that because random mutation impairs function more frequently than

PAGE 124

109 enhancing it in estrogen receptors, error in th e reconstruction is more likely to reduce the efficiency of steroid binding a nd activation than to create it de novo. This is analogous to the argument used in resurrecting ancestral diges tive ribonucleases. These findings provide empirical suppor t for the hypothesis that the ancient ancestral SR functioned a billion years ago as an estrogen receptor. This result is also consistent with the hypothesis that this anci ent steroid receptor ge ne was lost along the lineage leading to ecdysozoa. Thornton et al (Thornton, Need and Crews, 2003), suggested that in the lineage leading to the Aplysia ER, ligand regulation was lost. This narrative indicates how paleomolecu lar resurrections can open up biological understanding in a way that is difficult to do in any other way. First, the narrative suggests a previously unrecognized degree of functional and genomic lability in steroid receptors. The existence of a primitive steroi d receptor also suggests that many other nonecdysozoan metazoans, including echinoderms, annelids, and platyhelminthes, will also have steroid receptors. Given limited experi mental evidence for a reproductive role of steroid hormones in cephalopod and gastr opod mollusks (Di Cosmo, Di Cristo and Paolucci, 2001;Di Cosmo, Di Cristo and Paolucci, 2002; Oberdorster and McClellanGreen, 2002), Thornton et al. sugge sted that the loss of estrogen-dependent activation in the Aplysia system is recent and unique to the Opisthobranchs. Thornton et al. did not take the next ste p, to place their work within a planetary context. It is possible to do so in many ways. For example, the biosynthesis of estrogen requires molecular oxygen. This is not only be cause of the conversion of squalene to squalene oxide, an early step in lanosterol biosynthesis, or in the conversion of lanosterol to cholesterol, all very basic to eukaryotic biology. Estradiol also requires aromatase, the

PAGE 125

110 enzyme discussed in the first section of this review. This enzyme also requires dioxygen, and is part of a cytochrome P450 family of proteins that underwent explosive divergence near the time of the origins of the metazoans, relatively late compared with the dioxygenutilizing enzymes that make squalene oxide and oxidatively decarboxylate lanosterol. It would be helpful to revisit the problem of the steroid receptors as more metazoan genomes become available. This would include sampling a larger number of candidate ancestral estrogen receptors to demonstrate th e robustness of the inferences with respect to the ambiguity inherent in this analysis. The emergence of the metazoan endocrine system is a fascinating theme in the emergen ce of multicellularity in animals. Molecular paleoscience will undoubtedly play a key role in the development of our understanding of this important event in the history of life on Earth, and generate a deeper understanding of developmental biology in the process. Ancestral Coral Fluorescent Proteins Some experiments in paleobiochemistry are exceptionally elegant in their execution. For example, proteins homologous to the green fluor escent protein (GFP) from Aequorea victoria exploit two autocatalytic consecu tive reactions to complete the synthesis of the chromophores that generate the green and cyan emissions. Red fluorescent proteins and purpleblue chromoproteins require a third reaction to prepare the emitter. By this criteria, the red and pur ple-blue chromoproteins are more "advanced" (Ugalde, Chang and Matz, 2004). Neverthless, the natural wo rld contains several examples of red/green color diversification within this superfamily (Shagin et al., 2004). This may reflect convergent evolution of molecular complexity. To examine this issue, Ugalde et al. (Ugalde, Chang and Matz, 2004) studied the histor ical event that gave rise to the color diversity exhibited

PAGE 126

111 by the great star coral, Montastraea cavernosa. This coral has several genes coding for fluorescent proteins that ge nerate cyan, shortwave gree n, longwave green, and red emissions (Kelmanson and Matz, 2003). The sequences were inferred and synthe sized for the common ancestor of all M. cavernosa colors ("ALL ancestor"), the comm on ancestor of red proteins ("Red ancestor"), and two intermed iate nodes corresponding to th e possible common ancestors of red and long-wave green proteins ("Re d/Green ancestor" and "pre-Red ancestor") (Figure 1-34). Three models of evolu tion based on different types of sequence information: amino acids, codons, and nucleotides(Yang, 1997), were used. The reconstructions of all four ancestral sequences were largely robust under these models, with average posterior probabilities at a site ranging from 0.96 to 0.99. Still, the models were in disagreement at between 4 a nd 8 sites (out of a total of 217). Ugalde et al. therefore designed codons corre sponding to these sites to be degenerate. Bacteria were tr ansformed with plasmids ca rrying the genes encoding the ancestral proteins. This permitted a library to sample alternative predictions. For each type of the ancestral gene, the protein products were found to display identical fluorescent phenotypes regardless of th e amino acid at the ambiguous site. Green color at any of the ancestral node s would indicate that red proteins from other parts of the phylogeny were the result of convergent evolution. In contrast, if red fluorescent arose only once, then all ances tors would emit red fluorescence. The ALL ancestor turned out to emit in the shortw ave green. The two possible common ancestors of red and green proteins (Red/Green and pr eRed) showed an intermediate longwave green/red emission. Although most of the pr otein remained longwave green, a small

PAGE 127

112 fraction was able to complete the third autoca talytic step in the ch romophore synthesis, resulting in a small am ount of red emission. Clones of the Red ancestor showed an “imp erfect red” phenotype. Although the red emission dominated, the conversion of green to red was still less efficient than in modern reds, resulting in a prominent peak of green fluorescence. These results suggested that because the ALL ancestor was green and not red, the red emission color within the superfamily of GFP-like proteins has originated more than once. This establishes the convergent evolu tion of this complex molecular system. The red emitting system evidently evolved from the green emitting system through a stepwise accumulation of amino acid replacements. The most elegant part of this report was undoubtedly the presentation of the results. Here, a phylogenetic tree was traced onto the growth medium in a Petri dish, with the ancestral and modern sequences appropriate ly placed. When observe d under ultraviolet light, the dish gives a visual image of the evolution of fluorescent colors. The planetary biology of this system was not discussed by Ugalde et al (2004). It remains a subject of open discussion why th ese taxa fluoresce. Clearly, however, the emission of light is a way of interacting w ith an environment, which is changing. While the visual pigments of the organisms inte nded to receive the light are not, to our knowledge, a parallel evolutionary history must exist with these. As the post-genomic age develops, there is little doubt th at these will be found and understood. Isopropylmalate Dehydrogenases Evolutionary analysis indicates that eubacterial NADP-dependent isocitrate dehydrogenases (IDH) first evolved from an NAD-dependent precu rsor about 3.5 billion years ago. Various authors have suggested that selection in favor of utilizing NADP was

PAGE 128

113 a result of an expansion of an ecological ni che during growth on acetate, where isocitrate dehydrogenase provides 90% of the NAD PH necessary for biosynthesis. Dean and his coworkers (Zhu, Golding and Dean, 2005) used experimental paleobiochemistry to explore the divergence of cofactor specificity in this system starting from the NADP-dependent IDH (NADP-I DH) (Hurley et al., 1989) and the NADdependent isopropylmalate dehydrogenase (NAD-IMDH) (Imada et al., 1991). This example illustrated the limits of an analysis when overall amino acid composition was low. The degree of divergence here is ve ry high, only 17% to 24% of the sites are identical across the system. Therefore, the sequence alignments obtained using CLUSTAL W (Thompson, Higgins and Gibson, 1994) misaligne d residues that were independently known to be cri tical to substrate binding and catalysis. Therefore, the multiple sequence alignment modified usi ng the crystallographic structures of Escherichia coli NADP-dependent IDH (NADP-I DH) (Hurleyet al., 1989) and Thermus thermophilus NAD-dependent isopropylmalate de hydrogenase (NAD-IMDH) (Imada et al., 1991) as guides. Three amino acids responsible for differing coenzyme specificities were identified from x-ray crystallographic structures of NADP bound to Escherichia coli isocitrate dehydrogenase and the distantly related Thermus thermophilus NAD-dependent isopropylmalate dehydrogenase bound to NAD (An enzyme that is homologous to IDH’s and use NAD) (Hurley and Dean, 1994;Hurley et al., 1991;Yasutake et al., 2003). In a previous study (Dean and Golding, 1997) th e authors determined, using a maximum likelihood ancestral reconstruc tion, that the six conserved residues binding NADP in IDH were introduced very early in evolution. The complementation of th e structural and the

PAGE 129

114 ancestral reconstruction studies revealed that the three resi dues responsible for binding the cofactor (NADP) were ancestral. The ot her residues were not conserved in the NAD binding enzymes and are therefore not invol ved in binding NAD. Based on these results, site directed mutagenesis targeting these si tes in the binding pocket has been used to invert the coenzyme specificity of Escherichia coli IDH from NADP to NAD. The specificity was confirmed with kinetic experiments. Having determined the ancestral residues re sponsible for cofactor specificity the authors set out to test the hypothesis that NADP use is an adaptation to growth on acetate (Dean and Golding, 1997). When bacteria grow on highly reduced (energy-rich) compounds such as glucose, NADPH (the redu ced form of NADP) for biosynthesis is obtained by diverting energy-rich carbon from glycolysis into the oxidative branch of the pentose phosphate pathway (Neidhardt, Ingraham and Schaechter, 1990). During growth on acetate, which is a highly oxidized (energ y-poor) compound, there is no energy-rich carbon to divert into the oxidative branch, and therefore alterna tive sources of NADPH are required. A competition study of genetically identical bacterial strains, except for the gene coding for NADP dependent IDH, was undertaken. One strain had the wild-type NADP-IDH and the second had the IDH w ith the ancestral NAD binding residues obtained from the first part of the study. The chemostat competition experiments were conducted on either glucose or acetate as the only limiting factor. The ancestral NAD binding IDH was strongly selected againt wh en grown on acetate, yet it was favored over the NADP binding wild-type when grown on glucose. These results supported the hypothesis.

PAGE 130

115 To further show that selec tion on actetate is truly acting and not some other factor, like altered kinetics or less efficient re gulation of the ancestral enzyme by IDH kinase/phosphatase, the authors de vised a test to confirm thei r finding. If the selection at the IDH gene is truly caused by cofactor specif icity then if other s ources of NADPH were removed, selection for the IDH-NADP gene shoul d increase. Conversely, if this effect is independent of coenzyme use then no su ch increase should be observed. Indeed competition studies on strains with dele tion background in genes contributing NADPH (maeB, pntAB, and udhA encoding a soluble transhydrogenase, UdhA) showed that selection, on acetate, against the ancestral NAD-IDH increase s as the sources of NADPH decline. These results furt her support the hypothesis. Genomic comparisons also supported the pl aeoscience results. It was found that isocitrate lyase (ICL), an esse ntial enzyme for growth on acet ate, was always found in all of the 46 genomes that also encode NADP-de pendent IDH. Further, no ICL is found in the 12 genomes encoding NAD-dependent I DH. Members of both groups represent highly diverse phylogenies, includi ng archaea, bacilli, and and proteobacteria, and with variable metabolic lifestyles and habita ts. The genomic studies in isolation would have been merely a hint, but when added to the paleobochemical experimental results they become part of a sophisticated scientif ic narrative that not only shows evolutionary adaptation but also clarifies the structural and mechanistic basis for such selection. Global Lessons The results of these 20 examples provide th e first glimpse of the power of a field that is just beginning. These examples also pr ovide rejoinders to so me of the objections that are frequently raised by t hose who criticize the paradigm.

PAGE 131

116 First, the resurrected sequences provide more information than consensus models, single residue swapping, or simply compari ng ancestral sequences. In many cases, the properties of the ancestral proteins are not simply an average of the properties of the descendent proteins (Axe, 2000) . This is a consequence of a very fundamental principle of organic chemistry; the whole is not a lin ear sum of the parts in most molecules, and certainly not in proteins. The implication of this generalization is that selective reconstruction might not be sufficient to draw accurate inferences. The more complete the resurrection, the more likely the result is to give a reliable result. These examples show a wide range of t actics for managing ambiguities. With the exception of those who chose to use consensu s sequences to approximate an ancestral sequence, the degree to which ambiguity was ma naged reflected in la rge part the extent to which the data allowed the ambiguity to be managed. Sequence data are adequate to ensure that if the target for resurrection liv ed in the past 100 million years, the ambiguity could be comprehensively sampled. In more ancient resurrections, limitations from the dataset meant that the sampling was less comprehensive. One outcome of these studies was how infrequently it was found that ambiguities compromised the biology. In general, when the ambiguity was comprehensively sampled, a few of the proteins appeared to be defective to an extent that that particular sequence was deemed to be unlikely to be a true ances tor. Otherwise, it was difficult to find an example where ambiguity caused a biologi cal interpretation to be uncertain. This observation is, of course, good news fo r those working in the area. It may, as suggested in the second section of this review, reflect the fact that ambiguity is generally

PAGE 132

117 at sites that have suffered a large amount of neutral amino acid replacement, and that neutral drift generally means that replacement of an amino acid at that site does not have any impact on a functionally significant behavior of a protein. Further work is required to explore the generality of this observation. It is clear that f unctionally significant in vitro behaviors for a protein where new biological function has emerged can also be identified by paleomolecular resurrections. Several examples now exist of a strategy that examines the behavior of proteins resurrected from points in history before a nd after the episode of adaptive evolution, as indicated by a high rate of sequence change. Those behavior s that are rapidly changing during the episode of adaptive sequence evol ution, by hypothesis, conf er selective value on the protein in its new function, and therefor e are relevant to the change in function, either directly or by close coup ling to behaviors that are. The in vitro properties that are the same at the beginning and end of this episode are not relevant to the change in function. Figure 1-1. An Old World fossil suid shortly after the emergence of large litter sizes. (from the S. A. Benner collection: Photo courtesy of Steven Benner).

PAGE 133

118 C 11.5 S 0.1 2.2 T -0.5 1.5 2.5 P -3.1 0.4 0.1 7.6 A 0.5 1.1 0.6 0.3 2.4 G -2.0 0.4 -1.1 -1.6 0.5 6.6 N -1.8 0.9 0.5 -0.9 -0.3 0.4 3.8 D -3.2 0.5 0.0 -0.7 -0.3 0.1 2.2 4.7 E -3.0 0.2 -0.1 -0.5 0.0 -0.8 0.9 2.7 3.6 Q -2.4 0.2 0.0 -0.2 -0.2 -1.0 0.7 0.9 1.7 2.7 H -1.3 -0.2 -0.3 -1.1 -0.8 -1.4 1.2 0.4 0.4 1.2 6.0 R -2.2 -0.2 -0.2 -0.9 -0.6 -1.0 0.3 -0.3 0.4 1.5 0.6 4.7 K -2.8 0.1 0.1 -0.6 -0.4 -1.1 0.8 0.5 1.2 1.5 0.6 2.7 3.2 M -0.9 -1.4 -0.6 -2.4 -0.7 -3.5 -2.2 -3.0 -2.0 -1.0 -1.3 -1.7 -1.4 4.3 I -1.1 -1.8 -0.6 -2.6 -0.8 -4.5 -2.8 -3.8 -2.7 -1.9 -2.2 -2.4 -2.1 2.5 4.0 L -1.5 -2.1 -1.3 -2.3 -1.2 -4.4 -3.0 -4.0 -2.8 -1.6 -1.9 -2.2 -2.1 2.8 2.8 4.0 V 0.0 -1.0 0.0 -1.8 0.1 -3.3 -2.2 -2.9 -1.9 -1.5 -2.0 -2.0 -1.7 1.6 3.1 1.8 3.4 F -0.8 -2.8 -2.2 -3.8 -2.3 -5.2 -3.1 -4.5 -3.9 -2.6 -0.1 -3.2 -3.3 1.6 1.0 2.0 0.1 7.0 Y -0.5 -1.9 -1.9 -3.1 -2.2 -4.0 -1.4 -2.8 -2.7 -1.7 2.2 -1.8 -2.1 -0.2 -0.7 0.0 -1.1 5.1 7.8 W -1.0 -3.3 -3.5 -5.0 -3.6 -4.0 -3.6 -5.2 -4.3 -2.7 -0.8 -1.6 -3.5 -1.0 -1.8 -0.7 -2.6 3.6 4.1 14.2 C S T P A G N D E Q H R K M I L V F Y W Figure 1-2. A 20 x 20 matrix whose elements are 10 times the logarithm of the probability of the index amino acids being aligned by reason of common ancestry, divided by the probability of th eir being aligned by random chance, collected from aligned protein se quence pairs having ca. 0.15 amino acid replacements per site, and extrapolated to protein pairs having 2.5 amino acid replacements per site, based on the a ssumption that accumulation of accepted point mutation is a Markovi an process, where subse quent point mutations are not influenced by predecessor point mutations. Figure 1-3. (left) A part of the multiple seque nce alignment relating the sequences of the ribonucleases from the eland, the ox, and the swamp and river buffaloes, covering sites 37-43. Note that the full sequence alignment has ca. 130 sites. Thus, the topology of the tree (right) is de termined by the analys is of all of the sites, not just the seven sites shown here. This evolutionary tree (right) relating the sequences of the ribonucl eases from the eland, the ox, and the swamp and river buffaloes. The eland sequence is, from paleontological and cladistic studies, chosen to be the ou tgroup. As such, it can provide a root for the tree containing just the sequences of the RNases from the ox, the swamp buffalo, and the river buffalo.

PAGE 134

119 Figure 1-4 . A fragmentary specimen of Pachportax latidens, from (presently) Pakistan, representing part of the horn core an d skull of a bovid that approximates the ancestor of the buffaloes and modern oxen (From the British Museum of Natural History. This is a photo courtesy of Steven Benner. Figure 1-5. The amino acid at a site 39 is meth ionine (Met, or M) in the RNase from both the swamp and the river buffalo. Therefore, a simple theory of evolution that holds that no change is substantially more likely than one change can account for the commonality of the amino acid residues by assuming that site 39 held a Met in all of the evolutionary interm ediates between swamp and river buffalo. The probability that site 39 holds a Met is unity at every point along the line. Figure 1-6. The inferences drawn from the RNAse sequences drawn from the sequences of swamp and river buffalo does not differ if the tree is root ed. The probability of Met being at position 39 in the last common ancestor of the buffalos is still unity.

PAGE 135

120 Figure 1-7. The probabilistic values for the am ino acid residue at site 38 of the RNase system morph smoothly along an evolutio nary tree from 1.0 (for Ser) at the swamp buffalo leaf to 1.0 (for As n, N) at the river buffalo leaf. Figure 1-8. A rooted tree for site 38 shows probabilistic changes down each branch, with the change in time being positive as one proceeds from nodes higher in the tree to nodes lower in the tree. Thus, in the branch leading from the last common ancestor to the modern swamp buffalo, there is 0.5 change of an ancestral Asn to a Ser. In the branch leading from the last common ancestor to the modern river buffalo, there is 0.5 cha nge of an ancestral Ser to an Asn. Figure 1-9. A probabilistic model for site 39 in the last common ancestor of swamp and river buffalo. Note how the values sum to unity. Note also that the likelihood of different amino acids reflects the a lignment probabilities of the scoring matrix in Figure 1-2. Figure 1-10. In a maximum likelihood model, the probability of finding a Met at site 39, which is unity at the ends of the line (a s we know that Met occupies site 39 in the sequences from modern swamp and river buffalo) decreases towards the middle, with the probability of all other amino acids being present at that site increasing from zero to a maximum at the midpoint of the tree.

PAGE 136

121 Figure 1-11. A parsimony analysis infers an As n at site 38, as this inferences permits the occupancy of site 38 in all of the pr oteins to be accounted for by a single change. Any other inference for site 38 would require our evolutionary model to assume more changes than the mini mum absolutely required to account for the amino acids observed at site 38. Figure 1-12. A maximum likelihood analysis infe rs an Asn at site 38 as the post probable residue. Note how the additional "prior" information alters the probabilities assigned to different amino acids at points throughout the tree. Note that the sum of the probabilities of Ser, Asn, and all other amino acids is unity at all points in the tree. Figure 1-13. The ancestral residues inferred for sp ecific sites at internal nodes in the tree also assign changes along indivi dual branches in the tree.

PAGE 137

122 Figure 1-14 . With a different tree, the amino aci d inferred at site 41 changes in a parsimony analysis. The ancestral amino acid at position 41 is a Lys for the Ox and the Eland, and a Ser for the two buffaloes. The ancestral amino acid at position 41 for ancestor P could be a Lys or Ser in this tree. Figure 1-15. The distribution of amino acids at site 168 (left) and site 211 (right) in a set of 19 fungal alcohol dehydrogenases. The node of interest is at the right end of the red branch, marked with a *. Note the difficulty in reconstructing the amino acid at these sites at the node at the ri ght end of the red branch.

PAGE 138

123 Figure 1-16. Fossils of Eotragus (from the Musee d'Histoire Naturelle; the photograph is courtesy of Steven Benner). Figure 1-17. Fossils of Leptomeryx , a primitive ruminant from the North American Oligocene (from the S. A. Benner coll ection, photograph courtesy of Steven Benner)

PAGE 139

124 Figure 1-18. The evolutionary tree used in th e analysis of ancestral RNases. Lower case letters in the nodes in the graph desi gnate putative intermediates in the evolution of the protein family (see Tabl e 1-3). Upper case letters (D and G) indicate the residue at position 38 in the contem porary and reconstructed RNases. The time scale is approximate. The tree was adapted from Beintema et al . with a single alteration to join th e pig and the hippopotamus together in a separate subfamily that branches togeth er from the main line of descent. In the Beintema-Fitch tree, the pig and the hippopotamus diverge from the main line at separate points.

PAGE 140

125 Figure 1-19. A tree showing the divergence of seminal ribonucleases from ruminants. Two topologies are shown. The numbers in red next to the nodes indicate the number of different candidate ancestral proteins prepared to ensure that the ambiguity in the ancestral sequences was covered. Figure 1-20. Tree showing relati onship of a variety of seri ne proteases. Chy with the prefix letters h, b, d, and m indicat e the human, baboon, dog, rat, and mouse chymases, respectively indicates the ancestral chymase that was resurrected by Chandrasekharan et al. Bootstrap values are shown at the nodes.

PAGE 141

126 Figure 1-21. Eomaia , a primitive placental mammal, and Sinodelphys , a primitive marsupial, at the Jurassic/Cretaceous boundary, ca. 125 Ma, one of the best representative fossils at the time when mammalian protease systems were being developed (from the S. A. Be nner collection; photogr aph courtesy of Steven Benner). Figure 1-22. A tree relating the Pax genes. Note that Pax -4 is not included in the analysis. The resurrected ancestral sequences are donted by ANI, ANII, and AN6, at the indicated nodes.

PAGE 142

127 Figure 1-23. Visual pigments from vertebra tes analyzed by Chang and her coworkers. Figure 1-24. A fossil of Confuciusornis sanctus from the Jurassic/Cretaceous boundary, ca. 125 Ma. This represents the ealiest bi rd known with a beak, and a taxa that lived after the archosaur whose visual pigment was resurrected by Chang and her coworkers.

PAGE 143

128 Figure 1-25. An evolutionary tree of 21 verteb rate SWS1 pigments. The tree was inferred from both the sequence data and DNA-DNA hybridization data. The seven ancestral sequences that were resu rrected (circled) were inferred by a likelihood-based Bayesian method from Yang. FFTFSTALS refers to the amino acids at the critical si tes 46, 49, 52, 86, 90, 93, 114, 116, and 118 for the ancestral pigment. The UV pigments are boxed. The numbers after P and those at the nodes a – g refer to max values. The numbers next to nodal arrows indicate the total numbers of amino acid changes in troduced in constructing ancestral pigments.

PAGE 144

129 Figure 1-26. The phylogenetic tree of the vertebrate RH2 opsins. In vitro absorption maximum values (in nm) are indicated in parentheses. The values of the ancestral opsins at nodes A1 (Ancestor 1), A2 (Ancestor 2), and A3 (Ancestor 3), constructed using the JTT model are also indicated. The branches A, B, and C denote the branches A1-A2, A2– zebrafish RH2-1, and A3–zebrafish RH2-3, respectively, and are emphasized with thick lines. The absorbance shifts that occurred along the branches A, B, and C are indicated. Adapted from Chinen et al.

PAGE 145

130 Figure 1-27. The two un-rooted universal trees used to reconstruct ancestral bacterial sequences. Archaea and Eukarya serve as outgr oups for Bacteria, and thus provide a point at the base of the bact erial subtree from which ancien t sequences can be inferred. Thermophilic lineages are highlighted in bol d. a, maximum likelihood topology used to reconstruct the stem bacteria (ML-stem), or most recent common ancestor of bacteria, and the ancestral sequence for mesophilic lineages only (ML-meso). b, alternative topology used to reconstruct th e stem bacteria (Alt-stem).

PAGE 146

131 Figure 1-28. GDP binding assay to test ther mostability of ancestral and modern EF proteins. The amount of tritium labe lled GDP bound at zero degrees was subtracted from all other temperature va lues for a given protein. Shown is the relative amount of GDP bound compared to the amount bound at the optimal temperature for each protein.

PAGE 147

132 Figure 1-29. Evidence of stromatolites in th e geologic record. A, Salt Lake of Uyuni, South America, ca. 2.7 Ga (from the E. A. Gaucher collection, Photograph courtesy of Eric Gaucher). B, Dutc h Creek dolomite from the Proterozoic Wyloo Group in the Ashburton basin, Western Australia. Age is 2 ca. Ga (from the S. A. Benner collection: P hotograph courtesy of Steven Benner). The morphology of the structures an d the apparent place of deposition suggests that these organisms lived af ter the eubacteria whose protein was resurrected by Gaucher et al.

PAGE 148

133 Figure 1-30. The pathway by which yeast makes, accumulates, and then consumes, ethanol. Enzymes in red are associated with gene duplicati ons that, according to the TREx clock, arose nearly contemporaneously. The make-accumulateconsume pathway is boxed. Note that th e shunting of the carbon atoms from pyruvate into (and then out of, arrows in blue) ethanol is energy-expensive, consuming a molecule of ATP (green) for every molecule of ethanol generated. This ATP is not consumed if pyuvate is oxidatively decarboxylated directly to give acetyl-CoA to enter the citric acid cy cle directly (dotted arrow to the right). If dioxygen is available, the recycl ing of NADH does not need the acetaldehyde-toethanol reduction.

PAGE 149

134 Figure 1-31. Maximum likelihood trees interrelat ing sequences determined in this work with sequences in the publicly availa ble database. Shown are the two trees with the best (and nearly equal) ML scores using the following parameters estimated from the data: Substitutions A<->C, A<->T, C<->G, G<->T = 1.00, A<->G = 2.92, and C<->T = 5.89, empirical base frequencies, and proportion of invariable sites and th e shape parameter of the gamma distribution set to 0.33 and 1.31, respectively. The scale bar represents the number of substitutions/codon/unit evolut ionary time. Single, double, and triple asterisks represent bootstrap values greater than 50%, 70%, and 90%, respectively. Figure 1-32. A histogram showing all of the pairs of paralogs in the Saccharomyces cerevisae genome, dated using the TREx tool.

PAGE 150

135 Figure 1-33. Right. Cretaceous fruit from Patagonia showing insect damage. Left: a similar example in extant fruit. From Jorge F. Genise (Museo Paleontolgico E. Feruglio, Trelew, Argentin a) ( 53ichnology.htm). Figure 1-34. Phylogeny of GFP-like pr oteins from the great star coral M. cavernosa and closely related coral species. The red and cyan proteins from soft corals (dendRFP and clavGFP) represent an outgroup. The highlighted nodes (All, Red/Green, pre-Red and Red) were resurrected in the study reviewed here.

PAGE 151

136 Table 1-1 . Examples of molecular resurrections Extant genes Ancestral gene resu rrected Approximate age References (million years) Digestive ribonucleases Ancestor of buffalo and ox 5 Benner, 1988; Stackhouse et al. 1990 Digestive ribonucleases Digestiv e RNases in the first rumina nts 40 Jermann et al., 1995 Lysozyme Ancestral bird lyso zyme 10 Malcolm et al., 1990 L1 retroposons in mouse Ancestr al rodent retrotransposon 6 Adey et al., 1994 Chymase proteases Ancestral ortholog in LCA of mammals 80 Chandrasekaran et al., 1996 Sleeping Beauty Transposon Active ancestral transposon from fish 10 Ivics et al., 1997 Tc1 / mariner transposons Ancestral paralo g genomes of 8 salmonids 10 Ivics, Hackett et al. 1997 Immune RNases Ancestral ortholog LCA of higher primates 31 Zhang and Rosenberg 2002 Pax transcription factors Ancestral paralog 600 Sun, Merugu et al. 2002 SWS1 visual pigment Ortholog in LCA of bony vertebrates 400 Shi and Yokoyama 2003 Vertebrate rhodopsins Archosaur opsins 240 Chang, Jonsson et al. 2002 Fish opsins (blue, green) Fish opsins 30-50 Chinen, Matsumoto et al. 2005 Steroid hormone receptors Ancestral paralog 600 Thornton et al., 2003 Yeast alcohol dehydrogenase Enzyme at orig in of fermentation 80 Thomson et al., 2005 Green Fluorescent proteins Ancient fluorescent proteins ca. 20? Ugalde, Chang et al. 2004 Isopropylmalate synthase Ancestral eubacteria 2500 Zhu et al., 2005 Isopropylmalate dehydrogenase Ancestral archaebacteria 2500 Mi yazaki et al., 2001 Isocitrate dehydrogenase Ancestral ar chaebacteria 2500 Iwabata et al., 2005 Elongation factors LCA of eubacter ia 3500 Gaucher et al., 2003 LCA: Last common ancestor. Ages are approximate, and in some cases conjectural. Table 1-2: Axioms commonly in corporated into theories of protein sequence evolution Site i suffers replacement independently of site j Future replacements at site i are independent of past replacements The rate of replacement at each site is the same

PAGE 152

137 Table 1-3. Sequence changes in re constructed anci ent ribonucleases Bovine Ancestral Sequences RNa se a b c d e f g h1 h2 i1 i2 j1 j2 3 Thr Thr Thr Ser Ser Ser Ser Ser Ser Ser Ser* Thr* Ser 6 Ala Ala Ala Ala Ala Ala Ala Ala Al a Ala Glu Glu Lys 15 Ser Ser Ser Ser Ser Pro Se r Ser Ser Ser Ser Ser Ser 16 Ser Ser Ser Ser Ser Ser Ser Se r Gly* Gly* Gly* Gly* Gly 17 Thr Thr Thr Thr Thr Thr Thr Thr Ser* Thr* Ser Ser Ser 19 Ala Ser Ser Ser Ser Ser Se r Ser Ser Ser Ser Ser Ser 20 Ala Ala Ala Ala Ala Ala Ala Ala Se r Ser Ser Ser Ser 22 Ser Ser Ser Ser Ser Ser Ser Ser Ser Ser Asn* Asn* Asn 31 Lys Lys Lys Lys Lys Gln Ly s Lys Lys Lys Lys Lys Lys* 32 Ser Ser Ser Ser Ser Ser Se r Ser Ser Ser Arg Arg Arg 34 Asn Asn Asn Asn Asn Asn Asn As n* Lys* Lys* Lys* Lys* Asn 35 Leu Met Met Leu Leu Leu Leu Leu* Met Met Met Met Met 37 Lys Lys Gln Gln Gln Gln Gl n Gln Gln Gln Gln Gln Gln 38 Asp Asp Asp Asp Asp Asp As p Asp Gly Gly Gly Gly Gly 59 Ser Ser Ser Ser Ser Phe Se r Ser Ser Ser Ser Ser Ser 64 Ala Ala Ala Ala Ala Ala Ala Ala Th r Thr Thr Thr Thr 70 Thr Thr Thr Thr Thr Ser Th r Thr Thr Thr Thr Thr Thr 76 Tyr Tyr Tyr Tyr Tyr Asn Tyr Asn Asn Asn Asn Asn Asn 78 Thr Thr Thr Thr Thr Ala Th r Thr Thr Thr Thr Thr Thr 80 Ser Ser Ser Ser Ser His Ser Arg* Arg* Arg* His His His 96 Ala Ala Ala Ala Ala Val Al a Ala Ala Ala Ala Ala Ala 100 Thr Thr Thr Thr Thr Thr Th r Thr Thr Thr Ser Ser Ser 102 Ala Ala Ala Ala Ala Ala Ala Ala Val Val Val* Val* Val* 103 Asn Lys Lys Lys Glu Glu Glu Glu Glu Glu Gln Gln Gln Reconstructed ancient sequences are designated by lower case bol d letters. Amino acids marked with * indicate positions where assignment depends on ambiguous pa rsimony reconstructions, or might be changed by plausible reorganization of the tree. In seve ral of these cases, multiple sequen ces were reconstructed; subscripts indicate a lternative sequence reconstructions for one node in the tree.

PAGE 153

138 Table 1-4. Kinetic properties of re constructed ancestral ribonucleases RNase Ancestor of k cat /K m k cat /K m poly(U) poly(A)fpoly(U) UpA as % of relative to relative to x 10 6 RNase A RNase A RNase A RNase A 5.0 100 100 1.0 a ox, buffalo, eland 6.1 122 106 1.4 b ox, buffalo, eland, nilgai 5.9 118 112 1.0 c b and the gazelles 4.5 91 97 0.8 d Bovids 3.9 78 86 0.9 e Deer 3.6 73 77 1.0 f Deer, pronghorn, giraffe 3.3 67 103 1.0 g Pecora 4.6 94 87 1.0 h 1 Pecora and seminal RNase 5.5 111 106 5.2 h 2 Pecora and seminal RNase 6.5 130 106 5.2 i 1 Ruminata 4.5 90 96 5.0 i 2 Ruminata 5.2 104 80 4.3 j 1 Artiodactyla 3.7 74 73 4.6 j 2 Artiodactyla 3.3 66 51 2.7 RNase names refer to nodes in the evolutionary tree shown in Figure 1-18. All assays were performed at 25C Table 1-5. Thermal transition temperatures for reconstructed an cient ribonucleases Enzyme T m [C] T m [C] RNase A* 59.3 0.0 RNase A 59.7 +0.4 a 60.6 +1.3 b 61.0 +1.7 c 60.7 +1.4 d 58.4 -0.9 e 61.1 +1.8 f 58.6 -0.7 g 59.1 -0.2 h 1 58.9 -0.5 h 2 59.3 0.0 i 1 58.2 -1.1 i 2 58.7 -0.6 j 1 56.5 -2.8 j 2 57.1 -2.2 Thermal unfolding-proteolytic digestion temperatures ( 0.5 C) were determined incubating the RNase ancestor in 100 mM NaOAc (pH 5.0) in the presence of trypsin. expressed in E. coli Boehringer Mannheim

PAGE 154

139 Table 1-6. Percent sequence identity between ancestral and modern proteins T.m. = Thermotoga maritima T.a. = Thermus aquaticus E.c. = Escherichia coli G.s. = Geobacillus stearothermophilus Table 1-7. Kinetic properties of ADH 1, ADH 2, and candidate ancestral ADHAs Sample1 KM KM KM KM EtOH NAD+ acet. NADH M M M M Adh1 20060 218 1492 164 MKD 17280 511 1019 144 MKN 13750 814 1067 1106 MRD 11590 734 1265 287 MRN 10960 554 1163 894 MTD 10740 467 959 190 MTN N/A N/A N/A N/A RKD 8497 449 1066 142 RKN 7238 407 1085 735 RRD 7784 400 1074 203 RRN 8403 172 1156 1142 RTD 6639 254 1083 316 RTN 7757 564 1158 477 Adh12 24000 240 3400 140 Adh13 17000 170 1100 110 Adh22 2700 140 45 28 Adh23 810 110 90 50 Adh33 12000 240 440 70 Adh13 ( pombe ) 14000 160 1600 100 Adh1(M270L)3 19000 630 1000 80 KlP203694 27000 2800 1200 110 KlX643974 23000 2200 1700 180 KlX627664 2570 310 100 20 KlX62767 4 1560 200 3100 30 1 The three letters designate the amino acids at positions 168, 211, and 236; thus MKD = Met168 Lys211 Asp236. The remaining residues were the same as in Adh 1, except for the following changes (using sequence numbering of Adh1 from S. cerevisiae ) Asn15 Pro30 Thr58 Ala74 Glu147 Leu213 Ile232 Cys259 Val265 Leu270 Ser277 Asn324. Alt-stem ML-mesoT.m. T.a. E.c. G.s. ML-stem 93 87 85 84 79 84 Alt-stem 89 84 80 80 83 ML-meso 74 78 82 84

PAGE 155

140 Table 1-8. Duplication in the Saccharomyces cerevisiae Genome where 0.80 < f2 < 0.86 SGD gi trivial annotation and comments name number name Inosine-5'-monophosphate dehydrogenase family (3 paralogs, 3 pairs, 2 duplications)1 f2 = 0.8033 Pair associated with Wolfe duplication blocks 1& 44 YAR073W gi|456156 IMD1 nonfunctional homolog, near telomer, not expressed YLR432W gi|665971 IMD3 inosin e-5'-monophosphate dehydrogenase f2 = 0.8253 YLR432W gi|665971 IMD3 inosin e-5'-monophosphate dehydrogenase YHR216W gi|458916 IMD2 inosine-5'-monophosphate dehydrogenase Subfamily pair: YHR216W:YAR073W f2 = 0.93 (proposed recent duplication creating a pseudogene) Sugar transporter family A (4 paralogs, 4 pairs, 3 duplications)2 f2 = 0.8053 Pair not associated with any duplication block YJR158W gi|1015917 HXT16 sugar transporter repressed by high glucose levels YNR072W gi|1302608 HXT17 sugar transporter repressed by high glucose levels f2 = 0.8063 Pair not associated with any duplication block YDL245C gi|1431418 HXT15 sugar transporter induced by low glucose, repressed by high glucose YNR072W gi|1302608 HXT17 sugar transporter repressed by high glucose levels f2 = 0.8093 Pair not associated with any duplication block YJR158W gi|1015917 HXT16 sugar transporter repressed by high glucose levels YEL069C gi|603249 HXT13 sugar transporter induced by low glucose, repressed by high glucose f2 = 0.8103 Pair not associated with any duplication block YEL069C gi|603249 HXT13 sugar transporter induced by low glucose, repressed by high glucose YDL245C gi|1431418 HXT15 sugar transporter Subfamily pair: YEL069C:YNR072W f2 = 0.932 (proposed recent duplication) Subfamily pair: YJR158W:YDL245C f2 = 1.000 (proposed very recent duplication) Chaperone family A (2 paralogs, 1 pair, 1 duplication)1 f2 = 0.81 Pair associated with Wolfe duplication block 48 YMR186W gi|854456 HSC82 Cytoplasmic chaperone induced 2-3 fold by heat shock YPL240C gi|1370495 HSP82 Cytoplasmic chaperone , pheromone signaling, Hsf1p regulation Phosphatase/thiamine transport family A (2 paralogs, 1 pair, 1 duplication)2 f2 = 0.818 Pair not associated with any duplication block YBR092C gi|536363 PHO3 acid phosphatase implicated in thiamine transport YBR093C gi|536365 PHO5 acid phosphatase One of three repressible phosphatases Pyruvate decarboxylase family A (2 paralogs, 1 pair, 1 duplication)2 f2 = 0.835 Pair not associated with any duplication block YLR044C gi|1360375 PDC1 pyruvate decarboxylase , major isoform YLR134W gi|1360549 PDC5 pyruvate decarboxylase, minor isoform By ortholog analysis, S. bayanus (gi|515236) diverged from cerevisiae after the f2 = 0.835 duplication, Kluyveromyces diverged before, Glyceraldehyde-3-phosphate dehydrogenase family (3 paralogs, 3 pairs, 2 duplications)2 f2 = 0.8453 Pair not associated with any duplication block YJL052W gi|1008189 TDH1 glyceraldehyde-3-phosphate dehydrogenase YGR192C gi|1323341 TDH3 glyceraldehyde-3-phosphate dehydrogenase

PAGE 156

141 Table 1-8 (continued). Duplication in the Saccharomyces cerevisiae Genome where 0.80 < f2 < 0.86 f2 = 0.8453 Pair not associated with any duplication block YJL052W gi|1008189 TDH1 glyceraldehyde-3-phosphate dehydrogenase YJR009C gi|1015636 TDH2 glyceraldehyde-3-phosphate dehydrogenase Subfamily pair: YJR009C YGR192C f2 = 0.991 Proposed very recent duplication Alcohol dehydrogenase family (2 paralogs, 1 pair, 1 duplication)2 f2 = 0.848 Pair not associated with any duplication block YMR303C gi|798945 ADH2 alcohol dehydrogenase , glucose-repressible YOL086C gi|1419926 ADH1 alcohol dehydrogenase , constitutive Spermine transporter family (2 paralogs, 1 pair, 1 duplication)1 f2 = 0.86 Pair associated with Wolfe duplication block 34 YGR138C gi|1323230 TPO2 spermine transporter activity YPR156C gi|849164 TPO3 spermine transporter activity Sugar transporter family B (3 paralogs, 3 pairs, 2 duplications)2 f2 = 0.8473 Pair not associated with any duplication block YDR343C gi|1230670 HXT6 sugar transporter, high-affinity high basal levels YDR345C gi|1230672 HXT3 sugar transporter, low affinity glucose transporter f2 = 0.8543 Pair not associated with any duplication block YDR342C gi|1230669 HXT7 sugar transporter, high-affinity, high basal levels YDR345C gi|1230672 HXT3 sugar transporter, low affinity Subfamily pair: YDR342C:YDR343C f2 = 0.994 Proposed very recent duplication 1. Not associated with fermentation. These are associated with duplication blocks within the yeast genome, where the high value of f2 (typically equilibrated in block paralog pairs) may reflect either variance, or selective pressure to conserve silent sites in individual codons. 2. Associated with the pathway to make-accumulate-consum e ethanol. Genes involved in the fermentation pathway that are not rate-limiting, generally do not have duplicates in the yeast genome by (e.g. hexokinase, glucose-6-phosphate isomerase, phosphofructokinase, aldolase, triose phosphate isomerase and phosphogl ycerate kinase are all present in one isoform). Enolase has two pa ralogs (ENO1 and ENO2), where f2 = 0.946. These are distan tly related to a homolog known as ERR1, with the silent sites equilibrated. Phosphoglycerate mutase ha s three paralogs, GM1, GM2 and GM3, with silent sites that are essentially equilibrated. 3. These pairs represent a family genera ted with a single duplication with 0.80 < f2 < 0.86, and subsequent duplication(s) in the derived lineages. Pa ralog pairs are considered only if they have with at least 100 aligned silent sites, and are not separated by more than 120 point accepte d mutations per 100 aligned ami no acid sites (PAM units). f2 = fraction of nucleotides conserved at two-fold redundant codon sites only, and only at sites where the amino acid is identical.

PAGE 157

142 CHAPTER 2 INTRODUCTION AND HYPHOTHESIS Introduction A fundamental goal of biol ogists is to understand the physiological function of their favorite biomolecular systems. Many rese archers study their biomolecules as well as their relationships with other molecules in vitro in the hope of reaching this goal. However, this goal is generally challengi ng, as the behaviors of a macromolecule, measured in vitro , need not be relevant to the physiological function. In vivo studies could take the researcher one step closer to physiological relevancy. These are not, however, guaranteed to do so. Even if the bar is set lower, in vivo studies are often constructed to focus on a behavior (a mouse jumping off a hot plate) that is easy to measure, and has some logical connecti on to a physiological role. Even here, these have no guarantee to be physiological. Connecting biomolecular behavior s that are easily measured in vitro to physiological roles and biological function is st ill more difficult. Ultimately, "function" is a Darwinian concept, and requires a measure of the fitness of an organism in an environment. A biological function is evolut ionarily selected b ecause it confers upon its host organism an advantage that contributes to its reproductive success. Obviously, most in vivo experiments rarely place the organism in a natural ecological setting. Further, the connection between biomolecular behavi or and fitness is rarely known. The central point of this dissertation is that a comb ination of a bioinformatic analysis of the history of a biomolecule with experiments that resurrect ancestral

PAGE 158

143 biomolecular forms helps us connect biomolecu lar behaviors to physiological function, in the Darwinian sense of the term. Here is the idea. In the c ourse of the divergent evolu tion of a protein family, many in vitro biomolecular behaviors are gained, lost , or remain unchanged. In general, if biomolecular function is not changing, no cha nge is the norm. If biological function is changing, however, some of the amino acids in an ancestral protein must be replaced by others to create different biomolecular behaviors to support new function. Many of these are, of course, measurable in vitro . The nature, tempo, and position in the thre e dimensional structure are all signatures that indicate the acquisition of new biomolecular function. Alternatively, the emergence of new physiology is often recorded in the fossi l record, or in a clad istic analysis of the physiology of contemporary species. We propose that the new in vitro biomolecular behaviors that emerge at the same time as the acquisition of a new biomolecular function, as indicated by the bi oinformatics analysis, are those most closely related to the new physiological bi ological function . In general, we do not have in hand proteins that represent intermediate stages in the history of the evolution of a modern protein. We can, how ever, exploit recombinant DNA technology to resurrect these, based on a bioinf ormatics analysis. This allows researchers to get, in the laboratory, hist orical forms of their favorite molecule as it acquired new function. There, in the laboratory, they can analyze the accumulating changes distinguishing functionally relevant biomolecular in vitro behaviors from irrelevant behaviors. The fundamental premise of this project is this idea.

PAGE 159

144 The Phenomenology of Seminal RNase and the Proposed Hypothesis Bovine seminal ribonuclease (BS-RNase) is an archetypal protein in the post genomic age. We know it exists. It is abundant in a tissue (2% of bovine seminal plasma). We have no idea what it does. BS-RNase was first isolated as an an tispermatogenic factor by D'Alessio and coworkers by grinding up bull testes in an e ffort to purify such agents (D'Alessioet al., 1972). Matousek independently also isolated this protein (Matousek, 1973). Surprisingly, the sequence analysis showed that it wa s a homolog of pancreatic RNase (Over 80% identical), whose function has been well e xplored and believed to be well known. Up to this time, members of the pancr eatic RNase family were thought to play purely digestive roles. The role of seminal ribonuclease could not be digestive, yet it was found that it had a high catalytic activity agai nst RNA substrates. Fu rther, this protein was found to exhibit other in vitro behaviors that were novel. A gene duplication of an ancestral ribonucl ease gene that is thought to have had a digestive role gave rise to seminal ri bonuclease (Jermannet al., 1995). This protein exhibits many in vitro behavi ors that distinguish it from its closest homolog, pancreatic ribonuclease. Seminal RNase has ribonuclease activity but exhibits also unusual characteristics like the natural occurrence of dimers. Th e seminal RNase dimer is joined by two intersubunit disulfide bo nds (Figure 3-12), while RNase A is a monomer. In the dimer, seminal RNase exchanges residues 1-20 to form composite active sites, where key residues for catalysis come from both subun its (Ciglic et al., 1998)(Figure 3-12); in pancreatic RNase, the active site is built from residues that are all contained in the same polypeptide chain. Seminal RNase cleaves duplex RNA some 25 fold better than

PAGE 160

145 pancreatic RNase. Seminal RNase shows in vitro immunosuppressivity in the mixed lymphocyte culture assay (Filipec et al., 1996;Soucek et al., 1983;Soucek et al., 1996;Tamburrini et al., 1990) and is cytotoxic against ma ny transformed cell lines in culture (Cinatl et al., 2000;Laccetti et al., 1992;Laccetti et al., 1994;Michaelis et al., 2002;Michaelis et al., 2000;Vesc ia et al., 1980). Pancreatic RNase shows little of these activities. In the absence of evolutionary insight, seminal RNase is a perplexing collection of in vitro phenomena, similar to those encountered frequently in the biomedical literature. It is not clear which of the in vitro behaviors measured for seminal RNase, different from pancreatic RNase, are relevant to its new physiological function. There are many conceivable hypotheses, howev er, for the physiological function of these in vitro behaviors. For example, immunosuppr essive substances in the seminal plasma might confer selectiv e value by protecting spermatozo a from the female immune surveillance. This protection is expected to confer a reproductive advantage for the males who have this protein and generally for the whole species. We used a paleogenetics approach to shed light on this problem. The evolutionary history of seminal ribonuclease shows a rapid ph ase of evolution detected by high ratio of non-synonymous to synonymous mutations. This pha se starts at the an cestral node of the Bovidae including kudu, buffalo and bovine lineages and continues specifically down the bovine lineage. Based on the molecular evolut ionary analysis, we postulated that the acquisition of a new function ha ppened along this branch. We used ancestral reconstruction and resu rrection of ancestra l ribonucleases that surrounded the episode of rapid sequence e volution to investigate the physiological

PAGE 161

146 function of seminal ribonuclease. Different in vitro biomolecular behaviors of seminal RNase were assayed to test the hypothesis that immunosuppression is the physiologically selected new function. These assays included th e enzymatic activity, stru ctural features of dimerization, and immunosuppression. Fr om this, we shall conclude that immosuppresivity directly, and dimeric S-peptid e swap indirectly, are the features of the protein important for the new biologi cal activity in bovine seminal plasma.

PAGE 162

147 CHAPTER 3 BIOINFORMATIC ANALYSIS OF THE SEMINAL RNASE GENE SEQUENCES Building Evolutionary Models A basic model for the evolutionary history of a family of proteins comprises of a tree, multiple sequence alignment, and r econstructed ancestral sequences at nodes throughout the tree. This is a model about a historical reality th at is in principle unknowable, but undoubtedly exists. That is, fo r mammalian evolution, there is little doubt that each of the species that we cons ider had a small breeding population as their last common ancestor, that a binary tree is an excellent representation of the actual history, and that subsequent to speciation, the tr ansfer of genetic information rapidly became negligible. Nevertheless, the histor ical sequence information is not randomly related to the sequence data that are knowable because th ey can be extracted from organisms living today. Therefore, given a th eory about protein evolution, we can draw inferences about the ancestral stat es with a level of uncertainty that is in part statistical, and in part systematic, especially to the extent that the theory is wrong. Inferences captured in the model are cons tructed under an evolutionary theory. We use the words "evolutionary theory" to comprise all parameters that are used to construct the models. The theory can be simple (for example, assuming that every amino acid is replaced by any other at the same rate) or complex (different subs titution frequency and rates). The theory can choose different numbe rs of parameters, which it can estimate from the data themselves, or from input fr om other sources. Each of these is an approximation for an individualized process.

PAGE 163

148 A particular model, including a particular tree, alignment, or set of ancestral sequences is chosen over another because it meets better an optimality criterion. A particular theory is likewise chosen because it frequently is observed to generate models that are better, by the optimality criteri on, over those generated by other theories. The optimality criterion is itself arbitrary, however. Common criteria based on sequence data alone include parsimony and maximum likelihood. Other criteria incorporate non-sequence data, including clad istic analysis of contemporary physiology of extant organisms that pr ovide the modern sequences, and paleontological data, for example. Maxmimum likelihood (ML) analyses are cent ral to the work here. It should be noted that many theories with different para meters fall within the ML rubric. These are described briefly here. Maximum Likelihood The concept behind a maximum likelihood analysis recognizes that one can calculate the most likely values of parameters in a theory that are not themselves directly accessible, but which generate observable outco mes. ML analyses work backwards from these observables to these parameters. A simple example, not terribly relevant to sequence analysis (adapted from a Joseph Felsenstein lecture, after minor mistakes were removed), illust rates this using the tossing of a coin with an unknown we ighting. The coin has a probability p of coming up heads, and a probability (1p ) of coming up tails. We do not know the value of p , but it can be estimated from the outcome of a finite number of tosses. For example, suppose 12 tosses generate the following exact sequence of outcomes HHTHTTTTHTTH (where H means that the coin comes up "heads", and T means the

PAGE 164

149 coin comes up "tails"). The likelihood L of getting this sequence of outcomes, expressed in terms of the unknown parameter p , is: L = pp (1 p )p(1 p )(1 p )(1 p )(1 p ) p (1 p )(1 p ) p = p5 (1 p )7 Thus, if p is exactly equal to 0.5, then L = 0.000244. If the parameter p is chosen so that the coin is assumed to be weighted pref erring tails slightly, however, then a slightly higher likelihood is obtained for the out come. Figure 3-1 shows a plot of L against p shows the maximum L is achieved when p is 0.454. By calculation, if p = 0.454, then 1p is equal to 0.546, p5 = 0.0192876, (1-p)7 = 0.014466 , and L = 0.000279. Obviously, the same maximum value can be obtained analytically, by differentiating the expression of L with respect to the variable p and equating the derivative to 0. The value of p at the peak is again found to be 0.454. This illustrates how to find the probability p of obtaining a head with the highest likelihood based on the data. In other words, given the model (probability of tossing a head is p ), the value of p with the highest likelihood based on the data (the results of coin toss) is 0.454. This example can serve as an analogy to a process to find the parameters of an evolutionary model (here, the parameter is the lengths of branches between nodes and leaves in a tree). If one were to optimize the branch length in a tree, the x axis in Figure

PAGE 165

150 3-1 (coin toss figure) would be replaced with branch lengt h. The branch length with the highest likelihood is picked for the corresponding branch. Of course the exercise becomes much more complicated when solving complex problems, like finding the best tree or dete cting adaptive protein evolution, will require the optimization of multiple parameters at the same time. For example, if only two parameters such as the lengths of two differe nt branches in a tree were to be optimized the plot becomes three-dimensional where the x and z axes are the branch lengths and the y axis is the likelihood (figure 3-2) The calculation of the likelihood for a tree is equal to the product of the likelihoods of all sites in the data given the ancestral state and the branch le ngth (both components of the evolutionary model) linki ng the ancestral state to the examined site. Consequently, the likelihood of the tree is as follows: ln L lnPr( D | T ) i 1 sitesPr( D[ i ]| T ) L is the likelihood, D represents sites in the alignment, D[ i ] represents one site ( i ), and T is the Tree. This formula is rendered into a sum by taking the natural log of all parts of the equation: ln L lnPr( D | T ) lnPr( D[ i ] i 1 sites| T ) For example the log likelihood of the tr ee in figure 3-3 is calculated by the following formula: ln L zw Pr( z )Pr( w | t 3, z )Pr( A | t 1, w )Pr( C | t 2, w )Pr( G | t 4, z ) Sources of Data The maximum likelihood formalism allows, of course, for only certain kinds of data. Other sources of data are possible. Cl assical methods for m odeling the historical

PAGE 166

151 relationship of taxa, for example, combined a cladistic analysis of characters in the physiology of extant organisms, supported by a fossil record that is more or less incomplete. These methods could be used in isolation or in combination with sequence data to try to capture as much signal as possible in order to find the best phylogeny. The seminal ribonuclease family of genes comes from a fa mily of animals with rich paleontological record and sequence data allowing us to use the full range of the available approaches. Sources of Ambiguity ML tools provide probabilistic ancestral se quences, where each site carries each of the 20 standard amino acids associated with a number, from zero to unity, that represents the likelihood that that amino acid will be found at that site in the particular ancestor. In an evolutionary analysis where the most representative ancestral protein sequences are sought, one has to consider all of the sources of ambiguity that can bias the results. Building the multiple sequence alignment can be a source of significant ambiguity especially when the sequences are highly divergent and vary in lengths and the evolutionary history contains multiple inser tions/deletions that require the placement of gaps in the multiple sequence alignment. Another source of ambiguity arises from uncertainty in connectivity of the phylogenic tree. This ambiguity comes from several sources. Fi rst, different evolutionary theories can generate differe nt trees. Thus, maximum pars imony, minimum distance, and maximum likelihood methods need not generate th e same trees. In general, differences in trees have different orders of branching ar ound short branches in the tree. It is well

PAGE 167

152 known that maximum parsimony reconstructions are very sensitive to these different topologies, while ML reconstructions are not and are consequently considered superior. Building an Evolutionary Model for the Seminal RNase With these thoughts in mind, we set out to build an evolutionary model for seminal RNase. Here, specific characteristics of the collection of sequences means that certain problems are more severe, and certain are less severe. For example, the multiple sequence alignment itself was not a source of ambiguity in the case of seminal RNase. The seminal RNase genes have equal or very similar lengths, indicating a history that is largely free of events where segments of the gene suffered insertions or deletions (indels). Thus, the multiple sequence alignment (Figure 3-4) has few gaps. This implies that we can ignore a problematic multiple sequence alignment as the source of ambiguity in this work. The low overall sequence divergence within the seminal RNase family also has a disadvantage. This means that the multiple se quence alignment contains few sites that are informative about the topology, implying that different trees with different topologies have similar scores. To manage this problem, we returned to the fossil record, and to sequences from other proteins that help establish the species tree. The fossil record of artiodactyls is rather complete starting about 40 million years ago. The most primitive artiodactyl, Diacodexis , was described by Rose, and is believed to antedate the divergence of pigs and hippopotamuses as non-ruminant artiodactyl s. By the early Oligocene, however, 38 Ma, the pigs and the ruminants are clearly dist inct in the fossil record, with genera such as Archaeomeryx and Mericoidon representing the first and second suborder (respectively) in the Oligocene record fr om the Nebraska badlands (Schmidt, 1989).

PAGE 168

153 Many years ago, Beintema (Beintema et al., 1988), Miyamoto (Allardet al., 1992) and others have combined molecular sequen ce data for proteins, including RNase genes to generate a consensus view for the relations hip of various artiodactyl genera. This work influenced the initial design of this project. Earlier this year, how ever, these and other analyses were subsumed within two major publications, both in Biological Reviews , that provide combined (and purportedly the last wo rd) analyses of the di vergent evolution of ruminant artiodctyls, and the proposed Cetartiodactyla order that includes the cetaceans, hippopotamuses, the pigs and their relatives, and the ruminant artiodactyls(Hernandez Fernandez and Vrba, 2005;Price, Bininda-Emo nds and Gittleman, 2005). As discussed below, the various trees di ffer depending on the data available when they were constructed, and four alternativ e trees that score closely were considered in this analysis. The seminal RNase is part of a larger supe rfamily of proteins that has at least 8 paralogs in the bovine genome, and perhaps 11 in the human genome. This requires us to chose an outgroup when building an evoluti onary model for the seminal RNase family. Two families of RNAases, brain and pancrea tic, are the only true plausible outgroups. It should be noted that the fossil record does not help here. One possibility is that we can infer the outgroup by examining the order of the paralogs on the relevant chromosomes. Beintema has begun this analys is for humans, but data are at present inadequate for the artiodactyl families. For this reason, we relied on classical approach to identify the outgroup, and systematically considered both pancreatic RNase (our preference for an outgroup) and brai n RNase as alternative outgroup. Inferring Trees Four methods (Bayesian analysis as im plemented by the program Mr. Bayes, and three versions of ML implement on PAUP) that were applied to generate trees gave the

PAGE 169

154 same three trees as being the highest scoring. Different methods chose to rank these trees in a different order of preference. Alt hough the precise ranking based on the scores delivered by these methods are different, the scores of the three trees differ among themselves only slightly. We term these the "top three topologies". These were: 1. Topology 1. Preferred by us when the wo rk began, and the highest scoring via a full-blown Bayesian analysis. Topology 1 groups the okapi with the deer, and representing the saiga and dui ker as diverging separately from the lineage leading to ox after the divergence of deer (Figure 3-5). 2. Topology 2. This topology is preferre d today (Hernandez Fernandez and Vrba, 2005;Price, Bininda-Emonds and Gittleman, 2005) based on a global analysis of all available sequence and paleontological data. It places the okapi as an outgroup separate from the deer and diverging before the deer diverges from oxen. It also groups the saiga and the duiker (Figure 3-6) 3. Topology 3. This topology places the okapi together in a clade with the deer, and places the saiga and duiker togeth er in a clade (Figure 3-6). In addition, we explored Topology 4 (Figure 3-5), which was considered reasonable when this work began, but less so no w in light of data that have subsequently emerged. Here, the okapi is in a clade sepa rate from the deer, but diverges from the lineage leading to oxen after the deer di verge. The duiker and saiga again were represented as diverging separa tely from the lineage leading to ox after the divergence of deer.

PAGE 170

155 We can view the different reconstructions as the products of 2 trees x 2 outgroups x 3 models of the seminal RNase data. The tw o topologies in ques tion are Topology 1 and Topology 4. We then explored the origin of the di fferences between the trees. First, we examined ten sites that displayed variat ion within the seminal lineage that is consequential to the branching of the tree. These offer interesting, sometimes textbook, cases of homoplasy, which are summ arized briefly in Table 3-1. Inferring Ancestral Sequences Given this level of homoplasy, there is no "correct" answer within the standard maximum likely methods that examine each site individually. Rath er, we must manage this level of ambiguity in the topology of the trees when we construct the ancestral sequences. As might be expected, the most serious impact of the homoplasy, and the subsequent ambiguity in the precise topology of the trees, on the sequences of ancestral proteins, is seen in ancestral proteins that are deep in the tree. In particular, we were concerned about the sequence inferred for the founder seminal ribonuclease (An19), the protein that stands at the head of the first divergence within this family. As different trees represent this ancestor differently, as the last common ancestors of the okapi RNase and all of the other seminal RNases in one case (Topology 2, for example), versus the last common ancestor of the (okapi-deer) seminal RNases and the remaining seminal RNases, we expect the ambiguity to be most severe here. Before we discuss the specifics of the seminal RNase reconstructions, we briefly review alternativ e approaches to constructing ancestral sequences.

PAGE 171

156 Constructing the Ancestral Sequences When constructing ancestral sequences, even when a specific tree has been chosen, different inferences for different ancestra l character states can arise depending on the model that is used to infer the ancestral character state from the derived sequences. In this study, approaches based on ma ximum likelihood tools were used to reconstruct the ancestral sequences to be resurrected. Three different evolutionary theories were used in this exercise, on e based on the amino acid sequence and two on codon models based on the DNA sequence. Chap ter 1 introduced thes e approaches in general. In this section more detail s of the likelihood method is presented. As for ML approaches to estimate br anch lengths (discussed above), ML approaches to infer ancestral character states follow a statistical principle that given the data at a site, the conditional probabilities of different recons tructions can be evaluated. The reconstruction having the highest conditiona l probability is then selected to be the best inference for that site. This likelihood me thod generates a probabili ty associated with each character with the purpose to evaluate th e accuracy of the ancestral inference. This approach was first introduced by Yang and coworkers (Yang, Kumar and Nei, 1995). To illustrate this method we will consider a simple tree as an example shown figure 3-7. Here, x represents the amino acid at the extant sequences and xi represents a specific amino acid at extant sequence i (extant sequences on the tree range from 1 to 4). y represents the ancestral nodes and yi is the specific amino acid at the ith node (the ancestral nodes on the tree are 5 and 6). In a likelihood analysis an evolutionary model is used. For example an empirical model of am ino acid substitutions (Benner, Cohen and Gonnet, 1992;Jones, Taylor and Thornton, 1992; Kosiol and Goldman, 2005) can be used. In this case Pi is the amino acid frequency of amino acid i and Pij(t) is the transition

PAGE 172

157 probability from amino acids i to j during time t . The other parameters in this empirical amino acid model are the branch lengths =[ t1, t2, t3, t4, t5 ]. The tree is rooted arbitrarily at interior node 5. The first calculation is the proba bility of obtaining the data x as a sum over all the possibilities of y . ) 1 ( ) 2 ( ) 4 ( ) 3 ( ) 5 ( ) ; | ( ) ( ) ; (1 6 2 6 4 5 3 5 6 5 5 6 5t P t P t P t P t P y x f y f x fx y x y x y x y y y y y y y f(y ) is the prior probability of y and f(x|y;is the conditional proba bility of observing x given y . f(x;means that f is a function of x with parameters To estimate y (the ancestral state) we need to calculate the conditional probability of y given the data x (the known sites in th e extant sequences). f ( y | x ; ) f ( y ) f ( x | y ; ) f ( x ; ) We can estimate y by maximizing this conditional pr obability. In order to estimate the accuracy of an inferred ancestr al state the posterior probability f ( y | x ; ) is calculated of y at sites with amino acids x . This calculation is as follows: Pr( y correct ) f ( x ) f ( y | x ; )x In this way we can estimate the ancestr al states and calculate a corresponding probability representing the accuracy of th e inference. In this example the only parameters are branch lengths since the m odel is an empirical amino acid model with replacement probabilities between amino acids. In the case where a codon model is used other parameters such as transition/transver sion rates, equilibrium nucleotide frequencies and relative rate parameters for the different base substitutions are also optimized in the analysis.

PAGE 173

158 Empirical Amino Acid Model: Chapter 1 described the empirical repl acement model. It was used in the reconstruction of seminal ribonuclease. This model is based on a replacement calculated from a large empirical data set. In this case it was JTT 92 matrix (Jones, Taylor and Thornton, 1992). Codon Models The early models in molecular evolution we re based either on nucleotides or amino acids as the evolutionary unit, where theses units are assumed to evolve independently. There is, of course, more data present at the DNA level. This level of information is most valuable when the level of divergence betw een sequences is low. When evolutionary distances increase, the information at th e DNA level could become noisier as the saturation of silent mutation sites increases. In this case the protein sequence could serve as a better source of data as the noise is removed in the translation process, but some information is lost. The obvious compromise treats the evol utionary unit as the codon where both sources of information the nucleotide and the amino acid are joined to provide a more complete data set. For this purpose the codon evolutionary models were developed by Goldman and Yang (Goldman and Yang, 1994) and implemented in the program package PAML (Yang, 1997), a program used extensively in this part of the project. These models make full use of the non-synonymous and synonymous mutations information in sequence data, moreover codon models take into account the evolutionary dependence within the three positio ns in codons. This co-evolutionary side is largely ignored in the pure nucleotide models.

PAGE 174

159 The codon model used in the analysis assumes that only single-nucleotide mutations can occur at a time as the pr obabilities of more than two occurring simultaneously are minute and can be ignored. It follows that each codon can change to 9 possible other codons via a single mutation. Th e rate of each change to each of these 9 possible codons is proportional to thei r calculated equilibrium frequency ( Pj) of codon j . This parameter introduces base frequenc y and codon usage in the evolutionary model. The codon equilibrium frequencies can be calculated in diffe rent ways. In this ancestral reconstruction we used two separate ways, as a vari ation of the model, to test the robustness of the reconstr uction. The first codon-freque ncy calculation method (1X4), where it is estimated from the sequence data as the average nucleotide frequencies. The second method, 3X4, the frequencies are estimated from the average nucleotide frequencies at the codon positions. In the last method, the nucleotide frequencies in the first, second and third positions in a codon ar e estimated separately and the frequency of a specific codon becomes the product of th ree estimated base frequencies. Another parameter, , in the model represents the different mutational rates of transitions versus transversions. This f actor accounts for the higher occurrence of mutational transitions versus transversions . The physico-chemical properties of each amino acid were included in the original version of the model based on the matrix generated by Grantham (Grantham, 1974) but this apparent positive addition hurt the performance of the model rather than he lped. This is clearly a situation of overparametrization, where adding more parame ters to describe reality can enhance the weight of noise in the data to the detriment of the weight of real si gnal. Accordingly, it was omitted from the applied model as was suggested by Ziheng Yang in the description

PAGE 175

160 of his software package PAML. The last parameter in the model is which describes the selective constraint on mutations causing non-synonymous mutations versus synonymous mutations. The parameter will be discussed in more detail in the section of this chapter covering detection of adaptive evolution. The parameters of the codon model are related by the instantaneous rate matrix, Q={qij} . qij is the substitution rate from codon i to codon j . The codon substitution model is described by the following conditions where m is the rate. Sources of Ambiguity in Ancestral Sequence Inference Even given a tree with a defined topol ogy, various features of a dataset can generate ambiguity in the ami no acids inferred at specific site s. One source of ambiguity is rapid sequence divergence. Here, if the tree is insufficiently articulated with respect to the number of amino acid replacements at a site, the historical information at that site is lost. In seminal RNase, this was a negligible problem relative to the homoplasy discussed above. This homoplasy also crea ted far greater ambiguity than that arising from different inferences from different theories. Once a tree and an outgroup were chosen, the resulting ambiguities were few. Those that existed we re captured within ancerstral sequences generated from different trees. This is just another way of saying that ambiguity in the

PAGE 176

161 precise topology of a tree e nded up moving an amino acid replacement to an adjacent branch of the tree, not removing it from the tree entirely. The robustness of ancestral sequences at cr itical nodes in the seminal RNase family contrasts with other recent examples from this laboratory, in particular the resurrection of ancestral alcohol dehydrogenase by Mike Thomson, where a 3 x 2 x 2 ambiguity created critical issues at one node of interest. The Ancestral Sequences for Seminal RNase The ambiguity in the tree topology creat ed by the homoplasy discussed above (Table 3-1) most dramatically shifts the de tailed branching of the okapi, and deer, and subsequently diverging ruminants near the r oots of the tree representing the archetypal seminal RNase. Therefore, the ambiguity in the ancestral sequence inferences are most severe at that sequence, which we refer to generically as An19. To manage this ambiguity in An19, we c onstructed a list of all possible ancestral sequences for this archetypal seminal RNase drawn from all of the trees (4), with the two outgroups (brain and panc reatic) and both th eories (codon and amino acid) for reconstructing ancestral sequences. A representation of the ne ighbor joining distance tree, generated by the program PAUP* (Swofford, 1998), interrelates these ca ndidate ancestral sequences is shown in Figure 3-8. Of the 17 leaves in the tree, a tota l of 12 were prepared in this work. Further, this tree represents the "sequence space" of this root seminal RNase. Thus, we have sampled a substantial fraction of this space in th is work, even if we assume that all of the inferred ancestral sequences are equally probabl e. Of course, they are not, and our choice of the 12 to sample is biased by what we be lieve to be the most likely, or what the methods generate as a partial consensus. Bu t as is indicated above , the dynamics of new

PAGE 177

162 data, sequence and paleontological, means that any preference may not be time invariant for some period in the future, until the last genome is sequenced and the last fossil is discovered. As is discussed in depth in the next chap ter, various properties of the ancestors did not differ substantially in many of the in vitro assays applied. This suggests that any interpretation of the behavior of the ancestral protein will be robust with respect to the ambiguity in the ancestral se quences as created by ambiguity in the precise branching order, and the variation in the evoluti onary theory used to infer sequences. The final sets of sequences were actu ally examined, where the choice reflects Topology 1 (where all ancestors were specifica lly resurrected) and Topology 4 (where all ancestors were specifically resurrected). The ancestors (An19) for Topologies 2 and 3 were sampled as a consequence of these explicit resurrections. For the more derived nodes, none of the ambiguities had significant consequences for the ancestral sequences inferred. Only tw o candidates ancestral sequences needed to be considered for An24, and only three needed to be considered for An25. Interpreting the Evolutionary Model The inference of ancestral sequences was made as the starting point for a study in experimental paleogenetics. The results of th ese experimental studies are collected in the next chapter. It is well k nown, however, that certain in ferences can be drawn about function in a protein family by examining the details of the molecular evolution itself, without needing to resurrect th e ancestral forms for study in the laboratory. We discuss here the inferences that can be drawn from the details of molecular evolution independent of any experimental data collected on ancestral forms.

PAGE 178

163 Seminal RNase in Oxen is the Product of a Recent Episode of Adaptive Evolution Axiomatic in the Darwinian model for molecular evolution is the notion that changes in the structure of a gene accumulate randomly. The rate at which they are fixed in the population depends, however, on whethe r they confer selec tive advantage, or disadvantage, or neither, on the host organism . If the last, then they accumulate in the population in a process known as "drift", at a rate that is (under neutral theory of evolution) independent of the size of the breeding population. In contrast, mutations that confer positive Darwinian advantage or disadvantage accumulate in a population faster or slower than the rate of drift, in processes known as positive adaptation and negative purifying selection, respectively. In principle, it is possible to distinguish episodes wh ere a gene was evolving under purifying selection, adaptive positive selec tion, and neutral drif t reconstructing the historical rates at which mu tations accumulated in a linea ge. In practice, one cannot determine a rate at a single site or at a singl e moment in geological time. Therefore, rates are generally analyzed over the entirely le ngth of the protein se quence over a branch between two explicitly reconstructed ancestors. This in itself is problematic, as the signature of a brief episode of adaptive evolut ion can easily be lost in a long branch of an evolutionary tree. Further, it is clear that many individual amino acids must be conserved, even in a protein that is under going functional change, to preserve the core fold and other core functional behaviors. Thus , rapid evolution of a few sites may be lost as a signature for adaptive evolut ion in a longer protein sequence. These problems are intrinsic in any an alysis based purely on sequence data. Complicating the analysis is the fact that time is rarely known directly from an analysis

PAGE 179

164 of sequence data. Therefore, it is not trivial to determine the rate of protein sequence change in units denominated by geological time. Fortunately, the structure of the genetic code provides an oppor tunity to solve the second problem. For example, the third posit ion of a codon can frequently be changed without changing the sequence of the encoded pr otein. Therefore, substitutions at these silent sites cannot be removed by natural selecti on, at least if it operate s at the level of the protein behavior. Likewise, na tural selection cannot favor s ilent nucleotide substitutions by exerting pressure at the level of the enc oded protein. Only if selection pressure operates before the completion of translation will there be any pressu re. This is believed to be minimal in higher organisms. Therefore, silent substituti ons are often taken as a proxy for nearly neutral changes. They are generally believed to accumulate at the rate of neutral drift. In analyzing genomics, the rate of neutral drift is calcul ated by reconstructing th e rate of nucleotide sequence change at silent sites. Once the rate of neutral change is know n, then the rate of non-synonymous change can be compared with it on any particular branch . If it is higher, then adaptive selection is concluded. If it is lower, then this co nstitutes evidence for purifying selection. A pseudogene will have the same rate at all sites, regardless of whether the site is formally silent or not. For this reason, we analyzed th e rates of synonymous and non-synonymous substitutions separately, and ta ke their ratio, after normalizi ng for the number of sites of each type. Episodes of evolution with a low non-synonymous number of substitutions compared to the number of synonymous substi tutions, normalized for the number of sites

PAGE 180

165 of each kinds, is a sign that purifying selection is removing the non-synonymous mutations that, by Darwinian axiom, must be occurring in the non-silent sites randomly. Conversely, episodes of evolution wi th a large number of non-synonymous substitutions compared to the number of synonymous substitutions, normalized for the number of sites of each kinds, suggest that mutant forms of the protein conferred more fitness than non-mutant forms, and that these are accumulating by positive selection. Last, episodes of evolution where the num ber of non-synonymous substitutions is the same as the number of synonymous substi tutions, after normalizing for the number of sites of each kinds, suggest that the entire gene is drifting with neither positive selection nor purifying selection; it is expected for a pseudogene. The normalization is managed using simple mathematical transformations. First, one determines the number of synonymous mutations per synonymous site ( rs) and the number of non-synonymous mutati ons per non-synonymous site ( rn). The time ( t ) and number of generation also need to be consider ed, but since it is gene rally difficult to have this information it is more useful to write these rates in following form: dN=rN x t , dS=rS x t and the ratio dN/dS eliminates the time parameter. The ratio of dN and dS is a measure of selective pressure (=dN/dS). Maximum Likelihood Model for Adaptive Evolution When is less than unity, non-synonymous mu tations are deleteri ous and the gene is under purifying selection. When is equal to unity, both types of mutations are equally likely and neutral evolution is infe rred. For the purpose of this discussion, pseudogenes are expected to have values of unity. Finally, should be higher than unity, then we conclude that the gene is under positive (adaptive) selection.

PAGE 181

166 The likelihood model governing this ratio is the same as described earlier when codon model for ancestral reconstruction was di scussed. Several ways exist to implement the calculation, however. To name a few, the ratio can be calculated separately for each branch of the tree, for a one group of branches clustered in one unit, or for each site in the sequence. The choice of the method depends largely on the dataset. A more powerful calculation of a separate , for example, demands a large amount of sequence data and a certain level divergence. Unfortunately, the trifurcation (purifying, neutral, adaptive) is not clean, for reasons mentioned above. Because some sites ar e important to intrinsic function, a dN/dS of 0.8 may actually indicate an episode of adaptive po sitive evolution; it is not higher than unity simply because the sites rapidly suffering am ino acid replacement are lost in a large number of sites that are not changing because they are conserved to maintain intrinsic fold or core function. Seeking Signatures of Adapti ve Evolution in the Semina l Ribonuclease Gene Family To use this metric to detect adaptive change in the seminal RNase family, the phylogenetic data were firs t analyzed to calculate ( = dN/dS) separately for each branch of the tree, and for Topology 1 and Toplogy 4 with both the pancreatic and brain outgroup. Figures 3-10 and 3-11 show the values for each branch in the tree calculated using PAML. In these analyses, (the transition to transversion ratio, which is a parameter that can be set) was estimated from the sequence data and incorporated into the theory. As seen in the Figures 3-10 and 3-11, the values of for individual branches are generally below unity. However, a few bran ches (indicated in red and blue) have

PAGE 182

167 values greater than unity. Th is would normally indicate that these branches represent episode of positives selection. One difficulty constrains this conclusion. First, as a ratio, is uninformative if dS is a small number, even if dN is also a small number. This is equivalent to saying that the time represented by the branch is short. A small value for dS creates artificially high (and imprecise) values that are easily recognized ( = 999 or 551, for example in Figures 310 and 3-11). In the divergence of ruminants, several branch lengths do indeed reflect short times, making the appearance of su ch artificial value not surprising. Using PAML, can be different for each separate branch of the tree. PAML uses as a parameter, however, meaning that to allow to differ on each branch of the tree would involve introducing, in this case, 28 parameters into th e theory. This is problematic, as one can easily model noise with this many parameters. Based on the results obtained in the first an alysis that calculated a separate w for each branch, a set of simpler models were desi gned to test the robustness of the episodes of adaptive evolution that appe ared on the model due to high w values. The branches with w higher than unity were placed in groups. Wh en two adjacent branches qualified, they were placed in the same group (Figures 310 and 3-11). These groups are designated G1, T1PG2, T2PG3, T2PG2, and T2PG3. Depending on the topology and the choice of outgroup, we used four trees derived from Topology 1 and Topology 4. Here, 3 or 4 groups of branches (designated G1, T1PG2, T2PG3, T2PG2, and T2PG3), depending on the choice, had values greater than unity for each tree. Simpler models, where the values were not allowed to freely vary, were built to include all combinations of the suspected adaptive evolution branches,

PAGE 183

168 again from Topology 1 and Topol ogy 4. These had a single imposed on each group of branches. Tables 3-2 and 3-3 illustrate the re sulting models. Using PAML, each dataset (specific topology and outgroup) was analyzed by all of the corresponding models. These analyses resulted in values and likelihood scores re presenting the good ness of fit of each model. The values are shown in Tables 3-2 and 3-3. Assessing the Significance of the Output The Akaike Information Criterion (AIC) was used to evaluate the best model and the robustness of the episodes of adaptive ev olution that might be detected in this analysis. The AIC is a formalism to determine which of n models best approximates an unknown truth (Posada and Buckley, 2004). It is an unbiased estimator of the expected relative Kullback-Leibler information distance (K-L distance). This distance represents the amount of information lost when mode l A is used to approximate a model B. This distance does not tell us where the trut h is. Rather, it only evaluates which of the compared models is closer to the truth, and whether one model is better at describing the data than others. Of course, any model with more parameters will have a better likelihood score (describing how well the model fits the data); merely comparing theses scores ignores whether there is merit to include more paramete rs. The AIC includes a penalty for added parameters, and uses this penalty to evaluate wh ether the addition of parameters are beneficial or detrimental. It evaluates whether we are getting closer to the truth or getting farther away from the tr uth as we add parameters. The addition of parameters not warranted by the amount of data could be distracti ng from the truth as noise is modeled and presented as real signal.

PAGE 184

169 The AIC is represented by the following formula: A IC ln L 2 K. In this equation ln L is the logarithm of the likelihood, and K is the number of estimated parameters. Another way to look at the AIC is as a measure of the amount of information lost when we use a particular model to describe the data. The preferred model is with the smallest AIC. As more parameters are adde d to the model, the first term (likelihood) becomes smaller, representing an improved fit of the model, while and second term (the number of parameters) increases acting as the penalty term. Computing the AIC is not enough to finish the comparison, as it is a relative scale and the AIC differences (between compared models) are more meaningful. The AIC difference is represented by the following formula: AI C i AI C i min AI C . In this equation the AICi is the AIC for model i , and minAIC is the smalle st AIC in the group of models compared. These AIC differences permit quick ranking of the compared models. Another way to evaluate the difference between the studi ed models based on the AIC is through the Akaike weights ( wi). The weights change the sacle of the AIC differences to be compared on a scale of 1. The Akaike weights ar e represented by the following formula: wi exp( 1 2 i ) exp( 1 2 r )r 1 R In this equation wi is the Akaike weight for the ith model, and r is the number of models compared. The AIC, AIC and wi were applied to the compared w models. Tables 3-4 and 3-5 show the AIC results.

PAGE 185

170 As suspected, in all cases the more complex model, which allowed each branch of the tree to have a different , performed much worse than all of the simpler models where fewer values were allowed than the numb er of branches on the tree. In the analyses using the pancreatic outgroup, th e best model among those compared had a single for the two branches from An24 to A n26 (G1 representing 24..25..26 branches) and one for the remaining branches of the tr ee (G1+0). The models capturing the G1 branches within a single group and all having the same , performed better than if they did not include branch 24-26 (leading to the seminal RNase in the modern ox). In the analyses with the brain outgroup , the best performing models (G1+T1BG3+0 and G1+T2BG3+0) also consistently include d the branch grouping designated as G1. In the analyses with the brain outgroup G3 (T 2BG3 and T1BG3) does have a robust signal as well, without being supported by the analys es where pancreatic RNAses serve as an outgroup. The Only Branch with Significantly High Values Leads to Modern Seminal Bovine RNase The "take home lesson" from these analyses is that regardless of the model, the outgroup or the tree, the outcome of the AIC analysis is th at molecular evolutionary theory requires the conclusion of adaptive evoluti on in the branch leading to the modern seminal RNase in ox , in its three forms (gaur, Brah man, and the ox), and requires it in no other branch of the tree . It is important to note that this conclusion is robust with respect to every issue of ambiguity that is discussed above. The model testing exercise spanning different adaptive evolution models and different phylogenetic mo dels (outgroups and topologies) leads to the conclusion, that an episode of adap tive evolution connects ancestor An24 to An26

PAGE 186

171 and to bovine seminal RNase, is robust to a ny variation in all mode ls. In other words the >1, representing a phase of adaptive evolution, proved to be the st rongest signal in the analysis. Prior to any experimental data, the evolutionary analysis predicts a change in function that spans this period. If we are successful in selecting the right assayed activity, the one relevant to the physio logical activity, we should obser ve a change in function in the ancestral proteins pr esent in this period. Gene Duplications The existence of a branch of positive adaptation in the An24-An26 branch is curious when set within the la rger context of the history of the seminal ribonuclease gene. Many studies acknowledge the possibility that gene duplication might be followed by an episode of adaptive evolution, which leaves it signature as an episode with a high omega value. Indeed, starting with Haldane (Halda ne, 1932) and Muller (M uller, 1935) and later elaborated by Ohno (Ohno, 1970), it has o ccasionally been postulated that gene duplication is the only way for new gene function to arise. There might be few exceptions (e.g. alternat ive splicing and gene shuffling), but the molecular record provides ample support for the Ohno hypothesis in metazoans in particular. Duplication may al so serve simply to produce mo re of a single gene product when needed. The duplication of genes could also be partial, creating several domains within a protein and, in doing so, increasing the functional complexity of a protein. Gene duplications evidently also occur in large bl ocks (Wolfe and Shields, 2001), and perhaps cover a whole genome. What is clear in any case that duplications are key to the diversity of functional adaptation with in specific gene families, and these adaptations are central to metazoan biology.

PAGE 187

172 What is curious about the episode of rapi d sequence evolution, and the inference of the emergence of new function in this case is that it occurred so long after the gene duplication that separated the seminal lin eage from the pancreatic lineage. The Fernandez-Vrba analysis (Hernandez Fern andez and Vrba, 2005) suggested that the divergence of chevrotains (the most primitive ruminants) from the remaining ruminants occurred 50 million years ago. The An24-An26 branch occured only 1 million years ago. What was seminal RNAse as a gene family doing in the meantime? Every model for Darwinian evolution sugge sts that this amount of time is more than sufficient for the seminal gene to have rotted in to suited genes status. Benner and Trabesinger-Ruef estimated that this would occur with a half time of 14 million years for a protein the size of RNAse, with the resi dues known to be e ssential for function occurring. The more standard number, not generated for RNase specifically, is 10 million years (Lynch and Conery, 2000). Indeed, considerable eviden ce suggests that many of the seminal RNase genes are pseudogenes in extant ruminants. All of the ru minant artiodactyl genomes that have been examined in the Benner, Beintema, and Br eukelman laboratories have contained a seminal RNase gene (Kleineidamet al., 1999; TrabesingerRuefet al., 1996). Inspection of seminal plasma reveals, however, no seminal RNase or its associated catalytic activity. The sole exception to this is the semina l plasma from sheep, where classical protein chemistry showed that the RNase present was the digestive RNase, not a protein derived from the seminal RNase gene. This suggested that the seminal RNase genes are pseudogenes, although no studies have strictly ruled out their expression in another tissue, except in the case of buffalo

PAGE 188

173 (Kleineidamet al., 1999). Consiste nt with this model are defe cts in the coding regions of the genes for seminal RNase from various artiodactyls. These include frame-shifts, premature stop codons, and (in general, but no t in seminal RNase) lesions affecting the splicing site or regulatory elements. As it is known from wet biochemistry that amino acids 119-124 at the end of the protein are essential for fold ing, virtually any stop/frame shift at any point in the gene necessarily creates a non-functi onal protein. Indeed, in this dissertation work, one of these genes was acci dentally created by an unintentional frame shift at codon 114; the protein precipitated. Many defects are found in seminal RNase genes from the extant ruminants studied. In the seminal RNase gene from lesser kudu, a single base deletion in codon 114 creates a frame shift. In Cape buffalo, an eleven ba se deletion in codons 54 to 57 results in a frame-shift in the rest of the gene. In gira ffe, a 16 base deletion from nucleotide 32 to 48 also causes a frame shift, as well as the loss of important amino acids. In one case, a deletion can be placed on an ancestral branch of the tree. The genes for seminal RNase from both the hog and the roe deer have, in nucleotides 83 to 85, a five-nucleotide insertion that creates both a frame shift and a stop codon. The commonality of this deletion suggests that it occurred before the hog and roe deer diverged, but after the deer li neage diverged from the bovid lineage. This observation is quite significant, as the last common ancestor of hog and roe deer lies deep in the cervid tree, dated by Fernandez and Vrba to have lived ca. 19.4 million years ago, in the Miocene. We can be confident that the semi nal RNase gene in cervids was a pseudogene ever since Hydropotes inermis (Chinese water deer, the most primitive deer) diverged

PAGE 189

174 19.7 Ma. Obviously, inspection of the H. inermis genome would allow us to decide whether the lesion that caused the deer gene to become a pseudogene was older. Other features of the contemporary semina l RNase genes in ruminants suggest that the encoded protein could not have cataly tic or binding activity against RNA. For example, the sequence from giraffe has suffere d the replacement of His 12, an active site residue that has been shown in the Benner group to be essential for both binding to RNA and catalysis, is replaced by a leucine. In the hog deer encoded sequence, the same histidine is replaced by a tyrosine. The active site Lys 41 is replaced in both hog deer and roe deer seminal RNase with an arginine. La st, Cys 72, which usually forms a disulfide bridge with Cys 65, is replaced in the hog deer protein by a Ser. Other ruminant seminal RNase sequences (forest buffalo, water buffalo, duiker, saiga and okapi) are free from lesions in th e coding region. The pr otein is not found in their seminal plasmas (Tra besingerRuefet al., 1996). The lack of common lesions in contempor ary artiodactyls prevents us from using molecular evolutionary analysis to establish a pseudogene st atus for the protein along the lineage leading to the modern ox If the pseudogenes for seminal RNase prot eins from extant ruminants all have the same lesion, then simple rules of molecular analysis would allow us to place that lesion deep within the seminal RNase tree, and dr ive the inference that through much of the history of the seminal RNase gene family, th e gene was a pseudogene. This would then lead to the remarkable conclusion that a gene that went inactive 40 million years ago was resurrected with in the past 1 million years to deliver a functioning protein.

PAGE 190

175 This is not the case, however. All of the lesions detected with in coding regions of the seminal RNase genes from extant rumina nts are different. While the common gene lesion in the hog and roe deer allows us to date the lesion event at between 19 and 20 million years ago in that specific lineage, no other region discussed above is held in common between the various proteins. Even th e loss of histidine 12 in the giraffe and the deer cannot be placed in a less common ances tor of these two taxa, as the amino acid replacement is different in the two species. This has left open one of the longe r standing conundrums in the molecular evolution of the RNase family. The molecu lar evolutionary analysis makes nearly indisputable the conclusion that the prot ein presently found in modern oxen has a function different from the protei n in the last common ancestor of Bovidae and buffalo. Sequence analysis makes nearly indisputable the conclusion that the seminal RNase gene has no function in any other modern ruminant , and it did not have any function near the time of divergence of the major species of cervids. Yet the molecular evolutionary analysis implies that no lesions damaged the RNase gene in the 50 million years since the time that it was created and the time that it underwent recruitment for new function in the modern ox. Tests for Pseudogeneness Based on Non-Ma rkovian Features of Sequence Evolution Research in various laborato ries has suggested more spec ific tests for gene function than dN/dS. These include higher order analyses of molecular evolution, which relax the first order Markovian features of the sta ndard models for sequence evolution, which include the notion that future mutations are in dependent of past mutations. The fact that individual sites do not samp les randomly from all 20 amino acids is captured in the obvious homoplasy observed at 10 sites in the protein sequence. As noted above, this

PAGE 191

176 homoplasy created significant problems in identifying a preferre d topology for the evolutionary tree. While the statistical analysis of homopl asy remains an unsolved problem, such homoplasy is a sign of selection pressure, as it indicates the fi xation of amino acid replacements that does not reflect the random selection from all 20 possible amino acids, but rather a constraint to cons ider just two (or a few). For example, the K/Q homoplasy at site 55 involves the interv ersion of codons for Lys ( AAG and AAA) and Gln (CAA and CAG). Of the 9 single nucleotide substitutions that are possible starting with (for example) the AAG codon for Lys, only one of these generates a Gln codon (CAG). The remaining 8 generate two Asn codons ( AAT, AAC), one stop si gnal (TAG), one Met codon (ATG), one Thr codon (ACG), one Glu codon (GAG), one Arg codon (AGG), and one codon (AAA) that leaves the encoded am ino acid unchanged. For K and Q to have interconverted at least three times (each with probability 0.11) without having nucleotide substitutions that generate the other amino acids suggests selective pressure operating at that site, is significantly unlikely by any 95% confidence limit test. Structural and Evolutionary Analysis Combined: A Test for Pseudogeneness An alternative approach to analyzing molecular function, developed by Gaucher and Benner for proteins such as elonga tion factors and alcohol dehydrogenase and modified by Sassi for this project, involves the coupling of information from structural biology and chemistry to molecular evolution. Pseudogenes, by definition, must accumulat e mutations anywhere in the sequence. The lack of selective pressure gives all muta tions (with the exception of factors such as differential rates between transitions and transversions) an equa l opportunity. A common ancestral pseudogene would therefore be expected to have mutations randomly

PAGE 192

177 distributed in the three dimensional crystal st ructure of the protein. Failure to observe such a distribution during an episode of seque nce evolution would be an indicator that a gene family is not a pseu dogene during that episode. To test this idea, the replacements intr oduced along the lineage connecting An19 to An25 were mapped onto the three dimensional structure of seminal ribonuclease (Figure 3-12). From a first look, it seemed that all of the ancestral replacem ents are on the surface of the protein. The hydrophobic core of the protei n, important to proper folding, is free of mutations. Also, two enzymatically important structural units, the active site (His12, His119 and Lys41) and the RNA binding re gion (Thr45, Phe120, Ser123, Gln69, Asn71, Lys1, Lys7, Arg10, His12, Lys41, His119, and Lys66) are free of replacements. As noted above, this is not the case in the deer lineages, which ar e known to be pseudogenes. Statistical Support for the Distribution of the Ancestrally Replaced Amino Acids on the Structure of RNase DSSP (Definition of Secondary Structure of Proteins) is a program designed to standardize secondary struct ure assignment and extract other data like solvent accessibility from x-ray coordinates generated from the structure of proteins. We applied this program to two different x-ray structur es of seminal RNase, the monomoer and the dimer. Using another program, written in the Benner group, we used the DSSP results to classify the residues as solvent accessible or inaccessible using 10% as the cutoff to distinguish between the two classes of residues (buried when accessibility <10% and exposed when accessibility > or = to 10% ). Figure 3-13 shows the DSSP analysis. Seminal RNase has 124 amino acids of wh ich 18 are ancestrally replaced. Table 36 shows the buried versus exposed structural di stribution of all of th e amino acids in BS-

PAGE 193

178 RNase and the ancestrally replaced amino acids (residues in play). The classification was obtained from the DSSP output. To test the statistical significance of the observed surface distribution of the ancestrally replaced residues a Chi square test was performed. H0: The null hypothesis to be rejected by the Chi-square test holds that p1= p1,0 and p2= p2,0 ,where p1,0 and p2,0 are the hypothesized fractions of buried and exposed amino acids based on the calculated buried and exposed fractions of the amino acids not in play, and p1 and p2 are the calculated fractions of bur ied and exposed amino acids in play. Ha: The alternative hypoth esis is that at least one of the fractions (amino acids in play that are buried and exposed) do es not equal its hypothesized value. 2 [ n1 E ( n1)]2E ( n1) [ n2 E ( n2)]2E ( n2) where E ( ni) npi,0 , n is the total sample number, and n1 and n2 are the number of buried and exposed amino acids in play, respectively. Chi-square calculation based on BS-R Nase monomer crystal structure X2 9.64370.005 2 7.87944 X20.005 2 for one degree of freedom Chi-square calculation based on BS -RNase dimer crystal structure X2 6.477250.025 2 5.02389 X20.025 2 for one degree of freedom The null hypothesis ( H0) can be rejected in both cases, the monomer and the dimer crystal structures. This statisti cal test supports that the dist ribution of the amino acids in

PAGE 194

179 play (ancestrally replaced amino acids) is differe nt from the rest of the amino acids in the protein with a 99.995% (dimer structure) a nd 99.975% (monomer structure) confidence. Conclusion for Derived Pseudogenes: Taken together, the comparison of pse udogene lesions, the ancestral replacement structural mapping and the patterns of nonsynonymous and synonymous mutations build a strong case for recent gene inactivation of seminal ribonuclease o ccurring recently and separately in the different lineages (with the exception of the deer lineage where we estimate the lesions occurring at least 20 milli on years ago). Further, these results and more specifically the structural mapping predicts that the ancestral proteins will be enzymatically active. These predictions will be validated by the experimental data. Figure 3-1. Coin toss example illustrating the principle of Maximum Likelihood. The value of p (the probability to obtai n heads) corresponding to the maximum likelihood of the data (sequence of coin tosses) is 0.454. This figure was taken from the lecture of Joseph Felsenstein (Woods Hole, Summer 2004).

PAGE 195

180 Figure 3-2. Three-dimensiona l plot representing the li kelihood distribution for two parameters (two branch lengths in phyl ogenic tree) in an evolutionary model. The chosen values of the branch lengths correspond to the maximum likelihood value for both parameters. Figure 3-3. A hypothetical phylog enetic tree relating nucleo tide site (A, C, and G). Branch lengths are represented by the pa rameters t1, t2, t3, and t4. Ancestral states sites are represented by w and z.

PAGE 196

181 Figure 3-4. The multiple sequence alignment of the protein sequences of the seminal genes and sequences of the RNase ge nes used as outgroups in this study (pancreatic and brain RNases). This alignment was generated by the program Clustal X. Figure 3-5. This figure repres ents two tree topologies of the seminal RNase gene family (Topology 1 and Topology 4) that were c onsidered in this study to infer ancestral sequences. Topology 1 is th e preferred topology by a Bayesian analysis. The ancestral n odes studied in this proj ect are indicated on Topology 1 (An19, An22, An23, An24, An25, An26 and An28). The numbers in red below the ancestral nodes indicate the number of ances tral candidates considered. The outgroup shown includes three RNase sequences.

PAGE 197

182 Figure 3-6. This figure represents two tree topologies (Topology 2 and Topology 3) representing the phylogenentic relationshi p of the seminal RNase gene family. These topogies were among the three best selected by the bioinformatics analysis. The outgroup shown incl udes three RNase sequences. Figure 3-7. Hypothetical phylogene tic tree used as example to illustrate ancestral states reconstructions. Numbers 1,2,3 and 4 represent the extant characters. Numbers 5 and 6 represent the ancestral states. t1, t2, t3, t4 and t5 represent branch lengths.

PAGE 198

183 Figure 3-8. A distance tree represents the di fferences between all 17 inferred ancestral candidates for node An19. The difference arises from different trees, outgroups and inference models. The leaves in red boxed represent the candidates sampled in this study. T1, T2, T3 and T4 correspond to the 4 topologies discussed earlier in the ch apter. The letters B and P in the designations refer to the outgroups (Brain and Pancreatic). The letter AA refer to the amino acid model. The numbers 3 and 1 at the end of the designations refer to the codon models. The length of the branches represent the difference between the ancestral candidate sequ ences, for example, An19_T4B1 and An19_T4B3 (upper right hand corner) are different by one amino acid.

PAGE 199

184 Figure 3-9. Multiple sequence alignment showing the ancestral sequences considered in this study. The ancestral nodes are A n19, An22, An23, An24, An25, An26, and A28. An26 is the ancestral protein of the ox lineages and is incidentally the same as the BS-RNase. The ance stral nodes where ambiguity in the reconstruction was found are repres ented by more than one ancestral candidate sequence. Figure 3-10. The values of for each branch in Topology 1 under the pancreatic and brain outgroups are shown. The branch es on the tree highlighted in red and blue show branches with values higher than unity for the pancreatic (red) and brain outgroups (blue).

PAGE 200

185 Figure 3-11. The values of for each branch in Topology 4 under the pancreatic and brain outgroups are shown. The branch es on the tree highlighted in red and blue show branches with w values highe r than unity for the pancreatic (red) and brain outgroups (blue).

PAGE 201

186 Figure 3-12. The dimeric structure with sw apped domains (amino acids 1-20) of BSRNse. Only of the side chains of the re levant amino acids is shown. The active site residues are represented in grey, a nd the ancestrally replaced amino acids are in red. The RNA binding site is shown. The pdb accession code for this structure 1bsr.

PAGE 202

187 10 20 30 40 50 60 BS-RNase sequence KESAAAKFERQHMDSGNSPSSSSNYCNLMMCCRKMTQGKCKPVNTFVHESLADVKAVCSQ BS-RNase struct(D1) KESAAAKFERQHMDSGNSPSSSSNYaNLMMc RKMTQGKdKPVNTFVHESLADVKAVeSQ Secondary struct(D1) CCCHHHHHHHHHBCCCCCCCCGGGHHHHHHH CCCCCCCCCCEEEEECCCHHHHHGGGGC Solvent access (D1) 9464234042100254960133530031001 2601676025400001242650430082 BS-RNase struct(D2) KESAAAKFERQHMDSGNSPSSSSNYgNLMMc RKMTQGKhKPVNTFVHESLADVKAViSQ Secondary struct(D2) CCCHHHHHHHHHBCCCCCCCCGGGHHHHHHH CCCCCCCCCCEEEEECCCHHHHHGGGGC Solvent access (D2) 9465234042100149110635710030001 2302766025300000242650430082 Monomer struct(M) KESAAAKFERQHMDSG SSNYaNLMM RKMTQGK KPVNTFVHESLADVKAVcSQ Secondary struct(M) CCCHHHHHHHHHBCCC CCCHHHHHH CCCCCCC CCEEEEECCCHHHHHGGGGC Solvent access (M) 9454234042100259 986103410 2502776 24500001142550420182 70 80 90 100 110 120 BS-RNase sequence KKVTCKNGQTNCYQSKSTMRITDCRETGSSKYPNCAYKTTQVEKHIIVACGGKPSVPVHFDASV BS-RNase struct(D1) KKVTfKNGQTNfYQSKSTMRITDaRETGSSKYPNdAYKTTQVEKHIIVAeGGKPSVPVHFDASV Secondary struct(D1) EEECCCCCCCCEEECCCCEEEEEEEECCCCBCCBCCEEEEEEEECEEEEEECCCCEEEEEEEEC Solvent access (D1) 6616086665302207650500004339918467220612426320100028974201612226 BS-RNase struct(D2) KKVT KNGQTN YQSKSTMRITDgRETGSSKYPNhAYKTTQVEKHIIVAiGGKPSVPVHFDASV Secondary struct(D2) EEEC CCCCCC EECCCCEEEEEEEECCCCBCCBCCEEEEEEEECEEEEEECCCCEEEEEEEEC Solvent access (D2) 6518 669375 2107540200207339918567120624636330000029974201510126 Monomer struct(M) KKVTdKNGQTNdYQSKSTMRITDaRETGSSKYPN AYKTTQVEKHIIVAcGGKPSVPVHFDASV Secondary struct(M) EEECCCCCCCCEEECCCCEEEEEEEECCCCBCCB CEEEEEEEECEEEEEECCCCEEEEEEEEC Solvent access (M) 6516077665301208640400004339817468 10623737320000037973102411136 Figure 3-13. DSSP solvent accessibility classi fication of the monome ric (pdb 1N3Z ) and dimeric (1BSR) structures of BS-RNase . The upper sequence is the complete sequence of BS-RNase. Struct(D1), struct(D2), and struct(M) are the crystallographically determined sequen ce of the first monomer of the dimer (1BSR), second monomer of the dime r (1BSR) and the structural monomer(1N3Z), respectively. The solven t accessibility is noted below each amino acid with a number between 0 and 9. 0 represents 0% solvent accessibility and 9 represents 90% solvent accessibility. Table 3-1. Sequence summary for site s influencing seminal RNase tree* Brain/ Okapi Deer Subsequent sequences/ Pancreas homoplasy Site 9 R/E E W saiga R/bovid E Site 19 S P S all P Site 22 N/S N S saigakudubuffN.OxS Site 53 D/N/S N N D Site 55 K/Q Q K/N saigaQ/remainder K Site 64 T/A T A saigaT,duikerA, remainder T Site 70 P/ST S S T Site 80 NS/HS H H R Site 101 Q R R duikerR/rest Q Site 113 N/N N K saigaN/rest K *See multiple sequence alignment (Figure 3-4) for primary data

PAGE 203

188 Table 3-2. This table shows the different m odels used to detect adaptive episodes of evolution for both topologies (Topology 1 and Topology 4) for the pancreatic outgroup. The column under ‘Models’ s hows the different combination of branch grouping used in the respective model. Calculated values of w are shown for each branch grouping and re spective model. Under models, the designation 0 represents the rest of the branches, excluding the branches in the listed branch grouping, where one w valu e was calculated for all of these branches.

PAGE 204

189 Table 3-3. This table shows the different m odels used to detect adaptive episodes of evolution for both topologies (Topo logy 1 and Topology 4) for the brain outgroup. The column under ‘Models’ s hows the different combination of branch grouping used in the respective model. Calculated values of w are shown for each branch grouping and re spective model. Under models, the designation 0 represents the rest of the branches, excluding the branches in the listed branch grouping, where one w valu e was calculated for all of these branches.

PAGE 205

190 Table 3-4. This table shows the AIC and Akaike weights ( wi ) given all models and both topologies (Topology 1 and Topology 4) for the pancreatic outgroup.

PAGE 206

191 Table 3-5. This table shows the AIC and Akaike weights ( wi ) given all models and both topologies (Topology 1 and Topology 4) for the brain outgroup.

PAGE 207

192 Table 3-6. Buried versus exposed crystallogr aphic distribution of total and ancestrally replaced amino acids in BS-RNase. Th e data has been obtained from the DSSP output describing the monomeric structure (pdb 1N3Z) and dimeric structure (pdb 1BSR). Resiudes in pl ay are the ancestrally replaced amino acids. Residues not in play are the rest of residues in the protein. Dimer crystal structure Total Buried Exposed All residues 248 88 160 Residues not in play 212 83 129 Residues in play 36( n ) 5( n1) 31( n2) % Buried % Exposed All residues 35.5 64.5 Residues not in play 39.2 ( p1,0) 60.8 ( p2,0) Residues in play 13.9 ( p1) 86.1 ( p2) Monomer crystal structure Total Buried Exposed All residues 124 45 79 Residues not in play 106 43 63 Residues in play 18( n ) 2( n1) 16( n2) % Buried % Exposed All residues 36.3 63.7 Residues not in play 40.6 ( p1,0) 59.4 ( p2,0) Residues in play 11.1 ( p1) 88.9 ( p2)

PAGE 208

193 CHAPTER 4 PRODUCTION AND BIOCHEMICAL CHARACTERIZATION OF RESURRECTED RIBONUCLEASES The phylogenetic analysis outlined in Chapter 3 generated candidate sequences of the proteins that were the ancestors of contemporary ri bonucleases (RNases). These are represented by individual node s throughout the evolutionary history of seminal RNase gene (Figure 3-5 in Chapter 3). This Chap ter describes the produc tion, purification and biochemical characterization of these ancestral proteins. Production of Ancestral Proteins Site Directed Mutagenesis Method Genes encoding various ancestral RNase prot eins were obtained from an original ribonuclease gene cloned in a bacterial expre ssion vector (pET23b). This original gene (An35) was derived from the gene synthesi zed in the Benner laboratory in 1984. Its development was done by a variety of workers in the Benner laborator y prior to the start of this project, including Dr. Joseph Stackhous e, Dr. Katrin Trautw ein-Fritz, Dr. Mauro Ciglic, Dr. Jochen Opitz, and Dr. Lee Raley. The dissertation of Dr. Ciglic, written in English, was especially helpful. This original gene was then changed to create genes that encoded individual candidate ancestral RNases. To minimize the number of mutation steps needed to prepare these, it was recognized that the various ancestr al genes differed from the original gene at many points in common. Therefore, a plan was to strategically mutate the original gene (Figure 4-1).

PAGE 209

194 Site directed mutagenesis was used to in troduce the needed mutations into the gene. The starting plasmid had 5 -Gm6ATC-3 sequences throughout. These are susceptible to specific cleavage by Dnp I endonuclease. Mutations were introduced using forw ard and reverse primers that are complementary, with one or tw o nucleotide mismatches, to sites in the starting gene that were to be changed. The mismatches intr oduce the nucleotide subs titutions needed to create the desired amino acid replacement. From 10 to 15 correctly matched nucleotides were placed in the primers on either side of the mismatch that introduces the mutation. The primers then support a polymerase chai n reaction (PCR) that replicates the entire plasmid, and delivers a product plas mid that incorporates the desired mutations. The PCR used the thermostable Pfu DNA polymerase. The parental plasmid strands were then digested with Dnp I endonuclease. After digestion, a fraction of the PCR product is used to transform competent E. coli cells (XL-1 blue). Transformed clones were then grown, and the plasmid DNA purified and sequenced to confirm success of the mutagenesis. Sequence of the starting ribonucleas e sequence and the list of primers is in the Appendix. Protein Expression Ribonuclease toxicity in expression The ancestral proteins were expressed in a prokaryotic system ( E. coli ) using the pET23b bacterial expression vector. This sy stem places the RNase gene behind the T7 bacteriophage promoter, and places th e T7 RNA polymerase gene behind the lacUV5 promoter, making the latter (and therefore indirectly, the former) inducible with isopropylthiogalactoside (IPTG). The selection marker is a beta-lactamase gene on the plasmid that renders the host cell resi stant to ampicillin and carbenicillin.

PAGE 210

195 Unfortunately, in first expression attempts , addition of IPTG did not lead to the expression of cloned ancestral RNase genes in the desired large am ounts. To troubleshoot this problem, we considered the possibility that the beta-gal system had a low level of leakiness. We speculated that the seminal RNas e might be toxic to th e cells. This would not be surprising if the RNase were folded and active and in the cytoplasm. We did not expect this to be the case , however, as heterologously expressed RNases are usually produced as inclusion bodies (a ggregates of denatured proteins ). Further, in the reducing environment of the cytoplasm, disulfide bonds important to its proper folding and activity of RNase variants are not formed. However unl ikely the scenario, it was necessary to rule it out. Usually pET vectors are retained even in the absence of antibiotic selection for a long period (Pan and Malcolm, 2000). This, how ever, can change dr amatically if the expressed protein is toxic. As our cells were grown for an extended period of time, and are multiply propagated, we hypothesized th at the plasmid might have evolved under selection pressure that reduced its abi lity to deliver an expressible gene. A plasmid stability test was first performed to determine whether the plasmid itself was lost, if the RNases were expressed, a nd if these proteins were indeed toxic to bacteria. Cells were plated under the following four conditions: (a) Carbenicillin: Only ce lls retaining plasmid shoul d grow with this more chemically stable analog of ampicillin. (b) Media without antibiotic: all cells should grow. Comparing the number of cells here with the number obtained under conditi on (a) will indicate the proportion of cells retaining the plasmid.

PAGE 211

196 (c) IPTG: Cells with plasmid but without expression and cells without plasmid will grow. (d) IPTG and Carbenicillin: Cells with plasmid and without expression will grow. Cells that are expressing and that retain plasmid will not grow. If the same number of colonies were to be observed on plates with and without carbenicillin, then we would infer that the plasmid (at least the part that carries the lactamase) is present and stable. In additi on, if no colonies are observed on the IPTG plates (condition (c)), this sugge sts that the gene is expressi ble, and that the expression stops further division of the cells. In fact, wh en this expression strain (BL21DE3 strain) of E. coli is induced for expression with IPTG, its rate of division slows dramatically. Similar numbers of colonies were observed when plating E. coli containing genes encoding RNase variants in the presence and ab sence of carbenicillin. This suggests that the plasmid is stable. Unfortunately, a compar able number of colonies also grew in the presence of IPTG. This suggested that the gene for the RNase variant was unexpressible (Figure 4-2). Only one RNase variant plasmid stability test gave results su pporting a stable and expressible gene (Figure 4-3). This RN ase variant (An21) was expected to be enzymatically inactive because it had an Arg in position 41 instead of a Lys. Lys41 is an important active site amino acid; without it, the degradation of RNA is greatly slowed. This suggests that a low level of toxic pr otein was being generated from the cloned RNase gene even in the absence of IPTG, and was creating a non-expressible gene. We are able to only incomp letely reconcile this obse rvation with conventional wisdom, which hold that protei ns requiring disulfide bonds to fold cannot fold correctly

PAGE 212

197 inside of a cell. Carrio et al, observing simila r results with different proteins, reported that disulfide-containing proteins can retain some activity and might in a dynamic equilibrium between solubility states, even when they are predominantly in inclusion bodies (Carrio, Cubarsi and Villaverde, 2000). Further, the Eise nberg group was able to show that fibrils formed by the apparent precipitation of panc reatic RNase form organized structures where monomers swap domains to form a large non-soluble aggregate that retains activity (Sambashiv an et al., 2005). Solutions to expression problem As another approach to deal with expre ssion difficulties, a phage induction system was tested. An E. coli strain (BL21) lacking the ch romosomal copy of the T7 RNA polymerase gene was used. This allows the b acterial clone with the plasmid containing the RNase variant gene to be propagated indefi nitely in the absence of RNAse toxicity, as this E. coli strain is incapable of expressing gene s controlled with the T7 promoter. The cells are grown to the appropriate density , and then infected with a phage (CE6) containing the T7 RNA polymerase gene. Within 3 to 4 hours after infection, the cultures should express large amounts of protein. This method was quite successful (Figur e 4-4). This method was not practical, however, as it required preparation of large am ounts of phage in high densities to induce expression. We therefore attempted to solv e the problem by changing the handling of the origin strain. Here, a bacterial colony is pi cked immediately afte r transformation, and directly used to inoculate liquid culture for expression, without intermediate propagation. The idea here was to reduce the chance of mu tations and loss of expression by reducing the culture time and number of divisions of the expressing clone. Surprisingly this method worked quite well (Figure 4-5).

PAGE 213

198 Purification of Ancestral Ribonucleases Method 1 After induction with IPTG for 3 hours, th e cells were recove red by centrifugation. The cells were then suspended in buffer, a nd lysed. The lysis was best achieved using a French press, but similar results are obtain ed by using a cocktail of non-ionic detergent (BugBuster) supplied by Novagen. With BugBus ter, an enzyme (Benzonaze, Novagen) is added to the lysis cocktail. Benzonase is a genetically engineer ed endonuclease from Serratia marcescens that degrades all forms of DNA and RNA. Since the RNase was present in inclusion bodies (Figure 4-6), the lysis suspension was centrifuged at high speed to recover the insoluble protei n and other cell debris in a pellet. The pellet was washed 2 to 3 times with a wash buffer to reduce contaminating proteins and cell wall debris. The cleaned pellet was solubilized using a saturated or nearly saturated urea solution. This dissolved the RNase. The pr otein solution was centr ifuged at high speeds to remove the insoluble lipophilic debris. The supernatant is filtered and loaded on a strong cation-exchange SP (Sul phopropyl) column. The proteins were eluted with a salt (0-2 M NaCl) gradient (Figur e 4-7). At this point the RN ases were soluble but not properly folded. The RNase variant was folded in a fo lding buffer containing both reduced and oxidized forms of glutathione to stimulat e disulfide bond formation. The solution was incubated at room temperature for 2448 hours. The protein solution was then concentrated, the buffer is changed by dialysis to sodium acetate (50 mM , pH 5), and the dialyzate was loaded on a pUp (5’-(4 -Aminophenyl-phosphoryl)uridine-2(3’)phosphate) affinity column. pUp is a competitiv e inhibitor of the pancreatic ribonuclease

PAGE 214

199 family of proteins. Pure RNas e is eluted with a salt gradient (0-2 M NaCl). Figure 4-8 shows purified RNase by this method on a silver stained SDS-PAGE gel. Method 2 Another method was developed to in crease the purification yield. The solubilization agent was changed from urea to guanidinium hydroc hlride, which proved to be more efficient than urea for solub ilizing the inclusion bodies. Also, the cationexchange step was replaced by a dialysis st ep. The protein solubilized in aqueous GnHCl is dialyzed against aqueous acetic acid ( 20 mM, 4 C). In this step, most of the contaminant proteins precipitate, while RNase variants remain in solution. This dialysis step proved to be superior to the cation-exchange column step in removing impurities. Further, it allows guanidinium, itself a cat ion, to be used as a denaturant. The dialyzed solution is then centrifuged a nd filtered to remove insoluble proteins. RNAse is then folded as described in Met hod 1. The folded protein in sodium acetate buffer (20 mM, pH 5) is ready to load on to an affinity column. The described modifications improved the yield of purified RNAse, but further modifications were required to enhance the performance of the a ffinity chromatography step. The quality of the commercially availa ble pUp affinity resin, used in Method 1, evidently declined as this project progressed. In particular, the agarose provided by the manufacturer (Sigma) after the first year of this project had less than 50% of the pUp attached. This lower amount of RNase ligand caused the affinity column to be ineffective. Ribonucleases could be purif ied by other classical ch romatography methods, of course. An affinity step based on a competitive inhibitor serves a dual role, however, by ensuring that the purified protein is properl y folded. Competitive inhibitor binding to the

PAGE 215

200 enzyme active site is dependent on proper fold ing. Consequently, a new affinity step was developed to complete the purification. Ribonuclease A is well known to pref erably bind to a pyrimidine-purine recognition dinucleotide. Based on this pref erence, a DNA oligonucleotide (5'-amino C6dAdAdUdAdAdAdUdAdA wher e dA is deoxyadenosine an d dU is deoxyuridine) containing this dinucleotide was designed to be the affinity ligand. A 6 carbon spacer and a primary amine was appended at the end of th e tail were added to the 5' end of the DNA oligonucleotide to enable accessibility and c onjugation to chromatography resin. This modified oligonucleotide was pu rchased from Integrated DNA technology (IDT) and an NHSactivated sepharose from Amersham Biosciences. The resin is acidified using an ice cold HCl (1 mM). The oligonucleotides is disso lved in the coupling buffer containing a 0.2 M NaHCO3 and 0.5 M NaCl at pH 8. 3. This solution is then added to the acidified resin re placing the HCl solution and incubated at room temperature for 4 hours. The unreacted NHS groups in the re sin are then blocked with a solution of 0.5 M ethanolamine. As expected, RNase bound to the home-synt hesized column and was successfully eluted with high concentration of salt (1 M NaCl). We noticed that some contaminating proteins also bound, perhaps via coulombic inte ractions with the affinity oligonucleotide. To achieve better purification and ensure that the purified RNase is binding the column through the active site, a different elution scheme was devised. Here, a series of known competitive inhi bitors of RNase A were tested to selectively elute RNase. Competitive inhibitors are thought to compete for the active site with the DNA oligonucleotides, causing the column-bound RNase to elute. Adenosine

PAGE 216

201 diphosphate (ADP), an RNase competitive inhib itor (Leonidas et al., 1997), used at a low concentration (0.5 mM), performed best comp ared to other inhibitors tested. Unfolded RNase bound to the column was eluted only at high salt concentration; it did not, however, not elute with ADP (F igure 4-9). This observation c onfirms that this affinity chromatography method yields not only the RNas e variant, but the vari ant in the correctly folded form. Ancestral Protein Characterization The primary structure of a protein determin es its folded structure, which in turn determines characteristics that confer the ability of a protei n to confer fitness upon a host organism. Not all characterist ics that can be measured in vitro are directly linked to fitness, however. It remains a puzzle as to what sequence changes confer new function. The bioinformatics analysis of the last chapter used a rate criterion to do this, identifying not only an episode where positive selection must (within limits of statistical uncertainty) have operated to create new behaviors in RN ase, but also the indi vidual residues that were selected to do so. Unfortunately, a similar an alysis does not identify in vitro behaviors that are important for function. We describe here a se ries of assays to measure the behaviors of the ancestral proteins. Enzyme Kinetic Activity of Ancestral Ribonucleases The degradation of RNA catalyzed by RN ase A (the closest homologue, with 80% identity to BS-RNase) is a two-step process. In the first step, the enzyme catalyzes the transphosphorylation reaction from the 5’ posi tion of one nucleotide to the 2’ position of the adjacent nucleotide formi ng a 2’,3’-cyclic phosphate (Figure 4-10). In the second step, RNAse catalyzes the hydrol ysis of the cyclic phosphate intermediate. This second

PAGE 217

202 step is essentially the reverse of the first r eaction step, with water substituting the leaving group. A free 3’phosphate monoester product is formed (Blackburn and Gavilanes, 1982). The second step (hydrolysis of the cyclic phosphate) does not necessarily directly follow the first. RNase A tends to cleave the 3’,5’-phosphodiester bonds to create a population of products that are, on average, 6 nucleotides in length, because the hydrolysis of the cyclic intermediate becomes dominant (Raines, 1998). The transphosphorylation of single st randed RNA reaction has an acid-base catalytic mechanism. His12 acts as a base and His119 as an acid directing the intramolecular attack of the 2’-hydroxyl on the phosphorus atom and the subsequent cleavage of the P-5’-O bond (Figure 4-10). Lys41 stabiliz es the intermediate. This mechanism is supported by many studies. RNase A has a degree of specificity caused by the presence of discriminating subsites (Figure 4-11). For exam ple, the base at the 3’-side of the cleavable phosphodiester bond is st rongly preferred to be a pyrimidine. A purine does not fit easily in the B1 subs ite. In addition to th e active site P1 and the subsite B1, a series of ot her sites are involved in the guiding of the RNA molecule (Figure 4-11) (Raines, 1998). All of these subsite amino aci ds, including the active site, are conserved in all of the ancestral RNAse candidate proteins i ndicating that all the necessary elements are present for a successful RNase reaction. Regardless of whether these proteins have interesting biologi cal activity or not, they are primarily ribonucleases . In fact it was shown that without the enzymatic activity the otherwise cytotoxic RNas es become inert (Jo Chitest er and Walz, 2002). For these and other reasons, changing catalyt ic activity might be an important part of the adaptive

PAGE 218

203 response of seminal RNase proteins, meaning that the level of ac tivity of the purified ancestral RNases must be measured. Assaying RNase catalytic activi ty using UpA as the substrate The UpA (uridylyl-3'->5'-adenosine) assay m onitors a two-step reaction. In the first step, the dinucleotide UpA is cleaved by an RNase to yield a uridine 2',3'-cyclic monophosphate and adenosine. In the second step, the en zyme adenosine deaminase (ADA) converts adenosine into inosine. The RNase-catalyzed step itself produces an absorbance shift that is too small to accurately monitor the reaction. The second step in this assay is associated with a large diffe rence in the absorption at 265 nm, however (Warshaw and Tinoco, 1966). ADA is selectiv e for adenosine, leaving oligonucleotides unaffected. The assay was performed by equilibrating the reaction mixture (25 C) in a quartz cuvette containing sodium acetat e buffer (100 mM, pH 5.0), ADA (units adjusted so that this step is not rate determining), and the s ubstrate in varying concentrations (30, 50, 75 and 100 mM). The reaction was initiated by a dding RNase. The absorbance (265 nm) was measured for 90 seconds. The slope for the de crease in absorbance was determined from the first 5 values measured. The measurem ents were performed four times for each substrate concentration in three experiments. Th e initial rates measured as the slope of the linear part of each progress curve (Figure 412) were plotted against the corresponding substrate concentration. The resulting curv es were fitted to Michaelis-Menten equation (below) using non-linear regression curve fitting software (PRISM). In this equation v0 is the reaction initial velocity at substrate concentration [ S ]. Vmax is the maximum velocity of the reaction and KM is the Michaelis-Menten constant.

PAGE 219

204 v0 Vmax[ S ] KM [ S ] All of the ancestral proteins tested were enzymatically active. Th is is an important basic observation, as this was predicted by the analysis, described in the previous chapter, that showed that the residues replaced in the lineage leading directly to BS-RNase were not located in the active site. This distributi on of ancestral replacement is consistent with the maintenance of enzymatic activity. The second observation is that KM for the tested proteins is similar to that observed for wild type BS-RNase (KM= 137 mM), determined by Trabesinger-Ruf (Trabesinger-Ruef, 1997). The ancestral proteins An19 (KM =154 mM), An22E (KM=282 mM), and An28 (KM =135 mM) have values that are quite close to this value. The most distant KM is that of AN23, roughly 13 fold higher. This difference, if we assume that this value is accurate, is relatively close in kinetic terms where an order of magnitude difference is not surprising in tw o similar enzymes. However, this result is probably not very accura te, as the 95% confidence interval for this value is quite large (Table 4-1). On the other hand, when this interval is much tighter, suggesting a more robust measurement, the va lues become much closer to BS-RNase (An19EK, An22E, An28) (Table 4-1). This trend suggests that the ribonuclease enzymatic ac tivity did not extensively change over the evolutionary history of this protein. However, because the manufacturer (again Sigma) ceased to provide the UpA s ubstrate, we were una ble to collect enough data to complete the kinetic analysis. Assaying RNase catalytic activity using a fluorescent tetranucleotide as a substrate Because of the loss of a commercial sour ce of UpA, a second substrate was sought. Several researchers (Kelemen et al., 1999) ha d develop a more sensitive assay than the

PAGE 220

205 UpA assay. Here, at least one ribonucl ease-cleavable bond is present in an oligonucleotide that jo ins two fluorescent molecules attach ed to both ends (5' and 3'). When the oligonucleotide is inta ct, there is a small level of fluorescence released, as the two fluorescent moieties quench each other at short distances. When the substrate is cleaved, however, the quenching is relieved and fluorescence increases upon excitation at the right wavelength. The most sensitive substrate available is 6-FAM-dArUdAdA-6-TAMRA (Kelemenet al., 1999), where FAM is 6-carboxyflorscein, TAMRA is 6-carboxytetramethylrhodamine, dA is deoxyadenosine and rU is uridine. This substrate has only one RNase-cleavable bond in the molecule , the bond linking rU and dA. Figure 4-13 shows this substrate and the cleavable bond. The assay is monitored using a high precision spectrofluorometer. This substrate has a maximum emission at a wavelength of 515nm when excited at 490 nm. Figure 4-14 sh ows an emission scan at an excitation wavelength of 490 nm. The enzyme concen tration ranged from 80pM to 700 pM. The reaction progress is monitored for a 1000 sec for every assay in order to reach or approach maximal substrate cleavage. Figure 4-15 shows progress curves for ribonucleases. To calculate Kcat/KM, the progress curve data is fitted to the following equation using the mathematical program PRISM. I If ( If I0) e( kcatKM )[ E ] t

PAGE 221

206 I is the fluorescence intensity at time t , If is maximum fluorescence at complete cleavage, I0 is the background fluorescence before en zyme addition and [E] is the enzyme concentration. This kinetic equation is de rived from a first-order rate equation. Figure 4-16 show Kcat/KM values for the ancestral proteins that were tested, BSRNase and RNaseA. Results and interpretations The catalytic activity of ancestral semina l RNases, as measured by kinetic assays, does not significantly change over their ev olutionary history. The active ancestral enzymes reveal that the catalytic activity of seminal RNase is conserved. In particular, the catalytic activity does not change dramatically during the episode when positive selection pressure operates. This implies two conclusions. First, changing catalytic activity is not important to the changing functional role of this protein in oxen over the past 2 million years. Second, the conservation of catalytic ac tivity in the earlier history of seminal RNase suggests that these proteins were in fact expressed, a nd their sequence drift constrained by purifying selection pressures. Incidentally, there is not a strong correlation betw een catalytic activity, as measured in these in vitro assays, an immunosuppressivity or other biological activities. There might be a small increase in activity as An24 evolves to give An25 and An26 (BSRNase), but this increase is not significan t enough to correlate with the rapid phase evolution (An24..An25..An26). Inde ed, An25A, with the highes t catalytic activity, does not have the highest biological activity (See Chapter 5). The catalytic activity might be essential to the new biological function but it is not the primary selected character during the rapid evolutionary phase.

PAGE 222

207 Dimerization Seminal ribonuclease is the onl y known member of the fam ily that naturally occurs as a dimer. Bovine seminal RNase has tw o intersubunit disulfide bonds (linking Cys31 and Cys 32) forming the dimer. This dimer has another unusual structur al feature: Residues 1-20 are exchanged between the two monomeric folds, with re sidues 1-20 from one subunit folding with residues 21-124 of the other. This swap results in the two active site s of the dimer being composite, where the catalytically importa nt His12 from one subunit His-12 working with the catalytically important Lys41 from the other subunit (Figure3-12 from Chapter 3). This unique structural feature has attrac ted the attention of many researchers, who have related the swap and dimerization to various behaviors that they measure in vitro and in vivo , including cytotoxicity. For example, Raines and coworkers postulate d that proteins that form dimer with residues 1-20 swapped might be cytotoxi c because the swap would hold the dimer together even if intersubunit disulfide bonds are re duced. In this model, RNase needs to be localized in the cytoplasm to kill cells, the presence of ribonuclease inhibitor (a 50kD protein) might inhibit the RNase activity, a nd RNase inhibitor binds only to monomeric RNases. As this may be important to the e volution of biological f unction in the seminal RNase lineage, this structural feature was investigated. Disulfide Based Dimerization All of the inferred ancestral seminal RNases lack Cys32, which participates in one of two intersubunit disulfide bonds in modern BS-RNase. A ll have Cys31, however, a residue that was introduced immediately af ter the formation of the seminal RNase lineage. Cys 31 also participates in an intersubunit di sulfide bond.

PAGE 223

208 The purified ancestral proteins were allowed to dimerize after reduction and removal of reducing agent. The reduction of the proteins breaks disulfide bonds and removes any glutathione groups from the unreacted cysteine groups (intersubunit cysteine 31). The protein is then left at room temperature in the presence of air, which provides the dioxygen needed to allow the re-formati on of the disulfide bonds . The formation of dimers linked by disulfide bonds is verified by SDS-PAGE electrophoresis (Figure 4-17). All of the ancestral proteins, including the recombinant BS-RNase, readily formed dimers where the monomers are linked by a disulfide bond (presumably a Cys31-Cys31' disulfide bond). Treatm ent of the proteins with a re ducing agent (DTT) was shown to convert these RNase variants to monomers, again as shown in the SDS-PAGE in Figure 4-17. Cys31 is sufficient to ensure the covale nt dimerization of seminal ribonucleases. This result does not tell us if the dimer has residues 1-20 swapped, but it does show that dimerization of this protein is an early an cestral trait that is dependent only on the presence of cysteine 31. This dimerization is a character that has ex isted in most of the evolutionary history of this protein. Because it does not em erge during the episode of rapid sequence evolution, we conclude that dimeric structure is not an in vitro character important for the selection of new biological func tion. Because this protein character has been shown to be important to its in cell activities, the evol ution of the new function might require its preexistence but it has not been selected in the adaptive evolutionary phase. Domain Swap with a Divinylsulfone (DVS) Assay When the two monomers swap domains, they exchange the peptide (1-20 sites) and form composite active sites. This structural feature of the dimer is part of the Raines’

PAGE 224

209 model mentioned above, which requires BS-RNase to be a dimer to evade the cytosolic RNase inhibitor. The reduci ng environment of the cytoso l might break the intersubunit disulfides. The domain swap (by hypothesis) st abilizes the dimeric st ructure, however, and allows BS-RNase to ev ade the RNase inhibitor. The DVS assay is a covalent crosslinking method that exploits a very specific structural feature of the swapped dimer to m easure the extent of th is swap. Each active site is made of parts of each monomers, wh ere His12 is part of the swapped peptide, while His119 is part of the monomer. These two residues come to high proximity to form an active site. The two histidines can react with cross-linking agen ts that contain two electrophilic sites. Divinylsulfone (DVS) is ideal to cross link the active site histidines because the sulfone unit mimics the reactive phosphate in the RNA backbone. Further, based on the crystal structure of BS-RNase the electrophilic CH2 groups in DVS are at an appropriate distance to react with the two histidines when they are in the active site (Figure 4-18). When DVS reacts with a dimer containing the composite active site it crosslinks His-12 from one monomer with His-119 from the second monomer, yielding a covalently linked dimer. In contrast, DVS reacting with unswapped dimer and monomers will result in free monomers when reduced. The extent of dimer swap is de termined by a reducing the samples and separation with SDS-PAGE. Figure 4-19 shows three SDS-PA GE gels with ancestral RN ases taken at three time points in the DVS cross-linki ng reaction, 24 hours, 72 hours and 92 hours. A significant amount of recombinant BS-RNase is crosslinke d in the dimer form. This result supports what has been observed with wild-type and recombinant BS-RNase. The protein has been

PAGE 225

210 shown to be a dimer in a dynamic equilibrium between the swapped and the unswapped forms. RNaseA, as expected, remained as a monomer since it does not naturally form a swapped dimer. None of the ancestral proteins formed swa pped dimers to the extent as is observed in BS-RNase. The earliest resu rrected seminal ancestral pr otein (An19) showed a small level of dimeric swap. The following ancestr al proteins, An23, A24, An25 and An28, showed no detectable dimeric swap. If the swap is a biomolecular behavior that contributes to the newly evolving physiological function after An24, then it is expected to in crease during the adaptive episode of evolution. It does. Thus, we conclude that the swap is important to the new function. The principal caveat is that the ex tent of swap may not be independent of the conditions of folding and storag e of the protein. These cannot in the laboratory reproduce those in the animal. We doubt that the swap per se is the selected behavior, of course. More likely, the swap contributes to a behavior that is selected, and there by indirectly co ntributes to fitness. Materials and Methods Site-Directed Mutagenesis The mutagenesis PCR reaction contains PfuTurbo DNA polymerase buffer, 5 to 50 ng of dsDNA plasmid template, 125 ng of each oligonucleotide (20-25 nucleotides in length), 10 nM of dNTP and 2.5 units of PfuTurbo DNA polymerase per 0.050 mL reaction. At the of the PCR cycles, 0.001 mL of DpnI(10U/ mL) is added to the reaction and incubated at 37C for 4 hours. At the end of the DpnI digestion, 5 mL of the reaction mix is used to transfor XL-1 Blue competent cells, which are then plated on Carbenecillin

PAGE 226

211 plates. Few colonies are gr own and plasmid DNA isolated . The RNase gene is then sequenced to verify the success of the mutagenesis reaction. CE6 Phage Induced Expression A fresh culture of ED8739 E. coli strain (Novagen) is grown in LB supplemented with 0.2% Maltose and 10mM MgSO4 to saturation (1 mL /ml of LB from 1M MgSO4 stock and 20% Maltose stock). Top agarose (L B media plus 7.2 g agarose per liter) is melted and placed at 45-50C in water bath. 0.1 mL of a 3x107 plaque forming units per mL (pfu/mL) of CE6 phage is rapidly mixed with 0.3 mL of top agarose. The suspension is rapidly and evenly plated onto 37C warm LB/agar plates. The plates are incubated at room temperature for 20 min, then transferred to 37C for an overnight incubation. The next day a confluent lawn of bacteria is co mpletely lysed by the phage. The CE6 phage is harvested from the plates by adding 5 ml of SM (0.01% gelatin, 85mM NaCl, 8 mM MgSO4 .7H2O, 50 mM Tris pH 7.5) and gently swirled on a shaker for 4 hours. The phage containing SM is collected. The titer of the phage is determined by mixing serial dilutions and plating in the same way described. The plaques are counted to determine the titer which should be higher than 7x109 pfu/mL. To induce expression, the bact erial clone with the RNase variant gene is grown to an OD600 of 0.6-1. Magnesium sulfate is added to a concentration of 10 mM, then the phage is added to final concentration of 2-4x109 pfu/mL. The culture is grown at 37C for 3 to 4 hours. The cells are centrifuged and th e pellet placed at -80C to facilitate lysis. Induction of Expression With IPTG The day before expressing RNase proteins, fresh competent E. coli (BL21 DE3 pLys) is transformed with 50-100 ng of plas mid DNA. For transformation, the plasmid DNA is incubated with the competent cells for 30 minutes in ice, 45 seconds at 42C, 2

PAGE 227

212 min on ice then LB is added and the culture is placed in a shaking incubator at 37C for 1 hour. Finally, the cells are plated on a carben ecillin plate at 37C. A colony from the transformation plate is picked and used to di rectly inoculate the expression culture. The culture is then grown to an OD600 between 0.4 and 0.7, at which point IPTG is added to a final concentration of 0.4 mM . After 3 to 4 hours the cells are harvested as described before. Cell Lysis and Inclusion Bodies Purification To lyse the cells, 10 to 20 mL of non-ionic detergent co cktail (Bugbuster, Novagen) and 3 mL (25 U/mL) of Benzonase (Novagen) is added to the cell pellet and shaken at room temperature until the cell pell et dissolves completely. The suspension is then centrifuged for 25 minutes at 15000xg. The supernatant is discar ded and the pellet is resuspended in freshly made wash solution (100 mM TrisCl, pH 7.0, 5 mM EDTA, 5 mM, 2 M urea, 2% (w/v) Triton X-100) and then centrifuged for 20 minutes at 15000xg. This wash step is repeated twice. The pellet is then washed once with water. Solubilization of Inclusion Bodies In Method 1, the inclusion bodies are so lubilized in a freshly made urea based solubilization solution (9-10 M Urea, 50 mM NH4OAc pH 6.8, PMSF 1 mM, 10 mM sarcosine, 100 mM DTT). The pellet is vorte xed or solubilized us ing a homogenizer. The solution is then centrifuged at for 30 mi n at 50000xg to remove any impurities. The supernatant is filtered and is ready to be loaded onto cation exchange column. In Method 2, the inclusion bodies are so lubilized in guanidinium hydrochloride based solubilization solution (10 mM Tris -HCl pH 8, 6 M GnHCl, 10 mM sodium EDTA, 100 mM DTT). The pelle t is vortexed and incubate d for 3 hours in a shaking

PAGE 228

213 incubator at 37C. The solution is th en centrifuged for 30 min at 15000xg. The supernatant is filtered and is ready for the dialysis step. Purification of RNase In Method 1, the solubilized protein in ur ea solution is loaded on a cation-exchange column (SP). The resin is then washed with 10 column volumes of buffer (20 mM NH4OAc, pH7). The bound protein is eluted with a salt gradient (0-2 M NaCl over 30 minutes at 2 mL/ min). Protein elution is monitored using UV280 light absorbance. The collected fractions are further checked with SDS-PAGE. The fractions containing RNase are pooled and diluted in the folding glutat hione buffer (100 mM Tris pH 8.4, 3 mM GSH/ 0.3 mM GSSG). The prot ein solution is incubated at room temperature for 24-48 hours. The protein solution buffer is then cha nged with dialysis against 50 mM sodium acetate, pH5. The solution is then concentr ated and loaded on the pUp affinity column. The resin is washed with 10 column volumes of sodium acetate buffer. The protein is then eluted with a salt gradie nt (0-2 M NaCl over 30 min at 2 mL/min). Protein purity is verified with silver stained SDS-PAGE gel. In Method 2, the solubilized protein in the GnHCl based solution is dialyzed against 20 mM acetic acid at 4C. The dialysis buffer is changed twice. The protein solution is then centrifuged to remove the precipitate. The supernatant is filtered and concentrated. The protein is folded ih the sa me way described in Method 1. The buffer is then changed with dialysis against 50 mM sodium acetate pH5, and then concentrated. The RNase solution is then loaded onto oligonucletid e affinity column. The bound protein is washed with 10 column volumes of buffer (50 mM NaAc, pH5). Folded RNase is then eluted with 0.5 mM ADP. The column is purged with 2 M NaCl and regenerated with 50 mM NaOAc, pH 5. Purity is verified with silver stained SDS-PAGE of the ADP

PAGE 229

214 eluted protein fractions. The prot ein is dialyzed again to remove ADP, if it is to be used in kinetic assays. Dimerization Affinity purified RNase is treated for one hour with 200 mM DTT. The reducing agent is then removed by dialysis or desalti ng column. The protein solution is then placed at room temperature for 7-10 days. Extent of disulfide based dimerization is then verified on SDS-PAGE with and without reducing agent. DVS Swap Crosslinking Assay The RNase variant (13 mg, 1 nM per s ubunit) in NaOAc buffer (100mM pH 5, 100 mL) and DVS (1 ml of a 10% solution) were incubated at 30C. Ali quots were taken over a period of 96 hours. The samples are reduced with DTT and loaded on SDS-PAGE to assess extent of dimeric swap UpA Kinetics Assay Five solutions of UpA with different c oncentrations (30, 40, 50, 75 and 100 mM) were freshly prepared containing NaOAc (0.1 M) and ADA (5 mL of a 2 mg/mL solution per 10 mL reaction volume). The enzymes to be measured were diluted in NaOAc (100 mM) to a final concentration of 5ng/mL. An aliquot of the enzyme solution was added and the cuvette quickly invert ed 3 times. The absorbance wa s measured every 6 sec at a wavelength of 265 nm during 90 sec. The sl ope of the decrease in absorbance was determined from the first 5 values. The measurements were repeated 4 times for assay Fluorescent Tetranucleotide Kinetics Assay The substrate (IDT, Coralville IA) was prep ared in 3 concentrations (30, 50 and 100 nM) in 0.1 M MES/NaOH pH 6 buffe r containing 0.1 M NaCl (2 mL). Concentrations ranging from 80 – 700 pM of enzyme were used in the various assays.

PAGE 230

215 The progress of the reaction is monitered usi ng high precision spectroflurometer with an excitation wavelength 490nm and an em ission wavelength 515 nm. The reaction is followed for 1000 sec or until maximu m substrate cleavage is reached. Figure 4-1. Site directed mutagenesis st rategic plan to produc e the various RNase ancestral candidate genes. The labels th at start with AN designate the various genes. The introduced mutations are in rectangles. The starting sequence is An35.

PAGE 231

216 Figure 4-2. Plasmid stability test indicating that the plasmid for expressing RNase was retained but abilty to express was lo st. The condition is indicated under the respective plate. Similar number of co lonies observed in both Carbenecillin and the plate with only media (0). Simila r number of colonies was observed in the IPTG plates (Carbenecillin and no Carbenecillin). Figure 4-3. Plasmid stability test indicates that the plasmid fo r expressing AN21 RNase (enzymatically inactive) was retained. Th e lack of colonies on the IPTG plates indicate that the gene can be expres sed. The condition is indicated under the respective plate. Similar number of co lonies observed in both Carbenecillin and the plate with only media (0). No colonies were observed in the IPTG plates (Carbenecillin and no Carbenecillin).

PAGE 232

217 Figure 4-4. A coomassie stai ned SDS-PAGE gel separating E. coli cell lysate after inducing expression with IPTG and CE6 phage. Lane B shows BL21 cells successfully expressed RNase after inf ection with the phage CE6, containing the T7 polymerase gene. Lane A shows very weak or no expression of RNase by BL21(DE3)pLys induced with IPTG. Figure 4-5. A coomassie stai ned SDS-PAGE gel showing equally successful expression in both IPTG and CE6 phage induced expr essions. Lane A shows a lysate of BL21(DE3)pLys cells induced to expr ess RNase with IPTG. Lane B shows the lysate of BL21 cells induced to express RNase with CE6 phage.

PAGE 233

218 Figure 4-6. Coomassie stained SDS-PAGE show ing that the expression of RNase is in inclusion bodies. After lysis and centr ifugation of BL21(DE3)pLys expression RNase, the proteins in the supernatan t and the pellet were separated by SDSPAGE.

PAGE 234

219 Figure 4-7. SDS-PAGE gel sepa rates proteins in several pur ification fractions from a cation-exchange chromatography of RNas e. Lanes (3-6) separate proteins eluted in the wash after loading impur e RNase sample onto the column. Lanes (8-17) separate proteins from fractions collected dur ing salt elution gradient. The migration level of pure RNaseA in the middle lanes show where the RNase in chromatography fractions shoul d migrate. Lane 11 has the highest level of RNase.

PAGE 235

220 Figure 4-8. Silver stained SDSPAGE gel (right) and a Western blot (left) of affinity (pUp) purified RNase. Lanes 1 shows a “pure” Rnase A (Sigma). Lanes 2-7 represent affinity chromatography frac tions collected during a salt-gradient elution. RNase becomes pure in lanes 5 and 6. A polyclonal antibody against RNase is used in Western blot.

PAGE 236

221 Figure 4-9. Silver stained SDS-PAGE of purified RNase on the newly developed DNA oligonucleotide column. Gel on the left shows that the unfolded RNase does not elute with ADP but elutes with salt. The gel on the right shows that folded RNase eluted with ADP (0.5 mM).

PAGE 237

222 Figure 4-10. The transphorylation step in th e RNase mechanism of RNA cleavage. His12 and His119 are the residue catalyzing th e reaction in the active site. Lys41 stabilyzes the intermediate.

PAGE 238

223 Figure 4-11. Active site amino acids of RNas e are shown in relation to a 3 nucleotide RNA chain. The indicated sites are involved in the catalysis and RNA binding.

PAGE 239

224 Figure 4-12. The progress curves of UpA cleavage by RNase (An28) are shown. The x axis shows substrate concentrations in mM. The y axis shows the initial kinetic rate. The two progress cu rves correspond to two enzyme concentrations. Figure 4-13. The structure of the fluorescent tetranucleotide used in the RNase kinetic assay.

PAGE 240

225 Figure 4-14. Emission-scan of the cleavage of the fluorescent tetranucleotide by RNase A is shown. The flat curve represents an uncleaved substrate. The cureve with highest counts (emitted light) represents complete cleavage. Figure 4-15. A typical progress curve of the tetranucleotide cleavage reaction by RNase is shown. The first part of the cu rve is a flat line showing background fluorescence before the addition of the enzyme. After the enzyme is added the fluorescence increases until maximum cleavage of substrate. The x axis is time and y axis is counts per seconds.

PAGE 241

226 Figure 4-16. This histogram compares the different kcat/KM of the ancestral RNase candidate proteins, BS-RNase and RNaseA

PAGE 242

227 Figure 4-17. Non-reducing (A) and reducing (B) gels with the various purified RNases and RNaseA are shown. All of the seminal type RNas es are dimers in the non reducing gel and RNaseA is monomer. The reducing gel shows all of the RNases are monomers.

PAGE 243

228 Figure 4-18. Divinyl sulfone cross-linking re action in the active site of RNase.

PAGE 244

229 Figure 4-19. Three SDS-PAGE reducing gels sh owing RNases reacted with DVS taken at three time points (24, 72 and 92 hours). The arrow shows the level at which the dimer migrates.

PAGE 245

230 Table 4-1. UpA assay results RNase Variant KM (M) 95% Confidence interval An19EK 154 -25.25 to 671.4 An22E 282 175.2 to 388.9 An23 1894 -3676 to 7464 An25A 1150 -5215 to 7515 An28 135.5 54.51 to 216.4 BS-RNase 137* RNaseA 240* These measurements were made in the Benner laboratory by Nathalie Trabesinger-Ruef.

PAGE 246

231 CHAPTER 5 SEMINAL RIBONUCLEASE AN D THE IMMUNE SYSTEM Immune Response in the Re productive Tract and Sperm The mucous membranes lining the digestiv e, respiratory, and urogenital systems are the major sites of entry for most pathoge ns, and the vulnerability of these membranes drove the evolution of th e mucosal-associated lym phoid tissue. The functional importance of this lymphoid tissue in the body’s defense is affirmed by its large population of antibody-producing plasma cells that exceeds the combined number of these same cells in the spleen, lymph nodes, and bone marrow. The endometrial mucosa lining the uterus is part of this system and is potently armed with the same immune defenses. This mucosa is supplied with large lymphatic drainage and includes a wide array of lymphohaemopoietic ce lls and molecular regulators re sponsible for an adaptive immune response. The endometrial mucosa diff ers from other mucosal tissues in that it lacks highly organized secondary lymphoi d nodules (e.g., Peyer’s patches of the intestinal mucosa). This is probably due to a lower exposure to pathogens relative to other mucosal membranes. The endometrial based humoral (anti body mediated) and cellmediated immune response remains, however , an important reaction against invading antigens, including sperm . In a quick response to the entry of sperm, the leukocyt es populating the mucosal layers of the cervix and the uterus under go an important change and redistribution (Robertson, Bromfield and Tremellen, 2003). This response has been observed in all mammals in which this phenomenon was studied (Pandya and Cohen, 1985). The

PAGE 247

232 response is initiated when components of semina l plasma interact with uterine epethilial cells. This immune response induces the produc tion of pro-inflammatory cytokines that include Granulocyte-Macrophage colony s timulating factor (GM-CSF)(Tremellen, Seamark and Robertson, 1998), interleukin-6 (Robertson et al., 1996) and an array of chemokines. Subsequently, the cytokine cascad e recruits and activates antigen-presenting cells (APC’s) that include macrophages and de ndritic cells into th e endometrium where they engulf and process male antigens, includin g sperm. These APC’s then travel to paraaortic lymph nodes and the spleen. At this location, lymphocytes are recruited and activated against paternal antigen. In fact , in mice the para-aortic lymph nodes become enlarged upon insemination and T lymphocyt es start prolifera ting and expressing cytokines. T-cells reactive to paternal antigens are among the primary responders (Johansson et al., 2004). In mice and rats , it has also been shown that responder lymphocytes show signs of suppression after an initial aggressive activation (Kapovic and Rukavina, 1991;Piazzon et al., 1985). Such experiments are not possible in humans; however, some evidence related to infiltration of the reproductive tract with immune cells (Robertson and Sharkey, 2001)has been suppor tive of the mouse and rat observations. Immune Suppressive Factors in Mammalian Seminal Plasma Seminal plasma constituents have several essential roles in a successful fertilization. The complex range of organi c and inorganic constituents (metal and salt ions, sugars, lipids, steroid hormones, enzymes, prostagla ndin hormones, amino acids and basic amines) all have one primary goal: to facili tate fertilization. They provide a nutritive and protective medium for the sp ermatozoa during their journey through the female reproductive tract. These sperm s upporting agents are necessary because the

PAGE 248

233 normal environment of the vagina is one that is hostile for sperm cells, as it is very viscous and patrolle d by immune cells. To some extent, the components of the semi nal plasma compensate for this hostile environment. Naturally, if the described aggressive immune response were the whole story, reproduction and susten ance of mammalian life would be endangered. It is clear that sperm does not cause a hypersensitive immune reaction that incapacitates fertilization. In fact, immune suppressive and regulatory ag ents have been found in many mammals seminal plasma. Two major candidate s in human and mice that have been the subject of some investigations ar e prostaglandin E2 (PGE2) and TGF , both present in high concentrations in hu man and mice semen(Kelly and Critchley, 1997). TGF is recognized as an inhibitor of the immune re sponse at other mucosal surfaces (Robertson, Bromfield and Tremellen, 2003). These two factors are not the only ones identified to be immune suppressive; other factors spanning several species (Drosophila, mouse, rat, sw ine, horse, cow and human) share the same biological function. For exam ple, the Fas system, a mechanism where the ligand (FasL) expressing cells induce apoptosis in target cells with Fas receptors, has been proposed (Riccioli et al ., 2003) to play an immune-suppr essive role in mice. The Fas L/Fas ligand/receptor system is e xpressed in mice spermatozoa and mice lymphocytes. Further, males with nonfunctional FasL result in decreased litter size in females. In another mammalian species, the equine semina l plasma was also shown to reduce binding of sperm to neutrophils, an important early step in the immune response (Alghamdi, Foster and Troedsson, 2004). In boars, an immune suppressive component (ISF) was

PAGE 249

234 isolated (Dostal et al., 1995) and shown to exihibit a suppressive e ffect on lymphocytes. Lastly, the central protein of this project , bovine seminal ribonuc lease, was shown to have many in-vitro biological activities in cluding immunosuppressivity . This biological function was also demonstrated to be present during in-vivo experiments (Filipecet al., 1996). Immune regulation during reproduction (f ertilization and during pregnancy) is also regulated by female physiology. The evolutionary pressures for successful reproduction are, obviously, present on both se xes to transmit their genes to the next generation. It is thought that the developm ent of immune tolerance and desensitization against paternal antigen are influenced by the hormonal regulation of the female reproductive system, the cell type and compos ition of the local immune network of the reproductive tract. For example, there is evid ence that hormonally regulated endometrial cells can exhibit some immunosupression in th eir proliferative phase (Prabhala et al., 1998). In this section, we provide da ta supporting the hypothesis that immunosuppressivity was the selected biom olecular behavior in seminal ribonuclease during the time that the gene evolve d under positive Darwinian selection. Immunological Assays Mitogen Induced Lymphocyte Proliferation Assay Leukocytes isolated from peripheral blood contain T-ly mphocytes that can be nonspecifically induced to proliferate in response to a mitogen such as plant Phytohemagglutinin (PHA). Inhibiting or prom oting this lymphocyte response is used as an assay to evaluate imm unoregulatory agents. The im munosuppression activity of ancestral RNase candidates was measured by this method.

PAGE 250

235 The purified recombinant BS-RNase produced by at least two di fferent purification methods were cytotoxic to isolated bovine lymphocytes (Figure 4-1 and Figure 5-2) as was reported in prior literature (Soucek et al ., 1981). The patterns of cytotoxicity in the ancestral proteins are similar when different methods of purif ication are used (Figure 5-1 and Figure 5-2). The method of purification used early in the project was mainly based on a cation exchange chromatography step and a pUp affinity step. pUp is a competitive inhibitor for ribonuclease. The second met hod of purification that was predominately used is based on a dialysis of the solubilized proteins against acetic acid and an affinity step where ADP is used as competitive inhib itor to elute the pure pr otein. This indicates that the purification methods did not influe nce the behavior of the purified RNases. In this mitogen-induced proliferation assa y, some of the ancestral candidates for node An19 show a moderate level of cytoto xicity towards bovine lymphocyte (Figure 52). The rest of the ancestral candidates for node An19 show no inhibitory effect on the proliferating cells. The next ancestra l proteins covering nodes An23 and An24, chronologically following the ea rliest ancestor test ed (An19), seem to show no effect on lymphocytes. With An25, the ancestral protei n of the buffalo and oxen, the cytotoxic effect on bovine lymphocytes , in two of th e three possible ancestral candidates, is exhibited by a modest but real level of cytot oxicity. This inhibition is accentuated in BS RNase. Mixed Lymphocyte Assay (MLR) The MLR is an in vitro immunological assay that has many applications, including the assessment of AIDS patients’ immune response (C lerici et al., 1989), prediction of transplant rejection (Kerma n et al., 1997), and as a tool to assess immunosuppressive drugs (Dos Reis and Shevach, 1981).

PAGE 251

236 In an MLR, T lymphocyte populations fr om two different individuals undergo rapid and radical transformation and cell proliferation. The assay measures the degree of proliferation under the tested conditions through the incor poration of 3Hthymidine, added to the culture medium, into DNA of dividing cells. Both populations of lymphocytes proliferate, unless one populati on is rendered unresponsive in an attempt to have a one-way MLR where the immuno-competen ce of only one individual is evaluated. In this project, both lymphocyte populations are responsive because the goal is to measure the degree of immuno-suppression of th e tested ancestral proteins, and not to investigate the immune reaction of any particular individual. Within 24 to 48 hours, the responder T cells begin proliferating in response to the alloantigens of the stimulator cells, and by 72 to 96 hours a group of cytotoxic T lymphocytes is produced. T helper lymphocytes (TH) have a significant role in this assay, as proliferation could be blocked if TH cells are rendered inactiv e by specific antibodies antiCD4 marker. In this assay, TH cells recognize allogenic class II MHC (Major histocompatibility complex) on the second indi vidual stimulator cells and proliferate due to the differences. In addition to TH cells, accessory cells like macrophages are necessary for a successful MLR response. When adherent cells (mostly macrophages) are omitted from this reaction there is a lack of response. In this reaction the role of macrophages is to activate TH cells, without which proliferation does not take place. Relatively few studies examine the de tailed leukocyte population of the endometrium before and after fertilization, wi th the exception of a few centered on mouse reproduction. In these studies, the cell popul ation of the para-aor tic lymph node shown previously to drain into the uterus and respond to antigen in the reproductive tract was

PAGE 252

237 analyzed (Johanssonet al., 2004). Both T and B lymphocytes were di scovered in these nodes. The activated cells appear to be pr edominately CD4+ and CD8+ T lymphocytes, in addition to other types of leukocytes. These observations further s upport the choice of the MLR assay as a method to test the leve l of immuno-suppression associated with the tested ancestral ribonucleases. RNase A is the closest homlog to BS-RNase (over 80% identity), but does not have a cytotoxic effect on lymphocytes. This protei n was used as contro l in the MLR assays and as expected it does not exibit immunosuppr ession (Figure 5-3. A ll of the ancestral proteins in the MLR reaction show various levels of cyto toxicity. Ancestral candidates for the oldest studied node (An19) reduce the proliferation to about 60% of the control level. The intermediate ancestors An22 and An23 show a smaller effect compared to An19. The level of cytotoxicity starts increa sing with the following ancestral RNases in an incremental way starting with An24, then An25, and finally to the highest level of inhibition with BS-RNase. The An24 ancestral proteins show a proliferation of 60-70% of control. An25 ancestral candidates show a proliferation level 5060% that of control and finally BS-RNase show ~30% level. The first basic observation is that differ ent ancestral candidates for an ancestral protein representing one time point (An19, An24 and An25) in the evolutionary history of the seminal gene behave very similarly. This is interpreted to mean that the ancestral reconstruction is robust in re spect to the ambiguity in the evolutionary analysis. More importantly, we observe an increase in the level of cytot oxicity as measured by a reduction in the incorporation of tritiated thymidine in the DNA of the devisinf

PAGE 253

238 lymphocytes. This in vitro biomolecular behavi or correlates directly with the detected episode of adaptive evolution (see Chapter 3). Both the mitogen proliferation assay (MPA ) and the MLR were consistent with the results of previous studies in demonstrating that BS-RNase exhibits a cytotoxic effect on proliferating lymphocyes. The extant of cytot oxicity is higher in the MLR assay than in the MPA assay. MLR mimics the physiological conditions more closely than MPA, however, because it relies on im mune cells to recognize foreign antigen on cells from a genetically different individual. Further, the mechanism of the MLR reaction includes communication between phagocytic cells (macrophages and dendritic cells) and T responder cells. The proliferati on of these cells is measured by the MLR assay. The same cells and a similar mechanism were found in immunological responses to sperm and seminal plasma in the mouse reproductive tract. This make s the MLR an appropriate in vitro assay to measure immunosuppression of seminal proteins. The combination of the MLR and the use of bovine lymphocytes to evaluate the an cestral ribonuclease immunosuppression effect is an effort to make these series of experiments as physiologically releva nt as possible. The results of immunological proliferat ion assays (MLR and MPA) show an incremental increase in cytotoxicity effect , interpreted as an i mmunosuppressive effect, starting with An25 and leadi ng to the contemporary BS-RNa se. Interestingly, these proteins are represent stages in the evolutio nary history of the seminal gene where the evolutionary analysis indicated a rapid pha se of evolution (cha pter3). An24 ancestral candidates are without immunosuppressive ac tivity, which is followed by a branch of positive Darwininan evolution and leads to An25. An25 has more immunosuppressive

PAGE 254

239 activity than the pre ceding ancestral nodes. The branch linking An25 and BS-RNase also exhibits adaptive evolution ending in the c ontemporary BS-RNase with the highest level of immunosuppression. The AIC analysis (Chapter 3) gave the strongest suppor t for an episode of rapid adaptive evolution to the branch es linking the ancestor of the Bovidae to the contemporary BS-RNAse. The plaeogenetics ap proach can help us better confirm the functionally relevant episodes of positive Darwinina selection. More important, it helps us define the characters under selection a nd better understand physiological function. In this case, it seems that immunosuppression has been positively selected during the evolutionary episode discussed as an a dvantageous trait be nefiting successful reproduction. The evolutionary period that spans An19, the earliest studied ancestor, and An24, the Kudu ancestor, seem to have no immunosupp ressive activity in th e case of MPA and minimal activity in the MLR. These ancestors in deed lead to complete loss of activity in the contemporary animals, and the se minal ribonuclease genes are rendered to pseudogenes. One can postulate that the mini mal activity of these ancestral proteins deteriorated in the lineages including Saiga, Duiker, and Kudu, ending in complete loss and death of the gene. This gene came came under strong positive selection for an immunosuppressive character that reached it s highest degree in the contemporary bovine species. The remaining question is what was this gene doing during all this period from the time of the gene duplication until the st art of the episode of rapid evolution. Apoptosis Assay Using Flow Cytometry: It has been reported in some publicati ons that bovine seminal ribonuclease is cytotoxic to lymphocytes via an apopto tic mechanism (Sinatra et al., 2000).

PAGE 255

240 Consequently, we wanted to test this idea in the context of both the ancestral proteins and extant BS-RNase. Apoptosis is a carefully regul ated process of cell death that occurs as a normal part of development. Inappropriatel y regulated apoptosis is impli cated in disease states, such as Alzheimer and cancer. Apoptosis is disti nguished from necrosis , or accidental cell death, by characteristic mo rphological and biochemical ch anges, including compaction and fragmentation of the nuclear chromatin, shrinkage of the cytoplasm, and loss of membrane asymmetry (Allen, Hunter and Agrawal, 1997;Darzynkiewicz et al., 1997;Lincz, 1998). In normal live cells, phosphatidylserine (PS) is located on the cytoplasmic surface of the cell membrane. However, in apoptotic cells, PS is translocated from the inner to the outer leaflet of the plasma membrane, thus exposing PS to the external cellular environment (van Engeland et al., 1998). In leukocyte apoptosis, PS on the outer surface of the cell marks the cell fo r recognition and phagocytosis by macrophages.(Fadok et al., 1992). The human anticoagulant, annexi n V, is a 35 kD Ca2+-dependent phospholipid-binding protein that has a hi gh affinity for PS (Andree et al., 1990). Annexin V labeled with a fluor ophore or biotin can identify apoptotic cells by binding to PS exposed on the outer leaflet (Koopman et al., 1994). We used a flow cytometry apoptosis assay kit (Invitrogen Molecula r Probes) based on annexin V (green fluorescently labeled with Alexa 488) and propi dium iodide (PI). Th e red-fluorescent PI stains nucleic acids and doe s not permeate cell membranes. When cells die, their membrane integrity is lost, and PI enters and binds tightly to nucleic acids. After staining a cell population with Alexa Fluor 488-annexi n V, and PI, apoptotic cells show green

PAGE 256

241 fluorescence, dead cells show red and green fluorescence, and live cells show little or no fluorescence. The mitogen (PHA) proliferation assay was co mpleted to assess the apoptotic effect of these proteins and relate these to th e observed cytotoxic effects. In the same experiments the cytotoxic effect was also evaluated by cell counting using Trypan blue and a hemocytometer. The annexin V and propi dium iodide assay showed no significant (Figure 5-4) difference in th e percent relationship between dead, live and apoptotic cells. The dead cells ranged between 10 and 13%, live cells between 50 and 60%, and dead cells between 25 and 35%. These percenta ges are proportions of a 10000-cell sample from each single assay, and are not a measur e of proliferation le vels. The 24-hour and 48hour time points did not show a major differen ce, with the exception that the percent of dead cells increased. Duplicate samples of the flow cyto metry assay were counted using a hemocytometer and trypan blue. The percenta ges of dead and live cells are shown in Figure 5-5. These numbers agree with the fl ow cytometry results. If the apoptotic and live cells percentages were adde d, they would range from 60 to 70%, a range similar to the flow cytometry assay. However when th e cells are counted using Trypan blue to evaluate the extent of prolifer ation (Figure 5-6), the pattern of the data resembled that of the mitogen proliferation assay using 3Hthymidine. This dissertation has chosen not to investigate the mechanism of immunosuppression or the cytotoxic effect of these ribonucleases. Other researchers have discussed this problem generally, reporti ng that BS-RNase inhibits lymphocyte proliferation via induction of a poptosis in these cells. BS-RNase is described as cytotoxic

PAGE 257

242 to proliferating lymphocytes and cancer cells . The annexin V/ propidium iodide apoptosis assay showed no significant effect caused by the tested RNases on the level of apoptotic cells. This result was surprising, especially becau se clear differences in cell proliferation were observed between assays (Figure 4-6). I nvestigation of the prol iferation levels by cell counting reproduced the pro liferation assay results report ed earlier. The percentages of live and dead cells, on th e other hand, supported the apotosis assay results in terms ratio of live to dead cells. If some of th ese proteins have reduce the cell number of dividing cells, and at the same time are not shifting the overall ratio of dead/live cells, then these data support a cytostatic effect rather than a cytotoxi c one. This suggestion appears to conflict with earlier reports on th is protein’s mechanism of action; however, further investigation into this matter is warranted. Materials and Methods Bovine Peripheral Blood Le ukocyte Isolation: Method1 Peripheral bovine blood was collected from the jugular or the coccygeal vein in heparanized vacutainer tubes. Two or th ree 10 mL tubes with roughly 7ml of whole blood in each produce enough leukocytes for a 96 -well microtiter pl ate. The number of leukocytes varies across animals and is depe ndent on their physiologi cal state. The blood was centrifuged for 20 min at 400xg at room temperature to obtain the buffy coat, a whitish interface that lies between the upper plasma and bottom red blood cells (RBC) and granulocytes. The buffy coat was removed with a Pasteur pipette and resuspended in 2 mL of RPMI 1640 medium. The cell suspensi on was then carefully layered on top of 2 mL of Ficoll-Hypaque . The tube was th en centrifuged for 30 min at 400xg.

PAGE 258

243 The leukocytes accumulated in the interf ace between the upper medium layer and the bottom Ficoll-Hypaque layer with the contaminating red blood cells. The leukocyte layer was aspirated with Pasteu r pipette and transferred to 2 mL of PBS. The cells are then centrifuged for 5 min at 500xg and the PBS aspirated. If there was still some contaminating RBCs, the cells pellet was resuspended in RBC lysis buffer for 5 sec and immediately diluted tenfold with PBS. The ce lls were then centrifuged again for 5 min at 500xg, and the pellet is resuspended in complete medium (RPMI 1640 supplemented with glutamine, 20% Fetal bovine serum and antibiotics). A small aliquot from the cell suspension was taken and mixed 1:1 with a 0.4% solution of Trypan blue. Live cells were counted using a hemocytometer and the cell concentration was calcu lated. Cells were then diluted in complete medium to a final concentration 1x106 cells/mL for the mitogen proliferation assay or to 3x106 cells/mL for the MLR assay. Bovine Peripheral Blood Le ukocyte Isolation: Method2 Bovine blood was collected in the manne r previously described. Whole blood (15 mL) was pipetted into a 50 mL conical centrif uge tube and 25 mL of room temperature PBS is added. Using a 10 mL pipet, the dilute d whole blood is carefu lly layered on top of a 10 mL room temperature Ficoll-Hypaque solu tion. The tubes are then centrifuged for 20 min at 800 g at room temperature with the brake off. Most of the plasma and platelet containing supernatant above the interf ace band is aspirated (granulocytes and erythrocytes will be in red pellet). The inte rface band (which includes the lymphocytes) is transferred along with no more th an 5 mL of fluid above the pe llet into a 10 mL pipet to a new 50 mL conical centrifuge tube, combini ng the bands from 2 to 3 Ficoll-Hypaque gradients into one 50 mL tube. PBS is added to the 50 mL mark.

PAGE 259

244 The tubes were centrifuged for 10 min at 600 g at room temperature (RT) with the brake on. The supernatants are aspirated and the pellet is resuspended in each tube with 10 mL room temperature PBS. Resuspende d pellets are combined into as few 50 mL tubes as possible. PBS is added to 50 mL mark in each tube used. The cells are centrifuged for 15 min 300 g at RT, with the brake on. The last step is repeated at least once more, but with a centrifugation of 10 mi n at 100xg at R, with the break on. This low-speed centrifugation permits as many platel ets as possible to remain above the pellet of lymphocytes. High numbers of platelets might interfere with the successful stimulation and proliferation of lymphocytes in th e MLR assay. However, the low speed centrifugation will also cause loss of cells with each repetition. Cells are then suspended in complete medium as described earlier and cells enumerated and diluted to a 1x106 cells/mL for the mitogen proliferation assasy or to 3x106 cells/mL for the MLR assay. Mitogen Induced Proliferation Assay: For each test well in a 96 well plate, 100 l of 1x106 cells/mL cell suspension (1x105 cells per well) is added. The appropriate amount of the tested ribonucleases is then added in a 50 l volume of PBS. The concentrati ons tested range from 0 to 100 g/mL of ribonuclease in a final volume of 250 l per test well. A volume of 100 l of complete media containing the mitogen is finally adde d. PhytohemmaglutininL (PHA) was used a mitogen in concentrations ranging from 0.1 g/mL to 5 g/mL. Concavalin A (3 g/mL) was also used to verify the robustness of the results. In each assay, controls without mitogen, controls with mitogen only and no enzyme, and controls with mitogen and ribonuclease A were also tested. Each assay is done at least in tripli cates. The plate is placed on a belly-dancer (shaker) for 3 min and then placed in a 37C, 5% CO2 incubator.

PAGE 260

245 The incubation of the assay lasts for 72 hours. Sixteen to18 hours before harvesting the cells, 1 Ci of 3Hthymidine is added to the te st wells. The plate is placed on a bellydancer for 3 min and then returned to the incubator. The cells are harvested onto glass fiber filters using a Skatron/Molecular Devi ces Combi-12 Cell Harvester. The cells are washed out of the wells and onto the filters with 3 consecutive 10-sec washes of PBS, water, and PBS. The dry filters are then pl aced in scintillation vials with 4 mL of Scintiverse II or Scintiverse Bio-HP and the radioactive counts are measured in a Beckman 6400 scintillation counter. Mixed Lymphocyte Proliferation Assay: In this assay, 3x105 cells from each of two different animals are used in every test well. 100 l from each animal’s cell suspension (3x106 cells/mL) is added. The enzyme to be tested is added in the same way described ea rlier, in 50ul of PBS. It is important to use at least this cell number from a cell suspen sion that has minimal contamination with platelets and no RBCs. These contaminants c ould inhibit proliferation and waste large amounts of purified proteins. The plate is also mixed in the same way (on a belly-dancer for 3 min). It is also important to use a round bottom well plate for this assay, as an important part of the signal s that cause proliferation co mes from allogeneic cell-cell interactions. This is maximized when the cells accumulate at the bottom of the well. It is also important to shake the plate at least once a day on a belly-dancer shaker during the course of the assay to ensu re proper mixing and continuous cell-cell contact. The assay lasts 6 days at the end of whic h the cells are harvested in the same way described earlier. On the fifth day, roughly 18 hours before harvest, 1 Ci of 3Hthymidine is added to the

PAGE 261

246 test wells and the plate is mixed in the same way. The filters are also counted in the previously described manner. Annexin V/ Propidium Iodide Apoptosis Assay Cells are harvested and washed in cold PBS, centrifuged, and resuspended in 1X Annexin-binding buffer (10 mM HEPES, 140 mM NaCl, 2.5 mM CaCl2, pH 7.4). The cells should be suspended at a density of ~1 106 cells/mL. The annexinV solution (5 L) and PI solution (1 L , 100 g/mL) are then added to cells (100 l per assay) and incubated at room temperature fo r 15min. Annexin-binding buffer ( L) was added to the cell suspension, mixed and kept on ice. Samples were promptly analyzed on a flow cytometer. 10000 cells per sample were counted to satisfy statistical significance. The assay was performed at 24 and 48 hour s from the start of the assay. Figure 5-1. Mitogen (PHA) pro liferation assay using proteins purified on a pUp column. The concentrations tested were 50mg/mL and 100mg/mL

PAGE 262

247 Figure 5-2. Mitogen (PHA) Pro liferation assay using proteins purified on an affinity column with ADP as competitive inhibitor for elution. The concentrations tested were 50 g/mL.

PAGE 263

248 Figure 5-3. Mixed Lymphocyte Reaction (MLR). The concentr ation of RNase used is 50 g/mL. Controls are no treatme nt (Cont) and RNaseA (AA).

PAGE 264

249 Figure 5-4. Percentages of apopt otic, dead and live cells in a mitogen proliferation assay. The tested proteins do not seem to have a significant effect on the proportions relating cell viability.

PAGE 265

250 Figure 5-5. Percentages of dead and live cells in a mitogen prolifera tion assay determined by Trypan blue cell counting.

PAGE 266

251 Figure 5-6. Cell viability determined by cell counting using a hemocytometer of a mitogen proliferation assay. This assa y was done parallel to the apoptosis assay.

PAGE 267

252 CHAPTER 6 CONCLUSIONS 1. Bovine seminal RNase underwent an epis ode of positive adaptive pressure in the last two million years. 2. Along the lineage leading to modern bovine seminal RNase after a gene duplication ca. 40 Ma, the protein was e xpressed and evolving under functional constraints. 3. This notwithstanding, the seminal RN ase gene became a pseudogene, in lineages leading to most other ruminants, and did so rapidly in the deer lineage. 4. Immosuppresivity directly, and dime ric S-peptide swap indirectly, are the features of the protein important for the ne w biological activity in bovine seminal plasma. 5. Catalytic activity, dimeric structur e, and other behaviors measurable in vitro are not important for the new biological activity in bovine seminal plasma, although they may have been important to selected functi on in the lineage leading to BS-RNase, and may have been prerequisite for physiological function in BS-RNase. 6. Experimental paleogenetics is a useful tool to help connect in vitro behavior to selected physiological function. 7. Various tools were developed here to assess the robustness of reconstructions, to use crystallographic information to unders tand the relationship between fitness and structure, and 8. Various experimental methods were deve loped here, including the use of affinity columns.

PAGE 268

253 As future directions, during the course of this project, the num ber of artiodactyl species known tohumankind increased by about 10%. We have sampled only ca. 10% of the biodiversity of ar tiodactyls known. The next step in this project spec ifically is to increase the sample size. More generally, the next step is to identify paleogenetic research targets more relavent to human medicine.

PAGE 269

254 APPENDIX MUTAGENESIS PRIMERS AND STARTING SEQUENCE The site-directed mutagenesis to generate the RNase gene variants was started with An35 RNase gene. This appendix presents th e sequence of AN35 and the sequences of the primers used in the mutagenesis. The DNA and protein sequence of AN35 is: aaagaatctgcagctgccaagttcgagcggcagcacatggactctggcagctcc ccc agc K E S A A A K F E R Q H M D S G S S P S agcagctccaactactgcaacctgatgatgttctgccggaagatgacccaggggaaatgc S S S N Y C N L M M F C R K M T Q G K C aagccagtgaacacctttgtgcatgagtccctggccgatgttaaggccgtgtgctcccag K P V N T F V H E S L A D V K A V C S Q aagaaagtcgcctgcaagaatgggcag acc aactgctaccagagcaactccgccatgcac K K V A C K N G Q T N C Y Q S N S A M H atcacagactgccgcgagactggcagctccaagtaccccaactgcgcctacaagaccacc I T D C R E T G S S K Y P N C A Y K T T cgggcggagaaacacatcatagtggcttgtgagggtaaaccgtacgtgccagtccacttc R A E K H I I V A C E G K P Y V P V H F gatgcttcagtg D A S V The list of the primer used in the mutagenesis section of the project is: A64T/For gctcccagaagaaagtcacgtgcaagaatgggcagacc A64T/Rev ggtctgcccattcttgcacgtgactttcttctgggagc R101Q/For gcctacaagaccacccaagcggagaaacacatc R101Q/Rev gatgtgtttctccgcttgggtggtcttgtaggc H80R/For gcaactccgccatgcgcatcacagactgccg H80R/Rev cggcagtctgtgatgcgcatggcggagttgc A102V/For gcctacaagaccacccaagttgagaaacacatc A102V/Rev gatgtgtttctcaacttgggtggtcttgtaggc NA76KT/Fo r ctaccagagcaaatccaccatgcgcatcacagactg NA76KT/Re v cagtctgtgatgcgcatggtggatttgctctggtag E111G/For catcatagtggcttgtggtggtaaaccgtacgtgc E111G/Rev gcacgtacggtttaccaccacaagccactatgatg S17N/For catggactccggcaactcccccagcagcagctc S17N/Rev gagctgctgctgggggagttgccggagtccatg F31C/For caacctgatgatgtgctgccggaagatgaccc

PAGE 270

255 F31C/Rev gggtcatcttccggcagcacatcatcaggttg Y115S/For gtggtggtaaaccgtcggtgccagtccacttc Y115S/Rev gaagtggactggcaccgacggtttaccaccac E111A/For catcatagtggcttgtgcaggtaaaccgtacgtgccag E111A/Rev ctggcacgtacggtttacctgcacaagccactatgatg S22N/For ctcccccagcagcaactccaactactgcaac S22N/Rev gttgcagtagttggagttgctgctgggggag E9G/For cagctgccaagttcggtcgtcagcacatggac E9G/Rev gtccatgtgctgacgaccgaacttggcagctg E9R/For agctgccaagttccgccgtcagcacatgga E9R/Rev tccatgtgctgacggcggaacttggcagct D53N/For catgagtccctggcgaacgttaaggcggtgtgctc D53N/Rev gagcacaccgccttaacgttcgccagggactcatg T70S/For gcaagaatgggcagtcgaattgctaccagag T70S/Rev ctctggtagcaattcgactgcccattcttgc K113N/For gtggcttgtgagggtaacccgtacgtgccag K113N/Rev ctggcacgtacgggttaccctcacaagccac E9G/For cagctgccaagttcggtcgccagcacatggac E9G/Rev gtccatgtgctggcgaccgaacttggcagctg E9R/For cagctgccaagttccgccgccagcacatggac E9R/Rev gtccatgtgctggcggcggaacttggcagctg E9W/For gcggcgaagttctggcggcagcacatggactc E9W/Rev gagtccatgtgctgccgccagaacttcgccgc M13T/For ctggcggcagcacacggactctggcagctccc M13T/Rev gggagctgccagagtccgtgtgctgccgccag P19S/For gactctggcagctcctcgagcagcaactccaac P19S/Rev gttggagttgctgctcgaggagctgccagagtc K41R/For gacccaggggaaatgccgtccagtgaacacctttg K41R/Rev caaaggtgttcactggacggcatttcccctgggtc N44D/For gaaatgccgtccagtggatacctttgtgcatg N44D/Rev catgcacaaaggtatccactggacggcatttc N71S/For caagaatgggcagtcgtcgtgttaccagagcaac N71S/Rev gttgctctggtaacacgacgactgcccattcttg D83E/For gccatgcacatcacagaatgtcgcgagactggcag D83E/Rev ctgccagtctcgcgacattctgtgatgtgcatggc D121G/For ccgtacgtgccagtccacttcggtgcttcagtg D121G/Rev cactgaagcaccgaagtggactggcacgtacgg

PAGE 271

256 LIST OF REFERENCES Adachi, J., Hasegawa, M., (1996). Comput. Sci. Monogr. 28. Adey, N. B., Tollefsbol, T. O., Sparks, A. B ., Edgell, M. H., and Hutchison, C. A., 3rd (1994). Proc. Natl. Acad. Sci. U. S. A. 91, 1569. Alghamdi, A. S., Foster, D. N ., and Troedsson, M. H. (2004). Reproduction 127, 593. Allard, M. W., Miyamoto, M. M., Jarecki, L., Kraus, F., and Tennant, M. R. (1992). Proc. Natl. Acad. Sci. U. S. A. 89, 3972. Allen, R. T., Hunter, W. J., 3r d, and Agrawal, D. K. (1997). J Pharmacol Toxicol Methods 37, 215. Andree, H. A., Reutelingsperger, C. P., Haupt mann, R., Hemker, H. C., Hermens, W. T., and Willems, G. M. (1990). J Biol Chem 265, 4923. Arai, K. I., Kaziro, Y., and Kawakita, M. (1972). J. Biol. Chem. 247, 7029. Ardelt, W., Mikulski, S. M., and Shogen, K. (1991). J. Biol. Chem. 266, 245. Arnason, U., Gullberg, A., and Janke, A. (1998). J. Mol. Evol. 47, 718. Ashburner, M. (1998). Bioessays 20, 949. Axe, D. D. (2000). J. Mol. Biol. 301, 585. Baker, M. E. (2003). Bioessays 25, 396. Barnard, E. A. (1969). Nature 221, 340. Barrett, P. M., and Willis, K. J. (2001). Biological Rev. 76, 411. Baudin, A., Ozier-Kalogeropoulos, O., Denouel, A., Lacroute, F., and Cullin, C. (1993). Nucleic Acids Res. 21, 3329. Beintema, J. J., Broos, J., Meulenberg, J., and Schuller, C. (1985). Eur. J. Biochem. 153, 305. Beintema, J. J., Gaastra, W., and Munniksma, J. (1979). J. Mol. Evol. 13, 305. Beintema, J. J., and Gruber, M. (1967). Biochim. Biophys. Acta. 147, 612.

PAGE 272

257 Beintema, J. J., and Gruber, M. (1973). Biochim. Biophys. Acta. 310, 161. Beintema, J. J., and Martena, B. (1982). Mammalia 46, 253. Beintema, J. J., Schuller, C., Irie, M., and Carsana, A. (1988). Prog. Biophys. Mol. Biol. 51, 165. Beintema, J. J., Wietzes, P., Weickma nn, J. L., and Glitz, D. G. (1984). Anal. Biochem. 136, 48. Benner, S. A. (1988). FEBS Lett. 233, 225. Benner, S. A. (2002). Proc. Natl. Acad. Sci. U. S. A. 99, 4760. Benner, S. A. (2003). Adv. Enzyme Regul. 43, 271. Benner, S. A., and Allemann, R. K. (1989). Trends Biochem. Sci. 14, 396. Benner, S. A., Cannarozzi, G., Gerloff, D., Turcotte, M., and Chelvanayagam, G. (1997). Chem. Rev. 97, 2725. Benner, S. A., Caraco, M. D., Thomson, J. M., and Gaucher, E. A. (2002). Science 296, 864. Benner, S. A., Cohen, M. A., and Gonnet, G. (1992). Abstracts Of Papers Of The American Chemical Society 203, 61. Benner, S. A., and Ellington, A. D. (1990). In "Bioorganic Chemistry Frontiers" (H. Dugas, ed.), p. 1. sp ringer-Verlag, Berlin. Benner, S. A., and Gerloff, D. (1991). Adv. Enzyme Regul. 31, 121. Benner, S. A., Haugg, M., Jermann, T. M., Op itz, J. G., Raillard-Yoon, S.-A., Soucek, J., Stackhouse, J., Trabesinger-Ruef, N., Trau twein-Fritz, K., Zankel, T. R. (1997). In "Ribonucleases" (J. R. a. G. D'Alessio, ed.), p. 214. Academic Press, Inc., New York. Benner, S. A., Trabesinger, N., and Schreiber, D. (1998). Adv. Enzyme Regul. 38, 155. Benton, M. J., and Ayala, F. J. (2003). Science 300, 1698. Berbee, M. L., and Taylor, J. W. (1993). Can. J. Bot. 71, 1114. Bielawski, J. P., and Yang, Z. H. (2004). J. Mol. Evol. 59, 121. Blackburn, P., and Gavilanes, J. G. (1982). J Biol Chem 257, 316. Blackburn, P., and Moore, S. (1982). In "The Enzymes", p. 317. Academic Press, New York.

PAGE 273

258 Bonaventura, J., Bonaventura, C., and Sullivan, B. (1974). Science 186, 57. Boulton, B., Singleton, V. L., Bisson, L. F., and Kunkee, R. E. (1996). In "Principles and Practices of Winemaking", p. 139. Chapman and Hall, New York. Brand, A. H., and Perrimon, N. (1993). Development 118, 401. Breukelman, H. J., Beintema, J. J., Confalone , E., Costanzo, C., Sasso, M. P., Carsana, A., Palmieri, M., and Furia, A. (1993). J. Mol. Evol. 37, 29. Breukelman, H. J., Jekel, P. A., Dubois, J. Y. F., Mulder, P., Warmels, H. W., and Beintema, J. J. (2001). Eur. J. Biochem. 268, 3890. Brochier, C., and Philippe, H. (2002). Nature 417, 244. Buck, M., and Rosen, M. K. (2001). Science 291, 2329. Cai, W., Pei, J., and Grishin, N. V. (2004). BMC Evol. Biol. 4, 33. Cao, Y., Adachi, J., Yano, T. a., and Hasegawa, M. (1994). Mol. Biol. Evol. 11, 593. Carrio, M. M., Cubarsi, R ., and Villaverde, A. (2000). FEBS Lett 471, 7. Carroll, R. L. (1988). "Vertebrate Paleont ology and Evolution ", WH Freeman & Co, New York City. Carsana, A., Confalone, E., Palmieri, M., Libonati, M., and Furia, A. (1988). Nucleic Acids Res. 16, 5491. Cavalier-Smith, T. (2002). Int. J. Syst. Evol. Microbiol. 52, 7. Cavender (1991). In "Cyprinid Fishes: Systematics, Bi ology And Exploitati on" (I. J. W. a. J. S. Nelson, ed.). Chapman and Hall, London. Chalepakis, G., Stoykova, A., Wijnholds, J ., Tremblay, P., and Gruss, P. (1993). J. Neurobiol. 24, 1367. Chandrasekharan, U. M., Sanker, S., Glynias, M. J., Karnik, S. S., and Husain, A. (1996). Science 271, 502. Chang, B. S. W., and Donoghue, M. J. (2000). Trends Ecol. Evol. 15, 109. Chang, B. S. W., Jonsson, K., Kazmi, M. A., Donoghue, M. J., and Sakmar, T. P. (2002). Mol. Biol. Evol. 19, 1483. Chinen, A., Matsumoto, Y., and Kawamura, S. (2005a). Mol. Biol. Evol. 22, 1001. Chinen, A., Matsumoto, Y., and Kawamura, S. (2005b). J. Biol. Chem. 280, 9460.

PAGE 274

259 Ciglic, M. I., Jackson, P. J., Raillard, S. A., Haugg, M., Jermann, T. M., Opitz, J. G., Trabesinger-Ruf, N., and Benner, S. A. (1998). Biochemistry 37, 4008. Cinatl, J., Jr., Cinatl, J., Kotchetkov, R., Matousek, J., Woodcock, B. G., Koehl, U., Vogel, J. U., Kornhuber, B., and Schwabe, D. (2000). Anticancer Res 20, 853. Clerici, M., Stocks, N. I., Zajac, R. A., Bosw ell, R. N., Lucey, D. R., Via, C. S., and Shearer, G. M. (1989). J Clin Invest 84, 1892. Collier, L. S., Carlson, C. M., Ravimohan, S., Dupuy, A. J., and Largaespada, D. A. (2005). Nature 436, 272. Collinson, M. E., and Hooker, J. J. (1991). Philos. Trans. R. Soc. London Ser. B 333, 197. Corbin, C. J., Mapes, S. M., Marcos, J., Sh ackleton, C. H., Morrow, D., Safe, S., Wise, T., Ford, J. J., and Conley, A. J. (2004). Endocrinology 145, 2157 Cunningham, C. W. (1999). Syst. Biol. 48, 665. Czerny, T., and Busslinger, M. (1995). Mol. Cell. Biol. 15, 2858. Czerny, T., Halder, G., Kloter, U., Souabni, A., Gehring, W. J., and Busslinger, M. (1999). Mol. Cell. Biol. 3, 297. Czerny, T., Schaffner, G., and Busslinger, M. (1993). Genes Dev. 7, 2048. D'Alessio, G., Di Donato, A., Pa rente, A., and Piccoli, R. (1991). Trends Biochem. Sci. 16, 104. D'Alessio, G., Floridi, A., De Prisco , R., Pignero, A., and Leone, E. (1972). Eur. J. Biochem. 26, 153. Dahl, E., Koseki, H., and Balling, R. (1997). Bioessays 19, 755. Darzynkiewicz, Z., Juan, G., Li, X., Gorczy ca, W., Murakami, T., and Traganos, F. (1997). Cytometry 27, 1. Dean, A. M., and Golding, G. B. (1997). Proc. Natl. Acad. Sci. U. S. A. 94, 3104. Denkewalter, R. G., Veber, D. F., Holly, F. W., and Hirschma.R (1969). J. Am. Chem. Soc. 91, 502. Di Cosmo, A., Di Cristo, C., and Paolucci, M. (2001). J. Exp. Zool. 289, 33. Di Cosmo, A., Di Cristo, C., and Paolucci, M. (2002). Mol. Reprod. Dev. 61, 367. Dietmann, S., and Holm, L. (2001). Nat. Struct. Biol. 8, 953.

PAGE 275

260 Dixon, B., Nagelkerke, L. A. J., Sibbing, F. A., Egberts, E., and St et, R. J. M. (1996). Immunogenetics 44, 419. Doggrell, S. A., and Wanstall, J. C. (2004). Cardiovasc. Res. 61, 653. Dos Reis, G. A., and Shevach, E. M. (1981). J Immunol 127, 2456. Dostal, J., Veselsky, L., Drahor ad, J., and Jonakova, V. (1995). Biol Reprod 52, 1209. Dubois, J. Y. F., Jekel, P. A., Mulder, P., Bussink, A. P., Catzeflis, F. M., Carsana, A., and Beintema, J. J. (2002). J. Mol. Evol. 55, 522. Dupuy, A. J., Akagi, K., Largaespada, D. A., Copeland, N. G., and Jenkins, N. A. (2005). Nature 436, 221. Ellington, A. D., and Benner, S. A. (1987). J. Theor. Biol. 127, 491. Emmens, M., Welling, G. W., and Beintema, J. J. (1976). Biochem. J. 157, 317. Engelkamp, D., and van Heyningen, V. (1996). Curr. Opin. Genet. Dev. 6, 334. Escriva, H., Safi, R., Hanni, C., Langlois, M.-C., Saumitou-Laprade, P., Stehelin, D., Capron, A., Pierce, R., and Laudet, V. (1997). Proc. Natl. Acad. Sci. U. S. A. 94, 6803. Fadok, V. A., Voelker, D. R., Campbell, P. A., Cohen, J. J ., Bratton, D. L., and Henson, P. M. (1992). J Immunol 148, 2207. Feder, M. E., and Mi tchell-Olds, T. (2003). Nat. Rev. Genet. 4, 651. Felsenstein, J. (1981). J. Mol. Evol. 17, 368. Felsenstein, J. (1989). Cladistics 5, 164. Fernandez-Espinar, M. T., Barri o, E., and Querol, A. (2003). Yeast 20, 1213. Fersht, A. R. (1977). "Enzyme Structure a nd Mechanism", W.H. Freeman, New York. Filipec, M., Haskova, Z., Havrlikova, K., Le tko, E., Holan, V., Matousek, J., and Kalousek, I. (1996). Graefes Arch Clin Exp Ophthalmol 234, 586. Fleet, G. H., and Heard, G. M. (1993). In "Wine Microbiology and Biotechnology", p. 27. Harwood Academic Publisher, Chur, Switzerland. Foote, M., Hunter, J. P., Janis, C. M., and Sepkoski, J. J. (1999). Science 283, 1310. Fortin, A. S., Underhill, D. A., and Gros, P. (1998). Nucleic Acids Res. 26, 4574. Franzen, J. L. (1997). Natur Mus. 127, 61

PAGE 276

261 Gaastra, W., Groen, G., Welling, G. W., and Beintema, J. J. (1974). FEBS Lett. 41, 227. Gaastra, W., Welling, G. W., and Beintema, J. J. (1978). Eur. J. Biochem. 86, 209. Galtier, N., and Gouy, M. (1998). Mol. Biol. Evol. 15, 871. Galtier, N., Tourasse, N., and Gouy, M. (1999). Science 283, 220. Ganzhorn, A. J., Green, D. W., Hershey, A. D., Gould, R. M., and Plapp, B. V. (1987). J. Biol. Chem. 262, 3754. Gaschen, B., Taylor, J., Yusim, K., Foley, B., Gao, F., Lang, D., Novitsky, V., Haynes, B., Hahn, B. H., Bhattacharya, T., and Korber, B. (2002). Science 296, 2354. Gaucher, E., Graddy, L., Li, T., Simmen, R ., Simmen, F., Schreiber, D., Liberles, D., Janis, C., and Benner, S. (2004). BMC Biol. 2, 19. Gaucher, E. A., Gu, X., Miyamoto, M. M., and Benner, S. A. (2002). Trends Biochem. Sci. 27, 315. Gaucher, E. A., Miyamoto, M. M., and Benner, S. A. (2001). Proc. Natl. Acad. Sci. U.S.A. 98, 548. Gaucher, E. A., Thomson, J. M., Burgan, M. F., and Benner, S. A. (2003). Nature 425, 285. Gerloff, D. L., Cannarozzi, G. M., Joachimia k, M., Cohen, F. E., Schreiber, D., and Benner, S. A. (1999). Biochem. Biophys. Res. Commun. 254, 70. Gerloff, D. L., Cohen, F. E., Korostensky, C., Turcotte, M., Gonnet, G. H., and Benner, S. A. (1997). Protein-Struct. Funct. Genet. 27, 450. Giguere, V. (2002). Trends Endocrinol. Metab. 13, 220. Glenner, H., Hansen, A. J., Sorensen, M. V., Ronquist, F., Huelsenbeck, J. P., and Willerslev, E. (2004). Curr. Biol. 14, 1644. Goldman, N., and Yang, Z. (1994). Mol Biol Evol 11, 725. Gonnet, G. H., and Benner, S. A. (1991). Technical Report 154, Departement Informatik. Zurich . Gonnet, G. H., Cohen, M. A., and Benner, S. A. (1992). Science 256, 1443. Gould, S. J. (1980). "The Pandas Thumb: More Reflections in Natural History", New York. Grantham, R. (1974). Science 185, 862.

PAGE 277

262 Graur, D. (1993). FEBS Lett. 325, 152. Green, S., and Chambon, P. (1987). Nature 325, 75. Groen, G., Welling, G. W., and Beintema, J. J. (1975). FEBS Lett. 60, 300. Gromiha, M. M., Oobatake, M., and Sarai, A. (1999). Biophys. Chem. 82, 51. Haldane, J. B. S. (1932). "The Causes of Evolution", Longmans and Green, London. Halder, G., Callaerts, P., a nd Gehring, W. J. (1995). Science 267, 1788. Harder, J., and Schroder, J. M. (2002). J. Biol. Chem. 277, 46779. Hassanin, A., and Douzery, E. J. (2003). Syst. Biol. 52, 206. Hermanson, S., Davidson, A. E., Sivasubbu, S ., Balciunas, D., and Ekker, S. C. (2004). In "Zebrafish:2nd Edition Genetics Ge nomics And Informatics", p. 349. Hernandez Fernandez, M., and Vrba, E. S. (2005). Biol Rev Camb Philos Soc 80, 269. Hesse, M. (2002). "Alkaloids : Nature's Curs e or Blessing?" Wile y-VCH, Weinheim New York. Hey, J. (1999). Trends Ecol. Evol. 14, 35. Hillis, D. M., and Bull, J. J. (1993). Syst. Biol. 42, 182. Hirschmann, R., Nutt, R. F., Veber, D. F., Vita li, R. A., Varga, S. L., Jacob, T. A., Holly, F. W., and Denkewalter, R. G. (1969). J. Am. Chem. Soc. 91, 507. Huelsenbeck, J. P. (1997). Syst. Biol. 46, 69. Huelsenbeck, J. P., and Bollback, J. P. (2001). Syst Biol 50, 351. Huelsenbeck, J. P., Larget, B., and Alfaro, M. E. (2004). Mol. Biol. Evol. 21, 1123. Huelsenbeck, J. P., Ronquist, F., Niel sen, R., and Bollback, J. P. (2001). Science 294, 2310. Hugenholtz, P., Goebel, B. M., and Pace, N. R. (1998). J. Bacteriol. 180, 4765. Hurley, J. H., and Dean, A. M. (1994). Structure 2, 1007. Hurley, J. H., Dean, A. M., Koshland, D. E., and Stroud, R. M. (1991). Biochemistry 30, 8671. Hurley, J. H., Thorsness, P. E., Ramalingam, V., Helmers, N. H., Koshland, D. E., Jr., and Stroud, R. M. (1989). Proc. Natl. Acad. Sci. U. S. A. 86, 8635.

PAGE 278

263 Imada, K., Sato, M., Tanaka, N., Katsube, Y., Matsuura, Y., and Oshima, T. (1991). J. Mol. Biol. 222, 725. Ipata, P. L., and Felicioli, R. A. (1968). FEBS Lett. 1, 29. Ivics, Z., Hackett, P. B., Plaste rk, R. H., and Izsvak, Z. (1997). Cell 91, 501. Ivics, Z., Kaufman, C. D., Zayed, H., Mi skey, C., Walisko, O., and Izsvak, Z. (2004). Curr. Issues Mol. Biol. 6, 43. Iwabata, H., Watanabe, K., Ohkuri, T., Yokobori, S., and Yamagishi, A. (2005). FEMS Microbiol. Lett. 243, 393. James, K., and Hargreave, T. B. (1984). Immunol. Today 5, 357. Janis, C. M., Effinger, J. E., Harrison, J. A., Honey, J. G., Kron, D. G., Lander, B., Manning, E., Prothero, D. R., Stevens, M. S., Stucky, R. K., Webb, S. D., and Wright, D. B. (1998). In "In Evolution of Tertiary Mammals of North America", p. 337 Cambridge Univ Pr, Cambridge. Jekel, P. A., Sips, H. J., Lenstra, J. A., and Beintema, J. J. (1979). Biochimie 61, 827. Jenkins, S. R., Nutt, R. F., Dewey, R. S., Ve ber, D. F., Holly, F. W., Paleveda, W. J., Lanza, T., Strachan, R. G., Schoenew.Ef , Barkemey.H, Dickinso.Mj, Sondey, J., Hirschma.R, and Walton, E. (1969). J. Am. Chem. Soc. 91, 505. Jermann, T. M. (1995). "Der Ursprung und die Evolution der Ribonuklease aus dem Pankreas und aus der Samenfluessigke it", ETH Dissertation No. 11059, Zurich. Jermann, T. M., Opitz, J. G., Stackhouse, J., and Benner, S. A. (1995). Nature 374, 57. Jo Chitester, B., and Walz, F. G., Jr. (2002). Arch Biochem Biophys 406, 73. Johansson, M., Bromfield, J. J., Jasper , M. J., and Robertson, S. A. (2004). Immunology 112, 290. Jones, D. T., Taylor, W. R., and Thornton, J. M. (1992). Comput Appl Biosci 8, 275. Jukes, T. H., and Kimura, M. (1984). J. Mol. Evol. 21, 90. Kapovic, M., and Rukavina, D. (1991). J Reprod Immunol 20, 93. Kelemen, B. R., Klink, T. A., Behlke, M. A ., Eubanks, S. R., Leland, P. A., and Raines, R. T. (1999). Nucleic Acids Res 27, 3696. Kellis, M., Birren, B. W., and Lander, E. S. (2004). Nature 428, 617. Kelly, R. W., and Critchley, H. O. (1997). Hum. Reprod. 12, 2200.

PAGE 279

264 Kelmanson, I. V., and Matz, M. V. (2003). Mol. Biol. Evol. 20, 1125. Kerman, R. H., Susskind, B., Katz, S. M., Va n Buren, C. T., and Kahan, B. D. (1997). Transplant Proc 29, 1410. Kleineidam, R. G., Jekel, P. A., Beintema, J. J., and Situmorang, P. (1999). Gene 231, 147. Knauth, L. P. (2005). Palaeogeogr. Palaeoclimatol. Palaeoecol. 219, 53. Knight, P. A., Wright, S. H., Lawrence, C. E ., Paterson, Y. Y., and Miller, H. R. (2000). J. Exp. Med. 192, 1849. Koopman, G., Reutelingsperger, C. P., Kuijten, G. A., Keehnen, R. M., Pals, S. T., and van Oers, M. H. (1994). Blood 84, 1415. Koshi, J. M., and Goldstein, R. A. (1996). J. Mol. Evol. 42, 313. Kosiol, C., and Goldman, N. (2005). Molecular Biology And Evolution 22, 193. Kreitman, M., and Akashi, H. (1995). Systematics 26, 403. Krishnan, N. M., Seligmann, H., Stewart, C. B., de Koning, A. P. J., and Pollock, D. D. (2004). Mol. Biol. Evol. 21, 1871. Kruiswijk, C. P., Hermsen, T. T., Westphal, A. H., Savelkoul, H. F. J., and Stet, R. J. M. (2002). J. Immunol. 169, 1936. Kumar, S., and Hedges, S. B. (1998). Nature 392, 917. Kuper, H., and Beintema, J. J. (1976). Biochim. Biophys. Acta. 446, 337. Laccetti, P., Portella, G., Mastro nicola, M. R., Russo, A., Piccoli, R., D'Alessio, G., and Vecchio, G. (1992). Cancer Res 52, 4582. Laccetti, P., Spalletti-Cer nia, D., Portella, G., De Corato , P., D'Alessio, G., and Vecchio, G. (1994). Cancer Res 54, 4253. Lang, K., and Schmid, F. X. (1986). Eur. J. Biochem. 159, 275. Lenstra, J. A., and Beintema, J. J. (1979). Eur. J. Biochem. 98, 399. Leonidas, D. D., Shapiro, R., Irons, L. I., Russo, N., and Acharya, K. R. (1997). Biochemistry 36, 5578. Li, W. H., Gojobori, T ., and Nei, M. (1981). Nature 292, 237. Lincz, L. F. (1998). Immunol Cell Biol 76, 1.

PAGE 280

265 Lorenz, E. N. (1963). J. Atmos. Sci. 20, 130. Lorenz, E. N. (1969). Tellus 21, 289. Lynch, M., and Conery, J. S. (2000). Science 290, 1151. Maddison, W. P., and Ma ddison, D. R. (1989). Folia Primatol. 53, 190. Malcolm, B. A., Wilson, K. P., Matthews, B. W., Kirsch, J. F., and Wilson, A. C. (1990). Nature 345, 86. Margoliash, E. (1963). Proc. Natl. Acad. Sci. U. S. A. 50, 672. Margoliash, E. (1964). Can. J. Biochem. Physiol. 42, 745. Margulis, L., and Guerrero, R. (1995). Microbiologia 11, 173. Margulis, L., and West, O. (1993). GSA Today 3, 277. Marshall, C. R., Raff, E. C., and Raff, R. A. (1994). Proc. Natl. Acad. Sci. U. S. A. 91, 12283. Martins, E. P. (1999). Syst. Biol. 48, 642. Matousek, J. (1973). Experientia 29, 858. McGovern, P. E. (2004). Proc. Natl. Acad. Sci. U.S.A. 101, 17593. Menon, S. T., Han, M., and Sakmar, T. P. (2001). Physiol. Rev. 81, 1659. Michaelis, M., Cinatl, J., Cinatl, J., Pouckova, P., Langer, K., Kreuter, J., and Matousek, J. (2002). Anticancer Drugs 13, 149. Michaelis, M., Matousek, J., Vogel, J. U., Slav ik, T., Langer, K., Cinatl, J., Kreuter, J., Schwabe, D., and Cinatl, J. (2000). Anticancer Drugs 11, 369. Miller, H. R. (1996). Vet. Immunol. Immunopathol. 54, 331. Miskey, C., Izsvak, Z., Plasterk, R. H., and Ivics, Z. (2003). Nucleic Acids Res. 31, 6873. Miyazaki, J., Nakaya, S., Suzuki, T., Tamakoshi, M., Oshima, T., and Yamagishi, A. (2001). J. Biochem. 129, 777. Mooers, A. O., and Schluter, D. (1999). Syst. Biol. 48, 623. Moore, S., and Stein, W. H. (1973). Science 180, 458. Muller, H. J. (1935). Genetics 17, 237.

PAGE 281

266 Muskiet, F. A. J., Welling, G. W., and Beintema, J. J. (1976). Int. J. Pept. Protein Res. 8, 345. Nambiar, K. P., Stackhouse, J., Stauffer, D. M., Kennedy, W. P., Eldredge, J. K., and Benner, S. A. (1984). Science 223, 1299. Navidi, W. C., Churchill, G. A., and von Haeseler, A. (1991). Mol. Biol. Evol. 8, 128. Neidhardt, F. C., Ingraham, J. L., and Schaechter, M. (1990). "Physiology Of The Bacterial Cell : A Molecular Approach", Sinauer Associates, Sunderland, Mass. Nielsen, R. (2002). Syst. Biol. 51, 729. Nock, S., Grillenbeck, N., Ahmadian, M. R., Ri beiro, S., Kreutzer, R., and Sprinzl, M. (1995). Eur. J. Biochem. 234, 132. O'Harra, C. C. (1930). Science 71, 341. Oberdorster, E., and McCl ellan-Green, P. (2002). Mar. Environ. Res. 54, 715. Ohno, S. (1970). "Evolution by gene duplication", Berlin, New York. Ohno, S., Muramoto, J., Christia.L, and Atkin, N. B. (1967). Chromosoma 23, 1. Okabe, Y., Katayama, N., Iwama, M., Watanabe, H., Ohgi, K., Irie, M., Nitta, K., Kawauchi, H., Takayanagi, Y., Oyama, F., Titani, K., Abe, Y., Okazaki, T., Inokuchi, N., and Koyama, T. (1991). J. Biochem. 109, 786. Omland, K. E. (1999). Syst. Biol. 48, 604. Onorato, J., Scovena, E., Airaghi , S., Morandi, B., Morelli, M. , Pizzi, M., and Principi, N. (1996). Riv. Ital. Pediatr. 22, 900. Opitz, J. G. (1995). "Maximum Parsimony: Ei n neuer Ansatz zum besseren Verstaendnis von Protein/Nukleinsaeure-Wechselwirkungen", ETH Dissertation No. 10952. Opitz, J. G., Ciglic, M. I., Haugg, M., Trautw ein-Fritz, K., Raillard, S. A., Jermann, T. M., and Benner, S. A. (1998). Biochemistry 37, 4023. Pagel, M. (1999a). Nature 401, 877. Pagel, M. (1999b). Syst. Biol. 48, 612. Pagel, M., Meade, A., and Barker, D. (2004). Syst. Biol. 53, 673. Pan, S. H., and Malcolm, B. A. (2000). Biotechniques 29, 1234. Pandya, I. J., and Cohen, J. (1985). Fertil Steril 43, 417.

PAGE 282

267 Pauling, L., and Zuckerkandl, E. (1963). Acta. Chem. Scand. 17, S9. Peterson, K. J., and Eernisse, D. J. (2001). Evol. Dev. 3, 170. Piazzon, I., Deroche, A., Lanari, C., Matu sevich, M., and Pasqualini, C. D. (1985). J Reprod Immunol 7, 249. Posada, D., and Buckley, T. R. (2004). Syst Biol 53, 793. Prabhala, R. H., Fahey, J. V., Humphrey, S. L ., Edkins, R. D., Stern, J. E., and Wira, C. R. (1998). J Reprod Immunol 40, 25. Presnell, S. R., and Benner, S. A. (1988). Nucleic Acids Res. 16, 1693. Pretorius, I. S. (2000). Yeast 16, 675. Preuss, K. D., Wagner, S., Freudenstein, J., and Scheit, K. H. (1990). Nucleic Acids Res. 18, 1057. Price, S. A., Bininda-Emonds, O. R ., and Gittleman, J. L. (2005). Biol Rev Camb Philos Soc 80, 445. Pupko, T., Pe'er, I., Shamir, R., and Graur, D. (2000). Mol. Biol. Evol. 17, 890. Raillard, S. A. (1993). "Veraenderung der St ruktur und der biologischen Aktivitaet in RNase A mit Hilfe von gezi elter Mutagenese", ETH Dissertation No. 10022, Zurich. Raines, R. T. (1998). Chem Rev 98, 1045. Ree, R. H., and Donoghue, M. J. (1999). Syst. Biol. 48, 633. Riccioli, A., Salvati, L., D'Alessio, A., St arace, D., Giampietri, C., De Cesaris, P., Filippini, A., and Ziparo, E. (2003). Andrologia 35, 64. Richards, F. M., and Logue, A. D. (1962). J. Biol. Chem. 237, 3693. Riggs, A. (1959). Nature 183, 1037. Robertson, S. A., Bromfield, J. J., and Tremellen, K. P. (2003). J Reprod Immunol 59, 253. Robertson, S. A., Mau, V. J., Tremelle n, K. P., and Seamark, R. F. (1996). J Reprod Fertil 107, 265. Robertson, S. A., and Sharkey, D. J. (2001). Semin Immunol 13, 243. Rosenberg, H. F., and Domachowske, J. B. (2001). J. Leukoc. Biol. 70, 691.

PAGE 283

268 Rost, B. (2001). J. Struct. Biol. 134, 204. Runnegar, B. (2000). Nature 405, 403. Sambashivan, S., Liu, Y., Sawaya, M. R., Gingery, M., and Eisenberg, D. (2005). Nature 437, 266. Sanangelantoni, A. M., Cammara no, P., and Tiboni, O. (1996). Microbiology-Uk 142, 2525. Saunders, M., Wishnia, A., a nd Kirkwood, J. G. (1957). J. Am. Chem. Soc. 79, 3289. Schaaff, I., Heinisch, J., a nd Zimmerman, F. K. (1989). Yeast 5, 285. Schluter, D. (1995). Nature 377, 108. Schmidt, C. R. (1989). In Grzimek's Encyclopedia of Mammals, 20 Schroder, W., Mallmann, P., van der Ven, H., Diedrich, K., and Krebs, D. (1990). Arch. Gynecol. Obstet. 248, 67. Schultz, T. R., and Churchill, G. A. (1999). Syst. Biol. 48, 651. Schultz, T. R., Cocroft, R. B ., and Churchill, G. A. (1996). Evolution 50, 504. Segel, I. H. (1975). "Enzyme Kinetics", John Wiley and Sons, Inc., New York. Shagin, D. A., Barsova, E. V., Yanushevich, Y. G., Fradkov, A. F., Lukyanov, K. A., Labas, Y. A., Semenova, T. N., Ugalde, J. A., Meyers, A., Nunez, J. M., Widder, E. A., Lukyanov, S. A., and Matz, M. V. (2004). Mol. Biol. Evol. 21, 841. Shi, Y., and Yokoyama, S. (2003). Proc. Natl. Acad. Sci. U. S. A. 100, 8308. Sinatra, F., Callari, D., Viola, M., Longomb ardo, M. T., Patania, M., Litrico, L., Emmanuele, G., Lanteri, E., D'Alessa ndro, N., and Travali, S. (2000). Int J Clin Lab Res 30, 191. Singhania, N. A., Dyer, K. D., Zhang, J., Deming, M. S., Bonville, C. A., Domachowske, J. B., and Rosenberg, H. F. (1999). J. Mol. Evol. 49, 721. Soucek, J., Chudomel, V., Potmesilova, I., and Novak, J. T. (1986). Nat. Immun. Cell Growth Regul. 5, 250. Soucek, J., Hruba, A., Paluska, E., Chudomel, V., Dostal, J., and Matousek, J. (1983). Folia Biol (Praha) 29, 250. Soucek, J., Marinov, I., Benes, J., Hilgert, I., Matousek, J., and Raines, R. T. (1996). Immunobiology 195, 271.

PAGE 284

269 Soucek, J., Matousek, J., Chudomel, V., and Lindnerova, G. (1981). Folia Biol (Praha) 27, 334. Stackhouse, J., Presnell, S. R ., McGeehan, G. M., Nambiar, K. P., and Benner, S. A. (1990). FEBS Lett. 262, 104. Strachan, R. G., Paleveda, W. J., Nutt, R. F ., Vitali, R. A., Veber, D. F., Dickinso.Mj, Garsky, V., Deak, J. E., Walton, E., Jenki ns, S. R., Holly, F. W., and Hirschma.R (1969). J. Am. Chem. Soc. 91, 503. Stroband, H. W. J., Stevens, C., Kronnie, G. T., Samallo, J., Schipper, H., Kramer, B., and Timmermans, L. P. M. (1995). Rouxs Arch. Dev. Biol. 204, 369. Strydom, D. J., Fett, J. W., Lobb, R. R., Alde rman, E. M., Bethune, J. L., Riordan, J. F., and Vallee, B. L. (1985). Biochemistry 24, 5486. Stryer, L. (1995). "Biochemistry", W. H. Freeman and Company, New York. Sun, G. (2002). Science 296, 899. Sun, H., Merugu, S., Gu, X., Kang, Y. Y., Dick inson, D. P., Callaerts, P., and Li, W. H. (2002). Mol. Biol. Evol. 19, 1490. Sun, H. M., Rodin, A., Zhou, Y. H., Dickins on, D. P., Harper, D. E., HewettEmmett, D., and Li, W. H. (1997). Proc. Natl. Acad. Sci. U. S. A. 94, 5156. Suzuki, Y., Glazko, G. V., and Nei, M. (2002). Proc. Natl. Acad. Sci. U. S. A. 99, 16138. Swofford, D. L. (1998). "P AUP* Phylogenetic Analysis Using Parsimony Version 4". Swofford, D. L. (2001). "P AUP*.Phylogenetic Analysis Using Parsimony (*and Other Methods). Version 4." Sinauer Asso ciates, Sunderland, Massachusetts. Swofford, D. L., G. J. Olsen, P. J. Waddell, and D. M. Hillis. (1996). In "Molecular Systematics", p. 407. Sinauer Asso ciates, Sunderland, Massachusetts. Tamburrini, M., Scala, G., Verde, C., Ruo cco, M. R., Parente, A., Venuta, S., and D'Alessio, G. (1990). Eur J Biochem 190, 145. Tauer, A., and Benner, S. A. (1997). Proc. Natl. Acad. Sci. U.S.A. 94, 53. Thompson, J. D., Higgins, D. G., and Gibson, T. J. (1994). Nucleic Acids Res. 22, 4673. Thomson, J. M. (2002). "Interpretive Proteomi cs: Experimental Paleogenetics as a Tool to Analyze Function and Discover Pathways in Yeast", University of Florida Dissertation. Thomson, J. M., Gaucher, E. A., Burgan, M. F., De Kee, D. W., Li, T., Aris, J. P., and Benner, S. A. (2005). Nat. Genet. 37, 630.

PAGE 285

270 Thornton, J. W. (2001). Proc. Natl. Acad. Sci. U. S. A. 98, 5671. Thornton, J. W. (2004). Nat. Rev. Genet. 5, 366. Thornton, J. W., and DeSalle, R. (2000). Annu. Rev. Genomics Hum. Genet. 1, 41. Thornton, J. W., Need, E ., and Crews, D. (2003). Science 301, 1714. Trabesinger-Ruef, N. (1997). "Molekularer DarwinismusEin neuer Denkansatz zur strukturellen Aufklarung der Immunosuppr essivitat un Antitumoraktivitat der seminalen Ribonuklease des Rindes", ETH Dissertation No. 12485, Zurich. TrabesingerRuef, N., Jermann, T., Zankel, T ., Durrant, B., Frank, G., and Benner, S. A. (1996). FEBS Lett. 382, 319. Trautwein, K. (1991). "Constr uction of an Improved Expression System for Bovine Pancreatic Ribonuclease A and Construc tion and Characteri zation of RNase A Mutants", ETH Dissertation No. 9613, Zurich. Tremellen, K. P., Seamark, R. F., and Robertson, S. A. (1998). Biol Reprod 58, 1217. Ugalde, J. A., Chang, B. S., and Matz, M. V. (2004). Science 305, 1433. Underhill, D. A. (2000). Biochem. Cell Biol. 78, 629. Underhill, D. a., Vogan, K. J., and Gros, P. (1995). Proc. Natl. Acad. Sci. U. S. A. 92, 3692. van Engeland, M., Nieland, L. J., Ramaekers, F. C., Schutte, B., and Reutelingsperger, C. P. (1998). Cytometry 31, 1. Vandenberg, A., Vandenhendetimmer, L., and Beintema, J. J. (1976). Biochim. Biophys. Acta. 453, 400. Vandijk, H., Sloots, B., Vandenberg, A., G aastra, W., and Beintema, J. J. (1976). Int. J. Pept. Protein Res. 8, 305. Veber, D. F., Varga, S. L., Milkowsk.Jd, Joshua, H., Conn, J. B., Hirschma.R, and Denkewal.Rg (1969). J. Am. Chem. Soc. 91, 506.

PAGE 286

271 Venter, J. C., Adams, M. D., Myers, E. W., Li , P. W., Mural, R. J., Sutton, G. G., Smith, H. O., Yandell, M., Evans, C. A., Holt, R. A., Gocayne, J. D., Amanatides, P., Ballew, R. M., Huson, D. H., Wortman, J. R., Zhang, Q., Kodira, C. D., Zheng, X. H., Chen, L., Skupski, M., Subramanian, G., Thomas, P. D., Zhang, J., Gabor Miklos, G. L., Nelson, C., Broder, S., Clar k, A. G., Nadeau, J., McKusick, V. A., Zinder, N., Levine, A. J., Roberts, R. J., Simon, M., Slayman, C., Hunkapiller, M., Bolanos, R., Delcher, A., Dew, I., Fasulo, D., Flanigan, M., Florea, L., Halpern, A., Hannenhalli, S., Kravitz, S., Levy, S., Mobarry, C., Reinert, K., Remington, K., Abu-Threideh, J., Beasley, E., Biddick, K., Bonazzi, V., Brandon, R., Cargill, M., Chandramouliswaran, I., Charlab, R., Chaturvedi, K., Deng, Z., Di Francesco, V., Dunn, P., Eilbeck, K., Evange lista, C., Gabrielian, A. E., Gan, W., Ge, W., Gong, F., Gu, Z., Guan, P., Heiman , T. J., Higgins, M. E., Ji, R. R., Ke, Z., Ketchum, K. A., Lai, Z., Lei, Y., Li, Z., Li, J., Liang, Y., Lin, X., Lu, F., Merkulov, G. V., Milshina, N., Moore, H. M., Naik, A. K., Narayan, V. A., Neelam, B., Nusskern, D., Rusch, D. B., Sa lzberg, S., Shao, W., Shue, B., Sun, J., Wang, Z., Wang, A., Wang, X., Wang, J., We i, M., Wides, R., Xiao, C., Yan, C. (2001). Science 291, 1304. Venter, J. C., Remington, K., Heidelberg, J. F ., Halpern, A. L., Rusch, D., Eisen, J. A., Wu, D., Paulsen, I., Nelson, K. E., Nelson, W., Fouts, D. E., Levy, S., Knap, A. H., Lomas, M. W., Nealson, K., White, O ., Peterson, J., Hoffman, J., Parsons, R., Baden-Tillson, H., Pfannkoch, C., Roge rs, Y. H., and Smith, H. O. (2004). Science 304, 66. Vescia, S., Tramontano, D., Augusti-To cco, G., and D'Alessio, G. (1980). Cancer Res 40, 3740. Vos, J. P., Lopes-Cardozo, M., and Gadella, B. M. (1994). Biochim. Biophys. Acta. 1211, 125. Wade, N. (1995). In "New York Times", p. 35, New York. Warshaw, M. M., and Tinoco, I., Jr. (1966). J Mol Biol 20, 29. Weiser, K. C., and Justice, M. J. (2005). Nature 436, 184. Welling, G. W., Groen, G., and Beintema, J. J. (1975). Biochem. J. 147, 505. Welling, G. W., Mulder, H., and Beintema, J. J. (1976). Biochem. Genet. 14, 309. Wilkie, S. E., Robinson, P. R., Cronin, T. W., Poopalasundaram, S., Bowmaker, J. K., and Hunt, D. M. (2000). Biochemistry 39, 7895. Wills, C. (1976). Nature 261, 26. Woese, C. R. (1987). Microbiol. Rev. 51, 221. Wolfe, K. H., and Shields, D. C. (2001). Nature 387, 708.

PAGE 287

272 Xu, H. E., Rould, M. A., Xu, W., Epstein, J. A., Maas, R. L., and Pabo, C. O. (1999). Genes Dev. 13, 1263. Xu, W., Rould, M. A., Jun, S., De splan, C., and Pabo, C. O. (1995). Cell 80, 639. Yang, Z. (1997). Comput. Appl. Biosci. 13, 555. Yang, Z., Kumar, S., and Nei, M. (1995). Genetics 141, 1641. Yant, S. R., Park, J., Huang, Y., Mikkelsen, J. G., and Kay, M. A. (2004). Mol. Cell. Biol. 24, 9239. Yasutake, Y., Watanabe, S., Yao, M., Takada , Y., Fukunaga, N., and Tanaka, I. (2003). J. Biol. Chem. 278, 36897. Yokoyama, S., Radlwimmer, F. B., and Blow, N. S. (2000). Proc. Natl. Acad. Sci. U. S. A. 97, 7366. Zhang, J., Rosenberg, H. F., and Nei, M. (1998). Proc. Natl. Acad. Sci. U. S. A. 95, 3708. Zhang, J. Z., Dyer, K. D., and Rosenberg, H. F. (2002). Nucleic Acids Res. 30, 1169. Zhang, J. Z., Dyer, K. D., and Rosenberg, H. F. (2003). Nucleic Acids Res. 31, 602. Zhang, J. Z., and Nei, M. (1997). J. Mol. Evol. 44, S139. Zhang, J. Z., and Rosenberg, H. F. (2002). Proc. Natl. Acad. Sci. U. S. A. 99, 5486. Zhao, W., Kote-Jarai, Z., van Santen, Y., Ho fsteenge, J., and Beintema, J. J. (1998). Biochim. Biophys. Acta. 1384, 55. Zhu, G., Golding, G. B., and Dean, A. M. (2005). Science 307, 1279. Zilliacus, J., Carlstedtduke, J., Gustafss on, J. A., and Wright, A. P. H. (1994). Proc. Natl. Acad. Sci. U. S. A. 91, 4175.

PAGE 288

273 BIOGRAPHICAL SKETCH The author was born and rais ed in Nabeul, a Mediterranean town in the north African nation of Tunisia. He attended local primary and sec ondary schools, the school of Ibn Khaldoun and Le Lycee Technique de Nabeul. Before his 18th birthday he graduated with honors with a Baccal aureat in natural sciences (equi valent to a high school diploma). He then moved to the United States to pur sue a university educa tion. He majored in biochemistry and molecular biology at the Un iversity of Florida under the undergraduate major of Interdisciplinary studies. As part of this undergraduate major, he conducted research in the laboratory of professor Robert J. Cohen, De partment of Biochemistry and Molecular Biology, and a separate project in the laboratory of professor Christopher West, Department of Anatom y and Cell Biology in the College of Medicine. During his undergraduate years he also completed the curriculum required by the Department of Biochemistry and Molecular Biology for first year graduate student s and graduated with honors. During his research experience in th e laboratory of Christopher West, he was awarded a summer research fellowship from the American Cancer Society and a second research fellowship from university Scholars program at the University of Florida. He worked for two semesters afte r graduation researching the cy toplasmic glycosylation of the protein Skp1 in the Department of Anat omy and Cell Biology. He finally joined the Interdisciplinary Program in Biomedical Sciences at University of Florida’s College of Medicine to pursue a doctoral degree. Duri ng his graduate career he worked under the supervision of Distinguished Professor Steven A. Benner.