1 INFERENCES OF RECENT AND ANCIENT HUMAN POPULATION HISTORY USING GENETIC AND NON GENETIC DATA By ANDREW KITCHEN A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQ UIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2008
2 2008 Andrew Kitchen
3 To my Parents and Family.
4 ACKNOWLEDGMENTS I would first like to thank my Ph.D. advisor, Professor Con nie Mulligan, for her guidance as my committee chair and for her superb mentoring. I would also like to thank Prof. Michael Miyamoto for his support, advice, and insight while performing this research ; and Profs. John Moore and Marta Wayne for their guidin g hands as committee members. I also thank m y collaborator on the Semitic language project at the University of California, Los Angeles, Dr. useful discussions about my projects. Finally, I thank my parents for providing constant support and encouragement.
5 TABLE OF CONTENTS page ACKNOWLEDGMENTS ................................ ................................ ................................ ............... 4 LIST OF FIGURES ................................ ................................ ................................ ......................... 7 ABSTRACT ................................ ................................ ................................ ................................ ..... 8 CHAPTER 1 INTRODUCTION ................................ ................................ ................................ .................. 10 2 THREE STAGE COLONIZATION M ODEL FOR THE PEOPLING OF THE AMERICAS ................................ ................................ ................................ ............................ 22 Introduction ................................ ................................ ................................ ............................. 22 Results ................................ ................................ ................................ ................................ ..... 26 Skyline Plot Analyses ................................ ................................ ................................ ...... 26 Isolation with Migration Coalescent Analyses ................................ ............................... 28 Discussion ................................ ................................ ................................ ............................... 29 Materials and Methods ................................ ................................ ................................ ........... 35 Datasets ................................ ................................ ................................ ............................ 35 Bayesian Skyline Plot Analyses ................................ ................................ ...................... 36 Isolation with Migration Coalescent Analyses ................................ ............................... 37 3 BAYESIAN PHYLOGENETIC ANALYSES OF SEMITIC LANGUAGES IDENTIFY AN EARLY BRONZE AGE ORIGIN OF SEMITIC AND A SIN GLE IRON AGE MIGRATION TO AFRICA FOR ETHIOSEMITIC ................................ .............................. 44 Introduction ................................ ................................ ................................ ............................. 44 Results ................................ ................................ ................................ ................................ ..... 47 Genealogy of Semitic Languages ................................ ................................ .................... 47 Semitic Language Divergence Dates ................................ ................................ ............... 48 Log Bayes Factor Tests ................................ ................................ ................................ ... 49 Discussion ................................ ................................ ................................ ............................... 50 Semitic Origins ................................ ................................ ................................ ................ 52 Recent Arabic Divergence ................................ ................................ ............................... 54 South Semitic and the Origins of Ethiosemitic ................................ ............................... 55 Conclusion. ................................ ................................ ................................ ...................... 56 Methods ................................ ................................ ................................ ................................ .. 57 Word Lists and Cognate Coding ................................ ................................ ..................... 57 Phylogenetic Analysis and Divergence Date Estimation ................................ ................ 58 4 UTILITY OF DNA VIRUSES FOR STUDYING HUMAN HOST HISTORY: CASE STUDY OF JC VIRUS ................................ ................................ ................................ ........... 80
6 Introduction ................................ ................................ ................................ ............................. 80 Materials and Methods ................................ ................................ ................................ ........... 84 JC Virus and Human Mitochondrial DNA Sequences ................................ .................... 84 JC Virus Phylogenetic Analysis ................................ ................................ ...................... 85 Estimation of for JC Virus ................................ ................................ ........................... 86 JC Virus Population Dynamics ................................ ................................ ....................... 87 Human mtDNA Population Dynami cs ................................ ................................ ............ 88 Bayes Factor Model Comparison ................................ ................................ .................... 88 Results ................................ ................................ ................................ ................................ ..... 89 ML Phylogeny f or JC Virus ................................ ................................ ............................ 89 JC Virus Rates and Skyline Plots ................................ ................................ .................... 89 Human mtDNA Skyline Plots ................................ ................................ ......................... 90 Log Bayes Factor Model Comparison ................................ ................................ ............. 90 Discussion ................................ ................................ ................................ ............................... 91 Bayesian Skyline Plots, Human mtDNA, and Historical Demograph y .......................... 91 JC Virus Rates and Historical Demography ................................ ................................ .... 92 Current Support for the Fast Internal Rate of JC Virus ................................ ................... 92 Utility of JC Virus and Other Fast Evolving DNA Viruses for Studying the Human Host ................................ ................................ ................................ .............................. 95 5 CONCLUSION ................................ ................................ ................................ ..................... 101 LIST OF REFERENCES ................................ ................................ ................................ ............. 10 9 BIOGRAPHICAL SKETCH ................................ ................................ ................................ ....... 124
7 LIST OF FIGURES Figure page 2 1. Bayesian skyline plot for the mtDNA coding genome sequences. ................................ ......... 39 2 2. Bayesian skyline plot for the mtDNA HVR I+II datasets. ................................ ..................... 40 2 3. Graph of IM results for the combined nuclear and mitochondrial coding DNA dataset. ....... 41 2 4. Maps depicting each phase of our three step colonization model for the peopling of the Americas. ................................ ................................ ................................ ........................... 42 3 1. Map of Semitic languages and inferred dispersals. ................................ ................................ 62 3 2 Phylogenetic tree of Semitic languages. ................................ ................................ ................. 63 3 S1. Semitic wordlist data for 25 languages. ................................ ................................ ............... 64 3 S2. Example of the cognate coding process. ................................ ................................ .............. 73 3 S3 Cognate lists for 25 Semitic languages. ................................ ................................ ............... 74 4 1. Optimal ML phylogeny for 407 JCV coding genomes. ................................ ......................... 97 4 2. Bayesian skyline plots for the four regional groups of JCV generated with the slow external rate of 1.356 x 10 7 ................................ ................................ .............................. 98 4 3. Bayesian skyline plots for the four regional groups of J CV generated with the fast internal rate of 3.642 x 10 5 ................................ ................................ ............................... 99 4 4. Bayesian skyline plots for the four regional groups of humans as estimated with complete mtDNA coding genomes. ................................ ................................ ................. 100
8 Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy INFERENCES OF RECENT AND ANCIENT HUMAN POPULAT ION HISTORY USING GENETIC AND NON GENETIC DATA By Andrew Kitchen August 2008 Chair: Connie J. Mulligan Major: Anthropology I have adopted complementary approaches to inferring human demographic history utilizing human and non human genetic data as well as cultural data. These complementary approaches form an interdisciplinary perspective that allow s one to make inferences of human history at varying timescales, from the events that occurred tens of thousands of years ago to ones that occurred in the mos t recent decades and centuries. I used slow evolving human DNA to study the peopling of the Americas tens of thousands of years ago; fast evolving lexical data to address the origin of Semitic several thousand years ago; and fast evolving JC virus genomes to investigate human demography within the most recent decades and centuries. In the first study, I used human mtDNA to infer the demographic history of Amerind populations and analyzed a multi locus human DNA dataset to confirm important parameter estimat es of the peopling of the Americas. My analyses produced a three stage model for the peopling of the Americas that includes a long occupation of Siber i a and a rapid expansion ~16 000 years ago into the Americas from a founding population of ~1000 to 5000 i ndividuals. Second, I analyzed lexical data from 25 Semitic languages using computational phylogenetic techniques borrowed from evolutionary biology. Using the sampling dates of extinct languages, I was able to date events in the history of Semitic and pla ce the origin of Semitic ~5900 years ago in what is now Syria. My final project
9 entailed the analysis of a large dataset of JC virus genomes (>400) and their associated sampling dates to infer the mutation rate and demographic history of JC virus. I estima ted a surprisingly fast evolutionary rate for JC virus from the viral sampling dates, and confirmed this fast rate from Bayesian model tests. Ultimately, I was able to use the fast JC virus rate to infer a recent expansion of JC virus in regional populatio ns that correlates with events in human population history of over the most recent decades and centuries. Genetic anthropologists are uniquely equipped to appreciate the utility of combining genetic data with non genetic or non human data to investigate hu man history. This perspective allows genetic anthropologists to address questions that cannot be answered by human DNA alone by extracting as much information as possible from genetic analyses. Here, I provide examples of this integrative approach to study ing human population history.
10 CHAPTER 1 INTRODUCTION Demographic change has been a constant in human evolution since the expansion of modern humans out of Africa approximately 50,000 years ago (Fagundes et al. 2007) Genetic anthropologists have general ly attempted to infer events in this history of change by examining the genetic diversity of modern human populations. Early genetic studies of human history used variation in proteins, such as the ABO blood groups (Landsteiner 1901; Cavalli Sforza and Edw ards 1964) and allozymes (Pauling et al. 1949; Cavalli Sforza and Edwards 1964) as markers of underlying genetic variation at the DNA level. These data proved very useful in initial attempts to reconstruct patterns of population affinities and migration ( for a summary see: Cavalli Sforza, Menozzi, and Piazza 1994 ), but the advent of fast and easy DNA sequencing techniques made it possible for genetic anthropologists to directly asses s human genetic diversity, both extant ( e.g., Cann, Stoneking, and Wilson 1987; Kolman, Sambuughin, and Bermingham 1996; Underhill et al. 2000; Venter et al. 2001 ) and ancient ( e.g., Paabo 1985; Handt et al. 1994; Stone and Stoneking 1998 ). Despite the advance from the study of proteins to studying DNA sequences, modern human genetic diversity remains a complicated product of mutation, natural selection, genetic drift and gene flow The interaction of these processes has important implications for the utility of human DNA as a marker of past human population events. First, the relatively slow rate of mutation for human DNA, which varies from ~1x10 7 substitutions per nucleotide per year in non coding mitochondrial DNA (mtDNA; Hasegawa et al. 1993; Ingman et al. 2000; Howell et al. 2003) to ~1x10 9 substitutions per nucleotide pe r year in non coding nuclear DNA (Kaessmann et al. 1999) limits human demographic inference to events that occurred thousands to hundreds of thousands of years ago. Second, natural selection can deterministically shape
11 positive s election ), decreasing ( purifying selection ), or stabilizing ( balancing selection ) allele frequencies according to selective pressures that are, with some important exceptions ( i.e., ; Ohta 1992 ), independent of the effects of genetic drift (Fisher 1930) Finally, genetic drift, which causes random changes in allele frequencies and has a strong affect in small populations but a weak affect in large ones, and gene flow work to stochastically alter the frequencies of selectively neutral a lleles through the complex demographic processes of migration, admixture, population bottlenecks, and population expansions (Wright 1931; Kimura 1968) Thus, our ability to make i nferences concerning demographic change which directly relate to the combine d effects of genetic drift and gene flow, are limited by the rate of mutation and the effects of selection acting upon human genomic regions ( i.e., the other fundamental evolutionary processes ). The use of neutral (or nearly neutral) variation limits the b ias of selection upon inferences of demography, but the mutation rate still places a temporal limit on inferring past events. So while neutrally evolving DNA is abundant in the human genome (Payseur, Cutter, and Nachman 2002; Zhang 2007) the mutation rate s of human DNA limit the utility of these data to questions about human demography extending from human speciation into the history of anatomically modern humans. For this reason, researchers have begun to use fast evolving human parasites or pathogens, su ch as viruses ( e.g., Holmes 2004), as complementary markers of the population history of their human host. In my dissertation, I have adopted complementary approaches to inferring human demographic history utilizing human and non human genetic data as wel l as cultural data. These complementary approaches form an interdisciplinary perspective that allow s one to make inferences of human history at varying timescales, from the events that occurred tens of
12 thousands of years ago to ones that occurred in the mo st recent decades and centuries. This perspective incorporates data evolving at different rates from diverse disciplines to produce insights about human history that human DNA alone might not be able to provide. It also rests upon the understanding that wh ile human demographic history can be investigated from human DNA data directly, other lines of evolutionary data ( e.g., pathogens and languages) can be used to augment or provide independent assessments of human population history at different times in the past. Specifically, I have developed a framework in which I use human DNA sequences to infer events thousands to tens of thousands of years ago, language data to investigate human migrations centuries and millennia in the past, and virus genomes to assess the demographic history of human host populations in the most recent decades and centuries. Over the past six decades, a nalysis of human DNA has provided many insights into human population history. The genetic material most often used for inferring human demographic history are mtDNA ( e.g., Sherry et al. 1994; Richards et al. 1998; Macaulay et al. 2005 ), the male Y chromosome ( e.g., Hammer et al. 1997; Underhill et al. 2000; Quintana Murci et al. 2001) and autosomal short tandem repeats (STRs; e.g., Rose nberg et al. 2002; Wang et al. 2007 ) and Alu elements ( e.g., Hammer 1994; Athanasiadis et al. 2007 ). As neutral markers that are not affected by selection (Knight et al. 1996; Shen et al. 2000; Rosenberg et al. 2002; Mulligan, Kitchen, and Miyamoto 2006) the genetic variation at these loci is thought to reflect the demographic history of the populations from which they were sampled. The mutation rates of these markers vary from ~1x10 7 to ~1x10 9 substitutions per nucleotide per year, so they are useful to infer events thousands or tens of thousands of years in the past. Mutation rates in this range are not fast enough to provide a sufficient number of mutation events to infer past events with fine resolution, and thus estimates made using human DNA markers alone have large
13 variances. For example, Tishkoff et al. (2001) were able to estimate that two malaria resistance alleles of the glucose 6 phosphate dehydrogenase gene arose ~3,300 (95% credibility interval: 1600 6640 years ago) and ~6,400 (95% credibil ity interval: 3840 11760) years ago, but were unable to determine which of several factors ( e.g., wetter climate, increased human population size, or higher population density near water) was the causative agent as the credibility intervals were too larg e (from 50% to 200% of mean estimates) to exclude many alternatives. Newly developed coalescent techniques that employ Bayesian probability allow for the incorporation of independent evidence, such as archaeological or climatological data, into genetic an alyses ( e.g., Drummond et al. 2005; Hey 2005; Beerli 2006 ). The Bayesian probability framework weights the likelihood of a model (including parameter values) by a prior probability that the model is correct to produce a posterior probability of the model g iven the data. This allows researchers to combine information from multiple interdisciplinary lines of evidence as prior probabilities to increase the power of genetic analyses to both accurately infer events and do so with greater precision. For example, archaeological evidence for the first occupation of an island might be used as a prior with zero probability placed on any models in which the colonization occurs more recently than the archaeological evidence. Unfortunately, despite the advantages provide d by Bayesian techniques to integrate multiple lines of evidence into genetic analyses, many studies often neglect to do so ( e.g., Hey 2005; Garrigan et al. 2007 ). For example, Hey employed a Bayesian coalescent technique without applying informative prior s to estimate that the peopling of the New World occurred ~7,000 years ago with a founder effective population size (N e ) ~70 and large, continuous migration between Asia and the America (Hey 2005). These estimates are contradicted by the extensive archaeol ogical evidence for human populations in the Americas by
14 ~14 ,000 years ago (for a review see Goebel, Waters, and O'Rourke 2008), and this evidence While human DNA pro vide s the best evidence for ancient events in human history, the focus has expanded to include markers other than human DNA to infer more recent human population history. The two most promising examples are the use of languages ( e.g., Gray and Jordan 2000; Holden 2002; Gray and Atkinson 2003; Atkinson and Gray 2005) and genetic diversity of human pathogens ( e.g., Falush et al. 2003; Holmes 2004; Gilbert et al. 2007 ) as data for understanding demographic events in human populations, specifically migration an d changes in population size. These markers (languages and pathogens) typically evolve at faster rates [ e.g., ~10 4 substitutions per word per year for languages (Swadesh 1955) and as fast as ~10 2 substitutions nucleotide per year for lentiviruses (Fu 200 1)] than human DNA (~10 7 to 10 9 substitutions per nucleotide per year) and thus provide greater resolution for inferring population events that have occurred more recently in time, from decades to several thousands of years ago (Campbell 2000; Holmes 20 04) Languages encode genealogical information in their vocabularies (Swadesh 1955) that extends backwards in time for hundreds to thousands of years (Campbell 2000) The statistical analysis of wordlists has progressed from using simple matrices of pairw ise differences and principle components to produce hierarchical clustering patterns ( e.g., Rabin 1975 ), to the use of advanced phylogenetic techniques (such as parsimony, maximum likelihood, and Bayesian methods) to produce bifurcating phylogenies with br anch lengths representing real time (e,g, Gray and Jordan 2000; Holden 2002; Gray and Atkinson 2003 e.g., Renfrew, McMahon, and Trask 2000 ), accurate estimates of linguistic rates of change are n ecessary to make inferences in time. However, the linguistic clock
15 has been recently supported with strong empirical evidence (Pagel, Atkinson, and Meade 2007) and recent phylogenetic studies incorporated linguistic clocks calibrated with archaeological a nd historical information to estimate language divergence dates on the order of hundreds to thousands of years ( e.g., Gray and Jordan 2000; Gray and Atkinson 2003 ). Together, the topologies and divergence dates of the inferred phylogenies have been used to combine genetic and historical evidence with linguistic evidence to provide an interdisciplinary understanding of the Bantu expansion (Holden 2002) the colonization of island south East Asia (Gray and Jordan 2000) and the expansion of agriculture into E urope from Anatolia (Gray and Atkinson 2003) The ability of languages to be transferred horizontally, in addition to vertically, between human populations also provides unique opportunities to test hypotheses of gene flow and migration in the presence or absence of substantial cultural connections and exchange. Human pathogens, and viruses in particular, are considered good markers of recent human history due to their larger population sizes and generally faster mutation rates (Holmes 2004) This combinat ion means that virus populations generate genetic diversity at a far faster rate than humans do, which allows for estimates of population events to be made with both higher resolution and more recently in time A fine example of this is a study of hepatiti s C evolution in Egypt, which was able to trace the progress of an epidemic arising from contaminated antischistomiasis treatments first administered in the 1920s and the effect of public health policies in curtailing the epidemic in the following decades (Pybus et al. 2003; Drummond et al. 2005) A complicating factor of using viruses as markers of human populations is that horizontal transmission of viruses could produce geographic patterns of diversity that are the result of recent dispersals instead of ancestral vicariance ( e.g., Twiddy, Holmes, and Rambaut 2003; Holmes 2004; Gilbert et al. 2007) Generally speaking, fast evolving RNA viruses cause
16 i.e., a dispersal pattern) and slow evolving DNA virus es cause persistent infections associated with endemic disease [ i.e., a vicariance pattern ( Holmes 2004) ] though there are exceptions to this general rule [ e.g., persistent HIV infections and fast evolving polyomaviruses (Chen et al. 2004) ] Slow evolving viruses (DNA or RNA) that cause persistent infections are thus more likely to exhibit vicariance with their human hosts and be useful for inferring ancient human population events, whereas the diversity of fast evolving viruses is most likely the product of recent dispersal and may be useful in investigating recent migration events or connections between populations. For my dissertation research, I chose to analyze three different types of data to study human demographic history at three different window s of time in the past. In each case, I applied advanced computational techniques to analyze the data from an evolutionary perspective and used my results to make inferences about aspects of human population history. In my first study, I analyzed human DNA and incorporated independent information from other disciplines to investigate the peopling of the Americas, which occurred more than ten thousand years ago. The first colonization of the Americas by the Amerind ancestors has been extensively studied from archaeological, climatological, and genetic perspectives (see Goebel, Waters, and O'Rourke 2008 for a review). Despite the number of genetic studies of this major event in human history, there is little consensus as to the timing, the duration and t he siz e of the initial migration. The dates for the timing of the migration extend from ~40,000 (Bonatto and Salzano 1997) to ~7,000 years ago (Hey 2005) with many dates scattered between these outliers. These studies also postulate a migration that occurred imm ediately after the proto Amerind population separated from the central East Asian gene pool, but recent analysis has suggested that the proto Amerind population spent a considerable length of time isolated in Beringia (Tamm et al. 2007) There is
17 also grea t variation in the estimated size of the migrating population, with estimates ranging from ~10,000 (Bonatto and Salzano 1997 a ) to ~70 (Hey 2005) effective individuals. I endeavored to develop a comprehensive model for the peopling of the Americas that als o narrowed the estimates of the important demographic parameters of migration date and founder population size by incorporating the extensive archaeological, climatological, and historical evidence into my analysis of extant human genetic data. This was th e first study to explicitly incorporate the extant physical evidence into a genetic analysis of the peopling of the Americas. For this study, I analyzed a large dataset of mtDNA coding genomes using Bayesian skyline plot analyses (Drummond et al. 2005) to infer historical changes in the effective population size (N e ) of Amerinds and their ancestors. Bayesian skyline plots are a coalescent technique used to estimate historical changes in N e from the distribution of coalescent events in time without requiring a priori assumptions about how the N e of a population should change over time ( e.g., no growth vs. exponential growth). I then used a Bayesian isolation with migration (IM) model (Hey 2005) to analyze a combined dataset of mtDNA and nuclear sequences to e stimate the Amerind founder population size and infer the effect of migration on these estimates. The IM model is a coalescent technique used to estimate the population parameters ( i.e., divergence time, population sizes and growth rates, and migration rat es) when a single ancestral population diverges into two daughter populations with subsequent population growth and migration between them. From these analyses I was able to propose a model for the peopling of the Americas with three distinct stages: (i) a separation of the proto Amerind gene pool from the Central/East Asian gene pool ~43,000 years followed by a population expansion lasting 6,000 years; (ii) a long period of stable population size in greater Beringia lasting ~20,000 years; and (iii) an exp ansion into the Americas ~16,000 years ago in which the
18 Americas were successfully and permanently colonized. This three stage t heory for the peopling of the Americas serves as an example of how to incorporate all pertinent information in a genetic analysi s to produce a comprehensive hypothesis that researchers can use as a framework to test future data. This strategy appears to have been successful, as my research has been cited in a review article on the peopling of the Americas (Goebel, Waters, and O'Rou rke 2008), as well as requests to use my figures in papers about the extinction of large mammals in Beringia and for a book chapter on studying human history from human DNA. In my second study, I used wordlist data to investigate the evolution of the Semi tic language family and the history of Semitic speaking populations from a time depth of several centuries to several millennia in the past. The Semitic language family is the only branch of the African language phylum Afroasiatic that is spoken in both Af rica and Asia. The distribution of Semitic is the product of migration between Africa and Asia across the Red Sea within the past several millennia. However, despite a strong archaeological record attesting to the oldest Se mitic languages in Mesopotamia [ A kkadian and Eblaite ( Buccellati 1997; Gordon 1997) ] and the presence of several Semitic populations in the Levant during the Middle Bronze Age [ e.g., Aramaic, Hebrew, and Phoenecian (Lloyd 1984; Richard 2003; Nardo 2007) ] as well as in southern Arabia (Ko gan and Korotayev 1997) and the Horn of Africa (Connah 2001) the genealogical relationships between these populations remain uncertain. The existing genetic evidence for the genealogical relationships of Semitic speaking populations is not strong, as it w as primarily collected to study relationships at different levels of resolution than that between populations speaking different Semitic languages. Specifically, previous studies primarily focused on comparisons between Jewish populations and their non Jew ish neighbors (Hammer et al. 2000; Nebel et al. 2001; Rosenberg et al. 2001), which is too focused, or the contribution of
19 migration from the Middle East on the genetic diversity of Europe (Barbujani, Bertorelle, and Chikhi 1998; Chikhi et al. 1998; Xiao e t al. 2004), which is both too broad and too ancient Thus understanding the genealogy of the Semitic languages themselves would shed light on the historical relationships of Semitic speaking populations I analyzed wordlists from 25 extant and extinct Se mitic languages using a Bayesian phylogenetic technique to infer the ir genealog ical relationships. I used the extinction dates for five extinct Semitic languages (Akkadian, Aramaic, the divergence dates of the Semitic languages and tested alternative hypotheses of Semitic history using Bayes Factor model comparisons I estimated that Semitic first diversified ~3900 before the common era (B.C.E.) near Mesopotamia, before expanding int o the Levant and Arabia during the Early Bronze Age, and finally migrating from southern Arabia to the Horn of Africa ~1000 B.C.E. More generally, this research demonstrates how computational phylogenetic models [ e.g., relaxed clocks calibrated with sampli ng dates (Rambaut 2000)] and model testing [ i.e., log Bayes factor tests of alternative phylogenies (Suchard, Weiss, and Sinsheimer 2001)] borrowed from evolutionary genetics and molecular evolution can be used to investigate language history. Furthermore, this analysis provides a model of population history in the Middle East and Horn of Africa that incorporates linguistic and archaeological evidence that will be tested in future genetic studies of populations in these regions. For my final study, I analy zed JC virus (JCV) genomes and estimated the JCV mutation rate so JCV evolution could be connected in time to events in the history of its human host. JCV has been often used as a marker of ancient human population history. JCV shows substantial regional s ub structure (Sugimoto et al. 1997) with apparent vicariance with some human populations ( e.g., Sugimoto et al. 2002; Zheng et al. 2004; Ikegaya et al. 2005 ), and as a DNA
20 virus JCV was considered to have a slowly evolving genome. However, the phylogeny o f worldwide JCV populations does not reflect the known relationships of its human hosts (Shackelton et al. 2006) and the mutation rate of JCV had never been independently estimated despite evidence that closely related polyomaviruses evolved much faster t han previously thought. Accurate estimates of mutation rates are necessary to discriminate between dispersal in and vicariance with host populations (Holmes 2004) causing the geographic distribution of JCV diversity, as well as correlating events in JCV h istory with the history of its human host. I performed a maximum likelihood phylogenetic analy is of the complete set of available JCV genomes to confirm the sub structure of the worldwide JCV population and then used Bayesian skyline plot analyses (Drummon d et al. 2005) to estimate an independent mutation rate for JCV from viral sampling dates. I assessed the support for the independent rate using a Bayesian model testing technique, and used the independently estimated mutation rate to infer human populatio n history from events in the history JCV. I found that JCV is an example of a fast evolving DNA virus evolving ~300 times faster than previously thought (at 3.642 x 10 5 substitutions per site per year) and that its evolution reflects changes in the demog raphic history of its human host over the most recent decades and centuries. More broadly, this research not only highlights the utility of fast evolving DNA viruses as potential markers of recent human history, but also addresses the evolving nature of lo ng held assumptions about the relative substitution rates of RNA (fast) and DNA (slow) viruses (Duffy, Shackelton, and Holmes 2008). In sum, this dissertation demonstrates how genetic anthropologists can investigate human demographic history from distinct types of data evolving at different rates. This study shows that taking an evolutionary approach to the study of linguistic and virus data can provide insights into human population history that are complementary to those made from human genetic diversity
21 Specifically, by using data that evolve at different rates, I demonstrated how genetic anthropologists can make inferences about human demography across different timescales, from tens of thousands of years ago to just decades ago Fundamentally, this ap proach relies upon the appreciation of evidence from disparate disciplines (human genetics, linguistics, and virology) and embodies the advantage of taking an anthropological perspective to the use of interdisciplinary data. Critically, interdisciplinary r esearch utilizing different forms of data and evidence requires rigorous hypothesis testing to accurately assess the information contained within the data as well as to make results easily accessible to future research endeavors. The natural goal of resear ch programs that synthesize multiple lines of evidence is to produce well defined hypotheses that encourage the future incorporation of additional lines of evidence, and are hopefully robust to inquiry by new datasets.
22 CHAPTER 2 THREE STAGE COLONIZATION M O DEL FOR THE PEOPLING OF THE AMERICAS 1 I ntroduction For decades, intense and interdisciplinary attention has focused on the colonization of the last habitable landmass on the planet the peopling of the Americas. The first comprehensive, interdisciplinary model for New World colonization incorporated linguistic, paleoanthropological, and genetic data and generated great controversy, which was due at least in part, to the uniquely broad scope of the research (Greenberg, Turner, and Zegura 1986) Since that t ime, more focused studies have resulted in agreement on the general parameters of the colonization process, such as a single migration in contrast to the original three migration model that distinguished Amerinds, Na Dene, and Eskimo Aleuts (Greenberg, Tur ner, and Zegura 1986) However, a full understanding of the complex and dynamic nature of the timing and magnitude of the colonization process remains elusive. The majority of the genetic literature supports a single migration of Paleoindians into the New World from an East Asian source population (Schurr 2004) Specifically, the reduced variation and ubiquitous distribution of mitochondrial and Y chromosome haplogroups and microsatellite diversity throughout the New World relative to Asia argue strongly f or a single migration (Mulligan et al. 2004; Wang et al. 2007) However, a great many models have been proposed that differ significantly in the timing an d size of this migration event (Schurr 2004; Bonatto and Salzano 1997a; Bonatto and Salzano 1997b; For ster et al. 1996; Santos et al. 1999; Schurr and Sherry 2004; Shields et al. 1993 Silva et al. 2002; Szathmary 1993; Tamm et al. 2007; Torroni et al. 1993a; Torroni et al. 1993b) Different migration dates have been proposed 1 Reproduced with permission from Kitch en A, Miyamoto M, Mulligan C. 2008a. A three stage colonization model for the peopling of the Americas. PLoS ONE. 3:e1 596.
23 ranging from 13 thousand years ago (kya) to 30 40kya (Schurr 2004; Bonatto and Salzano 1997a; Bonatto and Salzano 1997b; Forster et al. 1996; Santos et al. 1999; Schurr and Sherry 2004; Shields et al. 1993 Silva et al. 2002; Szathmary 1993; Tamm et al. 2007; Torroni et al. 1993a; Torr oni et al. 1993b) Numerical estimates of the founder effective population size (N e ) are infrequent in the literature but vary substantially, from a high of 5000 (Bonatto and Salzano 1997b) to a low of 70 Paleoindian founders (Hey 2005) These dates and population sizes have been proposed to accommodate a wealth of scenarios including ancient, recent, and/or additional migrations responsible for the peopling of the Americas. Archaeological data provide clear support for a widespread human presence in the Americas by 13kya (all calendar dates are recalibrated radiocarbon dates as reported in the cited literature), the time by which the Clovis complex was established across the interior of North America (Hamilton and Buchanan 2007; Waters and Stafford 2007) Older arch aeological sites, e.g., the Nenana Complex in Alaska (Hamilton and Buchanan 2007) the Monte Verde site in Chile (Dillehay 1997) and the Schaefer, Hebior and Mud Lake sites in Wisconsin (Joyce 2006; Overstreet 2005) document an earlier chronology possibl y 2,400 years before Clovis (Waters and Stafford 2007; Joyce 2006; Overstreet 2005) Additionally, very old radiocarbon dates have been obtained from sites in Asian Beringia suggesting that human populations had reached the north of western Beringia by 30kya (Goebel 2007; Pitulko et al. 2004) The geological and paleoecological records for Beringia and northwestern North America provide further constraints on the timing for the peopling of the Americas. Beringia was a continuous landmass that connected Asia and North America roughly 60kya until 11 10kya (Pitulko et al. 2004; Elias et al. 1996; Hopkins 1982) However, Beringia was isolated from continental North America until 14kya when an intracontinental ice free corridor opened up
24 between the Lauren tide and Cordilleran Ice Sheets (Hoffecker, Powers, and Goebel 1993) Paleoecological data indicate that Beringia was able to sustain at least small human populations. Fossil pollen and plant macrofossils from ancient eastern Beringia are indicative of a p roductive, dry grassland ecosy stem (Zazula et al. 2003) and paleontological evidence from Alaska and Siberia demonstrates that large mammals roamed Beringia (Guthrie 1990) After 11 10 kya, Late Pleistocene sea levels rose sufficiently to re inundate Berin gia (Eilias et al. 1996; Hopkins 1982) creating the Bering strait that now separates the New World from Siberia by at least 100 kilometers (km) of open frigid water. Studies of human settlement throughout the Pacific Islands indicate that open water dista nces of >100 km constitute significant barriers to human migration, possibly because ancient people were unlikely to travel further than one day out of sight of land (Jobling, Hurles, and Tyler Smith 2004) Similar constraints (if not worse) would apply to early humans in Alaska and Siberia, thereby severely reducing the migration rate between the New and Old World once Beringia was re inundated. Reduced migration due to the Bering Strait remains valid even as recent rates of short range migration have incr eased between Siberia and Alaska (Tamm et al. 2007) In effect, the two continents were essentially geographically isolated from 11 10 kya until modern times. No detailed, unified theory of New World colonization currently exists that can account for the b readth and complexity of these interdisciplinary data. We analyze Native American mitochondrial DNA (mtDNA) coding genomes plus non coding control region sequences as well as a combined nuclear and mitochondrial coding DNA dataset from New World and Asian provide the most extensive comparative database for human populations worldwide (Pakendorf and Stoneking 2005) Furthermore, it has been proposed that mtDNA may be more sensitive to
25 demographic changes, such as population bottlenecks, due to its smaller effective population size (Wilson et al. 1985) The combined nuclear and mtDNA dataset was recently used to propose an unusually small N e for the Amerind founders (He y 2005) and thus investigation of this dataset is of much interest when attempting to reconcile the existing genetic evidence. We use two complementary coalescent methods to develop a comprehensive scenario of New World colonization, with a focus on the t iming and scale of the migration process. Bayesian skyline plot analyses use data from a single population to provide an unbiased estimate of changes in N e through time, and thus are a powerful means for estimating past population growth patterns when the nature of the growth ( e.g., exponential or constant) is unknown (Drummond et al. 2005) The isolation by migration (IM) structured coalescent model uses data from sister populations to jointly estimate population divergence time, migration rates and a foun der N e with an assumption of exponential growth (Hey 2005) Importantly, we ex plicitly incorporate archaeological, geological, and paleoecological constraints into both analyses. Our goal is to provide a comprehensive model for the initial settlement of t he Americas that generates new testable hypotheses and has high predictive power for the inclusion of new datasets. In light of our results, we propose a three stage model in which a recent, rapid expansion into the Americas was preceded by a long period o f population stability in greater Beringia by the Paleoindian population after divergence and expansion from their ancestral Asian population.
26 R esults Skyline Plot Analyses Our alignment of 77 full mitochondrial coding genomes is o ne of the largest publ ished alignments of Native American mtDNA coding genom es (Figure 2 S1 2 ). I t includes genomes from the four major mtDNA haplogroups in the Americas (haplogroup s A, B, C, and D are each represented by 17 31% of the entire sample ), as well as the minor haplog roup X (2%). Correspondingly, this set of 77 complete coding mtDNA genomes represent s geographically and linguistically diverse populations distributed throughout the New World (Mulligan et al. 2004) Bayesian skyline plots (Drummond et al. 2005) were used to visually illustrate changes in Amerind female effective population size (N ef ) over time. Bayesian skyline plots assume a single migration event, which makes the approach ideal for questions concerning the peopling of the Americas since it is generally agreed that there was a single migration (Mulligan et al. 2004) Ou r skyline plot of the coding genomes describes a three stage process in which there are two distinct increases in N ef at 40kya and 15kya that are separated by a long period of little to n o growth (Figure 2 1). Specifically, N ef increases from 640 [95% credible interval (CI) = 148 9,969] to 4,400 individuals (95% CI = 235 18,708) at the first inflection point, and from 4,000 (95% CI = 911 13,006) to 64,000 individuals (95% CI = 15,871 202,990) at the second inflection point. There is also an apparent decrease in N ef prior to the second inflection point in which median N ef drop s to 2700 (95% CI = 404 36,628). We define a significant change in population size as the occurrence of non overlapping 95% CIs at the beginning and e nd of an 2 Fig ure 2 S 1 (available online at PLoS ONE journal website): Multiple sequence alignment for the 77 Amerind mtDNA coding genomes use follo wing Pakendorf and Stoneking (2005) site 546 of the Anderson Reference Sequence (ARS; Anderson e t al. 1981 ). The final position of this alignment the naming conventions of Herrnstadt et al. (2002) Ingman et al. (2000) Kivisild et al. (200 6) and Mishmar et al. (2003) respectively.
27 increase (see shading i n Figure 2 1). Thus, we interpret the recent 16 fold increase in N ef over the interval 16 9kya as significant. The earlier 7 fold increase at 43 36kya is suggestive but not significant, although the increase is significant when compared o ver a much longer time period, e.g., from 25kya to the coalescent. Overall, the recent increase is consistent with a rapid, large scale expansion into the Americas while the older increase is suggestive of a gradual expansion within Asia or Beringia. The dataset of 812 concatenated mtDNA hypervariable region (HVR) I and II sequences is one of the largest published alignments of Native American HVRI+II seque nces (Figure 2 S2 3 ). It includes all major New World haplogroups, and represents geographically and linguistically diverse populations distributed throughout the Americas. The HVRI+II dataset was randomly divided into ten non overlapping alignments of 81 HVRI+II sequences, which allowed for ten independent trials for parameter estimation with a sample si ze similar to the coding genome alignment. The HVRI+II skyline plot analyses (Figure 2 2) produce estimates for median time to coalescence (55.5kya, 95% CI = 33.5 87.2kya) and N ef at coalescence (820, 95% CI = 26 3,979) and the present (66,200, 95% CI = 9, 839 346,289) that are similar to the coding genome analyses ( Figure 2 1). However, in contrast to the coding genome skyline plot, the HVRI+II skyline plot traces a very gradual increase in N ef over 40,000 years with no clear inflection points. The HVRI+II plot does show a significant increase in N ef but only when measured over the past 35,000 years. The fine detail evidenced in the coding genome skyline plot likely reflects the greater phylogenetic signal in the mitochondrial coding genome relative to the HVR (Non, 3 Fig ure 2 S 2 (available online at PLoS ONE journal website): Multiple sequence alignments for the ten, randomly selected, non overlapping sets of 81 HVRI+II sequences used in this study. In these alignments, p ositions 1 403 correspond to HVRI, whereas sites 404 781 refer to HVRII. In turn, these alignment positions correspond to sites 16003 16400 and 30 the naming convent ions of HRVBase (Handt, Meyer, and von Haeseler 1998)
28 Kitchen, and Mulligan 2007) In general, estimates of the time to the most recent common ancestor are less sensitive to reductions in the historical signal in mtDNA sequence data than phyl ogenetic estimation (Non, Kitchen, and Mulligan 2007) a result consistent with our ability to recover similar coalescence times but not the chang es in N ef seen when comparing the coding vs. HVRI+II skyline plots. Isolation with Migration Coalescent Analy ses Bayesian IM coalescent analyses were performed on a set of nine coding nuclear and mitochondrial loci that had been previously analyzed by Hey (2005) in support of an extremely small New World founder N e of 70 ind ividuals. Thus, we performed our analy sis on his identical dataset and used the same coalescent and substitution models and model parameters with the exception of new priors on the divergence time and on migration rates between Asian and Amerind populations ( m Asia NW and m NW Asia ). The lower b ound on divergence time was set to15kya, which corresponds to the period immediately preceding the earliest archaeological evidence for human habitation in the Americas (Waters and Stafford 2007; Dillehay 2007; Joyce 2006; Overstreet 2005) We also institu ted serial constraints on m in order to gauge the effect of changing migration rates on founder N e estimates. We interpret the various m values in comparison to an empirical estimate of m for modern Europe ( m = 4.3; see Materials and Methods). In contrast to modern Europe, migration between the New World and Siberia from 15kya to more recent times would have become increasingly limited as Late Pleistocene sea levels rose sufficiently to inundate the Bering land bridge (Elias et al. 1996; Hopkins 1982) Thus we expect m for modern Europe to be much higher than ancient migration rates between Asia and the Americas, especially after the inundation of Beringia.
29 Constraining divergence time by applying a lower bound of 15kya results in an estimate of 200 for t he Amerind founding N e Serially constraining m Asia NW and m NW Asia in conjunction with the constrained divergence time, produces increasingly larger estimates of N e ( Figure 2 3). Specifically, as both m parameters are simultaneously forced to lower and m ore biologically realistic values, estimates of N e steadily increase from 200 to 1 200, especially after the ir priors are constrained to be < 5. Regardless of the specific priors on the m parameters, estimates for the Amerind divergence/expansion event are consistently 15kya (data not shown), which is very close to the lower b ound of our prior established with known archaeological sites in the New World Our results demonstrate that smaller estimates of N e depend upon a substantial level of migration from Asia to account for present day levels of Amerind genetic diversity, e.g. (2005) estimate of 70 founders is associated with a m Asia NW 9.0, which is twice the migration rate for contemporary Europe ( m = 4.3 ). Eliminating all migration between Asia and the New World ( m = 0) results in the largest estimate of N e for the Amerind founding populatio n of 1,200 individuals. D iscussion When studying complex colonization scenarios, the interpretation of genetic data can benefit substantially from the incorporation of non genetic material evidence. In our study, we do this in three ways. First, we interp ret the skyline plot (see Figure 2 1) to reflect archaeological evidence that places Amerinds in the Americas by 15kya and human populations in Beringia 30kya, as well as geological and paleoecological evidence that Beringia was habitable yet isolated fr om the Americas from 30kya to 17kya. Second, we use archaeological radiocarbon dates to constrain the divergence time prior in our IM analyses to 15kya as the latest possible date for both the divergence of the Amerind and Asian gene pools and the Amerind expansion
30 into North America (Figure 2 3). Since the IM model assumes that divergence and expansion occur simultaneously, constraining the time of the expansion also requires identical constraint of the divergence date. Third, in our IM analyses we serial ly constrain the migration rate parameters to smaller values and deduce likely migration rates between Asia and the New World based on empirical estimates of current migration rates within Europe versus the greatly reduced migration rates of ancient people across the Bering Strait starting 11 10kya. Based on our results, we propose a three stage colonization process for the peopling of the New World, with a specific focus on the dating and magnitude of the Amerind population expansions (Figure 2 4). We pro pose that the first stage was a period of gradual population growth as Amerind ancestors diverged from the central Asian gene pool and moved to the northeast. This was followed by an extended period of population stability in greater Beringia. The final st age was a single, rapid population expansion as Amerinds colonized the New World from Beringia. The initial stage of the colonization process involved the divergence of Amerind ancestors from the East Central Asian gene pool (Figure 2 4A). Based on previo us studies that included Asian mtDNA sequences, this divergence likely occurred prior to 50kya (Bonatto and Salzano 1997a, 1997b) Our coding skyline plot ( Figure 2 1) indicates that the divergence was followed by a period of gradual growth, during which the proto Amerind population experienced a 7 fold increase from 640 to 4,400 females over 7,000 years, from 43 36kya. The migrating founder population (N ef 640) was a small subset of the ancestral Asian population, as evidenced by the low levels of va riation in New World populati ons relative to Asians ( e.g., Mulligan et al. 2004 ) as well as the larger effective size of the ancestral Asian population (Hey 2005) Thus, divergence from the Asian gene pool was the time at which a severe population bottlene ck
31 occurred that reduced the genetic variation in Amerind populations. The lack of archaeological sites in Siberia and Beringia that date to 43 36kya (Kuzmin and Keates 2005) suggests that this relatively rapid and continuous movement. Consistent with this hypothesis are the younger coalescent dates for modern Siberian populations relative to modern New World populations (Torroni et al. 1993; Derenko et al. 2007) wh ich indicate that the New World migrants passed through Siberia before other East Central Asian population(s) settled permanently in this region at a later date. Such relatively rapid and continuous movement would leave few archaeological sites, which have not yet been discovered due to the vast expanse and harsh conditions of Siberia and the current inundation of Beringia. Thus, an important prediction of the first stage of our model is that older archaeological sites dating to 43 36kya await discovery in these regions. The proposed second stage (Figure 2 4B) consisted of an extended period of little change in population size from 36 16kya (Figure 2 1). It is difficult to assign a precise geographic location to this population, but it may have occupied the large region from Siberia to Alaska, most of which is currently underwater. Our N ef estimates of 4,000 5,000 (equivalent to N e of 8,000 10,000, assuming an equal sex ratio) indicate that the proposed human presence would have been minor when compared to the size of greater Beringia. Nevertheless, the presence of this population in Beringia for 20,000 years would have afforded s ufficient time for the generation of new mutations. Indeed, the existence of New World specific variants that are distributed throughout the Americas indicate that substantial genetic diversification occurred during the Beringian occupation ( e.g., Tamm et al. 2007; Torroni et al. 1993a; Torroni et al. ). The proposed period of Beringian occupation coincides with archaeological evidence for the first Arctic inhabitation of
32 western Beringia ( 30kya ; Pitulko et al. 2004 ) and pre dates archaeological evidence for occupation of the New World (Waters and Stafford 2007; Dillehay 1997; Joyce 2006; Overstreet 2005) This period also coincides with geological evidence for restricted access to North Am erica because of the impenetrability of the Cordilleran and Laurentide ice sheets ( 17 30kya ; Hoffecker and Elias 2003; Mandryk et al. 2001 ). Botanical remains, such as macrofossils and ancient pollen, indicate that Beringia was a productive grassland ecos ystem rather than an exceedingly harsh Arctic desert environment (Zazula et al. 2003) Paleontological evidence from Alaska and Siberia demonstrates that large mammals such as steppe bison, mammoth, horse, lion, musk oxen, sheep, wholly rhinoceros, and car ibou inhabited this area (Guthrie 1990) Thus, the paleoecological data are consistent with a human presence in Beringia although the carrying capacity of Beringia and technological limitations of the human population may have restricted growth until the po pulation could expand into new and fertile lands in the Americas The rapid expansion of the population only after an ice free corridor into North America opened (see below) suggests that the population may have departed Beringia as soon as a viable altern ative presented. The final colonization stage (Figure 2 4C) was a rapid geographic expansion into the New World resulting in a significant population increase ( 16 fold; Figure 2 1). The rapid population increase occurred over the period 16 9kya according to the coding skyline plot or over the past 15,000 years based on the IM analyses (the latter results supported only the most recent and largest expansion, mos t likely because IM analyses assume a single, simultaneous divergence/expansion event). The geological record indicates that North America became accessible from Beringia between 17 14kya, when the ice sheets covering what is now Canada began to retreat ( Hoffecker, Powers, and Goebel 1993; Mandryk et al. 2001) The coincident
33 timing of an ice free corridor into North America and the rapid expansion of the Amerind population suggests that a land route may have been the preferred entry into the New World. Ho wever, the northwest Pacific coast of North America also may have been deglaciated by 17kya, thus presenting a viable coastal route to continental North America (Wang et al. 2007; Mandryk et al. 2001). This period also coincides with the initial inundation of the Bering land bridge, after which migration with Asia would have been severely limited. The first unequivocal evidence for human occupation of the New World occurs in the form of Clovis sites dating to 13kya (Waters and Stafford 2007) and pre Clovis sites in both North and South America dating to 14 15kya (Dillehay 1997; Joyce 2006 ; Overstreet 2005) Our datasets do not include typings from the Na Dene or Esk Aleut, so we limit our scope to the largest, initial migration of Amerinds into the New World. However, Na Dene and Esk Aleut genetic diversity represents a subset of Amerind d iversity ( e.g., Kolman et al. 1995; Kolman, Sambuughin, and Bermingham 1996; Merriwether, Rothhamer, and Ferrell 1995) suggesting that Na Dene and Esk Aleuts are derived from the same Beringian source population as Amerinds. As stated above, extensive arch aeological evidence supports the presence of multiple distinct Native American material cultures by 13kya [ e.g., Clovis, Nenana and pre Clovis lithic technologies ( Waters and Stafford 2007 ) ] Our results suggest that these distinct cultures derive from a single New World founder population and are most likely the product of an extensive and complex process of post peopling migrations within the Americas, possibly through a combination of coastal and/or riverine routes (Wang et al. 2007; Fix 2005) Determin ation of the size of the Amerind founding population has received considerable attention. Based on the coding Bayesian skyline plot (Figure 2 1), there is a slight decrease in population size preceding the increase seen at 15kya. This decrease is consistent with a
34 secondary founder effect in which a subset of the Beringian population seeded the proto Amerind expansion into the Americas. Assuming the apparent decrease in N ef is the result of such a founder effect, the upper bound on the founder population size is 5,400 individuals (N ef 2,700). Our IM analyses suggest that the founder population size could be lower d epending on prior assumptions about the over water migration rates between the Americas and Asia (see Figure 2 3). Migration rates ( m ) within Europe today based on census data have been determined to be 4.3 which can be taken as an extreme upper bound of possible ancient migration rates between the Americas and Asia especially after the appearance of the Bering Strait 11 10kya. Restricting migration rates to <1 results in founder N e estimates between 1 000 and 1 200, with 1 200 serving as an asymptotic upper bound (see Figure 2 3). Taken together, our Bayesian skyline plot and IM analyses suggest that a founder popu lation with N e = 1,000 5,400 colonized the New World in a process characterized by a rapid geographic and population expansion. The range of N e values can be translated into an approximate census population size by applying a scale factor estimated from la rge mamma l populations (scale factor = 5 ; Templeton 1998 ) which suggests that the founder population consisted of 5,000 27,000 people. Our three stage model now awaits further critical testing with new datasets of independent nuclear loci and more sophisticated m ethods of coalescent analysis. The extensive dataset of 700 autosomal microsatellites, compiled by Wang e t al. (2007) for both Native American and worldwide populations offers the opportunity to evaluate critically the size, timing, and duration of each step in our model at essentiall y a population genomics level. Future version s of BEAST will incorporate a structured coalescent where migration as well as population growth will be allowed to occur among populations from both the New World and Asia ( http://evolve.zoo.ox.ac.uk/beast/manual.html ). In these BEAST analyses, the microsatellites can
35 be modeled under a stepwise related according to their repeat lengths. One can then summarize over these microsatellite loci by assuming independence, which thereby a llows for the multiplication of their separate posterior distributions and final estimations of their c ombined Bayesian skyline plot. In these ways, we fully anticipate that such critical testing will lead to many imp ortant refinements of our three step mo del, including a further narrowing of our proposed range for the size of the founding population as well as new details about post peopling expansions within the New World. M aterials and Methods Datasets Three datasets were collected for analysis including : ( i ) 77 mtDNA coding genomes; ( ii ) 812 mtDNA HVRI+II sequences; and ( iii ) combined nuclear and mitochondrial coding DNA dataset. The 77 mtDNA coding genomes were collected from publicly available resources (Herrnstadt et al. 2002; Ingman et al. 2000; Kivi sild et al. 2006; Mishmar et al. 2003) and aligned using ClustalX (Thompson, Higgins, and Gibson 1994) The resultant 15,500 base pair (bp) multiple alignment was edited by hand to minimize the number of unique gaps and to ensure the integrity of the rea di ng fra me ( Figure 2 S1). A total of 812 combined HVRI+II sequences were collected from HVRbase ( http://www.hvrbase.org ; Handt, Meyer, and von Haeseler 1998) These sequences were aligned following the coding mtDNAs r esulting in a multiple alignment of 77 1 bps (Figure 2 S2). Th e complete dataset of 812 HVRI+II sequences was randomly divided into ten non overlapping alignments of 81 sequences that approximate the sample size for the coding mtDNA dataset. Skyline plot an alyses of larger datasets (up to 200 HVRI+II sequences) gave the same results as the 81 seq uence datasets (data not shown). Thus, t he smaller datasets of 81 sequences each were emphasized here since they avoided the likelihood rounding errors that can occu r when using large, heterogeneous datasets in Bayesian skyline plot analyses.
36 The coding nuclear and mtDNA dataset from Asian and Native American populations of Hey (available at http://lifesci.rutgers.ed u/~heylab/ ; Hey 2005 ) consisted of two autosomal coding loci, five X chromosome coding loci on e Y chromosome coding locus and the complete mtDNA coding genome (totaling 28,454 aligned bps) The sample sizes for these nuclear loci and mitochondrial genom e varied from 12 108 sequences. Bayesian Skyline Plot Analyses Bayesian skyline plots (Drummond et al. 2005) were used to estimate changes in Amerind N ef over time by providing highly parametric, piecewise estimates of N e f This approach produces serial estimates of effective population size from the time intervals between coalescent events in a genealogy of sampled individuals, and utilizes a Markov chain Monte Carlo simulation approach to integrate over all credible genealogies and other model parameter s. It thereby differs from previous approaches ( e.g., Polanski, Kimmel, and Chakraborty 1998 ) in that Bayesian skyline plots fully parameterize both the mutation model (including relaxed clock models) and the genealogical process, whereas prior methods rel ied on generating estimates from summary statistics ( e.g., the use of pairwise differences by Polanski, Kimmel, and Chakraborty 1998 ). In these analyses estimates of ( N ef x generation time) were converted to N ef by dividing by a generation time of 20 years, following convention (Hey 2005) Skyline plots were generated for the 77 mtDNA coding genome sequences and the ten datasets of HVRI+II sequences using the program BEAST v1.4 ( http://beast.bio.ed.ac.uk ) These BEAST analyses relied on the same coalescent and substitution models and run conditions of Kitchen Miyamoto, and Mulligan (2007) except as noted below. Plots were ge nerated using the established mutation rate s ( ) for coding mtDNA ( = 1.7 x 10 8 substitutions / site / year ; Ingman et al. 2000 ) and HVRI+II
37 mtDNA ( = 4.7 x 10 7 ; Howell et al. 2003) Markov chains were run for 100,000,000 generations and sampled every 2,500 generations with the first 10,000,000 ge nerations discarded as burn in. Three independent runs were performed for all coding and HVRI+II Bayesian skyline plot analyses. Markov chain samples from the three independent mtDNA coding replicates and from the 30 HVRI+II analyses were separately combin ed using the LogCompiler program (distributed with BEAST) and analyzed using Tracer v1.3 to produce the final Bayesian skyline plot s Isolation with Migration Coalescent Analyses Bayesian IM coalescent analyses were performed using the program IM (Hey 2005 ) to estimate Ne for the Amerind founder population (males + females) and the divergence time for Amerind and Asian populations. We used the same combined nuclear and mtDNA dataset, same coalescent and substitution models, and same model parameters as Hey (2005) with the exception of new priors on the divergence time and on the migration rates between Asian and Amerind populations. All IM analyses were performed using a flat uniform prior for the divergence time of Amerind and Asian populations set to the i nterval 15 40kya. The lower bound of this prior is based on accepted archaeological and climatological evidence for the first presence of humans in the Americas (Waters and Stafford 2007; Dillehay 1997; Joyce 2006; Overstreet 2005) The upper bound of the flat uniform priors on the migration rates per mutation per generation between the Amerindian and Asian populations ( m Asia NW and m NW Asia ) was set to 12 different values (0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, and 50). To help interpret these results, we relie d on an estimate of the migration rate in modern Europe as obtained from census data (Weale et al. 2002) Specifically, we converted their migration rate estimate of 0.0004 migrations per gene copy per generation (recalculated assuming a generation ti me of 20 years based on Hey
38 2005 ) to our units of migrations per mutation per generation ( m ) by dividing the former by the geometric mean of the mutation rates for the nine loci in this dataset (9.32 x 10 5 mutations per locus per generation ; Hey 2005 ). These c alculations resulted in m = 4.3 for modern Europe. In contrast, the ancient migration rates between the New World and Asia would have been significantly less, especially after their geographic separation due to the re inundation of Beringia starting at 11 kya (see Introduction). Ten independent replicates were performed for each of the 12 upper bound values on the migration rates, for a total of 120 IM analyses. All Markov chains were run for 100,000,000 generations without heating.
39 Figure 2 1 Bayesian skyline plot for the mtDNA coding genome sequences. The curve plots median N ef with its 95% CI indicated by the light gray lines. The calculated Nef assumes a generation time of 20 years following Hey (2005) ; alternatively, using a generation time of 25 ye ars (Fenner 2004) would uniformly decrease all estimates of brackets. The shaded regions highlight two periods of substantial population growth. This skyline plot provides the princi pal evidence for our three stage model of New World colonization, i.e., the three stages that are depicted and labeled here.
40 Figure 2 2 Bayesian skyline plot for the mtDNA HVR I+II datasets. This plot follows the conventions o f Figure 2 1. Its estima tes of coalescent time and N ef at the coalescence and today are in agreement with the coding mtDNA skyline plot ( Figure 2 1). In contrast, this HVR I+II plot provides little resolution for other population size changes, most likely because of mutational sat uration in the non coding control region (see text).
41 Figure 2 3 Graph of IM results for the combined nuclear and mitochondrial coding DNA dataset. The plot depicts mean Ne for the Amerind founder population (y axis) as a product of increasing the con straint on the upper bound of the priors for the migration rates (x axis). In these analyses, the prior on the lower bound of the divergence time was uniformly set to 15kya on the basis of known archaeological materials for human occupation in the New Worl d (see text). Each point is based on the average of the estimated medians for ten independent replicate analyses, with the bars corresponding to 1 standard deviation. These standard deviations are often small (with coefficients of variation less than 0.0 1), since their Markov chains were run for 100 million generations each.
42 Figure 2 4 Maps depicting each phase of our three step colonization model for the peopling of the Americas (A) Divergence, then gradual population expansion of the Amerind ances tors from their East Central Asian gene pool (blue arrow). (B) Proto Amerind occupation of Beringia with little to no population growth for 20,000 years. (C) Rapid colonization of the New World by a founder group migrating southward through the ice free, inland corridor between the eastern Laurentide and western Cordilleran Ice Sheets (green arrow) and/or along the Pacific coast (red arrow). In (B), the exposed seafloor is shown at its greatest extent during the last glacial maximum at 20 18kya (Hopkins 1 982) In (A) and (C), the exposed seafloor is depicted at 40kya and 16kya, when prehistoric sea levels were comparable (Elias et al. 1996; Hopkins 1982) Because of the earth's curvature, the km scale (which is based on the straight line distance at the equator) provides only an approximation of the same distance between two points on these maps. In addition, a scaled down version of Beringia today (60% reduction of A C) is presented in the lower left corner. This smaller map highlights the Bering Strait that has geographically separated the New World from Asia since 11 10kya.
44 CHAPTER 3 BAY ESIAN PHYLOGENETIC A NALYSES OF S EMITIC LANGUAGES IDE NTIFY AN EARLY BRONZE AGE ORI GIN OF SEMITIC AND A SINGLE IRON AGE MIGR ATION TO AFRICA FOR ETHIOSEMITIC Introductio n The Semitic languages comprise one of the most studied language families in the world. Semitic is of particular interest due to its association with the earliest civilizations in Mesopotamia (Lloyd 1984), the Levant (Rendsburg 2003), and the Horn of Afri ca (Connah e.g., the Akkadian poem The Epic of Gilgamesh ) and religious ( e.g., Judaism, Christian, and Islamic) traditions. This association dates back at least 4350 years before presen t (historical sources date events relative to the common era that began 2000 years ago, but for consistency we will use years before present [ybp] as our dating standard) to ancient Sumer in Mesopotamia, where Akkadian replaced Sumerian (the first known wr itten language) and adopted the Sumerian cuneiform script (Buccellati 1997). From this time forward, archaeological evidence for Semitic amongst the Hebrews and Phoenicians in the Levant (Rendsburg 2003) and the Aksumites in Horn of Africa (Connah 2001) s uggests that Semitic and Semitic speaking populations underwent a complex history of geographic expansion and migration tied to the emergence of the earliest urban civilizations in these regions. However, though Semitic speaking populations are well repres ented in the archaeological record (Lloyd 1984; Connah 2001; Richard 2003a; Nardo 2007), their origins and relationships to each other remain uncertain. Without knowledge about the history of Semitic populations, our understanding of the ancient civilizati ons and lasting cultural traditions of the Middle East and the Horn of Africa remains uncertain. Despite multiple genetic studies of extant Semi tic speaking populations ( e.g., Nebel et al. 2002; Capelli et al. 2006), much is still unknown about the genealo gical relationships of these
45 populations. Most previous genetic studies focused on time fra mes that are either too recent [ e.g., the origin of Jewish communities in the Middle East and Africa (Hammer et al. 2000; Nebel et a l. 2001; Rosenberg et al. 2001)] or too ancient [ e.g., the original out of Africa migration (Passarino et al. 19 98; Quintana Murci et al. 1999)] to provide insight about the origin and dispersal of Semitic and Semitic speakers. Linguistic studies can aid our understanding of the origin an d dispersal of Semitic and Semitic speakers. Previous historical linguistic studies of Semitic languages have utilized the 1997) ]. The comparative method is a t echnique that uses the pattern of shared, derived changes in language ( e.g., vocabulary, syntax, grammar), termed innovations, to assess the relative relatedness of languages but cannot date the divergences between languages (Campbell 2000). Simply stated, two languages that share more innovations in common are more closely related via a common ancestor than if they shared fewer innovations, and the pattern of pairwise comparisons of innovation sharing are used to infer relative relationships between many l anguages. Cognates, which are words that share a common form and meaning through descent from a common ancestor ( e.g., Several alternative sub groupings of Semitic have been proposed using the comparative method, but the field has generally coalesced around a model that places the ancient Mesopotamian language Akkadian at the root of Semitic (Hetzron 1976; Faber 1980; F aber 1997). This standard model divides Semitic into East Semitic, composed of only the extinct Akkadian and Eblaite languages, and West Semitic, consisting of all remaining Semitic languages, distributed from the Levant to the Horn of Africa. West Semitic is in turn divided into
46 Arabian (MSA)] and Central geographic groups, but the genealogical relationships of the languages within these two groups are ill defined (Huehne rgard 1990; Huehnergard 1992; Rodgers 1992; Faber 1997) and require further investigation to clarify their roles in the ancient history of the Middle East and Horn of Africa. Additionally, no consensus exists for placing Arabic in either the Central or Sou th Semitic groups (Hetzron 1976; Blau 1978; Diem 1980; location simultaneously uncertain and interesting, as Central and South Semitic are geographically and genealogically distinct entities. Dating language divergences has been controversial, especially when linguistic clocks are involved (see Renfrew, McMahon, and Trask 2000 for discussion). The linguistic clock is controversial because it is assumes that languages evolve in a probabilistic manner with a fixed rate (Ehret 2000), whereas there is evidence for variation in rates between languages and words and no reason why languages should evolve with predictable rates ( e.g., Blust 2000) However, recent studies show that v ariation in the rates of linguistic change follow generalized rules that apply across language families (Pagel, Atkinson, and Meade 2007; Atkinson et al. 2008). This suggests that variation in rates between words and languages can be appropriately modeled by applying techniques used in molecular evolution ( e.g., gamma distributions of site variation and relaxed clock models) to produce a relaxed linguistic clock that accounts for rate variation. Since the comparative method does not provide time estimates f or language divergences, relaxed linguistic clocks coupled with phylogenetic methods also borrowed from evolutionary biology provide a statistical alternative to accurately date language divergences.
47 In this study, we analyzed lexical data for 25 Semitic languages distributed throughout the Middle East and Horn of Africa (see Fig. 3 1 for the geographic distribution of languages) using a Bayesian phylogenetic method to simultaneously infer the genealogical relationships and estimate the divergence dates of the Semitic languages. We use d epigraphic and archa eological evidence for the sampling dates of the lexical information ( i.e., the time at which the materials were written and the lexical data stopped changing) from extinct Semitic languages (Akkadian, Ar accuracy of our divergence date estimates. We employed a log Bayes Factor (log BF) model testing technique to statistically test alternative Semitic histories and verify the information content of our lexical data. Finally, we combine d our divergence date estimates wit h the epigraphic and arc haeological evidence to form an integrated model of Semitic history. Results Genealogy of Semitic Languages The phylogenetic analysis of the Semitic languages produced the phylogenetic tree shown in Fig. 3 2. A brief summary of the tree highlights (1) the greater age of the non African Semitic languages (5925 ybp vs 2925 ybp), (2) the presence of Akkadian followed by Central Se mitic near the root of Semitic, (3) the relatively poor resolution of non African languages in comparison to the well resolved relationships of the African languages and (4) the well resolved and recent divergences of the Ethiosemitic languages in a monoph yletic ( i.e., single origin) clade. Branches with posterior probability estimates (the probability that a group of languages is more closely related to each other than to other languages) less than 0.70, which is generally considered the benchmark for stat istically well supported clades, were considered to be unresolved ( i.e., the relative pattern of divergence amongst the taxa could not be ascertained) and collapsed to reflect this uncertainty. Long branches are indicative of long intervals between
48 diverge nces or the presence of un sampled languages, such as the long branch leading to the two Arabic languages in our study, whereas short branches indicate bursts of diversification that are a hallmark of linguistic evolution (Nettle 1999; Atkinson et al. 2008 ). Semitic Language Divergence Dates In addition to illustrating the relationship between different Semitic languages, our phylogenetic analysis produced dates for the divergence of related languages. The mean divergence times and 95% credible intervals (CI) of all language divergences are depicted on the tree in Fig. 3 2, with all times in years. The tree displayed a primary division between Akkadian and the remaining Semitic languages, which supported an estimated origin of Semitic ~5900 ybp (CI = 4300 7750 ybp) during the Early Bronze Age (Ehrich 1992). A secondary division occurred between Central and South Semitic groups (node A in Fig. 3 2) with an estimated divergence time of ~5425 ybp (CI = 3850 7525 ybp). The Central Semitic clade (Arabic, Ara maic, Hebrew, and Ugaritic) had strong posterior support (0.82; node B ) and a weakly supported internal divergence of Ugaritic from an unstructured group of Arabic, Aramaic, and Hebrew dated at ~4100 ybp (CI = 3400 5925 ybp). The Arabic clade (node C ) ha d 100% posterior support and an estimated divergence time of Moroccan and Ogaden Arabic of ~540 ybp (CI = 110 1375 ybp). On the other half of the tree, the South Semitic clade showed an ancient divergence (node D ) Ethiosemitic and MSA language groups da ted to ~4525 ybp (CI = 2700 6825) and over lapping with the transition from the Early to Middle Bronze Ages. The most recent common ancestor of strongly supported MSA clade (1.0; node E ) was estimated to have been extant ~1300 ybp (CI = 475 2550), and the narrow geographic distribution of MSA along the southern
49 coast of Arabia facing the Gulf of Aden suggests that the diversification of MSA occurred in this area. The lone, strongly supported branch (posterior = 0.94) leading to Ethiosemitic indicates a single origin for the Semitic languages in the Horn of Africa with their diversification into North and South clades (node F ) occurring ~2975 ybp (CI = 1850 4500 ybp), or during the second period of the Iron Age. The large number of small internal branc hes in the Ethiosemitic group indicates a rapid diversification of these languages. Our phylogenetic analysis divided the South Ethiosemitic languages into three well supported clades with estimated divergences occurring approximately a millennia ago: Oute r Gurage (node G ; divergence = 1175 ybp, CI = 460 2175), East Gurage (node H ; divergence = 1125 ybp, CI = 400 2175), and Amharic Argobba Gafat (node I ; divergence = 1000 ybp, CI = 300 2100). Log Bayes Factor Tests The validity and usefulness of our i nterpretations rests on the accuracy of our phylogeny. Thus, we assessed the robustness of our phylogenetic analysis by statistically testing alternative histories of Semitic. This was done using log Bayes factor (BF) model tests, which compare the probabi lity that each model produced the observed data ( i.e., the wordlist data). Log BF values (all values in log base 10 units) in the intervals 0 1, 1 2, and >2, are considered to be ry model (Kass and Raftery 1995). We tested four alternative Semitic histories in the following three comparisons. The comparison between the Standard model [which constrains Akkadian to the root and constrains the sub division of West Semitic into Central and South Semitic clades] and a BF = 1.1,), consistent with Semitic genealogies estimated using the comparative method (Faber
50 1997). We next compared the Standa rd model to a model of Semitic history that placed Ethiosemitic at the root ( i.e., an African origin for Semitic), and the log BF test showed a non African or igin of Semitic. Our final comparison, between the Standard model and the Old Arabic model (a version of the Standard model with the most recent common ancestor of Ogaden and Moroccan Arabic constrained to the period prior to the expansion of Arabic ~1400 ybp.), accuracy of our rate estimates and the ability of our analysis to reject unrealistic divergence dates ( i.e., hic expansion in the 7 th century). Discussion The Semitic language family is unique in that it is the sole member of the Afroasiatic language phylum to have been historically spoken both outside Africa as well as in Af rica (Hayward 2000) Semitic is asso ciated with some of the oldest urban states in ancient Mesopotamia and the world, such as the first Akkadian Empire of Sargon in the third millennium B.C.E. (Lloyd 1984) Furthermore, Semitic societies in the ancient Levant, particularly the Canaanites, Ph oenicians, and Israelites, played important roles in the second and first millennia B.C.E. by developing new kinds of long distance commerce in the Mediterranean and ancient Middle East and creating great literary and religious traditions (Rendsburg 2003) The oldest states in southern Arabia ( e.g., von Wissman 1975 ) and the E thiopian Highlands [Yeha, Aksum ( Connah 2001) ] are also associated with Semitic and underscore both the antiquity and geographic extent of the influence and importance of Semit ic speaking populations in the history of the Middle East and Horn of Africa. However, important aspects of the complex history underlying this distribution of Semitic remain unresolved, such as the timing and location of
51 Semitic origins ( e.g., non African vs. African origin ), the age of Semitic languages in the Levant, and the timing and provenience of the Semitic languages in Africa. Historical linguists have traditionally used the comparative method to produce language trees that depict degrees of rela tedness to a common ancestor. The comparative method uses shared linguistic innovations ( i.e., derived linguistic features, analogous to DNA mutations) to infer patterns of hierarchical relatedness, but can not produce estimates of divergence times. Conver sely, a branch of historical linguistics called lexicostatistics (or glottochronology) employs the percentage of cognates shared between languages as a measure of evolutionary distance that is used to estimate language divergence dates based on a strict li nguistic clock. This method is controversial and seldom used today because many exceptions have been found to its underlying assumption that linguistic change is constant in time and across languages (Ehret 2000). Recent research, however, has identified g eneralized mechanisms underlying rate variation that act across language groups (Pagel, Atkinson, and Meade 2007; Atkinson et al. 2008) suggesting that variation in rates of linguistic evolution can be accurately modeled and accounted for using relaxed clo cks analogous to those used in molecular evolution. Several recent studies have taken a statistical approach to inferring language trees via the application of model based, computational phylogenetic methods borrowed from evolutionary biology to the analys is of cognate lists traditionally used in historical linguistics (Gray and Jordan 2000; Holden 2002; Gray and Atkinson 2003). These analyses employed computational phylogenetic techniques that allowed the evolution of the linguistic data to be modeled, the strength of alternative language trees to be assessed by objective criteria, the confidence of evolutionary models and parameters to be estimated, and alternative linguistic histories to be statistically tested. The first of these studies used maximum par simony phylogenetic criteria,
52 which favor trees that minimize the number of evolutionary changes necessary to explain the data, to investigate the Austronesian (Gray and Jordan 2000) and Bantu (Holden 2002) language expansions but did not estimate divergen ce dates for either language family. A similar study of Indo European (Gray and Atkinson 2003) employed a Bayesian phylogenetic method (which estimates the probability that a model is correct given the data and some independent, prior knowledge) to infer a n Indo European language tree and then applied a likelihood dating technique (which maximizes the probability that the model produces the data), calibrated with a priori constraints on the times of specific sub family divergences, to date events in the his tory of Indo European In contrast to the likelihood method used to estimate divergence times of Indo European, which does not account for uncertainty in both the tree and the rate, we employ a Bayesian phylogenetic technique that allows for the co estima tion of Semitic language trees and divergence dates while fully accounting for the uncertainty of both (Drummond et al. 2006). We also provide the first use of language sampling dates (as opposed to constraints on dates of specific nodes within the languag e tree used by (Gray and Atkinson 2003)) drawn from the archaeological record to calibrate the rate of linguistic evolution and date events in Semitic history. Furthermore, we use Bayes factor model tests to provide quantitative support for distinguishing between alternative hypotheses of Semitic language evolution and confirm the information content of our dataset. Semitic Origins Our phylogenetic analysis of Semitic produced a language tree with dates that establish Akkadian as the deepest branch in the Semitic family tree and estimates the origin of Semitic to ~5900 ybp (Fig. 3 2). This date places the Semitic origin a surprising ~1500 years before the first Akkadian inscriptions, which were written using Sumerian cuneiform script (Daniels 1997) and
53 appe ar in the archaeological record o f northern Mes opotamia (Buccellati 1997). The city states of Sumer were established and flourishing in Mesopotamia with their own indigenous languages ~5900 ybp (Lloyd 1984), so it is unlikely that Akkadian was spoken in Su mer for the entirety of the 1500 year interval between its divergence from ancestral Semitic and initial appearance in the archaeological record of Sumer. Furthermore, the closest relative of Akkadian and the only other member of East Semitic, Eblaite (uns ampled in our study), was spoken in northwestern Syria, in a region adjacent to where some of the oldest West Semitic languages were spoken. The presence of ancient members of the two oldest Semitic groups ( i.e., East and West Semitic) in the same area sug gests their divergence from ancestral Semitic occurred there, in what is today Syria. This, combined with the long interval between the origin of East Semitic and the appearance of Akkadian in Sumer, suggest that East Semitic originated in present day Syri a and Akkadian later spread from Syria eastward into Mesopotamia and Sumer (see Fig. 3 1 for a map of Semitic dispersals). Our Semitic tree indicates that within several centuries of the initial divergence of ancestral Semitic into East Semitic (represen ted in our study by Akkadian) and West Semitic branches, West Semitic in turn diverged ~5425 ybp ( Fig 3 2 node A ) to form Central and South Semitic. The short interval between this divergence and the origin of Semitic in Syria (~500 years) suggests that this divergence occurred in the same region of the interior Levant, and that Central Semitic (found throughout the Levant) spread to the west while South Semitic spread into the southern Levant. Thus the early emergence ~4525 ybp ( Fig. 3 2, node D ) of a S outh Arabia n lineage of South Semitic may reflect an early Bronze Age expansion of Semites from the southern Levant southward through the Arabian P eninsula. The Central Semitic sub branch of West Semitic was characterized first by a divergence into northe rn (Ugaritic) and southern
54 (Arabic, Aramaic, and Hebrew) lineages ~4100 ybp ( Fig 3 2 node B ), and later by the divergence of Arabic, Aramaic and Hebrew from each other ~3200 ybp ( Fig 3 2 ). The expansion of Central Semitic ~4100 ybp was likely part of th e migration process that was definitive of the transition from the Early to the Middle Bronze Age in the Levant (Ehrich 1992; Ilan 2003; Richard 2003b) This period in the Levant involved the devolution of many urban societies at the tail end of the Early Bronze Age (Richard 2003b) and their replacement with new urban societies that were culturally and morphologically distinct at the start of the Middle Bronze Age (Ilan 2003) Our analysis suggests that the shift in urban populations from the Early to Middl e Bronze Age is associated with the wider expansion of Semitic in the Levant. The recurrent spread of early Semitic peoples and their languages, first South Semitic and later Arabic, into the marginal, desert lands of the Arabian Peninsula, combined with the Biblical testimony on early Hebrew subsistence, suggest that the earliest West Semitic society had a largely pastoralist economy particularly adapted to such conditions. Furthermore, this is consistent with the ancestral Semitic society occupying the e asternmost Early Bronze Age urban developments in the Levant. This placement provides a plausible setting for the initial movement of Akkadian into Mesopotamia, as well as the subsequent expansion of South Semitic through the Arabian Peninsula and Central Semitic throughout the Levant. The expansion of Arabic into the Arabian Peninsula is the result of a more recent migration of Central Semitic southward from the Levant. Recent Arabic Divergence The Arabic languages, or dialects, represent the largest grou p of extant Central Semitic languages (Gordon 2005). The Arabic languages originated in north Arabia and expanded along with Islam in the 7 th century to occupy a geographic range that extends from Morocco to Iran
55 (Kaye and Rosenhouse 1997). Our phylogeneti c analysis indicated that the two studied Arabic languages (Moroccan and Ogaden) diverged ~540 ybp ( Fig. 3 2, node C ), or ~800 years after the expansion of Arab populations associated with Islam. We were able to employ a log BF test to asses the ability of our linguistic clock to reject unrealistic historical scenarios by comparing the effects of different constraints on the divergence times of these Arabic languages. Specifically, we compared our Standard model (Arabic divergence date estimate = ~540 ybp) with a model that constrained the divergence of Arabic to occur prior to the spread of Islam ( i.e., > 1400 ybp), constraints. This test confirms the accuracy of o ur linguistic cloc k, while suggesting that in some regions, such as Morocco, Arabic languages became fully established as local indigenous tongues, replacing earlier indigenous languages (Berber in this case), only within the past millennium, long after t he initial expansion of Islam. South Semitic and the Origins of Ethiosemitic We estimate that after the divergence of West Semitic into Central and South Semitic ~5425 ybp South Semitic continued to expand southward until ~4525 ybp ( Fig 3 2 node D ). It was at this time that South Semitic diverged into two lineages with to the south of the Levant One of these lineages was ancestral to the Modern South Arabian languages, and its speakers likely inhabited the southern coasts and coastal hinterlands of th e peninsula. The other lineage leads to the language(s), spoken in and around highland Yemen in the later second and the first millennia B.C.E. The founding speakers of Ethiosemitic in the far northern Ethiopian highlands likel y came from this Yemeni population. Our estimate of ~2975 ybp for this transfer of Semitic to Eritrea and Ethiopia (which became the Ethiosemitic languages) is contemporaneous with the adoption of Iron A ge technologies in the Middle East
56 (Ehrich 1992; Moor ey 1994) but pre dates the rise of the Aksumite Kingdom in Eritrea and Ethiopia by at least ~600 years (Connah 2001) This migration to Africa most likely reflects a influx of Semitic speaking migrants of unknown size concurrent with the rise of the firs t towns and cities in the northern edges of the Ethiopian Highlands during or before the middle of the first millennium B.C.E. (Ehret 1988) and suggests that the introduction of Semitic from Arabia was temporally correlated with the development of the fir st urbanized states in Eritrea and Ethio pia (Fattovich 1990). Intriguingly, the estimated date for the Ethiosemitic migration is broadly consistent with th century B.C.E. connection of the Ethiopian Highl ands and Yemen with the Levant. This myth claims that after visiting King Solomon in Israel, the Queen of Sheba returned home to bear a son who would later found a new state in Eritrea and Ethiopia culturally attached to Semitic populations in the Levant ( specifically the Hebrews). Genetic studies have shown that Ethiosemitic speaking populations are genetically similar to Cushitic speaking populations within Eritrea and Ethiopia (Lovell et al. 2005), indicating that the migration of Semitic to the Horn of Africa was accomplished with little gene flow from the Arabian Peninsula. This process of local adoption, in which Ethiosemitic was first introduced to and later adopted by individuals recruited from existing loc al populations is also consistent with the w ord borrowing evidence (Ehret 1988) that a relatively small number of people from Yemen, holding economically and politically strategic positions, introduced Semitic to populations in the Horn of Africa. Conclusion We used Bayesian phylogenetic methods to elucidate the relationship and divergence dates of Semitic languages, which we then related to the archaeologic al and epigraphic rec ord to
57 produce a comprehensive hypothesis of Semitic origins and dispersals (Fig. 3 1). Our analysis demonstrates the first time that language sampling dates have been used to calibrate the mean rate of linguistic evolution, including variation between lineages. This allowed us to provide dates for important events in Semitic history and place them in context. For example, we estimate that (i) Semitic had an Early Bronze Age origin (~5900 ybp) in the dry interior areas of the Levant from which Akkadian subsequently expanded into Mesopotamia, (ii) Semitic then dispersed, earlier than previously thought, throughout the Levant as part of the Early to Middle Bronze Age transition in the Eastern Mediterranean, and (iii) conclude that Ethiosemitic was the result of a single, early Iron Age (~2975 ybp) migration of Semitic across the Red Sea consistent with 10 th century B.C.E. Queen of Sheba myths connecting Ethiopia to non African Semitic populations. Furthermore, we employed the first use of Bayes factors to statistically test competing language histories and confirm the robustness of our inferences about Semitic history. These infere nces shed light upon the complex history of Semitic, answer key questions about Semitic origins and dispersals, and provide important hypotheses to test in future studies with new data. Methods Word Lists and Cognate Coding Wordlists were modified from S word list of most conserved words (Swadesh 1955), with the final lists containing 97 words for 25 extant and extinct Semitic languages (Supplementary Figure 3 S1). Wordlists for the Ethiosemitic languages (Amharic, Geto, Harari, Innemor, Mesmes, Mesqan, Soddo, Tigre, Tigrinya, The Languages of Ethiopia (Bender 1971). The lists for Moroccan Arabic, the South Arabian languages (Gibbali, Harsusi,
58 Mehri, Soqotri) and extinct non African Semitic languages (Akkadian, Aramaic, ancient Hebrew, and Ugaritic) were constructed from previously published lexicons (Rabin 1975). Cognate classes were determined for each of the 97 words using a comparative method that emphas izes the similarity of consonant consonant consonant roots and known consonant shifts when comparing two words (see Supplemental Figure 3 S2 for a graphical depiction of the cognate coding and loanword treatment). Loanwords were identified using lexical in formation from distantly related but geographically close language families (such as Cushitic) as well as comparisons with lexicons of languages within the Semitic family. Loanwords were dealt with in two ways: (i) all loanwords were coded as missing data ( i.e., ancestral word in the accepting language(s), and (ii) loanwords shared by descent ( i.e., the loan event occurred in a common ancestor of multiple languages) were considered to be an extra meaning in which all languag es sharing the loanword via descent are coded as cognate ( e.g., while all others are coded as missing data ( i.e., a marker of common descent in languages that diverged after the loan event, while no c ommon ancestry for that meaning was supposed between the language(s) that provided the loanword and the language(s) that accepted the loanword. The coded cognate dataset, including the treated loanwords, is available as Supplemental Figure 3 S3. Phylogenet ic Analysis and Divergence Date Estimation Phylogenetic trees were constructed under a Bayesian framework using BEAST v1.4.6 (Drummond and Rambaut 2007) An unordered model of cognate class evolution was used to allow transitions between any pair of cogna te classes. Rate heterogeneity across meanings was modeled by a gamma distribution on meaning specific rates. This model accommodates variations in the rate of change across meanings, such that conserved meanings ( e.g., a single
59 cognate class for all langu ages) were assigned a slower rate than the mean while highly variable meanings ( e.g., few shared cognate classes between languages) were assigned a faster rate than the mean. Priors for the unordered rate matrix and gamma shape parameter were flat. Diver gence times were estimated using an uncorrelated lognormal relaxed clock model that assumes a single underlying rate for the entire phylogeny but allows for variations in rates across the tree (Drummond et al. 2006). This relaxed clock model accommodated d ifferences in the overall rate of cognate transition between languages by assigning rates drawn from a lognormal distribution to individual branches in the tree. We used heterogeneous tip date information (Rambaut 2000) and constraints on the root of Semit ic to calibrate the rela xed clock. Specifically, we included the sampling dates of the five languages in our dataset that are no 2600 ybp, and Ugaritic = 3400 ybp; ( Rabin 1975) ] and used a flat prior on the root of the Semitic set to 4,300 [ the age of the first evidence of a Semitic language, Akkadian (Buccellati 1997) ] to 8,000 ybp to calibrate the clock. (Although the earliest few documents in Akkadian date to ~ 4 300 ybp the great bulk of Akkadian lexical data comes from late Assyrian materials, 2 900 2 700 ybp hence the choice of 2800 ybp for th e sampling date of Akkadian.) A flat prior of 0.01 to 0.0001 cognate transitions per meaning per year (roughly, a 0.01% to 1 % replacement rate per year) was placed on the mean of the lognormal distributed clock. The robustness of our results was tested by the application of log BF model tests to comparisons of alternative Semitic histories. In addition to the analysis with no topological constraints described above (the unconstrained model), we first constrained the tree topology to reflect the accepted major divisions of Semitic: East Semitic (Akkadian) vs. West Semitic, with West Semitic into Central (the Arabic languages, Ar amaic, ancient Hebrew, and Ugaritic) and
60 South (the Ethiosemitic and South Arabian languages) Semitic clades (Faber 1997). We called this Semitic history the Standard model, as it represents a general consensus amongst Semitic linguistisc that the primary division of Semitic is between East and West clades. For our second constrained the divergence of Moroccan and Ogaden Arabic to occur more than 1400 ybp. This comparison was used to test the accuracy of our relaxed clock model, as the Old Arabic model forces the divergence of the two Arabic languages to have occurred unrealistically before the expansion of Arabic with Islam starting in the 7 th century. Our final test comp ared the Standard constraining Ethiosemitic as the outgroup. The mean clock rate estimated under the Standard model (4.438 x 10 4 transitions per meaning per yea r) was used for all log BF tests, and Markov chains were run for 10 8 generations. Marginal likelihoods for each model were estimated using the smoothed harmonic mean of the likelihood distribution (Newton et al. 1994; Redelings and Suchard 2005) and all lo g BF values were calculated by taking the difference in the log of the marginal likelihoods of each model (Kass and Raftery 1995), with log BF values reported in log base 10 units. A log BF > 0 would indicate a preference for model 1, whereas a log BF < 0 would indicate a preference for model 2. BEAST uses a Markov chain Monte Carlo simulation technique to estimate the posterior distribution of parameters. All Markov chains were run for 20,000,000 generations with samples taken every 1,000 generations (fo r a total of 20,000 sampled states per run). Burn in was 1,000,000 generations (1,000 states), and post run analysis of parameter plots in Tracer v1.4 (Rambaut and Drummond 2007) suggested all chains had reached convergence by the end of the
61 burn in period ( e.g., all estimated sample size values were > 500). The MCMC sampling and run conditions, and all prior distributions, were identical for all analyses unless otherwise stated.
62 Figure 3 1. Map of Semitic languages and inferred dispersals. The locatio n of all languages sampled in this study are depicted. The map also presents the dispersal of Semitic inferred from our study. The origin of Afroasiatic along the African coast of the Red Sea is indicated in red (Ehret 1995; Ehret, Keita, and Newman 2004), while Semitic migrations are depicted by green arrows. The Semitic dispersal follows a radial pattern expanding from a Semitic origin in eastern Syria. The current distributions of all Semitic languages in Eritrea and Ethiopia (Ethiosemitic) follow Bender (Bender 1971) and the remaining follow Hetzron (Hetzron 1997). The ancient distributions of the extinct languages are indicated, i.e., Ugaritic (Bender 1971; Hetzron 1997). The West Gurage (Chaha, Geto, Innemor, Mesme s, and Mesqan) and East Gurage (Walani and Zway) Ethiosemitic language groups in central Ethiopia are listed as two combined groups since the distribution of these languages was too circumscribed to list individually in the map.
63 Figure 3 2 Phylogenetic tree of Semitic languages The language tree produced by the phylogenetic analysis of Semitic wordlists placed the origin of Semitic at ~5900 ybp (CI = 4300 7700 ybp). All branch lengths are in years, represented by median es timates from their posterior distributions, and node dates are relative to years before present (ybp). The scale bar along the bottom of the tree presents dates in ybp and relative to the common era (C.E. and B.C.E.). Extinct languages are under lined, and all other languages are considered to evolve to the present. Sub groups of Semitic are identified by color (East Semitic = purple, Central Semitic = green, Modern South Arabian = red, and Ethiosemitic = blue), with bars to the right of the tree, and by tw o boxes (Central Semitic and South Semitic), while the geographic distribution (African vs. non Africa) is indicated by a bar to the right of the tree. Important groups are indicated by letters A I These are: A West Semitic; B Central Semitic; C Arabic ; D South Semitic; E Modern South Arabian; F Ethiosemitic; G Outer Gurage; H East Gurage; and I Amharic Argobba Gafat. Posterior probabilities of internal branches are printed in italics above each branch, with median divergence dates printed to the right of each node. All branches with posterior support < 0.70 are collapsed and considered unresolved. The topology follows the constraints of the Standard model, which is preferred by our log BF analysis.
64 Figure 3 S1. Semitic wordlist data for 25 la nguages. The final Semitic wordlists contained 97 Soddo, Tigre, Tigrinya, Walani, and Zway; (B ender 1971)], Ogaden Arabic (Bender 1971), and 9 non African Semitic languages [Akkadian, Moroccan Arabic, Aramaic, Gibbali, Harsusi, Hebrew, Mehri, Socotri, and Ugaritic; (Rabin 1975)] were gathered from previously published wordlists. Gaps in the data ar
65 Figure 3 S1 continued.
66 Figure 3 S1 continued.
67 Figure 3 S1 continued.
68 Figure 3 S1 continued.
69 Figure 3 S1 continued.
70 Figure 3 S1 continued.
71 Figure 3 S1 continued.
72 Figure 3 S1 continued.
73 Figure 3 S2. Example of the cognate coding process Cognate coding involved two steps: the identification of loanwords and the determination of cognate classes (depicted in panel A ) Step 1: Loanwords shared by descent (identi fied by a red box placed around them) are placed in a new gloss category ( e.g., DOGb) and coded as missing data ( i.e., gloss ( e.g., DOGa). Step 2: Cognate classes are then identified from the words in each gloss category ba sed on similar morphology ( e.g., roots) and given state codes ( e.g., treatment on phylogenetic inference. When a language has missing data for a gloss category, the phyl ogenetic algorithm prunes the tree to not consider the position of that language when assessing the fit of the tree to the data for that particular gloss category. This pruning is depicted by dashed lines on the language trees of the DOGa (ancestral gloss As you can see from the DOGb language tree, the data in this gloss category serves to hold together the four languages sharing the loanword by descent, but does not influence the relation ships between any of the other languages.
74 Figure 3 S3 Cognate lists for 25 Semitic languages. This final Semitic cognate lists contained 126 words for 25 languages. The 126 words include the 97 meanings (see Figure 3 S1) and 29 loanwords (see Materia ls and Methods for our loanword identification and
75 Figure 3 S3 continued.
76 Figure 3 S3 continued.
77 Figure 3 S3 continued.
78 Figure 3 S3 continued.
79 Figure 3 S3 continued.
80 CHA PTER 4 UTILITY OF DNA VIRUS ES FOR STUDYING HUMA N HOST HISTORY: CASE STUDY OF JC VIRUS 4 I ntroduction Recent research has focused on the use of microbial pathogens and commensals as complements to traditional genetic markers to investigate the population his tories and demographies of their hosts (Ashford 2000 ; Holmes 2004). Microbial pathogens and commensals have generally faster mutation rates ( ) and shorter generation times than their hosts, which thereby often produce significant population differentiatio n faster than that observed in any host genetic system. These attributes also offer greater resolution for the estimation of and therefore other population genetic parameters such as effective population size (N e ) and coalescent time. In turn, the utili ty of microbial pathogens and commensals to study host history and demography also depends on their mode of transmission [ i.e., vertical, horizontal, or some mixture of both (Ashford 2000 ; Holmes 2004) ] Vertically transmitted pathogens and commensals are passed from parents to offspring within their host populations, thereby closely tying them to their host genealogies and generation times (with the latter often measured in years to decades). Thus, such microbes are expected to show older coalescent times and slower population dynamics that reflect the more ancient historical events within their hosts. In contrast, horizontally transmitted pathogens and commensals are not so constrained as they can also be passed among unrelated individuals within their ho st populations by direct or indirect contact with infected non relatives. Thus, these microbes are expected to exhibit more recent coalescent 4 Reproduced with permission from: Kitchen A, Miyamoto M, Mulligan C. 2008b. Utility of DNA viruses for studying human host history: Case study of JC virus. Mol Phylogenet Evol. 46: 673 682.
81 times and faster population dynamics that represent changes in their infection rates due to younger historical eve nts within their hosts. In particular, viruses have shown great utility as markers that both corroborate and extend the population histories inferred from human DNA (Holmes 2004). Given their fast rates of = 0.01 to 3.4 x 10 3 substitutions per site per year (Jenkins et al. 2002), RNA viruses have proven their utility for studying recent human events due to societal and epidemiological changes as well as population genetic ones ( e.g., migration). For example, Pybus et al. (2003) and Drummond et al. (2005 ) documented a link between virus population expansion and increased infection rates in the rapidly evolving hepatitis C virus ( = 7.9 x 10 4 ) due to changing public health policies during the 1920s 1950s in Egypt. In contrast to RNA viruses, DNA viruses have presumably more variable thereby making their applications to studies of human population events less clear but potentially broader. Slowly evolving DNA viruses, such as human papillomavirus (HPV 18, = 2 x 10 7 ), have been used as markers of anc ient human population events, including the establishment of host phylogeographic substructure (Ong et al. 1993; Bernard 1994). Conversely, hepatitis B virus (HBV) is a fast evolving virus ( = 4.2 x 10 5 ) whose population dynamics have been shown to refle ct recent human events such as the spread of HBV infections in Japan due to postwar societal changes after World War II (Michitaka et al. 2006). Collectively, DNA viruses have the potential to track major events in human history ranging from recent societa l and epidemiological changes to older phylogenetic co divergences. JC virus (JCV) is a double stranded, circular, DNA virus of humans that belongs to the polyomavirus family. Although ~ 70 to 90% of all adults are seroprevalent for JCV (Padgett and Walker 1973), this virus is not normally associated with any disease, except in
82 immunocompromised patients (Weber and Major 1997). The normal target organ for JCV is the kidneys and the virus is thought to be passed among both relatives and non relatives through their urin e. chromosomes (Khalili et al. 2007). E volutionary studies indicate that the JCV genome has not evolved under widespread positive selection (Pavesi 2005). JCV has been widely assumed to be a slowly evolving virus, which has co evolved for at least 100,000 years with its human host (Sugimoto et al. 1997). For these reasons, the virus has been used to infer ancient global (Sugimoto et al. 1997 2002; Wooding 2001) and regional (Agostini et al. 1997; Ikegaya et al. 2005) human population history. This assumption of slow evolution and ancient co divergence is largely based on medical, epidemiological, and comparative information for individual patients, ethnic groups, JC V strains, and other polyomaviruses (Agostini et al. 1997 ; Sugimoto et al. 1997). These results have proven most consistent with the hypothesis of an effectively vertically transmitted and slowly changing virus. However, other such studies have instead arg ued that horizontal transmission from extra familial sources occurs in > 50% of infections [Kitamura et al. 1994 ; Kunitake et al. 1995; see also Chen et al. (2004) for evidence of a rapid polyomavirus rate]. Significant horizontal transmission and rapid ch ange are more characteristic of a fast evolving virus that could instead be tracking recent events in human history. Recently, Shackelton et al. (2006) used a very different approach to estimate for JCV, a method that did not rely on the a priori assumpt ion of an ancient basal co divergence between virus and host. Instead, these authors used the different sampling dates for JCV samples to provide a new estimate of its one which was independent of this assumption. Specifically, they used Bayesian, maxim um likelihood (ML) and distance phylogenetic methods to show that
83 there was substantial geographic subdivision among global JCV populations. However, above the subtype level, they then statistically documented that the virus and human phylogenies showed no significant co divergence, which thereby led to their rejection of the old external calibration for estimating the JCV rate. Rather, using the viral sampling dates, Shackelton et al. produced an independent estimate of for global JCV that was two orders of magnitude faster ( = 1.7 x 10 5 ) than those based on the assumption of an old basal co divergence between virus and host ( = 1 to 4 x 10 7 ; Hatwell and Sharp 2000). Correspondingly, the historical population dynamics of global JCV was now scaled in c enturies rather than in tens of thousands of years. In this study, we evaluate the utility of DNA viruses, as represented by JCV, to investigate recent versus ancient events in the history and demography of their human host. Specifically, we assess the sup port for a fast versus slow for JCV and we evaluate the ramifications for each when investigating human population dynamics. Our approach relies on a combination of phylogenetic and coalescent methods, which include the first use of Bayesian skyline pl ots to trace regional changes in N e over time in both JCV and its human host. Coupled with different lines of evidence, comparisons of these skyline plots for JCV and humans indicate that the virus is evolving on a timescale similar to that for other fast evolving DNA viruses and RNA viruses. Thus, like these other fast evolving viruses, we conclude that JCV can be used to track recent human events, including those that are due to societal and epidemiological changes.
84 M aterials and Methods JC V irus and Hum an Mitochondrial DNA Sequences A dataset of 407 genome sequences for JCV was collected from GenBank (Table 4 S1 5 ). coding intergenic region, were removed from each genome prior to their multiple alignment with ClustalX (Thompson et al. 1994). The resultant 4,850 base pair alignment for the full coding genomes was edited by hand to minimize the number of unique gaps and to ensure the integrity of the reading frame (Fig ure 4 S1 6 ). A subset of 92 of the 231 dated sequences from Shackelton et al. (2006) was assembled for estimation of population genetic parameters, consisting of 11 African, 21 Native American, 25 European and 35 Japanese sequences. These four regions were chosen because each was well represented by dated JCV sequences that were sampled from the major lineages of the longer term diversity in that region (Fig ure 4 1). Furthermore, each region was also well represented by complete coding mitochondrial DNA (mtDNA) ge nomes for its human host population (see below), as well as by associated archaeological, paleoclimatological, and paleogeographic information. The sampling dates for this subset of 92 dated JCV ranged from 1970 to 2003. A total of 196 genome sequences for human mtDNA was collected from the online Human Mitochondrial Genome Database (mtDB : www.genpat.uu.se/mtDB/ ; Ingman and Gyllensten 5 Table 4 S 1 ( available online at Molecular Phylogenetics and Evolution journal website): Directory for the 407 Figure 4 S1). The fourth, fifth, and sixth col umns identify the subset of 337 JCV that were considered by Shackelton et al. (2006); the subset of 231 JCV for which known sampling dates were provided by this reference; and the subset of 92 JCV (in boldface) used in our internal and external rate estima tions 6 Figure 4 S1 ( available online at Molecular Phylogenetics and Evolution journal website): Multiple sequence alignment for the 407 JCV coding genomes. Each sequence is represented by its unique identifier that is keyed to its GenBank accession num ber in Table 4 coding IR region that was excluded from this study.
85 2006). As reviewed in Pakendorf and Stoneking (2005), the non coding control region sequences of each genome were then removed to limit subsequent analyses to the more conserved full coding regions. These full coding genomes were then aligned as for JCV, resulting in a multiple sequence alignment of 15,465 base pairs (Fig ure 4 S2 7 ) In all, 56 African, 20 Native American, 60 European and 60 Japanese genomes were collected (the 60 European and 60 Japanese genomes were randomly sampled from larger datasets available in mtDB). JC V irus Phylogenetic Analysis A ML phylogenetic analysis was performed on the 407 JCV sequence alignment to complement the distance, ML and Bayesian analyses of 337 sequences completed by Shackelton et al. (2006). Modeltest 3.7 (Posada and Crandall 1998) was used to determine that the GTR + + I model of nucleot ide substitution was most appropriate for this dataset (using the Akaike Information Criterion). The GTR + + I model allows for unequal rates among the six pairs of reciprocal substitutions ( e.g., the probability of change from A to C equals that from C to A, but can vary from that for A to G), while accounting for rate heterogeneity among sites by a gamma distribution and proportion of invariable positions. This ML analysis was performed using PhyML (Guindon and Gascuel 2003) with branch swapping by tree bisection and reconnection. ML bootstrap analysis was based on 1000 pseudo replicates to determine group support. The available genome sequences for three other members of the polyomavirus family [simian virus 40 (SV40), BK virus (BKV), and simian agent 1 2 (SA12)] were initially 7 Figure 4 S2 (a vailable online at Molecular Phylogenetics and Evolution journal website): Multiple sequence structural RNA genes following Pakendorf and St alignment corresponds to site 577 of the Anderson Reference Sequence (ARS), which represents the first position of the tRNA PHE gene. The final position of this alignment (15,465) corresponds to site 16,023 of the ARS, which represents the last position of tRNA conventions of Herrnstadt et al. (2002) for the European and American sequences in mtDB (Ingman and Gyllensten 2006), respectively. All other sequences are identified by their GenBank accession numbers.
86 considered, but later rejected, as outgroups to root the JCV phylogenies. Each of these three other polyomavirus genomes was aligned to the set of 407 JCV sequences and then subjected to a neighbor joining bootstrap analysis (with 1000 pseudo replicates) using ML distances under the same GTR + + I model as accepted above (S aitou and N ei 1987). However, none of these three provided a well supported root for JCV, as judged by their bootstrap scores of <50% (results not shown). The se failures can be attributed to the extensive sequence divergence between JCV and SV40, SA12 and BKV, which all have proportional sequence differences from the former of 20% to 30%. In addition to this concern with ambiguous rooting due to substitution sa turation, the two related polyomaviruses, BKV and SA12, may also have undergone horizontal transfers between their human and baboon hosts, thereby questioning their use as outgroups (Cantalupo et al. 2005). For these reasons, we instead followed convention and rooted the JCV tree at its midpoint, which thereby allowed for direct comparison of this ML phylogeny to those of Shackelton et al. (2006) and others. Estimation of for JC Virus Two Bayesian skyline analyses (Drummond et al. 2005) were performed on the dataset of 92 dated JCV sequences using the program BEAST v1.3 ( http://evolve.zoo.ox.ac.uk ) to estimate The first analysis relied on the sampling dates for each dated JCV, whereas the second assumed that these viral sequences were all sampled contem poraneously. Following the terminology introduced by Rambaut (2000) in his ML study, these two approaches for estimating are herein referred to as the single rate dated tips (SRDT) and single rate (SR) molecular clock models. The SRDT analysis relied on an uninformative flat prior of =0 to 100, whereas the SR approach used a strong uniform prior where the total tree depth was set to 90 110 thousand years
87 ago (kya). The latter prior was based on the assumption of a basal co divergence between JCV and humans calibrated to 100 kya (see F ig ure 4 1). These two approaches are herein referred to e was fixed at 10. Thus, these two Bayesian analyses were analogous to the generalized ML skyline approach, where the number of stepwise changes in N e can be less than their maximum ( i.e., the number of sequences (n) minus one ) (Strimmer and Pybus 2001). All Bayesian skyline analyses were performed with the GTR + + I substitution model and a relaxed molecular clock with an uncorrelated log normal rat e distribution (Drummond et al. 2006). Markov chains were run for 40,000,000 generations and sampled every 1,000 generations with the first 4,000 samples discarded as burn in. Unless otherwise specified ( e.g., as above for ), default priors were used for all parameters. The program Tracer v1.3 ( http://evolve.zoo.ox.ac.uk ) was used to visually inspect sampled posterior probabilities for Markov chain stationarity and to calculate summary statistics for the population genetic parameters. At least two independent runs were completed for each analysis to corroborate these final results. JC V irus Population Dynamics Bayesian skyline plots were generated for the four regiona l sets of dated JCV using the calculated mean internal and external rates to estimate historical changes in N e The SR rate model was used for the regional skyline plots using the external rate of 1.356 x 10 7 (Fig ure 4 2). The SRDT rate model was used to analyze the Native American, European and Japanese datasets with the internal rate of 3.642 x 10 5 (Fig ure 4 3). The original African dataset, with only 11 dated sequences, was supplemented with 55 undated African JCV. Skyline plots for Africa were then ge nerated using the SR model with both the internal and external rates. In these various
88 analyses, the number of stepwise changes in N e was set to their maximum of (n 1) as done in the classic ML skyline approach (Pybus et al. 2000) All other conditions o f these BEAST runs were as ab ove. Human mtDNA Population Dynamics Bayesian skyline plots were generated for the mtDNA coding genomic sequences from the same four regions. The SR model was used on the assumption that the sampling interval of the mtDNA seque nces was insignificant relative to The Bayesian counterpart to the classic ML skyline model (see above) and the widely accepted of 1.7 x 10 8 for coding mtDNA (Ingman et al. 2000) were used while all other conditions were as above. Bayes Factor Model Comparison Two Bayesian skyline ana lyses of the 92 dated JCV dataset were performed using the SRDT and SR models with the mean rate fixed at 3.642 x 10 5 and 10 piecewise estimates of N e The marginal likelihood under each model was approximated by the harmonic means of the log likelihoods sampled. Both analyses were performed using identical priors and protocols for their proposals, moves and sampling to minimize known limitations in using the harmonic mean to estimate marginal likelihoods (Lartillot and Philippe 2006). Log Bayes Factor (BF ) was calculated according to the equation: log BF = log [(marginal likelihood for SRDT model) / (marginal likelihood for S R) ] ( Ra ftery, 1996). In this way, we tested the significance of including the sampling dates for the JCV sequences in their SRDT anal ysis against the exclusion of this information in their SR treatment.
89 R esults ML Phylogeny for JC V irus Our ML phylogeny of 407 JCV coding genomes (Fig ure 4 1) was consistent with the distance, ML and Bayesian phylogenies of Shackelton et al. (2006) for 3 37 JCV. All 21 previously identified JCV subtypes clustered together and ML bootstrap analysis demonstrated substantial (83% to 100%) support for 18 of these 21 groups (Sugimoto et al. 2002). The JCV sequences within each subtype cluster corresponded to ho st populations from distinct geographic regions and historically related ethnicities. For example, subtypes 7B1 and 7B2, which were supported by bootstrap scores of 100%, included only East Asian sequences. At the inter subtype level, our ML phylogeny was congruent with the two of Shackelton et al. (2006) for the 13 highlighted subtypes that were used in their six jungle analyses of virus/host co divergence (Fig ure 4 1). Correspondingly, our JCV phylogeny also showed no significant association above the sub type level with the three human host phylogenies as used in their jungle analyses. In short, there was substantial phylogeographic structuring of JCV diversity by subtypes in regional human populations, but no significant co divergence at the higher levels between virus and host phylogenies. JC V irus Rates and Skyline Plots The mean estimate of for JCV, when calculated under the SR model that assumes a basal co divergence between virus and human host of 100 kya ( i.e., the external rate), was 1.356 x 10 7 with a 95% credible interval of 1.089 1.563 x 10 7 We used this external rate estimate to produce four regional skyline plots (Africa, Americas, Europe and Japan) that traced the changes in N e for JCV over time on a scale of thousands to tens of thous ands of years (Fig ure 4 2). We defined a significant change in population size as the occurrence of non overlapping 95% credible intervals at the beginning and end of an increase or decrease. Using this criterion, we
90 interpreted the ~40 fold and ~25 fold i ncreases of JCV population size beginning ~10 12 kya in Africa and Europe, respectively, to be significant, while the largest increases seen in the Americas and Japan to be suggestive but not significant. The mean estimate of for JCV, when calculated usi ng virus sampling dates under the SRDT model ( i.e., the internal rate), was 3.642 x 10 5 with a 95% credible interval of 1.227 to 6.149 x 10 5 This internal rate estimate, which is close to that of Shackelton et al. (2006) using a different set of dated JCV sequences, was thus more than 100 fold faster than the external rate estimate. Skyline plots generated with this internal rate had timescales on the order of decades and centuries (Fig ure 4 3). As is a scalar of N e and time (Pybus et al. 2000), the relative magnitudes and statistical significance of the changes in these regional plots were therefore similar to those calculated with the external rate, i.e., only Africa and Europe showed significant increases in population size. Human mtDNA Skyline Plots The human skyline analyses, estimated with the widely accepted of 1.7 x 10 8 for its coding mtDNA, produced skyline plots that were scaled in thousands to tens of thousands of years, similar to th e JCV external rate plots (Fig ure 4 4). Using our definition of significance, we interpreted the ~30 fold population increase starting at ~14 kya in Europe and the ~65 fold increase ~55 to 25 kya in Japan to be significant, while the largest changes in Afr ica and the Americas and the recent increase in Japan to be suggestive of population growth. Log Bayes Factor Model Comparison Using the full set of dated sequences, the marginal log likelihood of the SRDT model with the internal rate was 12817.73 where as the marginal log likelihood of the SR model with the
91 internal rate was 12829.22. This produced a log BF of 11.49 for the co mparison of the two model s. D iscussion Bayesian Skyline Plots, Human mtDNA, and Historical Demography Bayesian skyline plots prov ide a dynamic representation of changes in N e over time. As such, they provide a moving picture of population change and capture far more information than summary statistics or simpler coalescent models that offer only snapshots of N e Our Bayesian skyline plots of mtDNA sequences represent the first application of this powerful method to the extensive comparative database of human mtDNA (Fig ure 4 4). Our plots reveal coalescences that range from 50 kya in Europe to 140 kya in Africa and, thus, capture the complete time period during which anatomically modern humans lived in these regions. Interestingly, the significant 30 fold increase noted in the European population is consistent with paleoclimatological and archaeological evidence for the retreat of the European ice sheet 13 15 kya and the subsequent spread of agriculture in Neolithic Europe (Gamble et al. 2004; Pinhasi et al. 2005). The significant 65 fold increase between 55 and 25 kya in the Japan skyline plot is consistent with a mid Pleistocene popul ation expansion in Asia and reflects the ancestral diversity sampled by the relatively large group of immigrants from East Asia that founded the are also associate d with major shifts in climate, subsistence practices, or migrations, which are events thought to have the most extreme effects on regional human demography (Fig ure 4 4). The mtDNA skyline plots also allow for absolute estimates of the effective number of females (N ef ) over time, which are of great interest as alternatives to point estimates of N ef The values from the skyline plots can be converted to N ef estimates by dividing by the widely accepted value of 25 years per generation for humans (F enner 2004). In this way, we calculate
92 the founder N ef of the Americas to be ~2,000 females, which is consist ent with much of the literature (Bonatto and Salzano 1997 b ) and, therefore, disagrees with the extremely low estimate of ~70 males + females by Hey (2005). Furthermore, we calculate the me dian estimate of African N ef at the coalescent as ~1,300, which rise s to ~6,000 by ~75 kya and then increases even further during the Pleistocene expansion ~50 kya. These skyline estimates of African N ef add substantial detail to previously published estimates that indicate a N ef of ~5,000 averaged over the ~150 kya since the origin of modern humans (Jorde et al. 1998). JC V irus Rates and Historical Demography The slow external JCV rate estimate produces skyline plots that are scaled in thousands to tens of thousands of years and are thus comparable to the skyline plots of its human host. We find that there are apparent associations between the JCV skyline plots using the external rate and the major human expansions detailed in the mtDNA plots (Fig ure s 4 2 and 4 4). In particular, the external rate plots show increases in JC V population sizes for the five expansion events described in the mtDNA skyline plots (Fig ure 4 4). In contrast, JCV dynamics estimated under the fast internal rate occur on the order of decades to centuries (Fig ure 4 3), which is a product of the interna l rate being more than 100 times faster than the external rate. This extreme difference in timescale precludes a comparison of JCV population changes estimated with the fast rate to ancient population dynamics of its human host. The implication of acceptin g a fast internal rate is that JCV is not a marker of ancient human population dynamics. Rather, the geographic patterning of JCV diversity may reflect recent human events that are behavioral, sociological or technological in nature. Current Support for th e Fast Internal Rate of JC V irus Which JCV evolutionary rate, fast internal or slow external, is correct? In support of the slow rate, there is the apparent similarity between the skyline plots using the external JCV rate
93 and those for human mtDNA (Fig ure s 4 2 and 4 4). While this convergence supports the regional co divergence of JCV and its human host, it is the only current result supportive of the slow external rate estimat e. In contrast, there are three lines of current evidence that favor a substantia lly faster rate of JCV evolution. First, our phylogenetic analysis provides no significant overall support for ancient co divergence between JCV and humans, thereby contradicting a parallel slow rate in the virus as well as in its host. Specifically, our ML analysis of a JCV dataset that is 22% larger than that of Shackelton et al. (2006) produced the same inter subtype topology as used in their six jungle tests (Fig ure 4 1), thus corroborating the lack of significant support for a long term association b etween virus and host and a slow rate in the former (see these authors for details, particularly about the grouping of subtypes 2A1 and 2A2 from East Asia and the New World versus the non basal position of African JCV). Correspondingly, this phylogeny opp oses the unique hypothesis of Pavesi (2003) for an ancient African origin of JCV and thereby a similar slow rate for both virus and humans. In contrast to his study of only the slowly evolving and invariant sites for 18 JCV genomes and the highly divergen t SV40 as their outgroup, the current phylogenetic conclusions are based on six different jungle analyses of 13 JCV versus human nodes; from ~200 400 viral genomes; and from all available coding positions (including the rapidly evolving ones) with their ra te heterogeneity accommodated by a gamma distribution and proportion of invariable sites. Second, our log BF test now provides a new line of evidence in support of a faster rate for JCV. In the statistics literature, log BF values greater than 5 are genera in support of one model over another (Kass and Raft er y 1995). In our log BF test, an indecisive result would indicate that the known sampling dates for the 92 JCV offer no more information
94 for their skyline analyses than does a co mparable set of undated sequences. That is, if JCV is slowly evolving, the yearly differences in sampling dates would have little impact on the study of its population dynamics, which would have occurred on a timescale of thousands to tens of thousands of years. However, our log BF value of 11.49 indicates instead that these sampling dates for JCV do indeed provide critical information for its SRDT rate estimation and skyline analyses. The reason is that JCV is rapidly evolving on a timescale of decades to centuries, where yearly differences in its samp ling dates can have a large effect. Third, acceptance of the fast internal rate of = 3.642 x 10 5 for JCV leads to a novel explanation for virus/host interactions, one that is based on recent societal and ep idemiological changes in humans ~50 years ago. The fast internal rate for JCV means that its population dynamics are occurring on a timescale that is obviously too recent to track ancient human evolution. Instead, this fast internal rate is more consiste nt with the phylogeography of JCV reflecting recent changes in its infection rates due to societal and epidemiological shifts in human behavior or technology. In short, if the fast internal rate for JCV is accurate, we would expect to find associations be tween its skyline plots and known recent events in modern human history, as were initially noted for its slow external versus host mtDNA graphs (see above). Thus, the third line of current support for the fast internal rate comes f rom the associations of t he major recent expansions for JCV in Africa, Europe, and Japan (but not in the Americas) with known postwar societal changes at the end of World War II in 1945 (Michitaka et al. 2006). The significant 40 and 25 fold increases of the JCV populations in Af rica and Europe, respectively, start ~50 years ago (Fig ure 4 3). The JCV skyline plot for Japan also shows a suggestive increase that begins at about the same time. In contrast, the largest increase in the Americas begins only ~15 years ago. Taken together these results point to the fact that postwar
95 societal changes in the human host of JCV were much more extensive in regions near the centers of fighting during World War II than elsewhere ( i.e., Europe and Japan versus the Americas) (Weinberg 1995). In t urn, while major fighting and destruction did not occur across Africa, postwar political, economic, and technological changes, such as the replacement of colonial rule Saharan Africa having the largest population growth and urbanization rates in the world over the last ~50 years (United Nations 2003, 2004). These obvious ties (or lack thereof) to known postwar changes in its human host offer a new third line of corroboration in support of th e fast internal rate for JCV. In the process, they also reinforce the potential utility of JCV to address other, less, well documented events in human host history (see below). In light of these three lines of current support, we accept the fast internal rate for JCV and its associated skyline plots as more accurate reflections of its population dynamics. Correspondingly, we conclude that the apparent similarities between the JCV external rate and human mtDNA skyline plots are coincidental. Utility of JC V irus and Other Fast Evolving DNA Viruses for Studying the Human H ost Multiple tests are needed to document the fast versus slow evolutionary rates of DNA viruses. Based on such tests, this study shows that the evolutionary rate and population dynamics of J CV are most similar to those of other fast evolving DNA viruses ( e.g., Michitaka et al. 2006). Thus, li ke them, the historical population dynamics of JCV are to a large extent the consequence of rapid horizontal transmissions due to recent societal and epi demiological changes in its human host. Conversely, they are not primarily the result of slow vertical transmissions across human generations, which span ~25 years. Slowly evolving DNA viruses that exhibit vertical transmission patterns ( e.g., HPV) are mo re appropriate for the study of older evolutionary events within their hosts (Bernard 1994; Holmes 2004).
96 In light of this need for multiple testing, further follow up studies are now encouraged to assess critically our arguments for a fast JCV rate. As u sed in this study, BEAST v1.3 does not account for migration. Thus, the upcoming availability of a newer version that implements a structured coalescent model is particularly welcomed, since it will allow for Bayesian rate and skyline plot estimates for s ubdivided JCV populations that have experienced/are experiencing migration as well as growth ( http://evolve.zoo.ox.ac.uk/beast/manual.html ). Along these lines, the SR and SRDT models with their relaxed, uncorrelated, log normal, molecular clocks were chos en for this study on the basis of their nested relationship that allows for straightforward comparisons and their documented successes in the treatment of both real and simulated datasets (Rambaut 2000; Drummond et al. 2005). The coefficients of variation for these two models, given their estimated mean rates of 1.356 x 10 7 and 3.642 x 10 5 were 0.127 and 0.146, respectively. An obvious next step relative to the present study is to extend its Bayesian comparisons of these relaxed molecular clock models to include a non clock model where rates are entirely free to vary from branch to branch ( i.e., the unrooted model of phylogeny). Such follow up tests are crucial to improving our understanding of the history, demography, and epidemiology of JCV both reg ionally and worldwid e (Shackelton et al. 2006). In conclusion, fast e volving DNA viruses like JCV can complement RNA viruses to study human events that have occurred too recently to be detected by any host genetic system currently in use (Holmes 2004). Suc h studies of recent human history can include the tracking of political and economic changes, new vaccination programs, and dispersal events to name just a few of the societal, epidemiological, and population genetic areas that can be addressed with fast e volving viruses such as JCV.
97 F igure 4 1 Optimal ML phylogeny for 407 JCV coding genomes. Major JCV groups are classified according to their recognized subtypes (1A to 8B; Sugimoto et al. 1997) and the ethnic origins of their human hosts are specified a s well. Bootstrap proportions are given for those subtypes and other higher order clusters with scores >50%. Boxed labels identify the 92 JCV from the four regional groups studied here (Africa, Europe, Japan, and the Americas). The bold branches and nodes trace the relationships of the 13 highlighted subtypes used in the jungle analys es by Shackelton et al. (2006). The arrow points to the node that has been widely used to reinterpret the midpoint root of other JCV phylogenies as an unresolved trichotomy of its populations from Europe, Africa, and Africa plus other regions. Correspondingly, this basal trichotomy has been widely related to the initial split within the human phylogeny and has thereby been dated at ~100 kya (Sugimoto et al. 1997; Hatwell and Sha rp 2000).
98 F igure 4 2 Bayesian skyline plots for the four regional groups of JCV generated with the slow external rate of 1.356 x 10 7 The number of sequences for each region is noted in parentheses. The x axis is time as measured in years before prese nt and the y axis is the scaled population size ( which is the product of N e multiplied by generation time). Each curve is a plot of median with its 95% credible interval indicated by the X regional sample, with its 95% credible interval given in brackets. Note the different scales for both axes across plots. A to D highlight five increases in median for JCV that correspond to major events and episodes of growth in human history (see Fig ur e 4 4).
99 F igure 4 3 Bayesian skyline plots for the four regional groups of JCV generated with the fast internal rate of 3.642 x 10 5 The x and y axes are in the same units as in Fig ure 4 2, although both are two to three orders of magnitude smaller. T his difference in the dimensions of both axes reflects the fact that the genetic diversity of each population is determined by its product of N e times (Tajima 1983). Correspondingly, given the recent timescales of these plots, their x axes are best inter preted as years before the date of the most recently sampled JCV sequence(s) for that region ( e.g., 2003 for Japan; see Table S1). Otherwise, these plots follow the conventions of Fig ure 4 2.
100 F igure 4 4 Bayesian skyline plots for the four regional group s of humans as estimated with complete mtDNA coding genomes. In these plots, is the product of N ef multiplied by generation time. Otherwise, these plots follow the conventions of Fig ure 4 2. A to D highlight five episodes of population size change that correspond to known major events in human history. These events include: ( A ) i ncreasing aridity in Africa coinciding with a reduction in observed archaeological sites (Mitchell 2002; Kuper and Kropelin 2006); ( B ) the peopling of the Americas after migration across the Bering landbridge (Greenberg et al. 1986); ( C1 ) the retreat of th e glaciers in Europe following the last glacial maximum (Gamble et al. 2004); ( C2 ) the rise and spread of agriculture in Neolithic Europe (Pinhasi et al. 2005); and ( D ) the introduction of rice agriculture and subsequent migration of people from the Korean peni nsula to Japan 1995).
101 CHAPTER 5 CONCLUSION The demographic history of human populations has been one of shifting population sizes and complex migrations since the expansion of modern humans out of Africa at least ~50,000 years ago ( e.g., Tamm et al. 2007; Atkinson, Gray, and Drummond 2008; Kayser et al. 2008). Changes in human demography have left patterns in human genetic diversity, via the evolutionary processes of genetic drift and gene flow, which can be used to infer the timing and magnitude of events in the population history of humans. However, natural selection and mutation have also shaped variation in the human genome ( e.g., Reich et al. 2002; Smith, Webster, and Ellegren 2002; Webster, Smith, and Ellegren 2002; Voight et al 2006; Nielsen et al. 2007; Sabeti et al. 2007) and limit our ability to infer past demographic events from the genetic diversity of modern human populations. While the effects of natural selection can be minimized by analyzing neutrally evolving DNA sequ ences, the relatively slow mutation rate of human DNA [from ~10 7 substitutions per site for mitochondrial DNA (Hasegawa et al. 1993; Ingman et al. 2000; Howell et al. 2003) to ~10 9 for nuclear DNA (Kaessmann et al. 1999)] makes it extremely difficult to infer events in human demographic history that occurred more recently than thousands of years ago from human DNA alone and also increases the variance of estimates of demographic parameters. The limited utility of human DNA alone for investigating only a ncient population events necessitates the use of complementary approaches that utilize alternative sources of data to investigate more recent events or to increase the precision of demographic parameter estimates. My dissertation demonstrates how genetic a nthropology may combine complementary forms of data evolving at different rates with existing lines of evidence to investigate human demographic events from decades and centuries to tens of thousands of years in the past. I believe that genetic
102 anthropolog ists, who must consider cultural and historical as well as biological processes in human evolution, are uniquely positioned to use complementary forms of data and multiple lines of evidence from diverse disciplines. This inter disciplinary perspective can, at best, instill in genetic anthropologists an appreciation of alternative forms of evidence and an inherent disregard for disciplinary boundaries that aid in the pursuit of understanding human history. My study demonstrates that such an approach can prov ide significant new insights, such as a surprisingly fast evolutionary rate for a double stranded DNA virus or a long occupation of Beringia by the Amerind, and new strategies for addressing human demographic history. Each of the three projects included i n my dissertation presented approaches to investigating human demography at different timeframes. First, with regard to the colonization of the New World, I was able to use the mitochondrial genetic diversity of Amerind populations, along with archaeologic al, climatological, and paleoecological data, to propose a three stage model for the peopling of the Americas. In this model, the proto Amerind population first diverged from the Asian gene pool ~43,000 years ago, then underwent a long period (~20,000 year s) of stable population size, and ultimately expanded into the Americas ~16,000 years ago. Second, I showed that including independent evidence into an analysis of molecular data dramatically improves inferences about the Amerind expansion into the America s. By including archaeological and historical evidence in a reanalysis of a dataset previously used to produce an unrealistic model for the migration, I produced estimates of demographic parameters similar to those made from mitochondrial data and consiste nt with the archaeological record. Specifically, Hey (2005) estimated that ~70 people colonized the entire New World ~7,000 years ago, whereas I used archaeological evidence and contemporary migration rates to estimate a more plausible migration of ~1,000 Amerind ~15,000 years ago from the same dataset as Hey. Third, I
103 was able to estimate substantially narrower ranges for the time of the Amerind expansion (~14,000 to 16,000 years ago) into the Americas as well as the effective population size of the Amerin d founder population (~1,000 to ~5000 individuals). These dates and population sizes are consistent with the physical ( i.e., archaeological, climatological, and paleoecological) and molecular (maternal, paternal, and autosomal markers) evidence. This study was able to produce a model for the peopling of the Americas that reconciles the existing molecular, archaeological, and climatological evidence, dramatically narrow estimates of important demographic parameters, and is amenable to the inclusion of new da ta and future hypothesis testing. In the second project, I investigated the evolution of the Semitic language family to make inferences about the history of human populations in the Middle East and Horn of Africa over the most recent centuries and millenn ia. I used phylogenetic techniques borrowed from evolutionary biology to estimate a language tree with divergence dates for the Semitic languages, which I then correlated with the archaeological evidence of Semitic populations. This analysis indicated an E arly Bronze Age origin of Semitic ~5900 years ago between Mesopotamia and the Levant, while rejecting an African origin for the Semitic family. It also supported a single Iron Age migration of Semitic to Ethiopia ~3000 years ago from across the Red Sea, wh ich is consistent with cultural myths linking Ethiopia to Semitic populations in the Levant. These results provide a statistically robust model of Semitic population history that is consistent with linguistic and archaeological data and will provide a hypo thesis that will be tested in the future by the inclusion of genetic data. In the final project, I studied the recent demographic history of JC virus and correlated it to events in human history within the very recent past. First, I estimated a mutation r ate for JC virus that was much faster than previously assumed. This rate was ~300 times faster than
104 previous estimates, which indicates that JC virus is evolving on the order of decades and centuries and is a poor marker of ancient human history. Second, I used the demographic history of JC virus to infer events in recent human history. By applying the newly estimated fast rate to significant events in human history such as increased urbanization rates and population growth following World War II. Finally, I demonstrated that DNA viruses, even double stranded DNA viruses, can be used as fast evolving markers of recent human history. The fast rate estimated for JC vi rus was robustly supported by multiple tests, and the inferences about human history made from JC virus demography are consistent with the known historical record. This invites the possibility that fast evolving DNA viruses can complement slow evolving hum an DNA to investigate recent human history. This study has broader implications for genetic anthropology as a whole. First, I implement a strategy for investigating human history across multiple timeframes using data evolving at vastly different rates. Th is increases the temporal reach of genetic anthropologists, who have traditionally addressed only ancient events that occurred tens to hundreds of thousands of years ago, to include events in the very recent past. This provides unique opportunities for com bining historical and archaeological records with genetic or linguistic data into comprehensive analyses of recent population events. Indeed, this strategy is already generating much excitement, and at least one multi million dollar Mellon grant has been a warded (to UCLA) to facilitate collaborations between geneticists, historians, and linguists to study historical events as recent as the Middle Ages. Second, I demonstrate the importance of including lines of evidence from other disciplines into the analys is of molecular data. Genetic analyses must account for and often benefit from the inclusion of existing non genetic evidence;
105 for example, the inclusion of archaeological evidence for the occupation of North America by 14,000 years ago provided a more rea listic estimate of the founder population size in my analysis of the peopling of the Americas. Incorporating prior information guides genetic analyses to consider only realistic scenarios and often increases the precision of inferences made from such analy ses. Lastly, I show how data that are coevolving with human populations can be used to infer events in human history when the analysis of human DNA alone is not sufficient to do so. Languages and human viruses, though not vertically transmitted like human DNA and studied by fundamentally different disciplines, are intrinsically linked to human populations and can be exploited to investigate aspects of human demography. In sum, these insights demonstrate that genetic anthropologists who are interested in hum an demography and human history would benefit from an approach that includes biological, cultural, and historical perspectives. The pursuit of a truly multi disciplinary approach to genetic anthropology fits within a larger perspective that not only encom passes all of biological anthropology, but extends throughout anthropology as a whole. This perspective advocates not only the use of alternative forms of data, but also emphasizes the adoption of epistemological frames and analytical methods across discip disciplinary lines (archaeological, biological, cultural, and linguistic) of anthropology to perfo rm question based research about human history. I believe the time is ripe for research projects with this outlook to flourish. Previously, such integrative research has been accomplished in stops and starts, yielding interesting results but lacking susta ined output, possibly due to historical reasons. For example, a first attempt to unify archaeological, biological, and linguistic evidence for the peopling of the New World
106 (Greenberg, Turner, and Zegura 1986) resulted in much further research on the subje ct, but did not produce a sustained inter disciplinary approach to studying the Amerind colonization of the Americas, possibly because the attempt occurred before its time. In a sign that the outlook for such research is improving, my multi disciplinary an alysis of the peopling of the New World has been met with much interest, as evidenced by the numerous general science articles, requests to use my map, and citation in a Science review article (Goebel, Waters, and O'Rourke 2008). Another initiative toward an integrative approach is the analysis of individual samples from multiple perspectives, such as the generation of archaeological, stable isotope (also good for tracking human migrations), and molecular genetic data from the same set of samples. Technolog ical change has also increased the viability of such research, especially with regard to genetics. The ever decreasing cost of obtaining genetic data, in conjunction with the increasing amount of such data available in public databases, has begun to shift the focus of genetic anthropologists away from simple data collection and toward the development or application of innovative analytical methods. This increased emphasis on method application and development directly led to the incorporation of non genetic data into genetic analyses as it became clear that some questions could not be answered with sufficient accuracy or precision by human genetic data alone, regardless of sample size. For example, Wang et al. (2007) analyzed a massive microsatellite dataset (>800 chromosomes typed for 678 loci) to investigate the peopling of the Americas from the genetic structure of Amerind populations. Though they found intriguing patterns of population structure, they were unable to obtain estimates for important demograp hic parameters (such as the time and size of the migration) that I was able to obtain from the analysis of smaller datasets in my study of the same event. A side effect of the interest in new data has been the emergence of research using human pathogens as markers of human population history.
107 This research is maturing quickly as large datasets have led to novel insights about human disease dynamics and corrected false assumptions about the utility of specific pathogens as markers ( e.g., my research demonstr ating JC virus is a marker of recent instead of ancient human history). The most radical direction of such research is the investigation and identification of patterns that extend from human biological evolution to changes in human culture. This endeavor has an inglorious history rooted in social Darwinism and the mis application of natural selection to social and cultural outcomes. However, current examples of this research have been more circumspect and careful in their use of evolutionary theory to ans wer questions about human cultural and linguistic history. This research has had the most impact on the study of language history ( e.g., my phylogenetic study of Semitic languages), but has recently spread to include the languages or cultural outcomes under different scenarios. Two examples of this are the melding of ga me theory and population genetics to understand the evolution of human social behavior (see Boyd 2006 for a review of recent work) and the use of selection and drift to investigate links between linguistic change and human demography (see Nettle 2007 for a discussion of new studies). While such research will certainly experience growing pains characterized by the over interpretation of cultural change from an evolutionary perspective, it will at the very least provoke discussion about such broad approaches to studying human history. broad view of what constitutes appropriate data and the application of analytical techniques. The integration of multiple sub disciplines of anthropology, as well as data from other disciplines
108 ( e.g., virology or climatology), into a single, broad based analysis requires a willingness to look for general patterns and focus on question based research rather than expertise or discipline based re search. Research along these lines holds great promise to produce novel insights about human history as well as define the extent to which data and analytical techniques can be combined into a single anthropological research program.
109 LIST OF REFERENCES A gostini H, Yanagihara R, Davis V, Ryschkewitsch C, Stoner G. 1997 Asian genotypes of JC virus in Native Americans and in a Pacific island population: markers of viral evolution and human migration. Proc Natl Acad Sci US A 94: 14542 14546. Anderson S, Banki er A, Barrell B, de Bruijn M, Coulson A, et al. (14 co authors). 1981. Sequence and organization of the human mitochondrial genome. Nature. 290:457 465. Ashford R 2000 Parasites as indicators of human bi ology and evolution. J Med Microbiol 49 : 771 772. A thanasiadis G Esteban E, Via M Dugoujon J Moschonas N Chaabani H Moral P. 2007. The X chromosome Alu insertions as a tool for human popu lation genetics: data from European and African human groups. Eur J Hum Genet 15:578 583. Atkinson Q, Gray R 2005 Curious parallels and curious con nections -phylogenetic thinking in biology and historical linguistics. Syst Biol 54 :513 526. Atkinson Q, Gray R, Drummond A. 2008. mtDNA variation predicts population size in humans and reveals a major Southern Asian cha pter in human prehistory. Mol Biol Evol. 25:468 474. Atkinson Q, Meade A, Venditti C, Greenhill S, Pagel M. 2008. Languages evolve in punctuational bursts. Science. 319:588. Barbujani G, Bertorelle G Chikhi L 1998. Evidence for paleolithic and neolithic gene flow in Europe. Am J Hum Genet. 62 :488 491. Beerli P. 2006. Comparison of Bayes ian and maximum likelihood inference of population genetic parameters. Bioinformatics 22: 341 345. Bender ML. 1971. Languages of Ethiopia New Lexicostatistic Classification and Some Problems of Diffusion. Anthropol Linguist. 13:165 288. Bernard H 1994 Coevolution of papillomaviruses with human populations. Trends Microbiol 2: 140 143. Blau J. 1978. Hebrew and Northwest Semitic: reflections on the classification of the Semi tic languages. Hebrew Annual Review. 2:21 44. Blust R. 2000. Why lexicostatistics doesn't work: the 'universal constant' hypothesis and the Austronesian languages. In: Renfrew C, McMahon A, Trask L, editors. Time depth in historical linguistics. Cambridge: The McDonald Institute for Archaeological Research. p. 311 331.
1 10 Bonatto S, Salzano F 1997 a A single and early migration for the peopling of the Americas supported by mitochondrial DNA sequen ce data. Proc Natl Acad Sci US A 94 :1866 1871. Bonatto S ., Salz ano F, 1997 b. Diversity and age of four major mtDNA haplogroups, and their implications for the peopling of the New World. Am J Hum Genet 61 : 1 413 1423. Boyd R. 2006. Evolution. The puzzle of human sociality. Science. 314:1555 1556. Buccellati G. 1997. Akk adian In: Hetzron R ed itor The Semitic Languages. London: Routledge. p. 69 99. Campbell L. 2000. Time perspective in linguistics. In: Renfrew C ed itor Time depth in historical linguistics. Cambridge: The McDonald Institute for Ar chaeological Research p. 3 31 Cann R, Stoneking M Wilson A 1987. Mitochondrial DNA and Human Evolution. N ature 325 :31 36. Cantalupo P, Doering A., Sullivan C, Pal A., Peden K, Lewis A, Pipas J 2005. Complete nucleotide sequence of polyomavirus SA12. J Virol. 79 : 13094 131 04. Capelli C, Redhead N, Romano V, et al. (18 co authors). 2006. Population structure in the Mediterranean basin: A Y chromosome perspective. Ann Hum Genet. 70:207 225. Cavalli Sforza L, Edwards A 1964. Analysis of human evolution. Proceedings of the 11t h International Congress of Genetics 2 :923 933. Cavalli Sforza LL, Menozzi P Piazza A 1994. The history and geography of human genes. Princeton, New Jersey : Princeton University Press Chen Y, Sharp P Fowkes M Kocher O Joseph J Koralnik I 2004. Ana lysis of 15 novel full length BK virus sequences from three individuals: evidence of a high intra strain genetic diversity. J Gen Virol 85 :2651 2663. Chikhi L Destro Bisol G Bertorelle G Pascali V Barbujani G 1998. Clines of nuclear DNA markers sugge st a largely neolithic ancestry of the European gene pool. Proc Natl Acad Sci USA 95 :9053 9058. Connah G. 2001. African Civilizations: An Archaeological Perspective. Cambridge: Cambridge University Press. G, Yoshizaki M, Kudo T. 1995 Late Jomon cultigens in northeastern Japan. Antiquity 69 : 1 46 152. Daniels PT. 1997. Scripts of Semitic languages. In: Hetzron R, editor. The Semitic Languages. London: Routledge. p. 16 45.
111 Derenko M, Malyarchuk B, Grzybowski T, et al. ( 12 co authors ). 200 7 Phylogeographic analysis of mitochondrial DNA in northern Asian Populations. Am J Hum Genet 81:1025 1041. Diem W. 1980. Die genealogische Stellung des Arabischen in den semitischen Sprachen: ein eingelostes Problem der Semitistik. In: Diem W, Wild S, e ditors. Studien aus Arabistik und Semitistik, A. Spitaler zum 70. Wiesbaden: Harrassowitz. p. 65 85. Dillehay TD, editor 1997 The archaeological context and interpretation. Washington, DC: Smithsonian Institution Press. Drummond A, Ho S, Phillips M, Ramb aut A. 2006 Relaxed phylogenetics and dating with confidence. PLoS Biolog y 4 : 6 99 710. Drummond A, Rambaut A Shapiro B Pybus O 2005. Bayesian coalescent inference of past population dynamics from molecular sequences. Mol Biol Evol 22 :1185 1192. Drummo nd A, Rambaut A. 2007. BEAST: Bayesian evolutionary analysis by sampling trees. BMC Evol Biol. 7:214. Duffy S, Shackelton L Holmes E 2008. Rates of evolutionary change in viruses: patterns and determinants. Nat Rev Genet 9 :267 276. Ehret C. 1995. Recons tructing proto Afroasiatic (proto Afasian): Vowels, tone, consonants, and vocabulary. Berkeley: University of California Press. Ehret C. 1988. Social transformations in the early history of the Horn of Africa: linguistic clues to developments of the period 500 BC to AD 500. In: Bayene T, editor. Proceedings of the Eigth International Conference of Ethiopian Studies. Addis Ababa: Institute of Ethiopian Studies. p. 639 651. Ehret C. 2000. Testing the expectations of glottochronology against the correlations o f language and archaeology in Africa. In: Renfrew C, McMahon A, Trask L, editors. Time depth in historical linguistics. Cambridge: The McDonald Institute for Archaeological Research. p. 373 399. Ehret C, Keita S, Newman P. 2004. The origins of Afroasiatic. Science. 306:1680. Ehrich RH, editor. 1992. Chronologies in Old World archaeology. Chicago: University of Chicago Press. Elias S, Short S, Nelson C, Birks H 1996 Life and times of the Bering land bridge. Nature. 382: 60 63. Faber A. 1980. Genetic subgroup ings of the Semitic languages. Austin, Texas: Linguistics Department, University of Texas at Austin.
112 Faber A. 1997. Genetic subgrouping of the Semitic languages. In: Hetzron R, editor. The Semitic Languages. London: Routledge. p. 3 15. Fagundes N, Ray N B eaumont M Neuenschwander S Salzano F Bonatto S Excoffier L 2007. Statistical evaluation of alternative models of human evolution. Proc Natl Acad Sci USA 104 :17614 17619. Falush D, Wirth T Linz B et al. (18 co authors) 2003. Traces of human migrati ons in Helicobacter pylori populations. Science 299 :1582 1585. Fattovich R. 1990. Remarks on the pre Aksumite period in northern Ethiopia. Journal of Ethiopian Studies. 23:1 33. Fenner J. 2004 Cross cultural estimation of the human generation interval for use in genetics based population divergence studies. Am J Phys Anthropol 128: 4 15 423. Fisher RA. 1930. Genetical theory of natural selection. Oxford: Clarendon Press Fix A 2005 Rapid deployment of the five founding Amerind mtDNA haplogroups via coasta l and riverine colonization. Am J Phys Anthropol 128:430 436. Forster P, Harding R, Torroni A, Bandelt H 1996 Origin and evolution of Native American mtDNA variation: a reappraisal. Am J Hum Genet 59:935 945. Fu Y. 2001. Estimating mutation rate and ge neration time from longitudinal samples of DNA sequences. Mol Biol Evol 18 :620 626. Gamble C, Davies W, Pettitt P, Richards M. 2004 Climate change and evolving human diversity in Europe during the last glacial. Phil Trans R Soc Lond B 359: 2 43 254. Garr igan D, Kingan S Pilkington M et al. (12 co authors) 2007. Inferring human population sizes, divergence times and rates of gene flow from mitochondrial, X and Y chromosome resequencing data. Genetics 177 :2195 2207. Gilbert M, Rambaut A Wlasiuk G Spira T Pitchenik A Worobey M 2007. The emergence of HIV/AIDS in the Americas and beyond. Proc Natl Acad Sci USA 104 :18566 18570. Goebel T 2007 The missing years for modern humans. Science 315:194 196. Goebel T, Waters M O'Rourke D 2008. The late Pleis tocene dispersal of modern humans in the Americas. Science 319 :1497 1502. Gordon CH. 1997. Amorite and Eblaite. In: p. Hetzron R ed itor The Semitic Languages. London: Routledge. p. 100 113 Gordon RG, editor. 2005. Ethnologue: languages of the world. Da llas: SIL International. Gray R, Atkinson Q 2003. Language tree divergence times support the Anatolian theory of Indo European origin. Nature 426 :435 439.
113 Gray R, Jordan F 2000. Language trees support the express train sequence of Austronesian expansion Nature 405 :1052 1055. Greenberg J, Turner C, Zegura S. 1986 The settlement of the Americas: a comparison of the linguistic, dental and genetic evidence. Curr Anthropol 77 : 4 77 497. Guindon S Gascuel O. 2003 Simple, fast and accurate algorithm to est imate large phylogenies by maximum likelihood. Syst Biol 5 2: 6 96 704. Guthrie R D. 1990 Frozen fauna of the mammoth steppe. Chicago: University of Chicago Press. Hamilton M, Buchanan B 2007 Spatial gradients in Clovis age radiocarbon dates across North A merica suggest rapid colonization from the north. P roc Natl A cad Sci USA 104:15625 15630. Hammer M. 1994. A recent insertion of an alu element on the Y chromosome is a useful marker for human population studies. Mol Biol Evol 11 :749 761. Hammer M, Redd A Wood E et al. (12 co authors). 2000. Jewish and Middle Eastern non Jewish populations share a common pool of Y chromosome biallelic h aplotypes. Proc Natl Acad Sci USA 97 :6769 6774. Hammer M, Spurdle A Karafet T et al. (12 co authors) 1997. The geogr aphic d istribution of human Y chromosome variation. Genetics 145 :787 805. Handt O, Meyer S, von Haeseler A. 1998. Compilation of human mtDNA control region sequences. Nucleic Acids Res. 26:126 129. Handt O, Richards M Trommsdorff M et al. (13 co authors) 1994. Molecular Genetic Analyses of the Tyrolean Ice Man. Science 264 :1775 1778. Hasegawa M, Dirienzo A Kocher T Wilson A 1993. Toward a More Accurate Time Scale for the Human Mitochondrial DNA Tree. J Mol Evol. 37 :347 354. Hatwell J, Sharp P. 2000 E volution of human polyomavirus JC. J Gen Virol 81 : 1 191 1200. Hayward RJ. 2000. Afroasiatic. In: Heine B, Nurse D, editors. African languages. Cambridge: Cambridge University Press. p. 74 98. Herrnstadt C, Elson J, Fahy E, et al. (11 co authors). 2002. Red uced median network analysis of complete mitochondrial DNA coding region sequences for the major African, Asian, and European haplogroups. Am J Hum Genet. 70:1152 1171. Hetzron R. 1976. Two principles of genetic reconstruction. Lingua. 38:89 104. Hetzron R editor. 1997. The Semitic languages. London: Routledge. Hey J. 2005. On the number of New World founders: a population genetic portrait of the peopling of the Americas. PLoS Biol ogy. 3 :e193.
114 Hoffecker J, Elias S 2003 Environment and archaeology in Beri ngia. Evol Anthropol 12:34 49. Hoffecker J, Powers W, Goebel T 1993 The colonization of Beringia and the peopling of the New World. Science 259:46 53. Holden C. 2002. Bantu language trees reflect the spread of farming across sub Saharan Africa: a maximu m parsimony analysis. P Roy Soc B Biol Sci. 269 :793 799. Holmes E. 2004. The phylogeography of human viruses. Mol Ecol 13 :745 756. Hopkins DM 1982 Aspects of the paleogeography of Beringia during the Late Pleistocene. In: Hopkins DM, Matthews JV, Schweg er CE, Young SB, editors. Paleoecology of Beringia. New York: Academic Press. p. 3 28. Howell N, Smejkal C Mackey D Chinnery P Turnbull D Herrnstadt C 2003. The pedigree rate of sequence divergence in the human mitochondrial genome: There is a differe nce between phylogenetic and pedigree rates. Am J Hum Genet 72 :659 670. Huehnergard J. 1992. Languages of the ancient Near East. In: The Anchor Bible Dictionary. p. 155 170. Huehnergard J. 1990. Remarks on the classification of the Northwest Semitic Langu ages. In: Hoftizjer J, editor. Deir 'Alla Symposium. Leiden: Brill. p. 282 293. Ikegaya H, Zheng H Saukko P et al. (12 co authors) 2005. Genetic diversity of JC virus in the Saami and the Finns: implications for their population history. Am J Phys Anthr opol 128 :185 193. Ilan D. 2003. The Middle Bronze Age (circa 2000 1500 B.C.E.). In: Richard S, editor. Near Eastern archaeology: a reader. Winona Lakes, Indiana: Eisenbrauns. p. 331 342. Ingman M, Gyllensten U. 2006 Human Mitochondrial Genome Database, a resource for population genetics and medical sciences. Nucleic Acids Res 34 : D 749 D751. Ingman M, Kaessmann H Paabo S Gyllensten U 2000. Mitochondrial genome variation and the origin of modern humans. Nature 408 :708 713. Jenkins G, Rambaut A, Pybus O, Holmes E. 2002 Rates of molecular evolution in RNA viruses: a quantitative phylogenetic analysis. J Mol Evol 54 : 15 6 165. Jobling MA, Hurles ME, Tyler Smith C 2004 Into new found lands. In: Human E volutionary Genetics. New York: Garland Science. pp. 339 372. Jorde L, Bamshad M, Rogers A. 1998 Using mitochondrial DNA and nuclear DNA markers to reconstruct human evolution. Bioessay s 20 : 1 26 136. Joyce D 2006 Chronology and new research on the Schaefer mammoth ( Mammuthus primigenius ) site, Kenosha Cou nty, Wisconsin, USA. Quat Int l. 142:44 57.
115 Kaessmann H, Heissig F von Haeseler A Paabo S 1999. DNA sequence variation in a non c oding region of low recombination on the human X chromosome. Nat Genet. 22:78 81. Kass R, Raftery A. 1995 Bayes factors. J A m Stat A ssoc. 90 : 773 795. Kaye AS, Rosenhouse J. 1997. Arabic dialects and Maltese. In: Hetzron R, editor. The Semitic Languages. London: Routledge. p. 263 311. Kayser M, Choi Y, van Oven M, Mona S, Brauer S, Trent R, Suarkia D, Schiefenhovel W, Stoneking M. 2008. The impact of the Austronesian expansion: evidence from mtDNA and Y chromosome diversity in the Admiralty Islands of Melanesia. Mol Biol Evol. 25:1362 1374. Khalili K, White M, Lublin F, Ferrante P, Berger J. 2007. Reactivation of JC virus and dev elopment of PML in patients with multiple sclerosis. Neurology 68 : 985 990. Kimura M. 1968. Evolutionary rate at the molecular level. Nature 217 :614 626. Kitamura T, Kunitake T, Guo J, Tominaga T, Kawabe K, Yogo Y. 1994. Transmission of the human polyomav irus JC virus occurs with the family and outside the family. J Clin Microbiol. 32 : 2359 2363. Kitchen A, Miyamoto M, Mulligan C. 2008 a. Three stage colonization model for the peopling of the Americas. PLoS ONE. 3:e1596. Kitchen A, Miyamoto M, Mulligan C 20 08 b Utility of DNA viruses for studying human host history: Case study of JC virus. Mol Phylogenet Evol 46 :673 682. Kivisild T, Shen P, Wall D, et al. (17 co authors). 2006. The role of selection in the evolution of human mitochondrial genomes. Genetics. 172:373 387. Knight A, Batzer M Stoneking M Tiwari H Scheer W Herrera R Deininger P 1996. DNA sequences of Alu elements indicate a recent replacement of the human autosomal genetic complement. Proc Natl Acad Sci USA 93 :4360 4364. Kogan LE, Korotaye v AV 1997. Sayhadic (Epigraphic South Arabian). In: Hetzron R ed itor Semi tic Languages. London: Routledge p. 220 241 Kolman C, Bermingham E, Cooke R, Ward R, Arias T, Guionneau Sinclear F 1995 Reduced mtDNA diversity in the Ngobe Amerinds of Panama. Genetics 140:273 283. Kolman C, Sambuughin N Bermingham E 1996. Mitochondrial DNA analysis of mongolian populations and implications for the origin of New World founders. Genetics 142: 1321 1334. Kunitake T, Kitamura T, Guo J, Taguchi F, Kawabe K, Yogo Y 1995. Parent to child transmission is relatively common in the spread of human polyomavirus JC virus. J Clin Microbiol. 33 : 1448 1451.
116 Kuper R, Kropelin S. 2006 Climate controlled Holocene occupation in the Sahara: motor of e 313: 8 03 807. Kuzmin Y, Keates S 2005 Dates are not just data: paleolithic settlement patterns in Siberia derived from radiocarbon records. Am Antiquity 70:773 789. Landsteiner K. 1901. Uber Agglutinationserscheinungen normalen menschlichen. Weiner Klin. Wochenschr. 14 Lartillot N, Philippe H. 2006 Computing Bayes factors using thermodynamic integration. Syst Biol 55 : 1 95 207. Lloyd S. 1984. The archaeology of Mesopotamia. New York: Thames and Hudson Lovell A, Moreau C, Yotova V, Xiao F, Bourgeois S, G ehl D, Bertranpetit J, Schurr E, Labuda D. 2005. Ethiopia: between Sub Saharan Africa and Western Eurasia. Ann Hum Genet. 69:275 287. Macaulay V, Hill C Achilli A et al. (21 co authors) 2005. Single, rapid coastal s ettlement of Asia revealed by analysis of complete mitochondrial genomes. Science 308 :1034 1036. Malhi R, Eshleman J, Greenberg J Weiss D, Schultz Shook B, Kaestle F, Lorenz J, Kemp B, Johnson J, Smith D. 2002 The structure of diversity within new world mitochondrial DNA haplogroups: Implic ations for the prehistory of North America. Am J Hum Genet 70:905 919. Mandryk C, Josenhans H, Fedje D, Mathewes R 2001 Late Quaternary paleoenvironments of Northwestern North America: implications for inland versus coastal migration routes. Quat Sci Re v 20:301 314. Merriwether D, Rothhamer F, Ferrell R 1995 Distribution of the 4 founding lineage haplotypes in Native Americans suggests a single wave of migration for the New World. Am J Phys Anthropol 98:411 430. Michitaka K, Tanaka Y, Horiike N, Dhou ng T, Chen Y, Matsuura K, Hiasa Y, Mizokami M, Onji M. 2006. Tracing the history of hepatitis B virus genotype D in western Japan. J Med Virol. 78 : 44 52. Mishmar D, Ruiz Pesini E, Golik P, et al. (13 co authors). 2003. Natural selection shaped regional mtD NA variation in humans. Proc Natl Acad Sci USA. 100:171 176. Mitchell P. 2002 The Archaeology of Southern Africa. Cambridge: Cambridge University Press. Moorey PRS. 1994. Ancient Mesopotamian materials and industries: the archaeological evidence. Oxford: Clarendon Press. Mulligan C, Hunley K, Cole S, Long J 2004 Populatio n genetics, history, and health patterns in native Americans. Annu Rev Genom Hum G enet. 5:295 315.
117 Mulligan C, Kitchen A Miyamoto M 2006. Comment on "Population size does not influence mitochondrial genetic diversity in animals Science 314 :1390. Nardo D. 2007. Ancient Mesopotamia. Detroit: Greenhaven Press Nebel A, Filon D Brinkmann B Majumder P Faerman M Oppenheim A 2001. The y chromosome pool of Jews as part of the genetic la ndscape of the Middle East. Am J Hum Genet 69 :1095 1112. Nebel A, Landau Tasseron E, Filon D, Oppenheim A, Faerman M. 2002. Genetic evidence for the expansion of Arabian tribes into the Southern Levant and North Africa. Am J Hum Genet. 70:1594 1596. Nettl e D. 1999. Linguistic diversity of the Americas can be reconciled with a recent colonization. Proc Natl Acad Sci USA. 96:3325 3329. Nettle D. 2007. Language and genes: a new perspective on the origins of human cultural diversity. Proc Natl Acad Sci USA. 10 4:10755 10756. Newton M, Raftery A, Davison A, et al. (33 co authors). 1994. Approximate Bayesian Inference with the Weighted Likelihood Bootstrap. J Roy Stat Soc B Met. 56:3 48. Nielsen R, Hellmann I, Hubisz M, Bustamante C, Clark A. 2007. Recent and ongo ing selection in the human genome. Nat Rev Genet. 8:857 868. Non A, Kitchen A, Mulligan C 2007 Identification of the most informative regions of the mitochondrial genome for phylogenetic and coalescent analyses. Mol Phylogenet Evol 44:1164 1171. O'Rourk e D, Hayes M, Carlyle S 2000 Spatial and temporal stability of mtDNA haplogroup frequencies in native North America. Hum Biol 72:15 34. Ohta T. 1992. The Nearly Neutral Theory of Molecular Evolution. Annu Rev Ecol Syst 23 :263 286. Ong C, Chan S, Campo M, et al. (11 co authors). 1993. Evolution of human papillomavirus type 18: an ancient phylogenetic root in Africa and intratype diversity reflect coevolution with human ethnic groups. J Virol. 67 : 6424 6431. Overstreet DF 2005 Late glacial ice marginal a daptation in southeastern Wisconsin. In: Bonnichsen R, Lepper BT, Stanford D, Waters MR, editors. Paleoamerican origins: beyond Clovis. College Station, TX: Center for the Study of the First Americans. p. 183 195. Paabo S. 1985. Molecular Cloning of Ancien t Egyptian Mummy DNA. Nature 314 :644 645. Padgett B, Walker D. 1973. Prevalence of antibodies in human sera against JC virus, an isolate from a case of progressive multifocal leukoencephalopathy. J Infect Dis 127 : 4 67 470.
118 Pagel M, Atkinson Q Meade A 20 07. Frequency of word use predicts rates of lexical evolution throughout Indo European history. Nature 449: 717 720. Pakendorf B, Stoneking M. 2005. Mitochondrial DNA and human evolution. Annu Rev Genomics Hum Genet 6: 1 65 183. Passarino G, Semino O, Quint ana Murci L, Excoffier L, Hammer M, Santachiara Benerecetti A. 1998. Different genetic components in the Ethiopian population, identified by mtDNA and Y chromosome polymorphisms. Am J Hum Genet. 62:420 434. Pavesi A. 2003. African origin of polyomavirus JC and implications for prehistoric human migrations. J Mol Evol 56: 5 64 572. Pavesi A. 2005. Utility of JC polyomavirus in tracing the pattern of human migrations dating to prehistoric times. J Gen Virol 86: 1 315 1326. Pauling L, Itano H Singer S Wells I 1949. Sickle cell anemia, a molecular disease. Science 110 :543 548. Payseur B, Cutter A Nachman M 2002. Searching for evidence of positive selection in the human genome using patterns of microsatellite v ariability. Mol Biol Evol 19 :1143 1153. Pinhasi R, Fort J, Ammerman A. 2005 Tracing the origin and spread of agriculture in Europe. PLoS Biolog y 12 : 2 220 2228. Pitulko V, Nikolsky P, Girya E, Basilyan A, Tumskoy V, Koulakov S, Astakhov S, Pavlova E, Anis imov M. 2004 The Yana RHS site: humans in the Arctic before the last glacial maximum. Science 303:52 56. Polanski A, Kimmel M, Chakraborty R. 1998. Application of a time dependent coalescence process for inferring the history of population size changes f rom DNA sequence data. Proc Natl Acad Sci USA. 95:5456 5461. Posada D, Crandall K. 1998 Modeltest: testing the model of DNA substitution. Bioinformatic s 14: 8 17 818. Pybus O, Drummond A Nakano T Robertson B Rambaut A 2003. The epidemiology and iatroge nic transmission of hepatitis C virus in Egypt: a Bayesian coalescent approach. Mol Biol Evol 20 :381 387. Pybus O, Rambaut A, Harvey P. 2000 An integrated framework for the inference of viral population history from reconstructed genealogies. Genetic s 1 55 : 1 429 1437. Quintana Murci L, Krausz C Zerjal T et al. (13 co authors) 2001. Y chromosome lineages trace diffusion of people and languages in southwestern Asia. Am J Hum Genet 68 :537 542.
119 Quintana Murci L, Semino O, Bandelt H, Passarino G, McElreavey K, Santachiara Benerecetti A. 1999. Genetic evidence of an early exit of Homo sapiens sapiens from Africa through eastern Africa. Nat Genet. 23:437 441. Rabin C. 1975. Lexicostatistics and the internal divisions of Semitic. In: Bynon T Bynon J ed itor s. Hamito Semitica. The Hague: Mouton. p. 85 102 Raftery AE. 1996 Hypothesis testing and model selection I n : Gilks WR, Richardson S, Spiegelhalter DJ, editors. Markov chain Monte Carlo in practice London: Chapman & Hall. p. 163 188. Rambaut A. 2000. Estima ting the rate of molecular evolution: incorporating non contemporaneous sequences into maximum likelihood phylogenies. Bioinformatics 16 :395 399. Rambaut A, Drummond A. 2007. Tracer v1.4. Redelings B, Suchard M. 2005. Joint Bayesian estimation of alignmen t and phylogeny. Syst Biol. 54:401 418. Reich D, Schaffner S, Daly M, McVean G, Mullikin J, Higgins J, Richter D, Lander E, Altshuler D. 2002. Human genome sequence variation and the influence of gene history, mutation and recombination. Nat Genet. 32:135 142. Rendsburg GA. 2003. Semitic languages (with special reference to the Levant). In: Richard S, editor. Near Eastern archaeology: a reader. Winona Lakes, Indiana: Eisenbrauns. p. 71 73. Renfrew C, McMahon A, Trask L, editors. 2000. Time depth in historic al linguistics. Cambridge: The McDonald Institute for Archaeological Research. Richard S, editor. 2003a. Near Eastern Archaeology: a reader. Winona Lake, Indiana: Eisenbrauns. Richard S. 2003b. The Early Bronze Age in the southern Levant. In: Richard S, ed itor. Near Eastern archaeology: a reader. Winona Lake, Indiana: Eisenbrauns. p. 286 302. Richards M, Macaulay V Bandelt H Sykes B 1998. Phylogeography of mitochondrial DNA in western Europe. Ann Hum Genet 62:2 41 260. Rodgers J. 1992. The subgrouping of the South Semitic languages. In: Kaye AS, editor. Semitic studies in honor of Wolf Leslau. Wiesbaden: Harrassowitz. p. 1323 1336. Rosenberg N, Mahajan S Ramachandran S Zhao C Pritchard J Feldman M 2005. Clines, clusters, and the effect of study design on the inference of human population structure. PLoS Genet 1 :e70.
120 Rosenberg, N, Pritchard J Weber J Cann H Kidd K Zhivotovsky L Feldman M 2002. Genetic structure of human populations. Science 298 :2381 2385. Rosenberg N, Woolf E Pritchard J Schaa p T Gefel D Shpirer I Lavi U Bonne Tamir B Hillel J Feldman M 2001. Distinctive genetic signatures in the Libyan Jews. Proc Natl Acad Sci USA 98: 858 863. Sabeti P, Varilly P, Fry B, et al. (263 co authors). 2007. Genome wide detection and character ization of positive selection in human populations. Nature. 449:913 918. Saitou N, Nei M. 1987. The neighbor joining method for reconstructing evolutionary trees. Mol Biol Evol. 4 : 4 06 425. Santos F, Pandya A, Tyler Smith C, Pena S, Schanfield M, Leonard W, Osipova L, Crawford M, Mitchell R. 1999 The central Siberian origin for Native American Y chromosomes. Am J Hum Genet 64:619 628. Schurr T, Sherry S 2004 Mitochondrial DNA and Y chromosome diversity and the peopling of the Americas: evolutionary and d emographic evidence. Am J Hum Biol 16:420 439. Schurr T 2004 The peopling of the New World: perspectives from molecular a nthropology. Annu Rev Anthropol 33:551 583. Shackelton L, Rambaut A Pybus O Holmes E 2006. JC virus evolution and its associatio n with human populations. J Virol 80 :9928 9933. Shen P, Wang F Underhill P et al. (13 co authors) 2000. Population genetic implications from sequence variation in four Y chromosome genes. Proc Natl Acad Sci USA 97: 7354 7359. Sherry S, Rogers A Harpen ding H Soodyall H Jenkins T Stoneking M 1994. Mismatch distributions of mtDNA reveal recent human population expansions. Hum Biol 66: 761 775. Shields G, Schmiechen A, Frazier B, Redd A, Voevoda M, Reed J, Ward R 1993 mtDNA sequences suggest a recent evolutionary divergence for Beringian and northern North American populations. Am J Hum Genet 53:549 562. Silva W, Bonatto S, Holanda A, et al. ( 14 co authors). 200 2 Mitochondrial genome diversity of Native Americans supports a single early entry of fou nder populations into America. Am J Hum Genet 71:187 192. Smith N, Webster M, Ellegren H. 2002. Deterministic mutation rate variation in the human genome. Genome Res. 12:1350 1356. Stone A, Stoneking M 1998. mtDNA analysis of a prehistoric Oneota populat ion: implications for the peopling of the New World. Am J Hum Genet 62: 1153 1170.
121 Strimmer K, Pybus O. 2001 Exploring the demographic history of DNA sequences using the generalized skyline plot. Mol Biol Evol 18 : 2 298 2305. Suchard M, Weiss R Sinsheimer J 2001. Bayesian selection of continuous time Markov chain evolutionary models. Mol Biol Evol 18 :1001 1013. Sugimoto C, Hasegawa M, Kato A, Zheng H, Ebihara H, Taguchi F, Kitamura T, Yogo Y. 2002. Evolution of human polyomavirus JC: implications for the population history of humans. J Mol Evol. 54 : 285 297. Sugimoto C, Hasegawa M Zheng H et al. (17 co authors) 2002. JC virus strains indigenous to northeastern Siberians and Canadian Inuits are unique but evolutionally related to those distributed throug hout Europe and Mediterranean areas. J Mol Evol 55:3 22 335. Sugimoto C, Kitamura T Guo J et al. (19 co authors) 1997. Typing of urinary JC virus DNA offers a novel means of tracing human migrations. Proc Natl Acad Sci USA 94 :9191 9196. Swadesh M. 1955 Towards greater accuracy in lexicostatistic dating. Int J Am Linguist. 21: 121 137. Szathmary E 1993 Genetics of aboriginal North Americans. Evol Anthropol 1:202 220. Tajima F. 1983 Evolutionary relationship of DNA sequences in finite populations. Gen etic s 105: 4 37 460. Tamm E, Kivisild T Reidla M et al. (21 co authors) 2007. Beringian standstill and spread of Native American founders. PLoS ONE 2: e829. Templeton A 1998 Human races: a genetic and evolutionary perspective. Am A nthropol 100:632 650 Thompson J, Higgins D, Gibson T. 1994 CLUSTALW: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position specific gap penalties and weight matrix choice. Nucleic Acids Res 22: 4 673 4 680. Tishkoff S, Varko nyi R Cahinhinan N et al. (17 co authors) 2001. Haplotype diversity and linkage disequilibrium at human G6PD: recent origin of alleles that confer malarial resistance. Science 293: 455 462. Torroni A, Schurr T, Cabell M, Brown M, Neel J, Larsen M, Smith D, Vullo C, Wallace D. 1993 a. Asian affinities and continental radiation of the four founding Native American mtDNAs. Am J Hum Genet 53:563 590. Torroni A, Sukernik R, Schurr T, Starikorskaya Y, Cabell M, Crawford M, Comuzzie A, Wallace D. 1993 b. mtDNA v ariation of aboriginal Siberians reveals distinct genetic affinities with Native Americans. Am J Hum Genet 53:591 608.
122 Twiddy S, Holmes E Rambaut A 2003. Inferring the rate and time scale of dengue virus evolution. Mol Biol Evol 20: 122 129. Underhill P Shen P Lin A et al. (21co authors) 2000. Y chromosome sequence variation and the history of human populations. Nature Genetics 26 :358 361. United Nations, Department of Economic and Socia l Affairs, Population Division, 2003 World Urbanization Prospe cts: The 2003 Revision. www.un.org/esa/population/publications/wup2003/2003wup.htm United Nations, Department of Economic and Social Affairs, Population Division 2004 Wor ld Population Prospects: The 2004 Revision Analytical Report. www.un.org/esa/population/publications/WPP2004/WPP2004_Volume3.htm Venter J Adams M, Myers E, et al. ( 275 co authors) 2001. The sequence of the human genome. Science 291 :1304 1351. Voight B, Kudaravalli S, Wen X, Pritchard J. 2006. A map of recent positive selection in the human genome. PLoS Biology. 4:e72. von Wissman H. 1975. Uber die fruhe Geschichte Arabiens und das Enstehen des Sabaerreiches. Wien: Austria Academy of Sciences Press. Wang S, Lewis C Jakobsson M et al. (27 co authors) 2007. Genetic variation and population structure in native Americans. PLoS Genet 3: e185. Waters M, Stafford T 2007 Redefining the age of Clovis: implications for the peopling of the Americas. Science 315:1122 1126. Weale M, Weiss D, Jager R, Bradman N, Thomas M 2002 Y chromosome evidence for Anglo Saxon mass migration. Mol Biol Evol 19:1008 1021. Weber T, Major E 1997. Progressive multifocal leukoencephalopathy: molecular biology, pathogenesis and clinical impact. Intervirolog y 40: 9 8 111. Webster M, Smith N, Ellegren H. 2002. Microsatellite evolution inferred from human chimpanzee genomic sequence alignments. Pr oc Natl Acad Sci USA. 99:8748 8753. Weinberg G L. 1995 A World at Arms: A Global History of World War II. Cambridge: Cambridge University Press. Wilson A, Cann R, Carr S, et al. ( 11 co authors). 1985 Mitochondrial DNA and two perspectives on evolutionary genetics. Biol J Linn Soc 26:375 400. Wooding S. 2001 Do human and JC virus genes show evidence of host parasite codemography? Infect Genet Evol 1 : 3 12. Wright S. 1931. Evolution in Mendelian populations. Genetics 16 :0097 0159.
123 Xiao F, Yotova V Zietki ewicz E et al. (11 co authors) 2004. Human X chromosomal lineages in Europe reveal Middle Eastern and Asiatic contacts. Eur J H um Genet 12:3 01 311. Zazula G, Froese D, Schweger C, Mathewes R, Beaudoin A, Telka A, Harington C, Westgate J. 2003 Ice age s teppe vegetation in east Beringia. Nature 426:603. Zhang J. 2007. The drifting human genome. Proc Natl Acad Sci USA 104: 20147 20148. Zheng H, Zhao P Suganami H Ohasi Y Ikegaya H Kim J Sugimoto C Takasaka T Kitamura T Yogo Y 2004. Regional distri bution of two related Northeast Asian genotypes of JC virus, CY a and b: implications for the dispersal of Northeast Asians. Microbes Infect 6: 596 603.
BIOGRAPHICAL SKETCH I graduated from Homestead High School in Fort Wayne, Indiana in spring 1997. I then attended the The Johns Hopkins University from fall 1997 to spring 2001, and graduated with a B.S. in b iomedical e ngineering and a concentration in c omputer s cience. I began my graduate career in September 2001 when I matriculated at the University o f Oxford (Hertford College) and took my M.Sc. in b iology in 2002. I then enrolled in the anthropology graduate program at University of Florida in fall 2002, received my M.A. in May 2004 and my Ph.D. in August 2008.