This item has the following downloads:
1 EXOME SEQUENCING FOR HIGH THROUGHPUT GENOMIC A NALYSIS OF TREES By LEANDRO GOMIDE NEVES A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGRE E OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2013
2 2013 Leandro Gomide Neves
3 To my parents, Clara and Ary, to my sister Cludia, to my brother Tlio and to those who have inspired me to get here
4 ACKNOWLEDGMENTS I would like to acknowledge those who helped and contributed to the conclusion of this work. I am very grateful to my advisors, Dr. Matias Kirst and Dr. Brad Barbazuk, for their mentoring through ou t these years and friendship. I thank Dr. J ohn Davis for the discussions and help offered especially when we began the work on sequence capture in conifers. I am grateful to Dr. Lauren McIntyre and Dr. Charlie Baer for their advice and contributions as members of my supervisory committee. I am par ticularly thankful to the Plant Molecular and Cellular Biology program for accepting me for the Ph.D. program and to the University of Florida for the fellowship. I also thank Chris Dervinis for his help while performing the laboratory experiments, greenho use experiments and, most importantly, for keeping the Forest Genomics lab a great place to work and a friendly environment. I am grateful to Ruth Davenport for her help on important steps of the poplar sequence capture experiment. I am extremely grateful for my friends who have shared good moments with me throughout these years of my Ph.D, namely: Thiago Reis, Vincius Fialho, Leonardo Ramos, Douglas Souza, Marcio Resende, Daniele Paiva, Martina Spiess, Theodor Stanley, Virginia Little Cintia Ribeiro, Jua n Acosta, Ann ette Fahrenkrog, Patricio Munoz, Jared Westbrook, Amanda Pendleton and Maribeth Latvis. I am very thankful to my brother Tlio, my sister Cludia and my family. I am especially grateful for my parents, Clara and Ary, for always supporting my g oals to conclude the Ph.D., which is certainly a path that started from their belief that education is essential and is a product of many abdications they had to make in order to provide me with it.
5 TABLE OF CONTENTS page ACKNOWLEDGMENTS ................................ ................................ ................................ .. 4 LIS T OF TABLES ................................ ................................ ................................ ............ 7 LIST OF FIGURES ................................ ................................ ................................ .......... 8 ABSTRACT ................................ ................................ ................................ ..................... 9 CHAPTER 1 INTRODUCTION ................................ ................................ ................................ .... 11 Analysis of Genetic Diversity and its Application ................................ .................... 12 Genome Resequencing and its Applications ................................ .......................... 16 Genome Complexity Reduction ................................ ................................ .............. 19 Genome Structural Variation ................................ ................................ ................... 21 2 WHOLE EXOME TARGETED SEQUE NCING OF THE UNCHARACTERIZED PINE GENOME ................................ ................................ ................................ ...... 25 Background ................................ ................................ ................................ ............. 25 Experimental Procedures ................................ ................................ ........................ 27 Genotypes, mRNA and DNA Extraction ................................ ........................... 27 cDNA Synthesis ................................ ................................ ............................... 28 Library Preparation ................................ ................................ ........................... 28 Probe Design for Target Enrichment ................................ ................................ 29 Sequence Capture Hybridization Reaction and Sequencing ............................ 29 Sanger and P acific Biosciences Data Processing and Analysis ....................... 30 Illumina Data Processing and Analysis Pipeline ................................ ............... 31 De Novo Assembly of Captured Fragments and Gene Annotation .................. 32 Intron Prediction and Depth Analysis ................................ ............................... 33 Results ................................ ................................ ................................ .................... 34 Sequence Capture Probe Design from an Uncharacterized Genome .............. 34 Efficient Sequence Capture of the Loblolly Pine Genome with Different Complexities ................................ ................................ ................................ .. 35 Samples can be Multiplexed and Efficiently Captured ................................ ...... 36 Sequence Capture is Reproducible among Unrelated Samples ....................... 36 Probe Design that Avoids Boundaries Between Exons Leads to Higher Capture Efficiency ................................ ................................ ......................... 37 Sequence Capture Expands Gene Models ................................ ...................... 38 Probes Developed for Loblolly Pine are Efficient for Sequence Capture of a Related Species ................................ ................................ ............................ 40 Discussion ................................ ................................ ................................ .............. 40
6 3 A HIGH RESOLUTION GENE MAP OF LOBLOLLY PINE (PI NUS TAEDA L.) BASED ON EXOME SEQUE NCE CAPTURE ................................ ........................ 54 Background ................................ ................................ ................................ ............. 54 Experimental Procedu res ................................ ................................ ........................ 56 Mapping Population, DNA Extraction and Sequence Capture .......................... 56 Sequencing Data Filtering and Alignment ................................ ........................ 57 Detection of SNP and PAV ................................ ................................ ............... 57 Genetic Map Construction ................................ ................................ ................ 58 Results ................................ ................................ ................................ .................... 59 Genotyping of a Mapping Population with Sequence Capture ......................... 59 A P. taeda G enetic Map with 2,841 Gene Markers ................................ .......... 61 Map Validation ................................ ................................ ................................ .. 62 Discussion ................................ ................................ ................................ .............. 64 4 POPULATION LEVEL GENOTYPING OF COPY NUMBER VARIATION (CNV) IN POPLAR TREES BASED ON EXOME SEQUENCING ................................ ..... 78 Background ................................ ................................ ................................ ............. 78 Experimental Procedures ................................ ................................ ........................ 81 Plant Materials and DNA Extraction ................................ ................................ 81 Probe Design for Target Enrichment ................................ ................................ 81 Library Preparation, Capture Hybridization and Seq uencing ............................ 83 Illumina Data Processing ................................ ................................ .................. 83 CNV Detection ................................ ................................ ................................ .. 84 Functional Annotation and GO Enrichment Tests ................................ ............ 86 Results ................................ ................................ ................................ .................... 87 Copy Number Variation Detection Pipeline ................................ ...................... 87 Gene Copy Number Variation in 113 Poplar Trees ................................ .......... 88 Clustering Analysis for CNV Genotyping ................................ .......................... 89 Functional C haracterization of Genes with CNV ................................ .............. 90 Discussion ................................ ................................ ................................ .............. 92 5 CONCLUSIONS ................................ ................................ ................................ ... 106 LIST OF REFERENCES ................................ ................................ ............................. 111 BIOGRAPHICAL SKETCH ................................ ................................ .......................... 122
7 LIST OF TABLES Table page 2 1 Valida tion of the capture efficiency in pine across different genomic complexity. ................................ ................................ ................................ .......... 45 2 2 Reproducibility of the capture on 24 unrelated haploid samples of P. taeda ...... 46 2 3 Summary of the de novo assembly of the sequence capture data. .................... 47 3 1 Description of markers segregating in the population and mapped. ................... 71 3 2 Description of genetic map provided. ................................ ................................ 72 3 3 Marker validation at the grouping level. ................................ .............................. 73 4 1 Probe design for Populus deltoides ................................ ................................ ... 96 4 2 Distribution of selected gene CNV across scaffolds of poplar. ........................... 97
8 LIST OF FIGURES Figure page 2 1 Efficiency of multiplexed capture in Pinus taeda ................................ ................ 48 2 2 Reproducibility of the multiplexed capture in Pinus taeda ................................ 49 2 3 Effect of the presence of an intron in the probe region on capture efficiency. .... 50 2 4 Reduction of capture efficiency as a function of i ntron position within the probe. ................................ ................................ ................................ ................. 51 2 5 Examples of de novo assembly contig annotation. ................................ ............. 52 2 6 Size distribution of the annotated contigs produced by the Recruiter de novo assembly pipeline. ................................ ................................ .............................. 53 3 1 High density genetic map for loblolly pine. ................................ ......................... 74 3 2 Validation of the order of selected genes on the genetic map. ........................... 77 4 1 Examples of genes with putative CNV in 10 individuals of Populus deltoides ... 98 4 2 Example of genes with a rare putative CNV. ................................ ...................... 99 4 3 Examples of genes with CNV selected by unsupervised clustering algorithm. 100 4 4 GO term enrichment comparing the 300 most significant genes ..................... 102 4 5 GO term enrichment comparing the 300 most significant genes visualized separately for terms related to molecu lar function and biological process ....... 103 4 6 GO term enrichment comparing the 300 genes with CNV selected by unsupervised clustering ................................ ................................ ................... 104 4 7 Manhattan plot showing the distribution of p values for genes across chromosome 19 and chromosome 1. ................................ ............................... 105
9 Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy EXOME SEQUENCING FOR HIGH THROUGHPUT GENOMIC A NALYSIS OF TREES By Leandro Gomide Neves August 2013 Chair: Matias Kirst Cochair: William Bradley Barbazuk Major: Plant Molec ular and Cellular Biology There have been extensive improvements in sequencing technologies in recent years. These advances support the genotyping of genetic variants via DNA sequencing, for large populations, for applications in breeding and population g enetic studies. In parallel, new methods for selecting reduced representations of genomes have emerged. One of such method s is sequence capture, where genomic DNA is hybridized to probes that are complementary to regions of interest, which can then be retr ieved and sequenced. The goal of this work was to apply sequence capture to the tree s species loblolly pine ( Pinus taeda L.), slash pine ( Pinus elliottii Engelm.) and poplar ( Populus deltoides Bart. ex Marsh.), to characterize their genomes. Initially, we modified the sequence capture approach for higher throughput and faster sample preparation by multiplexing the assay with several individuals identified by unique DNA barcode adapters. A diversity panel of 24 haploid genomes of loblolly pine was then proc essed with multiplexing sequence capture targeting 14,729 genes. The data obtained was used to identify factors in probe design and data analysis that influence capture efficiency, namely to avoid designing probes spanning exon exon boundaries and to avoid probes capturing paralogs/pseudogenes during polymorphism detection. We
10 created a de novo assembly pipeline to assemble and annotate the captured genes, providing information about their structure (e.g. intron position and promoter regions) and functional annotation of this complex and uncharacterized genome. To evaluate the method for DNA polymorphism detection, we analyzed a mapping population of 72 haploid genomes of loblolly pine and generated a genetic map with 2,841 genes Finally, for the better cha racterized genome of poplar, we used sequence capture to identif y gene CNV in 18,153 genes characterized in 121 unrelated trees from an association population From these results we conclude that multiplexing sequence capture is an efficient way to charact erize the complex genome of plants, providing a method that can be quickly adopted to species with few genomic resources to allow the initial characterization of their genomes and assess the genetic diversity of large populations.
11 CHAPTER 1 INTRODUCTION Trees are an important source of timber and other natural resources, such as resin and oil. The timber can be used for construction, furniture or be processed into paper and bioenergy, all of which are essential products for society and represent an impor tant sector of the U.S. economy ( H OWA RD and M C K EEVER 2012 ) Trees are also essential from an ecological perspective, as native forests are habitat for other species and help regulate the world climate. Trees started being bred and genetically improved in the 1950s ( Z OBEL and T ALBERT 1984 ; W HITE et al. 2007 ) a considerable late start compared to other crop species that can be explained in part by skepticism that similar breeding principles could be applied ( Z OBEL and T ALBERT 1984 ) Many b reeding technologies have improved productivity of planted forests, such as early selection, top grafting and controlled crossing and hybridization. However, tree breeding is still largely done without any information about which genetic polymorphisms regu late traits of value. If available, such information would be valuable to improve breeding efficiency ( A DAMS and J OLY 1980b ; T ANKSLEY 1983 ; G RATTAPAGLIA and K IRST 2008 ; N EALE and I NGVARSSON 2008 ) Tree populations are typically characterized by high genetic diversity because they are in the early stages of domestication have large effective population sizes, and many propagate by outcrossing For instance, a survey of 32 unrelated haploid genomes of loblolly pine ( Pinus taeda L.) identified one single nucle otide polymorphism (SNP) every 63 bases. In poplar ( Populus trichocarpa Torr. & Gray), a recent project that resequenced the genome of approximately 1000 individuals identified one SNP every 5 bp, highlighting the haplotype diversity that exists in natural populations of trees
12 (Gerald A. Tuskan, personal communication). This already high level of genetic variation does not consider other genetic variants, for instance insertions, deletions and copy number variation, which may also be abundant in the genome. Analysis of G enetic D iversity and its A pplication The genetic diversity of trees was first characterized by methods based on allozymes. In loblolly pine, for example, Adams and Joly ( 1980a ) analyzed a panel of 15 allozyme loci in a seed orchard, to characterize the genetic diversity among clones and infer the degree of self pollination. Despite the typical ly low information content of allozyme markers, this early study was able to identify an average of 3.4 alleles per locu s in the population, indicating high levels of genetic diversity and demonstrating the potential use of th is assay for clonal identification. The results also allowed the authors to estimate a low level of self pollination in the progeny (1.2%), demonstrat ing that the management of the orchard to increase pollination between clones was adequate. Thus, although limited with regards to the method used and number of loci analyzed, this study domonstrates the practical potential of analyzing genetic variation i n tree breeding. The advent of DNA based molecular markers, such as restriction fragment length polymorphism (RFLP) ( W YMAN and W HITE 1980 ) randomly amplified polymorphic DNA (RAPD) ( W ILLIAMS et al. 1990 ) and AFLP ( V OS e t al. 1995 ) enabled parallel genotyping of a larger number of markers and, most importantly, the expansion of genetic diversity analysis to other applications, particularly the development of genetic maps and detection of quantitative trail loci (QTL) i n trees. In poplar, for example, Bradshaw and Stettler (1993) published a series of studies utilizing RFLP and RAPD markers genotyped in an imbred F2 mapping population derived from an interspecific cross
13 between Populus trichocarpa and Populus deltoides Analysis of this population identified chromosomal aberrations in the F1 hybrid progeny. Poplars are diploid species with 19 chromosomes, but by analyzing the segregation of RFLP markers the authors detected individuals that were triploid or aneuploid, a r esult that was confirmed by flow cytometry. The triploid trees were shown to have lighter abaxial leaf color that allowed for easy screening, as well as larger wood fiber length and a reduction in fertility. All of these properties can be explored in breed ing programs, as discussed by the authors ( B RADSHAW and S TETTLER 1993 ) The F2 population developed by Bradshaw and Stettler was al so utilized to place QTLs on a genetic map. The principle behind QTL mapping is to detect regions of the genome, characterized by flanking markers on a genetic map, that are associated with the trait phenotypic variation ( L ANDER and B OTSTEIN 1989 ) The requirement for such stud ies is a genetic map generated o n a bi parental population that segregates for the traits of interest. Phenotyping of 24 traits in the F2 progeny of 55 individuals, genotyped with 343 markers, resulted in the detection of several QTLs distributed across the genome ( B RADSHAW and S TETTLER 1995 ) An important contribution of QTL mapping to quantitative genetics is that these studies established the foundatio n for the molecular understanding of the genetic control of phenotypic traits. For example, the study of Bradshaw and Stettler ( 1995 ) identified one QTL for height measured at second year th at explained 31.7% of the genetic variance, and showed that the P. trichocarpa allele had a positive dominant effect over the P. deltoides allele. This level of information about the trait led the authors to conclude that an oligogenic model of inheritance where few alleles explain a large
14 portion of the genetic variance, could be employed for such traits that were traditionally being treated as quantitative in breeding programs. Other QTL studies in trees also observed similar results that suggested the p resence of few loci with large effect on the phenotypic variation ( G RATTAPAGLIA et al. 1995 ; W ILLIAMS e t al. 2007 ) As QTL mapping approaches became more studied, inherent limitations to the method became obvious. Simulation studies performed by Beavis showed that estimates of the phenotypic variance explained by detected QTLs is largely overestimated wh en only a few hundred individuals are analyzed, and that thousands of individuals are needed to achieve good accuracies ( X U 2003 ) This became known as population (up to 4 for outcrossing species). Thus, the transferability of the results to unrela ted populations is limited and has to rely on extensive linkage disequilibrium between markers and QTL for use in marker assisted selection, as proposed previously for rust resistance in eucalyptus ( J UNGHANS et al. 2003 ; M AMANI et al. 2010 ) Ano ther limitation intrinsic to using a small population is that the resolution of the QTL is limited by the number of rec ombination events sampled in the mapping population. Finally, in species with large genome such as those of conifers, confidence intervals as low as 1 cM might still represent several million nucleotides, and encompass hundreds to thousands of genes. There fore, despite the knowledge that QTL mapping can be accurate in identifying regions of interest by fine mapping ( P RICE 2006 ) its broader application has been limited in plants ( B ERNARDO 2008 ) To improve on some limitations of the markers previously described, microsatellites markers were developed for trees ( W EBER and M AY 1989 ; E CHT and
15 M AY M ARQUARDT 1997 ) Because of their high polymorphism information content and transferability across populations or even species, microsatellites developed in pine have been used for identify ing population structure in association studies ( G ONZALEZ M ARTINEZ et al. 2007 ) to compare population parameters between pines species ( S TEWART et al. 2012 ) and to generate moderately dense genetic maps ( E CHT et al. 2011 ) Microsatellites were developed in poplar ( T USKAN et al. 2004 ) and, among other applications, were used to identify regions of the genome that are associated with sex determination. The genus Poplar consists of dioecious species, but little is known about what chromosomes or regions of the chromosomes are responsible for the sex determination. Yin et al. ( 2008 ) generated a consensus genetic map containing 1138 AFLPs and microsatellite markers spaced with an average of 2.1 cM, based on the combined segregation data of five mapping populations. Based on the hypothesis that incipient sex chromosomes would show suppressed recombination, the authors explored the genetic map looking for segments of the genome that had that property. The authors observed that a peritelomeric region of ch romosome 19 had a large disagreement between genetic distance and physical distance, but that microsatellites were evenly distributed in the reference genome for this region, indicating that recombination is suppressed. Finally, using the segregation of on e of the mapping populations that contained flowers, the authors determined that this peritelomeric region of chromosome 19 is associated with sex determination. Despite not finding an obvious gene candidate in this region, there was a high enrichment for genes containing nucleotide binding site/leucine repeat in this region, but its importance on sex
16 determination is unknown. Therefore, this work demonstrated how genetic maps created using transferable and robust microsatellite markers can be useful tools for examining biology but they often lack the resolution required to pinpoint the specific region of the genome responsible for the trait studied. Genome Resequencing and its Applications Another methodology that was proposed to dissect the genetic archit ecture of complex traits is association genetics. Contrary to QTL studies, this methodology is typically based on linkage disequilibrium mapping of large, generally genetically unstructured populations, and seeks to identify markers in linkage disequilibri um with QTLs ( Z HU et al. 2008 ) Because populations are genetically unstructured, and individuals have been recombining for a large (unknown) number of generations, the extent of linkage disequilibrium is typically lower than in bi parental, single generation mapping populations. This approach was considered particularly relevant for tree populations, which are typically unstructured have large effect ive population size s maintain high nucleotide diversity have low linkage disequilibrium (LD) extension, and can be accurately phenotyped by clonal replication ( N EALE and S AVOLAINEN 2004 ) However, the l imited extent of LD requires a very large number of markers for whole genome analysis ( R AFALSKI 2002 ) DNA resequencing has long been considered an effective approach fo r discovery and genotyping of SNPs but, until recently, high costs hindered its application to large populations ( S CHLOTTERER 2004 ) Alternative approaches to large scale genotyping of tree populations h ave been to focus on the resequencing of a few candidate genes, and perform association analysis on the polymorphisms identified in them. This approach
17 has been adopted in the major ity of association studies performed in trees to date and yielded modest results ( G ONZALEZ M ARTINEZ et al. 2007 ; G UERRA et al. 2013 ) For loblolly pine, a major resequencing effort was implemented to identify SNPs in the coding regions of genes ( http://dendrome.ucdavis.edu/NealeLab/ adept2/overview.php ). In this project, 20,500 unigenes were created by assembling expressed sequence tags (ESTs) and these were used for primer design. The primers were tested for amplification of genomic DNA in a diversity panel of 18 trees and a total o f 6,924 unigenes generated valid amplicons that were sequenced using dye terminator sequencing (Sanger sequencing) to identify SNPs. To genotype the SNPs, a single base extension Illumina array was generated from these resources. Eckert et al. ( 2009 ) analyzed the potential of single base extension arrays in loblolly pine by genotyping 384 SNPs in a mapping population, and reported that 257 (67%) loci were successfu lly genotyped in the population, but only 70 SNPs segregated. This relatively low SNP conversion rate is likely due to the presence of polymorphisms adjacent to the SNP being assayed, which inhibit primer annealing and result s missing data. Thus, the devel opment of SNP genotyping based on single based extension of target regions of the genome is relatively inefficient for trees, and a large number of SNPs need to be discovered in order to identify a subset that fulfills the requirements of this method ( G RATTAPAGLIA et al. 2011 ) The resequencing of candidate genes for loblolly pine was also used to estimate population genetic parameters, such as nucleotide diversity and extension of LD. As mentioned earlier, Brown et al. ( 2004 ) observed one SNP every 63 bp on an analysis of 19 genes in 32 haploid genomes of loblolly pine, providing an estimated nucleotide
18 reported significant LD decay (r 2 < 0.2) after 1500 bp in the population. In white spruce ( Pice a glauca ), another conifer for which similar population parameters were estimated, 105 genes were partially resequenced from 48 haploid genomes identifying one S NP every 198 bp, ( P AVY et al. 2012 ) The LD decay in this population was even shorter, with the r 2 decreasing below 0.2 after just 87 bp. The advent of next generation sequencing ( M ARGULIES et al. 2005 ; B ENTLEY et al. 2008 ) considerably improved the resequencing throughput and decreased costs, expanding the application of sequencing to genotyping of large populations ( K AHVEJIAN et al. 2008 ; L ISTER et al. 2009 ) For example, de novo sequencing of whole transcriptome was performed in m any species with the goal of characterizing genes and detecting SNPs ( B ARBAZUK et al. 2007 ; N OVAES et al. 2008 ; P ARCHMAN et al. 2010 ) With recent improvements in the throughput and length of sequencing reads, de novo assembly of conifer large genomes relying entirely on these technologies and whole genome shot gun also became feasible, albeit yielding assemblies that are still largely fragmented and incomplete ( B IROL et al. 2013 ; N YSTEDT et al. 2013 ) On a population level, whole genome resequencing using next generation sequencing is still challenging and limited to only a few examples in forestry. Populus trichocarpa was the first tree species to have its genome seque nced ( T USKAN et al. 2006 ) and is the most advanced reference genome available for trees to date. The latest genome assembly (v 3.0) comprises 423 Mbp (of a 485 10 Mbp genome) represented with in 1,446 scaffolds This assembly is derive d from Sanger sequencing to an average of 9.44 redundancy Thus, this genome provides a good resource for
19 reference guided assembly or alignment of short reads from whole genome shot gun resequencing using next generation technologies. Using this approach, Slavov et al. ( 2012 ) resequenced the genome of 16 trees selected to represent the natur al range of the species, and estimated population parameters from the data. One important result obtained by the authors was the low LD decay observed in this natural population. On average, LD extended for approximately 5 Kbp (r 2 < 0.2). Similar results w ere observed when the authors used a different genotyping method (i.e. Infinium SNPs) to rule out the possibility of added bias due to the selection of SNPs from the resequencing data. This result is contrastingly different from those reported in other pop lar species ( I NGVARSSON 2005 ) and, if corroborated, suggests a good potential for selecting tre es for breeding based on marker trait association, but add challenge for identifying the molecular bases of such associations. Genome C omplexity R eduction One alternative to whole genome resequencing for population studies is to reduce the complexity of th e genome in a systematic way across all individuals, to sequence only an enriched targeted subset. Initially, most methods to reduce genome complexity focused on avoiding repetitive DNA, which constitutes the majority of the genome of plants ( B ARBAZUK et al. 2005 ) For example, sequencing by methylation filtration explores the notion that repetitive regions are highly methylated, and cloning DNA fragments in stra ins of Escherichia coli prior to sequencing favors the selection of unmethylated DNA ( R ABINOWICZ et al. 1999 ) Similarly, high C o t sequencing avoids repetitiv e DNA by first denaturing the genomic DNA completely, followed by slowly allowing the DNA to renature. The repetitive portion of the genome that tends to
20 renature first, due to its relative abundance compared to low copy regions, can then be discarded ( P ETERSON et al. 2002 ) More recently, other methods to reduce the complexity of the genome have been developed, that no longer limit the sequencing to repetitive v ersus non repetitive DNA, but allow specific targets of the genome to be enriched ( M AMAN OVA et al. 2010 ) One method, known as molecular inversion probes (MIP) or capture by circularization, was initially developed to detect specific SNPs in known genomic regions and later modified to accept longer gaps than a single SNP. The main principle of MIP is to design probes molecule to take a circular format. The gap formed betwe en the two ends of the probe is filled with a polymerase and ligase. Probes that did not bind to genomic DNA are removed by an exonuclease treatment. Probes that have formed a circular molecule can then be enriched by PCR and sequenced. The main advantages of MIP are the high specificity to the target, the use of genomic DNA as input, removing the need to perform DNA shearing and library preparation, and the restriction of the sequencing to the gap region, which reduces the cost of sequencing. The main disa dvantages of the method are the low capture uniformity across probes that translate in high sequencing depth variance across target sites ( T URNER et al. 2009 ; M AMANOVA et al. 2010 ) Capture by hybridization, or sequence capture, was first introduced by hybridizing a prepped library of genomic DNA to a microarray containing probes complementary to the targets an d sequencing of the molecules that hybridize to the probes ( A LBERT et al. 2007 ) Shortly after the method was expanded and the capture was performed in solution wi th
21 probes that had been released from the microarray surface, multiplied by PCR and converted to biotinylated RNA. The capture of probe:target hybrids is then performed with streptavidin proteins linked to paramagnetic beads and non targeted DNA that is n ot bound to a probe is washed away ( G NIRKE et al. 2009 ) Contrary to MIP, sequence capture has good capture uniformity across probes, but requires library prepara tion and results in captured fragments that are usually longer than the probes for which capture was developed. The application of sequence capture to trees will be presented and discussed in detail in the next chapters. Genome S tructural V ariation Most ap plied genomic studies that characterized genetic variation in populations have focused on detection of single nucleotide polymorphisms, as described in previous sections. However, other types of DNA polymorphisms are abundant in the genomes and may contrib ute to phenotypic variation. For instance, the genome size of subpopulations of maize was estimated to vary considerably between individuals, with elite inbred lines showing smaller genomes ( D IEZ et al. 2013 ) One of the factors that might contribute to vari ation in genome size among individuals of the same species is copy number variation (CNV), which is defined as a segment of the DNA 1 Kbp or longer that occurs in different number of copies between two or more genomes ( S CHERER et al. 2007 ) Other factors that contribute to variation in genome size include transposable elements, centromeric repeats, heterochromatic knobs, among others ( D IEZ et al. 2013 ) Copy number variation is most studied in human genomes. Current estimates suggest that 13% o f the human genome contains a CNV ( S TANKIEWICZ and L UPSKI 2010 ) and in maize there is an average of 438 CNVs between any individual and the
22 reference genome ( B ELO et al. 2010 ) Major rearrangements at the chromosome level need to take place for a CNV to occur, and, as the frequency of certain CNVs in the population varies from high to rare, it is expected that some rearrangements can be recurrent and others might be the result of unique events. Through the analysis of CNV breakpoints in humans, recurrent CNVs are supposed to arise from non allelic homologous recombination, where the homologous recombination event takes place in a repetitive region at different sites of the genome, resulting in a non allelic match during meiosis and the potential for CNV to occur. Less is known about the causes for less frequent, non recurrent CNVs, as they are usually not associated with repetitive regions of the genome or other sequence signatures, but hypothesis include imperfections that occurred during a non homologous end joining used to repair a double strand break in the germ cells ( H ASTINGS et al. 2009 ; VAN B INSBERGEN 2011 ) The detection of CNVs has been traditionally performed using real t ime PCR (RT PCR), comparative genomic hybridization (CGH) or high density SNP arrays ( G IRIRAJAN et al. 2010 ) RT PCR is a low throughput but highly sensitive method used to analyze few CNV instances and is usually applied at the validation level. Similar to the analysis of gene expression, it is based on faster or slower amplification profiles when the in put DNA target is present in more or less copies, respectively, compared to a control. When high density SNP panels are available for the species, such as the case of humans, CNV can be detected as variations on the expected ratio of the two alleles of a h eterozygous site. Because of the extremely high density of some genotyping arrays (> 1 million SNP sites), this method provides a good resolution for detection of CNV breakpoints. Finally, detection of CNV using CGH arrays consists o f hybridizing a
23 microar ray with genomic DNA and analyzing the relative fluorescence units for signal deviation that might indicate CNV. CGH arrays can be developed for any species through custom arrays, as long as genomic information is available for probe design ( L I and O LIVIER 2013 ) The presence of CNV in humans is usually reported as a detrimental feature that lead s to abnormalities, particularly when whole chromosomes are affected ( T ANG and A MON 2013 ) This result is possibly a consequence of a large number of studies on CNV focus ed on disease phenotypes rather than on adaptive traits. This contrasts to what is usually observed on plants where whole genome polyploidization is often associated wi th adaptive advantages ( F AWCETT et al. 2009 ) The fact that a gene CNV occurs in a population is not a guarantee that it will have implications on phenotype, as gene copies might be silenced or result in truncated transcripts. To analyze the relationship of gene CNV and gene expression Henrichsen et al. ( 2009 ) mapped gene CNV in mice and performed genome wide analysis of gene expression in a subset of the population across several organs. The authors observed a significant linear relationship between gene CNV and expression, with CNV exp laining from 66 to 74% of the expression level of those genes in most organs, except brains which had tighter regulation of gene expression and only 38% was explained by CNV. Similar results had been obtained in humans by Stranger et al. ( 2007 ) where CNV and SNPs were found to explain a significant component of the variation of gene expression among individuals but the subset of genes being explained was different between the two variants, demonstrating the importance of genotyping both markers. In this sense, many studies have sought out and found association between CNV and phenotype in
24 human populations ( T ANG and A MON 2013 ) but there is a debate in the literature about the extent to which CNV will identify variations that were already not id entified by SNPs in LD with the CNV, specially for common disease genotyped with common SNPs and common CNV ( C RADDOCK et al. 2010 ) The detection of CNV in plan ts has been mostly restricted to studies in maize. Belo et al ( 2010 ) characterized CNV between the maize reference genome B73 and 13 inbred line s by hybridizing DNA to a 60 mer custom array initially designed for gene expression. The authors applied a probe level statistical approach to detect CNV that relied on the probe intensity ratios between B73 and the tested inbred line, considering cases w hen three or more adjacent probes deviated from the expected 1:1 ratio as potential sites of CNV Using this approach, a total of 2,109 putative CNVs were detected and distributed throughout the genome. A very similar approach was used by Swanson Wagner et al. ( 2010 ) who identified CNV in 4,538 genes or 13.9% of the genes sampled. Contrary to studies in humans, implications of CNV variation on phenotype are very scarce in plants However, one example published recently identified a locus (MATE1) in maize that, when present in three copies in the genome, provided significant aluminum tolerance and was identified as the causal effect underlying previous QTL studies that had been iden tified in the region ( M ARON et al. 2013 )
25 CHAPTER 2 WHOLE EXOME TA RGETED SEQUENCING OF THE UNCHARACTERIZED PINE GENOME 1 Background Sequencing the large genome of gymnosperms remains a challenge despite the rapid increase in DNA sequencing throughput and cost reductions ( S CHATZ et al. 2010 ; B AKER 2012 ) At 15,480 Mbp, the modal genome size of gymnosperms is an order of magnitude larger than that of most sequenced angiosperms, and 6.7 times larger than the human genome ( L EITCH et al. 2001 ) The large genome of gymnosperms, and of pines in particular, does not appear to be the consequence of recent whole genome duplications, but rather is derived from retrotransposon activity ( M ORSE et al. 2009 ) The transcriptome of conifers has also not expanded significantly compared to other vascular plants such as Arabidopsis ( K IRST et al. 2003 ) Thus, coding and regulatory sequence will likely represent only a minute fraction of a conifer genome much smaller than that of other species sequenced to date. An alternative to complete sequencing of a large and comple x genome is to focus on its coding and regulatory regions, which are of primary interest for population and quantitative genetics studies, and comparative genomics. Reducing genome complexity for sequencing single copy coding and regulatory regions can be achieved with methylation fi ltration ( R ABINOWICZ et al. 1999 ) and C o t filtration ( P ETERSON et al. 2002 ) and the utility of these methods has been well demonstrated in plants such as maize ( W HITELAW et al. 2003 ; B ARBAZUK et al. 2005 ) Bo th methods rely on specific properties of repetitive DNA to exclude it from a genome 1 This chapter has been published in The Plant Journal 2013, 75(1):146 156
26 sample. However, these approaches are labor intensive and relatively inconsistent when common regions of the genome are targeted for sequencing in multiple samples. Alternative methods of genome complexity reduction that can retrieve targeted sequences, including coding regions, were introduced recently. Initially, genomic DNA was hybridized to microarrays containing oligonucleotides complementary to the target sequence s, followed by elution and sequencing of the hybridized DNA ( A LBERT et al. 2007 ) This approach ha s been further modified to use biotin labeled oligonucleotides that can be retrieved by streptavidin beads, thus enabling solution based hybridization ( G NIRKE et al 2009 ) Sequence capture has been used extensively in human genome studies to detect nucleotide and structural variants ( 2009 ; W ALSH et al. 2010 ) and its recent use in maize demonstrated its utility in plants ( F U et al. 2010 ) A targeted sequence capture approach applied to uncharacterized, large genomes such as those of conifers holds the promise of expanding knowledge of coding and regulatory sequences, and support s the identification of sequence variation. A pplying sequence capture to conifers presents challenges that have not been encountered previously. First, it is unclear how efficient sequence capture will be in recovering a much smaller fraction of the genome, compared to previous studies ( 2009 ; Z HOU and H OLLIDAY 2012 ) Second, unknown introns may impact hybridization efficiency and capture success, when positio ned within a probe designed from transcript data. Here we report the development and synthesis of probes for 14,729 loblolly pine ( Pinus taeda L.) unigenes for sequence capture from genomic sequence. We describe the efficiency of capture, and examine succe ss in light of the fact that much of our probe
27 design was done in the absence of known gene structure. We create a de novo reference underlying the genomic region within which the probes reside, which expands the initial reference and annotates the gene sp ace. We also compare the capture efficiency from cDNA and gDNA from both haploid and diploid tissue; hence, we examine the effect of targets with different genetic complexities on capture efficiency. Additionally, to support processing large numbers of sam ples, we determine the effect of multiplexing capture reactions on capture efficiency. Finally, we demonstrate how a probe set developed for P. taeda can be efficiently used to capture the target genes in a related species, slash pine ( Pinus elliottii Enge lm). Experimental Procedures Genotypes, mRNA and DNA E xtraction For sequence capture of haploid cDNA and haploid genomic DNA, megagametophytes were collected from seeds of loblolly pine genotype 17 ( K AYIHAN et al. 2005 ) and from 24 unrelated trees (for genomic DNA) sampled from the natural distribution of each P. taeda and P. elliottii Seeds were stratified at 4 o C. Prior to mRNA extraction, seed germination was s timulated by placing them in a petri dish with moist filter paper for four days, after which seeds were dissected to separate the haploid megagametophyte from the diploid embryo. Whole lyophilized megagametophytes were ground individually in a tissue homog enizer. Messenger RNA and DNA were extracted from the haploid megagametophyte tissue using Dynalbeads mRNA DIRECT kit (Invitrogen) and DNeasy Plant Mini Kit (Qiagen), respectively. For sequence capture of diploid genomic DNA, needles were collected from ge notype 352 ( Q UESADA et al. 2010 ) The needle tissue sample was manually ground in liquid nitrogen and DNA was extract ed as described above.
28 cDNA S ynthesis Messenger RNA was fragmented to an average length of 200 nucleotides with RNA Fragmentation Reagents (Ambion) in a 10 L reaction, followed by incubation for 3 minutes and 50 seconds at 70 o C. Fragmented mRNA was purifi ed using RNA Clean and Concentrator 5 (Zymo Research). First strand cDNA was synthesized using SuperScript II Reverse Transcriptase (Invitrogen) with the following modifications: 3 of Random Primers (Invitrogen) and 20 U of RNasin Ribonuclease Inhibito r (Promega) were used in each reaction. Second strand cDNA synthesis was done using 30mM dNTP mix (Promega), 2 U of RNaseH (Invitrogen) and 50 U of DNA Polymerase I (Invitrogen) in a 100 L reaction volume (buffer: 500 mM Tris HCl pH 7.8, 50 mM MgCl 2 10 m M DTT). Reactions were incubated at 16 o C for 2.5 hours and the product was purified with MinElute Reaction Cleanup (Qiagen). Library P reparation Sequence ready libraries from haploid DNA and cDNA isolated from megagametophyte, and diploid DNA from needles, were prepared according to Bentley et al ( 2008 ) with modifications. DNA was mechanically sheared to a mean fragment length of 200 base pairs using a Covaris E2 recommendations. Fragments were end repaired with the End It DNA End Repair kit (Epicenter Biotechnologies) and purified with a MinElute Reaction Cleanup (Qiagen). A Klenow fragment (New England Biolabs) and 10 mM dATP (Promega). Where samples barcode sequences were added and purified as described in the previous step. Adapter ligated fragments of 250 to 500 bp were excised from an agarose gel and purified with
2 9 MinElute Gel Extraction (Qiagen). Enrichment PCR was performed (up to 18 cycles) using Phusion Polymerase (New England Biolabs) and standard Illumina paired end primers followed by purification with a MinElute PCR cleanup kit (Qiagen). Each library was quantified with a Quant it PicoGreen dsDNA Assay Kit (Invitrogen). Probe D esign f or T arget E nrichment For probe design a custom pipeline was de veloped and 15,983 unigenes assembled from Sanger sequenced loblolly pine ESTs were used as input ( http://dendrome.ucdavis.edu/treegenes/species/species_detail.php?id=93 Supplementary Data S1). For each unigene adjacent non overlapping 120 nt long region s were extracted as putative probes from the inner part of the unigene sequence (or predicted exon) This resulted in a total of 64,032 putative probes that were filtered by comparing each probe against the unigene dataset at decreasing stringenc y (from e value 1x10 20 to 1x10 9 ) using WU Blast BLASTN ( G ISH 1996 2004 ) At each stringency level, p robes matching multiple regions were discarded The process continued until there were fewer unique probes than the probeset space would accommodate (55,000 probes). The final stringency filter (e value 1x10 9 ) resulted in a p ine s equence c apture set of 54,773 prob es representing 14,729 unigenes, and this probe set was used to prepare a custom SureSelect assay by Agilent Technologies. Sequence C apture H ybridization R eaction and S equencing Initially, one sample from each genome complexity (haploid cDNA, ha ploid genomic DNA and diploid genomic DNA) was processed individually and hybridized in individual reactions. The captured fragments for these three libraries were cloned into a vector using the Zero Blunt TOPO PCR cloning kit (Invitrogen) and 96 clones fr om each library were Sanger sequenced. Upon verification of the feasibility of sequence capture
30 in loblolly pine, eight barcoded libraries from haploid DNA, each representing a single individual, were pooled in equimolar concentrations prior to hybridizati on. Each pool was hybridized to capture probes following the Agilent SureSelect protocol, re quantified using Quant iT PicoGreen dsDNA Assay Kit and diluted to 10 nM. Each pool of eight haploid samples was sequenced on one lane of HiSeq 2000 (Illumina) fol lowing the end mode for 100 cycles. The haploid genomic DNA sample and one multiplexed reaction of eight individual diploid genomic samples were sequenced separately in two Pacific Biosciences SMRT Cells in circular consensus sequence (CCS) mode. Sanger and Pa cific Biosciences D ata P rocessing and A nalysis Reads from each of the set of 96 cloned fragments that were Sanger sequenced were converted to FASTA format. A sequence read was discarded if it ( i ) did not contain the full barcode sequence and at least a partial sequence for the correct Illumina ii than 20% of sequence. Barcodes and adapters were removed from the reads that pass ed filtering with custom scripts. The resulting captured fragments were aligned using BLASTN (WU Blast e value 1 10 10 ) to the unigene sequences used for probe design to enable descriptive analysis and identify putative polymorphisms. Nucleotide diversity between the sample and the reference was calculated as the number of divergent bases divided by the length of the captured fragment. Calculations were restricted to the regions of the unigene where probes had been designed. The Pacific Biosciences data wer e processed similarly to the Sanger data. Circular consensus reads for each of the two SMRT cells were created with the Pacific Biosciences default algorithm. The reads obtained from the two 45 minutes movies per SMRT cell were
31 combined into a single fastq file. Pacific Biosciences reads were discarded if they did barcodes. The FASTX Toolkit ( http://hannonlab.cshl.edu/f astx_toolkit ) was used to separate reads by barcode and remove the barcode sequence. The resulting capture fragments were aligned to the reference as described above. Illumina D ata P rocessing and Analysis P ipeline Data filtering: Raw fastq files had the I llumina systematic read identifier renamed to include project specific information ( e.g. sequencing center, project). Based individual were collected into separate files usi ng the barcode splitter function of FASTX Toolkit. The read identifiers were changed again to include the individual they represent and the barcode sequence was removed from the reads. Low quality bases emoved using the fastq quality trimmer function of FASTX Toolkit and only processed reads longer than 50 nucleotides were retained. Alignment: Because these are paired end reads, the quality filtered files from both ends were re synchronized to keep the sa me order of reads in both files. Reads that lost a mate during the data filtering step were treated as single end reads in the alignment. The reference used for the alignment was composed of the sequences used for probe design. Reads were aligned to the re ference using Mosaik 2 ( http://bioinformatics.bc.edu/marthlab/Mosaik ) with the following parameters: maximum percentage of mismatches ( mmp 0.05), unique alignment mode turned on ( m unique), most accurate hashing strategy ( a all) and hash size ( hs 15). Multiple alignments for a individual were merged using BamTools ( B ARNETT et al. 2011 ) Raw se quence data for
32 each genotype was submitted to the Short Read Achieve under study accession SRP018726 De N ovo A ssembly of C aptured F ragments and G ene A nnotation De novo assembly: The Recruiter pipeline was developed to assemble the captured reads on a per gene basis. The data for all 24 haploid samples of P. taeda or P. elliottii were used to create a consensus assembly for each species. For each individual, the bam alignment file was used to identify reads that aligned to the reference, either with one or both ends aligned. The aligned reads, together with their possible unaligned mate, were recruited from the re synchronized quality trimmed fastq file and the recruited data for all samples were combined for each reference unigene. Cap3 ( H UANG and M ADAN 1999 ) is used by Recruiter to assemble the reads, with an assembly task being independently performed for each gene. The recruited fastq file is conve rted to Cap3 format, maintaining the paired end and base quality information, and using the option h 100 to allow for maximum expansion of the contigs. Contig structural annotation: The contigs formed are annotated to exclude low confidence contigs and cr eate a GFF like file with annotation of the contigs relative to the reference. Contigs were annotated when they passed the following empirical criteria: minimum absolute size of 150bp, minimum relative size compared to the reference of 0.25, minimum absolu te number of reads of 20 and minimum relative number of reads relative to the total of 0.15. These values were selected by analyzing the distribution of length and number of reads from the assembly output. For the few cases where more than three contigs re mained after this selection, the top three contigs selected based on coverage relative to the reference were maintained for annotation. The following features were present in the GFF annotation file and their meaning is
33 explained in the text: five_prime_ex pansion, three_prime_expansion, reference_like_fragment, complete_non_reference_fragment and incomplete_non_reference_fragment. The source code for Recruiter is available upon request. Unigene functional annotation: The original unigenes used for probe des ign were functionally annotated using Blast2GO automated pipeline on default parameters ( C ONESA et al. 2005 ) with the only difference that BLASTX results (e valu e threshold 1 10 10 ) against NCBI non redundant database were generated locally and imported to Blast2GO. The annotated results for all unigenes are reported in Supplementary Data S7. A Fisher exact test was performed to compare GO (gene ontology) enrichme nt between the top 100 most captured genes and the remainder using Blast2GO. Intron P rediction and D epth A nalysis The P. taeda draft assembly (v. 0.8) was used to predict introns for an unbiased comparison of the effect of introns in the capture efficiency The large size of the genome is not handled by GMAP, thus we aligned the unigenes used for probe design to the genome using NCBI BLASTN ( e value 1x10 20 and percent identity of 85) and extracted the top hit scaffolds for each unigene as input for intron prediction. The unigenes were aligned to the scaffold top hit subset using GMAP (default parameters) and putative introns were identified from the output result using custom scripts. The positions of the exon intron boundaries were used to identify probes whose sequence spanned these boundaries (referred as with intron ) and those whose sequence do not span a detected intron (referred as without intron ). The relative position of the intron in the probe was obtained adjusted to the center, thus varying from 1 (borders or positions 1 and 120) to 60 (center or positions 60 and 61) to increase the number of observations
34 per class. The sequence depth for each probe was calculated as the average number of reads that aligned to each of the 60 innermost bases of the probe. The center of the probe was used in an attempt to avoid neighboring capture effect from adjacent probes. The final probe depth used for the analysis is the median depth taken across the 24 haploid samples of P. taeda Results Sequence C apture P robe D esign f rom an U ncharacterized G enome The efficiency of sequence capture probes designed from transcriptome assemblies may be affected by their position relative to exon exon boundaries. Expectedly, a probe designed from cDNA sequences that spans two exons will hybridize only partially to genomic DNA. In this study, probes were designed from 15,983 loblolly pine transcript sequences derived from the assembly of expressed sequence tags. For 4,500 transcripts, partial genomic sequences were available. Thus, i ncomplete gene models could be predicted by aligning transcripts to the corresponding genomic sequence with GMAP ( W U and W ATANABE 2005 ) Boundaries between exons were identified for 3,946 transcripts, and probes were designed to avoid these regions. For the remaining 12,037 transcripts, probes were designed to tile the entire transcript sequence, with no overlap between them. A custom pipeline was applied (see Experimental Procedures) to selected probes that minimize hybridization to multiple transcripts, resulting in a final set of 54,773 120 mer oligonucleotides (6.57 Mbp) r epresent ing 14,729 transcr ipts. The average length of each transcript was 660.2 bp. On average, 2.3 probes could be designed for transcripts where exon exon boundaries were predicted (3,615) and 4.2 probes for the remainder (11,114). Avoiding exon exon
35 boundaries resulted in shorte r sections where 120 mer probes could be designed for and, consequently, fewer probes. Efficien t S equence C apture of the L oblolly P ine G enome with D ifferent C omplexities Sequence capture may be less efficient when applied to conifers such as loblolly pine, because of the low target to genome size ratio. To assess the effect of genome complexity on the efficiency of sequence capture, the 54,773 custom designed probes were hybridized to ( i ) complementary DNA (cDNA) synthesized from megagametophyte mRNA, as we ll as ( ii ) haploid genomic DNA from megagametophyte DNA (1N), and ( iii ) diploid genomic DNA obtained from the needles of loblolly pine (2N). Sanger sequencing of the captured fragments and alignment (BLASTN E value = 1 10 10 ) to the reference transcript se quences showed that efficiency was similar regardless of the genome complexity 71, 87 and 83% of the fragments aligned uniquely to the reference for haploid cDNA, haploid DNA and diploid DNA, respectively (Table 2 1). To verify the capture efficiency in a larger sample, the haploid DNA sample was also sequenced in a Pacific Biosciences SMRT cell. A total of 25,912 circular consensus sequences (CCS) were generated, of which 65% aligned to the reference at the same threshold used for the Sanger sequences (T able 2 1) Further analysis suggested that regions of the genome containing paralogous captured from the haploid and diploid DNA samples. The median proportion of nucleo tide difference from the Sanger data to the probes was 0, 6.2 and 8.5% for haploid cDNA, haploid DNA and diploid DNA, respectively (Table 2 1). Pseudogenes are not likely to be represented in cDNA libraries, but genomic DNA populations may
36 contain these se quence relatives. Our results suggest that paralogs are also captured from the genomic DNA fraction Samples can be M ultiplexed and E fficiently C aptured Sequence capture is costly, laborious and time consuming. These obstacles limit its use for quantitativ e and population genetics studies, and other applications that require resequencing large numbers of individuals. To address these limitations, library preparation was customized to allow multiplexing of up to eight haploid samples in each hybridization. I barcodes, and samples with ligated adapters were pooled in equimolar concentrations prior to hybridization to probes. This simple approach reduces cost and increases throughput by almost one order of magnitude relative to the standard single sample hybridization procedure. The products of a multiplexed reaction were sequenced in a Pacific Biosciences SMRT cell and an average of 3,000 captured fragments were characterized from each individ ual (24,002 total CCS reads). Capture was reduced in multiplexed, compared to non multiplexed reactions, from 65% to 50% (Table 2 1), suggesting a trade off between increased throughput and efficiency. However, this trade off is compensated by the reductio n in cost and increase in throughput of sequence capture. Sequence C apture is R eproducible among U nrelated S amples Loblolly pine has very high levels of intra specific genetic diversity ( B ROWN et al. 2004 ) Such high diversity often hampers the development of molecular assays that require sequence conservation, and could result in variable capture efficiency among unrelated individuals. Here, t he reproducibility of multiplex sequence capture was asse ssed b y sequencing libraries from haploid DNA of 24 unrelated loblolly pine
37 samples (Figures 2 1, 2 2). A reproducible capture should result in the gene being sequenced at similar depth, across all samples of a population, for the probes targeting the gene On average, the coefficient of variation for sequencing depth of a transcript was 46.6% across the 24 haploid samples analyzed (Figure 2 2). At the gene level, 9,997 had one or more reads aligned to their sequence in all 24 samples analyzed (Table 2 2). When capture failed for a certain gene, it mostly failed across all the 24 unrelated samples. For 3,034 genes, none of the sequencing reads aligned to any portion of the gene across all the 24 samples (Table 2 2). Altogether, when performing multiplexed ca pture in a large set of haploid samples, the capture efficiency (measured as gene sequencing depth) showed low variation across samples. Probe D esign that A voids B oundaries B etween E xons L eads to H igher C apture E fficiency Initially, sequence capture probes could be designed to avoid boundaries between exons for 3,946 transcripts for which partial genomic sequences were available. After the completion of probe designs and hybridizations, a draft sequence of the P. taeda genome became available (v0.8). As a c onsequence, the position of boundaries between exons could be predicted for 2,749 transcript sequences used for probe design based on the draft genome assembly. For p robes designed in a region of boundary between exons, the m edian sequencing depth was 3.6 across the 24 haploid samples. For the same genes, probes that did not overlap two exons had a median depth of 11.2 ( t test p value 5.9x10 13 ) (Figure 2 3). The group of probes with introns predicted to occur within their sequences (N=4,937) was used to further investigate whether sequencing depth decayed as a result of the position of the intron relative to the probe The 120nt long probe was
38 assumed symmetric for this analysis and the 60 positions were evaluated, 1 being the outside of the probe and 60 being the middle of the probe. Figure 2 4 shows that there is a negative correlation (r 2 = 0.19) between median sequencing depth and position of the intron in the probe, as expected from the notion that an intron in the middle of the probe will hinder hyb ridization the most. The large variation observed in the data is probably a result of other factors not accounted for in this analysis, such as GC content, size of the intron, among others. Sequence C apture E xpands G ene M odels Sequence capture has been bro adly used to characterize genetic variation in exon regions. However, captured sequences typically expand beyond the target region because hybridized captured fragments are longer than probes. For species with an uncharacterized genome, this offers an oppo rtunity to expand gene models beyond the available coding sequence, to also include introns, and untranslated or putative regulatory regions. To develop such expanded models for loblolly pine, a bioinformatics pipeline was developed to utilize sequencing r eads that span the interval between two capture probes, and extend beyond it. Initially all reads were aligned to the reference sequences used for probe design, and those with similarity at both or single end (referred hereafter as orphaned reads) were ret ained to support the de novo assembly of each gene. The longest contigs were then annotated and a new reference was created. To avoid the mis assembly of paralogs and pseudo genes, the alignment step was performed at relatively high stringency (3% mismatch accepted). Furthermore, the data for all the 24 sequenced haploid samples was combined for the de novo assembly. A total of 11,396 expanded gene models were created, generated from 13,705 contigs. For the majority (9,288) of the expanded gene models, read s were assembled into
39 single contigs. However, for 1,907 and 201 expanded gene models, two and three contigs were generated, respectively. The average total contig size was 883 bp, ranging from 74 to 2,418 bp (Table 2 3, Figure 2 6). The goal of the de nov o assembly was to expand the gene models in pine. Four scenarios were considered during the annotation: (i ) expansion of the reference beyond (ii) complete assembly of an intron (complete_non_reference_fragment), (iii) pa rtial assembly of an intron (incomplete_non_reference_fragment), and (iv) end (three_prime_expansion) (Figure 5). On average, we were able to expand 141 the reference for 6,727 genes (Table 2 3). A set of 3,517 introns in 2,746 genes was completely sequenced. These have an average length of 122 bp, and a range of 9 to 425 bp (Table 2 3). Short candidate introns may represent coding sequence insertions abse nt in the reference. Finally, there were 8,609 instances in which introns were partially assembled (average length = 152 bp, Table 2 3) within 5,327 genes. These are likely to represent larger introns, where the sequence captured by the nearest probe was n ot sufficiently long to span the entire intron. For 967 out of 5,327 genes that contained partially assembled introns, we were able to connect adjacent incomplete intron fragments based on the following empirical rules: ( i ) two incomplete introns for the g ene are contained within two contigs; (ii) the first contig ends with an incomplete region and the second starts with an incomplete region; (iii) the two contigs are adjacent to each other relative to the reference (Figure 2 5B). When these connections bet ween incomplete introns were
40 detected, they were characterized as continuous incomplete non reference fragments, and characterized as discrete otherwise (Table 2 3). Probes D eveloped for L oblolly P ine are E fficient for S equence C apture of a R elated S pecies Because the capture between unrelated samples of P. taeda was highly reproducible, we tested the suitability of the probes in capturing homologous regions from the genome of P. elliottii Sequence capture was carried out as described for P. taeda i.e. f rom haploid megagametophyte DNA in three multiplex reactions with eight unrelated samples each, totaling 24 samples. Overall, the metrics of the capture were almost identical to that of P. taeda and, 98.9% of the probes that captured the expected sequence efficiently in P. taeda were also efficient in the capture of the same regions in P. elliottii (data not shown). Capture was also reproducible across the 24 haploid samples sequenced, with a sequencing depth coefficient of variation across samples of 45.3% at the gene level. A total of 11,104 genes were consistently captured in at least 20 of the samples and, similarly to P. taeda 3,138 genes were not captured in any of the haploid samples of P. elliottii A de novo reference was also created using the cap tured data and 13,566 contigs were formed for 11,160 genes. These results validate the flexibility of hybridization based target enrichment as a way to characterize the genome of closely related species using a common set of probes. Discussion Here we char acterized a reduced representation of the complex genome of the conifers loblolly and slash pine, using in solution target enrichment followed by next generation sequencing. We focused on a set of 14,729 genes for which transcript sequences were available for probe design. Considering an estimated haploid genome
41 size of 21.7 Gbp for P. taeda ( W AKAMIYA et al. 1993 ) and a probe space of 6.57 Mb, the target r egion represents only 0.03% of the genome. Sequence capture was evaluated on different genome complexities; including haploid cDNA, haploid genomic DNA and diploid genomic DNA. Overall, more than 70% of captured and sequenced fragments had sequence similar ity to the targets, regardless of the sample complexity. Similar success was reported by Saintenac et al. ( 2011 ) in the capture of 0.035% of the wheat genome, where 60% of the reads aligned to targets. Despite the early success in plant sequence capture ( C RONN et al. 2012 ) the procedure is laborious, limiting its use when the char acterization of a large numbers of individuals is required. Here, multiplexing the reaction to simultaneously capture eight samples minimized this limitation. While this approach seemed to result in some loss in capture efficiency (from 65% to 50%), the si gnificant gains in throughput and cost savings compensated for this drawback. These results are comparable to those reported in human genome studies, where multiplexed capture was also performed ( C UMMINGS et al. 2010 ; B ANSAL et al. 2011 ) Despite the complexity of the multiplex capture, there was good agreement between the fractions of the genome obtained from th e samples analyzed. The variation across genes, however, was high. While the mean sequencing depth of the experiment was 108, the median value was only 10 due to the presence of a few genes sequenced at extremely high depth. While attempts were made duri ng probe design to minimize cross hybridization, the high copy number of certain genes in conifer genome cannot be overlooked as an obstacle for suitable probe design. This limitation was also reported for wheat ( S AINTENAC et al. 2011 ) but l ittle can be done to avoid it when a reference genome is not available. During the unigene
42 annotation step we observed that the 100 most captured genes were enrichment for those annotated as being located in the chloroplast or as retrotransposon related pr oteins, and such genes should be avoided in future studies. Therefore, it is advised that new applications of sequence capture in such complex genomes begin with the analysis of few individuals and, if necessary, probes should then be redesigned to limit t he occurrence of repetitive regions inferred from the depth of coverage. Lack of a reference genome sequence also clearly contributed to lower capture efficiency among probes designed from sequences where the boundaries between exons were unknown. We obser ved that probes spanning these boundaries had lower sequencing depth ( Figure 2 3) and that depth was lower the more central those boundaries were relative to the probe ( Figure 2 4). For example, the sequence of gene 0_10007 (Figure 2 5A) was captured by th ree probes. An intron detected near the middle of the second probe (relative position 54), most likely contributed significantly to a reduction in median sequencing depth to 4.1, compared to 10.5 and 14.5 in the first and third probes, respectively. Alt hough other factors probably contribute to poor hybridization of probes to their targets, such as GC content ( 2009 ; B UNDOCK et al. 2012 ; Z HOU and H OLLIDAY 2012 ) we show that those spanning an intron junction are likely to have less favorable hybridization kinetics when hybridized to genomic DNA. It was noted by Gnirke et al ( 2009 ) that captured fragments can often be larger than the probes designed to recover them and, thus, regions beyond the probe targets will be capt ured and sequenced. In this study, approximately 35% of aligned Illumina paired i.e. only one of two mates aligned to the unigene reference used for probe design. Consequently, we tested if
43 sequence capture da ta could be used to expand gene models. For this, the preferred approach was to align the reads to the unigene reference, identify and recruit the orphaned mates to the aligned ones, and to perform an independent de novo assembly for each unigene reference A bioinformatics pipeline (Recruiter) was developed, that implements this approach in an automated fashion. A similar approach was recently reported by Lemmon et al ( 2012 ) on a comparatively simple genome and augmented with manual intervention to annotate the best contigs. With the fully automated Recruiter pipeline we reconstituted 13,705 high quality contigs for 11,396 genes with no manual in tervention. Newly assembled contigs were annotated in GFF files containing based reference (Figure 2 5). The expansion of gene models into non coding regions generated here also p rovides the first comprehensive assessment of the gene structure of conifers. Although the approach is biased towards assembling smaller introns, it suggests that loblolly pine appears to have most of its intron structures similar to other plants and diffe rent than humans ( H ONG et al. 2006 ) with an average length of 122 and 329 bp, depending on whether they were completely or partially assembled, respectively (Table 2 3) Unexpectedly, a large portion of the introns was only partially assembled (8,609 events for 5,327 genes), which suggests the presence of large introns in the genome of pine. In potentially capturing portions of the regulatory region of some genes. The summary results presented in Table 2 140 bp, respectively; and that complete introns and incomplete but continuous intr ons
44 were 122 and 329 bp on average, respectively. Thus, relative to the average original significant contribution for future assembly and characterization of the genome of conifers. Despite the magnitude and complexity of the conifer genomes, we describe how sequence capture was used to efficiently enrich and sequence the genic portions of th e genome of P. taeda We present a series of variables that affect capture efficiency and modified the approach to allow for higher throughput multiplexing the reaction, efficiently capturing 11,695 genes from a genomic background of eight haploid genomes (about 173.6 Gbp). Finally, using the probes designed for P. taeda we show that the interspecific capture of P. elliottii worked equally well in retrieving the target genes for this closely related species. Coupled with our de novo assembly and capture mu ltiplexing strategies, this provides a proof of concept for how sequence capture can be used to quickly expand genome research in uncharacterized and complex genomes. As further improvements to the method are developed, and novel cost reducing modification s are implemented, such as the method proposed by Querfurth et al. ( 2012 ) sequence capture presents a powerful method for characterizing genetic diversity in thousands of selected regions of complex plant genomes.
45 Table 2 1. Validation of the capture efficiency in pine ac ross different genomic complexity. A sample of the captured fragments (N) was sequenced using Sanger and/or Pacific Bioscience methods. Genomic complexity Sanger PacBio N Aligned* 1 Nucleotide Diversity (%) N Aligned* 1 Mean length (bp) Mean (sd) Media n Haploid cDNA 75 53 (71%) 3.0 (5.5) 0 NA Haploid gDNA 77 67 (87%) 7.7 (7.6) 6.2 25912 16883 (65%) 148 Diploid gDNA 83 69 (83%) 10.3 (8.4) 8.5 NA Haploid gDNA multiplex* 2 NA 24002 1506 (50%) 170 1 WU BLASTN alignment against unigene reference at e 10 stringency. 2 Sequence capture reaction performed on a pool of 8 individually barcoded samples. Aligned results are an average of the 8 samples.
46 Table 2 2 Reproducibility of the capture on 24 unrelated haploid samples of P. taeda Average sequenci ng depth was calculated for each individual and used to generate the results shown. A 100% level means that all 24 samples had at least one read aligning to the target, whereas 0% level indicate the opposite, no reads aligning to the target. Target Percent age of samples with target sequenced Total 100% 99 80% 79 1% 0% Probe level 26341 9678 4922 13832 54773 Gene level 9997 1058 640 3034 14729
47 Table 2 3 Summary of the de novo assembly of the sequence capture data. Contigs formed were annotated and e mpirical features were created to represent each possibility of the alignment of the reference to the contigs. Feature Occurrences Number of Genes Size descriptive statistics (bp) Average Quantiles 0% 25% 50% 75% 100% Original unigene reference 14729 660 121 549 696 804 1567 Total contig size 13705 11396 883 74 668 834 1057 2418 five_prime_expansion 7460 7437 141 1 58 106 218 345 three_prime_expansion 6758 6727 140 1 67 117 183 359 complete_non_reference_fragment 3517 2746 122 9 94 113 145 4 25 incomplete_non_reference_fragment 8609 5327 152 1 107 160 198 1105 continuous* 1 1016 967 329 11 275.75 340 390 941 discrete 6577 4828 148 1 97 155 195 1105 1 Sizes calculated based on the sum of the two adjacent incompete_non_reference_ fragments.
48 Figure 2 1. Efficiency of multiplexed capture in Pinus taeda Shown is the p ercentage of the Illumina sequenced reads that align to the predicted targets (Y axis) when increasing values of mismatch between the reads and the reference is acce pted (X axis). Average and standard deviation shown for 24 haploid samples.
49 Figure 2 2. Reproducibility of the multiplexed capture in Pinus taeda The average sequencing depth for target genes was calculated for each haploid samples. Coefficient of va riation for sequencing depth across samples was calculated (X axis) for the subset of the genes with at least 80% of the samples showing positive depth.
50 Figure 2 3 Effect of the presence of an intron in the probe region on capture efficiency. Box plot graphs show the variation of sequence depth (Y axis ) for probes with or without an intron predicted to exist in their sequence. Data presented on the plot was truncated at 50 (N=4,937 and 8,574 for probes with and without intron, respectively).
51 Figure 2 4 Reduction of capture efficiency as a function of intron position within the probe. The 120nt long probe was considered symmetric, totaling 60 relative positions (X axis). The Y axis represents the median sequencing depth for each position and the ve rtical bars represent median absolute deviations. Each relative position had a minimum of 60 observations.
52 Figure 2 5. Examples of de novo assembly contig annotation. The contigs formed by de novo assembly were annotated relative to the reference to pr oduce a General Feature Format file (GFF), which is shown for two representative genes. Panel A, for gene 0_10007 shows a single contig and is illustrates an gene 0_18295, shows two contigs and an example of two incomplete introns that can be linked based on their relative position to dashed lines, original unigene reference by solid t hin lines and de novo assembled contigs by solid thick line, light gray bars represent expansions of the reference.
53 Figure 2 6. Size distribution of the annotated contigs produced by the Recruiter de novo assembly pipeline. Total contig size (X axis) i s characterized by the sum of each individual contig assembled for a given gene.
54 CHAPTER 3 A HIGH RESOLUTION GENE MAP OF LOBLOLLY PINE (PI NUS TAEDA L.) BASED ON EXOME SEQUENCE CA PTURE 1 Background Loblolly pine ( Pinus taeda L.) covers 11.7 million hectares of natural and planted forests in the Southeastern United States and provides 58% of the U.S. and 16% of the ( W EAR and G REIS 2002 ) Loblolly pine is also an important species for comparative studies between gymnosperms and angiosperms, and genomic resources are becoming increasingly availabl e to enable these studies ( M ACKAY et al. 2012 ) For instance, single nucleotide polymorphisms (SNP) and microsatellites have been identified and applied to genera te genetic maps ( E LSIK and W ILLIAMS 2001 ; E CKERT et al. 2009 ; E CHT et al. 2011 ) identify population genetic parameters and associations to phenotype ( G ONZALEZ M ARTINEZ et al. 2007 ; E CKERT et al. 2010 ; S TEWART et al. 2012 ) and develop genomic selection prediction models ( R ESENDE et al. 2012 ) However, the number of available genetic markers remains small, particularly considering the large size of the loblolly pine genome. Advances in high throughput DNA sequencing and genomic tools are making it possible to genotype large numbers of individuals by sequencing reduced representations of the genome ( D AVEY et al. 2011 ) For example, targeted resequencing after in solution sequenc e capture ( G NIRKE et al. 2009 ) allows for enrichment of pre selected regions of the genome and sequencing of those regions to detect variants. Using this method, probes complementary to the target regions of the genome are designed and hybridized to genomic DNA for sequence capture and subsequent 1 This chapter has been submitted to a peer reviewed journal in the field of genetics and genomics
55 sequencing. Sequence capture is being optimized for an increasing number of plant species ( S AINTENAC et al. 2011 ; B UNDOCK et al. 2012 ; Z HOU and H OLLIDAY 2012 ) including conif ers ( N EVES et al. 2013 ) To evaluate the potential of targeted resequencing for genotyping in loblolly pine, we used sequence capture to detect polymorphisms based on the segregation of alleles in a recombining population. The analysis of a recombining population allows for the construction of genetic maps, where markers are grouped and ordered based on observed recombination frequencies. When the markers are located within genic regions, the genetic map provides the relative position of genes in chromosomes ( D ROST et al. 200 9 ; N EVES et al. 2011 ) which is particularly relevant for loblolly pine or other species without a reference sequence. High density, gene rich genetic maps are useful to examine genome synteny with other species ( K RUTOVSKY et al. 2004 ) ; for detection of quantitative trait loci (QTL) and identification of candidate regulatory genes ( B ERNARDO 2008 ) ; and to support genome assembly ( T USKAN et al. 2006 ; S ATO et al. 2012 ) Current ly, the best resolved loblolly pine genetic map was reported by Eckert et al ( 2010 ) which positioned 1,635 genes, genotyped from a starting SNP panel of 7,535 markers. More recently, Echt et al. ( 2011 ) reported the identification of large numbers of microsatellites and their use to create a moderately dense map with 429 markers. A subset of 311 of these microsatell ites was derived from cDNA and, therefore, reflects the position of the corresponding genes. The goal of this work was to genotype a loblolly pine mapping population using sequence capture to analyze the efficiency of this method for polymorphism detection and to create a high density, gene rich genetic map with the markers obtained. The
56 mapping population consisted of 72 haploid megagametophytes extracted from seeds of a single tree. In pine seeds, the haploid DNA from individual megagametophyte represent s the product of recombination of the maternal genomic DNA. Sequence capture was performed in each haploid sample and 7,842 markers were detected that segregate in the population, representing 4,195 genes. The best marker for each gene was selected and 2,8 43 genes were ordered in a genetic map, with an average of one marker every 0.58 cM(K). Sequencing depth data was utilized to detect 408 genes segregating for presence/absence variation of which 65 were ultimately mapped in the genetic map. The comparison of our genetic map with the previous map of Eckert et al. ( 2010 ) via 397 common genes showed high levels of agreement at the grouping and ordering level between the two maps. Experimental Procedures Mapping Population, DNA E xtraction and S equence C apture Seventy two seeds were collected from the reference loblolly pine genotype 17 ( K AYIHAN et al. 2005 ) and constitute the mapping population. Each seed was manually dissected to extract the haploid, maternally inherited, megagametophyte tissue. Megagametophytes were lyophilized and DNA extracted using the DNeasy Plant Mini ty was analyzed by visual inspection on agarose gel stained with ethidium bromide, and quantified by the Quant it PicoGreen dsDNA Assay Kit (Invitrogen). Libraries for each haploid sample were created as described previously ( N EVES et al. 2013 ) Briefly, DNA was sheared, end repaired and adenylated to allow for ligation of Illumina compatible adapters that conta size selected (250 500bp) on agarose gel and enriched by PCR using universal
57 primers. Libraries from eight barcoded haploid samples were combined in equimolar amounts and used as input for the s equence capture hybridization reactions following the Agilent SureSelect protocol. The probe set used for sequence capture contained 54,773 probes representing 14,729 Pinus taeda unigenes ( N EVES et al. 2013 ) Target enriched libraries were sequenced using a variety of Illumina machines and outputs. Sequencing D ata F iltering and A lignment Raw reads for each seque ncing lane were separated based on their individual barcode using the FASTX Toolkit barcode splitter end of the reads were trimmed to remove low quality bases using t he FASTX Toolkit fastq quality trimmer with parameters t 20 score < 20 and discard read if resulting read is shorter than 50bp) for reads with 100 bp or longer, and t 20 l 30 for reads with 40 bp. For samples sequenced in paired end mode, the order of the reads in each mate file was re synchronized using custom scripts, and reads without an appropriate mate were treated as single end. The filtered reads were aligned to the unigene reference used for probe desi gn with Mosaik Aligner ( http://bioinformatics.bc.edu/marthlab/Mosaik ) with parameters: maximum percentage of mismatches 0.05 ( mmp 0.05), unique alignment mode turned on ( m unique), most accurat e hashing strategy ( a all) and hash size 15 ( hs 15). Multiple alignments for the same sample were combined using BamTools ( B ARNETT et al. 2011 ) into a single alignment. Detection of SNP and PAV Single nucleotide polymorphisms, short insertions and deletions (short indels) and multi nucleotide polymorphisms (MNPs) were detected using all individu als of the
58 mapping population simultaneously with Freebayes v0.8.7 (http://bioinformatics.bc.edu/marthlab/FreeBayes), with parameters: SNP probability 0.75 ( -pvar 0.75), pairwise nucleotide diversity 0.01 ( -theta 0.01), haploid mode ( -ploidy 1), accepti ng indels ( -indels) and multi nucleotide polymorphisms ( -mnps), minimum alternative allele frequency of 0.8 ( -min alternate fraction 0.8), and minimum alternative allele count of 1 ( -min alternate count 1). For a polymorphism to be considered for mappi ng, reference and alternative alleles had to be observed in at least 50 individuals of the population, and the alleles had to segregate 1:1 in the population test p value 0.05). If more than two alleles were observed at a polymorphic position, a proble m that might arise due to contamination with diploid tissue (e.g. embryo), paralog capture and sequencing error, the individuals with the two most frequent alleles in the population were considered in further analysis and the remaining were categorized as missing data. To detect presence/absence variants (PAV), a sequencing depth matrix was calculated with genes as rows, haploid samples as columns and every observation being the average number of times each position of the gene was sequenced. The observatio ns were coded as 1, when at least one read aligned to the gene region, or 0, when no read aligning to the gene region was detected. A test for 1:1 segregation for the presenc e of the allele was performed ( 2 test p value 0.05) and PAV markers passing these criteria were used for mapping. Genetic M ap C onstruction The SNP, short indel, MNP and PAV markers that exhibited segregation in the population were combined, and in case of gene redundancy, the marker wi th the least amount of missing data was used as a representative of the gene. The genetic map was constructed with JoinMap 3.0 ( V AN O OIJEN and V OORRIPS 2001 ) with parameters:
59 =1, using the Kosambi mapping function. JoinMap 3.0 builds a genetic map adding markers in three successive steps, first removing markers that reduce the map goodness of fit or creates negative distances in the first and second steps, and then adding all markers ignoring the thresholds of goodness of fit and negative distances. The complete m ap was generated using all three rounds to order all linked markers to the map. For a conservative map, we stoped the mapping process after round two to provide a better estimation of gene order as this removes markers that contribute to unstable ordering. The group number (1 to 12) and the relative orientation of the markers within the group were defined following the genetic map published by Eckert et al. ( 2010 ) to facilitate comparisons between studies. Genome length and coverage were estimated following Echt et al. ( 2011 ) The genome length was estimated using method 4 of Charkravarti et al. ( 1991 ) where the observed Kosambi genetic distance of each linkage group (LG) was adjusted by multiplying it by (m+1)/(m 1), where m is the number of markers mapped to the linkage group. Genome coverage (c), defined as the proportion of the ge nome contained within a distance d (cM) from a marker was estimated by where, m is the number of markers mapped and L is the total estimated genome length ( L ANGE and B OEHNKE 1982 ; R EMINGTON et al. 1999 ; E CHT et al. 2011 ) Results Genotyping of a M apping P opulation with S equence C apture We used a multiplexed sequence capture method ( N EVES et al. 2013 ) to characterize genetic variants in 14,729 ge nes. Seventy two seeds were collected from a single female tree and the maternally inherited, haploid megagametophyte tissue was
60 used for library preparation. The use of DNA from the seed megagametophyte provides an effective way to assess recombination fr equency between genetic markers, without the need for crosses ( A DAMS and J OLY 1980b ) The 72 haploid samples were captured in nine pools of eight megagametophytes each, and sequenced with a median depth of 2.8 across 54,773 probes (average of 3.7 probes/genes). Of the 14,729 genes targeted, 6,283 (43%) were not captured and/or sequenced in any sample. This occurred either because these probes failed to hybridize to genomic DNA, or because the capt ured fragments were not sampled in the sequencing process. Probes were designed primarily based on a transcriptome assembly, and in a previous analysis of probe performance ( N EVES et al. 2013 ) we discovered that a significant number of them were complementary to the boundary of two exons, justifying the high failure rate and illustrating a limitation of designin g probes for species with uncharacterized genomes. For the remainder 8,446 genes, sequence capture was successful for at least one probe, for at least one individual. Of these, 4,123 were sequenced in 50 or more haploid samples of the population. If sequen cing data complementary to a given probe was available for fewer than 50 samples, the locus was excluded from further analysis. Therefore, from the 14,729 genes initially considered for capture, 28% (4,123) were actually captured in enough individuals, and sufficient sequencing depth was generated to allow identification of segregating sequence variants (i.e., SNPs, short indels and MNPs). The analysis of the sequence data identified 7,434 markers segregating 1:1 in at least 50 haploid samples of the popul ation. These markers represented 3,787 genes (Table 3 1). The majority of these markers were SNPs (6,857), while short deletions,
61 short insertions and multiple nucleotide polymorphisms accounted for 210, 208 and 159 markers, respectively. The average seque ncing depth for each gene was utilized to identify possible PAV ( S WANSON W AGNER et al. 2010 ) The average sequencing depth at the gene level was used to compensate for variations that exist at the probe level and 408 putative PAVs segregating 1:1 in the mapping population were considered for mapping (Table 3 1). Altogether, 7,842 segregating markers were identified for 4,195 genes, with an average of 2 mar kers per gene and a range of 1 to 13 (Table 3 1). A list of the markers and their allele representation within the population is available upon request Because the presence of multiple markers per gene provided redundant data for mapping, the marker with the least missing data was selected for the gene, resulting in a final set of 4,195 markers for mapping. A P. taeda G enetic M ap with 2,841 G ene M arkers The segregation of 4,195 markers in 72 haploid samples was used to construct a high density, gene based genetic map (Figure 3 1). The final map comprises 2,841 genes mapped to the expected 12 linkage groups of the species. The number of genes mapped in each linkage group varied from 193 genes for LG 1, to 291 genes in LG 8, with an average of 237 genes per l inkage group. Out of the 2,841 markers mapped, 2,622 (92%) were SNPs, 63 (2%) were short deletions, 53 (2%) were short deletions, 38 (1%) were multiple nucleotide polymorphisms and 65 (2%) were gene PAVs (Table 3 1). The final map spans 1637.4 cM, which is comparable with previous estimations of map length for the species ( R EMINGTON et al. 1999 ; E CKERT et al. 201 0 ; E CHT et al. 2011 ) LG 11 spanned the longest genetic distance with 190 cM, while LG 5 spanned the shortest, with 105.1 cM. On average, linkage groups spanned 132.9 cM (Table 3 2). The average interval between markers was 0.58 cM, across 12 linkage groups, with
62 95% of the intervals smaller than 1.5 cM, and a maximum interval of 10.9 cM at the end of linkage group 2. Following previous methods ( R EMINGTON et al. 1999 ; E CHT et al. 2011 ) we estimated the genome length at 1651.5 cM, and the map coverage as 99%. For this marker density, the estimated map coverage (c) predicts that any locus on the genome has 99% probability of being within 3 cM of a marker and 82% probability of being within 1 cM of a marker. A map with a more conservative marker order, obtained by stopping the mapping process at Round 2 of mapping on JoinMap, was generated and is available upon request When considering this map, there are still 1,371 genes mapped with an average of one marker every 1.3 cM, providing an estimated map coverage (c) that predicts that any locus on the genome has 98% prob ability of being within 5 cM of a marker. Map V alidation The high density map developed provides an overview of the grouping and orientation of 2,841 genes in each loblolly pine chromosome. However, because the number of markers far exceeds the number of m eiotic recombinations sampled to generate the map, it can be expected that their order might present incongruences with the physical position on the genome. To analyze the consistency of our map, we compared it with the genetic map built by Eckert et al. ( 2010 ) hereafter referred to as Eckert map comparison should be done relative the genome sequence of loblolly pine, but the current assembly (version 0.8) is unor dered and still largely fragmented. Thus, the comparison relative to another gene based genetic map remains the only viable option to validate our results. The Eckert map was generated using SNPs detected using an Illumina Infinium genotyping assay, with m arkers selected from a unigene dataset that
63 overlapped with the one used in our sequence capture probe design. Although the SNPs analyzed within the genes were not the same, a total of 397 genes had markers in common between the two maps, and could be used for a comparative analysis. The analysis occurred in two steps: (1) First, we analyzed whether the common genes mapped to the same linkage group. (2) Next, we evaluated if the order of the genes within the linkage group were similar between the two maps. In the first analysis, out of the 397 genes positioned in both maps, 392 (99%) mapped to the same linkage group as that observed in the Eckert map Thus, there is a nearly perfect agreement between the two maps at the grouping level (Table 3 3). For the c omparison of the genetic position of genes within linkage groups, the data is illustrated by plotting the normalized position of the common genes on our map (X) relative to the normalized position on the Eckert map (Y) (for instance, LG 2 and LG 4 in Figur e 3 2). In the comparative analysis, the position of each gene was normalized to control for the different length of the two maps. For example, a gene mapped to position 61.56 cM on LG 2, which has a length of 123.12 on our map (Table 3 2), would have a no rmalized position of 50%. The same approach was used for genes in the Eckert map Assuming gene synteny between the populations used for mapping, the expected results would be a linear relationship between the markers. For LG 1, 3, 4, 8, 9, 10 and 11 all t he common genes were collinear across the entire e xtension of the group (Figure 3 2 ). For LG 2, 6, 7 and 12, most of the genes were collinear, but a block defined by several genes shows incongruences between the two maps (Figure 3 2). Interestingly, in all cases, these blocks were previously mapped to the beginning or end of the linkage group in the Eckert map and were mapped towards the middle of the linkage group in
64 our map. In only one linkage group, LG5, there is low agreement between the two maps ( Figu re 3 2). Thus, overall the results show a high agreement between the genes mapped in the two maps, both at the grouping level and at the ordering level. Discussion The two main objectives of this study were to evaluate the potential of targeted resequencin g based on sequence capture for polymorphism detection in loblolly pine, and to expand our knowledge about the position of genes in the pine genome. Loblolly pine is one of the most economically and ecologically important species within the conifers and it s genome is currently being sequenced (http://pinegenome.org/pinerefseq). The large and complex pine genome, characterized by large retrotransposons expansions and high nucleotide diversity ( B ROWN et al. 2004 ; B URLEIGH et al. 2012 ) offers challenges for polymorphism genotyping using traditional platforms, resulting in low rates of SNP conversion and high rates o f missing data ( E CKERT et al. 2009 ) From previous work ( N EVES et al. 2013 ) we expected that probes utilized for sequence capture could be capturing non unique regions in the pine genome (e.g. paralogs and pseudogenes), hampering the detection of site specific polymorph isms. The use of a mapping population provides a method to verify detected polymorphisms, because alternative alleles are expected to segregate in equal proportion (1:1) in a sample of megagametophytes. Even adopting a shallow sequencing depth, a total of 7,434 sequence variant markers for 3,787 genes (Table 3 1) were detected segregating in the population. From the 14,729 genes initially considered for capture, we estimated that 4,123 (28%) had sequence properties that allowed us to detect sequence variati on markers, characterized by having at least 50 individuals with each having one or more reads aligned to the gene. Thus, given
65 appropriate sequence capture efficiency and sequencing depth, 92% (3,787 / 4,123) of the genes contained at least one segregatin g marker that was used for mapping in the population. This is a high SNP conversion rate considering that no prior information was available for the population. The megagametophyte used as source of DNA sequence capture is a haploid, maternally inherited t issue that is commonly used in conifer mapping studies ( R EMINGTON et al. 1999 ) Thus, a single haplotype was anticipated for any given locus. The pipeline for detection of sequence polymorphism required that the most common allele detected in a given megagametophyte be pre sent in at least 80% of the reads. In other words, if a particular position was sequenced 5 times, a minimum of 4 out of the 5 bases need to be the same in order to report the position as a putative SNP. This step was implemented to exclude possible errors added during library construction and sequencing, and to avoid cases where unspecific paralogs and pseudogenes were also being captured, which is likely considering their abundance in the pine genome ( K OVACH et al. 2010 ) Although we did not formally quantify instances where the latter happens, we observed it while optimizing the bioinformatics pipeline. For example, only reads aligned to the unigene reference with a maximum of 5% of mismatches were considered for polymorphism detection and mapping. At this threshold, we detected the 7,434 sequence polymorphisms reported. When up to 10% of mismatches to the unigene reference were permitted for use of a sequencing read, there was a significant re duction in the number of polymorphisms detected that segregated in the expected ratio in the population. Accepting 10% of mismatches lead to the detection of only 4,563 polymorphisms, represented within 2,210 genes, that segregate in a 1:1 ratio. This
66 sugg ests that bona fide polymorphisms were confounded by alleles from paralogs and pseudogenes mapped with less stringent alignment criteria. The large size of the pine genome of pine is attributed to retrotransposon activity, rather than whole genome duplicat ions ( M ACKAY et al. 2012 ) Based on the dynamic nature of retrotransposons, we hypothesized that a subset of the pine genes might be present in multiple copies in the genome, as observed in maize and another genomes characterized by retrotransposon genome expansions ( B ELO et al. 2010 ; S WANSON W AGNER et al. 2010 ) To test this hypothesis, we aimed to detect presence/absence variants (PAV). In the specific case of a maternally derived haploid segregating population, these are hemizygous genes that segregate in the pop ulation with 50% of the haploid samples containing the gene in their genome, and 50% with the gene absent in their genome. We focused on PAVs, because these events are easier to be detected than copy number variations ( T EO et al. 2012 ) particularly considering that samples were sequenced at low depth. In our analysis, a gene was considered absent when no reads aligned throughout the entire gene region, and present w hen at least one read was aligned to the gene region. Using this simple analysis criteria allowed us to detect 408 putative PAV. However, only 16% (65) of the PAVs could be mapped, suggesting that many of PAVs were false positives or contained genotyping e rrors that excluded them from mapping. These false positive or genotyping errors were expected as a consequence of the low sequencing depth, and we anticipated that the genetic mapping process would be able to filter those out, providing a more accurate su bset of PAVs in P. taeda
67 The final genetic map was obtained by analyzing the segregation of 4,195 markers in 72 haploid samples, each representing a unique gene of the P. taeda genome. Ultimately, 2,841 (68%) genes were mapped in the expected 12 linkage g roups. To the best of our knowledge, this is the most saturated genetic map in a conifers or trees in general. Most importantly, because this is a gene based marker, it provides links with other published maps for loblolly pine, allowing for the constructi on of a future combined map for the species. The genes mapped were not selected based on any specific criteria. Thus, the map provides an unbiased representation of the loblolly pine gene set that can be used for studies that benefit from a preliminary ord ering of the genes until a reference genome sequence of reasonable quality becomes available. This high density map should also be valuable when anchoring and orienting scaffolds of the reference genome of P. taeda that is being sequenced, as done for othe r plant species ( T USKAN et al. 2006 ; S ATO et al. 2012 ) The comparison to a previous map of P. taeda ( E CKERT et al. 2010 ) with 397 common genes, showed high level of agreement between them at the grouping and ordering level (Table 3). These results provide strong evidence that the bioinformatics selection of markers was able to correctly genotype the mapping population, despite the complexity of the genome of pine. For some linkage groups (2, 6, 7, 8 and 12; Figure 3 2 ), blocks of adjacent markers that were previously assig ned to the edge of the linkage group in the map of Eckert et al. were mapped towards the center of the linkage group in our map. Such blocks of non collinear markers between genetic markers on pinaceae have been observed in other studies ( K RUTOVSKY et al. 2004 ; E CKERT et al. 2009 ; E CHT et al. 2011 ) and were usu ally attributed to genotyping errors. Although we cannot test
68 for genotyping errors, we did not observe higher rates of missing data in the markers that are out of order compared to the collinear ones (data not shown). Also, from simulated experiments, it is expected that genotyping errors would inflate the length of the genetic map ( H ACKETT and B ROADFOOT 2003 ) but this was not observed in our map. Therefore, we are unable to conclude the causes for non collinear block of markers, but attributing it to real genom e rearrangements will require further experiments based on larger number of individuals or whole genome resequencing. Genotyping samples of a large population using sequence capture presents some advantages over conventional genotyping methods. For example sequence capture does not require one to previously characterize the polymorphisms to be detected. We show that this is of greater importance for a segregating population, because even if previously characterized polymorphisms are available, it is not kn own which of those sites will segregate in a single bi parental segregating population and have the potential to be mapped. In this study, the probes used for capture were designed to span the entire unigene length, and we were able to detect an average of two polymorphic sites segregating in the population, for every gene. Thus, the chances of detecting a marker to map the gene increased and we were able to map 68% of the genes containing a segregating marker (2,841/4,195), even adopting an overall shallow depth of 2.8. Sequence capture is also not limited by the presence of polymorphisms closely associated to the site of interest, a common limitation for developing traditional genotyping methods in highly polymorphic species, such as pine ( E CKERT et al. 2009 ; G RATTAPAGLIA et al. 2011 ) For several genes we detected additional polymorphic sites within 60 bp, which would normally not be genotyped and not mapped. Finally, another
69 advantage illustrated is the potential to detect other types of genetic variants than only SNPs, such as the presence/absence markers that we were able to detect and map for 65 genes. Although these markers constituted a small percentage of the mapped markers, we anticipate that sequencing at higher depth would increase this number and enable the detection of genes with higher level of copy number, as it is starting to be noted by some studies ( S AINTENAC et al. 2011 ; V ALDES M AS et al. 2012 ) Plus, there is also the notion that CNVs migh t explain a portion of phenotypic variation not samples by SNPs, justifying its characterization ( M ARON et al. 2013 ) We report the application of exome capture on loblolly pine for polymorphism detection and gene mapping by analyzing a mapping population of 72 haploid samples. An important advantage of this approach for polymorphism detection is that it allowed for the enrichment and sequencing of several segments of the gene being targeted, which increased the chances of detecting a marker segregating in the population. As a result, we were able to detect at least one se gregating marker for 4,195 genes, despite the shallow sequencing depth employed. The markers were used to generate a gene rich genetic map that ultimately positions 2,841 genes with an average intergenic distance of 0.6 cM. Also, because exome capture is a genotyping by sequencing method, we show that other types of markers can be identified segregating in the population, such as gene presence/absence variation. Sixty five genes were mapped using the segregation in the population of their presence/absence i n the genome. This is particularly important, because it highlights the possibility to study gene dosage using exome capture in plants, which traditionally has required the additional efforts of a separate experiment. To our knowledge, this is the most sat urated map in conifers and
70 trees in general. Because this is a gene based map, we expect that it will help in the process of improving the quality of the pine reference genome that is currently being generated, similar to what has happened to other species as well as other applications that benefit from a high density, gene rich genetic map.
71 Table 3 1 Description of markers segregating in the population and mapped. Markers identified based on sequence variation were subdivided in single nucleotide poly morphism (SNP), short insertions and deletions (Short Indels) and multiple nucleotide polymorphism (MNP). # Markers represent the number of markers that segregated 1:1 in the mapping population and # Genes is the number of unique genes represented by these markers. Marker Class Segregating 1:1 Genes Mapped # Marker # Genes Sequence variation 7434 3787 2776 SNP 6857 2622 Short deletions 210 63 Short insertions 208 53 MNP 159 38 Presence/Absence variation 408 408 65 Total 7842 4195 284 1
72 Table 3 2 Description of genetic map provided. A total of 2,841 genes were mapped to the expected 12 linkage groups of loblolly pine based on recombination data from 72 haploid samples. Linkage group No. genes mapped Length (cM) Inter marker distanc e (cM) Mean SD 1 193 144.72 0.75 0.65 2 260 123.12 0.48 0.39 3 205 113.29 0.56 0.44 4 195 124.79 0.64 0.54 5 263 105.07 0.4 0.62 6 230 117.54 0.51 0.83 7 242 125.03 0.52 0.71 8 291 118.53 0.41 0.41 9 223 169.32 0.76 0.91 10 228 165.71 0.73 0. 58 11 271 190 0.7 0.62 12 240 140.29 0.59 0.59 Average 236.75 136.45 0.58 0.61 Total 2841 1637.41
73 Table 3 3 Marker validation at the grouping level. Comparison of the maps from Eckert et al ( 2010 ) and the present map. The fe w genes mapped to different linkage groups between the two studies are provided in the last column. Linkage group No. genes at same group No. genes at different group Genes linked to different groups 1 23 0 2 40 0 3 35 0 4 24 1 Gene=0_9680; Neves=LG 4, Eckert=LG 8 5 33 1 Gene=0_7496; Neves=LG 5, Eckert=LG 11 6 32 0 7 31 1 Gene=0_5740; Neves=LG 7, Eckert=LG 12 8 45 0 9 30 1 Gene=UMN_213; Neves=LG 9, Eckert=LG 7 10 35 1 Gene=0_6106; Neves=LG 10, Eckert=LG 6 11 30 0 12 34 0 Total 392 5
74 Figure 3 1 High density genetic map for loblolly pine. Final map comprises 2,841 genes positioned in the expected 12 linkage groups of the species with an average of a gene every 0.6 cM and a total of 1637.4 cM. Linkage group name and relative order of t he genes follow that published previously by Eckert et al. ( 2009 ) with common genes between the two maps shown in green. Genes for which presence/absence variati on was identified and used to map the gene is show in green and have suffix PAV (e.g. 2_6131_PAV, for positio n 337 of the unigene). The genetic map graphical visualization was prepared using MapChart 2.2 ( V OORRIPS 2002 )
75 Figure 3 1 Continued.
76 Figure 3 1 Continued.
77 Figure 3 2. Validation of the order of selected genes on the genetic map. The normalized order of common genes between our study (X axis) and those mapped by Eckert et al. ( 2009 ) (Y axis) were plot fo r two representative linkage groups, left panel showing linkage group two and right panel showing linkage group four. Assuming genes syntheny between the two populations, a straight line is expected and illustrates perfect agreement at the gene ordering le vel between the two maps. Similar graphs for all 12 linkage groups are provided in Supplementary Figures 1 and 2.
78 CHAPTER 4 POPULATION LEVEL GENOTYPING OF COPY NUMBER VARIATION (CNV) IN POPLAR TREES BASED ON EXOME SEQUENCING Background Although plants are sessile organisms, they present remarkable adaptation to a variety of environments. The diversity of the genome of plants might explain their broad plasticity ( F AWCETT et al. 2009 ) For instance, most flowering plants have undergone one or more whole genome dupli cation (WGD) event s, a nd this increase in gen ome content is often associated with better agricultural performance or fitness ( C OMAI 2005 ; J IAO et al. 2011 ) Even in diploid species, individuals can have large regions of the genome (>1Kbp) present in multiple copies that may regulate phenotypic traits This category of structural variation is referred to as copy number variation (CNV) ( S CHERER et al. 2007 ) and its analysis has been mostly explored in human genetics ( G IRIRAJAN et al. 2010 ) In plants, CNV studies are restricted to few reference lines of Arabidopsis ( D E B OLT 2010 ) maize ( S PRINGER et al. 2009 ; B ELO et al. 2010 ) and rice ( Y U et al. 2011 ) and a population level CNV analysis is stil l missing. Traditionally, CNVs have been detected using real time PCR (RT PCR), comparative genomic hybridization (CGH) and high density SNP arrays ( G IRIRAJAN et al. 2010 ) However, because high density arrays that genotype millions of sites are not available for the majority of plant species whole genome CNV studies have mostly used custom CGH arrays. T he need to set up a dedicated experiment for CNV detection using CGH has limited its full application. One alternative to this limitation is the detection of CNV using massive parallel whole genome sequencing ( M ILLS et al. 2011 ) but performing it in the complex and often large genome of plants is still hindered by the high cost and data analysis challenges. The goal of this research was to utilize exome
79 sequencing to characterize gene CNV in a large population of poplar ( Populus deltoides Bart. ex Marsh.). Exome sequencing is increasingly being applied to plants Thus, using the data generated for CNV detection in plants is particularly advantageous. Exome sequencin g does not require detailed genomic resources to be available (e.g. complete genome sequence), it reduces the complexity of the sequencing and data analysis, and the data obtained from the same experiment can be used for detection of other type of polymorp hisms and genomic characterization ( S AINTENAC et al. 2011 ; C RONN et al. 2012 ; Z HOU and H OLLIDAY 2012 ; N EVES et al. 2013 ) Copy number variation and other forms of structural variation have been implicated in many diseases in humans ( G IRIRAJAN et al. 2010 ; T ANG and A MON 2013 ) It is also known that gene CNV correlates with gene expression ( S TRANGER et al. 2007 ; H ENRICHSEN et al. 2009 ) suggesting a significant role i n phenotyp ic variation Regions of structural variatio n spanning genes are frequent in the genome of plants, with 2,109 to 4,538 genes showing putative CNV in a diversity panel of maize with 14 to 33 genomes ( B ELO et al. 2010 ; S WANSON W AGNER et al. 2010 ) and 500 genes between two contrasting rice genomes ( Y U et al. 2011 ) Although a comprehensive assessment of the extent to which CNV explains phenotypic variation in plants is not known, a few examples indicate its importance in abiotic stress tolerance ( D E B OLT 2010 ; M ARON et al. 2013 ) Recently, different methods have been developed to detect CNV from whole genome sequencing that are based on depth of coverage (DOC), paired end mapping (PEM) and de novo assembly ( A LKAN et al. 2011 ) Methods based on DOC are based on the assumption that a region with multiple co pies will be sequenced more frequently,
80 compared to those with single gene copies. In PEM, paired end reads are mapped to a reference genome and CNV is identified when the distance or orientation between the mates differs from than of the reference ( M EDVEDEV et al. 2010 ) Finally, the detection of CNV directly by d e novo assembly is promising, but limited by challenges associated with assembling short reads ( B AKER 2012 ) The data obtained from exome sequencing has fundamental differences from that of whole genome sequencing. First, because of the more complex library preparation that involves hybridization and more PCR cycles, the signal to noise ratio may be smaller. Second, regions of extreme GC content and sequence complexity add technical variance, because both capture and sequencing are affected by this variation. Lastly, exome capture data does not provide information about intergenic regions usually hindering the detection of CNV breakpo ints. Due to these particularities, detection of CNV from exome sequencing requires specific methods for detection many of which have been reported recently, all based on DOC. The main difference between methods is the pre processing of the sequencing dat a using different quality filter s and alignment algorithm s the normalization of the depth of sequencing dat a, the statistical approach for detection of CNV and, when implemented, the statistical approach for genotyping the CNV events in the population ( L OVE et al. 2011 ; F ROMER et al. 2012 ; K RUMM et al. 2012 ; W U et al. 2012 ) Here we describe another method for detection of CNV from exome data that is based on mixed model analysis of variance that we apply to an unstructured population of 121 individuals of Populus de ltoides The model takes into consideration capture efficiency of different probes for a gene and identifies genes for which at least one
81 individual has normalized DOC different than the population mean. Thus, the approach is capable of detecting rare as w ell as common CNVs. To genotype the individuals within a CNV state, an unsupervised clustering method was employed and a set of 300 high confidence genes with CNV were explored for functional characterization. We observed an enrichment of gene CNV in chrom osomes 17 and 19, as well as in a region of chromosome 19 related to sex determination. Genes with CNV were enriched for GO terms related to biotic and abiotic stress resistance, protein phosphorylation, signaling and cell communication. Altogether, the re sults show the possibility to detect gene CNV from exome resequencing data and a potential biological function of this source of genetic variation Experimental Procedures Plant Materials and DNA Extraction The analysis of CNV was performed in a geneticall y unstructured population of 121 individuals of Populus deltoides Bartr. ex Marsh. The trees were collected from the natural range of the species. For the initial development of the CNV detection pipeline, 10 random individuals were sequenced at higher dep th. DNA extraction for all individuals was performed on 96 well plates using the DNeasy Plant Mini Kit (Qiagen, www.qiagen.com/ ) Probe D esign for T arget E nrichment Sequence capture probes were designed for the regulatory and coding region of 19,964 genes known to be expressed in poplar stem based on a previous microarray study ( Q UESADA et al. 2008 ) Prior to designing the prob es for these targets, their sequence (derived from the reference genome sequence of P. trichocarpa ) was P.
82 deltoides A collection of 103 million 2x75bp quality filtered reads d erived by random shotgun sequencing a single P. deltoides individual were aligned to the P. trichocarpa reference genome Next, SNPs and indels were identified, and the P. trichocarpa reference sequence was modified to approximate as much as possible a P. deltoides reference Alignment of the reads to the genome was performed using Mosaik Aligner ( parameters: mmp 0.05, m all, a all, hs 15, mhp 100, bw 41 act 26 ) and intraspecific markers were detected with Freebayes ( parameters: -pvar 0.75, -indels -min alternate fraction 0.8 -min alternate count 3 ). Based on the corrected targets, region, characterized by 500bp upstream of the transcription start site. In instanc es where exons were smaller than the length of the probe (120nt), the probe was allowed to expand into the intronic region. Besides the gene region, a set of 22,864 probes was designed to capture putative neutral intergenic regions, spaced evenly along the poplar genome. For the design of these neutral probes, the reference genome of P. trichocarpa was divided in windows of 15 Kbp. Next, a random 120 nt region within each window was selected as the representative of that window, and tested for the following criteria: GC content between 0.3 0.65 (lower better) and unique alignment to the genome (blastn, e value 1x10 5 identity 70%, minimum alignment length of 60nt). If failed, a new random 120 nt region was selected until one passed the criteria or all posit ions had been interrogated. The final capture design had an average of 11. 2 probes per gene, with 204,180 probes representing 18,153 genes In addition, 22,864 15 Kbp windows had also one probe rep resented. Finally, 971 neutral probes were obtained from Zh ou
83 et al. ( 2012 ) to allow for comparison between poplar studies. Thus, in total, 227,943 probes were synthesized for sequence capture (Table 1). Library P reparation, C apture Hybridization and S equencing Illumina compatible libraries were prepared from 2ug of genomic DNA following methods described previously ( N EVES et al. 2013 ) Libraries from up to 12 individuals prepared wit h unique barcoded adapters were combined in equimolar amounts for multiplex capture hybridization. Except for multiplexing the sequence capture procedure, the remaining steps of the procedure Each captured pool was r e quantified using Quant iT PicoGreen dsDNA Assay Kit and diluted to 10 nM. The pools were sequenced in a lane of Illumina HiSeq 2000 in paired end mode with 100 cycles. To generate a dataset with higher sequencing depth, a subset of 10 individuals were se quenced across four lanes of HiSeq 2000 in paired end mode with 100 cycles. Illumina Data Processing Data filtering and alignment: end of every sequence, reads from each individual were separated using the barcode splitter function of FASTX Toolkit ( http://hannonlab.cshl.edu/fastx_toolkit ) (parameters: mismatches 0, -bol). Reads were filtered by quality in two subsequent steps. First, low quality bas fastq quality trimmer function of FASTX Toolkit and only processed reads longer than 50 nt were retained (parameters: Q 33, t 20, l 50). Second, the quality trimmed reads were filt ered based on their overall quality, when only reads with 90% or more of the bases with phred score > 20 (parameters: Q 33, q 20, p 90) were retained.
84 Alignment: After quality filtering, some reads from a pair m ay be discarded resulting in loss of syn chrony of the paired end file. Thus, the files were resynchronized using custom scripts to restore the order of the paired end reads. Reads that lost their pairs were treated as single end in the alignment step. The alignment of the reads was performed aga inst the reference P. trichocarpa (version 2) genome that had been corrected for SNP/Indels with P. deltoides on the regions corresponding to the target genes. Mosaik 2 ( http://bioinformatics.bc.edu/marthlab/Mosaik ) was used for the alignment, and up to 5% mismatch was accepted between the reads and the reference genome (parameters: mmp 0.05, m all, a all, hs 15, mhp 100, bw 41 act 26). Reads were allowed to align to multiple regions of the genome, following recommendations from a previous reports by Krumm et al. ( 2012 ) that suggests that this improves detection of CNV. Alignment files from the same individual were merged into a single file, and sorted using ba mtools merge and sort tools, respectively ( B ARNETT et al. 2011 ) CNV D etection Probe sequenci ng depth: The average read depth at the probe level was generated and used as the dependent variable in the modeling step. Sequencing depth is defined as the number of times a given reference base was sequenced. The information was obtained from the alignm ent step using bamtools coverage tool for the whole genome, and parsed for the probe region using custom scripts. To reduce possible capture interference that adjacent probes cause on each other, we calculated the average sequencing depth of the 60bp centr al part of the probes. A sequence depth matrix that describes the sequence depth for each probe, for each individual in the population, was generated and used in subsequent analysis.
85 Depth normalization and mixed model analysis: Due to inherent variation in library preparation, hybridization and sequencing, samples have different sequencing outputs that directly affect sequencing depth at a global level. To control for such effects we normalized the depth matrix using a quantile normalization ( B OLSTAD et al. 2003 ) implemented on R using the function normalize.quantiles from library preprocessCore. Next, the quantile normalized data was log2 transformed, and the resultant data was analyzed a t the gene level using the following statistical model: where yij(g) is the normalized sequencing depth of individual i, for probe j, i is the fixed effect of set); P j is the random effect of probe ij(g) is the 2 ). The model was implemented in SAS using PROC MIXED. The p value associated with the effect of individu al ( ) were exported from SAS to R as a text file, and were corrected for multiple testing using the q value modified FDR correction ( S TOREY and T IBSHIRANI 2003 ) Genes with q value lower than 0.01 were considered putative CNV and used in subsequent analysis. Clustering and detection of CNV: The least square means for each individual ( ) for each gene containing a putive CNV (q value < 0.01) was used for a clustering analysis to identify the classes of CNV segregating in the population. The unsupervised mean shi f t algorithm ( C HENG 1995 ) was used for clustering analysis using the ms function of the LPCM package in R with default parameters and bandwidth chosen as described below. The mean shift algorithm does not require for the number of clusters
86 to be pre specified, but requires the definition of a bandwidth parameter, and different choices of bandwidth usually result in a different clustering of the data. To select the most likely bandwidth parameter, we implemented a two step approach as follows. First, the semi automated algorithm developed by Einbeck ( 2011 ) was used to select a set of putative bandwidths using the function ms.self.covera ge function of the LPCM package of R (parameters: gridsize=25, taumin=0.02, taumax=0.5, plot.type="o" ). Second, we analyzed the clustering results of each bandwidth and chose the bandwidth that provided the highest number of clusters as long as all cluster means were at least 1.5 the previous cluster mean, signifying the expected addition in depth by having an additional copy of the gene. D uring th is second step, each putative bandwidth was used on the mean shift clustering and the resulting data was parse d to (i) remove bandwidths resulting in identical clusters, (ii) obtain the mean and standard deviation of the cluster(s) formed, (iii) when three of more clusters were formed, selected the first non zero cluster mean as a reference, (iv) obtain the increm ental rations between each cluster mean and the reference cluster mean (when three or more clusters were generated), (v) discard bandwidth if any higher cluster mean were not at least 1.5x the reference cluster mean, with increments of 0.5 for each additio nal cluster, and, finally, (vi) select the bandwidth that yielded the highest number of clusters ( Figure 4 3 ). The selected bandwidth was ultimately used to generate the clustering analysis for each gene. For functional annotation, a gene was considered to have a CNV if the clustering generated with the chosen bandwidth resulted in two or more clusters. Functional A nnotation and GO E nrichment T ests The coding region of the 18,153 genes targeted for captured were extracted from the version 2 reference genome of Populus trichocarpa based on the GFF information.
87 In cases where multiple isoforms for a gene were available, the first one was used as a representative of the gene. The captured genes were subdivided in small batches of five genes and aligned in paral lel to the non redundant NCBI database using NCBI blast producing output on xml format (parameters: evalue 1e 10, max_hsps_per_subject 33, max_target_seqs 50, outfmt 5). The blast results were imported to Blast2GO ( C ONESA et al. 2005 ) for automated extraction of GO terms from the blast results and annotation of the genes using default parameters. Fisher exact tests were performed using Blast2GO with two test se ts. The first test set was the 300 genes identified as CNV and for which cluster data resulted in two or more clusters. The second test set was the 300 genes with most significant p values for individual effect, obtained by sorting the results of the ANOVA by p value. For both cases, the reference set was the remainder genes that were not present in the test set. Results Copy Number Variation Detection Pipeline The proposed method for detection of CNV is based on the hypothesis that each probe is behaving i ndependently in the capture reaction and, thus, can be considered an independent replicate to model the presence of copy number variation In that sense, CNV was modeled similarly to microarray data ( W OLFINGER et al. 2001 ) seeking to identify genes where one or more individuals are captured more or less efficiently than the averag e, across all the probes designed for the gene. Capture efficiency was measured as the normalized average sequencing depth of the inner portion of the probe and a mixed model approach was used to test the gene for an ( see Experimental P rocedures ). The normalization of the sequencing depth for variation in number of reads sequenced for each individual was performed using quantile
88 normalization ( B OLSTAD et al. 2003 ) which levels the sequencing depth of all individuals to the same distribution. A probe effect was included in the model, similarly to a randomized complete block design, to compensate for variation in capture efficiency among probes. A subset of the population composed of 10 individuals w as sequenced at a higher depth (median=45 ) and used for pipeline development and initial assessment of CNV. These individuals were captured using a custom sequence capture probe set targeting 18,153 genes with an average of 11 .2 probes per gene (Table 1). After correcting for multiple test s we observed 2,773 genes with a significant individual effect, referred to as putative CNVs. This represents 15% of the genes for which three or more probes were presented (17,940) and CNV c ould be modeled This proportion is similar to that observed in maize, wh ere 10 14% of genes had CNV ( B ELO et al. 2010 ; S WANSON W AGNER et al. 2010 ) Examples of gene CNV from this subset of the population are presented in Figure 4. 1. Gene C opy N umber V ariation in 113 P oplar T rees The method developed using the subset of 10 individuals was expanded to a lar ger population of 113 individuals, which allowed for better discrimination between CNV classes within a gene. Contrary to the initial subset of 10 individuals, the expanded population was sequenced with a lower median depth of 25 . These 113 individuals r epresent the natural diversity of the species and, based on their geographic origin, are expected to be unrelated. The analysis of these individuals provided evidence that both rare and common CNV s can be observed in the population. For example, Figure 4 2 shows an example of two genes for which the frequency of the alleles in the population suggest s the presence of rare gene CNV. Similarly to what we have employed in the
89 subset of 10 individuals, we ranked the p values and applied a correction for multiple tests in an attempt to determine the number of genes showing a putative CNV in the expanded population. However, because of the high number of degrees of freedom, the distribution of p values (and q values) were largely skewed towards the small values and a cut off could not be employed, as 15,513 genes out of 17,938 had q value (corrected p value) smaller than 0.01. Clustering Analysis for CNV Genotyping To select a group of genes with CNV for genotyping and functional characterization, we performed a clu stering of the data for all genes with a q value lower than 0.01 (15,513). As the number of clusters and density of each cluster is not known a priori we employed the unsupervised clustering mean shift algorithm ( C HENG 1995 ) Furthermore, a clustering result was only accepted for a gene if the mean of each cluster was well separated and followed the expectation of depth increment due to an addition of gene cop ies For example, if the first ordered cluster (ref erence cluster) represents individuals with both alleles present a nd mean depth of 20, a cluster of individuals with three alleles of the gene present is expected to have a mean that is 1.5 as much as that of the reference level, or 3 0. I n a different s cenario, if the reference cluster represents individuals that are hemizygous (gene present in only one allele) with a mean of 20, a cluster of individuals with both alleles present for the gene should have a mean twice as much as that of the hemizygous o r 40 After applying such selection, 300 genes were separated into two or more clusters. Examples of four genes with different number of clusters and cluster densities are shown in Figure 4 3. Gene POPTR_0019s01380 (Figure 4 3, first panel) is an example of a putative PAV, where the null allele is the most frequent in the population. Gene POPTR_0019s14730 (Figure
90 4 3, third panel) shows an example of a CNV with three categories (clusters) segregating in the population. The first cluster has a mean depth of 34 and was used as the reference cluster, whereas the second and third clusters have means of 55 and 75, approximately 1.5 and 2 that of th e first cluster. Gene POPTR_0004 s 07860 (Figure 4 3, second panel) is provided as an example of genes for which three classes are present including one where the gene is absent in some individuals of the population The difference in mean depth between the two other clusters is approximately 2 (mean depth 17 and 35, respectively), suggesting that one class is co mposed of hemyzigous individuals and the other has both alleles present in the individuals. Finally, gene POPTR_0003s20540 (Figure 4 4, fourth panel) shows an example of genes where multiple copies appear to be present in different individuals of the popul ation. Functional Characterization of Genes with CNV To obtain functional inferences about the genes with putative CNV we chose two datasets. The first one is composed of the 300 genes for which two or more clusters were obtained during the clustering step and the second was composed of the 300 genes with lowest p value for individual effect. The rationale for using both approaches was to provide a more comprehensive view, as each method has un certainty in identifying bona fide CNV s For both criteria gen e CNV occurs across all poplar scaffolds (Table 4 2). Some scaffolds, however, tend to have more g ene s with putative CNV variation than others after normalizing for gene density Scaffold 17, 19 and the smaller scaffolds (20+), for example, deviated signif icantly from the expected value (Table 4 2) (FDR corrected p value < 0.05 for hypergeometric test).
91 Next, we performed a functional annotation of the 18,153 genes targeted in our probe design using Blast2GO ( C ONESA et al. 2005 ) which extracts gene ontologies for all genes from the NCBI non redundant database. The majority of the target genes were successfully annotated using this approach (87%; 15,802). Using this annotation we tested if genes associated with certain GO terms were more likely to occur as putative CNV. We compared the 300 genes derived from the cluste r analysis as well as the 300 most significant genes for GO enrichment relative to all the remain i ng genes. The results for the 300 most significant genes indicate that some class of genes were significantly enriched in this dataset, particularly GO terms associated with defense, resistance and binding to nucleotide (Figure 4 4). When subdivided betwee n GO terms associated with molecular function and biological process, GO terms related to binding to ADP, to small molecules and kinase activity were most enriched in the dataset in terms of molecular function (Figure 4 5, top panel), whereas GO terms rela ted to response to defense, to stimulus, signaling and phosphorylation were most enriched in the dataset for biological processes (Figure 4 5, bottom panel). These results were corroborated by the analysis of the 300 genes arising from the clustering filte ring (Figure 4 6). A distal region of chromosome 19 is enriched for genes annotated as leucine rich repeats (LRR) and nucleotide binding site (NBS) genes ( Y IN et al. 2008 ) which are related to stress and disease resistance, and seemed enriched for gene CNV (Figure 4 7) T his region is also known to be an incipient sex chromosome in poplar ( Y IN et al. 2008 ) and whose putat ive boundaries seem to overlap with the region enric hed for gene CNV (Figure 4 7, green bar) suggesting a possible relationship between gene CNV and sex determination in poplar.
92 Discussion Gene copy number va riation is a class of structural variation that has been understudied in plants. In humans, it is often included in association studies and associations with phenotype are commonly observed, particularly for complex disease traits ( G IRIRAJAN et al. 2010 ) One factor that has been hindering the broad genotyping of CNV in plants is the need to perform dedic ated experiments for its characterization. To overcome such limitation, the goal of our study was to explore the possibility to genotype CNV using whole exome sequencing, a methodology that is increasingly being applied to plants. Recently, several article s have shown the feasibility of detect ing CNV from exome data in humans and achieved specificities and sensitivity close to that of CGH arrays ( L OVE et al. 2011 ; N ORD et al. 2011 ; S ATHIRAPONGSASUTI et al. 2011 ) To evaluate the potential of detecting CNV from exome capture data in plants, we targeted 18,1 53 genes in a population of 121 unrelated individuals of poplar. Compared to other plant species, the genome of poplar is well characterized making it a good candidate for this evaluation. The method em ployed for detection of CNV is based on the hypothesi s that capture probes could be considered replicates of the gene Some methods recently published for exome CNV detection in humans are based on log ratios between a sample and a reference genome ( V ALDES M AS et al. 2012 ) and, although they can provide probe level CNV resolution, they do not take into consideration the variation in the population for calling the site a CNV as independent pairwise tes ts are performed between the samples and the reference Other recently published methods to detect exome CNV normalize the depth data based on some modification of principle component analysis by removing an arbitrary number of components to clean the data
93 ( F ROMER et al. 2012 ; K RUMM et al. 2012 ) While they are written to be populational methods and might provide p robe level resolution the arbitrary removal of components might exclude bona fide variation that is due to more frequent or common CNVs in the population and seemed more applicable for the detection of rare CNVs. In the method we describe, a quantile norm alization is used to remove experiment level sample to sample variation in read death and it takes into consideration the populati on variance to detect CNV, but, due to the nature o f ANOVA, it is limited to resolution at the gene level. One challenge with the approach proposed is how to determine the threshold to declare a CNV. When a large number of individuals of the population were analyzed, the re were 15,513 genes for which the adjusted p value was lower than 0.01 Based on previous reports that quantif ied CNV in highly diverse, genetically unrelated individuals, it is anticipated that a large proportion of these significant genes are false positives. In an attempt to exclude false positives, we adopted two approaches for filtering the data further prio r to making inferences about functional aspects of gene CNVs. The first was to cluster the depth data for all individuals for each gene and identify well separated clusters with means following a consistent increment in depth. The second and most straightf orward way was to focus on an arbitrary number of genes with smallest p values. The clustering approach is challenging, because the number of clusters and their density is not known E ven though an unsupervised clustering methods previously applied to exom e CNV detection ( K RUMM et al. 2012 ) was employed visual inspection of the data suggests that the frequency of both false positive s and false negative s are high a nd clusters seem to be often merged incorrectly or not detected. Despite this limitation, we generated a set of 300 higher confident genes by clustering For the
94 second filtering approach we focused on the 300 most significant genes based on a ranked list of p value s In the poplar data, CNVs were present across all chromosome s but there was enrichment for CNVs in chromosomes 17 (both filtering criteria) and 19 (mainly among most significant genes) Although we have not applied statistical tests to look fo r CNV enrichments within each chromosome, we observed a higher number of gene CNVs in the beginning of chromosome 19 (Figure 4 7), a region previously identified in poplar to be linked to sexual determination ( Y IN et al. 2008 ) Poplar is a dioecious species, but there is not a clear set of chromosomes that segregate with gender and the molecular and genetic mechanisms controlling such trait are not fully understood. Yin et al. ( 2008 ) observed that in female trees this region of chromosome 1 9 has suppression of recombination, distortion of segregation of the markers analyzed a nd high haplotype divergence and coined this region to be an incipient sex chromosome following a ZW sex determination system Because of the lack of recombination in the region, the authors did not have enough resolution to identify the region responsibl e for the QTL, but they observed that this region was enriched for genes containing leucine rich repeats (LRR) and nucleotide binding site (NBS) motifs, classes of genes that we also observed to be enriched for CNV (Firgure s 4 4 and 4 7 ). Therefore, one po ssibility is that gene CNV contribute s to sex differentiation in poplar An alternative possibility, however, is that the high repeat level of this region might be more prone to CNV formation due to a non allelic match during homologous recombination that is known to cause common CNVs ( H ASTINGS et al. 2009 ) although such possibility conflicts with the the suppression of recombination in the region To test these possibilities will require
95 phenotyping our population for sex when flowering occurs or genotyping an existing segrega ting population for CNV such as the one developed by Yin et al. ( 2008 ) The functional characterization of genes with CNV showed a higher frequency of genes associa ted with resistance to biotic and abiotic stress. At the molecular level, these genes often had NBS and LRR or were involved in phosphorylation and kinase activity (Figure 4 5). These results were also observed in rice ( Y U et al. 2011 ) and might suggest an implication of CNV with biotic and abiotic stress related traits. Such hypothesis was first tested by DeBolt ( 2010 ) by growing a common ancestor of Arabidopsis under high temperature and under simulated biotic stress for five generations with selection for fecundity at each generation and analyzing gene CNV at the end of the process. Th e author observed that up to 1 .4% of the genes presented a CNV event after the five generations compared to the control grown under no stress Among the genes with CNV, there was enrichment for genes related to stress and disease resistance. Our results support the enrichment of resist ance related genes with CNV, but the implication on phenotyp es will require further analysis of the variation in disease traits in the population.
96 Table 4 1. Probe design for Populus deltoides Probes were designed to target genes as well as neutral regi ons of the genome. Probe target Number of loci Number of probes Targeting genes 18153 204108 Targeting neutral regions Every 15Kb window 22864 22864 Neutral from Zhou and Holliday ( 2012 ) 971 971 Total 41988 227943
97 Table 4 2. Distribution of selected gene CN V across scaffolds of poplar. The expected number was obtained assuming equal distribution of the gene CNV represent the 300 gene CNV that were selected based on unsupervised clusteri highest probability of having CNV. P value were calculated assuming a hypergeometric test and were FDR corrected with significant values (<0.05) shown in red italic font The small scaffolds w ere combined into one class for simplification and named 20+. Scaffold Targeted genes Expected Clustering selected Most significant genes observed p value observed p value 1 2112 35 25 0.051 27 0.060 2 1364 23 17 0.082 13 0.030 3 982 16 15 0.121 12 0.097 4 987 17 14 0.121 15 0.123 5 1252 21 10 0.019 17 0.099 6 1365 23 15 0.059 15 0.057 7 678 11 12 0.122 10 0.124 8 897 15 11 0.109 14 0.124 9 851 14 14 0.121 5 0.013 10 1248 21 23 0.119 8 0.004 11 595 10 7 0.121 14 0.097 12 598 10 16 0.059 13 0 .099 13 668 11 16 0.082 15 0.097 14 964 16 10 0.076 10 0.068 15 678 11 15 0.109 12 0.124 16 512 9 6 0.121 7 0.130 17 540 9 21 0.002 16 0.034 18 659 11 10 0.122 14 0.099 19 509 9 15 0.051 30 1.80E 08 20+ 478 8 28 1.45E 07 33 8.76E 11 Total 17937 30 0 300 300
98 Figure 4 1. Examples of genes with putative CNV in 10 individuals of Populus deltoides The mean sequencing depth across all probes (Y axis) are shown for four genes in ten randomly selected individuals ( X axis )
99 Figure 4 2. Examp le of genes with a rare putative CNV. The mean sequencing depth across all probes (Y axis) are shown for two genes in 113 individuals (X axis)
100 Figure 4 3. Examples of genes with CNV selected by unsupervised clustering algorithm. Data for four genes are shown in three different representations. The left panels show t he unsorted mean sequencing depth across all probes in 113 individuals and in the middle panel the individuals were sorted based on their mean sequencing depth The right panels show the resu lts of the clustering method, with each cluster being represented by a color and the cluster mean shown by a black dot. F or the right panel the Y axis represents a normalized sequencing depth
101 Figure 4 3. Continued.
102 Figure 4 4 GO term enrichment c omparing the 300 most significant genes (test set) versus the remainder genes (reference set).
103 Figure 4 5. GO term enrichment comparing the 300 most significant genes (test set) versus the remainder genes (reference set) visualized separately for term s related to molecular function (top) and biological process (bottom). Each color represents a broader level of GO annotation and the box sizes are proportional to the of the enrichment significance ( log 10 ( p value)). Plots were generated using REVIGO ( S UPEK et al. 2011 )
104 Figure 4 6 GO term enrichment comparing the 300 genes with CNV selected by unsupervised clustering (test set) versus the remainde r genes (reference set).
105 Figure 4 7. Manhattan plot showing the distribution of p values for genes across chromosome 19 (top panel ). Red dots represent genes annotated as leucine rich repeats (LRR), nucleotide binding site (NBS) or both. Green bar repr esents the incipient sex chromosome of poplar as describe elsewhere ( Y IN et al. 2008 ) Chromosome 1 (bottom panel ) is shown for relative comparison.
1 06 CHAPTER 5 CON CLUSIONS The goal of this research was to explore exome capture followed by next generation sequencing as a viable solution for high throughput analysis of complex and uncharacterized genomes, as well as its application for comprehensive assessment of the genetic diversity in large populations. To characterize the exome of conifers in Chapter 2, we explored the capture data searching for instances where the paired end reads were partially aligning to the unigene reference. Reads showing this feature were us ed to reconstitute partial or complete gene sequences from the captured reads We developed a de novo strategy to independently assemble the reads of 11,396 captured genes, which partially predicted the structure of these genes, inferring exon intron bound aries, assembling intron sequences and expanding the reference towards regulatory regions (Table 2 3 Figure 2 5 ) Interestingly, we observed a large set of genes for which we could not completely assemble the introns and we suggest that this was due to th e presence of large introns in the conifer genome, as recently demonstrated in the draft genome of spruce ( N YSTEDT et al. 2013 ) We are already making use of thi s expanded reference for polymorphism identification, and we detected twice as many SNPs when using the expanded reference compared to the original unigene references (data not shown). The work provided in Chapter 2 was also important to highlight technica l aspects of exome capture that influence its efficiency in genomes that are not well characterized and for which probe design is based on ESTs. Of major influence is the effect of not knowing the exon intron borders and designing probes that unknowingly fall on such junctions. By de novo predicting the introns in a different dataset we show that
107 these probes have lower capture efficiency and often have the hybridization hindered completely (Figures 2 3 and 2 4) Therefore, a possible guideline for exome c apture in less characterized genomes would be to (i) identify the exon intron borders t o the greatest extent possible prior to probe design by using genomic data available for the species (gene resequencing, short reads, etc) or information from closely re lated species; (ii) design probes in excess avoiding known exon intron boarders and use these probes to capture a diverse subset of the population to be characterized; (iii) analyze the capture data to flag probes that capture less complex regions resultin g in excessive sequencing depth, and using the developed recruiter pipeline or a similar approa ch, further identify introns and create an updated exome reference ; (iv) analyze the capture data to detect polymorphisms, similarly to what was shown in Chapter 3; (v) redesign the probes to target the exons but avoid exon intron boarders, avoid low complexity regions and contain polymorphism closer to the center of the probe ; and (v i ) capture the remainder of the population using the updated probe design and ana lyz e the data using the updated exome reference to maximize polymorphism detection Such guideline s are not much different from those established for the development of SNP genotyping assays. B ut because the costs of synthesizing custom probes for exome ca pture are lower than those associated with SNP arrays, for example, this becomes a more viable alternative that can be employed in every population to maximize the sequencing of polymorphic regions. To illustrate the potential of using sequence capture dat a for genetic characterization of large populations, we sequenced a mapping population of loblolly pine constituted of 72 haploid genomes extracted from megagametophyte tissue. As
108 described in Chapter 3, the haploid tissue of megagametophyte is mater nally inherited and allow s for an efficient way to sample meiotic events in conifers withou t the need to generate crosses. It is common for new genotyping methods to be analyzed in a mapping population, because it allows for the tracking of the alleles in the pe digree (when available), the filtering of the alleles based on the expected segregation ratio and th e optimization of the genotype calls by maximizing the number of loci with alleles segregating correctly ( E CKERT et al. 2009 ; E LSHIRE et al. 2011 ; N EVES et al. 2011 ) We sequenced the captured fragmen ts with a shal low depth of 2.8 and estimated that 4,123 genes were captured with enough sequence depth across a minimum of 50 haploid samples to allow identification of polymorphisms. Based on this dataset, we identified 7, 842 markers segregating in the population for 4,195 genes. These markers included SNPs, short insertions and deletions, multiple nucleotide polymorphisms as well as presence/absence variants (Table 3 1). Thus, this result illustrates the large number of robust markers that can be generated using exome capture in a complex genome. Another goal of Chapter 3 was to show how sequence capture can be an efficient way to generate high density genetic maps that will then be used to link and order the assembled contigs and scaffolds from genome sequencing proje cts W e showed that because a large portion of the gene s are captured and sequenced, the chance s of identifying informative marker s in the population under analyzes are high, resulting in a marker conversion rate of 92%. The markers obtained for the popula tion were used to generate a high density, gene rich genetic map for loblolly pine. In total, we mapped 2,841 markers genes with a genetic distance of 0.58 cM between genes. A subset of
109 these genes were mapped using presence/absent variants, providing init ial evidence that this class of markers can be obtained from exome sequencing data. In C hapter 4 we explored the exome data in Populus deltoides to identify structural variation markers, namely gene copy number variation. The analysis of 18,153 targeted ge nes in 121 unrelated individuals identified gene CNV and genotyped a subset of those putative CNVs by clustering the data. Interestingly, we observed that genes related to biotic and abiotic stress were more likely to occur as CNV in our dataset, suggestin g a potential role of this type of structural variation on resistance traits. A possible relationship between gene CNV and sex determination in poplar was also explored, based on the observation that regions of chromosome 19 of poplar seems more likely to have genes with CNV. In summary, chapters 2 and 3 shows both technical and applied aspects of sequence capture in the complex genome of loblolly pine. We highlight how the method can be high throughput and cost effective by performing custom libraries, mul tiplexing the capture reaction, and optimizing the probe design to sequence polymorphic regions of the genome. Using this approach we characterized the exome of loblolly pine and the closely related slash pine, identified polymorphic regions and ordered a large set of genes on a genetic map. In chapter 4 we expanded the application of the capture data to detect gene CNV, a topic that will probably be more explored in plants as populations have their exome sequenced. With this work we show how sequencing a r educed but targeted region of the genome can be an efficient way to characterize the genetic diversity of large populations of trees. The ability to target specific regions of the genome is an advantage
110 of sequence capture over other methods for complexity reduction that interrogate random parts of the genome. This advantage allows for unique applications, some of which we outlined in this work. With further improvements on the method and automation, sequence capture will likely remain an ideal method for a pplications in large population, for example in genetic association, genomic selection and germplasm characterization.
111 LIST OF REFERENCES Adams, W. T., and R. J. Joly, 1980a Allozyme studies in loblolly pine seed orchards conal varia tion and frequency of progeny due to self fertilization. Silvae Genetica 29 : 1 4. Adams, W. T., and R. J. Joly, 1980b Genetic of allozyme variants in loblolly pine. Journal of Heredity 71 : 33 40. Albert, T. J., M. N. Molla, D. M. Muzny, L. Nazareth, D. Whe eler et al. 2007 Direct selection of human genomic loci by microarray hybridization. Nature methods 4 : 903 905. Alkan, C., B. P. Coe and E. E. Eichler, 2011 Genome structural variation discovery and genotyping. Nature reviews. Genetics 12 : 363 376. Baker, M., 2012 De novo genome assembly: what every biologist should know. Nat Meth 9 : 333 337. Bansal, V., R. Tewhey, E. M. Leproust and N. J. Schork, 2011 Efficient and cost effective population resequencing by pooling and in solution hybridization. PLoS ONE 6 : e18353. Barbazuk, W. B., J. A. Bedell and P. D. Rabinowicz, 2005 Reduced representation sequencing: a success in maize and a promise for other plant genomes. BioEssays : news and reviews in molecular, cellular and developmental biology 27 : 839 848. Barba zuk, W. B., S. J. Emrich, H. D. Chen, L. Li and P. S. Schnable, 2007 SNP discovery via 454 transcriptome sequencing. Plant J 51 : 910 918. Barnett, D., E. Garrison, A. Quinlan, M. Stromberg and G. Marth, 2011 BamTools: a C++ API and toolkit for analyzing an d managing BAM files. Bioinformatics 27 : 1691 1692. Belo, A., M. K. Beatty, D. Hondred, K. A. Fengler, B. Li et al. 2010 Allelic genome structural variations in maize detected by array comparative genome hybridization. TAG. Theoretical and applied genetic s. Theoretische und angewandte Genetik 120 : 355 367. Bentley, D. R., S. Balasubramanian, H. P. Swerdlow, G. P. Smith, J. Milton et al. 2008 Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456 : 53 59. Bernardo, R., 2008 Molecular markers and selection for complex traits in plants: Learning from the last 20 years. Crop Science 48 : 1649 1664.
112 Birol, I., A. Raymond, S. D. Jackman, S. Pleasance, R. Coope et al. 2013 Assembling the 20 Gb white spruce (Picea glauca) genome fr om whole genome shotgun sequencing data. Bioinformatics 29 : 1492 1497. Bolstad, B. M., R. A. Irizarry, M. Astrand and T. P. Speed, 2003 A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinforma tics 19 : 185 193. Bradshaw, H. D., Jr., and R. F. Stettler, 1995 Molecular genetics of growth and development in Populus. IV. Mapping QTLs with large effects on growth, form, and phenology traits in a forest tree. Genetics 139 : 963 973. Bradshaw, H. D., an d R. F. Stettler, 1993 Molecular genetics of growth and development in populus. I. Triploidy in hibrid poplars. Theoretical and Applied Genetics 86 : 301 307. Brown, G. R., G. P. Gill, R. J. Kuntz, C. H. Langley and D. B. Neale, 2004 Nucleotide diversity an d linkage disequilibrium in loblolly pine. Proc Natl Acad Sci U S A 101 : 15255 15260. Bundock, P. C., R. E. Casu and R. J. Henry, 2012 Enrichment of genomic DNA for polymorphism detection in a non model highly polyploid crop plant. Plant biotechnology jour nal 10 : 657 667. Burleigh, J. G., W. B. Barbazuk, J. M. Davis, A. M. Morse and P. S. Soltis, 2012 Exploring Diversification and Genome Size Evolution in Extant Gymnosperms through Phylogenetic Synthesis. Journal of Botany 2012. Chakravarti, A., L. K. Lashe r and J. E. Reefer, 1991 A maximum likelihood method for estimating genome length using genetic linkage data. Genetics 128 : 175 182. Cheng, Y. Z., 1995 Mean shift, mode seeking, and clustering. Ieee Transactions on Pattern Analysis and Machine Intelligence 17 : 790 799. Comai, L., 2005 The advantages and disadvantages of being polyploid. Nature reviews. Genetics 6 : 836 846. Conesa, A., S. Gotz, J. M. Garcia Gomez, J. Terol, M. Talon et al. 2005 Blast2GO: a universal tool for annotation, visualization and an alysis in functional genomics research. Bioinformatics 21 : 3674 3676. Craddock, N., M. E. Hurles, N. Cardin, R. D. Pearson, V. Plagnol et al. 2010 Genome wide association study of CNVs in 16,000 cases of eight common diseases and 3,000 shared controls. Na ture 464 : 713 720. Cronn, R., B. J. Knaus, A. Liston, P. J. Maughan, M. Parks et al. 2012 Targeted enrichment strategies for next generation plant biology. American journal of botany 99 : 291 311.
113 Cummings, N., R. King, A. Rickers, A. Kaspi, S. Lunke et al 2010 Combining target enrichment with barcode multiplexing for high throughput SNP discovery. BMC Genomics 11 : 641. Davey, J. W., P. A. Hohenlohe, P. D. Etter, J. Q. Boone, J. M. Catchen et al. 2011 Genome wide genetic marker discovery and genotyping u sing next generation sequencing. Nature reviews. Genetics 12 : 499 510. DeBolt, S., 2010 Copy number variation shapes genome diversity in Arabidopsis over immediate family generational scales. Genome biology and evolution 2 : 441 453. Diez, C. M., B. S. Gaut E. Meca, E. Scheinvar, S. Montes Hernandez et al. 2013 Genome size variation in wild and cultivated maize along altitudinal gradients. The New phytologist 199 : 264 276. Drost, D. R., E. Novaes, C. Boaventura Novaes, C. I. Benedict, R. S. Brown et al. 2 009 A microarray based genotyping and genetic mapping approach for highly heterozygous outcrossing species enables localization of a large fraction of the unassembled Populus trichocarpa genome sequence. Plant J. Echt, C. S., and P. May Marquardt, 1997 Sur vey of microsatellite DNA in pine. Genome / National Research Council Canada = Genome / Conseil national de recherches Canada 40 : 9 17. Echt, C. S., S. Saha, K. V. Krutovsky, K. Wimalanathan, J. E. Erpelding et al. 2011 An annotated genetic map of lobloll y pine based on microsatellite and cDNA markers. BMC Genetics 12 : 17. Eckert, A. J., B. Pande, E. S. Ersoz, M. H. Wright, V. K. Rashbrook et al. 2009 High throughput genotyping and mapping of single nucleotide polymorphisms in loblolly pine (Pinus taeda L .). Tree Genetics and Genomes 5 : 225 234. Eckert, A. J., J. van Heerwaarden, J. L. Wegrzyn, C. D. Nelson, J. Ross Ibarra et al. 2010 Patterns of population structure and environmental associations to aridity across the range of loblolly pine (Pinus taeda L., Pinaceae). Genetics 185 : 969 982. Einbeck, J., 2011 Bandwidth selection for mean shift based unsupervised learning techniques: a unified approach via self coverage. Journal of pattern recognition research. 6 : 175 192. Elshire, R. J., J. C. Glaubitz, Q. Sun, J. A. Poland, K. Kawamoto et al. 2011 A Robust, Simple Genotyping by Sequencing (GBS) Approach for High Diversity Species. PLoS ONE 6 : e19379. Elsik, C. G., and C. G. Williams, 2001 Low copy microsatellite recovery from a conifer genome. Theoretical and Applied Genetics 103 : 1189 1195.
114 Fawcett, J. A., S. Maere and Y. Van de Peer, 2009 Plants with double genomes might Proceedings of the National Academy of Sciences 106 : 57 37 5742. Fromer, M., J. L. Moran, K. Chambert, E. Banks, S. E. Bergen et al. 2012 Discovery and statistical genotyping of copy number variation from whole exome sequencing depth. American journal of human genetics 91 : 597 607. Fu, Y., N. M. Springer, D. J Gerhardt, K. Ying, C. T. Yeh et al. 2010 Repeat subtraction mediated sequence capture from a complex genome. The Plant journal : for cell and molecular biology 62 : 898 909. Girirajan, S., C. D. Campbell and E. E. Eichler, 2010 Human Copy Number Variatio n and Complex Genetic Disease. Annual review of genetics. Gish, W., 1996 2004 WU BLAST, pp. Gnirke, A., A. Melnikov, J. Maguire, P. Rogov, E. M. LeProust et al. 2009 Solution hybrid selection with ultra long oligonucleotides for massively parallel targete d sequencing. Nat Biotechnol 27 : 182 189. Gonzalez Martinez, S. C., N. C. Wheeler, E. Ersoz, C. D. Nelson and D. B. Neale, 2007 Association genetics in Pinus taeda L. I. Wood property traits. Genetics 175 : 399 409. Grattapaglia, D., F. L. Bertolucci and R. R. Sederoff, 1995 Genetic mapping and QTLs controlling vegetative propagation in Eucalyptus grandis and E. urophylla using a pseudo testcross strategy and RAPD markers. Theoretical and Applied Genetics 90 : 933 947. Grattapaglia, D., and M. Kirst, 2008 Euc alyptus applied genomics: from gene sequences to breeding tools. New Phytol 179 : 911 929. Grattapaglia, D., O. B. Silva Junior, M. Kirst, B. M. de Lima, D. A. Faria et al. 2011 High throughput SNP genotyping in the highly heterozygous genome of Eucalyptus : assay success, polymorphism and transferability across species. BMC plant biology 11 : 65. Guerra, F. P., J. L. Wegrzyn, R. Sykes, M. F. Davis, B. J. Stanton et al. 2013 Association genetics of chemical wood properties in black poplar (Populus nigra). Ne w Phytologist 197 : 162 176. Hackett, C. A., and L. B. Broadfoot, 2003 Effects of genotyping errors, missing values and segregation distortion in molecular marker data on the construction of linkage maps. Heredity 90 : 33 38. Hastings, P. J., J. R. Lupski, S M. Rosenberg and G. Ira, 2009 Mechanisms of change in gene copy number. Nature reviews. Genetics 10 : 551 564.
115 Henrichsen, C. N., N. Vinckenbosch, S. Zollner, E. Chaignat, S. Pradervand et al. 2009 Segmental copy number variation shapes tissue transcript omes. Nature genetics 41 : 424 429. Hong, X., D. G. Scofield and M. Lynch, 2006 Intron size, abundance, and distribution within untranslated regions of genes. Molecular biology and evolution 23 : 2392 2404. Howard, J. L., and D. B. McKeever, 2012 US forest p roducts annual market review and prospects, 2008 2012. Research Note Forest Products Laboratory, USDA Forest Service. Huang, X., and A. Madan, 1999 CAP3: A DNA sequence assembly program. Genome Research 9 : 868 877. Ingvarsson, P. K., 2005 Nucleotide polymo rphism and linkage disequilbrium within and among natural populations of European Aspen (Populus tremula L., Salicaceae). Genetics 169 : 945 953. Jiao, Y., N. J. Wickett, S. Ayyampalayam, A. S. Chanderbali, L. Landherr et al. 2011 Ancestral polyploidy in s eed plants and angiosperms. Nature 473 : 97 100. Junghans, D. T., A. C. Alfenas, S. H. Brommonschenkel, S. Oda, E. J. Mello et al. 2003 Resistance to rust ( Puccinia psidii Winter) in eucalyptus: mode of inheritance and mapping of a major gene with RAPD ma rkers. Theor Appl Genet 108 : 175 180. Kahvejian, A., J. Quackenbush and J. F. Thompson, 2008 What would you do if you could sequence everything? Nat Biotechnol 26 : 1125 1133. Kayihan, G. C., D. A. Huber, A. M. Morse, T. L. White and J. M. Davis, 2005 Genet ic dissection of fusiform rust and pitch canker disease traits in loblolly pine. TAG. Theoretical and applied genetics. Theoretische und angewandte Genetik 110 : 948 958. Kirst, M., A. F. Johnson, C. Baucom, E. Ulrich, K. Hubbard et al. 2003 Apparent homol ogy of expressed genes from wood forming tissues of loblolly pine (Pinus taeda L.) with Arabidopsis thaliana. Proceedings of the National Academy of Sciences of the United States of America 100 : 7383 7388. Kovach, A., J. L. Wegrzyn, G. Parra, C. Holt, G. E Bruening et al. 2010 The Pinus taeda genome is characterized by diverse and highly diverged repetitive sequences. BMC Genomics 11. Krumm, N., P. H. Sudmant, A. Ko, B. J. O'Roak, M. Malig et al. 2012 Copy number variation detection and genotyping from e xome sequence data. Genome Research 22 : 1525 1532.
116 Krutovsky, K. V., M. Troggio, G. R. Brown, K. D. Jermstad and D. B. Neale, 2004 Comparative mapping in the Pinaceae. Genetics 168 : 447 461. Lander, E. S., and D. Botstein, 1989 Mapping mendelian factors un derlying quantitative traits using RFLP linkage maps. Genetics 121 : 185 199. Lange, K., and M. Boehnke, 1982 How many polymorphic genes will it take to span the human genome? American journal of human genetics 34 : 842 845. Leitch, I. J., L. Hanson, M. Winf ield, J. Parker and M. D. Bennett, 2001 Nuclear DNA C values Complete Familial Representation in Gymnosperms. Annals of botany 88 : 843 849. Lemmon, A. R., S. Emme and E. M. Lemmon, 2012 Anchored hybrid enrichment for massively high throughput phylogenomics Systematic Biology 61 : 727 744. Li, W., and M. Olivier, 2013 Current analysis platforms and methods for detecting copy number variation. Physiological genomics 45 : 1 16. Lister, R., B. D. Gregory and J. R. Ecker, 2009 Next is now: new technologies for se quencing of genomes, transcriptomes, and beyond. Curr Opin Plant Biol 12 : 107 118. Love, M. I., A. Mysickova, R. Sun, V. Kalscheuer, M. Vingron et al. 2011 Modeling read counts for CNV detection in exome sequencing data. Statistical applications in geneti cs and molecular biology 10. Mackay, J., J. F. Dean, C. Plomion, D. G. Peterson, F. M. Canovas et al. 2012 Towards decoding the conifer giga genome. Plant molecular biology 80 : 555 569. Mamani, E. M. C., N. W. Bueno, D. A. Faria, L. M. S. Guimaraes, D. La u et al. 2010 Positioning of the major locus for Puccinia psidii rust resistance (Ppr1) on the Eucalyptus reference map and its validation across unrelated pedigrees. Tree Genetics & Genomes 6 : 953 962. Mamanova, L., A. J. Coffey, C. E. Scott, I. Kozarewa E. H. Turner et al. 2010 Target enrichment strategies for next generation sequencing. Nat Methods 7 : 111 118. Margulies, M., M. Egholm, W. E. Altman, S. Attiya, J. S. Bader et al. 2005 Genome sequencing in microfabricated high density picolitre reactor s. Nature 437 : 376 380. Maron, L. G., C. T. Guimaraes, M. Kirst, P. S. Albert, J. A. Birchler et al. 2013 Aluminum tolerance in maize is associated with higher MATE1 gene copy number. Proceedings of the National Academy of Sciences of the United States of America 110 : 5241 5246.
117 Medvedev, P., M. Fiume, M. Dzamba, T. Smith and M. Brudno, 2010 Detecting copy number variation with mated short reads. Genome Research 20 : 1613 1622. Mills, R. E., K. Walter, C. Stewart, R. E. Handsaker, K. Chen et al. 2011 Mappi ng copy number variation by population scale genome sequencing. Nature 470 : 59 65. Morse, A. M., D. G. Peterson, M. N. Islam Faridi, K. E. Smith, Z. Magbanua et al. 2009 Evolution of genome size and complexity in Pinus. PLoS ONE 4 : e4332. Neale, D. B., an d P. K. Ingvarsson, 2008 Population, quantitative and comparative genomics of adaptation in forest trees. Current Opinion in Plant Biology 11 : 149 155. Neale, D. B., and O. Savolainen, 2004 Association genetics of complex traits in conifers. Trends in Plan t Science 9 : 325 330. Neves, L. G., J. M. Davis, W. B. Barbazuk and M. Kirst, 2013 Whole exome targeted sequencing of the uncharacterized pine genome. The Plant Journal 75 : 146 156. Neves, L. G., E. M. Mamani, A. C. Alfenas, M. Kirst and D. Grattapaglia, 2 011 A high density transcript linkage map with 1,845 expressed genes positioned by microarray based Single Feature Polymorphisms (SFP) in Eucalyptus. BMC Genomics 12 : 189. Nord, A., M. Lee, M. C. King and T. Walsh, 2011 Accurate and exact CNV identificatio n from targeted high throughput sequence data. BMC Genomics 12 : 184. Novaes, E., D. R. Drost, W. G. Farmerie, G. J. Pappas, Jr., D. Grattapaglia et al. 2008 High throughput gene and SNP discovery in Eucalyptus grandis, an uncharacterized genome. BMC Genom ics 9 : 312. Nystedt, B., N. R. Street, A. Wetterbom, A. Zuccolo, Y. C. Lin et al. 2013 The Norway spruce genome sequence and conifer genome evolution. Nature 497 : 579 584. Parchman, T. L., K. S. Geist, J. A. Grahnen, C. W. Benkman and C. A. Buerkle, 2010 Transcriptome sequencing in an ecologically important tree species: assembly, annotation, and marker discovery. BMC Genomics 11 : 180. Pavy, N., M. C. Namroud, F. Gagnon, N. Isabel and J. Bousquet, 2012 The heterogeneous levels of linkage disequilibrium in white spruce genes and comparative analysis with other conifers. Heredity 108 : 273 284. Peterson, D. G., S. R. Schulze, E. B. Sciara, S. A. Lee, J. E. Bowers et al. 2002 Integration of Cot analysis, DNA cloning, and high throughput sequencing facilitates genome characterization and gene discovery. Genome Research 12 : 795 807. Price, A. H., 2006 Believe it or not, QTLs are accurate! Trends Plant Sci 11 : 213 216.
118 Querfurth, R., A. Fischer, M. R. Schweiger, H. Lehrach and F. Mertes, 2012 Creation and applicat ion of immortalized bait libraries for targeted enrichment and next generation sequencing. Biotechniques 52 : 375 380. Quesada, T., V. Gopal, W. P. Cumbie, A. J. Eckert, J. L. Wegrzyn et al. 2010 Association Mapping of Quantitative Disease Resistance in a Natural Population of Loblolly Pine (Pinus taeda L.). Genetics 186 : 677 686. Quesada, T., Z. Li, C. Dervinis, Y. Li, P. N. Bocock et al. 2008 Comparative analysis of the transcriptomes of Populus trichocarpa and Arabidopsis thaliana suggests extensive evo lution of gene expression regulation in angiosperms. New Phytol 180 : 408 420. Rabinowicz, P. D., K. Schutz, N. Dedhia, C. Yordan, L. D. Parnell et al. 1999 Differential methylation of genes and retrotransposons facilitates shotgun sequencing of the maize genome. Nature genetics 23 : 305 308. Rafalski, A., 2002 Applications of single nucleotide polymorphisms in crop genetics. Curr Opin Plant Biol 5 : 94 100. Remington, D. L., R. W. Whetten, B. H. Liu and D. M. O'Malley, 1999 Construction of an AFLP genetic ma p with nearly complete genome coverage in Pinus taeda. TAG. Theoretical and applied genetics. Theoretische und angewandte Genetik 98 : 1279 1292. Resende, M. F., Jr., P. Munoz, J. J. Acosta, G. F. Peter, J. M. Davis et al. 2012 Accelerating the domesticati on of trees using genomic selection: accuracy of prediction models across ages and environments. The New phytologist 193 : 617 624. Saintenac, C., D. Jiang and E. D. Akhunov, 2011 Targeted analysis of nucleotide and copy number variation by exon capture in allotetraploid wheat genome. Genome biology 12 : R88. Sathirapongsasuti, J. F., H. Lee, B. A. Horst, G. Brunner, A. J. Cochran et al. 2011 Exome sequencing based copy number variation and loss of heterozygosity detection: ExomeCNV. Bioinformatics 27 : 2648 2654. Sato, S., S. Tabata, H. Hirakawa, E. Asamizu, K. Shirasawa et al. 2012 The tomato genome sequence provides insights into fleshy fruit evolution. Nature 485 : 635 641. Schatz, M. C., A. L. Delcher and S. L. Salzberg, 2010 Assembly of large genomes usi ng second generation sequencing. Genome Research 20 : 1165 1173. Scherer, S. W., C. Lee, E. Birney, D. M. Altshuler, E. E. Eichler et al. 2007 Challenges and standards in integrating surveys of structural variation. Nature genetics 39 : S7 15.
119 Schlotterer, C., 2004 The evolution of molecular markers -just a matter of fashion? Nat Rev Genet 5 : 63 69. Slavov, G. T., S. P. DiFazio, J. Martin, W. Schackwitz, W. Muchero et al. 2012 Genome resequencing reveals multiscale geographic structure and extensive linkage disequilibrium in the forest tree Populus trichocarpa. The New phytologist 196 : 713 725. Springer, N. M., K. Ying, Y. Fu, T. M. Ji, C. T. Yeh et al. 2009 Maize inbreds exhibit high levels of copy number variation (CNV) and presence/absence variation (PAV ) in genome content. PLoS Genetics 5 : e1000734. Stankiewicz, P., and J. R. Lupski, 2010 Structural variation in the human genome and its role in disease. Annual review of medicine 61 : 437 455. Stewart, J. F., C. G. Tauer and C. D. Nelson, 2012 Bidirectiona l introgression between loblolly pine (Pinus taeda L.) and shortleaf pine (P echinata Mill.) has increased since the 1950s. Tree Genetics & Genomes 8 : 725 735. Storey, J. D., and R. Tibshirani, 2003 Statistical significance for genomewide studies. Proc Nat l Acad Sci U S A 100 : 9440 9445. Stranger, B. E., M. S. Forrest, M. Dunning, C. E. Ingle, C. Beazley et al. 2007 Relative Impact of Nucleotide and Copy Number Variation on Gene Expression Phenotypes. Science 315 : 848 853. Supek, F., M. Bosnjak, N. Skunca and T. Smuc, 2011 REVIGO summarizes and visualizes long lists of gene ontology terms. PLoS ONE 6 : e21800. Swanson Wagner, R. A., S. R. Eichten, S. Kumari, P. Tiffin, J. C. Stein et al. 2010 Pervasive gene content variation and copy number variation in mai ze and its undomesticated progenitor. Genome Research 20 : 1689 1699. Tang, Y. C., and A. Amon, 2013 Gene copy number alterations: a cost benefit analysis. Cell 152 : 394 405. Tanksley, S., 1983 Molecular markers in plant breeding. Plant Molecular Biology Re porter 1 : 3 8. Teo, S. M., Y. Pawitan, C. S. Ku, K. S. Chia and A. Salim, 2012 Statistical challenges associated with detecting copy number variations with next generation sequencing. Bioinformatics 28 : 2711 2718. Turner, E. H., S. B. Ng, D. A. Nickerson a nd J. Shendure, 2009 Methods for genomic partitioning. Annual review of genomics and human genetics 10 : 263 284.
120 Tuskan, G. A., S. Difazio, S. Jansson, J. Bohlmann, I. Grigoriev et al. 2006 The genome of black cottonwood, Populus trichocarpa (Torr. & Gray ). Science 313 : 1596 1604. Tuskan, G. A., L. E. Gunter, Z. M. K. Yang, T. M. Yin, M. M. Sewell et al. 2004 Characterization of microsatellites revealed by genomic sequencing of Populus trichocarpa. Canadian Journal of Forest Research Revue Canadienne De R echerche Forestiere 34 : 85 93. Valdes Mas, R., S. Bea, D. A. Puente, C. Lopez Otin and X. S. Puente, 2012 Estimation of copy number alterations from exome sequencing data. PLoS ONE 7 : e51422. van Binsbergen, E., 2011 Origins and breakpoint analyses of copy number variations: up close and personal. Cytogenetic and Genome Research 135 : 271 276. Van Ooijen, J. W., and R. E. Voorrips, 2001 JoinMap 3.0, Software for the calculation of genetic linkage maps., pp., edited by P. R. International, Wageningen, the Ne therlands. Voorrips, R. E., 2002 MapChart: software for the graphical presentation of linkage maps and QTLs. The Journal of heredity 93 : 77 78. Vos, P., R. Hogers, M. Bleeker, M. Reijans, T. van de Lee et al. 1995 AFLP: a new technique for DNA fingerprint ing. Nucleic Acids Research 23 : 4407 4414. Wakamiya, I., R. J. Newton, J. S. Johnston and H. J. Price, 1993 Genome Size and Environmental Factors in the Genus Pinus. American journal of botany 80 : 1235 1241. Walsh, T., M. K. Lee, S. Casadei, A. M. Thornton S. M. Stray et al. 2010 Detection of inherited mutations for breast and ovarian cancer using genomic capture and massively parallel sequencing. Proceedings of the National Academy of Sciences of the United States of America 107 : 12629 12633. Wear, D. N. and J. G. Greis, 2002 Southern Forest Resource Assessment: Summary of Findings. Journal of Forestry 100 : 6 14. Weber, J. L., and P. E. May, 1989 Abundant class of human DNA polymorphisms which can be typed using the polymerase chain reaction. American jo urnal of human genetics 44 : 388 396. White, T. L., W. T. Adams and D. B. Neale, 2007 Forest genetics Cabi. Whitelaw, C. A., W. B. Barbazuk, G. Pertea, A. P. Chan, F. Cheung et al. 2003 Enrichment of gene coding sequences in maize by genome filtration. Sc ience 302 : 2118 2120.
121 Williams, C. G., M. H. Reyes Valdes and D. A. Huber, 2007 Validating a QTL region characterized by multiple haplotypes. Theor Appl Genet 116 : 87 94. Williams, J. G., A. R. Kubelik, K. J. Livak, J. A. Rafalski and S. V. Tingey, 1990 DN A polymorphisms amplified by arbitrary primers are useful as genetic markers. Nucleic Acids Research 18 : 6531 6535. Wolfinger, R. D., G. Gibson, E. D. Wolfinger, L. Bennett, H. Hamadeh et al. 2001 Assessing gene significance from cDNA microarray expression data via mixed models. J Comput Biol 8 : 625 637. Wu, J., K. R. Grzeda, C. Stewart, F. Grubert, A. E. Urban et al. 2012 Copy Number Variation detection from 1000 Genomes project exon captur e sequencing data. BMC Bioinformatics 13 : 305. Wu, T. D., and C. K. Watanabe, 2005 GMAP: a genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics 21 : 1859 1875. Wyman, A. R., and R. White, 1980 A highly polymorphic locus in human DNA. Proceedings of the National Academy of Sciences of the United States of America 77 : 6754 6758. Xu, S. Z., 2003 Theoretical basis of the Beavis effect. Genetics 165 : 2259 2268. Yin, T., S. P. DiFazio, L. E. Gunter, X. Zhang, M. M. Sewell et al. 2008 G enome structure and emerging evidence of an incipient sex chromosome in Populus. Genome Research 18 : 422 430. Yu, P., C. Wang, Q. Xu, Y. Feng, X. Yuan et al. 2011 Detection of copy number variations in rice using array based comparative genomic hybridizat ion. BMC Genomics 12. Zhou, L., and J. A. Holliday, 2012 Targeted enrichment of the black cottonwood (Populus trichocarpa) gene space using sequence capture. BMC Genomics 13 : 703. Zhu, C., M. Gore, E. S. Buckler and J. Yu, 2008 Status and Prospects of Asso ciation Mapping in Plants. Plant Gen. 1 : 5 20. Zobel, B., and J. Talbert, 1984 Applied forest tree improvement Wiley.
122 BIOGRAPHICAL SKETCH Leandro Gomide Neves, son of Ary Rodrigues das Neves and Clara Maria de Brito Gomide was born in the city of Vi osa, Minas Gerais state, Brazil. Leandro received his initial education from the Federal University of Viosa, where he concluded his high school from COLUNI (Colgio Universitrio), a federal high school managed by the university. After high school he was accepted in the f orest e ngineering program of the same university, where the interest for trees and their importance to society arose. During college, Leandro developed an interest to perform research and interned in some research laboratories to g e t expo sed to basic biology and genetics. Also during his undergraduate, Leandro studied a semester abroad at the University of Minnesota. During this time in Minnesota, Leandro took a crop breeding class that exposed him to the great field of genetics and breedi ng. T he possibility to combine scientific knowledge with practical problems made him decide to pursue further education in this field. When back to Brazil, searching for a place to conclude his final undergraduate thesis in tree genetics, he moved to Brasi lia to work with Dr. Dario Grattapaglia, a senior scientist at the genetics center of EMBRAPA, the government company responsible for agriculture Eucalyptus genomics and decided to pursue a m his Bachelor of Arts in f orest e ngineer ing Leandro started his degree in g enetics and b reeding under supervision of Dr. Dario Grattapaglia and Dr. degree, he was able to improve his understanding of genetics and tree breeding To perform his master s research Leandro traveled to Dr. Matias Kirst lab at the University of Florida, where they had expertise on methodologies needed for the research At the University o f Florida Leandro became aware of the Ph.D. program in Plant Molecular
123 and Cellular Biology and applied to it. At that time, th e research developed during his m with microarray and genetic mapping seemed like a vast amount of data and Leandr o realize d the need to gain computational skills. Also at this time, Leandro got exposed to the revolution that was happening with DNA sequencing and noticed that molecular breeding would certainly move in that direction, with genetic diversity analysis be ing performed by direct sequencing With this in mind, Leandro started the Ph.D. in Plant Molecular and Cellular Biology under supervision of Dr. Matias Kirst and Dr. Brad Barbazuk.