1 INTEGRATING GENOMICS AND QUANTITATIVE GENETICS FOR DISCOVERY OF GENES THAT REGULATE BIOENERGY TRAITS IN WOODY SPECIES By EVANDRO NOVAES A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2010
2 2010 Evandro Novaes
3 To my beloved girls Carolina and Sarah
4 ACKNOWLEDGMENTS I would like to thank my supervisor, Dr. Matias Kirst, for his guidance, trust, friendship and full commitment to the success of my PhD research I thank my committee members Dr. Gary Peter, Dr. Karen Koch, Dr. Ron Sederoff and Dr. George Casella for being readily available to help on my PhD research. I also thank Dr. Dudley Huber and Dr. Luis Osorio for helping with the experimental design and stat istical analysis involved in my research. I am grateful to Dr. William Farmerie for his assistance on the assembly of the Eucalyptus sequences. I thank Dr. Mark Davis and Robert Sykes of the Natio nal Renewable Energy Laboratory ( U .S Department of Energy Golden, CO) for helping with the py MBMS analysis of poplar wood chemistry. Many thanks also to Brent OBrien who obtained the Arabidopsis homozygous T DNA lines for cpg13. I want to express my appreciation to the Forest Genomics Laboratory manager, Chris Dervinis. Thanks for making sure reagents and supplies were always in the lab when I needed and, most of all, thanks for helping me with molecular biology techniques necessary to clone and characterize cpg13. I also want to recognize the contribution of my colleagues from the Forest Genomics Laboratory, especially Brianna Miles who organized the green house experiment described in Chapter 3, and Dr. Derek Drost, who provided insightful ideas to many aspects of our collaborative research. I am grateful to the faculty members of the SFRC, Statistics and Botany Departments, and PMCB Program that taught me the several courses I took during my PhD. I also thank my parents, Marcia and Carlos Novaes, for teaching me early in my childhood about the importance of education. I also have to express my gratitude to my former advisor and friend, Dr. Dario Grattapaglia (EMBRAPA CENARGEN, Brazil) who inspired me and opened many doors in my scientific career, including the one that put me in contact with my PhD supervisor, D r. Matias Kirst.
5 I am very thankful to my wife, Carolina Novaes, who always stood by me, and temporarily put her scientific career aside in benefit of our family when I decided to accept the challenge of pursuing a higher education in the U.S. I am also gr ateful for working together with Carolina, who turned out to be an excellent molecular biology technician, being essential for the success of many research projects conducted in the laboratory, including those involved in my dissertation. Finally, I thank my little girl, Sarah Novaes, for changing my life by fulfilling it with joy.
6 TABLE OF CONTENTS page ACKNOWLEDGMENTS ...............................................................................................................4 LIST OF TABLES ...........................................................................................................................8 LIST OF FIGURES .......................................................................................................................10 ABSTRACT ...................................................................................................................................11 CHAPTER 1 INTRODUCTION AND LITERATURE REVIEW ..............................................................13 The Eucalyptus and Populus Genera ......................................................................................14 Eucalyptus Biology, Plantation and Breeding .................................................................14 Populus Biology, Plantation and Breeding .....................................................................16 Genomic Resources A vailable for Eucalyptus and Populus ..................................................18 Sequences Available for Eucalyptus and Populus ..........................................................19 Molecular Markers Available for Eucalyptus and Populus ............................................21 Utilizing Genomic Resources to Identify Genes Controlling Bioenergy Traits .....................24 Forward and Reverse Genetics in Forest Species ............................................................24 Association Genetics and Genetical Genomics New Solutions for Forest Species ......27 Project Objectives ...................................................................................................................29 2 HIGH THROUGHPUT GENE AND SNP DISCOVERY IN Eucalyptus grandis AN UNCHARACTERIZED GENOME .......................................................................................34 Introduction .............................................................................................................................34 Materials and Methods ...........................................................................................................36 Plant Mate rial and RNA Extraction ................................................................................36 cDNA Synthesis and Normalization ...............................................................................36 454 Sequencing and Assembly ........................................................................................37 SNP Detection and Validation .........................................................................................38 Anal ysis of Synonymous and NonSynonymous Mutations ...........................................39 Results .....................................................................................................................................40 454 Pyrosequencing and Assembly of E. grandis ESTs .................................................40 Annotation of the E. grandis ESTs ..................................................................................41 Efficiency of Gene Discovery: 454 and Sanger Sequencing Comparison ......................42 SNP Detection and Validation in the E. grandis 454Generated ESTs ..........................43 Analysis of Synonymous and NonSynonymous SNPs in E. grandis Genes .................44 Nucleotide Diversity among the E. grandis Genes .........................................................45 Discussion ...............................................................................................................................47 3 QUANTITATIVE GENETIC ANALYSIS OF BIOMASS AND WOOD CHEMISTRY OF Populus UNDER DIFFERENT NITROGEN LEVELS ..................................................59
7 Introduction .............................................................................................................................59 Materials and Methods ...........................................................................................................61 Phenotyping a Poplar PseudoBackcross Pedigree .........................................................61 Statistical Analysis of Phenotypic Data ..........................................................................63 Genetic Linkage Map and QTL Analysis ........................................................................66 Results .....................................................................................................................................67 Genetic Control of Phenotypic Traits and the Effect of Nitrogen Treatments ................67 Analysis of Genetic and Phenotypic Correlations among Traits .....................................68 Quantitative Trait Loci Mapping in Eac h Nitrogen Treatment .......................................70 Phenotypic Plasticity in Response to Nitrogen Treatments ............................................70 Co Localization of QTL for Different Traits ..................................................................71 Discussion ...............................................................................................................................72 4 INTEGRATIVE GENOMICS IDENTIFIES CPG13 A CANDIDATE GENE FOR THE COORDINATE REGULATION OF BIOMASS GROWTH AND CARBON P ARTITIONING IN Populus .................................................................................................87 Introduction .............................................................................................................................87 Materials and Methods ...........................................................................................................89 Plant Material and Microarray Analysis of Gene Expression .........................................89 Fine Mapping the Pleiotropic QTL of LGXIII ................................................................90 Expression QTL Analysis ................................................................................................91 Statistical Analysis ..........................................................................................................92 Search for Common Cis Elements in the Promoter of Lignin Biosynthesis Genes and Cpg13 ....................................................................................................................92 Results .....................................................................................................................................92 A Pleiotropic QTL for Biomass and Wood Chemistry on LGXIII of Populus ...............92 Genetical Genomics Identifies a Candidate Gene for the Regulation of Biomass and Wood Chemistry ..........................................................................................................93 Promoter of Cpg13 Contains Regulatory Elements Found in Monolignol Biosynthesis Genes ......................................................................................................95 Cpg13 Protein Sequence Contains a Secretory Signal Peptide and a Domain of Unknown Function Conserved in all Land Plants. ......................................................96 Discussion ...............................................................................................................................97 CONCLUSIONS ..........................................................................................................................114 LIST OF REFERENCES .............................................................................................................117 BIOGRAPHICAL SKETCH .......................................................................................................140
8 LIST OF TABLES Table page 11 Frequency of SNPs across mapped and unmapped scaffolds from the genome sequence of Populus trichocarpa. ......................................................................................33 21 Summary of the E. grandis expressed sequences generated with GS 20 and GS FLX pyrosequencing runs, and control ESTs obtained from dideoxy based sequen cing of an analogous Eucalyptus library. .......................................................................................55 22 Length distribution and characteristics of contigs assembled from the two GS 20 runs, one GS FLX run, the three 454 runs combined and from the control Sanger sequenced ESTs. ................................................................................................................55 23 Number and percentage of gene models with matches against the E. grandis 454 unigenes and against the control Sanger sequenced Eucalyptus unigenes using three different BlastX thresholds ................................................................................................55 24 Number of detected polymorphisms and a ffected contigs by variant type ........................56 25 Number and percentage of nonvalidated and validated SNPs for each of the 43 amplicons sequenced with dideoxy based method. ...........................................................57 26 Biological process GO categories enriched (p value < 0.05) in each of the two extremes of Ka/Ks distribution. .........................................................................................58 27 Distribution summary of three nucleotide diversity parameters estimated for 2,392 E. grandis contigs. ..................................................................................................................58 28 Biological process GO categories enriched (p value < 0.05) in each of the two N (nonsynonymous nucleotide diversity) distribution. ...............................58 31 Estimates of clonal repeatability and average trait value for each of 12 phenotypes meas ured in two nitrogen treatments .................................................................................80 32 Estimates of clonal repeatability and aver age trait value for each of 8 wood chemistry phenotypes estimated in two nitrogen treatments with py MBMS ...................81 33 Pair wise estimates of phenotypic (below diagonal) and genotypic (above diagonal) correlation between traits ...................................................................................................82 34 Linkage group (LG) and fla nking marker localization for the 62 QTL identified under the two nitrogen treatments (H = high, D = deficiency) ..........................................84 35 Spearman cor trait. ....................................................................................................................................85
9 36 Comparison of clonal repeatibility estimated with combined data from both N treatments and separately from plants grown under high and deficient nitrogen treatments. ..........................................................................................................................86 41 Annotation of the two candidate cis regulated genes in the pleiotropic QTL, highly correlated with biomass and wood chemistry traits. ........................................................108 42 Top 30 expression correlations between the 20 cis regulated genes and genes from t he lignin biosynthesis pathways .....................................................................................109 43 Motifs enriched ( q value < 0.05) among the promoter of genes involved in lignin biosy nthetic and metabolic processes ..............................................................................110 44 Number of DUF579 containing genes in each of the 17 land plants with genome currently sequenced ........................................................................................................113
10 LIST OF FIGURES Figure page 11 Number of publications per year referencing the Populus genome sequence publication , since its release in September 2006 (Source: ISI Web of Knowledge). .......................................................................................................................31 12 Representation of forward and reverse genetics approaches to find the association between gene 4 coumarate:coenzyme A ligase (4CL) and ligni n levels i n the wood of poplars ...........................................................................................................................32 21 Proportion of E. grandis unigenes (contigs + singlets) without ( ) and with homology to the Arabidopsis (A), Populus (P) and Oryza (O) gene models .....................................53 22 Proportion of categories of each Gene Ontology (GO) sampled by the E. grandis unigene sequences compared with the proportions found in the Arabidopsis genome annotation. ..........................................................................................................................54 31 Effect of nitrogen ferti lization on shoot biomass accumulation on one genotype of the Populus pseudobackcross population .........................................................................78 32 Map location of all 62 Q TL identified in our experiment .................................................79 41 Distribution and correlation of bi omass weight and lignin content .................................101 42 Fine mappi ng the pleiotropic QTL interval of LGXIII ....................................................102 43 Venn diagram depicting the number of genes located in the QTL interval (blue circle), the number of genes with cis and trans regulated eQTL mapped between markers SSR5 and SSR8 (green circle), and genes correlated with leve ls of cellulose, lignin, as well as total biomass at a false discovery rate (FDR) of 10%. .........................103 44 Cis eQTL and expression correlatio n with lignin and total biomass for the two candidate genes identified wit h a genetical genomics approach .....................................104 45 Co expression of members of the biochemical pathways involved in lignin biosynthesis with gw1.41.566.1 gene. .............................................................................105 46 Cpg13 protein structure containi ng a secretory signal peptide (aa 1 31) and a conserved domain of unknown function (DUF579 aa 162 304) ...................................106 47 Expression of cpg13 is highest in tissues undergoing secondary cell wall formation .....107
11 Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy INTEGRATING GENOMICS AND QUANTITATIVE GENETICS FOR DISCOVERY OF GENES THAT REGULATE BIOENERGY TRAITS IN WOODY SPECIES. By Evandro Novaes May 2010 Chair: Matias Kirst Major: Forest Resources and Conservation Wood can provide a renewable source of energy to sustain our economic development. To meet the increasing demand for wood and bioenergy without harvesting natural forest ecosystems it is critical that both productivity and wood quality of fast growing, short rotation plantation forests be improved. Eucalyptus and Populus are the fastest growing woody species in tropical and temperate conditions, respectively. To efficiently improve these species it is important that genetic and genomic resources are accessible to assist breeding and allow identification of genes regulating bioenergy traits With the genome sequenced and hundreds of thousands of expressed sequence tags (ESTs) Populus has more de veloped genomic resources than Eucalyptus To address this limitation in Eucalyptus and allow comparative genomics with Populus we utilized 454 sequencing technology to generate 148 Mbp of expressed sequences and survey the nucleotide diversity in >2,000 genes of Eucalyptus grandis [ Chapter 2 published in BMC Genomics 2008]. This work is contributing to the annotation of the forthcoming Eucalyptus genome sequence. Working with poplars, we have identified one genomic region (QTL) o n chromosome XIII that is significantly associated with biomass growth and composition, based on rel at ive levels of wood cellulose and lignin [ Chapter 3 published in New Phytologist, 2009]. This QTL explains 56% of the heritable variation in cellulose to lignin
12 ratio, as well as 20 25% of the heritable variation of several productivity traits, including stem diameter and biomass accumulation in root and shoot. By i ntegrating the multiple genomic resources available for Populus we identified the gene cpg13 ( c arbon p artitioning and g rowth in LG13) as the likely regulator underlying this QTL [Chapter 4] Microarray analysis show s that the expression of cpg13 is cis regulated, and highly correlated with levels of wood cellulose and lignin as well as with the expression of genes in the phenylpropanoid pathway. Cpg13 has highly conserved homologues in all plant species sequenced so far, but their molecular function is currently unknown. Characterization of cpg13 is expected to illuminate the function of these homologues and to uncover the molecular mechanisms involved in the coordinate regulation of wood composition and tree growth.
13 CHAPTER 1 INTRODUCTION AND LIT ERATURE REVIEW Forests have a significant influence in the evolution and history of mankind. Initially humans relied heavily on forests to hunt and gather food. When humans learned to domesticate animals and propagate seeds, forests started being substituted by agricultu ral fields. The reduction of forest land was furthered with the need for energy and wood to sustain the economic development in the last centuries. Currently forests cover 30% of the global land area and still provide significant socioeconomic benefits t o humans  Worldwide, the forestry sector employs 15.7 million people and generates US$ 468 billion in total gross value  In the US, forests cover 303 million hectares (ha)  and the US forestry sector directly employs 1.1 million people and generates US$108 billion  In addition to the irrefutable socio economic importance, forests have a pervasive influence on the health and sustainability of terrestrial ecosystems  Forests are the most biodiverse of the land ecosystems. The biogeophysical (albedo and evapotranspiration) and biogeochemical (carbon cycle) processes of forests directly influence planetary energetics, hydrologic cycle, atmospheric co mposition and, consequently climate  Because of their contribution to climate, renewed importance is be ing attributed to the role of forests in easing the negative effects of accumulation of green hous e gases in the atmosphere [4, 5] Estimates indicate that forests store 45% (283 gigatonnes) of the terrestrial carbon [1, 3] Furthermore, wood is a renewable source of energy that could subst itute in part the need for petroleum fuel [6, 7] T o effectively utilize the potential of forests as a reservoir of carbon and energy, it is important to select suitable genetic material i.e. species with the capacity to produce large volume s of biomass with high energy potential, and that can grow in marginal soils where they
14 will not compete with agricultural crops [8, 9] These species are often referred to as bioenergy woody crops  and include species of the genera Eucalyptus and Populus [8, 9] Both genera include fast growing, short rotation species with wide capacity for acclimation Most Eucalyptus species are adapted to a tropical climate, while popl ars grow mostly in temperate conditions. These two genera were genetically studied at different levels in this dissertation. Chapter 1 reviews the biology and the state of the genetic literature on Eucalyptus and Populus as well as of the genomic resource s available for these species. The Eucalyptus and Populus Gen era Eucalyptus Biology, Plantation and Breeding The Eucalyptus genus includes approximately 700 species, several of which are still unnamed  The latest classification divides the species into 13 subgenera, with most included in subgenera Eucalyptus (former subgenus Monocalyptus ) and Symphyomyrtus  Eucalypts are long lived e vergreen trees that can reach 400 500 years in age  Eucalyptus species are native from Australia and some of its northern islands [12, 13] Eucalypts have hermap hrodite flowers that are pollinated by animals, mainly insects and birds. Although self compatible, the eucalypt breeding system is one of mixed mating with preferential outcrossing. Outcrossing rate range s from 0.62 to 0.84 in natural populations  and is estimated to be ~0.90 in exotic ( i.e. outside natural rage) conditions  Outcross ing is favored by various mechanisms including protandry (stigma is receptive only after pollen start s shedding) and reduction of seed yield and seedling vigor when self pollinat ion occurs [13, 15] Although not common, many species of Eucalyptus are reproductively compatible and do hybridize in nature. Hybridizations between species of different subgenera are generally not successful  In their native habitat eucalypts occupy a wide rang e of climate s and soil s from sea level to alpine tree line, from high rainfall to semiarid zones, and from the tropics to latitude 43o
15 south [16, 17] The remarkable capacity for acclimation fast growth rate, stem straightne ss and wood quality, have led to the introduction of Eucalyptus species in exotic plantations around the world [13, 17] Eucalyptus is the most widely planted woody angiosperm (hardwood), occupying 18 million ha predominantly in India (45%), Brazil (17%) and China (7%)  The most important species for commercial plantations belong to the subgenus Symphyomyrtus Examples include the species E. grandis E. globulus E. urophylla and E. camaldulensis which together with their hybrids account for about 80% of the global area of eucalypts plantations  There is also great interest in introducing Eucalyptus into subtropical and temperate clima tes, motivating efforts to improve freeze resistance of the species [20, 21] Eucalypt forests contribute to diverse end uses including solid wood products, energy, pulp and paper, and essential oils used as insect repellent s and by the pharmaceutical industry  D ue to its fast biomass growth, renewed interest has been recognized for eucalyptus species as bioenergy crops [ 8, 9] Despite the remarkable growth rate of several Eucalyputs species that commonly achieve an average increment of 30 50 m3 ha1yr1 in Brazil and South Africa  domestication of Eucalyptus is in its infancy compared to agricultural crop species. Given the inherently long time for trees to mature and flower, eucalypt breeding programs are only one or two generations of selection away from the wild  However, significant genetic gains were obtained in the early stages of domestication through species and provenance selection, followed by the establishment of seed orchards  Currently the most advanced breeding programs are focused on exploiting the high heterosis found in interspecific hybrids. Interspecific hybridizations were initially utilized to incorporate disease resistance genes into susceptible species, such as cank er resistance from E. urophylla into susceptible E. grandis  More recently hybridizations are focusing on
16 improving wood quality by crossing fast growing species like E. grandis species with those of superior wood composition like E globulus [16, 24] Clonal propagation technol ogies are well established for Eucalyptus species, allowing additive and nonadditive genetic effects from these interspecific crosses to be immortalized and multiplied in clonal forests  Currently, the use of hybrids and clonal propagation ha ve become key for the improvement of cellulose yield. The result has been a reduction of 2040% in the amount of wood necessary to produce the same amount of pulp  This improvement resulted in gains along the entire production chain because fewer trees have to be harvested and transported, mitigating the expansion of plantations and reducing effluent waste  Although successful, traditional tree breeding is a slow process. In practice, the progress of forest tree b reeding is limited by the long lag time between seed germination and flowering, and because most of the relevant forest traits can only be measured when the tree reaches maturity  For instance, t he reported gains in woodsp ecific consumption for cellulose production are the result of over two decades of selection. Thus, even though traditional breeding strategies have been successfully utilized by the pulp and paper industry, additional technologies will be require d to meet the increasing demand for wood and renewable energy. Populus Biology, Plantation and Breeding The most accepted classification of the Populus genus recognizes 29 species divided into six sections  However, the number of species described in t he literature varies from 20 to 90 depending on authorship, as reviewed by Davis  The controversy arises from the large phenotypic plasticity observed across the widely distributed natural populations of poplars. Moreover, some species have overlapping range s where hybridization can occur, confusing the classification of hybrids and pure species  Poplars have a broad distribution in the northern hem isphere, accounting for 70 million ha of natural forests concentrated in Canada (40%),
17 Russia (31%) and United States (25%)  The wide distribution is highlighted by the transcontinental range of P. balsamifera and P. tremuloides  N one of the Populus species occur naturally in the southern hemisphere  Like Eucalyptus poplars are hardwood (or Angiosperm) tree species. Different from Eucalyptus poplars are dioecious (separate male and female trees), wind pollinated, a nd deciduous. In addition to effective pollen and seed dispersal by wind, most poplars trees propagate clonally in natural stands by means of root borne sucker shoots  Similarly to Eucalyptus poplars can produce interspecific hybrids, especi ally when crosses are made between species of the same section  Poplars are among the fastest growing tree species in temperate zone, a characteristic related their ecological role as pioneer plants  Because of this fast growth rate, poplars are a dominant component of riparian ecosystem in North America, providing important habitat for fish and wildlife. Studies have demonstrated that the diversity of poplars plays major roles on communit y and ecosystems processes. For instance the composition of fungi, arthropods, mammals, and other animals is affected directly or indirectly by interactions with poplar trees  Natural populations of poplars exhibit extensive genetic diversity ranging from 0.5 2%  T he efficient wind dispersal of its genetic material results in little differentiation between natural populations of poplars [36, 37] but some level of structure has also been identified  Populus species also acclimat es in a wide range of environmental conditions. Acclimatability has facilitated plant ing of poplars in conditions as diverse as the hot Chinese desert and the cold, windy South American Andes  Established plantations of poplars total 6.7 million ha worldwid e, concentrated mostly in China (73%) and India (15%). The majority (56%) of the global planted area is destined for woodproduction and the rest for environmental purposes  Poplar forests provide many benefits including source of fuel,
18 fiber, and lumber. The quality of poplars as a founder species can be harnessed to rehabilitate disturbed and fragile ecosystems. Poplars are also used as windbreaks, to stabilize soil banks and prevent erosion, and in phytoremediation to decontaminate polluted soils [28, 38] Recen tly, poplars have received increasing attention as a renewable source of energy biomass, and breeding efforts are being focused on that end use for the genus [27, 39] As with Eucalyptus intercrossability among species and capacity for clonal propagation are the foundation for breeding poplars  Inter specific breeding and selection among poplars has been pursued since the 1930s in the Northeastern United States  The main objectives have been to produce clones with improved growth rate, res istan ce to Septoria stem canker (and other diseases), and with reliable ability to form roots from dormant hardwood cuttings  The new focus on bioenergy has expanded the aim of poplar breeding in many private companies that now select trees based also on wood density, calorific value, as well as cellulose and lignin content  Similarly to Eucalyptus and other perennial species, traditional breeding of poplars has its drawback in the long cycle of selection. Populus trees usually reach sexual maturity at 5 15 years of age  Therefore, technologies that decrease the amount of effort and time spent in evaluatin g and recombining the best poplar genotypes are much desired. Genomic Resources A vailable for Eucalyptus and Populus To efficiently exploit Eucalyptus and Populus species for bioenergy and other applications, it is important that productivity and quality of biomass be improved  T raditional forest breeding has improve d wood productivity chemistry and physical properties. However, the long generation time renders traditional genetic improvemen t of trees too slow to meet the urgent need for renewable source s of energy. Recent advances in genomics and biotechnology provide powerful tools to accelerat e the selection/recombination cycles in tree breeding [16, 43] Alternatively, these tools can also be harnessed to engineer the expression of genes that control
19 traits of i mportance for bioenergy To exploit any of these strategies, it is crucial that the necessary genetic and genomic resources are readily accessible  Below I review the current state of the genom ic resources available for Eucalyptus and Populus Sequences Available for Eucalyptus and Populus Until recently, sequencing technology was dependent on electrophoresis to resolve the one base difference between fragments terminated with labeled dideoxy nucleotides [45, 46] As a result of this low throughput approach com plete genome s were only sequenced in a few model organisms or microorganisms with small genome size s For the remaining species, many economically important, only the expressed portion of the genome was being sequenced. S equencing e xpressed sequence tags ( ESTs) or complementary DNA (cDNA) synthesized from mRNA extracted from tissues or organs of interes is a fast way of reducing the complexity of the genome to only its transcribed or gene portions  ESTs can be use to quantify the nucleotide diversity in coding sequence s to determ ine the structure of genes annotate genome s, and to estimate transcript abundance [47, 48] Although several private consortia have invested in the development of EST libraries for Eucalyptus  very few sequences are publicly available for the genus. Only ~37,000 eucalypts ESTs are currently deposited in the National Center for Biotechnology Information (NCBI) GenBank. This contrasts with the over 400,000 ESTs available for various species of Pinu s and Populus For Populus there are also 4,664 full length cDNAs sequenced in a collective effort  The lack of publicly available expressed sequenc es limits the breadth of the genetics and genomics research in Eucalyptus The design of microarrays that comprehensively represent the transcriptome for example, relies heavily on availability of ESTs. Currently, only 6 microarray experiments are deposit ed in the NCBI Gene Expression Omnibus (GEO) for Eucalyptus versus 79 for Populus Transcript levels constitute an important intermediary step in
20 the link between nucleotide diversity and phenotypic variation  As a result, analyses of gene expression can be crucial for identification of genes that regulate plant metabolism and traits [51, 52] Although important, ESTs do not provide the full picture of genes, as introns, promoters and other regulatory loci are not represented among expressed sequences. Nucleotide polymorphisms wi thin regulatory sequences are key sources for the variation in transcript levels [50, 51] and, consequently, phenotypes  In animals, most of the functi onal variability in DNA is contained in regulatory regions as opposed to coding sequence [57, 58] Therefore sequencing the entire genome is an important step towards finding polymorphisms associated with phenotypic variation. The genome of a P. trichocarpa genotype (female clone Nisqually 1) was sequenced at a 7.5 coverage, by the Joint Genome Initiative of the US Department of Energy (JGI DOE)  The sequence has been available for the scientific community since 2004 and is currently in its second version of assembly and annotation. The latest release contains 403 million bases ( Mb ) of assembled sequence, 92% of which is addressed genetically in the 19 poplar chromosomes. Half of the assembled sequences are contained in 9 major scaffolds (N50=9), and 50% of the scaffolds are longer than 18.8 Mb (L50=18.8 Mb). The latest annotation predicts 45,778 gene models, 94% of which have start and stop codons. The availability of the poplar genome sequence has fueled genomics research in the species. The articl e describing the Populus genome  has been cited 481 times and it is being increasingly quoted since its publication in September 2006 (Figure 11). While the genome sequence of Populus is established and being extensively used, the genome sequencing of a Euca lyptus grandis genotype (BRASUZ clone) is still ongoing. The Eucalyptus genome is also being sequenced by the JGI DOE, and the initial assembly ( 8 coverage ) is expected in June
21 2010. The 4.5 checkpoint assembly is already available for the community as a draft in the EucalyptusDB website ( http://eucalyptusdb.bi.up.ac.za/ ). A genome sequence provides a n essential framework where genes, regulatory elements and molecular markers can be identified. However, a sin gle genome does not display what geneticists and breeders are most interested, i.e. the genetic diversity to be selected upon. Ultimately many genotypes have to be sequenced to unveil the nucleotide diversity of a given species. Eighteen genotypes of P. trichocarpa and two of P. deltoides are being sequenced using Illumina technology. The objective is to discover single nucleotide polymorphisms (SNPs) that will be genotyped in an unstructured population to find associations with bioenergy related traits ( Tuskan G.A. personal communication). Similar whole genome, nucleotide diversity survey s will be needed in Eucalyptus to advance the genetics studies with the species. Molecular Markers Available for Eucalyptus and Populus Resequencing is the ultimate way t o comprehensively asses s the existing nucleotide diversity across the entire genome. However, the cost associated with sequencing and reassembling a genome is still prohibitive for most applications. In addition, many polymorphisms in a genome are associat ed ( i.e in linkage disequilibrium), especially in populations with high levels of relatedness Examples of such populations include some breeding and QTL (quantitative trait loci) pedigrees, where the progeny have only a few generations of recombination a fter the parents have crossed. These highly related pedigrees contrast with natural populations, where most individuals last shared a common ancestor thousands of generations ago. Because in QTL pedigrees linkage disequilibrium (LD) may extend several Mb, genotyping a few hundred polymorphisms is enough to survey the nucleotide diversity of the genome Molecular markers offer an inexpensive way of sampling the genetic variation across the genome, and are still the best choice to genotype crosses designed fo r QTL analysis
22 In genetics, molecular markers are usually DNA fragments that reveal the sequence polymorphisms in the genome. Several types of genetic markers are available and were utilized to assay Eucalyptus and Populus DNA polymorphisms. However, some of these markers are increasingly less utilized due to intensive genotyping labor ( e.g. RFLP) or due to dominance and low transferability among different genotypes and pedigrees ( e.g. RAPD and AFLP) [60, 61] Another reason for substituting the RAPD and AFLP markers is the difficulty in anchoring them to the genome sequence because random fragments are amplified for genotyping  Having markers anchored to the genome sequence is crucial for comparative genomics and for positional cloning of genes affecting Mendelian or quantitative traits. Co dominance refers to the important capacity of markers to discriminate individuals that are homozygote s f rom heterozygote s Two codominant markers widely used in animal and plant genetics are microsatellites (or simple sequence repeats SSRs) and single nucleotide polymorphism (SNPs). The first is highly multi allelic and easily genotyped with PCR (polymera se chain reaction), while the second is the most abundant type of DNA polymorphism. Because generally primer or probe sequences used for genotyping are specific (20mers or longer) and known, these markers are easily transferred to different pedigrees, easi ly shared between laboratories and easily anchored to the genome sequence.The International Populus Genome Consortium (IPGC) provides a repository with 4,166 microsatellites ( http://www.ornl.gov/s ci/ipgc/ssr_resource.htm ). More recently, 148,428 SSR primer pairs were designed from the genome sequence of Nisqually 1 and became publicly available  The lack of expressed and genomic sequences hinders construction of a similar microsatellite database for Eucalyptus species. Even though some studies analyzed thousands of SSRs from private EST databases, the primer sequences were not released [64, 65] As a result less than a thousand
23 microsatellite s are publicly available for Eucalyptus from several smaller scale efforts [61, 6674] Hundreds of microsatellites are more than sufficient for applications such as clonal fingerprinting for breeding management, as well as for the analysis of genetic diversity and structure in natural populations. However, to finely dissect the genome loci that regulate complex traits and effectively assist breeding, hundreds of thousands to millions of markers will be needed. While thousands of microsatellites will certainly be discovered with the E. grandis genome sequence, as it was the case in Populus a more genome dense type of marker will be required. SNPs are the most abundant type of polymorphism and are found wide ly distributed in the genome  Because of their high frequency and wide distribution, SNPs are the marker of choice to scan genome s for statistical association s between DNA and phenotypic variation. The largest SNP discovery for any forest species came as a by product of the Populus trichocarpa genome sequence  Because poplars are obligate outcrossers the sequencing and assembly of a single heterozygous genotype (Nisqually 1) allowed detection of ~1 million SNPs at a frequency one every ~386 bp. In transcribe d sequences, the frequency of SNPs decreases to one every ~543 bp (Table 11). In the Eucalyptus genus, no study had estimated SNP frequency until publication of the research described in Chapter 2 of this dissertation  U sing 454 pyrosequencing technology, we identified 23,742 SNPs among expressed sequences of E. grandis  In a recent study, 8,631 SNPs were identified by sequencing 23 genes from 1,764 pooled samples of individuals from four different Eucalyptus species  The main limitation of SNPs identified in these studies lies in the fact that forest species generally have an excess of low frequency alleles [34, 78, 79] Maintenance of rare alleles occurs frequentl y in trees because of their long lived, outcrossing habit and large effective population
24 sizes  This results in limited loss of new mutations through genetic drift, compared to plants with frequent selfing Consequently most SNPs observed in poplars and eucalypts are frequently population specific and are of limited use in differ ent genetic backgrounds. T he excess of nucleotide diversity and abundace of low frequency SNPs hamper design and operation of several high throughout genotyping technologies such as the SNP chips that are commonly used to genotype millions of human polymor phisms [81, 82] Consequently, the usability of SNPs in forest species mi ght depend on the discovery and selection of a set of common polymorphisms. The resequencing of 20 genotypes will unveil common polymorphisms in poplars (Tuskan G.A. personal communication). Even though sequencing is still an expensive way of genotyping SN Ps, new highthroughput technologies might be harnessed in targeted genomic regions with the advantage of identifying all alleles regardless of their frequency. S equence capture technologies can genotyp e SNPs in specific regions such as regulatory and tran scribed gene sequences [83, 84] Capture of precise regions will allow the multiplex sequencing of several poplar and eucalypts individuals, reducing costs associated with SNP genotyping. Utilizing Genomic Resources to Identify Genes Controlling Bioenergy Traits Availability of thousands of molecular markers and a well annotated genome sequence are only the first essential steps towards finding associations between the genetic and phenotypic variation. Now I review the strategies that exploit these genomic re sources to find genes that control economically important traits, such as biomass productivity and quality in woody species. Forward and Reverse Genetics in Forest Species Strategies to relate the genetic variation with phenotypic changes can be broadly cl assified as reverse or forward genetics. Reverse genetics focuses on the gene sequence and expression, searching for natural or induced (mutagenized) variation to find associated changes in the
25 physiology and phenotype. Conversely, forward genetics begins with analyses of natural or induced changes on the phenotype to search for the causal DNA polymorphism (Figure 12)  Reverse genetics approaches to obtain mutants include homologous recombination, gene silencing, ectopic overexpression ( e.g. using activation tagging  ), insertional mutagenesis ( e.g. using T DNA  ), induced point mutations ( e.g. using TILLING  ), and others  The forward genetics strategy relies on molecular markers to genetically map Mendelian or quantitative trait loci (QTL) at high resolution and positionally clone the underlying gene  Identification of mutations, either before (reverse) or after (forward) identification of the associated phenotype, is largely facilitated by the availability of a genome sequence [89, 90] For example, the genome sequence can be used to design genespecific primers to screen mutants in a reverse genetics strategy or to identify genes within a QTL in a forward genetics context. Large reverse genetics resources were generated and are being extensively utilized to identify the function of genes in herbaceous organisms such as Arabidopsis  maize  and rice  In Ar abidopsis for example, 90% of the genes annotated in the genome were disrupted with multiple insertions of transfer DNA (T DNA)  This invaluable resource was made possible after sequencing and mapping more than 360,000 insertion sites on the genome [87, 89] In forest species, a similar genome wide knockout resource is not available and reverse genetics has been applied on a gene by gene basis, either by down regulating with RNAi silencing  or by overexpressing [100, 101] the gene of interest. The generation of a comprehensive gene knockout resource l ike that of Arabidopsis is almost unfeasible in forest species. In most cases, T DNA insertion results in a recessive hemizygote allele i.e. the other homologue of the disrupted gene remains intact and functional. Therefore, selfing is necessary to find homozygote knockout mutants. Because of high inbreeding depression and long time to achieve sexual
26 maturity, selfing is complicated in forest trees. In dioecious species such as poplars knockout mutants would be obtained only after backcrossing or sibling mating. Because of this limitation, scientists have proposed the use of constitutive promoters, such as CaMV35S, to produce dominant mutants that ectopically overexpress genes around the insertion site [102, 103] Following this strategy, more than 2,600 activation tagging lines have been regenerated in Populus and many associated morphological changes were observed among them [102, 104] Genes that underlie some of these phenotypic alterations have been identified  However, the stra tegy is limited by the significant amount of labor and time consumed to regenerate transformed trees. As a result, identification of activation tags in most of the 45,778 genes of poplars would require significantly more resources than that used to generate the T DNA lines of Arabidopsis which has a smaller genome and a very straight forward transformation protocol  Forward genetics is being extensively utilized in forest species for almost two decades. Several studies have mapped QTLs for traits ranging from tree height and trunk diameter, to disease resistance and wood chemistry traits in Eucalyptus  Pinus [1 18129] and Populus  s pecies. One advantage of these studies over those that utilized reverse genetics comes from the possibility of realizing truly unbiased gene discovery with forward genetics no preconceived idea about the function of the underlying gene is required  Nevertheless, while these QTL studies identified genomic regions associated with the phenotypic variation in forest trees, given their low resolution, they all failed to define the causative genes  The low resolution has been overcome in herbaceous species by analyzing thousands of individuals to find the necessary recombinants to positionally clone the causative genes  These large scale experiments are very difficult in trees, given the extensive area required
27 and consequent difficulties to control the environmental variation. Also, QTLs identified in tree species typically do not have the same large effects that facilitated the positional cloning of QTLs in crop species. In summary, while reverse genetics has been utilized to functionally characterize a few genes of tree species, m ost of these genes were selected because of prior knowledge of their function. Approximately half of the Populus genes have unknown function and are being largely ignored by those reverse genetics approaches. While forward genetics can identify the link be tween genetic and phenotypic variation irrespective of prior functional knowledge, the extremely low resolution of these studies hamper mapbased cloning of the underlying genes in trees. Clearly new approaches are needed to find genes that control the var iation in economically important forest traits. Association Genetics and Genetical Genomics New Solutions for Forest Species The inherent problem of low resolution of QTL approaches arises from the fact that only one or few generations separate the paren ts from the analyzed progeny. As a result, linkage disequilibrium (LD) between linked loci remains very high. Association genetics was proposed to circumvent this problem by analyzing individuals sampled from natural populations that shared a common an cestor many generations ago. Among this sample of unrelated genotypes, historical recombinations have dissociated even closely linked loci that are a few hundred bases or kilobases apart in the genome. Therefore, a statistically significant association bet ween marker and trait generally identifies either the causal polymorphism or the underlying gene [75, 147] Because of the favorable resolution over traditional QTL studies, association genetics was proposed for forest trees, particularly for species with large and complex genomes such as that of conifers  The strategy has successfully identified genes associated with the phenotypic variation in cell wall microfibril angle in Eucalyptus  timing of bud set in Populus 
28 cold hardiness in Douglas fir  and drought response and wood property traits in Pinus [151, 152] The limitation of these studies is that associations were tested only in few candidate genes with a priori evidence for a role on the studied phenotypes. Restriction on the number of genes tested is largely due to the alr eady mentioned technological difficulties in genotyping SNPs widely distributed in the genome of forest species. For genome wide association studies (GWAS) in tree species, millions of markers will have to be genotyped to capture the existing nucleotide di versity. While GWAS awaits the development of cost effective SNP genotyping platforms for forest species, genetical genomics has been proposed as a way to identify genes underlying complex traits by capitalizing on the advantage of whole genome coverage of QTL studies. The development of several genomics technologies, such as microarrays and mass spectroscopy, enabled comprehensive profiling of gene expression, protein levels and metabolites. Jansen and Kap proposed the integration of these multi dimensiona l molecular data in the genetic map of a QTL pedigree to identify the genetic loci that regulate complex variation, and called the strategy genetical genomics  Transcript, protein and metabolite levels are quantitative traits that represent intermediary, hierarchical steps in the link between genetic and phenotypic variation. Because these intermediary steps also respond to the cellular and extracellular environments, they may be better predictors of phenotypes and can more easily unveil causative genes than simply measuring the genetic variation with molecular markers  The ultimate result of a genetical genomics study is the identification of genes whose expression is controlled by the QTL regio n, and also highly correlated with the phenotypic variation. Availability of a genome sequence is essential to deconvolute causal from indirect associations, as correlated genes physically located outside the QTL region are not directly controlling the phe notype.
29 Genetical genomics has been highly successful in medical research by identifying genes involved in predisposition to diabetes, obesity, cardiovascular, and other complex diseases  In plants, genetical genomics was applied in Arabidopsis [154, 155] barley [156, 157] maize [50, 158] and in tree species of the Eucalyptus [51, 159] and Populus  genera. These studies identified a manageable number (15) of candidate genes within QTLs, but so far none has confirmed the role of any of these genetic elements with functional characterization as has been done for mammalian diseases  Even though, genetical genomics is just being realized in plants, it has already contributed to a better understand ing of the genetic architecture of gene expression and metabolite s ynthesis. Most importantly, genetical genomics is starting to unveil how individual biological components ( e .g. genes and metabolites) interact and respond to genetic and environment al perturbations  This represents the first step towards the ultimate goal of systems biology i.e. predict ing phenotype s based on the genetic and molecular composition of an organism, and the environment/c ommunity where it lives and interacts Project Objectives To fully exploit the desirable characteristics of Eucalyptus and Populus for bioenergy, it is imperative that we continuously improve their growth rate and biomass quality for conversion to biofuel s The existence of genome resources and biotechnologies are essential to circumvent the long generation time required for traditional methods of forest improvement. With hundreds of thousands of ESTs available and a well annotated, extensively utilized gen ome sequence, Populus has genomic resoures that largely surpass those that exist for Eucalyptus To begin addressing this limitation, a large collection of expressed sequences from Eucalyptus grandis was generated and analyzed, with results described in Chapter 2. The sequences were generated in a fast and cost effective way with a new sequencing technology, known as 454 Roche. Chapters 3 and 4 demonstrate how the relatively extensive genomic data for Populus can be used
30 in an integrative fashion to identi fy genes that control important traits related to the potential of the species for bioenergy. More specifically, in Chapter 3 we performed a quantitative genetic analysis of growth and wood chemistry traits and found significant correlations among these phenotypes in a pseudobackcross pedigree of Populus trichocarpa x P. deltoides By genotyping microsatellite markers, the genetic variation could be access ed and linked to the phenotypic differences observed in the pedigree. This resulted in identification of genomic regions (QTL) that control growth and wood chemistry traits. The objective of Chapter 4 was to identify the gene(s) underlying one major QTL mapped on LGXIII with pleiotropic effect in many of the growth and wood chemistry traits analyzed. By integrating the genetic and phenotypic variation, with gene expression data and the genome sequence of Populus we identified cpg13 as the most likely candidate. Finally, Chapter 5 summarizes the findings of this research and suggests future directions for this research
31 Figure 11. Number of publications per year referencing the Populus genome sequence publication  since its release in September 2006 (Source: ISI Web of Knowledge). 0 40 80 120 160 200 2006 2007 2008 2009
32 Figure 1 2. Representation of forward and reverse genetics approaches to find the association between gene 4c oumarate:coenzyme A ligase (4CL) and lignin levels in the wood of poplars. Figure is based on a previous study that analyzed this gene with a reverse genetics strategy  Phenotype (lignin levels) DNA ( 4CL gene) mRNA Protein ( 4CL enzyme) Metabolite (lignin monomers) Reverse GeneticsAnalyze mutants for 4CL gene to then find a reduction in lignin levelstranscription translation transcription Mutant Wild typeForward GeneticsMap the variation in lignin levels on the genome to then positional clone 4CL
33 Table 1 1. Frequency of SNPs across mapped and unmapped scaffolds from the genome sequence of Populus trichocarpa. Genome Genes Scaffolds Number of SNPs Diversity (bp / SNPs) Number of SNPs Diversity (bp / SNPs) LG_I 85,410 376.53 16,772 560.53 LG_II 70,007 334.73 13,769 530.36 LG_III 46,293 376.88 9,721 550.07 LG_IV 39,659 380.25 7,180 563.39 LG_IX 33,874 366.41 7,980 544.19 LG_V 51,571 330.33 8,383 569.19 LG_VI 46,963 375.98 9,816 579.82 LG_VII 32,691 364.15 6,730 543.62 LG_VIII 42,758 360.87 10,362 534.63 LG_X 47,329 405.96 11,533 585.26 LG_XI 32,942 399.85 6,502 553.95 LG_XII 31,368 415.23 5,963 590.86 LG_XIII 32,977 348.60 8,179 432.94 LG_XIV 34,953 391.46 7,437 560.33 LG_XIX 24,744 413.30 5,832 477.51 LG_XV 25,327 402.37 5,418 579.50 LG_XVI 26,454 484.55 6,465 549.56 LG_XVII 16,291 334.18 2,808 472.34 LG_XVIII 29,995 414.67 5,976 583.66 Unmapped scaffolds 327,755 436.27 64,742 499.63 Total or Average 1,079,361 385.63 221,568 543.07 a Genes include the 45,555 currently annotated gene models plus 10,288 additional less supported models. The whole gene sequence was considered, including its non translated portions (introns, 3UTR and 5UTR). b Sequence gaps of known size (symbolized in t he genome sequence with Ns) were excluded from the calculation of genomic diversity.
34 CHAPTER 2 HIGH THROUGHPUT GENE AND SNP DISCOVERY IN E ucalyptus grandis AN UNCHARACTERIZED GENOME* Introduction The high throughput and cost effectiveness of DNA pyrosequencing using 454 Life Sciences technology  has been successfully applied to large scale EST sequencing in maize  Medicago  and Arabidopsis [165, 166] resulting in a significant contribution of additional ESTs for these species. However, t hese studies were carried out on organisms with extensive transcriptome sequences already available. P re existing sequences support the assembly of 454 reads into contigs, thereby minimizing the drawback of short average read length (100 200 bp) produced by the pyrosequencing technology. H owever, the benefits from recent improvements in sequencing technologies may be even more valuable for plant species with high economic value but limited genomic resources. Yet, it has not been shown whether the primary limitation s of the pyrosequencing method short read lengths and ambiguous homopolymer reads can be overcome to produce useful information for species with essentially no prior gene sequence information. The 454 sequencing also provides an opportunity to identify allelic variants by seque ncing and aligning ESTs from several haplotypes. Recently, two 454 cDNA sequencing runs, each interrogating a single maize inbred line, were used to identif y over 36,000 putative single nucleotide polymorphisms (SNPs)  SNP discovery with 454 technology could also be accomplished by simultaneously sequencing multiple genotypes to sample the nucleotide diversity of an organism [168, 169] However, the assembly of individual haplotypes is not feasible when sequencing a cDNA pool from highly heterozygous individuals The unavailability Reprinted with permission from: Novaes E, Drost DR, Farmerie WG, Pappas GJ, Jr., Grattapaglia D, Sederoff RR, Kirst M: High throughput gene and SNP discovery in Eucalyptus grandis, an uncharacterized genome BMC Genomics 2008, 9 :312.
35 of haplotype sequences hinders the use of most statistics to calculate genetic diversity. Therefore, new models are needed to estimate population genetic parameters. Here we describe sequencing, assembly, and SNP discovery from 454 derived EST sequences of Eucalyptus grandis a species with virtually no public genome sequence information. Eucalyptus is the most widely planted woody angiosperm in the world  Because of its value as a renewable source of raw material for wood, pap er products, and biofuels, E. grandis has been selected for sequencing by the U.S. Department of Energy/Joint Genome Institute for completion in 2010  Accurate annotation of the E. grandis genome sequence will rely heavily upon a repository of expressed sequences, as was demonstrated during the annotation of the Arabidopsi s  and Populus  genomes. However, only 1, 939 E. grandis expressed sequence tags (ESTs) are currently deposited in the National Center for Biotechnology Information (NCBI) database Thus, rapid development of a large ES T sequence collection will be crucial to support the E. grandis genome sequence annotation and for continued advancement of Eucalyptus genomics research. With the purpose of generating the first broad survey of genes in a Eucalyptus species, we sequenced and assembled 148 Mbp of E. grandis ESTs from two GS 20 and one GS FLX 454 pyrosequencing runs. Assembled sequences ( 25.4 Mbp) were deposited in the GenBank representing a 37 enrichment in publicly available expressed sequences for the species. Sequences were generated from a normalized cDNA pool comprised of multiple tissues and genotypes, promoting extensive gene discovery and a comprehensive survey of allelic variation in the transcriptome. We demonstrate that 454 reads can be reassembled into contigs without utilizing traditional cDNA sequences as scaffolds. In addition, we show that SNPs detected from pyrosequencing a pool of genotypes are useful to reveal selection signatures among genes.
36 Pyrosequencing technology can rapidly provide a foundational public genomic resource for an economically important but previously uncharacterized species. Material s and M ethods Plant Material and RNA E xtraction Three greenhouse grown Eucalyptus grandis seedlings from each of seven openpoll inated families (21 genotypes in total) were dissected into xylem (Xy), phloem (Ph), roots (R), young leaves(YL), mature leaves (ML) and apical/lateral meristems (M) Tissues were immediately frozen in liquid nitrogen and stored. For each tissue, RNA was e xtracted by a standard protocol from a pool of equal proportion of plant material from all 21 genotypes  RNA concentration was estimated using an ND1000 Spectrophotometer (NanoDrop USA, Wilmington, DE) and integrity was evaluated on an agarose gel stained with ethidium bromide. cDNA Synthesis and N ormalization RNA isolated from each tissue pool was combined in varying proportions (10% YL, 10% ML, 15% R, 15% Ph, 20% M and 30% Xy) to a single pool in an attempt to maximize the diversity of transcriptional units sampled. Pooled RNA was DNase treat ed and purified using the RNAeasy Plant Mini Kit (Qiagen USA, Valencia, CA). Fulllength cDNA was synthesized from 2g of RNA using the Clontech SMART cDNA Library Construction Kit (Clontech USA, Mountain View, CA) according to manufacturers protocol, exc ept that the Clontech CDSIII/3 PCR primer was replaced with the Evrogen CDS 3M adaptor (Evrogen, Moscow). cDNA was amplified using PCR Advantage II Polymerase (Clontech USA, Mountain View, CA) in 16 thermo cycles (7s at 95C, 20s at 66C, and 4 mins at 72 C) and was subsequently purified using the QIAquick PCR Purification Kit (Qiagen USA, Valencia, CA). The cDNA was normalized using the Evrogen Trimmer Direct Kit (Evrogen, Moscow) to minimize differences in representation of transcripts. This normalizati on protocol is based on denaturing reassociation of
37 cDNAs, followed by digestion with a duplex specific nuclease (DSN). The enzymatic degradation occurs primarily on the highly abundant cDNA fraction. The single stranded cDNA fraction was then amplified tw ice by sequential PCR reactions according to the manufacturers protocol. Adaptors incorporated during the first strand synthesis were partially removed by SfiI digestion (8U/g of cDNA). Normalized cDNA was purified using the QIAquick PCR Purification Kit (Qiagen USA, Valencia, CA). 454 Sequencing and A ssembly Approximately 15 g of normalized cDNA were used for library construction and sequencing at the Interdisciplinary Center for Biotechnology Research (ICBR) at the University of Florida, following the procedures described by Margulies et al.  Two sequencing runs were produced on the GS 20 platform, and one on the GS FLX. Bases were called with 454 software by processing the pyroluminescence intensity for each bead containing well in each nucleotide incorporation. An initial assembly of the sequences was performed with Newbler version 1.1.02.15 (454 Life Science, Branford, CT). Newbler considers the normalized intensity of each nucleotide flow, instead of individual base calls, which is more suited to assembly with sequencing by synthesis like 454. However, Newbler v.1.1.02.15 does not mask sequence repeats like the adaptors used for the c DNA normalization. Thus, after initial assembly of all reads, except those containing adaptor sequences, in Newbler further assembly was performed with Paracel Transcript Assembler (PTA) version 3.0.0 (Paracel Inc., Pasadena, CA). All contigs and singletons resulting from the Newbler assembly were combined with adaptor containing reads for input into PTA. Using a thres hold of 15, a ll sequences were masked for the presence of oligonucleotide adaptors used during cDNA library preparation and normalization. Low base call quality (score 10) data was trimmed from the ends of individual sequences. Low
38 complexity sequence re gions (simple sequence repeats) are identified and excluded from consideration during initial pair wise comparison, but are included during final alignment and consensus building. Assembly is performed in two stages. The first stage uses Haste algorithm to build groups (or clusters) of sequences sharing a minimal amount of identifiable sequence similarity (threshold = 50) The second stage carefully assembles sequences within individual cluste rs into consensus transcripts using the software defaults, exce pt parameters MinCovRep (500), InOverhang (30), EndOverhang (30), RemOverhang (30), QualSumLim (300), MaxInternalGaps (15) and PenalizeN (0). Files containing reads sequence and quality scores were deposited in the Short Read Archive of the National Cente r for Biotechnology Information (NCBI) [accession number SRA001122] Newbler and PTA assembly files were also submitted to NCBI. SNP Detection and Validation To detect SNPs in the cDNA pool, we used the consensus assembly generated from all sequencing runs as a reference sequence to which individual reads were aligned using GS Reference Mapper (454 Life Science, Branford, CT). Each read was aligned to only a single best homologous site in the reference sequence. Reads aligning equally well in more than one location in the reference were discarded. GS Reference Mapper only scores polymorphisms where two or more reads contain the variant allele. Additionally, at least two reads with the alternative allele must include ariable nucleotide and no more than one additional sequence polymorphism in this window. For the analysis reported here, we considered only single nucleotide polymorphisms (SNPs), excluding all indels and variants involving more than one nucleotide. We als o imposed the constraint that the variant allele appears in at least 10% of the total number of reads covering the polymorphic site.
39 To validate a sample of detected SNPs, we designed primers to amplify 500 700 bp of transcripts containing a large number of putative SNPs. Using the normalized cDNA as template, fragments were amplified using PCR with 30 thermocycles of 94oC for 30s, 55oC for 30s and 72oC for 35s. The amplified fragments (amplicons) were purified using the QIAquick PCR Purification Kit (Qiag en USA, Valencia, CA). Each amplicon was sequenced with both forward and reverse primers using standard dideoxy based technology analyzed on the ABI3730 platform (Applied Biosystems). Sequencing chromatograms were visually analyzed with Chromas 2.32 (Techn elysium Pty. Ltd.), and SNPs were identified as overlapping nucleotide peaks. For the SNP to be verified, the variant allele must have a chromatogram peak at least 50% higher than background peaks. Analysis of Synonymous and Non S ynonymous M utations Prote in coding sequence for each consensus was delimited using the best BlastX hit ( E value 105) against the Arabidopsis thaliana peptides translated from TAIR7 gene models. Codons for each consensus were identified and nucleotide degeneracy determined. The number of synonymous and non synonymous sites was calculated for each contig. Next, we determined whether SNPs positioned in consensus coding sequences introduce synonymous or nonsynonymous mutations by comparing the translated amino acids from the reference and variant sequences. W e calculated the proportion of nonsynonymous to synonymous mutation (Ka/Ks) for each consensus f ollo wing Hartl and Clark  with the exception that one unit was added to both number of synonymous and nonsynonymous substitutions This was important to allow Ka/Ks estimation in cases where either type of substitution was not found. Had we not included this modification, genes with an excess of synonymous but no observed nonsynonymous substitutions would all have Ka/Ks equal zero, regardless of their Ks value. On the
40 other hand, genes without any observed synonymous substitution would have undefined Ka/Ks because of division by zero. Results 454 Pyrosequencing and A ssembly of E. grandis ESTs An E. grandis normalized cDNA library was synthesized from RNA of vegetative tissues sampled from 21 different genotypes. Two GS 20 and one GS FLX 454 pyrosequencing runs performed on the normalized cDNA pool generated 1,024,251 reads comprising 148.4 Mbp of sequence (Table 2 1). The GS FLX platform produced more reads than the GS 20 and an average read length 2 longer To compare the length distribution s of contiguous sequences (contigs) using the different 454 platforms, we separately assembled the reads from the two GS 20 runs, the GS FLX run alone, and from all the three sequencing runs combined (Table 22). The two GS 20 runs generated contigs measuring 131bp on average, with 95% of the contigs being shorter than 250bp. Despite generating only 25% more sequence than the GS 20 runs combined, the longer reads obtaine d with the GS FLX platform render contigs that average almost three times longer (353bp) T he GS FLX run allowed assembly of the E. grandis 454derived expressed sequences into long contigs that could be more accurately annotated. Assembly of all three pyr osequencing runs generated 71,384 contigs with an average length of 247bp, of which 5,838 (8%) measure more than 500bp. The addition of GS 20 reads increased the number of short contigs in the full assembly, resulting in shorter average contig length when compared to the assembly with GS FLX reads only. Next we co mpared the sequences from 454 with 86,328 ESTs obtained from a proprietary database derived from sequencing a broad set of Eucalyptus cDNA libraries using conventional (Sanger) dideoxy based sequen cing (Table 21). The comparison shows that although Sanger contigs were built from a dataset with only one third the total basepairs of that derived from 454,
41 they are on average more than twice as long (Table 22). However, the larger number of reads generated by the 454 technology substantially improves the efficiency of gene discovery as demonstrated below. Annotation of the E. grandis ESTs An E. grandis unigene set was generated by combining all 71,384 assembled contigs and 118,722 nonassembled reads (singlets) generated by the three 454 runs. The unigene set was annotated by searching for sequence similarities using BlastX against Arabidopsis (TAIR v. 7.0), Populus (JGI v. 1.1) and Oryza (TIGR v. 5.0) gene m odels. As expected, the likelihood of finding similarity to previously described gene models is highly dependent on the length of the query sequence (Figure 2 1A ). Logistic regression testing the effect of sequence length on whether or not the query sequence have at least one BlastX hit ( E value 105) was highly significant (p value < 0.00001). For instance, sequences longer than 1000bp have significant similarity ( E value 105) with gene models from all three species in 96% of cases, whereas 88% of sequenc es shorter than 100bp have no similarities to any annotated gene model. Among 118,013 unigenes longer than 100bp 38% have similarity to at least one gene model at an E value of 105, 28% at an E value of 1010, and 15% at an E value of 1020 (Figure 2 1B ). The low proportion of BlastX hits is mainly due to the high frequency of shorter sequences (75th percentile = 252 bp). Next, we determined the proportion of annotated gene models for which homology was detected among E. grandis unigenes measuring over 100bp. Homology was detected to 45% of Arabidopsis 39% of Populus and 22% of Oryza gene models ( E value 105). The higher proportion of Arabidopsis genes that are apparent homologues to Eucalyptus is expected as the two species are more phylogenetically related than to Populus or Oryza  Arabidopsis gene models are also more refined than those from the other plant species. Analyzing only the 22,032
42 Arabidopsis gene models for which there is any detected transcript evidence (TAIR v. 7.0) leads to a higher proportion of homologies: 58% with E value 105 and 39% with E value 102 0 (Table 23). Finally we utilized the Gene Ontology (GO) classifications from the Arabidopsis best hit gene models ( E value 105). Proportions of best hits in each GO category were generally similar to those found in the Arabidopsis genome annotation (Fi gure 2 2). The GO annotation analysis reinforces the inference that a broad diversity of genes was sampled in our multi tissue normalized cDNA pool. Efficiency of Gene Discovery: 454 and Sanger Sequencing C omparison To compare the efficiency of gene discov ery in the two sequencing platforms (454 and Sanger) we established a unigene set for the Sanger sequenced ESTs by combi ni ng the 21,432 contigs with 17,203 singlets. The Sanger unigene set has a total of 22.05 Mbp, and is therefore comparable to the 25.42 Mbp in the E. grandis 454 unigenes measuring over 100bp. We first aligned the two unigene sets to each other using BlastN ( E value 105) and detected 84% of the Sanger unigenes having a match to the 454 dataset but only 41% of the 454 unigenes having homology to the Sanger sequences (data not shown). This is an indication of a greater coverage of distinct cDNAs in the 454 derived sequences However, it is possible that the higher frequency of shorter sequences found in the 454 dataset contributed to an inc reased number of nohits, as shorter sequences are less likely to align with a significant E value. To rule out this possibility, Sanger unigenes were also used in an analogous BlastX against the Arabidopsis Populus and Oryza gene models. For all organism s and all Blast thresholds applied, a larger number of gene models had similarity to the 454 unigenes than those generated by Sanger sequencing (Table 23). Therefore, the large number of reads generated by 454 pyrosequencing appears to maximize gene disco very in EST sequencing projects. On the other hand, mean Blast alignment lengths using the 454 unigenes are approximately 2 shorter than for the Sanger
43 unigene set (data not shown). Thus, although the 454 unigene set samples a broader diversity of transcriptional units, this occurs at the cost of decreased length of sequence of the individual genes. SNP Detection and V alidation in the E. grandis 454G enerated ESTs The GS Reference Mapper (454 Life Science) software was used to identify polymorphisms among ESTs by aligning individual reads against contigs from the assembly. For a sequence difference to be declared a true polymorphism, at least two individual reads al igning to the consensus must have the variant allele and at least two others must have the allele of the consensus. By applying this criterion, 30,108 variants were detected in 10,223 contigs (Table 24). We analyzed only single nucleotide polymorphisms (S NPs) and excluded all 821 indels and 635 variants involving more than one nucleotide. Also, we considered only 23,742 higher confidence SNPs for which both alleles were present in at least 10% of the reads aligned at the polymorphic locus. Although this re quirement reduces the sensitivity in detecting rare SNPs, it increases the specificity of true SNP detection by lowering the likelihood of including false variants that arise due to sequencing errors. Among both high and low confidence SNPs, the proportion of transition nucleotide substitutions (75 and 85%) was greater than the proportion of transversions (25 and 15%) To validate SNPs detected by GS Reference Mapper we PCR amplified a sample of 43 contig sequences from the normalized cDNA used in the 454 sequencing. Each amplicon was sequenced bidirectionally (forward and reverse) using standard dideoxy based sequencing in an ABI3730. Sequencing chromatograms were analyzed with Chromas 2.32 (Technelysium Pty. Ltd.), and SNPs were identified as overlapping nucleotide peaks. The number of putative SNP loci encompassed in the sequences of each amplicon ranged from 3 to 15 (Table 25). Of 337 SNP loci predicted to reside in the amplified sequences, 279 (8 2.8 %) were validated.
44 Analysis of Synonymous and Non S yno nymous SNPs in E. grandis G enes The proportion of mutations that change amino acid sequence ( i.e. nonsynonymous) relative to those that do not ( i.e. synonymous) can indicat e whether a gene is under purifying, neutral or diversifying selection  The large number of ESTs sequenced from a mixture of genotypes could provide information for a genome wide analysis of gene evolution based on the ratio of nonsynonymous ( K a) to synonymous substitutions ( K s). To carry out this analysis, we initially determined whether SNPs introduce synonymous or nonsynonymous mutations by (1) defining the sequence reading frame with a BlastX alignment against Arabidopsis peptides (2) isolating codons containing SNPs, and (3) comparing the translated amino acids for each allele. Next, the degeneracy of nucleotides in the contigs coding sequence was evaluated to estimate nonsynonymous to synonymous substitution rate (Ka/Ks). We estimated Ka/Ks for 2,001 E. grandis con tigs that have at least three high confidence SNPs and one positive BlastX hit against Arabidopsis gene models ( E value 105). Distribution of Ka/Ks among these contigs was right skewed, with estimates ranging from 0.008 to 2.101 and averaging 0.30, sugges ting that the majority of the genes are under purifying selection. Gene ontology (GO) categories associated with contig annotations derived by homologies to A. thaliana gene models allowed us to compare frequency of GO class representation in the extremes of the Ka/Ks distribution. Two binary variables (purifying and diversifying) were created to classify contigs according to their Ka/Ks contigs with Ka/Ks smaller than 0.15 (643 contigs) were classified as purifying, while those with Ka/Ks greater t han 0.5 (273 contigs) were defined as diversifying. Ka/Ks > 1.0 may be an overly stringent criterion of positive selection when estimated for the whole coding sequence  and therefore we utilized a lower threshold (0.5) for this analysis. A Fishers exact test for each binary category was used to statistically define GO Biological Process classes enriched in the Ka/Ks distribution extremes.
45 Table 26 depicts GO classes enriched (p value < 0.05) among contigs on the two Ka/Ks extremes. Genes encoding protein s of the ribosome complex and involved in translation are the most significantly enriched (p value = 0.0006) within the purifying classification. In addition, genes encoding proteins involved in nucleosome assembly, chrom osome organization and proteosome complex also appear to be constrained by purifying selection. Among GO categories of genes undergoing diversifying selection, there is enrichment for only those with unknown function and organismal development. Nucleotide Diversity among the E. grandis G enes In addition to the ratio of nonsynonymous to synonymous substitutions (Ka/Ks), a measurement of nucleotide diversity can also reveal differences in selection pressure acting on different genes. However, since our sequ encing method generated ESTs from a pool of 21 E. grandis genotypes, identification of the genotypic origin for each read becomes impossible. As a result the assembly of individual haplotypes is unfeasible and exact estimation of standard nucleotide polymo could not be obtained. Thus, we developed a relative measurement of nucleotide diver normalizes the number of polymorphisms detected relative to the sequence length and depth of each contig. Sequence length and depth are variables affecting the likelihood of finding SNP when it is present in a gi = + 1 ( 1 / ) 1 = 1 where S is the number of SNPs detected in the contig, L is the contig sequence length and D is the sequencing depth estimated by the average number of reads aligned to each nucleotide position during assembly of the contig. The constant 1 was added to the number of SNPs in the
46 numerator to enable comparisons with contigs containing no detected S NP. Contigs with extensive length and depth but for which no SNP is detected represent highly conserved genes. number of different sampled haplotypes. These two variables are different because sequence reads from 454 represent a random sample from the pool of haplotypes in the cDNA library. Therefore some haplotypes may be sequenced multiple times, while others may not be sampled at all relative measurement to compare the nucleotide diversity between contigs generated within this proje ct. and an average sequencing depth of at least 10 reads/nt. These thresholds reduce imprecision in the proportion of SNPs per nucleotide as well as the likelihood that all reads aligned to the contig T, which estimates the diversity on the entire contigs, i ncluding its noncoding regions; N, which estimates the diversity in nonsynonymous sites ; S, which estimates th e diversity in synonymous sites As observed for the Ka/Ks ratio, the distributions of all three nucleotide diversity parameters (Table 27) appear to be right skewed suggesting that the majority of E. grandis transcribed regions are constrained by purifying selection. Average values S were more than 4 larger compared with diversity in nonN), offering further evidence for the action of natural selection on the genetic dive rsity detected in these genes.
47 The annotation of each contig from homology to A. thaliana gene models was again used N distribution. Similarly to Ka/Ks, two binary variables were empirically created to identify genes on the conserved and diverse N N smaller than 1.0 x 103N greater than 3.5 x 103. Fishers exact test for each binary category detected GO classes enriched among each of the two N extremes (Table 28). As suggested by the Ka/Ks analysis for genes undergoing purifying N is also enriched for genes encoding prot eosome core factors involved in protein degradation. In addition, in the conserved extreme we also found enrichment for genes involved in malate metabolism. Among contigs classified as diverse, there is enrichment for genes of unknown function and for ge nes involved in defense response, including response to biotic stimuli. Discussion Our analysis of 148.4 Mbp of E. grandis expressed sequences generated with three 454 sequencing runs demonstrates that short reads produced with pyrosequencing technology can be assembled de novo into reasonably long contigs ; an advantage for species with limited public genomic resources. The 2 5.4 Mbp comprised in our unigene set represents an enrichment of 37 in the amount of publicly available E. grandis expressed sequences, and will provide substantial support for the genome sequencing and annotation that are currently under way  L onger reads from the GS FLX w ere essential to assemble a reasonable proportion of biologically relevant contig sequences. Although our sequencing effort produced three times more sequence ( 148 Mbp) than the control dideoxy based EST library (Sanger, 45 Mbp), the 454 short readbased contigs average only one third of the size. De spite this limitation, a substantial gain occurs at the level of gene discovery the large number of reads in our 454 dataset leads to sampling of
48 sequences from a much broader variety of genes. Therefore, 454based gene discovery projects represent a viable and, perhaps favorable alternative to Sanger based sequencing of EST libraries when a diverse sampling of genes is more important than obtaining transcript length contigs. As GS FLX become s the standard 454 pyrosequencing platform, large scale EST sequ encing and gene discovery projects will be more successful in assembly of transcript length contigs. We also demonstrated that the detection of valid SNPs is possible by sequencing a pooled sample of highly heterozygous genotypes. By aligning the reads der ived from cDNA of 21 E. grandis genotypes we were able to detect 23,742 SNPs and validate 83% of a sample of 337. Therefore, approximately 4,000 of the detected SNPs may be false positives, possibly arising from sequencing errors or alignment of paralogs. Paralogs that share high levels of sequence similarity may have been assembled in the same contig because they cannot be distinguished due to the short read length of 454 pyrosequencing. This would lead to the detection of false SNPs. On the other hand, a higher stringency of assembly raises the possibility of the opposite problem : the separate assembly of haplotypes from highly polymorphic genes. However, the assessment of the EST assembly quality in this study is difficult due to the lack of a reference genome sequence. Nonetheless, t he validation rate observed here was similar to the 85% reported for maize, where SNPs were detected by comparing sequenc es from two separate 454 runs, each interrogating a different inbred homozygous line  A significant number of polymorphisms in our sample may have been overlooked because of the stringent methodology used to declare SNP s For contigs with reasonable sequence coverage (length > 200bp) and depth (> 10 reads on average per nucleotide) we detected one SNP for every 192bp on average This rate of SNP discovery appears low compared to previous reports of one SNP every Eucalyptus [179, 180] and other forest
49 species [78, 79, 181184] There are at least three foreseeable reasons for missing true SNPs that are intrinsic to our experimental methodology and SNP detection approach. First, because 454 was used to randomly sequence cDNAs not all genotypes were sequenced for every SNP locus in fact, the average sequencing depth (6.7 reads/bp) in our experiment is far lower than the number of possible haplotypes in our sample. Secondly, t he requirement that a SNP be ob served in at least 10% of the reads aligned to a polymorphic site compromised the sensitivity required to detect rare alleles. Studies have demonstrated an excess of rare alleles in natural populations of forest tree species [78, 79, 183] which are probably largely discarded by our ap proach. Additionally, there is a relatively high level of relatedness among the 21 individuals utilized in our study. The sampled genotypes come from seven openpollinated families i.e. seven groups of three individuals share one common maternal parent limiting the genetic diversity sampled relative to what might be present in a similarly sized sample of unrelated trees Finally, the detection of polymorphisms was likely hindered by the requirement that at least two reads containing the variant alleles h a ve at least 20 nucleotides of conserved sequence upstream and downstream of the locus. This requirement is intended to minimize the discovery of false polymorphisms due to the alignment of paralogs a potentially significant problem when aligning short s equence reads. Therefore, only nucleotide variants in relatively conserved or recently derived paralogs may have been incorrectly identified as SNPs. The drawback is that true SNPs in hotspots of genetic diversity or genes under high diversifying selection ( e.g. diseaseresistance genes,  ) may be discard ed. Considering the high diversity found in forest trees [78, 79, 181184] the requirement for sequence conservation in regions surrounding a SNP may be too conservative.
50 High throughput DNA sequencing and SNP discovery from a pool of multiple genotypes may be a powerful approach for rapid assessment of genetic diversity and selection in a genomic scale. Genomewide surveys of genetic diversity have only recently been reported in the model plant Arabidopsis  Here we attempted to generate an approximate estimate of genetic diversity for a broad sample of genes by adapting exist ing parameters of genetic diversity to our experimental methodology. The most commonly used measure of genetic diversity, the random haplotype sample from a population and is independent of the allele frequencies [174, 178] In our study, the number of independent haplotypes sampled is unknown, as the same haplotype may be sampled repeatedly. Therefore we developed a modified nucleotide diversity genoty pe cDNA pool. is always underestimated relative to because the number of reads at a given SNP position is always equal or smaller than the effective (true) number of genotypes. The lower sensitivity to detect SNPs discussed previously also contributes estimated for genes of Populus by 9  Douglas fir by 3.8  loblolly pine by 2.22.7 [143, 181, 182] and Norway Spruce by 1.7  Although it is not possible to compare diversity among genes in this project. synonymous to synonymous substitutions (Ka/Ks). As observed in other plant species [59, 177, 187] most genes of E. grandis appear to be under purifying selection and, accordingly, Ka /Ks distribution averages 0.30 and is heavily right skewed. Among genes predicted to be under
51 strong purifying selection based on the Ka/Ks ratio, there is enrichment for GO categories involving essential biological process es conserved across kingdoms, suc h as translation, ubiquitindependent protein degradation and nucleosome assembly by histones. rRNA and ribosomal proteins have been shown to be highly conserved between species and to evolve under strong purifying selection  Several s tudies also confirm that histone genes evolve constrained by negative selection  N and Ka/Ks distributions, there is enrichment for genes classified as unknown biological process, defense response and response to biotic stress, and multicellular development. Other researchers have already reported an excess of unknown function among genes under positive selection [177, 187, 194] A possible explanation for this overrepresentation is that the category of uncharacterized genes may be enriched for duplicated genes where relaxed selection constraints are leading their diversification and/or eventual silencing  In fact, the category may include pseudogenes and transcribed but untranslated loci. It is also possible that duplicated genes might have higher nucleotide diversity due to the assembly of paralogs. However we do not anticipate false SNPs to bias Ka/Ks estimates because they are not expected to occur more or less frequen tly in synonymous versus non synonymous sites by chance. Genes acting in defense response/response to biotic stimulus are frequently positively selected for diversification to compete with rapid ly evolving avirulence genes of pathogens [185, 196] We did find extensive diversity in nonN) of most defense response genes, but the diversity was also high among synonymous sites and, as a result, these genes w ere not enriched among those under diversifying selection (measured by the Ka/Ks). Similar results were reported in another study  One possible explanation is that positive selection
52 generally only operates in certain domain s ( i.e. leucine rich repeat (LRR) ) of resistance genes [185, 196] and we estimated Ka/Ks over the entire coding sequence. The multicellular development GO class which is enriched among genes under diversifying selection is mainly due to the presence of NAC transcription factor genes. Transcription factors have been demonstrated to have an excess of positively selected genes in humans  and specifically o ne member of the NAC family was show n to be evolving rapidly in Arabidopsis  Finally, the literature accurately depicts relative differences of variability within gene sequences.
53 Figure 2 1. Proportion of E. grandis unigenes (contigs + singlets) without ( ) and with homology to the Arabidopsis (A), Populus (P) and Oryza (O) gene models. Class APO, for example, indicates the proportion of unigenes that match at least one gene model from all three plant genomes, while class --- indicates the proportion of unigenes with no matches in any of the three genomes. A) Effect of the sequence length on the proportion of homology to ge ne models ( E value 105). B) Proportion of E. grandis unigenes longer than 100bp with and without homology to gene models at three different E values (105, 1010 and 1020) 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 101 250 251 500 501 750 751 1000 > 1000 Length interval (bp) APO PO A O AP --O P A ---0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 1e 5 1e 10 1e 20 APO PO A O AP --O P A ---A B
54 Figure 2 2. Proportion of categories of each Gene Ontology (GO) sampled by the E. grandis unigene sequences compared with the proportions found in the Arabidopsis genome annotation. 0 5 10 15 20 25 30 35 40 cell organization and biogenesis developmental processes DNA or RNA metabolism electron transport or energy pathways other biological processes other cellular processes other metabolic processes protein metabolism response to abiotic or biotic stimulus response to stress signal transduction transcription transport cell wall chloroplast cytosol ER extracellular Golgi apparatus mitochondria nucleus other cellular components other cytoplasmic components other intracellular components other membranes plasma membrane plastid ribosome DNA or RNA binding hydrolase activity kinase activity nucleic acid binding nucleotide binding other binding other enzyme activity other molecular functions protein binding receptor binding or activity structural molecule activity transcription factor activity transferase activity transporter activity GO Biological Process GO Cellular Component GO Molecular Function E. grandis 454 Arabidopsis annotation
55 Table 2 1. Summary of the E. grandis expressed sequences generated with GS 20 and G S FLX pyrosequencing runs, and control ESTs obtained from dideoxy based sequencing of an analogous Eucalyptus library Run number Method Platform Number of reads Average length Total bp 1 454 pyrosequencing GS 20 328,486 106.28 bp 34.9 Mbp 2 454 pyrosequencing GS 20 303,149 102.54 bp 31.1 Mbp 3 454 pyrosequencing GS FLX 392,616 209.89 bp 82.4 Mbp 454 all (3 runs) 454 pyrosequencing GS 20 + FLX 1,024,251 145.24 bp 148.4 Mbp Sanger (control) dideoxy sequencing ABI 3100 86,328 522.18 bp 45.1 Mbp Table 2 2. Length distribution and characteristics of contigs assembled from the two GS 20 runs, one GS FLX run, the three 454 runs combined and from the control Sanger sequenced ESTs GS 20 GS FLX 454 all Sanger (control) Run(s) in assembly 1 + 2 3 1 + 2 + 3 18987 (29%) 52 (<1%) 10820 (15%) 0 101 250 bp 42320 (66%) 17348 (62%) 35958 (50%) 0 251 500 bp 2476 (4%) 5355 (19%) 18768 (26%) 7564 (35%) 501 750 bp 535 (<1%) 2463 (9%) 2869 (4%) 9226 (43%) 751 1000 bp 167 (<1%) 1314 (5%) 1396 (2%) 2830 (13%) > 1000 bp 99 (<1%) 1547 (6%) 1573 (2%) 1812 (8%) Total contigs 64584 (100%) 28079 (100%) 71384 (100%) 21432 (100%) Average contig length (bp) 130.6 353.22 247.16 623.35 Reads in contigs 80.72% 71.36% 88.41% 84,88% Average reads/contig 7.89 9.98 12.69 3.42 Table 2 3. Number and percentage of gene models with matches against the E. grandis 454 unigenes and against the control Sanger sequenced Eucalyptus unigenes using three different BlastX thresholds. The number of genes in each organism is presented in parenthesis in the first column Organism BlastX threshold Matches against 454 unigenes Matches against Sanger unigenes Arabidopsis (31,921) 1e 5 14250 (45%) 10154 (32%) 1e 10 12347 (39%) 9561 (30%) 1e 20 9077 (28%) 8410 (26%) Arabidopsis with transcript evidence (22,032) 1e 5 12790 (58%) 9542 (43%) 1e 10 11265 (51%) 9029 (41%) 1e 20 8489 (39%) 8003 (36%) Populus (45,555) 1e 5 17724 (39%) 11580 (25%) 1e 10 15383 (34%) 10962 (24%) 1e 20 11190 (25%) 9701 (21%) Oryza (66,710) 1e 5 14510 (22%) 9893 (15%) 1e 10 12139 (18%) 9193 (14%) 1e 20 8393 (13%) 7834 (12%)
56 Table 2 4. Number of detected polymorphisms and affected contigs by variant type. The higher confidence SNPs were selected for further analysis Variant type Number of variants Number of contigs containing SNPs Indel 821 704 Involving two or more nucleotides 635 537 Total SNPs 28,652 9,942 Lower confidence SNPs (freq rare allele <10%) 4,910 1,089 Transition 4,239 1,005 Transversion 671 405 Higher confidence SNPs (freq rare allele 23,742 9,845 Transition 17,871 8,394 Transversion 5,871 3,881 TOTAL 30,108 10,223
57 Table 2 5. Number and percentage of non validated and va lidated SNPs for each of the 43 amplicons sequenced with dideoxy based method. Amplified contig Non validated Validated Number of predicted SNPs amplicon01 KIRST.1015.C2 4 (44%) 5 (56%) 9 amplicon02 KIRST.2351.C2 4 (44%) 5 (56%) 9 amplicon03 KIRST.1461.C1 2 (40%) 3 (60%) 5 amplicon04 KIRST.1992.C1 3 (38%) 5 (63%) 8 amplicon05 KIRST.25.C1 4 (36%) 7 (64%) 11 amplicon06 KIRST.1936.C1 3 (33%) 6 (67%) 9 amplicon07 KIRST.12521.C1 2 (33%) 4 (67%) 6 amplicon08 KIRST.15421.C1 2 (33%) 4 (67%) 6 amplicon09 KIRST.2632.C1 1 (33%) 2 (67%) 3 amplicon10 KIRST.4036.C2 4 (33%) 8 (67%) 12 amplicon11 KIRST.5530.C1 2 (33%) 4 (67%) 6 amplicon12 KIRST.3079.C1 2 (29%) 5 (71%) 7 amplicon13 KIRST.15389.C1 1 (25%) 3 (75%) 4 amplicon14 KIRST.486.C3 2 (25%) 6 (75%) 8 amplicon15 KIRST.823.C1 2 (22%) 7 (78%) 9 amplicon16 KIRST.854.C4 2 (20%) 8 (80%) 10 amplicon17 KIRST.2687.C1 1 (20%) 4 (80%) 5 amplicon18 KIRST.4822.C1 1 (20%) 4 (80%) 5 amplicon19 KIRST.340.C6 2 (18%) 9 (82%) 11 amplicon20 KIRST.11157.C1 1 (17%) 5 (83%) 6 amplicon21 KIRST.152.C3 1 (17%) 5 (83%) 6 amplicon22 KIRST.8182.C6 1 (17%) 5 (83%) 6 amplicon23 KIRST.1003.C1 1 (14%) 6 (86%) 7 amplicon24 KIRST.1268.C4 1 (14%) 6 (86%) 7 amplicon25 KIRST.1975.C1 1 (14%) 6 (86%) 7 amplicon26 KIRST.4785.C1 1 (14%) 6 (86%) 7 amplicon27 KIRST.52.C8 1 (14%) 6 (86%) 7 amplicon28 KIRST.340.C1 2 (13%) 13 (87%) 15 amplicon29 KIRST.17053.C1 1 (13%) 7 (88%) 8 amplicon30 KIRST.52.C1 1 (13%) 7 (88%) 8 amplicon31 KIRST.8655.C1 1 (8%) 11 (92%) 12 amplicon32 KIRST.1285.C3 1 (7%) 14 (93%) 15 amplicon33 KIRST.1441.C5 0 (0%) 5 (100%) 5 amplicon34 KIRST.17202.C1 0 (0%) 8 (100%) 8 amplicon35 KIRST.2273.C1 0 (0%) 7 (100%) 7 amplicon36 KIRST.2790.C1 0 (0%) 11 (100%) 11 amplicon37 KIRST.2900.C3 0 (0%) 4 (100%) 4 amplicon38 KIRST.34.C15 0 (0%) 8 (100%) 8 amplicon39 KIRST.344.C1 0 (0%) 7 (100%) 7 amplicon40 KIRST.4650.C2 0 (0%) 7 (100%) 7 amplicon41 KIRST.5060.C2 0 (0%) 10 (100%) 10 amplicon42 KIRST.5120.C1 0 (0%) 10 (100%) 10 amplicon43 KIRST.6233.C1 0 (0%) 6 (100%) 6 T OTAL 43 contigs 58 (17%) 279 (83%) 337
58 Table 2 6. Biological process GO categories enriched (p value < 0.05) in each of the two extremes of Ka/Ks distribution Ka/Ks extremea GO category Proportion out extreme Proportion in extreme p value "purifying" translation 0.0400 0.0769 0.0006 "purifying" ubiquitin dependent protein catabolic process 0.0092 0.0280 0.0023 "purifying" nucleosome assembly 0.0008 0.0098 0.0039 "purifying" chromosome organization and biogenesis 0.0000 0.0070 0.0056 "purifying" ribosome biogenesis and assembly 0.0108 0.0252 0.0158 "purifying" response to hydrogen peroxide 0.0031 0.0126 0.0169 "purifying" response to high light intensity 0.0031 0.0112 0.0326 "diversifying" biological_process_unknown 0.1703 0.2751 0.0002 "diversifying" multicellular organismal development 0.0028 0.0175 0.0129 a. The purifying extreme is composed of contigs with Ka/Ks smaller than 0.15, while the diversifying extreme has contigs with Ka/Ks greater than 0.50 Table 2 7. Distribution summary of three nucleotide diversity parameters estimated for 2,392 E. grandis contigs Parameter Mean Median Range T 1.86 10 3 1.65 10 3 1.22 10 4 9.11 10 3 N 1.81 103 1.49 103 1.35 104 15.14 103 S 7.88 10 3 6.19 10 3 6.67 10 4 48.15 10 3 Table 2 8. Biological process GO categories enriched (p value < 0.05) in each of the two extremes N (nonsynonymous nucleotide diversity) distribution. N extremea GO category Proportion out extreme Proportion in extreme p value "conserved" malate metabolic process 0.0007 0.0093 0.0056 "conserved" ubiquitin dependent protein catabolic process 0.0131 0.0278 0.0314 "diverse" defense response 0.0070 0.0302 0.0069 "diverse" biological_process_unknown 0.1647 0.2362 0.0134 "diverse" response to biotic stimulus 0.0032 0.0151 0.0480 a. The conserved tail is composed of N smaller than 1.0 x 103, while the diverse tail has contigs N greater than 3.5 x 103.
59 CHAPTER 3 QUANTITATIVE GENETIC ANALYSIS OF BIOMASS AND WOOD CHEMISTRY OF P opulus UNDER DIFFERENT NITROGEN LEVELS* Introduction A s a growing body of evidence supports the negative effects of accumulation of carbon dioxide (CO2) and other greenhouse gases in the atmosphere [4, 5] society is increasingly turning to forests and forest management for mitigating atmospheric CO2  Forests store approximately 45% of the terrestrial carbon  and cellulosic ethanol production from wood has great potential to diminish the need for fossil fuels, limiting atmospheric CO2 accumulation  Therefore, i ncreasing the productivity of plantation forest s c ould have a significant impact on society by simultaneously enhancing carbon sequestration and meeting greater demand s for renewable wood and bioenergy products. The chemical composition of w ood, cellulose (45 50%), hemicellulose (25%) and lignin (25 35%)  is important for its conversion into products and for carbon sequestration. Because lignin is richest in carbon and the most recalcitrant component of wood, higher proportions can translate into more carbon stored for longer periods of time H owever, higher lignin content s may be undesirable for production of pulp/paper and cellulosic ethanol For these applications lignin has to be extracted from wood with relatively harsh chemical treatments and high energy input s [7, 99, 200] One strategy for increasing carbon sequestration and improving pulp, paper and cellulosic biofuel productivity is to raise the lignin content in nonharvested roots while reducing lignin in the woody ste m Therefore, the develop ment of tree germplasm that is optimal for carbon sequestration and biomass conversion requires an understanding of the Reprinted with permission from: Novaes E, Osorio L, Drost DR, Miles BL, Boaventura Novaes CRD, Benedict C, Dervinis C, Yu Q, Sykes R, Davis M et al: Quantitative genetic analysis of biomass and wood chemistry of Populus under different nitrogen levels New Phytologist 2009, 182(4) :878890.
60 genetic regulation of growth, carbon allocation among plant organs and carbon partition ing in to lignin, cellulo se and hemicellulose within organs In woody plants studies have identified a consistent, significant correlation between biomass growth and wood composition that is genetically regulated [51, 201203] P leiotropic genetic loci coordinating wood composition and stem growth are important targets for enhancement of wood products Populus is an excellent model species to identify these genetic elements, given the availability of several segregating pedigrees, easy clonal propagation, well established transformation protocols and the genome sequence of a P. trichocarpa genotype  Previous studies have identified quantitative trait loci ( QTL ) for biomass accumulation and allocation in segregating Populus families [134, 138, 204206] Above and below ground growth QTL have also been mapped in poplar grown under ambient and elevated CO2  However, few studies have attempted to map QTL for wood chemical composition in poplar  or more important, dissect ed the coordinate regulation of stem growth and wood chemistry. In addition to genetics, environmental cues at the level of nutrient availability also have tremendous impact on tree growth, biomass allocation and wood composition. N itrogen (N) is generally the most limiting nutrient for tree growth and carbon sequestration [208, 209] Poplar t rees exhibit extensive phenotypic p lasticity in response to N N itrogen fertilization of young ( <1 year old) poplar trees increase s shoot biomass and wood cellulose content, while decreasing lignin content and S/G ratio  N supply was also observed to increase photosynthetic rates in mature leaves  and to alter xylem fiber anatomy by thickening and shortening  C hanges in mRNA abundance were also observed in poplar in response to N treatments  Although these studies have contributed to a better understanding of the effects of N on the tree
61 anatomy and physiology, still little is known about how genotypes interact with N and how these interactions are regulated in forest species. The objective of this study was to define genetic loci controlling the phenotypic variation in biomass accumulation, carbon allocation and partitioning in an interspecific Populus pedigree grown under two diffe rent N conditions To identify these regions, a QTL mapping experiment with 396 clonally replicated genotypes was completed E xtensive phenotypic plasticity in response to N availability was observed. The experimental design allowed analysis of genotype by N interaction for the first time in a tree QTL mapping population, resulting in the identification of 51 loci that control traits only under one N condition. B iomass accumulation, cellulose and lignin levels i n wood were strongly correlated, genetically and phenotypically More importantly, some QTL for biomass growth and wood composition mapped to the same genomic region, delineating potential pleiotropic regulators that coordinate these traits Materials and Methods P henotyping a Poplar Pseudo Backcross Pedigree An interspecific pseudobackcross pedigree (family 52 124) composed of 396 genotypes was created by crossing the hybrid female clone 52 225 ( P. trichocarpa [clone 93968] P. deltoid e s [clone ILL 101]) with the male P. deltoides clone D124. Six c uttings of each progeny genotype were planted in 41 cm deep pots (TPOT2, Stuewe & Sons, Inc., Corvallis, Oregon, USA) with Fafard 4MIX soil (Canadian Sphagnum Peat 40%, processed pine bark and vermiculite) in a greenhouse at the University of Florida. Plants were arranged in the greenhouse following a partially balanced incomplete block design with three biological replicates, six incomplete benches per replicate and two nitrogen treatments. Within each replicate a coordinate system with 30 rows and 26 columns, where each plant was located in a row x column intersection, was utilized to account for possible systematic sources of variation across the green -
62 house in our statistical model (see below). The total number of plants in the experiment was 2,376 ( i.e. 396 genotypes 2 N treatments 3 biological replicates) After potting, all plants were grown for 6 weeks with 5 mM NH4NO3 supplied with Hockings complete nutrient solution  in a flood irrigation system of ebb and flow benches Benches were flooded twice a day for ~ 30 min. During the 5th week of growth, initial diameters and heights were measured. At the start of week six, one half of the benches in each replicate were flooded with Hockings solution supplemented with 25 mM NH4NO3 and the other half was flooded with the same solution, but without any N supplement. After gro wing for 4 weeks with and without nitrogen, during the late summer and early fall of 2006 (a total of ten weeks after potting) plants were harvested and phenotyped for total height, stem diameter at 5 cm above the bottom of the cutting and number of internodes. After these measurements were taken plants were dissected into roots, main stem, leaves and sylleptic branches with leaves attached By peeling off the bark, the main stem was further separated into two tissues: xylem and bark+phloem. Tissue samples were put in to barcode labeled paper envelopes and dried in a 65C drying room or a freeze drier (Freezezone 18L Bulk Tray Dryer, Labconco, Kansas City, Missouri) depending on the tissue. Dried tissues were conditioned in the lab for > 1week before being weighed with high precision balances. Weights of individual organs and tissues were summed to estimate aboveground (shoot) and total biomasses. Biomass allocation into root and shoot was calculated by the ratio of above ground to root w eights. Wood chemical composition was measured in two biological replicates of the experiment from the 510 cm of bottom xylem of each plants main stem The xylem was ground in a Wiley mill to pass through 20 mesh screen and re dried. Two subsamples (~ 4 mg each) of milled wood from each plant were scooped into 80l stainless steel sample cups that were subsequently
63 covered with glass fiber paper (type a/d) Cups were automatically loaded into a pyrolyzer coupled to a molecular beam mass spectrometer (py M BMS) at the National Renewable Energy Laboratory of the U.S. Department of Energy Golden, CO (NREL DOE). Samples were separated into two blocks and completely randomized. To evaluate the consistency and accuracy of the instrument, one loblolly pine and three Populus wood samples, previously characterized with wet chemistry  were analyzed in the py MBMS systematically after every 44 runs with our samples. Each sub sample was pyrolyzed for 30 seconds at 5 00oC. Vapors from the pyrolysis were rapidly expanded under vacuum through a 0.012 crystal orifice cre ating a molecular beam that was directed to the ExtrelTM Model TQMS C50 mass spectrometer, yielding a spectra ranging from 30 to 450 mass to charge ratios (m/z). Data was normalized against differences in total ions of each pyroly zed sample. Peaks associated with each wood chemistry component, based on previous literature  were summed to produce a single estimate as indic ated in Table 32 and described elsewhere  Statistical Analysis of Phenotypic D ata Prior to analyses of each phenotypic trait, the data were evaluated for the presence of outliers and rem oval or correction of measurements with recording errors. PROC INSIGHT (SAS Institute Inc 9.1 2004) was used to check the distribution of residuals Univariate analyses of each trait were performed using the SAS System for Mixed Models  to separately account for the different sources of variation from our experiment. This analysis of variance allows accurate estimation of the desired effects ( i.e. clone and nitrogen treatment) by controlling the influ ence of undesired sources of variation ( i.e. replicates, bench, row and column within replicate). The mixed model u tilized was:
64 = + + + + ( )+ + + + ( )+ ( )+ (Equation 1) Where is the response of the oth ramet of the lth clone in kth treatment of the jth bench within ith replication is the population mean, r ), is the fixed effect of the nitrogen treatment, is the effect of replication by treatment rt) ( ) is the random effect of bench (incomplete block) within replication b(r)), i c), is the random rc), is the random effect of treatment by tc), ( ) is the random effect of row within replication ~ NID p), ( ) is the random effect of column wi thin replication q), and is the random error effect within the experiment ~ NID (0,R) and = 120 0 0 220 0 0 32 The genetic analyses were modified to include different error variances for each replication. 2, 2, ( ) 2, 2, 2 2, 2, 2 and 12, 22, 32 are the variance components corresponding to replication, replication by treatment, bench within replication, clone, replication by clone, treatment by clone, row within replicatio n, column within replication and residual effects of each replication respectively. Variance components and genetic parameters were estimated by restricted maximum likelihood, using ASReml  Least square means used in the QTL analysis were calculated by including both clone effect and its interaction with treatment as fixed effects in the model.
65 Clonal r epeatability for each trait in the univariate analysi s was calculated by ASReml with the estimates of variance components as follows: 2= 2 2+ 2+ 2+ ( 2 3 1/ 3 ) with 2, 2, 2, 12, 22, 32 as previously defined. To estimate phenotypic and genetic correlations, pair wi se traits were analyzed with the following bivariate model equation: = + 1+ 2+ 3+ 4+ 5+ 6+ 7+ (Equation 2) Where is the vector of observations for traits 1 ( t1) and 2 ( t2), is the ve ctor of fixed effects ( i.e. means and treatments) associated with the incidence matrix Xi is the vector of random clonal effects associated with the incidence matrix 1 and ~ ( 0 ) with = ( 1) 2 ( 12) ( 12) ( 2) 2 , are random effects corresponding to replication by treatment, bench within replication, replication by clone, treatment by clone, row within replication and column within replication effects associated with known incidence mat rices 2, 3, 4, 5, 6 7 respectively All random effects have zero mean and variance structure similar to G. is the vector of random error effects and ~ ( 0 ) with = 10 0 0 20 0 0 3 and 1= 1( 1) 21( 12)1( 12)1( 2) 2 2= 2( 1) 22( 12)2( 12)2( 2) 2 and 3= 3( 1) 23( 12)3( 12)3( 2) 2 Where G and R are covariancevariance matrices corresponding to vectors and respectively, and the R structure was associated with error effects of each replication. ( 1) 2, ( 2) 2, ( 12) are
66 clonal variance components for trait 1, trait 2 and the covariance component between trait 1 and trait 2. 1( 1) 2, 1( 2) 2, 1( 12) are error variance components and the covariance component respectively for trait 1 and trait 2 in replication 1. Replications 2 and 3 have similar description of error variance and covariance components. All traits were analyzed according to the full model ( Equation 2), but effects with no variation were dropped to fit the final model. After estimating the variance and covariance components, genetic and phenotypic correlations were calculated for pairs of traits according to the following equation: 12= 1 2 1 2 2 2 where 12 is the genetic or phenotypic correlation between trait 1 and trait 2, 12 is the clonal genetic covariance or phenotypic covariance between two traits and, 12, 22 are the clonal genetic or phenotypic variances of trait 1 and trait 2. Genetic Linkage Map and QTL A nalysis From a dense microsatellite and microarray based genetic linkage map  we selected 181 evenly distributed markers that segregate in the hybrid female parent. The selection favored microsatellite (SSR) markers (163 selected) because they were genotyped in all 396 individuals of the progeny while microarray based markers were characterized in only 154. Eighteen microarray markers were included to expand map cove rage towards the flanks of some linkage groups (LG) and to fill a gap in LG6 where no SSR was genotyped. MapMaker 3.0  was utilized for construction of the linkage map with maximum recombination frequency of 45 cM using Kosambis map function  and a minimum LOD ( logarithm of the odds ratio) of 3. Consistent with the haploid number of chromosomes of P. trichocarpa and P. deltoides our map has 19 linkage groups, with a total length of 2889 cM and an average density of one marker every 16 cM. BLAST a lignment of SSR primer and microarray probe sequences to the poplar
67 genome sequence (JGI v.1.1) demonstrates that our map spans at least 85% of the assembled genome. The linkage map was used for identification of quantitative trait loci (QTL) with composite interval mapping  implemented in Windows QTL Cartographer v.2.5  We utilized the standard model 6 with default settings for selection of cofactor markers to account for background variance not associated with the locus being tested. Least squ are means estimates for each individual in each nitrogen treatment were used in the analysis. For each phenotypic trait, QTL analyses for 1000 permutations were performed establishing a null distribution of genome wide maximum LOD scores where the 95th per centile was defined as the significance threshold  Results Genetic Control of P henotypic T raits and the Effect of Nitrogen T reatments For brevity, all growth and allocation traits depicted in Table 31 are referred to as phenotypic traits while w ood composition and partitioning traits depicted in Table 32 will be called wood chemistry traits. Clonal repeatability (or within family b road sense heritability ) were estimated for each trait across the entire dataset ( i.e. all replicates and both nitrogen treatments). E stimates of clonal repeatability were moderate for all 12 phenotypic traits, ranging from 0. 25 for sylleptic branch biomass to 0.34 for internodes count (Table 31) The effect of nitrogen fertilization was highly significant for all traits, increasing diameter, height, number of internodes and above ground biomass, whereas root biomass was significantly decreased when compared to the nitrogen deficiency treatment (T able 31 and Figure 31) Carbon allocation favored shoot over root biomass, when nitrogen fertilization was applied. The average above to below ground biomass ratio significantly increased from 5.07 to 11.90 under 0 and 25 mM NH4NO3 treatments.
68 Estimates of clonal repeatability for wood chemistry traits ranged from 0.15 for G lignin to 0.38 for partitioning between S and G lignin (S/G ratio). Most of the clonal repeatability estimates for wood chemistry traits were lower than those calculated for phenoty pic traits except for S lignin (0.34) and S/G ratio (0.38). The effect of nitrogen fertilization was highly significant for wood chemistry traits increasing cellulose and hemicellulose and lowering lignin when compared to plants grown under limiting nitr ogen In response to N, the increase in carbohydrates was greater for cellulose compared with hemicellulose, as observed in the C6/C5 ratio. The decrease of mean lignin content in response to N was 7% greater for S lignin than G lignin (2), as observed in the S/G ratio (Table 32). Analysis of Genetic and Phenotypic C orrelation s among T raits Pairwise phenotypic and genotypic correlations were estimated for all traits across the entire experiment, i.e. combining data from the two nitrogen treatments (Table 33) Only genetic correlations are depicted between parentheses when a specific pairwise relationship is described in the text. M ost of the morphological and biomass traits were positively correlated phenotypically and genetically with each other e.g. he ight was strong ly and positively correlated with diameter (0.78) stem biomass (0.91) leaf biomass (0.69) above ground biomass (0.79) and total biomass (0.78) All biomass traits were also positively correlated with each other. For example, root biomass is strongly genetically correlated with leaf (0.86) stem (0.81) above ground (0.86) and total biomass (0.90) As expected, plants growing faster tend ed to accumulate biomass at higher rates in all vegetative tissues. The ratio between aboveand below ground biomass was the only trait that was not strongly correlated with any other phenotypic trait, except for its auto correlation with root biomass ( 0.63) W ood chemistry traits we re g enetically and phenotypically correlated among themselves (Table 33) Cellulose (C6) levels we re strongly, positively correlated with the amount of
69 hemicelluloses (C5) (0.95). In contrast both C5 and C6 amount s were strongly, negatively correlated with t otal lignin ( 0.87 and 0.99), G lignin ( 0.77 and 0.71) and S lignin ( 0.68 and 0.85) content The almost perfect negative genetic correlation between cellulose and lignin content ( 0.99) decreases to 0.78 at the phenotypic level. The ratio between wood carbohydrates (C6/C5) was negatively correlated with total lignin ( 0.95) and S/G ratio ( 0.81) Th is suggests that when more carbon is partitioned into cellulose relative to hemicellulose the total amount of lignin decreases and a disproportionate reduction in syringyl relative to guaiacyl monomers is observed. Lignin with a higher proportion of syringyl monomers has less carboncarbon cross links and therefore is more easily extracted than guaiacyl rich lignin  Thus in this population, carbon partitioning into carbohydrates (C6 > C5) is associated with higher proportions of a less extractable type of lignin monomer Variation in these traits was very high in both nitrogen treatments. The proportion of C5 and C6 carbohydrates ranged from 2134% and 2648%, respectively. For lignin syringyl monomers, the proportion ranged from 719%. Biomass and wood chemistry were als o strongly correlat ed (Table 3 3) In general, wood carbohydrate content s (C5 and C6) we re positively correlated with the biomass of all vegetative tissues, while higher lignin wa s associated with lower levels of biomass accumulation. For example, total biomass was genetically correlated with C6 (0.57), C5 (0.62) and lignin ( 0.48). The ratio between wood carbohydrates (C6/C5) was positively correlated with biomass traits especially with sylleptic branches (0.63), but not correlated with height and diamete r This result indicates that in the existing experimental conditions, carbon partitioned to cellulose may lead to higher biomass accumulation than carbon partitioned to hemicelluloses. However, it is difficult to establish any relationships of cause and e ffect and it is also possible that the higher cellulose accumulation is the result of superior growth ability. The ratio between lignin monomers (S/G)
70 was negatively correlated with above ground biomass ( 0.59) but not genetically correlated with below ground (root) biomass (0.06) Therefore, the higher the above ground biomass the lower the extractability of lignin in woody tissues. Quantitative Trait Loci Mapping in Each Nitrogen Treatment A total of 62 QTL were identified in the two nitrogen treatments ( 0 and 25mM) using 34, Figure 3 2 ). S lignin is the only trait for which no QTL was mapped in either nitrogen treatment. Under high nitrogen 30 QTL were identified 20 for 11 phenotypic traits and 10 QTL for seven wood chemistry traits. Under the nitrogen deficiency treatment 32 QTL were detected 19 for 10 phenotypic and 13 for five wood chemistry traits. The origin of the QTL alleles affecting positively the traits is balanced, with 33 coming from the P. trichocarpa and 29 from the P. deltoides grand parents of the pedigree. U nder high nitrogen, t he most significant QTL (LOD = 6.93) was detected on LG14 for biomass allocation between above and below ground. In the nitrogen limiting treat ment the most significant QTL (LOD = 6.33) was observed on LG1 for number of internodes (Table 34). The percentage of the phenotypic variance explained by the QTL ranged from 3.58% for C6/C5 ratio to 11.36% for the same trait under high nitrogen treatment. Under high nitrogen treatment LG13 had the most QTL with 12 ; whereas under nitrogen deficiency LG1 had the most QTL mapped w ith loci controlling 10 different traits O n these two linkage groups large fraction s of the QTL w ere detected in the same reg ions (Figure 32). Five of the 19 linkage groups had no QTL mapped in our study Phenotypic Plasticity in Response to Nitrogen T reatments Only six QTL for four phenotypic traits were mapped in the same genomic region under both nitrogen treatments ( compare QTL on both sides of each LG in Figure 32). For number of internodes, two QTL (LG1 and LG6) were colocalized in both treatments. Similarly, for the
71 ratio between above and below ground biomass QTL were detected on LG3 and LG14. Also mapped consistently under both nitrogen conditions were QTL for height of live crown (LG2) and diameter (LG3). The grand parent origin of the positive allele is the same under both nitrogen treatments for all these QTL, indicating that the same genetic elements are likely controlling the traits independent of nitrogen availability. All other 51 QTL identified in our pedigree were treatment specific, suggesting that in Populus a large fraction o f the interspecific variance in growth, carbon allocation and carbon partitionin g traits is highly responsive to the level of nitrogen available in the environment For wood chemistry traits, none of the QTL was mapped under both nitrogen treatments. Co L ocalization of QTL for Different T raits In this interspecific Populus pedigree we found eight genomic regions that control part of the variation for at least two different traits under the same nitrogen treatment. For example, under high nitrogen there are two regions with QTL co localized for several traits one on LG12 between markers S170 and G2682 and another on LG13 between P2847 and G2218 (Figure 32). The traits with QTL co localized on LG12 are leaf, stem, above ground and total biomass, as well as diameter. These traits are all positive ly, highly genetically correlated with each other and, as such, the QTL alleles have the same direction of effect in all traits i.e. P. deltoides QTL allele positively affects all traits On LG13, there is a hot spot of colocalized QTL for both phenotypic and wood chemistry traits. The traits with co localized QTL on LG13 are above ground, below ground, leaf and total biomass, diameter, number of internodes, cellulose, ratio between cellulose and hemicelluloses, lignin and ratio between cellulose and lignin. Consistent with the direction of cor relations, the P. trichocarpa allele positively affects total biomass (above and below ground) wood cellulose and hemicellulose, but negatively affects wood lignin
72 content. Another co localization of QTL under high nitrogen occurs on LG1 for sylleptic bra nch biomass and S/G ratio. Under nitrogen deficiency five regions have co localized QTL for different traits (Figure 32) Two of the regions occur on LG1. T he region between markers G3205 and G834 contains QTL for height of live crown, number of internod es, wood cellulose, hemicelluloses, guaiac yl lignin and ratio between cellulose and lignin, and the region between P575 and P2786b contains QTL for above ground, leaf and stem biomass As in the hot spot on LG13, the QTL alleles have opposite effects for wood carbohydrates and lignin, but the same direction of effect for height, number of internodes biomass traits and wood carbohydrates. Between markers S613 and G2126 on LG6, QTL were identified for the following wood chemistry traits: cellulose, guaiac yl lignin ratio between cellulose and hemicellulose and ratio between cellulose and lignin. Also on LG6, between markers S18 and E538, there are co localized QTL for wood cellulose and hemicellulose content. Colocalized QTL for root weight and ratio betwee n above and below ground biomass were mapped between markers G2541 and G588 of LG9. Discussion Here we report widespread effects of nitrogen fertilization on the genetic regulation of growth and wood chemistry traits in the progeny of an interspecific pseudo backcross of P deltoides and P trichocarpa. N supply affected all growth and wood chemistry traits, confirming the extensive phenotypic plasticity of poplars in response to this essential nutrient. Studies have shown that N positively influences popla r tree height, diameter, number of leaves, number of sylleptic branches and shoot biomass [210, 227229] However, contrary to our results, the same studies reported an increase of root biomass in response to N supply. Although statistical significance was either not observed  or not tested  this disparity might reflect differences in experimental conditions. For example, in our trial nutrients were supplied with
73 flood irrigation system in ebb and flow benches, while in previous work plants were fertilized with irrigation on the top of the pots. Despite differences in total root biomass, inarguably N fertilization causes significant changes in biomass allocation, decreasing root/shoot biomass ratio in poplars. Wood chemistry also responded to different nitrogen availability. Confirming previous studies with poplar tree s [211, 212] we observed that N fertiliz ation significantly increased wood cellulose content and decreas ed lignin content and its extractability measured as S/G ratio We also found that N supply increased hemicelluloses at a significantly lower rate (5%) when compared to the increase in cellulose sugars (8%). The extensive phenotypic plasticity in response to N availability observed in Populus may be triggered by changes in gene regulation  Supporting this hypothesis in poplars, nitrogen significantly changed transcript abundance of 52 cDNA clones identified by differential display  Consistent with N mediated decrease in wood lignin, two of the differentially expressed cDNA clones were from putative lignin biosynthesis genes, and were down regulated in response to N. Similarly, studies that assessed the effects of nitrogen fertilization on the whole transcriptome of Arabidopsis also demonstrate that N represses most of the genes from the phenylpropanoid pat hway [231, 232] Some genes involved in cell wall growth and modification, including expansins and xyloglucan:xyloglucosyl transferase were induced in response to N. N induction was also detected in most genes involved with photosynthesis, Calvin cycle and photorespiration  Changes i n gene expression in response to N fertilization may be the result of regulatory signals that balance C and N metabolism  The assimilation of N into amino acids depends upon availability of carbon skeletons and C is stored in starch if N is deprived  When nitrogen is luxuriant (relatively low C/N), plant metabolism shifts resources toward absorption of carbon through photosynthesis and develops more shoot than
74 root. Conversely, when nitrogen is limited (high C/N) metabolism favors allocation of biomass to root over shoot  Consistent with the widespread phenotypic and gene expression plasticity of plants in response to N, our study detected extensive changes in the genetic control of carbon allocation and partitioning with N supply. Most of the QTL identified in our study (50 out of 62) are N treatment specific. In other plant species, significant QTL by N have been reported for disease resistance an d root traits in rice [237, 238] for carbohydrate content and plant biomass in Arabidopsis [236, 239] and for grain yield and quality in wheat  Significant interaction of QTL by N treatment indicates that fastand slow growing genotypes are not the same under limiting and luxuriant nitrogen conditions. Spearman correlation of phenotypic values under the two N treatments indicates extensive changes in rank among genotypes ( Table 3 5). The even lower Spearman correlations between N treatments for wood chemistry traits indicate that most of these traits are more responsive to N supply than biomass and growth tr aits. This agrees with the lower clonal repeatability observed for most wood components when compared to phenotypic traits (Tables 31 and 32); which was not observed when clonal repeatability was estimated for each N treatment separately ( Table 3 6). Supporting the idea that wood chemistry traits are more reactive to nitrogen availability no QTL for wood traits were consistently mapped in both nitrogen treatments, while six QTL for phenotypic traits were mapped independently of N level. Our results support the hypothesis that growth co varies with wood chemistry traits in trees. As previously demonstrated with mutants of Populus  and Pinus [202, 203] and in a segregating pedigree of Eucalyptus  we observed that fast growing poplar trees have higher cellulose and lo wer lignin content when compared to slow growing trees. To genetically dissect
75 the association between growth and wood quality traits we searched for common, pleiotropic loci that regulate variation in both trait groups. We identified two common QTL for gr owth and wood chemistry. One QTL cluster detected in our study occurs on LG13, between markers G2577 and G2218, and regulates growth, biomass and wood composition traits. Specifically, trees that inherited the P. trichocarpa allele at this locus grow faste r, have more cellulose and less lignin when compared to those that inherited the P. deltoides allele. Genomic regions in the vicinity of this LG13 QTL cluster control the abundance of metabolites derived from the phenylpropanoid pathway in an independent pedigree of Populus  The other QTL cluster coordinately controls height of live crown and wood composition under nitrogen def iciency and is located on LG1, between markers G3205 and G834. In this same region of LG1, QTL were previously mapped for height, stem circumference, stem volume, root density and root growth in family 331, which shares a common grandparent (clone 93968) with family 52 124 analyzed here [134, 207] Other QTL detected in our study were also detected in family 331. For example, both QTL for height of live crown (LG1 and LG2) mapped under limiting N treatment colocalizes with QTL for stem height identified by Rae et al.  Two QTL for above ground biomass from our experiment (LG1 under limiting N and LG12 under high N) co localizes with QTL for stem height and diameter in family 331 growing under elevated (LG1) or ambient (LG12) CO2  Two of our QTL for total leaf biomass (LG3 under limiting N and LG12 under luxuriant N) mapped in the same region of previously identified QTL controlling leaf number  A number of interesting a priori candidate genes are physically positioned within the marker intervals flanking potential pleiotropic quantitative loci on LG1 and LG13. An immediately apparent candidate gene underlying the QTL cluster on LG13 is cinnamate 4-
76 hydroxylase (C4H), encoding an enzyme that regulates the conversion of cinnamate to 4coumarate in the phenylpropanoid biosynthesis pathway and a transcript for which we previously identified a strong correlation with growth i n the Eucalyptus segregating population  Similarly, the QTL cluster for growth and wood composition on LG1 harbors a cluster of three tandemly duplicated genes encoding putative cinnamoyl CoA reductases ( CCR), which regulate conversion of feruloyl CoA to coniferaldeyhyde in the phenylpropanoid pathway. Additional components (chalcone synthase [CS], 4coumarate:CoA ligase [4CL]) and potential regulators ( At MYB12 homolog  and other uncharacterized R2R3MYB type TFs) of phenylpropanoid metabolism are also conceivable candidate gen es within these intervals. Furthermore, several C5/C6 metabolic enzymes, including xyloglucan fucosyltransferase, glucose 6phosphate 1dehydrogenase, xyloglucan endotransglycosylase, and cellulose synthase are encoded by genes within these intervals. Fina lly, numerous genes of unknown function or with no homology to known Arabidopsis or Oryza coding sequences are prevalent among genes within the intervals, leaving open the possibility that some of these uncharacterized elements might coordinately regulate growth and biomass composition in woody plants. The gene models within marker intervals that flank most of our QTL, including those clusters on LG1 and LG13, are available online as Supporting Information ( www3.interscience.wiley.com/journal/122246762/suppinfo) Gene models within some QTL intervals (eight out of 29) could not be found because at least one of the flanking markers either did not have a BLAST hit in the genome or hit an unmapped scaffold T he number of candidate genes within these QTL is certainly too large to be tested with functional analysis our average QTL interval is 31cM wide and contains ~426 genes, 45% of which have unknown function. While reducing the size of the candidate gene pool is necessary,
77 the traditional positional cloning based isolation of regulatory locus would be challenging given the breadth of the QTL intervals low heritability of quantitative traits, and the typical low resolution of maps in forest species. Nevertheless, availability of the P. trichocarpa genome sequence  allied to gene expression data that is being assayed with microarrays from three tissues (root, xylem and leaf) of our experiment will form an excellent foundation to narrow down the candidate genes within QTL. Integration of gene expression with genotypic and phenotypic data in a genetical genomics setting  will uncover regions containing regulators of gene expression (eQTL) and of phenotypes. Identification of candidate genes for an eQTL/QTL hot spot can be achieved by elucidating genetic networks controlled by the region, as has been demonstrated in mammals [53, 54] Pinpointing genetic regulators linking tree growth t o wood quality will ultimately enhance our ability to breed for and engineer genotypes improved for pulp, paper, belowground carbon sequestration, and cellulosic ethanol production.
78 Figure 31. Effect of nitrogen fertilization on shoot biomass accumulation on one genotype of the Populus pseudobackcross population. The plant on the left was grown under nitrogen deficiency (0 mM of NH4NO3), while the plant on the right was supplied with 25 mM of N H4NO3 after establishment Both pictures were taken at 10 weeks after plants were potted. A height scale (in cm) is depicted in each picture. 0 mM NH 4 NO 3 25 mM NH 4 NO 3
79 Figure 32. Map location of all 6 2 QTL identified in our experiment. QTL mapped under nitrogen deficiency treatment are represented on the left of each linkage group, while QTL mapped under high nitrogen are on the right. Colored bars indicate map regions where the LOD profile is above statistical threshold, and black arrows the position of LOD peak for each QTL. QTL spanning less than 10cM above threshold were represented by 10cM bars. Microsatellite markers are identified in black and array based markers in red. Linkage groups where no QTL was mapped are not shown. G3465 G1688 P2481 G1470 G3886 G3365 G1629 P2611 G3989 G666 P2277 LG3 G2053 O56 O268 O202 G3079 G3858 O386b O414 G2062 G1722 P2598 G3578 G3362 LG8 G3457 G2541 G588 G1623 O21 P2321 G2097 G1949 S626 S611 LG9 G1089 G1244 G1743 O534 S89 LG18 G2126 P2221 G3600 G2281 E918 S613 G139 S18 E538 G998 O50 LG6 S170 G2682 P2737 G2643 G2673 LG12 P14 G3990 W22 P2658 G2577 P2847 G2218 LG13 S554 G1866 O59 G1177 G1564 G1292 G2080 O386a G2014 P2515 G674 LG14 G2107 G507 G2020 G2431 P2855 O345 G2122 G1946 P2786a G938 LG10 G1454 G1245 G4047 P520 P2585 G3333 LG15 E533 E408 G125 G880 G641 P648 G3580 LG17 P2839 P2840 G2497 P2558 G2985 P2156 G1063 G3627 G1255 W15 E529 LG5 G2903 G3205 G124 G834 G1568 G833 G3784 G2837 G947 G3231 G3214 G825 G3017 P2731 P575 G1782 P2786b P2385 G3688 E496 E728 LG1 S95 S96 G734 O461 G876 P422 P2797 G2637 P2709 P456 P2088 P2418 G1376 LG2 50 cM Height Ht live crown Diameter # Internodes Stem biomass Leaf biomass Sylleptic biomass Shoot biomass Root biomass Total biomass Shoot/root Cellulose (C6) Hemicelluloses (C5) C6/C5 lignin G lignin S/G Cellulose/lignin
80 Table 31. Estimates of clonal repeatability and average trait value for each of 12 phenotypes measured in two nitrogen treatments. Standard error (SE) is depicted for each estimate. The last column contains P values of a t test assessing the effect of nitrogen treatment in each phenotype. Clonal repeatability Nitrogen deficiency (0 mM) High nitrogen (25 mM) Trait Acronym H 2 SE Mean SE Mean SE Prob > | t | Diameter diam 0.328 0.025 4.41 0.031 cm 5.087 0.043 cm 0 Height ht 0.325 0.027 67.751 0.712 cm 78.9 0.739 cm 0 Height live crown htlc 0.338 0.026 61.517 0.735 cm 74.958 0.749 cm 0 Internodes count int 0.344 0.033 29.163 0.219 34.21 0.241 0 Leaf biomass tleaf 0.305 0.027 4.86 0.092 g 6.734 0.12 g 0 Stem biomass tstem 0.298 0.026 3.814 0.08 g 4.469 0.096 g 1.69E 07 Sylleptic branch biomass sylwt 0.249 0.04 0.8 0.057 g 2.053 0.097 g 0 Above ground biomass abbio 0.314 0.025 9.064 0.186 g 12.462 0.244 g 0 Root biomass rootwt 0.303 0.025 2.204 0.057 g 1.265 0.033 g 4.14E 44 Ratio abbio/rootwt rabbe 0.312 0.029 5.073 0.076 11.896 0.137 0 Total biomass tbio 0.31 0.026 11.267 0.241 g 13.742 0.278 g 2.26E 11
81 Table 3 2. Estimates of clonal repeatability and average trait value for each of 8 wood chemistry phenotypes estimated in two nitrogen treatments with py MBMS. Standard error (SE) is depicted for each estimate. The last column contains P values of a t test assessing the effect of nitrogen treatment in each phenotype. Clonal repeatability Nitrogen deficiency (0 mM) High nitrogen (25 mM) Trait Acronym m/z peaks sum H2 SE Mean SE Mean SE Prob > | t | Five carbons hemicellulose sugar C5 57 + 73 + 85 + 96 + 114 0.163 0.039 25.678 0.039 27.019 0.078 0 Six carbons cellulose sugar C6 57 + 60 + 73 + 98 + 126 + 144 0.174 0.04 32.888 0.065 35.540 0.137 0 Ratio C6/C5 C6/C5 0.170 0.039 1.280 0.001 1.311 0.002 0 Syringyl lignin monomer S lignin 154 + 167 + 168 + 182 + 194 + 208 + 210 0.338 0.038 13.03 0.054 10.153 0.048 3.81E 280 Guaiacyl lignin monomer G lignin 124 + 137 + 138 + 150 + 164 + 178 0.150 0.039 13.156 0.029 11.160 0.039 9.44E 294 Ratio S lignin/G lignin S/G 0.378 0.031 0.993 0.004 0.917 0.004 8.24E 41 Total lignin Lignin G lignin + S lignin + 120 + 152 + 180 + 181 0.234 0.039 21.690 0.06 17.382 0.063 0 Ratio C6/Lignin RatioCL 0.182 0.04 1.543 0.007 2.119 0.015 0
82 Table 3 3. Pair wise estimates of phenotypic (below diagonal) and genotypic (above diagonal) correlation between traits. Bold type shows pair wise traits with correlation > |0.60|. See Tables 3 1 and 32 for acronym description. Trait ht diam sylwt rootwt htlc tleaf tstem tbio abbio ht X 0.78 0.22 0.70 0.96 0.69 0.91 0.78 0.80 diam 0.75 X 0.12 0.76 0.73 0.74 0.85 0.80 0.79 sylwt 0.15 0.19 X 0.43 0.40 0.35 0.26 0.46 0.45 rootwt 0.70 0.74 0.49 X 0.74 0.86 0.81 0.90 0.86 htlc 0.94 0.70 0.25 0.72 X 0.73 0.89 0.83 0.84 tleaf 0.74 0.75 0.28 0.80 0.75 X 0.90 0.96 0.96 tstem 0.86 0.81 0.23 0.76 0.83 0.90 X 0.93 0.94 tbio 0.76 0.77 0.44 0.84 0.78 0.95 0.92 X 1.00 abbio 0.77 0.77 0.43 0.81 0.79 0.96 0.93 0.99 X rabbe 0.03 0.22 0.15 0.60 0.05 0.23 0.16 0.29 0.21 int 0.70 0.54 0.32 0.61 0.71 0.62 0.64 0.65 0.66 C6 0.23 0.16 0.31 0.43 0.31 0.42 0.31 0.43 0.43 C5 0.30 0.22 0.32 0.47 0.35 0.44 0.36 0.45 0.45 C6/C5 0.05 0.00 0.28 0.26 0.14 0.28 0.15 0.28 0.29 Lignin 0.23 0.27 0.45 0.49 0.33 0.47 0.37 0.48 0.48 S lignin 0.02 0.07 0.38 0.27 0.13 0.30 0.16 0.29 0.30 G lignin 0.46 0.45 0.29 0.61 0.49 0.55 0.51 0.55 0.55 S/G 0.28 0.21 0.20 0.11 0.18 0.04 0.17 0.55 0.55 RatioCL 0.25 0.24 0.35 0.52 0.37 0.47 0.41 0.48 0.49
83 Table 3 3. Continued Trait rabbe int C6 C5 C6/C5 Lignin S lignin G lignin S/G RatioCL ht 0.13 0.67 0.26 0.36 0.00 0.13 0.06 0.50 0.27 0.20 diam 0.26 0.50 0.11 0.18 0.07 0.03 0.10 0.34 0.22 0.13 sylwt 0.03 0.38 0.53 0.46 0.63 0.60 0.50 0.45 0.25 0.56 rootwt 0.63 0.59 0.55 0.60 0.36 0.37 0.17 0.61 0.07 0.53 htlc 0.14 0.73 0.35 0.43 0.13 0.23 0.04 0.51 0.18 0.33 tleaf 0.22 0.57 0.50 0.53 0.33 0.45 0.29 0.56 0.03 0.52 tstem 0.19 0.67 0.46 0.52 0.25 0.27 0.08 0.53 0.14 0.45 tbio 0.24 0.67 0.58 0.62 0.38 0.48 0.29 0.61 0.61 0.58 abbio 0.15 0.67 0.53 0.57 0.36 0.47 0.29 0.59 0.59 0.54 rabbe X 0.05 0.22 0.25 0.04 0.01 0.09 0.30 0.19 0.15 int 0.07 X 0.44 0.48 0.30 0.43 0.30 0.46 0.11 0.45 C6 0.20 0.37 X 0.95 0.89 1.00 0.85 0.72 0.53 0.99 C5 0.24 0.39 0.97 X 0.75 0.87 0.68 0.77 0.35 0.92 C6/C5 0.11 0.24 0.81 0.59 X 0.95 0.93 0.48 0.81 0.97 Lignin 0.19 0.42 0.78 0.71 0.73 X 0.93 0.57 0.69 0.98 S lignin 0.07 0.27 0.63 0.51 0.72 0.90 X 0.23 0.91 0.92 G lignin 0.27 0.46 0.64 0.68 0.40 0.72 0.34 X 0.27 0.70 S/G 0.08 0.01 0.27 0.11 0.51 0.47 0.80 0.19 X 0.63 RatioCL 0.22 0.41 0.96 0.89 0.83 0.89 0.75 0.68 0.35 X
84 Table 3 4. Linkage group (LG) and flanking marker localization for the 6 2 QTL identified under the two nitrogen treatments (H = high, D = deficiency). Also depicted for each QTL are the grand parent origin of the allele with positive effect on the trait, LOD score and percentage of the phenotypic v ariance explained QTL Nitrogen treatment Trait acronym LG Flanking marker1 Flanking marker2 Origin of positive allele LOD peak Phenotypic variance explained 1 H abbio 12 S170 P2737 P. deltoides 3.31 4.36% 2 H abbio 13 P2847 G2218 P. trichocarpa 3.05 7.04% 3 H diam 3 G3465 P2481 P. deltoides 3.83 5.32% 4 H diam 12 S170 P2737 P. deltoides 4.20 5.25% 5 H diam 13 P2847 G2218 P. trichocarpa 3.18 7.03% 6 H htlc 2 O461 P422 P. deltoides 3.74 4.85% 7 H int 1 G3205 G834 P. trichocarpa 4.72 7.26% 8 H int 6 G3600 O50 P. trichocarpa 5.66 6.69% 9 H int 13 P2847 G2218 P. trichocarpa 4.71 9.20% 10 H sylwt 1 G124 G1568 P. trichocarpa 4.98 5.70% 11 H sylwt 1 P2731 G1782 P. trichocarpa 4.30 4.90% 12 H sylwt 17 G641 G3580 P. trichocarpa 4.98 6.18% 13 H rabbe 3 G1470 G3365 P. deltoides 4.97 5.55% 14 H rabbe 14 G1866 G1177 P. deltoides 6.93 9.75% 15 H rootwt 13 P2847 G2218 P. trichocarpa 3.11 7.35% 16 H tbio 12 S170 P2737 P. deltoides 3.42 4.51% 17 H tbio 13 P2847 G2218 P. trichocarpa 3.18 7.52% 18 H tleaf 12 S170 P2737 P. deltoides 3.89 5.11% 19 H tleaf 13 P2847 G2218 P. trichocarpa 3.86 9.52% 20 H tstem 12 S170 G2682 P. deltoides 3.47 4.60% 21 H C5 13 P14 P2658 P. trichocarpa 3.17 5.03% 22 H C6 13 G2577 G2218 P. trichocarpa 4.78 11.24% 23 H C6/C5 1 P575 G1782 P. trichocarpa 2.97 3.58% 24 H C6/C5 13 G2577 G2218 P. trichocarpa 5.23 11.36% 25 H G lignin 13 P2847 G2218 P. deltoides 3.24 6.78% 26 H Lignin 13 G2577 G2218 P. deltoides 4.51 9.61% 27 H RatioCL 13 G2577 G2218 P. trichocarpa 4.69 10.17% 28 H S/G 1 P2731 G1782 P. deltoides 3.11 3.88% 29 H S/G 10 G1946 G938 P. deltoides 3.63 4.27% 30 H S/G 15 G4047 P2585 P. trichocarpa 3.39 5.03% 31 D abbio 1 P575 P2786b P. trichocarpa 3.52 6.10% 32 D diam 3 G3465 P2481 P. deltoides 3.36 4.61% 33 D diam 8 P2598 G3578 P. trichocarpa 3.29 4.42% 34 D ht 2 G876 P422 P. deltoides 3.21 4.50% 35 D htlc 1 G3205 G834 P. trichocarpa 4.70 6.49% 36 D htlc 2 G876 P422 P. deltoides 3.02 3.94% 37 D int 1 G3205 G834 P. trichocarpa 6.33 9.91% 38 D int 6 G2281 O50 P. trichocarpa 3.44 3.79% 39 D sylwt 5 G1063 G3627 P. trichocarpa 3.58 5.10% 40 D sylwt 15 G4047 P2585 P. deltoides 3.30 4.62% 41 D rabbe 1 G825 P575 P. trichocarpa 3.30 4.24% 42 D rabbe 3 G1470 G3365 P. deltoides 3.93 4.38% 43 D rabbe 9 G3457 G588 P. trichocarpa 3.01 3.64% 44 D rabbe 14 G1866 G1177 P. deltoides 2.95 3.95% 45 D rootwt 9 G3457 G588 P. deltoides 3.33 4.34% 46 D tleaf 1 P575 P2786b P. trichocarpa 3.37 6.98% 47 D tleaf 3 G3465 G1688 P. deltoides 3.11 4.90%
85 Table 3 4. Continued QTL Nitrogen treatment Trait acronym LG Flanking marker1 Flanking marker2 Origin of positive allele LOD peak Phenotypic variance explained 48 D tleaf 13 P2658 G2577 P. trichocarpa 3.07 4.42% 49 D tstem 1 P575 P2786b P. trichocarpa 3.10 3.97% 50 D C5 1 G3205 G834 P. trichocarpa 5.84 8.06% 51 D C5 6 S18 P2221 P. deltoides 4.09 8.79% 52 D C5 18 G1089 G1244 P. trichocarpa 3.63 6.85% 53 D C6 1 G3205 G834 P. trichocarpa 5.80 7.96% 54 D C6 6 S613 G2126 P. deltoides 2.94 3.85% 55 D C6 6 S18 E538 P. deltoides 3.28 7.38% 56 D C6/C5 6 S613 G2126 P. deltoides 3.58 4.73% 57 D G lignin 1 G3205 G834 P. deltoides 3.90 4.95% 58 D G lignin 6 E918 G139 P. trichocarpa 3.58 6.97% 59 D G lignin 13 G2577 G2218 P. deltoides 4.01 7.51% 60 D G lignin 17 G641 P648 P. deltoides 3.62 6.24% 61 D RatioCL 1 G3205 G834 P. trichocarpa 4.11 5.30% 62 D RatioCL 6 S613 G2126 P. deltoides 3.35 4.42% Table 3 5 trait. Trait type Trait under N deficiency By trait under high N Phenotypic diam_L diam_H 0.538 4.88E 31 Phenotypic ht_L ht_H 0.558 1.05E 33 Phenotypic htlc_L htlc_H 0.557 1.40E 33 Phenotypic int_L int_H 0.569 2.53E 35 Phenotypic tleaf_L tleaf_H 0.502 1.37E 26 Phenotypic tstem_L tstem_H 0.489 4.36E 25 Phenotypic sylwt_L sylwt_H 0.403 7.42E 17 Phenotypic abbio_L abbio_H 0.501 1.91E 26 Phenotypic rootwt_L rootwt_H 0.514 5.53E 28 Phenotypic rabbe_L rabbe_H 0.402 1.04E 16 Phenotypic tbio_L tbio_H 0.491 2.37E 25 Wood chemistry C5_L C5_H 0.240 1.35E 06 Wood chemistry C6_L C6_H 0.273 3.45E 08 Wood chemistry C6/C5_L C6/C5_H 0.263 1.15E 07 Wood chemistry Lignin_L Lignin_H 0.322 5.60E 11 Wood chemistry Guaiacol_L Guaiacol_H 0.279 1.64E 08 Wood chemistry Syringol_L Syringol_H 0.409 2.28E 17 Wood chemistry S/G_L S/G_H 0.482 2.54E 24 Wood chemistry RatioCL_L RatioCL_H 0.330 1.77E 11
86 Table 3 6. Comparison of clonal repeatibility estimated with combined data from both N treatments and separately from plants grown under high and deficient nitrogen treatments. Combined High nitrogen Deficient nitrogen Trait h 2 Std err h 2 Std err h 2 Std err diam 0.328 0.025 0.353 0.034 0.400 0.033 ht 0.325 0.027 0.335 0.034 0.391 0.033 htlc 0.338 0.026 0.343 0.034 0.384 0.033 int 0.344 0.033 0.366 0.034 0.379 0.033 tleaf 0.305 0.027 0.309 0.036 0.359 0.035 tstem 0.298 0.026 0.303 0.036 0.321 0.036 sylwt 0.249 0.040 0.233 0.050 0.337 0.057 abbio 0.314 0.025 0.296 0.037 0.337 0.037 rootwt 0.303 0.025 0.274 0.036 0.325 0.035 rabbe 0.312 0.029 0.400 0.037 0.387 0.037 tbio 0.310 0.026 0.297 0.038 0.324 0.038 C5 0.163 0.039 0.312 0.050 0.372 0.046 C6 0.174 0.040 0.358 0.049 0.375 0.046 C6/C5 0.170 0.039 0.308 0.052 0.316 0.048 S lignin 0.338 0.038 0.395 0.049 0.456 0.042 G lignin 0.150 0.039 0.240 0.054 0.232 0.051 S/G 0.378 0.031 0.359 0.050 0.404 0.045 Lignin 0.234 0.039 0.307 0.052 0.420 0.044 R atioCL 0.182 0.039 0.351 0.050 0.389 0.045
87 CHAPTER 4 INTEGRATIVE GENOMICS IDENTIFIES CPG13 A CANDIDATE GENE FOR THE COORDINATE REGULATION OF BIOMASS GROWTH AND CARBON PARTITIONING IN P opulus Introduction Wood is mainly composed of cellulose (~45%) and lignin (~25%), the two most abundant biopolymers on Earths land [199, 244] The complex polymeric structure with multiple co valent linkages found in lignocelluloses provides a plentiful source of renewable energy  Estimates indicate that a net energy bal ance (NEB) of up to 600 GJ/ha/yr can be achieved with lignocellulosic biomass, surpassing the energy gain of sugar, starch and oilseedbased biofuels  However, the success of lignocellulosic s as a viable alternative to petroleum based fuels depends on the improvement of biomass productivity and conversion efficiency [25, 44, 245, 246] The recalcitrance of lignin constitutes a main impediment for efficient conversion of woody biomass into highly energetic liquid fuel such as ethanol and hydrocarbons [6, 247] Evidence suggest s that the partitioning of carbon to cellulose rather than lignin not only improves biofuel conversion efficiency but also increases growth rates in tree species [51, 98, 202, 203, 248] A negative correlation between biomass growth and lignin content was first observed in mutants with reduced activity of enzymes in the lignin biosynthesis pathway [98, 202, 203] Transgenic aspens that downregulate the expression of 4coumarate:CoA ligase ( 4CL ; EC 188.8.131.52) exhibit a 45% reducti on in lignin content, with an associated 15% increase in cellulose and 1757% increase in hemicelluloses  The reduction in the lignin to carbohydrate ratio was associated with a significant enhancement in the growth rate of all vegetative organs. Similarly, a rare natural mutant of Pinus taeda with reduc ed activity of cinnamyl alcohol dehydrogenase ( CAD ; EC 184.108.40.206)  the enzyme that catalyzes the last step of the synthesis of lignin monomers shows a significant increase in radial and height growth as wel l as
88 wood density [202, 203] The association between secondary cell wall composition and biomass growth is not restricted to natural mutants or transgenic lines, but also segregates in hybrid populations of woody species [51, 248] We detected a negative correlation between transcript abundance of several genes of the lignin biosynthesis pathway and lignin content itself with growth in an interspecific cross of Eucalyptus grandis and E. globulus  Recently, we reported the most comprehensive analyses yet available of phenotypic and genetic correlations between biomass growth and wood chemistry traits in a woody perennial species  (Chapter 3) We phenotyped 396 clonally replicated individuals from an interspecific pseudobackcross of P. deltoides and P. trichocarpa (Family 52 124), and observed a highl y significant ( p value < 0.0001) negative correlation of lignin with cellulose content and total biomass. Although the negative relationship between growth and lignin content is useful for pulp, paper, and biofuels industr ies the genetic elements and mole cular mechanism that coordinately regulate these traits are unknown. Identification of genes that co regulat e biomass growth and composition can translate in to the development of germplasm that is improved simultaneously for higher biomass productivity and reduced lignin. By analyzing biomass growth and wood chemistry in a segregating progeny  we identified one major quantitative trait locus ( QTL ) on LGXIII of Populus with pleiotropic effects in biomass growth and composition. Specifically, t his QTL explains 56% of the heritable variation in ce llulose to lignin ratio, as well as 20 25% of the heritable variation of several productivity traits, including stem diameter and biomass accumulation in root and shoot  Identification of the causal polymorphism(s) underlying QTLs is challenging in tree species. Large scale experiments to fine map an d positional ly clone genes are almost unfeasible and have not yet been attempted in forest species  One alternative to positional cloning is
89 to integrate large scale, high dimensional genomic and molecular data into the QTL progeny [50, 153, 250] By analyzing the transcriptome in all individuals of a QTL population, it is possible to find genes whose expression correlates with the phenotypic variation. The expression level can also be treated as a quantitative trait in a QTL analysis to identify genomic regions (often referred to as eQTL) that affect the mRNA levels of each gene. Additionally, coexpressed genes can be identified and their association modeled into networks that deconvolute causal interactions from indirect correlations  These additional layers of information can be harnessed to identify a gene or network of genes involved in the regulation of complex traits  Herein, the ex pression of all known Populus genes were characterized in the pedigree (family 52 124) where the pleiotropic QTL for growth and carbon partitioning was identified on LGXIII. The expression data were used in combination with genetic and phenotypic informati on to identify the gene cpg13 ( c arbon p artitioning and g rowth in chromosome XIII) as the most likely pleiotropic regulator of biomass growth and lignin content. B ecause cpg13 and its homologues in other plant species are annotated with unknown function, its functional characterization can indicate new regulatory pathways of cell wall biosynthesis, which may be conserved across woody and non woody species. Materials and Methods Plant M aterial and M icroarray Analysis of Gene E xpression A pseudobackcross pe digree of Populus trichocarpa P. deltoides (family 52 124) composed of 396 genotypes was used in this study. Plants were clonally propagated and grown using a partially balanced incomplete block design, with three biological replicates in each of two ni trogen treatments  Ten weeks after rooted cut tings were planted in 1 gallon containers plants were phenotyped for biomass by dissecting above and below ground
90 vegetative organs (roots, leaves and stem). The stem was separated further into xylem and phloem, for phenotyping of cell wall chemistry of developing xylem (wood) with pyrolysis molecular beam mass spectrometry (py MBMS)  In addition, xylem tissues were sampled from 180 individuals from one replicate of the high nitrogen treatment (25 mM of NH4NO3) for transcriptome analysis. Immediately upon harvest, xylem samples were flash frozen in li quid nitrogen and subsequently stored at 80oC. The tissue was later lyophilized for RNA extraction with a standard protocol  Messenger RNA was converted to cDNA using the SuperScript Double Stranded cDNA Synthesis Kit (Invitrogen, Carlsbad, C A, USA) with oligo(dT) primers from Promega (Madison, W I USA). The double stranded cDNA was labeled using Cy3 tagged 9 mers random primers and Klenow fragment as previously described [25 2] The labeled cDNA was then hybridized to an oligonucleotide microarray from Roche Nimblegen ( Madison, W I USA). As previously described  the microarray contained 55,793 unique 60mer probes specifically designed to assay gene expression of the 45,555 Populus annotated gene models  plus 10,238 less supported models with transcriptional evidence. Probes were selected based on uniqueness of the targeted genomic region and GC content, while avoiding self complementarity, homopolymer runs, and hybridization interference from polymorphisms in the parents of the pedigree  Fine M apping the P leiotropic Q TL of LGXIII To reduce the QTL interval, the genome sequence within the QTL region was searched for additional microsatellites using MsatFinder ( v.2.0)  Forty primer pairs were designed and microsatellites were PCR screened for polymorphisms in the parents and six individuals of the progeny. Polymorphic markers were identified and genotyped in 1% agarose gels (w/v ).
91 Expression QTL A nalysis Xylem cDNA from the 180 genotypes were individually hybridized to microarrays without replicates S ignal data from the hybridizations w ere quantilenormalized  and log2transformed before being used in expression QTL (eQTL) analysis. The expression of each gene was analyzed as a quantitative trait with Composite Interval Mapping [223, 255] using Windows QTL Cartographer v2.5  as described previously for the phenotypic traits  Statistical threshold for eQTL detection was determined based on the 95th percentile of the distribution of genome wide maximum likelihood ratio (LR ) observed in 1000 permutations performed in a sample of 100 randomly selected genes, as previously detailed  Microsatellite markers were anchored with BlastN on the genome sequence to locate the relative position of genes on the genetic map. Based on the location of genes, e QTLs were classified as being the result of cis or trans regulation. C is regulated eQTLs were defined when the gene is physically located within the region (eQTL) controlling its expression. Otherwise, when the analyzed gene is annotated in a region diffe rent from its eQTLs, the gene is trans regulated. It is important to note that 37% of genes annotated in the Populus genome (assembly v.1) are localized in scaffolds that were not genetically linked to any linkage group. For genes in those unmapped scaffol ds, it is not directly possible to know the type of genetic regulation ( cis vs trans ) underlying their eQTLs. Using a dense genetic map previously published  we linked two unmapped scaffolds to the region of the pleiotropic QTL. Scaffold_41 contains 267 genes and was linked to the region by one microsa tellite and two microarray based markers. Scaffold_320 has only seven genes and was mapped with one microarray marker  Genes in scaffold_41 and scaffold_320 had eQTLs classified as cis regulated when mapped within the pleiotropic QTL.
92 Statistical Analysis Pearson and Spearman (rank) correlati ons were estimated among phenotypic and gene expression traits with the statistical software JMP 8.0 (SAS Institute Inc., Cary, NC ). False discovery rates were estimated using the software Q VALUE  Search for C ommon C i s Elements in the Promoter of Lignin Biosynthesis G enes and C pg13 Gene ontology annotation from the Arabidopsis best homologue identified 243 poplar genes related to the lignin biosynthetic and metabolic process es ( GO:0009809 and GO:0009808). To find signi ficantly enriched cis elements in the promoter of these genes, the 1,500 bp of sequence upstream of the start codon was extracted from all 55,793 genes represented in the microarray. Uninterrupted 1.5 kb of promoter sequence could be obtained from 49,066 g enes, including 209 of the genes annotated as involved in lignin processes. Using Patmatch  these 49,066 promoter sequences were searched for the 469 cis element motifs deposited in the PLACE ( Plant Cis Acting Regulatory Element) database  We detected 360 motifs present in the promoter of at least one of the Populus genes. The binary presence/absence of each of these motifs was used in Fishers exact tests to detect cis elements enriched among the 209 poplar genes involved in lignin biosynthetic and metabolic processes. Motifs enriched among lignin genes were compared with those present in the promoter of cpg13. Results A Pleiotropic QTL for Biomass and Wood C hemistry on LGXIII of Populus Quantitative analysis of biomass and wood chemistry traits in a pseudo backcross pedigree of Populus trichocarpa P. deltoides (family 52 124) identified significant genetic and phenotypic correlations between wood composition and biomass of all vegetative organs (Figure 41) Plants that accumulated more biomass had significantly less lignin (r = 0.48; pvalue = 1.16 102 3) and more cellulose (r = 0.40; pvalue = 4.26 1016), when compared to those with
93 lower growth rates (Figure 4 1). These correlations were significant under both the low and high nitrogen treatments ( p value < 0.0001) Ana lysis of quantitative trait loci for biomass and wood chemistry traits identified a QTL cluster on LGXIII of Populus  The pleiotropic QTL controls the levels of cellulose (likelihood ratio [LR] = 22; R2 = 11.2%), lignin (LR = 21; R2 = 9.6%), and the ratio between the two wood components (LR = 22; R2 = 10.2%) in xylem, as well as the diameter (LR = 15; R2 = 7.0%), leaf biomass (LR = 18; R2 = 9.5%), shoot biomass (LR = 14; R2 = 7.0%) and root biomass (LR = 14; R2 = 7.4%) of the plants (Figure 42 A)  The pleiotropic QTL locus spans a broad region of 57.4 cM between markers G2577 and G2218 on the original LGXIII map, and contains 677 genes annotated in the genome (Figure 4 2 A). To reduce the QTL interval, four microsatellites (SSR5, SSR8, SSR11 and SSR14) were genotyped in the progeny of family 52 124. These microsatellites were chosen based on their distribution in the genome sequence and ease of scoring in agarose gel s Marker P2847 was removed from the original map because of excessive missing data (n=102). The linkage group including the four new markers indicate that the QTL cluster was contained between SSR5 and SSR8, reducing the interval to 53% of its original length (Figure 4 2 B). On the new map, the pleiotropic QTL spans 31cM and includes 356 genes. O n the new map, t he QTL for diameter moved outside the region flanked by the additi onal microsatellites. Because this shift is not observed when the QTL is mapped with interval mapping procedure (data now shown), we believe that this change in position is an artifact of the marker chosen as cofactor in the composite interval mapping mode l. Genetical Genomics Identifies a Candidate Gene for the Regulation of Biomass and Wood Chemistry Based on the hypothesis that gene expression variation is a major contributor to differences in biomass growth and wood chemistry traits, we utilized the ge netical genomics strategy to
94 identify all genes with transcript levels controlled by the pleiotropic QTL. Expression QTL analysis performed with all 55,793 genes probed on the microarray identified 79 genes regulated by the genomic region between markers S SR5 and SSR8 (green circle in Figure 4 3) Among these 79 genes, 20 were cis regulated (green and blue overlap in Figure 43) i.e. they are physically localized and regulated by the genomic region between SSR5 and SSR8 while the remaining 75 were trans regulated by the pleiotropic locus. We focused on the 20 cis regulated genes to identify the most likely candidate s underlying the pleiotropic QTL. The assumption inherent of the genetical genomics approach is that polymorphisms in regulatory sequences mo dulate expression of the causative gene, and that this change in transcript levels could be a primary driver of the phenotypic variation detected by the QTL. Under that rationale the causative gene would have gene expression highly correlated with cellulo se and lignin levels, as well as with biomass traits. Spearman (rank) correlations were estimated between the expressions of all genes probed by the microarray with levels of cellulose and lignin, as well as with total biomass. At a false discovery rate (F DR) of 10%, 2,016 genes had transcript levels correlated with the three phenotypic traits controlled by the QTL cluster (red circle in Figure 4 3) Among these, only five genes had expression controlled by same region where the pleiotropic QTL was mapped ( red and green overlap in Figure 4 3). Two of these five genes are physically located between markers SSR5 and SSR8 and, therefore, are cis regulated and strong candidates for underlying the QTLs ( threeway overlap in Figure 43; see also Figure 4 4 A). Table 4 1 depicts the annotation of these two genes, as well as their expression correlation with wood chemistry and biomass traits. One of these genes ( gw1.41.566.1) has the highest correlation with levels of lignin (r = 0.40) and cellulose (r = 0.41) among all 79 genes controlled by the region within the pleiotropic QTL
95 (Figure 4 4 B). The correlation of gw1.41.566.1 expression with lignin is 41% higher than any other cis regulated gene. Both candidate genes have moderate negative correlation s with total biomass (Figure 4 4 B and C). Next we correlated transcript levels of the 20 cis regulated genes within the pleiotropic QTL (blue and green overlap in Figure 4 3) with expression of 161 genes annotated as enzymes involved in the biosynthesis of lignin monomers. Several of the cis regulated genes are highly correlated with lignin biosynthesis genes. For instance, 500 significant pair wise correlations were identified including 17 of the 20 genes, after applying a stringent Bonferroni correction f or pvalue < 3.11 106). Table 4 2 depicts the 30 highest correlations. The same gene that showed highest expression correlation with cellulose and lignin levels gw1.41.566.1 is also identified as the most highly co expres sed with monolignol biosynthesis genes (Table 4 2). Figure 4 5 shows the extent and metabolic context of the correlations observed between expression of gw1.41.566.1 and of genes for the biosynthesis of monolignols. With the exception of gene 4coumarate3hydroxylase ( C3H ), a ll other families of genes that encode enzymes involved in monolignol biosynthesis have at least one member highly correlated (r > 0.79) with transcript levels of gw1.41.566.1 (Figure 4 5). Because of its strong association with cell wall chemistry in family 52 124 of Populus we concluded that gene gw1.41.566.1 is the most likely candidate underlying the pleiotropic QTL of LGXIII, and renamed it cpg13 ( c arbon partitioning and g rowth of LGXIII). Promoter of C pg13 C ontains R egulatory E lements F ound in M onolignol B iosynthesis G enes We searched the promoter sequence ( 1.5 kbp upstream of the first codon) of all genes represented in the microarray for cis acting motifs deposited in the PLACE database 
96 Fishers exact tests identified 35 cis acting elements significantly enriched (FDR < 0.05) among promoters of the 209 genes annotate d as involved in lignin biosynthetic and metabolic processes (GO:0009808 and GO:0009809) (Table 4 3). Among these 35 enriched motifs, 30 are present in the 1.5 kb DNA sequence upstream of the c pg13 start codon (motifs in bold type in Table 43). The number of shared cis elements is significantly different than expected from a stochastic process ( p value = 5.2 1019 from a right tailed Fishers exact test). Some of these shared cis elements were previously found in the promoter of lignin genes in other pla nt species such as MYB binding domains W box rel ated to stress response as well as motifs acting in responses to light and salicylic acid  Shared cis acting regulatory elements offer a preliminary explanation for the high expression correlation of cpg13 with lignin biosynthesis genes and indicate a possible role of cpg13 in secondary cell wall formation. Microarray data from previous studies in poplars and Arabidopsis support this hypothesis, as discussed below (see Discussion section). Cpg13 P rotein S equence C ontains a S ecretory S ignal P eptide and a D omain of U nknown F unction C onserved in all L and P lants. Full length cDNA of cpg13 was previously generated  and is predicted to encode a protein with 304 amino acids (aa) (Figure 46 ). We utilized different bioinformatics tools to search for conserved domains within the peptide sequence of cpg13. TargetP 1.1  and SignalP 3.0 [264, 265] consistently predict a signal peptide for the secretory pathway in the N terminus of cpg13 protein, with most likely cleavage site between amino acids 31 and 32. The subcellular localization of cpg13 and its homologues is currently unknown. Another feature identified in the protein sequence of cpg13 is a domain of unknown function (DUF579) located in its C terminus. DUF579 is conserved (BlastP threshold = 1e 20) in all land plants with currently sequenced genome s The Phytozome website (DOE JGI) identifies DUF579 in one
97 gene of moss, two genes of Selaginella (lycophyte) and in several genes of Angiosperms, ranging from five in monkey flower ( Mimulus guttatus ) t o 18 in soybean ( Glycine max ) (Table 44). In Populus there are 12 DUF579 containing genes, including cpg13. All these plant genes have unknown function and encode proteins with similar structure to that of cpg13 (Figure 4 6), i.e. contain a relatively short N terminus and DUF579 in the C terminus. The level of conservation of DUF579 suggests that this domain modulates an important molecular function that has yet to be revealed. Characterization of cpg13 has the potential to uncover the function of DUF579 and illuminate the molecular role of several plant genes currently annotated with unknown function. Discussion Here we utilized the strength of genetical genomics to identify candidate genes contributing to the quantitative variatio n in critical traits for the pulp and paper, timber and bioenergy industries. By integrating gene expression analysis into an interspecific pseudo backcross pedigree of Populus we identified two candidate genes implicated in the phenotypic variation contr olled by a cluster of QTLs for wood chemistry and growth traits. Since wood composition and biomass growth were highly correlated at the phenotypic and genotypic levels in the progeny of family 52124, the colocalization of QTLs indicates existence of one or more pleiotropic regulator on LGXIII of Populus [2 48] The integrative genomics analysis identified the gene cpg13 as the most likely candidate underlying this pleiotropic effect. In the QTL pedigree, t ranscript levels of cpg13 were highly correlated with xylem lignin content (Figure 44 B) and with the expression of genes encoding enzymes of the lignin biosynthesis (Figure 4 5). The analysis of 1.5 kb of promoter sequences from genes for lignin biosynthesis indicate s enrichment of cis regulatory motifs also found in the promoter of cpg13 (Table 4 3). S hared
98 promoter motifs constitute the preliminary molecular explanation for the co expression of cpg13 with lignin biosynthesis genes. Previous microarray studies also show that transcript levels of cpg13 are highly associated with secondary cell wall formation. E xpression of cpg13 is highe r in poplar stems, a tissue with substantial deposition of secondary cell wall, and lower in leaves, which are predominantly composed of cells with primary wall only (Figure 47 A )  In a highresolution transcription profil e of tangential sections from poplar stem, cpg13 was shown to have hi ghest expression in the zone of programmed cell death where lignin is synthesized and deposited in the cell wall (Figure 4 7 B )  In Arabidopsis a metaanalysis of whole transcriptome studies selected the closest homologue of cpg13 ( At1g33800) as a candidate for secondary cell wall formation based on its expression correlation with IRREGULAR XYLEM (IRX) genes  Similarly, another study identified the same At1g33800 gene among 52 members of a core xylem gene set by performing a digital northern analysis with several ESTs libraries from Arabidopsis  Cpg13 expression profile and presence of cis acting motifs commonly found among genes of the lignin biosynthesis pathway indicate that cpg13 might be a new genetic element involved in the formation of secondary cell wall more specifically in lignification Previous studies have demonstrated a link between downregulation of genes encoding enzymes of the lignin biosynthesis and increased biomass in Populus  and P inus [202, 203, 249] However, the molecular mechanism coordinating lignin biosynthesis and biomass growth remains elusive and, in fact, some poplar mutants with downregulated lignin biosynthesis do not show increased growth rates [99, 270, 271] indicating that the pleiotropic effect on growth is not ubiquitous Alternatively, it is also conceivable t hat cpg13 might coordinate wood composition and growth by functioning in the molecular sensing of nitrogen and/ or sugar levels. Sugar and
99 nitrogen signaling are interconnected [272, 273] since t he assimilation of carbon (C) into amino acids nucleic acids and other organic molecules depends upon availability of nitrogen (N) and C accumulates into carbohydrates and starch if N is deprived  W hen nitrogen is abundant (relative low C/N), for example after N fertilization, plant metabolism tend s to shift resources towards absorption of carbon, developing more shoot than root biomass and inducing expression of genes involved in the primary metabolism of photosynthesis, Calvin cycl e and photorespiration [232, 233] .Changes in carbon to nitrogen balance can cause a dramatic change not only in biomass allocation but also in cell wall chemistry. We and other researchers demonstra ted that wood tissue responds to N fertilization by partitioning more carbon into cellulose than lignin [211, 212, 274] Consistently, analys e s of the effects of nitrogen fertilization i n the whole transcriptome of Arabidopsis show that N represses most of the genes from the phenylpropanoid pathway responsible for synthesis of lignin monomers [231, 232] If cpg13 has a role in sensing sugar status, for example, its overexpression might induce a high C/N signaling that triggers more carbon partitioning to lignin relative to cellulose and possibly reduces photosynthetic rate The later could contribute to low biomass accumulation associated with cpg13 upregulation. Consistent with this possibility are data demonstrat ing that addition of sugars enhanced lignin deposition in Arabidopsis  The effects of glucose and sucrose were not due solely to C resources they supplied, since nonmetabolized analogues also induced lignifications suggesting t hat sugars can affect lignin through signaling  Functional analysis of cpg13 can help address whether carbon and nitrogen sensing are molecular mechanism s involved in the coordinat e regulation of wood composition and growth that has been observed in different tr ee species [51, 98, 202, 203, 248, 249] To begin our characteriz ation of the cpg13 gene, we conducted a preliminary analysis of Arabidopsis mutant
100 lines with transfer DNA (T DNA) insertions i n two of the closest homologues to cpg13. The mutant lines showed increased shoot growth rate relative to wild type plants. Replications are currently in progress as well as additional experiments to test whether the increased growth occurs to a similar extent under altered conditions ( e.g. light and nutrient regimes). In addition to the analysis of Arabidopsis T DNA mutants, we have regenerated transgenic poplars designed to either overexpress or downregulate cpg13. We are currently propagating these transgenic lines to phenotype their biomass and wood chemistry traits. The transcriptome of these transgenic poplars will be assa yed with microarrays to identify genes with altered expression and, more importantly, the pathways that respond to changes in the expression of cpg13. Another transgenic poplar that expresses cpg13 fused to green fluorescent protein (GFP) in its carboxy te rminus is also being regenerated. Histochemical analysis of these plants will show the subcellular localization of cpg13. The analyses of these transgenic poplars in addition to the T DNA lines from Arabidopsis are expected to uncover the molecular function of cpg13, and possibly identify new regulatory pathways that coordinate biomass growth and carbon partitioning between lignin and cellulose
101 Figure 4 1. Distribution and correlation of biomass weight and lignin content. A) Histogram of the distribution of dry weight biomass B) Histrogram of the distribution of lignin content. C ) C orrelation (r = 0.48) between biomass weight and lignin content. Me asurements were obtaine d from 396 replicated genotypes of F amily 52 124. Values are least square mean estimates based on three biological replicates of each genotype grown under the high nitrogen treatment 0 10 20 30 40 50 60 70 80 90 100 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 Count units Lignin content (%) 0 10 20 30 40 50 60 0 4 8 12 16 20 24 28 32 36 Count units Biomass dry weight (g) 0 5 10 15 20 25 30 35 12 17 22 Biomass dry weight (g) Lignin content (%) A B C
102 Figure 42. Fine mapping the pleiotropic QTL interval of LGXIII A) Likelihood ratio (LR) profiles for cellulose, lignin and biomass traits on the original LGXIII map, showing the pleiotropic QTL contained in a region with 677 genes. B) LR profile for the same traits after addition of 4 microsatellites that reduced the QTL interval to a region containing 356 genes. A B 0 5 10 15 20 25 30 35 P14 G3990 W22 P2658 G2577 SSR5 SSR14 SSR11 SSR8 G2218 356 genes 0 5 10 15 20 25 P14 G3990 W22 P2658 G2577 P2847 G2218 677 genes threshold thresholdLikelihood ratio (LR) Likelihood ratio (LR)original map 4 additional SSRs
103 Figure 43. Venn diagram depicting the number of genes physically located in the QTL interval (blue circle), the number of genes with cis and trans regulated eQTL mapped between markers SSR5 and SSR8 (green circle), and genes with mRNA levels correlated with cellulose and lignin contents in xylem tissue as well as with total biomass at a false discovery rate (FDR) of 10% (red circle) Integration of these three levels of genomic information identifies two candidates strongly associated with the phenotypic variation regulated by the pleiotropic QTL of LGXIII. The threeway overlap shows that t hese two genes are physically located in the QTL interval, have transcript levels regulated by the same interval ( cis regulation) and are highly correlated with the three phenotypic traits of interest (lignin, cellulose and biomass). 324 1,999 12 2 18 3 56 cis eQTLs trans eQTLs Expression correlation at a FDR of 10% with phenotypic traits Genes physically located on the QTL interval(regulated by the QTL and physically located there) (regulated by the QTL but not physically located there)
104 Figure 4 4. Cis eQTL and expression correlation with lignin and total biomass for the two candidate genes identified with a genetical genomics approach. A) LR profile demonstrating that genes gw220.127.116.11 and gw1.41.5666.1 have a cis regulated eQTL that co localizes with QTLs for lignin, cellulose and biomass on LGXIII B) Expression of gw1.41.566.1 is correlated with lignin content (r = 0.40) and total bi omass (r = 0.20) C) Expression of gw18.104.22.168 is correlated with lignin (r = 0.23) content and total biomass (r = 0.25) 0 5 10 15 20 25 30 35 40 45 LGI LGI LGI LGI LGI LGI LGI LGII LGII LGII LGIII LGIII LGIV LGIV LGV LGV LGV LGVI LGVI LGVI LGVII LGVII LGVIII LGVIII LGIX LGIX LGX LGX LGXI LGXI LGXII LGXIII LGXIV LGXIV LGXIV LGXV LGXV LGXVI LGXVII LGXVII LGXVIII LGXIX LGXIX gw1.41.566.1 expression gw22.214.171.124 expression Cellulose Lignin Total biomass A Expression histogram Lignin content (%) Biomass dry weight (g)Microarray signal Microarray signalExpression histogram Lignin content (%) Biomass dry weight (g)B CLikelihood ratio (LR)
105 Figure 4 5. Co expression of members of the biochemical pathways involved in lignin biosynthesis with gw1.41.566.1 gene. With the exception of C3H, expression levels of at least one member of each lignin biosynthesis gene family correlates with levels of gw1.41.566.1 at Pearson correlation threshold of 0.79. Highlights in red, gold and yellow depict the extent of the correla tions as shown. Pathway chart modified from Kirst et al.  C4H = 0.90 4CL = 0.90 CCoAOMT = 0.87 SAMS = 0.86 PAL = 0.85 F5H = 0.85 CAD = 0.84 CAD = 0.84 HMT = 0.87 CCR = 0.80 OMT = 0.81 SAH = 0.79 r = 0.79 r = 0.90
106 Figure 4 6. C pg13 protein structure containing a secretory signal peptide (aa 1 31) and a conserved domain of unknown function ( DUF579 aa 162304). Below is an alignment depicting the conservation of DUF579 in 37 genes from Populus (sequences starting with Ptr), Arabidospis (AT), Oryza (Os), Vittis (GS), Selaginella ( Sel ) and Physcomitrella ( Phy ) Cpg13 gene model ID ( Populus annotation v.1) is depicted in red. Alignment was performed with ClustalW2 (EMBL EBI). DUF57931 162 304 Ptr|gw1.I.3855.1 Ptr|gw1.IX.2263.1 Ptr|eugene3.00050506 Ptr|gw1.VII.2881.1 AT5G67210.1 AT3G50220.1 GSVIVP00036632001 GSVIVP00019546001 AT2G15440.1 Os06g47310.1 Os02g06380.1 Os04g55640.1 Sel|gw1.38.480.1 Sel|gw1.59.411.1 Phy|e_gw1.15.1 Ptr|eugene3.00010467 Ptr|eugene3.00031418 GSVIVP00002806001 GSVIVP00037860001 AT1G67330.1 AT1G27930.1 Os11g29780.1 Ptr|gw126.96.36.199 AT1G09610.1 GSVIVP00035310001 GSVIVP00035299001 Ptr|gw1.41.566.1 Ptr|gw1.XIX.1870.1 GSVIVP00021863001 GSVIVP00021864001 AT1G33800.1 AT4G09990.1 AT1G71690.1 Os12g10320.1 Os11g13870.1 Ptr|fgenesh4_pm.C_LG_XV000296 AT4G24910.1 184 184 194 196 195 202 182 187 190 188 272 191 118 123 117 175 175 169 173 180 176 180 119 169 175 181 196 126 192 219 182 174 183 185 207 204 201 267 267 277 279 278 285 265 270 275 271 352 275 199 204 201 257 257 251 255 262 258 262 200 251 256 262 277 207 273 303 263 256 266 268 291 282 283
107 Figure 4 7. Previous microarray studies indicate that xpression of cpg13 is highest in tissues undergoing secondary cell wall formation. A) Relative expression level measured for cpg13 in five vegetative tissues fold change determined relative to the signal detected in a set of eighty negative control genes ( cartoon and data from Quesada et al.  ). B) Gene expression signal detected for cpg13 in five stem sections, representing different stages of wood formation. S amples were collected from the phloem, cambial zone, expansion zone, zone of secondary wall formation and zone of programmed cell death/lignification. Cartoon from PopGenIE  and data from Courtois Moreau et al.  A 0 5 10 15 20 25 internode node root young leaf mature leaf Intensity of signal relative to control B From: Quesada et al. (New Phytologist 2008) From: Moreau et al. (Plant Journal 2009)phloem cambium fiber extension secondary cell wall PCD / lignification
108 Table 4 1. Annotation of the two candidate cis r egulated genes in the pleiotropic QTL, highly correlated with biomass and wood chemistry traits. Spearman correlation Populus gene id (v.1) Annotation from Arabidopsis closest homologue Cellulose Lignin Shoot biomass Root biomass Total biomass gw1.41.566.1 unknown function protein 0.408** 0.403** 0.185 0.251* 0.207* gw188.8.131.52 CPK4 (calcium dependent protein kinase 4) 0.216* 0.232* 0.239* 0.260* 0.252* FDR < 0.10; ** FDR < 0.0001
109 Table 4 2. Top 30 expression cor relations between the 20 cis regulated genes and genes from the lignin biosynthesis pathways The gene model ID is shown for cis regulated gene while lignin gene s are identified by their annotation. P value is depicted only for Speaman correlation estimate s. Gene Members of gene families from lignin biosynthesis pathways Pearson r gw1.41.566.1 4 coumarate:CoA ligase 0.897 0.810 3.81 E 43 gw1.41.566.1 5 methyltetrahydropteroyltriglutamate homocysteine S methyltransferase 0.860 0.803 7.54 E 42 gw1.41.566.1 5 methyltetrahydropteroyltriglutamate homocysteine S methyltransferase 0.872 0.789 1.42 E 39 gw1.41.566.1 4 coumarate:CoA ligase 0.843 0.776 1.6 2 E 37 gw1.41.566.1 cytochrome p450 0.876 0.767 3.88 E 36 gw1.41.566.1 cinnamyl alcohol dehydrogenase 0.840 0.766 5.18 E 36 gw1.41.566.1 ferulate 5 hydroxylase (FAH1) 0.846 0.759 4.71 E 35 gw1.41.566.1 5 methyltetrahydropteroyltriglutamate homocysteine S methyltransferase 0.886 0.756 1.57 E 34 gw1.41.566.1 5 methyltetrahydropteroyltriglutamate homocysteine S methyltransferase 0.877 0.744 4.86 E 33 gw1.41.566.1 3 dehydroquinate synthase 0.835 0.739 2.52 E 32 gw1.41.566.1 4 coumarate:CoA ligase 0.782 0.731 2.05 E 31 gw184.108.40.206 cinnamoyl CoA reductase 0.731 0.719 5.69 E 30 gw1.41.566.1 cinnamoyl CoA reductase 0.726 0.714 2.29 E 29 gw1.41.566.1 5 methyltetrahydropteroyltriglutamate homocysteine S methyltransferase 0.792 0.714 2.57 E 29 gw1.41.566.1 putative 4 coumarate:CoA ligase 0.775 0.710 6.88 E 29 gw1.41.566.1 5 enolpyruvylshikimate 3 phosphate (EPSP) synthase 0.851 0.708 1.07 E 28 gw220.127.116.11 dehydroquinate dehydratase/shikimate dehydrogenase 0.689 0.704 3.34 E 28 gw1.41.566.1 putative shikimate kinase precursor 0.773 0.703 4.11 E 28 gw18.104.22.168 cinnamyl alcohol dehydrogenase 0.710 0.699 9.41 E 28 gw22.214.171.124 putative shikimate kinase precursor 0.758 0.697 1.80 E 27 gw1.41.566.1 anthranilate N benzoyltransferase 0.776 0.694 3.07 E 27 gw1.41.572.1 5 methyltetrahydropteroyltriglutamate homocysteine S methyltransferase 0.779 0.694 3.57 E 27 gw1.41.572.1 3 dehydroquinate synthase 0.798 0.690 9.56 E 27 gw1.41.566.1 ferulate 5 hydroxylase (FAH1) 0.799 0.688 1.29 E 26 gw126.96.36.199 anthranilate N benzoyltransferase 0.694 0.687 1.65 E 26 gw1.41.572.1 2 dehydro 3 deoxyphosphoheptonate aldolase 0.662 0.685 2.69 E 26 gw1.41.566.1 putative 4 coumarate:CoA ligase 0.770 0.683 4.24 E 26 gw1.41.572.1 putative cinnamyl alcohol dehydrogenase 0.746 0.673 4.42 E 25 gw1.41.566.1 cinnamoyl CoA reductase 0.806 0.671 6.46 E 25 gw1.41.572.1 cinnamoyl CoA reductase 0.599 0.670 7.94 E 25
110 Table 4 3. Motifs enriched ( qvalue < 0.05) among the promoter of genes involved in lignin biosynthetic and metabolic processes. The number of lignin genes with motif absent and present is depicted as well as their proportion among all genes represented in the microarray. P value and false di scovery rate ( q value) from the Fishers exact test for enrichment are also shown. Motifs in bold type are also present in the promoter of cpg13 (gw1.41.566.1) gene. Absent Present Motif N % N % p value q value Description of the motif' s molecular role AGAAA 4 0.069% 205 0.474% 1.53E 07 2.754E 05 One of two co dependent regulatory elements responsible for pollen specific activation of tomato (L.e.) lat52 gene; ATATT 5 0.081% 204 0.476% 2.977E 07 3.109E 05 Motif found both in promoters of rolD; GRWAAW 4 0.070% 205 0.473% 3.454E 07 3.109E 05 Consensus GT 1 binding site in many light regulated genes, e.g., RBCS from many species, PHYA from oat and rice, spinach RCA and PETA, and bean CHS15; RTTTTTR 9 0.118% 200 0.483% 4.335E 07 3.121E 05 SEF4 binding site; Soybean (G.m.) consensus sequence found in 5'upstream region ( 199) of beta conglycinin (7S globulin) gene (Gmg17.1); AATAAT 12 0.139% 197 0.487% 6.147E 07 3.161E 05 Plant polyA signal; Consensus sequence for plant polyadenylation signal; TTATTT 8 0.111% 201 0.480% 6.136E 07 3.161E 05 TATA box; TATA box found in the 5'upstream region of pea (Pisum sativum) glutamine synthetase gene; a functional TATA element by in vivo analysis; GATAA 8 0.117% 201 0.476% 1.843E 06 8.294E 05 I box; "I box"; Conserved sequence upstream of light regulated genes; TAAAG 8 0.118% 201 0.475% 2.682E 06 0.0001073 TAAAG motif found in promoter of Solanum tuberosum (S.t.) KST1 gene; Target site for trans acting StDof1 protein controlling guard cell specific gene expression; GAAAAA 10 0.135% 199 0.478% 3.927E 06 0.0001414 GT 1 motif found in the promoter of soybean (Glycine max) CaM isoform, SCaM 4; CANNTG 8 0.126% 201 0.470% 1.121E 05 0.0003669 MYC recognition site found in the promoters of the dehydrationresponsive gene rd22 and many other genes in Arabidopsis; TTWTWTTWTT 32 0.230% 177 0.503% 1.394E 05 0.0003877 T Box; Motif found in SAR (scaffold attachment region; or matrix attachment region, MAR); YTCANTYY 13 0.160% 196 0.479% 1.386E 05 0.0003877 Inr (initiator) elements found in the tobacco psaDb gene promoter without TATA boxes; Light responsive transcription of psaDb depends on Inr, but not TATA box; AATAAA 11 0.152% 198 0.473% 2.15E 05 0.0004838 PolyA signal; poly A signal found in legA gene of pea, rice alpha amylase; TAAAATAT 92 0.319% 117 0.577% 2.132E 05 0.0004838 Core element in LeCp (tomato Cys protease) binding cis element (from 715 to 675) in LeAcs2 gene;
111 Table 4 3. Continued Absent Present Motif N % N % p value q value Description of the motif' s molecular role TGTCA 16 0.178% 193 0.481% 2.03E 05 0.0004838 Binding site of OsBIHD1, a rice BELL homeodomain transcription factor; ACCWWCC 109 0.335% 100 0.604% 2.644E 05 0.0005599 Consensus of the putative "core" sequences of box L like sequences in carrot (D.c.) PAL1 promoter region; YCYYACCWACC 200 0.411% 9 2.479% 2.896E 05 0.00058 One of three putative cis acting elements (boxes P, A, and L) of phenylalanine ammonia lyase (PAL; EC 188.8.131.52) genes in parsley (P.c.); Appear to be necessary but not sufficient for light mediated PAL gene activation. TATAAAT 60 0.286% 149 0.531% 3.196E 05 0.0006063 TATA box; TATA box found in the 5'upstream region of pea legA gene; sporamin A of sweet potato; CAACA 11 0.157% 198 0.471% 4.195E 05 0.0006901 Binding consensus sequence of Arabidopsis (A.t.) transcription factor, RAV1; TGACY 11 0.156% 198 0.471% 4.217E 05 0.0006901 W box found in the promoter region of a transcriptional repressor ERF3 gene in tobacco; May be involved in activation of ERF3 gene by wounding; YTYYMMCMAMCMMC 198 0.408% 11 1.923% 3.935E 05 0.0006901 One of three putative cis acting elements (boxes P, A, and L) of phenylalanine ammonia lyase (PAL; EC 184.108.40.206) genes in parsley (P.c.); Appear to be necessary but not sufficient for light mediated PAL gene activation. WAACCA 20 0.206% 189 0.480% 8.04E 05 0.0012584 MYB recognition site found in the promoters of the dehydrationresponsive gene rd22 and many other genes in Arabidopsis; W=A/T; CTCTT 12 0.173% 197 0.468% 0.000198 0.00297 One of two putative nodulin consensus sequences; See also S000461 (NODCON1GM); TATTTAA 63 0.306% 146 0.513% 0.000432 0.0062208 Binding site for OsTBP2, found in the promoter of rice PAL gene encoding phenylalanine ammonia lyase; CCAAT 18 0.214% 191 0.470% 0.0006109 0.00846 Common sequence found in the 5' non coding regions of eukaryotic genes; "CCAAT box" found in the promoter of heat shock protein genes; TATATAA 77 0.323% 132 0.523% 0.0006547 0.0087333 TATA box; TATA box found in the 5'upstream region of sweet potato sporamin A gene; TGACG 87 0.332% 122 0.533% 0.0008128 0.0104529 Found in many promoters and are involved in transcriptional activation of several genes by auxin and/or salicylic acid; May be relevant to light regulation; TATTAAT 78 0.328% 131 0.518% 0.0013754 0.017069 TATA box; TATA box found in the 5'upstream region of sweet potato sporamin A gene;
112 Table 4 3. Continued Absent Present Motif N % N % p value q value Description of the motif' s molecular role CNGTTR 21 0.235% 188 0.469% 0.0015719 0.018864 Binding site for all animal MYB and at least two plant MYB proteins ATMYB1 and ATMYB2, both isolated from Arabidopsis; AAAGAT 34 0.273% 175 0.478% 0.0018505 0.0208238 One of two putative nodulin consensus sequences; TTGAC 20 0.233% 189 0.467% 0.0018145 0.0208238 W box found in promoter of Arabidopsis thaliana (A.t.) NPR1 gene; They were recognized specifically by salicylic acid (SA) induced WRKY DNA binding proteins; CCWACC 92 0.343% 117 0.527% 0.0021239 0.0231709 Core of consensus maize P (myb homolog) binding site; W=A/T; 6 bp core; Maize P gene specifies red pigmentation of kernel pericarp, cob, and other floral organs; TATTCT 51 0.305% 158 0.489% 0.0026729 0.0283024 Involved in the expression of the plastid gene psbD which encodes a photosystem II reaction center chlorophyll binding protein that is activated by blue, white or UV A light; AATTAAA 42 0.291% 167 0.482% 0.0029051 0.02988 PolyA signal; poly A signal found in rice alpha amylase; TGACT 29 0.270% 180 0.469% 0.0042341 0.04234 SUSIBA2 bind to W box element in barley iso1 (encoding isoamylase1) promoter;
113 Table 4 4. Number of DUF579 containing genes in each of the 17 land plants with genome currently sequenced. Data from the Phytozome website (DOE JGI). Organism Common name DUF579 containing genes Source Arabidopsis thaliana Mouse ear cress 10 TAIR release 9 acquired from TAIR Arabidopsis lyrata Lyre leaved rock cress 10 JGI release 1.0 Carica papaya Papaya 8 ASGPB release of 2007 Populus trichocarpa Poplar 12 JGI v2.0 annotation of the v2 assembly Medicago truncatula Barrel medic 4 Release Mt3.0 from the Medicago g enome s equence c onsortium Glycine max Soybean 18 JGI Glyma1.0 assembly Ricinus communis Castor bean 7 TIGR release 0.1 Manihot esculenta Cassava 12 JGI/Roche v1.1 assembly and annotation Cucumis sativus Cucumber 6 Roche 454 XLR assembly and JGI v1 annotation Vitis vinifera Grape 8 Sept 2007 annotation from Genoscope Sorghum bicolor Sweet Sorghum 7 Sbi1.4 models from MIPS/PASA on v1.0 assembly. Zea mays Maize 10 Protein coding models from Maizesequence.org release 4a.53 Oryza sativa Rice 6 MSU Release 6.0 of the Rice Genome Annotation Brachypodium distachyon Purple false brome 7 JGI v1.0 8x assembly of strain Bd21 Mimulus guttatus Monkey flower 5 JGI 7x assembly of strain IM62, annotation v1.0 Selaginella moellendorffii Spikemoss 2 JGI v1.0 assembly and annotation Phycomitrella patens Moss 1 JGI v1.1 assembly and annotation
114 CHAPTER 5 CONCLUSIONS The research presented in this dissertation shows how multiple dimensions of genomics information can be leveraged in an integrative fashion to identify genes controlling quantitative traits in plant species. By merging genotypic, phenotypic and gene expre ssion data measured in a QTL population of Populus trichocarpa P. deltoides we identified cpg13 as the most likely candidate underlying a major pleiotropic quantitative trait locus (QTL) (Chapter 4). The QTL mapped in LGXIII regulates the variation of two major bioenergy traits biomass growth and composition (Chapter 3). The composition, as measur ed by the relative amount of cellulose and lignin, is a key determinant of the conversion efficiency of wood into biofuels. Previous studies demonstrated in Eucalyptus and Populus that biomass growth and composition are not independent, as trees that grow faster tend to partition more carbon into cellulose than lignin [51, 248] However, the genetic elements and molecular mechanisms governing this coordinate regulation between growth and carbon partitioning were largely unknown, making it difficult to distinguish cause and effect and limiting e xploitation of this relationship The discovery of c pg13 potentially uncovers key molecular links between biomass growth and cell wall composition. To follow up the results from this dissertation, c urrently, several transgenic poplars with altered expressi on of cpg13 are being regenerated and propagated. Trees downregulating and overexpressing cpg13 will be phenotyped for biomass growth, as well as for lignin and cellulose content to confirm the function of the gene in the regulation of these bioenergy trai ts. Microarrays will be used to assay the transcriptome and identify the molecular pathways that are responsive to changes in cpg13 expression. Additionally, the subcellular localization of cpg13 will be visualized in transgenic poplars overexpressing the cpg13 protein fused to green
115 fluorescent protein. Analysis of these transgenic poplars are expected to reveal the molecular function of cpg13 and potentially uncover new regulatory mechanisms of secondary cell wall formation conserved across plant species More importantly, functional characterization of cpg13 can potentially unveil key molecular mechanisms coordinating biomass growth and composition observed in woody species [51, 98, 202, 203, 248, 249] Knowledge of these molecular mechanisms will open new avenues, either by using cpg13 or other pleiotropic regulators, to generate tree germplasm that is improved for higher pr oductivity of biomass with reduced lignin. Discovery of cpg13 illustrates the importance of large scale genomic resources, including genomic and expressed sequences, molecular markers, gene expression and metabolite data from different tissues and organs. Even though the Eucalyptus genus includes the fastest growing and most planted hardwood tree species, very few genomic resources are currently available for eucalypts For example, as of February/2010 only ~37,000 expressed sequence tags (ESTs) were publically available for the Eucalyptus genus at the National Center for Biotechnology Information sequence database, as opposed to over 400,000 for Populus and Pinus The lack of expressed sequences limits the breadth of the genetic studies aimed at discovering genes that regulate silvicultural and bioenergy traits, usin g integrative approaches such as the one used to identify cpg13 (Chapters 3 and 4). To start changing the paradigm for Eucalyptus the most comprehensive transcriptome analysis for the genus was generated and is described in Chapter 2. Using 454 technology 148 Mbp of expressed sequences were generated and assembled in 71,384 contigs that are helping annotate the forthcoming Eucalyptus grandis genome sequence. Because sequences were generated from a pool of different Eucalyptus genotypes, another important aspect of that study was the discovery of 23,742 single nucleotide polymorphisms (SNPs). The analy ses of these SNPs demonstrate how polymorphisms sampled on a large scale
116 using new sequencing technologies can infer evolutionary forces shaping the variation among gene sequences. In summary, this dissertation demonstrates how the genomic resources available from model species, such as Populus can be harnessed to discover genes implicated in the quantitative variation of phenotypic traits. Furthermore, this d issertation demonstrate s how new sequencing technologies are extending the benefit of having large scale genomic resources beyond model systems to economically important species, such as Eucalyptus grandis T he recent upsurge of high throughput sequencing technologies with output s of 2040 Gb in 12 days at a fraction of the cost of traditional Sanger technology [277, 278] are changing the field of genetics and will certainly impact forest species. Soon even the mega genomes of conifers will be sequenced, and species of Populus and Eucalyptus will have many individuals with their genome s fully characterized. Th e extensive genome sequence data being generated will allow a comprehensive analysis of genetic diversity in many plant species and will help identify genome wide associations with the quantitative variation in economically relevant traits.
117 LIST OF REFERENCES 1. FAO: Global forest resources assessment 2005: progress towards sustainable forest management In: Forestry Paper 147. Rome: Food and Agriculture Organization (FAO); 2006. 2. FAO: Contribution of the forestry sector to national economies, 19902006. In: Forest Finance FSFM/ACC/08. Rome: Food and Agriculture Organization (FAO), Forest Economies and Policy Division; 2008. 3. Bonan GB: Forests and climate change: forcings, feedbacks, and the climate benefits of forests Science 2008, 320(5882) :14441449. 4. Lenoir J, Gegout JC, Marquet PA, de Ruffray P, Brisse H: A significant upward shift in plant species optimum elevation during the 20th century Science 2008, 320(5884) :17681771. 5. Rosenzweig C, Karoly D, Vicarelli M, Neofotis P, Wu Q, Casassa G, Menzel A, Root TL, Estrella N, Seguin B et al : Attributing physical and bi ological impacts to anthropogenic climate change Nature 2008, 453(7193) :353357. 6. Carroll A, Somerville C: Cellulosic biofuels Annu Rev Plant Biol 2009, 60:165182. 7. Sticklen MB: Plant genetic engineering for biofuel production: towards affordable cellulosic ethanol Nat Rev Genet 2008, 9(6) :433443. 8. Johnson JM F, Coleman MD, Gesch R, Jaradat A, Mitchell R, Reicosky D, Wilhelm WW: Biomass Bioenergy Crops in the United States: A Changing Paradigm The Americas Journal of Plant Science and Biotechnology 2007, 1(1) :1 28. 9. Lemus R, Lal R: Bioenergy crops and carbon sequestration Crit Rev Plant Sci 2005, 24(1) :1 21. 10. Ladiges PY, Udovicic F, Nelson G: Australian biogeographical connections and the phylogeny of large genera in the plant family Myrt aceae J Biogeogr 2003, 30(7) :989998. 11. Brooker MIH: A new classification of the genus Eucalyptus L'Her. (Myrtaceae) Aust Syst Bot 2000, 13(1) :79148. 12. Williams JE, Brooker MIH: Eucalypts: an introduction. In: Eucalyptus ecology. Edited by Williams JA, Woinarski JCZ. Cambridge: University Press; 1997: 115. 13. Eldridge K, Davidson J, Harwood C, van Wyk G: Eucalypt domestication and breeding. Oxford: Oxford University Press; 1994.
118 14. Gaiotto FA, Bramucci M, Grattapaglia D: Estimation of outcrossing rate in a breeding population of Eucalyptus urophylla with dominant RAPD and AFLP markers Theoretical and Applied Genetics 1997, 95(5 6) :842849. 15. House SM: Reproductive biology of eucalypts. In: Eucalyptus ecology. Edited by Williams JA, Woinarski JCZ Cambridge: University Press; 1997: 115. 16. Grattapaglia D, Kirst M: Eucalyptus applied genomics: from gene sequences to breeding tools New Phytologist 2008, 179(4) :911929. 17. Poke FS, Vaillancourt RE, Potts BM, Reid JB: Genomic research in Eucalyptu s Genetica 2005, 125(1) :79101. 18. FAO: Global forest resources assessment 2000: main report In: Forestry Paper 140. Rome: Food and Agriculture Organization (FAO); 2001. 19. Grattapaglia D: Genomics of Eucalyptus a global tree for energy, paper, and wood. In: Plant genetics and genomics: crops and models. vol. 1. New York: Springer; 2008: 259298. 20. Meskimen GF, Rockwood DL, Reddy KV: Development of Eucalyptus clones for a summer rainfall environment with periodic severe frosts. New Forests 1987, 3:197205. 21. Rockwood DL: Freeze resilient E. grandis clones for Florida, USA. In: IUFRO Symposium on Intensive Forestry: The Role of Eucalypts: 1991; Durban, South Africa; 1991: 455466. 22. FAO: Mean annual volume increment of selected industrial fores t plantation species by L Ugalde & O Prez. In: Forest Plantation Thematic Papers, Working Paper 1. Rome: Food and Agriculture Organization (FAO), Forest Resources Division; 2001. 23. Potts BM, Dungey HS: Interespecific hybridization of Eucalyptus : key iss ues for breeders and geneticists. New Forests 2004, 27:115138. 24. de Assis TF, Rezende GDSP, Aguiar AM: Current status of breeding and deployment for clonal forestry with tropical eucalypt hybrids in Brazil. In: XXII IUFRO World Congress Forests in the B alance: Linking Tradition and Technology: 2005; Brisbane, Australia : Intl Forestry Rev, 7:61; 2005. 25. Tzfira T, Zuker A, Atman A: Forest tree biotechnology: genetic transformation and its application to future forests Trends in Biotechnology 1998, 16(10) :439446. 26. Eckenwalder JE: Systematics and evolution of Populus In: Biology of Populus and its implications for management and conservation. Edited by Stettler RF, Bradshaw HD, Heilman PE, Hinckley TM. Ottawa, Canada: NRC Research Press; 1996: 7 32.
119 27. Davis JM: Genetic improvement of poplar ( Populus spp.) as a bioenergy crop. In: Genetic improvement of bioenergy crops. Edited by Vermerris W. New York: Springer; 2008: 377396. 28. Ball J, Carle J, Del Lungo A: Contributions of poplars and willows to sustainable forestry and rural development. Unasylva 221 2005, 56:3 9. 29. Riemenschneider DE: Breeding and nursery propagation of cottonwood and hybrid poplars for use in intensively cultured plantations. In: The Northeastern Forest and Conservation Nur sery Associations Conference: 1997; Portland, OR : U.S. Department of Agriculture, Forest Service; 1997: 3842. 30. Bailey JK, Deckert R, Schweitzer JA, Rehill BJ, Lindroth RL, Gehring C, Whitham TG: Host plant genetics affect hidden ecological players: lin ks among Populus, condensed tannins, and fungal endophyte infection. Canadian Journal of Botany Revue Canadienne De Botanique 2005, 83(4) :356361. 31. Bailey JK, Wooley SC, Lindroth RL, Whitham TG: Importance of species interactions to community heritability: a genetic basis to trophic level interactions Ecol Lett 2006, 9(1) :7885. 32. LeRoy CJ, Whitham TG, Keim P, Marks JC: Plant genes link forests and streams Ecology 2006, 87(1) :255 261. 33. Wimp GM, Martinsen GD, Floate KD, Bangert RK, Whitha m TG: Plant genetic determinants of arthropod community structure and diversity Evolution 2005, 59(1) :6169. 34. Ingvarsson PK: Nucleotide polymorphism and linkage disequilibrium within and among natural populations of European aspen (Populus tremula L., Salicaceae) Genetics 2005, 169(2) :945953. 35. Savolainen O, Pyhajarvi T: Genomic diversity in forest trees Curr Opin Plant Biol 2007, 10(2) :162167. 36. Ingvarsson PK: Multilocus patterns of nucleotide polymorphism and the demographic history of Populus tremula Genetics 2008, 180(1) :329340. 37. Ingvarsson PK, Garcia MV, Luquez V, Hall D, Jansson S: Nucleotide polymorphism and phenotypic associations within and around the phytochrome B2 Locus in European aspen (Populus tremula, Salicaceae) Genetics 2008, 178(4) :22172226. 38. Stettler RF, Bradshaw HD, Heilman PE, Hinckley TM: Biology of Populus and its implications for management and conservation Ottawa, Canada: NRC Research Press; 1996. 39. Dinus RJ: Genetic improvement of poplar feedstock quality for ethanol production. Appl Biochem Biotechnol 2001, 9193:2334.
120 40. Bisoffi S, Gullberg U: Poplar breeding and selection strategies. In: Biology of Populus and its implications for management and conservation. Edited by Stettler RF, Bradshaw HD, Heilman PE Hinckley TM. Ottawa, Canada: NRC Research Press; 1996: 139158. 41. Stanton BJ, Johnson JD, Neale DB: Genetic improvement of hybrid poplar for the renewable fuels industry: a Pacific Northwest perspective. In: Biofuels, bioenergy, and bioproducts from sustainable agricultural and forest crops: proceedings of the short rotation crops international conference: 2008; Bloomington, MN : U.S. Department of Agriculture, Forest Service. Gen. Tech. Rep. NRS P 31.; 2008: 56. 42. Karp A, Shield I: Bioenergy from plan ts and the sustainable yield challenge New Phytol 2008, 179(1) :1532. 43. Grattapaglia D, Plomion C, Kirst M, Sederoff RR: Genomics of growth traits in forest trees Curr Opin Plant Biol 2009, 12(2) :148156. 44. Rubin EM: Genomics of cellulosic biofuels Nature 2008, 454(7206) :841845. 45. Sanger F, Nicklen S, Coulson AR: DNA sequencing with chainterminating inhibitors Proc Natl Acad Sci U S A 1977, 74(12) :54635467. 46. Green ED: Strategies for the systematic sequencing of complex genomes Nat Rev Genet 2001, 2(8) :573 583. 47. Adams MD, Kelley JM, Gocayne JD, Dubnick M, Polymeropoulos MH, Xiao H, Merril CR, Wu A, Olde B, Moreno RF et al : Complementary DNA sequencing: expressed sequence tags and human genome project Science 1991, 252(5013) :16511656. 48. Martin KJ, Pardee AB: Identifying expressed genes Proc Natl Acad Sci U S A 2000, 97(8) :37893791. 49. Ralph SG, Chun HJ, Cooper D, Kirkpatrick R, Kolosova N, Gunter L, Tuskan GA, Douglas CJ, Holt RA, Jones SJ et al : Analysis of 4,664 high quality sequenc e finished poplar full length cDNA clones and their utility for the discovery of genes responding to insect feeding BMC Genomics 2008, 9:57. 50. Schadt EE, Monks SA, Drake TA, Lusis AJ, Che N, Colinayo V, Ruff TG, Milligan SB, Lamb JR, Cavet G et al : Gene tics of gene expression surveyed in maize, mouse and man Nature 2003, 422(6929) :297302. 51. Kirst M, Myburg AA, De Leon JPG, Kirst ME, Scott J, Sederoff R: Coordinated genetic regulation of growth and lignin revealed by quantitative trait locus analysis of cDNA microarray data in an interspecific backcross of Eucalyptus Plant Physiol 2004, 135:23682378. 52. Morreel K, Goeminne G, Storme V, Sterck L, Ralph J, Coppieters W, Breyne P, Steenackers M, Georges M, Messens E et al : Genetical metabolomics of flavonoid biosynthesis in Populus: a case study Plant J 2006, 47(2) :224237.
121 53. Mehrabian M, Allayee H, Stockton J, Lum PY, Drake TA, Castellani LW, Suh M, Armour C, Edwards S, Lamb J et al : Integrating genotypic and expression data in a segregating mouse population to identify 5 lipoxygenase as a susceptibility gene for obesity and bone traits Nature Genetics 2005, 37(11) :12241233. 54. Meng H, Vera I, Che N, Wang X, Wang SS, Ingram Drake L, Schadt EE, Drake TA, Lusis AJ: Identification of Abcc6 as the ma jor causal gene for dystrophic cardiac calcification in mice through integrative genomics Proc Natl Acad Sci U S A 2007, 104(11) :45304535. 55. Hagg S, Skogsberg J, Lundstrom J, Noori P, Nilsson R, Zhong H, Maleki S, Shang MM, Brinne B, Bradshaw M et al : Multi organ expression profiling uncovers a gene module in coronary artery disease involving transendothelial migration of leukocytes and LIM domain binding 2: the Stockholm Atherosclerosis Gene Expression (STAGE) study PLoS Genet 2009, 5(12) :e1000754. 56. Yang X, Deignan JL, Qi H, Zhu J, Qian S, Zhong J, Torosyan G, Majid S, Falkard B, Kleinhanz RR et al : Validation of candidate causal genes for obesity that affect shared metabolic pathways and networks Nat Genet 2009, 41(4) :415 423. 57. Coller HA, Krugl yak L: Genetics. It's the sequence, stupid! Science 2008, 322(5900) :380381. 58. Wilson MD, Barbosa Morais NL, Schmidt D, Conboy CM, Vanes L, Tybulewicz VL, Fisher EM, Tavare S, Odom DT: Species specific transcription in mice carrying human chromosome 21 Science 2008, 322(5900) :434438. 59. Tuskan GA, DiFazio S, Jansson S, Bohlmann J, Grigoriev I, Hellsten U, Putnam N, Ralph S, Rombauts S, Salamov A et al : The genome of black cottonwood, Populus trichocarpa (Torr. & Gray) Science 2006, 313(5793) :15961604. 60. Kumar LS: DNA markers in plant improvement: an overview Biotechnol Adv 1999, 17(23) :143182. 61. Brondani RPV, Brondani C, Grattapaglia D: Towards a genus wide reference linkage map for Eucalyptus based exclusively on highly informative microsatell ite markers Molecular Genetics and Genomics 2002, 267(3) :338347. 62. Gupta PK, Rustgi S: Molecular markers from the transcribed/expressed region of the genome in higher plants Funct Integr Genomics 2004, 4(3) :139162. 63. Yin TM, Zhang XY, Gunter LE, Li SX, Wullschleger SD, Huang MR, Tuskan GA: Microsatellite primer resource for Populus developed from the mapped sequence scaffolds of the Nisqually1 genome New Phytol 2009, 181(2) :498503. 64. Ceresini PC, Silva C, Missio RF, Souza EC, Fischer CN, Guillherme IR, Gregorio I, da Silva EHT, Cicarelli RMB, da Silva MTA et al : Satellyptus: Analysis and database of
122 microsatellites from ESTs of Eucalyptus Genetics and Molecular Biology 2005, 28(3) :589600. 65. Rabello E, de Souza AN, Saito D, Tsai SM: In silico characterization of microsatellites in Eucalyptus spp.: Abundance, length variation and transposon associations Genetics and Molecular Biology 2005, 28(3) :582588. 66. Brondani RPV, Williams ER, Brondani C, Grattapaglia D: A microsatellite based co nsensus linkage map for species of Eucalyptus and a novel set of 230 microsatellite markers for the genus Bmc Plant Biology 2006, 6. 67. Brondani RPV, Brondani C, Tarchini R, Grattapaglia D: Development, characterization and mapping of microsatellite mark ers in Eucalyptus grandis and E urophylla Theoretical and Applied Genetics 1998, 97(5 6) :816827. 68. Steane DA, Jones RC, Vaillancourt RE: A set of chloroplast microsatellite primers for Eucalyptus (Myrtaceae) Molecular Ecology Notes 2005, 5(3) :538541. 69. Ottewell KM, Donnellan SC, Moran GF, Paton DC: Multiplexed microsatellite markers for the genetic analysis of Eucalyptus leucoxylon (Myrtaceae) and their utility for ecological and breeding studies in other Eucalyptus species Journal of Heredity 2005, 96(4) :445451. 70. Glaubitz JC, Emebiri LC, Moran GF: Dinucleotide microsatellites from Eucalyptus sieberi: inheritance, diversity, and improved scoring of singlebase differences Genome 2001, 44(6) :10411045. 71. Steane DA, Vaillancourt RE, Russell J, Powell W, Marshall D, Potts BM: Development and characterisation of microsatellite loci in Eucalyptus globulus (Myrtaceae) Silvae Genetica 2001, 50(2) :8991. 72. Van Der Nest MA, Steenkamp ET, Wingfield BD, Wingfield MJ: Development of simple sequence rep eat (SSR) markers in Eucalyptus from amplified inter simple sequence repeats (ISSR) Plant Breeding 2000, 119(5) :433436. 73. Byrne M, MarquezGarcia MI, Uren T, Smith DS, Moran GF: Conservation and genetic diversity of microsatellite loci in the genus Euca lyptus Australian Journal of Botany 1996, 44(3) :331341. 74. Yasodha R, Sumathi R, Chezhian P, Kavitha S, Ghosh M: Eucalyptus microsatellites mined in silico: survey and evaluation Journal of Genetics 2008, 87(1) :21 25. 75. Rafalski A: Applications of single nucleotide polymorphisms in crop genetics Curr Opin Plant Biol 2002, 5(2) :94100. 76. Novaes E, Drost DR, Farmerie WG, Pappas GJ, Jr., Grattapaglia D, Sederoff RR, Kirst M: High throughput gene and SNP discovery in Eucalyptus grandis, an uncharacterized genome. BMC Genomics 2008, 9:312.
123 77. Kulheim C, Yeoh SH, Maintz J, Foley WJ, Moran GF: Comparative SNP diversity among four Eucalyptus species for genes from secondary metabolite biosynthetic pathways BMC Genomics 2009, 10:452. 78. Heuertz M, De Paoli E, Kallman T, Larsson H, Jurman I, Morgante M, Lascoux M, Gyllenstrand N: Multilocus patterns of nucleotide diversity, linkage disequilibrium and demographic history of Norway spruce [Picea abies (L.) Karst] Genetics 2006, 174(4) :20952105. 79. Krutovsky KV, Neale DB: Nucleotide diversity and linkage disequilibrium in coldhardiness and wood qualityrelated candidate genes in Douglas fir. Genetics 2005, 171(4) :20292041. 80. Hamrick JL, Godt MJW: Effects of life history traits on genetic diversity in plant species Philosophical Transactions of the Royal Society of London Series B Biological Sciences 1996, 351(1345) :12911298. 81. Perkel J: SNP genotyping: six technologies that keyed a revolution Nature Methods 2008, 5(5) :447453. 82. Matsuzaki H, Dong S, Loi H, Di X, Liu G, Hubbell E, Law J, Berntsen T, Chadha M, Hui H et al : Genotyping over 100,000 SNPs on a pair of oligonucleotide arrays Nat Methods 2004, 1(2) :109111. 83. Blow N: Genomics: catch me if you can Nature Methods 2009, 6(7) :539542. 84. Gnirke A, Melnikov A, Maguire J, Rogov P, LeProust EM, Brockman W, Fennell T, Giannoukos G, Fisher S, Russ C et al : Solution hybrid selection with ultra long oligonucleotides for massively parallel targeted sequencing Nat Biotechnol 2009, 27(2) :182189. 85. Peters JL, Cnudde F, Gerats T: Forward genetics and map based cloning approaches Trends Plant Sci 2003, 8(10) :484491. 86. Weigel D, Ahn JH, Blazquez MA, Borevitz JO, Christensen SK, Fankhauser C, Ferrandiz C, Kardailsky I, Malancharuvil EJ, Neff MM et al : Activation tagging in Arabidopsis Plant Physiol 2000, 122(4) :10031013. 87. Alonso JM, Stepanova AN, Leisse TJ, Kim CJ, Chen H, Shinn P, Stevenson DK, Zimmerman J, Barajas P, Cheuk R et al : Genome wide insertional mutagenesis of Arabidopsis thaliana. Science 2003, 301(5633) :653657. 88. Till BJ, Reynolds SH, Greene EA, Codomo CA, Enns LC, Johnson JE, Burtner C, Odden AR, Young K, Taylor NE et al : Large scale discovery of induced point mutations with high throughput TILLING Genome Res 2003, 13(3) :524530. 89. Alonso JM, Ecker JR: Moving forward in reverse: genetic technologies to enable genome wide phenomic screens in Arabidopsis Nat Rev Genet 2006, 7(7):524536.
124 90. Jander G, Norris SR, Rounsley SD, Bush DF, Levin IM, Last RL: Arabidopsis ma p based cloning in the post genome era Plant Physiol 2002, 129(2) :440450. 91. McCarty DR, Settles AM, Suzuki M, Tan BC, Latshaw S, Porch T, Robin K, Baier J, Avigne W, Lai J et al : Steady state transposon mutagenesis in inbred maize Plant J 2005, 44(1) : 5261. 92. Droc G, Ruiz M, Larmande P, Pereira A, Piffanelli P, Morel JB, Dievart A, Courtois B, Guiderdoni E, Perin C: OryGenesDB: a database for rice reverse genetics Nucleic Acids Res 2006, 34(Database issue) :D736740. 93. An X, Wang D, Zhang Z, Li S, He C: Cloning and RNAi construction of a LEAFY homologous gene from Populus tomentosa and preliminary study in tobacco. Forestry Studies in China 2005, 7(3) :1521. 94. Behnke K, Kleist E, Uerlings R, Wildt J, Rennenberg H, Schnitzler JP: RNAi mediated suppression of isoprene biosynthesis in hybrid poplar impacts ozone tolerance Tree Physiology 2009, 29(5) :725736. 95. Coleman HD, Park JY, Nair R, Chapple C, Mansfield SD: RNAi mediated suppression of p coumaroyl CoA 3 hydroxylase in hybrid poplar impacts lignin deposition and soluble secondary metabolism Proceedings of the National Academy of Sciences of the United States of America 2008, 105(11) :45014506. 96. Li JY, Brunner AM, Shevchenko O, Meilan R, Ma C, Skinner JS, Strauss SH: Efficient and stable t ransgene suppression via RNAi in field grown poplars Transgenic Research 2008, 17(4) :679694. 97. Meyer S, Nowak K, Sharma VK, Schulze J, Mendel RR, Hansch R: Vectors for RNAi technology in poplar Plant Biology 2004, 6(1) :100103. 98. Hu WJ, Harding SA, Lung J, Popko JL, Ralph J, Stokke DD, Tsai CJ, Chiang VL: Repression of lignin biosynthesis promotes cellulose accumulation and growth in transgenic trees Nature Biotechnol 1999, 17(8) :808812. 99. Li L, Zhou Y, Cheng X, Sun J, Marita J M, Ralph J, Chiang VL: Combinatorial modification of multiple lignin traits in trees through multigene cotransformation Proc Natl Acad Sci U S A 2003, 100(8) :49394944. 100. Busov V, Meilan R, Pearce DW, Rood SB, Ma C, Tschaplinski TJ, Strauss SH: Transge nic modification of gai or rgl1 causes dwarfing and alters gibberellins, root growth, and metabolite profiles in Populus Planta 2006, 224(2) :288299. 101. Stewart JJ, Akiyama T, Chapple C, Ralph J, Mansfield SD: The effects on lignin structure of overexpression of ferulate 5 hydroxylase in hybrid poplar Plant Physiol 2009, 150(2) :621635.
125 102. Busov V, Fladung M, Groover A, Strauss S: Insertional mutagenesis in Populus: relevance and feasibility Tree Genet Gen 2005, 1(4) :135142. 103. Busov V, Strauss SH : Gene discovery in Populus using activation tagging. In vitro Cell DevAn 2007, 43:S22S22. 104. Busov VB, Brunner AM, Meilan R, Filichkin S, Ganio L, Gandhi S, Strauss SH: Genetic transformation: a powerful tool for dissection of adaptive traits in trees New Phytologist 2005, 167(1) :9 18. 105. Busov VB, Meilan R, Pearce DW, Ma C, Rood SB, Strauss SH: Activation tagging of a dominant gibberellin catabolism gene (GA 2 oxidase) from poplar that regulates tree stature. Plant Physiol 2003, 132(3) :12831291. 106. Clough SJ, Bent AF: Floral dip: a simplified method for Agrobacterium mediated transformation of Arabidopsis thaliana. Plant J 1998, 16(6) :735743. 107. Byrne M, Murrell JC, Owen JV, Kriedemann P, Williams ER, Moran GF: Identification and mode of action of quantitative trait loci affecting seedling height and leaf area in Eucalyptus nitens Theoretical and Applied Genetics 1997, 94(5) :674681. 108. Byrne M, Murrell JC, Owen JV, Williams ER, Moran GF: Mapping of quantitative trait loci influencing frost tolerance in Eucalyptus nitens Theoretical and Applied Genetics 1997, 95(5 6) :975979. 109. Freeman JS, Whittock SP, Potts BM, Vaillancourt RE: QTL influencing growth and wood properties in Eucalyptus globulus Tree Genetics and Genomes 2009, 5(4) :713722. 110. Grattapaglia D, Bertolucci FL, Sederoff RR: GeneticMapping of Qtls Controlling Vegetative Propagation in Eucalyptus Grandis and E Urophylla Using a PseudoTestcross Strategy and Rapd Markers Theoretical and Applied Genetics 1995, 90(78) :9 33947. 111. Grattapaglia D, Bertolucci FLG, Penchel R, Sederoff RR: Genetic mapping of quantitative trait loci controlling growth and wood quality traits in Eucalyptus grandis using a maternal half sib family and RAPD markers Genetics 1996, 144(3) :12051214. 112. Li Y, Chaparro J, Russell H, Gibbings M, Smail G, Duong H, Jones M, Mitchelson K, Teasdale R: Detection of wood density QTL [quantitative trait loci] in eucalyptus hybrids Proceedings of the Second International Wood Biotechnology Symposium, Canberra, Australia, 1012 March, 1997 1999:219224. 113. Missiaggia AA, Piacezzi AL, Grattapaglia D: Genetic mapping of Eef1, a major effect QTL for early flowering in Eucalyptus grandis Tree Genetics & Genomes 2005, 1(2) :7984.
126 114. Rocha RB, Barros EG, Cr uz CD, Rosado AM, Araujo EFd: Mapping of QTLs related with wood quality and developmental characteristics in hybrids (Eucalyptus grandis Eucalyptus urophylla) Revista Arvore 2007, 31(1) :1324. 115. Thamarus K, Groom K, Bradley A, Raymond CA, Schimleck L R, Williams ER, Moran GF: Identification of quantitative trait loci for wood and fibre properties in two full sib pedigrees of Eucalyptus globulus Theoretical and Applied Genetics 2004, 109(4) :856864. 116. Thumma BR, Southerton SG, Bell JC, Owen JV, Hene ry ML, Moran GF: Quantitative trait locus (QTL) analysis of wood quality traits in Eucalyptus nitens Tree Genetics & Genomes 2010, 6(2):305317. 117. Verhaegen D, Plomion C, Gion JM, Poitel M, Costa P, Kremer A: Quantitative trait dissection analysis in E ucalyptus using RAPD markers .1. Detection of QTL in interspecific hybrid progeny, stability of QTL expression across different ages Theoretical and Applied Genetics 1997, 95(4) :597608. 118. Brendel O, Pot D, Plomion C, Rozenberg P, Guehl JM: Genetic parameters and QTL analysis of delta C 13 and ring width in maritime pine Plant Cell and Environment 2002, 25(8) :945953. 119. Brown GR, Bassoni DL, Gill GP, Fontana JR, Wheeler NC, Megraw RA, Davis MF, Sewell MM, Tuskan GA, Neale DB: Identification of quant itative trait loci influencing wood property traits in loblolly pine (Pinus taeda L.). III. QTL verification and candidate gene mapping. Genetics 2003, 164(4) :15371546. 120. Devey ME, Carson SD, Nolan MF, Matheson AC, Riini CT, Hohepa J: QTL associations for density and diameter in Pinus radiata and the potential for marker aided selection Theoretical and Applied Genetics 2004, 108(3) :516524. 121. Devey ME, Groom KA, Nolan MF, Bell JC, Dudzinski MJ, Old KM, Matheson AC, Moran GF: Detection and verificati on of quantitative trait loci for resistance to Dothistroma needle blight in Pinus radiata Theor Appl Genet 2004, 108(6) :10561063. 122. Emebiri LC, Devey ME, Matheson AC, Slee MU: Age related changes in the expression of QTLs for growth in radiata pine s eedlings Theoretical and Applied Genetics 1998, 97(7) :10531061. 123. Kaya Z, Sewell MM, Neale DB: Identification of quantitative trait loci influencing annual height and diameter increment growth in loblolly pine (Pinus taeda L.) Theoretical and Applie d Genetics 1999, 98(3 4) :586592. 124. Li C, Yeh FC: QTLs for western gall rust (Endocronartium harknessii) resistance in lodgepole pine (P. contorta spp. latifolia) Forest Genetics 2002, 9(2) :137 144.
127 125. Markussen T, Fladung M, Achere V, Favre JM, Faivre Rampant P, Aragones A, Perez DD, Harvengt L, Espinel S, Ritter E: Identification of QTLs controlling growth, chemical and physical wood property traits in Pinus pinaster (Ait.) Silvae Genetica 2003, 52(1) :8 15. 126. Pot D, Rodrigues JC, Rozenberg P, Chantre G, Tibbits J, Cahalan C, Pichavant F, Plomion C: QTLs and candidate genes for wood properties in maritime pine (Pinus pinaster Ait.) Tree Genetics & Genomes 2006, 2(1):1024. 127. Sewell MM, Bassoni DL, Megraw RA, Wheeler NC, Neale DB: Identificatio n of QTLs influencing wood property traits in loblolly pine (Pinus taeda L.). I. Physical wood properties Theoretical and Applied Genetics 2000, 101(8) :12731281. 128. Sewell MM, Davis MF, Tuskan GA, Wheeler NC, Elam CC, Bassoni DL, Neale DB: Identificati on of QTLs influencing wood property traits in loblolly pine (Pinus taeda L.). II. Chemical wood properties Theor Appl Genet 2002, 104(23) :214222. 129. Shepherd M, Cross M, Dieters MJ, Henry R: Branch architecture QTL for Pinus elliottii var. elliottii x Pinus caribaea var. hondurensis hybrids Annals of Forest Science 2002, 59(5 6) :617625. 130. Bradshaw HD, Stettler RF: MolecularGenetics of Growth and Development in Populus .4. Mapping Qtls with Large Effects on Growth, Form, and Phenology Traits in a Forest Tree. Genetics 1995, 139(2) :963973. 131. Ferris R, Long L, Bunn SM, Robinson KM, Bradshaw HD, Rae AM, Taylor G: Leaf stomatal and epidermal cell development: identification of putative quantitative trait loci in relation to elevated carbon dioxide concentration in poplar Tree Physiology 2002, 22(9) :633640. 132. Frewen BE, Chen THH, Howe GT, Davis J, Rohde A, Boerjan W, Bradshaw HD: Quantitative trait loci and candidate gene mapping of bud set and bud flush in Populus Genetics 2000, 154(2) :837845. 133. Jorge V, Dowkiw A, Faivre Rampant P, Bastien C: Genetic architecture of qualitative and quantitative Melampsora larici populina leaf rust resistance in hybrid poplar: genetic mapping and QTL detection. New Phytologist 2005, 167(1) :113127. 134. Rae AM, Pinel MPC, Bastien C, Sabatti M, Street NR, Tucker J, Dixon C, Marron N, Dillen SY, Taylor G: QTL for yield in bioenergy Populus: identifying GxE interactions from growth at three contrasting sites Tree Genetics & Genomes 2008, 4(1) :97112. 135. Rae AM, Street NR, Robinson KM, Harris N, Taylor G: Five QTL hotspots for yield in short rotation coppice bioenergy poplar: the Poplar Biomass Loci BMC Plant Biology 2009, 9(23) :(26 February 2009).
128 136. Rae AM, Tricker PJ, Bunn SM, Taylor G: Adaptation of tree growth to elevated CO2: quantitative trait loci for biomass in Populus New Phytologist 2007, 175(1) :5969. 137. Wu R, Bradshaw HD, Stettler RF: Molecular genetics of growth and development in Populus (Salicaceae) .5. Mapping quantitative trait loci affecting leaf variation American Journal of Botany 1997, 84(2) :143153. 138. Wu RL: Genetic mapping of QTLs affecting tree growth and architecture in Populus: implication for ideotype breeding Theoretical and Applied Genetics 1998, 96(34) :447457. 139. Wu RL, Ma CX, Yang MCK, Chang M, Littell RC, Santra U, Wu SS, Yin TM, Huang MR, Wang MX et al : Quantitative trait loci for growth trajectories in Populus Genetical Research 2003, 81(1) :5164. 140. Zhang B, Tong C, Yin T, Zhang X, Zhuge Q, Huang M, Wan g M, Wu R: Detection of quantitative trait loci influencing growth trajectories of adventitious roots in Populus using functional mapping Tree Genetics & Genomes 2009, 5(3) 141. Zhang D, Zhang Z, Yang K: QTL analysis of growth and wood chemical content t raits in an interspecific backcross family of white poplar (Populus tomentosa x P bolleana) x P tomentosa Canadian Journal of Forest ResearchRevue Canadienne De Recherche Forestiere 2006, 36(8) :20152023. 142. Zhang D, Zhang Z, Yang K, Li B: QTL analysis of leaf morphology and spring bud flush in (Populus tomentosa P. bolleana) P. tomentosa. Scientia Silvae Sinicae 2005, 41(1) :4248. 143. Neale DB, Savolainen O: Association genetics of complex traits in conifers Trends Plant Sci 2004, 9(7) :325330. 144. Doebley JF, Gaut BS, Smith BD: The molecular genetics of crop domestication Cell 2006, 127(7) :13091321. 145. Morgante M, Salamini F: From plant genomics to breeding practice. Current Opinion in Biotechnology 2003, 14(2) :214219. 146. Paran I, Zamir D: Quantitative traits in plants: beyond the QTL Trends Genet 2003, 19(6) :303306. 147. Yu JM, Buckler ES: Genetic association mapping and genome organization of maize Current Opinion in Biotechnology 2006, 17(2) :155160. 148. Thumma BR, Nolan MR, Evans R, Moran GF: Polymorphisms in cinnamoyl CoA reductase (CCR) are associated with variation in microfibril angle in Eucalyptus spp Genetics 2005, 171(3) :12571265.
129 149. Ingvarsson PK, Garcia MV, Luquez V, Hall D, Jansson S: Nucleotide polymoirphism and phenotypic associations within and around the phytochrome B2 locus in European aspen (Populus tremula, Salicaceae) Genetics 2008, 178(4) :22172226. 150. Eckert AJ, Bower AD, Wegrzyn JL, Pande B, Jermstad KD, Krutovsky KV, St Clair JB, Neale DB: Association ge netics of coastal Douglas fir (Pseudotsuga menziesii var. menziesii, Pinaceae). I. Cold hardiness related traits Genetics 2009, 182(4) :12891302. 151. Gonzalez Martinez SC, Huber D, Ersoz E, Davis JM, Neale DB: Association genetics in Pinus taeda L. II. C arbon isotope discrimination Heredity 2008, 101(1) :1926. 152. Gonzalez Martinez SC, Wheeler NC, Ersoz E, Nelson CD, Neale DB: Association genetics in Pinus taeda L. I. Wood property traits Genetics 2007, 175(1) :399409. 153. Jansen RC, Nap JP: Genetical genomics: the added value from segregation Trends Genet 2001, 17(7) :388 391. 154. Keurentjes JJB, Fu JY, Terpstra IR, Garcia JM, van den Ackerveken G, Snoek LB, Peeters AJM, Vreugdenhil D, Koornneef M, Jansen RC: Regulatory network construction in Arabidopsis by using genome wide gene expression quantitative trait loci Proc Natl Acad Sci USA 2007, 104(5) :17081713. 155. Keurentjes JJB, Fu JY, de Vos CHR, Lommen A, Hall RD, Bino RJ, van der Plas LHW, Jansen RC, Vreugdenhil D, Koornneef M: The geneti cs of plant metabolism Nature Genetics 2006, 38(7) :842849. 156. Chen X, Hackett CA, Niks RE, Hedley PE, Booth C, Druka A, Marcel TC, Vels A, Bayer M, Milne I et al : An eQTL Analysis of Partial Resistance to Puccinia hordei in Barley PLoS One 2010, 5(1) : Article No.: e8598. 157. Luo ZW, Potokina E, Druka A, Wise R, Waugh R, Kearsey MJ: SFP genotyping from Affymetrix arrays is robust but largely detects cis acting expression regulators Genetics 2007, 176(2) :789800. 158. Shi C, Uzarowska A, Ouzunova M, Landbeck M, Wenzel G, Lubberstedt T: Identification of candidate genes associated with cell wall digestibility and eQTL (expression quantitative trait loci) analysis in a Flint Flint maize recombinant inbred line population. BMC Genomics 2007, 8(22) :(18 January 2007). 159. Kirst M, Basten CJ, Myburg AA, Zeng ZB, Sederoff RR: Genetic architecture of transcript level variation in differentiating xylem of a eucalyptus hybrid Genetics 2005, 169(4) :22952303. 160. Street NR, Skogstrom O, Sjodin A, Tucker J, Rodr iguez Acosta M, Nilsson P, Jansson S, Taylor G: The genetics and genomics of the drought response in Populus Plant J 2006, 48(3) :321341.
130 161. Keurentjes JJ: Genetical metabolomics: closing in on phenotypes Curr Opin Plant Biol 2009, 12(2) :223230. 162. Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA, Berka J, Braverman MS, Chen YJ, Chen Z et al : Genome sequencing in microfabricated high density picolitre reactors Nature 2005, 437(7057) :376380. 163. Ohtsu K, Smith MB, Emrich SJ, Borsuk LA, Zhou R, Chen T, Zhang X, Timmermans MC, Beck J, Buckner B et al : Global gene expression analysis of the shoot apical meristem of maize (Zea mays L.) Plant J 2007, 52(3) :391404. 164. Cheung F, Haas BJ, Goldberg SM, May GD, Xiao Y, Town CD: Sequencing Medicago truncatula expressed sequenced tags using 454 Life Sciences technology BMC Genomics 2006, 7:272. 165. Jones Rhoades MW, Borevitz JO, Preuss D: Genome wide expression profiling of the Arabidopsis female gametophyte identifies families of small, secreted proteins PLoS Genet 2007, 3(10) :1848 1861. 166. Weber AP, Weber KL, Carr K, Wilkerson C, Ohlrogge JB: Sampling the Arabidopsis transcriptome with massively parallel pyrosequencing Plant Physiol 2007, 144(1) :3242. 167. Barbazuk WB, Emrich SJ, Chen HD, Li L, Schnable PS: SNP discovery via 454 transcriptome sequencing Plant J 2007, 51(5) :910918. 168. Meyer M, Stenzel U, Myles S, Prufer K, Hofreiter M: Targeted high throughput sequencing of tagged nucleic acid samples Nucleic Acids Res 2007, 35(1 5) :e97. 169. Parameswaran P, Jalili R, Tao L, Shokralla S, Gharizadeh B, Ronaghi M, Fire AZ: A pyrosequencing tailored nucleotide barcode design unveils opportunities for largescale sample multiplexing Nucleic Acids Research 2007, 35(19) 170. FAO: Globa l forest resources assessment 2000 Main report In: FAO Forestry paper 140. 2000. 171. DOE Joint Genome Institute Announces 2008 Genome Sequencing Targets. [ http://www.jgi.doe.gov/News/news_6_8_07.html ] 172. The Arabidopsis Genome Initiative: Analysis of the genome sequence of the flowering plant Arabidopsis thaliana Nature 2000, 408(6814) :796815. 173. Chang S, Puryear J, Cairney J: A simple and efficient method for isolating RNA from pine trees Plant Mol Biol Rep 1993, 11:117121. 174. Hartl DL, Clark AG: Molecular population genetics. In: Principles of population genetics. 4th edn. SunderlandMA: Sinauer Associates, Inc.; 2007: 338342.
131 175. Moore MJ, Bell CD, Soltis PS, Soltis DE: Using plast id genome scale data to resolve enigmatic relationships among basal angiosperms Proc Natl Acad Sci U S A 2007, 104(49) :1936319368. 176. Nei M: Selectionism and neutralism in molecular evolution Mol Biol Evol 2005, 22(12) :23182342. 177. Roth C, Liberles DA: A systematic search for positive selection in higher plants (Embryophytes) BMC Plant Biol 2006, 6:12. 178. Watterson GA: On the number of segregating sites in genetical models without recombination Theor Popul Biol 1975, 7(2) :256276. 179. Kirst M, Marques CM, Sederoff RR: Nucleotide diversity and linkage disequilibrium in three Eucalyptus globulus genes. In: IUFRO Tree Biotechnology Conference: 2005; Pretoria, South Africa; 2005: Section 5, P. 28. 180. Santos SN: Genes de lignifica o em Eucalyptus: estrutura e diversidade gentica dos genes 4cl e ccoaomt. Braslia: Universidade Catlica de Braslia; 2005. 181. Brown GR, Gill GP, Kuntz RJ, Langley CH, Neale DB: Nucleotide diversity and linkage disequilibrium in loblolly pine Proc Nat l Acad Sci USA 2004, 101(42) :1525515260. 182. Gonzalez Martinez SC, Ersoz E, Brown GR, Wheeler NC, Neale DB: DNA sequence variation and selection of tag SNPs at candidate genes for drought stress response in Pinus taeda L Genetics 2006, 172:19151926. 183. Ingvarsson PK: Nucleotide polymorphism and linkage disequilbrium within and among natural populations of European Aspen (Populus tremula L., Salicaceae) Genetics 2005, 169(2) :945953. 184. Ma XF, Szmidt AE, Wang XR: Genetic structure and evolutionary h istory of a diploid hybrid pine Pinus densata inferred from the nucleotide variation at seven gene loci Mol Biol Evol 2006, 23(4) :807816. 185. Bergelson J, Kreitman M, Stahl EA, Tian D: Evolutionary dynamics of plant R genes Science 2001, 292(5525) :22812285. 186. Clark RM, Schweikert G, Toomajian C, Ossowski S, Zeller G, Shinn P, Warthmann N, Hu TT, Fu G, Hinds DA et al : Common sequence polymorphisms shaping genetic diversity in Arabidopsis thaliana Science 2007, 317(5836) :338342. 187. Barrier M, Bustamante CD, Yu J, Purugganan MD: Selection on rapidly evolving proteins in the Arabidopsis genome Genetics 2003, 163(2) :723733.
132 188. McIntosh KB, Bonham Smith PC: Establishment of Arabidopsis thaliana ribosomal protein RPL23A 1 as a functional homologue of Saccharomyces cerevisiae ribosomal protein L25. Plant Mol Biol 2001, 46(6) :673682. 189. Rooney AP, Ward TJ: Evolution of a large ribosomal RNA multigene family in filamentous fungi: birth and death of a concerted evolution paradigm Proc Natl Acad Sci U S A 2005, 102(14) :50845089. 190. Stage DE, Eickbush TH: Sequence variation within the rRNA gene loci of 12 Drosophila species Genome Res 2007, 17(12) :18881897. 191. Eirin Lopez JM, Gonzalez Tizon AM, Martinez A, Mendez J: Birth and death evolution with strong purifying selection in the histone H1 multigene family and the origin of orphon H1 genes Mol Biol Evol 2004, 21(10) :19922003. 192. Matsuo Y, Yamazaki T: Nucleotide variation and divergence in the histone multigene family in Drosoph ila melanogaster Genetics 1989, 122(1) :8797. 193. Rooney AP, Piontkivska H, Nei M: Molecular evolution of the nontandemly repeated genes of the histone 3 multigene family Mol Biol Evol 2002, 19(1) :6875. 194. Cork JM, Purugganan MD: High diversity genes in the Arabidopsis genome Genetics 2005, 170(4) :18971911. 195. Lynch M, Conery JS: The evolutionary fate and consequences of duplicate genes Science 2000, 290(5494) :11511155. 196. Fluhr R: Sentinels of disease. Plant resistance genes Plant Physiol 2001, 127(4) :13671374. 197. Bustamante CD, Fledel Alon A, Williamson S, Nielsen R, Hubisz MT, Glanowski S, Tanenbaum DM, White TJ, Sninsky JJ, Hernandez RD et al : Natural selection on protein coding genes in the human genome Nature 2005, 437(7062) :11531157. 198. Canadell JG, Raupach MR: Managing forests for climate change mitigation Science 2008, 320(5882) :14561457. 199. Plomion C, Leprovost G, Stokes A: Wood formation in trees Plant Physiol 2001, 127(4) :15131523. 200. Chen F, Dixon RA: Lignin modification improves fermentable sugar yields for biofuel production. Nat Biotechnol 2007, 25(7) :759761. 201. Hu WJ, Harding SA, Lung J, Popko JL, Ralph J, Stokke DD, Tsai CJ, Chiang VL: Repression of lignin biosynthesis promotes cellulose accumulation and growth in transgenic trees Nat Biotechnol 1999, 17(8) :808 812.
133 202. Wu RL, Remington DL, MacKay JJ, McKeand SE, O'Malley DM: Average effect of a mutation in lignin biosynthesis in loblolly pine Theoretical and Applied Genetics 1999, 99(34) :705710. 203. Yu Q, Li B, Nelson CD, McKeand SE, Batista VB, Mullin TJ: Association of the cad n1 allele with increased stem growth and wood density in full sib families of loblolly pine Tree Genetics & Genomes 2006, 2(2):98108. 204. Bradshaw HD, Jr., Stettler RF : Molecular genetics of growth and development in populus. IV. Mapping QTLs with large effects on growth, form, and phenology traits in a forest tree. Genetics 1995, 139(2) :963 973. 205. Wu R, Bradshaw HD, Stettler RF: Developmental quantitative genetics o f growth in Populus Theoretical and Applied Genetics 1998, 97(7) :11101119. 206. Wullschleger S, Yin TM, DiFazio SP, Tschaplinski TJ, Gunter LE, Davis MF, Tuskan GA: Phenotypic variation in growth and biomass distribution for two advancedgeneration pedig rees of hybrid poplar Canadian Journal of Forest ResearchRevue Canadienne De Recherche Forestiere 2005, 35(8) :17791789. 207. Rae AM, Tricker PJ, Bunn SM, Taylor G: Adaptation of tree growth to elevated CO2: quantitative trait loci for biomass in Populus New Phytol 2007, 175(1) :59 69. 208. Finzi AC, Norby RJ, Calfapietra C, GalletBudynek A, Gielen B, Holmes WE, Hoosbeek MR, Iversen CM, Jackson RB, Kubiske ME et al : Increases in nitrogen uptake rather than nitrogen use efficiency support higher rates of temperate forest productivity under elevated CO2 Proc Natl Acad Sci U S A 2007, 104(35) :1401414019. 209. Oren R, Ellsworth DS, Johnsen KH, Phillips N, Ewers BE, Maier C, Schafer KV, McCarthy H, Hendrey G, McNulty SG et al : Soil fertility limits carbon sequestration by forest ecosystems in a CO2enriched atmosphere. Nature 2001, 411(6836) :469472. 210. Cooke JE, Martin TA, Davis JM: Short term physiological and developmental responses to nitrogen availability in hybrid poplar New Phytol 2005, 167(1) :4152. 211. Pitre FE, Cooke JEK, Mackay JJ: Short term effects of nitrogen availability on wood formation and fibre properties in hybrid poplar Trees Structure and Function 2007, 21(2) :249259. 212. Pitre FE, Pollet B, Lafarguette F, Cooke JE, MacKay JJ, Lapierre C: Effects of increased nitrogen supply on the lignification of poplar wood J Agric Food Chem 2007, 55(25) :1030610314. 213. Cooke JEK, Brown KA, Wu R, Davis JM: Gene expression associated with N induced shifts in resource allocation in poplar Plan t Cell and Environment 2003, 26(5) :757770.
134 214. Hocking D: Preparation and use of a nutrient solution for culturing seedlings of lodgepole Pine and White Spruce, with selected bibliography. In: Northern Forest Research Centre Information Report Nor X L. E dmonton, Canada.: Canadian Forest Service, Department of the Environment.; 1971. 215. Browning BL: Methods of wood chemistry. New York: Interscience Publishers; 1967. 216. Evans RJ, Milne TA: Molecular characterization of the pyrolysis of biomass. 1. Funda mentals. Energy and Fuels 1987, 1(2) :123 137. 217. Sykes R, Kodrzycki B, Tuskan G, Foutz K, Davis M: Within tree variability of lignin composition in Populus Wood Science and Technology 2008, 42(8) :649661. 218. Littell RC, Milliken GA, Stroup WW, Wolfinger RD: SAS System for Mixed Models Cary, NC: SAS Institute Inc; 1996. 219. Gilmour AR, Cullis BR, Welham SJ, Thompson R: ASReml User Guide Release 1.0. Hemel Hempstead, UK: VSN International; 2002. 220. Drost D R, Novaes E, Boaventura Novaes CRD, Benedict CI, Brown RS, Yin T, Tuskan GA, Kirst M: A Microarray Based Genotyping and Genetic Mapping Approach for Highly Heterozygous Outcrossing Species Localizes a Large Fraction of the Unassembled Populus trichocarpa G enome Sequence. Plant Journal 2009, Submitted. 221. Lander ES, Green P, Abrahamson J, Barlow A, Daly MJ, Lincoln SE, Newburg L: MAPMAKER: an interactive computer package for constructing primary genetic linkage maps of experimental and natural populations Genomics 1987, 1(2) :174181. 222. Kosambi DD: The estimation of map distances from recombination values. Annu Eugen 1944, 12:172175. 223. Zeng ZB: Theoretical basis for separation of multiple linked gene effects in mapping quantitative trait loci Proc Natl Acad Sci USA 1993, 90(23) :1097210976. 224. Wang SC, Basten CJ, Zeng ZB: Windows QTL Cartographer 2.5. In Raleigh, NC.: Department of Statistics, North Carolina State University.; 2007. 225. Churchill GA, Doerge RW: Empirical Threshold Values for Qua ntitative Trait Mapping Genetics 1994, 138(3) :963971. 226. Chang H, Sarkanen KV: Species variation in lignin effect on rate of kraft delignification TAPPI 1973, 56:132134. 227. Ibrahim L, Proe MF, Cameron AD: Main effects of nitrogen supply and droug ht stress upon whole plant carbon allocation in poplar Can J Forest Res 1997, 27(9) :14131419.
135 228. Kleiner KW, Raffa KF, Ellis DD, McCown BH: Effect of nitrogen availability on the growth and phytochemistry of hybrid poplar and the efficacy of the Bacill us thuringiensis cry1A(a) d endotoxin on gypsy moth Can J Forest Res 1998, 28(7) :10551067. 229. Pregitzer KS, Dickmann DI, Hendrick R, Nguyen PV: Whole tree carbon and nitrogen partitioning in young hybrid poplars Tree Physiol 1990, 7(1_2_3_4) :7993. 230. Sultan SE: Phenotypic plasticity for plant development, function and life history Trends Plant Sci 2000, 5(12) :537542. 231. Gutierrez RA, Stokes TL, Thum K, Xu X, Obertello M, Katari MS, Tanurdzic M, Dean A, Nero DC, McClung CR et al : Systems approach identifies an organic nitrogen responsive gene network that is regulated by the master clock control gene CCA1 Proc Natl Acad Sci U S A 2008, 105(12) :49394944. 232. Scheible WR, Morcuende R, Czechowski T, Fritz C, Osuna D, Palacios Rojas N, Schindelasch D, Thimm O, Udvardi MK, Stitt M: Genome wide reprogramming of primary and secondary metabolism, protein synthesis, cellular growth processes, and the regulatory infrastructure of Arabidopsis in response to nitrogen Plant Physiol 2004, 136(1) :24832499. 233. Koch KE: Molecular crosstalk and the regulation of C and N responsive genes. In: A molecular approach to primary metabolism in plants. Edited by Foyer C, Quick P. London: Taylor and Francis Inc.; 1997: 105124. 234. Palenchar PM, Kouranov A, Lejay LV Coruzzi GM: Genome wide patterns of carbon and nitrogen regulation of gene expression validate the combined carbon and nitrogen (CN) signaling hypothesis in plants Genome Biol 2004, 5(11) :R91. 235. Weigelt K, Kuster H, Radchuk R, Muller M, Weichert H, F ait A, Fernie AR, Saalbach I, Weber H: Increasing amino acid supply in pea embryos reveals specific interactions of N and C metabolism, and highlights the importance of mitochondrial metabolism Plant J 2008, 55(6) :909926. 236. Calenge F, Saliba Colombani V, Mahieu S, Loudet O, Daniel Vedele F, Krapp A: Natural variation for carbohydrate content in Arabidopsis. Interaction with complex traits dissected by quantitative genetics Plant Physiol 2006, 141(4) :16301643. 237. MacMillan K, Emrich K, Piepho HP, Mullins CE, Price AH: Assessing the importance of genotype x environment interaction for root traits in rice using a mapping population II: conventional QTL analysis Theor Appl Genet 2006, 113(5) :953964. 238. Talukder ZI, McDonald AJ, Price AH: Loci contro lling partial resistance to rice blast do not show marked QTL x environment interaction when plant nitrogen status alters disease severity New Phytol 2005, 168(2) :455464.
136 239. Rauh BL, Basten C, Buckler ES: Quantitative trait loci analysis of growth resp onse to varying nitrogen sources in Arabidopsis thaliana. Theoretical and Applied Genetics 2002, 104(5) :743750. 240. Laperche A, Brancourt Hulmel M, Heumez E, Gardet O, Hanocq E, Devienne Barret F, Le Gouis J: Using genotype x nitrogen interaction variabl es to evaluate the QTL involved in wheat tolerance to nitrogen constraints Theor Appl Genet 2007, 115(3) :399415. 241. Rae AM, Ferris R, Tallis MJ, Taylor G: Elucidating genomic regions determining enhanced leaf growth and delayed senescence in elevated C O2 Plant Cell Environ 2006, 29(9) :17301741. 242. Mehrtens F, Kranz H, Bednarek P, Weisshaar B: The Arabidopsis transcription factor MYB12 is a flavonol specific regulator of phenylpropanoid biosynthesis Plant Physiol 2005, 138(2) :10831096. 243. Jansen RC, Nap JP: Genetical genomics: the added value from segregation Trends Genet 2001, 17(7) :388 391. 244. Boerjan W, Ralph, J., Baucher, M.: Lignin Biosynthesis Annual Reviews in Plant Biology 2003, 54:519546. 245. Yuan JS, Tiller KH, Al Ahmad H, Stewart NR, Stewart CN, Jr.: Plants to power: bioenergy to fuel the future. Trends Plant Sci 2008, 13(8) :421429. 246. Bhalerao R, Nilsson O, Sandberg G: Out of the woods: forest biotechnology enters the genomic era Current Opinion in Biotechnology 2003, 14(2) :206213. 247. Chen F, Dixon RA: Lignin modification improves fermentable sugar yields for biofuel production. Nature Biotechnol 2007, 25(7) :759761. 248. Novaes E, Osorio L, Drost DR, Miles BL, Boaventura Novaes CRD, B enedict C, Dervinis C, Yu Q, Sykes R, Davis M et al : Quantitative genetic analysis of biomass and wood chemistry of Populus under different nitrogen levels New Phytologist 2009, 182(4) :878890. 249. MacKay JJ, O'Malley DM, Presnell T, Booker FL, Campbell MM, Whetten RW, Sederoff RR: Inheritance, gene expression, and lignin characterization in a mutant pine deficient in cinnamyl alcohol dehydrogenase Proc Natl Acad Sci U S A 1997, 94(15) :82558260. 250. Schadt EE: Molecular networks as sensors and drivers of common human diseases Nature 2009, 461(7261) :218223. 251. Schadt EE, Lamb J, Yang X, Zhu J, Edwards S, GuhaThakurta D, Sieberts SK, Monks S, Reitman M, Zhang CS et al : An integrative genomics approach to infer causal
137 associations between gene expressi on and disease Nature Genetics 2005, 37(7) :710717. 252. Drost DR, Novaes E, Boaventura Novaes C, Benedict CI, Brown RS, Yin T, Tuskan GA, Kirst M: A microarraybased genotyping and genetic mapping approach for highly heterozygous outcrossing species enab les localization of a large fraction of the unassembled Populus trichocarpa genome sequence Plant J 2009, 58(6) :10541067. 253. Thurston MI, Field D: Msatfinder: detection and characterisation of microsatellites. Distributed by the authors at http://www.genomics.ceh.ac.uk/msatfinder/ In CEH Oxford, Mansfield Road, Oxford OX1 3SR.; 2005. 254. Bolstad BM, Irizarry RA, Astrand M, Speed TP: A comparison of normalization methods for high density oligonucleotide array data based on variance and bias Bioinformatics 2003, 19(2) :185193. 255. Zeng ZB: Precision mapping of quantitative trait loci Genetics 1994, 136(4) :14571468. 256. West MAL, Kim K, Kliebenstein DJ, van Leeuwen H, Michelmore RW, Doe rge RW, Clair DAS: Global eQTL mapping reveals the complex genetic architecture of transcript level variation in Arabidopsis Genetics 2007, 175(3) :14411450. 257. Storey JD, Tibshirani R: Statistical significance for genomewide studies Proc Natl Acad Sci USA 2003, 100(16) :94409445. 258. Yan T, Yoo D, Berardini TZ, Mueller LA, Weems DC, Weng S, Cherry JM, Rhee SY: PatMatch: a program for finding patterns in peptide and nucleotide sequences Nucleic Acids Research 2005, 33:W262W266. 259. Higo K, Ugawa Y, Iwamoto M, Korenaga T: Plant cis acting regulatory DNA elements (PLACE) database: 1999. Nucleic Acids Research 1999, 27(1) :297300. 260. Goicoechea M, Lacombe E, Legay S, Mihaljevic S, Rech P, Jauneau A, Lapierre C, Pollet B, Verhaegen D, Chaubet Gigot N e t al : EgMYB2, a new transcriptional activator from Eucalyptus xylem, regulates secondary cell wall formation and lignin biosynthesis Plant J 2005, 43(4) :553567. 261. Harmer SL, Hogenesch JB, Straume M, Chang HS, Han B, Zhu T, Wang X, Kreps JA, Kay SA: Or chestrated transcription of key pathways in Arabidopsis by the circadian clock Science 2000, 290(5499) :21102113. 262. Raes J, Rohde A, Christensen JH, Van de Peer Y, Boerjan W: Genome Wide Characterization of the Lignification Toolbox in Arabidopsis Pla nt Physiology 2003, 133:10511071.
138 263. Emanuelsson O, Nielsen H, Brunak S, von Heijne G: Predicting subcellular localization of proteins based on their N terminal amino acid sequence J Mol Biol 2000, 300(4) :10051016. 264. Bendtsen JD, Nielsen H, von Hei jne G, Brunak S: Improved prediction of signal peptides: SignalP 3.0 J Mol Biol 2004, 340(4) :783795. 265. Nielsen H, Engelbrecht J, Brunak S, von Heijne G: Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites Protein Eng 1997, 10(1) :1 6. 266. Quesada T, Li Z, Dervinis C, Li Y, Bocock PN, Tuskan GA, Casella G, Davis JM, Kirst M: Comparative analysis of the transcriptomes of Populus trichocarpa and Arabidopsis thaliana suggests extensive evolution of gene expression regulation in angiosperms New Phytol 2008, 180(2) :408420. 267. Courtois Moreau CL, Pesquet E, Sjodin A, Muniz L, Bollhoner B, Kaneda M, Samuels L, Jansson S, Tuominen H: A unique program for cell death in xylem fibers of Populus stem Plant J 2009, 58(2) :260274. 268. Brown DM, Zeef LA, Ellis J, Goodacre R, Turner SR: Identification of novel genes in Arabidopsis involved in secondary cell wall formation using expression profiling and reverse genetics Plant Cell 2005, 17:22812295. 269. Ko JH, Beers EP, Han KH: Global comparative transcriptome analysis identifies gene network regulating secondary xylem development in Arabidopsis thaliana Mol Genet Genomics 2006, 276(6) :517531. 270. Leple JC, Dauwe R, Morreel K, Storme V, Lapierre C, Pollet B, Naumann A, Kang KY, Kim H, Ruel K et al : Downregulation of cinnamoyl coenzyme A reductase in poplar: multiple level phenotyping reveals effects on cell wall polymer metabolism and structure. Plant Cell 2007, 19(11) :36693691. 271. Hancock JE, Loya WM, Giardina CP, Li LG, Chiang VL, Pregitzer KS: Plant growth, biomass partitioning and soil carbon formation in response to altered lignin biosynthesis in Populus tremuloides New Phytologist 2007, 173(4) :732742. 272. Coruzzi G, Bush DR: Nitrogen and carbon nutrient and metabolite signaling in plants Plant Physiol 2001, 125(1) :6164. 273. Coruzzi GM, Zhou L: Carbon and nitrogen sensing and signaling in plants: emerging 'matrix effects' Curr Opin Plant Biol 2001, 4(3) :247253. 274. Novae s E, Osorio L, Drost DR, Miles BL, Novaes CRDB, Benedict C, Dervinis C, Yu Q, Sykes R, Davis M et al : Quantitative genetic analysis of biomass and wood chemistry of Populus under different nitrogen levels New Phytologist 2009, (in press)
139 275. Hamann T, B ennett M, Mansfield J, Somerville C: Identification of cell wall stress as a hexose dependent and osmosensitive regulator of plant responses Plant J 2009, 57(6) :10151026. 276. Sjodin A, Street NR, Sandberg G, Gustafsson P, Jansson S: The Populus Genome Integrative Explorer (PopGenIE): a new resource for exploring the Populus genome New Phytol 2009. 277. Holt RA, Jones SJ: The new paradigm of flow cell sequencing Genome Res 2008, 18(6) :839846. 278. Metzker ML: Sequencing technologies the next generat ion Nat Rev Genet 2010, 11(1) :3146.
140 BIOGRAPHICAL SKETCH Evandro Novaes was born in Piracicaba ( So Paulo state, Brazil) on January 10th 1981. He was the oldest son of Carlos and Marcia Novaes, who taught him early on about the importance of educati on. Evandro grew up with his parents and brother, Rodrigo Novaes, mostly in Araraquara ( So Paulo state, Brazil). During high school, in a biology course, Evandro became fascinated with the disciplines of genetics and evolution, and decided to pursue a car eer in plant biology. In 1999, Evandro returned to his original city to start studying forestry engineering at the University of So Paulo (ESALQ USP), one of the most prized in Brazil. His choice for forestry was influenced by his uncle Osni Sanchez a f orestry engineer and director of the Suzano Pulp and Paper Co. in Brazil. Evandro obtained his Bachelor of Science degree in January of 2004. During his last year as an undergraduate, Evandro contacted Dr. Dario Grattapaglia, a PhD from North Carolina Stat e University and renowned forest geneticist, asking to do an internship in his laboratory in Braslia, the capital of Brazil. Dr. Grattapaglia accepted and was very influential in Evandros scientific career, cosupervising his masters degree in genetics and improvement at the Federal University of Viosa (UFV, Brazil). In the city of Viosa, Evandro met a beautiful and intelligent graduate student that was qualified two positions ahead of him in the entry exam for the masters program in genetics and impr ovement. Carolina Boaventura shared with him the passion for genetics and evolution. Evandro married Carolina in December of 2005, one month after she received her Master of Science degree and one month before he received his. During his masters work in 2004, Evandro read an elegant article describing the coordinate regulation of growth and lignin in Eucalyptus The first author was Dr. Matias Kirst, a young professor at the University of Florida. Evandro was selected for a brief talk in the same Brazilian meeting where Dr. Kirst was one of the keynote speakers. Evandro prepared himself to make a good first impression and, after his talk, asked Dr. Kirst about the
141 possibility of starting PhD work in his laboratory. During the summer of 2006 Evandro and Caro lina moved to Gainesville, where Evandro joined the Forest Genomics group and started his PhD in Forest Resources and Conservation. After approximately one year, Carolina also joined the Forest Genomics group where she helped Evandros and many other proje cts. Since December of 2009, Evandro and Carolina have been working on their most intensive project the parenthood of their beloved Sarah B. Novaes.