Citation
The Amborella Genome and the Evolution of Alternative Splicing across Eudicots

Material Information

Title:
The Amborella Genome and the Evolution of Alternative Splicing across Eudicots
Creator:
Chamala, Srikar
Place of Publication:
[Gainesville, Fla.]
Florida
Publisher:
University of Florida
Publication Date:
Language:
english
Physical Description:
1 online resource (191 p.)

Thesis/Dissertation Information

Degree:
Doctorate ( Ph.D.)
Degree Grantor:
University of Florida
Degree Disciplines:
Botany
Biology
Committee Chair:
BARBAZUK,WILLIAM BRADLEY
Committee Co-Chair:
SOLTIS,DOUGLAS EDWARD
Committee Members:
SOLTIS,PAMELA S
FOLTA,KEVIN M
Graduation Date:
8/9/2014

Subjects

Subjects / Keywords:
Angiosperms ( jstor )
Genomes ( jstor )
Introns ( jstor )
Libraries ( jstor )
RNA ( jstor )
Scaffolds ( jstor )
Sequencing ( jstor )
Soybeans ( jstor )
Species ( jstor )
Transcriptome ( jstor )
Biology -- Dissertations, Academic -- UF
alternative -- amborella -- assembly -- evolution -- genome -- plants -- sequencing -- splicing
Genre:
bibliography ( marcgt )
theses ( marcgt )
government publication (state, provincial, terriorial, dependent) ( marcgt )
born-digital ( sobekcm )
Electronic Thesis or Dissertation
Botany thesis, Ph.D.

Notes

Abstract:
At the start of the 21st century sequencing the genomes of most eukaryotes was expensive and laborious. Hence, early genome projects were restricted to species of high commercial and research interests. Over the past decade improvements to sequencing technologies have increased throughput and lowered per base sequencing cost, enabling for researchers to sequence genomes of non-model species. Until now, extensive genetic and physical maps have been required to direct the sequencing effort and sequence assembly for species with large and complex genomes. As these resources are unavailable for most species, especially for non-model species, assembling high-quality and nearly finished genome sequences from next-generation sequencing (NGS) data remains challenging. However, despite the existence of only sparse genetic and genomic resources, we were successful in generating a high-quality reference genome sequence for Amborella trichopoda, a non-model species that is crucial to understand flowering plant evolution. The strategy involves a whole genome shotgun (WGS) sequence assembly including a combination of FISH (fluorescent in situ hybridization), computational sequence assemblers, and whole genome restriction maps (derived from OpGen Incs Whole Genome Mapping technology). An Amborella genome sequence can be compared to other sequenced angiosperm genomes enabling the investigation of the evolution of key lineage specific innovations within angiosperms such as well differentiated flower structures. Comparative genomic analysis will also facilitate the reconstruction of the ancestral genomic features of the most recent common ancestor (MRCA) of all extant flowering plants, and the characterization of genomic differences between gymnosperms and angiosperms. Another important characteristic of Amborella is that it has not undergone any recent lineage-specific genome duplications like all other sequenced angiosperm genomes. Therefore the Amborella genome can be used to study, genome, gene, and epigenetic changes that happened after whole genome duplication(s) in various angiosperm lineages. Specifically in this dissertation, Amborella genome is used to examine conservation and evolution of alternative splicing (AS) across angiosperms (flowering plants). ( en )
General Note:
In the series University of Florida Digital Collections.
General Note:
Includes vita.
Bibliography:
Includes bibliographical references.
Source of Description:
Description based on online resource; title from PDF title page.
Source of Description:
This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Thesis:
Thesis (Ph.D.)--University of Florida, 2014.
Local:
Adviser: BARBAZUK,WILLIAM BRADLEY.
Local:
Co-adviser: SOLTIS,DOUGLAS EDWARD.
Electronic Access:
RESTRICTED TO UF STUDENTS, STAFF, FACULTY, AND ON-CAMPUS USE UNTIL 2016-08-31
Statement of Responsibility:
by Srikar Chamala.

Record Information

Source Institution:
UFRGP
Rights Management:
Applicable rights reserved.
Embargo Date:
8/31/2016
Classification:
LD1780 2014 ( lcc )

Downloads

This item has the following downloads:


Full Text

PAGE 1

THE AMBORELLA GENOME AND THE EVOLUTION OF ALTERNATIVE SPLICING ACROSS EUDICOTS By SRIKAR CHAMALA A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2014

PAGE 2

© 2014 Srikar Chamala

PAGE 3

To my wife, parents, grandparents, friends, and all well wishers

PAGE 4

4 ACKNOWLEDGMENTS I am very grateful to my committe e supervisor, Dr. William Bradley Barbazuk for his support, encouragement, and guidance throughout my graduate education. I am also very thankful to my committee members Dr. Douglas E. Soltis, Dr. Pam Soltis, and Dr. Kevin M. Folta not only for their advic e on thesis dissertation but also for giving me opportunity to collaborate on several of their projects and in publishing several papers . It is my pleasure to work with my lab mates: Brandon Walts, Christy Gault, Guanqiao Feng, Jason Brant, Jessica Sabo, Jon Boatwright, Leandro Gomide, Ruth Davenport, Stela Palii, Tales Sidronio, Wenbin Mei, and Xiaoxian Liu. I thank Brandon Walts for his contribution in validation of Amborella genome assembly (using physical maps) and size estimation (using k mer frequenc ies and repeat expansion approaches). I also thank Dr. Andre S. Chanderbali for his contribution in validation of Amborella genome assembly using fluorescenc e in situ hybridization (FISH). The contributions from Walts and Dr. Chanderbali were published in Chamala et al. (2013).

PAGE 5

5 TABLE OF CONTENTS page AC KNOWLEDGMENTS ................................ ................................ ................................ .. 4 LIST OF TABLES ................................ ................................ ................................ ............ 8 LIST OF FIGURES ................................ ................................ ................................ ........ 10 ABSTRACT ................................ ................................ ................................ ................... 12 CH A PTER 1 INTRODUCTION ................................ ................................ ................................ .... 14 Significance of the Amborella trichopoda Genome ................................ ................. 14 Pre messenger RNA Splicing ................................ ................................ ................. 16 Basic Intron Splicing Mechanism ................................ ................................ ...... 17 Alternative Splicing ................................ ................................ ........................... 17 Alternative Splicing in Plants ................................ ................................ ................... 19 Whole genome Duplication ................................ ................................ ..................... 20 Evolutionary Fates of Duplicated Gene C opies ................................ ................ 20 Biased Gene Content Retention Following Whole genome Duplication ........... 22 Evolutionary Fates of AS in WGD Gene Copies ................................ ............... 23 2 GENOME ASSEMBLY AND VALIDATION OF THE NO N MODEL BASAL ANGIOSPERM AMBORELLA TRICHOPODA ................................ ....................... 27 Background ................................ ................................ ................................ ............. 27 Genome Sequencing Strategies ................................ ................................ ....... 27 Genome Validation and Super scaffolding Strategies ................................ ...... 29 Results ................................ ................................ ................................ .................... 32 DNA Sequencing ................................ ................................ .............................. 32 Quality Filtering of DNA Sequence Data ................................ .......................... 33 454 seq uence data ................................ ................................ .................... 33 Illumina 3 kb mate pair data ................................ ................................ ....... 36 Genome Assembly and Size Estimate ................................ ............................. 37 Assessment of Data Saturation and Assembly Completeness ......................... 38 Incremental assemblies ................................ ................................ ............. 38 Representation of BAC end seq uences and Amborella unigene sequences within the WGS assembly ................................ ..................... 39 Assessment of Assembly Based on Pre existing Genomic Resources ............ 40 Coverage of contig assemblies across finished sequence contigs ............ 40 Comparison of the scaffolded assemblies against the available physical map contigs ................................ ................................ ............................ 40 Assessment of whole genome assembly using FISH ................................ 41

PAGE 6

6 Assessment of OpGen Whole Genome Maps and Assembly Improvement scaffolding ................................ ....... 42 scaffolding of version 1.0 assembly .................... 42 a new as sembly (V1.1) incorporating additional mate pair sequences ... 43 BAC free Version 1.0 Assembly and Genome scaffolding ..... 44 Discussion ................................ ................................ ................................ .............. 44 Materials and Methods ................................ ................................ ............................ 53 Sequencing ................................ ................................ ................................ ...... 53 454 Long In sert Paired end Library Protocol ................................ .................... 53 454 Sequence Data Quality Filtering ................................ ................................ 54 Identification of short reads ................................ ................................ ........ 54 Organelle contaminants ................................ ................................ ............. 54 Identification and removal of artificial duplicate reads ................................ 54 San ger BAC end Data ................................ ................................ ...................... 55 Quality trimming ................................ ................................ ......................... 55 Organelle contaminants ................................ ................................ ............. 55 I llumina 3 Kb Mate pair Library Protocol ................................ .......................... 55 Illumina 3 Kb Mate pair Data ................................ ................................ ............ 55 Quality trimming ................................ ................................ ......................... 55 Organelle contaminants ................................ ................................ ............. 56 Removal of junction and inward facing reads from Illumina mate pair libraries ................................ ................................ ................................ ... 56 Identification and removal of artificial duplicate reads ................................ 57 Sequence Assembly ................................ ................................ ......................... 57 Coverage Analysis and Estimating the Size of the Amborella Genome Sequence ................................ ................................ ................................ ...... 57 Genome size estimate based on sequence coverage across finished regions ................................ ................................ ................................ .... 58 Estimating genome siz e using k mer frequencies ................................ ...... 58 Genome size estimation by repeat expansion ................................ ........... 59 Cytogenetics ................................ ................................ ................................ ..... 60 OpGen Whole Genome Mapping ................................ ................................ ..... 60 3 GLOBAL CONSERVATION OF ALTERNATIVE SPLICING EVENTS ACROSS EUDICOTS USING AMBORELLA AS A REFERENCE ................................ ........ 117 Introduction ................................ ................................ ................................ ........... 117 Cross species Alternative Splicing Comparisons in Plants ............................. 117 Whole genome Duplication and Alternative Splicing in Plants ....................... 118 Previous Methods for Identifying Conserved AS Event Identification Methods ................................ ................................ ................................ ...... 119 Results ................................ ................................ ................................ .................. 121 Global Transcriptome Alignment and Assembly ................................ ............. 121 Intron Retention is the Most Frequent AS Event ................................ ............ 122 Up to 70 Percent of Expressed Multi exonic Genes Exhibit AS ...................... 123 High throughput Pipeline for Identifying Conserved AS Events ...................... 123

PAGE 7

7 More Than 5,000 Conserved AS Event Clusters between Common Bean and Soybean ................................ ................................ ............................... 124 Extensive Species specific AS Events in WGD Orthologs ............................. 125 More Than 27,000 Conserved AS Events among Nine Angiosperm Species 125 Ancestral Angiosperm AS Events ................................ ................................ ... 127 Overrepresented GO Categories among Genes With Conserved AS ............ 128 Discussion ................................ ................................ ................................ ............ 129 Frequency of Genes With AS ................................ ................................ ......... 129 Advantages of our Conserved AS Event Identification Strategy ..................... 130 Conserved AS Events between Common Bean and Soybean ....................... 132 Conserved AS Events in WGD Orthologs ................................ ...................... 133 Conserved AS Events among Nine Angiosperm Taxa ................................ ... 135 Application of Conserved AS Events and Future Research ........................... 135 Materials and Methods ................................ ................................ .......................... 138 Genomic and Transcriptomic Data C ollection ................................ ................ 138 Genome assemblies and annotations ................................ ...................... 138 Transcriptome collection ................................ ................................ .......... 138 RNA seq Data Processing and Assembly ................................ ...................... 139 Calculating maximum intron size ................................ ............................. 139 Trinity genome guided assembly ................................ ............................. 139 Trinity de novo assembly ................................ ................................ ......... 140 Cufflinks assembly ................................ ................................ ................... 141 PASA Pipeline ................................ ................................ ................................ 141 Classification of Alternative Splicing Events ................................ ................... 142 OrthoMCL Clustering ................................ ................................ ...................... 142 Identification of Conserved AS Events between Taxa ................................ .... 143 4 CONCLUSIONS ................................ ................................ ................................ ... 175 LIST OF REFERENCES ................................ ................................ ............................. 179 BIOGRAPHICAL SKETCH ................................ ................................ .......................... 191

PAGE 8

8 LIST OF TABLES Table page 2 1 Sequenced genome libr ary statistics. ................................ ................................ . 74 2 2 Genome library contaminant statistics. ................................ ............................... 75 2 3 Contig and scaffold assembly metrics for the Amborella assemb ly version 1. ... 76 2 4 Scaffold metrics for incremental assemblies of 454 paired end 11 Kb libraries. ................................ ................................ ................................ .............. 77 2 5 FPC contigs mapping to more than one assembled scaffold, determined by alignment of end sequenced BACs. ................................ ................................ ... 78 2 6 FISH ass essments of assembly scaffolds. ................................ ......................... 82 2 7 Amborella support status by FISH and/or Amborella V1.1 assembly. ................................ . 86 2 8 Contig and scaffold assembly metrics for the Ambore lla assembly version 1. ... 98 2 9 Contig and scaffold assembly metrics for the BAC Free Amborella assembly. .. 99 2 10 Amborella versi on 1.0 BAC ... 100 2 11 A comparison of assembly statistics between Amborella and other NGS based whole genome assemblies. ................................ ................................ ... 113 2 12 Genome reads coverage across sequenced BAC contigs. .............................. 115 2 13 K mer frequencies for genome estimation. ................................ ....................... 116 3 1 Genome sequence and annotation resources. ................................ ................. 157 3 2 EST, mRNA, 454, and RNA seq sequence data summary. ............................. 158 3 3 RNA seq tissues types and download sources. ................................ ............... 159 3 4 Global AS events. ................................ ................................ ............................. 162 3 5 Conserved AS events between common bean (CB) and Soybean (SB) at gene family level. ................................ ................................ .............................. 163 3 6 Conserved AS events in WGD orthologs. ................................ ......................... 164 3 7 Conserved AS events at gene family level . ................................ ...................... 165

PAGE 9

9 3 8 Genes with conserved AS events across at least one other species. ............... 166 3 9 Intron sizes used while performing transcripto me alignments and assemblies. 167 3 10 Conserved AS events retention and loss categories among WGD orthologs between common bean (CB) and soybean (SB) and their conservation with at least one o ther angiosperms. ................................ ................................ ....... 168 3 11 GO Enrichment Analysis ( Fisher's Exact Test ) with BLAST2GO of genes having conserved AS events conserved across a t least six angiosperms. . ...... 169 3 12 GO Enrichment Analysis ( Fisher's Exact Test ) with BLAST2GO of genes having CJ AS events conserved in 1:2 categories of soybean . ........................ 173 3 13 Vitis vinif era cv. Corvina samples pooled fo r RNA seq run . ............................. 174

PAGE 10

10 LIST OF FIGURES Figure page 1 1 Amborella trichopoda .. ................................ ................................ ........................ 25 1 2 Phylogenetic tree of plant genomes used in this study. ................................ ...... 26 2 1 Wh ole genome shotgun strategy. ................................ ................................ ....... 61 2 2 Incremental assemblies with 20 Kb paired end libraries. ................................ .... 62 2 3 Average contig length growth of incremental assemblies of single end data. ..... 63 2 4 Genome coverage of incremental assemblies of single end data. ...................... 64 2 5 N statisti cs of incremental assemblies of single end data. ................................ .. 65 2 6 Average scaffold size growth of incremental assemblies of 454 11 Kb inserts. .. 66 2 7 Coverage of assembled contigs across BAC contig 431. ................................ ... 67 2 8 Coverage of assembled contigs across BAC contig 1003. ................................ . 68 2 9 FISH co localized signal. ................................ ................................ .................... 69 2 10 Read coverage of BAC contig 431. ................................ ................................ .... 70 2 11 Read coverage of BAC contig 1003. ................................ ................................ .. 71 2 12 K mer volume plots . ................................ ................................ ............................ 72 2 13 Contig depth frequency plot. ................................ ................................ ............... 73 3 1 Ancestral CJ AS events in flowering plants. ................................ ..................... 144 3 2 Work flow of transcriptome data pre processing and PASA assembly. ............ 145 3 3 Types of AS events. ................................ ................................ ......................... 146 3 4 Frequencies of AS events. ................................ ................................ ............... 147 3 5 Frequencies of AS in expressed multi exonic genes. ................................ ....... 148 3 6 Conserved junction (CJ) alternative splicing (AS) events identification pipeline. ................................ ................................ ................................ ............ 149 3 7 Percentages of conserved AS events shared between species. ...................... 150

PAGE 11

11 3 8 category of genes having conserved AS events conserved across at least six angiosperms. ................................ ................................ ................................ .... 151 3 9 category of genes having conserved AS events conserved across at least six angiosperms. ................................ ................................ ................................ .... 152 3 10 category of genes having conserved AS events conserved across at least six angiosperms. ................................ ................................ ................................ .... 153 3 11 Visualizati category of genes having conserved AS events in two WGD paralogs of soybean and their corresponding ortholog in common bean. ........................... 154 3 1 2 category of genes having conserved AS events in two WGD paralogs of soybean and their corresponding ortholog in common bean. ........................... 155 3 13 category of genes having conserved AS events in two WGD paralogs of soybean and their corresponding ortholog in common bean . ........................... 156

PAGE 12

12 Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy THE AMBORELLA GENOME AND THE EVOLUTION OF ALTERNATIVE SPLICING ACROSS EUDICOTS By Srikar Chamala August 2014 Chair: William Bradley Barbazuk Major: Botany At the start of the 21 st century sequencing the genomes of most eukaryotes was expensive and laborious. Hence, early genome projects were restricted to species of commercial intere st or model organisms . Recent improvements in sequenci ng technolog y have increased throughput and lowered cost, enabling researchers to sequence genomes of non model species. Until now, extensive genetic and physical maps have been required to direct the sequenc ing and assembly for species with large and complex genomes. As these resources are unavailable for most species, especially for non model species, assembling high quality and nearly finished genome sequences from next generation sequencing (NGS) da ta remains challenging. However, despite the existence of only sparse genetic and genomic resources, I was successful in generating a high quality reference genome sequence for Amborella trichopoda , a non model species that is crucial for understand ing fl owering plant evolution. The strategy involves a whole genome shotgun (WGS) sequence assembly including a combination of FISH (fluorescent in situ hybridization), computational sequence assemblers, and whole genome restriction maps (derived from OpGen Inc technology). An Amborella genome sequence can then be compared to other

PAGE 13

13 sequenced angiosperm genomes enabling the investigation of the evolution of key lineage specific innovations within angiosperms such as well differentiated fl ower structures. Comparative genomic analysis will also facilitate the reconstruction of the ancestral genomic features of the most recent common ancestor (MRCA) of all extant flowering plants, and the characterization of genomic differences between gymno sperms and angiosperms. Another important characteristic of Amborella is that it has not undergone any recent lineage specific genome duplications like all other sequenced angiosperm genomes. Therefore , the Amborella genome can be used to study genome, gen e, and epigenetic changes that happened after whole genome duplication(s) in various angiosperm lineages. I n this dissertation the Amborella genome is compared to seven eudicot species and one monocot species to examine conservation and evolution of altern ative splicing (AS) across angiosperms (flowering plants) . This study identified 27,120 AS events that are conserved between at least two angiosperms, 9,129 putative ancestral AS events of angiosperms, high rates of lineage specific AS events, and prefere ntial retention of AS events in certain gene classes.

PAGE 14

14 CHAPTER 1 INTRODUCTION Significance of t he Amborella trichopoda Genome Angiosperms (flowering plants) occupy nearly every habitable terrestrial environment and many aquatic ones. They have diversified to include 250,000 400,000 species in a relatively short period of time estimated to be just over 130 million years (Soltis et al. 2008) . They are the sources for the majority of human food and animal feed and generate other important human necessities like wood, medicine, fiber, and fuel. They also account for a huge proportion of land based photosynthesis and carbon sequestration (Soltis et al. 2008) . Consequently, angiosperm genetics is an important research focus , and several angiosperm sequencing projects have been undertaken in the past few years. These projects have focus ed on plants of economic (e.g. Populus , Vitis , Oryza , Zea ) and/or model organism (e.g. Arabidopsis ) importance and are mostly from two of the major groups of angiosperms : eudicot s (like rosids and asterids) and monocots ( e.g. grasses). However, reconstruct ing the ancestral state of various traits and characteristics specific to angiosperms, including flowers and fruits, diverse pollination systems, double fertilization, large water conducting vessel elements, and divers e biochemical pathways, requires the genome sequence of a species belonging to a lineage branching from the most basal node of the angiosperm tree (Williams and Friedman 2002; Soltis et al. 2002, 2005, 2008) . Current estimates of phylogenetic studies place Amborella trichopoda (Ambor ellaceae) , an understory shrub endemic to New Caledonia , as the single sister species to all other extant angiosperms that is between the two extant seed plant lineages, gymnosperms and the remaining angiosperms (Figure 1 1) (Soltis et al. 1999;

PAGE 15

15 Mack et al. 2005; Jansen et al. 2007; Moore et al. 2007; Soltis et al. 2008; Zuccolo et al. 2011; Goremykin et al. 2013; Drew et al. 2014) . This makes the Amborella genome a pivotal reference for, (i) studying evolution of key lineage specific innovations within angiosperms, (ii) reconstructing ancestral genomic features of the most recent common ancestor (MRCA) of all extant flowering plants, and (iii) characterizing the differences between the two extant seed plant lineages (Soltis et al. 2008) . The Amborella genome will add resolution to angiosperm evolution in a similar fashion as the duck billed platypus ( Ornithorhynchus anati nus ), which is the sister group of all other extant mammals (Warren et al. 2008; Zuccolo et al. 2011) , did for mammalian genome evolution. Initially there was no strong evidence for whole genome duplication (WGD) in Amborella (Cui et al. 2006) ; this view has changed with deeper sequencing and gene family analyses (Jiao et al. 2011) . Using ESTs as well as complete genome sequence data for several taxa, Jiao et al. (201 1) have shown that many genes were apparently duplicated just before the origin of the angiosperms, which suggests a common causal factor, namely ancient polyploidy. However, this hypothesis remains to be tested more rigorously at the genome level, which i s a question that a whole genome sequence of Amborella can help address. Amborella is typical of the large majority of plants and animals that may be of exceptional biological interest but have either sparse or no genetic resources (e.g., genetic map) or g enome sequence. The published estimate of the genome size of Amborella based on flow cytometry is approximately 870 Mb (Leitch and Hanson 2002) . This is over 6 X the size of genome, making it complicated to

PAGE 16

16 assemble a high quality genome. Although NGS provides deep genomic sequence coverage at low cost, assembly of large contiguous segments remains difficult, and assessing assembly accuracy is problematic in the absence of independently derived genomic maps. Since one major application of the Amborella genome sequence is in comparative genomic studies a comprehensive and correctly assembled genome is necessary to support accurate comparisons . To generate a high quality reference genome for Amborella a novel whole genome assembly strategy was employed . This strategy combined computational WGS assembly and whole genome restriction maps scaffolding, and fluorescent in situ hybridization (FISH ) for evaluation. This sequencing strategy will be presented and discussed in detail in the next chapters. Once equipped with an assembled and annotated genome sequence for Amborella , this genome will be used to study alternative splicing ( AS ) conservation and evolution across eudicot lineages of angiosperms. Pre messenger RNA Splicing Eukaryotic genes often contain introns, which are spliced out , and neighboring exons are ligated during pre messenger RNA (pre mRNA) processing to produce a mature mRNA . This pro cess of intron removal and exon ligation is called pre mRNA splicing or RNA splicing (Kornblihtt et al. 2013) . This process happens within the nucleus, either simultaneously with, or after, transcri ption (Kornblihtt et al. 2013) . Some exons are always included during RNA splicing, while others are included sometimes and excluded other times. Regulated inclusion of exons generate s multiple mRNAs for a single gene, a phenomenon known as AS (Kornblihtt et al. 2013) .

PAGE 17

17 Basic I ntron S plicing M echanism The s pliceosome (reviewed by De Conti et al. 2012) is a ribonucleoprotein complex that plays a major role in splicing of pre mRNA. The s pliceosome machinery is made up of hundreds of part icles and exhibit s dynamic interactions RNA RNA, RNA protein , and protein protein interactions . Major components of the spliceosome include five uridine rich small nuclear ribonucleoprotein particles (snRNPs U1, U2, U4, U5, and U6) and a large number of non snRNP protein factors such as splicing factor 1 (SF1) and U2 auxiliary factor (U2AF). The stepwise interaction of spliceosome machinery with basic pre l initiate the splicing process. Some of the major events in splicing include (i) U1 snRNA base pairs to the intron junction consensus sequence is CAG/GURAGU, (ii) SF1 will bind to the branch point/site (BS) , which is a loosely conserved s ss and is usually close to the ss, (iii) U2AF is recruited to the polypyrimidine tract, which is located between the ss, and this step brings splice sites together , (iv) U2 snRNA will displace SF1 and stabilize t he interaction between itself, U1 snRNA, and U2AF, (v) U4, U5, and U6 snRNAs bind to the and (v i ) U1 and U4 snRNAs will be displ aced after recruiting the Prp19/CDC5L complex, which is followed by two transestererificaiton reactions to excise the in tron and splice together adjacent exons (De Conti et al. 2012) . Alternative S p licing In addition to basic cis acting motifs , cis enhancers, cis silencers and trans acting regulatory proteins may be require d to avoid pseudo splice sites. These signals are responsible for determining alterna tive

PAGE 18

18 inclusion/exclusion of introns/exons, resulting in alternative mRNA isoforms. The predominant AS events are intron retention (IntronR) (Wang and Brendel 2006; Kornblihtt et al. 2013) . Enhancer and silencer elements are cis regulatory sequence s, which include exonic splicing enhancers (ESEs), exonic splicing silencers (ESSs), intronic splicing enhancers (ISEs), and intronic splicing silencers (ISSs). As the name s suggest, ESEs and ESSs reside on exons, whereas ISSs and ISEs reside on introns. ESEs and ISEs will enhance the exon inclusion, whereas ESSs and ISSs will repress exon inclusion. Overall , enhancers and silencers have positive or negative impact s on spliceosome formation efficiency (De Conti et al. 2012) . T rans acting regulatory proteins will regulate exon recognition either positively or negatively through their interactions with cis regulatory enhancer and silencer motifs . Well characterized trans acting regulatory protein members that bind to ESEs/ISEs are members of the serine/arginine rich protein family (SR) . Likewise, members of the heterogeneous nuclear r ibonucleoprotein (hnRNP) protein family interact with ESSs/ISSs sites (Chen and Manley 2009) . There are also tissue specific splicing factors , which include nPTB and PTB, NOVA , and FOX (Chen and Manley 2009) . AS is one of the mechanisms eukaryotes use to generate transcriptome and proteome diversity , as well regulat e gene expression and protein abundance (Nilsen and Graveley 2010) . For example, the Dscam gene in D. melanogaster can potentially produce up to 38,016 splice variants or isofor m mRNAs , far exceeding its total number of genes ( about 14,500) (Nilsen and Graveley 2010; Schmucker et al. 2000) .

PAGE 19

19 Alternative Splicing in Plants Due to the availability of transcriptome sequences and well annotated gene models in metazoans (for example, human, mouse, and f ruit fly) , AS on a global scale has been well catalogued in these organisms. In contrast, cataloguing AS events in plant genomes are not as advanced. New high throughput sequencing technologies are providing more comprehensive studies of splicing to iden tify novel AS events. For example, in the early 1990s, <5% of human genes were thought to undergo AS events (Sharp 1997) , but this number increased to >90% as more and more transcriptome sequences w ere analyzed (Wang et al. 2008b) . Similarly, AS events were thought to be rar e in plants , but a recent comprehensive study in Arabidopsis using high throughput transcriptome sequencing suggest s that at least ~61% of intron containing genes exhibit AS (Marquez et al. 2012) . One possible reason that fewer AS events have been discovered in plants compared to metazoans is that animals rely on AS to generate transcrip tome and proteome diversity , while plants are thought to achieve this through whole or partial genome duplication, a process that is common in plant lineages but has not occurred frequently during the evolutionary history of metazoans (Otto and Whitton 2000; Barbazuk et al. 2008; Jiang et al. 2013; Cañestro et al. 2013) There is in creasing evidence suggesting that AS is important in plant processes such as photosynth esis, defense response, flowering, and cereal grain quality (Reddy 2007; Barbazuk et al. 2008; Staiger and Brown 2013; R eddy et al. 2013) . Despite the important role of AS in plants , the evolution and conservation of AS events across plants is understudied (Barbazuk et al. 2008; Darracq and Adams 2013) . This is largely due to lack of abundant transcriptome sequence data sampled from multiple tissues (Barbazuk et al. 2008; Syed et al. 2012) . This limitation is largely mitigated via the

PAGE 20

20 availability of high throughput sequencing technologies, which have led to large collections of plant transcriptome sequences in public databases, and the sequencing of several plant genomes at various phylogenetic distances. Using both public and in house transcriptome and genomic sequence resources Chapter 3 examines the global conservation of AS events across large and short evolutionary distances in nine angiosperm taxa ( with bold font in Figure 1 2), constituting seven species from the eudicot lineage, one monocot, and Amborella , which shares a common ancestor with all other extant angiosperms ( Figure 1 2), Whole g enome Duplication Evoluti onary Fates of D uplicat ed Gene C opies After gene duplication, either one of the gene copies may be lost or both copies may be retain ed. In the case of duplicate gene loss, o ne of the gene copies can be physicall y lost or become non functional (or pseudogenized) due to detrimental mutation accumulation while the other copy maintains the function of the ancestral gene. Loss of duplicate gene copy might be the most common fate of duplicate genes (C añestro et al. 2013) . On the other hand , when both copies are retained they could be functionally redundant , sub functionaliz ed, or one of the gene copies may become neo function alized (Cañestro et al. 2013) . I n the case of functional re dundancy, b oth copies of a duplicated gene are maintained and continue to express and function in the same manner as the ancestral gene. This may be due to selection for increased gene product, where both copies are subjected to purifying selection for mai ntaining ancestral function (McGrath and Lynch 2012; Cañestro et al. 2013) . A n example of this is ribosomal 5S gene copies in fishes

PAGE 21

21 that have been expanded through WGD events and maint ained through intense purifying selection (Pinhal et al. 2011; McGrath and Lynch 2012) . In the case of sub functionalization, ancestral gene function is partitioned between duplicated copies , and these copies are maintained under purifying selection (K a /K s < 1) (Cusack and Wolfe 2007; Roulin et al. 2012) . This could be due to mutations in cis regulatory sequences. There are two types of sub funct ionali zation, qualitative and quantitative ( McGrath and Lynch 2012) . In qualitative sub functionalization , gene duplicates show tissue , developmental , or time specific expression (McGrath and Lynch 2012) . For example, although the small cytoplasmic adaptor proteins CASP and GRASP in vertebrates share highly similar gene and protein structures, CASP is exclusively expressed in immune systems, while GRASP is expressed only in nervous systems (MacNeil et al. 2008) . In qua ntitative sub functionalization, both duplicated copies are subjected to partial loss of function mutation s , i.e., the expression of both gene copies is required to compensate for the original gene function (McGrath and Lynch 2012) . Substantial decrease s in the level s of gene expression after duplication in yeasts and mammals are linked to quantitative sub functionalization (Qian et al. 2010) . In neo functionalization , one of the gene copies maintains the function of the ancestral copy, w hile the other copy adopts a novel function as the result of advantageous mutation ( s ) and is maintained by positive selection (K a /K s >1) (Blanc and Wolfe 2004b; Roulin et al. 2012; Cañestro et al. 2013) . The n eo functionalization phenomenon is well illustrated by ABS and GOA paralogs in Arabidopsis (Erdmann et al. 2010) . ABS and GOA paralogs origi nated from a duplication event during Brassicaceae diversification (Erdmann et al. 2010) . ABS is known to have function in

PAGE 22

22 the regulation of endothelium development (Nesi et al. 2002; Kaufmann et al. 2005) while GOA acquired a new function involving fruit size control (Prasad et al. 2010) due to modification of a protein domain (Erdmann et al. 2010) . Biased Gene Content Retention Following Whole g enome D uplication Recently acquir ed whole genome and transcriptome sequences from several angiosperms suggest that WGD is common (Jiao et al. 2011; Van de Peer et al. 2009) , which is followed by gene loss and diploidization (Jiao et al. 2011; Xu et al. 2012) . Irrespective of whether a gene duplication i s locally ( tandem duplication ) or globally (WGD) derived, t he fate of duplicated gene copies is one of the four scenarios previously discus sed. However, c omparative genomic s data provide increasing evidence that there is retention bias in gene families specific to the mode of duplication , either tandem duplication or WGD , suggesting different evolutionary constraints on e a ch (Freeling 2009; Conant et al. 2014) . Dosage dependent genes like those encoding transcription factors, protein kinases, and ribosomal proteins are preferentially retained in genes duplicated by WGD and are un der retained in gene copies derived from tandem duplication gene copies (Freeling 2009) because loss of one of the duplica te copies relative to its interacting partners may result in negative fitness (Conant et al. 2014) . Another outcome of WGD is that all the genes involved in certain pathways or networks are duplicated, which mig ht lead to neo or sub functionalization of the ancestral network (McGrath and Lynch 2012) . Thus, WGD could contribute to the evolution of new traits resulting in increased fitness (Roulin et al. 2012) . Also, WGD is thought to be the key mechanism through which angiosperms attained rapid diversification (within the last 14 0 150 MYA ) (Van de Peer et al. 2009; Soltis et al.

PAGE 23

23 2009) (Friedman 2009) . Evoluti onary Fates of AS in WGD Gene C opies Investigation of WGD have been focused at the gene level including, gene collinearity (Tang et al. 2008) , gene family expansion (Veron et al. 2007) , bias in parental genome retentions (Buggs et al. 2010; Schnable and Freeling 2012) , and changes in gene expression (Roulin et a l. 2012) . However, advance s in high throughput sequencing technologies have facilitated the availability of several plant genome sequences and large RNA seq data collections from diverse plant tissue types, and these are enabling investigations into the i mpact WGD has on AS . High rates of WGDs in angiosperms (Jiao et al. 2011) make them good model systems for studying the changes in AS events after WGD. There is only one stud y in plants that investigated the evolutionary conservation and divergence of AS patterns in genes duplicated by polyploidy during the evolutionary history of the Arabidopsis thaliana (Zhang et al. 2010) , but this analysis was limited to only 52 WGD duplicate gene pairs (Blanc et al. 2003) and relied on AS evidence available within a legacy database ( http://www.plantgdb.org/ASIP/ ) of plant a lternative s plicing (Wang and Brendel 2006) constructed from sparse transcriptome resources . Along with an examination of global AS conservation in angiosperms, Chapter 3 will also examine the genome wide conservation of AS events in duplicated genes afte r WGD using soybean ( Glycine max ) as a model system . Soybean underwent a r ecent lineage specific WGD about 5 10 MYA (Roulin et al. 2012) . I use Phaseolus vulgaris (common bean) as an outgroup for comparison because there is no evidence of it undergoing a WGD since its divergence from soybea n about 19 MYA (McClean et al.

PAGE 24

24 2010) . To overcome the limitations of the Zhang et al. (2010) study , in this a nalysis I collected uniform and deep transcriptome data for both common bean and soybean from similar tissue types and examined 14,759 gene pairs derived from WGD for AS ( d ata provided by Scott Jackson, University of Georgia, USA) (Schmutz et al. 2014) .

PAGE 25

25 Figure 1 1. Amborella trichopoda . It is endemic to New Caledonia and sister to all other extant angiosperms. A ) Overview of angiosperm phylogeny showing the sister group position of Amborella to all other extant angiosperms, B ) Amborella inflorescence, C ) Map showing isolated location of New Caledonia (circled) (Amborella Genome Project 2013) . B A C

PAGE 26

26 Figure 1 2. Phylogenetic tree of plant genomes used in this study . Species of interest are marked in bold. Branch lengths are not proportional to length. Divergence of clades and WGD event timings are in MYA and are italicized, these are based on the following (Roulin et al. 2012; Jiao et al. 2011; Amborella Genome Project 2013; Fawcett et al. 2009; Young et al. 2011; Tuskan et al. 2006; McClean et al. 2010; Paterson et al. 2012; Woodhouse et al. 2011) .

PAGE 27

27 C HAPTER 2 GENOME ASSEMB LY AND VALIDATION OF THE NON MODEL BASAL ANGIOSPERM AMBORELLA TRICHOPODA Background Amborella trichopoda (Amborellaceae) is an understory shrub endemic to New Caledonia that diverged from all other extant angiosperms (flowering plan ts) ca. 160 million years ago (Amborella Genome Project 2013) . This species holds a pivotal phylogenetic position as the single sister to all other extant angiosperms (Amborella Genome Project 2013) . Generating a complete genome sequence of Amborella an d comparing it to other sequenced angiosperm genomes enables the study of m olecular changes that led to the innovation of key angiosperm traits (e.g., well differentiated flower organs and vessel elements ) . Another important characteristic of Amborella is that unlike all other angiosperms with sequenced genomes, it has not undergone any recent lineage specific genome duplications. Therefore the Amborella genome can be used to identify and study genome, gene, and epigenetic changes that occurred after whole genome duplication(s) in various angiosperm lineages. Genome Sequencing Strategies T here are two whole genome sequencing strategies, Hierarchical Shotgun Sequencing (HSS) and Whole Genome Shotgun Sequencing (WGSS) (Waterston et al. 2002) . HSS requires fragment ation of target genomic DNA that is cloned into large fragment vector libraries (e.g., BACs and Fosmids) (International Human Genome Sequencing Consortium 2004) . These libraries are fingerprinted and used to build a Contents o f this chapter are publ ished in the following peer reviewed journal articles: S. Chamala et al Science 342 , 1516 (2013 ); Amborella Genome Project, Science 342 , 1241089 (2013).

PAGE 28

28 physi cal map , which i s consulted to organize the sequencing process. A minimally redundant set of overlapping large insert clones that span the length of the physical map is identified and individually sequenced and assembled thus reconstructing the genome sequence (International Human Genome Sequencing Consortium 2004) . This method has three main advantages, (i) it uses fewer computat ional resources in terms of computer memory and hardware , and requires less complex assembly algorithms , (ii) it generates fewer misassembles due to repetitive sequences , and (iii) the sequence contigs and scaffolds are large and accurately assembled . The aforementioned advantages are due to localized genome assemblies ( individual clones ) and the advance knowledge of clone position with respect to each other at the chromosome scale (physical maps) . Although HSS can reconstruct high quality genome sequences , constructing the clone resources and a comprehensive physical map is both time consuming and expensive (Waterston et al. 2002) . Th e a lternative whole genome sequencing strategy to HSS is WGSS. WGSS requires shearing the geno mic DNA into smaller fragments that are randomly sequenc ed and assembled without the aid of a physical map (Pop et al. 2002; Miller et al. 2010) . While already faster and less expensive than HSS, the cost of WGSS has been greatly reduced with hi gh volume next generation sequencing (NGS) technologies that do not r equire clone libraries . However , WGSS requires a large amount of sequence reads that are complicated to assemble. L arge genome s with complex repeats and genome structure require large am ounts of computer hardware including hard disk space and memory. Also, sophisticated computational algorithms are required to resolve the repeat structures accurately and assemble large sequence data sets in a

PAGE 29

29 timely manner (Pop et al. 2002; Miller et al. 2010) . Developments in storage capacit y, better assembly algorithms (Miller et al. 2010) and declining costs of computer hardware have reduced many of the original limitations of WGSS Based on the above comparisons between HSS and WGSS genome a ssembly strategi es, I chose WGSS using NGS data as a strategy to generate a draft Amborella genome sequence . Briefly , WGSS using NGS data involve s sequencing and assembling single or short insert paired end DNA reads into contigs, then building the genome sequence scaffol ds by linking together contigs based on evidence from long insert ( e.g. 3 K b, 8 K b, 2 0 K b, 120 K b) paired end reads ( Figure 2 1 A and 2 1 B) (Miller et al. 2010) . Genome Valid ation and Super scaffolding Strategies As mentioned above , reconstruction of a genome via WGSS using NGS data is fast and has relatively low cost. However, the reconstructed genome is often fragmented . Another potential drawback of these draft genomes is t hat the accuracy of the assembled scaffolds is unknown , so c onclusions drawn from comparative genomic studies involving draft genome sequences assembled from NGS may not be reliable . Traditionally resources like physical or genetic maps were used to ident ify the relative positions of draft genome scaffolds with respect to each other (super scaffolding) and ass ig n them to chromosomes. Also , these resources were used to assess the overall accuracy of the assembled scaffolds (Anantharaman et al. 1999; Zhou et al. 2009) Often these resources are unavailable for non model species and generating them is prohibitive . A cost effective alternative is an optical map (Dong et al. 2012) , which is a high resolution, genome scale ordered restriction map . Optical maps are ma de by (i) digesti ng immobilized single molecules of DNA on open glass surfaces

PAGE 30

30 using restriction enzymes to preserve the order of the digested fragments , (ii) staining the fragments with a fluorescent dye and visuali zing them with fluorescence microscopy t o estimate fragment length , and (iii) identifying common restriction map patterns between various single molecules and assembl ing these into a consensus optical map. These optical maps can represent large regions of the genome spanning even large repeat re gions (Anantharaman et al. 1999; Dong et al. 2012) . On ce optical maps are ready, in silico maps generated from the draft genome scaffold sequences are used to anchor scaffolds on to optical maps, hence the relative position s of scaffolds with respect to each other are resolved , and scaffolds are built into su per scaffolds (Anantharaman et al., 1999) . Optical maps ha ve been routinely used for finishing bacterial genomes (Latreille et al. 2007; Nagarajan et al. 2008) , but ha ve only recently been applied to complex genomes of model eukaryotic organisms (Zhou et al. 2009; Young et al. 2011; Dong et al. 2011; Chamala et al. 2013) to assist with scaffolding and correction of well advanced genome assemblies. The aforementioned comple x genome assembly projects used optical maps and required extensive manual intervention, which could be a tedious and laborious task. However, OpGen, Inc., which markets optical map technology as Whole Genome Map s (WGM), has automated the optical mapping and super scaffold building steps (http://www.opgen.com/mapit) . At the time of this project only the super scaffolding step was automated and error correction within scaffolds still remained laborious. Hence in the context of the Amborella G enome P roject, WGM was used only to improve the contiguity of whole genome assembly generated from NGS data, making the Amborella whole genome sequencing project a pilot to for the application of WGM on large and complex genomes.

PAGE 31

31 Scaffolds generated by Newbler and Opgen super scaffolds were evaluated using fluorescence in situ hybridization (FISH). FISH h as a wide range of applications, including detection of karyotypic alterations, and mapping of single cloned or PCR amplified sequences to chromosomes ( reviewed in Chester et al. 2010) . The ability of FISH to map sequences onto chromosomes can be appl ied to evaluate the fidelity of a genome assembly , which has ob vious advantages for non model organisms like Amborella with sparse genetic resources. The b asic idea is to map FISH probes designed from distinct sections of the scaffold to chromosomes . Detecting probe signals in close proximity on a single chromosome s uggests scaffold assembly is correct. In contrast, p robe signals that fall relatively far apart on the same chromosome or on different chromosomes flag scaffolds as misassembled. The application of FISH to evaluate the fidelity of genome assembly was teste d and evaluated during Amborella whole genome assembly, which includes scaffolds from NGS genome assembly and super scaffolds from WGM. FISH was also used to assign genome scaffolds to individual chromosomes. This chapter contains two objectives. The fir st is the generati on of a draft genome sequence of Amborella using NGS mediated WGSS and includes a discussion of the deficiencies of NGS data and the method s to identify and resolve them . The s econd objective is to asses s completeness of the Amborella ass embly , validat e it, and , improv e contiguity of the draft whole genome assembly. Additionally, the potential for using WGM based super scaffolding as a replacement for large insert paired end libraries will be examined.

PAGE 32

32 Results DNA Sequencing W hole genome s hotgun sequencing of Amborella was undertaken using a combination of NGS technologies, leveraging the strengths of each technology to extend and improve the accuracy of our assembly. The sequence data ( Table 2 1) were generated primarily on the Roche 454 G S FLX Titanium and FLX+ platforms , which provided the longest accurate read s of NGS technologies available at the time . Longer reads help genome assemblers to resolve short repeats and result in larger contigs. Single end reads include d 14 Gb of 454 GS FL X Titanium and 12 Gb 454 FLX+ . Often large eukaryotic genomes have numerous and large repeat structures that limit an assemblers ability to reconstruct continuous genome sequence . These result in a fragmented assembly consisting of several small un ordere d islands of assembled sequence, or contigs . One of the main resource s that help assemble contigs into scaffold s is large insert paired end sequence data. For this project , 2 Gb of 11 K b insert paired end 454 GLS FLX Titanium seq uences, and over 19 Gb of 3 kb insert Illumina paired end reads was obtained. A collection of end sequences (48 Mb) derived from a n 5.5X coverage Amborella BAC library with an average insert size of 123 K b (Zuccolo et al. 2011) was also included . A lso 20 K b insert paired end 454 GLS FLX Titanium sequences were sequenced but discard ed these due to their c himeric nature . Overall the total raw sequences constitute 262 M reads representing 48 Gb , providing a theoretical coverage of 64.1X based on a genome size estimate of 870 Mb ( Table 2 1) .

PAGE 33

33 Quality Filtering of DNA Sequence Data Several quality, length, and contaminant filters (see below) were applied to ensure high quality data for assembly. The results of filtering are presented in Table 2 2. Quality filtering reduced the total size of the data set by >51%, to 23 Gb. 454 sequence d ata A minimum 454 read l ength threshold was established for our assembly by examining the effect that inclusion of short reads of various size ranges had on contig N50 and L50 values. This procedure amounts to empirically determining the point of diminishing returns for inclusio n of short reads of various sizes. Based on this analysis, single end reads shorter than 100 bp were discarded. To be retained for assembly, paired end reads had to have a combined length of at least 150 bp, and one member of the read pair had to be greate r than or equal to 100 bp, while the other member was required to be a minimum of 50 bp. This resulted in discarding 5.44%, 3.77% , and 7.34% of the reads corresponding to 454 GS FLX Titanium , 454 FLX+ , and 4 54 paired ends (11 Kb) respectively ( Table 2 2) . Tissue for DNA extractions was limited so no attempt was made to enrich 454 libraries for nuclear genomic DNA. Therefore, our reads included organelle sequences that needed to be identified and removed before assembly. I identified potential organelle cont amination by aligning the 454 reads to Amborella mitochondrial (Rice et al. 2013) and chloroplast (NCBI Accession Num ber: NC_005086.1) genome sequences with MosaikAligner v1.1.0020 (Hillier et al. 2008) . These alignments resulted in identif ying 14.87%, 6.01%, and 9.22% organelle reads corresponding to 454 GS FLX Titanium, 454 FLX+, and 4 54 paired ends (11 Kb) ( Table 2 2).

PAGE 34

34 Artificial duplicate reads are sequencing artifacts where some reads share the same start position and have exactly the same sequence suggesting that they are duplicate DNA fragments . Rece ntly, several studies reported the presence of artificial duplicate reads in 454 sequencing runs (Dong et al. 2011; Gomez Alvarez et al. 2009; Niu et al. 2010; You et al. 2 011) . Gomez Alvarez et al. (2009) reported that 11 35% of sequences in 454 metagenomic libraries are artificially replicated sequences. Two primary sources of these artificially replicated duplicates are (a) transfer and amplification of genomic templates from amplified emulsion droplets into other emulsion droplets with empty beads during the emulsion PCR step, and (b) bleeding of the optical signal during sequencing into the space of an adjacent empty well (Gomez Alvarez et al. 2009) . Approximately 12.94% and 10.36% of the reads corresponding to 454 Titanium and 454 Titanium plus libraries were identified as artificial d uplicates (Table 2 2) using the clustering program CD HIT 454 (Fu et al. 2012) , these values are within the range of previous ly reported arti ficial duplicates percentages (Gomez Alvarez et al. 2009) . There is an additional source through which 454 long insert paired end reads acquire artificial duplicates, resulting in a high rate of artificial duplicates reads compared to 454 unpaired reads . As the insert size increases the ratio of DNA fragments becoming circularized decreases rapidly (see Materials and Meth ods) . This means with increase s in insert size, the number of paired end fragments, which get pass ed on to the amplification step , will decrease, resulting in amplification bias. Thus sequencing will result in a high number of duplicate reads . For example, the amount of duplicate reads in libraries with an insert size of 5 K b > 8 K b > 20 K b.

PAGE 35

35 For generating high quality scaffolding, it i s crucial to thoroughly remove artificial duplicates, especially one s originating from long insert paired end reads. To dif ferentiate between legitimate vs. spurious/chimeric contig joins genome assemblers evaluate the number of paired end reads supporting a join. High quality joins will minimally have a threshold amount of paired end reads support compared to spurious/chimeri c joins. Having paired end read duplicates , especially from a poor quality or chimeric library , will give false contig join support and result in either misassembled scaffolds or poor contiguity. Using CD HIT 454 (Fu et al. 2012) I identified 38.94% of reads from the 11 Kb 454 paired ends library to be artificial duplicates (Table 2 2), with almost four times more compared to artificial duplic ates in 454 unpaired libraries. Another potential problem with large paired end libraries that I noticed is the formation of chimeric reads. Chimeric rich libraries pose serious problem for sequence assembly and would result in generation of misassembled genome sequence scaffolds. The c himeric nature of paired end libraries was flagged by examining the effect of scaffold length N statistics from assemblies produced with incremental addition of paired end libraries. In the case of non chimeric paired end libraries the expectation is that scaffold length N statistics for incremental paired end assemblies should continue to increase because adding more paired end reads will join more contigs. After a certain number of incremental additions the increase will plateau suggesting that additional paired end libraries of the same insert size is not helpful in improving the continuity of scaffold sizes. However in the case of chimera rich paired end libraries scaffold contiguity of the incremental assemblies does n ot increase and may even decrease.

PAGE 36

36 This is due to the inability of the genome assembler to resolve confounding evidence from chimeric and true joins between contigs and thus resulting in broken scaffolds. Using this strategy I was able to identify that 20 k b insert paired end libraries are of chimeric nature , and I then excluded them from genome assembly . Figure 2 2 illustrates that average scaffold size N statistics growth of incremental assemblies of 454 20 kb insert paired end data diminishes, clearly s uggesting the chimeric or poor quality of these libraries. Illumina 3 kb mate pair d ata Illumina 3 kb mate pair data were trimmed and quality filtered based to retain only high confidence sequence data. Out of 97,665,053 pairs of raw reads, 72,993,916 (74 .74%) passed the fi ltering criteria and are further subjected to organelle contaminant screening . Approximately 3.18% and 0.53% of reads correspond to mitochondrial and chloroplast genome sequences , respectively, and these sequences were eliminated from fu rther analysis. Generation of 2 5 K b mate pair data using the standard protocol of Illumina mate pair library construction is inexpensive compared to the 454 long insert paired end data generation protocol. However, the former methodology produces junction and inward facing artifact reads . Ideally during Illumina mate pair library construction all the size selected fragments should be biotin labeled with proper mate pair orientation s (reads facing outwards) and of approximately 3 kb insert sizes. However , it is known that some non biotin labeled fragments whose reads have insert size of ~400bp and inward facing read orientation will also be selected (Illumina 2009) . Another artifact is a junction read . A read is classified as a junction read when a sequenced read pass es through the junction of the two joined ends. Position of biotin labeling on the 400bp size

PAGE 37

37 selected fragments is normally distributed and represent s the junction point between two joined ends of 3 kb fragments (see Material and Methods) . Sequencing of fragments with biotin labeling present within 100bp of ei ther end using Illumina HiSeq (2 x 100bp) will result in sequencing through the junction . I nward facing and junction read artifacts need to be removed prior to assembly. J unction and inward facing read filtering was performed on the reads that remained af ter quality trimming and organelle contaminant screening , by aligning all mate pair reads to contigs produced by assembling all clean 454 unpaired data ( Table 2 1) . Out of 70,867,732 pairs of reads that passed quality and organelle screening steps, 14,653 ,624 (2 0.68%) passed junction and inward facing read filtering criteria (see Material and Methods) . Further these 14,653,624 sequences (2.2X coverage) were subjected to duplicate filtering , 40% were identified as duplicate reads, while the remaining 8,865, 354 were used for assembly. This high rate of duplicate reads is due to the PCR enrichment step during Illumina mate pair library construction (see Materials and Methods) . Genome Assembly and Size Estimate All cleaned sequences ( Table 2 1 ) were assembled u sing the Newbler Assembler (Margulies et al. 2005) Our decision to use Newbler was influenced by the large proportion of 454 sequences used and the ability for Newbler to handle multiple sequen ce data types , which allowed BACends, Illumina, and 454 data to be combined. Our current assembly consists of 43,234 contigs with an average size of 15,456 bp (min= 436 bp; max=287,935 bp), an N50 size of 29,456 bp, and an N50 count of 6,448. Scaffolding by virtue of the cleaned paired end reads resulted in 5,745 scaffolds, with an average size of 123 kb (min= 1,732 bp; max= 15.98 Mb), an N50 size of 4.93 Mb, and an N50 count of 50. Based on the N90 statistics, 90% of our assembled sequence

PAGE 38

38 resides in 155 s caffolds with .16 Mb or larger. This assembly is named as version 1.0, and assembly metrics detai ls are provided in Table 2 3 . Flow cytometry estimated the genome size of Amborella to be approximately 870 Mb (Leitch and Hanson 2002) . Accordi ng to this estimate, 82% of the Amborella genome is recovered in the current Newbler assembly. Additional g enome size estimates were obtained from sequence based methods. These genome size estimation methods are based on, k mer frequencies, expanding puta tive repeat regions, and sequence coverage across finished regions, which suggested the genome size to be 793 Mb, 713 Mb, and 736 Mb , respectively . These results unanimously suggest that the genome size estimated via flow cyto metry is an overestimate . This overestimation may be due to variability associated with flow cytometry based estimation of absolute DNA amounts (D olezel Fontes et al. 2011) . Thus I predict that the genome size of Amborella to be close to an average of the three sequence based methods, which is 748 Mb. Based on input data size and a 748 Mb genome, the high quality sequence repre sents an average depth of coverage of approximately 31 x , and the scaffolds cover >94% of the genome. Ass essment of Data Saturation and A ssembly C ompleteness Incremental a ssemblies To assess data saturation and assembly completeness, I monitored contig and scaffold growth as a function of total input bases ( Figure 2 3, 2 4, 2 5, and 2 6 ) . Examination of contig length statistics and genome coverage from assemblies produced with progressively larger sequence collections provides indication of data saturation and assembly completeness ( Table 2 4 , Figure 2 3, 2 4, 2 5, and 2 6 ). Plateaus i n contig length ( Figure 2 3 ), genome coverage ( Figure 2 4 ), and contig N50 sizes ( Figure 2 5 )

PAGE 39

39 suggest that the maximum contig coverage and contiguity attainable with single end sequence reads has been reached and that additional single end sequence data offer little or no improvement. In addition to indicating when our assembly was saturated for single end reads (indicating that contig growth was maximized), this process monitor ed the effectiveness of the paired end (PE) data to make contig joins. This in turn enabled us to determine when further addition of PE sequences, which are generally expensive and of variable complexity, were unlikely to improve assembly substantially ( Fi gure 2 6 , Table 2 4 ). Scaffold sizes are likely to increase with the further addition of long range mate pair sequence ( Table 2 4 , Figure 2 6 ). Rather than constructing additional PE libraries to improve contiguity, a gap closure method based on whole ge nome (formerly optical) mapping technology was undertaken in collaboration with OpGen, Inc. (Gaithersburg, MD, USA) (see below). Representation of BAC end sequences and Amborella unigene sequences within the WGS assembly Available BAC end sequences (Zuccolo et al. 2011) and an Amborella unigene collection ( http://ancangio.uga.edu/content/amborella trichopoda ) were aligned to the assembled contigs with WU BLASTN (version 2.0) to determine the proportion of each sequence collection represented within the assembled contig collection. BAC end and unigene sequences were required to align over a minimum of 90% of their length, with greater than 95% identity. Over 98% of the 63,924 BAC end sequences could be aligned, while greater than 70% of the 49,356 Amborella unigen es (with length at least 600 bp) can be aligned along their lengths to the current assemblies. Of the remaining 30% of Amborella unigenes that failed these filtering criteria, 8.68% have partial alignments to the Amborella genome, 9.15% appear to be contam inant sequences

PAGE 40

40 redundant (nr) protein database, while the remaining 11.66% did not have any significant alignments to sequences in the NCBI non redundant protein database, suggestive of m isassembled unigenes or novel transcripts. After excluding the above mentioned insect contaminants and putative misassemblies, 88% of the remaining 39,804 unigenes align with high confidence (90% coverage along the length and 95% identity), while 11% align partially to the Amborella genome. Overall, the current Amborella assembly is unable to account for 1% of the unigenes. As sessment of Assembly Based on Pre existing Genomic R esources Coverage of contig assemblies across finished sequence contigs Two finis hed BAC contigs (IDs 431 and 1003), that together represent approximately 1 Mb of the Amborella genome (Zuccolo et al. 2011) , can be used to assess the accuracy of contig assembly. All assembled contigs were aligned to these reference sequences with Mummer (Kurtz et al. 2004) . Over 93% of the bases in these BAC contigs are covered by our assembled contigs , which were expected to align over at least 95% of their length, and with greater tha n 95% identity. The depth of contigs coverage along each BAC was illustrated ( Figure 2 7 and 2 8 ). Comparison of the scaffolded assemblies against the available physical map contigs The position and order of those BACend sequences that were incorporated into the Amborella scaffolded assembly were compared to the p osition of their parent BACs within the available physical map BAC fingerprinted contigs (Zuccolo et al. 2011) . Those BACs associated with one another in a BAC fingerprint FPC physical map contig should be incorporated into the same Newbler generated scaffold. Conversely, BACs from a

PAGE 41

41 single FPC contig incorporated into several N ewbler scaffolds could be an indication of misassembly. The Amborella physical map consist ed of 3,106 FPC contigs (Zuccolo et al. 2011) . Of these, 2,941 had at least one BAC that was incorporated into a Newbler that are in Newbler scaffolds were incorporated exclusively in to one scaffold. Custom Python scripts were developed to match BACs from FPC scaffolds to the BACs incorporated in our Newbler assembled scaffolds. The vast majority of FPC contigs, 2,906 (99%), meet that congruent contig defin ition. Thirty five FPC contigs map to two or more scaffolds. Thirty three of these map to two scaffolds, one maps to three, and the remaining (ctg4047) maps to over 50 scaffolds ( Table 2 5 ), which is consistent with evidence that this contig is composed of BACs containing repeat elements (Zuccolo et al. 2011) . This implies that our assembly supports nearly all BAC fingerprint assembled contigs representing the physical map. Assessm ent of whole genome assembly using FISH The accuracy of the genome assembly was further assessed by FISH analysis. BACs assembled in 104 scaffolds cont aining 430 Mb (68%) of the genome assembly were cytogenetically localized by FISH to assess scaffold integrity ( Figure 2 9 and Table 2 6 ). For each scaffold, the distance between positional coordinates of co assembled assessed bp we re report ed , and whether assembly is FISH supported if probes are co localized, FISH inconclusive when co localization is ambiguous (often one or both probes are stain centromeres of all chromosomes), and misassembled when probes localize to different chro mosomes or chromosome arms. For example, AT_SBa0003J06 and AT_SBa0003J19 are two BACs co assembled in scaffold 23 at a distance of 1.3 Mb and co localized by FISH ( Figure 2 9 ).

PAGE 42

42 This analysis confirmed contiguity across major regions (56%) of 66 scaffolds c ontaining 306 Mb (44%) of the genome assembly. Significantly, co assembled BACs that were cytogenetically mapped to different chromosomes indicated potential misassemblies in only two scaffolds ( Table 2 6 ). However, multiple BACs from 38 scaffolds contain ing 154 Mb produced inconclusive genome wide centromeric signals ( Table 2 6 ). Sequence alignments associated with the promiscuous probes indicate extensive sequence similarity and the presence of tandem. Assessment of OpGen Whole Genome Maps and Assembly I mprovemen t uper scaffolding scaffolding of version 1.0 assembly Amborella assembly version 1.0 scaffolds from Newbler whose lengths are greater than 200 kb (219 scaffolds) are long enough to be potential ly mapped and are further super scaffolding of 163 assembly scaffolds ( Table 2 7 ). This process improved the contiguity of the assembly substantially, increasing the assembl y N50 approximately 2 fold (4.9 Mb to 9.3 Mb) and increasing the N90 approximately 2.4 fold (1.2 Mb to 2.9 Mb), while increasing the size of the largest scaffold 23 Mb ( Table 2 7 , Tab le 2 8 ). The fidelity of the joins made by OpGen in building super scaffo lds was evaluated using FISH and by comparing the Whole Genome Map to a preliminary new assembly that incorporated additional read filtering and a modest increase in 454 paired end data (see below). Twenty pairs of version 1.0 scaffolds that were joined by were shown to be co localized by FISH ( Table 2 7 ).

PAGE 43

43 assembly (V1.1) incorporating additional mate pair sequences A second assembly was constructed after including addit ional mate pair sequences and experimenting with improved data filtering. The original collection of paired end 11 kb insert libraries used in assembly version 1.0 was pooled by library (rather than plate) and screened for duplicates with CD HIT 454 (Fu et al. 2012) . This resulted in the removal of an additional 302,740 pairs (105 Mb) of mate paired sequence data from the 11 kb insert librarie s relative to that used in version 1.0. Additionally, three plates of 454 mate pair sequence data were generated from three 8 kb insert libraries, and one plate of 454 mate pair sequence data was generated from a 6 kb insert libraries. Sequence filtering a nd duplicate removal resulted in 324,421 8 kb insert library sequence mate pairs (122 Mb) with a mean read size of 374 bp, and 195,085 6 kb insert library sequence mate pairs (74 Mb) with mean read size of 378 bp. These sequence data were included in a re assembly using the same Newbler assembler parameters described for version 1.0. Amborella assembly 1.0 scaffolds (219 scaffolds, length > 200 kb) were aligned against the scaffolds resulting from the new assembly (hereafter referred to as V1.1) using the Assembly To Assembly Comparison (ATAC) software (Istrail et al. 2004) with default parameters. Version 1.0 scaffolds that aligned along a minimum of 80% of their lengths were examined to determine their relative order and orientation with respect to the V1.1 assembly scaffolds, and putative joins were made between adjacent version 1.0 scaffolds. The joins between version 1. 0 scaffolds that were suggested by V1.1 were compared to the joins between version 1.0 scaffolds that were suggested by two joins between 71 version 1.0 scaffolds were

PAGE 44

44 identified within V1.1. Of these 42 joins, 30 agree with joi ( Table 2 7 ), while the remaining 12 were unique to the V1.1 assembly. BAC free Version 1.0 Assembly and Genome uper scaffolding A BAC free version 1.0 assembly was produced by assembling all data used for the version 1. 0 assembly in the absence of the BAC end sequences with Newbler assembler with parameters identical to those used for the version 1.0 assembly. BAC free version 1.0 assembly metrics are presented in Table 2 9 run using the BAC free vers ion 1.0 scaffold sequences as described for the version 1.0 assembly (see above), resulting in 237 joins between 325 out of the 381 scaffolds (lengths > 200 kb) that represented the BAC free assembly ( Table 2 9 , Table 2 10 ). The scaffo lded, BAC free version 1.0 assembly has an N50 and N90 of 7,665,886 bp (32 scaffolds) and 1,493,879 bp (95 scaffolds), respectively, which is superior to the version 1.0 assembly (including BAC end sequences). This result uper scaffolding yields assemblies with greater substitute for assembling with BAC end sequences. This is particularly important because comprehensive BAC resources and BAC e nd sequences are costly, low throughput and not presently available for most evolutionary model organisms. Comparisons of assembly metrics between Amborella (w/wo BAC ends) and other recently reported NGS based whole genome assemblies were reported in Tab le 2 11 . Discussion De novo genome might result in poor quality genome reconstruction. I illustrate potential pitfalls associated with assembling NGS

PAGE 45

45 also present methodologies to identify and resolve these p itfalls by successfully applying them in generation of a high quality draft genome of Amborella . NGS data filtering for de novo genome assembly involve s removal of low quality, duplicate, contaminant, and chimeric reads. The f irst three filtering steps ma y result in a drastic reduction of the data size , which in turn reduces the genome assembly complexity . The Amborella NGS data filtering resulted in loss of about 50% of the raw data i.e., data coverage went from 64.1X to 30.7X coverage after filtering (T able 2 1) . C himeric filtering minimizes mis joined contigs during assembly , it is clear from Figure 2 2 that addition of chimeric 20 kb insert paired end 454 librar ies results in inferior scaffold N statistics and thus were flagged as chimeric . The s trateg y I employed to identify poor quality data is to generat e several incremental assemblies and construct contiguity curves , which are examined for inflections that indicat e the addition of poor quality sequence data. Once identified, the poor quality data sh ould be eliminated , and only high quality and non ambiguous data should be supplied to the genome assembler to minimiz e misassembles in genome reconstruction , suggest ing that the generation of a high quality gen ome assembly using WGSS is not yet a complete ly automated process . Rather , the assembly process requires a good understanding of potential artifacts associated with NGS technologies and the ability to build bioinformatics pipelines tailored to identify and remove artifacts before feeding NGS data to the genome assembly software. This point is demonstrated by the fact that 454 long insert paired end and Illumina 3 kb mate pair data are associated with unique artifacts . For example, inward facing and junction reads artifacts are unique to Illumina 3 kb mate pair s and processed differently compared to 454 long insert paired end data.

PAGE 46

46 U sing draft genome sequencing of Amborella, I further demonstrate the importance of performing genome size estimation in addition to that of a traditional genome estimation v ia flow cytometry . Based on the flow cytometry method, the Amborella genome size is estimated to be ~870 Mb (Leitch and Hanson 2002) , which is ~1 00 Mb greater than sequence based genome size estimations, including k mer frequencies (793 Mb ), expanding putative repeat regions (713 Mb), and sequence coverage across finished regions (736 Mb). Over estimation of the genome undermines the amount of overall genome reconstructed by genome assembler. D olezel (2005) suggested that variability in genome size estimation s via flow cytometry may be influenced by the following factors : (1) the need for fresh tissues complicates the transfer of samples from field to the laboratory and/or their storage ; (2) the role of cytosolic compounds interfering with quantitative DNA staining is not well understood; and (3) the use of a set of internationally agreed DNA reference standards still remains an unrealized goal (D olezel . To my knowledge, Amb orella is the first large plant genome for which a highly contiguous genome assembly was obtained using a combination of FISH (fluorescent in situ hybridization), and whole Whole Genome Mapping technology) to assess assembly accuracy . Successful implementation of this strategy in a typical non model organism like Amborella demonstrates its future potential of being routinely used in genome sequencing projects of many other organisms with limited genomic res ources, and this multi faceted approach may also extend to genomes with a more complex repeat structure than that of Amborella .

PAGE 47

47 The long super scaffolds generated after the Genome scaffolding step are long enough to accommodate large portio n s of the chromosome arms. Also, these long super scaffolds of Amborella will be useful for synteny comparisons with other sequenced angiosperm genomes to investigate angiosperm wide genome duplications, and in reconstructing ancestral gene order of angios perms. Assembling genomes solely from NGS data often result s in several hundreds or even thousands of scaffolds, which is especially true for complex genomes like Amborella . For example, in a BAC free assembly, 90% of the assembled Amborella genome is rep resented in 293 of the largest scaffolds (referred to as the N90 count in Table 2 3 ). While genome assembly with BAC end paired library results in a N90 scaffold count of 155 ( Table 2 3 ), increasing contiguity by 50% . However, generation of BAC end paired libraries is a laborious and expensive task. In contrast, procuring long insert paired end reads via next generation library preparation protocols is relatively cheap but may result in chimeric data sets as mentioned above for 20 Kb insert 454 paired end libraries. High throughput whole genome restriction maps such as from OpGen may help bypass preparation of long insert libraries. Th e value of this approach was illustrated by comparing the Amborella version 1.0 assembly containing BAC end paired sequenc es to the Genome scaffolding of BAC free version 1.0 assembly using genome restriction maps, where as the latter produced greater contiguity (Table 2 9 and Table 2 10). Th ese results suggest that Genome super scaffolding of genome re striction maps can act as a surrogate for very long (>150 kb) PE libraries ( e.g., BAC end sequences ) , which are expensive and time consuming to construct.

PAGE 48

48 Genome scaffolding of de novo assembled scaffolds using genome restriction maps result ed in great improvements in the assembly contiguity of Amborella draft genome. However, the scale of improvement depends on the number of scaffolds that are eligible to enter into the Genome scaffolding step. For Amborella genome sequencing project, only de novo assembled scaffolds that were a minimum of 200 kb were considered long enough to be potentially mapped to single molecule maps ( and these were used for scaffold joining by de novo sequence assembly scaffold sequences into restriction maps by in silico restriction enzyme digestion and these maps are extended iteratively by overlapping single molecule maps, thus building super scaffolds (Chamala et al. 2013; Dong et al. 2012) . In general, the larger the de novo assembled scaffo ld, the greater the number of restriction sites it will contain, which can be identified by in silico restriction enzyme digestion. In turn, larger numbers of in silico detected restriction sites are more likely to unambiguously map the de novo assembled contig to overlapping single molecule (Dong et al. 2012) . The minimum de novo assembled scaffold length suggested by OpGen Inc is 200 K b . NGS based whole genome assemblies (Table 2 11) like Brassica rapa (Wang et al. 2011) and potato (Xu et al. 2011) have 159 and 622 scaffolds respectively with over 200 kb size representing more t han 90% of the assembled genome . L ike Amborella , these genome assemblies would have easily gained significant improvements in assembly contiguities had the s e projects employed a whole genome optical map. However, not all de novo genome assemblies may yiel d highly contiguous scaffolds such as that of Amborella, especially in the absence of large insert

PAGE 49

49 paired end libraries. In such cases, the GB super scaffolding approach may have limited use. For example, in Cajanus cajan pigeonpea ) BAC free de novo gen ome assembly (Varshney et al. 2012) (Table 2 11) only about 50% of assembled genome is present in scaffolds with about 200 kb or more, suggesting only 50% of assembled genome would benefit from additional GB super scaffolding. In contrast, over 90% of assembled genome is present in scaffolds greater than 200 K b (Table 2 11). Relatively poor BAC free contiguity in pigeonpea compared to Ambore lla may be due to a more complex genome repeat structure in the former. Poor contiguity can also arise from differences in the quality and insert sizes of the NGS datasets used. The Amborella genome assembly is dominated by long 454 Titanium reads (~365 5 33 bp) which is almost double the combined read length of the Illumina paired end data (2X100 bp) used in piegeonpea (Varshney et al. 2012) . The botto m line is that the quality of the de novo assembly must be adequate to assemble much of the genome into scaffolds greater than 200 Kb to benefit from Opgen super scaffolding. Assembling the complete Amborella genome sequence also successfully demonstrate d the application of FISH for evaluating the accuracy of a draft genome assembly and super scaffolds generated from Genome . The assemblies of scaffolds with at least two BAC end sequences derived from independent BACS can potentially be evaluated with FISH. FISH probes are constructed from the parent BAC clones and hydridized to chromosome preparations . A scaffold is flagged as properly assembled if signals from the FISH probes are seen adjacent to each other . In contrast, signals that appear on different chromosomes or are much farther apart from one another than expected based on estimates of their distance within the assembly indicate

PAGE 50

50 misassembly (Figure 2 9 and Table 2 6) . One limitation of employing FISH is encountered when BAC clones used t o construct labeled probes contain repetitive sequences. In these cases the probe signal will be detected at multiple locations in the genome and cannot be interpreted conclusively. As discussed previously , constructing BAC resources is a relatively expe nsive . In this project I was able to take advantage of the availability of pre existing Amborella BAC clones for a FISH based assembly validation (Zuccolo et al. 2011) . An alternative f or whole genome projects without BAC resources would be synthesizing FISH probes using long range polymerase chain reaction from primers design ed in various portions of assembled scaffold and then testing for validity of genome assembly similar to that of using BAC clones . In this manner FISH strategy could still be used for cytogenetic validation of an assembly even without generating BAC resour ces . Genome assembly strategies discussed above were successful in generating a high quality, contiguous, genome assembly for Amborella. These methodologies were conceived and implemented during 2010 to 2012. However several advances in sequencing techno logies have been made subsequent to this project and these should be evaluated and incorporated in future WGS projects. As mentioned, the Amborella genome sequencing heavily relied on 454 sequence data for building contigs. The main reason behind choosing 454 sequencing technology was its superior read length in comparison to other NGS technologies available at the time. Currently, the Illumina Mi Seq can generate 2X300bp paired end reads ( http://www.illumina.com ) tha t can be constructed to overlap and generate sequence fragments of >350bp. These are more accurate in terms of base calling than 454 and approximately two orders of magnitude

PAGE 51

51 cheaper. Today, 454 sequence data could easily have replaced with MiSeq with on ly a modest difference in mean sequence read length but with a major reduction in cost . Additionally, in October of 2013, Roche announced the shutdown of 454 Life Sciences and will end support for 454 instruments by mid 2016 ( http://www.bio itworld.com/BioIT_Article.aspx?id=131053 ). Thus, using the 454 sequencing technology is not an option for future sequencing projects. Recent improvements in read lengths of the PacBio SMRT sequencin g technology have great potential in generating high quality de novo assemblies of eukaryotic genomes. While most NGS platforms generate read lengths of a few hundred base pairs, PacBio SMRT sequencing with the P5 C3 chemistry ha ve the potential to generat e >300Mb of sequence in reads with an average read length of ~8.5 kb, and maximum read size exceeding 30 Kb ( http://files.pacb.com/pdf/PacBio_RS_II_Brochure.pdf ). Clearly such reads could span large repeat regions, dispense with the need for paired end libraries and potentially simplify assembly. Additionally, the long reads can be used to scaffold between contigs generated with shorter read technology. Although, PacBio SMRT sequencing is low throughput compared to the Illumina platforms (~375 Mb and 50k reads vs. 10 600Gb) it is significantly less expensive than generating 454 paired end libraries. The main drawback to PacBio sequence is its high per base error rates (~14%). Sequence err ors need to be corrected before reads are assembled into contigs. There are two common methods for PacBio error correction and assembling genomes. The first is a non hybrid approach that error corrects by generating a consensus sequence derived from PacBi o SMRT sequences collected at sufficient sequence redundancy, which are then

PAGE 52

52 assembled into contigs (Chin et al. 2013) . Until recently this approach was routinely applied to small genomes such as bacteria (Roberts et al. 2013) , but recently the Arabidopsis thaliana (Ler 0) genome was sequenced and assembled solely with P5 C3 chemistry by PacBio scientists. The assembly contiguity is very impressive with an N50 contig size of 6.36 Mb and maximum length of 13.21 Mb at 18.74X assembly coverage ( http://files.pacb.com/pdf/CS_SMRTApproach_FinishingPlantAnimalGenomes.pdf ). The second methodology is a hybrid approach where sequence from short read platforms are mapped to PacBio sequences and errors are corrected based on alignment consensus (Koren et al. 2012) . The genome assembly strategy of the hybrid approach is to use the Illumina reads to both correct PacBio reads and to de novo assemble short read contigs, which can be scaffolded with the error corrected PacBio reads. This str ategy requires much shallower genome coverage by PacBio sequence relative to adopting the non hybrid approach. Another interesting sequencing technology that can produce long fragments is Illumina Long Read Sequencing Technology (formerly marketed as Mo leculo). This technology can generate up to 10 K b synthetic long reads derived from local assemblies of standard Illumina reads ( http://www.illumina.com ). PacBio SMRT and Illumina Long Read sequencing technologies wi ll play an important role in future whole genome sequencing projects. These technologies can compliment each other and even may be used in conjunction with traditional NGS based de novo assemblies for improving genome assembly contiguity without a need for large insert (>10kb) paired end read libraries. Whether such technologies obviate the need for BAC or Fosmid end sequences remains to be determined. However, these

PAGE 53

53 technologies will certainly increase contig/scaffold contiguity to a point where addition of a high throughput whole genome map from OpGen In. ( http://www.opgen.com/mapit ) or comparable technologies (BioNano Genomics; http://www.bionanogenomics.com/ ) will rapidly and cost effectively lead to highly contiguous de novo genome assemblies. In addition to super scaffolding, whole genome optical maps can be used to identify and correct contig and scaffold mis assemblies, which will minimize the need for FISH bey ond anchoring large contiguous scaffolds to an organisms karyotype. Materials and Methods Sequencing Reference genome sequence plant material was obtained from a single plant and its clones located in the University of California at Santa Cruz Botanical Garden, the Atlanta Botanic Garden and the University of Florida (Amborella Genome Project 2013) . Single end genomic 454 FLX an d SE 454 FLX+, 11 K b paired end 454 FLX, and 3 K b PE Illumina HiSeq reads were generated using standard protocols respectively at Pennsylvania State University by Dr.Stephan C. Schuster group (Amborella Genome Project 2013) . Sanger sequenced BAC end sequence reads were downloaded from NCBI (Zuccolo et al. 2011) . The sequence data is summarized in Table 2 1 . 454 Long Insert Paired end Library Protocol Standard 454 /Roche long insert paired end library p reparation protocol (Roche 2009) involves DNA fragmentation and si ze selection of appropriate insert size, DNA circulation, fragmentation of circularized DNA, ligation of sequencing adaptors to paired end library constructs, and amplification of paired end library constructs (personal communication with Dr.Stephan C. Sch .

PAGE 54

54 454 Sequence Data Quality Filtering Identification of short reads The criteria used to designate a read as short read, single end reads should be at least100 bp and paired end reads with combined length should be at least 150 bp, and one member of the read pair had to be greater than or equal to 100 bp, while the other member was required to be a minimum of 50 bp. The reads failing these criteri a are discarded. Organelle contaminants 454 reads were aligned to Amborella mitochondria l (Rice et al. 2013) and chloroplast (NCBI Accession Number: NC_005086.1) genome sequences with MosaikAligner (Hillier et al. 2008) using the banded Smith Waterman algorithm with a hash size of 15. Reads aligned with less than 5% mismatch over 95% of the aligned read length were considered to potentially o riginate from an organellar genome. In the case of single end reads, these reads were discarded. Paired end reads were examined more carefully, as a mate pair with only one end matching an organellar genome could provide valuable information about gene tra nsfer between the nuclear and organellar genomes. Reads were discarded if both paired ends appeared to be of organellar origin. Identification and removal of artificial duplicate reads Clustering program CD HIT 454 (Fu et al. 2012) was used to identify artificial duplicates. Reads with greater than 96% sequence identity, with no more than 1 insertion or deletion, and sharing the same start posi tion were gathered into artificial duplicate clusters. The representative sequence from each cluster was kept for assembly, and all other reads in the cluster were discarded.

PAGE 55

55 Sanger BAC end Data Quality trimming Amborella Sanger BAC end sequences (Zuccolo et al. 2011) were quality trimmed with Lucy v1.2 (Chou and Holmes 2001) using the following parameters: error 0.025 0.9; bracket 10 0.02; window 50 0.07. This step resulted in the removal of 70 low quality BAC end sequences. Organelle contaminants All Amborella BAC end sequences were aligned against the Amborella mitochondrial and chloroplast genome sequences using WU BLASTN (version 2.0), and reads aligning with at least 90% identity and 90% along their length were classified as potential mitochondrial and chloroplastic contaminants, respectively. Reads were discarded only if both members of a BAC end pair appeared to be of organellar genome origin. Illumina 3 K b Mate pair Library Protocol Illumina 2 5 K b mate pair protocol was used for generation of 3 K b mate pair data (Illumina 2009) . This protocol involve s following steps , (i) fragmentation of DNA, (ii) biotin labeling, (iii) size select 3.5 K b biotin end labeled molecules, (iv) c ircularize 3.5 K b molecules, (v) shear 3.5 K b molecule, (vi) capture and size select only biotin labeled ~400 bp sheared fragments, (vii) ligate appropriate sequencing primers, and (viii) PCR enrichment of size selected fragments. Illumina 3 K b Mate pair D ata Quality trimming Modules from the Fastx toolkit (version 0.0.13) ( http://hannonlab.cshl.edu/fastx_toolkit/ ) were used to perform quality trimming of

PAGE 56

56 Illumina 3 K b mate pair data. First, fastq_qua lity_trimmer was run to trim nucleotides from the end of sequences that have quality values lower than 30, and sequences shorter than 85 bp were removed. The output was processed through the fastx_artifacts_filter module run with default parameters, which removes reads with all but three identical bases. Finally, the fastq_quality_filter module was run with the following parameters: q (Minimum quality score to keep) 25; p (minimum percent of bases that must have [ q] quality) 90. Organelle contaminants I llumina 3 K b mate pair data were aligned against the organellar genome sequences with MosaikAligner package version 1.1.0020 (Hillier et al. 2008) using the following program parameters: hs (hash size) 15; act (alignment candidate threshold) 35; mm (number of mismatches allowed) 5; bw (uses the banded Smith Waterman algorithm) 29; mmal (uses the aligned read length instead of the original read length when counting errors); minp (minimum percentage of the read length should be aligned) 0.90; m (alignment mode) all; a (alignment algorithm) all; p (CPUs used) 8; mhp 100 (maximum number of hash positions to use). Only those read pairs whose individual ends both map to the organellar geno me sequence were considered as organelle contaminants. Removal of junction and inward facing reads from Illumina mate pair libraries Identification of junction and inward facing reads was performed on the reads that remained after quality trimming and org anelle contaminant screening. All mate pair reads were aligned using Mosaik (v1.1.0020) (Hillier et al. 2008) to contigs produced by assembling all clean 454 unpaired data (Table 2 1) with t he Newbler assembler (Margulies et al. 2005) . Mosaik aligner was run with the following parameters: hs (hash

PAGE 57

57 size) 15; act (alignment candidate threshold) 35; mm (number of mismatches allowe d) 5; bw (uses the banded Smith Waterman algorithm) 29; m (alignment mode) unique; a (alignment algorithm) all; p (CPUs used) 7; mhp (maximum number of hash positions to use) 100. Paired end reads were discarded as junction/inward reads unless (i) bot h read ends aligned completely, and (ii) when both reads aligned completely to the same contig, they were determined to be spaced at least 1000 bp, but not more than 5000 bp, from one another. Identification and removal of artificial duplicate reads Illumi na read pairs remaining after filtering for quality, organelle contaminants, junction reads and inward facing reads were checked for artificial duplicates. Read pairs were first concatenated, and the concatemers were processed through the CD Hit 454 (Fu et al. 2012) package using the same program parameters as described for removal of artificial duplicates from 454 data (above). Sequence Assemb ly All cleaned sequences were assembled using the Newbler Assembler (Margulies et al. 2005) scaffold het large cpu 3 siod processor node with 256 GB of RAM. Assembly metrics of version 1.0 are detailed in Table 2 12 . Coverage Analysis and Estimating the Size of the Amborella Genome Sequence An accurate estimate of sequencing depth in hi gh information content (i.e. non repeat) regions of our assembly is useful for a number of calculations, such as estimating genome size and completeness of sequencing coverage. Several methods to estimate sequencing depth in high information content region s were used: alignment to fully sequenced BAC contigs, alignment to ESTs, and k mer frequency estimation.

PAGE 58

58 Genome size estimate based on sequence coverage across finished regions Two finished BAC contigs (IDs 431 and 1003), together comprising approximately 1 Mb of the Amborella genome (Zuccolo et al. 2011) , were used as proxies to represent the sequence composition of the entire genome. All clean 454 unpaired reads were aligned to these contig sequences with Mosaik aligner (Hillier et al. 2008) v1.1.0020, using the banded Smith Waterman algorithm ( bw) and a hash size ( hs) of 15. Reads were required to align over a minimum of 95 % of their length, with greater that 95% identity. The depth of read alignment coverage along each BAC is illustrated ( Figure 2 10 and 2 1 1 ). The mean coverage depths across BAC contigs 431 and 1003 were 32X and 32X, respectively (Table 2 12 ); the median coverage depth was 29X for both contigs. Small gaps representing 1.30% of BAC contig 431 and 0.57% of BAC contig 1003 remained after read a lignments. The median coverage depth of the 454 sequence reads across these two assembled BAC contigs provides one estimate of genome size. Dividing 20.7 Gb of unpaired 454 sequences by the median depth of 29X gives an estimated genome size of 713 Mb. Esti mating genome size using k mer frequencies The Potato Genome Sequencing Consortium (Xu et al. 2011) described a method for estimating g enome coverage, and thus genome size, based on the frequency of k mers appearing in sequencing reads. Because this method uses only raw sequencing reads, the estimation provided is independent of any finished assembly. The choice of k mer size ( k ) is made such that k should be large enough so that 4 k is close to, but greater than, previous genome size estimates so that each k mer theoretically appears once in the genome. For the Amborella genome, this suggests a minimum k of 15. The number of appearances fo r each distinct k mer ( n ) within the

PAGE 59

59 genome sequence is determined, and this is, in turn, multiplied by the number of distinct k mer sequences appearing n times to yield the k mer volume . Because each k mer should appear once in a random DNA sequence, the n associated with the largest k mer volume is inferred to represent the true depth of coverage in high information content regions of the genome. Two k mer frequency analyses were performed using all clean 454 reads, one with k =15 and a second with k =17 ( F igure 2 12 , Table 2 13 ). K mer estimates suggest a genome size of 793Mb. Genome size estimation by repeat expansion Repeats and other low complexity genomic regions often assemble into deep contigs that inaccurately represent their distribution in the gen ome. Because highly repetitive sequences tend to assemble as overlapping fragments rather than extending contigs, these repeat laden contigs are of higher depth than contigs representing non repetitive portions of the genome. The sequence depth in low comp lexity contigs can be compared to the sequence depth of high complexity contigs to estimate the amount of excess read overlap. Assuming that the excess overlapped reads represent over collapsed repeat laden regions of the genome enables a length estimate f or the genomic region represented by the contig. Contigs were first partitioned into two classes: high complexity and low complexity. Figure 2 13 shows the frequency of contig depths. There are two peaks, one at a depth of 18X and another at 36X, which li kely represent high complexity genomic regions and repeat laden regions, respectively. Contigs with a read depth higher than 36 putatively represent low complexity regions. The 95 th percentile value for contig depth is 73X, which was selected as a conserva tive cutoff; contigs with a depth of 74X or greater are designated low complexity. Assuming that depth in non repeat containing assembled contigs is 18X suggests that the genome

PAGE 60

60 has been sampled 18 fold. This implies a scaling factor for the low complexity of d/18, where d is the low length by its scaling factor approximates the length of the genome segment represented by the reads within the contig. All contig lengths summed together after scaling suggest that the length of the assembled contigs represents 736 Mb. Based on these three independent assessments of genome size, the Amborella genome size was estimated to be mean value of ~ 748Mb. Cytogenetics Details of methods and materials fo r this work are described in (Chamala et al. 2013) . OpGen Whole Genome Mapping OpGen was the service provider who generated the whole genome optical map for Amborella and performed super scaffolding of Amborella scaffolds resulting from Newbler assembler (Margulies et al. 2005) . Details of methods and materials for this work are des cribed in (Chamala et al. 2013) .

PAGE 61

61 Figure 2 1 . Whole g enome s hotgun s trategy. A ) Single and short insert paired end DNA reads are ass embled into contigs based on single and short insert paired read , B ) Contigs are linked together into scaffolds based on evidence from long insert (3Kbp, 8 Kbp, 20 Kbp, 120 Kbp, etc) paired end/mate pair reads , C ) Scaffolds are further linked together to form super contigs based on evidence from optical maps, physical maps, genetics maps and FISH .

PAGE 62

62 Figure 2 2 . Incremental assemblies with 20 Kb paired end l ibraries. Average scaffold size N statistics growth of incremental assemblies of 454 20 K b inse rt paired end data, when added to 28.5 runs of singe end 454 data, all 8 K b insert 454 paired end data, and all BAC end data

PAGE 63

63 Figure 2 3 . Average contig length growth of incremental assemblies of single end d ata (Chamala et al. 2013) .

PAGE 64

64 Figure 2 4 . Genome coverage of incremental assemblies of single end d ata (Chamala et al. 2013) .

PAGE 65

65 Figure 2 5 . N statistics of incremental assemblies of single end d ata (Ch amala et al. 2013) .

PAGE 66

66 Figure 2 6 . Average scaffold size growth of incremental a ssemblies of 454 11 K b i nserts (Chamala et al. 2013) .

PAGE 67

67 Figure 2 7 . Coverage of assembled contigs across BAC contig 431. Mummer plot of WGS assembled contigs aligned across Amborella BAC Sanger sequenced contig 431 (487 Kb). Alignment position is represented along the X axis, while the percent (%) identi ty of the alignment is represented by the Y axis. Contigs that align along the forward strand are illustrated in red, reverse matches are illustrated in blue (Chamala et al. 2013) .

PAGE 68

68 Figure 2 8 . Coverage of assembled contigs across BAC contig 1003. Mummer plot of WGS assembled contigs aligned across Amborella BAC Sanger sequenced contig 1003 (630 Kb). Alignment position is represented al ong the X axis, while the percent (%) identity of the alignment is represented by the Y axis. Contigs that align along the forward strand are illustrated in red, reverse matches are illustrated in blue (Chamala et al. 2013) .

PAGE 69

69 Figure 2 9 . FISH co localized signal. Two BACs, AT_SBa0003J06 (green) and AT_SBa0003J19 (red), are co assembled and separated by 1,298,028 bp in scaffold 23. The sig nals are co localized, which supports the assembly that included the end sequence associated with these BACs (Chamala et al. 2013) .

PAGE 70

70 Fig ure 2 10 . Read coverage of BAC c ontig 431. Coverage of all 454 unpaired data across BAC contig 431 (red line indicates the median) (Chamala e t al. 2013) .

PAGE 71

71 Figure 2 11 . Read coverage of BAC c ontig 1003. Coverage of all 454 unpaired data across BAC contig 1003 (red line indicates the median) (Chamala et al. 2013) .

PAGE 72

72 Figure 2 12 . K mer volume p lots . K mer volume plots for Amborella k mers of size 15 (A), and 17 (B) identified within 454 WGS sequences (Chamala et al. 2013) .

PAGE 73

73 Figure 2 13 . Contig depth frequency p lot. Frequency of contigs as a function of sequence depth (Chamala et al. 2013) .

PAGE 74

74 Table 2 1. Sequenced genome library statistics. Library Type (Insert Size) Sequencing Information Raw Data (reads) Raw Data Size Mean Length Theoretical Depth of Coverage Raw Clean Raw Cle an Raw Clean Raw Clean 454 Titanium 28.5 Plates 38.8 M 27.6 M 14.2 Gb 10.5 Gb 365 bp 381 bp 18.9X 13.9X 454 Titanium plus 17 Plates 22.4 M 18.3 M 12 Gb 10.2 Gb 533 bp 557 bp 16X 13.6X 454 paired ends (11 Kb) 5 Plates 6.1 M 1.92 M 2.2 Gb 0.699 Gb 348 bp 364 bp 2.9X 0.9X Illumina mate pairs (3 Kb) Full lane HiSeq2000 195 M 17.7 M 19.7 Gb 1.68 Gb 101 bp 95 bp 26.2X 2.2X Sanger BAC ends (130 Kb) 5.2X Library 69466 63924 48 Mb 44 Mb 695 bp 689 bp 0.06X 0.06X Total 262 M 66 M 48 Gb 23 Gb 64.1X 30.7X

PAGE 75

75 Table 2 2. Genome library contaminant statistics. Library Type (Insert Size) Mitochondrial Chloroplast Junction/Inward Facing Reads Artificial Duplicates Short Total 454 Titanium 12.01% 2.86% N/A 12.94% 5.44% 33.25% 454 Titanium plus 5.0 2% 99% N/A 10.36% 3.77% 20.14% 454 paired ends (11 Kb) 7.34% 1.88% N/A 38.94% 7.34% 55.5% Illumina mate pairs (3 Kb) 3.18% 0.53% 57.56% 6% 25.26 92.53% Sanger BAC ends (130 Kb) 1.09% 2.89% N/A N/A N/A 3.98%

PAGE 76

76 Table 2 3 . Contig and scaffold assembly m etrics for the Amborella assembly version 1. BAC Free Version 1.0 Version 1.0 Contigs Scaffolds Contigs Scaffolds Number 44,402 5,688 43,234 5,745 Total Size (bp) 668,207,383 703,384,731 668,257,121 706,332,648 Genome Covered (%) 88.9 93.5 88.9 93.9 Largest Size (bp) 242,004 9,633,646 287,935 15,980,527 Mean Size (bp) 15,049 123,661 15,456 122,947 N10 Size (bp) 80,951 6,023,163 85,018 9,414,115 N10 Count 619 10 579 7 N25 Size (bp) 51,130 4,504,398 52,723 6,874,971 N25 Count 2,226 30 2,129 20 N5 0 Size (bp) 28,655 2,766,685 29,456 4,927,027 N50 Count 6,685 81 6,448 50 N75 Size (bp) 14,435 1,292,119 14,812 2,820,768 N75 Count 14,860 176 14,404 97 N90 Size (bp) 6,963 508,366 7,108 1,154,593 N90 Count 24,624 293 23,962 155

PAGE 77

77 Table 2 4 . Scaffold metrics for incremental assemblies of 454 paired end 11 Kb libraries. 454 Paired Reads (11 Kb insert size) 0 323,174 646,349 969,524 1,292,699 1,615,874 N50 Scaffold Size (bp) 30,008 489,143 1,430,625 2,399,258 3,450,317 3,987,348 N50 Scaffold Count 393 5 395 157 96 67 58 N90 Scaffold Size (bp) 6,933 12,644 30,888 76,499 365,895 676,301 N90 Scaffold Count 24755 5483 1431 444 260 198

PAGE 78

78 Table 2 5 . FPC contigs mapping to more than one assembled scaffold, determined by alignment of end sequenced BACs. FPC Contig ID v1.0 Scaffold ctg2 scaffold00040 ctg2 scaffold00105 ctg25 scaffold00041 ctg25 scaffold00043 ctg42 scaffold00049 ctg42 scaffold00174 ctg82 scaffold00031 ctg82 scaffold00051 ctg140 scaffold00039 ctg140 scaffold00102 ctg169 scaffold00173 ctg169 scaffold00187 ctg262 scaffold00076 ctg262 scaffold00201 ctg323 scaffold00078 ctg323 scaffold00142 ctg353 scaffold00015 ctg353 scaffold00083 ctg385 scaffold00063 ctg385 scaffold00137 ctg397 scaffold00110 ctg397 scaffold00296 ctg415 scaffo ld00041 ctg415 scaffold00138 ctg460 scaffold00015 ctg460 scaffold00223 ctg528 scaffold00055 ctg528 scaffold00152 ctg607 scaffold00005 ctg607 scaffold00020 ctg609 scaffold00048 ctg609 scaffold00120 ctg616 scaffold00004 ctg616 scaffold00071 ctg66 1 scaffold00033 ctg661 scaffold00035 ctg664 scaffold00054 ctg664 scaffold00118

PAGE 79

79 Table 2 5 . Continued FPC Contig ID v1.0 Scaffold ctg786 scaffold00054 ctg786 scaffold00270 ctg835 scaffold00106 ctg835 scaffold00111 ctg839 scaffold00139 ctg839 scaf fold00195 ctg870 scaffold00067 ctg870 scaffold00148 ctg895 scaffold00066 ctg895 scaffold00106 ctg895 scaffold00129 ctg968 scaffold00019 ctg968 scaffold00028 ctg1223 scaffold00006 ctg1223 scaffold00061 ctg1242 scaffold00037 ctg1242 scaffold00290 ctg1789 scaffold00011 ctg1789 scaffold00012 ctg1879 scaffold00068 ctg1879 scaffold00217 ctg1981 scaffold00005 ctg1981 scaffold00058 ctg2758 scaffold00075 ctg2758 scaffold00093 ctg3317 scaffold00027 ctg3317 scaffold00067 ctg3620 scaffold00010 ct g3620 scaffold00011 ctg4047 scaffold00001 ctg4047 scaffold00002 ctg4047 scaffold00003 ctg4047 scaffold00004 ctg4047 scaffold00006 ctg4047 scaffold00007 ctg4047 scaffold00008 ctg4047 scaffold00009 ctg4047 scaffold00010

PAGE 80

80 Table 2 5 . Continued FPC Contig ID v1.0 Scaffold ctg4047 scaffold00011 ctg4047 scaffold00014 ctg4047 scaffold00015 ctg4047 scaffold00016 ctg4047 scaffold00018 ctg4047 scaffold00019 ctg4047 scaffold00021 ctg4047 scaffold00022 ctg4047 scaffold00023 ctg4047 scaffold00024 c tg4047 scaffold00028 ctg4047 scaffold00029 ctg4047 scaffold00033 ctg4047 scaffold00036 ctg4047 scaffold00038 ctg4047 scaffold00041 ctg4047 scaffold00044 ctg4047 scaffold00045 ctg4047 scaffold00046 ctg4047 scaffold00049 ctg4047 scaffold00050 ctg4 047 scaffold00053 ctg4047 scaffold00056 ctg4047 scaffold00057 ctg4047 scaffold00058 ctg4047 scaffold00060 ctg4047 scaffold00062 ctg4047 scaffold00066 ctg4047 scaffold00077 ctg4047 scaffold00080 ctg4047 scaffold00092 ctg4047 scaffold00094 ctg4047 scaffold00095 ctg4047 scaffold00099 ctg4047 scaffold00108 ctg4047 scaffold00131 ctg4047 scaffold00132 ctg4047 scaffold00140

PAGE 81

81 Table 2 5 . Continued FPC Contig ID v1.0 Scaffold ctg4047 scaffold00142 ctg4047 scaffold00144 ctg4047 scaffold00193 ctg4 048 scaffold00064 ctg4048 scaffold00118

PAGE 82

82 Table 2 6 . FISH assessments of assembly scaffolds. Supported inter scaffold joins and chromosomal assignments are in superscript adjacent to scaffold IDs. AmTr_v1_Scaffold Length (bp) FISH Assessed (bp) FISH Resu lt 1 15980527 5830080 FISH supported 2 Chr2 and Chr4 11522362 1672153 Mis assembled 3 11085951 7368772 FISH inconclusive 4 10537363 8472629 FISH inconclusive 5 9585472 8322815 FISH inconclusive 6 Joined by pre v1 assembly with 51 9414115 7997003 FISH supported 7 Chr12, Joined by post v1 assembly with 131 9499498 8732041 FISH supported 8 9263929 1906165 FISH supported 9 Chr7 9389330 3978571 FISH supported 10 Chr4 9150321 4438029 FISH supported 11 Chr3, Joined by GB with 48 8972411 5340435 FISH sup ported 12 Chr3, Joined by GB with 165 8757819 5287502 FISH supported 13 Chr6 8123900 7655272 FISH supported 14 Chr2, Joined by GB and post v1 assembly with 83 7564463 6900472 FISH supported 15 Chr1, Joined by GB with 93 7405751 806918 FISH supported 1 7 7341449 451981 FISH supported 19 7204098 1256816 FISH supported 20 Chr1 6874971 3668083 FISH inconclusive 21 Chr1, Joined by GB with 130 6860876 4889892 FISH supported 22 Chr2 6709877 4279745 FISH supported 23 Chr8, Joined by GB with 80 6666511 2709 136 FISH supported 25 Chr1, joined by post v1 assembly with 58 6368207 879552 FISH supported 26 6376714 3773156 FISH inconclusive 27 6196301 602662 FISH inconclusive 28 6207899 3913965 FISH inconclusive 29 Chr9, Joined by GB with 171 6285685 6258528 F ISH supported 31 5744721 3913538 FISH inconclusive 32 Chr2 5900637 1025710 FISH supported 34 5737664 3602236 FISH inconclusive 35 5595234 3212839 FISH inconclusive

PAGE 83

83 Table 2 6 . Continued AmTr_v1_Scaffold Length (bp) FISH Assessed (bp) FISH Result 36 Chr5, Joined by GB with 136 5613473 1691021 FISH supported 38 Chr1 and Chr4 5499001 1754441 Mis assembled 39 Chr7, Joined by GB with 148 5379737 5376191 FISH supported 41 Joined by GB with 69 5369156 3213423 FISH supported 42 Chr2, Joined by GB and pos t v1 assembly with 81 5115746 3652447 FISH inconclusive 43 5098829 519832 FISH supported 44 Chr9, Joined by pre v1 assembly with 29 5106096 2335861 FISH supported 45 Chr2, Joined by GB with 57 5143656 2775735 FISH supported 47 Joined by post v1 assembl y with 154 4990407 3239634 FISH supported 48 Chr3, Joined by GB with 11 5155698 2714625 FISH supported 49 5090928 4133956 FISH inconclusive 50 4903570 3264385 FISH inconclusive 51 Joined by pre v1 assembly with 6 4861392 3968201 FISH supported 52 4904 977 1895828 FISH inconclusive 53 4927027 2136024 FISH inconclusive 54 4818897 2839825 FISH inconclusive 56 Chr5, Joined by GB with 109 4878105 4033512 FISH supported 57 Chr2, Joined by GB with 45 4629110 2838279 FISH supported 58 Chr1, joined by post v1 assembly with 25 4552514 4098607 FISH supported 62 Joined by GB and post v1 assembly with 183 4280439 1496033 FISH supported 64 4206236 4206236 FISH inconclusive 66 Chr6 3926670 590740 FISH supported 67 Chr7, Joined by GB with 39 and 148 3941259 394 1259 FISH supported 68 Chr3 3967838 2240720 FISH supported 69 Joined by GB with 41 3857696 3537964 FISH supported 70 3733942 1175646 FISH inconclusive 71 Joined by GB with 99 3793322 3679799 FISH supported 72 Chr7 3666727 1285937 FISH supported

PAGE 84

84 Tabl e 2 6 . Continued AmTr_v1_Scaffold Length (bp) FISH Assessed (bp) FISH Result 73 3630538 511445 FISH inconclusive 74 Chr10 3745557 2883669 FISH inconclusive 75 3588453 1421167 FISH inconclusive 79 Joined by post v1 assembly with 105 3533628 1139225 FIS H supported 80 Chr8, Joined by GB with 23 3254339 1973746 FISH supported 81 Chr2, Joined by GB and post v1 assembly with 42 3248084 1821599 FISH supported 82 3173775 2105734 FISH inconclusive 83 Chr2, Joined by GB and post v1 assembly with 14 3070426 1 62792 FISH supported 84 3047370 759770 FISH inconclusive 85 Joined by GB with 133 3064904 869025 FISH supported 86 3105707 1696234 FISH inconclusive 88 Joined by GB with 6 3069342 1892510 FISH supported 89 Joined with 163 3118961 3036489 FISH supporte d 91 2884017 865352 FISH inconclusive 92 2989607 2694932 FISH inconclusive 93 Chr1, Joined by GB with 15 2935351 959176 FISH inconclusive 94 2857628 2687213 FISH inconclusive 98 2832988 496625 FISH supported 99 Joined by GB with 71 2678520 2673888 FI SH supported 100 2599581 568151 FISH inconclusive 104 Joined by GB with 146 2532649 2380108 FISH supported 105 Joined by post v1 assembly with 79 2420675 1055552 FISH inconclusive 106 Joined by GB and post v1 assembly with 111 2366178 2093324 FISH supp orted 109 Chr5, Joined by GB with 56 2285430 1117883 FISH supported 111 Joined by GB and post v1 assembly with 106 2315763 750695 FISH supported 113 2159203 1008658 FISH inconclusive 116 2180460 1040602 FISH inconclusive 118 2047860 267622 FISH inconc lusive

PAGE 85

85 Table 2 6 . Continued AmTr_v1_Scaffold Length (bp) FISH Assessed (bp) FISH Result 122 Joined by GB and post v1 assembly with 146 2074012 1972182 FISH supported 124 1772254 706657 FISH inconclusive 125 1713774 646126 FISH inconclusive 130 Chr1, Joined by GB with 21 1605649 1540314 FISH supported 131 Chr12, Joined by post v1 assembly with 7 1574043 1484394 FISH supported 133 Joined by GB with 85 1531399 479603 FISH supported 136 Chr5, Joined by GB with 36 1544621 637800 FISH supported 139 145 3393 970569 FISH inconclusive 142 1470113 1341322 FISH inconclusive 146 1311389 1311389 FISH supported 148 Chr7, Joined by GB with 39 1337138 341640 FISH supported 154 Joined by post v1 assembly with 47 1210621 1038426 FISH supported 160 Joined by GB and post v1 assembly with 62 945152 436863 FISH supported 163 Joined with 89 887276 678958 FISH supported 165 Chr8, Joined by GB with 12 914386 470684 FISH supported 171 Chr9, Joined by GB with 29 763677 101951 FISH supported 183 Chr9, Joined by GB wit h 62 and 160 543613 543613 FISH supported 207 257758 171682 FISH inconclusive

PAGE 86

86 Table 2 7 . Amborella Amborella V1.1 assembly. Opgen Super scaffold Estimated Op gen Super scaffold Length (Kb) Order (left to right) and Orientation (1=forward strand, 1=reverse complement) of Amborella Version1.0 Scaffolds making up the OpGen GB Super scaffolds Support from FISH and/or the Amborella V1.1 Assembly 1 23365.04 AmTr_v 1.0_scaffold00007( 1), AmTr_v1.0_scaffold00153(1), AmTr_v1.0_scaffold00023(1), AmTr_v1.0_scaffold00080(1), AmTr_v1.0_scaffold00108( 1), AmTr_v1.0_scaffold00206( 1) AmTr_v1.0_scaffold00023,AmTr_v1.0_scaffold00080:FISH supported; AmTr_v1.0_scaffold00080,AmT r_v1.0_scaffold00108:FISH supported; AmTr_v1.0_scaffold00108, AmTr_v1.0_scaffold00206:FISH inconclusive; AmTr_v1.0_scaffold00007,AmTr_v1.0_scaffold00153:V1.1 supported; AmTr_v1.0_scaffold00153,AmTr_v1.0_scaffold00023:V1.1 supported; AmTr_v1.0_scaffold00 023,AmTr_v1.0_scaffold00080:V1.1 supported; AmTr_v1.0_scaffold00080,AmTr_v1.0_scaffold00108:V1.1 supported; AmTr_v1.0_scaffold00108,AmTr_v1.0_scaffold00206:V1.1 supported; 2 20002.526 AmTr_v1.0_scaffold00020( 1), AmTr_v1.0_scaffold00005(1), AmTr_v1.0_s caffold00079(1) AmTr_v1.0_scaffold00005,AmTr_v1.0_scaffold00079:FISH inconclusive; AmTr_v1.0_scaffold00020,AmTr_v1.0_scaffold00005:FISH inconclusive; AmTr_v1.0_scaffold00020,AmTr_v1.0_scaffold00005:V1.1 supported; 3 18405.535 AmTr_v1.0_scaffold00186( 1), AmTr_v1.0_scaffold00017( 1), AmTr_v1.0_scaffold00180(1), AmTr_v1.0_scaffold00070( 1), AmTr_v1.0_scaffold00046(1), AmTr_v1.0_scaffold00184( 1) AmTr_v1.0_scaffold00186,AmTr_v1.0_scaffold00017:FISH supported; AmTr_v1.0_scaffold00017,AmTr_v1.0_scaffold00180:F ISH supported; AmTr_v1.0_scaffold00180, AmTr_v1.0_scaffold00070:FISH inconclusive; AmTr_v1.0_scaffold00070,AmTr_v1.0_scaffold00046:FISH inconclusive; AmTr_v1.0_scaffold00180,AmTr_v1.0_scaffold00070: V1.1 supported;

PAGE 87

87 Table 2 7 . Continued Opgen Super sca ffold Estimated Opgen Super scaffold Length (Kb) Order (left to right) and Orientation (1=forward strand, 1=reverse complement) of Amborella Version1.0 Scaffolds making up the OpGen GB Super scaffolds Support from FISH and/or the Amborella V1.1 Assemb ly 4 17768.939 AmTr_v1.0_scaffold00176( 1), AmTr_v1.0_scaffold00138(1), AmTr_v1.0_scaffold00173(1), AmTr_v1.0_scaffold00187(1), AmTr_v1.0_scaffold00011(1), AmTr_v1.0_scaffold00048( 1) AmTr_v1.0_scaffold00011,AmTr_v1.0_scaffold00048:FISH supported; 5 173 44.301 AmTr_v1.0_scaffold00143(1), AmTr_v1.0_scaffold00001( 1) N/A 6 17239.462 AmTr_v1.0_scaffold00036( 1), AmTr_v1.0_scaffold00136(1), AmTr_v1.0_scaffold00064(1), AmTr_v1.0_scaffold00052(1) AmTr_v1.0_scaffold00036,AmTr_v1.0_scaffold00136:FISH supported ; AmTr_v1.0_scaffold00136,AmTr_v1.0_scaffold00064:FISH supported; AmTr_v1.0_scaffold00064,AmTr_v1.0_scaffold00052:FISH inconclusive; 7 15893.422 AmTr_v1.0_scaffold00037( 1), AmTr_v1.0_scaffold00035(1), AmTr_v1.0_scaffold00051(1) AmTr_v1.0_scaffold00037,A mTr_v1.0_scaffold00035:FISH inconclusive; AmTr_v1.0_scaffold00035,AmTr_v1.0_scaffold00051:FISH inconclusive;

PAGE 88

88 Table 2 7 . Continued Opgen Super scaffold Estimated Opgen Super scaffold Length (Kb) Order (left to right) and Orientation (1=forward strand , 1=reverse complement) of Amborella Version1.0 Scaffolds making up the OpGen GB Super scaffolds Support from FISH and/or the Amborella V1.1 Assembly 8 15196.57 AmTr_v1.0_scaffold00112( 1), AmTr_v1.0_scaffold00151( 1), AmTr_v1.0_scaffold00074(1), AmTr_v 1.0_scaffold00090(1), AmTr_v1.0_scaffold00055( 1 ) AmTr_v1.0_scaffold00074,AmTr_v1.0_scaffold00090:FISH inconclusive; AmTr_v1.0_scaffold00090,AmTr_v1.0_scaffold00055:FISH inconclusive; 9 15047.096 AmTr_v1.0_scaffold00216(1), AmTr_v1.0_scaffold00008(1), A mTr_v1.0_scaffold00073(1), AmTr_v1.0_scaffold00128( 1) AmTr_v1.0_scaffold00008,AmTr_v1.0_scaffold00073:FISH inconclusive;, AmTr_v1.0_scaffold00073,AmTr_v1.0_scaffold00128:FISH inconclusive; 10 14269.722 AmTr_v1.0_scaffold00102(1), AmTr_v1.0_scaffold0003 9(1), AmTr_v1.0_scaffold00067( 1), AmTr_v1.0_scaffold00148( 1), AmTr_v1.0_scaffold00175( 1) AmTr_v1.0_scaffold00039,AmTr_v1.0_scaffold00067:FISH supported; AmTr_v1.0_scaffold00067,AmTr_v1.0_scaffold00148:FISH supported; AmTr_v1.0_scaffold00067,AmTr_v1.0_ scaffold00148:V1.1 supported; 11 14187.169 AmTr_v1.0_scaffold00002( 1), AmTr_v1.0_scaffold00135(1), AmTr_v1.0_scaffold00156( 1) AmTr_v1.0_scaffold00002,AmTr_v1.0_scaffold00135:V1.1 supported;

PAGE 89

89 Table 2 7 . Continued Opgen Super scaffold Estimated Opgen S uper scaffold Length (Kb) Order (left to right) and Orientation (1=forward strand, 1=reverse complement) of Amborella Version1.0 Scaffolds making up the OpGen GB Super scaffolds Support from FISH and/or the Amborella V1.1 Assembly 12 13805.303 AmTr_v 1.0_scaffold00013(1), AmTr_v1.0_scaffold00117( 1), AmTr_v1.0_scaffold00214(1), AmTr_v1.0_scaffold00087(1) AmTr_v1.0_scaffold00013,AmTr_v1.0_scaffold00117:FISH supported; AmTr_v1.0_scaffold00013,AmTr_v1.0_scaffold00117:V1.1 supported; 13 13155.091 AmTr_v1 .0_scaffold00047( 1), AmTr_v1.0_scaffold00028(1), AmTr_v1.0_scaffold00123(1) AmTr_v1.0_scaffold00047,AmTr_v1.0_scaffold00028:FISH inconclusive; AmTr_v1.0_scaffold00028,AmTr_v1.0_scaffold00123:FISH inconclusive; 14 12483.457 AmTr_v1.0_scaffold00006(1), Am Tr_v1.0_scaffold00088(1) AmTr_v1.0_scaffold00006,AmTr_v1.0_scaffold00088:FISH supported; 15 12309.208 AmTr_v1.0_scaffold00125(1), AmTr_v1.0_scaffold00004( 1) AmTr_v1.0_scaffold00125,AmTr_v1.0_scaffold00004:FISH inconclusive; 16 11941.797 AmTr_v1.0_scaf fold00078(1), AmTr_v1.0_scaffold00142(1), AmTr_v1.0_scaffold00122(1), AmTr_v1.0_scaffold00146( 1), AmTr_v1.0_scaffold00104(1), AmTr_v1.0_scaffold00188(1) AmTr_v1.0_scaffold00104,AmTr_v1.0_scaffold00188:FISH inconclusive; AmTr_v1.0_scaffold00078,AmTr_v1.0_s caffold00142:V1.1 supported; AmTr_v1.0_scaffold00146,AmTr_v1.0_scaffold00104:V1.1 supported;

PAGE 90

90 Table 2 7 . Continued Opgen Super scaffold Estimated Opgen Super scaffold Length (Kb) Order (left to right) and Orientation (1=forward strand, 1=reverse comp lement) of Amborella Version1.0 Scaffolds making up the OpGen GB Super scaffolds Support from FISH and/or the Amborella V1.1 Assembly 17 11620.888 AmTr_v1.0_scaffold00164( 1), AmTr_v1.0_scaffold00086(1), AmTr_v1.0_scaffold00150(1), AmTr_v1.0_scaffold0002 6(1) AmTr_v1.0_scaffold00164,AmTr_v1.0_scaffold00086,:FISH inconclusive; AmTr_v1.0_scaffold00086,AmTr_v1.0_scaffold00150:FISH inconclusive; AmTr_v1.0_scaffold00086,AmTr_v1.0_scaffold00150:V1.1 supported; 18 11419.074 AmTr_v1.0_scaffold00178(1), AmTr_v1. 0_scaffold00168(1), AmTr_v1.0_scaffold00012(1), AmTr_v1.0_scaffold00165( 1) N/A 19 10989.949 AmTr_v1.0_scaffold00140(1), AmTr_v1.0_scaffold00009(1) N/A 20 10637.059 AmTr_v1.0_scaffold00014(1), AmTr_v1.0_scaffold00083(1) AmTr_v1.0_scaffold00014,AmTr_v1 .0_scaffold00083:FISH supported; AmTr_v1.0_scaffold00014,AmTr_v1.0_scaffold00083:V1.1 supported; 21 10276.242 AmTr_v1.0_scaffold00038( 1), AmTr_v1.0_scaffold00190(1), AmTr_v1.0_scaffold00063( 1) AmTr_v1.0_scaffold00190,AmTr_v1.0_scaffold00063:V1.1 support ed;

PAGE 91

91 Table 2 7 . Continued Opgen Super scaffold Estimated Opgen Super scaffold Length (Kb) Order (left to right) and Orientation (1=forward strand, 1=reverse complement) of Amborella Version1.0 Scaffolds making up the OpGen GB Super scaffolds Support from FISH and/or the Amborella V1.1 Assembly 22 10037.555 AmTr_v1.0_scaffold00092( 1), AmTr_v1.0_scaffold00155( 1), AmTr_v1.0_scaffold00032(1) N/A 23 9772.766 AmTr_v1.0_scaffold00045(1) , AmTr_v1.0_scaffold00057( 1) AmTr_v1.0_scaffold00045,AmTr_v1.0_sca ffold00057:FISH supported; 24 9385.578 AmTr_v1.0_scaffold00162( 1), AmTr_v1.0_scaffold00159(1) , AmTr_v1.0_scaffold00053(1) , AmTr_v1.0_scaffold00107(1) AmTr_v1.0_scaffold00053,AmTr_v1.0_scaffold00107:FISH inconclusive; 25 9308.884 AmTr_v1.0_scaffold00170 ( 1), AmTr_v1.0_scaffold00130(1) , AmTr_v1.0_scaffold00021(1) AmTr_v1.0_scaffold00130,AmTr_v1.0_scaffold00021:FISH supported; AmTr_v1.0_scaffold00170, AmTr_v1.0_scaffold00130:V1.1 supported; 26 9250.07 AmTr_v1.0_scaffold00069(1) , AmTr_v1.0_scaffold00041(1 ) N/A

PAGE 92

92 Table 2 7 . Continued Opgen Super scaffold Estimated Opgen Super scaffold Length (Kb) Order (left to right) and Orientation (1=forward strand, 1=reverse complement) of Amborella Version1.0 Scaffolds making up the OpGen GB Super scaffolds Suppo rt from FISH and/or the Amborella V1.1 Assembly 27 9024.01 AmTr_v1.0_scaffold00043(1), AmTr_v1.0_scaffold00072(1) AmTr_v1.0_scaffold00043,AmTr_v1.0_scaffold00072:FISH supported; 28 9017.598 AmTr_v1.0_scaffold00066(1), AmTr_v1.0_scaffold00049(1) N/A 29 8942.766 AmTr_v1.0_scaffold00081( 1), AmTr_v1.0_scaffold00042(1), AmTr_v1.0_scaffold00192( 1) AmTr_v1.0_scaffold00081,AmTr_v1.0_scaffold00042:FISH inconclusive; AmTr_v1.0_scaffold00042,AmTr_v1.0_scaffold00192:FISH inconclusive; 30 8892.648 AmTr_v1.0_sc affold00141(1), AmTr_v1.0_scaffold00096( 1), AmTr_v1.0_scaffold00085( 1), AmTr_v1.0_scaffold00133(1) AmTr_v1.0_scaffold00096,AmTr_v1.0_scaffold00085,:FISH supported; AmTr_v1.0_scaffold00085,AmTr_v1.0_scaffold00133:FISH supported; AmTr_v1.0_scaffold00141,A mTr_v1.0_scaffold00096:V1.1 supported; AmTr_v1.0_scaffold00085,AmTr_v1.0_scaffold00133: V1.1 supported; 31 8496.213 AmTr_v1.0_scaffold00030(1), AmTr_v1.0_scaffold00114( 1) N/A

PAGE 93

93 Table 2 7 Continued Opgen Super scaffold Estimated Opgen Super scaffold Len gth (Kb) Order (left to right) and Orientation (1=forward strand, 1=reverse complement) of Amborella Version1.0 Scaffolds making up the OpGen GB Super scaffolds Support from FISH and/or the Amborella V1.1 Assembly 32 8192.277 AmTr_v1.0_scaffold00179( 1), AmTr_v1.0_scaffold00015(1) N/A 33 8128.639 AmTr_v1.0_scaffold00027( 1), AmTr_v1.0_scaffold00121(1) AmTr_v1.0_scaffold00027,AmTr_v1.0_scaffold00121:FISH inconclusive; 34 7905.272 AmTr_v1.0_scaffold00094(1), AmTr_v1.0_scaffold00115(1), AmTr_v1.0_scaf fold00098(1) AmTr_v1.0_scaffold00094,AmTr_v1.0_scaffold00115:FISH inconclusive; 35 7785.495 AmTr_v1.0_scaffold00119(1), AmTr_v1.0_scaffold00101(1), AmTr_v1.0_scaffold00095(1) AmTr_v1.0_scaffold00119,AmTr_v1.0_scaffold00101:V1.1 supported; AmTr_v1.0_scaf fold00101,AmTr_v1.0_scaffold00095:V1.1 supported; 36 7591.775 AmTr_v1.0_scaffold00031(1), AmTr_v1.0_scaffold00195(1), AmTr_v1.0_scaffold00139( 1) AmTr_v1.0_scaffold00031,AmTr_v1.0_scaffold00195:FISH inconclusive; AmTr_v1.0_scaffold00195,AmTr_v1.0_scaffold 00139:V1.1 supported;

PAGE 94

94 Table 2 7 Continued Opgen Super scaffol d Estimated Opgen Super scaffold Length (Kb) Order (left to right) and Orientation (1=forward strand, 1=reverse complement) of Amborella Version1.0 Scaffolds making up the OpGen GB Super sc affolds Support from FISH and/or the Amborella V1.1 Assembly 37 7586.109 AmTr_v1.0_scaffold00019(1), AmTr_v1.0_scaffold00197( 1) N/A 38 7379.052 AmTr_v1.0_scaffold00193(1), AmTr_v1.0_scaffold00169( 1), AmTr_v1.0_scaffold00033( 1) N/A 39 7256.435 AmTr _v1.0_scaffold00202( 1), AmTr_v1.0_scaffold00203(1), AmTr_v1.0_scaffold00071(1), AmTr_v1.0_scaffold00099(1) AmTr_v1.0_scaffold00202,AmTr_v1.0_scaffold00203:V1.1 supported; AmTr_v1.0_scaffold00071,AmTr_v1.0_scaffold00099:V1.1 supported; 40 7188.826 AmTr_v 1.0_scaffold00109(1), AmTr_v1.0_scaffold00056(1) AmTr_v1.0_scaffold00109,AmTr_v1.0_scaffold00056:FISH supported; 41 7091.808 AmTr_v1.0_scaffold00034(1), AmTr_v1.0_scaffold00145(1) AmTr_v1.0_scaffold00034,AmTr_v1.0_scaffold00145:FISH inconclusive;

PAGE 95

95 Table 2 7 Continued Opgen Super scaffold Estimated Opgen Super scaffold Length (Kb) Order (left to right) and Orientation (1=forward strand, 1=reverse complement) of Amborella Version1.0 Scaffolds making up the OpGen GB Super scaffolds Support from FISH an d/or the Amborella V1.1 Assembly 42 7062.332 AmTr_v1.0_scaffold00171(1), AmTr_v1.0_scaffold00029(1) N/A 43 6308.672 AmTr_v1.0_scaffold00157( 1), AmTr_v1.0_scaffold00054(1) N/A 44 6238.16 AmTr_v1.0_scaffold00126(1), AmTr_v1.0_scaffold00060(1) AmTr_v1. 0_scaffold00126,AmTr_v1.0_scaffold00060:V1.1 supported; 45 6132.406 AmTr_v1.0_scaffold00161(1), AmTr_v1.0_scaffold00076( 1), AmTr_v1.0_scaffold00127(1) N/A 46 5964.405 AmTr_v1.0_scaffold00149(1), AmTr_v1.0_scaffold00111( 1), AmTr_v1.0_scaffold00106( 1) AmTr_v1.0_scaffold00111,AmTr_v1.0_scaffold00106:FISH supported; AmTr_v1.0_scaffold00149, AmTr_v1.0_scaffold00111:V1.1 supported; AmTr_v1.0_scaffold00111, AmTr_v1.0_scaffold00106:V1.1 supported;

PAGE 96

96 Table 2 7 . Continued Opgen Super scaffold Estimated Opgen Super scaffold Length (Kb) Order (left to right) and Orientation (1=forward strand, 1=reverse complement) of Amborella Version1.0 Scaffolds making up the OpGen GB Super scaffolds Support from FISH and/or the Amborella V1.1 Assembly 47 5955.251 AmTr_ v1.0_scaffold00152( 1), AmTr_v1.0_scaffold00113(1), AmTr_v1.0_scaffold00100(1) AmTr_v1.0_scaffold00152,AmTr_v1.0_scaffold00113:FISH inconclusive; AmTr_v1.0_scaffold00113,AmTr_v1.0_scaffold00100:FISH inconclusive; AmTr_v1.0_scaffold00113,AmTr_v1.0_scaffol d00100:V1.1 supported; 48 5781.993 AmTr_v1.0_scaffold00058( 1), AmTr_v1.0_scaffold00154(1) N/A 49 5775.402 AmTr_v1.0_scaffold00062(1), AmTr_v1.0_scaffold00183( 1), AmTr_v1.0_scaffold00160(1) AmTr_v1.0_scaffold00062,AmTr_v1.0_scaffold00183:FISH inconclu sive; AmTr_v1.0_scaffold00062, AmTr_v1.0_scaffold00183:V1.1 supported; AmTr_v1.0_scaffold00183,AmTr_v1.0_scaffold00160:V1.1 supported; 50 5567.43 AmTr_v1.0_scaffold00181(1), AmTr_v1.0_scaffold00166( 1), AmTr_v1.0_scaffold00089(1), AmTr_v1.0_scaffold00163( 1) AmTr_v1.0_scaffold00089,AmTr_v1.0_scaffold00163:FISH supported; AmTr_v1.0_scaffold00181,AmTr_v1.0_scaffold00166:V1.1 supported;

PAGE 97

97 Table 2 7 . Continued Opgen Super scaffold Estimated Opgen Super scaffold Length (Kb) Order (left to right) and Orienta tion (1=forward strand, 1=reverse complement) of Amborella Version1.0 Scaffolds making up the OpGen GB Super scaffolds Support from FISH and/or the Amborella V1.1 Assembly 51 4960.867 AmTr_v1.0_scaffold00132(1), AmTr_v1.0_scaffold00091( 1), AmTr_v1.0_sc affold00207(1) AmTr_v1.0_scaffold00091,AmTr_v1.0_scaffold00207:FISH inconclusive; 52 4946.029 AmTr_v1.0_scaffold00124(1), AmTr_v1.0_scaffold00082( 1) AmTr_v1.0_scaffold00124,AmTr_v1.0_scaffold00082:FISH inconclusive; 53 4676.467 AmTr_v1.0_scaffold00158 (1), AmTr_v1.0_scaffold00075(1) N/A

PAGE 98

98 Table 2 8 . Contig and s caffold assembly metrics for the Amborella assembly version 1. Version 1.0 Contigs Scaffolds After GB Super scaffolding Number 43,234 5,745 5,635 Total Size (bp) 668,257,121 706,332,648 71 2,382,144 Genome Covered (%) 88.9 93.9 94.7 Largest Size (bp) 287,935 15,980,527 23,365,040 Mean Size (bp) 15,456 122,947 126,409 N10 Size (bp) 85,018 9,414,115 17,768,939 N10 Count 579 7 4 N25 Size (bp) 52,723 6,874,971 14,187,169 N25 Count 2,129 2 0 11 N50 Size (bp) 29,456 4,927,027 9,308,884 N50 Count 6,448 50 26 N75 Size (bp) 14,812 2,820,768 6,368,207 N75 Count 14,404 97 49 N90 Size (bp) 7,108 1,154,593 2,935,351 N90 Count 23,962 155 70

PAGE 99

99 Table 2 9 . Contig and s caffold assembly metrics fo r the BAC Free Amborella assembly. BAC Free Version 1.0 Contigs Scaffolds After GB Super scaffolding Number 44,402 5,688 5,451 Total Size (bp) 668,207,383 703,384,731 720,156,913 Genome Covered (%) 88.9 93.5 95.8 Largest Size (bp) 242,004 9,633,646 22,812,226 Mean Size (bp) 15,049 123,661 132,114 N10 Size (bp) 80,951 6,023,163 17,828,228 N10 Count 619 10 4 N25 Size (bp) 51,130 4,504,398 10,603,805 N25 Count 2,226 30 12 N50 Size (bp) 28,655 2,766,685 7,665,886 N50 Count 6,685 81 32 N75 Size (b p) 14,435 1,292,119 4,645,926 N75 Count 14,860 176 62 N90 Size (bp) 6,963 508,366 1,493,879 N90 Count 24,624 293 95

PAGE 100

100 Table 2 10 . Amborella version 1.0 BAC Opgen Superscaffold Estimated Opgen Superscaffold Length (bp) Order (left to right) and Orientation (1=forward strand, 1=reverse complement) of Amborella BAC Free Scaffolds making up the OpGen GB Super scaffolds 1 9788841 AmTr_v1.0_BacFree_scaffold00177 (1) , AmTr_v1.0_BacFree_scaffold00250 (1) , AmT r_v1.0_BacFree_scaffold00255 ( 1) , AmTr_v1.0_BacFree_scaffold00280 ( 1) , AmTr_v1.0_BacFree_scaffold00150 (1) , AmTr_v1.0_BacFree_scaffold00092 ( 1) , AmTr_v1.0_BacFree_scaffold00097 ( 1) 2 10243434 AmTr_v1.0_BacFree_scaffold00222 ( 1) , AmTr_v1.0_ BacFree_scaffold00140 (1) , AmTr_v1.0_BacFree_scaffold00019 ( 1) , AmTr_v1.0_BacFree_scaffold00095 (1) , AmTr_v1.0_BacFree_scaffold00372 (1) 3 8366568 AmTr_v1.0_BacFree_scaffold00116 ( 1) , AmTr_v1.0_BacFree_scaffold00009 (1) , AmTr_v1.0_BacFree_sca ffold00322 (1) 4 10676310 AmTr_v1.0_BacFree_scaffold00004 (1) , AmTr_v1.0_BacFree_scaffold00082 (1) , AmTr_v1.0_BacFree_scaffold00351 ( 1) 5 5115448 AmTr_v1.0_BacFree_scaffold00214 (1) , AmTr_v1.0_BacFree_scaffold00200 (1) , AmTr_v1.0_BacFree_scaff old00324 ( 1) , AmTr_v1.0_BacFree_scaffold00239 (1) , AmTr_v1.0_BacFree_scaffold00226 ( 1)

PAGE 101

101 Table 2 10 . Continued Opgen Superscaffold Estimated Opgen Superscaffold Length (bp) Order (left to right) and Orientation (1=forward strand, 1=reverse complem ent) of Amborella BAC Free Scaffolds making up the OpGen GB Super scaffolds 6 20992142 AmTr_v1.0_BacFree_scaffold00260 (1) , AmTr_v1.0_BacFree_scaffold00132 ( 1) , AmTr_v1.0_BacFree_scaffold00058 ( 1) , AmTr_v1.0_BacFree_scaffold00047 (1) , AmTr_v1.0_ BacFree_scaffold00338 (1) , AmTr_v1.0_BacFree_scaffold00359 (1) , AmTr_v1.0_BacFree_scaffold00371 (1) , AmTr_v1.0_BacFree_scaffold00300 (1) , AmTr_v1.0_BacFree_scaffold00170 (1) , AmTr_v1.0_BacFree_scaffold00035 (1) , AmTr_v1.0_BacFree_scaffold00173 ( 1) , AmTr_v1.0_BacFree_scaffold00114 (1) , AmTr_v1.0_BacFree_scaffold00281 ( 1) 7 8939617 AmTr_v1.0_BacFree_scaffold00146 ( 1) , AmTr_v1.0_BacFree_scaffold00162 (1) , AmTr_v1.0_BacFree_scaffold00131 ( 1) , AmTr_v1.0_BacFree_scaffold00227 ( 1) , A mTr_v1.0_BacFree_scaffold00347 (1) , AmTr_v1.0_BacFree_scaffold00139 (1) , AmTr_v1.0_BacFree_scaffold00157 ( 1) 8 17828228 AmTr_v1.0_BacFree_scaffold00154 (1) , AmTr_v1.0_BacFree_scaffold00236 (1) , AmTr_v1.0_BacFree_scaffold00144 (1) , AmTr_v1.0_Ba cFree_scaffold00103 ( 1) , AmTr_v1.0_BacFree_scaffold00031 (1) , AmTr_v1.0_BacFree_scaffold00007 (1)

PAGE 102

102 Table 2 10 . Continued Opgen Superscaffold Estimated Opgen Superscaffold Length (bp) Order (left to right) and Orientation (1=forward strand, 1=re verse complement) of Amborella BAC Free Scaffolds making up the OpGen GB Super scaffolds 9 12390468 AmTr_v1.0_BacFree_scaffold00006 ( 1) , AmTr_v1.0_BacFree_scaffold00051 (1) , AmTr_v1.0_BacFree_scaffold00136 ( 1) 10 6038567 AmTr_v1.0_BacFree_scaffold 00354 ( 1) , AmTr_v1.0_BacFree_scaffold00104 ( 1) , AmTr_v1.0_BacFree_scaffold00091 (1) , AmTr_v1.0_BacFree_scaffold00256 (1) , AmTr_v1.0_BacFree_scaffold00301 (1) 11 12154529 AmTr_v1.0_BacFree_scaffold00311 (1) , AmTr_v1.0_BacFree_scaffold00307 (1) , AmTr_v1.0_BacFree_scaffold00059 (1) , AmTr_v1.0_BacFree_scaffold00036 ( 1) , AmTr_v1.0_BacFree_scaffold00054 ( 1) 12 15500291 AmTr_v1.0_BacFree_scaffold00023 ( 1) , AmTr_v1.0_BacFree_scaffold00026 (1) , AmTr_v1.0_BacFree_scaffold00241 (1) , AmTr _v1.0_BacFree_scaffold00050 (1) , AmTr_v1.0_BacFree_scaffold00178 (1) 13 22812226 AmTr_v1.0_BacFree_scaffold00052 (1) , AmTr_v1.0_BacFree_scaffold00012 ( 1) , AmTr_v1.0_BacFree_scaffold00209 (1) , AmTr_v1.0_BacFree_scaffold00211 (1) , AmTr_v1.0_BacF ree_scaffold00057 (1) , AmTr_v1.0_BacFree_scaffold00176 (1) , AmTr_v1.0_BacFree_scaffold00063 ( 1) , AmTr_v1.0_BacFree_scaffold00093 ( 1) , AmTr_v1.0_BacFree_scaffold00348 (1)

PAGE 103

103 Table 2 10 . Continued Opgen Superscaffold Estimated Opgen Superscaffold Length (bp) Order (left to right) and Orientation (1=forward strand, 1=reverse complement) of Amborella BAC Free Scaffolds making up the OpGen GB Super scaffolds 14 7951732 AmTr_v1.0_BacFree_scaffold00210 (1) , AmTr_v1.0_BacFree_scaffold00385 (1) , AmT r_v1.0_BacFree_scaffold00118 (1) , AmTr_v1.0_BacFree_scaffold00053 (1) 15 6110994 AmTr_v1.0_BacFree_scaffold00115 ( 1) , AmTr_v1.0_BacFree_scaffold00141 (1) , AmTr_v1.0_BacFree_scaffold00272 ( 1) , AmTr_v1.0_BacFree_scaffold00276 ( 1) , AmTr_v1.0_Ba cFree_scaffold00244 (1) 16 7201724 AmTr_v1.0_BacFree_scaffold00219 (1) , AmTr_v1.0_BacFree_scaffold00048 (1) , AmTr_v1.0_BacFree_scaffold00086 ( 1) 17 19271728 AmTr_v1.0_BacFree_scaffold00288 ( 1) , AmTr_v1.0_BacFree_scaffold00124 (1) , AmTr_v1.0_B acFree_scaffold00094 ( 1) , AmTr_v1.0_BacFree_scaffold00286 (1) , AmTr_v1.0_BacFree_scaffold00022 (1) , AmTr_v1.0_BacFree_scaffold00167 ( 1) , AmTr_v1.0_BacFree_scaffold00005 (1) 18 16157896 AmTr_v1.0_BacFree_scaffold00309 ( 1) , AmTr_v1.0_BacFree_s caffold00102 ( 1) , AmTr_v1.0_BacFree_scaffold00072 ( 1) , AmTr_v1.0_BacFree_scaffold00045 (1) , AmTr_v1.0_BacFree_scaffold00282 ( 1) , AmTr_v1.0_BacFree_scaffold00013 (1)

PAGE 104

104 Table 2 10 . Continued Opgen Superscaffold Estimated Opgen Superscaffold Leng th (bp) Order (left to right) and Orientation (1=forward strand, 1=reverse complement) of Amborella BAC Free Scaffolds making up the OpGen GB Super scaffolds 19 8783149 AmTr_v1.0_BacFree_scaffold00010 (1) , AmTr_v1.0_BacFree_scaffold00096 (1) , AmTr_v1 .0_BacFree_scaffold00341 (1) 20 9091300 AmTr_v1.0_BacFree_scaffold00081 ( 1) , AmTr_v1.0_BacFree_scaffold00089 (1) , AmTr_v1.0_BacFree_scaffold00336 ( 1) , AmTr_v1.0_BacFree_scaffold00117 ( 1) , AmTr_v1.0_BacFree_scaffold00203 (1) 21 9121032 AmTr_v 1.0_BacFree_scaffold00062 (1) , AmTr_v1.0_BacFree_scaffold00128 ( 1) , AmTr_v1.0_BacFree_scaffold00274 ( 1) , AmTr_v1.0_BacFree_scaffold00069 ( 1) , AmTr_v1.0_BacFree_scaffold00314 ( 1) 22 9377256 AmTr_v1.0_BacFree_scaffold00046 (1) , AmTr_v1.0_BacF ree_scaffold00099 (1) , AmTr_v1.0_BacFree_scaffold00213 (1) , AmTr_v1.0_BacFree_scaffold00217 ( 1) , AmTr_v1.0_BacFree_scaffold00317 (1) , AmTr_v1.0_BacFree_scaffold00337 (1) 23 11616674 AmTr_v1.0_BacFree_scaffold00151 ( 1) , AmTr_v1.0_BacFree_scaff old00262 ( 1) , AmTr_v1.0_BacFree_scaffold00184 (1) , AmTr_v1.0_BacFree_scaffold00083 ( 1) , AmTr_v1.0_BacFree_scaffold00075 ( 1) , AmTr_v1.0_BacFree_scaffold00240 ( 1) , AmTr_v1.0_BacFree_scaffold00370 ( 1) , AmTr_v1.0_BacFree_scaffold00181 ( 1)

PAGE 105

105 Table 2 10 . Continued Opgen Superscaffold Estimated Opgen Superscaffold Length (bp) Order (left to right) and Orientation (1=forward strand, 1=reverse complement) of Amborella BAC Free Scaffolds making up the OpGen GB Super scaffolds 24 4992265 AmTr_v1. 0_BacFree_scaffold00055 ( 1) , AmTr_v1.0_BacFree_scaffold00159 (1) 25 5001242 AmTr_v1.0_BacFree_scaffold00271 ( 1) , AmTr_v1.0_BacFree_scaffold00032 (1) 26 5216086 AmTr_v1.0_BacFree_scaffold00279 (1) , AmTr_v1.0_BacFree_scaffold00247 (1) , AmTr_v1. 0_BacFree_scaffold00110 ( 1) , AmTr_v1.0_BacFree_scaffold00138 ( 1) 27 4318825 AmTr_v1.0_BacFree_scaffold00125 (1) , AmTr_v1.0_BacFree_scaffold00259 (1) , AmTr_v1.0_BacFree_scaffold00361 ( 1) , AmTr_v1.0_BacFree_scaffold00266 (1) , AmTr_v1.0_BacFree _scaffold00232 ( 1) 28 6498479 AmTr_v1.0_BacFree_scaffold00342 (1) , AmTr_v1.0_BacFree_scaffold00018 ( 1) , AmTr_v1.0_BacFree_scaffold00253 (1) 29 8991354 AmTr_v1.0_BacFree_scaffold00268 ( 1) , AmTr_v1.0_BacFree_scaffold00346 (1) , AmTr_v1.0_BacFre e_scaffold00216 (1) , AmTr_v1.0_BacFree_scaffold00040 ( 1) , AmTr_v1.0_BacFree_scaffold00206 ( 1) , AmTr_v1.0_BacFree_scaffold00123 (1) 30 7154171 AmTr_v1.0_BacFree_scaffold00254 (1) , AmTr_v1.0_BacFree_scaffold00011 (1) , AmTr_v1.0_BacFree_scaffold 00330 (1)

PAGE 106

106 Table 2 10 . Continued Opgen Superscaffold Estimated Opgen Superscaffold Length (bp) Order (left to right) and Orientation (1=forward strand, 1=reverse complement) of Amborella BAC Free Scaffolds making up the OpGen GB Super scaffolds 31 304 5208 AmTr_v1.0_BacFree_scaffold00166 (1) , AmTr_v1.0_BacFree_scaffold00133 (1) 32 2251194 AmTr_v1.0_BacFree_scaffold00153 (1) , AmTr_v1.0_BacFree_scaffold00267 ( 1) 33 9696895 AmTr_v1.0_BacFree_scaffold00039 (1) , AmTr_v1.0_BacFree_scaffold00308 (1) , AmTr_v1.0_BacFree_scaffold00038 (1) , AmTr_v1.0_BacFree_scaffold00164 (1) 34 6701897 AmTr_v1.0_BacFree_scaffold00212 (1) , AmTr_v1.0_BacFree_scaffold00237 (1) , AmTr_v1.0_BacFree_scaffold00148 (1) , AmTr_v1.0_BacFree_scaffold00224 (1) , AmTr_v1. 0_BacFree_scaffold00218 ( 1) , AmTr_v1.0_BacFree_scaffold00223 (1) 35 6008646 AmTr_v1.0_BacFree_scaffold00065 (1) , AmTr_v1.0_BacFree_scaffold00087 (1) , AmTr_v1.0_BacFree_scaffold00349 ( 1) 36 4475323 AmTr_v1.0_BacFree_scaffold00265 (1) , AmTr_v1. 0_BacFree_scaffold00122 ( 1) , AmTr_v1.0_BacFree_scaffold00367 (1) , AmTr_v1.0_BacFree_scaffold00129 (1) 37 7563973 AmTr_v1.0_BacFree_scaffold00017 ( 1) , AmTr_v1.0_BacFree_scaffold00246 (1) , AmTr_v1.0_BacFree_scaffold00290 (1) , AmTr_v1.0_BacFree_ scaffold00245 ( 1)

PAGE 107

107 Table 2 10 . Continued Opgen Superscaffold Estimated Opgen Superscaffold Length (bp) Order (left to right) and Orientation (1=forward strand, 1=reverse complement) of Amborella BAC Free Scaffolds making up the OpGen GB Super scaffold s 38 8863753 AmTr_v1.0_BacFree_scaffold00008 ( 1) , AmTr_v1.0_BacFree_scaffold00198 ( 1) , AmTr_v1.0_BacFree_scaffold00215 (1) 39 7777885 AmTr_v1.0_BacFree_scaffold00060 ( 1) , AmTr_v1.0_BacFree_scaffold00183 (1) , AmTr_v1.0_BacFree_scaffold00156 (1 ) , AmTr_v1.0_BacFree_scaffold00189 (1) , AmTr_v1.0_BacFree_scaffold00312 ( 1) 40 2962569 AmTr_v1.0_BacFree_scaffold00257 (1) , AmTr_v1.0_BacFree_scaffold00294 (1) , AmTr_v1.0_BacFree_scaffold00297 (1) , AmTr_v1.0_BacFree_scaffold00221 (1) 41 6292 435 AmTr_v1.0_BacFree_scaffold00302 ( 1) , AmTr_v1.0_BacFree_scaffold00225 (1) , AmTr_v1.0_BacFree_scaffold00079 (1) , AmTr_v1.0_BacFree_scaffold00289 (1) , AmTr_v1.0_BacFree_scaffold00158 ( 1) 42 10603805 AmTr_v1.0_BacFree_scaffold00352 ( 1) , AmTr _v1.0_BacFree_scaffold00085 ( 1) , AmTr_v1.0_BacFree_scaffold00328 (1) , AmTr_v1.0_BacFree_scaffold00108 (1) , AmTr_v1.0_BacFree_scaffold00195 (1) , AmTr_v1.0_BacFree_scaffold00080 (1) , AmTr_v1.0_BacFree_scaffold00192 (1) 43 1310215 AmTr_v1.0_BacFr ee_scaffold00318 (1) , AmTr_v1.0_BacFree_scaffold00235 (1)

PAGE 108

108 Table 2 10 . Continued Opgen Superscaffold Estimated Opgen Superscaffold Length (bp) Order (left to right) and Orientation (1=forward strand, 1=reverse complement) of Amborella BAC Free Scaffo lds making up the OpGen GB Super scaffolds 44 13518648 AmTr_v1.0_BacFree_scaffold00027 (1) , AmTr_v1.0_BacFree_scaffold00024 ( 1) , AmTr_v1.0_BacFree_scaffold00149 (1) , AmTr_v1.0_BacFree_scaffold00113 (1) 45 9935816 AmTr_v1.0_BacFree_scaffold00362 ( 1) , AmTr_v1.0_BacFree_scaffold00001 (1) 46 7227707 AmTr_v1.0_BacFree_scaffold00199 ( 1) , AmTr_v1.0_BacFree_scaffold00030 (1) , AmTr_v1.0_BacFree_scaffold00180 ( 1) 47 3574620 AmTr_v1.0_BacFree_scaffold00208 (1) , AmTr_v1.0_BacFree_scaffold00145 (1) , AmTr_v1.0_BacFree_scaffold00228 (1) 48 6515449 AmTr_v1.0_BacFree_scaffold00169 ( 1) , AmTr_v1.0_BacFree_scaffold00329 (1) , AmTr_v1.0_BacFree_scaffold00175 ( 1) , AmTr_v1.0_BacFree_scaffold00090 (1) , AmTr_v1.0_BacFree_scaffold00278 ( 1) 49 3547691 AmTr_v1.0_BacFree_scaffold00067 (1) , AmTr_v1.0_BacFree_scaffold00316 (1) 50 2337805 AmTr_v1.0_BacFree_scaffold00120 ( 1) , AmTr_v1.0_BacFree_scaffold00293 (1) 51 6969014 AmTr_v1.0_BacFree_scaffold00285 (1) , AmTr_v1.0_BacFree_scaffold00339 (1) , AmTr_v1.0_BacFree_scaffold00016 ( 1) , AmTr_v1.0_BacFree_scaffold00270 (1)

PAGE 109

109 Table 2 10 . Continued Opgen Superscaffold Estimated Opgen Superscaffold Length (bp) Order (left to right) and Orientation (1=forward strand, 1=reverse complement) of Am borella BAC Free Scaffolds making up the OpGen GB Super scaffolds 52 6271825 AmTr_v1.0_BacFree_scaffold00242 (1) , AmTr_v1.0_BacFree_scaffold00101 ( 1) , AmTr_v1.0_BacFree_scaffold00201 ( 1) , AmTr_v1.0_BacFree_scaffold00152 (1) 53 947801 AmTr_v1.0_B acFree_scaffold00298 (1) , AmTr_v1.0_BacFree_scaffold00340 (1) 54 7065739 AmTr_v1.0_BacFree_scaffold00220 (1) , AmTr_v1.0_BacFree_scaffold00042 ( 1) , AmTr_v1.0_BacFree_scaffold00147 (1) , AmTr_v1.0_BacFree_scaffold00313 (1) 55 4762426 AmTr_v1.0_Ba cFree_scaffold00171 (1) , AmTr_v1.0_BacFree_scaffold00165 (1) , AmTr_v1.0_BacFree_scaffold00185 (1) 56 2369116 AmTr_v1.0_BacFree_scaffold00234 (1) , AmTr_v1.0_BacFree_scaffold00343 ( 1) , AmTr_v1.0_BacFree_scaffold00357 (1) , AmTr_v1.0_BacFree_scaff old00263 ( 1) , AmTr_v1.0_BacFree_scaffold00364 ( 1) 57 1435513 AmTr_v1.0_BacFree_scaffold00283 (1) , AmTr_v1.0_BacFree_scaffold00251 (1) 58 2326677 AmTr_v1.0_BacFree_scaffold00204 ( 1) , AmTr_v1.0_BacFree_scaffold00190 (1) 59 4214009 AmTr_v1.0_Ba cFree_scaffold00186 (1) , AmTr_v1.0_BacFree_scaffold00376 ( 1) , AmTr_v1.0_BacFree_scaffold00107 (1)

PAGE 110

110 Table 2 10 . Continued Opgen Superscaffold Estimated Opgen Superscaffold Length (bp) Order (left to right) and Orientation (1=forward strand, 1=rever se complement) of Amborella BAC Free Scaffolds making up the OpGen GB Super scaffolds 60 6100407 AmTr_v1.0_BacFree_scaffold00014 (1) , AmTr_v1.0_BacFree_scaffold00331 (1) 61 1304948 AmTr_v1.0_BacFree_scaffold00284 ( 1) , AmTr_v1.0_BacFree_scaffold0036 8 (1) , AmTr_v1.0_BacFree_scaffold00319 (1) 62 4747816 AmTr_v1.0_BacFree_scaffold00310 ( 1) , AmTr_v1.0_BacFree_scaffold00350 ( 1) , AmTr_v1.0_BacFree_scaffold00037 (1) 63 3875567 AmTr_v1.0_BacFree_scaffold00155 (1) , AmTr_v1.0_BacFree_scaffold0009 8 (1) 64 8341480 AmTr_v1.0_BacFree_scaffold00041 (1) , AmTr_v1.0_BacFree_scaffold00111 ( 1) , AmTr_v1.0_BacFree_scaffold00305 ( 1) , AmTr_v1.0_BacFree_scaffold00127 (1) 65 8642329 AmTr_v1.0_BacFree_scaffold00269 (1) , AmTr_v1.0_BacFree_scaffold0019 4 (1) , AmTr_v1.0_BacFree_scaffold00197 (1) , AmTr_v1.0_BacFree_scaffold00043 ( 1) , AmTr_v1.0_BacFree_scaffold00137 ( 1) 66 7587208 AmTr_v1.0_BacFree_scaffold00231 (1) , AmTr_v1.0_BacFree_scaffold00028 ( 1) , AmTr_v1.0_BacFree_scaffold00193 (1) , AmTr_v1.0_BacFree_scaffold00292 ( 1) 67 6336093 AmTr_v1.0_BacFree_scaffold00174 (1) , AmTr_v1.0_BacFree_scaffold00020 ( 1)

PAGE 111

111 Table 2 10 . Continued Opgen Superscaffold Estimated Opgen Superscaffold Length (bp) Order (left to right) and Orientation (1= forward strand, 1=reverse complement) of Amborella BAC Free Scaffolds making up the OpGen GB Super scaffolds 68 4177843 AmTr_v1.0_BacFree_scaffold00182 ( 1) , AmTr_v1.0_BacFree_scaffold00261 (1) , AmTr_v1.0_BacFree_scaffold00252 (1) , AmTr_v1.0_BacFre e_scaffold00172 ( 1) 69 1605499 AmTr_v1.0_BacFree_scaffold00168 ( 1) , AmTr_v1.0_BacFree_scaffold00356 (1) 70 5805660 AmTr_v1.0_BacFree_scaffold00088 ( 1) , AmTr_v1.0_BacFree_scaffold00061 (1) 71 1493879 AmTr_v1.0_BacFree_scaffold00303 (1) , AmTr_ v1.0_BacFree_scaffold00243 (1) 72 3956102 AmTr_v1.0_BacFree_scaffold00163 (1) , AmTr_v1.0_BacFree_scaffold00084 ( 1) 73 9011683 AmTr_v1.0_BacFree_scaffold00034 ( 1) , AmTr_v1.0_BacFree_scaffold00025 (1) 74 4391510 AmTr_v1.0_BacFree_scaffold00238 ( 1) , AmTr_v1.0_BacFree_scaffold00056 (1) 75 1274295 AmTr_v1.0_BacFree_scaffold00299 (1) , AmTr_v1.0_BacFree_scaffold00295 (1) 76 3684854 AmTr_v1.0_BacFree_scaffold00249 (1) , AmTr_v1.0_BacFree_scaffold00077 ( 1) 77 4645926 AmTr_v1.0_BacFree_scaffo ld00179 (1) , AmTr_v1.0_BacFree_scaffold00100 (1) , AmTr_v1.0_BacFree_scaffold00355 (1)

PAGE 112

112 Table 2 10 . Continued Opgen Superscaffold Estimated Opgen Superscaffold Length (bp) Order (left to right) and Orientation (1=forward strand, 1=reverse complemen t) of Amborella BAC Free Scaffolds making up the OpGen GB Super scaffolds 78 3441583 AmTr_v1.0_BacFree_scaffold00126 (1) , AmTr_v1.0_BacFree_scaffold00130 (1) 79 6595486 AmTr_v1.0_BacFree_scaffold00068 ( 1) , AmTr_v1.0_BacFree_scaffold00134 (1) , AmT r_v1.0_BacFree_scaffold00135 ( 1) 80 5623082 AmTr_v1.0_BacFree_scaffold00207 (1) , AmTr_v1.0_BacFree_scaffold00029 (1) 81 887146 AmTr_v1.0_BacFree_scaffold00273 (1) , AmTr_v1.0_BacFree_scaffold00360 (1) 82 3616908 AmTr_v1.0_BacFree_scaffold00076 (1 ) , AmTr_v1.0_BacFree_scaffold00296 ( 1) 83 3053442 AmTr_v1.0_BacFree_scaffold00119 (1) , AmTr_v1.0_BacFree_scaffold00229 ( 1) 84 5995047 AmTr_v1.0_BacFree_scaffold00078 (1) , AmTr_v1.0_BacFree_scaffold00106 ( 1) 85 7665886 AmTr_v1.0_BacFree_scaff old00015 (1) , AmTr_v1.0_BacFree_scaffold00105 (1) 86 3139223 AmTr_v1.0_BacFree_scaffold00230 (1) , AmTr_v1.0_BacFree_scaffold00109 (1) 87 5365471 AmTr_v1.0_BacFree_scaffold00049 (1) , AmTr_v1.0_BacFree_scaffold00142 ( 1) 88 1789638 AmTr_v1.0_BacF ree_scaffold00381 (1) , AmTr_v1.0_BacFree_scaffold00143 (1)

PAGE 113

113 Table 2 11 . A comparison of assembly statistics between Amborella and other NGS based whole genome assemblies. Organism Sanger BAC Ends (BE) Fosmid Ends (FE) Genome Size (Mb) Bases in Cont igs (Mb) Contig N50 (bp) [N50 (n)] Contig N90 (bp) [N90 (n)] Amborella BAC Free Assembly 0 730 668 28,655 [6,685] 6,963 [24,624] Amborella BAC Free Assembly + Genome Builder 0 730 668 28,655 [6,685] 6,963 [24,624] Amborella Assembly With BAC Ends BE: 63 ,924 730 668 29,456 [6,448] 7108 [23962] Amborella Assembly With BAC Ends+ Genome Builder BE: 63,924 730 668 28,655 [6,685] 6,963 [24,624] Brassica BE: 199,452 485 264 27,294 [2,778] 5,593 [10,564] Banana BE: 90,542 523 390 43,122 [2,113] 5,831 [12,50 5] Potato FE: 90,407 BE: 71,375 844 683 31,429 [6,446] 6,858 [23,392] Cacao BE: 84,547 430 291 19,800 [4,097] 4,800 [15,723] Cod BE: 78,034 830 536 2,778 [50,237] n.a Pigeonpea (BAC free) 0 833 606 21,954 [7,815] 4,494 [28,417] Pigeonpea BE: 82,604 833 606 21,954 [7,815] 4,494 [28,417] Melon BE: 53,203 450 n.a 18,200 [n.a] n.a Apple FE: 117,000 BE: 34,000 743 604 13,400 [16,171] n.a

PAGE 114

114 Table 2 11. Continued Organism Bases in Scaffolds (Mb) Scaffold N50 (bp) [N50 (n)] Scaffold N90 (bp)[N90 (n)] Re ferences Amborella BAC Free Assembly 703 2,766,685 [81] 508,366 [293] NA Amborella BAC Free Assembly + Genome Builder 720 7,665,886 [32] 1,493,879 [95] NA Amborella Assembly With BAC Ends 706 4,927,027 [50] 1,154,593 [155] NA Amborella Assembly With BA C Ends+ Genome Builder 712 9,308,884 [26] 2,935,351 [70] NA Brassica 284 1,971,137 [39] 357,979 [159] (Wang et al. 2011) Banana 472 1,311,088 [65] 54,335 [647] et al. 2012) Potato 727 1,318,511 [167] 253,760 [622] (Xu et al. 2011) Cacao 327 473,800 [178] 75,500 [854] (Argout et al. 2011) Cod 611 687,709 [218 ] n.a (Star et al. 2011) Pigeonpea (BAC free) 606 247,882 [620] 23,693 [3,485] (Varshney et al. 2012) Pigeonpea 606 387,700 [380] 24,993 [2,743] (Varshney et al. 2012) Melon 375 4,677,790 [26] 1,485,533 [78] (Garcia Mas et al. 2012) Apple 598 1,542,700 [102] n.a (Velasco et al. 2010)

PAGE 115

115 Table 2 12 . Genome reads coverage across sequenced BAC contigs. Contig ID Contig Length (kb) Million Bases Aligned Mean Depth Median Depth 431 487 15.6 32 29 1003 630 19.9 31.6 29 Total 1,117 35.5 31.7 29

PAGE 116

116 Table 2 13 . K mer frequencie s for genome estimation. Read Library K mer Size (k) Distinct k mer at max. Volume (n) Total K mers Estimated Size (MB) 454 15 22 14,793,485,875 672 454 17 19 15,057,519,382 793

PAGE 117

117 CHAPTER 3 GLOBAL CONSERVATION OF ALTERNATIVE SPLICING EVENTS ACROSS EUDI COTS USING AMBORELLA AS A REFERENCE Introduction Alternative splicing (AS) is a post transcription mechanism involving differential processing of precursor mRNA (pre mRNA) resulting in multiple distinct mRNAs from a single gene. This process is one of the mechanisms through which eukaryotes generate transcriptome and proteome diversity, and can also influence protein abundance (Barbazuk et al. 2008; Reddy 2007) . There is increasing evidence suggesting a crucial role of AS in many essential plant functions such as photosynthesis, defense response, flowering, and cereal grain quality (Barbazuk et al. 2008) . Despite the important roles AS plays in angiosperms (flowering plants), the evolu tion and conservation of AS events across plant species is not well understood (Barbazuk et al. 2008) . This is largely due to lack of abundant transcriptome sequence data sampled from multiple , and comparable tissues across diverse flowering plants (Barbazuk et al. 2008; Reddy 2007) . Cross s pecies Alternative Splicing Comparisons in Plants L arge scale , cross species , global scale AS comparisons in plants have been limited to identif ying conserved AS events using cDNA and expressed sequence tag (EST) sequences. Most previous comparative studies in plants reported fewer than 50 conserved events between species (Wang and Brendel 2006; Baek et al. 2008; Wang et al. 2008a; Severing et a l. 2009) . A recent study comparing Brassica and Arabidopsis identified many more conserved AS events , i.e., 537 AS events in 485 genes (Darracq and Adams 2013) . The increase in conservation of AS events is largely due to the inclusion of more cDNA and EST sequence data compared to previous studies.

PAGE 118

118 Ho wever, the results provided in these studies may still be an underestimate because they used only cDNA and EST datasets that likely do not represent transcript ome diversity present in all tissues (Darracq and Adams 2013) . It is clear that high throughput technologies and mu l ti tissue sampling can resul t in significant increases in estimates of the frequency of AS events (Syed et a l. 2012) . During the past few years several plant genomes and transcriptomes (especially RNA seq) spanning wide evolutionary distances were sequenced . These resources allow the study of genome wide AS event conservation and evolution in plants . Discover y of conserved events across phylogenetically diverse organisms implies a detailed transcriptomic and proteomic functional studies (Barbazuk et al. 2008; Reddy 2007) . Whole g enome Duplication and Alternative Splicing in Plants High frequency of whole genome du plications (WGDs) i n angiosperms (Soltis et al. 2009; Jiao et al. 2011; Vanneste et al. 2014) , makes them good model sy stems for studying the changes in AS events after WGD. In spite of this, there is only one study in plants that investigated the evolutionary conservation and divergence of AS patterns in genes duplicated by polyploidy during the evolutionary history of th e Arabidopsis thaliana (Zhang et al. 2010) . L imitations of this study were that the authors use d only 52 WGD duplicate gene pairs in Ar adidopsis, and the AS events of these duplicates were inferred based on a previous study (Wang and Brendel 2006) , which reported that only 20% of genes in Arabidopsis undergo AS. It is now clear that more than 61 % of intron containing genes exhibit AS in Arabidopsis (Marquez et al. 2012) , sugges ting that the findings of the previous study are not comprehensive.

PAGE 119

119 In this chapter I investigated the conservation of AS patterns in genes duplicated by WGD events using the legume model systems common bean ( Phaseolus vulgaris ) and soybean ( Glycine max ) . After their divergence from each other about 19 MYA (McClean et al. 2010) , soybean underwent a lineage specif ic WGD about 5 10 MYA (Roulin et al. 2012) . Thus, a single gene in common bean should have at most two orthologs in s oybean . There are about 14, 759 orthologous gene sets, where one gene copy in common bean has two ortholog s gene copies in soybean that presumably Compared to the aforementioned investigation by Zhang et al. ( 2010 ) in Arabidopsis this study has two main advantages. Firs t , low genome fractionation in common bean and soybean lineages after their divergence has led to a high retention of duplicate pairs in soybean, preserving the expected 1:2 ratio of common bean genes to ortholog s in soybean. Thus pr ovid ing a large set of 1:2 ortholog gene pairs to examine AS changes in the context of WGD within soybean . In contrast, Arabidopsis ha s undergone at least two lineage specific WGD events and relatively high genome fractionation (Blanc et al. 2003; Blanc and Wolfe 2004a) (Figure 3 1), leaving behind fewer ortholog pairs from recent WGD . The second strength of this study comes from the fact that genome sequence is available from both species and they also contain abundant and diverse sampling of transcriptomes from similar tissue types, which are essential pre requisites for robust cross species AS analysis. Previous Methods for I dentifying Conserved AS Event Identification Methods Previous studies of cross species AS event conservation primarily focused on two features : conserved position (CP) and conserved junction (CJ) even ts (Darracq and Adams 2013; Wang and Brendel 2006; Wang et al. 2008a) . In CP AS events, the same types of events should be present at the same position between orthologous /paralogous

PAGE 120

120 genes , w hile CJ AS events are a relaxed version of conservation where the same type of events should be present at orthologous intron exon junc tions. The first study involving CP AS events was conducted in legumes ( Medicago truncatula and Lotus japonicas ) by Wang et al. ( 2008 ) and identified 22 CP AS events. This was followed by Darracq and Adams ( 2013 ) who examined CP AS events in the Brassicace ae , resulting in 537 CP AS events (Wang et al. 2008a; Darracq and Adams 2013) . Both of these studies involve cross species transcriptome alignments to ident ify CP AS events across up to three species. Scalability of this approach for identifying CP AS events involving a large number of species would require the laborious task of performing all possible pairwise cross species alignments for mining conserved A S events across all species , and between subsets of species in the study. Another potential problem while looking at many species is in identifying orthologous relationship s between species , which is complicated by lineage specific polyploidy in angiosper ms (Soltis et al. 2009; Vanneste et al. 2014) . In the previous studies for identifyin g CJ AS events, the first step was to identify conserved orthologous exon intron junctions and then look for shared AS event types at these junctions (Wang and Brendel 2006; Wang et al. 2008a; Darracq and Adams 2013) . Wang et al. ( 2006 ) mined CJ AS events between rice and Arabidopsis by first identifying potential orthologo us gene pairs by performing reciprocal BLAST, then finding conserved intron exon junctions by matching the intron plus 30 bp flanking sequences between orthologs and looking for shared AS event types at orthologous intron junctions (Wang and Brendel 2006) . On other hand , W ang et al. ( 2008 ) and Darracq and Adams ( 2013 ) used cross species EST/transcript alignments to identify

PAGE 121

121 conserved intron exon junctions and looked for shared AS event types across species (Wang et al. 2008a; Darracq and Adams 2013) . Current CJ AS events identification methods also suffer from s imilar limitations as mentioned above fo r CP AS events . I n this chapter three primary objectives are addressed using both public and in house transcriptome and genomic sequence resources. First, I examine the global conservation of AS events across nine angiosperm s, includ ing seven eudicot speci es, one monocot species, and the basal angiosperm Amborella as the outgroup (Figure 3 1). Second, I investigate d conservation of AS events in WGD gene copies using the legume model systems common bean and soybean, which are part of the aforementioned eudi cot clade . Third, I develop ed high throughput computational pipelines and algorithms for identifying conserved AS events across taxa representing both short and large evolutionary distances. Results Global Transcriptome Alignment and Assembly Transc r iptom i c and genom ic data were collected from nine angiosperm taxa constituting seven eudicots, one monocot ( Oryza sativa r ice) , and Amborella trichopoda a pivotal species that is sister to all other angiosperms (used here as an outgroup) (Table 3 1). Transc r ip tome collection includes data from ESTs, mRNAs, 454, and RNA seq (Table 3 2) from diverse tissue types (Table 3 3), which were rigorously quality filtered by removing low quality and adapter regions (see Materials and Methods ; Figure 3 2 ). PASA ( P rogram t o A ssemble S pliced A lignments ) (Haas et al. 2003) , a software pipeline for reconstructing gene structures from spliced alignments of transcripts , was used to generate comprehensive transcript alignment assemblies for each taxon .

PAGE 122

122 Directly feeding m assive number s of reads of 454 and Illumina RNA seq data to PASA poses limitation s due to the enormous complexity involved in clustering and assembling ove rlapping alignments resulting from these reads . To overcome this limitation, reads from 454 and Illumina were first assembled via de novo and genome guided transcriptome assembly methodologies , reducing millions of transcriptome reads into few hundred thou sand transcripts (Rhind et al. 2011; Haas et al. 2011) . The se assembled transcripts , along with ESTs and mRNA , were aligned using the splice aware aligners GMAP (Wu and Watanabe 2005) and BLAT (Kent 2002) against corresponding reference genomes, and were further assembled by PASA (Figu re 3 2) (Rhind et al. 2011; Haas et al. 2011) . Low confiden ce PASA alignment assemblies were discarded based on junction read support, and the 2) (see Materials and Methods) . PASA transcripts not belonging to annotated protein coding genes were discarded as well (Figure 3 2). For do wnstream AS analysis , only multi exonic protein coding genes with support from PASA transcripts were considered and these genes are referred to as expressed mulit exonic protein coding genes (Table 3 1). Intron Retention is the Mo st Frequent AS E vent Along with transcript alignment assemblies, PASA also generates an AS classification report. PASA AS classification output was processed using a custom software pipeline to obtain AS events (Figure 3 3) as defined in Wang and Brendel ( 2006) . The four types of AS events examined in this study are : AltD (alternative donor site), AltA (alternative acceptor site), ExonS (exon skipping), and IntronR (intron retention). As illustrated in Table 3 4 and Figure 3 4, IntronR is the most prevalen t AS type among the seven species of eudicots , with Arabidopsis having the most abundant

PAGE 123

123 IntronR event category (65.3%). On average more than half of the AS events are IntronR (56%), followed by AltA (21%), and AltD (14%), with ExonS (9%) being least frequ ent. These AS event frequencies are consistent with previous studies in plants (Wang and Brendel 2006; Wang et al. 2008a; Marquez et al. 2012) . Up to 7 0 P ercent of Expressed Multi exonic Genes E xhibit AS Among all nine taxa, the fraction of intron containing genes with at least one AS event is the highest in Amborella (70.4%), followed by Vitis vinifera g rape (64.4%), Populus trichocarpa p opla r (53.2%), Arabidopsis thaliana (52.9%), Glycine max soybean (50.2%), Oryza sativa r ice (46.4%), Phaseolus vulgaris c ommon bean (4 4.9%), Medicago truncatula (44.7%), and Solanum lycopersicum t omato (39.1%) (Table 3 4 and Figure 3 5). These percentages are conservative estimates because our analysis is restricted to only four AS e vent types (AltA, AltD, ExonS, a nd IntronR). A previo us comprehensive AS study in Arabidopsis reported that 61.2% of expressed multiexonic genes exhibit AS ; however, the Arabidopsis study considered the top ten most frequent types of AS to estimate AS frequency (Marquez et al. 2012) . High throughput Pip eline for Identifying Conserved AS Events I dentify ing conserved position s of AS event s re quires the same type of event at the same position between orthologous/paralogous genes, and cross species alignments for identifying the precise splice junction positions. Although CP AS events provide a strictly defined set of evolutionar il y conserved A S events (Darracq and Adams 2013) , mining for th ese type s of conserved events is troublesome between species separated by large evolutionary distances due to sequence divergence , which also degrades at the sequence and intron exon boundary level and which complicates cross -

PAGE 124

124 spe cies EST/transcript alignme nts. Thus CP is limiting its application to species separated by short evolutionary distances. T o study the conservation of AS events across large evolutionary distances CJ AS events were chosen instead of CP AS events . Sequence conservation criteria for i dentifying CJ are relaxed relative to CP and require that same type of events must be present at orthologous intron exon junctions. A high throughput software pipeline for identifying CJ AS events was developed and successfully implemented in nine species of angiosperms constituting various divergence times and lineage specific WGD events (see Materials and Methods; Figure 3 6 ). From this point on , conserved AS event refers to CJ AS events. More Than 5,000 Conserved AS Event Clusters b etween Common Bean and Soybean C onserved AS events between common bean and soybean were identified using pipeline described in Figure 3 6 (see Materials and Methods). T here are 5,202 conserved AS events conserved between 4,020 and 5,671 genes in common bean and soybean (Tab le 3 5), corresponding to 45% and 31% of multi exon expressed genes, respectively. IntronR was the most abundant conserved AS event type , followed by AltA, AltD, and ExonS, which is in alignment with overall proportion of AS events. To the best of our know ledge , this is the largest number of conserved AS events reported to date between two species , far exceeding a recent study that reported only 694 conserved AS events in 597 genes between Arabidopsis thaliana and Brassica (Darracq and Adams 2013) .

PAGE 125

125 Extensive Species s pecific AS Events in WGD Orthologs C o nserved AS events between a single gene in common bean and its orthologs result ing from a recent WGD in soybean were examined in 8,325 gene sets with 1 common bean gene : 2 soybean gene ortholog ratios. These 8,325 gene sets were chosen from the original 14,7 59 ortholog gene sets with 1:2 ratio s based on the following criteria: all genes should be expressed, and at least one gene should be multi exonic and exhibit a minimum of one AS event. AS events were assigned to five categories (1:2, 1:1, 0:2, 1:0, 0:1) based on their conservation status between orthologous gene copies of common bean and soybean. For example, placement into the 1:2 category means that an AS event is conserved in the one gene copy of common bean and its corresponding two orthologous gene c opies in soybean. Similar explanations are applicable for other gene copy ratios. There are 1,432 conserved AS events which are of category '1:2' in Table 3 6 , which represents AS events that are conserved between one gene copy of common bean and the corr esponding two orthologous copies resul ting from the recent WGD in soybean. In category '1:1' there are 2,230 AS events, which represent conserved AS events between one gene copy of common bean and at least one of the two orthologous copies result ing from recent WGD in soybean (category '1:1' in Table 3 6). There are 2,302 instances of AS events that are present in both copies of soybean but are absent in common bean (category '0:2' in Table 3 6). Two of the largest AS event categories are '1:0' ( 8,497 AS e vents only in common bean) and '0:1' ( 21,816 AS events only in soybean) , which represent species specific More Than 27,000 C onserved AS Events a mong Nine Angiosperm Species C onserved AS events among nine angiosperm species were identified using a novel high throughput strategy i llustrated in Figure 3 7. There are 27,120 c onserved AS

PAGE 126

126 events that are conserved between at least two of the nine angiosperm taxa used in this study (Table 3 7). There are 101 AS events conserved across all nine taxa, followed by 201 in eight taxa, 365 in seven taxa, 599 in six taxa, 1,168 in five taxa, 2,394 in four taxa, and 5,876 in three taxa. The m ajority of conserved AS events (16,416; 60.5%) are conserved between only two species. Among the c onserved AS events, the IntronR AS event type represents the highest proportion with 65.6% followed by 20.5% of AltA, and 10.0% of AltD, while ExonS makes up only 3.9% of all events (Table 3 8). The p roportion of expressed protein coding multi exonic genes with at least one conserved AS in at least one other species is highest for grape (36.2%), followed by Amborella (34.1%), poplar (29.4%), common bean (27.7%), soybean (26.7%), Arabidopsis (24%), Medicago (23.7%), tomato (17%), and rice (16.7%). The percentage of conserved AS events rel ative to the total number of conserved AS events was calculated for each pairwise comparison of the species studied ( Figure 3 7 ) . Of all the conserved AS events identified in common bean that are conserved in one or more additional species , the largest f raction of these (68%) is conserved with soybean . This also accounts for the single pairwise comparison among all nine taxa that has the highest level of conservation. The pairwise comparison identifying the second highest fraction of conserved events oc curs between M edicago and soybean, with 58% of the conserved AS events within Medicago conserved with soybean ( Figure 3 7 ) . This is not unexpected owing in part to their close phylogenetic relationship (Figure 3 1) and also the availability of deep transcr iptome data (Table 3 3) . Interestingly , the majority of the species examined have the largest fraction of conserved AS events shared with grape ( Figure 3 7 ). One most likely explanation for this is that grape has a

PAGE 127

127 superior transcriptome collection compa red to all other species 114.7 M 100 bp paired end RNA seq reads (23 GB) generated by pooling RNA from 45 samples representing various developmental stages as detailed in Table 3 1 3 (Venturini et al. 2013) . Other possible explanation s would be that AS fractionation in grape may be very low compared to other species. O f all angiosperms in this study (Figure 3 1) , Amborella is the only species that has no t undergone any lineage specific WGD events in addition to the ancient WGD shared by all angiosperms , while grape ha s undergone only one whole genome triplication (i.e., two WGDs in close succession) (Jiao et al. 2012) , and the rest of the species ha ve undergone at least two or three WGD events, which may have lead to high AS fractionation in these species compared to grape. Ancestral Angiosperm AS Events A ncestral AS events were estimated at each of the nodes within the species tree of the nine angiosperm species included in this study (Figure 3 1) . The ancestral AS event numbers reported in boxes at each node were calculated by requiring that each AS event be conserved between an outgroup species and at least one other ingroup species. T here are 9,219 CJ AS events that are conserved between Amborella and at least one other species in the study, indicating these events may have been present in the MRCA of angiosperms . The highest number of conserved AS events (9,424) is seen at the node b etween grape and eurosids. T his extent of conservation may reflect the comprehensive nature of the grape transcriptome sequence collection or it may suggest that the grape genome is evolving slowly and has maintained much of the AS events in common with it s MRCA (with other rosids) . All other nodes have fewer than 6,000 AS events. Because the rate of convergent evolution of AS events in plants is not known , I am not sure that all of these

PAGE 128

128 events are strictly ancestral and what fraction actually represent s convergent gains of AS . To mini mize interpreting convergent ly evolved AS events as the ancestral state of the MRCA for each node I employed a more stringent criteri on where an A S event should be conserved in an outgroup species and at least two other ingr oup species. Using this criterion , 4,922 events are conserved between Amborella and all other angiosperms in this study, thus reduc ing the number of conserved events by about half compared to our previous estimate . The number of conserved events at each o f the other nodes was similarly reduced when using the more stringent classification criteri on ( Figure 3 1 ). Overrepresented GO Categories a mong Genes With Conserved AS GO category enrichment a nalysis was done using exact t est module with default parameters (Conesa et al. 2005) to determine whether any of the GO categories are overrepresented with in the 1,099 genes in Arabidopsis ( compared to all protein coding genes ) that have conserved AS events across at least six angiosperms. Genes representing the biological process es: cellular protein modification process, cell death, organi c substance metabolism, primary metabolism, metabolism, regulation of cellular pr ocess, signaling, cellular process, biological regulation, and response to stimulus were overrepresented ( Table 3 1 1 ; Figure 3 8 ) . For the cellular component category : nucleus, cell periphery, membrane, plasma membrane, organelle, macromolecular complex, and protein complex were overrepresented (Table 3 1 1 ; Figure 3 9 ). For the molecular function category, kinase activity, catalytic activity, transferase activity, transporter activity, hydrolase activity, ion binding, and mRNA binding were overrepresented (Table 3 1 1 ; Figure 3 10). Overrepresentation of GO terms suggest s that spe cific gene categories are retaining

PAGE 129

129 the same AS events, and the apparent AS conservation observed between species may be partially due to selection of parallel events. I also did G O category enrichment anal ysis for 2,264 soybean genes where conserved AS events are present in two WGD paralogs of soybean and their corresponding ortholog in commo n bean (1:2 category; Table 3 6 ) . The enrichment was done with respect to 8,325 soybean dup licate gene pai r s derived from WGD (Table 3 6 ). For the biological process category, response to stimulus was overrepresented (Table 3 1 2 ; Figure 3 1 1 ). For the cellular component category, nucleus ( includ ing intracellular membrane bounded organelle and membrane bounded organelle ) were overrepresented (Table 3 1 2 ; Figure 3 1 2 ). For the molecular function category, hydrolase activity and nuclear binding, which includes mRNA binding, were overrepresented (Table 3 1 2 ). Interestingly , all of the se overreprese nted GO categories are also part of the aforementioned GO enrichment observed in Arabidopsis genes with conserved AS events across at least six angiosperms. This suggests that genes with AS events that tend to be conserved a cross species are also prefere ntially being retained in gene copies derived from WGD . Discussion Frequency of Genes W ith AS To our knowledge this study reports the largest collection of expressed multi exonic genes exhibiting AS in plants . I computationally detected AS within genes o f 9 angiosperms whose annotations were supported by NGS data t o enable an examination of AS conservation across plant species. The proportion of multi exon ic genes exhibiting AS within Amborella and grape are 70.4% and 64.4% , respectively, both of which e xceeds the current estimate in Arabidopsis (61.2%) (Marquez et al. 2012) . Our

PAGE 130

130 AS fr equency estimates are based on only the four basic and most frequent AS event types (Table 3 4 and Figure 3 3) while the Marquez et al. ( 2012 ) AS estimate for Arabidopsis includes an additional six AS event types that are combinations of these four basic A S event types. Therefore, it is likely that our analysis remains an underestimate , and the true extent of AS within Amborella , g rape , and the other seven species examined could be even higher than the numbers reported here . The fraction of expressed multi exonic genes exhibiting AS with in the nine angiosperm taxa examined ranges from approximately 40% to 70% . One explanation for this range may be due to variation in trancriptome resources. For example, the Amborella RNA seq reads are 2x100 bp paired end s equences sampled from diverse tissue types with replicates (Table 3 2 and Table 3 3) . The read length and configuration promotes accurate mapping and AS event detection, while the diverse tissue sampling promotes identification of events that may be restr icted to one or a small subset of tissues. In contrast, the majority of the tomato RNA seq data constitutes 50 bp reads sampled from fruit tissue . These features make the transcripts sampled less diverse and transcriptome reconstruction more difficult, thu s affecting AS identification. Advantages of our C onserved AS Event Identification Strategy P revious studies investigating genome wide conserved AS events in plants are limited to at most three species (Darracq and Adams 2013; Baek et al. 2008) . Methodologies used in these studies of conserved AS events utilize d cross species transcriptome alignments as described in the previous section, or they perform ed pairwise comparisons between close homologues, or potent ial orthologs, and search ed for a conserved AS event between gene pairs (Wang and Brendel 2006) .

PAGE 131

131 The Darracq and Adams (2013) study aligned Arabidopsis thaliana and Brassica transcriptome data cons isting mainly of ESTs to Arabidopsis genes, and conserved AS events were identified from these alignments (as described in Introduction). Same species alignments (i.e., Arabidopsis ESTs aligned to Arabidopsis genes) resulted in mapping ~70% of ESTs at high stringency criteria with identity and coverage scores of species alignments (i.e., Brassica sequences aligned to Arabidopsis genes) resulted in mapping only ~40% of ESTs even at low stringency 0). This is a 30% reduction in EST alignments. The failure to align many of the Brassica ESTs could be a result of higher sequence divergence between them and their orthologous Arabidopsis genes, or they may represent Brassica specific genes (i.e., genes are present in only Brassica but are absent in Arabidopsis) . Most likely the inability to identify cross species alignment is the result of sequence divergence because ~93% of gene families are common between Arabidopsis and Brassica rapa (Wang et al. 2011) . Our AS conservation pipelines align transcriptome data directly against a reference genome of the same species rather than cross species alignment of tran s criptome data (Darracq and Adams 2013) , thus increasing alignment accuracy and efficiency. Additionally, performing p airwise comparisons of close homologs is not easily scalable to simultaneously assessing a large number of taxa for two reasons: (i) Plants often have lineage specific WGD events in addition to shared ancient WGD, which means one gene copy of a species cou ld have more than one potential homolog in another species i.e, co orthologs (Gabaldón and Koonin 2013) . Considering only one of several homologs as a potential ortholog for reciprocal BLAST based pair wise

PAGE 132

132 comparisons (Wang and Brendel 2006) will fail to capture conserved AS events with respect to other homologs. (ii) The number of pairwise comparisons to perform between gene pairs grows exponentially as one increases the number of species included in the an alysis. To overcome these limitations , our conserved AS event identification methodology first identifies AS splicing events for each species, creates a flanking exon sequence tag (FESTs) datasets from all species, and subdivides this into separate dataset s for each AS event type. FESTs AS event dataset were compared using TBLASTX and BLASTN to identify all possible events conserved between two or more species, which were placed into AS event clusters. TBLASTX alignments allow detection of alignments betwe en orthologous sequences that have high sequence divergence at the nucleotide level but are conserved at the amino acid level. These event clusters were recreated based on OrthoMCL gene clusters (see Methods and Materials and Figure 3 6 ), where all orthol ogous genes are searched for AS conservation . Overall , our method does not rely on either cross species alignments or pair wise gene comparisons. Conserved AS Events b etween Common Bean and Soybean This study reports 5,202 conserved AS event clusters betw een common bean and soybean, which is the largest number of conserved AS events between two species reported to date in plants. This number is high in comparison to a recent study that reported only 694 conserved events between Arabidopsis thaliana and Br assica species (Darracq and Adams 2013) , both be longing to Brassicaceae. The estimated time of divergence of common bean and soybean is 19 MYA (McClean et al. 2010) , while that of Arabidopsis thali a na and Brassica species is 20 MYA (Yang et al. 1999) . These estimates are similar , suggesting that difference s in divergence time s are not likely to be

PAGE 133

133 a major contributing factor to the number of AS conservation differences between our study and Darracq and Adams (2013) . The high amount of conserved AS events in our study compared to Darracq and Adams (2013) could be explained by differences in tra nscriptome resources and in AS identification methodologies used in these studies. Darracq and Adams (2013) examined only ESTs, whereas our study includes, ESTs, mRNA, and high depth RNA seq data from diverse tissues, promoting the capture of even rare tis sue specific mRNA isoforms. Additionally, specific features of our conserved AS identification algorithms discussed above also a ffect the identification of conserved AS events . Despite having collections of RNA seq data from similar tissue types from both common bean and soybean ( Table 3 3), only about 21 33% and 11 18% of AS events, respectively, are conserved (Table 3 5), which suggests that each species harbors substantial numbers of lineage specific AS events. Conserved AS Events in WGD Orthologs This i s the first study in plants to examine conserved AS events within gene pairs arising from WGD (soybean) with respect to their orthologous gene copies in an outgroup (common bean) without a lineage specific WGD. Conserved AS events were investigated using 8,325 orthologous gene sets between common bean and soybean, where single copy genes in common bean have two orthologous copies from WGD in soybean. Only 36 % and 1 9 % of AS events in common bean and soybean, respectively, are conserved between a common bean gene and at least one member of the homeologous gene pair representing the soybean co orthologues (1:2 and 1:1 categories of Table 3 6). These conserved AS event ratio categories represent the AS events that were likely present in the MRCA of common bean and soybean. Thus, the observation that ~65% of AS events (1:0 and 0:1 categories of Table 3 6) are

PAGE 134

134 associated solely with common bean or soybean, respectively, suggests that there were rapid AS gains/losses within these species after their divergence from a MCRA, and some of this may reflect fractionation after the soybean WGD. Approximately 8% of the conserved AS events are absent in common bean but present in both homeologous gene copies of soybean (0:2 category; Table 3 6). There are four possible scen arios that could account for this : (i) AS events may have independently arisen in duplicate copies at the same position in soybean, (ii) AS events would have been present in the MRCA of common bean and soybean but were lost in common bean after its diverge nce from the soybean lineage, (iii) AS events are not present in the MRCA of common bean and soybean but were formed after their divergence within the soybean lineage but prior to the soybean specific WGD event , such that both homeologues have the event , a nd (iv) t he AS event is actually conserved in common bean but was not recovered in our transcriptome dataset. Only the second of these four possible explanations is readily explained by our data and analysis. The presence of these AS events in the MRCA of common bean and the soybean lineage can be examined by looking for their conservation in a close outgroup. Approximately 43% of these events (Table 3 10) are conserved within at least one other angiosperm examined in this study, suggesting that the secon d scenario is the most parsimonious explanation for these cases. However, any of the remaining three explanations could account for this observation for at least some AS events and none can be discounted at this point . Indeed, by the same argument, the re maining events (57%) might well have arisen from the other scenarios, further underscoring that they may be active in the evolution of alternate splice isoforms.

PAGE 135

135 C onserved AS Events a mong Nine Angiosperm Taxa There are 27,120 AS events (Table 3 7) foun d to be conserved between at least two of nine angiosperm taxa using the conserved AS event identification pipeline. This is the largest number of conserved AS events reported to date, a ~38X increase relative to conserved AS events previously identified (Darracq and Adams 2013) . Additionally, this is t he first study to investigate genome wide conserved AS events in plants across more than three species and includes plant species from non eudicot and non monocot lineages (i.e., Amborella ). This approach provides new insight into AS conservation in plan ts across larger phylogenetic distances and across multiple lineages. This high throughput AS conservation methodology is easily scalable to any future analyses involving a large number of species that represent a wide phylogenetic distribution. As the s equenced genomes and transcriptome data increase in abundance , these methods will remain applicable. Furthermore, most of the earlier AS conservation studies relied on EST sequences , while this study is the first to incorporate all transcriptome sequence r esources (ESTs, mRNA, 454, and RNA seq from multiple tissue types), thus maximizing our ability to identify AS events. Overall , our study is both robust and comprehensive in its identification of conserved AS events. Application of Conserved AS Events and Future Research It is clear from our global comparative AS study that there are several thousand AS events that are conserved across plant species , even across large phylogenetic distances . This conservation implies that many of these AS events are func tional and have been retained during the course of evolution. Comparative AS studies in the past helped to prio ri tize AS events as functional . For example, Fu et al . (2009) compared an exon skipping event in TFIIIA of Arabidopsis thaliana with other spec ies , including

PAGE 136

136 Oryza sativa , Solanum lycopersicum , Selaginella moellendorffii , Physcomitrella patens (moss) , and Chlamydomonas (green algae) , and found this event to be highly conserved even across large evolutionary distances. Based on this evidence they further investigated this AS event using molecular experiments to show its biological function in quantitative auto regulation of TFIIIA homeostasis (Fu et al. 2009) . Our study also identifie d this s ame TFIIIA exon skipping event , providing evidence that our pipeline is efficiently identifying cross species AS events. Molecular characterization studies similar to Fu et al. (2009) could be initiated on thousands of conserved AS events that h a ve been i dentified by our study . Another use of the conserved AS events identified in this study is to provide datasets that can be mined for gene families showing higher rates of AS and then examine AS conservation rates in these gene families both within and acr oss species . Also, one can investigate correlation s between the number of genes exhibiting AS versus the gene family size. There is evidence from previous studies that certain gene families show higher rates of AS compared to others (Richardson et al. 2011) . A good exa mple of such a gene famil y is the Serine/Arginine Rich protein gene family (SR proteins) in plants. SR proteins perform crucial function s in spliceosome assembly, as well as constitutive and alternative splicing of pre mRNAs , including their own transcrip ts (Richardson et al. 2011) . Compared to vertebrates, angiosperms have nearly twice the number of genes encoding SR proteins, and AS within SR protein encoding genes is common . For example, Homo sapiens have 11 SR genes, while Arabidopsis thaliana and Oryza sativa have 18 and 22 SR genes , respectively (Richardson et al. 2011) ; and, 16 of 18 Arabidopsis SR protein genes undergo AS (Richardson et al.

PAGE 137

137 2011) . Using our conserved AS event identification pipeline , one can identify gene families that , similar to the family of SR proteins, undergo widespr ead AS and further investigate these events for functional relevance. Also, it would be interesting to look for AS conservation within gene family members of the same species and investigate preferential expression of these isoforms from paralogous genes at various developmental states or tissues. Our study identified 11 of 18 SR proteins that have conserved AS events with at least one other species , with the majority of them exhibiting conserv ation in at least 6 other angiosperms . Ancestral reconstructio n of gene family content and examination of gains and losses of genes relative to the MRCA of various plant lineages give s interesting insights into how these changes may have been involved in the evolution of new traits , especially key innovations . For ac curately drawing conclusion s about gene gains and losses , each species should have nearly complete gene sets , which is relatively easy to obtain for sequenced genomes . Similarly , to accurately i dentify lineage specific gains or losses of AS event s and the ir implications one need s to have similarly comprehensive and uniform transcriptome datasets for all species in the study . Currently such comprehensive datasets across multiple species are not available in public databases. Although, I was able to recons truct the ancestral state of AS events at various nodes (Figure 3 1) , it is not possible for us to infe r the exact origin of these because our transcriptome datasets are neither uniform nor comprehensive . It is important to continually investigate gains an d losses of AS events across various lineages and their functional implications as transcriptome datasets increase in depth and sampling diversity.

PAGE 138

138 In our comparison of AS in WGD gene copies of soybean with common bean I identified an overrepresentation o f GO terms among soybean genes (Table 3 1 2 ) having conserved AS events between its WGD paralogous copies and its ortholog in common bean. It would be interesting to further examine whether genes with the same GO terms are enriched in gene duplicate copies of another species with independent WGD. This will help in understand ing whether certain genes a re l ikely to retain AS events more often than others irrespective of lineage in which WGD occurs with regard to AS event conservation in WGD gene copies, an asp ect to be investigated is how AS expression changes within these WGD gene copies , i.e., whether there is preferential expression of one copy and its isoforms, sub functiona lization or neo functionalization . Materials and Methods Genomic and Transcriptom ic Data Collection Genome assemblies and annotations Genome assemblies and protein coding gene annotations corresponding to seven taxa involved in this study were mainly collected from Phytozome v9.0 (Goodstein et al. 2012) , with the expectation of Amborella and Medicago , which were collected from their corresponding genome sequencing websites. Table 3 1 summarizes sources from which genome assemblies and their corresponding annotations wer e collected, along with basic gene annotation metrics. Only protein coding genes with at least one intron were used in downstream analysis. Transcriptome collection Transciptome data , includ ing ESTs, mRNAs belonging to full and partial CDS, and RNA seq rep resenting various tissue types were collected from both public (NCBI database) and private resources (Table 3 2 and Table 3 3).

PAGE 139

139 RNA seq Data Processing and Assembly Three different methodologies (Figure 3 3) were implemented for assembling RNA seq data (Ta ble 3 3) to maximize the recovery of all possible isoforms. Before feeding RNA seq data into various assembly pipelines, the data were trimmed for quality and adapter sequences using Cutadapt (Martin 2011) and Trimmomatic (Lohse et al. 2012) software. Parameters used to r un cutadapt were error rate=0.1, times=1, overlap=5, and minimum length=0, and of Trimmomatic were HEADCROP:0, LEADING:3, TRAILING:3, SLIDINGWINDOW:4:15, and MINLEN:40. Only reads with length 40 bp or above were used in the following assembly pipelines. Ca lculating maximum intron size One of the important parameters used while aligning transcriptome reads and assembling them was maximum intron size. Directly calculating maximum intron size from all genes result s in picking an extremely large intron size tha t was often an outlier belonging to a few genes. To avoid this, the intron size parameter is estimated by calculating the 99 th percentile of maximum intron sizes and summing it with mean maximum exon sizes (Table 3 9 ). For each intron containing isoforms , maximum intron and maximum exon size are recorded, and overall metrics for them were reported in Table 3 9 . Trinity genome guided assembly Because AS events that are considered in this study are from protein coding genes only, Trinity genome guided assembl y was run only on RNA seq alignments belonging to these gene blocks. Gene sequences along with 2000 bp of flanking regions were extracted with the following rules: (i) in case of overlapping genes, flanking regions were extracted for non overlaping ends an d for the overlapping ends original

PAGE 140

140 coordinates were used , and (ii) if genes were overlapping after adding 2000 bp flanking regions, then the flanking regions were reduced to by half . The GSNAP v2012 07 12 (Wu and Nacu 2010) alignment program was used to align RNA seq data sets ag ainst the aforementioned gene sequences along with the genome with masked gene regions. Including masked gene regions avoid s false alignments to genic regions, i.e., the cases where a n RNA seq read would have top alignment to a non genic region and if , in the absence of a genome sequence , it would align to genic region as the top hit. The following parameters were used for running GSNAP: batch=5, suboptimal levels=0, novelsplicing=1, local splice penalty=0, distant splice penalty=1, npaths=5, quiet if exces sive, max mismatches=0.05, nofails, format=sam, sam multiple primaries, split output, orientation=FR, pairexpect=200 , and pairdev=25; localsplicedist and pairmax rna were set to the same values, which are the same as the 99 th percentile of maximum intron s izes in Table 3 9 . All alignments of RNA seq datasets were de duplicated and merged using MarkDuplicates and MergeSamFiles programs of Picard software package v1.72 ( http://picard.sourceforge.net ). Genome guid ed Trinity vr20130225 ( http://trinityrnaseq.sourceforge.net/genome_guided_trinity.html ) was run on the merged and cleaned alignments using default parameters. Trinity de novo assembly All RNA seq data sets were merged , and Trinity's in silico read normalization (Haas et al. 2013) was run using the followin g parameters: JM 175G, max_cov 50, JELLY_CPU 20, min_kmer_cov 2, pairs_together, and PARALLEL_STATS. This normalization process reduces the number of reads, which were fed into the Trinity de

PAGE 141

141 novo assembler (Grabherr et al. 2011) , t hus greatly reducing the run time and memory requirements for the Trinity assembly . Cufflinks assembly RNA seq datasets were aligned against corresponding reference genomes using Tophat v2.0.9 (Kim et al. 2013) . The following parameters were used: read mismatches 5, read gap length 3, read edit dist 5, read realign edit dist 0, mate inner dist 100, mate std dev 50, max intron length 8967, num threads 20, max multihits 5, library type fr unstranded, and GTF; max segment intron and max coverage intron were set to the same values, which is the same as the 99 th percentile of maximum intron sizes. Values for each species are listed in Table 3 9 . All read alignment s of RNA seq data sets were de duplicated and merged using MarkDuplicates and MergeSamFiles programs of Picard software package v1.72 ( http://picard.sourceforge.net ) and assembled using Cufflinks v2.1.1 (Trapnell et al. 2012b, 2012a) , which was guided by corresponding genomes and annotations. The f ollowing parameters wer e used for running Cufflinks: GTF guide, frag bias correct, min isoform fraction 0.05, multi read correct, upper quartile norm, library type fr unstranded, min frags per transfrag 10, min intron length 50, and no faux reads; max intron length for each spe cies was the same as the 99 th percentile of maximum intron sizes listed in Table 3 9 . Transc r ipts resulting from C was had FPKM of at least 0.1. PASA Pipeline EST, mRNA, and RNA seq assemblies were run through PASA 2.0 (Haas et al. 2003) , which aligns these trancriptome resources to a reference genome, builds transcript assemblies from the alignments, and finally identifies alternative splicing (AS)

PAGE 142

142 events. The f ollowing parameters were used for running the PASA pipeline: INVALIDATE_SINGLE_EXON_ESTS, and MAX_INTRON _LENGTH (which is same as the 99 th percentile of maximum intron sizes in Table 3 9 ). By default , PASA only keeps near perfect transcript alignments with at least 95% identity and covering at least 90% of the transcript length (Campbell et al. 2006) . Transcripts were discarded, if at least one of the junction s was not supported by a minimum of two read s supp ort ing it , or if the retained intron region had less than the median read coverage of two reads. These supporting reads come from same species as that of transcript. These filtered transcripts were rerun through PASA by importing their annotations to ide ntify AS events. Classification of Alternative Splicing Events The definitions of PASA AS events categories were described in (Campbell et al. 2006) , and differ from AS event definitions in (Wang and Brendel 2006) (Figure 3 3). The main difference is the latter allow s us to define alternative splice sites . AS events defined by PASA were processed through an in house software pipeline to re classify AS events in concordance with AS event definitions described in (Wang and Brendel 2006) . OrthoMCL Clustering The OrthoMCL pipeline (Li et al. 2003) with standard settings was used to identify potential orthol ogous gene families (orthogroups) between species listed in Table 3 1 using protein sequences from the longest isoform of each gene .

PAGE 143

143 Identification of C onserved AS Events b etween Taxa For each alternative splicing event, flanking sequences from upstream a nd downstream exons of an intron involved in AS were extracted. From each exon a sequence of at least 30 bp and up to 300 bp was extracted. These concatenated splice junction sequences were called flanking exon sequence tags (FESTs). Figure 3 6 illustrat es creation of FESTs. For each AS event type , FESTs were extracted and a W U Blast database was created. FESTs were aligned to their corresponding AS t ype FESTs database using WU BLASTN and WU TBLASTX v2.0 (Gish 1996) . AS events between two genes are classified as conserved if these two genes belong ed to the same orthogroup, if the genes FEST s exhibit the same A S type , and also their FEST s match.

PAGE 144

144 Figure 3 1. Ancestral CJ AS events in flowering plants. Species of interest are marked in bold. Ancestral AS event were reported in a box adjacent for each divergence node (events conserved with at least one other species other than outgroup are in bold and at least with two other species are in plain font) . Branch lengths are not proportional to length . Divergence of clades and WGD event timings are in MYA and are italicized, these are based on the following (Roulin et al. 2012; Jiao et al. 2011; Amborella Genome Project 2013; Fawcett et al. 2009; Young et al. 2011; Tuskan et al. 2006; McClean et al. 2010; Paterson et al. 2012; Woodhouse et al. 2011) .

PAGE 145

145 Figure 3 2. Work f low of transcriptome data pre processing and PASA assembly.

PAGE 146

146 Figure 3 3. Types of AS events. A ) ) Alternative donor or ) Exon skip. D ) Intron retention.

PAGE 147

147 Figure 3 4. Frequencies of AS events.

PAGE 148

148 Figure 3 5. Frequencies of AS in expressed multi exonic genes.

PAGE 149

149 Figure 3 6. Conserved junction (CJ) alternative splicing (AS) events identification pipeline.

PAGE 150

150 Figure 3 7. Percentages of conserved AS events shared between species. The way to read this table is for example, 28.33% of 12,242 conserved AS events of Amborella are conserved with Arabidopsis. Percentage of events are with species in the rows.

PAGE 151

151 Figure 3 8. Visualization of enriched GO terms ass AS events conserved across at least six angiosperms (Supek et al. 2011) .

PAGE 152

152 Figure 3 co nserved AS events conserved across at least six angiosperms (Supek et al. 2011) .

PAGE 153

153 Figure 3 conserved AS events conserved across at least six angiosperms (Sup ek et al. 2011) .

PAGE 154

154 Figure 3 AS events in two WGD paralogs of soybean and their corresponding ortholog in common bean (Supek et al. 2011) .

PAGE 155

155 Figure 3 category of genes having conserved AS events in two WGD paralogs of soybean and their corresponding ortholog in common bean (Supek et al. 2011) .

PAGE 156

156 Figure 3 category of genes having conserved AS events in two WGD paralogs of soybean and their corresponding ortholog in common bean (Supek et al. 2011) .

PAGE 157

157 Table 3 1. Genome sequence and annotation resources. Species Protein Coding Genes Source of Collection All Multi Exonic Expressed Multiexonic Amborella trichopoda (Amborella) 26,846 17,187 14,626 http://amborella.org/ Annotation version 1.0 (Chamala et al. 2013; Amborella Genome Project 2013) Arab idopsis thaliana (Arabidopsis) 27,206 21,236 19,637 Phytozome 9.0 Annotation TAIR 10 (Swarbreck et al. 2008) Glycine max (Soybean) 54,175 4 5,369 36,789 Phytozome 9.0 Annotation Version 1.1 (Schmutz et al. 2010) Medicago truncatula (Medicago) 50,895 39,323 21,889 Mt4.0v1 http://www.jcvi.org/medicago/ (Young et al. 2011; Tang et al. 2014) Oryza sativa (Rice) 38,867 29,098 20,760 Phytozome 9.0 MSU Release 7.0 (Excluding ChrUn and ChrSy molecules) (Ouyang et al. 2007) Phaseolus vulgaris (Common bean) 27,198 22,620 19,910 Phyto zome 9.0 Annotation version 1.0 http ://www.phytozome.net/commonbean Populus trichocarpa (Poplar) 41,336 33,412 24,712 Phytozome 9.0 Annotation version JGI v3.0 assembly v3 (Tuskan et al. 2006) Solanum lycopersicum (Tomato) 34,728 26,220 19,168 Phytozome 9.0 Annotation version ITAG2.3 (Sato et al. 2012) Vitis vinifera (Grape) 26,347 24,448 18,053 Phytozome 9.0 Annotation version as of March 2010 (Jaillon et al. 2007)

PAGE 158

158 Table 3 2. EST, mRNA, 454, and RNA seq sequence data summary. Species EST mRNA 454 Sequence RNA Seq Amborella trichopoda 38,147 14 2,243,371 2X72: 201.9M (30.2 Gb) 2X101: 242.7M (49Gb) Arabidopsis thaliana 1,52 9,700 81,157 NA 1X82bp: 89.1M (7.31 Gb) 1X101bp: 948.4 M (95.8 Gb) 2X76: 158.7M (24.1Gb) Glycine max 1,461,722 2,231 NA 2X101: 593M (119.8 Gb) 1X76: 58.4M (4.4 Gb) Medicago truncatula 269,501 46,682 NA 2X101: 79.1M (16Gb) 1X101: 489.8M (49 Gb) Oryz a sativa 89,943 33,547 NA 2X104: 64.1M (13Gb) 2X75: 81M (12.3Gb) Phaseolus vulgaris 125,490 381 NA 2X100: 371M (74.2Gb) Populus trichocarpa 89,943 393 NA 2X101: 563.8 M (114 Gb) Solanum lycopersicum 298,306 1,518 3,399,630 2X50: 196.9M (19.5 Gb) 1X50: 110.2M (5.4 Gb) Vitis vinifera 446,668 978 NA 2X51: 147.1M (15 Gb) 2X100: 354.8M (71 Gb)

PAGE 159

159 Table 3 3. RNA seq tissues types and download sources. Species RNA seq Tissue Source of RNA seq Data Collection Amborella trichopoda Apical meristem, fl ower, flower buds, fruit, leaves, roots, shoot, and whole plant normalized Data from oneKP project ( http://www.onekp.com ): Sample code "URDJ". AAGP website ( http:// ancangio.uga.edu/illumina data ): AmTr_ap_mer, AmTr_fem_bud, AmborellaWPN 1, and AmborellaWPN 2. Unpublished data from Amborella Genome Project ( http://www.amborella.org/ ) Arabidopsis thaliana Floral bud, flowe r, root, seed, and siliques. NCBI SRA Accession: SRR314813, SRR314814, SRR314815, SRR360147, SRR360152, SRR360153, SRR360154, SRR360205, SRR391051, SRR391052, SRR505743, SRR505744, SRR505745, SRR505746 Glycine max Floral buds, flower, leaves, nodules, po d, root, root hairs, SAM, seed, and stem. NCBI SRA Accession: SRR203366, SRR203367 Collaborators Jeremy Schmutz and Scott Jackson provided additional RNA seq data set. Medicago truncatula Multiple tissues pooled, root, and seedling. NCBI SRA Accession (Illumina): SRR670348, SRR670349, SRR670350, SRR670345, SRR670346, SRR670347, SRR670351, SRR670352, SRR670353, SRR670354, SRR670355, SRR670356, SRR670357, SRR670358, SRR670383, SRR670400, SRR670403, SRR670404. Multiple tissues pooled RNA seq data is from (Tang et al. 2014) .

PAGE 160

160 Table 3 3. Continued Species RNA seq Tissue NCBI SRA Accession Oryza sativa Leaf, panicle, root, and young ear NCBI SRA Accession (Illumina): DRR013722, DRR013723, SRR606414, SRR606408, SRR037739, SRR037738, SRR072076, and SRR072077 Phaseolus vulgaris Flower buds, flowers, leaves, nodules, pods, roots, stem, and trifoliates. Collaborator Scott Jackson provided RNA seq data, and can be accessed via follo wing url. ftp://ftp.jgi psf.org/pub/compgen/phytozome/v9.0/Pvulgaris/related_files/expression/bam/ Populus trichocarpa Buds, cambium/phloem, flowers , leaves, petiole, roots, seeds, suckers, and twigs. The P. tremula RNA Seq expression atlas dataset was provided by the Umeå Plant Science Centre (personal communication with Nathaniel Street, Umeå University, Sweden) and is available from the ENA reposi tory under accession ID ERP004398. The data is also made available for visualisation of expression within the various samples at the PopGenIE resource (Sjödin et al. 2009) .

PAGE 161

161 Table 3 3. Continued Species RNA Seq Tissue NCBI SRA Accession Solanum lycopersicum Flower, flower bud, fruit developmental stages, le af, meristem, pericarp of fruit, pollen, pollinated style, root, stem, and unpollinated style. NCBI SRA Accession (454 GS FLX Titanium): SRR363116, SRR363117, SRR363118, SRR363119, SRR363120, SRR363121, SRR363122, SRR088753, SRR088751, SRR088749, SRR088748 , SRR088747, SRR088745, SRR088744, SRR088743, SRR088742, SRR088741, SRR088740, SRR088739, SRR088738, SRR088737, SRR088736, SRR088735, SRR088734, SRR088733, SRR088732 NCBI SRA Accession (Illumina): SRR404309, SRR404310, SRR404311, SRR404312, SRR404313, SRR4 04314, SRR404315, SRR404316, SRR404317, SRR404318, SRR404319, SRR404320, SRR404321, SRR404322, SRR404324, SRR404325, SRR404326, SRR404327, SRR404328, SRR404329, SRR404331, SRR404333, SRR404334, SRR404336, SRR404338, SRR404339, SRR412747, SRR412748, SRR5679 99, SRR568000 Vitis vinifera Fruit, leaves, and multiple tissues pooled. NCBI SRA Accession (Illumina): SRR519449, SRR519450, SRR519451, SRR519452, SRR519453, SRR519454, SRR519455, SRR519456, SRR520374, SRR520376, SRR520378, SRR520379, SRR520380, SRR520 381, SRR520382, SRR520384, SRR520385, SRR520386, SRR520387, SRR520388, SRR522298, SRR522471, SRR522472, SRR522473, SRR522474, SRR522475, SRR522477, SRR522478, SRR522479, SRR522484

PAGE 162

162 Table 3 4. Global AS events. AS Type Amborella Arabidopsis Soybean Medi cago Rice Common B ean Poplar Tomato Grape AltA Events (%) 9,427 (18.5%) 5,377 (19.2%) 13,056 (23.3%) 6,115 (22.3%) 6,110 (21.7%) 5,675 (25.8%) 7,807 (20.5%) 3,918 (23.8%) 7,373 (16.1%) Genes (%) 5,342 (36.5%) 3,929 (20.0%) 8,774 (23.8%) 4,256 (19.4%) 4 ,212 (20.3%) 3,982 (20.0%) 5,383 (21.8%) 3,069 (16.0%) 4,794 (26.6%) AltD Events (%) 7,166 (14.0%) 3,168 (11.3%) 9,055 (16.2%) 4,237 (15.4%) 3,514 (12.4%) 3,575 (16.2%) 4,541 (12%) 2,127 (12.9%) 5,315 (11.6%) Genes (%) 5,342 (30.0%) 3,929 (12.4%) 8,774 (17.2%) 4,256 (14.1%) 4,212 (12.7%) 3,982 (13.6%) 5,383 (14.2%) 3,069 (9.1%) 4,794 (20.4%) ExonS Events (%) 6,119 (12%) 1,186 (4.2%) 5,647 (10.1%) 2,409 (8.8%) 2,493 (8.8%) 2,316 (10.5%) 2,491 (6.6%) 1,734 (10.6%) 3,850 (8.4%) Genes (%) 3,339 (22.8%) 866 (4.4%) 3,607 (9.8%) 1,644 (7.5%) 1,679 (8.1%) 1,536 (7.7%) 1,820 (7.4%) 1,306 (6.8%) 2,387 (13.2%) IntronR Events (%) 28,328 (%) 18,325 (65.3%) 28,262 (50.4%) 14,710 (53.6%) 16,100 (57.1%) 10,440 (47.4%) 23,190 (61%) 8,655 (52.7%) 29,184 (63.8%) Gen es (%) 8,693 (59.4%) 8,438 (43.0%) 12,870 (35.0%) 7,047 (32.2%) 7,177 (34.6%) 5,798 (29.1%) 10,412 (42.1%) 4,888 (25.5%) 10,071 (55.8%) Total Events 51,041 28,057 56,021 27,472 28,218 22,007 38,030 16,435 45,723 Genes (%) 10,292 (70.4%) 10,398 (52.9%) 18,476 (50.2%) 9,781 (44.7%) 9,641 (46.4%) 8,932 (44.9%) 13,152 (53.2%) 7,503 (39.1%) 11,628 (64.4%)

PAGE 163

163 Table 3 5. Conserved AS events between common bean (CB) and Soybean (SB) at gene family level. CB SB AltA Conserved Event Clusters 1,563 1,563 Con served Events § 1,976(35%) 2,737(21%) Conserved Event Genes § 1,518(38%) 2,123(24%) Total Events 5,675 13,056 Total Genes 3,982 8,774 AltD Conserved Event Clusters 807 807 Conserved Events 1,023(29%) 1,434(16%) Conserved Event Genes 809(30%) 1,11 1(18%) Total Events 3,575 9,055 Total Genes 2,705 6,319 ExonS Conserved Event Clusters 295 295 Conserved Events 417(18%) 605(11%) Conserved Event Genes 300(20%) 420(12%) Total Events 2,316 5,647 Total Genes 1,536 3,607 IntronR Conserved Eve nt Clusters 2,537 2,537 Conserved Events 3,798(36%) 5,286(19%) Conserved Event Genes 2,381(41%) 3,255(25%) Total Events 10,440 28,262 Total Genes 5,798 12,870 Total Conserved Event Clusters 5,202 5,202 Conserved Events 7,214(33%) 10,062(18%) Conserved Event Genes 4,020(45%) 5,671(31%) Total Events 22,006 56,020 Total Genes 8,931 18,475 § Percentage is relative to total events and genes in each AS type.

PAGE 164

164 Table 3 6. Conserved AS events in WGD orthologs. Gene copies in common bean: Gene copies in soybean 1:2 1:1 0:2 1:0 0:1 Total Conserved CB SB CB SB CB SB CB SB CB SB CB SB AltA Conserved Event Clusters 489 489 581 581 0 540 NA NA NA NA 1,070 1,070 Events 599 1,218 653 693 0 1,288 1,938 0 0 4,878 1,252 1,911 Genes 467 934 55 2 568 0 1,038 1,525 0 0 3,681 977 1,460 AltD Conserved Event Clusters 252 252 354 354 0 332 NA NA NA NA 935 935 Events 320 660 422 416 0 797 1,431 0 0 3,799 1,075 1,109 Genes 249 498 342 347 0 651 1,159 0 0 2,865 576 829 ExonS Conserved Event Cluste rs 93 93 145 145 0 144 NA NA NA NA 238 238 Events 123 264 197 191 0 387 930 0 0 2,303 320 455 Genes 92 184 136 139 0 272 662 0 0 1,577 223 317 IntronR Conserved Event Clusters 598 598 1,150 1,150 0 1,286 NA NA NA NA 1,748 1,748 Events 865 1,772 1,4 54 1,458 0 3,236 4,198 0 0 10,836 2,319 3,230 Genes 523 1,046 1,016 1,049 0 2,132 2,773 0 0 5,961 1,433 1,984 Total Conserved Event Clusters 1,432 1,432 2,230 2,230 0 2,302 NA NA NA NA 3,662 3,662 Events 1,907 (15%) 3,914 (11%) 2,726 (21%) 2,758 (8%) 0 5,708 (17%) 8,497 (65%) 0 0 21,816 (64%) 4,633 (35%) 6,672 (20%) Genes 1,132 2,264 1,761 1,883 0 3,464 4,274 0 0 9,053 2,542 (31%) 3,757 (23%)

PAGE 165

1 65 Table 3 7. Conserved AS events at gene family level. Number of Species AltA AltD ExonS IntronR Total To tal % 2 3691 1945 808 9972 16,416 60.5 3 1053 435 142 4246 5,876 21.7 4 362 141 36 1855 2,394 8.8 5 217 70 22 859 1,168 4.3 6 116 57 13 413 599 2.2 7 64 34 13 254 365 1.3 8 39 24 10 128 201 0.7 9 20 8 5 68 101 0.4 Total 5,562 2,714 1,049 17,795 27,120 Total % 20.5 10.0 3.9 65.6 20.5

PAGE 166

166 Table 3 8. Genes with conserved AS events across at least one other species. AS Type Number of Genes With Conserved AS Events Number of Genes Amborella Arabidopsis Soybean Medicago Rice Common Bean P oplar Tomato Grape AltA 14,745 1,322 1,030 3,563 1,690 830 2,040 1,848 893 1,529 AltD 7,323 644 458 1,882 904 360 1,099 858 365 753 ExonS 2,796 305 120 689 310 136 433 312 176 315 IntronR 39,992 4,326 4,009 6,945 3,864 2,896 3,648 6,051 2,434 5,819 Total 50,792 4,993 4,710 9,827 5,182 3,511 5,517 7,256 3,258 6,538 Total % § 34.1 24.0 26.7 23.7 16.9 27.7 29.4 17.0 36.2 § Percentage is based on expressed protein coding multi exonic genes.

PAGE 167

167 Table 3 9. Intron sizes used while performing transcri ptome alignments and assemblies. Species Mean exon size 99 th Percentile of maximum intron size per gene Intron sizes used in transcriptome alignments and assemblies (Mean exon sizes + 99 th percentile) Amborella trichopoda 517 26,957 27,474 Arabidopsis t haliana 753 1,586 2,339 Glycine max 868 8,099 8,967 Medicago truncatula 673 4,625 5,556 Oryza sativa 344 5,625 6,623 Phaseolus vulgaris 815 6,081 6,896 Populus trichocarpa 826 4,590 5,416 Solanum lycopersicum 627 6,731 7,358 Vitis vinifera 6 36 20,134 20,770

PAGE 168

168 Table 3 10. Conserved AS events retention and loss categories among WGD orthologs between common bean (CB) and soybean (SB) and their conservation with at least one other angiosperms. 1:2 1:1 0:2 1:0 0:1 CB SB CB SB CB SB CB SB CB S B AS Events 1907 3914 2726 2758 NA 5708 8497 NA NA 21816 Conserved with one other species apart from CB and SB 1252 (65.7%) 2451 (62.6%) 1445 (53%) 1374 (49.8%) NA 2460 (43.1%) 2574 (30.3%) NA NA 5378 (24.7%)

PAGE 169

169 Table 3 11. GO Enrichment Analysis ( Fish er's Exact Test ) with BLAST2GO of genes having conserved AS events conserved across at least six angiosperms. In GO ID Term Category P Value Over/Under G O:0003824 catalytic activity F 1.42E 47 over GO:0016301 kinase activity F 1.19E 35 over GO:0016772 transferase activity, transferring phosphorus containing groups F 4.29E 32 over GO:0043167 ion binding F 8.64E 28 over GO:0071944 cell periphery C 1.66E 26 over GO:0016740 transferase activity F 2.14E 25 over GO:0005886 plasma membrane C 1.42E 24 over GO:0016020 Membrane C 1.49E 24 over GO:0044238 primary metabolic process P 3.74E 24 over GO:0071704 organic substance metabolic process P 3.77E 24 over GO:0005488 Binding F 1.47E 20 over GO:0036211 protein modification process P 8.18E 19 over GO:0006464 cellular protein modification process P 8.18E 19 over GO:0009987 cellular process P 1.05E 18 over GO:0043412 macromolecule modification P 1.18E 18 ov er GO:0044699 single organism process P 2.10E 18 over GO:0022857 transmembrane transporter activity F 1.13E 17 over GO:0055085 transmembrane transport P 3.34E 16 over GO:0008152 metabolic process P 4.04E 16 over GO:0050789 regulation of biological pro cess P 1.03E 15 over

PAGE 170

170 Table 3 11. Continued GO ID Term Category P Value Over/Under GO:0044763 single organism cellular process P 1.34E 15 over GO:0005215 transporter activity F 1.38E 15 over GO:0007165 signal transduction P 1.54E 15 over GO:0051716 c ellular response to stimulus P 1.54E 15 over GO:0050794 regulation of cellular process P 1.56E 15 over GO:0023052 Signaling P 9.42E 15 over GO:0007154 cell communication P 9.42E 15 over GO:0044700 single organism signaling P 9.42E 15 over GO:0016787 h ydrolase activity F 1.18E 13 over GO:0016798 hydrolase activity, acting on glycosyl bonds F 2.56E 13 over GO:0065007 biological regulation P 2.80E 13 over GO:0044267 cellular protein metabolic process P 1.32E 12 over GO:0019538 protein metabolic proces s P 1.61E 12 over GO:0004871 signal transducer activity F 2.51E 12 over GO:0060089 molecular tranphotosducer activity F 2.51E 12 over GO:0044765 single organism transport P 5.30E 12 over GO:0050896 response to stimulus P 1.13E 11 over GO:0043170 macro molecule metabolic process P 3.19E 11 over GO:0044260 cellular macromolecule metabolic process P 4.11E 11 over GO:0005975 carbohydrate metabolic process P 8.85E 11 over

PAGE 171

171 Table 3 11. Continued GO ID Term Category P Value Over/Under GO:0051234 establish ment of localization P 2.83E 10 over GO:0006810 Transport P 2.83E 10 over GO:0051179 Localization P 2.85E 10 over GO:0008219 cell death P 8.49E 09 over GO:0016265 Death P 8.49E 09 over GO:0006397 mRNA processing P 3.70E 08 over GO:0016071 mRNA metabo lic process P 3.70E 08 over GO:0006396 RNA processing P 4.07E 08 over GO:0044237 cellular metabolic process P 6.55E 08 over GO:0002376 immune system process P 3.17E 07 over GO:0016070 RNA metabolic process P 9.77E 07 over GO:0044710 single organism me tabolic process P 2.21E 06 over GO:0061024 membrane organization P 4.16E 06 over GO:0006950 response to stress P 2.34E 05 over GO:0042578 phosphoric ester hydrolase activity F 2.40E 05 over GO:0016791 phosphatase activity F 2.40E 05 over GO:0016757 tr ansferase activity, transferring glycosyl groups F 7.38E 05 over GO:0016491 oxidoreductase activity F 7.53E 05 over GO:0021700 developmental maturation P 8.66E 05 over GO:0016788 hydrolase activity, acting on ester bonds F 1.66E 04 over

PAGE 172

172 Table 3 11. C ontinued GO ID Term Category P Value Over/Under GO:0003723 RNA binding F 6.57E 04 over GO:0007568 Aging P 6.66E 04 over GO:0044281 small molecule metabolic process P 1.98E 03 over GO:0019439 aromatic compound catabolic process P 2.55E 03 over GO:19013 61 organic cyclic compound catabolic process P 2.55E 03 over GO:0044270 cellular nitrogen compound catabolic process P 2.55E 03 over GO:0044248 cellular catabolic process P 2.55E 03 over GO:0034655 nucleobase containing compound catabolic process P 2.55 E 03 over GO:0046700 heterocycle catabolic process P 2.55E 03 over GO:0003729 mRNA binding F 2.56E 03 over GO:0044822 poly(A) RNA binding F 2.56E 03 over GO:1901575 organic substance catabolic process P 2.58E 03 over GO:0019748 secondary metabolic pro cess P 3.70E 03 over

PAGE 173

173 Table 3 12. GO Enrichment Analysis ( Fisher's Exact Test ) with BLAST2GO of genes having CJ AS events conserved in 1:2 categories of s oybean. I n C ategory Biol ogical Process. GO ID Term Category P Value Over/Under GO:0003676 nucleic acid binding F 1.30E 08 over GO:0005634 N ucleus C 3.04E 05 over GO:0004518 nuclease activity F 2.03E 04 over GO:0016788 hydrolase activity, acting on ester bonds F 2.03E 04 over GO:0003723 RNA binding F 2.11E 04 over GO:0097159 organic cyclic compound binding F 1.31E 03 over GO:1901363 heterocyclic compound binding F 1.31E 03 over GO:0009991 response to extracellular stimulus P 2.25E 03 over GO:0043231 intracellular membrane bounded organelle C 3.64E 03 over GO:0043227 membrane bounded organelle C 3.64E 03 over

PAGE 174

174 Table 3 13. Vitis vinifera cv. Corvina samples pooled for RNA seq run (23 Gb; 114.7 M; 2X100) by (Venturini et al. 2013) . Sample / organ Developmental stages collected Total Samples Bud first season latent bud, winter/dormant bud, bud scales opening, wooly cotton bud (green showing), bud after bud burst (shoot with one leaf visible) 5 Inflorescence young inflorescence, well developed inflorescence (single flower separated), flowering begins (30% caps off) 3 Tendril young tendril, mature tendril 2 Leaf young leaf (5 leaves separated), mature leaf, senescencin g leaf (beginning of leaf fall) 3 Berry (whole) fruit set 1 Berry Skin post fruit set, véraison, pre ripening, ripening, post harvest withering process I, II, and III 7 Berry Flesh post fruit set, véraison, pre ripening, ripening, post harvest withering process I, II, and III 7 Seed fruit set, post fruit set, véraison, full mature seed (pool from pre ripening and ripening) 4 Rachis fruit set, post fruit set, véraison, pre ripening, and ripening 5 Stem green stem (from the cane), woody stem (complete c ane maturation) 2 Pool 1 Seedling pool (from 3 different stages of seedling) 1 Anther pool (from 2 different stages of flower development) 1 Pollen Pool 1 Carpel pool (from 2 different stages of flower development) 1 Petal pool (fro m 2 different stages of flower development) 1 Total 45

PAGE 175

175 CHAPTER 4 CONCLUSIONS The goal s of this research were to build a high quality reference genome for Amborella trichopoda (Amborellaceae) and to use this reference to examine the conservation of a lternative splicing (AS) events across angiosperms, and to examine the impact of whole genome duplication on the evolution of AS. The work presented in Chapter 2 demonstrates a new paradigm for carrying out future genome sequencing projects where high thr oughput whole genome maps may replace traditional genetic or physical maps to assist the generation of contiguous genome assemblies. This is particularly relevant for genome sequencing projects of non model organisms like Amborella, which have minimal or n o pre existing genome resources like genetic or physical maps. More than 2 fold increases (Table 2 11) in Amborella scaffold length were achieved by implementing whole genome maps from Opgen for super ative to sequence assembly alone. The Amborella assembly benefited from the inclusion of 63,924 paired BAC end sequences (insert size ~120 kb), which also facilitate d scaffolding. However, performing assemblies without BAC ends, but with the inclusion of O pgen Genome Builder, resulted in 2.5 fold increases in N50 sizes with respect to a BAC free de novo assembly, and a 1.5 fold increase compared to the Amborella assembly that includes BAC ends (Table 2 11). This implies that high throughput whole genome ma ps like those constructed with Opgen Genome Builder can negate the requirement for expensive , large insert clone end sequences to ensure scaffold contiguity.

PAGE 176

176 Despite its speed and relatively low cost with respect to large insert clone end resources , the O pgen Genome Builder does suffer one disadvantage; it requires existing contigs and scaffolds to exceed a minimium length of 200 k b. In the case of Amborella over 90% of the sequence assembled scaffolds met or exceeded this size , and using Genome Builder w as therefore beneficial. Sequence technologies have matured further during the course of this project and now produce even higher volumes of sequence with longer read lengths. These features, and the availability of several techniques and technologies tha t further increase read, contig , and scaffold lengths, make attaining the 200 K b length minimum less challenging. In addition to exploring methods to increase contiguity, I also developed strategies to identify poor quality and chimeric sequencing reads th at hinder assembly, and developed methodologies for using FISH to quality control assembled scaffolds. I expect that this knowledge will assist future NGS based genome projects. With the reference sequence for Amborella in hand , the next task was to ident ify instances of alternative splicing (AS) and examine the conservation of alternative splicing across angiosperms. Chapter 3 describes the development of algorithms and software pipelines to first identify and then compare alternative splicing events acr oss plant species. This analysis was performed on a genome wide scale in nine taxa distributed across the angiosperm phylogentic tree. One highlight of the AS identification strategy is that it can integrate transcriptome datasets from various sequencing technologies (454, Illumina, and Sanger based ESTs) to increase detection power . It then identifies events based on these data rather than relying on pre existing mRNA isoform annotations. Among the nine angiosperms studied , the proportion of

PAGE 177

177 expressed mul ti exonic genes that exhibit alternative splicing ranged from 40 70% (Table 3 4 and Figure 3 5) , with intron retention being most frequent in all species (Figure 3 3). Observed variation in the proportion of genes exhibiting AS in the species I assessed m ay reflect differences in the depth and breadth of the transcriptome resources available for each species. In this study , a high proportion of Amborella (70.4%) and grape (64.4%) genes have detectable AS, but these species also had the largest and most bro adly sampled data sets. Overall, this study reiterates the fact that AS is not a rare phenomenon in plants , and observations of AS are likely to continue to increase as datasets increase is size. At any rate, the methodologies developed during this work easily scale up to assay more and larger datasets. Chapter 3 also describes the identification of conserved AS events across species. This study is the first to develop and implement a method to detect conserved AS events across more than three species. These methods enabled the identification of conserved AS events across nine angiosperms species and easily scale to larger numbers of species as datasets become available. Overall , this study identified 27,120 AS events that are conserved between at least two of the nine angiosperm taxa studied (Table 3 7). About 40% of these are conserved in at least three species. AS events that are conserved across multiple species suggest that they may perform an important biological function and have been retained duri ng the course of evolution. The output of our pipeline has identified thousands of candidate AS events, some of which have been conserved broadly across long evolutionary distances, and these data provide many interesting candidate genes for future functio nal studies. GO enrichment analysis was performed on those genes with AS events that are conserved across at least six

PAGE 178

178 angiosperms to investigate whether conserv ed of AS events are enriched in genes pertaining to specific functional categories. S everal o ver represented GO terms for these AS con s er v ed genes suggest that there is preferential retention of AS events in certain gene classes ( Table 3 11 , Figure 3 9, Figure 3 10, and Figure 3 11). In addition to examining conserved events across a broad repres entation of angiosperms, genome wide conservation of AS events was examined with the model legumes, common bean and soybean. Common bean and soybean are the two most closely related species within our study , having diverged ~19 MYA (McClean et al. 2010) ; soybean under went a lineage specific WGD event about 5 10 MYA (Roulin et al. 2012) relative to common bean, enabl ing examination of the direct impact of WGD on alternative splicing. A single gene in common bean should have at most two orthologs in soyb ean resulting from WGD. Conservation of AS events was observed in 8,325 gene sets with 1 common bean gene: 2 soybean gene orthologs ratios. However, several instances of loss of AS in one copy of soybean were observed. Interestingly, even though common be an and soybean diverged only 19 MYA , merely 30% of the detected AS events are conserved between these species, suggesting that most events are lineage specific (Table 3 5).

PAGE 179

179 LIST OF REFERENCES Amborella Gen ome Project. 2013. The Amborella Genome and the Evolution of Flowering Plants. Science (80 ) 342 : 1241089. Anantharaman TS, Mishra B, Schwartz DC. 1999. Genomics via optical mapping III: Contiging genomic DNA and variations. In The Seventh International C onference on Intelligent Systems for Molecular Biology , Vol. 7 of, pp. 18 27, Citeseer. Argout X, Salse J, Aury JM, Guiltinan MJ, Droc G, Gouzy J, Allegre M, Chaparro C, Legavre T, Maximova SN, et al. 2011. The genome of Theobroma cacao. Nat Genet 43 : 101 108. ://000286623800006. Baek J M, Han P, Iandolino A, Cook DR. 2008. Characterization and comparison of intron structure and alternative splicing between Medicago truncatula, Populus trichocarpa, Arabidopsis and rice. Plant Mol Biol 67 : 499 510 . Barbazuk WB, Fu Y, McGinnis KM. 2008. Genome wide analyses of alternative splicing in plants: opportunities and challenges. Genome Res 18 : 1381 1392. http://www.ncbi.nlm.nih.gov/pubmed/18669480. Blanc G, Hokamp K, Wolfe KH. 2003. A recent polyploidy supe rimposed on older large scale duplications in the Arabidopsis genome. Genome Res 13 : 137 144. Blanc G, Wolfe KH. 2004a. Functional divergence of duplicated genes formed by polyploidy during Arabidopsis evolution. Plant Cell Online 16 : 1679 1691. Blanc G, W olfe KH. 2004b. Widespread paleopolyploidy in model plant species inferred from age distributions of duplicate genes. Plant Cell Online 16 : 1667 1678. Buggs RJ, Chamala S, Wu W, Gao L, May GD, Schnable PS, Soltis DE, Soltis PS, Barbazuk WB. 2010. Character ization of duplicate gene evolution in the recent natural allopolyploid Tragopogon miscellus by next generation sequencing and Sequenom iPLEX MassARRAY genotyping. Mol Ecol 19 Suppl 1 : 132 146. http://www.ncbi.nlm.nih.gov/pubmed/20331776. Campbell M, Haas B, Hamilton J, Mount S, Buell CR. 2006. Comprehensive analysis of alternative splicing in rice and comparative analyses with Arabidopsis. BMC Genomics 7 : 327. Cañestro C, Albalat R, Irimia M, Garcia Fernàndez J. 2013. Impact of gene gains, losses and dupli cation modes on the origin and diversification of vertebrates. In Seminars in cell & developmental biology , Elsevier.

PAGE 180

180 Chamala S, Chanderbali AS, Der JP, Lan T, Walts B, Albert VA, Leebens Mack J, Rounsley S, Schuster SC, Wing RA. 2013. Assembly and Validat ion of the Genome of the Nonmodel Basal Angiosperm Amborella. Science (80 ) 342 : 1516 1517. Chen M, Manley JL. 2009. Mechanisms of alternative splicing regulation: insights from molecular and genomics approaches. Nat Rev Mol Cell Biol 10 : 741 754. Chester M, Leitch AR, Soltis PS, Soltis DE. 2010. Review of the application of modern cytogenetic methods (FISH/GISH) to the study of reticulation (polyploidy/hybridisation). Genes (Basel) 1 : 166 192. Chin C S, Alexander DH, Marks P, Klammer AA, Drake J, Heiner C , Clum A, Copeland A, Huddleston J, Eichler EE. 2013. Nonhybrid, finished microbial genome assemblies from long read SMRT sequencing data. Nat Methods 10 : 563 569. Chou H H, Holmes MH. 2001. DNA sequence quality trimming and vector removal. Bioinformatics 17 : 1093 1104. Conant GC, Birchler JA, Pires JC. 2014. Dosage, duplication, and diploidization: clarifying the interplay of multiple models for duplicate gene evolution over time. Curr Opin Plant Biol 19 : 91 98. Conesa A, Götz S, García Gómez JM, Terol J, Talón M, Robles M. 2005. Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics research. Bioinformatics 21 : 3674 3676. Cui L, Wall PK, Leebens Mack JH, Lindsay BG, Soltis DE, Doyle JJ, Soltis PS, Carlson JE, Arumuganathan K, Barakat A. 2006. Widespread genome duplications throughout the history of flowering pla nts. Genome Res 16 : 738 749. subfunctionalization. Trends Genet 23 : 270 272. Darracq A, Adams KL. 2013. Features of evolutionarily conserved alternative splicing events between Brassica and Arabi dopsis. New Phytol . De Conti L, Baralle M, Buratti E. 2012. Exon and intron definition in pre mRNA splicing. Wiley Interdiscip Rev RNA . http://www.ncbi.nlm.nih.gov/pubmed/23044818. ocs S, Droc G, Rouard M, et al. 2012. The banana (Musa acuminata) genome and the evolution of monocotyledonous plants. Nature 488 : 213 +. ://000307267000034. D olezel g enome size. Ann Bot 95 : 99 110.

PAGE 181

181 Dong H, Chen Y, Shen Y, Wang S, Zhao G, Jin W. 2011. Artificial duplicate reads in sequencing data of 454 Genome Sequencer FLX System. Acta Biochim Biophys Sin (Shanghai) 43 : 496 500. Dong Y, Xie M, Jiang Y, Xiao N, Du X, Zh ang W, Tosser Klopp G, Wang J, Yang S, Liang J. 2012. Sequencing and automated whole genome optical mapping of the genome of a domestic goat (Capra hircus). Nat Biotechnol . Drew BT, Ruhfel BR, Smith SA, Moore MJ, Briggs BG, Gitzendanner MA, Soltis PS, Solt is DE. 2014. Another look at the root of the angiosperms reveals a familiar tale. Syst Biol syt108. Erdmann R, Gramzow L, Melzer R, Theißen G, Becker A. 2010. GORDITA (AGL63) is a young paralog of the Arabidopsis thaliana Bsister MADS box gene ABS (TT16) t hat has undergone neofunctionalization. Plant J 63 : 914 924. Fawcett JA, Maere S, Van de Peer Y. 2009. Plants with double genomes might have had a better chance to survive the Cretaceous Tertiary extinction event. Proc Natl Acad Sci 106 : 5737 5742. Freelin g M. 2009. Bias in plant gene content following different sorts of duplication: tandem, whole genome, segmental, or by transposition. Annu Rev Plant Biol 60 : 433 453. Am J Bot 96 : 5 21. Fu L, Niu B, Zhu Z, Wu S, Li W. 2012. CD HIT: accelerated for clustering the next generation sequencing data. Bioinformatics 28 : 3150 3152. Fu Y, Bannach O, Chen H, Teune J H, Schmitz A, Steger G, Xiong L, Barbazuk WB. 2009. Alternative splicing of anciently ex onized 5S rRNA regulates plant transcription factor TFIIIA. Genome Res 19 : 913 921. Gabaldón T, Koonin E V. 2013. Functional and evolutionary implications of gene orthology. Nat Rev Genet 14 : 360 366. Garcia Mas J, Benjak A, Sanseverino W, Bourgeois M, Mir G, González VM, Hénaff E, Câmara F, Cozzuto L, Lowy E. 2012. The genome of melon (Cucumis melo L.). Proc Natl Acad Sci 109 : 11872 11877. Gish W. 1996. WU BLAST. http://blast.wustl.edu. Gomez Alvarez V, Teal TK, Schmidt TM. 2009. Systematic artifacts in me tagenomes from complex microbial communities. ISME J 3 : 1314 1317.

PAGE 182

182 Goodstein DM, Shu S, Howson R, Neupane R, Hayes RD, Fazo J, Mitros T, Dirks W, Hellsten U, Putnam N. 2012. Phytozome: a comparative platform for green plant genomics. Nucleic Acids Res 40 : D1178 D1186. Goremykin V V, Nikiforova S V, Biggs PJ, Zhong B, Delange P, Martin W, Woetzel S, Atherton RA, Mclenachan PA, Lockhart PJ. 2013. The evolutionary root of flowering plants. Syst Biol 62 : 50 61. Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompso n DA, Amit I, Adiconis X, Fan L, Raychowdhury R, Zeng Q. 2011. Full length transcriptome assembly from RNA Seq data without a reference genome. Nat Biotechnol 29 : 644 652. Haas BJ, Delcher AL, Mount SM, Wortman JR, Smith Jr RK, Hannick LI, Maiti R, Ronning CM, Rusch DB, Town CD. 2003. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res 31 : 5654 5666. Haas BJ, Papanicolaou A, Yassour M, Grabherr M, Blood PD, Bowden J, Couger MB, Eccles D, Li B, Lieber M. 2013. De novo transcript sequence reconstruction from RNA seq using the Trinity platform for reference generation and analysis. Nat Protoc 8 : 1494 1512. Haas BJ, Zeng Q, Pearson MD, Cuomo CA, Wortman JR. 2011. Approaches to fungal genome annotation. Myc ology 2 : 118 141. Hillier LW, Marth GT, Quinlan AR, Dooling D, Fewell G, Barnett D, Fox P, Glasscock JI, Hickenbotham M, Huang W. 2008. Whole genome sequencing and variant discovery in C. elegans. Nat Methods 5 : 183 188. Illumina. 2009. Mate Pair Library v 2 Sample Preparation Guide For 2 5 kb Libraries. International Human Genome Sequencing Consortium. 2004. Finishing the euchromatic sequence of the human genome. Nature 431 : 931 945. Istrail S, Sutton GG, Florea L, Halpern AL, Mobarry CM, Lippert R, Walenz B, Shatkay H, Dew I, Miller JR. 2004. Whole genome shotgun assembly and comparison of human genome assemblies. Proc Natl Acad Sci U S A 101 : 1916 1921. Jaillon O, Aury JM, Noel B, Policriti A, Clepet C, Casagrande A, Choisne N, Aubourg S, Vitulo N, Jubin C , et al. 2007. The grapevine genome sequence suggests ancestral hexaploidization in major angiosperm phyla. Nature 449 : 463 U5. ://000249724800041. Jansen RK, Cai Z, Raubeson LA, Daniell H, Depamphilis CW, Leebens Mack J, Müller KF, Guisinger Be llian M, Haberle RC, Hansen AK. 2007. Analysis of 81 genes from

PAGE 183

183 64 plastid genomes resolves relationships in angiosperms and identifies genome scale evolutionary patterns. Proc Natl Acad Sci 104 : 19369 19374. Jiang W, Liu Y, Xia E, Gao L. 2013. Prevalent r ole of gene features in determining evolutionary fates of whole genome duplication duplicated genes in flowering plants. Plant Physiol 161 : 1844 1861. Jiao Y, Leebens Mack J, Ayyampalayam S, Bowers JE, McKain MR, McNeal J, Rolf M, Ruzicka DR, Wafula E, Wic kett NJ. 2012. A genome triplication associated with early diversification of the core eudicots. Genome Biol 13 : R3. Jiao Y, Wickett NJ, Ayyampalayam S, Chanderbali AS, Landherr L, Ralph PE, Tomsho LP, Hu Y, Liang H, Soltis PS, et al. 2011. Ancestral polyp loidy in seed plants and angiosperms. Nature 473 : 97 100. http://www.ncbi.nlm.nih.gov/pubmed/21478875. Kaufmann K, Anfang N, Saedler H, Theissen G. 2005. Mutant analysis, protein protein interactions and subcellular localization of the Arabidopsis Bsister (ABS) protein. Mol Genet Genomics 274 : 103 118. Kent WJ. 2002. BLAT the BLAST like alignment tool. Genome Res 12 : 656 664. Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL. 2013. TopHat2: accurate alignment of transcriptomes in the presence o f insertions, deletions and gene fusions. Genome Biol 14 : R36. Koren S, Schatz MC, Walenz BP, Martin J, Howard JT, Ganapathy G, Wang Z, Rasko DA, McCombie WR, Jarvis ED. 2012. Hybrid error correction and de novo assembly of single molecule sequencing reads . Nat Biotechnol 30 : 693 700. Kornblihtt AR, Schor IE, Alló M, Dujardin G, Petrillo E, Muñoz MJ. 2013. Alternative splicing: a pivotal step between eukaryotic transcription and translation. Nat Rev Mol Cell Biol . Kurtz S, Phillippy A, Delcher AL, Smoot M, Shumway M, Antonescu C, Salzberg SL. 2004. Versatile and open software for comparing large genomes. Genome Biol 5 : R12. Latreille P, Norton S, Goldman BS, Henkhaus J, Miller N, Barbazuk B, Bode HB, Darby C, Du Z, Forst S. 2007. Optical mapping as a routine tool for bacterial genome sequence finishing. BMC Genomics 8 : 321. Leebens Mack J, Raubeson LA, Cui L, Kuehl J V, Fourcade MH, Chumley TW, Boore JL, Jansen RK. 2005. Identifying the basal angiosperm node in chloroplast genome ay out of the Felsenstein zone. Mol Biol Evol 22 : 1948 1963.

PAGE 184

184 Leitch IJ, Hanson L. 2002. DNA C values in seven families fill phylogenetic gaps in the basal angiosperms. Bot J Linn Soc 140 : 175 179. Li L, Stoeckert CJ, Roos DS. 2003. OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res 13 : 2178 2189. Lohse M, Bolger AM, Nagel A, Fernie AR, Lunn JE, Stitt M, Usadel B. 2012. RobiNA: a user friendly, integrated software solution for RNA Seq based transcriptomics. Nucleic Acids Res 40 : W 622 W627. ://000306670900102. MacNeil AJ, McEachern LA, Pohajdak B. 2008. Gene duplication in early vertebrates results in tissue specific subfunctionalized adaptor proteins: CASP and GRASP. J Mol Evol 67 : 168 178. Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA, Berka J, Braverman MS, Chen Y J, Chen Z. 2005. Genome sequencing in microfabricated high density picolitre reactors. Nature 437 : 376 380. Marquez Y, Brown JWS, Simpson C, Barta A, Kalyna M. 2012. Transcriptome survey reve als increased complexity of the alternative splicing landscape in Arabidopsis. Genome Res 22 : 1184 1195. Martin M. 2011. Cutadapt removes adapter sequences from high throughput sequencing reads. EMBnet J 17 : pp. 10 12. McClean PE, Mamidi S, McConnell M, Ch ikara S, Lee R. 2010. Synteny mapping between common bean and soybean reveals extensive blocks of shared loci. BMC Genomics 11 : 184. McGrath CL, Lynch M. 2012. Evolutionary significance of whole genome duplication. In Polyploidy and Genome Evolution , pp. 1 20, Springer. Miller JR, Koren S, Sutton G. 2010. Assembly algorithms for next generation sequencing data. Genomics 95 : 315 327. Moore MJ, Bell CD, Soltis PS, Soltis DE. 2007. Using plastid genome scale data to resolve enigmatic relationships among basal angiosperms. Proc Natl Acad Sci 104 : 19363 19368. Nagarajan N, Read TD, Pop M. 2008. Scaffolding and validation of bacterial genome assemblies using optical restriction maps. Bioinformatics 24 : 1229 1235. Nesi N, Debeaujon I, Jond C, Stewart AJ, Jenkins GI , Caboche M, Lepiniec L. 2002. The TRANSPARENT TESTA16 locus encodes the ARABIDOPSIS BSISTER MADS domain protein and is required for proper development and pigmentation of the seed coat. Plant Cell Online 14 : 2463 2479.

PAGE 185

185 Nilsen TW, Graveley BR. 2010. Expans ion of the eukaryotic proteome by alternative splicing. Nature 463 : 457 463. Niu B, Fu L, Sun S, Li W. 2010. Artificial and natural duplicates in pyrosequencing reads of metagenomic data. BMC Bioinformatics 11 : 187. Otto SP, Whitton J. 2000. Polyploid inci dence and evolution. Annu Rev Genet 34 : 401 437. Ouyang S, Zhu W, Hamilton J, Lin H, Campbell M, Childs K, Thibaud Nissen F, Malek RL, Lee Y, Zheng L. 2007. The TIGR rice genome annotation resource: improvements and new features. Nucleic Acids Res 35 : D883 D887. Paterson AH, Wang X, Li J, Tang H. 2012. Ancient and Recent Polyploidy in Monocots. In Polyploidy and Genome Evolution , pp. 93 108, Springer. Pinhal D, Yoshimura TS, Araki CS, Martins C. 2011. The 5S rDNA family evolves through concerted and birth and death evolution in fish genomes: an example from freshwater stingrays. BMC Evol Biol 11 : 151. Pop M, Salzberg S L, Shumway M. 2002. Genome sequence assembly: Algorithms and issues. Computer (Long Beach Calif) 35 : 47 54. Praça Fontes MM, Carvalho CR, Clarindo WR, Cruz CD. 2011. Revisiting the DNA C values of the genome size standards used in plant flow cytometry to c hoose the Plant Cell Rep 30 : 1183 1191. Prasad K, Zhang X, Tobón E, Ambrose BA. 2010. The Arabidopsis B sister MADS box protein, GORDITA, represses fruit growth and contributes to integument development. Plant J 62 : 203 214. Qian W, Liao B Y, Chang AY F, Zhang J. 2010. Maintenance of duplicate genes and their functional redundancy by reduced expression. Tre nds Genet 26 : 425 430. Reddy AS. 2007. Alternative splicing of pre messenger RNAs in plants in the genomic era. Annu Rev Plant Biol 58 : 267 294. http://www.ncbi.nlm.nih.gov/pubmed/17222076. Reddy ASN, Marquez Y, Kalyna M, Barta A. 2013. Complexity of the A lternative Splicing Landscape in Plants. Plant Cell Online tpc 113. Rhind N, Chen Z, Yassour M, Thompson DA, Haas BJ, Habib N, Wapinski I, Roy S, Lin MF, Heiman DI. 2011. Comparative functional genomics of the fission yeasts. Science (80 ) 332 : 930 936.

PAGE 186

186 R ice DW, Alverson AJ, Richardson AO, Young GJ, Sanchez Puerta MV, Munzinger J, Barry K, Boore JL, Zhang Y, Knox EB. 2013. Horizontal Transfer of Entire Genomes via Mitochondrial Fusion in the Angiosperm Amborella. Science (80 ) 342 : 1468 1473. Richardson D N, Rogers MF, Labadorf A, Ben Hur A, Guo H, Paterson AH, Reddy ASN. 2011. Comparative analysis of serine/arginine rich proteins across 27 eukaryotes: insights into sub family classification and extent of alternative splicing. PLoS One 6 : e24542. Roberts RJ , Carneiro MO, Schatz MC. 2013. The advantages of SMRT sequencing. Genome Biol 14 : 405. Roche. 2009. GS FLX Paired End DNA Library Preparation Method Manual, GS FLX Titanium Series. Roulin A, Auer PL, Libault M, Schlueter J, Farmer A, May G, Stacey G, Doer ge RW, Jackson SA. 2012. The fate of duplicated genes in a polyploid plant genome. Plant J . Sato S, Tabata S, Hirakawa H, Asamizu E, Shirasawa K, Isobe S, Kaneko T, Nakamura Y, Shibata D, Aoki K, et al. 2012. The tomato genome sequence provides insights in to fleshy fruit evolution. Nature 485 : 635 641. ://000304608000047. Schmucker D, Clemens JC, Shu H, Worby CA, Xiao J, Muda M, Dixon JE, Zipursky SL. 2000. Drosophila Dscam is an axon guidance receptor exhibiting extraordinary molecular diversity . Cell 101 : 671. Schmutz J, Cannon SB, Schlueter J, Ma J, Mitros T, Nelson W, Hyten DL, Song Q, Thelen JJ, Cheng J. 2010. Genome sequence of the palaeopolyploid soybean. Nature 463 : 178 183. Schmutz J, McClean PE, Mamidi S, Wu GA, Cannon SB, Grimwood J, Je nkins J, Shu S, Song Q, Chavarro C. 2014. A reference genome for common bean and genome wide analysis of dual domestications. Nat Genet . Schnable JC, Freeling M. 2012. Maize (Zea Mays) as a Model for Studying the Impact of Gene and Regulatory Sequence Loss Following Whole Genome Duplication. In Polyploidy and Genome Evolution , pp. 137 145, Springer. Severing EI, Van Dijk AD, Stiekema WJ, Van Ham RC. 2009. Comparative analysis indicates that alternative splicing in plants has a limited role in functional exp ansion of the proteome. BMC Genomics 10 : 154. http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2674458&tool=pmcentre z&rendertype=abstract.

PAGE 187

187 Sharp PA. 1997. Split genes and RNA splicing. Physiol Or Med 1991 1995 7 : 145. Sjödin A, Street NR, Sandberg G, Gustafsson P, Jansson S. 2009. The Populus Genome Integrative Explorer (PopGenIE): a new resource for exploring the Populus genome. New Phytol 182 : 1013 1025. Soltis DE, Albert VA, Leebens Mack J, Bell CD, Paterson AH, Zheng C, Sankoff D, Wall PK, Solt is PS. 2009. Polyploidy and angiosperm diversification. Am J Bot 96 : 336 348. Soltis DE, Albert VA, Leebens Mack J, Palmer JD, Wing RA, DePamphilis CW, Ma H, Carlson JE, Altman N, Kim S. 2008. The Amborella genome: an evolutionary reference for plant biolo gy. Genome Biol 9 : 402. Soltis DE, Soltis PS, Albert VA, Oppenheimer DG, depamphilis CW, Ma H, Frohlich MW, Theißen G. 2002. Missing links: the genetic architecture of flower and floral diversification. Trends Plant Sci 7 : 22 31. Soltis DE, Soltis PS, Endr ess PK, Chase MW. 2005. Phylogeny and evolution of angiosperms. Sinauer Associates Incorporated. Soltis PS, Soltis DE, Chase MW. 1999. Angiosperm phylogeny inferred from multiple genes as a tool for comparative biology. Nature 402 : 402 404. Staiger D, Brow n JWS. 2013. Alternative Splicing at the Intersection of Biological Timing, Development, and Stress Responses. Plant Cell Online tpc 113. Star B, Nederbragt AJ, Jentoft S, Grimholt U, Malmstrøm M, Gregers TF, Rounge TB, Paulsen J, Solbakken MH, Sharma A. 2 011. The genome sequence of Atlantic cod reveals a unique immune system. Nature 477 : 207 210. the earliest angiosperms: Amborella or monocots? BMC Evol Biol 4 : 35. Supek F, long lists of gene ontology terms. PLoS One 6 : e21800. Swarbreck D, Wilks C, Lamesch P, Berardini TZ, Garcia Hernandez M, Foerster H, Li D, Meyer T, Muller R, Ploetz L. 2008. The Arabidop sis Information Resource (TAIR): gene structure and function annotation. Nucleic Acids Res 36 : D1009 D1014. Syed NH, Kalyna M, Marquez Y, Barta A, Brown JW. 2012. Alternative splicing in plants coming of age. Trends Plant Sci 17 : 616 623. http://www.ncbi .nlm.nih.gov/pubmed/22743067.

PAGE 188

188 Tang H, Bowers JE, Wang X, Ming R, Alam M, Paterson AH. 2008. Synteny and collinearity in plant genomes. Science (80 ) 320 : 486 488. Tang H, Krishnakumar V, Bidwell S, Rosen B, Chan A, Zhou S, Gentzbittel L, Childs KL, Yandel l M, Gundlach H. 2014. An improved genome release (version Mt4. 0) for the model legume Medicago truncatula. BMC Genomics 15 : 312. Trapnell C, Hendrickson DG, Sauvageau M, Goff L, Rinn JL, Pachter L. 2012a. Differential analysis of gene regulation at trans cript resolution with RNA seq. Nat Biotechnol 31 : 46 53. Trapnell C, Roberts A, Goff L, Pertea G, Kim D, Kelley DR, Pimentel H, Salzberg SL, Rinn JL, Pachter L. 2012b. Differential gene and transcript expression analysis of RNA seq experiments with TopHat and Cufflinks. Nat Protoc 7 : 562 578. Tuskan GA, Difazio S, Jansson S, Bohlmann J, Grigoriev I, Hellsten U, Putnam N, Ralph S, Rombauts S, Salamov A. 2006. The genome of black cottonwood, Populus trichocarpa (Torr. & Gray). Science (80 ) 313 : 1596 1604. V an de Peer Y, Fawcett JA, Proost S, Sterck L, Vandepoele K. 2009. The flowering world: a tale of duplications. Trends Plant Sci 14 : 680 688. Vanneste K, Maere S, Van de Peer Y. 2014. Tangled up in two: a burst of genome duplications at the end of the Cret aceous and the consequences for plant evolution. Philos Trans R Soc B Biol Sci 369 : 20130353. Varshney RK, Chen W, Li Y, Bharti AK, Saxena RK, Schlueter JA, Donoghue MTA, Azam S, Fan G, Whaley AM. 2012. Draft genome sequence of pigeonpea (Cajanus cajan), a n orphan legume crop of resource poor farmers. Nat Biotechnol 30 : 83 89. Velasco R, Zharkikh A, Affourtit J, Dhingra A, Cestaro A, Kalyanaraman A, Fontana P, Bhatnagar SK, Troggio M, Pruss D. 2010. The genome of the domesticated apple (Malus [times] domest ica Borkh.). Nat Genet 42 : 833 839. Venturini L, Ferrarini A, Zenoni S, Tornielli GB, Fasoli M, Dal Santo S, Minio A, Buson G, Tononi P, Zago ED. 2013. De novo transcriptome characterization of Vitis vinifera cv. Corvina unveils varietal diversity. BMC Gen omics 14 : 41. Veron AS, Kaufmann K, Bornberg Bauer E. 2007. Evidence of interaction network evolution by whole genome duplications: a case study in MADS box proteins. Mol Biol Evol 24 : 670 678. Wang B B, Brendel V. 2006. Genomewide comparative analysis of alternative splicing in plants. Proc Natl Acad Sci U S A 103 : 7175 7180. http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1459036&tool=pmcentre z&rendertype=abstract.

PAGE 189

189 Wang B species EST alignments rev eal novel and conserved alternative splicing events in legumes. BMC Plant Biol 8 : 17. http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2277414&tool=pmcentre z&rendertype=abstract. Wang ET, Sandberg R, Luo S, Khrebtukova I, Zhang L, Mayr C, Kingsmor e SF, Schroth GP, Burge CB. 2008b. Alternative isoform regulation in human tissue transcriptomes. Nature 456 : 470 476. Wang X, Wang H, Wang J, Sun R, Wu J, Liu S, Bai Y, Mun J H, Bancroft I, Cheng F. 2011. The genome of the mesopolyploid crop species Brass ica rapa. Nat Genet 43 : 1035 1039. Warren WC, Hillier LW, Graves JAM, Birney E, Ponting CP, Grützner F, Belov K, Miller W, Clarke L, Chinwalla AT. 2008. Genome analysis of the platypus reveals unique signatures of evolution. Nature 453 : 175 183. Waterston RH, Lander ES, Sulston JE. 2002. On the sequencing of the human genome. Proc Natl Acad Sci 99 : 3712 3716. Williams JH, Friedman WE. 2002. Identification of diploid endosperm in an early angiosperm lineage. Nature 415 : 522 526. Woodhouse MR, Tang H, Freeling M. 2011. Different gene families in Arabidopsis thaliana transposed in different epochs and at different frequencies throughout the rosids. Plant Cell Online 23 : 4241 4253. Wu TD, Nacu S. 2010. Fast and SNP tolerant detection of complex variants and splicing in short reads. Bioinformatics 26 : 873 881. Wu TD, Watanabe CK. 2005. GMAP: a genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics 21 : 1859 1875. Xu G, Guo C, Shan H, Kong H. 2012. Divergence of duplicate genes in exon intron structure. Proc Natl Acad Sci 109 : 1187 1192. Xu X, Pan S, Cheng S, Zhang B, Mu D, Ni P, Zhang G, Yang S, Li R, Wang J, et al. 2011. Genome sequence and analysis of the tuber crop potato. Nature 475 : 189 195. http://www.ncbi .nlm.nih.gov/pubmed/21743474. Yang Y W, Lai K N, Tai P Y, Li W H. 1999. Rates of nucleotide substitution in angiosperm mitochondrial DNA sequences and dates of divergence between Brassica and other angiosperm lineages. J Mol Evol 48 : 597 604.

PAGE 190

190 You F, Huo N, Deal K, Gu Y, Luo M C, McGuire P, Dvorak J, Anderson O. 2011. Annotation based genome wide SNP discovery in the large and complex Aegilops tauschii genome using next generation sequencing without a reference genome sequence. BMC Genomics 12 : 59. Young ND, Debelle F, Oldroyd GED, Geurts R, Cannon SB, Udvardi MK, Benedito VA, Mayer KFX, Gouzy J, Schoof H, et al. 2011. The Medicago genome provides insight into the evolution of rhizobial symbioses. Nature 480 : 520 524. ://000298318000060. Zhang PG, Huang SZ, Pin AL, Adams KL. 2010. Extensive Divergence in Alternative Splicing Patterns after Gene and Genome Duplication During the Evolutionary History of Arabidopsis. Mol Biol Evol 27 : 1686 1697. ://000279872000021. Zhou S, Wei F, Nguyen J, B echner M, Potamousis K, Goldstein S, Pape L, Mehan MR, Churas C, Pasternak S. 2009. A single molecule scaffold for the maize genome. PLoS Genet 5 : e1000711. Zuccolo A, Bowers JE, Estill JC, Xiong Z, Luo M, Sebastian A, Goicoechea JL, Collura K, Yu Y, Jiao Y, et al. 2011. A physical map for the Amborella trichopoda genome sheds light on the evolution of angiosperm genome structure. Genome Biol 12 : R48. http://www.ncbi.nlm.nih.gov/pubmed/21619600.

PAGE 191

191 BIOGRAPHICAL SKETCH Srikar Chamala received his Bachelor o f Science in Bioinformatics from Brigham Young University, Provo, UT, where his research focus was to mine for putative single nucleotide polymorphisms in mitochondrial genes associated with obesity in Pima Indian population of Arizona. Later he moved to U niversity of Illinois at Urbana Champaign where he pursued Master of Science in Bioinformatics from department of Computer Science. After completion of m he started working full time in Dr. at University of Florida as a Biological Scientist, during which he worked on several plant genomics projects focus ing on genome evolution and annotation, polyploidy, and transcriptome assembly . After working full time for couple of years in Dr. degree in the same lab under the supervision of Dr. Barbazuk with research focus involving Amborella genome assembly and alternative splicing evoluti on in angiosperms (flowering plants) using Amborella as an outgroup.