Decoding the massive genome of loblolly pine using haploid DNA and novel assembly strategies

MISSING IMAGE

Material Information

Title:
Decoding the massive genome of loblolly pine using haploid DNA and novel assembly strategies
Physical Description:
Mixed Material
Language:
English
Creator:
Neale, David B.
Wegrzyn, Jill L.
Stevens, Kristian A.
Zimin, Aleksey V.
Puiu, Daniela
Crepeau, Marc W.
Publisher:
Bio Med Central (Genome Biology)
Publication Date:

Notes

Abstract:
Background: The size and complexity of conifer genomes has, until now, prevented full genome sequencing and assembly. The large research community and economic importance of loblolly pine, Pinus taeda L., made it an early candidate for reference sequence determination. Results: We develop a novel strategy to sequence the genome of loblolly pine that combines unique aspects of pine reproductive biology and genome assembly methodology. We use a whole genome shotgun approach relying primarily on next generation sequence generated from a single haploid seed megagametophyte from a loblolly pine tree, 20-1010, that has been used in industrial forest tree breeding. The resulting sequence and assembly was used to generate a draft genome spanning 23.2 Gbp and containing 20.1 Gbp with an N50 scaffold size of 66.9 kbp, making it a significant improvement over available conifer genomes. The long scaffold lengths allow the annotation of 50,172 gene models with intron lengths averaging over 2.7 kbp and sometimes exceeding 100 kbp in length. Analysis of orthologous gene sets identifies gene families that may be unique to conifers. We further characterize and expand the existing repeat library based on the de novo analysis of the repetitive content, estimated to encompass 82% of the genome. Conclusions: In addition to its value as a resource for researchers and breeders, the loblolly pine genome sequence and assembly reported here demonstrates a novel approach to sequencing the large and complex genomes of this important group of plants that can now be widely applied.
General Note:
Neale et al. Genome Biology 2014, 15:R59 http://genomebiology.com/2014/15/3/R59; Pages 1-13
General Note:
doi:10.1186/gb-2014-15-3-r59 Cite this article as: Neale et al.: Decoding the massive genome of loblolly pine using haploid DNA and novel assembly strategies. Genome Biology 2014 15:R59.

Record Information

Source Institution:
University of Florida
Holding Location:
University of Florida
Rights Management:
© 2014 Neale et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
System ID:
AA00024682:00001

Full Text

PAGE 1

RESEARCHOpenAccessDecodingthemassivegenomeofloblollypine usinghaploidDNAandnovelassemblystrategiesDavidBNeale1*,JillLWegrzyn1,KristianAStevens2,AlekseyVZimin3,DanielaPuiu4,MarcWCrepeau2, CharisCardeno2,MaximKoriabine5,AnnEHoltz-Morris5,JohnDLiechty1,PedroJMartnez-Garca1, HansAVasquez-Gross1,BrianYLin1,JacobJZieve1,WilliamMDougherty2,SaraFuentes-Soriano6,Le-ShinWu7, DonGilbert6,GuillaumeMarais3,MichaelRoberts3,CarsonHolt8,MarkYandell8,JohnMDavis9, KatherineESmith10,JeffreyFDDean11,WWalterLorenz11,RossWWhetten12,RonaldSederoff12, NicholasWheeler1,PatrickEMcGuire1,DoreenMain13,CarolALoopstra14,KeithanneMockaitis6,PieterJdeJong5, JamesAYorke3,StevenLSalzberg4andCharlesHLangley2AbstractBackground: Thesizeandcomplexityofconifergenomeshas,untilnow,preventedfullgenomesequencingand assembly.Thelargeresearchcommunityandeconomicimportanceofloblollypine, Pinustaeda L.,madeitanearly candidateforreferencesequencedetermination. Results: Wedevelopanovelstrategytosequencethegenomeofloblollypinethatcombinesuniqueaspects ofpinereproductivebiologyandgenomeassemblymethodology.Weuseawholegenomeshotgunapproach relyingprimarilyonnextgenerationsequencegeneratedfromasinglehaploidseedmegagametophytefrom aloblollypinetree,20-1010,thathasbeenusedinindustrialforesttreebreeding.Theresultingsequenceand assemblywasusedtogenerateadraftgenomespanning23.2Gbpandcontaining20.1GbpwithanN50scaffold sizeof66.9kbp,makingitasignificantimprovementoveravailableconifergenomes.Thelongscaffoldlengths allowtheannotationof50,172genemodelswithintronlengthsaveragingover2.7kbpandsometimesexceeding 100kbpinlength.Analysisoforthologousgenesetsidentifiesgenefamiliesthatmaybeuniquetoconifers.We furthercharacterizeandexpandtheexistingrepeatlibrarybasedonthe denovo analysisoftherepetitivecontent, estimatedtoencompass82%ofthegenome. Conclusions: Inadditiontoitsvalueasaresourceforresearchersandbreeders,theloblollypinegenome sequenceandassemblyreportedheredemonstratesanovelapproachtosequencingthelargeandcomplex genomesofthisimportantgroupofplantsthatcannowbewidelyapplied.BackgroundAdvancesinsequencingandassemblytechnologieshave madeitpossibletoobtainreferencegenomesequences fororganismsoncethoughtintractable,includingthe leviathangenomes(20to40Gb)ofconifers.Gymnosperms,representedprincipallybyadiverseandmajestic arrayofconiferspecies(approximately630species,distributedacrosseightfamiliesand70genera[1]),areone oftheoldestofthemajorplantclades,havingarisen fromancestralseedplantssome300millionyearsago. Coniferswilllikelyprovidemanygenome-levelinsights ontheoriginsofgeneticdiversityinhigherplants. Thoughtoday ’ sconifersmaybeconsideredrelicsofa oncemuch-largersetoftaxathatthrivedthroughoutthe ageofthedinosaurs(250to65millionsofyearsago) [2,3],theyremainthedominantlifeformsinmanyof thetemperateandborealecosystemsintheNorthern Hemisphereandextendintosubtropicalregionsandthe SouthernHemisphere. Wechosetoinvestigatetheloblollypine( Pinustaeda L.)genomebecauseofitswell-developedscientificresources.Over1.5billionseedlingsareplantedannually, approximately80%ofwhicharegeneticallyimproved, *Correspondence: dbneale@ucdavis.edu1DepartmentofPlantSciences,UniversityofCalifornia,Davis,CA,USA Fulllistofauthorinformationisavailableattheendofthearticle 2014Nealeetal.;licenseeBioMedCentralLtd.ThisisanOpenAccessarticledistributedunderthetermsoftheCreative CommonsAttributionLicense(http://creativecommons.org/licenses/by/2.0),whichpermitsunrestricteduse,distribution,and reproductioninanymedium,providedtheoriginalworkisproperlycredited.TheCreativeCommonsPublicDomain Dedicationwaiver(http://creativecommons.org/publicdomain/zero/1.0/)appliestothedatamadeavailableinthisarticle, unlessotherwisestated.Neale etal.GenomeBiology 2014, 15 :R59 http://genomebiology.com/2014/15/3/R59

PAGE 2

drivingitsselectionasthereferenceconifergenome. Amongconifers,itsgeneticresourcesareunsurpassedin thatthreetreeimprovementcooperativeshavebeen breedingloblollypineformorethan60yearsandmanagemillionsoftreesingenetictrials.Thecurrentconsensusreferencegeneticmapforloblollypineismade upof2,308geneticmarkers[4].ExtensiveQTLandassociationmappingstudiesinloblollypinehaverevealed agreatdealaboutthegeneticbasisofcomplextraits suchasphysicalandchemicalwoodproperties,disease andinsectresistance,growth,andadaptationtochangingenvironments.Currentresearchfocusesonthe potentialofgenomicselectionforcontinuedgeneticimprovement[5]. Thetreeselectedforsequencing, ‘ 20-1010 ’ ,isamemberoftheNorthCarolinaStateUniversity-IndustryCooperativeTreeImprovementProgramandtheproperty oftheCommonwealthofVirginiaDepartmentofForestry,whichreleasedthisgermplasmintothepublic domain.Inaccordancewithopenaccesspolicies[6],we releasedthefirstdraftgenomeofloblollypineinJune 2012,whichmadeitthefirstdraftassemblyavailablefor anygymnosperm.Thedraftdescribedhererepresentsa significantadvanceoveravailablegymnospermreference sequences[7,8].ResultsanddiscussionSequencingandassemblyTheloblollypinegenome[9]joinsthetwootherconifer referencesequencesproducedrecently[7,8].Withanestimated22billionbasepairs[10],itisthelargestgenome sequencedandassembledtodate.Ourexperimentaldesignleveragedauniquefeatureoftheconiferlifecycle andnewcomputationalapproachestoreducetheassemblyproblemtoatractablescale[9,11].Fromthefirst wholegenomeshotgun(WGS)assemblyofthe1.8million basepair Haemophilusinfluenzae genomein1995tothe orders-of-magnitudelargerthree-billion-base-pairmammaliangenomesthatfollowedyearslater[12],theWGS protocolhasbeenanefficientandeffectivemethodofproducinghighqualityreferencegenomes.Thiswasinpart madepossiblebytheoverlaplayoutconsensus(OLC)assemblyparadigmchampionedbyMyers[13]andubiquitouslyimplementedinfirst-generationWGSassemblers. Whennext-generationsequencingdisruptivelyusheredin aneweraofWGSsequencing,theextremelylargenumbersofreadsexceededthecapabilitiesofexistingOLC assemblers.Tocircumventthis,newassemblersweredeveloped,usingshortk-merbasedmethodsfirstdescribed byPevzner[14].Thegiantpanda[15]wasthefirstmammalianspeciestohaveitsgenomeproducedusingstrictly NGSreads.Forloblollypine,weutilizedahybridassembly methodthatincorporatesbothk-merbasedandOLCassemblymethods. Figure1AillustratesthetwosourcesofDNAthat comprisedthesequencingstrategy.Asoutlinedbelow (see[9]fordetails),themajorityoftheWGSsequence datainTable1wasgeneratedfromasinglepineseed megagametophyte.ThesmallquantityofgenomicDNA obtainedfromthehaploidmegagametophytetissuewas usedtoconstructaseriesof11Illuminapairedendlibrarieswithsufficientcomplexitytoformthebasisofa highqualityWGSassembly.TheuseofhaploidDNA greatlysimplifiesassembly,butthelimitedquantityof haploidDNAwasinsufficientfortheentireproject.DiploidneedletissueservedasanabundantsourceofparentalDNAfortheconstructionoflong-insertlinking libraries.Thisincluded48librariesrangingfrom1to5.5 kilobasepairs(Kb)andninefosmidDiTaglibrariesspanning35to40Kb. Anoverviewoftheassemblyprocessispresentedin Figure1B.Thecombined63coveragefrommegagametophytelibraries(approximately15billionreads)was usedforerrorcorrectionandfortheconstructionofa databaseof79-mersappearinginthehaploidgenome. Thisdatabasewasusedtofilterhighlydivergenthaplotypesfromthediploidsequencedata.Thesuper-readreductionimplementedintheMaSuRCAassembler[11] condensedmostofthehaploidpaired-endreadsintoa setofapproximately150Mlonger ‘ super-reads ’ .Each super-readisasinglecontiguoushaploidsequencethat containsbothendsofoneormorepaired-endreads. Theconstructionprocessensuredthatnosuper-read wascontainedinanothersuper-read.Critically,the numberofmegagametophyte-derivedreadswasreduced byafactorof100.Thecombineddatasetwas27-fold smallerthantheoriginal,andwassufficientlyreducedin sizetomakeoverlap-basedassemblyusingCABOG[16] possible.TheoutputoftheMaSuRCAassemblypipeline becameassembly1.0.Additionalscaffoldingmethods wereimplementedtoimprovetheassemblybytaking advantageofthedeeplysampledtranscriptomedata [17],ultimatelyproducingassemblyv1.01.Finally,tofurtherassesscompleteness,ascanforthe248conserved coregenesintheCEGMAdatabase[18]wasperformed onallconiferassemblies(Figure1B).Theresultingannotationsareclassifiedasfulllengthandpartial.The loblollypinev1.01assemblyhasthelargestnumberof totalannotations(203)ofthethreeconifersaswellas thelargestfractionoffulllengthannotations(91%). Forvalidationpurposes,weusedalargepoolofapproximately4,600fosmidclonestoapproximatearandomsampleofthegenome[9].Thesequencedand assembledpoolcontained3,798contigslongerthan 20,000bp,eachputativelyrepresentingmorethanhalf ofafosmidinsert,withatotalspanof109Mbp.When alignedtothegenome98.63%ofthetotallengthofthese contigswascoveredbytheWGSassembly.AtotalofNeale etal.GenomeBiology 2014, 15 :R59 Page2of13 http://genomebiology.com/2014/15/3/R59

PAGE 3

A B Figure1 (Seelegendonnextpage.) Neale etal.GenomeBiology 2014, 15 :R59 Page3of13 http://genomebiology.com/2014/15/3/R59

PAGE 4

2,120ofthealignedcontigshad99.5%orhighersimilarity,implyingacombinederrorrateoflessthan0.5%.AnnotationA denovo transcriptomeassemblyof83,285unique, full-lengthcontigsfromseveraltissuetypesandexisting nucleotideresources(ESTsandconifertranscriptomes) supportedasetof50,172uniquegenemodels,derived fromtheMAKER-Pannotationpipeline(Table2) [19,20].Fromthe denovo transcriptomeassembly, 42,822aligneduniquely(98%identityand95%coverage) tothegenome.Ofthe45,085re-clusteredloblollypine ESTsequences,27,412aligned(98%identityand98% coverage).Thefrequentoccurrenceofpseudogenes (gene-likefragmentsrepresenting2.9%ofthegenome), requiredtheuseofconservativefilterstodefinethefinal genespace[20].Theselectedmodelsrepresentcodingsequencelengthsbetween120bpand12Kbp.Geneand exonlengthswerecomparablewithangiospermspecies; however,thenumberoffull-lengthgenesidentified,even inamorefragmentedgenome,wasgreaterthaninother species(Figure2A).Intronsnumbered144,579withan averagelengthof2.7Kbpandamaximumlengthof318 Kbp.Atotalof6,267(4.4%)oftheintronsweregreater than20Kbpinlength.Thisdistributionfarexceedsthe intronlengthsreportedinotherplantspeciesandis,on average,longerthanestimatesin Piceaabies [8,20].The finalgenemodelswereidentifiedon31,284scaffolds thatwereatleast10Kbpinlength.Atotalof3,835scaffoldscontainedthreeormoregenes.Giventhefragmentationofthegenomeandlongintronlengthsobserved, itislikelythatthegenomecontainsadditionalgenes, butalsothatsomeofthe50,172modelsdefinedhere maylaterbemergedtogether. Weclusteredtheproteinsequencesinordertoidentifyorthologousgroupsofgenes[23].Comparisonswith 14species,ignoringtransposableelements,yielded 20,646genefamilieswithtwoormoremembersand 1,476genefamiliespresentinallspecies(Figure2A).Of thefullset,1,554werespecifictoconifersand159of thosewerespecifictoloblollypine[20]. Themajorityofcharacterizedplantresistanceproteins (Rproteins)aremembersoftheNB-ARCandNB-LRR families,andareassociatedwithdiseaseresistance[24]. Severalindependentfamilieswereidentifiedcontaining oneorbothofthesedomains.Thelargestcontained43 loblollypinemembersand14sprucemembers.Several othersmallerfamiliescontainedmembersexclusively fromloblollypinerangingfromtwotofivemembers each.Othergenefamilieswithrolesindiseaseresistance werealsoidentified,includingChalonesynthases(CHS) (threeinloblollypineandonesprucemember).IncreasedexpressionofCHSisassociatedwiththesalicylic aciddefensepathway[25]. Responsetoenvironmentalstress,suchassalinityand drought,hasbeeninvestigatedatlengthinconifers. ThreedifferentsetsofDehydrin(DHN)domainswere noted,thelargestwith10loblollypinemembersand12 (Seefigureonpreviouspage.) Figure1 (A)ThesourcesofhaploidanddiploidgenomicDNA. Thereproductivecycleofaconifershowingtheuniquesourcesofhaploid anddiploidgenomicDNAsequenced.Boththeovapronucleusandthemegagametophytearederivedbymitoticdivisionsfromasingleoneof thefourhaploidmeioticsegregantmegaspores.Thetissuefromasinglemegagametophyteformedthebasisforallofourshorterinsertpaired endIlluminalibraries(Table1).Toconstructlongerinsertlibraries(IlluminamatepairandFosmidDiTag)requiringgreateramountsofstarting DNA,needlesfromtheparentalgenotype(20-1010)wereused. (B) Sequencingandassemblyschematic.Anoverlaplayoutconsensusassembly, madepossiblebyMaSuRCA ’ scriticalreductionphase,wasfollowedbyadditionalscaffolding,incorporatingtranscriptassemblies,toimprove contiguityandcompleteness[9,11]. Table1Characteristicsoftheloblollypinev1.01draft assemblyEstimated1Ngenomesize22Gbp[ 10 ] Numberofchromosomes12 G+C%38.2% Sequenceincontigs>64bp20,148,103,497bp Totalspanofscaffolds23,180,477,227bp ContigN508,206bp ScaffoldN5066,920bp Haploidpairedend libraries200-600bp 11libraries 7.5xbillionx2reads (GA2x+HiSeq+MiSeq) 1.4trillionbptotalreadlength 63xsequencecoverage 150millionmaximal super-reads 52billiontotalbp 2.4xsequencecoverage Diploidmatepairlibraries 1,000-5,500bp 48libraries 863millionx2reads(GA2x) 273billiontotalreadlength 270millionx2readsafterfiltering 37xphysicalcoverage DiTaglibraries35-40Kbp9libraries 46millionx2reads(GA2x) 4.5millionreadsx2afterfiltering 7.5xphysicalcoverage Neale etal.GenomeBiology 2014, 15 :R59 Page4of13 http://genomebiology.com/2014/15/3/R59

PAGE 5

sprucemembers.Thefirstgeneticevidenceofdehydrins playingaroleincellularprotectionduringosmotic shockwasin2005,in Physcomitrellapatens [26].Subsequently,itwasnotedthattranscriptionlevelsofaDHN increasedin Pinuspinaster whenexposedtodrought conditions[27]. Cupinsaremembersofalarge,diversefamilybelongingtotheGerminandGermin-likesuperfamily(GLP) [28].Inthisanalysis,severalfamilies,includingonelarge familycontainedoneormoredomainsrelatedtoCupin 1.Thelargestfamilycontained23loblollypinemembers andfivesprucemembers.Thesegenes,similartoother GLPs,areexpressedduringsomaticembryogenesisin conifers[28].Cupinshavethereforebeenassociatedwith plantgrowth,andmorerecentlyassociatedwithdisease resistanceinrice[29]. The COPIC family(58members)wasthelargestexclusivelyidentifiedinloblollypine.Vesiclecoatprotein complexescontaining COPI familymembersmediate transportbetweentheERandgolgi,andinteractwith Ras-relatedtransmembraneproteins,p23andp24[30]. MembersoftheRassuperfamily, Arf and ArfGap ,also identifiedinloblollypine,areinvolvedin COPI vesicle formation[31].Theseproteinswereassignedtothe smallmoleculebindingGOcategory,whichisenriched inpineandotherconifersascomparedwithangiosperms.TheothernotableGOassignmentsincludenucleicacidbinding,proteinbinding,ionbinding,and transferaseactivitywhichareconsistentwiththemost populatedcategoriesfortheotherspeciesincludedin thecomparison(Figure2B).RepetitiveDNAcontentPreviousexaminationoftheloblollypineBACandfosmidsequencesledtothedevelopmentofthePine InterspersedElementResource(PIER),acustomrepeat library[32]. Denovo analysisof1%ofthegenome yielded8,155repeats,bringingPIER ’ stotalto19,194 [20].Theplethoraofnovelrepeatcontentmaybeexplainedinpartbythehighlydivergednatureoftherepeatsequences,whichpreventsaccurateidentification fromareferencelibraryduetoobscuredsimilarities. Homologyanalysisdemonstratedthatretrotransposons dominated,representing62%ofthegenome(Figure3A). Seventypercentofthesewerelongterminalrepeat (LTR)retrotransposons. PtConagree [32]coveredthelargestportionofthegenome,followedby TPE1 [33], PtRLC_3 PtRLX_3423 [20], PtOuachita ,and IFG7 [34]. Amongintrons,theestimatedrepetitivecontentwas 60%.IntronswererelativelyrichinDNAtransposons,at 3.31%(Figure3A).Overall,thecombinedsimilarityand denovo approachesestimatethat82%ofthepinegenomeisrepetitiveinnature(Figure3A). Thoughthegenomeisinundatedwithinterspersedrepetitivecontent,analysisrevealedthatonly2.86%iscomposedof tandemrepeats,themajorityofwhicharecomprisedofmillionsofretrotransposonLTRs.Thisestimateiscomparableto thefrequenciesobservedinothermembersofthePineaceae (2.71%in Piceaglauca (v1.0)and2.40%in Piceaabies (v1.0)) [20].Thenumberoftandemrepeatsmaybedependentonsequencingandassemblymethodologies.Asshownpreviously, loblollypinerangesfrom2.57%intheWGSassembly[20]to 3.3%inSanger-derivedBACs[32].Similartomostspecies,the relativefrequenciesoftherepeatingunitsoftandemloci areheavilyweightedtowardsminisatellites(between9and 100bp).Thisattributeisubiquitousacrossplantsbutthe smallervolumeofmicrosatellites(1to9bprepeatingunits)in coniferswhencomparedtoangiospermsandtheincreased contributionfromheptanucleotidesissignificant(Figure3B). Asubstantialnumberofloblollypine ’ standemrepeatsare Table2ComparisonofgenemetricsamongsequencedplantgenomesPinus taeda Piceaabies [ 8 ] Arabidopsisthaliana [ 21 ] Populustrichocarpa [ 21 ] Vitisvinifera [ 21 ] Amborellatrichopoda [ 22 ] Genomesize (assembled)(Mbp) 20,14812,019a135423487706 Chromosomes 12125191913 G+Ccontent(%) 38.237.935.033.336.235.5 TEcontent(%) 797015.34241.4N/A Numberofgenesb50,17258,587c27,16036,39325,66325,347 AverageCDS length(bps) 965723110211431095969 Averageintron length(bps) 2,7411,0201823669331,538 Maximumintron length(bps) 318,52468,26910,2344,69838,166175,748aEstimatedgenomesizeis19.6Gbp.bNumberoffull-lengthgenes>150bpinlengthandvalidatedthroughcurrentannotations.cHighandmediumconfidencegenesfromtheCongenieproject[ 8 ].Neale etal.GenomeBiology 2014, 15 :R59 Page5of13 http://genomebiology.com/2014/15/3/R59

PAGE 6

telomericsequences(TTTAGGG)n(approximately23,926 loci)andcandidatecentromericsequences(TGGAAACCCCAAATTTTGGGCGCCGGG)n(5,183loci,1.8Mbp).Originallyidentifiedin Arabidopsisthaliana [35],thetelomeric sequenceswerefoundinterstitiallyaswellasattheendofthe chromosomesinloblollypineandotherconifers[36].The interstitialpresenceoftheheptanucleotiderepeatmayexplain theincreasedobservationofthismicrosatelliteinconifers. Pineshaveespeciallylongtelomeres,reachingupto57Kbpas foundin Pinuslongaeva [37,38]. A B Figure2 UniquegenefamiliesandGeneOntologytermassignments.(A) Identificationoforthologousgroupsofgenesfor14speciessplit intofivecategories:conifers( Piceaabies Piceasitchensis ,and Pinustaeda ),monocots( Oryzasativa and Zeamays ),dicots( Arabidopsisthaliana Glycine max,Populustrichocarpa Ricinuscommunis Theobromacacao ,and Vitisvinifera ),earlylandplants( Selaginellamoellendorffii and Physcomitrellapatens ), andabasalangiosperm( Amborellatrichopoda ).Here,wedepictthenumberofclustersincommonbetweenthebiologicalcategoriesinthe intersections.Thetotalnumberofsequencesforeachspeciesisprovidedunderthename(totalnumberofsequences/totalnumberofclustered sequences). (B) Geneontologymolecularfunctiontermassignmentsbyfamilyforallspecies(red),conifers(green),and Pinustaeda exclusively(blue). Neale etal.GenomeBiology 2014, 15 :R59 Page6of13 http://genomebiology.com/2014/15/3/R59

PAGE 7

OrganellegenomesThemitochondrialgenomewasidentifiedandassembled separately,takingadvantageofitsdeepercoverageand distinctiveGCcontent.Theassemblywasbuiltprimarily from28.5millionhigh-quality255-bpMiSeqreads generatedduringWGSsequencing.Thereadwerefirst assembledwithSOAPdenovo2[39],andtheresulting 7,559scaffoldswerealignedtotheloblollypinechloroplastgenome[40]andto557completeandpartialplant mitochondrialgenomes.Twenty-sevenscaffoldsaligned A B Figure3 Interspersedandtandemrepetitivecontent.(A) Overviewofrepetitivecontentinthe Pinustaeda genomeforsimilarity(blue)and denovo (yellow)approaches.IntronsareevaluatedwithsimilaritymethodsagainstPIER2.0[32]. (B) Overviewofmicrosatellitecontentacross specieswithexclusionofmononucleotiderepeats.Orange,green,andpurplepointsrepresentangiosperm,gymnosperm,andlycophytespecies, respectively.Eachpointdisplaysboththedensity(pointsize)andlength(y-axis)ofdi-,tri-,tetra-,penta-,hexa-,hepta-,andoctanucleotide tandemrepeats(x-axis).The Overall categoryisanaccumulationoftheprevioussevencategories. Neale etal.GenomeBiology 2014, 15 :R59 Page7of13 http://genomebiology.com/2014/15/3/R59

PAGE 8

tothechloroplastand90alignedtomitochondria.The mitochondrialscaffoldsweredistinguishedbytheircoverage,whichaveraged>14x,whilethecoveragedepthof chloroplastscaffoldswasfardeeper,morethan100x.The originalreadsrepresentedjust0.3xcoverageofthenuclear DNA.ThemitochondrialGC-contentwas44% versus 38.2%forthegenomeand39.5%forthechloroplast.Based ontheseresults,weidentified33unalignedscaffoldslongerthan1Kblikelytobemitochondrial,with>=44%GC contentandcoveragebetween8xand50x.Theseplusthe 90previouslyalignedscaffoldswerereassembled,using additionalreadsfromtwoWGSjumpinglibraries(lengths 3,800bpand5,200bp),extractingonlythosepairsthat matchedthemitochondrialcontigs.Theresultingmitochondrialgenomeassemblyhas35scaffoldscontaining40 contigs,withatotalcontiglengthof1,253,551bpanda maximumcontigsizeof256,879bp.NewinsightsinconiferfunctionalbiologyThedraftgenomesequenceandtranscriptomeassemblieshaveenableddiscoveryofgenesthatunderlieecologicallyandevolutionarilyimportanttraits,illuminated larger-scalegenomicorganizationofgenefamilies,and revealedmissinggenesthatevolvedinangiospermsand notgymnosperms.DiseaseresistanceThegenomerevealedthatapartialESTcontaininga SNPwasactuallyacandidategeneforrustresistancein loblollypine.WemappedtheSNPgenetically,associated itwithrustresistancethendetermineditwasatollinterleukinreceptor/nucleotidebinding/leucine-richrepeat(TNL)gene[41]containingsignaturedomainsonly presentinthenewtranscriptandgenomeassemblies. Rustpathosystemscanprovideusefulinsightsintohostpathogenco-evolution,becausehostresistancegenes interactgeneticallywithpathogenavirulencegenes[42,43]. Analysisoffusiformrustpathogen Cronartiumquercuum (Berk.)MiyabeexShiraif.sp. fusiforme ( Cqf )geneticinteractionswith Pinus hosts[44,45]ledtomappingof F usiform r ustresistance 1 ( Fr1 ;[46])toLG2[47,48]. P. lambertianaCr1 forwhitepineblisterrustresistance[49] wasalsomappedtothesamelinkagegroupusingsyntenic markers[50]. Twolargemappingpopulationswereusedtoassign geneticmappositionsto2,308SNPmarkersthatwere mappedtogenomicscaffolds[4],whoseSNPswerethen testedforassociationwithrustresistance[48]ina family-based,clonalpopulation[51].Thetop-ranked SNPforrustresistanceinaparentsegregatingfor Fr1 mappedtoLG2(31.3cM;Figure4A)andoccurredina transcriptmodelencodedbyaTNL-typegenelocated onagenomicscaffold(Figure4B).TheTNL-typegeneis relatedto N from Nicotiana [52]thatbelongstoalarge classofgenesforresistancetobiotrophicpathogeninduceddiseases[41,53].Priortothisworktheloblolly pinegeneproductappearedtolackTIRandNBdomains becausetheESTwastruncated.BasedonOrthoMCL analysisofthefull-lengthproteins[20],thegenebelongs toaclassofTNLsthathaveexpandedinconifers(N=780 inloblollypine;N=180in P.abies )butnot Arabidopsis (N=3).Bycontrast,mostTNLgenesin Arabidopsis belongtoalargeclass(N=138)notfoundinloblollypineor P.abies ThegenomesequencehasrevealedthatdistinctclassesofTNLshaveexpandedinconifersandangiosperms, makingitfeasibletotestconifercandidategenesforcosegregationwithdiseaseresistance,insteadofusing markersderivedfromincompleteESTsorotherspecies. ThetranscriptoftheTNLgeneisdetectedinyoung stems,reactionwood,inhymeniallayersobtainedfrom fusiformrustgalls,andisacandidatefor Fr1.Avr1 ,the avirulencegenethatspecificallyinteractswith Fr1 [54], hasbeengeneticallymappedonLGIIIof Cqf andthe genomesequenceisnowavailable[55].Thesegenomebaseddiscoveriesopenthedoortounderstandingtheeffectsofhost Fr genesonallelicdiversityandfrequency intheircorresponding Avr genes,andviceversa,atlarge geographicscales.Thepracticaloutcomesfor Fr gene durabilityaresignificantgiventhewidespreadplanting of>500millionloblollypineseedlingseachyearthat harboroneormorefusiformrustresistancelociasa consequenceofparentalselection,selectivebreeding, andscreeningforfusiformrustresistance[56].Comparisonsof Fr1 and Cr1 locishouldgeneratenewinsights intoevolutionofresistancegenes[57]withinagenus thatarose102-190millionyearsago[58].StressresponseConifersdominateavarietyofbiomesbyvirtueoftheir capacitytosurviveandthriveinthefaceofextreme abioticstresses.Forexample,pronouncedresistanceto waterstress,particularlyinmaturetrees,hasenabledconiferstospreadacrossdesertsandalpineareas,wellbeyondtherangeofmostcompetingwoodyangiosperms. Atthesametime,waterstressisamajorcauseofmortalityforconiferseedlings[59],andpredictionshold thatdifferentialsusceptibilityofconiferspeciestowater stresswillhaveprofoundconsequencesforforestand ecosystemdynamicsunderfutureclimatechangescenarios[60].Variationindroughtresistanceinconifershas longbeenrecognizedtohaveageneticbasis[61,62],and substantialefforthaspreviouslybeendevotedtoattemptsatusingmolecularandgenomictoolstouncover theresponsiblegeneticdeterminants[63-65]. Oneofthefirstdrought-responsiveconifergenesto beclonedandcharacterizedwas lp3 [66],whichwas showntosharehomologywithasmallfamilyofnuclear-Neale etal.GenomeBiology 2014, 15 :R59 Page8of13 http://genomebiology.com/2014/15/3/R59

PAGE 9

localized,ABA-induciblegenes(termed ASR forABA-, Stress-andRipening)initiallyidentifiedintomato [67,68].Subsequentworkhasshownthat ASR genesare broadlydistributedinhigherplantsandadaptivealleles ofthesegenesaredeterminantsofdroughtresistancein wildrelativesofvariousdomesticatedcrops[69-71]. Transcriptomicstudieshavedetecteddifferentialexpressionof lp3 genefamilymembersindrought-stressed pine[72,73],whileotherstudieshavelinkedexpression of lp3 genefamilymemberstoaspectsofwoodformation,thatis,xylemdevelopment[74,75]andcoldtolerance[76].Geneticstudiesindicatethat lp3 allelesare underselectioninpineandlikelyconferadaptiveresistancetodrought[77,78]. Proteinsequencesforthefourdistinctloblollypine lp3 genefamilymembersinGenBank(AAB07493, AAB02692,AAB96829,AAB03388)alignedoptimallyto fourofthehigh-confidencegenemodels.The ASR genes intomatoarephysicallyclusteredandhavebeenheld outasexamplesoftandemlyarrayedgenesthatareimportantforadaptation[70].Inthev1.01assembly,two ofthepine lp3 genes(AAB07493,AAB96829)were foundtoresideonthesamescaffold(Figure5).Verylittleisknownaboutthephysicalclusteringofgenefamily membersinconifers,buttheavailabilityoftheloblolly pinegenomesequencenowprovidestheopportunityto studysuchrelationshipsandtheircontributiontoadaptationinconifers.WoodformationThegenomeassemblyandannotationprovidenewinformationontherolesofspecificgenesinvolvedin woodformation.Pinesecondaryxylemcontainslarge numbersoftracheidswithabundantborderedpitsfor bothmechanicalsupportandwatertransport;bycontrast,thesecondaryxylemofwoodydicotstypicallyhas specializedvesselelementsforwaterconduction,and fibercellsformechanicalsupport[79,80].Thechemical compositionofgymnospermxylemischaracterizedbya guaiacyl-rich(Gtype)ligninandtheabsenceofsyringyl (Stype)subunits[81].Theligninsinthexylemofwoody dicots,gnetales,andSelaginella(alycopod)arecharacterizedbyamixedpolymerofSandGsubunits(S/Glignin)[82].Thehemicellulosesofpinesareamixtureof heteromannans,whiledicothemicellulosesaretypically xylan-rich[83].Thefunctionalandevolutionarydifferencesinlignincomposition,hemicellulosecomposition, andpresenceofvesselelementsbetweengymnosperms andangiospermsareinformedbythepinegenomesequence.Expressedhomologsofallbutoneoftheknown genesforligninprecursor(monolignol)biosynthesis[84] havebeenidentifiedinthepinegenomeassembly.The exceptionisthegeneencodingferulate5-hydroxylase (alsocalledconiferaldehyde5-hydroxylase),thekey enzymefortheformationofsinapylalcohol,theprecursorforSsubunitsinS/Glignin[85].Theabsenceofa 5-hydroxylasehomologissignificantbecauseofthe depthofpinesequencingandthequalityoftheannotation[20].Aputativehomologofageneonlyrecently implicatedinmonolignolbiosynthesis,encodingcaffeoyl shikimateesterase[86],hasalsobeenidentifiedinthe pineannotation.Somemonolignolgenevariantsareassociatedwithquantitativevariationingrowthandwood propertiessuchaswooddensityormicrofibrilangle [87-89].Thedraftpinegenomeassemblycontainsputativehomologsofsixcellulosesynthasesubunits(CesA1, Figure4 IdentificationofTNLcandidategeneforFr1.(A) GenomesurveyofrustresistanceinsegregatingprogenyofFr1/fr1 Pinustaeda amongclonallypropagatedhalf-siblings(upper)andfull-siblings(lower).BinswithhighestLODscorescontained.SNP2_5345_01(*). (B) Translated genemodel(G)ongenomescaffoldjcf7180063178873isinterruptedbythreeintronswithsizesgiveninbp,previouslyavailableEST(E)containing SNP2_5345_01(*),fullyassembledtranscriptEvg1_1A_all_VO_L_3760_240252fromRNAseq(R)andthedomainstructureoftheproteinmodel(P). Neale etal.GenomeBiology 2014, 15 :R59 Page9of13 http://genomebiology.com/2014/15/3/R59

PAGE 10

3,4,5,7,and8),andtwoputativegenemodelsforglucomannan4-beta-mannosyltransferasesandtwoforxyloglucanglycosyltransferases,consistentwiththehemicellulose compositionofpine.Thepinegenomeassemblyalsocontainsputativehomologsformanygenesthatencodetranscriptionfactorsthatregulatewoodcelltypesorthe perennialgrowthhabit[90,91].Thisinformationisuseful toguidethegeneticimprovementofwoodpropertiesas resourcesforbiomaterialsandbioenergy.ConclusionsTheloblollypinereferencegenomejoinstherecentgenomesofNorwayspruceandwhitespruceforminga foundationforconifergenomics.Totackletheproblem ofreconstructingreferencesequencesfortheseleviathan genomes,thethreeprojectseachuseddifferentapproaches.Thewholegenomeshotgunapproachhas beenhistoricallyfavoredbecauseitgivesarapidresult. Analternativehasbeentheexpensiveandtimeconsumingapplicationofcloningtoreducetocomplexityoftheproblemfortractabilityortoobtainabetter result[8].Ourcombinedstrategyresultedinthemost completeandcontiguousconifer(gymnosperm)genome sequencedandassembledtodate[9]withanassembled referencesequenceconsistingof20.1billionbasepairs containedinscaffoldsspanning22.18billionbasepairs. Oureffortstoimprovethequalityoftheloblollypine referencegenomesequenceforconifersarecontinuing. Theimportanceofahighqualityandcompletereference sequenceformajortaxonomicgroupsiswellchronicled [92].Theloblollypinereferencegenomewasobtained fromasingletree,20-1010,forwhichsignificantand continuingopen-accessgenomeresourcesarefreely availablethroughtheDendromeProjectandTreeGenes Database[93].MaterialsandmethodsReferencegenotypetissueandDNAAllsourcematerialwasobtainedfromgraftedrametsof ourreference Pinustaeda genotype20-1010.OurhaploidtargetmegagametophytewasdissectedfromawindpollinatedpineseedcollectedfromatreeinaVirginia DepartmentofForestryseedorchardnearProvidence Forge,Virginia.Diploidtissuewasobtainedfromneedles collectedfromtreesattheErambertGeneticResource ManagementAreanearBrooklyn,Mississippiandthe HarrisonExperimentalForestnearSaucier,Mississippi. AdetaileddescriptionofthepreparationandQCof DNAfromthesetissuesamplesiscontainedin[9].Sequencing,assembly,andvalidationAdetaileddescriptionofthewholegenomeshotgunsequencing,assembly,andvalidationoftheV1.0and V1.01loblollypinegenomesiscontainedin[9]. TocomparethecontiguityofourV1.01wholegenome shotgunassemblytocontemporaryconifergenomeassembliesthescaffoldsequencesforwhitesprucegenome [7]andNorwayspruce[8]wereobtainedfromGenbank. CEGMAanalysisofthecoregeneset[18]performed ontheV1.0andV1.01loblollypinegenomeswasobtainedasdescribedin[9].Similarly,aNorwayspruce analysiswasperformedwithresultsconsistentwith thosereportedin[8].Theresultsforthewhitespruce assemblyweretakendirectlyfrom[7]. Toassemblethemitochondrialgenome,asubsetof theWGSsequenceconsistingof255bppairedend MiSeqreadsfromfourIlluminapairedendlibraries (medianinsertsizes:325,441,565,and637)wereselectedforanindependentorganelleassembly.The28.5 Mbpofsequence,representinglessthan0.3nuclear genomiccoverage,wasassembledusingSOAPdenovo2 (K=127).Theresultingcontigswerealignedusingnucmertoadatabasecontainingtheloblollypinechloroplast,sequencingvector,102BACs,and50complete plantmitochondria.Contigswereidentifiedandlabeled asmitochondrialiftheyalignedexclusivelytoexisting mitochondrialsequenceandhadhighcoverage(>=8) andG+C%(>=44%).Thecontigswerethencombined withadditionallinkinglibraries,theLPMP_23matepair libraryandallDiTaglibraries,andassembledasecond timewithSOAPdenovo2.Subsequentlyintra-scaffold gapswereclosedusingandGapCloser(v1.12).The Figure5 Pinustaeda lp3sequencesfromGenbank(AAB07493,AAB96829)werealignedtothesamescaffoldinv1.01andsupported bytwodistinctMAKER-derivedgenemodel. Neale etal.GenomeBiology 2014, 15 :R59 Page10of13 http://genomebiology.com/2014/15/3/R59

PAGE 11

assembledsequenceswereiterativelyscaffoldedandgaps werecloseduntilnoassemblyimprovementscouldbe made.AnnotationTheassembledgenomewasannotatedwiththeMAKERPpipeline[19]asdescribedin[20].Priortogeneprediction,thesequencewasmaskedwithsimilaritysearches againstRepBaseandthePineInterspersedElementResource(PIER)[32].Followingtheannotation,theTRIBEMCLpipeline[94],wasusedtoclusterthe399,358protein sequencesfrom14speciesintoorthologousgroupsasdescribedin[20].RepetitiveDNAcontentInterspersedrepeatdetectionwascarriedoutintwo stages,homology-basedand denovo asdescribedin [20].Forhomology-basedidentification,RepeatMasker 3.3.0[95]wasrunagainstthePIER2.0repeatlibrary [32]forboththefullgenomeandintrons.REPET 2.0[96]wasimplementedwiththepipelinedescribedin [32]for denovo repeatdiscovery.Onlythe63longest scaffoldswereusedintheall-vs-allalignment(approximately1%ofthegenome).Inaddition,PIER2.0,the sprucerepeatdatabase,andpubliclyavailabletranscripts from Pinustaeda and Pinuselliottii wereutilizedasinputforknownrepeatandhostgenerecognition. Toidentifytandemrepeats,TandemRepeatFinder (v4.0.7b)[97]wasrunonboththegenomeandtranscriptomeasdescribedin[20].Filteringofmultimeric repeatsandoverlapswithinterspersedrepeats,helped assesstotaltandemcoverageandrelativefrequenciesof specificsatellites.DataavailabilityPrimarysequencedatamaybeobtainedfromNCBIandis indexedunderBioProjectPRJNA174450.Thewholegenomeshotgunsequenceobtainedforthisassemblyisavailablefromthesequencereadarchive(SRA:SRP034079). TheV1.0andV1.01genomesequencesareavailableat [98].Accesstogenemodels,annotations,andGenome Browsers[99,100]areavailablethroughtheTreeGenes database[93,101].Competinginterests Theauthorsdeclarethattheyhavenocompetinginterests. Authors ’ contributions DBN,JLW,KAS,AZ,DM,CAL,KM,PJDJ,JAY,SLS,andCHLdesignedthe research.DBN,JLW,KAS,AZ,DP,MC,CC,MK,AEH,JDL,PJMG,HAVG,BYL, JJZ,WMD,SFS,LW,DG,GM,MR,CH,MY,JMD,KS,JFDD,WWL,RWW,RS, DM,CAL,KM,PJDJ,JAY,SLS,andCHLperformedresearchandanalyzeddata. DBN,JLW,KAS,JMD,JFDD,RWW,RS,NW,PEM,CAL,SLS,andCHLwrotethe article.Allauthorsreadandapprovedthefinalmanuscript. Acknowledgements FundingforthisprojectwasmadeavailablethroughtheUSDA/NIFA (2011-67009-30030)awardtoDBNatUniversityofCalifornia,Davis.We gratefullyacknowledgetheassistanceofC.DanaNelsonattheUSDAForest ServiceSouthernResearchStationforprovidingandverifyingthegenotype oftargettreematerial.Wealsowishtothankthemanagementandstaff oftheDNATechnologiescorefacilityattheUCDavisGenomeCenterfor providingexpertandtimelyHiSeqsequencingservicestothisproject. Authordetails1DepartmentofPlantSciences,UniversityofCalifornia,Davis,CA,USA.2DepartmentofEvolutionandEcology,UniversityofCalifornia,Davis,CA, USA.3InstituteforPhysicalSciencesandTechnology(IPST),Universityof Maryland,CollegePark,MD,USA.4CenterforComputationalBiology, McKusick-NathansInstituteofGeneticMedicine,JohnsHopkinsUniversity, Baltimore,MD,USA.5Children ’ sHospitalOaklandResearchInstitute,Oakland, CA,USA.6DepartmentofBiology,IndianaUniversity,Bloomington,IN,USA.7NationalCenterforGenomeAnalysisSupport,IndianaUniversity, Bloomington,IN,USA.8EcclesInstituteofHumanGenetics,Universityof Utah,SaltLakeCity,UT,USA.9SchoolofForestResourcesandConservation, GeneticsInstitute,UniversityofFlorida,Gainesville,FL,USA.10Southern InstituteofForestGenetics,USDAForestService,SouthernResearchStation, Saucier,MS,USA.11WarnellSchoolofForestryandNaturalResources, UniversityofGeorgia,Athens,GA,USA.12DepartmentofForestryand EnvironmentalResources,NorthCarolinaStateUniversity,Raleigh,NC,USA.13DepartmentofHorticulture,WashingtonStateUniversity,Pullman,WA, USA.14DepartmentofEcosystemScienceandManagement,TexasA&M University,CollegeStation,TX,USA. Received:11February2014Accepted:4March2014 Published:4March2014 References1.FarjonA: WorldChecklistandBibliographyofConifers. 2ndedition. Richmond:KewPublishing;2001. 2.FarjonA: TheNaturalHistoryofConifers. Portland,OR:TimberPress;2008. 3.LeslieAB,BeaulieuJM,RaiHS,CranePR,DonoghueMJ,MathewsS: Hemisphere-scaledifferencesinconiferevolutionarydynamics. ProcNatl AcadSciUSA 2012, 109: 16217 – 16221. 4.Martnez-GarcaPJ,StevensK,WegrzynJ,LiechtyJ,CrepeauM,LangleyC, NealeD: Combinationofmultipointmaximumlikelihood(MML)and regressionmappingalgorithmstoconstructahigh-densitygenetic linkagemapforloblollypine( Pinustaeda L.). TreeGenetics&Genomes 2013, 9: 1529. 5.Zapata-ValenzuelaJ,IsikF,MalteccaC,WegrzynJ,NealeD,McKeandS, WhettenR: SNPmarkerstracefamiliallinkagesinaclonedpopulation of Pinustaeda -prospectsforgenomicselection. TreeGenetics&Genomes 2012, 8: 1307 – 1318. 6.NealeD,LangleyC,SalzbergS,WegrzynJ: Openaccesstotreegenomes: thepathtoabetterforest. GenomeBiol 2013, 14: 120. 7.BirolI,RaymondA,JackmanSD,PleasanceS,CoopeR,TaylorGA,Saint YuenMM,KeelingCI,BrandD,VandervalkBP,KirkH,PandohP,MooreRA, ZhaoYJ,MungallAJ,JaquishB,YanchukA,RitlandC,BoyleB,BousquetJ, RitlandK,MacKayJ,BohlmannJ,JonesSJM: Assemblingthe20Gbwhite spruce( Piceaglauca )genomefromwhole-genomeshotgunsequencing data. Bioinformatics 2013, 29: 1492 – 1497. 8.NystedtB,StreetNR,WetterbomA,ZuccoloA,LinYC,ScofieldDG,VezziF, DelhommeN,GiacomelloS,AlexeyenkoA,VicedominiR,SahlinK, SherwoodE,ElfstrandM,GramzowL,HolmbergK,HallmanJ,KeechO, KlassonL,KoriabineM,KucukogluM,KallerM,LuthmanJ,LysholmF, NiittylaT,OlsonA,RilakovicN,RitlandC,RosselloJA,SenaJ, etal : The Norwaysprucegenomesequenceandconifergenomeevolution. Nature 2013, 497: 579 – 584. 9.ZiminA,StevensK,CrepeauM,Holtz-MorrisA,KorablineM,MaraisG,Puiu D,RobertsM,WegrzynJ,deJongP,NealeD,SalzbergS,YorkeJ,LangleyC: Sequencingandassemblyofthe22-Gbloblollypinegenome. Genetics 2014, 196: 875 – 890. 10.O'BrienIEW,SmithDR,GardnerRC,MurrayBG: Flowcytometric determinationofgenomesizein Pinus PlantSci 1996, 115: 91 – 99. 11.ZiminA,MaraisG,PuiuD,RobertsM,SalzbergS,YorkeJ: TheMaSuRCA genomeassembler. Bioinformatics 2013, 29: 2669 – 2677.Neale etal.GenomeBiology 2014, 15 :R59 Page11of13 http://genomebiology.com/2014/15/3/R59

PAGE 12

12.VenterJC,AdamsMD,MyersEW,LiPW,MuralRJ,SuttonGG,SmithHO, YandellM,EvansCA,HoltRA,GocayneJD,AmanatidesP,BallewRM,Huson DH,WortmanJR,ZhangQ,KodiraCD,ZhengXQH,ChenL,SkupskiM, SubramanianG,ThomasPD,ZhangJH,MiklosGLG,NelsonC,BroderS, ClarkAG,NadeauC,McKusickVA,ZinderN: Thesequenceofthehuman genome. Science 2001, 291: 1304 – 1351. 13.MyersE: Towardsimplifyingandaccuratelyformulatingfragment assembly. JComputBiol 1995, 2: 275 – 290. 14.PevznerPA: 1-TupleDNAsequencing:computeranalysis. JBiomolStruct Dyn 1989, 7: 63 – 73. 15.LiRQ,FanW,TianG,ZhuHM,HeL,CaiJ,HuangQF,CaiQL,LiB,BaiYQ, ZhangZH,ZhangYP,WangW,LiJ,WeiFW,LiH,JianM,LiJW,ZhangZL, NielsenR,LiDW,GuWJ,YangZT,XuanZL,RyderOA,LeungFCC,ZhouY, CaoJJ,SunX,FuYG: Thesequenceand denovo assemblyofthegiant pandagenome. Nature 2010, 463: 311 – 317. 16.MillerJR,DelcherAL,KorenS,VenterE,WalenzBP,BrownleyA,JohnsonJ, LiK,MobarryC,SuttonG: Aggressiveassemblyofpyrosequencingreads withmates. Bioinformatics 2008, 24: 2818 – 2824. 17.PuntaM,CoggillPC,EberhardtRY,MistryJ,TateJ,BoursnellC,PangN, ForslundK,CericG,ClementsJ,HegerA,HolmL,SonnhammerELL,Eddy SR,BatemanA,FinnRD: ThePfamproteinfamiliesdatabase. NucleicAcids Res 2012, 40: D290 – D301. 18.ParraG,BradnamK,KorfI: CEGMA:apipelinetoaccuratelyannotatecore genesineukaryoticgenomes. Bioinformatics 2007, 23: 1061 – 1067. 19.CampbellMS,LawM,HoltC,SteinJC,MogheGD,HufnagelDE,LeiJ, AchawanantakunR,JiaoD,LawrenceCJ,WareD,ShiuSH,ChildsKL, SunY,JiangN,YandellM: MAKER-P:Atoolkitfortherapidcreation, management,andqualitycontrolofplantgenomeannotations. PlantPhysiol 2014, 164: 513 – 524. 20.WegrzynJL,LiechtyJD,StevensKA,WuL-S,LoopstraCA,Vasquez-GrossHA, DoughertyWM,LinBY,ZieveJJ,Martinez-GarciaPJ,HoltC,YandellM,Zimin AV,YorkeJA,CrepeauMW,PuiuD,SalzbergSL,deJongPJ,MockaitisK, MainD,LangleyCH,NealeDB: Uniquefeaturesoftheloblollypine ( Pinustaeda L.)megagenomerevealedthroughsequenceannotation. Genetics 2014, 196: 891 – 909. 21.GoodsteinDM,ShuSQ,HowsonR,NeupaneR,HayesRD,FazoJ,MitrosT,Dirks W,HellstenU,PutnamN,RokhsarDS: Phytozome:acomparativeplatformfor greenplantgenomics. NucleicAcidsRes 2012, 40:D1178 – D1186. 22.Amborellagenomeproject: The Amborella genomeandtheevolutionof floweringplants. Science 2013, 342: 1467. 23.EnrightAJ,VanDongenS,OuzounisCA: Anefficientalgorithmforlargescaledetectionofproteinfamilies. NucleicAcidsRes 2002, 30: 1575 – 1584. 24.GlowackiS,MacioszekVK,KononowiczAK: Rproteinsasfundamentalsof plantinnateImmunity. CellMolBiolLett 2011, 16: 1 – 24. 25.DaoTTH,LinthorstHJM,VerpoorteR: Chalconesynthaseanditsfunctions inplantresistance. PhytochemRev 2011, 10: 397 – 412. 26.SaavedraL,SvenssonJ,CarballoV,IzmendiD,WelinB,VidalS: Adehydrin genein Physcomitrellapatens isrequiredforsaltandosmoticstress tolerance. PlantJ 2006, 45: 237 – 249. 27.Velasco-CondeT,YakovlevI,MajadaJP,ArandaI,JohnsenO: Dehydrinsin maritimepine( Pinuspinaster )andtheirexpressionrelatedtodrought stressresponse. TreeGenetics&Genomes 2012, 8: 957 – 973. 28.MathieuM,Lelu-WalterMA,BlervacqAS,DavidH,HawkinsS,NeutelingsG: Germin-likegenesareexpressedduringsomaticembryogenesisand earlydevelopmentofconifers. PlantMolBiol 2006, 61: 615 – 627. 29.CarrilloMGC,GoodwinPH,LeachJE,LeungH,CruzCMV: Phylogenomic relationshipsofriceoxalateoxidasestothecupinsuperfamilyandtheir associationwithdiseaseresistanceQTL. Rice 2009, 2: 67 – 79. 30.FainiM,PrinzS,BeckR,SchorbM,RichesJD,BaciaK,BruggerB,WielandFT, BriggsJAG: TheStructuresofCOPI-coatedvesiclesrevealalternatecoatomerconformationsandinteractions. Science 2012, 336: 1451 – 1454. 31.PucadyilTJ,SchmidSL: ConservedrunctionsofmembraneactiveGTPases incoatedvesicleformation. Science 2009, 325: 1217 – 1220. 32.WegrzynJ,LinB,ZieveJ,DoughertyW,Martnez-GarcaP,KoriabineM, Holtz-MorrisA,deJongP,CrepeauM,LangleyC,PuiuD,SalzbergS,NealeD,StevensK: Insightsintotheloblollypinegenome:characterizationof BACandfosmidsequences. PLoSONE 2013, 8: e72439. 33.KammA,DoudrickRL,HeslopHarrisonJS,SchmidtT: Thegenomicand physicalorganizationofTy1-copia-likesequencesasacomponentof largegenomesin Pinuselliottii var elliottii andothergymnosperms. ProcNatlAcadSciUSA 1996, 93: 2708 – 2713. 34.KossackDS,KinlawCS: IFG,agypsy-likeretrotransposonin Pinus (Pinaceae),hasanextensivehistoryinpines. PlantMolBiol 1999, 39: 417 – 426. 35.RichardsEJ,AusubelFM: Isolationofahighereukaryotictelomerefrom Arabidopsisthaliana Cell 1988, 53: 127 – 136. 36.LeitchAR,LeitchIJ: Ecologicalandgeneticfactorslinkedtocontrasting genomedynamicsinseedplants. NewPhytol 2012, 194: 629 – 646. 37.AronenT,RyynanenL: VariationintelomericrepeatsofScotspine ( Pinussylvestris L.). TreeGenetics&Genomes 2012, 8: 267 – 275. 38.FlanaryBE,KletetschkaG: Analysisoftelomerelengthandtelomerase activityintreespeciesofvariouslife-spans,andwithageinthe bristleconepine Pinuslongaeva Biogerontology 2005, 6: 101 – 111. 39.LuoR,LiuB,XieY,LiZ,HuangW,YuanJ,HeG,ChenY,PanQ,LiuY,Tang J,WuG,ZhangH,ShiY,LiuY,YuC,WangB,LuY,HanC,CheungD,Yiu S-M,PengS,XiaoqianZ,LiuG,LiaoX,LiY,YangH,WangJ,LamT-W, WangJ: SOAPdenovo2:anempiricallyimprovedmemory-efficient short-read denovo assembler. GigaScience 2012, 1: 18. 40.ParksM,CronnR,ListonA: Increasingphylogeneticresolutionatlow taxonomiclevelsusingmassivelyparallelsequencingofchloroplast genomes. BMCBiol 2009, 7: 84. 41.DanglJL,HorvathDM,StaskawiczBJ: Pivotingtheplantimmunesystem fromdissectiontodeployment. Science 2013, 341: 746 – 751.42.FlorHH: Inheritanceofpathogenicityin Melampsoralini Phytopathology 1942, 32: 653 – 669. 43.FlorHH: Currentstatusofthegene-for-geneconcept. AnnuRev Phytopathol 1971, 9: 275 – 296. 44.GriggsMM,WalkinshawCH: Diallelanalysisofgenetic-resistanceto Cronartiumquercuum f.sp. fusiforme inslashpine. Phytopathology 1982, 72: 816 – 818. 45.PowersHR: Pathogenicvariationamongsingle-aeciosporeisolatesof Cronartiumquercuum f.sp. fusiforme ForSci 1980, 26: 280 – 282. 46.WilcoxPL,AmersonHV,KuhlmanEG,LiuBH,OMalleyDM,SederoffRR: Detectionofamajorgeneforresistancetofusiformrustdiseasein loblollypinebygenomicmapping. ProcNatlAcadSciUSA 1996, 93: 3859 – 3864. 47.AmersonHV,NelsonCD,KubisiakTL: Rgenedetectionandmappingin fusiformrustdisease. Forests 2013.inreview. 48.QuesadaT,ResendeMFRJ,MunozP,WegrzynJL,NealeDB,KirstM,Peter GF,GezanSA,NelsonCD,DavisJM: Mappingfusiformrustresistance geneswithinacomplexmatingdesignofloblollypine. Forests 2013, 5: 347 – 362. 49.KinlochBB,ParksGK,FowlerCW: Whitepineblisterrust.Simplyinherited resistanceinsugarpine. Science 1970, 167: 193 – 195. 50.JermstadKD,EckertAJ,WegrzynJL,Delfino-MixA,DavisDA,BurtonDC, NealeDB: Comparativemappingin Pinus :sugarpine( Pinuslambertiana Dougl.)andloblollypine( Pinustaeda L.). TreeGenetics&Genomes 2011, 7: 457 – 468. 51.KayihanGC,HuberDA,MorseAM,WhiteTL,DavisJM: Geneticdissection offusiformrustandpitchcankerdiseasetraitsinloblollypine. TheorApplGenet 2005, 110: 948 – 958. 52.WhithamS,DineshkumarSP,ChoiD,HehlR,CorrC,BakerB:Theproduct ofthetobaccomosaic-virusresistancegeneN:Similaritytotollandthe interleukin-1receptor. Cell 1994, 78: 1101 – 1115. 53.MeyersBC,MorganteM,MichelmoreRW: TIR-XandTIR-NBSproteins:two newfamiliesrelatedtodiseaseresistanceTIR-NBS-LRRproteinsencoded in Arabidopsis andotherplantgenomes. PlantJ 2002, 32: 77 – 92. 54.KubisiakTL,AmersonHV,NelsonCD: Geneticinteractionofthefusiform rustfunguswithresistancegene Fr1 inloblollypine. Phytopathology 2005, 95: 376 – 380. 55.KubisiakTL,AndersonCL,AmersonHV,SmithJA,DavisJM,NelsonCD: Agenomicmapenrichedformarkerslinkedto Avr1 in Cronartium quercuum f.sp. fusiforme FungalGenetBiol 2011, 48: 266 – 274. 56.McKeandS,MullinT,ByramT,WhiteT: Deploymentofgenetically improvedloblollyandslashpinesinthesouth. JFor 2003, 101: 32 – 37. 57.JermstadKD,SheppardLA,KinlochBB,Delfino-MixA,ErsozES,KrutovskyKV, NealeDB: Isolationofafull-lengthCC – NBS – LRRresistancegeneanalog candidatefromsugarpineshowinglownucleotidediversity. Tree Genetics&Genomes 2006, 2: 76 – 85. 58.WillyardA,SyringJ,GernandtDS,ListonA,CronnR: Fossilcalibrationof moleculardivergenceinfersamoderatemutationrateandrecent radiationsfor Pinus MolBiolEvol 2007, 24: 90 – 101.Neale etal.GenomeBiology 2014, 15 :R59 Page12of13 http://genomebiology.com/2014/15/3/R59

PAGE 13

59.BurnsRM,HonkalaBH: SilvicsofNorthAmerica. In AgricultureHandbook 654. 2ndedition.Washington,DC:U.S:DepartmentofAgriculture,Forest Service;1990:877. 60.MuellerRC,ScudderCM,PorterME,TrotterRT,GehringCA,WhithamTG: Differentialtreemortalityinresponsetoseveredrought:evidencefor long-termvegetationshifts. JEcol 2005, 93: 1085 – 1093. 61.BrixH: Determinationofviabilityofloblollypineseedlingsafterwilting. BotGaz 1960, 121: 220 – 223. 62.FerrellWK,WoodardES: Effectsofseedoriginondroughtresistanceof douglas-fir( Pseudotsugamenziesii )(Mirb)Franco. Ecology 1966, 47: 499. 63.HamanishiET,CampbellMM: Genome-wideresponsestodroughtin foresttrees. Forestry 2011, 84: 273 – 283. 64.NewtonR,PadmanabhanV,LoopstraC,DiasM: Molecularresponsesto water-deficitstressinwoodyplants. In HandbookofPlantandCropStress. BocaRaton,FL:M.Pessarakl,CRCPress;1999:641 – 657. 65.NewtonRJ,FunkhouserEA,FongF,TauerCG: Molecularandphysiological geneticsofdroughttoleranceinforestspecies. ForEcolManage 1991, 43: 225 – 250. 66.ChangSJ,PuryearJD,DiasMADL,FunkhouserEA,NewtonRJ,CairneyJ: Gene expressionunderwaterdeficitinloblollypine( Pinustaeda ):isolationand characterizationofcDNAclones. PhysiolPlant 1996, 97: 139 – 148. 67.FrankelN,CarrariF,HassonE,IusemND: Evolutionaryhistoryofthe Asr genefamily. Gene 2006, 378: 74 – 83. 68.IusemND,BartholomewDM,HitzWD,ScolnikPA: Tomato( Lycopersicon esculentum )transcriptinducedbywater-deficitandripening. Plant Physiol 1993, 102: 1353 – 1354. 69.CortesAJ,ChavarroMC,MadrinanS,ThisD,BlairMW: Molecularecology andselectioninthedrought-relatedAsrgenepolymorphismsinwild andcultivatedcommonbean( Phaseolusvulgaris L.). BMCGenet 2012, 13:58. 70.FischerI,Camus-KulandaiveluL,AllalF,StephanW: Adaptationtodrought intwowildtomatospecies:theevolutionofthe Asr genefamily. NewPhytol 2011, 190: 1032 – 1044. 71.ShenG,PangYZ,WuWS,DengZX,LiuXF,LinJ,ZhaoLX,SunXF,TangKX: Molecularcloning,characterizationandexpressionofanovel Asr gene from Ginkgobiloba PlantPhysiolBiochem 2005, 43: 836 – 843. 72.LorenzWW,AlbaR,YuYS,BordeauxJM,SimoesM,DeanJFD: Microarray analysisandscale-freegenenetworksidentifycandidateregulatorsin drought-stressedrootsofloblollypine( P.taeda L.). BMCGenomics 2011, 12: 264. 73.LorenzWW,SunF,LiangC,KolychevD,WangHM,ZhaoX,CordonnierPrattMM,PrattLH,DeanJFD: Waterstress-responsivegenesinloblolly pine( Pinustaeda )rootsidentifiedbyanalysesofexpressedsequence taglibraries. TreePhysiol 2006, 26: 1 – 16. 74.Gonzalez-MartinezSC,WheelerNC,ErsozE,NelsonCD,NealeDB: Associationgeneticsin Pinustaeda L.I.Woodpropertytraits. Genetics 2007, 175: 399 – 409. 75.PaivaJAP,GarcesM,AlvesA,Garnier-GereP,RodriguesJC,LalanneC, PorconS,LeProvostG,PerezDD,BrachJ,FrigerioJM,ClaverolS,BarreA, FevereiroP,PlomionC: Molecularandphenotypicprofilingfromthebase tothecrowninmaritimepinewood-formingtissue. NewPhytol 2008, 178: 283 – 301. 76.JoosenRVL,LammersM,BalkPA,BronnumP,KoningsMCJM,PerksM, StattinE,VanWordragenMF,vanderGeestAHM: Correlatinggene expressiontophysiologicalparametersandenvironmentalconditions duringcoldacclimationofPinussylvestris,identificationofmolecular markersusingcDNAmicroarrays. TreePhysiol 2006, 26: 1297 – 1313. 77.EvenoE,ColladaC,GuevaraMA,LegerV,SotoA,DiazL,LegerP,GonzalezMartinezSC,CerveraMT,PlomionC,Garnier-GerePH: Contrastingpatterns ofselectionat Pinuspinaster Ait.droughtstresscandidategenesasrevealedbygeneticdifferentiationanalyses. MolBiolEvol 2008, 25: 417 – 437. 78.GrivetD,SebastianiF,Gonzlez-MartnezSC,VendraminGG: Patternsof polymorphismresultingfromlong-rangecolonizationintheMediterraneanconiferAleppopine. NewPhytol 2009, 184: 1016 – 1028.79.DonaldsonLA: Lignificationandlignintopochemistry-anultrastructural view. Phytochemistry 2001, 57: 859 – 873. 80.SperryJS,HackeUG,PittermannJ: Sizeandfunctioninconifertracheids andangiospermvessels. AmJBot 2006, 93: 1490 – 1500. 81.SarkanenKV,LudwingCH: Lignins:Occurrence,Formation,Structureand Reactions. WileyInterscience:NewYork,NY;1971. 82.WengJK,AkiyamaT,BonawitzND,LiX,RalphJ,ChappleC: Convergent evolutionofsyringylligninbiosynthesisviadistinctpathwaysinthe lycophyte Selaginella andfloweringplants. PlantCell 2010, 22: 1033 – 1045. 83.SchellerHV,UlvskovP: Hemicelluloses. AnnuRevPlantBiol 2010, 61: 263 – 289. 84.ShiR,SunYH,LiQZ,HeberS,SederoffR,ChiangVL: Towardsasystems approachforligninbiosynthesisin Populustrichocarpa :transcript abundanceandspecificityofthemonolignolbiosyntheticgenes. PlantCellPhysiol 2010, 51: 144 – 163. 85.OsakabeK,TsaoCC,LiLG,PopkoJL,UmezawaT,CarrawayDT,SmeltzerRH, JoshiCP,ChiangVL: Coniferylaldehyde5-hydroxylationandmethylation directsyringylligninbiosynthesisinangiosperms. ProcNatlAcadSci USA 1999, 96: 8955 – 8960. 86.VanholmeR,CesarinoI,RatajK,XiaoYG,SundinL,GoeminneG,KimH, CrossJ,MorreelK,AraujoP,WelshL,HaustraeteJ,McClellanC,VanholmeB, RalphJ,SimpsonGG,HalpinC,BoerjanW: Caffeoylshikimateesterase (CSE)isanenzymeintheligninbiosyntheticpathwayin Arabidopsis Science 2013, 341: 1103 – 1106. 87.GionJM,CaroucheA,DeweerS,BedonF,PichavantF,CharpentierJP,Bailleres H,RozenbergP,CarochaV,OgnouabiN,VerhaegenD,Grima-PettenatiJ, VigneronP,PlomionC: Comprehensivegeneticdissectionofwoodpropertiesinawidely-growntropicaltree: eucalyptus BMCGenomics 2011, 12: 301. 88.ThummaBR,NolanMR,EvansR,MoranGF: Polymorphismsincinnamoyl CoAreductase(CCR)areassociatedwithvariationinmicrofibrilanglein Eucalyptus spp. Genetics 2005, 171: 1257 – 1265. 89.YuQ,LiB,NelsonCD,McKeandSE,BatistaVB,MullinTJ: Associationofthe cad-n1 allelewithincreasedstemgrowthandwooddensityinfull-sibfamiliesofloblollypine. TreeGenetics&Genomes 2006, 2: 98 – 108. 90.DemuraT,FukudaH: Transcriptionalregulationinwoodformation. TrendsPlantSci 2007, 12: 64 – 70. 91.MelzerS,LensF,GennenJ,VannesteS,RohdeA,BeeckmanT: Floweringtimegenesmodulatemeristemdeterminacyandgrowthformin Arabidopsisthaliana NatGenet 2008, 40: 1489 – 1492. 92.FraserCM,EisenJA,NelsonKE,PaulsenIT,SalzbergSL: Thevalueof completemicrobialgenomeSequencing(yougetwhatyoupayfor). JBacteriol 2002, 184: 6403 – 6405. 93.WegrzynJL,LeeJM,TearseBR,NealeDB: TreeGenes:aforesttreegenome database. IntJPlantGenom 2008, 2008: 412875. 94.vanDongenS,Abreu-GoodgerC: UsingMCLtoextractclustersfrom networks. BacterialMolecularNetworks 2012, 804: 281 – 295. 95. RepeatMasker. [http://www.repeatmasker.org/] 96.FlutreT,DupratE,FeuilletC,QuesnevilleH: Consideringtransposable elementdiversificationin denovo annotationapproaches. PLoSONE 2011, 6: e16526. 97.BensonG: Tandemrepeatsfinder:aprogramtoanalyzeDNAsequences. NucleicAcidsRes 1999, 27: 573 – 580. 98. Pinereferencesequences. [http://www.pinegenome.org/pinerefseq] 99.LeeE,HeltGA,ReeseJT,Munoz-TorresMC,ChildersCP,BuelsRM,SteinL, HolmesIH,ElsikCG,LewisSE: WebApollo:aweb-basedgenomicannotationeditingplatform. GenomeBiol 2013, 14: R93. 100.SteinLD,MungallC,ShuSQ,CaudyM,MangoneM,DayA,NickersonE, StajichJE,HarrisTW,ArvaA,LewisS: Thegenericgenomebrowser:a buildingblockforamodelorganismsystemdatabase. GenomeRes 2002, 12: 1599 – 1610. 101. TreeGenes.Aforesttreegenomedatabase. [http://dendrome.ucdavis.edu/ treegenes]doi:10.1186/gb-2014-15-3-r59 Citethisarticleas: Neale etal. : Decodingthemassivegenomeof loblollypineusinghaploidDNAandnovelassemblystrategies. Genome Biology 2014 15 :R59.Neale etal.GenomeBiology 2014, 15 :R59 Page13of13 http://genomebiology.com/2014/15/3/R59