Joint assembly and genetic mapping of the Atlantic horseshoe crab genome reveals ancient whole genome duplication

MISSING IMAGE

Material Information

Title:
Joint assembly and genetic mapping of the Atlantic horseshoe crab genome reveals ancient whole genome duplication
Physical Description:
Mixed Material
Language:
English
Creator:
Nossa, Carlos
Havlak, Paul
Yue, Jia-Xing
Lv, Jie
Vincent, Kimberly Y.
Publisher:
Bio Med Central (Giga n Science)
Publication Date:

Notes

Abstract:
Background: Horseshoe crabs are marine arthropods with a fossil record extending back approximately 450 million years. They exhibit remarkable morphological stability over their long evolutionary history, retaining a number of ancestral arthropod traits, and are often cited as examples of “living fossils.” As arthropods, they belong to the Ecdysozoa, an ancient super-phylum whose sequenced genomes (including insects and nematodes) have thus far shown more divergence from the ancestral pattern of eumetazoan genome organization than cnidarians, deuterostomes and lophotrochozoans. However, much of ecdysozoan diversity remains unrepresented in comparative genomic analyses. Results: Here we apply a new strategy of combined de novo assembly and genetic mapping to examine the chromosome-scale genome organization of the Atlantic horseshoe crab, Limulus polyphemus. We constructed a genetic linkage map of this 2.7 Gbp genome by sequencing the nuclear DNA of 34 wild-collected, full-sibling embryos and their parents at a mean redundancy of 1.1x per sample. The map includes 84,307 sequence markers grouped into 1,876 distinct genetic intervals and 5,775 candidate conserved protein coding genes. Conclusions: Comparison with other metazoan genomes shows that the L. polyphemus genome preserves ancestral bilaterian linkage groups, and that a common ancestor of modern horseshoe crabs underwent one or more ancient whole genome duplications 300 million years ago, followed by extensive chromosome fusion. These results provide a counter-example to the often noted correlation between whole genome duplication and evolutionary radiations. The new, low-cost genetic mapping method for obtaining a chromosome-scale view of non-model organism genomes that we demonstrate here does not require laboratory culture, and is potentially applicable to a broad range of other species. Keywords: Genotyping-by-sequencing (GBS), Genetic linkage mapping, Genome evolution, Limulus polyphemus
General Note:
Nossa et al. GigaScience 2014, 3:9 http://www.gigasciencejournal.com/content/3/1/9; Pages 1-21
General Note:
doi:10.1186/2047-217X-3-9 Cite this article as: Nossa et al.: Joint assembly and genetic mapping of the Atlantic horseshoe crab genome reveals ancient whole genome duplication. GigaScience 2014 3:9.

Record Information

Source Institution:
University of Florida
Holding Location:
University of Florida
Rights Management:
© 2014 Nossa et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
System ID:
AA00024687:00001

Full Text

PAGE 1

RESEARCHOpenAccessJointassemblyandgeneticmappingofthe Atlantichorseshoecrabgenomerevealsancient wholegenomeduplicationCarlosWNossa1,4,PaulHavlak1,Jia-XingYue1,JieLv1,KimberlyYVincent1,HJaneBrockmann3andNicholasHPutnam1,2*AbstractBackground: Horseshoecrabsaremarinearthropodswithafossilrecordextendingbackapproximately450million years.Theyexhibitremarkablemorphologicalstabilityovertheirlongevolutionaryhistory,retaininganumberof ancestralarthropodtraits,andareoftencitedasexamplesof “ livingfossils. ” Asarthropods,theybelongtothe Ecdysozoa ,anancientsuper-phylumwhosesequencedgenomes(includinginsectsandnematodes)havethusfar shownmoredivergencefromtheancestralpatternofeumeta zoangenomeorganizationthancnidarians,deuterostomes andlophotrochozoans.However,muchofecdysozoandiversity remainsunrepresent edincomparativegenomicanalyses. Results: Hereweapplyanewstrategyofcombined denovo assemblyandgeneticmappingtoexaminethe chromosome-scalegenomeorganizationoftheAtlantichorseshoecrab, Limuluspolyphemus .Weconstructedagenetic linkagemapofthis2.7GbpgenomebysequencingthenuclearD NAof34wild-collected,ful l-siblingembryosandtheir parentsatameanredundancyof1.1xpersample.Themapincludes84,307sequencemarkersgroupedinto1,876distinct geneticintervalsand5,775candida teconservedproteincodinggenes. Conclusions: Comparisonwithothermetaz oangenomesshowsthatthe L.polyphemus genomepreservesancestral bilaterianlinkagegroups,andthatacommonancestoro fmodernhorseshoecrabsunderwentoneormoreancient wholegenomeduplications300millionyearsago,followed byextensivechromosomefusion.Theseresultsprovidea counter-exampletotheoftennotedcorrelationbetweenwh olegenomeduplicationandev olutionaryradiations.The new,low-costgeneticmappingmethodforobtainingachr omosome-scaleviewofnon-modelorganismgenomesthat wedemonstrateheredoesnotrequirelaboratoryculture,and ispotentiallyapplicabletoabroadrangeofotherspecies. Keywords: Genotyping-by-sequencing(GBS),Geneticlinka gemapping,Genomeevolution,LimuluspolyphemusBackgroundComparativeanalysisofgenomesequencesfromdiverse metazoanshasrevealedmuchabouttheirevolutionover hundredsofmillionsofyears.Thediscoveryofextensive genehomologyacrosslargeevolutionarydistanceshas allowedresearcherstotrack chromosomerearrangements andwholegenomeduplications.Theresultingvalueof wholechromosomesequencespresentsachallengefor existingwholegenomeshotgun (WGS)assemblystrategies. Wholegenomeduplicationeventswerelongsuspected [1],butonlytheavailabilityofgenomesequenceshas allowedconfirmationoftheminfungal,vertebrate,plant andciliatelineages[2-5].Incontrast,whenonlyafew chordate,insectandnematodegenomeswereavailable, conservationofgenelinkage(i.e.,synteny)andgene orderwereobservedonlybetweenclosely-relatedspecies, andconsequentlywerenotexpectedtobeconserved betweenphyla.Asmoremetazoangenomeshavebeensequenced,ithasbecomeclearthatlong-rangelinkagehas beenconservedoverlongtimescalesinmanylineages. Sequencingthegenomesofrepresentativesofchordate, mollusk,annelid,cnidarian,placozoanandspongeclades, hasidentified17or18ancestrallinkagegroups(ALGs) *Correspondence: nputnam@gmail.com1DepartmentofEcologyandEvolutionaryBiology,RiceUniversity,P.O.Box 1892,Houston,TX77251-1892,USA2DepartmentofBiochemistryandCellBiology,RiceUniversity,P.O.Box1892, Houston,TX77251-1892,USA Fulllistofauthorinformationisavailableattheendofthearticle 2014Nossaetal.;licenseeBioMedCentralLtd.ThisisanOpenAccessarticledistributedunderthetermsoftheCreative CommonsAttributionLicense(http://creativecommons.org/licenses/by/2.0),whichpermitsunrestricteduse,distribution,and reproductioninanymedium,providedtheoriginalworkisproperlycredited.TheCreativeCommonsPublicDomain Dedicationwaiver(http://creativecommons.org/publicdomain/zero/1.0/)appliestothedatamadeavailableinthisarticle, unlessotherwisestated.Nossa etal.GigaScience 2014, 3 :9 http://www.gigasciencejournal.com/content/3/1/9

PAGE 2

[6-10].EachoftheseALGsconsistsofasetofancestral geneswhosedescendantsshareconservedsyntenyin multiplesequencedgenomes.TheseALGshavebeen interpretedtocorrespondtoancestralmetazoanchromosomes,andcorrelationsbetweeninferredratesofgene movementbetweenALGsacrossthemetazoantreesuggestthattheseancestrallinkagerelationshipsareconserved throughtheactionofselectiveconstraintsonasubsetof genes[11]. Therelativelysmallnumberofgenomesfromanciently distinctmetazoanlineagesandthefragmentednatureof draftgenomeassembliesstilllimitboththesearchforancientwholegenomeduplicationsandthepowerofthe datatoconstrainmodelsofchromosome-scalegenome structureevolution.WhileWGSsequencingtechnology andassemblymethodsareactiveareasofresearchand technologicaldevelopment,andhaveimprovedatadramaticpaceinrecentyears,highquality denovo assembly oflarge,complexmetazoangenomesremainsadifficult andresource-intensiveproblem.Withoutgeneticorphysicalmaps,orrelianceonahigh-qualityreferencegenome ofaclosely-relatedspecies,WGSsequencingprojectsstill typicallyproduceassembliescontainingthousandsofscaffolds,hundredsofscaffoldsincorrectlyjoiningsequence fromdifferentchromosomes,orboth[12]. Next-generationsequencinghasgreatlyreducedthe costofconstructinghighdensitygeneticmapsbyeliminatingtheneedtodevelopandgenotypepolymorphic markersindividually[13].Thishasbeenachievedeither byfocusingsequencecoveragewithinoradjacenttogenomicregionsofdistinctbiochemicalcharacter,suchasrestrictionsiteswithrestrictionsiteassociatedDNA sequencing(RAD-seq)andrelatedmethods[14,15],orby combininginformationacrossregionsusingareference genomesequence[16,17].WhileRAD-seqisapplicableto organismslackingareferencegenomeassembly,itisnot directlyapplicabletocomparisonsofgenomeorganization acrosslongevolutionarytimespansbecausesuchcomparisonsrelyontheidentificationofhomologoussequence markers(typicallyprotein-codinggenes),whichtypically haveonlyasmalloverlapwiththerestriction-associated markers. Herewepresentagenotype-by-sequencingmethodfor constructingahigh-densitygeneticmapusinglowcoverage,low-cost,wholegenomesequencingdatafrom theoffspringofawildcross.Inthisjointassemblyand mapping(JAM)approach,thetraditionallyindependent andsequentialstepsofgenomeassembly,polymorphic markeridentificationandgeneticmapconstructionare combined.Existingassemblersexpectlowerdensitiesof sequencepolymorphism,deepercoverage,greatercomputermemoryormoreaggressivequalitytrimmingthat decreasesequencecoverage[18-20].Ourcurrentimplementationfocusesonconservativeassemblyofshort scaffoldssufficientformapconstruction,butourresults suggestthatfurtherintegrationofgeneticmappinginformationwithinwholegenomeshotgunassemblymethods canbeacosteffectivewaytoproduceassembliesoflarge, complexgenomeswithchromosome-scalecontiguity. Wehaveappliedthisapproachtoproduceageneticmap ofthegenomeoftheAtlantichorseshoecrab, Limulus polyphemus .Horseshoecrabsaremarinearthropodswith afossilrecordextendingback450millionyears[21].They exhibitremarkablemorphologicalstabilityovertheirlong evolutionaryhistory,retaininganumberofancestral arthropodtraits[22],andareoftencitedasexamplesof “ livingfossils ” L.polyphemus hasagenomeabout90%the sizeofthehumangenome.Itisanimportantspeciesfrom ecological,commercialandconservationperspectives[23], thathasbeenusedasamodelsystemforresearchin behavioralecology,physiologyanddevelopment[24].The mapandSNPmarkersdescribedherewillbearesource forthe L.polyphemus genomeproject,researchinhorseshoecrabpopulationbiology,andcomparisonsofmetazoangenomeorganization.B yanchoringproteincoding genestothismap,weareabletoextendanalysisofancestrallinkagegroupsandwholegenomeduplicationstothe cheliceratelineage.DatadescriptionApairofnaturallyspawninghorseshoecrabsandtheir eggswerecollectedfromtheirnaturalhabitatonthe beachatSeahorseKey.Thelarvaewerehatchedatthe lab4weekslaterfromthecollectingdate.Thetissue samplesfromthethirdwalkinglegsoftheparental horseshoecrabsand34larvaewereusedforDNA extraction,librarypreparationfollowingmanufactures ’ standardprotocols.Twoparentsand34larvaewereindividuallybarcodedduringlibrarypreparation.Illumina paired-endlibrarieswithinsertsizesofapproximate 300bpwerepreparedforeachsample.Theselibrarieswere pooledtogetherforthesubsequentsequncingontheIlluminaHiSeq2000platformatMedicalCollegeofWisconsin SequencingServiceCoreFacility.Atotalof1.7billion 100bpparied-end(PE)readswereobtainedafterthequalityfiltering.Thetotalsequencingcoveragewasestimated as38.9basedonthe k -merfrequencydistribution.The rawsequencingdatacanberetrivedfromNCBISRAvia NCBIBioProjectaccessionPRJNA187356.AnalysesAssemblyandmappingTheJAMmethodisdesignedtoproduceacombined assemblyofpolymorphicsequences,taggedbygenomic regionswithamaximumofonesinglenucleotidepolymorphisms(SNP)per k -merwindow(Methods).Startingwith genomicreadsfromamatingpairofadult L.polyphemus and34offspring(100bppaired-endreadson300bpNossa etal.GigaScience 2014, 3 :9 Page2of21 http://www.gigasciencejournal.com/content/3/1/9

PAGE 3

inserts),weanalyzed1.7billionreadscontainingatleast onehigh-quality23-mer. FittingPoissonmodelsforunduplicatedsequencetothe frequenciesoffiltered23-merssuggeststhat1.1billion genomiclociareuniqueatthisresolution(Figure1).This fitassumesthatthevastmajorityofmappablegenomic locihaveonlyoneortwoallelesrepresentedinthe parents ’ fourhaploidcontributions,whichgivesrisetothe fourcomponentsplottedinFigure1.Ofthesesufficiently uniqueloci,63%aremodeledashomozygous,27%as pairedmajor-minoralleles,andtheremaining10%astied allelepairs.ThecorrespondingSNPs,ifalwaysatleast23 basesapart,wouldbe1.6%ofbasesintheuniqueloci. Dividingthetotalnumberoffiltered23-mersbythe modeledhomozygousdepthofcoverage d =38.9yieldsan estimatedgenomesizeof2.74billionbases,consistent withthemeasuredDNAcontentof2.8pg(978Mb 1pg DNA)[25]. Wecategorizespecific23-mersbytheireditdistancesto others:havingnoneighborswithinasinglebasesubstitution(uniquetags)orwithasinglemutuallyuniqueonesubstitutionneighbor( “ SNPmerpair ” tags).Asubsetof these,includingSNPmerpairsforapproximately7.9millionSNPs,constitutethetagsusedforcontiggingand scaffolding.TheSNP-merpairsaccountforapproximately 45%ofthemodeledfractionofalleles,theothersmissed fromsimilaritytoothersequences(e.g.,duetorepeats)or distancefromeachother(becauseofindelsormultiple SNPsper23-mer). Chainingthese23-merstogether(seeMethods)produces aninitial6.6millioncontigs,3.9millionofwhichare linkablebypairedreadsforscaffolding.ApplyingBambus [26]produces944,000scaffoldsspanning1.3billionbases (Table1).Thesescaffoldsserveasmarkersincorporating multiple23-mertags,includingSNPmerpairsusedto identifyhaplotypes. Afterassembly,themeandensityofSNPsacrossthefour parentalhaplotypesinassembledregionswasestimated basedonreadre-alignmentstobe7.6perthousandbases. WejointlyinferredthephasesoftheseSNPsandsegregationpattern(offspringgenotypes)inthemappingcrossfor eachmarkerinamaximumlikelihoodframework (Methods).Wefocusedonthe91,320markerswithatleast 18inferredbi-allelicSNPsforconstructingthelinkage map.Thesemarkersgroupedinto1,908high-confidence mapbins(i.e.,uniquesegregationpatterns,assumedto correspondtolociinthegenomeuninterruptedbymeiotic recombinationinthecross[27].Mapbinsfellinto32 linkagegroups(Figure2),closetothe26pairs(2N=52) previouslyfoundinacytogeneticanalysisoftwochromosomespreads[28].Twentymapbinswereremovedfor havinginconsistentpositionsinthematernalandpaternal maps,and12weresingletons. Toestimatethefrequencyofincorrectgenotypecalls asafunctionoftheloglikelihooddifferencebetween thecalledandalternativegenotype(genotypeconfidencescore),includingcontributionsfromuncertainty inSNP-meridentification,assemblyandsamplingnoise, wecarriedoutasimulationofthelibrarypoolingand sequencing, k -merassemblyandgenotypeinferenceprotocols,usingthesequenced Cionaintestinalis genomeas astartingpoint. Figure1 FittingPoissondistributionstoLimulus23merfrequencies. Thedistributionofsequenced23-mers,modeledassamplinggenomic locithatarehomozygous(singleallelesharedbyallhaplotypes),havetwoallelesthataretied(AandapresentinparentsasAAaaorAaAa), orhavetwoallelesinamajor-minorrelationship(presentinparentsasAAAaorAaaa).Alleleswhoseparentalcontributionsarehomozygous, tied,majororminoreachhaveafrequencypeakcorrespondingtotheirdistinctivefractionoftheoveralldepthofgenomicsequencingd.Loci sharingsimpleorrepetitivesequencenotsufficientlyuniqueata23-merscalecontributetoalongtailofftherightedgeoftheplot. Nossa etal.GigaScience 2014, 3 :9 Page3of21 http://www.gigasciencejournal.com/content/3/1/9

PAGE 4

Inthesimulated C.intestinalis dataset(Methods),a singlestretchedexponentialdistributionprovidedagood fittothefrequencyofgenotypecallingerrorsasafunctionofthecallconfidencescoreforscoresupto6,or downtoerrorfrequencyofapproximately1%.Theobservederrorfrequencydeclinedmoreslowlyforhigher confidencescores.Theminimum x2fitusedforestimatingthegenotypingerrorrateinthe L.polyphemus mapbinswas pes a1eŠ s c 1 b 1 a2eŠ s c 2 b 2,withparameter values a1=0.49, b1=2.08, c1=1.26, a2=5.47, b2=0.17, c2=0.16(Figure3). Applyingthismodeltothe L.polyphemus marker genomecalls,weestimatedthatthegenotypecallingerror rateinthemapbinrepresentativemarkerswas0.0099. Weobservedthat51%ofadjacentmapbinpairsareseparatedbyasingleinferredrecombinationeventinthe cross,and94%areseparatedbythreeorfewerrecombinantsineachparent. Ofthe91,320markerswithatleast18putativeSNPs, 84,307(92%)wereassignedtotheirclosestmapbinswith athresholdof(Methods),foranestimatedgenome-wide averagedensityofonemappedsequencemarkerevery 32kb.Ameanof45markersweremappedtoeachmap bin,andthenumberofmarkersmappedwasusedtoestimatetherelativephysicalsizeofmapbins.Approximately 46%ofthescaffoldswith12 – 17SNPscouldbeplaced withthesamethreshold,foranadditional32,688markers, oronemarkerevery23kb. Thetotallengthofthescaffoldsassignedtomapbins was411Mb,andtheycontained2.67millionbi-allelic SNPsassignedaphasewithaposteriorprobabilityofat least0.99.Ofthese,72%wereinferredtobeuniquetoone ofthefourparentalchromosomes.Thisisclosetothe 74%predictedunderthefinitesitesneutralcoalescent modelgiventheobservedSNPdensity[29].SequencecompositionandrecombinationrateInthescaffoldslongerthan1kb(N=378,506andmean length=2.9kb),the G / C basecontentwas33.32.8%, andthelocalrelativefrequencyofCpGdinucleotides wasbimodallydistributed,withabout30%ofsequences exhibitingdepletionofCpG.TpGandCpAdinucleotideswereover-representedonaverageandtheirlocal densitiesnegativelycorrelated(r= Š 0.54,p<2.2e-16) withCpGdensity,suggestingongoinggerm-lineCpG methylationforafractionofthegenome[30]. Themeanmaternalandpaternalrecombinationrates wereestimatedtobe1.28and0.76centimorgansper megabaserespectively,consistentwithexpectationbased onthenegativecorrelationbetweenrecombination intensityandgenomesizeobservedinpreviousstudies [31].Wedidnotobserveevidenceofsegregationdistortionforanymapbins.Thecorrelationinlocalrecombinationratesintwoparentsacrossthegenomeis estimatedas r2=7.1%; p <1e-29,suggestingconsiderable variationinrecombinationlandscapebetweentwosexes inlimulus[32-34].Apositivecorrelationbetweenlocal recombinationrateandlocalSNPdensitywasobserved ( r2=9.7%; p <1e-40),whichisconsistentwithprevious observationsinhumanwithcomparablecorrelationcoefficient[35].AncestrallinkagegroupconservationWefoundthat34,942scaffoldshavesignificantsequenceconservationwith10,399predictedproteinsof thetick Ixodesscapularis :like Limulus ,achelicerate, butonewithawell-annotatedgenome[36].6,246of thesehitsformedreciprocalbestpairs,ofwhich5,775 (92%)couldbeplacedonthelinkagemapatathresholdof p <.Thesewereusedasconservedmarkersfor comparisonsofgenomeorganization.Whenlinkage groupsweredividedinto108non-overlappingbinsof 1,000markers,52hadsignificant( p <0.05,afterBonferronicorrectionfor1,944pairwisetests)enrichmentin sharedorthologs(or “ hit ” )withatleastoneofeighteen ancestralchordatelinkagegroups[7].AhiddenMarkov modelsegmentationalgorithm[6]identified40breakpointsinALGcompositioninthelinkagegroups. Approximately72%ofthegenomeisspannedby53 interveningsegmentsthathitoneor(foreightofthem) twoALGs(Figures4and5).EachoftheeighteenancestralALGshasatleastonehitamongthe45 segmentswithauniquehittotheALGs.WholegenomeduplicationsWholegenomeduplication(WGD),orpolyploidization isarare,butdramaticgeneticmutationeventwhich doublesthesizeofagenomeandcreatesaredundant pairofcopiesfromeverygene.Becauseitcreates Table1 K -mercontigandscaffoldstatisticsAssembledCountTotal(bp)Avg.span(bp)n50span(bp) k -mercontigs6,614,4341,240,275,515188418 Linkablecontigs3,925,8441,137,576,911290460 Initialscaffolds944,2461,261,263,17213363047 Referencescaffolds944,2461,295,334,51513722930 Referencebases1,131,458,74411982553 Nossa etal.GigaScience 2014, 3 :9 Page4of21 http://www.gigasciencejournal.com/content/3/1/9

PAGE 5

redundantcopiesofgenesforentirebiochemicalpathwaysandgeneticnetworks,ithasbeenproposedthatit createsuniquerawmaterialfortheevolutionofnovel biologicalfunctionsandincreasedcomplexity.HomeoboxgeneclustersHomeoboxgenesencodealargefamilyoftranscription factorsinvolvedindiverseembryonicpatterningandstructureformationprocessesofeukaryotes.Asaparticular Figure2 Eachofthenumberedblocksrepresentsoneofthe32linkagegroupsofthe L.polyphemus geneticmap,andiscomposedoffour columns:Twobandsofthetriangularmatricesinwhichthecolorscaleindicatesthefractionofsamplesshowingrecombinationbetween pairsofmarkers;maternalrecombinationfrequencyisshownontheleft,paternalontheright. Acolumnlabeled “ ALG ” indicatessegmentsof significant( p <0.05inFisher ’ sExactTest,afterBonferronicorrectionformultipletests)conservationofgenecontentwithancestralbilaterianlinage groups.ThecolumnlabeledHOXshowsthemappositionsandtypesofpredictedhomeoboxtranscriptionfactorgenes.Thetwocolorscalesarefor: recombinationfrequencybetweenpairsofmarkersandlogp-valueforenrichmentingenecontentwithancestrallinkagegroups. Nossa etal.GigaScience 2014, 3 :9 Page5of21 http://www.gigasciencejournal.com/content/3/1/9

PAGE 6

subfamilyofhomeoboxgenes,theHoxclusterisknownto controlmetazoanbodypatte rningalongtheanteriorposterioraxis.Weidentified155scaffoldswithsignificant homologytopredictedcheliceratehomeoboxgeneseque ncesinpublicdatabases.Weclassifiedthesesequencesinto homeoboxsubfamilies(Methods)andplacedthemonthe mapbybesthit.Twolargeclu stersofHoxgenesarefound onlinkagegroup(LG)15andLG21,eachcontainingmultiplemembersoftheanterior, centralandposteriorclasses. Therearealsotwoparahoxclusterhomologs,eachwith threehomeoboxgenes: gsx and cdx orthologsandathird homeoboxgenenotconfidentlyassignedtoasubfamilyin ouranalysis(LG5andLG19).Therearetwosmallerclusterscontainingmultiplehoxgenes(LG18andLG20),and clustersofotherhomeoboxgenes,includingmembersof the msx lbx nk evx and gbx families(Figure2).GenomicdistributionofparalogousgenesWGDcreatesmanypairsofduplicategenesor “ paralogs ” Thedistinctivefeaturesofthesegeneshavebeenusedto Figure3 Genotypecallerrorrateasafunctionofcallconfidencescoreforbinsof10,000callsinsimulated Cionaintestinalis genome data. Thestippledblueregionshows95%confidenceintervalsoftheBayesianposteriorprobabilitydistributionoftheunderlyingerrorrate computedfromtheBetadistribution Beta ( ne+1, nc ne+1)conjugatetheassumedbinomialdistributionofobservederrors,where neand ncare thenumberoferrorsandnumberofcallsineachbinrespectively.Theredcurveshowsthebestfiterrormodel a1e s c 1 b 1 a2e s c 2 b 2,withparameter values a1=0.49, b1=2.08, c1=1.26, a2=5.47, b2=0.17, c2=0.16. 2reduced=0.82. Figure4 Limulus-Humanmacro-syntenydotplot. BluepointsindicatethepositionofhumangenesinreconstructedancestralchordateALGs (verticaldisplacement)andtheircandidateorthologsinthe30 L.polyphemus linkagegroups(horizontaldisplacement). Nossa etal.GigaScience 2014, 3 :9 Page6of21 http://www.gigasciencejournal.com/content/3/1/9

PAGE 7

inferWGDeventsinfungal,vertebrateandplantgenomes [2-4].Weexaminedthegenomicdistributionof2,716 pairsofcandidateparalogousgenemarkersin L.polyphemus forsignaturesofWGD.In45%ofthesepairsboth markersmappedtothesamechromosome,compared with5.30.5%in1,000datasetswithrandomly-permuted paralogousgeneidentities.Themappedpositionsofpairs withinthechromosomeswerehighlycorrelated(average r2=0.81,andexceeding0.95for8ofthelargechromosomes;Figure6),suggestingthatmanyofthepairsrepresentrecenttandemgeneduplicatesorsinglegenes fragmentedacrossmultiplemarkers.Inthefollowing,these Figure5 Limulus-Humanmacro-syntenydotplotasinFigure4,showingbreaksintroducedbyhiddenMarkovmodelsegmentationof thelinkagegroupsasverticallines. Figure6 Genomicdistributionofcandidateparalogs. Themappositionsofpairsofputativelyparalogousproteincodinggeneswithinthe L.polyphemus genomeareplottedwithbluepoints.Pairsarebiasedtowardnearbymappositions,andthereforeconcentratedalongthe diagonal.Also,paralogssplitbetweenlinkagegroupsaresignificantlyclusteredinto “ paralogons ” Nossa etal.GigaScience 2014, 3 :9 Page7of21 http://www.gigasciencejournal.com/content/3/1/9

PAGE 8

same-chromosomeparalogsarereferredtoas “ tandem ” duplicates. Inter-chromosomalduplicat esareclusteredintoconservedparalogousmicro-syntenyblocks(or “ paralogons ” [3]):thereare25pairsofloci,eachwithatleastsix( mp =6) independentparalogpairsclusteredwithamaximum gap(max-gap)of300markersbetweenadjacentparalogsin eachcluster.Paralogpairsare consideredindependentif theyarebasedonhomologytoadistinctout-groupgene, toguardagainstrelyingoneithermultipleexonsofthe samegene,orrecently-duplicatedgenesasindependent evidenceofancientsegmentalparalogy.Theseclustersspan 25,044markers,or30%ofthemap,afterremovingredundancyfromparalogonswithoverlappingfootprints.In 1,000datasetswithrandomly-permutedparalogousgene identities,themaximumnumberofsuchclustersobserved was11;themeanandstandarddeviationwere3.91.0. Theobservedclusteringintoparalogonswasgreaterthan thatintherandomizeddatasetsoverabroadrangeof choicesofmax-gapandmp.Forexample,formax-gap= 100, mp =3thereare52clustersvs.3.51.9,range0 – 10; formax-gap=500, mp =9thereare12vs.2.91.7,range 0 – 9.Becauseofthelargeproportionofapparenttandem geneduplicates(45%),thisrandomizationschemeincreases thenumberofinter-chromosomalparalogpairsrelativeto thedata,makingitaconservativesignificancetestforinterchromosomalparalogcluste ring.Whengeneswithtandem duplicatesareexcludedfromtherandomizations,the observednumberofclustersisgreaterthanthemaximum observedin1000randomizationsforallthecombinations ofmax-gapintheset(100,200,300,400,500,600)and mp intheset(3,4,5,6,7,8,9).23max-gap=600, mp =7 clustersspan59%ofthemap,comparedtorespectivemean numberandmapcoverageof3.31.8,and136%inthese randomizations. Amongthemarkerpairsmappingtodifferentchromosomes,wefoundasignificantexcessofpairsrelatingsegmentsderivedfromthesameALGrelativeto randomizationcontrols(247pairsvs.10211, p <0.001 inrandomizationsofallgenes;202vs.467whengenes intandemduplicatesareexcluded).Thispatternisconsistentwiththecreationofthesesegmentsbyduplication (ratherthanfission). Themax-gapclustershaveasignificantamountofoverlapamongtheirfootprints.Forexample,thefootprintsof themax-gap=600, mp =7clustershadatotallengthof 72,072markers,butanetfootprintafterredundnacy removalof49,545markers.Agenomicregion a whichhas conservedsyntenywithtwootherregionsofthegenome ( b and c )couldariseeitherfrommixingofadjacent regions(throughlocalgenomerearragements)withhomologyto b and c respectively,orbysuccessiveduplications. Inthelattercase b and c arealsohomologous.Weexaminedtherelationshipsamongtheparalogonsforevidence ofsuccessiveroundsofduplication.Weconsidereda graphinwhichnodescorrespondtomerged,non-redun dantparalogonfootprintregions.Nodesareconnected withedgesifamax-gapclusterconnectsthetwonodes. Theaverageclusteringcoefficientofthisgraphisequalto theprobabilitythatfootprints a and c shareamax-gap cluster,giventhatthereareedges( a b )and( b c )inthe graph.Wecomparedtheclusteringcoefficientstothose foundinrandomErd s-Rnyigraphswiththesame numberofnodesandedgeprobabilityastheobserved graph.Wefoundthattheobserveddatashowssignificantlymoreclusteringthantheserandomgraphsfora widerangeofchoicesofmax-gapand mp.Forexample, formax-gap600, mp=7,theaverageclusteringcoefficientis0.19,while10,000randomgraphshadcoefficientsof0.0340.042, p =0.0039.AgedistributionofparalogousgenesBecauseWGDeventscreatemanyparalogsatthesame time,theyleavecharacteristicpeaksintheagedistribution ofparalogousgenes.In L.polyphemus thedistribution showspeakscenteredat0.71and1.34substitutionsper synonymoussite(Ks),valueswithintheapproximately linearresponserangeofKsestimatestoWGDage[37] (Figure7).Forcomparison,thesynonymoussitedivergencebetweenanAsianhorseshoecrabspecies Tachypleus tridentatus and L.polyphemus hasamodeof0.35.The commonancestorofthesespecieshasbeenestimatedto havelived114 – 154millionyearsago(MYA),coincident withtheopeningoftheAtlanticocean[38],suggestinga WGDevent230 – 310MYA,andpossiblyanolderone 450 – 600MYA.DiscussionOurresultsdemonstratethatalowcost,combinedapproachtowholegenomesequencingandgeneticmapping canbeusedtoefficientlycreateaveryhighdensitygenetic recombinationmapforanon-modelorganismwitha largegenome.Becausetheapproachusesgenome-wide sequencing,alargenumberofsequencemarkerscanbe anchoredtothemap,allowingcomparisonsofgenome organizationatthechromosomescaleoververylargeevolutionarydivergences.Theidentificationofchromosomal segmentswithsignificantgenecompositionhomologyto eachofthechordateALGsdemonstratesthatthepredominanceoffusionandmixingofancestrallinkagegroups previouslyobservedinanalyzedecdysozoangenomes[10] isnotancestralto,oruniversalintheclade. Themapallowsquantitativecharacterizationofother featuresofchromosome-scaleorganization,suchasthe correlationbetweenlocalrecombinationrateandpolymorphismlevels.Similarpositivecorrelationsbetween localrecombinationrateandpolymorphismlevelhave beenobservedinothermetazoansincludinghumansNossa etal.GigaScience 2014, 3 :9 Page8of21 http://www.gigasciencejournal.com/content/3/1/9

PAGE 9

[39-41]andplants[42,43].Futurecomparisonswithmore closelyrelatedchelicerateswillallowteststodistinguish whethertheseratesarepositivelycorrelatedwithinterspecificdivergence,consistentwithaneutralprocessof correlatedmutationandrecombinationrates[35].Alternatively,theassociationcouldbeexplainedbyhitchhiking andbackgroundselection[44]. Theenrichmentofinter-chromosomalparalogpairsin segmentsofthesameALGoriginisconsistentwiththeir creationbyduplication(rath erthanfission),although becausesmall-scaleduplicationisbiasedtowardlocal (tandem)duplication,fissio nofsegmentscouldalsoleave behindanenrichmentofparalogs.Suchamechanism, however,wouldnotcreatetheobservedorganizationof paralogs,thatis,theirclusteringinto “ paralogons ” .Thefact thattheseparalogonsspanalargeportionofthemap(59%) suggeststhatitwasawholeg enomeduplication,rather thansegmentalduplicationsthatgaverisetothepattern. Theexistenceofduplicatedhoxandparahoxclusters onfourdifferentchromosomesishighlysuggestiveof multiplewholegenomeduplication.Hoxclustershave notbeenfoundinduplicatecopiesexceptinvertebrates wheretheyhavebeencreatedbywholegenomeduplication,andhaveonlyrarelybeensubsequentlylost. Thedouble-peakedshapeofthedistributionofsynonymoussitedivergencebetweenpairsofparalogs,combinedwiththeexistenceoftwosmallclustersofHOX genesinadditiontothetwocompleteHOXclusters suggeststhattheremayhavebeentworoundsofwhole genomeduplicationinthehorseshoecrablineage. WGDsprecededmajorspeciesradiationsinvertebrates,angiospermsandteleostfish,andtheimportance oftheirroleinevolutionisthesubjectoflong-running debate[1-4].Thediscoveryofwholegenomeduplication inaninvertebrate,andduringhorseshoecrabs ’ longand famouslyconservativeevolutionaryhistorysuggeststhat sucheventsmayhavebeenmorecommonthanpreviouslyassumedinmetazoanevolution,andthatwhile theymayhaveprovidedrawmaterialforadaptiveevolutioninsomecases,theyarenotevolutionarydrivers.MethodsJointassemblyandmapping(JAM)overviewBarcodedgenomicDNAlibrarieswerecreated,pooled, andsequencedinfourlanesontheIlluminaHiSeq2000 platformforamatingpairof L.polyphemus and34 offspring. Figure7 Distributionofestimatednon-synonymous(Ka;top)andsynonymous(Ks;bottom)andsequencedivergenceratesforpairsof putative L.polyphemus paralogs(left)and L.polyphemus T.tridentatus orthologs(right). Nossa etal.GigaScience 2014, 3 :9 Page9of21 http://www.gigasciencejournal.com/content/3/1/9

PAGE 10

TheJAMmethodproceedsthroughthreemajorphases: 1.ThefrequenciesofDNAsub-sequencesoffixedlength k ( k -mers)areprofiledtocharacterizethequality,uniqueness, polymorphismsandrepetitioningenomicreads,usingsoftwarewedevelopedbuildingonworkfromtheAtlasassembler[45].Allelicpairsof k -mersrepresentingalternate formsofSNPsareidentified andtrackedthroughthesubsequentsteps.2.Contigsareassembledonagraphof unique k -mersandpairedSNP k -merssampledtoreduce memoryusage,thenorderedandorientedusingthe Bambusscaffolder[26,46].Ea chmulti-SNPscaffoldistreatedasasinglemarkerforthelinkagemappingsteps3.The pairedSNP k -mers(ineachscaffoldarecombinedwiththe read,mate-pair,andparent-oroffspring-libraryassociationsoftheirallelesforhaplot ypephasingandconstruction ofahighdensitygeneticlinkagemap.ThesoftwareispublicavailableasopensourcesoftwareatGitHub[47].SamplingandsequencingTheparentalhorseshoecrabsandtheireggswerecollectedfromtheirnaturalhabitatonthebeachatSeahorse Key,anislandalongthewestcoastofnorthFlorida,on27 March2010.Thisnaturallyspawningpairwereobserved astheeggswerebeinglaidandfertilized(fertilizationis externalinthisspecies,i.e.,theeggsarefertilizedinthe sandunderthefemaleastheeggsarebeinglaid).The tissuesamplewerecollectedfromthethirdwalkinglegs ofthisparentalpair.Wemarkedwheretheeggsamples werelaidandreturnedafewhourslateranddugupthe nest,thenremovedthefertilizedeggs.Wealsohave conductedpaternityanalysesthatshowthatfertilizationis bytheassociatedmaleandnotbyextraneousspermthat mightbeatthenestingsite(inthiscasethedensityof nestingfemaleswaslowonthisdaysoweknowthatthe eggswecollectedwerefromthepairweobserved). Trilobitelarvaewererearedinplasticdishesaspreviously describedandhatchedfromtheeggs4weekslater[48]. TissuesamplesandlarvaewerepreservedinRNALater. GenomicDNApurificationandlibraryconstructionwere carriedoutusingQiagenDNAEasy,IlluminaTruSeqand Nexterakits,followingmanufacturers ’ protocols.Barcoded sampleswerepooledandsequencedontheIllumina HiSeq2000platform. Limulus larvaewereprocessedasfollows;eachlarva, suspendedin100 L ofRNAlaterandstoredat Š 80C ina1.5mLEppendorftube,wasthawedonice,after whichRNAlaterwasremoved.DNAwasextractedusing theQiagenDNAEasykitpermanufacturer ’ sprotocols. DNAwasquantifiedusingpicogreenDNAquantitation kit.ToprepareTruSeqlibraries,DNAwasfirstpurified anothertimeusingzymogenomicDNAcleancolumns permanufacturer ’ sprotocols.Adult L.polyphemus DNA waspreparedasabove,butusingclawtissueratherthan wholelarvae.AllDNAextractsweretestedbygel electrophoresistoensureDNAwasnotdegraded.TruSeq librarieswerepreparedatUniversityofGeorgia ’ sGeorgia GenomicsFacility.1 – 5 gofsampleDNAwassubjected tofragmentationusingCovarissonicator.Fragmented DNAwasthenusedforlibraryconstructionusing IlluminaTruSeqlibraryprepkits.Librarieswerepooled togetherinequimolaramounts(for10larvae)andused forthefirstsequencingruninseparatelanesforthe parentalandlarvalpools.Forlarvalsamples11 – 34,library prepwasswitchedfromTruSeqtoNexterakits.Nextera librarypreparationwasperformedaccordingtomanufacturer ’ sprotocol.TheNexteralibraryproductwasquantifiedbypicogreen,andfragmentsizedistributionwas checkedbyusingLonzaflashgel,toensurethatfragment sizedistributionwasbetween300 – 1,000bp.Sample librarieswerepooledinequimolarconcentrationsand sentforthesecondsequencingrunintwolanes,eachon apoolof12larvae.Bothsequencingruns,comprisingfour librarypools,wereperformedontheIlluminaHiSeq2000 platformatMedicalCollegeofWisconsinSequencing ServiceCoreFacility. Atotalof1.7billionIlluminareadsqualifiedfor kmer analysisandassemblybycontainingatleast23consecutiveq20bases.Thematernallibraryaccountedfor13%of thesereads,thepaternallibrary7.4%,andthe34offspring librariesfor2.4%onaverage,0.64%atminimum.k-merdecompositionWedeterminedalowerboundonthe k -mersizelong enoughforagivenexpectationofuniquenesinarandom genome.Whileincreasing k reducedtherateofcoincidentallyrepeated k -mers,italsoreducedtheeffectivedepthof coverageduetountrimmederrorsandedgeeffectsat readends — andincreasedthecasesofmultipleSNPs per k -merlocus,whicharenottrackedinourcurrent softwareimplementation.Wecanapproximatelymodel agenome-scalestring G ofrandomnucleotidesas G samplestakenwithuniformprobabilityfromthespace ofall k -mers(ofsize4k/2forodd k -merstreating reverse-complementsassame;slightlymoreforeven k ).ThePoissondistributionthengivestheprobability thatalocationin G hasitsown,unique k -mer(shared withnootherlocation)as eŠ ; where G 4k2 2 G 4kTheprobabilityofalocationsharingits kmer isthen; thus,tolimitthemaximumrate R of G -locationssharing k -mers,werequire k [log4( Š 2 G /ln(1 Š R ))].Forexample, foramammalian-scalegenomeofapproximately3billion bases,and R =0.1%,wechose k 22.For Limuluspolyphemus ,theAnimalGenomeSizeDatabase[25]reportsan estimatedhaploidgenomesizeof2.80pgand,aseachNossa etal.GigaScience 2014, 3 :9 Page10of21 http://www.gigasciencejournal.com/content/3/1/9

PAGE 11

picogramrepresentsalmostabillionnucleotidebase-pairs ofDNA,themammalian-scalechoiceof k applies[19,49]. Thislowerboundignoreschemicalandbiological sequencebiases,soselecting k forarealgenomeproject requiresattentiontoerrorrates,repeatstandemand interspersed,andgenomesize,allknownvaguely,ifat all,beforesequencing.Studyingthe k -merdistributions aftersequencingcanclarifythesegenomicpropertiesas weselect k tomaximizethenetyieldofcandidate k -mer tags,betweenerrorsandwithatmostoneSNPlocation, insequencingreads.WeconvertedIllumina/Solexa FASTQformat(payingattentiontothedifferentquality encodingsofthesoftwareversions)intoFASTAformat [50],masking(replacingwith ‘ N ’ )anybasewithPhredscale[51]qualitybelow20,andsoft-masking(representing inlowercase)otherbaseswithqualitybelow30.Forinitial trimmingexperiments,wevariedthesequalitythresholds asindicatedbelow.Westored k -mersinhashtableswith openaddressing[52],supportingodd k <=31.Wetallied foreach k -merabitvectorforpresenceorabsenceinup to64samplelibraries(36for Limulus parentsandoffspring),andanoverallcountofoccurrencesinalllibraries (countlimitedto64 Š 2 k bits).Wherethe kmerhash wouldbetoolargeforavailablememories,wesampledthe k -mersusingahash-slicingfactor S (mustbeprime). Representingeach k -merasanintegerin,slice s consistsofthose kmerswhoseremainderondivisionby S is s .Wecantabulateonesliceforarepresentativesample of1/ Skmers(forinitialestimationofdepthofcoverage andgenomesize)or,using S independentjobs,tocollect informationfor k -mersinallslices.Ourhashtablesstored odd-length k -merssothatreverse-complementary sequencescanbecombinedwithouttheambiguousorientationofpalindromicsequences(e.g.,ATCGAT). Afterselecting k asdescribedaboveandmakingafull tabulationof k -mercountsandbitvectors,wefiltered out k -mersnotexpectedtorepresentgenomicsequence. k -merswererequiredtohavethreecopiesinthetotal sequenceset,withatleastonecopyintheinitialrun andoneinthesecondrun.Thiswaspartlytofilterout incompleteadaptersequences,whichcanbedifficultto trim,butweredifferentinthetworuns. ExtendingmethodsdevelopedfortheAtlasassembler [45]toheterozygoussequences,Figure1givesarough decompositionofthe k -merfrequencydistributionfor 23-merswithquality 20,minimizingthesquareofthe residualsof k -mercountsonfrequencies3through70whilenotexceedingtheobservedcounts.Fourlinked distributionsmodelfractionsofthegenomeasmonoallelicorbiallelic:homozygousregionswith d =38.9-fold coverage(darkblue),minorallelescoveredat d /4 (green),tiedallelesat d /2(red)andmajorallelesat3 d /4 (purple).Thisfitisrobustenoughtoconfirmtheabundanceofmajor-minorallelepairs(27%of k -loci,vs.10% fortiedalleles),withthebroaderpeaksinthedatathan inthefittedcurvesconsistentwithlessuniform sampling(forexample,varyingcoverageofparentsand offspring).ThePoissondecompositionsuggestsadensity ofpolymorphismsof1.2%inmajor-minorallelepairs, basedondividingthemodelednumberofsuchsequencedpairsby k (assumingmostpolymorphismsare SNPsspacedatleast k basesapart),by d (theestimated depthofsequencing)andbytheestimatedgenomesize of2.74billionbases.SNPmeridentificationThefilteredkmercounts,computedinparallel,areloaded intoahashtablewithadditionalfieldstotrackkmersthat areuniquelywithinonemismatchofeachother.Because thisstepanalyzesall(non-error) k -mersinonetable,this requiresasinglelarge-memoryprocessor(ontheorderof 32GiB). Foreach k -mer,wecheckallits3 k one-substitution neighbors.The k -mersarepartitionedeachintooneof threecategories: unique :havingnoedit-neighborswithin onesubstitution; ambiguous :havingeithermultipleonesubstitutionneighbors,oroneneighborthathasmultipleneighbors;or partnered :uniquelypairablewith exactlyoneother k -merdifferingbyonesubstitution, such k -mersalsoknownfromnowonasSNPmersor SNPmerpairs.ForeachSNPmer,wesavethepositionof thesubstitution,abitmaskforthechange(transition, complement,ornon-complementtransversion),and whetherthecanonicalformofthepartnerinthetable hasthesamesenseorisreversecomplementedwith respecttothis k -mer. Onlypartneredandunique k -merswillbefurther tracked.Whilethislimitedmethodcannotidentify k -mers forgenomicSNPandnon-SNPlocationswithcomplete confidence,falsepairingormissedpairingshouldhave limitedeffects,asconfirmedbyassemblyexperiments withsimulated Ciona sequence(see Errormodelcalibration inMethods). Falsepairing ,duetocoincidentalsimilarities orrepeats,wouldcombinenodesofthe k -mergraph (seebelow)andcausenoiseinthescaffolding,haplotype phasing,andlinkageanalysis.Suchmisleadinglinksare minimizedbythe robustedge requirementsincontiggingandscaffolding,describedbelow. Missedpairing canhappenfromindelpolymorphisms,SNPsseparated byfewerthan k Š 1positions,failuretosequenceminor alleles,orambiguityduetotoomanysimilar k -mers. Ambiguouslynon-unique k -merswillbeskippedover (reducingconnectivityofthe k -mergraphifthereare toomanyinarow).Whereallelic k -mersmisidentified asuniquecauseconflictingedgesinthe k -mergraph, nodesforunpartneredmajoralleleswilleitherbechained intocontigswithflankinguniquesequenceorleftas orphanedfragments,andunpartneredminoralleleswillNossa etal.GigaScience 2014, 3 :9 Page11of21 http://www.gigasciencejournal.com/content/3/1/9

PAGE 12

beleftasorphanedfragments.Overall,errorsinidentifyingparneredandunique kmersshouldshortencontigs andscaffoldsandhidelinkage,notpromotefalselinkages. Table2showsthetotalsandpercentagesofthedifferentkmercategories,countingeachSNPmerpairasone k -mer.SNPmerpairsaccountfor16.3%oftheputative genomicallyunique23mermarkers;dividingby23gives usthefractionofbasesinthosemarkersthatareputativeSNPs:0.71%.Nodek-merselectionToreducethememoryrequirementsofour k -mer assemblygraph,weselectaapproximateone-tenthsubsetoftheSNPmerandunique k -mertags. InthecaseofatrueSNPatleast k1basesfromother SNPsandfromgapsinerror-freecoverageofeitherallele, therewillbe k coveringSNPmerpairs(providedthatcovering k -mersarealsouniquelypairable).Bytakingonly SNPmerpairswiththesubstitutioninparticularpositions, wecanreducethesizeofthegraphanditsredundancy. Analyzingthedistributionofsubstitutedpositionforall theSNPmerpairs,weobserveanenrichmentforsubstitutionsneartheends,probablyduetoproximityto low-qualitysequence.Byselectingpositions3,12and 21of23-baseSNPmers,weavoidthemostproblematic positionsandreducethisportionof k -mernodesbya factorof7.67. UnlikeforSNPmers,therearenocanonicalpositions thatidentifytheunique,unpaired k -mers.Severalmechanismshavebeenproposedforsampling k -mersinarepresentativeway[53,54].Weusethemorepseudo-random hash-slicingrule,alreadydiscussedabove,tosamplea singlesliceof k -mers:thosewhoseintegerencodingsare congruenttoaparticularslicenumber s ,modulo S (the hashslicingfactor).Wehavefoundthatonthefinished humangenome(resultsnotshown),hashslicingiseffectivelyaPoissonsampling,withsampled k -mersspaced accordingtoanexponentialdistribution. Acaveatinapplyinghashslicingisthattakingthe remaindermoduloaprimeisnotverypseudo-random forMersenneprimes(equalto2p Š 1forsome p ),when k -mersarerepresentedinbase-4encoding[52].We thereforepickaslicingfactorof11,thesmallestnonMersenneprimegreaterthanourSNPmersampling factor. Theresulting k -mersubsethas86.0millionuniqueunpaired k -mersand24.0millionSNPmerpairs,each reducedaspredicted,foratotalfactorof9.8reduction in k -mernodesforthenextstep.ContiggingandscaffoldingEach23mertag(unique k -merorSNPmerpair)inthe abovesubsetisanodeinthe k -mergraph.Nodesareconnectedwhenthecorresponding k -mersappearconsecutively inatleastonereadoftheinput(anyintervening k -mers havingbeenskippedduetosamplingorambiguity).The relativeorientation,distan ceandnumberofsupporting readsofthe k -mersisstoredintheedge.Whenconflicting distanceorrelativeorientationisobservedamongdifferent readsforthesamepairof k -mernodes,alledgesfromboth nodesinthecorrespondingdirectionareignoredin contigging. Thenodesofthe kmergraphrepresentDNAtagsand havedistinctupstreamanddownstreamends.Oneedge ateachendofanodeisidentifiedas robust ifsupported byasupermajorityofthereadsforalledgesinthatdirection:thenumberofsupportingreadsisgreaterthanor equaltoboth(1)twoplusthesumofthereadcountsfor allotheredgesinthatdirectionand(2)twicetheread countofthenext-mostsupportededgeinthesamedirection.Bythisconstruction,anodehasatmostonerobust edgeoneachend. Amutuallyrobustedgeisdefinedasonethatisrobust goinginbothdirectionsbetweenthetwonodesit connects. Contigsaretheconnectedcomponentsofthesubgraph consistingofmutuallyrobustedges.Singletonandcircular contigsarereportedfordiagnosticpurposes,butignored insubsequentanalysis.Eachretained “ k -mercontig ” of Table1thereforerepresentsachainofnodesforSNPmers andunique k -mersnotsharedwithothercontigs. Afterassemblyof k -mercontigs,weconnectthemin longerstructuresusingtheBambusscaffolder[26].Becausethecontigsdonotcontaindetailedreadinformation,wemaptemplates(readpairs)tocontigsbasedon shared k -mercontent,anddividingtheresultinggraph ofcontigslinkedbytemplatesintobatchessmallenough forBambustoprocess.Batchesaredividedsothatno contigsindifferentbatchessharenotemplates. WepresenttemplateslinkingcontigstoBambususing AMOSformat[46]forthereads(templateends)mapped toeachcontig.Readsareincludedonlyforthecontig withwhichitsharesthemost k -mers,ifthespanof those k -mersis k andtheotherread-endofthetemplate similarlyqualifiesinadifferentcontig.Bambusinferslinks betweencontigsbymatchingtemplateidentifiersshared byreadsindifferent “ linkablecontigs ” ,thenproduces scaffoldsaschainsofcontigsthatarelinkedwithconsistentorderandorientation. Table2K-mercategories,countingaSNPmerpairas one k -merk -mertype#DistinctPercentage Nopartner/unique946,431,90155.48% Partnered/SNPmer184,756,14910.83% Ambiguous574,557,29633.68% TOTAL1,705,745,346 Nossa etal.GigaScience 2014, 3 :9 Page12of21 http://www.gigasciencejournal.com/content/3/1/9

PAGE 13

Contiguousconsensusrepresentationsfor k -mercontigs andscaffoldsweregeneratedintwophases.Inthefirst phase,sequencespannedbyselectedSNPmersandsubset k -mers(seesectionsabove)arejoinedtogether,separated byanumberofNscorrespondingtothenumberofbases notspannedby k -mersinthesubset.Inthesecondphase, asinglepassismadethroughthereaddataset,and stretchesofNsthatarespannedbysinglereadsare replacedbythesequenceoftheread.SNPphaseandgenotypeinferenceEachscaffoldofthe k -merassemblyconstitutesacandidatemarkerformapping.Whilethedepthofsequence coverageoneachmemberofthemappingpanelistoo low(~1X)todirectlyinferthegenotypeofindividual membersofthemappingpannelatindividualSNPs,the tightlinkagebetweenSNPswithinmarkersmeansthat learningasample ’ sgenotypeatanyonerevealsitatthe others,effectivelyamplifyingthesequencecoveragebya factorproportionaltothenumberofSNPswithinthe marker.Thisisthesameprincipleexploitedingenotype bysequencing(GBS)approachestogeneticmappingin thepresenceofreferencegenomes,forexampleinrecombinantinbredlinesofreferencericestrains[16],andin crossesbetween Drosophila specieswithsequenced genomesHerewegenotypeoffspringinthecontextofa crossbetweentwooutbredindividuals,simultaneouslyinferringthephasesoftheSNPs(i.e.,whichbasesappears oneachofthefourparentalchromosomesinthecross). Whilethedatawillbeinsufficienttoinfergenotypesat manymarkers,allthosewhereconfidentinferencescan bemadecanbeusedtobuildthelinkagemap. Forthepurposesofgenotypeinference,amarkeris treatedasacollectionof m SNPs(indexedinthefollowing by i {1,2, … m }),thathavebeeninferredtobeclosely linkedonthegenomeviathe k -merassemblystep.Ifthe fourparentalchromosomesarelabeled a b inoneparent and c d intheother,thenthegenotypingproblemisto inferwhichofthefourpossiblesegregationstatesorgenotypes ac ad bc bd describeseachsampleateachmarker locus.Weindexsampleswith j ,anddenoteasample genotypeby gj. Weassumethatmarkersareverysmallcomparedtoa chromosome,andignorethepossibilityofarecombinationeventwithinindividualmarkers.Thedatausedfor inferenceoftheoffspringgenotypesconsistofthenumberofreadsfromeachbarcodedsample j showingeach ofthefourpossibleDNAbases b ateachvariableSNP position i ,whichwedenote nb ij. Ifthephase iofSNP i wereknown, i.e. whichbaseis presentineachofthefourparentalchromosomes,then achoiceofgenotype gjimpliesaspecifichomozygousor heterozygousstate sij S ={ AA CC TT GG AC AT AG CT CG TG }forSNP i insample j .Foragivenphaseand genotype,thelikelihoodfunctionforagivenSNPpositioninagivensampleisgivenbyeitherabinomial(for homozygousstates)ortrinomialprobabilitymassfunctionofthereadcounts,base-callingerrorrate ,andthe sitegenotype sij:L i; gj p nb ij i; gj Pm nb ij sij; n m m 1 n m ; ifsijhomozygous n! k!l!m! 2 3 m 1 2 3 2 0 B B @ 1 C C Ak l; ifsijheterozygous 8 > > > > > > > < > > > > > > > :where n isthetotalnumberofreadsatSNP i ; m isthe numberofobservationsofbasesnotin ij(i.e.,mismatches); k and l arethecountsforeachofthetwobases of ijforheterozygoussites.LikelihoodmaximizationSearchingforanoptimalchoiceofSNPphases iand samplegenotypes gjismadedifficultbytheexponential sizeofthesearchspace:forsegregatingbi-allelicSNP sitesthereare14possiblephasestoconsiderateach SNPsite,soforamappingpanelofonly20siblingsand amarkercontainingonly10SNPs,therearecombinationstoconsider.Insimulationtests,wefoundthata variantofexpectationmaximization(EM),aniterative likelihoodmaximizationmethodcanaccuratelyinfera largeproportionofmarkergenotypes. Toinitializetheiteration,theparentalsamplesanda randomlyselectedoffspringare,withoutlossofgenerality,assignedgenotypes( a b ),( c d )and( a c ).Ateach step,wecalculatetheconditionalprobabilitydistributionsoverthepossibleSNPphases p ( i)giventhe genotypeassignmentsaccordingto:pt i p ij gt Y Spnb i j i; gt = XkY Spnb k j k; gt !wherewehavelabeledthechosenvaluesforthegenotypesatiteration t collectivelyby g( t )g( t )nb i isthecombinedtotalnumberofobservationsofbase b at polymorphicSNP i forallsamplesincludedatiteration step t whichhavegenotype s ( i, gj)= Oneachiterationuntilallsampleshavebeenincluded,arandomlyselectedsampleisaddedtotheset aftercalculating p( t )( i).ThenthenextsetofgenotypeNossa etal.GigaScience 2014, 3 :9 Page13of21 http://www.gigasciencejournal.com/content/3/1/9

PAGE 14

assignments g( t +1)aredeterminedbychoosingthose thatmaximizetheexpectedvalueoftheloglikelihood: E j n ; gj log L gj; n ; Xip i log pnb ijj gj; iThesestepsarerepeateduntilgenotypesarebeingselectedforallsamples,andtheexpectedloglikelihood stopsincreasing.Attheendoftheiteration,the likelihood-maximizinggenotypesarereported,alongwith theloglikelihooddifferencebetweenthebestandsecond bestchoiceofgenotypeforeachsample,whichprovides anindicatoroftheconfidenceingenotypecall.Togauge convergence,thisprocedureisrepeated5timesforeach marker,withdifferentrandomchoicesofinitialconditions.MarkerswhichdonotidentifythesameMLgenotypemultipletimesinindependentrunsarenotincluded amongthehighconfidencegenotypecalls.MapbinsUniquemarkersegregationpatternswereincludedin thesetofmapbinsiftheymetoneoftwocriteria:(1)at leastthreeindependentmarkerswereinferredtohave thepatternindependently,or(2)thepatternwasinferredfromatleastonemarkerwithatleast20SNPs suchthatthemeanoftheestimatedprobabilitiesofthe inferredSNPphaseswasgreaterthan0.9.ErrormodelcalibrationThesequenceof14 Cionaintestinalis autosomeswere downloadedfromEnsembl[55].These14chromosomes wereusedasthetemplateinourgenomesimulation. Basedontheirsequencelength,Weusedamarkoviancoalescentsimulatormacs[56]togeneratefourhaploid samplesdrawnfromapopulationunderneutralWrightFishermodelwithpopulationmutationrateof0.012and populationrecombinationrateof0.0085.Usingthe C. intestinalis genomeasthereferencesequence,twodiploid parentalgenomeswereconstructedbasedonthemacs outputwithrealisticSNPandIndelmodelsinferredby severalpreviousstudiesonthe Ciona genome[57-59].We wroteaperlscripttosimulatethegenomesofoffspring generatedbythecrossofthetwosimulatedparents.The softwarepackagedwgsim[60]wasusedtogenerateIlluminapaired-endreadsbasedonoursimulatedgenomes ofbothparentsandoffsprings,withthecoverageof20X and5Xrespectively. Toestimatethefrequencyofincorrectgenotypecallsasa functionoftheloglikelihooddifferencebetweenthecalled andalternativegenotype,includingcontributionsfromuncertaintyinSNP-meridentification,assembly,andsampling noise,wecarriedoutasimulationofthe k -merassembly andgenotypeinferenceprotoc olsAmonghigh-confidence genotypecalls,theobservederrorfrequencywasafunction ofcallconfidencescorewaswell-fitbyasumoftwo stretchedexponentialfunctions,allowingassignmentof errorprobabilitiestoindividualgenotypecalls.LinkagegroupconstructionWeusethelinkage p -value pabbetweenpairsofmap bins a and b definedastheminimumoverthefourpossiblerelabelings r ofthematernalandpaternalchromosomesoftheBinomial p -valueforthenumberof matchinggenotypes: pab minr1 Š XmrŠ 1 in i 1 2n"# where n isthetotalnumberofsamplegenotypecalls (68inthepresentcase,or34ineachparent)and mris thenumberofmatchinggenotypesunderrelabeling r Weidentifiedmapbinswithsegregationpatternsindicatingeitherinconsistentplacementinthematernaland paternalmapsorgenotypingerrorwithadoublethresholdprocedureasfollows: 1.Mapbinswerepartitionedintolinkagegroupsby singlelinkageclusteringatathresholdof. pab< p1. 2.Withineachpartition,mapbinswhichformed articulationpoints(i.e.,nodeswhich,ifremoved, wouldcausethelinkagegrouptofallapartintotwo disconnectedsubgraphs;)inthegraphof, pab< p2, where p2> p1. Thisprocedureidentifiesmapbinswhichaloneaccount forthemergingofwhatwouldotherwisebetwodistinct partitions.Weusedthefollowingpairsofthresholds p1, p2toidentifyatotalof20mapbinsforexclusionfrom themap:10,10;10,10;10,10.Theremainingmarkersform locallyconsistentlinkagegroupsinwhichalllinkagesdefinedatthreshold p1arecorroboratedbymultiplelinkages at p2,fortheabovevaluesof p1and p2.MarkerorderingMarkerswereorderedwithineachlinkagegroupusing thefollowingprotocol.Withineachlinkagegroupaconsistentlabelingofthefourparentalchromosomeswas achievedbyconstructingagraph G inwhichnodescorrespondtomapbinsandedgesareweightedbylinkage p -value pab(asdefinedabove).Thelocalchromosome labelsareupdatedateachmapbinasitisreachedina traversaloftheminimumspanningtreeof G tothelabeling r thatmaximizes pabalongtheincidentof G used inthetraversal.Markerswithineachlinkagegroupwere clusteredbyhierarchicalclustering(marker-markerdistancemetric:cosineoftheanglebetweenthevectorsof recombinationdistancestotheothermapbins;distance updatingmethod:averagelinkage)intoabinarytreedataNossa etal.GigaScience 2014, 3 :9 Page14of21 http://www.gigasciencejournal.com/content/3/1/9

PAGE 15

Figure8 Unrootedphylogenetictreeofhomeoboxsequences(part1). NodesarelabeledwithBayesianposteriorprobabilities.Highly supportedpartitionsusedtoclassify L.polyphemus sequencesaredrawninred,withtheabbreviationfortheclassshowninlargeletters. L. polyphemus homeoboxsequencesnotgroupedintooneofthesehighlysupportedpartitionsareassignedtoclass “ ? ” .Foreaseofdisplay,a largesubtreeconsistingofHOXandparahoxgeneshasbeenprunedatthepositionlabeled “ HOX ” ,andisshowninFigure7. Nossa etal.GigaScience 2014, 3 :9 Page15of21 http://www.gigasciencejournal.com/content/3/1/9

PAGE 16

Figure9 (Seelegendonnextpage.) Nossa etal.GigaScience 2014, 3 :9 Page16of21 http://www.gigasciencejournal.com/content/3/1/9

PAGE 17

structurewithleavesrepresentingmapbins.Anodein therightsubtreeoftherootnodewasrotated,interchangingitsleftandrightsubtreesifitsleftsubtreewas notalreadycloser(inaveragerecombinationdistance)to themarkersoftheleftsubtreeoftheglobalroot;and similarlyfornodesintheleftsubtreeoftheroot.Aninordertraversalofthetreegeneratesanorderingofthe markers.Finallythreereversalsoftheorderofmarkers insegmentsofthemapwereaddedbasedonvisual inspectionoftherecombinationdistancematrix.Inthe finalmarkerordering,51%ofadjacentmapbinpairsare separatedbyasinglerecombinationeventinthecross, and94%areseparatedbythreeorfewerrecombinants ineachparent.PlacementofmarkersonthemapToanchoradditionalmarkerstothemap,wecomputed the pab(seeabove)betweenmarker a tobeplacedon themapandeachmapbin b .Marker a isanchoredto themapatthepositionofthebin b whichminimizes pabif pab<10Š 6.SNPdensityestimationIlluminareadsweremappedtotheassembledscaffoldsequenceswithstampy[61]usingdefaultsettings.Fora sampleof9,228scaffoldswithlengthsrangingfrom5.05.5kb,sequencevariantswerecalledwithSAMtools[62] usingavariantqualityscorethresholdof50,andignoring indelpositions. ASNPdensityof0.76%infourhaplotypescorresponds toapredictedrateofpairwisesequencedifferencesper siteof =0.0042underthefinitesitesmodelofmutation andtheneutralcoalescentmodeloftherelationships amongsampledalleles[63].EstimationoflocalrecombinationrateToestimatethelocalrecombinationrateforeachmap bin,wecomputedthelinearregressionofmapdistancein numberofmarkersonphysicaldistanceusingupto10 neighboringmapbinsineachdirectionalongthemap(or fewerforbinswithin10mapbinsoftheendofthelinkage group).Mapdistancewascalculatedfromrecombination fractionusingHaldane ’ smapdistance Š 1 2log1 Š 2 r [64].AncestrallinkagegroupconservationTocomparethegenomeorganizationin L.polyphemus to theancestralmetazoanALGs,weusedthereciprocalbest blasthit(RBH)orthologycriterioninanalignmentofthe Ixodesscapularis predictedproteins[36]totheconsensus sequencesforthemarkerscaffolds. L.polyphemus scaffolds withRBHofe-valuewereassignedtothesameancestral bilateriangeneorthologygroupastheir I.scapularis ortholog,andtherebywithhumangenes.Regionsofthemap weretestedforenrichmentingenesfromparticularancestrallinkagegroupswithFisher ’ sExactTest,andbreakpointsinancestrallinkagegroupcompositionwere identifiedusingahiddenMarkovmodel,aspreviously described[6,7].HomeoboxgenemodelingWeidentified155markerscaffo ldswithatblastxalignment ofe-valuetoasetofcheliceratehomeoboxgenesequences downloadedfromGenbankusingtheNCBIonlinequery interface(genbankaccessionsAF071402.1,AF071403.1, AF071405.1,AF071406.1,AF071407.1,AF085352.1,AF151 986.1,AF151987.1,AF151988.1,AF151989.1,AF151990.1, AF151991.1,AF151992.1,AF151993.1,AF151994.1,AF151 995.1,AF151996.1,AF151997.1,AF151998.1,AF151999.1, AF152000.1,AF237818.1,AJ005643.1,AJ007431.1,AJ00743 2.1,AJ007433.1,AJ007434.1, AJ007435.1,AJ007436.1,AJ007 437.1,AM419029.1,AM419030.1,AM419031.1,AM41903 2.1,DQ315728.1,DQ315729.1,DQ315730.1,DQ315731.1, DQ315732.1,DQ315733.1,DQ315734.1,DQ315735.1,DQ3 15736.1,DQ315737.1,DQ315738.1,DQ315739.1,DQ31574 0.1,DQ315741.1,DQ315742.1,DQ315743.1,DQ315744.1, (Seefigureonpreviouspage.) Figure9 Phylogenetictreeofhomeoboxsequences,part2. TherootedsubtreeprunedfromthetreeinFigure8 Nodesarelabeledwith Bayesianposteriorprobabilities.Highlysupportedpartitionsusedtoclassify L.polyphemus sequencesaredrawninred,withtheabbreviationfor theclassshowninlargeletters. L.polyphemus homeoboxsequencesnotgroupedintooneofthesehighlysupportedpartitionsareassignedto class “ hox? ” Table3MixturemodelfitstoKsdistributionNkln(L)BICAICMixturecomponents 12 273.93559.74551.871.320.50 25 259.32 548.31 528.640.700.14;1.450.45 38 253.36554.19522.710.710.17;1.340.29;2.090.19 411 251.41568.11524.820.740.18;1.340.20;1.700.04;2.020.22Nisthenumberofmixturecomponents,kthenumbermodelparameters,ln(L)theloglikelihoodofthedataunderthebestfitmodel. BIC,Bayesianinformationcriterion;AIC,Akaikeinformationcriterion. TheBICscoreofourselectedmodelisshowninbold.Nossa etal.GigaScience 2014, 3 :9 Page17of21 http://www.gigasciencejournal.com/content/3/1/9

PAGE 18

EU870887.1,EU870888.1,EU870889.1,HE608680.1,HE608 681.1,HE608682.1,HE805493.1,HE805494.1,HE805495.1, HE805496.1,HE805497.1,HE805498.1,HE805499.1,HE80 5500.1,HE805501.1,HE805502.1,S70005.1,S70006.1,S700 08.1,andS70010.1).Thereadsofeachmarker(those withbeststampy[61]alignmenttothescaffold)were reassembledwithPHRAP[65],withdefaultparameters.The resultingcontigswerealignedtoacollectionofhomeoboxcontainingproteinsequences(genbankaccessionsNP 001034497.1,NP001034510.1,AAL71874.1,NP001034505.1, NP001036813.1,CAA66399.1,NP001107762.1,NP0011 07807.1,EEZ99256.1,NP001034519.1,NP476954.1,NP032 840.1,NP031699.2,AAI37770.1,EEN68949.1,NP523700.2, NP001034511.2,AAK16421.1,andAAK16422.1)with exonerate[66]inprotein-to-genomemode.Foreach contig,theaminoacidsequencepredictedbythe highest-scoringexoneratealignmentwasusedinsubsequentphylogeneticanalysis,resultingin104putative homeobox-containingmarkersranginginlengthfrom 18to147aminoacids.PhylogeneticanalysisofhomeoboxgenesAmultiplesequencealignmentofthepredictedhomeoboxsequencescombinedwithacollectionofrepresentativesequencesfromvariousclassesofhomeoboxgenes wasconstructedwithmusclev3.8.31[67]usingdefault settings.Theresultingalignmentwastrimmedtoa63 aminoacidsegmentspanningtheconservedhomeodomain,andsequenceswithmorethan50%gapswere removed,leaving93predicted L.polyphemus homeobox genesintheanalysis.Bayesianphylogeneticanalysiswas carriedoutontheresulting178taxon,63aminoacid charactermatrix(SeeAdditionalfile1)usingMrBayes v3.2.1[68]usingamixedmodelofaminoacidsubstituions,gamma-distributedratevariationamongsiteswith fixedshapeparameter =1.0,alignmentgapstreatedas missingdata,2,000,000MonteCarlosteps,twoindependentrunswithfourMonteCarlochains,andtheinitial 25%ofsampledtreeswerediscardedas “ burn-in ” .Monte Carloappearedtoreachconvergence,withanaverage standarddeviationofthesplitfrequenciesof0.022.The majority-ruleconsensusofthesampledtreesisshownin Figures8and9,andwell-supportedgeneclades(posterior probabilitygreaterthan0.95)wereusedtogroupthe predicted L.polyphemus genesintoclasses.Thetablein Additionalfile1liststhereassembledmarkercontigs, theirinferredhoxgeneclass,andmaximumlikelihood mappositions.Predictedgeneswereanchoredtothemap asdescribedabove.GenomicdistributionofparalogsWeidentified2,716pairsof Limulus markersthatcanboth beplacedonthemapandhavetheirbesttranslatedalignmenttothesame Ixodesscapularis gene.( I.scapularis geneswithmorethanfivebest-hitmarkerswereexcluded fromseedingsuchpairs.)Toestimatethesynonymoussequencedivergencebetweenpairsofcandidate L.polyphemus paralogousgenepairsand L.polyphemus genesand their T.tridentatus orthologs,weconstructedcodonalignmentsofpredictedcodingsequenceforestimationofsynonymoussequencedivergence.Conservedclustersof paralogswereidentifiedusingavariantofthe “ max-gap ” Figure10 TwocomponentmixturemodelfittotheKspeakontherange0<=Ks<=2.5. Thebest-fittingmodelwasselectedbytheBayesian InformationCriterion(Table3).Thecomponentmeansare0.7and1.45substitutionspersite.ThepositionofthepeakatlowestKswasnotsensitiveto theadditionofmoremixturecomponents. Nossa etal.GigaScience 2014, 3 :9 Page18of21 http://www.gigasciencejournal.com/content/3/1/9

PAGE 19

criterion[3]inwhichtwogenesareplacedinthesame clusteriftheyandtheirparalogsliewithinthreshold distance.KaandKsestimationforparalogsandT.tridentatus orthologsFigure6showsthedistributionacrossthemapofpairsof candidateparalogs.Toestimatethesynonymoussequence divergencebetweenpairsofcandidate L.polyphemus paralogousgenepairs,andbetween L.polyphemus genesand their T.tridentatus orthologs,wefollowedthefollowing protocol. 1.ReassemblereadsfromeachmarkerwithPHRAP [ 65 ],andcreateapredictedcodingsequenceusing exonerate,asdescribedfortheannotationof homeoboxgenemodels(seeabove). 2.Combinetheexoneratealignmentsofcodonsto aminoacidstocreateanalignmentofcodonsfor eitherapairof L.polyphemus sequences,orfora L.polyphemus T.tridentatus sequencepair. 3.UsethemethodofYangandNielsen[ 69 ]to estimatethesynonymousandnon-synonymous substitutionratesKaandKs,asimplementedinthe KaKsCalculatorpackage[ 70 ]. 4.Discardestimatesbasedonfewerthan30sites (30synonymoussitesforestimatesofKs, non-synonymoussitesforKa). GenBankaccessionsfor Tachypleustridentatus mRNA clones:JQ966943,AB353281,AB353280,HM156111,HQ22 1882,HQ221883,HQ221881,HQ386702,HM852953,TAT TPP,TATPROCLOT,FN582225,FN582226,AF467804,AF 227150,GQ260127,AF264067,AF264068,AB353279,AB0 05542,TATLICI,TATTGL,TATCFGB,TATLFC1,TATLF C2,AB201713,TATCFGA,TATLICI2,CS423581,CS423579, AB028144,AB201778,AB201776,AB201774,AB201772,AB 201770,AB201768,AB201766,AB201779,AB201777,AB20 1775,AB201773,AB201771,AB201769,AB201767,AB201 765,AB105059,AB002814,AX763473,TATCFBP,AB076 186,AB076185,X04192,TATHCLL,AB037394,AB019116, AB019114,AB019112,AB019110,AB019108,AB019106,AB 019104,AB019102,AB019100,AB019098,AB019096,AB 019117,AB019115,AB019113,AB019111,AB019109,AB 019107,AB019105,AB019103,AB019101,AB019099,AB 019097,AB023783,AB024738,AB024739,AB024737,AB0 17484,D87214,D85756,D85341. Figure7showsthedistributionofKaandKsforparalogs and T.tridentatus orthologs.Toestimatethenumberand ageofpeaksintheun-saturatedrange[37]oftheKsdistribution(andofputativeWGDevents),wefitaseriesof univariatenormalmixturemodels,with1,2,3,and4 componentstotheparalogKsdistributionintherange 0<=Ks<=2.5andselectedthebestmodelonthebasis ofBayesianInformationCriterion(BIC)(Table3).The bestmodelhadtwocomponents,withmeansat0.7and 1.45substitutionspersite.Thepositionofthepeakat lowestKswasnotsensitivetotheadditionofmore mixturecomponents.Figure10showscomparisonof thedistributionandthecomponentsofthebest-fitting model.GaussianmixturemodelswereestimatedinR withmixtools[71].AvailabilityofsupportingdataTherawsequencingreadsarecurrentlybeingsubmittedthroughtheNCBISRAandareaccessibleviaNCBI BioProjectaccessionPRJNA187356. Thedatasetssupportingtheresultsofthisarticleare availableinthe GigaScience GigaDBrepository[72].AdditionalfileAdditionalfile1: Theaminoacidcharactermatrixusedforthe phylogeneticanalysisofhomeoboxgenes. Abbreviations ALG: Ancestrallinkagegroups;AIC:Akaikeinformationcriterion;BIC:Bayesian informationcriterion;bp:Basepair;EM:Expectationmaximization; GBS:Genotyping-by-sequencing;JAM:Jointassemblyandmapping; LG:Linkagegroup;Max-gap:Maximumgap;Mb:Megabase;MYA:Million yearsago;PE:Paired-end;RAD-Seq:RestrictionsiteassociatedDNA sequencing;RBH:Reciprocalbestblasthit;SNP:Singlenucleotide polymorphism;WGD:Whole-genomeduplication. Competinginterests Theauthorsdeclarethattheyhavenocompetinginterests. Authors ’ contributions NHPconceivedandledtheproject.Allauthorswrotethepaper.PH,NHP andJ-XYwrotesoftware.NHP,PH,J-XYandCWNcarriedoutsequence analysis.JBcollectedandraisedsamples.CWNextractedgenomicDNAand createdthelibrariesforsequencing.Allauthorsreadandapprovedthefinal manuscript. Acknowledgements ThisresearchwassupportedbytheNationalScienceFoundation (EF-0850294andIOB06 – 41750),theBeckmanYoungInvestigatorProgram, theUniversityofFloridaDivisionofSponsoredResearch,theDepartmentof Biology,andtheUFMarineLaboratoryatSeahorseKey.Wethankthethree reviewers,Dr.HuguesRoestCrollius,Dr.StephenRichardsandDr.BrianEads, fortheirvaluablecommentswhichgreatlyhelpedtoimprovethequalityof thispaper. Authordetails1DepartmentofEcologyandEvolutionaryBiology,RiceUniversity,P.O.Box 1892,Houston,TX77251-1892,USA.2DepartmentofBiochemistryandCell Biology,RiceUniversity,P.O.Box1892,Houston,TX77251-1892,USA.3DepartmentofBiology,UniversityofFlorida,P.O.Box11-8525Gainesville,FL 32611-8525,USA.4Currentaddress:GenebyGene,Ltd,Houston,TX77008, USA. Received:23October2013Accepted:23April2014 Published:14May2014 References1.OhnoS: EvolutionbyGeneDuplication. Springer-Verlag;1970. 2.WolfeKH,ShieldsDC: Molecularevidenceforanancientduplicationof theentireyeastgenome. Nature 1997, 387: 708 – 713.Nossa etal.GigaScience 2014, 3 :9 Page19of21 http://www.gigasciencejournal.com/content/3/1/9

PAGE 20

3.McLysaghtA,HokampK,WolfeKH: Extensivegenomicduplicationduring earlychordateevolution. NatGenet 2002, 31: 200 – 204. 4.SimillionC,VandepoeleK,MontaguMCEV,ZabeauM,dePeerYV: Thehidden duplicationpastofArabidopsisthaliana. ProcNatlAcadSci 2002, 99: 13627 – 13632. 5.AuryJ-M,JaillonO,DuretL,NoelB,JubinC,PorcelBM,SgurensB,Daubin V,AnthouardV,AiachN,ArnaizO,BillautA,BeissonJ,BlancI,BouhoucheK, CmaraF,DuharcourtS,GuigoR,GogendeauD,KatinkaM,KellerA-M, KissmehlR,KlotzC,KollF,MoulAL,LepreG,MalinskyS,NowackiM, NowakJK,PlattnerH, etal : Globaltrendsofwhole-genomeduplications revealedbytheciliateParameciumtetraurelia. Nature 2006, 444: 171 – 178. 6.PutnamNH,SrivastavaM,HellstenU,DirksB,ChapmanJ,SalamovA,Terry A,ShapiroH,LindquistE,KapitonovVV,JurkaJ,GenikhovichG,GrigorievIV, LucasSM,SteeleRE,FinnertyJR,TechnauU,MartindaleMQ,RokhsarDS: Seaanemonegenomerevealsancestraleumetazoangenerepertoire andgenomicorganization. Science 2007, 317: 86 – 94. 7.PutnamNH,ButtsT,FerrierDEK,FurlongRF,HellstenU,KawashimaT, Robinson-RechaviM,ShoguchiE,TerryA,YuJ-K,Benito-GutierrezE, DubchakI,Garcia-FernandezJ,Gibson-BrownJJ,GrigorievIV,HortonAC, deJongPJ,JurkaJ,KapitonovVV,KoharaY,KurokiY,LindquistE,LucasS, OsoegawaK,PennacchioLA,SalamovAA,SatouY,Sauka-SpenglerT, SchmutzJ,Shin-IT, etal : Theamphioxusgenomeandtheevolutionof thechordatekaryotype. Nature 2008, 453: 1064 – 1071. 8.SrivastavaM,BegovicE,ChapmanJ,PutnamNH,HellstenU,KawashimaT, KuoA,MitrosT,SalamovA,CarpenterML,SignorovitchAY,MorenoMA, KammK,GrimwoodJ,SchmutzJ,ShapiroH,GrigorievIV,BussLW, SchierwaterB,DellaportaSL,RokhsarDS: TheTrichoplaxgenomeandthe natureofplacozoans. Nature 2008, 454: 955 – 960. 9.SrivastavaM,SimakovO,ChapmanJ,FaheyB,GauthierMEA,MitrosT, RichardsGS,ConacoC,DacreM,HellstenU,LarrouxC,PutnamNH,Stanke M,AdamskaM,DarlingA,DegnanSM,OakleyTH,PlachetzkiDC,ZhaiY, AdamskiM,CalcinoA,CumminsSF,GoodsteinDM,HarrisC,JacksonDJ, LeysSP,ShuS,WoodcroftBJ,VervoortM,KosikKS, etal : TheAmphimedon queenslandicagenomeandtheevolutionofanimalcomplexity. Nature 2010, 466: 720 – 726. 10.SimakovO,MarletazF,ChoS-J,Edsinger-GonzalesE,HavlakP,HellstenU, KuoD-H,LarssonT,LvJ,ArendtD,SavageR,OsoegawaK,deJongP, GrimwoodJ,ChapmanJA,ShapiroH,AertsA,OtillarRP,TerryAY,BooreJL, GrigorievIV,LindbergDR,SeaverEC,WeisblatDA,PutnamNH,RokhsarDS: Insightsintobilaterianevolutionfromthreespiraliangenomes. Nature 2013, 493: 526 – 531. 11.LvJ,HavlakP,PutnamN: Constraintsongenesshapelong-termconservation ofmacro-syntenyinmetazoangenomes. BMCBioinformatics 2011, 12 (Suppl9):S11. 12.EarlD,BradnamK,JohnJS,DarlingA,LinD,FassJ,YuHOK,BuffaloV, ZerbinoDR,DiekhansM,NguyenN,AriyaratnePN,SungW-K,NingZ, HaimelM,SimpsonJT,FonsecaNA,Birol ,DockingTR,HoIY,RokhsarDS,ChikhiR,LavenierD,ChapuisG,NaquinD,MailletN,SchatzMC,KelleyDR, PhillippyAM,KorenS, etal : Assemblathon1:Acompetitiveassessmentof denovoshortreadassemblymethods. GenomeRes 2011, 21: 2224 – 2241. 13.DaveyJW,HohenlohePA,EtterPD,BooneJQ,CatchenJM,BlaxterML: Genome-widegeneticmarkerdiscoveryandgenotypingusing next-generationsequencing. NatRevGenet 2011, 12: 499 – 510. 14.AltshulerD,PollaraVJ,CowlesCR,VanEttenWJ,BaldwinJ,LintonL,Lander ES: AnSNPmapofthehumangenomegeneratedbyreduced representationshotgunsequencing. Nature 2000, 407: 513 – 516. 15.BairdNA,EtterPD,AtwoodTS,CurreyMC,ShiverAL,LewisZA,SelkerEU, CreskoWA,JohnsonEA: RapidSNPdiscoveryandgeneticmappingusing sequencedRADmarkers. PLoSONE 2008, 3: e3376. 16.HuangX,FengQ,QianQ,ZhaoQ,WangL,WangA,GuanJ,FanD,Weng Q,HuangT,DongG,SangT,HanB: High-throughputgenotypingby whole-genomeresequencing. GenomeRes 2009, 19: 1068 – 1076. 17.AndolfattoP,DavisonD,ErezyilmazD,HuTT,MastJ,Sunayama-MoritaT, SternDL: Multiplexedshotgungenotypingforrapidandefficientgenetic mapping. GenomeRes 2011, 21: 610 – 617. 18.ChapmanJA,HoI,SunkaraS,LuoS,SchrothGP,RokhsarDS: Meraculous:De novogenomeassemblywithshortpaired-endreads. PLoSONE 2011, 6: e23501. 19.ButlerJ,MacCallumI,KleberM,ShlyakhterIA,BelmonteMK,LanderES, NusbaumC,JaffeDB: ALLPATHS:Denovoassemblyofwhole-genome shotgunmicroreads. GenomeRes 2008, 18: 810 – 820. 20.GnerreS,MacCallumI,PrzybylskiD,RibeiroFJ,BurtonJN,WalkerBJ,Sharpe T,HallG,SheaTP,SykesS,BerlinAM,AirdD,CostelloM,DazaR,WilliamsL, NicolR,GnirkeA,NusbaumC,LanderES,JaffeDB: High-qualitydraft assembliesofmammaliangenomesfrommassivelyparallelsequence data. ProcNatlAcadSci 2011, 108: 1513 – 1518. 21.RudkinDM,YoungGA: Horseshoecrabs – anancientancestryrevealed. In BiolConservHorseshoeCrabs. EditedbyTanacrediJT,BottonML,SmithD. NewYork:Springer;2009:25 – 44. 22.FisherDC: TheXiphosurida:archetypesofBradytely? In LivingFoss.Edited byEldregeN,StanleySM.NewYork:Springer;1984:196 – 213. 23.BerksonJ,ShusterCNJr: Thehorseshoecrab:thebattleforatrue multiple-useresource. Fisheries 1999, 24: 6 – 10. 24.ShusterCNJr,BarlowRB,BrockmannHJ(Eds): TheAmericanHorseshoeCrab. Cambridge,MA:HarvardUniversityPress;2003. 25.GregoryTR: AnimalGenomeSizeDatabase. http://www.genomesize.com. 26.PopM,KosackDS,SalzbergSL: Hierarchicalscaffoldingwithbambus. GenomeRes 2004, 14: 149 – 159. 27.VanOsH,AndrzejewskiS,BakkerE,BarrenaI,BryanGJ,CaromelB,Ghareeb B,IsidoreE,deJongW,vanKoertP,LefebvreV,MilbourneD,RitterE,van derVoortJNAMR,Rousselle-BourgeoisF,vanVlietJ,WaughR,VisserRGF, BakkerJ,vanEckHJ: Constructionofa10,000-markerultradensegenetic recombinationmapofpotato:providingaframeworkforaccelerated geneisolationandagenomewidephysicalmap. Genetics 2006, 173: 1075 – 1087. 28.SekiguchiK: BiologyofHorseshoeCrabs. Tokoyo:ScienceHouseCo.,Ltd.; 1988:50 – 68. 29.CartwrightRA,HussinJ,KeeblerJEM,StoneEA,AwadallaP: Afamily-based probabilisticmethodforcapturingdenovomutationsfromhighthroughputshort-readsequencingdata. StatApplGenetMolBiol 2012, 11 (2). 30.BirdAP: DNAmethylationandthefrequencyofCpGinanimalDNA. NucleicAcidsRes 1980, 8: 1499 – 1504. 31.LynchM: Theoriginsofeukaryoticgenestructure. MolBiolEvol 2006, 23: 450 – 468. 32.BromanKW,MurrayJC,SheffieldVC,WhiteRL,WeberJL: Comprehensive humangeneticmaps:individualandsexspecificvariationinrecombination. AmJHumGenet 1998, 63: 861 – 869. 33.KongA,GudbjartssonDF,SainzJ,JonsdottirGM,GudjonssonSA,Richardsson B,SigurdardottirS,BarnardJ,HallbeckB,MassonG,ShlienA,PalssonST,Frigge ML,ThorgeirssonTE,GulcherJR,StefanssonK: Ahigh-resolution recombinationmapofthehumangenome. NatGenet 2002, 31: 241 – 247. 34.CoopG,PrzeworskiM: Anevolutionaryviewofhumanrecombination. NatRevGenet 2007, 8: 23 – 34. 35.HellmannI,EbersbergerI,PtakSE,PboS,PrzeworskiM: Aneutral explanationforthecorrelationofdiversitywithrecombinationratesin humans. AmJHumGenet 2003,72: 1527 – 1535. 36.VectorBase: Ixodesscapularisannotation,IscaW1. https://www.vectorbase. org/organisms/ixodes-scapularis. 37.VannesteK,VandePeerY,MaereS: Inferenceofgenomeduplications fromagedistributionsrevisited. MolBiolEvol 2013, 30: 177 – 190. 38.ObstM,FaurbyS,BussarawitS,FunchP: Molecularphylogenyofextant horseshoecrabs(Xiphosura,Limulidae)indicatesPaleogene diversificationofAsianspecies. MolPhylogenetEvol 2012, 62: 21 – 26. 39.BegunDJ,AquadroCF: Levelsofnaturallyoccurring DNApolymorphismcorrelate withrecombinationratesinD.melanogaster. Nature 1992, 356: 519 – 520. 40.CutterAD,PayseurBA: Selectionatlinkedsitesinthepartialselfer caenorhabditiselegans. MolBiolEvol 2003, 20: 665 – 673. 41.NachmanMW: Singlenucleotidepolymorphismsandrecombinationrate inhumans. TrendsGenet 2001, 17: 481 – 485. 42.StephanW,LangleyCH: DNApolymorphisminlycopersiconand crossing-overperphysicallength. Genetics 1998, 150: 1585 – 1593. 43.RoseliusK,StephanW,StdlerT: Therelationshipofnucleotide polymorphism,recombinationrateandselectioninwildtomatospecies. Genetics 2005, 171: 753 – 763. 44.AndolfattoP,PrzeworskiM: Regionsoflowercrossingoverharbormore rarevariantsinAfricanpopulationsofdrosophilamelanogaster. Genetics 2001, 158: 657 – 665. 45.HavlakP,ChenR,DurbinKJ,EganA,RenY,SongX-Z,WeinstockGM,Gibbs RA: Theatlasgenomeassemblysystem. GenomeRes 2004, 14: 721 – 732. 46.TreangenTJ,SommerDD,AnglyFE,KorenS,PopM: Nextgenerationsequence assemblywithAMOS. CurrentProtocolsinBioinformatics 2011, 33: 11.8.1 – 11.8.18. 47.TheJAM-pipeline. https://github.com/putnamlab/jam-pipeline. 48.JohnsonSL,BrockmannHJ: Costsofmultiplemates:anexperimental studyinhorseshoecrabs. AnimBehav 2010, 80: 773 – 782.Nossa etal.GigaScience 2014, 3 :9 Page20of21 http://www.gigasciencejournal.com/content/3/1/9

PAGE 21

49.MullikinJC,NingZ: Thephusionassembler. GenomeRes 2003, 13: 81 – 90. 50.CockPJA,FieldsCJ,GotoN,HeuerML,RicePM: TheSangerFASTQfile formatforsequenceswithqualityscores,andtheSolexa/IlluminaFASTQ variants. NucleicAcidsRes 2010, 38: 1767 – 1771. 51.EwingB,GreenP: Base-callingofautomatedsequencertraces usingphred.ii.errorprobabilities. GenomeRes 1998, 8: 186 – 194. 52.KnuthDE: SearchingandSorting ,Volume3.Reading,MA:Addison-Wesley; 1973[ TheArtofComputerProgramming ]. 53.RobertsM,HayesW,HuntBR,MountSM,YorkeJA: Reducingstoragerequirements forbiologicalsequencecomparison. Bioinformatics 2004, 20: 3363 – 3369. 54.YeC,MaZS,CannonCH,PopM,YuDW: Exploitingsparsenessindenovo genomeassembly. BMCBioinformatics 2012, 13 (6) : S1. 55. Ensembl. http://www.ensembl.org/index.html. 56.ChenGK,MarjoramP,WallJD: FastandflexiblesimulationofDNA sequencedata. GenomeRes 2009, 19: 136 – 142. 57.DehalP,SatouY,CampbellRK,ChapmanJ,DegnanB,TomasoAD, DavidsonB,GregorioAD,GelpkeM,GoodsteinDM,HarafujiN,Hastings KEM,HoI,HottaK,HuangW,KawashimaT,LemaireP,MartinezD, MeinertzhagenIA,NeculaS,NonakaM,PutnamN,RashS,SaigaH,Satake M,TerryA,YamadaL,WangH-G,AwazuS,AzumiK, etal : Thedraft genomeofcionaintestinalis:insightsintochordateandvertebrate origins. Science 2002, 298: 2157 – 2167. 58.HauboldB,PfaffelhuberP,LynchM: mlRho – aprogramforestimatingthe populationmutationandrecombinationratesfromshotgun-sequenced diploidgenomes. MolEcol 2010, 19: 277 – 284. 59.SmallKS,BrudnoM,HillMM,SidowA: Extremegenomicvariationina naturalpopulation. ProcNatlAcadSci 2007, 104: 5698 – 5703. 60. dwgsim. https://github.com/nh13/DWGSIM/releases. 61.LunterG,GoodsonM:Stampy:astatisticalalgorithmforsensitiveand fastmappingofIlluminasequencereads. GenomeRes 2011, 21: 936 – 939. 62.LiH,HandsakerB,WysokerA,FennellT,RuanJ,HomerN,MarthG,Abecasis G,DurbinR: Thesequencealignment/MapformatandSAMtools. Bioinformatics 2009, 25: 2078 – 2079. 63.YangZ: StatisticalpropertiesofaDNAsampleunderthefinite-sites model. Genetics 1996, 144: 1941 – 1950. 64.HaldaneJBS: Thecombinationoflinkagevaluesandthecalculationof distancesbetweenthelocioflinkedfactors. JGenet 1919, 8: 299 – 309. 65.GordonD,AbajianC,GreenP: Consed:agraphicaltoolforsequence finishing. GenomeRes 1998, 8: 195 – 202. 66.SlaterGS,BirneyE: Automatedgenerationofheuristicsforbiological sequencecomparison. BMCBioinformatics 2005, 6: 31. 67.EdgarRC: MUSCLE:multiplesequencealignmentwithhighaccuracyand highthroughput. NucleicAcidsRes 2004, 32: 1792 – 1797. 68.RonquistF,HuelsenbeckJP: MrBayes3:Bayesianphylogeneticinference undermixedmodels. Bioinformatics 2003, 19: 1572 – 1574. 69.YangZ,NielsenR: Estimatingsynonymousandn onsynonymoussubstitution ratesunderrealisticevolutionarymodels. MolBiolEvol 2000, 17: 32 – 43. 70.ZhangZ,LiJ,ZhaoX-Q,WangJ,WongGK-S,YuJ: KaKscalculator:calculating KaandKsthroughmodelselectionandmodelaveraging. Genomics ProteomicsBioinformatics 2006, 4: 259 – 263. 71.BenagliaT,ChauveauD,Hunter,DavidR,Young,DerekS: mixtools:AnR packageforanalyzingfinitemixturemodels. JStatSoftw 2009, 32: 1 – 29. 72.NossaCW,HavlakP,YueJX,LvJ,VincentKY,BrockmannHJ,PutnamNH: Supportingmaterialsfrom “Jointassemblyandgeneticmappingofthe Atlantichorseshoecrabgenomerevealsancientwholegenome duplication ” 2014.GigaScienceDatabase.http://dx.doi.org/10.5524/100091.doi:10.1186/2047-217X-3-9 Citethisarticleas: Nossa etal. : Jointassemblyandgeneticmappingof theAtlantichorseshoecrabgenomerevealsancientwholegenome duplication. GigaScience 2014 3 :9. Submit your next manuscript to BioMed Central and take full advantage of: € Convenient online submission € Thorough peer review € No space constraints or color “gure charges € Immediate publication on acceptance € Inclusion in PubMed, CAS, Scopus and Google Scholar € Research which is freely available for redistribution Submit your manuscript at www.biomedcentral.com/submit Nossa etal.GigaScience 2014, 3 :9 Page21of21 http://www.gigasciencejournal.com/content/3/1/9



PAGE 1

name sequence accession position class '.' for match LG:coord. EVX-1 QMRRYRTAFTREQIARLEKEFYRENYVSRPRRCELAAALNLPETTIKVWFQNRRMKDKRQRLA P23683.1 g_ayrxn_scaff_4445_1.Contig24 --...........L.......C..S..................S....R.------------1 : 7968 evx g_algxc_scaff_4707_1.Contig9 KE..S.....E..LEY..RT.E.TQ.PDVYT.E...QRTK.T.ARVQ.--------------1 : 2335 ? g_ajgny_scaff_5092_1.Contig4 KAK.I..I..A..LN...A..LQQQ.MVGRE.SH..SE...K.AQV........I.WRK---4 : 26846 ? g_ahyur_scaff_4335_1.Contig2 -AK.V..I..A..LD...T..E.QQ.MVG.E.LQ..S....T.AQV........I.WRK.HFD 1 : 3106 ? engrailed__Tribolium_castaneum_ EEK.P....SGA.L...KH..AENR.LTER..QQ.S.E.G.N.AQ..I....K.A.I.KASGQ NP_001034511.2 engrailed__isoform_A__Drosophila_melanogaster_ DEK.P....SS..L...KR..NENR.LTER..QQ.SSE.G.N.AQ..I....K.A.I.KSTGS NP_523700.2 engrailed__Branchiostoma_oridae_ EEK.P.....S..LQ..K...QENR.LTEQ..QD..RE.K.N.SQ..I....K.A.I.KAAGV EEN68949.1 g_addne_scaff_4497_1.Contig54 EEK.P.....TD.L...K...HENR.LTEK..Q...RE.Q.N.SQ..I....K.A.I.KSTGE 1 : 5794 en g_ausca_scaff_4858_1.Contig1 DEK.P.....AD.L...KQ..QENR.LTEK..QD..LD.Q.N.SQ..I....K.A.I.KASGQ 20 : 83001 en g_adwrx_scaff_4421_1.Contig72 DEK.P.....AD.L...KA..QENR.LTEK..QD..RG.Q.N.SQ..I....K.A.I.KATGH 15 : 71297 en g_avzuc_scaff_4437_1.Contig3 DEK.P.....TD.L...KA..QENK-LTEK..QD..RE------------------------21 : 85367 en g_adwrx_scaff_4421_1.Contig61 DEK.P.....TD.L...KA..QENR.LTEK..QD..RE.Q.N.SQ..I.-------------15 : 71297 en g_avzuc_scaff_4437_1.Contig4 -------------------..QENR.LTEK..QD..RE.Q.N.SQ..I....K.A.I.KATGH 21 : 85367 en g_adwrx_scaff_4421_1.Contig63 -EK.P.....AD.L...KA..QENR.LTEK..QD..RE.Q.N.SQ..I....K.A.I.KAVGH 15 : 71297 en g_ayadw_scaff_4575_1.Contig4 DEK.P.....AD.L...KA..QENR.LTEK..QD..RE.Q.N.SQ..I....K.A.I.KAVGH 3 : 22654 en Nkx-2.2a KK.KR.VL.SKA.TYE..RR.RQQR.L.A.E.EH..SLIR.TP.QV.I....H.Y.M..A.AE NP_571497.1 Nkx-2.2 KK.KR.VL.SKA.TYE..RR.RQQR.L.A.E.EH..SLIR.TP.QV.I....H.Y.M..A.AE NP_002500.1 Nkx-2.2__Gallus_gallus_ KK.KR.VL.SKA.TYE..RR.RQQR.L.A.E.EH..SLIR.TP.QV.I....H.Y.M..A.AE XP_003643855.1 g_ajlua_scaff_4446_1.Contig10 KK.KR.VL.SKA.TYE..RR.RQQR.L.A.E.EH..NFIR.TP.QV.I....H.Y.T..A.QE 1 : 1567 nkx2 g_acfsj_scaff_4532_1.Contig25 KK.KR.VL.SKA.TYE..RR.RQQR.L.A.E.EH..SIIR.TP.QV.I....H.Y.T..A.QE 3 : 20058 nkx2 ladybird-like__3 KR.KS....SNQ.LFE..RR.V.QK.L.PAD.DQ..QR.A.SSAQVIT......A.L..DLDE NP_001038133.1 lady_bird-like__1_homolog KR.KS.....NH..YE...R.LYQK.L.PAD.DQI.QQ.G.TNAQVIT......A.L..DLEE NP_001007135.1 g_acdli_scaff_4927_1.Contig5 KK.KS.....NH..FE...R.LYQK.L.PAD.D.I.QS.G.TNAQVIT......A.L..---26 : 93066 lbx g_agfxy_scaff_4755_1.Contig8 KK.KS.....NQ..FE...R.LYQK.L.PAD.D.I.QS.R.TNAQVIT......A.L..---13 : 67147 lbx g_aplvp_scaff_4845_1.Contig14 KK.KS.....NH..FE...R.LYQK.L.PAD.D.I.QS.G.TNAQVIT......A.L..---1 : 3106 lbx AGAP003673PA__Anopheles_gambiae_str._PEST_ KK.KS.....NH..FE...R.LYQK.L.PSD.D.I....G.SNAQVIT......A.L..DMEE EAA04094 g_amhlt_scaff_4436_1.Contig3 KQ.KA.....DH.LQT...C.E.QK.L.VQE.M...SK...TD.QV.T.Y....--------8 : 46729 ? g_amhlt_scaff_4436_1.Contig8 KE.KA.....DY..QT...R.QTQK.L.VQD.M.I.SK...TD.HV.T.Y....--------8 : 46729 ? g_aeqlo_scaff_4727_1.Contig6 RKKKT..V...S.VFQ..ST.DMKR.L.SSE.AG...C.H.T..QV.I......N.W...--22 : 86973 ? g_agfda_scaff_4460_1.Contig8 KRKKS..T..GR..FE...Q.EIKK.L.SSE.A.M.KL..VT.QQVEI--------------26 : 93149 ? g_artzi_scaff_4429_1.Contig73 SN.KL..P..TQ.LLA..RK.LAKQ.L.IAE.AKFSIS.T.T.AQV.I......A.E..---5 : 32215 msx Msx__Saccoglossus_kowalevskii_ TN.KP..P..TS.LLA..RK.RQKQ.L.IAE.A.FS.S...T..QV.I......A.A..LQE. ABD97280 Msx___Nematostella_vectensis_ AN.KP..P..TS.LLA..RK.RQKQ.L.IAE.A.FS.S...T..QV.I......A.A..LHE. BAG11598 MSX-2__Mus_musculus_ TN.KP..P..TS.LLA..RK.RQKQ.L.IAE.A.FSSS...T..QV.I......A.A..LQE. NP_038629 MSX-2__Homo_sapiens_ TN.KP..P..TS.LLA..RK.RQKQ.L.IAE.A.FSSS...T..QV.I......A.A..LQE. BAA06549 MSX1___Homo_sapiens_ TN.KP..P..TA.LLA..RK.RQKQ.L.IAE.A.FSSS.S.T..QV.I......A.A..LQE. NP_002439 MSX1_variant__Mus_musculus_ TN.KP..P..TA.LLA..RK.RQKQ.L.IAE.A.FSSS.S.T..QV.I......A.A..LQE. AAG32466 g_aebwc_scaff_4685_2.Contig8 -----------------------KQ.L.IAE.A.FSSS.S.T..QV.I......A.E..LKE. 3 : 19118 msx g_aebwc_scaff_4685_2.Contig10 SN.KP..P..TQ.LLA..RK.RTKQ.L.IAE.A.FSSS...T..QV.I......A.E..---3 : 19118 msx g_aebwc_scaff_4685_2.Contig13 SN.KP..P..TQ.LLA..RK.RSKQ.L.IAE.A.FSSS...T..QV.I......A.E..---3 : 19118 msx g_aoopl_scaff_4606_1.Contig8 -----------------------KQ.L.IAE.A.FSSS...T..QV.I......A.EN.LKE26 : 93149 msx g_aoopl_scaff_4606_1.Contig18 SS.KP..P..TK.LLA...M.QTKQ.L.IAE.A.FSSSM..T..QV.I......A.E..---26 : 93149 msx g_agizq_scaff_4515_1.Contig28 --..R...Y.S..LLQ.....HAKK.L.LTE.SQISTT.Q.S.VQV.I......A.W..---3 : 23600 gbx unplugged__Drosophila_melanogaster_ KS..R.....S..LLE..R..HAKK.L.LTE.SQI.TS.K.S.VQV.I......A.W..VKAG NP_477146 AGAP006923PA__Anopheles_gambiae_str._PEST_ KS..R.....S..LLE..R..HAKK.L.LTE.SQI.TS.K.S.VQV.I......A.W..VKAG EAA04094 g_apstr_scaff_4456_1.Contig3 --..R.....N..LLE..N..HCKK.L.LTE.SQI.HS.Q.S.QQV.I......A.W..---1 : 8283 gbx GBX-1__Canis_lupus_familiaris_ KS..R.....S..LLE.....HCKK.L.LTE.SQI.H..K.S.VQV.I......A.W..IKAG XP_853664 le:///Users/yjx/Dropbox/gigasciencereviews/gigascience_R2_s... 1 of 4 4/22/14, 12:20 PM

PAGE 2

name sequence accession position class '.' for match LG:coord. g_aglhj_scaff_4667_1.Contig13 --..R.....S..LLE.....HSKK.L.LTE.SQI.HS.Q.S.VQV.I......A.W..---11 : 60310 gbx g_aglhj_scaff_4783_1.Contig4 KS..R.....S..LLE.....HSKK.L.LTE.SHI.NS.Q.S.VQV.I......A.W..VKAG 18 : 80043 gbx g_acgnb_scaff_4629_1.Contig3 RFLLSLSQR.SRNGSQATDGLHFNRWLT.R..I.IVRT.C.SD-------------------20 : 83755 hox? g_awtvf_scaff_4744_1.Contig2 ETK.Q...Y..H..LE.....HFNR.LT.R..I.I.HSSC.ERIK.----------------21 : 85243 hox? g_awtvf_scaff_4744_1.Contig1 -----------------------SR.LT.R.QI.I.HS.C.S.RQ..N...I....W.KDNKL 21 : 85243 hox? Drosophila-Dfd__NM_057853_ EPK.Q...Y..H..LE.....HYNR.LT.R..I.I.HT.V.S.RQ..I........W.KD--NM_057853 Tribolium-Dfd__EEZ99253_ EPK.Q...Y..H..LE.....HYNR.LT.R..I.I.HT.V.S.RQ..I........W.KD--EEZ99253 g_ajivj_scaff_4414_1.Contig4 ETK.Q...Y..H..LE.....HFNR.LT.R..I.I.HL.C.S.RQ..I........W.KDNKL 15 : 71079 hox? g_ajivj_scaff_4414_1.Contig3 ETK.Q...Y..H..LE.....HFNR.LT.R..I.I.HS.C.S.RQ..I........W.KDNKL 15 : 71079 hox? Branchiostoma-Hox4__AB028208_ DTK.S...Y..Q.VLE.....HFNR.LT.R..I.I.HS.G.T.RQ..I........W.KD--AB028208 g_akvhc_scaff_4400_4.Contig7 EAK.Q...Y..H.VLE.....HFNR.LT.R..I.I.H..C.S.RQ..I........W.KDNKL 18 : 79524 hox? Capitella-Dfd__EU196540_ DSK.T...Y..H..LE.....HFNR.LT.R..I.I.HT.C.S.RQ..I........W.KE--EU196540 Lottia-Dfd__110972_ DSK.N...Y..H.VLE.....HFNR.LT.R..I.I.HT.C.S.RQ..I........W.KE--JGI Lotgi1 protein id: 110972 Branchiostoma-Hox12__AAF81903_ SS.KK.CPYSKV.LLE.....LYNM.IT.EQ.G.I.RKV..TDRQV.I........M..M--AAF81903 Branchiostoma-Hox10__Z35150_ VG.KK.CPY.KY..LE.....LFNM....E..Q.ISRHV..SDRQV.I........M..M--Z35150 Branchiostoma-Hox11__AAF81909_ ST.KK.CPY.KY.TLE.....LFNMF.T.E..Q.I.RQ...TDRQV.I........M..M--AAF81909 Branchiostoma-Hox9__ABX39493_ SS.KK.CPY..F.TLE.....LYNM.LT.E..Y.ISQHV..T.RQV.I........M.KM--ABX39493 Branchiostoma-Hox13__AAF81904_ GG.KK.CPY.KY.LSV..Q.YIQNR....ET.L..SQR...TDRQV.I........Q..L--AAF81904 Branchiostoma-Hox14__AAF81905_ PV.PK.RPYSKY.LNE..N.YVQNQ.I..DK.LQ.SQK...T.RQV.I......I.Q.KL--AAF81905 Capitella-post1__EU196546_ NPKKK.KPYSKP.VSA..N.YSTST.ITKA..K.V.RE.D.T.RQ..I.Y....I.E.KI--EU196546 Lottia-post1__100031_ TL.KR.RPYSKF...E..R.YNGST...KS..W..SQLI..S.RQ..I......I.A.KI--JGI Lotgi1 protein id: 100031 Drosophila-AbdB__NM_080157_ SV.KK.KPYSKF.TLE.....LFNA...KQK.W...RN.Q.T.RQV.I........N.KN--NM_080157 Tribolium-AbdB__EEZ99247_ TV.KK.KPYSKF.TLE.....LFNA...KQK.W...RN...T.RQV.I........N.KN--EEZ99247 g_algea_scaff_4559_1.Contig10 TV.KK.KPYSKF.TLE.....LFNA...KQK.W...RN...T.RQV.I........S.KNAQR 15 : 71736 abdB g_algea_scaff_4559_3.Contig7 ------KPYSKF.TLE.....LFNS...KQK.W...RN...T.RQV.I........S.KSSQR 21 : 85120 abdB Capitella-post2__EU196545_ KQ.KK.KPY..Y.TMV..N..INNS.IT.QK.W.ISCK.H.S.RQV..........R.KL--EU196545 Lottia-post2__89720_ KG.KK.KPY..Y.TMV..N..LSSS.IT.QK.W.ISCK.Q.S.RQV..........R.KL--JGI Lotgi1 protein id: 89720 g_aosol_scaff_4355_1.Contig18 ---------------------EFNR.ITQR..I.I.HY.S.S.RK..M.....S..ANEKH-20 : 83145 hox? Drosophila-ftz__NM_058159_ DSK.T.QTY..Y.TLE.....HFNR.IT.R..IDI.N..S.S.RQ..I........S.KD--NM_058159 CDX-2__Mus_musculus_ TKDK..VVY.DH.RLE.....HFSR.ITIR.KS....T.G.S.RQV.I......A.ERKIKKK AAA19645 g_amvdv_scaff_4163_3.Contig12 TRDK..VVY.DQ.RLE.....HYNR.I.IR.KT....MVG.S.RQV.I......A.ER.NFRK 5 : 33811 cdx g_aildw_scaff_4772_1.Contig14 TKDK...VY.DH.RLE.....HYSR.ITIR.KT...TM.G.SDRQV.I......A.ERK.ARK 19 : 80739 cdx g_anovg_scaff_4508_1.Contig5 TKEK..VVY.DH.RLE......YSR.ITIR.KS....M.G.S.RQV.I......A.ERK.ARK 3 : 21395 cdx caudal__isoform_A__Drosophila_melanogaster_ TKDK..VVY.DF.RLE....YCTSR.ITIR.KS...QT.S.S.RQV.I......A.ERK.NKK NP_476954.1 Ftz__Tribolium_castaneum_ GNK.T.QTY..Y.TLE.....HFNK.LT.R..I.I.ES.R.T.RQ..I........A.KDTKF AF321227_1 Tribolium-ftz__AAC46491_ GNK.T.QTY..Y.TLE.....HFNK.LT.R..I.I.ES.R.T.RQ..I........A.KD--AAC46491 g_acrop_scaff_4310_1.Contig8 APK.T.QTY..Y.TLE.....HFNK.LT.R..I.I.HT.G.T.RQ..I........E.KENK15 : 71825 hox? g_awjsf_scaff_4464_1.Contig11 --K.T.QTY..Y.TLE.....HFNK.LT.R..I.I.HT.G.T.RQ..I....----------21 : 85120 hox? g_awjsf_scaff_4464_1.Contig15 APK.T.QTY..Y.TLE.....HFNR.LT.R..I.I.HT.G.T.RQ..I........A.KENKF 21 : 85120 hox? g_avhgw_scaff_4510_1.Contig7 --K.T.QTY..Y.TLE.....HFNR.LT.R..I.I.HS.G.T.RQ..I........A.KEMPK 18 : 79607 hox? g_alkdk_scaff_4677_1.Contig4 DQ..C.QTYSFY.TLA.....HFDP.LTKR..I.I.HT...K.KQ..I........W.KEKRS 28 : 94741 hox? g_ascty_scaff_4257_1.Contig45 DQ..S.QTYSFY.TLA.....HLDP.LTQR..I.I.HT...T.KQ.QI........W.KENKT 17 : 77601 hox? hox_6_Saccoglossus -Q..G.QTY..Y.TLE.....HFNR.LT.R..I.I.HT.G.T.RQ..I........W.KEQKABK00020 hox_7_Saccoglossus -KK.C.QTY..Y.TLE.....HYNR.LT.R..I..SHL.G.T.RQ..I........Y.KESKAAP79287 Capitella-lox5__EU196542_ EQK.T.QTY..Y.TLE.....HYNR.LT.R..I.I.H..Q.T.RQ..I........Y.KE--EU196542 g_anoef_scaff_4410_1.Contig27 PR..G.QSY..F.TLE.....HFNH.LT.R..I...HVVC.T.RQ..I........L.KEVRV 3 : 23600 hox? le:///Users/yjx/Dropbox/gigasciencereviews/gigascience_R2_s... 2 of 4 4/22/14, 12:20 PM

PAGE 3

name sequence accession position class '.' for match LG:coord. g_awjsf_scaff_4464_1.Contig6 ----G.Q.Y..F.TLE.....HYNH.LT.R..I.I.HVVC.T--------------------21 : 85120 hox? g_ajivj_scaff_4414_1.Contig2 ---.S.Q.Y..F.TLE.....HYNH.LT.R..I.I.HVVC.T--------------------15 : 71079 hox? g_anoef_scaff_4410_1.Contig17 --------------LE.....HYNH.LT.R..I.I.HVVC.T.RQ..I........L.KEIR. 3 : 23600 hox? Lottia-lox4__156931_ .R..G.QTYS.Y.TLE.....QFNH.LT.K..I.I.HT.C.T.RQ..I........M.KE--JGI Lotgi1 protein id: 156931 Capitella-lox4__EU196543_ .R..G.QTYS.Y.TLE.....QFNH.LT.K..I.I.H..C.T.RQ..I........L.KE--EU196543 g_awgsh_scaff_4551_1.Contig4 -R..G.QTY..F.TLE.....HFNH.LT.R..I.I.H..C.T.RQ..I........L.KEIR. 15 : 71699 hox? g_aqqvq_scaff_4781_1.Contig5 ------------------...HFNH.LT.R..I.I.H..C.T.RQ..I........L.KEIR. 21 : 85040 hox? g_asnox_scaff_4479_1.Contig10 ---------------------HFNH.LT.R..I.I.H..C.T.RQ..I........L.KEIR. 15 : 71793 hox? g_asnox_scaff_4479_1.Contig18 ---------------------HFNH.LT.R..I.I.H..C.T.RQ..I........L.KEIR. 15 : 71793 hox? Drosophila-abdA__NM_057345_ PR..G.QTY..F.TLE.....HFNH.LT.R..I.I.H..C.T.RQ..I........L.KE--NM_057345 Tribolium-abdA__AAB70263_ PR..G.QTY..F.TLE.....HFNH.LT.R..I.I.H..C.T.RQ..I........L.KE--AAB70263 g_aqqvq_scaff_4781_1.Contig10 --------YP.Y.TLE.....HTNH.LT.R..I.M.H..C.T.RQ..I........L.KEIQ. 21 : 85040 ubx g_asnox_scaff_4479_1.Contig17 -------------------------.LT.R..I.M.H..C.T.RQ..I........L.KEIQ. 15 : 71793 ubx g_asnox_scaff_4479_1.Contig21 LR..G.QTY..Y.TLE.....HTNH.LT.R..I.M.H..C.---------------------15 : 71793 ubx g_aityy_scaff_4289_1.Contig2 LR..G.QTY..Y.TLE.....HTNH.LT.R..I.M.H..C.T.RQ..I......--------10 : 54444 ubx Drosophila-Ubx__NM_080500_ LR..G.QTY..Y.TLE.....HTNH.LT.R..I.M.H..C.T.RQ..I........L.KE--NM_080500 Tribolium-Ubx__EEZ99249_ LR..G.QTY..Y.TLE.....HTNH.LT.R..I.M.H..C.T.RQ..I........L.KE--EEZ99249 g_asnox_scaff_4479_1.Contig28 LR..G.QTY..Y.TLE.....HTNH.LT.R..I.M.H..C.T.RQ..I........L.KEIQ. 15 : 71793 ubx Capitella-lox2__EU196544_ .R..G.QTY..Y.TLE.....KFNR.LT.R..I..SHM.C.T.RQ..I........E.KE--EU196544 Lottia-lox2__185752_ .R..G.QTY..F.TLE.....KFNR.LT.R..I..SHM.C.T.RQ..I........E.KE--JGI Lotgi1 protein id: 185752 Branchiostoma-Hox8__ABX39492_ ER..G.QTYS.Y.TLE.....HFNK.LT.R..I.I.H..G.T.RQ..I........L.KE--ABX39492 Branchiostoma-Hox7__ABX39491_ ERK.G.QTY..Y.TLE.....HFNK.LT.R..I.I.H..C.T.RQ..I........W.KE--ABX39491 Lottia-lox5__89598_ EQK.T.QTY..Y.TLE.....HFNR.LT.R..I.V.HM.G.T.RQ..I........W.KE--JGI Lotgi1 protein id: 89598 Branchiostoma-Hox6__ABX39490_ EKK.G.QTY..Y.TLE.....HFNK.LT.K..I.I.HL.G.T.RQ..I........W.KE--ABX39490 g_aubsu_scaff_4356_1.Contig13 ERK.G.QTY..Y.TLE.....HFNR.LT.R..I.I.H..C.T.RQ..I........W.KENK. 15 : 71793 hox? g_aubsu_scaff_4356_1.Contig10 ERK.G.QTY..Y.TLE.....HFNR.LT.R..I.I.H..C.T.RQ..I........W.KENK. 15 : 71793 hox? Drosophila-Antp__NM_079525_ ERK.G.QTY..Y.TLE.....HFNR.LT.R..I.I.H..C.T.RQ..I........W.KE--NM_079525 Tribolium-Antp__EEZ99250_ ERK.G.QTY..Y.TLE.....HFNR.LT.R..I.I.H..C.T.RQ..I........W.KE--EEZ99250 Capitella-Antp__EU196547_ ERK.G.QTY..Y.TLE.....HFNR.LT.R..I.I.H..C.T.RQ..I........W.KE--EU196547 Lottia-Antp__177860_ NRK.G.QTY..Y.TLE.....HFNR.LT.R..I.I.H..C.T.RQ..I........W.KE--JGI Lotgi1 protein id: 177860 Hox8_Metacrinus -KK.G.QTY..Y.TLE.....HFNR.LT.R..I.I.Q.VC.S.RQ..I........A.KETSBAF43726 g_atxvn_scaff_4590_1.Contig3 --------------------------LT.R..I.I.H..C.T.RQ..I........W.QENK. 5 : 33116 hox? g_awjsf_scaff_4464_1.Contig1 --K.G.QTY..Y.TLE.....HFNR.LT.R..I.I.--------------------------21 : 85120 hox? g_awjsf_scaff_4464_1.Contig8 ERK.G.QTY..Y.TLE.....HFNR.LT.R..I.I.H..C.T.R------------------21 : 85120 hox? g_ascty_scaff_4257_1.Contig27 ------------.TLE.....HFNR.LT.R..I.I.H..C.T.RQ..I--------------17 : 77601 hox? g_alkdk_scaff_4677_1.Contig1 -------TY..Y.TLE.....HFNR.LT.R..I.I.H..C.T--------------------28 : 94741 hox? hox_5_Saccoglossus -AK.S...Y..Y.TLE.....HFNR.LT.R..I.I.H..G.S.RQ..I........W.KEHNNP_001158410 Branchiostoma-Hox5__ABX39489_ DNK.T...Y..Y.TLE.....HFNR.LT.R..I.I.H..C.T.RQ..I........W.KE--ABX39489 Scr__Tribolium_castaneum_ ETK.Q..SY..Y.TLE.....HFNR.LT.R..I.I.H..C.T.RQ..I........W.KEHKM AF321227_2 Drosophila-Scr__X14475_ ETK.Q..SY..Y.TLE.....HFNR.LT.R..I.I.H..C.T.RQ..I........W.KE--X14475 Tribolium-Scr__EEZ99252_ ETK.Q..SY..Y.TLE.....HFNR.LT.R..I.I.H..C.T.RQ..I........W.KE--EEZ99252 g_anoef_scaff_4410_1.Contig19 ------------------...HFNR.LT.R..I.I.H..C.S.RQ..I........W.KEHKM 3 : 23600 hox? g_asnox_scaff_4479_1.Contig12 -------SY..Y.TLE.....HFNR.LT.R..I.I.H..C.S.RQ..I--------------15 : 71793 hox? g_awjsf_scaff_4464_1.Contig7 ETK.Q..SY..Y.TLE.....HFNR.LT.R..I.I.H..C.S.RQ..I........W.KEH-21 : 85120 hox? g_anbio_scaff_4694_1.Contig9 ETK.Q..SY..Y.TLE.....HFNR.LT.R..I.I.H..C.S.RQ..I........W.KEHKM 15 : 71825 hox? le:///Users/yjx/Dropbox/gigasciencereviews/gigascience_R2_s... 3 of 4 4/22/14, 12:20 PM

PAGE 4

name sequence accession position class '.' for match LG:coord. g_anbio_scaff_4694_1.Contig10 ETK.Q..SY..Y.TLE.....HFNR.LT.R..I.I.H..C.S.RQ..I........W.KEHKM 15 : 71825 hox? g_awjsf_scaff_4464_1.Contig5 --K.Q..SY..Y.TLE.....HFNR.LT.R..I.I.H..C.S--------------------21 : 85120 hox? Lottia-Scr__86927_ DSK.S..SY..H.TLE.....HYNK.LT.R..I.I.H....T.RQ..I........W.KD--JGI Lotgi1 protein id: 86927 Capitella-Scr__EU196541_ DNK.T..SY..H.TLE.....HFNR.LT.R..I.I.HS...T.RQ..I........W.KE--EU196541 pancreas_duodenum___1__Mus_musculus_ ENK.T...Y..A.LLE.....LFNK.I.....V...VM...T.RH..I........W.KEEDK NP_032840 Branchiostoma-Hox2__AB028207_ SS..L..V..NT.LLE.....HYNK..CK...K.I.SY.D.N.RQV.I.......RQ..R--AB028207 g_axtpu_scaff_4896_1.Contig3 ---SG..N..TK.L....N..LFNK.LT.K..I.I...IQ.N..QV.I........Q.KRMKE 9 : 53730 labial g_axtpu_scaff_4896_1.Contig4 ---SG..N..TK.L.G..N..LFNK.LT.K..I.I...IQ.N..QV.---------------9 : 53730 labial Drosophila-labial__NM_057265_ -NNSG..N..NK.LTE.....HFNR.LT.A..I.I.NT.Q.N..QV.I........Q.KRV-NM_057265 Tribolium-labial__AAF64149_ CLNTG..N..NK.LTE.....HFNK.LT.A..I.I.S..Q.N..QV.I........Q.KR--AAF64149 g_axtpu_scaff_4896_1.Contig2 --------------------------LT.A..I.I.T..Q.N..QV.I........Q.KRMKE 9 : 53730 labial g_ajwnd_scaff_4604_1.Contig5 -VGSG..N..TK.LTE.....HFNK.LT.A..I.I.T..Q.N..QV.I........Q.KRMKE 15 : 71825 labial g_aavsb_scaff_4656_1.Contig15 ----G..N..TK.LTE.....HFNK.LT.A..I.I.N..Q.N..QV.I........Q.KRMKE 18 : 79564 labial g_arkho_scaff_4547_1.Contig2 ----G..N..TK.LTE.....HFNK.LT.A..I.I.NV.Q.N..QV.I........Q.----1 : 5456 labial g_aavsb_scaff_4656_1.Contig7 ------------------------------..I.I.NV.Q.N..QV.I........Q.KRMKE 18 : 79564 labial Branchiostoma-Hox1__AB028206_ GPNNG..N..TK.LTE.....HYNK.LT.A..V.I......N..QV.I........Q.KR--AB028206 Capitella-labial__EU196537_ -PNMG..N..NK.LTE.....HFNK.LT.A..I.I..S.G.N..QV.I........Q.KRL-EU196537 Lottia-labial__100648_ -PNSG..N..NK.LTE.....HFNK.LT.A..I.I..S.G.N..QV.I........Q.KR--JGI Lotgi1 protein id: 100648 Lottia-pb__110623_ GS..L...Y.NT.LLE.....HFNK.LC....I.I..S.D.T.RQV..........Y...--JGI Lotgi1 protein id: 110623 Drosophila-pb__NM_057322_ LP..L...Y.NT.LLE.....HFNK.LC....I.I..S.D.T.RQV..........H...--NM_057322 Tribolium-pb__EEZ99256_ LP..L...Y.NT.LLE.....HFNK.LC....I.I..S.D.T.RQV..........H...--EEZ99256 g_adlvn_scaff_4304_2.Contig14 MP..L...Y.NT.LLE.....HFNK.LC....I.I..S.D.S.RQV..........H...TMM 15 : 71870 ? g_agkld_scaff_4527_1.Contig4 MP..L...Y.NT.LLE.....HFNK.LC....I.I..S.D.S.RQV..........H...TVM 3 : 19288 ? g_apppz_scaff_4373_1.Contig11 MP..L...Y.NT.LLE.....HFNK.LC....I.I..S.D.S.RQV..........H...TTM 21 : 85040 ? g_adlvn_scaff_4304_2.Contig9 MP..L...Y.NT.LLE.....HFNK.LC....------------------------------15 : 71870 ? Capitella-pb__EU196538_ HP..L...Y.NT.LLE.....HFNK.LC....I.I..S.D.T.RQV..........F...--EU196538 Tribolium-zen__NP_001036813_ AGK.A...Y.SA.LVE..R..HHGK.L.....IQI.EN...S.RQ..I........H.KE--NP_001036813 g_adhdu_scaff_4669_1.Contig2 PTK.V...Y.SA.LVE.....HFNR.LC....I.M.NL...T.RQ..I........Y.K---21 : 85075 hox3 g_amdsu_scaff_4454_1.Contig21 --K.A...Y.SA.LVE.....HFNR.LC....V.M.NL.S.T.RQ..I........Y.KEQKN 15 : 71825 hox3 Capitella-Hox3__EU196539_ PSK.A...Y.SA.LVE.....HFNR.LC....I.M..L...T.RQ..I........Y.KD--EU196539 Lottia-Hox3__56601_ PAK.A...Y.SA.LVE.....HFNR.LC....I.M..L...S.RQ-----------------JGI Lotgi1 protein id: 56601 g_amdsu_scaff_4454_1.Contig17 --K.P...Y.TA.LTE.....HFTR.LC....V.M.GL.H.T.RQ..I.......-------15 : 71825 hox3 Branchiostoma-Hox3__X68045_ AGK.A...Y.SA.LVE.....HFNR.LC....V.M..M...T.RQ..I........Y.KE--X68045 Drosophila-zen__AAF54087_ KLK.S.....SV.LVE..N..KSNM.LY.T..I.I.QR.S.C.RQV.I........F.KD--AAF54087 g_aheja_scaff_4145_1.Contig5 KTK.T..S.NGV.LLE.....TNNM.L..L..I.I.NY...S.KQV.I......V.F.KEG-23 : 89444 gsx g_atxvn_scaff_4590_1.Contig4 SSKKI..VY..S.LLE..R..AANM.LT.L..I.I.TY.S.S.KQV.I......V.Y.KE--5 : 33116 gsx g_aneyu_scaff_4415_1.Contig10 SSK.I.....ST.LLE..R..SANM.L..L..I.I.TY...S.KQV.I......V.Y.KE--6 : 37542 gsx Gsx1___Mus_musculus_ SSK.M.....ST.LLE..R..ASNM.L..L..I.I.TY...S.KQV.I......V.H.KEGKG AAI37770.1 g_aucja_scaff_4557_1.Contig17 --K.I.....SA.LLE..R..STNM.L..I..I.ISKY...S.KQV.I......V.H.KEG-19 : 80662 gsx le:///Users/yjx/Dropbox/gigasciencereviews/gigascience_R2_s... 4 of 4 4/22/14, 12:20 PM