![]() ![]() |
![]() |
UFDC Home | Search all Groups | UF Institutional Repository | UF Institutional Repository | | Help |
Material Information
Notes
Record Information
|
This item is only available as the following downloads:
Joint assembly and genetic mapping of the Atlantic horseshoe crab genome reveals ancient whole genome duplication ( PDF )
Supplement 1 ( PDF ) |
Full Text |
PAGE 1 RESEARCHOpenAccessJointassemblyandgeneticmappingofthe Atlantichorseshoecrabgenomerevealsancient wholegenomeduplicationCarlosWNossa1,4,PaulHavlak1,Jia-XingYue1,JieLv1,KimberlyYVincent1,HJaneBrockmann3andNicholasHPutnam1,2*AbstractBackground: Horseshoecrabsaremarinearthropodswithafossilrecordextendingbackapproximately450million years.Theyexhibitremarkablemorphologicalstabilityovertheirlongevolutionaryhistory,retaininganumberof ancestralarthropodtraits,andareoftencitedasexamplesof livingfossils. Asarthropods,theybelongtothe Ecdysozoa ,anancientsuper-phylumwhosesequencedgenomes(includinginsectsandnematodes)havethusfar shownmoredivergencefromtheancestralpatternofeumeta zoangenomeorganizationthancnidarians,deuterostomes andlophotrochozoans.However,muchofecdysozoandiversity remainsunrepresent edincomparativegenomicanalyses. Results: Hereweapplyanewstrategyofcombined denovo assemblyandgeneticmappingtoexaminethe chromosome-scalegenomeorganizationoftheAtlantichorseshoecrab, Limuluspolyphemus .Weconstructedagenetic linkagemapofthis2.7GbpgenomebysequencingthenuclearD NAof34wild-collected,ful l-siblingembryosandtheir parentsatameanredundancyof1.1xpersample.Themapincludes84,307sequencemarkersgroupedinto1,876distinct geneticintervalsand5,775candida teconservedproteincodinggenes. Conclusions: Comparisonwithothermetaz oangenomesshowsthatthe L.polyphemus genomepreservesancestral bilaterianlinkagegroups,andthatacommonancestoro fmodernhorseshoecrabsunderwentoneormoreancient wholegenomeduplications300millionyearsago,followed byextensivechromosomefusion.Theseresultsprovidea counter-exampletotheoftennotedcorrelationbetweenwh olegenomeduplicationandev olutionaryradiations.The new,low-costgeneticmappingmethodforobtainingachr omosome-scaleviewofnon-modelorganismgenomesthat wedemonstrateheredoesnotrequirelaboratoryculture,and ispotentiallyapplicabletoabroadrangeofotherspecies. Keywords: Genotyping-by-sequencing(GBS),Geneticlinka gemapping,Genomeevolution,LimuluspolyphemusBackgroundComparativeanalysisofgenomesequencesfromdiverse metazoanshasrevealedmuchabouttheirevolutionover hundredsofmillionsofyears.Thediscoveryofextensive genehomologyacrosslargeevolutionarydistanceshas allowedresearcherstotrack chromosomerearrangements andwholegenomeduplications.Theresultingvalueof wholechromosomesequencespresentsachallengefor existingwholegenomeshotgun (WGS)assemblystrategies. Wholegenomeduplicationeventswerelongsuspected [1],butonlytheavailabilityofgenomesequenceshas allowedconfirmationoftheminfungal,vertebrate,plant andciliatelineages[2-5].Incontrast,whenonlyafew chordate,insectandnematodegenomeswereavailable, conservationofgenelinkage(i.e.,synteny)andgene orderwereobservedonlybetweenclosely-relatedspecies, andconsequentlywerenotexpectedtobeconserved betweenphyla.Asmoremetazoangenomeshavebeensequenced,ithasbecomeclearthatlong-rangelinkagehas beenconservedoverlongtimescalesinmanylineages. Sequencingthegenomesofrepresentativesofchordate, mollusk,annelid,cnidarian,placozoanandspongeclades, hasidentified17or18ancestrallinkagegroups(ALGs) *Correspondence: nputnam@gmail.com1DepartmentofEcologyandEvolutionaryBiology,RiceUniversity,P.O.Box 1892,Houston,TX77251-1892,USA2DepartmentofBiochemistryandCellBiology,RiceUniversity,P.O.Box1892, Houston,TX77251-1892,USA Fulllistofauthorinformationisavailableattheendofthearticle 2014Nossaetal.;licenseeBioMedCentralLtd.ThisisanOpenAccessarticledistributedunderthetermsoftheCreative CommonsAttributionLicense(http://creativecommons.org/licenses/by/2.0),whichpermitsunrestricteduse,distribution,and reproductioninanymedium,providedtheoriginalworkisproperlycredited.TheCreativeCommonsPublicDomain Dedicationwaiver(http://creativecommons.org/publicdomain/zero/1.0/)appliestothedatamadeavailableinthisarticle, unlessotherwisestated.Nossa etal.GigaScience 2014, 3 :9 http://www.gigasciencejournal.com/content/3/1/9 PAGE 2 [6-10].EachoftheseALGsconsistsofasetofancestral geneswhosedescendantsshareconservedsyntenyin multiplesequencedgenomes.TheseALGshavebeen interpretedtocorrespondtoancestralmetazoanchromosomes,andcorrelationsbetweeninferredratesofgene movementbetweenALGsacrossthemetazoantreesuggestthattheseancestrallinkagerelationshipsareconserved throughtheactionofselectiveconstraintsonasubsetof genes[11]. Therelativelysmallnumberofgenomesfromanciently distinctmetazoanlineagesandthefragmentednatureof draftgenomeassembliesstilllimitboththesearchforancientwholegenomeduplicationsandthepowerofthe datatoconstrainmodelsofchromosome-scalegenome structureevolution.WhileWGSsequencingtechnology andassemblymethodsareactiveareasofresearchand technologicaldevelopment,andhaveimprovedatadramaticpaceinrecentyears,highquality denovo assembly oflarge,complexmetazoangenomesremainsadifficult andresource-intensiveproblem.Withoutgeneticorphysicalmaps,orrelianceonahigh-qualityreferencegenome ofaclosely-relatedspecies,WGSsequencingprojectsstill typicallyproduceassembliescontainingthousandsofscaffolds,hundredsofscaffoldsincorrectlyjoiningsequence fromdifferentchromosomes,orboth[12]. Next-generationsequencinghasgreatlyreducedthe costofconstructinghighdensitygeneticmapsbyeliminatingtheneedtodevelopandgenotypepolymorphic markersindividually[13].Thishasbeenachievedeither byfocusingsequencecoveragewithinoradjacenttogenomicregionsofdistinctbiochemicalcharacter,suchasrestrictionsiteswithrestrictionsiteassociatedDNA sequencing(RAD-seq)andrelatedmethods[14,15],orby combininginformationacrossregionsusingareference genomesequence[16,17].WhileRAD-seqisapplicableto organismslackingareferencegenomeassembly,itisnot directlyapplicabletocomparisonsofgenomeorganization acrosslongevolutionarytimespansbecausesuchcomparisonsrelyontheidentificationofhomologoussequence markers(typicallyprotein-codinggenes),whichtypically haveonlyasmalloverlapwiththerestriction-associated markers. Herewepresentagenotype-by-sequencingmethodfor constructingahigh-densitygeneticmapusinglowcoverage,low-cost,wholegenomesequencingdatafrom theoffspringofawildcross.Inthisjointassemblyand mapping(JAM)approach,thetraditionallyindependent andsequentialstepsofgenomeassembly,polymorphic markeridentificationandgeneticmapconstructionare combined.Existingassemblersexpectlowerdensitiesof sequencepolymorphism,deepercoverage,greatercomputermemoryormoreaggressivequalitytrimmingthat decreasesequencecoverage[18-20].Ourcurrentimplementationfocusesonconservativeassemblyofshort scaffoldssufficientformapconstruction,butourresults suggestthatfurtherintegrationofgeneticmappinginformationwithinwholegenomeshotgunassemblymethods canbeacosteffectivewaytoproduceassembliesoflarge, complexgenomeswithchromosome-scalecontiguity. Wehaveappliedthisapproachtoproduceageneticmap ofthegenomeoftheAtlantichorseshoecrab, Limulus polyphemus .Horseshoecrabsaremarinearthropodswith afossilrecordextendingback450millionyears[21].They exhibitremarkablemorphologicalstabilityovertheirlong evolutionaryhistory,retaininganumberofancestral arthropodtraits[22],andareoftencitedasexamplesof livingfossils L.polyphemus hasagenomeabout90%the sizeofthehumangenome.Itisanimportantspeciesfrom ecological,commercialandconservationperspectives[23], thathasbeenusedasamodelsystemforresearchin behavioralecology,physiologyanddevelopment[24].The mapandSNPmarkersdescribedherewillbearesource forthe L.polyphemus genomeproject,researchinhorseshoecrabpopulationbiology,andcomparisonsofmetazoangenomeorganization.B yanchoringproteincoding genestothismap,weareabletoextendanalysisofancestrallinkagegroupsandwholegenomeduplicationstothe cheliceratelineage.DatadescriptionApairofnaturallyspawninghorseshoecrabsandtheir eggswerecollectedfromtheirnaturalhabitatonthe beachatSeahorseKey.Thelarvaewerehatchedatthe lab4weekslaterfromthecollectingdate.Thetissue samplesfromthethirdwalkinglegsoftheparental horseshoecrabsand34larvaewereusedforDNA extraction,librarypreparationfollowingmanufactures standardprotocols.Twoparentsand34larvaewereindividuallybarcodedduringlibrarypreparation.Illumina paired-endlibrarieswithinsertsizesofapproximate 300bpwerepreparedforeachsample.Theselibrarieswere pooledtogetherforthesubsequentsequncingontheIlluminaHiSeq2000platformatMedicalCollegeofWisconsin SequencingServiceCoreFacility.Atotalof1.7billion 100bpparied-end(PE)readswereobtainedafterthequalityfiltering.Thetotalsequencingcoveragewasestimated as38.9basedonthe k -merfrequencydistribution.The rawsequencingdatacanberetrivedfromNCBISRAvia NCBIBioProjectaccessionPRJNA187356.AnalysesAssemblyandmappingTheJAMmethodisdesignedtoproduceacombined assemblyofpolymorphicsequences,taggedbygenomic regionswithamaximumofonesinglenucleotidepolymorphisms(SNP)per k -merwindow(Methods).Startingwith genomicreadsfromamatingpairofadult L.polyphemus and34offspring(100bppaired-endreadson300bpNossa etal.GigaScience 2014, 3 :9 Page2of21 http://www.gigasciencejournal.com/content/3/1/9 PAGE 3 inserts),weanalyzed1.7billionreadscontainingatleast onehigh-quality23-mer. FittingPoissonmodelsforunduplicatedsequencetothe frequenciesoffiltered23-merssuggeststhat1.1billion genomiclociareuniqueatthisresolution(Figure1).This fitassumesthatthevastmajorityofmappablegenomic locihaveonlyoneortwoallelesrepresentedinthe parents fourhaploidcontributions,whichgivesrisetothe fourcomponentsplottedinFigure1.Ofthesesufficiently uniqueloci,63%aremodeledashomozygous,27%as pairedmajor-minoralleles,andtheremaining10%astied allelepairs.ThecorrespondingSNPs,ifalwaysatleast23 basesapart,wouldbe1.6%ofbasesintheuniqueloci. Dividingthetotalnumberoffiltered23-mersbythe modeledhomozygousdepthofcoverage d =38.9yieldsan estimatedgenomesizeof2.74billionbases,consistent withthemeasuredDNAcontentof2.8pg(978Mb 1pg DNA)[25]. Wecategorizespecific23-mersbytheireditdistancesto others:havingnoneighborswithinasinglebasesubstitution(uniquetags)orwithasinglemutuallyuniqueonesubstitutionneighbor( SNPmerpair tags).Asubsetof these,includingSNPmerpairsforapproximately7.9millionSNPs,constitutethetagsusedforcontiggingand scaffolding.TheSNP-merpairsaccountforapproximately 45%ofthemodeledfractionofalleles,theothersmissed fromsimilaritytoothersequences(e.g.,duetorepeats)or distancefromeachother(becauseofindelsormultiple SNPsper23-mer). Chainingthese23-merstogether(seeMethods)produces aninitial6.6millioncontigs,3.9millionofwhichare linkablebypairedreadsforscaffolding.ApplyingBambus [26]produces944,000scaffoldsspanning1.3billionbases (Table1).Thesescaffoldsserveasmarkersincorporating multiple23-mertags,includingSNPmerpairsusedto identifyhaplotypes. Afterassembly,themeandensityofSNPsacrossthefour parentalhaplotypesinassembledregionswasestimated basedonreadre-alignmentstobe7.6perthousandbases. WejointlyinferredthephasesoftheseSNPsandsegregationpattern(offspringgenotypes)inthemappingcrossfor eachmarkerinamaximumlikelihoodframework (Methods).Wefocusedonthe91,320markerswithatleast 18inferredbi-allelicSNPsforconstructingthelinkage map.Thesemarkersgroupedinto1,908high-confidence mapbins(i.e.,uniquesegregationpatterns,assumedto correspondtolociinthegenomeuninterruptedbymeiotic recombinationinthecross[27].Mapbinsfellinto32 linkagegroups(Figure2),closetothe26pairs(2N=52) previouslyfoundinacytogeneticanalysisoftwochromosomespreads[28].Twentymapbinswereremovedfor havinginconsistentpositionsinthematernalandpaternal maps,and12weresingletons. Toestimatethefrequencyofincorrectgenotypecalls asafunctionoftheloglikelihooddifferencebetween thecalledandalternativegenotype(genotypeconfidencescore),includingcontributionsfromuncertainty inSNP-meridentification,assemblyandsamplingnoise, wecarriedoutasimulationofthelibrarypoolingand sequencing, k -merassemblyandgenotypeinferenceprotocols,usingthesequenced Cionaintestinalis genomeas astartingpoint. Figure1 FittingPoissondistributionstoLimulus23merfrequencies. Thedistributionofsequenced23-mers,modeledassamplinggenomic locithatarehomozygous(singleallelesharedbyallhaplotypes),havetwoallelesthataretied(AandapresentinparentsasAAaaorAaAa), orhavetwoallelesinamajor-minorrelationship(presentinparentsasAAAaorAaaa).Alleleswhoseparentalcontributionsarehomozygous, tied,majororminoreachhaveafrequencypeakcorrespondingtotheirdistinctivefractionoftheoveralldepthofgenomicsequencingd.Loci sharingsimpleorrepetitivesequencenotsufficientlyuniqueata23-merscalecontributetoalongtailofftherightedgeoftheplot. Nossa etal.GigaScience 2014, 3 :9 Page3of21 http://www.gigasciencejournal.com/content/3/1/9 PAGE 4 Inthesimulated C.intestinalis dataset(Methods),a singlestretchedexponentialdistributionprovidedagood fittothefrequencyofgenotypecallingerrorsasafunctionofthecallconfidencescoreforscoresupto6,or downtoerrorfrequencyofapproximately1%.Theobservederrorfrequencydeclinedmoreslowlyforhigher confidencescores.Theminimum x2fitusedforestimatingthegenotypingerrorrateinthe L.polyphemus mapbinswas pes a1e s c 1 b 1 a2e s c 2 b 2,withparameter values a1=0.49, b1=2.08, c1=1.26, a2=5.47, b2=0.17, c2=0.16(Figure3). Applyingthismodeltothe L.polyphemus marker genomecalls,weestimatedthatthegenotypecallingerror rateinthemapbinrepresentativemarkerswas0.0099. Weobservedthat51%ofadjacentmapbinpairsareseparatedbyasingleinferredrecombinationeventinthe cross,and94%areseparatedbythreeorfewerrecombinantsineachparent. Ofthe91,320markerswithatleast18putativeSNPs, 84,307(92%)wereassignedtotheirclosestmapbinswith athresholdof(Methods),foranestimatedgenome-wide averagedensityofonemappedsequencemarkerevery 32kb.Ameanof45markersweremappedtoeachmap bin,andthenumberofmarkersmappedwasusedtoestimatetherelativephysicalsizeofmapbins.Approximately 46%ofthescaffoldswith12 17SNPscouldbeplaced withthesamethreshold,foranadditional32,688markers, oronemarkerevery23kb. Thetotallengthofthescaffoldsassignedtomapbins was411Mb,andtheycontained2.67millionbi-allelic SNPsassignedaphasewithaposteriorprobabilityofat least0.99.Ofthese,72%wereinferredtobeuniquetoone ofthefourparentalchromosomes.Thisisclosetothe 74%predictedunderthefinitesitesneutralcoalescent modelgiventheobservedSNPdensity[29].SequencecompositionandrecombinationrateInthescaffoldslongerthan1kb(N=378,506andmean length=2.9kb),the G / C basecontentwas33.32.8%, andthelocalrelativefrequencyofCpGdinucleotides wasbimodallydistributed,withabout30%ofsequences exhibitingdepletionofCpG.TpGandCpAdinucleotideswereover-representedonaverageandtheirlocal densitiesnegativelycorrelated(r= 0.54,p<2.2e-16) withCpGdensity,suggestingongoinggerm-lineCpG methylationforafractionofthegenome[30]. Themeanmaternalandpaternalrecombinationrates wereestimatedtobe1.28and0.76centimorgansper megabaserespectively,consistentwithexpectationbased onthenegativecorrelationbetweenrecombination intensityandgenomesizeobservedinpreviousstudies [31].Wedidnotobserveevidenceofsegregationdistortionforanymapbins.Thecorrelationinlocalrecombinationratesintwoparentsacrossthegenomeis estimatedas r2=7.1%; p <1e-29,suggestingconsiderable variationinrecombinationlandscapebetweentwosexes inlimulus[32-34].Apositivecorrelationbetweenlocal recombinationrateandlocalSNPdensitywasobserved ( r2=9.7%; p <1e-40),whichisconsistentwithprevious observationsinhumanwithcomparablecorrelationcoefficient[35].AncestrallinkagegroupconservationWefoundthat34,942scaffoldshavesignificantsequenceconservationwith10,399predictedproteinsof thetick Ixodesscapularis :like Limulus ,achelicerate, butonewithawell-annotatedgenome[36].6,246of thesehitsformedreciprocalbestpairs,ofwhich5,775 (92%)couldbeplacedonthelinkagemapatathresholdof p <.Thesewereusedasconservedmarkersfor comparisonsofgenomeorganization.Whenlinkage groupsweredividedinto108non-overlappingbinsof 1,000markers,52hadsignificant( p <0.05,afterBonferronicorrectionfor1,944pairwisetests)enrichmentin sharedorthologs(or hit )withatleastoneofeighteen ancestralchordatelinkagegroups[7].AhiddenMarkov modelsegmentationalgorithm[6]identified40breakpointsinALGcompositioninthelinkagegroups. Approximately72%ofthegenomeisspannedby53 interveningsegmentsthathitoneor(foreightofthem) twoALGs(Figures4and5).EachoftheeighteenancestralALGshasatleastonehitamongthe45 segmentswithauniquehittotheALGs.WholegenomeduplicationsWholegenomeduplication(WGD),orpolyploidization isarare,butdramaticgeneticmutationeventwhich doublesthesizeofagenomeandcreatesaredundant pairofcopiesfromeverygene.Becauseitcreates Table1 K -mercontigandscaffoldstatisticsAssembledCountTotal(bp)Avg.span(bp)n50span(bp) k -mercontigs6,614,4341,240,275,515188418 Linkablecontigs3,925,8441,137,576,911290460 Initialscaffolds944,2461,261,263,17213363047 Referencescaffolds944,2461,295,334,51513722930 Referencebases1,131,458,74411982553 Nossa etal.GigaScience 2014, 3 :9 Page4of21 http://www.gigasciencejournal.com/content/3/1/9 PAGE 5 redundantcopiesofgenesforentirebiochemicalpathwaysandgeneticnetworks,ithasbeenproposedthatit createsuniquerawmaterialfortheevolutionofnovel biologicalfunctionsandincreasedcomplexity.HomeoboxgeneclustersHomeoboxgenesencodealargefamilyoftranscription factorsinvolvedindiverseembryonicpatterningandstructureformationprocessesofeukaryotes.Asaparticular Figure2 Eachofthenumberedblocksrepresentsoneofthe32linkagegroupsofthe L.polyphemus geneticmap,andiscomposedoffour columns:Twobandsofthetriangularmatricesinwhichthecolorscaleindicatesthefractionofsamplesshowingrecombinationbetween pairsofmarkers;maternalrecombinationfrequencyisshownontheleft,paternalontheright. Acolumnlabeled ALG indicatessegmentsof significant( p <0.05inFisher sExactTest,afterBonferronicorrectionformultipletests)conservationofgenecontentwithancestralbilaterianlinage groups.ThecolumnlabeledHOXshowsthemappositionsandtypesofpredictedhomeoboxtranscriptionfactorgenes.Thetwocolorscalesarefor: recombinationfrequencybetweenpairsofmarkersandlogp-valueforenrichmentingenecontentwithancestrallinkagegroups. Nossa etal.GigaScience 2014, 3 :9 Page5of21 http://www.gigasciencejournal.com/content/3/1/9 PAGE 6 subfamilyofhomeoboxgenes,theHoxclusterisknownto controlmetazoanbodypatte rningalongtheanteriorposterioraxis.Weidentified155scaffoldswithsignificant homologytopredictedcheliceratehomeoboxgeneseque ncesinpublicdatabases.Weclassifiedthesesequencesinto homeoboxsubfamilies(Methods)andplacedthemonthe mapbybesthit.Twolargeclu stersofHoxgenesarefound onlinkagegroup(LG)15andLG21,eachcontainingmultiplemembersoftheanterior, centralandposteriorclasses. Therearealsotwoparahoxclusterhomologs,eachwith threehomeoboxgenes: gsx and cdx orthologsandathird homeoboxgenenotconfidentlyassignedtoasubfamilyin ouranalysis(LG5andLG19).Therearetwosmallerclusterscontainingmultiplehoxgenes(LG18andLG20),and clustersofotherhomeoboxgenes,includingmembersof the msx lbx nk evx and gbx families(Figure2).GenomicdistributionofparalogousgenesWGDcreatesmanypairsofduplicategenesor paralogs Thedistinctivefeaturesofthesegeneshavebeenusedto Figure3 Genotypecallerrorrateasafunctionofcallconfidencescoreforbinsof10,000callsinsimulated Cionaintestinalis genome data. Thestippledblueregionshows95%confidenceintervalsoftheBayesianposteriorprobabilitydistributionoftheunderlyingerrorrate computedfromtheBetadistribution Beta ( ne+1, nc ne+1)conjugatetheassumedbinomialdistributionofobservederrors,where neand ncare thenumberoferrorsandnumberofcallsineachbinrespectively.Theredcurveshowsthebestfiterrormodel a1e s c 1 b 1 a2e s c 2 b 2,withparameter values a1=0.49, b1=2.08, c1=1.26, a2=5.47, b2=0.17, c2=0.16. 2reduced=0.82. Figure4 Limulus-Humanmacro-syntenydotplot. BluepointsindicatethepositionofhumangenesinreconstructedancestralchordateALGs (verticaldisplacement)andtheircandidateorthologsinthe30 L.polyphemus linkagegroups(horizontaldisplacement). Nossa etal.GigaScience 2014, 3 :9 Page6of21 http://www.gigasciencejournal.com/content/3/1/9 PAGE 7 inferWGDeventsinfungal,vertebrateandplantgenomes [2-4].Weexaminedthegenomicdistributionof2,716 pairsofcandidateparalogousgenemarkersin L.polyphemus forsignaturesofWGD.In45%ofthesepairsboth markersmappedtothesamechromosome,compared with5.30.5%in1,000datasetswithrandomly-permuted paralogousgeneidentities.Themappedpositionsofpairs withinthechromosomeswerehighlycorrelated(average r2=0.81,andexceeding0.95for8ofthelargechromosomes;Figure6),suggestingthatmanyofthepairsrepresentrecenttandemgeneduplicatesorsinglegenes fragmentedacrossmultiplemarkers.Inthefollowing,these Figure5 Limulus-Humanmacro-syntenydotplotasinFigure4,showingbreaksintroducedbyhiddenMarkovmodelsegmentationof thelinkagegroupsasverticallines. Figure6 Genomicdistributionofcandidateparalogs. Themappositionsofpairsofputativelyparalogousproteincodinggeneswithinthe L.polyphemus genomeareplottedwithbluepoints.Pairsarebiasedtowardnearbymappositions,andthereforeconcentratedalongthe diagonal.Also,paralogssplitbetweenlinkagegroupsaresignificantlyclusteredinto paralogons Nossa etal.GigaScience 2014, 3 :9 Page7of21 http://www.gigasciencejournal.com/content/3/1/9 PAGE 8 same-chromosomeparalogsarereferredtoas tandem duplicates. Inter-chromosomalduplicat esareclusteredintoconservedparalogousmicro-syntenyblocks(or paralogons [3]):thereare25pairsofloci,eachwithatleastsix( mp =6) independentparalogpairsclusteredwithamaximum gap(max-gap)of300markersbetweenadjacentparalogsin eachcluster.Paralogpairsare consideredindependentif theyarebasedonhomologytoadistinctout-groupgene, toguardagainstrelyingoneithermultipleexonsofthe samegene,orrecently-duplicatedgenesasindependent evidenceofancientsegmentalparalogy.Theseclustersspan 25,044markers,or30%ofthemap,afterremovingredundancyfromparalogonswithoverlappingfootprints.In 1,000datasetswithrandomly-permutedparalogousgene identities,themaximumnumberofsuchclustersobserved was11;themeanandstandarddeviationwere3.91.0. Theobservedclusteringintoparalogonswasgreaterthan thatintherandomizeddatasetsoverabroadrangeof choicesofmax-gapandmp.Forexample,formax-gap= 100, mp =3thereare52clustersvs.3.51.9,range0 10; formax-gap=500, mp =9thereare12vs.2.91.7,range 0 9.Becauseofthelargeproportionofapparenttandem geneduplicates(45%),thisrandomizationschemeincreases thenumberofinter-chromosomalparalogpairsrelativeto thedata,makingitaconservativesignificancetestforinterchromosomalparalogcluste ring.Whengeneswithtandem duplicatesareexcludedfromtherandomizations,the observednumberofclustersisgreaterthanthemaximum observedin1000randomizationsforallthecombinations ofmax-gapintheset(100,200,300,400,500,600)and mp intheset(3,4,5,6,7,8,9).23max-gap=600, mp =7 clustersspan59%ofthemap,comparedtorespectivemean numberandmapcoverageof3.31.8,and136%inthese randomizations. Amongthemarkerpairsmappingtodifferentchromosomes,wefoundasignificantexcessofpairsrelatingsegmentsderivedfromthesameALGrelativeto randomizationcontrols(247pairsvs.10211, p <0.001 inrandomizationsofallgenes;202vs.467whengenes intandemduplicatesareexcluded).Thispatternisconsistentwiththecreationofthesesegmentsbyduplication (ratherthanfission). Themax-gapclustershaveasignificantamountofoverlapamongtheirfootprints.Forexample,thefootprintsof themax-gap=600, mp =7clustershadatotallengthof 72,072markers,butanetfootprintafterredundnacy removalof49,545markers.Agenomicregion a whichhas conservedsyntenywithtwootherregionsofthegenome ( b and c )couldariseeitherfrommixingofadjacent regions(throughlocalgenomerearragements)withhomologyto b and c respectively,orbysuccessiveduplications. Inthelattercase b and c arealsohomologous.Weexaminedtherelationshipsamongtheparalogonsforevidence ofsuccessiveroundsofduplication.Weconsidereda graphinwhichnodescorrespondtomerged,non-redun dantparalogonfootprintregions.Nodesareconnected withedgesifamax-gapclusterconnectsthetwonodes. Theaverageclusteringcoefficientofthisgraphisequalto theprobabilitythatfootprints a and c shareamax-gap cluster,giventhatthereareedges( a b )and( b c )inthe graph.Wecomparedtheclusteringcoefficientstothose foundinrandomErd s-Rnyigraphswiththesame numberofnodesandedgeprobabilityastheobserved graph.Wefoundthattheobserveddatashowssignificantlymoreclusteringthantheserandomgraphsfora widerangeofchoicesofmax-gapand mp.Forexample, formax-gap600, mp=7,theaverageclusteringcoefficientis0.19,while10,000randomgraphshadcoefficientsof0.0340.042, p =0.0039.AgedistributionofparalogousgenesBecauseWGDeventscreatemanyparalogsatthesame time,theyleavecharacteristicpeaksintheagedistribution ofparalogousgenes.In L.polyphemus thedistribution showspeakscenteredat0.71and1.34substitutionsper synonymoussite(Ks),valueswithintheapproximately linearresponserangeofKsestimatestoWGDage[37] (Figure7).Forcomparison,thesynonymoussitedivergencebetweenanAsianhorseshoecrabspecies Tachypleus tridentatus and L.polyphemus hasamodeof0.35.The commonancestorofthesespecieshasbeenestimatedto havelived114 154millionyearsago(MYA),coincident withtheopeningoftheAtlanticocean[38],suggestinga WGDevent230 310MYA,andpossiblyanolderone 450 600MYA.DiscussionOurresultsdemonstratethatalowcost,combinedapproachtowholegenomesequencingandgeneticmapping canbeusedtoefficientlycreateaveryhighdensitygenetic recombinationmapforanon-modelorganismwitha largegenome.Becausetheapproachusesgenome-wide sequencing,alargenumberofsequencemarkerscanbe anchoredtothemap,allowingcomparisonsofgenome organizationatthechromosomescaleoververylargeevolutionarydivergences.Theidentificationofchromosomal segmentswithsignificantgenecompositionhomologyto eachofthechordateALGsdemonstratesthatthepredominanceoffusionandmixingofancestrallinkagegroups previouslyobservedinanalyzedecdysozoangenomes[10] isnotancestralto,oruniversalintheclade. Themapallowsquantitativecharacterizationofother featuresofchromosome-scaleorganization,suchasthe correlationbetweenlocalrecombinationrateandpolymorphismlevels.Similarpositivecorrelationsbetween localrecombinationrateandpolymorphismlevelhave beenobservedinothermetazoansincludinghumansNossa etal.GigaScience 2014, 3 :9 Page8of21 http://www.gigasciencejournal.com/content/3/1/9 PAGE 9 [39-41]andplants[42,43].Futurecomparisonswithmore closelyrelatedchelicerateswillallowteststodistinguish whethertheseratesarepositivelycorrelatedwithinterspecificdivergence,consistentwithaneutralprocessof correlatedmutationandrecombinationrates[35].Alternatively,theassociationcouldbeexplainedbyhitchhiking andbackgroundselection[44]. Theenrichmentofinter-chromosomalparalogpairsin segmentsofthesameALGoriginisconsistentwiththeir creationbyduplication(rath erthanfission),although becausesmall-scaleduplicationisbiasedtowardlocal (tandem)duplication,fissio nofsegmentscouldalsoleave behindanenrichmentofparalogs.Suchamechanism, however,wouldnotcreatetheobservedorganizationof paralogs,thatis,theirclusteringinto paralogons .Thefact thattheseparalogonsspanalargeportionofthemap(59%) suggeststhatitwasawholeg enomeduplication,rather thansegmentalduplicationsthatgaverisetothepattern. Theexistenceofduplicatedhoxandparahoxclusters onfourdifferentchromosomesishighlysuggestiveof multiplewholegenomeduplication.Hoxclustershave notbeenfoundinduplicatecopiesexceptinvertebrates wheretheyhavebeencreatedbywholegenomeduplication,andhaveonlyrarelybeensubsequentlylost. Thedouble-peakedshapeofthedistributionofsynonymoussitedivergencebetweenpairsofparalogs,combinedwiththeexistenceoftwosmallclustersofHOX genesinadditiontothetwocompleteHOXclusters suggeststhattheremayhavebeentworoundsofwhole genomeduplicationinthehorseshoecrablineage. WGDsprecededmajorspeciesradiationsinvertebrates,angiospermsandteleostfish,andtheimportance oftheirroleinevolutionisthesubjectoflong-running debate[1-4].Thediscoveryofwholegenomeduplication inaninvertebrate,andduringhorseshoecrabs longand famouslyconservativeevolutionaryhistorysuggeststhat sucheventsmayhavebeenmorecommonthanpreviouslyassumedinmetazoanevolution,andthatwhile theymayhaveprovidedrawmaterialforadaptiveevolutioninsomecases,theyarenotevolutionarydrivers.MethodsJointassemblyandmapping(JAM)overviewBarcodedgenomicDNAlibrarieswerecreated,pooled, andsequencedinfourlanesontheIlluminaHiSeq2000 platformforamatingpairof L.polyphemus and34 offspring. Figure7 Distributionofestimatednon-synonymous(Ka;top)andsynonymous(Ks;bottom)andsequencedivergenceratesforpairsof putative L.polyphemus paralogs(left)and L.polyphemus T.tridentatus orthologs(right). Nossa etal.GigaScience 2014, 3 :9 Page9of21 http://www.gigasciencejournal.com/content/3/1/9 PAGE 10 TheJAMmethodproceedsthroughthreemajorphases: 1.ThefrequenciesofDNAsub-sequencesoffixedlength k ( k -mers)areprofiledtocharacterizethequality,uniqueness, polymorphismsandrepetitioningenomicreads,usingsoftwarewedevelopedbuildingonworkfromtheAtlasassembler[45].Allelicpairsof k -mersrepresentingalternate formsofSNPsareidentified andtrackedthroughthesubsequentsteps.2.Contigsareassembledonagraphof unique k -mersandpairedSNP k -merssampledtoreduce memoryusage,thenorderedandorientedusingthe Bambusscaffolder[26,46].Ea chmulti-SNPscaffoldistreatedasasinglemarkerforthelinkagemappingsteps3.The pairedSNP k -mers(ineachscaffoldarecombinedwiththe read,mate-pair,andparent-oroffspring-libraryassociationsoftheirallelesforhaplot ypephasingandconstruction ofahighdensitygeneticlinkagemap.ThesoftwareispublicavailableasopensourcesoftwareatGitHub[47].SamplingandsequencingTheparentalhorseshoecrabsandtheireggswerecollectedfromtheirnaturalhabitatonthebeachatSeahorse Key,anislandalongthewestcoastofnorthFlorida,on27 March2010.Thisnaturallyspawningpairwereobserved astheeggswerebeinglaidandfertilized(fertilizationis externalinthisspecies,i.e.,theeggsarefertilizedinthe sandunderthefemaleastheeggsarebeinglaid).The tissuesamplewerecollectedfromthethirdwalkinglegs ofthisparentalpair.Wemarkedwheretheeggsamples werelaidandreturnedafewhourslateranddugupthe nest,thenremovedthefertilizedeggs.Wealsohave conductedpaternityanalysesthatshowthatfertilizationis bytheassociatedmaleandnotbyextraneousspermthat mightbeatthenestingsite(inthiscasethedensityof nestingfemaleswaslowonthisdaysoweknowthatthe eggswecollectedwerefromthepairweobserved). Trilobitelarvaewererearedinplasticdishesaspreviously describedandhatchedfromtheeggs4weekslater[48]. TissuesamplesandlarvaewerepreservedinRNALater. GenomicDNApurificationandlibraryconstructionwere carriedoutusingQiagenDNAEasy,IlluminaTruSeqand Nexterakits,followingmanufacturers protocols.Barcoded sampleswerepooledandsequencedontheIllumina HiSeq2000platform. Limulus larvaewereprocessedasfollows;eachlarva, suspendedin100 L ofRNAlaterandstoredat 80C ina1.5mLEppendorftube,wasthawedonice,after whichRNAlaterwasremoved.DNAwasextractedusing theQiagenDNAEasykitpermanufacturer sprotocols. DNAwasquantifiedusingpicogreenDNAquantitation kit.ToprepareTruSeqlibraries,DNAwasfirstpurified anothertimeusingzymogenomicDNAcleancolumns permanufacturer sprotocols.Adult L.polyphemus DNA waspreparedasabove,butusingclawtissueratherthan wholelarvae.AllDNAextractsweretestedbygel electrophoresistoensureDNAwasnotdegraded.TruSeq librarieswerepreparedatUniversityofGeorgia sGeorgia GenomicsFacility.1 5 gofsampleDNAwassubjected tofragmentationusingCovarissonicator.Fragmented DNAwasthenusedforlibraryconstructionusing IlluminaTruSeqlibraryprepkits.Librarieswerepooled togetherinequimolaramounts(for10larvae)andused forthefirstsequencingruninseparatelanesforthe parentalandlarvalpools.Forlarvalsamples11 34,library prepwasswitchedfromTruSeqtoNexterakits.Nextera librarypreparationwasperformedaccordingtomanufacturer sprotocol.TheNexteralibraryproductwasquantifiedbypicogreen,andfragmentsizedistributionwas checkedbyusingLonzaflashgel,toensurethatfragment sizedistributionwasbetween300 1,000bp.Sample librarieswerepooledinequimolarconcentrationsand sentforthesecondsequencingrunintwolanes,eachon apoolof12larvae.Bothsequencingruns,comprisingfour librarypools,wereperformedontheIlluminaHiSeq2000 platformatMedicalCollegeofWisconsinSequencing ServiceCoreFacility. Atotalof1.7billionIlluminareadsqualifiedfor kmer analysisandassemblybycontainingatleast23consecutiveq20bases.Thematernallibraryaccountedfor13%of thesereads,thepaternallibrary7.4%,andthe34offspring librariesfor2.4%onaverage,0.64%atminimum.k-merdecompositionWedeterminedalowerboundonthe k -mersizelong enoughforagivenexpectationofuniquenesinarandom genome.Whileincreasing k reducedtherateofcoincidentallyrepeated k -mers,italsoreducedtheeffectivedepthof coverageduetountrimmederrorsandedgeeffectsat readends andincreasedthecasesofmultipleSNPs per k -merlocus,whicharenottrackedinourcurrent softwareimplementation.Wecanapproximatelymodel agenome-scalestring G ofrandomnucleotidesas G samplestakenwithuniformprobabilityfromthespace ofall k -mers(ofsize4k/2forodd k -merstreating reverse-complementsassame;slightlymoreforeven k ).ThePoissondistributionthengivestheprobability thatalocationin G hasitsown,unique k -mer(shared withnootherlocation)as e ; where G 4k2 2 G 4kTheprobabilityofalocationsharingits kmer isthen; thus,tolimitthemaximumrate R of G -locationssharing k -mers,werequire k [log4( 2 G /ln(1 R ))].Forexample, foramammalian-scalegenomeofapproximately3billion bases,and R =0.1%,wechose k 22.For Limuluspolyphemus ,theAnimalGenomeSizeDatabase[25]reportsan estimatedhaploidgenomesizeof2.80pgand,aseachNossa etal.GigaScience 2014, 3 :9 Page10of21 http://www.gigasciencejournal.com/content/3/1/9 PAGE 11 picogramrepresentsalmostabillionnucleotidebase-pairs ofDNA,themammalian-scalechoiceof k applies[19,49]. Thislowerboundignoreschemicalandbiological sequencebiases,soselecting k forarealgenomeproject requiresattentiontoerrorrates,repeatstandemand interspersed,andgenomesize,allknownvaguely,ifat all,beforesequencing.Studyingthe k -merdistributions aftersequencingcanclarifythesegenomicpropertiesas weselect k tomaximizethenetyieldofcandidate k -mer tags,betweenerrorsandwithatmostoneSNPlocation, insequencingreads.WeconvertedIllumina/Solexa FASTQformat(payingattentiontothedifferentquality encodingsofthesoftwareversions)intoFASTAformat [50],masking(replacingwith N )anybasewithPhredscale[51]qualitybelow20,andsoft-masking(representing inlowercase)otherbaseswithqualitybelow30.Forinitial trimmingexperiments,wevariedthesequalitythresholds asindicatedbelow.Westored k -mersinhashtableswith openaddressing[52],supportingodd k <=31.Wetallied foreach k -merabitvectorforpresenceorabsenceinup to64samplelibraries(36for Limulus parentsandoffspring),andanoverallcountofoccurrencesinalllibraries (countlimitedto64 2 k bits).Wherethe kmerhash wouldbetoolargeforavailablememories,wesampledthe k -mersusingahash-slicingfactor S (mustbeprime). Representingeach k -merasanintegerin,slice s consistsofthose kmerswhoseremainderondivisionby S is s .Wecantabulateonesliceforarepresentativesample of1/ Skmers(forinitialestimationofdepthofcoverage andgenomesize)or,using S independentjobs,tocollect informationfor k -mersinallslices.Ourhashtablesstored odd-length k -merssothatreverse-complementary sequencescanbecombinedwithouttheambiguousorientationofpalindromicsequences(e.g.,ATCGAT). Afterselecting k asdescribedaboveandmakingafull tabulationof k -mercountsandbitvectors,wefiltered out k -mersnotexpectedtorepresentgenomicsequence. k -merswererequiredtohavethreecopiesinthetotal sequenceset,withatleastonecopyintheinitialrun andoneinthesecondrun.Thiswaspartlytofilterout incompleteadaptersequences,whichcanbedifficultto trim,butweredifferentinthetworuns. ExtendingmethodsdevelopedfortheAtlasassembler [45]toheterozygoussequences,Figure1givesarough decompositionofthe k -merfrequencydistributionfor 23-merswithquality 20,minimizingthesquareofthe residualsof k -mercountsonfrequencies3through70whilenotexceedingtheobservedcounts.Fourlinked distributionsmodelfractionsofthegenomeasmonoallelicorbiallelic:homozygousregionswith d =38.9-fold coverage(darkblue),minorallelescoveredat d /4 (green),tiedallelesat d /2(red)andmajorallelesat3 d /4 (purple).Thisfitisrobustenoughtoconfirmtheabundanceofmajor-minorallelepairs(27%of k -loci,vs.10% fortiedalleles),withthebroaderpeaksinthedatathan inthefittedcurvesconsistentwithlessuniform sampling(forexample,varyingcoverageofparentsand offspring).ThePoissondecompositionsuggestsadensity ofpolymorphismsof1.2%inmajor-minorallelepairs, basedondividingthemodelednumberofsuchsequencedpairsby k (assumingmostpolymorphismsare SNPsspacedatleast k basesapart),by d (theestimated depthofsequencing)andbytheestimatedgenomesize of2.74billionbases.SNPmeridentificationThefilteredkmercounts,computedinparallel,areloaded intoahashtablewithadditionalfieldstotrackkmersthat areuniquelywithinonemismatchofeachother.Because thisstepanalyzesall(non-error) k -mersinonetable,this requiresasinglelarge-memoryprocessor(ontheorderof 32GiB). Foreach k -mer,wecheckallits3 k one-substitution neighbors.The k -mersarepartitionedeachintooneof threecategories: unique :havingnoedit-neighborswithin onesubstitution; ambiguous :havingeithermultipleonesubstitutionneighbors,oroneneighborthathasmultipleneighbors;or partnered :uniquelypairablewith exactlyoneother k -merdifferingbyonesubstitution, such k -mersalsoknownfromnowonasSNPmersor SNPmerpairs.ForeachSNPmer,wesavethepositionof thesubstitution,abitmaskforthechange(transition, complement,ornon-complementtransversion),and whetherthecanonicalformofthepartnerinthetable hasthesamesenseorisreversecomplementedwith respecttothis k -mer. Onlypartneredandunique k -merswillbefurther tracked.Whilethislimitedmethodcannotidentify k -mers forgenomicSNPandnon-SNPlocationswithcomplete confidence,falsepairingormissedpairingshouldhave limitedeffects,asconfirmedbyassemblyexperiments withsimulated Ciona sequence(see Errormodelcalibration inMethods). Falsepairing ,duetocoincidentalsimilarities orrepeats,wouldcombinenodesofthe k -mergraph (seebelow)andcausenoiseinthescaffolding,haplotype phasing,andlinkageanalysis.Suchmisleadinglinksare minimizedbythe robustedge requirementsincontiggingandscaffolding,describedbelow. Missedpairing canhappenfromindelpolymorphisms,SNPsseparated byfewerthan k 1positions,failuretosequenceminor alleles,orambiguityduetotoomanysimilar k -mers. Ambiguouslynon-unique k -merswillbeskippedover (reducingconnectivityofthe k -mergraphifthereare toomanyinarow).Whereallelic k -mersmisidentified asuniquecauseconflictingedgesinthe k -mergraph, nodesforunpartneredmajoralleleswilleitherbechained intocontigswithflankinguniquesequenceorleftas orphanedfragments,andunpartneredminoralleleswillNossa etal.GigaScience 2014, 3 :9 Page11of21 http://www.gigasciencejournal.com/content/3/1/9 PAGE 12 beleftasorphanedfragments.Overall,errorsinidentifyingparneredandunique kmersshouldshortencontigs andscaffoldsandhidelinkage,notpromotefalselinkages. Table2showsthetotalsandpercentagesofthedifferentkmercategories,countingeachSNPmerpairasone k -mer.SNPmerpairsaccountfor16.3%oftheputative genomicallyunique23mermarkers;dividingby23gives usthefractionofbasesinthosemarkersthatareputativeSNPs:0.71%.Nodek-merselectionToreducethememoryrequirementsofour k -mer assemblygraph,weselectaapproximateone-tenthsubsetoftheSNPmerandunique k -mertags. InthecaseofatrueSNPatleast k1basesfromother SNPsandfromgapsinerror-freecoverageofeitherallele, therewillbe k coveringSNPmerpairs(providedthatcovering k -mersarealsouniquelypairable).Bytakingonly SNPmerpairswiththesubstitutioninparticularpositions, wecanreducethesizeofthegraphanditsredundancy. Analyzingthedistributionofsubstitutedpositionforall theSNPmerpairs,weobserveanenrichmentforsubstitutionsneartheends,probablyduetoproximityto low-qualitysequence.Byselectingpositions3,12and 21of23-baseSNPmers,weavoidthemostproblematic positionsandreducethisportionof k -mernodesbya factorof7.67. UnlikeforSNPmers,therearenocanonicalpositions thatidentifytheunique,unpaired k -mers.Severalmechanismshavebeenproposedforsampling k -mersinarepresentativeway[53,54].Weusethemorepseudo-random hash-slicingrule,alreadydiscussedabove,tosamplea singlesliceof k -mers:thosewhoseintegerencodingsare congruenttoaparticularslicenumber s ,modulo S (the hashslicingfactor).Wehavefoundthatonthefinished humangenome(resultsnotshown),hashslicingiseffectivelyaPoissonsampling,withsampled k -mersspaced accordingtoanexponentialdistribution. Acaveatinapplyinghashslicingisthattakingthe remaindermoduloaprimeisnotverypseudo-random forMersenneprimes(equalto2p 1forsome p ),when k -mersarerepresentedinbase-4encoding[52].We thereforepickaslicingfactorof11,thesmallestnonMersenneprimegreaterthanourSNPmersampling factor. Theresulting k -mersubsethas86.0millionuniqueunpaired k -mersand24.0millionSNPmerpairs,each reducedaspredicted,foratotalfactorof9.8reduction in k -mernodesforthenextstep.ContiggingandscaffoldingEach23mertag(unique k -merorSNPmerpair)inthe abovesubsetisanodeinthe k -mergraph.Nodesareconnectedwhenthecorresponding k -mersappearconsecutively inatleastonereadoftheinput(anyintervening k -mers havingbeenskippedduetosamplingorambiguity).The relativeorientation,distan ceandnumberofsupporting readsofthe k -mersisstoredintheedge.Whenconflicting distanceorrelativeorientationisobservedamongdifferent readsforthesamepairof k -mernodes,alledgesfromboth nodesinthecorrespondingdirectionareignoredin contigging. Thenodesofthe kmergraphrepresentDNAtagsand havedistinctupstreamanddownstreamends.Oneedge ateachendofanodeisidentifiedas robust ifsupported byasupermajorityofthereadsforalledgesinthatdirection:thenumberofsupportingreadsisgreaterthanor equaltoboth(1)twoplusthesumofthereadcountsfor allotheredgesinthatdirectionand(2)twicetheread countofthenext-mostsupportededgeinthesamedirection.Bythisconstruction,anodehasatmostonerobust edgeoneachend. Amutuallyrobustedgeisdefinedasonethatisrobust goinginbothdirectionsbetweenthetwonodesit connects. Contigsaretheconnectedcomponentsofthesubgraph consistingofmutuallyrobustedges.Singletonandcircular contigsarereportedfordiagnosticpurposes,butignored insubsequentanalysis.Eachretained k -mercontig of Table1thereforerepresentsachainofnodesforSNPmers andunique k -mersnotsharedwithothercontigs. Afterassemblyof k -mercontigs,weconnectthemin longerstructuresusingtheBambusscaffolder[26].Becausethecontigsdonotcontaindetailedreadinformation,wemaptemplates(readpairs)tocontigsbasedon shared k -mercontent,anddividingtheresultinggraph ofcontigslinkedbytemplatesintobatchessmallenough forBambustoprocess.Batchesaredividedsothatno contigsindifferentbatchessharenotemplates. WepresenttemplateslinkingcontigstoBambususing AMOSformat[46]forthereads(templateends)mapped toeachcontig.Readsareincludedonlyforthecontig withwhichitsharesthemost k -mers,ifthespanof those k -mersis k andtheotherread-endofthetemplate similarlyqualifiesinadifferentcontig.Bambusinferslinks betweencontigsbymatchingtemplateidentifiersshared byreadsindifferent linkablecontigs ,thenproduces scaffoldsaschainsofcontigsthatarelinkedwithconsistentorderandorientation. Table2K-mercategories,countingaSNPmerpairas one k -merk -mertype#DistinctPercentage Nopartner/unique946,431,90155.48% Partnered/SNPmer184,756,14910.83% Ambiguous574,557,29633.68% TOTAL1,705,745,346 Nossa etal.GigaScience 2014, 3 :9 Page12of21 http://www.gigasciencejournal.com/content/3/1/9 PAGE 13 Contiguousconsensusrepresentationsfor k -mercontigs andscaffoldsweregeneratedintwophases.Inthefirst phase,sequencespannedbyselectedSNPmersandsubset k -mers(seesectionsabove)arejoinedtogether,separated byanumberofNscorrespondingtothenumberofbases notspannedby k -mersinthesubset.Inthesecondphase, asinglepassismadethroughthereaddataset,and stretchesofNsthatarespannedbysinglereadsare replacedbythesequenceoftheread.SNPphaseandgenotypeinferenceEachscaffoldofthe k -merassemblyconstitutesacandidatemarkerformapping.Whilethedepthofsequence coverageoneachmemberofthemappingpanelistoo low(~1X)todirectlyinferthegenotypeofindividual membersofthemappingpannelatindividualSNPs,the tightlinkagebetweenSNPswithinmarkersmeansthat learningasample sgenotypeatanyonerevealsitatthe others,effectivelyamplifyingthesequencecoveragebya factorproportionaltothenumberofSNPswithinthe marker.Thisisthesameprincipleexploitedingenotype bysequencing(GBS)approachestogeneticmappingin thepresenceofreferencegenomes,forexampleinrecombinantinbredlinesofreferencericestrains[16],andin crossesbetween Drosophila specieswithsequenced genomesHerewegenotypeoffspringinthecontextofa crossbetweentwooutbredindividuals,simultaneouslyinferringthephasesoftheSNPs(i.e.,whichbasesappears oneachofthefourparentalchromosomesinthecross). Whilethedatawillbeinsufficienttoinfergenotypesat manymarkers,allthosewhereconfidentinferencescan bemadecanbeusedtobuildthelinkagemap. Forthepurposesofgenotypeinference,amarkeris treatedasacollectionof m SNPs(indexedinthefollowing by i {1,2, m }),thathavebeeninferredtobeclosely linkedonthegenomeviathe k -merassemblystep.Ifthe fourparentalchromosomesarelabeled a b inoneparent and c d intheother,thenthegenotypingproblemisto inferwhichofthefourpossiblesegregationstatesorgenotypes ac ad bc bd describeseachsampleateachmarker locus.Weindexsampleswith j ,anddenoteasample genotypeby gj. Weassumethatmarkersareverysmallcomparedtoa chromosome,andignorethepossibilityofarecombinationeventwithinindividualmarkers.Thedatausedfor inferenceoftheoffspringgenotypesconsistofthenumberofreadsfromeachbarcodedsample j showingeach ofthefourpossibleDNAbases b ateachvariableSNP position i ,whichwedenote nb ij. Ifthephase iofSNP i wereknown, i.e. whichbaseis presentineachofthefourparentalchromosomes,then achoiceofgenotype gjimpliesaspecifichomozygousor heterozygousstate sij S ={ AA CC TT GG AC AT AG CT CG TG }forSNP i insample j .Foragivenphaseand genotype,thelikelihoodfunctionforagivenSNPpositioninagivensampleisgivenbyeitherabinomial(for homozygousstates)ortrinomialprobabilitymassfunctionofthereadcounts,base-callingerrorrate ,andthe sitegenotype sij:L i; gj p nb ij i; gj Pm nb ij sij; n m m 1 n m ; ifsijhomozygous n! k!l!m! 2 3 m 1 2 3 2 0 B B @ 1 C C Ak l; ifsijheterozygous 8 > > > > > > > < > > > > > > > :where n isthetotalnumberofreadsatSNP i ; m isthe numberofobservationsofbasesnotin ij(i.e.,mismatches); k and l arethecountsforeachofthetwobases of ijforheterozygoussites.LikelihoodmaximizationSearchingforanoptimalchoiceofSNPphases iand samplegenotypes gjismadedifficultbytheexponential sizeofthesearchspace:forsegregatingbi-allelicSNP sitesthereare14possiblephasestoconsiderateach SNPsite,soforamappingpanelofonly20siblingsand amarkercontainingonly10SNPs,therearecombinationstoconsider.Insimulationtests,wefoundthata variantofexpectationmaximization(EM),aniterative likelihoodmaximizationmethodcanaccuratelyinfera largeproportionofmarkergenotypes. Toinitializetheiteration,theparentalsamplesanda randomlyselectedoffspringare,withoutlossofgenerality,assignedgenotypes( a b ),( c d )and( a c ).Ateach step,wecalculatetheconditionalprobabilitydistributionsoverthepossibleSNPphases p ( i)giventhe genotypeassignmentsaccordingto:pt i p ij gt Y Spnb i j i; gt = XkY Spnb k j k; gt !wherewehavelabeledthechosenvaluesforthegenotypesatiteration t collectivelyby g( t )g( t )nb i isthecombinedtotalnumberofobservationsofbase b at polymorphicSNP i forallsamplesincludedatiteration step t whichhavegenotype s ( i, gj)= Oneachiterationuntilallsampleshavebeenincluded,arandomlyselectedsampleisaddedtotheset aftercalculating p( t )( i).ThenthenextsetofgenotypeNossa etal.GigaScience 2014, 3 :9 Page13of21 http://www.gigasciencejournal.com/content/3/1/9 PAGE 14 assignments g( t +1)aredeterminedbychoosingthose thatmaximizetheexpectedvalueoftheloglikelihood: E j n ; gj log L gj; n ; Xip i log pnb ijj gj; iThesestepsarerepeateduntilgenotypesarebeingselectedforallsamples,andtheexpectedloglikelihood stopsincreasing.Attheendoftheiteration,the likelihood-maximizinggenotypesarereported,alongwith theloglikelihooddifferencebetweenthebestandsecond bestchoiceofgenotypeforeachsample,whichprovides anindicatoroftheconfidenceingenotypecall.Togauge convergence,thisprocedureisrepeated5timesforeach marker,withdifferentrandomchoicesofinitialconditions.MarkerswhichdonotidentifythesameMLgenotypemultipletimesinindependentrunsarenotincluded amongthehighconfidencegenotypecalls.MapbinsUniquemarkersegregationpatternswereincludedin thesetofmapbinsiftheymetoneoftwocriteria:(1)at leastthreeindependentmarkerswereinferredtohave thepatternindependently,or(2)thepatternwasinferredfromatleastonemarkerwithatleast20SNPs suchthatthemeanoftheestimatedprobabilitiesofthe inferredSNPphaseswasgreaterthan0.9.ErrormodelcalibrationThesequenceof14 Cionaintestinalis autosomeswere downloadedfromEnsembl[55].These14chromosomes wereusedasthetemplateinourgenomesimulation. Basedontheirsequencelength,Weusedamarkoviancoalescentsimulatormacs[56]togeneratefourhaploid samplesdrawnfromapopulationunderneutralWrightFishermodelwithpopulationmutationrateof0.012and populationrecombinationrateof0.0085.Usingthe C. intestinalis genomeasthereferencesequence,twodiploid parentalgenomeswereconstructedbasedonthemacs outputwithrealisticSNPandIndelmodelsinferredby severalpreviousstudiesonthe Ciona genome[57-59].We wroteaperlscripttosimulatethegenomesofoffspring generatedbythecrossofthetwosimulatedparents.The softwarepackagedwgsim[60]wasusedtogenerateIlluminapaired-endreadsbasedonoursimulatedgenomes ofbothparentsandoffsprings,withthecoverageof20X and5Xrespectively. Toestimatethefrequencyofincorrectgenotypecallsasa functionoftheloglikelihooddifferencebetweenthecalled andalternativegenotype,includingcontributionsfromuncertaintyinSNP-meridentification,assembly,andsampling noise,wecarriedoutasimulationofthe k -merassembly andgenotypeinferenceprotoc olsAmonghigh-confidence genotypecalls,theobservederrorfrequencywasafunction ofcallconfidencescorewaswell-fitbyasumoftwo stretchedexponentialfunctions,allowingassignmentof errorprobabilitiestoindividualgenotypecalls.LinkagegroupconstructionWeusethelinkage p -value pabbetweenpairsofmap bins a and b definedastheminimumoverthefourpossiblerelabelings r ofthematernalandpaternalchromosomesoftheBinomial p -valueforthenumberof matchinggenotypes: pab minr1 Xmr 1 in i 1 2n"# where n isthetotalnumberofsamplegenotypecalls (68inthepresentcase,or34ineachparent)and mris thenumberofmatchinggenotypesunderrelabeling r Weidentifiedmapbinswithsegregationpatternsindicatingeitherinconsistentplacementinthematernaland paternalmapsorgenotypingerrorwithadoublethresholdprocedureasfollows: 1.Mapbinswerepartitionedintolinkagegroupsby singlelinkageclusteringatathresholdof. pab< p1. 2.Withineachpartition,mapbinswhichformed articulationpoints(i.e.,nodeswhich,ifremoved, wouldcausethelinkagegrouptofallapartintotwo disconnectedsubgraphs;)inthegraphof, pab< p2, where p2> p1. Thisprocedureidentifiesmapbinswhichaloneaccount forthemergingofwhatwouldotherwisebetwodistinct partitions.Weusedthefollowingpairsofthresholds p1, p2toidentifyatotalof20mapbinsforexclusionfrom themap:10,10;10,10;10,10.Theremainingmarkersform locallyconsistentlinkagegroupsinwhichalllinkagesdefinedatthreshold p1arecorroboratedbymultiplelinkages at p2,fortheabovevaluesof p1and p2.MarkerorderingMarkerswereorderedwithineachlinkagegroupusing thefollowingprotocol.Withineachlinkagegroupaconsistentlabelingofthefourparentalchromosomeswas achievedbyconstructingagraph G inwhichnodescorrespondtomapbinsandedgesareweightedbylinkage p -value pab(asdefinedabove).Thelocalchromosome labelsareupdatedateachmapbinasitisreachedina traversaloftheminimumspanningtreeof G tothelabeling r thatmaximizes pabalongtheincidentof G used inthetraversal.Markerswithineachlinkagegroupwere clusteredbyhierarchicalclustering(marker-markerdistancemetric:cosineoftheanglebetweenthevectorsof recombinationdistancestotheothermapbins;distance updatingmethod:averagelinkage)intoabinarytreedataNossa etal.GigaScience 2014, 3 :9 Page14of21 http://www.gigasciencejournal.com/content/3/1/9 PAGE 15 Figure8 Unrootedphylogenetictreeofhomeoboxsequences(part1). NodesarelabeledwithBayesianposteriorprobabilities.Highly supportedpartitionsusedtoclassify L.polyphemus sequencesaredrawninred,withtheabbreviationfortheclassshowninlargeletters. L. polyphemus homeoboxsequencesnotgroupedintooneofthesehighlysupportedpartitionsareassignedtoclass ? .Foreaseofdisplay,a largesubtreeconsistingofHOXandparahoxgeneshasbeenprunedatthepositionlabeled HOX ,andisshowninFigure7. Nossa etal.GigaScience 2014, 3 :9 Page15of21 http://www.gigasciencejournal.com/content/3/1/9 PAGE 16 Figure9 (Seelegendonnextpage.) Nossa etal.GigaScience 2014, 3 :9 Page16of21 http://www.gigasciencejournal.com/content/3/1/9 PAGE 17 structurewithleavesrepresentingmapbins.Anodein therightsubtreeoftherootnodewasrotated,interchangingitsleftandrightsubtreesifitsleftsubtreewas notalreadycloser(inaveragerecombinationdistance)to themarkersoftheleftsubtreeoftheglobalroot;and similarlyfornodesintheleftsubtreeoftheroot.Aninordertraversalofthetreegeneratesanorderingofthe markers.Finallythreereversalsoftheorderofmarkers insegmentsofthemapwereaddedbasedonvisual inspectionoftherecombinationdistancematrix.Inthe finalmarkerordering,51%ofadjacentmapbinpairsare separatedbyasinglerecombinationeventinthecross, and94%areseparatedbythreeorfewerrecombinants ineachparent.PlacementofmarkersonthemapToanchoradditionalmarkerstothemap,wecomputed the pab(seeabove)betweenmarker a tobeplacedon themapandeachmapbin b .Marker a isanchoredto themapatthepositionofthebin b whichminimizes pabif pab<10 6.SNPdensityestimationIlluminareadsweremappedtotheassembledscaffoldsequenceswithstampy[61]usingdefaultsettings.Fora sampleof9,228scaffoldswithlengthsrangingfrom5.05.5kb,sequencevariantswerecalledwithSAMtools[62] usingavariantqualityscorethresholdof50,andignoring indelpositions. ASNPdensityof0.76%infourhaplotypescorresponds toapredictedrateofpairwisesequencedifferencesper siteof =0.0042underthefinitesitesmodelofmutation andtheneutralcoalescentmodeloftherelationships amongsampledalleles[63].EstimationoflocalrecombinationrateToestimatethelocalrecombinationrateforeachmap bin,wecomputedthelinearregressionofmapdistancein numberofmarkersonphysicaldistanceusingupto10 neighboringmapbinsineachdirectionalongthemap(or fewerforbinswithin10mapbinsoftheendofthelinkage group).Mapdistancewascalculatedfromrecombination fractionusingHaldane smapdistance 1 2log1 2 r [64].AncestrallinkagegroupconservationTocomparethegenomeorganizationin L.polyphemus to theancestralmetazoanALGs,weusedthereciprocalbest blasthit(RBH)orthologycriterioninanalignmentofthe Ixodesscapularis predictedproteins[36]totheconsensus sequencesforthemarkerscaffolds. L.polyphemus scaffolds withRBHofe-valuewereassignedtothesameancestral bilateriangeneorthologygroupastheir I.scapularis ortholog,andtherebywithhumangenes.Regionsofthemap weretestedforenrichmentingenesfromparticularancestrallinkagegroupswithFisher sExactTest,andbreakpointsinancestrallinkagegroupcompositionwere identifiedusingahiddenMarkovmodel,aspreviously described[6,7].HomeoboxgenemodelingWeidentified155markerscaffo ldswithatblastxalignment ofe-valuetoasetofcheliceratehomeoboxgenesequences downloadedfromGenbankusingtheNCBIonlinequery interface(genbankaccessionsAF071402.1,AF071403.1, AF071405.1,AF071406.1,AF071407.1,AF085352.1,AF151 986.1,AF151987.1,AF151988.1,AF151989.1,AF151990.1, AF151991.1,AF151992.1,AF151993.1,AF151994.1,AF151 995.1,AF151996.1,AF151997.1,AF151998.1,AF151999.1, AF152000.1,AF237818.1,AJ005643.1,AJ007431.1,AJ00743 2.1,AJ007433.1,AJ007434.1, AJ007435.1,AJ007436.1,AJ007 437.1,AM419029.1,AM419030.1,AM419031.1,AM41903 2.1,DQ315728.1,DQ315729.1,DQ315730.1,DQ315731.1, DQ315732.1,DQ315733.1,DQ315734.1,DQ315735.1,DQ3 15736.1,DQ315737.1,DQ315738.1,DQ315739.1,DQ31574 0.1,DQ315741.1,DQ315742.1,DQ315743.1,DQ315744.1, (Seefigureonpreviouspage.) Figure9 Phylogenetictreeofhomeoboxsequences,part2. TherootedsubtreeprunedfromthetreeinFigure8 Nodesarelabeledwith Bayesianposteriorprobabilities.Highlysupportedpartitionsusedtoclassify L.polyphemus sequencesaredrawninred,withtheabbreviationfor theclassshowninlargeletters. L.polyphemus homeoboxsequencesnotgroupedintooneofthesehighlysupportedpartitionsareassignedto class hox? Table3MixturemodelfitstoKsdistributionNkln(L)BICAICMixturecomponents 12 273.93559.74551.871.320.50 25 259.32 548.31 528.640.700.14;1.450.45 38 253.36554.19522.710.710.17;1.340.29;2.090.19 411 251.41568.11524.820.740.18;1.340.20;1.700.04;2.020.22Nisthenumberofmixturecomponents,kthenumbermodelparameters,ln(L)theloglikelihoodofthedataunderthebestfitmodel. BIC,Bayesianinformationcriterion;AIC,Akaikeinformationcriterion. TheBICscoreofourselectedmodelisshowninbold.Nossa etal.GigaScience 2014, 3 :9 Page17of21 http://www.gigasciencejournal.com/content/3/1/9 PAGE 18 EU870887.1,EU870888.1,EU870889.1,HE608680.1,HE608 681.1,HE608682.1,HE805493.1,HE805494.1,HE805495.1, HE805496.1,HE805497.1,HE805498.1,HE805499.1,HE80 5500.1,HE805501.1,HE805502.1,S70005.1,S70006.1,S700 08.1,andS70010.1).Thereadsofeachmarker(those withbeststampy[61]alignmenttothescaffold)were reassembledwithPHRAP[65],withdefaultparameters.The resultingcontigswerealignedtoacollectionofhomeoboxcontainingproteinsequences(genbankaccessionsNP 001034497.1,NP001034510.1,AAL71874.1,NP001034505.1, NP001036813.1,CAA66399.1,NP001107762.1,NP0011 07807.1,EEZ99256.1,NP001034519.1,NP476954.1,NP032 840.1,NP031699.2,AAI37770.1,EEN68949.1,NP523700.2, NP001034511.2,AAK16421.1,andAAK16422.1)with exonerate[66]inprotein-to-genomemode.Foreach contig,theaminoacidsequencepredictedbythe highest-scoringexoneratealignmentwasusedinsubsequentphylogeneticanalysis,resultingin104putative homeobox-containingmarkersranginginlengthfrom 18to147aminoacids.PhylogeneticanalysisofhomeoboxgenesAmultiplesequencealignmentofthepredictedhomeoboxsequencescombinedwithacollectionofrepresentativesequencesfromvariousclassesofhomeoboxgenes wasconstructedwithmusclev3.8.31[67]usingdefault settings.Theresultingalignmentwastrimmedtoa63 aminoacidsegmentspanningtheconservedhomeodomain,andsequenceswithmorethan50%gapswere removed,leaving93predicted L.polyphemus homeobox genesintheanalysis.Bayesianphylogeneticanalysiswas carriedoutontheresulting178taxon,63aminoacid charactermatrix(SeeAdditionalfile1)usingMrBayes v3.2.1[68]usingamixedmodelofaminoacidsubstituions,gamma-distributedratevariationamongsiteswith fixedshapeparameter =1.0,alignmentgapstreatedas missingdata,2,000,000MonteCarlosteps,twoindependentrunswithfourMonteCarlochains,andtheinitial 25%ofsampledtreeswerediscardedas burn-in .Monte Carloappearedtoreachconvergence,withanaverage standarddeviationofthesplitfrequenciesof0.022.The majority-ruleconsensusofthesampledtreesisshownin Figures8and9,andwell-supportedgeneclades(posterior probabilitygreaterthan0.95)wereusedtogroupthe predicted L.polyphemus genesintoclasses.Thetablein Additionalfile1liststhereassembledmarkercontigs, theirinferredhoxgeneclass,andmaximumlikelihood mappositions.Predictedgeneswereanchoredtothemap asdescribedabove.GenomicdistributionofparalogsWeidentified2,716pairsof Limulus markersthatcanboth beplacedonthemapandhavetheirbesttranslatedalignmenttothesame Ixodesscapularis gene.( I.scapularis geneswithmorethanfivebest-hitmarkerswereexcluded fromseedingsuchpairs.)Toestimatethesynonymoussequencedivergencebetweenpairsofcandidate L.polyphemus paralogousgenepairsand L.polyphemus genesand their T.tridentatus orthologs,weconstructedcodonalignmentsofpredictedcodingsequenceforestimationofsynonymoussequencedivergence.Conservedclustersof paralogswereidentifiedusingavariantofthe max-gap Figure10 TwocomponentmixturemodelfittotheKspeakontherange0<=Ks<=2.5. Thebest-fittingmodelwasselectedbytheBayesian InformationCriterion(Table3).Thecomponentmeansare0.7and1.45substitutionspersite.ThepositionofthepeakatlowestKswasnotsensitiveto theadditionofmoremixturecomponents. Nossa etal.GigaScience 2014, 3 :9 Page18of21 http://www.gigasciencejournal.com/content/3/1/9 PAGE 19 criterion[3]inwhichtwogenesareplacedinthesame clusteriftheyandtheirparalogsliewithinthreshold distance.KaandKsestimationforparalogsandT.tridentatus orthologsFigure6showsthedistributionacrossthemapofpairsof candidateparalogs.Toestimatethesynonymoussequence divergencebetweenpairsofcandidate L.polyphemus paralogousgenepairs,andbetween L.polyphemus genesand their T.tridentatus orthologs,wefollowedthefollowing protocol. 1.ReassemblereadsfromeachmarkerwithPHRAP [ 65 ],andcreateapredictedcodingsequenceusing exonerate,asdescribedfortheannotationof homeoboxgenemodels(seeabove). 2.Combinetheexoneratealignmentsofcodonsto aminoacidstocreateanalignmentofcodonsfor eitherapairof L.polyphemus sequences,orfora L.polyphemus T.tridentatus sequencepair. 3.UsethemethodofYangandNielsen[ 69 ]to estimatethesynonymousandnon-synonymous substitutionratesKaandKs,asimplementedinthe KaKsCalculatorpackage[ 70 ]. 4.Discardestimatesbasedonfewerthan30sites (30synonymoussitesforestimatesofKs, non-synonymoussitesforKa). GenBankaccessionsfor Tachypleustridentatus mRNA clones:JQ966943,AB353281,AB353280,HM156111,HQ22 1882,HQ221883,HQ221881,HQ386702,HM852953,TAT TPP,TATPROCLOT,FN582225,FN582226,AF467804,AF 227150,GQ260127,AF264067,AF264068,AB353279,AB0 05542,TATLICI,TATTGL,TATCFGB,TATLFC1,TATLF C2,AB201713,TATCFGA,TATLICI2,CS423581,CS423579, AB028144,AB201778,AB201776,AB201774,AB201772,AB 201770,AB201768,AB201766,AB201779,AB201777,AB20 1775,AB201773,AB201771,AB201769,AB201767,AB201 765,AB105059,AB002814,AX763473,TATCFBP,AB076 186,AB076185,X04192,TATHCLL,AB037394,AB019116, AB019114,AB019112,AB019110,AB019108,AB019106,AB 019104,AB019102,AB019100,AB019098,AB019096,AB 019117,AB019115,AB019113,AB019111,AB019109,AB 019107,AB019105,AB019103,AB019101,AB019099,AB 019097,AB023783,AB024738,AB024739,AB024737,AB0 17484,D87214,D85756,D85341. Figure7showsthedistributionofKaandKsforparalogs and T.tridentatus orthologs.Toestimatethenumberand ageofpeaksintheun-saturatedrange[37]oftheKsdistribution(andofputativeWGDevents),wefitaseriesof univariatenormalmixturemodels,with1,2,3,and4 componentstotheparalogKsdistributionintherange 0<=Ks<=2.5andselectedthebestmodelonthebasis ofBayesianInformationCriterion(BIC)(Table3).The bestmodelhadtwocomponents,withmeansat0.7and 1.45substitutionspersite.Thepositionofthepeakat lowestKswasnotsensitivetotheadditionofmore mixturecomponents.Figure10showscomparisonof thedistributionandthecomponentsofthebest-fitting model.GaussianmixturemodelswereestimatedinR withmixtools[71].AvailabilityofsupportingdataTherawsequencingreadsarecurrentlybeingsubmittedthroughtheNCBISRAandareaccessibleviaNCBI BioProjectaccessionPRJNA187356. Thedatasetssupportingtheresultsofthisarticleare availableinthe GigaScience GigaDBrepository[72].AdditionalfileAdditionalfile1: Theaminoacidcharactermatrixusedforthe phylogeneticanalysisofhomeoboxgenes. Abbreviations ALG: Ancestrallinkagegroups;AIC:Akaikeinformationcriterion;BIC:Bayesian informationcriterion;bp:Basepair;EM:Expectationmaximization; GBS:Genotyping-by-sequencing;JAM:Jointassemblyandmapping; LG:Linkagegroup;Max-gap:Maximumgap;Mb:Megabase;MYA:Million yearsago;PE:Paired-end;RAD-Seq:RestrictionsiteassociatedDNA sequencing;RBH:Reciprocalbestblasthit;SNP:Singlenucleotide polymorphism;WGD:Whole-genomeduplication. Competinginterests Theauthorsdeclarethattheyhavenocompetinginterests. Authors contributions NHPconceivedandledtheproject.Allauthorswrotethepaper.PH,NHP andJ-XYwrotesoftware.NHP,PH,J-XYandCWNcarriedoutsequence analysis.JBcollectedandraisedsamples.CWNextractedgenomicDNAand createdthelibrariesforsequencing.Allauthorsreadandapprovedthefinal manuscript. Acknowledgements ThisresearchwassupportedbytheNationalScienceFoundation (EF-0850294andIOB06 41750),theBeckmanYoungInvestigatorProgram, theUniversityofFloridaDivisionofSponsoredResearch,theDepartmentof Biology,andtheUFMarineLaboratoryatSeahorseKey.Wethankthethree reviewers,Dr.HuguesRoestCrollius,Dr.StephenRichardsandDr.BrianEads, fortheirvaluablecommentswhichgreatlyhelpedtoimprovethequalityof thispaper. Authordetails1DepartmentofEcologyandEvolutionaryBiology,RiceUniversity,P.O.Box 1892,Houston,TX77251-1892,USA.2DepartmentofBiochemistryandCell Biology,RiceUniversity,P.O.Box1892,Houston,TX77251-1892,USA.3DepartmentofBiology,UniversityofFlorida,P.O.Box11-8525Gainesville,FL 32611-8525,USA.4Currentaddress:GenebyGene,Ltd,Houston,TX77008, USA. Received:23October2013Accepted:23April2014 Published:14May2014 References1.OhnoS: EvolutionbyGeneDuplication. Springer-Verlag;1970. 2.WolfeKH,ShieldsDC: Molecularevidenceforanancientduplicationof theentireyeastgenome. Nature 1997, 387: 708 713.Nossa etal.GigaScience 2014, 3 :9 Page19of21 http://www.gigasciencejournal.com/content/3/1/9 PAGE 20 3.McLysaghtA,HokampK,WolfeKH: Extensivegenomicduplicationduring earlychordateevolution. NatGenet 2002, 31: 200 204. 4.SimillionC,VandepoeleK,MontaguMCEV,ZabeauM,dePeerYV: Thehidden duplicationpastofArabidopsisthaliana. ProcNatlAcadSci 2002, 99: 13627 13632. 5.AuryJ-M,JaillonO,DuretL,NoelB,JubinC,PorcelBM,SgurensB,Daubin V,AnthouardV,AiachN,ArnaizO,BillautA,BeissonJ,BlancI,BouhoucheK, CmaraF,DuharcourtS,GuigoR,GogendeauD,KatinkaM,KellerA-M, KissmehlR,KlotzC,KollF,MoulAL,LepreG,MalinskyS,NowackiM, NowakJK,PlattnerH, etal : Globaltrendsofwhole-genomeduplications revealedbytheciliateParameciumtetraurelia. Nature 2006, 444: 171 178. 6.PutnamNH,SrivastavaM,HellstenU,DirksB,ChapmanJ,SalamovA,Terry A,ShapiroH,LindquistE,KapitonovVV,JurkaJ,GenikhovichG,GrigorievIV, LucasSM,SteeleRE,FinnertyJR,TechnauU,MartindaleMQ,RokhsarDS: Seaanemonegenomerevealsancestraleumetazoangenerepertoire andgenomicorganization. Science 2007, 317: 86 94. 7.PutnamNH,ButtsT,FerrierDEK,FurlongRF,HellstenU,KawashimaT, Robinson-RechaviM,ShoguchiE,TerryA,YuJ-K,Benito-GutierrezE, DubchakI,Garcia-FernandezJ,Gibson-BrownJJ,GrigorievIV,HortonAC, deJongPJ,JurkaJ,KapitonovVV,KoharaY,KurokiY,LindquistE,LucasS, OsoegawaK,PennacchioLA,SalamovAA,SatouY,Sauka-SpenglerT, SchmutzJ,Shin-IT, etal : Theamphioxusgenomeandtheevolutionof thechordatekaryotype. Nature 2008, 453: 1064 1071. 8.SrivastavaM,BegovicE,ChapmanJ,PutnamNH,HellstenU,KawashimaT, KuoA,MitrosT,SalamovA,CarpenterML,SignorovitchAY,MorenoMA, KammK,GrimwoodJ,SchmutzJ,ShapiroH,GrigorievIV,BussLW, SchierwaterB,DellaportaSL,RokhsarDS: TheTrichoplaxgenomeandthe natureofplacozoans. Nature 2008, 454: 955 960. 9.SrivastavaM,SimakovO,ChapmanJ,FaheyB,GauthierMEA,MitrosT, RichardsGS,ConacoC,DacreM,HellstenU,LarrouxC,PutnamNH,Stanke M,AdamskaM,DarlingA,DegnanSM,OakleyTH,PlachetzkiDC,ZhaiY, AdamskiM,CalcinoA,CumminsSF,GoodsteinDM,HarrisC,JacksonDJ, LeysSP,ShuS,WoodcroftBJ,VervoortM,KosikKS, etal : TheAmphimedon queenslandicagenomeandtheevolutionofanimalcomplexity. Nature 2010, 466: 720 726. 10.SimakovO,MarletazF,ChoS-J,Edsinger-GonzalesE,HavlakP,HellstenU, KuoD-H,LarssonT,LvJ,ArendtD,SavageR,OsoegawaK,deJongP, GrimwoodJ,ChapmanJA,ShapiroH,AertsA,OtillarRP,TerryAY,BooreJL, GrigorievIV,LindbergDR,SeaverEC,WeisblatDA,PutnamNH,RokhsarDS: Insightsintobilaterianevolutionfromthreespiraliangenomes. Nature 2013, 493: 526 531. 11.LvJ,HavlakP,PutnamN: Constraintsongenesshapelong-termconservation ofmacro-syntenyinmetazoangenomes. BMCBioinformatics 2011, 12 (Suppl9):S11. 12.EarlD,BradnamK,JohnJS,DarlingA,LinD,FassJ,YuHOK,BuffaloV, ZerbinoDR,DiekhansM,NguyenN,AriyaratnePN,SungW-K,NingZ, HaimelM,SimpsonJT,FonsecaNA,Birol ,DockingTR,HoIY,RokhsarDS,ChikhiR,LavenierD,ChapuisG,NaquinD,MailletN,SchatzMC,KelleyDR, PhillippyAM,KorenS, etal : Assemblathon1:Acompetitiveassessmentof denovoshortreadassemblymethods. GenomeRes 2011, 21: 2224 2241. 13.DaveyJW,HohenlohePA,EtterPD,BooneJQ,CatchenJM,BlaxterML: Genome-widegeneticmarkerdiscoveryandgenotypingusing next-generationsequencing. NatRevGenet 2011, 12: 499 510. 14.AltshulerD,PollaraVJ,CowlesCR,VanEttenWJ,BaldwinJ,LintonL,Lander ES: AnSNPmapofthehumangenomegeneratedbyreduced representationshotgunsequencing. Nature 2000, 407: 513 516. 15.BairdNA,EtterPD,AtwoodTS,CurreyMC,ShiverAL,LewisZA,SelkerEU, CreskoWA,JohnsonEA: RapidSNPdiscoveryandgeneticmappingusing sequencedRADmarkers. PLoSONE 2008, 3: e3376. 16.HuangX,FengQ,QianQ,ZhaoQ,WangL,WangA,GuanJ,FanD,Weng Q,HuangT,DongG,SangT,HanB: High-throughputgenotypingby whole-genomeresequencing. GenomeRes 2009, 19: 1068 1076. 17.AndolfattoP,DavisonD,ErezyilmazD,HuTT,MastJ,Sunayama-MoritaT, SternDL: Multiplexedshotgungenotypingforrapidandefficientgenetic mapping. GenomeRes 2011, 21: 610 617. 18.ChapmanJA,HoI,SunkaraS,LuoS,SchrothGP,RokhsarDS: Meraculous:De novogenomeassemblywithshortpaired-endreads. PLoSONE 2011, 6: e23501. 19.ButlerJ,MacCallumI,KleberM,ShlyakhterIA,BelmonteMK,LanderES, NusbaumC,JaffeDB: ALLPATHS:Denovoassemblyofwhole-genome shotgunmicroreads. GenomeRes 2008, 18: 810 820. 20.GnerreS,MacCallumI,PrzybylskiD,RibeiroFJ,BurtonJN,WalkerBJ,Sharpe T,HallG,SheaTP,SykesS,BerlinAM,AirdD,CostelloM,DazaR,WilliamsL, NicolR,GnirkeA,NusbaumC,LanderES,JaffeDB: High-qualitydraft assembliesofmammaliangenomesfrommassivelyparallelsequence data. ProcNatlAcadSci 2011, 108: 1513 1518. 21.RudkinDM,YoungGA: Horseshoecrabs anancientancestryrevealed. In BiolConservHorseshoeCrabs. EditedbyTanacrediJT,BottonML,SmithD. NewYork:Springer;2009:25 44. 22.FisherDC: TheXiphosurida:archetypesofBradytely? In LivingFoss.Edited byEldregeN,StanleySM.NewYork:Springer;1984:196 213. 23.BerksonJ,ShusterCNJr: Thehorseshoecrab:thebattleforatrue multiple-useresource. Fisheries 1999, 24: 6 10. 24.ShusterCNJr,BarlowRB,BrockmannHJ(Eds): TheAmericanHorseshoeCrab. Cambridge,MA:HarvardUniversityPress;2003. 25.GregoryTR: AnimalGenomeSizeDatabase. http://www.genomesize.com. 26.PopM,KosackDS,SalzbergSL: Hierarchicalscaffoldingwithbambus. GenomeRes 2004, 14: 149 159. 27.VanOsH,AndrzejewskiS,BakkerE,BarrenaI,BryanGJ,CaromelB,Ghareeb B,IsidoreE,deJongW,vanKoertP,LefebvreV,MilbourneD,RitterE,van derVoortJNAMR,Rousselle-BourgeoisF,vanVlietJ,WaughR,VisserRGF, BakkerJ,vanEckHJ: Constructionofa10,000-markerultradensegenetic recombinationmapofpotato:providingaframeworkforaccelerated geneisolationandagenomewidephysicalmap. Genetics 2006, 173: 1075 1087. 28.SekiguchiK: BiologyofHorseshoeCrabs. Tokoyo:ScienceHouseCo.,Ltd.; 1988:50 68. 29.CartwrightRA,HussinJ,KeeblerJEM,StoneEA,AwadallaP: Afamily-based probabilisticmethodforcapturingdenovomutationsfromhighthroughputshort-readsequencingdata. StatApplGenetMolBiol 2012, 11 (2). 30.BirdAP: DNAmethylationandthefrequencyofCpGinanimalDNA. NucleicAcidsRes 1980, 8: 1499 1504. 31.LynchM: Theoriginsofeukaryoticgenestructure. MolBiolEvol 2006, 23: 450 468. 32.BromanKW,MurrayJC,SheffieldVC,WhiteRL,WeberJL: Comprehensive humangeneticmaps:individualandsexspecificvariationinrecombination. AmJHumGenet 1998, 63: 861 869. 33.KongA,GudbjartssonDF,SainzJ,JonsdottirGM,GudjonssonSA,Richardsson B,SigurdardottirS,BarnardJ,HallbeckB,MassonG,ShlienA,PalssonST,Frigge ML,ThorgeirssonTE,GulcherJR,StefanssonK: Ahigh-resolution recombinationmapofthehumangenome. NatGenet 2002, 31: 241 247. 34.CoopG,PrzeworskiM: Anevolutionaryviewofhumanrecombination. NatRevGenet 2007, 8: 23 34. 35.HellmannI,EbersbergerI,PtakSE,PboS,PrzeworskiM: Aneutral explanationforthecorrelationofdiversitywithrecombinationratesin humans. AmJHumGenet 2003,72: 1527 1535. 36.VectorBase: Ixodesscapularisannotation,IscaW1. https://www.vectorbase. org/organisms/ixodes-scapularis. 37.VannesteK,VandePeerY,MaereS: Inferenceofgenomeduplications fromagedistributionsrevisited. MolBiolEvol 2013, 30: 177 190. 38.ObstM,FaurbyS,BussarawitS,FunchP: Molecularphylogenyofextant horseshoecrabs(Xiphosura,Limulidae)indicatesPaleogene diversificationofAsianspecies. MolPhylogenetEvol 2012, 62: 21 26. 39.BegunDJ,AquadroCF: Levelsofnaturallyoccurring DNApolymorphismcorrelate withrecombinationratesinD.melanogaster. Nature 1992, 356: 519 520. 40.CutterAD,PayseurBA: Selectionatlinkedsitesinthepartialselfer caenorhabditiselegans. MolBiolEvol 2003, 20: 665 673. 41.NachmanMW: Singlenucleotidepolymorphismsandrecombinationrate inhumans. TrendsGenet 2001, 17: 481 485. 42.StephanW,LangleyCH: DNApolymorphisminlycopersiconand crossing-overperphysicallength. Genetics 1998, 150: 1585 1593. 43.RoseliusK,StephanW,StdlerT: Therelationshipofnucleotide polymorphism,recombinationrateandselectioninwildtomatospecies. Genetics 2005, 171: 753 763. 44.AndolfattoP,PrzeworskiM: Regionsoflowercrossingoverharbormore rarevariantsinAfricanpopulationsofdrosophilamelanogaster. Genetics 2001, 158: 657 665. 45.HavlakP,ChenR,DurbinKJ,EganA,RenY,SongX-Z,WeinstockGM,Gibbs RA: Theatlasgenomeassemblysystem. GenomeRes 2004, 14: 721 732. 46.TreangenTJ,SommerDD,AnglyFE,KorenS,PopM: Nextgenerationsequence assemblywithAMOS. CurrentProtocolsinBioinformatics 2011, 33: 11.8.1 11.8.18. 47.TheJAM-pipeline. https://github.com/putnamlab/jam-pipeline. 48.JohnsonSL,BrockmannHJ: Costsofmultiplemates:anexperimental studyinhorseshoecrabs. AnimBehav 2010, 80: 773 782.Nossa etal.GigaScience 2014, 3 :9 Page20of21 http://www.gigasciencejournal.com/content/3/1/9 PAGE 21 49.MullikinJC,NingZ: Thephusionassembler. GenomeRes 2003, 13: 81 90. 50.CockPJA,FieldsCJ,GotoN,HeuerML,RicePM: TheSangerFASTQfile formatforsequenceswithqualityscores,andtheSolexa/IlluminaFASTQ variants. NucleicAcidsRes 2010, 38: 1767 1771. 51.EwingB,GreenP: Base-callingofautomatedsequencertraces usingphred.ii.errorprobabilities. GenomeRes 1998, 8: 186 194. 52.KnuthDE: SearchingandSorting ,Volume3.Reading,MA:Addison-Wesley; 1973[ TheArtofComputerProgramming ]. 53.RobertsM,HayesW,HuntBR,MountSM,YorkeJA: Reducingstoragerequirements forbiologicalsequencecomparison. Bioinformatics 2004, 20: 3363 3369. 54.YeC,MaZS,CannonCH,PopM,YuDW: Exploitingsparsenessindenovo genomeassembly. BMCBioinformatics 2012, 13 (6) : S1. 55. Ensembl. http://www.ensembl.org/index.html. 56.ChenGK,MarjoramP,WallJD: FastandflexiblesimulationofDNA sequencedata. GenomeRes 2009, 19: 136 142. 57.DehalP,SatouY,CampbellRK,ChapmanJ,DegnanB,TomasoAD, DavidsonB,GregorioAD,GelpkeM,GoodsteinDM,HarafujiN,Hastings KEM,HoI,HottaK,HuangW,KawashimaT,LemaireP,MartinezD, MeinertzhagenIA,NeculaS,NonakaM,PutnamN,RashS,SaigaH,Satake M,TerryA,YamadaL,WangH-G,AwazuS,AzumiK, etal : Thedraft genomeofcionaintestinalis:insightsintochordateandvertebrate origins. Science 2002, 298: 2157 2167. 58.HauboldB,PfaffelhuberP,LynchM: mlRho aprogramforestimatingthe populationmutationandrecombinationratesfromshotgun-sequenced diploidgenomes. MolEcol 2010, 19: 277 284. 59.SmallKS,BrudnoM,HillMM,SidowA: Extremegenomicvariationina naturalpopulation. ProcNatlAcadSci 2007, 104: 5698 5703. 60. dwgsim. https://github.com/nh13/DWGSIM/releases. 61.LunterG,GoodsonM:Stampy:astatisticalalgorithmforsensitiveand fastmappingofIlluminasequencereads. GenomeRes 2011, 21: 936 939. 62.LiH,HandsakerB,WysokerA,FennellT,RuanJ,HomerN,MarthG,Abecasis G,DurbinR: Thesequencealignment/MapformatandSAMtools. Bioinformatics 2009, 25: 2078 2079. 63.YangZ: StatisticalpropertiesofaDNAsampleunderthefinite-sites model. Genetics 1996, 144: 1941 1950. 64.HaldaneJBS: Thecombinationoflinkagevaluesandthecalculationof distancesbetweenthelocioflinkedfactors. JGenet 1919, 8: 299 309. 65.GordonD,AbajianC,GreenP: Consed:agraphicaltoolforsequence finishing. GenomeRes 1998, 8: 195 202. 66.SlaterGS,BirneyE: Automatedgenerationofheuristicsforbiological sequencecomparison. BMCBioinformatics 2005, 6: 31. 67.EdgarRC: MUSCLE:multiplesequencealignmentwithhighaccuracyand highthroughput. NucleicAcidsRes 2004, 32: 1792 1797. 68.RonquistF,HuelsenbeckJP: MrBayes3:Bayesianphylogeneticinference undermixedmodels. Bioinformatics 2003, 19: 1572 1574. 69.YangZ,NielsenR: Estimatingsynonymousandn onsynonymoussubstitution ratesunderrealisticevolutionarymodels. MolBiolEvol 2000, 17: 32 43. 70.ZhangZ,LiJ,ZhaoX-Q,WangJ,WongGK-S,YuJ: KaKscalculator:calculating KaandKsthroughmodelselectionandmodelaveraging. Genomics ProteomicsBioinformatics 2006, 4: 259 263. 71.BenagliaT,ChauveauD,Hunter,DavidR,Young,DerekS: mixtools:AnR packageforanalyzingfinitemixturemodels. JStatSoftw 2009, 32: 1 29. 72.NossaCW,HavlakP,YueJX,LvJ,VincentKY,BrockmannHJ,PutnamNH: Supportingmaterialsfrom Jointassemblyandgeneticmappingofthe Atlantichorseshoecrabgenomerevealsancientwholegenome duplication 2014.GigaScienceDatabase.http://dx.doi.org/10.5524/100091.doi:10.1186/2047-217X-3-9 Citethisarticleas: Nossa etal. : Jointassemblyandgeneticmappingof theAtlantichorseshoecrabgenomerevealsancientwholegenome duplication. GigaScience 2014 3 :9. Submit your next manuscript to BioMed Central and take full advantage of: Convenient online submission Thorough peer review No space constraints or color gure charges Immediate publication on acceptance Inclusion in PubMed, CAS, Scopus and Google Scholar Research which is freely available for redistribution Submit your manuscript at www.biomedcentral.com/submit Nossa etal.GigaScience 2014, 3 :9 Page21of21 http://www.gigasciencejournal.com/content/3/1/9 PAGE 1 name sequence accession position class '.' for match LG:coord. EVX-1 QMRRYRTAFTREQIARLEKEFYRENYVSRPRRCELAAALNLPETTIKVWFQNRRMKDKRQRLA P23683.1 g_ayrxn_scaff_4445_1.Contig24 --...........L.......C..S..................S....R.------------1 : 7968 evx g_algxc_scaff_4707_1.Contig9 KE..S.....E..LEY..RT.E.TQ.PDVYT.E...QRTK.T.ARVQ.--------------1 : 2335 ? g_ajgny_scaff_5092_1.Contig4 KAK.I..I..A..LN...A..LQQQ.MVGRE.SH..SE...K.AQV........I.WRK---4 : 26846 ? g_ahyur_scaff_4335_1.Contig2 -AK.V..I..A..LD...T..E.QQ.MVG.E.LQ..S....T.AQV........I.WRK.HFD 1 : 3106 ? engrailed__Tribolium_castaneum_ EEK.P....SGA.L...KH..AENR.LTER..QQ.S.E.G.N.AQ..I....K.A.I.KASGQ NP_001034511.2 engrailed__isoform_A__Drosophila_melanogaster_ DEK.P....SS..L...KR..NENR.LTER..QQ.SSE.G.N.AQ..I....K.A.I.KSTGS NP_523700.2 engrailed__Branchiostoma_oridae_ EEK.P.....S..LQ..K...QENR.LTEQ..QD..RE.K.N.SQ..I....K.A.I.KAAGV EEN68949.1 g_addne_scaff_4497_1.Contig54 EEK.P.....TD.L...K...HENR.LTEK..Q...RE.Q.N.SQ..I....K.A.I.KSTGE 1 : 5794 en g_ausca_scaff_4858_1.Contig1 DEK.P.....AD.L...KQ..QENR.LTEK..QD..LD.Q.N.SQ..I....K.A.I.KASGQ 20 : 83001 en g_adwrx_scaff_4421_1.Contig72 DEK.P.....AD.L...KA..QENR.LTEK..QD..RG.Q.N.SQ..I....K.A.I.KATGH 15 : 71297 en g_avzuc_scaff_4437_1.Contig3 DEK.P.....TD.L...KA..QENK-LTEK..QD..RE------------------------21 : 85367 en g_adwrx_scaff_4421_1.Contig61 DEK.P.....TD.L...KA..QENR.LTEK..QD..RE.Q.N.SQ..I.-------------15 : 71297 en g_avzuc_scaff_4437_1.Contig4 -------------------..QENR.LTEK..QD..RE.Q.N.SQ..I....K.A.I.KATGH 21 : 85367 en g_adwrx_scaff_4421_1.Contig63 -EK.P.....AD.L...KA..QENR.LTEK..QD..RE.Q.N.SQ..I....K.A.I.KAVGH 15 : 71297 en g_ayadw_scaff_4575_1.Contig4 DEK.P.....AD.L...KA..QENR.LTEK..QD..RE.Q.N.SQ..I....K.A.I.KAVGH 3 : 22654 en Nkx-2.2a KK.KR.VL.SKA.TYE..RR.RQQR.L.A.E.EH..SLIR.TP.QV.I....H.Y.M..A.AE NP_571497.1 Nkx-2.2 KK.KR.VL.SKA.TYE..RR.RQQR.L.A.E.EH..SLIR.TP.QV.I....H.Y.M..A.AE NP_002500.1 Nkx-2.2__Gallus_gallus_ KK.KR.VL.SKA.TYE..RR.RQQR.L.A.E.EH..SLIR.TP.QV.I....H.Y.M..A.AE XP_003643855.1 g_ajlua_scaff_4446_1.Contig10 KK.KR.VL.SKA.TYE..RR.RQQR.L.A.E.EH..NFIR.TP.QV.I....H.Y.T..A.QE 1 : 1567 nkx2 g_acfsj_scaff_4532_1.Contig25 KK.KR.VL.SKA.TYE..RR.RQQR.L.A.E.EH..SIIR.TP.QV.I....H.Y.T..A.QE 3 : 20058 nkx2 ladybird-like__3 KR.KS....SNQ.LFE..RR.V.QK.L.PAD.DQ..QR.A.SSAQVIT......A.L..DLDE NP_001038133.1 lady_bird-like__1_homolog KR.KS.....NH..YE...R.LYQK.L.PAD.DQI.QQ.G.TNAQVIT......A.L..DLEE NP_001007135.1 g_acdli_scaff_4927_1.Contig5 KK.KS.....NH..FE...R.LYQK.L.PAD.D.I.QS.G.TNAQVIT......A.L..---26 : 93066 lbx g_agfxy_scaff_4755_1.Contig8 KK.KS.....NQ..FE...R.LYQK.L.PAD.D.I.QS.R.TNAQVIT......A.L..---13 : 67147 lbx g_aplvp_scaff_4845_1.Contig14 KK.KS.....NH..FE...R.LYQK.L.PAD.D.I.QS.G.TNAQVIT......A.L..---1 : 3106 lbx AGAP003673PA__Anopheles_gambiae_str._PEST_ KK.KS.....NH..FE...R.LYQK.L.PSD.D.I....G.SNAQVIT......A.L..DMEE EAA04094 g_amhlt_scaff_4436_1.Contig3 KQ.KA.....DH.LQT...C.E.QK.L.VQE.M...SK...TD.QV.T.Y....--------8 : 46729 ? g_amhlt_scaff_4436_1.Contig8 KE.KA.....DY..QT...R.QTQK.L.VQD.M.I.SK...TD.HV.T.Y....--------8 : 46729 ? g_aeqlo_scaff_4727_1.Contig6 RKKKT..V...S.VFQ..ST.DMKR.L.SSE.AG...C.H.T..QV.I......N.W...--22 : 86973 ? g_agfda_scaff_4460_1.Contig8 KRKKS..T..GR..FE...Q.EIKK.L.SSE.A.M.KL..VT.QQVEI--------------26 : 93149 ? g_artzi_scaff_4429_1.Contig73 SN.KL..P..TQ.LLA..RK.LAKQ.L.IAE.AKFSIS.T.T.AQV.I......A.E..---5 : 32215 msx Msx__Saccoglossus_kowalevskii_ TN.KP..P..TS.LLA..RK.RQKQ.L.IAE.A.FS.S...T..QV.I......A.A..LQE. ABD97280 Msx___Nematostella_vectensis_ AN.KP..P..TS.LLA..RK.RQKQ.L.IAE.A.FS.S...T..QV.I......A.A..LHE. BAG11598 MSX-2__Mus_musculus_ TN.KP..P..TS.LLA..RK.RQKQ.L.IAE.A.FSSS...T..QV.I......A.A..LQE. NP_038629 MSX-2__Homo_sapiens_ TN.KP..P..TS.LLA..RK.RQKQ.L.IAE.A.FSSS...T..QV.I......A.A..LQE. BAA06549 MSX1___Homo_sapiens_ TN.KP..P..TA.LLA..RK.RQKQ.L.IAE.A.FSSS.S.T..QV.I......A.A..LQE. NP_002439 MSX1_variant__Mus_musculus_ TN.KP..P..TA.LLA..RK.RQKQ.L.IAE.A.FSSS.S.T..QV.I......A.A..LQE. AAG32466 g_aebwc_scaff_4685_2.Contig8 -----------------------KQ.L.IAE.A.FSSS.S.T..QV.I......A.E..LKE. 3 : 19118 msx g_aebwc_scaff_4685_2.Contig10 SN.KP..P..TQ.LLA..RK.RTKQ.L.IAE.A.FSSS...T..QV.I......A.E..---3 : 19118 msx g_aebwc_scaff_4685_2.Contig13 SN.KP..P..TQ.LLA..RK.RSKQ.L.IAE.A.FSSS...T..QV.I......A.E..---3 : 19118 msx g_aoopl_scaff_4606_1.Contig8 -----------------------KQ.L.IAE.A.FSSS...T..QV.I......A.EN.LKE26 : 93149 msx g_aoopl_scaff_4606_1.Contig18 SS.KP..P..TK.LLA...M.QTKQ.L.IAE.A.FSSSM..T..QV.I......A.E..---26 : 93149 msx g_agizq_scaff_4515_1.Contig28 --..R...Y.S..LLQ.....HAKK.L.LTE.SQISTT.Q.S.VQV.I......A.W..---3 : 23600 gbx unplugged__Drosophila_melanogaster_ KS..R.....S..LLE..R..HAKK.L.LTE.SQI.TS.K.S.VQV.I......A.W..VKAG NP_477146 AGAP006923PA__Anopheles_gambiae_str._PEST_ KS..R.....S..LLE..R..HAKK.L.LTE.SQI.TS.K.S.VQV.I......A.W..VKAG EAA04094 g_apstr_scaff_4456_1.Contig3 --..R.....N..LLE..N..HCKK.L.LTE.SQI.HS.Q.S.QQV.I......A.W..---1 : 8283 gbx GBX-1__Canis_lupus_familiaris_ KS..R.....S..LLE.....HCKK.L.LTE.SQI.H..K.S.VQV.I......A.W..IKAG XP_853664 le:///Users/yjx/Dropbox/gigasciencereviews/gigascience_R2_s... 1 of 4 4/22/14, 12:20 PM PAGE 2 name sequence accession position class '.' for match LG:coord. g_aglhj_scaff_4667_1.Contig13 --..R.....S..LLE.....HSKK.L.LTE.SQI.HS.Q.S.VQV.I......A.W..---11 : 60310 gbx g_aglhj_scaff_4783_1.Contig4 KS..R.....S..LLE.....HSKK.L.LTE.SHI.NS.Q.S.VQV.I......A.W..VKAG 18 : 80043 gbx g_acgnb_scaff_4629_1.Contig3 RFLLSLSQR.SRNGSQATDGLHFNRWLT.R..I.IVRT.C.SD-------------------20 : 83755 hox? g_awtvf_scaff_4744_1.Contig2 ETK.Q...Y..H..LE.....HFNR.LT.R..I.I.HSSC.ERIK.----------------21 : 85243 hox? g_awtvf_scaff_4744_1.Contig1 -----------------------SR.LT.R.QI.I.HS.C.S.RQ..N...I....W.KDNKL 21 : 85243 hox? Drosophila-Dfd__NM_057853_ EPK.Q...Y..H..LE.....HYNR.LT.R..I.I.HT.V.S.RQ..I........W.KD--NM_057853 Tribolium-Dfd__EEZ99253_ EPK.Q...Y..H..LE.....HYNR.LT.R..I.I.HT.V.S.RQ..I........W.KD--EEZ99253 g_ajivj_scaff_4414_1.Contig4 ETK.Q...Y..H..LE.....HFNR.LT.R..I.I.HL.C.S.RQ..I........W.KDNKL 15 : 71079 hox? g_ajivj_scaff_4414_1.Contig3 ETK.Q...Y..H..LE.....HFNR.LT.R..I.I.HS.C.S.RQ..I........W.KDNKL 15 : 71079 hox? Branchiostoma-Hox4__AB028208_ DTK.S...Y..Q.VLE.....HFNR.LT.R..I.I.HS.G.T.RQ..I........W.KD--AB028208 g_akvhc_scaff_4400_4.Contig7 EAK.Q...Y..H.VLE.....HFNR.LT.R..I.I.H..C.S.RQ..I........W.KDNKL 18 : 79524 hox? Capitella-Dfd__EU196540_ DSK.T...Y..H..LE.....HFNR.LT.R..I.I.HT.C.S.RQ..I........W.KE--EU196540 Lottia-Dfd__110972_ DSK.N...Y..H.VLE.....HFNR.LT.R..I.I.HT.C.S.RQ..I........W.KE--JGI Lotgi1 protein id: 110972 Branchiostoma-Hox12__AAF81903_ SS.KK.CPYSKV.LLE.....LYNM.IT.EQ.G.I.RKV..TDRQV.I........M..M--AAF81903 Branchiostoma-Hox10__Z35150_ VG.KK.CPY.KY..LE.....LFNM....E..Q.ISRHV..SDRQV.I........M..M--Z35150 Branchiostoma-Hox11__AAF81909_ ST.KK.CPY.KY.TLE.....LFNMF.T.E..Q.I.RQ...TDRQV.I........M..M--AAF81909 Branchiostoma-Hox9__ABX39493_ SS.KK.CPY..F.TLE.....LYNM.LT.E..Y.ISQHV..T.RQV.I........M.KM--ABX39493 Branchiostoma-Hox13__AAF81904_ GG.KK.CPY.KY.LSV..Q.YIQNR....ET.L..SQR...TDRQV.I........Q..L--AAF81904 Branchiostoma-Hox14__AAF81905_ PV.PK.RPYSKY.LNE..N.YVQNQ.I..DK.LQ.SQK...T.RQV.I......I.Q.KL--AAF81905 Capitella-post1__EU196546_ NPKKK.KPYSKP.VSA..N.YSTST.ITKA..K.V.RE.D.T.RQ..I.Y....I.E.KI--EU196546 Lottia-post1__100031_ TL.KR.RPYSKF...E..R.YNGST...KS..W..SQLI..S.RQ..I......I.A.KI--JGI Lotgi1 protein id: 100031 Drosophila-AbdB__NM_080157_ SV.KK.KPYSKF.TLE.....LFNA...KQK.W...RN.Q.T.RQV.I........N.KN--NM_080157 Tribolium-AbdB__EEZ99247_ TV.KK.KPYSKF.TLE.....LFNA...KQK.W...RN...T.RQV.I........N.KN--EEZ99247 g_algea_scaff_4559_1.Contig10 TV.KK.KPYSKF.TLE.....LFNA...KQK.W...RN...T.RQV.I........S.KNAQR 15 : 71736 abdB g_algea_scaff_4559_3.Contig7 ------KPYSKF.TLE.....LFNS...KQK.W...RN...T.RQV.I........S.KSSQR 21 : 85120 abdB Capitella-post2__EU196545_ KQ.KK.KPY..Y.TMV..N..INNS.IT.QK.W.ISCK.H.S.RQV..........R.KL--EU196545 Lottia-post2__89720_ KG.KK.KPY..Y.TMV..N..LSSS.IT.QK.W.ISCK.Q.S.RQV..........R.KL--JGI Lotgi1 protein id: 89720 g_aosol_scaff_4355_1.Contig18 ---------------------EFNR.ITQR..I.I.HY.S.S.RK..M.....S..ANEKH-20 : 83145 hox? Drosophila-ftz__NM_058159_ DSK.T.QTY..Y.TLE.....HFNR.IT.R..IDI.N..S.S.RQ..I........S.KD--NM_058159 CDX-2__Mus_musculus_ TKDK..VVY.DH.RLE.....HFSR.ITIR.KS....T.G.S.RQV.I......A.ERKIKKK AAA19645 g_amvdv_scaff_4163_3.Contig12 TRDK..VVY.DQ.RLE.....HYNR.I.IR.KT....MVG.S.RQV.I......A.ER.NFRK 5 : 33811 cdx g_aildw_scaff_4772_1.Contig14 TKDK...VY.DH.RLE.....HYSR.ITIR.KT...TM.G.SDRQV.I......A.ERK.ARK 19 : 80739 cdx g_anovg_scaff_4508_1.Contig5 TKEK..VVY.DH.RLE......YSR.ITIR.KS....M.G.S.RQV.I......A.ERK.ARK 3 : 21395 cdx caudal__isoform_A__Drosophila_melanogaster_ TKDK..VVY.DF.RLE....YCTSR.ITIR.KS...QT.S.S.RQV.I......A.ERK.NKK NP_476954.1 Ftz__Tribolium_castaneum_ GNK.T.QTY..Y.TLE.....HFNK.LT.R..I.I.ES.R.T.RQ..I........A.KDTKF AF321227_1 Tribolium-ftz__AAC46491_ GNK.T.QTY..Y.TLE.....HFNK.LT.R..I.I.ES.R.T.RQ..I........A.KD--AAC46491 g_acrop_scaff_4310_1.Contig8 APK.T.QTY..Y.TLE.....HFNK.LT.R..I.I.HT.G.T.RQ..I........E.KENK15 : 71825 hox? g_awjsf_scaff_4464_1.Contig11 --K.T.QTY..Y.TLE.....HFNK.LT.R..I.I.HT.G.T.RQ..I....----------21 : 85120 hox? g_awjsf_scaff_4464_1.Contig15 APK.T.QTY..Y.TLE.....HFNR.LT.R..I.I.HT.G.T.RQ..I........A.KENKF 21 : 85120 hox? g_avhgw_scaff_4510_1.Contig7 --K.T.QTY..Y.TLE.....HFNR.LT.R..I.I.HS.G.T.RQ..I........A.KEMPK 18 : 79607 hox? g_alkdk_scaff_4677_1.Contig4 DQ..C.QTYSFY.TLA.....HFDP.LTKR..I.I.HT...K.KQ..I........W.KEKRS 28 : 94741 hox? g_ascty_scaff_4257_1.Contig45 DQ..S.QTYSFY.TLA.....HLDP.LTQR..I.I.HT...T.KQ.QI........W.KENKT 17 : 77601 hox? hox_6_Saccoglossus -Q..G.QTY..Y.TLE.....HFNR.LT.R..I.I.HT.G.T.RQ..I........W.KEQKABK00020 hox_7_Saccoglossus -KK.C.QTY..Y.TLE.....HYNR.LT.R..I..SHL.G.T.RQ..I........Y.KESKAAP79287 Capitella-lox5__EU196542_ EQK.T.QTY..Y.TLE.....HYNR.LT.R..I.I.H..Q.T.RQ..I........Y.KE--EU196542 g_anoef_scaff_4410_1.Contig27 PR..G.QSY..F.TLE.....HFNH.LT.R..I...HVVC.T.RQ..I........L.KEVRV 3 : 23600 hox? le:///Users/yjx/Dropbox/gigasciencereviews/gigascience_R2_s... 2 of 4 4/22/14, 12:20 PM PAGE 3 name sequence accession position class '.' for match LG:coord. g_awjsf_scaff_4464_1.Contig6 ----G.Q.Y..F.TLE.....HYNH.LT.R..I.I.HVVC.T--------------------21 : 85120 hox? g_ajivj_scaff_4414_1.Contig2 ---.S.Q.Y..F.TLE.....HYNH.LT.R..I.I.HVVC.T--------------------15 : 71079 hox? g_anoef_scaff_4410_1.Contig17 --------------LE.....HYNH.LT.R..I.I.HVVC.T.RQ..I........L.KEIR. 3 : 23600 hox? Lottia-lox4__156931_ .R..G.QTYS.Y.TLE.....QFNH.LT.K..I.I.HT.C.T.RQ..I........M.KE--JGI Lotgi1 protein id: 156931 Capitella-lox4__EU196543_ .R..G.QTYS.Y.TLE.....QFNH.LT.K..I.I.H..C.T.RQ..I........L.KE--EU196543 g_awgsh_scaff_4551_1.Contig4 -R..G.QTY..F.TLE.....HFNH.LT.R..I.I.H..C.T.RQ..I........L.KEIR. 15 : 71699 hox? g_aqqvq_scaff_4781_1.Contig5 ------------------...HFNH.LT.R..I.I.H..C.T.RQ..I........L.KEIR. 21 : 85040 hox? g_asnox_scaff_4479_1.Contig10 ---------------------HFNH.LT.R..I.I.H..C.T.RQ..I........L.KEIR. 15 : 71793 hox? g_asnox_scaff_4479_1.Contig18 ---------------------HFNH.LT.R..I.I.H..C.T.RQ..I........L.KEIR. 15 : 71793 hox? Drosophila-abdA__NM_057345_ PR..G.QTY..F.TLE.....HFNH.LT.R..I.I.H..C.T.RQ..I........L.KE--NM_057345 Tribolium-abdA__AAB70263_ PR..G.QTY..F.TLE.....HFNH.LT.R..I.I.H..C.T.RQ..I........L.KE--AAB70263 g_aqqvq_scaff_4781_1.Contig10 --------YP.Y.TLE.....HTNH.LT.R..I.M.H..C.T.RQ..I........L.KEIQ. 21 : 85040 ubx g_asnox_scaff_4479_1.Contig17 -------------------------.LT.R..I.M.H..C.T.RQ..I........L.KEIQ. 15 : 71793 ubx g_asnox_scaff_4479_1.Contig21 LR..G.QTY..Y.TLE.....HTNH.LT.R..I.M.H..C.---------------------15 : 71793 ubx g_aityy_scaff_4289_1.Contig2 LR..G.QTY..Y.TLE.....HTNH.LT.R..I.M.H..C.T.RQ..I......--------10 : 54444 ubx Drosophila-Ubx__NM_080500_ LR..G.QTY..Y.TLE.....HTNH.LT.R..I.M.H..C.T.RQ..I........L.KE--NM_080500 Tribolium-Ubx__EEZ99249_ LR..G.QTY..Y.TLE.....HTNH.LT.R..I.M.H..C.T.RQ..I........L.KE--EEZ99249 g_asnox_scaff_4479_1.Contig28 LR..G.QTY..Y.TLE.....HTNH.LT.R..I.M.H..C.T.RQ..I........L.KEIQ. 15 : 71793 ubx Capitella-lox2__EU196544_ .R..G.QTY..Y.TLE.....KFNR.LT.R..I..SHM.C.T.RQ..I........E.KE--EU196544 Lottia-lox2__185752_ .R..G.QTY..F.TLE.....KFNR.LT.R..I..SHM.C.T.RQ..I........E.KE--JGI Lotgi1 protein id: 185752 Branchiostoma-Hox8__ABX39492_ ER..G.QTYS.Y.TLE.....HFNK.LT.R..I.I.H..G.T.RQ..I........L.KE--ABX39492 Branchiostoma-Hox7__ABX39491_ ERK.G.QTY..Y.TLE.....HFNK.LT.R..I.I.H..C.T.RQ..I........W.KE--ABX39491 Lottia-lox5__89598_ EQK.T.QTY..Y.TLE.....HFNR.LT.R..I.V.HM.G.T.RQ..I........W.KE--JGI Lotgi1 protein id: 89598 Branchiostoma-Hox6__ABX39490_ EKK.G.QTY..Y.TLE.....HFNK.LT.K..I.I.HL.G.T.RQ..I........W.KE--ABX39490 g_aubsu_scaff_4356_1.Contig13 ERK.G.QTY..Y.TLE.....HFNR.LT.R..I.I.H..C.T.RQ..I........W.KENK. 15 : 71793 hox? g_aubsu_scaff_4356_1.Contig10 ERK.G.QTY..Y.TLE.....HFNR.LT.R..I.I.H..C.T.RQ..I........W.KENK. 15 : 71793 hox? Drosophila-Antp__NM_079525_ ERK.G.QTY..Y.TLE.....HFNR.LT.R..I.I.H..C.T.RQ..I........W.KE--NM_079525 Tribolium-Antp__EEZ99250_ ERK.G.QTY..Y.TLE.....HFNR.LT.R..I.I.H..C.T.RQ..I........W.KE--EEZ99250 Capitella-Antp__EU196547_ ERK.G.QTY..Y.TLE.....HFNR.LT.R..I.I.H..C.T.RQ..I........W.KE--EU196547 Lottia-Antp__177860_ NRK.G.QTY..Y.TLE.....HFNR.LT.R..I.I.H..C.T.RQ..I........W.KE--JGI Lotgi1 protein id: 177860 Hox8_Metacrinus -KK.G.QTY..Y.TLE.....HFNR.LT.R..I.I.Q.VC.S.RQ..I........A.KETSBAF43726 g_atxvn_scaff_4590_1.Contig3 --------------------------LT.R..I.I.H..C.T.RQ..I........W.QENK. 5 : 33116 hox? g_awjsf_scaff_4464_1.Contig1 --K.G.QTY..Y.TLE.....HFNR.LT.R..I.I.--------------------------21 : 85120 hox? g_awjsf_scaff_4464_1.Contig8 ERK.G.QTY..Y.TLE.....HFNR.LT.R..I.I.H..C.T.R------------------21 : 85120 hox? g_ascty_scaff_4257_1.Contig27 ------------.TLE.....HFNR.LT.R..I.I.H..C.T.RQ..I--------------17 : 77601 hox? g_alkdk_scaff_4677_1.Contig1 -------TY..Y.TLE.....HFNR.LT.R..I.I.H..C.T--------------------28 : 94741 hox? hox_5_Saccoglossus -AK.S...Y..Y.TLE.....HFNR.LT.R..I.I.H..G.S.RQ..I........W.KEHNNP_001158410 Branchiostoma-Hox5__ABX39489_ DNK.T...Y..Y.TLE.....HFNR.LT.R..I.I.H..C.T.RQ..I........W.KE--ABX39489 Scr__Tribolium_castaneum_ ETK.Q..SY..Y.TLE.....HFNR.LT.R..I.I.H..C.T.RQ..I........W.KEHKM AF321227_2 Drosophila-Scr__X14475_ ETK.Q..SY..Y.TLE.....HFNR.LT.R..I.I.H..C.T.RQ..I........W.KE--X14475 Tribolium-Scr__EEZ99252_ ETK.Q..SY..Y.TLE.....HFNR.LT.R..I.I.H..C.T.RQ..I........W.KE--EEZ99252 g_anoef_scaff_4410_1.Contig19 ------------------...HFNR.LT.R..I.I.H..C.S.RQ..I........W.KEHKM 3 : 23600 hox? g_asnox_scaff_4479_1.Contig12 -------SY..Y.TLE.....HFNR.LT.R..I.I.H..C.S.RQ..I--------------15 : 71793 hox? g_awjsf_scaff_4464_1.Contig7 ETK.Q..SY..Y.TLE.....HFNR.LT.R..I.I.H..C.S.RQ..I........W.KEH-21 : 85120 hox? g_anbio_scaff_4694_1.Contig9 ETK.Q..SY..Y.TLE.....HFNR.LT.R..I.I.H..C.S.RQ..I........W.KEHKM 15 : 71825 hox? le:///Users/yjx/Dropbox/gigasciencereviews/gigascience_R2_s... 3 of 4 4/22/14, 12:20 PM PAGE 4 name sequence accession position class '.' for match LG:coord. g_anbio_scaff_4694_1.Contig10 ETK.Q..SY..Y.TLE.....HFNR.LT.R..I.I.H..C.S.RQ..I........W.KEHKM 15 : 71825 hox? g_awjsf_scaff_4464_1.Contig5 --K.Q..SY..Y.TLE.....HFNR.LT.R..I.I.H..C.S--------------------21 : 85120 hox? Lottia-Scr__86927_ DSK.S..SY..H.TLE.....HYNK.LT.R..I.I.H....T.RQ..I........W.KD--JGI Lotgi1 protein id: 86927 Capitella-Scr__EU196541_ DNK.T..SY..H.TLE.....HFNR.LT.R..I.I.HS...T.RQ..I........W.KE--EU196541 pancreas_duodenum___1__Mus_musculus_ ENK.T...Y..A.LLE.....LFNK.I.....V...VM...T.RH..I........W.KEEDK NP_032840 Branchiostoma-Hox2__AB028207_ SS..L..V..NT.LLE.....HYNK..CK...K.I.SY.D.N.RQV.I.......RQ..R--AB028207 g_axtpu_scaff_4896_1.Contig3 ---SG..N..TK.L....N..LFNK.LT.K..I.I...IQ.N..QV.I........Q.KRMKE 9 : 53730 labial g_axtpu_scaff_4896_1.Contig4 ---SG..N..TK.L.G..N..LFNK.LT.K..I.I...IQ.N..QV.---------------9 : 53730 labial Drosophila-labial__NM_057265_ -NNSG..N..NK.LTE.....HFNR.LT.A..I.I.NT.Q.N..QV.I........Q.KRV-NM_057265 Tribolium-labial__AAF64149_ CLNTG..N..NK.LTE.....HFNK.LT.A..I.I.S..Q.N..QV.I........Q.KR--AAF64149 g_axtpu_scaff_4896_1.Contig2 --------------------------LT.A..I.I.T..Q.N..QV.I........Q.KRMKE 9 : 53730 labial g_ajwnd_scaff_4604_1.Contig5 -VGSG..N..TK.LTE.....HFNK.LT.A..I.I.T..Q.N..QV.I........Q.KRMKE 15 : 71825 labial g_aavsb_scaff_4656_1.Contig15 ----G..N..TK.LTE.....HFNK.LT.A..I.I.N..Q.N..QV.I........Q.KRMKE 18 : 79564 labial g_arkho_scaff_4547_1.Contig2 ----G..N..TK.LTE.....HFNK.LT.A..I.I.NV.Q.N..QV.I........Q.----1 : 5456 labial g_aavsb_scaff_4656_1.Contig7 ------------------------------..I.I.NV.Q.N..QV.I........Q.KRMKE 18 : 79564 labial Branchiostoma-Hox1__AB028206_ GPNNG..N..TK.LTE.....HYNK.LT.A..V.I......N..QV.I........Q.KR--AB028206 Capitella-labial__EU196537_ -PNMG..N..NK.LTE.....HFNK.LT.A..I.I..S.G.N..QV.I........Q.KRL-EU196537 Lottia-labial__100648_ -PNSG..N..NK.LTE.....HFNK.LT.A..I.I..S.G.N..QV.I........Q.KR--JGI Lotgi1 protein id: 100648 Lottia-pb__110623_ GS..L...Y.NT.LLE.....HFNK.LC....I.I..S.D.T.RQV..........Y...--JGI Lotgi1 protein id: 110623 Drosophila-pb__NM_057322_ LP..L...Y.NT.LLE.....HFNK.LC....I.I..S.D.T.RQV..........H...--NM_057322 Tribolium-pb__EEZ99256_ LP..L...Y.NT.LLE.....HFNK.LC....I.I..S.D.T.RQV..........H...--EEZ99256 g_adlvn_scaff_4304_2.Contig14 MP..L...Y.NT.LLE.....HFNK.LC....I.I..S.D.S.RQV..........H...TMM 15 : 71870 ? g_agkld_scaff_4527_1.Contig4 MP..L...Y.NT.LLE.....HFNK.LC....I.I..S.D.S.RQV..........H...TVM 3 : 19288 ? g_apppz_scaff_4373_1.Contig11 MP..L...Y.NT.LLE.....HFNK.LC....I.I..S.D.S.RQV..........H...TTM 21 : 85040 ? g_adlvn_scaff_4304_2.Contig9 MP..L...Y.NT.LLE.....HFNK.LC....------------------------------15 : 71870 ? Capitella-pb__EU196538_ HP..L...Y.NT.LLE.....HFNK.LC....I.I..S.D.T.RQV..........F...--EU196538 Tribolium-zen__NP_001036813_ AGK.A...Y.SA.LVE..R..HHGK.L.....IQI.EN...S.RQ..I........H.KE--NP_001036813 g_adhdu_scaff_4669_1.Contig2 PTK.V...Y.SA.LVE.....HFNR.LC....I.M.NL...T.RQ..I........Y.K---21 : 85075 hox3 g_amdsu_scaff_4454_1.Contig21 --K.A...Y.SA.LVE.....HFNR.LC....V.M.NL.S.T.RQ..I........Y.KEQKN 15 : 71825 hox3 Capitella-Hox3__EU196539_ PSK.A...Y.SA.LVE.....HFNR.LC....I.M..L...T.RQ..I........Y.KD--EU196539 Lottia-Hox3__56601_ PAK.A...Y.SA.LVE.....HFNR.LC....I.M..L...S.RQ-----------------JGI Lotgi1 protein id: 56601 g_amdsu_scaff_4454_1.Contig17 --K.P...Y.TA.LTE.....HFTR.LC....V.M.GL.H.T.RQ..I.......-------15 : 71825 hox3 Branchiostoma-Hox3__X68045_ AGK.A...Y.SA.LVE.....HFNR.LC....V.M..M...T.RQ..I........Y.KE--X68045 Drosophila-zen__AAF54087_ KLK.S.....SV.LVE..N..KSNM.LY.T..I.I.QR.S.C.RQV.I........F.KD--AAF54087 g_aheja_scaff_4145_1.Contig5 KTK.T..S.NGV.LLE.....TNNM.L..L..I.I.NY...S.KQV.I......V.F.KEG-23 : 89444 gsx g_atxvn_scaff_4590_1.Contig4 SSKKI..VY..S.LLE..R..AANM.LT.L..I.I.TY.S.S.KQV.I......V.Y.KE--5 : 33116 gsx g_aneyu_scaff_4415_1.Contig10 SSK.I.....ST.LLE..R..SANM.L..L..I.I.TY...S.KQV.I......V.Y.KE--6 : 37542 gsx Gsx1___Mus_musculus_ SSK.M.....ST.LLE..R..ASNM.L..L..I.I.TY...S.KQV.I......V.H.KEGKG AAI37770.1 g_aucja_scaff_4557_1.Contig17 --K.I.....SA.LLE..R..STNM.L..I..I.ISKY...S.KQV.I......V.H.KEG-19 : 80662 gsx le:///Users/yjx/Dropbox/gigasciencereviews/gigascience_R2_s... 4 of 4 4/22/14, 12:20 PM |