![]() ![]() |
![]() |
UFDC Home | myUFDC Home | Help |
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Full Text | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
PAGE 1 1 PAGE 2 2 PAGE 3 3 PAGE 4 Firstofall,IwanttoexpressmydeepestgratitudetoDr.GeorgeCasella.Notonlyforhispatientadvisementfornumerousacademicproblems,histime,hispealsofwisdomsharedwithme,butalsoforforsupportingme,encouragingmeandinspiringmetoalwaystobebetter.IalsowanttothankDr.HaniDoss,Dr.JohnDavis,Dr.GaryPeterandDr.RonglingWuforservingasmycommittee.Iwouldliketothankmyparents,PingLiandXuemeiTang,fortheirendlesslove,constantemotionalsupportandfortheirbeliefinme.Ithankmysisterforalwaysbeingthereformeandlovingme.IcouldneverthankmyhusbandJianenoughforhisloveandhisemotionalsupport.Hehasalwaysbeenkeepingmeinpeaceandcalminthosedays.Withoutthat,Icouldnotnishthisjourney. 4 PAGE 5 page ACKNOWLEDGMENTS ................................. 4 LISTOFTABLES ..................................... 7 LISTOFFIGURES .................................... 8 ABSTRACT ........................................ 9 CHAPTER 1LITERATUREREVIEWANDPROJECTINTRODUCTION .......... 11 1.1IntroductiontotheProject .......................... 11 1.2IntroductiontoMissingData ......................... 14 1.2.1PatternsofMissingData ........................ 14 1.2.2MechanismofMissingData ...................... 17 1.3ExistingMethodsforMissingData ...................... 19 1.3.1MaximumLikelihood .......................... 21 1.3.2EMAlgorithm .............................. 21 1.3.3MultipleImputation ........................... 23 1.4BayesianVariableSelection .......................... 25 1.4.1SemiautomaticBayesianVariableSelectionMethod ......... 27 1.4.2AutomaticBayesianVariableSelectionMethod ............ 28 1.4.3StochasticSearchAlgorithm ...................... 30 1.5OutlineoftheDissertation ........................... 31 2AHIERARCHICALBAYESIANMODELFORGENOME-WIDEASSOCIA-TIONSTUDIESOFSNPSWITHMISSINGVALUES .............. 33 2.1Introduction ................................... 33 2.2ProposedMethod ................................ 39 2.2.1TheModelwithoutRametRandomEect .............. 39 2.2.2TheModelwithRemeteRandomEect ................ 45 2.2.3IncreasingComputationSpeed ..................... 50 2.3ResultsforSimulatedData ........................... 51 2.4ResultsforLoblollyPine ............................ 53 2.5QuantifyingtheCovarianceandVariance ................... 57 2.6Discussion .................................... 73 3BAYESIANVARIABLESELECTIONFORGENOMICDATAWITHMISS-INGCOVARITES .................................. 77 3.1Introduction ................................... 77 3.2BridgeSamplingExtension ........................... 79 3.2.1GeneralFormula ............................. 82 5 PAGE 6 .................. 92 3.2.3ComparisonwiththeSimplestModel ................. 94 3.2.4MarginalLikelihoodform(Y) ..................... 98 3.3MarkovChainMonteCarloProperty ..................... 101 3.3.1CandidateDistribution ......................... 103 3.3.2ConvergenceofBayesFactors ..................... 104 3.3.3ErgodicityPropertyofThisM-HChain ............... 105 3.3.3.1Fixedn,uniformlyergodicconvergestothedistribution^B(n) 106 3.3.3.2ErgodicconvergencetoB 108 3.4ComputationSpeed ............................... 110 3.4.1MatrixInversion ............................. 110 3.4.1.1TwocolumnsofparametersforonecolumnofSNPs .... 110 3.4.1.2Determinantcalculation ................... 118 3.4.2ReplaceZwithanAverage ....................... 122 3.5SimulationandRealDataAnalysis ...................... 128 3.5.1Simulation ................................ 128 3.5.2RealDataAnalysis ........................... 129 4SUMMARYANDFUTUREWORK ........................ 132 4.1Summary .................................... 132 4.2FutureWork ................................... 133 APPENDIX AERGODICITYOFGIBBSSAMPLINGWHENUPDATINGZMATRIXBYCOLUMNS ...................................... 135 BANALGORITHMFORCALCULATINGTHENUMERATORRELATION-SHIPMATRIXR 140 REFERENCES ....................................... 142 BIOGRAPHICALSKETCH ................................ 146 6 PAGE 7 Table page 1-1Illustrationofunivariatenon-response ....................... 14 1-2Illustrationofmultivariatemissing ........................ 15 1-3Illustrationofmonotonemissing .......................... 15 1-4Illustrationofgeneralmissingdatapattern .................... 16 1-5Filematchingmissingdatapattern ........................ 16 2-1Illustrationofcombinationsofmissingdataforoneobservation. ......... 37 2-2ThepercentagesofSNPcategoriesinthegenerateddatasets. .......... 50 2-3ThepercentagesofcorrectlyimputedSNPsfordierentprobabilitiesofSNPcategories.10%missingvaluesexist. ........................ 52 2-4Theestimatedmeansoffamilyeectsfordierentdatasetswithdierentper-centagesofmissingvalues.Thismethodologygiveaccurateestimatesasthepercentageofmissingvaluesgoesupto20%. .................... 52 2-5TheestimatedmeansofSNPeectsforthedatasetwithoutmissingvaluesandfordatasetswithdierentpercentagesofmissingvalues. ............ 53 3-1ComparisonoftimespentoninversecalculationusingstandardsoftwareandMiller'smethodwith2columnsand2rowsupdated. ............... 115 3-2Comparisonoftimespentincalculationofmatrixinversionfordierentmethods. 118 3-3RecordsofsubsetsindicatorswithactualvaluesofforTable 3-4 andTable 3-5 ........................................... 126 3-4Bayesfactorcalculationapproximation.Userst20000iterationasburninandtakethefollowing400samplesforthecalculation. ................ 127 3-5Bayesfactorcalculationcomparisons.Userst40000iterationasburninandtakethefollowing400samplesforthecalculation. ................ 130 3-6SimulationresultsofBayesianvariableselectionfor15SNPsand450observa-tions,10%randommissingvalues.UsingtheBayesfactorestimationformula ............................................. 131 3-7Bayesvariableselectionforlesionlengthdataset.UsingtheaverageofimputedmissingSNPsasifobserved.20000stepsofburninandanother20000stepsassamples. ....................................... 131 7 PAGE 8 Figure page 2-1Thersttraceplotisfortherst2familyeectparametersforthelesionlengthdata.ThesecondplotisforoneofSNPparameterforthecarbonisotopedatafromPaltaka,Florida.Thesamplesaretakenafterinitial40000stepsofburnin. ............................................. 56 2-295%CondenceintervalsfortheadditiveanddominantSNPeectofthe44SNPsforlesionlength.The22thSNPhassignicantdominanteectswith95%condence,whiletheotherSNPsarenotsignicant.OtherSNPssuchasthe2ndandthe23thSNPsareapproximatelysignicantwith95%condence.Thesearegoodcandidatesforfurtherbiologicalexploration. .............. 58 2-395%CondenceintervalsfortheadditiveanddominantSNPeectofthe44SNPsforthecarbonisotopedatafromPaltaka,Florida.The16thSNPhassig-nicantdominanteectswith95%condence,whiletheotherSNPsarenotsignicant.OtherSNPssuchasthe6ndandthe40thSNPsareapproximatelysignicantwith95%condence.Thesearegoodcandidatesforfurtherbiologi-calexploration. ................................... 59 2-495%CondenceintervalsfortheadditiveanddominantSNPeectofthe44SNPsforcarbonisotopedatafromCuthbert,Georgia.The8th,35thandthe36thSNPhavesignicantdominanteectswith95%condence,whiletheotherSNPsarenotsignicant.OtherSNPssuchasthe13ndandthe28thSNPsareapproximatelysignicantwith95%condence.Thesearegoodcandidatesforfurtherbiologicalexploration. ............................ 60 8 PAGE 9 Withadvancingtechnology,largesinglenucleotidepolymorphism(SNP)datasetsareeasilyavailable.FortheADEPT2project,wehavecandidateSNPsandinterestingphenotypictraitvaluesavailable,whileabout10%oftheSNPsaremissing. StandardsoftwarepackagescannotdealadequatelywithmissingSNPdata.Forexample,SASeitherusesanavailablecaseanalysis(whichemploysallthecompletecasesfortheinferenceoftargetparameters)ortheprocedureMI(orMIANALYZE)whereSASassumesmultivariatenormaldistributionsforallthevariables.Somesoftwaredeletestheincompleteobservations,whichisgenerallyunacceptablefordatasetswithmanySNPs,becauseitcangivebiasedestimates,orpossiblydeleteallthedata.Morerecently,singleSNPassociation,linkagedisequilibriumbasedimputation,andhaplotypebasedimputationhavebeenproposed. IdescribeaBayesianhierarchicalmodeltoexplaintheSNPeectsforthephenotypictraits,andincorporatedfamilystructureinformationfortheobservations.Forthisassociationtest,theinformationofthedegreeoflinkagedisequilibriumisnotrequiredandmissingSNPsareimputedbasedonalltheavailableinformation.WeusedaGibbssamplertoestimatetheparametersandprovethatupdatingoneSNPateachiterationstillpreservestheergodicpropertyoftheMarkovchain,andatthesametimeitimprovescomputationspeed. 9 PAGE 10 10 PAGE 11 McKeeverandHoward(1996 ).Thepinespeciesinthesouthernstatesproduces58%ofthetimberintheUSand15:8%oftheworld'stimber,see WearandGreis(2002 ). Scientistsareinterestedindiscoveringtherelationshipbetweenthephenotypictraitsandthecomplexbiologicalrolesandfunctionsofgenesinloblollypine,whichcouldhelptoexplaintheevolutionofadaptivetraitsforotherlandplants.Rightnow,researchersfromseveraluniversitiesintheADEPT2projectarecollaboratingtoidentifyalleleswhichcontrolwoodpropertiesanddiseaseresistance.AnothergoaloftheADEPT2projectistoassociateallelicvariationwithphenotypicvariationforsomesubsetofgenes.Morespecically,thereare4objectives:1)toidentify5000targetcandidategenesforwoodpropertyanddiseaseresistancetraits;2)todiscoverallelesfor5000candidategenesusingahighthroughputresequencingpipeline;3)toestimatetheextentoflinkagedisequilibriuminregionsofasmallnumberofcandidategenes;4)todetectandverifyassociationsbetweenSNPsin1000candidategenesandasuiteofwoodpropertyphenotypes,diseaseresistancephenotypes. InthepreviousADEPTproject,SNP,orsinglenucleotidepolymorphismdiscoverywasdonefor50genesinvolvedinresistancetodiseaseandresponsetowater-decit.AsinglenucleotidepolymorphismisaDNAsequencevariationoccurringwhenanucleotide-A,T,C,orG-diersfromothernucleotideintheindividualsfromthesamespeciesatthesamelocus.TheSNPsmayoccurinthecodinggenesornon-codinggenes,soresearchersaretryingtodetecttheSNPswhichhavesignicantinuenceforthequantitativetraitsfromlargeamountofpotentialSNPs. 11 PAGE 12 Kayihanetal.(2005 ).Followingtheterminologyusedinthisproject,tworametswithinanycloneshareexactlythesamegeneticinformationwhiletwocloneswithinanyfamilyonlysharethesameparents,andtheycanbeconsideredtobesiblings. Loblollypineisthehostoftwofungalpathogens,whichcausesthefusiformrustdiseaseandpitchcankerdisease.Fusiformrusthasaspindle-shapedswellingonbranchesandstemsofloblollypinetreesandthisdiseaseiscausedbyafungus.Fusiformrustwillformstemgalls,whichleadtoshortsurvivaltimes,badwoodpropertiesandslowgrowth.ThisdiseasecauseslossesofhundredsofmillionsdollarsinsouthernUnitedStates.Pitchcankerisalsoaveryimportantdisease,whichiscausedbyfungusandproducesresinouslesionsandleadstoseedlingmortality,decreasedgrowthratesandcrowndieback. ThelargepitchcankerdatawererecordedintheUSDAForestServiceResistanceScreeningCenterinBentCreek,NorthCarolinaandthesmallerpitchcankerscreenwasconductedattheUniversityofFlorida.BothtengallrustscreensandonegallrustscreenswereinoculatedinNorthCarolinawithdierentdensityofspores=ml.ThisdatasetinformationiscalledCCLONESandwillbereferredbythenameCCLONESlater.WewillhaveanotherdatasetlaterfromNCSUwhichconsistsofindividualtreesnotrecentlymatedtoeachotherandcanserveasanaturaloccurringpinepopulationandwillbeusedforSNPvericationattheendoftheADEPTproject. Forthisdissertation,wehave46genotypedSNPsintheCCLONESpopulationforassociationdiscoveryandthemethodsdevelopedinthisstudywillbeappliedforthe7600SNPslater.Theresponsearemainlythemeasurementsoflesionlengthforfusiformrustandpitchcanker,wherelesionisaninfectedorwoundedpatchofskin. 12 PAGE 13 Thegoalsofthisdissertationare: 13 PAGE 14 1.2.1PatternsofMissingData UnivariateNon-response Table1-1. Illustrationofunivariatenon-response AgeWeight(lb)Height(feet)Race 281605.8white341156.1black552165.4white452306.3* Forthedatasetswithunivariatemissingpattern,onlyonecolumnhasmissingdataandalltheothercolumnshavecompleteinformation.Iuse\"todenotethemissingvalue.SupposewewanttoknowtherelationshipofHeightversusAge,Weight,andRacefromthedatainthistable.Ifthedatasetiscomplete,amostpossiblemethodtoemployislinearregression.Foroursituation,peoplemightsuggestdiscardingthelastobservationandcontinuetodolinearregression.Butisthedatamissingcompletelyatrandom?Willdeletingtheincompletecaseleadtoabiasedestimator? Multivariatemissing Insurveys,itisusualthatthedesignvariablesareknownbeforethestudy,butsomevariableswillnotalwaysbelledinbythepeoplebeingsurveyed.Thefollowingisanexampleofmultivariatemissingpattern. 14 PAGE 15 Illustrationofmultivariatemissing LocationImmigrationIncomeHouseholdnumberWorkingyears AlachuaYes53000510AlachuaYes6500054LeeNo7000062LeeNo***AlachuaNo*** IntheTable 1-2 thecountynameandimmigrationstatusaredesignvariablesandcanbelledinbeforethesurvey.Theleftcolumnsarequestionssupposedtobeansweredinaquestionnaireandsometimespeopledonotanswer.Again,beforedeletingtheincompletecases,weneedtoaskourselvesisthemissingdatamissingcompletelyatrandom?Arethereanybiasesofthemissingdata?Ifthemissingdataaremissingcompletelyatrandom,therewouldbenoharmtojustdeletingtheseincompletecases. Monotonemissing Thiskindofmissingpatternoftenhappensinlongitudinalstudieswheresubjectsdropoutovertimeandthecolumnrecordsinthelatertimetendtohavemoremissingdata.Forexample,Table 1-3 isamedicalexperimentandresearcherswanttotesttheeectofacertaindrugforbloodpressure: Table1-3. Illustrationofmonotonemissing Weight(lb)BPin1MBPin3MBPin6MBPin9MBPinyear1 150145150150156160167111132140145150200167150156160170230170****180210200*** Intheabovetable,\BPin1M"meanstherecordofbloodpressureinonemonth.Iftheobservationismissingat3months,thentherecordsforthelatermonthstendtobemissingtoo.Thereasonfortheoccurringofmissingvaluesmightbethatthesubjectmovedtoanewcity,orhewastoosicktocontinuetheexperimentorjustanyotherreasonthathecouldn'tcontinuetheexperiment.Itwilllikelytointroducebiasifwejust 15 PAGE 16 Generalmissingpattern Thegeneralmissingpatternisthepatternthatcannotbeputinanyspecialcat-egory.Forexampleingeneticdataanalysis,thefollowingdataaretypical:Inthis Table1-4. Illustrationofgeneralmissingdatapattern PhenotypicvaluefemaleparentmaleparentSNP1SNP2SNP3 15.331134513453ggtt*16.2351134513451*tcgc17.891135411542gc**19.311145311671**cc20.321134511651cccc* example,therstcolumnrecordsthephenotypicvalueofthetraitwhichtheresearcherisinterestedin.ThesecondandthethirdcolumnrecordstheparentsinformationandtheleftcolumnsaretheSNPinformation.Formicroarraydata,typicallywehavemorecolumnsofSNPinformationthanthatoftheabovetable.Solistwisedeletionistotallyinappropriatesinceittendstodeletethemajorityofthedataandwastelotsofeort.Moreappropriatemethodsshouldbeapplied. Filematching Inthiscategory,foreachobservation,onlyonevariableisobservedoutofthecovari-atestobelledin.Sointheworstcases,theremightbenocompletecases.Denitelybettermethodsthanlistwisedeletionneedtobeemployed.Thefollowingisanillustra-tion: Table1-5. Filematchingmissingdatapattern AgeJobtitleIncome(dollars) 55manager*40banker*31*7000023*3500049CEO* 16 PAGE 17 LittleandRubin(1987 ).Accordingtotheirdenition,themechanismofmissingdatacanbedividedinto3categories.SupposetherearemissingdataonvariableYandweusevariableMtodenotewhetherthevariableYisobservedornot.M=1meansthevariableismissingandM=0,otherwise. IfwedenotetheobservedpartofYbyYobs,andthemissingpartbyYmis,thenY=(Yobs;Ymis).Letbetheparameterofthemissingmechanism.ThesituationofmissingcompletelyatrandomcanbeexpressedasP(MjYobs;Ymis;)=P(Mj): PAGE 18 Onethingtonoteisthat,strictlyspeaking,wecannotverifythatthemissingdataisMARsincethemissingdataarenotobservedandthusimpossibletoverifywhethertheprobabilityofthemissingdataaretotallyindependentoftheunobservedvalue.However,MARisareasonableassumptioninmanycasesanditisaneasierassumptionforstatisticalinferencethanNMAR,unlessthereisevidenceofnotmissingatrandom,generallywecouldassumeitisMAR. Whenthedatasetiscomplete,directprocedureslikemaximumlikelihoodcanbeemployedforthestatisticalinference.Wewouldhave^=argmaxP(Yj): LittleandRubin(1987 ).Whenthedistinctnessholds,wehave:P(M;Yobs;Ymisj;)=P(MjYobs;Ymis;)P(Yobs;Ymisj): PAGE 19 =ZP(MjYobs;)P(Yobs;Ymis;j)dYmis=P(MjYobs;)ZP(Yobs;Ymisj)dYmis=P(MjYobs;)P(Yobsj): undertheassumptionofignorablenon-response,^=argmaxP(Yobsj); 19 PAGE 20 Thelistwisedeletionmethodbasicallydeletestheincompletecasesandprocessesthestatisticalanalysisasiftherewerenomissingdata.Thismethodisthemoststraightfor-wardmethodandisthedefaultoptioninalotofpopularsoftware.WhenthedataareMCAR,listwisedeletionisequivalenttoasub-samplingofthepopulationandtheresult-ingstatisticalinferenceislegitimateexceptthatwewillgetlargerstandarderrorbecauseoffewerobservations.WhentheassumptionofMCARisviolated,listwisedeletionwillgivebiasedestimatessincetheremainingdataareweightedmorethantheyshouldbe.Anotherdisadvantageofthismethodisthatittendstolosesubstantialinformationfromthedataifthemissingdataoccurinmultiplecovariates. Anothermethodispairwisedeletion,whichcanbeusedinlinearregressiontoestimatethemeansorthecovariancematrix.Theideaofpairwisedeletionistouseallthecasesthatareavailableforthesummarystatisticswhichwereintendedtobecomputed.Forexample,supposewehaveabivariateresponse(Y1;Y2)andeachvariablehasmissingdata.WhenwecalculatethemeanforY1weusealltheobserveddataforY1althoughthecorrespondingobservationforY2mightbemissing.WhenwecalculatethecovarianceforY1andY2,thecaseswithbothY1andY2areobservedwillbeused.ThebiggestproblemofpairwisedeletionisthattheestimatesandstandarderrorsinmostsoftwarearebiasedwhenthemissingvaluesarenotMCAR,sincetheprincipleofpairwisedeletionisambiguousinitsimplementationinsoftware.WhencomputingthecovarianceforY1andY2,thereisnocleardirectionaboutwhichobservedvaluesshouldbeusedtocalculatethemeanofY1,thecompletecasesforY1orthecompletecasesforY1andY2. 20 PAGE 21 Maximumlikelihoodisespeciallyeasywhenthemissingdatapatternismonotonic.Considerthetwovariablecasewhere(Y1;Y2)withY2havingmissingdata.Y1hasnobservationsandY2areonlyobservedformobservationsandtheremainingnmobservationsofY2aremissing.Obviouslythisisamonotonemissingpattern.Supposethatwewritetheobservedlikelihoodas:L(;jY)=mY1h(Y2jjY1j;)nY1g(Y1jj); Dempsteretal.(1977 ).ThebasicideaofEMistomaximizethelikelihood 21 PAGE 22 Inanyincomplete-dataproblem,thedensityofthecompletedatacanbefactoredas:p(Yj)=p(Yobsj)p(YmisjYobs;): wherel(jY)istheloglikelihoodofp(Yj)andl(jYobs)istheloglikelihoodofp(Yobsj)andcisaconstant.p(YmisjYobs;)iscalledthepredictivedistributionofthemissingdatagiven.BecausethesecondtermintheaboveformulahasYmisinit,wecannotmaximizeEquation 1{2 directly,insteadwewillintegrateouttheYmisoverthepredictivedistribution.Supposethecurrentrealizationofis(t)andthepredictivedistributionisP(YmisjYobs;(t)),theintegrationyields:l(jYobs)=Q(j(t))+H(j(t))+c; SotheEMalgorithmconsistsof2stepsofiterations: RepeatthetwostepsuntiltheMLestimatesconverge. Wu(1983 )wasthersttoprovethat(t)!^. Dempsteretal.(1977 )provedthatifwelet(t+1)bethevaluethat 22 PAGE 23 Dataaugmentationby TannerandWong(1987 )isanotherwidelyusedmethodwhichoftenisusedtoexploretheposteriordistributionwithmissingdata.IthasastrongresemblancetotheEMalgorithmanditisalsocomposedoftwosteps.Intherststepofdataaugmentation,itsimulatesacompletedatasucientstatisticwhileinEMalgorithmtherststepisusedtocalculatetheexpectation.Theotherstepindataaugmentationdrawsrandomsimulationofparametersfromtheposteriordistributionofcompletesucientdatawhichcomesfromtherststep,whiletheotherstepoftheEMalgorithmmaximizestheparameter.DataaugmentationassumesMARalso. Rubin(1978 ).Aconsiderablenumberofbookshavebeendevotedtoimplementingtheframeworkofmultipleimputation,suchasAnalysisofIncompleteMultivariateDataby Schafer(1997 )andMissingDataby Allison(2002 ).GenerallytheyassumeMARifthemissingdatamechanismisnotmodeled. Theideaofmultipleimputationistocompletethedatatogetausablelikelihood,basedonwhichstatisticalinferenceismucheasiertobeconducted.Togetcompletedata,randomimputationtakestheplaceofmissingvalues.Sinceweknowtheimputedvaluesarenotthetruevaluesofmissingdata,weneedtoimputethedatamultipletimestolettherandomeectofimputationcenteraroundtheunobservedtruevalue.Themultiplecompletedatasetsthenarecombinedtogiveestimates. SupposethereisdataY(mis)whichhasmissingvaluesinitandwecreateKcompletedatasetsY(C1);Y(C2);:::;Y(CK).SupposeweperformregressionanalysisforeachcompletedatasetY(Cj)andhaveestimate^(Cj),wethenaveragetheestimates^(Cj)togetabetter 23 PAGE 24 ThewithinvarianceiscalculatedasU=1 andthecorrespondingdegreesoffreedomVKisVK=(K1)[1+U B]2:ThiscombinedvariancewillapproximatelyhaveatdistributionwithdegreesoffreedomVK. BarnardandRubin(1999 )suggestedusingVK=[1 Lietal.(1991 )and MengandRubin(1992 )suggestthefollowingasanapproximatelikelihoodratiotest.DK=( PAGE 25 Lietal.(1991 )showthattestisquiterobusttotheviolationofthisassumptionwhenK3. OnequestionthatneedstobeansweredishowbigshouldKbe? Rubin(1987 )givesthefollowingformulatocalculatetheeciencyunderproportionofmissingdataandKcreatedcompletedatasets:Ef=(1+ K)1: Inpractice,multipleimputationrequiresamechanismtoimputethemissingdataandthereareanumberofwaystodothat. Reilley(1993 )considerstheHotDeckimputationwhichdrawsrandomsamplesfromtheactualobservedvalueswithequalweightsasanimputation.Thisisaneasilyemployedmethodand Efron(1994 )callstheHotDeckmethodanon-parametricbootstrapmethod. XieandPaik(1997 )proposedusingtheBayesianversionofHotDeckmethod;insteadofdrawingtherandomsamplewithequalweightsfromtheobservedvalues,theyputaDirichletpriorontheweights.Anotherimputationmethodassumesallofthecovariatesarenormallydistributed,evenforcategoricalvariables.Thecategoryisroundedtothenearestcumulativefunctioncategory. SmithandSpiegelhalter(1980 )proposedauniedapproachofpriorspecicationformodelchoiceandtheyshowedthatdierentpriorapproachescouldleadtoaSchwarz-typecriterionorAkaikeInformationCriterion. MitchellandBeauchamp(1988 )useda\spike 25 PAGE 26 GeorgeandMcCulloch(1993 ), GeorgeandMcCulloch(1995 ),and GeorgeandMcCulloch(1997 )advocatedtheMarkovchainMonteCarloapproachfortheposteriorcalculationandthismadethevariableselectionforlargecandidatespossible. Brownetal.(1998 )extendedtheapplicationofBayesianvariableselectiontomultivariateresponsesandaMarkovchainMonteCarloalgorithmwasemployedtospeedupthecomputation. BergerandPericchi(2001 )proposedtheobjectiveintrinsicBayesFactorsformodelselection.AnobjectivefullyautomaticBayesianprocedurewasproposedby CasellaandMoreno(2006 ).Theyusedintrinsicpriorstocalculatetheposteriorprobabilitiesandastochasticsearchalgorithmwasemployed.Justafewrelatedpapersarelistedaboveandmanyothersexistintheliterature. Innormallinearregression,wehaveadependentresponseYn1andaset(X1;:::;Xs)ofspotentialexplanatorypredictorsandweassumeY=X+,withYjXN(0;2I).Theactualvalueofsomeofthei;i=1;:::;s;maynotbesignicantlydierentfrom0ormaybecorrelatedwithanother.ThegoalofvariableselectionistondasetofpredictorsX1;:::;Xr,whichisasubsetofX1;:::;Xs,suchthateachcorrespondingihasapracticalsignicanteectonYandthefullmodelissimpliedtoacertaindegree.IwillemployanindexvectorwithlengthsandleteachelementofindexwhetherthecorrespondingpredictorXiisincludedintheselectedmodelornot.Thevaluei=1meansXiisintheselectedmodelandi=0otherwise.Sothevariableselectionproblemistondavectoraccordingtosomecriterion. ManyvariableselectionmethodshavebeenbasedontheAkaikeInformationCrite-rion,AIC Akaike(1973 ),Mallows'Cp )andBayesianInformationCriterion,BIC Schwarz(1973 )andhavebeenappliedtomanyproblemswhensisreasonablysmall. 26 PAGE 27 SupposeXrepresentstheselectedpredictormatrixindexedbyandlet^=(X0X)1XYdenotetheleastsquaresestimateof.Iusertorepresentthenumberofvariablesintheselectedsubsets.ThesumofsquaresforregressionisSS=^0X0X^: where^2isanestimateof2andCisapenalizedtermchosenbydierentcriterion.IfCtakesthevalueof2,CpisresultedandgivesexactBICtoo.ForAIC,takeC=logn. GeorgeandFoster(1994 )proposedtotakeC=2logpastheriskinationcriterion,RIC.Theabovementionedmethodshavedierentmotivations,unbiasedpredictiveriskestimateforCp,expectedinformationdistanceforAICandasymptoticBayesfactorforBIC.Whenthenumberofavailablepredictorsaresmall,theabovemethodshavebeenwidelyapplied.Whenconsideringproblemswithhundredsofpredictors,eachpotentialpredictorcaneitherbeincludedintheselectedmodelornotandthereforethereare2smodels,anumberthatcaneasilygobeyondthecomputationabilityandtheabovementionedcriteriawillbeimpossibletobeapplied.Newmethodologiestohandlehundredsofpredictorsareneeded. PAGE 28 CuiandGeorge(2008 )alsoproposedputtingpriorsforcandpiandintegratingthemout. Inthehierarchicalmodel,thepriorfor2needstobespeciedtooandnormallyitistakenastheinverseGammaGa(=2;),whereandarehyper-parameters.Moredetaileddiscussionabouthowtodecidethehyperparameterc;p;andhasbeengivenin GeorgeandMcCulloch(1993 ), GeorgeandMcCulloch(1997 ). Withtheabovespecication,theposteriordistributionf(jY)couldbecalculatedandthentheideaistoranktheposteriordistributiontodecidetheidealsetsofpredic-tors.Theactualposteriordistributioncalculationisachallengeandmethodstoaddressthiswillbediscussedinalatersubsection. )proposedthefullyautomaticBayesianvariableselectionprocedurebyusingintrinsicpriors.Letrepresentthesetofrealizationsofalland=1correspondstothefullmodelwithallpotentialpredictors.TheBayesfactorisdenedasB1=m(Y;X) PAGE 29 1+P2;6=1B1(Y;X);2; 2; Theintrinsicpriorapproachcomesfromthemodelstructureandisfreeofhyper-parameters(donotneedtoconsiderarangeofhyperparameters),soitcanbeusedasdefaultprioranditiscurrentlyoneoftheuniqueobjectiveprocedures. 29 PAGE 30 AgeneralwaytoimplementanMCMCprocedureistorunaGibbssampler(0;0;0);(1;1;1);:::;(t;t;t);::: CasellaandMoreno(2006 )havemoredetaileddiscussionaboutit. AnotherwaytoexploretheposteriordistributionofP(jY;X)istoconstructaMetropolis-HastingsMarkovChain.Normallythechainwillnotonlyvisitallthemodels,butalsovisitthebettermodelmore.Thisishowitworks:atiterationt,drawacandidatefromacandidatedistributionV;acceptthecandidatewithprobabilitymin1;P(t0jY;X)V(t) )usedtheGibbssamplertogeneratesamplesfromthejointdistributionofmodelandparametersconditionedontheresponse.Thismethodis 30 PAGE 31 Dellaportasetal.(2002 )proposedahybridGibbs-MetropolisstrategybasedonCarlinandChib'smethod. Green(1995 )suggestedthereversiblejumpMCMCmethodwhichcouldbeusedformodelselectionwithdierentdimensions. Chib(1995 )alsoproposedtoestimatethemarginallikelihoodbyusingblocksofparameters.Theselistedmethodsgenerallyworkwellformoderateamountsofcandidatevariablesandsomeofthemhavegoodapplicationresultsforspecialsituations.Forexample,reversiblejumpisgoodformixturemodelingwithunknownnumberofelements.However,thereisalmostnoinvestigationinvariableselectionwhenacertainpercentageofvariablesaremissing. InChapter2,anassociationtestisproposedtodetectthesignicantSNPsforthetargetphenotypictraits.Asthepopulationstructurecontainssubstantialgeneticinformationofthepopulationanditindexestheclosenessoftwoindividualswithinthepopulation,weuseanumeratorrelationshipmatrixtoquantifytheclosenessbetweenanytwoindividualswiththepopulation.ABayesianhierarchicalmodelisproposedandtheGibbssamplerisemployedtosampletheparametersofthejointlikelihood.InordertobeabletohandlethousandsofSNPstowardstheendofADEPT2project,methodstospeedupthecomputationareproposed:aGibbssamplingprocedurewhichjustsamplesonecolumnofSNPsineachcycleinsteadofallthecolumnsofSNPsineachcycle;andamatrixidentitywhichtakesadvantagesoftheupdatingschemeandavoidsdirectmatrixinversioncalculation.Simulationstudiesandrealdataanalysewillbedetailed. InChapter3,aBayesianmodelselectionmethodisproposedtoselect\good"subsetsofvariables,meanwhilemissingdataarebeingproperlytakencareof.Weusethe 31 PAGE 32 InChapter4,asummaryofthedissertationisgivenandthecontributionofthisdissertationwillbediscussed.Somefutureworkwillalsobediscussed. 32 PAGE 33 Asthetechnologiesaredevelopingsorapidly,SNPdataaregettingcheaperandeasilyavailable.Atthesametime,scientistsarehavingmoreopportunitytousehighthrough-putSNPdatasetswhichwerenotpossiblebefore.AlthoughmicroarraytechnologycouldprovidehighthroughputSNPdata,atthecurrentability,typically5%to10%ofgeno-typesaremissingaspointedby Daietal.(2006 ).OnechallengescientistsarefacingistouseavailableSNPandphenotypictraitinformationtodetecttheSNPswhichhavestrongassociationwithphenotypictraits,atthesametimeastatisticalmethodisneededtoproperlyaddressthemissingSNPs. Populationassociationinvolvingmissinggenotypicmarkershasmadealotofprogressforhumangenetics,especiallyinthecaseoftightlylinkedgenotypes,thatis,mostatten-tionhasbeenfocusedonne-scalemolecularregion.\Phase"packageby Stephensetal.(2001 )aimsathaplotypereconstructionofgenotypeddataandusesanEMalgorithmtomaximizethelikelihoodofhaplotypefrequencies.Theyuseanapproximatetypeofpriorfortheconditionalhaplotypedistribution. ScheetandStephens(2006 )proposed 33 PAGE 34 ChenandAbecasis(2007 )proposedfamilybasedtestsforgenomewideassociation.TheyuseanidenticalbydescentparametertomeasurecorrelationforthetestSNPs,andakinshipcoecienttomodelthecorrelationbetweensiblings.Theirmethodisusedforonefamily,oneSNPatatime,anditseemsnotfullyapplicableforcomplicatedpopulationpedigreesandforsimultaneousSNPtest-ing. ServinandStephens(2007 )proposedaBayesianregressionapproachforassociationtest.Themissinggenotypeswereimputedby\Fastphase"beforehand,andamixturepriorisusedfortheSNPeect.ThepriorsforthenumberofsignicantSNPsissettobesmall,andhassomeinuenceontheresults. Daietal.(2006 )usedEM,weightedEM,andanonparametricmethod(CART)forassociationstudies.Theyusedmultipleimputationsamplesfromthetreebasedalgorithms. Robertsetal.(2007 )proposedafastnearest-neighborbasedalgorithmtoinferthemissinggenotypes. Marchinietal.(2007 )proposedaunifyingframeworkofmissinggenotypeimputationandassociationtestingbasedonhaplotypesandotheravailablehumangenomicdatasets. SunandKardia(2008 )proposedaneuralnetworkmodelforthemissingSNPsandusedtheBICtochoosethepredictorsinthemodelandfurtherpredictthemissingSNPsaccordingtothechosenmodel. Balding(2006 )givesadetailedreviewofassociationstudiesandsomemissinggenotypemethodsbasedonhumangenetics.Thereismuchmoreliteratureonthistopic,theabovebeingjustasample.Almostalltheliteraturewefound,however,focusesonhumangeneticassociationwherethemarkersaretightlylinkedandhaplotypebasedinferencearedominant.Somepapersaredevotedtothesituationsofparentsinformationbeingmissingandproposedlikelihoodbasedmethods,see Weinberg(1999 ), Martinetal.(2003 ),and Boylesetal.(2005 ). 34 PAGE 35 2.2 EachrowofthematrixZ,Zi;i=1;:::ncorrespondstotheSNPgenotypeinforma-tionofoneindividualandforoneindividualwewriteZi=(Zoi;Zmi): WhenwellinthemissingdatawewriteZi=(Zoi;Zmi); (22)n=2Yi2I0exp1 22(YiXiZi)2 22(YiXiZi)2 Theobserveddatalikelihood,whichisthefunctionthatweeventuallyusetoestimatetheparameters,isbasedontheobserveddataonly.Wherethereismissingdata,thecompletedatalikelihoodmustbesummedoverallpossiblevaluesofthemissingdata.So 35 PAGE 36 (22)n=2Yi2I0exp1 22(YiXiZi)2Yi2IMXZiexp1 22(YiXiZi)2; 22(YiXiZi)2 PZiexp1 22(YiXiZi)2;(2{3) wherethesuminthedenominatorisoverallpossiblerealizationsofZi.ThisisadiscretedistributiononthemissingSNPdataforeachindividual.Tounderstandit,lookatoneindividual. Supposethattherearegpossiblegenotypes(typicallyg=2or3)andindividualihasmissingdataonkSNPs.SothedataforindividualiisZi=(Zo;Zm)whereZmhaskelements,eachofwhichcouldbeoneofgclass.Forexample,ifg=3andk=7,thenZmcantakevaluesinthefollowingTable 2-1 ,wheretheshowsonepossiblevalueoftheZmi.Fortheexample,thereare37=2187possiblevaluesforZmi.Inarealdatasetthiscouldgrowoutofhand.Forexample,iftherewere12missingSNPsthenthereare531,441possiblevaluesforZmi;with20missingSNPsthenumbergrowsto3,486,784,401(3.5billion).IfweemploytheEMalgorithm,ineveryiterationstep,foreachobservationweneedtosumoveragiganticnumberoftermstocalculatethedistributionofthemissingdataandthisiscomputationalinfeasible.SothenwethoughtwemightrunaGibbsSamplingprograminsideofeachEMsteptogeneratethemeansforthemissingdataandavoidsummingoverbillionsofterms. TheGibbsSamplerforthemissingdatasimulatesthesamplesofZmiaccordingtothedistributionofeachmissingelementconditionedontherestofthevector.MoreofthiswillbediscussedinSection 2.2 andherewejustgiveaavorwhytheGibbssampler 36 PAGE 37 Illustrationofcombinationsofmissingdataforoneobservation. SNP 22(YiXiZoioiZmi(j)mi(j)cmij)2 P3`=1exp1 22(YiXiZoioiZmi(j)mi(j)c`mij)2(2{4) wherethereareonlyg=3termsinthedenominatorsumandthisbasicallymadetheGibbssamplertobecomputationalpossibleforus. Forplantassociationorotheragriculturalgenomeassociationstudies,thewholegenomesequencelibrarymightnotbecompleteandthedegreeoflinkagedisequilibriuminformationmightnotbeavailable.Thusnewmethodologiesareneeded. Toimputemissingdata,multipleimputationhasreceivedmuchattentionsinceitsproposalby LittleandRubin(1987 ).Itacknowledgestheuncertaintyduetomissingdataandbasesinferenceonmultiplyimputeddatasets.Itaddressesthevariabilityinthemissingdatathroughmultipleimputation.Mostmethodsforlargegeneticdatasetsintheliterature,except Daietal.(2006 ),usesingleimputationandbasetheimputationonthelargestprobabilityforthecandidatecategories.Althoughsingleimputationtendstogivefastcalculation,itgenerallylosespowerandcangivebiasedparameterestimates.Notlikedoingsingleimputation,wetreatthemissingSNPsasparametersandimputethemeachiteration.ThiswouldbelessbiasedcomparedwithsingleSNPimputationandwouldmoreaccuratelycapturethevariationinthedata. Inanassociationstudyacrosstheentirepopulation,typicallythefamilypedigreeisveryimportant,sinceitexplainsthegeneticinformationsharedbyrelativesandwhichisnotnecessarygenotyped. Yuetal.(2005 )usedakinshipmatrixandpopulationstructure 37 PAGE 38 ChenandAbecasis(2007 )proposedfamily-basedassociationtesting,buthismethodcountstherelationshipwithinfamiliesandnottherelationshipbetweenfamilies.Weproposetouseanumeratorrelationshipmatrixtoexplaintherelationshipofindividualswithinfamiliesandbetweenfamiliesaswell.Theideabehindcalculationofanumeratorrelationshipmatrixwasoriginallydueto Henderson(1976 );seealso Quaas(1976 ). WiththemotivatingCCLONESdataset,weproposeaBayesianmethodologyforassociationstudies.WemodelthemissingSNPsasparametersanduseGibbssamplingtosampletheparameters,includingthemissingSNPs.FormissingSNPimputation,thisisessentiallymultipleimputation.Wetakeadvantageofthefamilyinformationfromtheavailablefamilypedigreeandparentpedigree.Itisnotahaplotype-basedmethodanddoesnotrequiretheSNPstobeinlinkagedisequilibriumandallowsthemodeltondthecorrelation(ifitexists).ThispropertyisquiteappealingforplantassociationoranyotherassociationforwhichwedonothaveasequencedgenomelibraryordetailedinformationaboutthecandidateSNPs.ItisespeciallyusefulforgenomescansandcanpointoutsomesignicantSNPsforfurtherstudy.Finally,weprovethatjustupdatingonemissingSNPforeachobservationineachiterationwillstillachievethetargetstationarydistribution.ThiswillsubstantiallyincreasethecalculationspeediftherearelargenumbersofSNPsforeachobservation,andgivesustheabilitytodealwithlargehighthroughputdatasets. Thischapterisorganizedasfollows.InSection 2.2 weexplaintwoproposedmodels,andalsoproposeamethodtoincreasethecomputationspeed.InSection 2.3 ,simulationresultsaregivenandalsowecomparewithotherresults.InSection 2.4 ,therealdataanalysisresultsarereportedhere.InSection 2.5 ,weexplainamethodtocategorizethecovariancerelationshipsforanygivepedigreesituation.Intheend,atSection 2.6 ,wegiveadiscussion. 38 PAGE 39 Inthissection,wediscusstwomodelsforthisdatasituations.Forthesimplicityofanalysis,weemployoneofthethemthroughthesimulationandrealdataanalysis. Themodelis 39 PAGE 40 EachrowofthematrixZ,Zi;i=1;:::n;correspondstotheSNPgenotypeinforma-tionofoneindividual.Someofthisinformationmaybemissing,andwewriteZi=(Zoi;Zmi) whereZoiaretheobservedgenotypesfortheithindividual,andZmiarethemissinggenotypes.Notetwothings: 1. ThevaluesofZmiarenotobserved.Thus,ifdenotesonemissingSNP,apossibleZiisZi=(1;;0;0;;;1): IndividualsmayhavedierentmissingSNPs.Sofor2dierentindividuals,wemighthaveZi=(1;;0;0;;;1)Zi0=(;;1;0;0;1;1); 40 PAGE 41 2) (2)a+1: HobertandCasella(1996 ),wetakea;b;c;dtobespeciedvalues,whichensuresaproperposteriordistribution. Nowletusspecifythevaluesofa;b;c;dandmakesurethattheposteriordistribu-tionsareproper.Firstofall,wecanwritethemodelspecicationasthefollows HobertandCasella(1996 ),thefollowinginequalitiesmustbesatised 41 PAGE 42 Furthersimplicationweleadtothefollowingequations:0>c>sa>n Asthenumberofobservationsinthedatasetisaboutonethousand,thenumberofparametersforSNPsareaboutonehundredandthedimensionofis61,wetakec=1,a=3,b=1,andd=1.Theabovespecicationmakessurethattheposteriordistributionsareproperandthepriorshavewideatranges. Asourdatasetisfromtheloblollypinegenome,aswithmanyotheragriculturegenomes,itisnotfullysequencedandthereisnotenoughpriorinformationintermsofthefrequencyofgenotypesforeachSNP.Also,thephysicalpositionsoftheSNPsarenotfullyrecordedandtheSNPsmaynotbetightlylinkedasclustersofhaplotypes.SononinformativepriorsareusedforthemissingSNPs.AstheSNPsconsideredherearebiallelic,eachmissingSNP,has3possiblegenotypes:homozygous,heterozygousandmutanthomozygous.Forexample,ifthetwoallelesinthelociareAandG,themissing 42 PAGE 43 ForthemissingSNPsinthedata,weassumethattheyaremissingatrandom,MAR.Inanotherwords,weassumethattheprobabilitythatwhethertheSNPismissingisrelatedtotheobserveddata,suchasthephenotypictraitorotherobservedSNPs,andisindependentoftheunobservedinformation.Conditionalonthisassumption,wewillimputethemissingSNPsbasedonthecorrelationbetweenSNPswithinindividualsandbetweenindividuals,andusethephenotypictraitinformationtoimprovethepowerofimputation.MARisareasonableassumptionandnotasstrictasmissingcompletelyatrandom,MCAR. Inthismodel,thecovariancematrix,R,modelsthecovariancebetweenindividualswithinthesamefamily,andthecovariancebetweenindividualsacrossfamilies.Phenotypictraitsofrelatedindividualsarealikebecausetheysharealargefractionofgeneticmaterial.Genotypesofrelativesaresimilarbecausetheysharethesameparentsorgrandparents(insomedegree).GenotypedSNPsmayexplainsomepartofthephenotypictraits,stilltheun-typedgeneticinformationcontributestothephenotypictraitsaswellasotherfactors,suchasenvironmentaleects.SimplyusingtheSNPmarkerinformationwithoutconsideringthefamilypedigreeorhistoryisnotawiseapproach.WeexpectthatincorporatingthefamilystructureinformationwillincreasethepowerofcapturingtheunderlingnatureofthedatasetandhelptodetectthesignicantSNPs. Intheliterature,therearedierentmethodstocalculatetherelationshipmatrix,suchasusingaco-ancestrymatrix,akinshipmatrix,etc.Thebasicideaistocalculatetheprobabilitythat2individualsshareonegeneorSNP,whichispasseddownfromthesameancestry.Someofthemethodsemploypairwisecalculationandthusdonotguaranteeapositivedeniterelationshipmatrix,whichisusuallynotsatisfactorywhentherelation-shipmatrixisusedascovariancematrix.Weusetherecursivecalculationmethoddueto Henderson(1976 ).Thismethodgivesanumeratorrelationshipmatrixwhichquantiesthe 43 PAGE 44 B .Noticethatevenifthefamilypedigreeorparentpedigreeisnotavailable,wecanstillusetheproposedmethodtocalculatetherelationshipmatrix.Inthefollowing,wehaveSection 2.5 dedicatedtocategorizingtherelationshipsforanypedigreehistory. SameasEquation 2{6 ,wehavethefollowingmodelspecicationp(Yj;;Z;2;2)N(X+Z;2R);()/1;()N(0;22I);(2)IG(a;b);(2)IG(c;d): 21Z0R1(YX);2Z0R1Z+I 21!21 (2)n=2+s=2+a+1exp(YXZ)0R1(YXZ)+jj2 (2)s 22: 44 PAGE 45 whereKc=(YiXiZoioiZmi(j)mi(j)cmij); Themodelis 45 PAGE 46 Thevariancetermisspeciedastheidentitymatrixrightnowforthesimplicity.Thefullspecicationofmodel 2{9 includesthefollowingpriordistributions: 46 PAGE 47 (2)(n+s+r)=2(2)s=2(2)r=2exp1 22jYXZWuj2exp1 222jj2 222juj2(2)(2)(2) Accordingto HobertandCasella(1996 ),thefollowinginequalitiesneedtobesatisedtomakesurethattheposteriordistributionsareproper: 2c>rank(Z) (2{12) 2e>rank(W)2a>nc<0e<0rank(Z)>rank(K)t2crank(W)>rank(K)t2en+2e+2c+2ap>0; 2{12 47 PAGE 48 2{11 ,withaatprioron2;2and2asweshowedinabovethattheyensurethepropertiesofposteriordistributions,wehavethefollowingfullconditionalsdistributions. (2)(n+s+r)=2 22jYXZWu)j2+1 (2)r=2exp1 22juj2 (2)s=2exp1 22jj2 (2{14) =!ij((Zoi;Zmi(j);c))exp1 22(YiXiWiuZoioiZmi(j)mi(j)cmij)2 Pg`=1!il((Zoi;Zmi(j);c`))exp1 22(YiXiWiuZoioiZmi(j)mi(j)c`mij)2; Usingtheabovemethod,weneedtosampleuineachiteration,andthisisfeasibletheoretically.However,thedimensionofuisreallybigandwedonotreallywantthesamplesofu.Nowwewanttotrytointegrateuoutandgetthemarginaldistribution. 48 PAGE 49 22(YXZ)0B(YXZ)exp1 222jj2 2{11 areensuredtobeproper.Withthesameatpriors,theposteriordistributionsfromtheEquation 2{15 willstillbeproperbecausetheonlydierencemadehereishavingubeingintegratedoutandthisdoeschangethepropertiesofotherposteriordistributions. BasedonEquation 2{15 ,wehavethefollowingfullconditionals (2)(n+s)=2exp1 22(YXZ)0B(YXZ)+1 (2)s=2exp1 22jj2 Weransomesimulationsonthisdatasetandwefoundoutthecomputationtimeisespeciallylong.Thereasonsforthatiswhenweconsiderobservationsbasedonremetes 49 PAGE 50 ThepercentagesofSNPcategoriesinthegenerateddatasets. PercentagesSNP1SNP2SNP3SNP4SNP5 Doublehomozygous:13%30%91%77%39% Heterozygous:53%38%7%19%54% Mutanthomozygous:33%32%2%4%7% insteadofaveragesofremetes,thenumberofobservationsisabout4timesbigger.Consequently,thecomputationtimeismuchlonger,intermsofmatrixinversionandmatrixdeterminant.Wedecidetousetherstmodeloftheanalysisanddonotconsiderthevariationwithintheremetes.NoticethatthisdoesnotchangetheinferenceontheSNPs. 2{7 )and( 2{8 ),ifinsteadofupdatingalltheparameters(t);(t);Z(t)np;2(t);2(t)ineachcycle,wejustupdate(t);(t);Z(t)j;2(t);2(t)ineachiteration,theMarkovChainachievesthesametargetstationarydistributionandergodicityalsoholds.HereZjdenotesthejthcolumnintheZnpmatrixforSNPs(thejthSNPforallobservations)andjchangesaccordingtotheiterationindex. A .TheactualmeaningofthistheoremisthatinsteadofupdatingtensorhundredsofSNPsinonecycle,wejustneedtoupdateoneSNPineachcycle.Thiswilldramaticallyspeedupcomputation,especiallywhentherearelargenumbersofSNPsinthedata. 50 PAGE 51 2-5 astheactualvalues.TheSNPeects(additiveanddominanteects),whichwereusedtosimulatethedata,arelistedasactualvaluesinTable 2-5 .Weletthevarianceparameter2be1.Theproposedmethodologywasappliedtoanalyzethedatawithoutmissingvaluesandwasalsoappliedtodatawithdierentpercentagesofmissingvalues.WewanttoseewhetheritcouldidentifythesignicantSNPsandcheckitsperformanceagainstdierentpercentagesofmissingvalues,aswellasitsperformancefordierentprobabilitiesofSNPgenotypecategory. OurultimategoalistondthesignicantSNPsfromthecandidateSNPs.Sincewebelievethatimputationisatooltoobtainbetterestimatesoftheparameters,wearenotparticularlyinterestedinrecoveringtheactualimputedvaluesforthemissingSNPs.Withthatbeingsaid,thesimulationresultsinTable 2-3 showthatwhentheprobabilityofonegenotypeforacertainSNPisdominantlyhigh,theimputedSNPsarecorrectlyidentiedwithsubstantiallyhighprobability.ThiscouldbeseeninTable 2-3 ,where 51 PAGE 52 ThepercentagesofcorrectlyimputedSNPsfordierentprobabilitiesofSNPcategories.10%missingvaluesexist. Probabilitiestobe\gg"0.13090.30120.81810.77190.3983Probabilitiestobe\gc"0.53070.38750.07960.19500.5425Probabilitiestobe\cc"0.33840.31130.10230.03310.0592Correctlyimputedprobabilities:0.550040.547880.633720.850750.65159 Probabilitiestobe\gg"0.03090.30120.31810.37190.0983Probabilitiestobe\gc"0.83070.38750.07960.19500.8425Probabilitiestobe\cc"0.13840.31130.60230.43310.0592Correctlyimputedprobabilities:0.75890.350520.976210.648290.85301 2-4 andTable 2-5 listtheparameterestimationforfamilyeectandSNPeects. Table2-4. Theestimatedmeansoffamilyeectsfordierentdatasetswithdierentpercentagesofmissingvalues.Thismethodologygiveaccurateestimatesasthepercentageofmissingvaluesgoesupto20%. Estimatedmeansfamily1:1family2:2family3:3 NomissingSNPs15.4520.6525.48 5%missingSNPs15.1620.7425.46 10%missingSNPs16.1821.3825.65 15%missingSNPs15.4519.6324.59 20%missingSNPs14.8720.1824.68 Estimatedmeansfamily4:4family5:5family6:6 NomissingSNPs29.8434.7640.40 5%missingSNPs28.2933.4338.62 10%missingSNPs30.7135.8640.81 15%missingSNPs30.1835.3840.18 20%missingSNPs30.0834.8840.13 52 PAGE 53 TheestimatedmeansofSNPeectsforthedatasetwithoutmissingvaluesandfordatasetswithdierentpercentagesofmissingvalues. ActualSNPvalue:SNP1:aSNP1:dSNP2:aSNP2:dSNP3:a -2.001.001.00-1.003.00 MeansfornomissingSNPs:-2.161.000.82-0.752.59 Meansfor5%missing:-1.861.141.16-1.053.00 Meansfor10%missing:-1.950.771.18-1.522.74 Meansfor15%missing:-1.800.780.99-0.962.48 Meansfor20%missing-2.081.291.21-0.763.10 ActualSNPvalue:SNP3:dSNP4:aSNP4:dSNP5:aSNP5:d 02.500.100.303.00 MeansfornomissingSNPs:0.302.430.60-0.042.38 Meansfor5%missing:0.052.21-0.200.482.88 Meansfor10%missing:0.182.510.130.002.53 Meansfor15%missing:0.672.430.470.733.20 Meansfor20%missing1.321.87-0.200.473.30 Allthecalculationswerebasedonsamplesobtainedaftertheinitial20000stepsofburn-in.TheresultsfromTable 2-4 andTable 2-5 showthatwhenthepercentageofmissingvaluesisnottoohigh,lessthan15%,theproposedmethodologygivesgoodestimatesfortheparametersthatweareinterestedin.Whenthepercentagesofmissingvaluesisbeyond15%percentage,weneedtobeverycarefulwithinterpretingtheresults.TakeSNP3asexample,thedominanteectforSNP3isactually0andtheestimatewas1:32whenthepercentageofmissingvaluesis20%,althoughtheestimatesareaccuratewhenthepercentagesofmissingvaluesislessthan10%.Thereasonforthat,webelieve,isonecategoryofgenotypeforSNP3hassubstantiallyhigherprobabilityanditoverdominatestheothertwocategoriesandthisoverdominance.Whenthepercentageofmissingvaluesgoesup,thedominatedgenotypecategoryhasonlyasmallchancetobewellrepresentedandthusmayhaveunreliableestimates.Generally,asmostmicroarraydatahavelessthan10%missingvalues,themethodologyperformswell. 53 PAGE 54 Forthisproject,wearespecicallyinterestedindetectingtherelationshipbetweenlesionlengthandthegenotypedSNPs,aslesionlengthisoneofthemostimportantquantitativetraitsofpitchcankerdisease.WealsohavephenotypicdataofcarbonisotopediscriminationvaluesfromloblollypineswhicharegrowninPaltaka,FloridaandCuth-bert,Georgia.Thecarbonisotopetraitisrelatedtothewateruseeciency,thushasanimportantroleinthesizegrowthofloblollypineandfurtherhassubstantialeconomicalvalues.Geneticallyspeaking,theloblollypinesinPaltaka,FloridaarereplicationsofloblollypinesinCuthbert,Georgia,excepttheyhavedierentenvironment,whichmightleadtodierentgenetic-environmentinteraction.Asforlesionlength,withreplicationofgeneticinformation,itisadierentphenotypictraitcomparedtothecarbonisotopedis-criminationtrait.Sowehavethreephenotypictraitsdata:CarbonisotopefromPaltaka,CarbonisotopefromCuthbertandlesionlengthdata.ThethreeshareonegeneticSNPdata. Forthedesignoftheexperiment,thereare61loblollypinefamiliesfromacirculardesignwithsomeo-diagonalcrossings.Forexample,family00isgeneratedfromparent24andparent23,whilefamily01isfromparent24andparent40.Family00andfamily01arenotindependentsincetheyshareoneparent.Originally,thiscirculardesignhad70familiesfrom44parentsalthoughourdatasetsjustcontainpartoftheexperimentdesign.Moredetailsoftheexperimentdesigncanbefoundin Kayihanetal.(2005 ).Therealso 54 PAGE 55 Forthe61families,wehavefamilypedigreeandparentpedigreeinformation.Byusingthe Henderson(1976 )method,wecalculatedthecovariancenumeratorrelationshipmatrix.ThedetailsofthecalculationareinAppendix B .Weuseauniformnoninfor-mativepriorforthefamilyeectandanormalpriorfortheSNPeectswhichweareinterestedin.Asforthevarianceparameters,weuseinvertedgammadistributionsandmakesurethattheposteriordistributionsareproperaccordingto HobertandCasella(1996 ).EachSNPisparameterizedwithadditiveanddominanteects.Forexample,ifthisSNPiscomposedby\A"and\T"nucleotides,theadditiveeectisonehalfofthedierenceoftheSNPeectbetweenhomozygous\AA"andmutanthomozygous\TT".Ifweuse(a;d)TtoparameterizetheSNPeect,SNP\AA"wouldhaveeect(1;0)(a;d)TandSNP\TT"wouldhaveeect(1;0)(a;d)T.Thedominanteectis 55 PAGE 56 Thersttraceplotisfortherst2familyeectparametersforthelesionlengthdata.ThesecondplotisforoneofSNPparameterforthecarbonisotopedatafromPaltaka,Florida.Thesamplesaretakenafterinitial40000stepsofburnin. thedierenceoftheheterozygouseectandtheaverageofhomozygouseects.Weusetheeect(0;1)(a;d)TtoparameterizethedominanteectforSNP\AT".WeupdatedthemissingSNPsbycolumnsasdetailedinSection 2.2.3 ,sothecomputationisfast(about3hoursfor30000iterations).Weuse40000iterationsasinitialburninandrecordthesamplesoffurther2000iterationsforthe3dataanalysis. TwotraceplotsareshowninFigures 2-1 .TherstoneisforoneoftherstfamilyeectparametersforthecarbonisotopediscriminationdatafromPaltaka,Florida.ThesecondplotisthetraceplotofvarianceparameterforthecarbonisotopediscriminationdatafromCuthbert,Georgia.BothplotssuggestthattheGibbssamplingsimulationconvergestothestationarydistributionafterburnin. Figures 2-2 2-3 ,and 2-4 showthecondenceintervalsforthe44SNPeectsofdierentdatasets.Wefoundthatthe22ndSNPforthelesionlengthdata,the16thSNPforthePaltakadataandthe8th,35th,36thSNPsfortheCuthbertdataaresignicant 56 PAGE 57 2-2 2-3 ,and 2-4 basedonthesamplesfromtheGibbssampling.WedidnotemploystandardmultipletestcorrectionprocedureforthetestsastherearenotmanysignicantSNPsandalsoourgoalistondsomestatisticallysignicantSNPsandletthebiologiststofurtherexplorethebiologicalpathway.ThesecondenceintervalplotspointoutthespecicSNPstofurtherinvestigateandwecouldhavemoreSNPstofollowifweloosenthecondencelimitsalitbit.InFigures 2-2 2-3 and 2-4 ,thexaxisisfortheindexofSNPs.AseachSNPhastwoparameters,with44SNPs,theindexrangesfrom1to88.TheyaxisisforthevalueofSNPeect.ForeachparameterofeachSNP,thereisasmallredlineinthetopofabluelineandasmallredlineonthebottomoftheblueline.Thesmallredlineonthetoprepresentstheupperboundofthe95%condenceintervalandthesmallredlineonthebottomrepresentsthelowerboundofthe95%condenceinterval.Thesmallgreenlineinthemiddlerepresentsthemeanoftheestimatedparameter.EachSNPhastwoparameters,additiveanddominant,correspondingly,ithastwolinesinalltheplots. 57 PAGE 58 95%CondenceintervalsfortheadditiveanddominantSNPeectofthe44SNPsforlesionlength.The22thSNPhassignicantdominanteectswith95%condence,whiletheotherSNPsarenotsignicant.OtherSNPssuchasthe2ndandthe23thSNPsareapproximatelysignicantwith95%condence.Thesearegoodcandidatesforfurtherbiologicalexploration. ontheotherhandiftheparent'sIDisbiggerorequalto35weknowthegrandparents'ID.So35isanimportantnumberandweneedtopayattentiontoitwhentryingtollinthecovariancematrixoffamilies.AnotherthingisfortheparentID38,originallyitsgrandparentswererecordedas13and0,outofconvenienceforprogramming,Ichangedtheminto13and1withoutchangingthenatureofthecovariance.Accordingtothedesignoftheexperiment,thefemaleandmaleparentarenotthesameforallthefamilies. Whencalculatingthecovariancebetweenindividualsfromdierentfamilies,onewayistodividethecombinationsoftwofamiliesinto11bigcategoriesandfurtherpartitioninsidethecategories.InthefollowingIwillexplaineachofthe11categories. 58 PAGE 59 95%CondenceintervalsfortheadditiveanddominantSNPeectofthe44SNPsforthecarbonisotopedatafromPaltaka,Florida.The16thSNPhassignicantdominanteectswith95%condence,whiletheotherSNPsarenotsignicant.OtherSNPssuchasthe6ndandthe40thSNPsareapproximatelysignicantwith95%condence.Thesearegoodcandidatesforfurtherbiologicalexploration. isthesimplestcaseandifitissatisedthetwofamilieshavezerocovariance.Inourdataset,weneedtocomputetotal2415(2415=1+2++69)covariancecellsanditturnsoutthereare507casesbelongtothiscategory. 59 PAGE 60 95%CondenceintervalsfortheadditiveanddominantSNPeectofthe44SNPsforcarbonisotopedatafromCuthbert,Georgia.The8th,35thandthe36thSNPhavesignicantdominanteectswith95%condence,whiletheotherSNPsarenotsignicant.OtherSNPssuchasthe13ndandthe28thSNPsareapproximatelysignicantwith95%condence.Thesearegoodcandidatesforfurtherbiologicalexploration. canbepartitionedinto8subcategoriesasthefollowing:0B@Family1:(a1;a2);bFamily2:b;d1CA;0B@Family1:(a1;a2);bFamily2:c;b1CA;0B@Family1:a;(b1;b2)Family2:a;d1CA;0B@Family1:a;(b1;b2)Family2:c;a1CA;0B@Family1:a;dFamily2:(c1;c2);d1CA;0B@Family1:d;bFamily2:(c1;c2);d1CA;0B@Family1:a;cFamily2:c;(d1;d2)1CA;0B@Family1:c;bFamily2:c;(d1;d2)1CA: PAGE 61 Nowwetake0B@(a1;a2);bb;d1CA PAGE 62 Theresultsshowthatforallthe896casesofthiscategory,noneoftheaboveequa-tionsissatised.Sothecovarianceforthese896casesareall0. PAGE 64 Group1a1=b;a2=b;c1=b;c2=b: 64 PAGE 65 Group1:c1=b;c2=b;a1=d;a2=d: PAGE 66 Group1:a1=d;a2=d;b1=d;b2=d: PAGE 67 Theprogramresultsshowthatinourdataset27out28casesinthiscategorydonothaveanyequationsholdingineithergroup,thatis,thesecasesarehalfsiblings.Onlyforthecasesoffamily65andfamily68,additionally,oneequationholdsfromgroup1. 67 PAGE 70 70 PAGE 71 Intheabove,wefoundthatinanyfamily,thetwoparentsarenotrelatedandstandardprocedurecanbeusedtocalculatethevarianceforeachfamily.Intermsofcovariance,incategory1,all507caseshavezerocovariance.Forcategory2,all88casesarehalfsiblings.Eightythreeoutof84casesincategory3arehalfsiblingsandwhenthecasecomestothecovariancebetweenfamily38andfamily70,ithasthefollowingrepresentation,0B@a;d(a;c2);d1CA: PAGE 73 22(YiXiZoioiZmi(j)mi(j)cmij)2 P3`=1exp1 22(YiXiZoioiZmi(j)mi(j)c`mij)2(2{17) wherethereareonly3termsinthedenominatorsumandthismakestheGibbssamplercomputationalfeasible. 73 PAGE 74 2{8 ,butratherthantheonerandomvariablegenerationperiterationthatisusedhere,MonteCarloEMwouldneedthousandsofZgenerationsperiteration,againprecludingcomputationalfeasibility. AsforthepriordistributionfortheSNPgenotypecategories,onealternativeistousetheavailablegenomeinformationortheobservedgenotypefrequenciesaspriorinformation.Foroursituation,wedonothaveasequencedgenomelibraryfortheloblollypine,andwedonotwanttousetheobservedSNPinformationastheprior,soweuseauniformpriorforthemissingSNPs. Inthischapter,theresponsesareassumedtobecontinuous,butthemethodcanbemadetoadapttodiscretecases.Forexample,inacasecontrolstudy,theresponsewouldbeeithercaseorcontrolstatus.Byemployingaprobitmodel,weaddatruncatedlatentvariableintheGibbssamplingcycleandthelatentvariablewouldactastheresponseinourpreviousdatasetexamples.Thismethodcanalsobemodiedtohandlethesituationofamultiplecategoryresponse. Letusassumethattheresponsesyisavectorwithitselementstobeeither0or1forthecasecontrolstudy.Weemployacontinuouslatentvariablelandassumethatlhasamultivariatenormaldistribution.Thenwecanspecifythewholemodelspecicationasthefollows: 74 PAGE 75 (2{18) Thenwecanwritethejointlikelihoodas: Withtheabovejointlikelihood,wecaneasilywritetheconditionaldistributionsasbefore,exceptthatnowwehaveconditionaldistributionoflgivenotherparametersandy.TheconditionalofliisliN(Xi+Zi;2)truncatedattheleftby0,ify1=1liN(Xi+Zi;2)truncatedattherightby0,ify1=0: FortheSNPeect,weuse[1;0][a;d]TtodenotetheeectofSNP\AA",[0;1][a;d]TtodenotetheeectofSNP\AC",and[1;0][a;d]TtodenotetheeectofSNP\CC". 75 PAGE 76 Forpopulationassociation,muchresearchworkhasbeenfocusedonusinghaplotypes.However,intheplant,orotheragriculturegenomeassociations,itisoftenthecasethatthewholegenomeisnotfullysequencedandnotmuchpriorlinkagedisequilibriuminformationisavailable.Block-wisehaplotypesorclustersofhaplotypesareimpossibletoconstruct.Theproposedmethodhaswideapplicationfortheeldofplantoragricultureassociationwithoutconstructinghaplotypes,andithasreasonablecalculationspeed.Basedonoursimulations,theproposedmethodcanadequatelyidentifythesignicantSNPs. 76 PAGE 77 Tobemoregeneral,therearesvariablestobeconsideredinthedataset.Whensisnite,andthevaluesofthecovariatesintheexperimentareobserved,muchprogresshasbeenmade.ThereviewoftheprogresshasbeendetailedinChapter1.However,forthesituationthatthecandidatevariableshaveacertainpercentageofmissingvalues,itisanovelresearcheldandnotmuchworkhasbeendone. Foreachcandidatevariable,therearetwochoices,eitherbeingincludedinthe\goodmodel"ornot.Sowithscandidatevariables,thereare2smodelchoicesinthemodelsamplespace.Whenthenumbersismoderatelylarge,thenumber2scouldbehuge.Typicallywhenthenumbersisbiggerthan15,itisunrealistictocalculateandcompareallthe2sposteriormodelprobabilities.Forthiscase,afeasiblemethodistodoastochasticsearch.Intheory,thestochasticsearchalgorithmhasthechancetosearchallthesamplespace.Thisstochasticsearchchainisexpectedtostaylongerwithmodelshavinghighposteriorprobabilities,andoccasionally,itvisitsmodelswithlowposteriorprobabilities.Bydoingthat,itavoidsgettingstuckinlocalmodesandhaschancetovisitthemodelsamplespaceglobally. Thefollowingisourplanforthemodelselection: 77 PAGE 78 WewillestimatetheBayesfactorsusingsamplesfromtheparallelGibbssampling. ThestochasticsearchchainisdrivenbyestimatedBayesfactors. NowletusintroducetheBayefactor.FortwomodelsMandM1withparametersand1,theBayesfactorisdenedas wherep(jM)istheparameterdistributionformodelMandp(1jM1)istheparame-terdistributionformodelM1.Obviouslyp(Yj;M)istheprobabilitydistributionundermodelMandp(Yj1;M1)istheprobabilitydistributionundermodelM1.Hereand1arescalars,buttheycanbeextendedtobeparametervectors.Theconstantm(Y)isthemarginallikelihoodformodelMandtheconstantm=1(Y)isthemarginallikelihoodformodelM1. Inthischapter,rstwewilldiscusshowtoestimatetheBayesfactorsasnoclosedformexists.ThenwewillproposeahybridstochasticsearchalgorithmtosearchthesamplespaceofthemodelsandwewillshowthattheergodicpropertiesofthederivedMarkovchain.Computationspeedisalwaysaconcernofvariableselection,sonallywescaletheproceduresuptohandlelargedatasets. 78 PAGE 79 (3{2) whereIG(a;b)representsinvertedGammadistributionwiththefollowingparameteriza-tionf(2)=ba 2g IfweuseM1;:::;M2storepresentallthemodelsinthemodelsamplespace,thegoalofthisprojectistoselectgoodmodelsfromthesecandidatemodels.Foreachoneofthemodelsinthemodelsamplespace,thereisacorrespondingindicatorvectorandeachmodelisdenedbyafamilyofdistributionswithparameterstobe;;;2;2.ForanytwodierentmodelsMjandMj0,theysharethesameparameterizationsof;2;2,exceptforadierentindicatorvector,thusalso. Severalthingstobenoticed: 79 PAGE 80 AswearedoingBayesianvariableselection,weuseBayesfactorasacriteriontoselectmodels.ThedicultpartofthisprojectisthatweareselectingsubsetswithdierentdimensionsandthemissingvaluesexistinthecovariatesofSNPs.Toovercometheproblem,weproposedaBayesfactorapproximationformulawhichtakescareofthedierentdimensionsandthemissingvalues.WeplantoemployastochasticsearchalgorithmtosearchthegoodsubsetsaccordingtotheBayesfactors.Foreachcandidatesubset,wecalculatetheBayesfactorforthecandidatesubsetversusthefullmodel.Herethefullmodelactsasthereferencemodel.DependingonthevalueofthecurrentBayesfactorandthepreviousBayesfactor,wedecidewhethertakethecandidatesubsetasthecurrentsubsetornot.SothefullmodelactsasareferencemodelwhencalculatingtheBayesfactorsintheaboveplan.Ontheotherhand,wemayusethesimplestmodelasareferencemodelwhencalculatingtheBayesfactors,whichdoesnotincludeanyvariablesinit.Wewillshowthatusingthesimplestmodeldoesnotgainanythingforusintermsofcomputation.Morediscussionofusingthesimplestmodelasareferencemodelwillbegivenlater. BeforegoingtothecalculationofBayesfactor,wewillintroducebridgesamplingproposedby MengandWong(1996 ). 80 PAGE 81 3{1 ,wehaveBF;=1=m(Y) PAGE 82 Whenhavesamplesofparameterfromthedistributionp2(),wecanestimatetheBayesfactorbyXq1((i)) Foroursituation,wegenerallyconsidertwomodelswithdierentdimensions,andwecannotdirectlyapplythisformulaasitassumesthesameparameterizationforthetwomodels.WewillproposeabridgesamplingtypeofBayesfactorestimation,whichaccommodatesmodelswithdierentdimensionsandmissingvaluesaswell.NextwewilldetailtheBayesfactorcalculationforoursituation.FirstweshowthataspecialfunctiongisneededforthepropercalculationofBayesfactor.Thenwewillshowwhatthisfunctioncouldbeforoursituation. SupposeaistheparameterformodelMaand(a;b)istheparametersformodelMb.ThelikelihoodformodelMaisfa(Yja)andthelikelihoodformodelMbisfb(Yja;b):Leta(a)andb(a;b)denotethepriorsformodelMaandmodelMb. 82 PAGE 83 NowletusprovetheTheorem 3.2.1 PAGE 84 =ZZa(a)g(a;b) 3{6 holding,then1 84 PAGE 85 3{4 ,agfunctionisneededandthemodiedbridgesamplingestimatoris 1 3{8 ,wemustndagfunctionthatsatisesthefollowingequation. 3{9 andcontainsafactorP(Zlmis=zlmis):Alsoiscomposedofandl.ThevectorparameterizestheSNPswhichareincludedinmodelMandlparameterizestheSNPswhichareexcludedinmodelM.ThematrixZiscomposedofZandZl,whereZlisthedesignmatrixofSNPswhicharenotincludedinmodelMandZlmisrepresentsthemissingSNPswithinZl.ThedistributionP(Zlmis=zlmis),forthediscreterandomvectorZlmis,couldbeanylegitimatediscretedistributionanditistakentobetheuniformdistribution,whichisthesameasthepriorforZlmisinmodelM1. 85 PAGE 86 3.2.1 .ApplyingTheorem 3.2.1 ,ais(;;2;2Zmis),andbis(Zlmis;).Thelikelihoodfor(a;b)isfb(a;b)=f1(Yj;;2;Zmis); 3.2.1 ,thefollowingequationmusthold:XZlmisZf1(Yj;;2;2;Zmis)g(;;2;2;Zmis)dl=f(Yj;;2;2;Zmis): 3{8 ) 1 86 PAGE 87 3{9 andfurtherwehaveaconsistentBayesfactorestimator.Thegis 2 22P(Zlmis=zlmis); 2exp(YX(i)Z(i)(i))0P(i)l(YX(i)Z(i)(i)) 22(i) NoticeP(Zlmis=zlmis)couldbeanylegitimatedistribution,anditistakenastheuni-formpriorinitssamplespace,whichisthesamepriorforZlmisin1(;;2;2;Zmis;Zlmis).Soduringthecalculation,theP(Zlmis=zlmis)withinthegfunctioniscanceledwiththepriorforZlmisin1(;;2;2;Zmis;Zlmis). NowwewanttoshowwhyEquation 3{12 isaconsistentestimator.Wehavetwomodels,MandM1,andthefollowingarethedetailsofaBayesianspecicationforthesetwomodels. 87 PAGE 88 (3{13) ForM1,themodelspecicationisdescribedin 3{2 anditisalmostthesameas 3{13 ,exceptthatthedesignmatrixforSNPsischangedfromZtoZasthemodelindicatorvectorischangedfromto=1.Correspondingly,ischangedto. Forbothmodels,thedesignmatrixforSNPs,ZandZ,containmissingSNPsandwedonotexplicitlymentionthatrightnow.ZisamatrixconcatenatedbythecolumnsofZandZl.ThemathematicalformulaZ=Z+Zldoesnothold,oneobviousreasonforthatisthenumberofcolumnsofZequalstothesumofthenumbersofcolumnsofmatrixZandZl.Thevector,,andlhavesimilarrelationship. Sameasbefore,weuse(;;2;2;Zmis)asthepriorformodelMandlet1(;;2;2;Zmis)asthepriorformodelM1. 88 PAGE 89 3{9 inCorollary 3.2.1 ,wehave: =XZlmisZg(;;2;2;Zmis)1 (22)n (22)n (22)n 3{14 ,wewillhave =XZlmis8<:expjYXZj2 2exp(YXZ)0Zl(Z0lZl)1Z0l(YXZ) 22; PAGE 90 3{11 ,thatis, 2exp(YXZ)0Pl(YXZ) 22P(Zlmis=zlmis); (3{16) =expjYXZj2 90 PAGE 91 3{9 issatised. WetakethepriorforMtobe(;;2;2;Zmis)=1expjj2 2) (2)a+1dc 2) (2)c+1P(Zmis=zmis); 2) (2)a+1dc 2) (2)c+1P(Zmis=zmis)P(Zlmis=zlmis); Withtheabovespecication,ifwehavesamples((i);(i);2(i);2(i);Z(i)mis)frommodelM1,andfollowtheEquation 3{10 ,wecanconsistentlyestimatetheBayesfactorasthefollowing:1 2exp(YX(i)Z(i)(i))0P(i)l(YX(i)Z(i)(i)) 22(i) 91 PAGE 92 3{9 ,wecanuse1 Wewillshowthatthefollowingg(;;Zmis;2;2)alsosatisestheEquation 3{9 22P(Zlmis=zlmis); 2+Z0lZl: 3.2.1 ,weneedshowthatEquation 3{9 issatised.TheleftsideofEquation 3{9 isequalto: 22expjYXZj2 92 PAGE 93 3{18 andwehave =XZlmis(22)sl=2jRj1=2exp(YXZ)0ZlR1Z0l(YXZ) 22(22)sl=2jRj1=2exp(YXZ)0ZlR1Z0l(YXZ) 22exp(jYXZj2) (22)2=nP(Zlmis=zlmis) Afterthecancelation,wewillhave: (22)2=nP(Zlmis=zlmis) (3{20) =exp(jYXZj2) (22)2=nXZlmisP(Zlmis=zlmis)=f(Yj;;Zmis;2): 3{17 andEquation 3{11 aretwodierentgfunctionsandtheybothsatisfytheEquation 3{9 .Thatshowsthegfunctionisnotunique.However,basicallywecanndagfunctionbyfollowingthedirections: PAGE 94 3{9 Solookingback,forthegfunctioninEquation 3{17 ,thecorrespondinggcandfunctionisgcand=expjj2 3{17 .AsfortheEquation 3{11 ,wejustletthegcandfunctionbe1,andfollowthesameprocedure:Calculatethehfunction,forthiscaseh=RRf1(;;Zmis;2;2)1dldZlmis furthergettheg=f(;;Zmis;2;2) 3{11 3.1 ,westatedthatweplantorunahybridM-Hchaintosearchthegoodsubsetsofvariablesaccordingtotheposteriorprobabilitiesofmodels.LaterwewillshowthatthisisequivalenttosearchingforthegoodsubsetsofvariablesaccordingtotheBayesfactors.Intheabovesubsections,wediscussedtheBayesfactorestimationforasubmodelversusthefullmodel.Sowhencomparinganytwosubmodels,theyareimplicitlycomparedwitheachotherthroughtheratioofBayesfactorsofsubmodelsversusthefullmodel.Forexample,ifweconsidertwosubmodelswithSNPindicators1and2,theratioofBayesfactors,BF1;=1 PAGE 95 Casella,Giron,Martinez,andMoreno(Casellaetal. )discussedthattherearetwowaystodoBaysianmodelselectionwhenusingintrinsicpriors:encompassingthemodelsfromabove,whichmeanscomparingallthemodelstothefullmodel;encompassingthemodelsfrombelow,thatis,comparingallthemodelstothesimplestmodel.Theyshowedwhenthenumberofsubsetsarenite,thesetwomethodswillessentiallygiveequivalentresults. Inthefollowing,wewillgivedetailsoftheBayesfactorapproximationwhenthesimplestmodelisusedasareference.Also,wewillprovidesomeexplanationsastowhywechoosetousethefullmodelasareference.SofortheBayesfactorofsubmodelversesthesimplestmodel,weareinterestedincalculatingBF;=0,where=0meanstheassociatedmodelhasnovariablesinit,alsoasthesimplestmodel.Byusingsimilarapproximationmethodsasabove,weareabletondtheBayesfactorapproximationforsmallmodelversusbigmodel,andalsoasBF;=0=1 NowthemodelM0hasthelikelihoodf0(Yj;2)andthemodelMhasthelikeli-hoodf(Yj;;2;2;Zmis). Nextwealsoshowhowwendthegfunctionforthissituation.ApplyingTheorem 3.2.1 ,theais(;2;2)andbis(;Zmis).Fortheequation( 3{6 )tobesatisedweneedZfb(Yja;b)g(a;b)db=fa(Yja): PAGE 96 2expjYXj2 WewillndthegfunctionfromtheaboveEquation 3{21 3{22 ,thenwewillhave 2exp(YX)0Z(Z0Z)1Z0(YX) 22g(;2;;2;Zmis)expjYXj2 (3{22) =(22)s 22P(Zlmis=zlmis); PAGE 97 Inabove,weshowedthattheEquation 3{21 holdsandwefoundthegfunctioninEquation 3{22 .Tomakethismethodwork,weneedsamples((i);(i);2(i);2(i);Zmis(i))fromthefullmodelM1whichhasalikelihoodof(Yj;;2;2;Zmis).AsweareplanningtorunaGibbssamplerforthefullmodelwhichcontainsalltheSNPs,fromtherewewillgetsamplesof((i);(i);2(i);2(i);Zmis(i))andithasthecorrectmarginalizedjointdistribution,accordingtoTheorem10:6:of RobertandCasella(2004 ). SowecanapproximatetheBayesfactorBF=0;by1 22(i) 2(i) wherenisthenumberofpairsofsamplesfromtheGibbssamplingandP(i)=Z(i)Z(i)0Z(i)1Z(i)0: 3{23 ,whichusethesimplestmodelasthereferencemodel,andEquation 3{12 ,whichusethefullmodelasthereference 97 PAGE 98 AccordingtothedenitionofBayesfactorinEquation 3{1 ,fortwomodelsMandM1withparametersand1,theBayesfactorisdenedasBF;=1=m(Y) LetMdenotethemodelY=X+Z+";whereisavectorwithlengthequaltosum(),andtheelementsofthisvectorarecomposedoftheelementsofvectorwiththecorrespondingelementofindicatorvectortobe1.ZistheisthematrixwithcolumnsbeingtakenfromtheZmatrixaccordingtothecorrespondingindicatorvector. Intheabovemodelspecication,Z=Z(obs);Z(mis);andforsimplicitywejustwriteZ.Weletstobethetotalnumberofelementsinequalto1.ThejointlikelihoodunderthemodelspecicationofMis: PAGE 99 expjYXZj2 =expjX(YZ)j2 2expjYZj2 3{25 andget: 2 22expjj2 exp(YZ)0P(YZ) 22expjj2 =exp0(Z0PZ+I 2) 3{27 andget: 99 PAGE 100 2(22)s 21 2expY0Z(Z0PZ+I 2)1Z0Y 3{29 alittlebitandwewillhave: 2jP;2j1 2 2. Thenextstepistointegrate2out,andasweknowtheprioris:(2)=ba 2 2jP;2j1 2 (a)Y0PY 100 PAGE 101 OurplanistosetupaMarkovchainwhichisdrivenbyBayesfactorssuchthatthestationarydistributionoftheMarkovchainisproportionaltothedistributionof,inotherwords,thedistributionofmodelsinthemodelsamplespace. Therearesvariablesinthedata,inourcase,sSNPs,sothereare2smodelsinthemodelspace.Eachofthe2smodelshasanassociatedindexvectori;i=1;;2s.Foramodelassociatedwithi;;ithasmarginallikelihoodmi(X)=Zf(Xji)(i)di;whereiistheparametervectorofmodeli.LetB(i)denotetheprobabilityofmodeliinthemodelsamplespace,thentheprobabilityofmodeliisB(i)=mi(X) 101 PAGE 102 1. Supposethecurrentstateis1,andwewilluseamixturedistributiontogeneratethecandidate2.Themixturedistributionis:withprobabilityp,drawasampleusingarandomwalkalgorithm;withprobability(1p),drawasamplefromtheuniformdistributiononthesamplespace.Thedrawsfromtheuniformsamplespacearei.i.d.andthetuningparameterpwillbesetas0:75toinsurerandomwalkmostofthetimeandalsohavingtheabilitytojumpoutoflocalmodes. 2. UsingthesamplesfromthepreviousGibbssampler,calculatetheapproximateBayesfactorsforthecandidatestateandcalculatetheacceptanceprobability(1;2). 3. DrawufromuniformdistributionU(0;1),ifu<(1;2)take2asthenextstate,otherwisestayat1: Repeatthesteps,untilthechainarrivesitsstationarydistribution. Theaboveisthechainwedesigned.Butinactualcomputation,thestepsarealittlebitdierent.WeruntheGibbssamplerrst.ThenwiththesamplesfromtheGibbschain,weruntheM-Hchaintondgoodsubsets. Weplantousethefollowingcandidaterandomwalkdistribution.Supposethecurrentchainisatstate1.Randomlychooseonenumberrfromtheuniformdiscretedistribution(1;2;:::;s),andiptherthelementin1from1to0orfrom0to1,andatthesametimekeepallotherelementsin1untouched.Denotethenewcandidatesamplefromrandomwalkas02.Let002denotethecandidatestatefromtheotherpartofthemixturedistribution,theuniformdistributioninthesamplespace.Eachelementof002isindependentandrandomlytakenaseither0or1. 102 PAGE 103 2s: 2s: 2s=(1p)Xk=1;:::;2sI(1=001k)1 2s=(1p) 2s s:justoneelementdierence Sotheaboveargumentsays,foranystateof1and2,thefollowingalwaysholdsq(2j1)=q(1j2): PAGE 104 UsingsamplesfromtheGibbssamplerandapplyingTheorem 3.2.1 ,wewillhave\BF(n)2;=1asaconsistentBayesfactorestimateforBF(2;=1)andnisthenumberofpairsofsamplersfromtheGibbssampler.WefurtherdenotetheratioofestimatedBayesfactoras\BF(n)2;=1 ^n(1;2)=min8<:\BF(n)2;=1 RobertandCasella(2004 )Thetheoremstates: RobertandCasella(2004 ),ifY(t)isergodic,thenthedistributiongisastationarydistributionforthechainY(t)andfisthelimitingdistributionofthesubchainX(t).HeregisthefulljointdistributionforX. RobertandCasella(2004 )isthefollowingalgorithm: 104 PAGE 105 whereg1(y1jy(t)2;;y(t)p);;gp(ypjy(t+1)1;;y(t+1)p1)aretheconditionaldistributions. Oursub-chainfromtheGibbssamplersatisestheregularconditionsandisHarris-recurrent,aperiodic,soitisergodic.Furthermore,bytheergodictheorem,wewillhave\BF(n)(2;=1) 105 PAGE 106 Now,letusshowitsatisesthedetailedbalancecondition.AsthetransitionkernelassociatedwiththisMarkovchainisK(1;2)=(1;2)q(2j1)I(26=1)+(1r(1))I(1=2); Theabovesecondequationisobvious,when1=2,theleftsideisequaltotherightside.When16=2,bothsidesequalto0.Asweshowedbeforethatq(2j1)=q(1j2),fortherstequation,byapplyingtheequation 3{32 ,weneedtoverifythatmin8<:\BF(n)2;=1 PAGE 107 If\BF(n)2;=1<\BF(n)1;=1,theleftsideofEquation 3{33 isequalto\BF(n)2;=1 3{33 isequalto\BF(n)2;=1 3{33 isequaltotheleftsideof( 3{33 ).Followthesamerational,Equation 3{33 holdswhen\BF(n)2;=1>\BF(n)1;=1: 3{33 issatised.ThenbyusingtheTheorem6:46of RobertandCasella(2004 ),^B(n)isthestationarydistributionofthischain. IntheaboveweshowedthattheempiricalM-HMarkovchainconvergesuniformlytothedistribution^B(n)whennisxed.However,ourultimategoalistoshowthischainconvergestothedistributionB(i).Thatis, 107 PAGE 108 3.3 .Theempiricalchainconvergestothetargetstationarydistributionwithergodicproperty. 3{34 2XkK(t)^Bn(0;)^Bn()k+1 2Xk^Bn()B()k: 2Xk^Bn()B()k!0;asn!1:(3{37) 3{35 wemustshow1 2XkK(t)^Bn(0;)^Bn()k!0;asn;t!1: PAGE 109 MeynandTweedie(2008 ),weknowthatwithnxed,wehavekK(t)^Bn(0;)^Bn()k>kK(t+1)^Bn(0;)^Bn()k: 2XkK(t)^Bn(0;)^Bn()k<: 2XkK(t)^Bn(0;)^Bn()k!0:(3{38) CombiningEquation 3{37 andEquation 3{38 wehave TheaboveproofshowsthatourM-Hchainwillconvergetothetargetstationarydistributionwiththeergodicproperty. 109 PAGE 110 Fromtheabovesections,weknowthatifwehavesamples((i);(i);2(i);2(i);Z(i)mis)generatedbyGibbssamplingfrommodelM1,thereferencemodel,alsothefullmodel,wecanapproximatetheBayesfactorby:1 2exp(YX(i)Z(i)(i))0P(i)l(YX(i)Z(i)(i)) 22(i) Intheaboveformula,foroneBayesfactorestimation,weneedtorepeatntimesofZ(i)l0Z(i)landntimesof(Z(i)l0Z(i)l)1.Asweknowthedeterminantcalculationiscloselyrelatedtotheinversecalculationandiftheinverseisavailable,thedeterminantcalculationismucheasier.Sowewanttotakeadvantagesofthispropertywithoutdirectlycalculatingthedeterminants. Inthissection,wewilltalkabouttwomethodstospeedupthecomputation.Firstwewillintroduceamethodtoupdatetheinversebasedoninversecalculationfromthelaststepwithoutcalculatingtheinversefromthebeginning.Second,wewilltalkaboutdirectreplacementofZ(i)withZ,andthejusticationofdoingthis. 3.4.1.1TwocolumnsofparametersforonecolumnofSNPs 110 PAGE 111 AnotherplacewheremostcalculationtimeisspentistondthejZ(i)l0Z(i)ljand(Z(i)l0Z(i)l)1.Rightnowwearedealingwithadatasetwith44SNPsandifwehavesum(1)=30,thenthereare30SNPsnotincludedinthesubmodel,tocalculateoneBayesfactor,weneedtodontimesmatrixinversionwithdimension6060,aseachcolumnofSNPsuses2columnsforcoding. Now,wewilldiscussthedierencebetweenthejustupdatedmatrixZ(i)andthepreviousdesignmatrixZ(i1).ThenonemethodofcalculatingtheinverseofmatrixZ(i)basedontheinverseofmatrixZ(i1)willbegiven.Thesediscussionwillgiveusaavorofthenalmethodweemploytocalculatetheinversematrix. SupposeinlaststepwehavematrixZ(0)=[Z1;:::;Zi;Zi+1;:::;Z2s],andinthecurrentstepthejthandthe(j+1)thcolumnofZ(0)willbeupdatedto_Zjand_Zj+1,asforeachiterationonlyonecolumnofSNPsareupdated,andweuse2columnsofthe 111 PAGE 112 Wecanwrite anddoingthemultiplicationwehave PAGE 113 WealreadyhaveZ0(0)Z(0)1,thatis,haveA1fromlaststepandnoticerank(Ha)=2andrank(Hb)=2.Wecanwritethefollowing where 113 PAGE 114 Miller(1981 ),ifHisarank2matrix,wehavethefollowingformula (I+H)1=I1 wherea=1+tr(H)and2b=tr(H)2tr(H2): Thentheproblembecomestond(I+A1Ha)1andI+I+A1Ha1A1Hb1:SupposeweareabletocalculateC=(I+A1Ha)1;thentheproblemistond(I+CA1Hb)1.Thenwewillhave ThequestionboilsdowntocalculatingCand(I+CA1Hb)1.Forbothproblems,wecanapplyEquation 3{43 tosolveit. AsforcalculatingA11,wewillhavethefollowingsub-steps: 1. CalculateHaandHb 114 PAGE 115 ComparisonoftimespentoninversecalculationusingstandardsoftwareandMiller'smethodwith2columnsand2rowsupdated. Matrixdimension:1001005005001000100020002000 Matlab'sdirectcalculation0.000s0.235s1.60911.985s Miller'smethod0.016s1.454s11.937s99.703s 2, calculateB1=(I+D)1C; nally,A11=B1A1: 3-1 recordsthetimespentonmatrixinversionwithdierentmethods. Tospeedupthecomputation,wechangethecodingofSNPsfrom2parametersperSNPto1parameterperSNP.NowweassumethattheeectofaSNPislineartothenumberofcopiesofcertainallelewithinthatSNP,insteadofparameterizingbyadditiveanddominanteect.Bychangingthecodingmethod,wedecreasethestepsneededforoneBayesfactortoonehalf,sincenowthereisonlyonecolumnupdateinZ(i)insteadof2columns. Asmentionedintheabove,wedecidetocodetheSNPsinanotherway:usingonecolumnofparametersforonecolumnofSNPs.Inthatway,thedesignmatrixfortheSNPsrecordsthenumberofcopyofalleleinthatSNP.Forexample,weusetorepresenttheallele\C"'seect.ThenforanindividualwithSNPgenotype\CC",theSNPeect 115 PAGE 116 Thefollowingmethodisdirectlydueto SadighiandKalra(1988 )anditisanapplicationoftheSherman-Morrison-Woodburyformula;see Woodbury(1950 )and Bartlett(1951 ).AreviewpaperabouttheSherman-Morrison-Woodburyformulaiswrittenby Hager(1989 ). SupposewehavetheinverseofamatrixA,A1,andweupdatethejthcolumnofAtobev.Let(A)jdenotethejthcolumnofA.Letejtobeavectorofallzeros,exceptthejthelementtobe1.AlsoletA1tobetheupdatedmatrix.TondA11=(A+(v(A)j)eTj)1,thefollowingstepsareneeded: 1+(b)jb,where(b)jisthejthelementofvectorb. Nextwewillshowwhytheabovemethodworks.FirstwriteA1=AB; PAGE 117 (3{46) Furthermore, 1+bj0000bj+1 SimilarlyifthejthrowofmatrixA1,A(j)1,isupdatedtobe,andA2istheupdatedmatrix.Thenc=(A(j)1)A11;C=(I+ejcT); (c)jejc;where(c)jisthejthelementofvectorcand (c)jejc)=A11A1 117 PAGE 118 Comparisonoftimespentincalculationofmatrixinversionfordierentmethods. Matrixdimension:500500100010001500150020002000 Matlab'sdirectcalculation0.266s1.547s5.01611.703s Miller'smethod0.031s0.188s0.32s0.719s Sanighi&Kalra'sdirectcalculation0.375s2.735s9.453s21.657s Sanighi&Kalra'ssimplication0.032s0.11s0.235s0.531s Foroursituation,whenweuseoneparameterforoneSNP,ineachGibbssampleriteration,wejustupdatethejthcolumnandjthrowofthematrixZ0Zandkeeptheotherpartsuntouched.Nextwewanttoshowatimetablecomparisonforthecalculationofmatrixinversionwithdierentdimensions.ForSanighi&Kalra'sdirectcalculationisIjustfollowtheirformulaandletMatlabcalculatethematrixmultiplicationdirectly.Sanighi&Kalra'ssimplicationmeanstodirectlyspecifythecellsandvectorswithoutmatrixmultiplication.Obviously,wewilltakethelastmethod,anditis20timesfasterthanthedefaultcalculationofMatlab.WedidthetestinRandtheresultsaresimilar. Wewrite whereejisthevectorwithonlythejthelementtobe1andallotherstobe0.Asshownbefore,wecanwriteA1=AB;withB=I+beTj; PAGE 119 Soforoursituation,jA1j=jAjjBj=jAjjI+beTjj: 119 PAGE 120 Similarly,forallotherrowsintheabovematrixwecandothesimilaroperations.Forexample,wecanmultiply(bj 120 PAGE 121 Sowehave wherebj=(A1)(j)(vAj)and(A1)(j)isthejthrowofA1. WhenweupdatethekthrowofmatrixA1tobeuT,wewrite withC=(I+ekcT);andc=(uA(k)1)A11: PAGE 122 Finally withckandbjdenedasabove. InthepreviousworkweshowedthatthefollowingformulacanbeusedfortheBayesfactorapproximation. 1 2exp(YX(i)Z(i)(i))0P(i)l(YX(i)Z(i)(i)) 22(i) whereallthesamplesarefromtheposteriordistribution1(;;2;2;ZmisjY): 122 PAGE 123 IftheZ(i)=(Z(i);Z(i)l)matrixdoesnotchangewithi,thenforeachBayesfactorestimation,insteadofcalculatingthedeterminantandmatrixinversentimes,wejustneedtodoonetimematrixdeterminantandmatrixinverse.Themethodproposedhereistoreplaceallthesamplesof(Z(i);Z(i)l)withtheexpectation(EZ;EZl),EZ.InthefollowingwejustifythereplacementwithEZ. OriginallywewanttocalculatetheBayesfactorBF;=1andweshowedthatitisequalto 2exp(YXZ)0Pl(YXZ) 22 2exp(YXZ)0Pl(YXZ) 22 123 PAGE 124 heretheintegralisimplicitlywithrespecttothemissingdatainZ.Asmentionedintheabove,wewanttoapproximateEquation 3{60 by RewritetheaboveintegralsasZZh(;Z)f(;Z)ddZ=ZZh(;Z)f(Zj)dZf()d; 3{61 ,weneedtocheckthefollowing: Further,ifwecanshowthatR(h(;Z)f(Zj)dZRh(;EZ)f(Zj)dZisclosetozero,itissucienttosaythatEquation 3{62 isclosetozerounderverygeneralconditions.Now 124 PAGE 125 h(;Z)h(;EZ) (3{63) 2(ZEZ)0@2[h(;Z)] 2(ZEZ)0@2[h(;Z)] Sowecanwrite =@h(;Z) 2(ZEZ)T@2[h(;Z)] 3{64 ,thenwecansayZh(;Z)f(Zj)dZZh(;EZ)f(Zj)dZ0: PAGE 126 WedidsomenumericalsimulationtocomparetheBayesfactorcalculations.Inthissimulateddataset,thereare15SNPs20familiesand800observations. Table 3-3 recordsthedierentmodelindicatorsusedforthesimulationstudiesinTable 3-4 and 3-5 .Weletthesvarysothattheycouldrepresentthepossibledierentmodelindicatorsthatmightoccur. Table3-3. RecordsofsubsetsindicatorswithactualvaluesofforTable 3-4 andTable 3-5 IndicatorvectorActualvaluesof:201130:1000013003 Table 3-4 showsthecalculationresultsofBayesfactorsbyusingdierentformulaofBayesfactorapproximation.TheupperpartofthetableusessamplesfromtheGibbssamplerwhereonlyonecolumnofmissingSNPsareupdatedpercyclewhilethebottompartofthetableusessamplesfromtheGibbssamplerwhereallthemissingSNPsareupdatedinonecycle.TherstcolumnofTable 3-4 istheindicatorvectorandtheactual 126 PAGE 127 Bayesfactorcalculationapproximation.Userst20000iterationasburninandtakethefollowing400samplesforthecalculation. sampleonecolumnofSNPperiterationdierentapplyingformula( 3{57 )ZbaroriginalZ sampleallSNPstogetherinonecycledierentapplyingformula( 3{57 )ZbaroriginalZ valuesofarefromTable 3-3 .ThesecondcolumnistheBayesfactorestimatesusingtheformulaweoriginallyplannedtouseanditensurestheBayesfactorestimatesareconsistent.ThethirdcolumnistheBayesfactorestimates,wherethemissingSNPsarereplacedwiththeaveragesofimputedvaluesZ.ThefourthcolumnistheBayesfactor 127 PAGE 128 Table 3-5 againaretheBayesfactorestimatesusingdierentmethodsforthemissingvalues.Thesecondcolumnassumesthereisnomissingvaluesinthedataset.thatis,Ididnotgeneratethemissingdataforthisanalysis.Inthethirdandfourthcolumns,Iassignthemissingvalueseitherall1orall1totesttheeectoftheextremehandlingofmissingvalues.TheupperpartofthetableusingsamplesfromtheGibbssamplerwherethemissingSNPswereimputed1columnpercyclewhilethebottompartofthetablearetheestimatesoftheBayesfactorswhereallthemissingSNPswereimputedineachcycle.Itshows,theseextremehandlingofmissingSNPsdonotgiveproperestimatesofBayesfactors.CombiningTable 3-4 andTable 3-5 ,weconcludethatusingZgivesproperestimatesandyetitprovidesfastcalculation. 3.5.1Simulation Inthesesimulationstudies,wehave15SNPsandthevaluesoftheSNPsarelistedinTable 3-6 .10%ofmissingSNPsweregeneratedatrandom.istheparameterfortheSNPeectand1isforSNP1,2isforSNP2,etc. 128 PAGE 129 3-6 representthedierentmodelsthatthestochasticchainvisited.Anymodelthathaslessthan1%frequencyofvisitsisnotlistedinthetable.FromtherstcolumnofTable 3-6 ,wecouldseethattworowshavehigherfrequenciesofvisitsthanallotherrows.Obviously,thestochasticchainspends25%ofitstimeonamodelwhichcapturesallthesignicantvariables,andthechainspendsaboutonehalfofitstimeonamodelwhichcapturesallthesignicantvariablesplusSNP7,SNP8,andSNP8.TheeectsofSNP7,SNP8,andSNP8arenotasbigasothervariablesintherealmodel,buttheireectsarenotinsignicantaccordingthecriterionofBayesfactor.Onethingthatneedstobementionedisthatthestochasticchainonlyspentabout4%ofitstimeonthemodelwhichincludesallthevariables,whichisadesirablepropertysinceourgoalistodiscoversimplesubsetswhichrepresenttheessentialcombinationsofvariablesinsteadoftakingthecomplicatedmodelssuchasthefullmodel. 3-7 arethefrequenciesofvisitsthechainspentondierentmodels.Asthemodelsamplespaceishuge,244,itisverylikelythatoncethechainleavesthevisitedmodel,itnevercomesbacktothepreviousmodel.Inthetablearelistedseveralmodelsthatthechainspentmoretimeonthanothermodelsthatarenotlisted.Althoughnoneofthefrequenciesofvisitsaresubstantiallyhigherthanothers,theanalysisprovidessomesubsetsofvariablestobefurtherinvestigatedat.OnethingtonoticeisthatwhenweaveragethevisitsperSNP,severalSNPshavemorethan50%visitfrequencies,andtheseSNPsappearoftenintheselectedmodelstoo.IfweareinterestedininvestigatetheSNPsonebyone,theseSNPsdenitelyaregoodcandidates. 129 PAGE 130 Bayesfactorcalculationcomparisons.Userst40000iterationasburninandtakethefollowing400samplesforthecalculation. sampleonecolumnofSNPperiteration dierentwithoutassignallassignallmissingmissingSNPmissingvaluesonevaluesminusone sampleallSNPstogetherinonecycle dierentwithoutassignallassignallmissingmissingSNPmissingvaluesonevaluesminusone 130 PAGE 131 SimulationresultsofBayesianvariableselectionfor15SNPsand450observations,10%randommissingvalues.UsingtheBayesfactorestimationformula posteriorActualvaluesofthe 0.0036novariables 0.000115 Bayesvariableselectionforlesionlengthdataset.UsingtheaverageofimputedmissingSNPsasifobserved.20000stepsofburninandanother20000stepsassamples. posteriormodelSubsetsofvariablesprobability>=0:29% 0:4%9;10;12;14;15;16;22;30;31;40;43 PAGE 132 InChapter2,weproposedaBayesianhierarchicalmodelforthedatastructure.Wetookadvantageofthepopulationstructurebycalculatinganumeratorrelationshipmatrix,whichquantiesthecovariancebetweentwoindividualloblollypinesinthispopulation.Formissingdata,weimputedthepossiblevaluesaccordingtotheposteriordistributionofthemissingvalues.Thatis,afteradjustingtheobservedvalues,weusedtheposteriordistributionofthemissingvaluesinsteadoftheonemeanvaluesubstitutionormultiplevalueaverages. OnenovelcontributionisweproposedanonstandardGibbssamplerprocedureandprovedthatthisGibbsprocedureensuresthetargetstationarydistribution.Byemployingthisproposedprocedure,thecomputationspeedisdramaticallyincreasedbydecreasingthenumberofimputedcolumnsineachupdatingcycle.ThepowerofthisprocedurewillbemoreobviouswhentherearemoreobservedSNPsinthedatasetandsothisisadesirablepropertysincewewillhavemanymoreSNPstowardtheendofADEPT2project. 132 PAGE 133 Intermsofcomputation,wetookadvantageoftheBayesfactorapproximationformulaandthespecialGibbssamplerupdatingprocedurebyapplyingtheSherman-Morrison-Woodburyformulaforoursituation.Themethoddecreasesthematrixinversion20timesforamatrixwithdimension20002000.Intheendofthedissertation,wealsoproposedtouseZtoreplacetheimputedvaluesforthecalculationofBayesfactors.Weshowedthatitgivesaccurateestimatesandmeanwhilefundamentallyscalesuptheprocedure. Inchapter3,weusedastochasticsearchalgorithmtosearchthegoodgroupofsubsetsandwespendmuchtimeinspeedingupthecomputation.ThebottleneckofthecomputationisduetotheBayesfactorestimates,especiallythemanyinversions 133 PAGE 134 AnotherfutureworktopicistoinvestigateotherpriorspecicationsforourmodelinChapter2andChapter3.Weusedtheconjugatepriorsandthatensurestheconditionalstobestandarddistributions.Wewouldliketoseethedierenceofthetheoreticalresultsandthedierenceofrealdataanalysiswhendierentpriorsareapplied,suchasgpriors,orintrinsicpriors. 134 PAGE 135 SupposewearerunningaGibbssamplerinwhichonecycleconsistsofupdating(X;Y;Z),andtheZnpmatrixcanbedecomposedas(Z1;Z2;:::;Zp).StandardGibbssamplingupdatesZnpalltogetherinonecycle,incontrast,wecanjustsystematicallyupdateonecolumnofZateachcycle.ByupdatingonecolumnofZineachiteration,thecomputationtimecandecreasedramatically,especiallywhenthenumberofcolumnsoftheZmatrixisintheorderofhundredsorthousands. Firstletuswritedownourupdatingscheme.Supposewehavestartingvalue(X(0);Y(0);Z(0)1;:::;Z(0)p): PAGE 136 ToprovethattheGibbssamplerhasthestationarydistribution,weneedtoshowthatthefollowingequationholds: =Z:::ZfX(1)jY(0);Z(0)1;:::;Z(0)pfY(1)jX(1);Z(0)1;:::;Z(0)pfZ(1)1jX(1);Y(1);Z(0)2Z(0)pfX(2)jY(1);Z(1)1;:::;Z(0)pfY(p)jX(p);Z(1)1;:::;Z(1)p1;Z(0)pfZ(1)pjX(p);Y(p);:::;Z(1)(p1)fZ(0)1;:::;Z(0)p;X(0);Y(0)dX(0)dY(0)dZ(0)1;:::;dZ(0)pdX(1)dY(1);:::;dX(p1)dY(p1): A{1 )becomes: =Z:::ZfX(1);Z(0)1;:::;Z0pfY(1)jX(1);Z(0)1;:::;Z(0)pfZ(1)1jX(1);Y(1);Z(0)2:::;Z(0)pfX(2)jY(1);Z(1)1;:::;Z(0)pfY(p)jX(p);Z(1)1;:::;Z(1)p1;Z(0)pfZ(1)pjX(p);Y(p);:::;Z(1)(p1)dZ(0)1;dZ(0)2;:::;dZ(0)pdX(1)dY(1);:::;dX(p1)dY(p1): PAGE 137 =Z:::ZfX(1);Y(1);Z(0)2:::;Z(0)pfZ(1)1jX(1);Y(1);Z(0)2:::;Z(0)pfX(2)jY(1);Z(1)1;:::;Z(0)p:::fY(p)jX(p);Z(1)1;:::;Z(1)p1;Z(0)pfZ(1)pjX(p);Y(p);:::;Z(1)(p1)dZ(0)2;:::;dZ(0)pdX(1)dY(1);:::;dX(p1)dY(p1): Doingthesimilarintegration,furtherweget: =ZZZfX(p1);Y(p1);Z(1)1;:::;Z(1)(p1);Z(0)pfX(p)jY(p1);Z(1)1;:::;Z(1)(p1);Z(0)pfY(p)jX(p);Z(1)1;:::;Z(1)(p1);Z(0)pfZ(1)pjX(p);Y(p);Z(1)1;:::;Z(1)(p1)dX(p1)dY(p1)dZ(0)p: PAGE 138 Inabove,weshowedthattheGibbssamplerchainupdatingonecolumnpercyclestillpreservesthetargetstationarycondition.Toprovethechainhasergodicityproperty,wecanciteTheorem10:10in RobertandCasella(2004 ).Thetheoremstates: Nowconsiderthemarginaldistributionofthesubvector(X(t);Y(t)).Wewillshowthatthismarginaldistributionsatisesthestationarycondition.Thatis,wewanttoshowthefollowingequationholds. A{6 )anditequalstotheleftsideof( A{6 ).Thefollowingarethecalculations: 138 PAGE 139 139 PAGE 140 Asweknowthesub-numeratorrelationshipmatrixfortherstunrelatedasubjectsistheidentity,nextwewillgivethedetailshowtocalculatetheremainingcellsofthenumeratorrelationshipmatrixfortherelatedsubjects.Considerthejthandtheithsubjectfromtheaboveorderedsubjects. 1. Ifbothparentsofthejthindividualareknown,saygandh,thenRji=Rij=0:5(Rig+Rih);i=1;:::;j1;Rjj=1+0:5Rgh 2. Ifonlyoneparentisknownforthejthsubject,sayitisg,thenRji=Rij=0:5Rig;i=1;:::;j1;Rjj=1: Ifneitherparentisknownforthejthsubject,Rji=Rij=0;i=1;:::j1;Rjj=1: 140 PAGE 141 141 PAGE 142 Akaike,H.(1973).Informationtheoryandanextensionofthemaximumlikelihoodprinciple.InB.N.PertrovandF.Csaki(Eds.),SecondInternationalSymposiumonInformationTheory,BudapestAkademiaiKiado,pp.267C281.Springer-Verlag. Allison,P.D.(2002).MissingData.Sage,CA:ThousandOaks. Balding,D.(2006).Atutorialonstatisticalmethodsforpopulationassociationstudies.NatureReviewsGenetics7,781. Barnard,J.andD.B.Rubin(1999).Smallsampledegreesoffreedomwithmultipleimputation.Biometrica,949{955. Bartlett,M.S.(1951).Aninversematrixadjustmentarisingindiscriminantanalysis.AnnualofMathematicalStatistics22,107{111. Berger,J.andL.Pericchi(2001).Objectivebayesianmethodsformodelselection:Introductionandcomparison(withdiscussion).ModelSelection,135{207. Boyles,A.L.,W.Scott,E.Martin,S.Schmidt,Y.J.Li,A.Ashley-Koch,M.P.Bass,M.Schmidt,M.A.Pericak-Vance,M.C.Speer,andE.R.Hauser(2005).Linkagedisequilibriuminatestypeerrorratesinmultipointlinkageanalysiswhenparentalgenotypesaremissing.HumanHeridity59,220{227. Brown,P.J.,M.Vannucci,andT.Fern(1998).Multivariatebayesianvariableselectionandprediction.JournalofRoyalStatistalSocietyB,627{641. Carlin,B.P.andS.Chib(1995).Bayesianmodelchoiceviamarkovchainmontecarlomethods.JournalofRoyalStatisticalSocietyB,473{484. Casella,G.,F.J.Giron,M.L.Martinez,andE.Moreno.Consistencyofbayesianproceduresforvariableselection.TheAnnalsofStatistics.(toappear). Casella,G.andE.Moreno(2006).Objectivebayesianvariableselection.JournaloftheAmericanStatisticalAssociation,157{167. Chen,W.M.andG.R.Abecasis(2007).family-basedassociationtestsforgenomewideassociationscan.TheAmericanJournalofHumanGenetics81,913{926. Chib,S.(1995).Marginallikelihoodfromthegibbsoutput.JournaloftheAmericanStatisticalAssociation,1313{1321. Cui,W.andE.I.George(2008).Empiricalbayesversusfullybayesvariableselection.JournalofStatisticalPlanningandInference,888{900. Dai,J.Y.,I.Ruczinski,M.LeBlanc,andC.Kooperberg(2006).Imputationmethodstoimproveinferenceinsnpassociationstudies.GeneticEpidemiology30,690{702. 142 PAGE 143 Dempster,A.P.,N.Laird,andD.B.Rubin(1977).Maximumlikelihoodestimationfromincompletedataviatheemalgorithm(withdiscussion).JournalRoyalStatisticalSocietySeriesB39,1{38. Efron,B.(1994).Missingdataimputationandthebootstrap.JournaloftheAmericanStatisticalAssociation,463{474. George,E.I.andD.Foster(1994).Theriskinationcriterionformultipleregression.AnnalsofStatistics,1947{1975. George,E.I.andR.E.McCulloch(1993).Variableselectionviagibbssampling.JournaloftheAmericanStatisticalSociety,881{889. George,E.I.andR.E.McCulloch(1995).Stochasticsearchvariableselection.MarkovChainMonteCarloinPractice,203{214. George,E.I.andR.E.McCulloch(1997).Approachestobayesianvariableselection.StatisticaSinica,339{379. Green,P.J.(1995).Reversiblejumpmarkovchainmontecarlocomputationandbayesianmodeldetermination.Biometrika,711{32. Hager,W.W.(1989).Updatingtheinverseofamatrix.SIAMReview31,221{239. Henderson,C.R.(1976).Asimplemethodforcomputingtheinverseofanumeratorrelationshipmatrixusedinpredictionofbreedingvalues.Biometrics39,69{83. Hobert,J.andG.Casella(1996).Theeectofimproperpriorsongibbssamplinginhierarchicallinearmixedmodels.JournaloftheAmericanStatisticalAssociation91,1461{1473. Kayihan,G.C.,D.A.Huber,A.M.Morse,T.T.White,andJ.M.Davis(2005).Geneticdissectionoffusiformrustandpitchcankerdiseasetraitsinloblollypine.TheoryofAppliedGenetics110,948{958. Li,K.H.,T.E.Raghunathan,andD.B.Rubin(1991).Large-samplesignicancelevelsfrommultiplyimputeddatausingmoment-basedstatisticsandanfreferencedistribution.J.AmericanStatisticalAssociation,1065{1073. Little,R.J.A.andD.B.Rubin(1987).StatisticalAnalysiswithMissingData.NewYork:Wiley&Sons. Mallows,C.(1973).Somecommentsoncp.Technometrics15,661{675. Marchini,J.,B.Howie,S.Myers,G.McVean,andP.Donnelly(2007).Anewmultipointmethodforgenome-wideassociationstudiesbyimputationofgenotypes.NatureGenetics39,doi:10.1038/ng2088. 143 PAGE 144 McKeever,D.B.andJ.L.Howard(1996).Valueoftimberandagriculturalproductsintheunitedstates,1991.ForestProductsJournal46,45{50. Meng,X.L.andD.B.Rubin(1992).Performinglikelihoodratiotestswithmultiply-imputeddatasets.Biometrica,103{111. Meng,X.L.andW.H.Wong(1996).Simulatingratiosofmormalizingconstantsviaasimpleidentity:Atheoreticalexploration.StatisticaSinica6,831{860. Meyn,S.P.andR.Tweedie(2008).MarkovChainsandStochasticStability.NewYork:Springer-Verlag. Miller,K.S.(1981).Ontheinverseofthesumofmatrices.MathematicsMagazine54,67{72. Mitchell,T.J.andJ.J.Beauchamp(1988).Bayesianvariableselectioninlinearregres-sion.JournaloftheAmericanStatisticalAssociation,1023{1032. Quaas,R.L.(1976).Computingthediagonalelementsandinverseofalargenumeratorrelationshipmatrix.Biometrics46,949{953. Reilley,M.(1993).Dataanalysisusinghotdeckmmultipleimputation.TheStatistician,307{313. Robert,C.P.andG.Casella(2004).MonteCarloStatisticalMethods.UnitedStatesofAmerica:Springer. Roberts,A.,L.McMillan,W.Wang,J.Parker,I.Rusyn,andD.Threadgill(2007).Inferringmissinggenotypesinlargesnppanelsusingfastnearest-neighborsearchesoverllidingwindows.Bioinformatics23,i401{i407. Rubin,D.B.(1978).Multipleimputationsinsamplesurveys-aphenomenologicalbayesianapproachtononresponse.JournaloftheAmericanStatisticalAssociation,20{34. Rubin,D.B.(1987).MultipleImputationforNonresponseinSurveys.NewYork:Wiley&Sons. Sadighi,I.andP.K.Kalra(1988).Approachesforupdatingthematrixinverseforcontrolsystemproblemswithspecialreferencetoroworcolumnperturbation.ElectricPowerSystemsResearch14,137{147. Schafer,J.L.(1997).AnalysisofIncompleteMultivariateData.London:Chapman&Hall. 144 PAGE 145 Schwarz,G.(1973).Estimatingthedimensionofamodel.AnnalsofStatistics6,461{464. Servin,B.andM.Stephens(2007).Imputation-basedanalysisofassociationatudies:candidateregionsandquantitativetraits.PLoSGenetics7,e114. Smith,A.M.F.andD.J.Spiegelhalter(1980).Bayesfactorsandchoicecriteriaforlinearmodels.JournaloftheRoyalStatisticalSociety,SerialB,213{220. Stephens,M.,S.N.J.,andP.Donnelly(2001).Anewstatisticalmethodforhaplotypereconstructionfrompopulationdata.TheAmericanJournalofHumanGenetics68,978{989. Sun,Y.V.andS.L.Kardia(2008).Imputingmissinggenotypicdataofsingle-nucleotidepolymorphismsusingneuralnetworks.EuropeanJournalofHumanGenetics16,487{495. Tanner,M.A.andW.H.Wong(1987).Thecalculationofposteriordistributionsbydataaugmentation.JournaloftheAmericanStatisticalAssociation,528{540. Wear,D.N.andJ.G.Greis(2002).Southernforestresourceassessment:Summaryofndings.JournalofForestry100,6{14. Weinberg,C.R.(1999).Allowingformissingparentsingeneticstudiesofcase-parenttriads.AmericanJournalofHumanGenetics64,1186{1193. Woodbury,M.(1950).Invertingmodiedmatrices.MemorandomRept.42,StatisticalResearchGroup,PrincetonUniversity. Wu,C.F.J.(1983).Ontheconvergencepropertiesoftheemalgorithm.TheAnnalsofStatistics,95{103. Xie,F.andM.Paik(1997).Multipleimputationmethodsforthemissingcovari-atesingeneralizedestimatingequation.Biometrics,1538{1546. Yu,J.M.,G.Pressoir,W.H.Briggs,I.V.Bi,M.Yamasaki,J.Doebley,M.D.McMullen,B.S.Gaut,D.M.Nielsen,J.B.Holland,S.Kresovich,andE.S.Buckler(2005).Auniedmixedmodelmethodforassociationmappingthataccountsformultiplelevelsofrelatedness.NatureGenetics38,e114. 145 PAGE 146 ZhenLiwasborninJiangsu,China.ShewasthesecondoftwodaughtersofPingLiandXuemeiTang.Shereceivedabachelor'sdegreeinmechanicalengineeringinNanjingUniversityofAeronauticsandAstronauticsin2001.Afterthat,shewasenrolledinamaster'sprogramofShanghaiJiaotongUniversity.In2004,shewasadmittedtotheStatisticsDepartmentofUniversityofFlorida.Shereceivedamaster'sdegreeinstatisticsin2006andisexpectingtoreceiveaPh.D.degreeinstatisticsin2008. 146 |