<%BANNER%>

Bayesian Methodologies for Genomic Data with Missing Covariates

Permanent Link: http://ufdc.ufl.edu/UFE0022460/00001

Material Information

Title: Bayesian Methodologies for Genomic Data with Missing Covariates
Physical Description: 1 online resource (146 p.)
Language: english
Creator: Li, Zhen
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2008

Subjects

Subjects / Keywords: bayes, genomic, gibbs, metropolis, missing, variable
Statistics -- Dissertations, Academic -- UF
Genre: Statistics thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: With advancing technology, large single nucleotide polymorphism (SNP) datasets are easily available. For the ADEPT 2 project, we have candidate SNPs and interesting phenotypic trait values available, while about 10% of the SNPs are missing. Standard software packages cannot deal adequately with missing SNP data. For example, SAS either uses an available case analysis (which employs all the complete cases for the inference of target parameters) or the procedure MI (or MIANALYZE) where SAS assumes multivariate normal distributions for all the variables. Some software deletes the incomplete observations, which is generally unacceptable for datasets with many SNPs, because it can give biased estimates, or possibly delete all the data. More recently, single SNP association, linkage disequilibrium based imputation, and haplotype based imputation have been proposed. I describe a Bayesian hierarchical model to explain the SNP e?ects for the phenotypic traits, and incorporated family structure information for the observations. For this association test, the information of the degree of linkage disequilibrium is not required and missing SNPs are imputed based on all the available information. We used a Gibbs sampler to estimate the parameters and prove that updating one SNP at each iteration still preserves the ergodic property of the Markov chain, and at the same time it improves computation speed. We also ran a stochastic search algorithm to search the good subsets of variables or SNPs. Bayes factor is used as a model comparison criterion and a new Bayes factor approximation formula is proposed. The hybrid Metropolis-Hastings algorithm was used to search the good models in the model sample space and proven to have the ergodic convergence property. To improve the computation speed, ?rst a matrix identity was applied to avoid direct calculation of matrix inversions and determinants, then we replaced the imputed missing SNPs with the average of imputed SNPs, which substantially increased the computation speed.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Zhen Li.
Thesis: Thesis (Ph.D.)--University of Florida, 2008.
Local: Adviser: Casella, George.
Electronic Access: RESTRICTED TO UF STUDENTS, STAFF, FACULTY, AND ON-CAMPUS USE UNTIL 2010-12-31

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2008
System ID: UFE0022460:00001

Permanent Link: http://ufdc.ufl.edu/UFE0022460/00001

Material Information

Title: Bayesian Methodologies for Genomic Data with Missing Covariates
Physical Description: 1 online resource (146 p.)
Language: english
Creator: Li, Zhen
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2008

Subjects

Subjects / Keywords: bayes, genomic, gibbs, metropolis, missing, variable
Statistics -- Dissertations, Academic -- UF
Genre: Statistics thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: With advancing technology, large single nucleotide polymorphism (SNP) datasets are easily available. For the ADEPT 2 project, we have candidate SNPs and interesting phenotypic trait values available, while about 10% of the SNPs are missing. Standard software packages cannot deal adequately with missing SNP data. For example, SAS either uses an available case analysis (which employs all the complete cases for the inference of target parameters) or the procedure MI (or MIANALYZE) where SAS assumes multivariate normal distributions for all the variables. Some software deletes the incomplete observations, which is generally unacceptable for datasets with many SNPs, because it can give biased estimates, or possibly delete all the data. More recently, single SNP association, linkage disequilibrium based imputation, and haplotype based imputation have been proposed. I describe a Bayesian hierarchical model to explain the SNP e?ects for the phenotypic traits, and incorporated family structure information for the observations. For this association test, the information of the degree of linkage disequilibrium is not required and missing SNPs are imputed based on all the available information. We used a Gibbs sampler to estimate the parameters and prove that updating one SNP at each iteration still preserves the ergodic property of the Markov chain, and at the same time it improves computation speed. We also ran a stochastic search algorithm to search the good subsets of variables or SNPs. Bayes factor is used as a model comparison criterion and a new Bayes factor approximation formula is proposed. The hybrid Metropolis-Hastings algorithm was used to search the good models in the model sample space and proven to have the ergodic convergence property. To improve the computation speed, ?rst a matrix identity was applied to avoid direct calculation of matrix inversions and determinants, then we replaced the imputed missing SNPs with the average of imputed SNPs, which substantially increased the computation speed.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Zhen Li.
Thesis: Thesis (Ph.D.)--University of Florida, 2008.
Local: Adviser: Casella, George.
Electronic Access: RESTRICTED TO UF STUDENTS, STAFF, FACULTY, AND ON-CAMPUS USE UNTIL 2010-12-31

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2008
System ID: UFE0022460:00001


This item has the following downloads:


Full Text

PAGE 1

1

PAGE 2

2

PAGE 3

3

PAGE 4

Firstofall,IwanttoexpressmydeepestgratitudetoDr.GeorgeCasella.Notonlyforhispatientadvisementfornumerousacademicproblems,histime,hispealsofwisdomsharedwithme,butalsoforforsupportingme,encouragingmeandinspiringmetoalwaystobebetter.IalsowanttothankDr.HaniDoss,Dr.JohnDavis,Dr.GaryPeterandDr.RonglingWuforservingasmycommittee.Iwouldliketothankmyparents,PingLiandXuemeiTang,fortheirendlesslove,constantemotionalsupportandfortheirbeliefinme.Ithankmysisterforalwaysbeingthereformeandlovingme.IcouldneverthankmyhusbandJianenoughforhisloveandhisemotionalsupport.Hehasalwaysbeenkeepingmeinpeaceandcalminthosedays.Withoutthat,Icouldnotnishthisjourney. 4

PAGE 5

page ACKNOWLEDGMENTS ................................. 4 LISTOFTABLES ..................................... 7 LISTOFFIGURES .................................... 8 ABSTRACT ........................................ 9 CHAPTER 1LITERATUREREVIEWANDPROJECTINTRODUCTION .......... 11 1.1IntroductiontotheProject .......................... 11 1.2IntroductiontoMissingData ......................... 14 1.2.1PatternsofMissingData ........................ 14 1.2.2MechanismofMissingData ...................... 17 1.3ExistingMethodsforMissingData ...................... 19 1.3.1MaximumLikelihood .......................... 21 1.3.2EMAlgorithm .............................. 21 1.3.3MultipleImputation ........................... 23 1.4BayesianVariableSelection .......................... 25 1.4.1SemiautomaticBayesianVariableSelectionMethod ......... 27 1.4.2AutomaticBayesianVariableSelectionMethod ............ 28 1.4.3StochasticSearchAlgorithm ...................... 30 1.5OutlineoftheDissertation ........................... 31 2AHIERARCHICALBAYESIANMODELFORGENOME-WIDEASSOCIA-TIONSTUDIESOFSNPSWITHMISSINGVALUES .............. 33 2.1Introduction ................................... 33 2.2ProposedMethod ................................ 39 2.2.1TheModelwithoutRametRandomEect .............. 39 2.2.2TheModelwithRemeteRandomEect ................ 45 2.2.3IncreasingComputationSpeed ..................... 50 2.3ResultsforSimulatedData ........................... 51 2.4ResultsforLoblollyPine ............................ 53 2.5QuantifyingtheCovarianceandVariance ................... 57 2.6Discussion .................................... 73 3BAYESIANVARIABLESELECTIONFORGENOMICDATAWITHMISS-INGCOVARITES .................................. 77 3.1Introduction ................................... 77 3.2BridgeSamplingExtension ........................... 79 3.2.1GeneralFormula ............................. 82 5

PAGE 6

.................. 92 3.2.3ComparisonwiththeSimplestModel ................. 94 3.2.4MarginalLikelihoodform(Y) ..................... 98 3.3MarkovChainMonteCarloProperty ..................... 101 3.3.1CandidateDistribution ......................... 103 3.3.2ConvergenceofBayesFactors ..................... 104 3.3.3ErgodicityPropertyofThisM-HChain ............... 105 3.3.3.1Fixedn,uniformlyergodicconvergestothedistribution^B(n) 106 3.3.3.2ErgodicconvergencetoB 108 3.4ComputationSpeed ............................... 110 3.4.1MatrixInversion ............................. 110 3.4.1.1TwocolumnsofparametersforonecolumnofSNPs .... 110 3.4.1.2Determinantcalculation ................... 118 3.4.2ReplaceZwithanAverage ....................... 122 3.5SimulationandRealDataAnalysis ...................... 128 3.5.1Simulation ................................ 128 3.5.2RealDataAnalysis ........................... 129 4SUMMARYANDFUTUREWORK ........................ 132 4.1Summary .................................... 132 4.2FutureWork ................................... 133 APPENDIX AERGODICITYOFGIBBSSAMPLINGWHENUPDATINGZMATRIXBYCOLUMNS ...................................... 135 BANALGORITHMFORCALCULATINGTHENUMERATORRELATION-SHIPMATRIXR 140 REFERENCES ....................................... 142 BIOGRAPHICALSKETCH ................................ 146 6

PAGE 7

Table page 1-1Illustrationofunivariatenon-response ....................... 14 1-2Illustrationofmultivariatemissing ........................ 15 1-3Illustrationofmonotonemissing .......................... 15 1-4Illustrationofgeneralmissingdatapattern .................... 16 1-5Filematchingmissingdatapattern ........................ 16 2-1Illustrationofcombinationsofmissingdataforoneobservation. ......... 37 2-2ThepercentagesofSNPcategoriesinthegenerateddatasets. .......... 50 2-3ThepercentagesofcorrectlyimputedSNPsfordierentprobabilitiesofSNPcategories.10%missingvaluesexist. ........................ 52 2-4Theestimatedmeansoffamilyeectsfordierentdatasetswithdierentper-centagesofmissingvalues.Thismethodologygiveaccurateestimatesasthepercentageofmissingvaluesgoesupto20%. .................... 52 2-5TheestimatedmeansofSNPeectsforthedatasetwithoutmissingvaluesandfordatasetswithdierentpercentagesofmissingvalues. ............ 53 3-1ComparisonoftimespentoninversecalculationusingstandardsoftwareandMiller'smethodwith2columnsand2rowsupdated. ............... 115 3-2Comparisonoftimespentincalculationofmatrixinversionfordierentmethods. 118 3-3RecordsofsubsetsindicatorswithactualvaluesofforTable 3-4 andTable 3-5 ........................................... 126 3-4Bayesfactorcalculationapproximation.Userst20000iterationasburninandtakethefollowing400samplesforthecalculation. ................ 127 3-5Bayesfactorcalculationcomparisons.Userst40000iterationasburninandtakethefollowing400samplesforthecalculation. ................ 130 3-6SimulationresultsofBayesianvariableselectionfor15SNPsand450observa-tions,10%randommissingvalues.UsingtheBayesfactorestimationformula ............................................. 131 3-7Bayesvariableselectionforlesionlengthdataset.UsingtheaverageofimputedmissingSNPsasifobserved.20000stepsofburninandanother20000stepsassamples. ....................................... 131 7

PAGE 8

Figure page 2-1Thersttraceplotisfortherst2familyeectparametersforthelesionlengthdata.ThesecondplotisforoneofSNPparameterforthecarbonisotopedatafromPaltaka,Florida.Thesamplesaretakenafterinitial40000stepsofburnin. ............................................. 56 2-295%CondenceintervalsfortheadditiveanddominantSNPeectofthe44SNPsforlesionlength.The22thSNPhassignicantdominanteectswith95%condence,whiletheotherSNPsarenotsignicant.OtherSNPssuchasthe2ndandthe23thSNPsareapproximatelysignicantwith95%condence.Thesearegoodcandidatesforfurtherbiologicalexploration. .............. 58 2-395%CondenceintervalsfortheadditiveanddominantSNPeectofthe44SNPsforthecarbonisotopedatafromPaltaka,Florida.The16thSNPhassig-nicantdominanteectswith95%condence,whiletheotherSNPsarenotsignicant.OtherSNPssuchasthe6ndandthe40thSNPsareapproximatelysignicantwith95%condence.Thesearegoodcandidatesforfurtherbiologi-calexploration. ................................... 59 2-495%CondenceintervalsfortheadditiveanddominantSNPeectofthe44SNPsforcarbonisotopedatafromCuthbert,Georgia.The8th,35thandthe36thSNPhavesignicantdominanteectswith95%condence,whiletheotherSNPsarenotsignicant.OtherSNPssuchasthe13ndandthe28thSNPsareapproximatelysignicantwith95%condence.Thesearegoodcandidatesforfurtherbiologicalexploration. ............................ 60 8

PAGE 9

Withadvancingtechnology,largesinglenucleotidepolymorphism(SNP)datasetsareeasilyavailable.FortheADEPT2project,wehavecandidateSNPsandinterestingphenotypictraitvaluesavailable,whileabout10%oftheSNPsaremissing. StandardsoftwarepackagescannotdealadequatelywithmissingSNPdata.Forexample,SASeitherusesanavailablecaseanalysis(whichemploysallthecompletecasesfortheinferenceoftargetparameters)ortheprocedureMI(orMIANALYZE)whereSASassumesmultivariatenormaldistributionsforallthevariables.Somesoftwaredeletestheincompleteobservations,whichisgenerallyunacceptablefordatasetswithmanySNPs,becauseitcangivebiasedestimates,orpossiblydeleteallthedata.Morerecently,singleSNPassociation,linkagedisequilibriumbasedimputation,andhaplotypebasedimputationhavebeenproposed. IdescribeaBayesianhierarchicalmodeltoexplaintheSNPeectsforthephenotypictraits,andincorporatedfamilystructureinformationfortheobservations.Forthisassociationtest,theinformationofthedegreeoflinkagedisequilibriumisnotrequiredandmissingSNPsareimputedbasedonalltheavailableinformation.WeusedaGibbssamplertoestimatetheparametersandprovethatupdatingoneSNPateachiterationstillpreservestheergodicpropertyoftheMarkovchain,andatthesametimeitimprovescomputationspeed. 9

PAGE 10

10

PAGE 11

McKeeverandHoward(1996 ).Thepinespeciesinthesouthernstatesproduces58%ofthetimberintheUSand15:8%oftheworld'stimber,see WearandGreis(2002 ). Scientistsareinterestedindiscoveringtherelationshipbetweenthephenotypictraitsandthecomplexbiologicalrolesandfunctionsofgenesinloblollypine,whichcouldhelptoexplaintheevolutionofadaptivetraitsforotherlandplants.Rightnow,researchersfromseveraluniversitiesintheADEPT2projectarecollaboratingtoidentifyalleleswhichcontrolwoodpropertiesanddiseaseresistance.AnothergoaloftheADEPT2projectistoassociateallelicvariationwithphenotypicvariationforsomesubsetofgenes.Morespecically,thereare4objectives:1)toidentify5000targetcandidategenesforwoodpropertyanddiseaseresistancetraits;2)todiscoverallelesfor5000candidategenesusingahighthroughputresequencingpipeline;3)toestimatetheextentoflinkagedisequilibriuminregionsofasmallnumberofcandidategenes;4)todetectandverifyassociationsbetweenSNPsin1000candidategenesandasuiteofwoodpropertyphenotypes,diseaseresistancephenotypes. InthepreviousADEPTproject,SNP,orsinglenucleotidepolymorphismdiscoverywasdonefor50genesinvolvedinresistancetodiseaseandresponsetowater-decit.AsinglenucleotidepolymorphismisaDNAsequencevariationoccurringwhenanucleotide-A,T,C,orG-diersfromothernucleotideintheindividualsfromthesamespeciesatthesamelocus.TheSNPsmayoccurinthecodinggenesornon-codinggenes,soresearchersaretryingtodetecttheSNPswhichhavesignicantinuenceforthequantitativetraitsfromlargeamountofpotentialSNPs. 11

PAGE 12

Kayihanetal.(2005 ).Followingtheterminologyusedinthisproject,tworametswithinanycloneshareexactlythesamegeneticinformationwhiletwocloneswithinanyfamilyonlysharethesameparents,andtheycanbeconsideredtobesiblings. Loblollypineisthehostoftwofungalpathogens,whichcausesthefusiformrustdiseaseandpitchcankerdisease.Fusiformrusthasaspindle-shapedswellingonbranchesandstemsofloblollypinetreesandthisdiseaseiscausedbyafungus.Fusiformrustwillformstemgalls,whichleadtoshortsurvivaltimes,badwoodpropertiesandslowgrowth.ThisdiseasecauseslossesofhundredsofmillionsdollarsinsouthernUnitedStates.Pitchcankerisalsoaveryimportantdisease,whichiscausedbyfungusandproducesresinouslesionsandleadstoseedlingmortality,decreasedgrowthratesandcrowndieback. ThelargepitchcankerdatawererecordedintheUSDAForestServiceResistanceScreeningCenterinBentCreek,NorthCarolinaandthesmallerpitchcankerscreenwasconductedattheUniversityofFlorida.BothtengallrustscreensandonegallrustscreenswereinoculatedinNorthCarolinawithdierentdensityofspores=ml.ThisdatasetinformationiscalledCCLONESandwillbereferredbythenameCCLONESlater.WewillhaveanotherdatasetlaterfromNCSUwhichconsistsofindividualtreesnotrecentlymatedtoeachotherandcanserveasanaturaloccurringpinepopulationandwillbeusedforSNPvericationattheendoftheADEPTproject. Forthisdissertation,wehave46genotypedSNPsintheCCLONESpopulationforassociationdiscoveryandthemethodsdevelopedinthisstudywillbeappliedforthe7600SNPslater.Theresponsearemainlythemeasurementsoflesionlengthforfusiformrustandpitchcanker,wherelesionisaninfectedorwoundedpatchofskin. 12

PAGE 13

Thegoalsofthisdissertationare: 13

PAGE 14

1.2.1PatternsofMissingData UnivariateNon-response Table1-1. Illustrationofunivariatenon-response AgeWeight(lb)Height(feet)Race 281605.8white341156.1black552165.4white452306.3* Forthedatasetswithunivariatemissingpattern,onlyonecolumnhasmissingdataandalltheothercolumnshavecompleteinformation.Iuse\"todenotethemissingvalue.SupposewewanttoknowtherelationshipofHeightversusAge,Weight,andRacefromthedatainthistable.Ifthedatasetiscomplete,amostpossiblemethodtoemployislinearregression.Foroursituation,peoplemightsuggestdiscardingthelastobservationandcontinuetodolinearregression.Butisthedatamissingcompletelyatrandom?Willdeletingtheincompletecaseleadtoabiasedestimator? Multivariatemissing Insurveys,itisusualthatthedesignvariablesareknownbeforethestudy,butsomevariableswillnotalwaysbelledinbythepeoplebeingsurveyed.Thefollowingisanexampleofmultivariatemissingpattern. 14

PAGE 15

Illustrationofmultivariatemissing LocationImmigrationIncomeHouseholdnumberWorkingyears AlachuaYes53000510AlachuaYes6500054LeeNo7000062LeeNo***AlachuaNo*** IntheTable 1-2 thecountynameandimmigrationstatusaredesignvariablesandcanbelledinbeforethesurvey.Theleftcolumnsarequestionssupposedtobeansweredinaquestionnaireandsometimespeopledonotanswer.Again,beforedeletingtheincompletecases,weneedtoaskourselvesisthemissingdatamissingcompletelyatrandom?Arethereanybiasesofthemissingdata?Ifthemissingdataaremissingcompletelyatrandom,therewouldbenoharmtojustdeletingtheseincompletecases. Monotonemissing Thiskindofmissingpatternoftenhappensinlongitudinalstudieswheresubjectsdropoutovertimeandthecolumnrecordsinthelatertimetendtohavemoremissingdata.Forexample,Table 1-3 isamedicalexperimentandresearcherswanttotesttheeectofacertaindrugforbloodpressure: Table1-3. Illustrationofmonotonemissing Weight(lb)BPin1MBPin3MBPin6MBPin9MBPinyear1 150145150150156160167111132140145150200167150156160170230170****180210200*** Intheabovetable,\BPin1M"meanstherecordofbloodpressureinonemonth.Iftheobservationismissingat3months,thentherecordsforthelatermonthstendtobemissingtoo.Thereasonfortheoccurringofmissingvaluesmightbethatthesubjectmovedtoanewcity,orhewastoosicktocontinuetheexperimentorjustanyotherreasonthathecouldn'tcontinuetheexperiment.Itwilllikelytointroducebiasifwejust 15

PAGE 16

Generalmissingpattern Thegeneralmissingpatternisthepatternthatcannotbeputinanyspecialcat-egory.Forexampleingeneticdataanalysis,thefollowingdataaretypical:Inthis Table1-4. Illustrationofgeneralmissingdatapattern PhenotypicvaluefemaleparentmaleparentSNP1SNP2SNP3 15.331134513453ggtt*16.2351134513451*tcgc17.891135411542gc**19.311145311671**cc20.321134511651cccc* example,therstcolumnrecordsthephenotypicvalueofthetraitwhichtheresearcherisinterestedin.ThesecondandthethirdcolumnrecordstheparentsinformationandtheleftcolumnsaretheSNPinformation.Formicroarraydata,typicallywehavemorecolumnsofSNPinformationthanthatoftheabovetable.Solistwisedeletionistotallyinappropriatesinceittendstodeletethemajorityofthedataandwastelotsofeort.Moreappropriatemethodsshouldbeapplied. Filematching Inthiscategory,foreachobservation,onlyonevariableisobservedoutofthecovari-atestobelledin.Sointheworstcases,theremightbenocompletecases.Denitelybettermethodsthanlistwisedeletionneedtobeemployed.Thefollowingisanillustra-tion: Table1-5. Filematchingmissingdatapattern AgeJobtitleIncome(dollars) 55manager*40banker*31*7000023*3500049CEO* 16

PAGE 17

LittleandRubin(1987 ).Accordingtotheirdenition,themechanismofmissingdatacanbedividedinto3categories.SupposetherearemissingdataonvariableYandweusevariableMtodenotewhetherthevariableYisobservedornot.M=1meansthevariableismissingandM=0,otherwise. IfwedenotetheobservedpartofYbyYobs,andthemissingpartbyYmis,thenY=(Yobs;Ymis).Letbetheparameterofthemissingmechanism.ThesituationofmissingcompletelyatrandomcanbeexpressedasP(MjYobs;Ymis;)=P(Mj):

PAGE 18

Onethingtonoteisthat,strictlyspeaking,wecannotverifythatthemissingdataisMARsincethemissingdataarenotobservedandthusimpossibletoverifywhethertheprobabilityofthemissingdataaretotallyindependentoftheunobservedvalue.However,MARisareasonableassumptioninmanycasesanditisaneasierassumptionforstatisticalinferencethanNMAR,unlessthereisevidenceofnotmissingatrandom,generallywecouldassumeitisMAR. Whenthedatasetiscomplete,directprocedureslikemaximumlikelihoodcanbeemployedforthestatisticalinference.Wewouldhave^=argmaxP(Yj): LittleandRubin(1987 ).Whenthedistinctnessholds,wehave:P(M;Yobs;Ymisj;)=P(MjYobs;Ymis;)P(Yobs;Ymisj):

PAGE 19

=ZP(MjYobs;)P(Yobs;Ymis;j)dYmis=P(MjYobs;)ZP(Yobs;Ymisj)dYmis=P(MjYobs;)P(Yobsj): undertheassumptionofignorablenon-response,^=argmaxP(Yobsj); 19

PAGE 20

Thelistwisedeletionmethodbasicallydeletestheincompletecasesandprocessesthestatisticalanalysisasiftherewerenomissingdata.Thismethodisthemoststraightfor-wardmethodandisthedefaultoptioninalotofpopularsoftware.WhenthedataareMCAR,listwisedeletionisequivalenttoasub-samplingofthepopulationandtheresult-ingstatisticalinferenceislegitimateexceptthatwewillgetlargerstandarderrorbecauseoffewerobservations.WhentheassumptionofMCARisviolated,listwisedeletionwillgivebiasedestimatessincetheremainingdataareweightedmorethantheyshouldbe.Anotherdisadvantageofthismethodisthatittendstolosesubstantialinformationfromthedataifthemissingdataoccurinmultiplecovariates. Anothermethodispairwisedeletion,whichcanbeusedinlinearregressiontoestimatethemeansorthecovariancematrix.Theideaofpairwisedeletionistouseallthecasesthatareavailableforthesummarystatisticswhichwereintendedtobecomputed.Forexample,supposewehaveabivariateresponse(Y1;Y2)andeachvariablehasmissingdata.WhenwecalculatethemeanforY1weusealltheobserveddataforY1althoughthecorrespondingobservationforY2mightbemissing.WhenwecalculatethecovarianceforY1andY2,thecaseswithbothY1andY2areobservedwillbeused.ThebiggestproblemofpairwisedeletionisthattheestimatesandstandarderrorsinmostsoftwarearebiasedwhenthemissingvaluesarenotMCAR,sincetheprincipleofpairwisedeletionisambiguousinitsimplementationinsoftware.WhencomputingthecovarianceforY1andY2,thereisnocleardirectionaboutwhichobservedvaluesshouldbeusedtocalculatethemeanofY1,thecompletecasesforY1orthecompletecasesforY1andY2. 20

PAGE 21

Maximumlikelihoodisespeciallyeasywhenthemissingdatapatternismonotonic.Considerthetwovariablecasewhere(Y1;Y2)withY2havingmissingdata.Y1hasnobservationsandY2areonlyobservedformobservationsandtheremainingnmobservationsofY2aremissing.Obviouslythisisamonotonemissingpattern.Supposethatwewritetheobservedlikelihoodas:L(;jY)=mY1h(Y2jjY1j;)nY1g(Y1jj); Dempsteretal.(1977 ).ThebasicideaofEMistomaximizethelikelihood 21

PAGE 22

Inanyincomplete-dataproblem,thedensityofthecompletedatacanbefactoredas:p(Yj)=p(Yobsj)p(YmisjYobs;): wherel(jY)istheloglikelihoodofp(Yj)andl(jYobs)istheloglikelihoodofp(Yobsj)andcisaconstant.p(YmisjYobs;)iscalledthepredictivedistributionofthemissingdatagiven.BecausethesecondtermintheaboveformulahasYmisinit,wecannotmaximizeEquation 1{2 directly,insteadwewillintegrateouttheYmisoverthepredictivedistribution.Supposethecurrentrealizationofis(t)andthepredictivedistributionisP(YmisjYobs;(t)),theintegrationyields:l(jYobs)=Q(j(t))+H(j(t))+c; SotheEMalgorithmconsistsof2stepsofiterations: RepeatthetwostepsuntiltheMLestimatesconverge. Wu(1983 )wasthersttoprovethat(t)!^. Dempsteretal.(1977 )provedthatifwelet(t+1)bethevaluethat 22

PAGE 23

Dataaugmentationby TannerandWong(1987 )isanotherwidelyusedmethodwhichoftenisusedtoexploretheposteriordistributionwithmissingdata.IthasastrongresemblancetotheEMalgorithmanditisalsocomposedoftwosteps.Intherststepofdataaugmentation,itsimulatesacompletedatasucientstatisticwhileinEMalgorithmtherststepisusedtocalculatetheexpectation.Theotherstepindataaugmentationdrawsrandomsimulationofparametersfromtheposteriordistributionofcompletesucientdatawhichcomesfromtherststep,whiletheotherstepoftheEMalgorithmmaximizestheparameter.DataaugmentationassumesMARalso. Rubin(1978 ).Aconsiderablenumberofbookshavebeendevotedtoimplementingtheframeworkofmultipleimputation,suchasAnalysisofIncompleteMultivariateDataby Schafer(1997 )andMissingDataby Allison(2002 ).GenerallytheyassumeMARifthemissingdatamechanismisnotmodeled. Theideaofmultipleimputationistocompletethedatatogetausablelikelihood,basedonwhichstatisticalinferenceismucheasiertobeconducted.Togetcompletedata,randomimputationtakestheplaceofmissingvalues.Sinceweknowtheimputedvaluesarenotthetruevaluesofmissingdata,weneedtoimputethedatamultipletimestolettherandomeectofimputationcenteraroundtheunobservedtruevalue.Themultiplecompletedatasetsthenarecombinedtogiveestimates. SupposethereisdataY(mis)whichhasmissingvaluesinitandwecreateKcompletedatasetsY(C1);Y(C2);:::;Y(CK).SupposeweperformregressionanalysisforeachcompletedatasetY(Cj)andhaveestimate^(Cj),wethenaveragetheestimates^(Cj)togetabetter 23

PAGE 24

ThewithinvarianceiscalculatedasU=1 andthecorrespondingdegreesoffreedomVKisVK=(K1)[1+U B]2:ThiscombinedvariancewillapproximatelyhaveatdistributionwithdegreesoffreedomVK. BarnardandRubin(1999 )suggestedusingVK=[1 Lietal.(1991 )and MengandRubin(1992 )suggestthefollowingasanapproximatelikelihoodratiotest.DK=(

PAGE 25

Lietal.(1991 )showthattestisquiterobusttotheviolationofthisassumptionwhenK3. OnequestionthatneedstobeansweredishowbigshouldKbe? Rubin(1987 )givesthefollowingformulatocalculatetheeciencyunderproportionofmissingdataandKcreatedcompletedatasets:Ef=(1+ K)1: Inpractice,multipleimputationrequiresamechanismtoimputethemissingdataandthereareanumberofwaystodothat. Reilley(1993 )considerstheHotDeckimputationwhichdrawsrandomsamplesfromtheactualobservedvalueswithequalweightsasanimputation.Thisisaneasilyemployedmethodand Efron(1994 )callstheHotDeckmethodanon-parametricbootstrapmethod. XieandPaik(1997 )proposedusingtheBayesianversionofHotDeckmethod;insteadofdrawingtherandomsamplewithequalweightsfromtheobservedvalues,theyputaDirichletpriorontheweights.Anotherimputationmethodassumesallofthecovariatesarenormallydistributed,evenforcategoricalvariables.Thecategoryisroundedtothenearestcumulativefunctioncategory. SmithandSpiegelhalter(1980 )proposedauniedapproachofpriorspecicationformodelchoiceandtheyshowedthatdierentpriorapproachescouldleadtoaSchwarz-typecriterionorAkaikeInformationCriterion. MitchellandBeauchamp(1988 )useda\spike 25

PAGE 26

GeorgeandMcCulloch(1993 ), GeorgeandMcCulloch(1995 ),and GeorgeandMcCulloch(1997 )advocatedtheMarkovchainMonteCarloapproachfortheposteriorcalculationandthismadethevariableselectionforlargecandidatespossible. Brownetal.(1998 )extendedtheapplicationofBayesianvariableselectiontomultivariateresponsesandaMarkovchainMonteCarloalgorithmwasemployedtospeedupthecomputation. BergerandPericchi(2001 )proposedtheobjectiveintrinsicBayesFactorsformodelselection.AnobjectivefullyautomaticBayesianprocedurewasproposedby CasellaandMoreno(2006 ).Theyusedintrinsicpriorstocalculatetheposteriorprobabilitiesandastochasticsearchalgorithmwasemployed.Justafewrelatedpapersarelistedaboveandmanyothersexistintheliterature. Innormallinearregression,wehaveadependentresponseYn1andaset(X1;:::;Xs)ofspotentialexplanatorypredictorsandweassumeY=X+,withYjXN(0;2I).Theactualvalueofsomeofthei;i=1;:::;s;maynotbesignicantlydierentfrom0ormaybecorrelatedwithanother.ThegoalofvariableselectionistondasetofpredictorsX1;:::;Xr,whichisasubsetofX1;:::;Xs,suchthateachcorrespondingihasapracticalsignicanteectonYandthefullmodelissimpliedtoacertaindegree.IwillemployanindexvectorwithlengthsandleteachelementofindexwhetherthecorrespondingpredictorXiisincludedintheselectedmodelornot.Thevaluei=1meansXiisintheselectedmodelandi=0otherwise.Sothevariableselectionproblemistondavectoraccordingtosomecriterion. ManyvariableselectionmethodshavebeenbasedontheAkaikeInformationCrite-rion,AIC Akaike(1973 ),Mallows'Cp )andBayesianInformationCriterion,BIC Schwarz(1973 )andhavebeenappliedtomanyproblemswhensisreasonablysmall. 26

PAGE 27

SupposeXrepresentstheselectedpredictormatrixindexedbyandlet^=(X0X)1XYdenotetheleastsquaresestimateof.Iusertorepresentthenumberofvariablesintheselectedsubsets.ThesumofsquaresforregressionisSS=^0X0X^: where^2isanestimateof2andCisapenalizedtermchosenbydierentcriterion.IfCtakesthevalueof2,CpisresultedandgivesexactBICtoo.ForAIC,takeC=logn. GeorgeandFoster(1994 )proposedtotakeC=2logpastheriskinationcriterion,RIC.Theabovementionedmethodshavedierentmotivations,unbiasedpredictiveriskestimateforCp,expectedinformationdistanceforAICandasymptoticBayesfactorforBIC.Whenthenumberofavailablepredictorsaresmall,theabovemethodshavebeenwidelyapplied.Whenconsideringproblemswithhundredsofpredictors,eachpotentialpredictorcaneitherbeincludedintheselectedmodelornotandthereforethereare2smodels,anumberthatcaneasilygobeyondthecomputationabilityandtheabovementionedcriteriawillbeimpossibletobeapplied.Newmethodologiestohandlehundredsofpredictorsareneeded.

PAGE 28

CuiandGeorge(2008 )alsoproposedputtingpriorsforcandpiandintegratingthemout. Inthehierarchicalmodel,thepriorfor2needstobespeciedtooandnormallyitistakenastheinverseGammaGa(=2;),whereandarehyper-parameters.Moredetaileddiscussionabouthowtodecidethehyperparameterc;p;andhasbeengivenin GeorgeandMcCulloch(1993 ), GeorgeandMcCulloch(1997 ). Withtheabovespecication,theposteriordistributionf(jY)couldbecalculatedandthentheideaistoranktheposteriordistributiontodecidetheidealsetsofpredic-tors.Theactualposteriordistributioncalculationisachallengeandmethodstoaddressthiswillbediscussedinalatersubsection. )proposedthefullyautomaticBayesianvariableselectionprocedurebyusingintrinsicpriors.Letrepresentthesetofrealizationsofalland=1correspondstothefullmodelwithallpotentialpredictors.TheBayesfactorisdenedasB1=m(Y;X)

PAGE 29

1+P2;6=1B1(Y;X);2; 2; Theintrinsicpriorapproachcomesfromthemodelstructureandisfreeofhyper-parameters(donotneedtoconsiderarangeofhyperparameters),soitcanbeusedasdefaultprioranditiscurrentlyoneoftheuniqueobjectiveprocedures. 29

PAGE 30

AgeneralwaytoimplementanMCMCprocedureistorunaGibbssampler(0;0;0);(1;1;1);:::;(t;t;t);::: CasellaandMoreno(2006 )havemoredetaileddiscussionaboutit. AnotherwaytoexploretheposteriordistributionofP(jY;X)istoconstructaMetropolis-HastingsMarkovChain.Normallythechainwillnotonlyvisitallthemodels,butalsovisitthebettermodelmore.Thisishowitworks:atiterationt,drawacandidatefromacandidatedistributionV;acceptthecandidatewithprobabilitymin1;P(t0jY;X)V(t) )usedtheGibbssamplertogeneratesamplesfromthejointdistributionofmodelandparametersconditionedontheresponse.Thismethodis 30

PAGE 31

Dellaportasetal.(2002 )proposedahybridGibbs-MetropolisstrategybasedonCarlinandChib'smethod. Green(1995 )suggestedthereversiblejumpMCMCmethodwhichcouldbeusedformodelselectionwithdierentdimensions. Chib(1995 )alsoproposedtoestimatethemarginallikelihoodbyusingblocksofparameters.Theselistedmethodsgenerallyworkwellformoderateamountsofcandidatevariablesandsomeofthemhavegoodapplicationresultsforspecialsituations.Forexample,reversiblejumpisgoodformixturemodelingwithunknownnumberofelements.However,thereisalmostnoinvestigationinvariableselectionwhenacertainpercentageofvariablesaremissing. InChapter2,anassociationtestisproposedtodetectthesignicantSNPsforthetargetphenotypictraits.Asthepopulationstructurecontainssubstantialgeneticinformationofthepopulationanditindexestheclosenessoftwoindividualswithinthepopulation,weuseanumeratorrelationshipmatrixtoquantifytheclosenessbetweenanytwoindividualswiththepopulation.ABayesianhierarchicalmodelisproposedandtheGibbssamplerisemployedtosampletheparametersofthejointlikelihood.InordertobeabletohandlethousandsofSNPstowardstheendofADEPT2project,methodstospeedupthecomputationareproposed:aGibbssamplingprocedurewhichjustsamplesonecolumnofSNPsineachcycleinsteadofallthecolumnsofSNPsineachcycle;andamatrixidentitywhichtakesadvantagesoftheupdatingschemeandavoidsdirectmatrixinversioncalculation.Simulationstudiesandrealdataanalysewillbedetailed. InChapter3,aBayesianmodelselectionmethodisproposedtoselect\good"subsetsofvariables,meanwhilemissingdataarebeingproperlytakencareof.Weusethe 31

PAGE 32

InChapter4,asummaryofthedissertationisgivenandthecontributionofthisdissertationwillbediscussed.Somefutureworkwillalsobediscussed. 32

PAGE 33

Asthetechnologiesaredevelopingsorapidly,SNPdataaregettingcheaperandeasilyavailable.Atthesametime,scientistsarehavingmoreopportunitytousehighthrough-putSNPdatasetswhichwerenotpossiblebefore.AlthoughmicroarraytechnologycouldprovidehighthroughputSNPdata,atthecurrentability,typically5%to10%ofgeno-typesaremissingaspointedby Daietal.(2006 ).OnechallengescientistsarefacingistouseavailableSNPandphenotypictraitinformationtodetecttheSNPswhichhavestrongassociationwithphenotypictraits,atthesametimeastatisticalmethodisneededtoproperlyaddressthemissingSNPs. Populationassociationinvolvingmissinggenotypicmarkershasmadealotofprogressforhumangenetics,especiallyinthecaseoftightlylinkedgenotypes,thatis,mostatten-tionhasbeenfocusedonne-scalemolecularregion.\Phase"packageby Stephensetal.(2001 )aimsathaplotypereconstructionofgenotypeddataandusesanEMalgorithmtomaximizethelikelihoodofhaplotypefrequencies.Theyuseanapproximatetypeofpriorfortheconditionalhaplotypedistribution. ScheetandStephens(2006 )proposed 33

PAGE 34

ChenandAbecasis(2007 )proposedfamilybasedtestsforgenomewideassociation.TheyuseanidenticalbydescentparametertomeasurecorrelationforthetestSNPs,andakinshipcoecienttomodelthecorrelationbetweensiblings.Theirmethodisusedforonefamily,oneSNPatatime,anditseemsnotfullyapplicableforcomplicatedpopulationpedigreesandforsimultaneousSNPtest-ing. ServinandStephens(2007 )proposedaBayesianregressionapproachforassociationtest.Themissinggenotypeswereimputedby\Fastphase"beforehand,andamixturepriorisusedfortheSNPeect.ThepriorsforthenumberofsignicantSNPsissettobesmall,andhassomeinuenceontheresults. Daietal.(2006 )usedEM,weightedEM,andanonparametricmethod(CART)forassociationstudies.Theyusedmultipleimputationsamplesfromthetreebasedalgorithms. Robertsetal.(2007 )proposedafastnearest-neighborbasedalgorithmtoinferthemissinggenotypes. Marchinietal.(2007 )proposedaunifyingframeworkofmissinggenotypeimputationandassociationtestingbasedonhaplotypesandotheravailablehumangenomicdatasets. SunandKardia(2008 )proposedaneuralnetworkmodelforthemissingSNPsandusedtheBICtochoosethepredictorsinthemodelandfurtherpredictthemissingSNPsaccordingtothechosenmodel. Balding(2006 )givesadetailedreviewofassociationstudiesandsomemissinggenotypemethodsbasedonhumangenetics.Thereismuchmoreliteratureonthistopic,theabovebeingjustasample.Almostalltheliteraturewefound,however,focusesonhumangeneticassociationwherethemarkersaretightlylinkedandhaplotypebasedinferencearedominant.Somepapersaredevotedtothesituationsofparentsinformationbeingmissingandproposedlikelihoodbasedmethods,see Weinberg(1999 ), Martinetal.(2003 ),and Boylesetal.(2005 ). 34

PAGE 35

2.2 EachrowofthematrixZ,Zi;i=1;:::ncorrespondstotheSNPgenotypeinforma-tionofoneindividualandforoneindividualwewriteZi=(Zoi;Zmi): WhenwellinthemissingdatawewriteZi=(Zoi;Zmi); (22)n=2Yi2I0exp1 22(YiXiZi)2 22(YiXiZi)2 Theobserveddatalikelihood,whichisthefunctionthatweeventuallyusetoestimatetheparameters,isbasedontheobserveddataonly.Wherethereismissingdata,thecompletedatalikelihoodmustbesummedoverallpossiblevaluesofthemissingdata.So 35

PAGE 36

(22)n=2Yi2I0exp1 22(YiXiZi)2Yi2IMXZiexp1 22(YiXiZi)2; 22(YiXiZi)2 PZiexp1 22(YiXiZi)2;(2{3) wherethesuminthedenominatorisoverallpossiblerealizationsofZi.ThisisadiscretedistributiononthemissingSNPdataforeachindividual.Tounderstandit,lookatoneindividual. Supposethattherearegpossiblegenotypes(typicallyg=2or3)andindividualihasmissingdataonkSNPs.SothedataforindividualiisZi=(Zo;Zm)whereZmhaskelements,eachofwhichcouldbeoneofgclass.Forexample,ifg=3andk=7,thenZmcantakevaluesinthefollowingTable 2-1 ,wheretheshowsonepossiblevalueoftheZmi.Fortheexample,thereare37=2187possiblevaluesforZmi.Inarealdatasetthiscouldgrowoutofhand.Forexample,iftherewere12missingSNPsthenthereare531,441possiblevaluesforZmi;with20missingSNPsthenumbergrowsto3,486,784,401(3.5billion).IfweemploytheEMalgorithm,ineveryiterationstep,foreachobservationweneedtosumoveragiganticnumberoftermstocalculatethedistributionofthemissingdataandthisiscomputationalinfeasible.SothenwethoughtwemightrunaGibbsSamplingprograminsideofeachEMsteptogeneratethemeansforthemissingdataandavoidsummingoverbillionsofterms. TheGibbsSamplerforthemissingdatasimulatesthesamplesofZmiaccordingtothedistributionofeachmissingelementconditionedontherestofthevector.MoreofthiswillbediscussedinSection 2.2 andherewejustgiveaavorwhytheGibbssampler 36

PAGE 37

Illustrationofcombinationsofmissingdataforoneobservation. SNP 22(YiXiZoioiZmi(j)mi(j)cmij)2 P3`=1exp1 22(YiXiZoioiZmi(j)mi(j)c`mij)2(2{4) wherethereareonlyg=3termsinthedenominatorsumandthisbasicallymadetheGibbssamplertobecomputationalpossibleforus. Forplantassociationorotheragriculturalgenomeassociationstudies,thewholegenomesequencelibrarymightnotbecompleteandthedegreeoflinkagedisequilibriuminformationmightnotbeavailable.Thusnewmethodologiesareneeded. Toimputemissingdata,multipleimputationhasreceivedmuchattentionsinceitsproposalby LittleandRubin(1987 ).Itacknowledgestheuncertaintyduetomissingdataandbasesinferenceonmultiplyimputeddatasets.Itaddressesthevariabilityinthemissingdatathroughmultipleimputation.Mostmethodsforlargegeneticdatasetsintheliterature,except Daietal.(2006 ),usesingleimputationandbasetheimputationonthelargestprobabilityforthecandidatecategories.Althoughsingleimputationtendstogivefastcalculation,itgenerallylosespowerandcangivebiasedparameterestimates.Notlikedoingsingleimputation,wetreatthemissingSNPsasparametersandimputethemeachiteration.ThiswouldbelessbiasedcomparedwithsingleSNPimputationandwouldmoreaccuratelycapturethevariationinthedata. Inanassociationstudyacrosstheentirepopulation,typicallythefamilypedigreeisveryimportant,sinceitexplainsthegeneticinformationsharedbyrelativesandwhichisnotnecessarygenotyped. Yuetal.(2005 )usedakinshipmatrixandpopulationstructure 37

PAGE 38

ChenandAbecasis(2007 )proposedfamily-basedassociationtesting,buthismethodcountstherelationshipwithinfamiliesandnottherelationshipbetweenfamilies.Weproposetouseanumeratorrelationshipmatrixtoexplaintherelationshipofindividualswithinfamiliesandbetweenfamiliesaswell.Theideabehindcalculationofanumeratorrelationshipmatrixwasoriginallydueto Henderson(1976 );seealso Quaas(1976 ). WiththemotivatingCCLONESdataset,weproposeaBayesianmethodologyforassociationstudies.WemodelthemissingSNPsasparametersanduseGibbssamplingtosampletheparameters,includingthemissingSNPs.FormissingSNPimputation,thisisessentiallymultipleimputation.Wetakeadvantageofthefamilyinformationfromtheavailablefamilypedigreeandparentpedigree.Itisnotahaplotype-basedmethodanddoesnotrequiretheSNPstobeinlinkagedisequilibriumandallowsthemodeltondthecorrelation(ifitexists).ThispropertyisquiteappealingforplantassociationoranyotherassociationforwhichwedonothaveasequencedgenomelibraryordetailedinformationaboutthecandidateSNPs.ItisespeciallyusefulforgenomescansandcanpointoutsomesignicantSNPsforfurtherstudy.Finally,weprovethatjustupdatingonemissingSNPforeachobservationineachiterationwillstillachievethetargetstationarydistribution.ThiswillsubstantiallyincreasethecalculationspeediftherearelargenumbersofSNPsforeachobservation,andgivesustheabilitytodealwithlargehighthroughputdatasets. Thischapterisorganizedasfollows.InSection 2.2 weexplaintwoproposedmodels,andalsoproposeamethodtoincreasethecomputationspeed.InSection 2.3 ,simulationresultsaregivenandalsowecomparewithotherresults.InSection 2.4 ,therealdataanalysisresultsarereportedhere.InSection 2.5 ,weexplainamethodtocategorizethecovariancerelationshipsforanygivepedigreesituation.Intheend,atSection 2.6 ,wegiveadiscussion. 38

PAGE 39

Inthissection,wediscusstwomodelsforthisdatasituations.Forthesimplicityofanalysis,weemployoneofthethemthroughthesimulationandrealdataanalysis. Themodelis 39

PAGE 40

EachrowofthematrixZ,Zi;i=1;:::n;correspondstotheSNPgenotypeinforma-tionofoneindividual.Someofthisinformationmaybemissing,andwewriteZi=(Zoi;Zmi) whereZoiaretheobservedgenotypesfortheithindividual,andZmiarethemissinggenotypes.Notetwothings: 1. ThevaluesofZmiarenotobserved.Thus,ifdenotesonemissingSNP,apossibleZiisZi=(1;;0;0;;;1): IndividualsmayhavedierentmissingSNPs.Sofor2dierentindividuals,wemighthaveZi=(1;;0;0;;;1)Zi0=(;;1;0;0;1;1); 40

PAGE 41

2) (2)a+1: HobertandCasella(1996 ),wetakea;b;c;dtobespeciedvalues,whichensuresaproperposteriordistribution. Nowletusspecifythevaluesofa;b;c;dandmakesurethattheposteriordistribu-tionsareproper.Firstofall,wecanwritethemodelspecicationasthefollows HobertandCasella(1996 ),thefollowinginequalitiesmustbesatised 41

PAGE 42

Furthersimplicationweleadtothefollowingequations:0>c>sa>n Asthenumberofobservationsinthedatasetisaboutonethousand,thenumberofparametersforSNPsareaboutonehundredandthedimensionofis61,wetakec=1,a=3,b=1,andd=1.Theabovespecicationmakessurethattheposteriordistributionsareproperandthepriorshavewideatranges. Asourdatasetisfromtheloblollypinegenome,aswithmanyotheragriculturegenomes,itisnotfullysequencedandthereisnotenoughpriorinformationintermsofthefrequencyofgenotypesforeachSNP.Also,thephysicalpositionsoftheSNPsarenotfullyrecordedandtheSNPsmaynotbetightlylinkedasclustersofhaplotypes.SononinformativepriorsareusedforthemissingSNPs.AstheSNPsconsideredherearebiallelic,eachmissingSNP,has3possiblegenotypes:homozygous,heterozygousandmutanthomozygous.Forexample,ifthetwoallelesinthelociareAandG,themissing 42

PAGE 43

ForthemissingSNPsinthedata,weassumethattheyaremissingatrandom,MAR.Inanotherwords,weassumethattheprobabilitythatwhethertheSNPismissingisrelatedtotheobserveddata,suchasthephenotypictraitorotherobservedSNPs,andisindependentoftheunobservedinformation.Conditionalonthisassumption,wewillimputethemissingSNPsbasedonthecorrelationbetweenSNPswithinindividualsandbetweenindividuals,andusethephenotypictraitinformationtoimprovethepowerofimputation.MARisareasonableassumptionandnotasstrictasmissingcompletelyatrandom,MCAR. Inthismodel,thecovariancematrix,R,modelsthecovariancebetweenindividualswithinthesamefamily,andthecovariancebetweenindividualsacrossfamilies.Phenotypictraitsofrelatedindividualsarealikebecausetheysharealargefractionofgeneticmaterial.Genotypesofrelativesaresimilarbecausetheysharethesameparentsorgrandparents(insomedegree).GenotypedSNPsmayexplainsomepartofthephenotypictraits,stilltheun-typedgeneticinformationcontributestothephenotypictraitsaswellasotherfactors,suchasenvironmentaleects.SimplyusingtheSNPmarkerinformationwithoutconsideringthefamilypedigreeorhistoryisnotawiseapproach.WeexpectthatincorporatingthefamilystructureinformationwillincreasethepowerofcapturingtheunderlingnatureofthedatasetandhelptodetectthesignicantSNPs. Intheliterature,therearedierentmethodstocalculatetherelationshipmatrix,suchasusingaco-ancestrymatrix,akinshipmatrix,etc.Thebasicideaistocalculatetheprobabilitythat2individualsshareonegeneorSNP,whichispasseddownfromthesameancestry.Someofthemethodsemploypairwisecalculationandthusdonotguaranteeapositivedeniterelationshipmatrix,whichisusuallynotsatisfactorywhentherelation-shipmatrixisusedascovariancematrix.Weusetherecursivecalculationmethoddueto Henderson(1976 ).Thismethodgivesanumeratorrelationshipmatrixwhichquantiesthe 43

PAGE 44

B .Noticethatevenifthefamilypedigreeorparentpedigreeisnotavailable,wecanstillusetheproposedmethodtocalculatetherelationshipmatrix.Inthefollowing,wehaveSection 2.5 dedicatedtocategorizingtherelationshipsforanypedigreehistory. SameasEquation 2{6 ,wehavethefollowingmodelspecicationp(Yj;;Z;2;2)N(X+Z;2R);()/1;()N(0;22I);(2)IG(a;b);(2)IG(c;d): 21Z0R1(YX);2Z0R1Z+I 21!21 (2)n=2+s=2+a+1exp(YXZ)0R1(YXZ)+jj2 (2)s 22: 44

PAGE 45

whereKc=(YiXiZoioiZmi(j)mi(j)cmij); Themodelis 45

PAGE 46

Thevariancetermisspeciedastheidentitymatrixrightnowforthesimplicity.Thefullspecicationofmodel 2{9 includesthefollowingpriordistributions: 46

PAGE 47

(2)(n+s+r)=2(2)s=2(2)r=2exp1 22jYXZWuj2exp1 222jj2 222juj2(2)(2)(2) Accordingto HobertandCasella(1996 ),thefollowinginequalitiesneedtobesatisedtomakesurethattheposteriordistributionsareproper: 2c>rank(Z) (2{12) 2e>rank(W)2a>nc<0e<0rank(Z)>rank(K)t2crank(W)>rank(K)t2en+2e+2c+2ap>0; 2{12 47

PAGE 48

2{11 ,withaatprioron2;2and2asweshowedinabovethattheyensurethepropertiesofposteriordistributions,wehavethefollowingfullconditionalsdistributions. (2)(n+s+r)=2 22jYXZWu)j2+1 (2)r=2exp1 22juj2 (2)s=2exp1 22jj2 (2{14) =!ij((Zoi;Zmi(j);c))exp1 22(YiXiWiuZoioiZmi(j)mi(j)cmij)2 Pg`=1!il((Zoi;Zmi(j);c`))exp1 22(YiXiWiuZoioiZmi(j)mi(j)c`mij)2; Usingtheabovemethod,weneedtosampleuineachiteration,andthisisfeasibletheoretically.However,thedimensionofuisreallybigandwedonotreallywantthesamplesofu.Nowwewanttotrytointegrateuoutandgetthemarginaldistribution. 48

PAGE 49

22(YXZ)0B(YXZ)exp1 222jj2 2{11 areensuredtobeproper.Withthesameatpriors,theposteriordistributionsfromtheEquation 2{15 willstillbeproperbecausetheonlydierencemadehereishavingubeingintegratedoutandthisdoeschangethepropertiesofotherposteriordistributions. BasedonEquation 2{15 ,wehavethefollowingfullconditionals (2)(n+s)=2exp1 22(YXZ)0B(YXZ)+1 (2)s=2exp1 22jj2 Weransomesimulationsonthisdatasetandwefoundoutthecomputationtimeisespeciallylong.Thereasonsforthatiswhenweconsiderobservationsbasedonremetes 49

PAGE 50

ThepercentagesofSNPcategoriesinthegenerateddatasets. PercentagesSNP1SNP2SNP3SNP4SNP5 Doublehomozygous:13%30%91%77%39% Heterozygous:53%38%7%19%54% Mutanthomozygous:33%32%2%4%7% insteadofaveragesofremetes,thenumberofobservationsisabout4timesbigger.Consequently,thecomputationtimeismuchlonger,intermsofmatrixinversionandmatrixdeterminant.Wedecidetousetherstmodeloftheanalysisanddonotconsiderthevariationwithintheremetes.NoticethatthisdoesnotchangetheinferenceontheSNPs. 2{7 )and( 2{8 ),ifinsteadofupdatingalltheparameters(t);(t);Z(t)np;2(t);2(t)ineachcycle,wejustupdate(t);(t);Z(t)j;2(t);2(t)ineachiteration,theMarkovChainachievesthesametargetstationarydistributionandergodicityalsoholds.HereZjdenotesthejthcolumnintheZnpmatrixforSNPs(thejthSNPforallobservations)andjchangesaccordingtotheiterationindex. A .TheactualmeaningofthistheoremisthatinsteadofupdatingtensorhundredsofSNPsinonecycle,wejustneedtoupdateoneSNPineachcycle.Thiswilldramaticallyspeedupcomputation,especiallywhentherearelargenumbersofSNPsinthedata. 50

PAGE 51

2-5 astheactualvalues.TheSNPeects(additiveanddominanteects),whichwereusedtosimulatethedata,arelistedasactualvaluesinTable 2-5 .Weletthevarianceparameter2be1.Theproposedmethodologywasappliedtoanalyzethedatawithoutmissingvaluesandwasalsoappliedtodatawithdierentpercentagesofmissingvalues.WewanttoseewhetheritcouldidentifythesignicantSNPsandcheckitsperformanceagainstdierentpercentagesofmissingvalues,aswellasitsperformancefordierentprobabilitiesofSNPgenotypecategory. OurultimategoalistondthesignicantSNPsfromthecandidateSNPs.Sincewebelievethatimputationisatooltoobtainbetterestimatesoftheparameters,wearenotparticularlyinterestedinrecoveringtheactualimputedvaluesforthemissingSNPs.Withthatbeingsaid,thesimulationresultsinTable 2-3 showthatwhentheprobabilityofonegenotypeforacertainSNPisdominantlyhigh,theimputedSNPsarecorrectlyidentiedwithsubstantiallyhighprobability.ThiscouldbeseeninTable 2-3 ,where 51

PAGE 52

ThepercentagesofcorrectlyimputedSNPsfordierentprobabilitiesofSNPcategories.10%missingvaluesexist. Probabilitiestobe\gg"0.13090.30120.81810.77190.3983Probabilitiestobe\gc"0.53070.38750.07960.19500.5425Probabilitiestobe\cc"0.33840.31130.10230.03310.0592Correctlyimputedprobabilities:0.550040.547880.633720.850750.65159 Probabilitiestobe\gg"0.03090.30120.31810.37190.0983Probabilitiestobe\gc"0.83070.38750.07960.19500.8425Probabilitiestobe\cc"0.13840.31130.60230.43310.0592Correctlyimputedprobabilities:0.75890.350520.976210.648290.85301 2-4 andTable 2-5 listtheparameterestimationforfamilyeectandSNPeects. Table2-4. Theestimatedmeansoffamilyeectsfordierentdatasetswithdierentpercentagesofmissingvalues.Thismethodologygiveaccurateestimatesasthepercentageofmissingvaluesgoesupto20%. Estimatedmeansfamily1:1family2:2family3:3 NomissingSNPs15.4520.6525.48 5%missingSNPs15.1620.7425.46 10%missingSNPs16.1821.3825.65 15%missingSNPs15.4519.6324.59 20%missingSNPs14.8720.1824.68 Estimatedmeansfamily4:4family5:5family6:6 NomissingSNPs29.8434.7640.40 5%missingSNPs28.2933.4338.62 10%missingSNPs30.7135.8640.81 15%missingSNPs30.1835.3840.18 20%missingSNPs30.0834.8840.13 52

PAGE 53

TheestimatedmeansofSNPeectsforthedatasetwithoutmissingvaluesandfordatasetswithdierentpercentagesofmissingvalues. ActualSNPvalue:SNP1:aSNP1:dSNP2:aSNP2:dSNP3:a -2.001.001.00-1.003.00 MeansfornomissingSNPs:-2.161.000.82-0.752.59 Meansfor5%missing:-1.861.141.16-1.053.00 Meansfor10%missing:-1.950.771.18-1.522.74 Meansfor15%missing:-1.800.780.99-0.962.48 Meansfor20%missing-2.081.291.21-0.763.10 ActualSNPvalue:SNP3:dSNP4:aSNP4:dSNP5:aSNP5:d 02.500.100.303.00 MeansfornomissingSNPs:0.302.430.60-0.042.38 Meansfor5%missing:0.052.21-0.200.482.88 Meansfor10%missing:0.182.510.130.002.53 Meansfor15%missing:0.672.430.470.733.20 Meansfor20%missing1.321.87-0.200.473.30 Allthecalculationswerebasedonsamplesobtainedaftertheinitial20000stepsofburn-in.TheresultsfromTable 2-4 andTable 2-5 showthatwhenthepercentageofmissingvaluesisnottoohigh,lessthan15%,theproposedmethodologygivesgoodestimatesfortheparametersthatweareinterestedin.Whenthepercentagesofmissingvaluesisbeyond15%percentage,weneedtobeverycarefulwithinterpretingtheresults.TakeSNP3asexample,thedominanteectforSNP3isactually0andtheestimatewas1:32whenthepercentageofmissingvaluesis20%,althoughtheestimatesareaccuratewhenthepercentagesofmissingvaluesislessthan10%.Thereasonforthat,webelieve,isonecategoryofgenotypeforSNP3hassubstantiallyhigherprobabilityanditoverdominatestheothertwocategoriesandthisoverdominance.Whenthepercentageofmissingvaluesgoesup,thedominatedgenotypecategoryhasonlyasmallchancetobewellrepresentedandthusmayhaveunreliableestimates.Generally,asmostmicroarraydatahavelessthan10%missingvalues,themethodologyperformswell. 53

PAGE 54

Forthisproject,wearespecicallyinterestedindetectingtherelationshipbetweenlesionlengthandthegenotypedSNPs,aslesionlengthisoneofthemostimportantquantitativetraitsofpitchcankerdisease.WealsohavephenotypicdataofcarbonisotopediscriminationvaluesfromloblollypineswhicharegrowninPaltaka,FloridaandCuth-bert,Georgia.Thecarbonisotopetraitisrelatedtothewateruseeciency,thushasanimportantroleinthesizegrowthofloblollypineandfurtherhassubstantialeconomicalvalues.Geneticallyspeaking,theloblollypinesinPaltaka,FloridaarereplicationsofloblollypinesinCuthbert,Georgia,excepttheyhavedierentenvironment,whichmightleadtodierentgenetic-environmentinteraction.Asforlesionlength,withreplicationofgeneticinformation,itisadierentphenotypictraitcomparedtothecarbonisotopedis-criminationtrait.Sowehavethreephenotypictraitsdata:CarbonisotopefromPaltaka,CarbonisotopefromCuthbertandlesionlengthdata.ThethreeshareonegeneticSNPdata. Forthedesignoftheexperiment,thereare61loblollypinefamiliesfromacirculardesignwithsomeo-diagonalcrossings.Forexample,family00isgeneratedfromparent24andparent23,whilefamily01isfromparent24andparent40.Family00andfamily01arenotindependentsincetheyshareoneparent.Originally,thiscirculardesignhad70familiesfrom44parentsalthoughourdatasetsjustcontainpartoftheexperimentdesign.Moredetailsoftheexperimentdesigncanbefoundin Kayihanetal.(2005 ).Therealso 54

PAGE 55

Forthe61families,wehavefamilypedigreeandparentpedigreeinformation.Byusingthe Henderson(1976 )method,wecalculatedthecovariancenumeratorrelationshipmatrix.ThedetailsofthecalculationareinAppendix B .Weuseauniformnoninfor-mativepriorforthefamilyeectandanormalpriorfortheSNPeectswhichweareinterestedin.Asforthevarianceparameters,weuseinvertedgammadistributionsandmakesurethattheposteriordistributionsareproperaccordingto HobertandCasella(1996 ).EachSNPisparameterizedwithadditiveanddominanteects.Forexample,ifthisSNPiscomposedby\A"and\T"nucleotides,theadditiveeectisonehalfofthedierenceoftheSNPeectbetweenhomozygous\AA"andmutanthomozygous\TT".Ifweuse(a;d)TtoparameterizetheSNPeect,SNP\AA"wouldhaveeect(1;0)(a;d)TandSNP\TT"wouldhaveeect(1;0)(a;d)T.Thedominanteectis 55

PAGE 56

Thersttraceplotisfortherst2familyeectparametersforthelesionlengthdata.ThesecondplotisforoneofSNPparameterforthecarbonisotopedatafromPaltaka,Florida.Thesamplesaretakenafterinitial40000stepsofburnin. thedierenceoftheheterozygouseectandtheaverageofhomozygouseects.Weusetheeect(0;1)(a;d)TtoparameterizethedominanteectforSNP\AT".WeupdatedthemissingSNPsbycolumnsasdetailedinSection 2.2.3 ,sothecomputationisfast(about3hoursfor30000iterations).Weuse40000iterationsasinitialburninandrecordthesamplesoffurther2000iterationsforthe3dataanalysis. TwotraceplotsareshowninFigures 2-1 .TherstoneisforoneoftherstfamilyeectparametersforthecarbonisotopediscriminationdatafromPaltaka,Florida.ThesecondplotisthetraceplotofvarianceparameterforthecarbonisotopediscriminationdatafromCuthbert,Georgia.BothplotssuggestthattheGibbssamplingsimulationconvergestothestationarydistributionafterburnin. Figures 2-2 2-3 ,and 2-4 showthecondenceintervalsforthe44SNPeectsofdierentdatasets.Wefoundthatthe22ndSNPforthelesionlengthdata,the16thSNPforthePaltakadataandthe8th,35th,36thSNPsfortheCuthbertdataaresignicant 56

PAGE 57

2-2 2-3 ,and 2-4 basedonthesamplesfromtheGibbssampling.WedidnotemploystandardmultipletestcorrectionprocedureforthetestsastherearenotmanysignicantSNPsandalsoourgoalistondsomestatisticallysignicantSNPsandletthebiologiststofurtherexplorethebiologicalpathway.ThesecondenceintervalplotspointoutthespecicSNPstofurtherinvestigateandwecouldhavemoreSNPstofollowifweloosenthecondencelimitsalitbit.InFigures 2-2 2-3 and 2-4 ,thexaxisisfortheindexofSNPs.AseachSNPhastwoparameters,with44SNPs,theindexrangesfrom1to88.TheyaxisisforthevalueofSNPeect.ForeachparameterofeachSNP,thereisasmallredlineinthetopofabluelineandasmallredlineonthebottomoftheblueline.Thesmallredlineonthetoprepresentstheupperboundofthe95%condenceintervalandthesmallredlineonthebottomrepresentsthelowerboundofthe95%condenceinterval.Thesmallgreenlineinthemiddlerepresentsthemeanoftheestimatedparameter.EachSNPhastwoparameters,additiveanddominant,correspondingly,ithastwolinesinalltheplots. 57

PAGE 58

95%CondenceintervalsfortheadditiveanddominantSNPeectofthe44SNPsforlesionlength.The22thSNPhassignicantdominanteectswith95%condence,whiletheotherSNPsarenotsignicant.OtherSNPssuchasthe2ndandthe23thSNPsareapproximatelysignicantwith95%condence.Thesearegoodcandidatesforfurtherbiologicalexploration. ontheotherhandiftheparent'sIDisbiggerorequalto35weknowthegrandparents'ID.So35isanimportantnumberandweneedtopayattentiontoitwhentryingtollinthecovariancematrixoffamilies.AnotherthingisfortheparentID38,originallyitsgrandparentswererecordedas13and0,outofconvenienceforprogramming,Ichangedtheminto13and1withoutchangingthenatureofthecovariance.Accordingtothedesignoftheexperiment,thefemaleandmaleparentarenotthesameforallthefamilies. Whencalculatingthecovariancebetweenindividualsfromdierentfamilies,onewayistodividethecombinationsoftwofamiliesinto11bigcategoriesandfurtherpartitioninsidethecategories.InthefollowingIwillexplaineachofthe11categories. 58

PAGE 59

95%CondenceintervalsfortheadditiveanddominantSNPeectofthe44SNPsforthecarbonisotopedatafromPaltaka,Florida.The16thSNPhassignicantdominanteectswith95%condence,whiletheotherSNPsarenotsignicant.OtherSNPssuchasthe6ndandthe40thSNPsareapproximatelysignicantwith95%condence.Thesearegoodcandidatesforfurtherbiologicalexploration. isthesimplestcaseandifitissatisedthetwofamilieshavezerocovariance.Inourdataset,weneedtocomputetotal2415(2415=1+2++69)covariancecellsanditturnsoutthereare507casesbelongtothiscategory. 59

PAGE 60

95%CondenceintervalsfortheadditiveanddominantSNPeectofthe44SNPsforcarbonisotopedatafromCuthbert,Georgia.The8th,35thandthe36thSNPhavesignicantdominanteectswith95%condence,whiletheotherSNPsarenotsignicant.OtherSNPssuchasthe13ndandthe28thSNPsareapproximatelysignicantwith95%condence.Thesearegoodcandidatesforfurtherbiologicalexploration. canbepartitionedinto8subcategoriesasthefollowing:0B@Family1:(a1;a2);bFamily2:b;d1CA;0B@Family1:(a1;a2);bFamily2:c;b1CA;0B@Family1:a;(b1;b2)Family2:a;d1CA;0B@Family1:a;(b1;b2)Family2:c;a1CA;0B@Family1:a;dFamily2:(c1;c2);d1CA;0B@Family1:d;bFamily2:(c1;c2);d1CA;0B@Family1:a;cFamily2:c;(d1;d2)1CA;0B@Family1:c;bFamily2:c;(d1;d2)1CA:

PAGE 61

Nowwetake0B@(a1;a2);bb;d1CA

PAGE 62

Theresultsshowthatforallthe896casesofthiscategory,noneoftheaboveequa-tionsissatised.Sothecovarianceforthese896casesareall0.

PAGE 64

Group1a1=b;a2=b;c1=b;c2=b: 64

PAGE 65

Group1:c1=b;c2=b;a1=d;a2=d:

PAGE 66

Group1:a1=d;a2=d;b1=d;b2=d:

PAGE 67

Theprogramresultsshowthatinourdataset27out28casesinthiscategorydonothaveanyequationsholdingineithergroup,thatis,thesecasesarehalfsiblings.Onlyforthecasesoffamily65andfamily68,additionally,oneequationholdsfromgroup1. 67

PAGE 70

70

PAGE 71

Intheabove,wefoundthatinanyfamily,thetwoparentsarenotrelatedandstandardprocedurecanbeusedtocalculatethevarianceforeachfamily.Intermsofcovariance,incategory1,all507caseshavezerocovariance.Forcategory2,all88casesarehalfsiblings.Eightythreeoutof84casesincategory3arehalfsiblingsandwhenthecasecomestothecovariancebetweenfamily38andfamily70,ithasthefollowingrepresentation,0B@a;d(a;c2);d1CA:

PAGE 73

22(YiXiZoioiZmi(j)mi(j)cmij)2 P3`=1exp1 22(YiXiZoioiZmi(j)mi(j)c`mij)2(2{17) wherethereareonly3termsinthedenominatorsumandthismakestheGibbssamplercomputationalfeasible. 73

PAGE 74

2{8 ,butratherthantheonerandomvariablegenerationperiterationthatisusedhere,MonteCarloEMwouldneedthousandsofZgenerationsperiteration,againprecludingcomputationalfeasibility. AsforthepriordistributionfortheSNPgenotypecategories,onealternativeistousetheavailablegenomeinformationortheobservedgenotypefrequenciesaspriorinformation.Foroursituation,wedonothaveasequencedgenomelibraryfortheloblollypine,andwedonotwanttousetheobservedSNPinformationastheprior,soweuseauniformpriorforthemissingSNPs. Inthischapter,theresponsesareassumedtobecontinuous,butthemethodcanbemadetoadapttodiscretecases.Forexample,inacasecontrolstudy,theresponsewouldbeeithercaseorcontrolstatus.Byemployingaprobitmodel,weaddatruncatedlatentvariableintheGibbssamplingcycleandthelatentvariablewouldactastheresponseinourpreviousdatasetexamples.Thismethodcanalsobemodiedtohandlethesituationofamultiplecategoryresponse. Letusassumethattheresponsesyisavectorwithitselementstobeeither0or1forthecasecontrolstudy.Weemployacontinuouslatentvariablelandassumethatlhasamultivariatenormaldistribution.Thenwecanspecifythewholemodelspecicationasthefollows: 74

PAGE 75

(2{18) Thenwecanwritethejointlikelihoodas: Withtheabovejointlikelihood,wecaneasilywritetheconditionaldistributionsasbefore,exceptthatnowwehaveconditionaldistributionoflgivenotherparametersandy.TheconditionalofliisliN(Xi+Zi;2)truncatedattheleftby0,ify1=1liN(Xi+Zi;2)truncatedattherightby0,ify1=0: FortheSNPeect,weuse[1;0][a;d]TtodenotetheeectofSNP\AA",[0;1][a;d]TtodenotetheeectofSNP\AC",and[1;0][a;d]TtodenotetheeectofSNP\CC". 75

PAGE 76

Forpopulationassociation,muchresearchworkhasbeenfocusedonusinghaplotypes.However,intheplant,orotheragriculturegenomeassociations,itisoftenthecasethatthewholegenomeisnotfullysequencedandnotmuchpriorlinkagedisequilibriuminformationisavailable.Block-wisehaplotypesorclustersofhaplotypesareimpossibletoconstruct.Theproposedmethodhaswideapplicationfortheeldofplantoragricultureassociationwithoutconstructinghaplotypes,andithasreasonablecalculationspeed.Basedonoursimulations,theproposedmethodcanadequatelyidentifythesignicantSNPs. 76

PAGE 77

Tobemoregeneral,therearesvariablestobeconsideredinthedataset.Whensisnite,andthevaluesofthecovariatesintheexperimentareobserved,muchprogresshasbeenmade.ThereviewoftheprogresshasbeendetailedinChapter1.However,forthesituationthatthecandidatevariableshaveacertainpercentageofmissingvalues,itisanovelresearcheldandnotmuchworkhasbeendone. Foreachcandidatevariable,therearetwochoices,eitherbeingincludedinthe\goodmodel"ornot.Sowithscandidatevariables,thereare2smodelchoicesinthemodelsamplespace.Whenthenumbersismoderatelylarge,thenumber2scouldbehuge.Typicallywhenthenumbersisbiggerthan15,itisunrealistictocalculateandcompareallthe2sposteriormodelprobabilities.Forthiscase,afeasiblemethodistodoastochasticsearch.Intheory,thestochasticsearchalgorithmhasthechancetosearchallthesamplespace.Thisstochasticsearchchainisexpectedtostaylongerwithmodelshavinghighposteriorprobabilities,andoccasionally,itvisitsmodelswithlowposteriorprobabilities.Bydoingthat,itavoidsgettingstuckinlocalmodesandhaschancetovisitthemodelsamplespaceglobally. Thefollowingisourplanforthemodelselection: 77

PAGE 78

WewillestimatetheBayesfactorsusingsamplesfromtheparallelGibbssampling. ThestochasticsearchchainisdrivenbyestimatedBayesfactors. NowletusintroducetheBayefactor.FortwomodelsMandM1withparametersand1,theBayesfactorisdenedas wherep(jM)istheparameterdistributionformodelMandp(1jM1)istheparame-terdistributionformodelM1.Obviouslyp(Yj;M)istheprobabilitydistributionundermodelMandp(Yj1;M1)istheprobabilitydistributionundermodelM1.Hereand1arescalars,buttheycanbeextendedtobeparametervectors.Theconstantm(Y)isthemarginallikelihoodformodelMandtheconstantm=1(Y)isthemarginallikelihoodformodelM1. Inthischapter,rstwewilldiscusshowtoestimatetheBayesfactorsasnoclosedformexists.ThenwewillproposeahybridstochasticsearchalgorithmtosearchthesamplespaceofthemodelsandwewillshowthattheergodicpropertiesofthederivedMarkovchain.Computationspeedisalwaysaconcernofvariableselection,sonallywescaletheproceduresuptohandlelargedatasets. 78

PAGE 79

(3{2) whereIG(a;b)representsinvertedGammadistributionwiththefollowingparameteriza-tionf(2)=ba 2g IfweuseM1;:::;M2storepresentallthemodelsinthemodelsamplespace,thegoalofthisprojectistoselectgoodmodelsfromthesecandidatemodels.Foreachoneofthemodelsinthemodelsamplespace,thereisacorrespondingindicatorvectorandeachmodelisdenedbyafamilyofdistributionswithparameterstobe;;;2;2.ForanytwodierentmodelsMjandMj0,theysharethesameparameterizationsof;2;2,exceptforadierentindicatorvector,thusalso. Severalthingstobenoticed: 79

PAGE 80

AswearedoingBayesianvariableselection,weuseBayesfactorasacriteriontoselectmodels.ThedicultpartofthisprojectisthatweareselectingsubsetswithdierentdimensionsandthemissingvaluesexistinthecovariatesofSNPs.Toovercometheproblem,weproposedaBayesfactorapproximationformulawhichtakescareofthedierentdimensionsandthemissingvalues.WeplantoemployastochasticsearchalgorithmtosearchthegoodsubsetsaccordingtotheBayesfactors.Foreachcandidatesubset,wecalculatetheBayesfactorforthecandidatesubsetversusthefullmodel.Herethefullmodelactsasthereferencemodel.DependingonthevalueofthecurrentBayesfactorandthepreviousBayesfactor,wedecidewhethertakethecandidatesubsetasthecurrentsubsetornot.SothefullmodelactsasareferencemodelwhencalculatingtheBayesfactorsintheaboveplan.Ontheotherhand,wemayusethesimplestmodelasareferencemodelwhencalculatingtheBayesfactors,whichdoesnotincludeanyvariablesinit.Wewillshowthatusingthesimplestmodeldoesnotgainanythingforusintermsofcomputation.Morediscussionofusingthesimplestmodelasareferencemodelwillbegivenlater. BeforegoingtothecalculationofBayesfactor,wewillintroducebridgesamplingproposedby MengandWong(1996 ). 80

PAGE 81

3{1 ,wehaveBF;=1=m(Y)

PAGE 82

Whenhavesamplesofparameterfromthedistributionp2(),wecanestimatetheBayesfactorbyXq1((i)) Foroursituation,wegenerallyconsidertwomodelswithdierentdimensions,andwecannotdirectlyapplythisformulaasitassumesthesameparameterizationforthetwomodels.WewillproposeabridgesamplingtypeofBayesfactorestimation,whichaccommodatesmodelswithdierentdimensionsandmissingvaluesaswell.NextwewilldetailtheBayesfactorcalculationforoursituation.FirstweshowthataspecialfunctiongisneededforthepropercalculationofBayesfactor.Thenwewillshowwhatthisfunctioncouldbeforoursituation. SupposeaistheparameterformodelMaand(a;b)istheparametersformodelMb.ThelikelihoodformodelMaisfa(Yja)andthelikelihoodformodelMbisfb(Yja;b):Leta(a)andb(a;b)denotethepriorsformodelMaandmodelMb. 82

PAGE 83

NowletusprovetheTheorem 3.2.1

PAGE 84

=ZZa(a)g(a;b) 3{6 holding,then1 84

PAGE 85

3{4 ,agfunctionisneededandthemodiedbridgesamplingestimatoris 1 3{8 ,wemustndagfunctionthatsatisesthefollowingequation. 3{9 andcontainsafactorP(Zlmis=zlmis):Alsoiscomposedofandl.ThevectorparameterizestheSNPswhichareincludedinmodelMandlparameterizestheSNPswhichareexcludedinmodelM.ThematrixZiscomposedofZandZl,whereZlisthedesignmatrixofSNPswhicharenotincludedinmodelMandZlmisrepresentsthemissingSNPswithinZl.ThedistributionP(Zlmis=zlmis),forthediscreterandomvectorZlmis,couldbeanylegitimatediscretedistributionanditistakentobetheuniformdistribution,whichisthesameasthepriorforZlmisinmodelM1. 85

PAGE 86

3.2.1 .ApplyingTheorem 3.2.1 ,ais(;;2;2Zmis),andbis(Zlmis;).Thelikelihoodfor(a;b)isfb(a;b)=f1(Yj;;2;Zmis); 3.2.1 ,thefollowingequationmusthold:XZlmisZf1(Yj;;2;2;Zmis)g(;;2;2;Zmis)dl=f(Yj;;2;2;Zmis): 3{8 ) 1 86

PAGE 87

3{9 andfurtherwehaveaconsistentBayesfactorestimator.Thegis 2 22P(Zlmis=zlmis); 2exp(YX(i)Z(i)(i))0P(i)l(YX(i)Z(i)(i)) 22(i) NoticeP(Zlmis=zlmis)couldbeanylegitimatedistribution,anditistakenastheuni-formpriorinitssamplespace,whichisthesamepriorforZlmisin1(;;2;2;Zmis;Zlmis).Soduringthecalculation,theP(Zlmis=zlmis)withinthegfunctioniscanceledwiththepriorforZlmisin1(;;2;2;Zmis;Zlmis). NowwewanttoshowwhyEquation 3{12 isaconsistentestimator.Wehavetwomodels,MandM1,andthefollowingarethedetailsofaBayesianspecicationforthesetwomodels. 87

PAGE 88

(3{13) ForM1,themodelspecicationisdescribedin 3{2 anditisalmostthesameas 3{13 ,exceptthatthedesignmatrixforSNPsischangedfromZtoZasthemodelindicatorvectorischangedfromto=1.Correspondingly,ischangedto. Forbothmodels,thedesignmatrixforSNPs,ZandZ,containmissingSNPsandwedonotexplicitlymentionthatrightnow.ZisamatrixconcatenatedbythecolumnsofZandZl.ThemathematicalformulaZ=Z+Zldoesnothold,oneobviousreasonforthatisthenumberofcolumnsofZequalstothesumofthenumbersofcolumnsofmatrixZandZl.Thevector,,andlhavesimilarrelationship. Sameasbefore,weuse(;;2;2;Zmis)asthepriorformodelMandlet1(;;2;2;Zmis)asthepriorformodelM1. 88

PAGE 89

3{9 inCorollary 3.2.1 ,wehave: =XZlmisZg(;;2;2;Zmis)1 (22)n (22)n (22)n 3{14 ,wewillhave =XZlmis8<:expjYXZj2 2exp(YXZ)0Zl(Z0lZl)1Z0l(YXZ) 22;

PAGE 90

3{11 ,thatis, 2exp(YXZ)0Pl(YXZ) 22P(Zlmis=zlmis); (3{16) =expjYXZj2 90

PAGE 91

3{9 issatised. WetakethepriorforMtobe(;;2;2;Zmis)=1expjj2 2) (2)a+1dc 2) (2)c+1P(Zmis=zmis); 2) (2)a+1dc 2) (2)c+1P(Zmis=zmis)P(Zlmis=zlmis); Withtheabovespecication,ifwehavesamples((i);(i);2(i);2(i);Z(i)mis)frommodelM1,andfollowtheEquation 3{10 ,wecanconsistentlyestimatetheBayesfactorasthefollowing:1 2exp(YX(i)Z(i)(i))0P(i)l(YX(i)Z(i)(i)) 22(i) 91

PAGE 92

3{9 ,wecanuse1 Wewillshowthatthefollowingg(;;Zmis;2;2)alsosatisestheEquation 3{9 22P(Zlmis=zlmis); 2+Z0lZl: 3.2.1 ,weneedshowthatEquation 3{9 issatised.TheleftsideofEquation 3{9 isequalto: 22expjYXZj2 92

PAGE 93

3{18 andwehave =XZlmis(22)sl=2jRj1=2exp(YXZ)0ZlR1Z0l(YXZ) 22(22)sl=2jRj1=2exp(YXZ)0ZlR1Z0l(YXZ) 22exp(jYXZj2) (22)2=nP(Zlmis=zlmis) Afterthecancelation,wewillhave: (22)2=nP(Zlmis=zlmis) (3{20) =exp(jYXZj2) (22)2=nXZlmisP(Zlmis=zlmis)=f(Yj;;Zmis;2): 3{17 andEquation 3{11 aretwodierentgfunctionsandtheybothsatisfytheEquation 3{9 .Thatshowsthegfunctionisnotunique.However,basicallywecanndagfunctionbyfollowingthedirections:

PAGE 94

3{9 Solookingback,forthegfunctioninEquation 3{17 ,thecorrespondinggcandfunctionisgcand=expjj2 3{17 .AsfortheEquation 3{11 ,wejustletthegcandfunctionbe1,andfollowthesameprocedure:Calculatethehfunction,forthiscaseh=RRf1(;;Zmis;2;2)1dldZlmis furthergettheg=f(;;Zmis;2;2) 3{11 3.1 ,westatedthatweplantorunahybridM-Hchaintosearchthegoodsubsetsofvariablesaccordingtotheposteriorprobabilitiesofmodels.LaterwewillshowthatthisisequivalenttosearchingforthegoodsubsetsofvariablesaccordingtotheBayesfactors.Intheabovesubsections,wediscussedtheBayesfactorestimationforasubmodelversusthefullmodel.Sowhencomparinganytwosubmodels,theyareimplicitlycomparedwitheachotherthroughtheratioofBayesfactorsofsubmodelsversusthefullmodel.Forexample,ifweconsidertwosubmodelswithSNPindicators1and2,theratioofBayesfactors,BF1;=1

PAGE 95

Casella,Giron,Martinez,andMoreno(Casellaetal. )discussedthattherearetwowaystodoBaysianmodelselectionwhenusingintrinsicpriors:encompassingthemodelsfromabove,whichmeanscomparingallthemodelstothefullmodel;encompassingthemodelsfrombelow,thatis,comparingallthemodelstothesimplestmodel.Theyshowedwhenthenumberofsubsetsarenite,thesetwomethodswillessentiallygiveequivalentresults. Inthefollowing,wewillgivedetailsoftheBayesfactorapproximationwhenthesimplestmodelisusedasareference.Also,wewillprovidesomeexplanationsastowhywechoosetousethefullmodelasareference.SofortheBayesfactorofsubmodelversesthesimplestmodel,weareinterestedincalculatingBF;=0,where=0meanstheassociatedmodelhasnovariablesinit,alsoasthesimplestmodel.Byusingsimilarapproximationmethodsasabove,weareabletondtheBayesfactorapproximationforsmallmodelversusbigmodel,andalsoasBF;=0=1 NowthemodelM0hasthelikelihoodf0(Yj;2)andthemodelMhasthelikeli-hoodf(Yj;;2;2;Zmis). Nextwealsoshowhowwendthegfunctionforthissituation.ApplyingTheorem 3.2.1 ,theais(;2;2)andbis(;Zmis).Fortheequation( 3{6 )tobesatisedweneedZfb(Yja;b)g(a;b)db=fa(Yja):

PAGE 96

2expjYXj2 WewillndthegfunctionfromtheaboveEquation 3{21 3{22 ,thenwewillhave 2exp(YX)0Z(Z0Z)1Z0(YX) 22g(;2;;2;Zmis)expjYXj2 (3{22) =(22)s 22P(Zlmis=zlmis);

PAGE 97

Inabove,weshowedthattheEquation 3{21 holdsandwefoundthegfunctioninEquation 3{22 .Tomakethismethodwork,weneedsamples((i);(i);2(i);2(i);Zmis(i))fromthefullmodelM1whichhasalikelihoodof(Yj;;2;2;Zmis).AsweareplanningtorunaGibbssamplerforthefullmodelwhichcontainsalltheSNPs,fromtherewewillgetsamplesof((i);(i);2(i);2(i);Zmis(i))andithasthecorrectmarginalizedjointdistribution,accordingtoTheorem10:6:of RobertandCasella(2004 ). SowecanapproximatetheBayesfactorBF=0;by1 22(i) 2(i) wherenisthenumberofpairsofsamplesfromtheGibbssamplingandP(i)=Z(i)Z(i)0Z(i)1Z(i)0: 3{23 ,whichusethesimplestmodelasthereferencemodel,andEquation 3{12 ,whichusethefullmodelasthereference 97

PAGE 98

AccordingtothedenitionofBayesfactorinEquation 3{1 ,fortwomodelsMandM1withparametersand1,theBayesfactorisdenedasBF;=1=m(Y) LetMdenotethemodelY=X+Z+";whereisavectorwithlengthequaltosum(),andtheelementsofthisvectorarecomposedoftheelementsofvectorwiththecorrespondingelementofindicatorvectortobe1.ZistheisthematrixwithcolumnsbeingtakenfromtheZmatrixaccordingtothecorrespondingindicatorvector. Intheabovemodelspecication,Z=Z(obs);Z(mis);andforsimplicitywejustwriteZ.Weletstobethetotalnumberofelementsinequalto1.ThejointlikelihoodunderthemodelspecicationofMis:

PAGE 99

expjYXZj2 =expjX(YZ)j2 2expjYZj2 3{25 andget: 2 22expjj2 exp(YZ)0P(YZ) 22expjj2 =exp0(Z0PZ+I 2) 3{27 andget: 99

PAGE 100

2(22)s 21 2expY0Z(Z0PZ+I 2)1Z0Y 3{29 alittlebitandwewillhave: 2jP;2j1 2 2. Thenextstepistointegrate2out,andasweknowtheprioris:(2)=ba 2 2jP;2j1 2 (a)Y0PY 100

PAGE 101

OurplanistosetupaMarkovchainwhichisdrivenbyBayesfactorssuchthatthestationarydistributionoftheMarkovchainisproportionaltothedistributionof,inotherwords,thedistributionofmodelsinthemodelsamplespace. Therearesvariablesinthedata,inourcase,sSNPs,sothereare2smodelsinthemodelspace.Eachofthe2smodelshasanassociatedindexvectori;i=1;;2s.Foramodelassociatedwithi;;ithasmarginallikelihoodmi(X)=Zf(Xji)(i)di;whereiistheparametervectorofmodeli.LetB(i)denotetheprobabilityofmodeliinthemodelsamplespace,thentheprobabilityofmodeliisB(i)=mi(X) 101

PAGE 102

1. Supposethecurrentstateis1,andwewilluseamixturedistributiontogeneratethecandidate2.Themixturedistributionis:withprobabilityp,drawasampleusingarandomwalkalgorithm;withprobability(1p),drawasamplefromtheuniformdistributiononthesamplespace.Thedrawsfromtheuniformsamplespacearei.i.d.andthetuningparameterpwillbesetas0:75toinsurerandomwalkmostofthetimeandalsohavingtheabilitytojumpoutoflocalmodes. 2. UsingthesamplesfromthepreviousGibbssampler,calculatetheapproximateBayesfactorsforthecandidatestateandcalculatetheacceptanceprobability(1;2). 3. DrawufromuniformdistributionU(0;1),ifu<(1;2)take2asthenextstate,otherwisestayat1: Repeatthesteps,untilthechainarrivesitsstationarydistribution. Theaboveisthechainwedesigned.Butinactualcomputation,thestepsarealittlebitdierent.WeruntheGibbssamplerrst.ThenwiththesamplesfromtheGibbschain,weruntheM-Hchaintondgoodsubsets. Weplantousethefollowingcandidaterandomwalkdistribution.Supposethecurrentchainisatstate1.Randomlychooseonenumberrfromtheuniformdiscretedistribution(1;2;:::;s),andiptherthelementin1from1to0orfrom0to1,andatthesametimekeepallotherelementsin1untouched.Denotethenewcandidatesamplefromrandomwalkas02.Let002denotethecandidatestatefromtheotherpartofthemixturedistribution,theuniformdistributioninthesamplespace.Eachelementof002isindependentandrandomlytakenaseither0or1. 102

PAGE 103

2s: 2s: 2s=(1p)Xk=1;:::;2sI(1=001k)1 2s=(1p) 2s s:justoneelementdierence Sotheaboveargumentsays,foranystateof1and2,thefollowingalwaysholdsq(2j1)=q(1j2):

PAGE 104

UsingsamplesfromtheGibbssamplerandapplyingTheorem 3.2.1 ,wewillhave\BF(n)2;=1asaconsistentBayesfactorestimateforBF(2;=1)andnisthenumberofpairsofsamplersfromtheGibbssampler.WefurtherdenotetheratioofestimatedBayesfactoras\BF(n)2;=1 ^n(1;2)=min8<:\BF(n)2;=1 RobertandCasella(2004 )Thetheoremstates: RobertandCasella(2004 ),ifY(t)isergodic,thenthedistributiongisastationarydistributionforthechainY(t)andfisthelimitingdistributionofthesubchainX(t).HeregisthefulljointdistributionforX. RobertandCasella(2004 )isthefollowingalgorithm: 104

PAGE 105

whereg1(y1jy(t)2;;y(t)p);;gp(ypjy(t+1)1;;y(t+1)p1)aretheconditionaldistributions. Oursub-chainfromtheGibbssamplersatisestheregularconditionsandisHarris-recurrent,aperiodic,soitisergodic.Furthermore,bytheergodictheorem,wewillhave\BF(n)(2;=1) 105

PAGE 106

Now,letusshowitsatisesthedetailedbalancecondition.AsthetransitionkernelassociatedwiththisMarkovchainisK(1;2)=(1;2)q(2j1)I(26=1)+(1r(1))I(1=2); Theabovesecondequationisobvious,when1=2,theleftsideisequaltotherightside.When16=2,bothsidesequalto0.Asweshowedbeforethatq(2j1)=q(1j2),fortherstequation,byapplyingtheequation 3{32 ,weneedtoverifythatmin8<:\BF(n)2;=1

PAGE 107

If\BF(n)2;=1<\BF(n)1;=1,theleftsideofEquation 3{33 isequalto\BF(n)2;=1 3{33 isequalto\BF(n)2;=1 3{33 isequaltotheleftsideof( 3{33 ).Followthesamerational,Equation 3{33 holdswhen\BF(n)2;=1>\BF(n)1;=1: 3{33 issatised.ThenbyusingtheTheorem6:46of RobertandCasella(2004 ),^B(n)isthestationarydistributionofthischain. IntheaboveweshowedthattheempiricalM-HMarkovchainconvergesuniformlytothedistribution^B(n)whennisxed.However,ourultimategoalistoshowthischainconvergestothedistributionB(i).Thatis, 107

PAGE 108

3.3 .Theempiricalchainconvergestothetargetstationarydistributionwithergodicproperty. 3{34 2XkK(t)^Bn(0;)^Bn()k+1 2Xk^Bn()B()k: 2Xk^Bn()B()k!0;asn!1:(3{37) 3{35 wemustshow1 2XkK(t)^Bn(0;)^Bn()k!0;asn;t!1:

PAGE 109

MeynandTweedie(2008 ),weknowthatwithnxed,wehavekK(t)^Bn(0;)^Bn()k>kK(t+1)^Bn(0;)^Bn()k: 2XkK(t)^Bn(0;)^Bn()k<: 2XkK(t)^Bn(0;)^Bn()k!0:(3{38) CombiningEquation 3{37 andEquation 3{38 wehave TheaboveproofshowsthatourM-Hchainwillconvergetothetargetstationarydistributionwiththeergodicproperty. 109

PAGE 110

Fromtheabovesections,weknowthatifwehavesamples((i);(i);2(i);2(i);Z(i)mis)generatedbyGibbssamplingfrommodelM1,thereferencemodel,alsothefullmodel,wecanapproximatetheBayesfactorby:1 2exp(YX(i)Z(i)(i))0P(i)l(YX(i)Z(i)(i)) 22(i) Intheaboveformula,foroneBayesfactorestimation,weneedtorepeatntimesofZ(i)l0Z(i)landntimesof(Z(i)l0Z(i)l)1.Asweknowthedeterminantcalculationiscloselyrelatedtotheinversecalculationandiftheinverseisavailable,thedeterminantcalculationismucheasier.Sowewanttotakeadvantagesofthispropertywithoutdirectlycalculatingthedeterminants. Inthissection,wewilltalkabouttwomethodstospeedupthecomputation.Firstwewillintroduceamethodtoupdatetheinversebasedoninversecalculationfromthelaststepwithoutcalculatingtheinversefromthebeginning.Second,wewilltalkaboutdirectreplacementofZ(i)withZ,andthejusticationofdoingthis. 3.4.1.1TwocolumnsofparametersforonecolumnofSNPs 110

PAGE 111

AnotherplacewheremostcalculationtimeisspentistondthejZ(i)l0Z(i)ljand(Z(i)l0Z(i)l)1.Rightnowwearedealingwithadatasetwith44SNPsandifwehavesum(1)=30,thenthereare30SNPsnotincludedinthesubmodel,tocalculateoneBayesfactor,weneedtodontimesmatrixinversionwithdimension6060,aseachcolumnofSNPsuses2columnsforcoding. Now,wewilldiscussthedierencebetweenthejustupdatedmatrixZ(i)andthepreviousdesignmatrixZ(i1).ThenonemethodofcalculatingtheinverseofmatrixZ(i)basedontheinverseofmatrixZ(i1)willbegiven.Thesediscussionwillgiveusaavorofthenalmethodweemploytocalculatetheinversematrix. SupposeinlaststepwehavematrixZ(0)=[Z1;:::;Zi;Zi+1;:::;Z2s],andinthecurrentstepthejthandthe(j+1)thcolumnofZ(0)willbeupdatedto_Zjand_Zj+1,asforeachiterationonlyonecolumnofSNPsareupdated,andweuse2columnsofthe 111

PAGE 112

Wecanwrite anddoingthemultiplicationwehave

PAGE 113

WealreadyhaveZ0(0)Z(0)1,thatis,haveA1fromlaststepandnoticerank(Ha)=2andrank(Hb)=2.Wecanwritethefollowing where 113

PAGE 114

Miller(1981 ),ifHisarank2matrix,wehavethefollowingformula (I+H)1=I1 wherea=1+tr(H)and2b=tr(H)2tr(H2): Thentheproblembecomestond(I+A1Ha)1andI+I+A1Ha1A1Hb1:SupposeweareabletocalculateC=(I+A1Ha)1;thentheproblemistond(I+CA1Hb)1.Thenwewillhave ThequestionboilsdowntocalculatingCand(I+CA1Hb)1.Forbothproblems,wecanapplyEquation 3{43 tosolveit. AsforcalculatingA11,wewillhavethefollowingsub-steps: 1. CalculateHaandHb 114

PAGE 115

ComparisonoftimespentoninversecalculationusingstandardsoftwareandMiller'smethodwith2columnsand2rowsupdated. Matrixdimension:1001005005001000100020002000 Matlab'sdirectcalculation0.000s0.235s1.60911.985s Miller'smethod0.016s1.454s11.937s99.703s 2, calculateB1=(I+D)1C; nally,A11=B1A1: 3-1 recordsthetimespentonmatrixinversionwithdierentmethods. Tospeedupthecomputation,wechangethecodingofSNPsfrom2parametersperSNPto1parameterperSNP.NowweassumethattheeectofaSNPislineartothenumberofcopiesofcertainallelewithinthatSNP,insteadofparameterizingbyadditiveanddominanteect.Bychangingthecodingmethod,wedecreasethestepsneededforoneBayesfactortoonehalf,sincenowthereisonlyonecolumnupdateinZ(i)insteadof2columns. Asmentionedintheabove,wedecidetocodetheSNPsinanotherway:usingonecolumnofparametersforonecolumnofSNPs.Inthatway,thedesignmatrixfortheSNPsrecordsthenumberofcopyofalleleinthatSNP.Forexample,weusetorepresenttheallele\C"'seect.ThenforanindividualwithSNPgenotype\CC",theSNPeect 115

PAGE 116

Thefollowingmethodisdirectlydueto SadighiandKalra(1988 )anditisanapplicationoftheSherman-Morrison-Woodburyformula;see Woodbury(1950 )and Bartlett(1951 ).AreviewpaperabouttheSherman-Morrison-Woodburyformulaiswrittenby Hager(1989 ). SupposewehavetheinverseofamatrixA,A1,andweupdatethejthcolumnofAtobev.Let(A)jdenotethejthcolumnofA.Letejtobeavectorofallzeros,exceptthejthelementtobe1.AlsoletA1tobetheupdatedmatrix.TondA11=(A+(v(A)j)eTj)1,thefollowingstepsareneeded: 1+(b)jb,where(b)jisthejthelementofvectorb. Nextwewillshowwhytheabovemethodworks.FirstwriteA1=AB;

PAGE 117

(3{46) Furthermore, 1+bj0000bj+1 SimilarlyifthejthrowofmatrixA1,A(j)1,isupdatedtobe,andA2istheupdatedmatrix.Thenc=(A(j)1)A11;C=(I+ejcT); (c)jejc;where(c)jisthejthelementofvectorcand (c)jejc)=A11A1 117

PAGE 118

Comparisonoftimespentincalculationofmatrixinversionfordierentmethods. Matrixdimension:500500100010001500150020002000 Matlab'sdirectcalculation0.266s1.547s5.01611.703s Miller'smethod0.031s0.188s0.32s0.719s Sanighi&Kalra'sdirectcalculation0.375s2.735s9.453s21.657s Sanighi&Kalra'ssimplication0.032s0.11s0.235s0.531s Foroursituation,whenweuseoneparameterforoneSNP,ineachGibbssampleriteration,wejustupdatethejthcolumnandjthrowofthematrixZ0Zandkeeptheotherpartsuntouched.Nextwewanttoshowatimetablecomparisonforthecalculationofmatrixinversionwithdierentdimensions.ForSanighi&Kalra'sdirectcalculationisIjustfollowtheirformulaandletMatlabcalculatethematrixmultiplicationdirectly.Sanighi&Kalra'ssimplicationmeanstodirectlyspecifythecellsandvectorswithoutmatrixmultiplication.Obviously,wewilltakethelastmethod,anditis20timesfasterthanthedefaultcalculationofMatlab.WedidthetestinRandtheresultsaresimilar. Wewrite whereejisthevectorwithonlythejthelementtobe1andallotherstobe0.Asshownbefore,wecanwriteA1=AB;withB=I+beTj;

PAGE 119

Soforoursituation,jA1j=jAjjBj=jAjjI+beTjj: 119

PAGE 120

Similarly,forallotherrowsintheabovematrixwecandothesimilaroperations.Forexample,wecanmultiply(bj 120

PAGE 121

Sowehave wherebj=(A1)(j)(vAj)and(A1)(j)isthejthrowofA1. WhenweupdatethekthrowofmatrixA1tobeuT,wewrite withC=(I+ekcT);andc=(uA(k)1)A11:

PAGE 122

Finally withckandbjdenedasabove. InthepreviousworkweshowedthatthefollowingformulacanbeusedfortheBayesfactorapproximation. 1 2exp(YX(i)Z(i)(i))0P(i)l(YX(i)Z(i)(i)) 22(i) whereallthesamplesarefromtheposteriordistribution1(;;2;2;ZmisjY): 122

PAGE 123

IftheZ(i)=(Z(i);Z(i)l)matrixdoesnotchangewithi,thenforeachBayesfactorestimation,insteadofcalculatingthedeterminantandmatrixinversentimes,wejustneedtodoonetimematrixdeterminantandmatrixinverse.Themethodproposedhereistoreplaceallthesamplesof(Z(i);Z(i)l)withtheexpectation(EZ;EZl),EZ.InthefollowingwejustifythereplacementwithEZ. OriginallywewanttocalculatetheBayesfactorBF;=1andweshowedthatitisequalto 2exp(YXZ)0Pl(YXZ) 22 2exp(YXZ)0Pl(YXZ) 22 123

PAGE 124

heretheintegralisimplicitlywithrespecttothemissingdatainZ.Asmentionedintheabove,wewanttoapproximateEquation 3{60 by RewritetheaboveintegralsasZZh(;Z)f(;Z)ddZ=ZZh(;Z)f(Zj)dZf()d; 3{61 ,weneedtocheckthefollowing: Further,ifwecanshowthatR(h(;Z)f(Zj)dZRh(;EZ)f(Zj)dZisclosetozero,itissucienttosaythatEquation 3{62 isclosetozerounderverygeneralconditions.Now 124

PAGE 125

h(;Z)h(;EZ) (3{63) 2(ZEZ)0@2[h(;Z)] 2(ZEZ)0@2[h(;Z)] Sowecanwrite =@h(;Z) 2(ZEZ)T@2[h(;Z)] 3{64 ,thenwecansayZh(;Z)f(Zj)dZZh(;EZ)f(Zj)dZ0:

PAGE 126

WedidsomenumericalsimulationtocomparetheBayesfactorcalculations.Inthissimulateddataset,thereare15SNPs20familiesand800observations. Table 3-3 recordsthedierentmodelindicatorsusedforthesimulationstudiesinTable 3-4 and 3-5 .Weletthesvarysothattheycouldrepresentthepossibledierentmodelindicatorsthatmightoccur. Table3-3. RecordsofsubsetsindicatorswithactualvaluesofforTable 3-4 andTable 3-5 IndicatorvectorActualvaluesof:201130:1000013003 Table 3-4 showsthecalculationresultsofBayesfactorsbyusingdierentformulaofBayesfactorapproximation.TheupperpartofthetableusessamplesfromtheGibbssamplerwhereonlyonecolumnofmissingSNPsareupdatedpercyclewhilethebottompartofthetableusessamplesfromtheGibbssamplerwhereallthemissingSNPsareupdatedinonecycle.TherstcolumnofTable 3-4 istheindicatorvectorandtheactual 126

PAGE 127

Bayesfactorcalculationapproximation.Userst20000iterationasburninandtakethefollowing400samplesforthecalculation. sampleonecolumnofSNPperiterationdierentapplyingformula( 3{57 )ZbaroriginalZ sampleallSNPstogetherinonecycledierentapplyingformula( 3{57 )ZbaroriginalZ valuesofarefromTable 3-3 .ThesecondcolumnistheBayesfactorestimatesusingtheformulaweoriginallyplannedtouseanditensurestheBayesfactorestimatesareconsistent.ThethirdcolumnistheBayesfactorestimates,wherethemissingSNPsarereplacedwiththeaveragesofimputedvaluesZ.ThefourthcolumnistheBayesfactor 127

PAGE 128

Table 3-5 againaretheBayesfactorestimatesusingdierentmethodsforthemissingvalues.Thesecondcolumnassumesthereisnomissingvaluesinthedataset.thatis,Ididnotgeneratethemissingdataforthisanalysis.Inthethirdandfourthcolumns,Iassignthemissingvalueseitherall1orall1totesttheeectoftheextremehandlingofmissingvalues.TheupperpartofthetableusingsamplesfromtheGibbssamplerwherethemissingSNPswereimputed1columnpercyclewhilethebottompartofthetablearetheestimatesoftheBayesfactorswhereallthemissingSNPswereimputedineachcycle.Itshows,theseextremehandlingofmissingSNPsdonotgiveproperestimatesofBayesfactors.CombiningTable 3-4 andTable 3-5 ,weconcludethatusingZgivesproperestimatesandyetitprovidesfastcalculation. 3.5.1Simulation Inthesesimulationstudies,wehave15SNPsandthevaluesoftheSNPsarelistedinTable 3-6 .10%ofmissingSNPsweregeneratedatrandom.istheparameterfortheSNPeectand1isforSNP1,2isforSNP2,etc. 128

PAGE 129

3-6 representthedierentmodelsthatthestochasticchainvisited.Anymodelthathaslessthan1%frequencyofvisitsisnotlistedinthetable.FromtherstcolumnofTable 3-6 ,wecouldseethattworowshavehigherfrequenciesofvisitsthanallotherrows.Obviously,thestochasticchainspends25%ofitstimeonamodelwhichcapturesallthesignicantvariables,andthechainspendsaboutonehalfofitstimeonamodelwhichcapturesallthesignicantvariablesplusSNP7,SNP8,andSNP8.TheeectsofSNP7,SNP8,andSNP8arenotasbigasothervariablesintherealmodel,buttheireectsarenotinsignicantaccordingthecriterionofBayesfactor.Onethingthatneedstobementionedisthatthestochasticchainonlyspentabout4%ofitstimeonthemodelwhichincludesallthevariables,whichisadesirablepropertysinceourgoalistodiscoversimplesubsetswhichrepresenttheessentialcombinationsofvariablesinsteadoftakingthecomplicatedmodelssuchasthefullmodel. 3-7 arethefrequenciesofvisitsthechainspentondierentmodels.Asthemodelsamplespaceishuge,244,itisverylikelythatoncethechainleavesthevisitedmodel,itnevercomesbacktothepreviousmodel.Inthetablearelistedseveralmodelsthatthechainspentmoretimeonthanothermodelsthatarenotlisted.Althoughnoneofthefrequenciesofvisitsaresubstantiallyhigherthanothers,theanalysisprovidessomesubsetsofvariablestobefurtherinvestigatedat.OnethingtonoticeisthatwhenweaveragethevisitsperSNP,severalSNPshavemorethan50%visitfrequencies,andtheseSNPsappearoftenintheselectedmodelstoo.IfweareinterestedininvestigatetheSNPsonebyone,theseSNPsdenitelyaregoodcandidates. 129

PAGE 130

Bayesfactorcalculationcomparisons.Userst40000iterationasburninandtakethefollowing400samplesforthecalculation. sampleonecolumnofSNPperiteration dierentwithoutassignallassignallmissingmissingSNPmissingvaluesonevaluesminusone sampleallSNPstogetherinonecycle dierentwithoutassignallassignallmissingmissingSNPmissingvaluesonevaluesminusone 130

PAGE 131

SimulationresultsofBayesianvariableselectionfor15SNPsand450observations,10%randommissingvalues.UsingtheBayesfactorestimationformula posteriorActualvaluesofthe 0.0036novariables 0.000115 Bayesvariableselectionforlesionlengthdataset.UsingtheaverageofimputedmissingSNPsasifobserved.20000stepsofburninandanother20000stepsassamples. posteriormodelSubsetsofvariablesprobability>=0:29% 0:4%9;10;12;14;15;16;22;30;31;40;43

PAGE 132

InChapter2,weproposedaBayesianhierarchicalmodelforthedatastructure.Wetookadvantageofthepopulationstructurebycalculatinganumeratorrelationshipmatrix,whichquantiesthecovariancebetweentwoindividualloblollypinesinthispopulation.Formissingdata,weimputedthepossiblevaluesaccordingtotheposteriordistributionofthemissingvalues.Thatis,afteradjustingtheobservedvalues,weusedtheposteriordistributionofthemissingvaluesinsteadoftheonemeanvaluesubstitutionormultiplevalueaverages. OnenovelcontributionisweproposedanonstandardGibbssamplerprocedureandprovedthatthisGibbsprocedureensuresthetargetstationarydistribution.Byemployingthisproposedprocedure,thecomputationspeedisdramaticallyincreasedbydecreasingthenumberofimputedcolumnsineachupdatingcycle.ThepowerofthisprocedurewillbemoreobviouswhentherearemoreobservedSNPsinthedatasetandsothisisadesirablepropertysincewewillhavemanymoreSNPstowardtheendofADEPT2project. 132

PAGE 133

Intermsofcomputation,wetookadvantageoftheBayesfactorapproximationformulaandthespecialGibbssamplerupdatingprocedurebyapplyingtheSherman-Morrison-Woodburyformulaforoursituation.Themethoddecreasesthematrixinversion20timesforamatrixwithdimension20002000.Intheendofthedissertation,wealsoproposedtouseZtoreplacetheimputedvaluesforthecalculationofBayesfactors.Weshowedthatitgivesaccurateestimatesandmeanwhilefundamentallyscalesuptheprocedure. Inchapter3,weusedastochasticsearchalgorithmtosearchthegoodgroupofsubsetsandwespendmuchtimeinspeedingupthecomputation.ThebottleneckofthecomputationisduetotheBayesfactorestimates,especiallythemanyinversions 133

PAGE 134

AnotherfutureworktopicistoinvestigateotherpriorspecicationsforourmodelinChapter2andChapter3.Weusedtheconjugatepriorsandthatensurestheconditionalstobestandarddistributions.Wewouldliketoseethedierenceofthetheoreticalresultsandthedierenceofrealdataanalysiswhendierentpriorsareapplied,suchasgpriors,orintrinsicpriors. 134

PAGE 135

SupposewearerunningaGibbssamplerinwhichonecycleconsistsofupdating(X;Y;Z),andtheZnpmatrixcanbedecomposedas(Z1;Z2;:::;Zp).StandardGibbssamplingupdatesZnpalltogetherinonecycle,incontrast,wecanjustsystematicallyupdateonecolumnofZateachcycle.ByupdatingonecolumnofZineachiteration,thecomputationtimecandecreasedramatically,especiallywhenthenumberofcolumnsoftheZmatrixisintheorderofhundredsorthousands. Firstletuswritedownourupdatingscheme.Supposewehavestartingvalue(X(0);Y(0);Z(0)1;:::;Z(0)p):

PAGE 136

ToprovethattheGibbssamplerhasthestationarydistribution,weneedtoshowthatthefollowingequationholds: =Z:::ZfX(1)jY(0);Z(0)1;:::;Z(0)pfY(1)jX(1);Z(0)1;:::;Z(0)pfZ(1)1jX(1);Y(1);Z(0)2Z(0)pfX(2)jY(1);Z(1)1;:::;Z(0)pfY(p)jX(p);Z(1)1;:::;Z(1)p1;Z(0)pfZ(1)pjX(p);Y(p);:::;Z(1)(p1)fZ(0)1;:::;Z(0)p;X(0);Y(0)dX(0)dY(0)dZ(0)1;:::;dZ(0)pdX(1)dY(1);:::;dX(p1)dY(p1): A{1 )becomes: =Z:::ZfX(1);Z(0)1;:::;Z0pfY(1)jX(1);Z(0)1;:::;Z(0)pfZ(1)1jX(1);Y(1);Z(0)2:::;Z(0)pfX(2)jY(1);Z(1)1;:::;Z(0)pfY(p)jX(p);Z(1)1;:::;Z(1)p1;Z(0)pfZ(1)pjX(p);Y(p);:::;Z(1)(p1)dZ(0)1;dZ(0)2;:::;dZ(0)pdX(1)dY(1);:::;dX(p1)dY(p1):

PAGE 137

=Z:::ZfX(1);Y(1);Z(0)2:::;Z(0)pfZ(1)1jX(1);Y(1);Z(0)2:::;Z(0)pfX(2)jY(1);Z(1)1;:::;Z(0)p:::fY(p)jX(p);Z(1)1;:::;Z(1)p1;Z(0)pfZ(1)pjX(p);Y(p);:::;Z(1)(p1)dZ(0)2;:::;dZ(0)pdX(1)dY(1);:::;dX(p1)dY(p1): Doingthesimilarintegration,furtherweget: =ZZZfX(p1);Y(p1);Z(1)1;:::;Z(1)(p1);Z(0)pfX(p)jY(p1);Z(1)1;:::;Z(1)(p1);Z(0)pfY(p)jX(p);Z(1)1;:::;Z(1)(p1);Z(0)pfZ(1)pjX(p);Y(p);Z(1)1;:::;Z(1)(p1)dX(p1)dY(p1)dZ(0)p:

PAGE 138

Inabove,weshowedthattheGibbssamplerchainupdatingonecolumnpercyclestillpreservesthetargetstationarycondition.Toprovethechainhasergodicityproperty,wecanciteTheorem10:10in RobertandCasella(2004 ).Thetheoremstates: Nowconsiderthemarginaldistributionofthesubvector(X(t);Y(t)).Wewillshowthatthismarginaldistributionsatisesthestationarycondition.Thatis,wewanttoshowthefollowingequationholds. A{6 )anditequalstotheleftsideof( A{6 ).Thefollowingarethecalculations: 138

PAGE 139

139

PAGE 140

Asweknowthesub-numeratorrelationshipmatrixfortherstunrelatedasubjectsistheidentity,nextwewillgivethedetailshowtocalculatetheremainingcellsofthenumeratorrelationshipmatrixfortherelatedsubjects.Considerthejthandtheithsubjectfromtheaboveorderedsubjects. 1. Ifbothparentsofthejthindividualareknown,saygandh,thenRji=Rij=0:5(Rig+Rih);i=1;:::;j1;Rjj=1+0:5Rgh 2. Ifonlyoneparentisknownforthejthsubject,sayitisg,thenRji=Rij=0:5Rig;i=1;:::;j1;Rjj=1: Ifneitherparentisknownforthejthsubject,Rji=Rij=0;i=1;:::j1;Rjj=1: 140

PAGE 141

141

PAGE 142

Akaike,H.(1973).Informationtheoryandanextensionofthemaximumlikelihoodprinciple.InB.N.PertrovandF.Csaki(Eds.),SecondInternationalSymposiumonInformationTheory,BudapestAkademiaiKiado,pp.267C281.Springer-Verlag. Allison,P.D.(2002).MissingData.Sage,CA:ThousandOaks. Balding,D.(2006).Atutorialonstatisticalmethodsforpopulationassociationstudies.NatureReviewsGenetics7,781. Barnard,J.andD.B.Rubin(1999).Smallsampledegreesoffreedomwithmultipleimputation.Biometrica,949{955. Bartlett,M.S.(1951).Aninversematrixadjustmentarisingindiscriminantanalysis.AnnualofMathematicalStatistics22,107{111. Berger,J.andL.Pericchi(2001).Objectivebayesianmethodsformodelselection:Introductionandcomparison(withdiscussion).ModelSelection,135{207. Boyles,A.L.,W.Scott,E.Martin,S.Schmidt,Y.J.Li,A.Ashley-Koch,M.P.Bass,M.Schmidt,M.A.Pericak-Vance,M.C.Speer,andE.R.Hauser(2005).Linkagedisequilibriuminatestypeerrorratesinmultipointlinkageanalysiswhenparentalgenotypesaremissing.HumanHeridity59,220{227. Brown,P.J.,M.Vannucci,andT.Fern(1998).Multivariatebayesianvariableselectionandprediction.JournalofRoyalStatistalSocietyB,627{641. Carlin,B.P.andS.Chib(1995).Bayesianmodelchoiceviamarkovchainmontecarlomethods.JournalofRoyalStatisticalSocietyB,473{484. Casella,G.,F.J.Giron,M.L.Martinez,andE.Moreno.Consistencyofbayesianproceduresforvariableselection.TheAnnalsofStatistics.(toappear). Casella,G.andE.Moreno(2006).Objectivebayesianvariableselection.JournaloftheAmericanStatisticalAssociation,157{167. Chen,W.M.andG.R.Abecasis(2007).family-basedassociationtestsforgenomewideassociationscan.TheAmericanJournalofHumanGenetics81,913{926. Chib,S.(1995).Marginallikelihoodfromthegibbsoutput.JournaloftheAmericanStatisticalAssociation,1313{1321. Cui,W.andE.I.George(2008).Empiricalbayesversusfullybayesvariableselection.JournalofStatisticalPlanningandInference,888{900. Dai,J.Y.,I.Ruczinski,M.LeBlanc,andC.Kooperberg(2006).Imputationmethodstoimproveinferenceinsnpassociationstudies.GeneticEpidemiology30,690{702. 142

PAGE 143

Dempster,A.P.,N.Laird,andD.B.Rubin(1977).Maximumlikelihoodestimationfromincompletedataviatheemalgorithm(withdiscussion).JournalRoyalStatisticalSocietySeriesB39,1{38. Efron,B.(1994).Missingdataimputationandthebootstrap.JournaloftheAmericanStatisticalAssociation,463{474. George,E.I.andD.Foster(1994).Theriskinationcriterionformultipleregression.AnnalsofStatistics,1947{1975. George,E.I.andR.E.McCulloch(1993).Variableselectionviagibbssampling.JournaloftheAmericanStatisticalSociety,881{889. George,E.I.andR.E.McCulloch(1995).Stochasticsearchvariableselection.MarkovChainMonteCarloinPractice,203{214. George,E.I.andR.E.McCulloch(1997).Approachestobayesianvariableselection.StatisticaSinica,339{379. Green,P.J.(1995).Reversiblejumpmarkovchainmontecarlocomputationandbayesianmodeldetermination.Biometrika,711{32. Hager,W.W.(1989).Updatingtheinverseofamatrix.SIAMReview31,221{239. Henderson,C.R.(1976).Asimplemethodforcomputingtheinverseofanumeratorrelationshipmatrixusedinpredictionofbreedingvalues.Biometrics39,69{83. Hobert,J.andG.Casella(1996).Theeectofimproperpriorsongibbssamplinginhierarchicallinearmixedmodels.JournaloftheAmericanStatisticalAssociation91,1461{1473. Kayihan,G.C.,D.A.Huber,A.M.Morse,T.T.White,andJ.M.Davis(2005).Geneticdissectionoffusiformrustandpitchcankerdiseasetraitsinloblollypine.TheoryofAppliedGenetics110,948{958. Li,K.H.,T.E.Raghunathan,andD.B.Rubin(1991).Large-samplesignicancelevelsfrommultiplyimputeddatausingmoment-basedstatisticsandanfreferencedistribution.J.AmericanStatisticalAssociation,1065{1073. Little,R.J.A.andD.B.Rubin(1987).StatisticalAnalysiswithMissingData.NewYork:Wiley&Sons. Mallows,C.(1973).Somecommentsoncp.Technometrics15,661{675. Marchini,J.,B.Howie,S.Myers,G.McVean,andP.Donnelly(2007).Anewmultipointmethodforgenome-wideassociationstudiesbyimputationofgenotypes.NatureGenetics39,doi:10.1038/ng2088. 143

PAGE 144

McKeever,D.B.andJ.L.Howard(1996).Valueoftimberandagriculturalproductsintheunitedstates,1991.ForestProductsJournal46,45{50. Meng,X.L.andD.B.Rubin(1992).Performinglikelihoodratiotestswithmultiply-imputeddatasets.Biometrica,103{111. Meng,X.L.andW.H.Wong(1996).Simulatingratiosofmormalizingconstantsviaasimpleidentity:Atheoreticalexploration.StatisticaSinica6,831{860. Meyn,S.P.andR.Tweedie(2008).MarkovChainsandStochasticStability.NewYork:Springer-Verlag. Miller,K.S.(1981).Ontheinverseofthesumofmatrices.MathematicsMagazine54,67{72. Mitchell,T.J.andJ.J.Beauchamp(1988).Bayesianvariableselectioninlinearregres-sion.JournaloftheAmericanStatisticalAssociation,1023{1032. Quaas,R.L.(1976).Computingthediagonalelementsandinverseofalargenumeratorrelationshipmatrix.Biometrics46,949{953. Reilley,M.(1993).Dataanalysisusinghotdeckmmultipleimputation.TheStatistician,307{313. Robert,C.P.andG.Casella(2004).MonteCarloStatisticalMethods.UnitedStatesofAmerica:Springer. Roberts,A.,L.McMillan,W.Wang,J.Parker,I.Rusyn,andD.Threadgill(2007).Inferringmissinggenotypesinlargesnppanelsusingfastnearest-neighborsearchesoverllidingwindows.Bioinformatics23,i401{i407. Rubin,D.B.(1978).Multipleimputationsinsamplesurveys-aphenomenologicalbayesianapproachtononresponse.JournaloftheAmericanStatisticalAssociation,20{34. Rubin,D.B.(1987).MultipleImputationforNonresponseinSurveys.NewYork:Wiley&Sons. Sadighi,I.andP.K.Kalra(1988).Approachesforupdatingthematrixinverseforcontrolsystemproblemswithspecialreferencetoroworcolumnperturbation.ElectricPowerSystemsResearch14,137{147. Schafer,J.L.(1997).AnalysisofIncompleteMultivariateData.London:Chapman&Hall. 144

PAGE 145

Schwarz,G.(1973).Estimatingthedimensionofamodel.AnnalsofStatistics6,461{464. Servin,B.andM.Stephens(2007).Imputation-basedanalysisofassociationatudies:candidateregionsandquantitativetraits.PLoSGenetics7,e114. Smith,A.M.F.andD.J.Spiegelhalter(1980).Bayesfactorsandchoicecriteriaforlinearmodels.JournaloftheRoyalStatisticalSociety,SerialB,213{220. Stephens,M.,S.N.J.,andP.Donnelly(2001).Anewstatisticalmethodforhaplotypereconstructionfrompopulationdata.TheAmericanJournalofHumanGenetics68,978{989. Sun,Y.V.andS.L.Kardia(2008).Imputingmissinggenotypicdataofsingle-nucleotidepolymorphismsusingneuralnetworks.EuropeanJournalofHumanGenetics16,487{495. Tanner,M.A.andW.H.Wong(1987).Thecalculationofposteriordistributionsbydataaugmentation.JournaloftheAmericanStatisticalAssociation,528{540. Wear,D.N.andJ.G.Greis(2002).Southernforestresourceassessment:Summaryofndings.JournalofForestry100,6{14. Weinberg,C.R.(1999).Allowingformissingparentsingeneticstudiesofcase-parenttriads.AmericanJournalofHumanGenetics64,1186{1193. Woodbury,M.(1950).Invertingmodiedmatrices.MemorandomRept.42,StatisticalResearchGroup,PrincetonUniversity. Wu,C.F.J.(1983).Ontheconvergencepropertiesoftheemalgorithm.TheAnnalsofStatistics,95{103. Xie,F.andM.Paik(1997).Multipleimputationmethodsforthemissingcovari-atesingeneralizedestimatingequation.Biometrics,1538{1546. Yu,J.M.,G.Pressoir,W.H.Briggs,I.V.Bi,M.Yamasaki,J.Doebley,M.D.McMullen,B.S.Gaut,D.M.Nielsen,J.B.Holland,S.Kresovich,andE.S.Buckler(2005).Auniedmixedmodelmethodforassociationmappingthataccountsformultiplelevelsofrelatedness.NatureGenetics38,e114. 145

PAGE 146

ZhenLiwasborninJiangsu,China.ShewasthesecondoftwodaughtersofPingLiandXuemeiTang.Shereceivedabachelor'sdegreeinmechanicalengineeringinNanjingUniversityofAeronauticsandAstronauticsin2001.Afterthat,shewasenrolledinamaster'sprogramofShanghaiJiaotongUniversity.In2004,shewasadmittedtotheStatisticsDepartmentofUniversityofFlorida.Shereceivedamaster'sdegreeinstatisticsin2006andisexpectingtoreceiveaPh.D.degreeinstatisticsin2008. 146