UFDC Home myUFDC Home  |   Help
<%BANNER%>

# Statistical Functional Mapping for Genetic Control of Programmed Cell Death

PAGE 4

PAGE 5

page ACKNOWLEDGMENTS ............................. iv LISTOFTABLES ................................. viii LISTOFFIGURES ................................ ix ABSTRACT .................................... x CHAPTER 1INTRODUCTION .............................. 1 1.1BasicGenetics ............................. 1 1.2GeneticMapping ........................... 3 1.3TraditionalQTLMappingApproaches ............... 5 1.4FunctionalMapping .......................... 8 1.5ProgrammedCellDeath ....................... 10 1.6Objectives ............................... 13 2GENERALMAPPINGFRAMEWORK .................. 15 2.1GeneticExperimentDesigns ..................... 15 2.2StatisticalModels ........................... 16 2.2.1FiniteMixtureModels .................... 16 2.2.2LikelihoodFunction ...................... 18 2.2.3ModellingtheMeanVector .................. 19 2.2.4ModellingtheCovarianceMatrix ............... 21 2.2.5ComputationalAlgorithms .................. 26 2.2.6HypothesisTesting ...................... 28 2.2.7AsymptoticSamplingError .................. 29 3ANONPARAMETRICAPPROACHFORFUNCTIONALMAPPINGOFPROGRAMMEDCELLDEATH ................... 31 3.1Introduction .............................. 31 3.2TheFiniteMixtureModelandLikelihoodFunction ........ 33 3.3ModellingtheMeanPCDProcess .................. 34 3.3.1LegendrePolynomial ..................... 34 3.3.2Information-theoreticCriteria ................ 37 3.3.3ConsistencyofModelSelectionCriteria ........... 40 3.3.4SelectionPower ........................ 41 v

PAGE 6

..................... 42 3.4ModellingtheCovarianceStructure ................. 43 3.4.1AR(1)CovarianceModel ................... 43 3.4.2SAD(r)CovarianceModel ................... 43 3.5ComputationAlgorithmandHypothesisTesting .......... 44 3.6CalculatingtheHeritabilityLevel .................. 45 3.7LEPOrderSelectionSimulation ................... 46 3.8ModelAssessmentbySimulation .................. 48 3.9RealExample ............................. 51 3.10Discussion ............................... 59 4FUNCTIONALQTLMAPPINGPROGRAMMEDCELLDEATH:ASEMIPARAMETRICMODEL ...................... 62 4.1Introduction .............................. 62 4.2DierentPhasesofPCD ....................... 64 4.3StatisticalModel ........................... 67 4.3.1TheMixtureModel-BasedLikelihood ............ 67 4.3.2SemiparametricModellingoftheMeanVector ....... 69 4.3.3ModellingtheCovarianceStructure ............. 72 4.3.4ComputationAlgorithms ................... 74 4.3.5CalculatingCurveHeritability ................ 76 4.4HypothesisTestings .......................... 77 4.5Results ................................. 78 4.5.1AWorkedExample ...................... 78 4.5.2Simulation ........................... 82 4.6Discussion ............................... 84 5CONCLUDINGREMARKS ......................... 89 6PROSPECTS ................................. 93 6.1MissingDataProblem ........................ 93 6.2JointMappingofPCDandTime-To-Event ............. 95 6.3NonparametricModellingofCovarianceStructure ......... 96 APPENDIX ADERIVATIONOFEMALGORITHMWITHNONPARAMETRICAPPROACH ................................ 97 A.1NonparametricwithAR(1)CovarianceStructure .......... 98 A.2NonparametricwithSAD(1)CovarianceStructure ......... 100 BDERIVATIONOFEM-SIMPLEXALGORITHMWITHSEMIPARAMETRICAPPROACH .................... 103 vi

PAGE 7

......... 105 C.1TheModel ............................... 105 C.2OrderSelectionCriterion ....................... 106 C.3ConsistencyoftheSelectionCriterion ................ 107 REFERENCES ................................... 111 BIOGRAPHICALSKETCH ............................ 123 vii

PAGE 8

Table page 1{1DatastructureforabackcrossdesigninFunMap ............ 10 2{1ConditionalprobabilitiesofQTLgenotypesgivenontheankingmarkersinabackcrossdesign .......................... 16 3{1LEPorderselectionpowercomparisonusingdierentinformationcriteriawithAR(1)covariancestructure .................... 47 3{2PowercomparisonforLEPorderselectionusingdierentinformationcriteriawithSAD(1)covariancestructure ............... 48 3{3TheMLEsoftheQTLpositionandthemodelparametersderivedfrom100simulationreplicateswiththeAR(1)covariancestructure.ThesquaredrootsofthemeansquareerrorsoftheMLEsaregiveninparentheses .............................. 49 3{4TheMLEsofthemodelparametersandtheQTLpositionderivedfrom100simulationreplicateswiththeSAD(1)covariancestructure.ThesquaredrootsofthemeansquareerrorsoftheMLEsaregiveninparentheses .............................. 50 3{5TheMLEsoftheparametersandtheirasymptoticstandarderrorsintheparentheseswiththeAR(1)covariancestructure ......... 56 3{6TheMLEsoftheestimatedparametersandtheirasymptoticstandarderrorsintheparentheseswiththeSAD(1)covariancestructure ... 57 4{1ModelselectionforLegendrepolynomialordersbasedontheAICandBICvaluesundertheM-Ccovariance-structuringmodel ....... 79 4{2TheMLEsofthegrowthanddeathparameters(andtheirapproximatestandarderrorsintheparentheses)forsignicantQTLdetectedondierentchromosomesfortillernumbersinaDHricepopulation .. 81 4{3TheMLEsoftheQTLpositionandthemodelparametersderivedfrom100simulationreplicates.ThesquaredrootsofthemeansquareerrorsoftheMLEsaregiveninparentheses. ............. 82 viii

PAGE 9

Figure page 1{1Simplegrowthcurve ........................... 9 1{2Simpleprogrammedcelldeathcurve .................. 13 3{1Linkagemapforthericegenome ..................... 52 3{2Dynamicchangesofthenumberoftillersfor123DHlinesofriceasanexampleofPCDinplants ..................... 53 3{3Tillergrowth ............................... 54 3{4TheLRproleplot.ThesolidanddottedcurvescorrespondtotheLRprolewithSAD(1)andAR(1)covariancerespectively. ..... 55 3{5Thedynamicchangesoftillernumbersfortwocurveseachpresentingtwogroupsofgenotypes,QQandqq,ateachofthefourQTLdetectedwiththeAR(1)covariancestructure. ................. 58 3{6Twocurvesforthedynamicchangesoftillernumberseachpresentingtwogroupsofgenotypes,QQandqq,ateachofthefourQTLdetectedwiththeSAD(1)covariancestructure. ................ 59 4{1AtypicalexampleofPCDthatincludesvedierentstages,(1)lag,(2)exponential,(3)declininggrowth,(4)stationaryand(5)death. 65 4{2TheLRproleplot.ThegenomicpositionscorrespondingtothepeakofthecurvearetheMLEsoftheQTLlocalization(indicatedbythearrows). .............................. 80 4{3Plotofthedynamicchangesoftillernumbersfortwocurveseachpresentingtwogroupsofgenotypes,QQandqq,ateachofthethreeQTL.A)QTLdetectedbetweenmarkersRG146andRG345,andB)QTLbetweenmarkersRZ730andRZ801onchromosome1,andC)QTLonmarkerRZ792onchromosome9. ................. 85 ix

PAGE 10

PAGE 11

SeveralinformationcriteriaareproposedtoselecttheoptimalLegendreorderforboththenonparametricandsemiparametricmodels.Aseriesofstationaryandnonstationaryprocessesisappliedtomodelthecovariancestructureamongthephenotypesatdierenttimepoints.ThedevelopedmodelsallowfornumeroushypothesestestsofregardingthegeneticcontrolmechanismofPCDfeatures.TheEMandsimplexalgorithmsarederivedtoestimatetheparametersthatmodelthemeanvectorandthecovariancestructure.Themodelisstatisticallyinvestigatedthroughsimulationstudiesandisvalidatedbyarealexamplefortillernumberinrice.Ourmodelprovidesaquantitativeandtestableframeworkforassessingtheinterplaybetweengenesandorganismdevelopmentalpatterns,andwillhavegreatimplicationsforelucidatingthedetailedgeneticarchitectureofanylongitudinaltraitssuchasPCD. xi

PAGE 12

PAGE 13

PAGE 14

PAGE 15

PAGE 16

PAGE 17

PAGE 18

PAGE 19

8 1.4 FunctionalMapping Innature,mosteconomicallyandbiomedicallyimportanttraits(suchasgrain yield,milkproduction,tumorsize,anddrugresponse)showcontinuousvariation, usuallyaectedbyanetworkofsmall-sizedgenesfunctioninginacoordinated manner,andalsoaectedbyenvironmentalstimulusinacomplicatedway.Thus, identicationofspecicQTLinvolvedhadalwaysbeendicultandimprecise. Thephenotypicformationofaquantitativetraitiscontrolledbygeneticand environmentalfactors,alsoregulatedbyasuiteofdevelopmentalsignals.This isevidentinthefactthatalltraitsundergodevelopmentalontogeny,andchange withtime.KirkpatrickandHeckman(1989)calledsuchtime-dependenttraits innite-dimensionalcharacters ;PletcherandGeyer(1999)calledthem functionvaluedtraits .Althoughtherelationshipbetweengeneticcontrolanddevelopment isstatisticallydiculttoelucidate,somekeydicultieswereovercomebyWu andcolleagues(Wuetal.2002a,2003,2004a,2004b;Maetal.2002).Theyhave proposedageneralstatisticalframework(i.e., functionalmapping orFunMap)to genome-widemapspecicQTLthatdeterminethedevelopmentalpatternofa complextrait. Thebasicrationaleoffunctionalmappingliesintheconnectionbetweengene actionorenvironmentaleectsanddevelopmentbyparametricornonparametric models.FunctionalmappingmapsdynamicQTLresponsibleforabiological processthatismeasuredatanitenumberoftimepoints.Inresponseto gradientsofacontinuousenvironmentalfactor(suchastemperature,light intensity,ornutritionalconcentration),avarietyofmathematicalequations havebeendeveloped,eitherempiricallyorthroughtheoreticalderivationsbased onfundamentalprinciples,tomodelthephenotypicbiologicalprocesses.For example,aseriesofgrowthequationswerederivedtodescribegrowthinheight, size,orweight(Fig. 1{1 ;vonBertalany1957)thatoccurwhenevertheanabolic

PAGE 20

PAGE 21

10 integratesparameterestimationandtestprocessinabiologicallymeaningful andmixture-basedlikelihoodframework.Insteadofdirectlyestimatingthegene eectsatallpossibletimepoints,itestimatesparametersdeterminingshapesand functionsofaparticularbiologicalnetwork. Table 1{1 showsatypicaldatastructureforFunMapforabackcrossdesign inwhich isthetotalnumberofmeasurementtimepoints.Thephenotypictraits aremeasuredatdierenttimepointsandcouldbemodelledasafunctionoftime. Biologicallymeaningfulmathematicalfunctionscouldbeadoptedinthemodelling procedure.Therepeatedlongitudinalmeasurementforeachindividualthencould bedescribedbysomefunctions, y i = m j + e i i =1 ; ;n and j =1 ; ;J (1{1) where m j ( t k )= f ( t k )( k =1 ; ; ).Byestimatingthegenotype-specic parameterscontainedin m j ,thatdescribe reactionnorm acrossenvironmental gradients,functionalmappingallowstheassessmentofenvironmentalimpactson geneticvariationincomplextraitsandtestingofthedierenthypothesesthat mediatephenotypicdiversity.Theresultsderivedfromfunctionalmappingwillbe closertobiologicalrealms. Table1{1. DatastructureforabackcrossdesigninFunMap SampleMarkergenotypesPhenotypes n i M 1 M 2 M k y (1) y (2) y ( ) 1 M 1 M 1 M 2 m 2 M k m k y 1 (1) y 1 (2) y 1 ( ) 2 M 1 m 1 M 2 M 2 M k m k y 2 (1) y 2 (2) y 2 ( ) 3 M 1 m 1 M 2 m 2 M k M k y 3 (1) y 3 (2) y 3 ( ) . . . . . . . . . . . . . nM 1 M 1 M 2 M 2 M k m k y n (1) y n (2) y n ( ) 1.5 ProgrammedCellDeath Organismsconsistofthousandsofcelltypesoriginatingfromafertilized eggcell.Startingfromasinglecell,cellnumbersincreasedramaticallyduring

PAGE 22

PAGE 23

12 death.Thesescientistsidentiedseveralkeygenessuchas nuc-1 ced-3 and ced4 genesthatregulateorgandevelopmentandprogrammedcelldeathinmodel organisms C.elegans ,andpointedoutthatcorrespondinggenesmayexistinhigher species,includinghumanbeing.Theirseminalworkestablishedthefoundation ofcellsuicideasanormalphysiologicprocess.Theimplicationsarewideranging includingunderstandingorgandevelopmentandcancer.Wemaybeabletotreat diseasesbystimulatingthecellular\suicideprogram".Iftheabnormallydividing cellscanbekilledobyusingcertaindrugstotriggerthe\self-destruct"programs, itwouldbeanecientwaytotreatcancer. SincePCDissoimportantinbiology,itwouldbeessentialtoderivenew statisticalmodelsforidentifyinggenesthatcontrolthisprocess.Asdiscussed above,QTLmappingbasedonmolecularmarkershasprovenpowerfultoelucidate therelationshipbetweenthegenotypeandphenotypeincomplexsystems.Itwill beavaluabletoolforunderstandingthegeneticarchitectureofacomplexPCD process. Becauseofitsuniquedevelopmentpattern,studyofPCDmustbecoordinated withotherphysiologicalprocesses,suchasgrowthoccurringinthetissueor individual(KuriyamaandFukuda2002).Althoughnospecicmathematical functionsareavailabletodescribetheshapeofthePCDprocess,asabalanceof cell-divisionandcell-deathsignals(KerrandHarmon1991),PCDcanbebroadly dividedintotwodierentphases:growthanddeath(Fig. 1{2 ).Thegrowthphase thatspeciesanincreasingprocessofcellscanbedescribedbyalogisticcurve. Thedeathphase(withcelldeathexceedingcellbirth)correspondstoadecaying processthatcannotbedenedbyaspeciccurve.Morenaturally,thePCD processmaynotfollowanyparticularbehaviorpatternduringontogeny.Another naturalwaytomodelthisprocessistotreattheentiredevelopmentalprocessas onesinglecurveandtitwithpolynomials.Thismodellingstrategyhasgreat

PAGE 24

Figure1{2. Simpleprogrammedcelldeathcurve exibilitybuthascertainlimitationswhenoneintendstotesttheeectofQTLatcertaindevelopmentalstages.TwodierentmodellingstrategiesaredetailedinChapters3and4.DierentmodellingstrategiesgiveaclearpictureofthedynamicdevelopmentalpatternofPCD. 1.6 Objectives TheprocessofPCDistypicallyacomplextrait,controlledbymultiplegenesandvariousdevelopmentalandenvironmentalfactors.CharacterizingthegeneticarchitectureofdynamicPCDtraitwillimproveourunderstandingofthedevelopmentalprocessoforganismsandtheetiologyofdiseasesduetoinappropriateactivationofPCD.Toourknowledge,nostudieshavethusfarexaminedthegeneticmappingofQTLcontrollingPCD. Inthisstudy,IderivedastatisticalframeworkformappingPCD-specicQTLwithinthecontextoffunctionalmapping.TwodierentapproacheswereproposedtomodelthedevelopmentalPCDcurve.TherstapproachwastomodeltheentiredevelopmentalprocessnonparametricallybasedontheorthogonalLegendre

PAGE 25

PAGE 26

PAGE 27

16 fromtheparenttoprogeny,onecanderivetheconditionalprobabilitiesofan unknown QTLgenotype,conditionaluponthe known markergenotypes,interms oftheserecombinationfractions.Theconditionalprobabilities( ij )ofabackcross individual i ( i =1 ; ;n )whocarriesaQTLgenotypeconditionaluponfour ankingmarkergenotypes M M M +1 M +1 M M M +1 m +1 M m M +1 M +1 M m M +1 m +1 isgiveninTable 2{1 Table2{1. ConditionalprobabilitiesofQTLgenotypesgivenontheanking markersinabackcrossdesign Markergenotype QQQq M M M +1 M +1 (1 r 1 )(1 r 2 ) 1 r r 1 r 2 1 r M M M +1 m +1 (1 r 1 ) r 2 r r 1 (1 r 2 ) r M m M +1 M +1 r 1 (1 r 2 ) r (1 r 1 ) r 2 r M m M +1 m +1 r 1 r 2 1 r (1 r 1 )(1 r 2 ) 1 r where r 1 and r 2 aretherecombinationfractionsbetweentheQTLandmarker M and M +1 ,respectively,and r istherecombinationfractionbetweenthetwo markers. Othercrossdesignsornaturalpopulationshavedierentconditional probabilitytables.Conditionalprobabilitymeasuresthepossibilityofrecombination betweentheputativeQTLandtheankingmarkersduringmeiosis.Therefore, theseprobabilitiesdeterminethesegregationpatternoftheputativeQTL.My studyusetheHaldanemapfunction(assumingnointerference)toconvertthe mapdistance cM (centimorgan)torecombinationfraction r throughthefunction r = 1 2 (1 e 2 d= 100 ). 2.2 StatisticalModels 2.2.1 FiniteMixtureModels Ingeneral,ifoneknowstheQTLgenotype,thenndingasignicantQTL isequivalenttodoingatwo-sample t testforabackcross;andanANOVAtest

PAGE 28

PAGE 29

18 where m j =[ m j (1) ; ;m j ( )] T isavectorofexpectedvaluesforQTLgenotype j atdierentpoints.Ataparticulartime t ,therelationshipbetweentheobservation andexpectedmeancanbedescribedbyasimplelinearmodel y i ( t )= J X j =1 ij u j ( t )+ e i ( t ) ; (2{3) where ij istheindicatorvariabledenotedas1ifaQTLgenotype j isconsidered forindividual i and0otherwise; e i ( t )istheresidualerror(i.e.,theaccumulative eectofpolygenesanderrors)thatis iid (identicallyandindependentlydistributed) normalwithmeanzeroandvariance 2 ( t ).Thewithin-subjectmeasurementis correlatedandtheerrorsattwodierenttimepoints, t 1 and t 2 ,arecharacterized withcorrelation ( t 1 ;t 2 ). 2.2.2 LikelihoodFunction Inmystudy,eachobservationwasarepeatedmeasurementwithacertain correlationstructure.Becauseoftheindependenceassumptionbetweensubjects, thelikelihoodfunctionofthePCDtrait, y ,andthemarkergenotype, M ,atthe hypothesizedQTLlocationcanbeexpressedforthebackcrossas L ( j y ; M )= n Y i =1 [ 1 j i f 1 ( y i j )+ 0 j i f 0 ( y i j )](2{4) where 1 j i and 0 j i arethemixtureproportionscorrespondingtothefrequencies ofdierentQTLgenotypesforindividual i .Becausethemarkergenotypeof eachindividualisknown,thesemixtureproportionscanbeactuallyexpressed asconditionalprobabilities(Table 2{1 ).Theunknownvector thatdenes themixturemodel(Eq. 2{1 )containstwosetsofparameters,onedeningthe co-segregationbetweentheQTLandmarkersand,thereby,thelocationoftheQTL relativetothemarkers,denotedby s ,andtheseconddeningthedistribution ofthePCDtraitforeachQTLgenotype,denotedby q =( m ; v ),where m denesthemeanvectoratdierenttimepointsand v denesthecovariance

PAGE 30

19 matrixamongdierenttimepoints.TheMLEsoftheseunknownparameters isobtainedthroughmaximizingthelikelihoodfunction(Eq. 2{4 )byusingEM algorithm(Dempsteretal.1977)orotheroptimizationpackagessuchasSimplex (Nelder&Nead1965). 2.2.3 ModellingtheMeanVector ToestimatetheparametersofthelikelihoodfunctionimplementedwithPCD datameasuredatmultipletimepoints,onecanextendthetraditionalinterval mappingapproachtoaccommodatethemultivariatenatureoftime-dependent traits.However,thisextensionislimitedinthreeaspects:(1)Individualexpected meansofdierentQTLgenotypesatalltimepointsandallelementsinthematrix needtobeestimated,resultinginsubstantialcomputationaldicultieswhen thevectorandmatrixdimensionsarelarge;(2)Theresultfromthisapproach maynotbebiologicallymeaningfulbecausetheunderlyingbiologicalprinciple fordevelopmentalcharacterisnotincorporated;(3)Theapproachcannotbewell deployedinapracticalschemebecauseof(2).Thus,somebiologicallyinteresting questionscannotbeaskedandanswered. Basedontheseconcerns,anewstatisticalframeworkhasbeendevelopedto detectQTLaectinggrowthtrajectories(Wuetal.2002a,2004a,2004b;Maet al.2002).Thisframeworkallowsforthemodellingofthetime-dependentexpected meansofQTLgenotype j usingagrowthequation j ( t )= g ( t ; m j ) ; (2{5) whichisspeciedbyasetofcurveparametersarrayedin m j .Alllivingentities arecharacterizedbygrowthdenedastheirreversibleincreaseofsizewith time.Anumberofmathematicalmodels,suchaslogisticorS-shapedcurves, havebeenproposedtodescribegrowthtrajectories.Thefundamentalbiological aspectsofmathematicalmodellingofgrowthcurveshavebeenfoundedbyearlier

PAGE 31

PAGE 32

PAGE 33

PAGE 34

PAGE 35

PAGE 36

wherethecorrelationmatrixR()isafunctionofcorrelationparameter.Thechoiceofasuitablecorrelationmatrixdependsonthenatureoftherepeatedmeasurementfactoraswellasthemodellingrequirement(forexample,thepositivedeniterequirement).Whenvarianceisconstantacrosstheresponserangeandequaltosomevalue2,thenCov(e)=2R() However,inmostsituations,theheterogeneityofintra-individualvariancemaybeevident.Thishappenswhenmeasurementsaretakenovertimeandvariancetendstouctuatewiththelevelofresponses.Forexample,aPCDtraitshowsauniquedevelopmentalpatternasthevariationsatdierenttimepointskeepchangingasthemeanchanges.Amoregeneralintra-individualcovariancestructureneedstobespecied. Anaturalrelationshipthatappearstobeevidentforrepeatedmeasurementsdataisameanandvariancerelationship.Thevariancescanbemodelledasfunctionsofthemeans.Deneavariancefunctionv=v(m)anddenethediagonalvariancematrixasV(m)=diag[v2(m;t1);;v2(m;tk)] wheremisavectorofparametersforthemeanfunction;v2(m;t1);;v2(m;tk)isthevariancefunctionrelatedtodierenttimepoints.UsingthecorrelationmatrixR()denedabove,thecovariancematrixcanbespeciedasCov(e)=2V(m)1=2R()V(m)1=2

PAGE 37

26 isguaranteedtobepositivedenite.Bychoosingappropriatedvariancefunction v andcorrelationmatrix R ( ),thedevelopedcovariancestructurecanbettercapture theintrinsicintra-individualvariation.AdetailedmodellingstrategyforPCD traitswillbediscussedinChapter4. 2.2.5 ComputationalAlgorithms EMalgorithm .WeimplementedtheEMalgorithm,originallyproposedby Dempsteretal.(1977),toobtainthemaximumlikelihoodestimates(MLEs) ofthreegroupsofunknownparametersinaQTLmappingmodel.Thatis,the QTL-segregatingparameters( s )thatspecifytheco-segregationpatternsofQTL andmarkersinthepopulation,thecurveparameters( m j )thatmodelthemean vectorforgenotype j ,andtheparameters( v )thatmodelthestructureofthe covariancematrix.Theseunknowns,denotedcollectivelyas =( s ; m j ; v ), correspondtomixtureproportions,component-specicmeanparametersand common(co)varianceparameters,respectively,inthemixturemodeldescribedby Equation( 2{1 ).Therstderivativeofthelog-likelihoodfunctionwithrespectto specicparameterscontainedin isgivenby @ @ log  ( )= n X i =1 2 X j =1 ik @ @ f j ( y i j ) P 2 j =1 ij f j ( y i j ) = n X i =1 2 X j =1 ij f j ( y i j ) P 2 j =1 ij f k ( y i j ) @ @ log f j ( y i j ) = n X i =1 2 X j =1 ij @ @ log f j ( y i j ) wherewedene ij = ij f j ( y i j ) P 2 j =1 ij f j ( y i j ) IntheE-step,theconditionalprobabilities ij (treatedaspriors)oftheQTL genotypesgiventhemarkergenotypesandthenormaldistributionfunctionare

PAGE 38

PAGE 39

28 TheamountofsupportforaQTLataparticularmappositionisoftendisplayed graphicallythroughtheuseoflikelihoodratiomapsorproles,whichplotthe likelihoodratioteststatisticasafunctionofmapposition(ortestposition)ofthe putativeQTL.Thisplotiscalleda likelihoodratio(LR)proleplot .Thepeakof theprolethatpassescertainsignicantthresholdcorrespondstothepositionof theQTLoverthegenome. 2.2.6 HypothesisTesting Oneofthemostimportantbiologicaladvantagesforfunctionalmapping isthatitprovidesaquantitativeandtestableframeworkforunderstanding therelationshipbetweengeneactionandvariousdevelopmentalevents(Wuet al.2004a).Forexample,whendoestheQTLturnonorotodeterminethe developmentprocessduringtimecourse?Underthecurrentsettings,ourmodel allowsforanumberofhypothesisteststoexaminethegeneticcontrolofPCD processes.TestingwhetherthereexistspecicQTLtoaectthePCDprocess isarststeptowardtheunderstandingofthedetailedgeneticarchitectureof complexphenotypes.AftertheMLEsoftheparametersareobtained,theexistence ofaQTLaectinganoverallPCDcurveshouldbetestedbyformulatingtwo alternativehypotheses 8 > > < > > : H 0 : m j m j =1 ; ;J H 1 :atleastoneoftheequalitiesabovedoesnothold (2{6) where H 0 correspondstothereducedmodel,inwhichthedatacanbetbya singlePCDcurve,and H 1 correspondstothefullmodel,inwhichthereexist dierentPCDcurvestotthedata.Theteststatisticfortestingthehypothesesin Equation( 2{6 )iscalculatedasthelog-likelihoodratio(LR)ofthereducedtothe fullmodel LR= 2[log L ( e j y ; M ) log L ( b j y ; M )] ; (2{7)

PAGE 40

PAGE 41

Anothersimpleapproachforestimatingtheasymptoticvariance-covariancematrixistousenumericalderivatives(Pressetal.1992).Thepartialderivativeofagivenfunctionwithrespecttosomeparametersxandyisgivenby@2f @x@y=[f(x+h1;y+h2)f(x+h1;yh2)][f(xh1;y+h2)f(xh1;yh2)] 4h1h2 @x2=f(x+h)2f(xh)+f(x) @xcloseto0.Takingtheinverseofthenegativevaluesofthepartialderivativematrix,onecanobtainanestimateoftheasymptoticsample(co)variancematrix.Thediagonalelementsaretheasymptoticsamplevariancefortheestimatedparameters.

PAGE 42

PAGE 43

PAGE 44

33 Muller1988;Altman1990;FraimanandMeloche1994;AltmanandCasella1995; Boularanetal.1995;Ferreiraetal.1997).Thesemethodstreattimeastheonly explanatoryvariableandestimatethemeanresponsecurvebysmoothingtheraw data.Ingeneticstudies,therandomregression(RR)modelisanotherpopular modelusedformodellingthedynamicadditiveeect.TheLegendrepolynomial (LEP)hasbeenwidelyusedinRRtotthedevelopmentalcurvepartlyduetothe natureoftheexibilityoftheorthogonalpolynomial(Pooletal.2000a). Unlikegrowthcurvedata,thelongitudinalPCDtraitsnotonlyinvolvea growthphase,butalsoarecharacterizedbyadeathphaseforcertaincharacteristics suchasvolume,numberandheight.ThecomplexcharacteristicofPCDtraitsfaces anumberofstatisticalchallengesifapplyingtraditionalmethods,especially whenmodelsarebuiltunderthemixture-basedlikelihoodframework.Here, wedevelopamodelusingtheLEPwithaspecicordertotthePCDcurve andincorporatetwocovariancestructuresintothemodellingframeworkfor mappingQTLunderlyingthecomplexdevelopmentalPCDtraits.Orderselection usinginformation-theoreticcriteriaisprovided.Theconsistencyoftheselection criteriaisprovedandtheselectionpowerisevaluatedbasedonsimulationstudies. Detailedparameterestimationprocedureswaregiven.ExtensiveMonteCarlo simulationstudiesandarealworldexampleusingricetillernumberdataaregiven todemonstratetheusefulnessofthedevelopmentmodel. 3.2 TheFiniteMixtureModelandLikelihoodFunction Thestatisticalfoundationofthecurrentstudyisbasedonanitemixture modeldescribedinChapter2inEquation( 2{1 ).Themixtureproportionsare correspondingtoQTLfrequenciesconditionalontheankingmarkers.Consider astandardbackcrossdesign,initiatedwithtwocontrastinghomozygousinbred lines.Thisbackcrossprogenycontains n individuals,ineachofwhichtherearetwo groupsofgenotypesatalocus.Beforeconductingthestatisticalmappinganalysis,

PAGE 45

PAGE 46

35 Thepolynomialsoforder r isdenotedby P r ( x ).Thepolynomialsareeitherevenor oddfunctionsof x forevenoroddorders r .ThegeneralformofaLEPoforder k isgivenbythesum P r ( x )= K X k =0 ( 1) k (2 r 2 k )! 2 r k !( r k )!( r 2 k )! x r 2 k ; (3{1) where K = r= 2or(r-1)/2whicheverisaninteger.Thispolynomialisdenedover theinterval[-1,1].SolvingEquation( 3{1 ),itiseasytoshowthattherstfew polynomialsare P 0 ( x )=1 P 1 ( x )= x P 2 ( x )= 1 2 (3 x 2 1) P 3 ( x )= 1 2 (5 x 3 3 x ) P 4 ( x )= 1 8 (35 x 4 30 x 2 +3) P 5 ( x )= 1 8 (63 x 5 70 x 3 +15 x ) P 6 ( x )= 1 16 (231 x 6 315 x 4 +105 x 2 5) Bychoosingdierentordersoforthogonalpolynomials,theLegendrefunction hasthepotentialtoapproximatethefunctionalrelationshipsbetweentraitvalues anddierenttimepointstoanyspecieddegreeofprecision.Forthecurrent model,theindependentvariable x isexpressedastime t ,whichisadjustedto matchthemeasurementtimestotheorthogonalfunctionrange[-1,1],by t 0 = 1+ 2( t t min ) t max t min ; where t min and t max aretherstandlasttimepointsrespectively. Withanappropriateorder r ,thetime-dependentgenotypicvaluesfordierent QTLgenotypescanbettedbytheorthogonalLEP.Afamilyofsuchpolynomials

PAGE 47

isdenotedbyP(t0)=[P0(t0);P1(t0);;Pr(t0)]T Thus,forindividualiwhoseQTLgenotypeisj,itsgenotypemeansatdierenttimepointscanbemodelledbythefollowingvector Thismodellingapproachhasgreatexibilityinmodellingcurvesinwhichthelogisticorotherparametricmodelsdoesnott.Bychoosingappropriateorder,themodelcanbettercapturetheintrinsicdevelopmentalPCDtrend. OrderselectionforLEP .OneofthemajorconcernsusingtheLEPistochoosetheoptimalorder.IfthedegreeoftheLEPisk(thehighestpowerofthepolynomial),thentheorderofLEPwillbek+1(thenumberofthecoecientsdeningthepolynomialwhichismorethanitsdegree).Theorderselectionproblemissimilartoselectvariablesinaregressionsense.WeareactuallyselectingthecoecientsoftheLEP,i.e.,thebasegenotypicvectors.Normally,ahigherorderalwaysgivesabettert.However,ifthemodelcontains

PAGE 48

PAGE 49

andCover(1991),whichhastheformI(Yjh())+I(h()),wherethersttermrepresentstheamountofinformationIrequiredtoexpressthedataY,giventheinterestedmodelhwithparameters;thesecondtermrepresentstheamountofinformationrequiredtoexpressthemodelitself.Foraninformationcriterion,thersttermisequivalenttothenegativelog-likelihoodofthedata,andthesecondtermisthoughtofasapenaltyformodelcomplexity. AgeneralformofthebasicinformationcriteriontoselecttheLEPorderisgivenbyICr(n)=2lnL(bm;bvjY;r)+c(n)pr() wherethersttermisthenegativemaximumlog-likelihoodofthedataYgiventhemodelparameterestimates;bmandbvrepresenttheMLEofmeanandcovarianceparameters;pr()representsthenumberoffreeparameterswhichisonlydependentonthenumberofmeasurementtimewithorderr,sopr()=dimension(m;vjr);c(n)isapenaltyterm. Thegoalofinformation-theoreticmodelselectionistoselectparsimoniousmodels.Hence,modelsthatminimizethecriterionareselected.Variouscriteriahavebeenproposedbasedonthisgeneralformandtheyaredierentinc(n)terms.Amongavarietyofinformation-theoreticcriteriaproposedsofar,thefollowingseveralcriteriawillbediscussed. Akaike'sinformationcriterion(AIC) .Oneofthepopularinformation-theoreticmodelselectioncriteriaistheAICinformationcriterion(Akaike1974).Theoretically,thiscriterionderivesfromconsiderationoftheKullback-Leiblerdistancebetweenagivenmodelandthetruemodel.Specically,minimizingAICisapproximatelyequivalenttominimizingtheexpectedKullback-Leiblerdistancebetweenthetrueandtheestimateddensity.TheAICvaluewithaparticularorderriscalculatedby AIC=2lnL(bm;bvjr)+2pr();

PAGE 50

SinceAICservesasaconsistentestimatoroftheKullback-Leiblerdiscrepancybetweenthedistributionthatgeneratedthedataandthemodelthatapproximatesit(Takeuchi1976),itcanbethoughtofasmeasuringtherelativeineciencyofapproximatingthetruemodelbythemodelofinterest.ModelswithsmallervaluesofAICcanthusbethoughtofasmoreecientlyapproximatingthetruemodel,wherethetruemodelisunknown.BykeepingincreasingtheLEPorder,theselectionmethodpickuptheoptimalorderriftheAICvalueisnolongerincreasing. CorrectedAIC(CAIC) .SinceAICisonlyanapproximateestimateoftherelativeKullback-Leiblerdistanceandhasbeenshowntobebiasedinsmallsamples,acorrectedversionofAIC,calledCAIC,wasdevelopedtoadjustforthisbias(Sugiura1978;HurvichandTsai1989).Lateron,BurnhamandAnderson(1998)proposedamultivariategeneralizationofCAICwhichisdenedas CAIC=2lnL(bm;bvjr)+p+p(p+) wherenisthesamplesize,pisthetotalnumberofparametersinthemeanmodel,andisthenumberofparametersusedtoestimatethecovariancematrixofthedistribution. Bayesianinformationcriterion(BIC) .BICisanotherselectioncriterionfromaBayesianpointofview.ItisusuallyexplainedintermsofBayesiantheory,especiallyasanestimateoftheBayesfactor{theratiooftheposteriortotheprioroddsincomparisonsofamodeltoasaturatedmodel(Schwarz1978;Raftery1993).BICisdenedas BIC=2lnL(bm;bvjr)+pr()ln(n): wherealltheparametersaredenedsimilarlyasabove.Sinceourmodellingobjectiveistoselectparsimoniousmodelstobestdescribethedynamicdevelopment

PAGE 51

40 process,theBICinformationcriterionassignsmorepenaltytothelikelihood functionandhenceittendstochoosemoreparsimoniousmodels. ComparisonsoftherelativemeritsofAICandBICbasedonasymptotic consistency(as n !1 )haveourishedintheliterature.Investigationshave generallydemonstratedthatBICisdimensionconsistent,i.e.,tendstochoose thetruemodelwithprobabilityequaltooneinlargesamplesbutperforms poorlyinsmallsamples(HurvichandTsai1990;BickelandZhang1992)and AICisconsistentifthedimensionalityofthetruemodelincreaseswithn(at anappropriaterate)(Shibata1981).Sakamotoetal.(1986)alsonotedthatthe modelselectedbyAICcriterionisnotactuallythetruemodelbuttheoneforthe bestmodelt.Toevaluatethemodelselectioncriteria,weneedtoshowthatthe proposedcriteriaareconsistent. 3.3.3 ConsistencyofModelSelectionCriteria Theconsistencyofmodelselectioncriteriahasbeenshownintheliterature dependingondierentselectioncriteriaproposedanddierentmodelsunderstudy (GuyonandYao1999;Woodetal.2002;HuangandYang2004).Theconsistency ofaselectioncriterionisdenedasthesituationwheretheprobabilityofthe criterionchoosingthecorrectmodelis1asthesamplesizeincreasestoinnity. Hence,amodelselectioncriterionisconsistentif P f Choosethebettermodelm 2 g =P f IC k 1 (n) < IC k 2 (n) g! 1 ; asn !1 whereweassumethat m 2 isthecorrectmodel; IC k 1 ( n )and IC k 2 ( n )arethe informationcriteriaformodel m 1 and m 2 respectivelywithLEPorder k 1 and k 2 ToselecttheoptimalLEPorder,wehavethefollowingtwoassumptions, 1. ThesameLEPorderfordierentgenotypes, 2. Thenumberofmeanparametersareonlydependentonthenumberof measurementtimepoints whichleadstoniteparameterspacefor p r ( ).

PAGE 52

Stoppingrule .WeproposaforwardselectionbyincreasingoneorderatatimeuntiltheinformationcriterionICk(n)doesnotdecreaseanymore.TheorderwhichgivestheminimumICk(n)willbetheoptimalorder.Thefollowingtheoremshowstheconsistencyoftheorderselectioncrtieria. Theorem1 .Undertheabovetwoassumptions,withthedenedstoppingrule,amodelselectioncriterionICk(n),isconsistentifc(n)[ps()ps0()] Proof .SeetheAppendixCforarigorousproof. Therefore,theproposedBIC,CAICandAICinformationcriterionareallconsistent. 3.3.4 SelectionPower Theusualmodelselectionerrorrate()isdenedastheprobabilityofselectingawrongordergiventhedataissimulatedwiththetrueorder.If!0asn!1forcertainmodelselectioncriterion,thenthisselectioncriterionisconsistent.Bysubtractingfrom1,onegetstheselectionpower|theprobabilitytoselectthetruemodel. Errorratehasbeenusedasanimportantcriterionofevaluationofthemodelselectioncriteriainmanypublications(Bozdogan1987;HurvichandTsai1990;Shao1993,1996;ZhengandLou1995).However,itisnotsucientbyitselftoshowwhichmodelselectioncriterionisbetterthanothers,partlyduetothesamplesizeproblem.Inthisstudy,wewilluseitasanevaluationforchoosingdierentselectioncriteria.

PAGE 53

3.3.5 LEPOrderSelection Underthecurrentstudysetting,weconsideredtwosituationstoselecttheoptimalorderfortheLEPstotthedevelopmentalPCDcurve,i.e.,selectionunderthenullandalternativehypothesisrespectively.Evaluationofdierentcasesbysimulationstudycanprovideusageneralselectionguideline. Orderselectionunderthealternativehypothesis .Underthealternativehypothesisthatthereisonegenecontrollingthedevelopmentalprocess,thereare2or3meancurvescorrespondingtotwogenotypesforabackcrossdesignandthreegenotypesforaF2designrespectively.AIC,CAICandBICwillbeappliedtoselecttheoptimalorderbasedonthelikelihoodinformation.Orderselectionunderthealternativeiscomputationallytime-consumingbecauseoneneedstoselectthecorrectorderateverytestposition.Therstassumptionwithequalorderfordierentgenotypeismandatoryunderthecurrentsettingsbecauseofthehypothesistestingissue.Intherealworld,thisassumptionmightbeviolatedsincetheremightbedierentdevelopmentalpatternsforindividualscarryingdierentgenotypesandhencemayfollowdierentdevelopmentalpatternsorcurves.Consideringthecomputationburden,forsimplicitywewillnotconsiderthiscase.Selectingordersfordierentgenotypeswillbeconsideredinfuturework. Orderselectionunderthenullhypothesis .UnderthenullhypothesisthatthereisnoQTLcontrollingthedevelopmentalprocess,thereisonlyonemeancurve.Withthisinformation,onecanselecttheLEPorderunderthenullwithdierentcriteria,specically,AIC,CAICandBIC.Computationally,itisveryecienttoselecttheorderunderthenullbecausetheselectionprocedureonlyperformsonce.Beforeselectingtheorderunderthealternative,theorderselectedunderthenullcanprovideapieceofinformationabouttheorderunderthealternativebecauseitishighlypossiblethatthemeanprocessfordierentgenotypesmaintainsthesameorderunderthenullandalternative.

PAGE 54

43 3.4 ModellingtheCovarianceStructure TobetterunderstandingtheQTLunderlyingthedevelopmentalPCD process,covariancestructuremodellingisaveryimportantstep.Dissectionof theintra-individualcorrelationwillhelpustounderstandhowQTLmediatethe developmentalpattern.Twocovariancestructureswereproposedformodellingthe covariancematrixunderthecurrentdesign. 3.4.1 AR(1)CovarianceModel Therst-orderautoregressive[AR(1)]model(Diggleetal.2002)canbe utilizedtomodelthestructureofthecovariancematrix.Thedetaileddescription oftheAR(1)covariancestructureisgiveninChapter2.Someofthepropertiesof AR(1)modelisgiveninAppendixA.TBSmodellingstrategywasapplied(Wu et.al.2004b)toremovetheheteroscedasticityproblemoftheresidualvariance. 3.4.2 SAD(r)CovarianceModel Tomodelthenonstationarityofthecovariancestructure,theSADmodelis moreappropriate.TheSADmodelwithorder r formodellingtheerrortermin Equation( 2{3 )isgivenby e ( t i )= 1 e ( t i 1)+ + r e ( t i r )+ i (3{7) where i isthe\innovation"termassumedtobeindependent N (0 ; 2 t i )random variables.FortherstorderSADmodel,ifaconstantinnovationvariance 2 is assumed,analyticalformsforvarianceandcorrelationfunctionscanbeobtained. Foranytime t k ,thecovarianceandcorrelationfunctionofaSAD(1)modelis givenby(Jarezicetal.2003) ( t )= 1 2 t 1 2 2 ; (3{8) ( t;k )= t k s 1 2 t 1 2 k ; (3{9)

PAGE 55

44 Fromtheabovefunction,itisclearthatevenforthesimplestSAD(1)model, thecorrelationfunctionisnonstationary.Usingmatrixnotation,theerrortermin Equation( 3{7 )canbeexpressedas e = A where e =[ e (1) ; ;e ( )] T =[ (1) ; ; ( )] T and A = 0 B B B B B B B B B B @ 1000 1 100 . . . 1 1 2 1 1 1 1 C C C C C C C C C C A Thevariance-covariancematrixofthePCDdataisthenexpressedas = A A T ; (3{10) where istheinnovationvariance-covariancematrixandisexpressedas = 0 B B B B B B B @ 2 1 00 0 0 2 2 0 0 . . . . . . . 000 2 1 C C C C C C C A : Theclosedformsfortheinverseanddeterminantofmatrix helptoestimate theparameters,( 1 ; ; r ; 2 )(seeAppendixA).Formodelsimplicityand computationconsideration,weonlyconsidertheSAD(1)model. 3.5 ComputationAlgorithmandHypothesisTesting AnEMbasedalgorithmcanbeappliedtoestimatetheparametersas describedinAppendixA.AftertheMLEsoftheparametersareobtained,the existenceofaQTLaectingaPCDcurveshouldbetestedbyformulatingthe

PAGE 56

45 followinghypotheses 8 > > < > > : H 0 : m j m H 1 :atleastoneoftheequalitiesabovedoesnothold ; where H 0 correspondstothereducedmodel,inwhichthedatacanbetbya singlecurve,and H 1 correspondstothefullmodel,inwhichthereexistdierent curvestotthedata.Theteststatisticfortestingthehypothesesiscalculatedas thelog-likelihood(LR)ratioofthereducedtothefullmodel LR= 2[log L ( e j y ; M ) log L ( b j y ; M )] where e and b denotetheMLEsoftheunknownparametersunderH 0 andH 1 respectively.Permutationtests(ChurchillandDoerge1994)areusedtoassessthe statisticalsignicance. 3.6 CalculatingtheHeritabilityLevel Heritabilitymeasurestheamountofphenotypicvariabilityexplainedby geneticvariation.Thehighertheheritability,thehigherthevariationofcertain traitsexplainedbythedetectedQTL.Itiseasytocalculatetheheritabilitylevel ( H 2 )whentraitsaremeasuredatasingletimepoint.Butforlongitudinaltraits, heritabilitycalculationsaredierent.Weproposedtwowaystocalculate H 2 1. Calculate H 2 atasingletimepoint t wherethetraitsshowsthehighest variation.Forabackcrossdesign,thegeneticvariationisgivenby 2 g = 1 4 ( 1 ( t ) 0 ( t )) 2 ,and H 2 = 2 g 2 where 2 istheresidualvariance.Ifweknow H 2 or 2 ,basedon 2 g ,wecaneasilycalculateanotherone. 2. Thesecondapproachweproposedisbycalculatingtheareaunderthecurve (AUC).Functionalmappingmapsthedynamicgeneeectovertime.It shouldbemoreaccuratetomeasurethegeneticvariationthroughtheentire measurementstageinsteadofbasedonapointestimation.Weproposedto calculatethegeneticvariationby 2 g = 1 4 ( AUC 1 AUC 0 ) 2 ,whereAUC 1 and

PAGE 57

46 AUC 0 aretheAUCfortwodierentgenotypes.TheAUCiscalculatedby AUC j = Z 1 1 u 0 j P r ( t 0 ) dt 0 where P r ( t 0 )isavectorofLEPwithorder r ; u j isthebasegenotypicmean parametersand t 0 istheadjustedtimepoint. 3.7 LEPOrderSelectionSimulation Totesttheperformanceofdierentorderselectioncriteriawithmoderate samplesize,wedidasimulationstudy.Todemonstratetheselectionpowerand forsimplicity,wesimulatethetrueorderwith6thorderLEP.AputativeQTLis simulatedunderdierentsamplesizes( n =100,200)anddierentheritabilitylevels ( H 2 =0.1,0.4).Thetotalmeasurementtimepointsarexedat9pointstomatch thericetillernumberdata(describedlater).Thephenotypicdataaresimulated underdierentcovariancestructuredesigns(AR(1),SAD(1)).Thevariousdesigns allowustochecktheselectionperformanceunderdierentsituations.Theselection powerisevaluatedasthepercentageofsimulateddatasetinwhichcorrectedmodel wasselected.Bysubtractingthepowerfrom1,theselectionerrorratecanbe easilyobtained. Therearetwomajorreasonsforthecurrentsimulationstudytoselectorder underthenullhypothesis, 1. Generallyspeaking,itismoreinformativetoselectorderunderthe alternativefordierentgenotypes.Forthehypothesistestingconcern, theparametersunderthenullhastobenestedunderthealternative (i.e.,dierentgenotypesunderthealternativeshouldhavethesame numberofparametersasthenullhas).Ifsimulationstudysupportsthis hypothesis,thennoinformationwillbelosttoputthesameorderforthenull distributionasselectedunderthealternative. 2. Computationally,itisveryintensivetoselectorderunderthealternative becauseonehastoselecttheorderforeachtestposition.Ifthesimulation studysupportsthehypothesisthatorderselectedunderthenullisthesame asitisselectedunderthealternative,onecanjustselectorderunderthenull tosamecomputationtime.

PAGE 58

47 Table3{1. LEPorderselectionpowercomparisonusingdierentinformation criteriawithAR(1)covariancestructure H 2 =0 : 1 H 2 =0 : 4 Informationcriterian=100n=200n=100n=200 Orderselectionunderthenull AIC 0 0.991.001.001.00 CAIC 0 0.991.001.001.00 BIC 0 0.830.971.001.00 Orderselectionunderthealternative AIC 1 0.890.920.800.83 CAIC 1 0.900.920.800.84 BIC 1 0.881.001.001.00 Note:0and1referstothecriteriaunderthenullandalternativehypothesis respectively. Simulationresult .Comparisonsofthedierentmodelselectioncriteriaunder dierentcovariancestructuresaresummarizedinTables 3{1 and 3{2 .Each tablegivesthepercentofdatasetsinwhichthecorrectmodelwaschosen,for eachcombinationofsamplesize,heritabilitylevel,andselectioncriteria.Despite dierencesinperformanceamongmodelselectioncriteria,trendsholdingacross dierentcriteriawereevidentinthesimulationresults.First,asmightbeexpected, theoverallpowerofvariousmodelselectioncriteriatoselectthetruemodel generallyincreasedwithsamplesize.Second,adistinctpatternofperformance amongthemodelselectioncriteriaisthatAICandCAICgenerallyperforming similartooneanother.However,therewasaslightbutpersistenttendencyfor CAICtooutperformAICinmanyconditions.Whenthecovariancestructureis modelledwiththeAR(1)structure,BICuniformlyoutperformedAICandCAIC withheritabilitylevel0.4.However,itperformspoorlywhentheheritabilitylevel is0.1.Weknowthatthehigherheritabilitylevelcorrespondstolowerresidual variance.Thispieceofinformationsuggestsusingdierentcriteriadependingon theresidualvariance.AsimilarpatternisalsoshownwiththeSAD(1)covariance structure.Overall,toselecttheoptimalorderunderthealternativehypothesis,one

PAGE 59

PAGE 60

49 TheHaldanemapfunctionwasusedtoconvertthemapdistanceintothe recombinationfraction.Totestthemodelperformance,wesimulatedatawith dierentspecications,namelydierentheritabilitylevels(H 2 =0.1vs0.4)and dierentsamplesizes( n =100vs200).Eachbackcrossprogenyismeasuredat9 equallyspacedtimepoints.Thesimulationwasbasedontwodierentpatternsof covariancestructures,AR(1)andSAD(1). Table3{3. TheMLEsoftheQTLpositionandthemodelparametersderived from100simulationreplicateswiththeAR(1)covariancestructure. ThesquaredrootsofthemeansquareerrorsoftheMLEsaregivenin parentheses H 2 =0 : 1H 2 =0 : 4 Trueparametern=100n=200 n=100n=200 QTLposition =4846.38(3.10)46.32(2.62)47.28(2.59)47.36(1.83) Meanparametersfor QQ u 20 =9 : 04919.4006(0.56)9.4141(0.47)9.0932(0.17)9.0872(0.12) u 21 =1 : 15071.1258(0.28)1.1093(0.21)1.1565(0.11)1.1577(0.07) u 22 = 6 : 0197-6.2381(0.47)-6.2482(0.36)-6.0489(0.16)-6.0348(0.12) u 23 =2 : 65142.8064(0.34)2.8118(0.28)2.6786(0.13)2.6710(0.09) u 24 =0 : 65170.6941(0.23)0.6584(0.16)0.6684(0.09)0.6463(0.06) u 25 = 0 : 7969-0.8804(0.19)-0.8745(0.17)-0.8129(0.07)-0.8082(0.05) u 26 =0 : 62050.6430(0.21)0.6704(0.17)0.6163(0.09)0.6303(0.06) Meanparametersfor Qq u 00 =7 : 14766.9422(0.38)6.9685(0.32)7.1406(0.11)7.1623(0.11) u 01 =1 : 37881.3676(0.21)1.3862(0.13)1.3755(0.08)1.3835(0.05) u 02 = 4 : 4886-4.2814(0.37)-4.3142(0.30)-4.4751(0.12)-4.4978(0.10) u 03 =2 : 00421.8882(0.27)1.8685(0.21)2.0106(0.09)2.0063(0.06) u 04 =0 : 66190.6899(0.17)0.7067(0.13)0.6583(0.07)0.6639(0.05) u 05 = 0 : 8356-0.8563(0.14)-0.8568(0.09)-0.8457(0.06)-0.8458(0.03) u 06 =0 : 43210.4542(0.15)0.4433(0.11)0.4421(0.06)0.4402(0.04) Covarianceparameters 2 0 : 1 =0 : 14360.1199(0.02)0.1188(0.03) 2 0 : 4 =0 : 02390.0224(0.003)0.0223(0.002) =0 : 870.8206(0.05)0.8202(0.05)0.8573(0.02)0.8568(0.02) ThelocationofthesimulatedQTLisdescribedbythemapdistances(incM)fromthe rstmarkerofthelinkagegroup(100cMlong). Results: SimulationresultsaresummarizedinTables 3{3 and 3{4 .The precisionofparameterestimationisevaluatedintermsofthesquarerootofthe meansquarederrors(SMSE)oftheMLEs.Ingeneral,themodelcanprovide

PAGE 61

PAGE 62

51 ratherthanjustincreasingthesamplesize.Forexample,theSMSEofthemean parameters u 20 forgenotype QQ decreasesfrom0.56to0.47whenincreasingthe samplesizefrom100to200usingtheAR(1)covariancestructure.However,this increaseofprecisionislessattractivewhencomparedwiththeimprovementwhen heritabilitylevelincreasesfrom0.1to0.4forthesamesamplesize100.Thispiece ofinformationsuggeststhatinpracticewell-managedexperiments,throughwhich residualerrorsarereducedandthereforeH 2 isincreased,aremoreimportant thansimplyincreasingsamplesizesbecausenormallyitcostsmoretoincrease thesamplesize.Itisalsointerestingtonotethatbothcovariancemodelsdisplay similarestimationprecision. 3.9 RealExample Totestthedevelopedmodel,arealdatasetwasanalyzed.Twoinbred lines,semi-dwarfIR64andtallAzucena,werecrossedtogenerateanF 1 progeny population.Bydoublinghaploidchromosomesofthegametesderivedfromthe heterozygousF 1 ,adoubledhaploid(DH)populationof123lineswerefounded (Huangetal.1997).SuchaDHpopulationisequivalenttoabackcrosspopulation becauseitsmarkersegregationfollowsa1:1ratio.With123DHlines,Huangetal. genotyped135RFLPand40isozymeandRAPDmarkerstoconstructagenetic linkagemapoflength2005cMwithanaveragedistanceof11.5cM,representing goodcoverageof12ricechromosomes(Fig. 3{1 ). The123DHlinesandtheirparents,IR64andAzucena,wereplantedina randomizedcompleteblockdesignwithtwoblocksintheeld.Eachblockwas dividedintodierentplots,eachcontainingeightplantsperline.Startingfrom10 daysoftransplanting,tillernumbersweremeasuredevery10daysforvecentral plantsineachplotuntilalllineshadheaded.Weusedthemeansofthetwoblocks forQTLanalysis.

PAGE 63

PAGE 64

53 Figure3{2. Dynamicchangesofthenumberoftillersfor123DHlinesofriceasan exampleofPCDinplants Figure 3{2 showsthedynamicsoftillernumbersforeachDHlinemeasured at9timepoints.TillergrowthisthoughttobeanexcellentexampleofPCDin plants(Greenberg1996)sinceitexperiencesseveraldevelopmentalstagesduring riceontogeny(Fig. 3{3 ).Attheearlydevelopmentalstage,tillernumberincreases dramaticallycorrespondingtothevegetativephaseinrice.Duringthereproductive phase,tillernumberdeclinesastheinitiationofpanicle,emergenceoftheagleaf (thelastleaf),booting,heading,andoweringofthespikelets.Tillersthatdo notbearpaniclesarecalledineectivetillers,andwillbekilledbygeneticcontrol. Thenumberofineectivetillersisacloselyexaminedtraitinplantbreeding sinceitisundesirableinirrigatedvarieties.Theineectivetillersresultinmany unwantedproblemsinricesuchasoverconsumptionofnutritionandcompetition forspace.Thegeneticcontrolsystemwillplayitskeyroleinreducingtheover producedtillersandbalancethericemetabolismsystemtohaveoptimalnutrition consumption.

PAGE 65

54 Figure3{3. Tillergrowth Datawereanalyzedwiththedevelopedmodelswithmeanmodelledbythe LEPandcovariancemodelledbyAR(1)orSAD(1)structure.Agenome-wide scanwasconductedatevery2cMdistanceonthe12chromosome.Ateachtest position,aLRtestisconducted.Arandomtestpositionwaschosentocalculate theheritabilitylevelwithdierentorderrangingfrom2to8beforescantheentire genome.Resultshowsthat H 2 iscloseto0.4(datanotshown).Basedonthis information,LEPorderisselectedunderthealternativeateachtestpositionusing theBICinformationcriterion.Whendoingthepermutationtest,theselected ordersweremaintainedateachtestposition.Themodelsuccessfullyidentiedfour majorQTLthatcontrolthedynamicPCDdevelopmentalprocesswiththeSAD(1) covariancestructureandfourmajorQTLwiththeAR(1)covariancestructure. Figure 3{4 showsthelog-likelihoodproleplotbetweenthefullandreduced (noQTL)modelfortillernumbertrajectoriesacrossthe12ricechromosomes.The

PAGE 66

PAGE 67

56 makeadecisiononwhichcovariancemodelisthe\right"one.However,itisclear thatthemappingpowerissensitivetothecovariancestructurebeingapplied. Table3{5. TheMLEsoftheparametersandtheirasymptoticstandarderrorsin theparentheseswiththeAR(1)covariancestructure Parameters Q 12 Q 31 Q 32 Q 9 QTLposition( )200 cM 220 cM 260 cM 119.16 cM Markerinterval RZ 730 RZ 801 RZ 337 A RZ 448 RZ 519 Pgi 1 RZ 792 Parametersfor QQ u 00 8.1221(0.25)8.5010(0.21)8.7429(0.20)8.0997(0.21) u 01 1.8070(0.15)1.5895(0.13)1.6282(0.14)1.3295(0.13) u 02 -5.5342(0.24)-5.6862(0.20)-5.8039(0.19)-5.3113(0.20) u 03 1.3254(0.17)2.2708(0.15)2.3644(0.15)2.3670(0.15) u 04 0.9909(0.14)0.4513(0.13)0.3681(0.13)0.4101(0.12) u 05 -1.0331(0.13)-0.8325(0.09)-0.8934(0.09)-0.8583(0.09) u 06 0.5397(0.11)0.9730(0.11)1.0336(0.11)0.8386(0.11) u 07 0.8828(0.18)--Parametersfor qq u 20 7.6154(0.23)6.9148(0.21)6.8063(0.21)7.4884(0.21) u 21 1.0413(0.14)1.2773(0.13)1.2711(0.14)1.5255(0.13) u 22 -4.7521(0.21)-4.3434(0.20)-4.3360(0.20)-4.8555(0.19) u 23 2.7560(0.18)1.8608(0.14)1.8183(0.15)1.8480(0.14) u 24 0.0881(0.12)0.6113(0.12)0.7262(0.14)0.6375(0.12) u 25 -0.7394(0.11)-0.7097(0.08)-0.6921(0.09)-0.7289(0.08) u 26 0.8853(0.11)0.3981(0.11)0.3185(0.12)0.5650(0.10) u 27 -0.4783(0.16)--Covarianceparameters 2 0.0453(0.006)0.0428(0.004)0.0346(0.004)0.0475(0.004) 0.8709(0.02)0.8552(0.02)0.8308(0.02)0.8688(0.01) Note: Q 12 Q 31 Q 32 and Q 9 refertotheQTLdetectedonchromosome1,3and9;* referstothechromosome-wisesignicantQTL. Tables 3{5 and 3{6 tabulatetheestimatedQTLpositiononthechromosome, themarkerintervalwheretheQTLbelongsto,theMLEsofcurveparametersthat specifythedevelopmentalpatternaswellastheasymptoticstandarderrorsofthe estimates.Mosttheparameterscanbereasonablyestimatedwithsmallsampling error,indicatinggreatpowerofthedevelopedmodels.Asclearlyindicatedinthe table,theordersatdierenttestpositionsaredierent.Forexample,theQTL detectedinchromosome1( Q 12 )withtheAR(1)covariancemodelhasorder7

PAGE 68

57 andtheQTLdetectedinchromosome9( Q 9 )hasorder7.Morethan85%oftest positionstendtochoosethe6 th order(datanotshown)andAR(1)tendstochoose alowerorderthantheSAD(1)model. Table3{6. TheMLEsoftheestimatedparametersandtheirasymptoticstandard errorsintheparentheseswiththeSAD(1)covariancestructure Parameters Q 11 Q 12 Q 31 Q 32 QTLposition( )108 cM 198 cM 220 cM 262 cM Markerinterval RZ 276 RG 146 RZ 730 RZ 801 RZ 337 A RZ 448 RZ 519 Pgi 1 Parametersfor QQ u 00 10.1080(0.25)8.8476(0.26)8.9333(0.19)9.1363(0.18) u 01 1.4258(0.25)1.9202(0.19)1.5415(0.16)1.6244(0.16) u 02 -6.2307(0.33)-6.1187(0.20)-5.5936(0.20)-6.2130(0.15) u 03 2.4717(0.22)1.3324(0.15)2.0395(0.13)2.0758(0.15) u 04 0.7185(0.27)1.0686(0.13)0.8554(0.16)0.5325(0.12) u 05 -1.4245(0.16)-1.1540(0.11)-1.1371(0.09)-1.1813(0.10) u 06 1.3406(0.17)0.6267(0.11)0.9195(0.10)1.0541(0.11) u 07 0.9732(0.21)1.1440(0.17)0.6945(0.14)0.7148(0.16) u 08 -1.0999(0.38)--0.7288(0.25)Parametersfor qq u 20 7.4620(0.14)7.5165(0.19)7.1258(0.20)6.9824(0.19) u 21 1.3765(0.12)1.0548(0.16)1.3297(0.17)1.2889(0.17) u 22 -4.4309(0.17)-4.6904(0.14)-3.927(0.22)-4.3434(0.16) u 23 2.0271(0.10)2.7183(0.13)2.1169(0.14)2.1184(0.17) u 24 0.7610(0.13)0.0544(0.10)0.8066(0.18)0.4214(0.13) u 25 -0.7168(0.07)-0.7741(0.09)-0.5475(0.10)-0.5721(0.11) u 26 0.5365(0.09)0.9289(0.10)0.3967(0.12)0.5472(0.12) u 27 -0.0612(0.09)-0.4435(0.14)-0.4249(0.15)-0.3716(0.17) u 28 -0.5427(0.20)--0.7679(0.28)Covarianceparameters 2 0.7249(0.04)0.7483(0.04)0.7866(0.04)0.7886(0.04) 0.8477(0.02)0.9013(0.02)0.8836(0.02)0.8420(0.02) ThedevelopmentaltrajectoryoftheidentiedQTLareplottedandareshown inFigures 3{5 and 3{6 withtillernumbertrajectoriesforallindividualsindicated inbackground.The4QTLdetectedbytheAR(1)modelarebetweenmarkers RZ730andRZ801(Ch1-2( Q 12 ))onchromosome1,betweenmarkerRZ337A andRZ448(Ch3-1( Q 31 )),betweenmarkerRZ519andPgi 1(Ch3-2( Q 32 ))on chromosome3andonmarkerRZ792onchromosome9(Ch9( Q 9 )).ThreeQTL detectedbytheSAD(1)modelarelocatedatthesamemarkerintervalasthe

PAGE 69

PAGE 70

PAGE 71

PAGE 72

PAGE 73

PAGE 74

PAGE 75

64 thestructureofthecovariancematrixamongamultitudeofdierenttimepoints, which,therefore,largelyreducesthenumberofparametersbeingestimatedfor variancesandcovariances,especiallywhenthedimensionofdataishigh. Inthischapter,wewilldevelopanovelstatisticalmodelforthegenome-wide scanofQTLthatguidePCDtowardanactiveprocessofcelldeath.Thismodel incorporatestwosequentiallydistinctstagesofthedevelopmentalprocessintothe mappingframeworkconstructedwithinthecontextofGaussianmixturemodels. Therststage,growth,hasproventoobeysomeuniversalgrowthlawthatcan bemodelledmathematicallybycurveparameters(vonBertalany1957).While itisdiculttondapropermathematicalequationtodescribethesecondstage {death{thatissubjecttoafastexponentialdecayofcellsfollowedbyaslowly decreasingfunction(Balabanetal.2004),anonparametricapproachbasedon theLegendrefunctionisderivedtomodelthePCDprocess.Thecombinationof parametricmodellingofthegrowthprocessandnonparametricmodellingofthe deathprocesslaysafoundationfor semiparametricfunctionalmapping ofPCD. Weimplementanonstationarymean-dependentcovariancemodeltocharacterize thestructureofthecovariancematrixamongcellnumbersmeasuredatamultitude ofdierenttimes.Thestatisticalbehaviorofourmodelisinvestigatedthrough simulationsstudies.Theutilityofthemodelinarealexampleofricesuggeststhat ourmodelcanbeusefulinpractice. 4.2 DierentPhasesofPCD Ingeneral,thewholeprocessofPCDcanbedescribedbyvereasonablywell distinctphases(Fig. 4{1 ;FoggandThake1987):lag,exponential,declininggrowth rate,stationaryanddeath.Eachofthephasesisdenedbelow. Lagphase: Thisistheinitialgrowthphase,duringwhichcellnumber remainsrelativelyconstantpriortorapidgrowth.Duringthisphase,theorganism

PAGE 76

Figure4{1. AtypicalexampleofPCDthatincludesvedierentstages,(1)lag,(2)exponential,(3)declininggrowth,(4)stationaryand(5)death. preparestogrowandunseenbiochemicalchanges,celldivisionanddierentiationoftissuesoccurduringthistime.

PAGE 77

Thedurationandextentofeachphasewilldependontheorganismandenvironmentalconditions.Forexample,iftissuesfromthestationaryphasearesuppliedwithfreshnutrients,thelagphasewillbelongerthanforthecaseoftissuesfromthedecliningphase.Forgrowingtissuesfromtheexponentialphase,organismssuppliedwithfreshnutrientswilllikelyskipthelagphase.Ifthegrowthnutrientisrich,organismswillremainintheexponentialgrowthphaseforalongerperiodandproduceagreaterbiomass.Furthermore,theirrateofgrowthintheexponentialphasemayalsobegreater. Growthcurvesmustbedrawnfromaseriesofgrowthmeasurementsatdierenttimesduringthegrowthcurve.Mathematicalequationshavebeenderived

PAGE 78

tomodelthegrowthfromthelagtostationaryphase(Westetal.2001).Itisdiculttondaspecicmathematicalequationforthedeathphase. 4.3 StatisticalModel 4.3.1 TheMixtureModel-BasedLikelihood ThestatisticalfoundationoffunctionalmappingforthePCDprocessisbasedonanitemixturemodel.Inthismixturemodel,eachcurvecorrespondingtoaQTLgenotypettedbyanitesetofmeasurementswithtimepointsforanyindividualisassumedtohavearisenfromoneofaknownorunknownnumberofcomponentswitheachcomponentbeingmodelledbyamultivariatenormaldistribution.Thatis where$=($1;;$J)0arethemixtureproportions(i.e.,QTLgenotypefrequencies)forJQTLgenotypewhichareconstrainedtobenonnegativeandPJj=1$j=1;'=('1;;'J)0arethecomponent(orQTLgenotype)specicparameters,with'jbeingspecictocomponentj;andareparameterswhicharecommontoallcomponents. Considerastandardbackcrossdesign,initiatedwithtwocontrastinghomozygousinbredlines.Thisbackcrossprogenypopulationcontainstwogroupsofgenotypesatalocus.Assumethatageneticlinkagemapcoveringtheentiregenomehasbeenconstructedwithpolymorphicmarkers.SupposethereisasegregatingQTLwithallelesQandqthataectsPCDcurvesortrajectoriesinthebackcrosspopulation.ThisQTLcannotbedirectlyobservedratherthanshouldbeinferredonthebasisofknownmarkerinformation.InQTLintervalmapping,wewillusetwoankingmarkerswithknowngenotypestoinferthegenotypeofaQTLthatishypothesizedtobelocatedbetweenthetwomarkers(LanderandBotstein1989).Therecombinationfractionisalinkageparameterthatcandescribethe

PAGE 79

68 geneticdistancebetweentwogivenloci.Let r r 1 and r 2 betherecombination fractionsbetweenthetwomarkers M 1 and M 2 ,betweentheleftmarker M 1 and QTLandbetweentheQTLandrightmarker M 2 ,respectively.Basedonthe segregationandtransmissionofgenesfromtheparenttoprogeny,onecanderive theconditionalprobabilitiesofan unknown QTLgenotype,conditionaluponthe known markergenotypes,intermsoftheserecombinationfractions(Table 2{1 ). TheunknownparametersthatspecifythepositionofQTLwithinamarkerinterval arearrayedin s ThelikelihoodfunctionofthelongitudinalPCDtrait( y )andmarker information( M )collectedforthisbackcrosspopulationsatthishypothesized QTLwithtwogenotypes j =1 ; 0isconstructedas L ( s ; m ; v j y ; M )= n Y i =1 [ $1 j i f ( y i j m 1 ; v )+$ 0 j i f ( y i j m 0 ; v )](4{2) where $1 j i and$ 0 j i containedin s arethemixtureproportionscorresponding tothefrequenciesofdierentQTLgenotypes,expressedastheconditional probabilitiesofQTLgenotypesgivenmarkergenotypesforindividual i m j containstheparametersthatmodeltime-dependentmeansforgenotype j and v containstheparametersthatmodelthestructureoftheresidualcovariance matrixwhichisassumedtobecommontoallmixtures.Thelikelihoodfunction isconstructedonthebasisoftheassumptionofindependentresponsebetween individuals.Themultivariatenormaldistributionofeachmixtureforprogeny i measuredat timepointsisexpressedas f ( y i ; M j m j ; v )= 1 (2 ) = 2 j j 1 = 2 exp 1 2 ( y i m j ) T 1 ( y i m j ) ; (4{3) where y i =[ y i (1) ; ;y i ( )]isavectorofobservationforprogeny i and u j = [ u j (1) ; ;u j ( )]isameanvectorforallprogenieswithgenotype j .Thismeans thatallprogenieswiththesameQTLgenotypewillhavethesamemeanat

PAGE 80

69 dierenttimepoints.Ataparticulartimepoint,say t ,therelationshipbetweenthe observationandexpectedmeancanbedescribedbyalinearregressionmodel y i ( t )= i u 1 ( t )+(1 i ) u 0 ( t )+ e i ( t ) ; where i isanindicatorvariabledenotedas1forQTLgenotype j =1and0for QTLgenotype j =0and e i ( t )istheresidualerrorwhichis iid normalwithzero meanandvariance 2 ( t ).Theerrorsforprogeny i attwodierenttimepoints, t 1 and t 2 ,arecorrelatedwithcovariancecov( y i ( t 1 ) ;y i ( t 2 )).Thevariancesand covariancescomprisethecovariancematrix whoseelementsarethecommon parametersspeciedby v .Withthesegeneralsettings,thestatisticalchallenging becomeshowtomodelthemeanprocessandhowtostructurethecovariance matrix. 4.3.2 SemiparametricModellingoftheMeanVector Inabroadsense,theentirePCDprocessforaparticularindividual i canbe dividedintotwophases,growthanddeath(Fig. 4{1 ).Let t i bethetransition timepointwhichmarkstheendofgrowthphaseandbeginningofthedeath phase.ThemeanvectorinthemultivariatenormaldistributionofPCD(Eq. 4{3 ) forindividual i thatcarriesQTLgenotype j cannowbespeciedbytwomean sub-vectorsexpressedas u j j i =( u G j j i ; u D j j i )(4{4) where u G j j i and u D j j i correspondtothegrowthanddeathvectorbeforeandafter t i ,respectively. Theparametricmodelofgrowthphase .Theprocessofgrowth(before t i ) followsuniversalgrowthlawsandcanbedescribedbybiologicallymeaningful mathematicalfunctions.Asanearlyuniversalbiologicallawforlivingsystems,the sigmoidal(orlogistic)growthfunctioncanbettedtocaptureage-specicchanges duringthegrowthphase(Westetal.2001).Thelogisticfunctionismathematically

PAGE 81

70 describedforindividual i by g i ( t )= a i 1+ b i e c i t ; with t [1 ;t i ](4{5) where a i istheasymptoticorlimitingvalueof g i when t !1 a i = (1+ b i )isthe initialvalueof g i when t =0and c i istherelativerateofgrowth(vonBertalany 1957).Thus,withthesethreeparameters,onecanuniquelydeterminetheshapeof PCDinthegrowthphaseforindividual i thatcarriesQTLgenotype j ,andhave thetime-dependentmeanvectorforEq. 4{4 speciedby u G j j i =[ u G j j i (1) ; ;u G j j i ( t i )] = a j 1+ b j e c j ; ; a j 1+ b j e c j t i : (4{6) IfdierentgenotypesataputativeQTLhavedierentcombinationsofthe parameters( a j ;b j ;c j ),thisimpliesthatthisQTLplaysaroleingoverningthe dierenceofgrowthtrajectories. Thenonparametricmodelofdeathphase .Sincenoparticularmathematical functioncanbeusedtodescribethedeathphase(after t i ),thenonparametric approachbasedontheorthogonalLegendrepolynomial(LEP)isused.The exibilityofLEPwillgreatlyincreasetherobustnessoffunctionalmapping. Withappropriateorder r ,thetime-dependentgenotypicvaluesfordierent QTLgenotypesinthedeathphasecanbettedbytheorthogonalLEP.Afamily ofsuchpolynomialswithnormalizedtime t 0 isdenotedby P ( t 0 )=[ P 0 ( t 0 ) ;P 1 ( t 0 ) ; ;P r ( t 0 )] T andavectorofgenotypic-related,time-independentvalueswithorder r isdenoted by v j =[ v 0 j ;v 1 j ; ;v rj ] T

PAGE 82

wherevjiscalledthebasegenotypicvectorforQTLgenotypejandtheparameterswithinthevectorarecalledthebasegenotypicmeans.Thenormalizedtimet0isobtainedbyadjustingtheoriginalmeasurementtimettomatchtheorthogonalfunctionrange[-1,1],byt0=1+2(ttmin) Withthesespecications,thetime-dependentgenotypicvalues,uDj(t),inthedeathphasecanbedescribedasalinearcombinationofvjweightedbyseriesofthepolynomials.Thatis Thus,forindividualiwhoseQTLgenotypeisj,weusethefollowingexpressiontomodelgenotypemeansinthedeathphase Thisapproachhasagreatexibilityinmodellinglongitudinaldatathatcannotbettedbyaparametricmodel.Bychoosinganappropriateorder,thenonparametricmodelcanbettercapturetheintrinsicpatternofdevelopmentalPCD.Thenumberofparameterscanbereducediftheorderofthepolynomialislessthanthenumberoftimepoints,asanusualcase. Legendreorderselection .TodeterminewhichorderoftheLEPbesttsthedata,weneedtoselecttheoptimalorder.OneofthepopularmodelselectioncriteriaistheAICinformationcriterion(Akaike1974).TheAICvalueataparticularorder,r,iscalculatedby AIC=2lnL(bG;bD;bvjr)+2dimension(G;D;vjr); wherebG=fbaj;bbj;bcjg1j=0,andbD=fb0j;b1j;:::;brjg1j=0arethetheMLEsofparametersforthegrowthcurvefunctionandtheLegendrepolynomialoforderr,

PAGE 83

72 b v containstheMLEsofthecovarianceparameters,anddimension( G ; D ; v j r ) representsthenumberoffreeparametersunderorder r .Theoptimalorderisone thatdisplaystheminimumAICvalue. Anothermodelselectioncriteriontodeterminetheoptimalorderofthe LegendrefunctionistheBayesianInformationCriterion(BIC)(Schwarz1978), whichiscalculatedby BIC= 2ln L ( b G ; b D ; b v j r )+dimension( G ; D ; v j r )ln( n ) : (4{10) wherealltheparametersaredenessimilarlyasaboveexceptthat n isthetotal numberofobservationataparticulartimepoint.SincetheBICcriterionadjusts theeectofsamplesize,themodelselectedbyBICwillbemoreparsimonious. 4.3.3 ModellingtheCovarianceStructure Tomodelthecovariancestructureforlongitudinaldata,weneedtomakethe followingassumptions:(1)theerror e i ( t )inEq. 2{3 isnormallydistributedwith meanzeroandvariance 2 ( t ),and(2)theerror e i ( t )isindependentandidentically distributedamongdierentindividuals.Anumberofstatisticalmodelshavebeen usedtomodelthecovariancestructure(Diggleetal.2002).Inearlierfunctional mapping,therst-orderautoregressive(AR(1))modelhasbeenused(Maetal. 2002),whichisexpressedas 2 (1)= = 2 ( )= 2 forthevariance,and ( t 1 ;t 2 )= 2 j t 2 t 1 j (4{11) forthecovariancebetweenanytwotimeintervals t 1 and t 2 ,where0 << 1is theproportionparameterwithwhichthecorrelationdecayswithtimelag.The parametersthatmodelthestructureofthe(co)variancematrixarearrayedin v .

PAGE 84

PAGE 85

74 elementsof e i forindividual i isassumedtofollowapattern,expressedas corr( e i )= R i ( ) ; wherethecorrelationmatrix R i ( )isafunctionofavectorofcorrelation parameters .ThecorrelationstructurecanbedescribedbytheAR(1)model inwhich = (Eq. 4{11 ). Instep2,time-dependentvariancesarespeciedaccordingtoHorwitz's Ruleinanalyticalchemistry.Thisruleproposesthatthereexistsanempirical relationshipbetweenconcentrationandvariance(Atkinson2003).Thus,we cansimilarlymodeltime-dependentvariancesforindividual i byconsideringits genotypicmeansatvarioustimepoints,expressedas 2 i ( t )= 2 u 2 j j i ( t ) : Therefore,thecorrespondingcovariancematrixcanbemodelledby j j i =cov( y )= 2 u 1 2 j j i R i ( ) u 1 2 j j i ; (4{12) where u j j i =diag[ u 2 j j i (1) ; ;u 2 j j i ( )].Sincethecovariancestructureismodelledas afunctionofgenotypicmean,itiscalledthemean-covariance(M-C)model.The M-CmodelhasagreatexibilityformodellingthecovariancematrixofthePCD process.TheunknownparameterstobeestimatedintheM-Cmodelarearrayedin v =( 2 ; )withacceptedmeans. 4.3.4 ComputationAlgorithms Anumberofcomputationalalgorithmscanbeusedtoestimatetheunknown parameters, =( s ; m ; v ).Thesealgorithmsincludethemaximumlikelihood methodimplementedwiththeEMalgorithm(Dempsteretal.1977).Asusual, theQTLpositionparameter s canbeviewedasaknownparameterbecausea putativeQTLcanbesearchedatevery1or2cMonamapintervalbracketed

PAGE 86

PAGE 87

76 entirecurvecontinuous.Thesecondconstraintisthatthetwofunctionshasthe samescoreattime t i tomaketheentirecurvesmooth.Thesetwoconstraintsare expressedas 8 > < > : u G j j i ( t i )= u D j j i ( t i ) @ @t i u G j j i ( t i )= @ @t i u D j j i ( t i ) Withtheseconstraints,weobtaintheexpressionsofonegrowthparameter andonedeathparameterforanyQTLgenotype j .Forexample,iftheLegendre polynomialorderis3,wecansolvetheequationstoobtaintheestimatesof a j and v 1 j asfollows 8 > > < > > : a j = 2[ v 0 j 0 : 5(3 t 0 2 i +1) v 2 j 5 t 0 3 i v 3 j ](1+ b j e c j t i ) 2 1+ b j e c j t i b j c j ( t ) t 0 i e c j t i v 1 j = a j b j c j ( t ) e c j t i 2(1+ b j e c j t i ) 2 3 t 0 i v 2 j 0 : 5(15 t 0 2 i 3) v 3 j where t 0 i istheadjustedtimefor t i and t = t max t min 4.3.5 CalculatingCurveHeritability Itiseasytocalculatetheheritabilitylevel( H 2 )whentraitsaremeasuredata singletimepoint.Butforlongitudinaltraits,heritabilitycalculationisdicult.We proposetwowaystodoit: 1. Calculate H 2 atasingletimepoint t wherethetraitsshowsthehighest variation.Forabackcrossdesign,thegeneticvariationisgivenby 2 g = 1 4 [ u 1 ( t ) u 0 ( t )] 2 ,where u j ( t )( j =1 ; 0)isthegeneticmeanforgenotype j at time t ,andtheheritability H 2 ( t )= 2 g ( t ) = 2 ( t )iscalculatedwhere 2 ( t )is theresidualvarianceattime t 2. Calculate H 2 usingtheareaundercurve(AUC).Functionalmappingmaps thedynamicgeneeectovertime.Thegeneticvariationexplainedbythe entiremeasurementperiodismoreinformativethanthatbyindividualpoint time.Weproposetocalculatethegeneticvariationby 2 g = 1 4 (AUC 1 AUC 0 ) 2 ,whereAUC 1 andAUC 0 aretheAUCfortwodierentgenotypes.

PAGE 88

PAGE 89

78 fullmodel).Theteststatisticfortestingthehypothesesiscalculatedasthe log-likelihoodratioofthereducedtothefullmodelasgivenbelow LR= 2[log L ( e j y ; M ) log L ( b j y ; M )] ; (4{14) where e and b denotetheMLEsoftheunknownparametersunderH 0 andH 1 a respectively.ThecriticalthresholdvaluefordeclaringthepresenceofQTLcanbe empiricallycalculatedbasedonthepermutationtests(ChurchillandDoerge1994). OtherhypothesescanbemadetotestifthedetectedQTLonlycontrolsthe growthphasewiththefollowingalternativehypothesis H 1 b : u G 1 6= u G 0 and u D 1 = u D 0 (4{15) orifthedetectedQTLonlycontrolsthedeathphasewiththefollowingalternative H 1 c : u G 1 = u G 0 and u D 1 6= u D 0 (4{16) Thecriticalthresholdsfortheabovetwohypothesescanbedeterminedusing simulationstudies.Onlywhenboth H 1 b and H 1 c arerejected,thedetectedQTLis thoughttopleiotropicallyaectthegrowthanddeathphases. TheproposedmodelcanbeusedtotesttheinuenceofQTLongrowthin dierentstagesofdevelopment,lag,exponential,declininggrowth,stationaryand death.Thesetestscanbebasedontheareaundercurveduringatimecourseof interest.Simulationstudiesareusedtodeterminethecriticalthresholdsforeach test. 4.5 Results 4.5.1 AWorkedExample Iusetheproposedmodeltoanalyzearealdatasetofrice.Datadescriptionis giveninChapter3.Figure 3{2 showsthedynamicsoftillernumbersforeachDH linemeasuredat9timepoints.Tillergrowthisthoughttobeanexcellentexample

PAGE 90

79 Table4{1. ModelselectionforLegendrepolynomialordersbasedontheAICand BICvaluesundertheM-Ccovariance-structuringmodel SelectionOrder criterion01234 AIC2437.81096759.74 559.87 670.65 BIC2453.91109.4775.77 578.58 692.03 ofPCDinplants(Greenberg1996)sinceitexperiencesseveraldevelopmental stageswhenricesgrow.OursemiparametricmodelwasusedtomapspecicQTL thatdeterminethedynamicchangesoftillernumberduringontogeny.Although thegrowthphaseoftillernumbercanbewellmodelledbyalogisticequation denedbyparameters a;b and c (Eq. 4{5 ),noproperequationcanbeusedto modelthedeathphase.Forthisreason,anonparametricapproachbasedonthe LegendrepolynomialfunctionisadoptedintheframeworkofQTLmapping.But thiswillencountertheissueoforderdetermination.Todetectthebestorderofthe Legendrepolynomialfunctionforthisricedataset,wecalculatedtheAICandBIC informationcriteriaforvariousorders(Table 4{1 ).Boththecriteriaprovidethe consistentresultthatthedeathphaseoftillernumbercanbebestexplainedbythe Legendrepolynomialoforder3. BygenomewidescanningforQTLatevery2cMwithineachmarkerinterval across12ricechromosomes,ourmodelhasprobedthreemajorQTLthattrigger theireectsontheoverallPCDprocessoftillernumber.Asshownbythe genomewidelog-likelihoodratio(LR)prolebetweenthefullandreduced(no QTL)modelfortillernumbertrajectoriesinFig. 4{2 ,thesethreeQTLarelocated betweenmarkersRG146andRG345andbetweenmarkersRZ730andRZ801on chromosome1andonmarkerRZ792onchromosome9.Thepositionsofmarkers onthelinkagegroups(Huangetal.1997)areindicatedatticks.Thethreshold valueforclaimingtheexistenceofQTLisgivenasthehorizonaldottedlinefor thegenome-widelevelanddashedlineforthechromosome-widelevel.Ofthese

PAGE 91

80 Figure4{2. TheLRproleplot.Thegenomicpositionscorrespondingtothepeak ofthecurvearetheMLEsoftheQTLlocalization(indicatedbythe arrows). threedetectedQTL,therstissignicantgenomewide,whereastheothertwoare signicantchromosomewide,allatthe5%signicancelevelbasedonthecritical thresholdsdeterminedfromthepermutationtests. ToknowmoreaboutthebehaviorofthedetectedQTL,wetabulatetheMLEs ofcurveparametersthatspecifythegrowthphaseandgenotypicbasiseects thatspecifythedeathphase,alongwiththeapproximatestandarderrorsofthe estimates(Table 4{2 ).Alltheparametersthatspecifythegrowthanddeathphases fordierentQTLgenotypes( j =1for QQ or0for qq intheDHpopulation)are estimatedwithreasonablyhighprecisionasshownbythestandarderrors,although theestimationprecisiontendtobebetterforthegrowthparametersthanforthe deathparameters(Table 4{2 ).Theparametersthatmodelthestructureofthe covariancematrixbasedontheM-Cmodelcanalsobewellestimated,suggesting goodbehaviorofourmodel.

PAGE 92

81 Table4{2. TheMLEsofthegrowthanddeathparameters(andtheirapproximate standarderrorsintheparentheses)forsignicantQTLdetectedon dierentchromosomesfortillernumbersinaDHricepopulation ParametersChromosome1Chromosome2 Chromosome9 QTLposition( )120 cM 198 cM 119.1 cM MarkerintervalRG146{RG345RZ730{RZ801RZ792 Growthparameters a 2 10.597311.082410.7188 b 2 7.4938(0.20)7.5167(0.28)7.8547(0.35) r 2 1.7257(0.02)1.8486(0.03)1.6747(0.04) a 0 12.382311.546111.5739 b 0 9.6068(0.38)8.9779(0.34)8.3427(0.39) r 0 1.8834(0.03)1.661(0.03)1.8311(0.04) Deathparameters u 20 8.5001(0.22)8.6888(0.32)8.5837(0.32) u 21 -2.4290-2.7785-2.4647 u 22 0.1008(0.05)0.1326(0.07)0.0989(0.07) u 23 0.5118(0.04)0.5685(0.05)0.5312(0.05) u 00 9.6225(0.37)9.2706(0.31)9.1128(0.35) u 01 -3.2068-2.6222-2.8732 u 02 0.1472(0.08)0.0964(0.07)0.1030(0.07) u 03 0.6575(0.07)0.5765(0.05)0.5784(0.06) Covarianceparameters 2 0.0493(0.01)0.0508(0.004)0.0535(0.005) 0.8747(0.01)0.8738(0.01)0.8786(0.01) where*referstothechromosome-widesignicantQTL UsingtheMLEsofparametersforthegrowthanddeathphases,wedraw thedevelopmentaltrajectoriesoftillernumberforthetwodierentQTL genotypes(Fig. 4{3 )withtillernumbertrajectoriesforallindividualsindicated inbackground.EachQTLshowsauniquedevelopmentalpatternovertime.For example,thedynamicprocessofgeneticeectsfortheQTLlocatedbetween markersRZ730andRZ801onchromosome1isdierentthanthosefortheother twoQTL.Statisticaltestsbasedonhypotheses 4{15 and 4{16 showthattheQTL

PAGE 93

Table4{3. TheMLEsoftheQTLpositionandthemodelparametersderivedfrom100simulationreplicates.ThesquaredrootsofthemeansquareerrorsoftheMLEsaregiveninparentheses. Trueparametern=100n=200n=100n=200 ThelocationsofthetwoQTLaredescribedbythemapdistances(incM)fromtherstmarkerofthelinkagegroup(100cMlong). 4.5.2 Simulation Weperformedaseriesofsimulationstudiestoexaminethestatisticalpropertiesofthemodel.Sixequidistantmarkersaresimulatedfromabackcross

PAGE 94

83 populationandareorderedas M 1 M 6 onalinkagegroupwiththelengthof 100cM.TheHaldanemapfunctionwasusedtoconvertthemapdistanceinto therecombinationfraction.Dierentheritabilitylevels( H 2 =0 : 1vs0.4)and dierentsamplesizes( n =100vs200)wereconsideredinthesimulationstudy toexaminemodel'sperformancesunderdierentscenarios.TheputativeQTLis locatedbetweenmarker M 3 and M 4 ,at48cMfromtherstmarker.Dataare simulatedbyassumingthatQTLcontrolstheentiredevelopmentalprocess.The simulateddatahave9continuoustimepoints.Themeansatdierenttimepoints usedtomodelthecovariancematrixbasedontheM-Cmodelaretheaverage ofthetwogenotypicmeans.Table 4{3 liststheresultsfromsimulation.The trueparametersaregivenintheleftcolumn.Ingeneral,ourmodelcanprovide reasonableestimatesoftheQTLpositionsandthegrowthanddeathparameters determinedbytheQTL,withestimationprecisiondependingonheritability levelandsamplesize.Inallcasesofdierentsamplesizesandheritabilities,the maximumvaluesoftheLRlandscapesfrom100simulationreplicatesarebeyond thecriticalthresholdsatthe =0 : 001leveldeterminedfrom1000permutation testsforthesimulateddata,suggestingthatourmodelhas100%powertodetect QTLintheseconditions.Theprecisionofparameterestimationisevaluated intermsofthesquarerootofthemeansquarederrors(SMSE)oftheMLEs. TheQTLpositionsandeectscanbebetterestimatedwhenthePCDtrait hasahigherthanlowerheritabilityorwhensamplesizeislargerthansmaller (Table 4{3 ).Buttheincreaseof H 2 from0.1to0.4leadstomoresignicant improvementfortheestimationprecisionthantheincreaseof n from100to200. Forexample,theSMSEofthegrowthparameter c 0 forQTLgenotype qq reduces bymorethanonefoldwhen H 2 isincreasedfrom0.1to0.4foragivensamplesize, whereassuchreductionismuchsmallerwhen n isincreasedfrom100to200fora givenheritability.Thissuggeststhatinpracticeitismoreimportanttomanage

PAGE 95

PAGE 96

B C Plotofthedynamicchangesoftillernumbersfortwocurveseachpresentingtwogroupsofgenotypes,QQandqq,ateachofthethreeQTL.A)QTLdetectedbetweenmarkersRG146andRG345,andB)QTLbetweenmarkersRZ730andRZ801onchromosome1,andC)QTLonmarkerRZ792onchromosome9.

PAGE 97

PAGE 98

87 ItcanalsobeseenthatthePCDmappingmodelproposedinthisarticle isdierentfromearlierpublishedfunctionalmappingapproachespurelybased onparametricmodelling.ThecurrentmodelsplitsthePCDprocessintotwo sequentiallydistinctphases|growthanddeath.Asinaparametricmodel, universalgrowthlaws(Westetal.2001)areusedtomodelthegrowthphase, whereasnonparametricmodellingisperformedforthedeathphase.This combinationofparametricandnonparametricmodels,referredtoas semiparametricmodelling ,isaimedtoovercometheproblemofthedeathphasethatcannot bemathematicallydescribed.Althoughitisfeasibleforanonparametricapproach tomodelanyformofcurveincludingtheentiregrowth-deathcurvelikeFig. 4{1 thisapproachdoesnotmakefulluseofbiologicalinformationcontainedinthe growthphaseand,therefore,losestheadvantagesofparametricfunctionalmapping inthebiologicalinterpretationofresults(seeWuetal.2004a). NonparametricpartofoursemiparametricmodelisbasedonLegendre-polynomial approaches.AsshownbyKirkpatrickandHeckman(1989),Legendrepolynomials haveseveralfavorablepropertiesforcurvettingwhichinclude:(1)thefunctions areorthogonal,(2)itisexibletotsparsedata,(3)higherordersareestimable forhighlevelsofcurvecomplexityand(4)computationisfastbecauseofgood convergence.Othernonparametricregressionmethodsusingkernelestimates havebeenconsideredforthemeanstructureofgrowthcurvedatabyAltman (1990),Boularanetal.(1994),WangandRuppert(1995)andFerreiraetal. (1997).Relativetononparametricmodelingofthemeanstructure,nonparametric covariancemodellinghasreceivedlittleattention.LeonardandHsu(1992)derived aBayesianapproachfornonparametricestimatesofthecovariancestructure. DiggleandVerbyla(1998)usedkernel-weightedlocallinearregressionsmoothingof samplevariogramsordinatesandofsquaredresidualstoprovideanonparametric estimatorforthecovariancestructurewithoutassumingstationarity.Itis

PAGE 99

appealingtoincorporatethesenonparametricorsemiparametricapproachesintoourfunctionalmappingframeworktoincreasethemodel'sexibility.

PAGE 100

PAGE 101

PAGE 102

PAGE 103

PAGE 104

PAGE 105

missing-dataindicatormatrixcanbedenedasM=fMijgsuchthat IfwefurtherassumethatMisarandomvariable.ThenthejointdistributionofYandMcanbewrittenas wheretheunknownparameteriscalledthedistributionofthemissing-datamechanism;istheparameterofinterest;;istheparameterspaceof(;);fsatisescertaindistribution.Therefore,theunobservedcomponentsofYarecalledMCARiff(MjY;)=f(Mj),MARiff(MjY;)=f(MjYobs;),andMNARifthedistributionofMdependsonmissingvaluesofY. InQTLmappingstudy,typicallytherearethreetypeofmissinginformation:missingQTLinformation,missingmarkerinformationandmissingphenotypictraitdata.Therstmissinginformationisthemajormissingissues.ThegeneralstatisticalQTLmappingstudyisaninferenceprocessinwhichoneintendstosetuparelationshipofmissingQTLinformationwiththeobservedphenotypictraitvaluethroughtheobservedmarkerdata.Thisprocessallowsustodoanumberofhypothesistodetectanymajorgeneeectduringontogeny.MixturemodelhasplayedapivotalroletoinferthemissingQTLinformation.Theothertwomissingdataproblemshavebeenpaidlittleattention.Duetoanunexpectedsituation,missingmarkerandphenotypictraitsareverycommonintherealworld.ThesemissingdatarepresentcommonstatisticalproblemsinQTLmapping.Inpractice,ifatraitvalueormarkerinformationaremissingforanindividual,thatindividualwillbeomittedfromtheanalysis.Thiswillgreatlyreducethesamplesizeandcorrespondinglyreducethemappingpower.Ifmarkerdataaremissing

PAGE 106

PAGE 107

PAGE 108

TheMLEsoftheparameterscontainedin=(m;v)underdierentmodellingdesignsarederivedasfollowswiththesymbol]denotestheestimatesofparametersfromthepreviousstep.Thevaluesof(m];v])estimatedfromthefollowingequationswillbeusedtoprovidenewestimatorsof(m;v)inthenextstep.@ @'log()=nXi=12Xj=1ik@ @'fj(yij) @logfj(yij)=nXi=12Xj=1ij@ @'logfj(yij) wherewedene ij=ijfj(yij) TheMLEsoftheparameterscontainedin(m;v)areobtainedbysolving @'log()=0(A{2) However,directMLEscannotbeobtaineddirectlybecauseofthemixturedistributionproblem.EMalgorithmhasbeenappliedtosolvetheseunknownsiteratively. 97

PAGE 109

Ifwedene and whereXismatrixofaLEGbasefunctionwithorderrwhichissimilarasthedesignmatrixinregressionanalysisandjisthebasegenotypicvectorforgenotypejwhichcontainsthemeanparameterstobeestimated.Thenwehavethemeanvectoratdierenttimepointsasmj=Xj. A.1 NonparametricwithAR(1)CovarianceStructure TheAR(1)covariancestructureisgivenas=22666666641112...121377777775 (1)jj=(2)(12)1, (2)(yimj)01(yimj)=1

PAGE 110

99 where ( )= 2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 4 1 ( 1+ 2 ) ( 1+ 2 ) 0 0 ( 1+ 2 ) 2 +1 ( 1+ 2 ) ( 1+ 2 ) 0 0 ( 1+ 2 ) 2 +1 ( 1+ 2 ) ( 1+ 2 ) . . . 0 0 ( 1+ 2 ) 2 +1 ( 1+ 2 ) ( 1+ 2 ) 0 ( 1+ 2 ) 1 ( 1+ 2 ) 3 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 5 (A{5) SolvingEquation( A{2 )withrespecttotheunknownstogettheMLEsofthe unknownparameters,weget ^ j = P n i =1 P 2 j =1 ij X 0 1 y i P n i =1 P 2 j =1 ij X 0 1 X = P n i =1 P 2 j =1 ij X 0 ( ) y i nX 0 ( ) X Furtheralgebrashowsthat ^ u sj = A B j =1 ; 0ands=1 ; ; r where A = n X i =1 ij ff 2[ y i (1) u 0 j ( s ) P ( s ) ( t 0 1 )] P s ( t 0 1 )+[ y i ( ) u 0 j ( s ) P ( s ) ( t 0 )] P s ( t 0 ) g + 1 X k =1 p s ( t 0 k )[ y i ( k ) u 0 j ( s ) P ( s ) ( t 0 k )+ y i ( k +1) u 0 j ( s ) P ( s ) ( t 0 k +1 )]( 2 +1) g B = n X i =1 ij f 2[ P 2 s ( t 0 1 )+ P 2 s ( t 0 )] +2 1 X k =1 P s ( t 0 k ) P s ( t 0 k +1 )( 2 +1) g andtheMLEsofcovarianceparameters ^ 2 = P n i =1 P 2 j =1 ij ( y i u 0 j P ( t 0 )) 0 (^ )( y i u 0 j P ( t 0 )) n ^ = n ^ 2 ( 1)^ ] 3 + C (^ ] 2 +1) n ( 1)^ 2 D

PAGE 111

PAGE 112

101 diagonalentries. L = 2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 4 100 0 100 0 0 10 0 . . . 0 0 10 0 0 1 3 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 5 and ( )= 2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 4 2 +1 0 0 2 +1 0 0 2 +1 . . . 0 0 2 +1 0 0 1 3 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 5 SolvingEq. A{2 withrespecttotheunknownstogettheMLEsofthe unknownparameters,weget ^ = P n i =1 P 2 j =1 ij X 0 1 y i P n i =1 P 2 j =1 ij X 0 1 X = P n i =1 P 2 j =1 ij X 0 ( ) y i nX 0 ( ) X Furtheralgebrashowsthat ^ u sj = A B j =1 ; 0 s =1 ; ;r

PAGE 113

whereA=nXi=1ijf1Xk=1f[yi(k)u0j(s)P(s)(t0k)]Ps(t0k+1)+[yi(k+1)u0j(s)P(s)(t0k+1)]Ps(t0k)g1Xk=1[yi(k)u0j(s)P(s)(t0k)]Ps(t0k)(2+1)[yi()u0j(s)P(s)(t0)]Ps(t0)gB=nXi=1ijf21Xk=1[Ps(t0k)Ps(t0k+1)](2+1)1Xk=1P2s(t0k)P2s(t0)g

PAGE 114

APPENDIXB DERIVATIONOFEM-SIMPLEXALGORITHMWITHSEMIPARAMETRIC APPROACH EM-Simplexalgorithmisimplementedtoavoid\bad"convergenceof covarianceparameters.FortheM-CcovariancemodelwithAR(1)correlation structuregiveninEq. 4{12 ,wehave = 2 u 1 2 R ( ) u 1 2 where u = diag f u 2 (1) ; ; u 2 ( ) g u ( t )referstotheaveragemeanforthetwogenotypesattime t ,and R ( )isthe AR(1)correlationmatrix.Thismodelhasthefollowingproperties 1. log j j = log( 2 )+( 1)log(1 2 )+ X t =1 log( u j ( t )) 2. @ @ ( y i u j ) T 1 ( y i u j ) = 1 2 ( 2 1) 4 ( 2 1 X t =1 [ y i ( t ) u j ( t )][ y i ( t +1) u j ( t +1)](1+ 2 ) u j ( t ) u j ( t +1) + 2[ y i (1) u j (1))] u 2 j (1) + 2[ y i ( ) u j ( ))] u 2 j ( ) Therefore,bysolvingEq. A{2 ,wehave ^ 2 = P n i =1 P 1 j =0 i j j ( y i ^ u j ) T ^ u 1 2 (^ ) ^ u 1 2 ( y i ^ u j ) n and ^ = n ^ 2 ( 1)^ ] 3 + C (^ ] 2 +1) n ( 1)^ 2 D where ( ^ )isgivenin A{5 and C = n X i =1 1 X j =0 i j j 1 X t =1 [ y i ( t ) ^ u j ( t )][ y i ( t +1) ^ u j ( t +1)] u j ( t ) u j ( t +1) D = n X i =1 1 X j =0 i j j [ y i (1) ^ u j (1)] 2 u 2 j (1) + [ y i ( ) ^ u j ( )] 2 u 2 j ( ) 103

PAGE 115

104 Estimationprocedures: (1) E-step :Giveninitialvaluesfor( m ; v ),calculatetheposteriorprobability matrix = f i j j g inEq. A{1 (2) M-step :Withthemeanparameterscontainedin m fromthepreviousstep, calculatethemeanvector u andupdatethecovarianceparameters^ 2 and^ Theseupdatedcovarianceparametersareusedinthesimplexsteptomaximize themeanparameterscontainedin m (3)Theaboveproceduresareiterativelyrepeateduntilacertainconvergence criterionismet.TheconvergingvaluesaretheMLEsoftheparameters.

PAGE 116

APPENDIXC CONSISTENCYOFMODELSELECTIONCRITERIA C.1 TheModel Considerthemultipleregressionmodel y = X + e where y isan 1vectorofobservations, X isaknown ( r +1)design matrix, =( 1 ; ; r +1 )isanunknownparametervectorand e isan 1error vectorwhichisassumedhavemultivariatenormaldistributionwith MVN ( m ; ), whereisassumedtobestructured.Ifwehavearandomsampleofobservation y i ;i =1 ; ;n ,ourgoalistotaregressioncurvethatbestexplainsthetrue meancurveof m .Forthecurrentstudydesign,thedesignmatrixXisgivenin Eq. A{3 andtheparametervector j forgenotype j isgiveninEq. A{4 .Selecting thebestorderoftheLEPfunctionisequivalenttoselecttheLEPcoecients(or regressionvariables). Underthenullhypothesis,the-2log-likelihoodfunctionisexpressedas 2  ( m ; v j y ) / log j j + n X i =1 ( y i X ) 0 1 ( y i X ) andthescorefunctionof isgivenby U ( )= @ ( m ; v ) @ = n X i =1 X 0 1 ( y i X ) whichleadtotheMLEof as ^ =( P n i =1 X 0 1 X ) 1 P n i =1 X 0 1 y i Underthealternativehypothesis,therearetwogenotypesforabackross design.Themeanvectorforgenotype j atdierenttimepointsisexpressedas 105

PAGE 117

(2)=2jj1=2exp1 2(yiXj)01(yiXj) C.2 OrderSelectionCriterion Underthecurrentstudydesign,weaimtotthedevelopmentalPCDcurvewithLEPfunctionwithoptimalorderrthatbestexplainsthevariationofthedynamicPCDprocess.OurselectiongoalistoselectanoptimalorderthathavegreatexibilitytocapturethedevelopmentalPCDpatternandalsoisparsimoniousenough.ThemodelhasbeendescribedinChapter3indetails.WeproposetousethefollowinginformationcriterionICk(n)=2(kjy)+c(n)pk() whereICk(n)istheinformationformodelmk;k=(1k;0k;v)withvcontainingthecommoncovarianceparameters;c(n)isapenaltytermwithc(n)=2givingtheAICcriterionandc(n)=log(n)givingtheBICcriterion;pk()isthenumberoffreeparametersfortheselectedmodelmk.

PAGE 118

C.3 ConsistencyoftheSelectionCriterion ThefollowingargumentsarebasedontheassumptionthatMLEconvergesandisconsistent.

PAGE 119

108 Sincemodel m s containslessparametersthanmodel m s 0 ,thereforewehave  ( s )  ( s 0 ) n < 0 and P [ IC s ( n ) >IC s 0 ( n )]= P f 2  ( ^ s )  ( ^ s 0 ) n c ( n )[ p s ( ) p s 0 ( )] n < 0 g! 1 ; if c ( n )[ p s ( ) p s 0 ( )] n 0as n !1 : SobothAICandBICwillbeconsistentifandonlyif p s ( ) p s 0 ( )isnite,which issatised. CaseII:Overtting( s s 0 ) Basedonthesegregationprinciple,therearetwoQTLgenotypesata particulartestpositionforabackcrossprogeny.Dierentcombinationsofanking markersform4groups,say AABB;AABb;AaBB;AaBb .BasedonsimpleBayesian theorem,onecaneasilycalculatetheconditionalprobabilityofQTLgenotype givenonthe4ankingmarkergenotypes(Table 2{1 ).Thenumberofobservations canalsobegroupedaccordingtothese4ankingmarkergroupsas n 1 n 2 n 3 and n 4 correspondingtomarkertype AABB;AABb;AaBB;AaBb respectively. Consequently,thedatacanalsobegroupedaccordingtothese4markergenotypes as y k ( k =1 ; ; 4).Nowdene f ( y k;i j ; k i )= k i 1 f 1 ( y k;i j 1 )+ k i 0 f 0 ( y i j 0 )and k i 1 + k i 0 =1,for k =1 ; ; 4and i =1 ; ;n ,where k ij isthemixtureproportion forindividual i ingroup k withgenotype j f j isthemultivariatedensityfunction forgenotype j withparameterscontainedin f j g 1 j =0 .Thenthelog-likelihood

PAGE 120

functionwithparameterscanbepartitionedas(jy)=nXi=1logf(yij)=n1Xi=1logf(y1;ij;1i)+n2Xi=1logf(y2;ij;2i)+n3Xi=1logf(y3;ij;3i)+n4Xi=1logf(y4;ij;4i)=4Xk=1nkXi=1logf(yk;ij;ki) BytheconsistencyofMLEs,thenwehave(^s)(^s0)

PAGE 121

Therefore,1 Nowusingthefactthatlog(1 Hence,P[ICs(n)>ICs0(n)]=Pf2(^s)(^s0) ifc(n)[ps()ps0()]

PAGE 122

[1] AkaikeH.,1974ANewLookattheStatisticalModelIdentication.IEEETransactionsonAutomaticControlAC-19:716{23. [2] Altman,N.S.,1990.Kernelsmoothingofdatawithcorrelatederrors.J.Am.Stat.Assoc.90:508{515. [3] Altman,N.andG.Casella,1995NonparametricempiricalBayesgrowthcurveanalysisJ.Amer.Statist.Assoc.90,508{515. [4] AmeisenJ.C.,2002Ontheorigin,evolution,andnatureofprogrammedcelldeath:atimelineoffourbillionyears.CellDeathDier.9:367{393. [5] AnderssonL,C.S.Haley,H.Ellegren,S.A.Knott,M.Johansson,K.Andersson,L.Andersson-Eklund,I.Edfors-Lilja,M.Fredholm,I.Hansson,J.Hakansson,K.Lundstrom,1994Geneticmappingofquantitativetraitlociforgrowthandfatnessinpigs.Science263:1771{1774. [6] Atkinson,A.C.,2003Horwitz'srule,transformingbothsidesandthedesignofexperimentsformechanisticmodels.J.Roy.Stat.Soc.Ser.C52:261{278. [7] Balaban,N.Q.,J.Merrin,R.Chait,L.KowalikandL.Leibler,2004Bacterialpersistenceasaphenotypicswitch.Science305:1622{1625. [8] Barron,A.R.andT.M.Cover,1991Minimumcomplexitydensityestimation.IEEETrans.Inform.Theory37:1034{1054. [9] Berhane,K.andL.A.Weissfeld,2003Inferenceinspline-basedmodelsformultipletime-to-eventdata,withapplicationstoabreastcancerpreventiontrial.Biometrics,59(4):859{868. [10] Bickel,P.andP.Zhang,1992Variableselectioninnonparametricregressionwithcategoricalcovariates.J.Am.Stat.Assoc.87:90{97. [11] Boularan,J.,L.FerreandP.Vieu,1994.Growthcurves:atwo-stagenonparametricapproach.J.Stat.Plan.Infer.38:327{350. [12] Bozdogan,H.,1987ModelselectionandAkaikesinformationcriterion:Generaltheoryanditsanalyticalextensions,Psychometrika,Vol.52,No.3,345{370. 111

PAGE 123

[13] Brzustowicz,L.M.,K.A.Hodgkinson,E.W.C.Chow,W.C.Honer,A.S.Bassett,2000Locationofamajorsusceptibilitylocusforfamilialschizophreniaonchromosome1q21{q22.Science288:678{682 [14] Burnham,K.P.andD.R.Anderson,1998Modelselectionandinference:apracticalinformation-theoreticapproach.NewYork:Springer. [15] Cardon,L.R.,S.D.Smith,D.W.Fulker,W.J.Kimberling,B.F.Pennington,J.C.DeFries,1994Quantitativetraitlocusforreadingdisabilityonchromosome6.Science266:276{279 [16] Carroll,R.J.,D.Ruppert,1984Power-transformationswhenttingtheoreticalmodelstodata.J.Am.Stat.Assoc.79:321{328. [17] Cashio,P.,T.V.LeeandA.Bergmann,2005GeneticcontrolofprogrammedcelldeathinDrosophilamelanogaster.Sem.CellDevelop.Biol.16:225{235. [18] Chen,H.1995Analysisofirregularlyspacedlongitudinaldatausingakernelsmoothingapproach.UnpublishedPh.D.dissertation,DepartmentofStatisticsandActurialScience,TheUniversityofIowa. [19] Churchill,G.A.,andR.W.Doerge,1994Empiricalthresholdvaluesforquantitativetraitmapping.Genetics138:963{971. [20] Comuzzie,A.G.,J.E.Hixson,L.Almasy,B.D.Mitchell,M.C.Mahaney,T.D.Dyer,M.P.Stern,J.W.MacCluer,J.Blangero,1997Amajorquantitativetraitlocusdeterminingserumleptinlevelsandfatmassislocatedonhumanchromosome2.Nat.Genet.15:273{275 [21] Daniels,M.J.andM.Pourahmadi,2002Bayesiananalysisofcovariancematricesanddynamicmodelsforlongitudinaldata.Biometrika89,553{566. [22] Davidian,M.,andD.M.Giltinan,1995Nonlinearmodelsforrepeatedmeasurementdata.ChapmanandHall,London. [23] Davies,J.L.,Y.Kawaguchi,S.T.Bennett,J.B.Copeman,H.J.Cordell,L.E.Pritchard,P.W.Reed,S.C.Gough,S.C.Jenkins,S.M.Palmer,K.M.Balfour,B.R.Rowe,M.Farrall,A.H.Barnett,S.C.Bain,J.A.Todd,1994Agenome-widesearchforhumantype1diabetessusceptibilitygenes.Nature371:130{136. [24] Debajyoti,S.,1993Semiparametricbayesiananalysisofmultipleeventtimedata.J.Am.Stat.Assoc.88:979{983. [25] Dempster,A.P.,N.M.LairdandD.B.Rubin,1977MaximumlikelihoodfromincompletdataviatheEMalgorithm.J.R.Statist.Soc.B39(1):1{38 [26] Diggle,P.J.,P.Heagerty,K.Y.LiangandS.L.Zeger,2002AnalysisofLongitudinalData.Oxford,UK:OxfordUniversityPress.

PAGE 124

[27] Diggle,P.J.,K.Y.LiangandS.L.Zeger,1994AnalysisofLongitudinalData,OxfordUniversityPress. [28] Diggle,P.J.andA.P.Verbyla,1998.Nonparametricestimationofcovariancestructureinlongitudinaldata.Biometrics54:401{415. [29] Ellis,H.M.andH.R.Horvitz,1986GeneticcontrolofprogrammedcelldeathinthenematodeC.elegans.Cell44:817{29. [30] Ellis,R.E.,J.Y.YuanandH.R.Horvitz,1991Mechanismsandfunctionsofcelldeath.Annu.Rev.CellBiol.7,663{698. [31] Ertekin-TanerN,N.Gra-Radford,L.H.Younkin,C.Eckman,M.Baker,J.Adamson,J.Ronal,J.Blangero,M.HuttonandS.G.Younkin,2000.LinkageofplasmaA42toaquantitativelocusonchromosome10inlateonsetAlzheimersdiseasepedigrees.Science290:2303{2304. [32] Ferreira,E.,V.Nu~nez-AntonandJ.Rondriguez-Poo,1997Kernelregressionestimatesofgrowthcurvesusingnonstationarycorrelatederrors.Stat.Prob.Let.34:413{423. [33] Fogg,G.E.,B.Thake,1987Algalculturesandphytoplanktonecology.TheUniversityofWisconsinpress. [34] FraimanR.andJ.Meloche,1994Smoothingdependentobservations.Stat.Prob.Let.2:203{214. [35] Gabriel,K.R.,1962Ante-dependenceanalysisofanorderedsetofvariables.Ann.Math.Statist.33:201{212. [36] Glasbey,C.A.,1988Standarderrorsresilienttoerrorvariancemisspecifcation.Biometrika75:201{206. [37] Grady,J.J.andR.W.Helms,1995Modelselectiontechniquesforthecovariancematrixforincompletelongitudinaldata.Stat.Med.14(13):1397{1416. [38] Grattapaglia,D.,F.L.G.Bertolucci,R.PenchelandR.R.Sedero,1996GeneticmappingofquantitativetraitlocicontrollinggrowthandwoodqualitytraitsinEucalyptusgrandisusingamaternalhalf-sibfamilyandRAPDmarkers.Genetics144:1205{1214. [39] Greenberg,J.T.,1996Programmedcelldeath:Awayoflifeforplants.Proc.Natl.Acad.Sci.93:12094{12097. [40] Guyon,X.andJ.F.Yao,1999Onestimationofunderttedandoverttedmodelschosenbyamodelselectioncriterion.J.MultivariateAnalysis,70:221{249.

PAGE 125

[41] Haley,C.S.andS.A.Knott,1992Asimpleregressionmethodformappingquantitativetraitlociinlinecrossesusingankingmarkers.Heredity69:315{324. [42] Hall,P.andB.Turlach,1997Intopolationmethodsfornonlinearwaveletregressionwithirregularlyspaceddesign.TheAnnalsofStattistics,1912{1925. [43] Hanahan,D.andR.A.Weinberg,2000Thehallmarksofcancer.cell100:57{70. [44] Hanis,C.L.,E.Boerwinkle,R.Chakraborty,D.L.Ellsworth,P.Concannon,B.Stirling,V.A.Morrison,B.Wapelhorst,R.S.Spielman,K.J.Gogolin-Ewens,J.M.Shephard,S.R.Williams,N.Risch,D.Hinds,N.Iwasaki,M.Ogata,Y.Omori,C.Petzold,H.Rietzsch,H.-E.Schrder,J.Schulze,N.J.Cox,S.Menzel,V.V.Boriraj,X.Chen,L.R.Lim,T.Lindner,L.E.Mereu,Y.-Q.Wang,K.Xiang,K.Yamagata,Y.YangandG.I.Bell,1996Agenome-widesearchforhumannoninsulin-dependent(type2)diabetesgenesrevealsamajorsusceptibilitylocusonchromosome2.Nat.Genet.13:161-166. [45] Hart,J.D.andT.E.Wehrly,1986Kernalregressionestimationusingrepeatedmeasurementsdata.J.Am.Stat.Assoc.81,1080{1088. [46] Horvitz,H.R.,1999GeneticcontrolofprogrammedcelldeathinthenematodeCaenorhabditiselegans.CancerRes.59:1701{1706. [47] Horvitz,H.R.,2003Worms,life,anddeath(NobelLecture).Chembiochem.4:697{711. [48] Huang,J.,L.Yang,2004Identicationofnonlinearadditiveautoregressivemodels.J.R.Statist.Soc.B66:463{477. [49] Huang,N.,A.Parco,T.Mew,G.Magpantay,S.McCouch,E.Guiderdoni,J.Xu,P.Subudhi,E.R.Angeles,G.S.Khush,1997RFLPmappingofisozymes,RAPDandQTLforgrainshape,brownplanthopperresistanceinadoubledhaploidricepopulation.Mol.Breed.3:105{113. [50] Hurvich,C.M.andC-L.Tsai,1989Regressionandtimeseriesmodelselectioninsmallsamples.Biometrika76:297{307. [51] Hurvich,C.M.andC-L.Tsai,1990Theimpactofmodelselectiononinferenceinlinearregression.Am.Stat.44:214{217. [52] Jacobson,M.D.,M.WeilandM.C.Ra,1997Programmedcelldeathinanimaldevelopment.Cellvol(88):347{354.

PAGE 126

[53] Jarezic,F.,R.ThompsonandW.G.Hill,2003Structuredantedependencemodelsforgeneticanalysisofrepeatedmeasuresonmultiplequantitativetraits.Genet.Res.,Camb.82,55{65. [54] Jansen,R.C.,1993IntervalMappingofMultipleQuantitativeTraitLoci.Genetics135:205{211. [55] Jansen,R.C.,andP.Stam,1994Highresolutionofquantitativetraitsintomultiplelociviaintervalmapping.Genetics136:1447{1455. [56] Jiang,C.andZ.-B.Zeng,1997Mappingquantitativetraitlociwithdominantandmissingmarkersinvariouscrossesfromtwoinbredlines.Genetica101:47{58. [57] Kao,C.H.,Z.-B.ZengandR.D.Teasdale,1999Multipleintervalmappingforquantitativetraitloci.Genetics152:1203{1216. [58] Kerr,J.F.R.andB.V.Harmon,1991Denitionandincidenceofapoptosis:anhistoricalperspective.In:TomeiLD,CopeFO,eds.Apoptosis.TheMolecularBasisofCellDeath.Plainview,NY:ColdSpringHarborLaboratoryPress5{29. [59] Kirkpatrick,M.andN.Heckman,1989Aquantitativegeneticmodelforgrowth,shape,reactionnorms,andotherinnite-dimensionalcharacters.J.Math.Biol.27:429{450. [60] Kirkpatrick,M.,W.G.Hill,R.Thompson,1994Estimatingthecovariancestructureoftraitsduringgrowthandaging,illustratedwithlactationindairycattle.Genet.Res.64,57{69. [61] Kirkpatrick,M.,D.LofsvoldandM.Bulmer,1990Analysisoftheinheritance,selectionandevolutionofgrowthtrajectories.Genetics124:979{993. [62] Knoblauch,H.,B.Muller-Myhsok,A.Busjahn,A.L.Ben,S.Bahring,H.Baron,S.C.Heath,R.Uhlmann,H.D.Faulhaber,S.Shpitzen,A.Aydin,A.Reshef,M.Rosenthal,O.Eliav,A.Muhl,A.Lowe,D.Schurr,D.Harats,E.Jeschke,Y.Friedlander,H.Schuster,F.C.LuftandE.Leitersdorf,2000Acholesterol-loweringgenemapstochromosome13q.Am.J.Hum.Genet.66:157{166. [63] Knott,S.A.,L.Marklund,C.S.Haley,K.Andersson,W.Davies,H.Ellegren,M.Fredholm,I.Hansson,B.Hoyheim,K.Lundstrom,M.MollerandL.Andersson,1998MultiplemarkermappingofquantitativetraitlociinacrossbetweenoutbredwildboarandLargeWhitepigs.Genetics149:1069{1080.

PAGE 127

[64] Kreitman,M.,1983NucleotidepolymorphismatthealcoholdehydrogenaselocusofDrosophilamelanogaster.Nature304:412{417. [65] Kristensen,C.N.,D.Keleotis,T.Kristensen,A-L.Brresen-Dale,2001High-throughputmethodsfordetectionofgeneticvariation.BioTechniques30:318{332. [66] Kuriyama,H.andH.Fukuda,2002Developmentalprogrammedcelldeathinplants.CurrentOpinioninPlantBiology,5:568{573. [67] Lander,E.S.andD.Botstein,1989MappingendelianfactorsunderlyingquantitativetraitsusingRFLPlinkagemaps.Genetics121,185{199. [68] Langley,C.H.,E.MontgomeryandW.F.Quattlebaum,1982RestrictionmapvariationintheAdhregionofDrosophila.Proc.Natl.Acad.Sci.USA79:5631{5635. [69] Laurent-Crawford,A.G.,B.Krust,S.Muller,Y.Rivire,M.-A.Rey-Cuill,J.-M.Bchet,L.MontagnierandA.G.Hovanessian,1991ThecytopathiceectofHIVisassociatedwithapoptosis.Virology185,829{839. [70] LeipsJ.,T.F.C.Mackay,2000QuantitativetraitlociforlifespaninDrosophilamelanogaster:interactionswithgeneticbackgroundandlarvaldensity.Genetics1(5):1773{1788. [71] Leonard,T.andJ.Hsu,1992Bayesianinferenceforacovariancematrix.Ann.Stat.20:1669{1696. [72] Lin,M.,X.-Y.Lou,M.ChangandR.L.Wu,2003Ageneralstatisticalframeworkformappingquantitativetraitlociinnonmodelsystems:Issueforcharacterizinglinkagephases.Genetics165:901{913. [73] Lin,H.,C.E.McCullochandS.T.Mayne,2002Maximumlikelihoodestimationinthejointanalysisoftime-to-eventandmultiplelongitudinalvariables.StatMed,21(16):2369{2382. [74] Lin,M,C.Aquilante,J.A.JohnsonandR.Wu,2005SequencingdrugresponsewithHapMap.PharmacogenomicsJ.5:149{156. [75] Littell,R.C.,J.PendergastandR.Natarajan,2000Modellingcovariancestructureintheanalysisofrepeatedmeasuresdata.Statist.Med.19:1793{1819. [76] Liu,T.,W.Zhao,L.L.TianandR.L.Wu,2004Analgorithmformoleculardissectionoftumorprogression.JournalofMathematicalBiology50:336{354.

PAGE 128

[77] Louis,T.A.,1982FindingtheobservedinformationmatrixwhenusingtheEMalgorithm.J.R.Statist.Soc.B44,226{233. [78] Lynch,M.andB.Walsh,1998GeneticsandAnalysisofQuantitativeTraits.Sinauer,Sunderland,MA. [79] Ma,C.-X.,G.CasellaandR.L.Wu,2002Functionalmappingofquantitativetraitlociunderlyingthecharacterprocess:Atheoreticalframework.Genetics161:1751{1762. [80] Mackay,T.F.C.,2001QuantitativetraitlociinDrosophila.Nat.Rev.Genet.2:11{20. [81] Martinez,O.,andR.N.Curnow,1994Missingmarkerdatawhenestimatingquantitativetraitlociusingregressionmapping.Heredity73,198{206. [82] Mauricio,R.,2001Mappingquantitativetraitlociinplants:Usesandcaveatsforevolutionarybiology.Nat.Rev.Genet.2:370{381. [83] Mendel,G.J.,1866VersucheuberPanzen-Hybriden.VerhandlungendesnaturforschendenVereines,Abhandlungen,Brunn4:3{47. [84] Meng,X.L.andD.B.Rubin,1991UsingEMtoObtainAsymptoticVarianceCovarianceMatrices:TheSEMAlgorithm.J.Am.Stat.Assoc.86,899{909. [85] Muller,H.G.,1988NonparametricAnalysisofLongitudinalData.Springer-Verlag,Berlin. [86] Myers,A.,P.Holmans,I.H.Marshal,J.Kwon,D.Meyer,D.Ramic,S.Shears,J.Booth,F.W.DeVrieze,R.Crook,M.Hamshere,R.Abraham,N.Tunstall,F.Rice,S.Carty,S.Lillystone,P.Kehoe,V.Rudrasingham,L.Jones,S.Lovestone,J.Perez-Tur,J.Williams,M.J.Owen,J.HardyandA.M.Goate,2000SusceptibilitylocusforAlzheimersdiseaseonchromosome10.Science290:2304{2305. [87] Nelder,J.A.andR.Mead,1965Asimplexmethodforfunctionminimization.ComputerJ.7:308{313. [88] Nilsson-Ehle,H.,1909KreuzungsuntersuchungenanHaferundWeizen.LundsUniversitets_Arsskrift [89] Pan,J.andG.Machenzie,2003Onmodellingmean-covariancestructuresinlongitudinalstudies.Biometrika90,239{244. [90] Pankratz,V.S.,MdeAndradeandT.M.Therneau,2005Random-eectsCoxproportionalhazardsmodel:generalvariancecomponentsmethodsfortime-to-eventdata.GenetEpidemiol,28(2):97{109.

PAGE 129

[91] Paterson,A.H.,S.Damon,J.D.Hewitt,D.Zamir,H.D.Rabinowitch,S.E.Lincoln,E.S.LanderandS.D.Tanksley,1991Mendelianfactorsunderlyingquantitativetraitsintomato:comparisonacrossspecies,generations,andenvironments.Genetics127:181{197. [92] Pennell,R.I.,C.Lamb,1997Programmedcelldeathinplants.PlantCell9:1157{1168. [93] Pletcher,S.D.andC.J.Geyer,1999Thegeneticanalysisofage-dependenttraits:Modelingthecharacterprocess.Genetics153:825{835. [94] Pool,M.H.,L.L.G.JanssandT.H.E.Meuwissen,2000aGeneticParametersofLegendrePolynomialsforFirstParityLactationCurves.J.DairySci.83(11):2640{2649. [95] Pool,M.H.,andT.H.E.Meuwissen,2000bPredictionofdailymilkyieldsfromalimitednumberoftestdaysusingtestdaymodels.J.DairySci.82:1555{1564. [96] Pourahmadi,M.,1999Jointmean-covariancemodelswithapplicationstolongitudinaldata:Unconstrainedparameterisation.Biometrika86:677{690. [97] Pourahmadi,M.,2000Maximumlikelihoodestimationofgeneralisedlinearmodelsformultivariatenormalcovariancematrix.Biometrika87:425{435. [98] Press,W.H.,B.P.Flannery,S.A.TeukolskyandW.T.Vetterling,1992NumericalrecipesinC:theartofscienticcomputing.CambridgeUniversityPress,pp186{189. [99] Ra,M.C.,B.A.Barres,J.F.Burne,H.S.Coles,Y.IshizakiandM.D.Jacobson,1993Programmedcelldeathandthecontrolofcellsurvival:lessonsfromthenervoussystem.Science262:695{700. [100] Raftery,A.E.,1993Bayesianmodelselectioninstructuralequationmodels.InK.A.BollenandJ.S.Long(eds.),Testingstructuralequationmodels.NewburyPark,CA:Sage,pp.163{180. [101] Redner,R.,1981Noteontheconsistencyofthemaximumlikelihoodestimatefornonidentiabledistributions.Ann.Statist.9:225{228. [102] Rubin,D.B.,1987MultipleImputationforNonresponseinSurveys,NewYork:JohnWiley&Sons,Inc. [103] Sakamoto,Y.,M.IshiguroandG.Kitagawa,1986AkaikeInformationCriterionStatistics.Tokyo:D.Reidel. [104] Satagopan,J.M.,Y.S.Yandell,M.A.NewtonandT.C.Osborn,1996ABayesianapproachtodetectquantitativetraitlociusingMarkovchainMonteCarlo.Genetics144:805{816.

PAGE 130

[105] Sax,K.,1923Theassociationofsizedierencewithseed-coatpatternandpigmentationinPhaseolusvulgaris.Genetics8:552{560. [106] Schwarz,G.,1978EstimatingtheDimensionofaModel.Ann.Statist.6:461{464. [107] Shao,J.,1993LinearModelSelectionbyCross-Validation.J.Am.Stat.Assoc.88:486{494. [108] Shao,J.,1996Bootstrapmodelselection.J.Am.Stat.Assoc.91:655{665. [109] Shibata,R.,1981Anoptimalselectionofregressionvariables,Biometrika63:114{126. [110] Sillanpaa,M.J.,andE.Arjas,1999Bayesianmappingofmultiplequantitativetraitlocifromincompleteoutbredospringdata.Genetics151:1605{1619. [111] Simpson,S.P.,1989Detectionoflinkagebetweenquantitativetraitlociandrestrictionfragmentlengthpolymorphismsusinginbredlines.Theor.Appl.Genet.77:815{819. [112] Soller,M.,T.BrodyandA.Genizi,1976Onthepowerofexperimentaldesignsforthedetectionoflinkagebetweenmarkerlociandquantitativelociincrossesbetweeninbredlines.Theor.Appl.Genet.47:35{39. [113] Song,X.,M.DavidianandA.A.Tsiatis,2002Asemiparametriclikelihoodapproachtojointmodelingoflongitudinalandtime-to-eventdata.Biometrics58(4):742{53. [114] Spelman,R.J.,W.Coppieters,L.Karim,J.K.vanArendonkandH.Bovenhuis,1996QuantitativetraitlocianalysisforvemilkproductiontraitsonchromosomesixintheDutchHolstein-Friesianpopulation.Genetics144:1799{1808. [115] Steller,H.,1995MechanismsandGenesofCellularSuicide.Science267:1445{1449. [116] Stuber,C.W.,S.E.Lincoln,D.W.Wol,T.HeletjarisandE.S.Lander,1992Identicationofgeneticfactorscontributingtoheterosisinahybridfromtwoelitemaizeinbredlinesusingmolecularmarkers.Genetics132:823{839. [117] Sugiura,N.,1978FurtheranalysisofthedatabyAkaikesinformationcriterionandthenitecorrections.Commun.Stat.TheoryMeth.A7:13{26. [118] Takeuchi,K.,1976Distributionofinformationstatisticsandcriteriaforadequacyofmodels.Math.Sci.153:12{18.

PAGE 131

[119] Tanksley,S.D.,1993Mappinggenes.Annu.Rev.Genet.27:205{233. [120] Thoday,J.M.,1961Locationofpolygenes.Nature191:368{370. [121] VanDyk,D.A.andX.L.Meng,2001Theartofdataaugmentation(withdiscussion).J.ofComput.andGraph.Stat.10,1{111. [122] Vaudry,D.,A.Falluel-Morel,S.Leuillet,H.VaudryandB.J.Gonzalez,2003Regulatorsofcerebellargranulecelldevelopmentactthroughspecicsignalingpathways.Science300:1532{1534. [123] Vaux,D.L.andJ.K.Stanley,1999CellDeathinDevelopment.Cellvol(96):245{254. [124] VonBertalany,L.,1957Quantitativelawsinmetabolismandgrowth.Q.Rev.Biol.32:217{231. [125] Walling,G.A.,P.M.Visscher,L.Andersson,M.F.Rothschild,L.Wang,G.Moser,M.A.Groenen,J.P.Bidanel,S.Cepica,A.L.Archibald,H.Geldermann,D.J.deKoning,D.MilanandC.S.Haley,2000Combinedanalysesofdatafromquantitativetraitlocimappingstudies:chromosome4eectsonporcinegrowthandfatness.Genetics155:1369{1378. [126] Wang,N.,andD.Ruppert,1995Nonparametricestimationofthetransformationinthetransform-both-sidesregressionmodel.J.Am.Stat.Assoc.90:522{534. [127] Wang,Z.H.andR.L.Wu,2004Astatisticalmodelforhigh-resolutionmappingofquantitativetraitlocidetermininghumanHIV-1dynamics.Stat.Med.23(19):3033{3051. [128] Weller,J.I.,1986Maximumlikelihoodtechniquesforthemappingandanalysisofquantitativetraitlociwiththeaidofgeneticmarkers.Biometrics42:627{640. [129] Weller,J.I.,1987MappingandanalysisofquantitativetraitlociinLocyper-sicon(tomato)withtheaidofgeneticmakersusingapproximatemaximumlikelihoodmethods.Heredity59:413{421. [130] West,G.B.,J.H.BrownandB.J.Enquist,2001Ageneralmodelforontogeneticgrowth.Nature413:628{631. [131] White,K.,M.E.Grether,J.M.Abrams,L.Young,K.FarrellandH.Steller,1994GeneticcontrolofprogrammedcelldeathinDrosophila.Science264:677{683. [132] Williamson,P.R.,C.T.Smith,J.L.HuttonandA.G.Marson,2002Aggregatedatameta-analysiswithtime-to-eventoutcomes.Stat.Med.21(22):3337{51.

PAGE 132

[133] Wood,S.,R.Kohn,T.ShivelyandW.X.Jiang,2002Modelselectioninsplinenonparametricregression.J.R.Statist.Soc.B64:119{139. [134] Wu,R.L.,1998Thedetectionofplasticitygenesinheterogeneousenvironments.Evolution52:967{977. [135] Wu,R.L.,Y.F.Han,J.J.Hu,L.Li,M.L.LiandZ-B.Zeng,2000AnintegratedgeneticmapofPopulusbasedonampliedfragmentlengthpolymorphisms.Theor.Appl.Genet.100:1249{1256. [136] Wu,R.L.,B.Li,S.S.WuandG.Casella,2001Amaximumlikelihood-basedmethodforminingmajorgenesaectingaquantitativecharacter.Biometrics57(3):764{768. [137] Wu,R.L.,C.-X.Ma,M.Chang,R.C.Littell,S.S.Wu,T.M.Yin,M.-R.Huang,M.-X.WangandG.Casella,2002aAlogisticmixturemodelforcharacterizinggeneticdeterminantscausingdierentiationingrowthtrajectories.GenetRes79(3):235{245. [138] Wu,R.L.,C.-X.Ma,M.LinandG.Casella,2004aAgeneralframeworkforanalyzingthegeneticarchitectureofdevelopmentalcharacteristics.Genetics166:1541{1551. [139] Wu,R.L.,C.-X.Ma,M.Lin,Z.H.WangandG.Casella,2004bFunctionalmappingofquantitativetraitlociunderlyinggrowthtrajectoriesusingatransform-both-sideslogisticmodel.Biometrics60:729{738. [140] Wu,R.L.,C.-X.Ma,R.C.Littell,andG.Casella,2002bAstatisticalmodelforthegeneticoriginofallometricscalinglawsinbiology.J.theor.Biol.219:121{135. [141] Wu,R.L.,C.-X.Ma,W.ZhaoandG.Casella,2003Functionalmappingofquantitativetraitlociunderlyinggrowthrates:Aparametricmodel.Physiol.Genomics14:241{249. [142] Wu,R.L.,Z.H.Wang,W.ZhaoandJ.M.Cheverud,2004cAmechanisticmodelforgeneticmachineryofontogeneticgrowth.Genetics168:2383{2394. [143] Wu,W.B.andM.Pourahmadi,2003Nonparametricestimationoflargecovariancematricesoflongitudinaldata.Biometrika90:831{844. [144] Xu,S.Z.,1995Acommentonthesimpleregressionmethodforintervalmapping.Genetics141:1657{1659. [145] Xu,S.Z.andN.J.Yi,2000Mixedmodelanalysisofquantitativetraitloci.Proc.Natl.Acad.Sci.USA97:14542{14547.

PAGE 133

[146] Yan,J.Q.,J.Zhu,C.X.He,M.BenmoussaandP.Wu,1998Quantitativetraitlocianalysisforthedevelopmentalbehavioroftillernumberinrice.Theor.Appl.Genet.97:267{274. [147] Yuan,J.,andH.R.Horvitz,2004Arstinsightintothemolecularmechanismsofapoptosis.CellS116:S53{S56. [148] Zeng,Z.-B.,1994Precisionmappingofquantitativetraitloci.Genetics136:1457{1468. [149] Zeng,Z-B.,C.H.KaoandC.J.Basten,1999Estimatingthegeneticarchitectureofquantitativetraits.Genet.Res.74:279{289. [150] Zhang,Q.,D.Boichard,I.Hoeschele,C.Ernst,A.Eggen,B.Murkve,M.Pster-Genskow,L.A.Witte,F.E.Grignola,P.Uimari,G.ThallerandM.D.Bishop,1998Mappingquantitativetraitlociformilkproductionandhealthofdairycattleinalargeoutbredpedigree.Genetics149:1959{1973. [151] Zhao,W.,Y.Q.Chen,G.Casella,J.M.CheverudandR.L.Wu,2005Anonstationarymodelforfunctionalmappingofcomplextraits.Bioinformatics21:2469{2477. [152] Zhao,W.,R.L.Wu,C.-X.MaandG.Casella,2004Afastalgorithmforfunctionalmappingofcomplextraits.Genetics167:2133{2137. [153] Zheng,X.andW.Lou,1995Consistentvariableselectioninlinearmodels,J.Am.Stat.Assoc.90:151{156. [154] Zimmerman,D.L.andV.Nu~nez-Anton,1997Structuredantedependencemod-elsforlongitudinaldata.InModellingLongitudinalandSpatiallyCorrelatedData.Methods,Applications,andFutureDirections,63-76(T.G.Gregoire,D.R.Brillinger,P.J.Diggle,E.Russek-Cohen,W.G.Warren,andR.Woln-ger,eds.)Springer-Verlag,NewYork. [155] Zimmerman,D.L.andV.Nu~nez-Anton,2001Parametricmodelingofgrowthcurvedata:Anoverview.Test10:1{73.

PAGE 134

## Material Information

Title: Statistical Functional Mapping for Genetic Control of Programmed Cell Death
Physical Description: Mixed Material

## Record Information

Source Institution: University of Florida
Holding Location: University of Florida
System ID: UFE0011281:00001

## Material Information

Title: Statistical Functional Mapping for Genetic Control of Programmed Cell Death
Physical Description: Mixed Material

## Record Information

Source Institution: University of Florida
Holding Location: University of Florida
System ID: UFE0011281:00001

Full Text

STATISTICAL FUNCTIONAL MAPPING FOR GENETIC CONTROL OF
PROGRAMMED CELL DEATH

By

YUEHUA CUI

A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA

2005

by

Yuehua Cui

I dedicate this work to my wife, and my parents.

ACKNOWLEDGMENTS

First and most importantly, I would like to express my deepest gratitude to

Dr. Rongling Wu, my advisor, for his tremendous help and encouragement during

my doctoral research. He has been a constant source of energy and inspiration

for me. Without his forceful guidance and excellent advisement, this dissertation

would not have been possible. Further, I would like to thank Dr. Casella for giving

me valuable -1.- i.' and great help on my dissertation. I appreciate all of his

advice and encouragement. Also, I would like to thank the other members of my

committee, Drs. Kenneth Portier, Alex Trindade and Kathleen Shiverick for their

knowledge, wisdom and most of all, for their precious time. Each of them pl i,.

a unique and special role in my time here as a graduate student. Special thanks

also go to Dr. Ramon Littell, Dr. James Booth, Dr. Yongsung Joo, the statistical

genetics group members for discussing various problems with me, and the faculty

and students in the Department of Statistics for their help and friendship.

Finally, I would like to give my deepest thanks and appreciation to my parents

for the constant support and unconditional love they provided. I appreciate the

endless love and support I receive from my brothers and sisters and their respective

families. Most importantly, I would like to give my special thanks and love to my

wonderful wife Kelian for her constant love, help, support, understanding, and

encouragement since the di we started this journey.

page

ACKNOW LEDGMENTS ............................. iv

LIST OF TABLES .. .. ... .. .. .. ... .. .. .. .. ... .. .. .. viii

LIST OF FIGURES ................................ ix

A B ST R A CT . . . . . . . . . x

CHAPTER

1 INTRODUCTION .............................. 1

1.1 Basic G enetics . . . . . . . 1
1.2 Genetic M apping .. .. .. ... .. .. .. .. ... .. .. .. 3
1.3 Traditional QTL Mapping Approaches .............. 5
1.4 Functional M apping .......................... 8
1.5 Programmed Cell Death ....................... 10
1.6 O objectives . . . . . . . 13

2 GENERAL MAPPING FRAMEWORK . . . . 15

2.1 Genetic Experiment Designs . . . . . 15
2.2 Statistical M odels . . . . . . 16
2.2.1 Finite Mixture Models . . . . . 16
2.2.2 Likelihood Function . . . . . 18
2.2.3 Modelling the Mean Vector . . . . 19
2.2.4 Modelling the Covariance Matrix ... . . 21
2.2.5 Computational Algorithms . . . . 26
2.2.6 Hypothesis Testing . . . . . 28
2.2.7 Asymptotic Sampling Error . . . . 29

3 A NONPARAMETRIC APPROACH FOR FUNCTIONAL MAPPING
OF PROGRAMMED CELL DEATH . . . . 31

3.1 Introduction . . . ... . . 31
3.2 The Finite Mixture Model and Likelihood Function . ... 33
3.3 Modelling the Mean PCD Process . . . . 34
3.3.1 Legendre Polynomial . . . . . 34
3.3.2 Information-theoretic Criteria . . . . 37
3.3.3 Consistency of Model Selection Criteria . . 40
3.3.4 Selection Power . . . . . . 41

3.3.5 LEP Order Selection . . . . . 42
3.4 Modelling the Covariance Structure . . . . 43
3.4.1 AR(1) Covariance Model . . . . 43
3.4.2 SAD(r) Covariance Model. . . . . 43
3.5 Computation Algorithm and Hypothesis Testing . . ... 44
3.6 Calculating the Heritability Level . . . . 45
3.7 LEP Order Selection Simulation . . . . 46
3.8 Model Assessment by Simulation . . . . 48
3.9 Real Example . . . . . . . 51
3.10 Discussion . . . . . . . 59

4 FUNCTIONAL QTL MAPPING PROGRAMMED CELL DEATH: A
SEMIPARAMETRIC MODEL . . . . . 62

4.1 Introduction . . . . . . . 62
4.2 Different Phases of PCD . . . . . 64
4.3 Statistical Model . . . . . . 67
4.3.1 The Mixture Model-Based Likelihood . . ... 67
4.3.2 Semiparametric Modelling of the Mean Vector . ... 69
4.3.3 Modelling the Covariance Structure . . .... 72
4.3.4 Computation Algorithms . . . . 74
4.3.5 Calculating Curve Heritability . . . . 76
4.4 Hypothesis Testings . . . . . . 77
4.5 R results . . . . . . . . 78
4.5.1 A Worked Example . . . . . 78
4.5.2 Sim ulation . . . . . . 82
4.6 Discussion . . . . . . . 84

5 CONCLUDING REMARKS . . . . . . 89

6 PROSPECTS . . . . . . . . 93

6.1 Missing Data Problem . . . . . . 93
6.2 Joint Mapping of PCD and Time-To-Event . . ... 95
6.3 Nonparametric Modelling of Covariance Structure . ... 96

APPENDIX

A DERIVATION OF EM ALGORITHM WITH NONPARAMETRIC
APPROACH . . . . . . . . 97

A.1 Nonparametric with AR(1) Covariance Structure . .... 98
A.2 Nonparametric with SAD(1) Covariance Structure . ... 100

B DERIVATION OF EM-SIMPLEX ALGORITHM WITH
SEMIPARAMETRIC APPROACH .... . . 103

C CONSISTENCY OF MODEL SELECTION CRITERIA ........ 105

C .1 The M odel . . . . . . . 105
C.2 Order Selection Criterion ....................... 106
C.3 Consistency of the Selection Criterion . . 107

REFERENCES . . . . . . . . 111

BIOGRAPHICAL SKETCH . . . . . . . 123

LIST OF TABLES
Table page

1-1 Data structure for a backcross design in FunMap . . 10

2-1 Conditional probabilities of QTL .'i vpes given on the flanking markers
in a backcross design . . . . . . 16

3-1 LEP order selection power comparison using different information criteria
with AR(1) covariance structure . . . . . 47

3-2 Power comparison for LEP order selection using different information
criteria with SAD(1) covariance structure ... . . 48

3-3 The MLEs of the QTL position and the model parameters derived
from 100 simulation replicates with the AR(1) covariance structure.
The squared roots of the mean square errors of the MLEs are given
in parentheses . . . . . . . 49

3-4 The MLEs of the model parameters and the QTL position derived
from 100 simulation replicates with the SAD(1) covariance structure.
The squared roots of the mean square errors of the MLEs are given
in parentheses . . . . . . . 50

3-5 The MLEs of the parameters and their .,-vmptotic standard errors in
the parentheses with the AR(1) covariance structure . .... 56

3-6 The MLEs of the estimated parameters and their i,-'iiii ,1 ic standard
errors in the parentheses with the SAD(1) covariance structure 57

4-1 Model selection for Legendre polynomial orders based on the AIC and
BIC values under the M-C covariance-structuring model . ... 79

4-2 The MLEs of the growth and death parameters (and their approximate
standard errors in the parentheses) for significant QTL detected on
different chromosomes for tiller numbers in a DH rice population 81

4-3 The MLEs of the QTL position and the model parameters derived
from 100 simulation replicates. The squared roots of the mean square
errors of the MLEs are given in parentheses. . . . 82

LIST OF FIGURES
Figure page

1-1 Simple growth curve . . . . . . ... 9

1-2 Simple programmed cell death curve . . . . 13

3-1 Linkage map for the rice genome . . . . . 52

3-2 Dynamic changes of the number of tillers for 123 DH lines of rice as
an example of PCD in plants . . . . . 53

3-3 Tiller growth . . . . . . . 54

3-4 The LR profile plot. The solid and dotted curves correspond to the
LR profile with SAD(1) and AR(1) covariance respectively. .. 55

3-5 The dynamic changes of tiller numbers for two curves each presenting
two groups of .-., .1 irvpes, QQ and qq, at each of the four QTL detected
with the AR(1) covariance structure. ....... . . 58

3-6 Two curves for the dynamic changes of tiller numbers each presenting
two groups of .-., .1 irvpes, QQ and qq, at each of the four QTL detected
with the SAD(1) covariance structure. . . . . 59

4-1 A typical example of PCD that includes five different stages, (1) lag,
(2) exponential, (3) declining growth, (4) stationary and (5) death. 65

4-2 The LR profile plot. The genomic positions corresponding to the peak
of the curve are the MLEs of the QTL localization (indicated by
the arrows) . . . . . . . 80

4-3 Plot of the dynamic changes of tiller numbers for two curves each presenting
two groups of .., : i,' vpes, QQ and qq, at each of the three QTL. A)
QTL detected between markers RG146 and RG345, and B) QTL
between markers RZ730 and RZ801 on chromosome 1, and C) QTL
on marker RZ792 on chromosome 9. ...... . . 85

Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Doctor of Philosophy

STATISTICAL FUNCTIONAL MAPPING FOR GENETIC CONTROL OF
PROGRAMMED CELL DEATH

By

Yuehua Cui

August 2005

C!i ,i: Rongling Wu
Major Department: Statistics

The development of any organism represents a complex dynamic process

which is genetically controlled by a network of genes and environmental factors.

Programmed cell death (PCD), a physiological cell suicide process, occurs during

the development of most organisms and is typically a complex longitudinal trait.

Identifying genes that contribute to the differentiation of cells has been one of the

most important and difficult tasks for PCD studies. With the new conceptual idea

of functional mapping and abundant molecular markers available, specific genes

that regulate PCD can now be detected.

In this dissertation, I extend the principle of functional mapping and proposed

a series of statistical models and computational algorithms for mapping genes

that are associated with the dynamic features of the PCD process. A typical PCD

process involves two alternate phases-growth and death. The nonparametric

approach developed based on the orthogonal Legendre polynomial is shown great

flexibility in capturing various developmental PCD patterns. The limitation of

the nonparametric model is its difficulty in implementing the biological principle

of growth phase. To enhance the biological relevance of the mapping model, a

semiparametric model is derived that integrates a parametric logistic function for

the growth process and a nonparametric Legendre function for the dead phase.

Several information criteria are proposed to select the optimal Legendre order

for both the nonparametric and semiparametric models. A series of stationary and

nonstationary processes is applied to model the covariance structure among the

phenotypes at different time points. The developed models allow for numerous

hypotheses tests of regarding the genetic control mechanism of PCD features. The

EM and simplex algorithms are derived to estimate the parameters that model the

mean vector and the covariance structure. The model is statistically investigated

through simulation studies and is validated by a real example for tiller number in

rice. Our model provides a quantitative and testable framework for assessing the

interplay between genes and organism developmental patterns, and will have great

implications for elucidating the detailed genetic architecture of any longitudinal

traits such as PCD.

CHAPTER 1
INTRODUCTION

1.1 Basic Genetics

In most organisms, chromosomes are paired: one is inherited from the paternal

parent and one from the maternal parent. A gene represents a segment of

functioned sequence. It could be a coding sequence that codes for a particular

protein, or a noncoding sequence regulating the expression of other genes. Each

gene is located at a specific position on a chromosome. This site (called a locus)

contains alleles of one particular gene. Alleles are alternative DNA sequences on

a particular locus. For a haploid organism, a gene contains two alleles. But for

a triploid organism (for instance, endosperm) a gene contains three alleles. The

number of genes varies from several thousand to more than 10,000 in different

species. For example, the human genome consists of 23 homologous pairs of

chromosomes containing more than 30,000 genes; in yeast, there are only about

6,000 genes. Genetic differences between two individuals exist because each genetic

locus could contain different alleles. Different combinations of alleles at a locus

form the g, I.-li,, that determine individual ph. ii.. ivpic values. P, I'.1li,", the

product of gene expression, denotes an observable characteristic (e.g., skin color

and body height). If there is a gene contributable to the variation of a certain

phenotype, then individuals carrying different genotypes of this gene will have

different phenotypes.

During meiosis, a process which produces the -i.,mr,/. I (sperm and egg cells),

the homologous chromosome strands pair up and may exchange chromosome

segments. This random shuffling of genetic materials during miosis is called

recombination (one of the i, i,' .r sources of genetic diversity in nature). At the

end of meiosis, the chromosome pairs are separated. As a result, each gamete

carries a single copy of chromosomes. A linear map of the relative positions

(chromosomal loci) of genes along a chromosome is called a 1.,.,,,.' map. Distances

on a map are established on the basis of recombination frequency by linkage

analysis. A 1.,.,,,g.' map does not give the physical distance (which is shown by

the jli/-.:. ', map) between genes but rather their relative positions. The closer

two genes are, the more tightly they are linked and hence the more often they will

be inherited together. I. ,.,.,.' distance (d) is measured in centiMorgans (cM).

The recombination fraction (r) can be calculated using the .i, .,,,.' distance. If no

interference is assumed, then based on the Poisson assumption of an even number

of crossovers, the Haldane map function can be used to calculate r

-1 (- -2d/100)
2

If interference is taken into account, the Kosambi map function should be used

64d/100 t
2(e4d/100 + 1)

Therefore, r 0.5 will give an infinite distance between two markers, which means

In traditional quantitative trait studies (Lynch and Walsh 1998), phenotypic

variation can be partition into genetic variation and nongenetic variation

VP = VP + VE

where Vp is the total phenotypic variance and is partitioned into genetic variance

(VG) and environmental variance (VE). The ratio H2 = V1/Vp heritabilityy)

measures the degree of phenotypic deviation transmitted from parents to their

offspring and provides useful information in predicting the response to selection.

1.2 Genetic Mapping

There are two basic approaches to detect the presence of i' 'r genes:

segregation analysis based on phenotypic data only (Wu et al. 2001) and linkage

analysis based on both phenotypic data and molecular marker data (Lander

and Botstein 1989). My study focused on the second approach. A quantitative

trait has measurable variation due to genetic and/or environmental influences.

Unlike qualitative traits (which cannot be measured quantitatively), quantitative

variation can consist of discrete or continuous values. For example, the number

of separate tumors in the lung of a cancer-prone mouse is a discrete trait. Plant

height/diameter and grain yield are examples of continuous traits. Genetic studies

in quantitative traits are not limited to morphological traits such as body height

or weight. Such studies have increased tremendously in other areas including

physiological and biochemical traits such as blood pressure and cholesterol level,

complex diseases such as heart disease and diabetes, and behavioral traits such as

reaction to alcoholism and smoking. Genetically speaking, quantitative variation

is multifactorial and is influenced by several polymorphic genes and environmental

factors as well as complicated interactions. However, most quantitative variations

caused by environment influence are not inheritable.

Quantitative trait loci (QTL) is a genetic location of one or more genes.

Quantitative differences among individuals are contributed by specific QTL that

function individually or interactively. The QTL reflect natural genetic variation

as it exists in organisms. Mendel (1866) first proposed the idea that quantitative

variation in a trait could be due to the action of multiple genes. Nilsson-Ehle

(1909) demonstrated this in wheat. With the development of chromosome theory

and the concept of genetic linkage by the 1920, Sax (1923) proposed that the

association between seed weight and seed coat color in beans was due to linkage

between genes controlling color and one or more genes controlling sizes. However,

it was a long time before the pioneered mapping framework was developed in the

1980 partly because of the small number of available markers (Thodv'- 1961) and

the lack of powerful statistical models and computation techniques.

Modern biological techniques make it possible to detect the abundant

variation of molecular polymorphism that segregate in most species in nature

and further make QTL mapping study feasible. Examples of polymorphism are

single nucleotides polymorphisms (or SNPs), short di-, tri- or tetra-nucleotide

tandem repeats (microsatellites), longer tandem repeats (minisatellites), and

insertion sites of transposable elements. These markers are genetically neutral

and show high polymorphism with great abundance, spanning entire genomes.

To detect molecular variation, many methods have been developed including

traditional techniques such as restriction fragment length polymorphism (RFLP)

analysis using Southern blots (Langley et at. 1982; Leips et. al. 2000) and direct

sequencing (Kreitman 1983), and new developments such as high-throughput

methods for both polymorphism discovery and ..n', vping (Kristensen 2001). With

the development of new molecular and DNA recombination techniques in the past

decades, it has made it possible to detect QTL underlying quantitative variation

of certain traits of interest and to map their chromosomal locations throughout

the entire genome with polymorphic markers. M nz QTL have been mapped in

numerous economically and agronomically important traits in crop and animal

species; for instance, tomato (Paterson et al. 1991), maize (Stuber et al. 1992),

eucalyptus (Grattapaglia et al. 1996), pigs (Andersson et al. 1994; Knott et al.

1998; Walling et al. 2000), and dairy cattle (Georges et al. 1995; Spelman et al.

1996; Zhang et al. 1998). Undoubtly, QTL are also responsible for most of the

genetic diversity in human disease susceptibility and severity. A1l in QTL have

been mapped for human susceptibility to type I (Davies et al. 1994) and type II

(Hanis et al. 1996), diabetes mellitus, schizophrenia (Brzustowicz et al. 2000),

obesity (Comuzzie et al. 1997), Alzheimer's disease (Ertekin-Taner et al. 2000;

Myers et al. 2000), cholesterol levels (Knoblauch et al. 2000), and dyslexia (Cardon

et al. 1994). Even though QTL present challenges of identification, they represent

a fascinating biological phenomenon that is fundamental to our understanding of

abundant variation in nature. With the completion of human, mouse, Arobidopsis

and many other organisms' genome sequences, we are entering an era in which the

identification of QTL can be approached both experimentally and mathematically.

QTL mapping is a phenotype-driven approach to gene function. By setting

up a relationship between marker genotype and phenotypic traits, one can identify

different gene expressions of individuals who carry a specific genotype. There are a

variety of methods for identifying QTL segregating in an experimental cross (such

as a backcross or an F2 intercross). The theoretical principle for analyzing QTL

dates back to Sax (1923) who first associated pattern and pigment markers with

seed size in beans. If a QTL is linked to a marker locus, there will be a difference

in mean values of the quantitative trait among individuals with different genotypes

at the marker loci. With this idea, single marker methods, ANOVA (Soller et al.

1976) maximum likelihood with a single marker (Weller 1986, 1987; Simpson 1989)

have been developed. However, these methods cannot estimate QTL positions

precisely. Putative QTL genotypic means and QTL positions are also confounded.

These confounding factors cause the biased estimator of QTL effects and low

statistical mapping power, particularly when linkage map density is low.

Statistical methodologies for mapping QTL on a high-density linkage map of

molecular markers had not been established until Lander and Botstein's (1989)

pioneering work. The mapping approach is based on the simple linear model

yi= pt+axi+i, i 1,2,... ,n

where y, is the phenotypic measurement, p is the overall genetic mean, a is the

genetic additive effect, xi is the indicator variable indicating different genotypes,

and ci is the environmental error which is assumed to be normally identically

distributed with mean 0 and variance a2. These authors used an EM-implemented

maximum likelihood approach, proposed by Dempster et al. (1977), to map QTL

on a particular chromosomal interval bracketed by two flanking markers. This

so-called interval mapping method was later improved by including markers from

other intervals as cofactors to control the overall genetic background effect (Jansen

and Stam 1994; Zeng 1994). The model is given as

K
yi u+axi+ ybkXjk +C, i= 1,2,... ,n
k=0
where bk is the background marker effect and Xjk indicates variables of different

-. in. .i pes. This approach is based on multiple linear regression of a quantitative

phenotype on .-, -n.ir pe in segregating generations obtained from line crosses.

The improved method, called composite interval mapping (CIM) by Zeng (1994),

di pl-,1- increased power in QTL detection because of reduced genetic background

noise (i.e., residual variance).

When only one QTL is considered at a time, it could bias QTL identification

and estimation if indeed multiple QTL are located in the same linkage group

(Jansen 1993; Zeng 1994). To deal with these problems and to further improve

QTL mapping precision, Kao et al. (1999) proposed using multiple marker intervals

simultaneously to map multiple QTL of epistatic interactions throughout a linkage

map. The developed model, called multiple interval ini',''.: ('\I \I), has the form

yj p+ ajxij +Z6jk("', )+ C, i 1,2< .,n
j=1 j1k

where p is the overall mean; xij is coded as 1/2 and -1/2 if the genotype of QTL

Qj is QjQj or Qjqj for the backcross design, respectively, aj is the corresponding

mean effect of Q ; and a',, is the epistatic effect between Qj and Qk; 6jk is an

indicator variable for epistasis between Q, and Qk; and ci is the environmental

error, which is assumed to follow N(0, a2). The MIM model has several advantages

(Kao et al. 1999): 1) it can increase sensitivity; 2) it can estimate epistatic effects

(i.e., interactions of alleles at different QTL). A stepwise model selection procedure

was applied in the CIM and the MIM model to select candidate QTL.

Besides these pioneering QTL mapping approaches, an upsurge of QTL

mapping methodologies have been developed to consider various situations

regarding different marker types (dominant or codominant), different marker

spaces (sparse or dense), different experimental designs (F2/backcross or full-sib

family), or different mapping populations (autogamous or allogamous) (Wu et

al. 2002b; Lin et al. 2003). Statistical methods pl i, a key role in QTL mapping

including model development and parameter estimation. The common statistical

methods used for QTL mapping include regression analysis (Haley and Knott 1992;

Xu 1995), maximum likelihood (Lander and Botstein 1989; Zeng 1994; Kao et al.

1999), and the B i,v -i oi approach (Satagopan et al. 1996; Sillanpaa and Arjas

1999; Xu and Yi 2000). Many of these mapping methods have been crucial in

identification of QTL responsible for variations in various complex traits important

to agriculture, forestry, biomedical or biological research (Tanksley 1993; Wu et al.

2000; M ,.1: iv 2001; Mauricio 2001).

However, the above-mentioned QTL mapping approaches only consider a

trait measured at a single time point. In nature, many traits display a dynamic

developmental pattern in their life span. Genes controlling these complex processes

may function during part or all of the developmental process. To further consider

the dynamic gene effect over time, new statistical mapping approaches are needed.

1.4 Functional Mapping

In nature, most economically and biomedically important traits (such as grain

yield, milk production, tumor size, and drug response) show continuous variation,

usually affected by a network of small-sized genes functioning in a coordinated

manner, and also affected by environmental stimulus in a complicated way. Thus,

identification of specific QTL involved had alvbi -, been difficult and imprecise.

The phenotypic formation of a quantitative trait is controlled by genetic and

environmental factors, also regulated by a suite of developmental signals. This

is evident in the fact that all traits undergo developmental ontogeny, and change

with time. Kirkpatrick and Heckman (1989) called such time-dependent traits

infinite-dimensional characters; Pletcher and Geyer (1999) called them function-

valued traits. Although the relationship between genetic control and development

is statistically difficult to elucidate, some key difficulties were overcome by Wu

and colleagues (Wu et al. 2002a, 2003, 2004a, 2004b; Ma et al. 2002). They have

proposed a general statistical framework (i.e., functional in',i' or FunMap) to

genome-wide map specific QTL that determine the developmental pattern of a

complex trait.

The basic rationale of functional mapping lies in the connection between gene

action or environmental effects and development by parametric or nonparametric

models. Functional mapping maps dynamic QTL responsible for a biological

process that is measured at a finite number of time points. In response to

gradients of a continuous environmental factor (such as temperature, light

intensity, or nutritional concentration), a variety of mathematical equations

have been developed, either empirically or through theoretical derivations based

on fundamental principles, to model the phenotypic biological processes. For

example, a series of growth equations were derived to describe growth in height,

size, or weight (Fig. 1-1; von Bertalanffy 1957) that occur whenever the anabolic

S5 -
0
( 4-

3-

2

1 2 3 4 5 6 7 8 9 10
Time

Figure 1-1. Simple growth curve

or metabolic rate exceeds the rate of catabolism. Based on fundamental principles

behind biological or biochemical networks, West et al. (2001) mathematically

proofed the universality of these growth equations. With mathematical functions

incorporated into the QTL mapping framework, functional mapping estimates

parameters that determine shapes and functions of a particular biological network,

instead of directly estimating the gene effects at all possible time points. Because

of such connections among these points through mathematical functions, functional

mapping strikingly reduces the number of parameters to be estimated. Hence,

it di-p-,1 increased statistical power and is advantageous because biological

principles are embedded in the process of estimating QTL parameters.

From a statistical perspective, functional mapping is a problem of jointly

modelling mean-covariance structures in longitudinal studies, an area that has

1999, 2000; Daniels and Pourahmadi 2002; Pan and Mackenzie 2003; Wu and

Pourahmadi 2003). Unlike general longitudinal modelling, functional mapping

integrates parameter estimation and test process in a biologically meaningful

and mixture-based likelihood framework. Instead of directly estimating the gene

effects at all possible time points, it estimates parameters determining shapes and

functions of a particular biological network.

Table 1 1 shows a typical data structure for FunMap for a backcross design

in which r is the total number of measurement time points. The phenotypic traits

are measured at different time points and could be modelled as a function of time.

Biologically meaningful mathematical functions could be adopted in the modelling

procedure. The repeated longitudinal measurement for each individual then could

be described by some functions,

yi mj + e i = 1,.. n and j 1,... ,J (1 -1)

where mj(tk) f(tk) (k = 1,--- ). By estimating the genotype-specific

parameters contained in mj, that describe reaction norm across environmental

gradients, functional mapping allows the assessment of environmental impacts on

genetic variation in complex traits and testing of the different hypotheses that

mediate phenotypic diversity. The results derived from functional mapping will be

closer to biological realms.

Sample Marker ._ I ,' rpes Phenotypes
nf M, M. .. M1 y(1) y(2) ... y(-r)
1 MIMI M11 i 1f Mk yi() yi(2) ... yi(r)
2 Mimi 1.f. f Mk Y2 ((t) Y2 (2) y2 7r)
3 MIm MIf 1n i 1, i3 y3() Y3(2) Y3 (7)

n MMI M.M 1i. M ,, yi() y,(2) ... y.(r)

1.5 Programmed Cell Death

Organisms consist of thousands of cell types originating from a fertilized

egg cell. Starting from a single cell, cell numbers increase dramatically during

development, to form the various tissues and organs. Meanwhile, parallel with cell

formation, another type of cell activity is cell death (a normal process regulating

an appropriate number of cells in the tissues). This controlled or programmed

elimination of cells is called programmed cell death (PCD). PCD is a physiological

cell suicide process that occurs during the development of most organisms (Ameisen

2002). Unlike necrosis-a form of nonphysiological cell-death process that results

from acute tissue injury involving cell swelling, lysis, and inflammatory leakage

of cell contents, PCD is carried out in a regulated process that generally confers

advantages during an organism's life cycle.

In plants, PCD is essential for development and survival, for example, the

senescence of various tissues (such as falling of leaves in autumn and dying of tillers

after a certain time point in rice) that helps plants to redistribute nutrients and

efficiently use resources during development. In animals, PCD is a cell suicide

mechanism that enables metazoans (multicellular animals) to control cell number

and eliminate cells that threaten the animal's survival (Elis 1991). In addition,

PCD is an important mechanism in defense against pathogens in both animals

and plants, and serves fundamental functions during both plant and metazoa

tissue development. The advantage of PCD is obvious. However, inappropriate

activation of PCD may cause or contribute to a variety of diseases, such as

acquired immunodeficiency syndrome (AIDS) (Laurent-Crawford et al. 1991),

neurodegenerative diseases, and ischemic stroke (Raff et al. 1993).

PCD is involved in multiple developmental stages of any organism (such as cell

differentiation; proliferation; ..in:-. and the final stage, death). The physiological

mechanisms of a PCD process are complicated and are still under intensive study

(Steller 1995; Pennell 1997; Jacobson et al. 1997; Vaux 1999). The 2002 Nobel

Prize in Medicine was awarded to Sydney Brenner, H. Robert Horvitz, and John

E. Sulston for their discoveries of the genetic regulation of programmed cell

death. These scientists identified several key genes such as nuc-1, ced-3 and ced-

4 genes that regulate organ development and programmed cell death in model

organisms C. 1. 1. 1- and pointed out that corresponding genes may exist in higher

species, including human being. Their seminal work established the foundation

of cell suicide as a normal physiologic process. The implications are wide ranging

including understanding organ development and cancer. We may be able to treat

diseases by stimulating the cellular "suicide pis o;i ,'i If the abnormally dividing

cells can be killed off by using certain drugs to trigger the "self-destruct" programs,

it would be an efficient way to treat cancer.

Since PCD is so important in biology, it would be essential to derive new

statistical models for identifying genes that control this process. As discussed

above, QTL mapping based on molecular markers has proven powerful to elucidate

the relationship between the genotype and ph. i, '1 v.pe in complex systems. It will

be a valuable tool for understanding the genetic architecture of a complex PCD

process.

Because of its unique development pattern, study of PCD must be coordinated

with other physiological processes, such as growth occurring in the tissue or

individual (Kuriyama and Fukuda 2002). Although no specific mathematical

functions are available to describe the shape of the PCD process, as a balance of

cell-division and cell-death signals (Kerr and Harmon 1991), PCD can be broadly

divided into two different phases: growth and death (Fig. 1-2). The growth phase

that specifies an increasing process of cells can be described by a logistic curve.

The death phase (with cell death exceeding cell birth) corresponds to a decaying

process that cannot be defined by a specific curve. More naturally, the PCD

process may not follow any particular behavior pattern during ontogeny. Another

natural way to model this process is to treat the entire developmental process as

one single curve and fit it with polynomials. This modelling strategy has great

8-

7-

6-

a5-

4-

3-

2--Growth phase -- Death phase -

1 2 3 4 5 t* 6 7 8 9 10
Time

Figure 1-2. Simple programmed cell death curve

flexibility but has certain limitations when one intends to test the effect of QTL

at certain developmental stages. Two different modelling strategies are detailed in

Ch'! pters 3 and 4. Different modelling strategies give a clear picture of the dynamic

developmental pattern of PCD.

1.6 Objectives

The process of PCD is typically a complex trait, controlled by multiple

genes and various developmental and environmental factors. CI'! i o ,terizing

the genetic architecture of dynamic PCD trait will improve our understanding

of the developmental process of organisms and the etiology of diseases due to

inappropriate activation of PCD. To our knowledge, no studies have thus far

examined the genetic mapping of QTL controlling PCD.

In this study, I derived a statistical framework for mapping PCD-specific QTL

within the context of functional mapping. Two different approaches were proposed

to model the developmental PCD curve. The first approach was to model the

entire developmental process nonparamrnel ,'1 -l'/ based on the orthogonal Legendre

polynomial (LEP). This modelling approach shows great fitting flexibility when

fitting complicated curves comparing to parametric approach. The second approach

incorporates a parametric function of the growth phase and a nonparametric

function of the death phase into the functional mapping paradigm, which leads

to a so-called semiparametric model. Thus, unlike the nonparametric approach,

this model capitalizes on the biologically meaningful information for cell growth

to better characterize the process of PCD. Also unlike previous parametric models

(\!, et al. 2002; Wu et al. 2004a, 2004b), the semiparametric model has better

flexibility, and could precisely and efficiently model the PCD dynamic process.

With the LEP, one statistical issue is to select the best order. Order selection

represents a statistical challenge when a model is built under the functional

mapping framework. Higher orders .1.v i give a better fit, but are less efficient

for parameter estimation. Lower orders are efficient but lose flexibility. We

need a parsimonious model that balances flexibility and efficiency. A number

of order-selecting criteria were proposed and their performance was compared.

Another statistical challenge for functional mapping is to model the structure

of the covariance matrix among time-dependent observations. Ma et al. (2002)

used the stationary first-order autoregressive (AR(1)) model to structure the

time-dependent changes of variances and correlations. I proposed several covariance

structure model strategies (based on the nature of PCD process) to better

capture the intrinsic pattern of time-dependent variation and the intra-individual

correlations.

CHAPTER 2
GENERAL MAPPING FRAMEWORK

2.1 Genetic Experiment Designs

In QTL mapping studies, most experiments begin with two pure-inbreeding

lines that have different traits of interest. The two lines result from intensive

inbreeding, which gives homozygous alleles at all loci. The result of crossing these

two lines is the first filial (Fi) generation. All the Fi generations are heterozygous

and are genetically identical. The backcross generation is produced by crossing Fi

with one of its parental lines. Hence, there will be two different genotypes for a

backcross generation. The current study is based on a backcross design, initiated

with two contrasting ', ii, v.-ous inbred lines. However, the developed models

can be easily extended to other cross designs (F2 or RIL) or natural populations.

This backcross population of size n has two groups of .n. i,' ,pes at a locus. A

marker-based genetic linkage map was constructed, to identify QTL affecting

a developmental PCD trait. Consider two flanking markers (M4,, and AM,+1)

_.1 inr ,iped from a backcross plant. The recombination fraction between these two

markers is denoted by r. Suppose there is a putative QTL (denoted by Q) located

between these two markers (measured by the recombination fraction r, with M,,

and r2 with A4M+i). The pl.1 i ..I l.pic PCD traits (such as cell counts or other

quantitative measurements at tissue or organ level) are thought to be genetically

controlled by this QTL. In interval mapping, we will use two flanking markers with

known genotypes to infer the QTL that is hypothesized to be located between the

two markers. The recombination fraction is a linkage parameter that can describe

the genetic distance between two given loci, and is defined as the frequency of

crossing over between two loci. Based on the segregation and transmission of genes

from the parent to progeny, one can derive the conditional probabilities of an

unknown QTL genotype, conditional upon the known marker genotypes, in terms

of these recombination fractions. The conditional probabilities (- .) of a backcross

individual i (i = 1,... n) who carries a QTL ,.1 vpe conditional upon four

flanking marker ., ,. ,l vpes f. 1 ,.1 1. ,+11,- [, if1.. f[.- ,i,+ Tn i1,,I ,. [ +11 i ,

if .u1 1., [T1+i is given in Table 2-1.

Table 2-1. Conditional probabilities of QTL ,. i,.' .vpes given on the flanking
markers in a backcross design

Marker genotype QQ Qq
(1-r1)(1-r2) rlr2
,+ 1-r 1-r

,2,. Im 2 (-rl (1 -r2)

where ri and r2 are the recombination fractions between the QTL and marker
MA. and MA,+I, respectively, and r is the recombination fraction between the two
markers.

Other cross designs or natural populations have different conditional

probability tables. Conditional probability measures the possibility of recombination

between the putative QTL and the flanking markers during meiosis. Therefore,

these probabilities determine the segregation pattern of the putative QTL. My

study use the Haldane map function (assuming no interference) to convert the

map distance cM (centimorgan) to recombination fraction r through the function

r (1 e-2d/100).

2.2 Statistical Models

2.2.1 Finite Mixture Models

In general, if one knows the QTL genotype, then finding a significant QTL

is equivalent to doing a two-sample t test for a backcross; and an ANOVA test

for an F2 population in the univariate case, or Hotelling's t-test for multivariate

data. However, in the real world, one can only observe the marker genotype

and the phenotypic trait data. The QTL genotype is missing. This produces a

statistical missing data problem how to set up the relationship of observed data

with missing QTL? One of the main tasks for QTL mapping is to build statistical

models that connect the phenotypic trait with the unobservable QTL data through

the observed marker data. The statistical foundation of QTL mapping lies in a

mixture model in which each observation y is assumed to have arisen from one

of a known or unknown number of components. Assuming that there are J QTL

-,. .:I ,,pes contributing to the variation of a quantitative trait, this mixture model

is expressed as

y ~ p(y /, ;, TI) = 7rwif(y; i, 0 ) + + jf((y; j, I), (2-1)

where y is the response; = (Q-i, 7wj)' are the mixture proportions (i.e., QTL

i. i..I vpe frequencies) which are constrained to be nonnegative and J 1 =7 1;

p = (1i, pj)' are the component (or QTL genotype) specific parameters, with

pj being specific to component j; and TI are parameters which are common to all

components.

In practice, the data are only observed at a finite set of times, 1,... 7-, rather

than a continuum, so we have only a finite set of data for each individual i, which

can be considered as a multivariate trait vector, y = ',/ (1), y ,(r)]. There are

two i, ii r assumptions: (1) independence of response between subjects; and (2)

multivariate normality of responses. Both assumptions make it feasible to construct

the likelihood function and make corresponding inferences. The density function for

each individual is expressed as

/(yi; m, Q) -(2 1 12 exp (yi mj)TX- (y, m) (2-2)
(27)7/21 / e/ 2 -

where mj = [mj (1), mj ()] is a vector of expected values for QTL ,-. vi,.1 vpe j

at different points. At a particular time t, the relationship between the observation

and expected mean can be described by a simple linear model

J

j 1

where (ij is the indicator variable denoted as 1 if a QTL genotype j is considered

for individual i and 0 otherwise; ei(t) is the residual error (i.e., the accumulative

effect of polygenes and errors) that is iid (identically and independently distributed)

normal with mean zero and variance a2 (t). The within-subject measurement is

correlated and the errors at two different time points, t1 and t2, are characterized

with correlation p(ti, t2)-

2.2.2 Likelihood Function

In my study, each observation was a repeated measurement with a certain

correlation structure. Because of the independence assumption between subjects,

the likelihood function of the PCD trait, y, and the marker .' n .1 vpe, M, at the

hypothesized QTL location can be expressed for the backcross as

L(Q|y, M) [7fTrifi(yi|[) + 7Tolfo(yi|[)] (2-4)
i= 1

where 7ri and 0Toi are the mixture proportions corresponding to the frequencies

of different QTL .., -.1 ivpes for individual i. Because the marker genotype of

each individual is known, these mixture proportions can be actually expressed

as conditional probabilities (Table 2-1). The unknown vector f2 that defines

the mixture model (Eq. 2-1) contains two sets of parameters, one defining the

co-segregation between the QTL and markers and, thereby, the location of the QTL

relative to the markers, denoted by Qfi, and the second defining the distribution

of the PCD trait for each QTL genotype, denoted by 2q = (f2, f,), where fm

defines the mean vector at different time points and 2, defines the covariance

matrix among different time points. The MLEs of these unknown parameters

is obtained through maximizing the likelihood function (Eq. 2-4) by using EM

algorithm (Dempster et al. 1977) or other optimization packages such as Simplex

2.2.3 Modelling the Mean Vector

To estimate the parameters of the likelihood function implemented with PCD

data measured at multiple time points, one can extend the traditional interval

mapping approach to accommodate the multivariate nature of time-dependent

traits. However, this extension is limited in three aspects: (1) Individual expected

means of different QTL genotypes at all time points and all elements in the matrix

E need to be estimated, resulting in substantial computational difficulties when

the vector and matrix dimensions are large; (2) The result from this approach

may not be biologically meaningful because the underlying biological principle

for developmental character is not incorporated; (3) The approach cannot be well

deploy, .1 in a practical scheme because of (2). Thus, some biologically interesting

Based on these concerns, a new statistical framework has been developed to

detect QTL affecting growth trajectories (Wu et al. 2002a, 2004a, 2004b; Ma et

al. 2002). This framework allows for the modelling of the time-dependent expected

means of QTL genotype j using a growth equation

p (t)= g (t; Qm), (2-5)

which is specified by a set of curve parameters array' '1 in fi. All living entities

are characterized by growth defined as the irreversible increase of size with

time. A number of mathematical models, such as logistic or S-shaped curves,

have been proposed to describe growth trajectories. The fundamental biological

aspects of mathematical modelling of growth curves have been founded by earlier

mathematical biologists, e.g., von Bertalanffy (1957), and recently revisited by

West et al. (2001).

Built under the functional mapping framework, we extend the principle of

functional mapping and developed PCD specific modelling strategies to map PCD

genes. The overall form of the PCD curve of QTL genotype j is determined by the

set of curve parameters contained in f2,. If different genotypes at a putative QTL

have different combinations of these parameters, this implies that this QTL pl ,-

a role in governing the differentiation of PCD trajectories. Thus, by testing for the

difference of f,j among different genotypes, we can determine whether there exists

a specific QTL that confers an effect on PCD curves.

The time-dependent PCD means for different genotypes, pj(t), can be used

to estimate the dynamic changes of various genetic effects, including the additive,

dominant and epistatic effects as well as the interaction between these effects

and environmental factors. For example, for a backcross population containing

two genotypes at a QTL, coded as 1 for QQ, 0 for Qq, the estimates of the

time-dependent additive effect at a particular time point t is given by a(t) =

pi(t) po(t) during PCD trajectories. The time-dependent epistatic and i vipe
environment effects can be characterized in a similar way if a multi-QTL model

is assumed or if an experimental design with multiple environments is used. The

similar settings can be applied to F2 population or other experimental cross

designs.

In this study, we propose two different v--- to model the developmental mean

process:

1. modelling the mean curve using the orthogonal Legendre polynomial (LEP),

2. joint modelling the mean process with logistic function and LEP.

Each modelling strategy has its own advantages. The first model has great

flexibility in terms of capturing the intrinsic development pattern when it is hard

to characterize the development process with any currently available mathematical

functions. In nature, there are variety of variations for different traits. In most

cases, characterizing their development pattern by some mathematical function

is impossible. LEG has been applied in animal science study to fit test-d v,

models in dairy cattle and shows great flexibility than the regular polynomial

(Pool et al. 2000a). The Legendre polynomials, as used in this study, have the

benefit that: (1) the functions are orthogonal, which is useful for analyzing

patterns corresponding to different -., n.' vpes; (2) higher orders were estimable

when conventional polynomials failed because of better convergence (Pool et

al. 2000b). However, there are certain limitations of this modelling method.

The estimated parameters have no specific biological meanings and are hard to

interpret. Moreover, over-fitting may result in capturing too much noise which is

due to environment influences.

The second approach models the entire developmental process as two separate

stages. The growth phase is modelled by the logistic growth function which fits

a universal growth law (West et al. 2001). The death phase is described by a

death function which is modelled by the LEP. The developed model is called a

semiparametric model, which combines the parametric and nonparametric approach

together. It is unique in that it captures the biological characteristic of the growth

phase with great flexibility to describe the death phase. Moreover, this joint

modelling approach allows us to do different hypotheses testing which will be

described in detail in Chapter 4.

2.2.4 Modelling the Covariance Matrix

Modelling a covariance matrix E for a longitudinal data set is difficult because

of: 1) the possible high dimensionality of the problem, and 2) the constraint that E

must be positive definite (Pourahmadi 1999).

Many standard v--,v- to model a covariance structure are available in

the published literature. For example, the simple covariance structure (SIM)

specifies that observations are independent between and within subject and

have homogeneous variance. SIM is easy to implement in terms of parameter

estimation but shows no power to model the intrinsic correlation for repeated

measurement. The unstructured covariance structure (UN) is another easy method

of modelling the covariance matrix. However, it is not parsimonious to estimate

all the elements in the within-subject covariance matrix among different time

points, especially when the dimension of measurements is large. The compound

symmetry (CS) structure has been used in numerous longitudinal studies. CS

assumes a homogeneous correlation structure within subject. It is efficient in

terms of parameter estimation, but may not be flexible enough to capture the

intra-individual correlation. Other structures such as the MA (Moving Average),

ARIMA (Auto-Regressive Integrated Moving Average), RC (Random Coefficients)

or AD (Antedependence) have also been proposed as models of covariance structure

(Littell et al. 2000; Zimmerman and Nilnez-Ant6n 2001).

Unlike the nonparametric modelling of the mean curve structure, nonparametric

covariance modelling has been little studied. Most authors have considered only

the stationary case (e.g., Glasbey 1988; Hall et al. 1994). The nonstationary case

was considered only recently (C'!, i. 1995; Diggle and Verbyla 1998). Diggle and

V. i ,1 i (1998) used kernel-weighted local linear regression smoothing techniques

to provide a nonparametric estimator for the covariance structure without

assuming stationarity. C'!. ii (1995) used kernel estimators to estimate covariance

functions. Kirkirpatric (1994) applied the LEP to model the covariance structure

for longitudinal data. This approach attempts to reduce the dimension of the

covariance matrix by detecting the most important factors that can explain most

of the information contained in the original covariance matrix. By modelling the

covariance as a function of time, this approach shows its uniqueness by smoothing

the covariance matrix.

Modelling the covariance structure of developmental PCD traits can be

accomplished either parametrically or nonparametrically. Due to the unique

development pattern of the PCD trait and the statistical challenges of modelling

the complicated covariance structure, only the parametric approaches were

developed in the current study with the nonparametric modelling approach left

for future work. We propose three modelling strategies with each one implementing

under different situations.

AR(1) covariance model. A common model for covariance structure is the

first-order autoregressive [AR(1)] structure (Diggle et al. 2002), expressed with

a 2(1) a2) a2

for the variance, and

o-(t1, t2) 2plt2-tlI

for the covariance between any two time intervals t1 and t2, where 0 < p < 1 is

the proportion parameter with which the correlation decays with time lag. The

parameters that model the structure of the (co)variance matrix are a iv d in f,.

Clearly, the AR(1) model assumes the homogeneity of variance.

However, for longitudinal PCD traits, the homogeneity assumption is easily

violated. To remove the heteroscedastic problem of the residual variance, two

approaches can be used. The first approach is to model the residual variance by a

parametric function of time, as originally proposed by Pletcher and Geyer (1999).

But this approach requires additional parameters to characterize the age-dependent

change of the variance. The second approach is to embed Carroll and Rupert's

(1984) transform-both-sides (TBS) model into the growth-incorporated finite

mixture model (Wu et al. 2004b), without using any additional parameters. Both

empirical analyses with real examples and computer simulations si,--L. -1 that

the TBS-based model can increase the precision of parameter estimation and

computational efficiency. Furthermore, the TBS model preserves the original

biological means of the curve parameters although statistical in &I -,-. are based

on transformed data. The variance is greatly stabilized by log transformations.

Hence the AR(1) model can be applied to model the PCD covariance structure

implemented with the TBS modelling strategy. Model performance will be

evaluated through simulation study.

SAD(r) covariance model. The TBS-based model has the potential to relax

the assumption of variance stationarity, but the covariance stationarity issue

remains unsolved. Zimmerman and Nuinez-Ant6n (1997) proposed a structured

antedependence (SAD) model to model the age-specific change of correlation

in the analysis of longitudinal traits. The antedependence (AD) model was

initially proposed by Gabriel (1962) with no structure restriction. The random

variables Xi, X, indexed by time, are said to be AD(r) if the conditional

distribution of Xt given Xt-1, Xi depends on Xt-1, Xt-, for all t > r

(Gabriel 1962). This concept is equivalent to XI, X having a Markovian

dependence of order r (Diggle et al. 1994, p. 85). The SAD model is an extension

of the AD model created by putting special structure on the model. Due to the

flexibility of modelling the nonstationary structure, the SAD model has been

employi- 1 in several studies and di -pl-1- many favorable properties (Zimmerman

and Nuinez-Ant6n 2001). A detailed modelling strategy will be described in the

following chapters.

Mean-covariance model. Correlation among observations on a given individual

is a common feature. In many cases, a systematic pattern of correlation is evident,

which may be characterized by a relatively simple model. Suppose that

Corr(e) R(p)

where the correlation matrix R(p) is a function of correlation parameter p. The

choice of a suitable correlation matrix depends on the nature of the repeated

measurement factor as well as the modelling requirement (for example, the positive

definite requirement). When variance is constant across the response range and

equal to some value a2, then

Cov(e) = 2R()

However, in most situations, the heterogeneity of intra-individual variance may

be evident. This happens when measurements are taken over time and variance

tends to fluctuate with the level of responses. For example, a PCD trait shows

a unique developmental pattern as the variations at different time points keep

changing as the mean changes. A more general intra-individual covariance structure

needs to be specified.

A natural relationship that appears to be evident for repeated measurements

data is a mean and variance relationship. The variances can be modelled as

functions of the means. Define a variance function v = v(Qm) and define the

diagonal variance matrix as

V(m) diag [v2 (m 1), U2( tk)]

where Qm is a vector of parameters for the mean function; v2(Qmti), *, 2(m, tk)

is the variance function related to different time points. Using the correlation

matrix R(p) defined above, the covariance matrix can be specified as

Cov(e) -= 2V(Qm) 1/2R(p)V(Qm)1/2

This covariance structure shows a great deal of versatility in characterizing

uncertainty in intra-individual response (Davidian and Giltinan 1995). It allows

separate consideration of heterogeneity of variance and the pattern of correlation. If

the defined correlation matrix R(p) is positive definite, then the covariance matrix

is guaranteed to be positive definite. By choosing appropriated variance function v

and correlation matrix R(p), the developed covariance structure can better capture

the intrinsic intra-individual variation. A detailed modelling strategy for PCD

traits will be discussed in C'! Ipter 4.

2.2.5 Computational Algorithms

EM algorithm. We implemented the EM algorithm, originally proposed by

Dempster et al. (1977), to obtain the maximum likelihood estimates (MLEs)

of three groups of unknown parameters in a QTL mapping model. That is, the

QTL-segregating parameters (Qf) that specify the co-segregation patterns of QTL

and markers in the population, the curve parameters (f m) that model the mean

vector for .-., .n vipe j, and the parameters (f ) that model the structure of the

covariance matrix. These unknowns, denoted collectively as 2 = (fQs, f,, Qf),

correspond to mixture proportions, component-specific mean parameters and

common (co)variance parameters, respectively, in the mixture model described by

Equation (2-1). The first derivative of the log-likelihood function with respect to

specific parameters contained in f2 is given by

9O, j1 (yi )

2 log fj(Yii )
i= 1 j= 1 Yj= 17 fyilQ) 0ki
n 2 0
S II log fj(yi l )
i= 1 j=1

where we define
.fj(yil )
Ei Z 1T ,fj(Yil)
In the E-step, the conditional probabilities 7 (treated as priors) of the QTL

,-.1_i vipes given the marker genotypes and the normal distribution function are

used to calculate ILi which could be thought of as a posterior probability with the

ith individual carrying the jth QTL genotype. Conditional on HII = {j}, we solve

for the zeros of a log (fQ) to get the estimates of Q (the M step).

Briefly, given initial values for f2, we calculate the posterior probability of H

(E-step), then condition on IIj we update the parameter estimates contained in f2

(\ !--I' p). The process is iterated for many times until the specified convergence

criteria is satisfied. The values at convergence are treated as the MLEs. The MLEs

of the mean and covariance parameters are strongly consistent under the equal

covariance matrix assumption (Redner 1981). A detailed derivation of the MLEs

for the nonparametric model is given in Appendix A.

EM-Simplex algorithm. When there is no closed form for some of the

parameters, one can use an integrated EM-Simplex algorithm. The Nelder-Mead

1965), can be used to estimate the mean and residual (co)variances parameters

(Zhao et al. 2004). The simplex algorithm is a direct search method for nonlinear

unconstrained optimization. It attempts to minimize a scalar-valued nonlinear

function using only function values, without any explicit or implicit derivative

information. The algorithm uses linear adjustment of the parameters until some

convergence criteria are met. The term -will! ::" arises because the feasible

solutions for the parameters may be represented by a polytope figure called a

-1iplI!. ::". The simplex is a line in one dimension, triangle in two dimensions

and tetrahedron in three dimensions, respectively. For the joint model described

in C'! Ilpter 4, EM-Simplex algorithm will be used to estimate the mean and

covariance parameters.

In practical computations, the QTL position parameter can be viewed as a

fixed parameter because a putative QTL can be searched at every 1 or 2 cM on

a linkage map interval bracketed by two markers throughout the entire genome.

The amount of support for a QTL at a particular map position is often di- ph i', '1

graphically through the use of likelihood ratio maps or profiles, which plot the

likelihood ratio test statistic as a function of map position (or test position) of the

putative QTL. This plot is called a likelihood ratio (LR) profile plot. The peak of

the profile that passes certain significant threshold corresponds to the position of

the QTL over the genome.

2.2.6 Hypothesis Testing

One of the most important biological advantages for functional mapping

is that it provides a quantitative and testable framework for understanding

the relationship between gene action and various developmental events (Wu et

al. 2004a). For example, when does the QTL turn on or off to determine the

development process during time course? Under the current settings, our model

allows for a number of hypothesis tests to examine the genetic control of PCD

processes. Testing whether there exist specific QTL to affect the PCD process

is a first step toward the understanding of the detailed genetic architecture of

complex phenotypes. After the MLEs of the parameters are obtained, the existence

of a QTL affecting an overall PCD curve should be tested by formulating two

alternative hypotheses

Ho: j nm t'= 1,--, J
(2-6)
H, : at least one of the qualities above does not hold

where Ho corresponds to the reduced model, in which the data can be fit by a

single PCD curve, and H1 corresponds to the full model, in which there exist

different PCD curves to fit the data. The test statistic for testing the hypotheses in

Equation (2-6) is calculated as the log-likelihood ratio (LR) of the reduced to the

full model

LR -2[logL(Q|y, M) logL(f2y, M)],

(2-7)

where f and 2 denote the MLEs of the unknown parameters under Ho and H1,

respectively.

By searching the entire genome by every 1 or 2 cM, the hypothesis test is

conducted at each test position. A LR profile plot is generated across the entire

genome and significant QTL is declared at a particular position if the peak at that

position passes certain significant threshold by a permutation test (C'li i !i!! and

Doerge 1994).

Other hypothesis tests can be about the control of a QTL over growth or

death at specific time points or time intervals, the timing at which an organism

starts to kill itself, or the timing of other developmental features of interest,

whether the same QTL pleiotropically affects the growth and death phases, how

the QTL mediate the transition point of the two phases. All these tests are helpful

to address important biological questions related to the genetic architecture of PCD

traits.

2.2.7 Asymptotic Sampling Error

After the point estimates of parameters are obtained by the specified

algorithm, we derive the .,-vmptotic variance-covariance matrix and evaluate

the sampling errors of the estimates ( ,, Qv). The techniques for doing so involve

calculation of the incomplete-data information matrix which is the negative

second-order derivative of the incomplete-data log-likelihood. The incomplete-data

information can be calculated by extracting the information for the missing data

from the information for the complete data (Louis 1982). A different so-called

supplemented EM algorithm or SEM algorithm was proposed by Meng and Rubin

(1991) to estimate the .,-, iii.Ilic variance-covariance matrices, which can also be

used for the calculations of the sampling errors for the MLEs of the parameters

(Q ).

Another simple approach for estimating the .. i,-,_ild. ic variance-covariance

matrix is to use numerical derivatives (Press et al. 1992). The partial derivative of

a given function with respect to some parameters x and y is given by

a2f [f(x +hi,y+ h2) f(x + hi,y h2) [f(x hi,y + h2) f(x -hy- h2)
OxOy 4h1h2

If x and y are the same parameters, the above formula simplifies to

2f f(x + h)- 2f(x- h) + f(x)
x.2 h2

here f is the log-likelihood function; x and y are any parameters of interest in the

current model; h is a gradient parameter which is chosen to make L close to 0.

Taking the inverse of the negative values of the partial derivative matrix, one can

obtain an estimate of the ,-il-, l ic sample (co)variance matrix. The diagonal

elements are the -, I ~ i ..ic sample variance for the estimated parameters.

CHAPTER 3
A NONPARAMETRIC APPROACH FOR FUNCTIONAL MAPPING OF
PROGRAMMED CELL DEATH

3.1 Introduction

Functional mapping has been utilized as a powerful statistical method for

detecting quantitative trait loci (QTL) that affect complex traits undergoing

developmental changes (\! I et al. 2002; Wu et al. 2003). By incorporating various

mathematical functions into the mapping framework, it has a variety of flexibility

in mapping genes that underlie complex traits. It was originally developed

for mapping the /;;;1,,/iK.. effects of QTL on growth trajectories. However, its

usefulness is not restricted to only model growth curves. It has been extended to

study many genetic mechanisms of biological or biomedical processes including

allometric scaling, tumor progression, HIV dynamics and drug response (Wu et

al. 2002b; Wang et al. 2004; Liu et al. 2005; Lin et al. 2005). Simulation and

real data a analysis from these studies indicate that functional mapping has

high power to detect lt;;.;:,i'.. QTL effects. However, in real life, many biological

characteristics showing a specific development pattern cannot be explained by any

currently available parametric mathematical functions or equations, for instance,

programmed cell death (PCD). To better understanding the /;;;.,i/K.. gene effect

underlying complex longitudinal PCD traits, new statistical models need to be

developed.

PCD, a physiological cell suicide process, occurs during the development of

most organisms and is typically a complex longitudinal trait (Ameisen 2002).

It is involved in multiple developmental stages of organism: cell differentiation,

proliferation, aging and dying and acts as a defense mechanism against disease

or virus attacks, balance organism's metabolism (Elis 1991; Pennell 1997). The

identification of genes that contribute to the growth of cells has been one of the

most important and difficult tasks for PCD study. With developed functional

mapping and abundant molecular markers, the detection of specific genes affecting

PCD is now possible.

There are two i, i i" issues in functional mapping: (1) how to fit the mean

curve to better capture the dynamic developmental pattern? (2) how to model the

variation at different time points and intra-individual correlation? These two issues

have to be addressed in order to dissect the genetic effects of ni i' QTL. There

have been numerous studies on the developmental growth curve either focusing

on modelling the mean process or the covariance structure. Historically, there are

several classical approaches for modelling growth curve data such as univariate

ANOVA, repeated measures ANOVA, and multivariate regression modelling

approaches. However, unlike the simple growth curve, the developmental PCD trait

represents a dynamic process and di-p',v a different pattern than typical growth

curves. The two-stage unique developmental characteristic presents statistical

challenges on modelling the mean process, especially when the goal is to detect

significant QTL. Traditional parametric approaches, such as parametric regression,

require specific quantitative information about the form of the regression. Hence,

they do not have enough flexibility to capture the nature of this process and have

certain limitations on fitting the developmental curve. Additionally, the classical

analysis-of-variance methods ignore the covariance structure and the classical

multivariate approach estimates the covariance matrix but imposes no structure on

it. Therefore, statistically speaking, they are all inefficient.

Nonparametric approaches for modelling growth mean curves have been

developed, for example, nonparametric regression. Nonparametric regression

using kernel estimators has flourished in the literature (Hart and Wehrly 1986;

Miller 1988; Altman 1990; Fraiman and Meloche 1994; Altman and Casella 1995;

Boularan et al. 1995; Ferreira et al. 1997). These methods treat time as the only

explanatory variable and estimate the mean response curve by smoothing the raw

data. In genetic studies, the random regression (RR) model is another popular

model used for modelling the dynamic additive effect. The Legendre polynomial

(LEP) has been widely used in RR to fit the developmental curve partly due to the

nature of the flexibility of the orthogonal polynomial (Pool et al. 2000a).

Unlike growth curve data, the longitudinal PCD traits not only involve a

growth phase, but also are characterized by a death phase for certain characteristics

such as volume, number and height. The complex characteristic of PCD traits faces

a number of statistical challenges if applying traditional methods, especially

when models are built under the mixture-based likelihood framework. Here,

we develop a model using the LEP with a specific order to fit the PCD curve

and incorporate two covariance structures into the modelling framework for

mapping QTL underlying the complex developmental PCD traits. Order selection

using information-theoretic criteria is provided. The consistency of the selection

criteria is proved and the selection power is evaluated based on simulation studies.

Detailed parameter estimation procedures ware given. Extensive Monte Carlo

simulation studies and a real world example using rice tiller number data are given

to demonstrate the usefulness of the development model.

3.2 The Finite Mixture Model and Likelihood Function

The statistical foundation of the current study is based on a finite mixture

model described in C!I Ipter 2 in Equation (2-1). The mixture proportions are

corresponding to QTL frequencies conditional on the flanking markers. Consider

a standard backcross design, initiated with two contrasting 1'.. ii v.-ous inbred

lines. This backcross progeny contains n individuals, in each of which there are two

groups of-, i.i vi.1pes at a locus. Before conducting the statistical mapping analysis,

two sets of data are collected. A genetic linkage map is constructed with molecular

markers. The longitudinal PCD trait such as cell counts or other measurements at

tissue or organ levels are observed at a finite set of time points for each individual.

These longitudinal traits are continuous in nature and can be modelled by some

parametric or nonparametric mathematical functions or equations. Suppose

there is a putative segregating QTL with alleles Q and q that affects PCD trait,

but with different degrees. The two flanking markers form 4 groups of joint

marker genotypes. The observed data then can be grouped according to these 4

marker genotypes as Yk, k = 1, 4, in each of which there are nk observations.

The independent and multivariate normality assumptions lead to the following

likelihood function

i= 1
4 n
1= fi(yki ) + fo(yk,i)
k= i 1

where 7k refers to the mixture proportion (or QTL frequency) corresponding to

each marker group. The unknown parameters in Q2 contains the mean curve and

covariance parameters which describe the shape of the mean process and covariance

structure. Statistical modelling of the mean and covariance processes is discussed in

the following sections.

3.3 Modelling the Mean PCD Process

3.3.1 Legendre Polynomial

Based on the properties of LEP mentioned in introduction, nonparametric

regression using LEP to characterized the mean process was proposed in this

chapter. The orthogonal LEPs are solutions to a differential equation, the Legendre

equation

(1 x2) 2x + r(r + 1)z 0.
( dx2 dx

The polynomials of order r is denoted by P, (x). The polynomials are either even or

odd functions of x for even or odd orders r. The general form of a LEP of order k

is given by the sum

Pr(x) ( ()k (2r 2r-2k! (3
k 2rk!(r k)!(r 2' ( )k)!

where K r/2 or (r-1)/2 whichever is an integer. This polynomial is defined over

the interval [-1, 1]. Solving Equation (3-1), it is easy to show that the first few

polynomials are

Po(x) 1

Pi(x) x

P2(x) (3X 12 )
2
P3(x) = 1(5 3x)
2
P4(x) = -(35X4 30 +3)
8
1
P(x) = -(63x 70x3 + 15x)
8
P6(x) (231x6 315?4 + 052os 5)

By choosing different orders of orthogonal polynomials, the Legendre function

has the potential to approximate the functional relationships between trait values

and different time points to any specified degree of precision. For the current

model, the independent variable x is expressed as time t, which is adjusted to

match the measurement times to the orthogonal function range [-1, 1], by

2+ 2(t -min)
t= -1 +
tmax tmin

where tmin and tmax are the first and last time points respectively.

With an appropriate order r, the time-dependent genotypic values for different

QTL ii. -n.1 ipes can be fitted by the orthogonal LEP. A family of such polynomials

is denoted by

p(t')=- [Po ),p (t'),--- ,Pr(t1)]

and a vector of genotypic-related, time-independent values with order r is denoted

by

Uj ( ,, Uj,' ,Ujr).

This -i* i.I pe-related vector is called the base ,.. '1,' ;'i.:+ vector and the

parameters within the vector are called the base /,. u.', */ i,' means. A vector of

polynomials P(t') is also called the base function. The time-dependent genotypic

values at time t, mDj(t), can be described as a linear combination of uj weighted

by series of the polynomials, i.e.

m j(t)= u'P(t') (3-2)

Thus, for individual i whose QTL i. I vpe is j, its 1J. I pe means at

different time points can be modelled by the following vector

mjj i = [mTji (1), ,mTjli(rT)] (3-3)

This modelling approach has great flexibility in modelling curves in which the

logistic or other parametric models does not fit. By choosing appropriate order, the

model can better capture the intrinsic developmental PCD trend.

Order selection for LEP. One of the in i. jr concerns using the LEP is to

choose the optimal order. If the degree of the LEP is k (the highest power

of the polynomial), then the order of LEP will be k + 1 (the number of the

coefficients defining the polynomial which is more than its degree). The order

selection problem is similar to select variables in a regression sense. We are

actually selecting the coefficients of the LEP, i.e., the base genotypic vectors.

Normally, a higher order ahv-- gives a better fit. However, if the model contains

too many parameters, it will greatly reduce the model efficiency and reduce the

mapping power. On the other hand, if we fit the curve with lower order, a model

that contains too few parameters will not be flexible enough to approximate

important features in the data. This will give a bias contribution to the misfit

due to lack of flexibility. In both cases, the use of poor or redundant orders can

be harmful. There should be a tradeoff between the number of LEP orders and

model efficiency. It is essential to select the iight" order for LEP without over- or

under-fitting the PCD curve. One choice is to conduct an order selection using the

information-theoretic criteria.

3.3.2 Information-theoretic Criteria

Statistical model selection has been studied for a long time with numerous

topics presented in this area. An important class of selection criteria is the

information-theoretic criteria. The basic principle of information theoretic

model selection criteria is to select statistical models that better fit the data

and also simplify the model. Specifically, information-theoretic methods focus on

minimizing the amount of information required for the model to explain the data.

Consequently, it results in the selection of models that are the most parsimonious

or efficient representations of observed data. In general, there are two types of

model selection criteria. One type is that a model is evaluated based on its ability

to describe not only the present data but also future or unseen observations

from the same process. Examples of this approach are AIC or cross-validation,

which are aimed to selecting models that yield the best prediction. Another type

of approach is the likelihood based approach which selects the model that is

most likely to generate the present observations, without accounting for future

observations. Example of this approach includes the classical BIC.

Generally speaking, most information-theoretic criteria can be considered

as special cases of minimum complexity density estimators proposed by Barron

and Cover (1991), which has the form I(Ylh(O)) + I(h(O)), where the first term

represents the amount of information I required to express the data Y, given the

interested model h with parameters 0; the second term represents the amount of

information required to express the model itself. For an information criterion, the

first term is equivalent to the negative log-likelihood of the data, and the second

term is thought of as a penalty for model complexity.

A general form of the basic information criterion to select the LEP order is

given by

IC,(n) -21nL(m, | Y,r) + c(n)pr(T)

where the first term is the negative maximum log-likelihood of the data Y given

the model parameter estimates; Qm and Q, represent the MLE of mean and

covariance parameters; pr(r-) represents the number of free parameters which is

only dependent on the number of measurement time with order r, so pr(') =

dimension(Qm, Q |r); c(n) is a penalty term.

The goal of information-theoretic model selection is to select parsimonious

models. Hence, models that minimize the criterion are selected. Various criteria

have been proposed based on this general form and they are different in c(n) terms.

Among a variety of information-theoretic criteria proposed so far, the following

several criteria will be discussed.

Akaike's information criterion (AIC). One of the popular information-theoretic

model selection criteria is the AIC information criterion (Akaike 1974). Theoretically,

this criterion derives from consideration of the Kullback-Leibler distance between

a given model and the true model. Specifically, minimizing AIC is approximately

equivalent to minimizing the expected Kullback-Leibler distance between the true

and the estimated density. The AIC value with a particular order r is calculated by

AIC -21nL(Qm, |r) + 2p,(r),(

(3-4)

Since AIC serves as a consistent estimator of the Kullback-Leibler discrepancy

between the distribution that generated the data and the model that approximates

it (Takeuchi 1976), it can be thought of as measuring the relative inefficiency

of approximating the true model by the model of interest. Models with smaller

values of AIC can thus be thought of as more efficiently approximating the true

model, where the true model is unknown. By keeping increasing the LEP order,

the selection method pick up the optimal order r if the AIC value is no longer

increasing.

Corrected AIC (CAIC). Since AIC is only an approximate estimate of the

relative Kullback-Leibler distance and has been shown to be biased in small

samples, a corrected version of AIC, called CAIC, was developed to adjust for this

bias (Sugiura 1978; Hurvich and Tsai 1989). Later on, Burnham and Anderson

(1998) proposed a multivariate generalization of CAIC which is defined as

CAIC -21nL(m,,r) +p+ p(p+v) (3-5)

where n is the sample size, p is the total number of parameters in the mean model,

and v is the number of parameters used to estimate the covariance matrix of the

distribution.

B i-, -i i information criterion (BIC). BIC is another selection criterion from

a B ,i, -i ,i point of view. It is usually explained in terms of B ,i, -i ,i theory,

especially as an estimate of the B- v.i factor-the ratio of the posterior to the prior

odds in comparisons of a model to a saturated model (Schwarz 1978; Raftery 1993).

BIC is defined as

BIC -21nL(Qm, v r) + p,()ln(n). (3-6)

where all the parameters are defined similarly as above. Since our modelling

objective is to select parsimonious models to best describe the dynamic development

process, the BIC information criterion assigns more penalty to the likelihood

function and hence it tends to choose more parsimonious models.

Comparisons of the relative merits of AIC and BIC based on .-i~,e I ic

consistency (as n -- oo) have flourished in the literature. Investigations have

generally demonstrated that BIC is dimension consistent, i.e., tends to choose

the true model with probability equal to one in large samples but performs

poorly in small samples (Hurvich and Tsai 1990; Bickel and Zhang 1992) and

AIC is consistent if the dimensionality of the true model increases with n (at

an appropriate rate) (Shibata 1981). Sakamoto et al.(1986) also noted that the

model selected by AIC criterion is not actually the true model but the one for the

best model fit. To evaluate the model selection criteria, we need to show that the

proposed criteria are consistent.

3.3.3 Consistency of Model Selection Criteria

The consistency of model selection criteria has been shown in the literature

depending on different selection criteria proposed and different models under study

(Guyon and Yao 1999; Wood et al. 2002; Huang and Yang 2004). The consistency

of a selection criterion is defined as the situation where the probability of the

criterion choosing the correct model is 1 as the sample size increases to infinity.

Hence, a model selection criterion is consistent if

P{C!i...-- the better model m2} P{ICk (n) < ICk2(n)} 1,as n -i oo

where we assume that m2 is the correct model; ICk (n) and ICk (n) are the

information criteria for model m1 and m2 respectively with LEP order k, and k2.

To select the optimal LEP order, we have the following two assumptions,

1. The same LEP order for different genotypes,

2. The number of mean parameters are only dependent on the number of
measurement time points 7 which leads to finite parameter space for pr (-).

Stopping rule. We propos a forward selection by increasing one order at a time

until the information criterion ICk(n) does not decrease any more. The order which

gives the minimum ICk(n) will be the optimal order. The following theorem shows

the consistency of the order selection crtieria.

Theorem 1. Under the above two assumptions, with the defined stopping rule,

a model selection criterion ICk(n), is consistent if

c(n)[p (7) pso (T)]
-- 0, as n -- o0

where So = {j, ,jk, k < oc} and s ={j.i ,i, I / k, I < oc00} are the optimal

and selected index set of LEP order respectively.

Proof. See the Appendix C for a rigorous proof.

Therefore, the proposed BIC, CAIC and AIC information criterion are all

consistent.

3.3.4 Selection Power

The usual model selection error rate (a) is defined as the probability of

selecting a wrong order given the data is simulated with the true order. If a -- 0

as n -- oc for certain model selection criterion, then this selection criterion is

consistent. By subtracting a from 1, one gets the selection power-the probability

to select the true model.

Error rate has been used as an important criterion of evaluation of the model

selection criteria in many publications (Bozdogan 1987; Hurvich and Tsai 1990;

Shao 1993, 1996; Zheng and Lou 1995). However, it is not sufficient by itself to

show which model selection criterion is better than others, partly due to the sample

size problem. In this study, we will use it as an evaluation for choosing different

selection criteria.

3.3.5 LEP Order Selection

Under the current study setting, we considered two situations to select the

optimal order for the LEPs to fit the developmental PCD curve, i.e., selection

under the null and alternative hypothesis respectively. Evaluation of different cases

by simulation study can provide us a general selection guideline.

Order selection under the alternative hypothesis. Under the alternative

hypothesis that there is one gene controlling the developmental process, there are

2 or 3 mean curves corresponding to two genotypes for a backcross design and

three genotypes for a F2 design respectively. AIC, CAIC and BIC will be applied

to select the optimal order based on the likelihood information. Order selection

under the alternative is computationally time-consuming because one needs to

select the correct order at every test position. The first assumption with equal

order for different genotype is mandatory under the current settings because of

the hypothesis testing issue. In the real world, this assumption might be violated

since there might be different developmental patterns for individuals carrying

different genotypes and hence may follow different developmental patterns or

curves. Considering the computation burden, for simplicity we will not consider

this case. Selecting orders for different genotypes will be considered in future work.

Order selection under the null hypothesis. Under the null hypothesis that there

is no QTL controlling the developmental process, there is only one mean curve.

With this information, one can select the LEP order under the null with different

criteria, specifically, AIC, CAIC and BIC. Computationally, it is very efficient to

select the order under the null because the selection procedure only performs once.

Before selecting the order under the alternative, the order selected under the null

can provide a piece of information about the order under the alternative because it

is highly possible that the mean process for different genotypes maintains the same

order under the null and alternative.

3.4 Modelling the Covariance Structure

To better understanding the QTL underlying the developmental PCD

process, covariance structure modelling is a very important step. Dissection of

the intra-individual correlation will help us to understand how QTL mediate the

developmental pattern. Two covariance structures were proposed for modelling the

covariance matrix under the current design.

3.4.1 AR(1) Covariance Model

The first-order autoregressive [AR(1)] model (Diggle et al. 2002) can be

utilized to model the structure of the covariance matrix. The detailed description

of the AR(1) covariance structure is given in C'!i pter 2. Some of the properties of

AR(1) model is given in Appendix A. TBS modelling strategy was applied (Wu

et.al. 2004b) to remove the heteroscedasticity problem of the residual variance.

To model the nonstationarity of the covariance structure, the SAD model is

more appropriate. The SAD model with order r for modelling the error term in

Equation (2-3) is given by

e(t}) = fie(t, 1) + ... + 4e(t r) + eC (3-7)

where ci is the 'iii ', ii, ,1" term assumed to be independent N(0, it2) random

variables. For the first order SAD model, if a constant innovation variance a2 is

assumed, analytical forms for variance and correlation functions can be obtained.

For any time t > k, the covariance and correlation function of a SAD(1) model is

given by (Jaffrdzic et al. 2003)

1 (2t2
7(t) 2 (38)

t 1 2 t
O (t, k) ,k 1 (3-9)
1-

44

From the above function, it is clear that even for the simplest SAD(1) model,

the correlation function is nonstationary. Using matrix notation, the error term in

Equation (3-7) can be expressed as

e A= A

where e = [e(1), .. e(r)]',

[(1), ..., ( )]T and

e variance-covariance matrix of te PCD data is then expressed as

The variance-covariance matrix of the PCD data is then expressed as

E AEAT,

(3-10)

where E, is the innovation variance-covariance matrix and is expressed as

XE

The closed forms for the inverse and determinant of matrix E help to estimate

the parameters, (1, 0r, a2) (see Appendix A). For model simplicity and

computation consideration, we only consider the SAD(1) model.

3.5 Computation Algorithm and Hypothesis Testing

An EM based algorithm can be applied to estimate the parameters as

described in Appendix A. After the MLEs of the parameters are obtained, the

existence of a QTL affecting a PCD curve should be tested by formulating the

following hypotheses

Ho : Omii- Qm

Hi: at least one of the qualities above does not hold,

where Ho corresponds to the reduced model, in which the data can be fit by a

single curve, and H1 corresponds to the full model, in which there exist different

curves to fit the data. The test statistic for testing the hypotheses is calculated as

the log-likelihood (LR) ratio of the reduced to the full model

LR -2[log L(2y, M) log L(Q|y, M)]

where f and 2 denote the MLEs of the unknown parameters under Ho and H1,

respectively. Permutation tests (C'i, !11!! and Doerge 1994) are used to assess the

statistical significance.

3.6 Calculating the Heritability Level

Heritability measures the amount of phenotypic variability explained by

genetic variation. The higher the heritability, the higher the variation of certain

traits explained by the detected QTL. It is easy to calculate the heritability level

(H2) when traits are measured at a single time point. But for longitudinal traits,

heritability calculations are different. We proposed two --v to calculate H2,

1. Calculate H2 at a single time point t where the traits shows the highest
variation. For a backcross design, the genetic variation is given by a2
(pui(t) po(t))2, and H2 =.2 where 72 is the residual variance. If we know
H2 or J2 based on a2 we can easily calculate another one.

2. The second approach we proposed is by calculating the area under the curve
(AUC). Functional mapping maps the dynamic gene effect over time. It
should be more accurate to measure the genetic variation through the entire
measurement stage instead of based on a point estimation. We proposed to
calculate the genetic variation by a 2= (AUC, AUCo)2, where AUCI and

AUCo are the AUC for two different genotypes. The AUC is calculated by

--1
AUCJ J 1u'P,P(t')dt'

where P,(t') is a vector of LEP with order r; uj is the base genotypic mean
parameters and t' is the adjusted time point.

3.7 LEP Order Selection Simulation

To test the performance of different order selection criteria with moderate

sample size, we did a simulation study. To demonstrate the selection power and

for simplicity, we simulate the true order with 6th order LEP. A putative QTL is

simulated under different sample sizes (n=100, 200) and different heritability levels

(H2 0.1, 0.4). The total measurement time points are fixed at 9 points to match

the rice tiller number data (described later). The phenotypic data are simulated

under different covariance structure designs (AR(1), SAD(l)). The various designs

allow us to check the selection performance under different situations. The selection

power is evaluated as the percentage of simulated dataset in which corrected model

was selected. By subtracting the power from 1, the selection error rate can be

easily obtained.

There are two nii, r reasons for the current simulation study to select order

under the null hypothesis,

1. Generally speaking, it is more informative to select order under the
alternative for different genotypes. For the hypothesis testing concern,
the parameters under the null has to be nested under the alternative
(i.e., different genotypes under the alternative should have the same
number of parameters as the null has). If simulation study supports this
hypothesis, then no information will be lost to put the same order for the null
distribution as selected under the alternative.

2. Computationally, it is very intensive to select order under the alternative
because one has to select the order for each test position. If the simulation
study supports the hypothesis that order selected under the null is the same
as it is selected under the alternative, one can just select order under the null
to same computation time.

Table 3-1. LEP order selection power comparison using different information
criteria with AR(1) covariance structure

H2 0.1 H2 0.4
Information criteria n 100 n 200 n 100 n 200
Order selection under the null
AICo 0.99 1.00 1.00 1.00
CAICo 0.99 1.00 1.00 1.00
BICo 0.83 0.97 1.00 1.00

Order selection under the alternative
AICI 0.89 0.92 0.80 0.83
CAICI 0.90 0.92 0.80 0.84
BICI 0.88 1.00 1.00 1.00
Note: 0 and 1 refers to the criteria under the null and alternative hypothesis
respectively.

Simulation result. Comparisons of the different model selection criteria under

different covariance structures are summarized in Tables 3-1 and 3-2. Each

table gives the percent of datasets in which the correct model was chosen, for

each combination of sample size, heritability level, and selection criteria. Despite

differences in performance among model selection criteria, trends holding across

different criteria were evident in the simulation results. First, as might be expected,

the overall power of various model selection criteria to select the true model

generally increased with sample size. Second, a distinct pattern of performance

among the model selection criteria is that AIC and CAIC generally performing

similar to one another. However, there was a slight but persistent tendency for

CAIC to outperform AIC in many conditions. When the covariance structure is

modelled with the AR(1) structure, BIC uniformly outperformed AIC and CAIC

with heritability level 0.4. However, it performs poorly when the heritability level

is 0.1. We know that the higher heritability level corresponds to lower residual

variance. This piece of information i-.-. -1I using different criteria depending on

the residual variance. A similar pattern is also shown with the SAD(1) covariance

structure. Overall, to select the optimal order under the alternative hypothesis, one

can choose CAIC when the heritability level is lower and choose BIC criterion when

the heritability is high.

Table 3-2. Power comparison for LEP order selection using different information

H2 0.1 H2 0.4
Information criteria n 100 n 200 n 100 n 200
Order selection under the null
AICo 0.81 0.96 0.97 0.98
CAICo 0.85 0.98 0.99 1.00
BICo 0.55 0.89 1.00 1.00

Order selection under the alternative
AICI 0.85 0.85 0.90 0.91
CAICI 0.86 0.86 0.95 0.90
BICI 0.65 0.93 1.00 1.00
Note: 0 and 1 refers to the criteria under the null and alternative hypothesis
respectively.

Another interesting finding is that most criteria perform better under the null

than under the alternative. This means one can select the order under the null

and put the same order for different genotypes under the alternative. However,

in practice, real world data might not exactly satisfy the multivariate normality

assumption (for instance, skewed normal). We still r-- .--- I selecting order under

the alternative for different genotypes if computationally it is feasible. If the order

under the null tends to maintain the same order as the alternative does, then the

same order will be assigned to the null due to hypothesis testing concerns.

3.8 Model Assessment by Simulation

With the appropriate LEP order selected, we would like to check the model

performance. A number of simulation studies are done with different objectives

under different study designs.

Consider a backcross population with which a 100cM long linkage group

composed of 6 equidistant markers is constructed. A QTL that affects the

PCD process is located at 48 cM from the first marker on the linkage group.

The Haldane map function was used to convert the map distance into the

recombination fraction. To test the model performance, we simulate data with

different specifications, namely different heritability levels (H2 0.1 vs 0.4) and

different sample sizes (n 100 vs 200). Each backcross progeny is measured at 9

equally spaced time points. The simulation was based on two different patterns of

Table 3-3. The MLEs of the QTL position and the model parameters derived
from 100 simulation replicates with the AR(1) covariance structure.
The squared roots of the mean square errors of the MLEs are given in
parentheses

True parameter
QTL position A 48

Mean parameters for QQ
U20 = 9.0491
U21 1.1507
U22 -6.0197
U23 2.6514
U24 0.6517
u25 -0.7969
U26 0.6205

Mean parameters for Qq
uoo = 7.1476
Uol 1.3788
o02 -4.4886
U03 2.0042
U04 0.6619
u05 -0.8356
o06 0.4321

Covariance parameters
CT 2. -0.1436
cr.4 2 0.0239

H2
n 100
46.38(3.10)

9.4006(0.56)
1.1258(0.28)
-6.2381(0.47)
2.8064(0.34)
0.6941(0.23)
-0.8804(0.19)
0.6430(0.21)

6.9422(0.38)
1.3676(0.21)
-4.2814(0.37)
1.8882(0.27)
0.6899(0.17)
-0.8563(0.14)
0.4542(0.15)

0.1199(0.02)

p = 0.87 0.8206(0.05)

0.1
n 200
46.32(2.62)

9.4141(0.47)
1.1093(0.21)
-6.2482(0.36)
2.8118(0.28)
0.6584(0.16)
-0.8745(0.17)
0.6704(0.17)

6.9685(0.32)
1.3862(0.13)
-4.3142(0.30)
1.8685(0.21)
0.7067(0.13)
-0.8568(0.09)
0.4433(0.11)

0.1188(0.03)

0.8202(0.05)

H2
n 100
47.28(2.59)

9.0932(0.17)
1.1565(0.11)
-6.0489(0.16)
2.6786(0.13)
0.6684(0.09)
-0.8129(0.07)
0.6163(0.09)

7.1406(0.11)
1.3755(0.08)
-4.4751(0.12)
2.0106(0.09)
0.6583(0.07)
-0.8457(0.06)
0.4421(0.06)

0.0224(0.003)

0.8573(0.02)

The location of the simulated QTL is described by the map distances (in
first marker of the linkage group (100 cM long).

0.4
n 200
47.36(1.83)

9.0872(0.12)
1.1577(0.07)
-6.0348(0.12)
2.6710(0.09)
0.6463(0.06)
-0.8082(0.05)
0.6303(0.06)

7.1623(0.11)
1.3835(0.05)
-4.4978(0.10)
2.0063(0.06)
0.6639(0.05)
-0.8458(0.03)
0.4402(0.04)

0.0223(0.002)

0.8568(0.02)

cM) from the

Results: Simulation results are summarized in Tables 3-3 and 3-4. The

precision of parameter estimation is evaluated in terms of the square root of the

mean squared errors (SMSE) of the MLEs. In general, the model can provide

reasonable estimates of the QTL positions (A) and effects of various kind,

Table 3-4. The MLEs of the model parameters and the QTL position derived
from 100 simulation replicates with the SAD(1) covariance structure.
The squared roots of the mean square errors of the MLEs are given in
parentheses

True parameter
QTL position A = 48

Mean parameters for QQ
U20 = 9.0491
U21 1.1507
1122 -6.0197
1123 2.6514
U24 0.6517
u2s -0.7969
U26 0.6205

Mean parameters for Qq
Uoo = 7.1476
U101 1.3788
o02 -4.4886
uo3 2.0042
04 0.6619
05 -0.8356
U06 0.4321

Covariance parameters
Co2.1 -5.0654
a.4 2 0.8442
0O.Q4

H2
n=100
47.98(4.62)

9.3293(0.59)
1.3365(0.45)
-5.6839(0.49)
2.5921(0.29)
0.7744(0.27)
-0.9747(0.27)
0.6198(0.25)

7.3446(0.49)
1.5202(0.40)
-3.9291(0.67)
1.8615(0.34)
0.6952(0.23)
-0.9042(0.24)
0.4103(0.24)

4.7322(0.41)

- 0.95 0.9150(0.04)

0.1
n 200
47.02(3.47)

9.3234(0.45)
1.3114(0.32)
-5.6770(0.44)
2.5802(0.23)
0.7527(0.20)
-0.9498(0.23)
0.6355(0.19)

7.3932(0.43)
1.5595(0.32)
-3.9811(0.59)
1.8429(0.25)
0.6996(0.17)
-0.9284(0.16)
0.4034(0.16)

4.7546(0.35)

0.9147(0.04)

H2
n 100
46.38(3.49)

9.0803(0.22)
1.1646(0.18)
-6.0056(0.13)
2.6576(0.11)
0.6718(0.10)
-0.8203(0.08)
0.6206(0.10)

7.139(0.19)
1.3699(0.17)
-4.4672(0.15)
2.0061(0.12)
0.6791(0.10)
-0.8618(0.09)
0.4475(0.10)

11 .' l7.j 1.04)

0.9506(0.02)

The location of the simulated QTL is described by the map distances (in
first marker of the linkage group (100 cM long).

0.4
n 200
46.26(2.96)

9.0911(0.15)
1.1682(0.13)
-6.0013(0.09)
2.6627(0.09)
0.6580(0.07)
-0.8260(0.07)
0.6264(0.06)

7.1454(0.17)
1.3742(0.13)
-4.4815(0.09)
1.9873(0.09)
0.6816(0.08)
-0.8522(0.07)
0.4448(0.07)

0.8292(0.03)

0.9482(0.01)

cM) from the

with estimation precision depending on heritability, sample size and sampling

strategy under the two covariance structure. The precision of different parameter

estimations fluctuates with different covariance structures. For example, the QTL

location can be better estimated with the AR(1) covariance structure than the

SAD(1) model, but poorer estimation for parameter u23 is shown by the AR(1)

than the SAD(1). As might be expected, the precision of the QTL parameters is

increased with increased sample size. It is interesting to note that the precision

of the parameter estimates is greatly improved with increased heritability level

rather than just increasing the sample size. For example, the SMSE of the mean

parameters u20 for genotype QQ decreases from 0.56 to 0.47 when increasing the

sample size from 100 to 200 using the AR(1) covariance structure. However, this

increase of precision is less attractive when compared with the improvement when

heritability level increases from 0.1 to 0.4 for the same sample size 100. This piece

of information -, '.-.- that in practice well-managed experiments, through which

residual errors are reduced and therefore H2 is increased, are more important

than simply increasing sample sizes because normally it costs more to increase

the sample size. It is also interesting to note that both covariance models display

similar estimation precision.

3.9 Real Example

To test the developed model, a real data set was analyzed. Two inbred

lines, semi-dwarf IR64 and tall Azucena, were crossed to generate an Fi progeny

population. By doubling haploid chromosomes of the gametes derived from the

heterozygous F1, a doubled haploid (DH) population of 123 lines were founded

(Huang et al. 1997). Such a DH population is equivalent to a backcross population

because its marker segregation follows a 1:1 ratio. With 123 DH lines, Huang et al.

_.-' I i.ped 135 RFLP and 40 isozyme and RAPD markers to construct a genetic

linkage map of length 2005 cM with an average distance of 11.5 cM, representing

good coverage of 12 rice chromosomes (Fig. 3-1).

The 123 DH lines and their parents, IR64 and Azucena, were planted in a

randomized complete block design with two blocks in the field. Each block was

divided into different plots, each containing eight plants per line. Starting from 10

d- v- of transplanting, tiller numbers were measured every 10 d-v- for five central

plants in each plot until all lines had headed. We used the means of the two blocks

for QTL analysis.

chrome

RG472

RG246

K5
U10
-- RG532
Wl
SRG173
AmylB
RZ276

RG146

RG345
RG381
-- RZ19
-- RG690

RZ730

j RZ801

RG810
RG331

chrom5

0.9 -

35.0 -
6.6 -
19.4 -
8.5 -
16.7 -
4.2 -
9.5 -

21.3 -

12.8 -
19.7 -

chrom2

C --- RG437
-- RG544
S RG171
-- RG157

RZ318

Pall
RZ58
CD0686
AmylA/C

RG95
S-- RG654

RG256

RZ213
RZ123

RG520

chrom6

RG556
RZ390

-- RG403
-- RG229
SRG13
CDO105
-- RZ649

-- RZ67
-- RZ70

-- RZ225

chrom9

3 C711
2 G103
1 RZ206
S-- RZ422
Amy3ABC
t RZ228
- RZ12
3--

-- RZ792
RZ404

20.7 -
10.0 _
0.0-
2.7
1.8
5.6
31.4 -

10.1 -
11.0 -

30.4 -

4.2-
11.8--
19.6 -
10.8 -
4.4 -
8.6 -

RZ398
RG213
Amp -3
Est -2
RZ144
RZ667
Pg -2
-- pRD lOB
RG648
RG424

RG162
RG172
CDO544
-- RG653
SAmy2A
SRG433

chrom3

7 RG104
---- RG348
1362 RZ329

9.8
98-- RZ892
2,8 RG100
17,5 RG191
RZ678

41,6
RZ574
371
-- RZ284
15,6
-- RZ394
pRD10A
S5 RZ403

286 RG179
19 CDO337
1229 RZ337A
-- RZ448
15,0
-- RZ519
32,1 --
Pgi-1
7A1 CDO87

9.2 RG910
179 -

chrom7

25.7 -

16.3 -
1.7 -
18.4 -
9.0 -

31.4-

1.8 -
14.1 -
3.2 -
13.0 -
15.1 -
11.5 -
5.8 -
9.4 -

chromlO

I-F G1084
6.6 -- RG257
27.2 -,
0.0 .'

3.0 a- CD093
16.4- CD098
.16.4 -- G2155
10.2 RG134
3.8 RZ500
1 RZ500

- RG773

- RG769
__ RZ488
\ RG511
RG477

PGMSO.'
SCD059
-- RG711
Es 9

CD0418

RG351

chrome 1
CDO127
-- RZ638
RZ400
3-
R RG1 18
-- RG1094
RG167
1-

) Npb44
85 RG247

-- RG103
4 -

-- RG1109

5 r=- RZ536

Npbl86

chrom4
R RG218
RZ262
6 _--- G9
RG190
RG908
RG91
RG449
RG788
RZ565
RZ675

RG163

RZ590
RG214
RG143

RG620

chrom8

RZ143
0 -- RG20
5.0

).0- A18A1120
2.3 --
I.8I
i.2
?.8
.5 RG978
5.5 RG1
1.1 Amy3D/E

5.1
RZ66
1.8 -- AC5
1.0 -- RG418B
4-- Amp -2
57 f-- CD099

chroml2
i. 574
3 7
3 -- '- 816

S -G341
5 16
3 3457
2 lth-1
7 1- ,3463
1 ,3901
1 /- 00344
S" 3958
S3181

Figure 3-1. Linkage map for the rice genome

20

18-

16

14

12

/ ,. -_ .- .
Z 10,. ..

4 t., ~ --

2/

1 2 3 4 5 6 7 8 9
Time

Figure 3-2. Dynamic changes of the number of tillers for 123 DH lines of rice as an
example of PCD in plants

Figure 3-2 shows the dynamics of tiller numbers for each DH line measured

at 9 time points. Tiller growth is thought to be an excellent example of PCD in

plants (Greenberg 1996) since it experiences several developmental stages during

rice ontogeny (Fig. 3-3). At the early developmental stage, tiller number increases

dramatically corresponding to the vegetative phase in rice. During the reproductive

phase, tiller number declines as the initiation of panicle, emergence of the flag leaf

(the last leaf), booting, 1, lii. and flowering of the spikelets. Tillers that do

not bear panicles are called ineffective tillers, and will be killed by genetic control.

The number of ineffective tillers is a closely examined trait in plant breeding

since it is undesirable in irrigated varieties. The ineffective tillers result in many

unwanted problems in rice such as over consumption of nutrition and competition

for space. The genetic control system will pl i, its key role in reducing the over

produced tillers and balance the rice metabolism system to have optimal nutrition

consumption.

Amount of growth

Tiller number
Plant height

Ineffective anicle
tillers

Grain weight

0 20 60 90 120
Days after germination
Vegetative | Reproductive --- Ripening --

Figure 3-3. Tiller growth

Data were analyzed with the developed models with mean modelled by the
LEP and covariance modelled by AR(1) or SAD(1) structure. A genome-wide
scan was conducted at every 2 cM distance on the 12 chromosome. At each test

position, a LR test is conducted. A random test position was chosen to calculate

the heritability level with different order ranging from 2 to 8 before scan the entire
genome. Result shows that H2 is close to 0.4 (data not shown). Based on this

information, LEP order is selected under the alternative at each test position using

the BIC information criterion. When doing the permutation test, the selected
orders were maintained at each test position. The model successfully identified four
iI i ri" QTL that control the dynamic PCD developmental process with the SAD(l)

covariance structure and four in, I ri QTL with the AR(1) covariance structure.

Figure 3-4 shows the log-likelihood profile plot between the full and reduced

(no QTL) model for tiller number trajectories across the 12 rice chromosomes. The

positions of markers on the linkage groups (Huang et al. 1997) are indicated at

ticks. The plot shows the LR at each test position when modelling the covariance

matrix with AR(1) and SAD(1) structure. The threshold value for claiming the

existence of QTL at the genome-wide level is given as the horizonal solid line for

SAD(1) covariance structure and dotted line for the AR(1) covariance structure.

The chromosome-wide significant level is given by the dashed line. The significant

rate shown in the plot is at 5'. level with permutation test.

chrom 1 chrom 2 chrom 3 AR(1)
80o ARM
60
40

1001,
chrom 4 chrom 5 chrom 6 chrom 7
80
60
r40

100
chrom 8 chrom 9 chrom 10 chrom 11 chrom 12
80
60

20 '

Test Position

Figure 3-4. The LR profile plot. The solid and dotted curves correspond to the LR
profile with SAD(1) and AR(1) covariance respectively.

As clearly shown in the plot, the AR(1) covariance structure detected 1

genome-wide significant QTL and 3 chromosome-wide significant QTL. SAD(l)

model detected 4 QTL all significant at the genome-wide significant level. Among

the QTL being detected by the two covariance models, 3 of them are located at

the same marker interval. These QTL detected by both covariance models might

be the same QTL. Whether they are the same QTL can be checked by calculating

the confidence interval applying bootstrap method (Zeng et al. 1999). It is hard to

make a decision on which covariance model is the light" one. However, it is clear

that the mapping power is sensitive to the covariance structure being applied.

Table 3-5. The MLEs of the parameters and their .~,-1,iii l.1 ic standard errors in
the parentheses with the AR(1) covariance structure

Parameters Q 12 311 Q2 9Q
QTL position (A) 200cM 220cM 260cM 119.16cM

Marker interval

RZ730 RZ801 RZ337A RZ448 RZ519 Pgi_1

Parameters for QQ
uoo
UOl
uo2
U02
u03
uo4
o05
u06
uo7

Parameters for qq
U20
U21
U22
U23
U24
U25
U26
U27

Covariance parameters
(72
P

8.1221(0.25)
1.8070(0.15)
-5.5342(0.24)
1.3254(0.17)
0.9909(0.14)
-1.0331(0.13)
0.5397(0.11)
0.8828(0.18)

7.6154(0.23)
1.0413(0.14)
-4.7521(0.21)
2.7560(0.18)
0.0881(0.12)
-0.7394(0.11)
0.8853(0.11)
-0. i7" ;.i,.16)

0.0453(0.006)
0.8709(0.02)

8.5010(0.21)
1.5895(0.13)
-5.6862(0.20)
2.2708(0.15)
0.4513(0.13)
-0.8325(0.09)
0.9730(0.11)

6.9148(0.21)
1.2773(0.13)
-4.3434(0.20)
1.8608(0.14)
0.6113(0.12)
-0.7097(0.08)
0.3981(0.11)

0.0428(0.004)
0.8552(0.02)

Note: Q12, Q31, Q32 and Q9 refer to the QTL detected
refers to the chromosome-wise significant QTL.

8.7429(0.20)
1.6282(0.14)
-5.8039(0.19)
2.3644(0.15)
0.3681(0.13)
-0.8934(0.09)
1.0336(0.11)

6.8063(0.21)
1.2711(0.14)
-4.3360(0.20)
1.8183(0.15)
0.7262(0.14)
-0.6921(0.09)
0.3185(0.12)

8.0997(0.21)
1.3295(0.13)
-5.3113(0.20)
2.3670(0.15)
0.4101(0.12)
-0.8583(0.09)
0.8386(0.11)

7.4884(0.21)
1.5255(0.13)
-4.8555(0.19)
1.8480(0.14)
0.6375(0.12)
-0.7289(0.08)
0.5650(0.10)

0.0346(0.004) 0.0475(0.004)
0.8308(0.02) 0.8688(0.01)

on chromosome 1, 3 and 9; *

Tables 3-5 and 3-6 tabulate the estimated QTL position on the chromosome,

the marker interval where the QTL belongs to, the MLEs of curve parameters that

specify the developmental pattern as well as the .,-,iiii.l, .1 ic standard errors of the

estimates. Most the parameters can be reasonably estimated with small sampling

error, indicating great power of the developed models. As clearly indicated in the

table, the orders at different test positions are different. For example, the QTL

detected in chromosome 1 (Q12) with the AR(1) covariance model has order 7

RZ792

and the QTL detected in chromosome 9 (Q9) has order 7. More than 85' of test

positions tend to choose the 6th order (data not shown) and AR(1) tends to choose

a lower order than the SAD(1) model.

Table 3-6. The MLEs of the estimated parameters and their .,-vmptotic standard
errors in the parentheses with the SAD(1) covariance structure

Parameters
QTL position (A)

Marker interval

Parameters for QQ
o00
UOl
u02
o03
uo4
u05
u06
uo7
u08

Parameters for qq
U20
U21
U22
U23
U24
U25
U26
U27
U28

Covariance parameters
(2
0

Qil
108cM

RZ276 RG146

10.1080(0.25)
1.4258(0.25)
-6.2307(0.33)
2.4717(0.22)
0.7185(0.27)
-1.4245(0.16)
1.3406(0.17)
0.9732(0.21)
-1.0999(0.38)

7.4620(0.14)
1.3765(0.12)
-4.4309(0.17)
2.0271(0.10)
0.7610(0.13)
-0.7168(0.07)
0.5365(0.09)
-0.0612(0.09)
-0.5427(0.20)

0.7249(0.04)
SI' 177,11.02)

Q12
198cM

RZ730 RZ801

8.8476(0.26)
1.9202(0.19)
-6.1187(0.20)
1.3324(0.15)
1.0686(0.13)
-1.1540(0.11)
0.6267(0.11)
1.1440(0.17)

7.5165(0.19)
1.0548(0.16)
-4.6904(0.14)
2.7183(0.13)
0.0544(0.10)
-0.7741(0.09)
0.9289(0.10)
-0.4435(0.14)

0.7483(0.04)
0.9013(0.02)

Q31
220cM

RZ337A RZ448

8.9333(0.19)
1.5415(0.16)
-5.5936(0.20)
2.0395(0.13)
0.8554(0.16)
-1.1371(0.09)
0.9195(0.10)
0.6945(0.14)
-0.7288(0.25)

7.1258(0.20)
1.3297(0.17)
-3.927(0.22)
2.1169(0.14)
0.8066(0.18)
-ii -, .(0.10)
0.3967(0.12)
-0.4249(0.15)
-0.7679(0.28)

0.7866(0.04)
0.8836(0.02)

Q32
262cM

RZ519 Pgi_1

9.1363(0.18)
1.6244(0.16)
-6.2130(0.15)
2.0758(0.15)
0.5325(0.12)
-1.1813(0.10)
1.0541(0.11)
0.7148(0.16)

6.9824(0.19)
1.2889(0.17)
-4.3434(0.16)
2.1184(0.17)
0.4214(0.13)
-0.5721(0.11)
1i -, 217_ 1'.12)
-0.3716(0.17)

0.7886(0.04)
0.8420(0.02)

The developmental trajectory of the identified QTL are plotted and are shown

in Figures 3-5 and 3-6 with tiller number trajectories for all individuals indicated

in background. The 4 QTL detected by the AR(1) model are between markers

RZ730 and RZ801 (Chl-2 (Q12)) on chromosome 1, between marker RZ337A

and RZ448 (Ch3-1 (Q31)), between marker RZ519 and Pgi_l (Ch3-2 (Q32)) on

chromosome 3 and on marker RZ792 on chromosome 9 (C'!i1, (Q9)). Three QTL

detected by the SAD(1) model are located at the same marker interval as the

20
Chl-2 (Q12)
15

10

5

1 2 3 4 5 6 7 8 9
Time

20
Ch3-2 (Q32)
15

10

1 2 3 4 5 6 7 8 9
Time

20
Ch3-1 (Q 1)
15

10
5qq
I 5

1 2 3 4 5 6 7 8 9
Time

Ch9 (Q9)
15
E

5----
10 2
0
1 2 3

5 6 7 8 9
Time

The dynamic changes of tiller numbers for two curves each presenting
two groups of genotypes, QQ and qq, at each of the four QTL detected
with the AR(1) covariance structure.

AR(1) model detected except for the one between markers RZ276 and RG146

(Qii). It is clear that different QTL show different developmental patterns.

Individual rice plants carry different .-, -:.1 i rpes showing different tiller increasing

and decline rates. For example, the QTL detected in chromosome 1 by the AR(1)

covariance model shows an unusual pattern than other QTL. Individuals with

:,. :.r ,ipe qq have fast tiller growth and death rate than individuals who carry

,,,I .ipe QQ. This information is very helpful in rice breeding. Plant breeders

can try to select species carrying the qq ,.. i vpe. The rapidly increasing tiller

numbers at the early developmental stage help the rice to build more tillers in case

of the environmental stress resulting in lower production. At the productive phase,

over produced tillers is not needed anymore. The fast killing process helps rice to

get ride of over produced tiller quickly and reduce nutrition consumption due to

these ineffective tillers. This is a very phenomenal developmental process. The

Figure 3-5.

20 20
Chl-1 (Q11) Chl-2 (Q12)
15 15

10 10

5 5

0 0-
1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9
Time Time

20 20
Ch3-1 (Q31) Ch3-2 (Q32)
15 15

10 10

5 5

0 0-
1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9
Time Time

Figure 3-6. Two curves for the dynamic changes of tiller numbers each presenting
two groups of genotypes, QQ and qq, at each of the four QTL detected

identification of in i r genes controlling this process will greatly help plant breeders

in their breeding design.

3.10 Discussion

To observe the pattern and shape of a developmental process, it is necessary to

develop a function that has enough flexibility to fit the measured data. Parametric

modelling of the developmental curve has been studied extensively (\! et al. 2002;

Zimmerman and Nnfiez-Ant6n 2001). Parametric approaches, however, require

specific quantitative information about the form of model, and hence have little

flexibility when the developmental pattern becomes complicated and does not

In this study, we proposed a novel nonparametric model for the longitudinal

PCD trait using LEP. Without making any assumptions on the underlying

developmental pattern, nonparametric regression with LEP shows greatly flexibility

to model the longitudinal PCD trait. However, for rapidly changing values of the

PCD curve to be approximated, the degree of the polynomial has to be increased.

Three model selection criteria are proposed to select the optimal order of the LEP.

A number of conclusions can be drawn, both about information theoretic fit criteria

in general, and about the performance of individual criteria in particular. The

proposed different criteria may not necessarily agree on the best order selected.

Among these criteria, BIC has a stronger penalty, which is a function of the

number of unknown parameters and sample size. Therefore, the BIC will likely

favor more parsimonious models compared to AIC or CAIC, and this just fits our

objective of parsimonious modelling of the PCD curve without losing too much

flexibility. If data come from the multivariate normal distribution, simulation

studies show that order selection under the null alvb--, outperforms the selection

under the alternative. However, in the real world data, it is not easy to test the

data distribution. We -i--::. -I selecting the LEP order under the alternative if

computationaly feasible. Selecting order with the unrestricted model ahb-- v

permutation test to assess the statistical significance, we need to fix the LEP order

for each test position. It might be expected that different genotypes may follow

different development patterns and hence results in different LEP orders. Order

selection assuming different orders for different .: il, vrpes need to be developed in

the future study.

Two covariance structure modelling strategies are also proposed and tested

by simulation and real data example. It is difficult to compare which structure is

the right one given that simulating the true biological mechanisms is unrealistic

due to our limited knowledge of the underlying true residual covariance structure.

Simulation studies show that both covariance models perform well if we simulate

data according to the corresponding structure. For the real data, one way we can

compare the two covariance structure is to see which one has higher mapping power

(the ability to detect more QTL). The one resulting in more QTL is probably

the more powerful model at the same significant a-level. The final set of QTL

should be a union of the QTL detected by both models. It is better to keep

the identified QTL as a real gene before we have knowledge to further prove

its usefulness. To test which covariance model performs better, model selection

using information-theoretic criteria can be applied, for example, AIC, BIC or

Cross-validation. Studies have been done in model selection of the covariance

model using these criteria (Grady and Helms 1995; Littell et al. 2000). However,

the selection is based on the same mean model (i.e., the mean functions have the

same number of parameters). Since the LEP order could be different at each test

position when using different covariance models, it is hard to compare the two

models under the current study design. Further methods need to be proposed to

better evaluate different covariance models.

CHAPTER 4
FUNCTIONAL QTL MAPPING PROGRAMMED CELL DEATH: A
SEMIPARAMETRIC MODEL

4.1 Introduction

The question of how an organism develops into a fully functioning adult from a

mass of undifferentiated cells has alv-- i attracted top researchers in diverse areas

of developmental biology (Horvitz 2003). Fundamentally, to produce a functioning

adult form, a living organism should coordinate various complementary and

sometimes antagonistic processes, which include cell proliferation and ]*i.',;/ "" 1

cell death (PCD) or apoptosis, during its development (Cashio et al. 2005). The

molecular and genetic characterization of these processes has been useful to

identify the specific signaling path.--W,-i that underlie a delicate balance between

cell proliferation and PCD (Vaudry et al. 2003) and, in the ultimate, enhance our

understanding of the roots of disease such as cancer (Hanahan and Weinberg 2000;

Yuan and Horvitz 2004). In particular, PCD has been thought to be a universal

phenomenon that occurs in predictable patterns in response to environmental or

developmental clues, whose study has become one of the most fascinating areas in

all of biology over the last decade (Greenberg 1996; Horvitz 2003).

Studies of the genetic control of development have used simple model systems,

the nematode Caenorhabditis elegans and fruit fly Drosophila, from which views

have been established that PCD involves specific genes and proteins and that these

genes and proteins interact within the cells that die (Ellis and Horvitz 1986; White

et al. 1994; Steller 1995; Horvitz 1999). For the adult hermaphrodite C. elegans

forms, there are 1090 somatic cells, of which 131 die by apoptosis. Each apoptosis

process is characterized by four stages (Steller 1995): (1) decision about whether

a cell should die or assume another fate, (2) death, (3) engulfment of the dead cell

by phagocytes and (4) degradation of the engulfed corpse. Each of these stages is

regulated by a number of genes. Mutations affecting the final three stages affect all

somatic cells, whereas genes affecting the death verdict affect very few cells.

The molecular genetic pathway that defines PCD in C. 1.,m.- and Drosophila

provides a basis for understanding apoptosis in more complex organisms, including

higher plants and humans, because genes responsible for PCD are evolutionarily

conserved (Horvitz 1999). Given the unique complexity of genetic path- -,.v of

these species, however, a rigorous, detailed and analytic approach should be

developed on its merit, which allows for the genome-wide identification of genes for

apoptosis in any complex organism. Genetic mapping based on molecular markers

(Lander and Botestein 1989; Zeng 1994; Lynch and Walsh 1998), by superimposing

real biological phenotypes on genome sequence and structural polymorphisms,

can provide an unbiased view of the network of gene actions and interactions of

quantitative trait loci (QTL) that build a complex phenotype like PCD.

Unlike traditional QTL mapping for a complex trait, the mapping of PCD

needs to incorporate the dynamic feature of this developmental process. Although

this presents one of the most difficult tasks in genetic studies, some of key issues

have been overcome by Wu and colleagues (Wu et al. 2002a, 2003, 2004a, 2004b,

2004c; Ma et al. 2002) who proposed a so-called functional i'."'.!:'!." to map and

identify specific QTL that underlie the developmental changes of complex traits.

The rationale of functional mapping is to express the .- i.. I vrpic values of QTL at

a series of time points in terms of a continuous growth function with respect to

time t. Under this formulation, the parameters describing longitudinal trajectories,

rather than time-dependent .-, 1 ,ivpic values, as carried out in traditional mapping

strategies, are estimated within a maximum likelihood framework. Also, unlike

traditional strategies, functional mapping estimates the parameters that model

the structure of the covariance matrix among a multitude of different time points,

which, therefore, largely reduces the number of parameters being estimated for

variances and covariances, especially when the dimension of data is high.

In this chapter, we will develop a novel statistical model for the genome-wide

scan of QTL that guide PCD toward an active process of cell death. This model

incorporates two sequentially distinct stages of the developmental process into the

mapping framework constructed within the context of Gaussian mixture models.

The first stage, growth, has proven to obey some universal growth law that can

be modelled mathematically by curve parameters (von Bertalanffy 1957). While

it is difficult to find a proper mathematical equation to describe the second stage

- death that is subject to a fast exponential decay of cells followed by a slowly

decreasing function (Balaban et al. 2004), a nonparametric approach based on

the Legendre function is derived to model the PCD process. The combination of

parametric modelling of the growth process and nonparametric modelling of the

death process 'lv a foundation for semiparametric functional i".Ij':j"!i of PCD.

We implement a nonstationary mean-dependent covariance model to characterize

the structure of the covariance matrix among cell numbers measured at a multitude

of different times. The statistical behavior of our model is investigated through

simulations studies. The utility of the model in a real example of rice -i--::. -I that

our model can be useful in practice.

4.2 Different Phases of PCD

In general, the whole process of PCD can be described by five reasonably well

distinct phases (Fig. 4-1; Fogg and Thake 1987): lag, exponential, declining growth

rate, stationary and death. Each of the phases is defined below.

Lag phase: This is the initial growth phase, during which cell number

remains relatively constant prior to rapid growth. During this phase, the organism

1 2 i i i i i i i
4

10 -

5

8

U6
CO6-
E

4

2-

Growth phase Death phase"
0-
1 2 3 4 5 6 t* 7 8 9 10 11
Time

Figure 4-1. A typical example of PCD that includes five different stages, (1) lag,
(2) exponential, (3) declining growth, (4) stationary and (5) death.

prepares to grow and unseen biochemical changes, cell division and differentiation

of tissues occur during this time.

Exponential phase: During this phase the tissues are growing and dividing

rapidly to take advantage of abundant nutrients. Growth rate, as a measure of the

increase in biomass over time, is determined from the exponential phase. Growth

rate is one important way of expressing the relative success of an organism in

adapting to its biotic or abiotic environment imposed upon it. The duration of

exponential phase depends upon the growth rate and the abundance of nutrients to

support tissue growth. If the growth phase is plotted (time on x-axis and biomass

on logarthmic y-axis), the exponential phase will be straightened out.

Declining growth: Declining growth normally occurs when either a specific

requirement for cell division is limiting or something else is inhibiting reproduction.

During this phase growth slows or the death rate increases. As a result, the

initiation of new tissues and the senescence and death of old ones start to come

into equilibrium. This phase typically occurs as nutrients become limiting for

growth.

Stationary phase: Tissues enter the stationary phase when net growth is

zero, and within a matter of time cells may undergo dramatic biochemical changes.

The nature of the changes depends upon the growth limiting factor. The shut down

of many biochemical pathv-s as the stationary phase proceeds means that the

longer the cells are held in this condition the longer the lag phase will be when cells

are returned to good growth conditions.

Death phase: When cell metabolism can no longer be maintained, the death

rate of a tissue is generally very rapid (Balaban et al. 2004). The steepness of the

decline is often more marked than that represented in the accompanying growth

figure.

The duration and extent of each phase will depend on the organism and

environmental conditions. For example, if tissues from the stationary phase are

supplied with fresh nutrients, the lag phase will be longer than for the case of

tissues from the declining phase. For growing tissues from the exponential phase,

organisms supplied with fresh nutrients will likely skip the lag phase. If the growth

nutrient is rich, organisms will remain in the exponential growth phase for a longer

period and produce a greater biomass. Furthermore, their rate of growth in the

exponential phase may also be greater.

Growth curves must be drawn from a series of growth measurements at

different times during the growth curve. Mathematical equations have been derived

to model the growth from the lag to stationary phase (West et al. 2001). It is

difficult to find a specific mathematical equation for the death phase.

4.3 Statistical Model

4.3.1 The Mixture Model-Based Likelihood

The statistical foundation of functional mapping for the PCD process is based

on a finite mixture model. In this mixture model, each curve corresponding to a

QTL .-, ii. .1pe fitted by a finite set of measurements with 7 time points for any

individual is assumed to have arisen from one of a known or unknown number

of components with each component being modelled by a multivariate normal

distribution. That is

y ~ p(r/| -7 TI) = oif(y; yi, TI) + + jf (y; j, TI), (4-1)

where w (l, wj)' are the mixture proportions (i.e., QTL .-. -I vpe

frequencies) for J QTL i,' .vpe which are constrained to be nonnegative and

J j1 =j 1; = (1, i)' are the component (or QTL genotype) specific
parameters, with mj being specific to component j; and T] are parameters which are

common to all components.

Consider a standard backcross design, initiated with two contrasting

1I': .. v.-ous inbred lines. This backcross progeny population contains two groups

of genotypes at a locus. Assume that a genetic linkage map covering the entire

genome has been constructed with polymorphic markers. Suppose there is a

segregating QTL with alleles Q and q that affects PCD curves or trajectories in the

backcross population. This QTL cannot be directly observed rather than should be

inferred on the basis of known marker information. In QTL interval rn ,ppii. we

will use two flanking markers with known genotypes to infer the .-., .1 ivpe of a QTL

that is hypothesized to be located between the two markers (Lander and Botstein

1989). The recombination fraction is a linkage parameter that can describe the

genetic distance between two given loci. Let r, r, and r2 be the recombination

fractions between the two markers MI and M2, between the left marker MI and

QTL and between the QTL and right marker M2, respectively. Based on the

segregation and transmission of genes from the parent to progeny, one can derive

the conditional probabilities of an unknown QTL, i.1 ,,vpe, conditional upon the

known marker .' .1 virpes, in terms of these recombination fractions (Table 2-1).

The unknown parameters that specify the position of QTL within a marker interval

are array, '1 in Qs.

The likelihood function of the longitudinal PCD trait (y) and marker

information (M) collected for this backcross populations at this hypothesized

QTL with two genotypes j = 1, 0 is constructed as

L(QfO, Q,, l|y, M) 1[1 if (yi|,, fIQ) + o|if (yi |O [o I1)] (4-2)
i= 1
where w li and wo0i contained in fs are the mixture proportions corresponding

to the frequencies of different QTL .,-. .1 vpes, expressed as the conditional

probabilities of QTL .. n. 1 pes given marker .. .1 vpes for individual i, fm,

contains the parameters that model time-dependent means for ., .'1 vipe j and

f contains the parameters that model the structure of the residual covariance

matrix which is assumed to be common to all mixtures. The likelihood function

is constructed on the basis of the assumption of independent response between

individuals. The multivariate normal distribution of each mixture for progeny i

measured at T time points is expressed as

f(y ,M |m, v) (2 ) /2 1/2 exp (yi- mj)TX-(yi m) (4-3)

where y = [y/(1), '' ,T u(r)] is a vector of observation for progeny i and uj =

[uj(1), uj (r)] is a mean vector for all progenies with genotype j. This means

that all progenies with the same QTL .1. ivpe will have the same mean at

different time points. At a particular time point, i- t, the relationship between the

observation and expected mean can be described by a linear regression model

yt) M (t) + (1 +)uo(t) + eM(t),

where (i is an indicator variable denoted as 1 for QTL genotype j = 1 and 0 for

QTL .ii-, i. vpe j = 0 and ei(t) is the residual error which is lid normal with zero

mean and variance O2(t). The errors for progeny i at two different time points,

t1 and t2, are correlated with covariance cov(yi(ti), yi(t2)). The variances and

covariances comprise the covariance matrix E whose elements are the common

parameters specified by fQ. With these general settings, the statistical challenging

becomes how to model the mean process and how to structure the covariance

matrix.

4.3.2 Semiparametric Modelling of the Mean Vector

In a broad sense, the entire PCD process for a particular individual i can be

divided into two phases, growth and death (Fig. 4-1). Let t* be the transition

time point which marks the end of growth phase and beginning of the death

phase. The mean vector in the multivariate normal distribution of PCD (Eq. 4-3)

for individual i that carries QTL genotype j can now be specified by two mean

sub-vectors expressed as

Uji (u ujl, UDj i) (4-4)

where ugQji and uDji correspond to the growth and death vector before and after

t*, respectively.

The parametric model of growth phase. The process of growth (before t*)

follows universal growth laws and can be described by biologically meaningful

mathematical functions. As a nearly universal biological law for living systems, the

sigmoidal (or logistic) growth function can be fitted to capture age-specific changes

during the growth phase (West et al. 2001). The logistic function is mathematically

described for individual i by

g9(t) = -, with t [1, t (4-5)
1 + bie-cit

where ai is the .,-vmptotic or limiting value of gi when t -+ oo, ai/(1 + bi) is the

initial value of gi when t 0 and ci is the relative rate of growth (von Bertalanffy

1957). Thus, with these three parameters, one can uniquely determine the shape of

PCD in the growth phase for individual i that carries QTL genotype j, and have

the time-dependent mean vector for Eq. 4-4 specified by

Ugjli [ugjli ), -"- Ugjli ti)1
= a a (4-6)
1 +bje-c'' 1+ bje-Aj "

If different genotypes at a putative QTL have different combinations of the

parameters (ay, bj, cy), this implies that this QTL p1 a role in governing the

difference of growth trajectories.

The nonparametric model of death phase. Since no particular mathematical

function can be used to describe the death phase (after t*), the nonparametric

approach based on the orthogonal Legendre polynomial (LEP) is used. The

flexibility of LEP will greatly increase the robustness of functional mapping.

With appropriate order r, the time-dependent genotypic values for different
QTL ,-., .I vipes in the death phase can be fitted by the orthogonal LEP. A family

of such polynomials with normalized time t' is denoted by

P(t') [Po (t/), Pl (t/), .. Pr (t1/)]T

and a vector of genotypic-related, time-independent values with order r is denoted

by

Vj rI,,. V j, -- Urjl]T

where vj is called the base /. i. 'ij*/"l'.:' vector for QTL ,-, 11. 1pe j and the parameters

within the vector are called the base /, i. 'i.*/l'.:, means. The normalized time t' is

obtained by adjusting the original measurement time t to match the orthogonal

function range [-1, 1], by
,/ 2(t- Imin)
t' -1+
tmax -min
where tmin and tmax are the first and last time point, respectively.

With these specifications, the time-dependent i i.1' pic values, uDj (t), in the

death phase can be described as a linear combination of vj weighted by series of

the polynomials. That is

UD (t) vTp(t') (4-7)

Thus, for individual i whose QTL .1 i,.'vpe is j, we use the following

expression to model genotype means in the death phase

UD l (t),. *. (7)]. (4-8)

This approach has a great flexibility in modelling longitudinal data that

cannot be fitted by a parametric model. By choosing an appropriate order, the

nonparametric model can better capture the intrinsic pattern of developmental

PCD. The number of parameters can be reduced if the order of the polynomial is

less than the number of time points, as an usual case.

Legendre order selection. To determine which order of the LEP best fits the

data, we need to select the optimal order. One of the popular model selection

criteria is the AIC information criterion (Akaike 1974). The AIC value at a

particular order, r, is calculated by

AIC- -21nL(6f, ,fD, 6,r) + 2 dimension(Qg, fD, fQr), (4-9)

where Qg {aj,bj,c}jo, and QD {Poj,PIj,... rj}j=o are the the MLEs of

parameters for the growth curve function and the Legendre polynomial of order r,

Q, contains the MLEs of the covariance parameters, and dimension(fg, D, L, r)

represents the number of free parameters under order r. The optimal order is one

that di-p'1,, the minimum AIC value.

Another model selection criterion to determine the optimal order of the

Legendre function is the B -i'l, Information Criterion (BIC) (Schwarz 1978),

which is calculated by

BIC -2 In L(Ig, 2, Lr) + dimension(QLg, fD, LIr) ln(n). (4-10)

where all the parameters are defines similarly as above except that n is the total

number of observation at a particular time point. Since the BIC criterion adjusts

the effect of sample size, the model selected by BIC will be more parsimonious.

4.3.3 Modelling the Covariance Structure

To model the covariance structure for longitudinal data, we need to make the

following assumptions: (1) the error ei(t) in Eq. 2-3 is normally distributed with

mean zero and variance a 2(t), and (2) the error ei(t) is independent and identically

distributed among different individuals. A number of statistical models have been

used to model the covariance structure (Diggle et al. 2002). In earlier functional

mapping, the first-order autoregressive (AR(1)) model has been used (V\!. et al.

2002), which is expressed as

a2(1) 2... 2 ) a2

for the variance, and

1r t1, t2) 0-2plt2-tl (4- 11)

for the covariance between any two time intervals t1 and t2, where 0 < p < 1 is

the proportion parameter with which the correlation decays with time lag. The

parameters that model the structure of the (co)variance matrix are an, i ,1 in f,.

To remove the heteroscedastic problem of the residual variance, which violates

a basic assumption of the simple AR(1) model, two approaches can be used.

The first approach is to model the residual variance by a parametric function of

time, as originally proposed by Pletcher and Geyer (1999). But this approach

needs to implement additional parameters for characterizing the age-dependent

change of the variance. The second approach is to embed Carroll Rupert's

(1984) transform-both-sides (TBS) model into the growth-incorporated finite

mixture model (Wu et al. 2004b), which does not need any more parameters.

Both empirical analyses with real examples and computer simulations -~i--.- -1

that the TBS-based model can increase the precision of parameter estimation

and computational efficiency. Furthermore, the TBS model preserves original

biological means of the curve parameters although statistical in &iv -. are based on

transformed data.

The TBS-based model di-pl '1., the potential to relax the assumption of

variance stationarity, but the covariance stationarity issue remains unsolved.

Zimmerman and Nnfiez-Ant6n (1997) proposed a so-called structured antedependence

(SAD) model to model the age-specific change of correlation in the analysis of

longitudinal traits. The SAD model has been employ. .1 in several studies and

di-pl' .v many favorable properties (Zimmerman and Nfinez-Ant6n 2001). Zhao et

al. (2005) incorporated the first-order SAD (SAD(1)) model into modelling of the

covariance matrix.

In this article, we use a different modelling approach that is as simple as the

AR(1) and as flexible as the SAD(1). This approach has two steps. In step 1, the

intra-individual correlation structure is modelled. In many cases, a systematic

pattern of correlation is evident, which may be characterized accurately by a

relatively simple model. The intra-individual correlation among the time-dependent

elements of e, for individual i is assumed to follow a pattern, expressed as

corr(ei) Ri(),

where the correlation matrix R,(Q) is a function of a vector of correlation

parameters Q. The correlation structure can be described by the AR(1) model

in which = p (Eq. 4-11).

In step 2, time-dependent variances are specified according to Horwitz's

Rule in analytical chemistry. This rule proposes that there exists an empirical

relationship between concentration and variance (Atkinson 2003). Thus, we

can similarly model time-dependent variances for individual i by considering its

-,.1 iJ vpic means at various time points, expressed as

a2 (t) -a2u t).

Therefore, the corresponding covariance matrix can be modelled by

1 1
Sli, cov(y) a2UjRi(2)uj, (4 12)

where uy, = di -[ i(1),. u (lT)]. Since the covariance structure is modelled as

a function of genotypic mean, it is called the mean-covariance (\ -C) model. The

M-C model has a great flexibility for modelling the covariance matrix of the PCD

process. The unknown parameters to be estimated in the M-C model are ai 1v, 1 in

Q, = (o2, Q) with accepted means.

4.3.4 Computation Algorithms

A number of computational algorithms can be used to estimate the unknown

parameters, f = (fs, fm, [v). These algorithms include the maximum likelihood

method implemented with the EM algorithm (Dempster et al. 1977). As usual,

the QTL position parameter fs can be viewed as a known parameter because a

putative QTL can be searched at every 1 or 2 cM on a map interval bracketed

by two markers throughout the entire linkage map. The amount of support for a

QTL at a particular map position is often di-phil i, d graphically through the use

of likelihood maps or profiles, which plot the likelihood ratio test statistic as a

function of map position of the putative QTL (Lander and Botstein 1989).

Because m, and Q, are contained in complex nonlinear equations, it is

difficult to derive a closed form of the maximum likelihood estimates (MLEs) for

them. Theoretically we could use Simplex algorithm to estimate all the parameters

contained in Q (Zhao et al. 2004). The Nelder-Mead simplex algorithm, originally

proposed by Nelder and Mead (Nelder and Mead 1965) is a direct search method

for nonlinear unconstrained optimization. It attempts to minimize a scalar-valued

nonlinear function using only function values, without any derivative information

(explicit or implicit). The algorithm uses linear adjustment of the parameters until

some convergence criterion is met. However, because of the complex nonlinear

function being minimized by Simplex algorithm, it cannot alvb--, guarantee the

correct convergence of covariance parameters aO2 and p during the minimization

process. This consequently results in negative infinity of the log-likelihood function

and convergence will never be reached. Based on these concerns, we developed a

EM-Simplex combined algorithm to estimate the unknown parameters contained

in Q. The detailed algorithm is given in Appendix B. The simplex algorithm

is integrated into the EM algorithm to estimate the mean parameters, namely

the logistic curve, LEP parameters with given covariance parameters. Then the

estimated mean parameters can be used to maximize the covariance parameters.

Iterative procedure will be processed until convergence.

Under the joint modelling framework, two mean functions, growth and death,

need to be connected. Two constraints are imposed to make the PCD curve

continuous at the transition time point t* for each individual. The first constraint

is to make the growth mean equal to the death mean at time t* to make the

entire curve continuous. The second constraint is that the two functions has the

same score at time ti to make the entire curve smooth. These two constraints are

expressed as

With these constraints, we obtain the expressions of one growth parameter

and one death parameter for any QTL genotype j. For example, if the Legendre

polynomial order is 3, we can solve the equations to obtain the estimates of aj and

Vlj as follows

2[,,, 0.5(3 i 1) 5t3 (1 + be- )2
S 1 + bje-ct bjc(At)t'e-cit:
Sajbjcj(At)e-cA t12 2
I =l 2(1 + bje-ct)2 3t2J 0.5(15t 3)v3

where t' is the adjusted time for t* and At tmax tmin.

4.3.5 Calculating Curve Heritability

It is easy to calculate the heritability level (H2) when traits are measured at a

single time point. But for longitudinal traits, heritability calculation is difficult. We

propose two --v to do it:

1. Calculate H2 at a single time point t where the traits shows the highest
variation. For a backcross design, the genetic variation is given by a2
l[uI(t) uo(t)]2, where uj(t) (j 1,0) is the genetic mean for genotype j at
time t, and the heritability H2(t) 7(t)/o2(t) is calculated where o 2(t) is
the residual variance at time t.

2. Calculate H2 using the area under curve (AUC). Functional mapping maps
the dynamic gene effect over time. The genetic variation explained by the
time. We propose to calculate the genetic variation by a2 = i(AUCI -
AUCo)2, where AUCI and AUCo are the AUC for two different genotypes.

The AUC is calculated by

AUC j 'j dt + / v Pr(t')dt'

a.[l. (bj @ ti) -- n(b -+ c ln(,jt-l)] I vIPr(t')dt'
Cj be '

where t* is the transition time point, aj, bj and rj are the growth parameters
for genotype j, Pr(t') is a vector of LEP with order r and u, is the base
,. iJ, vpic mean parameters and t' is the adjusted time point.

4.4 Hypothesis Testings

One of the main advantages of functional mapping is that it allows for

a number of hypothesis tests to examine the genetic control mechanisms of

growth throughout development and in response to varying environmental or

developmental clues. Wu et al. (2004a) have formulated some of these hypothesis

tests which include the global test of genetic effects on the entire developmental

process, the regional test of genetic control over a particular developmental period

of interest and the point test for the timing of developmental events. From a

genetic perspective, we can also test how different genetic action modes pl ,v a role

in regulating the developmental process. With such a complete set of tests, we are

able to address biological questions related to the genetic control mechanisms of

PCD traits.

Testing whether specific QTL exist to affect the PCD process is a first

step toward the understanding of the detailed genetic architecture of complex

phenotypes. The genetic control over entire developmental process of PCD can be

tested by formulating the following hypotheses

Ho: Ug = ugo and ui = uDo
(4-13)
HIa : At least one of the qualities above does not hold

The Ho states that there is no QTL affecting the dynamic PCD process

(the reduced model), whereas the Hia proposes that such a QTL does exist (the

full model). The test statistic for testing the hypotheses is calculated as the

log-likelihood ratio of the reduced to the full model as given below

LR -2[logL(Q|y, M) -log L(|y, M)], (4-14)

where f and 2 denote the MLEs of the unknown parameters under Ho and Hil,

respectively. The critical threshold value for declaring the presence of QTL can be

empirically calculated based on the permutation tests (C'!ni !In!! and Doerge 1994).

Other hypotheses can be made to test if the detected QTL only controls the

growth phase with the following alternative hypothesis

Hlb : ugi / UgO and uD = u O (4-15)

or if the detected QTL only controls the death phase with the following alternative

Hic : ugi = ugo and uDi / uDo (4-16)

The critical thresholds for the above two hypotheses can be determined using

simulation studies. Only when both Hlb and HIc are rejected, the detected QTL is

thought to pleiotropically affect the growth and death phases.

The proposed model can be used to test the influence of QTL on growth in

different stages of development, lag, exponential, declining growth, stationary and

death. These tests can be based on the area under curve during a time course of

interest. Simulation studies are used to determine the critical thresholds for each

test.

4.5 Results

4.5.1 A Worked Example

I use the proposed model to analyze a real data set of rice. Data description is

given in C'!i Ipter 3. Figure 3-2 shows the dynamics of tiller numbers for each DH

line measured at 9 time points. Tiller growth is thought to be an excellent example

Table 4-1. Model selection for Legendre polynomial orders based on the AIC and
BIC values under the M-C covariance-structuring model

Selection Order
criterion 0 1 2 3 4
AIC 2437.8 1096 759.74 559.87 670.65
BIC 2453.9 1109.4 775.77 578.58 692.03

of PCD in plants (Greenberg 1996) since it experiences several developmental

stages when rices grow. Our semiparametric model was used to map specific QTL

that determine the dynamic changes of tiller number during ontogeny. Although

the growth phase of tiller number can be well modelled by a logistic equation

defined by parameters a, b and c (Eq. 4-5), no proper equation can be used to

model the death phase. For this reason, a nonparametric approach based on the

Legendre polynomial function is adopted in the framework of QTL mapping. But

this will encounter the issue of order determination. To detect the best order of the

Legendre polynomial function for this rice data set, we calculated the AIC and BIC

information criteria for various orders (Table 4-1). Both the criteria provide the

consistent result that the death phase of tiller number can be best explained by the

Legendre polynomial of order 3.

By genomewide scanning for QTL at every 2 cM within each marker interval

across 12 rice chromosomes, our model has probed three i, i"r QTL that tri- -.r

their effects on the overall PCD process of tiller number. As shown by the

genomewide log-likelihood ratio (LR) profile between the full and reduced (no

QTL) model for tiller number trajectories in Fig. 4-2, these three QTL are located

between markers RG146 and RG345 and between markers RZ730 and RZ801 on

chromosome 1 and on marker RZ792 on chromosome 9. The positions of markers

on the linkage groups (Huang et al. 1997) are indicated at ticks. The threshold

value for claiming the existence of QTL is given as the horizonal dotted line for

the genome-wide level and dashed line for the chromosome-wide level. Of these

40 chrom 1 chrom 2 chrom 3
30
20
VoSOV

34U
30
S20
10
40
30
20
10
n

chrom 4 chrom 5 chrom 6 chrom 7

chrom 8 chrom 9 chrom 10 chrom 11 chrom 12

Test Position

Figure 4-2. The LR profile plot. The genomic positions corresponding to the peak
of the curve are the MLEs of the QTL localization (indicated by the
arrows).

three detected QTL, the first is significant genomewide, whereas the other two are

significant chromosomewide, all at the 5'. significance level based on the critical

thresholds determined from the permutation tests.

To know more about the behavior of the detected QTL, we tabulate the MLEs

of curve parameters that specify the growth phase and genotypic basis effects

that specify the death phase, along with the approximate standard errors of the

estimates (Table 4-2). All the parameters that specify the growth and death phases

for different QTL .-, n. .,ipes (j = 1 for QQ or 0 for qq in the DH population) are

estimated with reasonably high precision as shown by the standard errors, although

the estimation precision tend to be better for the growth parameters than for the

death parameters (Table 4-2). The parameters that model the structure of the

covariance matrix based on the M-C model can also be well estimated, -.-.--i -ii-

good behavior of our model.

Table 4-2. The MLEs of the growth and death parameters (and their approximate
standard errors in the parentheses) for significant QTL detected on
different chromosomes for tiller numbers in a DH rice population

Parameters
QTL position (7)
Marker interval

Growth parameters
a2
b2
rT2

Death
U20
U21
U22
U23

parameters

Covariance parameters
Oa2

120 cM
RG146 RG345

10.5973
7.,-!' ; (0.20)
1.7257(0.02)

12.3823
9.6068(0.38)
1.8834(0.03)

8.5001(0.22)
-2.4290
0.1008(0.05)
0.5118(0.04)

9.6225(0.37)
-3.2068
0.1472(0.08)
0.6575(0.07)

0.0493(0.01)
0.8747(0.01)

C '! i, i ..* 2*
198 cM
RZ730 RZ801

11.0824
7.5167(0.28)
1.8486(0.03)

11.5461
8.9779(0.34)
1.661(0.03)

8.6888(0.32)
-2.7785
0.1326(0.07)
0.5685(0.05)

9.2706(0.31)
-2.6222
0.0964(0.07)
0.5765(0.05)

0.0508(0.004)
0.8738(0.01)

'! i !.... n. 9 *
119.1 cM
RZ792

10.7188
7.8547(0.35)
1.6747(0.04)

11.5739
8.3427(0.39)
1.8311(0.04)

8.5837(0.32)
-2.4647
0.0989(0.07)
0.5312(0.05)

9.1128(0.35)
-2.8732
0.1030(0.07)
0.5784(0.06)

0.0535(0.005)
0.8786(0.01)

where refers to the chromosome-wide significant QTL

Using the MLEs of parameters for the growth and death phases, we draw

the developmental trajectories of tiller number for the two different QTL

.-.. -i' vpes (Fig. 4-3) with tiller number trajectories for all individuals indicated

in background. Each QTL shows a unique developmental pattern over time. For

example, the dynamic process of genetic effects for the QTL located between

markers RZ730 and RZ801 on chromosome 1 is different than those for the other

two QTL. Statistical tests based on hypotheses 4-15 and 4-16 show that the QTL

Table 4-3. The MLEs of the QTL position and the model parameters derived from
100 simulation replicates. The squared roots of the mean square errors
of the MLEs are given in parentheses.

True parameter
A -48

Growth parameters
a2 15.033
b2 8.324
r2 = 1.814

10.926
- 7.602
- 1.522

Death parameters
20 = 9.817
t21 = -6.453
Ut22 -0.366
t23 0.958

uoo = 7.893
U01 = -3.681
U02 = -0.208
U03 = 0.625

H2-
n 100
46.222(4.03)

8.3705(0.31)
1.I,1 .',(0.04)

7.6648(0.45)
1 I I_'(0.06)

9.8818(0.43)

-0.3623(0.10)
0.9695(0.08)

0.1
n =200
45.98(3.19)

8.3317(0.22)
1.812(0.02)

7 '1...' ,(0.31)
1.5255(0.03)

9.8442(0.25)

-0.3692(0.06)
0.9554(0.05)

8.1185(0.42) 7.9525(0.29)

-0.2127(0.09)
1, 1. L(0.08)

-0.2164(0.06)
0.6277(0.06)

H2
n 100
46.02(3.27)

8.3404(0.12)
1.8124(0.01)

7.5957(0.18)
1.5219(0.02)

9.8363(0.15)

-0.3661(0.03)
0.9589(0.03)

7.8879(0.13)

-0.2056(0.03)
0.6247(0.03)

0.4
n 200
46.06(2.69)

8.3211(0.09)
1.813(0.01)

7.6274(0.13)
1.5224(0.01)

9.8257(0.10)

-0.3673(0.03)
,',',11,,"(0.02)

7.9137(0.12)

-0.2109(0.02)
0.6260(0.02)

Covariance parameters
02.1 -0.1194
02.4 -0.0199

0.1096(0.02) 0.1167(0.01)

p = 0.85 0.8415(0.02) 0.8475(0.01)

0.0198(0.002) 0.0197(0.002)

0.8484(0.02) 0.8478(0.01)

The locations of the two QTL are described by the map distances (in cM) from the first
marker of the linkage group (100 cM long).

detected between markers RG146 and RG345 on chromosome 1 and on marker

RZ792 on chromosome 9 merely control the growth phase, whereas the second QTL

on chromosome 1 controls the entire developmental process (P < 0.05).

4.5.2 Simulation

We performed a series of simulation studies to examine the statistical

properties of the model. Six equidistant markers are simulated from a backcross

population and are ordered as MAI-MA6 on a linkage group with the length of

100 cM. The Haldane map function was used to convert the map distance into

the recombination fraction. Different heritability levels (H2 = 0.1 vs 0.4) and

different sample sizes (n = 100 vs 200) were considered in the simulation study

to examine model's performances under different scenarios. The putative QTL is

located between marker M3 and M4, at 48 cM from the first marker. Data are

simulated by assuming that QTL controls the entire developmental process. The

simulated data have 9 continuous time points. The means at different time points

used to model the covariance matrix based on the M-C model are the average

of the two genotypic means. Table 4-3 lists the results from simulation. The

true parameters are given in the left column. In general, our model can provide

reasonable estimates of the QTL positions and the growth and death parameters

determined by the QTL, with estimation precision depending on heritability

level and sample size. In all cases of different sample sizes and heritabilities, the

maximum values of the LR landscapes from 100 simulation replicates are beyond

the critical thresholds at the a = 0.001 level determined from 1000 permutation

tests for the simulated data, -~i-.-. -Iii.-; that our model has 10(1'. power to detect

QTL in these conditions. The precision of parameter estimation is evaluated

in terms of the square root of the mean squared errors (SMSE) of the MLEs.

The QTL positions and effects can be better estimated when the PCD trait

has a higher than lower heritability or when sample size is larger than smaller

(Table 4-3). But the increase of H2 from 0.1 to 0.4 leads to more significant

improvement for the estimation precision than the increase of n from 100 to 200.

For example, the SMSE of the growth parameter Co for QTL genotype qq reduces

by more than one fold when H2 is increased from 0.1 to 0.4 for a given sample size,

whereas such reduction is much smaller when n is increased from 100 to 200 for a

given heritability. This -i.:r-- -i that in practice it is more important to manage

experiments to reduce residual errors (and therefore increase H2) than simply

increase sample size.

4.6 Discussion

The growth of any tissue, whether normal or malignant, is determined by the

quantitative relationship between the rate at which cells proliferate and the rate

at which cells die. Depending on how the rate of cell proliferation is compromised

or coordinated with the rate of cell death, all tissues will undertake two distinct

processes, growth or death, during development (Jacobson et al. 1997). Unlike the

detailed framework for cell proliferation, an understanding of the initiation of cell

death and the cellular mechanics of this process is still in its infancy. Cell death

can occur as an active and orderly process of development, a process coined by the

term Ii'.;';'",n ,1 cell death (PCD) (Horvitz 2003).

PCD, or referred to as ipop ..-i- appears to be a universal feature of animal

development, and abnormalities in PCD have been associated with a broad v ii, I,

of human diseases, including certain cancers and neurodegenerative disorders

(Hanahan and Weinberg 2000). In plants, PCD is also ubiquitous for essential

development and survival, including xylogenesis, reproduction, senescence and

pathogenesis (Greenberg 1996). To make PCD control and execution efficient,

particular genetic mechanisms should be involved in regulating and modulating this

process in response to various developmental and environmental stimuli. The use

of organisms with simple structure like the the nematode Caenorhabditis elegans

and Drosophila has led to the identification of numerous genes responsible for PCD

(Ellis and Horvitz 1986; Horvitz 1999, 2003; Yuan and Horvitz 2004). The 2002

Nobel Prize in medicine was awarded to three scientists because of their discoveries

of the genetic regulation of PCD (Horvitz 2003).

Although the use of simple model systems can provide a wealth of information

about the genetics of PCD for more complicated organisms like animals and higher

Figure 4-3.

6 7 8 9

6 7 8 9

6 7 8 9

Plot of the dynamic changes of tiller numbers for two curves each
presenting two groups of ., -i ii .pes, QQ and qq, at each of the three
QTL. A) QTL detected between markers RG146 and RG345, and B)
QTL between markers RZ730 and RZ801 on chromosome 1, and C)
QTL on marker RZ792 on chromosome 9.

2 3 4 5
Time

2 3 4 5
Time

2 3 4 5
Time

plants, some key questions cannot be well addressed without a direct use of these

organisms. Recent development of high-throughput molecular technologies has

made it possible to generate a massive amount of genetic and genomic data for

any organism almost with no limit. This, thus, presents a pressing need for the

development of vigorous, detailed and analytical approaches to unravel the genetic

control and regulation mechanisms that underlie the PCD process. In this article,

we have for the first time developed a statistical model that can make a systematic

scan of quantitative trait loci (QTL) for PCD throughout the entire genome with a

well-covered genetic linkage map. This model has been validated by a real example

for the PCD process of tiller number in rice. Three QTL were detected to affect

tiller number trajectories during a growing season in the field. The locations for

two of the QTL detected on chromosome 1 are consistent with those estimated

from basic interval mapping of single traits (Yan et al. 1998), but the third QTL

detected on chromosome 9 was previously undetected by interval mapping. It seems

that our model has not been able to detect many other QTL detected by Yan et

al., which may be due to the difference in the threshold criteria used to claim the

existence of a QTL.

The rationale for this PCD mapping model is in spirit similar to that

established in earlier functional mapping of growth curves ('\ I, et al. 2002; Wu

et al. 2004a, 2004b, 2004c; Zhao et al. 2005). Both types of models integrate

biological principles into the framework for QTL mapping constructed on the

basis of Gaussian mixture models. They have provided a quantitative platform for

testing the interplay between gene actions and organ development. In addition, by

taking advantages of structuring the covariance matrix, these models estimate a

much fewer number of parameters, thereby increasing the statistical power of the

model.

It can also be seen that the PCD mapping model proposed in this article

is different from earlier published functional mapping approaches purely based

on parametric modelling. The current model splits the PCD process into two

sequentially distinct phases-growth and death. As in a parametric model,

universal growth laws (West et al. 2001) are used to model the growth phase,

whereas nonparametric modelling is performed for the death phase. This

combination of parametric and nonparametric models, referred to as semipara-

metric modelling, is aimed to overcome the problem of the death phase that cannot

be mathematically described. Although it is feasible for a nonparametric approach

to model any form of curve including the entire growth-death curve like Fig. 4-1,

this approach does not make full use of biological information contained in the

growth phase and, therefore, loses the advantages of parametric functional mapping

in the biological interpretation of results (see Wu et al. 2004a).

Nonparametric part of our semiparametric model is based on Legendre-polynomial

approaches. As shown by Kirkpatrick and Heckman (1989), Legendre polynomials

have several favorable properties for curve fitting which include: (1) the functions

are orthogonal, (2) it is flexible to fit sparse data, (3) higher orders are estimable

for high levels of curve complexity and (4) computation is fast because of good

convergence. Other nonparametric regression methods using kernel estimates

have been considered for the mean structure of growth curve data by Altman

(1990), Boularan et al. (1994), Wang and Ruppert (1995) and Ferreira et al.

(1997). Relative to nonparametric modeling of the mean structure, nonparametric

covariance modelling has received little attention. Leonard and Hsu (1992) derived

a B ,i,- ~i in approach for nonparametric estimates of the covariance structure.

Diggle and V' iI i 1, (1998) used kernel-weighted local linear regression smoothing of

sample variograms ordinates and of squared residuals to provide a nonparametric

estimator for the covariance structure without assuming stationarity. It is

88

appealing to incorporate these nonparametric or semiparametric approaches

into our functional mapping framework to increase the model's flexibility.

CHAPTER 5
CONCLUDING REMARKS

Programmed cell death (PCD), a deliberate cell suicide process in a multicellular

organism, is undertaken in a coordinated manner that generally confers advantages

during the organism's life cycle. It serves fundamental functions during both plant

and metazoa (multicellular animals) tissue development. The information about

how genes control or affect this process is crucial for a better understanding of the

physiological mechanisms of PCD. The identification of genes that contribute to the

differentiation of cells has been one of the most important and difficult tasks for

PCD research. While our understanding of the molecular mechanisms of PCD has

been largely based on simple organisms like the nematodes, a direct use of more

complex organisms such as plants, animals and humans presents a pressing need for

examining the genetic regulation of PCD for complex traits including grain yield,

meat production and cancer.

The genes or quantitative trait loci (QTL) that mediate PCD can be detected

using genetic mapping with available abundant information of molecular markers

and the conceptual idea of functional mapping. In this dissertation, I have for the

first time derived a series of statistical models and algorithms for detecting and

characterizing specific QTL that are responsible for PCD process under various

problem settings. The developed models can make a systematic scan of QTL for

PCD across the entire genome with a well-covered genetic linkage map.

The general mapping framework is stated in Chi Ipter 2 which presents a

conceptual idea on how to model and characterize the developmental PCD process.

Two interconnected core issues addressed in this chapter are how to model the

developmental mean PCD process and the covariance structure efficiently in order