UFDC Home  myUFDC Home  Help 



Full Text  
STATISTICAL FUNCTIONAL MAPPING FOR GENETIC CONTROL OF PROGRAMMED CELL DEATH By YUEHUA CUI A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2005 Copyright 2005 by Yuehua Cui I dedicate this work to my wife, and my parents. ACKNOWLEDGMENTS First and most importantly, I would like to express my deepest gratitude to Dr. Rongling Wu, my advisor, for his tremendous help and encouragement during my doctoral research. He has been a constant source of energy and inspiration for me. Without his forceful guidance and excellent advisement, this dissertation would not have been possible. Further, I would like to thank Dr. Casella for giving me valuable 1. i.' and great help on my dissertation. I appreciate all of his advice and encouragement. Also, I would like to thank the other members of my committee, Drs. Kenneth Portier, Alex Trindade and Kathleen Shiverick for their knowledge, wisdom and most of all, for their precious time. Each of them pl i,. a unique and special role in my time here as a graduate student. Special thanks also go to Dr. Ramon Littell, Dr. James Booth, Dr. Yongsung Joo, the statistical genetics group members for discussing various problems with me, and the faculty and students in the Department of Statistics for their help and friendship. Finally, I would like to give my deepest thanks and appreciation to my parents for the constant support and unconditional love they provided. I appreciate the endless love and support I receive from my brothers and sisters and their respective families. Most importantly, I would like to give my special thanks and love to my wonderful wife Kelian for her constant love, help, support, understanding, and encouragement since the di we started this journey. TABLE OF CONTENTS page ACKNOW LEDGMENTS ............................. iv LIST OF TABLES .. .. ... .. .. .. ... .. .. .. .. ... .. .. .. viii LIST OF FIGURES ................................ ix A B ST R A CT . . . . . . . . . x CHAPTER 1 INTRODUCTION .............................. 1 1.1 Basic G enetics . . . . . . . 1 1.2 Genetic M apping .. .. .. ... .. .. .. .. ... .. .. .. 3 1.3 Traditional QTL Mapping Approaches .............. 5 1.4 Functional M apping .......................... 8 1.5 Programmed Cell Death ....................... 10 1.6 O objectives . . . . . . . 13 2 GENERAL MAPPING FRAMEWORK . . . . 15 2.1 Genetic Experiment Designs . . . . . 15 2.2 Statistical M odels . . . . . . 16 2.2.1 Finite Mixture Models . . . . . 16 2.2.2 Likelihood Function . . . . . 18 2.2.3 Modelling the Mean Vector . . . . 19 2.2.4 Modelling the Covariance Matrix ... . . 21 2.2.5 Computational Algorithms . . . . 26 2.2.6 Hypothesis Testing . . . . . 28 2.2.7 Asymptotic Sampling Error . . . . 29 3 A NONPARAMETRIC APPROACH FOR FUNCTIONAL MAPPING OF PROGRAMMED CELL DEATH . . . . 31 3.1 Introduction . . . ... . . 31 3.2 The Finite Mixture Model and Likelihood Function . ... 33 3.3 Modelling the Mean PCD Process . . . . 34 3.3.1 Legendre Polynomial . . . . . 34 3.3.2 Informationtheoretic Criteria . . . . 37 3.3.3 Consistency of Model Selection Criteria . . 40 3.3.4 Selection Power . . . . . . 41 3.3.5 LEP Order Selection . . . . . 42 3.4 Modelling the Covariance Structure . . . . 43 3.4.1 AR(1) Covariance Model . . . . 43 3.4.2 SAD(r) Covariance Model. . . . . 43 3.5 Computation Algorithm and Hypothesis Testing . . ... 44 3.6 Calculating the Heritability Level . . . . 45 3.7 LEP Order Selection Simulation . . . . 46 3.8 Model Assessment by Simulation . . . . 48 3.9 Real Example . . . . . . . 51 3.10 Discussion . . . . . . . 59 4 FUNCTIONAL QTL MAPPING PROGRAMMED CELL DEATH: A SEMIPARAMETRIC MODEL . . . . . 62 4.1 Introduction . . . . . . . 62 4.2 Different Phases of PCD . . . . . 64 4.3 Statistical Model . . . . . . 67 4.3.1 The Mixture ModelBased Likelihood . . ... 67 4.3.2 Semiparametric Modelling of the Mean Vector . ... 69 4.3.3 Modelling the Covariance Structure . . .... 72 4.3.4 Computation Algorithms . . . . 74 4.3.5 Calculating Curve Heritability . . . . 76 4.4 Hypothesis Testings . . . . . . 77 4.5 R results . . . . . . . . 78 4.5.1 A Worked Example . . . . . 78 4.5.2 Sim ulation . . . . . . 82 4.6 Discussion . . . . . . . 84 5 CONCLUDING REMARKS . . . . . . 89 6 PROSPECTS . . . . . . . . 93 6.1 Missing Data Problem . . . . . . 93 6.2 Joint Mapping of PCD and TimeToEvent . . ... 95 6.3 Nonparametric Modelling of Covariance Structure . ... 96 APPENDIX A DERIVATION OF EM ALGORITHM WITH NONPARAMETRIC APPROACH . . . . . . . . 97 A.1 Nonparametric with AR(1) Covariance Structure . .... 98 A.2 Nonparametric with SAD(1) Covariance Structure . ... 100 B DERIVATION OF EMSIMPLEX ALGORITHM WITH SEMIPARAMETRIC APPROACH .... . . 103 C CONSISTENCY OF MODEL SELECTION CRITERIA ........ 105 C .1 The M odel . . . . . . . 105 C.2 Order Selection Criterion ....................... 106 C.3 Consistency of the Selection Criterion . . 107 REFERENCES . . . . . . . . 111 BIOGRAPHICAL SKETCH . . . . . . . 123 LIST OF TABLES Table page 11 Data structure for a backcross design in FunMap . . 10 21 Conditional probabilities of QTL .'i vpes given on the flanking markers in a backcross design . . . . . . 16 31 LEP order selection power comparison using different information criteria with AR(1) covariance structure . . . . . 47 32 Power comparison for LEP order selection using different information criteria with SAD(1) covariance structure ... . . 48 33 The MLEs of the QTL position and the model parameters derived from 100 simulation replicates with the AR(1) covariance structure. The squared roots of the mean square errors of the MLEs are given in parentheses . . . . . . . 49 34 The MLEs of the model parameters and the QTL position derived from 100 simulation replicates with the SAD(1) covariance structure. The squared roots of the mean square errors of the MLEs are given in parentheses . . . . . . . 50 35 The MLEs of the parameters and their .,vmptotic standard errors in the parentheses with the AR(1) covariance structure . .... 56 36 The MLEs of the estimated parameters and their i,'iiii ,1 ic standard errors in the parentheses with the SAD(1) covariance structure 57 41 Model selection for Legendre polynomial orders based on the AIC and BIC values under the MC covariancestructuring model . ... 79 42 The MLEs of the growth and death parameters (and their approximate standard errors in the parentheses) for significant QTL detected on different chromosomes for tiller numbers in a DH rice population 81 43 The MLEs of the QTL position and the model parameters derived from 100 simulation replicates. The squared roots of the mean square errors of the MLEs are given in parentheses. . . . 82 LIST OF FIGURES Figure page 11 Simple growth curve . . . . . . ... 9 12 Simple programmed cell death curve . . . . 13 31 Linkage map for the rice genome . . . . . 52 32 Dynamic changes of the number of tillers for 123 DH lines of rice as an example of PCD in plants . . . . . 53 33 Tiller growth . . . . . . . 54 34 The LR profile plot. The solid and dotted curves correspond to the LR profile with SAD(1) and AR(1) covariance respectively. .. 55 35 The dynamic changes of tiller numbers for two curves each presenting two groups of .., .1 irvpes, QQ and qq, at each of the four QTL detected with the AR(1) covariance structure. ....... . . 58 36 Two curves for the dynamic changes of tiller numbers each presenting two groups of .., .1 irvpes, QQ and qq, at each of the four QTL detected with the SAD(1) covariance structure. . . . . 59 41 A typical example of PCD that includes five different stages, (1) lag, (2) exponential, (3) declining growth, (4) stationary and (5) death. 65 42 The LR profile plot. The genomic positions corresponding to the peak of the curve are the MLEs of the QTL localization (indicated by the arrows) . . . . . . . 80 43 Plot of the dynamic changes of tiller numbers for two curves each presenting two groups of .., : i,' vpes, QQ and qq, at each of the three QTL. A) QTL detected between markers RG146 and RG345, and B) QTL between markers RZ730 and RZ801 on chromosome 1, and C) QTL on marker RZ792 on chromosome 9. ...... . . 85 Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy STATISTICAL FUNCTIONAL MAPPING FOR GENETIC CONTROL OF PROGRAMMED CELL DEATH By Yuehua Cui August 2005 C!i ,i: Rongling Wu Major Department: Statistics The development of any organism represents a complex dynamic process which is genetically controlled by a network of genes and environmental factors. Programmed cell death (PCD), a physiological cell suicide process, occurs during the development of most organisms and is typically a complex longitudinal trait. Identifying genes that contribute to the differentiation of cells has been one of the most important and difficult tasks for PCD studies. With the new conceptual idea of functional mapping and abundant molecular markers available, specific genes that regulate PCD can now be detected. In this dissertation, I extend the principle of functional mapping and proposed a series of statistical models and computational algorithms for mapping genes that are associated with the dynamic features of the PCD process. A typical PCD process involves two alternate phasesgrowth and death. The nonparametric approach developed based on the orthogonal Legendre polynomial is shown great flexibility in capturing various developmental PCD patterns. The limitation of the nonparametric model is its difficulty in implementing the biological principle of growth phase. To enhance the biological relevance of the mapping model, a semiparametric model is derived that integrates a parametric logistic function for the growth process and a nonparametric Legendre function for the dead phase. Several information criteria are proposed to select the optimal Legendre order for both the nonparametric and semiparametric models. A series of stationary and nonstationary processes is applied to model the covariance structure among the phenotypes at different time points. The developed models allow for numerous hypotheses tests of regarding the genetic control mechanism of PCD features. The EM and simplex algorithms are derived to estimate the parameters that model the mean vector and the covariance structure. The model is statistically investigated through simulation studies and is validated by a real example for tiller number in rice. Our model provides a quantitative and testable framework for assessing the interplay between genes and organism developmental patterns, and will have great implications for elucidating the detailed genetic architecture of any longitudinal traits such as PCD. CHAPTER 1 INTRODUCTION 1.1 Basic Genetics In most organisms, chromosomes are paired: one is inherited from the paternal parent and one from the maternal parent. A gene represents a segment of functioned sequence. It could be a coding sequence that codes for a particular protein, or a noncoding sequence regulating the expression of other genes. Each gene is located at a specific position on a chromosome. This site (called a locus) contains alleles of one particular gene. Alleles are alternative DNA sequences on a particular locus. For a haploid organism, a gene contains two alleles. But for a triploid organism (for instance, endosperm) a gene contains three alleles. The number of genes varies from several thousand to more than 10,000 in different species. For example, the human genome consists of 23 homologous pairs of chromosomes containing more than 30,000 genes; in yeast, there are only about 6,000 genes. Genetic differences between two individuals exist because each genetic locus could contain different alleles. Different combinations of alleles at a locus form the g, I.li,, that determine individual ph. ii.. ivpic values. P, I'.1li,", the product of gene expression, denotes an observable characteristic (e.g., skin color and body height). If there is a gene contributable to the variation of a certain phenotype, then individuals carrying different genotypes of this gene will have different phenotypes. During meiosis, a process which produces the i.,mr,/. I (sperm and egg cells), the homologous chromosome strands pair up and may exchange chromosome segments. This random shuffling of genetic materials during miosis is called recombination (one of the i, i,' .r sources of genetic diversity in nature). At the end of meiosis, the chromosome pairs are separated. As a result, each gamete carries a single copy of chromosomes. A linear map of the relative positions (chromosomal loci) of genes along a chromosome is called a 1.,.,,,.' map. Distances on a map are established on the basis of recombination frequency by linkage analysis. A 1.,.,,,g.' map does not give the physical distance (which is shown by the jli/.:. ', map) between genes but rather their relative positions. The closer two genes are, the more tightly they are linked and hence the more often they will be inherited together. I. ,.,.,.' distance (d) is measured in centiMorgans (cM). The recombination fraction (r) can be calculated using the .i, .,,,.' distance. If no interference is assumed, then based on the Poisson assumption of an even number of crossovers, the Haldane map function can be used to calculate r 1 ( 2d/100) 2 If interference is taken into account, the Kosambi map function should be used 64d/100 t 2(e4d/100 + 1) Therefore, r 0.5 will give an infinite distance between two markers, which means they are unlinked. In traditional quantitative trait studies (Lynch and Walsh 1998), phenotypic variation can be partition into genetic variation and nongenetic variation VP = VP + VE where Vp is the total phenotypic variance and is partitioned into genetic variance (VG) and environmental variance (VE). The ratio H2 = V1/Vp heritabilityy) measures the degree of phenotypic deviation transmitted from parents to their offspring and provides useful information in predicting the response to selection. 1.2 Genetic Mapping There are two basic approaches to detect the presence of i' 'r genes: segregation analysis based on phenotypic data only (Wu et al. 2001) and linkage analysis based on both phenotypic data and molecular marker data (Lander and Botstein 1989). My study focused on the second approach. A quantitative trait has measurable variation due to genetic and/or environmental influences. Unlike qualitative traits (which cannot be measured quantitatively), quantitative variation can consist of discrete or continuous values. For example, the number of separate tumors in the lung of a cancerprone mouse is a discrete trait. Plant height/diameter and grain yield are examples of continuous traits. Genetic studies in quantitative traits are not limited to morphological traits such as body height or weight. Such studies have increased tremendously in other areas including physiological and biochemical traits such as blood pressure and cholesterol level, complex diseases such as heart disease and diabetes, and behavioral traits such as reaction to alcoholism and smoking. Genetically speaking, quantitative variation is multifactorial and is influenced by several polymorphic genes and environmental factors as well as complicated interactions. However, most quantitative variations caused by environment influence are not inheritable. Quantitative trait loci (QTL) is a genetic location of one or more genes. Quantitative differences among individuals are contributed by specific QTL that function individually or interactively. The QTL reflect natural genetic variation as it exists in organisms. Mendel (1866) first proposed the idea that quantitative variation in a trait could be due to the action of multiple genes. NilssonEhle (1909) demonstrated this in wheat. With the development of chromosome theory and the concept of genetic linkage by the 1920, Sax (1923) proposed that the association between seed weight and seed coat color in beans was due to linkage between genes controlling color and one or more genes controlling sizes. However, it was a long time before the pioneered mapping framework was developed in the 1980 partly because of the small number of available markers (Thodv' 1961) and the lack of powerful statistical models and computation techniques. Modern biological techniques make it possible to detect the abundant variation of molecular polymorphism that segregate in most species in nature and further make QTL mapping study feasible. Examples of polymorphism are single nucleotides polymorphisms (or SNPs), short di, tri or tetranucleotide tandem repeats (microsatellites), longer tandem repeats (minisatellites), and insertion sites of transposable elements. These markers are genetically neutral and show high polymorphism with great abundance, spanning entire genomes. To detect molecular variation, many methods have been developed including traditional techniques such as restriction fragment length polymorphism (RFLP) analysis using Southern blots (Langley et at. 1982; Leips et. al. 2000) and direct sequencing (Kreitman 1983), and new developments such as highthroughput methods for both polymorphism discovery and ..n', vping (Kristensen 2001). With the development of new molecular and DNA recombination techniques in the past decades, it has made it possible to detect QTL underlying quantitative variation of certain traits of interest and to map their chromosomal locations throughout the entire genome with polymorphic markers. M nz QTL have been mapped in numerous economically and agronomically important traits in crop and animal species; for instance, tomato (Paterson et al. 1991), maize (Stuber et al. 1992), eucalyptus (Grattapaglia et al. 1996), pigs (Andersson et al. 1994; Knott et al. 1998; Walling et al. 2000), and dairy cattle (Georges et al. 1995; Spelman et al. 1996; Zhang et al. 1998). Undoubtly, QTL are also responsible for most of the genetic diversity in human disease susceptibility and severity. A1l in QTL have been mapped for human susceptibility to type I (Davies et al. 1994) and type II (Hanis et al. 1996), diabetes mellitus, schizophrenia (Brzustowicz et al. 2000), obesity (Comuzzie et al. 1997), Alzheimer's disease (ErtekinTaner et al. 2000; Myers et al. 2000), cholesterol levels (Knoblauch et al. 2000), and dyslexia (Cardon et al. 1994). Even though QTL present challenges of identification, they represent a fascinating biological phenomenon that is fundamental to our understanding of abundant variation in nature. With the completion of human, mouse, Arobidopsis and many other organisms' genome sequences, we are entering an era in which the identification of QTL can be approached both experimentally and mathematically. 1.3 Traditional QTL Mapping Approaches QTL mapping is a phenotypedriven approach to gene function. By setting up a relationship between marker genotype and phenotypic traits, one can identify different gene expressions of individuals who carry a specific genotype. There are a variety of methods for identifying QTL segregating in an experimental cross (such as a backcross or an F2 intercross). The theoretical principle for analyzing QTL dates back to Sax (1923) who first associated pattern and pigment markers with seed size in beans. If a QTL is linked to a marker locus, there will be a difference in mean values of the quantitative trait among individuals with different genotypes at the marker loci. With this idea, single marker methods, ANOVA (Soller et al. 1976) maximum likelihood with a single marker (Weller 1986, 1987; Simpson 1989) have been developed. However, these methods cannot estimate QTL positions precisely. Putative QTL genotypic means and QTL positions are also confounded. These confounding factors cause the biased estimator of QTL effects and low statistical mapping power, particularly when linkage map density is low. Statistical methodologies for mapping QTL on a highdensity linkage map of molecular markers had not been established until Lander and Botstein's (1989) pioneering work. The mapping approach is based on the simple linear model yi= pt+axi+i, i 1,2,... ,n where y, is the phenotypic measurement, p is the overall genetic mean, a is the genetic additive effect, xi is the indicator variable indicating different genotypes, and ci is the environmental error which is assumed to be normally identically distributed with mean 0 and variance a2. These authors used an EMimplemented maximum likelihood approach, proposed by Dempster et al. (1977), to map QTL on a particular chromosomal interval bracketed by two flanking markers. This socalled interval mapping method was later improved by including markers from other intervals as cofactors to control the overall genetic background effect (Jansen and Stam 1994; Zeng 1994). The model is given as K yi u+axi+ ybkXjk +C, i= 1,2,... ,n k=0 where bk is the background marker effect and Xjk indicates variables of different . in. .i pes. This approach is based on multiple linear regression of a quantitative phenotype on ., n.ir pe in segregating generations obtained from line crosses. The improved method, called composite interval mapping (CIM) by Zeng (1994), di pl,1 increased power in QTL detection because of reduced genetic background noise (i.e., residual variance). When only one QTL is considered at a time, it could bias QTL identification and estimation if indeed multiple QTL are located in the same linkage group (Jansen 1993; Zeng 1994). To deal with these problems and to further improve QTL mapping precision, Kao et al. (1999) proposed using multiple marker intervals simultaneously to map multiple QTL of epistatic interactions throughout a linkage map. The developed model, called multiple interval ini',''.: ('\I \I), has the form yj p+ ajxij +Z6jk("', )+ C, i 1,2< .,n j=1 j1k where p is the overall mean; xij is coded as 1/2 and 1/2 if the genotype of QTL Qj is QjQj or Qjqj for the backcross design, respectively, aj is the corresponding mean effect of Q ; and a',, is the epistatic effect between Qj and Qk; 6jk is an indicator variable for epistasis between Q, and Qk; and ci is the environmental error, which is assumed to follow N(0, a2). The MIM model has several advantages (Kao et al. 1999): 1) it can increase sensitivity; 2) it can estimate epistatic effects (i.e., interactions of alleles at different QTL). A stepwise model selection procedure was applied in the CIM and the MIM model to select candidate QTL. Besides these pioneering QTL mapping approaches, an upsurge of QTL mapping methodologies have been developed to consider various situations regarding different marker types (dominant or codominant), different marker spaces (sparse or dense), different experimental designs (F2/backcross or fullsib family), or different mapping populations (autogamous or allogamous) (Wu et al. 2002b; Lin et al. 2003). Statistical methods pl i, a key role in QTL mapping including model development and parameter estimation. The common statistical methods used for QTL mapping include regression analysis (Haley and Knott 1992; Xu 1995), maximum likelihood (Lander and Botstein 1989; Zeng 1994; Kao et al. 1999), and the B i,v i oi approach (Satagopan et al. 1996; Sillanpaa and Arjas 1999; Xu and Yi 2000). Many of these mapping methods have been crucial in identification of QTL responsible for variations in various complex traits important to agriculture, forestry, biomedical or biological research (Tanksley 1993; Wu et al. 2000; M ,.1: iv 2001; Mauricio 2001). However, the abovementioned QTL mapping approaches only consider a trait measured at a single time point. In nature, many traits display a dynamic developmental pattern in their life span. Genes controlling these complex processes may function during part or all of the developmental process. To further consider the dynamic gene effect over time, new statistical mapping approaches are needed. 1.4 Functional Mapping In nature, most economically and biomedically important traits (such as grain yield, milk production, tumor size, and drug response) show continuous variation, usually affected by a network of smallsized genes functioning in a coordinated manner, and also affected by environmental stimulus in a complicated way. Thus, identification of specific QTL involved had alvbi , been difficult and imprecise. The phenotypic formation of a quantitative trait is controlled by genetic and environmental factors, also regulated by a suite of developmental signals. This is evident in the fact that all traits undergo developmental ontogeny, and change with time. Kirkpatrick and Heckman (1989) called such timedependent traits infinitedimensional characters; Pletcher and Geyer (1999) called them function valued traits. Although the relationship between genetic control and development is statistically difficult to elucidate, some key difficulties were overcome by Wu and colleagues (Wu et al. 2002a, 2003, 2004a, 2004b; Ma et al. 2002). They have proposed a general statistical framework (i.e., functional in',i' or FunMap) to genomewide map specific QTL that determine the developmental pattern of a complex trait. The basic rationale of functional mapping lies in the connection between gene action or environmental effects and development by parametric or nonparametric models. Functional mapping maps dynamic QTL responsible for a biological process that is measured at a finite number of time points. In response to gradients of a continuous environmental factor (such as temperature, light intensity, or nutritional concentration), a variety of mathematical equations have been developed, either empirically or through theoretical derivations based on fundamental principles, to model the phenotypic biological processes. For example, a series of growth equations were derived to describe growth in height, size, or weight (Fig. 11; von Bertalanffy 1957) that occur whenever the anabolic S5  0 ( 4 3 2 1 2 3 4 5 6 7 8 9 10 Time Figure 11. Simple growth curve or metabolic rate exceeds the rate of catabolism. Based on fundamental principles behind biological or biochemical networks, West et al. (2001) mathematically proofed the universality of these growth equations. With mathematical functions incorporated into the QTL mapping framework, functional mapping estimates parameters that determine shapes and functions of a particular biological network, instead of directly estimating the gene effects at all possible time points. Because of such connections among these points through mathematical functions, functional mapping strikingly reduces the number of parameters to be estimated. Hence, it dip,1 increased statistical power and is advantageous because biological principles are embedded in the process of estimating QTL parameters. From a statistical perspective, functional mapping is a problem of jointly modelling meancovariance structures in longitudinal studies, an area that has recently received a considerable interest in the statistical literature (Pourahmadi 1999, 2000; Daniels and Pourahmadi 2002; Pan and Mackenzie 2003; Wu and Pourahmadi 2003). Unlike general longitudinal modelling, functional mapping integrates parameter estimation and test process in a biologically meaningful and mixturebased likelihood framework. Instead of directly estimating the gene effects at all possible time points, it estimates parameters determining shapes and functions of a particular biological network. Table 1 1 shows a typical data structure for FunMap for a backcross design in which r is the total number of measurement time points. The phenotypic traits are measured at different time points and could be modelled as a function of time. Biologically meaningful mathematical functions could be adopted in the modelling procedure. The repeated longitudinal measurement for each individual then could be described by some functions, yi mj + e i = 1,.. n and j 1,... ,J (1 1) where mj(tk) f(tk) (k = 1, ). By estimating the genotypespecific parameters contained in mj, that describe reaction norm across environmental gradients, functional mapping allows the assessment of environmental impacts on genetic variation in complex traits and testing of the different hypotheses that mediate phenotypic diversity. The results derived from functional mapping will be closer to biological realms. Table 1 1. Data structure for a backcross design in FunMap Sample Marker ._ I ,' rpes Phenotypes nf M, M. .. M1 y(1) y(2) ... y(r) 1 MIMI M11 i 1f Mk yi() yi(2) ... yi(r) 2 Mimi 1.f. f Mk Y2 ((t) Y2 (2) y2 7r) 3 MIm MIf 1n i 1, i3 y3() Y3(2) Y3 (7) n MMI M.M 1i. M ,, yi() y,(2) ... y.(r) 1.5 Programmed Cell Death Organisms consist of thousands of cell types originating from a fertilized egg cell. Starting from a single cell, cell numbers increase dramatically during development, to form the various tissues and organs. Meanwhile, parallel with cell formation, another type of cell activity is cell death (a normal process regulating an appropriate number of cells in the tissues). This controlled or programmed elimination of cells is called programmed cell death (PCD). PCD is a physiological cell suicide process that occurs during the development of most organisms (Ameisen 2002). Unlike necrosisa form of nonphysiological celldeath process that results from acute tissue injury involving cell swelling, lysis, and inflammatory leakage of cell contents, PCD is carried out in a regulated process that generally confers advantages during an organism's life cycle. In plants, PCD is essential for development and survival, for example, the senescence of various tissues (such as falling of leaves in autumn and dying of tillers after a certain time point in rice) that helps plants to redistribute nutrients and efficiently use resources during development. In animals, PCD is a cell suicide mechanism that enables metazoans (multicellular animals) to control cell number and eliminate cells that threaten the animal's survival (Elis 1991). In addition, PCD is an important mechanism in defense against pathogens in both animals and plants, and serves fundamental functions during both plant and metazoa tissue development. The advantage of PCD is obvious. However, inappropriate activation of PCD may cause or contribute to a variety of diseases, such as acquired immunodeficiency syndrome (AIDS) (LaurentCrawford et al. 1991), neurodegenerative diseases, and ischemic stroke (Raff et al. 1993). PCD is involved in multiple developmental stages of any organism (such as cell differentiation; proliferation; ..in:. and the final stage, death). The physiological mechanisms of a PCD process are complicated and are still under intensive study (Steller 1995; Pennell 1997; Jacobson et al. 1997; Vaux 1999). The 2002 Nobel Prize in Medicine was awarded to Sydney Brenner, H. Robert Horvitz, and John E. Sulston for their discoveries of the genetic regulation of programmed cell death. These scientists identified several key genes such as nuc1, ced3 and ced 4 genes that regulate organ development and programmed cell death in model organisms C. 1. 1. 1 and pointed out that corresponding genes may exist in higher species, including human being. Their seminal work established the foundation of cell suicide as a normal physiologic process. The implications are wide ranging including understanding organ development and cancer. We may be able to treat diseases by stimulating the cellular "suicide pis o;i ,'i If the abnormally dividing cells can be killed off by using certain drugs to trigger the "selfdestruct" programs, it would be an efficient way to treat cancer. Since PCD is so important in biology, it would be essential to derive new statistical models for identifying genes that control this process. As discussed above, QTL mapping based on molecular markers has proven powerful to elucidate the relationship between the genotype and ph. i, '1 v.pe in complex systems. It will be a valuable tool for understanding the genetic architecture of a complex PCD process. Because of its unique development pattern, study of PCD must be coordinated with other physiological processes, such as growth occurring in the tissue or individual (Kuriyama and Fukuda 2002). Although no specific mathematical functions are available to describe the shape of the PCD process, as a balance of celldivision and celldeath signals (Kerr and Harmon 1991), PCD can be broadly divided into two different phases: growth and death (Fig. 12). The growth phase that specifies an increasing process of cells can be described by a logistic curve. The death phase (with cell death exceeding cell birth) corresponds to a decaying process that cannot be defined by a specific curve. More naturally, the PCD process may not follow any particular behavior pattern during ontogeny. Another natural way to model this process is to treat the entire developmental process as one single curve and fit it with polynomials. This modelling strategy has great 8 7 6 a5 4 3 2Growth phase  Death phase  1 2 3 4 5 t* 6 7 8 9 10 Time Figure 12. Simple programmed cell death curve flexibility but has certain limitations when one intends to test the effect of QTL at certain developmental stages. Two different modelling strategies are detailed in Ch'! pters 3 and 4. Different modelling strategies give a clear picture of the dynamic developmental pattern of PCD. 1.6 Objectives The process of PCD is typically a complex trait, controlled by multiple genes and various developmental and environmental factors. CI'! i o ,terizing the genetic architecture of dynamic PCD trait will improve our understanding of the developmental process of organisms and the etiology of diseases due to inappropriate activation of PCD. To our knowledge, no studies have thus far examined the genetic mapping of QTL controlling PCD. In this study, I derived a statistical framework for mapping PCDspecific QTL within the context of functional mapping. Two different approaches were proposed to model the developmental PCD curve. The first approach was to model the entire developmental process nonparamrnel ,'1 l'/ based on the orthogonal Legendre polynomial (LEP). This modelling approach shows great fitting flexibility when fitting complicated curves comparing to parametric approach. The second approach incorporates a parametric function of the growth phase and a nonparametric function of the death phase into the functional mapping paradigm, which leads to a socalled semiparametric model. Thus, unlike the nonparametric approach, this model capitalizes on the biologically meaningful information for cell growth to better characterize the process of PCD. Also unlike previous parametric models (\!, et al. 2002; Wu et al. 2004a, 2004b), the semiparametric model has better flexibility, and could precisely and efficiently model the PCD dynamic process. With the LEP, one statistical issue is to select the best order. Order selection represents a statistical challenge when a model is built under the functional mapping framework. Higher orders .1.v i give a better fit, but are less efficient for parameter estimation. Lower orders are efficient but lose flexibility. We need a parsimonious model that balances flexibility and efficiency. A number of orderselecting criteria were proposed and their performance was compared. Another statistical challenge for functional mapping is to model the structure of the covariance matrix among timedependent observations. Ma et al. (2002) used the stationary firstorder autoregressive (AR(1)) model to structure the timedependent changes of variances and correlations. I proposed several covariance structure model strategies (based on the nature of PCD process) to better capture the intrinsic pattern of timedependent variation and the intraindividual correlations. CHAPTER 2 GENERAL MAPPING FRAMEWORK 2.1 Genetic Experiment Designs In QTL mapping studies, most experiments begin with two pureinbreeding lines that have different traits of interest. The two lines result from intensive inbreeding, which gives homozygous alleles at all loci. The result of crossing these two lines is the first filial (Fi) generation. All the Fi generations are heterozygous and are genetically identical. The backcross generation is produced by crossing Fi with one of its parental lines. Hence, there will be two different genotypes for a backcross generation. The current study is based on a backcross design, initiated with two contrasting ', ii, v.ous inbred lines. However, the developed models can be easily extended to other cross designs (F2 or RIL) or natural populations. This backcross population of size n has two groups of .n. i,' ,pes at a locus. A markerbased genetic linkage map was constructed, to identify QTL affecting a developmental PCD trait. Consider two flanking markers (M4,, and AM,+1) _.1 inr ,iped from a backcross plant. The recombination fraction between these two markers is denoted by r. Suppose there is a putative QTL (denoted by Q) located between these two markers (measured by the recombination fraction r, with M,, and r2 with A4M+i). The pl.1 i ..I l.pic PCD traits (such as cell counts or other quantitative measurements at tissue or organ level) are thought to be genetically controlled by this QTL. In interval mapping, we will use two flanking markers with known genotypes to infer the QTL that is hypothesized to be located between the two markers. The recombination fraction is a linkage parameter that can describe the genetic distance between two given loci, and is defined as the frequency of crossing over between two loci. Based on the segregation and transmission of genes from the parent to progeny, one can derive the conditional probabilities of an unknown QTL genotype, conditional upon the known marker genotypes, in terms of these recombination fractions. The conditional probabilities ( .) of a backcross individual i (i = 1,... n) who carries a QTL ,.1 vpe conditional upon four flanking marker ., ,. ,l vpes f. 1 ,.1 1. ,+11, [, if1.. f[. ,i,+ Tn i1,,I ,. [ +11 i , if .u1 1., [T1+i is given in Table 21. Table 21. Conditional probabilities of QTL ,. i,.' .vpes given on the flanking markers in a backcross design Marker genotype QQ Qq (1r1)(1r2) rlr2 ,+ 1r 1r ,2,. Im 2 (rl (1 r2) where ri and r2 are the recombination fractions between the QTL and marker MA. and MA,+I, respectively, and r is the recombination fraction between the two markers. Other cross designs or natural populations have different conditional probability tables. Conditional probability measures the possibility of recombination between the putative QTL and the flanking markers during meiosis. Therefore, these probabilities determine the segregation pattern of the putative QTL. My study use the Haldane map function (assuming no interference) to convert the map distance cM (centimorgan) to recombination fraction r through the function r (1 e2d/100). 2.2 Statistical Models 2.2.1 Finite Mixture Models In general, if one knows the QTL genotype, then finding a significant QTL is equivalent to doing a twosample t test for a backcross; and an ANOVA test for an F2 population in the univariate case, or Hotelling's ttest for multivariate data. However, in the real world, one can only observe the marker genotype and the phenotypic trait data. The QTL genotype is missing. This produces a statistical missing data problem how to set up the relationship of observed data with missing QTL? One of the main tasks for QTL mapping is to build statistical models that connect the phenotypic trait with the unobservable QTL data through the observed marker data. The statistical foundation of QTL mapping lies in a mixture model in which each observation y is assumed to have arisen from one of a known or unknown number of components. Assuming that there are J QTL ,. .:I ,,pes contributing to the variation of a quantitative trait, this mixture model is expressed as y ~ p(y /, ;, TI) = 7rwif(y; i, 0 ) + + jf((y; j, I), (21) where y is the response; = (Qi, 7wj)' are the mixture proportions (i.e., QTL i. i..I vpe frequencies) which are constrained to be nonnegative and J 1 =7 1; p = (1i, pj)' are the component (or QTL genotype) specific parameters, with pj being specific to component j; and TI are parameters which are common to all components. In practice, the data are only observed at a finite set of times, 1,... 7, rather than a continuum, so we have only a finite set of data for each individual i, which can be considered as a multivariate trait vector, y = ',/ (1), y ,(r)]. There are two i, ii r assumptions: (1) independence of response between subjects; and (2) multivariate normality of responses. Both assumptions make it feasible to construct the likelihood function and make corresponding inferences. The density function for each individual is expressed as /(yi; m, Q) (2 1 12 exp (yi mj)TX (y, m) (22) (27)7/21 / e/ 2  where mj = [mj (1), mj ()] is a vector of expected values for QTL ,. vi,.1 vpe j at different points. At a particular time t, the relationship between the observation and expected mean can be described by a simple linear model J j 1 where (ij is the indicator variable denoted as 1 if a QTL genotype j is considered for individual i and 0 otherwise; ei(t) is the residual error (i.e., the accumulative effect of polygenes and errors) that is iid (identically and independently distributed) normal with mean zero and variance a2 (t). The withinsubject measurement is correlated and the errors at two different time points, t1 and t2, are characterized with correlation p(ti, t2) 2.2.2 Likelihood Function In my study, each observation was a repeated measurement with a certain correlation structure. Because of the independence assumption between subjects, the likelihood function of the PCD trait, y, and the marker .' n .1 vpe, M, at the hypothesized QTL location can be expressed for the backcross as L(Qy, M) [7fTrifi(yi[) + 7Tolfo(yi[)] (24) i= 1 where 7ri and 0Toi are the mixture proportions corresponding to the frequencies of different QTL .., .1 ivpes for individual i. Because the marker genotype of each individual is known, these mixture proportions can be actually expressed as conditional probabilities (Table 21). The unknown vector f2 that defines the mixture model (Eq. 21) contains two sets of parameters, one defining the cosegregation between the QTL and markers and, thereby, the location of the QTL relative to the markers, denoted by Qfi, and the second defining the distribution of the PCD trait for each QTL genotype, denoted by 2q = (f2, f,), where fm defines the mean vector at different time points and 2, defines the covariance matrix among different time points. The MLEs of these unknown parameters is obtained through maximizing the likelihood function (Eq. 24) by using EM algorithm (Dempster et al. 1977) or other optimization packages such as Simplex (Nelder & Nead 1965). 2.2.3 Modelling the Mean Vector To estimate the parameters of the likelihood function implemented with PCD data measured at multiple time points, one can extend the traditional interval mapping approach to accommodate the multivariate nature of timedependent traits. However, this extension is limited in three aspects: (1) Individual expected means of different QTL genotypes at all time points and all elements in the matrix E need to be estimated, resulting in substantial computational difficulties when the vector and matrix dimensions are large; (2) The result from this approach may not be biologically meaningful because the underlying biological principle for developmental character is not incorporated; (3) The approach cannot be well deploy, .1 in a practical scheme because of (2). Thus, some biologically interesting questions cannot be asked and answered. Based on these concerns, a new statistical framework has been developed to detect QTL affecting growth trajectories (Wu et al. 2002a, 2004a, 2004b; Ma et al. 2002). This framework allows for the modelling of the timedependent expected means of QTL genotype j using a growth equation p (t)= g (t; Qm), (25) which is specified by a set of curve parameters array' '1 in fi. All living entities are characterized by growth defined as the irreversible increase of size with time. A number of mathematical models, such as logistic or Sshaped curves, have been proposed to describe growth trajectories. The fundamental biological aspects of mathematical modelling of growth curves have been founded by earlier mathematical biologists, e.g., von Bertalanffy (1957), and recently revisited by West et al. (2001). Built under the functional mapping framework, we extend the principle of functional mapping and developed PCD specific modelling strategies to map PCD genes. The overall form of the PCD curve of QTL genotype j is determined by the set of curve parameters contained in f2,. If different genotypes at a putative QTL have different combinations of these parameters, this implies that this QTL pl , a role in governing the differentiation of PCD trajectories. Thus, by testing for the difference of f,j among different genotypes, we can determine whether there exists a specific QTL that confers an effect on PCD curves. The timedependent PCD means for different genotypes, pj(t), can be used to estimate the dynamic changes of various genetic effects, including the additive, dominant and epistatic effects as well as the interaction between these effects and environmental factors. For example, for a backcross population containing two genotypes at a QTL, coded as 1 for QQ, 0 for Qq, the estimates of the timedependent additive effect at a particular time point t is given by a(t) = pi(t) po(t) during PCD trajectories. The timedependent epistatic and i vipe environment effects can be characterized in a similar way if a multiQTL model is assumed or if an experimental design with multiple environments is used. The similar settings can be applied to F2 population or other experimental cross designs. In this study, we propose two different v to model the developmental mean process: 1. modelling the mean curve using the orthogonal Legendre polynomial (LEP), 2. joint modelling the mean process with logistic function and LEP. Each modelling strategy has its own advantages. The first model has great flexibility in terms of capturing the intrinsic development pattern when it is hard to characterize the development process with any currently available mathematical functions. In nature, there are variety of variations for different traits. In most cases, characterizing their development pattern by some mathematical function is impossible. LEG has been applied in animal science study to fit testd v, models in dairy cattle and shows great flexibility than the regular polynomial (Pool et al. 2000a). The Legendre polynomials, as used in this study, have the benefit that: (1) the functions are orthogonal, which is useful for analyzing patterns corresponding to different ., n.' vpes; (2) higher orders were estimable when conventional polynomials failed because of better convergence (Pool et al. 2000b). However, there are certain limitations of this modelling method. The estimated parameters have no specific biological meanings and are hard to interpret. Moreover, overfitting may result in capturing too much noise which is due to environment influences. The second approach models the entire developmental process as two separate stages. The growth phase is modelled by the logistic growth function which fits a universal growth law (West et al. 2001). The death phase is described by a death function which is modelled by the LEP. The developed model is called a semiparametric model, which combines the parametric and nonparametric approach together. It is unique in that it captures the biological characteristic of the growth phase with great flexibility to describe the death phase. Moreover, this joint modelling approach allows us to do different hypotheses testing which will be described in detail in Chapter 4. 2.2.4 Modelling the Covariance Matrix Modelling a covariance matrix E for a longitudinal data set is difficult because of: 1) the possible high dimensionality of the problem, and 2) the constraint that E must be positive definite (Pourahmadi 1999). Many standard v,v to model a covariance structure are available in the published literature. For example, the simple covariance structure (SIM) specifies that observations are independent between and within subject and have homogeneous variance. SIM is easy to implement in terms of parameter estimation but shows no power to model the intrinsic correlation for repeated measurement. The unstructured covariance structure (UN) is another easy method of modelling the covariance matrix. However, it is not parsimonious to estimate all the elements in the withinsubject covariance matrix among different time points, especially when the dimension of measurements is large. The compound symmetry (CS) structure has been used in numerous longitudinal studies. CS assumes a homogeneous correlation structure within subject. It is efficient in terms of parameter estimation, but may not be flexible enough to capture the intraindividual correlation. Other structures such as the MA (Moving Average), ARIMA (AutoRegressive Integrated Moving Average), RC (Random Coefficients) or AD (Antedependence) have also been proposed as models of covariance structure (Littell et al. 2000; Zimmerman and NilnezAnt6n 2001). Unlike the nonparametric modelling of the mean curve structure, nonparametric covariance modelling has been little studied. Most authors have considered only the stationary case (e.g., Glasbey 1988; Hall et al. 1994). The nonstationary case was considered only recently (C'!, i. 1995; Diggle and Verbyla 1998). Diggle and V. i ,1 i (1998) used kernelweighted local linear regression smoothing techniques to provide a nonparametric estimator for the covariance structure without assuming stationarity. C'!. ii (1995) used kernel estimators to estimate covariance functions. Kirkirpatric (1994) applied the LEP to model the covariance structure for longitudinal data. This approach attempts to reduce the dimension of the covariance matrix by detecting the most important factors that can explain most of the information contained in the original covariance matrix. By modelling the covariance as a function of time, this approach shows its uniqueness by smoothing the covariance matrix. Modelling the covariance structure of developmental PCD traits can be accomplished either parametrically or nonparametrically. Due to the unique development pattern of the PCD trait and the statistical challenges of modelling the complicated covariance structure, only the parametric approaches were developed in the current study with the nonparametric modelling approach left for future work. We propose three modelling strategies with each one implementing under different situations. AR(1) covariance model. A common model for covariance structure is the firstorder autoregressive [AR(1)] structure (Diggle et al. 2002), expressed with a 2(1) a2) a2 for the variance, and o(t1, t2) 2plt2tlI for the covariance between any two time intervals t1 and t2, where 0 < p < 1 is the proportion parameter with which the correlation decays with time lag. The parameters that model the structure of the (co)variance matrix are a iv d in f,. Clearly, the AR(1) model assumes the homogeneity of variance. However, for longitudinal PCD traits, the homogeneity assumption is easily violated. To remove the heteroscedastic problem of the residual variance, two approaches can be used. The first approach is to model the residual variance by a parametric function of time, as originally proposed by Pletcher and Geyer (1999). But this approach requires additional parameters to characterize the agedependent change of the variance. The second approach is to embed Carroll and Rupert's (1984) transformbothsides (TBS) model into the growthincorporated finite mixture model (Wu et al. 2004b), without using any additional parameters. Both empirical analyses with real examples and computer simulations si,L. 1 that the TBSbased model can increase the precision of parameter estimation and computational efficiency. Furthermore, the TBS model preserves the original biological means of the curve parameters although statistical in &I ,. are based on transformed data. The variance is greatly stabilized by log transformations. Hence the AR(1) model can be applied to model the PCD covariance structure implemented with the TBS modelling strategy. Model performance will be evaluated through simulation study. SAD(r) covariance model. The TBSbased model has the potential to relax the assumption of variance stationarity, but the covariance stationarity issue remains unsolved. Zimmerman and NuinezAnt6n (1997) proposed a structured antedependence (SAD) model to model the agespecific change of correlation in the analysis of longitudinal traits. The antedependence (AD) model was initially proposed by Gabriel (1962) with no structure restriction. The random variables Xi, X, indexed by time, are said to be AD(r) if the conditional distribution of Xt given Xt1, Xi depends on Xt1, Xt, for all t > r (Gabriel 1962). This concept is equivalent to XI, X having a Markovian dependence of order r (Diggle et al. 1994, p. 85). The SAD model is an extension of the AD model created by putting special structure on the model. Due to the flexibility of modelling the nonstationary structure, the SAD model has been employi 1 in several studies and di pl1 many favorable properties (Zimmerman and NuinezAnt6n 2001). A detailed modelling strategy will be described in the following chapters. Meancovariance model. Correlation among observations on a given individual is a common feature. In many cases, a systematic pattern of correlation is evident, which may be characterized by a relatively simple model. Suppose that Corr(e) R(p) where the correlation matrix R(p) is a function of correlation parameter p. The choice of a suitable correlation matrix depends on the nature of the repeated measurement factor as well as the modelling requirement (for example, the positive definite requirement). When variance is constant across the response range and equal to some value a2, then Cov(e) = 2R() However, in most situations, the heterogeneity of intraindividual variance may be evident. This happens when measurements are taken over time and variance tends to fluctuate with the level of responses. For example, a PCD trait shows a unique developmental pattern as the variations at different time points keep changing as the mean changes. A more general intraindividual covariance structure needs to be specified. A natural relationship that appears to be evident for repeated measurements data is a mean and variance relationship. The variances can be modelled as functions of the means. Define a variance function v = v(Qm) and define the diagonal variance matrix as V(m) diag [v2 (m 1), U2( tk)] where Qm is a vector of parameters for the mean function; v2(Qmti), *, 2(m, tk) is the variance function related to different time points. Using the correlation matrix R(p) defined above, the covariance matrix can be specified as Cov(e) = 2V(Qm) 1/2R(p)V(Qm)1/2 This covariance structure shows a great deal of versatility in characterizing uncertainty in intraindividual response (Davidian and Giltinan 1995). It allows separate consideration of heterogeneity of variance and the pattern of correlation. If the defined correlation matrix R(p) is positive definite, then the covariance matrix is guaranteed to be positive definite. By choosing appropriated variance function v and correlation matrix R(p), the developed covariance structure can better capture the intrinsic intraindividual variation. A detailed modelling strategy for PCD traits will be discussed in C'! Ipter 4. 2.2.5 Computational Algorithms EM algorithm. We implemented the EM algorithm, originally proposed by Dempster et al. (1977), to obtain the maximum likelihood estimates (MLEs) of three groups of unknown parameters in a QTL mapping model. That is, the QTLsegregating parameters (Qf) that specify the cosegregation patterns of QTL and markers in the population, the curve parameters (f m) that model the mean vector for .., .n vipe j, and the parameters (f ) that model the structure of the covariance matrix. These unknowns, denoted collectively as 2 = (fQs, f,, Qf), correspond to mixture proportions, componentspecific mean parameters and common (co)variance parameters, respectively, in the mixture model described by Equation (21). The first derivative of the loglikelihood function with respect to specific parameters contained in f2 is given by 9O, j1 (yi ) 2 log fj(Yii ) i= 1 j= 1 Yj= 17 fyilQ) 0ki n 2 0 S II log fj(yi l ) i= 1 j=1 where we define .fj(yil ) Ei Z 1T ,fj(Yil) In the Estep, the conditional probabilities 7 (treated as priors) of the QTL ,.1_i vipes given the marker genotypes and the normal distribution function are used to calculate ILi which could be thought of as a posterior probability with the ith individual carrying the jth QTL genotype. Conditional on HII = {j}, we solve for the zeros of a log (fQ) to get the estimates of Q (the M step). Briefly, given initial values for f2, we calculate the posterior probability of H (Estep), then condition on IIj we update the parameter estimates contained in f2 (\ !I' p). The process is iterated for many times until the specified convergence criteria is satisfied. The values at convergence are treated as the MLEs. The MLEs of the mean and covariance parameters are strongly consistent under the equal covariance matrix assumption (Redner 1981). A detailed derivation of the MLEs for the nonparametric model is given in Appendix A. EMSimplex algorithm. When there is no closed form for some of the parameters, one can use an integrated EMSimplex algorithm. The NelderMead simplex algorithm, originally proposed by Nelder and Mead (Nelder and Mead 1965), can be used to estimate the mean and residual (co)variances parameters (Zhao et al. 2004). The simplex algorithm is a direct search method for nonlinear unconstrained optimization. It attempts to minimize a scalarvalued nonlinear function using only function values, without any explicit or implicit derivative information. The algorithm uses linear adjustment of the parameters until some convergence criteria are met. The term will! ::" arises because the feasible solutions for the parameters may be represented by a polytope figure called a 1iplI!. ::". The simplex is a line in one dimension, triangle in two dimensions and tetrahedron in three dimensions, respectively. For the joint model described in C'! Ilpter 4, EMSimplex algorithm will be used to estimate the mean and covariance parameters. In practical computations, the QTL position parameter can be viewed as a fixed parameter because a putative QTL can be searched at every 1 or 2 cM on a linkage map interval bracketed by two markers throughout the entire genome. The amount of support for a QTL at a particular map position is often di ph i', '1 graphically through the use of likelihood ratio maps or profiles, which plot the likelihood ratio test statistic as a function of map position (or test position) of the putative QTL. This plot is called a likelihood ratio (LR) profile plot. The peak of the profile that passes certain significant threshold corresponds to the position of the QTL over the genome. 2.2.6 Hypothesis Testing One of the most important biological advantages for functional mapping is that it provides a quantitative and testable framework for understanding the relationship between gene action and various developmental events (Wu et al. 2004a). For example, when does the QTL turn on or off to determine the development process during time course? Under the current settings, our model allows for a number of hypothesis tests to examine the genetic control of PCD processes. Testing whether there exist specific QTL to affect the PCD process is a first step toward the understanding of the detailed genetic architecture of complex phenotypes. After the MLEs of the parameters are obtained, the existence of a QTL affecting an overall PCD curve should be tested by formulating two alternative hypotheses Ho: j nm t'= 1,, J (26) H, : at least one of the qualities above does not hold where Ho corresponds to the reduced model, in which the data can be fit by a single PCD curve, and H1 corresponds to the full model, in which there exist different PCD curves to fit the data. The test statistic for testing the hypotheses in Equation (26) is calculated as the loglikelihood ratio (LR) of the reduced to the full model LR 2[logL(Qy, M) logL(f2y, M)], (27) where f and 2 denote the MLEs of the unknown parameters under Ho and H1, respectively. By searching the entire genome by every 1 or 2 cM, the hypothesis test is conducted at each test position. A LR profile plot is generated across the entire genome and significant QTL is declared at a particular position if the peak at that position passes certain significant threshold by a permutation test (C'li i !i!! and Doerge 1994). Other hypothesis tests can be about the control of a QTL over growth or death at specific time points or time intervals, the timing at which an organism starts to kill itself, or the timing of other developmental features of interest, whether the same QTL pleiotropically affects the growth and death phases, how the QTL mediate the transition point of the two phases. All these tests are helpful to address important biological questions related to the genetic architecture of PCD traits. 2.2.7 Asymptotic Sampling Error After the point estimates of parameters are obtained by the specified algorithm, we derive the .,vmptotic variancecovariance matrix and evaluate the sampling errors of the estimates ( ,, Qv). The techniques for doing so involve calculation of the incompletedata information matrix which is the negative secondorder derivative of the incompletedata loglikelihood. The incompletedata information can be calculated by extracting the information for the missing data from the information for the complete data (Louis 1982). A different socalled supplemented EM algorithm or SEM algorithm was proposed by Meng and Rubin (1991) to estimate the .,, iii.Ilic variancecovariance matrices, which can also be used for the calculations of the sampling errors for the MLEs of the parameters (Q ). Another simple approach for estimating the .. i,,_ild. ic variancecovariance matrix is to use numerical derivatives (Press et al. 1992). The partial derivative of a given function with respect to some parameters x and y is given by a2f [f(x +hi,y+ h2) f(x + hi,y h2) [f(x hi,y + h2) f(x hy h2) OxOy 4h1h2 If x and y are the same parameters, the above formula simplifies to 2f f(x + h) 2f(x h) + f(x) x.2 h2 here f is the loglikelihood function; x and y are any parameters of interest in the current model; h is a gradient parameter which is chosen to make L close to 0. Taking the inverse of the negative values of the partial derivative matrix, one can obtain an estimate of the ,il, l ic sample (co)variance matrix. The diagonal elements are the , I ~ i ..ic sample variance for the estimated parameters. CHAPTER 3 A NONPARAMETRIC APPROACH FOR FUNCTIONAL MAPPING OF PROGRAMMED CELL DEATH 3.1 Introduction Functional mapping has been utilized as a powerful statistical method for detecting quantitative trait loci (QTL) that affect complex traits undergoing developmental changes (\! I et al. 2002; Wu et al. 2003). By incorporating various mathematical functions into the mapping framework, it has a variety of flexibility in mapping genes that underlie complex traits. It was originally developed for mapping the /;;;1,,/iK.. effects of QTL on growth trajectories. However, its usefulness is not restricted to only model growth curves. It has been extended to study many genetic mechanisms of biological or biomedical processes including allometric scaling, tumor progression, HIV dynamics and drug response (Wu et al. 2002b; Wang et al. 2004; Liu et al. 2005; Lin et al. 2005). Simulation and real data a analysis from these studies indicate that functional mapping has high power to detect lt;;.;:,i'.. QTL effects. However, in real life, many biological characteristics showing a specific development pattern cannot be explained by any currently available parametric mathematical functions or equations, for instance, programmed cell death (PCD). To better understanding the /;;;.,i/K.. gene effect underlying complex longitudinal PCD traits, new statistical models need to be developed. PCD, a physiological cell suicide process, occurs during the development of most organisms and is typically a complex longitudinal trait (Ameisen 2002). It is involved in multiple developmental stages of organism: cell differentiation, proliferation, aging and dying and acts as a defense mechanism against disease or virus attacks, balance organism's metabolism (Elis 1991; Pennell 1997). The identification of genes that contribute to the growth of cells has been one of the most important and difficult tasks for PCD study. With developed functional mapping and abundant molecular markers, the detection of specific genes affecting PCD is now possible. There are two i, i i" issues in functional mapping: (1) how to fit the mean curve to better capture the dynamic developmental pattern? (2) how to model the variation at different time points and intraindividual correlation? These two issues have to be addressed in order to dissect the genetic effects of ni i' QTL. There have been numerous studies on the developmental growth curve either focusing on modelling the mean process or the covariance structure. Historically, there are several classical approaches for modelling growth curve data such as univariate ANOVA, repeated measures ANOVA, and multivariate regression modelling approaches. However, unlike the simple growth curve, the developmental PCD trait represents a dynamic process and dip',v a different pattern than typical growth curves. The twostage unique developmental characteristic presents statistical challenges on modelling the mean process, especially when the goal is to detect significant QTL. Traditional parametric approaches, such as parametric regression, require specific quantitative information about the form of the regression. Hence, they do not have enough flexibility to capture the nature of this process and have certain limitations on fitting the developmental curve. Additionally, the classical analysisofvariance methods ignore the covariance structure and the classical multivariate approach estimates the covariance matrix but imposes no structure on it. Therefore, statistically speaking, they are all inefficient. Nonparametric approaches for modelling growth mean curves have been developed, for example, nonparametric regression. Nonparametric regression using kernel estimators has flourished in the literature (Hart and Wehrly 1986; Miller 1988; Altman 1990; Fraiman and Meloche 1994; Altman and Casella 1995; Boularan et al. 1995; Ferreira et al. 1997). These methods treat time as the only explanatory variable and estimate the mean response curve by smoothing the raw data. In genetic studies, the random regression (RR) model is another popular model used for modelling the dynamic additive effect. The Legendre polynomial (LEP) has been widely used in RR to fit the developmental curve partly due to the nature of the flexibility of the orthogonal polynomial (Pool et al. 2000a). Unlike growth curve data, the longitudinal PCD traits not only involve a growth phase, but also are characterized by a death phase for certain characteristics such as volume, number and height. The complex characteristic of PCD traits faces a number of statistical challenges if applying traditional methods, especially when models are built under the mixturebased likelihood framework. Here, we develop a model using the LEP with a specific order to fit the PCD curve and incorporate two covariance structures into the modelling framework for mapping QTL underlying the complex developmental PCD traits. Order selection using informationtheoretic criteria is provided. The consistency of the selection criteria is proved and the selection power is evaluated based on simulation studies. Detailed parameter estimation procedures ware given. Extensive Monte Carlo simulation studies and a real world example using rice tiller number data are given to demonstrate the usefulness of the development model. 3.2 The Finite Mixture Model and Likelihood Function The statistical foundation of the current study is based on a finite mixture model described in C!I Ipter 2 in Equation (21). The mixture proportions are corresponding to QTL frequencies conditional on the flanking markers. Consider a standard backcross design, initiated with two contrasting 1'.. ii v.ous inbred lines. This backcross progeny contains n individuals, in each of which there are two groups of, i.i vi.1pes at a locus. Before conducting the statistical mapping analysis, two sets of data are collected. A genetic linkage map is constructed with molecular markers. The longitudinal PCD trait such as cell counts or other measurements at tissue or organ levels are observed at a finite set of time points for each individual. These longitudinal traits are continuous in nature and can be modelled by some parametric or nonparametric mathematical functions or equations. Suppose there is a putative segregating QTL with alleles Q and q that affects PCD trait, but with different degrees. The two flanking markers form 4 groups of joint marker genotypes. The observed data then can be grouped according to these 4 marker genotypes as Yk, k = 1, 4, in each of which there are nk observations. The independent and multivariate normality assumptions lead to the following likelihood function i= 1 4 n 1= fi(yki ) + fo(yk,i) k= i 1 where 7k refers to the mixture proportion (or QTL frequency) corresponding to each marker group. The unknown parameters in Q2 contains the mean curve and covariance parameters which describe the shape of the mean process and covariance structure. Statistical modelling of the mean and covariance processes is discussed in the following sections. 3.3 Modelling the Mean PCD Process 3.3.1 Legendre Polynomial Based on the properties of LEP mentioned in introduction, nonparametric regression using LEP to characterized the mean process was proposed in this chapter. The orthogonal LEPs are solutions to a differential equation, the Legendre equation (1 x2) 2x + r(r + 1)z 0. ( dx2 dx The polynomials of order r is denoted by P, (x). The polynomials are either even or odd functions of x for even or odd orders r. The general form of a LEP of order k is given by the sum Pr(x) ( ()k (2r 2r2k! (3 k 2rk!(r k)!(r 2' ( )k)! where K r/2 or (r1)/2 whichever is an integer. This polynomial is defined over the interval [1, 1]. Solving Equation (31), it is easy to show that the first few polynomials are Po(x) 1 Pi(x) x P2(x) (3X 12 ) 2 P3(x) = 1(5 3x) 2 P4(x) = (35X4 30 +3) 8 1 P(x) = (63x 70x3 + 15x) 8 P6(x) (231x6 315?4 + 052os 5) By choosing different orders of orthogonal polynomials, the Legendre function has the potential to approximate the functional relationships between trait values and different time points to any specified degree of precision. For the current model, the independent variable x is expressed as time t, which is adjusted to match the measurement times to the orthogonal function range [1, 1], by 2+ 2(t min) t= 1 + tmax tmin where tmin and tmax are the first and last time points respectively. With an appropriate order r, the timedependent genotypic values for different QTL ii. n.1 ipes can be fitted by the orthogonal LEP. A family of such polynomials is denoted by p(t')= [Po ),p (t'), ,Pr(t1)] and a vector of genotypicrelated, timeindependent values with order r is denoted by Uj ( ,, Uj,' ,Ujr). This i* i.I perelated vector is called the base ,.. '1,' ;'i.:+ vector and the parameters within the vector are called the base /,. u.', */ i,' means. A vector of polynomials P(t') is also called the base function. The timedependent genotypic values at time t, mDj(t), can be described as a linear combination of uj weighted by series of the polynomials, i.e. m j(t)= u'P(t') (32) Thus, for individual i whose QTL i. I vpe is j, its 1J. I pe means at different time points can be modelled by the following vector mjj i = [mTji (1), ,mTjli(rT)] (33) This modelling approach has great flexibility in modelling curves in which the logistic or other parametric models does not fit. By choosing appropriate order, the model can better capture the intrinsic developmental PCD trend. Order selection for LEP. One of the in i. jr concerns using the LEP is to choose the optimal order. If the degree of the LEP is k (the highest power of the polynomial), then the order of LEP will be k + 1 (the number of the coefficients defining the polynomial which is more than its degree). The order selection problem is similar to select variables in a regression sense. We are actually selecting the coefficients of the LEP, i.e., the base genotypic vectors. Normally, a higher order ahv gives a better fit. However, if the model contains too many parameters, it will greatly reduce the model efficiency and reduce the mapping power. On the other hand, if we fit the curve with lower order, a model that contains too few parameters will not be flexible enough to approximate important features in the data. This will give a bias contribution to the misfit due to lack of flexibility. In both cases, the use of poor or redundant orders can be harmful. There should be a tradeoff between the number of LEP orders and model efficiency. It is essential to select the iight" order for LEP without over or underfitting the PCD curve. One choice is to conduct an order selection using the informationtheoretic criteria. 3.3.2 Informationtheoretic Criteria Statistical model selection has been studied for a long time with numerous topics presented in this area. An important class of selection criteria is the informationtheoretic criteria. The basic principle of information theoretic model selection criteria is to select statistical models that better fit the data and also simplify the model. Specifically, informationtheoretic methods focus on minimizing the amount of information required for the model to explain the data. Consequently, it results in the selection of models that are the most parsimonious or efficient representations of observed data. In general, there are two types of model selection criteria. One type is that a model is evaluated based on its ability to describe not only the present data but also future or unseen observations from the same process. Examples of this approach are AIC or crossvalidation, which are aimed to selecting models that yield the best prediction. Another type of approach is the likelihood based approach which selects the model that is most likely to generate the present observations, without accounting for future observations. Example of this approach includes the classical BIC. Generally speaking, most informationtheoretic criteria can be considered as special cases of minimum complexity density estimators proposed by Barron and Cover (1991), which has the form I(Ylh(O)) + I(h(O)), where the first term represents the amount of information I required to express the data Y, given the interested model h with parameters 0; the second term represents the amount of information required to express the model itself. For an information criterion, the first term is equivalent to the negative loglikelihood of the data, and the second term is thought of as a penalty for model complexity. A general form of the basic information criterion to select the LEP order is given by IC,(n) 21nL(m,  Y,r) + c(n)pr(T) where the first term is the negative maximum loglikelihood of the data Y given the model parameter estimates; Qm and Q, represent the MLE of mean and covariance parameters; pr(r) represents the number of free parameters which is only dependent on the number of measurement time with order r, so pr(') = dimension(Qm, Q r); c(n) is a penalty term. The goal of informationtheoretic model selection is to select parsimonious models. Hence, models that minimize the criterion are selected. Various criteria have been proposed based on this general form and they are different in c(n) terms. Among a variety of informationtheoretic criteria proposed so far, the following several criteria will be discussed. Akaike's information criterion (AIC). One of the popular informationtheoretic model selection criteria is the AIC information criterion (Akaike 1974). Theoretically, this criterion derives from consideration of the KullbackLeibler distance between a given model and the true model. Specifically, minimizing AIC is approximately equivalent to minimizing the expected KullbackLeibler distance between the true and the estimated density. The AIC value with a particular order r is calculated by AIC 21nL(Qm, r) + 2p,(r),( (34) Since AIC serves as a consistent estimator of the KullbackLeibler discrepancy between the distribution that generated the data and the model that approximates it (Takeuchi 1976), it can be thought of as measuring the relative inefficiency of approximating the true model by the model of interest. Models with smaller values of AIC can thus be thought of as more efficiently approximating the true model, where the true model is unknown. By keeping increasing the LEP order, the selection method pick up the optimal order r if the AIC value is no longer increasing. Corrected AIC (CAIC). Since AIC is only an approximate estimate of the relative KullbackLeibler distance and has been shown to be biased in small samples, a corrected version of AIC, called CAIC, was developed to adjust for this bias (Sugiura 1978; Hurvich and Tsai 1989). Later on, Burnham and Anderson (1998) proposed a multivariate generalization of CAIC which is defined as CAIC 21nL(m,,r) +p+ p(p+v) (35) where n is the sample size, p is the total number of parameters in the mean model, and v is the number of parameters used to estimate the covariance matrix of the distribution. B i, i i information criterion (BIC). BIC is another selection criterion from a B ,i, i ,i point of view. It is usually explained in terms of B ,i, i ,i theory, especially as an estimate of the B v.i factorthe ratio of the posterior to the prior odds in comparisons of a model to a saturated model (Schwarz 1978; Raftery 1993). BIC is defined as BIC 21nL(Qm, v r) + p,()ln(n). (36) where all the parameters are defined similarly as above. Since our modelling objective is to select parsimonious models to best describe the dynamic development process, the BIC information criterion assigns more penalty to the likelihood function and hence it tends to choose more parsimonious models. Comparisons of the relative merits of AIC and BIC based on .i~,e I ic consistency (as n  oo) have flourished in the literature. Investigations have generally demonstrated that BIC is dimension consistent, i.e., tends to choose the true model with probability equal to one in large samples but performs poorly in small samples (Hurvich and Tsai 1990; Bickel and Zhang 1992) and AIC is consistent if the dimensionality of the true model increases with n (at an appropriate rate) (Shibata 1981). Sakamoto et al.(1986) also noted that the model selected by AIC criterion is not actually the true model but the one for the best model fit. To evaluate the model selection criteria, we need to show that the proposed criteria are consistent. 3.3.3 Consistency of Model Selection Criteria The consistency of model selection criteria has been shown in the literature depending on different selection criteria proposed and different models under study (Guyon and Yao 1999; Wood et al. 2002; Huang and Yang 2004). The consistency of a selection criterion is defined as the situation where the probability of the criterion choosing the correct model is 1 as the sample size increases to infinity. Hence, a model selection criterion is consistent if P{C!i... the better model m2} P{ICk (n) < ICk2(n)} 1,as n i oo where we assume that m2 is the correct model; ICk (n) and ICk (n) are the information criteria for model m1 and m2 respectively with LEP order k, and k2. To select the optimal LEP order, we have the following two assumptions, 1. The same LEP order for different genotypes, 2. The number of mean parameters are only dependent on the number of measurement time points 7 which leads to finite parameter space for pr (). Stopping rule. We propos a forward selection by increasing one order at a time until the information criterion ICk(n) does not decrease any more. The order which gives the minimum ICk(n) will be the optimal order. The following theorem shows the consistency of the order selection crtieria. Theorem 1. Under the above two assumptions, with the defined stopping rule, a model selection criterion ICk(n), is consistent if c(n)[p (7) pso (T)]  0, as n  o0 where So = {j, ,jk, k < oc} and s ={j.i ,i, I / k, I < oc00} are the optimal and selected index set of LEP order respectively. Proof. See the Appendix C for a rigorous proof. Therefore, the proposed BIC, CAIC and AIC information criterion are all consistent. 3.3.4 Selection Power The usual model selection error rate (a) is defined as the probability of selecting a wrong order given the data is simulated with the true order. If a  0 as n  oc for certain model selection criterion, then this selection criterion is consistent. By subtracting a from 1, one gets the selection powerthe probability to select the true model. Error rate has been used as an important criterion of evaluation of the model selection criteria in many publications (Bozdogan 1987; Hurvich and Tsai 1990; Shao 1993, 1996; Zheng and Lou 1995). However, it is not sufficient by itself to show which model selection criterion is better than others, partly due to the sample size problem. In this study, we will use it as an evaluation for choosing different selection criteria. 3.3.5 LEP Order Selection Under the current study setting, we considered two situations to select the optimal order for the LEPs to fit the developmental PCD curve, i.e., selection under the null and alternative hypothesis respectively. Evaluation of different cases by simulation study can provide us a general selection guideline. Order selection under the alternative hypothesis. Under the alternative hypothesis that there is one gene controlling the developmental process, there are 2 or 3 mean curves corresponding to two genotypes for a backcross design and three genotypes for a F2 design respectively. AIC, CAIC and BIC will be applied to select the optimal order based on the likelihood information. Order selection under the alternative is computationally timeconsuming because one needs to select the correct order at every test position. The first assumption with equal order for different genotype is mandatory under the current settings because of the hypothesis testing issue. In the real world, this assumption might be violated since there might be different developmental patterns for individuals carrying different genotypes and hence may follow different developmental patterns or curves. Considering the computation burden, for simplicity we will not consider this case. Selecting orders for different genotypes will be considered in future work. Order selection under the null hypothesis. Under the null hypothesis that there is no QTL controlling the developmental process, there is only one mean curve. With this information, one can select the LEP order under the null with different criteria, specifically, AIC, CAIC and BIC. Computationally, it is very efficient to select the order under the null because the selection procedure only performs once. Before selecting the order under the alternative, the order selected under the null can provide a piece of information about the order under the alternative because it is highly possible that the mean process for different genotypes maintains the same order under the null and alternative. 3.4 Modelling the Covariance Structure To better understanding the QTL underlying the developmental PCD process, covariance structure modelling is a very important step. Dissection of the intraindividual correlation will help us to understand how QTL mediate the developmental pattern. Two covariance structures were proposed for modelling the covariance matrix under the current design. 3.4.1 AR(1) Covariance Model The firstorder autoregressive [AR(1)] model (Diggle et al. 2002) can be utilized to model the structure of the covariance matrix. The detailed description of the AR(1) covariance structure is given in C'!i pter 2. Some of the properties of AR(1) model is given in Appendix A. TBS modelling strategy was applied (Wu et.al. 2004b) to remove the heteroscedasticity problem of the residual variance. 3.4.2 SAD(r) Covariance Model To model the nonstationarity of the covariance structure, the SAD model is more appropriate. The SAD model with order r for modelling the error term in Equation (23) is given by e(t}) = fie(t, 1) + ... + 4e(t r) + eC (37) where ci is the 'iii ', ii, ,1" term assumed to be independent N(0, it2) random variables. For the first order SAD model, if a constant innovation variance a2 is assumed, analytical forms for variance and correlation functions can be obtained. For any time t > k, the covariance and correlation function of a SAD(1) model is given by (Jaffrdzic et al. 2003) 1 (2t2 7(t) 2 (38) t 1 2 t O (t, k) ,k 1 (39) 1 44 From the above function, it is clear that even for the simplest SAD(1) model, the correlation function is nonstationary. Using matrix notation, the error term in Equation (37) can be expressed as e A= A where e = [e(1), .. e(r)]', [(1), ..., ( )]T and e variancecovariance matrix of te PCD data is then expressed as The variancecovariance matrix of the PCD data is then expressed as E AEAT, (310) where E, is the innovation variancecovariance matrix and is expressed as XE The closed forms for the inverse and determinant of matrix E help to estimate the parameters, (1, 0r, a2) (see Appendix A). For model simplicity and computation consideration, we only consider the SAD(1) model. 3.5 Computation Algorithm and Hypothesis Testing An EM based algorithm can be applied to estimate the parameters as described in Appendix A. After the MLEs of the parameters are obtained, the existence of a QTL affecting a PCD curve should be tested by formulating the following hypotheses Ho : Omii Qm Hi: at least one of the qualities above does not hold, where Ho corresponds to the reduced model, in which the data can be fit by a single curve, and H1 corresponds to the full model, in which there exist different curves to fit the data. The test statistic for testing the hypotheses is calculated as the loglikelihood (LR) ratio of the reduced to the full model LR 2[log L(2y, M) log L(Qy, M)] where f and 2 denote the MLEs of the unknown parameters under Ho and H1, respectively. Permutation tests (C'i, !11!! and Doerge 1994) are used to assess the statistical significance. 3.6 Calculating the Heritability Level Heritability measures the amount of phenotypic variability explained by genetic variation. The higher the heritability, the higher the variation of certain traits explained by the detected QTL. It is easy to calculate the heritability level (H2) when traits are measured at a single time point. But for longitudinal traits, heritability calculations are different. We proposed two v to calculate H2, 1. Calculate H2 at a single time point t where the traits shows the highest variation. For a backcross design, the genetic variation is given by a2 (pui(t) po(t))2, and H2 =.2 where 72 is the residual variance. If we know H2 or J2 based on a2 we can easily calculate another one. 2. The second approach we proposed is by calculating the area under the curve (AUC). Functional mapping maps the dynamic gene effect over time. It should be more accurate to measure the genetic variation through the entire measurement stage instead of based on a point estimation. We proposed to calculate the genetic variation by a 2= (AUC, AUCo)2, where AUCI and AUCo are the AUC for two different genotypes. The AUC is calculated by 1 AUCJ J 1u'P,P(t')dt' where P,(t') is a vector of LEP with order r; uj is the base genotypic mean parameters and t' is the adjusted time point. 3.7 LEP Order Selection Simulation To test the performance of different order selection criteria with moderate sample size, we did a simulation study. To demonstrate the selection power and for simplicity, we simulate the true order with 6th order LEP. A putative QTL is simulated under different sample sizes (n=100, 200) and different heritability levels (H2 0.1, 0.4). The total measurement time points are fixed at 9 points to match the rice tiller number data (described later). The phenotypic data are simulated under different covariance structure designs (AR(1), SAD(l)). The various designs allow us to check the selection performance under different situations. The selection power is evaluated as the percentage of simulated dataset in which corrected model was selected. By subtracting the power from 1, the selection error rate can be easily obtained. There are two nii, r reasons for the current simulation study to select order under the null hypothesis, 1. Generally speaking, it is more informative to select order under the alternative for different genotypes. For the hypothesis testing concern, the parameters under the null has to be nested under the alternative (i.e., different genotypes under the alternative should have the same number of parameters as the null has). If simulation study supports this hypothesis, then no information will be lost to put the same order for the null distribution as selected under the alternative. 2. Computationally, it is very intensive to select order under the alternative because one has to select the order for each test position. If the simulation study supports the hypothesis that order selected under the null is the same as it is selected under the alternative, one can just select order under the null to same computation time. Table 31. LEP order selection power comparison using different information criteria with AR(1) covariance structure H2 0.1 H2 0.4 Information criteria n 100 n 200 n 100 n 200 Order selection under the null AICo 0.99 1.00 1.00 1.00 CAICo 0.99 1.00 1.00 1.00 BICo 0.83 0.97 1.00 1.00 Order selection under the alternative AICI 0.89 0.92 0.80 0.83 CAICI 0.90 0.92 0.80 0.84 BICI 0.88 1.00 1.00 1.00 Note: 0 and 1 refers to the criteria under the null and alternative hypothesis respectively. Simulation result. Comparisons of the different model selection criteria under different covariance structures are summarized in Tables 31 and 32. Each table gives the percent of datasets in which the correct model was chosen, for each combination of sample size, heritability level, and selection criteria. Despite differences in performance among model selection criteria, trends holding across different criteria were evident in the simulation results. First, as might be expected, the overall power of various model selection criteria to select the true model generally increased with sample size. Second, a distinct pattern of performance among the model selection criteria is that AIC and CAIC generally performing similar to one another. However, there was a slight but persistent tendency for CAIC to outperform AIC in many conditions. When the covariance structure is modelled with the AR(1) structure, BIC uniformly outperformed AIC and CAIC with heritability level 0.4. However, it performs poorly when the heritability level is 0.1. We know that the higher heritability level corresponds to lower residual variance. This piece of information i.. 1I using different criteria depending on the residual variance. A similar pattern is also shown with the SAD(1) covariance structure. Overall, to select the optimal order under the alternative hypothesis, one can choose CAIC when the heritability level is lower and choose BIC criterion when the heritability is high. Table 32. Power comparison for LEP order selection using different information criteria with SAD(1) covariance structure H2 0.1 H2 0.4 Information criteria n 100 n 200 n 100 n 200 Order selection under the null AICo 0.81 0.96 0.97 0.98 CAICo 0.85 0.98 0.99 1.00 BICo 0.55 0.89 1.00 1.00 Order selection under the alternative AICI 0.85 0.85 0.90 0.91 CAICI 0.86 0.86 0.95 0.90 BICI 0.65 0.93 1.00 1.00 Note: 0 and 1 refers to the criteria under the null and alternative hypothesis respectively. Another interesting finding is that most criteria perform better under the null than under the alternative. This means one can select the order under the null and put the same order for different genotypes under the alternative. However, in practice, real world data might not exactly satisfy the multivariate normality assumption (for instance, skewed normal). We still r . I selecting order under the alternative for different genotypes if computationally it is feasible. If the order under the null tends to maintain the same order as the alternative does, then the same order will be assigned to the null due to hypothesis testing concerns. 3.8 Model Assessment by Simulation With the appropriate LEP order selected, we would like to check the model performance. A number of simulation studies are done with different objectives under different study designs. Consider a backcross population with which a 100cM long linkage group composed of 6 equidistant markers is constructed. A QTL that affects the PCD process is located at 48 cM from the first marker on the linkage group. The Haldane map function was used to convert the map distance into the recombination fraction. To test the model performance, we simulate data with different specifications, namely different heritability levels (H2 0.1 vs 0.4) and different sample sizes (n 100 vs 200). Each backcross progeny is measured at 9 equally spaced time points. The simulation was based on two different patterns of covariance structures, AR(1) and SAD(1). Table 33. The MLEs of the QTL position and the model parameters derived from 100 simulation replicates with the AR(1) covariance structure. The squared roots of the mean square errors of the MLEs are given in parentheses True parameter QTL position A 48 Mean parameters for QQ U20 = 9.0491 U21 1.1507 U22 6.0197 U23 2.6514 U24 0.6517 u25 0.7969 U26 0.6205 Mean parameters for Qq uoo = 7.1476 Uol 1.3788 o02 4.4886 U03 2.0042 U04 0.6619 u05 0.8356 o06 0.4321 Covariance parameters CT 2. 0.1436 cr.4 2 0.0239 H2 n 100 46.38(3.10) 9.4006(0.56) 1.1258(0.28) 6.2381(0.47) 2.8064(0.34) 0.6941(0.23) 0.8804(0.19) 0.6430(0.21) 6.9422(0.38) 1.3676(0.21) 4.2814(0.37) 1.8882(0.27) 0.6899(0.17) 0.8563(0.14) 0.4542(0.15) 0.1199(0.02) p = 0.87 0.8206(0.05) 0.1 n 200 46.32(2.62) 9.4141(0.47) 1.1093(0.21) 6.2482(0.36) 2.8118(0.28) 0.6584(0.16) 0.8745(0.17) 0.6704(0.17) 6.9685(0.32) 1.3862(0.13) 4.3142(0.30) 1.8685(0.21) 0.7067(0.13) 0.8568(0.09) 0.4433(0.11) 0.1188(0.03) 0.8202(0.05) H2 n 100 47.28(2.59) 9.0932(0.17) 1.1565(0.11) 6.0489(0.16) 2.6786(0.13) 0.6684(0.09) 0.8129(0.07) 0.6163(0.09) 7.1406(0.11) 1.3755(0.08) 4.4751(0.12) 2.0106(0.09) 0.6583(0.07) 0.8457(0.06) 0.4421(0.06) 0.0224(0.003) 0.8573(0.02) The location of the simulated QTL is described by the map distances (in first marker of the linkage group (100 cM long). 0.4 n 200 47.36(1.83) 9.0872(0.12) 1.1577(0.07) 6.0348(0.12) 2.6710(0.09) 0.6463(0.06) 0.8082(0.05) 0.6303(0.06) 7.1623(0.11) 1.3835(0.05) 4.4978(0.10) 2.0063(0.06) 0.6639(0.05) 0.8458(0.03) 0.4402(0.04) 0.0223(0.002) 0.8568(0.02) cM) from the Results: Simulation results are summarized in Tables 33 and 34. The precision of parameter estimation is evaluated in terms of the square root of the mean squared errors (SMSE) of the MLEs. In general, the model can provide reasonable estimates of the QTL positions (A) and effects of various kind, Table 34. The MLEs of the model parameters and the QTL position derived from 100 simulation replicates with the SAD(1) covariance structure. The squared roots of the mean square errors of the MLEs are given in parentheses True parameter QTL position A = 48 Mean parameters for QQ U20 = 9.0491 U21 1.1507 1122 6.0197 1123 2.6514 U24 0.6517 u2s 0.7969 U26 0.6205 Mean parameters for Qq Uoo = 7.1476 U101 1.3788 o02 4.4886 uo3 2.0042 04 0.6619 05 0.8356 U06 0.4321 Covariance parameters Co2.1 5.0654 a.4 2 0.8442 0O.Q4 H2 n=100 47.98(4.62) 9.3293(0.59) 1.3365(0.45) 5.6839(0.49) 2.5921(0.29) 0.7744(0.27) 0.9747(0.27) 0.6198(0.25) 7.3446(0.49) 1.5202(0.40) 3.9291(0.67) 1.8615(0.34) 0.6952(0.23) 0.9042(0.24) 0.4103(0.24) 4.7322(0.41)  0.95 0.9150(0.04) 0.1 n 200 47.02(3.47) 9.3234(0.45) 1.3114(0.32) 5.6770(0.44) 2.5802(0.23) 0.7527(0.20) 0.9498(0.23) 0.6355(0.19) 7.3932(0.43) 1.5595(0.32) 3.9811(0.59) 1.8429(0.25) 0.6996(0.17) 0.9284(0.16) 0.4034(0.16) 4.7546(0.35) 0.9147(0.04) H2 n 100 46.38(3.49) 9.0803(0.22) 1.1646(0.18) 6.0056(0.13) 2.6576(0.11) 0.6718(0.10) 0.8203(0.08) 0.6206(0.10) 7.139(0.19) 1.3699(0.17) 4.4672(0.15) 2.0061(0.12) 0.6791(0.10) 0.8618(0.09) 0.4475(0.10) 11 .' l7.j 1.04) 0.9506(0.02) The location of the simulated QTL is described by the map distances (in first marker of the linkage group (100 cM long). 0.4 n 200 46.26(2.96) 9.0911(0.15) 1.1682(0.13) 6.0013(0.09) 2.6627(0.09) 0.6580(0.07) 0.8260(0.07) 0.6264(0.06) 7.1454(0.17) 1.3742(0.13) 4.4815(0.09) 1.9873(0.09) 0.6816(0.08) 0.8522(0.07) 0.4448(0.07) 0.8292(0.03) 0.9482(0.01) cM) from the with estimation precision depending on heritability, sample size and sampling strategy under the two covariance structure. The precision of different parameter estimations fluctuates with different covariance structures. For example, the QTL location can be better estimated with the AR(1) covariance structure than the SAD(1) model, but poorer estimation for parameter u23 is shown by the AR(1) than the SAD(1). As might be expected, the precision of the QTL parameters is increased with increased sample size. It is interesting to note that the precision of the parameter estimates is greatly improved with increased heritability level rather than just increasing the sample size. For example, the SMSE of the mean parameters u20 for genotype QQ decreases from 0.56 to 0.47 when increasing the sample size from 100 to 200 using the AR(1) covariance structure. However, this increase of precision is less attractive when compared with the improvement when heritability level increases from 0.1 to 0.4 for the same sample size 100. This piece of information , '.. that in practice wellmanaged experiments, through which residual errors are reduced and therefore H2 is increased, are more important than simply increasing sample sizes because normally it costs more to increase the sample size. It is also interesting to note that both covariance models display similar estimation precision. 3.9 Real Example To test the developed model, a real data set was analyzed. Two inbred lines, semidwarf IR64 and tall Azucena, were crossed to generate an Fi progeny population. By doubling haploid chromosomes of the gametes derived from the heterozygous F1, a doubled haploid (DH) population of 123 lines were founded (Huang et al. 1997). Such a DH population is equivalent to a backcross population because its marker segregation follows a 1:1 ratio. With 123 DH lines, Huang et al. _.' I i.ped 135 RFLP and 40 isozyme and RAPD markers to construct a genetic linkage map of length 2005 cM with an average distance of 11.5 cM, representing good coverage of 12 rice chromosomes (Fig. 31). The 123 DH lines and their parents, IR64 and Azucena, were planted in a randomized complete block design with two blocks in the field. Each block was divided into different plots, each containing eight plants per line. Starting from 10 d v of transplanting, tiller numbers were measured every 10 dv for five central plants in each plot until all lines had headed. We used the means of the two blocks for QTL analysis. chrome RG472 RG246 K5 U10  RG532 Wl SRG173 AmylB RZ276 RG146 RG345 RG381  RZ19  RG690 RZ730 j RZ801 RG810 RG331 chrom5 0.9  35.0  6.6  19.4  8.5  16.7  4.2  9.5  21.3  12.8  19.7  chrom2 C  RG437  RG544 S RG171  RG157 RZ318 Pall RZ58 CD0686 AmylA/C RG95 S RG654 RG256 RZ213 RZ123 RG520 chrom6 RG556 RZ390  RG403  RG229 SRG13 CDO105  RZ649  RZ67  RZ70  RZ225 chrom9 3 C711 2 G103 1 RZ206 S RZ422 Amy3ABC t RZ228  RZ12 3  RZ792 RZ404 20.7  10.0 _ 0.0 2.7 1.8 5.6 31.4  10.1  11.0  30.4  4.2 11.8 19.6  10.8  4.4  8.6  RZ398 RG213 Amp 3 Est 2 RZ144 RZ667 Pg 2  pRD lOB RG648 RG424 RG162 RG172 CDO544  RG653 SAmy2A SRG433 chrom3 7 RG104  RG348 1362 RZ329 9.8 98 RZ892 2,8 RG100 17,5 RG191 RZ678 41,6 RZ574 371  RZ284 15,6  RZ394 pRD10A S5 RZ403 286 RG179 19 CDO337 1229 RZ337A  RZ448 15,0  RZ519 32,1  Pgi1 7A1 CDO87 9.2 RG910 179  chrom7 25.7  16.3  1.7  18.4  9.0  31.4 1.8  14.1  3.2  13.0  15.1  11.5  5.8  9.4  chromlO IF G1084 6.6  RG257 27.2 , 0.0 .' 3.0 a CD093 16.4 CD098 .16.4  G2155 10.2 RG134 3.8 RZ500 1 RZ500  RG773  RG769 __ RZ488 \ RG511 RG477 PGMSO.' SCD059  RG711 Es 9 CD0418 RG351 chrome 1 CDO127  RZ638 RZ400 3 R RG1 18 Adh 1  RG1094 RG167 1 ) Npb44 85 RG247  RG103 4   RG1109 5 r= RZ536 Npbl86 chrom4 R RG218 RZ262 6 _ G9 RG190 RG908 RG91 RG449 RG788 RZ565 RZ675 RG163 RZ590 RG214 RG143 RG620 chrom8 RZ143 0  RG20 5.0 ).0 A18A1120 2.3  I.8I i.2 ?.8 .5 RG978 5.5 RG1 1.1 Amy3D/E 5.1 RZ66 1.8  AC5 1.0  RG418B 4 Amp 2 57 f CD099 chroml2 i. 574 3 7 3  ' 816 S G341 5 16 3 3457 2 lth1 7 1 ,3463 1 ,3901 1 / 00344 S" 3958 S3181 Figure 31. Linkage map for the rice genome 20 18 16 14 12 / ,. _ . . Z 10,. .. 4 t., ~  2/ 1 2 3 4 5 6 7 8 9 Time Figure 32. Dynamic changes of the number of tillers for 123 DH lines of rice as an example of PCD in plants Figure 32 shows the dynamics of tiller numbers for each DH line measured at 9 time points. Tiller growth is thought to be an excellent example of PCD in plants (Greenberg 1996) since it experiences several developmental stages during rice ontogeny (Fig. 33). At the early developmental stage, tiller number increases dramatically corresponding to the vegetative phase in rice. During the reproductive phase, tiller number declines as the initiation of panicle, emergence of the flag leaf (the last leaf), booting, 1, lii. and flowering of the spikelets. Tillers that do not bear panicles are called ineffective tillers, and will be killed by genetic control. The number of ineffective tillers is a closely examined trait in plant breeding since it is undesirable in irrigated varieties. The ineffective tillers result in many unwanted problems in rice such as over consumption of nutrition and competition for space. The genetic control system will pl i, its key role in reducing the over produced tillers and balance the rice metabolism system to have optimal nutrition consumption. Amount of growth Tiller number Plant height Ineffective anicle tillers Grain weight 0 20 60 90 120 Days after germination Vegetative  Reproductive  Ripening  Figure 33. Tiller growth Data were analyzed with the developed models with mean modelled by the LEP and covariance modelled by AR(1) or SAD(1) structure. A genomewide scan was conducted at every 2 cM distance on the 12 chromosome. At each test position, a LR test is conducted. A random test position was chosen to calculate the heritability level with different order ranging from 2 to 8 before scan the entire genome. Result shows that H2 is close to 0.4 (data not shown). Based on this information, LEP order is selected under the alternative at each test position using the BIC information criterion. When doing the permutation test, the selected orders were maintained at each test position. The model successfully identified four iI i ri" QTL that control the dynamic PCD developmental process with the SAD(l) covariance structure and four in, I ri QTL with the AR(1) covariance structure. Figure 34 shows the loglikelihood profile plot between the full and reduced (no QTL) model for tiller number trajectories across the 12 rice chromosomes. The positions of markers on the linkage groups (Huang et al. 1997) are indicated at ticks. The plot shows the LR at each test position when modelling the covariance matrix with AR(1) and SAD(1) structure. The threshold value for claiming the existence of QTL at the genomewide level is given as the horizonal solid line for SAD(1) covariance structure and dotted line for the AR(1) covariance structure. The chromosomewide significant level is given by the dashed line. The significant rate shown in the plot is at 5'. level with permutation test. 100 SAD(1) chrom 1 chrom 2 chrom 3 AR(1) 80o ARM 60 40 1001, chrom 4 chrom 5 chrom 6 chrom 7 80 60 r40 100 chrom 8 chrom 9 chrom 10 chrom 11 chrom 12 80 60 20 ' Test Position Figure 34. The LR profile plot. The solid and dotted curves correspond to the LR profile with SAD(1) and AR(1) covariance respectively. As clearly shown in the plot, the AR(1) covariance structure detected 1 genomewide significant QTL and 3 chromosomewide significant QTL. SAD(l) model detected 4 QTL all significant at the genomewide significant level. Among the QTL being detected by the two covariance models, 3 of them are located at the same marker interval. These QTL detected by both covariance models might be the same QTL. Whether they are the same QTL can be checked by calculating the confidence interval applying bootstrap method (Zeng et al. 1999). It is hard to make a decision on which covariance model is the light" one. However, it is clear that the mapping power is sensitive to the covariance structure being applied. Table 35. The MLEs of the parameters and their .~,1,iii l.1 ic standard errors in the parentheses with the AR(1) covariance structure Parameters Q 12 311 Q2 9Q QTL position (A) 200cM 220cM 260cM 119.16cM Marker interval RZ730 RZ801 RZ337A RZ448 RZ519 Pgi_1 Parameters for QQ uoo UOl uo2 U02 u03 uo4 o05 u06 uo7 Parameters for qq U20 U21 U22 U23 U24 U25 U26 U27 Covariance parameters (72 P 8.1221(0.25) 1.8070(0.15) 5.5342(0.24) 1.3254(0.17) 0.9909(0.14) 1.0331(0.13) 0.5397(0.11) 0.8828(0.18) 7.6154(0.23) 1.0413(0.14) 4.7521(0.21) 2.7560(0.18) 0.0881(0.12) 0.7394(0.11) 0.8853(0.11) 0. i7" ;.i,.16) 0.0453(0.006) 0.8709(0.02) 8.5010(0.21) 1.5895(0.13) 5.6862(0.20) 2.2708(0.15) 0.4513(0.13) 0.8325(0.09) 0.9730(0.11) 6.9148(0.21) 1.2773(0.13) 4.3434(0.20) 1.8608(0.14) 0.6113(0.12) 0.7097(0.08) 0.3981(0.11) 0.0428(0.004) 0.8552(0.02) Note: Q12, Q31, Q32 and Q9 refer to the QTL detected refers to the chromosomewise significant QTL. 8.7429(0.20) 1.6282(0.14) 5.8039(0.19) 2.3644(0.15) 0.3681(0.13) 0.8934(0.09) 1.0336(0.11) 6.8063(0.21) 1.2711(0.14) 4.3360(0.20) 1.8183(0.15) 0.7262(0.14) 0.6921(0.09) 0.3185(0.12) 8.0997(0.21) 1.3295(0.13) 5.3113(0.20) 2.3670(0.15) 0.4101(0.12) 0.8583(0.09) 0.8386(0.11) 7.4884(0.21) 1.5255(0.13) 4.8555(0.19) 1.8480(0.14) 0.6375(0.12) 0.7289(0.08) 0.5650(0.10) 0.0346(0.004) 0.0475(0.004) 0.8308(0.02) 0.8688(0.01) on chromosome 1, 3 and 9; * Tables 35 and 36 tabulate the estimated QTL position on the chromosome, the marker interval where the QTL belongs to, the MLEs of curve parameters that specify the developmental pattern as well as the .,,iiii.l, .1 ic standard errors of the estimates. Most the parameters can be reasonably estimated with small sampling error, indicating great power of the developed models. As clearly indicated in the table, the orders at different test positions are different. For example, the QTL detected in chromosome 1 (Q12) with the AR(1) covariance model has order 7 RZ792 and the QTL detected in chromosome 9 (Q9) has order 7. More than 85' of test positions tend to choose the 6th order (data not shown) and AR(1) tends to choose a lower order than the SAD(1) model. Table 36. The MLEs of the estimated parameters and their .,vmptotic standard errors in the parentheses with the SAD(1) covariance structure Parameters QTL position (A) Marker interval Parameters for QQ o00 UOl u02 o03 uo4 u05 u06 uo7 u08 Parameters for qq U20 U21 U22 U23 U24 U25 U26 U27 U28 Covariance parameters (2 0 Qil 108cM RZ276 RG146 10.1080(0.25) 1.4258(0.25) 6.2307(0.33) 2.4717(0.22) 0.7185(0.27) 1.4245(0.16) 1.3406(0.17) 0.9732(0.21) 1.0999(0.38) 7.4620(0.14) 1.3765(0.12) 4.4309(0.17) 2.0271(0.10) 0.7610(0.13) 0.7168(0.07) 0.5365(0.09) 0.0612(0.09) 0.5427(0.20) 0.7249(0.04) SI' 177,11.02) Q12 198cM RZ730 RZ801 8.8476(0.26) 1.9202(0.19) 6.1187(0.20) 1.3324(0.15) 1.0686(0.13) 1.1540(0.11) 0.6267(0.11) 1.1440(0.17) 7.5165(0.19) 1.0548(0.16) 4.6904(0.14) 2.7183(0.13) 0.0544(0.10) 0.7741(0.09) 0.9289(0.10) 0.4435(0.14) 0.7483(0.04) 0.9013(0.02) Q31 220cM RZ337A RZ448 8.9333(0.19) 1.5415(0.16) 5.5936(0.20) 2.0395(0.13) 0.8554(0.16) 1.1371(0.09) 0.9195(0.10) 0.6945(0.14) 0.7288(0.25) 7.1258(0.20) 1.3297(0.17) 3.927(0.22) 2.1169(0.14) 0.8066(0.18) ii , .(0.10) 0.3967(0.12) 0.4249(0.15) 0.7679(0.28) 0.7866(0.04) 0.8836(0.02) Q32 262cM RZ519 Pgi_1 9.1363(0.18) 1.6244(0.16) 6.2130(0.15) 2.0758(0.15) 0.5325(0.12) 1.1813(0.10) 1.0541(0.11) 0.7148(0.16) 6.9824(0.19) 1.2889(0.17) 4.3434(0.16) 2.1184(0.17) 0.4214(0.13) 0.5721(0.11) 1i , 217_ 1'.12) 0.3716(0.17) 0.7886(0.04) 0.8420(0.02) The developmental trajectory of the identified QTL are plotted and are shown in Figures 35 and 36 with tiller number trajectories for all individuals indicated in background. The 4 QTL detected by the AR(1) model are between markers RZ730 and RZ801 (Chl2 (Q12)) on chromosome 1, between marker RZ337A and RZ448 (Ch31 (Q31)), between marker RZ519 and Pgi_l (Ch32 (Q32)) on chromosome 3 and on marker RZ792 on chromosome 9 (C'!i1, (Q9)). Three QTL detected by the SAD(1) model are located at the same marker interval as the 20 Chl2 (Q12) 15 10 5 1 2 3 4 5 6 7 8 9 Time 20 Ch32 (Q32) 15 10 1 2 3 4 5 6 7 8 9 Time 20 Ch31 (Q 1) 15 10 5qq I 5 1 2 3 4 5 6 7 8 9 Time Ch9 (Q9) 15 E 5 10 2 0 1 2 3 5 6 7 8 9 Time The dynamic changes of tiller numbers for two curves each presenting two groups of genotypes, QQ and qq, at each of the four QTL detected with the AR(1) covariance structure. AR(1) model detected except for the one between markers RZ276 and RG146 (Qii). It is clear that different QTL show different developmental patterns. Individual rice plants carry different ., :.1 i rpes showing different tiller increasing and decline rates. For example, the QTL detected in chromosome 1 by the AR(1) covariance model shows an unusual pattern than other QTL. Individuals with :,. :.r ,ipe qq have fast tiller growth and death rate than individuals who carry ,,,I .ipe QQ. This information is very helpful in rice breeding. Plant breeders can try to select species carrying the qq ,.. i vpe. The rapidly increasing tiller numbers at the early developmental stage help the rice to build more tillers in case of the environmental stress resulting in lower production. At the productive phase, over produced tillers is not needed anymore. The fast killing process helps rice to get ride of over produced tiller quickly and reduce nutrition consumption due to these ineffective tillers. This is a very phenomenal developmental process. The Figure 35. 20 20 Chl1 (Q11) Chl2 (Q12) 15 15 10 10 5 5 0 0 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 Time Time 20 20 Ch31 (Q31) Ch32 (Q32) 15 15 10 10 5 5 0 0 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 Time Time Figure 36. Two curves for the dynamic changes of tiller numbers each presenting two groups of genotypes, QQ and qq, at each of the four QTL detected with the SAD(1) covariance structure. identification of in i r genes controlling this process will greatly help plant breeders in their breeding design. 3.10 Discussion To observe the pattern and shape of a developmental process, it is necessary to develop a function that has enough flexibility to fit the measured data. Parametric modelling of the developmental curve has been studied extensively (\! et al. 2002; Zimmerman and NnfiezAnt6n 2001). Parametric approaches, however, require specific quantitative information about the form of model, and hence have little flexibility when the developmental pattern becomes complicated and does not follow any particular mathematical function. In this study, we proposed a novel nonparametric model for the longitudinal PCD trait using LEP. Without making any assumptions on the underlying developmental pattern, nonparametric regression with LEP shows greatly flexibility to model the longitudinal PCD trait. However, for rapidly changing values of the PCD curve to be approximated, the degree of the polynomial has to be increased. Three model selection criteria are proposed to select the optimal order of the LEP. A number of conclusions can be drawn, both about information theoretic fit criteria in general, and about the performance of individual criteria in particular. The proposed different criteria may not necessarily agree on the best order selected. Among these criteria, BIC has a stronger penalty, which is a function of the number of unknown parameters and sample size. Therefore, the BIC will likely favor more parsimonious models compared to AIC or CAIC, and this just fits our objective of parsimonious modelling of the PCD curve without losing too much flexibility. If data come from the multivariate normal distribution, simulation studies show that order selection under the null alvb, outperforms the selection under the alternative. However, in the real world data, it is not easy to test the data distribution. We i::. I selecting the LEP order under the alternative if computationaly feasible. Selecting order with the unrestricted model ahb v provides more information than with the restricted model. When conducting the permutation test to assess the statistical significance, we need to fix the LEP order for each test position. It might be expected that different genotypes may follow different development patterns and hence results in different LEP orders. Order selection assuming different orders for different .: il, vrpes need to be developed in the future study. Two covariance structure modelling strategies are also proposed and tested by simulation and real data example. It is difficult to compare which structure is the right one given that simulating the true biological mechanisms is unrealistic due to our limited knowledge of the underlying true residual covariance structure. Simulation studies show that both covariance models perform well if we simulate data according to the corresponding structure. For the real data, one way we can compare the two covariance structure is to see which one has higher mapping power (the ability to detect more QTL). The one resulting in more QTL is probably the more powerful model at the same significant alevel. The final set of QTL should be a union of the QTL detected by both models. It is better to keep the identified QTL as a real gene before we have knowledge to further prove its usefulness. To test which covariance model performs better, model selection using informationtheoretic criteria can be applied, for example, AIC, BIC or Crossvalidation. Studies have been done in model selection of the covariance model using these criteria (Grady and Helms 1995; Littell et al. 2000). However, the selection is based on the same mean model (i.e., the mean functions have the same number of parameters). Since the LEP order could be different at each test position when using different covariance models, it is hard to compare the two models under the current study design. Further methods need to be proposed to better evaluate different covariance models. CHAPTER 4 FUNCTIONAL QTL MAPPING PROGRAMMED CELL DEATH: A SEMIPARAMETRIC MODEL 4.1 Introduction The question of how an organism develops into a fully functioning adult from a mass of undifferentiated cells has alv i attracted top researchers in diverse areas of developmental biology (Horvitz 2003). Fundamentally, to produce a functioning adult form, a living organism should coordinate various complementary and sometimes antagonistic processes, which include cell proliferation and ]*i.',;/ "" 1 cell death (PCD) or apoptosis, during its development (Cashio et al. 2005). The molecular and genetic characterization of these processes has been useful to identify the specific signaling path.W,i that underlie a delicate balance between cell proliferation and PCD (Vaudry et al. 2003) and, in the ultimate, enhance our understanding of the roots of disease such as cancer (Hanahan and Weinberg 2000; Yuan and Horvitz 2004). In particular, PCD has been thought to be a universal phenomenon that occurs in predictable patterns in response to environmental or developmental clues, whose study has become one of the most fascinating areas in all of biology over the last decade (Greenberg 1996; Horvitz 2003). Studies of the genetic control of development have used simple model systems, the nematode Caenorhabditis elegans and fruit fly Drosophila, from which views have been established that PCD involves specific genes and proteins and that these genes and proteins interact within the cells that die (Ellis and Horvitz 1986; White et al. 1994; Steller 1995; Horvitz 1999). For the adult hermaphrodite C. elegans forms, there are 1090 somatic cells, of which 131 die by apoptosis. Each apoptosis process is characterized by four stages (Steller 1995): (1) decision about whether a cell should die or assume another fate, (2) death, (3) engulfment of the dead cell by phagocytes and (4) degradation of the engulfed corpse. Each of these stages is regulated by a number of genes. Mutations affecting the final three stages affect all somatic cells, whereas genes affecting the death verdict affect very few cells. The molecular genetic pathway that defines PCD in C. 1.,m. and Drosophila provides a basis for understanding apoptosis in more complex organisms, including higher plants and humans, because genes responsible for PCD are evolutionarily conserved (Horvitz 1999). Given the unique complexity of genetic path ,.v of these species, however, a rigorous, detailed and analytic approach should be developed on its merit, which allows for the genomewide identification of genes for apoptosis in any complex organism. Genetic mapping based on molecular markers (Lander and Botestein 1989; Zeng 1994; Lynch and Walsh 1998), by superimposing real biological phenotypes on genome sequence and structural polymorphisms, can provide an unbiased view of the network of gene actions and interactions of quantitative trait loci (QTL) that build a complex phenotype like PCD. Unlike traditional QTL mapping for a complex trait, the mapping of PCD needs to incorporate the dynamic feature of this developmental process. Although this presents one of the most difficult tasks in genetic studies, some of key issues have been overcome by Wu and colleagues (Wu et al. 2002a, 2003, 2004a, 2004b, 2004c; Ma et al. 2002) who proposed a socalled functional i'."'.!:'!." to map and identify specific QTL that underlie the developmental changes of complex traits. The rationale of functional mapping is to express the . i.. I vrpic values of QTL at a series of time points in terms of a continuous growth function with respect to time t. Under this formulation, the parameters describing longitudinal trajectories, rather than timedependent ., 1 ,ivpic values, as carried out in traditional mapping strategies, are estimated within a maximum likelihood framework. Also, unlike traditional strategies, functional mapping estimates the parameters that model the structure of the covariance matrix among a multitude of different time points, which, therefore, largely reduces the number of parameters being estimated for variances and covariances, especially when the dimension of data is high. In this chapter, we will develop a novel statistical model for the genomewide scan of QTL that guide PCD toward an active process of cell death. This model incorporates two sequentially distinct stages of the developmental process into the mapping framework constructed within the context of Gaussian mixture models. The first stage, growth, has proven to obey some universal growth law that can be modelled mathematically by curve parameters (von Bertalanffy 1957). While it is difficult to find a proper mathematical equation to describe the second stage  death that is subject to a fast exponential decay of cells followed by a slowly decreasing function (Balaban et al. 2004), a nonparametric approach based on the Legendre function is derived to model the PCD process. The combination of parametric modelling of the growth process and nonparametric modelling of the death process 'lv a foundation for semiparametric functional i".Ij':j"!i of PCD. We implement a nonstationary meandependent covariance model to characterize the structure of the covariance matrix among cell numbers measured at a multitude of different times. The statistical behavior of our model is investigated through simulations studies. The utility of the model in a real example of rice i::. I that our model can be useful in practice. 4.2 Different Phases of PCD In general, the whole process of PCD can be described by five reasonably well distinct phases (Fig. 41; Fogg and Thake 1987): lag, exponential, declining growth rate, stationary and death. Each of the phases is defined below. Lag phase: This is the initial growth phase, during which cell number remains relatively constant prior to rapid growth. During this phase, the organism 1 2 i i i i i i i 4 10  5 8 U6 CO6 E 4 2 Growth phase Death phase" 0 1 2 3 4 5 6 t* 7 8 9 10 11 Time Figure 41. A typical example of PCD that includes five different stages, (1) lag, (2) exponential, (3) declining growth, (4) stationary and (5) death. prepares to grow and unseen biochemical changes, cell division and differentiation of tissues occur during this time. Exponential phase: During this phase the tissues are growing and dividing rapidly to take advantage of abundant nutrients. Growth rate, as a measure of the increase in biomass over time, is determined from the exponential phase. Growth rate is one important way of expressing the relative success of an organism in adapting to its biotic or abiotic environment imposed upon it. The duration of exponential phase depends upon the growth rate and the abundance of nutrients to support tissue growth. If the growth phase is plotted (time on xaxis and biomass on logarthmic yaxis), the exponential phase will be straightened out. Declining growth: Declining growth normally occurs when either a specific requirement for cell division is limiting or something else is inhibiting reproduction. During this phase growth slows or the death rate increases. As a result, the initiation of new tissues and the senescence and death of old ones start to come into equilibrium. This phase typically occurs as nutrients become limiting for growth. Stationary phase: Tissues enter the stationary phase when net growth is zero, and within a matter of time cells may undergo dramatic biochemical changes. The nature of the changes depends upon the growth limiting factor. The shut down of many biochemical pathvs as the stationary phase proceeds means that the longer the cells are held in this condition the longer the lag phase will be when cells are returned to good growth conditions. Death phase: When cell metabolism can no longer be maintained, the death rate of a tissue is generally very rapid (Balaban et al. 2004). The steepness of the decline is often more marked than that represented in the accompanying growth figure. The duration and extent of each phase will depend on the organism and environmental conditions. For example, if tissues from the stationary phase are supplied with fresh nutrients, the lag phase will be longer than for the case of tissues from the declining phase. For growing tissues from the exponential phase, organisms supplied with fresh nutrients will likely skip the lag phase. If the growth nutrient is rich, organisms will remain in the exponential growth phase for a longer period and produce a greater biomass. Furthermore, their rate of growth in the exponential phase may also be greater. Growth curves must be drawn from a series of growth measurements at different times during the growth curve. Mathematical equations have been derived to model the growth from the lag to stationary phase (West et al. 2001). It is difficult to find a specific mathematical equation for the death phase. 4.3 Statistical Model 4.3.1 The Mixture ModelBased Likelihood The statistical foundation of functional mapping for the PCD process is based on a finite mixture model. In this mixture model, each curve corresponding to a QTL ., ii. .1pe fitted by a finite set of measurements with 7 time points for any individual is assumed to have arisen from one of a known or unknown number of components with each component being modelled by a multivariate normal distribution. That is y ~ p(r/ 7 TI) = oif(y; yi, TI) + + jf (y; j, TI), (41) where w (l, wj)' are the mixture proportions (i.e., QTL .. I vpe frequencies) for J QTL i,' .vpe which are constrained to be nonnegative and J j1 =j 1; = (1, i)' are the component (or QTL genotype) specific parameters, with mj being specific to component j; and T] are parameters which are common to all components. Consider a standard backcross design, initiated with two contrasting 1I': .. v.ous inbred lines. This backcross progeny population contains two groups of genotypes at a locus. Assume that a genetic linkage map covering the entire genome has been constructed with polymorphic markers. Suppose there is a segregating QTL with alleles Q and q that affects PCD curves or trajectories in the backcross population. This QTL cannot be directly observed rather than should be inferred on the basis of known marker information. In QTL interval rn ,ppii. we will use two flanking markers with known genotypes to infer the .., .1 ivpe of a QTL that is hypothesized to be located between the two markers (Lander and Botstein 1989). The recombination fraction is a linkage parameter that can describe the genetic distance between two given loci. Let r, r, and r2 be the recombination fractions between the two markers MI and M2, between the left marker MI and QTL and between the QTL and right marker M2, respectively. Based on the segregation and transmission of genes from the parent to progeny, one can derive the conditional probabilities of an unknown QTL, i.1 ,,vpe, conditional upon the known marker .' .1 virpes, in terms of these recombination fractions (Table 21). The unknown parameters that specify the position of QTL within a marker interval are array, '1 in Qs. The likelihood function of the longitudinal PCD trait (y) and marker information (M) collected for this backcross populations at this hypothesized QTL with two genotypes j = 1, 0 is constructed as L(QfO, Q,, ly, M) 1[1 if (yi,, fIQ) + oif (yi O [o I1)] (42) i= 1 where w li and wo0i contained in fs are the mixture proportions corresponding to the frequencies of different QTL .,. .1 vpes, expressed as the conditional probabilities of QTL .. n. 1 pes given marker .. .1 vpes for individual i, fm, contains the parameters that model timedependent means for ., .'1 vipe j and f contains the parameters that model the structure of the residual covariance matrix which is assumed to be common to all mixtures. The likelihood function is constructed on the basis of the assumption of independent response between individuals. The multivariate normal distribution of each mixture for progeny i measured at T time points is expressed as f(y ,M m, v) (2 ) /2 1/2 exp (yi mj)TX(yi m) (43) where y = [y/(1), '' ,T u(r)] is a vector of observation for progeny i and uj = [uj(1), uj (r)] is a mean vector for all progenies with genotype j. This means that all progenies with the same QTL .1. ivpe will have the same mean at different time points. At a particular time point, i t, the relationship between the observation and expected mean can be described by a linear regression model yt) M (t) + (1 +)uo(t) + eM(t), where (i is an indicator variable denoted as 1 for QTL genotype j = 1 and 0 for QTL .ii, i. vpe j = 0 and ei(t) is the residual error which is lid normal with zero mean and variance O2(t). The errors for progeny i at two different time points, t1 and t2, are correlated with covariance cov(yi(ti), yi(t2)). The variances and covariances comprise the covariance matrix E whose elements are the common parameters specified by fQ. With these general settings, the statistical challenging becomes how to model the mean process and how to structure the covariance matrix. 4.3.2 Semiparametric Modelling of the Mean Vector In a broad sense, the entire PCD process for a particular individual i can be divided into two phases, growth and death (Fig. 41). Let t* be the transition time point which marks the end of growth phase and beginning of the death phase. The mean vector in the multivariate normal distribution of PCD (Eq. 43) for individual i that carries QTL genotype j can now be specified by two mean subvectors expressed as Uji (u ujl, UDj i) (44) where ugQji and uDji correspond to the growth and death vector before and after t*, respectively. The parametric model of growth phase. The process of growth (before t*) follows universal growth laws and can be described by biologically meaningful mathematical functions. As a nearly universal biological law for living systems, the sigmoidal (or logistic) growth function can be fitted to capture agespecific changes during the growth phase (West et al. 2001). The logistic function is mathematically described for individual i by g9(t) = , with t [1, t (45) 1 + biecit where ai is the .,vmptotic or limiting value of gi when t + oo, ai/(1 + bi) is the initial value of gi when t 0 and ci is the relative rate of growth (von Bertalanffy 1957). Thus, with these three parameters, one can uniquely determine the shape of PCD in the growth phase for individual i that carries QTL genotype j, and have the timedependent mean vector for Eq. 44 specified by Ugjli [ugjli ), " Ugjli ti)1 = a a (46) 1 +bjec'' 1+ bjeAj " If different genotypes at a putative QTL have different combinations of the parameters (ay, bj, cy), this implies that this QTL p1 a role in governing the difference of growth trajectories. The nonparametric model of death phase. Since no particular mathematical function can be used to describe the death phase (after t*), the nonparametric approach based on the orthogonal Legendre polynomial (LEP) is used. The flexibility of LEP will greatly increase the robustness of functional mapping. With appropriate order r, the timedependent genotypic values for different QTL ,., .I vipes in the death phase can be fitted by the orthogonal LEP. A family of such polynomials with normalized time t' is denoted by P(t') [Po (t/), Pl (t/), .. Pr (t1/)]T and a vector of genotypicrelated, timeindependent values with order r is denoted by Vj rI,,. V j,  Urjl]T where vj is called the base /. i. 'ij*/"l'.:' vector for QTL ,, 11. 1pe j and the parameters within the vector are called the base /, i. 'i.*/l'.:, means. The normalized time t' is obtained by adjusting the original measurement time t to match the orthogonal function range [1, 1], by ,/ 2(t Imin) t' 1+ tmax min where tmin and tmax are the first and last time point, respectively. With these specifications, the timedependent i i.1' pic values, uDj (t), in the death phase can be described as a linear combination of vj weighted by series of the polynomials. That is UD (t) vTp(t') (47) Thus, for individual i whose QTL .1 i,.'vpe is j, we use the following expression to model genotype means in the death phase UD l (t),. *. (7)]. (48) This approach has a great flexibility in modelling longitudinal data that cannot be fitted by a parametric model. By choosing an appropriate order, the nonparametric model can better capture the intrinsic pattern of developmental PCD. The number of parameters can be reduced if the order of the polynomial is less than the number of time points, as an usual case. Legendre order selection. To determine which order of the LEP best fits the data, we need to select the optimal order. One of the popular model selection criteria is the AIC information criterion (Akaike 1974). The AIC value at a particular order, r, is calculated by AIC 21nL(6f, ,fD, 6,r) + 2 dimension(Qg, fD, fQr), (49) where Qg {aj,bj,c}jo, and QD {Poj,PIj,... rj}j=o are the the MLEs of parameters for the growth curve function and the Legendre polynomial of order r, Q, contains the MLEs of the covariance parameters, and dimension(fg, D, L, r) represents the number of free parameters under order r. The optimal order is one that dip'1,, the minimum AIC value. Another model selection criterion to determine the optimal order of the Legendre function is the B i'l, Information Criterion (BIC) (Schwarz 1978), which is calculated by BIC 2 In L(Ig, 2, Lr) + dimension(QLg, fD, LIr) ln(n). (410) where all the parameters are defines similarly as above except that n is the total number of observation at a particular time point. Since the BIC criterion adjusts the effect of sample size, the model selected by BIC will be more parsimonious. 4.3.3 Modelling the Covariance Structure To model the covariance structure for longitudinal data, we need to make the following assumptions: (1) the error ei(t) in Eq. 23 is normally distributed with mean zero and variance a 2(t), and (2) the error ei(t) is independent and identically distributed among different individuals. A number of statistical models have been used to model the covariance structure (Diggle et al. 2002). In earlier functional mapping, the firstorder autoregressive (AR(1)) model has been used (V\!. et al. 2002), which is expressed as a2(1) 2... 2 ) a2 for the variance, and 1r t1, t2) 02plt2tl (4 11) for the covariance between any two time intervals t1 and t2, where 0 < p < 1 is the proportion parameter with which the correlation decays with time lag. The parameters that model the structure of the (co)variance matrix are an, i ,1 in f,. To remove the heteroscedastic problem of the residual variance, which violates a basic assumption of the simple AR(1) model, two approaches can be used. The first approach is to model the residual variance by a parametric function of time, as originally proposed by Pletcher and Geyer (1999). But this approach needs to implement additional parameters for characterizing the agedependent change of the variance. The second approach is to embed Carroll Rupert's (1984) transformbothsides (TBS) model into the growthincorporated finite mixture model (Wu et al. 2004b), which does not need any more parameters. Both empirical analyses with real examples and computer simulations ~i. 1 that the TBSbased model can increase the precision of parameter estimation and computational efficiency. Furthermore, the TBS model preserves original biological means of the curve parameters although statistical in &iv . are based on transformed data. The TBSbased model dipl '1., the potential to relax the assumption of variance stationarity, but the covariance stationarity issue remains unsolved. Zimmerman and NnfiezAnt6n (1997) proposed a socalled structured antedependence (SAD) model to model the agespecific change of correlation in the analysis of longitudinal traits. The SAD model has been employ. .1 in several studies and dipl' .v many favorable properties (Zimmerman and NfinezAnt6n 2001). Zhao et al. (2005) incorporated the firstorder SAD (SAD(1)) model into modelling of the covariance matrix. In this article, we use a different modelling approach that is as simple as the AR(1) and as flexible as the SAD(1). This approach has two steps. In step 1, the intraindividual correlation structure is modelled. In many cases, a systematic pattern of correlation is evident, which may be characterized accurately by a relatively simple model. The intraindividual correlation among the timedependent elements of e, for individual i is assumed to follow a pattern, expressed as corr(ei) Ri(), where the correlation matrix R,(Q) is a function of a vector of correlation parameters Q. The correlation structure can be described by the AR(1) model in which = p (Eq. 411). In step 2, timedependent variances are specified according to Horwitz's Rule in analytical chemistry. This rule proposes that there exists an empirical relationship between concentration and variance (Atkinson 2003). Thus, we can similarly model timedependent variances for individual i by considering its ,.1 iJ vpic means at various time points, expressed as a2 (t) a2u t). Therefore, the corresponding covariance matrix can be modelled by 1 1 Sli, cov(y) a2UjRi(2)uj, (4 12) where uy, = di [ i(1),. u (lT)]. Since the covariance structure is modelled as a function of genotypic mean, it is called the meancovariance (\ C) model. The MC model has a great flexibility for modelling the covariance matrix of the PCD process. The unknown parameters to be estimated in the MC model are ai 1v, 1 in Q, = (o2, Q) with accepted means. 4.3.4 Computation Algorithms A number of computational algorithms can be used to estimate the unknown parameters, f = (fs, fm, [v). These algorithms include the maximum likelihood method implemented with the EM algorithm (Dempster et al. 1977). As usual, the QTL position parameter fs can be viewed as a known parameter because a putative QTL can be searched at every 1 or 2 cM on a map interval bracketed by two markers throughout the entire linkage map. The amount of support for a QTL at a particular map position is often diphil i, d graphically through the use of likelihood maps or profiles, which plot the likelihood ratio test statistic as a function of map position of the putative QTL (Lander and Botstein 1989). Because m, and Q, are contained in complex nonlinear equations, it is difficult to derive a closed form of the maximum likelihood estimates (MLEs) for them. Theoretically we could use Simplex algorithm to estimate all the parameters contained in Q (Zhao et al. 2004). The NelderMead simplex algorithm, originally proposed by Nelder and Mead (Nelder and Mead 1965) is a direct search method for nonlinear unconstrained optimization. It attempts to minimize a scalarvalued nonlinear function using only function values, without any derivative information (explicit or implicit). The algorithm uses linear adjustment of the parameters until some convergence criterion is met. However, because of the complex nonlinear function being minimized by Simplex algorithm, it cannot alvb, guarantee the correct convergence of covariance parameters aO2 and p during the minimization process. This consequently results in negative infinity of the loglikelihood function and convergence will never be reached. Based on these concerns, we developed a EMSimplex combined algorithm to estimate the unknown parameters contained in Q. The detailed algorithm is given in Appendix B. The simplex algorithm is integrated into the EM algorithm to estimate the mean parameters, namely the logistic curve, LEP parameters with given covariance parameters. Then the estimated mean parameters can be used to maximize the covariance parameters. Iterative procedure will be processed until convergence. Under the joint modelling framework, two mean functions, growth and death, need to be connected. Two constraints are imposed to make the PCD curve continuous at the transition time point t* for each individual. The first constraint is to make the growth mean equal to the death mean at time t* to make the entire curve continuous. The second constraint is that the two functions has the same score at time ti to make the entire curve smooth. These two constraints are expressed as With these constraints, we obtain the expressions of one growth parameter and one death parameter for any QTL genotype j. For example, if the Legendre polynomial order is 3, we can solve the equations to obtain the estimates of aj and Vlj as follows 2[,,, 0.5(3 i 1) 5t3 (1 + be )2 S 1 + bject bjc(At)t'ecit: Sajbjcj(At)ecA t12 2 I =l 2(1 + bject)2 3t2J 0.5(15t 3)v3 where t' is the adjusted time for t* and At tmax tmin. 4.3.5 Calculating Curve Heritability It is easy to calculate the heritability level (H2) when traits are measured at a single time point. But for longitudinal traits, heritability calculation is difficult. We propose two v to do it: 1. Calculate H2 at a single time point t where the traits shows the highest variation. For a backcross design, the genetic variation is given by a2 l[uI(t) uo(t)]2, where uj(t) (j 1,0) is the genetic mean for genotype j at time t, and the heritability H2(t) 7(t)/o2(t) is calculated where o 2(t) is the residual variance at time t. 2. Calculate H2 using the area under curve (AUC). Functional mapping maps the dynamic gene effect over time. The genetic variation explained by the entire measurement period is more informative than that by individual point time. We propose to calculate the genetic variation by a2 = i(AUCI  AUCo)2, where AUCI and AUCo are the AUC for two different genotypes. The AUC is calculated by AUC j 'j dt + / v Pr(t')dt' a.[l. (bj @ ti)  n(b + c ln(,jtl)] I vIPr(t')dt' Cj be ' where t* is the transition time point, aj, bj and rj are the growth parameters for genotype j, Pr(t') is a vector of LEP with order r and u, is the base ,. iJ, vpic mean parameters and t' is the adjusted time point. 4.4 Hypothesis Testings One of the main advantages of functional mapping is that it allows for a number of hypothesis tests to examine the genetic control mechanisms of growth throughout development and in response to varying environmental or developmental clues. Wu et al. (2004a) have formulated some of these hypothesis tests which include the global test of genetic effects on the entire developmental process, the regional test of genetic control over a particular developmental period of interest and the point test for the timing of developmental events. From a genetic perspective, we can also test how different genetic action modes pl ,v a role in regulating the developmental process. With such a complete set of tests, we are able to address biological questions related to the genetic control mechanisms of PCD traits. Testing whether specific QTL exist to affect the PCD process is a first step toward the understanding of the detailed genetic architecture of complex phenotypes. The genetic control over entire developmental process of PCD can be tested by formulating the following hypotheses Ho: Ug = ugo and ui = uDo (413) HIa : At least one of the qualities above does not hold The Ho states that there is no QTL affecting the dynamic PCD process (the reduced model), whereas the Hia proposes that such a QTL does exist (the full model). The test statistic for testing the hypotheses is calculated as the loglikelihood ratio of the reduced to the full model as given below LR 2[logL(Qy, M) log L(y, M)], (414) where f and 2 denote the MLEs of the unknown parameters under Ho and Hil, respectively. The critical threshold value for declaring the presence of QTL can be empirically calculated based on the permutation tests (C'!ni !In!! and Doerge 1994). Other hypotheses can be made to test if the detected QTL only controls the growth phase with the following alternative hypothesis Hlb : ugi / UgO and uD = u O (415) or if the detected QTL only controls the death phase with the following alternative Hic : ugi = ugo and uDi / uDo (416) The critical thresholds for the above two hypotheses can be determined using simulation studies. Only when both Hlb and HIc are rejected, the detected QTL is thought to pleiotropically affect the growth and death phases. The proposed model can be used to test the influence of QTL on growth in different stages of development, lag, exponential, declining growth, stationary and death. These tests can be based on the area under curve during a time course of interest. Simulation studies are used to determine the critical thresholds for each test. 4.5 Results 4.5.1 A Worked Example I use the proposed model to analyze a real data set of rice. Data description is given in C'!i Ipter 3. Figure 32 shows the dynamics of tiller numbers for each DH line measured at 9 time points. Tiller growth is thought to be an excellent example Table 41. Model selection for Legendre polynomial orders based on the AIC and BIC values under the MC covariancestructuring model Selection Order criterion 0 1 2 3 4 AIC 2437.8 1096 759.74 559.87 670.65 BIC 2453.9 1109.4 775.77 578.58 692.03 of PCD in plants (Greenberg 1996) since it experiences several developmental stages when rices grow. Our semiparametric model was used to map specific QTL that determine the dynamic changes of tiller number during ontogeny. Although the growth phase of tiller number can be well modelled by a logistic equation defined by parameters a, b and c (Eq. 45), no proper equation can be used to model the death phase. For this reason, a nonparametric approach based on the Legendre polynomial function is adopted in the framework of QTL mapping. But this will encounter the issue of order determination. To detect the best order of the Legendre polynomial function for this rice data set, we calculated the AIC and BIC information criteria for various orders (Table 41). Both the criteria provide the consistent result that the death phase of tiller number can be best explained by the Legendre polynomial of order 3. By genomewide scanning for QTL at every 2 cM within each marker interval across 12 rice chromosomes, our model has probed three i, i"r QTL that tri .r their effects on the overall PCD process of tiller number. As shown by the genomewide loglikelihood ratio (LR) profile between the full and reduced (no QTL) model for tiller number trajectories in Fig. 42, these three QTL are located between markers RG146 and RG345 and between markers RZ730 and RZ801 on chromosome 1 and on marker RZ792 on chromosome 9. The positions of markers on the linkage groups (Huang et al. 1997) are indicated at ticks. The threshold value for claiming the existence of QTL is given as the horizonal dotted line for the genomewide level and dashed line for the chromosomewide level. Of these 40 chrom 1 chrom 2 chrom 3 30 20 VoSOV 34U 30 S20 10 40 30 20 10 n chrom 4 chrom 5 chrom 6 chrom 7 chrom 8 chrom 9 chrom 10 chrom 11 chrom 12 Test Position Figure 42. The LR profile plot. The genomic positions corresponding to the peak of the curve are the MLEs of the QTL localization (indicated by the arrows). three detected QTL, the first is significant genomewide, whereas the other two are significant chromosomewide, all at the 5'. significance level based on the critical thresholds determined from the permutation tests. To know more about the behavior of the detected QTL, we tabulate the MLEs of curve parameters that specify the growth phase and genotypic basis effects that specify the death phase, along with the approximate standard errors of the estimates (Table 42). All the parameters that specify the growth and death phases for different QTL ., n. .,ipes (j = 1 for QQ or 0 for qq in the DH population) are estimated with reasonably high precision as shown by the standard errors, although the estimation precision tend to be better for the growth parameters than for the death parameters (Table 42). The parameters that model the structure of the covariance matrix based on the MC model can also be well estimated, ..i ii good behavior of our model. Table 42. The MLEs of the growth and death parameters (and their approximate standard errors in the parentheses) for significant QTL detected on different chromosomes for tiller numbers in a DH rice population Parameters QTL position (7) Marker interval Growth parameters a2 b2 rT2 Death U20 U21 U22 U23 parameters Covariance parameters Oa2 120 cM RG146 RG345 10.5973 7.,!' ; (0.20) 1.7257(0.02) 12.3823 9.6068(0.38) 1.8834(0.03) 8.5001(0.22) 2.4290 0.1008(0.05) 0.5118(0.04) 9.6225(0.37) 3.2068 0.1472(0.08) 0.6575(0.07) 0.0493(0.01) 0.8747(0.01) C '! i, i ..* 2* 198 cM RZ730 RZ801 11.0824 7.5167(0.28) 1.8486(0.03) 11.5461 8.9779(0.34) 1.661(0.03) 8.6888(0.32) 2.7785 0.1326(0.07) 0.5685(0.05) 9.2706(0.31) 2.6222 0.0964(0.07) 0.5765(0.05) 0.0508(0.004) 0.8738(0.01) '! i !.... n. 9 * 119.1 cM RZ792 10.7188 7.8547(0.35) 1.6747(0.04) 11.5739 8.3427(0.39) 1.8311(0.04) 8.5837(0.32) 2.4647 0.0989(0.07) 0.5312(0.05) 9.1128(0.35) 2.8732 0.1030(0.07) 0.5784(0.06) 0.0535(0.005) 0.8786(0.01) where refers to the chromosomewide significant QTL Using the MLEs of parameters for the growth and death phases, we draw the developmental trajectories of tiller number for the two different QTL ... i' vpes (Fig. 43) with tiller number trajectories for all individuals indicated in background. Each QTL shows a unique developmental pattern over time. For example, the dynamic process of genetic effects for the QTL located between markers RZ730 and RZ801 on chromosome 1 is different than those for the other two QTL. Statistical tests based on hypotheses 415 and 416 show that the QTL Table 43. The MLEs of the QTL position and the model parameters derived from 100 simulation replicates. The squared roots of the mean square errors of the MLEs are given in parentheses. True parameter A 48 Growth parameters a2 15.033 b2 8.324 r2 = 1.814 10.926  7.602  1.522 Death parameters 20 = 9.817 t21 = 6.453 Ut22 0.366 t23 0.958 uoo = 7.893 U01 = 3.681 U02 = 0.208 U03 = 0.625 H2 n 100 46.222(4.03) 8.3705(0.31) 1.I,1 .',(0.04) 7.6648(0.45) 1 I I_'(0.06) 9.8818(0.43) 0.3623(0.10) 0.9695(0.08) 0.1 n =200 45.98(3.19) 8.3317(0.22) 1.812(0.02) 7 '1...' ,(0.31) 1.5255(0.03) 9.8442(0.25) 0.3692(0.06) 0.9554(0.05) 8.1185(0.42) 7.9525(0.29) 0.2127(0.09) 1, 1. L(0.08) 0.2164(0.06) 0.6277(0.06) H2 n 100 46.02(3.27) 8.3404(0.12) 1.8124(0.01) 7.5957(0.18) 1.5219(0.02) 9.8363(0.15) 0.3661(0.03) 0.9589(0.03) 7.8879(0.13) 0.2056(0.03) 0.6247(0.03) 0.4 n 200 46.06(2.69) 8.3211(0.09) 1.813(0.01) 7.6274(0.13) 1.5224(0.01) 9.8257(0.10) 0.3673(0.03) ,',',11,,"(0.02) 7.9137(0.12) 0.2109(0.02) 0.6260(0.02) Covariance parameters 02.1 0.1194 02.4 0.0199 0.1096(0.02) 0.1167(0.01) p = 0.85 0.8415(0.02) 0.8475(0.01) 0.0198(0.002) 0.0197(0.002) 0.8484(0.02) 0.8478(0.01) The locations of the two QTL are described by the map distances (in cM) from the first marker of the linkage group (100 cM long). detected between markers RG146 and RG345 on chromosome 1 and on marker RZ792 on chromosome 9 merely control the growth phase, whereas the second QTL on chromosome 1 controls the entire developmental process (P < 0.05). 4.5.2 Simulation We performed a series of simulation studies to examine the statistical properties of the model. Six equidistant markers are simulated from a backcross population and are ordered as MAIMA6 on a linkage group with the length of 100 cM. The Haldane map function was used to convert the map distance into the recombination fraction. Different heritability levels (H2 = 0.1 vs 0.4) and different sample sizes (n = 100 vs 200) were considered in the simulation study to examine model's performances under different scenarios. The putative QTL is located between marker M3 and M4, at 48 cM from the first marker. Data are simulated by assuming that QTL controls the entire developmental process. The simulated data have 9 continuous time points. The means at different time points used to model the covariance matrix based on the MC model are the average of the two genotypic means. Table 43 lists the results from simulation. The true parameters are given in the left column. In general, our model can provide reasonable estimates of the QTL positions and the growth and death parameters determined by the QTL, with estimation precision depending on heritability level and sample size. In all cases of different sample sizes and heritabilities, the maximum values of the LR landscapes from 100 simulation replicates are beyond the critical thresholds at the a = 0.001 level determined from 1000 permutation tests for the simulated data, ~i.. Iii.; that our model has 10(1'. power to detect QTL in these conditions. The precision of parameter estimation is evaluated in terms of the square root of the mean squared errors (SMSE) of the MLEs. The QTL positions and effects can be better estimated when the PCD trait has a higher than lower heritability or when sample size is larger than smaller (Table 43). But the increase of H2 from 0.1 to 0.4 leads to more significant improvement for the estimation precision than the increase of n from 100 to 200. For example, the SMSE of the growth parameter Co for QTL genotype qq reduces by more than one fold when H2 is increased from 0.1 to 0.4 for a given sample size, whereas such reduction is much smaller when n is increased from 100 to 200 for a given heritability. This i.:r i that in practice it is more important to manage experiments to reduce residual errors (and therefore increase H2) than simply increase sample size. 4.6 Discussion The growth of any tissue, whether normal or malignant, is determined by the quantitative relationship between the rate at which cells proliferate and the rate at which cells die. Depending on how the rate of cell proliferation is compromised or coordinated with the rate of cell death, all tissues will undertake two distinct processes, growth or death, during development (Jacobson et al. 1997). Unlike the detailed framework for cell proliferation, an understanding of the initiation of cell death and the cellular mechanics of this process is still in its infancy. Cell death can occur as an active and orderly process of development, a process coined by the term Ii'.;';'",n ,1 cell death (PCD) (Horvitz 2003). PCD, or referred to as ipop ..i appears to be a universal feature of animal development, and abnormalities in PCD have been associated with a broad v ii, I, of human diseases, including certain cancers and neurodegenerative disorders (Hanahan and Weinberg 2000). In plants, PCD is also ubiquitous for essential development and survival, including xylogenesis, reproduction, senescence and pathogenesis (Greenberg 1996). To make PCD control and execution efficient, particular genetic mechanisms should be involved in regulating and modulating this process in response to various developmental and environmental stimuli. The use of organisms with simple structure like the the nematode Caenorhabditis elegans and Drosophila has led to the identification of numerous genes responsible for PCD (Ellis and Horvitz 1986; Horvitz 1999, 2003; Yuan and Horvitz 2004). The 2002 Nobel Prize in medicine was awarded to three scientists because of their discoveries of the genetic regulation of PCD (Horvitz 2003). Although the use of simple model systems can provide a wealth of information about the genetics of PCD for more complicated organisms like animals and higher Figure 43. 6 7 8 9 6 7 8 9 6 7 8 9 Plot of the dynamic changes of tiller numbers for two curves each presenting two groups of ., i ii .pes, QQ and qq, at each of the three QTL. A) QTL detected between markers RG146 and RG345, and B) QTL between markers RZ730 and RZ801 on chromosome 1, and C) QTL on marker RZ792 on chromosome 9. 2 3 4 5 Time 2 3 4 5 Time 2 3 4 5 Time plants, some key questions cannot be well addressed without a direct use of these organisms. Recent development of highthroughput molecular technologies has made it possible to generate a massive amount of genetic and genomic data for any organism almost with no limit. This, thus, presents a pressing need for the development of vigorous, detailed and analytical approaches to unravel the genetic control and regulation mechanisms that underlie the PCD process. In this article, we have for the first time developed a statistical model that can make a systematic scan of quantitative trait loci (QTL) for PCD throughout the entire genome with a wellcovered genetic linkage map. This model has been validated by a real example for the PCD process of tiller number in rice. Three QTL were detected to affect tiller number trajectories during a growing season in the field. The locations for two of the QTL detected on chromosome 1 are consistent with those estimated from basic interval mapping of single traits (Yan et al. 1998), but the third QTL detected on chromosome 9 was previously undetected by interval mapping. It seems that our model has not been able to detect many other QTL detected by Yan et al., which may be due to the difference in the threshold criteria used to claim the existence of a QTL. The rationale for this PCD mapping model is in spirit similar to that established in earlier functional mapping of growth curves ('\ I, et al. 2002; Wu et al. 2004a, 2004b, 2004c; Zhao et al. 2005). Both types of models integrate biological principles into the framework for QTL mapping constructed on the basis of Gaussian mixture models. They have provided a quantitative platform for testing the interplay between gene actions and organ development. In addition, by taking advantages of structuring the covariance matrix, these models estimate a much fewer number of parameters, thereby increasing the statistical power of the model. It can also be seen that the PCD mapping model proposed in this article is different from earlier published functional mapping approaches purely based on parametric modelling. The current model splits the PCD process into two sequentially distinct phasesgrowth and death. As in a parametric model, universal growth laws (West et al. 2001) are used to model the growth phase, whereas nonparametric modelling is performed for the death phase. This combination of parametric and nonparametric models, referred to as semipara metric modelling, is aimed to overcome the problem of the death phase that cannot be mathematically described. Although it is feasible for a nonparametric approach to model any form of curve including the entire growthdeath curve like Fig. 41, this approach does not make full use of biological information contained in the growth phase and, therefore, loses the advantages of parametric functional mapping in the biological interpretation of results (see Wu et al. 2004a). Nonparametric part of our semiparametric model is based on Legendrepolynomial approaches. As shown by Kirkpatrick and Heckman (1989), Legendre polynomials have several favorable properties for curve fitting which include: (1) the functions are orthogonal, (2) it is flexible to fit sparse data, (3) higher orders are estimable for high levels of curve complexity and (4) computation is fast because of good convergence. Other nonparametric regression methods using kernel estimates have been considered for the mean structure of growth curve data by Altman (1990), Boularan et al. (1994), Wang and Ruppert (1995) and Ferreira et al. (1997). Relative to nonparametric modeling of the mean structure, nonparametric covariance modelling has received little attention. Leonard and Hsu (1992) derived a B ,i, ~i in approach for nonparametric estimates of the covariance structure. Diggle and V' iI i 1, (1998) used kernelweighted local linear regression smoothing of sample variograms ordinates and of squared residuals to provide a nonparametric estimator for the covariance structure without assuming stationarity. It is 88 appealing to incorporate these nonparametric or semiparametric approaches into our functional mapping framework to increase the model's flexibility. CHAPTER 5 CONCLUDING REMARKS Programmed cell death (PCD), a deliberate cell suicide process in a multicellular organism, is undertaken in a coordinated manner that generally confers advantages during the organism's life cycle. It serves fundamental functions during both plant and metazoa (multicellular animals) tissue development. The information about how genes control or affect this process is crucial for a better understanding of the physiological mechanisms of PCD. The identification of genes that contribute to the differentiation of cells has been one of the most important and difficult tasks for PCD research. While our understanding of the molecular mechanisms of PCD has been largely based on simple organisms like the nematodes, a direct use of more complex organisms such as plants, animals and humans presents a pressing need for examining the genetic regulation of PCD for complex traits including grain yield, meat production and cancer. The genes or quantitative trait loci (QTL) that mediate PCD can be detected using genetic mapping with available abundant information of molecular markers and the conceptual idea of functional mapping. In this dissertation, I have for the first time derived a series of statistical models and algorithms for detecting and characterizing specific QTL that are responsible for PCD process under various problem settings. The developed models can make a systematic scan of QTL for PCD across the entire genome with a wellcovered genetic linkage map. The general mapping framework is stated in Chi Ipter 2 which presents a conceptual idea on how to model and characterize the developmental PCD process. Two interconnected core issues addressed in this chapter are how to model the developmental mean PCD process and the covariance structure efficiently in order 