UFDC Home  UF Institutional Repository  UF Theses & Dissertations  Internet Archive   Help 
Material Information
Record Information

Table of Contents 
Title Page
Page i Page ii Acknowledgement Page iii Table of Contents Page iv Page v List of Tables Page vi Page vii Abstract Page viii Page ix Chapter 1. Introduction Page 1 Page 2 Page 3 Page 4 Page 5 Page 6 Page 7 Chapter 2. Literature review Page 8 Page 9 Page 10 Page 11 Page 12 Page 13 Page 14 Page 15 Page 16 Page 17 Page 18 Page 19 Page 20 Page 21 Page 22 Page 23 Page 24 Page 25 Page 26 Page 27 Page 28 Page 29 Page 30 Page 31 Page 32 Page 33 Page 34 Page 35 Page 36 Page 37 Page 38 Page 39 Chapter 3. Twostage genome search for simple mendelian disease Page 40 Page 41 Page 42 Page 43 Page 44 Page 45 Page 46 Page 47 Page 48 Page 49 Page 50 Page 51 Page 52 Page 53 Page 54 Page 55 Page 56 Page 57 Page 58 Page 59 Page 60 Page 61 Page 62 Page 63 Page 64 Page 65 Page 66 Page 67 Page 68 Page 69 Page 70 Page 71 Page 72 Page 73 Page 74 Page 75 Page 76 Page 77 Chapter 4. Twostage genome search for complex disease Page 78 Page 79 Page 80 Page 81 Page 82 Page 83 Page 84 Page 85 Page 86 Page 87 Page 88 Page 89 Page 90 Page 91 Page 92 Page 93 Page 94 Page 95 Page 96 Page 97 Page 98 Page 99 Page 100 Page 101 Page 102 Page 103 Page 104 Page 105 Page 106 Page 107 Page 108 Page 109 Page 110 Page 111 Chapter 5. Concluding remarks Page 112 Page 113 References Page 114 Page 115 Page 116 Page 117 Page 118 Page 119 Biographical sketch Page 120 Page 121 Page 122 Page 123 
Full Text 
TWOSTAGE GENOME SEARCH DESIGN IN AFFECTEDSIBPAIR METHOD By CHIHSE TENG A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 1997 Copyright 1997 by CHIHSE TENG ACKNOWLEDGMENTS I would like to thank the chairman of my committee, Professor Mark C.K. Yang, for introducing me to the field of genetic statistics and his wise guidance throughout my graduate study at UF. I also want to thank Professor Randy Carter, Professor Frank Martin, Professor Sue McGorrary, and Professor Jin Xiong She. Their advice, encouragement, and patience were essential. Last but not least, I want to thank Ms. Margaret Joyner for her gracious editorial help and effort. Finally, my deepest appreciation goes to my parents, brother, sister, and parentsinlaw for their love and neverending support and to my wife HsiaoYun and our daughter Gillian for their sweet company. TABLE OF CONTENTS ACKNOW LEDGM ENTS ...................................................... iii LIST OF TABLES ............................................................. vi A B ST R A C T ................................................................... viii CHAPTERS 1 INTRODUCTION ............................. ....................... 1 2 LITERATURE REVIEW ............................................. 8 2.1 Map Function ........................................ .... 8 2.2 An Introduction to Human Linkage Data for Genetic Dissection 11 2.3 Identical by State, Identical by Descent ........................ 16 2.4 Distributions of IBD ............... .... ......................... 17 2.5 Risk Ratio ...................................................... 24 2.6 Test Statistics Based on IBD Score ............................. 25 2.7 Likelihood Ratio Test ........................................... 30 2.8 Heterogeneity and Homogeneity ................................ 33 2.9 Polygenes ....................................................... 37 2.10 Twostage Genome Search ...................................... 38 3 TWOSTAGE GENOME SEARCH FOR SIMPLE MENDELIAN DIS E A SE .............................................................. 40 3.1 Assum options .................................................... 41 3.2 TwoStage Procedure ........................................... 42 3.3 Probability of Allocating the Correct Marker ................... 43 3.4 Type I Error and Power of Claiming Linkage ................... 72 3.5 Discussion ....................................................... 76 4 TWOSTAGE GENOME SEARCH FOR COMPLEX DISEASE .... 78 4.1 Genetic Model and Assumptions ................................ 78 4.2 Twostage Genome Search ...................................... 79 4.3 Probability of Allocating the Correct Marker for a Complex D disease ........................................................ 81 4.4 Discussion ....................................................... 110 5 CONCLUDING REMARKS .......................................... 112 REFERENCES ................................................................. 114 BIOGRAPHICAL SKETCH ................................................... 120 LIST OF TABLES Table Page 1.1 Results of Mendel's crosses experiments......................... 2 1.2 The results of Bateson and Bunnet's experiment. ................. 5 2.1 Multilocusfeasibility of several map functions ................... 12 2.2 Some sampling issues in linkage studies......................... 16 2.3 The values of Pr(IBD of trait gene I IBD of marker and relationship). 20 2.4 The probability of IBD score for different relatives ................ 26 3.1 The joint distribution of IBD score of the locus 1 and locus 2. ....... 57 3.2 The joint distribution of Xij and Xi,j ........................... 58 3.3 Optimal resource allocation in twostage genome search for rare reces sive gene ............................................... 64 3.4 Optimal resource allocation in twostage genome search for rare dom inant gene ............................................... 68 3.5 The 95% percentile of the unique maximum of R Binomial(n2, 0.25).. 74 3.6 The 95% percentile of the unique maximum marker group of R Binomial(n2, 0.25) .................................................. 75 4.1 Trait IBD distribution conditional on parental genotype and E vector. 83 4.2 Exclusive events for the case "G1 is found but G2 is not." .......... 84 4.3 Exclusive events for the case "G2 is found but G, is no." .......... 85 4.4 Exclusive events for the case "G1 and G2 are found." ............. 85 4.5 Conditional distribution of Xl, and X2j given ................ 87 4.6 P(both affectedlEi, E2) and possible trait IBD given El and E2 ..... 89 4.7 Optimal resource allocation in twostage genome search for rare reces sive gene, assumed two genes................................. 95 Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy TWOSTAGE GENOME SEARCH DESIGN IN AFFECTEDSIBPAIR METHOD By CHIHSE TENG December, 1997 Chairman: Mark C.K. Yang Major Department: Statistics Since Penrose proposed the sibpair method in 1935 and the affectedsibpair (ASP) method in 1950 and 1953, these methods have been widely used in linkage studies. Combined with contemporary DNAlevel genetic markers technology, ASP method can now be used for genomewide searches for disease genes. A twostage search involves first a screening search to eliminate nonviable marker loci, then an intensive search to identify gene location using the remaining markers. Although this approach has been suggested in the literature, its properties have not been thoroughly investigated. The spacing between markers, the number of ASPs to be used in the experiment of each stage, and the criteria for markers to pass the first stage need to be determined. The major difficulties of the twostage approach are (1) the joint distribution of the nonindependent statistics is difficult to handle; (2) the "random" dependence structure of the test statistics in the second stage; and (3) in most practically available sample sizes the high number of ties in test scores make asymptotic approaches inappropriate. This paper intends to provide a solution for designing an optimal design for rare autosomal recessive and dominant diseases. Power computations using the multinomial distribution supported by simulation show that in most cases a twostage design is usually better than a one stage design, but not always. Combining data from two stages will gain a small increase in accuracy. The optimal designs for resource allocation in each of the two stages are obtained and presented in tables. CHAPTER 1 INTRODUCTION Modern genetics began with the work of Gregor Mendel (182284) conducted between 1856 and 1863, forming the basis for his 1866 paper (Klug and Cummings, 1997). Gregor Mendel, an Austrian monk, performed a series of experiments on the garden peas. Based on the results of these experiments, he proposed a particulate inheritance theory which hypothesized that heritable biological characteristics were carried and controlled by individual "units." We call these units genes today, but Mendel called each unit a "Merkmal," the German word for "Character" (Levine and Miller, 1991). In Mendel's garden pea experiments, he studied seven characters of garden peas. They are: round or wrinkled ripe seeds, yellow or green seed interiors, purple or white petals, inflated or pinched ripe pods, green or yellow unripe pods, axial or terminal flowers, and long or short stems (Griffiths et al, 1993). For each one of these seven characters, Mendel obtained pure lines of plants. "A Pure line is a population that all offspring produced by selfing or crossing within the population show the same form of the character being studied" (Griffiths et al., 1993). The parental generation, denoted as F0, in Mendel's experiment are the plants of these pure lines. Mendel first studied these characters separately; he crossed two pure lines, one for each phenotype, of every characters to obtain the first filial generation, denoted as F1. All the individuals of F1 have only one phenotype of each characters. Next, Mendel self fertilized Fi and obtained second filial generation, denoted as F2. The results are showed in Table 1.1. Mendel established two principles to explain the pattern of the Table 1.1. Results of Mendel's crosses experiments. Parental phenotype F1 F2 F2 Ratio Round x wrinkled seeds All round 5474 round; 1850 wrinkled 2.96:1 Yellow x green seeds All yellow 6022 yellow; 2001 green 3.01:1 Purple x white petals All purple 705 purple; 224 white 3.15:1 Inflated x pinched pods All inflated 882 inflated; 299 pinched 2.95:1 Green x yellow pods All green 428 green; 152 yellow 2.82:1 Axial x terminal flowers All axial 651 axial; 207 terminal 3.14:1 Long x short stems All long 787 long; 277 short 2.84:1 Source: Griffiths et al., 1993, p. 23. data (Levine and Miller, 1991; Griffiths et al., 1993): The characteristics of an organism are determined by individual units of heredity called genes. Each adult organism has two alleles for each gene, one from each parent. These alleles are segregated (separated) from each other when reproductive cells are formed, each gametes receive one of the two alleles with equal chance. This is the principle of segregation, also being known as the Mendel's First Law. In an organism with contrasting alleles for the same gene, one allele may be dominant over another. This is known as the principle of dominance. He assumed that each plant had two alleles for each trait his studied. This assumption was correct. "A Allele is one of the different forms of a gene that can exist at a single locus" (Griffiths et al., 1993, p. 783). He introduced the notation A for the dominant allele and a for the recessive allele. Since the parental generations were pure lines, their genotypes were homozygous AA and aa. Thus, the F1 population were all heterozygous Aa phenotype. The F2, offspring of the F1, were expected to be AA, Aa, and aa in the ratio 1:2:1; and dominant (AA and Aa) and recessive (aa) phenotype in ratio 3:1. Mendel cultivated 10 seeds from each of 100 dominant F2 plants, if all 10 offspring from a single F2 plant were of the dominant character then he concluded this plant was homozygous. Mendel used this experiment to demonstrate the 1:2 ratio of homozygous (AA) dominant and heterozygous dominant (Aa) F2's. However, for a heterozygous F2 parent there is still a (0.75)10 = 0.0563 probability would produce 10 dominant offspring, and hence been misclassified as homozygous. Therefore, the true expected ratio should be 0.3709:0.6291 instead of 1:2. Fisher suggested that Mendel's data are too close to 1:2 rather than the correct values, and thus suspected there was some manipulation, or omission of data (Weir, 1996). Mendel also studied the two characters together; for example the shape of the seeds and the color of the seed. We denote round allele as R and wrinkled allele as r, and yellow allele as Y and green allele as y. He crossed a parental line of plants with yellow and round seeds (genotype RRYY) with a line with green and wrinkled seeds (genotype rryy). The seeds of F1 were all round and yellow (RrYy) showing that these two alleles, R and Y, are dominant over r and y. Then, he crossed F1 and got four types of seeds, 315 round/yellow seeds, 108 round/green, 101 wrinkled/yellow, and 32 wrinkled/green. The ratio was approximately 9:3:3:1 (Levine and Miller, 1991). All possible genotype combinations of F2 are given in the following Punnett Square (Griffiths et al., 1993); the first row represents the genotype of gamete from father, and the first column represents the genotype of gamete from mother, and the numbers in the parenthesis are the expected frequencies if the alleles of these two characters are segregated independently. Mendel established the following principle which could explain the pattern of the data. Each of the seven genes that Mendel investigated segregated independently (independent assortment), also being known as the Mendel's Second Law (Levine and Miller, 1991; McPeek, 1997). Regardless of the controversy of his data, Mendel's theory regarding genes controlling biological characters is well accepted. Correns (1900) observed the phenomenon of complete linkage in which alleles of two or more different characters appeared to be always inherited together, rather than independently as Mendel's Second Law (McPeek, 1997). This violation led to an extension of Mendel's theory; the chromosome theory of heredity which says that the genes are parts of specific cellular structures, the chromosomes (Griffiths et al., 1993). "In 1902, Walter Sutton and Theodor Boveri, independently published papers linking their discoveries of the behavior of chromosomes during meiosis to the Mendel's principles of segregation and independent assortment. Sutton and Boveri are credited with initiating the chromosomal theory of heredity" (Klug, 1997, p. 61). This theory provided a physical mechanism for Mendel's Laws. It were assumed that those characters that Mendel studied lay on different chromosomes, and that those which were completely linked lay on the same chromosome (McPeek, 1997). Bateson and Punnett did an experiment on the sweet peas to study two characters: the flower color, purple (dominant) and red (recessive), and the form of pollen, long ____RY () Ry () ry () rY () RY(1) RRYY( ) RRYy( ) RrYy () RrYY () Ry () RRYy () RRyy () Rryy (j) RrYy (1) ry () RrYy () Rryy () rryy (1) rrYy (1) rY () RrYY () RrYy (1) rrYy ( ) rrYY ( ') (dominant) and round (recessive). First, they crossed purple, long (PPLL) with red, round (ppll) to create a heterozygotes (PpLI) type (Fl generation), then they crossed Fl generation among themselves and produced 381 plants. Result is as follows (Levine, 1991, p. 193). These results did not match Mendel's Law, nor these two Table 1.2. The results of Bateson and Bunnet's experiment. Phenotype Number Percent Percent expected Percent expected observed observed if unlinked if completely linked Purple, long 284 74% 56% 75% Purple, round 21 6% 19% Red, long 21 6% 19%  Red, round 55 14% 6% 25% Source: Levine, 1991. genes were completely linked. Morgan (1911) provided an explanation for Bateson and Punnett's observation and for his own fruit fly experiment's results which is similar to Bateson and Punnett's. He suggested the exchange of genetic material had occurred between two homologous chromosomes when they paired during cell division, and this exchange was called a crossover (McPeek, 1997). Recombination is the process by which progeny derive a combination of genes different from that of either parent (DOE, 1992). Each individual inherits one chromosome from its father and the other one from its mother. When this individual reproduces two chromosomes to pass to its offspring during meiosis, it will not pass one of each pair that it inherited from its parents but a "blended" one. Meiosis produces crossovers that blend two chromosomes and cause recombinations. If there is no crossover or an even number of crossovers between two loci, then recombination does not happen to these two loci. On the other hand, if there are odd number of crossovers, then recombination happens. The probability of recombination happening between two loci is called the recombination fraction. At the state of current technology level, crossovers are not directly observable but recombination between two genes (or markers) are (Ott, 1991). An obvious consequence of the crossover process, is that the closer two genes are to each other on a chromosome, the less chance that crossovers can happen between them. Hence there is less chance that recombination can happen between them. This idea became the foundation of linkage analysis. Recombination is the chief source of variation among species (Rhodes et al., 1974). The purpose of linkage study is to find the relative location between genes or markers and finally to present them in a linkage map. The distance between two loci on a chromosome is measured using genetic distance. We will discuss this in 2.1 "Map Function". Lander and Schork (1994) summarized that there are four major categories of methods for genetic dissection. One is "genetic analysis of large crosses in model organisms such as the mouse and rat." The other three use human genetic data. These three methods are, pedigree analysis, allelesharing methods, and association studies in human population. We will discuss more details in 2.2 "An Introduction to Human Linkage Data for Genetic Dissection". Affectedsibpair method, proposed by Penrose (1953), is one of allelesharing methods. Affectedsibpair method combined with modern DNAlevel genetic marker technology, like the Restriction Fragment Length Polymorphism markers (RFLP) (Botstein et al., 1980), can be used for genomewide gene search. This dissertation will discuss a twostage genomewide approach in affectedsib pair method for searching for genetic disease genes. Only the designs for searching for rare recessive and dominant diseases with dichotomous phenotype (affected or not affected) is studied. Chapter 2 provides a literature review of related topics, includ ing map function, sibpair method, affectedsibpair methods, identical by descent, identical by state, affectedrelativemembers method, heterogeneity, and polygeneity. 7 Chapter 3 discusses how to design an optimal twostage search for a simple Mendelian disease gene and Chapter 4 discusses how to design an optimal twostage search for two low penetrance, no interaction, unlinked recessive, complex disease genes. Brief concluding remarks are given in the last chapter. CHAPTER 2 LITERATURE REVIEW In this chapter, the following topics will be covered: map function (without biological interference), concepts of identicalbydescent (IBD) and identicalbystate (IBS), test statistics; based on IBD, IBS scores and likelihood ratio, sibpair method and affectedsibpair method, heterogeneity and homogeneity, polygeneity, twostage approach. 2.1 Map Function The genetic map distance, in units called Morgans, between two genes is defined as the expected number of crossovers occurring between two genes on a single chro mosome strand (Ott, 1991). When the distance is so small that the probability of multiple crossovers is negligible, the genetic map distance is equivalent to the recom bination fraction. A centimorgan between two genes is a distance that produces an approximate 0.01 probability that recombination taking place between them. When the distance is large, the probability of multiple crossovers increases. If there is zero or an even number of crossovers, then we will not observe recombination, but if there is an odd number of crossovers we will observe recombination. Hence, the recombi nation fraction is not necessarily equal to the map distance when two loci are farther apart. A map function is used to relate the additive, but hard to measure, ge netic or map distance to the nonadditive, but more readily estimable, recombination fraction (Speed, 1997). Map function should preserve the additivity of map distance but not every map function does. For example, consider three loci, xi, X2, X3 on a chromosome and let 012 and 023 be the map distance between xl and X2, and X2 and X3 respectively, and r12 and r23 be the recombination fraction distance between x, and X2, and X2 and X3 respectively. If we assume that X1, X2, and X3 are very close such that the chance of multiple crossovers between them is negligible then the Mor gan map function will not preserve additive of distance but the Haldane (1911) map function will. For Morgan map function, r12 = 012 and r23 = 023, the map distance between xi, and X3, 013 is equal to 012 + 023, hence recombination fraction between x, and X3, r13 is equal to r12 + r23. On the other hand, since the chance of multiple crossovers is negligible, then Pr(recombination happen between x, and X3) = Pr(recombination happen between x, and X2 but does not between X2 and X3) + Pr(recombination does not happen between x, and X2 but between X2 and x3) = 012(1 023) + (1 9012)023 = 012 + 023 2012023 7 r12 + r23. This is an contradiction. For Haldane map function, since the map distance is 012+023, r13 is equal to 0.5(1 exp2(012+623)), and Pr(recombination happen between xi and X3) = 0.5(1 exp2012)[1 0.5(1 exp223)] + 0.5(1 exp223)[1 0.5(1 exp2012)] = 0.5{(1 exp212) + (1 exp223) (1 exp212)(1 exp2823)} = 0.5(1 exp2(912+023)). The additive of map distance is preserved. Also, "most map functions are determined by three loci comparisons which may not be consistent in terms of four or more loci" (Liberman and Karlin, 1984). If a map function is consistent for any number of loci then it is called multilocusfeasible. Several counterexamples in Ott (1991, p. 127) showed that a nonmultilocusfeasible might result a negative probability. Liberman and Karlin (1984) investigated these problems. Their conclusions can be summarized as follows: there are two methods for constructing a genetic map function, the first starts with a model of the recombination process, the second uses a differential equation method. "The renewal crossover formation process model assumes that crossovers occur in succession along the chromosome, starting at a natural biological site (e.g., the centromere, some origin of replication, or a telomere) such that the lengths of the intervals between successive crossovers are positive random variables independently and identically distributed." For the first method, if a renewal crossover formation model is specified, the only feasible map function is the Haldane map function (Haldane, 1919). The correspond ing crossover formation process is a Poisson process. Theorem 1 Consider a crossover formation model of the form of a renewal process with intercrossover distribution F(t), the distribution function of the distance between two successive crossovers, such that dF(0) :0 for some positive integer n. Then a map function exists if and only if the intercrossover distribution agrees with that of an exponential distribution, i.e., the renewal process is a Poisson process (Liberman and Karlin, 1984). On the issue of multilocus feasibility of map functions, they derived necessary and sufficient conditions for multilocusfeasibility. Theorem 2 Let r be recombination fraction and x be genetic distance. Sufficient conditions: A map function r = M(x) is multilocusfeasible if its derivative functions M(n) obey the inequalities (l)nM(n)(x) < 0, n = 1,2,..., for all x > 0. Necessary conditions: Let r = M(x) be a map function that is multilocusfeasible and suppose that all derivatives M(n)(x) exist. Then (1)nM(")(0) < 0, Vn = 1,2,.... Table 2.1 shows the multilocus feasibility of several map functions. Note that the multilocusfeasible map functions are constructed based on the crossover formation process method, where r is the recombination fraction and x is genetic map distance. Because further extension will incorporate genetical interference, we will not discuss it here, but it can be referred to Karlin and Liberman (1994), and Speed (1997). Most map functions constructed by a differential equation method are not multilocus feasible, also will not be discussed. 2.2 An Introduction to Human Linkage Data for Genetic Dissection The purpose of linkage study is to locate the gene which we are interested in. With contemporary DNAlevel genetic marker technology (Botstein et al., 1980), one way to achieve this is by studying the linkage between genes and markers. "When two genes are inherited independently of each other, recombinants and nonrecombinants are expected in equal proportions among the offspring. For some pairs of genes, one observes a consistent deviation from the 1:1 ratio of recombinant to nonrecombinant Table 2.1. Multilocusfeasibility of several map functions. Source Map Function r = M(x) llulti lois Feasible Haldane (1919) Ludwig (1934) Kosambi (1944) Carter and Falconer (1951) Sturt (1976) Rao et al. (1977) Felsenstein (1979) 1(1 exp2x) 1 sin(2x) 2  tanh(2x) 2 x = \ tan'(2r) + tanhl'(2r) [ [ (1 (x/L))expx(2L1)/L] x = [ p(2p 1)(1 4p)ln(1 2r) +1l6p(l p)(2p 1)tanl(2r) +2p(l p)(8p + 2) tanh(2r) +6(1 p)(l 2p)(l 4p)r] l1exp2(K2)x 2(1(K1) exp2(k2)x) .1. Source: Liberman and Karlin, 1984. offspring . In other words, alleles of different genes appear to be genetically coupled, and this phenomenon is called genetic linkage." (Ott, 1991, p. 6) A marker is an small identifiable physical region on a chromosome and the inheritance of this region can be monitored (DOE, 1992). The closer a marker is to the gene, the smaller the chance that they will be recombined during the DNA replication process. Therefore, if a marker has a strong correlation with the phenotype of a gene, the gene should be in proximity to the marker. There are three major categories of methods using human genetic data: pedigree analysis, allelesharing method, and association study. Risch and Merikangas (1996) summarized these methods as follows: In pedigree analysis, we first collect pedigrees that contain affected members. Next we propose a genetic model (location of the gene, allele frequency, mode of inheritance, and so on), say M1, to explain the pattern observed in the data, and compare the likelihood under M1 with the likelihood under null hypothesis Mo, which assumes no gene in the region of the marker, by the likelihood ratio, L(data\Mo) L(datalMi)' or equivalently by logarithm of the odds (LOD) (Barnard, 1949) score, L( datalMi) g1o L(data\Mo) Since this method requires specifying a genetic model, it is mostly used to identify simple Mendelian trait genes. In allelesharing methods, we first collect relative pairs or groups (most of the time, we collect affected relatives), and try to prove that the pattern observed in the data is not due to random Mendelian segregation. Most of these methods use identicalbydecent scores, which is the number of identical copies of the markers that relatives share. The major advantage is that no specific genetic model is required to do the test, so they are considered as nonparametric methods. These methods are mostly used for detecting nonMendelian complex trait genes. Association study is a casecontrol study. Instead of using familial data, rather it collects unrelated affected and unaffected individuals, and then compares the allele frequencies of the candidate genes or markers. If the disease gene(s) is/are one of those candidate gene or very strongly associated with some marker alleles, then the affected group will have higher frequencies of those alleles than the control group. This method is mostly used after the possible region of the genes has been narrowed down, because this method involves examining genes and/or markers on a very fine scale, i.e., examining many markers in a small area of chromosome. "Association studies seem to be of greater power than linkage studies. But of course, the limitation of association studies is that the actual gene or genes involved in the disease must be tentatively identified before the test can be performed. . Thus, the primary limitation of genomewide association tests is not a statistical one but a technological one" (Risch and Merikangas, 1996, p. 1517). Allelesharing methods include the affectedsibpair (ASP) method and its vari ants. Penrose was considered the first person to propose the sibpair method (Con neally and Rivas, 1980; Shah and Green, 1994). In 1935, Penrose proposed a simple X2 test, to detect linkage between two characters. In 1938, he extended this method to "graded" characters, which means the characters could have intermediate value. In 1950, Penrose proposed the concept of affectedsibpair method as a means for analyzing red hair phenotype and ABO blood group data. He wrote: "Accuracy is preserved and uninformative dead wood excluded if a set of sibships is selected by the presence of one of the test characters . ." A general form of the sibpair method was proposed by Penrose in 1953. Now, these methods have been extended to examine identicalbydescent scores (elaborated in the following sections), identicalbystate scores of markers, and the risk ratio between two relatives. The idea of affectedsib pair method is only allow a pair of sibs both who are affected to be included in the study. If the disease is caused by a gene then both affected sibs tend to receive the same gene. Hence, if a marker is close to the gene, then the marker will tend to segregate with the gene during meiosis. We will cover how to detect the linkage in the later sections. The major advantages of affectedsibpair method over pedigree analysis were summarized by Holmans and Craddock in 1997: Affectedsibpair method does not require specification of a genetic model, which is important for complex disease where the mode of inheritance is unclear. It is generally easier to collect affected sibling data than to collect a large, multigeneration pedigree with multiple affected members. Affected sibpairs are more likely to be informative for linkage than large pedi gree under oligogentic epistatic models, which is plausible for a number of com plex traits. The disadvantage of affectedsibpair method is that it is considered less powerful than traditional pedigree analysis when the genetic model can be specified. If the genetic model cannot be specified in the affectedsibpair methods, the recombination fraction cannot be estimated. Suarez, Rich, and Reich (1978), and Blackwelder and Elston (1985) suggested that sampling only affected sibpairs is more powerful then sampling affectedunaffected sibs under a fixed sample size constraint. Some sampling issues in linkage studies as summarized by Gershon et al. (1994) appear in Table 2.2. Table 2.2. Some sampling issues in linkage studies. Issue Pro Con Very large pedigree Statistical power high under Extension may actually in homogeneity assumption. troduce heterogeneity. Very homogeneity assumption. hard to find. Mediumsize Less likely to have het Heterogeneity between pedi pedigree series erogeneity within pedigrees. More likely to get generaliz grees. able results. Nuclear families Lower power: very many needed if heterogene ity present. Affected sibpairs Modelfree Even lower statistical power. Source: Gershon et al., 1994. 2.3 Identical by State, Identical by Descent IBS stands for "identical by state." It means two alleles are the same regardless of whether the alleles are copies of the same ancestral alleles or not. IBD stands for "identical by descent." Two alleles are said to be IBD if those two alleles are copies of the same ancestral alleles. Since humans are diploid, that is, having two haploid sets of chromosomes, two sibs can share 2, 1, or 0 marker alleles at a locus. Consequently, their IBD scores would be equal to 2, 1, or 0, respectively. We will discuss the distribution of IBD score in the next section. We only consider two alleles are identicalbydecent if they are copies of the same ancestral allele in the pedigree under study, i.e., if the alleles are traceable. With the affectedsibpair method, since the sibpairs were selected for study on the condition that both sibs are sick with the same disease (biological character), if the disease is caused by a gene, then a marker adjacent to the disease gene has a higher probability to segregate with the gene during meiosis. Hence, this marker tends to have a higher IBD score. This method can serve as a tool to locate the trait genes. 2.4 Distributions of IBD Under ideal conditions, in using the affectedsibpair method, we can take IBD scores to make inferences about the recombination fraction between the trait gene(s) and the marker(s). Hence it is important to know the distribution of the IBD scores. According to Mendel's First Law, each of a parent's haplotypes have an equal chance to be passed to offspring. Therefore, if sibs were not selected conditionally on any other biological characters, then, the probability for a pair of full sibs to have a IBD score equal to 2 is 0.25, to 1 is 0.5, and to 0 is 0.25. If sibs are selected conditional on a biological character and if the marker(s) we examine is unlinked to the trait gene(s) that responsible for the biological character, then, regardless of the inheritance mode, penetrance, and allele frequency of the trait gene the probability mass function is also 0.25, 0.5, and 0.25 for IBD equal to 2, 1, and 0, respectively. Therefore, if the observed IBD score deviates from this mass function, it is an evidence that the observed IBD scores are not from a random model. Nevertheless, under specified condition the distribution of the IBD score is still worth investigating, especially to develop a more powerful test statistics or to do power analysis. Li and Sacks (1954) developed a method using stochastic matrices to derive the joint distribution of the genotype of two relatives. They used an example to demon strate their method: a gene with two alleles, say A and a, with population frequencies p and q = 1 p. First, obtain a conditional probability matrix P={pij};, relative 2 AA Aa aa AA p1 pP12 P13 relative 1 Aa P21 P22 P23 aa P31 P32 P33 where Pll is the conditional probability conditional on relative 1 having AA, that relative 2 has AA; P12 that relative 2 has Aa; and p13 that relative 2 has aa, and so on. Once such a matrix is obtained, the absolute frequencies of all different combinations of genotypes of relatives pairs can be easily obtained by multiplying corresponding row by genotype frequencies of relative 1. In their example, that would be multiplying the first row, (p11,P12,P13) by p2, the second row by 2pq, and the third row by q2. Li and Sacks' method use three basic transition probabilities matrices, I, T, and 0, to construct P. The matrices, I, T, and 0 is defined same as P except conditional on relatives pair having both, one, or no genes identical by descent, respectively. One can see that, in their example, 1 0 0 p q 0 p2 2pq q2 {ilj}= 0 1 0 ,T = {t lj}= p 1 0 0 = q2 2P 2 0 =oj= p 2pq q2 0 0 1 0 p q p2 2pq q2 where I = 1,2,3 and j = 1,2,3. For the matrix I; since given that relative 1 and 2 share two genes identical by descent, then the probability of the genotype of relative 2 is AA, given the genotype of relative 1 is AA, is equal to 1, so ill = 1. For the same reason, the diagonal elements of I, il are equal to one, and off diagonal elements are equal to zero. Now consider the matrix T, since given that both relative share only one gene IBD, if relative 1 has genotype AA, and relative 2 also has genotype AA, relative 2 must receive one of his A alleles somewhere else other than the same ancestor from whom relative 1 received his A allele. Since population frequency of A is p, thus t11 of the matrix T is equal to p. The other elements of T and 0 can be obtained by the same reasoning. They showed several examples of how to use I, T, and 0 to construct P. The one for the parentoffspring relationship, the transition matrix P = T, and the grandparentgrandchild pair has transition matrix p =T2 1 10 2 2 and for the general parentoffspring type relatives, p=n+1 1 1^)~o P = T + T + I(' where n + 1 is the number of generations between the two relatives. For the fullsib, the matrix is p=s=li+ 1+10 4 2 4 Campbell and Elston (1971) extended Li and Sacks' method to derive the transi tion probability matrices for different modes of inheritance and multiple loci. Hase man and Elston (1972) constructed the joint probability matrix for sibs. Risch (1990b) and Bishop and Williamson (1990) summarized these works and gave a table of trait IBD distribution conditional on marker IBD and relationship for sibs, grandparent grandchild, unclenephew, halfsibs, and first cousins. It is adapted here in Table 2.3. Holmans (1993) showed that for an affected sibpair, the possible values of prob ability mass function p = (po,pi, 1 po pi) of marker IBD score 0, 1, and 2, is restricted in what they called "the possibletriangle", which is the intersection of Po > 0, pi _< and Pi > 2po, for all genetic models. Their proof is included. Let pj and zi be the probabilities of a sibpair sharing ibd at marker locus and trait locus respectively, 0 be the recombination fraction between the marker and the Table 2.3. The values of Pr(IBD of trait gene I IBD of marker and relationship). IBD of marker IBD of Trait 2 1 0 sibs 2 Ip2 2%F(1T) (1q)2 1 T(1 %) (1 2q + 2T2) q(1 I) 0 (1 )2 2q(1I) q2 grandparentgrandchild 1 10 0 0 0 10 unclenephew 1 (10)+0 1 i0 T(1 0) 0 1 10 %(1 0) %F(1 0) + 10 halfsibs 1 11 I T I first cousins 1 I(10)2 + 102 1 02 (I_ 0)2 i0 l1_ 102 2_(1 _0)2} {2+ '(1 0)2 + 102} Where 0 is the recombination fraction between the = 02 + (1 0)2. marker and trait gene, trait gene, and I = 92 + (1 9)2. Then, po = ko2Z + I(1 )zli + (1 I)22 pi= 2x(1 I)Zo + [%2 + (1 T)2]zl + 2%(1 I)z2 p2 = (1 1)2Z0 o+ X(1l _)zi+ T2z2. From the second equation, p, = 2x(1 I)(zo + Z1 + z2) + [T2 + (1 X)2 + 21(1 T)]zi. Since < % < 1 and it can be shown that z1 < 1 pi < 2k(1 T) + (21 1)2/2 = 1 2 From the first and second equations pi 2po = (21 4T2)Zo + (2, 1)2Zl + {(1 ')[21 2(1 T)]}z2 = (21 1)2z1 + 2(1 x)(21 1)(1 zi zo) 2T(2' 1)zo > (2T 1)2Z, + 2(1 I)(2I 1) 3(1 I)(2I 1)zl T(2T 1)zli = 2(1 T)(21 1)(1 2zi) > 0. This proves the "possible triangle" restriction. He also proposed a procedure for find ing the maximum likelihood estimator under the above restriction. That procedure is as follows: 1. Obtain 2 = (io, ii, 1 i0 Li) from the unrestricted method which is sample frequency. 2. If ii > remaximize Fsubject to the constraint Li = If the resultant Q > then reset 2 to null which is (0.25, 0.5, 0.25). 3. If 220 > 2j, remaximize z subject to the constraint z1 = 2z0. If the resultant o_ > then reset z to null. 4. If is in the triangle, leave it as it is. This concludes the procedure. Suarez, Rice, and Reich (1978) derived a generalized sibpair IBD score distribu tion in which the IBD score distribution is conditioned on the number of affected sibs for a trait with two alleles. This conditional distribution showed that "the distribu tion of IBD scores depends only on the additive and dominance variances and the population prevalence of the disorder" (p. 94). One problem with using the IBD scores is, in order to establish IBD scores, marker alleles must be highly polymorphic. "Often, identity by descent cannot unequivocally be established" (Ott, 1991, p. 79). To overcome this problem with the IBD score, Lange (1986a) proposed an affectedsibsets method based on IBS (identical by state) and at same time he extended the affectedsibpair method to the affectedsibset (ASS) method (Lange, 1986b). He proposed a test statistic for the affected sibs in a single nuclear family, z = (2.1) i 1 i and j concordant (i.e. IBS=2), Xij = I i and j halfconcordant (i.e. IBS=1), (2.2) 0 i and j discordant (i.e. IBS=0), where i and j are indexes of sibset members. Lange noted that since they "will ultimately apply the Central Limit Theorem, it suffices to derive the mean and vari ance of Z for a particular affected sib set." This ASS method uses population allele frequency to calculate the probabilities of all the possible parent mating types, then, conditional on the parental mating type, to calculate the probability of IBS of the sibpair. Lange gave the mean and variance of Z "in the limiting case of an infinite number of alleles, each of infinitesimal frequency," (this author believes that means very highly polymorphic) as follows, E(Z) = s(s l)/4, (2.3) Var(Z) = s(s 1)/16, (2.4) where s is the size of the sibset. He then presented a way to combine Z statistics from different sibsets, T = Es Er wr,(Zrs E(Zr))(2.5) (EsEZ r wVar(Zr,)) ( where Zrs denote the Z statistic for the rth affected sib set of size s, and Wr, is a weighting. To assign a value to Wr8, he claimed if the weights ws depend only on s and not on r, and if the number of sib sets is large, then T should follow a standard normal distribution approximately. Based on Hodge's (1984) results, Lange recommended using the weights, (s 1) Var(Zrs) In 1988, Weeks and Lange (1988) generalized the ASS method to pedigrees, calling the refined method the affectedpedigreemember (APM) method. In this method only those pedigree members who are both affected and typed at the marker locus enter the definition of test statistic. They modify the above test statistic T in two ways. First, modified the Z statistic to be 1 1 1 1 4^ = I(G,,Gj) + 6(G., Gjy) + IS(Giy, Gj) + S(GiGj), where i and j are the index for affected members in a pedigree, G. and Gj., are the maternal marker allele for member i and j respectively, Gy and Gjy are the paternal marker allele for member i and j respectively, and f 1 G and G' match in state (G",G') ( 0 G and G' do not match in state. They claim this modification "permits computation of the theoretical mean and vari ance of the new test statistic for each pedigree by taking advantage of the theory of multipleperson ibd relation." Second, a new weighting factor r m 1 Wrm = /Var (Zm) where rm is the number of affected and typed individuals in the mth pedigree, and Zm is the Z statistic of the mth pedigree. Later, Week and Lange (1992) proposed a multilocus extension of this APM method. Bishop and Williamson (1990) studied the power of IBS methods on affected relative pairs and showed that several factors have a major influence on the power. These factors are, the relationship of the affected pairs, the polymorphism of the marker, the recombination distance between a trait locus and the closest marker, and the mode of inheritance of the trait. 2.5 Risk Ratio In 1990, Risch published a series of papers addressing risk ratio methods. In the first of this series (1990a), a ratio of risk, AR, is defined as the ratio of the probability of being affected given a type R relatives is affected, and the population prevalence. Let the relationship subscripts as follows: M=monozygotic twin; S=sibling (or dizygotic twin); 0= parent (or offspring); 2= seconddegree relatives; 3=thirddegree relative. For a singlelocus model, two recurrencerisk pattern were derived; Ao 1 = 2(A2 1)= 4(A3 1) (2.6) and AM = 4As 2Ao 1; (2.7) Similar results were obtained for multiplicative, additive, and genetic heterogeneity twolocus models and a general multilocus model. Important conclusions are: "for a singlelocus model, ARI decreases by a factor of two with each degree of relationship. . For a multiplicative epistasiss) model, AR 1 decreases more rapidly than by a factor of two with each degree of relationship. Examination of AR values for various classes of relatives can potentially suggest the presence of multiple loci and epistasis". In the second paper (1990b), Risch derived the probabilities of IBD scores for a completely polymorphic marker for several relatives under the assumption that the recurrencerisk pattern (2.6) holds. According to his first paper of this series, this assumption might be violated if multiple contributing loci present. The IBD distributions were summarized in Table 2.4. He concluded that, "for diseases with large A values and for small 0 value, distant relatives offer greater power. For large 0 values, grandparentgrandchild pairs are best; for small A values, sibs are best." The third paper (1990c) took into account the effect of marker polymorphism on power. There were some errors in the third paper and Risch wrote a response to address and corrected them (Risch 1992). The feasibility of his method is dependent on whether a suitable control group is included in the study or an appropriate estimate of general population risk can be obtained. 2.6 Test Statistics Based on IBD Score There are several test statistics based on IBD score of the sibpair method to detect linkage. One is the Twoallele (proportion) test statistic, which is the number Table 2.4. The probability of IBD score for different relatives. Marker IBD Probability Affected sib pairs 2 + 14(2' 1)[(As 1) + 2%I(As Ao)] 2 2  1 2 j(2v 1) 2s(As Ao) o// 4(2Q 1)[(As 1)+ 2(1 )(As Ao)] Grandparentgrandchild pairs 1 2 +!(I 20) (Ao 1) 0 +( 20) 1 (Ao 1) 02 2( \+I Uncle( Aunt)Nephew(Niece) 1 +(I 0)(1 20)2 (1(Ao 1) _0_________t ^ 2 _( __ o l 2 ( W 0)(i 20)2 1 (A 1) Halfsib Pairs 2 + (2%P 1)(\1(Ao 1) 0_  1(2T 1)(1)(Ao 1) 02 2 (OI Firstcousin pairs e + r[(1 0)4 + (1 t )2 + r02 b ( 1 1) 3I2 1 ( 1o 3 [(1 0)4 + 02(1 0)2 + 102 _ 1](1(Ao 1) Where 0 is the recombination fraction between gene and marker, I = 02 + (1 0)2. (or proportion) of sibpairs that share two alleles, proposed by Day and Simons (1976) and Suarez, Rice, and Reich (1978). The second, the Mean test statistic, is the sum (or sample mean) of IBD scores, proposed by Green and Woodrow (1977). The third is to use the Pearson chisquare goodnessoffit test (Pearson, 1900) on the distribution of IBD scores. All these tests assume that the IBD score frequencies are 0.25, 0.5 and 0.25, under unlinked conditions, and using the deviation of observed IBD score from these frequencies to detect the linkage. Several papers discuss the merit of these test statistics. For the Twoalleles test statistics; the one proposed Day and Simons (1976) was based on the supposition that "the probability of sharing both haplotypes deviates more from the probability under the null hypothesis than does the probability of having one, or at least one, haplotype in common. One would test for the existence of the DS (disease susceptibility) gene by comparing the observed frequency of having both haplotypes in common." Suarez, Rice, and Reich (1978) suggested pooling categories IBD=0 and IBD=1, based on their generalized sibpair IBD distribution that we mentioned in 2.4 "Distributions of IBD". They demonstrated that the affectedsibpair that shared two alleles had the most information, noting that "with minimal loss of information, we can pool categories Pr{IBD=0,l}." Green and Woodrow (1977) used the total number of "repeats," i.e., the total number of IBD scores, as the test criterion. They also suggested using only affected sibpairs because the distribution of IBD scores is symmetric under null hypothesis, so the sum of "repeats" can be approximated by normal distribution quicker than other sampling scheme. Blackwelder and Elston (1985) examined the Twoallele test statistic, the Mean test statistics, and Goodnessoffit test statistic, by calculating the exact probabilities. They concluded that the Mean test statistic is generally more powerful than either of the other two. Schaid and Nick (1990) proposed a test statistic using a linear combination of the number of marker alleles IBD as a test statistic. If the alternative IBD distribution can be specified, then this test statistic can be optimized. They also suggested a tmax test statistic which was the maximum of the Mean test statistic and the Twoalleles test statistic and showed that tmax was a more robust test statistic in the sense that, if the mode of inheritance (recessive, dominant, and so on) is misspecified, loss of power may not be too great. Faraway (1993) proposed a modified chisquare test. This test is a special case of a general method, X2 test with restricted alternatives, given by Lehmann (1986, p. 480). Faraway showed it was more powerful than the Twoalleles, the Mean, Pearson chisquare, or Schaid and Nick's tmax tests for the finite sample and claimed it was the asymptotically uniformly most powerful invariant test. More details are given in the following: For testing hypothesis Ho : po = p2 = O.25,pi = 0.5 vs H, : po + pi + p2 = 1. Let the chisquare goodness of fit test statistics be Y = 4n(po 0.25)2 + 2n(Fl 0.5)2 + 4n(2 0.25)2, where pi is the sample frequency of IBD=i, n is the total sample size. Faraway showed that the possible value of IBD score distribution, pi i = 1,2, 3, is restricted in region F that is the intersection of pi + P2 < 1, Pi < 1/2, and 3p,/2 + P2 1, and claimed this region is equivalent to the "possibletriangle" in Holmans (1993). Then, showed under this restriction, the (po/,il/,j2/) that maximizes statistic Y will actually turn Y into one of Twoallele test statistic, Mean test statistic, Y, and 0. The results are as follows. All these discussions regarding the merit of different test statistics were based on a model where the trait locus has only two alleles. Since the distribution of IBD score was dependents on the penetrance of the genotype, as well as the frequency of the genotype, therefore, whether these results can be extend to a model that assumes three or more alleles for a disease susceptibility gene needs further investigation. Based on the same twoallele model, and Suarez, Rice, and Reich's (1978) gen eralized sibpair IBD score distribution, Knapp, Seuchter, and Baur (1994a) showed that if fl2 = fof2, the Mean test was the uniformly most powerful test in 0 (recom bination fraction) regardless of the allele frequency of t, where fo is the penetrance of the genotype tt (t is the susceptibility allele and T is the normal allele), f, is for the genotype Tt, and f2 is for the genotype TT. It is clear that recessive diseases (f, = f2 = 0) satisfied this condition, with either complete or incomplete penetrance (fo = 1 or fo < 1). The authors also proved the Mean test is the locally most powerful test (there is no test with a larger power for any alternative within a neighborhood of Ho according to Rao, 1973), irrespective of the mode of inheritance. He stated, "Because uniform optimality implies local optimality, no other test than the mean test can be uniformly most powerful." In another paper (1994b) Knapp, Seuchter, and Baur presented the equivalence of the mean test and the parametric maximum LOD score analysis with an assumed recessive mode of inheritance. Region Test statistic F Y 2,2 + fi > 1, fi > 1/2 Mean 3pi /2 + p < 1, P2 > 1/4 Twoallele 22 + fil < 1,p2 < 1/4 0 2.7 Likelihood Ratio Test The generalized likelihood ratio test has been commonly used in genetics for detecting linkage and heterogeneity. The LOD, an acronym for "logarithm of the odds ratio" Barnard (1949), is a special version of the likelihood ratio. However, the classic research in likelihood ratio test cannot be directly applied. Let's review this theorem first. In Serfling's book "Approximation Theorems of Mathematical Statistics", (1980, p. 144), he states the following: Regularity Conditions on F. Consider 6 to be an open interval (not necessarily finite) in R. We assume: (Rl) For each 0 E 0, the derivatives . , where F is the family of distribution function, and 0 is the parameter space of F. Assuming regularity conditions, it can be shown that the maximum likelihood es timator of 0 has an asymptotic normal distribution (Serfling, 1980, p. 145). With this property and Lemma A (Serfling, p. 153), the likelihood ratio test theorem (Serfling, p. 155) can be proven as: Consider testing H0: 0 = o0, where 90 is a kdimensional vector in 8. Let An = L({0) supO~e L(0)' where n is sample size. Then, under H0, the statistics 21n An converge in distribution to x' (Serfling, 1980). The purpose of regularity conditions was to make sure of the existence of the Taylor expansion of the distribution functions (Serfling, p. 145). The parameter space was assumed to be an open interval in R, in fact, if O0 is an interior point of the parameter space 0, the theorem holds regardless of whether the parameter space is an open interval or not. Let LOD = max o<0<0.s L(0) (2.8) LOD = log10 L(0 = 0.5) '(. If the genetic model is known or is assumed to be known, some people use LOD, to to test H0: 0= 0.5 vs Hj: 0 < 0.5, where 0 is the recombination fraction between a marker and a gene, and claimed 21 ln(10)x LOD _ X (Chotai, 1984; Elston, 1994; Shute, 1988). However, this is in correct because 00 = 0.5 is on the boundary of the parameter space [0, 0.5]. Therefore, the maximum likelihood estimator does not have an asymptotic normal distribution. Similar errors happen when testing heterogeneity, where the parameter space is [0,1], and, under null hypothesis, monogeneity, i.e., a = 1 is on the boundary of parameter space (Ott, 1983), where a is the proportion of families belonging to the linked group. Ott (1991) attempted to correct this error, arguing that LOD is a "onesided" test. Correct asymptotic distribution of maximum likelihood ratio under the condition that null parameter value is on the boundary of the parameter space was given by Self and Liang (1987). The asymptotic distribution of Eq. (2.8) is a 50:50 mixture distribution of a X1 and a mass equal to 1 at 0. To make inferences, a critical value of LOD score needs to be decided. For simple Mendelian disease, the conventional criterion for claiming linkage is LOD > 3 (Morton, 1955, 1956). The probability of obtaining a LOD > 3 under the null hypothese is about 0.0001. One reason for such a low significance level is that the prior probability of that gene being within a certain distance from the marker is small; and another reason is that we would rather have no linkage map than have a wrong linkage map (Morton, 1955, 1956). Ott (1994) gave a Baysian argument, asking us to consider two hypotheses, H0: free recombination or absence of linkage (0 = 0.5), and Hi: 0 < 0.5. The "prior" probability that a gene and a marker are within a measurable distance (40 CM) is small, and based on Elston and Lange's result in 1975, he said the probability of a marker and a gene being within 40 cM was equal to 0.02. The posterior odds for linkage is, P(Hil data) (P(data\ Hi) (P(H1)> P(HoI data) \P(dataI Ho)) \P(Ho)) Since, with LOD=3 the first term on the righthand side is about 1,000:1 and the second term is about 1:50, the posterior odds against linkage are equal to 0.05. Since a multiple gene effect cannot be ruled out for a complex trait, he added one more hypothesis, the "other" hypothesis, H2. The prior probabilities are given as follows, with h being the prior probability of H2. Hypothesis Prior probability Ho: Single gene, no linkage 0.98(1h) Hi: Single gene, linkage 0.02(1h) H12: "Other" h The posterior odds are equal to, SP(Hl data) P(datai H1)_ (P(H1) Q P data) P(datal Ho) Pp(Ho) + p(datlH2) ) da I }data) H0)) (o)] P(H2)' where HJi means "not Hi." Let R = P(datal 2), then P(datal Ho)' Q = 0.02(1 h) S0.98(1 h) + hR If the disease in the family has an prior chance of 90% of being due to the "other" mode of inheritance, then the critical value has to increase from 3 to 4 in order to retain the posterior odds at 0.05. The above Ott's Baysian argument applies to a single marker situation. When we do a genomewide search, we use many markers to do multiple tests. While the overall false positive rate increases due to the nature of multiple tests, Ott (1991) indicated that, because the prior probability also increase with the number of markers, the critical limit could remain at 3. He also reported that Lander and Botstein (1989) investigated the problem and concluded that for the human genome, in order to keep the overall significance level at 5%, the appropriate critical level of LOD is between 2 and 3. Later, Lander et al. (Lander and Schork, 1994; Lander and Kruglyak, 1995) strongly advocated higher LOD thresholds for genomewide detection, results based on dense markers. This author have not tried to verify their "dense markers" approach hence will not include their results here, but would like to point out that for different methods the distribution of LOD are different. Thus, different thresholds should be applied to different methods as Lander and Schork, and Lander and Kruglyak did. In practice, it seems that most researchers forgot the other part of Morton's sug gestion, that is, when LOD < 2, the marker should be excluded. This author has not seen a paper published using the method of exclusion. ClergetDarpoux, BonaitiPelli6, and Hochez (1986) studied the effects of misspec ified genetic parameters in LOD score analysis. They reported that "the power of the linkage test is sensitive to the degree of dominance, and slightly to the penetrance, but not to the gene frequency. In contrast, the estimation of the recombination frac tion may be strongly affected by an error on any genetic parameter." MacLean et al. (1993) proposed a "MOD" method which is similar to LOD but it not only maximiz ing LOD over recombination fraction but also maximizing over inheritance mode. 2.8 Heterogeneity and Homogeneity There are two types of heterogeneity linkage analysis. One is allelic and the other is nonallelic; the latter is also referred to as locus heterogeneity. "With allelic heterogeneity, individuals differ from each other by having different alleles at the same locus responsible for the disease; in nonallelic heterogeneity, however, the disease is caused by different loci" (Ott, 1991). We will discuss only nonallelic heterogeneity in this section because there are more than one recombination fraction that can be detected by linkage analysis. There are several methods to test for homogeneity. These methods include a method proposed by Morton (1956), and a method by Smith (1963). Morton's method to test homogeneity is to test whether all the families have the same recombination fraction or have different recombination fractions. He proposed a simple statistic, k 2 ln(10) x [E Zi(i) Z(A)], (2.9) i=i where O is the recombination fraction that maximize the LOD score of the ith family, and Zi(0j) denotes the maximized LOD score, i = 1 to k. The Z(0) is the total LOD score maximum that occurs at a value of 6 for all families combined. This statistic is assumed to have a X2 distribution with k 1 df. Smith (1963) assumed that there were two groups, a linked group and an unlinked group. The linked group is the group of families that have a disease gene at a locus linked with the markers. The members of unlinked group either have no gene that caused the disease or have a gene at a locus that is unlinked to the markers. Smith proposed a test for testing whether all families in the study are from the linked group. If e is the true but unknown proportion of families belonging to the linked group with recombination fraction 0, let 0 and e be the MLE of e and 0. Under H1, the LOD score of the ith family is given as Zi(e, 0) = log[eLi(O) + (1 E)Li(1)]. The total LOD score was equal to Z(e, 0) = Zi(e, 0). He assessed the significance of nonallelic heterogeneity by the statistic 21n(10) x [Z(e, IH1) Z(e = 1,OHo)]. (2.10) If Ho is true, because e = 1 is on the boundary of parameter space, the statistic follows a 50:50 mixture distribution of a chisquare distribution with 1 df and a mass=l at 0, not a chisquare distribution with 1 df as originally suggested. Ott (1983) compared Smith's method, which he called the Atest, and Morton's method, which he called the PStest (Mtest in his 1991 book). For a mixed situation (families with or without linkages between gene locus and marker locus), Ott com pared the Mtest with the Atest and found that the Atest was generally superior. However, since he erred on the distribution of the test statistics (2.10), his conclusion might need modification. Using IBD scores, Chakravart, Badner, and Li (1987) proposed a method to test linkage and homogeneity using IBD scores in the case of an autosomal recessive gene. The test was basically a goodnessoffit test. They concluded that while the power of their method for detecting linkage from sibpair data is excellent, that for detecting the heterogeneity of linkage is not. Proceedural details are as follows. For an autosomal recessive disease, the genotype of unaffected parents in multiplex families are Dd x Dd, where D and d are the normal and disease alleles, respectively. Among the affected sibpair, let the probabilities of marker IBD =2, 1, and 0, be, P2 = Pr(IBD = 2) = y2, (2.11) P1 = Pr(IBD = 1) = 2xy, (2.12) Po = Pr(IBD = 0) = x2, (2.13) where x = 20(1 0) < 0.5 and y = 1 x. The maximum likelihood estimator (MLE) for x is nI + 2no = +2n (2.14) where n, are the numbers of affected sibpair sharing i markers IBD, and n = ni. Because true value should not be greater than 0.5, they proposed a new estimator, u, where, 0 5 if i <0.5 0.5 if i> 0.5. This new estimator has a smaller variance under both null and alternative hypotheses than i's. The recombination value may be estimated by V 1 2ft 2 Under genetic heterogeneity, the IDB score distribution will be a mixture of two binomial distributions under 0 < 0.5 and 0 = 0.5 in the proportions c : 1 c. Thus the Pi's in (2.11) through (2.13) become P2 = c(1 x)2 + (1 c)/4, (2.15) P1 =2cx(1 x)+(1 c)/2, (2.16) Po = cx2 + (1 c)/4. (2.17) Solving the above equatons, they got c = pp Replacing P, with its MLE, they obtained the MLE for c, and x under heterogeneity <. = (n2 no)2 (2.18) Sn(n2 ni + no) (2.18) S n2 2n0 Xh = (2.19) 2(n2 no) Then, they proposed a two stage method to test linkage and heterogeneity. First they tests linkage, if the null hypothesis of no linkage was rejected then they conclude heterogeneity. In the first stage, the statistic T = 2V/(2n)(i ) 2 was used and the authors claimed that T had asymptotic normal distribution. They did not explain why not using f. The statistic 2 2 S + + no n (2.20) n(l x)2 2nx(1 x) nX2 is used to test heterogeneity. The authors claimed G is asymptotically distributed as a x2 variable with df 1. However, they did not prove it nor point out what value should be used for x; use or xh. MonteCarlo simulations were used to evaluate the power. 2.9 Polygenes There are many traits in a population that have more variation and can not be categorized into distinct classes easily (Klug and Cummings, 1997). Traits exhibiting continuous variation may be controlled by two or more genes. Such traits are said to exhibit continuous or quantitative variation and are examples of polygenic inheri tance. The hypothesis suggesting a large number of factors or genes were responsible for continuous phenotype are called the multiplefactor or multiplegenie hypoth esis. These genes have a special name, polygenes (Griffiths et al., 1993). Klug and Cummings (1997, p. 95) summarized some major characteristics of multiplefactor hypothesis: 1. Characters controlled by multiplefactors can usually be quantified by measur ing, weighing, counting, etc. 2. Two or more pairs of genes, located throughout the genome, account for the hereditary influence on the phenotype in an additive way. Because many genes may be involved, inheritance of this type is often called polygenic. 3. Each gene locus may be occupied by either an additive allele, which contributes a set amount to the phenotype, or by a nonadditive allele, which does not contribute quantitatively to the phenotype. 4. The total effect of each additive alleles at each locus, while small, it approxi mately equivalent to all other additive alleles at other gene sites. 5. Together, the genes controlling a single character produce substantial pheno typic variation. 6. Analysis of polygenic traits requires the study of large numbers of progeny from a population of organisms. If there are more than two genes controlling the phenotype, then the number of genes involved and the effects of those genes are important to know. To address this, Tan and Chang (1972) proposed a method for estimating the number of genes for selffertilized populations assuming that there were only two alleles for each gene and the effect of each gene is the same. This work was further expanded on by Tan and D'Angelo (1979) to estimate the numbers and effects of major genes and polygenes assuming that all major genes have the same effect and all polygenes have the same effect. Because their work were developed for selffertilized population, we will not cover the details. 2.10 Twostage Genome Search Using the DNAlevel genetic markers technology proposed by Botstein et al. (1980), Lander and Botstein (1986) concluded that the affectedsibpair method can be used for a genome search for disease genes. A twostage design, first a screening search to eliminate nonviable marker loci, and then an intensive search to identify gene location is an intuitive design. Although this design has already been used in genomewide searches, such as those by Davies et al. (1994) and Luo et al. (1995), no optimality of their designs was discussed. Thus, there were no guidelines on how to space the markers or how to allocate the available ASP in each stage. While Elston (1992, 1994) studied the "optimal" twostage design and concluded that twostage designs are more efficient than onestage designs, he did not consider the statisti cal complexity of the multiple test nor he considered the interval search nature in genomewide linkage detection or the resource constraints. In another paper, Brown et al. (1994) studied the multiplestage approach for genome search using affected pedigreemember, by simulation. Since Brown et al. used only simulation and only one pedigree was considered, it is not obvious how to apply their results to the ASP method. Darvasi and Soller (1994) have studied the optimal spacing of genetic markers for the QTL trait without considering the twostage approach. Holmans and Craddock (1997) conducted a simulation studied on the efficient strategies for genome scanning using maximumlikelihood affectedsibpair analysis. The situation they considered are: a 200 affected sibpairs sample, four different sample allocation strategies, and five gridtightening strategies. The risk ratio of sibs are equal to 2 or 3, each marker locus have five equifrequent alleles, and there are five possible location of genes. Since their studies were simulation studies, it is not clear how to generalize their results. Furtheremore, they did not consider different sample size, nor optimal strategies under resource constrains. The main goals of this dissertation is to answer the two stage design question as they apply to rare autosomal recessive and dominant diseases with a dichotomous phenotype (affected or not affected). CHAPTER 3 TWOSTAGE GENOME SEARCH FOR SIMPLE MENDELIAN DISEASE In a twostage search method, the first stage uses part of the ASPs with a wide spread markers in the genome. The rest of the resource is to be used on those promising markers identified in the first stage. In this study, only the least favorable configuration, that gene that lies in the middle of two adjacent markers, will be used for constructing the designs. There are three major statistical problems involved in deriving analytic solutions are summarized as follows: 1. Because all the markers on the same chromosome are linked, the IBD scores are not independent. In the case of a genome search, there are many markers spread along the chromosomes. We have to handle a highdimensional joint distribution. 2. In the first stage, a number of loci will be chosen for the more intensive linkage analysis in the second stage. The number and position of these loci that pass to the second stage are random. This makes calculation of exact distribution in the second stage extremely difficult. 3. Since in the first stage only a small number of ASPs and a large number of markers are used, there will be many ties in the IBD scores. The asymptotical solutions using a continuous normal distribution cannot handle these ties. In this dissertation, Problem 3 was handled by a multinormial distribution. We used independent model as an approximation for Problem 1 and 2, and checked the results with simulation. 3.1 Assumptions 1. There is one and only one disease gene that increases the probability of an individual being affected. However, the disease may have nongenetic causes. 2. Highly polymorphic equally spaced markers are available. When m markers are assigned in the first stage, their positions are L L, (2ml)L where L, the 2m 2m71' 2m length of the genome, is equal to 3300 cM (Lewin, 1990). 3. For a rare autosomal recessive disease, the parents' disease genotypes are Did and D2d; and for a rare dominant disease, the parents' disease genotypes are Did and D2D3, where d is the disease gene. 4. In the same family, the probability that the disease of one affected sib is caused by a gene and the other is not is negligible. 5. The cost of typing alleles is a constant, i.e., the cost of typing k markers from one person is the same as typing one marker from k persons. This assumption of cost ratio can be relaxed to suit practical situations, but the numerical result would need to be recalculated. 6. We use the Twoalleles statistic (Day and Simons, 1976), i.e., let Xij=l if ith marker IBD score = 2 for the jth sibpair, and Xij=0 otherwise. In our analytic approach, the Xij are assumed to be independent random variables, except the two Xij adjacent to the gene. 42 3.2 TwoStage Procedure Suppose there are n ASPs and there are enough resources to type N marker loci. Three numbers need to be determined in a twostage design: n, and m, the number of ASPs and the number of markers to be used in the first stage (Stage I), and r, the number of markers to be used in the second stage (Stage II). The markers chosen for the second stage are based on the statistic used by Day and Simons, si = 1x, (3.1) j=1 in the first stage, where Xj is defined in Assumption 6. Ideally, the r markers with the highest scores are chosen for Stage II. However, in the event of ties, more than r of them may have to be chosen. Thus, R, the actual number of markers used in Stage II, is a random variable. The formal definition of R is: R > r, but if the marker(s) with the lowest score in this group (markers for stage II study) is (are) taken away, then the total number of remaining markers is smaller than r. In Stage II, R markers on N2 ASPs are to be typed, where N2 is the largest number of sibs that can be used subject to the resource constraint. Since R is a random variable, N2 is also a random variable. Without loss of generality, let the R markers be 0,1, ..., R1, and the sibpairs in the Stage II be nl +1,nl +2, ...,ni+ N2. Thus, N2 is the largest x such that mnn + Rx < N. Define ni+N2 2S,= X,E (3.2) j=nl+1 Then the marker with the uniquely highest 2SA is claimed to be the locus nearest the disease gene. If two adjacent markers have the same highest score, then the gene is claimed to lie between them; otherwise, the gene location is undetermined. In this case, the two markers are called a marker group. Once the location, I, has been chosen, say I = i, the next step is to check whether we can claim linkage. Let ta,n2,R be the 100(1 a) percentile of the unique maximum of R binomial (n2, 0.25) random variables. If 2SI > ta,2,R, then we claim there is a linkage at loci i at significance level a. 3.3 Probability of Allocating the Correct Marker 3.3.1 Analytic Approach To make the analytic solution tractable, Assumption (6) is made on the distribu tions of X1j. They are later compared with a more realistic situation by simulation. Let markers be numbered from 0 to m 1. Without loss of generality, if the gene is at the end of the chromosome, we let the nearest marker be marker 0; and if the gene is between two markers, we let it be located in the middle of marker 0 and 1. If we assume that gene location is uniformly distributed along the genome, the probability of the gene at the end is 1/m. Let iS(r) be the rth largest statistic in the set {1Si} in the first stage. The probability of finding the gene by the twostage method is, P(The marker closest to the gene is found) = P(Gene is at the end of the chromosome, marker 0 is chosen in Stage II) (3.3) + P(Gene is not at the end, either marker 0 or 1 is chosen in Stage II). (3.4) Eq. (3.3) can be written as a sum of the probabilities of three exclusive events; Eq. (3.5) through Eq. (3.7). P(Marker 0 is chosen at the end of Stage II I gene is at the end) 1 1 P(1So passes Stage I and 2S0 is the highest in Stage II) m S1 Z P(1So passes Stage I, 2S0 is the highest in Stage II, IS(r,) = k) m k=O nl  1 {P(iSo > iS(ri), iS(r1) = k, 2S0 is the highest in Stage II) k=0 + P(k = iS(rI1) iSo > IS(r), 2S0 is the highest in Stage II) + P(k = 1S(r1) > iSo = 1S(r), 2So is the highest in Stage II)}. Eq. (3.5) is equal to, P(iSo > IS(r1), iS(rl) = k, 2So0 is the highest in Stage II) (assuming independence) zz i P(iSo k, i+l > (r 1 ), < (r l) (m 1 i 1) iSj's < k, r2 mli =0 E E=r i=0 t=rli P(iS, > k) [N2 57 P (2 SO t1 i 1Sj's > k, I iSj's = k, 2So > max 2St) t#o P(is >_ rk) Mi! P(1Sj = k)' P(15, < k)mltI = t)P(2Sj < t)+' where N2= [NIJm] Eq. (3.6) is equal to, P(S(rI) = k > iSo > 1S(r), gene is found) k2 = P(lS(rl) = k > ISo > 1S(r) = i=0 k2 r1 (ml)(rl) : EE Ec P ii = P(k = 1S(ri) i=0 1=1 h=1 i, 2So is uniquely highest) > 1SI > IS(r) = i, (r 1 1) 1Sj's > k, (3.5) (3.6) (3.7) I 1Sj's = k, h 1Sj/'s = i, (m r h) 1Sj's < i, 2S0 is uniquely highest) k2 r1 (m1)(r1) m1! = E E E {P(k>lSo>i) (rl)! 1! h!((m1) (r 1)h)! i=0 1=1 h=1 P(ISj > k)y11 P(,Sj = k)' P(,Sj = i)h P(1Sj < i)mI(r1)! r N2 NP(2So = t)P(2S < ty' } .t=l J where N2 = Nn r J Int Eq. (3.7) is equal to, P(IS(r1) = k > iSo = 1S(r), gene is found) ki = ZP(1S(r1) = k > iSo = 1S(r) = i, 2So0 is uniquely highest) i=0 k1 r (ml1)(r1) = E Z 3 P(k= lSrI > ISl = lSr =, (r l) iSjls > k, i=0 1=1 h=l I 1Sj's = k, h 1Sj's = i, (m r h) iSj's < i, 2So is uniquely highest) kI r1 (ml)(r1)m 1! = E E E {P(k>1So=i) (rl)!l! h>!((om)(r1)h)! i=O 1=1 h=l P(iSj > k)11 P(iSj = k)' P(iSj = i)h P(1Sj < i)mi(r1)l 'N2* P2SO = t)P(2j < t)l+h , kt=l I where N2 = Nnjm] Lr+h Lnt Eq. (3.4) can be written as 1P(Marker 0 or 1 is chosen at the end of Stage II I gene is not at the end) m mn1 S {P(iSo passes Stage I and i1i does not, 2So is the highest in Stage II) (3.8) m 46 +P(1S1 passes Stage I and 1So does not, 2S1 is the highest in Stage II) (3.9) +P(iSo and 1S1 pass Stage I and 2So or 2S1 is the highest in Stage II)}. (3.10) Since the gene is assumed to be in the middle of two markers, Eq. (3.8) is equal to Eq. (3.9) and they = P(1So passes Stage I and iS, does not, 2So is the highest in Stage II) ni = P(1So passes Stage I and 1S1 does not, 2So is the highest in Stage II, k=1 IS(r1) = k) nl = {P(iSo >iS(r1) = k > 1S1, 2So is the highest in Stage II) (3.11) k=I + P(1S(ri) = k > 1iSo > 1S(r), 1iSo > ISI, 2S0 is the highest in Stage II)(3.12) + P(IS(r)=k > iSo = iS(r), iSo > 1Si, 2So is the highest in Stage II)}(.3.13) Eq. (3.11): P(iSo >iS(r1) = k > 1Si, 2So is the highest in Stage II) EE i I P(1So k, i 1s > k I 1S's=k, i+1 > (r 1), i < (r 1) (m 2 i 1) Sjs < k, iS, < k 2So > max2St ) tjo r2 m2i k m 2! = E E PS> S P(ISj > k)' P(1Sj = k)' P(1Sj < k)m2i1 'N2 1 N P(2SO = t)P(2Sj < t)i } t=w [ where N2 = n I +I+ 1 1t Eq. (3.12) is equal to, P(k =1 S(rl) > 1So > iS(r) = i, 1So > 1S1, 2So is the highest in Stage II) k2 r1 (m2)(r1) = E E P(k =l S(rl) > 1So > IS(r)= i, 1So > 1S1, i=0 1=1 h=l (r  1) Sj's > k, 1 1Sj's = k, h iSj's = i, (m r 1 h) 1Sj's < i, 2So > max 2Sf) t9O,1 k2 r1 (m2)(r1) i=0 1=1 h=l m2! P(1Sj > k)r11 P(iSj = k)' P(1Sj = i)h P(,Sj < i)mr1h *N2 P(2O = t)P(2S, < ty' ] t=1 where N2= [N rnlm] t Eq. (3.13) is equal tont Eq. (3.13) is equal to, k2 EP(k =1 S(rl1) > 1So = IS(r) = i, ISo > i=1 k2 r1 i=1 1=1 I18, 2So is the highest in Stage II) (m2)(r1) E P(k = lS(rl) > 1So = 1S(r) = i, 1So > 1S1, h=l (r 1 1) ISj's > k, I iSj's = k, h 1Sj's =i, (m r 1 h) 1Sj's < i, 2SO > max2 St) t#0,1 k2 r1 (m2)(rl) = E E T, i=l 1=1 h=1 m 2! P(1So=i, ii < i) (r1l)!l! h!((m2)(r1)h)! m 2! P(k > 1So > i, 1So > 1S1) (r 1 1 )! 1 h! ((m 2) (r 1) h)! P(iSj > k)r'1 P(,Sj = k)' P(,Sj = i)h P(iSj < i)mrlh *N2 1 E P(2So = t)P(2Sj < t)ri+h } t=1 I r Nnm where N2 = Nh nm r r+ h I nt Eq. (3.10) is equivalent to P(1So and iS, pass Stage I, gene is found in Stage II) 1i = Z{P(GSo >S(r2) = k, 1S, >iS(r2) = k, gene is found in Stage II) (3.14) k=1 + P(iSo >iS(r2) = k > iSi > iS(ri), gene is found in Stage II) (3.15) + P(iSo >iS(r2) = k > ii = 1S(ri), gene is found in Stage II) (3.16) + P(iSi >1S(r2) = k > iSo > 1S(ri), gene is found in Stage II) (3.17) + P(S1 >iS(r2) = k > iSo = iS(ri), gene is found in Stage II) (3.18) + P(IS(r2) = k > iSo > iSi > iS(ri), gene is found in Stage II) (3.19) + P(IS(r2) = k > iSo > iSi = iS(ri), gene is found in Stage II) (3.20) + P(iS(r2) = k > iSi > iSo > iS(r1), gene is found in Stage II) (3.21) + P(iS(r2) = k > ii > 1So = 1S(r1), gene is found in Stage II) (3.22) + P(iS(r2) = k > iSi = iSO > 1S(ri), gene is found in Stage II) (3.23) + P(iS(r2) = k > ASi = 1So = iS(rI), gene is found in Stage II) (3.24) + P(iS(ri) = k > Ai5 = 15o > iS(r), gene is found in Stage II) (3.25) + P(iS(ri) = k > iSi = iSo 1= i5(r), gene is found in Stage II)}. (3.26) Again, because the gene is in the middle of two markers, Eq. (3.15) is equal to Eq. (3.17), (3.16) is equal to (3.18), (3.19) is equal to (3.21), and (3.20) is equal to (3.22). Eq. (3.14) is equal to, P(iSo > 1S(r2) = k, iSi > 1S(r2) = k, 280 is uniquely highest in Stage II) + P(1So > iS(r2) = k, iSi > 1S(r2) = k, 2S1 is uniquely highest in Stage II) + P(1So > iS(r2) = k, iSi > 1S(r2) = k, 2S0 = 2Siis the highest) P(iSo > 1S(r2) = k, iSi > 1S(r2) = k, 2S0 or 2S1 is uniquely highest in Stage II) (A) + P(1So 1S(r2) = k, iSi > iS(r2) = k, 2S0 = 2Slis the highest) (B) (A) E E Sl P(iSo k, iSi > k, 1S(r2) = k, i 1Sj's > k, + I > (r2), < (r 2) 1 iSj's = k, (m 2 i 1) 1Sj's < k, 2So or 2S1 is uniquely highest in Stage II) r3 m2i (So kS m 2! = EE tP'o>" > il ( 2 _) i=0 l=r2i 2)i1)! P(1Sj > k)' P(,Sj = k)' P(1Sj < k)m2i P(2So > max 20 > 21) + P(251 > max 2t, 2S1 > 2S0)] } r3 m2i { PSo>  2! = E ^ {(^S>k S>knl((m2 i,) i=0 I=r2i 2 ) P(1Sj > k)' P(1Sj, = k)' P(1Sj < k)m2iI 2 P(2So = t, 251 < t)P(2Sj < t)i+l , where N2 L l+2 mt (B) EE i I P(iSo > k, 151 > k, 1S(r2) = k, i Sj's > k, i + I > (r2), i < (r 2) I 1Si's = k, (m 2 i 1) jSj's < k, 2So0 = 2S1 is uniquely highest in Stage II) r3 m2i m 2! = E p(S > k, S > k)i!! ((m2i1)! i=0 l=r2i P(iSj > k)' P(,S, = k)' P(iSj < k)m21 P(2So 2S1 > max 2St r3 m2i ( m 2! E E {(S>klS, > k)!,! ((m 2)i ) i=0 l=r2i \ P(1Sj > k)' P(1Sj = k)' P(1Sj < k)m2i1 *N2 1 SP(2So = t, 2S1 = t)P(2j5, < t)r+l , t=1 ,, [N mimi) where N2 = r /2 J l (A) + (B) r3 m2i / m 2! = E P(So k, IS k)! ((m 2) )! i=0 1=r2i I k< P(,Sj > k)' P(15,j = k)' P(15,j < k)m2i1 [2P(2So = t, 251 < t) + P(25o = t, 2S1 = t)]P(25j < t)+l] , L t=l where N2 = EN nm 1 S r + + 2 J Int Eq. (3.15) and (3.17) are equal to P(ISo > iS(r2) = k> 1S1 > IS(r1), gene is found) k2 = {P(iSo > 15(r2) = k > 1S1 > 1S(r1) = i, 2S0 or 2S1 i=0 is uniquely highest in Stage II) +P(iSo iS(r2) = k > iS1 > IS(ri) = i, 250 = 2S1 is the highest in Stage II) k2 r1 (m2)(r2) = Z Z Z {P(ISo > k,k> IS1 >i,(r 1)iSj's > k, i=0 1=1 h=1 I Sj's = k,h 1Sj's = i, (m r 1 h) 1Sj's < i,2So0 or 2S1 is uniquely highest in Stage II) +P(1So k, k > iS > i, (r 1 l) iSj's > k, 1 iSj's = k, h iSj's = i, (m r 1 h) iSj's < i, 2So = 2S1 is the highest in Stage II)} k2 r1 (m2)(r2) =zzz i=0 1=1 h=l ~m 2! {P(jSo ! k, k >IS, > m 2 P~(1o> >11>i) (r21)!1! h!((m 2) (r 2) h)! P(ij > k)r21 P(,S, = k)' P(1Sj = i)h P(1,j < i)m2(r2)i [:(2P(2So = t, 251 < t) + P(2So = t, 2S1 = t))P(2Sj < tY2 , where N2 N rI t Eq. (3.16) and (3.18) are equal to P(iSo > 1S(r2) = k > 1i1 = IS(ri), gene is found) kI = E{P(lSo >_ is(r2) = k > iSi = l S(ri) = i,2So or 2o 1 i=0 is uniquely highest in Stage II) + P(SO >_ 1S(r2) = k > AI = IS(rI) = i, 2S0 = 2S1 is the highest) k1 r2 (m2)(r2) = Y, Z {P(lSo>k,iS, >i,(r21) Sits>k, i=0 =1 h=l I iSj's = k, h iS's = i, (m r h) iS'/s < i, 2So or 2S1 is uniquely highest in Stage II) + P(iSo > k, 1S1 >= i, (r 2 I) Sj's > k, I 1Sj's = k, h 1Sj's = i, (m r h) 1Sj's < i, 2S0 = 2S1 is the highest)} ki r2 (m2)(r2) = EE E i=O 11 h= f kS>m 2! 1P(iSo > k, > i) (r2l)!l! h! ((m2) (r2) h)! P(iSj > k)21 P(1S, = k)' P(1Sj = i)h P(iSj < i)2(r2)( 'N2 [ (2P(2So = t, 2S1 < t) + P(2So = t, 2S1 = t))P(2Sj < t)r2+h , t=1 ,~ ~ fVnl l where N2 = Nr+nm] L r+h lInt Eq. (3.19) and (3.21) are equal to P(iS(r2) = k > 1So > iSi > 1S(r1), gene is found) ki {P(iS(r2) = k > So > 1Si > IS(r1) = i, 2S0 or 2S1 i=O is uniquely highest in Stage II) + P(S(r2) = k > iSo > 1Si > iS(r1) = i, 2SO0 = 2S1is the highest)} k3 r2 (m2)(r2) i=0 1=1 h=l {P(k > 1So > iS1 > i, (r 2 1) 1Sj's > k, 1 1Sj's = k, h Sj's = i, (m r h) iSj's < i, 2So or 2S1 is uniquely highest in Stage II) + P(k > iSo > iSi >= i, (r 2 l) iSj's > k, I iSj's = k, h iSj's =i, (m r h) iSj's < i, 2S0 = 2S1is the highest)} k3 r2 (m2)(r2) EE E i=O 1=1 h=l Sm 2! {P(k > So > IS, > i) m 2 S> o > > (r21)! 1! h! ((m2) (r2)h)! P(,Sj > k)y2' P(1Sj = k)1 P(i,Sj = i)h P(,Sj < i)m2(r2)h i(2P(25o = t, 2S1 < t) + P(25o = t, 2S1 = t))P(2Sj < t)r2 , .t=i J where N2 = [N nm Int Eq. (3.20) and (3.22) are equal to P(iS(r2) = k > iSo > iSi = 1S(r1), gene is found) k2 S {P(l5(r2) = k > iSo > iS1 = 1S(r1) = i, 2SO or 2S1 i=0 is uniquely highest in Stage II) + P(IS(r2) = k> So > iS1 = 1S(r _) = i, 2So = 2S1 is the highest in Stage II)} k2 r2 (m2)(r2) zzE i=0 1=1 h=l {P(k > iSo > i, iSi =i, (r 2 l) 1Sj's > k, 1 iSj's = k, h 1Sj's = i, (m r h) 1Sj's < i, 2So0 or 2S1 is uniquely highest in Stage II) + P(k > iSo > i, iSi = i, (r 2 1) iSj's > k, I 1Sj's =k,h iSj's = , (m r h) 1j8's < i, 2S0 = 2S1is the highest in Stage II)} k2 r2 (m2)(r2) T, E E i=0 1=1 h=l m 2! {P(k > 5o > i, 8 = i) (r2/)!/! h!((m2)(r2)h)! P(1S, > k)r2 P(,Sj = k)' P(1Sj = i)h P(ISj < i)m2(r2)h N2 < t)2+h (2P(2So = t, 2SA < t) + P(2So = t, 2A1 = t)) X P(2Sj < t) +h , t=1 j N nim] whereN2= r+h Jn t Lr+h lInt* Eq. (3.23) is equal to P(iS(r2) = k > iSo = iS1 > iS(ri), gene is found) k2 = {P(S(r2) = k > iSo = iSi > 1S(r1) = i, 2So0 or 2S1 i=0 is uniquely highest in Stage II) + P(iS(r2) = k > iSo = 1S1 > 1S(r1) =i,2So = 2S1 is the highest in Stage II) k2 r2 (m2)(r2) i=0 1=1 h=l {P(k > iSo = 1S1 > i, (r 2 1) 1Sj's > k, I Sj's = k, h 1Sj's = i, (m r h) 1Sj's < i, 2So or 2S1is uniquely highest in Stage II) + P(k > iSo = iSi > i, (r 2 1) Sj's > k, I 1Sj's =k,h iSj's = i, (m r h) 1Sjs < i, 2So0 = 2Siis the highest in Stage II)} k2 r2 (m2)(r2) i=O 1=1 h=l Sm 2! {P(k > ISO= iSI > 0m 2 S> So = 11 > ) (r 2 1)! h!((m 2) (r 2) h)! P(ISj > k)21 P(ISj = k)' P(iSj = i)h P(ISj < i)m2(r2)h (2P(2So = t, 2S1 < t) + P(2So = t, 2S1 = t))P(2Sj < t)y2 , t=l where N2 j t Eqr ( Int Eq. (3.24) is equal to P(GS(r2) = k > iSo = 1S1 = 1S(r1), gene is found) ki = {P(1S(r2) = k > iSo = 1S1 = iS(r1) = i,2So or 2S1 i=0 is uniquely highest in Stage II) + P(iS(r2) = k > iSo = 1SI = iS(ri) = i, 2So = 2SI is the highest in Stage II) k1 r2 (m2)(r2) = Z Z {P(iSo=iSi=i,(r21) iS/s>k, l1Sj's=k, i=0 1=1 h=I h 1Sjs = i, (m r h) 1S/'s < i, 2S0 or 2S1 is uniquely highest in Stage II) + P(ISo = 1S1 = i, (r 2 1) iSjs > k, 1Sj's = k, h iSj's = i, (m r h) 1Sj's < i, 2So = 2S1 is the highest in Stage II)} k1 r2 (m2)(r2) zz z i=0 1=1 h=l m _m2! {P(1S0 = is = 0m 2 P1So 1I ) (r 2 1)!l! h! ((m 2) (r 2) h)! P(ISj > k)r21 P(iSj = k)' P(1Sj, = i)h P(1S, < i)2(r2)h N 2 < t1+ (2P(2So = t, 2S1 < t) + P(2So = t, 2S1 = t)) X P(2Sj < t)r2+h , t=1 where N2  j mt rL ]I Jnt* Eq. (3.25) is equal to P(IS(rI) = k > 1iSo = 1S1 > 1S(r), gene is found) k2 SE{P(iS(rl) = k > iSo = iSI > IS(r) = i, 2So or 2SI i=0 is uniquely highest in Stage II) + P(iS(ri) = k > iSo = ISi > 1S(r) = i, 2S0 = 2S1 is the highest in Stage II) k2 r1 (m2)(r1) = E= E (=0 1=1 hi=l {P(k > 1S0 = IS > i, (r 1 1) 1Sj's > k, I Sj's = k, h 1Sj's = i, (m r h) 1Sj's < i, 2S0 or 2S1 is uniquely highest in Stage II) + P(k > iSo = iSi > i,(rl I) iSj's > k, 1 Sj's = k, h iSj's = i, (m r h) 1Sj's < i, 1S1 = 2Siis the highest in Stage II)} k2 r1 (m2)(ri) =EE E i=0 1=1 h=l ~m 2! {P(k > iso = lSl > i) (rl)! lh!(m2)(rl) h! P(jSj > k)r1' P(1Sj = k)' P(iSj = i)h P(1Sj < i)m2(rl)h *N2 \ [ (2P(2So = t, 2S1 < t) + P(2So = t, 2S1 = t)) x P(2Sj < t)1 , L t= .[ , \N nim] where N2 =J In [ r+1 i t Eq. (3.26) is equal to P(iS(ri) = k> 1So= 11 = 1S(r), gene is found) ki = {P(1Sr1) = k > iSo = IS, l S(r) = i, 2So0 or 2S1 i=O is uniquely highest in Stage II) + P(iS(ri) = k > 1So = S = iS(r) = i, 2S0 = 2Si is the highest in Stage II) k1 r1 (m2)(ri) E E E {P(,So = iSi = i,(r 1) Sj's> k, 1 Sj's = k, i=0 /=1 h=1 h 1Sj's = i, (m r h) iS/'s < i, 2S0 or 2S1 is uniquely highest in Stage II) + P(iSo=iSi = i,(r 1 l)iSj's > k, I 1Sj's = k, h jSj's = i, (m r h) iSj's < i, 2So = 2Si is the highest in Stage II)} Table 3.1. The joint distribution of IBD score of the locus 1 and locus 2. Where 92 + (1 9)2 and 0 is the recombination fraction between two loci. kI r1 (m2)(r1) m 2! = EE E {P(So = l= i) (r1)!l! h!((m2)(r 1)h)! i=0 I=1 h=l ' P(1Sj > k)r11 P(1Sj = k)1 P(1Sj, = i)h P(iSj, < i)m2(r)h _(2P(2So = t, 2S1 < t) + P(2So = t, 2S1 = t)) x P(2S, < t)r +h , .t=l 1 where N2= N l . The joint distribution of the IBD scores of two markers (or genes), P(IBD of the marker (gene) 1, IBD of marker (gene) 2) is given in Table 3.1 adapted from Table 2.3 where T=02 + (1 9)2 and 9 is the recombination fraction between two loci. From Table 3.1, we can deduce the joint distribution of Xij and Xij for i $ i'. The joint distribution between Xij and Xi,j is shown in Table 3.2, with Ti defined as the T in Table 3.1. If a marker is not linked with the disease gene, then the probabilities of getting an IBD score of the marker equal to 2, 1, and 0 are 0.25, 0.5, and 0.25, respectively. Thus, XAj has a Bernoulli distribution with parameter of 0.25. The distribution of ,Si under the null hypotheses is Binomial(ni, 0.25) for 1=1, 2. Let e be the probability that the disease is caused by the gene, and 0 be the distance between marker 0 and the gene. For equations 3.53.7, where the recessive gene is at the end, Xoj is distributed IBD of locus 2 IBD of locus 1 2 1 0 2 V %( ') i 4 2 4 1 (lq) 4 2+(4)2 ( ) 2 2 2 0 j112 %P (I%P 4 2 4 as Bernoulli(e 2 +(1e) 0.25) and 1So as Binomial(ni, e T2+(l_) 0.25). Similarly, for a dominant disease gene, Xoj is distributed as Bernoulli(E T/2 + (1 e) 0.25) and 1Si as Binomial(ni, e T1/2 + (1 s) 0.25). For Eq. 3.113.13, we need to find the joint distribution P(ISo=i, IS,=j), i'=0,1,...,nI, j=0,1,...,nI, 1=1,2. Let nabi be the number of (Xoj, Xlj) = (a, 6) in stage I. Then (noot, no01, n0io, nil,) has a multinomial distribution (ne, poo, poi, plo, p1l). Thus, the joint distribution P(iSo= i, iS1= j) (3.27) = P(nioi + n11=1, nolK + nli=j) (3.28) min(i,j ) = P(nni=k, nlot=i k, noll=j k, k = max(0, i + j n) nool=n i j + n1il), (3.29) can be computed once poo, poi, pio, and pl are specified. Although we assume the gene is in the middle of the two adjacent markers, the formulae are derived for the gene anywhere in between. Let the recombination fraction between the gene and marker 0 be 00 and between the gene and marker 1 be 01. Let 02 be the recombination fraction between two markers. The joint distribution of Xoj and Xlj, for x = 0, l,y = 0,1, is, Table 3.2. The joint distribution of Xij and Xij. xii, Xj 1 0 4 4 0 (112) T2+2 4 4 P(Xoj = x, Xij = y) = P(Xo = x, Xij = y I gene IBD=2)P(gene IBD=2) + P(Xoj = x, Xj = y I gene IBD=l)P(gene IBD=1) + P(Xoj = x, X1j = y I gene elsewhere)P(gene elsewhere) = P(Xoj = x I Xij = y, gene IBD=2)P(Xl3 = y I gene IBD=2) P(gene IBD=2) + P(Xoj = x I X j = y, gene IBD=1)P(Xlj = y I gene IBD=1) P(gene IBD=1) + P(Xoj = x I Xj = y, gene elsewhere) (3.30) (3.31) '(Aij = y I gene elsewhere)P( gene elsewhere). (3.32) Since an individual has to have two recessive disease genes in order to be affected, Eq. (3.32) becomes P(Xoj = x I Xj = y, gene IBD=2)P(Xlj = y I gene IBD=2)e + P(Xoj = x I Xjj = y, gene elsewhere)P(Xlj = y I gene elsewhere)(1 e). For a dominant disease, if an ASP is caused by a gene, then both sibs must at least share the disease gene and there is a 5050 chance they share the other allele. Thus, Eq. (3.32) becomes = P(Xoj = x I Xj = y, gene IBD=2)P(X,3 = y I gene IBD=2)(e/2) + P(Xoj = Xj = y, gene IBD=1)P(Xlj, = y I gene IBD=I)(e/2) + P(Xoj = x Xlj = y, gene elsewhere) P(Xlj = y I gene elsewhere)(1 e). (3.33) Let I'= 6] + (1 e)2, 1=0,1,2, p, = P(Xoj = x I Xij = y), x = 0,1, y = 0,1. Then apply Table 3.2 for a recessive disease we have 4 V2 + (1 I (1) 6 + T22+2 Poo = (1) e 4 ( 1), (3.34) PIO = I2(12) ( (339) po0i=) + 1 (1 ( (3.36) 2 Pp = Q2 J E + o(1o) (1I) + (1 ). (337.41) Thus, P(the marker closest to the gene is found) can be calculated. In addition to For analysis c dominputation for Eq. 3.3 and 3.4, simulations were also done.ase, 3.3.2 Simulation under More Realistic Assumptions The assumption of independent Xi, violates the fact that some loci are linked. Simulations were done under a Markov chain model with the combinations of resource N = ()()1000, 2000, 5000, and 10,000, number of ASP n = 10, 25, 75, and 100, = 0.25, 0.5, 0.75, and 1, m from 50 to 350 with an increment 25 for twostage design and 10 for onestage design, r from 5 to m with an increment (m 5)/10, and n1 from 5 to Poo 0 ^()+ol^(+ )+~( 1) (3.38) n 5 with increment 5. The simulations were conducted as follows: pul = (1_p2 T2E+ o(lToXy)Ti(l Tl) 'P2(l) (3.40) Thus, P(the marker closest to the gene is found) can be calculated. In addition to analysis computation for Eq. 3.3 and 3.4, simulations were also done. 3.3.2 Simulation under More Realistic Assumptions The assumption of independent Xij violates the fact that some loci are linked. Simulations were done under a Markov chain model with the combinations of resource N = 1000, 2000, 5000, and 10,000, number of ASP n = 10, 25, 75, and 100, 6 = 0.25, 0.5) 0.75, and 1, m from 50 to 350 with an increment 25 for twostage design and 10 for onestage design, r from 5 to m with an increment (m 5)/10, and nI from 5 to n 5 with increment 5. The simulations were conducted as follows: 1. Reading in parameters, resource N, heterogeneity e, number of ASP n, and m, ni, and r of the design. 2. Given n1, m, generate y from uniform(0,1). If y e then generate gene location from uniform distribution (0,3300). Let locus I and locus I + 1 be two adjacent loci, i.e. the gene location is between 2 and (21+1)L where L is the total length 2m ~2mI of genome. This step determines which interval contains the gene and then: (a) For recessive disease, generate IBD scores at loci 1 and I + 1 conditional on both sibs of ASP carrying two disease genes and the gene locus is in the middle of two markers. Haldane map function is used to convert map distance into recombination fraction 0. (b) For dominant disease, if y < 0.5e, generate IBD scores at loci I and I + 1 conditional on both sibs of ASP sharing two alleles at gene loci and the gene locus is in the middle of two markers. If 0.5e < y <_ e, then generate IBD score at loci I and l+ 1 conditional on both sibs of ASP share one allele at gene loci and, again, the gene locus is in the middle of two markers. If y > c, let I =0, and generate IBD score at locus 0 from Bernoulli(0.25) as if there is no gene linked with marker 0. 3. For Markov chain model simulation, generate IBD score at locus i, i start from l 1 and decrease to 0, conditional on IBD score at locus i + 1, and then generate IBD score at locus i conditional on IBD score at locus i 1 for i = I + 1, ..., m 1. For independent model simulation, generate IBD scores independently from Trinomial(ni, 0.25, 0.5,0.25). Conditional probability formulas are given by Table 2.3. 4. Convert IBD scores into statistics Xijs. 5. Repeat step 2, 3, and 4 ni times and then calculated Si. 6. Check ties, adjust R to include all the ties with the rth highest IS, for Stage II, check whether 1S0 passes Stage I. If yes, then calculate N2 according to the resource constraint and go to the next step. If not, record the detection as a failure and go back to step 2. 7. The Xijs in Stage II are generated the same way as those in Stage I, except only the chosen markers are used and repeat N2 times. 8. Check whether 2S0 or 2S1 is the unique largest among 2Sis in Stage II. 9. If marker 0 or 1 is chosen, it is a success; otherwise it is a failure. Go back to step 2 until enough simulation has been done. The programs were compiled using GCC version 2.7.2 on a PC with Pentium 166 CPU and 48M RAM in an OS/2 environment. The random number generator for the simulation was adapted from Press (1992). 3.3.3 Results Based on analytic computation and simulation, the designs with the highest prob ability of finding the right marker (power) were identified. These optimal designs are given in Table 3.3 for searching a recessive gene, and Table 3.4 for a dominant gene. In both tables, in the ASP column is reported the number of available affectedsibpairs; the E column shows the probability that the disease is caused by the gene, repre senting heterogeneity; the m columns, the number of marker loci used in the first stage; the r column, the proposed number of loci to be chosen in Stage I for Stage II study; the n, columns, the number of ASP used in first stage; the F2 column, the probability of locating the right marker by the best twostage design obtained by analytic formula; the Indep columns and the Markov columns show the probabilities obtained by simulation with independent assumption and Markov chain assumption without combining first and secondstage data; and the Comb. column shows the 63 simulated probability of the twostage design with Markov chain model with first and secondstage data combined. The Fl column gives the probability of the best onestage design, i.e., with optimal m and n subject to mn < N obtained by analytic formula. The last column, F2F1, shows the increase in probability of twostage de sign over onestage. An asterisk is marked when the increase was over 0.35 for Table 3.3 (recessive) and 0.15 for Table 3.4 (dominant). Table 3.3. Optimal resource allocation in twostage genome search for rare recessive gene. Resource N=1000 Parameter Twostage design Onestage design Improv. ASP E m r ni F2 Indep Markov Comb. m n Fl Indep Markov F2F1 10 0.25 125 5 5 0.040 0.041 0.041 0.045 100 10 0.042 0.043 0.042 0.001 10 0.50 175 5 5 0.149 0.152 0.146 0.167 100 10 0.121 0.126 0.116 0.027 10 0.75 175 5 5 0.366 0.369 0.352 0.412 100 10 0.269 0.268 0.272 0.098 10 1.00 175 5 5 0.662 0.662 0.631 0.723 100 10 0.481 0.484 0.477 0.181 25 0.25 125 5 5 0.092 0.095 0.090 0.093 50 20 0.059 0.062 0.059 0.034 25 0.50 125 5 5 0.324 0.339 0.316 0.324 70 14 0.127 0.126 0.130 0.197 25 0.75 125 5 5 0.642 0.654 0.630 0.642 100 10 0.269 0.269 0.268 0.374* 25 1.00 150 5 5 0.897 0.888 0.882 0.894 100 10 0.481 0.495 0.474 0.416* 50 0.25 100 5 5 0.126 0.130 0.135 0.135 50 20 0.059 0.061 0.060 0.068 50 0.50 125 5 5 0.388 0.385 0.372 0.377 70 14 0.127 0.129 0.120 0.260 50 0.75 125 5 5 0.697 0.701 0.680 0.685 100 10 0.269 0.277 0.272 0.428* 50 1.00 125 5 5 0.902 0.898 0.894 0.897 100 10 0.481 0.479 0.474 0.421* 75 0.25 100 5 5 0.135 0.141 0.134 0.136 50 20 0.059 0.059 0.062 0.077 75 0.50 100 5 5 0.400 0.404 0.401 0.404 70 14 0.127 0.125 0.133 0.273 75 0.75 125 5 5 0.697 0.707 0.689 0.696 100 10 0.269 0.278 0.269 0.429* 75 1.00 125 5 5 0.902 0.898 0.887 0.892 100 10 0.481 0.488 0.474 0.421* 100 0.25 100 5 5 0.137 0.138 0.136 0.138 50 20 0.059 0.059 0.061 0.078 100 0.50 100 5 5 0.402 0.419 0.387 0.393 70 14 0.127 0.129 0.125 0.274 100 0.75 125 5 5 0.697 0.703 0.685 0.691 100 10 0.269 0.268 0.268 0.429* 100 1.00 125 5 5 0.902 0.900 0.891 0.895 100 10 0.481 0.481 0.477 0.421* Table 3.3continued. Resource N=2000 Parameter Two Stage Design Onestage design Improv. ASP e m r ni F2 Indep Markov Comb. m n Fl Indep Markov F2F1 10 0.25 125 5 5 0.040 0.042 0.041 0.044 200 10 0.041 0.043 0.044 0.001 10 0.50 350 5 5 0.181 0.198 0.154 0.170 200 10 0.178 0.188 0.177 0.003 10 0.75 350 5 5 0.500 0.513 0.419 0.481 200 10 0.460 0.472 0.441 0.040 10 1.00 350 5 5 0.883 0.880 0.745 0.834 200 10 0.807 0.808 0.779 0.076 25 0.25 175 5 10 0.108 0.120 0.114 0.118 80 25 0.085 0.090 0.088 0.023 25 0.50 175 5 10 0.441 0.455 0.427 0.453 100 20 0.269 0.272 0.265 0.172 25 0.75 175 5 10 0.824 0.830 0.801 0.832 110 18 0.565 0.570 0.563 0.259 25 1.00 175 5 10 0.984 0.984 0.974 0.985 140 14 0.841 0.845 0.829 0.143 50 0.25 150 5 10 0.177 0.184 0.168 0.173 60 33 0.088 0.095 0.094 0.089 50 0.50 150 5 10 0.594 0.611 0.578 0.588 100 20 0.269 0.275 0.266 0.326 50 0.75 150 5 10 0.893 0.905 0.887 0.891 110 18 0.565 0.570 0.561 0.329 50 1.00 175 5 10 0.990 0.990 0.987 0.992 140 14 0.841 0.852 0.829 0.149 75 0.25 150 5 5 0.213 0.221 0.206 0.208 60 33 0.088 0.087 0.093 0.124 75 0.50 150 5 10 0.614 0.626 0.605 0.610 100 20 0.269 0.274 0.269 0.345 75 0.75 100 14 10 0.897 0.901 0.891 0.905 110 18 0.565 0.565 0.567 0.332 75 1.00 175 5 10 0.990 0.990 0.986 0.991 140 14 0.841 0.838 0.828 0.149 100 0.25 150 5 5 0.229 0.240 0.219 0.221 60 33 0.088 0.089 0.098 0.141 100 0.50 125 5 10 0.623 0.638 0.618 0.621 100 20 0.269 0.283 0.266 0.354* 100 0.75 100 14 10 0.897 0.903 0.884 0.899 110 18 0.565 0.579 0.559 0.332 100 1.00 175 5 10 0.990 0.990 0.988 0.991 140 14 0.841 0.848 0.837 0.149 Table 3.3continued. Resource N=5000 Parameter Twostage design Onestage design Improv. ASP e m r ni F2 Indep Markov Comb. m n Fl Indep Markov F2F1 10 0.25 125 5 5 0.040 0.039 0.037 0.041 210 10 0.041 0.046 0.035 0.001 10 0.50 350 5 5 0.181 0.202 0.152 0.171 350 10 0.197 0.212 0.175 0.016 10 0.75 350 5 5 0.500 0.521 0.425 0.480 350 10 0.552 0.566 0.491 0.052 10 1.00 350 5 5 0.883 0.882 0.742 0.830 350 10 0.929 0.933 0.838 0.047 25 0.25 300 5 15 0.118 0.130 0.114 0.128 200 25 0.127 0.138 0.131 0.009 25 0.50 350 73 5 0.578 0.615 0.516 0.572 200 25 0.555 0.577 0.534 0.022 25 0.75 350 73 5 0.951 0.956 0.888 0.930 200 25 0.926 0.938 0.911 0.025 25 1.00 350 73 5 1.000 1.000 0.986 0.994 350 14 0.989 0.992 0.943 0.011 50 0.25 250 29 10 0.270 0.291 0.257 0.278 100 50 0.193 0.198 0.197 0.077 50 0.50 250 29 10 0.860 0.881 0.825 0.853 130 38 0.640 0.658 0.632 0.221 50 0.75 225 27 15 0.995 0.997 0.986 0.995 170 29 0.937 0.951 0.932 0.058 50 1.00 200 5 20 1.000 1.000 0.999 1.000 350 14 0.989 0.992 0.943 0.011 75 0.25 200 24 10 0.367 0.395 0.354 0.366 100 50 0.193 0.194 0.191 0.174 75 0.50 200 24 15 0.915 0.928 0.893 0.907 130 38 0.640 0.661 0.634 0.275 75 0.75 150 19 20 0.998 0.998 0.995 0.997 170 29 0.937 0.948 0.929 0.061 75 1.00 125 5 30 1.000 1.000 1.000 1.000 350 14 0.989 0.992 0.941 0.011 100 0.25 175 22 15 0.418 0.441 0.405 0.424 100 50 0.193 0.200 0.193 0.225 100 0.50 150 19 20 0.929 0.938 0.921 0.934 130 38 0.640 0.659 0.637 0.289 100 0.75 125 5 35 0.999 0.997 0.997 0.998 170 29 0.937 0.949 0.928 0.061 100 1.00 100 5 35 1.000 1.000 1.000 1.000 350 14 0.989 0.991 0.944 0.011 Table 3.3continued. __________Resource N=10,000 Parameter Twostage design _Onestage design Improv. ASP E m r ni F2 Indep Markov Comb. m n Fl Indep Markov F2F1 10 0.25 125 5 5 0.040 0.042 0.042 0.044 210 10 0.041 0.043 0.046 0.001 10 0.50 350 5 5 0.181 0.200 0.156 0.175 350 10 0.197 0.219 0.179 0.016 10 0.75 350 5 5 0.500 0.515 0.422 0.474 350 10 0.552 0.570 0.493 0.052 10 1.00 350 5 5 0.883 0.885 0.739 0.830 350 10 0.929 0.928 0.835 0.047 25 0.25 300 5 15 0.118 0.127 0.114 0.129 350 25 0.132 0.148 0.129 0.014 25 0.50 350 73 5 0.578 0.606 0.523 0.579 350 25 0.647 0.685 0.590 0.069 25 0.75 350 73 5 0.951 0.958 0.900 0.930 350 25 0.974 0.982 0.934 0.023 25 1.00 350 73 5 1.000 0.999 0.985 0.993 350 25 0.997 1.000 0.993 0.003 50 0.25 350 73 5 0.294 0.326 0.269 0.279 200 50 0.295 0.314 0.281 0.000 50 0.50 325 101 10 0.907 0.927 0.852 0.902 200 50 0.894 0.912 0.872 0.013 50 0.75 325 101 10 0.999 1.000 0.990 0.996 350 28 0.985 0.992 0.958 0.014 50 1.00 225 5 20 1.000 1.000 0.999 1.000 350 28 0.997 1.000 0.994 0.003 75 0.25 275 59 15 0.449 0.488 0.430 0.467 140 71 0.361 0.381 0.356 0.088 75 0.50 275 59 15 0.976 0.980 0.953 0.969 160 62 0.918 0.937 0.906 0.059 75 0.75 250 5 35 1.000 1.000 0.996 1.000 350 28 0.985 0.992 0.956 0.014 75 1.00 125 5 30 1.000 1.000 1.000 1.000 350 28 0.997 1.000 0.996 0.003 100 0.25 250 29 25 0.560 0.600 0.527 0.561 100 100 0.386 0.408 0.386 0.174 100 0.50 225 27 30 0.990 0.993 0.977 0.990 160 62 0.918 0.934 0.912 0.073 100 0.75 150 5 40 1.000 1.000 0.999 1.000 350 28 0.985 0.991 0.957 0.015 100 1.00 100 5 35 1.000 1.000 1.000 1.000 350 28 0.997 1.000 0.996 0.003 Table 3.4. Optimal resource allocation in twostage genome search for rare dominant gene. Resource N=1000 Parameter Twostage design Onestage design Improv. ASP E m r ni F2 Indep Markov Comb. m n Fl Indep Markov F2F1 10 0.25 50 5 5 0.028 0.029 0.027 0.030 50 10 0.028 0.026 0.029 0.000 10 0.50 50 5 5 0.036 0.034 0.037 0.037 50 10 0.037 0.035 0.039 0.000 10 0.75 100 5 5 0.050 0.053 0.051 0.059 100 10 0.053 0.059 0.053 0.003 10 1.00 175 5 5 0.082 0.084 0.082 0.092 100 10 0.082 0.078 0.080 0.000 25 0.25 50 5 5 0.042 .043 0.039 0.040 50 20 0.036 0.036 0.041 0.005 25 0.50 75 5 10 0.063 0.064 0.066 0.067 50 20 0.053 0.054 0.053 0.009 25 0.75 125 5 5 0.119 0.122 0.121 0.124 50 20 0.075 0.076 0.074 0.044 25 1.00 125 5 5 0.201 0.205 0.197 0.204 50 20 0.103 0.096 0.106 0.098 50 0.25 50 5 10 0.053 0.053 0.052 0.053 50 20 0.036 0.037 0.039 0.016 50 0.50 100 5 5 0.091 0.093 0.088 0.089 50 20 0.053 0.055 0.058 0.037 50 0.75 100 5 5 0.168 0.168 0.162 0.162 50 20 0.075 0.078 0.070 0.092 50 1.00 100 5 5 0.270 0.264 0.258 0.262 50 20 0.103 0.104 0.102 0.166* 75 0.25 50 5 10 0.057 0.058 0.058 0.059 50 20 0.036 0.035 0.036 0.020 75 0.50 75 5 5 0.102 0.100 0.105 0.107 50 20 0.053 0.053 0.056 0.048 75 0.75 100 5 5 0.179 0.179 0.169 0.171 50 20 0.075 0.075 0.079 0.104 75 1.00 100 5 5 0.284 0.279 0.286 0.289 50 20 0.103 0.104 0.105 0.181* 100 0.25 50 5 5 0.058 0.055 0.059 0.058 50 20 0.036 0.039 0.040 0.022 100 0.50 75 5 5 0.107 0.104 0.103 0.102 50 20 0.053 0.053 0.053 0.054 100 0.75 75 5 5 0.184 0.184 0.183 0.183 50 20 0.075 0.080 0.081 0.109 100 1.00 100 5 5 0.286 0.288 0.286 0.291 50 20 0.103 0.110 0.110 0.183* Table 3.4continued. Resource N=2000 Parameter Twostage design Onestage design Improv. ASP e m r n, F2 Indep Markov Comb. m n Fl Indep Markov F2F1 10 0.25 50 5 5 0.028 0.027 0.028 0.029 50 10 0.028 0.026 0.025 0.000 10 0.50 50 5 5 0.036 0.033 0.036 0.039 50 10 0.037 0.037 0.040 0.000 10 0.75 100 5 5 0.050 0.048 0.051 0.056 150 10 0.052 0.054 0.054 0.002 10 1.00 225 5 5 0.082 0.084 0.081 0.090 200 10 0.090 0.095 0.087 0.008 25 0.25 50 5 5 0.042 0.044 0.043 0.046 50 25 0.040 0.041 0.038 0.002 25 0.50 125 5 10 0.066 0.066 0.067 0.072 80 25 0.066 0.069 0.066 0.000 25 0.75 175 5 10 0.136 0.130 0.133 0.143 80 25 0.117 0.122 0.111 0.019 25 1.00 175 5 10 0.246 0.241 0.245 0.259 80 25 0.189 0.198 0.192 0.057 50 0.25 50 5 15 0.053 0.048 0.045 0.050 50 40 0.048 0.046 0.048 0.005 50 0.50 100 14 10 0.113 0.115 0.111 0.117 50 40 0.079 0.075 0.079 0.034 50 0.75 150 5 10 0.226 0.225 0.226 0.232 60 33 0.122 0.121 0.126 0.104 50 1.00 150 5 10 0.385 0.392 0.372 0.381 80 25 0.189 0.187 0.191 0.197* 75 0.25 50 5 25 0.062 0.058 0.061 0.066 50 40 0.048 0.047 0.050 0.014 75 0.50 125 5 10 0.137 0.138 0.132 0.133 50 40 0.079 0.080 0.081 0.058 75 0.75 125 5 10 0.274 0.282 0.269 0.276 60 33 0.122 0.123 0.121 0.151* 75 1.00 125 5 10 0.438 0.435 0.432 0.437 80 25 0.189 0.185 0.186 0.249* 100 0.25 50 5 25 0.068 0.072 0.066 0.068 50 40 0.048 0.044 0.047 0.020 100 0.50 125 5 5 0.152 0.154 0.149 0.148 50 40 0.079 0.077 0.078 0.073 100 0.75 125 5 10 0.293 0.295 0.285 0.288 60 33 0.122 0.124 0.124 0.170* 100 1.00 125 5 10 0.456 0.445 0.445 0.449 80 25 0.189 0.192 0.186 0.267* Table 3.4continued. Resource N=5000 Parameter Twostage design Onestage design Improv. ASP E m r n, F2 Indep Markov Comb. m n Fl Indep Markov F2F1 10 0.25 50 5 5 0.028 0.029 0.028 0.031 50 10 0.028 0.029 0.029 0.000 10 0.50 50 5 5 0.036 0.036 0.035 0.039 50 10 0.037 0.040 0.037 0.000 10 0.75 100 5 5 0.050 0.055 0.052 0.058 150 10 0.052 0.052 0.051 0.002 10 1.00 225 5 5 0.082 0.084 0.080 0.089 240 10 0.089 0.085 0.088 0.007 25 0.25 50 5 5 0.042 0.040 0.043 0.045 50 25 0.040 0.042 0.038 0.002 25 0.50 125 5 10 0.066 0.067 0.062 0.067 180 25 0.069 0.073 0.067 0.004 25 0.75 225 27 5 0.141 0.141 0.132 0.144 200 25 0.159 0.160 0.145 0.018 25 1.00 300 63 5 0.275 0.270 0.252 0.287 200 25 0.301 0.302 0.286 0.026 50 0.25 50 5 15 0.053 0.043 0.049 0.056 50 50 0.052 0.052 0.056 0.000 50 0.50 200 43 5 0.136 0.141 0.132 0.138 100 50 0.129 0.135 0.130 0.007 50 0.75 250 29 10 0.325 0.333 0.306 0.321 100 50 0.269 0.275 0.260 0.055 50 1.00 250 29 10 0.572 0.570 0.532 0.560 100 50 0.457 0.464 0.446 0.115 75 0.25 75 40 5 0.062 0.050 0.052 0.065 60 75 0.063 0.064 0.058 0.000 75 0.50 175 22 10 0.200 0.202 0.196 0.202 70 71 0.144 0.144 0.140 0.056 75 0.75 200 24 10 0.447 0.453 0.416 0.429 80 62 0.279 0.279 0.280 0.168* 75 1.00 200 24 15 0.701 0.698 0.659 0.687 100 50 0.457 0.459 0.445 0.245* 100 0.25 100 23 15 0.074 0.056 0.057 0.074 50 100 0.071 0.071 0.069 0.003 100 0.50 150 19 10 0.246 0.111 0.101 0.137 70 71 0.144 0.147 0.148 0.102 100 0.75 175 22 15 0.510 0.513 0.480 0.502 80 62 0.279 0.285 0.284 0.231* 100 1.00 125 17 20 0.757 0.745 0.744 0.764 100 50 0.457 0.465 0.459 0.300* Table 3.4continued. Resource N=10,000 Parameter Twostage design Onestage design Improv. ASP m Ir n, F2 Indep Markov Comb. m n Fl Indep Markov F2F1 10 0.25 50 5 5 0.028 0.031 0.027 0.028 50 10 0.028 0.030 0.028 0.000 10 0.50 50 5 5 0.036 0.038 0.040 0.042 50 10 0.037 0.038 0.036 0.000 10 0.75 100 5 5 0.050 0.054 0.051 0.057 150 10 0.052 0.057 0.054 0.002 10 1.00 225 5 5 0.082 0.081 0.086 0.097 240 10 0.089 0.090 0.086 0.007 25 0.25 50 5 5 0.042 0.042 0.043 0.043 50 25 0.040 0.038 0.041 0.002 25 0.50 125 5 10 0.066 0.066 0.063 0.068 180 25 0.069 0.072 0.071 0.004 25 0.75 225 27 5 0.141 0.139 0.134 0.144 300 25 0.158 0.162 0.146 0.017 25 1.00 300 63 5 0.275 0.274 0.244 0.270 350 25 0.315 0.316 0.270 0.040 50 0.25 50 5 15 0.053 0.047 0.044 0.058 50 50 0.052 0.050 0.050 0.000 50 0.50 225 93 5 0.140 0.145 0.134 0.144 200 50 0.153 0.161 0.144 0.013 50 0.75 250 101 5 0.345 0.351 0.319 0.344 200 50 0.368 0.383 0.353 0.023 50 1.00 250 125 5 0.607 0.605 0.550 0.591 200 50 0.630 0.639 0.591 0.023 75 0.25 100 68 5 0.062 0.045 0.047 0.058 130 75 0.064 0.067 0.060 0.002 75 0.50 225 71 10 0.224 0.217 0.205 0.228 130 75 0.218 0.227 0.216 0.006 75 0.75 275 59 15 0.522 0.528 0.470 0.514 130 75 0.477 0.490 0.464 0.045 75 1.00 275 59 15 0.802 0.793 0.731 0.775 140 71 0.736 0.747 0.724 0.067 100 0.25 125 65 5 0.079 0.056 0.048 0.070 100 100 0.081 0.085 0.080 0.002 100 0.50 225 49 15 0.303 0.238 0.231 0.285 100 100 0.255 0.267 0.259 0.047 100 0.75 225 49 15 0.645 0.657 0.594 0.622 100 100 0.526 0.541 0.522 0.119 100 1.00 250 29 25 0.886 0.876 0.825 0.859 100 100 0.780 0.785 0.784 0.106 3.4 Type I Error and Power of Claiming Linkage Tables 3.3 and 3.4 give the optimal designs and the probabilities of finding the locus linked to the responsible gene when the gene exists, but they do not provide information on the probability of causing a false conclusion when there is no gene responsible for the disease. The usual requirement is that the LOD score in Stage II should be greater than a certain threshold, t, in order to claim linkage. Given R = r, we let the threshold be t, which is the 100(1a) percentile of the unique maximum, T, of r Binomial(n2, 0.25) random variables. The probability mass function of T is P(T = s) = P(2SIo = s\2SIS is the unique maximum) P(2SIj = s,2SIo is the unique maximum) P(2Si0is the unique maximum) rf(s)F(s 1)r1 =1 rf (s)F(s I)r1 The probability mass function of the marker group is P(T= s) = P(2SIo = 2SI0+1 = s12SIo and 2SI0 are the unique maximum) P(2SI = 2Sio+1 = s,2SI0is the unique maximum) P(2SIo and 2SI,+, are the unique maximum) (r 1)f(s)2F(s 1)r2 .=j(r 1)f(s)F(s 1)2 Where f(s) and F(s) are probability mass function and cumulative distribution function of Binomial(n2, 0.25). Table 3.5 and 3.6 gives the value of t of the unique maximum and the marker group. To use Table 3.5 and 3.6, first find your n2 in the n2 column, then find your R in R column, if your R is not in the table, then find the largest value R that is smaller than your R with same n2. The value in the t column is the 95% percentile. The range of n2 is from 5 to 95, and for R is from 5 to 300. A linkage can be claimed if and only if 2SIo > t, where Io is one of the markers that was chosen in Stage II. Clearly, P(linkage claim is incorrect) < P(2Sio > t I no gene is responsible)P(no gene is responsible) +P(lo is wrong and 2SI0 > t I a gene is responsible) P(a gene is responsible). The (prior) P(no gene is responsible) is usually unknown, but we can conclude that P(Io is wrong and 2S/0 > t I a gene is responsible) < P(2SI0 > t I no gene is respon sible). A proof is as follows. Let A denote the event { the gene is between marker io and marker io + 1}. P(Io is wrong and 2So > t a gene is responsible) (3.42) = P(2SI, > 2Sj, Vj $ Io, 2SIo > t and Io $ io or io + 1 JA) (3.43) n2 = E P(k = 2Sio > 2Sj, Vj 0 Io, and Io 7 io or io + 1 JA) (3.44) k=t+l n2 = E P(k> 2Sj, Vj 54 Io, io, or io + 1, k > 2SA k > 2Sio+i, k=t+i 2Sio = k and Io 7 io or io + 1 IA) (3.45) n2 = 5 {P(k > 2Sj, Vj o10, io, or io + 1, Io io or io + 1 IA) (3.46) k=t+l P(k > 2SA0, k > 2So+1IA) (3.47) P(2So = k, Io $ io or io + 1 IA)}, (3.48) 74 Table 3.5. The 95% percentile of the unique maximum of R Binomial(n2, 0.25). 11 n2 1 A I t 11 n2 R t 11 n2 R t n R t n R t n R  Table 3.6. The 95% percentile of the unique maximum marker group of R Binomial(n2, 0.25). n t n t R t nR R t R t t s5 4 28 64 15 45 a 18 59 108 26 72 95 5 12 5 28 203 16 45 16 19 59 248 27 72 201 31 84 227 35 5 173 5* 29 5 12 45 33 20 60 5 21 73 5 25 85 5 28 6 5 4 29 8 13 45 76 21 60 7 22 73 8 26 85 6 29 6 6 5 29 17 14 45 193 22 60 11 23 73 12 27 85 10 30 6 38 6 29 44 15 46 5 17 60 20 24 73 21 28 85 16 31 7 5 5 29 131 16 46 7 18 60 40 25 73 39 29 85 27 32 7 16 6 30 5 12 46 13 19 60 83 26 73 76 30 85 49 33 7 132 7 30 6 13 46 26 20 60 187 27 73 158 31 85 92 34 8 5 5 30 13 14 46 57 21 61 5 21 74 5 25 85 181 35 8 8 6 30 31 15 46 140 22 61 6 22 74 7 26 86 5 28 8 43 7 30 87 16 47 5 17 61 10 23 74 11 27 86 6 29 9 5 6 30 278 17 47 6 18 61 17 24 74 18 28 86 9 30 9 21 7 31 5 13 47 10 19 61 32 25 74 32 29 86 14 31 9 130 8 31 10 14 47 21 20 61 65 26 74 62 30 86 23 32 10 5 6 31 23 15 47 44 21 61 143 27 74 125 31 86 41 33 10 12 7 31 60 16 47 103 22 62 5 22 74 265 32 86 75 34 10 54 8 31 179 17 47 262 23 62 8 23 75 5 25 86 146 35 11 5 6 32 5 13 48 5 18 62 14 24 75 6 26 86 296 36 11 7 7 32 8 14 48 9 19 62 26 25 75 9 27 87 5 29 11 28 8 32 17 15 48 17 20 62 52 26 75 15 28 87 8 30 11 151 9 32 43 16 48 34 21 62 110 27 75 27 29 87 12 31 12 5 7 32 120 17 48 77 22 62 249 28 75 50 30 87 20 32 12 16 8 33 5 13 48 189 23 63 5 22 75 99 31 87 34 33 12 70 9 33 7 14 49 5 18 63 7 23 75 207 32 87 62 34 13 5 7 33 13 15 49 7 19 63 12 24 76 5 26 87 118 35 13 10 8 33 31 16 49 13 20 63 22 25 76 8 27 87 235 36 13 37 9 33 82 17 49 27 21 63 42 26 76 13 28 88 5 29 13 187 10 33 246 18 49 58 22 63 86 27 76 23 29 88 7 30 14 5 7 34 5 14 49 138 23 63 189 28 76 41 30 88 10 31 14 7 8 34 10 15 50 5 18 64 5 22 76 80 31 88 17 32 14 22 9 34 23 16 50 6 19 64 6 23 76 163 32 88 29 33 14 93 10 34 58 17 50 11 20 64 10 24 77 5 26 88 51 34 15 5 8 34 164 18 50 21 21 64 18 25 77 7 27 88 96 35 15 14 9 35 5 14 50 45 22 64 34 26 77 11 28 88 188 36 15 51 10 35 8 15 50 103 23 64 68 27 77 19 29 89 5 29 15 240 11 35 18 16 50 255 24 64 146 28 77 34 30 89 6 30 16 5 8 35 42 17 51 5 19 65 5 23 77 65 31 89 9 31 16 9 9 35 113 18 51 9 20 65 9 24 77 129 32 89 15 32 16 30 10 36 5 14 51 17 21 65 15 25 77 272 33 89 24 33 16 124 11 36 7 15 51 35 22 65 28 26 78 5 26 89 43 34 17 5 8 36 14 16 51 78 23 65 54 27 78 6 27 89 79 35 17 7 9 36 31 17 51 186 24 65 113 28 78 10 28 89 152 36 17 19 10 36 79 18 52 5 19 65 252 29 78 16 29 90 5 29 17 69 11 36 225 19 52 8 20 66 5 23 78 29 30 90 6 30 18 5 9 37 5 14 52 14 21 66 8 24 78 53 31 90 8 31 18 13 10 37 6 15 52 28 22 66 13 25 78 103 32 90 13 32 18 41 11 37 11 16 52 60 23 66 23 26 78 213 33 90 21 33 18 166 12 37 24 17 52 138 24 66 44 27 79 5 26 90 36 34 19 5 9 37 57 18 53 5 19 66 89 28 79 6 27 90 65 35 19 9 10 37 154 19 53 7 20 66 193 29 79 9 28 90 123 36 19 26 11 38 5 15 53 12 21 67 5 23 79 14 29 90 244 37 19 94 12 38 9 16 53 22 22 67 7 24 79 24 30 91 5 30 20 5 9 38 18 17 53 47 23 67 11 25 79 44 31 91 7 31 20 7 10 38 42 18 53 104 24 67 19 26 79 83 32 91 11 32 20 18 11 38 108 19 53 251 25 67 35 27 79 168 33 91 18 33 20 57 12 39 5 15 54 5 19 67 70 28 80 5 27 91 31 34 20 223 13 39 7 16 54 6 20 67 149 29 80 8 28 91 54 35 21 5 10 39 14 17 54 10 21 68 5 23 80 12 29 91 101 36 21 13 11 39 32 18 54 18 22 68 6 24 80 20 30 91 196 37 21 36 12 39 77 19 54 37 23 68 9 25 80 36 31 92 5 30 21 129 13 39 210 20 54 79 24 68 16 26 80 68 32 92 7 31 22 5 10 40 5 15 54 185 25 68 29 27 80 134 33 92 10 32 22 9 11 40 6 16 55 5 20 68 56 28 80 279 34 92 16 33 22 24 12 40 12 17 55 8 21 68 116 29 81 5 27 92 26 34 22 78 13 40 24 18 55 15 22 68 255 30 81 7 28 92 45 35 23 5 10 40 57 19 55 29 23 69 5 24 81 11 29 92 83 36 23 7 11 40 147 20 55 61 24 69 8 25 81 17 30 92 159 37 23 17 12 41 5 16 55 139 25 69 14 26 81 30 31 93 5 30 23 50 13 41 9 17 56 5 20 69 24 27 81 56 32 93 6 31 23 176 14 41 19 18 56 7 21 69 46 28 81 108 33 93 9 32 24 5 10 41 42 19 56 13 22 69 92 29 81 220 34 93 14 33 24 6 11 41 105 20 56 24 23 69 197 30 82 5 27 93 22 34 24 13 12 41 286 21 56 48 24 70 5 24 82 6 28 93 38 35 24 34 13 42 5 16 56 105 25 70 7 25 82 9 29 93 69 36 24 108 14 42 8 17 56 248 26 70 12 26 82 15 30 93 129 37 25 5 11 42 15 18 57 5 20 70 20 27 82 26 31 93 254 38 25 9 12 42 32 19 57 6 21 70 37 28 82 46 32 94 5 31 25 23 13 42 77 20 57 11 22 70 73 29 82 87 33 94 8 32 25 69 14 42 200 21 57 19 23 70 153 30 82 174 34 94 12 33 25 240 15 43 5 16 57 38 24 71 5 24 83 5 28 94 19 34 26 5 11 43 7 17 57 81 25 71 6 25 83 8 29 94 32 35 26 7 12 43 12 18 57 186 26 71 10 26 83 13 30 94 57 36 26 17 13 43 25 19 58 5 21 71 17 27 83 22 31 94 106 37 26 46 14 43 57 20 58 9 22 71 31 28 83 38 32 94 204 38 26 148 15 43 142 21 58 16 23 71 59 29 83 71 33 95 5 31 27 5 11 44 5 16 58 31 24 71 120 30 83 140 34 95 7 32 27 6 12 44 6 17 58 63 25 71 260 31 83 287 35 95 11 33 27 13 13 44 10 18 58 140 26 72 5 24 84 5 28 95 17 34 27 32 14 44 20 19 59 5 21 72 6 25 84 7 29 95 28 35 27 95 15 44 43 20 59 8 22 72 9 26 84 11 30 95 48 36 28 5 12 44 103 21 59 13 23 72 14 27 84 18 31 95 87 37 28 10 13 44 272 22 59 25 24 72 25 28 84 32 32 95 166 38 28 23 14 45 5 17 59 50 25 72 48 29 84 59 33 76 and P(2SI > t I no gene is responsible) (3.49) n2 = S {P(k > 2Sj, J 5 Io, io or io + 1 and lo io or io+ 1 I no gene(3.50) k=t+l P(k > 2Si,, k > 2Sio+il no gene) (3.51) P(2SIo = k and Io $ io or io + 1 I no gene)}. (3.52) Eq. (3.47) is less than Eq. 3.51 and the other corresponding terms are equal, therefore P(Io is wrong and 2SIo > t I a gene is responsible) < P(2SI > t I no gene is responsible) and thus, P(linkage claim is incorrect) < P(2S o > t I no gene is responsible) { ^ FP(2Si > tn o = i)P(Io = i) P(the result of stage I)1 all possible results I =O of the stage I m a P(Io=i) 3.5 Discussion The Monte Carlo results indicated that the relative errors between the probabil ities calculated from formulas and Markov chain simulations were under 7% in the dominant cases and under 15% in the recessive cases. Consequently, the approxima tion using the independence assumption for dependent marker loci was acceptable. The simulation studies also showed that, in the dominant cases, combining Stage I data with Stage II data did not have any significant advantage. The probability of allocating the correct marker increased less than by 3%. For the recessive cases, there was some advantage when the ASPs were few. The probability of allocating the correct marker could increase by as much as 10%. However, it is very difficult to combine data in the theoretical derivation. The powers to find the gene depends on the exact gene location between markers. Since Table 3.3 and 3.4 were constructed under the least favorable configuration, the actual power should be higher. The twostage approach indeed boosted the probability of finding the correct gene location under resource constraints. In many cases this probability can increase up to 20% to 30%. In searching for recessive disease genes, there were several instances when the improvement exceeded 35% (Table 3.3). However, if there are enough resources, such that almost all available markers can be typed on all ASPs, then the onestage approach may have higher power. (see e.g., N=5000 and APS=10 or N =10,000 and ASP=10, 25 in Table 3.3.) However, since the power loss is so small, we may always choose the optimal twostage design. As shown in Tables 3.3 and 3.4, it requires much more resources to locate a dominant than a recessive disease gene. For example, in a twostage design with 25 ASPs and N=1000 we can locate a recessive disease gene with a 0.88 probability when e = 1. But for a dominant disease gene, more than 100 ASPs and/or more than N=10,000 are needed to achieve the same probability. Another point worth noting is that phenocopy can severely reduce the probability of finding correct gene location. For example, to locate a recessive gene, with resource N=5000, 25 ASPs, and E = 1, the chance of finding it is 99.9%. However, when = 0.5, even if we double the resource to N=10,000 and 50 ASPs, the chance of finding the correct locus is only 90%. Although both of them have about 25 ASPs whose disease is caused by gene, those extra phenocopy ASPs reduce the probability considerably. CHAPTER 4 TWOSTAGE GENOME SEARCH FOR COMPLEX DISEASE In the previous chapter, we have focused on finding a single disease gene. However, genetic diseases are not always singlegene diseases. Many of them are complex dis eases, i.e., diseases caused by several genes. For example, insulindependent diabetes mellitus (IDDM) is influenced by a number of susceptibility genes and environmental factors (Luo et al., 1995). A disease phenotype controlled by genes at several dif ferent loci is considered to have "nonallelic heterogeneity" and "when this disease is relatively rare and mutation rates are low, individuals within a family are generally homogeneous. Locus heterogeneity then leads to the situation that the recombination fraction between disease phenotype and marker will be different in different families" (Ott, 1991, p. 199). This chapter discusses the probability to find two unlinked reces sive genes that may cause the same disease in a twostage search under the following genetic model. 4.1 Genetic Model and Assumptions Throughout this chapter, we assumed the following: No epistasis, i.e. no interaction between genes. When an individual carries the two disease genes, the penetrance is additive, i.e. the probability of being affected for an individual who has two recessive genes at two loci is twice as high as for an individual who has two recessive genes at only one locus. For an individual who has no disease gene, the probability of being affected is ko times the probability of being affected for an individual who has two recessive genes at one locus. Receiving one gene has no effect on receiving other genes. Other assumptions, necessary for simplifying analytic study are similar to those in the previous chapter, are: There are two alleles at each gene locus, denote disease gene d and normal gene D. The population frequency of disease gene is p for both loci. Genes are in the middle of two adjacent markers. In this study, only three maps, 5 cM (centimorgan), 10 cM, and 20 cM, are available, and every marker are highly polymorphic. Marker's positions on the each chromosome are at 0 cM, 5 cM, 10 cM,..., for 5 cM map, 0 cM, 10 cM,..., for 10cM map, and 0 cM, 20 cM, 40 cM,..., for 20 cM map. The cost of typing alleles is a constant. 4.2 Twostage Genome Search For simplicity, we use a slightly different twostage approach for searching complex disease genes. In the first stage, we choose a threshold for statistics instead of choosing a number of loci. In the second stage, we choose the loci on different chromosomes where the statistics have the highest overall value. Details are as follows: Following chapter 3, suppose there are n ASPs and there are enough resources to type N markers for ASPs. Again, there are three numbers that must be determined in a twostage design: n, and m, the number of ASPs and the number of markers to be used in Stage I, and k, the threshold of markers to be studied in Stage II. Define Si = E X,. (4.1) j=1 in the same way as in chapter 3 where i is the index of loci, j is the index of ASP, Xij is defined by the assumption 6 in 3.1. If in Stage I, marker i has a 1Si value higher than the threshold, k, we will study this marker again in the stage II. Let the number of markers that passed Stage I be R. In Stage II, R markers on AN2 ASPs are to be typed, where N2 is the largest number subject to the resource constraint and R. Since R is a random variable, N2 is also a random variable. Thus, N2 is the largest x, such that mn1 + Rx < N. We define nl+N2 2S,= E XiJ, (4.2) j=nl +1 for stage II. Then, markers that meet the following criteria will be declared having a gene nearby: A marker with the uniquely highest 25' is claimed to be the marker nearest to the disease gene. A marker group (see page 42), has the uniquely highest score, is claimed to have the gene lie between them, If two markers or marker groups have the same highest score but on different chromosomes, then declare each having a gene nearby in the corresponding way in the above. If none of the above applies, then gene location is considered undetermined. Once the locations) has(have) been chosen, same as in the chapter 3, the next step is to check whether we can claim linkage. Let t be the 100(1 a) percentile of the maximum of r binomial (n2, 0.25) random variables. If 2SA, of the chosen locations) is(are) greater than t, then we claim there is linkage at that locationss. Since we imposed some restriction on declaring a maker having a gene nearby, the actual type I error will be less than a. The 95% percentile for the unique maximum and marker group were given in Table 3.5 and 3.6, respectively. 4.3 Probability of Allocating the Correct Marker for a Complex Disease In this section we discuss an analytical approach and some simulation results. 4.3.1 is a preparation for the discussion. 4.3.1 Possible Parental Genotypes and Trait IBD distribution For a disease gene locus, say locus i, let d and D denote the disease and the normal gene allele respectively. Let also the population frequency of d be p, and PG denote the parental genotype. Let Ei = (eil, ei2) = (1, 1), if both sibs received two recessive genes at locus i (1,0), if sib 1 received two recessive genes and sib 2 did not at locus i (0, 1), if sib 2 received two recessive genes and sib 1 did not at locus i (0, 0), if none of sibs received two recessive genes at locus i. Then the distribution of trait IBD, It, conditional on parental genotype and E is given in Table 4.1. The numbers in the table is very easy to verify, for example, for parental genotype 2, if the order of the chromosomes are specified, say 1, 2, 3, and 4 where 1 and 2 belong to father, and 3 and 4 belong to mother, then the probability of receiving Dd dd, where D is on the chromosome 4, is p3(1 p). If we permute the order of chromosomes we will get 4 different permutation, therefore, the probability of parents having genotype Dd and dd is 4p3(1 p) in a random mating population. When parental genotype is Dd and dd, possible sib genotype are Dd and dd with equal frequency, therefore, P(E.=(1,1)IPG=2)= P(E1=(1,0)IPG=2)= P(Ei=(0,1)IPG=2)=P(E=(o,O)IPG=2)= 1. In order to illustrate the conditional 4. distribution of trait IBD score, subscripts for parental genotype are denoted as djd2 and Dd3. When parental genotype is dd and Dd, and E, = (0,0), sib's genotype can only be Dd, or Dd2 with equal chance, therefore, P(Ih=2IPG=2,E,=(0,0)) = P(It=1 IPG=2,Ei=(0,0))= 1. 4.3.2 Analytic Approach Assume there are 2 unlinked genes, namely G1 and G2. Without loss generality, let X1j and X2j be the statistics of the markers next to G1, and X3j and X4j be the statistics of the markers next to G2 for the ASP j; IS1, \S2, 1S3, and 1S4 be the sum of Xs respectively in Stage 1, as defined at the beginning of this section. Also, let I1 and 12 be the IBD score of G1 and G2 for the affected sibs pair. Let the penetrance of carrying two recessive gene at one locus is A, two loci is 2A, and none is k0A, where k0 is the relative risk of being affected for an individual carrying no gene to an individual carrying two recessive genes at one locus. The A is unknown but is assumed to be small. It will be cancel out in the formula. Then P(a gene or both genes are found) = P(G1 is found and G2 is not) +P(G2 is found and G1 is not) +P(G1 and G2 are found) = {P(G1 passes Stage I, G2 does not pass Stage I, G1 is found) (4.3) +P(Gi and G2 pass Stage I, G1 is found)} (4.4) +{P(G2 passes Stage I, G, does not pass I,G2 is found) (4.5) +P(G1 and G2 pass Stage I, G2 is found)} (4.6) +P(Gi and G2 pass Stage I, Gi and G2 are found) (4.7) For a given threshold k, a given event {G, is found but G2 is not}, the possible relationship between k and 1S1, 1S2, 1S3, and 1S4 in Stage I, and corresponding relationship of 2S1, 2S2, 2S3, and 2S4 in Stage II are given in Table 4.2. For example for event 6, given G1 is found but G2 is not, in Stage I 1S1 > k, 1S2 > k, jS3 < k Table 4.1. Trait IBD distribution conditional on parental genotype and E vector. PG P(PG) P(E=(1,1) I PG) P(E=(1,0) I PG) P(E=(0,1) I PG) P(E=(0,0) I PG) 1. dd dd p4 1 0 0 0 P(It=2 PG, E)=1 1 1 1 1 2. Dddd 4p3(1 p) 1 4 4 P(I=1 PG, E)= P(ht=l I PG, E)=! P(I,=2 PG, E)=! P(It=2 PG, E)=I P(t___PG_ E1 P(It=0 I PG, E)=! P(It=0 I PG, E)= P(It=l I PG, E)=! 1 3 3 9_ 3.DdDd 4p2(1 p)2 16 P(It=2  PG, E)= P(It=l  P E 1I PG, E)= P(It=l 1G, E)= 9 P(It=2  PG, E)=l 3 3 (It=l PG, E)=49 P(It=O PG, E)=1 P(It=0 PG, E)= 3 P(It=0  PG, E)=2 4.DD dd 2p2(1 p)2 0 0 0 1 P(It=2 I PG, E)=! P(It=l I PG, E)=! P(It=0 I PG, E)= 5. DD Dd 4p(1 p)3 0 0 0 1 P(It=2 I PG, E)=4 P(It=1  PG, E)=! _____~__________ __________________ P(It=0 PG, E)= 6. DD DD (1 p)4 0 0 0 1 P(It=2 I PG, E)=! P(It=1 I PG, E)=2 __________________________________________P(It=0  PG, E)= Table 4.2. Exclusive events for the case "G1 is found but G2 is not." Stage I Stage II event 1S1 1S2 1S3 1S4 the largest statistics and relationship 1 > > < < max(2S1, 2S2) 2 > < < < 2S1 3 < > < < 2S2 4 > > > > max(2S1, 2S2), and > max(2S3, 2S4) 5 > > > < max(2S1, 2S2), and > 2S3 6 > > < > max(2S1, 2S52), and > 2S4 7 > < > > 2S1, and > max(2S3, 2S4) 8 < > > > 2S2, and > max(2S3, 2S4) 9 > < > < 2S1 and > 2S3 10 > < < > 2S1, and > 2S4 11 < > > < 2S2, and > 2S3 12 < > < > 2S2, and > 2S4 and 1S4 > k, then in the Stage II, the maximum of 2S1 and 2S2 must have the largest value and also larger than 2S4. For a given threshold k, and event {G2 is found but G1 is not) the possible relationship between k and 1S1, 1S2, 1S3, and 1S4 in Stage I, and corresponding relationship in Stage II are given in Table 4.3. For a given threshold k, and event {G1 and G2 are found} the possible relationship between k and 1Si, 15S2, 1S3, and 15S4 in Stage I, and corresponding relationship in Stage II are given in Table 4.4. For a given threshold k, the sum of probabilities (4.3) through (4.7) are m4 33 N2 Z E P(event i, and I lSj pass Stage I, the largest statistics =h in Stage II ), 1=0 i=1 h=l Table 4.3. Exclusive events for the case "G2 is found but G, is no." Stage I Stage II event 1S1 1S2 1S3 1S4 the largest statistics and relationship 13 < < > > max(2S3, 284) 14 < < > < 2S3 15 < < < > 2S4 16 > > > > max(2S3, 284), and > max(2S1, 2S2) 17 > < > > max(2S3, 2S4), and > 2S1 18 < > > > max(2S3, 2S4), and > 2S2 19 > > > < 2S3, and > max(2S1, 282) 20 > > < > 2S4, and > max(2S1, 282) 21 > < > < 283, and > 2S1 22 > < < > 284, and > 2S1 23 < > > < 2S3, and > 2S2 24 < > < > 2S4, and > 2S2 Table 4.4. Exclusive events for the case "G1 and G2 are found." Stage I Stage II event 1S1 182 183 184 the largest statistics and relationship 25 > > > > max(2S3, 2S4)= max(2SI, 2S2) 26 > < > > max(2S3, 2S'4)= 281 27 < > > > max(2S3, 2S4)= 2S2 28 > > > K max(2S1, 2S2)= 283 29 > > < > max(2SI, 2S2)= 2S4 30 > < > < 283=281 31 > < < > 2S4=281 32 < > > < 2S3= 2S2 33 < > < > 2S4=2S2 where N2 = [R ]Int For example for event 25; P(event 25, and I 1Sis pass Stage I) = P(1S > k, 1S2 > k, 1S3>k, 1S4 > k, I Sj >k max(2S3, 2S4)= max(2S1, 2S2)=h, and all other I 2Sjs < h) (with assumption of independence) = P(1S1 > k, 152 > k, 153 > k, IS4 > k)P(l 1Sj pass Stage I) P(max(2S3, 2S4)= max(2SI, 2S2)=h)P(all other I 2Sjs < h), where j 5 1,2,3,4. The distribution of 2Sj is Binomial(N2, 0.25). Therefore, if we want to know probabilities (4.3) through (4.7), we need to know the joint distribution of 1S1, IS2, 1S3 and 1S4 conditional on both sibs are affected. In order to calculate this joint distribution, we need to know the joint distribution of Xij and X2j and X3j and X4j conditional on both sibs are affected, which is, Pxyuv (4.8) = P(Xlj = x, X2j = y, X3j = u, X4j = v I both affected) (4.9) 2 2 = ) T P(X1j = x, X2j = y, X3j = u, X4j = v I = i1, /2 = i2 1 both affected) il =0 i2=0 2 2 = E f{P(Xlj =x, X2j = y, X3j =u, X4j =v 1i = i 1,2 = i2, both affected) i1 =0 i2=0 P(I1 = ii, 12 = i I both affected)} (4.10) 2 2 = > P(Xl, = X, X2j = Y I = ii)P(X3j = u,X4j = v 2 = i2) i1=0 12=0 P(1 = i1, 12 = i2 both affected) (4.11) Table 4.5. Conditional distribution of Xl, and X2j given I1i. The probability P(Xij = x, X2j = y I I, = i) in general case, i.e. gene is anywhere in between 2 markers, can be derived from Table 2.3, and is given in Table 4.5. Where Tj = 09+(1oj)2, j=1,2, 01 and 02 are the recombination fraction between gene and marker 1, and between gene and marker 2, respectively. The probability P(Xaj = x, X4j = y 1 12 = i) is same except using different 0s. The probability P(I = i', /12 = I both affected) is equal to, P(I = il, 12 = i2, both affected) P(both affected) SZ E P(1I = i1, 12 = i2, El, E2, PG1, PG2, both affected) E1 E2 PG1 PG2 P(both affected) = {Z Z P(both affectedlli = i, 12 = i2 El, E2, PG1, PG2) E1 E2 PG1 PG2 i x y P(Xlj =X, X2j =yII=i) 2 0 0 (1 )(1 I) 1 0 (1 ) 0 1 2 1 1 1 2 1 0 0 (11+')(1 '2+') 1 0 T,(1 )(I + T'2) o 1 (1 1+1 )'2(1 2) 1 1 1(1 T1)'2(1 2) 0 0 0 (211 +jp)(2I2 T2) 1 0 (1 TI1)(2 1) 0 1 (2i T)(1 T)2 1 1 (I _ 1)2(1 T2)2 P(I = il, h2 = i2 1El, E2, PGi, PG2)}/P(both affected) EE E E P(both affectedE1, E2)P(I = i1, 12 = i2 Ei, E2, PG1, PG2) E1 E2 PG1 PG2 P(both affected) { ZE E Z P(both affectedlEi, E2)P(Il = ii, El, PG)P(I2 = i2, E2, PG2)} Ei E2 PG1 PG2 P(both affected) 1 Poth effect E E P(both affectedlEl, E2)P(I = il El, PG1) P(both affected) El E2 PGI PG2 P(E1 I PG1)P(PG1)P(12 = i2 I 2, PG2)P(E2 I PG2)P(PG2)} Because both sibs are affected, hence when ko = 0 the following restriction on sum ming over E and E2 are applied; if E1 = (1,1) and E2 # (1,1) then (E1, E2) must be ((1,0), (0,1)) or ((0,1),(1,0)). P(both affectedlEi, E2) is given in Table 4.6. The probabilities P(both affected) = E2 E2=0 p(Il = 1 = Jboth affected). P(I = i, = j, both affected) are given as follows. The rest probabilities were given in Table 4.1. Therefore, P(i = il, I2 = i2 both affected) can be found. Let p, = P(PG = 1), P2 = P(PG = 2), p3 = P(PG = 3), and p46 = P(PG = 4, 5, or 6), Then, P(I1 = 0, 12 = 0,BA) S4kA21 P21 P3 1 9 P32 1 P4s) G4"2 16 3 16 3 9 14 +2A2 1 3 1) (1 23 1 +4koA2 121+ 3p3) +_A1 A2(2p2 + P3)2 128 Table 4.6. P(both affectedE1, E2) and possible trait IBD given E1 and E2. E1 E2 P(BAE1, E2) possible trait IBD score, (I1, 12), given E1 and E2 (1,1) (0,0) A2 (2,0), (2,1), (2,2) (1,1) (0,1) 2A2 (2,0), (2,1) (1,1) (1,0) 2A2 (2,0), (2,1) (1,1) (1,1) 4A2 (2,2) (1,0) (0,0) koA2 (1,0),(1,1),(1,2),(0,0),(0,1),(0,2) (1,0) (0,1) A2 (1,0),(1,1),(0,0),(0,1) (1,0) (1,0) 2koA2 (1,0),(1,1),(0,0),(0,1) (1,0) (1,1) 2A2 (1,2),(0,2) (0,1) (0,0) koA2 (1,0),(1,1),(1,2),(0,0),(0,1),(0,2) (0,1) (0,1) 2koA2 (1,0),(1,1),(0,0),(0,1) (0,1) (1,0) A2 (1,0),(1,1),(0,0),(0,1) (0,1) (1,1) 2A2 (1,2),(0,2) (0,0) (0,0) k2A2 (2,0),(2,1),(2,2),(1,0),(1,1),(1,2),(0,0),(0,1),(0,2) (0,0) (0,1) koA2 (2,0),(2,1),(1,0),(1,1),(0,0),(0,1) (0,0) (1,0) koA2 (2,0),(2,1),(1,0),(1,1),(0,0),(0,1) (0,0) (1,1) A2 (2,2),(1,2),(0,2) +6IkoA2(2p2 + P3)(2p2 + 3P3 + 4P456) 64 + 1QoA2(p3 + 2p456)2 64 P(II = 1,12 = 0, BA) = P(I =2, 12 = 1,BA) S2koA P2+ p32) (9P3 + I P456 G421 3 2 I 1 3 1 G/ 2 3 2 1 3 1 2kA2 1P2 + 9 4 +2oA p2 16 9p3+ +k,2,A2 (11 9 4 1 = A2(p2 + P3)(2P2 + p3) 1 SP456P456) ( 1 [p2 P456P456) (9P32 +koA2(p2 + P3)(P2 + P3 + P456) 16 +4koA2(p2 + 2p3 + 4p456)(2P2 + P3) 64 +61koA2(p2 + 2p3 + 4p456)(p3 + 2p456) 64= P(1I = 2, 12 = 0, BA) = P(7I = 0, I2 = 2, BA) = A2 (P+ P2+ P (921 \ 16 P39 + 4P456) S 1 1 (1 1 + ~P2 + P3 I p2 4' l16 ) 4 2 P3 1 1 2 4 456 ) +4A2 (i +3 1 iP3 5) +2 P2 + P3 + P456 3 + P456 +kOA P2 216 9+ P45(6 9 ,A /1 1 9 3 1 1 I 1 3 1) +2k0A2 ( + P3 P ( 2p 2 + FF'3 L A A2(16pi + 4P2 + p3)(4p2 + 3p3 + 2p456) 128 + 1 koA2(4p2 + 3p3 + 4p456)(2p2 + P3) 128 1 0 +Ik2A2(2p2 + 3P3 + 4p456)(p3 + 2P456) 128 0 P(IJ = 1,12 = 1,BA) S12 1 31 9 4 1 = 4koA P2 + p3 (Ip2 + P3 + P456 2 16 3 2 42 169 2 +2A (1P2 + P3 2 ) +4koA 21P2 I+ 3 P3 ) 2 1 3 2 2 32 /1 1 9 4 \ +2 (P + ^ +)^ + koA2(p2 + p3)(2p2 + 3p3 + 4p456) + ko2A2(p2 + 2p3 + 4P456)2 64 P(I = 2, 12 = 1,BA) = P(IA = 1, 12 = 2, BA) A + 12 j P I 1 9 4 1 ) A Pl "4+ 4P2 + "6P3 IP2I + "i9 P34+ P456 