Two-stage genome search design in affected-sib-pair method

MISSING IMAGE

Material Information

Title:
Two-stage genome search design in affected-sib-pair method
Physical Description:
Book
Creator:
Teng, Chi-Hse
Publication Date:

Record Information

Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
aleph - 28638544
oclc - 38827144
System ID:
AA00020475:00001

Table of Contents
    Title Page
        Page i
        Page ii
    Acknowledgement
        Page iii
    Table of Contents
        Page iv
        Page v
    List of Tables
        Page vi
        Page vii
    Abstract
        Page viii
        Page ix
    Chapter 1. Introduction
        Page 1
        Page 2
        Page 3
        Page 4
        Page 5
        Page 6
        Page 7
    Chapter 2. Literature review
        Page 8
        Page 9
        Page 10
        Page 11
        Page 12
        Page 13
        Page 14
        Page 15
        Page 16
        Page 17
        Page 18
        Page 19
        Page 20
        Page 21
        Page 22
        Page 23
        Page 24
        Page 25
        Page 26
        Page 27
        Page 28
        Page 29
        Page 30
        Page 31
        Page 32
        Page 33
        Page 34
        Page 35
        Page 36
        Page 37
        Page 38
        Page 39
    Chapter 3. Two-stage genome search for simple mendelian disease
        Page 40
        Page 41
        Page 42
        Page 43
        Page 44
        Page 45
        Page 46
        Page 47
        Page 48
        Page 49
        Page 50
        Page 51
        Page 52
        Page 53
        Page 54
        Page 55
        Page 56
        Page 57
        Page 58
        Page 59
        Page 60
        Page 61
        Page 62
        Page 63
        Page 64
        Page 65
        Page 66
        Page 67
        Page 68
        Page 69
        Page 70
        Page 71
        Page 72
        Page 73
        Page 74
        Page 75
        Page 76
        Page 77
    Chapter 4. Two-stage genome search for complex disease
        Page 78
        Page 79
        Page 80
        Page 81
        Page 82
        Page 83
        Page 84
        Page 85
        Page 86
        Page 87
        Page 88
        Page 89
        Page 90
        Page 91
        Page 92
        Page 93
        Page 94
        Page 95
        Page 96
        Page 97
        Page 98
        Page 99
        Page 100
        Page 101
        Page 102
        Page 103
        Page 104
        Page 105
        Page 106
        Page 107
        Page 108
        Page 109
        Page 110
        Page 111
    Chapter 5. Concluding remarks
        Page 112
        Page 113
    References
        Page 114
        Page 115
        Page 116
        Page 117
        Page 118
        Page 119
    Biographical sketch
        Page 120
        Page 121
        Page 122
        Page 123
Full Text











TWO-STAGE GENOME SEARCH DESIGN IN AFFECTED-SIB-PAIR METHOD


By

CHI-HSE TENG













A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA


1997






























Copyright 1997

by

CHI-HSE TENG















ACKNOWLEDGMENTS


I would like to thank the chairman of my committee, Professor Mark C.K. Yang,

for introducing me to the field of genetic statistics and his wise guidance throughout

my graduate study at UF. I also want to thank Professor Randy Carter, Professor

Frank Martin, Professor Sue McGorrary, and Professor Jin Xiong She. Their advice,

encouragement, and patience were essential. Last but not least, I want to thank

Ms. Margaret Joyner for her gracious editorial help and effort. Finally, my deepest

appreciation goes to my parents, brother, sister, and parents-in-law for their love and

never-ending support and to my wife Hsiao-Yun and our daughter Gillian for their

sweet company.















TABLE OF CONTENTS




ACKNOW LEDGM ENTS ...................................................... iii

LIST OF TABLES ............................................................. vi

A B ST R A C T ................................................................... viii

CHAPTERS

1 INTRODUCTION ............................. ....................... 1


2 LITERATURE REVIEW ............................................. 8

2.1 Map Function ........................................ .... 8
2.2 An Introduction to Human Linkage Data for Genetic Dissection 11
2.3 Identical by State, Identical by Descent ........................ 16
2.4 Distributions of IBD ............... .... ......................... 17
2.5 Risk Ratio ...................................................... 24
2.6 Test Statistics Based on IBD Score ............................. 25
2.7 Likelihood Ratio Test ........................................... 30
2.8 Heterogeneity and Homogeneity ................................ 33
2.9 Polygenes ....................................................... 37
2.10 Two-stage Genome Search ...................................... 38

3 TWO-STAGE GENOME SEARCH FOR SIMPLE MENDELIAN DIS-
E A SE .............................................................. 40

3.1 Assum options .................................................... 41
3.2 Two-Stage Procedure ........................................... 42
3.3 Probability of Allocating the Correct Marker ................... 43
3.4 Type I Error and Power of Claiming Linkage ................... 72
3.5 Discussion ....................................................... 76

4 TWO-STAGE GENOME SEARCH FOR COMPLEX DISEASE .... 78

4.1 Genetic Model and Assumptions ................................ 78









4.2 Two-stage Genome Search ...................................... 79
4.3 Probability of Allocating the Correct Marker for a Complex
D disease ........................................................ 81
4.4 Discussion ....................................................... 110

5 CONCLUDING REMARKS .......................................... 112


REFERENCES ................................................................. 114

BIOGRAPHICAL SKETCH ................................................... 120














LIST OF TABLES


Table Page

1.1 Results of Mendel's crosses experiments......................... 2

1.2 The results of Bateson and Bunnet's experiment. ................. 5

2.1 Multilocus-feasibility of several map functions ................... 12

2.2 Some sampling issues in linkage studies......................... 16

2.3 The values of Pr(IBD of trait gene I IBD of marker and relationship). 20

2.4 The probability of IBD score for different relatives ................ 26

3.1 The joint distribution of IBD score of the locus 1 and locus 2. ....... 57

3.2 The joint distribution of Xij and Xi,j ........................... 58

3.3 Optimal resource allocation in two-stage genome search for rare reces-

sive gene ............................................... 64

3.4 Optimal resource allocation in two-stage genome search for rare dom-

inant gene ............................................... 68

3.5 The 95% percentile of the unique maximum of R Binomial(n2, 0.25).. 74

3.6 The 95% percentile of the unique maximum marker group of R Binomial(n2,

0.25) .................................................. 75

4.1 Trait IBD distribution conditional on parental genotype and E vector. 83

4.2 Exclusive events for the case "G1 is found but G2 is not." .......... 84

4.3 Exclusive events for the case "G2 is found but G, is no." .......... 85

4.4 Exclusive events for the case "G1 and G2 are found." ............. 85

4.5 Conditional distribution of Xl, and X2j given ................ 87









4.6 P(both affectedlEi, E2) and possible trait IBD given El and E2 ..... 89

4.7 Optimal resource allocation in two-stage genome search for rare reces-

sive gene, assumed two genes................................. 95














Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment
of the Requirements for the Degree of
Doctor of Philosophy

TWO-STAGE GENOME SEARCH DESIGN IN AFFECTED-SIB-PAIR METHOD

By

CHI-HSE TENG

December, 1997

Chairman: Mark C.-K. Yang
Major Department: Statistics

Since Penrose proposed the sib-pair method in 1935 and the affected-sib-pair

(ASP) method in 1950 and 1953, these methods have been widely used in linkage

studies. Combined with contemporary DNA-level genetic markers technology, ASP

method can now be used for genome-wide searches for disease genes.

A two-stage search involves first a screening search to eliminate nonviable marker

loci, then an intensive search to identify gene location using the remaining markers.

Although this approach has been suggested in the literature, its properties have not

been thoroughly investigated. The spacing between markers, the number of ASPs to

be used in the experiment of each stage, and the criteria for markers to pass the first

stage need to be determined. The major difficulties of the two-stage approach are (1)

the joint distribution of the nonindependent statistics is difficult to handle; (2) the
"random" dependence structure of the test statistics in the second stage; and (3) in

most practically available sample sizes the high number of ties in test scores make

asymptotic approaches inappropriate. This paper intends to provide a solution for

designing an optimal design for rare autosomal recessive and dominant diseases.









Power computations using the multinomial distribution supported by simulation

show that in most cases a two-stage design is usually better than a one stage design,

but not always. Combining data from two stages will gain a small increase in accuracy.

The optimal designs for resource allocation in each of the two stages are obtained and

presented in tables.














CHAPTER 1
INTRODUCTION



Modern genetics began with the work of Gregor Mendel (1822-84) conducted

between 1856 and 1863, forming the basis for his 1866 paper (Klug and Cummings,

1997). Gregor Mendel, an Austrian monk, performed a series of experiments on the

garden peas. Based on the results of these experiments, he proposed a particulate

inheritance theory which hypothesized that heritable biological characteristics were

carried and controlled by individual "units." We call these units genes today, but

Mendel called each unit a "Merkmal," the German word for "Character" (Levine and

Miller, 1991).

In Mendel's garden pea experiments, he studied seven characters of garden peas.

They are: round or wrinkled ripe seeds, yellow or green seed interiors, purple or white

petals, inflated or pinched ripe pods, green or yellow unripe pods, axial or terminal

flowers, and long or short stems (Griffiths et al, 1993). For each one of these seven

characters, Mendel obtained pure lines of plants. "A Pure line is a population that

all offspring produced by selfing or crossing within the population show the same form

of the character being studied" (Griffiths et al., 1993). The parental generation,

denoted as F0, in Mendel's experiment are the plants of these pure lines. Mendel first

studied these characters separately; he crossed two pure lines, one for each phenotype,

of every characters to obtain the first filial generation, denoted as F1. All the

individuals of F1 have only one phenotype of each characters. Next, Mendel self-

fertilized Fi and obtained second filial generation, denoted as F2. The results are

showed in Table 1.1. Mendel established two principles to explain the pattern of the









Table 1.1. Results of Mendel's crosses experiments.


Parental phenotype F1 F2 F2 Ratio
Round x wrinkled seeds All round 5474 round; 1850 wrinkled 2.96:1
Yellow x green seeds All yellow 6022 yellow; 2001 green 3.01:1
Purple x white petals All purple 705 purple; 224 white 3.15:1
Inflated x pinched pods All inflated 882 inflated; 299 pinched 2.95:1
Green x yellow pods All green 428 green; 152 yellow 2.82:1
Axial x terminal flowers All axial 651 axial; 207 terminal 3.14:1
Long x short stems All long 787 long; 277 short 2.84:1

Source: Griffiths et al., 1993, p. 23.


data (Levine and Miller, 1991; Griffiths et al., 1993):

The characteristics of an organism are determined by individual units of heredity

called genes. Each adult organism has two alleles for each gene, one from

each parent. These alleles are segregated (separated) from each other when

reproductive cells are formed, each gametes receive one of the two alleles with

equal chance. This is the principle of segregation, also being known as the

Mendel's First Law.

In an organism with contrasting alleles for the same gene, one allele may be

dominant over another. This is known as the principle of dominance.

He assumed that each plant had two alleles for each trait his studied. This assumption

was correct. "A Allele is one of the different forms of a gene that can exist at a

single locus" (Griffiths et al., 1993, p. 783). He introduced the notation A for the

dominant allele and a for the recessive allele. Since the parental generations were

pure lines, their genotypes were homozygous AA and aa. Thus, the F1 population

were all heterozygous Aa phenotype. The F2, offspring of the F1, were expected to









be AA, Aa, and aa in the ratio 1:2:1; and dominant (AA and Aa) and recessive

(aa) phenotype in ratio 3:1. Mendel cultivated 10 seeds from each of 100 dominant F2
plants, if all 10 offspring from a single F2 plant were of the dominant character then he

concluded this plant was homozygous. Mendel used this experiment to demonstrate
the 1:2 ratio of homozygous (AA) dominant and heterozygous dominant (Aa) F2's.

However, for a heterozygous F2 parent there is still a (0.75)10 = 0.0563 probability

would produce 10 dominant offspring, and hence been misclassified as homozygous.

Therefore, the true expected ratio should be 0.3709:0.6291 instead of 1:2. Fisher

suggested that Mendel's data are too close to 1:2 rather than the correct values, and

thus suspected there was some manipulation, or omission of data (Weir, 1996).

Mendel also studied the two characters together; for example the shape of the
seeds and the color of the seed. We denote round allele as R and wrinkled allele as r,
and yellow allele as Y and green allele as y. He crossed a parental line of plants with

yellow and round seeds (genotype RRYY) with a line with green and wrinkled seeds

(genotype rryy). The seeds of F1 were all round and yellow (RrYy) showing that

these two alleles, R and Y, are dominant over r and y. Then, he crossed F1 and got

four types of seeds, 315 round/yellow seeds, 108 round/green, 101 wrinkled/yellow,

and 32 wrinkled/green. The ratio was approximately 9:3:3:1 (Levine and Miller,

1991). All possible genotype combinations of F2 are given in the following Punnett

Square (Griffiths et al., 1993); the first row represents the genotype of gamete from
father, and the first column represents the genotype of gamete from mother, and the

numbers in the parenthesis are the expected frequencies if the alleles of these two

characters are segregated independently.

















Mendel established the following principle which could explain the pattern of the
data.

Each of the seven genes that Mendel investigated segregated independently
(independent assortment), also being known as the Mendel's Second Law
(Levine and Miller, 1991; McPeek, 1997).

Regardless of the controversy of his data, Mendel's theory regarding genes controlling
biological characters is well accepted.
Correns (1900) observed the phenomenon of complete linkage in which alleles
of two or more different characters appeared to be always inherited together, rather
than independently as Mendel's Second Law (McPeek, 1997). This violation led to
an extension of Mendel's theory; the chromosome theory of heredity which says
that the genes are parts of specific cellular structures, the chromosomes (Griffiths et
al., 1993). "In 1902, Walter Sutton and Theodor Boveri, independently published
papers linking their discoveries of the behavior of chromosomes during meiosis to the
Mendel's principles of segregation and independent assortment. Sutton and Boveri
are credited with initiating the chromosomal theory of heredity" (Klug, 1997, p. 61).
This theory provided a physical mechanism for Mendel's Laws. It were assumed that
those characters that Mendel studied lay on different chromosomes, and that those
which were completely linked lay on the same chromosome (McPeek, 1997).
Bateson and Punnett did an experiment on the sweet peas to study two characters:
the flower color, purple (dominant) and red (recessive), and the form of pollen, long


____RY () Ry () ry () rY ()
RY(1) RRYY( ) RRYy( ) RrYy (-) RrYY (-)
Ry () RRYy (-) RRyy (-) Rryy (j) RrYy (1)

ry () RrYy (-) Rryy (-) rryy (1) rrYy (1)
rY () RrYY (-) RrYy (1) rrYy ( ) rrYY ( ')









(dominant) and round (recessive). First, they crossed purple, long (PPLL) with

red, round (ppll) to create a heterozygotes (PpLI) type (Fl generation), then they

crossed Fl generation among themselves and produced 381 plants. Result is as follows

(Levine, 1991, p. 193). These results did not match Mendel's Law, nor these two

Table 1.2. The results of Bateson and Bunnet's experiment.

Phenotype Number Percent Percent expected Percent expected
observed observed if unlinked if completely linked
Purple, long 284 74% 56% 75%
Purple, round 21 6% 19%
Red, long 21 6% 19% -
Red, round 55 14% 6% 25%

Source: Levine, 1991.


genes were completely linked. Morgan (1911) provided an explanation for Bateson

and Punnett's observation and for his own fruit fly experiment's results which is
similar to Bateson and Punnett's. He suggested the exchange of genetic material had

occurred between two homologous chromosomes when they paired during cell division,

and this exchange was called a crossover (McPeek, 1997). Recombination is the

process by which progeny derive a combination of genes different from that of either

parent (DOE, 1992). Each individual inherits one chromosome from its father and

the other one from its mother. When this individual reproduces two chromosomes to

pass to its offspring during meiosis, it will not pass one of each pair that it inherited

from its parents but a "blended" one. Meiosis produces crossovers that blend two
chromosomes and cause recombinations. If there is no crossover or an even number

of crossovers between two loci, then recombination does not happen to these two
loci. On the other hand, if there are odd number of crossovers, then recombination
happens. The probability of recombination happening between two loci is called









the recombination fraction. At the state of current technology level, cross-overs

are not directly observable but recombination between two genes (or markers) are

(Ott, 1991). An obvious consequence of the crossover process, is that the closer two

genes are to each other on a chromosome, the less chance that crossovers can happen

between them. Hence there is less chance that recombination can happen between

them. This idea became the foundation of linkage analysis. Recombination is the

chief source of variation among species (Rhodes et al., 1974).

The purpose of linkage study is to find the relative location between genes or

markers and finally to present them in a linkage map. The distance between two loci

on a chromosome is measured using genetic distance. We will discuss this in 2.1

"Map Function".

Lander and Schork (1994) summarized that there are four major categories of
methods for genetic dissection. One is "genetic analysis of large crosses in model

organisms such as the mouse and rat." The other three use human genetic data. These

three methods are, pedigree analysis, allele-sharing methods, and association studies

in human population. We will discuss more details in 2.2 "An Introduction to Human

Linkage Data for Genetic Dissection". Affected-sib-pair method, proposed by Penrose

(1953), is one of allele-sharing methods. Affected-sib-pair method combined with

modern DNA-level genetic marker technology, like the Restriction Fragment Length

Polymorphism markers (RFLP) (Botstein et al., 1980), can be used for genome-wide

gene search.

This dissertation will discuss a two-stage genome-wide approach in affected-sib-

pair method for searching for genetic disease genes. Only the designs for searching
for rare recessive and dominant diseases with dichotomous phenotype (affected or not

affected) is studied. Chapter 2 provides a literature review of related topics, includ-

ing map function, sib-pair method, affected-sib-pair methods, identical by descent,

identical by state, affected-relative-members method, heterogeneity, and polygeneity.






7


Chapter 3 discusses how to design an optimal two-stage search for a simple Mendelian

disease gene and Chapter 4 discusses how to design an optimal two-stage search for

two low penetrance, no interaction, unlinked recessive, complex disease genes. Brief

concluding remarks are given in the last chapter.














CHAPTER 2
LITERATURE REVIEW



In this chapter, the following topics will be covered:

map function (without biological interference),

concepts of identical-by-descent (IBD) and identical-by-state (IBS),

test statistics; based on IBD, IBS scores and likelihood ratio,

sib-pair method and affected-sib-pair method,

heterogeneity and homogeneity,

polygeneity,

two-stage approach.


2.1 Map Function


The genetic map distance, in units called Morgans, between two genes is defined

as the expected number of crossovers occurring between two genes on a single chro-

mosome strand (Ott, 1991). When the distance is so small that the probability of

multiple crossovers is negligible, the genetic map distance is equivalent to the recom-

bination fraction. A centimorgan between two genes is a distance that produces an

approximate 0.01 probability that recombination taking place between them. When

the distance is large, the probability of multiple crossovers increases. If there is zero

or an even number of crossovers, then we will not observe recombination, but if there








is an odd number of crossovers we will observe recombination. Hence, the recombi-

nation fraction is not necessarily equal to the map distance when two loci are farther

apart. A map function is used to relate the additive, but hard to measure, ge-

netic or map distance to the non-additive, but more readily estimable, recombination

fraction (Speed, 1997). Map function should preserve the additivity of map distance

but not every map function does. For example, consider three loci, xi, X2, X3 on a

chromosome and let 012 and 023 be the map distance between xl and X2, and X2 and

X3 respectively, and r12 and r23 be the recombination fraction distance between x,

and X2, and X2 and X3 respectively. If we assume that X1, X2, and X3 are very close
such that the chance of multiple crossovers between them is negligible then the Mor-

gan map function will not preserve additive of distance but the Haldane (1911) map

function will. For Morgan map function, r12 = 012 and r23 = 023, the map distance
between xi, and X3, 013 is equal to 012 + 023, hence recombination fraction between

x, and X3, r13 is equal to r12 + r23. On the other hand, since the chance of multiple

crossovers is negligible, then

Pr(recombination happen between x, and X3)

= Pr(recombination happen between x, and X2 but does not between X2 and X3)

+ Pr(recombination does not happen between x, and X2 but between X2 and x3)

= 012(1 023) + (1 9012)023

= 012 + 023 2012023

7 r12 + r23.

This is an contradiction. For Haldane map function, since the map distance is 012+023,

r13 is equal to 0.5(1 exp-2(012+623)), and

Pr(recombination happen between xi and X3)

= 0.5(1 exp-2012)[1 0.5(1 exp-223)] + 0.5(1 exp-223)[1 0.5(1 exp-2012)]









= 0.5{(1 exp-212) + (1 exp-223) -(1 exp-212)(1 exp-2823)}

= 0.5(1 exp-2(912+023)).

The additive of map distance is preserved.

Also, "most map functions are determined by three loci comparisons which may

not be consistent in terms of four or more loci" (Liberman and Karlin, 1984). If a

map function is consistent for any number of loci then it is called multilocus-feasible.

Several counterexamples in Ott (1991, p. 127) showed that a non-multilocus-feasible

might result a negative probability.

Liberman and Karlin (1984) investigated these problems. Their conclusions can be

summarized as follows: there are two methods for constructing a genetic map function,

the first starts with a model of the recombination process, the second uses a differential

equation method. "The renewal crossover formation process model assumes that

crossovers occur in succession along the chromosome, starting at a natural biological

site (e.g., the centromere, some origin of replication, or a telomere) such that the

lengths of the intervals between successive crossovers are positive random variables

independently and identically distributed."

For the first method, if a renewal crossover formation model is specified, the only

feasible map function is the Haldane map function (Haldane, 1919). The correspond-

ing crossover formation process is a Poisson process.

Theorem 1 Consider a crossover formation model of the form of a renewal process

with intercrossover distribution F(t), the distribution function of the distance between

two successive crossovers, such that d-F(0) :0 for some positive integer n. Then a

map function exists if and only if the intercrossover distribution agrees with that of

an exponential distribution, i.e., the renewal process is a Poisson process (Liberman

and Karlin, 1984).









On the issue of multilocus feasibility of map functions, they derived necessary and
sufficient conditions for multilocus-feasibility.

Theorem 2 Let r be recombination fraction and x be genetic distance. Sufficient
conditions: A map function r = M(x) is multilocus-feasible if its derivative functions
M(n) obey the inequalities


(-l)nM(n)(x) < 0, n = 1,2,..., for all x > 0.

Necessary conditions: Let r = M(x) be a map function that is multilocus-feasible

and suppose that all derivatives M(n)(x) exist. Then

(-1)nM(")(0) < 0, Vn = 1,2,....

Table 2.1 shows the multilocus feasibility of several map functions. Note that the
multilocus-feasible map functions are constructed based on the cross-over formation
process method, where r is the recombination fraction and x is genetic map distance.

Because further extension will incorporate genetical interference, we will not discuss
it here, but it can be referred to Karlin and Liberman (1994), and Speed (1997).
Most map functions constructed by a differential equation method are not multilocus

feasible, also will not be discussed.


2.2 An Introduction to Human Linkage Data for Genetic Dissection


The purpose of linkage study is to locate the gene which we are interested in.
With contemporary DNA-level genetic marker technology (Botstein et al., 1980), one
way to achieve this is by studying the linkage between genes and markers. "When two
genes are inherited independently of each other, recombinants and nonrecombinants
are expected in equal proportions among the offspring. For some pairs of genes, one
observes a consistent deviation from the 1:1 ratio of recombinant to nonrecombinant















Table 2.1. Multilocus-feasibility of several map functions.


Source Map Function r = M(x) llulti lois
Feasible


Haldane (1919)


Ludwig (1934)


Kosambi (1944)


Carter and Falconer
(1951)


Sturt (1976)


Rao et al. (1977)







Felsenstein (1979)


1(1 exp-2x)


1 sin(2x)
2


| tanh(2x)
2


x = \ tan-'(2r) + tanh-l'(2r)



[ [ -(1 (x/L))exp-x(2L-1)/L]


x = [ p(2p- 1)(1 -4p)ln(1 -2r)

+1l6p(l -p)(2p- 1)tan-l(2r)
+2p(l p)(8p + 2) tanh-(2r)
+6(1 p)(l 2p)(l 4p)r]


l1-exp2(K-2)x
2(1-(K-1) exp2(k-2)x)


.1.


Source: Liberman and Karlin, 1984.









offspring . In other words, alleles of different genes appear to be genetically

coupled, and this phenomenon is called genetic linkage." (Ott, 1991, p. 6) A

marker is an small identifiable physical region on a chromosome and the inheritance
of this region can be monitored (DOE, 1992). The closer a marker is to the gene, the
smaller the chance that they will be recombined during the DNA replication process.

Therefore, if a marker has a strong correlation with the phenotype of a gene, the gene

should be in proximity to the marker. There are three major categories of methods
using human genetic data: pedigree analysis, allele-sharing method, and association

study. Risch and Merikangas (1996) summarized these methods as follows:
In pedigree analysis, we first collect pedigrees that contain affected members. Next

we propose a genetic model (location of the gene, allele frequency, mode of inheritance,

and so on), say M1, to explain the pattern observed in the data, and compare the

likelihood under M1 with the likelihood under null hypothesis Mo, which assumes no
gene in the region of the marker, by the likelihood ratio,

L(data\Mo)
L(datalMi)'

or equivalently by logarithm of the odds (LOD) (Barnard, 1949) score,

L( datalMi)
g1o L(data\Mo)


Since this method requires specifying a genetic model, it is mostly used to identify

simple Mendelian trait genes.
In allele-sharing methods, we first collect relative pairs or groups (most of the
time, we collect affected relatives), and try to prove that the pattern observed in
the data is not due to random Mendelian segregation. Most of these methods use
identical-by-decent scores, which is the number of identical copies of the markers that

relatives share. The major advantage is that no specific genetic model is required to









do the test, so they are considered as nonparametric methods. These methods are

mostly used for detecting non-Mendelian complex trait genes.

Association study is a case-control study. Instead of using familial data, rather it

collects unrelated affected and unaffected individuals, and then compares the allele

frequencies of the candidate genes or markers. If the disease gene(s) is/are one of

those candidate gene or very strongly associated with some marker alleles, then the

affected group will have higher frequencies of those alleles than the control group. This

method is mostly used after the possible region of the genes has been narrowed down,

because this method involves examining genes and/or markers on a very fine scale,

i.e., examining many markers in a small area of chromosome. "Association studies

seem to be of greater power than linkage studies. But of course, the limitation of

association studies is that the actual gene or genes involved in the disease must be

tentatively identified before the test can be performed. . Thus, the primary

limitation of genome-wide association tests is not a statistical one but a technological

one" (Risch and Merikangas, 1996, p. 1517).

Allele-sharing methods include the affected-sib-pair (ASP) method and its vari-

ants. Penrose was considered the first person to propose the sib-pair method (Con-

neally and Rivas, 1980; Shah and Green, 1994). In 1935, Penrose proposed a simple

X2 test, to detect linkage between two characters. In 1938, he extended this method
to "graded" characters, which means the characters could have intermediate value.

In 1950, Penrose proposed the concept of affected-sib-pair method as a means for

analyzing red hair phenotype and ABO blood group data. He wrote: "Accuracy is

preserved and uninformative dead wood excluded if a set of sibships is selected by the
presence of one of the test characters . ." A general form of the sib-pair method was

proposed by Penrose in 1953. Now, these methods have been extended to examine

identical-by-descent scores (elaborated in the following sections), identical-by-state









scores of markers, and the risk ratio between two relatives. The idea of affected-sib-

pair method is only allow a pair of sibs both who are affected to be included in the

study. If the disease is caused by a gene then both affected sibs tend to receive the

same gene. Hence, if a marker is close to the gene, then the marker will tend to

segregate with the gene during meiosis. We will cover how to detect the linkage in

the later sections.

The major advantages of affected-sib-pair method over pedigree analysis were

summarized by Holmans and Craddock in 1997:

Affected-sib-pair method does not require specification of a genetic model, which

is important for complex disease where the mode of inheritance is unclear.

It is generally easier to collect affected sibling data than to collect a large,

multigeneration pedigree with multiple affected members.

Affected sib-pairs are more likely to be informative for linkage than large pedi-

gree under oligogentic epistatic models, which is plausible for a number of com-

plex traits.

The disadvantage of affected-sib-pair method is that it is considered less powerful than

traditional pedigree analysis when the genetic model can be specified. If the genetic

model cannot be specified in the affected-sib-pair methods, the recombination fraction

cannot be estimated.

Suarez, Rich, and Reich (1978), and Blackwelder and Elston (1985) suggested that

sampling only affected sib-pairs is more powerful then sampling affected-unaffected

sibs under a fixed sample size constraint. Some sampling issues in linkage studies as

summarized by Gershon et al. (1994) appear in Table 2.2.









Table 2.2. Some sampling issues in linkage studies.


Issue Pro Con
Very large pedigree Statistical power high under Extension may actually in-
homogeneity assumption. troduce heterogeneity. Very
homogeneity assumption.-
hard to find.


Medium-size Less likely to have het- Heterogeneity between pedi-
pedigree series erogeneity within pedigrees.
More likely to get generaliz- grees.
able results.


Nuclear families Lower power: very
many needed if heterogene-
ity present.


Affected sib-pairs Model-free Even lower statistical power.

Source: Gershon et al., 1994.


2.3 Identical by State, Identical by Descent


IBS stands for "identical by state." It means two alleles are the same regardless

of whether the alleles are copies of the same ancestral alleles or not. IBD stands

for "identical by descent." Two alleles are said to be IBD if those two alleles are

copies of the same ancestral alleles. Since humans are diploid, that is, having two

haploid sets of chromosomes, two sibs can share 2, 1, or 0 marker alleles at a locus.

Consequently, their IBD scores would be equal to 2, 1, or 0, respectively. We will

discuss the distribution of IBD score in the next section. We only consider two alleles

are identical-by-decent if they are copies of the same ancestral allele in the pedigree

under study, i.e., if the alleles are traceable.









With the affected-sib-pair method, since the sib-pairs were selected for study on

the condition that both sibs are sick with the same disease (biological character),

if the disease is caused by a gene, then a marker adjacent to the disease gene has

a higher probability to segregate with the gene during meiosis. Hence, this marker

tends to have a higher IBD score. This method can serve as a tool to locate the trait

genes.


2.4 Distributions of IBD


Under ideal conditions, in using the affected-sib-pair method, we can take IBD

scores to make inferences about the recombination fraction between the trait gene(s)

and the marker(s). Hence it is important to know the distribution of the IBD scores.

According to Mendel's First Law, each of a parent's haplotypes have an equal chance

to be passed to offspring. Therefore, if sibs were not selected conditionally on any

other biological characters, then, the probability for a pair of full sibs to have a IBD

score equal to 2 is 0.25, to 1 is 0.5, and to 0 is 0.25. If sibs are selected conditional on

a biological character and if the marker(s) we examine is unlinked to the trait gene(s)

that responsible for the biological character, then, regardless of the inheritance mode,

penetrance, and allele frequency of the trait gene the probability mass function is also

0.25, 0.5, and 0.25 for IBD equal to 2, 1, and 0, respectively. Therefore, if the observed

IBD score deviates from this mass function, it is an evidence that the observed IBD

scores are not from a random model. Nevertheless, under specified condition the

distribution of the IBD score is still worth investigating, especially to develop a more

powerful test statistics or to do power analysis.

Li and Sacks (1954) developed a method using stochastic matrices to derive the

joint distribution of the genotype of two relatives. They used an example to demon-

strate their method: a gene with two alleles, say A and a, with population frequencies








p and q = 1 p. First, obtain a conditional probability matrix P={pij};,

relative 2
AA Aa aa
AA p1 pP12 P13
relative 1 Aa P21 P22 P23
aa P31 P32 P33

where Pll is the conditional probability conditional on relative 1 having AA, that
relative 2 has AA; P12 that relative 2 has Aa; and p13 that relative 2 has aa, and so on.
Once such a matrix is obtained, the absolute frequencies of all different combinations
of genotypes of relatives pairs can be easily obtained by multiplying corresponding
row by genotype frequencies of relative 1. In their example, that would be multiplying
the first row, (p11,P12,P13) by p2, the second row by 2pq, and the third row by q2. Li
and Sacks' method use three basic transition probabilities matrices, I, T, and 0, to
construct P. The matrices, I, T, and 0 is defined same as P except conditional on
relatives pair having both, one, or no genes identical by descent, respectively. One
can see that, in their example,

1 0 0 p q 0 p2 2pq q2
{ilj}= 0 1 0 ,T = {t lj}= p 1 0 0 = q2
2P 2 0 =oj= p 2pq q2
0 0 1 0 p q p2 2pq q2

where I = 1,2,3 and j = 1,2,3. For the matrix I; since given that relative 1 and 2
share two genes identical by descent, then the probability of the genotype of relative
2 is AA, given the genotype of relative 1 is AA, is equal to 1, so ill = 1. For the same
reason, the diagonal elements of I, il are equal to one, and off diagonal elements
are equal to zero. Now consider the matrix T, since given that both relative share
only one gene IBD, if relative 1 has genotype AA, and relative 2 also has genotype
AA, relative 2 must receive one of his A alleles somewhere else other than the same








ancestor from whom relative 1 received his A allele. Since population frequency of A
is p, thus t11 of the matrix T is equal to p. The other elements of T and 0 can be
obtained by the same reasoning.
They showed several examples of how to use I, T, and 0 to construct P. The
one for the parent-offspring relationship, the transition matrix P = T, and the
grandparent-grandchild pair has transition matrix

p =T2 1 10
2 2

and for the general parent-offspring type relatives,


p=n+1 1 1-^)~o
P = T + T + I-(-'


where n + 1 is the number of generations between the two relatives. For the full-sib,
the matrix is
p=s=li+ 1+10
4 2 4

Campbell and Elston (1971) extended Li and Sacks' method to derive the transi-
tion probability matrices for different modes of inheritance and multiple loci. Hase-
man and Elston (1972) constructed the joint probability matrix for sibs. Risch (1990b)
and Bishop and Williamson (1990) summarized these works and gave a table of trait
IBD distribution conditional on marker IBD and relationship for sibs, grandparent-
grandchild, uncle-nephew, half-sibs, and first cousins. It is adapted here in Table 2.3.
Holmans (1993) showed that for an affected sib-pair, the possible values of prob-
ability mass function p = (po,pi, 1 po pi) of marker IBD score 0, 1, and 2, is
restricted in what they called "the possible-triangle", which is the intersection of

Po > 0, pi _< and Pi > 2po, for all genetic models. Their proof is included.
Let pj and zi be the probabilities of a sib-pair sharing ibd at marker locus and
trait locus respectively, 0 be the recombination fraction between the marker and the














Table 2.3. The values of Pr(IBD of trait gene I IBD of marker and relationship).


IBD of marker IBD of Trait
2 1 0
sibs
2 Ip2 2%F(1-T) (1-q)2
1 T(1 %) (1 2q + 2T2) q(1 I)
0 (1- )2 2q(1-I) q2
grandparent-grandchild
1 1-0 0
0 0 1-0
uncle-nephew
1 (1-0)+0 1- i0- T(1 -0)
0 1- 10 -%(1 0) %F(1 0) + 10
half-sibs
1 1-1-
I T- I-

first cousins
1 I(1-0)2 + 102 1 02 (I_ 0)2
i0 l1_ 102 -2_(1 _-0)2} {2+ '(1 0)2 + 102}


Where 0 is the recombination fraction between the
= 02 + (1 -0)2.


marker and trait gene,








trait gene, and I = 92 + (1 9)2. Then,

po = ko2Z + I(1 )zli + (1 I)22

pi= 2x(1 I)Zo + [%2 + (1 T)2]zl + 2%(1 I)z2

p2 = (1 1)2Z0 o+ X(1l- _)zi+ T2z2.

From the second equation,

p, = 2x(1 I)(zo + Z1 + z2) + [T2 + (1 X)2 + 21(1 T)]zi.

Since < % < 1 and it can be shown that z1 < 1


pi < 2k(1 T) + (21 1)2/2 = 1
2

From the first and second equations

pi 2po

= (21 4T2)Zo + (2, 1)2Zl + {(1 ')[21 2(1 T)]}z2

= (21 1)2z1 + 2(1 x)(21 1)(1 zi zo)- 2T(2' 1)zo

> (2T 1)2Z, + 2(1 I)(2I 1) 3(1 I)(2I 1)zl T(2T 1)zli

= 2(1 T)(21 1)(1 2zi)

> 0.

This proves the "possible triangle" restriction. He also proposed a procedure for find-
ing the maximum likelihood estimator under the above restriction. That procedure
is as follows:

1. Obtain 2 = (io, ii, 1 i0 Li) from the unrestricted method which is sample
frequency.

2. If ii > remaximize Fsubject to the constraint Li = If the resultant Q >
then reset 2 to null which is (0.25, 0.5, 0.25).








3. If 220 > 2j, remaximize z subject to the constraint z1 = 2z0. If the resultant
o_ > then reset z to null.

4. If is in the triangle, leave it as it is.

This concludes the procedure.
Suarez, Rice, and Reich (1978) derived a generalized sib-pair IBD score distribu-
tion in which the IBD score distribution is conditioned on the number of affected sibs
for a trait with two alleles. This conditional distribution showed that "the distribu-
tion of IBD scores depends only on the additive and dominance variances and the
population prevalence of the disorder" (p. 94).
One problem with using the IBD scores is, in order to establish IBD scores, marker
alleles must be highly polymorphic. "Often, identity by descent cannot unequivocally
be established" (Ott, 1991, p. 79). To overcome this problem with the IBD score,
Lange (1986a) proposed an affected-sib-sets method based on IBS (identical by state)
and at same time he extended the affected-sib-pair method to the affected-sib-set
(ASS) method (Lange, 1986b). He proposed a test statistic for the affected sibs in a
single nuclear family,

z = (2.1)
i
1 i and j concordant (i.e. IBS=2),
Xij = I i and j half-concordant (i.e. IBS=1), (2.2)
0 i and j discordant (i.e. IBS=0),

where i and j are indexes of sib-set members. Lange noted that since they "will
ultimately apply the Central Limit Theorem, it suffices to derive the mean and vari-
ance of Z for a particular affected sib set." This ASS method uses population allele
frequency to calculate the probabilities of all the possible parent mating types, then,
conditional on the parental mating type, to calculate the probability of IBS of the








sib-pair. Lange gave the mean and variance of Z "in the limiting case of an infinite
number of alleles, each of infinitesimal frequency," (this author believes that means
very highly polymorphic) as follows,

E(Z) = s(s l)/4, (2.3)

Var(Z) = s(s- 1)/16, (2.4)

where s is the size of the sib-set. He then presented a way to combine Z statistics
from different sib-sets,

T = Es Er wr,(Zrs E(Zr))(2.5)
(EsEZ r wVar(Zr,)) (

where Zrs denote the Z statistic for the rth affected sib set of size s, and Wr, is
a weighting. To assign a value to Wr8, he claimed if the weights ws depend only
on s and not on r, and if the number of sib sets is large, then T should follow a
standard normal distribution approximately. Based on Hodge's (1984) results, Lange
recommended using the weights,

(s -1)
Var(Zrs)

In 1988, Weeks and Lange (1988) generalized the ASS method to pedigrees, calling
the refined method the affected-pedigree-member (APM) method. In this method
only those pedigree members who are both affected and typed at the marker locus
enter the definition of test statistic. They modify the above test statistic T in two
ways. First, modified the Z statistic to be

1 1 1 1
4^ = I(G,,Gj) + -6(G., Gjy) + IS(Giy, Gj) + -S(GiGj),

where i and j are the index for affected members in a pedigree, G. and Gj., are the
maternal marker allele for member i and j respectively, Gy and Gjy are the paternal









marker allele for member i and j respectively, and

f 1 G and G' match in state
(G",G')
( 0 G and G' do not match in state.

They claim this modification "permits computation of the theoretical mean and vari-
ance of the new test statistic for each pedigree by taking advantage of the theory of
multiple-person ibd relation." Second, a new weighting factor


r m 1
Wrm = /Var (Zm)


where rm is the number of affected and typed individuals in the mth pedigree, and

Zm is the Z statistic of the mth pedigree. Later, Week and Lange (1992) proposed a

multilocus extension of this APM method.
Bishop and Williamson (1990) studied the power of IBS methods on affected

relative pairs and showed that several factors have a major influence on the power.

These factors are, the relationship of the affected pairs, the polymorphism of the
marker, the recombination distance between a trait locus and the closest marker, and
the mode of inheritance of the trait.


2.5 Risk Ratio


In 1990, Risch published a series of papers addressing risk ratio methods. In the

first of this series (1990a), a ratio of risk, AR, is defined as the ratio of the probability of
being affected given a type R relatives is affected, and the population prevalence. Let
the relationship subscripts as follows: M=monozygotic twin; S=sibling (or dizygotic
twin); 0= parent (or offspring); 2= second-degree relatives; 3=third-degree relative.
For a single-locus model, two recurrence-risk pattern were derived;










Ao 1 = 2(A2 1)= 4(A3 1) (2.6)

and

AM = 4As 2Ao 1; (2.7)

Similar results were obtained for multiplicative, additive, and genetic heterogeneity

two-locus models and a general multilocus model. Important conclusions are: "for a

single-locus model, AR-I decreases by a factor of two with each degree of relationship.

. For a multiplicative epistasiss) model, AR 1 decreases more rapidly than by a

factor of two with each degree of relationship. Examination of AR values for various

classes of relatives can potentially suggest the presence of multiple loci and epistasis".

In the second paper (1990b), Risch derived the probabilities of IBD scores for

a completely polymorphic marker for several relatives under the assumption that

the recurrence-risk pattern (2.6) holds. According to his first paper of this series,

this assumption might be violated if multiple contributing loci present. The IBD

distributions were summarized in Table 2.4. He concluded that, "for diseases with

large A values and for small 0 value, distant relatives offer greater power. For large 0

values, grandparent-grandchild pairs are best; for small A values, sibs are best." The

third paper (1990c) took into account the effect of marker polymorphism on power.

There were some errors in the third paper and Risch wrote a response to address and

corrected them (Risch 1992). The feasibility of his method is dependent on whether

a suitable control group is included in the study or an appropriate estimate of general

population risk can be obtained.


2.6 Test Statistics Based on IBD Score


There are several test statistics based on IBD score of the sib-pair method to detect

linkage. One is the Two-allele (proportion) test statistic, which is the number















Table 2.4. The probability of IBD score for different relatives.


Marker IBD Probability
Affected sib pairs
2 + 14(2' 1)[(As 1) + 2%I(As Ao)]
2 2 -
1 2 j(2v -1) 2s(As- Ao)
o// 4(2Q 1)[(As 1)+ 2(1 )(As Ao)]
Grandparent-grandchild pairs
1 2 +!(I 20) (Ao 1)
0 +(- 20) 1 (Ao- 1)
02 2( \+I
Uncle( Aunt)-Nephew(Niece)
1 +(I 0)(1 20)2 (1(Ao 1)

_0_________t -^ -2 _( __ o -l
2 (- W- 0)(i 20)2 1 (A 1)
Half-sib Pairs
2 + (2%P 1)(\1(Ao -1)
0_ - 1(2T 1)(1)(Ao 1)
02 2 (OI
First-cousin pairs
e + r[(1 0)4 + (1 t- )2 + r02 b ( 1 1)
3I2 1 ( 1o 3
[(1 0)4 + 02(1 0)2 + 102 -_ 1](-1(Ao 1)

Where 0 is the recombination fraction between gene and marker, I = 02 + (1 0)2.









(or proportion) of sib-pairs that share two alleles, proposed by Day and Simons (1976)
and Suarez, Rice, and Reich (1978). The second, the Mean test statistic, is the

sum (or sample mean) of IBD scores, proposed by Green and Woodrow (1977). The

third is to use the Pearson chi-square goodness-of-fit test (Pearson, 1900) on the

distribution of IBD scores. All these tests assume that the IBD score frequencies are

0.25, 0.5 and 0.25, under unlinked conditions, and using the deviation of observed

IBD score from these frequencies to detect the linkage.

Several papers discuss the merit of these test statistics. For the Two-alleles test
statistics; the one proposed Day and Simons (1976) was based on the supposition that

"the probability of sharing both haplotypes deviates more from the probability under

the null hypothesis than does the probability of having one, or at least one, haplotype

in common. One would test for the existence of the DS (disease susceptibility) gene

by comparing the observed frequency of having both haplotypes in common." Suarez,

Rice, and Reich (1978) suggested pooling categories IBD=0 and IBD=1, based on

their generalized sib-pair IBD distribution that we mentioned in 2.4 "Distributions
of IBD". They demonstrated that the affected-sib-pair that shared two alleles had

the most information, noting that "with minimal loss of information, we can pool

categories Pr{IBD=0,l}."

Green and Woodrow (1977) used the total number of "repeats," i.e., the total
number of IBD scores, as the test criterion. They also suggested using only affected

sib-pairs because the distribution of IBD scores is symmetric under null hypothesis,

so the sum of "repeats" can be approximated by normal distribution quicker than

other sampling scheme.

Blackwelder and Elston (1985) examined the Two-allele test statistic, the Mean
test statistics, and Goodness-of-fit test statistic, by calculating the exact probabilities.
They concluded that the Mean test statistic is generally more powerful than either
of the other two. Schaid and Nick (1990) proposed a test statistic using a linear









combination of the number of marker alleles IBD as a test statistic. If the alternative

IBD distribution can be specified, then this test statistic can be optimized. They also

suggested a tmax test statistic which was the maximum of the Mean test statistic and
the Two-alleles test statistic and showed that tmax was a more robust test statistic

in the sense that, if the mode of inheritance (recessive, dominant, and so on) is

misspecified, loss of power may not be too great.

Faraway (1993) proposed a modified chi-square test. This test is a special case of

a general method, X2 test with restricted alternatives, given by Lehmann (1986, p.

480). Faraway showed it was more powerful than the Two-alleles, the Mean, Pearson
chi-square, or Schaid and Nick's tmax tests for the finite sample and claimed it was
the asymptotically uniformly most powerful invariant test. More details are given in
the following: For testing hypothesis

Ho : po = p2 = O.25,pi = 0.5 vs H, : po + pi + p2 = 1.

Let the chi-square goodness of fit test statistics be

Y = 4n(po 0.25)2 + 2n(Fl 0.5)2 + 4n(2 0.25)2,

where pi is the sample frequency of IBD=i, n is the total sample size. Faraway showed
that the possible value of IBD score distribution, pi i = 1,2, 3, is restricted in region

F that is the intersection of pi + P2 < 1, Pi < 1/2, and 3p,/2 + P2 1, and claimed
this region is equivalent to the "possible-triangle" in Holmans (1993). Then, showed

under this restriction, the (po/,il/,j2/) that maximizes statistic Y will actually turn
Y into one of Two-allele test statistic, Mean test statistic, Y, and 0. The results are
as follows.


















All these discussions regarding the merit of different test statistics were based on
a model where the trait locus has only two alleles. Since the distribution of IBD score
was dependents on the penetrance of the genotype, as well as the frequency of the

genotype, therefore, whether these results can be extend to a model that assumes
three or more alleles for a disease susceptibility gene needs further investigation.

Based on the same two-allele model, and Suarez, Rice, and Reich's (1978) gen-

eralized sib-pair IBD score distribution, Knapp, Seuchter, and Baur (1994a) showed

that if fl2 = fof2, the Mean test was the uniformly most powerful test in 0 (recom-

bination fraction) regardless of the allele frequency of t, where fo is the penetrance
of the genotype tt (t is the susceptibility allele and T is the normal allele), f, is for

the genotype Tt, and f2 is for the genotype TT. It is clear that recessive diseases

(f, = f2 = 0) satisfied this condition, with either complete or incomplete penetrance
(fo = 1 or fo < 1). The authors also proved the Mean test is the locally most powerful
test (there is no test with a larger power for any alternative within a neighborhood
of Ho according to Rao, 1973), irrespective of the mode of inheritance. He stated,

"Because uniform optimality implies local optimality, no other test than the mean
test can be uniformly most powerful." In another paper (1994b) Knapp, Seuchter,

and Baur presented the equivalence of the mean test and the parametric maximum

LOD score analysis with an assumed recessive mode of inheritance.


Region Test statistic
F Y
2,2 + fi > 1, fi > 1/2 Mean
3pi /2 + p < 1, P2 > 1/4 Two-allele
22 + fil < 1,p2 < 1/4 0









2.7 Likelihood Ratio Test


The generalized likelihood ratio test has been commonly used in genetics for

detecting linkage and heterogeneity. The LOD, an acronym for "logarithm of the

odds ratio" Barnard (1949), is a special version of the likelihood ratio. However, the

classic research in likelihood ratio test cannot be directly applied. Let's review this

theorem first.

In Serfling's book "Approximation Theorems of Mathematical Statistics", (1980,

p. 144), he states the following:

Regularity Conditions on F. Consider 6 to be an open interval (not

necessarily finite) in R. We assume:

(Rl) For each 0 E 0, the derivatives . ,

where F is the family of distribution function, and 0 is the parameter space of F.

Assuming regularity conditions, it can be shown that the maximum likelihood es-

timator of 0 has an asymptotic normal distribution (Serfling, 1980, p. 145). With

this property and Lemma A (Serfling, p. 153), the likelihood ratio test theorem

(Serfling, p. 155) can be proven as:

Consider testing H0: 0 = o0, where 90 is a k-dimensional vector in 8. Let


An = L({0)
supO~e L(0)'

where n is sample size. Then, under H0, the statistics -21n An converge in distribution

to x' (Serfling, 1980).

The purpose of regularity conditions was to make sure of the existence of the
Taylor expansion of the distribution functions (Serfling, p. 145). The parameter

space was assumed to be an open interval in R, in fact, if O0 is an interior point of
the parameter space 0, the theorem holds regardless of whether the parameter space

is an open interval or not.









Let

LOD = max o<0<0.s L(0) (2.8)
LOD = log10 L(0 = 0.5) '(.


If the genetic model is known or is assumed to be known, some people use LOD, to

to test
H0: 0= 0.5 vs Hj: 0 < 0.5,

where 0 is the recombination fraction between a marker and a gene, and claimed
21 ln(10)x LOD _- X (Chotai, 1984; Elston, 1994; Shute, 1988). However, this is in-

correct because 00 = 0.5 is on the boundary of the parameter space [0, 0.5]. Therefore,

the maximum likelihood estimator does not have an asymptotic normal distribution.

Similar errors happen when testing heterogeneity, where the parameter space is [0,1],
and, under null hypothesis, monogeneity, i.e., a = 1 is on the boundary of parameter
space (Ott, 1983), where a is the proportion of families belonging to the linked group.

Ott (1991) attempted to correct this error, arguing that LOD is a "one-sided" test.

Correct asymptotic distribution of maximum likelihood ratio under the condition that

null parameter value is on the boundary of the parameter space was given by Self and

Liang (1987). The asymptotic distribution of Eq. (2.8) is a 50:50 mixture distribution

of a X1 and a mass equal to 1 at 0. To make inferences, a critical value of LOD score
needs to be decided. For simple Mendelian disease, the conventional criterion for

claiming linkage is LOD > 3 (Morton, 1955, 1956). The probability of obtaining
a LOD > 3 under the null hypothese is about 0.0001. One reason for such a low
significance level is that the prior probability of that gene being within a certain
distance from the marker is small; and another reason is that we would rather have
no linkage map than have a wrong linkage map (Morton, 1955, 1956). Ott (1994) gave

a Baysian argument, asking us to consider two hypotheses, H0: free recombination
or absence of linkage (0 = 0.5), and Hi: 0 < 0.5. The "prior" probability that a

gene and a marker are within a measurable distance (40 CM) is small, and based on








Elston and Lange's result in 1975, he said the probability of a marker and a gene
being within 40 cM was equal to 0.02. The posterior odds for linkage is,

P(Hil data) (P(data\ Hi) (P(H1)>
P(HoI data) \P(dataI Ho)) \P(Ho))

Since, with LOD=3 the first term on the right-hand side is about 1,000:1 and the
second term is about 1:50, the posterior odds against linkage are equal to 0.05. Since
a multiple gene effect cannot be ruled out for a complex trait, he added one more
hypothesis, the "other" hypothesis, H2. The prior probabilities are given as follows,
with h being the prior probability of H2.

Hypothesis Prior probability
Ho: Single gene, no linkage 0.98(1-h)
Hi: Single gene, linkage 0.02(1-h)
H12: "Other" h

The posterior odds are equal to,


SP(Hl data) P(datai H1)_ (P(H1)
Q P data) P(datal Ho) Pp(Ho) + p(datlH2) )
da I }data) H0)) (o)] P(H2)'


where HJi means "not Hi." Let R = P(datal 2), then
P(datal Ho)'

Q = 0.02(1 h)
S0.98(1 h) + hR

If the disease in the family has an prior chance of 90% of being due to the "other"
mode of inheritance, then the critical value has to increase from 3 to 4 in order to
retain the posterior odds at 0.05.
The above Ott's Baysian argument applies to a single marker situation. When we
do a genome-wide search, we use many markers to do multiple tests. While the overall









false positive rate increases due to the nature of multiple tests, Ott (1991) indicated

that, because the prior probability also increase with the number of markers, the

critical limit could remain at 3. He also reported that Lander and Botstein (1989)

investigated the problem and concluded that for the human genome, in order to keep

the overall significance level at 5%, the appropriate critical level of LOD is between 2

and 3. Later, Lander et al. (Lander and Schork, 1994; Lander and Kruglyak, 1995)

strongly advocated higher LOD thresholds for genome-wide detection, results based

on dense markers. This author have not tried to verify their "dense markers" approach

hence will not include their results here, but would like to point out that for different

methods the distribution of LOD are different. Thus, different thresholds should be

applied to different methods as Lander and Schork, and Lander and Kruglyak did.

In practice, it seems that most researchers forgot the other part of Morton's sug-

gestion, that is, when LOD < -2, the marker should be excluded. This author has

not seen a paper published using the method of exclusion.

Clerget-Darpoux, Bonaiti-Pelli6, and Hochez (1986) studied the effects of misspec-

ified genetic parameters in LOD score analysis. They reported that "the power of the

linkage test is sensitive to the degree of dominance, and slightly to the penetrance,

but not to the gene frequency. In contrast, the estimation of the recombination frac-

tion may be strongly affected by an error on any genetic parameter." MacLean et al.

(1993) proposed a "MOD" method which is similar to LOD but it not only maximiz-

ing LOD over recombination fraction but also maximizing over inheritance mode.


2.8 Heterogeneity and Homogeneity


There are two types of heterogeneity linkage analysis. One is allelic and the

other is nonallelic; the latter is also referred to as locus heterogeneity. "With allelic

heterogeneity, individuals differ from each other by having different alleles at the same









locus responsible for the disease; in nonallelic heterogeneity, however, the disease is
caused by different loci" (Ott, 1991). We will discuss only nonallelic heterogeneity

in this section because there are more than one recombination fraction that can be

detected by linkage analysis.
There are several methods to test for homogeneity. These methods include a

method proposed by Morton (1956), and a method by Smith (1963). Morton's method
to test homogeneity is to test whether all the families have the same recombination
fraction or have different recombination fractions. He proposed a simple statistic,

k
2 ln(10) x [E Zi(i)- Z(A)], (2.9)
i=i

where O is the recombination fraction that maximize the LOD score of the ith family,

and Zi(0j) denotes the maximized LOD score, i = 1 to k. The Z(0) is the total LOD-

score maximum that occurs at a value of 6 for all families combined. This statistic is

assumed to have a X2 distribution with k 1 df.

Smith (1963) assumed that there were two groups, a linked group and an unlinked
group. The linked group is the group of families that have a disease gene at a locus
linked with the markers. The members of unlinked group either have no gene that
caused the disease or have a gene at a locus that is unlinked to the markers. Smith

proposed a test for testing whether all families in the study are from the linked
group. If e is the true but unknown proportion of families belonging to the linked

group with recombination fraction 0, let 0 and e be the MLE of e and 0. Under H1,
the LOD score of the ith family is given as Zi(e, 0) = log[eLi(O) + (1 E)Li(1)]. The

total LOD score was equal to Z(e, 0) = Zi(e, 0). He assessed the significance of
nonallelic heterogeneity by the statistic


21n(10) x [Z(e, IH1) Z(e = 1,OHo)].


(2.10)









If Ho is true, because e = 1 is on the boundary of parameter space, the statistic follows

a 50:50 mixture distribution of a chi-square distribution with 1 df and a mass=l at

0, not a chi-square distribution with 1 df as originally suggested.

Ott (1983) compared Smith's method, which he called the A-test, and Morton's
method, which he called the PS-test (M-test in his 1991 book). For a mixed situation

(families with or without linkages between gene locus and marker locus), Ott com-
pared the M-test with the A-test and found that the A-test was generally superior.

However, since he erred on the distribution of the test statistics (2.10), his conclusion

might need modification.

Using IBD scores, Chakravart, Badner, and Li (1987) proposed a method to test

linkage and homogeneity using IBD scores in the case of an autosomal recessive gene.

The test was basically a goodness-of-fit test. They concluded that while the power of
their method for detecting linkage from sib-pair data is excellent, that for detecting

the heterogeneity of linkage is not. Proceedural details are as follows.

For an autosomal recessive disease, the genotype of unaffected parents in multiplex
families are Dd x Dd, where D and d are the normal and disease alleles, respectively.

Among the affected sib-pair, let the probabilities of marker IBD =2, 1, and 0, be,


P2 = Pr(IBD = 2) = y2, (2.11)

P1 = Pr(IBD = 1) = 2xy, (2.12)

Po = Pr(IBD = 0) = x2, (2.13)

where x = 20(1 0) < 0.5 and y = 1 x. The maximum likelihood estimator (MLE)

for x is
nI + 2no
-= +2n (2.14)

where n, are the numbers of affected sib-pair sharing i markers IBD, and n = ni.

Because true value should not be greater than 0.5, they proposed a new estimator,








u, where,

0 5 if i <0.5
0.5 if i> 0.5.

This new estimator has a smaller variance under both null and alternative hypotheses
than i's. The recombination value may be estimated by

V 1 -2ft
2

Under genetic heterogeneity, the IDB score distribution will be a mixture of two
binomial distributions under 0 < 0.5 and 0 = 0.5 in the proportions c : 1 c. Thus
the Pi's in (2.11) through (2.13) become

P2 = c(1 x)2 + (1 c)/4, (2.15)

P1 =2cx(1 x)+(1 c)/2, (2.16)

Po = cx2 + (1 c)/4. (2.17)

Solving the above equatons, they got c = pp Replacing P, with its MLE, they
obtained the MLE for c, and x under heterogeneity

<. = (n2 no)2 (2.18)
Sn(n2 ni + no) (2.18)

S n2 2n0
Xh = (2.19)
2(n2 no)

Then, they proposed a two stage method to test linkage and heterogeneity. First
they tests linkage, if the null hypothesis of no linkage was rejected then they conclude
heterogeneity. In the first stage, the statistic


T = 2V/(2n)(i -)
2









was used and the authors claimed that T had asymptotic normal distribution. They

did not explain why not using f.

The statistic
2 2
S + + no --n (2.20)
n(l x)2 2nx(1 x) nX2

is used to test heterogeneity. The authors claimed G is asymptotically distributed

as a x2 variable with df 1. However, they did not prove it nor point out what value

should be used for x; use or -xh. Monte-Carlo simulations were used to evaluate the

power.


2.9 Polygenes


There are many traits in a population that have more variation and can not be

categorized into distinct classes easily (Klug and Cummings, 1997). Traits exhibiting

continuous variation may be controlled by two or more genes. Such traits are said to

exhibit continuous or quantitative variation and are examples of polygenic inheri-

tance. The hypothesis suggesting a large number of factors or genes were responsible

for continuous phenotype are called the multiple-factor or multiple-genie hypoth-

esis. These genes have a special name, polygenes (Griffiths et al., 1993). Klug and

Cummings (1997, p. 95) summarized some major characteristics of multiple-factor

hypothesis:

1. Characters controlled by multiple-factors can usually be quantified by measur-

ing, weighing, counting, etc.

2. Two or more pairs of genes, located throughout the genome, account for the

hereditary influence on the phenotype in an additive way. Because many genes

may be involved, inheritance of this type is often called polygenic.









3. Each gene locus may be occupied by either an additive allele, which contributes

a set amount to the phenotype, or by a nonadditive allele, which does not

contribute quantitatively to the phenotype.

4. The total effect of each additive alleles at each locus, while small, it approxi-

mately equivalent to all other additive alleles at other gene sites.

5. Together, the genes controlling a single character produce substantial pheno-

typic variation.

6. Analysis of polygenic traits requires the study of large numbers of progeny from

a population of organisms.

If there are more than two genes controlling the phenotype, then the number of

genes involved and the effects of those genes are important to know. To address this,

Tan and Chang (1972) proposed a method for estimating the number of genes for

self-fertilized populations assuming that there were only two alleles for each gene and

the effect of each gene is the same. This work was further expanded on by Tan and

D'Angelo (1979) to estimate the numbers and effects of major genes and polygenes

assuming that all major genes have the same effect and all polygenes have the same

effect. Because their work were developed for self-fertilized population, we will not

cover the details.


2.10 Two-stage Genome Search


Using the DNA-level genetic markers technology proposed by Botstein et al.

(1980), Lander and Botstein (1986) concluded that the affected-sib-pair method can
be used for a genome search for disease genes. A two-stage design, first a screening

search to eliminate nonviable marker loci, and then an intensive search to identify

gene location is an intuitive design. Although this design has already been used in









genome-wide searches, such as those by Davies et al. (1994) and Luo et al. (1995),

no optimality of their designs was discussed. Thus, there were no guidelines on how to

space the markers or how to allocate the available ASP in each stage. While Elston

(1992, 1994) studied the "optimal" two-stage design and concluded that two-stage

designs are more efficient than one-stage designs, he did not consider the statisti-

cal complexity of the multiple test nor he considered the interval search nature in

genome-wide linkage detection or the resource constraints. In another paper, Brown

et al. (1994) studied the multiple-stage approach for genome search using affected-

pedigree-member, by simulation. Since Brown et al. used only simulation and only

one pedigree was considered, it is not obvious how to apply their results to the

ASP method. Darvasi and Soller (1994) have studied the optimal spacing of genetic

markers for the QTL trait without considering the two-stage approach. Holmans

and Craddock (1997) conducted a simulation studied on the efficient strategies for

genome scanning using maximum-likelihood affected-sib-pair analysis. The situation

they considered are: a 200 affected sib-pairs sample, four different sample allocation

strategies, and five grid-tightening strategies. The risk ratio of sibs are equal to 2 or 3,

each marker locus have five equi-frequent alleles, and there are five possible location

of genes. Since their studies were simulation studies, it is not clear how to generalize

their results. Furtheremore, they did not consider different sample size, nor optimal

strategies under resource constrains.

The main goals of this dissertation is to answer the two stage design question as

they apply to rare autosomal recessive and dominant diseases with a dichotomous

phenotype (affected or not affected).














CHAPTER 3
TWO-STAGE GENOME SEARCH FOR SIMPLE MENDELIAN DISEASE



In a two-stage search method, the first stage uses part of the ASPs with a wide

spread markers in the genome. The rest of the resource is to be used on those

promising markers identified in the first stage. In this study, only the least favorable

configuration, that gene that lies in the middle of two adjacent markers, will be used

for constructing the designs.

There are three major statistical problems involved in deriving analytic solutions

are summarized as follows:

1. Because all the markers on the same chromosome are linked, the IBD scores

are not independent. In the case of a genome search, there are many markers

spread along the chromosomes. We have to handle a high-dimensional joint

distribution.

2. In the first stage, a number of loci will be chosen for the more intensive linkage

analysis in the second stage. The number and position of these loci that pass

to the second stage are random. This makes calculation of exact distribution

in the second stage extremely difficult.

3. Since in the first stage only a small number of ASPs and a large number of

markers are used, there will be many ties in the IBD scores. The asymptotical

solutions using a continuous normal distribution cannot handle these ties.









In this dissertation, Problem 3 was handled by a multinormial distribution. We used

independent model as an approximation for Problem 1 and 2, and checked the results

with simulation.


3.1 Assumptions


1. There is one and only one disease gene that increases the probability of an

individual being affected. However, the disease may have nongenetic causes.

2. Highly polymorphic equally spaced markers are available. When m markers are

assigned in the first stage, their positions are --L L, (2m-l)L where L, the
2m 2m7-1' 2m
length of the genome, is equal to 3300 cM (Lewin, 1990).

3. For a rare autosomal recessive disease, the parents' disease genotypes are Did

and D2d; and for a rare dominant disease, the parents' disease genotypes are

Did and D2D3, where d is the disease gene.

4. In the same family, the probability that the disease of one affected sib is caused

by a gene and the other is not is negligible.

5. The cost of typing alleles is a constant, i.e., the cost of typing k markers from

one person is the same as typing one marker from k persons. This assumption

of cost ratio can be relaxed to suit practical situations, but the numerical result

would need to be re-calculated.

6. We use the Two-alleles statistic (Day and Simons, 1976), i.e., let Xij=l if ith

marker IBD score = 2 for the jth sib-pair, and Xij=0 otherwise. In our analytic
approach, the Xij are assumed to be independent random variables, except the

two Xij adjacent to the gene.






42


3.2 Two-Stage Procedure


Suppose there are n ASPs and there are enough resources to type N marker loci.

Three numbers need to be determined in a two-stage design: n, and m, the number

of ASPs and the number of markers to be used in the first stage (Stage I), and r, the

number of markers to be used in the second stage (Stage II). The markers chosen for

the second stage are based on the statistic used by Day and Simons,


si = 1x, (3.1)
j=1

in the first stage, where Xj is defined in Assumption 6. Ideally, the r markers with

the highest scores are chosen for Stage II. However, in the event of ties, more than r

of them may have to be chosen. Thus, R, the actual number of markers used in Stage

II, is a random variable. The formal definition of R is: R > r, but if the marker(s)

with the lowest score in this group (markers for stage II study) is (are) taken away,

then the total number of remaining markers is smaller than r.

In Stage II, R markers on N2 ASPs are to be typed, where N2 is the largest

number of sibs that can be used subject to the resource constraint. Since R is a

random variable, N2 is also a random variable. Without loss of generality, let the R

markers be 0,1, ..., R-1, and the sib-pairs in the Stage II be nl +1,nl +2, ...,ni+ N2.

Thus, N2 is the largest x such that mnn + Rx < N. Define

ni+N2
2S,= X,E (3.2)
j=nl+1

Then the marker with the uniquely highest 2SA is claimed to be the locus nearest the

disease gene. If two adjacent markers have the same highest score, then the gene is

claimed to lie between them; otherwise, the gene location is undetermined. In this

case, the two markers are called a marker group.









Once the location, I, has been chosen, say I = i, the next step is to check whether

we can claim linkage. Let ta,n2,R be the 100(1 a) percentile of the unique maximum

of R binomial (n2, 0.25) random variables. If 2SI > ta,2,R, then we claim there is a

linkage at loci i at significance level a.


3.3 Probability of Allocating the Correct Marker


3.3.1 Analytic Approach


To make the analytic solution tractable, Assumption (6) is made on the distribu-

tions of X1j. They are later compared with a more realistic situation by simulation.

Let markers be numbered from 0 to m- 1. Without loss of generality, if the gene is

at the end of the chromosome, we let the nearest marker be marker 0; and if the gene

is between two markers, we let it be located in the middle of marker 0 and 1. If we

assume that gene location is uniformly distributed along the genome, the probability

of the gene at the end is 1/m.

Let iS(r) be the rth largest statistic in the set {1Si} in the first stage. The

probability of finding the gene by the two-stage method is,

P(The marker closest to the gene is found)

= P(Gene is at the end of the chromosome, marker 0 is chosen in Stage II) (3.3)

+ P(Gene is not at the end, either marker 0 or 1 is chosen in Stage II). (3.4)

Eq. (3.3) can be written as a sum of the probabilities of three exclusive events;

Eq. (3.5) through Eq. (3.7).


P(Marker 0 is chosen at the end of Stage II I gene is at the end)

1
1 P(1So passes Stage I and 2S0 is the highest in Stage II)
m









S1 Z P(1So passes Stage I, 2S0 is the highest in Stage II, IS(r,-) = k)
m
k=O
nl
-- 1 {P(iSo > iS(r-i), iS(r-1) = k, 2S0 is the highest in Stage II)
k=0

+ P(k = iS(r-I1) iSo > IS(r), 2S0 is the highest in Stage II)

+ P(k = 1S(r-1) > iSo = 1S(r), 2So is the highest in Stage II)}.


Eq. (3.5) is equal to,

P(iSo > IS(r-1), iS(r-l) = k, 2So0 is the highest in Stage II)

(assuming independence)


zz
i P(iSo k,
i+l > (r- 1 ), < (r- l)

(m 1 i 1) iSj's < k,


r-2 m-l--i
=0 E E=r--
i=0 t=r-l-i


P(iS, > k)

[N2
57 P (2 SO
t-1


i 1Sj's > k,


I iSj's = k,


2So > max 2St)
t#o


P(is >_ rk) M-i!


P(1Sj = k)'


P(15, < k)m-l-t-I


= t)P(2Sj < t)+'


where N2= [N--IJm]


Eq. (3.6) is equal to,

P(S(r-I) = k > iSo > 1S(r), gene is found)


k-2
= P(lS(r-l) = k > ISo > 1S(r) =
i=0

k-2 r-1 (m-l)-(r-l)
: EE Ec P -i-i
= P(k = 1S(r-i)
i=0 1=1 h=1


i, 2So is uniquely highest)


> 1SI > IS(r) = i, (r 1 1) 1Sj's > k,


(3.5)


(3.6)

(3.7)









I 1Sj's = k, h 1Sj/'s = i, (m r h) 1Sj's < i, 2S0 is uniquely highest)

k-2 r-1 (m-1)-(r-1) m-1!
= E E E {P(k>lSo>i) (r-l-)! 1! h!((m-1)- (r -1)-h)!
i=0 1=1 h=1

P(ISj > k)y-1-1 P(,Sj = k)' P(,Sj = i)h P(1Sj < i)m-I-(r-1)-!

r N2-
NP(2So = t)P(2S < ty'- }
.t=l J

where N2 = N--n
r J Int

Eq. (3.7) is equal to,

P(IS(r-1) = k > iSo = 1S(r), gene is found)

k-i
= ZP(1S(r-1) = k > iSo = 1S(r) = i, 2So0 is uniquely highest)
i=0

k-1 r-- (m-l1)-(r-1)
= E Z 3 P(k= lSr-I > ISl = lSr =, (r-- l) iSjls > k,
i=0 1=1 h=l

I 1Sj's = k, h 1Sj's = i, (m r h) iSj's < i, 2So is uniquely highest)

k-I r-1 (m-l)-(r-1)m- 1!
= E E E {P(k>1So=i) (r-l-)!l! h>!((om-)-(r-1)-h)!
i=O 1=1 h=l

P(iSj > k)-1-1 P(iSj = k)' P(iSj = i)h P(1Sj < i)m-i-(r-1)-l

'N2*
P2SO = t)P(2j < t)-l+h ,
kt=l I

where N2 = N-njm]
Lr+h Lnt

Eq. (3.4) can be written as

1P(Marker 0 or 1 is chosen at the end of Stage II I gene is not at the end)
m
mn-1
S ---{P(iSo passes Stage I and i1i does not, 2So is the highest in Stage II) (3.8)
m






46


+P(1S1 passes Stage I and 1So does not, 2S1 is the highest in Stage II) (3.9)

+P(iSo and 1S1 pass Stage I and 2So or 2S1 is the highest in Stage II)}. (3.10)

Since the gene is assumed to be in the middle of two markers, Eq. (3.8) is equal
to Eq. (3.9) and they

= P(1So passes Stage I and iS, does not, 2So is the highest in Stage II)
ni
= P(1So passes Stage I and 1S1 does not, 2So is the highest in Stage II,
k=1

IS(r-1) = k)
nl
= {P(iSo >iS(r-1) = k > 1S1, 2So is the highest in Stage II) (3.11)
k=I

+ P(1S(r-i) = k > 1iSo > 1S(r), 1iSo > ISI, 2S0 is the highest in Stage II)(3.12)

+ P(IS(r-)=k > iSo = iS(r), iSo > 1Si, 2So is the highest in Stage II)}(.3.13)

Eq. (3.11):

P(iSo >iS(r-1) = k > 1Si, 2So is the highest in Stage II)

EE
i I P(1So k, i 1s > k I 1S's=k,
i+1 > (r- 1), i < (r- 1)

(m 2 i 1) Sjs < k, iS, < k 2So > max2St )
tjo
r-2 m-2-i k m-- 2!
= E E PS>- S i=0 l=r-l-i i !( )

P(ISj > k)' P(1Sj = k)' P(1Sj < k)m-2-i-1

'N2 1
N P(2SO = t)P(2Sj < t)i }
t=w [

where N2 = --n--
I +I+ 1 1t










Eq. (3.12) is equal to,

P(k =1 S(r-l) > 1So > iS(r) = i, 1So > 1S1, 2So is the highest in Stage II)

k-2 r-1 (m-2)-(r-1)
= E E P(k =l S(r-l) > 1So > IS(r)= i, 1So > 1S1,
i=0 1=1 h=l

(r - 1) Sj's > k, 1 1Sj's = k, h iSj's = i, (m r 1 h) 1Sj's < i,

2So > max 2Sf)
t9O,1

k-2 r-1 (m-2)-(r-1)

i=0 1=1 h=l


m-2!


P(1Sj > k-)r1-1 P(iSj = k)' P(1Sj = i)h P(,Sj < i)m-r-1-h

*N2-
P(2O = t)P(2S, < ty-' ]
t=1

where N2= [N -rnlm] t

Eq. (3.13) is equal tont
Eq. (3.13) is equal to,


k-2
EP(k =1 S(rl-1) > 1So = IS(r) = i, ISo >
i=1


k-2 r-1

i=1 1=1


I18, 2So is the highest in Stage II)


(m-2)-(r-1)
E P(k = lS(r-l) > 1So = 1S(r) = i, 1So > 1S1,
h=l


(r 1 1) ISj's > k, I iSj's = k, h 1Sj's =i, (m r 1 h) 1Sj's < i,

2SO > max2 St)
t#0,1

k-2 r-1 (m-2)-(r-l)
= E E T,
i=l 1=1 h=1


m 2!
P(1So=i, ii < i) (r-1-l)!l! h!((m-2)-(r-1)-h)!


m- 2!


P(k > 1So > i, 1So > 1S1) (r 1 1- )! 1 h!


((m- 2) (r 1) h)!








P(iSj > k)r-'-1 P(,Sj = k)' P(,Sj = i)h P(iSj < i)m-r-l-h

*N2 1
E P(2So = t)P(2Sj < t)r-i+h }
t=1 I
r N-nm
where N2 = N-h nm
r r+ h I nt

Eq. (3.10) is equivalent to

P(1So and iS, pass Stage I, gene is found in Stage II)
1i
= Z{P(GSo >S(r-2) = k, 1S, >iS(r-2) = k, gene is found in Stage II) (3.14)
k=1

+ P(iSo >iS(r-2) = k > iSi > iS(r-i), gene is found in Stage II) (3.15)

+ P(iSo >iS(r-2) = k > ii = 1S(r-i), gene is found in Stage II) (3.16)

+ P(iSi >1S(r-2) = k > iSo > 1S(r-i), gene is found in Stage II) (3.17)

+ P(S1 >iS(r-2) = k > iSo = iS(r-i), gene is found in Stage II) (3.18)

+ P(IS(r-2) = k > iSo > iSi > iS(r-i), gene is found in Stage II) (3.19)

+ P(IS(r-2) = k > iSo > iSi = iS(r-i), gene is found in Stage II) (3.20)

+ P(iS(r-2) = k > iSi > iSo > iS(r-1), gene is found in Stage II) (3.21)

+ P(iS(r-2) = k > ii > 1So = 1S(r-1), gene is found in Stage II) (3.22)

+ P(iS(r-2) = k > iSi = iSO > 1S(r-i), gene is found in Stage II) (3.23)

+ P(iS(r-2) = k > ASi = 1So = iS(r-I), gene is found in Stage II) (3.24)

+ P(iS(r-i) = k > Ai5 = 15o > iS(r), gene is found in Stage II) (3.25)

+ P(iS(r-i) = k > iSi = iSo 1= i5(r), gene is found in Stage II)}. (3.26)

Again, because the gene is in the middle of two markers, Eq. (3.15) is equal to
Eq. (3.17), (3.16) is equal to (3.18), (3.19) is equal to (3.21), and (3.20) is equal to
(3.22).








Eq. (3.14) is equal to,

P(iSo > 1S(r-2) = k, iSi > 1S(r-2) = k, 280 is uniquely highest in Stage II)

+ P(1So > iS(r-2) = k, iSi > 1S(r-2) = k, 2S1 is uniquely highest in Stage II)

+ P(1So > iS(r-2) = k, iSi > 1S(r-2) = k, 2S0 = 2Siis the highest)

P(iSo > 1S(r-2) = k, iSi > 1S(r-2) = k, 2S0 or 2S1 is uniquely highest in Stage II)
(A)

+ P(1So 1S(r-2) = k, iSi > iS(r-2) = k, 2S0 = 2Slis the highest)
(B)

(A)
E E
Sl P(iSo k, iSi > k, 1S(r-2) = k, i 1Sj's > k,
+ I > (r-2), < (r- 2)
1 iSj's = k, (m 2 i 1) 1Sj's < k, 2So or 2S1 is uniquely highest in Stage II)
r-3 m-2-i (So kS m- 2!
= EE tP'o>-" >- il ( -2 _-)
i=0 l=r-2-i 2)-i-1)!

P(1Sj > k)' P(,Sj = k)' P(1Sj < k)m-2-i-

P(2So > max 20 > 21) + P(251 > max 2t, 2S1 > 2S0)] }

r-3 m-2-i { PSo> -- 2!
= E ^ {(^S>-k S>-knl((m-2 -i-,)
i=0 I=r-2-i 2 )

P(1Sj > k)' P(1Sj, = k)' P(1Sj < k)m-2-i-I

2 P(2So = t, 251 < t)P(2Sj < t)i+l ,


where N2 L --l+2 mt

(B)

EE
i I P(iSo > k, 151 > k, 1S(r-2) = k, i Sj's > k,
i + I > (r-2), i < (r- 2)









I 1Si's = k, (m 2 i 1) jSj's < k, 2So0 = 2S1 is uniquely highest in Stage II)
r-3 m-2-i m- 2!
= E p(S > k, S > k)i!! ((m-2-i-1)!
i=0 l=r-2-i

P(iSj > k)' P(,S, = k)' P(iSj < k)m-2-1-

P(2So 2S1 > max 2St

r-3 m-2-i ( m- 2!
E E {(S>-klS, > k)!,! ((m -2)i )
i=0 l=r-2-i \

P(1Sj > k)' P(1Sj = k)' P(1Sj < k)m-2-i-1

*N2 1
SP(2So = t, 2S1 = t)P(2j5, < t)r+l ,
t=1
,, [N mimi)
where N2 = r /-2 J l

(A) + (B)

r-3 m-2-i / m- 2!
= E P(So k, IS k)! ((m 2) )!
i=0 1=r-2-i I k<

P(,Sj > k)' P(15,j = k)' P(15,j < k)m-2-i-1


[2P(2So = t, 251 < t) + P(25o = t, 2S1 = t)]P(25j < t)+l] ,
L t=l

where N2 = EN nm 1
S r + + 2 J Int


Eq. (3.15) and (3.17) are equal to

P(ISo > iS(r-2) = k> 1S1 > IS(r-1), gene is found)

k-2
-= {P(iSo > 15(r-2) = k > 1S1 > 1S(r-1) = i, 2S0 or 2S1
i=0


is uniquely highest in Stage II)








+P(iSo iS(r-2) = k > iS1 > IS(r-i) = i, 250 = 2S1 is the highest in Stage II)

k-2 r-1 (m-2)-(r-2)
= Z Z Z {P(ISo > k,k> IS1 >i,(r- 1)iSj's > k,
i=0 1=1 h=1

I Sj's = k,h 1Sj's = i, (m r 1- h) 1Sj's < i,2So0 or 2S1

is uniquely highest in Stage II)

+P(1So k, k > iS > i, (r 1 l) iSj's > k,

1 iSj's = k, h iSj's = i, (m r 1 h) iSj's < i, 2So = 2S1

is the highest in Stage II)}

k-2 r-1 (m-2)-(r-2)
=zzz
i=0 1=1 h=l
~m- 2!
{P(jSo ! k, k >IS, > m 2
P~(1o> >11>i) (r-2-1)!1! h!((m 2) (r 2) h)!

P(ij > k)r-2-1 P(,S, = k)' P(1Sj = i)h P(1,j < i)m-2-(r-2)-i

[:(2P(2So = t, 251 < t) + P(2So = t, 2S1 = t))P(2Sj < tY-2 ,


where N2 N --r--I t

Eq. (3.16) and (3.18) are equal to

P(iSo > 1S(r-2) = k > 1i1 = IS(r-i), gene is found)
k-I
= E{P(lSo >_ is(r-2) = k > iSi = l S(r-i) = i,2So or 2o 1
i=0
is uniquely highest in Stage II)

+ P(SO >_ 1S(r-2) = k > AI = IS(r-I) = i, 2S0 = 2S1 is the highest)

k-1 r-2 (m-2)-(r-2)
= Y, Z {P(lSo>k,iS, >i,(r-2-1) Sits>k,
i=0 -=1 h=l
I iSj's = k, h iS's = i, (m r h) iS'/s < i, 2So or 2S1








is uniquely highest in Stage II)

+ P(iSo > k, 1S1 >= i, (r 2 I) Sj's > k,

I 1Sj's = k, h 1Sj's = i, (m r h) 1Sj's < i, 2S0 = 2S1 is the highest)}

k-i r-2 (m-2)-(r-2)
= EE E
i=O 11 h=
f kS>m- 2!
1P(iSo > k, > i) (r-2-l)!l! h! ((m-2)- (r-2)- h)!

P(iSj > k)-2-1 P(1S, = k)' P(1Sj = i)h P(iSj < i)-2-(r-2)-(
'N2
[ (2P(2So = t, 2S1 < t) + P(2So = t, 2S1 = t))P(2Sj < t)r-2+h ,
t=1
,~ ~ fV-nl l
where N2 = Nr-+nm]
L r+h lInt

Eq. (3.19) and (3.21) are equal to

P(iS(r-2) = k > 1So > iSi > 1S(r-1), gene is found)

k-i
{P(iS(r-2) = k > So > 1Si > IS(r-1) = i, 2S0 or 2S1
i=O
is uniquely highest in Stage II)

+ P(S(r-2) = k > iSo > 1Si > iS(r-1) = i, 2SO0 = 2S1is the highest)}

k-3 r-2 (m-2)-(r-2)

i=0 1=1 h=l
{P(k > 1So > iS1 > i, (r 2 1) 1Sj's > k, 1 1Sj's = k, h Sj's = i,

(m r h) iSj's < i, 2So or 2S1 is uniquely highest in Stage II)

+ P(k > iSo > iSi >= i, (r- 2- l) iSj's > k,

I iSj's = k, h iSj's =i, (m r h) iSj's < i, 2S0 = 2S1is the highest)}

k-3 r-2 (m-2)-(r-2)
--EE E
i=O 1=1 h=l








Sm- 2!
{P(k > So > IS, > i) m 2
S> o > > (r-2-1)! 1! h! ((m-2)- (r-2)-h)!

P(,Sj > k)y-2-' P(1Sj = k)1 P(i,Sj = i)h P(,Sj < i)m-2-(r-2)-h

i(2P(25o = t, 2S1 < t) + P(25o = t, 2S1 = t))P(2Sj < t)r-2 ,
.t=i J

where N2 = -[N- nm Int

Eq. (3.20) and (3.22) are equal to

P(iS(r-2) = k > iSo > iSi = 1S(r-1), gene is found)
k-2
S -{P(l5(r-2) = k > iSo > iS1 = 1S(r-1) = i, 2SO or 2S1
i=0
is uniquely highest in Stage II)
+ P(IS(r-2) = k> So > iS1 = 1S(r- _) = i, 2So = 2S1 is the highest in Stage II)}
k-2 r-2 (m-2)-(r-2)
-zzE
i=0 1=1 h=l
{P(k > iSo > i, iSi =i, (r 2 l) 1Sj's > k, 1 iSj's = k, h 1Sj's = i,
(m r h) 1Sj's < i, 2So0 or 2S1 is uniquely highest in Stage II)

+ P(k > iSo > i, iSi = i, (r 2 1) iSj's > k, I 1Sj's =k,h iSj's = ,
(m r h) 1j8's < i, 2S0 = 2S1is the highest in Stage II)}
k-2 r-2 (m-2)-(r-2)
T, E E
i=0 1=1 h=l
m- 2!
{P(k > 5o > i, 8 = i) (r-2-/)!/! h!((m-2)-(r-2)-h)!

P(1S, > k)r-2- P(,Sj = k)' P(1Sj = i)h P(ISj < i)m-2-(r-2)-h
N2 < t)-2+h
-(2P(2So = t, 2SA < t) + P(2So = t, 2A1 = t)) X P(2Sj < t)- +h ,
t=1









j N -nim]
whereN2= r+h J-n t
Lr+h lInt*

Eq. (3.23) is equal to

P(iS(r-2) = k > iSo = iS1 > iS(r-i), gene is found)
k-2
= -{P(S(r-2) = k > iSo = iSi > 1S(r-1) = i, 2So0 or 2S1
i=0
is uniquely highest in Stage II)

+ P(iS(r-2) = k > iSo = 1S1 > 1S(r-1) =i,2So = 2S1 is the highest in Stage II)

k-2 r-2 (m-2)-(r-2)

i=0 1=1 h=l

{P(k > iSo = 1S1 > i, (r 2 1) 1Sj's > k, I Sj's = k, h 1Sj's = i,

(m r h) 1Sj's < i, 2So or 2S1is uniquely highest in Stage II)

+ P(k > iSo = iSi > i, (r- 2- 1) Sj's > k, I 1Sj's =k,h iSj's = i,

(m r h) 1Sjs < i, 2So0 = 2Siis the highest in Stage II)}

k-2 r-2 (m-2)-(r-2)

i=O 1=1 h=l
Sm- 2!
{P(k > ISO= iSI > 0m 2
S> So = 11 > ) (r 2 1)! h!((m 2) (r 2) h)!

P(ISj > k)-2-1 P(ISj = k)' P(iSj = i)h P(ISj < i)m-2-(r-2)-h


-(2P(2So = t, 2S1 < t) + P(2So = t, 2S1 = t))P(2Sj < t)y-2 ,
t=l

where N2 j -----t
Eqr ( Int

Eq. (3.24) is equal to


P(GS(r-2) = k > iSo = 1S1 = 1S(r-1), gene is found)









k-i
= {P(1S(r-2) = k > iSo = 1S1 = iS(r-1) = i,2So or 2S1
i=0

is uniquely highest in Stage II)

+ P(iS(r-2) = k > iSo = 1SI = iS(r-i) = i, 2So = 2SI is the highest in Stage II)

k-1 r-2 (m-2)-(r-2)
= Z Z {P(iSo=iSi=i,(r-2-1) iS/s>k, l1Sj's=k,
i=0 1=1 h=I

h 1Sjs = i, (m r h) 1S/'s < i, 2S0 or 2S1 is uniquely highest in Stage II)

+ P(ISo = 1S1 = i, (r 2 1) iSjs > k, 1Sj's = k,

h iSj's = i, (m r h) 1Sj's < i, 2So = 2S1 is the highest in Stage II)}

k-1 r-2 (m-2)-(r-2)
zz z
i=0 1=1 h=l
m _m-2!
{P(1S0 = is = 0m 2
P1So- 1I ) (r- 2- 1)!l! h! ((m 2) (r- 2)- h)!

P(ISj > k)r-2-1 P(iSj = k)' P(1Sj, = i)h P(1S, < i)-2-(r-2)-h

N 2 < t1-+
-(2P(2So = t, 2S1 < t) + P(2So = t, 2S1 = t)) X P(2Sj < t)r-2+h ,
t=1

where N2 --- j mt
rL ]I Jnt*

Eq. (3.25) is equal to

P(IS(r-I) = k > 1iSo = 1S1 > 1S(r), gene is found)

k-2
SE{P(iS(rl) = k > iSo = iSI > IS(r) = i, 2So or 2SI
i=0

is uniquely highest in Stage II)

+ P(iS(r-i) = k > iSo = ISi > 1S(r) = i, 2S0 = 2S1 is the highest in Stage II)

k-2 r-1 (m-2)-(r-1)
= E= E
(=0 1=1 hi=l








{P(k > 1S0 = IS > i, (r 1 1) 1Sj's > k, I Sj's = k, h 1Sj's = i,

(m r h) 1Sj's < i, 2S0 or 2S1 is uniquely highest in Stage II)

+ P(k > iSo = iSi > i,(r-l- I) iSj's > k, 1 Sj's = k,

h iSj's = i, (m r h) 1Sj's < i, 1S1 = 2Siis the highest in Stage II)}

k-2 r-1 (m-2)-(r-i)
=EE E
i=0 1=1 h=l
~m- 2!
{P(k > iso = lSl > i) (r-l)! lh!(m-2)-(r-l)- h!

P(jSj > k-)r1-' P(1Sj = k)' P(iSj = i)h P(1Sj < i)m-2-(r-l)-h

*N2 \
[ (2P(2So = t, 2S1 < t) + P(2So = t, 2S1 = t)) x P(2Sj < t)-1 ,
L t=----

.[ -, \N nim]
where N2 =J In
[ r+1 i t

Eq. (3.26) is equal to

P(iS(r-i) = k> 1So= 11 = 1S(r), gene is found)
k-i
= -{P(1Sr-1) = k > iSo = IS, l S(r) = i, 2So0 or 2S1
i=O
is uniquely highest in Stage II)

+ P(iS(r-i) = k > 1So = S = iS(r) = i, 2S0 = 2Si is the highest in Stage II)

k-1 r-1 (m-2)-(r-i)
E E E {P(,So = iSi = i,(r- 1-) Sj's> k, 1 Sj's = k,
i=0 /=1 h=1

h 1Sj's = i, (m r h) iS/'s < i, 2S0 or 2S1 is uniquely highest in Stage II)

+ P(iSo=iSi = i,(r- 1 l)iSj's > k,

I 1Sj's = k, h jSj's = i, (m r h) iSj's < i, 2So = 2Si is the highest in Stage II)}









Table 3.1. The joint distribution of IBD score of the locus 1 and locus 2.


Where 92 + (1 9)2 and 0 is the recombination fraction between two loci.


k-I r-1 (m-2)-(r-1) m 2!
= EE E {P(So = l= i) (r-1-)!l! h!((m-2)-(r- 1)-h)!
i=0 I=1 h=l '

P(1Sj > k)r-1-1 P(1Sj = k)1 P(1Sj, = i)h P(iSj, < i)m-2-(r-)-h


_(2P(2So = t, 2S1 < t) + P(2So = t, 2S1 = t)) x P(2S, < t)r- +h ,
.t=l 1

where N2= N-- l .


The joint distribution of the IBD scores of two markers (or genes), P(IBD of the

marker (gene) 1, IBD of marker (gene) 2) is given in Table 3.1 adapted from Table

2.3 where T=02 + (1 9)2 and 9 is the recombination fraction between two loci.

From Table 3.1, we can deduce the joint distribution of Xij and Xij for i $ i'.

The joint distribution between Xij and Xi,j is shown in Table 3.2, with Ti defined as

the T in Table 3.1.
If a marker is not linked with the disease gene, then the probabilities of getting
an IBD score of the marker equal to 2, 1, and 0 are 0.25, 0.5, and 0.25, respectively.
Thus, XAj has a Bernoulli distribution with parameter of 0.25. The distribution of ,Si

under the null hypotheses is Binomial(ni, 0.25) for 1=1, 2. Let e be the probability

that the disease is caused by the gene, and 0 be the distance between marker 0 and the
gene. For equations 3.5-3.7, where the recessive gene is at the end, Xoj is distributed


IBD of locus 2
IBD of locus 1 2 1 0
2 V %(- ') i-
4 2 4
1 (l-q) 4 2+(4-)2 ( )
2 2 2
0 j112 %P (I-%P
4 2 4









as Bernoulli(e 2 +(1-e) 0.25) and 1So as Binomial(ni, e T2+(l-_) 0.25). Similarly,
for a dominant disease gene, Xoj is distributed as Bernoulli(E T/2 + (1 e) 0.25) and

1Si as Binomial(ni, e T1/2 + (1 s) 0.25).
For Eq. 3.11-3.13, we need to find the joint distribution P(ISo=i, IS,=j),
i'=0,1,...,nI, j=0,1,...,nI, 1=1,2. Let nabi be the number of (Xoj, Xlj) = (a, 6) in stage

I. Then (noot, no01, n0io, nil,) has a multinomial distribution (ne, poo, poi, plo, p1l).
Thus, the joint distribution

P(iSo= i, iS1= j) (3.27)

= P(nioi + n11=1, nolK + nli=j) (3.28)

min(i,j )
= P(nni=k, nlot=i k, noll=j k,
k = max(0, i + j n)

nool=n i j + n1il), (3.29)

can be computed once poo, poi, pio, and pl are specified.
Although we assume the gene is in the middle of the two adjacent markers, the
formulae are derived for the gene anywhere in between. Let the recombination fraction
between the gene and marker 0 be 00 and between the gene and marker 1 be 01. Let
02 be the recombination fraction between two markers. The joint distribution of Xoj
and Xlj, for x = 0, l,y = 0,1, is,





Table 3.2. The joint distribution of Xij and Xij.


xii,
Xj 1 0

4 4
0 (1-12) T2+2
4 4










P(Xoj = x, Xij = y)
= P(Xo = x, Xij = y I gene IBD=2)P(gene IBD=2)

+ P(Xoj = x, Xj = y I gene IBD=l)P(gene IBD=1)

+ P(Xoj = x, X1j = y I gene elsewhere)P(gene elsewhere)
= P(Xoj = x I Xij = y, gene IBD=2)P(Xl3 = y I gene IBD=2)

P(gene IBD=2)

+ P(Xoj = x I X j = y, gene IBD=1)P(Xlj = y I gene IBD=1)
P(gene IBD=1)

+ P(Xoj = x I Xj = y, gene elsewhere)


(3.30)




(3.31)


-'(Aij = y I gene elsewhere)P( gene elsewhere). (3.32)

Since an individual has to have two recessive disease genes in order to be affected,
Eq. (3.32) becomes

P(Xoj = x I Xj = y, gene IBD=2)P(Xlj = y I gene IBD=2)e

+ P(Xoj = x I Xjj = y, gene elsewhere)P(Xlj = y I gene elsewhere)(1 e).

For a dominant disease, if an ASP is caused by a gene, then both sibs must at least
share the disease gene and there is a 50-50 chance they share the other allele. Thus,
Eq. (3.32) becomes

= P(Xoj = x I Xj = y, gene IBD=2)P(X,3 = y I gene IBD=2)(e/2)

+ P(Xoj = Xj = y, gene IBD=1)P(Xlj, = y I gene IBD=I)(e/2)

+ P(Xoj = x Xlj = y, gene elsewhere)
P(Xlj = y I gene elsewhere)(1 e). (3.33)


Let I'= 6] + (1 -e)2, 1=0,1,2, p, = P(Xoj = x I Xij = y), x = 0,1, y = 0,1. Then









apply Table 3.2 for a recessive disease we have






4
V2 + (1 I (1) 6 + T22+2
Poo = (1) e 4 ( 1-), (3.34)

PIO = I2(-12) -( (339)


po0i-=) + 1- (1 ( (3.36)



2




Pp = Q2 J E + o(1-o) (1-I) + (1 ). (3-37.41)








Thus, P(the marker closest to the gene is found) can be calculated. In addition to
For analysis c dominputation for Eq. 3.3 and 3.4, simulations were also done.ase,









3.3.2 Simulation under More Realistic Assumptions


The assumption of independent Xi, violates the fact that some loci are linked.
Simulations were done under a Markov chain model with the combinations of resource
N = ()()1000, 2000, 5000, and 10,000, number of ASP n = 10, 25, 75, and 100, = 0.25,











0.5, 0.75, and 1, m from 50 to 350 with an increment 25 for two-stage design and 10
for one-stage design, r from 5 to m with an increment (m 5)/10, and n1 from 5 to
Poo 0 ^(-)+ol^(-+ )+~( 1-) (3.38)























n 5 with increment 5.
The simulations were conducted as follows:
pul = (1_p2 T2E+ o(l-ToXy)Ti(l- Tl) 'P2(l) (3.40)



Thus, P(the marker closest to the gene is found) can be calculated. In addition to
analysis computation for Eq. 3.3 and 3.4, simulations were also done.


3.3.2 Simulation under More Realistic Assumptions


The assumption of independent Xij violates the fact that some loci are linked.
Simulations were done under a Markov chain model with the combinations of resource
N = 1000, 2000, 5000, and 10,000, number of ASP n = 10, 25, 75, and 100, 6 = 0.25,
0.5) 0.75, and 1, m from 50 to 350 with an increment 25 for two-stage design and 10
for one-stage design, r from 5 to m with an increment (m 5)/10, and nI from 5 to
n 5 with increment 5.
The simulations were conducted as follows:









1. Reading in parameters, resource N, heterogeneity e, number of ASP n, and m,
ni, and r of the design.

2. Given n1, m, generate y from uniform(0,1). If y e then generate gene location

from uniform distribution (0,3300). Let locus I and locus I + 1 be two adjacent

loci, i.e. the gene location is between 2 and (21+1)L where L is the total length
2m ~2mI
of genome. This step determines which interval contains the gene and then:

(a) For recessive disease, generate IBD scores at loci 1 and I + 1 conditional

on both sibs of ASP carrying two disease genes and the gene locus is in
the middle of two markers. Haldane map function is used to convert map
distance into recombination fraction 0.

(b) For dominant disease, if y < 0.5e, generate IBD scores at loci I and I + 1

conditional on both sibs of ASP sharing two alleles at gene loci and the
gene locus is in the middle of two markers. If 0.5e < y <_ e, then generate
IBD score at loci I and l+ 1 conditional on both sibs of ASP share one allele
at gene loci and, again, the gene locus is in the middle of two markers.

If y > c, let I =0, and generate IBD score at locus 0 from Bernoulli(0.25) as if
there is no gene linked with marker 0.

3. For Markov chain model simulation, generate IBD score at locus i, i start from
l- 1 and decrease to 0, conditional on IBD score at locus i + 1, and then generate
IBD score at locus i conditional on IBD score at locus i 1 for i = I + 1, ...,
m 1. For independent model simulation, generate IBD scores independently

from Trinomial(ni, 0.25, 0.5,0.25). Conditional probability formulas are given
by Table 2.3.

4. Convert IBD scores into statistics Xijs.


5. Repeat step 2, 3, and 4 ni times and then calculated Si.









6. Check ties, adjust R to include all the ties with the rth highest IS, for Stage
II, check whether 1S0 passes Stage I. If yes, then calculate N2 according to the
resource constraint and go to the next step. If not, record the detection as a
failure and go back to step 2.

7. The Xijs in Stage II are generated the same way as those in Stage I, except

only the chosen markers are used and repeat N2 times.

8. Check whether 2S0 or 2S1 is the unique largest among 2Sis in Stage II.

9. If marker 0 or 1 is chosen, it is a success; otherwise it is a failure. Go back to
step 2 until enough simulation has been done.

The programs were compiled using GCC version 2.7.2 on a PC with Pentium 166
CPU and 48M RAM in an OS/2 environment. The random number generator for the

simulation was adapted from Press (1992).


3.3.3 Results


Based on analytic computation and simulation, the designs with the highest prob-
ability of finding the right marker (power) were identified. These optimal designs are

given in Table 3.3 for searching a recessive gene, and Table 3.4 for a dominant gene. In
both tables, in the ASP column is reported the number of available affected-sib-pairs;
the E column shows the probability that the disease is caused by the gene, repre-
senting heterogeneity; the m columns, the number of marker loci used in the first
stage; the r column, the proposed number of loci to be chosen in Stage I for Stage
II study; the n, columns, the number of ASP used in first stage; the F2 column,
the probability of locating the right marker by the best two-stage design obtained by
analytic formula; the Indep columns and the Markov columns show the probabilities
obtained by simulation with independent assumption and Markov chain assumption
without combining first- and second-stage data; and the Comb. column shows the






63


simulated probability of the two-stage design with Markov chain model with first-
and second-stage data combined. The Fl column gives the probability of the best
one-stage design, i.e., with optimal m and n subject to mn < N obtained by analytic
formula. The last column, F2-F1, shows the increase in probability of two-stage de-
sign over one-stage. An asterisk is marked when the increase was over 0.35 for Table
3.3 (recessive) and 0.15 for Table 3.4 (dominant).













Table 3.3. Optimal resource allocation in two-stage genome search for rare recessive gene.

Resource N=1000
Parameter Two-stage design One-stage design Improv.
ASP E m r ni F2 Indep Markov Comb. m n Fl Indep Markov F2-F1
10 0.25 125 5 5 0.040 0.041 0.041 0.045 100 10 0.042 0.043 0.042 -0.001
10 0.50 175 5 5 0.149 0.152 0.146 0.167 100 10 0.121 0.126 0.116 0.027
10 0.75 175 5 5 0.366 0.369 0.352 0.412 100 10 0.269 0.268 0.272 0.098
10 1.00 175 5 5 0.662 0.662 0.631 0.723 100 10 0.481 0.484 0.477 0.181
25 0.25 125 5 5 0.092 0.095 0.090 0.093 50 20 0.059 0.062 0.059 0.034
25 0.50 125 5 5 0.324 0.339 0.316 0.324 70 14 0.127 0.126 0.130 0.197
25 0.75 125 5 5 0.642 0.654 0.630 0.642 100 10 0.269 0.269 0.268 0.374*
25 1.00 150 5 5 0.897 0.888 0.882 0.894 100 10 0.481 0.495 0.474 0.416*
50 0.25 100 5 5 0.126 0.130 0.135 0.135 50 20 0.059 0.061 0.060 0.068
50 0.50 125 5 5 0.388 0.385 0.372 0.377 70 14 0.127 0.129 0.120 0.260
50 0.75 125 5 5 0.697 0.701 0.680 0.685 100 10 0.269 0.277 0.272 0.428*
50 1.00 125 5 5 0.902 0.898 0.894 0.897 100 10 0.481 0.479 0.474 0.421*
75 0.25 100 5 5 0.135 0.141 0.134 0.136 50 20 0.059 0.059 0.062 0.077
75 0.50 100 5 5 0.400 0.404 0.401 0.404 70 14 0.127 0.125 0.133 0.273
75 0.75 125 5 5 0.697 0.707 0.689 0.696 100 10 0.269 0.278 0.269 0.429*
75 1.00 125 5 5 0.902 0.898 0.887 0.892 100 10 0.481 0.488 0.474 0.421*
100 0.25 100 5 5 0.137 0.138 0.136 0.138 50 20 0.059 0.059 0.061 0.078
100 0.50 100 5 5 0.402 0.419 0.387 0.393 70 14 0.127 0.129 0.125 0.274
100 0.75 125 5 5 0.697 0.703 0.685 0.691 100 10 0.269 0.268 0.268 0.429*
100 1.00 125 5 5 0.902 0.900 0.891 0.895 100 10 0.481 0.481 0.477 0.421*












Table 3.3-continued.


Resource N=2000
Parameter Two Stage Design One-stage design Improv.
ASP e m r ni F2 Indep Markov Comb. m n Fl Indep Markov F2-F1
10 0.25 125 5 5 0.040 0.042 0.041 0.044 200 10 0.041 0.043 0.044 -0.001
10 0.50 350 5 5 0.181 0.198 0.154 0.170 200 10 0.178 0.188 0.177 0.003
10 0.75 350 5 5 0.500 0.513 0.419 0.481 200 10 0.460 0.472 0.441 0.040
10 1.00 350 5 5 0.883 0.880 0.745 0.834 200 10 0.807 0.808 0.779 0.076
25 0.25 175 5 10 0.108 0.120 0.114 0.118 80 25 0.085 0.090 0.088 0.023
25 0.50 175 5 10 0.441 0.455 0.427 0.453 100 20 0.269 0.272 0.265 0.172
25 0.75 175 5 10 0.824 0.830 0.801 0.832 110 18 0.565 0.570 0.563 0.259
25 1.00 175 5 10 0.984 0.984 0.974 0.985 140 14 0.841 0.845 0.829 0.143
50 0.25 150 5 10 0.177 0.184 0.168 0.173 60 33 0.088 0.095 0.094 0.089
50 0.50 150 5 10 0.594 0.611 0.578 0.588 100 20 0.269 0.275 0.266 0.326
50 0.75 150 5 10 0.893 0.905 0.887 0.891 110 18 0.565 0.570 0.561 0.329
50 1.00 175 5 10 0.990 0.990 0.987 0.992 140 14 0.841 0.852 0.829 0.149
75 0.25 150 5 5 0.213 0.221 0.206 0.208 60 33 0.088 0.087 0.093 0.124
75 0.50 150 5 10 0.614 0.626 0.605 0.610 100 20 0.269 0.274 0.269 0.345
75 0.75 100 14 10 0.897 0.901 0.891 0.905 110 18 0.565 0.565 0.567 0.332
75 1.00 175 5 10 0.990 0.990 0.986 0.991 140 14 0.841 0.838 0.828 0.149
100 0.25 150 5 5 0.229 0.240 0.219 0.221 60 33 0.088 0.089 0.098 0.141
100 0.50 125 5 10 0.623 0.638 0.618 0.621 100 20 0.269 0.283 0.266 0.354*
100 0.75 100 14 10 0.897 0.903 0.884 0.899 110 18 0.565 0.579 0.559 0.332
100 1.00 175 5 10 0.990 0.990 0.988 0.991 140 14 0.841 0.848 0.837 0.149













Table 3.3-continued.


Resource N=5000
Parameter Two-stage design One-stage design Improv.
ASP e m r ni F2 Indep Markov Comb. m n Fl Indep Markov F2-F1
10 0.25 125 5 5 0.040 0.039 0.037 0.041 210 10 0.041 0.046 0.035 -0.001
10 0.50 350 5 5 0.181 0.202 0.152 0.171 350 10 0.197 0.212 0.175 -0.016
10 0.75 350 5 5 0.500 0.521 0.425 0.480 350 10 0.552 0.566 0.491 -0.052
10 1.00 350 5 5 0.883 0.882 0.742 0.830 350 10 0.929 0.933 0.838 -0.047
25 0.25 300 5 15 0.118 0.130 0.114 0.128 200 25 0.127 0.138 0.131 -0.009
25 0.50 350 73 5 0.578 0.615 0.516 0.572 200 25 0.555 0.577 0.534 0.022
25 0.75 350 73 5 0.951 0.956 0.888 0.930 200 25 0.926 0.938 0.911 0.025
25 1.00 350 73 5 1.000 1.000 0.986 0.994 350 14 0.989 0.992 0.943 0.011
50 0.25 250 29 10 0.270 0.291 0.257 0.278 100 50 0.193 0.198 0.197 0.077
50 0.50 250 29 10 0.860 0.881 0.825 0.853 130 38 0.640 0.658 0.632 0.221
50 0.75 225 27 15 0.995 0.997 0.986 0.995 170 29 0.937 0.951 0.932 0.058
50 1.00 200 5 20 1.000 1.000 0.999 1.000 350 14 0.989 0.992 0.943 0.011
75 0.25 200 24 10 0.367 0.395 0.354 0.366 100 50 0.193 0.194 0.191 0.174
75 0.50 200 24 15 0.915 0.928 0.893 0.907 130 38 0.640 0.661 0.634 0.275
75 0.75 150 19 20 0.998 0.998 0.995 0.997 170 29 0.937 0.948 0.929 0.061
75 1.00 125 5 30 1.000 1.000 1.000 1.000 350 14 0.989 0.992 0.941 0.011
100 0.25 175 22 15 0.418 0.441 0.405 0.424 100 50 0.193 0.200 0.193 0.225
100 0.50 150 19 20 0.929 0.938 0.921 0.934 130 38 0.640 0.659 0.637 0.289
100 0.75 125 5 35 0.999 0.997 0.997 0.998 170 29 0.937 0.949 0.928 0.061
100 1.00 100 5 35 1.000 1.000 1.000 1.000 350 14 0.989 0.991 0.944 0.011












Table 3.3-continued.


__________Resource N=10,000
Parameter Two-stage design _One-stage design Improv.
ASP E m r ni F2 Indep Markov Comb. m n Fl Indep Markov F2-F1
10 0.25 125 5 5 0.040 0.042 0.042 0.044 210 10 0.041 0.043 0.046 -0.001
10 0.50 350 5 5 0.181 0.200 0.156 0.175 350 10 0.197 0.219 0.179 -0.016
10 0.75 350 5 5 0.500 0.515 0.422 0.474 350 10 0.552 0.570 0.493 -0.052
10 1.00 350 5 5 0.883 0.885 0.739 0.830 350 10 0.929 0.928 0.835 -0.047
25 0.25 300 5 15 0.118 0.127 0.114 0.129 350 25 0.132 0.148 0.129 -0.014
25 0.50 350 73 5 0.578 0.606 0.523 0.579 350 25 0.647 0.685 0.590 -0.069
25 0.75 350 73 5 0.951 0.958 0.900 0.930 350 25 0.974 0.982 0.934 -0.023
25 1.00 350 73 5 1.000 0.999 0.985 0.993 350 25 0.997 1.000 0.993 0.003
50 0.25 350 73 5 0.294 0.326 0.269 0.279 200 50 0.295 0.314 0.281 -0.000
50 0.50 325 101 10 0.907 0.927 0.852 0.902 200 50 0.894 0.912 0.872 0.013
50 0.75 325 101 10 0.999 1.000 0.990 0.996 350 28 0.985 0.992 0.958 0.014
50 1.00 225 5 20 1.000 1.000 0.999 1.000 350 28 0.997 1.000 0.994 0.003
75 0.25 275 59 15 0.449 0.488 0.430 0.467 140 71 0.361 0.381 0.356 0.088
75 0.50 275 59 15 0.976 0.980 0.953 0.969 160 62 0.918 0.937 0.906 0.059
75 0.75 250 5 35 1.000 1.000 0.996 1.000 350 28 0.985 0.992 0.956 0.014
75 1.00 125 5 30 1.000 1.000 1.000 1.000 350 28 0.997 1.000 0.996 0.003
100 0.25 250 29 25 0.560 0.600 0.527 0.561 100 100 0.386 0.408 0.386 0.174
100 0.50 225 27 30 0.990 0.993 0.977 0.990 160 62 0.918 0.934 0.912 0.073
100 0.75 150 5 40 1.000 1.000 0.999 1.000 350 28 0.985 0.991 0.957 0.015
100 1.00 100 5 35 1.000 1.000 1.000 1.000 350 28 0.997 1.000 0.996 0.003













Table 3.4. Optimal resource allocation in two-stage genome search for rare dominant gene.

Resource N=1000
Parameter Two-stage design One-stage design Improv.
ASP E m r ni F2 Indep Markov Comb. m n Fl Indep Markov F2-F1
10 0.25 50 5 5 0.028 0.029 0.027 0.030 50 10 0.028 0.026 0.029 0.000
10 0.50 50 5 5 0.036 0.034 0.037 0.037 50 10 0.037 0.035 0.039 -0.000
10 0.75 100 5 5 0.050 0.053 0.051 0.059 100 10 0.053 0.059 0.053 -0.003
10 1.00 175 5 5 0.082 0.084 0.082 0.092 100 10 0.082 0.078 0.080 -0.000
25 0.25 50 5 5 0.042 .043 0.039 0.040 50 20 0.036 0.036 0.041 0.005
25 0.50 75 5 10 0.063 0.064 0.066 0.067 50 20 0.053 0.054 0.053 0.009
25 0.75 125 5 5 0.119 0.122 0.121 0.124 50 20 0.075 0.076 0.074 0.044
25 1.00 125 5 5 0.201 0.205 0.197 0.204 50 20 0.103 0.096 0.106 0.098
50 0.25 50 5 10 0.053 0.053 0.052 0.053 50 20 0.036 0.037 0.039 0.016
50 0.50 100 5 5 0.091 0.093 0.088 0.089 50 20 0.053 0.055 0.058 0.037
50 0.75 100 5 5 0.168 0.168 0.162 0.162 50 20 0.075 0.078 0.070 0.092
50 1.00 100 5 5 0.270 0.264 0.258 0.262 50 20 0.103 0.104 0.102 0.166*
75 0.25 50 5 10 0.057 0.058 0.058 0.059 50 20 0.036 0.035 0.036 0.020
75 0.50 75 5 5 0.102 0.100 0.105 0.107 50 20 0.053 0.053 0.056 0.048
75 0.75 100 5 5 0.179 0.179 0.169 0.171 50 20 0.075 0.075 0.079 0.104
75 1.00 100 5 5 0.284 0.279 0.286 0.289 50 20 0.103 0.104 0.105 0.181*
100 0.25 50 5 5 0.058 0.055 0.059 0.058 50 20 0.036 0.039 0.040 0.022
100 0.50 75 5 5 0.107 0.104 0.103 0.102 50 20 0.053 0.053 0.053 0.054
100 0.75 75 5 5 0.184 0.184 0.183 0.183 50 20 0.075 0.080 0.081 0.109
100 1.00 100 5 5 0.286 0.288 0.286 0.291 50 20 0.103 0.110 0.110 0.183*













Table 3.4-continued.
Resource N=2000
Parameter Two-stage design One-stage design Improv.
ASP e m r n, F2 Indep Markov Comb. m n Fl Indep Markov F2-F1
10 0.25 50 5 5 0.028 0.027 0.028 0.029 50 10 0.028 0.026 0.025 0.000
10 0.50 50 5 5 0.036 0.033 0.036 0.039 50 10 0.037 0.037 0.040 -0.000
10 0.75 100 5 5 0.050 0.048 0.051 0.056 150 10 0.052 0.054 0.054 -0.002
10 1.00 225 5 5 0.082 0.084 0.081 0.090 200 10 0.090 0.095 0.087 -0.008
25 0.25 50 5 5 0.042 0.044 0.043 0.046 50 25 0.040 0.041 0.038 0.002
25 0.50 125 5 10 0.066 0.066 0.067 0.072 80 25 0.066 0.069 0.066 -0.000
25 0.75 175 5 10 0.136 0.130 0.133 0.143 80 25 0.117 0.122 0.111 0.019
25 1.00 175 5 10 0.246 0.241 0.245 0.259 80 25 0.189 0.198 0.192 0.057
50 0.25 50 5 15 0.053 0.048 0.045 0.050 50 40 0.048 0.046 0.048 0.005
50 0.50 100 14 10 0.113 0.115 0.111 0.117 50 40 0.079 0.075 0.079 0.034
50 0.75 150 5 10 0.226 0.225 0.226 0.232 60 33 0.122 0.121 0.126 0.104
50 1.00 150 5 10 0.385 0.392 0.372 0.381 80 25 0.189 0.187 0.191 0.197*
75 0.25 50 5 25 0.062 0.058 0.061 0.066 50 40 0.048 0.047 0.050 0.014
75 0.50 125 5 10 0.137 0.138 0.132 0.133 50 40 0.079 0.080 0.081 0.058
75 0.75 125 5 10 0.274 0.282 0.269 0.276 60 33 0.122 0.123 0.121 0.151*
75 1.00 125 5 10 0.438 0.435 0.432 0.437 80 25 0.189 0.185 0.186 0.249*
100 0.25 50 5 25 0.068 0.072 0.066 0.068 50 40 0.048 0.044 0.047 0.020
100 0.50 125 5 5 0.152 0.154 0.149 0.148 50 40 0.079 0.077 0.078 0.073
100 0.75 125 5 10 0.293 0.295 0.285 0.288 60 33 0.122 0.124 0.124 0.170*
100 1.00 125 5 10 0.456 0.445 0.445 0.449 80 25 0.189 0.192 0.186 0.267*












Table 3.4-continued.

Resource N=5000
Parameter Two-stage design One-stage design Improv.
ASP E m r n, F2 Indep Markov Comb. m n Fl Indep Markov F2-F1
10 0.25 50 5 5 0.028 0.029 0.028 0.031 50 10 0.028 0.029 0.029 0.000
10 0.50 50 5 5 0.036 0.036 0.035 0.039 50 10 0.037 0.040 0.037 -0.000
10 0.75 100 5 5 0.050 0.055 0.052 0.058 150 10 0.052 0.052 0.051 -0.002
10 1.00 225 5 5 0.082 0.084 0.080 0.089 240 10 0.089 0.085 0.088 -0.007
25 0.25 50 5 5 0.042 0.040 0.043 0.045 50 25 0.040 0.042 0.038 0.002
25 0.50 125 5 10 0.066 0.067 0.062 0.067 180 25 0.069 0.073 0.067 -0.004
25 0.75 225 27 5 0.141 0.141 0.132 0.144 200 25 0.159 0.160 0.145 -0.018
25 1.00 300 63 5 0.275 0.270 0.252 0.287 200 25 0.301 0.302 0.286 -0.026
50 0.25 50 5 15 0.053 0.043 0.049 0.056 50 50 0.052 0.052 0.056 0.000
50 0.50 200 43 5 0.136 0.141 0.132 0.138 100 50 0.129 0.135 0.130 0.007
50 0.75 250 29 10 0.325 0.333 0.306 0.321 100 50 0.269 0.275 0.260 0.055
50 1.00 250 29 10 0.572 0.570 0.532 0.560 100 50 0.457 0.464 0.446 0.115
75 0.25 75 40 5 0.062 0.050 0.052 0.065 60 75 0.063 0.064 0.058 -0.000
75 0.50 175 22 10 0.200 0.202 0.196 0.202 70 71 0.144 0.144 0.140 0.056
75 0.75 200 24 10 0.447 0.453 0.416 0.429 80 62 0.279 0.279 0.280 0.168*
75 1.00 200 24 15 0.701 0.698 0.659 0.687 100 50 0.457 0.459 0.445 0.245*
100 0.25 100 23 15 0.074 0.056 0.057 0.074 50 100 0.071 0.071 0.069 0.003
100 0.50 150 19 10 0.246 0.111 0.101 0.137 70 71 0.144 0.147 0.148 0.102
100 0.75 175 22 15 0.510 0.513 0.480 0.502 80 62 0.279 0.285 0.284 0.231*
100 1.00 125 17 20 0.757 0.745 0.744 0.764 100 50 0.457 0.465 0.459 0.300*













Table 3.4-continued.
Resource N=10,000
Parameter Two-stage design One-stage design Improv.
ASP m Ir n, F2 Indep Markov Comb. m n Fl Indep Markov F2-F1
10 0.25 50 5 5 0.028 0.031 0.027 0.028 50 10 0.028 0.030 0.028 0.000
10 0.50 50 5 5 0.036 0.038 0.040 0.042 50 10 0.037 0.038 0.036 -0.000
10 0.75 100 5 5 0.050 0.054 0.051 0.057 150 10 0.052 0.057 0.054 -0.002
10 1.00 225 5 5 0.082 0.081 0.086 0.097 240 10 0.089 0.090 0.086 -0.007
25 0.25 50 5 5 0.042 0.042 0.043 0.043 50 25 0.040 0.038 0.041 0.002
25 0.50 125 5 10 0.066 0.066 0.063 0.068 180 25 0.069 0.072 0.071 -0.004
25 0.75 225 27 5 0.141 0.139 0.134 0.144 300 25 0.158 0.162 0.146 -0.017
25 1.00 300 63 5 0.275 0.274 0.244 0.270 350 25 0.315 0.316 0.270 -0.040
50 0.25 50 5 15 0.053 0.047 0.044 0.058 50 50 0.052 0.050 0.050 0.000
50 0.50 225 93 5 0.140 0.145 0.134 0.144 200 50 0.153 0.161 0.144 -0.013
50 0.75 250 101 5 0.345 0.351 0.319 0.344 200 50 0.368 0.383 0.353 -0.023
50 1.00 250 125 5 0.607 0.605 0.550 0.591 200 50 0.630 0.639 0.591 -0.023
75 0.25 100 68 5 0.062 0.045 0.047 0.058 130 75 0.064 0.067 0.060 -0.002
75 0.50 225 71 10 0.224 0.217 0.205 0.228 130 75 0.218 0.227 0.216 0.006
75 0.75 275 59 15 0.522 0.528 0.470 0.514 130 75 0.477 0.490 0.464 0.045
75 1.00 275 59 15 0.802 0.793 0.731 0.775 140 71 0.736 0.747 0.724 0.067
100 0.25 125 65 5 0.079 0.056 0.048 0.070 100 100 0.081 0.085 0.080 -0.002
100 0.50 225 49 15 0.303 0.238 0.231 0.285 100 100 0.255 0.267 0.259 0.047
100 0.75 225 49 15 0.645 0.657 0.594 0.622 100 100 0.526 0.541 0.522 0.119
100 1.00 250 29 25 0.886 0.876 0.825 0.859 100 100 0.780 0.785 0.784 0.106









3.4 Type I Error and Power of Claiming Linkage


Tables 3.3 and 3.4 give the optimal designs and the probabilities of finding the
locus linked to the responsible gene when the gene exists, but they do not provide
information on the probability of causing a false conclusion when there is no gene
responsible for the disease. The usual requirement is that the LOD score in Stage II
should be greater than a certain threshold, t, in order to claim linkage. Given R = r,
we let the threshold be t, which is the 100(1-a) percentile of the unique maximum,
T, of r Binomial(n2, 0.25) random variables. The probability mass function of T is

P(T = s)

= P(2SIo = s\2SIS is the unique maximum)

P(2SIj = s,2SIo is the unique maximum)
P(2Si0is the unique maximum)

rf(s)F(s 1)r-1
=1 rf (s)F(s I)r-1

The probability mass function of the marker group is

P(T= s)

= P(2SIo = 2SI0+1 = s12SIo and 2SI0 are the unique maximum)

P(2SI = 2Sio+1 = s,2SI0is the unique maximum)
P(2SIo and 2SI,+, are the unique maximum)

(r 1)f(s)2F(s 1)r-2
.=j(r- 1)f(s)F(s 1)-2

Where f(s) and F(s) are probability mass function and cumulative distribution
function of Binomial(n2, 0.25). Table 3.5 and 3.6 gives the value of t of the unique
maximum and the marker group. To use Table 3.5 and 3.6, first find your n2 in the
n2 column, then find your R in R column, if your R is not in the table, then find the









largest value R that is smaller than your R with same n2. The value in the t column
is the 95% percentile. The range of n2 is from 5 to 95, and for R is from 5 to 300.
A linkage can be claimed if and only if 2SIo > t, where Io is one of the markers
that was chosen in Stage II. Clearly,

P(linkage claim is incorrect)

< P(2Sio > t I no gene is responsible)P(no gene is responsible)

+P(lo is wrong and 2SI0 > t I a gene is responsible)

P(a gene is responsible).

The (prior) P(no gene is responsible) is usually unknown, but we can conclude that
P(Io is wrong and 2S/0 > t I a gene is responsible) < P(2SI0 > t I no gene is respon-
sible). A proof is as follows. Let A denote the event { the gene is between marker io
and marker io + 1}.



P(Io is wrong and 2So > t a gene is responsible) (3.42)

= P(2SI, > 2Sj, Vj $ Io, 2SIo > t and Io $ io or io + 1 JA) (3.43)
n2
= E P(k = 2Sio > 2Sj, Vj 0 Io, and Io 7 io or io + 1 JA) (3.44)
k=t+l
n2
= E P(k> 2Sj, Vj 54 Io, io, or io + 1, k > 2SA k > 2Sio+i,
k=t+i

2Sio = k and Io 7 io or io + 1 IA) (3.45)
n2
= 5 {P(k > 2Sj, Vj o10, io, or io + 1, Io io or io + 1 IA) (3.46)
k=t+l

P(k > 2SA0, k > 2So+1I|A) (3.47)

P(2So = k, Io $ io or io + 1 IA)}, (3.48)






74



Table 3.5. The 95% percentile of the unique maximum of R Binomial(n2, 0.25).

11 n2 1 A I t 11 n2 |R t 11 n2 R t |n R t |n R t |n R |
















Table 3.6. The 95% percentile of the unique maximum marker group of R

Binomial(n2, 0.25).


n t n t R t nR R t R t t
s5 4 28 64 15 45 a 18 59 108 26 72 95
5 12 5 28 203 16 45 16 19 59 248 27 72 201 31 84 227 35
5 173 5* 29 5 12 45 33 20 60 5 21 73 5 25 85 5 28
6 5 4 29 8 13 45 76 21 60 7 22 73 8 26 85 6 29
6 6 5 29 17 14 45 193 22 60 11 23 73 12 27 85 10 30
6 38 6 29 44 15 46 5 17 60 20 24 73 21 28 85 16 31
7 5 5 29 131 16 46 7 18 60 40 25 73 39 29 85 27 32
7 16 6 30 5 12 46 13 19 60 83 26 73 76 30 85 49 33
7 132 7 30 6 13 46 26 20 60 187 27 73 158 31 85 92 34
8 5 5 30 13 14 46 57 21 61 5 21 74 5 25 85 181 35
8 8 6 30 31 15 46 140 22 61 6 22 74 7 26 86 5 28
8 43 7 30 87 16 47 5 17 61 10 23 74 11 27 86 6 29
9 5 6 30 278 17 47 6 18 61 17 24 74 18 28 86 9 30
9 21 7 31 5 13 47 10 19 61 32 25 74 32 29 86 14 31
9 130 8 31 10 14 47 21 20 61 65 26 74 62 30 86 23 32
10 5 6 31 23 15 47 44 21 61 143 27 74 125 31 86 41 33
10 12 7 31 60 16 47 103 22 62 5 22 74 265 32 86 75 34
10 54 8 31 179 17 47 262 23 62 8 23 75 5 25 86 146 35
11 5 6 32 5 13 48 5 18 62 14 24 75 6 26 86 296 36
11 7 7 32 8 14 48 9 19 62 26 25 75 9 27 87 5 29
11 28 8 32 17 15 48 17 20 62 52 26 75 15 28 87 8 30
11 151 9 32 43 16 48 34 21 62 110 27 75 27 29 87 12 31
12 5 7 32 120 17 48 77 22 62 249 28 75 50 30 87 20 32
12 16 8 33 5 13 48 189 23 63 5 22 75 99 31 87 34 33
12 70 9 33 7 14 49 5 18 63 7 23 75 207 32 87 62 34
13 5 7 33 13 15 49 7 19 63 12 24 76 5 26 87 118 35
13 10 8 33 31 16 49 13 20 63 22 25 76 8 27 87 235 36
13 37 9 33 82 17 49 27 21 63 42 26 76 13 28 88 5 29
13 187 10 33 246 18 49 58 22 63 86 27 76 23 29 88 7 30
14 5 7 34 5 14 49 138 23 63 189 28 76 41 30 88 10 31
14 7 8 34 10 15 50 5 18 64 5 22 76 80 31 88 17 32
14 22 9 34 23 16 50 6 19 64 6 23 76 163 32 88 29 33
14 93 10 34 58 17 50 11 20 64 10 24 77 5 26 88 51 34
15 5 8 34 164 18 50 21 21 64 18 25 77 7 27 88 96 35
15 14 9 35 5 14 50 45 22 64 34 26 77 11 28 88 188 36
15 51 10 35 8 15 50 103 23 64 68 27 77 19 29 89 5 29
15 240 11 35 18 16 50 255 24 64 146 28 77 34 30 89 6 30
16 5 8 35 42 17 51 5 19 65 5 23 77 65 31 89 9 31
16 9 9 35 113 18 51 9 20 65 9 24 77 129 32 89 15 32
16 30 10 36 5 14 51 17 21 65 15 25 77 272 33 89 24 33
16 124 11 36 7 15 51 35 22 65 28 26 78 5 26 89 43 34
17 5 8 36 14 16 51 78 23 65 54 27 78 6 27 89 79 35
17 7 9 36 31 17 51 186 24 65 113 28 78 10 28 89 152 36
17 19 10 36 79 18 52 5 19 65 252 29 78 16 29 90 5 29
17 69 11 36 225 19 52 8 20 66 5 23 78 29 30 90 6 30
18 5 9 37 5 14 52 14 21 66 8 24 78 53 31 90 8 31
18 13 10 37 6 15 52 28 22 66 13 25 78 103 32 90 13 32
18 41 11 37 11 16 52 60 23 66 23 26 78 213 33 90 21 33
18 166 12 37 24 17 52 138 24 66 44 27 79 5 26 90 36 34
19 5 9 37 57 18 53 5 19 66 89 28 79 6 27 90 65 35
19 9 10 37 154 19 53 7 20 66 193 29 79 9 28 90 123 36
19 26 11 38 5 15 53 12 21 67 5 23 79 14 29 90 244 37
19 94 12 38 9 16 53 22 22 67 7 24 79 24 30 91 5 30
20 5 9 38 18 17 53 47 23 67 11 25 79 44 31 91 7 31
20 7 10 38 42 18 53 104 24 67 19 26 79 83 32 91 11 32
20 18 11 38 108 19 53 251 25 67 35 27 79 168 33 91 18 33
20 57 12 39 5 15 54 5 19 67 70 28 80 5 27 91 31 34
20 223 13 39 7 16 54 6 20 67 149 29 80 8 28 91 54 35
21 5 10 39 14 17 54 10 21 68 5 23 80 12 29 91 101 36
21 13 11 39 32 18 54 18 22 68 6 24 80 20 30 91 196 37
21 36 12 39 77 19 54 37 23 68 9 25 80 36 31 92 5 30
21 129 13 39 210 20 54 79 24 68 16 26 80 68 32 92 7 31
22 5 10 40 5 15 54 185 25 68 29 27 80 134 33 92 10 32
22 9 11 40 6 16 55 5 20 68 56 28 80 279 34 92 16 33
22 24 12 40 12 17 55 8 21 68 116 29 81 5 27 92 26 34
22 78 13 40 24 18 55 15 22 68 255 30 81 7 28 92 45 35
23 5 10 40 57 19 55 29 23 69 5 24 81 11 29 92 83 36
23 7 11 40 147 20 55 61 24 69 8 25 81 17 30 92 159 37
23 17 12 41 5 16 55 139 25 69 14 26 81 30 31 93 5 30
23 50 13 41 9 17 56 5 20 69 24 27 81 56 32 93 6 31
23 176 14 41 19 18 56 7 21 69 46 28 81 108 33 93 9 32
24 5 10 41 42 19 56 13 22 69 92 29 81 220 34 93 14 33
24 6 11 41 105 20 56 24 23 69 197 30 82 5 27 93 22 34
24 13 12 41 286 21 56 48 24 70 5 24 82 6 28 93 38 35
24 34 13 42 5 16 56 105 25 70 7 25 82 9 29 93 69 36
24 108 14 42 8 17 56 248 26 70 12 26 82 15 30 93 129 37
25 5 11 42 15 18 57 5 20 70 20 27 82 26 31 93 254 38
25 9 12 42 32 19 57 6 21 70 37 28 82 46 32 94 5 31
25 23 13 42 77 20 57 11 22 70 73 29 82 87 33 94 8 32
25 69 14 42 200 21 57 19 23 70 153 30 82 174 34 94 12 33
25 240 15 43 5 16 57 38 24 71 5 24 83 5 28 94 19 34
26 5 11 43 7 17 57 81 25 71 6 25 83 8 29 94 32 35
26 7 12 43 12 18 57 186 26 71 10 26 83 13 30 94 57 36
26 17 13 43 25 19 58 5 21 71 17 27 83 22 31 94 106 37
26 46 14 43 57 20 58 9 22 71 31 28 83 38 32 94 204 38
26 148 15 43 142 21 58 16 23 71 59 29 83 71 33 95 5 31
27 5 11 44 5 16 58 31 24 71 120 30 83 140 34 95 7 32
27 6 12 44 6 17 58 63 25 71 260 31 83 287 35 95 11 33
27 13 13 44 10 18 58 140 26 72 5 24 84 5 28 95 17 34
27 32 14 44 20 19 59 5 21 72 6 25 84 7 29 95 28 35
27 95 15 44 43 20 59 8 22 72 9 26 84 11 30 95 48 36
28 5 12 44 103 21 59 13 23 72 14 27 84 18 31 95 87 37
28 10 13 44 272 22 59 25 24 72 25 28 84 32 32 95 166 38
28 23 14 45 5 17 59 50 25 72 48 29 84 59 33






76


and

P(2SI > t I no gene is responsible) (3.49)
n2
= S {P(k > 2Sj, J 5 Io, io or io + 1 and lo io or io+ 1 I no gene(3.50)
k=t+l

P(k > 2Si,, k > 2Sio+il no gene) (3.51)

P(2SIo = k and Io $ io or io + 1 I no gene)}. (3.52)

Eq. (3.47) is less than Eq. 3.51 and the other corresponding terms are equal, therefore

P(Io is wrong and 2SIo > t I a gene is responsible)

< P(2SI > t I no gene is responsible)

and thus,

P(linkage claim is incorrect)

< P(2S o > t I no gene is responsible)


{ ^- FP(2Si > tn o = i)P(Io = i) P(the result of stage I)1
all possible results I =O
of the stage I
m
a P(Io=i) i=0



3.5 Discussion


The Monte Carlo results indicated that the relative errors between the probabil-
ities calculated from formulas and Markov chain simulations were under 7% in the
dominant cases and under 15% in the recessive cases. Consequently, the approxima-
tion using the independence assumption for dependent marker loci was acceptable.
The simulation studies also showed that, in the dominant cases, combining Stage
I data with Stage II data did not have any significant advantage. The probability









of allocating the correct marker increased less than by 3%. For the recessive cases,
there was some advantage when the ASPs were few. The probability of allocating

the correct marker could increase by as much as 10%. However, it is very difficult to
combine data in the theoretical derivation.
The powers to find the gene depends on the exact gene location between markers.
Since Table 3.3 and 3.4 were constructed under the least favorable configuration, the
actual power should be higher.
The two-stage approach indeed boosted the probability of finding the correct gene
location under resource constraints. In many cases this probability can increase up to
20% to 30%. In searching for recessive disease genes, there were several instances when
the improvement exceeded 35% (Table 3.3). However, if there are enough resources,

such that almost all available markers can be typed on all ASPs, then the one-stage
approach may have higher power. (see e.g., N=5000 and APS=10 or N =10,000 and

ASP=10, 25 in Table 3.3.) However, since the power loss is so small, we may always

choose the optimal two-stage design.
As shown in Tables 3.3 and 3.4, it requires much more resources to locate a
dominant than a recessive disease gene. For example, in a two-stage design with
25 ASPs and N=1000 we can locate a recessive disease gene with a 0.88 probability
when e = 1. But for a dominant disease gene, more than 100 ASPs and/or more than
N=10,000 are needed to achieve the same probability.
Another point worth noting is that phenocopy can severely reduce the probability
of finding correct gene location. For example, to locate a recessive gene, with resource
N=5000, 25 ASPs, and E = 1, the chance of finding it is 99.9%. However, when
= 0.5, even if we double the resource to N=10,000 and 50 ASPs, the chance of
finding the correct locus is only 90%. Although both of them have about 25 ASPs
whose disease is caused by gene, those extra phenocopy ASPs reduce the probability
considerably.














CHAPTER 4
TWO-STAGE GENOME SEARCH FOR COMPLEX DISEASE



In the previous chapter, we have focused on finding a single disease gene. However,
genetic diseases are not always single-gene diseases. Many of them are complex dis-

eases, i.e., diseases caused by several genes. For example, insulin-dependent diabetes

mellitus (IDDM) is influenced by a number of susceptibility genes and environmental

factors (Luo et al., 1995). A disease phenotype controlled by genes at several dif-

ferent loci is considered to have "nonallelic heterogeneity" and "when this disease is
relatively rare and mutation rates are low, individuals within a family are generally
homogeneous. Locus heterogeneity then leads to the situation that the recombination
fraction between disease phenotype and marker will be different in different families"

(Ott, 1991, p. 199). This chapter discusses the probability to find two unlinked reces-

sive genes that may cause the same disease in a two-stage search under the following
genetic model.


4.1 Genetic Model and Assumptions


Throughout this chapter, we assumed the following:

No epistasis, i.e. no interaction between genes.

When an individual carries the two disease genes, the penetrance is additive,
i.e. the probability of being affected for an individual who has two recessive
genes at two loci is twice as high as for an individual who has two recessive
genes at only one locus.









For an individual who has no disease gene, the probability of being affected is ko
times the probability of being affected for an individual who has two recessive

genes at one locus.

Receiving one gene has no effect on receiving other genes.

Other assumptions, necessary for simplifying analytic study are similar to those in
the previous chapter, are:

There are two alleles at each gene locus, denote disease gene d and normal gene
D. The population frequency of disease gene is p for both loci.

Genes are in the middle of two adjacent markers.

In this study, only three maps, 5 cM (centimorgan), 10 cM, and 20 cM, are

available, and every marker are highly polymorphic. Marker's positions on the

each chromosome are at 0 cM, 5 cM, 10 cM,..., for 5 cM map, 0 cM, 10 cM,...,
for 10cM map, and 0 cM, 20 cM, 40 cM,..., for 20 cM map.

The cost of typing alleles is a constant.


4.2 Two-stage Genome Search


For simplicity, we use a slightly different two-stage approach for searching complex
disease genes. In the first stage, we choose a threshold for statistics instead of choosing
a number of loci. In the second stage, we choose the loci on different chromosomes
where the statistics have the highest overall value. Details are as follows:

Following chapter 3, suppose there are n ASPs and there are enough resources to
type N markers for ASPs. Again, there are three numbers that must be determined
in a two-stage design: n, and m, the number of ASPs and the number of markers to
be used in Stage I, and k, the threshold of markers to be studied in Stage II. Define

Si = E X,. (4.1)
j=1









in the same way as in chapter 3 where i is the index of loci, j is the index of ASP,
Xij is defined by the assumption 6 in 3.1. If in Stage I, marker i has a 1Si value

higher than the threshold, k, we will study this marker again in the stage II. Let the

number of markers that passed Stage I be R.
In Stage II, R markers on AN2 ASPs are to be typed, where N2 is the largest number

subject to the resource constraint and R. Since R is a random variable, N2 is also a
random variable. Thus, N2 is the largest x, such that mn1 + Rx < N. We define

nl+N2
2S,= E XiJ, (4.2)
j=nl +1

for stage II. Then, markers that meet the following criteria will be declared having a

gene nearby:

A marker with the uniquely highest 25' is claimed to be the marker nearest to
the disease gene.

A marker group (see page 42), has the uniquely highest score, is claimed to have

the gene lie between them,

If two markers or marker groups have the same highest score but on different
chromosomes, then declare each having a gene nearby in the corresponding way

in the above.

If none of the above applies, then gene location is considered undetermined.

Once the locations) has(have) been chosen, same as in the chapter 3, the next

step is to check whether we can claim linkage. Let t be the 100(1 a) percentile of the

maximum of r binomial (n2, 0.25) random variables. If 2SA, of the chosen locations)

is(are) greater than t, then we claim there is linkage at that locationss. Since we
imposed some restriction on declaring a maker having a gene nearby, the actual type

I error will be less than a. The 95% percentile for the unique maximum and marker
group were given in Table 3.5 and 3.6, respectively.









4.3 Probability of Allocating the Correct Marker for a Complex Disease


In this section we discuss an analytical approach and some simulation results.
4.3.1 is a preparation for the discussion.


4.3.1 Possible Parental Genotypes and Trait IBD distribution


For a disease gene locus, say locus i, let d and D denote the disease and the normal
gene allele respectively. Let also the population frequency of d be p, and PG denote
the parental genotype. Let

Ei = (eil, ei2) =

(1, 1), if both sibs received two recessive genes at locus i
(1,0), if sib 1 received two recessive genes and sib 2 did not at locus i
(0, 1), if sib 2 received two recessive genes and sib 1 did not at locus i
(0, 0), if none of sibs received two recessive genes at locus i.

Then the distribution of trait IBD, It, conditional on parental genotype and E is
given in Table 4.1. The numbers in the table is very easy to verify, for example,
for parental genotype 2, if the order of the chromosomes are specified, say 1, 2, 3,
and 4 where 1 and 2 belong to father, and 3 and 4 belong to mother, then the

probability of receiving Dd dd, where D is on the chromosome 4, is p3(1 p). If we
permute the order of chromosomes we will get 4 different permutation, therefore, the

probability of parents having genotype Dd and dd is 4p3(1 p) in a random mating
population. When parental genotype is Dd and dd, possible sib genotype are Dd
and dd with equal frequency, therefore, P(E.=(1,1)IPG=2)= P(E1=(1,0)IPG=2)=

P(Ei=(0,1)IPG=2)=P(E=(o,O)IPG=2)= 1. In order to illustrate the conditional
4.
distribution of trait IBD score, subscripts for parental genotype are denoted as djd2
and Dd3. When parental genotype is dd and Dd, and E, = (0,0), sib's genotype








can only be Dd, or Dd2 with equal chance, therefore, P(Ih=2IPG=2,E,=(0,0)) =

P(It=1 IPG=2,Ei=(0,0))= 1.


4.3.2 Analytic Approach


Assume there are 2 unlinked genes, namely G1 and G2. Without loss generality,
let X1j and X2j be the statistics of the markers next to G1, and X3j and X4j be the
statistics of the markers next to G2 for the ASP j; IS1, \S2, 1S3, and 1S4 be the sum
of Xs respectively in Stage 1, as defined at the beginning of this section. Also, let I1
and 12 be the IBD score of G1 and G2 for the affected sibs pair. Let the penetrance of
carrying two recessive gene at one locus is A, two loci is 2A, and none is k0A, where k0
is the relative risk of being affected for an individual carrying no gene to an individual
carrying two recessive genes at one locus. The A is unknown but is assumed to be
small. It will be cancel out in the formula. Then

P(a gene or both genes are found)

= P(G1 is found and G2 is not)

+P(G2 is found and G1 is not)

+P(G1 and G2 are found)

= {P(G1 passes Stage I, G2 does not pass Stage I, G1 is found) (4.3)

+P(Gi and G2 pass Stage I, G1 is found)} (4.4)

+{P(G2 passes Stage I, G, does not pass I,G2 is found) (4.5)

+P(G1 and G2 pass Stage I, G2 is found)} (4.6)

+P(Gi and G2 pass Stage I, Gi and G2 are found) (4.7)

For a given threshold k, a given event {G, is found but G2 is not}, the possible
relationship between k and 1S1, 1S2, 1S3, and 1S4 in Stage I, and corresponding
relationship of 2S1, 2S2, 2S3, and 2S4 in Stage II are given in Table 4.2. For example
for event 6, given G1 is found but G2 is not, in Stage I 1S1 > k, 1S2 > k, jS3 < k









Table 4.1. Trait IBD distribution conditional on parental genotype and E vector.


PG P(PG) P(E=(1,1) I PG) P(E=(1,0) I PG) P(E=(0,1) I PG) P(E=(0,0) I PG)
1. dd dd p4 1 0 0 0
P(It=2 PG, E)=1
1 1 1 1
2. Dddd 4p3(1 -p) 1 4 4
P(I=1| PG, E)= P(ht=l I PG, E)=! P(I,=2 PG, E)=!
P(It=2 PG, E)=I
P(t___PG_ E1 P(It=0 I PG, E)=! P(It=0 I PG, E)=| P(It=l I PG, E)=!
1 3 3 9_
3.DdDd 4p2(1- p)2 16
P(It=2 | PG, E)=|
P(It=l | P E 1I PG, E)= P(It=l 1G, E)= 9
P(It=2 | PG, E)=l 3 3 (It=l PG, E)=49
P(It=O PG, E)=1 P(It=0 PG, E)=-
3 P(It=0 | PG, E)=2

4.DD dd 2p2(1 -p)2 0 0 0 1
P(It=2 I PG, E)=!
P(It=l I PG, E)=!
P(It=0 I PG, E)=
5. DD Dd 4p(1 p)3 0 0 0 1
P(It=2 I PG, E)=4
P(It=1 | PG, E)=!
_____~__________ __________________ P(It=0 PG, E)=--
6. DD DD (1 p)4 0 0 0 1
P(It=2 I PG, E)=!
P(It=1 I PG, E)=2
__________________________________________P(It=0 | PG, E)=









Table 4.2. Exclusive events for the case "G1 is found but G2 is not."

Stage I Stage II
event 1S1 1S2 1S3 1S4 the largest statistics and relationship
1 > > < < max(2S1, 2S2)
2 > < < < 2S1
3 < > < < 2S2
4 > > > > max(2S1, 2S2), and > max(2S3, 2S4)
5 > > > < max(2S1, 2S2), and > 2S3
6 > > < > max(2S1, 2S52), and > 2S4
7 > < > > 2S1, and > max(2S3, 2S4)
8 < > > > 2S2, and > max(2S3, 2S4)
9 > < > < 2S1 and > 2S3
10 > < < > 2S1, and > 2S4
11 < > > < 2S2, and > 2S3
12 < > < > 2S2, and > 2S4



and 1S4 > k, then in the Stage II, the maximum of 2S1 and 2S2 must have the largest
value and also larger than 2S4.
For a given threshold k, and event {G2 is found but G1 is not) the possible
relationship between k and 1S1, 1S2, 1S3, and 1S4 in Stage I, and corresponding
relationship in Stage II are given in Table 4.3.
For a given threshold k, and event {G1 and G2 are found} the possible relationship
between k and 1Si, 15S2, 1S3, and 15S4 in Stage I, and corresponding relationship in
Stage II are given in Table 4.4.
For a given threshold k, the sum of probabilities (4.3) through (4.7) are

m-4 33 N2
Z E P(event i, and I lSj pass Stage I, the largest statistics =h in Stage II ),
1=0 i=1 h=l









Table 4.3. Exclusive events for the case "G2 is found but G, is no."

Stage I Stage II
event 1S1 1S2 1S3 1S4 the largest statistics and relationship
13 < < > > max(2S3, 284)
14 < < > < 2S3
15 < < < > 2S4
16 > > > > max(2S3, 284), and > max(2S1, 2S2)
17 > < > > max(2S3, 2S4), and > 2S1
18 < > > > max(2S3, 2S4), and > 2S2
19 > > > < 2S3, and > max(2S1, 282)
20 > > < > 2S4, and > max(2S1, 282)
21 > < > < 283, and > 2S1
22 > < < > 284, and > 2S1
23 < > > < 2S3, and > 2S2
24 < > < > 2S4, and > 2S2


Table 4.4. Exclusive events for the case "G1 and G2 are found."

Stage I Stage II
event 1S1 182 183 184 the largest statistics and relationship
25 > > > > max(2S3, 2S4)= max(2SI, 2S2)
26 > < > > max(2S3, 2S'4)= 281
27 < > > > max(2S3, 2S4)= 2S2
28 > > > K max(2S1, 2S2)= 283
29 > > < > max(2SI, 2S2)= 2S4
30 > < > < 283=281
31 > < < > 2S4=281
32 < > > < 2S3= 2S2
33 < > < > 2S4=2S2








where N2 = [R ]Int

For example for event 25;

P(event 25, and I 1Sis pass Stage I)

= P(1S > k, 1S2 > k, 1S3>k, 1S4 > k, I Sj >k

max(2S3, 2S4)= max(2S1, 2S2)=h, and all other I 2Sjs < h)

(with assumption of independence)

= P(1S1 > k, 152 > k, 153 > k, IS4 > k)P(l 1Sj pass Stage I)

P(max(2S3, 2S4)= max(2SI, 2S2)=h)P(all other I 2Sjs < h),



where j 5 1,2,3,4. The distribution of 2Sj is Binomial(N2, 0.25). Therefore, if we
want to know probabilities (4.3) through (4.7), we need to know the joint distribution
of 1S1, IS2, 1S3 and 1S4 conditional on both sibs are affected. In order to calculate
this joint distribution, we need to know the joint distribution of Xij and X2j and X3j
and X4j conditional on both sibs are affected, which is,

Pxyuv (4.8)
= P(Xlj = x, X2j = y, X3j = u, X4j = v I both affected) (4.9)

2 2
= ) T P(X1j = x, X2j = y, X3j = u, X4j = v I = i1, /2 = i2 1 both affected)
il =0 i2=0
2 2
= E f{P(Xlj =x, X2j = y, X3j =u, X4j =v 1i = i 1,2 = i2, both affected)
i1 =0 i2=0

P(I1 = ii, 12 = i I both affected)} (4.10)
2 2
= > P(Xl, = X, X2j = Y I = ii)P(X3j = u,X4j = v 2 = i2)
i1=0 12=0

P(1 = i1, 12 = i2 both affected) (4.11)








Table 4.5. Conditional distribution of Xl, and X2j given I1i.


The probability P(Xij = x, X2j = y I I, = i) in general case, i.e. gene is
anywhere in between 2 markers, can be derived from Table 2.3, and is given in Table
4.5. Where Tj = 09+(1-oj)2, j=1,2, 01 and 02 are the recombination fraction between
gene and marker 1, and between gene and marker 2, respectively. The probability
P(Xaj = x, X4j = y 1 12 = i) is same except using different 0s.
The probability P(I = i', /12 = I both affected) is equal to,

P(I = il, 12 = i2, both affected)
P(both affected)

SZ E P(1I = i1, 12 = i2, El, E2, PG1, PG2, both affected)
E1 E2 PG1 PG2
P(both affected)

= {Z Z P(both affectedlli = i, 12 = i2 El, E2, PG1, PG2)
E1 E2 PG1 PG2


i x y P(Xlj =X, X2j =yII=-i)
2 0 0 (1- )(1- I)
1 0 (1- )
0 1 2
1 1 1 2
1 0 0 (1-1+')(1- '2+')
1 0 T,(1 )(-I + T'2)
o 1 (1 1+1 )'2(1- 2)
1 1 1(1- T1)'2(1 2)
0 0 0 (211 +jp)(2I2- T2)
1 0 (1 TI1)(2 1)
0 1 (2i- T)(1- T)2

1 1 (I -_ 1)2(1- T2)2








P(I = il, h2 = i2 1El, E2, PGi, PG2)}/P(both affected)
EE E E P(both affected|E1, E2)P(I = i1, 12 = i2 Ei, E2, PG1, PG2)
E1 E2 PG1 PG2
P(both affected)

{ ZE E Z P(both affectedlEi, E2)P(Il = ii, El, PG)P(I2 = i2, E2, PG2)}
Ei E2 PG1 PG2
P(both affected)

1 Poth effect E E P(both affectedlEl, E2)P(I = il El, PG1)
P(both affected) El E2 PGI PG2

P(E1 I PG1)P(PG1)P(12 = i2 I 2, PG2)P(E2 I PG2)P(PG2)}

Because both sibs are affected, hence when ko = 0 the following restriction on sum-
ming over E and E2 are applied; if E1 = (1,1) and E2 # (1,1) then (E1, E2)
must be ((1,0), (0,1)) or ((0,1),(1,0)). P(both affectedlEi, E2) is given in Table
4.6. The probabilities P(both affected) = E2 E2=0 p(Il = 1 = Jboth affected).
P(I = i, = j, both affected) are given as follows. The rest probabilities were given
in Table 4.1. Therefore, P(i = il, I2 = i2 both affected) can be found.
Let p, = P(PG = 1), P2 = P(PG = 2), p3 = P(PG = 3), and p46 = P(PG =
4, 5, or 6), Then,

P(I1 = 0, 12 = 0,BA)

S4kA21 P21 P3 1 9 P32 1 P4s)
G4"2 16 3 16 3 9 14

+2A2 1 3 1) (1 23 1

+4koA2 121+ 3-p3-)



+_A1 A2(2p2 + P3)2

128
















Table 4.6. P(both affected|E1, E2) and possible trait IBD given E1 and E2.

E1 E2 P(BA|E1, E2) possible trait IBD score, (I1, 12), given E1 and E2
(1,1) (0,0) A2 (2,0), (2,1), (2,2)
(1,1) (0,1) 2A2 (2,0), (2,1)
(1,1) (1,0) 2A2 (2,0), (2,1)
(1,1) (1,1) 4A2 (2,2)
(1,0) (0,0) koA2 (1,0),(1,1),(1,2),(0,0),(0,1),(0,2)
(1,0) (0,1) A2 (1,0),(1,1),(0,0),(0,1)
(1,0) (1,0) 2koA2 (1,0),(1,1),(0,0),(0,1)
(1,0) (1,1) 2A2 (1,2),(0,2)
(0,1) (0,0) koA2 (1,0),(1,1),(1,2),(0,0),(0,1),(0,2)
(0,1) (0,1) 2koA2 (1,0),(1,1),(0,0),(0,1)
(0,1) (1,0) A2 (1,0),(1,1),(0,0),(0,1)
(0,1) (1,1) 2A2 (1,2),(0,2)
(0,0) (0,0) k2A2 (2,0),(2,1),(2,2),(1,0),(1,1),(1,2),(0,0),(0,1),(0,2)
(0,0) (0,1) koA2 (2,0),(2,1),(1,0),(1,1),(0,0),(0,1)
(0,0) (1,0) koA2 (2,0),(2,1),(1,0),(1,1),(0,0),(0,1)
(0,0) (1,1) A2 (2,2),(1,2),(0,2)








+6IkoA2(2p2 + P3)(2p2 + 3P3 + 4P456)
64

+ 1QoA2(p3 + 2p456)2
64



P(II = 1,12 = 0, BA) = P(I =2, 12 = 1,BA)

S2koA P2+ p32) (9P3 + I P456


G421 3 2 I 1 3 1

G/ 2 -3 2 1 3 1


2kA2 1P2 + 9 4
+2oA p2 16 9p3+

+k,2,A2 (11 9 4 1

= A2(p2 + P3)(2P2 + p3)
1


SP456P456)


( 1
[p2-


-P456P456) (9P32


+-koA2(p2 + P3)(P2 + P3 + P456)
16

+-4koA2(p2 + 2p3 + 4p456)(2P2 + P3)
64

+61koA2(p2 + 2p3 + 4p456)(p3 + 2p456)
64=


P(1I = 2, 12 = 0, BA) = P(7I = 0, I2 = 2, BA)


= A2 (P+ P2+- P


(921 \
16 P39 + 4P456)


S 1 1 (1 1
+ ~-P2 + -P3 I p2
4' l16 ) 4- 2


P3 1

1 2
4 456 )


+4A2 (i


+3 1
i--P3 5)








+2 P2 + P3 + P456 3 + -P456
+kOA P2 216 9+ P45(6 9
,A /1 1 9 3 1 1 I 1 3 1)
+2k0A2 ( + -P3 P ( 2p 2 + -FF'3

L A A2(16pi + 4P2 + p3)(4p2 + 3p3 + 2p456)
128

+- 1 koA2(4p2 + 3p3 + 4p456)(2p2 + P3)
128
1 0
+-Ik2A2(2p2 + 3P3 + 4p456)(p3 + 2P456)
128 0


P(IJ = 1,12 = 1,BA)
S12 1 31 9 4 1
= 4koA P2 + -p3- (Ip2- + -P3- + -P456
2 16 3 2 42 169 2
+2A (1P2 + P3 2 )


+4koA 21P2 I+ 3 P3 )
2 1 3 2 2




32
/1 1 9 4 \
+2 (P + ^ +)-^



+ -koA2(p2 + p3)(2p2 + 3p3 + 4p456)

+ ko2A2(p2 + 2p3 + 4P456)2
64


P(I = 2, 12 = 1,BA) = P(IA = 1, 12 = 2, BA)

A + 1-2 j P I 1 9 4 1 )
A Pl "4+ 4P2 + "-6P3 IP2I + "i9 P34+ P456