Statistical Designs and Algorithms for Mapping Cancer Genes

Permanent Link: http://ufdc.ufl.edu/UFE0024903/00001

Material Information

Title: Statistical Designs and Algorithms for Mapping Cancer Genes
Physical Description: 1 online resource (130 p.)
Language: english
Creator: Li, Yao
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2009


Subjects / Keywords: aneupoidy, cancer, em, haplotyping, imprinting, quantitative, transgeneration
Statistics -- Dissertations, Academic -- UF
Genre: Statistics thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation


Abstract: The identification of genes that are directly involved in tumor initiation and maintenance is instrumental for understanding the phenotypic variation of cancer and ultimately designing crucial therapeutic drugs to treat this disease. In recent years, the completed genome sequence of humans and cancers has markedly enhanced cancer gene identification. The overall goal of this dissertation is to develop a warehouse of statistical tools for identifying cancer genes with growingly increasing sequence data. These tools are founded on the latest discoveries for the genetic and developmental roots of cancer formation, including somatic mutations, aneuploid induction, epigenetic modifications, transgenerational imprinting, copy number variants, and host-tumor genetic interactions. New statistical methods and algorithms will be developed to integrate each of these discoveries. By comparing the difference in the DNA structure and sequence between the human and cancer genomes, a disequilibrium model has been formulated to identify and test the genetic mutations or 'drivers' that cause cancer. A quantitative model is derived to unravel the aneuploidy control of cancer and estimate the genetic effects of aneuploid loci on cancer risk. Using a commonly used three-generation design, a two-stage hierarchical model is developed to estimate and test the transgenerational alteration of genetic effects and identify genetic imprinting effects due to different parental origins of the same allele. This hierarchical model allows the characterization of genetic interactions between additive and dominant effects and imprinting effects over generations. Cancer susceptibility may be controlled not only by host genes and mutated genes in cancer cells, but also by the epistatic interactions between genes from the host and cancer genomes. A model was derived to estimate genome-genome interactions of host DNA and cancer DNA. Models for cancer gene identifications require the solution of missing data problems given the fact that cancer genes and their incidence in a natural population cannot be observed directly. For this reason, I have built up the models within the mixture model framework. The maximum likelihood approaches, implemented with the EM algorithm, have been derived to provide the estimates of genetic parameters related to mutation rates, chromosome duplication rates, genetic imprinting, genetic interactions, and haplotype frequencies. I have performed various sets of computer simulation to investigate the statistical properties of the new models in terms of power, estimation precision, and false positive rates. A series of practical computational issues, including convergence rates and choices of initial values, are discussed. I have also formulated various testable hypotheses about the frequencies of genetic mutations and the effects of host genes, cancer genes, and their interactions on cancer susceptibility. This dissertation provides a most complete set of statistical models for cancer gene identification thus far in the literature. The biological relevance and statistical sophistication of these models will make them practically useful to unlock the genetic secrets of cancer.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Yao Li.
Thesis: Thesis (Ph.D.)--University of Florida, 2009.
Local: Adviser: Wu, Rongling.

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2009
System ID: UFE0024903:00001

Permanent Link: http://ufdc.ufl.edu/UFE0024903/00001

Material Information

Title: Statistical Designs and Algorithms for Mapping Cancer Genes
Physical Description: 1 online resource (130 p.)
Language: english
Creator: Li, Yao
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2009


Subjects / Keywords: aneupoidy, cancer, em, haplotyping, imprinting, quantitative, transgeneration
Statistics -- Dissertations, Academic -- UF
Genre: Statistics thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation


Abstract: The identification of genes that are directly involved in tumor initiation and maintenance is instrumental for understanding the phenotypic variation of cancer and ultimately designing crucial therapeutic drugs to treat this disease. In recent years, the completed genome sequence of humans and cancers has markedly enhanced cancer gene identification. The overall goal of this dissertation is to develop a warehouse of statistical tools for identifying cancer genes with growingly increasing sequence data. These tools are founded on the latest discoveries for the genetic and developmental roots of cancer formation, including somatic mutations, aneuploid induction, epigenetic modifications, transgenerational imprinting, copy number variants, and host-tumor genetic interactions. New statistical methods and algorithms will be developed to integrate each of these discoveries. By comparing the difference in the DNA structure and sequence between the human and cancer genomes, a disequilibrium model has been formulated to identify and test the genetic mutations or 'drivers' that cause cancer. A quantitative model is derived to unravel the aneuploidy control of cancer and estimate the genetic effects of aneuploid loci on cancer risk. Using a commonly used three-generation design, a two-stage hierarchical model is developed to estimate and test the transgenerational alteration of genetic effects and identify genetic imprinting effects due to different parental origins of the same allele. This hierarchical model allows the characterization of genetic interactions between additive and dominant effects and imprinting effects over generations. Cancer susceptibility may be controlled not only by host genes and mutated genes in cancer cells, but also by the epistatic interactions between genes from the host and cancer genomes. A model was derived to estimate genome-genome interactions of host DNA and cancer DNA. Models for cancer gene identifications require the solution of missing data problems given the fact that cancer genes and their incidence in a natural population cannot be observed directly. For this reason, I have built up the models within the mixture model framework. The maximum likelihood approaches, implemented with the EM algorithm, have been derived to provide the estimates of genetic parameters related to mutation rates, chromosome duplication rates, genetic imprinting, genetic interactions, and haplotype frequencies. I have performed various sets of computer simulation to investigate the statistical properties of the new models in terms of power, estimation precision, and false positive rates. A series of practical computational issues, including convergence rates and choices of initial values, are discussed. I have also formulated various testable hypotheses about the frequencies of genetic mutations and the effects of host genes, cancer genes, and their interactions on cancer susceptibility. This dissertation provides a most complete set of statistical models for cancer gene identification thus far in the literature. The biological relevance and statistical sophistication of these models will make them practically useful to unlock the genetic secrets of cancer.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Yao Li.
Thesis: Thesis (Ph.D.)--University of Florida, 2009.
Local: Adviser: Wu, Rongling.

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2009
System ID: UFE0024903:00001

This item has the following downloads:

Full Text







2009 Yao Li

To my dearest mother, Shuqin Ma


Today, I can no longer touch the smile on my mother's face if she could read this

dissertation. Nonetheless, I know she would be proud of me, as she had aliv- been,

ever since I was born. She was my first teacher, tireless mentor, and dearest friend. She

endowed me with the strength to endure and to fight for my ideal. Every time I looked

back, I felt so secure since I knew she's ah--bi-i there for me, until the last minute of her

life. Even now, four years after I lost her to cancer, her endless love is still supporting

me through my pursuit of the Ph.D. degree. How I wish that I could hug her once again

and tell her how lucky I am to have her as my mother. I can never p 'i- pack the love I

have received from her. Here is one thing I can do for her: I chose this topic for her, and I

promise that I will never give up the efforts to fight cancer, all my life.

I am deeply indebted to my academic advisor and committee chair, Dr. Rongling Wu.

He is the kindest and most caring advisor one can ever wish to have. His enthusiasm and

passion in research has encouraged and inspired me so much, and his painstaking guidance

has led me through the low d He gave me countless help from d4iv one I came to UF.

But for his great patience and understanding, I can not complete this work.

I would like to express my gratitude to the other committee members: Dr. Arthur

Berg, Dr. Myron C'Ii "in and Dr. Robert Dorazio. I want to thank all the other faculty

and staff members of the Department of Statistics, University of Florida, I owe so much to

them for their warmhearted advice and help. I would also like to extend my gratefulness

to the Penn State University College of Medicine, where I have received generous support

in the last year of my Ph.D. study. I want to give my thanks to my colleagues there for

their interest and valuable hints.

I am also obliged to my friends at the University of Florida. They have ahb i been

by my side, cheered on me, offered their self-giving assist, and shared my joys and tears.

They made my memories of the time I spent at UF so colorful and indelible.


ACKNOWLEDGMENTS ................... .............. 4

LIST OF TABLES ................... .................. 7

LIST OF FIGURES ................... ................. 9

ABSTRACT . ........... ................... .. 10



1.1 Introduction ................... ................ 12
1.2 Genetic Mapping ................... ............. 13
1.2.1 From Genetic Mapping to Genetic Haplotyping .......... 13
1.2.2 From Genetic Action to Genetic Interaction ............ 14
1.2.3 From Mendelian Inheritance to Genetic Imprinting ......... 15
1.2.4 From Genomics to Proteomics . . . . ... 16
1.3 Molecular Mechanisms for Cancer . . . . . 16
1.3.1 Genetic Mutations ... . . .. . ..... 17
1.3.2 Clii- Translocations . . . . . 19
1.3.3 Aneuploidy . . . . . . . .... 19
1.3.4 Epigenetic Modifications . . . ... . 20
1.4 Statistical Issues for Cancer Gene Identification . . .. 22
1.4.1 Normal-Tumor Sampling Design . . . . .... 22
1.4.2 Family Design ... . . .. . . .... 23
1.4.3 Longitudinal Studies .. . . . . .... 24
1.5 Dissertation Goals .... . . .. . . .... 24
1.5.1 Detecting Genetic Mutations . . . . . 24
1.5.2 Modeling Genetic Imprinting . . . . . 25
1.5.3 Modeling Aneuploidy ... . . .. . .... 25
1.5.4 Modeling Cancer-Host Interactions . . . .... 25

FOR CANCER ... . . .. . . . .... 27

2.1 Introduction ... . . .. . . . ..... 27
2.2 Model ... . . .. . . . . .... 29
2.2.1 Study Design .. . . . .. . . ..... 29
2.2.2 Genetic Model ... . . .. . . ..... 30
2.2.3 Estimation ... . . .. . . ..... 32
2.2.4 Hypothesis Tests .. . . . .. . .... 37
2.3 Computer Simulation ... . . .. . . ..... 38
2.4 Discussion . . . . . . . . . 40


3.1 Introduction ................ .............. 48
3.2 Model ........ ............... ... ....... 49
3.2.1 Study Design ............... ........... 49
3.2.2 C!i i.ii~.-.ii- Duplication ..... ........... ..... 50
3.2.3 Quantitative Genetic Parameters .................. 52
3.2.4 Estimation ................... ............. 54
3.2.5 Hypothesis Tests . . . . . . . 57
3.3 Application to Simulated Data ... . . .... .... 58
3.4 Discussion . . . . . . . . . 60


4.1 Introduction . . . . . . . . 63
4.2 Design . . . . . . . . . 65
4.2.1 Sampling Strategies . . . . . . 65
4.2.2 Genetic M odels . . . . . . . 69
4.2.3 Estimation ... . . .. . . ..... 70
4.2.4 Hypothesis Tests .. . . . .. . .... 73
4.3 Haplotyping Model .. . . . .. . . .... 74
4.4 Computer Simulation ... . . .. . . ..... 78
4.5 Discussion . . . . . . . . . 80

SEQUENCE DATA. . . . . . . . . 87

5.1 Introduction . . . . . . . . 87
5.2 Design . . . . . . . . . 89
5.2.1 Sampling Strategies . . . . . . 89
5.2.2 Genetic Models . . . . . . . 89
5.2.3 Estimation Procedures .. . . . .... .... 95
5.2.4 Hypothesis Tests . . . . . . . 110
5.3 Computer Simulation . . . . . . . 113
5.4 Discussion . . . . . . . . . 116

6 FUTURE DIRECTIONS .. . . . . . . . 119

REFERENCES . . . . . . . . . 121

BIOGRAPHICAL SKETCH . . . . . . . . 130



2-1 G'. .i r-pes at SNPs A and B for normal and cancer cells. . . .

2-2 G. i.r'ipe frequencies at SNPs A (causal) and B (neutral) in the -!-pini-. "
population after the mutation occurs. . . . . . .

2-3 Sex-specific mutations at SNP A for normal and cancer cells before and after

the mutation. . . .

2-4 MLEs of the simulated sample
the SNP is strong. . .

2-5 MLEs of the simulated sample
the SNP is moderate. . .

2-6 MLEs of the simulated sample
the SNP is weak. . .

2-7 MLEs of the simulated sample
frequency (Case 1) . .

2-8 MLEs of the simulated sample
frequency (Case 2) . .

3-1 The changes of genotypes and

. . . . . . . 3 9

when the association between a cancer gene and
. . . . . . . 4 3

when the association between a cancer gene and
. . . . . . . 4 4

when the association between a cancer gene and
. . . . . . . 4 5

when disequilibrium coefficients occur at high
. . . . . . . 4 6

when disequilibrium coefficients occur at high
. 47

genotype frequencies after chromosomal duplication. 53

3-2 G. Iri' .lpic values and proportions of different configurations of a triploid genotype
at a duplicated gene. . . . . . . . ...... 54

3-3 The simulation results with different sample size and heritability combinations
for both the population parameters (p, u, v) and the genetic parameters. . 59

4-1 A three-generation family design used to study transgenerational inheritance.. 66

4-2 MLEs of genetic effects in two generations with two simulation strategies. . 79

4-3 Estimation results for haplotype analysis . . . . .. 81

4-4 A three-generation family design used to study transgenerational inheritance in
haplotype analysis. . . . .... ........... 83

5-1 Disequilibrium compositions of four-SNP haplotype frequencies derived from
the host and cancer genomes. . . . . . ....... 90

5-2 Additive, dominance, and epistatic compositions of the genotypic value of a composite
diplotype constructed with haplotypes from the host and cancer genomes. . 94




5-3 Observed 81 joint host-cancer SNP .'* Ir-n.1pes and their frequencies described in
terms of their haplotype/diplotype compositions. . . . ...... 97

5-4 MLEs of population genetic parameters for two host SNPs and two cancer SNPs 114

5-5 Log-likelihood values for different combinations of risk haplotypes from the host
and cancer genomes . . . . . . . . 114

5-6 MLEs of quantitative genetic parameters of haplotypes for SNPs typed from
the host and cancer genomes . . . . . . . 115

Figure page

1-1 Main types of genomic and epigenomic aberration that cause cancers. . 18

1-2 Karyotype analysis for a normal human cell and tumor cell. . . 21

3-1 Diagram for chromosome duplication and the resulting changes of genotypes at
an aneuploid gene A. . . . . . . . ...... 51

Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Doctor of Philosophy



Yao Li

August 2009

C' iir: Rongling Wu
Major: Statistics

The identification of genes that are directly involved in tumor initiation and

maintenance is instrumental for understanding the phenotypic variation of cancer and

ultimately designing crucial therapeutic drugs to treat this disease. In recent years, the

completed genome sequence of humans and cancers has markedly enhanced cancer gene

identification. The overall goal of this dissertation is to develop a warehouse of statistical

tools for identifying cancer genes with growingly increasing sequence data. These tools

are founded on the latest discoveries for the genetic and developmental roots of cancer

formation, including somatic mutations, aneuploid induction, epigenetic modifications,

transgenerational imprinting, copy number variants, and host-tumor genetic interactions.

New statistical methods and algorithms will be developed to integrate each of these

discoveries. By comparing the difference in the DNA structure and sequence between

the human and cancer genomes, a disequilibrium model has been formulated to identify

and test the genetic mutations or "di i~ 1 that cause cancer. A quantitative model is

derived to unravel the aneuploidy control of cancer and estimate the genetic effects of

aneuploid loci on cancer risk. Using a commonly used three-generation design, a two-stage

hierarchical model is developed to estimate and test the transgenerational alteration of

genetic effects and identify genetic imprinting effects due to different parental origins of

the same allele. This hierarchical model allows the characterization of genetic interactions

between additive and dominant effects and imprinting effects over generations. Cancer

susceptibility may be controlled not only by host genes and mutated genes in cancer cells,

but also by the epistatic interactions between genes from the host and cancer genomes. A

model was derived to estimate genome-genome interactions of host DNA and cancer DNA.

Models for cancer gene identifications require the solution of missing data problems

given the fact that cancer genes and their incidence in a natural population cannot

be observed directly. For this reason, I have built up the models within the mixture

model framework. The maximum likelihood approaches, implemented with the EM

algorithm, have been derived to provide the estimates of genetic parameters related to

mutation rates, chromosome duplication rates, genetic imprinting, genetic interactions, and

haplotype frequencies. I have performed various sets of computer simulation to investigate

the statistical properties of the new models in terms of power, estimation precision, and

false positive rates. A series of practical computational issues, including convergence

rates and choices of initial values, are discussed. I have also formulated various testable

hypotheses about the frequencies of genetic mutations and the effects of host genes, cancer

genes, and their interactions on cancer susceptibility. This dissertation provides a most

complete set of statistical models for cancer gene identification thus far in the literature.

The biological relevance and statistical sophistication of these models will make them

practically useful to unlock the genetic secrets of cancer.


1.1 Introduction

Cancer is a leading killer. According to the World Cancer Report in 2003 by

World Health Organization, more than 10 million people are diagnosed with cancer

that causes 6 million deaths (or 1t'. of deaths) worldwide every year (http:/www.who.

int/mediacentre/news/releases/2003/pr27/en/). These numbers are estimated to

increase by 50'. by 2020. In the United States, about 1.3 million people were diagnosed

with cancer and more than half a million deaths were caused by cancer, i.e., one in every

four deaths, in 2002.

Despite painstaking cumulative efforts to fight against cancer by researchers

worldwide in the past five decades, we have still not achieved substantial progress in

diagnosis, prevention and treatment of this disease. In US, the cancer death rate was

0.19 :' in 1950 and still 0.1- i !I I' in 2001 (http:/www.who. int/mediacentre/news/

releases/2003/pr27/en/). We are in crisis in fighting against cancer!

It is impossible to achieve a significant increment in the cure rate of cancer without

more profound knowledge of cancer pathophysiology. Our firm understanding of

cancer biology is a key to discover new anticancer agents and develop new biomedical

technologies. The emerging convergence between cancer genetics, cancer etiology and drug

development brings a new hope for significant breakthroughs to control cancer in the near


The recent completion of the Human Genome Project and HapMap Project

provides a warehouse of genetic data that allows the categorization, documentation,

and organization of the human genome related to complex traits and biological processes.

Although this will facilitate the detection of cancer genes, there is an underscoring

demand on the integration of genetic mapping strategies with knowledge of how cancer is

originated and how it can be selectively advantageous in the host environment. To do so,

we need to formulate cost-efficient statistical designs for data generation under biologically

relevant hypotheses, derive powerful statistical algorithms for data analysis and parameter

estimation, and then use integrative biology to interpret results and produce a next round

of hypotheses.

In this chapter, I will review several recent developments in genetic mapping

strategies that can be useful for cancer gene identification and then outline the latest

discoveries in the genetic mechanisms of cancer initiation. In the following chapters,

we will provide specific statistical designs based on each of these genetic discoveries.

Statistical algorithms will be derived to estimate genetic parameters from the data. We

will show how a number of new hypotheses about cancer genetics can be generated from

the results obtained. Finally, I will describe the goals of this dissertation work in cancer

gene identification.

1.2 Genetic Mapping

For complex traits controlled by multiple genes and sensitive to the environment,

genetic mapping has proven a powerful tool for identifying their genetic underpinnings

(Lander and Bostein, 1989). By mapping quantitative trait loci (QTLs), the genetic

architecture of complex traits can be elucidated. Recently, the availability of the human

genome sequence and progress in sequencing and bioinformatic technologies have enabled

genetic mapping to study the detailed mechanisms for trait variation. I will pinpoint

the aspects of results which genetic mapping can achieve through appropriate statistical


1.2.1 From Genetic Mapping to Genetic Haplotyping

Genetic mapping has capacity to map individual QTLs responsible for a complex

trait. However, a QTL may contain multiple genes that operate in a collective way. It is

not possible to study the DNA structure, organization and function of a QTL detected

from a mapping approach. A more accurate and useful approach for the characterization

of genetic variants contributing to quantitative variation is to directly analyze DNA

sequences, known as quantitative trait nucleotides (QTN), associated with a particular

disease. If a string of DNA sequence for a QTN is known to increase disease risk, this

risk can be reduced by the alteration of this string of DNA sequence using a specialized

drug. The control of this disease can be made more efficient if all possible DNA sequences

determining its variation are identified in the entire genome. The elucidation of the entire

human genome has been made possible with the construction of the haplotype map, or

"HapMap" by single nucleotide polymorphisms (SNPs),

A statistical model for detecting the effects of QTN, tlii:--. 1I by individual

haplotypes, has been derived (Liu et al., 2004). Haplotype is defined as a linear arrangement

of alleles (i.e., nucleotides) at different SNPs on a single chromosome, or part of a

chromosome. The main idea of our model is to discern the difference between observable

diplotypes (i.e., a pair of haplotypes) from an observable SNP genotype. For example, a

double heterozygous genotype, AaBb, can be formed by either diplotype ABRab or AblaB

where | denotes the paternally- or maternally-derived haplotype. These two diplotypes,

although genetically identical, can be differently responsive to cancer risk. Liu et al.'s

approach allows the separation of diplotypes in the genetic control of cancer.

1.2.2 From Genetic Action to Genetic Interaction

Most of the current approaches for cancer gene identification allow the characterization

of small numbers of genes or proteins involved in cancer susceptibility. These approaches

are often insufficient to study the molecular mechanisms of neoplastic processes. First,

cancer is a complex trait involving a network of genes that interact in a coordinated

manner. These so-called genetic interactions or epistasis occur when the action of one

gene measured through a molecular, cellular or organism phenotype is modified by one or

more other genes. To date, genetic interactions remain largely unknown on a large scale

in human systems. Previous reviews and ,i gave excellent descriptions of the role of

genetic interactions in understanding phenotypic variability. (Boone et al., 2007; Hartman

et al., 2001; Lehner, 2007) The study of genetic interactions will not only help better

understand cancer susceptibility and progression, but also, and most importantly, develop

novel anticancer treatments.

In general, cancer incidence and development are not only affected by the host

genes, but also by genes derived from the cancer cells themselves. These two different

systems of genes operate interactively or epistatically to alter the course of cancer growth.

Shortly after the completion of the human genome sequencing, the cancer genome is

being sequenced (Kaiser, 2005). The data from these two types of genomes will provide

tremendous resources for characterizing the patterns and organization of genome-genome

interactions in cancer susceptibility.

1.2.3 From Mendelian Inheritance to Genetic Imprinting

Each <. i' of a gene (called allele) contains an individual and specific instruction for

a particular task to be carried out in the cells of the body. These instructions expressed

in a genetic code (i.e., a sequence of letters in the DNA) guide the cells to produce a

protein. There are two copies of every gene carried on the chromosomes in each cell of

the body: One comes from the mother (the maternal copy of the gene on the chromosome

copy inherited from the mother) and the second comes from the father (the paternal <" lli

of the gene on the chromosome copy inherited from the father). According to Mendelian

inheritance, the information contained in both the maternal "..i'- of the gene and the

paternal copy is equally used by the cells to make protein products. However, the alleles

of certain genes are expressed only when they passes to a child through the sperm or the

egg. This phenomenon of gene being "stamped" according to the paternal or maternal

origin of a gene copy is called ;, i, I imprinting A number of mechanisms have been

thought to -1 IpI' genes so that the expression of the inherited genetic information is

modified according to whether it is passed to a child through the egg or the sperm. This

modification determines whether the information contained in the gene copy is expressed

or not. The imprinting modification process is reversible in the next generation. The

production of genetic imprinting may be due to epigenetic marks that influences the

expression of genes by switching the genetic information on and off.

1.2.4 From Genomics to Proteomics

With the advent of large-scale functional genomic and proteomic ("omic") data,

additional mechanistic insights into neoplasia can be uncovered. Whole-genome association

studies for cancer risk variants and somatic mutation screening projects will provide a

' 'i t li-I of cancer genes. Transcript analyses have identified expression profiles that

provide accurate prognoses for cancer patients.(Fan et al., 2006) Currently, systematic

mapping of protein-protein interactions has received considerable attention with the

launch of the so-called iili. I i. .iii. w" mapping projects. This will elucidate the wiring

diagram of protein associations in cells (Rual et al., 2005; Stelzl et al., 2005). These types

of genes and/or protein (gene/protein) functional relationships can be modeled together to

provide better understanding and predict molecular mechanisms of neoplasia. (Barabasi

and Oltvai, 2004; Hernandez et al., 2007; Khalil and Hill, 2005; Pui ij et al., 2007;

Rhodes and Chlii, i"yan, 2005)

1.3 Molecular Mechanisms for Cancer

Recent advances in sequencing array and biomedical technologies have made it

possible to explore the genetic mechanisms of complicated biological traits, like cancer.

An emerging trend to study cancer genetics is to conduct large-scale sequencing of cancer

genomes (Greenman et al., 2007), compare the sequences found in tumor samples and

those of the originating normal tissues, and identify regions of the genome that differ

between two types of tissues (Parmigiani et al., 2009). A detailed survey has identified the

numerous alterations of cancer genomes at the level of the chromosomes, the chromatin

(the fibers that constitute the chromosomes) and the nucleotides. These alterations may

be due to irreversible aberrations in the DNA sequence or structure and in the number

of particular sequences, genes or chromosomes (i.e., the <.pi,- number of the DNA), or

to potentially reversible changes, known as epigenetic modifications to the DNA and/or

to the histone proteins closely associated with the DNA in chromatin. By affecting gene

expressions and/or regulatory transcripts, these reversible and irreversible changes result

in the activation or inhibition of various biological events that cause angiogenesis, immune

evasion, metastasis, and altered cell growth, death and metabolism (Chin and Gray,

2008). Figure 1-1 is a diagram that describes an overall picture of the genetic aberrations

related to cancer. A i ii ir challenge is how to predict each of these types of aberrations

with growing accumulating high-dimensional genomic data. We argue that these can be

predicted and diagnosed by effectively analyzing genomic data collected from well-designed


1.3.1 Genetic Mutations

Somatic mutations arise spontaneously in the genomes of dividing cells (Fig. 1-1A). In

particular, probably all adult organisms are mosaics of somatically mutated cells. Somatic

mutations may occur after exposure to carcinogens through cytosine deamination. A

mutator phenotype caused by mutations in polymerases and/or in mismatch-repair genes

can also lead to spontaneous mutations and chromosomal instability (Baudot et al., 2009;

Loeb et al., 2008). Somatic changes can be identified through re-sequencing the genome

and their involvement in cancer can be detected by linkage or linkage disequilibrium

analysis with SNPs. A large-scale re-sequencing project of the human genome leads

to the discovery that 1,11'. of malignant melanomas, 1C' of colorectal cancers and a

smaller percentage of other cancers may be caused by somatic mutations in BRAF, which

encodes a serine/threonine kinase (Davies et al., 2002). This discovery has stimulated

the development of BRAF inhibitors, leading to the use of several drugs in clinical trials.

Other notable discoveries include frequent mutations in PIK3CA34 (which encodes the

catalytic subunit of phosphatidylinositol-3-OH kinase) and AKT1 (Carpten et al., 2007)

(which encodes a serine/threonine kinase) in many cancer types, as well as in ERBB2

and EGFR in non-small-cell lung cancer (Sharma et al., 2007; Stephens et al., 2004). It is

interesting to find that the mutation status of EGFR can predict responses to treatment

A DNA sequencing, showing somatic mutations

B Fluorescence In situ hybridization,
showing aneuploidy and translocation


Genomic hybridization,
showing copy-number aberrations

Chromosome number

* Coding mRNA
* Non-coding microRNA

Figure 1-1.

D Chromatin immunoprecipitation,
showing mnethylation or histone modification

Aftered GEOI expression
Change in level of expression
Shift in pattern of alternatively spliced variants
I Presenceof aberrant transcripts
(for example, fusion transcripts)

Altered functions

Main types of genomic and epigenomic aberration that cause cancers. The left
panel shows the types of genomic and epigenomic aberration, and the right
panel show examples of how they can be detected. A) C'!i I1,, -s in DNA
sequence, such as point mutations. B) C'! 1I1g, in genomic organization. DNA
segments exchanged between the two (blue and green) DNA molecules are
shown. C) C'!i ni5,, in DNA copy number, such as those that result from
amplification. D) C'!g 1I1, in DNA methylation and the resultant changes in
chromatin structure. All these changes ultimately translate into altered
functions, leading to cancers. Adapted from (Chin and Gray, 2008)

with the EGFR inhibitors gefitinib or erlotinib in patients with advanced non-small cell

lung cancer. These studies not only highlight the importance of somatic mutations in

cancer susceptibility, but also show a possible role of testing mutations in deciding an

appropriate treatment of intervention for individual patients.


1.3.2 Chromosomal Translocations

The Philadelphia chromosome, discovered by Peter Nowell and David Hungerford

in 1960 (Nowell, 2007) is the first example of chromosomal translocations that cause

human malignancy. A chromosome translocation is the rearrangement of parts between

nonhomologous chromosomes (Fig. 1-1B); the Philadelphia chromosome is the translocation

between chromosomes 9 and 22. Since the pioneering discovery of the Philadelphia

chromosome, a number of cancer-causing recurrent translocations have been detected

in human leukaemias and lymphomas by using molecular cytogenetic analyses (Rowley,

1999). Although it is difficult to find causal translocations in solid tumors, with an

increasing amount of genomic information and analytical techniques, recurrent structural

aberrations related to solid tumors can still be detected. For example, human prostate

cancer shows a high frequency of translocations between TMPRSS2 (which is upregulated

in response to androgenic hormones) and the ETS-family genes ERG, ETV1 and ETV4

(which encode transcription factors) (Tomlins et al., 2005).

1.3.3 Aneuploidy

Aneuploidy is a disorder in which a cell has extra or missing chromosomes, and is

often associated with tumors. Figure 1-2 illustrates karyotype analyses for normal and

tumor cells. The normal cell include 23 pairs of standard chromosomes, whereas the tumor

cell exhibits the irregular karyotype described as aneuploid: some whole chromosomes

(e.g., chromosome 2) are missing, others have extra copies (e.g., chromosome 3), and
many have traded fragments (e.g., chromosome 14). Although it is still unclear whether

it is a cause or a consequence of malignant transformation, strong associations between

aneuploidy and cancer have been observed. Two v-iv are found to lead cells to become

aneuploid: alterations in the number of intact chromosomes, known as whole-chromosome

aneuploidy and originating from errors in cell division (mitosis), and rearrangements in

chromosome structure deletions, amplifications or translocations arising from breaks

in DNA (Pellman, 2007). While the latter way is a well-established cause of tumor

development, the former way is the subject of a long debate. The challenge for cancer

genetic research is how we can predict the occurrence of aneuploidy and how aneuploidy

loci determine whether aneuploidy leads to cell death or cancer.

1.3.4 Epigenetic Modifications

Aberrant epigenetic modifications have been increasingly recognized to p1 i a

inii .r role in the tumorigenic process(Gal-Yam et al., 2008). These modifications are

imposed on chromatin, but do not alter the DNA sequence and structure, and they are

maintained and inherited through cell divisions (Fig. 1-1D). Epigenetic modifications

intend to provide somatic cells with altered functions and features by expressing or

repressing a certain set of genes. These epigenetic alterations may due to covalent

modifications of amino acid residues in the histones around which the DNA is wrapped,

and changes in the methylation status of cytosine bases (C) in the context of CpG

dinucleotides within the DNA itself (Gronbaek et al., 2007). Methylation of clusters of

CpGs (called "CpG-islands") in the promoters of genes has been associated with heritable

gene silencing. More recently, new projects of human epigenome projects (Jones and

Martienssen, 2005) and epigenetic therapies (\ I 1: 2006) have been initiated to study how

epigenetic changes can modify genome-wide gene expression.

One of the consequences of epigenetic modifications is genomic imprinting, a

phenomenon where the father and mother contribute different epigenetic patterns

for specific genomic loci in their germ cells. A growing body of evidence shows that

genomic imprinting may have a transgenerational effect (Crews et al., 2007; Nilsson

et al., 2008). One of the notable examples for transgenerational imprinting is given by

Marcus Pembrey and colleagues (Pembrey et al., 2006). Using the Avon Longitudinal

Study of Parents and Children in Swede, they observed that male offspring, whose

paternal (but not maternal) grandsons were exposed during preadolescence to famine

in the 19th century, were less likely to die of cardiovascular disease. However, if food

was plentiful then diabetes mortality in the grandchildren increased. The opposite effect

was also observed for females, i.e., if the paternal (but not maternal) granddaughters
of women experienced famine while in the womb, they lived shorter lives on average.
Transgenerational imprinting may trigger some effects on cancer susceptibility, and the
understanding of this epigenetic modification can provide a unique perspective for cancer
research, and ultimately offer new insights into novel diagnostic and therapeutic strategies
(Anway and Skinner, 2008; Skinner and Anway, 2007).

Figure 1-2. Karyotype analysis for a normal human cell (left) and tumor cell (right).
Numbers under each one specify the sources of its fragments; plus and minus
signs identify those that are larger or smaller than usual. Adapted from
(Duesberg, 2007).


11 11 11
1 2 3 4 5

6 ? 8 9 10 It 2

13 14 15 16 1? to

N I 1--i
19 L 22 X Y

1.4 Statistical Issues for Cancer Gene Identification

To study a complete genetic architecture of cancer, we need appropriate statistical

designs with which genetic data can be collected, analyzed, and interpreted. On the other

hand, the latest discoveries in cancer genetics have important implications for the design

of efficient statistical strategies. Below, I will discuss several issues related to the choices of

optimal statistical designs that can be used to address various problems in cancer genetics.

1.4.1 Normal-Tumor Sampling Design

As a powerful design, case-control studies can be used to investigate the genetic

control of cancer. This design samples a panel of diseased subjects (cases) and an

approximately equivalent number of normal subjects (controls) that match cases in

demographic factors such as age, race, sex, and life style. The same types of markers

are .-~ i.r ped from normal cells of these two groups. By testing differences in gene

segregation at each marker between the two groups, genes that may contribute to cancer

risk can be characterized. The disadvantage of case-control designs is that they can only

identify cancer genes located on the genome from normal cells.

Cancer arises from genomic aberrations of normal cells. Thus, a comparison between

the DNA sequence, structure, and organization derived from cancer cells and their

originating normal tissue will help to detect these aberrations. Genomic and sequencing

data from these two types of tissues for the same subjects should be collected. Such a

normal-tumor sampling design can serve on two purposes:

First, somatic mutation detection By screening the normal and tumor

genomes from the patient, we can detect those regions that differ between the two

genomes. The genes detected from this comparison are called "d li.- I mutations that are

directly involved in cancer development (Stratton et al., 2009). There are also a group

of mutations that may not cause cancer, and they are called Ip i-- 1;, _r" mutations.

Therefore, it is crucial to develop a design that can identify these two types of mutations

and rank genes based on the likelihood that they may be drivers. Equally importantly,

this design should be able to allow the detection of driver genes through non-random

associations with SNPs in genome-wide association studies.

Second, host-tumor genetic interaction modeling The development of cancer

is viewed as an event initiated by mutational events in the tumor cell, but profoundly

influenced by the interaction of these cells with the molecular and cellular components

that make up the tumor microenvironment. Complex interactions between the host and

tumor determine key processes, such as tumor invasion, angiogenesis, and metastasis,

and may also drive the progression of the tumor to a stage that resists standard forms

of therapy. By interrupting host:tumor communication, tumor progression can be

attenuated or reversed, providing an incentive to understand the molecular nature of

these communication signals and to develop targeted therapies that capitalize on the

powerful control a ir'iiii I!" environment can exert over the tumor. The normal-tumor

sampling design will help to characterize the genetic interactions between genes from the

host and cancer genomes.

1.4.2 Family Design

A family design samples multiple members of a family including parents and their

offspring. Genetic information should be collected for two generations, but phenotypic

data for cancer can only be obtained for the offspring. The advantage of this family-based

design is that it allows the estimation of genetic imprinting effect, or parent-of-origin

effects, on cancer risk. Also, as a foundation of association studies, linkage disequilibrium

(LD) shows tremendous potential to fine map functional genes for a complex trait, but it

may provide a spurious estimate of LD in practice when the association between genes is

due to evolutionary forces, such as mutation, drift, selection, population structure, and

admixture. A mapping strategy that samples unrelated families (composed of parents and

offspring) from a natural population (Dupuis et al., 2007; Wu and Zeng, 2001) is helpful

for overcoming the limitation of LD mapping by simultaneously estimating the linkage and

linkage disequilibrium.

In humans, neither adequate numbers of progeny can be generated from a single

family nor can any controlled cross be made possible. For this species, a nuclear family

with multiple successive generations is often used in order to accumulate a sufficient

number of progeny for genetic mapping of cancers. This design can be particularly

informative for inherited diseases, such as diabetes or cancer, because it provides a way of

estimating transgenerational imprinting effects. The recombination fraction and identical

by decent (IBD) coefficient are the key for genetic mapping with multigenerational


1.4.3 Longitudinal Studies

Cancer is a dynamic process, with a growth pattern following sigmoid curves. Also,

the incidence of cancer increases with age. These characteristics determine that cancer

studies should be longitudinal in which tumor traits are measured in time course.

1.5 Dissertation Goals

The purposes of this dissertation are to develop a battery of statistical models for

identifying the genetic architecture of cancer susceptibility. Specific objectives include

1.5.1 Detecting Genetic Mutations

I will derive a disequilibrium model for detecting driver mutations that cause cancer.

This model will be based on the association between cancer genes and SNPs typed

from the human genome. The model of association analysis is founded on the linkage

disequilibria that occur at the zygotic level. The evolutionary basis of this model is

from commonly used population genetic theory that the equilibrium state of a natural

population can be violated by genetic mutations, leading to non-random associations

between different genes that are subject to selection. I will derive the EM algorithm

to estimate the rate of cancer-causing genetic mutation in a human population and

zygotic disequilibria between cancer genes and SNPs. A series of hypothesis procedure is

formulated to test the significance of each parameter. I will perform simulation studies

to study the statistical behavior of the model, including the power of gene detection, the

precision of parameter estimation, and false positive rates.

1.5.2 Modeling Genetic Imprinting

I will propose a statistical design for detecting imprinted loci based on a random set

of three-generation families from a natural population by using genotyped SNPs. This

design provides a pathway for characterizing the effects of imprinted genes on a complex

disease at different generations and testing transgenerational changes of imprinted effects.

The design is integrated with population and cytogenetic principles of gene segregation

and transmission from the previous generation to the next. The implementation of the EM

algorithm within the design framework leads to the estimation of genetic parameters that

define imprinted effects. A simulation study is used to investigate the statistical properties

of the model and validate its utilization.

1.5.3 Modeling Aneuploidy

I will present a statistical model for detecting genes that determine cancer pathogenesis

through chromosomal instability due to aneuploidy, a syndrome caused by extra or missing

chromosomes and constituting some of the most widely recognized genetic disorders in

humans. I will develop an algorithm for estimating and testing the imprinting effects of

genes on cancer risk by distinguishing the parental origin of chromosome doubling. If the

SNPs used are .1-- ,-. Id from an entire genome, the proposed model can be expanded to

construct an imprinted map of the cancer genome. The imprinting model, whose statistical

properties are investigated through simulation studies, will provide a useful tool for

studying the genetic architecture of cancer risk.

1.5.4 Modeling Cancer-Host Interactions

I will derive a statistical model for cancer gene identification by integrating the gene

mutation hypothesis of cancer formation into the mixture-model framework. Within

this framework, genetic interactions of DNA sequences (or haplotypes) between host

and cancer genes responsible for cancer risk are defined in terms of quantitative genetic

principle. The model was founded on a commonly used genetic association design in

which a random sample of patients is drawn from a natural human population. Each

patient is typed for single nucleotide polymorphisms (SNPs) on normal and cancer cells

and measured for cancer susceptibility. The model is formulated within the maximum

likelihood context and implemented with the EM algorithm, allowing the estimation

of both population and quantitative genetic parameters. The model provides a general

procedure for testing the distribution of haplotypes constructed by SNPs from host and

cancer genes and the linkage disequilibria of different orders among the SNPs. The model

also formulates a series of testable hypotheses about the effects of host genes, cancer

genes, and their interactions on cancer susceptibility. I will carry out simulation studies to

examine the statistical properties of the model. The implications of this model for cancer

gene identification are discussed.

This dissertation will for the first time establish a systematic procedure for cancer

gene identification based on the latest discoveries in the genetic architecture of cancer

initiation and progression. The designs, models, and algorithms developed, coupled with

recent advances in sequencing technologies, will provide important tools for probing

the molecular genetic mechanisms of cancer and ultimately helping to design effective

interventions for this important disease.


2.1 Introduction

By altering the behavior of cells, mutations in key regulatory genes (tumor suppressors

and protooncogenes) can lead to unregulated growth that may potentially develop

into cancer. So far, more than 350 mutated genes have been identified to be causally

implicated in cancer development (Futreal et al., 2004). The identification of these cancer

genes is mostly to use physical and genetic mapping strategies. However, since each

of these strategies can only identify a subset of cancer genes, the question about how

many cancer genes are totally involved in cancer development is largely unanswered. A

recent study surveying the human genome shows that more gene mutations may drive

cancer than previously thought (Greenman et al., 2007). The advent of the human

genome sequence and cancer genome sequence provides an unprecedented opportunity

to reveal the full compendium of mutations in individual cancers (Stratton et al., 2009;

The Cancer Genome Atlas Research Network, 2008; Velculescu, 2008) and thereby use this

information to develop accurately targeted treatments.

A great iin ii i ly of somatic mutations observed was single-base substitutions

(Velculescu, 2008), although mutations may also encompass several other classes of

DNA sequence change, such as insertions or deletions of small or large segments of DNAs,

DNA rearrangements, copy number increases, and <.li,- number reductions (Stratton

et al., 2009). The first somatic point mutation that cause cancer was identified in two

independent experiments of transforming cancer DNA into normal cells (Reddy et al.,

1982; Tabin et al., 1982). The transformed normal cells become cancerous due to the

single base G > T substitution that leads a glycine to valine substitution in codon 12 of

the HRAS gene. This discovery brought about increasing studies of associating genetic

mutations with cancer. The substitution of C:G base pairs by T:A base pairs or by G:C

base pairs is correlated with colorectal cancer and breast cancer and may be explained

to be the cause of these cancers (Greenman et al., 2007). In addition, the chromosomal

distribution and impact of these mutations were found to differ between these two

types of cancer, -ir-'-. ii-; that the mechanisms underlying mutagenesis and repair are

cancer-dependent. Somatic mutations in the genome occur through the misincorporation

of DNA replication during exposure to exogenous or endogenous mutagens. This may

include two different biological processes. In one process, mutations confer growth

advantage on the cell in which they occur, thus they have been positively selected.

The mutations in this process are called the "dw li. I mutations, ones that are known as

so-called cancer genes. The mutations in the second process are the Ip i-- 11,' r" mutations

which are neutrally present in the cell that was the progenitor of the final clonal expansion

of the cancer and have not been subject to selection. In practice, there is an underscoring

need to distinguish driver from passenger mutations and characterize the incidence and

distribution of driver mutations in the human genome.

In this study, we will develop a statistical model for detecting driver mutations in

a genome-wide association study project. The model assumes that driver mutations

occur at individual points in somatic chromosomes, which are in strong associations with

common genetic variants (such as single nucleotide polymorphisms, SNPs) due to some

evolutionary forces. The basic idea of the model is to use SNPs genotyped in the genome

of a normal tissue to predict driver point mutations that are expressed in the cancer

genome. The model considers the differences in the probability of mutations occurring

in alleles derived from the male and female chromosomes. However, because observed

' iir .1 rpes at individual markers hide the information about the parental origins of an

allele, a mixture model is formulated to reflect these differences. We derive an elegant EM

algorithm to estimate and test parameters related to sex-specific differences in mutation

rate. The estimated coefficients of disequilibrium between SNPs in the human genome

and cancer gene can be used to predict the occurrence of driver mutations responsible for


2.2 Model

2.2.1 Study Design

Suppose there is a natural human population from which a random set of n samples is

drawn for genome-wide .-. vi. .riping by SNP arrays. Of these samples, some are diagnosed

to have a cancer arising from point mutations. The cancer genomes of these patients are

also typed. Consider two SNPs, A and B, segregating in this sampled population. Assume

that the first SNP A is subject to the driver mutation, whereas the second SNP B is

normal. In the normal cells, SNP A has two alleles A, and A2, forming three .-, vir .ripes

A1A1, AIA2, and A2A2 with genotype frequencies of P1, P2, and P3, respectively. We

assume that one of the two alleles (-i4 A,) at SNP A is mutated to a new allele A3.

During the mutation, subjects with .-, in..'i pe A1A1 may have three outcomes: (1) they

are changed into A1A3 arising because of the mutation of only one <."p,- of allele Ai, (2)

they are changed into A3A3 arising from the mutation of both copies of alleles A1, and (3)

they keep genotype A1A1 unchanged possibly because they carry some other genes that

precent the mutation. Similarly, genotype AAa2 have two possible fates: (1) A2A3 with

the mutation and (2) AIA2 with no mutation. The last genotype A2A2 will keep the same

,j. I.ripe unchanged. Because the mutation is considered as a driver mutation, subjects

with mutated genotypes A1A3, A2A3, and A3A3 carry the cancer.

Table 2-1 gives the data structure of genotype observations from the normal and

cancer cells for a makeup example composed of 10 subjects in which SNP A is subject to

mutation but SNP B is not. Subjects 1, 2, and 5 are diagnosed to have a cancer because

of the driver mutation occurring in their genomes. The other subjects have no cancer.

To find such a driver mutation, we I..ir1 rpe genome-wide SNPs in normal cells, some

of which are hoped to serve as a predictor. In Table 2-1, SNP B is an example of these

predictors. Of its three genotypes B1B1, B1B2, and B2B2 with genotype frequencies of

Q1, Q2, and QS, respectively, B1B1 is consistently in agreement with the driver mutation

and can be used as a predictor. Such predictors in practice can be detected by genotyping

the whole genome of normal cells and screening all the typed SNPs with a model to be

developed in the next sections. With this information, we can make an early diagnosis of

cancer, without waiting to see until the cancer occurs.

Table 2-1. G .il rpes at SNPs A and B for normal and cancer cells.
Normal Cancer
No. A B ... A
1 A1A1 BiB1 ... A1A3
2 A1A1 BiB1 ... A3A3
3 A1A1 B1B2
4 A1A1 B22
5 A1A2 B1B1 ... A2A3
6 A1A2 B1B2
7 A1A2 B22
8 A1A2 B1B2
9 A2A2 B22
10 A2A2 B22

2.2.2 Genetic Model

It is possible that the mutation rate at a gene may differ between two different sexes.

In the following description, we will consider such a difference. Let ul and u2 denote

the mutation rates of allele A, at SNP A if it is derived from the father and mother

chromosome, respectively. We use to denote the configuration of a genotype where

the left and right sides of the vertical line present the mother's and father's origin of the

composed alleles, respectively. Table 2-3 lists the possible configurations of each genotype

before the mutation occurs. The two configurations of Ip rental" .-,.. l1 rpe A1A2 are

A1 A2 and A2 A1, with frequencies denoted as P2 and P2, respectively, which sum to P2.

The table also provides the relative proportions of "(.!11-pil configurations mutated from

the same p rental" genotype in terms of sex-specific mutation rates. From the table, we

calculate the frequencies of "(l!-pi ol;, genotypes in the population after the mutation

occurs, which are arrea, '1 as Because "(.iI-1 in-,; genotypes AIA3, A3A3, and A1A1 are

No. Genotype

1 A1AI RI = (1 ui u2 1iu2)Pi
2 AIA2 R2 (1 u2)2 + (1 ui)P2
3 A2A2 3 P3
4 AIA3 R4 (U1+ 2)P]
5 A2A3 R5 U2P2 + U1P2
6 A3A3 R6 = u1U2P

derived from pi rental" genotype AIA1, we have RI + R4 + R6 = PI. Similarly, it can be

seen that R2 + R5 = P2 and R3 P3.

Theorem: The Hardy-Weinberg equilibrium (HWE) of a natural human population

can be violated by genetic mutations.

Proof: The allele frequencies of alleles A, and A2 in the p I rental" population before

the mutation are expressed as p, = PI + !P2 and p2 P3+ !P2, respectively. After

the mutation, allele A, is dissolved into two parts, A, (wild-type) and A3 (mutant). Let

p, and p' denote the allele frequencies of alleles A, and A3 in the "( .1-pill population

after the mutation. Thus, we have pi = p' + p'. Since HWE is assumed in the p rental"

population, we have

Pi A

(P + p'3)2

/ p /2 32 2pp'3 (2-1)
Pi +p2 + 2p 2

(p + D) + (p2 + D3) + (2pp'3 D D3) (2-2)

R= + R4 + R6,


Ri = p2 +Dj

R4 = P2+ D3 (2-3)

R6 = 2p'p'3 D1 D3


It can be seen from equations (1) and (2), we can ahb- find non-zero values for D1 and

D)3 which make equation (3) hold. When D1 := 0 and/or D3 := 0, this means that SNP

A is at Hardy-Weinberg disequilibrium in the "offspriiin population. For .. .1 .ipe A2A2,

we have

P3 = p2 in I i mental population

= (1 p)2 in Ip I'ental" population

=(1 -p'- p'3)2 in "offspring population

= p2 in "offspriiin population

= R3 in offspringg population.

Finally, we establish the relationships between allele frequencies and genotype frequencies

in the .1-1-piIIig" population as

Ri = p'2 + D

R2 = 2p'p' D1

R3 22

R4 32p +p D3

Rs = 2p'2p D3

R6 = 2p'p'3 Di D3

This theorem tells us that the traditional theory of linkage disequilibrium analysis

based on the HWE assumption will not be useful for association studies in a cancer

population due to genetic mutations.

2.2.3 Estimation

The aim of the model is to detect mutated genes for cancer by scanning their

associations with the SNPs typed from the human genome. For the samples drawn from

a natural population, there are 18 joint genotypes formed by three genotypes BIB1 (1),

BiB2 (2), and B2B2 (3) at neutral SNP B in the human genome and six genotypes A1A1

(1), AIA2 (2), A2A2 (3), AIA3 (4), A2A3 (5), and A3A3 (6) at mutated SNP A in the

human and cancer genomes. Let nij (summing to N) and Pij denote the observations and

frequency of genotype ij (i = 1, 2, 3; j = 1,..., 6), respectively. Based on these genotype

frequencies, we construct a polynomial-based likelihood function for observed SNPs A and

B as
3 6
log L(A, B) = c + > Y nij log(Pi,), (2-4)
i= j= 1

where c is the constant, from which the maximum likelihood estimate (V1.IK) of genotype

frequency can be obtained as Pij = nij/N.

Table 2-2 gives two different compositions of each iJ. in.i pe frequency at SNPs A

and B in the i..l-piiiig" population after the mutation occurs. The first is the product of

,j. iilpe frequencies at SNPs A (determined by ul, u2, PI, P2, and P3) and B (denoted as

Qi, Q2, and QS) and the second is the linkage disequilibrium at the zygotic level between

the two SNPs. Of 18 genotypes, we determine ten independent linkage disequilibria D, ...,


By maximizing likelihood (2-4), we obtain the analytical solutions of i. I. nr'pe

frequencies, expressed as

PI1 = (nij + n4) + 66j),
1 3
P2 = n Y(n2j + ( 5j),
3 (2-5)
P3 Y>j,
Qj t nus, ,j 1,2, 3.

^ 1--

0 I

0~ 1
^ | C

-b 0 1
o 0~
a~~ 01- ,

2 "a
0 01
-^ 0 s C I

" S i ~J

Cih ICN; ".,0

Cr -~ c

0 01
3 I
&~ I + B-

o D'I I 01<

-- 0. 1 01 CO CO C

oZ^ -? 01 01 CO
Hb 1 -,
c^-a ^^

CTJ -' ~ ^ 0.
cb Da-^ ^ -^
u ----------
<+ C^
^ ? t
-+ B :,
cc ? ^ ~

i c' ^ "
a;; c u^S -
^ ^ 51^ ^ +
>, S^ a; + l

r'lCn ^ -^ -^ ^ -
ola ^, ^ ^ M t
^~~c ^ !-^- ^ ^

An iterative procedure is needed to estimate the mutation rates, configuration

frequencies, and disequilibria. This iterative procedure is implemented using the

structure of Table 2-2. In step 1, for a particular joint ,j. in'l /pe in Table 2-2, calculate

the proportions of compositions related to the mutation rates by

O2j ( + ) ( (1 + ) 2(JA

2 j P2 j
^ o-.)(^), Qo ) (

-4 1 (PJ4 ( P- ) (2-6)
4i (P2Qj ) (UP2 (26)
-= U U2
P5j P5j

82 j -- 1,2, 3,

related to the configuration frequencies by

A2i ( .2) A (1 ul) 7)

( (P i 2 (W(2
A51- 2 5iAj Ul

and related to the disequilibria by

Di D6
11 12 P ,
P11 12
D2 D7
021 -P2, 022 P
D3 D8
31 P31 32 32 (2-8)
D4 D9
041-1 242

451 452
D5 Dio
051 1 052

In step 2, estimate sex-specific mutation rates by

E 13 (4jn4j + + ', ., ijnj)
Ui = -- (2-9)
E_ j(02jn2j + 04j1t4j + + ." ljlj)
Y 4 4j 3+ 1.5j ++ \.+. 1 lj)1
U2 j 4j-- + f5-----,+ (210)
E =1( -1, 2 4j 4j + '" + *, 1 i,) 1

the configuration frequencies by

P2 1 (A2jn2j + A5jn5j)
P2=1 ( 2j A22 + A5jn5j + A (2
Y^j I (2j l A' i 5j n5j

(A2 + +j 5j 5j + A )
=1j I2j 2j 2j n2j+ j 5jn5j))

and the disequilibria by

Dl -= (2-13)

P13 P61 P63
23 n21021 (2-14)
D2 n23 n+6 '6
P23 P61 P63
1 31031 (2 5)
D3 3- n33 n _l 63(
P33 P61 P63
D4 61 63 (2-16)

P43 P61 P63
D5 __+n 23 (2-17)
D53 n61 T63 '
A53 P61 P63
D6 __+= _3 (2-18)
D13 n62 T63 '
P13 )P62 P63
n2222 (2-
D7 n2 n _62 63 (2 19)
P23 P62 P63
D8 n_ (2-20)
n33 n62 T63
P33 P62 P63
1D9 _(2-21)
n43 n62 T63 '
P43 P62 P63

Do 53 62 n63 (222)
P5 3 P62 P63
Both steps 1 and 2 are iterated until the estimates are stable. The stable estimates are the

maximum likelihood estimates (MLEs) of parameters.

2.2.4 Hypothesis Tests

After the MLEs of the parameters are obtained, we need to test their significance.

The significance of the association between SNP B and mutated gene A can be tested by

formulating a null hypothesis:

Ho : D, = D2 D3 = 4 = D5 D6 = D7 = D8 = D9 = Dio 0. (2-23)

under which the parameters, Pi, P2, P3, Q1, Q2, and Q3 are estimated with an analytical

form and parameters ul, u2, P2, and P2 estimated with the iterative algorithm. The 10

zygotic disequilibria that specify the association between a SNP and cancer gene can be

tested individually or collectively. A similar EM algorithm can be implemented to estimate

the parameters for these tests.

It would be interesting to test whether the mutation rate is sex-specific. This can be

performed by using the following null hypothesis:

Ho :1 = U2 = (2-24)

When ul = u2 = u, two configurations of Ip rental" genotype AIA2 will be collapsed into

a single genotype whose frequency is P2. All the genotype frequencies can be estimated

using an analytical formula, but the estimate of u under hypothesis (2-24) should be

based on the EM algorithm.

Because all the hypothesis tests described above nest the null hypothesis within the

alternative hypothesis, a t2 test statistic can be regarded to follow a X2 distribution with

the degrees of freedom equal to the difference of the number of parameters to be estimated

under the null and alternative hypotheses.

2.3 Computer Simulation

A simulation study is used to investigate the statistical properties of the model

developed. We simulate a natural human population at HWE with a SNP A that tends

to mutate and a neutral SNP B that is associated with SNP A. In normal cells, allele

frequencies of these two SNPs are given as follows: 0.3 or 0.4 for allele A, if it is derived

from the left-side or right-side parent and 0.7 or 0.6 for allele A2 if it is derived from the

left-side or right-side parent at SNP A (Table 2-3), and 0.4 for allele B1 and 0.6 for allele

B2, regardless of the parental origin, at SNP B. Originally, the two SNPs may or may not

be associated, but become associated after the mutation from A, to A3 happens at SNP

A. The coefficients of genetic associations at the zygote level between the two SNPs are

given according to the magnitude and occurrence of disequilibria:

(1) Strong Disequilibria D1 = 0.05, D2 = -0.05, D3 = D4 = D5 = 0, D6
-0.03, D7 = 0.03, Ds = D9 = Dio = 0,

(2) Moderate disequilibria D = 0.02, D2 = -0.02, D3 = D4 = D5 = 0, D6
-0.01, D7 = 0.01, Ds = D9 = Dio = 0,

(3) Weak disequilibria D = 0.001, D2 = 0.001, D3 = -0.001, D4 = 0.001, D =
0, D6 = -0.001, D7 = 0.001, Ds = -0.001, D9 = 0.001, Dio = 0,

(4) High occurrence rate of disequilibria D = -0.001, D2 = 0.04, D3
-0.04, D4 = 0.001, D5 = -0.0001,0, Dg = 0.0001, D7 = -0.04, Ds = 0.04, D9
-0.001, Dio = 0.001,

(5) High occurrence rate of disequilibria D1 = 0.04, D2 = -0.04, D3
-0.001, D4 = -0.001, Ds = 0.001,0, D6 = -0.03, D7 = 0.001, Ds = 0.03, D9 =
0.0001, Dio -0.001.

It should be noted that the determination of these disequilibrium coefficients needs

to meet Pi > 0 for all i = 1,..., 6 and j = 1,..., 3. We wrote a small program to allow

the choices of suitable disequilibrium values. The mutation rates of allele A1 derived from

left-side and right-side parents are 0.05 and 0.08, respectively. Based on the v. .r1 pe

frequencies and mutation rates given, we simulate the amount of subjects with cancer

mutation genes in a sample size of N = 400, 800, and 2000, respectively.

Table 2-3. Sex-specific mutations at SNP A for normal and cancer cells before and after
the mutation.
"Parental" Mutation "Offspriiin
Genotype Configuration Occurs Configuration Proportion
A1A1 A1|AI Ai|A3 U1
A1A1 A1|AI A3A1 U2
A1A1 A1|AI A3SA3 U1u2
A1A1 A1|AI A1|A1 1 uI U2 ulU2

A1A2 Ai|A2 A3SA2 U2
A1A2 Ai|A2 Ai|A2 1 u2
A1A2 A2|A A2|A3 U1
A1A2 A2|A, A2|A, 1 uI

A2A2 A2 A2 A2|A2 1

Tables 2-4-2-8 provide the MLEs of the allele frequencies, mutation rates and zygotic

disequilibria between SNPs A and B with the model derived. It is not surprised that

the allele frequencies and irpe frequencies at the SNP marker, as well as genotype

frequencies (and also allele frequencies) for the cancer gene before its mutation occurs, can

be very well estimated. A sample size of 400 will be largely sufficient to precisely estimate

those parameters. The model has power to estimate the frequencies of two configurations

each with two different alleles derived from a reciprocal parent for a heterozygote at the

cancer gene. The frequencies of allele (- i- Ai) for the cancer gene derived from right-

(PR(Al)) and left-side parent (PL(Al)) can be estimated reasonably, but a sufficiently

large sample size ( i- 2000 or more) is needed to assure their estimates at an acceptably

high precision.

Mutation rates are an important parameter for studying the genetic control of a

cancer. It is not technically difficult to estimate the mutation rate of an allele by analyzing

and comparing the data collected from cancer and normal genomes. The merit of our

model lies in the estimation of the mutation rates of alleles derived from different parents.

In general, parent-specific mutation rates (u, and U2) can well be estimated, with the

estimation precision increasing substantially with sample size.

All the disequilibrium coefficients can be estimated although a convergence problem

may occur when these values are small. Under strong or moderately strong disequilibria,

a sample size of 800 or 2000 seems to be sufficient for reasonably good estimates (Tables

2-4 and 2-5). For weak disequilibria, a larger sample size (-i 2000 or more) is needed

(2-6). The power to detect linkage disequilibria depends on the size of disequilibria. For

weak disequilibria, power is only about 25 35'. for a sample size of 400 to 2000 (Table

2-6). The power can increase dramatically when the disequilibria are strong or moderately

strong (Tables 2-4 and 2-5).

We also examined the estimates of parameters when all 10 disequilibria occur. The

results summarized in Tables 2-7 and 2-8 -I-i: -1 that all parameters can be estimated,

but for those parameters estimated from numerical iterations a large sample size is

required. We investigated the false positive rates of disequilibrium tests under different

simulation scenarios by simulating joint genotype data under the assumption of no zygotic

disequilibria. In most cases, the false positive rates are about 5 to 1 (data not shown).

2.4 Discussion

In this article, we present a novel computational algorithm for detecting mutated

genes for cancer by association studies of SNPs. The idea behind the model derivation

is that all cancers arise as a result of somatically acquired changes that have occurred

in the DNA sequence of the genomes of cancer cells (Stratton et al., 2009). The past 20

years has accumulated much knowledge about these mutations and the abnormal genes

that operate in human cancers. Now it is a time to characterize genes for cancers using a

complete DNA sequence of cancer genomes. To elucidate the detailed and comprehensive

picture of the genetic architecture of cancer, there is an underscoring need to develop a

statistical model for analyzing the data of the normal and cancer genomes, aimed at the

genetic mutations for cancer.

Because not all the somatic abnormalities present in a cancer genome have been

involved in development of the cancer, the terms "di Ri. i and Ip i-- 1;,. r" have been

coined, stressing that only a driver mutation is causally implicated in oncogenesis.

(Greenman et al., 2007; Haber and Settleman, 2007; Stratton et al., 2009) Our model

intends to detect those drivers for cancer based on the assumption that some particular

SNPs are associated with driver mutations because of evolutionary force. The model needs

the design in which both normal and cancer cells are typed with SNPs for a set of subjects

sampled from a natural human population.

Our model is constructed on linkage disequilibria expressed at the zygote level.

Traditional theory for linkage disequilibrium analysis is based on the non-random

association between different genes at the gametic level. With the assumption of

Hardy-Weinberg equilibrium (HWE), gametic linkage disequilibria are estimated from

observed zygotic data. For an HWE population, this assumption may be violated

when gene mutations occurs. Thus, the traditional theory cannot be used for linkage

disequilibrium analysis in a cancer population with genetic mutations. In this article, we

introduce the concept of zygotic linkage disequilibrium defined as non-random associations

between .-. Ir' 1/pes at different genes. By scanning SNPs throughout the genome, the

model will be able to detect those subjects that are likely susceptible to cancer based on

their genotypes at nearby SNPs.

The model assumes that genetic mutations occur at different probabilities depending

on the parental origin of the allele (Arnheim and Calabrese, 2009). It further assumes

that the occurrence of genetic mutation is independent between the two sexes. While

the first assumption has some biological foundation because of ubiquitousness of genetic

imprinting, the second assumption needs to be justified.

Given the growing popularity of genome-wide association studies, the model needs

to be expanded to consider the impact of multiple gene mutations in the genome. In

principle, such an extension is straightforward with the framework constructed in

this article. However, computational issues with high-dimensional data and multiple

comparisons due to correlated SNPs in similar genome regions are needed to be well



It has been recognized that genetic mutations in specific nucleotides may give

rise to cancer via the alteration of signaling path-, i-. Thus, the detection of those

cancer-causing mutations has received considerable interest in cancer genetic research.

Here, we propose a statistical model for characterizing genes that lead to cancer through

point mutations using genome-wide single nucleotide polymorphism (SNP) data. The

basic idea of the model is that mutated genes may be in high association with their nearby

SNPs because of evolutionary force. By .' Ir' Jping SNPs in both normal and cancer

cells, we formulate a polynomial likelihood to estimate the population genetic parameters

related to cancer, such as allele frequencies of cancer-causing alleles, mutation rates of

alleles derived from maternal or paternal parents, and zygotic linkage disequilibria between

different loci after the mutation occurs. We implement the EM algorithm to estimate some

of these parameters because of the missing information in the likelihood construct. The

model allows the elegant tests of the significant associations between mutated cancer genes

and genome-wide SNPs, thus providing a way for predicting the occurrence and formation

of cancer with genetic information. The model, validated through computer simulation,

may help cancer geneticists design efficient experiments and formulate hypotheses for

cancer gene identification.

The maximum likelihood estimates (MLEs) of allele frequencies, configuration
frequencies, genotype frequencies, mutation rates, and zygotic disequilibria
when a cancer gene is strongly associated with the SNP simulated. The
standard errors of the MLEs calculated from 500 simulation runs are given in

Sample Size

Parameter True 400
SNP Frequencies
Qi 0.1600 0.1595 (0.0179)
Q2 0.4800 0.4816 (0.0223)


0.1601 (0.0125)
0.4807 (0.0173)

0.1602 (0.0080)
0.4800 (0.0111)



Driver Gene


0.1200 0.
0.4600 0.
0.4200 0.
0.1800 0.
0.2800 0.
0.3000 0.
0.4000 0.

0.3589 (0.0217)
0.4003 (0.0164)

1190 (0.0170)
4611 (0.0261)
4199 (0.0251)
1734 (0.1504)
2877 (0.1552)
2924 (0.1514)
4068 (0.1550)

Mutation Rate
U1 0.1000
U2 0.1200

Zygotic Disequilibria
D, 0.0500
D2 -0.0500
D3 0.0000
D4 0.0000
D5 0.0000
D6 -0.0300
D7 0.0300
D8 0.0000
D9 0.0000
Dio 0.0000
Power ( )

0.1041 (0.0859)
0.1173 (0.0999)

0.0492 (0.0097)
-0.0496 (0.0078)
0.0003 (0.0081)
0.0002 (0.0032)
-0.0001 (0.0039)
-0.0294 (0.0073)
0.0295 (0.0126)
-0.0000 (0.0117)
-0.0000 (0.0042)
-0.0001 (0.0054)



0.0502 (0.0069)
-0.0499 (0.0053)
-0.0003 (0.0063)
-0.0001 (0.0021)
0.0001 (0.0029)
-0.0301 (0.0048)
0.0304 (0.0085)
0.0001 (0.0083)
-0.0001 (0.0029)
-0.0004 (0.0038)



0.0502 (0.0043)
-0.0500 (0.0034)
-0.0000 (0.0038)
0.0000 (0.0014)
-0.0001 (0.0019)
-0.0301 (0.0031)
0.0298 (0.0054)
0.0001 (0.0054)
0.0000 (0.0018)
0.0002 (0.0024)

Table 2-4.









The maximum likelihood estimates (MLEs) of allele frequencies, configuration
frequencies, genotype frequencies, mutation rates, and zygotic disequilibria
when a cancer gene is moderately associated with the SNP simulated. The
standard errors of the MLEs calculated from 500 simulation runs are given in

Sample Size

Parameter True 400
SNP Frequencies
Qi 0.1600 0.1596 (0.0188)
Q2 0.4800 0.4803 (0.0246)






0.1592 (0.0123)
0.4811 (0.0184)
0.3597 (0.0174)
0.3997 (0.0119)

0.1603 (0.0077)
0.4793 (0.0110)
0.3604 (0.0105)
0.3999 (0.0074)

Driver Gene


0.1200 0.1198
0.4600 0.4627
0.4200 0.4175
0.1800 0.1898
0.2800 0.2729
0.3000 0.3096
0.4000 0.3927

Mutation Rate
U1 0.1000
U2 0.1200

Zygotic Disequilibria
D, 0.0200
D2 -0.0200

Power( .)







-0.0003 (0.0091)
0.0002 (0.0032)
0.0004 (0.0043)
-0.0102 (0.0068)
0.0101 (0.0126)
0.0002 (0.0123)
0.0000 (0.0040)
-0.0002 (0.0053)

0.1203 (0.0112)
0.4598 (0.0173)
0.4200 (0.0173)
0.1772 (0.1377)
0.2826 (0.1391)
0.2974 (0.1384)
0.4029 (0.1388)

0.1015 (0.0631)
0.1219 (0.0774)

0.0200 (0.0051)
-0.0200 (0.0060)
0.0001 (0.0062)
-0.0001 (0.0021)
-0.0000 (0.0028)
-0.0102 (0.0047)
0.0102 (0.0088)
-0.0002 (0.0084)
0.0001 (0.0028)
0.0000 (0.0039)

0.1202 (0.0070)
0.4593 (0.0115)
0.4205 (0.0117)
0.1604 (0.1107)
0.2989 (0.1115)
0.2806 (0.1100)
0.4191 (0.1124)

0.1019 (0.0452)
0.1204 (0.0539)

0.0202 (0.0033)
-0.0203 (0.0039)
0.0002 (0.0041)
-0.0000 (0.0014)
0.0000 (0.0018)
-0.0102 (0.0031)
0.0103 (0.0056)
-0.0003 (0.0056)
0.0000 (0.0019)
0.0001 (0.0025)

Table 2-5.

The maximum likelihood estimates (MLEs) of allele frequencies, configuration
frequencies, genotype frequencies, mutation rates, and zygotic disequilibria
when a cancer gene is weakly associated with the SNP simulated. The standard
errors of the MLEs calculated from 500 simulation runs are given in

Sample Size

Parameter True 400
SNP Frequencies
Qi 0.1600 0.1600 (0.0174)
Q2 0.4800 0.4793 (0.0245)






0.1596 (0.0128)
0.4804 (0.0184)



0.1607 (0.0084)
0.4797 (0.0108)



Driver Gene


0.1200 0.1203
0.4600 0.4589
0.4200 0.4208
0.1800 0.2210
0.2800 0.2378
0.3000 0.3413
0.4000 0.3581

Mutation Rate
U1 0.0500
U2 0.0800

Zygotic Disequilibria


-0.0011 (0.0056)
0.0012 (0.0074)
-0.0012 (0.0073)
0.0013 (0.0038)
0.0001 (0.0035)
-0.0013 (0.0067)
0.0003 (0.0091)
0.0002 (0.0096)
0.0008 (0.0041)
-0.0002 (0.0047)

-0.0009 (0.0041)
0.0011 (0.0058)
-0.0011 (0.0057)
0.0011 (0.0026)
0.0001 (0.0026)
-0.0009 (0.0051)
0.0014 (0.0071)
-0.0012 (0.0073)
0.0009 (0.0029)
-0.0003 (0.0034)

-0.0009 (0.0024)
0.0010 (0.0035)
-0.0011 (0.0035)
0.0010 (0.0016)
0.0001 (0.0016)
-0.0008 (0.0031)
0.0008 (0.0045)
-0.0012 (0.0047)
0.0010 (0.0016)
0.0002 (0.0020)

Table 2-6.











Power( .)


The maximum likelihood estimates (MLEs) of allele frequencies, configuration
frequencies, genotype frequencies, mutation rates, and zygotic disequilibria
when the association between a cancer and the SNP is determined by all
different disequilibrium coefficients in case 1. The standard errors of the MLEs
calculated from 500 simulation runs are given in parentheses.

Sample Size

Parameter True 400
SNP Frequencies
Qi 0.1600 0.1598 (0.0193)
Q2 0.4800 0.4803 (0.0243)






0.1597 (0.0133)
0.4807 (0.0182)



0.1600 (0.0083)
0.4798 (0.0103)



Driver Gene


0.1200 0.1198
0.4600 0.4617
0.4200 0.4185
0.1800 0.1994
0.2800 0.2623
0.3000 0.3192
0.4000 0.3821

Mutation Rate
U1 0.0500
U2 0.0800

Zygotic Disequilibria


-0.0009 (0.0043)
0.0399 (0.0092)
-0.0399 (0.0083)
0.0011 (0.0027)
-0.0001 (0.0005)
0.0001 (0.0006)
-0.0399 (0.0114)
0.0395 (0.0114)
-0.0011 (0.0029)
0.0014 (0.0034)

-0.0008 (0.0031)
0.0398 (0.0065)
-0.0401 (0.0056)
0.0012 (0.0020)
-0.0001 (0.0005)
0.0001 (0.0006)
-0.0399 (0.0086)
0.0399 (0.0086)
-0.0011 (0.0022)
0.0010 (0.0025)

-0.0010 (0.0021)
0.0400 (0.0041)
-0.0400 (0.0038)
0.0011 (0.0013)
-0.0001 (0.0004)
0.0001 (0.0005)
-0.0396 (0.0056)
0.0395 (0.0056)
-0.0010 (0.0015)
0.0010 (0.0016)

Table 2-7.











Power( .)


The maximum likelihood estimates (MLEs) of allele frequencies, configuration
frequencies, genotype frequencies, mutation rates, and zygotic disequilibria
when the association between a cancer and the SNP is determined by all
different disequilibrium coefficients in case 2. The standard errors of the MLEs
calculated from 500 simulation runs are given in parentheses.

Sample Size

Parameter True 400
SNP Frequencies
Qi 0.1600 0.1600 (0.0178)
Q2 0.4800 0.4790 (0.0243)





0.1600 (0.0130)
0.4800 (0.0179)



0.1597 (0.0085)
0.4801 (0.0113)



Driver Gene


0.1200 0.1199
0.4600 0.4600
0.4200 0.4201
0.1800 0.2050
0.2800 0.2550
0.3000 0.3249
0.4000 0.3749

Mutation Rate
U1 0.0500
U2 0.0800

Zygotic Disequilibria
D, 0.0400

Power( .)




0.0398 (0.0080)
-0.0402 (0.0074)
-0.0001 (0.0009)
-0.0007 (0.0030)
0.0011 (0.0038)
-0.0297 (0.0068)
0.0007 (0.0094)
0.0302 (0.0109)
0.0001 (0.0008)
-0.0012 (0.0041)



0.0403 (0.0059)
-0.0400 (0.0051)
-0.0001 (0.0007)
-0.0011 (0.0020)
0.0011 (0.0026)
-0.0302 (0.0051)
0.0016 (0.0068)
0.0296 (0.0077)
0.0001 (0.0007)
-0.0011 (0.0029)



0.0400 (0.0038)
-0.0399 (0.0035)
-0.0001 (0.0007)
-0.0010 (0.0012)
0.0012 (0.0017)
-0.0299 (0.0032)
0.0008 (0.0048)
0.0301 (0.0050)
0.0001 (0.0006)
-0.0010 (0.0019)

Table 2-8.




0 >ii,



3.1 Introduction

Aneuploidy is defined as any chromosome number that is not an exact multiple of the

haploid number, or describes the occurrence of one or more extra or missing chromosomes

leading to an unbalanced chromosome complement. Over 100 years ago, aneuploidy

was proposed to cause cancer. However, the aneuploidy hypothesis has been abandoned

since 1960s, in favor of the gene mutation hypothesis, because no specific chromosomal

rearrangements have been described that correlate with specific cancer types. A growing

body of evidence supports a role for aneuploidy in the genetic underpinning of cancer

(Duesberg et al., 1999; Hanks and Rahman, 2005; Stock and Bialy, 2003; Weaver and

Cleveland, 2006). According to extensive work by Duesberg and his group, aneuploidy

offers a simple, coherent explanation of all cancer-specific phenotypes in terms of the

following aspects:

(1) Aneuploidy is confirmed to generate abnormal ph. i.r'pes, such as Down syndrome
in humans and cancer in animals;

(2) The degrees of aneuploidy are correlated with the resulting phenotype abnormalities;

(3) Since aneuploidy imbalances the highly balance-sensitive components of the spindle
apparatus it destabilizes symmetrical chromosome segregation;

(4) Both non-genotoxic and genotoxic carcinogens can cause aneuploidy by physical or
chemical interaction with mitosis proteins.

It is concluded that when aneuploidy exceeds a certain threshold it is sufficient to cause all

cancer-specific phenotypes (Duesberg et al., 2007; Duesberg, 2007). Kops et al. outlined

the cytological mechanisms for aneuploid formation from checkpoint signalling. (Kops

et al., 2005) Normally, chromosome mis-segregation can be prevented by the mitotic

checkpoint by delaying cell-cycle progression through mitosis until all chromosomes have

successfully made spindle-microtubule attachments. Any defect in the mitotic checkpoint

will generate aneuploidy, might facilitate tumorigenesis and cause increased resistance to

anti-cancer therapies (Suijkerbuijk and Kops, 2008).

In the preceding chapter, I have proposed a model for detecting somatic mutations

that cause cancer. This mutation hypothesis, along with the hypothesis that cancer results

from aneuploidy, present two alternative views about the genetic control of cancer. In the

past decade, a heavy debate has arisen about which hypothesis can better explain the

reasons of cancer formation (Li et al., 2000; Parris, 2005), although these two hypotheses

may cooperate to govern carcinogenesis (Zhang et al., 2006). In this chapter, I will

present a statistical model for detecting the aneuploidy control mechanism of cancer. The

model is constructed with a random set of samples from a natural cancerous population.

By assuming the duplication of chromosomes derived different parents, the model is

able to test the genetic imprinting of alleles due to their different parental origins. If

the aneuploidy hypothesis is continuously confirmed, this model will provide a timely

tool to quantify the genetic effects of aneuploidy loci on cancer susceptibility with the

genetic data collected from the cancer genome project. Also, by comparing with the

model for detecting somatic mutations, this new model will help to determine the relative

importance of these two hypotheses in cancer studies.

3.2 Model

3.2.1 Study Design

Suppose there is a natural population from which cancerous patients are sampled.

At particular regions of the genome, chromosomes are multiplied to form a triploid,

tetraploid, or a polyploid of any higher order. To simply describe our idea, we only

consider a triploid and tetraploid. Each patient is typed at duplicated or triplicated

chromosomal segments with molecular markers, although the parental origin of chromosomal

duplication or triplication is unknown. A phenotype that defines cancer is measured for

all subjects. A model will be derived to distinguish between the genetic effects of alleles

inherited from the maternal (\1) and paternal parents (P).

3.2.2 Chromosome Duplication

Triploid Model: Consider a gene of interest A, with two alleles A and a, on a

chromosome (-i chromosome 3). Figure 3-1 describes the process of a pair of normal

chromosomes that are duplicated into a triploid for a portion of chromosome 3. For

a normal diploid, the -.' Ir. .1pes at this gene may be AA, Aa, or aa. Considering

parent-specific origins of alleles, we use A|a, A|a, a|A, and ala to denote the configurations

of these genotypes, respectively, where the left- and right-side alleles of the vertical lines

represent two alleles from different parents. Of these, configurations Ala and a A are

,i. I.ripically observed as the same genotype Aa. When the chromosomal segment that

harbors this gene are duplicated for only one single chromosome, triploids with two copies

from one parent and the third copy from the other parent will result. It is possible that a

single chromosome derived from maternal and paternal parents may both be duplicated,

but with a different frequency. Thus, through such a duplication, four configurations in

the normal diploid will form a total of eight triploid configurations, which are classified

into four different genotypes:

(1) AAA including configurations AA|A, duplicated from the left-side parent, and A|AA,
duplicated from the right-side parent, of configuration AA;

(2) AAa including configurations AA|a duplicated from the left-side parent of
configuration Ala and a AA duplicated from the right-side parent of configuration

(3) Aaa including configurations aa A duplicated from the left-side parent of configuration
a A and Alaa duplicated from the right-side parent of configuration A|a;

(4) aaa including configurations aa a, duplicated from the left-side parent, and alaa,
duplicated from the right-side parent, of configuration aa.

Let p and q (p + q = 1) are the allele frequencies of A and a in the original population

before chromosome duplication. For a natural population at Hardy-Weinberg equilibrium

(HWE), genotype frequencies can be expressed as p2 for genotype AA, 2pq for iJ .In.i' pe

Aa, and q2 for genotype aa.

Ala AAa
alA aaA
ala aaaa aaa

Figure 3-1. Diagram for chromosome duplication and the resulting changes of genotypes at
an aneuploid gene A. Alleles in different colors denote their parent-specific
origins separated by the vertical lines.

Theorem: For an HWE diploid population, chromosome duplication operating on
particular loci can violate the equilibrium status of the population.
Proof: Let g and h denote a proportion of allele A and a that is duplicated,
respectively. Thus, of diploid genotype AA, a proportion g will become AAA, with
the remaining proportion 1 g unduplicated. Similarly, a proportion h and 1 h will
be aaa and aa after duplication for diploid genotype aa. Diploid -n. .1 ivpe Aa will have
three possibilities, AAa with a proportion of g, Aaa with a proportion of h, and Aa with a
proportion of 1 g h. In a duplicated population purely composed of triploids, we will
have allele frequencies for A and a, respectively, as

p =p2g + j(2pqg) + -(2pqh)

p(pg + lqg + gqh)

q q2h + (2pqh) + -1(2pqg)

q(qh + !ph + pg)

The ..' Ir'l '/pe frequency of triploid AAA in the duplicated population is expressed as

P(AAA) = p2g

2 x (g + g + qh) x g
p(pg + jqg + qh)3

= p3
p(pg + ^qg + qhg)3

3 +P3(3 p(pgjg+qh)3 )-
p( p(pg 4+ +j 3

Similarly, we have the i. In'.rpe frequency of triploid aaa as

P(aaa) = q2h

q2 x (qh + ph + pg)3 x
3 3 q(qh + !ph + gpg)3

= ~3 3
q(qh + !ph + p g)3

q3 +q3 (( h pg)3 I)

Thus, unless g = p(pg + lqg + qh)3 and h = q(qh + ph + pg)3, the duplicated

population will be at Hardy-Weinberg disequilibrium.

This theorem shows that traditional HWE theory for population genetic studies will

not be useful for cancer gene identification. Meanwhile, this theorem provides a foundation

for my model to perform association studies of cancer.

3.2.3 Quantitative Genetic Parameters

Our analysis will be based on the duplicated population purely composed of triploids,

from which a total of N subjects are sampled at random. Each subject is typed for a

series of markers throughout the genome and phenotyped for a cancer trait. Let nk denote

the observation of triploid i,. In' .1pe k (k = 1 for AAA, 2 for AAa, 3 for Aaa, and 4

for aaa). It has proven that chromosomal duplication may violate the Hardy-Weinberg

equilibrium of the population. Thus, genotype frequencies, Pk, are expressed as the

products of allele frequencies plus disequilibrium parameters. Let D1 and D2 denote the

Hardy-Weinberg disequilibrium coefficients associated with allele A and a, respectively, at

the duplicated gene. Thus, the frequencies of four iJ. .' rpes can be expressed in terms of

allele frequencies and disequilibria (Table 3-1).

Table 3-1. The changes of genotypes and genotype frequencies after chromosomal
Non-duplicated Duplicated
G. .r>lpe Frequency Duplication Gi. ir>lpe Frequency Observation

AA p2 = AAA Pi p3 + Di n
Aa 2pq AAa P2 3p2q 2D1 + D2 n2

Aaa P3 3pq2 2D2 + D1 n
aa q2 aaa P4 =3 + D2 n4

The same triploid genotype at the duplicated gene may have different values when the

expression of its alleles depends on the origin of parents. For example, triploid ii,.1 Irpe

AAA may be formed from normal diploid genotype AA when either the maternally- (\I)

or paternally-derived allele (P) is doubled. Thus, the configuration of -. I ir..1'pe AAA

can be either AMAMAF or AMAFAF, where the subscripts denote the parental origin of

alleles. Table 2 gives the ,i. in.1'rpic values (p10, 1k2) of two possible configurations of each

triploid genotype. These genotypic values are partitioned into eight different components,

the overall mean (p), additive dominant genetic effect (a), dominance genetic effects

of AA over a (d) and A over aa (d'), genetic imprinting effects due to different origins

of alleles (A), and interactions between the additive and imprinting effects (Il\), the d

dominance and imprinting effects (IdA), and the d' dominance and imprinting effects (Id'A),


For each triploid ir,.rpe, the relative proportions of two underlying configurations

can be different, depending on the rate of the duplication of parent-specific chromosomes.

Let u and 1 u be the proportions of the duplication of allele A derived from the maternal

and paternal parents, respectively. Similarly, let v and 1 v be the proportions of the

duplication of allele a derived from the maternal and paternal parents, respectively (Table

3-2). These proportions can be estimated from genotype data.

Table 3-2. G. ir .1 .pic values and proportions of different configurations of a triploid
genotype at a duplicated gene.

Duplicated Duplication
G. .rv v'pe Configuration Genetic Value Rate

A AMAMA f /il p + a + A + !I., u
AAA P12 = p +a A- JA 1 a

aMAFA [1122 = p +!a A -- d t U

Aaa AM aF 321 + + d + A + -IaA d' 1
{AmaFaF [1^31 /-I !a + d' A + Il|A Idx 1 V

aMaMAF P-132 = a + d'+ A aA + Id'A V v

( aMapaF 41 = p a A + aA 1 V
aaMaMapF 42 = p + |aA

3.2.4 Estimation

It is straightforward to estimate the frequency of a triploid genotype with i, In.i'pe

observations using

Pk k, (31)

which is derived from a polynomial likelihood. The EM algorithm is implemented to

estimate the allele frequencies (p and q) and HWD coefficients from the triploid .r- i'pe

observations of the aneuploid population sampled (Table 1). It is described as follows:

In the E step, we calculate the proportion of an allele within a triploid genotype using

3p3 6p2- 3p42
P1A A 3A= (3-2)
P1 P2 P3

for allele A, and

3p24 6p 2 4 3q3
2a= 3a 4a (33)
P2 P3 P4

for allele a.

In the M step, the allele frequencies are then estimated with the following equations:

ni1IA + n122A + n33A (34)
p = -(3-4)
n1 + n2 + ns + n4
n2 2a + n3q3a + n4I4a
q = (35)
n1 + n2 + ns + n4

and the HWD coefficients are calculated by

21- 1
D = 2 1A (3 6)
P2 P3
1 1
D2 = 2 T4. (37)
P3 P2

To estimate the duplication rate and genotypic values for each configuration, we

need to formulate a mixture model because each triploid genotype contains two unknown

configurations. The likelihood of genotype observations at the duplicated gene (Table 1)

and phenotypic values (y) measured for all subjects is constructed as

L( y) = ]J[uf1i(y1) + (1 u)fl2(y1) [uf2h(y2i)+ ( -u)f22(Y2i)]
i=1 i=1
n3 n4
x [(1 v)f3 ) + vf323i 1[(1 V)f44i) + f42i)], (3-8)
i=i i= 1

where 2 = (u, v, {pk1, '}i= 1) is the vector of unknown parameters, and fkj(yki) =

exp ) (k = 1, ..., 4; = 1, 2) is the normal distribution of the phenotypic trait

with mean pkj and variance a2.

To obtain the maximum likelihood estimates (MLEs) of the parameters, we

implement the EM algorithm to the likelihood (3-8). In the E step, the posterior

probability with which a subject i with a specific triploid genotype has a configuration

j is calculated using

U fl I (y ) 2
ufii(yi) + (1 u)fl2(Yi) 12 ufi(
uf21i (i)
421i U fi Q 22i) 22i 2
S f2l(?y) + (1 U)f22(yi) U 22 2l(I
(1 v)f31 (Yi)
(1 v)f3 (Yi) + 32(Yi) 32 (1 -

(1 v) f41(i)
(1 )f41(Yi) + f42(Yi) 42 (1 -
In the M step, by solving the log-likelihood equations, the p

the calculated posterior probabilities, i.e.,

2 I kjiY i
pk-kj k = i ..., 4; j
Ykji=Z 1 +kji
Sk 2

nki (lj 1 1%

n1 + n2
Z 1 32i + -1 442i
n3 + n4

1 U)f12(Yi)
i) (1 u)fl2(i)
S- U)f2(Yi)
S ( U)f(y) (3-
U 32 (Yi)
)f31l(Yi) + Uf32 (Yi)
vf42 (Yi)
)f 41(Yi) + Uf42 (Yi)
arameters are estimated with





A loop of the E and M steps is iterated between equations (3-9) and (3-10), (3-11),

(3-12) and (3-13). Thus, the parameter estimates are obtained when the estimates

converge to stable values. The MLEs of genetic effects can be obtained by solving a

system of linear equations given in Table 2, i.e.,

I I1 +112 P41 P42)

(21 + 22) Iu1 + P12) + |(p41 + 42)

(31 + P32) I(41 + P142) + 1(11 + P12)





A '( ( 12) 41 42) (317)

'ciA (P 11- 12) + |(41 -P42) (3 18)

iA (P21 22) P(1- 12) + 1(p41 P42) (3 -t9)

d'A (P32- 31) + 3 (41 42) ( 12) (3-20)

3.2.5 Hypothesis Tests

How a duplicated gene deviates from Hardy-Weinberg equilibrium can be tested by

formulating the null hypothesis as follows:

Ho: D1 = D2 = 0,

under which genotype frequencies can be estimated from the estimated allele frequencies

using equation (3-1). The log-likelihood ratio calculated under the null and alternative

hypotheses follows a X2-distribution with 2 degrees of freedom. It is interesting to test

the two disequilibria separately. Under the null hypothesis Ho : D1 = 0, genotype

frequencies are estimated using equation (3-1), but with a constraint P3 = p3, in addition

to constraint P1 + P2 + P3 + P4 = 1. Similarly, genotype frequencies are estimated with a

constraint P4 = q3 for testing whether D2 = 0.

Whether the duplicated gene is significantly associated with cancer susceptibility

can be tested using the null hypothesis Ho : kj -- p for k = 1,..., 4; j = 1, 2. The

additive effect and two types of dominance effects can be tested jointly or separately by

formulating the relevant null hypotheses based on equations (3-14), (3-15), and (3-16).

The imprinting effect and its interactions with additive and dominance effects can be

tested by using the null hypothesis Ho : A = 0, Ho : Ic\ = 0, Ho : IAX = 0, and Ho : I', = 0

constructed with equations (3-17), (3-18), (3-18), and (3-19), respectively.

The model can also be used to test the significance of duplication rate for a

parent-specific chromosome by formulating the null hypothesis Ho : u = 1 or Ho : v = 1.

This information helps to understand the genetic structure and evolutionary process of

cancer risk.

3.3 Application to Simulated Data

Simulation studies were used to investigate the statistical properties of the model

in terms of estimation precision, power and false positive rates. We simulate a cancer

population of triploids for a portion of chromosome. The allele frequencies at a triploid

locus are p = 0.6 and q = 0.4. The HWD coefficients at this locus are assumed as

D1 = 0.08, D2 = 0.06. By assuming the duplication rates of 0.3 and 0.4 for two parental

chromosomes, respectively, the distribution of four different triploid genotypes AAA, AAa,

Aaa, and aaa can be simulated. The phenotypic values of cancer traits were simulated

by summing the additive, dominance, imprinting, and their interaction effects given with

particular values and the errors of measurement within each triploid genotype following

a normal distribution with variance scaled by a heritability of 0.1 and 0.4, respectively.

Different sample sizes, 400, 800, and 2,000 are considered.

The model was used to estimate allele frequencies, HWD, parent-specific duplication

rates, and genetic effects for a cancer population (Table 3-3). As expected, allele

frequencies can be precisely estimated with a modest sample size (400). A larger sample

size (- i 800) is needed to provide precise estimation of two HWD coefficients D1 and D2.

Because duplication rates determine the mixture proportions for each triploid genotype,

their estimates will be affected by the heritability level. If the cancer trait has a larger

heritability, then a sample size of 400 will provide good estimates of duplication rates. For

a less heritable trait, a large sample size (even 2000) is needed for good estimation. The

additive effect can be generally well estimated, but the estimates of the dominant effects

need much larger sample size. The estimation precision of the imprinting effect seems

to be intermediate between that of the additive and dominant effects. It is interesting

to see that the additive x imprinting interaction effect can be better estimated than the

imprinting effect alone. It is hard to estimate the interactions between the dominant and














~ 0










O 0

a ~cc



a coc

cc Lr'
cc C

0 C
10 0 o

H H ^
. mO ^

0 0 0

0 00
0 O 0



( M^
nr i

00 (


-g 0

10 0



OH o

c O

03 0

t0 1?
oo -
0o 0
10 0

Ln L

o t0



Cl n
3 0


Cd C

03 0
C- o

CM 03
CIA 03l


a 0


o -
oo ^
03 0

10 03

oo -

0^ -

I- C
0 00
C 0

- 0-C


O 0


00 ~
MO 0

00 cl

O0 -

t- 0^

- M

00 0

0 0

0 0

CM 0

oo0 L
oo o
03 0


C 00
M 0

C0 0




- CM


0 L-

oo i



imprinting effects unless an extremely large sample size (> 2000) is used. Overall, the

impact of heritability is large than that of sample size, -i--i -1ii-; that it is important to

allocate limited resource to measure phenotypes precisely rather than increase sample size


The power to detect the overall genetic effect and imprinting effect was investigated.

In general, the model has a great power for the identification of aneuploid loci causing

cancer. To achieve adequate power for imprinting effect detection, a large sample size

and/or large heritability is required. Overall, s sample size of 400 with a heritability of

0.4 can reach power of over 0.85 for the detection of imprinting effects. We also performed

simulation studies to examine the false positive rates for detecting overall genetic effects

and imprinting effects at aneuploid loci. It appears that in each case the false positive

rates can be controlled to be below 5-10' .

3.4 Discussion

Over the past 100 years since Theodor Boveri hypothesized that mitotic defects that

result in tetraploidy promote oncogenesis ('\ .il[ i-pacher, 2008), a tremendous concern has

been given to explore the genetic cause of tumorigenesis. It has been partly established

that aneuploidy has an effect on proliferation and survival of tumors. The recent discovery

of components of the mitotic checkpoint, as well as the realization that many of the classic

tumour suppressors and oncogene products regulate mitotic progression, has renewed

interest in the role of aneuploidy in tumorigenesis (Hanks and Rahman, 2005; Kops et al.,

2005; Suijkerbuijk and Kops, 2008). With the completion of the human genome projects

and HapMap project, there is a pressing need for the development of statistical models for

estimating the genetic effect of aneuploid loci on cancer risk.

In this article, we present a statistical strategy for detecting the genetic control

of cancer traits through genotyping aneuploids of cancer cells. The model proposed

presents two novelties. First, it has for the first time integrated the latest discovery of

cancer genetic studies with statistical principles and directly pushed the modeling effort

of cancer gene identification at the frontier of cancer biology. The experimental design

used is founded on biologically relevant hypotheses from which data can be collected

in an effective way. The derived closed forms for the EM algorithm to estimate various

parameters will provide an efficient computation for any data set. Second, the model

capitalizes on traditional quantitative genetic theory, allowing the partition of overall

genetic control into different components. Particularly, we are able to estimate and test

the effect of genetic imprinting on cancer risk and, thus, draw a detailed picture of genetic

control tlii .-:. I from different parental chromosomes. The model can also characterize

the interactions of additive and dominant effects with imprinting effects, helping to gain a

better insight into the complexity of the genetic architecture of cancer.

We performed computer simulation to examine the statistical properties of the model.

Results from simulation studies were investigated, from which an appropriate sample size

is determined for a cancer trait with a particular heritability. Analyses of model power

and false positive rates validated the possible usefulness of the model when practical data

sets are available. Through a simple mathematical proof, I found that the Hardy-Weinberg

equilibrium of an original population can be destroyed when some chromosomes are

duplicated. In deed, it is just the occurrence of Hardy-Weinberg disequilibrium that cancer

genes can be detected from association studies of genome-wide SNP data.

The idea of the model can be extended to several more complicated situations.

First, the aneuploidy control of cancer may be derived from high-order aneuploid, such

as tetraploids. A high-order polyploid not only contain more allelic combinations, but

also a more amount of missing data due to the duplication of different chromosomes

with unknown parental origins. To model the tetraploidy control of cancer, a more

sophisticated algorithm is required to obtain efficient estimates of parameters. second,

different aneuploid loci responsible for cancer traits may be associated in the duplication

population and interact in a coordinated manner. Modeling of multi-locus associations

and multi-locus epistasis will deserve a further investigation although these pieces of

information can better explain the genetic variation of cancer than single loci. Second,

other factors, such as sex, race, and life style, also contribute to cancer. It is crucial to

incorporate these factors and study the effects of each of them and their interactions with

gene in tumorigenesis.


Aneuploidy has long been found to be a common characteristic of cancer cells. A

growing body of evidence '-.-- -I -; that tumorigenesis may be promoted by aneuploidy,

at least at low frequency, arising from errors of the mitotic checkpoint, the in i Pjr cell

cycle control mechanism that acts to prevent chromosome missegregation. However, so

far, there has not been a model that can be used to test this hypothesis. Here, I develop

a statistical model for testing the association between aneuploidy loci and cancer risk in

a genome-wide association study. The model incorporates quantitative genetic principles

into a mixture-model framework in which various genetic effects, including additive,

dominant, imprinting, and their interactions, can be estimated by implementing the EM

algorithm. I formulate a series of hypotheses tests for the pattern of the genetic control

of cancer through aneuploid loci. Simulation studies were performed to investigate the

statistical behavior of the model.


4.1 Introduction

Genetic imprinting arises from a gene when either the maternally or paternally

derived <.pi,- of it is expressed while the other copy is silenced (Reik and Walter, 2001).

Caused by an epigenetic mark of differential methylation set during gametogenesis, genetic

imprinting has been shown to phliv a pivotal role in regulating the formation, development,

function, and evolution of complex traits and diseases (Constancia et al., 2004; Isles and

Wilkinson, 2000; Itier et al., 1998; Li et al., 1999; Wilkinson et al., 2007). While most

studies of genetic imprinting focus on the epigenetic and molecular mechanisms of this

phenomenon (Sha, 2008), information about the number and distribution of imprinted

genes and their epistatic interactions is still very limited, limiting our ability to predict

the effects of imprinting genes on the diversity of a biological trait or process. Several

authors have started to use genome-wide association and linkage studies to identify the

regions of the genome that contain imprinted sequence variants and further understand

the epigenetic variation of complex traits (C'! i ud et al., 2008; De Koning et al., 2000;

Liu et al., 2007; Wolf et al., 2008).

In a series of recent studies, C'!. ud, Wolf, and colleagues categorized genetic

imprinting into different types based on the pattern of its expression, i.e., maternal

expression, paternal expression, bipolar dominance, polar overdominance, and polar

underdominance (C'!i 1'.- i ud et al., 2008; Wolf et al., 2008). With a three-generation F2

design, they identified these types of imprinted quantitative trait loci (iQTL) affecting

body weight and growth in mice, displaying much more complex and diverse effect

patterns than previously assumed. A different design based on reciprocal backcrosses was

proposed to test and estimate the distribution of iQTL responsible for physiological

traits related to endosperm development in maize (Li et al., 2007). By modeling

identical-by-descent relationships in multiple related families of canines, Liu et al. (2007)

derived a random effect model based on linkage analysis to genome-wide scan for the

existence of iQTL that affect canine hip dysplasia.

While epigenetic marks resulting in genetic imprinting can be generally stable in

an organism's lifetime, they may undergo reprogramming, i.e., a faithful clearing of the

epigenetic state established in the previous generation, in the new generation during

gametogenesis and early embryogenesis (Morgan et al., 2005; Sasaki and Matsui, 2008).

However, a growing body of evidence since the early 1980s indicates that genes may

escape such reprogramming and, thus, inherit their imprinting effects into next generations

(Cropley et al., 2006; Dolinoy et al., 2006; McGrath and Solter, 1984; Morgan et al.,

1999; Skinner, 2008; Surani et al., 1984). Two fundamental questions will naturally arise

from this discovery: how common are imprinted genes of this type and how strong is the

evidence for their existence in humans and other organisms? If epigenetic changes through

imprinted genes can be inherited across generations, this would significantly alter the way

we think about the inheritance of phenotype (Whitelaw and Whitelaw, 2008; Youngson

and Whitelaw, 2008). Such transgenerational epigenetic inheritance, i.e., modifications

of the chromosomes that pass to the next generation through gametes, may be related

with health and diseases with a mechanism for transmitting environmental exposure

information that alters gene expression in the next generations) (Pembrey et al., 2006).

The identification of imprinted loci displaying transgenerational epigenetic inheritance

will be greatly helpful for addressing the two questions mentioned above, in a quest to

elucidate the detailed genetic architecture of complex traits and diseases.

The motivation of this chapter is to develop a novel strategy for identifying

imprinted genes and understanding the transgenerational changes of their effects with

a three-generation family design. This design samples multiple unrelated nuclear families,

each composed of the grandfather, grandmother, father, mother, and grandchildren,

from a natural population. The inheritance of alleles at a gene from a male or female

parent is traced by observing the segregation of the gene in the progeny generation.

We implement the EM algorithm to estimate the effects of imprinted genes and their

changes across generations. A testing procedure is proposed to study the pattern of

transgenerational epigenetic inheritance. The statistical behavior of the model is examined

through simulation studies.

4.2 Design

4.2.1 Sampling Strategies

Suppose there is a natural human population at Hardy-Weinberg equilibrium (HWE)

from which a panel of three-generation families, each composed of the grandfather,

grandmother, father, mother, and grandchildren, are sampled. Each member in a family

is typed for single nucleotide polymorphisms (SNPs) from the human genome. Consider

a SNP with two alleles A in a frequency of p and a in a frequency of q, leading to three

iJ. In.ipes AA, Aa, and aa with the frequencies of p2, 2pq, and q2, respectively. In the

grandparent generation, these three genotypes are mating randomly to produce nine

cross types (Table 4-1). Given a cross type, the .-. Inir.' pes of sons or daughters can be

inferred. Here we first assume one sex (- i son) in the second generation, although both

sexes can be considered. The sons from a family serve as the father to mate with the

females as the mother derived from a natural population, with .-. Irc.1ipes, AA, Aa, and aa,

characterized by frequencies p2, 2pq, and q2, respectively. Each of such second-generation

families produces a certain number of grandchildren. The genotype frequencies in the third

generation are derived according to Mendel's first law.

According to this design, the grandfathers and grandmothers are founders whose

parents are unknown. Alleles of sons from a first-generation family can be traced

directly or indirectly, but the females used to generate the second-generation family

are the founders with the unknown origin of alleles. For this reason, we will measure

the phenotype for sons from the first-generation families and grandchildren from the

second-generation families. This design will allow us to characterize imprinting effects of a

gene in the second- and third-generations.

0 0 _

t a?

a 4 A
_0 mu c

m bOc0

4 z ^

bj O
-o 0ia r a

r 1 c-
11 1 0^
bL^ &

t 0 tl



0 0 0

0 0 0

O -li -

- 1i01 0


&,0 01 C
01 & a_

0 0 0 0 C 1 r10

0 0o 0 o -1'1 I o

O 101 O I Ir -10

Ir 0 C i -10 0-

0 -1h -l10

-I| -1i 0

0 -I1 -l10

-I10 Ihr 0

CM" ,

0 0 0 1 1 icu

1 0 0 IIC ^1
0 ~ ~ I -Id ^^ ^1^

O -I O -I'lc -Id~

a, ( a,-IM &

0 0 I I1' 0 I 1 = -I '1t 0

-0Id = -=I h 0 1 Id i )1 1

tM A, AM CM a C M M a

t3 C; t3 C; t3 C; t3 C; t3 C; t3 C;
c; c; c; c; c; c;

O I 1n 0 1'^' -

I| -1 0- 1 -Id 0

0O I- I 0 0 O

-c -ib~ O=

0 -1I dI

|IM d1 O0

O Ik1 1Id

-IM 1^ 0

0 1- I I 0 |I -- O |I -

I| 1 0 -I = I 0

0 1 -I' 0 0 O

-- -It 0





4.2.2 Genetic Models

There are three genotypes, AA, Aa, and aa, for a biallelic gene according to

Mendelian segregation pattern. Considering the parent-of-origin of alleles, these genotypes

are described by four configurations, A|A (coded as 2), Ala (coded as 1), a|A (coded as

1'), and ala (coded as 0), where symbol is used to separate the maternally- (left) and

maternally-derived alleles (right). The genotypic values of the four configurations in two

different generations are defined as follows:

Configuration Paternal Offspring

A|A [ = Ip- + ap = [ po +ao,

Ala pi = ipL + d + IF = I to +do +io, (4f1)

alA pI = ip + d IF P = Io +do +io,
ala Po = Fp ap = MIo-ao,

where ipp and po are the overall means of the paternal and offspring generations, ap,

dF, and iF are the additive, dominant and imprinting genetic effects of the gene in the

parental generation, and ao, do, and io are the additive, dominant and imprinting genetic

effects of the gene in the offspring generation.

The difference in the genetic architecture of a complex trait between two different

generations is described as

Aa a ao, (4-2)

Ad d do, (4-3)

Ai iF io. (4-4)

By testing whether these differences are equal to zero jointly or individually, we can

determine the transgenerational changes of the pattern of genetic control. If a significant

imprinting effect is detected, we can test the type of genetic imprinting, i.e., parental or

bipolar dominance, by incorporating the imprinting models of C'!i. 1,- i ud et al. (2008).

4.2.3 Estimation

The grandfather and grandmother in the first generation from a natural population

constitutes 3 x 3 = 9 mating types for three genotypes. For the jth first-generation mating

type listed in Table 4-1 (j = 1,..., 9), let Nj denote the family number of this mating type.

Each first-generation family may have one or multiple sons who serve the father of the

second generation. Those families in the second generation with the father derived from

the jth first-generation mating type and the mother of a particular .. ir .i Tpe from the

natural population are summed together, denoted by N', for mother genotype I (1 = 2 for

AA, 1 for Aa, and 0 for aa). Thus, we have a total of NM = I N%, second-generation

mothers who carry iJ. m'lT/pe 1.

It is not difficult to derive the maximum likelihood estimate of allele frequency from

the three-generation family design as

4N, + 3(N2 + N4) + 2(N3 + N5 + N7) + (N6 + N8) + 2N + N
p= 9-E 4 Nj + 2(NM + NM + Nm)

4N + 3(N5 + N7) + 2(N2 +4 N6) + (N + N3) + 2NO + N
9q 4 N4 + 2(N + N + Nm)

The male individuals from the first generation are typed for the marker, with four

distinct configurations, A |A, A, A|a, a|A, and aa. Let denote the cumulative number

of male individuals (as the father for the second generation) bearing configuration k

(k = 2, 1, ', 0) from nj first-generation families. In the third generation, only genotypes

rather than configurations can be observed. We use No,, to denote the number of children

who carry genotype s (s = 2, 1, 0) from a second-generation family with father k (from the

jth first-generation mating type) and mother I from a natural population. The phenotypic

values measured are expressed as y(F (i = 1,..., nF) for the second-generation fathers and

Yjkisi ( = 1, ..., N is) for the third-generation children. Both yji and si are assumed to
follow a normal distribution with mean depending on genotypes (shown in Expression 1)

and residual variances aF and ou respectively.

The joint likelihood of marker (M) and phenotypic (y) data from the three

generations is formulated as

L(Q,QoyF,y0,M ,M ,M) ,MM ,M )

SL(Q yp, M5, Mf, MV) + L(Qo0yo, M5 1, M f, MV, Mj, MU ), (4-5)

where Q2 = (Ip, aF, dF, ip, a") and Qo = (po, ao, do, io, t). When .vii. n vpic effects

are assumed to be different between two generations, maximizing joint likelihood (5) is

equivalent to maximizing its two likelihood components independently. The estimates that

maximize the first component can be obtained with the EM algorithm. In the E step, the

posterior probability with which the double heterozygote father of the second generation

from the 5th first-generation mating type has a particular configuration is calculated by

S1f(yFi) and f I ()4-6)
5 fil(u i) + _1fi'Q4 ) lf,(u5lFJ + -fi/'Q (4 6)

In the M step, the genotypic values of configurations and variance are calculated by

N ?2 L y 4F i F
2 F 2 F V 2F F V 2
12 + iNi22 + i42 +t N52

y4?, yF i +yN3, F i +yNF 4)F iyF i + :NF;, yF
F i- i 21i 3F 31 i 1 1 5F1 i-1i 61 i
I + NV F + y: N 5F, 4 ) F i + N6 1
1 5F

N 4 v4FVi 1iF 1 F 1F F
F Y Z Ii~1 yE1pi +Z K i5 1 + 51/Z5/ + K -1 + Z7il + %-1 .7t
NA, +2 51' ) + NAV, + NA,
z F, 1 + z>$ y 5i + F'7 + 81
N50 0+ N60 + N80 + N90


j= 1 7 i=1
N1 N N

i= i=1

5F2i F2 51i_

i -i
i 1 i 0

^ii i i/, 2 + i

Si F)2 Fi
i1 i=

22 21
C 2F2i F2 2li 1 F2
i-1 i=1
F )2 4(y1i 1/F)2

F)2 51 i _)2 5FOi F 2

-OF 2 i F )2

OF 2 + 9F ii O )2

The EM algorithm can also be derived to estimate genetic parameters in the third

generation. In the E step, the posterior probability with which the double heterozygote

offspring of the third generation derived from the combination of two double heterozygote

parents in the second generation has a particular configuration is calculated by

In the M





io j Okli( ) and o Okli)
jkl fJ( Okli) + 4f'( l ojkll' 1f( Okli) klio

step, the genotypic values of configurations and variance are calcula

O LI=1Cz 1 =0 101O O .1-
9 2 2 0112; 0,
Ej= IEk=0 E=0 N.
k _I

0 E9-1 L 0 = 0 K -1 4Jk l0ikjklli

O jL = 1 k=0 21=0 i=1 yjklli ljkli
1 9 2 y0 2 0 112o 0

j1 Lk0 L1=0 Li=1 4jOkll'i jklli

0 =1 Y k=0 12=0 zi=l jkli Jkl0i
0o 9 2_ y 2 k = 0, t', t, 2; 1 0, t, 2,
j= k=0 =0 NOkl0


ted by




Et=1 E=o 2 l=0o(N + N 1 + N10)

xCC EEE 1 E W^/0 -0)2+ [40 I (ji.O /,0)2]
jil k=0 1=0 i-I l 1

+ k^lli'Y/ukli 1/2 +> &kl(/0i.. 5)2 k -0,1', 1, 2; -0,1,2,


where k.,.- jklli, and Q,f,, are the indicator variables that are defined as 1 if offspring

i in the third generation from the combination of father k from the jth first-generation

mating type and mother I from the natural population has genotype AA, Aa, and aa,

respectively, and 0 otherwise. The EM steps are iterated between equations (6) and (7) to

obtain the MLEs of fQ and between equations (8) and (9) to obtain the MLEs of 2o.

4.2.4 Hypothesis Tests

It is imperative to know whether significant SNPs exist to be associated with a

complex trait and how a significant SNP triggers an additive, dominant, or imprinting

effect on the trait. These questions can be addressed by formulating the following

hypotheses. The first one has the null hypothesis Ho : aF = dF = IF = ao = do = io.

The log-likelihood ratio under the null and alternative hypotheses is calculated. The

critical threshold for claiming the existence of a significant SNP is determined from

permutation tests (C0'iii !!1il and Doerge, 1994). The second one has the null hypotheses,

Ho : au = ao = 0, Ho : dF = do = 0, and Ho : iF = io = 0, respectively. Because each of

these null hypotheses is nested within its alternative, the log-likelihood ratio test statistic

can be thought to .-i-mptotically follow a X2-distribution for a large sample size.

The transgenerational changes of different genetic effects can also be tested, with null

hypotheses Ho : Aa = 0, Ho : Ad = 0, and Ho : Ai = 0, respectively. These null hypotheses

can be considered singly or jointly, in order to better study the transgenerational changes

of the genetic architecture of a trait.

4.3 Haplotyping Model

Recent molecular surveys --' -1 that the human genome contains many discrete

haplotype blocks that are sites of closely located SNPs (Dawson et al., 2002; Gabriel et al.,

2002; Patil et al., 2001). Each block may have a few common haplotypes which account

for a large proportion of chromosomal variation. Between .i1i i:ent blocks are there large

regions, called hotspots, in which recombination events occur with high frequencies.

Several algorithms have been developed to identify a minimal subset of SNPs, i.e., Ir-'-ir-:

SNPs, that can characterize the most common haplotypes (Zhang et al., 2002). The

number and type of I P--.ir-; SNPs within each haplotype block can be determined prior

to association studies. In this section, I will derive a model for detecting the association

between haplotypes constructed by alleles at a set of SNPs and complex traits.

Consider two SNPs, A (with two alleles A and a) and B (with two alleles B and

b) to clearly describe the model. Four haplotypes formed by these two SNPs are AB,

Ab, aB, and ab, with respective frequencies denoted as pn1, pio, pol, and poo. The

two markers produce nine joint genotypes, AABB (coded as 1), AABb (coded as 2),

..., aabb (coded as 9), which are observed. Thus, each subject will bear one of these

, iil.r ipes, and the parents in each family will be one of 9 x 9 = 81 possible genotype

by genotype combinations. If each parent for a combination is homozygous for both

markers, their offspring will have one genotype. As long as one parent is heterozygous

for one marker, the offspring will have two or more genotypes. However, only when both

markers are heterozygous for at least one parent, the genotype frequencies of offspring

will be determined by the recombination fraction between the markers (r). Table 4-4

shows the structure and frequencies of mother by father genotype combinations under

random mating and their offspring genotype frequencies. For a double heterozygote

AaBb, its observed genotype may be derived from two possible diplotypes, ABRab (with

the probability of piipoo) or AblaB (with the probability of piopol). Each of these two

diplotypes produce four haplotypes AB, Ab, aB, and ab, whose frequencies are expressed


Diplotype AB Ab aB ab
ABlab ( r) \irr 1( r)
AbjaB ir 1(( r) 1(l r) ir

A similar likelihood (4-5) cane be formulated for haplotype models. A complicated

EM algorithm is derived to estimate haplotype frequencies using the parental information.

Let Nij denote the observation of mating type between ,j. In. .1pe i for one parent and

,j. Ip'ipe j for the second parent. In the E step, calculate the proportion of a diplotype for

a heterozygous -i~ TIr'lpe for a particular mating design by

i = (1 )r )(1 r)

(1 r
(1 O)r

(1 O)r + _O2r
W1;1 + WL'2
(1 0)2r2 + 0(1 O)r(- r)
2 2

0(1 /)r2 + [02 + (1 0)21r(- r)

02r2 + 20(1 O)r(l r) + (1 0)2r2
7 2(o;2 + 02)

In the M step, estimate the haplotype frequencies and recombination fraction by

pil = {4N11 + 3(N12 + N21 + N14 + N41) + 2(N22 + N44 + N13 + N31

+N16 + N61 + N17 + N71 + N18 + N81 + N19 + N91 + 24 + N42) + N23

+N32 + N26 + 62 + 27 + 72 + 28 + N82 + N29 + N92 + N34 + N43

+N46 + N64 + N47 + N74 + N48 + + N49 + N94

+[3(Nis + N51) + 2(N25 + N52 + N45 + N54) + N35 + N53 + N56

+N65 + N57 + N75 + N58 + N85 + N59 + N95l

+(1 0)[2(N15 + N51) + N25 + N52 + N45 + N54] + 2Ns55s
9, 9
/(4 N j),
i= ,j 1

plo = {4N33 + 3(N23 + N32 + N36 + N63) + 2(N22 + N66 + N13 + N31

+N26 + N62 + N34 + N43 + N37 + N73 + 8N- + N83 + N39 + N93)

+N12 + N21 + N16 + N61 + N24 + N42 + N27 + N72 + N28 + N82 + N29

+N92 + N46 + N64 + N67 + N76 + N68 + N86 + N69 + N96 +

0[2(N35 + N53) + N25 + N52 + N56 + N65]

+(1 ) [3((N35 + N53) + 2(N25 + N52 + N56 + N65) + N15 + N51 + N45

+N54 + N57 + N75 + N58 + N85 + N59 + N95] + 2(1 )N55}
9, 9
/(4 N j),
i= 1, j= 1

po0 = {4N77 + 3(N47 + N74 + N78 + N87) + 2(N44 + N.- + N17 + N71

+N27 + N72 + N37 + N73 + N67 + N76 + N48 + N. L + N79 + N97)

+N41 + N18 + N81 + N24 + N42 + N28 + N82 + N34 + N43 + N38 + N83

+N46 + N64 + N49 + N94 + N68 + N86 + N89 + N98

+0[2(N57 + N75) + N45 + N54 + N58 + N85]

+(1 )[3(N57 + N75) + 2(N45 + N54 + N58 + N85) + N15 + N51 + N25

+N52 + N35 + N53 + N56 + N65 + N59 + N95] + 2(1 O)N55}
9, 9
/(4 N )S
i= ,j= 1

oo00 {499 + 3{(N69 + N96 + N89 + N98) + 2(N66 + -. + N19 + N91

+N29 + N92 + N39 + N93 + N49 + N94 + N68 + N86 + N79 + N97)

+N16 + 161 + N18 + N8s + N26 + N62 + N8 + N2 + N36 + N63 + N38 + N83

+N46 + N64 + N48 + I\- I + N67 + N76 + N78 + N87

+0[3(N59 + N95) + 2(N56 + N65 + N58 + Ns85) + N15 + N51 + N25

+N52 + N35 + N53 + N45 + N54 + N57 + N75]

+(1 ) [2(N59 + N95) + N56 + N65 + N58 + N85] + 2 Ns55
9, 9
/(4 NSj).
i= ,j= 1

r +8 (N51 + N5, + N5 N5 56 N54+ N?5 + N5 + N 57

+N58 + N5 + ,N + N5 + NM + N +N+ N2 + N3 + N3 + N4 + N5
+65 + N6 + N7 + N785 + N45 + 895 + N955 + N 2 + ( + N + N5 + N52

+N5 ( + N53 + N54 + 54 + 56 + N56 + + 8 5 + 8 + N5 + N5

+N65 + N75 + N65 + N95) + N5 + N55 + N'5) + 4(N55 + N55)

+N65 + (N53 + N,) + ,. (N525 + N545 + 545 + N565 + N5) + 7N555)
9 9
/(ENsj + E5s -N55s)
j=1 i=l

The E and M steps are iterated until the estimates converge to a stable value. These

stable values are the maximum likelihood estimates (MLEs) of parameters. The estimated

haplotype frequencies and recombination fraction are embedded into a mixture model for

estimating genotypic values and variances for different generations.

4.4 Computer Simulation

Simulation studies were performed to examine the statistical behavior of the model.

A three-generation design is simulated which include a certain number of first-generation

families sorted into 9 mating types (as shown in Table 4-1) according to the genotype

frequencies. Assume that the allele frequencies of a gene are 0.6 and 0.4 in a natural

population at Hardy-Weinberg equilibrium. Our simulation will focus on the investigation

of the impacts of different sampling strategies and heritabilities on parameter estimation

and model power. For a given sample size, two sampling strategies are simulated, (1) a

large family number vs. small family size, and (2) a small family number vs. large family


The first strategy samples 200 unrelated grandfathers and 200 unrelated grandmothers,

who marry to form 200 the first-generation families. Each first-generation family is

assumed to have one son who, as the father, form a second-generation family with the

mother from the natural population. There is one child for each second-generation family.

This allocation results in a total of 1000 subjects. All members in the design are typed for

the gene, but only the fathers and offspring of the third generation are phenotyped for a

normally distributed trait. The second strategy samples 50 unrelated grandfathers and 50

unrelated grandmothers. In each first-generation family, 3 sons are simulated, forming 150

second-generation families in which 4 children are assumed. This strategy also results in

1000 subjects.

Different genetic effects of the gene, additive, dominant, and imprinting, are simulated

for the second- and third-generations (Table 4-4). Two different heritability levels, 0.1

and 0.4, are simulated for each generation, from which variances are determined. Table

4-2 tabulates the estimates of population and quantitative genetic parameters from the

three-generation design. As expected, allele frequency can be very well estimated. The

model provides reasonable estimation accuracy and precision for all genetic parameters

under different sampling strategies, even for a modest heritability level. The model is

powerful to detect differences of genetic effects between two consecutive generations.

More interesting, the difference of imprinting effect between different generations, i.e.,

transgenerational inheritance of genetic imprinting, can be discerned with our statistical


Table 4-2. The maximum likelihood estimates (MLEs) of additive (a), dominant (d), and
imprinting effects (i) of a functional SNP on a complex trait in parental (F)
and offspring (0) generations under two different strategies. The estimates are
the means of MLEs obtained from 200 simulation replicates, with standard
errors given in parentheses.

Strategy 1
H2 = 0.1 H2 = 0.4

1.0198(0.0236) 1.0028(0.0112)

0.5702(0.0340) 0.6184(0.0144)

0.6024(0.0306) 0.5991(0.0093)

Strategy 2
H2 = 0.1 H2 = 0.4

1.0387(0.0333) 0.9975(0.0125)

0.6272(0.0392) 0.6046(0.0185)

0.5897(0.0328) 0.6037(0.012)











We conducted an additional simulation study in which the same genetic effects are

assumed between the two generations. The model detects a small proportion of simulation

replicates which di- zpi transgenerational differences. This -r--.ii -;that the model has

a small type I error rate for detecting the transgenerational difference of overall genetic

effects. We particularly tested the type I error rate for the transgenerational difference of

genetic imprinting, which is very small.










The haplotype model is also examined through simulation studies. I simulated

two SNPs with a recombination fraction of r = 0.05 that are segregating in a human

population. Of the four haplotypes, one is assumed to function as a risk haplotype. The

remaining is collectively called the non-risk haplotype. The genetic values of composite

diplotypes constituted by risk and non-risk haplotypes include the additive (a, dominant

(d), and imprinting (i) genetic effects. I assume that some of these effects are different,
and the others are the same between the parental and offspring generations. Combinations

of different heritabilities between the two generations are simulated.

Table 4-3 tabulates the results of simulation for different heritabilities and sample

sizes. Overall, all parameters can be estimated reasonably well. As expected, the precision

of parameter estimation increases with heritability and sample size. The additive genetic

effects in both generations can well be estimated with a modest sample size ( i- 400) for a

small heritability (0.1). More sample sizes ( i- 800) are needed to provide a good estimate

for genetic imprinting effects for a small heritability. To well estimate dominant genetic

effects, an even larger sample size ( i- 2000) is required for the same level of heritability.

4.5 Discussion

The traditional view assumes that the maternally and paternally derived alleles

of each gene are expressed simultaneously at a similar level. This view is violated by a

growing body of evidence that alleles are expressed from only one of the two parental

chromosomes (Reik and Walter, 2001; Wilkins and Haig, 2003). This so-called genetic

imprinting or parent-of-origin effect has been thought to plhvy a pivotal role in regulating

the phenotypic variation of a complex trait (Constancia et al., 2004; Isles and Wilkinson,

2000; Itier et al., 1998; Li et al., 1999; Wilkinson et al., 2007). With the discovery of

more imprinting genes involved in trait control through molecular and bioinformatics

approaches, we will be in a position to elucidate the genetic architecture of quantitative

variation for various organisms including humans.






': d




0 ^~


0 -

m 0
0 -

- V

0 0


9 S


m S

^ ^i

0; 0
s^b OOQ

Ul I


c -00-

(M 1- 1

C1 1CI C1
o t- 00
^ 03



t0 1- l^

t0 0 1^

cc Ctl

CIA C c- I-
C~l 0 0

- 00 (

0 0~ 0~ 0

cc cc

0 0 0 0

00 ^ o
03 ^ ^ 0
0 0 0 0
;?'O?' 1? '1
t0 01^-0
(M (M -

0 0 0 03
~30 0 0
o^ ^ ^ ^

00 o001^

;- ^- 0
0 0 0 0
0 0 0 0
^o' o^ ^ 'F^

^ ^ 10 1
0 00 0

0 0 0 0
0 0 0 0~
- C0 0^
L0 10 03 0
10 d t
0 0 0 0n
00 ~0 0
- ^ -
0 00 0

- ^ ^
(0 (0 (0 (

L0 CIl 03 03
1- 1- 1- oo

cc 0 0

cc Cl

cc 0 0

cc cc cc
0300 0

t0 10- t 10-

1- I- V

t- (C t

;- 1- ^-

^ 10 031-

0^ 1- o 1-
V V 0 0

V V1 C
0 00V
0 0 0 0
0^ 0^ t0 0

t'- ^ 1^
0 ~0 0 ^
0 0 0 0
;- ; 0

t30 ^ 10
(M 10 t
;- ;- 0 0
0 0 0 0

0 0 0 0
o^r ^oo ir-^o
t0 0^
- 0-0 0^
d 0 0 0
0 0 0 0n

;?'0^ 1'
oo o t- o

;- ;- 0
0 0 0 0
~0 000
'0' ^'^'e

- 0~3 t' (
t0 0^ 0 0
^ ^ 10 1
0 0 0 0cr

d 1 d t
03 03 1
^- ^ 0 0
~0 000
0 00 0

03 t0 b-
0 130 ^
C 30 0^
0 0^ 0
;- d '

^ o^ ^ ^o
cr c 3 03
~0 000
0 00 0

0 0 0 0
t0 t0 b
oo t t
^ ^ 10

0 0 0
0 0 0 0
- ^ -
0 00 0
- ^
(0 (0 (0 (

- c10 03

I 0- V
0 0 0 0


C IA 0

10 10- V0 1

0c 0 0

cc- C0 0

1 1- Ct-

CI C- I- cc

cc Lr V
0 I0
V V0
1 1l1
;- ;- ;- ^
^r r i?? o
(M ^ 1
;- 0 '- 0

0~0 0 0
0 0 0 0~
~03 00

00 30^ 30
0 0 0 0

0 0 0 0
0 0 0 0
er ^0 o
1003 000
0 00 0
C ^ ^ 0
0 0 0 0
^00 i0 ?'1?
^ o ^ ^

^- ^- 0 0
0 0 0 0cr
0 00 0
- 03003

io oo t'C
0^ (M 0 0
0 0 10 10
0 0 0 0

o? ^' ?';?

t' 0^ 30
0 00 0

0 0 0 0r
0 0 0 0
'a^3o0 o''a

03 (M M (

- 0300

10 010^10
0 00 0



Evidence for transgenerational epigenetic inheritance in humans would represent a

significant shift in our current understanding of inheritance and disease aetiology. Despite

the development of new technologies that are reducing the time and cost of sequencing

by several orders of magnitude (CI i1, 2005), ruling out underlying genetic events will be

challenging. Furthermore, copy number variants (CNVs), which are not readily detectable

by standard sequencing technology, must also be considered since they are now believed

to be as prevalent as single nucleotide polymorphisms (Beckmann et al., 2007). And so,

the task of correlating bona fide epimutations with disease, let alone trying to trace their

inheritance across generations, remains a formidable one.

This study assumes the unisex (sons) produced from the first-generation family. One

can also assume daughters with no change of the model. In fact, our model can allow the

involvement of both sexes so that in the second generation sex-specific genetic effects can

be characterized. If the sexes in the third generation are considered, the model can be

extended to study the transgenerational changes of gene-sex interactions. Although a basic

premise of epigenetic processes was that, once established, these marks were maintained

through rounds of mitotic cell division and stable for the life of the organism, several

recent studies have shown that at some loci the epigenetic state can be altered by the

environment (Jirtle and Skinner, 2007). The questions are how common are genes of this

type and how strong is the evidence for their existence in humans? The development

of our design and model will help to address these biological questions of fundamental

importance in elucidating the genetic architecture of complex traits.


0 -4 -
0a 0




2 |

t $
* eg ^



k a;

.2 "3

< a

92 '

0 o



a ^

0 0


t Q
t O
o m



01 Cf 'm Cn UD m 1 Cl 'm C U

B *
a a

cq a



a n

01 01
N" cq *
-o a a
bSaa a
^ a a

01 01
b a^~

*^ ^

-o -


.S 01

01 01l
-o -o

*01 Cfl

E"~ S ^











N N(
a a

Ct O b^ iO (O t- QO O

^ ^^ ^^ aa aa a
El &
^- l O GO Ci O -l 0
si ^1 0l Ol Ol Ol Ol Ol C O C


19 a a a
g F~ ;I F~ i i O
"Nai" F~cgo~-~
19 19 Ig a
Is Is B Is Is B Is Is a

" a
ii o

Is B

8 ee

g ^ g 19 gIg
c7 ^o '-! -

d I 0 )s,
S3 Bl aaa

19, 19

~ ~ ~ ~ ~ ~ ~ ~ ~oooo o
iiiiiiiiiiiiiiiiii~~~~ ~
NaNaNaNa Na
Is Is Is Is Is Is Is Is Is
C: C: C: C: C: C: C: C: C: C: C: C: C: C:
i i i i i i i i i iiii i

% ~ $ S S ~ S 9 ~


Cl C Cla Cl
-o -o -o -
-o o -

m a ,

-o -~ o m, R
mmRo m mm



Cl C

m a,~

-o -o
-o m m-0

amm ma a mammmmmm m mmmm
&. &. &. &. &. &. &. &. 8 8 8 8 8
D D^ D D D D D D D Do ^' D' o D o D D D D D D
D D^ D^ D D D^ D D^ D D^ D D^ D D^ D? D? D? D? :

Cl Cm 1T L0 0 t 0% Cl Cm 1r L0C0 0% Cl Cm~
H ; L ,L L ,LL L ,C, QCQ CQC, CQC, ,QL L -

3i -0
3i 3




-o -o



| -7 |C'

I|CN | 7

,|N i| |7 N0




1CN r I i IIG

\| I-|t I

I Il 0

cv Il Il

| |

^ d

,t ,I

|N iI|7

cv I i I

-o -o

a -o

e -o

1 a

-0 ^l
^ ^

n? n?
cq e
e n?
^ ^
^ 3
? ^
^ ^
^ ^
^ 5
n? ^_
^ n?
^ ^


0 C)



hi h

Li i

iN -lN -I



!~- *
-l|N ~il|N

I -
~il|N ~il|N

N 01
-o -o -o -

-o -oR -o -o -o -o
-o -o o -o -o -

10 P-"C
P- P- P- P P


5.1 Introduction

Since the recognition of cancer as a genetic disease, a number of familial cancer

genes with high-penetrance mutations, such as oncogenes and the tumor suppressors,

have been chromosomally identified, isolated or cloned. (Balman et al., 2003; Rand et al.,

2008) However, growing evidence shows that most cancer is the result of an intricate

interaction of low-penetrance genetic variants with environmental exposures that humans

experience. (Brennan, 2002) These low-penetrance cancer genes, each usually with a

minor effect and cooperating with others in a complicated web, are difficult to detect and,

therefore, their contribution to the risk of cancer development remains unclear. There is

a pressing demand on the development of powerful statistical models and computational

algorithms for identifying and mapping specific DNA sequence variants that regulate

cancer susceptibility.

Human cancer cells frequently possess large-scale chromosomal rearrangements due

to chromosomal instability (CIN) (Stock and Bialy, 2003; Thompson and Compton, 2008)

or gene mutation (Greenman et al., 2007; Jallepalli and Lengauer, 2001). CIN makes

whole chromosomes or large fractions of chromosomes gained or lost during cell division,

resulting in an imbalance in the number of chromosomes per cell (aneuploidy) and an

enhanced rate of loss of heterozygosity. Thus, the ii. ipl. 'idy hypothesis of cancer"

(Stock and Bialy, 2003) proposes that the main differences between normal and abnormal

(cancer) cells result from the number of genes rather than the types of genes differentially
expressed, as opposed to the ;,, i. -mutation hypotl I -i- (Jallepalli and Lengauer, 2001).

In general, cancer incidence and development are not only affected by the host genes,

but also by genes derived from the cancer cells themselves. Given strong mechanistic

interactions between the host and cancer tissues (Araujo and McElwain, 2006), these two

different systems of genes operate interactively or epistatically to alter the course of cancer

progression. Thus, it can be well anticipated that any statistical model for gene detection

that incorporates these two hypotheses are likely to make groundbreaking discoveries of

cancer genes.

Genetic mapping has proven to be a powerful approach for detecting quantitative

trait loci (QTLs) for complex traits. But a QTL may contain multiple genes that operate

in a collective way. It is not possible to study the DNA structure, organization and

function of a QTL detected from a mapping approach. A more accurate and useful

approach for the characterization of genetic variants contributing to quantitative variation

is to directly analyze DNA sequences, known as haplotypes, associated with a particular

disease. (Lin and Wu, 2006; Liu et al., 2004) If a string of DNA sequence is known to

increase disease risk, this risk can be prevented by inhibiting the expression of this string

using a specialized drug. The control of this disease can be made more efficient if all

possible DNA sequences determining its variation are identified in the entire genome. The

elucidation of the entire human genome has been accelerated by the haplotype map, or

HapMap, constructed by SNPs. (The International HapMap Consortium, 2003) More

recently, the marvelous plans of sequencing the cancer genomes (Kaiser, 2005) will provide

unprecedented fuel for studying the genetic architecture of cancer risk.

In this article, we will derive a statistical model for detecting the actions and

interactions of haplotypes derived from the host and cancer genomes for cancer susceptibility.

We will incorporate the ;,, i --mutation hypot l into the model. The "aneuploidy

hypothesis of ( oin, I will be considered in a next paper. Through the release of software

to the public, our statistical model will serve as a routine means for the genetic diagnosis

of cancer risk. Results from the model will provide scientific guidance for clinical doctors

to design an optimal treatment scheme in terms of cancer genes and patient's genes.

5.2 Design

5.2.1 Sampling Strategies

Suppose there is a natural human population at Hardy-Weinberg equilibrium (HWE)

from which a random sample is drawn for cancer gene identification. In order to identify

DNA sequences responsible for cancer susceptibility, we genotype SNPs from the entire

host genome and also SNPs from the cancer genome for the same patient. We assume that

the cancer genome is a diploid, whose difference from the normal genome is due to the

,. ii.'-mutation hypotl! -i- Recent molecular surveys -i-i--, -1 that the human genome

contains many discrete haplotype blocks that are sites of closely located SNPs. (Daly

et al., 2001; Gabriel et al., 2002; Patil et al., 2001) Each block may have a minimal subset

of SNPs, i.e., I SNPs, that can characterize the most common haplotypes. Our

model will be based on --ii'-; SNPs within each haplotype block. Although no detailed

information about the structure of the cancer genome is available, we can assume that a

particular set of SNPs may contribute to cancer formation at the haplotype level. The

tenet of our epistatic model is that the effect of a given DNA sequence in the host genome

on cancer is masked or enhanced by one or more sequences in the cancer genome.

5.2.2 Genetic Models

Population Genetic Model: Consider a set of R P -'-ii,-; SNPs from a haplotype

block of the host genome and a set of S SNPs from the cancer genome. We denote two

alleles of SNP r from the host genome by Hk (k, = 1, 0; r = 1, R) and two alleles of

SNP s from the cancer genome by Cf (s = 1, 0; s = 1, S). Let pj and p' denote allele

frequencies at the corresponding SNP from the host and cancer genomes, respectively.

All the SNPs considered from the host and cancer genomes form 2R+S possible joint

haplotypes expressed as (H H ... H ) (1 C 1. IS). The corresponding haplotype

frequencies are denoted by P(kk...--kR)(1112---is), which are composed of allele frequencies at

each SNP and linkage disequilibria of different orders among SNPs within and between the

genomes (Wu and Lin 2008). A general expression for the relationships between haplotype

frequencies and allele frequencies and linkage disequilibria was originally given by (Bennett

1954). Table 5-1 lists the compositions of the frequency of a haplotype constructed jointly

by two SNPs from the host genome and two SNPs from the cancer genome in which

linkage disequilibria are specified with two, three, and four sites.

Table 5-1. Disequilibrium compositions of four-SNP haplotype frequencies derived from
the host and cancer genomes.


pH pjpC

(2) (-1)2 (_ )ki+k2ppcDHlH2


1)3 (

11+12pH pH Dc,

k p+11PHFkcDH1c,
l) k2 pcD^l
1 )k1+k2+lilpDHlH2Cl
1 )k1+k2+12p1cDHlH2C2

1 ) q+ll+lz k DH1C1C2
1 )k2+i+lip2k DH2C1C2
1)kl+k2+1+12 DHi HC1 C,

Digenic LD within the host genome (H)
Digenic LD within the cancer genome (C)
Digenic LD between SNP 2 of H and SNP 1 of C
Digenic LD between SNP 2 of H and SNP 2 of C
Digenic LD between SNP 1 of H and SNP 1 of C
Digenic LD between SNP 1 of H and SNP 2 of C
Trigenic LD between H and SNP 1 of C
Trigenic LD between H and SNP 2 of C
Trigenic LD between SNP 1 of H and C
Trigenic LD between SNP 2 of H and C
Quadrigenic LD between H and C

From these compositions, linkage disequilibria are expressed as


= (P(l)(ll) +P(11)(10) +P(11)(01) +P(11)(00))(P(00)(11) +P(00)(10) +P(00)(01) +P(00)(00)) (5-1)

- (P(10)(11) +P(10)(10) +P(10)(01) +P(10)(00))(P(01)(11) +P(01)(10) +P(01)(01) +P(01)(00))


= (P(11)(11) +P(10)(11) +P(01)(11) +P(00)(11))(P(11)(00) +P(10)(00) +P(01)(00) +P(00)(00)) (5-2)

- (p(ll)(10) + P(10)(10) + P(01)(10) + P(00)(10))(P(11)(01) + P(10)(01) + P(01)(01) + P(00)(01))





(P(ll)(ll) +P(11)(10) +P(01)(11) +P(01)(10))(P(10)(01) +P(10)(oo00) +P(00)(01) +P(00)(00)) (5-3)

(P(11)(01) +P(11)(00) +P(o1)(01) +P(01)(00))(P(10)(11) +P(10)(10) +P(oo)(11) +P(00)(10))


(P(ll)(11) +P(11)(01) +P(01)(11) +P(01)(01))(P(10)(10) +P(10)(00) +P(00)(10) +P(00)(00)) (5-4)

(P(11)(10) +P(11)(00) +P(01)(10) +P(01)(00))(P(10)(11) +P(10)(01) +P(00)(11) +P(00)(01))


(P(11)(11) +P(11)(10) +P(10)(11) +P(10)(10))(P(01)(01) +P(01)(00) +P(00)(01) +P(00)(00))) (5-5)

(P(11)(ol) +P(11)(00) +P(10)(ol) +P(10)(00))(P(01)(11) +P(ol)(10) +P(00)(11) +P(00)(10))


(P(11)(11) +P(11)(01ol) +P(10)(11) +P(10)(01))(P(11)(10) +P(11)(00oo) +P(10)(10) +P(10)(00)) (5-6)

(P(11)(10) +P(11)(00) +P(10)(10) +P(10)(00))(P(01)(11) +P(ol)(ol) +P(oo)(11) +P(00)(01))


[(P(11)(11) +P(11)(10))(P(10)(01) +P(10)(00)) + (P(o01)(01) +P(01)(00))(P(00)(11) +P(OO)(10))](5-7)

[((00oo)(ol) +P(00)(00))(P(01)(11) +P(01)(10)) + (P(11)(01) +P(11)(00))(P(10)(11) +P(10)(10))


[(P(11)(11) +P(11)(01))(P(01)(10) +P(01)(00)) + (P(o01)(10) +P(01)(00))(P(oo00)(11) +P(00)(01))](5-8)

[(P(oo)(1o) +P(00)(00))(P(01)(11) +P(01)(01)) + (P(11)(10) +P(11)(00))(P(10)(11) +P(10)(01))


[(P(11)(11) +P(10)(11))(P(11)(00) +P(10)(00)) + (P(01o)(10) +P(00)(10))(P(01)(01) +P(00)(01))](5-9)

[(P(01ol)(00oo) +P(00)(00))(P(01)(11) +P(00)(11)) + (P(11)(10o) +P(10)(10))(P(11)(01) +P(10)(01))


S[(P(11)(11) +P(01)(11))(P(11)(00) +P(01)(00)) + (P(10)(10) +P(00)(10))(P(10)(01) +P(00)(01)J15-10)

[(P(10)(00) +P(00)(00))(P(10)(11) +P(00)(11)) + ((11)(10) +P(01)(10))(P(11)(01) +P(01)(01))

(-1)4(_ )kl+k2+11+12 DHH2C1C2

P(kik2)(ll1) PkPk2P1P12

(-)(-2 _tk+k2PClcDH1,H _- )2(_11+12pHpH Dc,C
( 112 1i k2)klpD( 1 21{Nj c'1c'o

(- 1)2_ +1pIDHH2C (-i(_ )_k122p2lDH2C2 (5-11)

(-1)2_t)k1,pH1D C {_ 2 _tk+12pH1DH

(-1)3 _t k+k+lflcpDHH2CI (_3( ) kl+k2+12pcDHjH2C2

(-)3(_k++12pHDHC1C2 (- )3(-l)_k+11+llpHDH2C1C2

The random combination of maternal and paternal haplotypes generates 2R+S-1(2R+S

1) diplotypes expressed as (HH ... HR )(C1 / ) |(HI Hi HR ) C S) (k >

k, k2 > k',..., kR > kR = 1,0; 1 > 1 /, k2 > ..., S > S = 1,0). We use a vertical line to
separate two haplotypes derived from the maternal and paternal parents, respectively, for

a given diplotype. Under the HWE assumption, diplotype frequencies are expressed as the

products of the frequencies of the two haplotypes that constitute the diplotype, i.e.,

P(k -k2. ) (l{l-ls;\2 I. .k' ) .../-l's)

SPkik2-kR)(-ls) k = ki, k2 = k1, kR = k;

S= 1' i, 12 1 1 S I

I 2P(kIk2...kR)(12-... ls)P(kpIk'k'. ) k (l'zl') Otherwise.

In practice, diplotypes cannot be observed, although observable .; i.' ;. i/ .;. '" will

be the same as diplotypes when at most one SNP is heterozygous. Thus, the numbers of

zygotic genotypes, 3R+S, will be less than the number of di l..i .rpes.

Quantitative Genetic Model: Our model will be derived to characterize

haplotypes that are responsible for complex traits because the association between

haplotype diversity and phenotypic variation has been detected by several genetic studies

(Bader, 2001; Judson et al., 2000; Rha et al., 2007). Among all possible haplotypes are

there some particular haplotypes, called the risk i/ ,i''l '". (A), that perform differently

than the rest of the haplotypes, called the non-risk i'/'*1. *;l'.I- (A). The combinations

between the risk and non-risk haplotypes, AA, AA, and AA, are called the composite

diplotype. (Liu et al., 2004; Wu and Lin, 2008). Thus, by testing the differences in the

,.i irpic value of a trait among composite diplotypes, we can estimate the genetic effects

of haplotypes on the trait. It is also feasible to detect epistatic interactions between

haplotypes from different genomes.

Let A, A and B, B denote the risk haplotypes and non-risk haplotypes for a series of

SNPs genotyped from the host and cancer genomes, respectively. These two genomes form

nine different composite diplotypes expressed as AABB, AABB, AABB, AABB, AABB,

AABB, AABB, AABB, and AABB. We will use Mather and Jinks's (1982) formulation

for genetic epistasis between different loci (Table 5-2) to model the genetic effects of the

composite diplotypes. The .-. .I rn'pic value (PJ,1J) of a joint composite diplotype from the

two genomes can be decomposed into nine different components as follows:

Pi2h= p Overall mean

+(ji 1)a1 + (j2 1)a2 Additive effects

+jid1 + j2d2 Dominant effects

+(ji 1)(j2 )i Additive x additive effect (5-12)

+(ji 1)j2iad Additive x dominance effect

+jl(j2 l)da Dominance x additive effect

+(1 ji)(1 j2)idd Dominance x dominance effect,

Table 5-2. Additive, dominance, and epistatic compositions of the genotypic value of a
composite diplotype constructed with haplotypes from the host and cancer



p + a, + a2 + laa


p + di + a2 + tda


pI a + a2 oaa



p + aI + d2 + tad


p + di + d2 + tdd


pI a, + d2



p + a,


p + di


iad pI a



for AA or BB

for AA or BB

for AA or BB

stand for the composite diplotypes from the host and cancer genomes, respectively, p

is the overall mean; aH and ac are the additive effects of haplotypes from the host and

cancer genome; dH and dc, the dominance effects of haplotypes; and iaa, iad, ida, and idd,

the additive x additive, additive x dominance, dominance x additive, and dominance x

dominance epistatic effects between the haplotypes from the two different genomes (Table


Different types of genetic actions and interactions can be expressed in terms of

,1 iiJ .rpic values by solving a group of regular equations (5-12). This lets us describe the

overall mean, additive, dominance, and four kinds of epistatic effects between the two


a2 iaa

a2 tda

a2 + laa

genomes by










Thus, by testing the significance of iaa,, iad, ida, and idd, we can judge whether there is

epistasis and how the epistasis affects a phenotypic trait.

5.2.3 Estimation Procedures

We will estimate two types of parameters for gene cancer identification. First is

the population genetic parameters (Q,) that describe the distribution and diversity of

haplotypes in the sampled population quantified by haplotype frequencies for multiple

SNPs from the host and cancer genomes, allele frequencies at these SNPs and their

linkage disequilibria. Second is the quantitative genetic parameters (Qq) that describe the

,j. irp'pic values of composite diplotypes, specified by the action and interaction effects

of haplotypes on cancer susceptibility, and residual variance. Given observed ph. iI 1r- vpic

(y) and marker data from the host (H) and cancer (C) genomes, we construct a likelihood
and factorize it to two components:

log L(Qp, ,q y, H, C) = log L(Q,|H, C) + log L(,q y, H, C, Q,), (5-14)

where the first component is related to haplotype frequencies and the second component

related to haplotype effects and variance. Thus, maximizing the likelihood (5-14) is

equivalent to maximizing its two components separately.

Estimating Across-Genome Haplotype Frequencies: Because the same

iil.r ipe may be formed from multiple diplotypes, we need to incorporate the EM

algorithm to estimate the unknown diplotype of a genotype, which is statistically viewed

as a missing data problem. An observed zygotic genotype is generally expressed as

(H Hk/HH/ /H HRF (C /Cy /. /Cf 1), where the slashes are used
to separate n..I vpes at different SNPs. Let n(kk/kk /-k2 /kk'w)(1(._. ,/.../ ...) (which sums

to a total sample size of n subjects) be the observation of a typical joint-SNP iiri'pe

from the host and cancer genomes. Table 5-3 is an example of data structure for genotypic

observations of two SNPs derived from the host genome and two SNPs from the cancer

genome. The table also provides the expected frequencies of different genotypes in terms

of haplotype frequencies.

Based on the information about observed data, it is not difficult to construct a

multinomial likelihood, log L( I|H, C), in which a mixture model is incorporated for

those genotypes that are heterozygous at two or more SNPs. By maximizing the observed

data likelihood, the EM algorithm is derived. In the E step, we calculate the expected

number of a particular across-genome haplotype (H H2 ... HR ) (CCC2 ... Cs) within

the mixture of diplotypes that form the same i, ir .pes. For example, such an expected

number is calculated for two SNPs from the host genome and two SNPs from the cancer

Table 5-3. Observed 81 joint host-cancer SNP i.- Ir'1 Tpes and their frequencies described in
terms of their haplotype/diplotype compositions.


No. Host

1 H H /H2H2
2 H Hi/H2 H2

3 H H /H2 H

4 H H /H2H2

5 H H /H2H2

6 H1H1/H 2H2

7 H H /H H

8 H1H1/H 2H2

9 H1H1/H 2H2

10 H /HHO2 H

11 H H/H 2H2

12 H H /H2 H2

13 H H /H HO

14 H H /H2 H2

15 H H /H2H2

16 H H /H2 H2

17 H H /H2HO

18 H H /H 2 H2

19 HH / HO2 HO2

20 HH'/HO2HO2



CC /CC2 02

I 1 /I
S0CC2 C02

Colod/CC2 12
01 O/1 02

C01Co /0 C2 C02

ColCol / C12 C12

C01 C 12C

SC/C2 C02

SC/C02 C02
0 C0 / 12 C2

C1 C01 /C2C02
C01C1 /0 02 C02
CICI/ 1 C2
ClCl /r(2C02

CrlCdi /rc2Cr
CrlCJJ r(CrC





















P 11)(10)




+ 2p(11)(Io)P(1)(o01)


2p(ii)(iI)p(10)(10) + 2p(o0)(ii)P(11)(10)


2p(ii)(II)P(Io)(o1) + 2p(10)(ii)P(11)(o0)

2p(ii)(11)P(Io)(oo) + 2p(11)(10)P(10)(oi)

+2p(Io)(11)P(Io)(oo) + 2p(10)(10)P(11)(ol)

2p(i1)(io)P(Io)(oo) + 2p(11)(oo)P(10)(10)

2p(11)(01)p(10)(00) + 2p( ,,, I'',,,[,,,)



Table 5-3. Continued

SH H/2 HO2





HHo/ Ho2 Ho



1HHo/H 2H2
1H Ho/H2 H2

H Ho/H2H
1 1/ 10

SHoH/H2H 2



H HHO/H 2 H2
H1 HO/H2H2

C CI 02 02

C 0 C 12 C 02
/1 01/V 1 2

C0 C0 /C02 02
01 O/1212
C Co /C2CO2

1 C 02 C02
CIC0 /C12C

01 1 CI1C2 12

" Co /C2 C02

"IC/ C02 C02
C0 C0 / C2 0

Cl Col / C2 C2

CI C1 /Cr2C

I I /rI(o
^0 0/V]00














n (10/11)(00/11)









2p(1o)(11pr... ........ + 2p(o10)(o10)P(o10)(o)



2p (jo)(olj i .........i i



2p(1i)(II)P(ol)(1o) + 2p(ol)(1)P(11)(10o)


2p(1i)(II)P(ol)(oi) + 2p(ol)(ii)P(11)(10)

2p(i)(I)P(o)(o)(oo) + 2p(11)(10)P(ol)(o1)

+2p(ol)(11)P(11)(oo) + 2p(ol)(10)P(11)(01)

2p(11)(10o)P(oo)(1o) + 2p(10)(10)P(ol)(10)


2p(11)(oi)P(oi)(oo) + 2p(oi)(oi)P(11)(oo)

2p(1 ,,, [,,, n t,,,)(00)

2p(11)(iI)P(oo)(11) + 2p(o0)(i1)P(ol)(11)

2p(1)(iI)P(oo)(10o) + 2p(o0)(i1)P(ol)(10)

+2p(11)(10o)P(oo)(11) + 2p(10)(10)P(o01)(11)

2p(11)(10o)P(oo)(1o) + 2p(10)(10)P(ol)(10)

2p(11)(iI)p(oo)(o1) + 2p(10)(i1)P(ol)(o1)

+2p(11)(ol)P(oo)(11) + 2p(10)(oI)P(ol)(11)

Table 5-3. Continued

41 H Ho/H2 H2

42 HHo/HI2H2

43 HHo/HH2 H2

44 H Ho/H2HO

45 H Ho/HO2 H

46 HHo/HO2H2

47 HHo/HOHO2

48 HHo/HO2H2

49 HHo/HO2 H2

50 H Ho/HO2H

51 H Ho/HO2HO2

52 H Ho/HO2HO2

53 H Ho/HO2HO2

54 H Ho/HO2HO

55 HoHo/HH2

56 HoHo/HH2

57 HoHo/HH2

58 HoHol/H 2 H2

59 HoHo/H2H2

C C / 12 C02

1 CO/2 02

C C 1 /C2 C02

C C 0 /C2 C02

CICo /o2Co

01 1/2 02
C 1 02 02
r0101 122/rr

/1 I C 2 C 02

Col Co / C2C2

C Co /C02 C02
0 I o/ 12 02

^I I/^ir o




















2p(11)(I1)P(oo)(oo) + 2p(11)(10)P(oo)(ol)

+2p(lo)(lo)P(ol)(ol) + 2p(10)(10)P(o01)(01)

+2p(ol)(11)P(lo)(oo) + 2p(ol)(10)P(10)(01)

+2p(oo)(lo)P(11)(ol) + 2p(00)(10)P(11)(01)

2p(11)(10)P(oo)(oo) + 2p(o0)(10)P(ol)(oo)

+2p(oo)(10o)P(11)(oo) + 2p(01)(10)P(10)(00)

2p(11)(ol./'.i[...iiL) + 2p(10)(01)P(01)(01)

2p(11)(oi)P(oo)(oo) + 2p(11)(oo)P(oo)(o1)

+2p(1o)(ol)P(ol)(oo) + 2p(10)(oo)P(01)(01)

2p(11)(oo)P(oo)(oo) + 2p(o0)(oo)P(ol)(oo)


2p(lo)(ii)P(oo)(1o) + 2p(oo)(ii)P(o0)(10)


2p(10)(i11p.[..ii.) + 2p(oo)(11)P(o10)(o01)

2p(lo)(11)P(oo)(oo) + 2p(10)(10)P(oo)(01)

+2p(oo)(11)P(10)(oo) + 2p(10)(10)P(00)(01)

2p(10)(10)P(oo)(oo) + 2p(10)(oo)P(oo)(10)

2p(Io)(ol1./, .".InuL)

2p(lo)(ol)P(oo)(oo) + 2p(0)(oo)P(oo)(o01)

2p(10) (00)P(oo) (00)





+ P(01)(10)P(1) (01)

Table 5-3. Continued

'HoHl /H 2 H2

Ho HoH/H2 H

HHo/HH1 1









HoHo/H2 H2









C0 C/02 C02

Co Col / CO C

Cr0 C0 /02 C02

CI C I/C2 C2

CI C/C02 C02
01 O/02

C0 C /C02 02

C1 C1 / C2 C02

Cl Co/C02 C2

01 C /C2C02

1C C02 C02
01 01/ 2^ VO

Cl C0l /C2 C02

SC01 02 O02


























2p(oi)(ii)P(oo)(0o) + 2p(oo)(ii)P(oi)(io)


2p(o)()(o)( oo)(o0 ) + 2p(oo)(II)P(ol)(o)

2p(ol)(11)P(oo)(0oo) + 2p(o0)(1o)P(oo)(o1)

+2p(oo)(11)P(ol)(oo) + 2p(oo)(10)P(01)(01)

2p(ol)(1o)P(oo)(oo) + 2p(ol)(oo)P(oo)(10)


2p(ol)(ol0 / ............ + 2p(ol)(oo)P(oo)(ol)





2p(oo)(1o .',...1.,

(o00) (01)


+ 2p(oo)(1o)P(oo)(o1)

genome using the following formulas:

P, ( P(k1k2)(l1112)P(k1k2)(l l1)
P(k1k2)(l1l2)P(k k2)(l12) + P(k1k2)(11')P(kik2)(ll2)

P (l P(k1k2)(l1112)P(kik'2)(11"2)
P(k1k2)(1112)P(kik2)(1ij2) +P(k1k2)(lIl)P(kik2)(1112)

P(fc2)(l2)P(kik2)(1112) + P(kk2)(11l)P(kk2)(l)
3 P(k1k2)(l1l2)P(k'k 2)(lIl') / P(k1k2)(l1l')P(k'k 2)(ll;)

04 2
P(kik2)(lil2)P(kk')(1'12) + P(kik2)(lul)P(k(k)(lkl)

5 ~P(kik)(1112)P(k+kP)(lkP2)
P(kjk2)(1112)P(k'k2)(l'l2) + P(kfk2)(l l)P(k^'ik)(ll2)

06 1 2(1
P(kjk2)(l1i2)P(k'k )(1112) + P(kjk )(l1l2)P(k'1k2)(l1l2)

1 P(kik2)(l1l2)P(klk )(lIl)]/[P(kfk2)(l1l2)P(kik)(lI'I) +P(kifk2)(1hV)P(kik')(ll1)
2 1212 2

+ P(kik2)(lIl2)P(kik' (lil'+ P(kjk2)(lIl )P(kjk')(l1l2)\ (5-t5)

2 [P(kik2)(1112)P(k'k2)(l'Il)]/[P(kik2)(1112)P(k'k2)(l'l') +P(kki2)(ll',)P(k~k')(l'l2)

P(kik2) (lh^2)P(k'k') (lil') + P(kjk')(lIl )P(k'k2)(1112)1,

[P(kik2)(lilh)P(kk')(l') 1/[P(kik2)( l(2)P(kk')(lIll') +P(kk2)(ll'})P(k'k')(l

+P(kik')(lils)P(k/k2)(1112) +P(kik2)(ll2)P(k'k')(lil)]

'P = [P(kik2)(li)P(k'k)(lIl})/Pv,


Pp P(k1k2)(1112)P(kk')(lIl') + P(k1k2)(lil')P(k'k')(lIl2) + P(kik')(I112)P(k~k2)(l'l')

+P(k'k2)(l1l2)P(k1k')(l1') + P(k1k2)(l' )P(k' k)(lil2) + P(k1k')(lIl')P(k' k)(llI)

+P(k'k2)(l1l')P(kik')(lIl2) +P(k'k2)(l'l2)P(k1k')(lIll)-

In the M step, we estimate a haplotype frequency with the expected number of

haplotypes calculated above and the observations given in Table 5-3 by


+ n(klkl/k2k2)(ll.11 ..)1

+ n(kiki/k2k2)(I1',/11212),

+ (kiki/k2k' )(1111/1212),

+ n(kik'1/k2k2)(uiI1/1212),

+ O1(klk1/k2k2)(l11/ 2

2+ l(kfk1ki/2k)(11I/ .

+ ,',(kk'1k/2k2)(11 .

+ 04(klkl/k2k')(I1l1/121

+ ',k(i'k'/k2k2)(I1/l121

*+ ',(k'1k/2k')(1I/I121


< 11

< k2

< ki

kI < k2,' < 2

k 2 < k, i < 12

k' < kj2,I < 1

k' < k,l I < 11

k' < kj, k' < k2









+ lfl(kik/1k2k')(11_i1._. ), k2 < k2, I < 11,12 < 12

+ '_k'''/1k'2/k2)(11l. ), k1 < k ,lP < 11,21' < 12

+ k"('k1k/Ik9)(lil. ), k' < ki, k < k2 <2

+ 4n(kk,1/ kk)(111/1212), k' < ki, k' < k2, < 1

+ pn(k1k/k2k)(ll'/ _.), k' < k, kV2 < k2, I' < l1, 2 < .2-

Both the E and M steps are iterated between equations (5-15) and (5-16) until the

estimates converge to stable values. The estimates at convergence are the maximum

likelihood estimates (MLEs) of haplotype frequencies. The MLEs of allele frequencies at

different SNPs and their linkage disequilibria of different orders can be solved from these

estimated haplotype frequencies using a system of equations given in Table 5-1.

Estimating Across-Genome Haplotype Interactions: To detect how haplotypes

or diplotypes are associated with phenotypic variation in a trait (y), we will first assume a

risk haplotype from the host genome and a risk haplotype from the cancer genome. These

two types of risk haplotypes will generate composite diplotypes. As an example shown

in Table 5-3, in which there are two SNPs from each genome, we assume Hf H,2 from the

host genome and C\ C2 from the cancer genome as two risk haplotypes. This leads to nine

different across-genome composite dip 'l.. .rpes. A mixture-based likelihood for quantitative

genetic parameters (,Q) is formulated as

log L(,| y, H, C, ,)

log fAABB(Yi) +

log0 AABB(YD) +

10 logAABB(Yi)

+ log[wcfAABB(yM) + (1 c)fAABB(y i)
(*(. 11/ 11) fl(.)) (*) (**)
+ log fAABB((Y) + 0log fAABB(Yi)+ 0log fAABB (i)

+ S log[wcfAABB(?y) + (t1 )fAABB(y/i)

(.* l)(11/11)
+ lo
+ lo
i= 1

+ 10o
+ 5 lo
n(lo io)(1io0io)

g fAABB(yi)+ S

log ffAABB (Y) +

g[wcfAABB(Yi) + (1 Wc)fAABB(Yi)]

log [wHfAABB(Y) + (1

L[wHfAABB(Yi) + (1

g[wHfAABB(Y) + (1

logw 10/10)(10/10) fAAB() + (10/10)(10,

(10/10)(10/10) + (10/10)(10/10) .


1(* *)(**)
S log fAABB(y )



H a)fAABB(Yi)]

WH) fAABB (Yi)]

10) AABB (i)


"(11/1)(.) "(11/ii)(11/10) + "(11/11)(10/11),

"(11/11)(..) = "(11/11)(11/00) + "(11/11)(10/00) + "(11/11)(00/11) + "(11/11)(00/10) + "(11/11)(00/00),

"(.)(11/11) = "(11/10)(11/11) + "(10/11)(11/11),

"(..)(11/11) = (11/o00)(11/11) + 1(10/00)(11/11) + (00oo/11)(11/11) + (00oo/10)(11/11) + (oo00/00)(11/11),

n(.)(.) = n(11/10)(11/10) + n(11/10)(10/11) + n(10/11)(11/10) + n(10/11)(10/11),

()(2) = (11/10)(11/00) + 1(11/10)(10/00) + 1(11/10)(00/11) + 1(11/10)(00/10) + 1(11/10)(00/00)

+1(10/11)(11/00) + 1(10/11)(10/00) + 1(10/11)(00/11) + 1(10/11)(00/10) + 1(10/11)(00/00),

()() = (11/o00)(11/10) + '(10/o00)(11/10) + (00oo/11)(11/10) + (00oo/10)(11/10) + (oo00/00)(11/10)

+ (11/o00)(10/11) + 1(10/00)(10/11) + (00oo/11)(10/11) + (00oo/10)(10/11) + (oo00/00)(10/11),

(..)(**..) = (11/o00)(11/00) + 1(11/00)(10/00) + (11/o00)(00/11) + (11/o00)(00/10) + 1(11/00)(00/00)

+1(10/00)(11/00) + 1(10/00)(10/00) + 1(10/00)(00/11) + 1(10/00)(00/10) + 1(10/00)(00/00)

+(00oo/11)(11/00) + (00oo/11)(10/00) + (00oo/11)(00/11) + (00oo/11)(00/10) + 1(oo/11)(00/00)

+(00oo/10)(11/00) + (00oo/10)(10/00) + (00oo/10)(00/11) + (00oo/10)(00/10) + (00oo/10)(00/00)

+(00oo/00)(11/00) + (OO/oo00)(10/00) + (00oo/00)(00/11) + (00oo/00)(00/10) + (00oo/00)(00/00),

1(11/10)(10/10) + 1(10/11)(10/10),

1(10/10)(11/10) + 1(10/10)(10/11),

(11/00oo)(10/10) + (10 /00)(10/10) + (oo0011)(10/10) + (oo0010)(10/10) + (oo00oo)(10/10),

1(10/10)(11/00) + 1(10/10)(10/00) + 1(10/10)(00/11) + 1(10/10)(00/10) + 1(10/10)(00/00),


LC = Pc11





00 + t

c -
00 o

[ 1 1
H H Y' Pk I 0 OP(klk2)((1112),
01 111

01=0 kC=0
) C 12 Y YP(kik2)((l1l2),
001 ci -0 k2 0-

P(11)(11)P(00)(00) + P(11)(00)P(00)(11)
P(11)(10)P(oo)(01) + P(11)(o1)P(oo)(10)
P(10)(11)P(o1)(oo) + P(o1)(11)P(1o)(o0)

P(10)(10)P(01)(o1) + P(10)(01)P(01)(10)

In likelihood (5-17), we model f....(yi) by a normal distribution with diplotype-specific

mean p.... and variance a2. The EM algorithm is implemented to estimate these means

and variance that maximize the likelihood. In the E step, we calculate the posterior

probability of a particular diplotype within a -~.' TIr'pe for SNPs across the genomes using













(10/10) (10/10)i


wcfAABB(Yi) + (1 cC)fAABB(Yi)

wcfAABB(Yi) + (1 wc)fAABB(Yi)

cfAABB (Yi) + (1 wc)fAABB (Yi)

W_ H fAABB (Yi)
HIfAABB(Yi) + (1 WH)fAABB(Yi)

W_ H fAABB (i)

HfAABB(Yi) + (1 H)fAABB(Yi)
(10/10)(10/10) f. B
(10/10)(10/10) x
(10/10)(10 10)

with A (10/10)(10/10) f (10/10)(10/10) f / (B y)(10 /1)/10) f
(10/10) (10 /10)
A A1AAR jABB(yi)-


In the M step, the quantitative genetic parameters are estimated by


(11/11)() Yi (11/11/)(10/10)
Bi (1/11)(10/10)iyi
PAABB n(11/11/)(10/10)

n(11/11)(..) n(11/11/)(10/10)
iA + (1 cl/01)(10/10)i)i
i=1 i=1
PAABB n(1l/ll/)(10/10)

(11/1)(* ) f(((1 ll/11 )(10/10)i ()

n(.)(11/11) n(10/10/)(11/11)
Yi + (l0/10)(11/11)i
PAABB n(10/10/)(11/11)

(*)(11/11) + (0/10)(11/11)i
i= 1
n"()(*) (*)n(10/10/)
Y, i+ +i)(10/10)ii 1
i=1 i1
PAABB n(e)(10/10/) H
(e)(*) i=1 (0)(10/10)i
n(*)(n**) (*)(10 100/) n(10/10)(.) n(10/10/)(10/10)
S i+ (- )(10/10)i) i+ (o/1o)(..)ii+ ..'')(10/10)ii
PAABB ( n(0/lo)(.) n(10/10/)(10/10)

()(..) + (1 )(10/10)i (0/10)()i + ..)(10/10)i

,n(**)(11/11) + (10/10/)(11/11) H
z= -zi -Ii=l (10/10)(11/11)ii
PAABB n(10/10/)(11/11) H

n n(**)(10/10/) n(10/10)(*) n(10/10)(10/10)
yi+ )(10/10)iyi+ /10)()i)i )(10/10)ii
PAAB3 n(**.)(10/10/) n(10/10)(e) n(10/10)(10/10)

(.*)(e) **)(10/10)i + 0/10)()i )(10/10)i

n n(**)(10/10/) n(10/10)(**) n(10/10/)(10/10)
i+ 1 (-4 .)(10/10)i)i+ 1 )-H0/10)(O*)i)9i+ )(10/1O)iyi
PAABP n(**)(10/10/) n(10/10)(f*) n(10/10/)(10/10)

(**..)(e.) + ^ (1 *(,,)(10/10)i) 0/10)(*)i i ,', i .",10)i
i=1 i=1 i=1

2 2 + 2 I+AABB)2 _+ 5 (y i AABB) 2 + i
i=l i= i=

+ [S)C(/1111)(10/10)i(Yi PAABB )2 + (1 i 11/1)(10/10)i)(Yi PA

+ (z- PiAABB)2 + 5 (- PAABB)2 + (Y PAABB )2


i=1 i=1 i=1
+**)(10/10)ii ( PAABB 2 + *)(10/10)i)( i PAABB)2

i 1
+ (Y 1AABB1)2 + (ll PAABB)2 + ( (/Y lPAABBl )2

+ 5 ~[io/io)i PAABB)2 + (1 -'/.(1O(Y-PAABB)21

+ [ O/10)(*11 11)( iAAB B )2 + (1 )o1)(11/ 11) )(AABBY

S5 )(.)(i PAABB 2 io 1)())i PAABB2

+ [4AABB 2 AABB 2
+ 1O)(1O1O)i(Yi PAABBI + '(1O1O)(1O1O)i(i -PAABB)



ABB) ]

+ (4AABB (Y) ) _2 + q)AABB (Yi __AABB)2],
1 (/10)(10/10)Wi PAABB 2 / B)(10/10)(io Y 2AABB

A loop of the E and M steps is formulated between equations (5-18) and (5-19) to obtain

the MLEs of the genotypic values and variance.

For a practical data set, risk haplotypes are unknown. A combinatory approach is

used to detect an optimal combination of risk haplotypes derived from the host and cancer

genomes. This can be chosen from all 16 possible combinations. The combination that

gives the largest likelihood is considered as the best risk-haplotype combination. Under

such an optimal combination, we estimate genotypic values of the composite diplotypes


From the estimated p.... values, we can then solve the additive, dominance, and epistatic

effects between haplotypes from the two genomes using a system of equations (5-13).

5.2.4 Hypothesis Tests

It is imperative to know how different SNPs are associated within and between

the host and cancer genomes and how haplotypes trigger cancer susceptibility singly or

epistatically across different genomes. Two kinds of i ii jr hypotheses can be made to

address these questions.

Linkage Disequilibrium Tests: The association between different SNPs within

each genome and between two different genomes by testing their linkage disequilibria

(LD). For example, the LD between four SNPs from the two genomes as shown in Table

5-1 can be tested using the two hypotheses as follows:

Ho : DHiH2 = DCC2= DHiCi = DHIC2 DH2Ci = DH2C2 DHIH2CI

< = DHH2C2 DHICIC2= DH2CIC2 DHIH2CIC2 = 0 (5-20)

H1 : At least one of the LD above is not equal to zero.

The log-likelihood ratio test statistic for the significance of LD is calculated by comparing

the likelihood values under the H1 (full model) and Ho (reduced model) using

LRD = -2[log L( P pH pPc, LD = 0|H, C) log L(| H, C)] (5-21)

where PH, PH, pC, and pC are the MLEs of allele frequencies at four SNPs from the

two genomes. The LRD calculated under the Ho and H1 hypotheses is considered to

.-i-''i11.. ically follow a X2 distribution with 11 degrees of freedom.

It is also interesting to test whether the linkage disequilibria of a different particular

order are significant. This can be done by formulating the null hypotheses:

Ho : DH1H = DC1C2= DH1Ci DHIC2 DH2CI DH2C2 = 0

Hi : At least one of the LD above is not equal to zero,

for the digenic LD,

H1 : At least one of the LD above is not equal to zero,

for the trigenic LD, and

Ho: DH1H2CIC2 = 0

H, DH1H2C1C2 / 0,

for the quadrigenic LD.

Each LD can also be tested separately. Under the null hypothesis of no LD, haplotype

frequencies are estimated with the same EM algorithm derived to estimate the frequency

parameters under the alternative hypothesis, except for the constraint posed on the

relationships of haplotype frequencies under the null hypothesis. Depending on which

type of LD is tested, these constraints can be obtained from equations (5-1)-(5-11),


Genome-Genome Epistasis Tests: The significance of an assumed risk haplotype

for its effect on cancer susceptibility should be tested by formulating the following

hypotheses, expressed as


P AABB = PAABB = PAABB = PAABB = u (5-25)

H1 : At least one equality in Ho does not hold

The log-likelihood ratio test statistic (LRE) under these two hypotheses can be similarly

calculated. The LRE may .i~-i!!i11 .. i cally follow a X2 distribution with eight degrees of

freedom. However, the approximation of a X2 distribution may be inappropriate when

some regularity conditions, such as normality and uncorrelated residuals, are violated. The

permutation test approach (Cliri 1.!I and Doerge, 1994), which does not rely upon the

distribution of the LRE, may be used to determine the critical threshold for determining

the existence of risk haplotypes.

Different genetic effects, such as the additive, dominance, and additive x additive,

additive x dominance, dominance x additive, and dominance x dominance effects

between haplotypes from the host and cancer genomes can also be tested individually,

with respective null hypotheses formulated as

Ho : aH = 0, (5-26)

Ho : ac = 0, (5-27)

Ho : dH = 0, (5-28)

Ho : dc = 0, (5-29)

Ho : aa = 0, (5-30)

Ho : id = 0, (5-31)

Ho : da = 0, (5-32)

Ho : dd = 0. (5-33)

The parameter estimation under each of these null hypotheses can be obtained using

the same EM algorithm as described for the alternative hypothesis (full model) of

equation (5-25), with a constraint derived from a system of equations (5-13). The

critical thresholds for these individual effects (5-26)-(5-33) can be determined on the basis

of simulation studies.

5.3 Computer Simulation

The statistical behavior of the model proposed for cancer gene identification is

investigated through simulation studies. We simulate a HWE population of cancer

individuals in which a set of SNPs from the host genome are associated with a different

set of SNPs from the cancer genome. The haplotypes of these SNPs within and across

the genomes tri -rj. r main and interaction effects on the susceptibility of a cancer. The

allele frequencies of two of the SNPs from each genome and their linkage disequilibria

of different orders are given in Table 5-4. Four sample sizes from modest (200) to

intermediate (400) to large (800) to very large (2000) are considered. These population

genetic parameters that specify the distribution and diversity of haplotypes can be

well estimated with the model. A modest sample size is adequate for estimating allele

frequencies and digenic linkage disequilibria. A intermediate to large sample size is needed

to estimate higher-order linkage disequilibria. Especially, to precisely estimate quadrigenic

linkage disequilibrium, a sample of 2000 is recommended (Table 5-4).

For the assumed population in which multiple SNPs are typed, a quantitative trait

that describes cancer susceptibility was simulated, following a normal distribution with

means depending on composite diplotypes of SNPs and a residual variance. The genotypic

values of composite diplotype are determined by assuming specific values for the additive,

dominance and epistatic effects of haplotypes on the cancer trait. Composite diplotypes

are formed by assuming two risk haplotypes for the SNPs, one from the host genome

(Hf Hf) and the second from the cancer genome (Cf IC2). The i- ..i 'pic values of a total
of 16 composite di 'l .1 pes, along with their probabilities calculated from haplotype

Table 5-4. The MLEs of population genetic parameters for two host SNPs and two cancer
SNPs and the standard deviations of the estimates (in parentheses) in a
simulated cancer population of varying sampling sizes.









Table 5-5. Log-likelihood values for different combinations of risk haplotypes from the host
and cancer genomes in a simulated cancer population of 200 subjects with a
heritability of 0.1.



H1 H

Hl H
2 O

















frequencies, are used to compute the genetic variance. The residual variance is then

determined by assuming different heritability levels (0.1 and 0.4).

Table 5-5 gives the log-likelihoods of population and quantitative genetic parameters

by assuming different combinations of risk haplotypes from the host and cancer genomes

The MLEs of quantitative genetic parameters of haplotypes for SNPs typed
from the host and cancer genomes and the standard deviations of the estimates
(in parentheses) in a simulated cancer population of varying sampling sizes and


H2 = 0.1

True 200










for simulation data with sample size 200 and heritability 0.1. It can be seen that risk

haplotype combination (HH 2) (CC 2) corresponds to the maximum likelihood among all

possible combinations, -i-'-I. -r,- that the model has correctly selected risk haplotypes.

It appears that the power to correctly select the optimal combination of risk haplotypes

is very high, even when a modest sample size and/or heritability is assumed (data not

shown). The model provides reasonable estimates of quantitative genetic parameters

(Table 5-6). The estimation precision of parameters increases with sample size and

heritability. For the additive genetic effects, a modest sample size (200) is quite enough

even when the heritability is low (0.1). The good estimation of dominance genetic effects

Table 5-6.

needs an intermediately large sample size (400 or more) for a small heritability. Epistasis,

especially the dominance x dominance interaction, can be reasonably estimated when a

large sample size (2000) is used. This is especially true for a small heritability.

5.4 Discussion

Despite painstaking cumulative efforts to fight against cancer by researchers

worldwide in the past five decades, we have still not achieved substantial progress in

diagnosis, prevention and treatment of this disease. The latest research, however, has

found a possibility to treat, control and prevent cancer by using gene therapy. The

successful use of this technique relies on our profound understanding of the genetic

architecture of cancer susceptibility and progression. When tremendous progress in

Siil .rping and sequencing the human genome and cancer genome has taken place, we are

now in a great position to study the genetic control mechanism of cancer. In this article,

we have developed a statistical model for characterizing DNA sequence variants that

encode cancer susceptibility.

The novelty of the model lies in three aspects. First, we incorporate the latest

discovery of cancer genetics into the model that gene mutation cause cancer (Balman

et al., 2003; Greenman et al., 2007; Jallepalli and Lengauer, 2001; Stock and Bialy, 2003).

The model is not only able to characterize how gene mutation in the cancer genome acts

to regulate cancer, but also can detect the genetic interactions between the host genes and

cancer mutation. The model allows the test of haplotype distribution and diversity in the

cancer population and patterns of genetic actions and interactions. Second, the model is

integrated with multilocus SNP data, detecting cancer genes at the DNA sequence level

(Liu et al., 2004; Wu and Lin, 2008). This will provide significant insights into the genetic

regulation mechanisms of cancer and cloning of cancer genes. Third, the model was

built on the interactions of genes between different genomes. Modeling genome-genome

interactions has received an increasing interest in studying the genetic architecture of seed

development (Cui and Wu, 2005) and pathogenesis (Foster et al., 2003; Wang et al., 2005).

The model was investigated in terms of its statistical behavior through simulation

studies. Different schemes of simulation that consider varying sample sizes and heritabilities

were used. The estimation precision of parameters and the power to detect genetic

variants for cancer was explored under different schemes. The results from simulation

provide scientific support for the model to be used for cancer gene identification in

practical data sets. Although we did not include real data to validate our model, the

statistical design and algorithm proposed in this work will help cancer geneticists and

clinicians launch a novel experiment to test hypotheses about the genetic control of cancer.

Although this article presents a general framework for haplotyping cancer genes, its

extension to including genes x environment interactions, haplotyping in a case-control

study, genetic imprinting, and an arbitrary number of SNPs will be possible. As an

inherited disease, genetic research of cancer is beneficial from an informative family-structured

design in which one or both of the parents and offspring are sampled simultaneously. The

general principle of haplotyping genome-genome interactions can be used for such a

family design, facilitating our understanding of cancer genetics. Also, cancer can be

better viewed as a dynamic trait which undergoes marked developmental transition.

Functional mapping advocated by our group (Liu et al., 2005; Ma et al., 2002; Wu and

Lin, 2006) can be implemented into the haplotyping model to explore the developmental

change of genetic control of cancer in time course. In this article, we focus on the

;,. ii.-mutation hypot l! of cancer formation when the epistatic model was derived.

Other hypotheses, such as "aneuploidy hypothesis of oin,' i (Stock and Bialy, 2003),

should also be integrated into the model, to better understand the genetic mechanisms of

cancer formation and progression. The model that incorporate the aneuploidy control of

cancer will be reported elsewhere.


Cancer susceptibility may be controlled not only by host genes and mutated genes in

cancer cells, but also by the epistatic interactions between genes from the host and cancer

genomes. We derive a novel statistical model for cancer gene identification by integrating

the gene mutation hypothesis of cancer formation into the mixture-model framework.

Within this framework, genetic interactions of DNA sequences (or haplotypes) between

host and cancer genes responsible for cancer risk are defined in terms of quantitative

genetic principle. Our model was founded on a commonly used genetic association design

in which a random sample of patients is drawn from a natural human population. Each

patient is typed for single nucleotide polymorphisms (SNPs) on normal and cancer cells

and measured for cancer susceptibility. The model is formulated within the maximum

likelihood context and implemented with the EM algorithm, allowing the estimation

of both population and quantitative genetic parameters. The model provides a general

procedure for testing the distribution of haplotypes constructed by SNPs from host and

cancer genes and the linkage disequilibria of different orders among the SNPs. The model

also formulates a series of testable hypotheses about the effects of host genes, cancer

genes, and their interactions on cancer susceptibility. We carried out simulation studies to

examine the statistical properties of the model. The implications of this model for cancer

gene identification are discussed.


This dissertation provides a most complete set of statistical models for cancer gene

identification. Unlike traditional approaches, these models were built on solid biological

aspects of cancer initiation and formation. They include genetic mutations, aneuploid

metabolic control, transgenerational epigenetics, genetic imprinting, and host-tumor

interactions. These models should find their immediate implications in current cancer

genetic research with the advent of massive amounts of genomic data. However, to

enhance the biological relevance of these models, we need to integrate biological principles

of cancer.

As a high-order phenotype, cancer formation undergoes a series of biochemical

pathW- i,- from DNA to mRNA through transcription, from mRNA to protein through

translation, and from protein to cancer via various biosynthesis. "Omic" data in each of

these steps have been increasingly accumulated and, thus, it is a time to integrate these

data by powerful statistical models. However, it should be pointed out that such a Central

Dogma of biology is now surpassed by a plethora of new discoveries, including alternative

splicing, RNA editing, post-translational modifications, moonlighting proteins, feedback

circuits, and epigenetic inheritance. Techniques have been developed to detect the

regulatory control of these phenomena at each step of life formation. Metabolites are the

end products of nearly all cellular regulatory processes and reflect the ultimate outcome

of potential changes directed by genomic and proteomic adjustments stemming from an

environmental stimulus or genetic modification. To fully explain the nature of growth

and development, it is crucial to combine genomics, transcriptomics, proteomics, and

metabonomics into a network of interactions by developing powerful and robust statistical

models. Such a network biology approach will provide an unprecedented opportunity to

study the dynamic network of genes that determines the physiology of cancer over time

and space, thus understanding its growth and development processes in a comprehensive



Anway, M. D. and Skinner, M. K. "Transgenerational effects of the endocrine disruptor
vinclozolin on the prostate transcriptome and adult onset disease." Prostate 68 (2008):

Araujo, R. P. and McElwain, D. L. S. "The role of mechanical host-tumour interactions in
the collapse of tumour blood vessels and tumour growth dynamics." J. Theor. Biol. 238
(2006): 817-827.

Arnheim, N. and Calabrese, P. "Understanding what determines the frequency and
pattern of human germline mutations." Nat. Rev. Genet. 10 (2009): 478-488.

Bader, J. S. "The relative power of snps and haplotype as genetic markers for association
tests." Ph, i,,ma. ./u. u.mics 2 (2001): 11-24.

Balman, A., Gray, J., and Ponder, B. "The genetics and genomics of cancer." Nat. Genet.
33 (2003): 238-244.

Barabasi, A. L. and Oltvai, Z. N. \. I .-ork biology: understanding the cell's functional
organization." Nat. Rev. Genet. 5 (2004).2: 101-113.

Baudot, A., Real, F. X., Izarzugaza, J. M., and Valencia, A. "From cancer genomes to
cancer models: bridging the gaps." EMBO Rep. 10 (2009).4: 359-366.

Beckmann, J. S., Estivill, X., and Antonarakis, S. E. "Copy number variants and genetic
traits: closer to the resolution of phenotypic to genotypic variability." Nat. Rev. Genet.
8 (2007): 639-646.

Boone, C., Bussey, H., and Andrews, B. J. "Exploring genetic interactions and networks
with yeast." Nat. Rev. Genet. 8 (2007).6: 437-449.

Brennan, P. "Gene-environment interaction and aetiology of cancer: what does it mean
and how can we measure it?" Carcinogenesis 23 (2002): 381-387.

Carpten, J. D., Faber, A. L., Horn, C., Donoho, G. P., Briggs, S. L., Robbins, C. M.,
Hostetter, G., Boguslawski, S., Moses, T. Y., Savage, S., Uhlik, M., Lin, A., Du, J.,
Qian, Y. W., Zeckner, D. J., Tucker-K. 11..- G., Touchman, J., Patel, K., Mousses,
S., Bittner, M., Schevitz, R., Lai, M. H. T., Blanchard, K. L., and Thomas, J. E. "A
transforming mutation in the pleckstrin homology domain of AKT1 in cancer." Nature
448 (2007): 439-444.

C!i in, E. Y. "Advances in sequencing technology." Mutat. Res. 573 (2005): 13-40.

C'!. i. ud, J. M., Hager, R., Roseman, C., Fawcett, G., W r.v. B., and Wolf, J. B.
"Genomic imprinting effects on adult body composition in mice." Proc. Natl. Acad.
Sci. 105 (2008): 4253-4258.

Chin, Lynda and Gray, Joe W. "Translating insights from the cancer genome into clinical
practice." Nature 452 (2008): 553-563.

C! l 1n11, G. A. and Doerge, R. W. "Empirical Threshold Values for Quantitative Triat
Mapping." Genetics 138 (1994): 963-971.

Constancia, M., Kelsey, G., and Reik, W. "Resourceful imprinting." Nature 432 (2004):

Crews, D., Gore, A. C., Hsu, T. S., Dangleben, N. L., Spinetta, M., Schallert, T., Anway,
M. D., and Skinner, M. K. "Transgenerational epigenetic imprints on mate preference."
Proc. Natl. Acad. Sci. U S A 104 (2007): 5942-5946.

Cropley, J. E., Suter, C. M., Beckman, K. B., and Martin, D. I. "Germ-line epigenetic
modification of the murine Avy allele by nutritional supplementation." Proc Natl Acad
Sci USA 103 (2006): 17308-17312.

Cui, Y. H. and Wu, R. L. \! I -ii i genome-genome epistasis: A multi-dimensional
model." Bioinformatics 21 (2005): 2447-2455.

Daly, M. J., Rioux, J. D., Schaffner, S. F., Hudson, T. J., and Lander, E. S.
"High-resolution haplotype structure in the human genome." Nat. Genet. 29 (2001):

Davies, H., Bignell, G. R., Cox, C., Stephens, P., Edkins, S., C1. *^ S., Teague, J.,
Woffendin, H., Garnett, M. J., Bottomley, W., Davis, N., Dicks, E., Ewing, R., Floyd,
Y., Gray, K., Hall, S., Hawes, R., Hughes, J., Kosmidou, V., Menzies, A., Mould, C.,
Parker, A., Stevens, C., Watt, S., Hooper, S., Wilson, R., Jayatilake, H., Gusterson,
B. A., Cooper, C., Shipley, J., Hargrave, D., Pritchard-Jones, K., Maitland, N.,
Chenevix-Trench, G., RT-t-in- G. J., Bigner, D. D., Palmieri, G., Cossu, A., Flanagan,
A., Nicholson, A., Ho, J. W. C., Leung, S. Y., Yuen, S. T., Weber, B. L., Seigler, H. F.,
Darrow, T. L., Paterson, H., Marais, R., Marshall, C. J., Wooster, R., Stratton, M. R.,
and Futreal, P. A. '\!il i. .uis of the BRAF gene in human cancer." Nature 417 (2002):

Dawson, E., Abecasis, G. R., Bumpstead, S., Ch, i1, Y., Hunt, S., Beare, D. M., Pabial,
J, Dibling, T., Tinsley, E., Kirby, S., Carter, D., Papaspyridonos, M., Livingstone, S.,
Ganske, R., Lohmussaar, E., Zernant, J., Tonisson, N., Remm, M., Magi, R., Puurand,
T., Vilo, J., Kurg, A., Rice, K., Deloukas, P., Mott, R., Metspalu, A., Bentley, D. R.,
Cardon, L. R., and Dunham, I. "A first-generation linkage disequilibrium map of human
chromosome." Nature 418 (2002): 544-548.

De Koning, D. J., Rattniek, A. P., Harlizius, B., Arendonk, J. A. M., Brascamp, E. W.,
and Groenen, M. A. M. "Genome-wide scan for body composition in pigs reveals
important role of imprinting." Proc Natl Acad Sci USA 97 (2000): 7947-7950.

Dolinoy, D. C., Weidman, J. R., Waterland, R. A., and Jirtle, R. L. \! i.i 11 II genistein
alters coat color and protects Avy mouse offspring from obesity by modifying the fetal
epigenome." Environ Health Perspect 114 (2006): 567-572.

Duesberg, P., Li, R., Fabarius, A., and Hehlmann, R. "Orders of magnitude change in
phenotype rate caused by mutation." Cell Oncol. 29 (2007): 71-72.

Duesberg, P., Rasnick, D., Li, R., Winters, L., Rausch, C., and Hehlmann, R. "How
aneuploidy may cause cancer and genetic instability." Anticancer Res. 19 (1999):

Duesberg, Peter. "C!in.ii-i i i I C! C! .. and Cancer." Scientific American 296 (2007):

Dupuis, J., Siegmund, D., and Yakir, B. "A unified framework for linkage and association
analysis of quantitative traits." Proceedings of the National A ./J. miI of Sciences of the
United States of America 104 (2007): 20210-20215.

Fan, C., Oh, D. S., Wessels, L., Weigelt, B., Nuyten, D. S., Nobel, A. B., van't Veer, L. J.,
and Perou, C. M. "Concordance among gene-expression-based predictors for breast
cancer." N. Engl. J. Med. 355 (2006).6: 560-569.

Foster, J. S., Palmer, R. J., Jr., and Kolenbrander, P. E. "Human oral cavity as a model
for the study of genome-genome interactions." Biol. Bull. 204 (2003): 200-204.

Futreal, P. A., Coin, L., Marshall, M., Down, T., Hubbard, T., Wooster, R., Rahman, N.,
and Stratton, M. R. "A census of human cancer genes." Nature Rev. Cancer 4 (2004):

Gabriel, S. B., Schaffner, S. F., Nguyen, H., Moore, J. M., Roy, J., Blumenstiel, B.,
H1-<-zii- J., DeFelice, M., Lochner, A., F.--- Irt, M., Liu-Cordero, S. N., Rotimi, C.,
Ad. i. m.,, A., Cooper, R., Ward, R., Lander, E. S., Daly, M. J., and Altshuler, D. "The
structure of haplotype blocks in the human genome." Science 296 (2002): 2225-2229.

Gal-Yam, E. N., Saito, Y., Egger, G., and Jones, P. A. "Cancer Epigenetics:
Modifications, Screening, and Therapy." Annual Review of Medicine 59 (2008):

Greenman, C., Stephens, P., Smith, R., Dalgliesh, G. L., Hunter, C., Bignell, G., Davies,
H., Teague, J., Butler, A., Stevens, C., Edkins, S., O'Meara, S., Vastrik, I., Schmidt,
E. E., Avis, T., Barthorpe, S., Bhamra, G., Buck, G., Choudhury, B., Clements, J.,
Cole, J., Dicks, E., Forbes, S., Gray, K., Hallid i., K., Harrison, R., Hills, K., Hinton, J.,
Jenkinson, A., Jones, D., Menzies, A., Mironenko, T., Perry, J., Raine, K., Richardson,
D., Shepherd, R., Small, A., Tofts, C., Varian, J., Webb, T., West, S., Widaa, S., Yates,
A., Cahill, D. P., Louis, D.N., Goldstraw, P., Nicholson, A. G., Brasseur, F., Looijenga,
L., Weber, B.L., Chiew, Y. E., DeFazio, A., Greaves, M. F., Green, A. R., Campbell, P.,
Birney, E., Easton, D. F., Chenevix-Trench, G., Tan, M. H., Khoo, S. K., Teh, B. T.,

Yuen, S. T., Leung, S. Y., Wooster, R., Futreal, P. A., and Stratton, M.R. "Patterns of
somatic mutation in human cancer genomes." Nature 446 (2007): 153-158.

Gronbaek, K., Hother, C., and Jones, P. A. "Epigenetic changes in cancer." Acta
Pathologica, Microbiologica et Ini ;1,./I iy' .1 Scandinavica 115 (2007): 1039-1059.

Haber, D. A. and Settleman, J. "Cancer: Drivers and passengers." Nature 446 (2007):

Hanks, S. and Rahman, N. "Aneuploidy-cancer predisposition syndromes: a new link
between the mitotic spindle checkpoint and cancer." Cell C;,. / 4 (2005): 225-227.

Hartman, J. L., B., Garvik, and L., Hartwell. "Principles for the buffering of genetic
variation." Science 291 (2001).5506: 1001-1004.

Hernandez, P., Huerta-Cepas, J., Montaner, D., Al-Shahrour, F., Valls, J., Gomez, L.,
Capella, G., Dopazo, J., and Pul oi, i M. A. "Evidence for systems-level molecular
mechanisms of tumorigenesis." BMC Genomics 8 (2007): 185.

Isles, A. R. and Wilkinson, L. S. "Imprinted genes, cognition and behaviour." Trends
Cogn Sci 4 (2000): 309-318.

Itier, J. M., Tremp, G., Leonard, J. F., Multon, M. C., Ret, G., Schweighoffer, F., Tocqu6,
B., Bluet-Pajot, M. T., Cormier, V., and Dautry, F. "Imprinted gene in postnatal
growth role." Nature 393 (1998): 125-126.

Jallepalli, P. V. and Lengauer, C. "C'!hi inii.,-n segregation and cancer: Cutting through
the ui--I. i i*." Nat. Rev. Cancer 1 (2001): 109-117.

Jirtle, R. L. and Skinner, M. K. "Environmental epigenomics and disease susceptibility."
Nat. Rev. Genet. 8 (2007): 253-262.

Jones, P. A. and Martienssen, R. "A blueprint for a Human Epigenome Project: the
AACR Human Epigenome Workshop." Cancer Res. 65 (2005): 11241-11246.

Judson, R., Stephens, J. C., and Windemuth, A. "The predictive power of haplotypes in
clinical response." Pi./ .. I *' ui.mics 1 (2000): 15-26.

Kaiser, J. "Tackling the cancer genome." Science 309 (2005): 6.

Khalil, I. G. and Hill, C. "Systems biology for cancer." Curr. Opin. Oncol. 17 (2005).1:

Kops, G. J., Weaver, B. A., and Cleveland, D. W. "On the road to cancer: aneuploidy and
the mitotic checkpoint." Nat. Rev. Cancer 5 (2005): 773-785.

Lander, E. S. and Bostein, D. \! iipping mendelian factors underlying quantitative traits
using RFLP linkage maps." Genetics 121 (1989): 185-199.

Lehner, B. "Modelling genotype-phenotype relationships and human disease with genetic
interaction networks." J. Exp. Biol. 210 (2007).Pt 9: 1559-1566.

Li, L. L., Keverne, E. B., Aparicio, S. A., Ishino, F., Barton, S. C., and Surani, M. A.
"Regulation of maternal behaviour and offspring growth by paternally expressed Peg3."
Science 284 (1999): 330-333.

Li, R. H., Sonik, A., Stindl, R., Rasnick, D., and Duesberg, P. "Aneuploidy vs. gene
mutation hypothesis of cancer: Recent study claims mutation but is found to support
aneuploidy." Proceedings of the National A I... 'il,; of Sciences of the United States of
America 97 (2000): 3236-3241.

Li, Y. C., Coelho, C. M., Liu, T., Wu, S., Zeng, Y. R., Li, Y., Hunter, B., Dante, R. A.,
Larkins, B. A., and Wu, R. L. "A statistical strategy to estimate maternal-zygotic
interactions and parent-of-origin effects of QTLs for seed development." PLoS ONE 3
(2007): e3131.

Lin, M. and Wu, R. L. "Detecting sequence-sequence interactions for complex diseases."
Current Genomics 7 (2006): 59-72.

Liu, T., Johnson, J. A., Casella, G., and Wu, R. L. "Sequencing complex diseases with
HapMap." Genetics 168 (2004): 503-511.

Liu, T., Todhunter, R. J., Wu, S., Hou, W., Mateescu, R., Zh iiir Z. W., Burton-Wurster,
N. I., Acland, G. M., Lust, G., and Wu, R. L. "A random model for mapping imprinted
quantitative trait loci in a structured pedigree: An implication for mapping canine hip
dysplasia." Genomics 90 (2007): 276-284.

Liu, T., Zhao, W., Tian, L. L., and Wu, R. L. "An algorithm for molecular dissection of
tumor progression." Journal of Mathematical B'. .,i/;/ 50 (2005): 336-354.

Loeb, L. A., Bielas, J. H., and Beckman, R. A. "Cancers exhibit a mutator phenotype:
clinical implications." Cancer Res. 68 (2008): 3551-3557.

Ma, C. X., Casella, G., and Wu, R. L. "Functional mapping of quantitative trait loci
underlying the character process: A theoretical framework." Genetics 161 (2002):

Mack, G. S. "Epigenetic cancer therapy makes headway." J. Natl. Cancer Inst. 98 (2006):

Maderspacher, Florian. "Theodor Boveri and the natural experiment." Current B'. 4. .i/;/ 18

McGrath, J. and Solter, D. In i lily of mouse blastomere nuclei transferred to enucleated
zygotes to support development in vitro." Science 226 (1984): 1317-1319.

Morgan, H. D., Santos, F., Green, K., Dean, W., and Reik, W. "Epigenetic
reprogramming in mammals." Hum. Mol. Genet. 14 (2005): R47-R58.

Morgan, H. D., Sutherland, H. G., Martin, D. I., and Whitelaw, E. "Epigenetic inheritance
at the agouti locus in the mouse." Nat. Genet. 23 (1999): 314-318.

Nilsson, E. E., Anway, M. D., Stanfield, J., and Skinner, M. K. "Transgenerational
epigenetic effects of the endocrine disruptor vinclozolin on pregnancies and female adult
onset disease." Reproduction 135 (2008): 713-721.

Nowell, P. C. "Discovery of the Philadelphia chromosome: a personal perspective." J.
Clin. Invest. 117 (2007): 2033-2035.

Parmigiani, G., Boca, S., Lin, J., Kenneth, K. W., Velculescu, V., and Vogelstein, B.
"Design and analysis issues in genome-wide somatic mutation studies of cancer."
Genomics 93 (2009): 17-21.

Parris, George E. "Clinically significant cancer evolves from transient mutated and/or
aneuploid neoplasia by cell fusion to form unstable syncytia that give rise to ecologically
viable parasite species." Medical H;,p./h. I 65 (2005).5.

Patil, N., Berno, A. J., Hinds, D. A., Barrett, W. A., Doshi, J. M., Hacker, C. R., Kautzer,
C. R., Lee, D. H., Marjoribanks, C., McDonough, D. P., Nguyen, B. T., Norris, M. C.,
Sheehan, J. B., Shen, N., Stern, D., Stokowski, R. P., Thomas, D. J., Trulson, M. 0.,
Vyas, K. R., Frazer, K. A., Fodor, S. P., and Cox, D. R. "Blocks of limited haplotype
diversity revealed by high-resolution scanning of human chromosome 21." Science 294
(2001): 1719-1723.

Pellman, David. "Cell biology: Aneuploidy and cancer." Nature 446 (2007): 38-39.

Pembrey, M. E., Bygren, L. 0., Kaati, G., Edvinsson, S., Northstone, K., Sj6str6m, M.,
Golding, J., and The ALSPAC Study Team. "Sex-specific, male-line transgenerational
responses in humans." Eur. J. Hum. Genet. 14 (2006): 159-166.

Puli i, i M. A., Han, J. D., Starita, L. M., Stevens, K. N., Tewari, M., and Ahn, J. S.
I work modeling links breast cancer susceptibility and centrosome dysfunction." Nat.
Genet. 39 (2007).11: 1338-1349.

Rand, V., Prebble, E., Ridley, L., Howard, M., Wei, W., Brundler, M. A., Fee, B. E.,
Rf--iii- G. J., Coyle, B., and Grundy, R. G. I'., i, Ii in of chromosome Iq reveals
differential expression of members of the S100 family in clinical subgroups of intracranial
paediatric ependymoma." Br. J. Cancer (2008).

Reddy, E. P., Reynolds, R. K., Santos, E., and Barbacid, M. "A point mutation is
responsible for the acquisition of transforming properties by the T24 human bladder
carcinoma oncogene." Nature 300 (1982): 149-152.

Reik, W. and Walter, J. "Genomic imprinting: parental influence on the genome." Nat.
Rev. Genet. 2 (2001): 21-32.

Rha, S. Y., .I. ir- H. C., Choi, Y. H., Yang, W. I., Yoo, J. H., Kim, B. S., Roh, J. K.,
and Chluin, H. C. "An association between RRM1 haplotype and gemcitabineinduced
neutropenia in breast cancer patients." Oncologist 12 (2007): 622-630.

Rhodes, D. R. and Chlii iiyan, A. M. liii. ii ive analysis of the cancer transcriptome."
Nat. Genet. 37 (2005): S31-S37.

Rowley, J. D. "The role of chromosome translocations in leukemogenesis." Semin.
Hematol. 36 (1999): 59-72.

Rual, J. F., Venkatesan, K., Hao, T., Hirozane-Kishikawa, T., Dricot, A., and Li, N.
"Towards a proteome-scale map of the human protein-protein interaction network."
Nature 437 (2005).7062: 1173-1178.

Sasaki, H. and Matsui, Y. "Epigenetic events in mammalian germ-cell development:
reprogramming and beyond." Nat. Rev. Genet. 9 (2008): 129-140.

Sha, K. "A mechanistic view of genomic imprinting." Annu. Rev. Genomics Hum. Genet.
9 (2008): 197-216.

Sharma, S. V., Bell, D. W., Settleman, J., and Haber, D. A. "Epidermal growth factor
receptor mutations in lung cancer." Nature Rev. Cancer 7 (2007): 169-181.

Skinner, M. K. "What is an epigenetic transgenerational ph. ii..rIipe? F3 or F2." Reprod.
Toxicol. 25 (2008): 2-6.

Skinner, M. K. and Anway, M. D. "Epigenetic transgenerational actions of vinclozolin on
the development of disease and cancer." Crit. Rev. Oncog. 13 (2007): 75-82.

Stelzl, U., Worm, U., Lalowski, M., Haenig, C., Brembeck, F. H., and H., Goehler. "A
human protein-protein interaction network: a resource for annotating the proteome."
Cell 122 (2005).6: 957-968.

Stephens, P., Hunter, C., Bignell, G., Edkins, S., Davies, H., Teague, J., Stevens, C.,
0"M. i1 o, S., Smith, R., Parker, A., Barthorpe, A., Blow, M., Brackenbury, L., Butler,
A., Clarke, 0., Cole, J., Dicks, E., Dike, A., Drozd, A., Edwards, K., Forbes, S.,
Foster, R., Gray, K., Greenman, C., Hallid i-, K., Hills, K., Kosmidou, V., Lugg, R.,
Menzies, A., Perry, J., Petty, R., Raine, K., Ratford, L., Shepherd, R., Small, A.,
Stephens, Y., Tofts, C., Varian, J., West, S., Widaa, S., Yates, A., Brasseur, F., Cooper,
C. S., Flanagan, A. M., Knowles, M., Leung, S. Y., Louis, D. N., Looijenga, L. H. J.,
Malkowicz, B., Pierotti, M. A., Teh, B., Chenevix-Trench, G., Weber, B. L., Yuen, S. T.,
Harris, G., Goldstraw, P., Nicholson, A. G., Futreal, P. A., Wooster, R., and Stratton,
M. R. "Lung cancer: intragenic ERBB2 kinase mutations in tumours." Nature 431
(2004): 525-526.

Stock, R. P. and Bialy, H. "The sigmoidal curve of cancer." Nat. Biotech. 21 (2003):

Stratton, M. R., Campbell, P. J., and Futrea, P. A. "The cancer genome." Nature 485
(2009): 719-724.

Suijkerbuijk, S. J. and Kops, G. J. "Preventing aneuploidy: the contribution of mitotic
checkpoint proteins." Biochem. Biophys. Acta. 1786 (2008): 24-31.

Surani, M. A., Barton, S. C., and Norris, M. L. "Development of reconstituted mouse e:-
~i-0-I- -i; imprinting of the genome during gametogenesis." Nature 308 (1984): 548-550.

Tabin, C. J., Bradley, S. M., Bargmann, C. I., Weinberg, R. A., Papageorge, A. G.,
Scolnick, E. M., Dhar, R., Lowy, D. R., and CI i i,: E. H. M. i i i""!-ii, of activation of a
human oncogene." Nature 300 (1982): 143-149.

The Cancer Genome Atlas Research Network. "Comprehensive genomic characterization
defines human glioblastoma genes and core pathr-, v-." Nature 455 (2008): 1061-1068.

The International HapMap Consortium. "The International HapMap Project." Nature 426
(2003): 789-794.

Thompson, S. L. and Compton, D. A. "Examining the link between chromosomal
instability and aneuploidy in human cells." J. Cell Biol. 180 (2008): 665-672.

Tomlins, S. A., Rhodes, D. R., Perner, S., Dhanasekaran, S. M., Mehra, R., Sun, X. W.,
Varambally, S., Cao, X., Tchinda, J., Kuefer, R., Lee, C., Montie, J. E., Shah, R. B.,
Pienta, K. J., Rubin, M. A., and Climii iyan, A. M. "Recurrent fusion of TMPRSS2
and ETS transcription factor genes in prostate cancer." Science 310 (2005): 644-648.

Velculescu, V. E. "Defining the blueprint of the cancer genome." Carcinogenesis 29
(2008): 1087-1091.

W-\,i Z. H., Hou, W., and Wu, R. L. "A statistical model to analyze quantitative trait
locus interactions for HIV dynamics from the virus and human genomes." Stat. Med. 25
(2005): 495-511.

Weaver, B. A. and Cleveland, D. W. "Does aneuploidy cause cancer?" Curr. Opin. Cell
Biol. 18 (2006): 658-667.

Whitelaw, N. C. and Whitelaw, E. "Transgenerational epigenetic inheritance in health and
disease." Curr. Opin. Genet. Dev. 18 (2008): 273-279.

Wilkins, J. F. and Haig, D. "What good is genomic imprinting: The function of
parent-specific gene expression." Nat. Rev. Genet. 4 (2003): 359-368.

Wilkinson, L. S., Davies, W., and Isles, A. R. "Genomic imprinting effects on brain
development and function." Nat. Rev. Neurosci. 4 (2007): 1-19.

Wolf, J. B., Ch!, i. iud, J. M., Roseman, C., and Hager, R. "Genome-Wide Analysis
Reveals a Complex Pattern of Genomic Imprinting in Mice." PLoS. Genet. 4 (2008):

Wu, R. L. and Lin, M. "Functional mapping C A new tool to study the genetic
architecture of dynamic complex trait." Nat. Rev. Genet. 7 (2006): 229-237.

Wu, R. L. and Zeng, Z.-B. "Joint linkage and linkage disequilibrium mapping in natural
populations." Genetics 157 (2001): 899-909.

Wu, Rongling and Lin, Min. Statistical and Computational PAi,1r ,i'r y.' u.mics. London:
C'! 11ip, ii' & Hall/CRC, 2008.

Youngson, N. A. and Whitelaw, E. "Transgenerational epigenetic effects." Annu. Rev.
Genomics Hum. Genet. 9 (2008): 233-257.

Zhang, F., Zhao, D., CI.' in G., and Li, Q. "Gene mutation and aneuploidy might
cooperate to carcinogenesis by dysregulation of .,- iiiiii. I ie division of adult stem cells."
Medical Hilp.//, -, 67 (2006).4: 995-996.

Zhang, K., Deng, M., C'. in, T., Waterman, M. S., and Sun, F. "A dynamic programming
algorithm for haplotype block partitioning." Proc. Natl. Acad. Sci. USA 99 (2002):


Yao Li, originally trained in biology at University of Science and Technology of

Ch'ii and University of Illinois at Urbana-Cl li'p S1, became a graduate student in

the Department of Statistics at the University of Florida in the fall of 2003. She received

her Ph.D. from University of Florida in the summer of 2009, under the supervision of

Professor Rongling Wu. She also worked in the Department of Public Health Sciences

at the Penn State College of Medicine as a visiting student before her graduation. She

has got a tenure-track position in the Department of Statistics at the West Virginia

University. She is intrigued by the development of statistical and computational models

for identifying genes that control complex traits and diseases. She is eager to use her

models and algorithms to solve complicated real-world genetic problems.








Today,Icannolongertouchthesmileonmymother'sfaceifshecouldreadthisdissertation.Nonetheless,Iknowshewouldbeproudofme,asshehadalwaysbeen,eversinceIwasborn.Shewasmyrstteacher,tirelessmentor,anddearestfriend.Sheendowedmewiththestrengthtoendureandtoghtformyideal.EverytimeIlookedback,IfeltsosecuresinceIknewshe'salwaysthereforme,untilthelastminuteofherlife.Evennow,fouryearsafterIlosthertocancer,herendlessloveisstillsupportingmethroughmypursuitofthePh.D.degree.HowIwishthatIcouldhugheronceagainandtellherhowluckyIamtohaveherasmymother.IcanneverpaypacktheloveIhavereceivedfromher.HereisonethingIcandoforher:Ichosethistopicforher,andIpromisethatIwillnevergiveuptheeortstoghtcancer,allmylife.Iamdeeplyindebtedtomyacademicadvisorandcommitteechair,Dr.RonglingWu.Heisthekindestandmostcaringadvisoronecaneverwishtohave.Hisenthusiasmandpassioninresearchhasencouragedandinspiredmesomuch,andhispainstakingguidancehasledmethroughthelowdays.HegavemecountlesshelpfromdayoneIcametoUF.Butforhisgreatpatienceandunderstanding,Icannotcompletethiswork.Iwouldliketoexpressmygratitudetotheothercommitteemembers:Dr.ArthurBerg,Dr.MyronChangandDr.RobertDorazio.IwanttothankalltheotherfacultyandstamembersoftheDepartmentofStatistics,UniversityofFlorida,Iowesomuchtothemfortheirwarmheartedadviceandhelp.IwouldalsoliketoextendmygratefulnesstothePennStateUniversityCollegeofMedicine,whereIhavereceivedgeneroussupportinthelastyearofmyPh.D.study.Iwanttogivemythankstomycolleaguestherefortheirinterestandvaluablehints.IamalsoobligedtomyfriendsattheUniversityofFlorida.Theyhavealwaysbeenbymyside,cheeredonme,oeredtheirself-givingassist,andsharedmyjoysandtears.TheymademymemoriesofthetimeIspentatUFsocolorfulandindelible. 4


page ACKNOWLEDGMENTS ................................. 4 LISTOFTABLES ..................................... 7 LISTOFFIGURES .................................... 9 ABSTRACT ........................................ 10 CHAPTER 1MAPPINGCANCERGENES:CHALLENGESANDOPPORTUNITIES .... 12 1.1Introduction ................................... 12 1.2GeneticMapping ................................ 13 1.2.1FromGeneticMappingtoGeneticHaplotyping ............ 13 1.2.2FromGeneticActiontoGeneticInteraction ............. 14 1.2.3FromMendelianInheritancetoGeneticImprinting .......... 15 1.2.4FromGenomicstoProteomics ..................... 16 1.3MolecularMechanismsforCancer ....................... 16 1.3.1GeneticMutations ............................ 17 1.3.2ChromosomalTranslocations ...................... 19 1.3.3Aneuploidy ................................ 19 1.3.4EpigeneticModications ........................ 20 1.4StatisticalIssuesforCancerGeneIdentication ............... 22 1.4.1Normal-TumorSamplingDesign .................... 22 1.4.2FamilyDesign .............................. 23 1.4.3LongitudinalStudies .......................... 24 1.5DissertationGoals ............................... 24 1.5.1DetectingGeneticMutations ...................... 24 1.5.2ModelingGeneticImprinting ...................... 25 1.5.3ModelingAneuploidy .......................... 25 1.5.4ModelingCancer-HostInteractions .................. 25 2ADISEQUILIBRIUMMODELFORDETECTINGGENETICMUTATIONSFORCANCER .................................... 27 2.1Introduction ................................... 27 2.2Model ...................................... 29 2.2.1StudyDesign .............................. 29 2.2.2GeneticModel .............................. 30 2.2.3Estimation ................................ 32 2.2.4HypothesisTests ............................ 37 2.3ComputerSimulation .............................. 38 2.4Discussion .................................... 40 5


.......... 48 3.1Introduction ................................... 48 3.2Model ...................................... 49 3.2.1StudyDesign .............................. 49 3.2.2ChromosomeDuplication ........................ 50 3.2.3QuantitativeGeneticParameters .................... 52 3.2.4Estimation ................................ 54 3.2.5HypothesisTests ............................ 57 3.3ApplicationtoSimulatedData ......................... 58 3.4Discussion .................................... 60 4MODELINGTRANSGENERATIONALIMPRINTING ............. 63 4.1Introduction ................................... 63 4.2Design ...................................... 65 4.2.1SamplingStrategies ........................... 65 4.2.2GeneticModels ............................. 69 4.2.3Estimation ................................ 70 4.2.4HypothesisTests ............................ 73 4.3HaplotypingModel ............................... 74 4.4ComputerSimulation .............................. 78 4.5Discussion .................................... 80 5MODELINGHOST-CANCERGENETICINTERACTIONSWITHMULTILOCUSSEQUENCEDATA .................................. 87 5.1Introduction ................................... 87 5.2Design ...................................... 89 5.2.1SamplingStrategies ........................... 89 5.2.2GeneticModels ............................. 89 5.2.3EstimationProcedures ......................... 95 5.2.4HypothesisTests ............................ 110 5.3ComputerSimulation .............................. 113 5.4Discussion .................................... 116 6FUTUREDIRECTIONS ............................... 119 REFERENCES ....................................... 121 BIOGRAPHICALSKETCH ................................ 130 6


Table page 2-1GenotypesatSNPsAandBfornormalandcancercells. ............. 30 2-2GenotypefrequenciesatSNPsA(causal)andB(neutral)inthe\ospring"populationafterthemutationoccurs. ........................ 34 2-3Sex-specicmutationsatSNPAfornormalandcancercellsbeforeandafterthemutation. ..................................... 39 2-4MLEsofthesimulatedsamplewhentheassociationbetweenacancergeneandtheSNPisstrong. .................................. 43 2-5MLEsofthesimulatedsamplewhentheassociationbetweenacancergeneandtheSNPismoderate. ................................. 44 2-6MLEsofthesimulatedsamplewhentheassociationbetweenacancergeneandtheSNPisweak. ................................... 45 2-7MLEsofthesimulatedsamplewhendisequilibriumcoecientsoccurathighfrequency(Case1) .................................. 46 2-8MLEsofthesimulatedsamplewhendisequilibriumcoecientsoccurathighfrequency(Case2) .................................. 47 3-1Thechangesofgenotypesandgenotypefrequenciesafterchromosomalduplication. 53 3-2Genotypicvaluesandproportionsofdierentcongurationsofatriploidgenotypeataduplicatedgene. ................................. 54 3-3Thesimulationresultswithdierentsamplesizeandheritabilitycombinationsforboththepopulationparameters(p,u,v)andthegeneticparameters. .... 59 4-1Athree-generationfamilydesignusedtostudytransgenerationalinheritance. .. 66 4-2MLEsofgeneticeectsintwogenerationswithtwosimulationstrategies. .... 79 4-3Estimationresultsforhaplotypeanalysis ...................... 81 4-4Athree-generationfamilydesignusedtostudytransgenerationalinheritanceinhaplotypeanalysis. .................................. 83 5-1Disequilibriumcompositionsoffour-SNPhaplotypefrequenciesderivedfromthehostandcancergenomes. ............................ 90 5-2Additive,dominance,andepistaticcompositionsofthegenotypicvalueofacompositediplotypeconstructedwithhaplotypesfromthehostandcancergenomes. .... 94 7


................. 97 5-4MLEsofpopulationgeneticparametersfortwohostSNPsandtwocancerSNPs 114 5-5Log-likelihoodvaluesfordierentcombinationsofriskhaplotypesfromthehostandcancergenomes .................................. 114 5-6MLEsofquantitativegeneticparametersofhaplotypesforSNPstypedfromthehostandcancergenomes ............................. 115 8


Figure page 1-1Maintypesofgenomicandepigenomicaberrationthatcausecancers. ...... 18 1-2Karyotypeanalysisforanormalhumancellandtumorcell. ........... 21 3-1DiagramforchromosomeduplicationandtheresultingchangesofgenotypesatananeuploidgeneA. ................................. 51 9


Theidenticationofgenesthataredirectlyinvolvedintumorinitiationandmaintenanceisinstrumentalforunderstandingthephenotypicvariationofcancerandultimatelydesigningcrucialtherapeuticdrugstotreatthisdisease.Inrecentyears,thecompletedgenomesequenceofhumansandcancershasmarkedlyenhancedcancergeneidentication.Theoverallgoalofthisdissertationistodevelopawarehouseofstatisticaltoolsforidentifyingcancergeneswithgrowinglyincreasingsequencedata.Thesetoolsarefoundedonthelatestdiscoveriesforthegeneticanddevelopmentalrootsofcancerformation,includingsomaticmutations,aneuploidinduction,epigeneticmodications,transgenerationalimprinting,copynumbervariants,andhost-tumorgeneticinteractions.Newstatisticalmethodsandalgorithmswillbedevelopedtointegrateeachofthesediscoveries.BycomparingthedierenceintheDNAstructureandsequencebetweenthehumanandcancergenomes,adisequilibriummodelhasbeenformulatedtoidentifyandtestthegeneticmutationsor\drivers"thatcausecancer.Aquantitativemodelisderivedtounraveltheaneuploidycontrolofcancerandestimatethegeneticeectsofaneuploidlocioncancerrisk.Usingacommonlyusedthree-generationdesign,atwo-stagehierarchicalmodelisdevelopedtoestimateandtestthetransgenerationalalterationofgeneticeectsandidentifygeneticimprintingeectsduetodierentparentaloriginsofthesameallele.Thishierarchicalmodelallowsthecharacterizationofgeneticinteractionsbetweenadditiveanddominanteectsandimprintingeectsovergenerations.Cancer 10


Modelsforcancergeneidenticationsrequirethesolutionofmissingdataproblemsgiventhefactthatcancergenesandtheirincidenceinanaturalpopulationcannotbeobserveddirectly.Forthisreason,Ihavebuiltupthemodelswithinthemixturemodelframework.Themaximumlikelihoodapproaches,implementedwiththeEMalgorithm,havebeenderivedtoprovidetheestimatesofgeneticparametersrelatedtomutationrates,chromosomeduplicationrates,geneticimprinting,geneticinteractions,andhaplotypefrequencies.Ihaveperformedvarioussetsofcomputersimulationtoinvestigatethestatisticalpropertiesofthenewmodelsintermsofpower,estimationprecision,andfalsepositiverates.Aseriesofpracticalcomputationalissues,includingconvergenceratesandchoicesofinitialvalues,arediscussed.Ihavealsoformulatedvarioustestablehypothesesaboutthefrequenciesofgeneticmutationsandtheeectsofhostgenes,cancergenes,andtheirinteractionsoncancersusceptibility.Thisdissertationprovidesamostcompletesetofstatisticalmodelsforcancergeneidenticationthusfarintheliterature.Thebiologicalrelevanceandstatisticalsophisticationofthesemodelswillmakethempracticallyusefultounlockthegeneticsecretsofcancer. 11


Despitepainstakingcumulativeeortstoghtagainstcancerbyresearchersworldwideinthepastvedecades,wehavestillnotachievedsubstantialprogressindiagnosis,preventionandtreatmentofthisdisease.InUS,thecancerdeathratewas0.1939%in1950andstill0.1940%in2001( Itisimpossibletoachieveasignicantincrementinthecurerateofcancerwithoutmoreprofoundknowledgeofcancerpathophysiology.Ourrmunderstandingofcancerbiologyisakeytodiscovernewanticanceragentsanddevelopnewbiomedicaltechnologies.Theemergingconvergencebetweencancergenetics,canceretiologyanddrugdevelopmentbringsanewhopeforsignicantbreakthroughstocontrolcancerinthenearfuture. TherecentcompletionoftheHumanGenomeProjectandHapMapProjectprovidesawarehouseofgeneticdatathatallowsthecategorization,documentation,andorganizationofthehumangenomerelatedtocomplextraitsandbiologicalprocesses.Althoughthiswillfacilitatethedetectionofcancergenes,thereisanunderscoringdemandontheintegrationofgeneticmappingstrategieswithknowledgeofhowcancerisoriginatedandhowitcanbeselectivelyadvantageousinthehostenvironment.Todoso, 12


Inthischapter,Iwillreviewseveralrecentdevelopmentsingeneticmappingstrategiesthatcanbeusefulforcancergeneidenticationandthenoutlinethelatestdiscoveriesinthegeneticmechanismsofcancerinitiation.Inthefollowingchapters,wewillprovidespecicstatisticaldesignsbasedoneachofthesegeneticdiscoveries.Statisticalalgorithmswillbederivedtoestimategeneticparametersfromthedata.Wewillshowhowanumberofnewhypothesesaboutcancergeneticscanbegeneratedfromtheresultsobtained.Finally,Iwilldescribethegoalsofthisdissertationworkincancergeneidentication. LanderandBostein 1989 ).Bymappingquantitativetraitloci(QTLs),thegeneticarchitectureofcomplextraitscanbeelucidated.Recently,theavailabilityofthehumangenomesequenceandprogressinsequencingandbioinformatictechnologieshaveenabledgeneticmappingtostudythedetailedmechanismsfortraitvariation.Iwillpinpointtheaspectsofresultswhichgeneticmappingcanachievethroughappropriatestatisticalanalyses. 13


AstatisticalmodelfordetectingtheeectsofQTN,triggeredbyindividualhaplotypes,hasbeenderived( Liuetal. 2004 ).Haplotypeisdenedasalineararrangementofalleles(i.e.,nucleotides)atdierentSNPsonasinglechromosome,orpartofachromosome.Themainideaofourmodelistodiscernthedierencebetweenobservablediplotypes(i.e.,apairofhaplotypes)fromanobservableSNPgenotype.Forexample,adoubleheterozygousgenotype,AaBb,canbeformedbyeitherdiplotypeABjaborAbjaBwherejdenotesthepaternally-ormaternally-derivedhaplotype.Thesetwodiplotypes,althoughgeneticallyidentical,canbedierentlyresponsivetocancerrisk.Liuetal.'sapproachallowstheseparationofdiplotypesinthegeneticcontrolofcancer. Booneetal. 2007 ; Hartmanetal. 2001 ; Lehner 2007 )Thestudyofgeneticinteractionswillnotonlyhelpbetter 14


Ingeneral,cancerincidenceanddevelopmentarenotonlyaectedbythehostgenes,butalsobygenesderivedfromthecancercellsthemselves.Thesetwodierentsystemsofgenesoperateinteractivelyorepistaticallytoalterthecourseofcancergrowth.Shortlyafterthecompletionofthehumangenomesequencing,thecancergenomeisbeingsequenced( Kaiser 2005 ).Thedatafromthesetwotypesofgenomeswillprovidetremendousresourcesforcharacterizingthepatternsandorganizationofgenome-genomeinteractionsincancersusceptibility. 15


Fanetal. 2006 )Currently,systematicmappingofprotein-proteininteractionshasreceivedconsiderableattentionwiththelaunchoftheso-called\interactome"mappingprojects.Thiswillelucidatethewiringdiagramofproteinassociationsincells( Rualetal. 2005 ; Stelzletal. 2005 ).Thesetypesofgenesand/orprotein(gene/protein)functionalrelationshipscanbemodeledtogethertoprovidebetterunderstandingandpredictmolecularmechanismsofneoplasia.( BarabasiandOltvai 2004 ; Hernandezetal. 2007 ; KhalilandHill 2005 ; Pujanaetal. 2007 ; RhodesandChinnaiyan 2005 ) Greenmanetal. 2007 ),comparethesequencesfoundintumorsamplesandthoseoftheoriginatingnormaltissues,andidentifyregionsofthegenomethatdierbetweentwotypesoftissues( Parmigianietal. 2009 ).Adetailedsurveyhasidentiedthenumerousalterationsofcancergenomesatthelevelofthechromosomes,thechromatin(thebersthatconstitutethechromosomes)andthenucleotides.ThesealterationsmaybeduetoirreversibleaberrationsintheDNAsequenceorstructureandinthenumberofparticularsequences,genesorchromosomes(i.e.,thecopynumberoftheDNA),ortopotentiallyreversiblechanges,knownasepigeneticmodicationstotheDNAand/or 16


ChinandGray 2008 ).Figure 1-1 isadiagramthatdescribesanoverallpictureofthegeneticaberrationsrelatedtocancer.Amajorchallengeishowtopredicteachofthesetypesofaberrationswithgrowingaccumulatinghigh-dimensionalgenomicdata.Wearguethatthesecanbepredictedanddiagnosedbyeectivelyanalyzinggenomicdatacollectedfromwell-designedexperiments. 1-1 A).Inparticular,probablyalladultorganismsaremosaicsofsomaticallymutatedcells.Somaticmutationsmayoccurafterexposuretocarcinogensthroughcytosinedeamination.Amutatorphenotypecausedbymutationsinpolymerasesand/orinmismatch-repairgenescanalsoleadtospontaneousmutationsandchromosomalinstability( Baudotetal. 2009 ; Loebetal. 2008 ).Somaticchangescanbeidentiedthroughre-sequencingthegenomeandtheirinvolvementincancercanbedetectedbylinkageorlinkagedisequilibriumanalysiswithSNPs.Alarge-scalere-sequencingprojectofthehumangenomeleadstothediscoverythat60%ofmalignantmelanomas,10%ofcolorectalcancersandasmallerpercentageofothercancersmaybecausedbysomaticmutationsinBRAF,whichencodesaserine/threoninekinase( Daviesetal. 2002 ).ThisdiscoveryhasstimulatedthedevelopmentofBRAFinhibitors,leadingtotheuseofseveraldrugsinclinicaltrials.OthernotablediscoveriesincludefrequentmutationsinPIK3CA34(whichencodesthecatalyticsubunitofphosphatidylinositol-3-OHkinase)andAKT1( Carptenetal. 2007 )(whichencodesaserine/threoninekinase)inmanycancertypes,aswellasinERBB2andEGFRinnon-small-celllungcancer( Sharmaetal. 2007 ; Stephensetal. 2004 ).ItisinterestingtondthatthemutationstatusofEGFRcanpredictresponsestotreatment 17


Maintypesofgenomicandepigenomicaberrationthatcausecancers.Theleftpanelshowsthetypesofgenomicandepigenomicaberration,andtherightpanelshowexamplesofhowtheycanbedetected.A)ChangesinDNAsequence,suchaspointmutations.B)Changesingenomicorganization.DNAsegmentsexchangedbetweenthetwo(blueandgreen)DNAmoleculesareshown.C)ChangesinDNAcopynumber,suchasthosethatresultfromamplication.D)ChangesinDNAmethylationandtheresultantchangesinchromatinstructure.Allthesechangesultimatelytranslateintoalteredfunctions,leadingtocancers.Adaptedfrom( ChinandGray 2008 ) withtheEGFRinhibitorsgetiniborerlotinibinpatientswithadvancednon-smallcelllungcancer.Thesestudiesnotonlyhighlighttheimportanceofsomaticmutationsincancersusceptibility,butalsoshowapossibleroleoftestingmutationsindecidinganappropriatetreatmentofinterventionforindividualpatients. 18


Nowell 2007 )istherstexampleofchromosomaltranslocationsthatcausehumanmalignancy.Achromosometranslocationistherearrangementofpartsbetweennonhomologouschromosomes(Fig. 1-1 B);thePhiladelphiachromosomeisthetranslocationbetweenchromosomes9and22.SincethepioneeringdiscoveryofthePhiladelphiachromosome,anumberofcancer-causingrecurrenttranslocationshavebeendetectedinhumanleukaemiasandlymphomasbyusingmolecularcytogeneticanalyses( Rowley 1999 ).Althoughitisdiculttondcausaltranslocationsinsolidtumors,withanincreasingamountofgenomicinformationandanalyticaltechniques,recurrentstructuralaberrationsrelatedtosolidtumorscanstillbedetected.Forexample,humanprostatecancershowsahighfrequencyoftranslocationsbetweenTMPRSS2(whichisupregulatedinresponsetoandrogenichormones)andtheETS-familygenesERG,ETV1andETV4(whichencodetranscriptionfactors)( Tomlinsetal. 2005 ). 1-2 illustrateskaryotypeanalysesfornormalandtumorcells.Thenormalcellinclude23pairsofstandardchromosomes,whereasthetumorcellexhibitstheirregularkaryotypedescribedasaneuploid:somewholechromosomes(e.g.,chromosome2)aremissing,othershaveextracopies(e.g.,chromosome3),andmanyhavetradedfragments(e.g.,chromosome14).Althoughitisstillunclearwhetheritisacauseoraconsequenceofmalignanttransformation,strongassociationsbetweenaneuploidyandcancerhavebeenobserved.Twowaysarefoundtoleadcellstobecomeaneuploid:alterationsinthenumberofintactchromosomes,knownaswhole-chromosomeaneuploidyandoriginatingfromerrorsincelldivision(mitosis),andrearrangementsinchromosomestructure-deletions,amplicationsortranslocationsarisingfrombreaksinDNA( Pellman 2007 ).Whilethelatterwayisawell-establishedcauseoftumor 19


Gal-Yametal. 2008 ).Thesemodicationsareimposedonchromatin,butdonotaltertheDNAsequenceandstructure,andtheyaremaintainedandinheritedthroughcelldivisions(Fig. 1-1 D).Epigeneticmodicationsintendtoprovidesomaticcellswithalteredfunctionsandfeaturesbyexpressingorrepressingacertainsetofgenes.TheseepigeneticalterationsmayduetocovalentmodicationsofaminoacidresiduesinthehistonesaroundwhichtheDNAiswrapped,andchangesinthemethylationstatusofcytosinebases(C)inthecontextofCpGdinucleotideswithintheDNAitself( Grnbaeketal. 2007 ).MethylationofclustersofCpGs(called\CpG-islands")inthepromotersofgeneshasbeenassociatedwithheritablegenesilencing.Morerecently,newprojectsofhumanepigenomeprojects( JonesandMartienssen 2005 )andepigenetictherapies( Mack 2006 )havebeeninitiatedtostudyhowepigeneticchangescanmodifygenome-widegeneexpression. Oneoftheconsequencesofepigeneticmodicationsisgenomicimprinting,aphenomenonwherethefatherandmothercontributedierentepigeneticpatternsforspecicgenomiclociintheirgermcells.Agrowingbodyofevidenceshowsthatgenomicimprintingmayhaveatransgenerationaleect( Crewsetal. 2007 ; Nilssonetal. 2008 ).OneofthenotableexamplesfortransgenerationalimprintingisgivenbyMarcusPembreyandcolleagues( Pembreyetal. 2006 ).UsingtheAvonLongitudinalStudyofParentsandChildreninSwede,theyobservedthatmaleospring,whosepaternal(butnotmaternal)grandsonswereexposedduringpreadolescencetofamineinthe19thcentury,werelesslikelytodieofcardiovasculardisease.However,iffoodwasplentifulthendiabetesmortalityinthegrandchildrenincreased.Theoppositeeect 20


AnwayandSkinner 2008 ; SkinnerandAnway 2007 ). Figure1-2. Karyotypeanalysisforanormalhumancell(left)andtumorcell(right).Numbersundereachonespecifythesourcesofitsfragments;plusandminussignsidentifythosethatarelargerorsmallerthanusual.Adaptedfrom( Duesberg 2007 ). 21


Cancerarisesfromgenomicaberrationsofnormalcells.Thus,acomparisonbetweentheDNAsequence,structure,andorganizationderivedfromcancercellsandtheiroriginatingnormaltissuewillhelptodetecttheseaberrations.Genomicandsequencingdatafromthesetwotypesoftissuesforthesamesubjectsshouldbecollected.Suchanormal-tumorsamplingdesigncanserveontwopurposes: First,somaticmutationdetection{Byscreeningthenormalandtumorgenomesfromthepatient,wecandetectthoseregionsthatdierbetweenthetwogenomes.Thegenesdetectedfromthiscomparisonarecalled\driver"mutationsthataredirectlyinvolvedincancerdevelopment( Strattonetal. 2009 ).Therearealsoagroupofmutationsthatmaynotcausecancer,andtheyarecalled\passenger"mutations.Therefore,itiscrucialtodevelopadesignthatcanidentifythesetwotypesofmutationsandrankgenesbasedonthelikelihoodthattheymaybedrivers.Equallyimportantly, 22


Second,host-tumorgeneticinteractionmodeling{Thedevelopmentofcancerisviewedasaneventinitiatedbymutationaleventsinthetumorcell,butprofoundlyinuencedbytheinteractionofthesecellswiththemolecularandcellularcomponentsthatmakeupthetumormicroenvironment.Complexinteractionsbetweenthehostandtumordeterminekeyprocesses,suchastumorinvasion,angiogenesis,andmetastasis,andmayalsodrivetheprogressionofthetumortoastagethatresistsstandardformsoftherapy.Byinterruptinghost:tumorcommunication,tumorprogressioncanbeattenuatedorreversed,providinganincentivetounderstandthemolecularnatureofthesecommunicationsignalsandtodeveloptargetedtherapiesthatcapitalizeonthepowerfulcontrola\normal"environmentcanexertoverthetumor.Thenormal-tumorsamplingdesignwillhelptocharacterizethegeneticinteractionsbetweengenesfromthehostandcancergenomes. Dupuisetal. 2007 ; WuandZeng 2001 )ishelpfulforovercomingthelimitationofLDmappingbysimultaneouslyestimatingthelinkageandlinkagedisequilibrium. 23






Thisdissertationwillforthersttimeestablishasystematicprocedureforcancergeneidenticationbasedonthelatestdiscoveriesinthegeneticarchitectureofcancerinitiationandprogression.Thedesigns,models,andalgorithmsdeveloped,coupledwithrecentadvancesinsequencingtechnologies,willprovideimportanttoolsforprobingthemoleculargeneticmechanismsofcancerandultimatelyhelpingtodesigneectiveinterventionsforthisimportantdisease. 26


Futrealetal. 2004 ).Theidenticationofthesecancergenesismostlytousephysicalandgeneticmappingstrategies.However,sinceeachofthesestrategiescanonlyidentifyasubsetofcancergenes,thequestionabouthowmanycancergenesaretotallyinvolvedincancerdevelopmentislargelyunanswered.Arecentstudysurveyingthehumangenomeshowsthatmoregenemutationsmaydrivecancerthanpreviouslythought( Greenmanetal. 2007 ).Theadventofthehumangenomesequenceandcancergenomesequenceprovidesanunprecedentedopportunitytorevealthefullcompendiumofmutationsinindividualcancers( Strattonetal. 2009 ; TheCancerGenomeAtlasResearchNetwork 2008 ; Velculescu 2008 )andtherebyusethisinformationtodevelopaccuratelytargetedtreatments. Agreatmajorityofsomaticmutationsobservedwassingle-basesubstitutions( Velculescu 2008 ),althoughmutationsmayalsoencompassseveralotherclassesofDNAsequencechange,suchasinsertionsordeletionsofsmallorlargesegmentsofDNAs,DNArearrangements,copynumberincreases,andcopynumberreductions( Strattonetal. 2009 ).TherstsomaticpointmutationthatcausecancerwasidentiedintwoindependentexperimentsoftransformingcancerDNAintonormalcells( Reddyetal. 1982 ; Tabinetal. 1982 ).ThetransformednormalcellsbecomecancerousduetothesinglebaseG>Tsubstitutionthatleadsaglycinetovalinesubstitutionincodon12oftheHRASgene.Thisdiscoverybroughtaboutincreasingstudiesofassociatinggeneticmutationswithcancer.ThesubstitutionofC:GbasepairsbyT:AbasepairsorbyG:Cbasepairsiscorrelatedwithcolorectalcancerandbreastcancerandmaybeexplained 27


Greenmanetal. 2007 ).Inaddition,thechromosomaldistributionandimpactofthesemutationswerefoundtodierbetweenthesetwotypesofcancer,suggestingthatthemechanismsunderlyingmutagenesisandrepairarecancer-dependent.SomaticmutationsinthegenomeoccurthroughthemisincorporationofDNAreplicationduringexposuretoexogenousorendogenousmutagens.Thismayincludetwodierentbiologicalprocesses.Inoneprocess,mutationsconfergrowthadvantageonthecellinwhichtheyoccur,thustheyhavebeenpositivelyselected.Themutationsinthisprocessarecalledthe\driver"mutations,onesthatareknownasso-calledcancergenes.Themutationsinthesecondprocessarethe\passenger"mutationswhichareneutrallypresentinthecellthatwastheprogenitorofthenalclonalexpansionofthecancerandhavenotbeensubjecttoselection.Inpractice,thereisanunderscoringneedtodistinguishdriverfrompassengermutationsandcharacterizetheincidenceanddistributionofdrivermutationsinthehumangenome. Inthisstudy,wewilldevelopastatisticalmodelfordetectingdrivermutationsinagenome-wideassociationstudyproject.Themodelassumesthatdrivermutationsoccuratindividualpointsinsomaticchromosomes,whichareinstrongassociationswithcommongeneticvariants(suchassinglenucleotidepolymorphisms,SNPs)duetosomeevolutionaryforces.ThebasicideaofthemodelistouseSNPsgenotypedinthegenomeofanormaltissuetopredictderiverpointmutationsthatareexpressedinthecancergenome.Themodelconsidersthedierencesintheprobabilityofmutationsoccurringinallelesderivedfromthemaleandfemalechromosomes.However,becauseobservedgenotypesatindividualmarkershidetheinformationabouttheparentaloriginsofanallele,amixturemodelisformulatedtoreectthesedierences.WederiveanelegantEMalgorithmtoestimateandtestparametersrelatedtosex-specicdierencesinmutationrate.TheestimatedcoecientsofdisequilibriumbetweenSNPsinthehumangenomeandcancergenecanbeusedtopredicttheoccurrenceofdrivermutationsresponsibleforcancer. 28


2.2.1StudyDesign Table 2-1 givesthedatastructureofgenotypeobservationsfromthenormalandcancercellsforamakeupexamplecomposedof10subjectsinwhichSNPAissubjecttomutationbutSNPBisnot.Subjects1,2,and5arediagnosedtohaveacancerbecauseofthedrivermutationoccurringintheirgenomes.Theothersubjectshavenocancer.Tondsuchadrivermutation,wegenotypegenome-wideSNPsinnormalcells,someofwhicharehopedtoserveasapredictor.InTable 2-1 ,SNPBisanexampleofthesepredictors.OfitsthreegenotypesB1B1,B1B2,andB2B2withgenotypefrequenciesofQ1,Q2,andQ3,respectively,B1B1isconsistentlyinagreementwiththedrivermutationandcanbeusedasapredictor.Suchpredictorsinpracticecanbedetectedbygenotyping 29


Table2-1. GenotypesatSNPsAandBfornormalandcancercells. NormalCancer No.AB...A 2-3 liststhepossiblecongurationsofeachgenotypebeforethemutationoccurs.Thetwocongurationsof\parental"genotypeA1A2areA1jA2andA2jA1,withfrequenciesdenotedas_P2andP2,respectively,whichsumtoP2.Thetablealsoprovidestherelativeproportionsof\ospring"congurationsmutatedfromthesame\parental"genotypeintermsofsex-specicmutationrates.Fromthetable,wecalculatethefrequenciesof\ospring"genotypesinthepopulationafterthemutationoccurs,whicharearrayedasBecause\ospring"genotypesA1A3,A3A3,andA1A1are 30


Frequency 1A1A1 Proof:TheallelefrequenciesofallelesA1andA2inthe\parental"populationbeforethemutationareexpressedasp1=P1+1 2P2andp2=P3+1 2P2,respectively.Afterthemutation,alleleA1isdissolvedintotwoparts,A1(wild-type)andA3(mutant).Letp01andp03denotetheallelefrequenciesofallelesA1andA3inthe\ospring"populationafterthemutation.Thus,wehavep1=p01+p03.SinceHWEisassumedinthe\parental"population,wehave =(p021+D1)+(p023+D3)+(2p01p03D1D3) (2{2) =R1+R4+R6; 31




logL(A;B)=c+3Xi=16Xj=1nijlog(Pij); wherecistheconstant,fromwhichthemaximumlikelihoodestimate(MLE)ofgenotypefrequencycanbeobtainedas^Pij=nij=N. Table 2-2 givestwodierentcompositionsofeachgenotypefrequencyatSNPsAandBinthe\ospring"populationafterthemutationoccurs.TherstistheproductofgenotypefrequenciesatSNPsA(determinedbyu1,u2,P1,P2,andP3)andB(denotedasQ1,Q2,andQ3)andthesecondisthelinkagedisequilibriumatthezygoticlevelbetweenthetwoSNPs.Of18genotypes,wedeterminetenindependentlinkagedisequilibriaD1,...,D10. Bymaximizinglikelihood( 2{4 ),weobtaintheanalyticalsolutionsofgenotypefrequencies,expressedas ^P1=1 33


GenotypefrequenciesatSNPsA(causal)andB(neutral)inthe\ospring"populationafterthemutationoccurs. SNP SNPB A


2-2 .Instep1,foraparticularjointgenotypeinTable 2-2 ,calculatetheproportionsofcompositionsrelatedtothemutationratesby relatedtothecongurationfrequenciesby andrelatedtothedisequilibriaby 35


^u1=P3j=1(4jn4j+5jn5j+6jn6j1jn1j) (2{9) ^u2=P3j=1(04jn4j+05jn5j+6jn6j01jn1j) thecongurationfrequenciesby ^_P2=P2P3j=1(2jn2j+5jn5j) (2{11) ^P2=P2P3j=1(02jn2j+05jn5j) andthedisequilibriaby ^D1=n1111 ^D2=n2121 ^D3=n3131 ^D4=n4141 ^D5=n5151 ^D6=n1212 ^D7=n2222 ^D8=n3232 ^D9=n4242 36


Bothsteps1and2areiterateduntiltheestimatesarestable.Thestableestimatesarethemaximumlikelihoodestimates(MLEs)ofparameters. underwhichtheparameters,P1,P2,P3,Q1,Q2,andQ3areestimatedwithananalyticalformandparametersu1,u2,_P2,andP2estimatedwiththeiterativealgorithm.The10zygoticdisequilibriathatspecifytheassociationbetweenaSNPandcancergenecanbetestedindividuallyorcollectively.AsimilarEMalgorithmcanbeimplementedtoestimatetheparametersforthesetests. Itwouldbeinterestingtotestwhetherthemutationrateissex-specic.Thiscanbeperformedbyusingthefollowingnullhypothesis: Whenu1=u2=u,twocongurationsof\parental"genotypeA1A2willbecollapsedintoasinglegenotypewhosefrequencyisP2.Allthegenotypefrequenciescanbeestimatedusingananalyticalformula,buttheestimateofuunderhypothesis( 2{24 )shouldbebasedontheEMalgorithm. Becauseallthehypothesistestsdescribedabovenestthenullhypothesiswithinthealternativehypothesis,a2teststatisticcanberegardedtofollowa2distributionwiththedegreesoffreedomequaltothedierenceofthenumberofparameterstobeestimatedunderthenullandalternativehypotheses. 37


2-3 ),and0.4foralleleB1and0.6foralleleB2,regardlessoftheparentalorigin,atSNPB.Originally,thetwoSNPsmayormaynotbeassociated,butbecomeassociatedafterthemutationfromA1toA3happensatSNPA.ThecoecientsofgeneticassociationsatthezygotelevelbetweenthetwoSNPsaregivenaccordingtothemagnitudeandoccurrenceofdisequilibria: (1) 38


Sex-specicmutationsatSNPAfornormalandcancercellsbeforeandafterthemutation. \Parental"Mutation\Ospring" GenotypeCongurationOccursCongurationProportion Tables 2-4 { 2-8 providetheMLEsoftheallelefrequencies,mutationratesandzygoticdisequilibriabetweenSNPsAandBwiththemodelderived.ItisnotsurprisedthattheallelefrequenciesandgenotypefrequenciesattheSNPmarker,aswellasgenotypefrequencies(andalsoallelefrequencies)forthecancergenebeforeitsmutationoccurs,canbeverywellestimated.Asamplesizeof400willbelargelysucienttopreciselyestimatethoseparameters.Themodelhaspowertoestimatethefrequenciesoftwocongurationseachwithtwodierentallelesderivedfromareciprocalparentforaheterozygoteatthecancergene.Thefrequenciesofallele(sayA1)forthecancergenederivedfromright-(PR(A1))andleft-sideparent(PL(A1))canbeestimatedreasonably,butasucientlylargesamplesize(say2000ormore)isneededtoassuretheirestimatesatanacceptablyhighprecision. Mutationratesareanimportantparameterforstudyingthegeneticcontrolofacancer.Itisnottechnicallydiculttoestimatethemutationrateofanallelebyanalyzingandcomparingthedatacollectedfromcancerandnormalgenomes.Themeritofourmodelliesintheestimationofthemutationratesofallelesderivedfromdierentparents. 39


Allthedisequilibriumcoecientscanbeestimatedalthoughaconvergenceproblemmayoccurwhenthesevaluesaresmall.Understrongormoderatelystrongdisequilibria,asamplesizeof800or2000seemstobesucientforreasonablygoodestimates(Tables 2-4 and 2-5 ).Forweakdisequilibria,alargersamplesize(say2000ormore)isneeded( 2-6 ).Thepowertodetectlinkagedisequilibriadependsonthesizeofdisequilibria.Forweakdisequilibria,powerisonlyabout25{35%forasamplesizeof400to2000(Table 2-6 ).Thepowercanincreasedramaticallywhenthedisequilibriaarestrongormoderatelystrong(Tables 2-4 and 2-5 ). Wealsoexaminedtheestimatesofparameterswhenall10disequilibriaoccur.TheresultssummarizedinTables 2-7 and 2-8 suggestthatallparameterscanbeestimated,butforthoseparametersestimatedfromnumericaliterationsalargesamplesizeisrequired.Weinvestigatedthefalsepositiveratesofdisequilibriumtestsunderdierentsimulationscenariosbysimulatingjointgenotypedataundertheassumptionofnozygoticdisequilibria.Inmostcases,thefalsepositiveratesareabout5to1%(datanotshown). Strattonetal. 2009 ).Thepast20yearshasaccumulatedmuchknowledgeaboutthesemutationsandtheabnormalgenesthatoperateinhumancancers.NowitisatimetocharacterizegenesforcancersusingacompleteDNAsequenceofcancergenomes.Toelucidatethedetailedandcomprehensivepictureofthegeneticarchitectureofcancer,thereisanunderscoringneedtodevelopastatisticalmodelforanalyzingthedataofthenormalandcancergenomes,aimedatthegeneticmutationsforcancer. 40


Greenmanetal. 2007 ; HaberandSettleman 2007 ; Strattonetal. 2009 )OurmodelintendstodetectthosedriversforcancerbasedontheassumptionthatsomeparticularSNPsareassociatedwithdrivermutationsbecauseofevolutionaryforce.ThemodelneedsthedesigninwhichbothnormalandcancercellsaretypedwithSNPsforasetofsubjectssampledfromanaturalhumanpopulation. Ourmodelisconstructedonlinkagedisequilibriaexpressedatthezygotelevel.Traditionaltheoryforlinkagedisequilibriumanalysisisbasedonthenon-randomassociationbetweendierentgenesatthegameticlevel.WiththeassumptionofHardy-Weinbergequilibrium(HWE),gameticlinkagedisequilibriaareestimatedfromobservedzygoticdata.ForanHWEpopulation,thisassumptionmaybeviolatedwhengenemutationsoccurs.Thus,thetraditionaltheorycannotbeusedforlinkagedisequilibriumanalysisinacancerpopulationwithgeneticmutations.Inthisarticle,weintroducetheconceptofzygoticlinkagedisequilibriumdenedasnon-randomassociationsbetweengenotypesatdierentgenes.ByscanningSNPsthroughoutthegenome,themodelwillbeabletodetectthosesubjectsthatarelikelysusceptibletocancerbasedontheirgenotypesatnearbySNPs. Themodelassumesthatgeneticmutationsoccuratdierentprobabilitiesdependingontheparentaloriginoftheallele( ArnheimandCalabrese 2009 ).Itfurtherassumesthattheoccurrenceofgeneticmutationisindependentbetweenthetwosexes.Whiletherstassumptionhassomebiologicalfoundationbecauseofubiquitousnessofgeneticimprinting,thesecondassumptionneedstobejustied. Giventhegrowingpopularityofgenome-wideassociationstudies,themodelneedstobeexpandedtoconsidertheimpactofmultiplegenemutationsinthegenome.Inprinciple,suchanextensionisstraightforwardwiththeframeworkconstructedin 41




Themaximumlikelihoodestimates(MLEs)ofallelefrequencies,congurationfrequencies,genotypefrequencies,mutationrates,andzygoticdisequilibriawhenacancergeneisstronglyassociatedwiththeSNPsimulated.ThestandarderrorsoftheMLEscalculatedfrom500simulationrunsaregiveninparentheses. SampleSize ParameterTrue4008002000 43


Themaximumlikelihoodestimates(MLEs)ofallelefrequencies,congurationfrequencies,genotypefrequencies,mutationrates,andzygoticdisequilibriawhenacancergeneismoderatelyassociatedwiththeSNPsimulated.ThestandarderrorsoftheMLEscalculatedfrom500simulationrunsaregiveninparentheses. SampleSize ParameterTrue4008002000 44


Themaximumlikelihoodestimates(MLEs)ofallelefrequencies,congurationfrequencies,genotypefrequencies,mutationrates,andzygoticdisequilibriawhenacancergeneisweaklyassociatedwiththeSNPsimulated.ThestandarderrorsoftheMLEscalculatedfrom500simulationrunsaregiveninparentheses. SampleSize ParameterTrue4008002000 45


Themaximumlikelihoodestimates(MLEs)ofallelefrequencies,congurationfrequencies,genotypefrequencies,mutationrates,andzygoticdisequilibriawhentheassociationbetweenacancerandtheSNPisdeterminedbyalldierentdisequilibriumcoecientsincase1.ThestandarderrorsoftheMLEscalculatedfrom500simulationrunsaregiveninparentheses. SampleSize ParameterTrue4008002000 46


Themaximumlikelihoodestimates(MLEs)ofallelefrequencies,congurationfrequencies,genotypefrequencies,mutationrates,andzygoticdisequilibriawhentheassociationbetweenacancerandtheSNPisdeterminedbyalldierentdisequilibriumcoecientsincase2.ThestandarderrorsoftheMLEscalculatedfrom500simulationrunsaregiveninparentheses. SampleSize ParameterTrue4008002000 47


Duesbergetal. 1999 ; HanksandRahman 2005 ; StockandBialy 2003 ; WeaverandCleveland 2006 ).AccordingtoextensiveworkbyDuesbergandhisgroup,aneuploidyoersasimple,coherentexplanationofallcancer-specicphenotypesintermsofthefollowingaspects: (1) Aneuploidyisconrmedtogenerateabnormalphenotypes,suchasDownsyndromeinhumansandcancerinanimals; (2) Thedegreesofaneuploidyarecorrelatedwiththeresultingphenotypeabnormalities; (3) Sinceaneuploidyimbalancesthehighlybalance-sensitivecomponentsofthespindleapparatusitdestabilizessymmetricalchromosomesegregation; (4) Bothnon-genotoxicandgenotoxiccarcinogenscancauseaneuploidybyphysicalorchemicalinteractionwithmitosisproteins. Itisconcludedthatwhenaneuploidyexceedsacertainthresholditissucienttocauseallcancer-specicphenotypes( Duesbergetal. 2007 ; Duesberg 2007 ).Kopsetal.outlinedthecytologicalmechanismsforaneuploidformationfromcheckpointsignalling.( Kopsetal. 2005 )Normally,chromosomemis-segregationcanbepreventedbythemitoticcheckpointbydelayingcell-cycleprogressionthroughmitosisuntilallchromosomeshavesuccessfullymadespindle-microtubuleattachments.Anydefectinthemitoticcheckpoint 48


SuijkerbuijkandKops 2008 ). Intheprecedingchapter,Ihaveproposedamodelfordetectingsomaticmutationsthatcausecancer.Thismutationhypothesis,alongwiththehypothesisthatcancerresultsfromaneuploidy,presenttwoalternativeviewsaboutthegeneticcontrolofcancer.Inthepastdecade,aheavydebatehasarisenaboutwhichhypothesiscanbetterexplainthereasonsofcancerformation( Lietal. 2000 ; Parris 2005 ),althoughthesetwohypothesesmaycooperatetogoverncarcinogenesis( Zhangetal. 2006 ).Inthischapter,Iwillpresentastatisticalmodelfordetectingtheaneuploidycontrolmechanismofcancer.Themodelisconstructedwitharandomsetofsamplesfromanaturalcancerouspopulation.Byassumingtheduplicationofchromosomesderiveddierentparents,themodelisabletotestthegeneticimprintingofallelesduetotheirdierentparentalorigins.Iftheaneuploidyhypothesisiscontinuouslyconrmed,thismodelwillprovideatimelytooltoquantifythegeneticeectsofaneuploidylocioncancersusceptibilitywiththegeneticdatacollectedfromthecancergenomeproject.Also,bycomparingwiththemodelfordetectingsomaticmutations,thisnewmodelwillhelptodeterminetherelativeimportanceofthesetwohypothesesincancerstudies. 3.2.1StudyDesign 49


TriploidModel:ConsiderageneofinterestA,withtwoallelesAanda,onachromosome(saychromosome3).Figure 3-1 describestheprocessofapairofnormalchromosomesthatareduplicatedintoatriploidforaportionofchromosome3.Foranormaldiploid,thegenotypesatthisgenemaybeAA,Aa,oraa.Consideringparent-specicoriginsofalleles,weuseAja,Aja,ajA,andajatodenotethecongurationsofthesegenotypes,respectively,wheretheleft-andright-sideallelesoftheverticallinesrepresenttwoallelesfromdierentparents.Ofthese,congurationsAjaandajAaregenotypicallyobservedasthesamegenotypeAa.Whenthechromosomalsegmentthatharborsthisgeneareduplicatedforonlyonesinglechromosome,triploidswithtwocopiesfromoneparentandthethirdcopyfromtheotherparentwillresult.Itispossiblethatasinglechromosomederivedfrommaternalandpaternalparentsmaybothbeduplicated,butwithadierentfrequency.Thus,throughsuchaduplication,fourcongurationsinthenormaldiploidwillformatotalofeighttriploidcongurations,whichareclassiedintofourdierentgenotypes: (1) (2) (3) (4) Letpandq(p+q=1)aretheallelefrequenciesofAandaintheoriginalpopulationbeforechromosomeduplication.ForanaturalpopulationatHardy-Weinbergequilibrium(HWE),genotypefrequenciescanbeexpressedasp2forgenotypeAA,2pqforgenotypeAa,andq2forgenotypeaa. 50


DiagramforchromosomeduplicationandtheresultingchangesofgenotypesatananeuploidgeneA.Allelesindierentcolorsdenotetheirparent-specicoriginsseparatedbytheverticallines. Proof:LetgandhdenoteaproportionofalleleAandathatisduplicated,respectively.Thus,ofdiploidgenotypeAA,aproportiongwillbecomeAAA,withtheremainingproportion1gunduplicated.Similarly,aproportionhand1hwillbeaaaandaaafterduplicationfordiploidgenotypeaa.DiploidgenotypeAawillhavethreepossibilities,AAawithaproportionofg,Aaawithaproportionofh,andAawithaproportionof1gh.Inaduplicatedpopulationpurelycomposedoftriploids,wewillhaveallelefrequenciesforAanda,respectively,as~p=p2g+2 3(2pqg)+1 3(2pqh)=p(pg+4 3qg+2 3qh)~q=q2h+2 3(2pqh)+1 3(2pqg)=q(qh+4 3ph+2 3pg) 51


3qg+2 3qh)3g p(pg+4 3qg+2 3qh)3=~p3g p(pg+4 3qg+2 3qh)3=~p3+~p3g p(pg+4 3qg+2 3qh)31: 3ph+2 3pg)3h q(qh+4 3ph+2 3pg)3=~q3h q(qh+4 3ph+2 3pg)3=~q3+~q3h q(qh+4 3ph+2 3pg)31: 3qg+2 3qh)3andh=q(qh+4 3ph+2 3pg)3,theduplicatedpopulationwillbeatHardy-Weinbergdisequilibrium. ThistheoremshowsthattraditionalHWEtheoryforpopulationgeneticstudieswillnotbeusefulforcancergeneidentication.Meanwhile,thistheoremprovidesafoundationformymodeltoperformassociationstudiesofcancer. 52


3-1 ). Table3-1. Thechangesofgenotypesandgenotypefrequenciesafterchromosomalduplication. Non-duplicatedDuplicated GenotypeFrequencyDuplicationGenotypeFrequencyObservation Foreachtriploidgenotype,therelativeproportionsoftwounderlyingcongurationscanbedierent,dependingontherateoftheduplicationofparent-specicchromosomes.Letuand1ubetheproportionsoftheduplicationofalleleAderivedfromthematernal 53


3-2 ).Theseproportionscanbeestimatedfromgenotypedata. Table3-2. Genotypicvaluesandproportionsofdierentcongurationsofatriploidgenotypeataduplicatedgene. DuplicatedDuplicationGenotypeCongurationGeneticValueRate 2a++3 2Ia12=+3 2a3 2Ia8<:u1uAAa8<:AMAMaFaMAFAF8<:21=+1 2a+d++1 2Ia+Id22=+1 2a+d1 2IaId8<:u1uAaa8<:AMaFaFaMaMAF8<:31=1 2a+d0+1 2IaId032=1 2a+d0+1 2Ia+Id08<:1vvaaa8<:aMaFaFaMaMaF8<:41=3 2a+3 2Ia42=3 2a+3 2Ia8<:1vv ^Pk=nk whichisderivedfromapolynomiallikelihood.TheEMalgorithmisimplementedtoestimatetheallelefrequencies(~pand~q)andHWDcoecientsfromthetriploidgenotypeobservationsoftheaneuploidpopulationsampled(Table1).Itisdescribedasfollows: 54


1A=3~p3 P2;3A=3~p~q2 foralleleA,and 2a=3~p2~q P2;3a=6~p~q2 forallelea. IntheMstep,theallelefrequenciesarethenestimatedwiththefollowingequations: ~p=n11A+n22A+n33A ~q=n22a+n33a+n44a andtheHWDcoecientsarecalculatedby 1A 4a Toestimatetheduplicationrateandgenotypicvaluesforeachconguration,weneedtoformulateamixturemodelbecauseeachtriploidgenotypecontainstwounknowncongurations.Thelikelihoodofgenotypeobservationsattheduplicatedgene(Table1)andphenotypicvalues(y)measuredforallsubjectsisconstructedas where=(u;v;fk1;2kg4k=1)isthevectorofunknownparameters,andfkj(yki)=1 55


3{8 ).IntheEstep,theposteriorprobabilitywithwhichasubjectiwithaspecictriploidgenotypehasacongurationjiscalculatedusing 11i=uf11(yi) (1v)f31(yi)+vf32(yi);32i=vf32(yi) (1v)f31(yi)+vf32(yi)41i=(1v)f41(yi) (1v)f41(yi)+vf42(yi);42i=vf42(yi) (1v)f41(yi)+vf42(yi): IntheMstep,bysolvingthelog-likelihoodequations,theparametersareestimatedwiththecalculatedposteriorprobabilities,i.e., (3{10) AloopoftheEandMstepsisiteratedbetweenequations( 3{9 )and( 3{10 ),( 3{11 ),( 3{12 )and( 3{13 ).Thus,theparameterestimatesareobtainedwhentheestimatesconvergetostablevalues.TheMLEsofgeneticeectscanbeobtainedbysolvingasystemoflinearequationsgiveninTable2,i.e., ^a=1 6(^11+^12^41^42) (3{14) ^d=1 2(^21+^22)2 3(^11+^12)+1 6(^41+^42) (3{15) ^d0=1 2(^31+^32)2 3(^41+^42)+1 6(^11+^12) (3{16) 56


2(^11^12)1 4(^41^42) (3{17) ^Ia=1 6(^11^12)+1 6(^41^42) (3{18) ^Id=1 2(^21^22)1 3(^11^12)+1 6(^41^42) (3{19) ^Id0=1 2(^32^31)+1 3(^41^42)1 6(^11^12) (3{20) 3{1 ).Thelog-likelihoodratiocalculatedunderthenullandalternativehypothesesfollowsa2-distributionwith2degreesoffreedom.Itisinterestingtotestthetwodisequilibriaseparately.UnderthenullhypothesisH0:D1=0,genotypefrequenciesareestimatedusingequation( 3{1 ),butwithaconstraintP3=p3,inadditiontoconstraintP1+P2+P3+P4=1.Similarly,genotypefrequenciesareestimatedwithaconstraintP4=q3fortestingwhetherD2=0. WhethertheduplicatedgeneissignicantlyassociatedwithcancersusceptibilitycanbetestedusingthenullhypothesisH0:kjfork=1;:::;4;j=1;2.Theadditiveeectandtwotypesofdominanceeectscanbetestedjointlyorseparatelybyformulatingtherelevantnullhypothesesbasedonequations( 3{14 ),( 3{15 ),and( 3{16 ).TheimprintingeectanditsinteractionswithadditiveanddominanceeectscanbetestedbyusingthenullhypothesisH0:=0,H0:Ia=0,H0:Id=0,andH0:Id0=0constructedwithequations( 3{17 ),( 3{18 ),( 3{18 ),and( 3{19 ),respectively. Themodelcanalsobeusedtotestthesignicanceofduplicationrateforaparent-specicchromosomebyformulatingthenullhypothesisH0:u=1orH0:v=1. 57


Themodelwasusedtoestimateallelefrequencies,HWD,parent-specicduplicationrates,andgeneticeectsforacancerpopulation(Table 3-3 ).Asexpected,allelefrequenciescanbepreciselyestimatedwithamodestsamplesize(400).Alargersamplesize(say800)isneededtoprovidepreciseestimationoftwoHWDcoecientsD1andD2.Becauseduplicationratesdeterminethemixtureproportionsforeachtriploidgenotype,theirestimateswillbeaectedbytheheritabilitylevel.Ifthecancertraithasalargerheritability,thenasamplesizeof400willprovidegoodestimatesofduplicationrates.Foralessheritabletrait,alargesamplesize(even2000)isneededforgoodestimation.Theadditiveeectcanbegenerallywellestimated,buttheestimatesofthedominanteectsneedmuchlargersamplesize.Theestimationprecisionoftheimprintingeectseemstobeintermediatebetweenthatoftheadditiveanddominanteects.Itisinterestingtoseethattheadditiveimprintinginteractioneectcanbebetterestimatedthantheimprintingeectalone.Itishardtoestimatetheinteractionsbetweenthedominantand 58


Thesimulationresultswithdierentsamplesizeandheritabilitycombinationsforboththepopulationparameters(p,u,v)andthegeneticparameters. SampleSizeH2puvadd0IaIdId0


Thepowertodetecttheoverallgeneticeectandimprintingeectwasinvestigated.Ingeneral,themodelhasagreatpowerfortheidenticationofaneuploidlocicausingcancer.Toachieveadequatepowerforimprintingeectdetection,alargesamplesizeand/orlargeheritabilityisrequired.Overall,ssamplesizeof400withaheritabilityof0.4canreachpowerofover0.85forthedetectionofimprintingeects.Wealsoperformedsimulationstudiestoexaminethefalsepositiveratesfordetectingoverallgeneticeectsandimprintingeectsataneuploidloci.Itappearsthatineachcasethefalsepositiveratescanbecontrolledtobebelow5{10%. Maderspacher 2008 ),atremendousconcernhasbeengiventoexplorethegeneticcauseoftumorigenesis.Ithasbeenpartlyestablishedthataneuploidyhasaneectonproliferationandsurvivaloftumors.Therecentdiscoveryofcomponentsofthemitoticcheckpoint,aswellastherealizationthatmanyoftheclassictumoursuppressorsandoncogeneproductsregulatemitoticprogression,hasrenewedinterestintheroleofaneuploidyintumorigenesis( HanksandRahman 2005 ; Kopsetal. 2005 ; SuijkerbuijkandKops 2008 ).WiththecompletionofthehumangenomeprojectsandHapMapproject,thereisapressingneedforthedevelopmentofstatisticalmodelsforestimatingthegeneticeectofaneuploidlocioncancerrisk. Inthisarticle,wepresentastatisticalstrategyfordetectingthegeneticcontrolofcancertraitsthroughgenotypinganeuploidsofcancercells.Themodelproposedpresentstwonovelties.First,ithasforthersttimeintegratedthelatestdiscoveryofcancergeneticstudieswithstatisticalprinciplesanddirectlypushedthemodelingeort 60


Weperformedcomputersimulationtoexaminethestatisticalpropertiesofthemodel.Resultsfromsimulationstudieswereinvestigated,fromwhichanappropriatesamplesizeisdeterminedforacancertraitwithaparticularheritability.Analysesofmodelpowerandfalsepositiveratesvalidatedthepossibleusefulnessofthemodelwhenpracticaldatasetsareavailable.Throughasimplemathematicalproof,IfoundthattheHardy-Weinbergequilibriumofanoriginalpopulationcanbedestroyedwhensomechromosomesareduplicated.Indeed,itisjusttheoccurrenceofHardy-Weinbergdisequilibriumthatcancergenescanbedetectedfromassociationstudiesofgenome-wideSNPdata. Theideaofthemodelcanbeextendedtoseveralmorecomplicatedsituations.First,theaneuploidycontrolofcancermaybederivedfromhigh-orderaneuploid,suchastetraploids.Ahigh-orderpolyploidnotonlycontainmorealleliccombinations,butalsoamoreamountofmissingdataduetotheduplicationofdierentchromosomeswithunknownparentalorigins.Tomodelthetetraploidycontrolofcancer,amoresophisticatedalgorithmisrequiredtoobtainecientestimatesofparameters.second,dierentaneuploidlociresponsibleforcancertraitsmaybeassociatedintheduplicationpopulationandinteractinacoordinatedmanner.Modelingofmulti-locusassociationsandmulti-locusepistasiswilldeserveafurtherinvestigationalthoughthesepiecesof 61




ReikandWalter 2001 ).Causedbyanepigeneticmarkofdierentialmethylationsetduringgametogenesis,geneticimprintinghasbeenshowntoplayapivotalroleinregulatingtheformation,development,function,andevolutionofcomplextraitsanddiseases( Constanciaetal. 2004 ; IslesandWilkinson 2000 ; Itieretal. 1998 ; Lietal. 1999 ; Wilkinsonetal. 2007 ).Whilemoststudiesofgeneticimprintingfocusontheepigeneticandmolecularmechanismsofthisphenomenon( Sha 2008 ),informationaboutthenumberanddistributionofimprintedgenesandtheirepistaticinteractionsisstillverylimited,limitingourabilitytopredicttheeectsofimprintinggenesonthediversityofabiologicaltraitorprocess.Severalauthorshavestartedtousegenome-wideassociationandlinkagestudiestoidentifytheregionsofthegenomethatcontainimprintedsequencevariantsandfurtherunderstandtheepigeneticvariationofcomplextraits( Cheverudetal. 2008 ; DeKoningetal. 2000 ; Liuetal. 2007 ; Wolfetal. 2008 ). Inaseriesofrecentstudies,Cheverud,Wolf,andcolleaguescategorizedgeneticimprintingintodierenttypesbasedonthepatternofitsexpression,i.e.,maternalexpression,paternalexpression,bipolardominance,polaroverdominance,andpolarunderdominance( Cheverudetal. 2008 ; Wolfetal. 2008 ).Withathree-generationF2design,theyidentiedthesetypesofimprintedquantitativetraitloci(iQTL)aectingbodyweightandgrowthinmice,displayingmuchmorecomplexanddiverseeectpatternsthanpreviouslyassumed.AdierentdesignbasedonreciprocalbackcrosseswasproposedtotestandestimatethedistributionofiQTLresponsibleforphysiologicaltraitsrelatedtoendospermdevelopmentinmaize( Lietal. 2007 ).Bymodelingidentical-by-descentrelationshipsinmultiplerelatedfamiliesofcanines,Liuetal.(2007) 63


Whileepigeneticmarksresultingingeneticimprintingcanbegenerallystableinanorganism'slifetime,theymayundergoreprogramming,i.e.,afaithfulclearingoftheepigeneticstateestablishedinthepreviousgeneration,inthenewgenerationduringgametogenesisandearlyembryogenesis( Morganetal. 2005 ; SasakiandMatsui 2008 ).However,agrowingbodyofevidencesincetheearly1980sindicatesthatgenesmayescapesuchreprogrammingand,thus,inherittheirimprintingeectsintonextgenerations( Cropleyetal. 2006 ; Dolinoyetal. 2006 ; McGrathandSolter 1984 ; Morganetal. 1999 ; Skinner 2008 ; Suranietal. 1984 ).Twofundamentalquestionswillnaturallyarisefromthisdiscovery:howcommonareimprintedgenesofthistypeandhowstrongistheevidencefortheirexistenceinhumansandotherorganisms?Ifepigeneticchangesthroughimprintedgenescanbeinheritedacrossgenerations,thiswouldsignicantlyalterthewaywethinkabouttheinheritanceofphenotype( WhitelawandWhitelaw 2008 ; YoungsonandWhitelaw 2008 ).Suchtransgenerationalepigeneticinheritance,i.e.,modicationsofthechromosomesthatpasstothenextgenerationthroughgametes,mayberelatedwithhealthanddiseaseswithamechanismfortransmittingenvironmentalexposureinformationthataltersgeneexpressioninthenextgeneration(s)( Pembreyetal. 2006 ).Theidenticationofimprintedlocidisplayingtransgenerationalepigeneticinheritancewillbegreatlyhelpfulforaddressingthetwoquestionsmentionedabove,inaquesttoelucidatethedetailedgeneticarchitectureofcomplextraitsanddiseases. Themotivationofthischapteristodevelopanovelstrategyforidentifyingimprintedgenesandunderstandingthetransgenerationalchangesoftheireectswithathree-generationfamilydesign.Thisdesignsamplesmultipleunrelatednuclearfamilies,eachcomposedofthegrandfather,grandmother,father,mother,andgrandchildren,fromanaturalpopulation.Theinheritanceofallelesatagenefromamaleorfemaleparentistracedbyobservingthesegregationofthegeneintheprogenygeneration. 64


4.2.1SamplingStrategies 4-1 ).Givenacrosstype,thegenotypesofsonsordaughterscanbeinferred.Herewerstassumeonesex(sayson)inthesecondgeneration,althoughbothsexescanbeconsidered.Thesonsfromafamilyserveasthefathertomatewiththefemalesasthemotherderivedfromanaturalpopulation,withgenotypes,AA,Aa,andaa,characterizedbyfrequenciesp2,2pq,andq2,respectively.Eachofsuchsecond-generationfamiliesproducesacertainnumberofgrandchildren.ThegenotypefrequenciesinthethirdgenerationarederivedaccordingtoMendel'srstlaw. Accordingtothisdesign,thegrandfathersandgrandmothersarefounderswhoseparentsareunknown.Allelesofsonsfromarst-generationfamilycanbetraceddirectlyorindirectly,butthefemalesusedtogeneratethesecond-generationfamilyarethefounderswiththeunknownoriginofalleles.Forthisreason,wewillmeasurethephenotypeforsonsfromtherst-generationfamiliesandgrandchildrenfromthesecond-generationfamilies.Thisdesignwillallowustocharacterizeimprintingeectsofageneinthesecond-andthird-generations. 65


Athree-generationfamilydesignusedtostudytransgenerationalinheritance.Therst-generationiscomposedofthegrandfatherandgrandmotherofthreegenotypes,AA(inafrequencyofp2),Aa(inafrequencyof2pq),andaa(inafrequencyofq2),fromanaturalpopulation,thesecondgenerationincludesthefather,i.e.,asonfromtherstgeneration,andthemothersampledfromthenaturalpopulation,andthethirdgenerationincludesamixofsonsanddaughtersfromthesecondgeneration.Thephenotypeismeasuredforthefatherinthesecondgenerationandchildreninthethirdgeneration.Adiplotypeisshownbyaverticalline. FirstGenerationSecondGenerationThirdGeneration MatingAAAaaa 2001 210000002AA(p2)Aa(2pq)AjA(1 2)8>>><>>>:AA(p2)Aa(2pq)aa(q2)11 2001 21000000Aja(1 2)8>>><>>>:AA(p2)Aa(2pq)aa(q2)1 21 4001 41 21 21 4001 41 23AA(p2)aa(q2)Aja(1)8>>><>>>:AA(p2)Aa(2pq)aa(q2)1 21 4001 41 21 21 4001 41 2


4-1 .Continued FirstGenerationSecondGenerationThirdGeneration MatingAAAaaa 2)8>>><>>>:AA(p2)Aa(2pq)aa(q2)11 2001 21000000ajA(1 2)8>>><>>>:AA(p2)Aa(2pq)aa(q2)1 21 4001 41 21 21 4001 41 25Aa(2pq)Aa(2pq)AjA(1 4)8>>><>>>:AA(p2)Aa(2pq)aa(q2)11 2001 21000000Aja(1 4)8>>><>>>:AA(p2)Aa(2pq)aa(q2)1 21 4001 41 21 21 4001 41 2ajA(1 4)8>>><>>>:AA(p2)Aa(2pq)aa(q2)1 21 4001 41 21 21 4001 41 2aja(1 4)8>>><>>>:AA(p2)Aa(2pq)aa(q2)00000011 2001 21


4-1 .Continued FirstGenerationSecondGenerationThirdGeneration MatingAAAaaa 2)8>>><>>>:AA(p2)Aa(2pq)aa(q2)1 21 4001 41 21 21 4001 41 2aja(1 2)8>>><>>>:AA(p2)Aa(2pq)aa(q2)00000011 2001 217aa(q2)AA(p2)ajA(1)8>>><>>>:AA(p2)Aa(2pq)aa(q2)1 21 4001 41 21 21 4001 41 28aa(q2)Aa(2pq)ajA(1 2)8>>><>>>:AA(p2)Aa(2pq)aa(q2)1 21 4001 41 21 21 4001 41 2aja(1 2)8>>><>>>:AA(p2)Aa(2pq)aa(q2)00000011 2001 219aa(q2)aa(q2)aja(1 2)8>>><>>>:AA(p2)Aa(2pq)aa(q2)00000011 2001 21


CongurationPaternalOspring (4{1) whereFandOaretheoverallmeansofthepaternalandospringgenerations,aF,dF,andiFaretheadditive,dominantandimprintinggeneticeectsofthegeneintheparentalgeneration,andaO,dO,andiOaretheadditive,dominantandimprintinggeneticeectsofthegeneintheospringgeneration. Thedierenceinthegeneticarchitectureofacomplextraitbetweentwodierentgenerationsisdescribedas a=aFaO; d=dFdO; i=iFiO: Bytestingwhetherthesedierencesareequaltozerojointlyorindividually,wecandeterminethetransgenerationalchangesofthepatternofgeneticcontrol.Ifasignicantimprintingeectisdetected,wecantestthetypeofgeneticimprinting,i.e.,parentalorbipolardominance,byincorporatingtheimprintingmodelsofCheverudetal.(2008). 69


4-1 (j=1;:::;9),letNjdenotethefamilynumberofthismatingtype.Eachrst-generationfamilymayhaveoneormultiplesonswhoservethefatherofthesecondgeneration.Thosefamiliesinthesecondgenerationwiththefatherderivedfromthejthrst-generationmatingtypeandthemotherofaparticulargenotypefromthenaturalpopulationaresummedtogether,denotedbyNMjl,formothergenotypel(l=2forAA,1forAa,and0foraa).Thus,wehaveatotalofNMl=P9j=1NMjlsecond-generationmotherswhocarrygenotypel. Itisnotdiculttoderivethemaximumlikelihoodestimateofallelefrequencyfromthethree-generationfamilydesignasp=4N1+3(N2+N4)+2(N3+N5+N7)+(N6+N8)+2NM2+NM1 70


whereF=(F;aF;dF;iF;2F)andO=(O;aO;dO;iO;2O).Whengenotypiceectsareassumedtobedierentbetweentwogenerations,maximizingjointlikelihood(5)isequivalenttomaximizingitstwolikelihoodcomponentsindependently.TheestimatesthatmaximizetherstcomponentcanbeobtainedwiththeEMalgorithm.IntheEstep,theposteriorprobabilitywithwhichthedoubleheterozygotefatherofthesecondgenerationfromthe5thrst-generationmatingtypehasaparticularcongurationiscalculatedby F51i=1 2f1(yF51i) 2f1(yF51i)+1 2f10(yF510i)andF510i=1 2f10(yF510i) 2f1(yF51i)+1 2f10(yF510i): IntheMstep,thegenotypicvaluesofcongurationsandvariancearecalculatedbyF2=PNF12i=1yF12i+PNF22i=1yF22i+PNF42i=1yF42i+PNF52i=1yF52i


TheEMalgorithmcanalsobederivedtoestimategeneticparametersinthethirdgeneration.IntheEstep,theposteriorprobabilitywithwhichthedoubleheterozygoteospringofthethirdgenerationderivedfromthecombinationoftwodoubleheterozygoteparentsinthesecondgenerationhasaparticularcongurationiscalculatedby Ojkl1i=1 4f1(yOjkli) 4f1(yOjkli)+1 4f10(yOjkli)andOjkl10i=1 4f10(yOjkli) 4f1(yOjkli)+1 4f10(yOjkli): IntheMstep,thegenotypicvaluesofcongurationsandvariancearecalculatedby


wherejkl2i,jkl1i,andjkl0iaretheindicatorvariablesthataredenedas1ifospringiinthethirdgenerationfromthecombinationoffatherkfromthejthrst-generationmatingtypeandmotherlfromthenaturalpopulationhasgenotypeAA,Aa,andaa,respectively,and0otherwise.TheEMstepsareiteratedbetweenequations(6)and(7)toobtaintheMLEsofFandbetweenequations(8)and(9)toobtaintheMLEsofO. ChurchillandDoerge 1994 ).Thesecondonehasthenullhypotheses,H0:aF=aO=0,H0:dF=dO=0,andH0:iF=iO=0,respectively.Becauseeachofthesenullhypothesesisnestedwithinitsalternative,thelog-likelihoodratioteststatisticcanbethoughttoasymptoticallyfollowa2-distributionforalargesamplesize. Thetransgenerationalchangesofdierentgeneticeectscanalsobetested,withnullhypothesesH0:a=0,H0:d=0,andH0:i=0,respectively.Thesenullhypothesescanbeconsideredsinglyorjointly,inordertobetterstudythetransgenerationalchangesofthegeneticarchitectureofatrait. 73


Dawsonetal. 2002 ; Gabrieletal. 2002 ; Patiletal. 2001 ).Eachblockmayhaveafewcommonhaplotypeswhichaccountforalargeproportionofchromosomalvariation.Betweenadjacentblocksaretherelargeregions,calledhotspots,inwhichrecombinationeventsoccurwithhighfrequencies.SeveralalgorithmshavebeendevelopedtoidentifyaminimalsubsetofSNPs,i.e.,taggingSNPs,thatcancharacterizethemostcommonhaplotypes( Zhangetal. 2002 ).ThenumberandtypeoftaggingSNPswithineachhaplotypeblockcanbedeterminedpriortoassociationstudies.Inthissection,IwillderiveamodelfordetectingtheassociationbetweenhaplotypesconstructedbyallelesatasetofSNPsandcomplextraits. ConsidertwoSNPs,A(withtwoallelesAanda)andB(withtwoallelesBandb)toclearlydescribethemodel.FourhaplotypesformedbythesetwoSNPsareAB,Ab,aB,andab,withrespectivefrequenciesdenotedasp11,p10,p01,andp00.Thetwomarkersproduceninejointgenotypes,AABB(codedas1),AABb(codedas2),...,aabb(codedas9),whichareobserved.Thus,eachsubjectwillbearoneofthesegenotypes,andtheparentsineachfamilywillbeoneof99=81possiblegenotypebygenotypecombinations.Ifeachparentforacombinationishomozygousforbothmarkers,theirospringwillhaveonegenotype.Aslongasoneparentisheterozygousforonemarker,theospringwillhavetwoormoregenotypes.However,onlywhenbothmarkersareheterozygousforatleastoneparent,thegenotypefrequenciesofospringwillbedeterminedbytherecombinationfractionbetweenthemarkers(r).Table 4-4 showsthestructureandfrequenciesofmotherbyfathergenotypecombinationsunderrandommatingandtheirospringgenotypefrequencies.ForadoubleheterozygoteAaBb,itsobservedgenotypemaybederivedfromtwopossiblediplotypes,ABjab(withtheprobabilityofp11p00)orAbjaB(withtheprobabilityofp10p01).Eachofthesetwo 74


Haplotype DiplotypeABAbaBab ABjab1 2(1r)1 2r1 2r1 2(1r)AbjaB1 2r1 2(1r)1 2(1r)1 2r !12=r !23=(1)r+r !1+!24=(1)2r2+(1)r(1r) 21!27=2r2+2(1)r(1r)+(1)2r2 75




4-1 )accordingtothegenotypefrequencies.Assumethattheallelefrequenciesofageneare0.6and0.4inanaturalpopulationatHardy-Weinbergequilibrium.Oursimulationwillfocusontheinvestigationoftheimpactsofdierentsamplingstrategiesandheritabilitiesonparameterestimationandmodelpower.Foragivensamplesize,twosamplingstrategiesaresimulated,(1)alargefamilynumbervs.smallfamilysize,and(2)asmallfamilynumbervs.largefamilysize. Therststrategysamples200unrelatedgrandfathersand200unrelatedgrandmothers,whomarrytoform200therst-generationfamilies.Eachrst-generationfamilyisassumedtohaveonesonwho,asthefather,formasecond-generationfamilywiththemotherfromthenaturalpopulation.Thereisonechildforeachsecond-generationfamily.Thisallocationresultsinatotalof1000subjects.Allmembersinthedesignaretypedforthegene,butonlythefathersandospringofthethirdgenerationarephenotypedforanormallydistributedtrait.Thesecondstrategysamples50unrelatedgrandfathersand50unrelatedgrandmothers.Ineachrst-generationfamily,3sonsaresimulated,forming150second-generationfamiliesinwhich4childrenareassumed.Thisstrategyalsoresultsin1000subjects. Dierentgeneticeectsofthegene,additive,dominant,andimprinting,aresimulatedforthesecond-andthird-generations(Table 4-4 ).Twodierentheritabilitylevels,0.1and0.4,aresimulatedforeachgeneration,fromwhichvariancesaredetermined.Table 78


tabulatestheestimatesofpopulationandquantitativegeneticparametersfromthethree-generationdesign.Asexpected,allelefrequencycanbeverywellestimated.Themodelprovidesreasonableestimationaccuracyandprecisionforallgeneticparametersunderdierentsamplingstrategies,evenforamodestheritabilitylevel.Themodelispowerfultodetectdierencesofgeneticeectsbetweentwoconsecutivegenerations.Moreinteresting,thedierenceofimprintingeectbetweendierentgenerations,i.e.,transgenerationalinheritanceofgeneticimprinting,canbediscernedwithourstatisticaldesign. Table4-2. Themaximumlikelihoodestimates(MLEs)ofadditive(a),dominant(d),andimprintingeects(i)ofafunctionalSNPonacomplextraitinparental(F)andospring(O)generationsundertwodierentstrategies.TheestimatesarethemeansofMLEsobtainedfrom200simulationreplicates,withstandarderrorsgiveninparentheses. GeneticTrueStrategy1Strategy2 ParameterValueH2=0:1H2=0:4H2=0:1H2=0:4 Weconductedanadditionalsimulationstudyinwhichthesamegeneticeectsareassumedbetweenthetwogenerations.Themodeldetectsasmallproportionofsimulationreplicateswhichdisplaystransgenerationaldierences.ThissuggeststhatthemodelhasasmalltypeIerrorratefordetectingthetransgenerationaldierenceofoverallgeneticeects.WeparticularlytestedthetypeIerrorrateforthetransgenerationaldierenceofgeneticimprinting,whichisverysmall. 79


Table 4-3 tabulatestheresultsofsimulationfordierentheritabilitiesandsamplesizes.Overall,allparameterscanbeestimatedreasonablywell.Asexpected,theprecisionofparameterestimationincreaseswithheritabilityandsamplesize.Theadditivegeneticeectsinbothgenerationscanwellbeestimatedwithamodestsamplesize(say400)forasmallheritability(0.1).Moresamplesizes(say800)areneededtoprovideagoodestimateforgeneticimprintingeectsforasmallheritability.Towellestimatedominantgeneticeects,anevenlargersamplesize(say2000)isrequiredforthesamelevelofheritability. ReikandWalter 2001 ; WilkinsandHaig 2003 ).Thisso-calledgeneticimprintingorparent-of-origineecthasbeenthoughttoplayapivotalroleinregulatingthephenotypicvariationofacomplextrait( Constanciaetal. 2004 ; IslesandWilkinson 2000 ; Itieretal. 1998 ; Lietal. 1999 ; Wilkinsonetal. 2007 ).Withthediscoveryofmoreimprintinggenesinvolvedintraitcontrolthroughmolecularandbioinformaticsapproaches,wewillbeinapositiontoelucidatethegeneticarchitectureofquantitativevariationforvariousorganismsincludinghumans. 80


Simulationresultsfortransgenerationimprintingeectscomparisons.Thegeneticdesignscenariosarechosenasthecombinationofdierentheritabilitiesandsamplesizes.Theyare:H21=0:1=0:4,H22=0:1=0:4,n=400;800;2000. FirstGenerationParametersSecondGenerationParameters 4000:10:10:0561(0:0049)0:9936(0:0185)0:5228(0:0263)0:3824(0:0195)1:0182(0:0266)1:4761(0:0405)0:6592(0:0269)0:10:40:0450(0:0049)0:9975(0:0162)0:4690(0:0279)0:4437(0:0184)0:9860(0:0106)1:4936(0:0174)0:5937(0:0115)0:40:10:0639(0:0055)1:0011(0:0082)0:5032(0:0107)0:4013(0:0082)1:0210(0:0277)1:5275(0:0383)0:5631(0:0285)0:40:40:0681(0:0054)1:0137(0:0075)0:5072(0:0115)0:3916(0:0083)0:9939(0:0108)0:4920(0:0144)0:5975(0:0110)8000:10:10:0486(0:0044)1:0003(0:0134)0:4610(0:0183)0:4162(0:0126)1:0079(0:0194)1:5219(0:0284)0:6322(0:0175)0:10:40:0461(0:0040)0:9956(0:0135)0:4933(0:0180)0:3796(0:0114)1:0049(0:0075)1:4978(0:0112)0:5874(0:0072)0:40:10:0516(0:0041)1:0047(0:0046)0:5074(0:0077)0:3915(0:0055)0:9916(0:0083)1:5027(0:0091)0:6002(0:0073)0:40:40:0567(0:0036)1:0011(0:0056)0:5023(0:0080)0:3976(0:0061)0:9773(0:0077)1:5069(0:0104)0:5926(0:0083)20000:10:10:0516(0:0032)1:0109(0:0079)0:4951(0:0116)0:4059(0:0089)1:0053(0:0122)1:5095(0:0150)0:5878(0:0119)0:10:40:0536(0:0029)1:0078(0:0094)0:5283(0:0107)0:4038(0:0099)1:0017(0:0042)1:5011(0:0064)0:5912(0:0053)0:40:10:0488(0:0027)0:9996(0:0034)0:5076(0:0044)0:4064(0:0036)0:9830(0:0115)1:4997(0:0152)0:6001(0:0138)0:40:40:0545(0:0028)1:0009(0:0033)0:5043(0:0047)0:3986(0:0033)0:9993(0:0050)1:5012(0:0064)0:5996(0:0048)


Chan 2005 ),rulingoutunderlyinggeneticeventswillbechallenging.Furthermore,copynumbervariants(CNVs),whicharenotreadilydetectablebystandardsequencingtechnology,mustalsobeconsideredsincetheyarenowbelievedtobeasprevalentassinglenucleotidepolymorphisms( Beckmannetal. 2007 ).Andso,thetaskofcorrelatingbonadeepimutationswithdisease,letalonetryingtotracetheirinheritanceacrossgenerations,remainsaformidableone. Thisstudyassumestheunisex(sons)producedfromtherst-generationfamily.Onecanalsoassumedaughterswithnochangeofthemodel.Infact,ourmodelcanallowtheinvolvementofbothsexessothatinthesecondgenerationsex-specicgeneticeectscanbecharacterized.Ifthesexesinthethirdgenerationareconsidered,themodelcanbeextendedtostudythetransgenerationalchangesofgene-sexinteractions.Althoughabasicpremiseofepigeneticprocesseswasthat,onceestablished,thesemarksweremaintainedthroughroundsofmitoticcelldivisionandstableforthelifeoftheorganism,severalrecentstudieshaveshownthatatsomelocitheepigeneticstatecanbealteredbytheenvironment( JirtleandSkinner 2007 ).Thequestionsarehowcommonaregenesofthistypeandhowstrongistheevidencefortheirexistenceinhumans?Thedevelopmentofourdesignandmodelwillhelptoaddressthesebiologicalquestionsoffundamentalimportanceinelucidatingthegeneticarchitectureofcomplextraits. 82


Athree-generationfamilydesignusedtostudytransgenerationalinheritance.Therst-generationiscomposedofthegrandfatherandgrandmotherofninegenotypesfromanaturalpopulation,thesecondgenerationincludesthefather,i.e.,asonfromtherstgeneration,andthemothersampledfromthenaturalpopulation,andthethirdgenerationincludesamixofsonsanddaughtersfromthesecondgeneration.Thephenotypeismeasuredforthefatherinthesecondgenerationandchildreninthethirdgeneration.Adiplotypeisshownbyaverticalline. 21 23AABB(p211)AAbb(p210)14AABB(p211)AaBB(2p11p01)1 21 25AABB(p211)AaBb8<:ABjab(2p11p00)AbjaB(2p10p01)1 2(1r)1 2r1 2r1 2(1r)1 2r1 2(1r)1 2(1r)1 2r6AABB(p211)Aabb(2p10p00)1 21 27AABB(p211)aaBB(p201)18AABB(p211)aaBb(2p01p00)1 21 29AABB(p211)aabb(p200)110AABb(2p11p10)AABB(p211)1 21 211AABb(2p11p10)AABb(2p11p10)1 41 41 41 412AABb(2p11p10)AAbb(p210)1 21 213AABb(2p11p10)AaBB(2p11p01)1 41 41 41 414AABb(2p11p10)AaBb8<:ABjab(2p11p00)AbjaB(2p10p01)1 4(1r)1 4r1 4r1 4(1r)1 4(1r)1 4r1 4r1 4(1r)1 4r1 4(1r)1 4(1r)1 4r1 4r1 4(1r)1 4(1r)1 4r15AABb(2p11p10)Aabb(2p10p00)1 41 41 41 416AABb(2p11p10)aaBB(p201)1 21 217AABb(2p11p10)aaBb(2p01p00)1 41 41 41 418AABb(2p11p10)aabb(p200)1 21 219AAbb(p210)AABB(p211)120AAbb(p210)AABb(2p11p10)1 21 221AAbb(p210)AAbb(p210)122AAbb(p210)AaBB(2p11p01)1 21 223AAbb(p210)AaBb8<:ABjab(2p11p00)AbjaB(2p10p01)1 2(1r)1 2r1 2r1 2(1r)1 2r1 2(1r)1 2(1r)1 2r


4-4 .Continued FirstGenerationSecondGeneration 21 225AAbb(p210)aaBB(p201)126AAbb(p210)aaBb(2p01p00)1 21 227AAbb(p210)aabb(p200)128AaBB(2p11p01)AABB(p211)1 21 229AaBB(2p11p01)AABb(2p11p10)1 41 41 41 430AaBB(2p11p01)AAbb(p210)1 21 231AaBB(2p11p01)AaBB(2p11p01)1 41 41 41 432AaBB(2p11p01)AaBb8<:ABjab(2p11p00)AbjaB(2p10p01)1 4(1r)1 4r1 4r1 4(1r)1 4r1 4(1r)1 4(1r)1 4r1 4(1r)1 4r1 4r1 4(1r)1 4r1 4(1r)1 4(1r)1 4r33AaBB(2p11p01)Aabb(2p10p00)1 41 41 41 434AaBB(2p11p01)aaBB(p201)1 21 235AaBB(2p11p01)aaBb(2p01p00)1 41 41 41 436AaBB(2p11p01)aabb(p200)1 21 237AaBb8<:ABjab(2p11p00)AbjaB(2p10p01)AABB(p211)1 2(1r)1 2r1 2r1 2(1r)1 2r1 2(1r)1 2(1r)1 2r38AaBb8<:ABjab(2p11p00)AbjaB(2p10p01)AABb(2p11p10)1 4(1r)1 4r1 4(1r)1 4r1 4r1 4(1r)1 4r1 4(1r)1 4r1 4(1r)1 4r1 4(1r)1 4(1r)1 4r1 4(1r)1 4r39AaBb8<:ABjab(2p11p00)AbjaB(2p10p01)AAbb(p210)1 2(1r)1 2r1 2r1 2(1r)1 2r1 2(1r)1 2(1r)1 2r40AaBb8<:ABjab(2p11p00)AbjaB(2p10p01)AaBB(2p11p01)1 4(1r)1 4r1 4r1 4(1r)1 4(1r)1 4r1 4r1 4(1r)1 4r1 4(1r)1 4(1r)1 4r1 4r1 4(1r)1 4(1r)1 4r41AaBb8<:ABjab(2p11p00)AbjaB(2p10p01)AaBb8<:ABjab(2p11p00)AbjaB(2p10p01)1 4(1r)21 4r21 4(1r)r1 4r(1r)1 4r(1r)1 4(1r)r1 4r21 4(1r)21 4(1r)r1 4r(1r)1 4r(1r)1 4(1r)r1 4(1r)21 4r21 4r21 4(1r)21 4r21 4(1r)21 4(1r)r1 4r(1r)1 4r(1r)1 4(1r)r1 4(1r)r1 4r(1r)1 4r21 4(1r)21 4r(1r)1 4(1r)r1 4(1r)r1 4r(1r)1 4(1r)21 4r242AaBb8<:ABjab(2p11p00)AbjaB(2p10p01)Aabb(2p10p00)1 4(1r)1 4r1 4r1 4(1r)1 4(1r)1 4r1 4r1 4(1r)1 4r1 4(1r)1 4(1r)1 4r1 4r1 4(1r)1 4(1r)1 4r43AaBb8<:ABjab(2p11p00)AbjaB(2p10p01)aaBB(p201)1 2(1r)1 2r1 2r1 2(1r)1 2r1 2(1r)1 2(1r)1 2r44AaBb8<:ABjab(2p11p00)AbjaB(2p10p01)aaBb(2p01p00)1 4(1r)1 4r1 4(1r)1 4r1 4r1 4(1r)1 4r1 4(1r)1 4r1 4(1r)1 4r1 4(1r)1 4(1r)1 4r1 4(1r)1 4r45AaBb8<:ABjab(2p11p00)AbjaB(2p10p01)aabb(p200)1 2(1r)1 2r1 2r1 2(1r)1 2r1 2(1r)1 2(1r)1 2r46Aabb(2p10p00)AABB(p211)1 21 247Aabb(2p10p00)AABb(2p11p10)1 41 41 41 448Aabb(2p10p00)AAbb(p210)1 21 249Aabb(2p10p00)AaBB(2p11p01)1 41 41 41 450Aabb(2p10p00)AaBb8<:ABjab(2p11p00)AbjaB(2p10p01)1 4(1r)1 4r1 4r1 4(1r)1 4r1 4(1r)1 4(1r)1 4r1 4(1r)1 4r1 4r1 4(1r)1 4r1 4(1r)1 4(1r)1 4r


4-4 .Continued FirstGenerationSecondGeneration 41 41 41 452Aabb(2p10p00)aaBB(p201)1 21 253Aabb(2p10p00)aaBb(2p01p00)1 41 41 41 454Aabb(2p10p00)aabb(p200)1 21 255aaBB(p201)AABB(p211)156aaBB(p201)AABb(2p11p10)1 21 257aaBB(p201)AAbb(p210)158aaBB(p201)AaBB(2p11p01)1 21 259aaBB(p201)AaBb8<:ABjab(2p11p00)AbjaB(2p10p01)1 2(1r)1 2r1 2r1 2(1r)1 2r1 2(1r)1 2(1r)1 2r60aaBB(p201)Aabb(2p10p00)1 21 261aaBB(p201)aaBB(p201)162aaBB(p201)aaBb(2p01p00)1 21 263aaBB(p201)aabb(p200)164aaBb(2p01p00)AABB(p211)1 21 265aaBb(2p01p00)AABb(2p11p10)1 41 41 41 466aaBb(2p01p00)AAbb(p210)1 21 267aaBb(2p01p00)AaBB(2p11p01)1 41 41 41 468aaBb(2p01p00)AaBb8<:ABjab(2p11p00)AbjaB(2p10p01)1 4(1r)1 4r1 4r1 4(1r)1 4(1r)1 4r1 4r1 4(1r)1 4r1 4(1r)1 4(1r)1 4r1 4r1 4(1r)1 4(1r)1 4r69aaBb(2p01p00)Aabb(2p10p00)1 41 41 41 470aaBb(2p01p00)aaBB(p201)1 21 271aaBb(2p01p00)aaBb(2p01p00)1 41 41 41 472aaBb(2p01p00)aabb(p200)1 21 273aabb(p200)AABB(p211)174aabb(p200)AABb(2p11p10)1 21 2


4-4 .Continued FirstGenerationSecondGeneration 21 277aabb(p200)AaBb8<:ABjab(2p11p00)AbjaB(2p10p01)1 2(1r)1 2r1 2r1 2(1r)1 2r1 2(1r)1 2(1r)1 2r78aabb(p200)Aabb(2p10p00)1 21 279aabb(p200)aaBB(p201)180aabb(p200)aaBb(2p01p00)1 21 281aabb(p200)aabb(p200)1


Balmanetal. 2003 ; Randetal. 2008 )However,growingevidenceshowsthatmostcanceristheresultofanintricateinteractionoflow-penetrancegeneticvariantswithenvironmentalexposuresthathumansexperience.( Brennan 2002 )Theselow-penetrancecancergenes,eachusuallywithaminoreectandcooperatingwithothersinacomplicatedweb,arediculttodetectand,therefore,theircontributiontotheriskofcancerdevelopmentremainsunclear.ThereisapressingdemandonthedevelopmentofpowerfulstatisticalmodelsandcomputationalalgorithmsforidentifyingandmappingspecicDNAsequencevariantsthatregulatecancersusceptibility. Humancancercellsfrequentlypossesslarge-scalechromosomalrearrangementsduetochromosomalinstability(CIN)( StockandBialy 2003 ; ThompsonandCompton 2008 )orgenemutation( Greenmanetal. 2007 ; JallepalliandLengauer 2001 ).CINmakeswholechromosomesorlargefractionsofchromosomesgainedorlostduringcelldivision,resultinginanimbalanceinthenumberofchromosomespercell(aneuploidy)andanenhancedrateoflossofheterozygosity.Thus,the\aneuploidyhypothesisofcancer"( StockandBialy 2003 )proposesthatthemaindierencesbetweennormalandabnormal(cancer)cellsresultfromthenumberofgenesratherthanthetypesofgenesdierentiallyexpressed,asopposedtothe\gene-mutationhypothesis"( JallepalliandLengauer 2001 ).Ingeneral,cancerincidenceanddevelopmentarenotonlyaectedbythehostgenes,butalsobygenesderivedfromthecancercellsthemselves.Givenstrongmechanisticinteractionsbetweenthehostandcancertissues( AraujoandMcElwain 2006 ),thesetwodierentsystemsofgenesoperateinteractivelyorepistaticallytoalterthecourseofcancer 87


Geneticmappinghasproventobeapowerfulapproachfordetectingquantitativetraitloci(QTLs)forcomplextraits.ButaQTLmaycontainmultiplegenesthatoperateinacollectiveway.ItisnotpossibletostudytheDNAstructure,organizationandfunctionofaQTLdetectedfromamappingapproach.AmoreaccurateandusefulapproachforthecharacterizationofgeneticvariantscontributingtoquantitativevariationistodirectlyanalyzeDNAsequences,knownashaplotypes,associatedwithaparticulardisease.( LinandWu 2006 ; Liuetal. 2004 )IfastringofDNAsequenceisknowntoincreasediseaserisk,thisriskcanbepreventedbyinhibitingtheexpressionofthisstringusingaspecializeddrug.ThecontrolofthisdiseasecanbemademoreecientifallpossibleDNAsequencesdeterminingitsvariationareidentiedintheentiregenome.Theelucidationoftheentirehumangenomehasbeenacceleratedbythehaplotypemap,orHapMap,constructedbySNPs.( TheInternationalHapMapConsortium 2003 )Morerecently,themarvelousplansofsequencingthecancergenomes( Kaiser 2005 )willprovideunprecedentedfuelforstudyingthegeneticarchitectureofcancerrisk. Inthisarticle,wewillderiveastatisticalmodelfordetectingtheactionsandinteractionsofhaplotypesderivedfromthehostandcancergenomesforcancersusceptibility.Wewillincorporatethe\gene-mutationhypothesis"intothemodel.The\aneuploidyhypothesisofcancer"willbeconsideredinanextpaper.Throughthereleaseofsoftwaretothepublic,ourstatisticalmodelwillserveasaroutinemeansforthegeneticdiagnosisofcancerrisk.Resultsfromthemodelwillprovidescienticguidanceforclinicaldoctorstodesignanoptimaltreatmentschemeintermsofcancergenesandpatient'sgenes. 88


5.2.1SamplingStrategies Dalyetal. 2001 ; Gabrieletal. 2002 ; Patiletal. 2001 )EachblockmayhaveaminimalsubsetofSNPs,i.e.,\tagging"SNPs,thatcancharacterizethemostcommonhaplotypes.OurmodelwillbebasedontaggingSNPswithineachhaplotypeblock.Althoughnodetailedinformationaboutthestructureofthecancergenomeisavailable,wecanassumethataparticularsetofSNPsmaycontributetocancerformationatthehaplotypelevel.ThetenetofourepistaticmodelisthattheeectofagivenDNAsequenceinthehostgenomeoncancerismaskedorenhancedbyoneormoresequencesinthecancergenome. PopulationGeneticModel:ConsiderasetofRtaggingSNPsfromahaplotypeblockofthehostgenomeandasetofSSNPsfromthecancergenome.WedenotetwoallelesofSNPrfromthehostgenomebyHrkr(kr=1;0;r=1;R)andtwoallelesofSNPsfromthecancergenomebyCsls(ls=1;0;s=1;S).LetpHkrandpClsdenoteallelefrequenciesatthecorrespondingSNPfromthehostandcancergenomes,respectively.AlltheSNPsconsideredfromthehostandcancergenomesform2R+Spossiblejointhaplotypesexpressedas(H1k1H2k2HRkR)(C1l1C2l2CSlS).Thecorrespondinghaplotypefrequenciesaredenotedbyp(k1k2kR)(l1l2lS),whicharecomposedofallelefrequenciesateachSNPandlinkagedisequilibriaofdierentordersamongSNPswithinandbetweenthegenomes(WuandLin2008).Ageneralexpressionfortherelationshipsbetweenhaplotype 89


5-1 liststhecompositionsofthefrequencyofahaplotypeconstructedjointlybytwoSNPsfromthehostgenomeandtwoSNPsfromthecancergenomeinwhichlinkagedisequilibriaarespeciedwithtwo,three,andfoursites. Table5-1. Disequilibriumcompositionsoffour-SNPhaplotypefrequenciesderivedfromthehostandcancergenomes. Term.CompositionRemark (1)pHk1pHk2pCl1pCl2NoLD(2)(1)2(1)k1+k2pCl1pCl2DH1H2DigenicLDwithinthehostgenome(H)(3)(1)2(1)l1+l2pHk1pHk2DC1C2DigenicLDwithinthecancergenome(C)(4)(1)2(1)k2+l1pHk1pCl2DH2C1DigenicLDbetweenSNP2ofHandSNP1ofC(5)(1)2(1)k2+l2pHk1pCl1DH2C2DigenicLDbetweenSNP2ofHandSNP2ofC(6)(1)2(1)k1+l1pHk2pCl2DH1C1DigenicLDbetweenSNP1ofHandSNP1ofC(7)(1)2(1)k1+l2pHk2pCl1DH1C2DigenicLDbetweenSNP1ofHandSNP2ofC(8)(1)3(1)k1+k2+l1pCl2DH1H2C1TrigenicLDbetweenHandSNP1ofC(9)(1)3(1)k1+k2+l2pCl1DH1H2C2TrigenicLDbetweenHandSNP2ofC(10)(1)3(1)k1+l1+l2pHk2DH1C1C2TrigenicLDbetweenSNP1ofHandC(11)(1)3(1)k2+l1+l1pHk1DH2C1C2TrigenicLDbetweenSNP2ofHandC(12)(1)4(1)k1+k2+l1+l2DH1H2C1C2QuadrigenicLDbetweenHandC (5{1) (5{2) 90


(5{3) (5{4) (5{5) (5{6) (5{7) (5{8) (5{9) 91


(5{10) (1)4(1)k1+k2+l1+l2DH1H2C1C2=p(k1k2)(l1l2)pHk1pHk2pCl1pCl2(1)2(1)k1+k2pCl1pCl2DH1H2(1)2(1)l1+l2pHk1pHk2DC1C2(1)2(1)k2+l1pHk1pCl2DH2C1(1)2(1)k2+l2pHk1pCl1DH2C2(1)2(1)k1+l1pHk2pCl2DH1C1(1)2(1)k1+l2pHk2pCl1DH1C2(1)3(1)k1+k2+l1pCl2DH1H2C1(1)3(1)k1+k2+l2pCl1DH1H2C2(1)3(1)k1+l1+l2pHk2DH1C1C2(1)3(1)k2+l1+l1pHk1DH2C1C2 Therandomcombinationofmaternalandpaternalhaplotypesgenerates2R+S1(2R+S+1)diplotypesexpressedas(H1k1H2k2HRkR)(C1l1C2l2CSlS)j(H1k01H2k02HRk0R)(C1l01C2l02CSl0S)(k1k01;k2k02;:::;kRk0R=1;0;l1l01;k2l02;:::;lSl0S=1;0).Weuseaverticallinetoseparatetwohaplotypesderivedfromthematernalandpaternalparents,respectively,foragivendiplotype.UndertheHWEassumption,diplotypefrequenciesareexpressedastheproductsofthefrequenciesofthetwohaplotypesthatconstitutethediplotype,i.e.,p(k1k2kR)(l1l2lS)j(k01k02k0R)(l01l02l0S)=8>>>>><>>>>>:p2(k1k2kR)(l1l2lS)k1=k01;k2=k02;;kR=k0R;l1=l01;l2=l02;;lS=l0S2p(k1k2kR)(l1l2lS)p(kp1k02k0R)(l01l02l0S)Otherwise: 92


Bader 2001 ; Judsonetal. 2000 ; Rhaetal. 2007 ).Amongallpossiblehaplotypesaretheresomeparticularhaplotypes,calledtheriskhaplotype(A),thatperformdierentlythantherestofthehaplotypes,calledthenon-riskhaplotype(A).Thecombinationsbetweentheriskandnon-riskhaplotypes,AA,AA,andAA,arecalledthecompositediplotype.( Liuetal. 2004 ; WuandLin 2008 ).Thus,bytestingthedierencesinthegenotypicvalueofatraitamongcompositediplotypes,wecanestimatethegeneticeectsofhaplotypesonthetrait.Itisalsofeasibletodetectepistaticinteractionsbetweenhaplotypesfromdierentgenomes. LetA,AandB,Bdenotetheriskhaplotypesandnon-riskhaplotypesforaseriesofSNPsgenotypedfromthehostandcancergenomes,respectively.ThesetwogenomesformninedierentcompositediplotypesexpressedasAABB,AABB,AABB,AABB,AABB,AABB,AABB,AABB,andAABB.WewilluseMatherandJinks's(1982)formulationforgeneticepistasisbetweendierentloci(Table 5-2 )tomodelthegeneticeectsofthecompositediplotypes.Thegenotypicvalue(j1j2)ofajointcompositediplotypefromthetwogenomescanbedecomposedintoninedierentcomponentsasfollows: 93


Additive,dominance,andepistaticcompositionsofthegenotypicvalueofacompositediplotypeconstructedwithhaplotypesfromthehostandcancergenomes. AAAABB=AABB=AABB=+a1+a2+iaa+a1+d2+iad+a1a2iaaAAAABB=AABB=AABB=+d1+a2+ida+d1+d2+idd+d1a2idaAAAABB=AABB=AABB=a1+a2iaaa1+d2iada1a2+iaa 5-2 ). Dierenttypesofgeneticactionsandinteractionscanbeexpressedintermsofgenotypicvaluesbysolvingagroupofregularequations( 5{12 ).Thisletsusdescribetheoverallmean,additive,dominance,andfourkindsofepistaticeectsbetweenthetwo 94


4(AABB+AABB+AABB+AABB)aH=1 4(AABBAABB+AABBAABB)aC=1 4(AABBAABBAABB+AABB)dH=1 4(2AABBAABBAABBAABBAABB+2AABB)dC=1 4(2AABBAABBAABBAABBAABB+2AABB)iaa=1 4(AABBAABBAABB+AABB)iad=1 4(2AABBAABB2AABB+AABBAABB+AABB)ida=1 4(2AABB2AABB+AABB+AABBAABBAABB)idd=1 4(4AABB+AABB+AABB+AABB+AABB2AABB2AABB2AABB2AABB): Thus,bytestingthesignicanceofiaa,iad,ida,andidd,wecanjudgewhetherthereisepistasisandhowtheepistasisaectsaphenotypictrait. logL(p;qjy;H;C)=logL(pjH;C)+logL(qjy;H;C;p); 95


5{14 )isequivalenttomaximizingitstwocomponentsseparately. 5-3 isanexampleofdatastructureforgenotypicobservationsoftwoSNPsderivedfromthehostgenomeandtwoSNPsfromthecancergenome.Thetablealsoprovidestheexpectedfrequenciesofdierentgenotypesintermsofhaplotypefrequencies. Basedontheinformationaboutobserveddata,itisnotdiculttoconstructamultinomiallikelihood,logL(pjH;C),inwhichamixturemodelisincorporatedforthosegenotypesthatareheterozygousattwoormoreSNPs.Bymaximizingtheobserveddatalikelihood,theEMalgorithmisderived.IntheEstep,wecalculatetheexpectednumberofaparticularacross-genomehaplotype(H1k1H2k2HRkR)(C1l1C2l2CSlS)withinthemixtureofdiplotypesthatformthesamegenotypes.Forexample,suchanexpectednumberiscalculatedfortwoSNPsfromthehostgenomeandtwoSNPsfromthecancer 96


Observed81jointhost-cancerSNPgenotypesandtheirfrequenciesdescribedintermsoftheirhaplotype/diplotypecompositions. Genotype No.HostCancerObservationFrequency 1H11H11=H21H21C11C11=C21C21n(11=11)(11=11)p2(11)(11)2H11H11=H21H21C11C11=C21C20n(11=11)(11=10)2p(11)(11)p(11)(10)3H11H11=H21H21C11C11=C20C20n(11=11)(11=00)p2(11)(10)4H11H11=H21H21C11C10=C21C21n(11=11)(10=11)2p(11)(11)p(11)(01)5H11H11=H21H21C11C10=C21C20n(11=11)(10=10)2p(11)(11)p(11)(00)+2p(11)(10)p(11)(01)6H11H11=H21H21C11C10=C20C20n(11=11)(10=00)2p(11)(10)p(11)(00)7H11H11=H21H21C10C10=C21C21n(11=11)(00=11)p2(11)(01)8H11H11=H21H21C10C10=C21C20n(11=11)(00=10)2p(11)(01)p(11)(00)9H11H11=H21H21C10C10=C20C20n(11=11)(00=00)p2(11)(00)10H11H11=H21H20C11C11=C21C21n(11=10)(11=11)2p(11)(11)p(10)(11)11H11H11=H21H20C11C11=C21C20n(11=10)(11=10)2p(11)(11)p(10)(10)+2p(10)(11)p(11)(10)12H11H11=H21H20C11C11=C20C20n(11=10)(11=00)2p(11)(10)p(10)(10)13H11H11=H21H20C11C10=C21C21n(11=10)(10=11)2p(11)(11)p(10)(01)+2p(10)(11)p(11)(01)14H11H11=H21H20C11C10=C21C20n(11=10)(10=10)2p(11)(11)p(10)(00)+2p(11)(10)p(10)(01)+2p(10)(11)p(10)(00)+2p(10)(10)p(11)(01)15H11H11=H21H20C11C10=C20C20n(11=10)(10=00)2p(11)(10)p(10)(00)+2p(11)(00)p(10)(10)16H11H11=H21H20C10C10=C21C21n(11=10)(00=11)2p(11)(01)p(10)(01)17H11H11=H21H20C10C10=C21C20n(11=10)(00=10)2p(11)(01)p(10)(00)+2p(11)(00)p(10)(01)18H11H11=H21H20C10C10=C20C20n(11=10)(00=00)2p(11)(00)p(10)(00)19H11H11=H20H20C11C11=C21C21n(11=00)(11=11)p2(10)(11)20H11H11=H20H20C11C11=C21C20n(11=00)(11=10)2p(10)(11)p(10)(10)


5-3 .Continued 21H11H11=H20H20C11C11=C20C20n(11=00)(11=00)p2(10)(10)22H11H11=H20H20C11C10=C21C21n(11=00)(10=11)2p(10)(11)p(10)(01)23H11H11=H20H20C11C10=C21C20n(11=00)(10=10)2p(10)(11)p(10)(00)+2p(10)(10)p(10)(01)24H11H11=H20H20C11C10=C20C20n(11=00)(10=00)2p(10)(10)p(10)(00)25H11H11=H20H20C10C10=C21C21n(11=00)(00=11)p2(10)(01)26H11H11=H20H20C10C10=C21C20n(11=00)(00=10)2p(10)(01)p(10)(00)27H11H11=H20H20C10C10=C20C20n(11=00)(00=00)2p2(10)(00)28H11H10=H21H21C11C11=C21C21n(10=11)(11=11)2p(11)(11)p(01)(11)29H11H10=H21H21C11C11=C21C20n(10=11)(11=10)2p(11)(11)p(01)(10)+2p(01)(11)p(11)(10)30H11H10=H21H21C11C11=C20C20n(10=11)(11=00)2p(11)(10)p(01)(10)31H11H10=H21H21C11C10=C21C21n(10=11)(10=11)2p(11)(11)p(01)(01)+2p(01)(11)p(11)(10)32H11H10=H21H21C11C10=C21C20n(10=11)(10=10)2p(11)(11)p(01)(00)+2p(11)(10)p(01)(01)+2p(01)(11)p(11)(00)+2p(01)(10)p(11)(01)33H11H10=H21H21C11C10=C20C20n(10=11)(10=00)2p(11)(10)p(00)(10)+2p(10)(10)p(01)(10)34H11H10=H21H21C10C10=C21C21n(10=11)(00=11)2p(11)(01)p(01)(01)35H11H10=H21H21C10C10=C21C20n(10=11)(00=10)2p(11)(01)p(01)(00)+2p(01)(01)p(11)(00)36H11H10=H21H21C10C10=C20C20n(10=11)(00=00)2p(11)(00)p(01)(00)37H11H10=H21H20C11C11=C21C21n(10=10)(11=11)2p(11)(11)p(00)(11)+2p(10)(11)p(01)(11)38H11H10=H21H20C11C11=C21C20n(10=10)(11=10)2p(11)(11)p(00)(10)+2p(10)(11)p(01)(10)+2p(11)(10)p(00)(11)+2p(10)(10)p(01)(11)39H11H10=H21H20C11C11=C20C20n(10=10)(11=00)2p(11)(10)p(00)(10)+2p(10)(10)p(01)(10)40H11H10=H21H20C11C10=C21C21n(10=10)(10=11)2p(11)(11)p(00)(01)+2p(10)(11)p(01)(01)+2p(11)(01)p(00)(11)+2p(10)(01)p(01)(11)


5-3 .Continued 41H11H10=H21H20C11C10=C21C20n(10=10)(10=10)2p(11)(11)p(00)(00)+2p(11)(10)p(00)(01)+2p(10)(10)p(01)(01)+2p(10)(10)p(01)(01)+2p(01)(11)p(10)(00)+2p(01)(10)p(10)(01)+2p(00)(10)p(11)(01)+2p(00)(10)p(11)(01)42H11H10=H21H20C11C10=C20C20n(10=10)(10=00)2p(11)(10)p(00)(00)+2p(10)(10)p(01)(00)+2p(00)(10)p(11)(00)+2p(01)(10)p(10)(00)43H11H10=H21H20C10C10=C21C21n(10=10)(00=11)2p(11)(01)p(00)(01)+2p(10)(01)p(01)(01)44H11H10=H21H20C10C10=C21C20n(10=10)(00=10)2p(11)(01)p(00)(00)+2p(11)(00)p(00)(01)+2p(10)(01)p(01)(00)+2p(10)(00)p(01)(01)45H11H10=H21H20C10C10=C20C20n(10=10)(00=00)2p(11)(00)p(00)(00)+2p(10)(00)p(01)(00)46H11H10=H20H20C11C11=C21C21n(10=00)(11=11)2p(10)(11)p(00)(11)47H11H10=H20H20C11C11=C21C20n(10=00)(11=10)2p(10)(11)p(00)(10)+2p(00)(11)p(10)(10)48H11H10=H20H20C11C11=C20C20n(10=00)(11=00)2p(10)(10)p(00)(10)49H11H10=H20H20C11C10=C21C21n(10=00)(10=11)2p(10)(11)p(00)(01)+2p(00)(11)p(10)(01)50H11H10=H20H20C11C10=C21C20n(10=00)(10=10)2p(10)(11)p(00)(00)+2p(10)(10)p(00)(01)+2p(00)(11)p(10)(00)+2p(10)(10)p(00)(01)51H11H10=H20H20C11C10=C20C20n(10=00)(10=00)2p(10)(10)p(00)(00)+2p(10)(00)p(00)(10)52H11H10=H20H20C10C10=C21C21n(10=00)(00=11)2p(10)(01)p(00)(01)53H11H10=H20H20C10C10=C21C20n(10=00)(00=10)2p(10)(01)p(00)(00)+2p(10)(00)p(00)(01)54H11H10=H20H20C10C10=C20C20n(10=00)(00=00)2p(10)(00)p(00)(00)55H10H10=H21H21C11C11=C21C21n(00=11)(11=11)p2(01)(11)56H10H10=H21H21C11C11=C21C20n(00=11)(11=10)2p(01)(11)p(01)(10)57H10H10=H21H21C11C11=C20C20n(00=11)(11=00)p2(01)(10)58H10H10=H21H21C11C10=C21C21n(00=11)(10=11)2p(01)(11)p(01)(01)59H10H10=H21H21C11C10=C21C20n(00=11)(10=10)2p(01)(11)p(01)(00)+p(01)(10)p(01)(01)

PAGE 100

5-3 .Continued 60H10H10=H21H21C11C10=C20C20n(00=11)(10=00)2p(01)(10)p(01)(00)61H10H10=H21H21C10C10=C21C21n(00=11)(00=11)p2(01)(01)62H10H10=H21H21C10C10=C21C20n(00=11)(00=10)2p(01)(01)p(01)(00)63H10H10=H21H21C10C10=C20C20n(00=11)(00=00)p2(01)(00)64H10H10=H21H20C11C11=C21C21n(00=10)(11=11)2p(01)(11)p(00)(11)65H10H10=H21H20C11C11=C21C20n(00=10)(11=10)2p(01)(11)p(00)(10)+2p(00)(11)p(01)(10)66H10H10=H21H20C11C11=C20C20n(00=10)(11=00)2p(01)(10)p(00)(10)67H10H10=H21H20C11C10=C21C21n(00=10)(10=11)2p(01)(11)p(00)(01)+2p(00)(11)p(01)(01)68H10H10=H21H20C11C10=C21C20n(00=10)(10=10)2p(01)(11)p(00)(00)+2p(01)(10)p(00)(01)+2p(00)(11)p(01)(00)+2p(00)(10)p(01)(01)69H10H10=H21H20C11C10=C20C20n(00=10)(10=00)2p(01)(10)p(00)(00)+2p(01)(00)p(00)(10)70H10H10=H21H20C10C10=C21C21n(00=10)(00=11)2p(01)(01)p(00)(01)71H10H10=H21H20C10C10=C21C20n(00=10)(00=10)2p(01)(01)p(00)(00)+2p(01)(00)p(00)(01)72H10H10=H21H20C10C10=C20C20n(00=10)(00=00)2p(01)(00)p(00)(00)73H10H10=H20H20C11C11=C21C21n(00=00)(11=11)p2(00)(11)74H10H10=H20H20C11C11=C21C20n(00=00)(11=10)2p(00)(11)p(00)(10)75H10H10=H20H20C11C11=C20C20n(00=00)(11=00)p2(00)(10)76H10H10=H20H20C11C10=C21C21n(00=00)(10=11)2p(00)(11)p(11)(01)77H10H10=H20H20C11C10=C21C20n(00=00)(10=10)2p(00)(11)p(00)(00)+2p(00)(10)p(00)(01)78H10H10=H20H20C11C10=C20C20n(00=00)(10=00)2p(00)(10)p(00)(00)79H10H10=H20H20C10C10=C21C21n(00=00)(00=11)p2(00)(01)80H10H10=H20H20C10C10=C21C20n(00=00)(00=10)2p(00)(01)p(01)(00)81H10H10=H20H20C10C10=C20C20n(00=00)(00=00)p2(00)(00)

PAGE 101


PAGE 102

5-3 by 2n[2n(k1k1=k2k2)(l1l1=l2l2)+n(k1k1=k2k2)(l1l1=l2l02);l02
PAGE 103

5{15 )and( 5{16 )untiltheestimatesconvergetostablevalues.Theestimatesatconvergencearethemaximumlikelihoodestimates(MLEs)ofhaplotypefrequencies.TheMLEsofallelefrequenciesatdierentSNPsandtheirlinkagedisequilibriaofdierentorderscanbesolvedfromtheseestimatedhaplotypefrequenciesusingasystemofequationsgiveninTable 5-1 5-3 ,inwhichtherearetwoSNPsfromeachgenome,weassumeH11H21fromthehostgenomeandC11C21fromthecancergenomeastworiskhaplotypes.Thisleadstoninedierentacross-genomecompositediplotypes.Amixture-basedlikelihoodforquantitative 103

PAGE 104

logL(qjy;H;C;^p)=n(11=11)(11=11)Xi=1logfAABB(yi)+n(11=11)()Xi=1logfAABB(yi)+n(11=11)()Xi=1logfAABB(yi)+n(11=11)(10=10)Xi=1log[!CfAABB(yi)+(1!C)fAABB(yi)]+n()(11=11)Xi=1logfAABB(yi)+n()()Xi=1logfAABB(yi)+n()()Xi=1logfAABB(yi)+n()(10=10)Xi=1log[!CfAABB(yi)+(1!C)fAABB(yi)]+n()(11=11)Xi=1logfAABB(yi)+n()()Xi=1logfAABB(yi)+n()()Xi=1logfAABB(yi)+n()(10=10)Xi=1log[!CfAABB(yi)+(1!C)fAABB(yi)]+n(10=10)(11=11)Xi=1log[!HfAABB(yi)+(1!H)fAABB(yi)]+n(10=10)()Xi=1log[!HfAABB(yi)+(1!H)fAABB(yi)]+n(10=10)()Xi=1log[!HfAABB(yi)+(1!H)fAABB(yi)]+n(10=10)(10=10)Xi=1log[!(10=10)(10=10)AABBfAABB(yi)+!(10=10)(10=10)AABBfAABB(yi)+!(10=10)(10=10)AABBfAABB(yi)+!(10=10)(10=10)AABBfAABB(yi)]; 104

PAGE 106

5{17 ),wemodelf::::(yi)byanormaldistributionwithdiplotype-specicmean::::andvariance2.TheEMalgorithmisimplementedtoestimatethesemeansandvariancethatmaximizethelikelihood.IntheEstep,wecalculatetheposteriorprobabilityofaparticulardiplotypewithinagenotypeforSNPsacrossthegenomesusing 106

PAGE 107

iAABB(10=10)(10=10)i=!(10=10)(10=10)AABBfAABB(yi) iAABB(10=10)(10=10)i=!(10=10)(10=10)AABBfAABB(yi) iAABB(10=10)(10=10)i=!(10=10)(10=10)AABBfAABB(yi) i withi=!(10=10)(10=10)AABBfAABB(yi)+!(10=10)(10=10)AABBfAABB(yi)+!(10=10)(10=10)AABBfAABB(yi)+!(10=10)(10=10)AABBfAABB(yi):

PAGE 109

AloopoftheEandMstepsisformulatedbetweenequations( 5{18 )and( 5{19 )toobtaintheMLEsofthegenotypicvaluesandvariance. Forapracticaldataset,riskhaplotypesareunknown.Acombinatoryapproachisusedtodetectanoptimalcombinationofriskhaplotypesderivedfromthehostandcancer 109

PAGE 110

5{13 ). 5-1 canbetestedusingthetwohypothesesasfollows: Thelog-likelihoodratioteststatisticforthesignicanceofLDiscalculatedbycomparingthelikelihoodvaluesundertheH1(fullmodel)andH0(reducedmodel)using LRD=2[logL(^pHk1;^pHk2;^pCl1;^pCl2;LD=0jH;C)logL(^pjH;C)](5{21) where^pHk1,^pHk2,^pCl1,and^pCl2aretheMLEsofallelefrequenciesatfourSNPsfromthetwogenomes.TheLRDcalculatedundertheH0andH1hypothesesisconsideredtoasymptoticallyfollowa2distributionwith11degreesoffreedom. 110

PAGE 111

forthedigenicLD, forthetrigenicLD,and forthequadrigenicLD. EachLDcanalsobetestedseparately.UnderthenullhypothesisofnoLD,haplotypefrequenciesareestimatedwiththesameEMalgorithmderivedtoestimatethefrequencyparametersunderthealternativehypothesis,exceptfortheconstraintposedontherelationshipsofhaplotypefrequenciesunderthenullhypothesis.DependingonwhichtypeofLDistested,theseconstraintscanbeobtainedfromequations( 5{1 ){( 5{11 ),respectively. 111

PAGE 112

(5{25) Thelog-likelihoodratioteststatistic(LRE)underthesetwohypothesescanbesimilarlycalculated.TheLREmayasymptoticallyfollowa2distributionwitheightdegreesoffreedom.However,theapproximationofa2distributionmaybeinappropriatewhensomeregularityconditions,suchasnormalityanduncorrelatedresiduals,areviolated.Thepermutationtestapproach( ChurchillandDoerge 1994 ),whichdoesnotrelyuponthedistributionoftheLRE,maybeusedtodeterminethecriticalthresholdfordeterminingtheexistenceofriskhaplotypes. Dierentgeneticeects,suchastheadditive,dominance,andadditiveadditive,additivedominance,dominanceadditive,anddominancedominanceeectsbetweenhaplotypesfromthehostandcancergenomescanalsobetestedindividually,withrespectivenullhypothesesformulatedas 112

PAGE 113

5{25 ),withaconstraintderivedfromasystemofequations( 5{13 ).Thecriticalthresholdsfortheseindividualeects( 5{26 ){( 5{33 )canbedeterminedonthebasisofsimulationstudies. 5-4 .Foursamplesizesfrommodest(200)tointermediate(400)tolarge(800)toverylarge(2000)areconsidered.Thesepopulationgeneticparametersthatspecifythedistributionanddiversityofhaplotypescanbewellestimatedwiththemodel.Amodestsamplesizeisadequateforestimatingallelefrequenciesanddigeniclinkagedisequilibria.Aintermediatetolargesamplesizeisneededtoestimatehigher-orderlinkagedisequilibria.Especially,topreciselyestimatequadrigeniclinkagedisequilibrium,asampleof2000isrecommended(Table 5-4 ). FortheassumedpopulationinwhichmultipleSNPsaretyped,aquantitativetraitthatdescribescancersusceptibilitywassimulated,followinganormaldistributionwithmeansdependingoncompositediplotypesofSNPsandaresidualvariance.Thegenotypicvaluesofcompositediplotypearedeterminedbyassumingspecicvaluesfortheadditive,dominanceandepistaticeectsofhaplotypesonthecancertrait.CompositediplotypesareformedbyassumingtworiskhaplotypesfortheSNPs,onefromthehostgenome(H11H21)andthesecondfromthecancergenome(C11C21).Thegenotypicvaluesofatotalof16compositediplotypes,alongwiththeirprobabilitiescalculatedfromhaplotype 113

PAGE 114

TheMLEsofpopulationgeneticparametersfortwohostSNPsandtwocancerSNPsandthestandarddeviationsoftheestimates(inparantheses)inasimulatedcancerpopulationofvaryingsamplingsizes. MLE ParametersTrue2004008002000 Table5-5. Log-likelihoodvaluesfordierentcombinationsofriskhaplotypesfromthehostandcancergenomesinasimulatedcancerpopulationof200subjectswithaheritabilityof0.1. Cancer HostC11C21C11C20C10C21C10C20 frequencies,areusedtocomputethegeneticvariance.Theresidualvarianceisthendeterminedbyassumingdierentheritabilitylevels(0.1and0.4). Table 5-5 givesthelog-likelihoodsofpopulationandquantitativegeneticparametersbyassumingdierentcombinationsofriskhaplotypesfromthehostandcancergenomes 114

PAGE 115

TheMLEsofquantitativegeneticparametersofhaplotypesforSNPstypedfromthehostandcancergenomesandthestandarddeviationsoftheestimates(inparantheses)inasimulatedcancerpopulationofvaryingsamplingsizesandheritabilities. MLE ParametersTrue2004008002000 forsimulationdatawithsamplesize200andheritability0.1.Itcanbeseenthatriskhaplotypecombination(H11H21)(C11C21)correspondstothemaximumlikelihoodamongallpossiblecombinations,suggestingthatthemodelhascorrectlyselectedriskhaplotypes.Itappearsthatthepowertocorrectlyselecttheoptimalcombinationofriskhaplotypesisveryhigh,evenwhenamodestsamplesizeand/orheritabilityisassumed(datanotshown).Themodelprovidesreasonableestimatesofquantitativegeneticparameters(Table 5-6 ).Theestimationprecisionofparametersincreaseswithsamplesizeandheritability.Fortheadditivegeneticeects,amodestsamplesize(200)isquiteenoughevenwhentheheritabilityislow(0.1).Thegoodestimationofdominancegeneticeects 115

PAGE 116

Thenoveltyofthemodelliesinthreeaspects.First,weincorporatethelatestdiscoveryofcancergeneticsintothemodelthatgenemutationcausecancer( Balmanetal. 2003 ; Greenmanetal. 2007 ; JallepalliandLengauer 2001 ; StockandBialy 2003 ).Themodelisnotonlyabletocharacterizehowgenemutationinthecancergenomeactstoregulatecancer,butalsocandetectthegeneticinteractionsbetweenthehostgenesandcancermutation.Themodelallowsthetestofhaplotypedistributionanddiversityinthecancerpopulationandpatternsofgeneticactionsandinteractions.Second,themodelisintegratedwithmultilocusSNPdata,detectingcancergenesattheDNAsequencelevel( Liuetal. 2004 ; WuandLin 2008 ).Thiswillprovidesignicantinsightsintothegeneticregulationmechanismsofcancerandcloningofcancergenes.Third,themodelwasbuiltontheinteractionsofgenesbetweendierentgenomes.Modelinggenome-genomeinteractionshasreceivedanincreasinginterestinstudyingthegeneticarchitectureofseeddevelopment( CuiandWu 2005 )andpathogenesis( Fosteretal. 2003 ; Wangetal. 2005 ). 116

PAGE 117

Althoughthisarticlepresentsageneralframeworkforhaplotypingcancergenes,itsextensiontoincludinggenesenvironmentinteractions,haplotypinginacase-controlstudy,geneticimprinting,andanarbitrarynumberofSNPswillbepossible.Asaninheriteddisease,geneticresearchofcancerisbenecialfromaninformativefamily-structureddesigninwhichoneorbothoftheparentsandospringaresampledsimultaneously.Thegeneralprincipleofhaplotypinggenome-genomeinteractionscanbeusedforsuchafamilydesign,facilitatingourunderstandingofcancergenetics.Also,cancercanbebetterviewedasadynamictraitwhichundergoesmarkeddevelopmentaltransition.Functionalmappingadvocatedbyourgroup( Liuetal. 2005 ; Maetal. 2002 ; WuandLin 2006 )canbeimplementedintothehaplotypingmodeltoexplorethedevelopmentalchangeofgeneticcontrolofcancerintimecourse.Inthisarticle,wefocusonthe\gene-mutationhypothesis"ofcancerformationwhentheepistaticmodelwasderived.Otherhypotheses,suchas\aneuploidyhypothesisofcancer"( StockandBialy 2003 ),shouldalsobeintegratedintothemodel,tobetterunderstandthegeneticmechanismsofcancerformationandprogression.Themodelthatincorporatetheaneuploidycontrolofcancerwillbereportedelsewhere.

PAGE 118


PAGE 119

Thisdissertationprovidesamostcompletesetofstatisticalmodelsforcancergeneidentication.Unliketraditionalapproaches,thesemodelswerebuiltonsolidbiologicalaspectsofcancerinitiationandformation.Theyincludegeneticmutations,aneuploidmetaboliccontrol,transgenerationalepigenetics,geneticimprinting,andhost-tumorinteractions.Thesemodelsshouldndtheirimmediateimplicationsincurrentcancergeneticresearchwiththeadventofmassiveamountsofgenomicdata.However,toenhancethebiologicalrelevanceofthesemodels,weneedtointegratebiologicalprinciplesofcancer. Asahigh-orderphenotype,cancerformationundergoesaseriesofbiochemicalpathwaysfromDNAtomRNAthroughtranscription,frommRNAtoproteinthroughtranslation,andfromproteintocancerviavariousbiosynthesis.\Omic"dataineachofthesestepshavebeenincreasinglyaccumulatedand,thus,itisatimetointegratethesedatabypowerfulstatisticalmodels.However,itshouldbepointedoutthatsuchaCentralDogmaofbiologyisnowsurpassedbyaplethoraofnewdiscoveries,includingalternativesplicing,RNAediting,post-translationalmodications,moonlightingproteins,feedbackcircuits,andepigeneticinheritance.Techniqueshavebeendevelopedtodetecttheregulatorycontrolofthesephenomenaateachstepoflifeformation.Metabolitesaretheendproductsofnearlyallcellularregulatoryprocessesandreecttheultimateoutcomeofpotentialchangesdirectedbygenomicandproteomicadjustmentsstemmingfromanenvironmentalstimulusorgeneticmodication.Tofullyexplainthenatureofgrowthanddevelopment,itiscrucialtocombinegenomics,transcriptomics,proteomics,andmetabonomicsintoanetworkofinteractionsbydevelopingpowerfulandrobuststatisticalmodels.Suchanetworkbiologyapproachwillprovideanunprecedentedopportunitytostudythedynamicnetworkofgenesthatdeterminesthephysiologyofcancerovertime 119

PAGE 120


PAGE 121

Anway,M.D.andSkinner,M.K.\Transgenerationaleectsoftheendocrinedisruptorvinclozolinontheprostatetranscriptomeandadultonsetdisease."Prostate68(2008):517{529. Araujo,R.P.andMcElwain,D.L.S.\Theroleofmechanicalhost-tumourinteractionsinthecollapseoftumourbloodvesselsandtumourgrowthdynamics."J.Theor.Biol.238(2006):817{827. Arnheim,N.andCalabrese,P.\Understandingwhatdeterminesthefrequencyandpatternofhumangermlinemutations."Nat.Rev.Genet.10(2009):478{488. Bader,J.S.\Therelativepowerofsnpsandhaplotypeasgeneticmarkersforassociationtests."Pharmacogenomics2(2001):11{24. Balman,A.,Gray,J.,andPonder,B.\Thegeneticsandgenomicsofcancer."Nat.Genet.33(2003):238{244. Barabasi,A.L.andOltvai,Z.N.\Networkbiology:understandingthecell'sfunctionalorganization."Nat.Rev.Genet.5(2004).2:101{113. Baudot,A.,Real,F.X.,Izarzugaza,J.M.,andValencia,A.\Fromcancergenomestocancermodels:bridgingthegaps."EMBORep.10(2009).4:359{366. Beckmann,J.S.,Estivill,X.,andAntonarakis,S.E.\Copynumbervariantsandgenetictraits:closertotheresolutionofphenotypictogenotypicvariability."Nat.Rev.Genet.8(2007):639{646. Boone,C.,Bussey,H.,andAndrews,B.J.\Exploringgeneticinteractionsandnetworkswithyeast."Nat.Rev.Genet.8(2007).6:437{449. Brennan,P.\Gene-environmentinteractionandaetiologyofcancer:whatdoesitmeanandhowcanwemeasureit?"Carcinogenesis23(2002):381{387. Carpten,J.D.,Faber,A.L.,Horn,C.,Donoho,G.P.,Briggs,S.L.,Robbins,C.M.,Hostetter,G.,Boguslawski,S.,Moses,T.Y.,Savage,S.,Uhlik,M.,Lin,A.,Du,J.,Qian,Y.W.,Zeckner,D.J.,Tucker-Kellogg,G.,Touchman,J.,Patel,K.,Mousses,S.,Bittner,M.,Schevitz,R.,Lai,M.H.T.,Blanchard,K.L.,andThomas,J.E.\AtransformingmutationinthepleckstrinhomologydomainofAKT1incancer."Nature448(2007):439{444. Chan,E.Y.\Advancesinsequencingtechnology."Mutat.Res.573(2005):13{40. Cheverud,J.M.,Hager,R.,Roseman,C.,Fawcett,G.,Wang,B.,andWolf,J.B.\Genomicimprintingeectsonadultbodycompositioninmice."Proc.Natl.Acad.Sci.105(2008):4253{4258. 121

PAGE 122

Churchill,G.A.andDoerge,R.W.\EmpiricalThresholdValuesforQuantitativeTriatMapping."Genetics138(1994):963{971. Constancia,M.,Kelsey,G.,andReik,W.\Resourcefulimprinting."Nature432(2004):53{57. Crews,D.,Gore,A.C.,Hsu,T.S.,Dangleben,N.L.,Spinetta,M.,Schallert,T.,Anway,M.D.,andSkinner,M.K.\Transgenerationalepigeneticimprintsonmatepreference."Proc.Natl.Acad.Sci.USA104(2007):5942{5946. Cropley,J.E.,Suter,C.M.,Beckman,K.B.,andMartin,D.I.\Germ-lineepigeneticmodicationofthemurineAvyallelebynutritionalsupplementation."ProcNatlAcadSciUSA103(2006):17308{17312. Cui,Y.H.andWu,R.L.\Mappinggenome-genomeepistasis:Amulti-dimensionalmodel."Bioinformatics21(2005):2447{2455. Daly,M.J.,Rioux,J.D.,Schaner,S.F.,Hudson,T.J.,andLander,E.S.\High-resolutionhaplotypestructureinthehumangenome."Nat.Genet.29(2001):229{232. Davies,H.,Bignell,G.R.,Cox,C.,Stephens,P.,Edkins,S.,Clegg,S.,Teague,J.,Woendin,H.,Garnett,M.J.,Bottomley,W.,Davis,N.,Dicks,E.,Ewing,R.,Floyd,Y.,Gray,K.,Hall,S.,Hawes,R.,Hughes,J.,Kosmidou,V.,Menzies,A.,Mould,C.,Parker,A.,Stevens,C.,Watt,S.,Hooper,S.,Wilson,R.,Jayatilake,H.,Gusterson,B.A.,Cooper,C.,Shipley,J.,Hargrave,D.,Pritchard-Jones,K.,Maitland,N.,Chenevix-Trench,G.,Riggins,G.J.,Bigner,D.D.,Palmieri,G.,Cossu,A.,Flanagan,A.,Nicholson,A.,Ho,J.W.C.,Leung,S.Y.,Yuen,S.T.,Weber,B.L.,Seigler,H.F.,Darrow,T.L.,Paterson,H.,Marais,R.,Marshall,C.J.,Wooster,R.,Stratton,M.R.,andFutreal,P.A.\MutationsoftheBRAFgeneinhumancancer."Nature417(2002):949{954. Dawson,E.,Abecasis,G.R.,Bumpstead,S.,Chen,Y.,Hunt,S.,Beare,D.M.,Pabial,J,Dibling,T.,Tinsley,E.,Kirby,S.,Carter,D.,Papaspyridonos,M.,Livingstone,S.,Ganske,R.,L~ohmussaar,E.,Zernant,J.,T~onisson,N.,Remm,M.,Magi,R.,Puurand,T.,Vilo,J.,Kurg,A.,Rice,K.,Deloukas,P.,Mott,R.,Metspalu,A.,Bentley,D.R.,Cardon,L.R.,andDunham,I.\Arst-generationlinkagedisequilibriummapofhumanchromosome."Nature418(2002):544{548. DeKoning,D.J.,Rattniek,A.P.,Harlizius,B.,Arendonk,J.A.M.,Brascamp,E.W.,andGroenen,M.A.M.\Genome-widescanforbodycompositioninpigsrevealsimportantroleofimprinting."ProcNatlAcadSciUSA97(2000):7947{7950. 122

PAGE 123

Duesberg,P.,Li,R.,Fabarius,A.,andHehlmann,R.\Ordersofmagnitudechangeinphenotyperatecausedbymutation."CellOncol.29(2007):71{72. Duesberg,P.,Rasnick,D.,Li,R.,Winters,L.,Rausch,C.,andHehlmann,R.\Howaneuploidymaycausecancerandgeneticinstability."AnticancerRes.19(1999):4887{4906. Duesberg,Peter.\ChromosomalChaosandCancer."ScienticAmerican296(2007):52{59. Dupuis,J.,Siegmund,D.,andYakir,B.\Auniedframeworkforlinkageandassociationanalysisofquantitativetraits."ProceedingsoftheNationalAcademyofSciencesoftheUnitedStatesofAmerica104(2007):20210{20215. Fan,C.,Oh,D.S.,Wessels,L.,Weigelt,B.,Nuyten,D.S.,Nobel,A.B.,van'tVeer,L.J.,andPerou,C.M.\Concordanceamonggene-expression-basedpredictorsforbreastcancer."N.Engl.J.Med.355(2006).6:560{569. Foster,J.S.,Palmer,R.J.,Jr.,andKolenbrander,P.E.\Humanoralcavityasamodelforthestudyofgenome-genomeinteractions."Biol.Bull.204(2003):200{204. Futreal,P.A.,Coin,L.,Marshall,M.,Down,T.,Hubbard,T.,Wooster,R.,Rahman,N.,andStratton,M.R.\Acensusofhumancancergenes."NatureRev.Cancer4(2004):177{183. Gabriel,S.B.,Schaner,S.F.,Nguyen,H.,Moore,J.M.,Roy,J.,Blumenstiel,B.,Higgins,J.,DeFelice,M.,Lochner,A.,Faggart,M.,Liu-Cordero,S.N.,Rotimi,C.,Adeyemo,A.,Cooper,R.,Ward,R.,Lander,E.S.,Daly,M.J.,andAltshuler,D.\Thestructureofhaplotypeblocksinthehumangenome."Science296(2002):2225{2229. Gal-Yam,E.N.,Saito,Y.,Egger,G.,andJones,P.A.\CancerEpigenetics:Modications,Screening,andTherapy."AnnualReviewofMedicine59(2008):267{280. Greenman,C.,Stephens,P.,Smith,R.,Dalgliesh,G.L.,Hunter,C.,Bignell,G.,Davies,H.,Teague,J.,Butler,A.,Stevens,C.,Edkins,S.,O'Meara,S.,Vastrik,I.,Schmidt,E.E.,Avis,T.,Barthorpe,S.,Bhamra,G.,Buck,G.,Choudhury,B.,Clements,J.,Cole,J.,Dicks,E.,Forbes,S.,Gray,K.,Halliday,K.,Harrison,R.,Hills,K.,Hinton,J.,Jenkinson,A.,Jones,D.,Menzies,A.,Mironenko,T.,Perry,J.,Raine,K.,Richardson,D.,Shepherd,R.,Small,A.,Tofts,C.,Varian,J.,Webb,T.,West,S.,Widaa,S.,Yates,A.,Cahill,D.P.,Louis,D.N.,Goldstraw,P.,Nicholson,A.G.,Brasseur,F.,Looijenga,L.,Weber,B.L.,Chiew,Y.E.,DeFazio,A.,Greaves,M.F.,Green,A.R.,Campbell,P.,Birney,E.,Easton,D.F.,Chenevix-Trench,G.,Tan,M.H.,Khoo,S.K.,Teh,B.T., 123

PAGE 124

Grnbaek,K.,Hother,C.,andJones,P.A.\Epigeneticchangesincancer."ActaPathologica,MicrobiologicaetImmunologicaScandinavica115(2007):1039{1059. Haber,D.A.andSettleman,J.\Cancer:Driversandpassengers."Nature446(2007):145{146. Hanks,S.andRahman,N.\Aneuploidy-cancerpredispositionsyndromes:anewlinkbetweenthemitoticspindlecheckpointandcancer."CellCycle4(2005):225{227. Hartman,J.L.,B.,Garvik,andL.,Hartwell.\Principlesforthebueringofgeneticvariation."Science291(2001).5506:1001{1004. Hernandez,P.,Huerta-Cepas,J.,Montaner,D.,Al-Shahrour,F.,Valls,J.,Gomez,L.,Capella,G.,Dopazo,J.,andPujana,M.A.\Evidenceforsystems-levelmolecularmechanismsoftumorigenesis."BMCGenomics8(2007):185. Isles,A.R.andWilkinson,L.S.\Imprintedgenes,cognitionandbehaviour."TrendsCognSci4(2000):309{318. Itier,J.M.,Tremp,G.,Leonard,J.F.,Multon,M.C.,Ret,G.,Schweighoer,F.,Tocque,B.,Bluet-Pajot,M.T.,Cormier,V.,andDautry,F.\Imprintedgeneinpostnatalgrowthrole."Nature393(1998):125{126. Jallepalli,P.V.andLengauer,C.\Chromosomesegregationandcancer:Cuttingthroughthemystery."Nat.Rev.Cancer1(2001):109{117. Jirtle,R.L.andSkinner,M.K.\Environmentalepigenomicsanddiseasesusceptibility."Nat.Rev.Genet.8(2007):253{262. Jones,P.A.andMartienssen,R.\AblueprintforaHumanEpigenomeProject:theAACRHumanEpigenomeWorkshop."CancerRes.65(2005):11241{11246. Judson,R.,Stephens,J.C.,andWindemuth,A.\Thepredictivepowerofhaplotypesinclinicalresponse."Pharmacogenomics1(2000):15{26. Kaiser,J.\Tacklingthecancergenome."Science309(2005):6. Khalil,I.G.andHill,C.\Systemsbiologyforcancer."Curr.Opin.Oncol.17(2005).1:44{48. Kops,G.J.,Weaver,B.A.,andCleveland,D.W.\Ontheroadtocancer:aneuploidyandthemitoticcheckpoint."Nat.Rev.Cancer5(2005):773{785. Lander,E.S.andBostein,D.\MappingmendelianfactorsunderlyingquantitativetraitsusingRFLPlinkagemaps."Genetics121(1989):185{199. 124

PAGE 125

Li,L.L.,Keverne,E.B.,Aparicio,S.A.,Ishino,F.,Barton,S.C.,andSurani,M.A.\RegulationofmaternalbehaviourandospringgrowthbypaternallyexpressedPeg3."Science284(1999):330{333. Li,R.H.,Sonik,A.,Stindl,R.,Rasnick,D.,andDuesberg,P.\Aneuploidyvs.genemutationhypothesisofcancer:Recentstudyclaimsmutationbutisfoundtosupportaneuploidy."ProceedingsoftheNationalAcademyofSciencesoftheUnitedStatesofAmerica97(2000):3236{3241. Li,Y.C.,Coelho,C.M.,Liu,T.,Wu,S.,Zeng,Y.R.,Li,Y.,Hunter,B.,Dante,R.A.,Larkins,B.A.,andWu,R.L.\Astatisticalstrategytoestimatematernal-zygoticinteractionsandparent-of-origineectsofQTLsforseeddevelopment."PLoSONE3(2007):e3131. Lin,M.andWu,R.L.\Detectingsequence-sequenceinteractionsforcomplexdiseases."CurrentGenomics7(2006):59{72. Liu,T.,Johnson,J.A.,Casella,G.,andWu,R.L.\SequencingcomplexdiseaseswithHapMap."Genetics168(2004):503{511. Liu,T.,Todhunter,R.J.,Wu,S.,Hou,W.,Mateescu,R.,Zhang,Z.W.,Burton-Wurster,N.I.,Acland,G.M.,Lust,G.,andWu,R.L.\Arandommodelformappingimprintedquantitativetraitlociinastructuredpedigree:Animplicationformappingcaninehipdysplasia."Genomics90(2007):276{284. Liu,T.,Zhao,W.,Tian,L.L.,andWu,R.L.\Analgorithmformoleculardissectionoftumorprogression."JournalofMathematicalBiology50(2005):336{354. Loeb,L.A.,Bielas,J.H.,andBeckman,R.A.\Cancersexhibitamutatorphenotype:clinicalimplications."CancerRes.68(2008):3551{3557. Ma,C.X.,Casella,G.,andWu,R.L.\Functionalmappingofquantitativetraitlociunderlyingthecharacterprocess:Atheoreticalframework."Genetics161(2002):1751{1762. Mack,G.S.\Epigeneticcancertherapymakesheadway."J.Natl.CancerInst.98(2006):1443{1444. Maderspacher,Florian.\TheodorBoveriandthenaturalexperiment."CurrentBiology18(2008).7. McGrath,J.andSolter,D.\Inabilityofmouseblastomerenucleitransferredtoenucleatedzygotestosupportdevelopmentinvitro."Science226(1984):1317{1319. Morgan,H.D.,Santos,F.,Green,K.,Dean,W.,andReik,W.\Epigeneticreprogramminginmammals."Hum.Mol.Genet.14(2005):R47{R58. 125

PAGE 126

Nilsson,E.E.,Anway,M.D.,Staneld,J.,andSkinner,M.K.\Transgenerationalepigeneticeectsoftheendocrinedisruptorvinclozolinonpregnanciesandfemaleadultonsetdisease."Reproduction135(2008):713{721. Nowell,P.C.\DiscoveryofthePhiladelphiachromosome:apersonalperspective."J.Clin.Invest.117(2007):2033{2035. Parmigiani,G.,Boca,S.,Lin,J.,Kenneth,K.W.,Velculescu,V.,andVogelstein,B.\Designandanalysisissuesingenome-widesomaticmutationstudiesofcancer."Genomics93(2009):17{21. Parris,GeorgeE.\Clinicallysignicantcancerevolvesfromtransientmutatedand/oraneuploidneoplasiabycellfusiontoformunstablesyncytiathatgiverisetoecologicallyviableparasitespecies."MedicalHypotheses65(2005).5. Patil,N.,Berno,A.J.,Hinds,D.A.,Barrett,W.A.,Doshi,J.M.,Hacker,C.R.,Kautzer,C.R.,Lee,D.H.,Marjoribanks,C.,McDonough,D.P.,Nguyen,B.T.,Norris,M.C.,Sheehan,J.B.,Shen,N.,Stern,D.,Stokowski,R.P.,Thomas,D.J.,Trulson,M.O.,Vyas,K.R.,Frazer,K.A.,Fodor,S.P.,andCox,D.R.\Blocksoflimitedhaplotypediversityrevealedbyhigh-resolutionscanningofhumanchromosome21."Science294(2001):1719{1723. Pellman,David.\Cellbiology:Aneuploidyandcancer."Nature446(2007):38{39. Pembrey,M.E.,Bygren,L.O.,Kaati,G.,Edvinsson,S.,Northstone,K.,Sjostrom,M.,Golding,J.,andTheALSPACStudyTeam.\Sex-specic,male-linetransgenerationalresponsesinhumans."Eur.J.Hum.Genet.14(2006):159{166. Pujana,M.A.,Han,J.D.,Starita,L.M.,Stevens,K.N.,Tewari,M.,andAhn,J.S.\Networkmodelinglinksbreastcancersusceptibilityandcentrosomedysfunction."Nat.Genet.39(2007).11:1338{1349. Rand,V.,Prebble,E.,Ridley,L.,Howard,M.,Wei,W.,Brundler,M.A.,Fee,B.E.,Riggins,G.J.,Coyle,B.,andGrundy,R.G.\Investigationofchromosome1qrevealsdierentialexpressionofmembersoftheS100familyinclinicalsubgroupsofintracranialpaediatricependymoma."Br.J.Cancer(2008). Reddy,E.P.,Reynolds,R.K.,Santos,E.,andBarbacid,M.\ApointmutationisresponsiblefortheacquisitionoftransformingpropertiesbytheT24humanbladdercarcinomaoncogene."Nature300(1982):149{152. Reik,W.andWalter,J.\Genomicimprinting:parentalinuenceonthegenome."Nat.Rev.Genet.2(2001):21{32. 126

PAGE 127

Rhodes,D.R.andChinnaiyan,A.M.\Integrativeanalysisofthecancertranscriptome."Nat.Genet.37(2005):S31{S37. Rowley,J.D.\Theroleofchromosometranslocationsinleukemogenesis."Semin.Hematol.36(1999):59{72. Rual,J.F.,Venkatesan,K.,Hao,T.,Hirozane-Kishikawa,T.,Dricot,A.,andLi,N.\Towardsaproteome-scalemapofthehumanprotein-proteininteractionnetwork."Nature437(2005).7062:1173{1178. Sasaki,H.andMatsui,Y.\Epigeneticeventsinmammaliangerm-celldevelopment:reprogrammingandbeyond."Nat.Rev.Genet.9(2008):129{140. Sha,K.\Amechanisticviewofgenomicimprinting."Annu.Rev.GenomicsHum.Genet.9(2008):197{216. Sharma,S.V.,Bell,D.W.,Settleman,J.,andHaber,D.A.\Epidermalgrowthfactorreceptormutationsinlungcancer."NatureRev.Cancer7(2007):169{181. Skinner,M.K.\Whatisanepigenetictransgenerationalphenotype?F3orF2."Reprod.Toxicol.25(2008):2{6. Skinner,M.K.andAnway,M.D.\Epigenetictransgenerationalactionsofvinclozolinonthedevelopmentofdiseaseandcancer."Crit.Rev.Oncog.13(2007):75{82. Stelzl,U.,Worm,U.,Lalowski,M.,Haenig,C.,Brembeck,F.H.,andH.,Goehler.\Ahumanprotein-proteininteractionnetwork:aresourceforannotatingtheproteome."Cell122(2005).6:957{968. Stephens,P.,Hunter,C.,Bignell,G.,Edkins,S.,Davies,H.,Teague,J.,Stevens,C.,O'Meara,S.,Smith,R.,Parker,A.,Barthorpe,A.,Blow,M.,Brackenbury,L.,Butler,A.,Clarke,O.,Cole,J.,Dicks,E.,Dike,A.,Drozd,A.,Edwards,K.,Forbes,S.,Foster,R.,Gray,K.,Greenman,C.,Halliday,K.,Hills,K.,Kosmidou,V.,Lugg,R.,Menzies,A.,Perry,J.,Petty,R.,Raine,K.,Ratford,L.,Shepherd,R.,Small,A.,Stephens,Y.,Tofts,C.,Varian,J.,West,S.,Widaa,S.,Yates,A.,Brasseur,F.,Cooper,C.S.,Flanagan,A.M.,Knowles,M.,Leung,S.Y.,Louis,D.N.,Looijenga,L.H.J.,Malkowicz,B.,Pierotti,M.A.,Teh,B.,Chenevix-Trench,G.,Weber,B.L.,Yuen,S.T.,Harris,G.,Goldstraw,P.,Nicholson,A.G.,Futreal,P.A.,Wooster,R.,andStratton,M.R.\Lungcancer:intragenicERBB2kinasemutationsintumours."Nature431(2004):525{526. Stock,R.P.andBialy,H.\Thesigmoidalcurveofcancer."Nat.Biotech.21(2003):13{14. 127

PAGE 128

Suijkerbuijk,S.J.andKops,G.J.\Preventinganeuploidy:thecontributionofmitoticcheckpointproteins."Biochem.Biophys.Acta.1786(2008):24{31. Surani,M.A.,Barton,S.C.,andNorris,M.L.\Developmentofreconstitutedmouseeggssuggestsimprintingofthegenomeduringgametogenesis."Nature308(1984):548{550. Tabin,C.J.,Bradley,S.M.,Bargmann,C.I.,Weinberg,R.A.,Papageorge,A.G.,Scolnick,E.M.,Dhar,R.,Lowy,D.R.,andChang,E.H.\Mechanismofactivationofahumanoncogene."Nature300(1982):143{149. TheCancerGenomeAtlasResearchNetwork.\Comprehensivegenomiccharacterizationdeneshumanglioblastomagenesandcorepathways."Nature455(2008):1061{1068. TheInternationalHapMapConsortium.\TheInternationalHapMapProject."Nature426(2003):789{794. Thompson,S.L.andCompton,D.A.\Examiningthelinkbetweenchromosomalinstabilityandaneuploidyinhumancells."J.CellBiol.180(2008):665{672. Tomlins,S.A.,Rhodes,D.R.,Perner,S.,Dhanasekaran,S.M.,Mehra,R.,Sun,X.W.,Varambally,S.,Cao,X.,Tchinda,J.,Kuefer,R.,Lee,C.,Montie,J.E.,Shah,R.B.,Pienta,K.J.,Rubin,M.A.,andChinnaiyan,A.M.\RecurrentfusionofTMPRSS2andETStranscriptionfactorgenesinprostatecancer."Science310(2005):644{648. Velculescu,V.E.\Deningtheblueprintofthecancergenome."Carcinogenesis29(2008):1087{1091. Wang,Z.H.,Hou,W.,andWu,R.L.\AstatisticalmodeltoanalyzequantitativetraitlocusinteractionsforHIVdynamicsfromthevirusandhumangenomes."Stat.Med.25(2005):495{511. Weaver,B.A.andCleveland,D.W.\Doesaneuploidycausecancer?"Curr.Opin.CellBiol.18(2006):658{667. Whitelaw,N.C.andWhitelaw,E.\Transgenerationalepigeneticinheritanceinhealthanddisease."Curr.Opin.Genet.Dev.18(2008):273{279. Wilkins,J.F.andHaig,D.\Whatgoodisgenomicimprinting:Thefunctionofparent-specicgeneexpression."Nat.Rev.Genet.4(2003):359{368. Wilkinson,L.S.,Davies,W.,andIsles,A.R.\Genomicimprintingeectsonbraindevelopmentandfunction."Nat.Rev.Neurosci.4(2007):1{19. Wolf,J.B.,Cheverud,J.M.,Roseman,C.,andHager,R.\Genome-WideAnalysisRevealsaComplexPatternofGenomicImprintinginMice."PLoS.Genet.4(2008):e1000091. 128

PAGE 129

Wu,R.L.andZeng,Z.-B.\Jointlinkageandlinkagedisequilibriummappinginnaturalpopulations."Genetics157(2001):899{909. Wu,RonglingandLin,Min.StatisticalandComputationalPharmacogenomics.London:Chapman&Hall/CRC,2008. Youngson,N.A.andWhitelaw,E.\Transgenerationalepigeneticeects."Annu.Rev.GenomicsHum.Genet.9(2008):233{257. Zhang,F.,Zhao,D.,Chen,G.,andLi,Q.\Genemutationandaneuploidymightcooperatetocarcinogenesisbydysregulationofasymmetricdivisionofadultstemcells."MedicalHypotheses67(2006).4:995{996. Zhang,K.,Deng,M.,Chen,T.,Waterman,M.S.,andSun,F.\Adynamicprogrammingalgorithmforhaplotypeblockpartitioning."Proc.Natl.Acad.Sci.USA99(2002):7335{7339. 129

PAGE 130

YaoLi,originallytrainedinbiologyatUniversityofScienceandTechnologyofChinaandUniversityofIllinoisatUrbana-Champaign,becameagraduatestudentintheDepartmentofStatisticsattheUniversityofFloridainthefallof2003.ShereceivedherPh.D.fromUniversityofFloridainthesummerof2009,underthesupervisionofProfessorRonglingWu.ShealsoworkedintheDepartmentofPublicHealthSciencesatthePennStateCollegeofMedicineasavisitingstudentbeforehergraduation.Shehasgotatenure-trackpositionintheDepartmentofStatisticsattheWestVirginiaUniversity.Sheisintriguedbythedevelopmentofstatisticalandcomputationalmodelsforidentifyinggenesthatcontrolcomplextraitsanddiseases.Sheiseagertousehermodelsandalgorithmstosolvecomplicatedreal-worldgeneticproblems. 130